Thinking a bit about rearranging the hypergeometric distribution also yields another way of getting a p-value combining the gene-mesh and disease-mesh links. Rather than computing the p-values separately, we can instead ask whether the marked ball draw rate in the gene is equal to higher than the rate we see for the disease. Is the rate at which we see the mesh term annotated more than we’d expect by chance, given the rate it occurs for the disease. This loses the background distribution…but then we’d be able compensate by considering the old p-values. Since we’re comparing rates, this would have to be a t-test, comparing the ratios? We can’t use hypergeometric, since one set of marbles aren’t drawn from the other …
Or maybe we can combine it all (?!) So we want the probability either of the links occurred by chance, and the probability that the term annotation rate is the same. So more formally:
p1(gene, term) = p-value that the gene-related articles have the term annotation by chance wrt to pubmed
p2(disease, term) = p-value that the disease-related articles have the term annotation by chance wrt to pubmed
p3(gene,disease,term) = p-value that the gene-related articles have the term annotation at the same rate as the disease annotations.
Master p = (p1 + p2 – p1*p2) * p3