Since things are still chugging slowly on the servers, started looking at ideas on how to make everything faster. After all, there’s more than a couple CPUs sitting around twiddling their fingers – I ought to think of ways to have them all play. That and making the solution more scalable – too many of the computes take a week or so to complete.
Took a not-so-quick look around the “distributed database” camp, and there’s not really much production level – lots of research, nobody really jumping on the bandwagon. Kinda primitive pseudo-database would be trying to exploit the MapReduce framework from the Google folks, although sadly it would probably have to use a weaker implementation (Hadoop). MySQL cluster looks like it might also work, although I’m not sure about its scalability either. Is it much more than splitting the tables over a couple computers?
Anyways, I’m looking at optimising my current big, slow query:
MeSH (disease) term -> Self and Parent terms -> Article citing this term -> other terms in the article -> Other term and its Parents
Distinct MeSH terms: 24 355
Number of PubMed articles: 16 million (8 million with a disease term)
Number of term-parent relations: 230 963
Anyways, a couple of ideas. One would be to improve the join speed. Right now, joining on the mesh term (character field). One idea is to change this to an internal “mesh_id”. Generating unique ids is easy (auto-increment). This would involve reloading the pubmed_mesh to associate the pubmed IDs (ints) with the new mesh_ids (ints)
Then maybe we should build an intermediate table: pubmed_mesh_parent, where we cross pubmed_mesh with the mesh_child table, associating the PMID with its mesh terms and all the parents of the mesh term. Then generating our final result would simply be crossing this table with itself.
Idea for the “reload compute” will be to have it shared out among the cluster. I’ll have to see if I can get enough disk shared between the cluster nodes – ideally, they’d all have access to where the gzip PubMed files are.