I guess I should be running things through the fine-toothed comb of careful analysis.
Current idea of file-level joins seems to not be going so well.
Really, we ought to be able to do it all in memory…well, given that there are 24355(2.4e4) terms, that makes the co-occurrence matrix on the order of 4e8 … not completely [...]
Archive for March, 2008
Need Moar Speed
Posted in MeSH, PubMed on March 18, 2008 | Leave a Comment »
Sun Grid Engine – Using Makefiles
Posted in cluster on March 18, 2008 | Leave a Comment »
Moving away from big script files and towards Makefiles. This should simplify some of the coding (no more need to write big loops!) as well as making it very easily parallelisable: SGE has a specially designed version of make (qmake) that will automatically farm out jobs. Makefiles also make it easy to write “pipelines” with [...]
Python Mesh-Child File generation – Success!
Posted in MeSH, PubMed on March 18, 2008 | Leave a Comment »
Successfully generated the mesh-parent datafiles in reasonable time. Loading all the results into a database results in a very large table (pubmed_mesh_parent) is slower but still reasonable. Querying the table is still pretty slow right now – attempting to optimise the index (right now indexed on pmid,term, so creating another index on term), but feeling [...]
Plan of attack
Posted in MeSH, PubMed on March 6, 2008 | Leave a Comment »
Long term rewrite – Separate datafiles and workfiles from projectfiles to simplify backup
Will probably need a program to handle the join efficiently.
Thinking of writing in Python:
Read (mesh-child) file:
Each line converts to a dictionary entry (key=term) and add to the value (append to set)
(Reverse? Child is the key, parent is the set of [...]
Problem Identified
Posted in Uncategorized on March 6, 2008 | Leave a Comment »
Figured out what was going wrong with my cluster script – the result set is for a couple kb of input is over a gigabyte. YIPES!
I was doing sort/uniq, but I need to cut down the tuples. pubmed x mesh x mesh is probably too large to keep. Pubmed x mesh(parents) is probably large, but [...]
Reviewing The Big Problem
Posted in MeSH on March 5, 2008 | Leave a Comment »
Currently, I’m trying to find mesh terms overrepresented in articles with a particular disease term.
Turns out I’ve made a couple errors on this front, so I’m going to rewrite the solution here so that I can remember my current take on it.
I want articles with the disease term, or any of it’s children.
From [...]
Cluster-ification
Posted in Uncategorized on March 4, 2008 | Leave a Comment »
And not even data clustering – here I’m just whacking at the Sun GridEngine cluster to run my data processing scripts more efficiently. Good news is that I’ve done a barebones port and successfully improved my cluster submission experience by noticing the following:
you can put default options for submitted jobs in ~/.sgerequest
for me this is
-v [...]