Feeds:
Posts
Comments

Ideas from RECOMB

Things that I’ve been reminded to do after seeing things in RECOMB:

MeSH term attachment (general paper) – this data is running!  HA! After that,  we can run the validation!

Stability check – predictions change with respect to missing annotation, misannotation

Bayesian mode for predicting term attachment – P(term|papers) = P(term)P(papers|term)/P(papers)

Break down the AUC by term (can actually ignore the tree and do it per term…) – for this I should probably rewrite the AUC calculator as an object…

Hausdorff distance == likelihood

PageRank/social network analysis (especially for the author data!)

As for results,  seems that the validation set for pharma-chem/disease annotations is ZERO – I should generate the histogram of annotation over time (this is probably interesting in general) – Histogram is in progress.  Also potential alternate avenues are doing attachment of all MeSH terms rather than just disease,  or looking at the attachment of new pharmacological actions – txt/mesh/mesh_pharma.txt new entries.  Will need to compute chem<->all MeSH profiles

Also want to

2011 Build GO!

Database set up, PubMed files transferred – this only leaves Entrez Gene and the MeSH files to be grabbed (in integrator/Archive now!)

the getMeSH script hardcodes the year being grabbed, so had to switch these to the 2011 files.

Had to remember to set up the database files – if the database access script fails silently (as it does if you call a database that’s not in .dbrc) you get weird errors.

Looks like some of the previous builds are almost done…and looks like I might need to rebuild the geneRIF bits?

 

Still waiting for word on the paper – should probably follow up tonight?

Downloaded the 2011 PubMed files – need to set up an wcdb5 to house it.  Currently scp’ing the baseline over to chickenwire,  then need to move it into position for the build.  Also need versions of Entrez Gene, MeSH, etc…

ALSO, need to update the website and mention exactly which versions of which files are live on the databases.

Grabbed pubmed-chem-term.txt and put it into integrator/mesh-chem.  WIll match against the drugbank database, get a list of non-matching pharma.  Also,  get list of compounds with pharmaco action, and see how much that loses.

Re: circular make.  Digenei4 seems to have choked with a “directory doesn’t exist” error for a directory that exists.  Maybe a node with a file system problem?

Suddenly realised that RECOMB is nearly upon us – if I want to put in some new figures, now’s the time!

Seems like there’s a circular/non-updating portion in the Makefile…Or hopefully it’s more I’ve been twiddling with the makefiles so it’s been unhappy.  Here’s hoping a final build will sufffice to finish things off.  And then it’s probably time to build a new version with the new pubmed…Let’s go and download that now, shall we?

Waiting on final confirmation for the paper revisions – will likely submit tomorrow.

Fixing silly keyErrors – unicode causing sed to barf, running the same regex through perl will hopefully fix that

digenei0 is making without errors, but doesn’t seem to be “done” – is there a circular dependency?!

Previously was using the PHP based PDL package,  but that seems to break once the number of articles gets large (once past 51 articles).

Installed R to the web server – we can run it directly using “R –vanilla –slave”,  but that seems more than a bit slow – getting results takes a good chunk of time, probably because it costs a few seconds to compute each p-value, and you have to do it for every one of the MeSH terms.  Maybe a couple of minutes to process them all.  Maybe I should look at figuring out how to batch it all up into one giant computation – maybe make an array that can be read in – which might allow for parallel processing, or at least save on loadup time for R.  Otherwise,  maybe there’s a lightweight stats package that could be used instead?

Paper compressing

From time to time I need to squeeze things in a paper to get things to fit.  I might as well list my usual ideas in case I need them in the future:

  • shorten paragraphs.  Particularly,  make sure the last line is as full as possible – if there’s only a couple hanging words,  perhaps a little editing can cut enough to save that line.
  • move figures as close as possible to text.  Also related is to crop the figures to eliminate empty space
  • remove sentences which duplicate content – any time a phrase is repeated,  see if it is really necessary, or if it is possible to rearrange to avoid extra occurrences
  • simplify – this helps clarity which is also good,  but shorter, more direct wording,  avoid “weasel wording”
  • Often times a lot of small short words are unnecessary
  • use common abbreviations/abbreviate long complex phrases
  • Avoid multiline titles/headings (since the font for those is big!)
  • watch out of extra carriage returns between sections

Solution for authors too big

  • Use only one score, and only keep the “highest k” – DO NOT save it all
  • To IMPLEMENT:  modify the profile comparison code to store only the top k lines

Need to overlap pharm list with the drugbank list to make sure we’re not losing too many

  • Messed up something here it seems – only 302 of the drugbank generic names map to chem terms (ignoring case)
  • Actually – only 397 of the chem terms are mapping from chem-mesh-refs.txt
  • and only 827 of all-chem-refs.txt is mapping
  • CAS number matching 794 records
  • SIGH…looks like ~/drugcards-sorted is not properly sorted. BOO
  • OK NEW STATS
  • all-chem-refs matches 2666 of the drugcards
  • 1029 of pharma-chem matches the drugcards
  • 1731 are in all-chem but not in pharma-chem
  • 94 are in pharma-chem but not in all-chem…WAITASEC WHAT??
  • DOH – have to be careful on joins regarding whitespace – use the -t param to specify the break field (ie NO BREAK FIELD)
  • 910 in all-chem match drugcards
  • 803 in pharma-chem match drugcards
  • surprisingly – 27 pharma-chem are NOT in all-chem??  Isn’t phamra-chem a subset of all-chem? or is all-chem-refs something else entirely vs pharma-chem-refs?
  • INTERESTING – pharma is NOT a subset of all-chem..although in the pipeline it is used as a filter so doesn’t quite matter.  weird though.

Ref breaking

Seems that there’s some wonkiness happening when converting/merging some of the entries in

./txt/direct_gene_disease/all-short-author-refs.txt

Weird that it only appeard in digenei4 – maybe some weird names in recent author entries triggered this.

ALLGO\xc1\xba\x85ER, M

maps to

ALLGO|1<C1><BA><85>ER, M

but as can be seen above,  the “|1″ portion is for some reason next to ALLGO rather than at the end.

Seems to be an issue with the UNIQ_COUNT?

TROUBLE SHOOT – checking out digenei0

For some reason, that version builds cleanly.  Maybe it’s just a delete and retry…WONKINESS

Looks like it’s fixed.  HUZZAH!

Profile Compute

Target is to do pharma to EVERYTHIING today.  Should be doable since pharma is pretty small.

Currently duped the disease-chem block

Next – dupe the gene-disease block,  then add a pharma-pharma block.

Code looks straightforward so far – change the dependencies inputs, ADD A BUILD DIRECTORY.

Well, code is written.  I guess I’ll wait till the previous build on the pharm-disease completes,  then run the full thing.

HMMM…noticed that the build directory is actually the profile splitting directory.  Thinking perhaps it might be more efficient to not have a separate split directory for each output, but rather just one split directory per split input.  This does mean that the cleanup procedure can’t delete the split files though…HMMMMM

 

Older Posts »

Follow

Get every new post delivered to your Inbox.