January 30, 2009 by warrenac
Paper: Skeleton of the methods section for the profile comparisons – we should update that with descriptions of the distance functions.
Fixes: Fixed some sorting error in the all-mesh-refs.txt code, which would have affected some p-value computations. Should probably double-check if similar errors are elsewhere (ie in the profile comparison code? direct association code?)
Computation: digenei3 is running at the oicr, once that is done we can create a cmp-digenei2 that will compare digenei1 and digenei3. Hopefully results will not change much. Recomputing wrt gene2pubmed (apparently never been done yet?) – will see effect of recent fixes on results.
Cmp-digenei: extract configuration information – ref source and directories as a config.mk
Profile p-values TODO: Have a gene and disease background for the genes/disease profile p-values.
Term Fraction TODO: Somehow merge this with the p-values? Maybe have a “weighted p-value comparison”, or maybe weight the term fraction by the p-value? There must be something that can use the term fraction effectiveness to boost effectiveness
Data Examination: Look at the “low-scoring” tuples for each scoring method, to better identify how to fix those.
Alternate Validation: Check other papers – the “check each prediction and compare against X random others”, or check predictions vs. nearby genes.
Posted in Uncategorized | Leave a Comment »
January 21, 2009 by warrenac
This needs to be fixed – badly. txt/direct-gene-disease/all-mesh-refs seems to have some sorting problem, especially obvious around Antigens, CD (and the various CD#s) -g does not do what I think it did when I rewrote the BIGSORT macro.
This does affect a pretty high level mesh term count though. Might be the ideal time to add in the specific backgrounds for gene, disease.
Update: Should be fixable – modification is to cut the file by field separator (the pipe) and specify the keys as 1,2. Hopefully, that should result in the expected behaviour.
Posted in Uncategorized | Leave a Comment »
January 21, 2009 by warrenac
The “simple” feature selection method (based on Zipf’s law?)
Remove the N most common terms
Remove terms in very few documents (documents = M < 1 or 2)
Posted in Uncategorized | Leave a Comment »
January 21, 2009 by warrenac
The comparison here is against genes linking to MeSH terms independantly at random. So it may be more a question of better choosing the background set (or rather, the choice of background set determines what the meaning of the p-values derived mean). If we have a “gene-specific” background set, we get terms which are unusually overrepresented with respect to all genes (ie, if we look only at articles referenced by GeneRIFs/Gene2Pubmed) If we look at all disease papers, these terms are unusually common, compared to drawing at random from that pool. Or we can consider it from the opposite perspective of terms that would be “lost” – in the gene background, seeing high occurrences for “gene-related” terms is expected. Likewise, high-level “pathology” terms are lost if we use a disease background. This does seem like the behaviour we would like though.
Posted in Uncategorized | Leave a Comment »
January 21, 2009 by warrenac
We assume hypergeometric background for the direct connections, but what does the background distribution really look like (and really, how would I figure this out)
Interdependence between the MeSH terms – and ultimately, dimensionality reduction on the feature set (24k features is too much really). Maybe employ subspace clustering à la gene expression analysis.
Tag clouds – more dynamic range, and it’s worth cutting off the bottom enders and focusing on the top.
Label axes in the figures.
Keep an eye on BioBase – they seem to have a curated set of protein-disease relations.
Posted in Uncategorized | Leave a Comment »
January 19, 2009 by warrenac
Now that I’ve got the cancer miniset, and the tf gene miniset, lets try profile2arff and hope for something reasonable!
python profile2arff.py tf-generif-gene-mesh-p.txt cancer-comesh-p.txt mesh_ids.txt curr-generif-hum-disease-validation-tuples.txt > tf-cancer-profile.arff
The upside – successful termination was achieved relatively quickly
Downside – resulting file is 35G – running some sanity checks but it looks like I might need to do some significant dimensionality reduction before using the profiles. Should be feasible since the parent-child relationship probably makes a lot of the parents redundant…
Posted in Uncategorized | Leave a Comment »
January 19, 2009 by warrenac
Looks like making a “totality” dataset is more than a bit out of the question, the plan now will be to cut things down based on genes/diseases.
Ultimately, we could do something like split on diseases, and offer individual profiles for each disease subcategory (or likewise, for each gene category) – split it up, cluster compute each one separately.
Anyways, grabbing a partial list of TF IDs from
~/Warren/cs/research/integrator/digenei/tf-author
Next to do – scan mesh-parent for all terms with cancer as its parent
SELECT * from mesh_child WHERE term=’Neoplasms’
Put into
cmp-digenei/mesh-cancer-list.txt
After it’s implemented, try pushing this analysis upstream to digenei direct_prediction – make this a subparse of disease/hum filtering
Posted in Uncategorized | Leave a Comment »
January 16, 2009 by warrenac
It’s running and pumping the output to cmp-digenei/hum-gene-disease-profile.arff via
nohup nice python profile2arff.py ../digenei1/txt/direct_gene_disease/hum-generif-gene-mesh-p.txt ../digenei1/txt/direct_gene_disease/disease-comesh-p.txt ../digenei1/txt/mesh/mesh_ids.txt txt/curr-generif-hum-disease-validation-tuples.txt > hum-gene-disease-profile.arff&
Real question is how big yon file will be when it terminates – it’s hitting 2.7G already.
Ideas for slices – take a subset of the diseases (ignore very high level …), take a subset of the columns (dimensionality reduction).
Posted in Uncategorized | Leave a Comment »
January 15, 2009 by warrenac
To convert profiles to ARFF files, we can use a python script
profiles2arff [gene_profile] [disease_profile] [mesh_terms] [validation]
You get, for each gene-disease pair, the gene mesh terms and the disease mesh terms, plus the YN validation
Can also make a separate set with the p-values.
Posted in Uncategorized | Leave a Comment »
November 6, 2008 by warrenac
Integrated into makefile, currently building to txt/gene/
Posted in Uncategorized | 1 Comment »