Paper: Skeleton of the methods section for the profile comparisons – we should update that with descriptions of the distance functions.
Fixes: Fixed some sorting error in the all-mesh-refs.txt code, which would have affected some p-value computations. Should probably double-check if similar errors are elsewhere (ie in the profile comparison code? direct association code?)
Computation: digenei3 is [...]
Archive for January, 2009
Status Update
Posted in Uncategorized on January 30, 2009 | Leave a Comment »
Bug Detected
Posted in Uncategorized on January 21, 2009 | Leave a Comment »
This needs to be fixed – badly. txt/direct-gene-disease/all-mesh-refs seems to have some sorting problem, especially obvious around Antigens, CD (and the various CD#s) -g does not do what I think it did when I rewrote the BIGSORT macro.
This does affect a pretty high level mesh term count though. Might be the ideal time to add [...]
Feature Selection
Posted in Uncategorized on January 21, 2009 | Leave a Comment »
The “simple” feature selection method (based on Zipf’s law?)
Remove the N most common terms
Remove terms in very few documents (documents = M < 1 or 2)
Background Distribution
Posted in Uncategorized on January 21, 2009 | Leave a Comment »
The comparison here is against genes linking to MeSH terms independantly at random. So it may be more a question of better choosing the background set (or rather, the choice of background set determines what the meaning of the p-values derived mean). If we have a “gene-specific” background set, we get terms which are [...]
Group Meeting Notes
Posted in Uncategorized on January 21, 2009 | Leave a Comment »
We assume hypergeometric background for the direct connections, but what does the background distribution really look like (and really, how would I figure this out)
Interdependence between the MeSH terms – and ultimately, dimensionality reduction on the feature set (24k features is too much really). Maybe employ subspace clustering à la gene expression analysis.
Tag clouds – [...]
Profile2Arff – Round II
Posted in Uncategorized on January 19, 2009 | Leave a Comment »
Now that I’ve got the cancer miniset, and the tf gene miniset, lets try profile2arff and hope for something reasonable!
python profile2arff.py tf-generif-gene-mesh-p.txt cancer-comesh-p.txt mesh_ids.txt curr-generif-hum-disease-validation-tuples.txt > tf-cancer-profile.arff
The upside – successful termination was achieved relatively quickly
Downside – resulting file is 35G – running some sanity checks but it looks like I might need to do some [...]
Filtration – TF/Cancer
Posted in Uncategorized on January 19, 2009 | Leave a Comment »
Looks like making a “totality” dataset is more than a bit out of the question, the plan now will be to cut things down based on genes/diseases.
Ultimately, we could do something like split on diseases, and offer individual profiles for each disease subcategory (or likewise, for each gene category) – split it up, cluster compute [...]
Profile2Arff
Posted in Uncategorized on January 16, 2009 | Leave a Comment »
It’s running and pumping the output to cmp-digenei/hum-gene-disease-profile.arff via
nohup nice python profile2arff.py ../digenei1/txt/direct_gene_disease/hum-generif-gene-mesh-p.txt ../digenei1/txt/direct_gene_disease/disease-comesh-p.txt ../digenei1/txt/mesh/mesh_ids.txt txt/curr-generif-hum-disease-validation-tuples.txt > hum-gene-disease-profile.arff&
Real question is how big yon file will be when it terminates – it’s hitting 2.7G already.
Ideas for slices – take a subset of the diseases (ignore very high level …), take a subset of the columns (dimensionality [...]
Machine Learning ARFF file
Posted in Uncategorized on January 15, 2009 | Leave a Comment »
To convert profiles to ARFF files, we can use a python script
profiles2arff [gene_profile] [disease_profile] [mesh_terms] [validation]
You get, for each gene-disease pair, the gene mesh terms and the disease mesh terms, plus the YN validation
Can also make a separate set with the p-values.