Feeds:
Posts
Comments

Archive for January, 2009

Status Update

Paper:   Skeleton of the methods section for the profile comparisons – we should update that with descriptions of the distance functions.
Fixes:  Fixed some sorting error in the all-mesh-refs.txt code,  which would have affected some p-value computations.  Should probably double-check if similar errors are elsewhere (ie in the profile comparison code?   direct association code?)
Computation:  digenei3 is [...]

Read Full Post »

Bug Detected

This needs to be fixed – badly.  txt/direct-gene-disease/all-mesh-refs seems to have some sorting problem,  especially obvious around Antigens, CD (and the various CD#s)  -g does not do what I think it did when I rewrote the BIGSORT macro.
This does affect a pretty high level mesh term count though.  Might be the ideal time to add [...]

Read Full Post »

Feature Selection

The “simple” feature selection method (based on Zipf’s law?)
Remove the N most common terms
Remove terms in very few documents (documents = M < 1 or 2)

Read Full Post »

The comparison here is against genes linking to MeSH terms independantly at random.  So it may be more a question of better choosing the background set (or rather,  the choice of background set determines what the meaning of the p-values derived mean).   If we have a “gene-specific” background set,  we get terms which are [...]

Read Full Post »

Group Meeting Notes

We assume hypergeometric background for the direct connections,  but what does the background distribution really look like (and really,  how would I figure this out)
Interdependence between the MeSH terms – and ultimately,  dimensionality reduction on the feature set (24k features is too much really).  Maybe employ subspace clustering à la gene expression analysis.
Tag clouds – [...]

Read Full Post »

Now that I’ve got the cancer miniset,  and the tf gene miniset,  lets try profile2arff and hope for something reasonable!
python profile2arff.py tf-generif-gene-mesh-p.txt cancer-comesh-p.txt mesh_ids.txt curr-generif-hum-disease-validation-tuples.txt > tf-cancer-profile.arff
The upside – successful termination was achieved relatively quickly
Downside – resulting file is 35G – running some sanity checks but it looks like I might need to do some [...]

Read Full Post »

Looks like making a “totality” dataset is more than a bit out of the question,  the plan now will be to cut things down based on genes/diseases.
Ultimately,  we could do something like split on diseases,  and offer individual profiles for each disease subcategory (or likewise,  for each gene category) – split it up,  cluster compute [...]

Read Full Post »

Profile2Arff

It’s running and pumping the output to cmp-digenei/hum-gene-disease-profile.arff via
 nohup nice python profile2arff.py ../digenei1/txt/direct_gene_disease/hum-generif-gene-mesh-p.txt ../digenei1/txt/direct_gene_disease/disease-comesh-p.txt ../digenei1/txt/mesh/mesh_ids.txt txt/curr-generif-hum-disease-validation-tuples.txt  > hum-gene-disease-profile.arff&
Real question is how big yon file will be when it terminates – it’s hitting 2.7G already.

Ideas for slices – take a subset of the diseases (ignore very high level …),  take a subset of the columns (dimensionality [...]

Read Full Post »

To convert profiles to ARFF files,  we can use a python script
profiles2arff [gene_profile] [disease_profile] [mesh_terms] [validation]
You get, for each gene-disease pair,  the gene mesh terms and the disease mesh terms,  plus the YN validation
Can also make a separate set with the p-values.

Read Full Post »