Feeds:
Posts
Comments

Status Update

Paper:   Skeleton of the methods section for the profile comparisons – we should update that with descriptions of the distance functions.

Fixes:  Fixed some sorting error in the all-mesh-refs.txt code,  which would have affected some p-value computations.  Should probably double-check if similar errors are elsewhere (ie in the profile comparison code?   direct association code?)

Computation:  digenei3 is running at the oicr,  once that is done we can create a cmp-digenei2 that will compare digenei1 and digenei3.   Hopefully results will not change much.  Recomputing wrt gene2pubmed (apparently never been done yet?) – will see effect of recent fixes on results.

Cmp-digenei:  extract configuration information – ref source and directories as a config.mk

Profile p-values TODO:  Have a gene and disease background for the genes/disease profile p-values.

Term Fraction TODO:  Somehow merge this with the p-values?  Maybe have a “weighted p-value comparison”,  or maybe weight the term fraction by the p-value?  There must be something that can use the term fraction effectiveness to boost effectiveness

Data Examination:  Look at the “low-scoring” tuples for each scoring method, to better identify how to fix those.

Alternate Validation:  Check other papers – the “check each prediction and compare against X random others”,  or check predictions vs.  nearby genes.

Bug Detected

This needs to be fixed – badly.  txt/direct-gene-disease/all-mesh-refs seems to have some sorting problem,  especially obvious around Antigens, CD (and the various CD#s)  -g does not do what I think it did when I rewrote the BIGSORT macro.

This does affect a pretty high level mesh term count though.  Might be the ideal time to add in the specific backgrounds for gene, disease.

Update: Should be fixable – modification is to cut the file by field separator (the pipe)  and specify the keys as 1,2.  Hopefully,  that should result in the expected behaviour.

Feature Selection

The “simple” feature selection method (based on Zipf’s law?)

Remove the N most common terms

Remove terms in very few documents (documents = M < 1 or 2)

Background Distribution

The comparison here is against genes linking to MeSH terms independantly at random.  So it may be more a question of better choosing the background set (or rather,  the choice of background set determines what the meaning of the p-values derived mean).   If we have a “gene-specific” background set,  we get terms which are unusually overrepresented with respect to all genes (ie, if we look only at articles referenced by GeneRIFs/Gene2Pubmed)  If we look at all disease papers,  these terms are unusually common,  compared to drawing at random from that pool.  Or we can consider it from the opposite perspective of terms that would be “lost” – in the gene background,  seeing high occurrences for “gene-related” terms is expected.  Likewise,  high-level “pathology” terms are lost if we use a disease background.  This does seem like the behaviour we would like though.

Group Meeting Notes

We assume hypergeometric background for the direct connections,  but what does the background distribution really look like (and really,  how would I figure this out)

Interdependence between the MeSH terms – and ultimately,  dimensionality reduction on the feature set (24k features is too much really).  Maybe employ subspace clustering à la gene expression analysis.

Tag clouds – more dynamic range,  and it’s worth cutting off the bottom enders and focusing on the top.

Label axes in the figures.

Keep an eye on BioBase – they seem to have a curated set of protein-disease relations.

Profile2Arff – Round II

Now that I’ve got the cancer miniset,  and the tf gene miniset,  lets try profile2arff and hope for something reasonable!

python profile2arff.py tf-generif-gene-mesh-p.txt cancer-comesh-p.txt mesh_ids.txt curr-generif-hum-disease-validation-tuples.txt > tf-cancer-profile.arff

The upside – successful termination was achieved relatively quickly

Downside – resulting file is 35G – running some sanity checks but it looks like I might need to do some significant dimensionality reduction before using the profiles.   Should be feasible since the parent-child relationship probably makes a lot of the parents redundant…

Filtration – TF/Cancer

Looks like making a “totality” dataset is more than a bit out of the question,  the plan now will be to cut things down based on genes/diseases.

Ultimately,  we could do something like split on diseases,  and offer individual profiles for each disease subcategory (or likewise,  for each gene category) – split it up,  cluster compute each one separately.

Anyways,  grabbing a partial list of TF IDs from

~/Warren/cs/research/integrator/digenei/tf-author

Next to do – scan mesh-parent for all terms with cancer as its parent

SELECT * from mesh_child WHERE term=’Neoplasms’

Put into 

cmp-digenei/mesh-cancer-list.txt

After it’s implemented, try pushing this analysis upstream to digenei direct_prediction – make this a subparse of disease/hum filtering

Profile2Arff

It’s running and pumping the output to cmp-digenei/hum-gene-disease-profile.arff via

 nohup nice python profile2arff.py ../digenei1/txt/direct_gene_disease/hum-generif-gene-mesh-p.txt ../digenei1/txt/direct_gene_disease/disease-comesh-p.txt ../digenei1/txt/mesh/mesh_ids.txt txt/curr-generif-hum-disease-validation-tuples.txt  > hum-gene-disease-profile.arff&

Real question is how big yon file will be when it terminates – it’s hitting 2.7G already.
Ideas for slices – take a subset of the diseases (ignore very high level …),  take a subset of the columns (dimensionality reduction).

To convert profiles to ARFF files,  we can use a python script

profiles2arff [gene_profile] [disease_profile] [mesh_terms] [validation]

You get, for each gene-disease pair,  the gene mesh terms and the disease mesh terms,  plus the YN validation

Can also make a separate set with the p-values.

Integrated into makefile,  currently building to txt/gene/

« Newer Posts - Older Posts »