Posts Tagged ‘mining’
Thoughts on “A few useful things to know about machine learning”
Thursday, February 14th, 2013Some thoughts on a good paper giving intuition on machine learning approaches
http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
http://dl.acm.org/citation.cfm?id=2347755
In particular, the paper gives good intuition about:
– overfitting (e.g. how it’s related to multiple testing & bias v variance)
– the curse of dimensionality (in high-D all neighbors look the same)
– the non-practicality of theoretical guarantees
– how different frontiers can give the same prediction
– ensembles (which reduce variance greatly without increasing bias that much)
– ensembles vs Bayesian model averaging (which essentially select the best model)
A few useful things to know about machine learning
Saturday, February 9th, 2013homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
http://dl.acm.org/citation.cfm?id=2347755
Digging for Drug Facts | October 2012 | Communications of the ACM
Saturday, February 9th, 2013Inside the Secret World of the Data Crunchers Who Helped Obama Win
Sunday, November 11th, 2012Competing on Analytics – Harvard Business Review
Sunday, November 11th, 2012Exploring the human genome with functional maps.
Sunday, November 11th, 2012This paper has: (1) Large-scale datasets compiled from literature and databases, (2) comprehensive gold standards for positive and negative samples, (3) a classifier algorithm (regularized Bayesian), and (4) further analysis beyond “functional prediction”, including an interaction network. It predicts a list of genes having some possible functions, and the authors have experimentally validated them.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2694471/
Genome Res. 2009 Jun;19(6):1093-106. Epub 2009 Feb 26.
Exploring the human genome with functional maps.
Huttenhower C, Haley EM, Hibbs MA, Dumeaux V, Barrett DR, Coller HA, Troyanskaya OG.
Aneuploidy prediction and tumor classification with heterogeneous hidden conditional random fields.
Monday, November 5th, 2012This paper introduces a new method for detecting copy number variants in cancer genomes that addresses deficiencies of previous detection methods. The new method, dubbed HHCRF by the authors, adds the use of sequential correlations in selecting classification features for inferring copy numbers and identifying clinically relevant genes. This improvement results in higher accuracy on noisy data, and the identification of more clinically relevant genes, relative to previous methods. These results were obtained by testing HHCRF on both simulated array-CGH microarray data, and on actual breast cancer, uveal melanoma, and bladder tumor datasets.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2677736/
Bioinformatics. 2009 May 15;25(10):1307-13. Epub 2008 Dec 3. Aneuploidy prediction and tumor classification with heterogeneous hidden conditional random fields.
Barutcuoglu Z, Airoldi EM, Dumeaux V, Schapire RE, Troyanskaya OG.
Article: Graph startup Neo raises $11M as specialized databases take hold
Sunday, November 4th, 2012http://gigaom.com/data/graph-startup-neo-raises-11m-as-specialized-databases-take-hold
see open-source graph nosql DB : http://neo4j.org/