Machine learning applications in genetics and genomics

The field of machine learning, which aims to develop computer algorithms that improve with experience, holds promise to enable computers to assist humans in the analysis of large, complex data sets. Here, we provide an overview of machine learning applications for the analysis of genome sequencing data sets, including the annotation of sequence elements and epigenetic, proteomic or metabolomic data. We present considerations and recurrent challenges in the application of supervised, semi-supervised and unsupervised machine learning methods, as well as of generative and discriminative modelling approaches. We provide general guidelines to assist in the selection of these machine learning methods and their practical application for the analysis of genetic and genomic data sets.

This is a preview of subscription content, access via your institution

Access options

Subscribe to this journal

Receive 12 print issues and online access

206,07 € per year

only 17,17 € per issue

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Prices may be subject to local taxes which are calculated during checkout

Navigating the pitfalls of applying machine learning in genomics

Article 26 November 2021

Decoding disease: from genomes to networks to phenotypes

Article 02 August 2021

Computational analysis of cancer genome sequencing data

Article 08 December 2021

References

Mitchell, T. Machine Learning (McGraw-Hill, 1997). This book provides a general introduction to machine learning that is suitable for undergraduate or graduate students.Google Scholar
Ohler, W., Liao, C., Niemann, H. & Rubin, G. M. Computational analysis of core promoters in the Drosophila genome. Genome Biol.3, RESEARCH0087 (2002). ArticlePubMedPubMed CentralGoogle Scholar
Degroeve, S., Baets, B. D., de Peer, Y. V. & Rouzé, P. Feature subset selection for splice site prediction. Bioinformatics18, S75–S83 (2002). ArticlePubMedGoogle Scholar
Bucher, P. Weight matrix description of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol.4, 563–578 (1990). ArticleGoogle Scholar
Heintzman, N. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature Genet.39, 311–318 (2007). ArticleCASPubMedGoogle Scholar
Segal, E. et al. A genomic code for nucleosome positioning. Nature44, 772–778 (2006). ArticleGoogle Scholar
Picardi, E. & Pesole, G. Computational methods for ab initio and comparative gene finding. Methods Mol. Biol.609, 269–284 (2010). ArticleCASPubMedGoogle Scholar
Ashburner, M. et al. Gene ontology: tool for the unification of biology. Nature Genet.25, 25–29 (2000). ArticleCASPubMedGoogle Scholar
Fraser, A. G. & Marcotte, E. M. A probabilistic view of gene function. Nature Genet.36, 559–564 (2004). ArticleCASPubMedGoogle Scholar
Beer, M. A. & Tavazoie, S. Predicting gene expression from sequence. Cell117, 185–198 (2004). ArticleCASPubMedGoogle Scholar
Karlic, R. R. Chung, H., Lasserre, J., Vlahovicek, K. & Vingron, M. Histone modification levels are predictive for gene expression. Proc. Natl Acad. Sci. USA107, 2926–2931 (2010). ArticleCASPubMedGoogle Scholar
Ouyang, Z., Zhou, Q. & Wong, H. W. ChIP–seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells. Proc. Natl Acad. Sci. USA106, 21521–21526 (2009). ArticleCASPubMedGoogle Scholar
Friedman, N. Inferring cellular networks using probabilistic graphical models. Science303, 799–805 (2004). ArticleCASPubMedGoogle Scholar
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference and Prediction (Springer, 2001). This book provides an overview of machine learning that is suitable for students with a strong background in statistics.BookGoogle Scholar
Hamelryck, T. Probabilistic models and machine learning in structural bioinformatics. Stat. Methods Med. Res.18, 505–526 (2009). ArticlePubMedGoogle Scholar
Swan, A. L., Mobasheri, A., Allaway, D., Liddell, S. & Bacardit, J. Application of machine learning to proteomics data: classification and biomarker identification in postgenomics biology. OMICS17, 595–610 (2013). ArticleCASPubMedPubMed CentralGoogle Scholar
Upstill-Goddard, R., Eccles, D., Fliege, J. & Collins, A. Machine learning approaches for the discovery of gene–gene interactions in disease data. Brief. Bioinform.14, 251–260 (2013). ArticleCASPubMedGoogle Scholar
Yip, K. Y., Cheng, C. & Gerstein, M. Machine learning and genome annotation: a match meant to be? Genome Biol.14, 205 (2013). ArticlePubMedPubMed CentralGoogle Scholar
Day, N., Hemmaplardh, A., Thurman, R. E., Stamatoyannopoulos, J. A. & Noble, W. S. Unsupervised segmentation of continuous genomic data. Bioinformatics23, 1424–1426 (2007). ArticleCASPubMedGoogle Scholar
Ernst, J. & Kellis, M. ChromHMM: automating chromatin-state discovery and characterization. Nature Methods9, 215–216 (2012). This study applies an unsupervised hidden Markov model algorithm to analyse genomic assays such as ChIP–seq and DNase-seq in order to identify new classes of functional elements and new instances of existing functional element types.ArticleCASPubMedPubMed CentralGoogle Scholar
Hoffman, M. M. et al. Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nature Methods9, 473–476 (2012). ArticleCASPubMedPubMed CentralGoogle Scholar
Chapelle, O., Schölkopf, B. & Zien, A. (eds) Semi-supervised Learning (MIT Press, 2006). BookGoogle Scholar
Stamatoyannopoulos, J. A. Illuminating eukaryotic transcription start sites. Nature Methods7, 501–503 (2010). ArticleCASPubMedGoogle Scholar
Boser, B. E., Guyon, I. M. & Vapnik, V. N. in A Training Algorithm for Optimal Margin Classifiers (ed. Haussler, D.) 144–152 (ACM Press, 1992). This paper was the first to describe the SVM, a type of discriminative classification algorithm.Google Scholar
Noble, W. S. What is a support vector machine? Nature Biotech.24, 1565–1567 (2006). This paper describes a non-mathematical introduction to SVMs and their applications to life science research.ArticleCASGoogle Scholar
Ng, A. Y. & Jordan, M. I. Advances in Neural Information Processing Systems (eds Dietterich, T. et al.) (MIT Press, 2002). Google Scholar
Jordan, M. I. Why the logistic function? a tutorial discussion on probabilities and neural networks. Computational Cognitive Science Technical Report 9503[online], (1995). Google Scholar
Wolpert, D. H. & Macready, W. G. No free lunch theorems for optimization. IEEE Trans. Evol. Comput.1, 67–82 (1997). This paper provides a mathematical proof that no single machine learning method can perform best on all possible learning problems.ArticleGoogle Scholar
Yip, K. Y. et al. Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol.13, R48 (2012). ArticleCASPubMedPubMed CentralGoogle Scholar
Urbanowicz, R. J., Granizo-Mackenzie, D. & Moore, J. H. in Proceedings of the Parallel Problem Solving From Nature 266–275 (Springer, 2012). BookGoogle Scholar
Brown, M. et al. in Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology (ed. Rawlings, C.) 47–55 (AAAI Press, 1993). Google Scholar
Bailey, T. L. & Elkan, C. P. in Proceedings of the Third International Conference on Intelligent Systems for Molecular Biology (eds Rawlings, C. et al.) 21–29 (AAAI Press, 1995). Google Scholar
Schölkopf, B. & Smola, A. Learning with Kernels (MIT Press, 2002). Google Scholar
Leslie, C. et al. (eds) Proceedings of the Pacific Symposium on Biocomputing (World Scientific, 2002). Google Scholar
Rätsch, G. & Sonnenburg, S. in Kernel Methods in Computational Biology (eds Schölkopf, B. et al.) 277–298 (MIT Press, 2004). Google Scholar
Zien, A. et al. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics16, 799–807 (2000). ArticleCASPubMedGoogle Scholar
Saigo, H., Vert, J.-P. & Akutsu, T. Optimizing amino acid substitution matrices with a local alignment kernel. BMC Bioinformatics7, 246 (2006). ArticlePubMedPubMed CentralGoogle Scholar
Jaakkola, T. & Haussler, D. Advances in Neural Information Processing Systems 11 (Morgan Kauffmann, 1998). Google Scholar
Shawe-Taylor, J. & Cristianini, N. Kernel Methods for Pattern Analysis (Cambridge Univ. Press, 2004). This textbook describes kernel methods, including a detailed mathematical treatment that is suitable for quantitatively inclined graduate students.BookGoogle Scholar
Peña-Castillo, L. et al. A critical assessment of M. musculus gene function prediction using integrated genomic evidence. Genome Biol.9, S2 (2008). ArticlePubMedPubMed CentralGoogle Scholar
Sonnhammer, E., Eddy, S. & Durbin, R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins28, 405–420 (1997). ArticleCASPubMedGoogle Scholar
Apweiler, R. et al. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res.29, 37–40 (2001). ArticleCASPubMedPubMed CentralGoogle Scholar
Pavlidis, P., Weston, J., Cai, J. & Noble, W. S. Learning gene functional classifications from multiple data types. J. Computat. Biol.9, 401–411 (2002). ArticleCASGoogle Scholar
Lanckriet, G. R. G., Bie, T. D., Cristianini, N., Jordan, M. I. & Noble, W. S. A statistical framework for genomic data fusion. Bioinformatics20, 2626–2635 (2004). ArticleCASPubMedGoogle Scholar
Troyanskaya, O. G., Dolinski, K., Owen, A. B., Altman, R. B. & Botstein, D. A. Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl Acad. Sci. USA100, 8348–8353 (2003). ArticleCASPubMedGoogle Scholar
Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference (Morgan Kaufmann, 1998). This textbook on probability models for machine learning is suitable for undergraduates or graduate students.Google Scholar
Song, L. & Crawford, G. E. DNase-seq: a high-resolution technique for mapping active gene regulatory elements across the genome from mammalian cells. Cold Spring Harbor Protoc.2, pdb.prot5384 (2010). Google Scholar
Wasson, T. & Hartemink, A. J. An ensemble model of competitive multi-factor binding of the genome. Genome Res.19, 2102–2112 (2009). ArticleGoogle Scholar
Pique-Regi, R. et al. Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res.21, 447–455 (2011). ArticleCASPubMedPubMed CentralGoogle Scholar
Cuellar-Partida, G. et al. Epigenetic priors for identifying active transcription factor binding sites. Bioinformatics28, 56–62 (2011). ArticlePubMedPubMed CentralGoogle Scholar
Ramaswamy, S. et al. Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl Acad. Sci. USA98, 15149–15154 (2001). ArticleCASPubMedGoogle Scholar
Glaab, E., Bacardit, J., Garibaldi, J. M. & Krasnogor, N. Using rule-based machine learning for candidate disease gene prioritization and sample classification of cancer gene expression data. PLoS ONE7, e39932 (2012). ArticleCASPubMedPubMed CentralGoogle Scholar
Tibshirani, R. J. Regression shrinkage and selection via the lasso. J. R. Statist. Soc. B58, 267–288 (1996). This paper was the first to describe the technique known as lasso (orL1regularization), which performs feature selection in conjunction with learning.Google Scholar
Urbanowicz, R. J., Granizo-Mackenzie, A. & Moore, J. H. An analysis pipeline with statistical and visualization-guided knowledge discovery for Michigan-style learning classifier systems. IEEE Comput. Intell. Mag.7, 35–45 (2012). ArticlePubMedPubMed CentralGoogle Scholar
Tikhonov, A. N. On the stability of inverse problems. Dokl. Akad. Nauk SSSR39, 195–198 (1943). This paper was the first to describe the now-ubiquitous method known asL2regularization or ridge regression.Google Scholar
Keogh, E. & Mueen, A. Encyclopedia of Machine Learning (Springer, 2011). Google Scholar
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature489, 57–74 (2012).
Manning, C. D. & Schütze, H. Foundations of Statistical Natural Language Processing (MIT Press, 1999). Google Scholar
Davis, J. & Goadrich, M. Proceedings of the International Conference on Machine Learning (ACM, 2006). This paper provides a succinct introduction to precision-recall and receiver operating characteristic curves, and details under which scenarios these approaches should be used.Google Scholar
Cohen, J. Weighted κ: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull.70, 213 (1968). ArticleCASPubMedGoogle Scholar
Luengo, J., García, S. & Herrera, F. On the choice of the best imputation methods for missing values considering three groups of classification methods. Knowl. Inf. Syst.32, 77–108 (2012). ArticleGoogle Scholar
Troyanskaya, O. et al. Missing value estimation methods for DNA microarrays. Bioinformatics17, 520–525 (2001). This study uses an imputation-based approach to handle missing values in microarray data. The method was widely used in subsequent studies to address this common problem.ArticleCASPubMedGoogle Scholar
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nature Genet.46, 310–315 (2014). This study uses a machine learning approach to estimate the pathogenicity of genetic variants using a framework that takes advantage of the fact that natural selection removes deleterious variation.ArticleCASPubMedGoogle Scholar
Qiu, J. & Noble, W. S. Predicting co-complexed protein pairs from heterogeneous data. PLoS Comput. Biol.4, e1000054 (2008). ArticlePubMedPubMed CentralGoogle Scholar
Friedman, N., Linial, M., Nachman, I. & Pe'er, D. Using Bayesian networks to analyze expression data. J. Comput. Biol.7, 601–620 (2000). ArticleCASPubMedGoogle Scholar
Bacardit, J. & Llorà, X. Large-scale data mining using genetics-based machine learning. Wiley Interdiscip. Rev.3, 37–61 (2013). Google Scholar
Koski, T. J. & Noble, J. A review of Bayesian networks and structure learning. Math. Applicanda40, 51–103 (2012). Google Scholar
Pearl, J. Causality: Models, Reasoning and Inference (Cambridge Univ. Press, 2000). Google Scholar

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of Washington, 185 Stevens Way, Seattle, 98195–2350, Washington, USA Maxwell W. Libbrecht & William Stafford Noble
Department of Genome Sciences, University of Washington, 3720 15th Ave NE, Seattle, 98195–5065, Washington, USA William Stafford Noble

Maxwell W. Libbrecht