×

An empirical Bayes testing procedure for detecting variants in analysis of next generation sequencing data. (English) Zbl 1283.62011

Summary: Because of the decreasing cost and high digital resolution, next-generation sequencing (NGS) is expected to replace the traditional hybridization-based microarray technology. For genetics study, the first-step analysis of NGS data is often to identify genomic variants among sequenced samples. Several statistical models and tests have been developed for variant calling in NGS study. The existing approaches, however, are based on either conventional Bayesian or frequentist methods, which are unable to address the multiplicity and testing efficiency issues simultaneously. In this paper, we derive an optimal empirical Bayes testing procedure to detect variants for NGS study. We utilize the empirical Bayes technique to exploit the across-site information among many testing sites in NGS data. We prove that our testing procedure is valid and optimal in the sense of rejecting the maximum number of nonnulls while the Bayesian false discovery rate is controlled at a given nominal level. We show by both simulation studies and real data analysis that our testing efficiency can be greatly enhanced over the existing frequentist approaches that fail to pool and utilize information across the multiple testing sites.

MSC:

62C12 Empirical decision procedures; empirical Bayes procedures
92C40 Biochemistry, molecular biology
62P10 Applications of statistics to biology and medical sciences; meta analysis
65C60 Computational problems in statistics (MSC2010)

Software:

vipR; SNVer; GATK; Samtools

References:

[1] Altmann, A., Weber, P., Quast, C., Rex-Haffner, M., Binder, E. B. and Müller-Myhsok, B. (2011). vipR: Variant identification in pooled DNA using R. Bioinformatics 27 i77-i84.
[2] Amaral, A. J., Ferretti, L., Megens, H.-J., Crooijmans, R. P. M. A., Nie, H., Ramos-Onsins, S. E., Perez-Enciso, M., Schook, L. B. and Groenen, M. A. M. (2011). Genome-wide footprints of pig domestication and selection revealed through massive parallel sequencing of pooled DNA. PLoS ONE 6 e14782.
[3] Bansal, V. (2010). A statistical method for the detection of variants from next-generation resequencing of DNA pools. Bioinformatics 26 i318-i324.
[4] Benjamini, Y. and Heller, R. (2008). Screening for partial conjunction hypotheses. Biometrics 64 1215-1222. · Zbl 1152.62045 · doi:10.1111/j.1541-0420.2007.00984.x
[5] Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B Stat. Methodol. 57 289-300. · Zbl 0809.62014
[6] Bodmer, W. and Bonilla, C. (2008). Common and rare variants in multifactorial susceptibility to common diseases. Nat. Genet. 40 695-701.
[7] Calvo, S. E., Tucker, E. J., Compton, A. G., Kirby, D. M., Crawford, G., Burtt, N. P., Rivas, M., Guiducci, C., Bruno, D. L., Goldberger, O. A., Redman, M. C., Wiltshire, E., Wilson, C. J., Altshuler, D., Gabriel, S. B., Daly, M. J., Thorburn, D. R. and Mootha, V. K. (2010). High-throughput, pooled sequencing identifies mutations in NUBPL and FOXRED1 in human complex I deficiency. Nat. Genet. 42 851-858.
[8] Cheng, C., White, B. J., Kamdem, C., Mockaitis, K., Costantini, C., Hahn, M. W. and Besansky, N. J. (2012). Ecological genomics of Anopheles gambiae along a latitudinal cline: A population-resequencing approach. Genetics 190 1417-1432.
[9] Craig, D. W., Pearson, J. V., Szelinger, S., Sekar, A., Redman, M., Corneveaux, J. J., Pawlowski, T. L., Laub, T., Nunn, G., Stephan, D. A., Homer, N. and Huentelman, M. J. (2008). Identification of genetic variants using bar-coded multiplexed sequencing. Nat. Methods 5 887-893.
[10] Daye, Z. J., Li, H. and Wei, Z. (2012). A powerful test for multiple rare variants association studies that incorporates sequencing qualities. Nucleic Acids Res. 40 e60.
[11] Druley, T. E., Vallania, F. L. M., Wegner, D. J., Varley, K. E., Knowles, O. L., Bonds, J. A., Robison, S. W., Doniger, S. W., Hamvas, A., Cole, F. S., Fay, J. C. and Mitra, R. D. (2009). Quantification of rare allelic variants from pooled genomic DNA. Nat. Methods 6 263-265.
[12] Efron, B. (2005). Bayesians, frequentists, and scientists. J. Amer. Statist. Assoc. 100 1-5. · Zbl 1117.62325 · doi:10.1198/016214505000000033
[13] Efron, B. (2008). Microarrays, empirical Bayes and the two-groups model. Statist. Sci. 23 1-22. · Zbl 1327.62046 · doi:10.1214/07-STS236
[14] Efron, B. (2010). Large-Scale Inference : Empirical Bayes Methods for Estimation , Testing , and Prediction. Institute of Mathematical Statistics ( IMS ) Monographs 1 . Cambridge Univ. Press, Cambridge. · Zbl 1277.62016
[15] Efron, B. and Morris, C. (1971). Limiting the risk of Bayes and empirical Bayes estimators. I. The Bayes case. J. Amer. Statist. Assoc. 66 807-815. · Zbl 0229.62003 · doi:10.2307/2284231
[16] Efron, B. and Morris, C. (1973). Stein’s estimation rule and its competitors-An empirical Bayes approach. J. Amer. Statist. Assoc. 68 117-130. · Zbl 0275.62005 · doi:10.2307/2284155
[17] Efron, B. and Morris, C. N. (1975). Data analysis using Stein’s estimator and its generalizations. J. Amer. Statist. Assoc. 311-319. · Zbl 0319.62018 · doi:10.2307/2285814
[18] Efron, B., Tibshirani, R., Storey, J. D. and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Amer. Statist. Assoc. 96 1151-1160. · Zbl 1073.62511 · doi:10.1198/016214501753382129
[19] Elshire, R. J., Glaubitz, J. C., Sun, Q., Poland, J. A., Kawamoto, K., Buckler, E. S. and Mitchell, S. E. (2011). A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6 e19379.
[20] Fisher, R. A. (1925). Statistical Methods for Research Workers . Oliver & Boyd, Edinburgh. · JFM 51.0414.08
[21] Frazer, K. A., Murray, S. S., Schork, N. J. and Topol, E. J. (2009). Human genetic variation and its contribution to complex traits. Nat. Rev. Genet. 10 241-251.
[22] Genovese, C. and Wasserman, L. (2002). Operating characteristics and extensions of the false discovery rate procedure. J. R. Stat. Soc. Ser. B Stat. Methodol. 64 499-517. · Zbl 1090.62072 · doi:10.1111/1467-9868.00347
[23] Hayden, E. C. (2008). International genome project launched. Nature 451 378-379.
[24] He, L., Sarkar, S. K. and Zhao, Z. (2012). Capturing the severity of type II errors in high-dimensional multiple testing. Technical report. · Zbl 1327.62432
[25] Hindorff, L. A., Sethupathy, P., Junkins, H. A., Ramos, E. M., Mehta, J. P., Collins, F. S. and Manolio, T. A. (2009). Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. USA 106 9362-9367.
[26] Huang, X., Feng, Q., Qian, Q., Zhao, Q., Wang, L., Wang, A., Guan, J., Fan, D., Weng, Q., Huang, T., Dong, G., Sang, T. and Han, B. (2009). High-throughput genotyping by whole-genome resequencing. Genome Res. 19 1068-1076.
[27] Kolaczkowski, B., Kern, A. D., Holloway, A. K. and Begun, D. J. (2011). Genomic differentiation between temperate and tropical Australian populations of Drosophila melanogaster. Genetics 187 245-260.
[28] Lander, E. S. (2011). Initial impact of the sequencing of the human genome. Nature 470 187-197.
[29] Li, B. and Leal, S. M. (2009). Discovery of rare variants via sequencing: Implications for the design of complex trait association studies. PLoS Genet. 5 e1000481.
[30] Li, H., Ruan, J. and Durbin, R. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res. 18 1851-1858.
[31] Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., Durbin, R. and 1000 Genome Project Data Processing Subgroup (2009a). The sequence alignment/map format and SAMtools. Bioinformatics 25 2078-2079.
[32] Li, R., Li, Y., Fang, X., Yang, H., Wang, J., Kristiansen, K. and Wang, J. (2009b). SNP detection for massively parallel whole-genome resequencing. Genome Res. 19 1124-1132.
[33] Manolio, T. A., Collins, F. S., Cox, N. J., Goldstein, D. B., Hindorff, L. A., Hunter, D. J., McCarthy, M. I., Ramos, E. M., Cardon, L. R., Chakravarti, A., Cho, J. H., Guttmacher, A. E., Kong, A., Kruglyak, L., Mardis, E., Rotimi, C. N., Slatkin, M., Valle, D., Whittemore, A. S., Boehnke, M., Clark, A. G., Eichler, E. E., Gibson, G., Haines, J. L., Mackay, T. F. C., McCarroll, S. A. and Visscher, P. M. (2009). Finding the missing heritability of complex diseases. Nature 461 747-753.
[34] Mardis, E. R. (2011). A decade’s perspective on DNA sequencing technology. Nature 470 198-203.
[35] Margraf, R. L., Durtschi, J. D., Dames, S., Pattison, D. C., Stephens, J. E. and Voelkerding, K. V. (2011). Variant identification in multi-sample pools by illumina genome analyzer sequencing. J. Biomol. Tech. 22 74-84.
[36] McKenna, A., Hanna, M., Banks, E., Sivachenko, A., Cibulskis, K., Kernytsky, A., Garimella, K., Altshuler, D., Gabriel, S., Daly, M. and DePristo, M. A. (2010). The genome analysis toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20 1297-1303.
[37] Momozawa, Y., Mni, M., Nakamura, K., Coppieters, W., Almer, S., Amininejad, L., Cleynen, I., Colombel, J.-F., de Rijk, P., Dewit, O., Finkel, Y., Gassull, M. A., Goossens, D., Laukens, D., Lémann, M., Libioulle, C., O’Morain, C., Reenaers, C., Rutgeerts, P., Tysk, C., Zelenika, D., Lathrop, M., Del-Favero, J., Hugot, J.-P., de Vos, M., Franchimont, D., Vermeire, S., Louis, E. and Georges, M. (2011). Resequencing of positional candidates identifies low frequency IL23R coding variants protecting against inflammatory bowel disease. Nat. Genet. 43 43-47.
[38] Morris, C. N. (1983). Parametric empirical Bayes inference: Theory and applications (with discussion). J. Amer. Statist. Assoc. 78 47-65. · Zbl 0506.62005 · doi:10.2307/2287098
[39] Nejentsev, S., Walker, N., Riches, D., Egholm, M. and Todd, J. A. (2009). Rare variants of IFIH1, a gene implicated in antiviral responses, protect against type 1 diabetes. Science 324 387-389.
[40] Norton, N., Williams, N. M., O’Donovan, M. C. and Owen, M. J. (2004). DNA pooling as a tool for large-scale association studies in complex traits. Ann. Med. 36 146-152.
[41] Out, A. A., van Minderhout, I. J. H. M., Goeman, J. J., Ariyurek, Y., Ossowski, S., Schneeberger, K., Weigel, D., van Galen, M., Taschner, P. E. M., Tops, C. M. J., Breuning, M. H., van Ommen, G.-J. B., den Dunnen, J. T., Devilee, P. and Hes, F. J. (2009). Deep sequencing to reveal new variants in pooled DNA samples. Hum. Mutat. 30 1703-1712.
[42] Prabhu, S. and Pe’er, I. (2009). Overlapping pools for high-throughput targeted resequencing. Genome Res. 19 1254-1261.
[43] Robbins, H. (1951). Asymptotically subminimax solutions of compound statistical decision problems. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability , 1950 131-148. Univ. California Press, Berkeley and Los Angeles. · Zbl 0044.14803
[44] Robbins, H. (1956). An empirical Bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability , 1954 - 1955, Vol. I 157-163. Univ. California Press, Berkeley and Los Angeles.
[45] Sarkar, S. K., Zhou, T. and Ghosh, D. (2008). A general decision theoretic formulation of procedures controlling FDR and FNR from a Bayesian perspective. Statist. Sinica 18 925-945. · Zbl 1149.62003
[46] Sham, P., Bader, J. S., Craig, I., O’Donovan, M. and Owen, M. (2002). DNA pooling: A tool for large-scale association studies. Nat. Rev. Genet. 3 862-871.
[47] Smith, A. M., Heisler, L. E., Onge, R. P. S., Farias-Hesson, E., Wallace, I. M., Bodeau, J., Harris, A. N., Perry, K. M., Giaever, G., Pourmand, N. and Nislow, C. (2010). Highly-multiplexed barcode sequencing: An efficient method for parallel analysis of pooled samples. Nucleic Acids Res. 38 e142.
[48] Storey, J. D. (2003). The positive false discovery rate: A Bayesian interpretation and the \(q\)-value. Ann. Statist. 31 2013-2035. · Zbl 1042.62026 · doi:10.1214/aos/1074290335
[49] Sun, W. and Cai, T. T. (2007). Oracle and adaptive compound decision rules for false discovery rate control. J. Amer. Statist. Assoc. 102 901-912. · Zbl 1469.62318 · doi:10.1198/016214507000000545
[50] Sun, W. and Cai, T. T. (2009). Large-scale multiple testing under dependence. J. R. Stat. Soc. Ser. B Stat. Methodol. 71 393-424. · Zbl 1248.62005 · doi:10.1111/j.1467-9868.2008.00694.x
[51] Sun, W. and Wei, Z. (2011). Multiple testing for pattern identification, with applications to microarray time-course experiments. J. Amer. Statist. Assoc. 106 73-88. · Zbl 1396.62261 · doi:10.1198/jasa.2011.ap09587
[52] Turner, T. L., Bourne, E. C., Wettberg, E. J. V., Hu, T. T. and Nuzhdin, S. V. (2010). Population resequencing reveals local adaptation of Arabidopsis lyrata to serpentine soils. Nat. Genet. 42 260-263.
[53] Turner, T. L., Stewart, A. D., Fields, A. T., Rice, W. R. and Tarone, A. M. (2011). Population-based resequencing of experimentally evolved populations reveals the genetic basis of body size variation in Drosophila melanogaster. PLoS Genet. 7 e1001336.
[54] Vallania, F. L. M., Druley, T. E., Ramos, E., Wang, J., Borecki, I., Province, M. and Mitra, R. D. (2010). High-throughput discovery of rare insertions and deletions in large cohorts. Genome Res. 20 1711-1718.
[55] Wang, W., Wei, Z. and Sun, W. (2010). Simultaneous set-wise testing under dependence, with applications to genome-wide association studies. Stat. Interface 3 501-511. · Zbl 1245.62160 · doi:10.4310/SII.2010.v3.n2.a8
[56] Wei, Z., Sun, W., Wang, K. and Hakonarson, H. (2009). Multiple testing in genome-wide association studies via hidden Markov models. Bioinformatics 25 2802-2808.
[57] Wei, Z., Wang, W., Hu, P., Lyon, G. J. and Hakonarson, H. (2011). SNVer: A statistical tool for variant calling in analysis of pooled or individual next-generation sequencing data. Nucleic Acids Res. 39 e132.
[58] Xie, J., Cai, T. T., Maris, J. and Li, H. (2011). Optimal false discovery rate control for dependent data. Stat. Interface 4 417-430. · Zbl 1245.62091 · doi:10.4310/SII.2011.v4.n4.a1
[59] Zhao, Z., Wang, W. and Wei, Z. (2013). Supplement to “An empirical Bayes testing procedure for detecting variants in analysis of next generation sequencing data.” . · Zbl 1283.62011
[60] Zhu, Y., Bergland, A. O., González, J. and Petrov, D. A. (2012). Empirical validation of pooled whole genome population re-sequencing in Drosophila melanogaster. PLoS ONE 7 e41901.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.