×

Fast nonseparable Gaussian stochastic process with application to methylation level interpolation. (English) Zbl 07499253

Summary: Gaussian stochastic process (GaSP) has been widely used as a prior over functions due to its flexibility and tractability in modeling. However, the computational cost in evaluating the likelihood is \(O(n^3)\), where \(n\) is the number of observed points in the process, as it requires to invert the covariance matrix. This bottleneck prevents GaSP being widely used in large-scale data. We propose a general class of nonseparable GaSP models for multiple functional observations with a fast and exact algorithm, in which the computation is linear (\(O(n)\)) and exact, requiring no approximation to compute the likelihood. We show that the commonly used linear regression and separable models are special cases of the proposed nonseparable GaSP model. Through the study of an epigenetic application, the proposed nonseparable GaSP model can accurately predict the genome-wide DNA methylation levels and compares favorably to alternative methods, such as linear regression, random forest, and localized Kriging method. The of this article are online and the algorithm for fast computation is implemented in the FastGaSP R package on CRAN. Supplemental materials for this article are available online.

MSC:

62-XX Statistics

References:

[1] Banerjee, S.; Gelfand, A. E.; Finley, A. O.; Sang, H., “Gaussian Predictive Process Models for Large Spatial Data Sets,”, Journal of the Royal Statistical Society, Series B, 70, 825-848 (2008) · Zbl 1533.62065 · doi:10.1111/j.1467-9868.2008.00663.x
[2] Bayarri, M. J.; Berger, J. O.; Calder, E. S.; Dalbey, K.; Lunagomez, S.; Patra, A. K.; Pitman, E. B.; Spillerh, E. T.; Wolperti, R. L., “Using Statistical and Computer Models to Quantify Volcanic Hazards,”, Technometrics, 51, 402-413 (2009) · doi:10.1198/TECH.2009.08018
[3] Breiman, L., “Random Forests,”, Machine Learning, 45, 5-32 (2001) · Zbl 1007.68152 · doi:10.1023/A:1010933404324
[4] Chang, W.; Haran, M.; Olson, R.; Keller, K., “Fast Dimension-Reduced Climate Model Calibration and the Effect of Data Aggregation,”, The Annals of Applied Statistics, 8, 649-673 (2014) · Zbl 1454.62438 · doi:10.1214/14-AOAS733
[5] Chu, T.; Wang, H.; Zhu, J., “On Semiparametric Inference of Geostatistical Models via Local Karhunen-Loève Expansion,”, Journal of the Royal Statistical Society, Series B, 76, 817-832 (2014) · Zbl 07555465 · doi:10.1111/rssb.12053
[6] Conti, S.; O’Hagan, A., “Bayesian Emulation of Complex Multi-Output and Dynamic Computer Models,”, Journal of Statistical Planning and Inference, 140, 640-651 (2010) · Zbl 1177.62033 · doi:10.1016/j.jspi.2009.08.006
[7] Cressie, N.; Johannesson, G., “Fixed Rank Kriging for Very Large Spatial Data Sets,”, Journal of the Royal Statistical Society, Series B, 70, 209-226 (2008) · Zbl 05563351 · doi:10.1111/j.1467-9868.2007.00633.x
[8] Das, P. M.; Singal, R., “DNA Methylation and Cancer,”, Journal of Clinical Oncology, 22, 4632-4642 (2004) · doi:10.1200/JCO.2004.07.151
[9] Eidsvik, J.; Shaby, B. A.; Reich, B. J.; Wheeler, M.; Niemi, J., “Estimation and Prediction in Spatial Models With Block Composite Likelihoods,”, Journal of Computational and Graphical Statistics, 23, 295-315 (2014) · doi:10.1080/10618600.2012.760460
[10] Gelfand, A. E.; Diggle, P.; Guttorp, P.; Fuentes, M., Handbook of Spatial Statistics (2010), Boca Raton, FL: CRC Press, Boca Raton, FL · Zbl 1188.62284
[11] Gelfand, A. E.; Schmidt, A. M.; Banerjee, S.; Sirmans, C., “Nonstationary Multivariate Process Modeling Through Spatially Varying Coregionalization,”, Test, 13, 263-312 (2004) · Zbl 1069.62074 · doi:10.1007/BF02595775
[12] Goulard, M.; Voltz, M., “Linear Coregionalization Model: Tools for Estimation and Choice of Cross-Variogram Matrix,”, Mathematical Geology, 24, 269-286 (1992) · doi:10.1007/BF00893750
[13] Gu, M., FastGaSP: Fast and Exact Computation of Gaussian Stochastic Process, R Package Version 0.5.1 (2019)
[14] Gu, M., “Jointly Robust Prior for Gaussian Stochastic Process in Emulation, Calibration and Variable Selection,”, Bayesian Analysis, 14, 857-885 (2019) · Zbl 1421.62055 · doi:10.1214/18-BA1133
[15] Gu, M.; Berger, J. O., “Parallel Partial Gaussian Process Emulation for Computer Models With Massive Output,”, The Annals of Applied Statistics, 10, 1317-1347 (2016) · Zbl 1391.62184 · doi:10.1214/16-AOAS934
[16] Gu, M.; Shen, W., “Generalized Probabilistic Principal Component Analysis of Correlated Data,”, arXiv no. 1808.10868 (2018)
[17] Gu, M.; Wang, X.; Berger, J. O., “Robust Gaussian stochastic process emulation,”, The Annals of Statistics, 46, 3038-3066 (2018) · Zbl 1408.62155 · doi:10.1214/17-AOS1648
[18] Hartikainen, J.; Sarkka, S., “Kalman Filtering and Smoothing Solutions to Temporal Gaussian Process Regression Models, 379-384 (2010)
[19] Higdon, D.; Gattiker, J.; Williams, B.; Rightley, M., “Computer Model Calibration Using High-Dimensional Output,”, Journal of the American Statistical Association, 103, 570-583 (2008) · Zbl 1469.62414 · doi:10.1198/016214507000000888
[20] Kaufman, C. G.; Schervish, M. J.; Nychka, D. W., “Covariance Tapering for Likelihood-Based Estimation in Large Spatial Data Sets,”, Journal of the American Statistical Association, 103, 1545-1555 (2008) · Zbl 1286.62072 · doi:10.1198/016214508000000959
[21] Liaw, A.; Wiener, M., “Classification and Regression by Randomforest, R News, 2, 18-22 (2002)
[22] Lindgren, F.; Rue, H.; Lindström, J., “An Explicit Link Between Gaussian Fields and Gaussian Markov Random Fields: The Stochastic Partial Differential Equation Approach,”, Journal of the Royal Statistical Society, Series B, 73, 423-498 (2011) · Zbl 1274.62360 · doi:10.1111/j.1467-9868.2011.00777.x
[23] Ma, P.; Kang, E. L., Fused Gaussian Process for Very Large Spatial Data, arXiv no. 1702.08797 (2017)
[24] Petris, G.; Petrone, S.; Campagnoli, P., Dynamic Linear Models (2009), New York: Springer, New York · Zbl 1176.62088
[25] R Core Team, R: A Language and Environment for Statistical Computing (2019), Vienna, Austria: R Foundation for Statistical Computing, Vienna, Austria
[26] Sacks, J.; Welch, W. J.; Mitchell, T. J.; Wynn, H. P., “Design and Analysis of Computer Experiments,”, Statistical Science, 4, 409-423 (1989) · Zbl 0955.62619 · doi:10.1214/ss/1177012413
[27] Scarano, M. I.; Strazzullo, M.; Matarazzo, M. R.; D’Esposito, M., “DNA Methylation 40 Years Later: Its Role in Human Health and Disease,”, Journal of Cellular Physiology, 204, 21-35 (2005) · doi:10.1002/jcp.20280
[28] Shi, T.; Cressie, N., “Global Statistical Analysis of MISR Aerosol Data: A Massive Data Product From NASA’s Terra Satellite,”, Environmetrics, 18, 665-680 (2007) · doi:10.1002/env.864
[29] West, M.; Harrison, P. J., Bayesian Forecasting & Dynamic Models (1997), New York: Springer-Verlag, New York · Zbl 0871.62026
[30] Whittle, P., “On Stationary Processes in the Plane,”, Biometrika, 41, 434-449 (1954) · Zbl 0058.35601 · doi:10.1093/biomet/41.3-4.434
[31] Whittle, P., Stochastic Process in Several Dimensions,”, Bulletin of the International Statistical Institute, 40, 974-994 (1963) · Zbl 0129.10603
[32] Wickham, H., “Reshaping Data With the Reshape Package,”, Journal of Statistical Software, 21, 1-20 (2007) · doi:10.18637/jss.v021.i12
[33] Wickham, H., ggplot2: Elegant Graphics for Data Analysis (2016), New York: Springer-Verlag, New York · Zbl 1397.62006
[34] Zhang, W.; Spector, T. D.; Deloukas, P.; Bell, J. T.; Engelhardt, B. E., “Predicting Genome-Wide DNA Methylation Using Methylation Marks, Genomic Position, and DNA Regulatory Elements,”, Genome Biology, 16, 1-20 (2015) · doi:10.1186/s13059-015-0581-9
[35] Ziller, M. J.; Gu, H.; Müller, F.; Donaghey, J.; Tsai, L. T.-Y.; Kohlbacher, O.; De Jager, P. L.; Rosen, E. D.; Bennett, D. A.; Bernstein, B. E.; Gnirke, A., “Charting a Dynamic DNA Methylation Landscape of the Human Genome,”, Nature, 500, 477-481 (2013) · doi:10.1038/nature12433
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.