×

Bayesian semiparametric analysis for two-phase studies of gene-environment interaction. (English) Zbl 1454.62308

Summary: The two-phase sampling design is a cost-efficient way of collecting expensive covariate information on a judiciously selected subsample. It is natural to apply such a strategy for collecting genetic data in a subsample enriched for exposure to environmental factors for gene-environment interaction (\(G\times E\)) analysis. In this paper, we consider two-phase studies of \(G\times E\) interaction where phase I data are available on exposure, covariates and disease status. Stratified sampling is done to prioritize individuals for genotyping at phase II conditional on disease and exposure. We consider a Bayesian analysis based on the joint retrospective likelihood of phases I and II data. We address several important statistical issues: (i) we consider a model with multiple genes, environmental factors and their pairwise interactions. We employ a Bayesian variable selection algorithm to reduce the dimensionality of this potentially high-dimensional model; (ii) we use the assumption of gene-gene and gene-environment independence to trade off between bias and efficiency for estimating the interaction parameters through use of hierarchical priors reflecting this assumption; (iii) we posit a flexible model for the joint distribution of the phase I categorical variables using the nonparametric Bayes construction of D. B. Dunson and C. Xing [J. Am. Stat. Assoc. 104, No. 487, 1042–1051 (2009; Zbl 1388.62151)]. We carry out a small-scale simulation study to compare the proposed Bayesian method with weighted likelihood and pseudo-likelihood methods that are standard choices for analyzing two-phase data. The motivating example originates from an ongoing case-control study of colorectal cancer, where the goal is to explore the interaction between the use of statins (a drug used for lowering lipid levels) and 294 genetic markers in the lipid metabolism/cholesterol synthesis pathway. The subsample of cases and controls on which these genetic markers were measured is enriched in terms of statin users. The example and simulation results illustrate that the proposed Bayesian approach has a number of advantages for characterizing joint effects of genotype and exposure over existing alternatives and makes efficient use of all available data in both phases.

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
62F15 Bayesian inference
62-08 Computational methods for problems pertaining to statistics

Citations:

Zbl 1388.62151

Software:

CGEN; Survey; CODA

References:

[1] Agresti, A. (2002). Categorical Data Analysis , 2nd ed. Wiley, New York. · Zbl 1018.62002
[2] Ahn, J., Mukherjee, B., Gruber, S. B. and Ghosh, M. (2013). Supplement to “Bayesian semiparametric analysis for two-phase studies of gene-environment interaction.” . · Zbl 1454.62308
[3] Amundadottir, L., Kraft, P., Stolzenberg-Solomon, R. Z., Fuchs, C. S., Petersen, G. M., Arslan, A. A. et al. (2009). Genome-wide association study identifies variants in the ABO locus associated with susceptibility to pancreatic cancer. Nat. Genet. 41 986-990.
[4] Bhattacharjee, S., Chatterjee, N. and Wheeler, W. (2011). An R package for analysis of case-control studies in genetic epidemiology. Package CGEN, Version 1.0.0. Available at .
[5] Bhattacharya, A. and Dunson, D. B. (2012). Simplex factor models for multivariate unordered categorical data. J. Amer. Statist. Assoc. 107 362-377. · Zbl 1263.62097 · doi:10.1080/01621459.2011.646934
[6] Breslow, N. E. and Cain, K. C. (1988). Logistic regression for two-stage case-control data. Biometrika 75 11-20. · Zbl 0635.62110 · doi:10.1093/biomet/75.1.11
[7] Breslow, N. E. and Chatterjee, N. (1999). Design and analysis of two-phase studies with binary outcome applied to Wilms tumour prognosis. J. Appl. Stat. 48 457-468. · Zbl 0957.62091 · doi:10.1111/1467-9876.00165
[8] Breslow, N. E. and Holubkov, R. (1997a). Maximum likelihood estimation of logistic regression parameters under two-phase, outcome-dependent sampling. J. R. Stat. Soc. Ser. B Stat. Methodol. 59 447-461. · Zbl 0886.62071 · doi:10.1111/1467-9868.00078
[9] Breslow, N. E. and Holubkov, R. (1997b). Weighted likelihood, pseudo-likelihood and maximum likelihood methods for logistic regression analysis of two-stage data. Stat. Med. 16 103-116. · Zbl 0886.62071
[10] Chatterjee, N. and Carroll, R. J. (2005). Semiparametric maximum likelihood estimation exploiting gene-environment independence in case-control studies. Biometrika 92 399-418. · Zbl 1094.62136 · doi:10.1093/biomet/92.2.399
[11] Chatterjee, N., Chen, Y.-H. and Breslow, N. E. (2003). A pseudoscore estimator for regression problems with two-phase sampling. J. Amer. Statist. Assoc. 98 158-168. · Zbl 1047.62031 · doi:10.1198/016214503388619184
[12] Chatterjee, N. and Chen, Y.-H. (2007). Maximum likelihood inference on a mixed conditionally and marginally specified regression model for genetic epidemiologic studies with two-phase sampling. J. R. Stat. Soc. Ser. B Stat. Methodol. 69 123-142. · Zbl 1120.62096 · doi:10.1111/j.1467-9868.2007.00580.x
[13] Cochran, W. G. (1963). Sampling Techniques . Wiley, New York. · Zbl 0051.10707
[14] Dunson, D. B. and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categorical data. J. Amer. Statist. Assoc. 104 1042-1051. · Zbl 1388.62151 · doi:10.1198/jasa.2009.tm08439
[15] Durt, T. (2010). Experimental proposal for testing the emergence of environment induced (EIN) classical selection rules with biological systems. Studia Logica 95 259-277. · Zbl 1202.81008 · doi:10.1007/s11225-010-9247-5
[16] Flanders, W. D. and Greenland, S. (1991). Analytic methods for 2-stage case-control studies and other stratified designs. Stat. Med. 10 739-747.
[17] Gelman, A. and Rubin, D. B. (1992). Inference from iterative simulation using multiple sequences (with discussion). Statist. Sci. 7 457-472. · Zbl 1386.65060
[18] Geman, S. and Geman, D. (1984). Stochastic relaxation, gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6 721-741. · Zbl 0573.62030 · doi:10.1109/TPAMI.1984.4767596
[19] George, E. I. and Mcculloch, R. E. (1993). Variable selection via Gibbs sampling. J. Amer. Statist. Assoc. 88 881-889.
[20] Hachem, C., Morgan, R., Johnson, M., Muebeler, M. and El-Serag, H. (2009). Statins and the risk of colorectal carcinoma: A nested case-control study in veterans with diabetes. Am. J. Gastroenterol. 104 1241-1248.
[21] Haneuse, S. and Chen, J. (2011). A multiphase design strategy for dealing with participation bias. Biometrics 67 309-318. · Zbl 1216.62167 · doi:10.1111/j.1541-0420.2010.01419.x
[22] Haneuse, S. J.-P. A. and Wakefield, J. C. (2007). Hierarchical models for combining ecological and case-control data. Biometrics 63 128-136, 312. · Zbl 1124.62085 · doi:10.1111/j.1541-0420.2006.00673.x
[23] Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47 663-685. · Zbl 0047.38301 · doi:10.2307/2280784
[24] Hunter, D. J., Kraft, P., Jacobs, K. B., Cox, D. G., Yeager, M. et al. (2007). A genome-wide association study identifies alleles in FGFR2 associated with risk of sporadic postmenopausal breast cancer. Nat. Genet. 39 870-874.
[25] Ishwaran, H. and Rao, J. S. (2003). Detecting differentially expressed genes in microarrays using Bayesian model selection. J. Amer. Statist. Assoc. 98 438-455. · Zbl 1041.62090 · doi:10.1198/016214503000224
[26] Lawless, J. F., Kalbfleisch, J. D. and Wild, C. J. (1999). Semiparametric methods for response-selective and missing data problems in regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 61 413-438. · Zbl 0915.62030 · doi:10.1111/1467-9868.00185
[27] Lee, A. J., Scott, A. J. and Wild, C. J. (2010). Efficient estimation in multi-phase case-control studies. Biometrika 97 361-374. · Zbl 1406.62139 · doi:10.1093/biomet/asq009
[28] Li, D. and Conti, D. V. (2009). Interactions using a combined case-only and case-control approach. Am. J. Epidemiol. 169 497-504.
[29] Lipkin, S. M. et al. (2010). Genetic variation in 3-hydroxy-3-methylglutaryl CoA reductase modifies the chemopreventive activity of statins for colorectal cancer. Cancer Prev. Res. ( Phila ) 3 597-603.
[30] Little, R. J. A. and Rubin, D. B. (2002). Statistical Analysis with Missing Data , 2nd ed. Wiley, Hoboken, NJ. · Zbl 1011.62004
[31] Lumley, T. (2011). R for analyzing data from complex surveys. Package Survey, Version 3.2.4. Available at .
[32] Manski, C. F. and Lerman, S. R. (1977). The estimation of choice probabilities from choice based samples. Econometrica 45 1977-1988. · Zbl 0372.62094 · doi:10.2307/1914121
[33] Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear regression. J. Amer. Statist. Assoc. 83 1023-1036. · Zbl 0673.62051 · doi:10.2307/2290129
[34] Mukherjee, B. and Chatterjee, N. (2008). Exploiting gene-environment independence for analysis of case-control studies: An empirical Bayes-type shrinkage estimator to trade-off between bias and efficiency. Biometrics 64 685-694. · Zbl 1190.62185 · doi:10.1111/j.1541-0420.2007.00953.x
[35] Mukherjee, B., Ahn, J., Stephen, B. G., Rennert, G., Victor, M. and Chatterjee, N. (2008). Testing gene-environment interaction from case-control data: A novel study of type-1 error, power and designs. Gen. Epid. 32 615-626.
[36] Mukherjee, B., Ahn, J., Gruber, S. B., Ghosh, M. and Chatterjee, N. (2010). Bayesian sample size determination for case-control studies of gene-environment interaction. Biometrics 66 934-948. · Zbl 1202.62162 · doi:10.1111/j.1541-0420.2009.01357.x
[37] Müller, P., Parmigiani, G., Shildkraut, J. and Tardella, L. (1999). A Bayesian hierarchical approach for combining case-control and prospective studies. Biometrics 55 858-866. · Zbl 1059.62681 · doi:10.1111/j.0006-341X.1999.00858.x
[38] Murcray, C. E., Lewinger, J. P. and Gauderman, W. J. (2009). Gene-environment interaction in genome-wide association studies. Am. J. Epidemiol. 169 219-226.
[39] Neyman, J. (1938). Contribution to the theory of sampling from human populations. J. Amer. Statist. Assoc. 33 101-116. · Zbl 0018.22603 · doi:10.2307/2279117
[40] Park, J. H., Wacholder, S., Gail, M. H., Peters, U., Jacobs, K. B., Chanock, S. J. and Chatterjee, N. (2010). Estimation of effect size distribution from genome-wide association studies and implications for future discoveries. Nat. Genet. 42 570-575.
[41] Pfeiffer, R. M. and Gail, M. H. (2003). Sample size calculations for population- and family-based case-control association studies on marker genotypes. Genet. Epidemiol. 25 136-148.
[42] Piegorsch, W. W., Weinberg, C. R. and Taylor, J. (1994). Non hierarchical logistic models and case-only designs for assessing susceptibility in population-based case-control studies. Stat. Med. 13 153-162.
[43] Plummer, M., Best, N., Cowles, K. and Vines, K. (2009). Output analysis and diagnostics for MCMC. Package CODA, Version 0.13-4. Available at .
[44] Poynter, J. N., Gruber, S. B., Higgins, P. D. R., Almog, R., Bonner, J. D., Rennert, H. S., Low, M., Greenson, J. K. and Rennert, G. (2005). Statins and the risk of colorectal cancer. N. Engl. J. Med. 352 2184-2192.
[45] Reilly, M. and Pepe, M. S. (1995). A mean score method for missing and auxiliary covariate data in regression models. Biometrika 82 299-314. · Zbl 0828.62097 · doi:10.1093/biomet/82.2.299
[46] Robins, J. M., Rotnitzky, A. and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. J. Amer. Statist. Assoc. 89 846-866. · Zbl 0815.62043 · doi:10.2307/2290910
[47] Schill, W., Jöckel, K. H., Drescher, K. and Timm, J. (1993). Logistic analysis in case-control studies under validation sampling. Biometrika 80 339-352. · Zbl 0783.62097 · doi:10.1093/biomet/80.2.339
[48] Scott, A. J. and Wild, C. J. (1997). Fitting regression models to case-control data by maximum likelihood. Biometrika 84 57-71. · Zbl 1058.62505 · doi:10.1093/biomet/84.1.57
[49] Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statist. Sinica 4 639-650. · Zbl 0823.62007
[50] Umbach, D. M. and Weinberg, C. R. (1997). Designing and analysing case-control studies to exploit independence of genotype and exposure. Stat. Med. 11 259-272.
[51] Vansteelandt, S., VanderWeele, T. J. and Robins, J. M. (2008). Multiply robust inference for statistical interactions. J. Amer. Statist. Assoc. 103 1693-1704. · Zbl 1286.62033 · doi:10.1198/016214508000001084
[52] Wacholder, S., Hartge, P., Prentice, R., Garcia-Closas, M. et al. (2010). Performance of common genetic variants in breast-cancer risk models. N. Engl. J. Med. 362 986-993.
[53] Walker, S. G. (2007). Sampling the Dirichlet mixture model with slices. Comm. Statist. Simulation Comput. 36 45-54. · Zbl 1113.62058 · doi:10.1080/03610910601096262
[54] Whittemore, A. S. and Halpern, J. (1998). Multi-stage sampling in genetic epidemiology. Stat. Med. 16 153-167.
[55] Yeager, M., Orr, N., Hayes, R. B., Jacobs, K. B., Kraft, P., Wacholder, S. et al. (2007). Genome-wide association study of prostate cancer identifies a second risk locus at 8q24. Nat. Genet. 39 645-649.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.