×

Robust prediction of domain compositions from uncertain data using isometric logratio transformations in a penalized multivariate Fay-Herriot model. (English) Zbl 1541.62035

Summary: Assessing regional population compositions is an important task in many research fields. Small area estimation with generalized linear mixed models marks a powerful tool for this purpose. However, the method has limitations in practice. When the data are subject to measurement errors, small area models produce inefficient or biased results since they cannot account for data uncertainty. This is particularly problematic for composition prediction, since generalized linear mixed models often rely on approximate likelihood inference. Obtained predictions are not reliable. We propose a robust multivariate Fay-Herriot model to solve these issues. It combines compositional data analysis with robust optimization theory. The nonlinear estimation of compositions is restated as a linear problem through isometric logratio transformations. Robust model parameter estimation is performed via penalized maximum likelihood. A robust best predictor is derived. Simulations are conducted to demonstrate the effectiveness of the approach. An application to alcohol consumption in Germany is provided.
{© 2021 The Authors. Statistica Neerlandica published by John Wiley & Sons Ltd on behalf of Netherlands Society for Statistics and Operations Research.}

MSC:

62D05 Sampling theory, sample surveys
62J12 Generalized linear models (logistic models)
62F40 Bootstrap, jackknife and other resampling methods
62P25 Applications of statistics to social sciences

Software:

glmnet

References:

[1] Aitchison, J. (1986). The statistical analysis of compositional data. Boca Raton, FL: Chapman & Hall. · Zbl 0688.62004
[2] Ames, G., & Cunradi, C. (2004). Alcohol use and preventing alcohol‐related problems among young adults in the military. Alcohol Research & Health, 28(4), 252-257.
[3] Arima, S., Bell, W. R., Datta, G. S., Franco, C., & Liseo, B. (2017). Multivariate Fay‐Herriot Bayesian estimation of small area means under functional measurement error. Journal of the Royal Statistical Society Series A (Statistics in Society), 180(4), 1191-1209.
[4] Beard, E., Brown, J., West, R., Kaner, E., Meier, P., & Michie, S. (2019). Associations between socio‐economic factors and alcohol consumption: A population survey of adults in England. PLoS One, 142(2), e0209442.
[5] Benavent, R., & Morales, D. (2016). Multivariate Fay‐Herriot for small area estimation. Computational Statistics and Data Analysis, 94, 372-390. · Zbl 1468.62026
[6] Benavent, R., & Morales, D. (2021). Small area estimation under a temporal bivariate area‐level linear mixed model with independent time effects. Statistical Methods and Applications, 30(1), 195-222. · Zbl 1474.62438
[7] Ben‐Tal, A., El Ghaoui, L., & Nemirovski, A. (2009). Robust optimization (Vol. 28). Princeton, NJ: Princeton University Press. · Zbl 1221.90001
[8] Bertsimas, D., Brown, D., & Caramanis, C. (2011). Theory and applications of robust optimization. SIAM Review, 53(3), 464-501. · Zbl 1233.90259
[9] Bertsimas, D., & Copenhaver, M. S. (2018). Characterization of the equivalence of robustification and regularization in linear and matrix regression. European Journal of Operational Research, 270(3), 931-942. · Zbl 1403.62040
[10] Breslow, N. E., & Clayton, D. G. (1993). Approximate inference in generalized linear mixed models. Journal of the American Statistical Association, 88(421), 9-25. · Zbl 0775.62195
[11] Britton, A., Ben‐Shlomo, Y., Benzeval, M., Kuh, D., & Bell, S. (2015). Life course trajectories of alcohol consumption in the United Kingdom using longitudinal data from nine cohort studies. BMC Medicine, 13(47), 1-9. https://doi.org/10.1186/s12916‐015‐0273‐z · doi:10.1186/s12916‐015‐0273‐z
[12] Burgard, J. P., Esteban, M., Morales, D., & Pérez, A. (2020a). A Fay‐Herriot model when auxiliary variables are measured with error. TEST, 29, 166-195. · Zbl 1439.62055
[13] Burgard, J. P., Esteban, M., Morales, D., & Pérez, A. (2020b). Small area estimation under a measurement bivariate Fay‐Herriot model. Statistical Methods and Applications, 30(1), 79-108. · Zbl 1478.62020
[14] Burgard, J. P., Krause, J., Kreber, D., & Morales, D. (2020). The generalized equivalence of regularization and min‐max robustification in linear mixed models. Statistical Papers. https://doi.org/10.1007/s00362‐020‐01214‐z · Zbl 1483.62119 · doi:10.1007/s00362‐020‐01214‐z
[15] Chambers, R., Salvati, N., & Tzavidis, N. (2016). Semiparametric small area estimation for binary outcomes with application to unemployment estimation for local authorities in the UK. Journal of the Royal Statistical Association Series A (Statistics in Society), 179(2), 453-479.
[16] Chen, S., & Lahiri, P. (2012). Inferences on small area proportions. Journal of the Indian Society of Agricultural Statistics, 66, 121-124. · Zbl 07906665
[17] Connelly, R., Gayle, V., & Lambert, P. S. (2016). Ethnicity and ethnic groups measures in social survey research. Methodological Innovations, 9, 1-10.
[18] Cotto, J. H., Davis, E., Dowling, G. J., Elcano, J. C., Staton, A. B., & Weiss, S. R. B. (2010). Gender effects on drug use, abuse, and dependence: A special analysis of results from the national survey on drug use and health. Gender Medicine, 7(5), 402-413.
[19] Cougle, J. R., Hakes, J. K., Macatee, R. J., Zvolensky, M. J., & Chavarria, J. (2016). Probability and correlates of dependence among regular users of alcohol, nicotine, cannabis, and cocaine: Concurrent and prospective analysis of the national epidemiologic survey on alcohol and related conditions. The Journal of Clinical Psychiatry, 77(4), 444-450.
[20] Crum, R. M., Helzer, J. E., & Anthony, J. C. (1993). Level of education and alcohol abuse and dependence in adulthood: A further inquiry. American Journal of Public Health, 83(6), 830-837.
[21] Egozcue, J. J., & Pawlowsky‐Glahn, V. (2019). Compositional data: The sample space and its structure. TEST, 28, 599-638. · Zbl 1428.62220
[22] Egozcue, J. J., Pawlowsky‐Glahn, V., Mateu‐Figueras, G., & Barceló‐Vidal, C. (2003). Isometric logratio transformations for compositional data analysis. Mathematical Geology, 35(3), 279-300. · Zbl 1302.86024
[23] El Ghaoui, L., & Lebret, H. (1997). Robust solutions to least‐squared problems with uncertain data. SIAM Journal on Matrix Analysis and Applications, 18(4), 1035-1064. · Zbl 0891.65039
[24] Erciulescu, A. L., & Fuller, W. A. (2013). Small area prediction of the mean of a binomial random variable. JSM Proceedings ‐ Survey Research Methods Section. 855-863, Alexandria, VA.
[25] Esteban, M. D., Lombardía, M. J., López‐Vizcanío, E., Morales, D., & Pérez, A. (2020). Small area estimation of proportions under area‐level compositional mixed models. TEST, 29(3), 793-818. · Zbl 1458.62020
[26] Faltys, O., Hobza, T., & Morales, D. (2020). Small area estimation under area‐level generalized linear mixed models. Comunications in Statistics ‐ Simulation and Computation, 44. https://doi.org/10.1080/03610918.2020.1836216 · Zbl 07632274 · doi:10.1080/03610918.2020.1836216
[27] Fan, J., Han, F., & Liu, H. (2014). Challenges of big data analysis. National Science Review, 1(2), 293-314.
[28] Friedman, J., Hastie, T., Höfling, H., & Tibshirani, R. (2007). Pathwise coordinate optimization. Annals of Applied Statistics, 1(2), 302-332. · Zbl 1378.90064
[29] Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate gradient descent. Journal of Statistical Software, 33(1), 1-22.
[30] Ghosh, M., Kim, D., Sinha, K., Maiti, T., Katzoff, M., & Parsons, V. L. (2009). Hierarchical and empirical Bayes small domain estimation and proportion of persons without health insurance for minority subpopulations. Survey Methodology, 35, 53-66.
[31] Goebel, J., Krause, P., Pischner, R., Sieber, I., & Wagner, G. G. (2008). Daten‐ und datenbankstruktur der längsschnittstudie sozio‐oekonomisches panel (soep). [Online]. SOEPpapers on Multidisciplinary Panel Data Research 89, DIW Berlin, The German Socio‐Economic Panel (SOEP).
[32] González‐Manteiga, W., Lombardía, M. J., Molina, I., Morales, D., & Santamaría, L. (2008). Bootstrap mean squared error of a small‐area eblup. Journal of Statistical Computation and Simulation, 78(5), 443-462. · Zbl 1274.62094
[33] González‐Manteiga, W., Lombardía, M. J., Molina, I., Morales, D., & Santamaría, L. (2010). Small area estimation under Fay‐Herriot models with nonparametric estimation of heteroscedasticity. Statistical Modelling, 10(2), 215-239. · Zbl 07256823
[34] Hall, P., & Maiti, T. (2006). Nonparametric estimation of mean‐squared prediction error in nested‐error regression models. The Annals of Statistics, 34(4), 1733-1750. · Zbl 1246.62106
[35] Hapke, U., Hanisch, C., Ohlmeier, C., & Rumpf, H.‐J. (2009). Epidemiologie des Alkoholkonsums bei älteren Menschen in Privathaushalten: Ergebnisse des telefonischen Gesundheitssurvey 2007. SUCHT, 55(5), 281-291.
[36] Hart, J., & Alston, J. M. (2019). Persistent patterns in the U.S. alcohol market: Looking at the link between demographics and drinking. Journal of Wine Economics, 14(4), 356-364.
[37] Henkel, D. (2011). Unemployment and substance abuse: A review of the literature (1990‐2010). Current Drug Abuse Reviews, 4(1), 4-27.
[38] Hobza, T., Marhuenda, Y., & Morales, D. (2020). Small area estimation of additive parameters under unit‐level generalized linear mixed models. SORT, 44(1), 3-38. · Zbl 1442.62169
[39] Hobza, T., & Morales, D. (2016). Empirical best prediction under unit‐level logit mixed models. Journal of Official Statistics, 32(3), 661-692.
[40] Hobza, T., Morales, D., & Santamaría, L. (2018). Small area estimation of poverty proportions under unit‐level temporal binomial‐logit mixed models. TEST, 27, 270-294. · Zbl 1404.62075
[41] Hoerl, A., & Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Techometrics, 12(1), 55-67. · Zbl 0202.17205
[42] Jiang, J. (2003). Empirical best prediction for small‐area inference based on generalized linear mixed models. Journal of Statistical Planning and Inference, 111(1‐2), 117-127. · Zbl 1033.62067
[43] Krause, J. (2019). Regularization methods for statistical modelling in small area estimation (Ph. D. thesis). Trier University. https://doi.org/10.25353/ubtr‐xxxx‐de9f‐02c8 · doi:10.25353/ubtr‐xxxx‐de9f‐02c8
[44] Lange, C., Jentsch, F., Allen, J., Hoebel, J., Kratz, A. L., von derLippe, E., … Ziese, T. (2015). Data resource profile: German health update (geda) ‐ the health interview survey for adults in Germany. International Journal of Epidemiology, 44(2), 442-450.
[45] López‐Vizcaíno, E., Lombardía, M. J., & Morales, D. (2015). Small area estimation of labour force indicators under a multinomial model with correlated time and area effects. Journal of the Royal Statistical Association Series A (Statistics in Society), 178(3), 535-565.
[46] López‐Vizcaíno, E., Lombardía, M. J., & Morales, D. (2013). Multinomial‐based small area estimation of labour fource indicators. Statistical Modelling, 13(2), 153-178. · Zbl 07257453
[47] Maiti, T., Ren, H., & Sinha, A. (2014). Prediction error of small area predictors shrinking both means and variances. Scandinavian Journal of Statistics, 41, 775-790. · Zbl 1309.62024
[48] Markovsky, I., & Van Huffel, S. (2007). Overview of total least‐squares methods. Signal Processing, 87, 2283-2302. · Zbl 1186.94229
[49] Militino, A. F., Ugarte, M. D., & Goicoa, T. (2015). Deriving small area estimates from information technology business surveys. Journal of the Royal Statistical Association Series A (Statistics in Society), 178(4), 1051-1067.
[50] Mills, T. C. (2018). Is there convergence in national alcohol consumption patterns? Evidence from a compositional time series approach. Journal of Wine Economics, 13(1), 92-98.
[51] Molina, I., Saei, A., & Lombardía, M. J. (2007). Small area estimates of labour force participation under a multinomial logit mixed model. Journal of the Royal Statistical Society Series A (Statistics in Society), 170(4), 975-1000.
[52] Morais, J., Thomas‐Agnan, C. M., & Simioni, M. (2018). Interpretation of explanatory variables impacts in compositional regression models. Austrian Journal of Statistics, 47(5), 1-25 CoDaWork 2017.
[53] Morales, D., Esteban, M. D., Pérez, A., & Hobza, T. (2021). A course on small area estimation and mixed models. In Methods, theory and applications in RStatistics for Social and Behavioral Sciences (). New York, NY: Springer. · Zbl 1464.62009
[54] O’Malley, P. M. (2004). Maturing out of problematic alcohol use. Alcohol Research & Health, 28(4), 202-204.
[55] Patel, K., Savchenko, Y., & Vella, F. (2013). Chapter 12. Occupational sorting of ethnic groups. In A.Constant (ed.) & K.Zimmermann (ed.) (Eds.), International handbook of the economics of migration (pp. 227-241). Cheltenham: Edward Elgar Publishing.
[56] Pawlowsky‐Glahn, V., & Buccianti, A. (2011). Compositional data analysis: Theory and applications. Hoboken, NJ: John Wiley & Sons. · Zbl 1103.62111
[57] Rao, J. N. K., & Molina, I. (2015). Small area estimationWiley Series in Survey Methodology (2nd ed.). Hoboken, NJ: John Wiley & Sons. · Zbl 1323.62002
[58] Rehm, J., Gmel, G., Sempos, C. T., & Trevisan, M. (2003). Alcohol‐related morbidity and mortality. Alcohol Research & Health, 27(1), 39-51.
[59] Robert Koch Institute (2012). Daten und Fakten: Ergebnisse der Studie “Gesundheit in Deutschland aktuell 2010”. Beiträge zur Gesundheitsberichterstattung des Bundes.
[60] Robert Koch Institute (2013). German health update 2010 (GEDA 2010). Public use file third version. [Online]. https://doi.org/10.7797/27‐200910‐1‐1‐3. · doi:10.7797/27‐200910‐1‐1‐3
[61] Scealy, J. L., & Welsh, A. H. (2017). A directional mixed effects model for compositional expenditure data. Journal of the American Statistical Association, 112(517), 24-36.
[62] Schröder, H., Brückner, G., Schüssel, K., Breitkreuz, J., Schlotmann, A., & Günster, C. (2020). Monitor: Gesundheitliche Beeinträchtigungen ‐ Vorerkrankungen mit erhöhtem Risiko für schwere Verläufte von COVID‐19. Verbreitung in der Bevölkerung Deutschlands und seinen Regionen. [Online]. https://doi.org/10.13140/RG.2.2.14946.27841. · doi:10.13140/RG.2.2.14946.27841
[63] Schumm, J. A., & Chard, K. M. (2012). Alcohol and stress in the military. Alcohol Research: Current Reviews, 34(4), 401-407.
[64] Singh, T. (2011). Efficient small area estimation in the presence of measurement error in covariates (Ph. D. thesis). Texas A&M University.
[65] Sugasawa, S., Tamae, H., & Kubokawa, T. (2017). Bayesian estimators for small area models shrinking both means and variances. Scandinavian Journal of Statistics, 44, 150-167. · Zbl 1361.62008
[66] Tabassum, M. N., & E.Ollila (2017). Pathwise least angle regression and significance testing for the elastic net. Paper presented at: Proceedings of the 25th European Signal Processing Conference (EUSIPCO). Kos, Greece. Retrieved from https://www.eurasip.org/Proceedings/Eusipco/Eusipco2017/papers/1570347287.pdf
[67] Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological), 58(1), 267-288. · Zbl 0850.62538
[68] Ubaidillah, A., Notodiputro, K. A., Kurnia, A., & Wayan, I. (2019). Multivariate Fay‐Herriot models for small area estimation with application to household consumption per capita expenditure in Indonesia. Journal of Applied Statistics, 45(15), 2845-2861. · Zbl 1516.62632
[69] Wagner, G. G., Frick, J. R., & Schupp, J. (2007). The German socio‐economic pabel study (soep): Scope, evolution and enhancements. [Online]. SOEPpapers on Multidisciplinary Panel Data Research 1, DIW Berlin, The German Socio‐Economic Panel (SOEP).
[70] Wang, C., Hu, J., Blaser, M. J., & Li, H. (2020). Estimating and testing the microbial causal mediation effect with high‐dimensional and compositional microbiome data. Bioinformatics, 36(2), 347-355.
[71] Wang, H., Liu, Q., Mok, H. M. K., Fu, L., & Tse, W. M. (2007). A hyperspherical transformation forecasting model for compositional data. European Journal of Operational Research, 179(2), 459-468. · Zbl 1114.90049
[72] Wood, J. (2008). On the covariance between related Horvitz‐Thompson estimators. Journal of Official Statistics, 24(1), 53-78.
[73] Ybarra, L. M. R., & Lohr, S. L. (2008). Small area estimation when auxiliary information is measured with error. Biometrika, 95, 919-931. · Zbl 1437.62666
[74] Zagheni, E., & Weber, I. (2015). Demographic research with non‐representative internet data. International Journal of Manpower, 36(1), 13-25.
[75] Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society Series B (Methodological), 67(2), 301-320. · Zbl 1069.62054
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.