×

A survey of differentially private regression for clinical and epidemiological research. (English) Zbl 1527.62082

Int. Stat. Rev. 89, No. 1, 132-147 (2021); correction ibid. 89, No. 2, 433 (2021).
Summary: Differential privacy is a framework for data analysis that provides rigorous privacy protections for database participants. It has increasingly been accepted as the gold standard for privacy in the analytics industry, yet there are few techniques suitable for statistical inference in the health sciences. This is notably the case for regression, one of the most widely used modelling tools in clinical and epidemiological studies. This paper provides an overview of differential privacy and surveys the literature on differentially private regression, highlighting the techniques that hold the most relevance for statistical inference as practiced in clinical and epidemiological research. Research gaps and opportunities for further inquiry are identified.
{© 2020 International Statistical Institute}

MSC:

62P10 Applications of statistics to biology and medical sciences; meta analysis
68P27 Privacy of data

Software:

GUPT; PrivGene
Full Text: DOI

References:

[1] Agresti, A. (2013). Categorical Data Analysis, 3. Wiley‐Interscience: Hoboken, NJ. · Zbl 1281.62022
[2] Awan, J. & Slavkovic, A. (2018). Differentially private uniformly most powerful tests for binomial data. arXiv preprint arXiv:1805.09236.
[3] Awan, J. & Slavković, A.B. (2019). Structure and sensitivity in differential privacy: comparing K‐norm mechanisms. arxiv: 1801.09236v3.
[4] Barak, B., Chaudhuri, K., Dwork, C., Kale, S., Mcsherry, F. & Talwar, K. (2007). Privacy, accuracy and consistency too: A holistic solution to contingency table release. In Proceedings of the Twenty‐Sixth ACM SIGMOD‐SIGACT‐SIGART Symposium on Principles of Database Systems. ACM: Beijing, China, pp. 273-282.
[5] Barrientos, A.F., Reiter, J.P., Machanavajjhala, A. & Chen, Y. (2019). Differentially private significance tests for regression coefficients. J. Comput. Graph. Statist., 28, 440-453. · Zbl 07499065
[6] Bassily, R., Smith, A. & Thakurta, A. (2014). Differentially private empirical risk minimization: efficient algorithms and tight error bounds. arXiv preprint arXiv:1405.7085.
[7] Benitez, K. & Malin, B. (2010). Evaluating re‐identification risks with respect to the HIPAA privacy rule. J. Am. Med. Info. Assoc., 17, 169-177.
[8] Blocki, J., Blum, A., Datta, A. & Sheffet, O. (2012). The Johnson-Lindenstrauss transform itself preserves differential privacy. In 2012 IEEE 53rd Annual Symposium on Foundations of Computer Science (FOCS). IEEE: New Brunswick, NJ, USA, pp. 410-419.
[9] Bun, M. & Steinke, T. (2016). Concentrated differential privacy: simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, Springer: Berlin, Heidelberg, pp. 635‐658. · Zbl 1406.94030
[10] Chaudhuri, K., Monteleoni, C. & Sarwate, A. (2011). Differentially private empirical risk minimization. J. Mach. Learn. Res., 12, 1069-1109. · Zbl 1280.62073
[11] Chen, C., Lee, J. & Kifer, D. (2019). Renyi differentially private ERM for smooth objectives. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2037-2046: Naha, Okinawa, Japan.
[12] Chen, Y., Machanavajjhala, A., Reiter, J. & Barrientos, A.F. (2016). Differentially private regression diagnostics. In 2016 IEEE 16Th International Conference on Data Mining (ICDM), pp. 81-90: Barcelona, Spain.
[13] Chen, R., Xiao, Q., Zhang, Y. & Xu, J. (2015). Differentially private high‐dimensional data publication via sampling‐based inference. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM: Sydney, Australia, pp. 129-138.
[14] Cole, S.R., Chu, H. & Greenland, S. (2014). Maximum likelihood, profile likelihood, and penalized likelihood: a primer. Am. J. Epidemiol., 179, 252-260.
[15] Couch, S., Kazan, Z., Shi, K., Bray, A. & Groce, A. (2018). A differentially private Wilcoxon signed‐rank test. arXiv preprint arXiv:1809.01635.
[16] D’orazio, V., Honaker, J. & King, G. (2015). Differential privacy for social science inference. In Sloan Foundation Economics Research Paper No. 2676160, Annual Meeting of the Society for Political Methodology, pp. 1‐44.
[17] Dankar, F. & El Emam, K. (2013). Practicing differential privacy in health care: a review. Trans. Data Privacy, 5, 35-67.
[18] Dimitrakakis, C., Nelson, B., Mitrokotsa, A. & Rubinstein, B.I. (2014). Robust and private Bayesian inference. In International Conference on Algorithmic Learning Theory. Springer: Cham, pp. 291-305. · Zbl 1432.68132
[19] Ding, B., Nori, H., Li, P. & Allen, J. (2018). Comparing population means under local differential privacy: with significance and power. arXiv preprint arXiv:1803.09027.
[20] Duchi, J.C., Jordan, M.I. & Wainwright, M.J. (2013). Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science (FOCS). IEEE: Berkeley, CA, USA, pp. 429-438.
[21] Dwork, C. (2008). Differential privacy: a survey of results. In International Conference on Theory and Applications of Models of Computation. Springer: Berlin, Heidelberg, pp. 1-19. . · Zbl 1139.68339
[22] Dwork, C., Kenthapadi, K., Mcsherry, F., Mironov, I. & Naor, M. (2006a). Our data, ourselves: privacy via distributed noise generation. In Annual International Conference on the Theory and Applications of Cryptographic Techniques. Springer: Berlin, Heidelberg, pp. 486-503. · Zbl 1140.94336
[23] Dwork, C. & Lei, J. (2009). Differential privacy and robust statistics. In Proceedings of the forty‐first annual ACM symposium on Theory of computing. Bethesda, MD, USA, pp. 371-380. · Zbl 1304.94049
[24] Dwork, C., Mcsherry, F., Nissim, K. & Smith, A. (2006b). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference. Springer: Berlin, Heidelberg, pp. 265-284. · Zbl 1112.94027
[25] Dwork, C. & Roth, A. (2014). The algorithmic foundations of differential privacy. FNT Theoret Comput Sci, 9, 211-407. · Zbl 1302.68109
[26] Dwork, C. & Rothblum, G.N. (2016). Concentrated differential privacy. arXiv preprint arXiv:1603.01887.
[27] Dwork, C., Rothblum, G.N. & Vadhan, S. (2010). Boosting and differential privacy. In 2010 IEEE 51st Annual Symposium on Foundations of Computer Science, pp. 51-60.
[28] Dwork, C. & Smith, A (2009). Differential privacy for statistics: what we know and what we want to learn. J. Privacy Confidential., 1(2), 135-154.
[29] Dwork, C., Talwar, K., Thakurta, A. & Zhang, L. (2014). Analyze Gauss: optimal bounds for privacy‐preserving principal component analysis. In Proceedings of the Forty‐Sixth Annual ACM Symposium on Theory of Computing. ACM: New York, NY, USA, pp. 11-20. · Zbl 1315.94115
[30] Fienberg, S.E., Rinaldo, A. & Yang, X. (2010). Differential privacy and the risk-utility tradeoff for multi‐dimensional contingency tables. In International Conference on Privacy in Statistical Databases. Springer: Berlin, Heidelberg, pp. 187-199.
[31] Foulds, J., Geumlek, J., Welling, M. & Chaudhuri, K. (2016). On the theory and practice of privacy‐preserving Bayesian data analysis. arXiv:1603.07294v4.
[32] Fredrikson, M., Lantz, E., Jha, S., Lin, S., Page, D. & Ristenpart, T. (2014). Privacy in pharmacogenetics: an end‐to‐end case study of personalized warfarin dosing. In USENIX Security Symposium, pp. 17-32: San Diego, CA, USA.
[33] Gaboardi, M., Lim, H.W., Rogers, R.M. & Vadhan, S.P. (2016). Differentially private chi‐squared hypothesis testing: goodness of fit and independence testing. In ICML’16 Proceedings of the 33rd International Conference on International Conference on Machine Learning‐Volume 48, pp. 2111-2120: New York, NY, USA.
[34] Hall, R., Rinaldo, A. & Larry, W. (2012). Random differential privacy. J. Privacy Confidential., 4, 43-59.
[35] Holdren, J. (2013). Memorandum for the heads of executive departments and agencies: increasing access to the results of federally funded scientific research. Office of Science and Technology.
[36] Hsu, J., Gaboardi, M., Haeberlen, A., Khanna, S., Narayan, A., Pierce, B.C. & Roth, A. (2014). Differential privacy: an economic method for choosing epsilon. In 2014 IEEE 27th Computer Security Foundations Symposium (Csf). Vienna, Austria, pp. 398-410.
[37] Hudson, K.L. & Collins, F.S. (2015). Sharing and reporting the results of clinical trials. JAMA, 313, 355-356.
[38] Jain, P., Kothari, P. & Thakurta, A. (2012). Differentially private online learning. In Conference on Learning Theory, pp. 24-1: Edinburgh, Scotland.
[39] Jain, P. & Thakurta, A.G (2014). (Near) dimension independent risk bounds for differentially private learning. In International Conference on Machine Learning, pp. 476-484: Beijing, China.
[40] Jarmin, R. (2019). Census bureau adopts cutting edge privacy protections for 2020 census. in Census Blogs: United States Census Bureau.
[41] Jiang, X., Sarwate, A.D. & Ohno‐Machado, L. (2013). Privacy technology to support data sharing for comparative effectiveness research: a systematic review. Med. Care, 51, S58-S65.
[42] Johnson, A. & Shmatikov, V. (2013). Privacy‐preserving data exploration in genome‐wide association studies. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM: Chicago, IL, USA, pp. 1079-1087.
[43] Kakizaki, K., Fukuchi, K. & Sakuma, J. (2017). Differentially private chi‐squared test by unit circle mechanism. In International Conference on Machine Learning, pp. 1761-1770: Sydney, Australia.
[44] Karwa, V. & Vadhan, S. (2017). Finite sample differentially private confidence intervals. arXiv preprint arXiv:1711.03908. · Zbl 1462.68084
[45] Kasiviswanathan, S.P., Rudelson, M. & Smith, A. (2013). The power of linear reconstruction attacks. In Proceedings of the Twenty‐Fourth Annual ACM‐SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics: New Orleans, LA, USA, pp. 1415-1433. · Zbl 1423.68141
[46] Kifer, D., Smith, A. & Thakurta, A. (2012). Private convex empirical risk minimization and high‐dimensional regression, Conference on Learning Theory, pp. 25-1: Edinburgh, Scotland.
[47] Kiley, R., Peatfield, T., Hansen, J. & Reddington, F. (2017). Data sharing from clinical trials—a research funder’s perspective. New England J. Med., 377, 1990-1992.
[48] Lawler, M., Haussler, D., Siu, L.L., Haendel, M.A., Mcmurry, J.A., Knoppers, B.M., Chanock, S.J., Calvo, F., The, B.T., Walia, G., Banks, I., Yu, P.P., Staudt, L.M. & Sawyers, C.L. (2017). Sharing clinical and genomic data on cancer—the need for global solutions. New England J. Med., 376, 2006-2009.
[49] Lee, J. & Clifton, C. (2011). How much is enough? Choosing epsilon for differential privacy. Info. Sec., 7001, 325-340.
[50] Lei, J. (2011). Differentially private m‐estimators. Adv. Neural Info. Process. Syst., 361-369.
[51] Li, C., Hay, M., Miklau, G. & Wang, Y. (2014). A data‐ and workload‐aware query answering algorithm for range queries under differential privacy. PVLDB, 7, 341-352.
[52] Lo, B. (2015). Sharing clinical trial data: maximizing benefits, minimizing risk. JAMA, 313, 793-794.
[53] Matthews, G.J. & Harel, O. (2011). Data confidentiality: a review of methods for statistical disclosure limitation and methods for assessing privacy. Stat. Surv., 5, 1-29. · Zbl 1274.62055
[54] Mcsherry, F. & Talwar, K. (2007). Mechanism design via differential privacy. In FOCS’07. 48th Annual IEEE Symposium on Foundations of Computer Science, 2007. IEEE: Providence, RI, USA, pp. 94-103.
[55] Minami, K., Arai, H., Sato, I. & Nakagawa, H. (2016). Differential privacy without sensitivity. In Advances in Neural Information Processing Systems, NIPS: Barcelona, Spain, pp. 956‐964.
[56] Mironov, I. (2017). Rényi differential privacy. arxiv: 1702.07476v3.
[57] Mohammed, N., Chen, R., Fung, B. & Yu, P.S. (2011). Differentially private data release for data mining. In Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM: San Diego, CA, USA, pp. 493-501.
[58] Mohan, P., Thakurta, A., Shi, E., Song, D. & Culler, D. (2012). GUPT: privacy preserving data analysis made easy. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 349-360: Scottsdale, AZ, USA.
[59] Naldi, M. & D’acquisto, G. (2015). Differential privacy: an estimation theory‐based method for choosing Epsilon. eprint arXiv:1510.00917.
[60] Eds. Nass, S.J. (ed.), Levit, L.A. (ed.) & Gostin, L.O. (ed.) (2009). Beyond the HIPAA Privacy Rule: Enhancing Privacy, Improving Health Through Research Edited byNass, S.J. (ed.), Levit, L.A. (ed.) & Gostin, L.O. (ed.)National Academy of Medicine: Washington DC.
[61] National Institutes of Health (2003). Final NIH statement on sharing research data. Retrieved from
[62] National Science Foundation (2019). Proposal and award policies and procedures guide. Retrieved from https://www.nsf.gov/pubs/policydocs/pappg20_1/index.jsp
[63] Nguyen, T. & Hui, S. (2017). Differentially private regression for discrete‐time survival analysis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management Singapore. ACM: Singapore, pp. 1199-1208.
[64] Nissim, K., Raskhodnikova, S. & Smith, A. (2007). Smooth sensitivity and sampling in private data analysis. In Proceedings of the Thirty‐Ninth Annual ACM Symposium on Theory of Computing. ACM: San Diego, CA, USA, pp. 75-84. · Zbl 1232.68039
[65] O’Keefe, C.M. & Rubin, D.B. (2015). Individual privacy versus public good: protecting confidentiality in health research. Stat. Med., 34, 3081-3103.
[66] Qardaji, W., Yang, W. & Li, N. (2013). Understanding hierarchical methods for differentially private histograms. PVLDB, 6, 1954-1965.
[67] Rogers, R. & Kifer, D. (2017). A new class of private chi‐square hypothesis tests. In Proceedings of the 20Th Conference on Artificial Intelligence and Statistics, pp. 991-1000: Fort Lauderdate, Florida, USA.
[68] Rubinstein, B.I. & Alda, F. (2017). Pain‐free random differential privacy with sensitivity sampling. arXiv preprint arXiv:1706.02562.
[69] Sheffet, O. (2017). Differentially private ordinary least squares. In International Conference on Machine Learning, pp. 3105-3114: Sydney, Australia.
[70] Shmueli, G. (2010). To explain or to predict?Stat. Sci., 25, 289-310. · Zbl 1329.62045
[71] Skinner, C. (2012). Statistical disclosure risk: separating potential and harm. Int. Stat. Rev., 80, 349-368. · Zbl 1416.62059
[72] Smith, A. (2008). Efficient, differentially private point estimators. CoRR, arXiv:0809.4794.
[73] Smith, A. (2011). Privacy‐preserving statistical estimation with optimal convergence rates. In Proceedings of the Forty‐Third Annual ACM Symposium on Theory of Computing, pp. 813-822: San Jose, CA, USA. · Zbl 1288.62015
[74] Smith, A. & Thakurta, A.G (2013). Differentially private feature selection via stability arguments, and the robustness of the lasso. In Conference on Learning Theory, pp. 819-850: Princeton, NJ, USA.
[75] Snoke, J. & Slavković, A. (2018). PMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity. Springer International Publishing: Cham.
[76] Solea, E. (2014). Differentially Private Hypothesis Testing for Normal Random Variables. Pennsylvania State University. https://etda.libraries.psu.edu/catalog/21486
[77] Song, S., Chaudhuri, K. & Sarwate, A.D. (2013). Stochastic descent with differentially private updates. In Global Conference on Signal and Information Processing (GlobalSIP). IEEE: Austin, TX, USA., pp. 245-248.
[78] Su, D., Cao, J. & Li, N. (2015). Differentially private projected histograms of multi‐attribute data for classification. arXiv preprint arXiv:1504.05997.
[79] Sweeney, L. (1997). Weaving technology and policy together to maintain confidentiality. J Law Med Ethics, 25, 98-100.
[80] Taichman, D.B., Sahni, P., Pinborg, A., Peiperl, L., Laine, C., James, A., Hong, S.‐T., Haileamlak, A., Gollogly, L., Godlee, F., Frizelle, F.A., Florenzano, F., Drazen, J.M., Bauchner, H., Baethge, C. & Backus, J. (2017). Data sharing statements for clinical trials: a requirement of the International Committee of Medical Journal Editors. Lancet, 389, e12-e14.
[81] Task, C. & Clifton, C. (2016). Differentially private significance testing on paired‐sample data. In Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 153-161: Miami, FL, USA.
[82] Taylor, L., Zhou, X.‐H. & Rise, P. (2018). A tutorial in assessing disclosure risk in microdata. Stat. Med., 37, 3693-3706.
[83] U.S. Department of Health and Human Services (2012). Guidance regarding methods for de‐identification of protected health information in accordance with the health insurance portability and accountability act (HIPAA) privacy rule. Retrieved from
[84] U.S. Department of Health and Human Services (2013). Summary of the HIPAA Privacy Rule. Retrieved from
[85] Uhler, C., Slavković, A. & Fienberg, S.E. (2013). Privacy‐preserving data sharing for genome‐wide association studies. J Privacy Confident, 5, 137.
[86] Vu, D. & Slavkovic, A. (2009). Differential privacy for clinical trial data: preliminary evaluations. In 2009 IEEE International Conference on Data Mining Workshops (ICDMW), Vol. Miami, FL, USA. IEEE, pp. 138-143.
[87] Walport, M. & Brest, P. (2011). Sharing research data to improve public health. Lancet, 377, 537-539.
[88] Wang, Y.‐X., Fienberg, S. & Smola, A (2015b). Privacy for free: posterior sampling and stochastic gradient Monte Carlo. In International Conference on Machine Learning, pp. 2493-2502: Lille, France.
[89] Wang, Y., Kifer, D. & Lee, J. (2018). Differentially private confidence intervals for empirical risk minimization. arXiv preprint arXiv:1804.03794.
[90] Wang, Y., Lee, J. & Kifer, D. (2015a). Revisiting differentially private hypothesis tests for categorical data. arXiv preprint arXiv:1511.03376.
[91] Williams, O. & Mcsherry, F. (2010). Probabilistic inference and differential privacy. In Advances in Neural Information Processing Systems, NIPS: Vancouver, Canada, pp. 2451-2459.
[92] World Health Organization. Policy on use and sharing of data collected in Member States by the World Health Organization (WHO) outside the context of public health emergencies. Retrieved from https://www.who.int/publishing/datapolicy/en/
[93] Wu, X., Fredrikson, M., Wu, W., Jha, S. & Naughton, J.F. (2015). Revisiting differentially private regression: Lessons from learning theory and their consequences. arXiv preprint arXiv:1512.06388.
[94] Wu, X., Li, F., Kumar, A., Chaudhuri, K., Jha, S. & Naughton, J. (2017). Bolt‐on differential privacy for scalable stochastic gradient descent‐based analytics. In Sigmod’17: Proceedings of the 2017 ACM International Conference on Management of Data, pp. 1307-1322: Chicago, IL, USA.
[95] Yu, F., Fienberg, S.E., Slavković, A. & Uhler, C. (2014a). Scalable privacy‐preserving data sharing methodology for genome‐wide association studies. J. Biomed. Info., 50, 133-141.
[96] Yu, F., Rybar, M., Uhler, C. & Fienberg, S.E. (2014b). Differentially‐private logistic regression for detecting multiple‐SNP association in GWAS databases. In International Conference on Privacy in Statistical Databases. Springer: Ibiza, Spain, pp. 170-184.
[97] Zhang, J., Xiao, X., Yang, Y., Zhang, Z. & Winslett, M. (2013). PrivGene: differentially private model fitting using genetic algorithms. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 665-676: New York, NY, USA.
[98] Zhang, J., Zhang, Z., Xiao, X., Yang, Y. & Winslett, M. (2012). Functional mechanism: regression analysis under differential privacy. PVLDB.
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.