×

On the empirical estimation of integral probability metrics. (English) Zbl 1295.62035

Summary: Given two probability measures, \(\mathbb{P}\) and \(\mathbb{Q}\) defined on a measurable space, \(S\), the integral probability metric (IPM) is defined as \[ \gamma_{\mathcal F}(\mathbb{P},\mathbb{Q})=\sup\left\{\left| \int_{S}f d\mathbb{P}-\int_{S} f d\mathbb{Q}\right|\,:\,f\in\mathcal{F}\right\}, \] where \(\mathcal{F}\) is a class of real-valued bounded measurable functions on \(S\). By appropriately choosing \(\mathcal{F}\), various popular distances between \(\mathbb{P}\) and \(\mathbb{Q}\), including the Kantorovich metric, Fortet-Mourier metric, dual-bounded Lipschitz distance (also called the Dudley metric), total variation distance, and kernel distance, can be obtained.
In this paper, we consider the problem of estimating \(\gamma_{\mathcal{F}}\) from finite random samples drawn i.i.d. from \(\mathbb{P}\) and \(\mathbb{Q}\). Although the above mentioned distances cannot be computed in closed form for every \(\mathbb{P}\) and \(\mathbb{Q}\), we show their empirical estimators to be easily computable, and strongly consistent (except for the total-variation distance). We further analyze their rates of convergence. Based on these results, we discuss the advantages of certain choices of \(\mathcal{F}\) (and therefore the corresponding IPMs) over others – in particular, the kernel distance is shown to have three favorable properties compared with the other mentioned distances: it is computationally cheaper, the empirical estimate converges at a faster rate to the population value, and the rate of convergence is independent of the dimension \(d\) of the space (for \(S=\mathbb{R}^{d}\)). We also provide a novel interpretation of IPMs and their empirical estimators by relating them to the problem of binary classification: while the IPM between class-conditional distributions is the negative of the optimal risk associated with a binary classifier, the smoothness of an appropriate binary classifier (e.g., support vector machine, Lipschitz classifier, etc.) is inversely related to the empirical estimator of the IPM between these class-conditional distributions.

MSC:

62G05 Nonparametric estimation
60B05 Probability measures on topological spaces

References:

[1] Ali, S. M. and Silvey, S. D. (1966). A general class of coefficients of divergence of one distribution from another., Journal of the Royal Statistical Society, Series B (Methodological) 28 , 131-142. · Zbl 0203.19902
[2] Aronszajn, N. (1950). Theory of reproducing kernels., Trans. Amer. Math. Soc. 68 , 337-404. · Zbl 0037.20701 · doi:10.2307/1990404
[3] Bartlett, P. and Mendelson, S. (2002). Rademacher and Gaussian complexities: Risk bounds and structural results., Journal of Machine Learning Research 3 , 463-482. · Zbl 1084.68549 · doi:10.1162/153244303321897690
[4] Bartlett, P. L., Bousquet, O., and Mendelson, S. (2005). Local rademacher complexities., Annals of Statistics 33 , 4, 1497-1537. · Zbl 1083.62034 · doi:10.1214/009053605000000282
[5] Beauzamy, B. (1985)., Introduction to Banach spaces and their Geometry . North-Holland, The Netherlands. · Zbl 0585.46009
[6] Berlinet, A. and Thomas-Agnan, C. (2004)., Reproducing Kernel Hilbert Spaces in Probability and Statistics . Kluwer Academic Publishers, London, UK. · Zbl 1145.62002
[7] Bickel, P. J. (1969). A distribution free version of the smirnov two sample test in the p-variate case., The Annals of Mathematical Statistics 40 , 1, 1-23. · Zbl 0179.48704 · doi:10.1214/aoms/1177697800
[8] Boucheron, S., Lugosi, G., and Massart, P. (2000). A sharp concentration inequality with applications., Random Structures and Algorithms 16 , 3, 277-292. · Zbl 0954.60008 · doi:10.1002/(SICI)1098-2418(200005)16:3<277::AID-RSA4>3.0.CO;2-1
[9] Breiman, L. (1999). Prediction games and arcing algorithms., Neural Computation 11 , 7, 1493-1517.
[10] Cortes, C. and Vapnik, V. (1995). Support-vector networks., Machine Learning 20 , 273-297. · Zbl 0831.68098
[11] Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observations., Studia Scientiarium Mathematicarum Hungarica 2 , 299-318. · Zbl 0157.25802
[12] Cucker, F. and Zhou, D.-X. (2007)., Learning Theory: An Approximation Theory Viewpoint . Cambridge University Press, Cambridge, UK. · Zbl 1274.41001
[13] Devroye, L. and Györfi, L. (1985)., Nonparametric Density Estimation: The \(L_1\) View . Wiley, New York. · Zbl 0546.62015
[14] Devroye, L., Györfi, L., and Lugosi, G. (1996)., A Probabilistic Theory of Pattern Recognition . Springer-Verlag, New York. · Zbl 0853.68150
[15] Dudley, R. M. (2002)., Real Analysis and Probability . Cambridge University Press, Cambridge, UK. · Zbl 1023.60001
[16] Fedotov, A. A., Harremoës, P., and Topsøe, F. (2003). Refinements of Pinsker’s inequality., IEEE Trans. Information Theory 49 , 6, 1491-1498. · Zbl 1063.94017 · doi:10.1109/TIT.2003.811927
[17] Fukumizu, K., Gretton, A., Sun, X., and Schölkopf, B. (2008). Kernel measures of conditional dependence. In, Advances in Neural Information Processing Systems 20 , J. Platt, D. Koller, Y. Singer, and S. Roweis, Eds. MIT Press, Cambridge, MA, 489-496.
[18] Gibbs, A. L. and Su, F. E. (2002). On choosing and bounding probability metrics., International Statistical Review 70 , 3, 419-435. · Zbl 1217.62014 · doi:10.2307/1403865
[19] Gray, R. M., Neuhoff, D. L., and Shields, P. C. (1975). A generalization of Ornstein’s \(\overlined\) distance with applications to information theory., Annals of Probability 3 , 315-328. · Zbl 0304.94025 · doi:10.1214/aop/1176996402
[20] Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., and Smola, A. (2012). A kernel two-sample test., JMLR 13 , 723-773. · Zbl 1283.62095
[21] Gretton, A., Borgwardt, K. M., Rasch, M., Schölkopf, B., and Smola, A. (2007). A kernel method for the two sample problem. In, Advances in Neural Information Processing Systems 19 , B. Schölkopf, J. Platt, and T. Hoffman, Eds. MIT Press, 513-520. · Zbl 1283.62095
[22] Gretton, A., Fukumizu, K., Teo, C. H., Song, L., Schölkopf, B., and Smola, A. J. (2008). A kernel statistical test of independence. In, Advances in Neural Information Processing Systems 20 , J. Platt, D. Koller, Y. Singer, and S. Roweis, Eds. MIT Press, 585-592.
[23] Khosravifard, M., Fooladivanda, D., and Gulliver, T. A. (2007). Confliction of the convexity and metric properties in \(f\)-divergences., IEICE Trans. Fundamentals E90-A , 9, 1848-1853.
[24] Kolmogorov, A. N. and Tihomirov, V. M. (1961). \(\epsilon\)-entropy and \(\epsilon\)-capacity of sets in functional space., American Mathematical Society Translations 2 , 17, 277-364. · Zbl 0133.06703
[25] Lindvall, T. (1992)., Lectures on the Coupling Method . John Wiley & Sons, New York. · Zbl 0850.60019
[26] Massart, P. (2000). Some applications of concentration inequalities to statistics., Ann. Fac. Sci. Toulouse Math. 9 , 6, 245-303. · Zbl 0986.62002 · doi:10.5802/afst.961
[27] McShane, E. J. (1934). Extension of range of functions., Bulletin of the American Mathematical Society 40 , 837-842. · Zbl 0010.34606 · doi:10.1090/S0002-9904-1934-05978-0
[28] Mendelson, S. (2002). Rademacher averages and phase transitions in Glivenko-Cantelli classes., IEEE Transactions on Information Theory 48 , 1, 251-263. · Zbl 1059.60027 · doi:10.1109/18.971753
[29] Müller, A. (1997). Integral probability metrics and their generating classes of functions., Advances in Applied Probability 29 , 429-443. · Zbl 0890.60011 · doi:10.2307/1428011
[30] Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2007). Nonparametric estimation of the likelihood ratio and divergence functionals. In, IEEE International Symposium on Information Theory .
[31] Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2009). On surrogate loss functions and \(f\)-divergences., Annals of Statistics 37 , 2, 876-904. · Zbl 1162.62060 · doi:10.1214/08-AOS595
[32] Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2010). Estimating divergence functionals and the likelihood ratio by convex risk minimization., IEEE Transactions on Information Theory 56 , 11, 5847-5861. · Zbl 1366.62071 · doi:10.1109/TIT.2010.2068870
[33] Rachev, S. T. (1984). On a class of minimum functionals in a space of probability measures., Theory of Probability and its Applications 29 , 41-48. · Zbl 0531.60008
[34] Rachev, S. T. (1985). The Monge-Kantorovich mass transference problem and its stochastic applications., Theory of Probability and its Applications 29 , 647-676. · Zbl 0581.60010 · doi:10.1137/1129093
[35] Rachev, S. T. and Rüschendorf, L. (1998)., Mass transportation problems. Vol. I Theory, Vol. II Applications . Probability and its Applications. Springer-Verlag, Berlin. · Zbl 0990.60500
[36] Rockafellar, R. T. (1970)., Convex Analysis . Princeton University Press, Princeton, NJ. · Zbl 0193.18401
[37] Schölkopf, B. and Smola, A. J. (2002)., Learning with Kernels . MIT Press, Cambridge, MA. · Zbl 1019.68094
[38] Shawe-Taylor, J. and Cristianini, N. (2004)., Kernel Methods for Pattern Analysis . Cambridge University Press, UK. · Zbl 0994.68074
[39] Shorack, G. R. (2000)., Probability for Statisticians . Springer-Verlag, New York. · Zbl 0951.62005 · doi:10.1007/b98901
[40] Smirnov, N. (1939). On the estimation of the discrepancy between empirical curves of distribution for two independent samples., Moscow University Mathematics Bulletin 2 , 3-26. University of Moscow. · Zbl 0023.24902
[41] Srebro, N., Sridharan, K., and Tewari, A. (2010). Smoothness, low noise and fast rates. In, Advances in Neural Information Processing Systems 23 , J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, Eds. 2199-2207.
[42] Sriperumbudur, B. (2011). Mixture density estimation via Hilbert space embedding of measures. In, Proceedings of International Symposium on Information Theory . 1027-1030.
[43] Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Lanckriet, G. R. G., and Schölkopf, B. (2008). Injective Hilbert space embeddings of probability measures. In, Proc. of the 21\(^st\) Annual Conference on Learning Theory , R. Servedio and T. Zhang, Eds. 111-122. · Zbl 1242.60005
[44] Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Schölkopf, B., and Lanckriet, G. R. G. (2010). Hilbert space embeddings and metrics on probability measures., Journal of Machine Learning Research 11 , 1517-1561. · Zbl 1242.60005
[45] Steinwart, I. and Christmann, A. (2008)., Support Vector Machines . Springer. · Zbl 1203.68171
[46] Suquet, C. (2009). Reproducing kernel Hilbert spaces and random measures. In, Proc. of the 5th International ISAAC Congress, Catania, Italy, 25-30 July 2005 , H. G. W. Begehr and F. Nicolosi, Eds. World Scientific, 143-152. · Zbl 1195.60072
[47] Vajda, I. (1989)., Theory of Statistical Inference and Information . Kluwer Academic Publishers, Boston. · Zbl 0711.62002
[48] Vallander, S. S. (1973). Calculation of the Wasserstein distance between probability distributions on the line., Theory Probab. Appl. 18 , 784-786. · Zbl 0351.60009 · doi:10.1137/1118101
[49] van de Geer, S. (2000)., Empirical Processes in M-Estimation . Cambridge University Press, Cambridge, UK. · Zbl 0953.62049
[50] van der Vaart, A. W. and Wellner, J. A. (1996)., Weak Convergence and Empirical Processes . Springer-Verlag, New York. · Zbl 0862.60002
[51] von Luxburg, U. and Bousquet, O. (2004). Distance-based classification with Lipschitz functions., Journal for Machine Learning Research 5 , 669-695. · Zbl 1222.68326
[52] Wang, Q., Kulkarni, S. R., and Verdú, S. (2005). Divergence estimation of continuous distributions based on data-dependent partitions., IEEE Trans. Information Theory 51 , 9, 3064-3074. · Zbl 1310.94055 · doi:10.1109/TIT.2005.853314
[53] Wendland, H. (2005)., Scattered Data Approximation . Cambridge University Press, Cambridge, UK. · Zbl 1075.65021
[54] Whitney, H. (1934). Analytic extensions of differentiable functions defined in closed sets., Transactions of the American Mathematical Society 36 , 63-89. · Zbl 0008.24902 · doi:10.2307/1989708
[55] Zolotarev, V. M. (1983). Probability metrics., Theory of Probability and its Applications 28 , 278-302. · Zbl 0533.60025 · doi:10.1137/1128025
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.