×

Motif estimation via subgraph sampling: the fourth-moment phenomenon. (English) Zbl 1486.62079

Summary: Network sampling is an indispensable tool for understanding features of large complex networks where it is practically impossible to search over the entire graph. In this paper, we develop a framework for statistical inference for counting network motifs, such as edges, triangles and wedges, in the widely used subgraph sampling model, where each vertex is sampled independently, and the subgraph induced by the sampled vertices is observed. We derive necessary and sufficient conditions for the consistency and the asymptotic normality of the natural Horvitz-Thompson (HT) estimator, which can be used for constructing confidence intervals and hypothesis testing for the motif counts based on the sampled graph. In particular, we show that the asymptotic normality of the HT estimator exhibits an interesting fourth-moment phenomenon, which asserts that the HT estimator (appropriately centered and rescaled) converges in distribution to the standard normal whenever its fourth-moment converges to 3 (the fourth-moment of the standard normal distribution). As a consequence, we derive the exact thresholds for consistency and asymptotic normality of the HT estimator in various natural graph ensembles, such as sparse graphs with bounded degree, Erdős-Rényi random graphs, random regular graphs and dense graphons.

MSC:

62G05 Nonparametric estimation
62G20 Asymptotic properties of nonparametric inference
05C80 Random graphs (graph-theoretic aspects)
62H22 Probabilistic graphical models

References:

[1] Airoldi, E. M., Costa, T. B. and Chan, S. H. (2013). Stochastic blockmodel approximation of a graphon: Theory and consistent estimation. In Advances in Neural Information Processing Systems 692-700.
[2] ALIAKBARPOUR, M., SHANKHA BISWAS, A., GOULEAKIS, T., PEEBLES, J., RUBINFELD, R. and YODPINYANEE, A. (2018). Sublinear-time algorithms for counting star subgraphs via edge sampling. Algorithmica 80 668-697. · Zbl 1391.68120 · doi:10.1007/s00453-017-0287-3
[3] APICELLA, C. L., MARLOWE, F. W., FOWLER, J. H. and CHRISTAKIS, N. A. (2012). Social networks and cooperation in hunter-gatherers. Nature 481 497-501. · doi:10.1038/nature10736
[4] BANDIERA, O. and RASUL, I. (2006). Social networks and technology adoption in northern Mozambique. Econ. J. 116 869-902.
[5] BARBOUR, A. D. and CHEN, L. H. Y., eds. (2005). An Introduction to Stein’s Method. Lecture Notes Series. Institute for Mathematical Sciences. National University of Singapore 4. World Scientific, Hackensack, NJ. · Zbl 1072.62007 · doi:10.1142/9789812567680
[6] Bertail, P., Chautru, E. and Clémençon, S. (2017). Empirical processes in survey sampling with (conditional) Poisson designs. Scand. J. Stat. 44 97-111. · Zbl 1361.62015 · doi:10.1111/sjos.12243
[7] BHATTACHARYA, A., BISHNU, A., GHOSH, A. and MISHRA, G. (2019). Triangle estimation using tripartite independent set queries. In 30th International Symposium on Algorithms and Computation (ISAAC 2019). Schloss Dagstuhl. Leibniz-Zent. Inform., Wadern. · Zbl 1508.68259
[8] BHATTACHARYA, B. B, DAS, S. and MUKHERJEE, S. (2022). Supplement to “Motif estimation via subgraph sampling: The fourth-moment phenomenon.” https://doi.org/10.1214/21-AOS2134SUPP
[9] Bhattacharya, B. B., Diaconis, P. and Mukherjee, S. (2017). Universal limit theorems in graph coloring problems with connections to extremal combinatorics. Ann. Appl. Probab. 27 337-394. · Zbl 1360.05051 · doi:10.1214/16-AAP1205
[10] Bickel, P. J., Chen, A. and Levina, E. (2011). The method of moments and degree distributions for network models. Ann. Statist. 39 2280-2301. · Zbl 1232.91577 · doi:10.1214/11-AOS904
[11] Borgs, C., Chayes, J. T., Lovász, L., Sós, V. T. and Vesztergombi, K. (2008). Convergent sequences of dense graphs. I. Subgraph frequencies, metric properties and testing. Adv. Math. 219 1801-1851. · Zbl 1213.05161 · doi:10.1016/j.aim.2008.07.008
[12] Borgs, C., Chayes, J. T., Lovász, L., Sós, V. T. and Vesztergombi, K. (2012). Convergent sequences of dense graphs II. Multiway cuts and statistical physics. Ann. of Math. (2) 176 151-219. · Zbl 1247.05124 · doi:10.4007/annals.2012.176.1.2
[13] CAPOBIANCO, M. (1972). Estimating the connectivity of a graph. In Graph Theory and Applications (Proc. Conf., Western Michigan Univ., Kalamazoo, Mich., 1972; Dedicated to the Memory of J. W. T. Youngs). Lecture Notes in Math. 303 65-74. · Zbl 0248.05133
[14] CHANDRASEKHAR, A. and LEWIS, R. (2011). Econometrics of sampled networks.
[15] CHATTERJEE, S. and DIACONIS, P. (2013). Estimating and understanding exponential random graph models. Ann. Statist. 41 2428-2461. · Zbl 1293.62046 · doi:10.1214/13-AOS1155
[16] Chatterjee, S., Diaconis, P. and Sly, A. (2011). Random graphs with a given degree sequence. Ann. Appl. Probab. 21 1400-1435. · Zbl 1234.05206 · doi:10.1214/10-AAP728
[17] Chen, L. H. Y. and Shao, Q.-M. (2004). Normal approximation under local dependence. Ann. Probab. 32 1985-2028. · Zbl 1048.60020 · doi:10.1214/009117904000000450
[18] CRANE, H. (2018). Probabilistic Foundations of Statistical Network Analysis. Monographs on Statistics and Applied Probability 157. CRC Press, Boca Raton, FL. · Zbl 1490.62001
[19] DIACONIS, P. and FREEDMAN, D. (1980). Finite exchangeable sequences. Ann. Probab. 8 745-764. · Zbl 0434.60034
[20] EDEN, T., LEVI, A., RON, D. and SESHADHRI, C. (2017). Approximately counting triangles in sublinear time. SIAM J. Comput. 46 1603-1646. · Zbl 1380.68445 · doi:10.1137/15M1054389
[21] FEIGE, U. (2006). On sums of independent random variables with unbounded variance and estimating the average degree in a graph. SIAM J. Comput. 35 964-984. · Zbl 1098.60026 · doi:10.1137/S0097539704447304
[22] FRANK, O. (1977). Estimation of graph totals. Scand. J. Stat. 4 81-89. · Zbl 0358.62008
[23] FRANK, O. (1978). Estimation of the number of connected components in a graph by using a sampled subgraph. Scand. J. Stat. 5 177-188. · Zbl 0391.05049
[24] GAO, C. and MA, Z. (2021). Minimax rates in network analysis: Graphon estimation, community detection and hypothesis testing. Statist. Sci. 36 16-33. · Zbl 07368217 · doi:10.1214/19-STS736
[25] GOLDREICH, O. and RON, D. (2008). Approximating average parameters of graphs. Random Structures Algorithms 32 473-493. · Zbl 1155.05057 · doi:10.1002/rsa.20203
[26] GOLDREICH, O. and RON, D. (2011). On testing expansion in bounded-degree graphs. In Studies in Complexity and Cryptography. Miscellanea on the Interplay Between Randomness and Computation 68-75. Springer. · Zbl 1343.68302
[27] GONEN, M., RON, D. and SHAVITT, Y. (2011). Counting stars and other small subgraphs in sublinear-time. SIAM J. Discrete Math. 25 1365-1411. · Zbl 1234.68138 · doi:10.1137/100783066
[28] GOODMAN, L. A. (1961). Snowball sampling. Ann. Math. Stat. 32 148-170. · Zbl 0099.14203 · doi:10.1214/aoms/1177705148
[29] GOVINDAN, R. and TANGMUNARUNKIT, H. (2000). Heuristics for Internet map discovery. In Proceedings IEEE INFOCOM 2000. Conference on Computer Communications. Nineteenth Annual Joint Conference of the IEEE Computer and Communications Societies 3 1371-1380. IEEE, New York.
[30] Grimmett, G. (1999). Percolation, 2nd ed. Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences] 321. Springer, Berlin. · Zbl 0926.60004 · doi:10.1007/978-3-662-03981-6
[31] GUPTA, P. and BHATTACHARJEE, G. P. (1984). An efficient algorithm for random sampling without replacement. Int. J. Comput. Math. 16 201-209. · Zbl 0554.65100 · doi:10.1080/00207168408803438
[32] HANDCOCK, M. S. and GILE, K. J. (2010). Modeling social networks from sampled data. Ann. Appl. Stat. 4 5-25. · Zbl 1189.62187 · doi:10.1214/08-AOAS221
[33] Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. J. Amer. Statist. Assoc. 47 663-685. · Zbl 0047.38301
[34] JANSON, S., ŁUCZAK, T. and RUCINSKI, A. (2000). Random Graphs. Wiley-Interscience Series in Discrete Mathematics and Optimization 45. Wiley Interscience, New York. · doi:10.1002/9781118032718
[35] KIM, J. H., SUDAKOV, B. and VU, V. (2007). Small subgraphs of random regular graphs. Discrete Math. 307 1961-1967. · Zbl 1118.05088 · doi:10.1016/j.disc.2006.09.032
[36] KLUSOWSKI, J. M. and WU, Y. (2018). Counting motifs with graph sampling. In Conference on Learning Theory 1966-2011.
[37] KLUSOWSKI, J. M. and WU, Y. (2020). Estimating the number of connected components in a graph via subgraph sampling. Bernoulli 26 1635-1664. · Zbl 1440.62043 · doi:10.3150/19-BEJ1147
[38] Kolaczyk, E. D. (2009). Statistical Analysis of Network Data: Methods and Models. Springer Series in Statistics. Springer, New York. · Zbl 1277.62021 · doi:10.1007/978-0-387-88146-1
[39] KOLACZYK, E. D. (2017). Topics at the Frontier of Statistics and Network Analysis: (Re)Visiting the Foundations. SemStat Elements. Cambridge Univ. Press, Cambridge. · Zbl 1398.62005 · doi:10.1017/9781108290159
[40] KRIVELEVICH, M., SUDAKOV, B., VU, V. H. and WORMALD, N. C. (2001). Random regular graphs of high degree. Random Structures Algorithms 18 346-363. · Zbl 0996.05106 · doi:10.1002/rsa.1013
[41] LESKOVEC, J. and FALOUTSOS, C. (2006). Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 631-636.
[42] Lovász, L. (2012). Large Networks and Graph Limits. American Mathematical Society Colloquium Publications 60. Amer. Math. Soc., Providence, RI. · Zbl 1292.05001 · doi:10.1090/coll/060
[43] Marinucci, D. and Peccati, G. (2011). Random Fields on the Sphere: Representation, Limit Theorems and Cosmological Applications. London Mathematical Society Lecture Note Series 389. Cambridge Univ. Press, Cambridge. · Zbl 1260.60004 · doi:10.1017/CBO9780511751677
[44] Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D. and Alon, U. (2002). Network motifs: Simple building blocks of complex networks. Science 298 824-827.
[45] Nourdin, I. and Peccati, G. (2009). Stein’s method on Wiener chaos. Probab. Theory Related Fields 145 75-118. · Zbl 1175.60053 · doi:10.1007/s00440-008-0162-x
[46] Nourdin, I. and Peccati, G. (2012). Normal Approximations with Malliavin Calculus: From Stein’s Method to Universality. Cambridge Tracts in Mathematics 192. Cambridge Univ. Press, Cambridge. · Zbl 1266.60001 · doi:10.1017/CBO9781139084659
[47] Nourdin, I., Peccati, G. and Reinert, G. (2010). Invariance principles for homogeneous sums: Universality of Gaussian Wiener chaos. Ann. Probab. 38 1947-1985. · Zbl 1246.60039 · doi:10.1214/10-AOP531
[48] Nualart, D. and Peccati, G. (2005). Central limit theorems for sequences of multiple stochastic integrals. Ann. Probab. 33 177-193. · Zbl 1097.60007 · doi:10.1214/009117904000000621
[49] OVE, F. (2005). Network sampling and model fitting. In Structural Analysis in the Social Sciences 31-56. Cambridge Univ. Press, Cambridge.
[50] PRŽULJ, N., CORNEIL, D. G. and JURISICA, I. (2004). Modeling interactome: Scale-free or geometric? Bioinformatics 20 3508-3515.
[51] RIBEIRO, B. and TOWSLEY, D. (2010). Estimating and sampling graphs with multidimensional random walks. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement 390-403.
[52] ROUZANKIN, P. S. and VOYTISHEK, A. V. (1999). On the cost of algorithms for random selection. Monte Carlo Methods Appl. 5 39-54. · Zbl 0933.65006 · doi:10.1515/mcma.1999.5.1.39
[53] RUAL, J.-F., VENKATESAN, K., HAO, T., HIROZANE-KISHIKAWA, T., DRICOT, A., LI, N., BERRIZ, G. F., GIBBONS, F. D., DREZE, M. et al. (2005). Towards a proteome-scale map of the human protein-protein interaction network. Nature 437 1173-1178.
[54] Stein, C. (1972). A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971), Vol. II: Probability Theory 583-602. · Zbl 0278.60026
[55] Tang, M., Sussman, D. L. and Priebe, C. E. (2013). Universally consistent vertex classification for latent positions graphs. Ann. Statist. 41 1406-1430. · Zbl 1273.62147 · doi:10.1214/13-AOS1112
[56] TILLÉ, Y. (2006). Sampling Algorithms. Springer Series in Statistics. Springer, New York. · Zbl 1099.62009
[57] UETZ, P., GIOT, L., CAGNEY, G., MANSFIELD, T. A., JUDSON, R. S., KNIGHT, J. R., LOCKSHON, D., NARAYAN, V., SRINIVASAN, M. et al. (2000). A comprehensive analysis of protein-protein interactions in saccharomyces cerevisiae. Nature 403 623-627.
[58] YANG, J. and LESKOVEC, J. (2015). Defining and evaluating network communities based on ground-truth. Knowl. Inf. Syst. 42 181-213.
[59] Zhang, Y., Kolaczyk, E. D. and Spencer, B. D. (2015). Estimating network degree distributions under sampling: An inverse problem, with applications to monitoring social media networks. Ann. Appl. Stat. 9 166-199 · Zbl 1454.62058 · doi:10.1214/14-AOAS800
This reference list is based on information provided by the publisher or from digital mathematics libraries. Its items are heuristically matched to zbMATH identifiers and may contain data conversion errors. In some cases that data have been complemented/enhanced by data from zbMATH Open. This attempts to reflect the references listed in the original paper as accurately as possible without claiming completeness or a perfect matching.