{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,15]],"date-time":"2025-12-15T19:10:52Z","timestamp":1765825852494,"version":"3.37.3"},"reference-count":46,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2021,6,28]],"date-time":"2021-06-28T00:00:00Z","timestamp":1624838400000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2021,6,28]],"date-time":"2021-06-28T00:00:00Z","timestamp":1624838400000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["EPJ Data Sci."],"published-print":{"date-parts":[[2021,12]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>We are interested in the widespread problem of clustering documents and finding topics in large collections of written documents in the presence of metadata and hyperlinks. To tackle the challenge of accounting for these different types of datasets, we propose a novel framework based on Multilayer Networks and Stochastic Block Models. The main innovation of our approach over other techniques is that it applies the same non-parametric probabilistic framework to the different sources of datasets simultaneously. The key difference to other multilayer complex networks is the strong unbalance between the layers, with the average degree of different node types scaling differently with system size. We show that the latter observation is due to generic properties of text, such as Heaps\u2019 law, and strongly affects the inference of communities. We present and discuss the performance of our method in different datasets (hundreds of Wikipedia documents, thousands of scientific papers, and thousands of E-mails) showing that taking into account multiple types of information provides a more nuanced view on topic- and document-clusters and increases the ability to predict missing links.<\/jats:p>","DOI":"10.1140\/epjds\/s13688-021-00288-5","type":"journal-article","created":{"date-parts":[[2021,6,28]],"date-time":"2021-06-28T15:03:31Z","timestamp":1624892611000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":14,"title":["Multilayer networks for text analysis with multiple data types"],"prefix":"10.1140","volume":"10","author":[{"given":"Charles C.","family":"Hyland","sequence":"first","affiliation":[]},{"given":"Yuanming","family":"Tao","sequence":"additional","affiliation":[]},{"given":"Lamiae","family":"Azizi","sequence":"additional","affiliation":[]},{"given":"Martin","family":"Gerlach","sequence":"additional","affiliation":[]},{"given":"Tiago P.","family":"Peixoto","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1932-3710","authenticated-orcid":false,"given":"Eduardo G.","family":"Altmann","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2021,6,28]]},"reference":[{"key":"288_CR1","doi-asserted-by":"publisher","DOI":"10.1142\/10282","volume-title":"Statistical data fusion","author":"B Kedem","year":"2017","unstructured":"Kedem B, De Oliveira V, Sverchkov M (2017) Statistical data fusion. World Scientific, Singapore"},{"key":"288_CR2","volume":"2013","author":"F Costanedo","year":"2013","unstructured":"Costanedo F (2013) A review of data fusion techniques. Sci World J 2013:704504","journal-title":"Sci World J"},{"key":"288_CR3","doi-asserted-by":"publisher","first-page":"473","DOI":"10.1145\/2487575.2487693","volume-title":"Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining","author":"Y Zhu","year":"2013","unstructured":"Zhu Y, Yan X, Getoor L, Moore C (2013) Scalable text and link analysis with mixed-topic link models. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp\u00a0473\u2013481"},{"issue":"3","key":"288_CR4","doi-asserted-by":"publisher","first-page":"203","DOI":"10.1093\/comnet\/cnu016","volume":"2","author":"M Kivel\u00e4","year":"2014","unstructured":"Kivel\u00e4 M, Arenas A, Barthelemy M, Gleeson J, Moreno Y, Porter M (2014) Multilayer networks. J Complex Netw 2(3):203\u2013271","journal-title":"J Complex Netw"},{"key":"288_CR5","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1016\/j.physrep.2016.04.005","volume":"635","author":"M Zanin","year":"2016","unstructured":"Zanin M, Papo D, Sousa PA, Menasalvas E, Nicchi A, Kubik E, Boccaletti S (2016) Combining complex networks and data mining: why and how. Phys Rep 635:1\u201344","journal-title":"Phys Rep"},{"key":"288_CR6","volume-title":"Proceedings of SysML","author":"E Breck","year":"2019","unstructured":"Breck E, Zinkevich M, Polyzotis N, Whang S, Roy S (2019) Data validation for machine learning. In: Proceedings of SysML"},{"key":"288_CR7","volume-title":"Third conference on machine learning and systems (MLSys)","author":"K O\u2019Leary","year":"2020","unstructured":"O\u2019Leary K, Uchida M (2020) Common problems with creating machine learning pipelines from existing code. In: Third conference on machine learning and systems (MLSys)"},{"key":"288_CR8","doi-asserted-by":"publisher","first-page":"391","DOI":"10.1007\/978-3-642-13657-3_43","volume-title":"Advances in knowledge discovery and data mining","author":"R Arun","year":"2010","unstructured":"Arun R, Suresh V, Madhavan CEV, Murthy MNN (2010) On finding the natural number of topics with latent Dirichlet allocation: some observations. In: Advances in knowledge discovery and data mining, 391\u2013402"},{"key":"288_CR9","doi-asserted-by":"publisher","first-page":"1775","DOI":"10.1016\/j.neucom.2008.06.011","volume":"72","author":"J Cao","year":"2009","unstructured":"Cao J, Xia T, Li J, Zhang Y, Tang S (2009) A density-based method for adaptive LDA model selection. Neurocomputing 72:1775\u20131781","journal-title":"Neurocomputing"},{"key":"288_CR10","volume":"6","author":"T Vall\u00e8s-Catal\u00e0","year":"2016","unstructured":"Vall\u00e8s-Catal\u00e0 T, Massucci FA, Guimer\u00e0 R, Sales-Pardo M (2016) Multilayer stochastic block models reveal the multilayer structure of complex networks. Phys Rev X 6:011036","journal-title":"Phys Rev X"},{"issue":"4","key":"288_CR11","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.92.042807","volume":"92","author":"TP Peixoto","year":"2015","unstructured":"Peixoto TP (2015) Inferring the mesoscale structure of layered, edge-valued and time-varying networks. Phys Rev E 92(4):042807","journal-title":"Phys Rev E"},{"key":"288_CR12","volume-title":"Advances in network clustering and blockmodeling, ch. 11","author":"TP Peixoto","year":"2019","unstructured":"Peixoto TP (2019) Bayesian stochastic blockmodeling. In: Advances in network clustering and blockmodeling, ch. 11"},{"key":"288_CR13","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.84.036103","volume":"84","author":"B Ball","year":"2011","unstructured":"Ball B, Karrer B, Newman MEJ (2011) Efficient and principled method for detecting communities in networks. Phys Rev E 84:036103","journal-title":"Phys Rev E"},{"issue":"1","key":"288_CR14","volume":"5","author":"A Lancichinetti","year":"2015","unstructured":"Lancichinetti A, Sirer MI, Wang JX, Acuna D, K\u00f6rding K, Amaral LAN (2015) High-reproducibility and high-accuracy method for automated topic classification. Phys Rev X 5(1):011007","journal-title":"Phys Rev X"},{"key":"288_CR15","doi-asserted-by":"publisher","DOI":"10.1126\/sciadv.aaq1360","volume":"4","author":"M Gerlach","year":"2018","unstructured":"Gerlach M, Peixoto TP, Altmann EG (2018) A network approach to topic models. Sci Adv 4:eaaq1360","journal-title":"Sci Adv"},{"key":"288_CR16","doi-asserted-by":"crossref","unstructured":"Blei DM (2012) Probabilistic topic models. Commun ACM 55","DOI":"10.1145\/2133806.2133826"},{"issue":"3\u20135","key":"288_CR17","doi-asserted-by":"publisher","first-page":"75","DOI":"10.1016\/j.physrep.2009.11.002","volume":"486","author":"S Fortunato","year":"2010","unstructured":"Fortunato S (2010) Community detection in graphs. Phys Rep 486(3\u20135:75\u2013174","journal-title":"Phys Rep"},{"key":"288_CR18","doi-asserted-by":"crossref","unstructured":"Bouveyron C, Latouche P, Zreik R (2016) The stochastic topic block model for the clustering of vertices in networks with textual edges. Stat Comput: 1\u201321","DOI":"10.1007\/s11222-016-9713-7"},{"key":"288_CR19","doi-asserted-by":"crossref","unstructured":"Holland PW, Laskey KB, Leinhardt S (1983) Stochastic blockmodels: first steps. Soc Netw 5(2)","DOI":"10.1016\/0378-8733(83)90021-7"},{"key":"288_CR20","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.83.016107","volume":"83","author":"B Karrer","year":"2011","unstructured":"Karrer B, Newman MEJ (2011) Stochastic blockmodels and community structure in networks. Phys Rev E 83:016107","journal-title":"Phys Rev E"},{"key":"288_CR21","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.74.035102","volume":"74","author":"M Hastings","year":"2006","unstructured":"Hastings M (2006) Community detection as an inference problem, physical review. Phys Rev E, Stat Nonlinear Soft Matter Phys 74:035102","journal-title":"Phys Rev E, Stat Nonlinear Soft Matter Phys"},{"key":"288_CR22","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.102.032309","volume":"102","author":"T-C Yen","year":"2020","unstructured":"Yen T-C, Larremore DB (2020) Community detection in bipartite networks with stochastic blockmodels. Phys Rev E 102:032309","journal-title":"Phys Rev E"},{"key":"288_CR23","unstructured":"Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3"},{"issue":"3","key":"288_CR24","volume":"6","author":"D Hric","year":"2016","unstructured":"Hric D, Peixoto TP, Fortunato S (2016) Network structure, metadata, and the prediction of missing nodes and annotations. Phys Rev X 6(3):031038","journal-title":"Phys Rev X"},{"key":"288_CR25","doi-asserted-by":"crossref","unstructured":"Newman M, Clauset A (2015) Structure and inference in annotated networks. Nat Commun 7","DOI":"10.1038\/ncomms11863"},{"key":"288_CR26","doi-asserted-by":"crossref","unstructured":"Altmann EG, Gerlach M (2016) Statistical laws in linguistics. Creativity and universality in language: 7\u201326","DOI":"10.1007\/978-3-319-24403-7_2"},{"key":"288_CR27","doi-asserted-by":"publisher","first-page":"22073","DOI":"10.1073\/pnas.0908366106","volume":"106","author":"R Guimera","year":"2009","unstructured":"Guimera R, Pardo MS (2009) Missing and spurious interactions and the reconstruction of complex networks. Proc Natl Acad Sci 106:22073\u201322078","journal-title":"Proc Natl Acad Sci"},{"key":"288_CR28","unstructured":"Codes: TopSBM (Topic Models based on Stochastic Block Models, https:\/\/topsbm.github.io) and graph-tool (Efficient network analysis, https:\/\/graph-tool.skewed.de)"},{"issue":"6","key":"288_CR29","doi-asserted-by":"publisher","DOI":"10.1063\/1.4954215","volume":"26","author":"HF de Arruda","year":"2016","unstructured":"de Arruda HF, Costa LDF, Amancio DR (2016) Topic segmentation via community detection in complex networks. Chaos 26(6):063120","journal-title":"Chaos"},{"key":"288_CR30","doi-asserted-by":"crossref","unstructured":"Leydesdorff L, Nerghes A (2017) Co-word maps and topic modeling: a comparison using small and medium-sized corpora ($N< 1000$). Journal of the Association for Information Science and Technology 68(4)","DOI":"10.1002\/asi.23740"},{"key":"288_CR31","volume-title":"Type-token mathematics","author":"G Herdan","year":"1960","unstructured":"Herdan G (1960) Type-token mathematics. Mouton"},{"key":"288_CR32","volume-title":"Information retrieval","author":"HS Heaps","year":"1978","unstructured":"Heaps HS (1978) Information retrieval. Academic, New York"},{"issue":"1","key":"288_CR33","volume":"4","author":"TP Peixoto","year":"2014","unstructured":"Peixoto TP (2014) Hierarchical block structures and high-resolution model selection in large networks. Phys Rev X 4(1):011047","journal-title":"Phys Rev X"},{"issue":"1","key":"288_CR34","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.95.012317","volume":"95","author":"TP Peixoto","year":"2017","unstructured":"Peixoto TP (2017) Nonparametric Bayesian inference of the microcanonical stochastic block model. Phys Rev E 95(1):012317","journal-title":"Phys Rev E"},{"key":"288_CR35","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.90.062805","volume":"90","author":"D Hric","year":"2014","unstructured":"Hric D, Darst RK, Fortunato S (2014) Community detection in networks: structural communities versus ground truth. Phys Rev E 90:062805","journal-title":"Phys Rev E"},{"issue":"1","key":"288_CR36","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.89.012804","volume":"89","author":"TP Peixoto","year":"2014","unstructured":"Peixoto TP (2014) Efficient Monte Carlo and greedy heuristic for the inference of stochastic block models. Phys Rev E 89(1):012804","journal-title":"Phys Rev E"},{"key":"288_CR37","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.102.012305","volume":"102","author":"TP Peixoto","year":"2020","unstructured":"Peixoto TP (2020) Merge-split Markov chain Monte Carlo for community detection. Phys Rev E 102:012305","journal-title":"Phys Rev E"},{"key":"288_CR38","doi-asserted-by":"crossref","DOI":"10.1093\/oso\/9780198517962.001.0001","volume-title":"Monte Carlo methods in statistical physics","author":"MEJ Newman","year":"1999","unstructured":"Newman MEJ, Barkema GT (1999) Monte Carlo methods in statistical physics. Oxford University Press, London"},{"key":"288_CR39","doi-asserted-by":"publisher","first-page":"465","DOI":"10.1016\/0005-1098(78)90005-5","volume":"14","author":"J Rissanen","year":"1978","unstructured":"Rissanen J (1978) Modeling by shortest data description. Automatica 14:465\u2013471","journal-title":"Automatica"},{"key":"288_CR40","doi-asserted-by":"publisher","DOI":"10.7551\/mitpress\/4643.001.0001","volume-title":"The minimum description length principle","author":"P Gr\u00fcnwald","year":"2007","unstructured":"Gr\u00fcnwald P (2007) The minimum description length principle. MIT Press, Cambridge"},{"key":"288_CR41","volume":"11","author":"TP Peixoto","year":"2021","unstructured":"Peixoto TP (2021) Revealing consensus and dissensus between network partitions. Phys Rev X 11:021003","journal-title":"Phys Rev X"},{"key":"288_CR42","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.97.062316","volume":"97","author":"T Vall\u00e8s-Catal\u00e0","year":"2018","unstructured":"Vall\u00e8s-Catal\u00e0 T, Peixoto TP, Guimer\u00e0 R, Sales-Pardo M (2018) Consistencies and inconsistencies between model selection and link prediction in networks. Phys Rev E 97:062316","journal-title":"Phys Rev E"},{"key":"288_CR43","series-title":"Procedia computer science","volume-title":"The role of text pre-processing in sentiment analysis","author":"E Haddi","year":"2013","unstructured":"Haddi E, Liu X, Shi Y (2013) The role of text pre-processing in sentiment analysis. Procedia computer science, vol\u00a017"},{"issue":"1","key":"288_CR44","doi-asserted-by":"publisher","DOI":"10.1088\/1742-5468\/aa53f5","volume":"2017","author":"EG Altmann","year":"2017","unstructured":"Altmann EG, Dias L, Gerlach M (2017) Generalized entropies and the similarity of texts. J Stat Mech Theory Exp 2017(1):014002","journal-title":"J Stat Mech Theory Exp"},{"key":"288_CR45","volume-title":"Natural language processing with Python","author":"S Bird","year":"2009","unstructured":"Bird S, Loper E, Klein E (2009) Natural language processing with Python. O\u2019Reilly Media Inc."},{"key":"288_CR46","doi-asserted-by":"publisher","first-page":"98","DOI":"10.1038\/nature06830","volume":"453","author":"A Clauset","year":"2008","unstructured":"Clauset A, Moore C, Newman MEJ (2008) Hierarchical structure and the prediction of missing links in networks. Nature 453:98\u2013101","journal-title":"Nature"}],"container-title":["EPJ Data Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1140\/epjds\/s13688-021-00288-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1140\/epjds\/s13688-021-00288-5\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1140\/epjds\/s13688-021-00288-5.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,2]],"date-time":"2024-09-02T20:42:44Z","timestamp":1725309764000},"score":1,"resource":{"primary":{"URL":"https:\/\/epjdatascience.springeropen.com\/articles\/10.1140\/epjds\/s13688-021-00288-5"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2021,6,28]]},"references-count":46,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2021,12]]}},"alternative-id":["288"],"URL":"https:\/\/doi.org\/10.1140\/epjds\/s13688-021-00288-5","relation":{},"ISSN":["2193-1127"],"issn-type":[{"type":"electronic","value":"2193-1127"}],"subject":[],"published":{"date-parts":[[2021,6,28]]},"assertion":[{"value":"5 March 2021","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"14 June 2021","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"28 June 2021","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"The authors declare that they have no competing interests.","order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"33"}}