{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,19]],"date-time":"2026-02-19T16:08:54Z","timestamp":1771517334125,"version":"3.50.1"},"reference-count":51,"publisher":"Springer Science and Business Media LLC","issue":"1","license":[{"start":{"date-parts":[[2025,2,19]],"date-time":"2025-02-19T00:00:00Z","timestamp":1739923200000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2025,2,19]],"date-time":"2025-02-19T00:00:00Z","timestamp":1739923200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100031478","name":"NextGenerationEU","doi-asserted-by":"publisher","award":["PE00000018"],"award-info":[{"award-number":["PE00000018"]}],"id":[{"id":"10.13039\/100031478","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100031478","name":"NextGenerationEU","doi-asserted-by":"publisher","award":["CN00000023"],"award-info":[{"award-number":["CN00000023"]}],"id":[{"id":"10.13039\/100031478","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Appl Netw Sci"],"abstract":"<jats:title>Abstract<\/jats:title>\n          <jats:p>This paper presents a novel methodology, called Word Co-occurrence SVN topic model (WCSVNtm), for document clustering and topic modeling in textual datasets. This method represents the corpus as a bipartite network of words and documents to rigorously assess the statistical significance of word co-occurrences within documents and document overlap based on shared vocabulary. By employing the Leiden community detection algorithm to the SVN, distinct communities of words can be identified and interpreted as topics. Similarly, documents can be sorted into groups based on their thematic similarities. We demonstrate the effectiveness of our approach by analyzing three datasets: a set of 120 Wikipedia articles, the arXiv10 dataset, which consists of 100,000 abstracts from scientific papers, and a sampled subset of 10,000 documents from the original arXiv10. To benchmark our results, we compare our approach with several well-established models in the field of topic modeling and document clustering, including the hierarchical Stochastic Block Model (hSBM), BERTopic, and Latent Dirichlet Allocation (LDA). The results show that WCSVNtm achieves competitive performance across all datasets, automatically\u00a0selecting the number of topics and document clusters, whereas state-of-the-art methods require prior knowledge or additional tuning for optimization. Finally, any advancements in community detection algorithms could further improve our method.<\/jats:p>","DOI":"10.1007\/s41109-025-00693-z","type":"journal-article","created":{"date-parts":[[2025,2,19]],"date-time":"2025-02-19T18:30:22Z","timestamp":1739989822000},"update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":1,"title":["Statistically validated network for analysing textual data"],"prefix":"10.1007","volume":"10","author":[{"given":"Andrea","family":"Simonetti","sequence":"first","affiliation":[]},{"given":"Alessandro","family":"Albano","sequence":"additional","affiliation":[]},{"given":"Michele","family":"Tumminello","sequence":"additional","affiliation":[]},{"given":"T.","family":"Di Matteo","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2025,2,19]]},"reference":[{"key":"693_CR1","first-page":"1981","volume":"9","author":"EM Airoldi","year":"2008","unstructured":"Airoldi EM, Blei DM, Fienberg SE, Xing EP (2008) Mixed membership stochastic Blockmodels. J Mach Learn Res 9:1981\u20132014","journal-title":"J Mach Learn Res"},{"key":"693_CR2","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s41109-018-0109-9","volume":"4","author":"MT Altuncu","year":"2019","unstructured":"Altuncu MT, Mayer E, Yaliraki SN, Barahona M (2019) From free text to clusters of content in health records: an unsupervised graph partitioning approach. Appl Netw Sci 4:1\u201323","journal-title":"Appl Netw Sci"},{"key":"693_CR3","doi-asserted-by":"crossref","unstructured":"Altuncu MT, Yaliraki SN, Barahona M (2021) Graph-based topic extraction from vector embeddings of text documents: application to a corpus of news articles. Complex networks & their applications IX: Volume 2, proceedings of the ninth international conference on complex networks and their applications complex networks 2020 (pp 154\u2013166)","DOI":"10.1007\/978-3-030-65351-4_13"},{"key":"693_CR4","unstructured":"Angelov D (2020) Top2vec: distributed representations of topics. arXiv:2008.09470"},{"key":"693_CR5","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.84.036103","volume":"84","author":"B Ball","year":"2011","unstructured":"Ball B, Karrer B, Newman MEJ (2011) Efficient and principled method for detecting communities in networks. Phys Rev E 84:036103","journal-title":"Phys Rev E"},{"key":"693_CR6","doi-asserted-by":"publisher","first-page":"159","DOI":"10.1016\/j.eswa.2017.08.047","volume":"91","author":"M Belford","year":"2018","unstructured":"Belford M, Mac Namee B, Greene D (2018) Stability of topic modeling via matrix factorization. Expert Syst Appl 91:159\u2013169","journal-title":"Expert Syst Appl"},{"key":"693_CR7","unstructured":"Bengio Y, Ducharme R, Vincent P, Janvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137\u20131155"},{"issue":"1","key":"693_CR8","doi-asserted-by":"publisher","first-page":"289","DOI":"10.1111\/j.2517-6161.1995.tb02031.x","volume":"57","author":"Y Benjamini","year":"1995","unstructured":"Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B (Methodol) 57(1):289\u2013300","journal-title":"J R Stat Soc Ser B (Methodol)"},{"key":"693_CR9","doi-asserted-by":"crossref","unstructured":"Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 1165\u20131188","DOI":"10.1214\/aos\/1013699998"},{"key":"693_CR10","unstructured":"Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res3:993\u20131022"},{"key":"693_CR11","doi-asserted-by":"crossref","unstructured":"Blondel VD, Guillaume J- L, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008(10):P10008","DOI":"10.1088\/1742-5468\/2008\/10\/P10008"},{"key":"693_CR12","unstructured":"Casella G, Berger RL (2002) Statistical inference. Duxbury Press"},{"key":"693_CR13","first-page":"439","volume":"8","author":"AB Dieng","year":"2020","unstructured":"Dieng AB, Ruiz FJ, Blei DM (2020) Topic modeling in embedding spaces. Trans Assoc Comput Ling 8:439\u2013453","journal-title":"Trans Assoc Comput Ling"},{"key":"693_CR14","doi-asserted-by":"publisher","DOI":"10.1063\/1.3057290","volume-title":"Transmission of information: a statistical theory of communications","author":"RM Fano","year":"1961","unstructured":"Fano RM (1961) Transmission of information: a statistical theory of communications. MIT Press, Cambridge, MA"},{"key":"693_CR15","doi-asserted-by":"crossref","unstructured":"Farhangi A, Sui N, Hua N, Bai H, Huang A, Guo Z (2022) Protoformer: embedding prototypes for transformers. Advances in knowledge discovery and data mining: 26th pacific-asia conference, Pakdd 2022, Chengdu, China, May 16\u201319, 2022, proceedings, part i, pp 447\u2013458","DOI":"10.1007\/978-3-031-05933-9_35"},{"issue":"3\u20135","key":"693_CR16","doi-asserted-by":"publisher","first-page":"75","DOI":"10.1016\/j.physrep.2009.11.002","volume":"486","author":"S Fortunato","year":"2010","unstructured":"Fortunato S (2010) Community detection in graphs. Phys Rep 486(3\u20135):75\u2013174","journal-title":"Phys Rep"},{"key":"693_CR17","doi-asserted-by":"crossref","unstructured":"Gerlach M, Peixoto TP, Altmann EG (2018) A network approach to topic models. Sci Adv 4(7):eaaq1360","DOI":"10.1126\/sciadv.aaq1360"},{"issue":"1","key":"693_CR18","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0084912","volume":"9","author":"R Gramatica","year":"2014","unstructured":"Gramatica R, Di Matteo T, Giorgetti S, Barbiani M, Bevec D, Aste T (2014) Graph theory enables drug repurposing-how a mathematical model can drive the discovery of hidden mechanisms of action. PLoS ONE 9(1):e84912","journal-title":"PLoS ONE"},{"key":"693_CR19","unstructured":"Grootendorst M (2022) Bertopic: neural topic modeling with a class-based tf-idf procedure. arXiv:2203.05794"},{"key":"693_CR20","doi-asserted-by":"crossref","unstructured":"Gr\u00fcnwald PD (2007) The minimum description length principle. MIT Press","DOI":"10.7551\/mitpress\/4643.001.0001"},{"key":"693_CR21","doi-asserted-by":"crossref","unstructured":"Gururangan S, Marasovi\u0107 A, Swayamdipta S, o K, Beltagy I, Downey D, Smith NA (2020) Don\u2019t stop pretraining: adapt language models to domains and tasks. arXiv:2004.10964","DOI":"10.18653\/v1\/2020.acl-main.740"},{"issue":"4","key":"693_CR22","doi-asserted-by":"publisher","first-page":"693","DOI":"10.1080\/14697688.2014.969889","volume":"15","author":"V Hatzopoulos","year":"2015","unstructured":"Hatzopoulos V, Iori G, Mantegna RN, Micciche S, Tumminello M (2015) Quantifying preferential trading in the e-mid interbank market. Quant Finance 15(4):693\u2013710","journal-title":"Quant Finance"},{"issue":"2","key":"693_CR23","doi-asserted-by":"publisher","first-page":"109","DOI":"10.1016\/0378-8733(83)90021-7","volume":"5","author":"PW Holland","year":"1983","unstructured":"Holland PW, Laskey KB, Leinhardt S (1983) Stochastic blockmodels: first steps. Soc Netw 5(2):109\u2013137","journal-title":"Soc Netw"},{"issue":"6","key":"693_CR24","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.90.062805","volume":"90","author":"D Hric","year":"2014","unstructured":"Hric D, Darst RK, Fortunato S (2014) Community detection in networks: structural communities versus ground truth. Phys Rev E 90(6):062805","journal-title":"Phys Rev E"},{"issue":"1","key":"693_CR25","doi-asserted-by":"publisher","first-page":"33","DOI":"10.1140\/epjds\/s13688-021-00288-5","volume":"10","author":"CC Hyland","year":"2021","unstructured":"Hyland CC, Tao Y, Azizi L, Gerlach M, Peixoto TP, Altmann EG (2021) Multilayer networks for text analysis with multiple data types. EPJ Data Sci 10(1):33","journal-title":"EPJ Data Sci"},{"issue":"1","key":"693_CR26","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.83.016107","volume":"83","author":"B Karrer","year":"2011","unstructured":"Karrer B, Newman ME (2011) Stochastic blockmodels and community structure in networks. Phys Rev E 83(1):016107","journal-title":"Phys Rev E"},{"key":"693_CR27","volume":"5","author":"A Lancichinetti","year":"2015","unstructured":"Lancichinetti A, Sirer MI, Wang JX, Acuna D, K\u00f6rding K, Amaral LAN (2015) A high-reproducibility and high-accuracy method for automated topic classification. Phys Rev X 5:011007","journal-title":"Phys Rev X"},{"key":"693_CR28","unstructured":"Mikolov T, Chen K, Corrado G, Dean J (2013) Distributed representations of words and phrases and their compositionality. Nips 1\u20139"},{"key":"693_CR29","unstructured":"Mikolov T, Corrado G, Chen K, Dean J (2013) Efficient estimation of word representations in vector space. In: Proceedings of the international conference on learning representations (iclr 2013), pp 1\u201312"},{"key":"693_CR30","doi-asserted-by":"crossref","unstructured":"Miller J (1981) Rg (1981): simultaneous statistical inference. Springer-Verlag","DOI":"10.1007\/978-1-4613-8122-8"},{"key":"693_CR31","unstructured":"Paranyushkin D (2011) Identifying the pathways for meaning circulation using text network analysis. Nodus Labs 26"},{"issue":"3","key":"693_CR32","doi-asserted-by":"publisher","first-page":"489","DOI":"10.3233\/SW-160218","volume":"8","author":"H Paulheim","year":"2017","unstructured":"Paulheim H (2017) Knowledge graph refinement: a survey of approaches and evaluation methods. Semant Web 8(3):489\u2013508","journal-title":"Semant Web"},{"issue":"4","key":"693_CR33","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.92.042807","volume":"92","author":"TP Peixoto","year":"2015","unstructured":"Peixoto TP (2015) Inferring the mesoscale structure of layered, edge-valued, and time-varying networks. Phys Rev E 92(4):042807","journal-title":"Phys Rev E"},{"key":"693_CR34","doi-asserted-by":"publisher","DOI":"10.1103\/PhysRevE.95.012317","volume":"95","author":"TP Peixoto","year":"2017","unstructured":"Peixoto TP (2017) Nonparametric bayesian inference of the microcanonical stochastic block model. Phys Rev E 95:012317","journal-title":"Phys Rev E"},{"issue":"2","key":"693_CR35","volume":"11","author":"TP Peixoto","year":"2021","unstructured":"Peixoto TP (2021) Revealing consensus and dissensus between network partitions. Phys Rev X 11(2):021003","journal-title":"Phys Rev X"},{"key":"693_CR36","doi-asserted-by":"publisher","first-page":"1112","DOI":"10.3758\/s13423-014-0585-6","volume":"21","author":"ST Piantadosi","year":"2014","unstructured":"Piantadosi ST (2014) Zipf\u2019s word frequency law in natural language: a critical review and future directions. Psychon Bull Rev 21:1112\u20131130","journal-title":"Psychon Bull Rev"},{"issue":"5","key":"693_CR37","doi-asserted-by":"publisher","first-page":"465","DOI":"10.1016\/0005-1098(78)90005-5","volume":"14","author":"J Rissanen","year":"1978","unstructured":"Rissanen J (1978) Modeling by shortest data description. Automatica 14(5):465\u2013471","journal-title":"Automatica"},{"key":"693_CR38","doi-asserted-by":"crossref","unstructured":"R\u00f6der M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. Proceedings of the eighth ACM international conference on web search and data mining, pp 399\u2013408","DOI":"10.1145\/2684822.2685324"},{"key":"693_CR39","doi-asserted-by":"crossref","unstructured":"Simonetti A, Albano A, Plaia A, Tumminello M (2023) Ranking coherence in topic models using statistically validated networks. J Inf Sci 01655515221148369","DOI":"10.1177\/01655515221148369"},{"key":"693_CR40","doi-asserted-by":"publisher","unstructured":"Sprent P (2011) Fisher exact test. Lovric M (Eds) International encyclopedia of statistical science, pp 524\u2013525. Berlin, Heidelberg: Springer Berlin Heidelberg. https:\/\/doi.org\/10.1007\/978-3-642-04898-2-253","DOI":"10.1007\/978-3-642-04898-2-253"},{"issue":"1","key":"693_CR41","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1038\/s41598-019-41695-z","volume":"9","author":"VA Traag","year":"2019","unstructured":"Traag VA, Waltman L, Van Eck NJ (2019) From Louvain to Leiden: guaranteeing well-connected communities. Sci Rep 9(1):1\u201312","journal-title":"Sci Rep"},{"issue":"2","key":"693_CR42","doi-asserted-by":"publisher","first-page":"381","DOI":"10.1111\/jori.12415","volume":"90","author":"M Tumminello","year":"2023","unstructured":"Tumminello M, Consiglio A, Vassallo P, Cesari R, Farabullini F (2023) Insurance fraud detection: a statistically validated network approach. J Risk Insur 90(2):381\u2013419","journal-title":"J Risk Insur"},{"issue":"5","key":"693_CR43","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0064703","volume":"8","author":"M Tumminello","year":"2013","unstructured":"Tumminello M, Edling C, Liljeros F, Mantegna RN, Sarnecki J (2013) The phenomenology of specialization of criminal suspects. PLoS ONE 8(5):e64703","journal-title":"PLoS ONE"},{"issue":"3","key":"693_CR44","doi-asserted-by":"publisher","DOI":"10.1371\/journal.pone.0017994","volume":"6","author":"M Tumminello","year":"2011","unstructured":"Tumminello M, Micciche S, Lillo F, Piilo J, Mantegna RN (2011) Statistically validated networks in bipartite complex systems. PLoS ONE 6(3):e17994","journal-title":"PLoS ONE"},{"issue":"01","key":"693_CR45","doi-asserted-by":"publisher","first-page":"P01019","DOI":"10.1088\/1742-5468\/2011\/01\/P01019","volume":"2011","author":"M Tumminello","year":"2011","unstructured":"Tumminello M, Micciche S, Lillo F, Varho J, Piilo J, Mantegna RN (2011) Community characterization of heterogeneous complex systems. J Stat Mech Theory Exp 2011(01):P01019","journal-title":"J Stat Mech Theory Exp"},{"issue":"1","key":"693_CR46","volume":"6","author":"T Valles-Catala","year":"2016","unstructured":"Valles-Catala T, Massucci FA, Guimera R, Sales-Pardo M (2016) Multilayer stochastic block models reveal the multilayer structure of complex networks. Phys Rev X 6(1):011036","journal-title":"Phys Rev X"},{"key":"693_CR47","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s41109-019-0228-y","volume":"4","author":"A Veremyev","year":"2019","unstructured":"Veremyev A, Semenov A, Pasiliao EL, Boginski V (2019) Graph-based exploration and clustering analysis of semantic spaces. Appl Netw Sci 4:1\u201326","journal-title":"Appl Netw Sci"},{"key":"693_CR48","first-page":"581","volume":"7","author":"Y Xu","year":"2019","unstructured":"Xu Y, Lapata M (2019) Weakly supervised domain detection. Trans Assoc Comput Ling 7:581\u2013596","journal-title":"Trans Assoc Comput Ling"},{"key":"693_CR49","doi-asserted-by":"crossref","unstructured":"Zhu Y, Yan X, Getoor L, Moore C (2013) Scalable text and link analysis with mixed-topic link models. In: Proceedings of the 19th ACM sigkdd international conference on knowledge discovery and data mining, pp 473\u2013481","DOI":"10.1145\/2487575.2487693"},{"key":"693_CR50","first-page":"51","volume":"95","author":"GK Zipf","year":"1936","unstructured":"Zipf GK (1936) The psycho-biology of language: An introduction to dynamic philology, london: G. Routledge. INDEX BADIP 95:51\u201353","journal-title":"Routledge. INDEX BADIP"},{"issue":"2","key":"693_CR51","doi-asserted-by":"publisher","first-page":"379","DOI":"10.1007\/s10115-015-0882-z","volume":"48","author":"Y Zuo","year":"2016","unstructured":"Zuo Y, Zhao J, Xu K (2016) Word network topic model: a simple but general solution for short and imbalanced texts. Knowl Inf Syst 48(2):379\u2013398","journal-title":"Knowl Inf Syst"}],"container-title":["Applied Network Science"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41109-025-00693-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s41109-025-00693-z\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s41109-025-00693-z.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,2,19]],"date-time":"2025-02-19T18:30:30Z","timestamp":1739989830000},"score":1,"resource":{"primary":{"URL":"https:\/\/appliednetsci.springeropen.com\/articles\/10.1007\/s41109-025-00693-z"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,2,19]]},"references-count":51,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2025,12]]}},"alternative-id":["693"],"URL":"https:\/\/doi.org\/10.1007\/s41109-025-00693-z","relation":{},"ISSN":["2364-8228"],"issn-type":[{"value":"2364-8228","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,2,19]]},"assertion":[{"value":"30 April 2024","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"29 January 2025","order":2,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"19 February 2025","order":3,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}},{"order":1,"name":"Ethics","group":{"name":"EthicsHeading","label":"Declarations"}},{"value":"The authors declare that they have no competing interests.","order":2,"name":"Ethics","group":{"name":"EthicsHeading","label":"Competing interests"}}],"article-number":"5"}}