{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,9,26]],"date-time":"2025-09-26T22:28:10Z","timestamp":1758925690177,"version":"3.37.3"},"reference-count":24,"publisher":"Springer Science and Business Media LLC","issue":"3","license":[{"start":{"date-parts":[[2023,5,10]],"date-time":"2023-05-10T00:00:00Z","timestamp":1683676800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"},{"start":{"date-parts":[[2023,5,10]],"date-time":"2023-05-10T00:00:00Z","timestamp":1683676800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0"}],"funder":[{"DOI":"10.13039\/100010661","name":"Horizon 2020 Framework Programme","doi-asserted-by":"publisher","award":["101017452"],"award-info":[{"award-number":["101017452"]}],"id":[{"id":"10.13039\/100010661","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["link.springer.com"],"crossmark-restriction":false},"short-container-title":["Int J Digit Libr"],"published-print":{"date-parts":[[2024,9]]},"abstract":"<jats:title>Abstract<\/jats:title><jats:p>Metadata enrichment through text mining techniques is becoming one of the most significant tasks in digital libraries. Due to the exponential increase of open access publications, several new challenges have emerged. Raw data are usually big, unstructured, and come from heterogeneous data sources. In this paper, we introduce a text analysis framework implemented in extended SQL that exploits the scalability characteristics of modern database management systems. The purpose of this framework is to provide the opportunity to build performant end-to-end text mining pipelines which include data harvesting, cleaning, processing, and text analysis at once. SQL is selected due to its declarative nature which offers fast experimentation and the ability to build APIs so that domain experts can edit text mining workflows via easy-to-use graphical interfaces. Our experimental analysis demonstrates that the proposed framework is very effective and achieves significant speedup, up to three times faster, in common use cases compared to other popular approaches.<\/jats:p>","DOI":"10.1007\/s00799-023-00358-1","type":"journal-article","created":{"date-parts":[[2023,5,11]],"date-time":"2023-05-11T06:02:34Z","timestamp":1683784954000},"page":"457-469","update-policy":"https:\/\/doi.org\/10.1007\/springer_crossmark_policy","source":"Crossref","is-referenced-by-count":3,"title":["DETEXA: declarative extensible text exploration and analysis through SQL"],"prefix":"10.1007","volume":"25","author":[{"ORCID":"https:\/\/orcid.org\/0000-0002-2785-946X","authenticated-orcid":false,"given":"Yannis","family":"Foufoulas","sequence":"first","affiliation":[]},{"given":"Eleni","family":"Zacharia","sequence":"additional","affiliation":[]},{"given":"Harry","family":"Dimitropoulos","sequence":"additional","affiliation":[]},{"given":"Natalia","family":"Manola","sequence":"additional","affiliation":[]},{"given":"Yannis","family":"Ioannidis","sequence":"additional","affiliation":[]}],"member":"297","published-online":{"date-parts":[[2023,5,10]]},"reference":[{"key":"358_CR1","unstructured":"NLTK, https:\/\/www.nltk.org"},{"key":"358_CR2","unstructured":"PySpark, https:\/\/spark.apache.org\/docs\/latest\/api\/python\/"},{"key":"358_CR3","unstructured":"Dask, https:\/\/dask.org"},{"key":"358_CR4","doi-asserted-by":"crossref","unstructured":"Raasveldt, M., M\u00fchleisen, H.: Vectorized udfs in column-stores. In: Proceedings of the 28th International Conference on Scientific and Statistical Database Management (2016)","DOI":"10.1145\/2949689.2949703"},{"key":"358_CR5","unstructured":"https:\/\/www.postgresql.org\/docs\/current\/xfunc.html"},{"key":"358_CR6","unstructured":"https:\/\/www.vertica.com\/docs\/9.2.x\/HTML\/Content\/Authoring\/ExtendingVertica\/UDF-SQLFunctions\/CreatingUser-DefinedSQLFunctions.htm"},{"key":"358_CR7","unstructured":"Declarative Extensible Text EXploration and Analysis (DETEXA): https:\/\/github.com\/madgik\/detexa"},{"key":"358_CR8","doi-asserted-by":"crossref","unstructured":"Foufoulas, Y., Simitsis, A., Stamatogiannakis, L., Ioannidis, Y.: YeSQL: \u201cYou extend SQL\u201d with rich and highly performant user-defined functions in relational databases. PVLDB (2022)","DOI":"10.14778\/3547305.3547328"},{"key":"358_CR9","unstructured":"OpenAIRE. https:\/\/www.openaire.eu"},{"key":"358_CR10","doi-asserted-by":"crossref","unstructured":"Giannakopoulos, T., et\u00a0al.: Content visualization of scientific corpora using an extensible relational database implementation. In: International Conference on Theory and Practice of Digital Libraries. Springer, Cham (2013)","DOI":"10.1007\/978-3-319-14226-5_10"},{"key":"358_CR11","doi-asserted-by":"crossref","unstructured":"Giannakopoulos, T., Foufoulas, Y., Dimitropoulos, H., Manola, N.: Interactive text analysis and information extraction. In: Manghi, P., Candela, L., Silvello, G. (eds) Digital Libraries: Supporting Open Science. IRCDL 2019. Communications in Computer and Information Science, vol 988. Springer, Cham (2019)","DOI":"10.1007\/978-3-030-11226-4_27"},{"key":"358_CR12","doi-asserted-by":"crossref","unstructured":"Foufoulas, Y., Stamatogiannakis, L., Dimitropoulos, H., Ioannidis, Y.: High-pass text filtering for citation matching. In: Kamps, J., Tsakonas, G., Manolopoulos, Y., Iliadis, L., Karydis, I. (eds) Research and Advanced Technology for Digital Libraries. TPDL 2017. Lecture Notes in Computer Science, vol. 10450. Springer, Cham, https:\/\/doi.org\/10.1007\/978-3-319-67008-9_28 (2017)","DOI":"10.1007\/978-3-319-67008-9_28"},{"key":"358_CR13","doi-asserted-by":"crossref","unstructured":"Varoquaux, G., et\u00a0al.: Scikit-learn: Machine learning without learning the machinery. GetMobile: Mobile Comput. Commun. 19(1): 29\u201333 (2015)","DOI":"10.1145\/2786984.2786995"},{"key":"358_CR14","unstructured":"Vasiliev, Y.: Natural Language Processing with Python and SpaCy: A Practical Introduction. No Starch Press (2020)"},{"key":"358_CR15","unstructured":"Gensim text processing library, https:\/\/radimrehurek.com\/gensim"},{"key":"358_CR16","unstructured":"Devlin, J., et\u00a0al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)"},{"key":"358_CR17","unstructured":"https:\/\/www.tensorflow.org"},{"key":"358_CR18","unstructured":"https:\/\/pytorch.org"},{"key":"358_CR19","unstructured":"Foufoulas, Y., Gogolou, A., Stamatogiannakis, L., Dimitropoulos, H., Manola, N., Ioannidis, Y.: Extracting biological knowledge from literature using SQL. In: Poster in 5th International Workshop on Mining Science Publishing WOSP 2016 (2016)"},{"key":"358_CR20","doi-asserted-by":"publisher","unstructured":"Giannakopoulos, T., Dimitropoulos, H., Metaxas, O., Manola, N., Ioannidis, Y.: Supervised content visualization of scientific publications: a case study on the arXiv dataset. In: K\u0142opotek, M.A., Koronacki, J., Marciniak, M., Mykowiecka, A., Wierzcho\u0144, S.T. (eds) Language Processing and Intelligent Information Systems. IIS 2013. Lecture Notes in Computer Science, vol 7912. Springer, Berlin, https:\/\/doi.org\/10.1007\/978-3-642-38634-3_23 (2013)","DOI":"10.1007\/978-3-642-38634-3_23"},{"key":"358_CR21","doi-asserted-by":"crossref","unstructured":"Giannakopoulos, T., Foufoulas, I., Stamatogiannakis, E., Dimitropoulos, H., Manola, N., Ioannidis, Y.: Discovering and Visualizing Interdisciplinary Content Classes in Scientific Publications. D-Lib Mag., Volume 20, Number 11\/12, https:\/\/doi.org\/10.1045\/november14-giannakopoulos (2014)","DOI":"10.1045\/november14-giannakopoulos"},{"key":"358_CR22","doi-asserted-by":"publisher","unstructured":"Giannakopoulos, T., Foufoulas, I., Stamatogiannakis, E., Dimitropoulos, H., Manola, N., Ioannidis, Y.: Visual-based classification of figures from scientific literature. In: Proceedings of the 24th International Conference on World Wide Web (WWW \u201915 Companion). Association for Computing Machinery, New York, 1059\u20131060, https:\/\/doi.org\/10.1145\/2740908.2742024 (2015)","DOI":"10.1145\/2740908.2742024"},{"key":"358_CR23","unstructured":"OpenAIRE2020 H2020 Project Deliverable D10.2 \u201cClustering Algorithms\u201d (2016), https:\/\/www.openaire.eu\/d10-2-clustering-algorithms"},{"key":"358_CR24","unstructured":"tfidf algorithm, https:\/\/www.kaggle.com\/code\/xfffrank\/tfidf-stemming\/notebook"}],"container-title":["International Journal on Digital Libraries"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00799-023-00358-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/article\/10.1007\/s00799-023-00358-1\/fulltext.html","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/link.springer.com\/content\/pdf\/10.1007\/s00799-023-00358-1.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2024,9,20]],"date-time":"2024-09-20T12:06:14Z","timestamp":1726833974000},"score":1,"resource":{"primary":{"URL":"https:\/\/link.springer.com\/10.1007\/s00799-023-00358-1"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2023,5,10]]},"references-count":24,"journal-issue":{"issue":"3","published-print":{"date-parts":[[2024,9]]}},"alternative-id":["358"],"URL":"https:\/\/doi.org\/10.1007\/s00799-023-00358-1","relation":{},"ISSN":["1432-5012","1432-1300"],"issn-type":[{"type":"print","value":"1432-5012"},{"type":"electronic","value":"1432-1300"}],"subject":[],"published":{"date-parts":[[2023,5,10]]},"assertion":[{"value":"10 November 2022","order":1,"name":"received","label":"Received","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"13 March 2023","order":2,"name":"revised","label":"Revised","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"16 March 2023","order":3,"name":"accepted","label":"Accepted","group":{"name":"ArticleHistory","label":"Article History"}},{"value":"10 May 2023","order":4,"name":"first_online","label":"First Online","group":{"name":"ArticleHistory","label":"Article History"}}]}}