{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,3,28]],"date-time":"2026-03-28T02:01:48Z","timestamp":1774663308861,"version":"3.50.1"},"reference-count":15,"publisher":"SAGE Publications","issue":"1","license":[{"start":{"date-parts":[[2026,2,3]],"date-time":"2026-02-03T00:00:00Z","timestamp":1770076800000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"},{"start":{"date-parts":[[2026,2,3]],"date-time":"2026-02-03T00:00:00Z","timestamp":1770076800000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["Statistical Journal of the IAOS"],"published-print":{"date-parts":[[2026,3]]},"abstract":"<jats:p>The growing availability of online data creates new opportunities to improve the timeliness and detail of official statistics, particularly in domains such as price monitoring and inflation measurement. However, leveraging web-scraped data for official use requires alignment with standardized classification frameworks such as the European Classification of Individual Consumption According to Purpose (ECOICOP). We train two natural-language models, a lightweight convolutional neural network (CNN) and a fine-tuned BERTimbau transformer, to classify Portuguese food and beverage items into ECOICOP categories. Using 100,000 product titles scraped from six national supermarket sites and labeled via a human-in-the-loop workflow, the CNN reaches a macro-F1 of 92.19 % with minimal computing cost, while the transformer attains 94.00 %, the first such result for Portuguese. Both models are published on Hugging Face, enabling reproducible inference at scale while the source data remain confidential. The study delivers the first open-source Portuguese ECOICOP classifiers for food and beverage products, a replicable low-resource labeling workflow, and a benchmark of accuracy-speed trade-offs to guide researchers in similar tasks.<\/jats:p>","DOI":"10.1177\/18747655251414407","type":"journal-article","created":{"date-parts":[[2026,2,3]],"date-time":"2026-02-03T17:21:54Z","timestamp":1770139314000},"page":"122-136","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":0,"title":["Turning web data into official statistics: Classifying Portuguese retail products with NLP models"],"prefix":"10.1177","volume":"42","author":[{"ORCID":"https:\/\/orcid.org\/0009-0006-1859-5710","authenticated-orcid":false,"given":"Juliana","family":"de Freitas Ulisses Machado","sequence":"first","affiliation":[{"name":"University of Porto"}]},{"ORCID":"https:\/\/orcid.org\/0000-0001-7980-0972","authenticated-orcid":false,"given":"Bruno","family":"Veloso","sequence":"additional","affiliation":[{"name":"FEP, University of Porto &amp; LIAAD INESC TEC, Porto, Portugal"}]}],"member":"179","published-online":{"date-parts":[[2026,2,3]]},"reference":[{"key":"e_1_3_3_2_2","doi-asserted-by":"publisher","DOI":"10.1257\/jep.30.2.151"},{"key":"e_1_3_3_3_2","unstructured":"Eurostat. Practical guidelines on web scraping for the HICP; 2020. https:\/\/ec.europa.eu\/eurostat\/documents\/272892\/12032198\/Guidelines-web-scraping-HICP-11-2020.pdf."},{"key":"e_1_3_3_4_2","doi-asserted-by":"publisher","DOI":"10.1016\/j.ijforecast.2017.12.002"},{"key":"e_1_3_3_5_2","doi-asserted-by":"publisher","DOI":"10.1108\/BFJ-02-2019-0081"},{"key":"e_1_3_3_6_2","doi-asserted-by":"publisher","DOI":"10.1111\/ajae.12158"},{"key":"e_1_3_3_7_2","unstructured":"Eurostat. European statistics code of practice; 2018. https:\/\/ec.europa.eu\/eurostat\/documents\/4031688\/8971242\/KS-02-18-142-EN-N.pdf\/e7f85f07-91db-4312-8118-f729c75878c7?t=1528447068000. DOI: 10.2785\/798269."},{"key":"e_1_3_3_8_2","doi-asserted-by":"crossref","unstructured":"Bertolotto M. The perils of using aggregate data in real exchange rate estimations. Social Science Research Network; 2016. https:\/\/ssrn.com\/abstract=2882339 or http:\/\/dx.doi.org\/10.2139\/ssrn.2882339.","DOI":"10.2139\/ssrn.2882339"},{"key":"e_1_3_3_9_2","unstructured":"Jahanshahi H Ozyegen O Cevik M et\u00a0al. Text classification for predicting multi-level product categories. arXiv; 2021. https:\/\/arxiv.org\/abs\/2109.01084. DOI: 10.48550\/ARXIV.2109.01084."},{"key":"e_1_3_3_10_2","unstructured":"Devlin J Chang MW Lee K et\u00a0al. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv; 2018. https:\/\/arxiv.org\/abs\/1810.04805. DOI: 10.48550\/ARXIV.1810.04805."},{"key":"e_1_3_3_11_2","unstructured":"Liu Y Ott M Goyal N et\u00a0al. RoBERTa: a robustly optimized BERT pretraining approach. arXiv; 2019. https:\/\/arxiv.org\/abs\/1907.11692. DOI: 10.48550\/ARXIV.1907.11692."},{"key":"e_1_3_3_12_2","unstructured":"Ma S Yang J Huang H et\u00a0al. XLM-T: Scaling up multilingual machine translation with pretrained cross-lingual transformer encoders. ArXiv; 2020. abs\/2012.15547."},{"key":"e_1_3_3_13_2","unstructured":"Lehmann E Simonyi A Henkel L et\u00a0al. Bilingual transfer learning for online product classification. In: Proceedings of workshop on natural language processing in e-commerce. Barcelona Spain: Association for Computational Linguistics; 2020 pp.21\u201331. https:\/\/aclanthology.org\/2020.ecomnlp-1.3."},{"key":"e_1_3_3_14_2","doi-asserted-by":"publisher","DOI":"10.1017\/dap.2020.13"},{"key":"e_1_3_3_15_2","unstructured":"Hartmann N Fonseca E Shulby C et\u00a0al. Portuguese word embeddings: evaluating on word analogies and natural language tasks. arXiv preprint arXiv:170806025 2017."},{"key":"e_1_3_3_16_2","doi-asserted-by":"crossref","unstructured":"Souza F Nogueira R Lotufo R. BERTimbau: pretrained BERT models for Brazilian Portuguese. In: Cerri R and Prati RC (eds). Intelligent systems. Cham: Springer International Publishing 2020 pp.403\u2013417.","DOI":"10.1007\/978-3-030-61377-8_28"}],"container-title":["Statistical Journal of the IAOS"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/18747655251414407","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/18747655251414407","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/18747655251414407","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,3,28]],"date-time":"2026-03-28T01:46:16Z","timestamp":1774662376000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/10.1177\/18747655251414407"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,2,3]]},"references-count":15,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,3]]}},"alternative-id":["10.1177\/18747655251414407"],"URL":"https:\/\/doi.org\/10.1177\/18747655251414407","relation":{},"ISSN":["1874-7655","1875-9254"],"issn-type":[{"value":"1874-7655","type":"print"},{"value":"1875-9254","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,2,3]]}}}