{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,10,28]],"date-time":"2025-10-28T09:06:20Z","timestamp":1761642380219,"version":"build-2065373602"},"reference-count":43,"publisher":"IOP Publishing","issue":"4","license":[{"start":{"date-parts":[[2025,10,28]],"date-time":"2025-10-28T00:00:00Z","timestamp":1761609600000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"},{"start":{"date-parts":[[2025,10,28]],"date-time":"2025-10-28T00:00:00Z","timestamp":1761609600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/iopscience.iop.org\/info\/page\/text-and-data-mining"}],"funder":[{"DOI":"10.13039\/501100023650","name":"NCCR Catalysis","doi-asserted-by":"crossref","award":["180544"],"award-info":[{"award-number":["180544"]}],"id":[{"id":"10.13039\/501100023650","id-type":"DOI","asserted-by":"crossref"}]}],"content-domain":{"domain":["iopscience.iop.org"],"crossmark-restriction":false},"short-container-title":["Mach. Learn.: Sci. Technol."],"published-print":{"date-parts":[[2025,12,30]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>\n                    Transformers have proven successful in a range of sequence modelling tasks. However, these models have significant limitations: they are inherently data-greedy, and suffer from the risk of training data leakage. These limitations prevent their broad application in various domains. While the advent of foundation models (FMs) addresses the data-greedy nature of Transformers, the risk of exposing training data remains; it has been demonstrated that excerpts of the training data can be obtained by prompt engineering on an FM. To simultaneously address these limitations, we propose unified lookup tables (ULTs), a data preprocessing step that enables building and fine-tuning FMs on encoded data. ULTs enable the reuse of a trained model on new datasets without exposing any unencoded training data. The method leverages data compression methods as efficient modality tokenizers, and a common representation vocabulary to facilitate fine-tuning on encoded data. We theoretically support our claims through numerical estimations of the likelihood of reverse engineering the data encoding and practically through empirical evaluation on domains that can benefit from ULTs. Specifically, we evaluate the impact of using ULTs as a preprocessing step before training both decoder-only and encoder\u2013decoder language models on text, images, and molecules. We demonstrate that the encoding step does not negatively affect model training and leads to an average relative increase of \u223c16% on a collection of text metrics, while producing close to competitive results on image classification and chemical reaction prediction tasks. Code to reproduce the experiments is available at:\n                    <jats:ext-link xmlns:xlink=\"http:\/\/www.w3.org\/1999\/xlink\" ext-link-type=\"uri\" xlink:href=\"https:\/\/github.com\/IBM\/unified-lookup-tables\">https:\/\/github.com\/IBM\/unified-lookup-tables<\/jats:ext-link>\n                    .\n                  <\/jats:p>","DOI":"10.1088\/2632-2153\/ae143c","type":"journal-article","created":{"date-parts":[[2025,10,16]],"date-time":"2025-10-16T22:51:38Z","timestamp":1760655098000},"page":"045022","update-policy":"https:\/\/doi.org\/10.1088\/crossmark-policy","source":"Crossref","is-referenced-by-count":0,"title":["Unified lookup tables: training foundation models on encoded data"],"prefix":"10.1088","volume":"6","author":[{"ORCID":"https:\/\/orcid.org\/0000-0001-7886-8385","authenticated-orcid":true,"given":"Nikita","family":"Janakarajan","sequence":"first","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1790-7536","authenticated-orcid":true,"given":"Irina","family":"Espejo Morales","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0009-0003-9198-7866","authenticated-orcid":false,"given":"Marvin","family":"Alberts","sequence":"additional","affiliation":[]},{"given":"Andrea","family":"Giovannini","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0002-8872-0269","authenticated-orcid":true,"given":"Matteo","family":"Manica","sequence":"additional","affiliation":[]},{"ORCID":"https:\/\/orcid.org\/0000-0003-0118-7252","authenticated-orcid":true,"given":"Antonio","family":"Foncubierta-Rodr\u00edguez","sequence":"additional","affiliation":[]}],"member":"266","published-online":{"date-parts":[[2025,10,28]]},"reference":[{"article-title":"Release strategies and the social impacts of language models","year":"2019","author":"Solaiman","key":"mlstae143cbib1","type":"preprint"},{"key":"mlstae143cbib2","doi-asserted-by":"crossref","DOI":"10.1109\/CVPR.2016.90","type":"conference-proceedings","article-title":"Deep residual learning for image recognition","author":"He","year":"2016"},{"key":"mlstae143cbib3","first-page":"pp 6000","type":"conference-proceedings","article-title":"Attention is all you need","author":"Vaswani","year":"2017"},{"key":"mlstae143cbib4","doi-asserted-by":"publisher","first-page":"47","DOI":"10.1038\/s41586-023-06221-2","type":"journal-article","article-title":"Scientific discovery in the age of artificial intelligence","volume":"620","author":"Wang","year":"2023","journal-title":"Nature"},{"key":"mlstae143cbib5","doi-asserted-by":"publisher","DOI":"10.1038\/s41598-025-98483-1","type":"journal-article","article-title":"Industrial applications of large language models","volume":"15","author":"Raza","year":"2025","journal-title":"Sci. Rep."},{"article-title":"Extracting training data from large language models","author":"Carlini","key":"mlstae143cbib6","type":"preprint"},{"key":"mlstae143cbib7","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2022.findings-emnlp.148","type":"preprint","article-title":"Are large pre-trained language models leaking your personal information?","author":"Huang","year":"2022"},{"key":"mlstae143cbib8","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2023.inlg-main.3","type":"preprint","article-title":"Preventing verbatim memorization in language models gives a false sense of privacy","author":"Ippolito","year":"2023"},{"article-title":"Coercing LLMs to do and reveal (almost) anything","year":"2024","author":"Geiping","key":"mlstae143cbib9","type":"preprint"},{"key":"mlstae143cbib10","first-page":"pp 9784","type":"conference-proceedings","article-title":"Getting the most out of your tokenizer for pre-training and domain adaptation","author":"Dagan","year":"2024"},{"article-title":"Overcoming vocabulary mismatch: vocabulary-agnostic teacher guided language modeling","year":"2025","author":"Shin","key":"mlstae143cbib11","type":"preprint"},{"key":"mlstae143cbib12","doi-asserted-by":"crossref","DOI":"10.18653\/v1\/2024.naacl-long.395","type":"preprint","article-title":"Bridging the gap between different vocabularies for llm ensemble","author":"Xu","year":"2024"},{"key":"mlstae143cbib13","first-page":"pp 4651","type":"conference-proceedings","article-title":"Perceiver: general perception with iterative attention","author":"Jaegle","year":"2021"},{"key":"mlstae143cbib14","first-page":"pp 78808","type":"conference-proceedings","article-title":"MEGABYTE: predicting million-byte sequences with multiscale transformers","volume":"vol 36","author":"Yu","year":"2023"},{"article-title":"Training data leakage analysis in language models","year":"2021","author":"Inan","key":"mlstae143cbib15","type":"preprint"},{"article-title":"Privacy-preserving instructions for aligning large language models","year":"2024","author":"Yu","key":"mlstae143cbib16","type":"preprint"},{"article-title":"Privacy restore: privacy-preserving inference in large language models via privacy removal and restoration","year":"2024","author":"Zeng","key":"mlstae143cbib17","type":"preprint"},{"key":"mlstae143cbib18","doi-asserted-by":"publisher","DOI":"10.1016\/j.sbi.2023.102545","type":"journal-article","article-title":"Federated learning for molecular discovery","volume":"79","author":"Hanser","year":"2023","journal-title":"Curr. Opin. Struct. Biol."},{"key":"mlstae143cbib19","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/9780470544242.ch21","type":"journal-article","article-title":"Coding theorems for a discrete source with a fidelity criterion","volume":"4","author":"Shannon","year":"1959","journal-title":"IRE Nat. Conv. Rec"},{"article-title":"Language modeling is compression","year":"2023","author":"Del\u00e9tang","key":"mlstae143cbib20","type":"preprint"},{"article-title":"A block-sorting lossless data compression algorithm","year":"1994","author":"Burrows","key":"mlstae143cbib21","type":"other"},{"key":"mlstae143cbib22","doi-asserted-by":"publisher","first-page":"407","DOI":"10.1145\/382780.382782","type":"journal-article","article-title":"An analysis of the Burrows\u2013Wheeler transform","volume":"48","author":"Manzini","year":"2001","journal-title":"J. ACM"},{"key":"mlstae143cbib23","doi-asserted-by":"publisher","first-page":"xviii","DOI":"10.1109\/30.125072","type":"journal-article","article-title":"The JPEG still picture compression standard","volume":"38","author":"Wallace","year":"1992","journal-title":"IEEE Trans. Consum. Electron."},{"article-title":"Pointer sentinel mixture models","year":"2016","author":"Merity","key":"mlstae143cbib24","type":"preprint"},{"article-title":"Imagenet: a large-scale hierarchical image database","author":"Deng","key":"mlstae143cbib25","doi-asserted-by":"crossref","DOI":"10.1109\/CVPR.2009.5206848","type":"journal-article"},{"article-title":"PASS: an imagenet replacement for self-supervised pretraining without humans","year":"2021","author":"Asano","key":"mlstae143cbib26","type":"conference-proceedings"},{"article-title":"Learning multiple layers of features from tiny images","year":"2009","author":"Krizhevsky","key":"mlstae143cbib27","type":"other"},{"key":"mlstae143cbib28","doi-asserted-by":"publisher","first-page":"31","DOI":"10.1021\/ci00057a005","type":"journal-article","article-title":"SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules","volume":"28","author":"Weininger","year":"1988","journal-title":"J. Chem. Inf. Comput. Sci."},{"key":"mlstae143cbib29","first-page":"p 30","type":"conference-proceedings","article-title":"Predicting organic reaction outcomes with Weisfeiler-Lehman network","author":"Jin","year":"2017"},{"key":"mlstae143cbib30","doi-asserted-by":"publisher","first-page":"119","DOI":"10.1038\/s41524-024-01304-8","type":"journal-article","article-title":"Predicting polymerization reactions via transfer learning using chemical language models","volume":"10","author":"Ferrari","year":"2024","journal-title":"npj Comput. Mater."},{"article-title":"difflib package","year":"2024","author":"The Python Standard Library","key":"mlstae143cbib31","type":"other"},{"article-title":"An elementary mathematical theory of classification and prediction","year":"1958","author":"Tanimoto","key":"mlstae143cbib32","type":"other"},{"article-title":"Open-source cheminformatics software","year":"2024","author":"Landrum","key":"mlstae143cbib33","type":"other"},{"key":"mlstae143cbib34","first-page":"pp 38","type":"conference-proceedings","article-title":"Transformers: state-of-the-art natural language processing","author":"Wolf","year":"2020"},{"article-title":"SmolLM - blazingly fast and remarkably powerful","year":"2024","author":"Allal","key":"mlstae143cbib35","type":"web-resource"},{"key":"mlstae143cbib36","first-page":"pp 4895","type":"conference-proceedings","article-title":"GQA: training generalized multi-query transformer models from multi-head checkpoints","author":"Ainslie","year":"2023"},{"article-title":"Adam: a method for stochastic optimization","year":"2014","author":"Kingma","key":"mlstae143cbib37","type":"preprint"},{"key":"mlstae143cbib38","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1214\/10-BA521","type":"journal-article","article-title":"Exploring the limits of transfer learning with a unified text-to-text transformer","volume":"21","author":"Raffel","year":"2020","journal-title":"J. Mach.: Learn. Res."},{"key":"mlstae143cbib39","first-page":"pp 74","type":"book","article-title":"Rouge: a package for automatic evaluation of summaries","author":"Lin","year":"2004"},{"key":"mlstae143cbib40","first-page":"pp 65","type":"conference-proceedings","article-title":"METEOR: an automatic metric for MT evaluation with improved correlation with human judgments","author":"Banerjee","year":"2005"},{"article-title":"BERTScore: evaluating text generation with BERT","year":"2020","author":"Zhang","key":"mlstae143cbib41","type":"conference-proceedings"},{"article-title":"An image is worth 16x16 words: transformers for image recognition at scale","year":"2021","author":"Dosovitskiy","key":"mlstae143cbib42","type":"preprint"},{"key":"mlstae143cbib43","doi-asserted-by":"crossref","DOI":"10.1007\/978-3-031-20053-3_29","type":"preprint","article-title":"Three things everyone should know about vision transformers","author":"Touvron","year":"2022"}],"container-title":["Machine Learning: Science and Technology"],"original-title":[],"link":[{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ae143c","content-type":"text\/html","content-version":"am","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ae143c\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ae143c","content-type":"text\/html","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ae143c\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ae143c\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"syndication"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ae143c\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ae143c\/pdf","content-type":"application\/pdf","content-version":"am","intended-application":"similarity-checking"},{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ae143c\/pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,28]],"date-time":"2025-10-28T09:00:35Z","timestamp":1761642035000},"score":1,"resource":{"primary":{"URL":"https:\/\/iopscience.iop.org\/article\/10.1088\/2632-2153\/ae143c"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,10,28]]},"references-count":43,"journal-issue":{"issue":"4","published-online":{"date-parts":[[2025,10,28]]},"published-print":{"date-parts":[[2025,12,30]]}},"URL":"https:\/\/doi.org\/10.1088\/2632-2153\/ae143c","relation":{},"ISSN":["2632-2153"],"issn-type":[{"type":"electronic","value":"2632-2153"}],"subject":[],"published":{"date-parts":[[2025,10,28]]},"assertion":[{"value":"Unified lookup tables: training foundation models on encoded data","name":"article_title","label":"Article Title"},{"value":"Machine Learning: Science and Technology","name":"journal_title","label":"Journal Title"},{"value":"paper","name":"article_type","label":"Article Type"},{"value":"\u00a9 2025 The Author(s). Published by IOP Publishing Ltd","name":"copyright_information","label":"Copyright Information"},{"value":"2025-06-13","name":"date_received","label":"Date Received","group":{"name":"publication_dates","label":"Publication dates"}},{"value":"2025-10-16","name":"date_accepted","label":"Date Accepted","group":{"name":"publication_dates","label":"Publication dates"}},{"value":"2025-10-28","name":"date_epub","label":"Online publication date","group":{"name":"publication_dates","label":"Publication dates"}}]}}