{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,2,9]],"date-time":"2026-02-09T22:12:41Z","timestamp":1770675161977,"version":"3.49.0"},"reference-count":27,"publisher":"SAGE Publications","issue":"1","license":[{"start":{"date-parts":[[2026,1,1]],"date-time":"2026-01-01T00:00:00Z","timestamp":1767225600000},"content-version":"tdm","delay-in-days":0,"URL":"https:\/\/journals.sagepub.com\/page\/policies\/text-and-data-mining-license"}],"content-domain":{"domain":["journals.sagepub.com"],"crossmark-restriction":true},"short-container-title":["Journal of Computational Biology"],"published-print":{"date-parts":[[2026,1]]},"abstract":"<jats:p>\n                    Molecular structure prediction is essential for understanding therapeutic functions and accelerating pharmaceutical research. While state-of-the-art deep learning models like AlphaFold demonstrate strong performance on general protein backbone prediction, they struggle with critical regions of VHH antibodies, a novel family of molecules underrepresented in current training datasets. Many academic and industry laboratories can generate high-quality VHH structures for novel sequences, presenting an opportunity to improve model performance through iterative fine-tuning with strategically selected new data. However, experimental structure determination requires weeks to months of effort and significant costs per structure, making exhaustive data collection impractical. Randomly curating subset of full collection yields suboptimal improvements, as many structures provide redundant information while key regions remain unexplored. Strategic data selection can identify which structures, once experimentally determined, will maximally improve prediction accuracy, enabling superior model performance with fewer iterations and lower costs. We propose DEWDROP, an active learning selection method that guides VHH structure curation to maximally improve fine-tuned model performance. DEWDROP leverages Monte Carlo dropout to generate prediction ensembles that inform optimal data selection. While we focus on VHH antibodies, underrepresentation issues affect many molecular domains, making DEWDROP broadly applicable as a model-agnostic method for structural biology applications. To demonstrate this effectiveness, we evaluate our approach through retrospective iterative fine-tuning experiments and batch selection analysis on two distinct structural families: VHH antibodies from SAbDab-nano as our target application and primary benchmark and\n                    <jats:italic toggle=\"yes\">Mycobacterium leprae<\/jats:italic>\n                    proteins from the AlphaFold Protein Database to demonstrate broader applicability across different molecular domains. For all analyses, we use a structured prediction model based on coarse-grain molecular representations that operates independently of multiple sequence alignments called Equifold. We demonstrate that DEWDROP (1) improves model training efficiency through optimized batch selection, outperforming baseline methods and (2) selects structurally informative data with high information content.\n                  <\/jats:p>","DOI":"10.1177\/15578666251405823","type":"journal-article","created":{"date-parts":[[2026,1,30]],"date-time":"2026-01-30T10:54:54Z","timestamp":1769770494000},"page":"184-200","update-policy":"https:\/\/doi.org\/10.1177\/sage-journals-update-policy","source":"Crossref","is-referenced-by-count":0,"title":["Deep Batch Active Learning for Protein Structure Modeling"],"prefix":"10.1177","volume":"33","author":[{"given":"Zexin","family":"Xue","sequence":"first","affiliation":[{"name":"R&amp;D Data &amp; Computational Science, Sanofi, Cambridge, Massachusetts, USA."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Michael","family":"Bailey","sequence":"additional","affiliation":[{"name":"R&amp;D Data &amp; Computational Science, Sanofi, Cambridge, Massachusetts, USA."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Abhinav","family":"Gupta","sequence":"additional","affiliation":[{"name":"Large Molecule Research, Sanofi, Cambridge, Massachusetts, USA."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ruijiang","family":"Li","sequence":"additional","affiliation":[{"name":"Large Molecule Research, Sanofi, Cambridge, Massachusetts, USA."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Alejandro","family":"Corrochano-Navarro","sequence":"additional","affiliation":[{"name":"R&amp;D Data &amp; Computational Science, Sanofi, Cambridge, Massachusetts, USA."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sizhen","family":"Li","sequence":"additional","affiliation":[{"name":"R&amp;D Data &amp; Computational Science, Sanofi, Cambridge, Massachusetts, USA."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Lorenzo","family":"Kogler-Anele","sequence":"additional","affiliation":[{"name":"R&amp;D Data &amp; Computational Science, Sanofi, Cambridge, Massachusetts, USA."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Qui","family":"Yu","sequence":"additional","affiliation":[{"name":"Large Molecule Research, Sanofi, Cambridge, Massachusetts, USA."},{"name":"Biologics Engineering, Oncology R&amp;D, AstraZeneca, Waltham, Massachusetts, USA."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Heidi","family":"Rommelaere","sequence":"additional","affiliation":[{"name":"NANOBODY Research Platform, Sanofi R&amp;D, Sanofi, Zwijnaarde, Belgium."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Wander","family":"Van Breedam","sequence":"additional","affiliation":[{"name":"NANOBODY Research Platform, Sanofi R&amp;D, Sanofi, Zwijnaarde, Belgium."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Norbert","family":"Furtmann","sequence":"additional","affiliation":[{"name":"Large Molecule Research, Sanofi, Frankfurt, Germany."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Joseph","family":"Batchelor","sequence":"additional","affiliation":[{"name":"Large Molecule Research, Sanofi, Cambridge, Massachusetts, USA."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ziv","family":"Bar-Joseph","sequence":"additional","affiliation":[{"name":"R&amp;D Data &amp; Computational Science, Sanofi, Cambridge, Massachusetts, USA."}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Sven","family":"Jager","sequence":"additional","affiliation":[{"name":"R&amp;D Data &amp; Computational Science, Sanofi, Cambridge, Massachusetts, USA."}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"179","published-online":{"date-parts":[[2026,1,30]]},"reference":[{"key":"e_1_3_3_2_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-024-07487-w"},{"key":"e_1_3_3_3_1","first-page":"8927","volume-title":"Advances in Neural Information Processing Systems, volume 34","author":"Ash J","year":"2021","unstructured":"Ash J, , Goel S, , Krishnamurthy A, et al. Gone fishing: Neural active learning with fisher embeddings. In Ranzato M, , Beygelzimer A, , Dauphin Y, , Liang P, , and Vaughan JW, editors, Advances in Neural Information Processing Systems, volume 34, 8927\u20138939. Curran Associates, Inc; 2021. Available from: https:\/\/proceedings.neurips.cc\/paper\/2021\/file\/4afe044911ed2c247005912512ace23b-Paper.pdf"},{"key":"e_1_3_3_4_1","unstructured":"Ash JT Zhang C Krishnamurthy A et al. Deep batch active learning by diverse uncertain gradient lower bounds. In International Conference on Learning Representations 2020. Available from: https:\/\/openreview.net\/forum?id=ryghZJBKPS"},{"key":"e_1_3_3_5_1","doi-asserted-by":"publisher","DOI":"10.7554\/elife.89679.2"},{"key":"e_1_3_3_6_1","doi-asserted-by":"publisher","DOI":"10.1021\/acs.jmedchem.9b02147"},{"key":"e_1_3_3_7_1","doi-asserted-by":"publisher","DOI":"10.1021\/acs.jmedchem.0c00385"},{"key":"e_1_3_3_8_1","doi-asserted-by":"publisher","DOI":"10.1613\/jair.295"},{"key":"e_1_3_3_9_1","doi-asserted-by":"publisher","DOI":"10.1023\/A:1022673506211"},{"key":"e_1_3_3_10_1","article-title":"Diffdock: Diffusion steps, twists, and turns for molecular docking","author":"Corso G","year":"2022","unstructured":"Corso G, , St\u00e4rk H, , Jing B, et al. Diffdock: Diffusion steps, twists, and turns for molecular docking. ArXiv, 2022.","journal-title":"ArXiv"},{"key":"e_1_3_3_11_1","doi-asserted-by":"publisher","DOI":"10.1093\/bioinformatics\/btv552"},{"key":"e_1_3_3_12_1","first-page":"20","volume-title":"Proceedings of The 33rd International Conference on Machine Learning, volume 48, 1050\u20131059","author":"Gal Y","year":"2016","unstructured":"Gal Y, , Ghahramani Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Balcan MF, and Weinberger KQ, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48, 1050\u20131059, Machine Learning Research: New York, New York, USA; 2016, 20\u201322. Available from: https:\/\/proceedings.mlr.press\/v48\/gal16.html"},{"key":"e_1_3_3_13_1","doi-asserted-by":"publisher","DOI":"10.1021\/acs.jcim.2c01052"},{"issue":"164","key":"e_1_3_3_14_1","first-page":"1","article-title":"A framework and benchmark for deep batch active learning for regression","volume":"24","author":"Holzm\u00fcller D","year":"2023","unstructured":"Holzm\u00fcller D, , Zaverkin V, , K\u00e4stner J, et al. A framework and benchmark for deep batch active learning for regression. J Machine Learning Res, 2023; 24(164):1\u201381. Available from: http:\/\/jmlr.org\/papers\/v24\/22-0937.html","journal-title":"J Machine Learning Res"},{"key":"e_1_3_3_15_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-021-03819-2"},{"key":"e_1_3_3_16_1","doi-asserted-by":"publisher","DOI":"10.1101\/2022.10.07.511322"},{"key":"e_1_3_3_17_1","first-page":"500902","article-title":"Language models of protein sequences at the scale of evolution enable accurate structure prediction","author":"Lin Z","year":"2022","unstructured":"Lin Z, , Akin H, , Rao R, et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. BioRxiv, 2022:500902.","journal-title":"BioRxiv"},{"key":"e_1_3_3_18_1","doi-asserted-by":"publisher","DOI":"10.1109\/TIT.1982.1056489"},{"key":"e_1_3_3_19_1","doi-asserted-by":"publisher","DOI":"10.1007\/BF02834632"},{"key":"e_1_3_3_20_1","doi-asserted-by":"publisher","DOI":"10.4155\/fmc-2016-0197"},{"key":"e_1_3_3_21_1","doi-asserted-by":"publisher","DOI":"10.1038\/s41586-023-05905-z"},{"key":"e_1_3_3_22_1","doi-asserted-by":"publisher","DOI":"10.1093\/nar\/gkab1050"},{"key":"e_1_3_3_23_1","volume-title":"Synthesis Lectures on Artificial Intelligence and Machine Learning","author":"Settles B","year":"2012","unstructured":"Settles B. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers; 2012."},{"key":"e_1_3_3_24_1","doi-asserted-by":"publisher","DOI":"10.1016\/j.ailsci.2022.100050"},{"key":"e_1_3_3_25_1","doi-asserted-by":"publisher","DOI":"10.1109\/34.88573"},{"key":"e_1_3_3_26_1","doi-asserted-by":"publisher","DOI":"10.3390\/molecules28103991"},{"key":"e_1_3_3_27_1","doi-asserted-by":"publisher","DOI":"10.1186\/s40537-016-0043-6"},{"key":"e_1_3_3_28_1","doi-asserted-by":"publisher","DOI":"10.1039\/D2DD00034B"}],"container-title":["Journal of Computational Biology"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/15578666251405823","content-type":"application\/pdf","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/full-xml\/10.1177\/15578666251405823","content-type":"application\/xml","content-version":"vor","intended-application":"text-mining"},{"URL":"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/15578666251405823","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2026,2,9]],"date-time":"2026-02-09T05:45:49Z","timestamp":1770615949000},"score":1,"resource":{"primary":{"URL":"https:\/\/journals.sagepub.com\/doi\/full\/10.1177\/15578666251405823"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2026,1]]},"references-count":27,"journal-issue":{"issue":"1","published-print":{"date-parts":[[2026,1]]}},"alternative-id":["10.1177\/15578666251405823"],"URL":"https:\/\/doi.org\/10.1177\/15578666251405823","relation":{},"ISSN":["1066-5277","1557-8666"],"issn-type":[{"value":"1066-5277","type":"print"},{"value":"1557-8666","type":"electronic"}],"subject":[],"published":{"date-parts":[[2026,1]]}}}