{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,4,30]],"date-time":"2026-04-30T05:26:35Z","timestamp":1777526795214,"version":"3.51.4"},"reference-count":57,"publisher":"ASME International","issue":"12","license":[{"start":{"date-parts":[[2025,12,1]],"date-time":"2025-12-01T00:00:00Z","timestamp":1764547200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/www.asme.org\/publications-submissions\/publishing-information\/legal-policies"}],"funder":[{"DOI":"10.13039\/100000084","name":"Directorate for Engineering","doi-asserted-by":"publisher","award":["2029905"],"award-info":[{"award-number":["2029905"]}],"id":[{"id":"10.13039\/100000084","id-type":"DOI","asserted-by":"publisher"}]},{"DOI":"10.13039\/100000084","name":"Directorate for Engineering","doi-asserted-by":"publisher","award":["2030093"],"award-info":[{"award-number":["2030093"]}],"id":[{"id":"10.13039\/100000084","id-type":"DOI","asserted-by":"publisher"}]}],"content-domain":{"domain":["asmedigitalcollection.asme.org"],"crossmark-restriction":true},"short-container-title":[],"published-print":{"date-parts":[[2025,12,1]]},"abstract":"<jats:title>Abstract<\/jats:title>\n                  <jats:p>It is time to talk about data in its own right, not just its usage! Machine learning applications are using a wide variety of data sources, some real, such as data collected by sensors and cameras in driving, and some artificial, such as data generated through numerical simulations. The latter mode has been gaining rapid popularity for engineering design and analysis. We are now seeing the beginnings of publicly shared datasets. Thus, the quality and efficacy of such data need to be considered before their use. In this article, we attempt to outline systematic principles and quantifiable quality and efficacy metrics based on insights gained collectively from both data curation projects and usage of large engineering datasets. Specifically, this article addresses issues related to generating BIG datasets from computer-aided design (CAD) and finite element analysis (FEA): granularity and modality, applicable to both input and output data; and variety and balance, applicable to input data; also, efficacy for machine learning (ML). Generation of Big Data by simulation requires the use of commercial CAD and finite element packages, which poses multiple challenges: automation, integration and balancing sample variants. We propose parametric variety and balance matrices to study over or under representation of input attributes. Even if we have a large dataset with good balance, it may not be suitable for ML if we do not see significant variation in response or performance variables. Several data generation, curation, and utilization case studies are included in a variety of domains (aero, structural, thermal, manufacturing) and usage of metrics is demonstrated.<\/jats:p>","DOI":"10.1115\/1.4070033","type":"journal-article","created":{"date-parts":[[2025,10,6]],"date-time":"2025-10-06T12:08:51Z","timestamp":1759752531000},"update-policy":"https:\/\/doi.org\/10.1115\/crossmarkpolicy-asme","source":"Crossref","is-referenced-by-count":1,"title":["Principles and Metrics for Curating Large Engineering Simulation Datasets for Machine Learning"],"prefix":"10.1115","volume":"25","author":[{"given":"Jami J.","family":"Shah","sequence":"first","affiliation":[{"id":[{"id":"https:\/\/ror.org\/00rs6vg23","id-type":"ROR","asserted-by":"publisher"}],"name":"The Ohio State University Department of Mechanical and Aerospace Engineering, , , \u00a0","place":["Columbus, OH, 43210"]}]},{"given":"Satchit","family":"Ramnath","sequence":"additional","affiliation":[{"id":[{"id":"https:\/\/ror.org\/037s24f05","id-type":"ROR","asserted-by":"publisher"}],"name":"Clemson University Department of Mechanical Engineering, , , \u00a0","place":["Clemson, SC, 29634"]}]},{"given":"Stefan","family":"Menzel","sequence":"additional","affiliation":[{"name":"Honda Research Institute Europe ,\u00a0 \u00a0 ,","place":["Offenbach am Main, Germany, 63073"]}]},{"given":"Thiago","family":"Rios","sequence":"additional","affiliation":[{"name":"Honda Research Institute Europe ,\u00a0 \u00a0 ,","place":["Offenbach am Main, Germany, 63073"]}]},{"given":"Fatma","family":"Kocer","sequence":"additional","affiliation":[{"id":[{"id":"https:\/\/ror.org\/05939ef94","id-type":"ROR","asserted-by":"publisher"}],"name":"Altair Engineering , , \u00a0","place":["Troy, MI, 48083"]}]},{"given":"Eamon","family":"Whalen","sequence":"additional","affiliation":[{"name":"Altair Engineering , , \u00a0","place":["Troy, MI, 48083"]}]},{"given":"Joseph","family":"Pajot","sequence":"additional","affiliation":[{"id":[{"id":"https:\/\/ror.org\/05939ef94","id-type":"ROR","asserted-by":"publisher"}],"name":"Altair Engineering , , \u00a0","place":["Troy, MI, 48083"]}]},{"given":"Alex","family":"Adrian","sequence":"additional","affiliation":[{"name":"GE Aerospace , , \u00a0","place":["Cincinnati, OH, 43000"]}]},{"given":"Prakash","family":"Kumar","sequence":"additional","affiliation":[{"name":"Amazon Inc. , , \u00a0","place":["Bellevue, WA, 98004"]}]}],"member":"33","published-online":{"date-parts":[[2025,12,2]]},"reference":[{"key":"2025120812344211200_CIT0001","doi-asserted-by":"publisher","first-page":"10881","DOI":"10.1109\/ICCV48922.2021.01072","article-title":"Common Objects in 3D: Large-Scale Learning and Evaluation of Real-life 3D Category Reconstruction","author":"Reizenstein"},{"key":"2025120812344211200_CIT0002","author":"Chang","year":"2015"},{"key":"2025120812344211200_CIT0003","doi-asserted-by":"publisher","first-page":"248","DOI":"10.1109\/CVPR.2009.5206848","article-title":"ImageNet: A Large-Scale Hierarchical Image Database","author":"Deng","year":"2009"},{"issue":"5","key":"2025120812344211200_CIT0004","doi-asserted-by":"publisher","first-page":"9","DOI":"10.1111\/cgf.14353","article-title":"SimJEB: Simulated Jet Engine Bracket Dataset","volume":"40","author":"Whalen","year":"2021","journal-title":"Computer Graphics Forum"},{"key":"2025120812344211200_CIT0005","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/SSCI50451.2021.9660034","article-title":"Exploiting Generative Models for Performance Predictions of 3D Car Designs","author":"Saha","year":"2021"},{"key":"2025120812344211200_CIT0006","doi-asserted-by":"publisher","first-page":"730","DOI":"10.1016\/j.cma.2019.02.002","article-title":"Kriging-Assisted Topology Optimization of Crash Structures","volume":"348","author":"Raponi","year":"2019","journal-title":"Comput. Methods Appl. Mech. Eng."},{"key":"2025120812344211200_CIT0007","doi-asserted-by":"publisher","first-page":"100576","DOI":"10.1016\/j.ecmx.2024.100576","article-title":"Identification of Energy Management Configuration Concepts From a Set of Pareto-Optimal Solutions","volume":"22","author":"Lanfermann","year":"2024","journal-title":"Energy Convers. Manage.: X"},{"key":"2025120812344211200_CIT0008","doi-asserted-by":"publisher","first-page":"101704","DOI":"10.1016\/j.aei.2022.101704","article-title":"Concept Identification for Complex Engineering Datasets","volume":"53","author":"Lanfermann","year":"2022","journal-title":"Adv. Eng. Inform."},{"issue":"3","key":"2025120812344211200_CIT0009","doi-asserted-by":"publisher","first-page":"160","DOI":"10.1007\/s42979-021-00592-x","article-title":"Machine Learning: Algorithms, Real-World Applications and Research Directions","volume":"2","author":"Sarker","year":"2021","journal-title":"SN Comput. Sci."},{"key":"2025120812344211200_CIT0010","doi-asserted-by":"publisher","first-page":"108197","DOI":"10.1016\/j.knosys.2022.108197","article-title":"Surrogate-Assisted Evolutionary Optimization of Expensive Many-Objective Irregular Problems","volume":"240","author":"Liu","year":"2022","journal-title":"Knowledge-Based Syst."},{"key":"2025120812344211200_CIT0011","doi-asserted-by":"publisher","first-page":"525","DOI":"10.1109\/SSCI52147.2023.10371864","article-title":"Applicability Study of Model-Free Reinforcement Learning Towards an Automated Design Space Exploration Framework","author":"Hoffmann","year":"2023"},{"issue":"4","key":"2025120812344211200_CIT0012","doi-asserted-by":"publisher","first-page":"1172","DOI":"10.1109\/TPAMI.2019.2952353","article-title":"Assessing Transferability From Simulation to Reality for Reinforcement Learning","volume":"43","author":"Muratore","year":"2021","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."},{"key":"2025120812344211200_CIT0013","article-title":"Evaluating RL Agents in Hanabi With Unseen Partners","author":"Canaan","year":"2020"},{"issue":"2","key":"2025120812344211200_CIT0014","doi-asserted-by":"publisher","first-page":"206","DOI":"10.1109\/TEVC.2021.3086308","article-title":"Multitask Shape Optimization Using a 3-D Point Cloud Autoencoder as Unified Representation","volume":"26","author":"Rios","year":"2022","journal-title":"IEEE Trans. Evol. Comput."},{"key":"2025120812344211200_CIT0015","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1109\/IJCNN48605.2020.9207326","article-title":"Feature Visualization for 3D Point Cloud Autoencoders","author":"Rios","year":"2020"},{"key":"2025120812344211200_CIT0016","doi-asserted-by":"publisher","first-page":"315","DOI":"10.1613\/jair.1199","article-title":"Learning When Training Data Are Costly: The Effect of Class Distribution on Tree Induction","volume":"19","author":"Weiss","year":"2003","journal-title":"J. Artif. Intell. Res."},{"issue":"4","key":"2025120812344211200_CIT0017","doi-asserted-by":"publisher","DOI":"10.1145\/3450626.3459818","article-title":"Fusion 360 Gallery: A Dataset and Environment for Programmatic CAD Construction From Human Design Sequences","volume":"40","author":"Willis","year":"2021","journal-title":"ACM Trans. Graph."},{"key":"2025120812344211200_CIT0018","first-page":"15849","article-title":"Joinable: Learning Bottom-Up Assembly of Parametric CAD Joints","author":"Willis","year":"2022"},{"issue":"3","key":"2025120812344211200_CIT0019","doi-asserted-by":"publisher","first-page":"031706","DOI":"10.1115\/1.4052585","article-title":"BIKED: A Dataset for Computational Bicycle Design With Machine Learning Benchmarks","volume":"144","author":"Regenwetter","year":"2021","journal-title":"ASME. J. Mech. Des."},{"key":"2025120812344211200_CIT0020","doi-asserted-by":"publisher","DOI":"10.1016\/j.cad.2022.103446","article-title":"FRAMED: An AutoML Approach for Structural Performance Prediction of Bicycle Frames","volume":"156","author":"Regenwetter","year":"2023","journal-title":"Comput.-Aided Des."},{"issue":"4","key":"2025120812344211200_CIT0021","doi-asserted-by":"publisher","first-page":"215","DOI":"10.1515\/rnam-2019-0018","article-title":"Neural Networks for Topology Optimization","volume":"34","author":"Sosnovik","year":"2019","journal-title":"Russ. J. Numer. Anal. Math. Model."},{"issue":"3","key":"2025120812344211200_CIT0022","doi-asserted-by":"publisher","first-page":"031715","DOI":"10.1115\/1.4049533","article-title":"Topologygan: Topology Optimization Using Generative Adversarial Networks Based on Physical Fields Over the Initial Domain","volume":"143","author":"Nie","year":"2021","journal-title":"ASME J. Mech. Des."},{"key":"2025120812344211200_CIT0023","author":"Ramnath","year":"2022"},{"key":"2025120812344211200_CIT0024","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1512.03012","author":"Chang","year":"2015","journal-title":"arXiv preprint arXiv:1512.03012"},{"key":"2025120812344211200_CIT0025","doi-asserted-by":"crossref","first-page":"175","DOI":"10.1007\/978-3-030-58523-5_11","volume-title":"Computer Vision\u2014ECCV 2020","author":"Kim","year":"2020"},{"key":"2025120812344211200_CIT0026","doi-asserted-by":"publisher","first-page":"1912","DOI":"10.1109\/CVPR.2015.7298801","article-title":"3D Shapenets: A Deep Representation for Volumetric Shapes","author":"Wu","year":"2015"},{"key":"2025120812344211200_CIT0027","doi-asserted-by":"publisher","DOI":"10.1115\/DETC2021-71853","article-title":"Design Form and Function Prediction From a Single Image","author":"Edwards","year":"2021"},{"key":"2025120812344211200_CIT0028","doi-asserted-by":"publisher","first-page":"571","DOI":"10.1063\/1.3623659","article-title":"Evaluation of Constitutive Models for Springback Prediction in U-Draw\/Bending of DP and TRIP Steel Sheets","volume":"1383","author":"Lee","year":"2011","journal-title":"AIP Conf. Proc."},{"key":"2025120812344211200_CIT0029","article-title":"Applications of ML\/AI for CAE","author":"Kocer","year":"2024"},{"key":"2025120812344211200_CIT0030","doi-asserted-by":"publisher","DOI":"10.1115\/DETC2019-97378","article-title":"Automatically Generating 60,000 CAD Variant for Big Data Applications","author":"Ramnath","year":"2019"},{"key":"2025120812344211200_CIT0031","doi-asserted-by":"publisher","first-page":"V009T09A043","DOI":"10.1115\/DETC2020-22377","article-title":"Design Science Meets Data Science: Curating Large Design Datasets for Engineered Artifacts","author":"Ramnath","year":"2020"},{"key":"2025120812344211200_CIT0032","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1115\/DETC2021-67923","article-title":"Intelligent Design Prediction Aided by Experiment Design and Machine Learning in Feature Based Product Development","author":"Ramnath","year":"2021"},{"key":"2025120812344211200_CIT0033","article-title":"Automation and of a Multi-Stage T-Joint Assembly of Stamped Components and Prediction of Performance Parameters Using Machine Learning","volume-title":"M.S.M.E. thesis","author":"Bolar","year":"2023"},{"key":"2025120812344211200_CIT0034","unstructured":"Jiang, Y.\n          , 2019, \u201cAutomated Generation of CAD Big Data for Geometric Machine Learning,\u201d M.S. thesis, MAE Department, OH."},{"issue":"6","key":"2025120812344211200_CIT0035","doi-asserted-by":"publisher","first-page":"1221","DOI":"10.1109\/TEVC.2022.3147013","article-title":"CarHoods10k: An Industry-Grade Data Set for Representation Learning and Design Optimization in Engineering Applications","volume":"26","author":"Wollstadt","year":"2022","journal-title":"IEEE Trans. Evol. Comput."},{"key":"2025120812344211200_CIT0036","author":"Altair Engineering Inc."},{"key":"2025120812344211200_CIT0037","unstructured":"Kumar, P.\n          , 2024, \u201cA Study on the Extraction of Geometrical Parameters From FlexiMech Components and Assemblies and Their Impact on Performance: A Machine Learning Approach,\u201d Master's thesis, Arizona State University."},{"key":"2025120812344211200_CIT0038","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.1511.08458","author":"O\u2019shea","year":"2015","journal-title":"arXiv preprint arXiv:1511.08458"},{"issue":"10","key":"2025120812344211200_CIT0039","doi-asserted-by":"publisher","first-page":"143","DOI":"10.29322\/IJSRP.9.10.2019.p9420","article-title":"Transfer Learning Using vgg-16 With Deep Convolutional Neural Network for Classifying Images","volume":"9","author":"Tammina","year":"2019","journal-title":"Int. J. Sci. Res. Publ."},{"issue":"1","key":"2025120812344211200_CIT0040","doi-asserted-by":"publisher","DOI":"10.1063\/5.0082328","article-title":"ResNet-50 Based Deep Neural Network Using Transfer Learning for Brain Tumor Classification","volume":"2463","author":"Sahaai","year":"2022","journal-title":"AIP Conf. Proc."},{"key":"2025120812344211200_CIT0041","first-page":"40","article-title":"Learning Representations and Generative Models for 3D Point Clouds","author":"Achlioptas","year":"2017"},{"key":"2025120812344211200_CIT0042","unstructured":"Rios, T.\n          , 2022, \u201cLearning-Based Representations of High-Dimensional CAE Models for Automotive Design Optimization,\u201d Ph.D. thesis, LIACS, Leiden University."},{"key":"2025120812344211200_CIT0043","doi-asserted-by":"publisher","first-page":"77","DOI":"10.1109\/CVPR.2017.16","article-title":"PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation","author":"Charles","year":"2017"},{"key":"2025120812344211200_CIT0044","doi-asserted-by":"publisher","first-page":"102456","DOI":"10.1016\/j.displa.2023.102456","article-title":"Deep Learning-Based 3D Point Cloud Classification: A Systematic Survey and Outlook","volume":"79","author":"Zhang","year":"2023","journal-title":"Displays"},{"key":"2025120812344211200_CIT0045","first-page":"66833","article-title":"A Near-Linear Time Algorithm for the Chamfer Distance","author":"Bakshi","year":"2024"},{"key":"2025120812344211200_CIT0046","doi-asserted-by":"publisher","DOI":"10.1109\/CEC45853.2021.9504746","article-title":"Exploiting Local Geometric Features in Vehicle Design Optimization With 3D Point Cloud Autoencoders","author":"Rios","year":"2021"},{"key":"2025120812344211200_CIT0047","doi-asserted-by":"publisher","first-page":"1747","DOI":"10.1017\/pds.2022.177","article-title":"Exploiting 3D Variational Autoencoders for Interactive Vehicle Design","author":"Saha","year":"2022"},{"key":"2025120812344211200_CIT0048","doi-asserted-by":"publisher","first-page":"791","DOI":"10.1109\/SSCI44817.2019.9003161","article-title":"On the Efficiency of a Point Cloud Autoencoder as a Geometric Representation for Shape Optimization","author":"Rios","year":"2019"},{"key":"2025120812344211200_CIT0049","first-page":"1","article-title":"RCS Predictions Using Geometric Deep Learning","author":"M\u00e4urer","year":"2025"},{"key":"2025120812344211200_CIT0050","article-title":"Altair Feko","author":"Altair Engineering Inc."},{"issue":"4","key":"2025120812344211200_CIT0051","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1145\/3306346.3322959","article-title":"Meshcnn: A Network With an Edge","volume":"38","author":"Hanocka","year":"2019","journal-title":"ACM Trans. Graph."},{"key":"2025120812344211200_CIT0052","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2505.08137","author":"Zhang","year":"2025","journal-title":"arXiv preprint arXiv: 2505.08137"},{"key":"2025120812344211200_CIT0053","first-page":"18563","article-title":"CAD-Llama: Leveraging Large Language Models for Computer-Aided Design Parametric 3D Model Generation","author":"Li","year":"2025"},{"key":"2025120812344211200_CIT0054","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2406.00144","article-title":"Query2CAD: Generating CAD Models Using Natural Language Queries","author":"Badagabettu","year":"2024","journal-title":"arXiv preprint arXiv:2406.00144"},{"key":"2025120812344211200_CIT0055","doi-asserted-by":"publisher","DOI":"10.48550\/arXiv.2503.04417","article-title":"From Idea to CAD: A Language Model-Driven Multi-Agent System for Collaborative Design","author":"Ocker","year":"2025","journal-title":"arXiv preprint arXiv: 2503.04417"},{"key":"2025120812344211200_CIT0056","doi-asserted-by":"publisher","first-page":"1","DOI":"10.1007\/s00170-025-15830-2","article-title":"Generative AI Meets CAD: Enhancing Engineering Design to Manufacturing Processes With Large Language Models","author":"Daareyni","year":"2025","journal-title":"Int. J. Adv. Manuf. Technol."},{"key":"2025120812344211200_CIT0057","article-title":"AnsysAI","author":"Ansys"}],"container-title":["Journal of Computing and Information Science in Engineering"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/asmedigitalcollection.asme.org\/computingengineering\/article-pdf\/25\/12\/121003\/7544745\/jcise-25-1296.pdf","content-type":"application\/pdf","content-version":"vor","intended-application":"syndication"},{"URL":"https:\/\/asmedigitalcollection.asme.org\/computingengineering\/article-pdf\/25\/12\/121003\/7544745\/jcise-25-1296.pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,8]],"date-time":"2025-12-08T17:34:49Z","timestamp":1765215289000},"score":1,"resource":{"primary":{"URL":"https:\/\/asmedigitalcollection.asme.org\/computingengineering\/article\/25\/12\/120811\/1223157\/Principles-and-Metrics-for-Curating-Large"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,1]]},"references-count":57,"journal-issue":{"issue":"12","published-print":{"date-parts":[[2025,12,1]]}},"URL":"https:\/\/doi.org\/10.1115\/1.4070033","relation":{},"ISSN":["1530-9827","1944-7078"],"issn-type":[{"value":"1530-9827","type":"print"},{"value":"1944-7078","type":"electronic"}],"subject":[],"published":{"date-parts":[[2025,12,1]]},"article-number":"120811"}}