{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2025,12,27]],"date-time":"2025-12-27T05:17:29Z","timestamp":1766812649178,"version":"3.48.0"},"reference-count":60,"publisher":"MDPI AG","issue":"1","license":[{"start":{"date-parts":[[2025,12,21]],"date-time":"2025-12-21T00:00:00Z","timestamp":1766275200000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Data"],"abstract":"<jats:p>Constructed-response items offer rich evidence of writing proficiency, but the linguistic signals they contain vary with grade level. This study presents a cross-sectional analysis of 5638 English Language Arts essays from Grades 6\u201312 to identify which linguistic features predict proficiency and to characterize how their importance shifts across grade levels. We extracted a suite of lexical, syntactic, and semantic-cohesion features, and evaluated their predictive power using an interpretive dual-model framework combining LASSO and XGBoost algorithms. Feature importance was assessed through LASSO coefficients, XGBoost Gain scores, and SHAP values, and interpreted by isolating both consensus and divergences of the three metrics. Results show moderate, generalizable predictive signals in Grades 6\u20138, but no generalizable predictive power was found in the Grades 9\u201312 cohort. Across the middle grades, three findings achieved strong consensus. Essay length, syntactic density, and global semantic organization served as strong predictors of writing proficiency. Lexical diversity emerged as a key divergent feature, it was a top predictor for XGBoost but ignored by LASSO, suggesting its contribution depends on interactions with other features. These findings inform actionable, grade-sensitive feedback, highlighting stable, diagnostic targets for middle school while cautioning that discourse-level features are necessary to model high-school writing.<\/jats:p>","DOI":"10.3390\/data11010002","type":"journal-article","created":{"date-parts":[[2025,12,22]],"date-time":"2025-12-22T08:35:27Z","timestamp":1766392527000},"page":"2","update-policy":"https:\/\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":0,"title":["A Dual-Model Framework for Writing Assessment: A Cross-Sectional Interpretive Machine Learning Analysis of Linguistic Features"],"prefix":"10.3390","volume":"11","author":[{"ORCID":"https:\/\/orcid.org\/0009-0004-6556-7144","authenticated-orcid":false,"given":"Cheng","family":"Tang","sequence":"first","affiliation":[{"name":"Department of Educational Psychology, Mary Frances Early College of Education, The University of Georgia, Athens, GA 30602, USA"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-1694-8942","authenticated-orcid":false,"given":"George","family":"Engelhard","sequence":"additional","affiliation":[{"name":"Department of Educational Psychology, Mary Frances Early College of Education, The University of Georgia, Athens, GA 30602, USA"}]},{"ORCID":"https:\/\/orcid.org\/0009-0007-9697-3091","authenticated-orcid":false,"given":"Yinying","family":"Liu","sequence":"additional","affiliation":[{"name":"School of Automation, Chongqing University, Chongqing 400044, China"}]},{"ORCID":"https:\/\/orcid.org\/0000-0002-2069-8720","authenticated-orcid":false,"given":"Jiawei","family":"Xiong","sequence":"additional","affiliation":[{"name":"Department of Educational Psychology, Mary Frances Early College of Education, The University of Georgia, Athens, GA 30602, USA"}]}],"member":"1968","published-online":{"date-parts":[[2025,12,21]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"101","DOI":"10.1080\/15366367.2023.2298135","article-title":"Analysis of Mixed-Format Assessments Using Measurement Models and Topic Modeling","volume":"23","author":"Xiong","year":"2025","journal-title":"Meas. Interdiscip. Res. Perspect."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"93","DOI":"10.1111\/j.1745-3984.1990.tb00736.x","article-title":"Scoring Constructed Responses Using Expert Systems","volume":"27","author":"Braun","year":"1990","journal-title":"J. Educ. Meas."},{"key":"ref_3","first-page":"76","article-title":"Analysis of Multiple-Choice versus Open-Ended Questions in Language Tests According to Different Cognitive Domain Levels","volume":"14","author":"Polat","year":"2020","journal-title":"Novitas-R. (Res. Youth Lang.)"},{"key":"ref_4","doi-asserted-by":"crossref","first-page":"431","DOI":"10.1016\/0749-596X(86)90036-7","article-title":"Domain Knowledge and Linguistic Knowledge in the Development of Writing Ability","volume":"25","author":"McCutchen","year":"1986","journal-title":"J. Mem. Lang."},{"key":"ref_5","doi-asserted-by":"crossref","unstructured":"Tang, C., Xiong, J., and Engelhard, G. (2025). Identification of Writing Strategies in Educational Assessments with an Unsupervised Learning Measurement Framework. Educ. Sci., 15.","DOI":"10.3390\/educsci15070912"},{"key":"ref_6","doi-asserted-by":"crossref","first-page":"415","DOI":"10.17239\/jowr-2020.11.03.01","article-title":"Linguistic Features in Writing Quality and Development: An Overview","volume":"11","author":"Crossley","year":"2020","journal-title":"J. Writ. Res."},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"307","DOI":"10.1177\/0741088300017003001","article-title":"Documenting Improvement in College Writing: A Longitudinal Approach","volume":"17","author":"Haswell","year":"2000","journal-title":"Writ. Commun."},{"key":"ref_8","doi-asserted-by":"crossref","first-page":"358","DOI":"10.1006\/jecp.1996.0054","article-title":"Individual Differences in Children\u2019s Working Memory and Writing Skill","volume":"63","author":"Swanson","year":"1996","journal-title":"J. Exp. Child Psychol."},{"key":"ref_9","doi-asserted-by":"crossref","first-page":"149","DOI":"10.1177\/0741088316631527","article-title":"Academic Writing Development at the University Level: Phrasal and Clausal Complexity Across Level of Study, Discipline, and Genre","volume":"33","author":"Staples","year":"2016","journal-title":"Writ. Commun."},{"key":"ref_10","doi-asserted-by":"crossref","first-page":"282","DOI":"10.1177\/0741088311410188","article-title":"The Development of Writing Proficiency as a Function of Grade Level: A Linguistic Analysis","volume":"28","author":"Crossley","year":"2011","journal-title":"Writ. Commun."},{"key":"ref_11","unstructured":"Loban, W. (1976). Language Development: Kindergarten Through Grade Twelve, Education Resources Information Center, U.S. Department of Education. NCTE Committee on Research Report No. 18."},{"key":"ref_12","first-page":"59","article-title":"Linguistic Features of Writing Quality and Development: A Longitudinal Approach","volume":"6","author":"Crossley","year":"2022","journal-title":"J. Writ. Anal."},{"key":"ref_13","first-page":"1","article-title":"Examining the Dimensionality of Linguistic Features in L2 Writing Using the Rasch Measurement Model","volume":"2","author":"Effatpanah","year":"2024","journal-title":"Educ. Methods Pract."},{"key":"ref_14","doi-asserted-by":"crossref","unstructured":"Hastie, T. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer.","DOI":"10.1007\/978-0-387-84858-7"},{"key":"ref_15","doi-asserted-by":"crossref","first-page":"45","DOI":"10.1007\/s12559-023-10179-8","article-title":"Interpreting Black-Box Models: A Review on Explainable Artificial Intelligence","volume":"16","author":"Hassija","year":"2024","journal-title":"Cogn. Comput."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"267","DOI":"10.1111\/j.2517-6161.1996.tb02080.x","article-title":"Regression Shrinkage and Selection via the Lasso","volume":"58","author":"Tibshirani","year":"1996","journal-title":"J. R. Stat. Soc. Ser. B Stat. Methodol."},{"key":"ref_17","doi-asserted-by":"crossref","unstructured":"Chen, T., and Guestrin, C. (2016, January 13\u201317). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA.","DOI":"10.1145\/2939672.2939785"},{"key":"ref_18","unstructured":"Martin, J.R., and Rose, D. (2008). Genre Relations. Mapping Culture, Equinox."},{"key":"ref_19","doi-asserted-by":"crossref","first-page":"431","DOI":"10.1016\/S0898-5898(01)00073-0","article-title":"Linguistic Features of the Language of Schooling","volume":"12","author":"Schleppegrell","year":"2001","journal-title":"Linguist. Educ."},{"key":"ref_20","doi-asserted-by":"crossref","first-page":"17","DOI":"10.1016\/S1060-3743(02)00124-8","article-title":"Genre-Based Pedagogies: A Social Response to Process","volume":"12","author":"Hyland","year":"2003","journal-title":"J. Second Lang. Writ."},{"key":"ref_21","doi-asserted-by":"crossref","unstructured":"Schleppegrell, M.J. (2004). The Language of Schooling: A Functional Linguistics Perspective, Routledge.","DOI":"10.4324\/9781410610317"},{"key":"ref_22","unstructured":"Christie, F., and Derewianka, B. (2008). School Discourse: Learning to Write Across the Years of Schooling, Bloomsbury Publishing."},{"key":"ref_23","doi-asserted-by":"crossref","first-page":"100540","DOI":"10.1016\/j.asw.2021.100540","article-title":"Examining Lexical Features and Academic Vocabulary Use in Adolescent L2 Students\u2019 Text-Based Analytical Essays","volume":"49","author":"Maamuujav","year":"2021","journal-title":"Assess. Writ."},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"3521","DOI":"10.1016\/j.egyr.2024.03.020","article-title":"Investigating Boosting Techniques\u2019 Efficacy in Feature Selection: A Comparative Analysis","volume":"11","author":"Ahmed","year":"2024","journal-title":"Energy Rep."},{"key":"ref_25","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1038\/s42256-019-0138-9","article-title":"From Local Explanations to Global Understanding with Explainable AI for Trees","volume":"2","author":"Lundberg","year":"2020","journal-title":"Nat. Mach. Intell."},{"key":"ref_26","doi-asserted-by":"crossref","first-page":"37","DOI":"10.1111\/j.1469-8137.1912.tb05611.x","article-title":"The Distribution of the Flora in the Alpine Zone","volume":"11","author":"Jaccard","year":"1912","journal-title":"New Phytol."},{"key":"ref_27","unstructured":"VanRossum, G., and Drake, F.L. (2010). The Python Language Reference, Python Software Foundation Amsterdam."},{"key":"ref_28","unstructured":"Vasiliev, Y. (2020). Natural Language Processing with Python and SpaCy: A Practical Introduction, No Starch Press."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Reimers, N., and Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings Using Siamese BERT-Networks. arXiv.","DOI":"10.18653\/v1\/D19-1410"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Colangelo, M.T., Meleti, M., Guizzardi, S., Calciolari, E., and Galli, C. (2025). A Comparative Analysis of Sentence Transformer Models for Automated Journal Recommendation Using PubMed Metadata. Big Data Cogn. Comput., 9.","DOI":"10.20944\/preprints202501.1334.v1"},{"key":"ref_31","first-page":"5776","article-title":"Minilm: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers","volume":"33","author":"Wang","year":"2020","journal-title":"Adv. Neural Inf. Process. Syst."},{"key":"ref_32","unstructured":"McCarthy, P.M. (2005). An Assessment of the Range and Usefulness of Lexical Diversity Measures and the Potential of the Measure of Textual, Lexical Diversity (MTLD). [Ph.D. Thesis, The University of Memphis]."},{"key":"ref_33","doi-asserted-by":"crossref","first-page":"201","DOI":"10.1017\/S0305000900012885","article-title":"Type\/Token Ratios: What Do They Really Tell Us?","volume":"14","author":"Richards","year":"1987","journal-title":"J. Child Lang."},{"key":"ref_34","doi-asserted-by":"crossref","first-page":"381","DOI":"10.3758\/BRM.42.2.381","article-title":"MTLD, Vocd-D, and HD-D: A Validation Study of Sophisticated Approaches to Lexical Diversity Assessment","volume":"42","author":"McCarthy","year":"2010","journal-title":"Behav. Res. Methods"},{"key":"ref_35","first-page":"11","article-title":"Comparison Jaccard Similarity, Cosine Similarity and Combined Both of the Data Clustering with Shared Nearest Neighbor Method","volume":"5","author":"Zahrotun","year":"2016","journal-title":"Comput. Eng. Appl. J."},{"key":"ref_36","doi-asserted-by":"crossref","first-page":"115","DOI":"10.1111\/j.1467-9817.2010.01449.x","article-title":"Predicting Second Language Writing Proficiency: The Roles of Cohesion and Linguistic Sophistication","volume":"35","author":"Crossley","year":"2012","journal-title":"J. Res. Read."},{"key":"ref_37","unstructured":"Crossley, S., and McNamara, D. (2011, January 20\u201323). Text Coherence and Judgments of Essay Quality: Models of Quality and Coherence. Proceedings of the Annual Meeting of the Cognitive Science Society, Boston, MA, USA."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"17","DOI":"10.1017\/S0267190512000025","article-title":"Formulaic Language and Second Language Acquisition: Zipf and the Phrasal Teddy Bear","volume":"32","author":"Ellis","year":"2012","journal-title":"Annu. Rev. Appl. Linguist."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"1176","DOI":"10.1080\/17470218.2013.850521","article-title":"Subtlex-UK: A New and Improved Word Frequency Database for British English","volume":"67","author":"Mandera","year":"2014","journal-title":"Q. J. Exp. Psychol."},{"key":"ref_40","unstructured":"Speer, R. (Rspeer\/Wordfreq, 2022). Rspeer\/Wordfreq, version 3.0.2."},{"key":"ref_41","first-page":"1","article-title":"Does Quantity Equal Quality? The Relationship between Length of Response and Scores on the SAT Essay","volume":"8","author":"Kobrin","year":"2007","journal-title":"J. Appl. Test. Technol."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"333","DOI":"10.1111\/modl.12468","article-title":"Measuring Syntactic Complexity in L2 Writing Using Fine\u2014Grained Clausal and Phrasal Indices","volume":"102","author":"Kyle","year":"2018","journal-title":"Mod. Lang. J."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"123","DOI":"10.1016\/S1060-3743(00)00019-9","article-title":"Using Computer-Tagged Linguistic Features to Describe L2 Writing Differences","volume":"9","author":"Grant","year":"2000","journal-title":"J. Second Lang. Writ."},{"key":"ref_44","unstructured":"Hunt, K.W. (1965). Grammatical Structures Written at Three Grade Levels, Education Resources Information Center, U.S. Department of Education. NCTE Research Report No. 3."},{"key":"ref_45","unstructured":"Johnston, M., Boguraev, B., and Pustejovsky, J. (1995, January 27\u201329). The Acquisition and Interpretation of Complex Nominals. Proceedings of the AAAI Symposium on the Representation and Acquisition of Lexical Knowledge: Polysemy, Ambiguity, and Generativity, Stanford, CA, USA."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"57","DOI":"10.1177\/0741088309351547","article-title":"Linguistic Features of Writing Quality","volume":"27","author":"McNamara","year":"2010","journal-title":"Writ. Commun."},{"key":"ref_47","unstructured":"Xu, W., Portanova, J., Chander, A., Ben-Zeev, D., and Cohen, T. (November, January 30). The Centroid Cannot Hold: Comparing Sequential and Global Estimates of Coherence as Indicators of Formal Thought Disorder. Proceedings of the AMIA Annual Symposium Proceedings, San Diego, CA, USA."},{"key":"ref_48","doi-asserted-by":"crossref","first-page":"304","DOI":"10.1016\/j.schres.2007.03.001","article-title":"Quantifying Incoherence in Speech: An Automated Methodology and Novel Application to Schizophrenia","volume":"93","author":"Foltz","year":"2007","journal-title":"Schizophr. Res."},{"key":"ref_49","doi-asserted-by":"crossref","first-page":"rm4","DOI":"10.1187\/cbe.16-04-0148","article-title":"Rasch Analysis for Instrument Development: Why, When, and How?","volume":"15","author":"Boone","year":"2016","journal-title":"CBE Life Sci. Educ."},{"key":"ref_50","doi-asserted-by":"crossref","first-page":"93","DOI":"10.1111\/j.1745-3984.1994.tb00436.x","article-title":"Examining Rater Errors in the Assessment of Written Composition With a Many\u2014Faceted Rasch Model","volume":"31","author":"Engelhard","year":"1994","journal-title":"J. Educ. Meas."},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Engelhard, G., and Wang, J. (2024). Invariant Measurement: Using Rasch Models in the Social, Behavioral, and Health Sciences, Routledge.","DOI":"10.4324\/9781003458746"},{"key":"ref_52","doi-asserted-by":"crossref","first-page":"149","DOI":"10.1007\/BF02296272","article-title":"A Rasch Model for Partial Credit Scoring","volume":"47","author":"Masters","year":"1982","journal-title":"Psychometrika"},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"1","DOI":"10.18637\/jss.v048.i06","article-title":"Mirt: A Multidimensional Item Response Theory Package for the R Environment","volume":"48","author":"Chalmers","year":"2012","journal-title":"J. Stat. Softw."},{"key":"ref_54","unstructured":"R Core Team (2021). R: A Language and Environment for Statistical Computing, R Foundation for STATISTICAL Computing."},{"key":"ref_55","doi-asserted-by":"crossref","first-page":"385","DOI":"10.1111\/j.2044-8317.1995.tb01070.x","article-title":"An Investigation of the Standard Errors of Expected A Posteriori Ability Estimates","volume":"48","author":"Schafer","year":"1995","journal-title":"Br. J. Math. Stat. Psychol."},{"key":"ref_56","first-page":"2825","article-title":"Scikit-Learn: Machine Learning in Python","volume":"12","author":"Pedregosa","year":"2011","journal-title":"J. Mach. Learn. Res."},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"1","DOI":"10.18637\/jss.v033.i01","article-title":"Regularization Paths for Generalized Linear Models via Coordinate Descent","volume":"33","author":"Friedman","year":"2010","journal-title":"J. Stat. Softw."},{"key":"ref_58","doi-asserted-by":"crossref","unstructured":"Coup\u00e9, C. (2018). Modeling Linguistic Variables with Regression Models: Addressing Non-Gaussian Distributions, Non-Independent Observations, and Non-Linear Predictors with Random Effects and Generalized Additive Models for Location, Scale, and Shape. Front. Psychol., 9.","DOI":"10.3389\/fpsyg.2018.00513"},{"key":"ref_59","doi-asserted-by":"crossref","unstructured":"Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. (2019, January 25). Optuna: A Next-Generation Hyperparameter Optimization Framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA.","DOI":"10.1145\/3292500.3330701"},{"key":"ref_60","unstructured":"Pestana, D., and Viqueira, E.A. (2025). Automating Credit Card Limit Adjustments Using Machine Learning. arXiv."}],"container-title":["Data"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/www.mdpi.com\/2306-5729\/11\/1\/2\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,12,27]],"date-time":"2025-12-27T05:12:39Z","timestamp":1766812359000},"score":1,"resource":{"primary":{"URL":"https:\/\/www.mdpi.com\/2306-5729\/11\/1\/2"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2025,12,21]]},"references-count":60,"journal-issue":{"issue":"1","published-online":{"date-parts":[[2026,1]]}},"alternative-id":["data11010002"],"URL":"https:\/\/doi.org\/10.3390\/data11010002","relation":{},"ISSN":["2306-5729"],"issn-type":[{"type":"electronic","value":"2306-5729"}],"subject":[],"published":{"date-parts":[[2025,12,21]]}}}