Early detection of Alzheimer’s disease (AD) is a critical yet unresolved challenge in neurology, as subtle cognitive and linguistic impairments often emerge years before formal diagnosis. Traditional approaches, including neuroimaging and cognitive testing, are limited by cost, invasiveness, and low sensitivity at prodromal stages. Speech and language markers have recently emerged as promising, non-invasive digital biomarkers that can be continuously monitored in naturalistic settings. In this study, we present a proof-of-concept framework that leverages natural language processing (NLP) techniques for automated early AD detection using synthetic speech transcripts. We generated a balanced dataset of 440 samples (220 healthy controls, 220 early AD-like) designed to capture hallmark linguistic alterations associated with AD, including reduced lexical diversity, shorter sentence length, excessive pronoun use, semantic drift, and increased occurrence of fillers and pauses. Each transcript was processed into two complementary feature sets: (i) term frequency-inverse document frequency (TF-IDF) representations of unigrams and bigrams, and (ii) engineered linguistic biomarkers such as type-token ratio, idea density, repetition rate, pronoun ratio, and Flesch reading ease. A logistic regression classifier trained on the combined features achieved strong discriminative performance, with an area under the ROC curve (AUC) of 0.87 and an average precision score of 0.84. Interpretability analysis revealed that features most predictive of AD closely aligned with known linguistic deficits, including filler frequency and pronoun ratio, while lexical diversity and syntactic complexity protected against misclassification. Although this study relies on synthetic data, the framework establishes a transparent, reproducible methodology for integrating speech-based biomarkers into digital phenotyping pipelines. These findings highlight the potential of language analysis for scalable, non-invasive early detection of AD, motivating future validation on real patient cohorts.
Cite this paper
Filippis, R. D. and Foysal, A. A. (2026). AI-Based Early Detection of Alzheimer’s Disease through Speech and Language Biomarkers: A Synthetic Proof-of-Concept Study. Open Access Library Journal, 13, e14443. doi: http://dx.doi.org/10.4236/oalib.1114443.
Nandi, A., Counts, N., Chen, S., Seligman, B., Tortorice, D., Vigo, D., et al. (2022) Global and Regional Projections of the Economic Burden of Alzheimer’s Disease and Related Dementias from 2019 to 2050: A Value of Statistical Life Approach. eClinicalMedicine, 51, Article 101580. https://doi.org/10.1016/j.eclinm.2022.101580
Xiaopeng, Z., Jing, Y., Xia, L., Xingsheng, W., Juan, D., Yan, L., et al. (2025) Global Burden of Alzheimer’s Disease and Other Dementias in Adults Aged 65 Years and Older, 1991-2021: Population-Based Study. Frontiers in Public Health, 13, Article ID: 1585711. https://doi.org/10.3389/fpubh.2025.1585711
Twiss, E., McPherson, C. and Weaver, D.F. (2025) Global Diseases Deserve Global Solu-tions: Alzheimer’s Disease. Neurology International, 17, Article 92. https://doi.org/10.3390/neurolint17060092
Cacabelos, R. (2025) Spe-cial Issue: “New Trends in Alzheimer’s Disease Research: From Molecular Mechanisms to Therapeutics: 2nd Edition”. International Journal of Molecular Sciences, 26, Article 7175. https://doi.org/10.3390/ijms26157175
Mitchell, A.J., Kemp, S., Beni-to-León, J. and Reuber, M. (2010) The Influence of Cognitive Impairment on Health-Related Quality of Life in Neurological Disease. Acta Neuropsychiatrica, 22, 2-13. https://doi.org/10.1111/j.1601-5215.2009.00439.x
Landeiro, F., Mughal, S., Walsh, K., Nye, E., Morton, J., Williams, H., et al. (2020) Health-Related Quality of Life in People with Predementia Alzheimer’s Disease, Mild Cognitive Impair-ment or Dementia Measured with Preference-Based Instruments: A Systematic Literature Review. Alzheimer’s Research & Therapy, 12, Article No. 154. https://doi.org/10.1186/s13195-020-00723-1
Dickson, D.W. (1997) Neuropathological Diagnosis of Alzheimer’s Disease: A Perspective from Longi-tudinal Clinicopathological Studies. Neurobiology of Aging, 18, S21-S26. https://doi.org/10.1016/s0197-4580(97)00065-1
DeTure, M.A. and Dickson, D.W. (2019) The Neuropathological Diagnosis of Alzheimer’s Disease. Molecular Neurodegeneration, 14, Article No. 32. https://doi.org/10.1186/s13024-019-0333-5
Sabbagh, M.N., Boada, M., Borson, S., Chilukuri, M., Doraiswamy, P.M., Dubois, B., et al. (2020) Rationale for Early Diagnosis of Mild Cognitive Impairment (MCI) Supported by Emerging Digital Technologies. The Journal of Prevention of Alzheimer’s Disease, 7, 158-164. https://doi.org/10.14283/jpad.2020.19
Tahami Monfared, A.A., Phan, N.T.N., Pearson, I., Mauskopf, J., Cho, M., Zhang, Q., et al. (2023) A Systematic Review of Clinical Practice Guidelines for Alzheimer’s Disease and Strategies for Future Advancements. Neurology and Therapy, 12, 1257-1284. https://doi.org/10.1007/s40120-023-00504-6
Hampel, H., Lista, S. and Khachaturian, Z.S. (2012) Development of Biomarkers to Chart All Alzheimer’s Disease Stages: The Royal Road to Cutting the Therapeutic Gordian Knot. Alz-heimer’s & Dementia, 8, 312-336. https://doi.org/10.1016/j.jalz.2012.05.2116
Werner, P., Barthel, H., Drzezga, A. and Sabri, O. (2015) Current Status and Future Role of Brain PET/MRI in Clinical and Research Settings. European Journal of Nuclear Medi-cine and Molecular Imaging, 42, 512-526. https://doi.org/10.1007/s00259-014-2970-9
Savitz, J.B., Rauch, S.L. and Drevets, W.C. (2013) Clinical Application of Brain Imaging for the Diagnosis of Mood Disorders: The Current State of Play. Molecular Psychiatry, 18, 528-539. https://doi.org/10.1038/mp.2013.25
Garcia, A. and Reilly, J. (2015) Linguistic Disruption in Primary Progressive Aphasia, Frontotemporal Degener-ation, and Alzheimer’s Disease. In Bahr, R.H. and Silliman, E.R., Eds., Routledge Handbook of Communication Disorders, Routledge, 268-277.
Kothinti, R.R. (2021) Advancements in Natural Language Pro-cessing for Auto-Mated Phenotyping and Predictive Analytics in Oncology EHRS. Iconic Research and Engineering Journals, 8, 245-252.
Noori, A., Magda-mo, C., Liu, X., Tyagi, T., Li, Z., Kondepudi, A., et al. (2022) Development and Evaluation of a Natural Language Processing Annotation Tool to Facilitate Phe-notyping of Cognitive Status in Electronic Health Records: Diagnostic Study. Journal of Medical Internet Research, 24, e40384. https://doi.org/10.2196/40384
Shaikh, S., Pereira, K.W., Sahay, S., Lopes, A. and Parshionikar, S. (2024) An Extensive Review: Models for Regional Language Speech Recognition. 2024 4th Asian Conference on In-novation in Technology (ASIANCON), Pimari Chinchwad, 23-25 August 2024, 1-8. https://doi.org/10.1109/asiancon62057.2024.10837903
Yu, D., Ju, Y., Wang, Y., Zweig, G. and Acero, A. (2007) Automated Directory Assistance System—From Theory to Practice. Interspeech 2007, Antwerp, 27-31 August 2007, 2709-2712. https://doi.org/10.21437/interspeech.2007-65
Kudapa, S.P. (2025) AI-Driven Data Science Models for Real-Time Transcription and Productivity Enhancement in U.S. Remote Work Environments. ASRC Procedia: Global Per-spectives in Science and Scholarship, 1, 801-832. https://doi.org/10.63125/gzyw2311
Ashurst, C. and Weller, A. (2023) Fairness without Demographic Data: A Survey of Approaches. Equity and Ac-cess in Algorithms, Mechanisms, and Optimization, Boston, 30 October 2023-1 November 2023, 1-12. https://doi.org/10.1145/3617694.3623234
Ramesh, K., Sitaram, S. and Choudhury, M. (2023) Fairness in Language Models Beyond English: Gaps and Challenges. Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, 2-6 May 2023, 2106-2119. https://doi.org/10.18653/v1/2023.findings-eacl.157
Jones, P., Liu, W., Huang, I. and Huang, X. (2025) Examining Imbalance Effects on Performance and Demographic Fairness of Clinical Language Models. 2025 IEEE 13th Inter-national Conference on Healthcare Informatics (ICHI), Rende, 18-21 June 2025, 58-68. https://doi.org/10.1109/ichi64645.2025.00016
AlSaad, R., Abd-alrazaq, A., Boughorbel, S., Ahmed, A., Renault, M., Damseh, R., et al. (2024) Multimodal Large Language Models in Health Care: Applications, Chal-lenges, and Future Outlook. Journal of Medical Internet Research, 26, e59505. https://doi.org/10.2196/59505
He, R., Chapin, K., Al-Tamimi, J., Bel, N., Marquié, M., Rosende-Roca, M., et al. (2023) Automated Classification of Cogni-tive Decline and Probable Alzheimer’s Dementia across Multiple Speech and Language Domains. American Journal of Speech-Language Pathology, 32, 2075-2086. https://doi.org/10.1044/2023_ajslp-22-00403
Li, C. (2024) Detecting Cognitive Impairment from Language and Speech for Early Screening of Alz-heimer’s Disease Dementia with Interpretable Transformer-Based Language Models. PhD Dissertation, University of Minnesota.
Uggen, T.K.E. (2020) The Use of Machine Learning Algorithms and Statistical Models to Classify Aphasia Severity. University of Technology Sydney (Australia).
Davis, B.H. and Maclagan, M. (2009) Examining Pauses in Alzheimer’s Discourse. American Journal of Alzheimer’s Disease & Other Dementias®, 24, 141-154. https://doi.org/10.1177/1533317508328138
Andreetta, S., Cantagallo, A. and Marini, A. (2012) Narrative Discourse in Anomic Aphasia. Neuropsy-chologia, 50, 1787-1793. https://doi.org/10.1016/j.neuropsychologia.2012.04.003
McCarthy, P.M. (2005) An Assessment of the Range and Usefulness of Lexical Diversity Measures and the Potential of the Measure of Textual, Lexical Diversity (MTLD). PhD Dissertation, The University of Memphis.
Shlesinger, M. (1998) Cor-pus-Based Interpreting Studies as an Offshoot of Corpus-Based Translation Studies. Meta, 43, 486-493. https://doi.org/10.7202/004136ar
McNamara, D.S., Graesser, A.C., McCarthy, P.M. and Cai, Z. (2014) Automated Evaluation of Text and Discourse with Coh-Metrix. Cambridge University Press. https://doi.org/10.1017/cbo9780511894664
Chou, C., Chang, C., Chang, Y., Lee, C., Chuang, Y., Chiu, Y., et al. (2024) Screening for Early Alz-heimer’s Disease: Enhancing Diagnosis with Linguistic Features and Biomarkers. Frontiers in Aging Neuroscience, 16, Article ID: 1451326. https://doi.org/10.3389/fnagi.2024.1451326
Kavé, G. and Goral, M. (2017) Word Retrieval in Connected Speech in Alzheimer’s Disease: A Review with Meta-Analyses. Aphasiology, 32, 4-26. https://doi.org/10.1080/02687038.2017.1338663
Nyongesa, C.A., Hogarth, M. and Pa, J. (2025) Artificial Intelligence-Driven Natural Language Processing for Identifying Linguistic Patterns in Alzheimer’s Disease and Mild Cognitive Im-pairment: A Study of Lexical, Syntactic, and Cohesive Features of Speech through Picture Description Tasks. Journal of Alzheimer’s Disease, 106, 120-138. https://doi.org/10.1177/13872877251339756
Rane, N., Choudhary, S. and Rane, J. (2023) Explainable Artificial Intelligence (XAI) in Healthcare: Interpretable Models for Clinical Decision Support. SSRN Electronic Journal, 17 p. https://doi.org/10.2139/ssrn.4637897
Valente, F., Paredes, S., Henriques, J., Rocha, T., de Carvalho, P. and Morais, J. (2022) In-terpretability, Personalization and Reliability of a Machine Learning Based Clin-ical Decision Support System. Data Mining and Knowledge Discovery, 36, 1140-1173. https://doi.org/10.1007/s10618-022-00821-8
Abbas, Q., Jeong, W. and Lee, S.W. (2025) Explainable AI in Clinical Decision Support Sys-tems: A Meta-Analysis of Methods, Applications, and Usability Challenges. Healthcare, 13, Article 2154. https://doi.org/10.3390/healthcare13172154
Hartsock, I. and Rasool, G. (2024) Vision-Language Models for Medical Report Generation and Visual Question Answering: A Review. Frontiers in Artificial Intelligence, 7, Article ID: 1430984. https://doi.org/10.3389/frai.2024.1430984
Iriondo, C. (2021) Characterizing Pheno-types of Musculoskeletal Degeneration Using Medical Imaging and Deep Learn-ing. PhD Dissertation, University of California.
Hou, S., Wu, Y., Chen, K., Chang, T., Hsu, Y., Chuang, S., et al. (2022) Code-Switching Automatic Speech Recognition for Nursing Record Documentation: System Development and Evaluation. JMIR Nursing, 5, e37562. https://doi.org/10.2196/37562
KhudaBukhsh, A.R. (2024) Deceptively Simple: An Outsider’s Perspective on Natural Language Processing. AI Maga-zine, 45, 569-582. https://doi.org/10.1002/aaai.12204
Levy, J.J. and O’Malley, A.J. (2020) Don’t Dismiss Logistic Regression: The Case for Sensible Extraction of Interactions in the Era of Machine Learning. BMC Medical Re-search Methodology, 20, Article No. 171. https://doi.org/10.1186/s12874-020-01046-3
Pratap, V., Xu, Q., Sriram, A., Synnaeve, G. and Collobert, R. (2020) MLS: A Large-Scale Multilingual Da-taset for Speech Research. Interspeech 2020, Shanghai, 25-29 October 2020, 2757-2761. https://doi.org/10.21437/interspeech.2020-2826