How AI could transform speech therapy for children
There aren’t enough speech and language pathologists (SLPs) to support the millions of American children with disordered speech patterns, such as difficulty articulating certain words or speaking fluently. As a result, these children may go on to struggle academically, socially, and emotionally.
To fill the gap, researchers are turning to language models (LMs).
“A language model that can assist SLPs with various tasks in the diagnostic workflow will potentially allow them to help more children,” says Sang Truong, a graduate student in computer science at Stanford University.
But is the technology adequately trained to accurately distinguish disordered speech patterns?
In a paper accepted for presentation at the 2025 Conference on Empirical Methods in Natural Language Processing and partially supported by the Stanford Institute for Human-Centered AI, Truong and his colleagues (including Nick Haber, assistant professor of education, Sanmi Koyejo, assistant professor of computer science, and San Francisco-based SLP Jody Vaynshtok) show that 15 models, including several versions of GPT-4, Whisper, Gemini, and Qwen, do quite poorly straight out of the box.
But when the team fine-tuned several open-source Qwen models, the performance of many improved, suggesting these tools could eventually prove helpful to SLPs, Truong says.
Still, challenges remain: The models exhibited gender, age, and language biases. And fine-tuning LMs depends on the availability of large datasets of children’s speech samples, which are difficult to obtain due to privacy protections for minors, Truong says.
Nevertheless, the team is optimistic about the prospect of using AI to streamline the work of SLPs.
“Although these very general language models are probably not designed with this clinical use in mind, it seems likely that their shortcomings can be overcome,” Truong says.
Koyejo agrees. “This gives us the first systematic benchmark for AI in pediatric speech pathology and demonstrates a technical path forward,” he says. “We’re not just identifying problems, we’re showing they’re solvable.”
High demand, real burnout
More than 3.4 million American children struggle with speech and language challenges. They might stutter, lisp, have difficulty articulating specific sounds, or omit or insert certain sounds. They also might have cognitive, hearing, or swallowing issues.
The SLPs available to help these children typically work in a school setting where there might be one SLP for hundreds of children. As a result, SLP caseloads can be onerous, leading to burnout.
Tasked with identifying children who need help, SLPs must interview children to record their speech patterns in conversation; transcribe and evaluate those interviews; develop treatment plans; provide treatment for weeks, months, or years; and keep careful notes of children’s progress while communicating with their parents and teachers.
Without replacing the important human connection that a clinician provides, an AI system could potentially simplify several of the more tedious steps in this pipeline, freeing SLPs to give children important one-on-one attention, Truong says. For example, an LM could evaluate every child in the school and flag those most in need of help; transcribe interviews; provide games and other forms of therapeutic engagement to help children directly; and track children’s progress.
“If a model can move the needle by easing the SLP’s workload, that’s a win,” Truong says.
Can AI help?
Until two or three years ago, LMs could only analyze text, not audio. The advent of multi-modal language models (MLMs) that can accept audio directly seemed like it might be a game-changer for diagnosing speech disorders. “It’s possible that MLMs, by skipping the transcription step, can capture more nuanced information about the way a child speaks,” Truong says.
To test that theory, Truong and his colleagues prompted 15 LMs (some that accepted audio directly and some that required transcription) to act as SLPs and evaluate children’s speech samples on several key SLP tasks including disorder diagnosis (a basic triage step to distinguish typical from disordered speech), disorder type diagnosis (articulation vs. phonological disorder), symptom diagnosis (stuttering, sound omission, sound substitution, sound addition), transcription-based diagnosis (comparing a child’s transcribed words to the words they were asked to say); and transcription alone (measuring how well an LM transcribes a child’s disordered speech).
Before a tool can be deployed in a clinical setting, the Food and Drug Administration recommends that it be at least 80-85% accurate, Truong says. None of the 15 LMs tested came close to that level of accuracy. For disorder diagnosis, which is the most basic task, the best performing model was only 55% accurate, while most models were wrong more than 50% of the time.
Some key lessons: Recently developed MLMs that can analyze audio directly without first converting it to text were generally better at the more fine-grained SLP tasks (such as symptom diagnosis) than were models that relied on transcribing children’s speech using automatic speech recognition. And bigger models were not necessarily better, Truong says.
In addition, the models were better at diagnosing speech issues in boys than in girls, in English speakers than in speakers of other languages, and in older children compared with younger.
Despite the models’ poor performance out of the box, there’s reason to believe they can do better, Truong says. When the team fine-tuned several versions of Qwen models using a small dataset of children’s speech, their performance on diagnostic tasks (disorder, disorder type, and symptom diagnosis) often improved.
“For a long time, we weren’t really sure how good these models were when it comes to SLP,” Truong says. “But with this very careful measurement, we can now see a potential path to creating a clinically useful tool.”
Making the leap to clinical practice
Even though fine-tuning is promising, Truong cautions that more work is needed. “The fact that we can do fine-tuning and get some improved performance doesn’t mean that the problem is solved,” he says. “It’s just a first-order demonstration that fine-tuning is a potentially promising solution.” Fine-tuning depends on high-quality data, and collecting speech data from children runs up against significant privacy issues. One possible solution: Generate synthetic data that simulates how children with various speech disorders might speak. Such data could then be used to bootstrap the LMs to achieve higher performance, Truong says.
Truong is also reluctant to dismiss transcription approaches. “It’s possible that very clever people can give instructions to these models in a way that can get the job done using transcription.”
Addressing the models’ observed biases will require further work as well. Gender imbalance in the training data might be the reason for the observed discrepancies in performance for boys and girls. Adjusting for that imbalance might help alleviate this problem, Truong says.
It’s also important to expand the models’ reach beyond English, Truong says. “Many children all around the world have speech disorders, so we need to understand how these models perform in different languages.”
For now, the team’s benchmarks of LMs’ performance on SLP tasks have been implemented in the HELM benchmarking framework, a key step in making it easier for the community to track progress toward a clinically effective tool. “This benchmark allows people to judge the quality of MLMs for SLP-specific use-cases, which should make such applications easier for the community to create and track,” Haber says.
At some point, Truong says he’d like to slowly deploy an AI-based SLP tool with a clinician to see if it can simplify their typical workflow. “The sheer number of children that need to go through the pipeline is so enormous that any way we can improve productivity would be valuable.”
This story was originally published by Stanford HAI.
Faculty mentioned in this article: Nick Haber
