Automated Analysis of Spontaneous Language Samples of Russian-English Bilingual Children and Russian Children with Specific Language Impairment

Name: 

Katsiaryna (Katya) Aharodnik

Department:

Speech-Language-Hearing Sciences

Project Title:

Automated Analysis of Spontaneous Language Samples of Russian-English Bilingual Children and Russian Children with Specific Language Impairment

My name is Katya Aharodnik. I am originally from Minsk, Belarus. I have always been passionate about children and the science of language. While conducting my PhD in the Speech-Language-Hearing Sciences, I have acquired advanced knowledge about typical and atypical language acquisition. In my dissertation, I would like to focus on the development of automatic ways of assessing bilingual Russian-English children and children with language impairments. I am happy that I am able to combine my fields of interest—early education, language acquisition and computational linguistics—into a plausible research project that I immensely enjoy.

Project

This project aims to investigate whether computational linguistics techniques such as statistical machine learning can differentiate between typical bilingual development and children with Specific Language Impairment (SLI). The vast majority of research on SLI has been conducted on monolingual, English-speaking children. Valid and reliable assessment tools are scarce for many other languages and are not sensitive enough to capture the differences between children with atypical language development and children with typical bilingual language skills. This problem arises because the language characteristics of bilinguals in each of their two languages resemble the errors made by monolingual children with language impairment. Ultimately, this research aims to perform a machine learning Natural Language Processing (NLP) experiment to predict the status of a child (bilingual or language impaired) based on elicited narrative samples and, consequently, to evaluate the potential for automated ways of clinical assessment for these populations.

Three year-old children are impressively creative individuals who acquire languages surrounding them in a robust manner. During this summer, I had the opportunity to collect the data for my dissertation project from three year-olds speaking Russian and English. It fascinates me how much children’s brains are capable of taking in during those early stages of development, and at the same time, it becomes critical to point to any difficulties that children may have developmentally during their preschool years. Implementing automatic ways of language assessment might be a step towards a more efficient way of identification of developmental language disorders at earlier stages. 

In my summer work, I focused on employing machine learning algorithms for identification of children with language impairment versus bilingual children who develop more typically. I managed to collect data from 17 Russian-English bilingual children aged three to five years old. These data were collected from two Russian preschool sites located in Brooklyn. The data collection was time-consuming but fun for both the children and myself. 

The testing battery that I used for my data collection included parental questionnaires and consent forms, spontaneous narratives, structured elicitation, and semantic fluency tasks. Via the questionnaire, I collected information about language background, the percentage of exposure to the languages at home, time spent on reading activities, and information about the input participants receive at home from siblings and parents. I included questions about the child’s medical history to exclude children who are at risk of language impairment in the bilingual group. 

During data collection, I collected spontaneous narratives from children based on two wordless picture books: Fox Story (Gülzow & Gagarina, 2007) and Frog, Where are You? (Mayer, 1969). I asked the children to carefully look at all pictures, and I encouraged them to construct a story by looking at one picture at a time. Children told their stories in English and Russian (though occasionally, the preschoolers surprised me with their trilingual skills by throwing in a word or two from Ukrainian or Kazak), and a native speaker of Russian and English recorded and transcribed them. We tagged each transcript for a variety of errors for future automatic analyses; error types included dysfluencies, errors in verb agreement (subject-verb agreement), errors in noun agreement (case errors), errors in lexicon, and preposition omissions. Language productivity measures were automatically calculated using special software from the speech samples. 

Another task that I implemented was a structured elicitation task, which targeted the children’s production of morpho-syntactic markers: third person singular, third person plural, present and past tense inflections. I asked the children to produce sentences matching the pictures based on the model I provided. In the third task, semantic fluency, I asked children to name as many related items as they could to the concept of kitchen within one minute. This task tapped into lexical retrieval abilities and was an additional, indirect measure of children’s vocabulary skills in each language tested.  My research assistants and I calculated the number of words produced by a child per minute in this task.

 I am grateful to the opportunity this Fellowship provided to me. It allowed me to accomplish the steps (data collection, transcription, and tagging) that are critical for my future dissertation project. This fall, I plan to continue the data collection from bilingual children at other preschool and kindergarten sites. Following that, I will extract additional, more complicated features to run the machine learning classification task and will collect data from children with developmental language impairment.