When we’re learning a second language, we get a grasp on how well we’re doing when we speak to other learners, native speakers and language tutors. Sometimes we just don’t have the resources to exercise our verbal skills or the means to be assessed.
Researchers have been exploring ways in which proficiency levels of spontaneous speech of ESL (English as a Second Language) speakers can be scored in an automated way. Automated models of speech scoring, which lessen the amount of human contact and in-person work, would allow for a broader coverage of assessment that would also minimise costs in its administration.
How can we measure proficiency in ESL spontaneous speech?
How well one speaks a language can be defined in many ways. Fluency, pronunciation, grammar and vocabulary all shape the learner’s utterances. When it comes to second language learners, it seems the most appropriate aspect to focus on would be their grammar usage, or as Ortega (2003) terms “syntactic complexity”. This is a collective term to indicate “the range of [syntactic] forms that surface in language production and the degree of sophistication of such forms”. As adult language learners may exhibit their own style and pace of fluency and pronunciation, syntactic complexity is more suggestive of the degree of their language acquisition and in turn, their level of skill in exercising the language – not completely dependent on individual differences in size of lexicon, as this may also contribute to the forms they can and do produce. Syntactic complexity in spontaneous speech gives us a good indicator of the learner’s ability to construct these forms and utterances and produce them verbally.
Automated scoring of syntactic complexity of spontaneous speech comparable to human assessment is not a simple task. We will explore the challenges that arise by looking at two types of scoring approaches using natural language processing: the vector-space modelling (VSM) approach and the language model (LM) approach realised in the research done by Bhat & Yoon (2015).
For both models, Bhat and Yoon used part-of-speech (POS) tags from a large set of learners’ responses which were classified into four groups according to their proficiency level (assigned by professional raters). The learner’s response to be tested was then compared to the POS tag distributions among the four groups according to the model’s specific scoring criteria.
Vector-space modelling approach
Vector-space modelling is often used in information retrieval e.g. matching a query to a document, and seeks to capture the relative terms in a document. In this case of VSM, There were four vectors, each vector was formed from the available POS tags of responses in that score level. A given test response, treated much like a query, was converted to a vector. In this form, it was assigned to a score class with a similar POS distribution, captured by cosine similarity (the smaller the angle between two vectors, the more similar).
The advantage found in VSM was that due to the weighting of POS tags, it was able to capture the overall distribution for each score class. For example, POS tags that commonly occured across all groups were thus given a low “weight”, highlighting the specific POS tags that discerned a test responses assignment to a given class. In this model, it is important for the researcher to make the decision about which POS tag sequences (unigram, bigram etc..) are most appropriate and more representative of the task at hand. In this case bigram POS sequences acted as the middle-ground to yield the most signifcant information, the unigrams had limited distinguishing power and the trigrams, although capturing more structure, did not have good coverage due to data sparseness.
Language model approach
While VSM captures the overall distribution of all scores among all four groups, the LM is concentrated on local features of learner responses, meaning it matches a response to a given score level that shows the POS grammatical constructions in that test response more often. This approach works at the level of POS sequences as the main criterion for matching, with the intuition that speakers with higher proficiency scores use more complex grammatical expressions.
The LM exhibited promising correlations for its predicted scores againts the human scores, with the combination of exact and adjacent agreement at 86%. The main issue found by Bhat and Yoon was that LM was susceptible to automated speech recognition (ASR) output errors that affected its performance. As LM is concerned with probabilites at the POS level, the speech recogniser made errors such as mis-recognising sophisticated expressions in a response replacing them with more frequent, simpler expressions. This directly affects how a response would be assigned a score level and in part explains the predicted scores’ discrepancy against the human scores.
It seems as though focusing on local syntactic features on a clause level for scoring speech is too micro-centred for a task like automated scoring for spontaneous speech. Tapping into data from language learners rather than native speakers is more appropriate when using automated models for scoring proficiency and creative ways of capturing an overall distribution allow us to weed out which utterances matter and which don’t when distinguishing between proficiency levels. It may not be at the same unique quality of a professional human scorer but automation in scoring may be beneficial in a sense that it is a streamlined and for the most part an objective way for a learner to be assessed and obtain that feedback with respect to how they are doing among their peers.
Bhat, S., & Yoon, S. (2015). Automatic assessment of syntactic complexity for spontaneous speech scoring. Speech Communication. 67:42–57.
Ortega, L. (2003). Syntactic complexity measures and their relationship to L2 proficiency: a research synthesis of college-level L2 writing. Appl. Linguist. 24, 492–518.