As you prepare your next asynchronous lecture, you set up your laptop on an appropriately sized delivery service box for the best camera angle, turn on the ring light, click start on the extension that embeds your webcam, and press record. You did your best to incorporate engagement, representation, and action and expression into your lecture. You used closed captioning for students who cannot hear or listen to the audio. That is, you incorporated principles of universal design for learning. Your lectures should be accessible to and for everyone.
Imagine that you are beginning a unit on career development. You chose to include a YouTube video of Steve Harvey using his personal life story as a mechanism to encourage Black Americans to take risks in their own lives. You included this video precisely because of Harvey’s use of his personal narrative as an attempt to connect to the diversity within your classroom. To your surprise, the automatic video transcription (AVT) is—quite frankly—a hot mess. Approximately one minute into the video, Harvey switched from Mainstream American English (MAE) to African American English (AAE) as he narrated his story. He said, “Now, this here. This a gold star moment right here.” AVT transcribed this as, “It is here this is gonna stop bombing right here.”
Say what?
Unfortunately, these types of transcription errors in automated speech recognition are far from infrequent. They do not affect only supplemental materials like embedded videos. They also affect what is transcribed when instructors or students who speak non-MAE dialects use AVT systems to capture their own words. We suggest that these types of transcription errors represent inherent inequities in the AVT systems on which we have come to rely during the COVID-19 pandemic. The presence of these errors draws negative attention to subtle differences in oral language that relate to the use of nonmainstream dialects, English language learner status, or even communication difficulties. As a result, they have the potential to affect what and how students learn.
Why do these gaps exist?
There are 24–30 different dialects of American English (Joseph & Janda, 2003). The dialects are associated with different income levels, geographic regions, racial and ethnic groups, or combinations of these factors. Some dialects, such as New York City English, do not differ greatly from MAE. Others, such as AAE, diverge widely regarding speech sound production, grammar, and vocabulary. For example, one speech sound production rule of AAE affects production of word-final consonant clusters (e.g., the “ft” in “left”). In AAE, the final consonant of the cluster (i.e., “t”) is not produced when the following word begins with a consonant (e.g., “left hand” becomes “lef hand”). Grammar rules in AAE differ from MAE, especially concerning use of forms of the verb "be," “Wh-”questions, and negative statements. Some examples: “He big”/”He is big”; “He be coming around all the time”/”He always comes around”; “What you did that for?”/”Why did you do that?”; and “She don’t like no vegetables”/“She doesn’t like any vegetables.” AAE vocabulary tends to be more dynamic and flexible than MAE vocabulary. As a vocabulary item from AAE is acculturated into MAE, its use in AAE tends to rapidly diminish (e.g., the rise and demise of “bling” in recent years). The AAE meaning attributed to a word may also cause temporary ambiguity in comprehension if not supported by enough context.
All dialects of English are fully formed linguistic systems that are not and should not be considered substandard variants or vernaculars of MAE. They simply differ in form and content from MAE just as MAE differs from British English. But the incidence and prevalence rates of non-MAE dialect use in college classrooms is unknown because these demographic characteristics currently are not captured in national reporting databases.
We argue that the difficulty systems like AVT have with non-MAE dialects stems from an implicit bias against these dialects. This implicit bias begins early, seeps into general society, and affects education across all levels. Elementary teachers generally have a more negative view of students who use non-MAE dialects (Diehm & Hendricks, 2020). Yet little to no support is provided for non-MAE speaking children to become bidialectal in the same way support is provided for children who are English-language learners. Researchers and companies have developed programs to record “teacher talk” in classrooms and schools. In one case, however, a developer had to use adult language samples from the Northeast US to train the machine algorithm because adult samples from the Southeast US could not be accurately recognized and transcribed. When it comes to voice assistants, smart speakers, and alternative and augmentative communication devices, it is possible to choose MAE, British, Australian, or even Spanish-influenced English options. But there are no options to choose AAE or other non-MAE dialects. To put it another way, non-MAE speakers currently are expected to conform to a set of MAE standards to access these increasingly ubiquitous automated systems. It is clear that change is needed to make space for all dialects.
Where do we go from here?
Speech recognition systems likely will be a large part of both online and face-to-face instruction as we emerge from the confines of the pandemic. We should not remain mired in our usual manner of doing things which is flawed by the limitations of social biases and experiences. There is a lot of room to build more inclusive environments and elevate all voices in the classroom.
Companies such as Google and Amazon acknowledge the difficulty systems like AVT have in recognizing non-MAE speech. These systems are developed by and designed for MAE speakers. Errors occur because the systems attempt to match individual spoken words to the items in the database that most closely resemble each single item. Consequently, they cannot negotiate the group and individual variations inherent in different dialects because the databases do not contain enough tokens of non-MAE language (Biadsy, 2011). Automated speech recognition systems cannot accurately recognize an average of 35 percent of the speech of non-MAE speakers (Tatman & Kasten, 2017). An attempt to improve the underlying algorithm reported improvement to an error rate of 14.6 percent when comparing MAE to non-MAE dialects (Biadsy, 2011). For comparison, the error rate of MAE to Indian English was 6.3 percent. The tech companies offer some options to build custom vocabularies to help with spoken language recognition; however, these services generally can be cost prohibitive and limited to a single product or platform. Obviously, considerable work is required to improve automated systems’ ability to accurately recognize and transcribe the speech of non-MAE speakers in real time.
What can instructors do to help resolve these inequities in the classroom?
- Acknowledge that non-MAE dialects are independent, fully formed language systems that vary from MAE in systematic ways. When there are dialect differences in the classroom, instructors first must examine their own implicit biases against the language systems used by non-MAE speakers. In addition to asking for demographic information like students’ preferred names and hometowns, include questions regarding the languages and dialects they identify as their primary means of communication. Instructors should examine areas where implicit biases occur relative to linguistic variation and equip students with strategies on how to reduce their own biases. This includes refraining from labelling non-MAE dialects as “vernaculars” or presenting MAE as “proper” or “correct” English. Instructors and staff across departments need to consider making a concerted effort to use resources that present non-MAE dialects as the rich, fully-formed language systems they are. Free resources like the University of Oregon’s Online Resources for African American Language, which highlight the various aspects of AAE, can assist with this effort. Being open these changes is critical to eliminating the implicit bias against non-MAE dialects, which appears as early as elementary school.
- Include representative materials written in non-MAE dialects in postsecondary classrooms and accept correctly answered responses from students when spoken or written in non-MAE dialects. Exposing MAE-speaking students to representative materials written in non-MAE dialects increases their exposure to and understanding of the cultural and linguistic diversity of the United States. Incorporating these materials also ensures that non-MAE speakers feel accepted and comfortable using their dialect(s) in the classroom. Additionally, accepting correctly answered verbal or written responses regardless of the dialect in which they are provided in either written or oral form demonstrates increased acceptance of all linguistic differences in the classroom.
- Provide instruction on MAE grammar for all students and on differences between dialects for non-MAE speakers in introductory English composition courses. Formal grammar instruction does not take place in many K–12 classrooms. Consequently, students enter college without a solid, working knowledge of English grammar in oral or written forms. Providing instruction on basic MAE grammar to all students will ensure that instructors and students all understand how to make changes to written products. Instructors also can reiterate that students should use their own words to answer questions in either spoken or written formats rather than trying to incorrectly substitute more “MAE sounding” vocabulary.
- Explain why a particular set of standards is used in a field of study rather than refer students to support services. Here again, instructors may need to examine their own implicit biases against non-MAE dialects. In many fields, students are learning to undertake the responsibilities of their chosen professions. Often, that means teaching students to write to the standards of the field and explaining why MAE grammar and vocabulary is incorporated. This is particularly important in fields like law, medicine and allied health, psychology, or education where documents students prepare become part of a legal record. Rather than referring students to a writing center due to “poor writing skills,” instructors should consider providing explicit instruction on how MAE and non-MAE can be used across home, school, and professional settings. Drawing attention to culturally specific interactions affirms identities and allows all voices to be heard.
- Enable AVT during virtual or face-to-face lectures to capture a wider variety of non-MAE language examples. Some tech companies offer options to provide feedback on the quality of their audio and video products. Both representative instructor and student language is needed to inform AVT algorithms. Instructors who record their lectures or discussion sections should take advantage of opportunities to provide “real-life” examples of non-MAE language samples and errors made by AVT systems to these companies.
- Expand the current metrics of machine-interpreted and translated speech to become more inclusive of the oral language variation present in American English by increasing the size and diversity of the dialect tokens in the databases (Koenecke et al., 2020). Tech companies are responsive to consumer feedback and consumer power. Instructors in higher education should use their consumer power to request inclusion of more culturally and linguistically diverse samples when programming AVT systems. One way to do this is to capture the errors as they happen and report those errors to the companies.
To paraphrase Steve Harvey, creating space for all American English dialects in the classroom really would be a gold-star moment.
References
Biadsy, F. (2011). Automatic dialogue and accent recognition. Doctoral dissertation. Academic Commons: Columbia University. https://academiccommons.columbia.edu/doi/10.7916/D8M61S68
Diehm, E. A. & Hendricks, A. E. (2020). Teachers’ content knowledge and pedagogical beliefs regarding use of African American English. Language Speech and Hearing Services in Schools, 52(1), 100–117. https://doi.org/10.1044/2020_LSHSS-19-00101
Joseph, B. D., & Janda, R. D. (Eds.). (2003). The handbook of historical linguistics. Blackwell Publishing. https://doi.org/10.1002/9780470756393
Koenecke, A., Nam, A., Lake, E., Nudell, J., Quartey, M., Mengesha, Z., Toups, C., Rickford, J.R., Jurasky, D., & Goel, S. (2020). Racial disparities in automated speech recognition. PNAS, 117(14), 7684–7689. https://doi.org/10.1073/pnas.1915768117
Tatman, R., & Kastan, C. (2017). Effects of talker dialect, gender & race on accuracy of Bing Speech and YouTube automatic captions. Interspeech, 934–938. https://doi.org/10.21437/Interspeech.2017-1746
Lori A. Bass, PhD, CCC-SLP, is an assistant professor at Worcester State University. She earned her PhD in communication sciences and disorders from Florida State University. Her areas of scholarship include supporting the needs of students at-risk for poor academic outcomes as the result of cultural and linguistic diversity.
Rihana S. Mason, PhD, is a research scientist at the Urban Child Study Center (UCSC) at Georgia State University. She earned her PhD in experimental psychology from the University of South Carolina. Her areas of scholarship include vocabulary development in diverse populations. She also evaluates diversity, equity, and inclusion pipeline programming.