Carnegie Mellon University

Modeling Crosslinguistic Influences between Language Varieties

Natural Language Processing and Computational Linguistics

By Yulia Tsvetkov

Most people in the world today are multilingual. Multilingualism is a gradual phenomenon: it ranges from language learners at various levels of competence through highly fluent, advanced nonnative speakers all the way to native speakers who can also master other languages, to translators. While fluency-challenged learner language is a well-established topic in NLP research, text produced by nonnative but highly fluent speakers has received little to no attention to date. Unlike learner language, where grammatical errors are  apparent to native speakers, the signature of fluent but nonnative language is  subtler; it differs nevertheless from native, monolingual language in the frequencies of certain concepts, constructions, and collocations. This raises the possibility that language technologies – typically trained on "standard" native language – are systematically biased in ways that render them less useful for the majority of users. 
We propose new NLP techniques to shed light on the differences in language use by fluent speakers with varying linguistic backgrounds. We hypothesize that current NLP models are biased toward native language, and therefore may not support accurate measurement in nonnative text; the project develops new techniques to mitigate this bias. In order to examine semantic differences in aggregate over a corpus, we build lexical semantic analyzers, and develop a family of multi-variety models – architectures in which a single computational model  accounts for diverse speaker populations – to effectively leverage new insights about the differences between language varieties and to improve our lexical semantic analyzers.