Carnegie Mellon University

Modeling Lexical Borrowing to Bridge the "Linguistic Divide" in Natural Language Processing

Natural Language Processing and Computational Linguistics

By Yulia Tsvetkov

Identifying lexical correspondences between high- and low-resource languages is fundamentally important for a variety of problems in NLP. This project seeks to enable language-processing technologies in or between low-resource languages by identifying and exploiting cross-lingually-borrowed lexical material. We will develop models to identify the ontological status of words and use a string transduction model scored based on Optimality Theory to model the adaptation process between languages. We will incorporate borrowed lexical material as another source of evidence for cross-lingual lexical correspondence and use this in syntactic parsing and machine translation, focusing on both improving medium-resource performance and bootstrapping applications in languages that completely lack parallel data. Using a large-scale corpus of Twitter, we will explore the sociological factors that drive linguistic borrowing, identifying how words cross linguistic barriers and are adopted. This project will enable natural language-processing tools for a much larger portion of the world’s languages—in particular low-resource languages that lack parallel data or other linguistic resources. Additionally, the tools we develop will be of interest to sociolinguists, historical linguists and cognitive linguists who wish to study the processes that drive language change and adaptation.