Carnegie Mellon University

stock photo of a statue of Lady Justice,blindfolded and holding scales.

FAI: Quantifying and Mitigating Disparities in Language Technologies

By Graham Neubig

Advances in natural language processing (NLP) technology now make it possible to perform many tasks through natural language or over natural language data -- automatic systems can answer questions, perform web search, or command our computers to perform specific tasks. However, "language'' is not monolithic; people vary in the language they speak, the dialect they use, the relative ease with which they produce language, or the words they choose with which to express themselves. In benchmarking of NLP systems however, this linguistic variety is generally unattested. Most commonly tasks are formulated using canonical American English, designed with little regard for whether systems will work on language of any other variety. In this work we ask a simple question: can we measure the extent to which the diversity of language that we use affects the quality of results that we can expect from language technology systems? This will allow for the development and deployment of fair accuracy measures for a variety of tasks regarding language technology, encouraging advances in the state of the art in these technologies to focus on all, not just a select few.

Specifically, this work focuses on four aspects of this overall research question. First, we will develop a general-purpose methodology for quantifying how well particular language technologies work across many varieties of language. Measures over multiple speakers or demographics are combined to benchmarks that can drive progress in development of fair metrics for language systems, tailored to the specific needs of design teams. Second, we will move beyond simple accuracy measures, and directly quantify the effect that the accuracy of systems has on users in terms of relative utility derived from using the system. These measures of utility will be incorporated in our metrics for system success. Third, we focus on the language produced by people from varying demographic groups, predicting system accuracies from demographics. Finally, we will examine novel methods for robust learning of NLP systems across language or dialectal boundaries, and examine the effect that these methods have on increasing accuracy for all users.