Pangea Tool Expands LLMs' Global Reach
New tool developed by LTI researchers expands access to AI and LLM systems to new languages
Artificial intelligence and large language models (LLM) increasingly complete or supplement everyday tasks, from using search engines to creating art. However, these tools and their datasets rely mostly on English and Western-centric languages, limiting access for people who speak any of the thousands of other languages used worldwide.
To address this gap, a team of researchers at Carnegie Mellon University's School of Computer Science developed Pangea. The open-source, multilingual multimodal large language model (MLLM) recognizes 39 languages and was trained on six million data samples to create a culturally inclusive model.
"In the long run, AI should not be just a tool, it should make our society better," said Xiang Yue, a postdoctoral research associate in the Language Technologies Institute (LTI). "Most importantly, these kinds of tools should be accessible to everyone no matter where they are and what language they speak."
Pangea supports languages such as Urdu, Polish, Czech and more. Users can chat with Pangea and receive text-based answers. But because it's an MLLM, Pangea was also trained on non-text data. For example, a user can enter a picture of a park in the chat and, in their native language, ask Pangea to describe the picture. Pangea will describe the image in that same language.
The training dataset, PangeaIns, focused on linguistic and cultural diversity, but researchers curating the dataset also had to address challenges such as data scarcity and cultural nuance. To do so, they combined existing open-source resources with newly created instructions that explained how to complete visual reasoning tasks and focused on multicultural understanding. These instructions included things like caption and visual reasoning. Researchers curated these instructions in English and then translated and adapted them for multilingual contexts using machine translation to make the process more efficient.
Researchers had to source these high-quality datasets and then iterate. Graham Neubig, an associate professor in the LTI, said the work took time and required solving unique challenges.
"One issue we encountered was with this visual question-answering dataset in Japanese, the answers were all very short," Neubig said. "In an earlier iteration of our model, we would ask, 'What is this picture about?' And in English it would say, 'Oh, this is a beautiful picture of a dock that's in the middle of a lake with lots of snowy mountains around it, demonstrating a peaceful scene that many people would be happy to be in.' And then the Japanese model would respond, 'dock.' And the dataset that we had for Japanese was good. The answers were high quality, but they were not in the form we expected. So we had to do a lot of work around this, and it will become even harder as we expand to more languages."
Cultural nuances go beyond the languages used as inputs and outputs. Researchers also needed to consider how certain images could be context-dependent. Yueqi Song, a student in the Computer Science Department, said Pangea can provide more culturally relevant responses and understand more culturally relevant inputs.
"It's so important that our model is open-source, and that we support 39 languages," Song said. "We aim to cover this much of the population to promote inclusivity. That way, people from around the globe can input images and text and get responses from our models that are relevant to their culture, which could help people learn more about how these models interact in the multimodal domain."
The final piece of the tool is PangeaBench, which assesses Pangea's performance compared to other LLMs and MLLMs. The researchers evaluated the tool's capabilities such as captioning, cultural understanding and multisubject reasoning.
Neubig says he plans to expand Pangea, adding more languages and building upon this base.
"It's a good template, and the work must be done carefully," Neubig said. "It requires you to be a speaker of the language, which is why we could do Japanese because a team member and I both speak Japanese. We could do Hindi because a team member speaks Hindi. We have team members who speak Chinese. But hopefully, we'll be able to ship this out to other people and, because it's open source, they can build on it."
Learn more about Pangea and try a demo on the team's Github page.