Carnegie Mellon University

Jamie  Callan

Jamie Callan

Professor, Language Technologies Institute

  • 5419 —Gates & Hillman Centers
  • 412-268-4525

Research Area

Information Retrieval, Text Mining and Analytics

Research

My research and teaching focus on information retrieval and analysis. I have worked on a wide range of topics over the years, but am particularly interested in search engine architectures, information filtering and text mining. A sample of current projects is shown below. See my personal webpage for more information.

Projects

Lemur: The Lemur Project develops open-source search engines, toolbars, text analysis tools, search services and datasets that support international research and development. The project is best known for its Indri and Galago search engines, and large-scale ClueWeb datasets. Our software and datasets are widely used in scientific and research applications, and some commercial applications. Lemur's software development philosophy emphasizes state-of-the-art accuracy, flexibility and efficiency.

Search Engines With Knowledge Resources: This project develops new methods for using knowledge graphs and ontologies to improve search engine accuracy, especially for vague, ambiguous or poorly specified queries. Knowledge graphs and ontologies are less structured than typical relational databases and semantic web resources, but more structured than text stored in full-text search engines. The weak semantics in these semi-structured information resources can support interesting applications, but can also accommodate contradictions, inconsistencies and mistakes — making them easier to scale for large amounts of information. A search engine can use these resources to identify the probable meanings of query terms, and use this knowledge to identify documents that match those meanings.

Retrieval of Scientific Data: Numerical data continues to expand as the results of scholarly research in data-rich sciences (e.g., non-textual data) continue to grow. This project extends search engine architectures to support large, centralized, universal repositories of affordable and easily used scientific data. Our goal is to access tabular, numeric and other non-textual information as easily and readily as documents without laborious additional work.

Selective and Federated Search: I have a long-term interest in environments that contain numerous search engines. Much of my prior research focused on integrating many independent search engines — perhaps operated by different organizations with different interests— into a single integrated federated search system. My recent work investigates a related problem: decomposing a massive text collection into hundreds or thousands of small search engines designed to have skewed utility distributions that enable most index partitions to be ignored for most queries. This selective search architecture is as effective as conventional search engine architectures, but has far lower computational costs and reveals new challenges and opportunities in large-scale search. The decomposition process creates text collections, thus inviting research on the characteristics desired or to be avoided in a text collection to enable accurate search. We've developed new resource selection algorithms to address efficiency problems in existing algorithms and dynamically adjust search costs based on query difficulty. Our goal is an easily customizable and extensible off-the-shelf method that provides an order of magnitude reduction in search costs over the current state-of-the-art, especially on corpora of more than a billion documents.