SOTAVerified

HC4: A New Suite of Test Collections for Ad Hoc CLIR

2022-01-24Code Available0· sign in to hype

Dawn Lawrie, James Mayfield, Douglas Oard, Eugene Yang

Code Available — Be the first to reproduce this paper.

Reproduce

Code

Abstract

HC4 is a new suite of test collections for ad hoc Cross-Language Information Retrieval (CLIR), with Common Crawl News documents in Chinese, Persian, and Russian, topics in English and in the document languages, and graded relevance judgments. New test collections are needed because existing CLIR test collections built using pooling of traditional CLIR runs have systematic gaps in their relevance judgments when used to evaluate neural CLIR methods. The HC4 collections contain 60 topics and about half a million documents for each of Chinese and Persian, and 54 topics and five million documents for Russian. Active learning was used to determine which documents to annotate after being seeded using interactive search and judgment. Documents were judged on a three-grade relevance scale. This paper describes the design and construction of the new test collections and provides baseline results for demonstrating their utility for evaluating systems.

Tasks

Reproductions