SOTAVerified

Data Collection vs. Knowledge Graph Completion: What is Needed to Improve Coverage?

2021-11-01EMNLP 2021Unverified0· sign in to hype

Kenneth Church, Yuchen Bian

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

This survey/position paper discusses ways to improve coverage of resources such as WordNet. Rapp estimated correlations, rho, between corpus statistics and pyscholinguistic norms. rho improves with quantity (corpus size) and quality (balance). 1M words is enough for simple estimates (unigram frequencies), but at least 100x more is required for good estimates of word associations and embeddings. Given such estimates, WordNet’s coverage is remarkable. WordNet was developed on SemCor, a small sample (200k words) from the Brown Corpus. Knowledge Graph Completion (KGC) attempts to learn missing links from subsets. But Rapp’s estimates of sizes suggest it would be more profitable to collect more data than to infer missing information that is not there.

Tasks

Reproductions