`BonTen' -- Corpus Concordance System for `NINJAL Web Japanese Corpus'

2016-12-01COLING 2016Unverified0· sign in to hype

Masayuki Asahara, Kazuya Kawahara, Yuya Takei, Hideto Masuoka, Yasuko Ohba, Yuki Torii, Toru Morii, Yuki Tanaka, Kikuo Maekawa, Sachi Kato, Hikari Konishi

arXiv PDF

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

The National Institute for Japanese Language and Linguistics, Japan (NINJAL) has undertaken a corpus compilation project to construct a web corpus for linguistic research comprising ten billion words. The project is divided into four parts: page collection, linguistic analysis, development of the corpus concordance system, and preservation. This article presents the corpus concordance system named `BonTen' which enables the ten-billion-scaled corpus to be queried by string, a sequence of morphological information or a subtree of the syntactic dependency structure.

Tasks

Morphological Analysis

`BonTen' -- Corpus Concordance System for `NINJAL Web Japanese Corpus'

Abstract

Tasks

Reproductions