NAIL: A Challenging Benchmark for Na\"ive Logical Reasoning
Xinbo Zhang, Changzhi Sun, Yue Zhang, Lei LI, Hao Zhou
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
Logical reasoning over natural text is an important capability towards human level intelligence. Existing datasets are either limited and inadequate to train and evaluate logical reasoning capability (e.g., LogiQA and ReClor), or not oriented for logical reasoning (e.g., SQuAD and HotpotQA). In this paper, we focus on a specific category of logical reasoning, named , and propose a new large scale benchmark, named , targeted for learning and evaluating models' capabilities towards . is source from standardized exams such as Chinese National Civil Servants Examination and Law School Admission Test. Furthermore, to collect more data, we propose to imitate the example of standardized exams rather than designing them from scratch. is available in both Chinese and English containing a total of 10,296 * 2 instances. Empirical results show that current state-of-the-art neural models struggle on with very poor accuracy (the best result is 30.10\% for and 36.15\% for Chinese ), while human experts can perform nearly 100\% accuracy. Further results indicate that human imitations can significantly help models learn logic from natural text.