Towards Machine Ethics with Language Models

2021-01-01ICLR 2021Unverified0· sign in to hype

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt

Unverified — Be the first to reproduce this paper.

Abstract

We show how to assess a language model’s knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense moral judgments. Models predict widespread moral judgments about diverse written scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may later serve as a general regularizer of behavior in open-ended settings. We find that language models have low but nontrivial performance. With the ETHICS dataset, we enable meaningful progress on value learning to be made today, providing a steppingstone toward AI that is aligned with human values.

Tasks

Ethics World Knowledge

Towards Machine Ethics with Language Models

Abstract

Tasks

Reproductions