Towards Machine Ethics with Language Models
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, Jacob Steinhardt
Unverified — Be the first to reproduce this paper.
ReproduceAbstract
We show how to assess a language model’s knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense moral judgments. Models predict widespread moral judgments about diverse written scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may later serve as a general regularizer of behavior in open-ended settings. We find that language models have low but nontrivial performance. With the ETHICS dataset, we enable meaningful progress on value learning to be made today, providing a steppingstone toward AI that is aligned with human values.