SOTAVerified

Value Alignment Verification

2020-10-16NeurIPS Workshop HAMLETS 2020Unverified0· sign in to hype

Anonymous

Unverified — Be the first to reproduce this paper.

Reproduce

Abstract

As humans interact with autonomous agents to perform increasingly complicated, potentially risky tasks, it is important that humans can verify these agents' trustworthiness and efficiently evaluate their performance and correctness. In this paper we formalize the problem of value alignment verification: how can a human efficiently test whether the goals and behavior of another agent are aligned with the human's values? We explore several different value alignment verification settings and provide foundational theory regarding value alignment verification. We study alignment verification problems with idealized human testers that know their own reward function as well as value alignment verification problems where the human tester has implicit values. Our theoretical results and our empirical results in both a discrete grid navigation domain and a continuous autonomous driving domain demonstrate that it is possible to synthesize highly efficient and accurate value alignment verification tests for certifying the alignment of autonomous agents.

Tasks

Reproductions