The (ab)use of Open Source Code to Train Large Language Models

2023-02-27Code Available0· sign in to hype

Ali Al-Kaswan, Maliheh Izadi

Code Available — Be the first to reproduce this paper.

Code

github.com/aise-tudelft/nlbse23_reading_list
OfficialIn papernone★ 6
github.com/nlbse2024/code-comment-classification
pytorch★ 1

Abstract

In recent years, Large Language Models (LLMs) have gained significant popularity due to their ability to generate human-like text and their potential applications in various fields, such as Software Engineering. LLMs for Code are commonly trained on large unsanitized corpora of source code scraped from the Internet. The content of these datasets is memorized and emitted by the models, often in a verbatim manner. In this work, we will discuss the security, privacy, and licensing implications of memorization. We argue why the use of copyleft code to train LLMs is a legal and ethical dilemma. Finally, we provide four actionable recommendations to address this issue.

Tasks

Memorization

The (ab)use of Open Source Code to Train Large Language Models

Code

Abstract

Tasks

Reproductions