AI code generators steal open source code

"Wide angle photo of a cat zombie, walking dead style, digital art"

Bing Image Creator

Open Source is dead.

Long live Open Source Software.

"Wide angle photo of A cat wearing a king's crown and a red cape, game of thrones style, iron throne, 3d digital art"

Bing Image Creator

Introduction

The Airius Risk Maturity Knowledgebase is intended to give you a snapshot of those things in the world affecting information risk for June, 2023.

The advent of artificial intelligence, and more specifically, Large Language Model (LLM) has changed how software is developed. These LLMs are as capable as the material that they are trained upon. As a result, LLMs have started to specialize, focusing on research, natural language, conversation, contracts and for this discussion, software development.

These models have used readily available data on the internet. They have also used structured datasources to aid in the learning and the indexing of data. As a result, mankind has access to the knowledge of the machines since they have become sentient using whatever can be found on the internet.

The problem lies with the use of everything accessible on the internet and whether training an LLM for private and commercial purposes constitutes "fair use". We will discuss this in detail.

By Columbia Copyright Office - Obtained from the Library of Congress https://www.loc.gov/exhibits/bobhope/vaude.html

Transferred from en.wikipedia; transferred to Commons by User:Dichter using CommonsHelper., Public Domain, https://commons.wikimedia.org/w/index.php?curid=10858426

Copyright infringement - Wikipedia

Note: 100% of the research of this project was done with the aid of Bing-GPT. Most of the images were generated with Bing's version of Dall-E. All sources for research are cited in the references below.

Using technology to copy protected content (copyright or copyleft) and then use that inventory to allow customers to bypass existing license restrictions and earn money undermines the fair use argument. Using AI to bypass restrictive open source licenses is theft.

What is a Large Language Model (LLM)?

A large language model (LLM) is a type of artificial intelligence (AI) algorithm that uses deep learning techniques and massively large data sets to understand, summarize, generate and predict new content. It consists of a neural network with many parameters (typically billions of weights or more), trained on large quantities of unlabeled text using self-supervised learning or semi-supervised learning. LLMs emerged around 2018 and perform well at a wide variety of tasks.

Incomplete list of current LLM projects (there are easily hundreds of well developed projects)

OpenAI and Microsoft
- GPT (GPT-3 - Wikipedia)
- Codex (OpenAI Codex - Wikipedia)
Microsoft
- Turing NLG (Microsoft Project Turing | Home Page)
DeepMind
- Gopher
- Chinchilla
Meta
- OPT
- Fairseq Dense
Google
- Switch Transformer ([2101.03961] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (arxiv.org))
- GLAM ([2112.06905] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts (arxiv.org))
- PALM
- Lamba
- T5
- MT5
Adobe
- Sensei
- Firefly
Github
- Copilot
GPT-3 (Open Source)
- GPT-Neo
- GPT-J
- GPT-NeoX

What is Generative AI?

Generative AI is a type of artificial intelligence (AI) system capable of generating text, images, or other media in response to prompts. Generative AI models learn the patterns and structure of their input training data, and then generate new data that has similar characteristics.

Generative AI builds on existing technologies, like large language models (LLMs) which are trained on large amounts of text and learn to predict the next word in a sentence. For example, “peanut butter and _” is more likely to be followed by “jelly” than “shoelace”. Generative AI can not only create new text but also images, videos, or audio.

Generative AI has potential applications across a wide range of industries, including art, writing, software development, healthcare, finance, gaming, marketing, and fashion. However, there are also concerns about the potential misuse of generative AI, such as in creating fake news or deepfakes which can be used to deceive or manipulate people.

Generative AI LLMs specifically designed to generate code

OpenAI and Microsoft
- Codex (OpenAI Codex - Wikipedia)
Github (with technology from OpenAI and Microsoft)
- Copilot (Codex)

is it possible to copyright code created by copilot

Kate Downing explained in her blog:

“The crux of the USCO’s refusal to recognize any copyright interest in the images rests on the idea that Midjourney’s output is unpredictable and that the prompts users provide to it are mere suggestions, with too much “distance between what a user may direct Midjourney to create and the visual material Midjourney actually produces” such that “users lack sufficient control over generated images to be treated as the “mastermind” behind them.” Repeatedly, the USCO seems to argue that the final result has to reflect the artist’s “own original conception,” even going so far as to argue that the “process is not controlled by the user because it is not possible to predict what Midjourney will create ahead of time.”

The ownership of code generated by AI tools like GitHub Copilot is a topic of active debate and legal dispute. There have been lawsuits filed against Microsoft, GitHub and OpenAI alleging that the creation of AI-powered coding assistant GitHub Copilot relies on “software piracy on an unprecedented scale”. The key question in the lawsuit is whether open-source code can be reproduced by AI without attached licenses.

According to GitHub, the suggestions generated by Copilot and the code you write with its help belong to you and you are responsible for it. However, there have been instances where Copilot has been found to regurgitate long sections of licensed code without providing credit.

It's a complex issue and the legal landscape is still evolving. I would recommend consulting with a lawyer for more specific information on this topic.

There have been concerns raised about whether code generated by AI tools like Microsoft's Codex and GitHub Copilot bypasses copyleft licensing terms of use. Copilot has been found to regurgitate long sections of licensed code without providing credit. There have been instances where Copilot has generated a substantial amount of unique, GPL’d code and then suggested a license that is non-copyleft.

The question of whether works created by generative AI can be copyrighted is a complex one and the legal landscape around this issue is still evolving. According to the U.S. Copyright Office, there is no copyright protection for works created by non-humans, including machines⁴. However, some argue that AI-generated works should be eligible for copyright protection because they are the product of complex algorithms and programming.

Conclusion: What is "Fair Use"?

Fair use is a legal doctrine that allows for the use of copyrighted material without permission under certain circumstances. It permits a party to use a copyrighted work without the copyright owner’s permission for purposes such as criticism, comment, news reporting, teaching, scholarship, or research.

There are four factors that must be considered in deciding whether a use constitutes fair use: the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used in relation to the copyrighted work as a whole, and the effect of the use upon the potential market for or value of the copyrighted work.

The four factors that must be considered in deciding whether a use constitutes fair use are:

The purpose and character of the use: This factor considers whether the use is commercial or non-commercial and whether the use is transformative. If a use is commercial it is less likely to be fair use and if it is non-commercial it is more likely to be fair use. Transformative uses are those that add something new, with a further purpose or different character, and do not substitute for the original use of the work.
The nature of the copyrighted work: This factor considers the nature of the underlying work, specifically whether it is more creative or more factual. Use of a more creative or imaginative underlying work is less likely to support a claim of fair use, while use of a factual work would be more likely to support a fair use claim.
The amount and substantiality of the portion used in relation to the copyrighted work as a whole: This factor considers the amount of the copyrighted work that was used compared to the copyrighted work as a whole. Where the amount used is very small in relation to the copyrighted work, this factor will favor a finding of fair use, but where the amount used is not insignificant, this factor will favor the copyright owner.
The effect of the use upon the potential market for or value of the copyrighted work: This factor considers whether the use would harm the potential market for or value of the copyrighted work.

Comments

Training an AI LLM does not add new capability. Rather, it uses existing knowledge in a faster and more effortless way.
Code is factual, less artistic and creative. It is limited by the capabilities of languages, APIs and interfaces. Regardless, new code always finds better, faster, more efficient ways to do things. In coding, the art is in the details, and modern interfaces and languages are chosen for their implementations, their creative approaches to solve technical challenges.
AI libraries train on billions of lines of code, digesting entire language libraries and all projects within those libraries. The training is indiscriminate.
The AI interface to coding would be a highly efficient search interface to find the perfect existing libraries with which to address a coding challenge. Instead, the AI version effectively used the entire open source inventory publicly available to replace that open source with a more readily available alternative. AI coding engines commercially are replacing open source licensed code for a fee.

For the reasons quickly outlined above, AI engines are not using research samplings of code in order to learn how code works. They grabbed ALL code, and offer a convenient interface to that code. They offer a way for users to mistakenly bypass license obligations and solve code challenges. For a fee, customers get access to a stolen inventory of code offered by Github and Microsoft.

In The News

Notable Mentions

We are amazed at the number of submissions we have gotten to date, but even more so, we are incredibly grateful to over 150 core contributors who have devoted their time and resources to helping us provide up-to-date information. Send your stories and announcements to knowledgebase@airius.com.

The Risk Maturity Knowledgebase restarts an effort that we began in 2007. With hundreds of volunteers, interns and staff members at the time, along with over 60 weekly translations, our predecessor became the standard for GPL and open source security information.

The Risk Maturity Knowledgebase restarts an effort that we began in 2007. With hundreds of volunteers, interns and staff members at the time, along with over 60 weekly translations, our predecessor became the standard for GPL and open source security information. Can you translate the blog? Please reach out.

Ready to Help!

If we can help you with risk management, SOC reporting, an emergency or you just need guidance with INFOSEC or IP issues, please reach out to us.

At Airius, we depend on our friends at A-Lign to provide auditors and experience with the SOC reporting and auditing process. We work closely with companies to get them through it.