Your data is our data or whether data can be protected from scraping by artificial intelligence
- What kind of data is used for AI training?
- Copyrighted Objects
- Non-Personal Data
- What can rights holders do to protect their data from being used by artificial intelligence?
The current surge of interest in artificial intelligence (AI), aside from hopes of optimizing and simplifying routine processes and the general panic over job loss risks, has also brought to the forefront the question of using data for AI training without the consent of the data owner and the ability to protect such data by the owner. Below, our lawyers Kamal Tserakhau and Ekaterina Erohovec will provide more detailed insights into this matter.
It's no secret that AI training involves the use of publicly available data, including texts, images, videos, and other content. The most well-known text-based AIs, such as ChatGPT, Google Bard, and Claude 2, openly state that they use both open data and licensed datasets for their training.
What kind of data is used for AI training?
Broadly, they can be divided into 3 groups:
- Information provided by the users themselves and for which they are the rights holders (prompt content, attached files to the prompt);
- Licensed datasets from third parties (e.g., a set of stock images or an archive of newspaper articles);
- Publicly available information, typically obtained through web scraping, which AI also discovers on its own.
Regarding the first group of data, the primary concern lies in the collection and processing of personal data, as well as its transfer, especially when dealing with products that use third-party AI via APIs. An illustrative example is image processing applications where users upload their photos, and the application processes them using connected AI models like Dall-e or Stable Diffusion.
The use of the second group of data pertains to the processing of intellectual property objects, typically copyrighted materials such as texts, images, and music. The legality of training AI on such data is currently a subject of legal debate, with initial results emerging, but no definitive regulation in place yet.
As for issues related to the processing of personal data and the use of copyrighted objects for AI training, we will provide more detailed information in our upcoming materials.
The most complex question is related to the third group of data, which encompasses a wide range of data categories. This group can include both intellectual property objects and personal data, as well as non-personal data. The latter category covers information that is neither personal data nor intellectual property objects. For instance, a cooking recipe website, technical information, public domain objects, or historical facts. In contrast to the first two categories of data, which have applicable legal frameworks, non-personal data is currently regulated in a fragmented manner, although attempts are being made in the EU to establish a common regulation.
How legal is it to use someone else's data for AI training without the consent of the rights holder?
Copyrighted Objects
The legality of this remains an open question. Here are the key arguments that currently support the legality of training AI based on others' data:
- AI training falls under the text and data mining exception. The essence of this exception is that it permits the use of copyrighted works "for the automated analysis of text and data in digital form to extract information, including but not limited to patterns, trends, and correlations." This rule is already in effect in the EU (Art. 2(2), Art. 3, Art. 4 DSM) and allows the use of open data without the rights holder's permission. This text and data mining exception applies to both private companies and research organizations, as well as to cultural heritage institutions.
- The use of copyrighted works in the AI training process is considered part of the technological process, which is an exception (Art. 5.1 Infosoc) and allows data usage without the rights holder's consent. Since copyrighted works are not copied for AI training but are used briefly for the purpose of learning patterns and regularities, it is believed that this exception also applies.
- AI training is conducted for scientific and research purposes, which is covered by the relevant exception (Art. 5.3 Infosoc).
It's worth noting that the question of the applicability of the above-mentioned exceptions to AI remains open, and there are opposing opinions.
Non-Personal Data
Regarding non-personal data, the situation is somewhat more complex due to the lack of direct regulation. In certain cases, a legal regime may be applied to such data, such as the rights of databases or datasets, meaning that individual data items are not protected separately but may be protected as a whole. As a database, protection applies when the creation of the database demonstrates the original creative choices of the author, making it a free and creative selection (Art. 3.1 Directive 96/9/EC, Case Football Dataco C‑403/08 i C‑429/08). For example, a database containing annotations, reviews, information about actors, and so on related to films may be protected by copyright.
However, for the average IT company, protection as a dataset, which means that significant investments were made in the collection, verification, and presentation of the dataset, is likely to be more applicable. In such cases, the creator has the right to prohibit unauthorized use of their dataset (Art. 7.1 Directive 96/9/EC).
What can rights holders do to protect their data from being used by artificial intelligence?
Website owners have the option to regulate the use of their data or the data of their users on the platform through the terms of service or user agreement. For example, Reddit has taken this approach with the following provision:
"Except as expressly permitted by this section, no other rights or licenses are granted or implied, including any right to use User Content for other purposes, such as for training a machine learning or AI model, without the express permission of rightsholders in the applicable User Content." |
Similarly, The New York Times has made it clear that their data cannot be used for AI training:
"Non-commercial use does not include the use of Content without prior written consent from The New York Times Company in connection with: (1) the development of any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system; or (2) providing archived or cached data sets containing Content to another person or entity." |
Establishing such limitations is permissible (Case Ryanair C‑30/14), although not always possible. For instance, the application of the aforementioned text and data mining exception can only be restricted by the website owner with regard to private companies, as research organizations and cultural heritage institutions still retain mining rights (Art. 7 DSM).
The most critical question is a technical one. Suppose the terms of service include a clause prohibiting the use of data for AI training. In that case, it remains uncertain whether a robot that copies this information will understand that data from this particular website cannot be used. There is no definitive answer to this question. For example, Google recommends including instructions in the robots.txt file, but whether such instructions will work with other AI systems is unclear.
Therefore, the question of the permissibility of collecting and processing publicly available data from a website remains open. Nevertheless, website owners can limit the use of their data by AI with certain restrictions.
Dear journalists, the use of materials from REVERA website in publications is possible only after our written permission.
For approval of materials please contact e-mail: i.antonova@revera.legal or Telegram: https://t.me/PR_revera