OpenAI's AI Models Under Fire: Are They Using Copyrighted Data?
OpenAI is once again in the spotlight, but this time not for its innovations in AI. Instead, a new study from the AI Disclosures Project raises serious questions about the sources of data employed for training its large language models, particularly the GPT-4o model. The research findings suggest that OpenAI may have utilized copyrighted data from O’Reilly Media books without proper authorization, igniting concerns about copyright violations and ethical practices in AI development.
The AI Disclosures Project, driven by figures like technologist Tim O’Reilly and economist Ilan Strauss, is committed to tackling the potential societal harms stemming from AI commercialization. Their recent paper emphasizes the critical need for transparency within the AI sector—drawing parallels with financial disclosure standards that underpin healthily functioning markets.
Diving deeper into the methodology, the study analyzed a legally sourced set of 34 copyrighted books from O’Reilly Media to see whether the models had indeed been trained on these texts without consent. They applied a technique known as the “DE-COP membership inference attack,” which proved effective in assessing if GPT-4o could distinguish between human-written content and generated text.
What did the researchers find? Here are some eye-opening highlights:
- GPT-4o demonstrated a “strong recognition” of the paywalled content from O’Reilly’s catalog, scoring an AUROC of 82%. By contrast, its predecessor, GPT-3.5 Turbo, barely surpassed the 50% mark.
- When comparing non-public O'Reilly material, GPT-4o outperformed expectations, recognizing it better than public samples (82% versus 64% AUROC).
- Interestingly, GPT-3.5 Turbo showed a tendency to recognize more publicly accessible content (64%) than non-public works (54%).
- A smaller model, GPT-4o Mini, utterly failed to demonstrate any recognition of both public and non-public O’Reilly data, scoring around 50%.
It’s worth noting that the study highlighted potential access violations possibly linked to the LibGen database. All the book titles leveraged for the research were found within this repository. Moreover, the findings suggest that newer LLMs are getting better at differentiating human from machine-generated text—which the researchers argue doesn’t diminish the overall challenge of assessing data provenance.
One fascinating aspect mentioned is "temporal bias," referring to the evolution of language and its impact on results over time. To combat this, the researchers evaluated models (GPT-4o and GPT-4o Mini) that were trained on data dating from the same period.
The implications of these findings extend beyond just OpenAI and O’Reilly Media. They signal a broader concern about how copyrighted data is being used in AI training, suggesting that using such data without adequate compensation could jeopardize the quality and diversity of content available on the internet. Given that financially supporting professional content creators is critical, the project underscores the urgent need for accountability.
Enhancing corporate transparency around data usage is key, they argue, recommending that liability measures be adopted to encourage companies to disclose information regarding their training data sources. The EU AI Act includes disclosure requirements that, if enforced correctly, could establish a beneficial feedback loop for data disclosure standards.
Interestingly, while there’s strong evidence suggesting that AI firms may be tapping into unauthorized data for model training, a market is simultaneously emerging where developers are legally acquiring content via licensing deals. Companies like Defined.ai are paving the way by securing consent from data providers, stripping out identifiable information.
In conclusion, this study draws a serious conclusion regarding the training practices at OpenAI, signalling that a collaborative effort is needed to ensure all AI models are built on solid ethical foundations and respect copyright laws.
(Image by Sergei Tokmakov)