Generative AI – Will large language models (LLMs) eat themselves?

19 December 2023 Generative AI – Will large language models (LLMs) eat themselves?

Many large language models have been trained using freely available web content, whether that be from crawling the internet, or using more specialised sources of content such as databases of patents or scientific articles. Until recently it has been relatively ensured that most of this content is essentially human-generated. Such content has therefore served as a useful source of training data for generative AI models, with models such as ChatGPT capable of generating text that may not be immediately and easily distinguished from human-generated content. Similar generative AI models are increasingly used for producing image and audio content. Newer models may therefore start being trained on content that potentially includes a significant, and increasing, proportion of content that has been generated by another model.

What happens then when such model-generated content is used as training data for other models?

This problem has been investigated for large language models¹ and for generative image models², in all cases leading to a finding that using model-generated content in the training process may rapidly cause ‘irreversible defects’ in the resulting models, with the models effectively losing sight of low-probability events in the original underlying data (which effect has been termed “model collapse”). This model collapse can then lead to errors in the content that is generated by the model, and increased model bias, as the training process may serve to reinforce incorrect beliefs about reality.

The recent example of xAI’s Grok model outputting reference to OpenAI’s use case policy has been attributed to this issue, with Grok having been trained on web data in the post-ChatGPT era.

It may therefore become increasingly difficult to train new models by crawling existing publicly available sources of content where such sources may themselves become “polluted” with model-generated content. This issue could perhaps be mitigated by a commitment to sourcing ‘old’ content (e.g. from the Internet Archive), but with the risk that the model then becomes outdated and unable to capture real-world developments. Otherwise, there may be a continued reliance on essentially human-generated content at least for training purposes.

Dehns Insights

This points to advantages for companies that can address this issue and ensure a reliable stream of new content for training purposes. Techniques for detecting and removing AI-generated content, or otherwise modifying the training process to mitigate such potential model collapse, may therefore become increasingly important and this could be an area that is ripe for innovation. In this respect, note that the European Patent Office already recognises innovation in the area of training as potentially patentable intellectual property, for instance, setting out in the Guidelines for Examination:

Where a classification method serves a technical purpose, the steps of generating the training set and training the classifier may also contribute to the technical character of the invention if they support achieving that technical purpose.

Further, the very recent judgement from the High Court of England and Wales in Emotional Perception AI v Comptroller-General of Patents [2023] EWHC 2948 (Ch)³ suggests that the UK Intellectual Property Office may start to look more favourably on AI inventions, (as also discussed by my colleague John Somerton, here) which should particularly encourage companies to consider filing patent applications in the UK for innovations in this technical field.

Therefore, whilst in many cases it may be considered desirable to try to keep specifics of the training process under the hood, with appropriate trade secret management potentially being employed, consideration should also be given to filing patent applications. For instance, the advantages of patent protection (including possible Patent Box (tax relief) benefits) may become increasingly attractive for companies that are working in this area. This may be especially so given the potential regulatory push towards increased AI transparency meaning that some level of disclosure may be required, at least for AI products targeting ‘high-risk’ applications (e.g. as defined in the EU AI Act).

For these reasons, it is recommended that any companies developing AI products speak to their IP attorneys to discuss possible strategies for protecting innovation in this area. In this respect, Dehns’ software team has significant experience with advising on appropriate strategies for protecting AI-related innovation.

¹The Curse of Recursion: Training on Generated Data Makes Models Forget, Shumailov, I. et al, arXiv:2305.17493v2

²Self-Consuming Generative Models Go MAD, Alemohammad. S, et al, arXiv:2307.01850

³Emotional Perception AI Ltd v Comptroller-General of Patents, Designs and Trade Marks [2023] EWHC 2948 (Ch) (21 November 2023) (bailii.org)

Our expert:

Tom Parry

Partner View profile

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_23462243_1	1 minute	Set by Google to distinguish users.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
CONSENT	16 years 2 months 18 days 9 hours	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.

Cookie	Duration	Description
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	15 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.

blog

What happens then when such model-generated content is used as training data for other models?

Dehns Insights

Our expert:

Tom Parry