Many large language models have been trained using freely available web content, whether that be from crawling the internet, or using more specialised sources of content such as databases of patents or scientific articles. Until recently it has been relatively ensured that most of this content is essentially human-generated. Such content has therefore served as a useful source of training data for generative AI models, with models such as ChatGPT capable of generating text that may not be immediately and easily distinguished from human-generated content. Similar generative AI models are increasingly used for producing image and audio content. Newer models may therefore start being trained on content that potentially includes a significant, and increasing, proportion of content that has been generated by another model.
What happens then when such model-generated content is used as training data for other models?
This problem has been investigated for large language models1 and for generative image models2, in all cases leading to a finding that using model-generated content in the training process may rapidly cause ‘irreversible defects’ in the resulting models, with the models effectively losing sight of low-probability events in the original underlying data (which effect has been termed “model collapse”). This model collapse can then lead to errors in the content that is generated by the model, and increased model bias, as the training process may serve to reinforce incorrect beliefs about reality.
The recent example of xAI’s Grok model outputting reference to OpenAI’s use case policy has been attributed to this issue, with Grok having been trained on web data in the post-ChatGPT era.
It may therefore become increasingly difficult to train new models by crawling existing publicly available sources of content where such sources may themselves become “polluted” with model-generated content. This issue could perhaps be mitigated by a commitment to sourcing ‘old’ content (e.g. from the Internet Archive), but with the risk that the model then becomes outdated and unable to capture real-world developments. Otherwise, there may be a continued reliance on essentially human-generated content at least for training purposes.
This points to advantages for companies that can address this issue and ensure a reliable stream of new content for training purposes. Techniques for detecting and removing AI-generated content, or otherwise modifying the training process to mitigate such potential model collapse, may therefore become increasingly important and this could be an area that is ripe for innovation. In this respect, note that the European Patent Office already recognises innovation in the area of training as potentially patentable intellectual property, for instance, setting out in the Guidelines for Examination:
Where a classification method serves a technical purpose, the steps of generating the training set and training the classifier may also contribute to the technical character of the invention if they support achieving that technical purpose.
Further, the very recent judgement from the High Court of England and Wales in Emotional Perception AI v Comptroller-General of Patents  EWHC 2948 (Ch)3 suggests that the UK Intellectual Property Office may start to look more favourably on AI inventions, (as also discussed by my colleague John Somerton, here) which should particularly encourage companies to consider filing patent applications in the UK for innovations in this technical field.
Therefore, whilst in many cases it may be considered desirable to try to keep specifics of the training process under the hood, with appropriate trade secret management potentially being employed, consideration should also be given to filing patent applications. For instance, the advantages of patent protection (including possible Patent Box (tax relief) benefits) may become increasingly attractive for companies that are working in this area. This may be especially so given the potential regulatory push towards increased AI transparency meaning that some level of disclosure may be required, at least for AI products targeting ‘high-risk’ applications (e.g. as defined in the EU AI Act).
For these reasons, it is recommended that any companies developing AI products speak to their IP attorneys to discuss possible strategies for protecting innovation in this area. In this respect, Dehns’ software team has significant experience with advising on appropriate strategies for protecting AI-related innovation.
1The Curse of Recursion: Training on Generated Data Makes Models Forget, Shumailov, I. et al, arXiv:2305.17493v2
2Self-Consuming Generative Models Go MAD, Alemohammad. S, et al, arXiv:2307.01850