When machines pollute knowledge: legal implications of AI data contamination in the indian context

  1. Home
  2. /
  3. Publications
  4. /
  5. Articles
  6. /
  7. When machines pollute...

– Anup Koushik Karavadi

Introduction

The exponential rise of generative artificial intelligence has redefined how digital information is created and consumed. However, this rapid advancement has also contaminated the global data environment. Large language models and other generative systems increasingly train on data drawn from the internet, much of which already contains synthetic AI outputs. This recursive process produces what researchers term “model collapse,” where models progressively lose coherence, creativity, and factual integrity over successive training cycles1 The result is an erosion of trust in digital content and a deepening crisis of authenticity within online ecosystems.

Uncontaminated human-generated data, particularly data collected before 2022, is now treated as a scarce resource. Those who already possess large repositories of such data enjoy a structural advantage, as it becomes nearly impossible for new developers to replicate or purchase comparable datasets.2 This imbalance strengthens the position of a few dominant technology firms and simultaneously constrains innovation and competition in the artificial intelligence sector.3

In India, this issue raises two intertwined legal concerns. The first relates to copyright, where the unconsented use of human-authored works for training artificial intelligence challenges existing interpretations of authorship and originality under the Copyright Act, 1957.4 The second arises under competition law, where the control of uncontaminated data by a small group of entities may constitute an abuse of dominance prohibited by section 4 of the Competition Act, 2002.5

The problem of data contamination is thus not merely technological. It strikes at the foundations of informational reliability, creative ownership, and market fairness, demanding urgent legal and regulatory attention. The article touches upon these contemporary issues and seeks to deliberate on possible policy measures that can be undertaken to minimize negative implications.

Data Contamination and Model Collapse

Artificial intelligence systems rely on the assumption that the data used to train them reflects authentic human expression. When this assumption fails, the entire architecture of generative learning begins to deteriorate. Data contamination occurs when synthetic or machine-generated content is mixed with genuine human-created material within training datasets.6 The problem intensifies as generative systems continuously scrape data from the internet, absorbing their own prior outputs without differentiation. This self-referential cycle gradually distorts the representational quality of models, leading to “model collapse,” a condition where the algorithm loses linguistic diversity, factual grounding, and creative variability.7

Empirical research shows that even minimal inclusion of synthetic data can have significant cumulative effects on model integrity.8 Each iteration of training amplifies uniformity, filtering out the rare or atypical patterns that previously contributed to the richness of human language. Over time, such feedback loops diminish the model’s cognitive elasticity and its ability to approximate nuanced human reasoning. This degeneration is not merely a technical malfunction but a manifestation of a larger epistemic risk, as the digital sphere becomes flooded with machine-recycled content that lacks originality or accountability.

The consequences extend beyond computational science. The contamination of informational ecosystems undermines social trust and the authenticity of knowledge itself. As synthetic content becomes indistinguishable from human expression, it becomes increasingly difficult to verify the credibility of sources.9 This shift transforms information from a public good into a private asset controlled by a few entities that possess access to verified, uncontaminated datasets. Such concentration of informational capital creates a dual economy of knowledge: one based on authentic, high-quality data available only to dominant corporations, and another built on contaminated material accessible to smaller actors and the general public10

This bifurcation has clear implications for innovation and equity. The inability of emerging developers to access pure datasets restricts their capacity to produce reliable models, thus reinforcing existing market hierarchies.11 Over time, the informational advantages of incumbents harden into structural monopolies that are self-perpetuating. The companies that first accumulated large human-generated datasets before 2022 now hold a form of digital capital that cannot be recreated. Such asymmetry not only restricts market competition but also shapes the direction of technological development itself, as innovation becomes conditioned by the datasets owned by a few.12

In response, researchers have proposed technical mitigation mechanisms, such as watermarking of AI-generated content and the use of classifiers to identify synthetic data.13 However, these solutions remain partial and increasingly ineffective as generative models grow more sophisticated. The long-term sustainability of AI therefore depends not merely on better engineering but on legal and ethical governance of data provenance and accessibility. Ensuring the availability of human-generated, uncontaminated data must become a regulatory priority before informational degradation becomes irreversible.14

Copyright Implications in India

The intersection of artificial intelligence and copyright law has become one of the most contested domains in contemporary legal discourse. In India, the Copyright Act, 1957, built upon the foundations of human creativity and originality, is facing structural challenges due to the proliferation of AI-generated content and the contamination of human-authored data. The legal framework was conceived in an era where authorship presupposed human agency. Artificial intelligence, however, subverts this premise by creating outputs derived from massive datasets that blend human creativity with algorithmic reproduction.15

The core difficulty arises from the nature of training data. When an AI system learns from human-authored works without authorization, the process may constitute an act of reproduction within the meaning of section 14(a) of the Copyright Act.16 This raises questions about whether such unconsented use can be justified under the fair dealing exceptions enumerated in section 52. While Indian jurisprudence on fair use has been traditionally liberal, courts have consistently required that the use be transformative and non-substitutive.17 AI training, by contrast, does not engage with the expressive purpose of the original work but absorbs it statistically, converting human creativity into algorithmic probability. Such use, lacking direct transformative intent, arguably exceeds the scope of fair dealing.18

A further complication concerns the originality of AI-generated content. Indian courts, particularly in Eastern Book Co. v. D.B. Modak, adopted the “modicum of creativity” standard, aligning with the U.S. precedent in Feist Publications v. Rural Telephone Service Co.19 Under this test, originality requires the exercise of human skill and judgment beyond mere mechanical labor. AI-generated works, being products of automated computation, cannot meet this threshold unless human authorship is demonstrably involved in the creative process.20 Thus, AI outputs lack independent copyright protection, and their use may, paradoxically, infringe the very rights from which they were derived.

Data contamination deepens these ambiguities. When synthetic outputs mix with human-authored material, it becomes nearly impossible to trace the provenance of creative input.21 This not only frustrates enforcement but also undermines the moral rights of authors under section 57 of the Act, which preserve the right of attribution and integrity. If AI-generated works mimic or alter existing human works without disclosure, they risk violating these rights by distorting original expression. Moreover, when contaminated data re-enters the creative ecosystem through derivative AI training, it perpetuates a cycle of untraceable infringements, eroding the accountability mechanisms that copyright law depends upon.22

From a policy perspective, the Indian copyright regime must evolve toward greater transparency in AI data use. One possible reform is mandating disclosure of training datasets for models deployed commercially, akin to provenance requirements under the Digital Personal Data Protection Act, 2023.23 Another is the recognition of a sui generis data right that distinguishes between human-generated and machine-generated data, ensuring compensation to human creators whose works contribute to model training. Such a framework would align copyright principles with the realities of the data economy, safeguarding both innovation and authorship integrity.

Competition Law Angle

The scarcity of uncontaminated, human-generated data has transformed information itself into an essential economic input. In the artificial intelligence market, access to clean datasets determines the capacity to innovate and compete. This reality gives rise to significant antitrust implications under the Competition Act, 2002, particularly with respect to abuse of dominance and the essential facilities doctrine.24

Under section 4 of the Act, a dominant enterprise is prohibited from engaging in conduct that restricts market access or exploits its position to the detriment of competition.25 When a small number of entities control indispensable data resources, they effectively regulate the entry of new developers into the AI market. The ability to access or withhold uncontaminated datasets thus functions as a form of economic gatekeeping.26 This resembles the essential facilities principle developed in antitrust jurisprudence, where denial of access to a critical input by a dominant firm can amount to abuse.

Although Indian competition law does not explicitly codify the essential facilities doctrine, its elements have been acknowledged by the Competition Commission of India in decisions such as Arshiya Rail Infrastructure Ltd. v. Ministry of Railways, where denial of access to necessary terminal infrastructure was found potentially abusive.27 The doctrine was first articulated in United States v. Terminal Railroad Association, where control over railway facilities by a few companies was deemed anti-competitive.28 Analogously, control over verified pre-2022 datasets by major AI firms such as OpenAI or Alphabet can create similar structural dependencies in the digital economy.

A comparable precedent can be drawn from Telefonica SA v. Commission, where the European Court of Justice affirmed that restricting access to indispensable network facilities amounted to an abuse of dominance.29 Translating this principle to data governance, uncontaminated human-generated data qualifies as a facility essential for market participation. Without it, competitors cannot develop reliable AI systems capable of matching incumbents’ quality or accuracy. The competitive advantage derived from exclusive control of such data is not based on efficiency but on historical possession, which is contrary to the objectives of competition law that favor innovation and consumer welfare.30

The issue becomes more acute when viewed through the lens of the Indian digital market. Dominant platforms already integrate vertically across several technological layers, from cloud infrastructure to generative model deployment.31 Their ability to combine computing power, proprietary data, and user feedback consolidates multi-market dominance that new entrants cannot challenge. In Fx Enterprise Solutions v. Hyundai Motor India Ltd., the Supreme Court underscored that abuse of dominance extends beyond direct pricing or exclusionary tactics to conduct that forecloses market access.32 By analogy, refusal to share or license clean datasets, or imposing discriminatory conditions for access, can amount to a similar foreclosure in the AI sector.

To mitigate such risks, the CCI may consider framing a data-access remedy modelled on section 20(1)(a) of the German Competition Act, which allows intervention even without proof of dominance when an enterprise’s dependence on another’s data prevents it from operating competitively.33 A similar approach in India could empower the regulator to compel fair and non-discriminatory data sharing where informational asymmetries threaten innovation. The introduction of such a mechanism would harmonize competition enforcement with India’s broader digital governance strategy under the Digital India framework.

However, caution is necessary. Mandating access to proprietary datasets must balance the right to competition with the protection of trade secrets and data privacy under the Digital Personal Data Protection Act, 2023.34 The solution may lie in federated learning frameworks, where model developers can train algorithms on another’s data without directly accessing it. This method preserves privacy while reducing entry barriers. In the long term, transparent data-sharing obligations combined with technical safeguards could ensure that uncontaminated human-generated data remains a shared infrastructural resource rather than a monopolized asset.

Conclusion

The contamination of digital data by artificial intelligence represents a turning point in the governance of technology. What began as a question of machine accuracy has evolved into a broader legal and policy dilemma involving intellectual property, competition, and the preservation of informational integrity. The erosion of uncontaminated human-generated data threatens not only the technical functionality of AI systems but also the legal structures that depend upon human creativity and fair market access.35

The challenge is dual in nature. On one hand, copyright law struggles to reconcile human authorship with algorithmic derivation, exposing the limits of traditional concepts such as originality and moral rights.36 On the other, competition law confronts a structural imbalance where a handful of corporations, by virtue of possessing pre-2022 human-generated datasets, hold a permanent advantage over potential entrants.37 Together, these dynamics risk entrenching a form of informational monopoly that stifles innovation and weakens the democratic ideal of open knowledge.

Regulatory frameworks must therefore move toward recognizing the right to uncontaminated data as an essential component of digital governance. This right would not confer ownership over data in the proprietary sense but would safeguard the accessibility and integrity of human-generated content as a shared societal resource.38 Achieving this requires a combination of approaches: transparent provenance requirements for AI training datasets, federated learning mechanisms that preserve privacy, and fair data-sharing obligations that prevent monopolistic control.39

In the Indian context, both the Copyright Act, 1957 and the Competition Act, 2002 must evolve in coordination. The former should mandate disclosure and accountability in AI training processes, while the latter should develop tools to prevent data concentration and enable equitable access. The goal is not to halt technological progress but to ensure that innovation does not devour its own foundation. A sustainable AI future depends on preserving the authenticity of human creativity and the openness of digital markets before the erosion of uncontaminated knowledge becomes irreversible.40

References

  1. Shumailov et al., AI Models Collapse When Trained on Recursively Generated Data, 631 Nature 755 (2024).
  2. John Burden et al., Legal Aspects of Access to Human-Generated Data and Other Essential Inputs for AI Training, Cambridge Legal Studies Research Paper No. 35/2024, at 3–5.
  3. Id. at 9–10.
  4. Norton Rose Fulbright, The Interaction Between Intellectual Property Laws and AI: Opportunities and Challenges (2024).
  5. Competition Act, No. 12 of 2003, § 4 (India); see also Harvard JOLT Digest, Model Collapse and the Right to Uncontaminated Human-Generated Data (2024).
  6. John Burden et al., Legal Aspects of Access to Human-Generated Data and Other Essential Inputs for AI Training, Cambridge Legal Studies Research Paper No. 35/2024, at 3.
  7. Shumailov et al., AI Models Collapse When Trained on Recursively Generated Data, 631 Nature 755, 756 (2024).
  8. Dohmatob et al., Strong Model Collapse, arXiv preprint arXiv:2410.04840 (2024).
  9. Vaccari & Chadwick, Deepfakes and Disinformation: Exploring the Impact of Synthetic Political Video on Deception, Uncertainty, and Trust in News, 6 Soc. Media + Soc’y 2056305120903408 (2020).
  10. Burden et al., supra note 6, at 9–10.
  11. Id. at 15.
  12. Id. at 16–17.
  13. Hataya, Bao & Arai, Will Large-Scale Generative Models Corrupt Future Datasets?, Proc. IEEE/CVF Int’l Conf. on Computer Vision 20555 (2023).
  14. Harvard JOLT Digest, Model Collapse and the Right to Uncontaminated Human-Generated Data (2024).
  15. Norton Rose Fulbright, The Interaction Between Intellectual Property Laws and AI: Opportunities and Challenges (2024).
  16. Copyright Act, No. 14 of 1957, § 14(a) (India).
  17. See Civic Chandran v. Ammini Amma, 1996 (16) PTC 329 (Ker.).
  18. IIPRD, Artificial Intelligence and Its Negative Impact on Intellectual Property: A Basic Guide (2023).
  19. Eastern Book Co. v. D.B. Modak, (2008) 1 SCC 1; see also Feist Publ’ns, Inc. v. Rural Tel. Serv. Co., 499 U.S. 340 (1991).
  20. IIPRD, supra note 18.
  21. John Burden et al., Legal Aspects of Access to Human-Generated Data and Other Essential Inputs for AI Training, Cambridge Legal Studies Research Paper No. 35/2024, at 7–8.
  22. Id. at 12.
  23. Digital Personal Data Protection Act, No. 22 of 2023, § 8 (India).
  24. Competition Act, No. 12 of 2003, § 4 (India).
  25. Id. § 4(2)(c).
  26. John Burden et al., Legal Aspects of Access to Human-Generated Data and Other Essential Inputs for AI Training, Cambridge Legal Studies Research Paper No. 35/2024, at 15–17.
  27. Arshiya Rail Infrastructure Ltd. v. Ministry of Railways, Case No. 64/2014, Competition Comm’n of India (2016).
  28. United States v. Terminal R.R. Ass’n of St. Louis, 224 U.S. 383 (1912).
  29. Telefonica SA v. Comm’n, Case C-295/12, ECLI:EU:C:2014:2062.
  30. Burden et al., supra note 26, at 19–20.
  31. Id. at 21.
  32. Fx Enterprise Solutions India Pvt. Ltd. v. Hyundai Motor India Ltd., (2018) 6 SCC 71.
  33. Gesetz gegen Wettbewerbsbeschränkungen [Act Against Restraints of Competition], § 20(1)(a) (Ger.); see also Bundeskartellamt, Case B9-144/19 (2023).
  34. Digital Personal Data Protection Act, No. 22 of 2023, § 8 (India).
  35. John Burden et al., Legal Aspects of Access to Human-Generated Data and Other Essential Inputs for AI Training, Cambridge Legal Studies Research Paper No. 35/2024, at 32.
  36. Eastern Book Co. v. D.B. Modak, (2008) 1 SCC 1.
  37. Fx Enterprise Solutions India Pvt. Ltd. v. Hyundai Motor India Ltd., (2018) 6 SCC 71.
  38. Harvard JOLT Digest, Model Collapse and the Right to Uncontaminated Human-Generated Data (2024).
  39. Norton Rose Fulbright, The Interaction Between Intellectual Property Laws and AI: Opportunities and Challenges (2024).
  40. Burden et al., supra note 1, at 35.

Recent Articles

Arbitration at the Crossroads of Energy Regulations: Critical Assessment of Cerc’s Exclusive Arbitration Referral Authority

– Anup Koushik Karavadi and Kanishk Tiwari Introduction Poised between private autonomy and statutory control, Arbitration occupies a peculiar space in India’s energy sector. Projects contracts such as… Read more »

When machines pollute knowledge: legal implications of AI data contamination in the indian context

– Anup Koushik Karavadi Introduction The exponential rise of generative artificial intelligence has redefined how digital information is created and consumed. However, this rapid advancement has also contaminated… Read more »

Substance Over Form: The Supreme Court’s Fiduciary Turn in Resource Allocation

Abstract The Supreme Court’s decision in Kamla Nehru Memorial Trust v. U.P. State Industrial Development Corporation Ltd. (2025 INSC 791) represents more than the resolution of a dispute… Read more »

Disclaimer

The Rules and Regulations set forth by the Bar Council of India under Advocates Act, 1961 prohibit Advocates or Law Firms from advertising or soliciting work through public domain communications. This website is intended solely to provide information. Karavadi & Associates (“K&A”) does not aim to advertise or solicit clients through this platform. K & A disclaim any responsibility for decisions made by readers/visitors based solely on the content of this website.

By clicking 'AGREE,' readers/visitors agree and acknowledge that the information provided herein (a) does not constitute advertising or solicitation, and (b) is intended solely for their understanding of K & A services. By continuing to use this site, you consent to the use of cookies on your device as outlined in our Cookie Policy.