Decentralized Provenance Layer for Foundation Models: A Framework for Quantifying and Penalizing Synthetic Data Contamination in Recursive LLM Training

SyedTalib Zaheer Zaidi; Muhammad Zamin Ali Khan; Muhammad Usama Khan; Amad Asif; Khalid Bin Muhammad; Faigha Karim; Ammad Mallick

doi:10.71146/kjmr944

Authors

SyedTalib Zaheer Zaidi HBL, Karachi, Pakistan Author
Muhammad Zamin Ali Khan Department of Computer Science, Iqra University, Karachi, Pakistan Author
Muhammad Usama Khan UHF Solutions Pvt Ltd, Karachi, Pakistan Author
Amad Asif Graphica Pro Artistry (Australia, Broadmeadows, Victoria) Author
Khalid Bin Muhammad COCSE, Ziauddin University, Karachi, Pakistan Author
Faigha Karim Department of Computer Science, Iqra University, Karachi, Pakistan Author
Ammad Mallick Department of Computer Science, Cardiff Metropolitan University, London, UK Author

DOI:

https://doi.org/10.71146/kjmr944

Keywords:

LLM’s, Future Generation, Exhausted

Abstract

The rapid development of Large Language Models (LLMs) and other Foundation Models requires substantial amounts of high-quality, human-generated training data [1]. However, the global collection of original human-generated content, known as the 'internet corpus,' is experiencing a marked decline [1]. To train successive generations of models, increasing quantities of data are sourced from the internet, which often includes outputs generated by existing AI models [1]. "Model collapse" describes a progressive degradation in learning, initiated when models are trained on data produced by their predecessors, resulting in a diminished ability to capture the true underlying data distribution [1]. This degradation is exacerbated by the loss of long-tail information, which is rare, difficult to obtain, and often sensitive. Consequently, model outputs become increasingly similar, reducing diversity and quality [1]. This issue impairs performance and may render future AI systems unreliable and less effective [1]. Recent research by Shumailov et al. (2024) highlights the model collapse phenomenon resulting from continual training on synthetic data. Borji (2024) further examines this issue using distribution fitting and iterative sampling of generated data [2].

Downloads

Download data is not yet available.

References

(1)Shumailov, I.; Shumaylov, Z.; Zhao, Y.; Papernot, N.; Anderson, R.; Gal, Y. AI Models Collapse When Trained on Recursively Generated Data. Nature 2024, 631 (8022), 755–759. https://doi.org/10.1038/s41586-024-07566-y.

(1)Borji, A. A Note on Shumailov et al. (2024): `AI Models Collapse When Trained on Recursively

Generated Data’. arXiv 2024. https://doi.org/10.48550/ARXIV.2410.12954.

(1)Schwartz, E. J.; Cohen, C. F.; Gennari, J. S.; Schwartz, S. M. A Generic Technique for Automatically Finding Defense-Aware Code Reuse Attacks. In Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security; ACM, 2020; pp 1789–1801. https://doi.org/10.1145/3372297.3417234.

Moreau, L. (2010). The Foundations for Provenance on the Web. Foundations and Trends® in Web Science, 2(2–3), 99–241. https://doi.org/10.1561/1800000010

(1)Simmhan, Y. L.; Plale, B.; Gannon, D. Karma2. International Journal of Web Services Research 2008, 5 (2), 1–22. https://doi.org/10.4018/jwsr.2008040101.

(1)Ye, Q.; Lu, M. S2p: Provenance Research for Stream Processing System. Applied Sciences 2021, 11 (12), 5523. https://doi.org/10.3390/app11125523.

(1)Freitas, A.; Knap, T.; O’Riain, S.; Curry, E. W3P: Building an OPM Based Provenance Model for the Web. Future Generation Computer Systems 2011, 27 (6), 766–774. https://doi.org/10.1016/j.future.2010.10.010.

(1)Werder, K.; Ramesh, B.; Zhang, R. (Sophia). Establishing Data Provenance for Responsible Artificial Intelligence Systems. ACM Trans. Manage. Inf. Syst. 2022, 13 (2), 1–23. https://doi.org/10.1145/3503488.

(1)Gao, Y.; Chen, X.; Du, X. A Big Data Provenance Model for Data Security Supervision Based on PROV-DM Model. IEEE Access 2020, 8, 38742–38752. https://doi.org/10.1109/access.2020.2975820.

(1)Souza, R.; Silva, V.; Camata, J. J.; Coutinho, A. L. G. A.; Valduriez, P.; Mattoso, M. Keeping Track of User Steering Actions in Dynamic Workflows. Future Generation Computer Systems 2019, 99, 624–643. https://doi.org/10.1016/j.future.2019.05.011.

(1)Pan, B.; Stakhanova, N.; Ray, S. Data Provenance in Security and Privacy. ACM Comput. Surv. 2023, 55 (14s), 1–35. https://doi.org/10.1145/3593294.

(1)Huber, S. P.; Zoupanos, S.; Uhrin, M.; Talirz, L.; Kahle, L.; Häuselmann, R.; Gresch, D.; Müller, T.; Yakutovich, A. V.; Andersen, C. W.; Ramirez, F. F.; Adorf, C. S.; Gargiulo, F.; Kumbhar, S.; Passaro, E.; Johnston, C.; Merkys, A.; Cepellotti, A.; Mounet, N.; Marzari, N.; Kozinsky, B.; Pizzi, G. AiiDA 1.0, a Scalable Computational Infrastructure for Automated Reproducible Workflows and Data Provenance. Sci Data 2020, 7 (1). https://doi.org/10.1038/s41597-020-00638-4.

(1)Siddiqui, M. S.; Rahman, A.; Nadeem, A. Secure Data Provenance in IoT Network Using Bloom Filters. Procedia Computer Science 2019, 163, 190–197. https://doi.org/10.1016/j.procs.2019.12.100.

(1)Zipperle, M.; Gottwalt, F.; Chang, E.; Dillon, T. Provenance-Based Intrusion Detection Systems: A Survey. ACM Comput. Surv. 2022, 55 (7), 1–36. https://doi.org/10.1145/3539605.

(1)Mahmood, T.; Jami, S. I.; Shaikh, Z. A.; Mughal, M. H. Toward the Modeling of Data Provenance in Scientific Publications. Computer Standards & Interfaces 2013, 35 (1), 6–29. https://doi.org/10.1016/j.csi.2012.02.004.

(1)Sachan, S.; Liu (Lisa), X. Blockchain-Based Auditing of Legal Decisions Supported by Explainable AI and Generative AI Tools. Engineering Applications of Artificial Intelligence 2024, 129, 107666. https://doi.org/10.1016/j.engappai.2023.107666.

(1)Yiu, N. C. K. Toward Blockchain-Enabled Supply Chain Anti-Counterfeiting and Traceability. Future Internet 2021, 13 (4), 86. https://doi.org/10.3390/fi13040086.

(1)Qiao, L.; Lv, Z. A Blockchain-Based Decentralized Collaborative Learning Model for Reliable Energy Digital Twins. Internet of Things and Cyber-Physical Systems 2023, 3, 45–51. https://doi.org/10.1016/j.iotcps.2023.01.003.

Hajlaoui, R., Dhahri, S., Mahfoudhi, S., Moulahi, T., & Alotibi, G. (2024). Protecting machine learning systems using blockchain: solutions, challenges and future prospects. Multimedia Tools and Applications, 84(20), 22755–22782. https://doi.org/10.1007/s11042-024-19993-0

(1)Balan, K.; Gilbert, A.; Collomosse, J. PDFed: Privacy-Preserving and Decentralized Asynchronous Federated Learning for Diffusion Models. arXiv 2024. https://doi.org/10.48550/ARXIV.2409.18245.

(1)Shafay, M.; Ahmad, R. W.; Salah, K.; Yaqoob, I.; Jayaraman, R.; Omar, M. Blockchain for Deep Learning: Review and Open Challenges. Cluster Comput 2022, 26 (1), 197–221. https://doi.org/10.1007/s10586-022-03582-7.

(1)Peng, C.; Yang, X.; Smith, K. E.; Yu, Z.; Chen, A.; Bian, J.; Wu, Y. Model Tuning or Prompt Tuning? A Study of Large Language Models for Clinical Concept and Relation Extraction. Journal of Biomedical Informatics 2024, 153, 104630. https://doi.org/10.1016/j.jbi.2024.104630.

(1)Luhtaru, A.; Purason, T.; Vainikko, M.; Del, M.; Fishel, M. To Err Is Human, but Llamas Can Learn It Too. arXiv 2024. https://doi.org/10.48550/ARXIV.2403.05493.

(1)Mischler, G.; Li, Y. A.; Bickel, S.; Mehta, A. D.; Mesgarani, N. Contextual Feature Extraction Hierarchies Converge in Large Language Models and the Brain. Nat Mach Intell 2024, 6 (12), 1467–1477. https://doi.org/10.1038/s42256-024-00925-4.

(1)Jacovi, A.; Marasović, A.; Miller, T.; Goldberg, Y. Formalizing Trust in Artificial Intelligence. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency; ACM, 2021; pp 624–635. https://doi.org/10.1145/3442188.3445923.

(1)Cui, T.; Wang, Y.; Fu, C.; Xiao, Y.; Li, S.; Deng, X.; Liu, Y.; Zhang, Q.; Qiu, Z.; Li, P.; Tan, Z.; Xiong, J.; Kong, X.; Wen, Z.; Xu, K.; Li, Q. Risk Taxonomy, Mitigation, and Assessment Benchmarks of Large Language Model Systems. arXiv 2024. https://doi.org/10.48550/ARXIV.2401.05778.

(1)Sarker, I. H. LLM Potentiality and Awareness: A Position Paper from the Perspective of Trustworthy and Responsible AI Modeling. Discov Artif Intell 2024, 4 (1). https://doi.org/10.1007/s44163-024-00129-0.

(1)Stade, E. C.; Stirman, S. W.; Ungar, L. H.; Boland, C. L.; Schwartz, H. A.; Yaden, D. B.; Sedoc, J.; DeRubeis, R. J.; Willer, R.; Eichstaedt, J. C. Large Language Models Could Change the Future of Behavioral Healthcare: A Proposal for Responsible Development and Evaluation. npj Mental Health Res 2024, 3 (1). https://doi.org/10.1038/s44184-024-00056-z.

(1)Hu, X.; Fu, H.; Wang, J.; Wang, Y.; Li, Z.; Xu, R.; Lu, Y.; Jin, Y.; Pan, L.; Lan, Z. Nova: An Iterative Planning and Search Approach to Enhance Novelty and Diversity of LLM Generated Ideas. arXiv 2024. https://doi.org/10.48550/ARXIV.2410.14255.

(1)Ku, A. Y.; Hool, A. Capabilities and Limitations of AI Large Language Models (LLMs) for Materials Criticality Research. Miner Econ 2024. https://doi.org/10.1007/s13563-024-00478-3.

M Zamin Ali Khan, Hussain Saleem et al, “Application of VLSI In Artificial Intelligence” Vol 6 Issue 2, PP-23-25 IOSR JCE 2012.

Saim Masood Shaikh, Muhammad Zamin Ali Khan et al “NAVIGATING CONTEMPORARY CHALLENGES OF SOFTWARE QUALITY ASSURANCE IN SOFTWARE TESTING” Vol 3 Issue 9, PP 45-71, April 2025.

Humera Azam, M.Zamin Ali Khan et al, “Quality Assurance in the Digital Age: Exploring Contemporary Challenges in Software Testing” Vol 5 , Issue 2, PP 9-26, 2025

Muhammad Zulqarnain Siddiqui , Muhammad Zamin Ali Khan et al, “ANALYSIS OF THE EFFECTIVENESS OF GENERATIVE AI MODELS FOR TEXT-TO-SQL TASKS IN BUSINESS INTELLIGENCE SYSTEMS” Vol3 Issue 12, PP 1777-1794 Dec 2025

Hussain Saleem, M Zamin Ali Khan, et al “Towards Identification and Recognition of Trace Associations in Software Requirements Traceability” Vol 9, Issue 5, pp 257-263 Sep, 2012.

Hussain Saleem, M Zamin Ali Khan, et al “Mobile Agents: An Intelligent Multi-Agent System for Mobile Phones” Vol 6 Issue 2, pp 26-34, Oct 2012