The Failures of Algorithm Transparency with Deep Learning & Data Mining

Cable Network Photo. 2019. Taylor Vick on Unsplash. Unsplash license.

Algorithm Culture 

The world-wide integration of online services and social media platforms has led to new forms of personal exploitation and surveillance. As a result, there is an increasingly concerning sentiment that users do not pertain ownership of their data. Often aware or not, users are willingly handing over their personal information to software companies and advertisers in order to gain access to these platforms. This commodification of user data contributes great concern for online user privacy. Furthermore unexplored threats and concerns emerge with the recent advancements of machine learning and AI based data mining techniques. These concerns present a worrying future of the Internet, one far from the initial liberties of the early ‘free and open’ Internet.

Murry (2019) brings attention to how after the events of Web 2.0, there has been a paradigm shift for Internet users, as user generated content gained prevalence, users became accustomed and ubiquitous to interacting with algorithms in everyday life. 

Murray states that this shift towards AI mediated content has altered the power structure of the Internet. While Web 2.0 has brought new power to users via providing widely accessible content creation tools, it also spawned new concerns of control and privacy. As users become more reliant on these platforms for empowerment, they often leave themselves ripe for exploitation.

The Adoption of Deep Learning with Big Data

Deep learning and neural networks had existed far before the emergence of “Big Data” corporations. For example, the first demonstration of a neural network being used to read hand-written digits was first demonstrated in 1989 by Yann LeCun. He illustrated a practical use of “back propagation” a technique that is still used today for data mining operations (Foote, 2022).

Recording of Yann LeCun’s demonstration in 1989 at Bell Labs, New Jersey.

In the era of Big Data and profiled consumer advertising, neutral networks offer powerful prospects to corporations to maximise the value extracted from their users. Additionally neutral networks provide versatility. Unstructured text, images and audio can be understood in ways traditional procedural algorithms fail without human mediation (IBM, n.d.).

Opaque Nature of Deep Neutral Networks

Neural networks often lack the data transparency of traditional data mining techniques. Traditionally, researchers and auditors are able to simply inspect the source code of the algorithm. While still possible with neutral networks, predictions are determined from the individual weights and biases inside the trained network’s layers. Typically a deep neural network (DNN) relies on complex layers of potentially millions mathematical transformations to generate results. As a result, attempting to reverse engineer a DNN would be futile using traditional inspection techniques (Thampi, 2022). 

Due to their lack of data transparency, deep neural networks are referred to as a ‘black box’ as the algorithms underlying calculations used to determine a result would be incomprehensible.

This notion is fundamentally problematic – Individuals have voiced their concern about their own personal privacy when their data is being used to train a language model. It is important to consider, if a user’s personal information is being used to train a neural network… to what extent does their own data persist and reflect in the outputs of AI models?

GitHub Copilot Discourse

One example that voiced concerns surrounding the lack of an AI model’s transparency was GitHub’s Copilot service. It is offered to developers as a generative tool to assist in writing code using AI text generation. When users sign up to GitHub, a free platform for sharing and contributing code, users agree that their code may be used to train AI models, like GitHub Copilot. 

Surrounding the public release of Copilot, discourse erupted when the software community discovered that a majority of the code used to train the model was licensed open-source code found on GitHub (Barber, 2021). This was most evident in a specific example where Copilot received a user’s copyrighted code with no licensing or attribution.

Tim Davis, Professor at Texas A&M University discovered that his own copyrighted code could be generated – even when configuring Copilot to not use public code sources.
Screenshot. GitHub Copilot’s FAQ states that their current model may reproduce code from the training set approximately 1% of the time.

The existence of GitHub Copilot provides a dilemma for developers, the tool itself is powerful, being able to quickly generate code snippets, considerably automating the process of software development. In the future, it is possible to see using tools like GitHub Copilot to be indispensable for developers in order to maintain pace with industry standards. 

But the tool has considerable risks. Due to the nature of neutral networks, the model lacks transparency. The possibility of accidentally breaching copyright or plagiarising existing free and open-source code is non-zero. Supporting such a tool could be seen as an attack on the open source community. 

Furthermore Copilot collects telemetry data on customers who used the product, this data could be used to further train the model and thus, further erode copyright and privacy protection. 

Ethics of Data Anonymisation & Consent

One counter argument is that the commodification and exploitation of user data are not necessarily synonymous. User data is inherently valuable by its personal nature, and so the more identifiable the training data is, the more effective it would be at training the model. In the specific scenario of the application of deep learning techniques for medical research an interesting ethical dilemma is proposed. To what extent may the public concede private medical information for the benefit of medical research?

Bak et al. (2022) argues that the personally identifiable information found in medical records would be instrumental for training a safe and accurate AI, suitable for the application in real world medical scenarios. 

Anonymisation techniques which remove personal but insightful medical information could undermine the performance of an AI model. In a worse case scenario, it could lead to adversely harming patients by providing insufficiently informed medical decisions.

Person holding pencil near laptop computer photo. 2016. Scott Graham on Unsplash. Unsplash license.

Bak et al. (2022) proposes that the traditional ‘consent or anonymise’ approach to medical data breaks down when applied to AI research. It can compromise the training data’s integrity with selection bias which could propagate through the AI model. The application of a biassed AI model would risk worsening existing health inequities. Having a diverse training data set is integral for a safe and accurate AI model. (Norori, 2021).

The training of AI models for medical research provides a counterexample where the collection of personal information can be done without necessarily being exploitative.


With the recent and rapid advancements in machine learning this year, the appropriation of personal user data as training data has become increasingly prevalent across online services and platforms. 

Ultimately despite the transparency issues of deep learning neural networks, a free internet may not be fundamentally opposed to all uses of AI and deep learning on user data. While software companies like GitHub aim to exploit user data and the good-will of the open source community, certain applications such as in medical research should be considered as they hold potential for greatly increasing the accessibility of medical assistance and could help address existing inequities and biases in human mediated medical services. A careful balance is to be determined, where should we draw the line between data accessibility for the good of the public and individual user privacy.


Bak, M., Madai, V. I., Fritzsche, M.-C., Mayrhofer, M. T., & McLennan, S. (2022). You Can’t Have AI Both Ways: Balancing Health Data Privacy and Access Fairly. Frontiers in Genetics, 13, 929453–929453.

Barber, G. (2021, July 12). GitHub’s commercial AI tool was built from open source code. Wired.

Foote, K. (2022)

GitHub copilot · your AI pair programmer. GitHub. (n.d.).

Morrison, S. (2023, July 27). The tricky truth about how generative AI uses your data. Vox.

Murray, S. (2021). Secret agents: Algorithmic culture, Goodreads and datafication of the contemporary book world. European Journal of Cultural Studies, 24(4), 970–989.

Norori, N., Hu, Q., Aellen, F. M., Faraci, F. D., & Tzovara, A. (2021). Addressing bias in big data and AI for health care: A call for open science. Patterns (New York, N.Y.), 2(10), 100347.

Thampi, A. (2022). 4 Model-agnostic methods: Local interpretability. Interpretable AI : building explainable machine learning systems ([First edition].). Manning Publications.

What is deep learning?. IBM. (n.d.).