The Price of Progress: Protecting Patient Privacy in the Age of AI

by Barry P Chaiken, MD

As artificial intelligence (AI) continues to revolutionize industries across the globe, its impact on healthcare is becoming increasingly apparent. From diagnosis and treatment planning to drug discovery and clinical trials, AI has the potential to transform patient care and clinical outcomes. However, this rapid advancement has also raised significant concerns about using protected health information (PHI) stored in electronic medical records (EMRs) without obtaining prior patient permission.

The recent NY Times article “How Tech Giants Cut Corners to Harvest Data for A.I.” sheds light on a similar issue in the tech industry. Companies like OpenAI, Google, and Meta have been using copyrighted material without permission to train their AI models, sparking debates about intellectual property rights and data’s ethical use in the AI age. OpenAI, in particular, faced a data shortage in late 2021 and resorted to transcribing over one million hours of YouTube videos, potentially violating the platform’s terms of service. Similarly, Google used YouTube video transcripts to train its AI models despite the unclear legal and ethical implications.

Running Out of Data

Tech companies are growing concerned about the impending scarcity of high-quality data for training AI models. As the demand for data increases, companies are looking for new sources, including copyrighted material and user-generated content, and even considering the acquisition of publishing houses like Simon & Schuster to gain access to long-form text. The debate around fair use and the need for licensing data has become contentious, with some arguing that the scale required for AI development makes traditional licensing impractical.

Furthermore, access to data in the AI race remains a critical component as researchers emphasize that “scale is all you need” when training large language models. The competition for data has led to a rapid increase in the size of training datasets, from hundreds of billions of tokens to trillions of tokens in just a few years. As a result, tech companies are exploring alternative data sources, including the generation of synthetic data using AI models.

Synthetic Data

In the field of AI, a token is a fundamental unit of data in an extensive data set, such as a word, character, or phrase, processed by algorithms.

As mentioned in the article, synthetic data is generated by AI models rather than collected from real-world sources. Tech companies like OpenAI believe that synthetic data could be the solution to the looming data shortage. By training AI models to generate realistic text, images, and other forms of data, companies hope to create a self-sustaining loop where AI can learn from its outputs. However, this approach is challenging, as AI-generated data may lack the diversity and quality necessary for robust model training. Misinformation in data sets used to train synthetic AI models could create a feedback loop that generates more and more information and poor-quality AI. Despite these concerns, synthetic data will likely become more prevalent as tech companies seek to maintain their competitive edge in AI.

The parallels between using copyrighted material in the tech industry and using PHI in healthcare are striking. Patients, by law, own and have control over their medical records. This fundamental right has been established to protect individual privacy and maintain trust between patients and healthcare providers. Traditionally, when medical researchers seek to use patient data for studies, they must obtain written permission from every participant. This process ensures transparency and allows patients to decide how researchers can use their personal information.

However, the widespread adoption of EMRs has created a vast repository of digital health data, which is irresistible to researchers and companies developing AI tools. A parallel issue exists in the tech industry, where companies like OpenAI, Google, and Meta have been using copyrighted material without permission to train their AI models. This practice has sparked debates about intellectual property rights and data’s ethical use in the AI age.

In healthcare, the stakes are even higher. PHI is not copyrighted; it is sensitive information directly impacting patients’ lives and well-being. When this data is used without consent, it erodes the trust between patients and the healthcare system. Moreover, it raises questions about the ownership and control of personal health data in an increasingly digital world.

Some argue that de-identifying patient data is sufficient to protect privacy and justify its use in AI development. However, this argument fails to address the fundamental issue of patient consent. Even if data is stripped of personally identifiable information, it belongs to the patient who generated it. Using this data without permission violates patient autonomy and breaches trust.

Regulations Controlling Use

Furthermore, developing AI tools in healthcare requires vast amounts of diverse, high-quality data to ensure accuracy and reliability. While obtaining individual consent from every patient may be impractical, the need for guidelines and regulations governing the use of PHI in AI research and product development is still relevant.

These guidelines must prioritize patient privacy and ensure that the use of data directly benefits patient care. One potential approach is establishing a framework where for-profit organizations accessing patient data must contribute a portion of their profits to non-profit entities dedicated to advancing the public good. These funds can support AI research grants, clinical trials, or other initiatives that directly improve patient outcomes. A public-private partnership offers another path forward.

Additionally, patients should be given the option to opt in or opt out of having their data used for AI development. This can be achieved through clear, concise consent forms that explain how the data will be used, who will have access to it, and what benefits, if any, will be shared with patients. Transparency and patient engagement are key to building trust and fostering a collaborative approach to AI in healthcare.

It is crucial to balance enabling AI innovation and protecting patient rights. The goal is not to stifle progress but to ensure it happens ethically and responsibly. By involving patients in the process and ensuring that they benefit from the use of their data, we can create a healthcare system that leverages the power of AI while maintaining the highest standards of privacy and trust.

As we navigate this new frontier of AI in healthcare, we must have open and honest conversations about data ownership, consent, and the responsible use of PHI. Patients, healthcare providers, researchers, policymakers, and industry leaders must work together to develop a framework that promotes innovation while safeguarding patient rights.

The potential benefits of AI in healthcare are immense – from early detection of diseases to personalized treatment plans and improved outcomes. However, we must not lose sight of the human element at the core of this endeavor. Patients are not just data points; they have the right to control their health information.

By prioritizing patient consent, fostering transparency, and ensuring that AI’s benefits are shared equitably, we can harness the power of this technology to transform healthcare for the better. Our collective responsibility is to build a future where AI and patient rights coexist harmoniously, paving the way for a healthier, more empowered society.

The lessons learned from the tech industry’s struggle with data usage and intellectual property rights serve as a cautionary tale for the healthcare sector. As we embrace the potential of AI in medicine, we must proactively address the ethical and legal challenges surrounding patient data. By engaging in open dialogue, establishing clear guidelines, and prioritizing patient rights, we can ensure that the integration of AI in healthcare is both innovative and responsible, ultimately benefiting patients and society.

Source: How Tech Giants Cut Corners to Harvest Data for A.I., NY Times, April 6, 2024


I look forward to your thoughts, so please submit your comments in this post and subscribe to my bi-weekly newsletter Future-Primed Healthcare on LinkedIn.

Leave a comment


This site uses Akismet to reduce spam. Learn how your comment data is processed.