google.com, pub-8701563775261122, DIRECT, f08c47fec0942fa0
UK

Confidential health records from UK BioBank project exposed online | Genetics

An investigation by the Guardian can reveal that confidential health data has been exposed online dozens of times, raising questions about the protection of patient records by one of the UK’s most important medical research projects.

UK Biobank, which holds the medical records of 500,000 British volunteers, is one of the world’s most comprehensive repositories of health information and is known for breakthroughs in cancer, dementia and diabetes research. But scientists who are approved to access the Biobank’s sensitive data sometimes appear to be cavalier about the security of that data.

The files, which appear to have been accidentally posted online by researchers using the data, do not contain names or addresses, but could still raise privacy concerns. A dataset found by the Guardian contained millions of hospital diagnoses and associated dates for more than 400,000 participants.

With the permission of a Biobank volunteer, the Guardian was able to locate extensive hospital diagnostic records for the volunteer using only his month and year of birth and details of a major surgery he had undergone.

A data expert said the scale and persistence of the problem was “shocking” at a time when artificial intelligence and social media have made it easier than ever to cross-reference information online.

UK Biobank dismissed the concerns, saying no identifying data such as name and address were provided to researchers.

Prof Sir Rory Collins, chief executive of the UK Biobank, said in a statement: “We have not seen any evidence of any UK Biobank participants being re-identified by others.”

‘They said they would keep our data securely’

Established in 2003 by the Department of Health and medical research charities, the UK Biobank holds genome sequences, scans, blood samples and lifestyle information from 500,000 volunteers. Last month the government expanded Biobank’s access to volunteers’ GP records.

Scientists at universities and private companies around the world applied for access and were free to download the data directly to their own computer systems until late 2024.

Prior to this point, data had been accidentally published online, and Biobank appears to still be struggling with the issue.

The issue arose as journals and funders increasingly demanded that researchers publish the code they use to analyze large data sets. While planning to upload code, some researchers accidentally published some or all of their Biobank datasets to GitHub, a popular online code-sharing platform. UK Biobank prohibits researchers from sharing data outside its own systems and says it offers further training for all researchers.

Over the past year, data leaks appear to have become a more pressing concern for UK Biobank. It issued 80 legal notices to GitHub between July and December 2025. It demands that the data be removed from the internet. But there is still a lot available.

Some of the data files contain only patient IDs or test results for small numbers, while others are more comprehensive. A dataset found online by the Guardian in January It included hospital diagnoses and relevant diagnosis dates, gender, month and year of birth for approximately 413,000 participants.

A data expert who reviewed the file said: “Just opening it gave me goosebumps. I deleted it immediately. It was so detailed, and even just glancing at it felt like a huge invasion of privacy.”

The Guardian turned to several Biobank volunteers to test the risk of re-identification; two of them had undergone medical procedures during the time frame in the data and agreed to share these details with an external data scientist.

A volunteer who reported fracture and seizure treatment dates could not be found in the data set. A second volunteer, a woman in her 70s, shared the month and year of birth and the month and year of her hysterectomy. Only one person in the dataset matched these details. The apparent match was also confirmed by five other diagnoses in the records that the volunteer had not initially disclosed.

“You were actually rehearsing the main parts of my medical history to me without me giving you any information. I wasn’t expecting that,” the volunteer said.

The woman said she was not particularly concerned about her own data being exposed and planned to remain a participant, saying she viewed UK Biobank’s work as “extremely important”. But he added: “I’m more concerned about whether Biobank has broken their agreement with people. They said they would keep our data securely… I just feel like that needs to be factored into the equation.”

UK Biobank said the re-identification scenario tested by the Guardian did not highlight a privacy risk because it would be impossible to identify individuals without additional information.

A Biobank spokesperson said: “As we have communicated to our participants, website: ‘If a participant posts information on a public website that reveals certain information about their health and identity, such as genealogical data, this may enable their identity to be discovered through cross-referencing of UK Biobank research data.’

“You simply showed why we told participants not to do this.”

The spokesperson added that Biobank took extensive measures to protect participants’ privacy, including proactively searching GitHub, contacting researchers directly, and issuing legal takedown notices, which he said led to the removal of approximately 500 repositories. It was stated that most of these contain only patient identities, not health data.

‘There are tensions between guiding research with data and protecting privacy’

Privacy experts said UK Biobank’s approach was at odds with the fact that many people conceivably share some health information online and, in the age of artificial intelligence, this could be easily identified and cross-referenced.

“Are these people aware of the existence of the Internet?” asked Prof Felix Ritchie, an economist at the University of the West of England. “The idea that they can trust their volunteers to never publish any other information about them is completely implausible.”

Associate Professor at the Oxford Internet Institute, who examined various Biobank datasets available online. Luc Rocher said removing identifiers often does not guarantee anonymity, and knowing a person’s birthday and the date they broke their leg, for example, can be enough to determine their record with a high degree of confidence.

“Once detected, this record could reveal sensitive information such as a psychiatric diagnosis, HIV test result, or drug use history,” they said.

Prof Niels Peek, professor of data science and healthcare improvement at the University of Cambridge, said the scale of the problem was “shocking”. “If this had happened once or 10 times, I probably would have said: ‘It’s not great that this happened, but zero risk is also impossible,'” he said. “Hundreds of times. That’s a bit much.”

According to Peek, Biobank’s actions show that it took the issue seriously and “did everything that could reasonably be expected.” But he added: “The scale and persistence with which this has happened suggests that there are major tensions between the desire to conduct health research with data at scale and the legal and ethical imperative to protect people’s privacy.”

Experts have questioned whether Biobank will be able to fully regain control of data published online. Although the researchers and GitHub removed most of the offending repositories in response to Biobank’s requests, most of the relevant files remained available on the code archive website.

Additional reporting by Luke Hoyland

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button