Google+ Facebook Twitter Twitter

Cradle-to-grave health data

Researchers have mapped and evaluated electronic data flows in the NHS for the first time. We talk to the lead author, Joe Zhang.

When it comes to health data, the UK is in a unique position. The big data that comprises the NHS’ records for the population has the potential to help life sciences and artificial intelligence, improve clinical decision-making and support population health and personalised medicine.

This capability was realised during the COVID-19 pandemic, when the rapid, secure and efficient connection of health information from hospitals, GP practices and laboratories enabled quick policy decisions around lockdowns, shielding and vaccines. And in April 2023, the UK government commissioned Professor Cathie Sudlow, Chief Scientist and Deputy Director of Health Data Research UK, to review the landscape of health data.

Helping to inform that review comes a study from Imperial College London’s Institute of Global Health in which researchers comprehensively mapped all electronic data flows, infrastructure and assets – a first for any national healthcare system. Studying the data within individuals’ electronic health records, and how it is shared between healthcare providers and users, has shown a complex landscape with tens of thousands of transactions.

A giant data ecosystem

In particular, there are shortfalls in transparency and best practices for safe data access, and a lack of value return to the NHS. Dr Joe Zhang, Wellcome-funded Clinical Research Fellow at the Institute of Global Health Innovation, and lead study author, says a lack of transparency and reporting at all levels was the biggest challenge in carrying out the research.

“There’s an incredible amount of patient data being extracted and passed around, and these flows are quite opaque,” he says. “The majority of data use occurs outside of secure environments. Population-sized – de-identified – files may be physically transferred to and between different consumers. There’s no audit trail, and there is a history of data consumers breaching data use agreements, where audited, so there is a vast unknown landscape.”

Zhang says the scale of the data is “beyond what any of us previously imagined” – the team was surprised by the sheer scale of the NHS data ecosystem. “We trawled through more than 10,000 documents, including data privacy notices taken from primary care websites, and performed dozens of literature searches for individual databases,” he adds.

“It was comprehensive, but difficult and took about a year, with a team web crawling and data mining. Imagine trying to find out what is happening to medical data as a patient?” However, given the enormity of the NHS data ecosystem, the research demonstrated how valuable the data and effective the technology used is.

Alongside the research is a website that shows NHS data assets and promotes transparency. The site, Data Insights, aims to simplify understanding the way patient data in NHS England is extracted, flowed into research assets and used by consumers.

The research was undertaken for a number of reasons. “For policymakers, it’s hard to develop a useful data strategy without knowing what data is out there and how valuable it actually is,” Zhang explains. “For scientists, it’s incredibly important to understand the provenance of data that you are performing analysis on, over and above what variables you have in front of you. For patients and the public, they have a right to transparency and understanding in how data is used.”

The way data is viewed in the NHS is changing very quickly too. A 2022 government-commissioned independent report offered a more finessed and critical view of the problems of checking, securing, analysing and communicating data. In addition, NHS England is investing millions of pounds in data infrastructure. Since 2022, it has announced £200 million to support development of secure data environments, and a further £480 million for a national federated data platform.

“That amount of spending really needs to be backed up by objective information,” Zhang adds. “Data infrastructure is expensive to build, and at the moment we have a lot of it. Better to make the most of existing national data assets (like the NHS TRE and OPENSafely) rather than building more, that will end up just holding the same data in a different location with a different badge.”

The study also recommends that capabilities outside of observational research are built, for example, incorporating the five innate characteristics of big data – velocity, volume, value, variety and veracity – into good data governance that will support use-cases like AI deployment and management, and life sciences platforms, rather than just holding curated research datasets.

The NHS should also push opt-outs to the level of data usage. “It’s what patients want, and because origin flows lead to dozens of use-cases, a blanket opt-out can restrict patients from having access to some data-driven capabilities,” Zhang says. And any value from data should be returned directly to the NHS.

Joe Zhang

  • Wellcome Fellow at Imperial College London, undergoing a PhD in data-centric AI
  • Works as Data Scientist in the NHS, implementing population health AI for the NHS in London
  • Intensive Care Specialty Doctor, Guy’s and Thomas’ NHS Foundation Trust
  • Trained in internal medicine at King’s College Hospital NHS Foundation Trust, London
  • Received an MA in medical sciences and a BM BCh in medicine and surgery at the University of Oxford



Critical role of data

The study is a one-off piece of work to support current data policy, but Zhang and co-author Jess Morley, who is Policy Lead for the Oxford Internet Institute’s DataLab and co-author of the 2022 NHS data strategy publication Better, Broader, Safer, intend to map value chains to get a real estimate of NHS data value. And they are exploring the development of tooling that can report and track data flows (like a supply chain), to support auditing, help track transactions, and position and implement opt-outs at the level of usage.  

Zhang, who is an Intensive Care Unit (ICU) doctor as well as a data scientist, highlights one big policy objective for any data infrastructure in the NHS: “Get value for NHS spending on data, because at the moment, it’s largely everyone else who is making money from NHS data.”

He has been working with NHS data for almost as long as he’s been a clinician. “Data is essential, and ICU is a data-heavy specialty,” Zhang adds. During the pandemic, he was working in a national extracorporeal membrane oxygenation (ECMO) and severe respiratory failure centre – the biggest challenge of his career so far.

He and fellow intensive care and data specialist Stephen Whebell built a live database and analytics tools to phenotype critically ill referrals from more than 100 hospitals and track outcomes. “Some great insights and a big national study came out of that data, which gave us an early understanding of effective treatments,” he adds.

Just as healthcare data provided insights during COVID-19, the findings of the study give an objective understanding of gaps in the data landscape. This will guide investment into new data infrastructure and in designing specific policy objectives.

Image credit | Getty

Related Articles