Skip to main content
  • Letter to the Editor
  • Open access
  • Published:

Data saves lives: optimising routinely collected clinical data for rare disease research


Necessity driven organisational change in the post-pandemic landscape has seen health care providers adopting innovations to manage and process health data. These include the use of ‘real-world’ datasets of routinely collected clinical information, enabling data-driven delivery. Rare disease risks being ‘left-behind’ unless our clinical and research communities engage with the challenges and opportunities afforded by the burgeoning field of health data informatics. We address the challenges to the meaningful use and reuse of rare disease data, and, through a series of recommendations around workforce education, harmonisation of taxonomy, and ensuring an inclusive health data environment, we highlight the role that those who manage rare disease must play in addressing them.

Despite the significant direct and indirect negative impact of the COVID-19 pandemic on global health, the disruption to health care services has, in some areas, resulted in opportunities for the advancement of patient care [1]. Necessity driven organisational change has seen health care providers adopting innovations to manage and process health data [2, 3]. A next step in addressing the complex challenges of longer-term restoration of quality care will be the operationalisation of ‘real-world’ datasets of routinely collected clinical information. This will enable data-driven delivery of care by supporting agile pragmatic or adaptive studies anchored in ‘real-life’ data [4]. These advances are only possible with stakeholder engagement, particularly of those clinical teams who generate these data. Rare disease—collectively affecting 3.5–6% of the population, an estimated 263–446 million persons[5]—risks being ‘left-behind’ unless our clinical and research communities engage with the challenges and opportunities afforded by the burgeoning field of health data informatics. Meaningful use of health data is all the more important in areas where those health data are particularly scarce, and the individual rarity of these uncommon disorders magnifies the adverse impact to the evidence base of ‘data wastage’ through failure to take advantage of appropriate design and implementation of health informatic platforms and applications.

Not all health informatic systems are created equal

Electronic health records—also termed electronic patient or electronic medical records in different clinical contexts—vary in quality and usability. Maturity is a key metric of EHR robustness which speaks to the stability, responsiveness, interoperability and usability of the system, as well as the measurable benefit to patient care and the wider population. This measurable positive impact lies at the heart of the importance of EHR maturity, with ideal system being one which enables meaningful use of data for the delivery of high quality, equitable patient-centred care, consistent with national (eg, the US Centre for Disease Control, Fig. 1) [6] and supranational (eg, the United Nations) [7] definitions of such care.

Fig. 1
figure 1

The requisites of the maturity of electronic health records are determined by the priorities for patient and population health outcomes [6]

Mature EHR systems support the operationalization of data [8]. The essential requisites for a mature EHR are laid out in the Healthcare Information and Management Systems Society’s inpatient and outpatient Electronic Medical Record Adoption Model (EMRAM and (O)EMRAM, Fig. 1). It is important that clinicians understand these requisites and are familiar with the issue of maturity when called on to engage with the choice and adoption of EHR systems within their practice. However, clinicians should also be aware that the maturity of these systems is irrelevant to the meaningful use of health data without an infrastructure for system implementation. The success of this infrastructure is dependent on their teams. We address the challenges to the meaningful use and reuse of rare disease data, and, through a series of recommendations, we highlight the role that those who manage rare disease must play in addressing them.

Education: data literacy skills for clinical and non-clinical health care staff

The digitisation of data within a health care setting is “adaptive change of the highest order” [9], irrevocably changing the nature of work and those who do the work [10]. Data management skills are a necessary workforce prerequisite for the successful implementation of an electronic health records system.

Within a health care system, data flows in a cyclical fashion, from generation (by patients or clinicians), to storage, processing, analysis and the use of that analysis to impact care and subsequent generation of data. Weak links in this data cycle limit the effectiveness of analysis and resultant application of data. This health data cycle is particularly vulnerable in rare disease. The evidence base which supports improvements in rare disease care and services is reliant on studies with small population sizes, where the scarcity of generated data makes efficient use of that data critically important. Rare disease care is also reliant on multi-centre collaborations, where efficient processing (harmonisation and integration) and analysis is dependent on the quality of the generated data.

Increasingly, medical schools and nurse training courses include modules on data management, but this is not routinely offered to allied health professionals or non-medical staff, all of whom generate and use data within health care settings. New staff joining a care facility typically have inductive training in the relevant EHR system, but not on the principles of the health data cycle or the importance of understanding the structure of data. Cyclically updated training breeds confidence with EHR interfaces and usability [11], and supports staff in driving ongoing optimisation of the EHR interface, with measurable, significant benefit of staff experience [12]. A data-literate clinical workforce also understands that the ‘findable, accessible, interoperable and reusable’ (FAIR) principles and standards apply to all kinds of data [6, 8, 9, 12]. These data types include lists of eligibility criteria for rare disease registries, or a database of local allied care centres with the capacity to co-manage rare and complex disease.

For large, complex and rapidly evolving datasets, the burden of implementing FAIR standards can be considerable. There will be an increasing role of artificial intelligence in making data findable (eg, through searches improved by natural language processing approaches), accessible (eg through adaptable interfaces for those with disabilities), interoperable (eg through algorithmic dataset harmonisation) and reusable (eg by automating data cleaning or transformation for different purposes). This increasing role will still be reliant on stakeholders. Clinicians and researchers need to come together to consider what ‘FAIR’ looks like, and thus to define the value and utility of data and metadata.

Recommendation: All staff who interact with health data should receive ongoing training in the principles of the health data cycle, and the necessity of ensuring that data are findable, accessible, interoperable, and reusable (FAIR).

Capacity: increase the critical mass of subject matter experts trained in health informatics

Data literacy enables an individual to begin to understand what it takes to ‘ask good questions’ of the data stored within their EHRs [13]. Patient-facing clinicians often originate these ‘good questions’ but answering them in rare disease using routinely collected clinical data requires datasets gathered across multiple centres which are sufficiently granular as to allow description and evaluation of complex phenotypes. This typically requires additional analytic and programming skills. EHRs contain highly structured data comprising quantitative or qualitative variables such as age, body mass index, drug names, but may also contain ‘dirty’ or unstructured data, such as free text entries. Free text data within EHRs can hold valuable information on patient experience, disease severity, reported adverse events, or details on concordance with prescribed medication. However, without subject matter expertise, these data are a challenge to transform into a product that can be queried and analysed [14, 15]. The analyses of such data require individuals who understand both the clinical question and context and the capabilities of analytical platforms and programming languages such as the open-source R, SPARQL and Python languages [16]. User-friendly interfaces for these programming languages are available, allowing for intuitive use of these tools to analyse or visualise data, without the need for deep coding knowledge. Wide adoption of these skills may also improve the working experience, health, and wellbeing of the staff members themselves. The great promise of the digitisation of health care is the eventual ‘gift of time’ for healthcare workers.[12] In some settings, eg those which lack the resource of data science staff to support the informed use of information, digitisation is more likely to contribute to physician stress and burnout [17]. Rather than the redirection of attention from the patient which is often noted by clinicians following their hospital’s adoption of HER [18], implementation of intelligent EHR systems may free clinical staff to spend more time interacting with their patients. This will be particularly important during public health crises [19]. The ‘artificial intelligence’ of the system will be dependent on the ‘good questions’ it has answered, and how it has answered them, and this depends in turn on the involvement of the subject matter expertise of hospital staff.

Recommendation: The creation and expansion of a rare disease analyst workforce, with data skills present even in staff who lack the terms ‘coding’ or ‘analysis’ in their work title, and subject matter experts with health informatics experience, is urgently needed, and should be a priority across all health care settings.

Collaboration: maintaining metadata

Health service delivery for rare disease involves care across and within different tiers, from primary care to super-specialised tertiary team structures, to links with national or international disease registries, and links with external regulatory authorities (e.g., tissue and transplant authorities). Communication across these settings is negatively impacted by the siloed approach to data collection which characterises most health care settings [15]. Whilst the structure of the data generated within the EHR of individual care settings may meet the metadata-related requirements necessary to establish FAIR use, this is not always true of the other datasets, particularly those datasets lacking metadata, ie descriptive information on data elements, dataset structure, location of data storage, and provenance [20]. An illustrative example is absence of uniform adoption of the Digital Imaging and Communications in Medicine, or DICOM metadata standards [15, 16]. Imaging is particularly important for objective capture of phenotype in rare diseases, which tend to be characterised by heterogeneity and complexity. The DICOM metadata standards ensure documentation of the descriptive data (image type, mode of acquisition, image machine settings) needed to ensure that the images are accessible to and usable by other clinicians and researchers [15, 16]. Meta-data capture and data cataloguing thus reduce the risk of ‘health data entrapment’, where critical data are less accessible or interoperable [21, 22].

Recommendation: Clinical and clinical research teams should ensure the creation of detailed metadata (such as modes of data acquisition, authorship details, timestamps) for rare disease study datasets, registries or other data item, to ensure dataset re-use.

Standardization: medical terminology, coding and cataloguing

Prompt identification of specific populations of patients allows for redirection of care, audit of clinical outcomes, and can support recruitment to time-sensitive research. These forms of data utilisation are reliant on a high degree of clinical terminology harmonization among EHR users. During the pandemic, multi-centre collaborative networks of researchers worked together to develop COVID-related projects, with subject matter experts (clinicians) and health informaticians co-developing search algorithms within the EHR to generate lists of eligible individuals for inclusion within the studies. Clinical data have also been used for pragmatic and adaptive randomised controlled trials [5]. Terminology on key clinical elements such as clinical condition is based on the World Health Organization’s International Classification of Diseases (ICD) taxonomy.

The words terminology, taxonomy and ontology are often used interchangeably but have different meanings (Fig. 2). Ontologies provide context for data, by ensuring representation of the relationships between concepts and entities. This supports the complexity needed to integrate and standardise data on related concepts from different sources, and supports the logical reasoning needed to make inferences, conclusions or decisions about data. The Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) is now, internationally, the leading clinical and healthcare ontology [23]. For example, it supports the multiple synonyms typically associated with distinct concepts. Examples of this include pneumonia, which has over 1000 synonyms including ‘bronchitis’ or chest infection’, or concepts such as ‘body weight’, which can be considered a clinical finding, a disorder in individuals with clinical obesity, and an entity seen in context with medications prescribed by body weight. This growing granularity of the EHR environment allows richer capture of concepts such as findings, interventions, pharmaceutical or biologic products, geographical location or social context. However, it can become an obstacle to data utilisation without the involvement of subject matter experts to identify and report associative relationships. Progress requires clinicians / subject matter experts to collaborate to create consensus-based libraries of SNOMED CT-defined conditions, which can then be shared or accessed more widely for clinical or analytic purposes [5, 24].

Fig. 2
figure 2

Terminology, taxonomy and ontology

The standardisation of nomenclature is particularly important for the future ‘intelligent’ (ie, data-driven) use of EHR data. The healthcare workforce has been under considerable strain during the COVID-19 pandemic, and now faces the burden of rebuilding non-COVID related care to avoid reversing the earlier gains made in rare disease. Artificial intelligence-based predictive analytics, embedded into EHR systems, and able to personalise treatment by modelling prognosis and treatment response, may be able to release clinician time [12]. The development and maturity of such AI assistance will, to a great extent, be dependent on the development of large, carefully coded and ontologically annotated datasets. Without these datasets, we will lack the domain specific and contextual understanding needed to train AI efficiently. The annotation of these datasets is time-consuming, and resource intensive with regards to the hours needed from clinical subject matter experts. Future ‘hybrid’ AI approaches will enable the use of unannotated or unstructured data for the training of assistive health care AI, providing the adaptability necessary to meet novel or unforeseen challenges such as the emergence of new diseases or significant shifts in health care delivery structures. Until then, the harnessing of supportive AI health tech will depend on datasets which are annotated with explicit definitions of the meanings and relationships contained within them, and those annotations must be anchored within subject matter expertise.

Recommendations: Health care professionals who manage rare disease must take up leadership roles in data science and establish collaborative networks to develop consensus led data libraries.

Inclusivity: equitable patient access to health care information

International differences in implementation of EHR systems are well-documented, with adoption of EHRs being much lower on average in the lower-middle (35% of whom have adopted EHRs) and low-income countries (15%), compared with > 50% in upper-middle- and high-income countries [25]. However, data poverty, where families are unable to access online services due to educational, financial or geographical limitations is not necessarily predictable by national GDP (gross domestic product). One in six adults in the majority of all Organization for Economic Co-operation and Development (OECD) countries are at the lowest level of literacy (Level 1, basic vocabulary only, unable to make low-level inferences, and unable to make matches between the text, either digital or printed, and information) [26]. The importance of the individual’s right of access to a computable version of their medical record is widely recognised [7, 27], but although EHR implementation has been associated with improved health outcomes, that association is weakest for those in their country’s lowest socioeconomic strata [28], and there is under-ascertainment of families with low data literacy [29, 30]. Additionally, data governance will be an important consideration when communicating the importance of data use to communities, especially in the context of disengagement from health care professionals, issues around trust, or health misinformation. Transparency around ownership, use and protection of data will be of great importance, particularly where patients are being asked to share data they have generated and which, as with all the data generated about them as patients or service users, they own.

Although EHRs bring many advantages for rare disease patients, empowering them as they receive care from multi-disciplinary teams across multiple centres, or as they transition to different models of complex care, care must be taken to avoid exacerbating existing health and disease outcome disparities. Those who design and implement EHR systems must address how they can maintain patient trust, support wide and equitable accessibility for patients to their health care data [31, 32], and avoid widening the existing disparities in health care access and health outcomes [33].

Recommendation: The system-level, rather than patient-level drivers behind inequitable EHR impact must be considered during EHR implementation, and clinical teams must also consider whose data are, and whose data are not being collected within the EHR.


Mature EHR systems are those which address the optimisation of care processes and patient health outcomes, through prediction and prevention of unwanted patient experience or health outcomes [25]. The success of these systems is dependent on their implementation, and the delivery of such systems are critical national and international goals [7, 20, 25]. Future responses to national health care emergencies will be driven by data [25, 34], and therefore will require robust, unbiased transparent data collection and management methods in place. Without this, data-based diagnostic and prediction models, especially those using artificial intelligence approaches, will be at high risk of amplifying bias, with resultant over-optimistic estimates of accuracy and performance [35]. Beyond the pandemic, well designed and implemented EHR will enable alignment of clinical data with a broad range of national and international rare disease health policies. The success of such policies aimed at ‘building back better’ will rests on the strength of our ‘analyst workforce’, our terminology harmonisation, our metadata, and the accessibility of patients and families to their health data.

Availability of data and materials

All available data and materials are presented in the article.



Centre for disease control


Electronic health record


Findable, accessible, interoperable, reusable


Gross domestic product


International classification of diseases


United Nations


World Health Organization


  1. Wang CJ, Ng CY, Brook RH. Response to COVID-19 in Taiwan: big data analytics, new technology, and proactive testing. JAMA. 2020;323(14):1341–2.

    Article  PubMed  CAS  Google Scholar 

  2. Hollander JE, Carr BG. Virtually perfect? Telemedicine for Covid-19. N Engl J Med. 2020;382(18):1679–81.

    Article  PubMed  CAS  Google Scholar 

  3. Ohannessian R, Duong TA, Odone A. Global telemedicine implementation and integration within health systems to fight the COVID-19 pandemic: a call to action. JMIR Public Health Surveill. 2020;6(2):e18810.

    Article  PubMed  PubMed Central  Google Scholar 

  4. Ford I, Norrie J. Pragmatic trials. N Engl J Med. 2016;375(5):454–63.

    Article  PubMed  Google Scholar 

  5. Nguengang Wakap S, Lambert DM, Olry A, Rodwell C, Gueydan C, Lanneau V, et al. Estimating cumulative point prevalence of rare diseases: analysis of the Orphanet database. Eur J Hum Genet. 2020;28:165–73.

    Article  PubMed  Google Scholar 

  6. Horgan D, Bernini C, Thomas PPM, Morre SA. Cooperating on data: the missing element in bringing real innovation to Europe’s healthcare systems. Public Health Genomics. 2019;22(3–4):77–101.

    Article  PubMed  Google Scholar 

  7. Matthews L, Chin V, Taliangis M, Samanek A, Baynam G. Childhood rare diseases and the UN convention on the rights of the child. Orphanet J Rare Dis. 2021;16:523.

    Article  PubMed  PubMed Central  Google Scholar 

  8. The Office of the National Coordinator for Health Information Technology. Strategy on Reducing Regulatory and Administrative Burden Relating to the Use of Health IT and EHRs. Final Report. 2020 ( Accessed 13th Dec 2022.

  9. Blumenthal D, Tavenner M. The, “meaningful use” regulation for electronic health records. N Engl J Med. 2010;363(6):501–4.

    Article  PubMed  CAS  Google Scholar 

  10. Wachter RM. Making IT Work: Harnessing the Power of Health Information Technology to Improve Care in England. 2016. ( Accessed 13th Dec 2022.

  11. Priestman W, Sridharan S, Vigne H, Collins R, Seamer L, Sebire NJ. What to expect from electronic patient record system implementation; lessons learned from published evidence. J Innov Health Inform. 2018;25(2):92–104.

    PubMed  Google Scholar 

  12. Topol E. The Topol review: Preparing the healthcare workforce to deliver the digital future. 2019. ( Accessed 13th Dec 2022.

  13. Goldacre B, Bardsley M, Benson T, Cheema K, Chinn R, Coughlan E, et al. Bringing NHS data analysis into the 21st century. J R Soc Med. 2020;113(10):383–8.

    Article  PubMed  PubMed Central  Google Scholar 

  14. Ambrosini A, Calabrese D, Avato FM, Catania F, Cavaletti G, Pera MC, et al. The Italian neuromuscular registry: a coordinated platform where patient organizations and clinicians collaborate for data collection and multiple usage. Orphanet J Rare Dis. 2018;13:176.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Berger A, Rustemeier AK, Göbel J, Kadioglu D, Britz V, Schubert K, et al. How to design a registry for undiagnosed patients in the framework of rare disease diagnosis: suggestions on software, data set and coding system. Orphanet J Rare Dis. 2021;16:198.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Sweet LE, Moulaison HL. Electronic health records data and metadata: challenges for big data in the United States. Big data. 2013;1(4):245–51.

    Article  PubMed  Google Scholar 

  17. Gardner RL, Cooper E, Haskell J, Harris DA, Poplau S, Kroth PJ, et al. Physician stress and burnout: the impact of health information technology. J Am Med Inform Assoc. 2019;26(2):106–14.

    Article  PubMed  Google Scholar 

  18. Srivastava R. When the EMR Stole My Pen. N Engl J Med. 2020;383(8):708–9.

    Article  PubMed  Google Scholar 

  19. Kissler MJ, Kissler K, Burden M. Toward a medical “ecology of attention.” N Engl J Med. 2021;384(4):299–301.

    Article  PubMed  PubMed Central  Google Scholar 

  20. The Office of the National Coordinator for Health Information Technology. National Health IT Priorities for Research. 2020.( Accessed 13th Dec 2022.

  21. Bellgard MI, Snelling T, McGree JM. RD-RAP: beyond rare disease patient registries, devising a comprehensive data and analytic framework. Orphanet J Rare Dis. 2019;14(1):176.

    Article  PubMed  PubMed Central  Google Scholar 

  22. Atalaia A, Thompson R, Corvo A, Carmody L, Piscia D, Matalonga L, et al. A guide to writing systematic reviews of rare disease treatments to generate FAIR-compliant datasets: building a Treatabolome. Orphanet J Rare Dis. 2020;15(1):206.

    Article  PubMed  PubMed Central  Google Scholar 

  23. Wardle M, Spencer A. Implementation of SNOMED CT in an online clinical database. Future Healthc J. 2017;4(2):126–30.

    PubMed  PubMed Central  Google Scholar 

  24. Springate DA, Kontopantelis E, Ashcroft DM, Olier I, Parisi R, Chamapiwa E, et al. ClinicalCodes: an online clinical codes repository to improve the validity and reproducibility of research using electronic medical records. PLoS ONE. 2014;9(6):e99825.

    Article  PubMed  PubMed Central  Google Scholar 

  25. World Health Organization. Global strategy on digital health 2020–2025. 2020 Accessed 13th Dec 2022.

  26. Organisation for Economic Co-operation and Development Development (OECD). Skills Matter: Additional Results from the Survey of Adult Skills. 2019. (

  27. Mandl KD, Szolovits P, Kohane IS. Public standards and patients’ control: how to keep electronic medical records accessible but private. BMJ. 2001;322(7281):283–7.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  28. Cebul RD, Love TE, Jain AK, Hebert CJ. electronic health records and quality of diabetes care. N Engl J Med. 2011;365(9):825–33.

    Article  PubMed  CAS  Google Scholar 

  29. Lyles C, Schillinger D, Sarkar U. Connecting the dots: health information technology expansion and health disparities. PLoS Med. 2015;12(7):e1001852.

    Article  PubMed  PubMed Central  Google Scholar 

  30. Menon DU, Belcher HME. COVID-19 pandemic health disparities and pediatric health care-the promise of telehealth. JAMA Pediatr. 2020.

    Article  Google Scholar 

  31. Kavadichanda C, Shah S, Daber A, Bairwa D, Mathew A, Dunga S, et al. Tele-rheumatology for overcoming socioeconomic barriers to healthcare in resource constrained settings: lessons from COVID-19 pandemic. Rheumatology. 2020.

    Article  Google Scholar 

  32. Sittig DF, Singh H. Electronic health records and national patient-safety goals. N Engl J Med. 2012;367(19):1854–60.

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  33. Best S, Vidic N, An K, Collins F, White SM. A systematic review of geographical inequities for accessing clinical genomic and genetic services for non-cancer related rare disease. Eur J Hum Genet. 2022;30:645–52.

    Article  PubMed  PubMed Central  Google Scholar 

  34. United Nations News. World must be ready for the next pandemic, UN says on first International Day of Epidemic Preparedness. 2020. ( Accessed 13th Dec 2022.

  35. Ashrafian H, Darzi A. Transforming health policy through machine learning. PLoS Med. 2018;15(11):e1002692.

    Article  PubMed  PubMed Central  Google Scholar 

Download references


No other acknowledgements.


AL Solebo is supported by an NIHR Clinician Scientist award CS-2018–18-ST2-005). This work is part supported by a Wellcome Grant 204841/Z/16/Z. JS Rahi is supported in part by the NIHR BRC based at Moorfields Eye Hospital NHS Foundation Trust and UCL Institute of Ophthalmology, and an NIHR Senior Investigator award. This work was undertaken at UCL Institute of Child Health / Great Ormond Street Hospital for children which received a proportion of funding from the Department of Health’s NIHR Biomedical Research Centers funding scheme.

Author information

Authors and Affiliations



ALS conceptualised and designed the study, interpreted data and contributed to drafting of the manuscript. PH interpreted data and contributed to drafting of the manuscript. LAGH interpreted data and contributed to drafting of the manuscript. JSR interpreted data and contributed to drafting of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Ameenat Lola Solebo.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors have no competing interests to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Solebo, A.L., Hysi, P., Horvat-Gitsels, L.A. et al. Data saves lives: optimising routinely collected clinical data for rare disease research. Orphanet J Rare Dis 18, 285 (2023).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: