The IDeaS initiative: pilot study to assess the impact of rare diseases on patients and healthcare systems

Background Rare diseases (RD) are a diverse collection of more than 7–10,000 different disorders, most of which affect a small number of people per disease. Because of their rarity and fragmentation of patients across thousands of different disorders, the medical needs of RD patients are not well recognized or quantified in healthcare systems (HCS). Methodology We performed a pilot IDeaS study, where we attempted to quantify the number of RD patients and the direct medical costs of 14 representative RD within 4 different HCS databases and performed a preliminary analysis of the diagnostic journey for selected RD patients. Results The overall findings were notable for: (1) RD patients are difficult to quantify in HCS using ICD coding search criteria, which likely results in under-counting and under-estimation of their true impact to HCS; (2) per patient direct medical costs of RD are high, estimated to be around three–fivefold higher than age-matched controls; and (3) preliminary evidence shows that diagnostic journeys are likely prolonged in many patients, and may result in progressive, irreversible, and costly complications of their disease Conclusions The results of this small pilot suggest that RD have high medical burdens to patients and HCS, and collectively represent a major impact to the public health. Machine-learning strategies applied to HCS databases and medical records using sentinel disease and patient characteristics may hold promise for faster and more accurate diagnosis for many RD patients and should be explored to help address the high unmet medical needs of RD patients. Supplementary Information The online version contains supplementary material available at 10.1186/s13023-021-02061-3.


Introduction
When combined, rare diseases are not actually rare, as they collectively affect around 25-30 million people in the United States (US) and more than 300 million people worldwide [1][2][3][4]. RD represent a diverse spectrum of more than 7-10,000 different disorders, most of which affect only a few hundred to a few thousand people per disease [5][6][7][8]. It is estimated that around 85% of RD are genetic diseases, [6] the majority of which are serious or life-threatening conditions that carry substantial morbidity and early mortality, and present considerable medical and financial burdens to RD patients and the families who care for them [9][10][11]. Given the large number of different rare diseases, each of which affects only a small number of patients, assessing the true impact of rare diseases on healthcare systems (HCS) is challenging. RD are generally difficult to diagnose, with many patients undergoing prolonged diagnostic journeys, termed the diagnostic odyssey, in order to obtain an accurate diagnosis [12,13]. Even when accurately diagnosed though, less than half of RD map to an International Classification of Disease (ICD) 10 code, with far fewer (< 20%) having a specific ICD 10 code, [14] resulting in most RD being under-recognized and under-counted within HCS databases (such as payor/insurance databases) [15][16][17] and myriad downstream effects, such as imprecise coding of RD patients and poor tracking and understanding of both RD patients and the diseases themselves. Further, without a diagnosis, it is often the case that a set of labs, notes, and other features (e.g., a computable phenotype) cannot be reliably or consistently used to identify RD patients. Hence, the true impact of RD on HCS are not well described, and RD remain largely invisible to the HCS.
There are some estimates in the medical literature that medical care for RD patients may account for more than 10% of overall costs in some HCS, [18] and a few small studies, mainly case series, have shown high direct medical costs of RD at single centers in individual diseases or narrow clusters of related diseases (e.g., severe/refractory seizures) [19,20]. Recently, a patient-reported survey on direct and indirect costs of rare diseases in the U.S. was reported, which showed high direct and indirect medical cost burdens to patients and HCS, with total costs estimated to be about $1 trillion (US) in 2019 [21]. Another recent study examined pediatric and adult hospital discharges in patients with rare and common conditions, which showed substantially higher healthcare utilization in rare versus common diagnoses, with RD accounting for nearly half of the US national healthcare costs [22].
In order to better understand RD medical costs, more accurately identify RD patients, and shorten the diagnostic odyssey for RD patients, additional work needs to be done to develop generalizable methodologies and tools (e.g., clinical decision support tools) that can be used across different HCS to adequately and consistently identify RD patients within HCS and to objectively quantify direct medical costs associated with RD by disease and overall. Similarly, the impact of delayed or misdiagnosis of RD on patients and HCS has not been well quantified [13]. While delayed or misdiagnosis is an issue for both rare and common diseases, delays in diagnosis disproportionately impact RD patients given the often years-long diagnostic odyssey most patients undergo. Misdiagnosis and lack of diagnosis can result in inappropriate care, lack of targeted or, when available, disease modifying treatment, and missed opportunities for intervention that may ameliorate or prevent disease progression, which in some cases is irreversible or require administration within certain time windows (e.g., neurodegenerative or metabolic disorders) [23,24].
IDeaS (Impact of Rare Diseases on Patients and Healthcare Systems) is a collaboration between the Office of Rare Diseases Research (ORDR) within the National Institutes of Health (NIH) National Center for Advancing Translational Sciences (NCATS), Eversana ™ , a commercial life sciences company, the Oregon Health & Science University (OHSU), Oregon's public academic health center, Sanford Health (Sanford), a large integrated healthcare system predominately from the northern Midwestern states, and a health insurer in Australia. IDeaS is intended to be a small preliminary pilot study whose overall purpose is to explore the feasibility of identifying and describing RD patients in a limited set of 14 representative RD within different and diverse HCS. The 3 main aims are to: (1) explore whether methodologies could be developed to quantify patients with RD and provide estimates of disease prevalence in different HCS; (2) quantify the direct medical costs of a representative set of 14 RD in order to identify additional areas for study into RD direct costs and health burdens that may help identify gaps in RD research; and (3) perform a preliminary assessment of the diagnostic journey for selected patients in 2 RD (Batten disease [BD] late infantile neuronal ceroid lipofuscinosis type 2 [CLN2] and cystic fibrosis [CF]) to start to identify disease-course characteristics that might be used to inform the development of strategies that could accelerate RD diagnosis using graphical representation of the disease course in patient "journey maps" (Figs. 5,6). While the IDeaS pilot study is limited in scope, it is hoped that the results of these explorations will contribute to further development of methods and approaches that can help us better understand the complex issues currently impeding our understanding of cost and utilization drivers for RD that could be applicable to the thousands of known RD, as well as to inform larger research questions, such as the relationships between costs and cost savings, patient outcomes and disease rarity. However, these lines of inquiry will require additional and iterative development of analytical tools and approaches that are beyond the scope of this study.

Methodology
We conducted the IDeaS study, a retrospective, descriptive pilot study, to explore the feasibility of quantifying patient and direct medical costs for 14 representative RD (Table 1). There are 14 RD (or disease groups) included in the pilot that were selected by the study authors to explore a diverse set of disorders that differed in prevalence, organ systems affected, age of onset, clinical course, and availability of an approved treatment or specific ICD code, intended to be representative of many RD beyond the 14 used in this pilot. The pilot IDeaS study includes 3 main Aims for exploration.

Aim 1: Estimation of disease prevalence in different HCS databases
We initially attempted to identify patients with the 14 pilot RD within the 5 different HCS databases using diagnostic (ICD) codes (see Table 1, Additional file 1: Table S1); however, due to the substantially different billing methods used by the Australian healthcare system (see below), we were not able to reliably connect the Australian HCS data to the 14 RD used in the pilot. Thus, exploration and comparison of the Australian data could not be performed and was dropped from further consideration.
For the remaining 4 HCS, a patient is considered diagnosed with the RD when there are at least two instances of any one of the corresponding diagnosis codes in the patient's chart or medical claims data, occurring at least 3 months apart. Two diseases, pheochromocytoma (Pheo) and Charcot Marie Tooth (CMT), did not have specific ICD codes and additional analyses were attempted by adding specific Current Procedural Terminology (CPT) codes to the search criteria (see Results section).

Percentage of patients
The percentage of patients with a RD was estimated by calculating the number of patients with the disease diagnosis divided by the total number of patients within the HCS database during the specified time period using the source data and HCS approaches summarized in Table 2. For 12 of the 14 RDs that either had specific ICD codes or mapped to 1 or more ICD codes, the 4 HCS databases were searched using the ICD codes listed in Table 1. However, given differences between some of the databases in how patient data is categorized, some customization by system was necessary, including: (1) The NCATS analysis was inclusive of data obtained prior to 2015, and only ICD9 codes were used; and (2) the Eversana database was predominantly organized around billing, and certain non-billable ICD codes were not able to be used in the analyses [for example ICD-9 code 277.0 (Cystic Fibrosis, nonbillable)].
The Australia HCS data assessment was performed using the Australian Refined Diagnosis Related Groups (AR-DRG) system, which is an Australian admitted patient classification system that relates the number and type of patients treated in a hospital to the resources required by the hospital, in a clinically meaningful way [25,26]. AR-DRGs group patients with similar diagnoses requiring similar hospital services. Episodes of admitted hospital acute care are assigned with disease and intervention codes, including Australian Modification ICD-10 (ICD-10-AM) and other coding standards.
The medical literature and public health sources were searched to provide a prevalence estimate comparator for each of the diseases.

Aim 2: Average direct cost estimates by disease
Direct medical costs were estimated for patients with each of 13 of the 14 RDs identified in Aim 1 using HCS data from 2 of the collaborating institutions NCATS and Eversana. For one disease, CMT, which lacked a specific ICD code, patients were not able to be reliably identified in Aim 1, and this disease was dropped from further analysis. Direct medical costs were estimated using the U.S. dollar amount paid to the HCS that was extracted from the database's billing records. As per Aim 1, a patient was considered diagnosed with the disease when there were at least two instances of any one of the corresponding diagnoses codes in the patient's medical claims, occurring at least 3 months apart. The first occurrence of the diagnosis satisfying these criteria was defined as the date of diagnosis of the disease for the patient. For the NCATS database, patients were first identified using the RD ICD codes (Aim 1) then direct costs were calculated by disease using billing codes that represent what was paid by the State of Florida's Medicare/Medicaid program for the time period 2007-2012. For Eversana, direct costs were extracted from the payment information in the IBM ® Marketscan ® Research Database in years 2006-2020, which includes gross payment made to a provider. For a given RD, the total cost was calculated in each of the years for the set of patients with costs in the database during that given year, independent of the stage of diagnosis (both pre-diagnosis and post-diagnosis).

Total cost of care
For NCATS, the total cost of care was calculated by summing the total costs of all visits for each patient in the defined population during the specified time period. For Eversana, the total cost of care for each disease was computed as the sum of cost of care of all patients over all the years.

Average cost per patient (PP)
For NCATS, the average cost per patient (PP) in the 5-year time period was derived by calculating the total cost of all visits for each patient in the defined population and the average was then calculated. For Eversana, the total PP cost was calculated in each year for each disease separately by dividing the total cost of care for all patients in that disease cohort in that year by the number of patients in that disease cohort in that year. Weight average (wtavg) costs PP for the 13 representative RDs were then calculated using the formula shown in Additional file 2: Figure S1.

Control population: average cost of age-matched patients without the rare disease
For NCATS, a control population was created by querying the system for patients that had a general wellness visit within the specified time period. This resulted in patients being pulled with the CPT codes listed as "initial history and examination related to the healthy individual" in adult, adolescent, childhood, and infant age groups (CPT 90750, 90751, 90752, 90754). For Eversana, the average costs for all age-matched patients without the RD within the same HCS database and time period were used as a control using the same methodology.

Aim 3: Creation of patient journey maps in selected diseases
Using patient-level data in the Eversana (IBM ® Marketscan ® ) database, patient "journey maps" were created, which charted the patient's clinical course for two diseases, BD CLN2 and CF, for two patients per disease who were identified as having the highest total direct medical costs (Figs. 5, 6). For each patient, key clinical features and major medical milestones, patient characteristics, disease-modifying therapy, and billing costs were extracted from the individual patient records and mapped over the available time period.

Aim 1: Estimation of disease prevalence in different HCS databases
Disease percentage within the HCS databases was estimated by identifying RD patients by ICD codes as a percentage of total patients within the HCS (Fig. 1). The findings show that: Two of the 14 RDs, Pheo and CMT, do not have specific ICD codes and patients with these diseases were not able to be identified using ICD codes alone. With the aim of more specifically identifying only the patients with these 2 RD of interest, additional analyses were attempted by adding specific CPT codes to the search criteria. For Pheo, which is included under the non-specific ICD code "benign neoplasm of adrenal gland" (ICD-9 227.0) inclusive of several non-related diseases and conditions, the CPT codes for labs more specific to Pheo (e.g., catecholamines) were added to the search criteria as a more sensitive indicator of Pheo vs other benign adrenal tumors (see Table 1). This combined search for Pheo was able to be performed within the 4 remaining HCS databases (NCATS, Eversana, Sanford, OHSU) resulting in a more targeted identification of Pheo patients. A similar strategy for CMT was attempted using the CPT codes thought to be more specific to CMT [e.g., PMP22 (peripheral myelin protein 22)] (see Table 1); however, this approach resulted in 3 of the 4 HCS databases yielding 0 patients, and was not able to provide estimates of the percentage of patients across the different HCS databases. Thus, CMT was dropped from further analysis.
Second, overall the percentage estimates for the remaining 13 diseases were found to vary widely by HCS (Table 2, Fig. 2). Consistent with the medical literature, Sickle Cell Disease (SCD), Muscular Dystrophy (MD), CF, and Eosinophilic Esophagitis (EoE) had the highest percentages of patients, and Takayasu's Disease, Pheo, and Mitochondrial NeuroGastroIntestinal Encephalomyopathy (MNGIE) had the lowest. The percentages within a disease were quite variable across the different HCS data analyses, and for many of the diseases, the NCATS analysis showed higher percentages of patients with the selected diseases. These findings may be partially explained by the different populations represented in each of the databases. Many RD, especially genetically-based RD, are known to cluster within certain populations and the variable findings may merely show clustering of populations within certain geographic areas or HCS. For example, many RD are highly debilitating with substantial morbidity that may limit a patient or caregiver's ability to work or attend school. Thus, RD patients may be disproportionately reliant upon public insurance programs for their healthcare, which may partially explain the higher percentages for some RD in the NCATS findings. The estimates from the medical literature also showed that, in many cases, disease percentages by HCS were not consistent with generally reported literature estimates in that the literature-cited prevalence rates tended to be lower for most of the diseases than the percentages calculated from the HCS databases.

Aim 2: Average direct cost estimates by disease Cost per patient per year (PPPY)
An evaluation of direct medical costs by disease was estimated independently for the NCATS and Eversana HCS data sources and compared to an age-matched control without the RD. Direct medical costs to payors from HCS billing records were estimated by averaging per patient (PP) cost by disease and total direct costs vs control were estimated by adding the average cost PP by disease over the respective time periods. The results show that average RD costs ranged from 1.5-to 23.9-fold higher versus control (Fig. 2). The Eversana HCS database estimates (Fig. 2a), which were extracted from a mix of commercial and public insurance/payors over an almost 15-year time period (2006-2020), showed per patient per year (PPPY) costs ranged from $8812 to $140,044 for RD patients vs $5862 for the control. The highest PPPY costs for RDs for the Eversana analysis were for Urea Cycle Disorders (UCD), Lennox Gastaut Syndrome (LGS), and BD, and the lowest for EoE, Hereditary Hemorrhagic Telangiectasis (HHT), and SCD. The NCATS estimates (Table 2b), which were extracted from an almost exclusively Medicaid datasource for the 5-year period 2007-2012, PPPY costs ranged from $4859 to 18,994 for RD patients versus $2211 for the control. The highest PPPY costs were for MNGIE, UCD, and MD, and the lowest for EOE, HHT and Pheo. While the NCATS and Eversana cost estimates differed by PPPY and by cost per disease, in every case, the PPPY cost for RD patients exceeded those of the control.
An estimated PPPY cost averaged across the RD was estimated using a weighted average (wtavg). The wtavg for the Eversana analysis was $16,644 for an average RD patient versus $5862 for the control (2.8-fold higher for RD vs control), and for the NCATS analysis was $10,695 for a RD patient versus $2211 for the control (4.8-fold higher).

Total cost within time period
Total costs by RD within the time period, averaged by year, were then calculated by multiplying the number of patients with the disease (or control) by the average cost of the disease (Figs. 3, 4, Table 3). For the Eversana analysis (Fig. 3), the results show that the total costs were higher for the control population and for any individual RD. For NCATS (Fig. 4), there were 3 RD that exceeded the average total costs per year, including LGS, MD and SCD, and with the total costs per disease and control differing from the Eversana data. The reasons for generally lower total costs per disease vs control is likely due to the small number of patients per disease, despite the high average costs PP for RD. The high total costs for the 3 RD in the NCATS analysis vs control are likely due to LGS, MD and SCD being relatively prevalent for a RD, and due to the possible enrichment of patients with RD in a public insurance database.

Aim 3: Creation of patient journey maps in selected diseases
In order to better understand the disease course leading to diagnosis for RD, with the hopes to identify and diagnose patients with RD sooner after clinical presentation, an exploratory analysis of individual patient journeys were plotted on journey maps, which document key medical events, diagnosis and treatments in 2 RD areas, BD and CF. For this pilot analysis, 2 highest cost patients with each disease were mapped and compared with each other. BD and CF were selected because they have an available disease modifying therapy that allowed for preliminary description of clinical course pre-and post-therapy.
For CF (Fig. 5), 2 highest costs patients were overlaid, with the date of diagnosis used as time 0 for each patient. The results show the overall clinical course of Patient 1 (red), who experienced 2 upper respiratory tract illnesses approximately 10 and 20 months prior to diagnosis, and was later diagnosed with CF at age 5 years and started on disease modifying therapy (ivacaftor) at approximately 2 years post-diagnosis. Patient 1's course post-diagnosis shows costs predominantly for prescription drugs, with almost no subsequent clinical events in the post-diagnosis time period. Patient 2 (blue) experienced primary pulmonary hypertension, congestive heart failure, major depressive disorder, and substance abuse disorder clinical events in the approximately 30 months prior to diagnosis, with a CF diagnosis at age 20 years. He subsequently underwent prolonged home infusion therapy and a heart-double lung transplant, accounting for much of the high direct medical costs for this patient.
For BD (Fig. 6), 2 highest costs patients were evaluated, one with CLN2 for which there is an approved disease-modifying therapy, and one unspecified BD   (Fig. 2b) and multiplying by the number of patients with the disease (Table 3). SCD sickle cell disease, MD muscular dystrophy, CF cystic fibrosis, HHT hereditary hemorrhagic teleangiectasia, BD Batten disease, LGS Lennox Gastaut syndrome, FSGS focal segmental glomerulosclerosis, EOE eosinophilic esophagitis, OI osteogenesis imperfecta, MNGIE mitochondrial neurogastrointestinal encephalopathy, Pheo pheochromocytoma, TA Takayasu's arteritis patient who did not receive disease-modifying therapy. The results show that pre-diagnosis, Patient 1 (CLN2, red), whose HCS data begins at approximately 12 years of age, had neurodegenerative complications of the disease beginning at the start of his known clinical course, and diagnosis at age 14 years. Disease-modifying therapy (cerliponase) was initiated approximately 4 months after diagnosis, and the patient's course post-diagnosis reflects costs predominantly for prescription drugs, with two clinical events for BD-related complications (shunt removal) in the post-diagnosis time period. Patient 2 (BD, blue) had premature birth, numerous ICU and other hospitalizations for convulsions, respiratory failure, nervous system procedures, and other complications of BD, with subsequent diagnosis at age 2 years, and post-diagnosis events, including ICU and hospitalizations relating to neurodegenerative and respiratory complications of the disease, and eventual transition to home nursing care.

Discussion
In this pilot study, we explored the feasibility of quantifying the number of RD patients within different HCS and the direct medical costs for their care, and performed a preliminary analysis of the diagnostic journey for individual RD patients. The results are notable for three major findings.
First, estimating RD percentages within and across different databases and HCS using straight-forward ICD code search strategies is not able to provide reliable or consistent RD patient identification or disease percentage estimates. We saw wide variability in percentage estimates for 14 representative RD, which may, in part, be due to differences in patient populations within the different HCS, the different types of HCS, and the type of data being queried (EHR data vs medical claims data). Given that many RD are genetic, clustering of patients in geographic areas or different payor systems with specialized expertise is not unexpected; however, in preliminary analysis of the diagnostic journey, and as reported by others, [12,13] we know that many RD patients undergo prolonged periods of time where they are undiagnosed or misdiagnosed, which also may contribute to small percentages and variability across HCS. Furthermore, the lack of infrastructure for sharing RD knowledge and tools for diagnosis in HCS could lead to disparities in diagnostic rate and time to diagnosis. For the 2 RD in our sample that did not have a specific ICD codes (Pheo, CMT), identifying patients with these conditions was even more difficult. Pheo patients were relatively consistently Table 3 Unique patient counts and calculated disease percentages by HCS, and estimates from the medical literature Unique patient counts by disease extracted from each healthcare system database, and estimated disease percentages within each HCS and for the US population using medical literature/published data sources. Unique patient counts which were used to calculate per patient cost and the total disease cost by disease in Figs [38] identified across the different HCS by developing customized search criteria, in this case using specific CPT codes, but CMT patients could not be reliably identified using a similar approach. Given that at least half of RD do not currently map to a specific ICD code, consistently and reliably quantifying the estimated 25-30 million RD patients in the US with the thousands of different RD is a daunting task that would require individualized and computable phenotyping criteria for most RD.
Identifying and quantifying RD patients internationally was shown to be even more difficult. Different countries use different approaches for patient classification and payment, which may not be readily applied in other HCS, and our attempts to combine the AR-DRG system into the study were unsuccessful. Interoperability and data/knowledge sharing are crucial to improve the ability for HCS to diagnose and care for patients. RD, being rare, require this knowledge and data from around the world be utilized in local HCS; our attempts to identify and profile RD patients in Australia highlight this persisting need. Many open science initiatives exist to overcome these issues; however, coding systems, classification strategies, and tools for sharing RD case information have yet to be implemented in most HCS. Further, important data and knowledge are needed directly from patients, such as from registries, natural history studies and biobanks, however, these important datasources that could provide this knowledge [27] have not to date been integrated into HCS. A call to action to make data and knowledge openly shareable and interoperable into HCS was recently published [28]. Second, RD direct medical costs are high, with RD average PPPY costs estimated to be approximately three to fivefold higher than age-matched controls. While there were differences in total direct costs PP depending on different payors HCS used, the PP costs were still consistently higher across the RD in this sample when compared to non-RD patients. This result is not unexpected-high direct medical costs and healthcare utilization are surrogates for poor health. Patients with complex conditions and serious illnesses, regardless of type or rarity, are generally heavily reliant upon healthcare services to sustain life and relieve pain and suffering with resultant high costs to patients and their families, HCS, and society writ large. Most RD are genetic disorders that interrupt or affect fundamental biological processes (e.g., enzyme deficiencies), are overwhelmingly serious and life-threatening conditions, often affecting more than one organ system, which result in substantial impacts to the patient's overall health and activities of daily living. Unlike most other illnesses however, RD disproportionately (but not exclusively) affect younger patients-children, adolescents, and young adults-with impacts, on average, showing substantially higher costs versus agematched non-RD patients.
We additionally note that the total cost of an individual RD was generally lower than for the control overall. Given the fragmentation of small numbers of RD patients across thousands of different disorders and despite the relatively high PP costs per RD disease, many RD are likely to have a relatively low total cost (PP cost times the number of patients) that may not stand out within HCS, and thus, not call sufficient attention to the seriousness and high clinical needs for many RD.
Third, preliminary assessment of high-cost RD patients with two RD (CF, BD) showed that these patients had long (ranging for ~ 1.5 to 20 years) diagnostic journeys after first clinical presentation prior to receiving a definitive diagnosis, which for 3 of the 4 patients described resulted in the occurrence of irreversible complications of the disease and ongoing high costs and HCS utilization related to disease progression.
Mapping of the clinical course also showed that there is potential for identifying and diagnosing suspected RD patients sooner. These patients showed recurrent engagement with the HCS, persistent and progressive symptoms often falling into more general "basket" terms (e.g., convulsions, developmental delay, recurrent infections), and high utilization relative to age-matched controls. These patterns could be leveraged to escalate patients for definitive diagnosis and intervention sooner in order to slow disease progression or avoid catastrophic presentations and hospital admissions (e.g., organ transplant, ICU stays) [29]. We saw candidate diagnoses within the problem lists, and although often found in clinical notes, they may not be documented as diagnoses until later time points. The administration of disease-modifying treatments showed changes in clinical course in the two patients in this study. While high-costs continued postdiagnosis and treatment administration, the costs for the treated patients almost entirely clustered into the costs for outpatient treatment administration vs continuing hospital care for the patients without a disease-modifying therapy. This signal in individual patients shows hope for earlier diagnosis and intervention, where available, potentially offering beneficial effects and altering the clinical course in some RD.

Study limitations
There were several limitations to this study. The study was intended to be a pilot/exploratory study to assess the feasibility of identifying and quantifying costs and utilization in RD in a select sample of 14 RD. Although the sample of RD was chosen to reflect the diversity of RD, with widely varying presentations, clinical course, age and populations affected in this sample, the 14 RD admittedly represent only a small sample of the estimated 7-10,000 different RD and 25-30 million patients in the US with RD, and it is not known if these RD are truly representative of the RD population generally. This study was also intended as a preliminary feasibility pilot to begin to address the large problem of identifying, describing and quantifying RD patient data within the US healthcare system, which could then be used to answer larger research questions currently beyond the scope of the IDeaS analyses, such as relationships between costs and cost savings, patient outcomes and disease rarity. However, we see the current analyses as important first steps in what is intended to be an iterative process of developing methodologies that can progressively and deliberately address these larger research goals over time. Additionally, the widely varying percentages of these diseases in different HCS and versus commonly cited literature sources makes it difficult to understand the true prevalence of RD in HCS in the US. The information sources presented additional limitations. Data included in the EHR, but not placed in structured data fields is not available for simple extraction and limits the ability to identify RD diagnoses. While this may occur with both rare and common disease diagnoses, it disproportionately affects RD because only about half of RD can be mapped to a more specific ICD code or cluster, as well as the prolonged timelines between symptom/disease onset and accurate diagnosis and coding of RD patients that make them especially difficult to identify within HCS. Additionally, US patients frequently change their HCS plans and lack of continuity of data from one EHR or HCS to another makes it difficult to identify original diagnosis dates or sentinel signs/terms that may facilitate RD recognition [30]. Thus, taken together, our study suggests that RD patients have long diagnostic journeys compounded by lack of HCS continuity, and tend to be classified under broader nonspecific terms, at least early on in their disease course, resulting in percentage estimates that are likely to be underestimates of their true prevalence and impact of RD on HCS.
Direct costs are also based on the costs to payors, which are known to differ substantially by type of insurance (or no insurance) for individual patients. PP and total costs in the 2 HCS presented in this study varied widely, and likely reflect differences in the payor status (e.g., commercial vs public) in the two HCS. However, in either case, RD costs PP were still notably higher than matched control. Direct medical costs also only account for a portion of total medical costs on patients, families, and HCS. We were not able to assess out-of-pocket costs and indirect costs (such as social and support services) that patients and societies incur for RD patient care and treatment.

Conclusions
Overall, these preliminary findings suggest several major considerations for RD that should form the basis for additional study.
• RD patients are likely to be under-recognized and under-estimated in HCS databases and in cost estimates for their medical care. This under-estimation results in the lack of recognition of the true scope of the public health impact of RD on HCS, as well as the vast unmet and ongoing medical needs for RD patients. • PP costs on average in this study were around threeto fivefold higher than a matched control; gross extrapolation of this average costs estimate in a large HCS database (Eversana, estimated at approximately ~ $17 K per RD patient per year vs ~ $6 K for the control) for an estimated 25 million RD patients in US would result in total yearly direct medical costs for RD in the range of $400 billion per year, making the cost burden similar to other high-cost diseases, such as cancer [31] and heart failure, [32] and exceeding those of Alzheimer's disease [33]. Additionally, the large variance in the cost of care of patients with the same RD could be attributed to different reasons-using HCS and insurance claims databases to stratify patient cohorts within a given RD to surface diagnostic, therapeutic, and utilization patterns will be valuable in the quest to better understand disease course and uncover ideal disease management interventions. • Machine-assisted strategies for early identification and diagnosis of likely RD patients may be feasible. Journey maps in selected RD patients revealed potential characteristics, such as young age, high utilization, recurrent hospitalizations and severe clinical presentations, that may assist with early identification and escalation for definitive diagnosis. Genetic diagnosis as part of the early diagnosis strategy has been shown to be beneficial in other analyses, and importantly, impact clinical course and patient management, especially if implemented earlier [34][35][36][37].
Thus, we conclude that the results from this small pilot study of RD impact on HCS show that the 14 RD included in this pilot have high medical burdens to patients and HCS, likely in a similar range to burdens experienced by patients with other serious diseases, such as cancer, heart failure and Alzheimer's disease; however, these results will need to be confirmed in a larger cohort of RD. This suggests that RD represent a major impact to public health, have high unmet medical needs, and that there is an urgent and considerable need for earlier and accurate RD diagnosis and intervention to address medical management for RD patients that is further supported by similar high-cost burden results seen in two other recent cost-burden studies [21,22].
Finally, with the information and data gathered from this small pilot study, we have sought to bring attention to key considerations (such as limitations in coding) that have been recognized for many years in the RD community that continue to limit our ability to better understand RD and their impacts on patients and the public health. This is an important line of inquiry and we hope that efforts such as this study, will begin to open new areas of research that can improve our ability to identify RD patients more accurately, and assess and mitigate the impacts (utilization and cost) of RDs by leveraging available HCS data.