A scoping review and proposed workflow for multi-omic rare disease research

Background Patients with rare diseases face unique challenges in obtaining a diagnosis, appropriate medical care and access to support services. Whole genome and exome sequencing have increased identification of causal variants compared to single gene testing alone, with diagnostic rates of approximately 50% for inherited diseases, however integrated multi-omic analysis may further increase diagnostic yield. Additionally, multi-omic analysis can aid the explanation of genotypic and phenotypic heterogeneity, which may not be evident from single omic analyses. Main body This scoping review took a systematic approach to comprehensively search the electronic databases MEDLINE, EMBASE, PubMed, Web of Science, Scopus, Google Scholar, and the grey literature databases OpenGrey / GreyLit for journal articles pertaining to multi-omics and rare disease, written in English and published prior to the 30th December 2018. Additionally, The Cancer Genome Atlas publications were searched for relevant studies and forward citation searching / screening of reference lists was performed to identify further eligible articles. Following title, abstract and full text screening, 66 articles were found to be eligible for inclusion in this review. Of these 42 (64%) were studies of multi-omics and rare cancer, two (3%) were studies of multi-omics and a pre-cancerous condition, and 22 (33.3%) were studies of non-cancerous rare diseases. The average age of participants (where known) across studies was 39.4 years. There has been a significant increase in the number of multi-omic studies in recent years, with 66.7% of included studies conducted since 2016 and 33% since 2018. Fourteen combinations of multi-omic analyses for rare disease research were returned spanning genomics, epigenomics, transcriptomics, proteomics, phenomics and metabolomics. Conclusions This scoping review emphasises the value of multi-omic analysis for rare disease research in several ways compared to single omic analysis, ranging from the provision of a diagnosis, identification of prognostic biomarkers, distinct molecular subtypes (particularly for rare cancers), and identification of novel therapeutic targets. Moving forward there is a critical need for collaboration of multi-omic rare disease studies to increase the potential to generate robust outcomes and development of standardised biorepository collection and reporting structures for multi-omic studies.


Background
The scale of the rare disease challenge is a staggering one, with upwards of 8000 types of rare diseases described and an estimated 262.9-446.2 million people living with a rare disease globally. The definition of a rare disease varies internationally; the European definition is any disease with an incidence of less than one in 2000 [1], the United States (US) definition is conditions affecting fewer than 200,000 people [2], and the Chinese definition is disorders prevalent in less than one in 500,000 within the population [3]. Yet while the type and definitions of a rare disease may vary, there are many common issues faced by patients falling under the 'rare' umbrella.
The first hurdle many patients' face is escaping the 'diagnostic odyssey', with an average of 5.6 years waiting for an accurate diagnosis in the United Kingdom (UK) and 7.6 years in the US [4]. Patients often report receiving several inaccurate diagnoses before the correct conclusion is reached. Obtaining a diagnosis has a significant impact on the development of a patients' defined care pathway as an accurate diagnosis can enable appropriate medical intervention, access to public services (such as financial support) and connection with vital rare disease support groups [5][6][7]. The difficulties in providing a diagnosis arise due to several interacting factors. Rare diseases have been widely reported as presenting with phenotypic and genetic heterogeneity which can make them difficult to diagnose even by specialists with prior experience [8][9][10][11], and the often multi-system impact of the conditions can mean they are masked by common complex disease symptoms [12]. Overlapping phenotypes in patients with more than one rare disease can also be difficult to differentiate and provide a conclusive diagnosis [13]. As rare diseases frequently have multi-system impact, patients are usually managed by more than one physician across a range of medical specialities. For example, a national survey of rare disease patients and carers in Northern Ireland showed that 63% of participants reported attending multiple doctors with 7% reporting management by greater than 10 doctors [14]. The nature of typical patient confidentiality can make essential communication between healthcare teams difficult, particularly with care across multiple centres and when accessing external specialist centres of excellence, thus leading to further delays in the diagnosis of a rare disease and fragmented patient care [4, 15,16].
Even where a patient is fortunate enough to obtain a diagnosis there are often limited or no treatment options available, and a third of all rare disease patients die before reaching their fifth birthday [2]. Conditions lasting into adulthood are often debilitating and/or life limiting. The development of a panel of sensitive, minimally invasive and clinically accessible molecular biomarkers for faster diagnosis of patients with rare diseases will facilitate optimised care strategies and drive new therapeutic developments. This will be aided by the evolution of international registries, federated datasets, and computational tools that enable secure sharing and analysis of complex data generated within rare disease research networks; one such example of evolving infrastructure for rare disease is the Health Data Research UK (HDRUK) SPRINT exemplar innovation program which, in collaboration with the National Institute for Health Research (NIHR) BioResource and several National Health Service (NHS) trusts, aims to provide a dedicated research resource involving integration of phenotype-genotype information through cloud-based methods [17].
The advent of high-throughput technology in the past decade, such as next generation sequencing (NGS) and high-density microarrays have enabled large scale genomic analysis of rare diseases and brought hope for many patients and their families [18]. For example, the 100,000 Genomes Project was a UK based project which recently completed whole genome sequencing (WGS) of 119,286 genomes including those from 74,674 patients with rare diseases and their family members, providing actionable findings for 20-25% of rare disease cases where traditional genetic testing did not identify a causal variant; additionally, new therapeutic targets have been identified and this transformational research project that was embedded with the UK NHS is moving towards generating sequencing data for~1 million individuals [19]. Thirty percent of the identified causal variants found by the 100,000 Genomes pilot project had not previously been reported [20]. Moving forward the Health and Social Care secretary announced in 2018 plans to continue this work by sequencing five million genomes in the UK over the next five years, with all seriously ill children being offered WGS from 2019 [21]. Rapid genome/exome sequencing for acutely ill children with a likely genetic diagnosis will enable improve diagnostic rates with rapidly implemented optimised care protocols. Non-invasive prenatal testing that analyses foetal cell free circulating DNA within a maternal blood sample to identify chromosomal disorders has been introduced by many countries, including tests such as Harmony (Ireland, UK, US, Spain, Mexico, Germany, Canada and more) and MaterniT21 (US, Algeria, Belgium, Cameroon, Czech Republic, France and more) [22]. International projects such as the US National Institute of Health (NIH) Undiagnosed Diseases Program (UDP) aims to provide a diagnosis and identify treatment options, whilst the International Rare Diseases Research Consortium (IRDiRC) aims to provide a diagnosis for rare disease patients within one year of presentation, to develop 1000 new therapies and assess the impact of these diagnoses and novel therapies by 2027 [23][24][25].
Undoubtedly WGS efforts have been found to increased diagnostic yield, with the figure ranging from 21 to 73% depending on participant age and phenotype [18]. However, for those phenotypes with lower diagnostic yields, WGS by projects such as The 100,000 Genomes Project facilitate further investigations by providing a platform for integrative multi-omic analysis. The term 'omic' stems from the suffix 'ome' added to many fields of biological study, which refers to the study of something in its entirety. There are estimated to be over 500 omic types [26], (Table S4) with the most commonly known being genomics, epigenomics, transcriptomics, proteomics, metabolomics and phenomics (definitions for common examples can be found in Fig. 1). Considered individually, these omic types have been used to identify and / or provide functional supporting information for candidate pathogenic mutations for rare diseases across various medical specialities. For example, transcriptomics from blood samples has been shown as a useful method of characterising undiagnosed rare diseases with a validated diagnostic yield of 7.5% where whole exome sequencing (WES) was insufficient to identify a causal variant [27]. Taking a holistic molecular approach by integrating analyses for several different 'omic' types could further increase diagnostic yield and contribute to understanding of phenotypic heterogeneity and disease progression. Furthermore, multiomic analysis could illuminate opportunities for drug repurposing through identification of novel therapeutic targets, an important component of rare disease treatment, where drugs originally intended for treatment and management of common complex diseases can be applied for use in rare diseases where there is unlikely to be many existing treatment options [28]. In reality, the full potential of integrative multi-omic analysis has yet to be comprehended. Challenges exist in the integration and processing of large datasets across 'omics' technologies and even between laboratories (with much data now publicly available online), as well as interpreting the clinical impact of the relationships between these omic analyses [29].

Aims and objectives
To fully understand what research has been undertaken and what gaps still exist, this scoping review aims to systematically summarise research into multi-omics and rare disease research by: Evaluating what primary research studies exist pertaining to multi-omics and rare disease and which type of omic analysis was undertaken. Highlighting research outcomes with implications for rare disease diagnosis, treatment or improved understanding of disease mechanisms.

Methods summary
The full methodology for this review is available online as a published protocol [30], and follows the Joanna Briggs Institute methodology guidance for scoping reviews. To ensure our search was comprehensive, we followed all applicable aspects of the Preferred Reporting Items for Systematic Reviews and Meta-Analyses Extension for Scoping Reviews guidelines [31].
With reference to the population, concept, context (PCC) guidelines for determining the review research question, our population of interest was studies of patients diagnosed with a rare disease, meeting the Fig. 1 The diagram emphasises the potential of studies which, following careful phenotyping at study conception, utilise integrated multi-omic analysis to consider multiple components in the journey from DNA to expression European definition (an incidence of less than 5 in 1000) [32], or with a rare cancer (European definition of less than 6 in 100,000 and the US definition of less than 15 in 100,000) [33,34]. Our concept was multi-omic data generated on rare diseases, where a multi-omic study was defined as one which included two or more omic analyses types [26]. The context of the scoping review was primary studies written in English, published prior to 30th December 2018.
Databases searched included MEDLINE, EMBASE, PubMed, Web of Science, Scopus and Google Scholar, as well as the grey literature databases GreyLit and OpenGrey. One additional information source utilised not detailed in the published protocol, was papers published by The Cancer Genome Atlas (TCGA). This resource was identified through an article returned in the initial search. TCGA is a large collaborative project between the National Cancer Institute and the National Human Genome Research Institute, which has conducted multi-omic analyses of 33 cancers [35]. While no hard definition of a rare cancer was used by TCGA, researchers selected uncommon cancers on the basis of public health impact and the feasibility of getting enough samples for meaningful analyses. Review articles and reference lists were searched for any additional eligible articles, as well as forward citation searching using the Web of Science Cited Reference Search Tool. For any conference abstracts identified, full texts were searched.
The reference management software EndNote X8 was used for citation handling throughout duplicate removal and title/abstract screening. Microsoft Excel was used to record results and exclusion reasons, as well as for full text screening and data extraction. Data extraction (otherwise referred to in scoping reviews as datacharting) was performed independently and in duplicate (by KK and CB) with any discrepancies were resolved by consultation of a third individual. Data extracted included rare diagnosis (or phenotype where patients were undiagnosed), omic analyses type, study design information, experimental methods and key relevant results. As is typical of scoping reviews, a qualitative narrative synthesis was then conducted to summarise key components of the multi-omic rare disease field [31,36,37].

Results
Initial searches identified a total of 1770 articles: n = 173 MEDLINE articles, n = 630 EMBASE articles, n = 17 PubMed articles, n = 206 Google Scholar articles, n = 721 Web of Science articles, n = 23 Scopus articles. A further 19 articles were identified from additional sources, not included in the initial search numbers. This included five articles which were full text versions of conference abstracts [38][39][40][41][42]. One paper returned in the initial search published through TCGA [43], led to the identification of a further 13 articles on multi-omics of rare cancers [44][45][46][47][48][49][50][51][52][53][54][55][56]. Finally, one article was identified from the reference list of a review paper [57,58]. The screening process is summarised in Fig. 2. Following duplicate removal, 1417 articles were identified for title/abstract screening from which 1306 articles were excluded (1018 papers as they were not primary studies of multiomics and/or rare disease, 20 articles as they were not written in English, and 268 articles as they only included one omic analysis type). This left 111 articles for full text screening, from which four articles were excluded as they were qualitative review articles, four as they were conference abstracts and the corresponding full texts were already included in the return, nine articles as they were found not to be primary studies of rare disease, two did not specify which rare cancer and finally a further 26 articles described only a single omic type (a total of 45 articles removed at this stage). Subsequently, 66 articles were eligible for inclusion in this review. General study and participant characteristics are summarised in Table 1, detailed experimental procedures and results are available in Additional file 1: supplementary Table 1 ( Table S1). The year of publication ranged from 2001 to 2018, with a rapid increase in publications over the past decade (Fig. 3). Two of the final 66 included papers were published in 2019, despite the 2018 date restriction, as these were identified from within the additional TCGA search [38,59]. Evolution of inclusion and exclusion criteria is not unusual within the process of conducting scoping reviews [37]. Three study designs were identified: case-control studies (n = 55), familial studies (n = 6) and studies which incorporated a mix of both familial comparisons and external unrelated cohort comparisons (n = 5) into their methodological designs ( Table 1).
Four of the 66 articles were conference abstracts for which no full text was available, but appeared to describe case-control studies. As expected, likely due to the low prevalence of rare diseases, no randomised controlled trials were identified in the search. The most frequent number of participants was 1-5, with mode reported instead of mean as there are a number of studies conducted by TCGA which have extremely large participant numbers which would disproportionally skew the mean of included studies. The mean age of participants was 39.4 years (median = 49), however this was not reported for almost half of the included articles (39.4%, 26 studies). There was a peak in the number of studies which included participants between 0 and 10 years of age, followed by a significant reduction until a second peak from 50 years of age (Table 1). Similar to participant age, participant ethnicity/race was unknown in a large percentage of included studies (60.6%, 40 studies). Where ethnicity and/or race were known there was significant heterogeneity in reporting, therefore these have been summarised in groups in Table 1. The most common participant ethnicity was Caucasian (82.1%), and the least common ethnicity mixed race (0.33%). Publication countries of origin included the United States of America (n = 38), France (n = 5), Switzerland (n = 4), the United Kingdom (n = 4), Canada (n = 3), Japan (n = 3), Germany (n = 3), Italy (n = 2) Brazil (n = 1), Finland (n = 1), Korea (n = 1), Spain (n = 1).
Fourteen different omic analyses types were identified within this scoping review, including various combinations of genomic, epigenomic, metabolomic, phenomic, proteomic and transcriptomic analyses (Table 2), with transcriptomics being the most commonly integrated omic analyses type. The majority of studies eligible for inclusion were rare cancers (64%, 42 studies), including two studies of pre-cancerous conditions, summarised in Table 3. Of the remaining 22 non-cancerous rare disease articles, neurological disorders was the most common disease type (15%, 10 studies) whilst other rare disease types combined contributed just 20% of the included studies. These included auto-immune diseases, multisystem developmental disorders, cardiovascular disease, muscular disease, neurological disease and renal disease.
Specific rare diseases are detailed in the discussion and in Additional file 1, Table S1. Studies of rare cancers/pre-cancerous rare diseases had more than ten times the mean participant number compared to studies of non-cancerous rare diseases (429.3 ± 1799.5 and 41.2 ± 113.6 mean and standard deviation of participant numbers respectively). However, this was influenced by two studies with high participant numbers (3527 and 11,286 participants) [43,60]. The disproportionate representation of cancerous to non-cancerous rare diseases is summarised in Fig. 4.
From the data extraction process five key themes were identified in this review which are expanded upon in a narrative synthesis in the discussion section. These included: 1. Significant use of NGS technologies and high throughput microarrays for multi-omic rare disease analysis. 2. Varied methodological and analytical approaches to multi-omic rare disease research. 3. Multi-omics for diagnosis of undiagnosed rare phenotypes. 4. Multi-omics for identification of pathogenic and prognostic biomarkers of rare disease.

Multi-omics for elucidation of novel treatments and drug re-purposing opportunities.
A concise critical appraisal of studies was conducted using a checklist adapted from the Joanna Briggs Institute (JBI) critical appraisal tools in the PRISMA extension for scoping reviews (Additional file 2: Table S2) [31]. Conference abstracts were excluded from critical appraisal (n = 5). Assessment of sample numbers, appropriate matching of cases and controls (e.g. age/gender), appropriate experimental controls and statistical analysis (e.g. accounting for multiple variates) lead to the identification of 19 studies with high methodological rigour, 10 with medium methodological rigour and 32 with low methodological rigour (Additional file 1: Table S1). This high proportion of studies deemed to have low methodological rigour was in most cases due to very low sample numbers, e.g. case reports of one person, and is typical of studies of rare disease.

Discussion
Scoping reviews are an increasingly popular method of summarising literature in a researcher's particular area of interest, which can be used to identify themes and significant gaps to inform research hypothesis development [61]. This scoping review provides a comprehensive narrative synthesis of studies of multi-omics and rare disease [36], identifying 66 primary studies published between 2000 and 2019. Estimated European prevalence (where known) of each rare disease and overall study objectives are summarised in Table 4, whilst detailed study design, methodology and results are available from Additional file 1, Table S1.
Whilst this review conducted a comprehensive search, using multiple information sources and developing search terms carefully in collaboration with a Medical Faculty librarian, it was not possible to include all relevant studies, primarily due to the heterogeneity of terms used to identify these studies as multi-omics, and in varying definitions of a rare cancer. Rather it is intended that this scoping review will provide an overview of general themes in multi-omic rare disease research and provide direction for future projects. This is particularly true of multi-omics and cancer studies, which are conducted far more routinely than studies of non-cancerous rare diseases. Two studies of non-cancerous rare diseases which were eligible for inclusion, but not returned through our original search, were identified during peerevaluation of this scoping review, both of which utilised RNA sequencing to increase diagnostic yield of Mendelian disorders reporting a 10% diagnosis in mitochondriopathy patients and 35% in undiagnosed rare muscular diseases respectively [139,140]. A second limitation is that due to language restrictions we were only able to include articles written in English, which led to the  (Table S3) and can be reviewed for readers able to interpret them. Following our published protocol, the search strategy can be easily reproduced by researchers hoping to conduct a multi-lingual inclusive search [30]. Furthermore, the vast majority (82%) of participants in the included studies for which ethnicity/race was known were identified as Caucasian, with all other ethnicities comprising just 17.9% of participants. It should be noted however that representation bias may have been introduced by the language limitations imposed on this review. Twenty articles were identified that were published in additional non-English languages; many of these may not meet the criteria for inclusion within the review and so the effect of such bias is likely to be minimal. This disproportionate representation of Western ethnicity will need to be addressed by international collaborative efforts in future research studies. In the narrative synthesis below, this review reflects how multi-omic rare disease research is the natural next step for progressing our understanding of rare diseases: whether that be for diagnostic or prognostic purposes, development of novel treatment options, or simply understanding the mechanisms behind disease  progression. We also discuss the challenges posed by researchers attempting to conduct these projects and areas to be addressed in future projects.
NGS, high density arrays and data integration software has enabled multi-omic research The success of large-scale genomic analysis projects is largely owed to the development of cost-effective high throughput microarrays with semi-automated analysis and the refinement of NGS technologies. Similarly, emerging data for epigenomic and transcriptomic data typically use these approaches. Within the papers discussed in this review, platforms provided by Illumina® dominated for WGS, WES and RNA-seq data generation. These included the: More recent versions of Illumina technologies not used in this review include the NextSeq550® and the NextSeq 2000. These platforms, whilst undoubtedly very useful, rely on short read sequencing methods where the DNA is fragmented for sequencing and re-aligned against a reference genome for interpretation. Other providers of NGS less frequently seen included Ion Pro-ton™ System by Ion Torrent™ and the Applied Biosys-tems™ 5500xl Genetic Analyzer. Moving forward with multi-omic rare disease research, long read sequencing methodologies (currently commercially provided by Oxford Nanopore Technologies and Pacific Biosciences) offer significant benefits compared to short read sequencing, with Oxford Nanopore also providing ultra-long read sequencing with additional benefits for identifying molecular variation. True long read sequencing has potential to overcome issues with amplification bias during short read library preparation (presuming the sample to be processed by LRS has not already underwent amplification), errors when aligning to a reference genome to due repetitive regions, detection of large structural or copy number variants and issues with inaccuracies in reference genomes themselves [141][142][143]. Furthermore, long read sequencing enables direct measurement of methylation and RNA sequencing without the need for reverse transcription to complementary DNA (cDNA) which can introduce additional errors [144,145].
Other platforms utilised included microarrays for the detection of single nucleotide polymorphisms (SNP) such as the Affymetrix™ Genome-Wide Human SNP Array 6.0 (Applied Biosystems™), which enables the interrogation of approximately 900,000 SNPs across the genome. For studies which included epigenomic analysis of DNA methylation, the primary microarray platforms utilised were the Illumina® Infinium methylation arrays: the HumanMethylation27 (27 K), HumanMethylation450 and Illumina's most recent array, the MethylationEPIC (850 K) BeadChip®. These arrays were used by all but two studies for DNA methylation, where one utilised targeted bisulphite pyrosequencing of four genes with the Qiagen PyroMark Q96 MD System [65], and a second utilised enhanced reduced representation bisulphite sequencing [92]. Proteomic and metabolomic analyses largely utilised liquid Chromatography with tandem mass spectrometry (LC-MS/MS) and nanoLC-MS/MS.

A standardised methodological approach to multi-omic rare disease research is needed
One challenge with multi-omic rare disease research is the large variety of methodological approaches which researchers can choose to undertake. These can complicate data analysis due to between-laboratory batch effects and a lack of independent datasets generated from the same methods which may be needed to validate potential variants of interest. Therefore it would be valuable to develop a multi-omic analysis pipe-line that can be utilised to maximise the power of rare disease studies. As discussed previously, TCGA is a multi-centre cancer genomics programme run by the National Cancer Institute and National Human Genome Research Institute, which began in 2006 and has undertaken extensive multi-omic analysis of 33 cancers including several rare cancers [35]. Sixteen of the 66 articles included in this review were studies of rare cancers conducted by TCGA [39, 43-46, 48-56, 60, 93]. The studies included in this review which followed TCGA methodology all presented with high methodological rigour, usually with large sample numbers and even a broader range of participant ethnicity which is crucially needed in genomic analysis. With few methodological differences between research projects, these studies followed a comprehensive analytical pipeline which involved the generation and interpretation genomic, epigenomic, transcriptomic and proteomic data, providing a powerful impetus for standardised multi-omic methodology (Additional file 1, Table S1). This enabled researchers to identify molecular relationships between cancers, cluster prognostic variants and elucidate future therapeutic targets to explore. Furthermore, much of this anonymised data is publicly available on the Genomic Data Commons Data Portal for future research projects to utilise [146], and three non-TCGA cancer studies included in this review reported using this public data to overcome the rarity of their studied cancer type, to confirm cell ontology and even simply as a comparative control for their own gene expression data [69,70,99]. An additional point of interest was the computational algorithms used to overcome the statistical challenge of data integration in multi-omic studies of rare disease. Approaches to the analysis of multi-omic data vary dependent on research group preferences and bioinformatic experience, with many choosing to simply analyse each 'omic' dataset independently and identify overlapping molecular variation within top ranked genes (e.g. genes which show differentially methylated CpG sites from microarray analysis that correlate with differential gene expression from mRNA sequencing). However, this approach can lead to missing variants with biological significance which may not be immediately clear, for example missing a relationship between differentially methylated genes which could indirectly impact downstream protein production. Therefore, for those with bioinformatic expertise, integration of multi-omic data largely falls into three categories; early data integration, late data integration and statistical data integration, with a comprehensive description and examples of each
• Functional study which developed organoids to assess the molecular profile of CRPC-NE [92].
provided by Rapport and Shamir, 2018 [147]. Early data integration involves the combining of integrated features from single omic data sets (concatenation) to output a single matrix representing similar features from multiomic datasets in the participant, e.g. Autoencoder which has been used to integrate data from three omic analyses (DNA methylation, RNA-Seq and miRNA) for analysis of liver cancer [148]. Late data integration conducts clustering of related variants within single omic analysis and then integration of the single analyses clusters together, for example the Cluster-Of-Cluster-Assignments (CoCA) algorithm which looks across multiple omic analyses to define subclasses, whilst removing the need for data normalization prior to clustering and adding weight to analyses type so that large platforms do not dominant results (e.g. 450 K array compared to reverse phase protein array) [149]. Another example of late data integration tools is the similarity network fusion (SNF) which develops a network of patient level data rather than individual clusters enabling prognostic prediction [150]. Finally, statistical algorithms infer the most probable clusters within multi-omic datasets, for example PARADIGM which infers associations of molecular variants with patient phenotype by incorporating pathway Unknown prevalence • Identification of novel therapeutic targets using genomic and transcriptomic analysis [108].

Non-Cancerous rare diseases
Mevalonate kinase deficiency Unknown • To explain polarised phenotypic heterogeneity in siblings with the same pathogenic mutation [111].
Congenital Disorder of Glycosylation < 100 cases reported of each type [120] • Investigation of key genomic and proteomic variants associated with glycosylation disorders [121].
Congenital absence of the ACL v / PCL w 1.7/100,000 live births • Investigation of key genomic and proteomic variants associated with congenital ACL/PCL [126].
• Investigation of therapeutic intervention in animal models of Huntington's disease [133].
Abbreviations: TCGA a The Cancer Genome Atlas, ENB b Esthesioneuroblastoma, R-GBM c Rhabdoid glioblastoma, IGCTs d Intracranial germ cell tumours, FL-HCC e Fibrolamellar hepatocellular carcinoma, USC f Uterine serous carcinoma, UCS g uterine carcinosarcoma, SCCOHT h Small cell carcinoma of the ovary hypercalcemic type, VSCC i Vulvar squamous cell carcinoma, PCCs j Pheochromocytomas, PGLs k paragangliomas, PUCA l Primary Urethral Clear-Cell Adenocarcinoma, SPPC m Small cell prostate cancer, CRPC-NE n Castration resistant neuroendocrine prostate cancer, ChRCC o Chromophobe renal cell carcinoma, TLFRCC p Thyroid-like follicular renal cell carcinoma, MNTI q Melanotic neuroectodermal tumour of infancy, FSHD r Facioscapulohumeral muscular dystrophy, ICF1 r Immunodeficiency Centromere instability and Facial anomlies syndrome, IPEX s Immune dysregulation polyendocrinopathy enteropathy X-linked, PID t Primary immunodeficiency disorder, TBS u Townes-Brocks syndrome, ACL v /PCL w anterior/posterior cruciate ligaments, SNS x Snyder-Robinson syndrome, HPE y Holoprosencephaly, PUV z Posterior urethral valves activity and inactivity data [151]. Multi-Omics Factor Analysis (MOFA) is a second example of a statistical multi-omic data integration tool and an unsupervised model for identification of biological and technical variability [152].
In this scoping review we found that studies utilising multi-omic specific software to facilitate data integration across omics platforms comprised just 11% of included articles (7 studies). These included three algorithms developed by TCGA [1]; COCA consensus clustering (described above) [2] iCluster, an integrative multi-variate regression clustering algorithm which looks across several datatypes (DNA methylation, copy number variants (CNVs), mRNA and miRNA) to identify molecular patterns (also an example of a data integration technique which spans the criteria of both early and statistical integration), and [3] PARADIGM (described above) [39,43,45,46,49,60]. In addition to studies which utilised TCGA specific algorithms, the only other study included in this review which discussed a bioinformatic pipe-line for multi-omic data integration was in vulvar carcinoma [81], in which the researchers used the CONEXIC algorithm to combine CNV and gene expression data to construct hypothesised regulatory networks, providing a ranked score which informs how well a particular variant predicts module behaviour, with high scores indicating high tumour adaptive advantage.
The pipe-lines for laboratory and computational analysis described here focus primarily on cancer. Development of a similar integrative workflow for noncancerous rare diseases, coupled with international collaboration to increase sample size, would be useful to increase pathogenic variant identification, diagnostic yield and development of a defined care pathway. Figure 5 illustrates a workflow which could be utilised for the planning and implementation of multi-omic rare disease studies when considering study design, selection of biological material for common omic analysis, data integration and reporting of findings to patients. Furthermore, there is a need for discussion on ensuring the data we generate is publicly available, whilst protecting patient confidentiality, to enable large scale collaborative efforts, a phenomenon which would be particularly helpful for diagnosing currently undiagnosed patients. Such a discussion and development of resources should involve continuous consultation with patients and their family members [16,153].
Multi-omics can provide a diagnosis to previously undiagnosed patients with rare phenotypes Escaping the diagnostic odyssey is a major hurdle many patients with rare diseases face. This review highlighted studies which specifically intended to utilise multi-omics for provision of a diagnosis to previously undiagnosed rare phenotypes. Whilst most articles included in this paper sought to identify disease driving mutations, which could themselves be further investigated to elucidate a definitive molecular diagnosis, just three studies were identified with the specific aim of providing a diagnosis for patients with previously undiagnosed rare diseases through multi-omic analysis [58,124,136], as well as a further two conference abstracts for which no full text was available [134,154].
Through a combination of comprehensive WES, chromosome microarray (CMA), linkage analysis and mRNA analysis, one study aimed to provide a diagnosis for a combination of the complex undiagnosed phenotypes: non-syndromic hearing loss (NSHL), aberrant skeletal phenotypes and significant developmental delay, in four individuals [124]. WES identified a recessively inherited splice variant in PDZD7 (c.226 + 2_226 + 5del-TAGG) likely to explain the NSHL phenotype, which was confirmed through mRNA analysis to inhibit gene expression in affected individuals, as no PDZD7 exons were amplified. Furthermore, the developmental delay and microcephaly phenotype was explained via CMA through identification of a de novo unbalanced translocation in chromosome 8 and 18. The skeletal phenotype was associated with an autosomal dominantly inherited variant in COL1A1 which lead to a diagnosis of osteogenesis imperfecta. This study reflects the phenotypic heterogeneity that is often present in undiagnosed rare diseases and demonstrates the utility of providing comprehensive genomic analysis with additional confirmatory mRNA analysis to maximise diagnostic yield. A second study aimed to utilise WGS, protein and mRNA analyses to aid the diagnosis of a family with heterogeneous myopathic and neurogenic phenotypes [136], uncovering five likely pathogenic exonic variants. Of these, a single mutation in the gene NEFL was identified in all affected family members, (c.1261C > T; p.R421X associated with truncated NEFL protein levels) which has previously been associated with Charcot-Marie-Tooth disease. This study is an excellent example of the power of multi-omic analysis to provide a molecular diagnosis for patients with rare undiagnosed phenotypes, while expanding on a previously known clinical phenotype, with both of the above studies utilising genomic analyses complemented by a form of transcriptomic analyses. Finally, a third small case-control study of eight patients, from four unrelated families, utilised WES and global metabolomics to identify diagnostic biomarkers of the rare disease mitochondrial aconitase deficiency [58]. The research team identified 758 metabolomic features with a minimum fold change of 1.5 between cases and controls, including α-ketoglutarate which was reduced 4.3 fold in ACO2 deficient patients and thus likely to contribute to the pathogenic phenotype. This study is the first to report a diagnostic biomarker of mitochondrial aconitase deficiency, using multi-omic technologies.
Further to the above studies, two conference abstracts were identified which also briefly discussed the utility of multi-omics for diagnosis of rare neuro-metabolic diseases [134,154]. The first of these described utilising WES and metabolomic analysis of undiagnosed neurometabolic diseases in 59 individuals with a diagnostic yield of 43% [134]. However, unfortunately no full text is currently available for this article and little detail with regards to the target genes and metabolites identified is provided in the abstract. The second conference abstract also described the application of WES and metabolomics to aid diagnosis of complex rare phenotypes including neuro-metabolic diseases [154]. The researchers reported diagnosis of 179/500 previously undiagnosed individuals, with 8% of this diagnostic yield originating from metabolomic analysis alone, reflecting that a combined omics approach to diagnosis is indeed capable of greater diagnostic yield than WES alone.

Pathogenic and prognostic markers can be identified by integration of multi-omic datasets
Elucidation of disease driving molecular profiles through integrative multi-omic analysis, most commonly genomic, epigenomic, transcriptomic, proteomic and metabolomic analysis, was the primary focus of most articles included in this review. Whilst it would be impractical to discuss each of the driver mutations identified for each rare disease studied within this review across all studied omic types, a comprehensive overview of pathogenic variants identified in individual studies is available for reference in Additional file 1: Table S1 (key results column). One example of note was a study of mevalonate kinase deficiency, a recessively inherited auto-inflammatory disorder, with multiple organ involvement. The spectrum of clinical presentations includes hyperimmunoglobulinemia D syndrome, periodic fever syndrome and mevalonic aciduria [155]. This study conducted WES, RNA-Seq and differential protein analysis in a case study of two sisters presenting with polarised phenotypic heterogeneity where they both harboured a known driver homozygous mutation in MVK but only one sibling presented with disease symptoms [111]. Their integrative multi-omic analysis identified a rare mutation in the modifier gene STAT1 resulting in upregulated mRNA AND protein expression, likely responsible for the phenotype in the affected sister. Single omic analysis alone was insufficient to detect this mutation, and therefore is exemplary of why these multi-omic studies are crucial in identifying a cause for rare diseases and explaining the phenotypic heterogeneity which can complicate patient care.
The identification of prognostic biomarkers through multi-omic analysis was highlighted in several studies of rare cancer including adrenocortical carcinoma (ACC), sarcoma, uveal melanoma and pseudomyxoma peritonei. Proposed workflow for multi-omic analysis of rare diseases. To conduct an impactful study of multi-omics and rare disease, careful planning from study conceptualisation is crucial Distinct prognostic groups of ACC were discussed in three of the five included ACC studies [39,64,65], including three prognostic molecular subtypes of ACC clustered by DNA methylation profile with 92.4% accuracy [39], two clusters by DNA methylation changes with frequent gene mutations (poor prognosis) and miRNA regulation (good prognosis) [64]. A third study reported the increased power of prognostic prediction accuracy using multi-omic data compared to singular analysis, specifically through the integration of several somatic variants, pathway analysis and differential methylation [65]. In a large case-control study of six different sarcoma subtypes, three prognostic clusters were identified through integration of somatic CNAs and DNA methylation data in dedifferentiated liposarcoma, in which the first two groups (JUN amplified and TERT amplified with chromosome instability) had a worse survival rate than the third cluster (6q25.1 amplified and less unbalanced chromosome segments), with JUN identified as a potential therapeutic target due to its overexpression previously being shown to increase tumour migration and invasion [49]. In a study of uveal melanoma, four molecularly distinct groups were identified with differences in prognostic outcomes: two associated with poorprognosis, (monosomy 3) and two with better-prognosis, (disomy 3) [43]. Finally, one study of pseudomyxoma peritonei (a very rare form of appendix cancer) showed that aberrant p53 staining reflected a worse overall survival in patients compared to normal p53 staining (19% compared to 80% five year survival) [90].
Multi-omic analysis can identify both novel treatments and drug re-purposing opportunities Care for patients with rare diseases often relies on symptom management, rather than treatment of the underlying cause, with limited therapeutic options available. The most frequent age-group of participants in the studies included in this review was between 0 and 10 years, followed by a significant drop off until a peak again at age group of 41-50 years, which stresses the need for early diagnosis and therapeutic intervention to improve survival and quality of life for these children. Therefore, it is unsurprising that identification of promising novel therapeutic targets through multi-omic analysis was a consistently observed research aim across studies of non-cancerous rare diseases and rare cancers.
For example, one study identified 156 differentially methylated genes in ACC, including hypermethylation of CYP1B1, which was shown to have sensitivity to the methylation inhibitor decitabine in an ACC cell line. Furthermore, the same study found that cell proliferation occurred in the mutated genes GATA6, G0S2, MEIS1, NCOA7, KCTD12, FAM1156A following treatment with the oncology drug oncostatin M [41]. However even where novel therapeutic targets are identified as excellent candidates for clinical research, the expense of trials often results in pharmaceutical companies refusing to test and a produce a novel drug. For those novel drugs fortunate enough to be deemed worth the financial investment, the average timeframe from experimentation to clinical implementation is 12 years [156]. Therefore, re-purposing of drugs already approved for use in a different disease has become an increasing focus of the search for therapeutics in rare disease, in particular for precision oncology medicine [157]. This scoping review found that identification of drug repurposing opportunities to over-come the lack of treatment availability was a strong recurrent theme for the multi-omic analyses of rare cancer studies. For example the drug Ponatinib, which is used normally to treat leukaemia, was identified as a potential drug repurposing opportunity for small cell carcinoma of the ovary hypercalcaemic type, through integrated proteomic and transcriptomic analysis with functional cell-line and animal models [42]. Additionally, 16 potential novel ACC drug targets were identified for which there is varying degrees of evidence for drug targeting in other cancers targeting of the genes: CDK4, NOTCH1, NF1, MDM2, EGFR, BRCA1, BRCA2, ATM, BRAF, PTCH1, TSC1, TSC2, KIT, RET, ESR1, EZH2 [65]. With this in mind, it would be useful to explore opportunities for drug repurposing via multiomic analysis for non-cancerous rare diseases also.

Conclusions
This scoping review highlights the exponential increase of multi-omic studies of rare diseases in the past decade, reflecting how the advent of NGS and high-density arrays have enabled multi-omic analysis. We have also highlighted in this review that the most frequently age group of participants identified was 0-10 years. This is concordant with the life expectancy of less than five years for a third of all rare disease patients, and emphases the importance of early diagnosis and implementation of a defined care pathway involving optimised treatment and not symptom management alone which can be provided by multi-omic analyses. Taken together, the discussed themes emphasise the need for the development of a standardised pipeline, to ensure unbiased and accurate reporting of biomarkers, as well as international collaboration to address the low participant numbers and biased participant ethnicity numbers which plague the power of rare disease research studies. The workflow provided in this review will be useful for researchers planning multi-omic studies of rare disease, whether that be for cancer or noncancerous conditions. Projects such as the previously mentioned 100,000 Genomes Project, and moving forward, the Five Million Genomes project, as well the NIH UDP and the IRDiRC, provide a platform for multi-omic analysis and are therefore fundamental for the future of rare disease research.