Next generation phenotyping using narrative reports in a rare disease clinical data warehouse

Background Secondary use of data collected in Electronic Health Records opens perspectives for increasing our knowledge of rare diseases. The clinical data warehouse (named Dr. Warehouse) at the Necker-Enfants Malades Children’s Hospital contains data collected during normal care for thousands of patients. Dr. Warehouse is oriented toward the exploration of clinical narratives. In this study, we present our method to find phenotypes associated with diseases of interest. Methods We leveraged the frequency and TF-IDF to explore the association between clinical phenotypes and rare diseases. We applied our method in six use cases: phenotypes associated with the Rett, Lowe, Silver Russell, Bardet-Biedl syndromes, DOCK8 deficiency and Activated PI3-kinase Delta Syndrome (APDS). We asked domain experts to evaluate the relevance of the top-50 (for frequency and TF-IDF) phenotypes identified by Dr. Warehouse and computed the average precision and mean average precision. Results Experts concluded that between 16 and 39 phenotypes could be considered as relevant in the top-50 phenotypes ranked by descending frequency discovered by Dr. Warehouse (resp. between 11 and 41 for TF-IDF). Average precision ranges from 0.55 to 0.91 for frequency and 0.52 to 0.95 for TF-IDF. Mean average precision was 0.79. Our study suggests that phenotypes identified in clinical narratives stored in Electronic Health Record can provide rare disease specialists with candidate phenotypes that can be used in addition to the literature. Conclusions Clinical Data Warehouses can be used to perform Next Generation Phenotyping, especially in the context of rare diseases. We have developed a method to detect phenotypes associated with a group of patients using medical concepts extracted from free-text clinical narratives. Electronic supplementary material The online version of this article (10.1186/s13023-018-0830-6) contains supplementary material, which is available to authorized users.


Background
The global trend toward digital health in the US and in Europe has led to an unprecedented adoption of Electronic Health Records (EHRs). By the end of 2014, 83% of US physicians [1] and 75% of hospitals [2] used some form of EHRs. The increasing number of EHRs opens strong perspectives for the secondary use of data collected during the care process. Many hospitals are now equipped with Clinical Data Warehouses (CDW) integrating all the data produced during the care of the patients for research purposes [3][4][5]. CDWs gather a large variety of information, ranging from structured data (e.g. diagnosis codes, laboratory test results…) to free-text clinical narratives and images. Structured data include coded data using terminologies like the International Classification of Diseases, and questionnaires that provide precise, standardized but somehow limited information. Conversely free-text reports are produced without constraints and may be used to express nuanced, unexpected, and unexplained signs or symptoms regarding the patient case.
Clinical narratives collect information from all aspects of the patient care that might not be collected anywhere else in clinical information system including history of the disease, family history, fine-grained description of all the symptoms, hypothesis of diagnosis or treatment, information from treatments received outside of the hospital, and so forth. Previous studies in different contexts showed the importance of free-text in EHRs. For example Raghavan et al. identified that unstructured data were essential to solve trial criteria from two studies. [6]. The value of text data is even more important to detect phenotypes in specialized hospitals treating patients with rare diseases and for outpatients, for whom clinical information is barely coded [7].
Rare diseases represent a large group of heterogeneous conditions and some cases remain undiagnosed for a long time. A precise phenotypic description of such diseases can be problematic given the small number of cases and the heterogeneity of the phenotypes. Leveraging large CDWs, could be helpful to enrich this description. While structured (standardized) questionnaires exist for several rare diseases (e.g., in France [8,9]), part of the clinical description is still present only in free text in EHRs. We hypothesized that mining large collections of clinical texts in hospitals specialized in rare diseases could offer interesting perspectives to enrich the descriptions provided by dedicated knowledge bases. We investigated this hypothesis at the Necker Enfants Malades Hospital (Necker Children Hospital), a children's hospital in Paris that is associated with the Imagine research institute, specialized in genetic diseases, and hosts 15 national reference centers for rare diseases. We illustrate our approach on six rare diseases: DOCK8 deficiency, the Activated PI3-kinase Delta Syndrome (APDS), Rett, Lowe, Silver Russell and Bardet Biedl syndromes. The combined immunodeficiency due to DOCK8 deficiency (prevalence less than 1/1,000,000) is a form of autosomal recessive combined immunodeficiency (T, B and NK cells), characterized by recurrent lung infections, cutaneous viral infections, allergy, severe skin inflammation and susceptibility to cancer with a high level of IgE [10]. DOCK8 deficiency is caused by homozygous or compound heterozygous mutations in DOCK8 gene [11].
Rett syndrome (estimated prevalence 1/15,000) is characterized by a rapid regression in language and motor skills (i.e. repetitive, stereotypic hand movements) after six to eighteen months of normal psychomotor development [13].
The Lowe syndrome or Oculocerebrorenal syndrome (estimated prevalence 1 to 9 /1,000,000) is a multisystem disorder characterized by congenital cataract, intellectual disabilities, glaucoma, postnatal growth retardation and renal tubular dysfunction [14].
From now on, we will refer to as phenotype any sign or symptom, disease, defects, and so forth, affecting a patient.
In this study, we present the methods that we developed to extract phenotypes associated with rare diseases from clinical texts in Dr. Warehouse® (DrWH), the clinical data warehouse of the Necker Children's hospital. Then, we evaluate the scalability of our approach in the context of high throughput phenotyping.

Material
All data were collected from the Necker Enfants Malades Hospital (Necker Children Hospital), a pediatric University hospital belonging to the Assistance Publique Hôpitaux de Paris group (400 pediatric beds, 200 adult beds). The Necker hospital is a national reference center for rare and undiagnosed diseases. The hospital hosts the Imagine Institute, a research institute focused on genetic diseases. Imagine institute has been developing since 2015 a document-based open-source clinical data warehouse oriented toward free-text: Dr. Warehouse® (DrWH). DrWH includes a full text search engine, and contains, as of August 2017 more than 3.9 million clinical free-text documents for more than 446,000 patients.
In Table 1, we describe the demographic characteristics of the patients included in DrWH. We used all the clinical narratives, ranging from hospitalization to outpatient visits reports, available in DrWH to perform this study. The heterogeneity of the records is illustrated in Table 2 with the distribution of these records by hospital departments and type of reports.
A demonstration version of DrWH is publicly available at the URL: https://imagine-plateforme-bdd.fr/dwh_pubmed/. Note that for privacy reasons, this demo version has been In brackets lower and upper quartile populated with data from PubMed abstracts and not with patient data.
To represent the phenotypes, we used the terminologies from the Unified Medical Language System® (UMLS [17]). The UMLS is considered the lingua franca of medical vocabularies. The UMLS has a large coverage of biomedical vocabularies mostly in English. The UMLS is assembled by integrating 153 medical vocabularies, including generalist terminologies (e.g. MeSH or SNOMED CT), or specialized ones (e.g. the Human Phenotype Ontology -HPO, OMIM, the Gene Ontology). The UMLS Metathesaurus® contains about 3.2 million concepts identified by their unique identifier (the Concept Unique Identifier: CUI). A concept is a cluster of synonymous terms coming from various source vocabularies (13+ million of synonymous terms). In the UMLS the creation of concepts is semi-automatic. For example, the Rett Syndrome CUI is C0035372, this concept is made of terms provided by 145 terms from 50 terminologies. Hierarchical relations or other types of relations are extracted from the source terminologies and included in the UMLS. The UMLS Semantic Network is a much smaller network of 133 semantic types (e.g. Disease or Syndrome, Anatomical Abnormality…). Each Metathesaurus concept is assigned at least one semantic type. The UMLS integrates mostly terms in English, but other language such as French have a non-negligible coverage (397,203 terms).
Our source for reference data was Orphanet, an online resource gathering and integrating knowledge on rare diseases. Orphanet was established in France in 1997 and became a European initiative now involving a consortium of 40 countries in Europe and the rest of the world. Orphanet data are organized using ontologies and structured data [18]. Orphadata is a partial extraction of the data stored in Orphanet freely accessible and organized as XML files [19]. Orphanet proposes a vocabulary for rare diseases. Experts and terminologists have identified synonymous terms associated with disease. Orphanet also provides mappings between Orphanet concepts and a variety of other terminologies (e.g. HPO) to enable interoperability.
Orphanet is dedicated to a specific domain, much narrower than the UMLS but highly specialized and manually curated. In addition, the Orphanet vocabulary has been translated into other languages (including French). The HPO is integrated with the UMLS (as terms of UMLS concepts), and is mapped to Orphanet concepts. Therefore, HPO can serve as a pivot between the two vocabularies.
All the terminologies described above are mainly constituted of English terms, see the related work section of the discussion for further comments on non-English text processing.

Methods
In this study, we aim at using automated methods to extract phenotypes from the narrative reports. For this purpose, we mined the large body of text documents available in the CDW. This section describes the free-text document processing to automatically extract phenotypes from the narrative reports, and details the exploration of phenotypes associated with six use cases.
Processing text-documents.
In a nutshell, we leveraged the UMLS to extract phenotypical terms from patients' text reports. We selected the 397,203 terms (including synonyms) available in French in the UMLS Metathesaurus (version 2017AA) and filtered out terms having less than three characters, or more than 80 characters. To limit the concepts extraction to a phenotypic description, we considered only the concepts assigned to one of the following semantic types: 'Sign or Symptom' , 'Disease or Syndrome' , 'Finding' , 'Pathologic Function' , 'Congenital Abnormality' , 'Physiologic Function' , ' Anatomical Abnormality' , 'Neoplastic Process' , ' Acquired Abnormality' and 'Mental or Behavioral Dysfunction'. Finally, we obtained 91,533 terms. In the remainder of the manuscript, we will refer to these terms as phenotypical concepts or UMLS concepts. We extracted the phenotypes from every text reports through simple terms matching, case insensitive, and insensitive to non-alphanumerical characters (e.g. spaces, parenthesis, dash etc.). In the context of rare and undiagnosed diseases, clinical narratives are likely to contain many sentences expressing the absence of phenotypes (e.g. "Clinical examination does not support a finding of lupus", "absence of diabetes") or describing the family history of the patient (e.g. "the mother has asthma"). Therefore, detecting negation and family history context was essential to exclude these phenotypes from the high throughput phenotyping. We used trigger terms to determine if a phenotype was associated to negated meaning (e.g. "none", "absence" etc.) or family history context (e.g. "cousin", "brother", "sister" etc.). To compute this extraction, we developed an algorithm similar to Context [20,21], and adapted to French [22] (see Fig. 1).
In this study, we considered exclusively the not negated phenotypes associated with the patients (i.e. not associated with their family).

Use cases: Exploring phenotypes of rare disease patients
We created six groups of patients associated with a specific disease. We queried DrWH at Necker hospital using Rett Syndrome (and not atypical Rett syndrome), Lowe, Silver Russell, Bardet Biedl, DOCK8 deficiency and APDS as search criteria. We obtained six sets of patients and their associated corpora of clinical documents (RETT set, LOWE set, SILVER RUSSELL set, BARDET BIEDL set, DOCK8 deficiency set, and APDS set). For each patient set, we extracted all the phenotypes as detailed in the previous section (see Fig. 2).
To rank the extracted phenotypical concepts in terms of relevance, we used two metrics (the frequency and the "term frequency-inverse document frequency" -TF-IDF) classically used in the context of information retrieval. For example, our method identified 1022 distinct phenotypical concepts in the "RETT syndrome" set.
Computing Frequency and TF-IDF: -  Fig. 1 Overview of the method applied to extract phenotypes from the narrative reports more frequent is a phenotype in the population, the less specific it is for a given patient set. Conversely finding several occurrences of a rare phenotypical concept in a single patient set highlights the potential interest of this term for this data set. For example, the TF-IDF of the concept stereotypy in the "Rett syndrome" result set is 0.081 and is computed as follows: We stored the number of relevant phenotypes, and their associated ranks. Based on the experts' feedbacks, we computed the Average Precision for each query, and the overall Mean Average Precision. The average precision expresses the correctness of the top ranked results for a query. The Mean Average Precision evaluates the average precision across a series of queries [23].

Concepts aggregation and ranking
For each patient of the DWH Phenotype associations: Fig. 2 Overview of the method applied to perform next generation phenotyping

Comparison to Orphadata
For each disease set we compared the phenotypical concepts obtained by our method with those in Orphadata with the following steps. We leveraged HPO to map Orphadata and the UMLS (Orphanet is mapped to HPO, and HPO is integrated in the UMLS). We calculated the number of equivalent phenotypical concepts and the number of phenotypical concepts present in only one of the data sources (i.e. DrWH or Orphadata). The phenotypical concepts were considered equivalent (i) in case of exact mapping (same identifier) or (ii) when a broader phenotype was found (in Orphadata: Arrhythmia, in our extraction: Cardiac flutter). The steps are illustrated with the example of Rett syndrome in Fig. 3.

Document processing in DrWH
We extracted a total of 18.7 million phenotypical terms from 3.9 million medical records, representing 446,481 distinct patients. Among these terms, 4% were related to family history. Among the 96% of the remaining terms, 72% were classified into as not negated expression (12.99 million of phenotypes) ( Table 3).

Detailed expert evaluation
The description of the data available in each cohort and the evaluation by the experts are detailed in Table 4. The Fig. 4 is a screenshot of the graphical user interface of Dr. Warehouse for Rett syndrome. The automated phenotyping identified an average of 768 phenotypical concepts associated to each disease. In contrast, the number of UMLS concepts found in Orphadata ranges from 16 for the Silver-Russell syndrome to 120 for the Lowe syndrome. APDS was not documented in Orphanet at the time of redaction of this article. Overall, the experts classified between 11 (SILVER-RUSSELL set, ranked by TF-IDF) and 41 (LOWE set, ranked by TF-IDF) of the top-50 results as relevant to the disease. The number of phenotypical concepts identified by the union of results obtained through ranking by frequency and ranking by TF-IDF ranges from 16 (SILVER-RUSSELL set) to 52 (DOCK8 deficiency and APDS sets).
The Mean Average Precision was 0.79 for results ranked by Frequency and 0.75 for results ranked by TF-IDF. An additional file shows in detail the Top50 phenotypical concepts extracted for each cohort [see Additional file 1].

Comparison with Orphadata
The comparison with Orphadata is detailed in Table 5. The limitation to French terms resulted in a reduction of an average of 16 phenotypes, corresponding to an average of 39% of the UMLS concepts (max: 63%, min 21%).
We obtained the best coverage for the SILVER RUSSEL set with 100% of the Orphadata phenotypes present in the phenotypes of the patient set. The lowest coverage was  Fig. 3 Evaluation procedure for the RETT set found for the LOWE set with 66% of the 76 Orphadata phenotypes present in the phenotypes of the patient set. The average coverage for all the patient sets was 78%.
In the six diseases studied, 2.8% of Orphadata concepts do not belong to the semantic types used for the automated phenotyping. For example, "Dislocated hips" (HP:0002827) is part of the description of Lowe syndrome in Orphadata and is assigned to the semantic type "Injury or Poisoning" in the UMLS.
Among the phenotypical concepts in Orphadata, 41 are not represented in the patient sets phenotypical concepts.

Findings and practical significance
Our method of automated extraction of phenotypes from narrative reports in a clinical data warehouse can be useful even in the context of rare disease with a low number of patients. Indeed, 4 of the 6 examples showed results above 83% in terms of Average Precision in the top-50 phenotypes based on the experts' evaluation. It was above 49% for the 2 remaining examples. It means that the extracted phenotypes are meaningful and can be used as baseline in diverse situations: enrichment of the phenotypic description of diseases, rapid exploration of the phenotypes in a population, or assisting the experts in the identification of phenotypes of interest.
Our approach can be used to enrich existing phenotypic description of rare diseases. For example, osteoporosis was significantly associated with Rett syndrome in the Necker data warehouse. The association is present neither in Orphanet nor in OMIM. It is however described in six articles in Medline [24][25][26][27][28][29].
Moreover, this method enables a quick exploration of phenotypes in a population. This feature is especially meaningful in the context of rare diseases for which the information may be scarce. In a research context, we have shown with the six examples, that our method was able to automatically display the phenotypes associated with rare diseases in a cohort of patients. The same approach could be used to look for undescribed phenotypes associated with new mutations (using gene names as a query for example or a series of patients selected manually). The Necker hospital and Imagine Institute collaborate actively to increase the knowledge on rare diseases and the phenotype explorer from the CDW is used on a daily basis by the staff to support translational research: When a geneticist discovers a new mutation, the exploration of the documents gathered from patients presenting the mutation in the CDW can support the description of the associated phenotypes. For example, the phenotypes associated to APDS 1 and 2 could provide basis for the description of the syndrome.
DrWH may also be used to assist experts in the identification of phenotypes of interest. After a careful review and comparison with other cohorts, such associations could be used to enrich online reference resources. Moreover, the method is easily reproducible, and the  comparison of phenotypes coming from a variety of clinical data warehouses can provide candidates (union of the candidate phenotypes) or reinforce the interest on specific candidate phenotypes (using the intersection of different submissions). In addition, the prevalence of signs and symptoms for a given disorder can be estimated using the frequencies provided by DrWH. Our method can provide the clinicians with an estimated prevalence of phenotypes in addition to the associations. In our running example Rett syndrome, "stereotypy" had a prevalence of 71.8%, consistent with Orphanet (Very frequent 80-99%); similarly "scoliosis" had a prevalence of 51.2%, vs. frequent (30-79%) in Orphanet. Conversely, the prevalence of "apraxia" in DrWH was 12.9%, whereas apraxia is considered very frequent (80-99%) in the Rett syndrome by Orphanet. A more precise estimation of the frequency would require considering not only single phenotypical concepts but also group of semantically close phenotypes.

Limitations
Comparison to a gold standard and, interoperability issues It was complex to perform an automated evaluation of phenotypes found by DrWH by comparison to a gold standard (e.g. Orphanet with Orphadata).
(1) The extraction of the phenotypical concepts was based on the French terms from the UMLS. However, the coverage of French term is limited compared to the extent of the English counterpart, knowing in particular that the French version of HPO was not available in the UMLS 2017AA. For example, in Orphadata the Rett syndrome is associated with 39 phenotypical concepts, of which only 31 exist in French in the UMLS ( Table 5). The  ) is a sign associated to Silver-Russell syndrome in Orphadata and is absent from the UMLS concepts extracted from the corresponding set. Nevertheless, 27 patients of the SILVER-RUSSELL syndrome set have the string "face triangulaire" in their narrative records according to a full text search, but the concept "Triangular Face" does not exist in French in the UMLS. Despite this limitation, the current version of DrWH enables nonetheless relevant explorations, and allows the discovery of phenotypes of interest. The limited coverage of French terms compared to English limits our ability to identify concepts in free-text, but is also a limitation in our evaluation (which tends to underestimate the performance of the method). The integration of new terminologies (with a French translation) provided with mapping to UMLS, or integrated in the UMLS will reduce the gap between English and non-English terms. The recent increase of interest for non-English Natural Language Processing is a step forward in that direction.
(2) The granularity between phenotypical concepts extracted from Orphadata and DrWH may differ (e.g. a very precise term can be identified in DrWH whereas a more general term is present in Orphadata). This issue cannot be addressed by a simple hierarchical reasoning given that phenotypical concepts may be related semantically, but not identical nor hierarchically linked (e.g. Hypotonia (C0026827) vs muscle weakness (C0151786)).
(3) We solely used an exact match strategy (with text normalization) to recognize phenotypical concepts in the reports. Our method does not handle terms presenting the words in a different order (e.g. renal acute injury versus acute renal injury would not match). The presence of multiple synonyms in source terminologies might limit the impact of this strategy to a certain extent. However we intend to upgrade our phenotype recognition strategy to allow more flexibility in the recognition of phenotypical concepts.
(4) Some phenotypical concepts are not in the semantic types that we used for the automated phenotyping. In Orphadata, "autoagression" is a sign associated to Rett syndrome. We found 10 patients in the RETT set with "automutilation" in their narrative reports, but this concept is in the semantic type "Injury or Poisoning".

Study population
Our warehouse hosts data produced by a children hospital, and therefore, phenotypes can be different from adult patients (for example, Alzheimer disease is not represented in pediatrics). However, patients with rare diseases may be followed-up in our institution even during adulthood, enabling an extended longitudinal data collection. Longitudinal follow-up makes it possible to observe the age of apparition of the phenotypes and reconstruct the natural history of rare diseases.

Related work Information extraction
Several approaches have been developed to recognize UMLS concepts, or terminology terms from free-text records. Savova et al. [30] developed cTAKES, an open source modular system of pipelined components combining rule-based and machine learning techniques. cTAKES aims at the extraction of information from the clinical narratives. Despite development in other languages [31,32], most of the open source clinical Natural Language Processing systems have been developed for the English language (MedLee [33], MetaMap [34], HITex [35]). Many challenges have helped to test and assess the different tools and methodologies. In non-English languages, less out-of-the-box tools and less learning datasets are available to work with text. More recently a challenge was dedicated to the extraction of information in multiple language medical documents (including French) [36].

Narrative reports versus coded data
We have shown that text exploration of clinical reports can provide phenotypes of interest. Whereas structured databases are particularly adapted for the collection of data regarding well documented diseases, clinical report based exploration enables the secondary use of data collected during care. Such approaches allow the development of learning health systems in which there is a bidirectional relation between routine care data and research. In addition, patient generated data could be integrated and mined along with the EHRs [37].
We plan to conduct additional studies by comparing our results with the French national rare diseases registry [38].

Scalability
The Phenotype Explorer of Dr. Warehouse enables the exploration of millions of clinical narratives in a simple manner. The algorithm is optimized to display the phenotype analysis of thousands of documents quickly, and limited expertise is needed to write and execute queries. The queries demonstrated in this study only took a few seconds to run, enabling a real time exploration of the data. The expert user can easily sort the associated phenotypes according to their need, depending on the use case.