Skip to content


Open Access

Next generation phenotyping using narrative reports in a rare disease clinical data warehouse

Orphanet Journal of Rare Diseases201813:85

Received: 4 October 2017

Accepted: 23 May 2018

Published: 31 May 2018



Secondary use of data collected in Electronic Health Records opens perspectives for increasing our knowledge of rare diseases. The clinical data warehouse (named Dr. Warehouse) at the Necker-Enfants Malades Children’s Hospital contains data collected during normal care for thousands of patients. Dr. Warehouse is oriented toward the exploration of clinical narratives. In this study, we present our method to find phenotypes associated with diseases of interest.


We leveraged the frequency and TF-IDF to explore the association between clinical phenotypes and rare diseases. We applied our method in six use cases: phenotypes associated with the Rett, Lowe, Silver Russell, Bardet-Biedl syndromes, DOCK8 deficiency and Activated PI3-kinase Delta Syndrome (APDS). We asked domain experts to evaluate the relevance of the top-50 (for frequency and TF-IDF) phenotypes identified by Dr. Warehouse and computed the average precision and mean average precision.


Experts concluded that between 16 and 39 phenotypes could be considered as relevant in the top-50 phenotypes ranked by descending frequency discovered by Dr. Warehouse (resp. between 11 and 41 for TF-IDF). Average precision ranges from 0.55 to 0.91 for frequency and 0.52 to 0.95 for TF-IDF. Mean average precision was 0.79. Our study suggests that phenotypes identified in clinical narratives stored in Electronic Health Record can provide rare disease specialists with candidate phenotypes that can be used in addition to the literature.


Clinical Data Warehouses can be used to perform Next Generation Phenotyping, especially in the context of rare diseases. We have developed a method to detect phenotypes associated with a group of patients using medical concepts extracted from free-text clinical narratives.


Data warehouseNext generation phenotypingData miningRare diseasesNatural language processing


The global trend toward digital health in the US and in Europe has led to an unprecedented adoption of Electronic Health Records (EHRs). By the end of 2014, 83% of US physicians [1] and 75% of hospitals [2] used some form of EHRs. The increasing number of EHRs opens strong perspectives for the secondary use of data collected during the care process. Many hospitals are now equipped with Clinical Data Warehouses (CDW) integrating all the data produced during the care of the patients for research purposes [35]. CDWs gather a large variety of information, ranging from structured data (e.g. diagnosis codes, laboratory test results…) to free-text clinical narratives and images. Structured data include coded data using terminologies like the International Classification of Diseases, and questionnaires that provide precise, standardized but somehow limited information. Conversely free-text reports are produced without constraints and may be used to express nuanced, unexpected, and unexplained signs or symptoms regarding the patient case. Clinical narratives collect information from all aspects of the patient care that might not be collected anywhere else in clinical information system including history of the disease, family history, fine-grained description of all the symptoms, hypothesis of diagnosis or treatment, information from treatments received outside of the hospital, and so forth. Previous studies in different contexts showed the importance of free-text in EHRs. For example Raghavan et al. identified that unstructured data were essential to solve trial criteria from two studies. [6]. The value of text data is even more important to detect phenotypes in specialized hospitals treating patients with rare diseases and for outpatients, for whom clinical information is barely coded [7].

Rare diseases represent a large group of heterogeneous conditions and some cases remain undiagnosed for a long time. A precise phenotypic description of such diseases can be problematic given the small number of cases and the heterogeneity of the phenotypes. Leveraging large CDWs, could be helpful to enrich this description. While structured (standardized) questionnaires exist for several rare diseases (e.g., in France [8, 9]), part of the clinical description is still present only in free text in EHRs. We hypothesized that mining large collections of clinical texts in hospitals specialized in rare diseases could offer interesting perspectives to enrich the descriptions provided by dedicated knowledge bases. We investigated this hypothesis at the Necker Enfants Malades Hospital (Necker Children Hospital), a children’s hospital in Paris that is associated with the Imagine research institute, specialized in genetic diseases, and hosts 15 national reference centers for rare diseases. We illustrate our approach on six rare diseases: DOCK8 deficiency, the Activated PI3-kinase Delta Syndrome (APDS), Rett, Lowe, Silver Russell and Bardet Biedl syndromes. The combined immunodeficiency due to DOCK8 deficiency (prevalence less than 1/1,000,000) is a form of autosomal recessive combined immunodeficiency (T, B and NK cells), characterized by recurrent lung infections, cutaneous viral infections, allergy, severe skin inflammation and susceptibility to cancer with a high level of IgE [10]. DOCK8 deficiency is caused by homozygous or compound heterozygous mutations in DOCK8 gene [11].

The activated phosphoinositide 3-kinase-δ (PI3Kδ) syndrome (APDS) (estimated prevalence < 1 /1,000,000) is characterized by immunodeficiency and recurrent respiratory tract infections, lymphoproliferation and hypogammaglobulinemia. APDS is caused by activating heterozygous mutations in PIK3CD (APDS1) or in PIK3R1 (APDS2) [11, 12].

Rett syndrome (estimated prevalence 1/15,000) is characterized by a rapid regression in language and motor skills (i.e. repetitive, stereotypic hand movements) after six to eighteen months of normal psychomotor development [13].

The Lowe syndrome or Oculocerebrorenal syndrome (estimated prevalence 1 to 9 /1,000,000) is a multisystem disorder characterized by congenital cataract, intellectual disabilities, glaucoma, postnatal growth retardation and renal tubular dysfunction [14].

The Silver-Russell syndrome (prevalence 1–9 /1,000,000) is characterized by growth retardation with antenatal onset, characteristic facies and limb asymmetry [15].

The Bardet-Biedl syndrome (prevalence estimated at 1 to 9 /1,000,000) is a ciliopathy characterized by a combination of clinical signs including obesity, pigmentary retinopathy, post-axial polydactyly, polycystic kidneys [16].

From now on, we will refer to as phenotype any sign or symptom, disease, defects, and so forth, affecting a patient.

In this study, we present the methods that we developed to extract phenotypes associated with rare diseases from clinical texts in Dr. Warehouse® (DrWH), the clinical data warehouse of the Necker Children’s hospital. Then, we evaluate the scalability of our approach in the context of high throughput phenotyping.


All data were collected from the Necker Enfants Malades Hospital (Necker Children Hospital), a pediatric University hospital belonging to the Assistance Publique Hôpitaux de Paris group (400 pediatric beds, 200 adult beds). The Necker hospital is a national reference center for rare and undiagnosed diseases. The hospital hosts the Imagine Institute, a research institute focused on genetic diseases. Imagine institute has been developing since 2015 a document-based open-source clinical data warehouse oriented toward free-text: Dr. Warehouse® (DrWH). DrWH includes a full text search engine, and contains, as of August 2017 more than 3.9 million clinical free-text documents for more than 446,000 patients.

In Table 1, we describe the demographic characteristics of the patients included in DrWH. We used all the clinical narratives, ranging from hospitalization to outpatient visits reports, available in DrWH to perform this study. The heterogeneity of the records is illustrated in Table 2 with the distribution of these records by hospital departments and type of reports.
Table 1

Description of the population of the data warehouse at Necker hospital



Nb patients


Sex ratio (M)


Median Nb reports excluding biological reports per patient

2 [1–6]

Median follow up (years) per patient

0.06 [0–2]

In brackets lower and upper quartile

Table 2

Number of documents per Hospital department and per type of records

Hospital departments

# Documents

Types of records

# Documents





Pediatric Cardiology




Adult Clinical Hematology




Metabolism-Pediatric Neurology


Discharge letter


Nephrology Transplantations Adult


Diagnostic Related Group


Pediatric Nephrology




Pediatric Immuno-Hematology




Pediatric Radiology


Day hospital


Adult Radiology




Pediatric Cardiac Surgery




Pediatric Visceral Surgery




Pediatric Orthopedic Surgery


Medical certificate


Adult Nephrology


Pathology report


Anesthesia intensive care unit Adult And Pediatric




Pediatric Gastroenterology


Multidisciplinary consultation meeting






General Pediatrics


Staff meeting reports






Pediatric ear nose and throat



Pediatric Intensive Care Unit









A demonstration version of DrWH is publicly available at the URL: Note that for privacy reasons, this demo version has been populated with data from PubMed abstracts and not with patient data.

To represent the phenotypes, we used the terminologies from the Unified Medical Language System® (UMLS [17]). The UMLS is considered the lingua franca of medical vocabularies. The UMLS has a large coverage of biomedical vocabularies mostly in English. The UMLS is assembled by integrating 153 medical vocabularies, including generalist terminologies (e.g. MeSH or SNOMED CT), or specialized ones (e.g. the Human Phenotype Ontology - HPO, OMIM, the Gene Ontology). The UMLS Metathesaurus® contains about 3.2 million concepts identified by their unique identifier (the Concept Unique Identifier: CUI). A concept is a cluster of synonymous terms coming from various source vocabularies (13+ million of synonymous terms). In the UMLS the creation of concepts is semi-automatic. For example, the Rett Syndrome CUI is C0035372, this concept is made of terms provided by 145 terms from 50 terminologies. Hierarchical relations or other types of relations are extracted from the source terminologies and included in the UMLS. The UMLS Semantic Network is a much smaller network of 133 semantic types (e.g. Disease or Syndrome, Anatomical Abnormality…). Each Metathesaurus concept is assigned at least one semantic type. The UMLS integrates mostly terms in English, but other language such as French have a non-negligible coverage (397,203 terms).

Our source for reference data was Orphanet, an online resource gathering and integrating knowledge on rare diseases. Orphanet was established in France in 1997 and became a European initiative now involving a consortium of 40 countries in Europe and the rest of the world. Orphanet data are organized using ontologies and structured data [18]. Orphadata is a partial extraction of the data stored in Orphanet freely accessible and organized as XML files [19]. Orphanet proposes a vocabulary for rare diseases. Experts and terminologists have identified synonymous terms associated with disease. Orphanet also provides mappings between Orphanet concepts and a variety of other terminologies (e.g. HPO) to enable interoperability.

Orphanet is dedicated to a specific domain, much narrower than the UMLS but highly specialized and manually curated. In addition, the Orphanet vocabulary has been translated into other languages (including French). The HPO is integrated with the UMLS (as terms of UMLS concepts), and is mapped to Orphanet concepts. Therefore, HPO can serve as a pivot between the two vocabularies.

All the terminologies described above are mainly constituted of English terms, see the related work section of the discussion for further comments on non-English text processing.


In this study, we aim at using automated methods to extract phenotypes from the narrative reports. For this purpose, we mined the large body of text documents available in the CDW. This section describes the free-text document processing to automatically extract phenotypes from the narrative reports, and details the exploration of phenotypes associated with six use cases.

Processing text-documents.

In a nutshell, we leveraged the UMLS to extract phenotypical terms from patients’ text reports. We selected the 397,203 terms (including synonyms) available in French in the UMLS Metathesaurus (version 2017AA) and filtered out terms having less than three characters, or more than 80 characters. To limit the concepts extraction to a phenotypic description, we considered only the concepts assigned to one of the following semantic types: ‘Sign or Symptom’, ‘Disease or Syndrome’, ‘Finding’, ‘Pathologic Function’, ‘Congenital Abnormality’, ‘Physiologic Function’, ‘Anatomical Abnormality’, ‘Neoplastic Process’, ‘Acquired Abnormality’ and ‘Mental or Behavioral Dysfunction’. Finally, we obtained 91,533 terms. In the remainder of the manuscript, we will refer to these terms as phenotypical concepts or UMLS concepts.

We extracted the phenotypes from every text reports through simple terms matching, case insensitive, and insensitive to non-alphanumerical characters (e.g. spaces, parenthesis, dash etc.). In the context of rare and undiagnosed diseases, clinical narratives are likely to contain many sentences expressing the absence of phenotypes (e.g. “Clinical examination does not support a finding of lupus”, “absence of diabetes”) or describing the family history of the patient (e.g. “the mother has asthma”). Therefore, detecting negation and family history context was essential to exclude these phenotypes from the high throughput phenotyping. We used trigger terms to determine if a phenotype was associated to negated meaning (e.g. “none”, “absence” etc.) or family history context (e.g. “cousin”, “brother”, “sister” etc.). To compute this extraction, we developed an algorithm similar to Context [20, 21], and adapted to French [22] (see Fig. 1).
Figure 1
Fig. 1

Overview of the method applied to extract phenotypes from the narrative reports

In this study, we considered exclusively the not negated phenotypes associated with the patients (i.e. not associated with their family).

Use cases: Exploring phenotypes of rare disease patients

We created six groups of patients associated with a specific disease. We queried DrWH at Necker hospital using Rett Syndrome (and not atypical Rett syndrome), Lowe, Silver Russell, Bardet Biedl, DOCK8 deficiency and APDS as search criteria. We obtained six sets of patients and their associated corpora of clinical documents (RETT set, LOWE set, SILVER RUSSELL set, BARDET BIEDL set, DOCK8 deficiency set, and APDS set). For each patient set, we extracted all the phenotypes as detailed in the previous section (see Fig. 2).
Figure 2
Fig. 2

Overview of the method applied to perform next generation phenotyping

To rank the extracted phenotypical concepts in terms of relevance, we used two metrics (the frequency and the “term frequency–inverse document frequency” - TF-IDF) classically used in the context of information retrieval. For example, our method identified 1022 distinct phenotypical concepts in the “RETT syndrome” set.

Computing Frequency and TF-IDF:
  • The frequency: the frequency of the phenotypical concept of interest in the results. For example, the frequency of the term stereotypy in the “Rett syndrome” set is 150 (number of patients having at least one mention of stereotypy in at least one document) / 209 (number of patients in the set) = 71.8%.

  • The TF-IDF (term frequency – inverse document frequency) is intended to reflect how important a phenotypical concept is to a patient set in the entire data warehouse. The intuition is that the more frequent is a phenotype in the population, the less specific it is for a given patient set. Conversely finding several occurrences of a rare phenotypical concept in a single patient set highlights the potential interest of this term for this data set. For example, the TF-IDF of the concept stereotypy in the “Rett syndrome” result set is 0.081 and is computed as follows:

$$ TF- IDF\ (c)=\frac{N_c}{N_{tot}}\times \log \left(\frac{P_{tot}}{P_c}\right) $$

Nc: Number of times this phenotypical concept c is used in the set.

Ntot: Number of not distinct phenotypical concepts in the set.

Ptot: Total number of patients in the DWH with phenotypical concepts extracted.

Pc: Number of patients with phenotypical concept c in the set
$$ TF- IDF\ \left(\mathrm{Stereotypy}\right)=\frac{649}{\mathrm{18,538}}\times \log \left(\frac{\mathrm{446,481}}{\mathrm{2,233}}\right)=0.081 $$


Manual evaluation

We considered six use cases. For each of them, a domain expert was asked to browse the highest ranked phenotypes (top-50 phenotypical concepts) found by DrWH and evaluate their relevance with regard to the disease of interest. We presented each expert with two lists of top-50 phenotypes: (i) the top-50 phenotypes ranked by descending frequency and (ii) the top-50 phenotypes ranked by descending TF-IDF. The experts classified the phenotypes as relevant or not relevant to the disease.

We stored the number of relevant phenotypes, and their associated ranks. Based on the experts’ feedbacks, we computed the Average Precision for each query, and the overall Mean Average Precision. The average precision expresses the correctness of the top ranked results for a query. The Mean Average Precision evaluates the average precision across a series of queries [23].

Comparison to Orphadata

For each disease set we compared the phenotypical concepts obtained by our method with those in Orphadata with the following steps. We leveraged HPO to map Orphadata and the UMLS (Orphanet is mapped to HPO, and HPO is integrated in the UMLS). We calculated the number of equivalent phenotypical concepts and the number of phenotypical concepts present in only one of the data sources (i.e. DrWH or Orphadata). The phenotypical concepts were considered equivalent (i) in case of exact mapping (same identifier) or (ii) when a broader phenotype was found (in Orphadata: Arrhythmia, in our extraction: Cardiac flutter).

The steps are illustrated with the example of Rett syndrome in Fig. 3.
Figure 3
Fig. 3

Evaluation procedure for the RETT set


Document processing in DrWH

We extracted a total of 18.7 million phenotypical terms from 3.9 million medical records, representing 446,481 distinct patients. Among these terms, 4% were related to family history. Among the 96% of the remaining terms, 72% were classified into as not negated expression (12.99 million of phenotypes) (Table 3).
Table 3

Number of phenotypical terms extracted per context and certainty

Context / Certainty


Not negated

Family history






Total number of terms



Detailed expert evaluation

The description of the data available in each cohort and the evaluation by the experts are detailed in Table 4. The Fig. 4 is a screenshot of the graphical user interface of Dr. Warehouse for Rett syndrome. The automated phenotyping identified an average of 768 phenotypical concepts associated to each disease. In contrast, the number of UMLS concepts found in Orphadata ranges from 16 for the Silver-Russell syndrome to 120 for the Lowe syndrome. APDS was not documented in Orphanet at the time of redaction of this article. Overall, the experts classified between 11 (SILVER-RUSSELL set, ranked by TF-IDF) and 41 (LOWE set, ranked by TF-IDF) of the top-50 results as relevant to the disease. The number of phenotypical concepts identified by the union of results obtained through ranking by frequency and ranking by TF-IDF ranges from 16 (SILVER-RUSSELL set) to 52 (DOCK8 deficiency and APDS sets).
Table 4

Description and evaluation of the 6 sets of patients



DOCK8 deficiency




APDS 1 and 2

Median age at visit (years)

8.2 [4.8–12.6]

11.4 [9.3–14.1]

12.8 [5.8–20.3]

2.4 [0.8–5.4]

15.7 [10.1–41.5]

12.8 [7.7–18.6]

Median follow up (years)

2.6 [0–4.9]

3.1 [0.3–9]

6.6 [3–10.3]

2 [0.8–4.7]

2 [0.1–6.6]

7.5 [4.8–8.6]

# Patients







# Documents







Phenotypes extracted, not negated and in patient context

# Phenotypes







# distinct Phenotypes







Evaluation by experts in the Top50 phenotypes

Medical Experts







# Phenotypes ranked by Freq







# Phenotypes ranked by TF-IDF







# Phenotypes Freq union TF-IDF







# Phenotypes Freq intersect TF-IDF







Average Precision, ranked by Freq







Average Precision, ranked by TF-IDF







Figure 4
Fig. 4

Screenshot of Dr. Warehouse and the concept tab for “Rett syndrome” query

The Mean Average Precision was 0.79 for results ranked by Frequency and 0.75 for results ranked by TF-IDF. An additional file shows in detail the Top50 phenotypical concepts extracted for each cohort [see Additional file 1].

Comparison with Orphadata

The comparison with Orphadata is detailed in Table 5. The limitation to French terms resulted in a reduction of an average of 16 phenotypes, corresponding to an average of 39% of the UMLS concepts (max: 63%, min 21%).
Table 5

Comparison with Orphadata








# Concepts HPO Orphadata (English)






# Concepts HPO Orphadata (French) [A]






# UMLS distinct phenotypes extracted [B]







# [A] intersection [B] (coverage)






% [A] intersection [B] / [A] (coverage %)






We obtained the best coverage for the SILVER RUSSEL set with 100% of the Orphadata phenotypes present in the phenotypes of the patient set. The lowest coverage was found for the LOWE set with 66% of the 76 Orphadata phenotypes present in the phenotypes of the patient set. The average coverage for all the patient sets was 78%.

In the six diseases studied, 2.8% of Orphadata concepts do not belong to the semantic types used for the automated phenotyping. For example, “Dislocated hips” (HP:0002827) is part of the description of Lowe syndrome in Orphadata and is assigned to the semantic type “Injury or Poisoning” in the UMLS.

Among the phenotypical concepts in Orphadata, 41 are not represented in the patient sets phenotypical concepts.


Findings and practical significance

Our method of automated extraction of phenotypes from narrative reports in a clinical data warehouse can be useful even in the context of rare disease with a low number of patients. Indeed, 4 of the 6 examples showed results above 83% in terms of Average Precision in the top-50 phenotypes based on the experts’ evaluation. It was above 49% for the 2 remaining examples. It means that the extracted phenotypes are meaningful and can be used as baseline in diverse situations: enrichment of the phenotypic description of diseases, rapid exploration of the phenotypes in a population, or assisting the experts in the identification of phenotypes of interest.

Our approach can be used to enrich existing phenotypic description of rare diseases. For example, osteoporosis was significantly associated with Rett syndrome in the Necker data warehouse. The association is present neither in Orphanet nor in OMIM. It is however described in six articles in Medline [2429].

Moreover, this method enables a quick exploration of phenotypes in a population. This feature is especially meaningful in the context of rare diseases for which the information may be scarce. In a research context, we have shown with the six examples, that our method was able to automatically display the phenotypes associated with rare diseases in a cohort of patients. The same approach could be used to look for undescribed phenotypes associated with new mutations (using gene names as a query for example or a series of patients selected manually). The Necker hospital and Imagine Institute collaborate actively to increase the knowledge on rare diseases and the phenotype explorer from the CDW is used on a daily basis by the staff to support translational research: When a geneticist discovers a new mutation, the exploration of the documents gathered from patients presenting the mutation in the CDW can support the description of the associated phenotypes. For example, the phenotypes associated to APDS 1 and 2 could provide basis for the description of the syndrome.

DrWH may also be used to assist experts in the identification of phenotypes of interest. After a careful review and comparison with other cohorts, such associations could be used to enrich online reference resources. Moreover, the method is easily reproducible, and the comparison of phenotypes coming from a variety of clinical data warehouses can provide candidates (union of the candidate phenotypes) or reinforce the interest on specific candidate phenotypes (using the intersection of different submissions).

In addition, the prevalence of signs and symptoms for a given disorder can be estimated using the frequencies provided by DrWH. Our method can provide the clinicians with an estimated prevalence of phenotypes in addition to the associations. In our running example Rett syndrome, “stereotypy” had a prevalence of 71.8%, consistent with Orphanet (Very frequent 80–99%); similarly “scoliosis” had a prevalence of 51.2%, vs. frequent (30–79%) in Orphanet. Conversely, the prevalence of “apraxia” in DrWH was 12.9%, whereas apraxia is considered very frequent (80–99%) in the Rett syndrome by Orphanet. A more precise estimation of the frequency would require considering not only single phenotypical concepts but also group of semantically close phenotypes.


Comparison to a gold standard and, interoperability issues

It was complex to perform an automated evaluation of phenotypes found by DrWH by comparison to a gold standard (e.g. Orphanet with Orphadata).

(1) The extraction of the phenotypical concepts was based on the French terms from the UMLS. However, the coverage of French term is limited compared to the extent of the English counterpart, knowing in particular that the French version of HPO was not available in the UMLS 2017AA. For example, in Orphadata the Rett syndrome is associated with 39 phenotypical concepts, of which only 31 exist in French in the UMLS (Table 5). The difference is more dramatic with the Lowe syndrome: for 120 phenotypical concepts, only 76 have a French counterpart. Our automated exploration is based on the use of medical terminologies in French, and DrWH cannot recognize a phenotypical concept that is not present in French. For example, Triangular Face (HPO: HP:0000325, UMLS: C1835884) is a sign associated to Silver-Russell syndrome in Orphadata and is absent from the UMLS concepts extracted from the corresponding set. Nevertheless, 27 patients of the SILVER-RUSSELL syndrome set have the string “face triangulaire” in their narrative records according to a full text search, but the concept “Triangular Face” does not exist in French in the UMLS. Despite this limitation, the current version of DrWH enables nonetheless relevant explorations, and allows the discovery of phenotypes of interest. The limited coverage of French terms compared to English limits our ability to identify concepts in free-text, but is also a limitation in our evaluation (which tends to underestimate the performance of the method). The integration of new terminologies (with a French translation) provided with mapping to UMLS, or integrated in the UMLS will reduce the gap between English and non-English terms. The recent increase of interest for non-English Natural Language Processing is a step forward in that direction.

(2) The granularity between phenotypical concepts extracted from Orphadata and DrWH may differ (e.g. a very precise term can be identified in DrWH whereas a more general term is present in Orphadata). This issue cannot be addressed by a simple hierarchical reasoning given that phenotypical concepts may be related semantically, but not identical nor hierarchically linked (e.g. Hypotonia (C0026827) vs muscle weakness (C0151786)).

(3) We solely used an exact match strategy (with text normalization) to recognize phenotypical concepts in the reports. Our method does not handle terms presenting the words in a different order (e.g. renal acute injury versus acute renal injury would not match). The presence of multiple synonyms in source terminologies might limit the impact of this strategy to a certain extent. However we intend to upgrade our phenotype recognition strategy to allow more flexibility in the recognition of phenotypical concepts.

(4) Some phenotypical concepts are not in the semantic types that we used for the automated phenotyping. In Orphadata, “autoagression” is a sign associated to Rett syndrome. We found 10 patients in the RETT set with “automutilation” in their narrative reports, but this concept is in the semantic type “Injury or Poisoning”.

Study population

Our warehouse hosts data produced by a children hospital, and therefore, phenotypes can be different from adult patients (for example, Alzheimer disease is not represented in pediatrics). However, patients with rare diseases may be followed-up in our institution even during adulthood, enabling an extended longitudinal data collection. Longitudinal follow-up makes it possible to observe the age of apparition of the phenotypes and reconstruct the natural history of rare diseases.

Related work

Information extraction

Several approaches have been developed to recognize UMLS concepts, or terminology terms from free-text records. Savova et al. [30] developed cTAKES, an open source modular system of pipelined components combining rule-based and machine learning techniques. cTAKES aims at the extraction of information from the clinical narratives. Despite development in other languages [31, 32], most of the open source clinical Natural Language Processing systems have been developed for the English language (MedLee [33], MetaMap [34], HITex [35]). Many challenges have helped to test and assess the different tools and methodologies. In non-English languages, less out-of-the-box tools and less learning datasets are available to work with text. More recently a challenge was dedicated to the extraction of information in multiple language medical documents (including French) [36].

Narrative reports versus coded data

We have shown that text exploration of clinical reports can provide phenotypes of interest. Whereas structured databases are particularly adapted for the collection of data regarding well documented diseases, clinical report based exploration enables the secondary use of data collected during care. Such approaches allow the development of learning health systems in which there is a bidirectional relation between routine care data and research. In addition, patient generated data could be integrated and mined along with the EHRs [37].

We plan to conduct additional studies by comparing our results with the French national rare diseases registry [38].


The Phenotype Explorer of Dr. Warehouse enables the exploration of millions of clinical narratives in a simple manner. The algorithm is optimized to display the phenotype analysis of thousands of documents quickly, and limited expertise is needed to write and execute queries. The queries demonstrated in this study only took a few seconds to run, enabling a real time exploration of the data. The expert user can easily sort the associated phenotypes according to their need, depending on the use case.


Clinical Data Warehouses can be used to perform Next Generation Phenotyping, especially in the context of rare diseases. We have developed a method to detect phenotypes associated with a group of patients using medical concepts extracted from free-text clinical narratives. There are still hurdles to overcome with terminologies in non-English languages, however experts’ evaluation suggests that the phenotypes identified using the Frequency and TF-IDF scores can be useful to populate knowledge bases in addition to literature mining.



BR is supported in part by the SIRIC CARPEM cancer integrated research program.


This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Availability of data and materials

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Authors’ contributions

NG, AN and VB made substantial contributions to the acquisition of data. NG, AN, BR and AB conceived the hypothesis and designed the study. RS, JA, NBB, CP and NM manually evaluated the automated phenotyping. All authors made substantial contributions to the analysis and interpretation of data, were involved in drafting and critically revising the manuscript, gave final approval of the version to be published and agree to be accountable for all aspects of the work.

Ethics approval and consent to participate

We got an ethical approval by the French IRB CPP Il-de-France II (IRB registration number 00001072) registered under reference 2016–06-01.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Authors’ Affiliations

Institut Imagine, Paris Descartes Paris Descartes-Sorbonne Paris Cité University, Paris, France
Institut National de la Santé et de la Recherche Médicale (INSERM), Centre de Recherche des Cordeliers, UMR 1138 Equipe 22, Paris Descartes, Sorbonne Paris Cité University, Paris, France
Department of Medical Informatics, Necker-Enfants Malades Hospital, Assistance Publique des Hôpitaux de Paris (AP-HP), Paris, France
Pediatric Nephrology, Necker Enfants Malades Hospital AP-HP, Université Paris Descartes, Paris, France
Pediatric Neurology, Necker Enfants Malades Hospital AP-HP, Université Paris Descartes, Paris, France
Laboratory of embryology and genetics of congenital malformations, INSERM UMR 1163, Institut Imagine, Paris, France
Department of Genetic, Necker Enfants Malades Hospital AP-HP, Université Paris Descartes, Paris, France
Laboratory of Lymphocyte Activation and Susceptibility to EBV infection, INSERM UMR 1163, Paris Descartes Sorbonne Paris Cité University, Imagine Institute, Paris, France
Study center for primary immunodeficiencies (CEDI) Necker Enfants Malades Hospital AP-HP, Université Paris Descartes, Paris, France
French National Reference Center for Primary Immuno Deficiencies (CEREDIH), Necker Enfants Malades Hospital AP-HP, Université Paris Descartes, Paris, France
Pediatric Immuno-Haematology and Rheumatology Necker Enfants Malades Hospital AP-HP, Université Paris Descartes, Paris, France
Hôpital Européen Georges Pompidou, AP-HP, Université Paris Descartes, Paris, France
Imagine - Institute of Genetic Diseases, Paris, France


  1. Office of the National Coordinator for Health Information Technology Health Record Adoption: 2004-2014, Health IT Quick-Stat #50. [Internet]. 2015 Sep. Available from: Scholar
  2. Adler-Milstein J, DesRoches CM, Kralovec P, Foster G, Worzala C, Charles D, et al. Electronic health record adoption in US hospitals: progress continues, but challenges persist. Health Aff Proj Hope. 2015;34:2174–80.View ArticleGoogle Scholar
  3. Zapletal E, Rodon N, Grabar N, Degoulet P. Methodology of integration of a clinical data warehouse with a clinical information system: the HEGP case. Stud Health Technol Inform. 2010;160:193–7.PubMedGoogle Scholar
  4. Murphy SN, Weber G, Mendis M, Gainer V, Chueh HC, Churchill S, et al. Serving the enterprise and beyond with informatics for integrating biology and the bedside (i2b2). J Am Med Inform Assoc. 2010;17:124–30.View ArticlePubMedPubMed CentralGoogle Scholar
  5. Danciu I, Cowan JD, Basford M, Wang X, Saip A, Osgood S, et al. Secondary use of clinical data: the Vanderbilt approach. J Biomed Inform. 2014;52:28–35.View ArticlePubMedPubMed CentralGoogle Scholar
  6. Raghavan P, Chen JL, Fosler-Lussier E, Lai AM. How essential are unstructured clinical narratives and information fusion to clinical trial recruitment? AMIA Jt Summits Transl Sci Proc. 2014;2014:218–23.PubMedPubMed CentralGoogle Scholar
  7. Escudié J-B, Jannot A-S, Zapletal E, Cohen S, Malamut G, Burgun A, et al. Reviewing 741 patients records in two hours with FASTVISU. AMIA Annu Symp Proc. 2015;2015:553–9.PubMedPubMed CentralGoogle Scholar
  8. Choquet R, Maaroufi M, de Carrara A, Messiaen C, Luigi E, Landais P. A methodology for a minimum data set for rare diseases to support national centers of excellence for healthcare and research. J Am Med Inform Assoc. 2015;22:76–85.View ArticlePubMedGoogle Scholar
  9. Radico - Rare Disease Cohorts [Internet]. [cited 2017 Sep 30]. Available from:
  10. RESERVED IU--AR. Orphanet: Combined immunodeficiency due to DOCK8 deficiency [Internet]. [cited 2017 Sep 30]. Available from:
  11. Picard C, Al-Herz W, Bousfiha A, Casanova J-L, Chatila T, Conley ME, et al. Primary immunodeficiency diseases: an update on the classification from the International Union of Immunological Societies Expert Committee for primary immunodeficiency 2015. J Clin Immunol. 2015;35:696–726.View ArticlePubMedPubMed CentralGoogle Scholar
  12. RESERVED IU--AR. Orphanet: Activated PI3K delta syndrome [Internet]. [cited 2017 Sep 30]. Available from:
  13. RESERVED IU--AR. Orphanet: Rett syndrome [Internet]. [cited 2017 Sep 30]. Available from:
  14. RESERVED IU--AR. Orphanet: Oculocerebrorenal syndrome of Lowe [Internet]. [cited 2017 Sep 30]. Available from:
  15. RESERVED IU--AR. Orphanet: Silver Russell syndrome [Internet]. [cited 2017 Sep 30]. Available from:
  16. RESERVED IU--AR. Orphanet: Bardet Biedl syndrome [Internet]. [cited 2017 Sep 30]. Available from:
  17. Lindberg DA, Humphreys BL, McCray AT. The unified medical language system. Methods Inf Med. 1993;32:281–91.View ArticlePubMedGoogle Scholar
  18. Orphanet: an online rare disease and orphan drug data base. Copyright, INSERM 1997. [Internet]. [cited 2017 Sep 22]. Available from:
  19. INSERM. Orphadata: Free access data from Orphanet. © INSERM 1997. Available on Data version (XML data version) [Internet]. 1997 [cited 2017 Sep 24]. Available from:
  20. Harkema H, Dowling JN, Thornblade T, Chapman WW. Context: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomed Inform. 2009;42:839–51.View ArticlePubMedPubMed CentralGoogle Scholar
  21. Chapman WW, Hillert D, Velupillai S, Kvist M, Skeppstedt M, Chapman BE, et al. Extending the NegEx lexicon for multiple languages. Stud Health Technol Inform. 2013;192:677–81.PubMedPubMed CentralGoogle Scholar
  22. Garcelon N, Neuraz A, Benoit V, Salomon R, Burgun A. Improving a full text search engine: the importance of negation detection and family history context to identify cases in a biomedical data warehouse. J Am Med Inform Assoc. Google Scholar
  23. Beitzel SM, Jensen EC, Frieder O. MAP. In: Liu L, Özsu MT, editors. Encycl. Database Syst [Internet]. Springer US; 2009 [cited 2017 Sep 30]. p. 1691–2. Available from:
  24. Bahi-Buisson N. Genetically determined encephalopathy: Rett syndrome. Handb Clin Neurol. 2013;111:281–6.View ArticlePubMedGoogle Scholar
  25. Budden SS, Gunness ME. Possible mechanisms of osteopenia in Rett syndrome: bone histomorphometric studies. J Child Neurol. 2003;18:698–702.View ArticlePubMedGoogle Scholar
  26. Cortelazzo A, De Felice C, Guerranti R, Signorini C, Leoncini S, Pecorelli A, et al. Subclinical Inflammatory Status in Rett Syndrome. Mediators Inflamm. [Internet]. 2014 [cited 2017 Sep 30];2014. Available from:
  27. Jefferson A, Leonard H, Siafarikas A, Woodhead H, Fyfe S, Ward LM, et al. Clinical guidelines for Management of Bone Health in Rett syndrome based on expert consensus and available evidence. PLoS One. 2016;11(2):e0146824. eCollection 2016. PubMed PMID: 26849438; PubMed Central PMCID: PMC4743907.
  28. Lotan M, Reves-Siesel R, Eliav-Shalev RS, Merrick J. Osteoporosis in Rett syndrome: a case study presenting a novel management intervention for severe osteoporosis. Osteoporos. Osteoporos Int. 2013;24:3059–63.View ArticlePubMedGoogle Scholar
  29. Zysman L, Lotan M, Ben-Zeev B. Osteoporosis in Rett syndrome: a study on normal values. ScientificWorldJournal. 2006;6:1619–30.View ArticlePubMedGoogle Scholar
  30. Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, et al. Mayo clinical text analysis and knowledge extraction system (cTAKES): architecture, component evaluation and applications. J Am Med Inform Assoc. 2010;17:507–13.View ArticlePubMedPubMed CentralGoogle Scholar
  31. Roque FS, Jensen PB, Schmock H, Dalgaard M, Andreatta M, Hansen T, et al. Using electronic patient records to discover disease correlations and stratify patient cohorts. PLoS Comput Biol. 2011;7:e1002141.View ArticlePubMedPubMed CentralGoogle Scholar
  32. Deléger L, Grouin C, Zweigenbaum P. Extracting medication information from French clinical texts. Stud Health Technol Inform. 2010;160:949–53.PubMedGoogle Scholar
  33. Friedman C, Shagina L, Lussier Y, Hripcsak G. Automated encoding of clinical documents based on natural language processing. J Am Med Inform Assoc. 2004;11:392–402.View ArticlePubMedPubMed CentralGoogle Scholar
  34. Aronson AR. Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program. Proc AMIA Symp 2001;2001:17–21.Google Scholar
  35. Zeng QT, Goryachev S, Weiss S, Sordo M, Murphy SN, Lazarus R. Extracting principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Med Inform Decis Mak. 2006;6:30.View ArticlePubMedPubMed CentralGoogle Scholar
  36. CLEF e-health 2016 [Internet]. 2016 [cited 2017 Sep 30];2016. Available from:
  37. Friedman C, Rubin J, Brown J, Buntin M, Corn M, Etheredge L, et al. Toward a science of learning systems: a research agenda for the high-functioning learning health system. J Am Med Inform Assoc. 2015;22:43–50.PubMedGoogle Scholar
  38. Maaroufi M, Choquet R, Landais P, Jaulent M-C. Towards data integration automation for the French rare disease registry. AMIA Annu Symp Proc. 2015;2015:880–5.PubMedPubMed CentralGoogle Scholar


© The Author(s). 2018