Identification of risk features for complication in Gaucher’s disease patients: a machine learning analysis of the Spanish registry of Gaucher disease

Background Since enzyme replacement therapy for Gaucher disease (MIM#230800) has become available, both awareness of and the natural history of the disease have changed. However, there remain unmet needs such as the identification of patients at risk of developing bone crisis during therapy and late complications such as cancer or parkinsonism. The Spanish Gaucher Disease Registry has worked since 1993 to compile demographic, clinical, genetic, analytical, imaging and follow-up data from more than 400 patients. The aims of this study were to discover correlations between patients’ characteristics at diagnosis and to identify risk features for the development of late complications; for this a machine learning approach involving correlation networks and decision trees analyses was applied. Results A total of 358 patients, 340 type 1 Gaucher disease and 18 type 3 cases were selected. 18% were splenectomyzed and 39% had advanced bone disease. 81% of cases carried heterozygous genotype. 47% of them were diagnosed before the year 2000. Mean age at diagnosis and therapy were 28 and 31.5 years old (y.o.) respectively. 4% developed monoclonal gammopathy undetermined significance or Parkinson Disease, 6% cancer, and 10% died before this study. Previous splenectomy correlates with the development of skeletal complications and severe bone disease (p = 0.005); serum levels of IgA, delayed age at start therapy (> 9.5 y.o. since diagnosis) also correlates with severe bone disease at diagnosis and with the incidence of bone crisis during therapy. High IgG (> 1750 mg/dL) levels and age over 60 y.o. at diagnosis were found to be related with the development of cancer. When modelling the decision tree, patients with a delayed diagnosis and therapy were the most severe and with higher risk of complications. Conclusions Our work confirms previous observations, highlights the importance of early diagnosis and therapy and identifies new risk features such as high IgA and IgG levels for long-term complications.


Introduction
Gaucher Disease (GD)(MIM#230800; MIM#231000; MIM#230900) is the most common lysosomal storage disorder (LSD) [1,2]; some of the most common problems for GD patients are difficulty in diagnosis [3], appearance of complications, variability in the intensity of symptoms and absence of curative treatments with decreased quality of life [4,5]. Clinical characteristics of GD are well established, but there remains a lack of information due to the singularity of the cases [6]. Also, it has been impossible to define a complete phenotypegenotype correlation [7][8][9] or to create a prognosis model for complications. Because of these the initiative to create registries has been developed by different institutions, research groups, and pharmaceutical companies; allowing a continuous improvement in the knowledge of the disease [10][11][12].
GD has a pan-ethnic distribution with cases described worldwide. Outside the Ashkenazi Jewish population, incidence ranges from 1 in 40,000 to 1 in 100,000 inhabitants; however, in the Ashkenazi population a higher incidence has been found (1 in 2500) and GD is not considered a rare disease [13]. Three types have been described: type-1 GD (GD1; MIM#230800), or the nonneuronopathic GD form, is the most common in western countries and it is characterized by the absence of primary involvement in the central nervous system; type-2 GD (GD2; MIM#230900) is the acute neuronopathic form with very severe cases, all of them with a short lifespan of less than 2 years; and type-3 GD (GD3; MIM#231000), or the juvenile/adult neuronopathic form, described for first time in 1959 [14], is characterized by neurological affectation and also involvement of other organs such as lungs, cardiac valves, and kyphosis, among other manifestations [15][16][17][18].
The application of Enzymatic Replacement Therapy (ERT), which began in 1991, has significantly improved awareness of the disease, and has changed the characteristics and expectations of patients as well as the experience of everyone involved in GD management. Nowadays, ERT offers a secure therapy for GD patients with 3 different available enzymes worldwide, two of them in Europe (Imiglucerase, Sanofi-Genzyme and Velaglucerasa alfa, Takeda pharmaceuticals), taliglucerase alfa obtained from plant cell-expressed is until now non approval in EU [19][20][21][22]. Since 2004, Substrate Reduction Therapy (SRT) has been developed for GD treatment, first with one iminosugar (Miglustat, Actelion Pharmaceuticls) and more recently with a ceramide mimetic (Eliglustat Tartatre, Sanofi-Genzyme) [23][24][25] expanding the therapeutic options to GD patients. However, there is still the need to develop means of identifying the small number of patients who are at risk of bone crisis while receiving ERT, as well as those who are at risk of developing late complications such cancer or parkinsonism.
The Spanish Gaucher Disease Registry (SGDR) has worked since 1993 to compile demographic, clinical, genetic, analytical and imaging data of Spanish GD patients (currently numbering 361 GD1, 36 GD, and 21 GD3). The Registry has allowed us to calculate GD prevalence in Spain (about 1/100,000 inhabitants) and to identify the GBA (MIM*606463) variants distribution in the population [12,18].
In the last decades, the explosion of all kind of data has driven to the use of different big data and machine learning techniques for many applications in the healthcare and bioinformatics fields (several reviews can be seen in references [26][27][28]. In particular the application of computational tools and correlations network techniques for the analysis of data can provide new insights into the relationship between different variables and with the disease, as well as informative and descriptive visualizations [28,29]. The main objective of this project is to identify new correlations among the patient characteristics and to made a first approximation to the development of prediction models for the risk of late complications.

Patients
Since the establishment of the SGDR coordinated by the Fundación Española para el Estudio y Terapéutica de la Enfermedad de Gaucher y otras lisosomales (FEETEG), a total of 418 GD patients have been reported in Spain. All patients included in the SGDR provided informed consent for the collection and use of the information and biological samples for research projects, all according to the Helsinki declaration of 1963 revised in October 2013, and in accordance with European Regulation 2016/679 on the protection of personal data and the free movement of such data. For this study, ethics and scientific FEETEG boards gave their approval.
All the registered patients were included except those diagnosed with GD2 and those who had less than 70% of baseline data available (Table 1). Of 418 patients in the SGDR, 358 (85.6%) were analysed.

Study design
In collaboration with Kampal Data Solutions demographic, clinical, analytical, imagining data at diagnosis and comorbidities during the follow-up were evaluated (Table 1).
The aimed conditions for which the analysis sought correlations were the presence of severe bone disease at diagnosis, development of bone crisis during follow-up, and the development of neoplasia or PD.

Statistical analysis
The statistical analysis of the data was made in two parts.

Baseline data analysis
A descriptive analysis was performed by splitting the variables between numerical and categorical. To establish correlation, Pearson, Chi-Square, Mann-Whitney and Mann-Whitney normalized tests were used.

Prediction model
Based on the results of the first step and the correlation between the different variables, we proceeded to the development of a predictive model using decision trees.  To implement the models, a training and validation cohort [30] were used, with application of the crossvalidation technique [31]. This allowed us to offer an estimate of errors. The models have been built only with GD1 patients; GD3 patients have been ruled out because they can die prematurely due to the severity of their disease. Standard quality metrics such as test sample size, accuracy, sensitivity, specificity, odds ratio (OR), positive predictive value (PPV), true positives (TP), true negatives (TN), false positives (FP), false negatives, area under the receptor operator curve (AUC) were calculated. Preprocessing, data analysis and modelling were carried out through the programming language R programming language (version 3.6.2), by using, among others, the following packages: car, ggplot2, vcd, GGally, plyr, igraph, rpart, dplyr [32][33][34].

Correlations between numerical variables
A detailed correlation between the numerical variables (Table S1) and categorical variables (Table S2) can be found in supplemental material. A graph was constructed to provide a representation of how the different variables are related to each other, not only in pairs but in a global way (Fig. 1). In this graph, the nodes are the different variables and a link is established between two of them if the correlation (Pearson's r) calculated between them is statistically significant (p ≤ 0.05). The weight of the link is equal to the correlation between the two variables. The position algorithm used for its creation tries to place more closely those nodes that are joined by stronger links, while those that are unrelated are further away. The highest correlation was established between the age of diagnosis and the age of onset of treatment. The statistical analysis was performed in order to stablish correlation among all variables, however, there were some correlations that need to be taken carefully in an individualized manner, in special the one involving baseline characteristics and variables such as the age at diagnosis, time since diagnosis to therapy, time on therapy. At this respect, for example, some patients did not start therapy because the ERT was not available, and as consequence their age at therapy correlates with the delay of therapy.

Correlation between categorical variables
Table S2 from supplemental material shows the significance of the correlation between the categorical variables. There was a high correlation between spleen removal and the presence of bone disease (χ 2 n = 10.87, p < 0.01) and repeat bone crises (χ 2 n = 15.93, p < 0.01). Almost all of the patients who suffered new bone crises had previous bone lesions (χ 2 n = 30.47, p-value< 0.01). Family history of PD and GBA genotypes no NM_ 000157. 4:c.1226A > G in homozygosity were the variables related to the development of PD (χ 2 n = 4.58, p < 0.01 in the correlation between having PD's or not and the set of 11 different genotypes).

Correlation between numerical variables and conditions
To stablish correlation between the presence of conditions such as severe bone disease, repeated bone crisis, Spleen removal, Parkinson Disease and neoplasia with the numerical variables the normalized Mann-Whitney test was used; for this two levels were stablished, level 1 the absence of the condition and level 2 the presence of the condition (Table S3 supplementary material).

Bone disease
The numerical variables that showed the main relevance for severe bone disease at diagnosis were the S-MRI (U n = 0.98, p < 0.01) and IgA levels (U n = 0.93, p = 0.01), Table S3, supplemental material. Nevertheless, there were other variables that present relatively high correlations and low pvalues, such as high levels of ferritin (U n = 0.85, p = 0.06), triglycerides (U n = 0.75, p < 0.01), delayed age at diagnosis (> 9.5y.o.) (p < 0.001), time in years between diagnosis and the start of treatment (U n = 0.67, p = 0.01) or delayed age of initiation of ERT (U n = 0.61, p = 0.01) (Fig. 3A1).
The same happens with the appearance of successive bone crises during ERT. The variables that correlate and   have greater significance are S-MRI (U n = 0.92, p < 0.01), high IgA levels (U n = 0.91, p = 0.08), and delayed age of initiation of ERT (U n = 0.51, p = 0.07) (Fig. 3B1). The analysis considering the mean age at diagnosis minus the mean age to start therapy (U n = 0.62, p = 0.00001) (Table  S3 Supplementary material) (Fig. 4A2).
In relation to the occurrence of PD, the numerical variables that have a significant correlation were elevated ferritin levels (U n = 0.92, p = 0.04) and age at diagnosis (U n = 0.45, p = 0.01); in this last correlation the age of PD onset probably has more weight (Fig. 4B). The significant correlations are presented in Table S3 of supplemental material.

Correlation between categorical variables
All correlations observed between categorical variables are shown in Fig. 2.
High correlation were found between spleen removal and the severe bone disease and repeated bone crises (p = 0.0001). Almost all of the patients who suffered new bone crises had previous bone lesions (p = 0.0005) in spite of long-term ERT exposure.
The family history of PD and GBA genotypes other than homozygous NM_000157.4:c.1226A > G were also found to be statistically associated with PD development (p < 0.01).
The last correlation between categorial variables with statistical significance was cancer development (not only hematological) and spleen removal (p = 0.05) Fig. 2. (Table S2 supplementary material).

Generation of predictive models for complications by means of decision trees
Decision trees show the best prediction for the development of severe bone disease in patients with an S-MRI Fig. 4 Correlation between numerical variables with the development of neoplasia or Parkinson's disease. Histograms: A Correlation between increased serum IgG levels, time delay between GD diagnosis and the start of ERT with the development of neoplasia. B Correlation between increased serum ferritin levels, age of GD diagnosis with the development of neoplasia > 2.5 who started therapy after 9.5 years; 87% of patients with these characteristics developed a severe bone disease (Fig. 5).
For neoplasia, a higher risk was found when IgG > 1725 mg/dL and age of diagnosis > 60 y.o.
In the case of PD, it was not possible to design a tree that improves the prediction of the risk of disease development, because the percentage of patients was very small. However, it has been observed that there is an important correlation with the GBA genotype (p < 0.01) and also with the existence of relatives with PD (p = 0.08), although this was not statistically significant. In the supplemental material, Tables S3-S4 show the statistical significance of correlation among the selected variables used for the algorithms.

Discussion
Large-scale data (big data) in our case when referring to a rare disease have been adapted to the number of cases available; this kind of analysis is a new tool that has recently been incorporated into biomedical activity; machine learning is the study of computer algorithms that improve automatically through experience and it involves a wide series of algorithms, classification and regression models such as decision trees being some of them [26][27][28][29]. This methodology is especially useful for obtaining pooled information on the diversity of outcomes and identifying prognostic factors potentially related to disease complications [35]. In rare disease research, this is of particular interest due to the scarcity and the spread of the data among the different centers [36]. Various approaches have been applied in the area of rare diseases, especially in looking for genetic associations [37] and making correlations between genotype and phenotype [38].
Registries have an important role in this kind of analysis, because they include complete information about patients, which is especially important for rare disease research. This collected information helps in diagnosis, patient management, treatment strategy planning, health care planning and follow-up. It enables the acceleration of research and paves new pathways for personalized medicine [39,40].
This study is the first attempt to establish a correlation network among different biochemical and clinical characteristics in a national-base cohort. We have aimed to analyse diagnostic data and to relate them with longterm complications as bone crises, development of neoplasia or PD, which are the most common and disabling complications [41][42][43][44][45].
Two observations, already accepted in Gaucher research, were also confirmed in this machine-learning study: first, the fact that spleen-removal patients have a higher risk of presenting more serious and extensive bone disease; second, our observation that almost all The information that appears in each node includes (top down): the value of the target variable assigned by the algorithm: develops bone disease: yes/no. the ratio of patients in this node who had (left) / did not have (right) bone disease. percentage of the total population included in this node. For the prediction of bone complications, a mild bone marrow infiltration of 2.5 points by the Spanish Magnetic Resonance Imaging Score (S-MRI) with the delay of the start therapy over an age 9.5 years were the two characteristics selected by the prediction model patients with new bone crisisdespite having received long-term ERThad previous bone lesions, which remind us that the most feared complication in GD1 are not solved merely by starting ERT. These two facts confirm previous reports and provide validity of our analysis [42,[45][46][47]. In addition, genotypes different from homozygous NM_000175. 4:c.1226A > G are significantly correlated with bone disease (p = 0.05). This last observation is in line with the observation that c.1226A > G variant provides a mild phenotype [48,49].
It is a priority to identify accurate risk factors of bone crisis to improve treatment dosage and to avoid this complication. The standard biomarkers related to GD (ChT activity, CCL18/PARC and GluSph concentrations) have been discarded as risk factors for bone complication [50,51] even though their concentration will be increased during bone crisis, due to the acute inflammatory event [45,49]. This reminds us of the importance to continue searching for other biomarkers. Our results confirm the lack of association between these biomarkers and disease outcomes, but other biomarkers, such as high levels of ferritin, show a tendency in patients with advanced bone disease although it was not statistically significant.
Surprisingly, the high serum IgA concentration correlates with the degree of bone involvement and with the development of bone crisis (p = 0.001). The age of onset of treatment (mean 30.6 y.o.) (p = 0.01) also shows a clearer relevance for the occurrence of bone crises (p = 0.01).
In this study, the development of malignancies appears strongly correlated with the delayed age at the start of ERT (p < 0.01) and the increased concentration of IgG (p = 0.01). Many aspects remain to be unraveled in the complexity of the immune system, but aging is an important factor clearly related to humoral immune dysfunction and the appearance of malignancies [52]. Polyclonal and monoclonal gammopathies in GD patients are common [53] and we observed a significant correlation between high levels of IgG and the appearance of neoplasia [54]. However, the origin of these alterations is not fully clarified, and is attributed to the chronic inflammation state; also, it is related to an increase in levels of inflammatory cytokines such as interleukins (IL-6, IL-10) that could lead to an overproduction of immunoglobulins [53,54]. Another hypothesis could be that B lymphocytes were activated by specific type II natural killer T lymphocytes, with a T follicular helper profile, and that the clonal immunoglobulin in GD patients and in mouse models of GD was reactive against GluSph [55].
The identification of levels of IgA as a risk factor for complication was a surprising finding; it has not been previously reported that IgA levels are related to severe bone disease and the presence of repeated bone crises in GD. Also, our data, shown an analysis of the main clinical features of GD patients at diagnosis; in accordance with previous reports [2,6,[10][11][12], general characteristics such as polyclonal gammopathies, bone pain, bone vascular lesions, hypertriglyceridemia, splenomegaly and family history of parkinsonism, would be findings that can help to identify GD patients.
The SGDR only includes GD patients from Spain, thus the main limitations for the study are the absence of a larger data set. Despite this, the included data reflect the characteristics of the disease in this country. It could be interesting to validate these findings by studying other populations with a greater number of patients; however, taking into account the homogeneity of the series and the single-country origin, the data are solid.

Conclusions
Our work confirms previous observations such as the relationship among bone disease and splenectomy; it highlights the importance of early diagnosis and therapy and identifies new risk features such as high IgA and IgG levels for long-term complications. This is first attempt in which all the baseline diagnosis data has been included in a study to perform analysis of network correlations. This open the possibility to move forward using nowadays technology; it will help us to identified features that can predict risk for complications or maybe, if more patients can be included, a better phenotypegenotype correlation.
Additional file 1: Table S1. (supplementary material). Correlation between numerical variables.  Authors' contributions MAC and PG designed the study, collected the data, interpreted the results and wrote the manuscript. LLF, JJC, ISG, BME and MRE collected the data, reviewed and approved the final version. BGB, JPH and DI analysed the data, create the algorithms for the prediction models, wrote and reviewed and approved the manuscript. The author(s) read and approved the final manuscript.

Funding
This work was supported by FEETEG.

Availability of data and materials
The data analysed and generated during the current study belongs to the SRGD and to the FEETEG are available under request through the corresponding author.
Ethics approval and consent to participate All the patients included in the SRDG have signed an informed consent to the use of their data on research purpose. The scientific and ethics committees of the FEETEG foundation approved this study.

Consent for publication not applicable
Competing interests PhD Jorge J Cebolla is employee of Takeda Pharmaceutical outside the submitted work; all other authors have indicated they have no financial relationships, relevant to this article, to disclose.