Phenotype prediction for mucopolysaccharidosis type I by in silico analysis

Background Mucopolysaccharidosis type I (MPS I) is an autosomal recessive disease due to deficiency of α-L-iduronidase (IDUA), a lysosomal enzyme that degrades glycosaminoglycans (GAG) heparan and dermatan sulfate. To achieve optimal clinical outcomes, early and proper treatment is essential, which requires early diagnosis and phenotype severity prediction. Results To establish a genotype/phenotype correlation of MPS I disease, a combination of bioinformatics tools including SIFT, PolyPhen, I-Mutant, PROVEAN, PANTHER, SNPs&GO and PHD-SNP are utilized. Through analyzing single nucleotide polymorphisms (SNPs) by these in silico approaches, 28 out of 285 missense SNPs were predicted to be damaging. By integrating outcomes from these in silico approaches, a prediction algorithm (sensitivity 94%, specificity 80%) was thereby developed. Three dimensional structural analysis of 5 candidate SNPs (P533R, P496R, L346R, D349G, T374P) were performed by SWISS PDB viewer, which revealed specific structural changes responsible for the functional impacts of these SNPs. Additionally, SNPs in the untranslated region were analyzed by UTRscan and PolymiRTS. Moreover, by investigating known pathogenic mutations and relevant patient phenotypes in previous publications, phenotype severity (severe, intermediate or mild) of each mutation was deduced. Conclusions Collectively, these results identified potential candidate SNPs with functional significance for studying MPS I disease. This study also demonstrates the effectiveness, reliability and simplicity of these in silico approaches in addressing complexity of underlying genetic basis of MPS I disease. Further, a step-by-step guideline for phenotype prediction of MPS I disease is established, which can be broadly applied in other lysosomal diseases or genetic disorders. Electronic supplementary material The online version of this article (doi:10.1186/s13023-017-0678-1) contains supplementary material, which is available to authorized users.


Background
Mucopolysaccharidosis type I (MPS I) is a lysosomal disease included within the genetically heterogeneous group of mucopolysaccharidoses (MPSs). MPS I results from mutations in the gene encoding the lysosomal enzyme α-L-iduronidase (IDUA; glycosaminoglycan a-Liduronohydrolase, OMIM 252800) [1]. Deficiency of IDUA leads to progressive lysosomal accumulation of glycosaminoglycans (GAG) heparan and dermatan sulfate in tissues. Based on the severity of symptoms, MPS I can be divided into three subtypes, from mild (Scheie syndrome, OMIM 607016) to intermediate (Hurler-Scheie syndrome, OMIM 607015) to severe (Hurler syndrome, OMIM 607016). Scheie or Hurler-Scheie patients have symptoms including growth delay, aortic valvular disease, skeletal dysplasias, corneal clouding and joint stiffness. In addition to having these symptoms, but in a more pronounced way, Hurler patients also have growth delay, hepatosplenomegaly, coarse facial features, hydrocephalus, mental retardation and neurodegeneration.
It has been shown that the earlier enzyme replacement therapy or hematopoetic stem cell transplantation is performed, the better the outcome is [2][3][4][5]. Since early initiation of treatment is more likely to improve clinical outcomes, early diagnosis and accurate phenotype prediction are essential. However, genotype/phenotype correlation of MPS I has not been well established [6,7]. To date, assessment of the phenotype is generally based on clinical signs and symptoms. A recent study showed a lack of consensus on the assessment of phenotypic severity solely based on signs and symptoms at presentation [8]. Therefore, establishment of a reliable and easy-to-use phenotype prediction method based on genotype will be of great benefit.
The single nucleotide polymorphisms (SNPs) are the most common form of genetic mutations. SNP was originally defined as a single nucleotide variant with a frequency in genome of more than 1% [9]. In this study, for the simplicity of description, single nucleotide variants with a frequency of less than 1% were also included in the analysis. While many SNPs are phenotypically neutral, others could cause disease, predispose human to disease, or influence response to medicine. Previous studies on polymorphisms screening by in silico analysis contributed to predicting the functional non-synonymous SNPs (nsSNPs) in genes such as G6PD [10], ATM [11], PTEN [12], BRAF [13] and BUB1B [14]. This powerful computed methodology enables prioritizing SNPs with functional significance from a large quantity of neutral non-risk variants. To date, computational analyses of IDUA gene for phenotype prediction have not been performed. To this end, a number of bioinformatics tools, based on recent findings from evolutionary biology, protein structure research, machine learning and computational biology, may provide useful information for assessing the functional impacts of SNPs. A stepwise guideline for phenotype prediction based on genotype is established, which will benefit early diagnosis and proper treatment allocation for MPS I patients.

Dataset
The SNPs information (Protein accession number and SNP ID) of the IDUA gene was retrieved from the NCBI dbSNP (http://www.ncbi.nlm.nih.gov/snp/). Known disease-associated mutations in IDUA gene were retrieved from The Human Gene Mutation Database (http:// www.hgmd.cf.ac.uk/ac/index.php).
SIFT SIFT (Sorting Intolerant From Tolerant; http://sift.jcvi.org/) can predict the effect of amino acid substitution on protein function, and classify it as 'tolerated' or 'deleterious' [15]. SIFT applies multiple alignment information for the query sequence and predicts whether substitutions are 'tolerated' or 'deleterious' by calculating the tolerance index score (0 to 1). Tolerance index score is a normalized probability that an amino acid substitution is tolerated. Substitutions with a tolerance index less than 0.05 are predicted to be 'deleterious' and those with greater than or equal to 0.05 are predicted as 'tolerated'. The analysis was performed using the default settings.
PROVEAN PROVEAN (Protein Variation Effect Analyzer; http:// provean.jcvi.org) is a sequence based predictor that estimates the impact of protein sequence variation on protein function [18]. In PROVEAN, BLAST hits with more than 75% global sequence identity are clustered together, and top 30 such clusters from a supporting sequence are averaged within and across clusters to generate the final score. A protein variant is predicted to be 'deleterious' if the final score is below −2.5, and is predicted to be 'neutral' otherwise.
PANTHER PANTHER (http://www.pantherdb.org/) is a database which contains a collection of protein families and subfamilies that predict the occurrence of an amino acid at a position in a family of evolutionarily related protein [19]. PANTHER uses hidden Markov model (HMM) based statistical modeling methods and multiple sequence alignments to perform evolutionary analysis of coding nsSNPs. By calculating the substitution position-specific evolutionary conservation score (subPSEC) based on an alignment of evolutionarily related proteins, PANTHER estimates the likelihood of a particular nsSNP causing a functional impact. Based on subPSEC scores, PANTHER classifies SNPs as 'deleterious' (score < −3) or 'neutral' (score > −3).

SNPs&GO
SNPs&GO (Single Nucleotide Polymorphism Database & Gene Ontology; http://snps.biofold.org/snps-and-go/ snps-and-go.html) is an support vector machine (SVM) based method used to predict the disease related mutations from protein sequences with a scoring accuracy of 82% and Matthews correlation coefficient of 0.63. For SNPs&GO, FASTA sequence of whole protein is considered to be an input option and output will be the prediction results based on the discrimination among 'disease' and 'neutral' variations of protein sequence. The probability score higher than 0.5 is defined as 'disease' [20].

PHD-SNP
PHD-SNP (Predictor of Human Deleterious Single Nucleotide Polymorphisms; http://snps.biofold.org/ phd-snp/phd-snp.html) is an SVM-based classifier, trained over a million amino acid polymorphism datasets using supervised training. PHD-SNP predicts whether the given amino acid substitution leads to 'disease' or 'neutral' along with the reliability index score [21].

NetSurfP
NetSurfP (http://www.cbs.dtu.dk/services/NetSurfP/) is a web server that predicts the surface accessibility and secondary structure of amino acids. The reliability of this NetsurfP is given in the form of Z-score. The Z-score highlights the surface prediction reliability, but not associated with secondary structure [22].

Modeling of mutant protein structures
The Swiss-PDB Viewer, a free molecular graphics program was used for viewing the modeled structures and for calculation of the root mean square deviation (RMSD) between the native and mutant structures. Swiss-PDB viewer named as Deep View, a stand-alone program, was used as an analytical tool for macromolecules [23]. To superimpose protein structures, the "Magic Interactive Fit" command was used for detection of a stretch of similar residues at sequence level to obtain a structural fit between the two models. Energy minimization for three-dimensional (3D) structures was performed using NOMAD-Ref server (http://lorentz.immstr.pasteur.fr/nomad-ref.php) [24]. Conjugate gradient method was used for energy minimization of the 3D structures.

Project HOPE
Project Have yOur Protein Explained (HOPE; http:// www.cmbi.ru.nl/hope/home) is an easy-to-use web service that analyzes the structural effects of a point mutation in a protein sequence. HOPE provides the 3D structural visualization of mutated proteins by using UniProt and DAS prediction servers. HOPE server predicts the output in the form of structural variation between mutant and wild type residues [25].

UTRscan
UTRscan (http://itbtools.ba.itb.cnr.it/utrscan) is a web server that can analyze the untranslated regions (5′ UTR and 3′ UTR) of eukaryotic mRNA which are involved in many post-transcriptional regulatory pathways that control mRNA localization, stability and translation [26]. The internet resource for UTR analysis are UTRdb, which contains experimentally proven biological activity of functional pattern of UTR sequence from eukaryotic mRNAs. If different sequences for each UTR SNP are found to have different functional patterns, that particular UTR SNP is predicted to have functional significance.

PolymiRTS
PolymiRTS database (http://compbio.uthsc.edu/miRSNP/) was used specifically for the analysis of SNPs in the 3′ UTR. The polymorphic microRNA target sites are classified into four classes [27]. Specifically, class 'D' may cause loss of normal repression, and class 'C' may cause abnormal gene repression control. Therefore, these two classes of PolymiRTS are most likely to have functional impacts.

Analysis of missense SNPs using a combination of bioinformatics tools
Polymorphisms in the IDUA gene were retrieved from NCBI dbSNP database. Non-synonymous SNPs (nsSNPs) from the coding region, and untranslated (5'and 3′) region were selected for further analysis. The impacts of any amino acid substitution with its functional significance and physical properties can be determined using SIFT by aligning homologous and orthologous protein sequence. A total of 285 missense SNPs of IDUA gene were analyzed using SIFT. Out of 285 SNPs, 201(71%) were predicted to be 'deleterious' (tolerance index <0.05), while 157 (55%) were 'highly deleterious' (tolerance index = 0). All 201 SNPs predicted to be 'deleterious' by SIFT were further analyzed by PolyPhen. For every input SNP, Polyphen calculates PSIC score and perform BLAST query to identify homologous protein. A total of 149 SNPs were predicted to be 'probably damaging'. For further confirmation, the PolyPhen results were subjected to I-Mutant, which is a routine SNP prediction tool based on neural network, for adding another layer of confirmation. I-Mutant estimates the effect of substitution on protein stability by calculating the reliability index (25°C, pH 7.0). Out of 149 missense SNPs analyzed, 107 (72%) were predicted to cause 'large decrease' , while 42 were predicted to cause 'neutral stability'. The remaining 107 SNPs were analyzed by PROVEAN, yielding 93 deleterious and 14 neutral SNPs. Therefore, 93 out of 285 SNPs were predicted to be damaging by 4 different methods and summarized in Table 1.
All 93 SNPs identified were further analyzed by PANTHER, SNPs&GO and PHD-SNP. PANTHER characterizes the effect of amino acid variation on protein function via HMM based statistical modeling. PANTHER can classify proteins by function, adding another layer of complexity to refine SNP prediction. SNPs&GO predicts the log-odd (LGO) score from the GO data base by placing the similar proteins in the same dataset. PHD-SNP is an SVM-based classifier, trained over a million amino acid polymorphism datasets using supervised training. Out of the 93 SNPs, 28 were predicted to be disease-associated by three methods ( Table 2).

Biophysical validation and 3D structure analysis of missense SNPs
Based on the in silico analyses performed, 28 SNPs were selected for biophysical analysis using NetSurfP. The location and the type of a mutated residue can affect the stability of the protein by decreasing the solvent accessibility of a residue decreases. NetSurfP Z-score allows for the identification of the most reliable predictions for both buried and exposed amino acids. Out of 28 SNPs, a huge drift in the Z-score was observed for 5 SNPs (Table 3).
To analyze the 3D structural change introduced by these 5 SNPs, we performed structural analysis by comparing the native and mutant protein structures. Briefly, the native structure of IDUA was extracted from Protein Data Bank (ID 3 W81). Single amino acid substitution and superimposition of native and mutated structures were examined using Swiss-PDB viewer, and their degree of similarity was measured as the RMSD value. RMSD values between native and each mutant structure are <0.5 Å, indicating a minor structural change caused by the SNP. An illustration of overall superimposition by Swiss-PDB viewer is shown in Fig. 1, while detailed structural changes in Fig. 2. Total energy values of native structure and 5 mutant structures were calculated after energy minimization by NOMAD_Ref and summarized in Table 4. The total energy of three mutant models (L346R, P496R and P533R) is significantly higher than that of the native model, indicating that the mutation decreases the protein stability.
Specifically, rs772416503 leads to conversion of proline into arginine at position 496 (P496R). The hydrophobic environment around Pro496 leaves no room for a bulky polar residue (arginine). This mutation (P496R) may interfere with the placement of Asn372 glycan over the active site, and thereby affect enzyme catalytic activity. Rs371397270 leads to conversion of aspartic acid into glycine at position 349 (D349G). Asp349 is located in triosephosphateisomerase (TIM) barrel active site and interacts with substrate. Besides, since glycine is smaller than aspartic acid, the mutation will cause an empty space in the core of the protein. The charge of the buried wild-type residue is also lost due to this mutation. Therefore, D349G will also cause loss of hydrogen bonds in the core of the protein and thereby disturb correct folding. Rs121965021 (P533R) is located in the β sandwich. Prolines are known to have a very rigid structure, sometimes forcing the backbone in a specific conformation. P533R may disturb this special conformation and destabilize the β sandwich domain by introducing the side chain of arginine. Besides, only the wild type residue proline is found at this position. Mutation of a 100% conserved residue is usually damaging for the protein.
Rs121965033 (L346R), located in the TIM barrel, may cause steric hindrance and destabilize active site confirmation. The mutant residue (Arg) introduces a charge in a buried residue (Leu) which affects protein folding. Besides, since Leu346 is buried in the core of the protein, Arg is bigger and probably will not fit. This mutation will cause loss of hydrophobic interactions in the core of the protein. Rs775816150 (T374P) is located at Thr374, a conserved N glycosylation site. It has also been shown that N-glycans are essential for substrate binding and catalytic activity of IDUA [28]. Therefore, this mutation (T374P) may lead to decrease or loss of catalytic activity of IDUA.

Establishment and evaluation of SNPs prediction algorithm
By integrating outcomes of the bioinformatics tools listed in Section 3.1, a prediction algorithm (SAAMP: Single Amino Acid Mutation Predictor) with a pathogenic index (PI) was developed. PI is defined as percentage of 'damaging' predictions from these 7 bioinformatics tools. The higher the PI is, the more pathogenic the SNP is. The cutoff value is set at 0.43. When PI is ≥0.43 (larger than or equal to 3 damaging related predictions), the mutation is defined as 'pathogenic' , otherwise it is 'benign'. A total of 81 known disease-associated missense mutations and 15 known benign polymorphisms of IDUA were analyzed by these bioinformatics tools, and the PI of each mutation was calculated. By assessing false positives and false negatives, a sensitivity of 94% and a specificity of 80% were reached. The false positives and false negatives were evaluated manually, however, no significant patterns were observed. It might be due to the differences in methodologies utilized by these in silico tools. Alternatively, when the cut-off value is set as 0.57 (larger than or equal to 4 damaging related predictions), a sensitivity of 79% and a specificity of 93% was calculated. In order to increase the probability of identifying pathogenic mutations and minimize the risk of neglecting patients, high sensitivity is preferable and the cut-off value of 0.43 is recommended.

Functional SNPs in UTRs identified by UTSscan and PolymiRTs
All of the 177 UTR SNPs were analyzed using UTRscan. It has been shown that polymorphisms in 3′ UTR region can affect the gene expression pattern during mRNA translation, while the polymorphisms in 5′ UTR region affect the RNA half-life by altering the polyadenylation [28,29]. After comparing the functional elements for each UTR SNP, we predicted that 6 SNPs in 5′ UTR are related to the functional pattern changes including internal ribosome entry site (IRES) and 15-Lipoxygenase Differentiation Control Element (15-LOX-DICE) ( Table 5). The IRES is involved in internal mRNA ribosome binding, which allows for translation when the   (Table 6).

Phenotypic severity prediction of known disease-associated mutations
Proper and timely treatment allocation based on phenotype severity prediction is essential for benefits of patients. The aforementioned bioinformatics tools are not designed specifically for MPS I disease, and are unable to predict the phenotype severity (Hurler, Hurler-Scheie or Scheie). Therefore, an extensive review of previous publications reporting pathogenic mutations of IDUA was conducted to make inferences about phenotype severity. A total of 185 mutations have been identified, including 86 missense mutations, 22 nonsense mutations, 45 deletions/insertions and 32 splicing mutations. By analyzing the phenotypes and mutations on both alleles of patients from the original reports, phenotype prediction of each mutation was conducted manually. Four general assumptions were used as followed: 1) only when both alleles are predicted to be severe, the phenotype is Hurler; 2) if one allele is predicted to be mild (intermediate) while the other severe, the phenotype is Scheie (Hurler-Scheie); 3) if both alleles are intermediate, the phenotype is Hurler-Scheie or Scheie; 4) even only one allele is predicted to be mild, the phenotype is Scheie (illustrated in Additional file 1: Fig. S1). Further, the crystal structure of IDUA has been elucidated [30,31], which was used to further confirm and rectify the predictions made in Tables 7 and 8. Notably, due to lack of enough information and consensus of phenotype severity, it is difficult to make a comprehensive evaluation of reliability of the original reports. Therefore, we highlighted the severity predictions with relatively low  insertions, 38 are predicted to be severe, which is reasonable due to the usual consequence of frame shift. However, there might be some exceptions: 396insAC, c.1593delG, and 1995del11 with Hurler-Scheie or Scheie phenotype. 1995del11 is in the final exon of IDUA, which may lead to residual enzyme activity. c.1593delG was found to be in trans with a missense mutation (deduced to be severe from multiple reports) in a Hurler-Scheie patient [32]. However, although this patient is defined as Hurler-Scheie, delayed mental development was observed. Therefore, this patient may actually have Hurler disease, which will make 1592delG 'severe'. Similarly, additional evidence is required to determine the phenotypic severity of 396insAC. Missense mutations are the least severe type, with only 31 out of 86 are predicted to be severe. P533R is the most frequent but complicated missense mutation, which has been found in the homozygous state in patients with Hurler, Hurler-Scheie and Scheie phenotypes. Due to convenience consideration, the nomenclature of mutations in this study still uses the old names as reported in previous publications. However, as suggested in the current guideline on nomenclature [33], it will be important to follow this guideline to name newly identified mutations.

Discussion
The identification of SNPs responsible for specific phenotypes with molecular approaches can be expensive and time-consuming [34]. Therefore, computational approaches can be of great help by narrowing down the number of missense mutations to be screened in genetic association studies and advancing the understanding of functional and structural aspects of the protein. Since existing in silico methods have widely varying performance, no single method could be considered as the best and most accurate for predicting functional SNPs. Therefore, a combination of methods based on evolutionary information, protein structure and functional parameters were used in order to increase the prediction accuracy.
Notably, there is no specific order for using these bioinformatics tools.  In this study, significant concordance was observed between the functional consequences of nsSNPs predicted by various combinations of the tools. Out of 201 missense nsSNPs predicted to be 'deleterious' by SIFT, 149 (74%) were also predicted to be 'probably damaging' by PolyPhen. Out of 285 missense nsSNPs, 93 (47%) were predicted to be 'damaging' by SIFT, PolyPhen, I-Mutant and PROVEAN. Then, these 93 nsSNPs were analyzed by PHD-SNP, SNPs&GO and PANTHER, and 28 (30%) were predicted to be disease-associated. Further, the SNPs predicted by these in silico approaches were well supported by experimental and clinical reports. We cross-referenced the results of in silico analysis and previously identified disease-associated mutations in HGMD. Out of 28 missense SNPs (Table 2) predicted, 18 (64%) have been identified to be diseaseassociated in the HGMD. These results demonstrated that implementations of different algorithms could serve as reliable and powerful tools for prioritizing candidate functional nsSNPs.
Based on the results in this study, a step-by-step guiding model for phenotype prediction of MPS I disease was established (Fig. 3). When a mutation is identified, 1) if it is a known disease-associated mutation, refer to Tables 7 and 8 for phenotype severity prediction; 2) if not, conduct the in silico analysis of coding region SNPs and UTR SNPs, respectively. As discussed previously [35], even multiple lines of computational evidence only count as a single supporting criterion for classifying variants as pathogenic or benign. Therefore, further confirmation should be conducted through biochemical and/or clinical analyses. This model will be of great use by providing a valid, time-saving, cheap and easy-to-use method for phenotype prediction for a variety of diseases including MPS I. Admittedly, there are some limitations of this model. First, the in silico analysis is not sensitive enough for phenotype severity prediction because there are no algorithms specifically designed for this purpose. Second, the 3D structural analysis relies on the availability of 3D structure, rendering it difficult for analyzing proteins without solved structures. In this case, homology modeling can be applied to bridge this gap by predicting unknown protein structures.

Conclusions
In conclusion, structural and functional impacts of nsSNPs in the IDUA gene were predicted using powerful computational tools. By predicting the possible deleterious SNPs of IDUA gene, the number of SNPs screened in association with diseases can be narrowed down to those that are most likely to alter gene function. Further, a model of phenotype prediction for MPS I disease by a combination of bioinformatics tools is established, which will benefit diagnosis and treatment allocation of MPS I patients. In the future, it will be essential to optimize the SAAMP algorithm by integrating the scores from each method with more sophisticated statistical methods, and validate it in a broad array of genes.   Fig. 3 Step-by-step guideline for phenotype prediction by in silico analysis