Skip to main content

Psychometric evaluation of the Indolent Systemic Mastocytosis Symptom Assessment Form (ISM-SAF) in a phase 2 clinical study



Indolent systemic mastocytosis (ISM) is a rare, clonal mast cell neoplasm characterized by severe, unpredictable symptoms. The Indolent Systemic Mastocytosis Symptom Assessment Form (ISM-SAF) items compose a Total Symptom Score (TSS), Gastrointestinal Symptom Score (GSS), and Skin Symptom Score (SSS) to assess symptom severity. This study evaluated the psychometric performance of ISM-SAF among ISM patients.


In PIONEER, a Phase 2 trial evaluating safety and efficacy of selective kinase inhibitor avapritinib in patients with ISM, the 12-item ISM-SAF was administered daily. Psychometric evaluation of score reliability, validity, and clinical interpretation was conducted using the trial data.


Thirty-eight patients contributed to analyses (78.9% female; mean age = 49). Baseline internal consistency reliability (α) for bi-weekly TSS, GSS, and SSS was 0.86, 0.83, and 0.82, respectively. Test–retest reliability among patients exhibiting no change in Patient Global Impression of Symptom Severity (PGIS) between Baseline and Day 15 exceeded 0.74 universally. Construct validity and known-groups analysis showed moderate to strong ISM-SAF score correlation (r = 0.382–0.881) to supportive patient-reported questionnaires (e.g., PGIS and Mastocytosis Quality of Life Questionnaire) symptom and skin scores, and ability to distinguish among clinically unique groups. Correlations of ISM-SAF and other assessment change scores reflect evidence of score sensitivity. Clinically important difference and response estimates were 7–10 and 19, respectively.


ISM-SAF produced reliable, construct-valid, sensitive scores when administered in PIONEER to patients in the target population. Results of this study support the use of the ISM-SAF as a reliable and valid measure to evaluate disease symptomology in ISM patients.

Trial registration, NCT03731260. Registered 10 October 2018,


Systemic mastocytosis is a rare, clonal mast cell neoplasm driven by the KIT D816V mutation [1], characterized by uncontrolled proliferation and activation of mast cells that leads to severe and unpredictable symptoms for patients with systemic mastocytosis [2]. The incidence of all systemic mastocytosis subtypes is approximately 0.89 per 100,000 per year [3] and the prevalence of indolent systemic mastocytosis (ISM) is estimated at 9.59/100,000 [3]. Many ISM patients experience severe, life-limiting symptoms that significantly impact daily life (e.g., psychological symptoms, neurological symptoms, asthenia) [4, 5]. Currently, there are limited treatment options available for patients with systemic mastocytosis and no approved therapies for patients with ISM [6].

There is a lack of well-defined and reliable measures of disease symptomology to assess the potential clinical benefits of novel treatments for ISM. To address this gap, the Indolent Systemic Mastocytosis Symptom Assessment Form (ISM-SAF) (©2018 Blueprint Medicines Corporation) was developed in ways consistent with regulatory [7] and scientific guidelines [8, 9] to evaluate clinical benefit hypotheses for use in product approval and labeling decisions. The content validity of the ISM-SAF was established using qualitative research methods, along with feedback from regulatory authorities to ensure the ISM-SAF aligned with regulatory expectations for instruments intended for use in clinical trials. Preliminary psychometric evaluation data generated from an observational study supported the trustworthiness of ISM-SAF scores [10], although the interpretation of scores has not yet been evaluated.

The goals of the present study were to psychometrically evaluate the scores produced by the ISM-SAF among ISM patients and inform the interpretation of ISM-SAF scores. Measurement-focused analyses were executed based on blinded data from Part 1 of the Phase 2 PIONEER trial to evaluate the performance of scores produced by the ISM-SAF with respect to score variability, distribution, and missingness; reliability; construct-related validity; and sensitivity to change. Additionally, distribution-based and anchor-based methods were employed to characterize how meaning is attributed to observed ISM-SAF change scores.


Study design

The ISM-SAF was administered daily to patients with ISM enrolled in Part 1 of PIONEER (NCT03731260), a multicenter, randomized, double-blind, placebo-controlled Phase 2 clinical trial to evaluate the safety and efficacy of avapritinib, a potent and selective inhibitor of KIT D816V, in patients with ISM with symptoms inadequately controlled with standard therapy (Fig. 1).

Fig. 1
figure 1

aAll subjects were randomized at the beginning of the study to one of three avapritinib doses or placebo in Part 1

BLU-285-2203 Part 1 study design. BSC best supportive care, GI gastrointestinal, ISM indolent systemic mastocytosis, PRO patient-reported outcome, RP2D recommended phase 2 dose, TSS total symptom score.

Analysis populations

Two analysis populations were defined: (1) a cross-sectional analysis population (CS-AP) composed of all patients with at least one response on the ISM-SAF evaluated at Baseline (biweekly period from Cycle 1 Day-14 [C1D-14] to C1D-1) and at least one biweekly follow-up score at either Cycle 3 (C3D-14 to C3D-1) or Cycle 4 (C4D-14 to C4D-1); and (2) a test–retest analysis population (TRT-AP) composed of patients who exhibited no change in Patient Global Impression of Severity (PGIS) score from Baseline to C1D15 who provided at least one response for the ISM-SAF at both Baseline and Timepoint 2 (C1D1 to C1D14).

Study assessments


The ISM-SAF is a 12-item diary that assesses 11 symptoms of ISM, including bone pain, abdominal pain, headache, nausea, spots, itching, flushing, fatigue, dizziness, brain fog, and diarrhea, over a 24-h period. Eleven items assess symptom severity using an 11-point numeric rating scale, where 0 = No [symptom] and 10 = Worst imaginable [symptom]; the twelfth item measures diarrhea frequency by asking patients to enter a discrete numerical value. Developed in United States English, the ISM-SAF underwent translation and linguistic/cultural validation in all relevant languages prior to implementation in PIONEER. A handheld electronic device was used to administer the ISM-SAF daily.

The ISM-SAF is scored as a 14-day average at the item, domain, and total score levels. The two symptom domains include the Gastrointestinal Symptom Score (GSS), composed of abdominal pain, nausea, and diarrhea severity (score range 0–30), and the Skin Symptom Score (SSS), composed of spots, itching, and flushing severity (score range 0–30). The Total Symptom Score (TSS) is composed of all 11 severity items (range 0–110). The daily domain and total scores are generated by summing the item scores for contributing items each day; if any contributing items are missing for the day, the daily score cannot be calculated. Biweekly scores were derived by averaging scores over 14 days, with a minimum of seven daily scores required.

Supportive measures

Psychometric evaluation of the ISM-SAF was supported by other patient-reported outcome (PRO) assessments, which were administered at Baseline (except for the Patient Global Impression of Change [PGIC]), C3D1, and C4D1. The administration of the Patient Global Impression of Severity (PGIS) at C1D15 was also used to evaluate test–retest reliability.

12-Item Short Form Health Survey (SF-12v2®)

The SF-12v2® is a 12-item PRO questionnaire developed for a general population assessing physical and emotional health and function using a recall period of “the past week” on three- and five-point verbal response scales (scores range from 0 to 100, with higher scores representing better health) [11, 12].

Mastocytosis Quality of Life Questionnaire (MC-QoL)

The MC-QoL is a 27-item PRO questionnaire assessing the domains of symptoms, emotions, social life/functioning, and skin in patients with cutaneous mastocytosis and ISM [13]. The questionnaire uses a recall period of “the past two weeks” and a five-point verbal rating scale (scores ranges from 0 to 100, where higher scores indicate higher health-related quality-of-life impairment).

Patient Global Impression of Severity (PGIS)

The PGIS is a single item that asks patients to rate their overall symptom severity “right now” on a five-point scale (“0—absent,” “1—minimal,” “2—moderate,” “3—severe,” and “4—very severe”).

Patient Global Impression of Change (PGIC)

The PGIC item assesses a patient’s perception of the change in the state of their condition at a point in time (“degree of change since beginning care at this clinic”) on an 11-point numeric rating scale measuring the full spectrum of change (0 = much better, 5 = no change, and 10 = much worse).

Five-level EQ-5D (EQ-5D-5L)

The EQ-5D-5L is used to measure current health status and provide a generic measure of health for clinical assessment. It comprises two parts: the EQ-5D-5L descriptive system and the EQ-5D-5L Visual Analogue Scale (VAS). The EQ-5D-5L VAS is a single item that asks respondents to self-rate their health on a VAS ranging from 0 to 100 where lower scores indicate a lower overall health state. Only the EQ-5D-5L VAS contributed to the psychometric analyses in this study.


All analyses were conducted in SAS 9.4 and focused on evaluating the performance of the ISM-SAF. There was no imputation of missing data. Unless otherwise specified, analyses were conducted using data at C1D1, C3D1, and C4D1, with C1D15 data additionally used to evaluate test–retest reliability.

Study sample

Descriptive statistics for age, sex, and race were computed for the study sample using the data generated from the CS-AP at Baseline.

Score distribution

Item-level and domain-level ISM-SAF score distributions were evaluated in terms of respondents’ use of the entire scale and for floor and ceiling effects.

Inter-item correlations

Inter-item correlations were evaluated to characterize the extent to which scores on one item of the ISM-SAF relate to scores produced by the other items within that same multi-item scale/domain. Guidelines used to facilitate interpretation of correlations were as follows: negligible relationship, r = 0.0–0.09; small relationship, r = 0.1–0.29; medium relationship, r = 0.30–0.49; and strong relationship, r ≥ 0.50 [14, 15].


Reliability estimates characterize consistency and reproducibility of a particular set of scores produced by a questionnaire when administered to a particular target patient population and in a particular context of use [16]. In this study, the reliability of the ISM-SAF was investigated in terms of both internal consistency reliability and test–retest reliability. Internal consistency reliability, which reflects to what extent individual items are measuring the same general concept [17], was investigated by calculating Cronbach’s alpha coefficient (α, range 0 to 1). Alpha was calculated for the biweekly TSS, GSS, and SSS using the CS-AP at Baseline, C3D1, and C4D1 and again with each individual item within a domain removed. Scores greater than 0.70 are typically seen as sufficient for research purposes [18]. Test–retest reliability, which assesses whether items produce stable scores at different assessment points during which no change (or minimal change) in the patient’s condition is expected to occur [19], was evaluated in the TRT-AP using ISM-SAF biweekly scores at Baseline and C1D15. Intra-class correlation coefficients (ICCs) greater than 0.70 are evidence of adequate test–retest reliability [20].


Construct-related validity measures the associations between concepts of a specified assessment and of other assessments (i.e., reasonably strong associations should exist between related concepts, and low associations between unrelated concepts), and was evaluated for the biweekly ISM-SAF scores by generating correlation coefficients between its scores and other PRO assessments at Baseline, C3D1, and C4D1. The same guidelines were used to facilitate interpretation of correlations as for inter-item correlations.

Known-groups methods characterize the degree to which a PRO questionnaire generates scores capable of distinguishing among patient groups hypothesized to be clinically distinct [7]. This analysis was conducted using the PGIS, EQ-5D-5L VAS, MC-QoL, and SF-12v2® to categorize patients into “known groups” at Baseline, and ISM-SAF biweekly scores were described across patient severity groups. It was hypothesized that higher ISM-SAF scores (greater symptoms) would be associated with worse symptoms/quality of life scores on the other instruments.

Sensitivity to change

Sensitivity-to-change analyses were conducted by examining the mean change and associated effect size [14] of biweekly ISM-SAF scores, as well as the correlations between the ISM-SAF change scores and change scores of other measures. It was hypothesized that improvements (or worsening) in ISM-SAF scores would correspond to improvements (or worsening) in other related measures.

Interpretation of scores

Score interpretation analysis informs how meaning is attributed to the change detected by a PRO questionnaire. Distribution-based methods utilize the observed distribution of the data to generate clinically important difference (CID) estimates, or the difference in mean scores between two treatment groups that can be considered clinically relevant [21, 22]. Two distribution-based analyses were employed here for the biweekly ISM-SAF scores: (1) ½ standard deviation (SD) at Baseline and (2) standard error of measurement (SEm). Anchor-based methods use external criteria (PGIS) to categorize patients into groups, each reflecting an a priori-determined change grouping (e.g., no change, positive change, or negative change), and were employed to generate clinically important response (CIR) estimates to inform conclusions about the meaning of observed within-person change in the scores of the ISM-SAF [22, 23].


Study sample

A total of 38 eligible patients contributed to the psychometric-focused analysis, with < 3% (n = 1) of patients having missing biweekly severity item scores at C3D1 and C4D1. The average age of the CS-AP cohort was 49.0 years (SD = 13), 78.9% of the patients were female (n = 30), and 92.1% of the patients were White (n = 35). Complete demographic and health details are presented in Additional file 1: Table S1.

Score distribution

Descriptive analysis of the ISM-SAF indicated that, while patients used the range of response options available to them for each item (i.e., 0 to 10), not all patients reported experiencing all symptoms and, when symptoms were reported, severity rates were variable. The mean scores of severity items ranged from 3.0 (diarrhea) to 7.2 (fatigue); the mean TSS, GSS, and SSS were 54.2, 10.9, and 16.2, respectively, at Baseline.

Inter-item correlations

At Baseline, the GSS items (abdominal pain, nausea, and diarrhea) were moderately to strongly correlated with one another (r = 0.46 to 0.83), while the SSS items (spots, itching, and flushing) were also moderately to strongly correlated with one another (r = 0.46 to 0.76). The GSS items and other symptom severity items (bone pain, fatigue, dizziness, brain fog, and headache) were moderately to strongly correlated with one another at Baseline (r = 0.41 to 0.67) with the exception of abdominal pain and nausea with bone pain (r = 0.28), and diarrhea severity with headache (r = 0.13). The SSS items and other symptom severity items had small to medium relationships at Baseline (r = 0.11 to 0.42) with the exception of the spots item, which had negative and negligible to small relationships with other symptom items (r = -0.26 to -0.07). In addition, the SSS items were negligibly to moderately related to the GSS items (r = -0.02 to 0.44). As expected, results indicated a strong relationship between the diarrhea frequency and severity items (r = 0.72 at Baseline). As a wider range of values were available for the ISM-SAF at C3D1 and C4D1, the correlations among items were generally enhanced at the later timepoints.


Internal consistency reliability

Internal consistency estimates (α) for the TSS, GSS, and SSS biweekly scores are presented in Table 1 and exceeded pre-specified criteria for adequate reliability (α ranged from 0.72 to 0.86). Removal of items from the TSS did not result in an appreciable increase in alpha coefficients; removal of the diarrhea severity item and spots item resulted in an increase in the Cronbach’s alpha for the GSS and SSS, respectively.

Table 1 Internal consistency reliability estimates of biweekly ISM-SAF domain and total scores

Test–retest reliability

Test–retest reliability ICCs for the biweekly ISM-SAF TSS, GSS, SSS, and item scores for patients maintaining the same PGIS rating at Baseline (C1D1) and at C1D15 (as their scores are expected to remain stable) are presented in Table 2. All ICCs exceeded 0.7 (ranged from 0.741 to 0.986), indicating that the biweekly item, domain, and total ISM-SAF scores were all reliable.

Table 2 Test–retest reliability between baseline and C1D15 (n = 17)


Construct-related validity

The relationships between the TSS and other variables were strong and in the expected direction. Specifically, at C4D1, the biweekly ISM-SAF domain and total scores were more strongly correlated (r = 0.382 to 0.881) to the PGIS, MC-QoL symptom and skin scores than to more distal concepts. Correlations with other measures were generally greater for the TSS than for the GSS and SSS, except for the MC-QoL skin domain, which correlated most strongly with the SSS as expected (Table 3).

Table 3 Pearson correlations of ISM-SAF total and domain scores with other measures administered at C4D1 (N = 34a)

Known-groups analysis

ISM-SAF TSS scores were able to distinguish among clinically unique groups, as evidenced by clearly distinct scores in the hypothesized direction (i.e., participants with greater symptoms, as assessed by the PGIS, EQ-5D-5L VAS, MC-QoL Symptoms, and SF-12v2® Physical Component Summary (PCS), also scored higher on the ISM-SAF). These differences in scores were statistically significant (p < 0.05) across all groups for the TSS at C4D1 (Table 4). For the GSS and SSS, scores for most groups also trended in the hypothesized direction, although the differences were not always significant. In cases where the mean and median scores for GSS and SSS were similar between adjacent severity groups, any deviations from hypotheses were likely due to the limitation of sample size.

Table 4 Known-groups analysis of ISM-SAF scores based on concurrent assessments administered at C4D1

Sensitivity to change

The results indicated that all ISM-SAF scores were sensitive to change, as shown by a decrease from Baseline to C4D1. The mean change scores of the biweekly TSS (− 12.70 [SD = 14.93]), GSS (− 3.83 [SD = 5.98]), and SSS (− 4.07 [SD = 5.64]) all had a moderate effect size (0.8 > d ≥ 0.5). In addition, the results indicated that from Baseline to C4D1, the change scores of the TSS, GSS, and SSS were strongly correlated with each other (r ≥ 0.50) and moderately to strongly correlated with the change scores in the PGIS, EQ-5D-5L VAS, SF-12v2®, MC-QoL domain and total scores, and PGIC (Additional file 1: Table S2), indicating sensitivity to change.

Interpretation of scores

Candidate between-group CIDs for ISM-SAF biweekly scores were generated using distribution-based methods (TSS = 7–10, GSS = 2–4, and SSS = 3–4 points) (Table 5). Among patients who “improved” (n = 16) based on a PGIS reduction of one or two units, the ISM-SAF biweekly TSS, GSS, and SSS decreased 19.0, 6.4, and 6.2 with an average individual percent decrease of 29.4%, 8.4%, and 36.3% at C4D1 from Baseline, respectively. CID estimates based on changes from Baseline to C3D1 were slightly lower (16.2, 6.0, and 4.7 for the TSS, GSS, and SSS, respectively).

Table 5 Distribution-based methods to estimate clinically important difference for ISM-SAF biweekly domain and total scores


The results of the psychometric analysis of the TSS scores produced by the ISM-SAF in Part 1 of PIONEER provide evidence of the reliability and validity of the ISM-SAF’s scores and help to inform score interpretation of the ISM-SAF in future clinical studies.

The data showed strong compliance with the ISM-SAF across all timepoints, with only one patient missing a TSS score at C3D1 and C4D1. The ISM-SAF was able to produce reliable scores in terms of internal consistency and test–retest reliability. The biweekly TSS, GSS, and SSS all met the pre-specified criterion for internal consistency (α > 0.70) at Baseline, and the removal of items from TSS did not appreciably increase alpha coefficients. Test–retest reliability exceeded 0.70 for all biweekly scores. The scores produced by the ISM-SAF were also concluded to be construct-valid based on the evidence that they moderately to strongly correlated with other assessments as expected (e.g., PGIS, MC-QoL symptom and skin scores). In addition, as evidence of validity by known-groups analysis, TSS was clearly distinct by PGIS, EQ-5D-5L VAS, MC-QoL symptom, and SF-12v2® PCS score groups in the hypothesized direction. Lastly, the ISM-SAF scores were also observed to be sensitive to change, as shown by all ISM-SAF scores decreasing from Baseline to C4D1, and the moderate to strong correlation of change scores on the ISM-SAF with change scores of other instruments measuring similar concepts.

Candidate between-group CIDs for ISM-SAF bi-weekly scores were generated using distribution-based methods and, based on a range of 7–10 scale units for the TSS, a 10-point threshold was chosen as a conservative approach to provide guidance for interpreting substantive results when using ISM-SAF for the comparison of treatment group mean differences. Candidate CIR estimates were generated using anchor-based methods based on changes in ISM-SAF scores for those patients who improved on the PGIS from Baseline to C3D1 and C4D1. Based on the upper limit of the range of estimates for individual percentage decrease (i.e., 29.4% for TSS using PGIS anchor at C4D1), a 30% individual percentage decrease on the TSS was selected as a conservative estimate to represent the CIR or improvement at the individual level for future efficacy analyses.

There were a few limitations in this study. The removal of the diarrhea item resulted in a notable increase of Cronbach’s alpha for GSS, and the removal of the spots item also resulted in an increase in the alpha coefficient. The decision as to whether an item should be removed from the calculation of a domain or total score is not solely based on the Cronbach’s alpha coefficient, and the conceptual framework of the measure (e.g., the relevance of diarrhea to gastrointestinal key signs and symptoms) generated from patient interviews should be taken into consideration. For example, based on the results from concept elicitation patient interviews, 75% of the patients (n = 12/16) identified diarrhea as a symptom of ISM, and 90% of patients (n = 9/10) cognitively debriefed reported having experienced diarrhea due to their ISM. Therefore, even though 47.4–62.2% of patients (n = 18–23) in Part 1 of PIONEER scored zero (i.e., no diarrhea) at each biweekly assessment timepoint used in analyses, which might affect the internal consistency of GSS, it was not recommended that the diarrhea severity item be removed from the scale.

Additionally, the confidence in the statistical analysis was reduced due to the limited sample size. Although the ISM-SAF TSS was clearly distinct by PGIS, EQ-5D-5L VAS, MC-QoL, and SF-12v2® groups, the small sample size (N = 38 for CS-AP) limited the interpretation of these known-groups analyses (n < 10 for some groups). Additionally, the small to moderate effect sizes generated using these data were expected because the change from Baseline to C4D1 was examined with combined treatment groups and placebo group, given the blinded nature of the data on which these estimates were based. Furthermore, given the limitations of the PGIC version implemented in the study (e.g., not specific to change in symptoms, and potential recall bias), only CIR estimates generated using PGIS anchors are reported here.

The ISM-SAF was developed through qualitative research including both patients with ISM and those with smoldering systemic mastocytosis. Although the psychometric analyses presented here are based on an ISM population, the findings are consistent with preliminary psychometric analyses that were previously conducted through an observational study involving both patients with ISM and those with smoldering systemic mastocytosis [10], thereby supporting the use of the ISM-SAF in this broader population.

In conclusion, the ISM-SAF produced reliable, construct-valid, and sensitive scores when administered in the target patient population participating in a regulated clinical trial, with a CIR definition of a 30% individual percentage decrease on the TSS. These results, along with the ISM-SAF’s strong development history and evidence of content validity, support its use in clinical studies designed to evaluate ISM treatments and impact on patient symptom improvement.

Availability of data and materials

The datasets generated and/or analyzed during the current study are not publicly available due to the fact that the analyses described in this publication were executed based on blinded data from an ongoing Phase 2 clinical trial but could be available from the corresponding author on reasonable request.



Clinically important difference


Clinically important response


Cross-sectional analysis population


Five-level EQ-5D


Gastrointestinal Symptom Score


Intraclass correlation coefficient


Indolent systemic mastocytosis


Indolent Systemic Mastocytosis Symptom Assessment Form


Mastocytosis Quality of Life Questionnaire


Patient Global Impression of Change


Patient Global Impression of Severity


Patient-reported outcome


Standard deviation


Skin Symptom Score


Test–retest analysis population


Total Symptom Score


Visual analogue scale


  1. Jara-Acevedo M, Teodosio C, Sanchez-Muñoz L, et al. Detection of the KIT D816V mutation in peripheral blood of systemic mastocytosis: diagnostic implications. Mod Pathol. 2015;28(8):1138–49.

    Article  CAS  Google Scholar 

  2. Metcalfe DD. Mast cells and mastocytosis. Blood. 2008;112(4):946–56.

    Article  CAS  Google Scholar 

  3. Cohen SS, Skovbo S, Vestergaard H, et al. Epidemiology of systemic mastocytosis in Denmark. Br J Haematol. 2014;166(4):521–8.

    Article  Google Scholar 

  4. Hermine O, Lortholary O, Leventhal PS, et al. Case–control cohort study of patients’ perceptions of disability in mastocytosis. PLoS ONE. 2008;3(5):e2266.

    Article  Google Scholar 

  5. Jennings S, Russell N, Jennings B, et al. The Mastocytosis Society survey on mast cell disorders: patient experiences and perceptions. J Allergy Clin Immunol Pract. 2014;2(1):70–6.

    Article  Google Scholar 

  6. Valent P, Akin C, Metcalfe DD. Mastocytosis: 2016 updated WHO classification and novel emerging treatment concepts. Blood. 2017;129(11):1420–7.

    Article  CAS  Google Scholar 

  7. US Department of Health and Human Services, Food and Drug Administration, Center for Drug Evaluation and Research, Center for Biologics Evaluation and Research, Center for Devices and Radiological Health. Guidance for industry patient-reported outcome measures: use in medical product development to support labeling claims. 2009.

  8. Patrick DL, Burke LB, Gwaltney CJ, et al. Content validity-establishing and reporting the evidence in newly developed patient-reported outcomes (PRO) instruments for medical product evaluation: ISPOR PRO good research practices task force report: part 1-eliciting concepts for a new PRO instrument. Value Health. 2011;14(8):967–77.

    Article  Google Scholar 

  9. Patrick DL, Burke LB, Gwaltney CJ, et al. Content validity-establishing and reporting the evidence in newly developed patient-reported outcomes (PRO) instruments for medical product evaluation: ISPOR PRO good research practices task force report: part 2-assessing respondent understanding. Value Health. 2011;14(8):978–88.

    Article  Google Scholar 

  10. Shields, AL, Taylor F, et al. Psychometric performance of the Indolent Systemic Mastocytosis Symptom Assessment Form (ISM-SAF). ISPOR 22nd Annual European Congress, Copenhagen, Denmark; 2019.

  11. Maruish ME, editor. User’s manual for the SF-36v2®. 3rd ed. QualityMetric Inc; 2011.

    Google Scholar 

  12. Ware JE Jr, Kosinski M, Keller SD. A 12-Item Short-Form Health Survey: construction of scales and preliminary tests of reliability and validity. Med Care. 1996;34(3):220–33.

    Article  Google Scholar 

  13. Siebenhaar F, von Tschirnhaus E, Hartmann K, et al. Development and validation of the mastocytosis quality of life questionnaire: MC-QoL. Allergy. 2016;71(6):869–77.

    Article  CAS  Google Scholar 

  14. Cohen J. Statistical power analysis for the behavioral sciences. Lawrence Earlbaum Associates; 1988.

    Google Scholar 

  15. Cohen J. A power primer. Psychol Bull. 1992;112(1):155–9.

    Article  CAS  Google Scholar 

  16. Thompson B, Vacha-Haase T. Psychometrics is datametrics: the test is not reliable. Educ Psychol Measur. 2000;60(2):174–95.

    Article  Google Scholar 

  17. Thompson B. Understanding reliability and coefficient alpha, really. In: Thompson B, editor. Score reliability: contemporary thinking on reliability issues. Sage Publications; 2003. p. 3–21.

    Chapter  Google Scholar 

  18. Nunnally JC. The assessment of reliability. In: Bernstein I, editor. Psychometric theory. McGraw Hill; 1994. p. 248–92.

    Google Scholar 

  19. Guttman L. A basis for analyzing test–retest reliability. Psychometrika. 1945;10:255–82.

    Article  CAS  Google Scholar 

  20. Cicchetti DV. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol Assess. 1994;6(4):284–90.

    Article  Google Scholar 

  21. Cappelleri JC, Zou KH, Bushmakin AG, Alvir JMJ, Alemayehu D, Symonds T. Patient-reported outcomes: measurement, implementation and interpretation. CRC Press; 2013.

    Book  Google Scholar 

  22. Shields A, Coon C, Hao Y, et al. Patient-reported outcomes for US oncology labeling: review and discussion of score interpretation and analysis methods. Expert Rev Pharmacoecon Outcomes Res. 2015;15(6):951–9.

    Article  Google Scholar 

  23. Coon CD, Cappelleri JC. Interpreting change in scores on patient-reported outcome instruments. Therap Innov Regul Sci. 2016;50(1):22–9.

    Article  Google Scholar 

Download references


Kas Severson for manuscript writing support; Christine Yip for programming QC support.


Funding for this study was provided by Blueprint Medicines.

Author information

Authors and Affiliations



ALS, TG, ALB, H-ML, FS, and BM contributed to concept and design. BP, FT, TG, and H-ML contributed to acquisition of data. BP, ALS, FT, XL, JM, CA, and FS contributed to analysis and interpretation of data. All authors contributed to drafting the manuscript and/or critical revision of the paper for important intellectual content. ALB contributed to obtaining funding. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Fiona Taylor.

Ethics declarations

Ethics approval and consent to participate

Institutional review board or independent ethics committee approval was obtained from each study center that participated in NCT03731260. All patients gave written informed consent.

Consent for publication

Not applicable.

Competing interests

BP, ALS, FT, XL, and JM are, or were at the time of the research, employees of Adelphi Values, which conducted research on behalf of Blueprint Medicines. TG, ALB, H-ML, and BM are employees of Blueprint Medicines and own stock in the company. CA received research funding and consultancy fees from Blueprint Medicines. FS is or recently was a speaker and/or advisor for and/or has received research funding from Allakos, Blueprint, Celldex, Genentech, Moxie, Novartis, and Uriach.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1:

Demographic information and sensitivity to change results. Table S1. Sample demographic information at Baseline (N = 38). Table S2. Sensitivity to change: Correlation between ISM-SAF biweekly domain and total change scores and change in concurrently administered measures from Baseline to C4D1 (N = 36).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Padilla, B., Shields, A.L., Taylor, F. et al. Psychometric evaluation of the Indolent Systemic Mastocytosis Symptom Assessment Form (ISM-SAF) in a phase 2 clinical study. Orphanet J Rare Dis 16, 434 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: