Sources of variation in estimates of Duchenne and Becker muscular dystrophy prevalence in the United States
Orphanet Journal of Rare Diseases volume 18, Article number: 65 (2023)
Direct estimates of rare disease prevalence from public health surveillance may only be available in a few catchment areas. Understanding variation among observed prevalence can inform estimates of prevalence in other locations. The Muscular Dystrophy Surveillance, Tracking, and Research Network (MD STARnet) conducts population-based surveillance of major muscular dystrophies in selected areas of the United States. We identified sources of variation in prevalence estimates of Duchenne and Becker muscular dystrophy (DBMD) within MD STARnet from published literature and a survey of MD STARnet investigators, then developed a logic model of the relationships between the sources of variation and estimated prevalence.
The 17 identified sources of variability fell into four categories: (1) inherent in surveillance systems, (2) particular to rare diseases, (3) particular to medical-records-based surveillance, and (4) resulting from extrapolation. For the sources of uncertainty measured by MD STARnet, we estimated each source’s contribution to the total variance in DBMD prevalence. Based on the logic model we fit a multivariable Poisson regression model to 96 age–site–race/ethnicity strata. Age accounted for 74% of the variation between strata, surveillance site for 6%, race/ethnicity for 3%, and 17% remained unexplained.
Variation in estimates derived from a non-random sample of states or counties may not be explained by demographic differences alone. Applying these estimates to other populations requires caution.
Public health surveillance, defined as the "systematic and continuous collection, analysis, and interpretation of data"  is foundational to public health practice . Public health surveillance provides accurate, representative information on the occurrence of a disease in the population from which the data is collected but is not usually designed to be generalizable to other populations. Resources and logistics may limit surveillance programs to a few catchment areas that may not be representative of the entire population. In the absence of other data, prevalence and other epidemiologic measures from these few catchment areas are often generalized to the population, which is valid only if the epidemiology of the disease of interest is consistent across the population.
Significant variation in epidemiologic measures among catchment areas suggests the underlying epidemiology of the disease differs among geographic areas. However, rare diseases are vulnerable to random fluctuation in prevalence estimates, which can be difficult to distinguish from true differences among populations. Structured uncertainty analysis can be an important tool for assessing such differences. Taruscio and Mantovani recently demonstrated the value of uncertainty analysis to identify gaps in our knowledge of the epidemiology of rare diseases and assess their impact . They categorize the sources of uncertainty into epistemic (uncertainty due to lack of knowledge), sampling uncertainty (uncertainty associated with data and disparate methods), and variability (uncertainty due to heterogeneity within a population).
The Muscular Dystrophy Surveillance, Tracking and Research Network (MD STARnet), which conducts population-based surveillance of muscular dystrophies in selected areas of the US, is the sole source of population-based prevalence estimates in the country [4, 5]. The 2007 MD STARnet estimated prevalence of Duchenne/Becker muscular dystrophy (DBMD) among males age 5 to 24 was 1.47 cases per 10,000 males (calculated from data in the article) [6, 7]. The range among the four individual catchment areas was 1.3 to 1.8 cases per 10,000 males ages 5 to 24 years, a variance of 12% . Among the three catchment areas with estimates for 2007 and 2014–2019, the same catchment areas had higher prevalence in both time periods, indicating that the differences between catchment areas are likely not random (Personal communication, Suzanne McDermott, DBMD Ascertainment Progress Presented: Fall 2017 MD STARnet Principal Investigators Meeting. Atlanta, GA, 2017).
Variation across catchment areas could be due to true differences in the population frequency of pathogenic alleles of the dystrophin gene; the population distribution of sex, age or ancestry; or migration among individuals with DBMD. It could also be due to random or systematic error. Our aim was to understand what factors explain the observed differences in DBMD prevalence among catchment areas and the implications for the generalizability of the prevalence estimates. Our analysis examined sources of sampling uncertainty and population variability. If population demographics or regional differences in diagnosis or surveillance practices explain the variation among catchment areas, adjustment for these differences would allow MD STARnet estimates to be extrapolated to the broader U.S. population. Unexplained variation between catchment areas indicates that MD STARnet prevalence estimates may not be an accurate estimate of DBMD prevalence in the broader U.S. population.
Literature review and investigator survey
After abstract and title review, we identified 52 unique citations, of which 12 advanced to full text review (Additional file 1: Fig. S2, Additional file 2). We included findings from five articles, from which we identified 12 potential sources of variation (Table 1) [8,9,10,11,12]. None of the minor discrepancies in abstraction required adjudication. Most information on sources of variation was in surveillance or registry methodological articles. These articles examined rare disease cluster identification , drug registries for treatments of lysosomal storage disorders , a cancer registry , and surveillance based on multiple data sources . The fifth article was an epidemiological report from a registry of arthritis, musculoskeletal and skin diseases .
Twenty investigators from six sites completed our survey on sources and magnitude of bias in MD STARnet. The investigators included six analysts, four abstractors, three clinicians, three study coordinators, two data managers, and two people with unspecified roles. The survey identified 12 sources of variation, five of which had not been identified by the literature review (Table 1). The average investigator estimate of bias in DMD prevalence from a given source ranged from 5% (for residents obtaining care outside the study region and demographic changes in the population) to 12% (for differences between the MD STARnet and the U.S population) (Additional file 3: Table S1).
In total, we identified 17 sources of variation in national estimates from the literature review or the investigator survey (Table 1). We grouped the sources of variation into four categories comprising sources of variation that are:
Inherent to all surveillance systems, including case ascertainment, misclassification of disease status, and migration;
Specific to rare disease surveillance, including small case numbers, regional differences in incidence, the relatively large impact of a few misclassified cases, and biases in care-seeking behaviors and diagnostic practices;
Specific to medical records-based surveillance, including lack of standardization and incomplete data; and
Due to extrapolation from local to national estimates, including differences between the local and national populations.
Sources and magnitude of variation
The expanded MD STARnet data set included 720 cases from a surveilled male population of 8 million (Table 2). Of these cases, 249 (34%) were identified in Arizona, 193 (27%) in Colorado, 152 (21%) in Iowa, and 126 (17%) in western New York. The cases were mostly non-Hispanic and white (67%). The racial and ethnic distribution of the cases was similar to that of the surveilled populations, although individuals of Black or Other race were slightly underrepresented among the cases.
Age and ethnicity distributions were significantly associated with prevalence. Age group explained the majority of the variability between strata, accounting for 74% of the deviance in the model. However, the similarity of unadjusted, standardized, and adjusted prevalence estimates indicates that population differences in age and ethnicity or differences in the surveillance process account for very little of the variation between catchment areas (Table 3). Catchment area accounted for the second largest proportion of the variability between strata, 6% of the total variance (Table 4). Arizona was the reference site due to alphabetical coding order. Prevalance in Colorado and Iowa did not differ significantly from those in Arizona (Table 5). However, the prevalence in the New York catchment area was twice that of Arizona (Prevalence Ratio. 2.2, p < 0.001). Seventeen percent of the variation in prevalence across strata remained unexplained after controlling for the demographic and process factors in the model.
Our primary goal was to determine whether adjusting for sources of variability in site-specific prevalence estimates would reduce differences among catchment areas, increasing confidence that findings are generalizable beyond the areas included within the surveillance system. Unfortunately, adjusting for known and potential sources of variability by standardization or multivariate modeling did not substantially reduce between-site differences. Surveillance site accounted for 6% of the deviance between prevalence rates, and 17% of the deviance was unexplained after adjusting for age, race/ethnicity, and ascertainment details. The large proportion (74%) of the deviance explained by age group is expected given the natural history of DBMD. In this progressive disorder, prevalence is low in children younger than the usual age of diagnosis (approximately 5 years) and highest among children age 5–19 years, when most affected boys have been diagnosed and mortality is still low. The prevalence declines among adults age 20 years and older, when mortality increases.
Our analysis complements the article by Taruscio and Mantovani 3 by providing an example of a structured analysis to evaluate the uncertainty in prevalence estimates of rare diseases. We experienced several challenges in analyzing the sources of variability. Population level data on potential sources of variation such as the number of unsurveilled health care providers within a catchment area was unavailable. We could not evaluate how well our proxy measures, the mean number of sources at which cases were ascertained and the proportion of cases seen at a neuromuscular clinic, estimated the completeness of coverage of health care facilities treating muscular dystrophy for each stratum. Socioeconomic status was unavailable at the case level. The limited data on potential sources of variability and the relatively small number of strata limited our ability to explain the sources and magnitude of variation in DBMD prevalence rates.
Our analysis is strengthened by factors that reduce process variability in case ascertainment. MD STARnet sites use a standard protocol . Cases are actively sought using multiple data sources, and identifying information allows duplicate cases to be identified and consolidated. For the pilot, case eligibility was reviewed by a local clinician experienced in treating muscular dystrophy cases, with additional review of uncertain cases by a committee of clinicians [4, 13].
Our findings suggest that the estimated prevalence of muscular dystrophy may be dependent on which sites are included in MD STARnet. More generally, they suggest that estimates derived from a non-random sample of states or counties cannot be assumed to represent national rates. Although not all the factors that impact MD STARnet estimates are generalizable to other surveillance systems, our study illustrates a valuable approach for evaluating the sources and impact of uncertainty that is applicable to rare disease surveillance systems generally. This analysis provides an example of one methodology for such an evaluation. The Poisson model we used provided estimates of the magnitude and relative contribution of each potential source of variability of DBMD prevalence across demographic strata within the limitations of our data.
Estimating sources of variability in the extrapolation of the prevalence of DBMD from a local to a national scale requires attention to surveillance methodology, the characteristics of the condition under surveillance, and differences and similarities between the local and national populations. In this study, 17% of the variation was not explained by the model.
Our objectives were to identify sources of variation in MD STARnet prevalence estimates between sites and to estimate the magnitude of the total variation in DBMD prevalence estimates and the relative contribution of each source of variation.
Sources of variation
We identified potential sources of variation in prevalence estimates from the scientific literature and expert opinion. We synthesized the findings into a theoretical model of how the sources contributed to potential bias in generalizing the estimates to the US population (Fig. 1).
Literature review. Two analysts independently searched PubMed and Google Scholar and reviewed the retrieved citations for eligibility. Our original criteria for inclusion were methodological studies of the types, sources, or magnitude of bias in surveillance or research studies. PubMed and Google Scholar were chosen because they were available to both analysts and were expected to capture most articles on public health surveillance methods. The search terms included surveillance, rare disease, prevalence, error, limitations, uncertainty, epidemiology, estimation, MD STARnet, muscular dystrophy, prevalence, US Census, and variations of these terms. Details on the search strategies are provided in the Additional file 4. The last search was conducted on November 3, 2016 and included all articles published prior to that date. The search was not updated after the final logic model was constructed.
We adhered to a rigorous search methodology to the extent possible but deviated from a full systematic review methodology in two regards. First, we could not develop a complete, deduplicated count of identified citations because Google Scholar results cannot be exported, making it impossible to identify duplicates. Second, we found very few studies that met our pre-determined eligibility criteria of being designed explicitly to study the sources or magnitude of bias in surveillance systems. Instead, information on sources of bias was more commonly found in reports about surveillance or research study design. We therefore include articles that discussed possible sources of bias in their surveillance system or data even if they did not estimate the magnitude of the bias. The placement of the information within the article and the depth of detail varied greatly among studies. This variability made the use of structured abstraction or a data extraction tool impossible. Instead, relevant information was manually extracted into Word.
Both analysts reviewed the combined list of eligible citations and classified each as included or excluded. Included articles were abstracted by both analysts independently and reviewed for discrepancies.
Investigator survey. We surveyed MD STARnet investigators to explore their experiences and perceptions of different sources of variation that may affect MD STARnet prevalence estimates, and the approximate magnitude of bias that may be introduced by each source (Additional file 5: Fig. S1). Due to the small number of eligible sites, instead of formally piloting the survey, it was reviewed by North Carolina investigators who did not participate in developing the survey. We emailed the link to the Survey Gizmo  survey to the principal investigators of six sites (Colorado, Iowa, western New York, central North Carolina, South Carolina, and Utah) funded from 2014 to 2019 and asked them to distribute it to the MD STARnet investigators at their site. Because staff roles and responsibilities vary across MD STARnet sites, we relied on the principal investigators to distribute the survey to appropriate site colleagues. The survey was anonymous; investigators who responded online could not be identified or linked to a specific site, and a formal response rate could not be calculated. There was at least one response from all sites. Four sites submitted responses through the link, and two sites submitted aggregate responses for their site by email. The institutional review board (IRB) at RTI International, employer of the primary analysts, determined the survey was program evaluation, not human subjects research as defined by 45 CFR 46.102. Due to the small sample size and the aggregate responses obtained from two sites, all data were analyzed descriptively.
MD STARnet data
The analytic data were from MD STARnet’s pilot expanded muscular dystrophy surveillance (EMDS) . Four geographically defined surveillance sites (Arizona, Colorado, Iowa, and 12 counties in western New York) conducted retrospective active surveillance of nine muscular dystrophies (MD) (Duchenne, Becker, congenital, distal, Emery-Dreifuss, facioscapulohumeral, limb-girdle, and oculopharyngeal MD, MD not otherwise specified, and myotonic dystrophy) from 2011 to 2014. All four sites had authority to conduct public health surveillance by the legal authority of their state department of health and/or institutional review board approval or exemption . Informed consent was waived because the project was public health surveillance. Trained medical coders reviewed electronic or paper medical records of eligible cases to abstract information about signs and symptoms, diagnostic tests, treatment and follow-up care. Eligible individuals had evidence of a physician’s diagnosis of a specific MD type within their medical record, resided within a MD STARnet catchment area, and had at least one healthcare encounter from 2007 to 2011 inclusive . Case ascertainment sources varied between sites but included physician and other provider medical records, hospital records, vital statistics, and administrative data. Cases were ascertained using International Classification of Diseases, Ninth Revision, Clinical Modification codes (359.0: congenital hereditary MD, 359.1: hereditary progressive MD, 359.21: myotonic dystrophy) in medical and administrative records and International Classification of Diseases, Tenth Revision mortality codes (G71.0: MD, G71.1: myotonic dystrophy) in death certificates. At each site, a clinician who treated patients with muscular dystrophy reviewed the abstracted case notes and decided if the MD type specified was consistent with standard diagnostic practice. If the diagnosis was in question, a panel of 5 neuromuscular experts made the final determination about MD type. The muscular dystrophies differ in inheritance pattern, age and sex of individuals affected, and prevalence of the disorders. Therefore, we limited our analyses to DBMD. Because we estimated the point prevalence of DBMD, we only included individuals with DBMD who were alive on July 1, 2010, leaving a total of 720 cases.
To determine if the variability in site-specific prevalence was within expected random variation, controlling for site population demographics and surveillance procedures, we constructed a dataset with one record for each age-race/ethnicity-site stratum, with a total of 96 strata. The dataset variables were number of DBMD cases, total population, age category (5-year intervals as shown in Table 2), surveillance site, race/ethnicity (White, Black, Hispanic and Other, which included Asian, Pacific Islander, American Indian, and unknown or unspecified race), method of diagnosis (proxy for diagnostic certainty; defined as genetic diagnosis in case or family member, family history of MD, or clinical diagnosis), the average number of reporting sources per patient (proxy for likelihood of identification at surveilled facilities), and the proportion of patients within the stratum treated at a MD clinic (proxy for likelihood of being treated at surveilled facilities). Data were too sparse to include zip code in the strata definition, which would have allowed us to use Census data as a proxy for socioeconomic status. We defined age and vital status as of July 1, 2010.
Sources of variation in calculated prevalence
We calculated the unadjusted prevalence of DBMD overall and by site, age, and race/ethnicity. We calculated standardized prevalence for the US population using standard methods . Briefly, we analyzed the prevalence for each age-race/ethnicity stratum, calculated the expected number of cases for the US based on the US population for equivalently-defined strata, then assessed the prevalence using the projected number of cases. Similar methods were used for standardized prevalence for subpopulations. We used the July 1, 2010 US Census estimated population of the surveillance catchment areas and the United States for all prevalence calculations and statistical models.
We used our theoretical model to develop a multivariable Poisson regression model to quantify the contribution of each measured source of variation to the total variance and how much variation remained unexplained. The Poisson model, fit to the stratum level dataset, controlled for the potential sources of uncertainty for which we had data. The MD STARnet data did not include a measure of socioeconomic status. Independent variables were age group, race/ethnicity, method of diagnosis, average number of reporting sources per patient, and whether the patient was treated at a specialized neuromuscular clinic. The natural log of the total stratum population was used as an offset variable to adjust for the differences in opportunity for the outcome. The number of DBMD cases in each stratum was the dependent variable. Analysis of deviance, the difference between the predicted outcome variables and the actual values for each record, was used to quantify the contribution of each variable to the variation in prevalence among the 96 strata.
We compared the unadjusted, standardized and modeled estimates of prevalence to assess the extent to which controlling for age, race/ethnicity and differences in surveillance process explained prevalence differences between sites. Primary analyses were conducted in R software, version 3.4.3 . The secondary analyst used R software, version 3.6.0  and SAS/STAT software, version 9.4 .
Availability of data and materials
Because of state policies governing access to public health surveillance data, MD STARnet data is only available through collaboration with a MD STARnet principal investigator. For more information on access to MD STARnet data, please contact the Centers for Disease Control and Prevention at firstname.lastname@example.org.
Porta M, ed. A Dictionary of Epidemiology. 6th ed. New York: Oxford University Press. ISBN: 978-0-19-997673-7.
Thacker SB, Qualters JR, Lee LM, et al. Public health surveillance in the United States: evolution and challenges. MMWR Suppl. 2012;61(3):3–9.
Taruscio D, Mantovani A. Multifactorial rare diseases: can uncertainty analysis bring added value to the search for risk factors and etiopathogenesis. Medicina. 2021. https://doi.org/10.3390/medicina57020119.
Do TN, Street N, Donnelly J, et al. Muscular dystrophy surveillance, tracking, and research network pilot: population-based surveillance of major muscular dystrophies at four US sites, 2007–2011. Birth Defects Res. 2018;110(19):1404–11. https://doi.org/10.1002/bdr2.1371.
Miller LA, Romitti PA, Cunniff C, et al. The muscular dystrophy surveillance tracking and research network (MD STARnet): surveillance methodology. Birth Defects Res A Clin Mol Teratol. 2006;76(11):793–7. https://doi.org/10.1002/bdra.20279.
Centers for Disease C, Prevention. Prevalence of Duchenne/Becker muscular dystrophy among males aged 5–24 years - four states, 2007. MMWR Morb Mortal Wkly Rep 2009;58(40):1119–22.
Romitti PA, Zhu Y, Puzhankara S, et al. Prevalence of Duchenne and Becker muscular dystrophies in the United States. Pediatrics. 2015;135(3):513–21. https://doi.org/10.1542/peds.2014-2044.
Besag J, Newell J. The detection of clusters in rare disease. J R Stat Soc A Stat Soc. 1991;154(1):143–55. https://doi.org/10.2307/2982708.
Hollak CE, Aerts JM, Ayme S, et al. Limitations of drug registries to evaluate orphan medicinal products for the treatment of lysosomal storage disorders. Orphanet J Rare Dis. 2011;6:16. https://doi.org/10.1186/1750-1172-6-16.
Mendez EP, Lipton R, Ramsey-Goldman R, et al. US incidence of juvenile dermatomyositis, 1995–1998: results from the national institute of arthritis and musculoskeletal and skin diseases registry. Arthritis Rheum. 2003;49(3):300–5. https://doi.org/10.1002/art.11122.
Yu JB, Gross CP, Wilson LD, et al. NCI SEER public-use data: applications and limitations in oncology research. Oncology. 2009;23(3):288–95.
Papoz L, Balkau B, Lellouch J. Case counting in epidemiology: limitations of methods based on multiple data sources. Int J Epidemiol. 1996;25(3):474–8. https://doi.org/10.1093/ije/25.3.474.
Mathews KD, Cunniff C, Kantamneni JR, et al. Muscular dystrophy surveillance tracking and research network (MD STARnet): case definition in surveillance for childhood-onset Duchenne/Becker muscular dystrophy. J Child Neurol. 2010;25(9):1098–102. https://doi.org/10.1177/0883073810371001.
SurveyGizmo [program]. https://www.alchemer.com/survey/. Alchemer. Louisville, CO.
Rothman KJ. Standardization of rates. Modern Epidemiology. Boston, MA: Little, Brown and Company 1986:42–44.
Bates D et al. R: A language and environment for statistical computing. [program]. 3.4.3 version. Vienna, Austria: R Foundation for Statistical Computing, 2017. https://www.R-project.org.
Bates D et al. R: A language and environment for statistical computing. [program]. 3.6.0 version. Vienna, Austria: R Foundation for Statistical Computing, 2019. https://www.R-project.org.
Base SAS® 9.4 Procedures Guide, Seventh Edition. Cary, NC: SAS Institute Inc. https://www.sas.com.
We acknowledge and appreciate the contributions of the MD STARnet network members to data collection and case classification. The analysts for the sources of variability analyses were Nedra Whitehead (primary) and Suzanne McDermott (secondary). The analysts for the magnitude of variation were Stephen Erickson (primary) and Bo Cai (secondary).
This analysis was supported by CDC cooperative agreements 5U01DD00116 and 1U01DD001255 (North Carolina) and 6U01DD00117 and 6U01DD00145 (South Carolina). The Expanded Muscular Dystrophy Surveillance pilot was supported by the following CDC cooperative agreements, DD000830 (Arizona), DD000835 (Colorado), DD000831 (Iowa), DD000836 (Western New York), DD000832 (coordinating center), DD000834 (data coordinating center) and DD000837 (Abstractor QA Center). The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for Diseases Control and Prevention or the Department of Health and Human Services.
Ethics approval and consent to participate
This study complies with the guidelines for human studies and was conducted ethically in accordance with the World Medical Association Declaration of Helsinki. As described in the manuscript, all four sites had authority to conduct public health surveillance by the legal authority of their state department of health and/or institutional review board approval or exemption.(3) Informed consent was waived because the project was public health surveillance.
Consent for publication
No individual data included.
The authors have no competing interests to declare.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Disposition of articles from literature review.
Articles included in full text review.
Investigators' estimates of bias, by source.
Uncertainty in surveillance literature review search strategy.
Survey of MD STARnet Investigators.
About this article
Cite this article
Whitehead, N., Erickson, S.W., Cai, B. et al. Sources of variation in estimates of Duchenne and Becker muscular dystrophy prevalence in the United States. Orphanet J Rare Dis 18, 65 (2023). https://doi.org/10.1186/s13023-023-02662-0