0% found this document useful (0 votes)
89 views27 pages

Phenome Risk Classification Enables Phenotypic Imputation and Gene Discovery in Developmental Stuttering

This document summarizes a study that used machine learning to identify individuals affected by developmental stuttering within a biobank database. The study developed a model called PheML that used comorbidities associated with stuttering to predict stuttering status. Applying this model identified over 9,000 affected individuals, enabling a genome-wide association study. This study identified two genetic variants associated with stuttering in different ancestry groups. The findings were validated in an independent clinical sample and provide new insights into the genetic basis of developmental stuttering.

Uploaded by

一位大神
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views27 pages

Phenome Risk Classification Enables Phenotypic Imputation and Gene Discovery in Developmental Stuttering

This document summarizes a study that used machine learning to identify individuals affected by developmental stuttering within a biobank database. The study developed a model called PheML that used comorbidities associated with stuttering to predict stuttering status. Applying this model identified over 9,000 affected individuals, enabling a genome-wide association study. This study identified two genetic variants associated with stuttering in different ancestry groups. The findings were validated in an independent clinical sample and provide new insights into the genetic basis of developmental stuttering.

Uploaded by

一位大神
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

ARTICLE

Phenome risk classification enables


phenotypic imputation and gene discovery
in developmental stuttering
Douglas M. Shaw,1 Hannah P. Polikowsky,1 Dillon G. Pruett,2 Hung-Hsin Chen,1 Lauren E. Petty,1
Kathryn Z. Viljoen,3 Janet M. Beilby,3 Robin M. Jones,2 Shelly Jo Kraft,4 and Jennifer E. Below1,*

Summary

Developmental stuttering is a speech disorder characterized by disruption in the forward movement of speech. This disruption includes
part-word and single-syllable repetitions, prolongations, and involuntary tension that blocks syllables and words, and the disorder has a
life-time prevalence of 6–12%. Within Vanderbilt’s electronic health record (EHR)-linked biorepository (BioVU), only 142 individuals out
of 92,762 participants (0.15%) are identified with diagnostic ICD9/10 codes, suggesting a large portion of people who stutter do not have a
record of diagnosis within the EHR. To identify individuals affected by stuttering within our EHR, we built a PheCode-driven Gini impu-
rity-based classification and regression tree model, PheML, by using comorbidities enriched in individuals affected by stuttering as predict-
ing features and imputing stuttering status as the outcome variable. Applying PheML in BioVU identified 9,239 genotyped affected
individuals (a clinical prevalence of 10%) for downstream genetic analysis. Ancestry-stratified GWAS of PheML-imputed affected indi-
viduals and matched control individuals identified rs12613255, a variant near CYRIA on chromosome 2 (B ¼ 0.323; p value ¼ 1.31 3 108)
in European-ancestry analysis and rs7837758 (B ¼ 0.518; p value ¼ 5.07 3 108), an intronic variant found within the ZMAT4 gene on
chromosome 8, in African-ancestry analysis. Polygenic-risk prediction and concordance analysis in an independent clinically ascertained
sample of developmental stuttering cases validate our GWAS findings in PheML-imputed affected and control individuals and demon-
strate the clinical relevance of our population-based analysis for stuttering risk.

Introduction the risk of unemployment and reduce perceived job perfor-


mance, both of which contribute to reducing socioeco-
Developmental stuttering is a speech disorder character- nomic status among people who stutter.8 Despite a clear
ized by disruption in the forward movement of speech. social and vocational impact, no direct causes of develop-
This disruption includes part-word and single-syllable mental stuttering in populations have been previously
word repetitions, sound prolongations, and involuntary identified. Given the observed enrichment in families, ge-
breaks in syllables and words.1 Previous population-based netic studies offer a particularly promising approach to un-
studies estimate that 6–12% of children aged 2-4 will derstanding underlying genetic causes and provide insight
develop a stutter and that 15–25% of these speech imped- into potential biological mechanisms contributing to this
iments will persist to adulthood, resulting in approxi- phenotype.9
mately 1% prevalence in the adult population.2 Risk Heritability estimates of developmental stuttering have
factors for developmental stuttering include sex—males varied greatly across studies; they have ranged from 0.42
demonstrate increased risk—and a family history of stut- to 0.84 in the two largest twin studies, each comprising a
tering.3 Elevated risk in males increases with age; the sample size exceeding 20,000 individuals.10,11 Though her-
male-to-female ratio is approximately 2:1 (or lower) in chil- itability estimates vary, there is clear evidence that a ge-
dren under 44,5 but rises to 5:1 in adolescents and adults,4 netic component for developmental stuttering exists, and
suggesting a higher rate of recovery in females, by age.6 consequently several linkage-based genetic analyses have
The impact of stuttering across the lifespan is significant sought to identify loci within potentially causative genes.
and well documented. Children who stutter, especially These familial genetic studies identified significant hits
those in whom stuttering persists, experience decreased within GNPTAB (MIM: 607840), GNPTG (MIM: 607838),
overall school performance, including social withdrawal NAGPA (MIM: 607985), and AP4E1 (MIM: 607244),
and reduced classroom participation.7 In addition to the although there is little concordance in identified loci across
impact on their academic experiences, adolescents who studies, indicating that results might be specific to the
stutter often experience a higher incidence of bullying.7 tested family.9,12,13 Follow-up studies have demonstrated
Adults who continue to stutter can also experience that disruptions in GNPTAB resulted in deficits in astrocyte
impaired career trajectories because stuttering can increase pathology in the corpus callosum and disruptions in

1
Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN 37203, USA; 2Hearing and Speech Sciences, Vanderbilt University,
Nashville, TN 37203, USA; 3Curtin School of Allied Health, Curtin University, Perth 6845, Australia; 4Communication Sciences and Disorders, Wayne State
University, Detroit, MI 48202, USA
*Correspondence: [email protected]
https://doi.org/10.1016/j.ajhg.2021.11.004.
Ó 2021 The Authors. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).

The American Journal of Human Genetics 108, 2271–2283, December 2, 2021 2271
mouse vocalization.14 The roles of these astrocytes in the patients do not have an overt stutter or in the event that
onset of stuttering are not well characterized; however, the patient exhibits early recovery, doctors might simply
recent studies have contributed to a growing body of evi- overlook the condition and not record a diagnosis in the
dence that dopamine receptor D2 blockers can impact stut- EHR. Even if a patient were to seek evaluation and treat-
tering behavior, perhaps because of increased astrocyte ment for developmental stuttering, speech evaluations
metabolism in the striatum.15 These studies suggest that are typically performed by speech-language pathologists,
dopamine projection from the basal ganglia might usually outside of a hospital context (e.g., schools or pri-
contribute to disturbances in speech and vocalization, vate clinics).21 There are also currently no FDA-approved
and the findings potentially support pharmacological medications or medical procedures to treat developmental
means for treatment.16,17 Recent evidence also implicates stuttering, making it significantly less likely to be noted in
autoimmune reactions from group A beta-hemolytic strep- a medical setting. Finally, although treatment exists for
tococcus (GAS [MIM: 607395]) infections that target spe- stuttering in the form of therapy and even though this is
cific cell types within the basal ganglia as a potential cause a chronic condition for many adults, this condition is
of stuttering.18 Rheumatic fevers (MIM: 268240) and other not considered a parity diagnosis, and as such most
sequelae resulting from GAS have been more generally government and private insurance plans do not cover
linked with pediatric autoimmune neuropsychiatric disor- treatment costs for stuttering. An inability to bill for these
ders and historically have correlated strongly with stutter- diagnoses makes it less likely for providers to include the
ing in children.18 ICD code for this diagnosis during a patient visit.
Still, to date, most genetic research has provided limited In our previous research, we identified individuals
biological insight into potential mechanisms of action that affected by stuttering by applying a phenotype-driven ma-
contribute to the stuttering phenotype, and the lack of chine-learning algorithm (PheML) that uses commonly re-
replicability across the linkage studies suggests that these ported phenotypes significantly associated with clinically
genetic risk loci do not explain the genetic basis of stutter- diagnosed developmental stuttering as predictor variables
ing at a population level.9 to impute a developmental stuttering phenotype in BioVU
Genome-wide association studies (GWASs), an alterna- (individuals with this phenotype are predicted by PheML
tive method to linkage analysis for disease gene discovery, to be affected by stuttering).21 Our model takes a series of
typically utilize genome-wide genetic data in large samples binary proxy parameters to impute developmental stutter-
drawn from populations to identify common genetic vari- ing in a patient population by using a Gini impurity-based
ants that are associated with increased risk of a disease or classification and regression tree classifier.22 The PheML
trait. Prior to this study, no population-based genome- algorithm was built and tested with an initial pool of manu-
wide association study (GWAS) has successfully identified ally reviewed records from individuals affected by stutter-
variants significantly associated with developmental stut- ing in Vanderbilt University’s EHR (no subjects within
tering. Stuttering has a high recovery rate and is frequently BioVU were used) across all ancestries (Figure 1). Model
diagnosed outside of a hospital or clinical setting; there- validation testing in an independent dataset containing
fore, one reason for the lack of genetic discoveries that manually reviewed records resulted in a positive prediction
explain the general prevalence of stuttering is the chal- rate of 83.3% (Table 1).21 Applying this model in BioVU re-
lenge of acquiring large numbers of developmental stutter- sulted in a higher proportion of individuals with imputed
ing cases for GWAS approaches to be well powered. To developmental stuttering (10%) than were observed by
address the issue of case acquisition, today, researchers diagnostic code (0.15%) or manual chart review.21
are frequently turning to large-scale biobanks linked to To execute a well-powered GWAS aimed at identifying
electronic health records (EHRs) to efficiently and cost associated genetic loci for stuttering, we applied PheML
effectively develop studies well powered for genetic discov- in BioVU to impute a stuttering phenotype in patients
ery.19 Cases for a particular phenotype are often identified with genetic data linked to their EHR. We then leveraged
in EHRs through the use of phenotyping algorithms based the imputed stuttering phenotype as the dependent vari-
on ICD-9/10 billing codes, CPT procedural codes, and/or able in a GWAS to identify associated variants and generate
notes from clinical records.20 However, the billing codes a polygenic-risk-prediction model. We validated these
traditionally used to assess patient status in the electronic results by comparing the concordance of our GWAS sum-
health record are heavily underreported for developmental mary statistics to the GWAS results obtained from an inde-
stuttering.21 pendent clinically ascertained stuttering sample set ac-
In Vanderbilt University’s large HER-linked DNA data- quired through the International Stuttering Project (ISP)
base (BioVU), only 142 of the 92,762 (0.15%) patient sam- and the polygenic-risk-prediction scores in the clinically
ples genotyped on the Illumina Multi-Ethnic Genotyping ascertained cases versus matched population-based con-
Array (MEGAEX) had recorded billing codes denoting trols.23 This approach allowed us to impute a stuttering
developmental stuttering (see Table S1), a proportion phenotype in a large patient set on the basis of the pres-
well below even the most stringent expected prevalence. ence of a phenotypic profile that approximates a develop-
The nature of this condition might shed some light on mental stuttering phenotype and to amass statistical
its underrepresentation in Vanderbilt’s EHR. In cases where power from a large sample size to help accommodate the

2272 The American Journal of Human Genetics 108, 2271–2283, December 2, 2021
Figure 1. Outline of PheML development and application
Within a set of 3.1 million deidentified electronic health records (A), we first identified a small pool of subjects (B) with developmental
stuttering through expert manual review. We selected these patients and their demographically matched controls to identify comorbid-
ities as predictive features and develop and test a machine-learning model (C) that would impute stuttering in BioVU (D), an indepen-
dent EHR dataset linked to genetic data. We then performed a GWAS by using the imputed phenotype as the dependent variable in the
labeled genetic dataset (E) to identify genetic variants associated with imputed stuttering (F).

lack of clinical specificity. In doing so, we were able to excluding variants with a call rate less than 98% and samples
perform a GWAS that identified genome-wide-significant with a call rate less than 97%.25 Eigenvectors and eigenvalues
variants associated with the clinical profile of develop- were calculated through principal-component analysis (PCA) run
mental stuttering. in PLINKv1.90, and data were separated according to genetic prin-
cipal components (eigenvectors) into five broad ancestry groups—
European, African, South East Asian (EAS), EAS, and Hispanic
Subjects and methods (AMR) —with the 1000 Genomes reference for ancestry classifica-
tion verification (Figure S1).26 Each ancestry subset was subse-
quently analyzed separately; a minor-allele filter of 1%, a
Model development and application to BioVU
variant-missingness filter of 5%, and a sample-missingness filter
We developed a model that classified patients as having a high
of 10% were applied, and checks for heterozygosity, sex, and var-
probability of having developmental stuttering if they had one
iants that did not align with Hardy-Weinberg expectations (vari-
of the phenotypes (denoted as phecodes) associated with develop-
ants with a Hardy-Weinberg [hwe] statistic <1 3 1010 were
mental stuttering; details are described in Pruett et al.21 Phecodes
removed) were performed.25 Data were prepared for imputation
were mapped from ICD-9 codes clustered on the basis of a
according to specifications outlined on the Michigan server web-
grouping system developed through the Phecode Map project.23
page; these included using a pre-imputation data-preparation tool-
Features were selected on the basis of phecode enrichment in indi-
kit (see McCarthy Group tools in the web resources).27 Imputation
viduals with developmental stuttering as compared to matched
was performed for each ancestry cohort on the Michigan Imputa-
controls.21 Patients with one or more instance of a phecode in
tion server through the use of EAGLE2 phasing, Minimac4
their EHR were noted as positive for that feature, and they were
imputation, and the Haplo-type Reference Consortium (HRC)
notedas negative if they had no mentions of the phecode; only
reference.27–29 Final post-imputation quality-control filtering
phecodes observed more often in the set of affected individuals
included selecting variants with a minor-allele frequency above
than in 10,000 simulations of matched controls were carried for-
1% within each ancestry group, as well as removing all variants
ward into model building (corresponding to a p value of 0). A
with an R2 imputation info score of less than 0.4.
Gini impurity-based classification-and-regression-tree machine-
International Stuttering Project (ISP) dataset
learning model was developed from these features via scikit-learn
As an independent reference dataset, we obtained 1,345 clinically
tree regression software,22 and the model was tested in an inde-
ascertained developmental stuttering patients (965 male and 380
pendent set of 141 individuals with developmental stuttering
female) collected from Curtin University Stuttering Center in
and 684 matched controls in Vanderbilt University’s HER; pheno-
Perth, Australia, the SpeechMatters Clinic in Dublin, Ireland, the
typic status of these individuals was confirmed by expert manual
Stuttering Research Laboratory at the University of Pittsburgh,
review.21 We then applied this model to a set of 92,762 individuals
Dr. Shelly Jo Kraft’s research group at Wayne State University,
genotyped on the MEGAEX array with available ICD-9 records and
and the Attadale Stuttering Treatment Facility in Australia and
resulting phecodes to impute developmental stuttering status for a
through a social-media outreach campaign led by Drs. Below
downstream GWAS.
and Kraft on reddit.com. We paired these with 7,019 demograph-
ically matched controls from BioVU (4,951 males and 2,068 fe-
Genotyping, imputation, and quality control males; selected as described below and with no overlap with the
All BioVU participants as well as the participants in the indepen- dataset used in our primary GWAS) (Table S2). A speech patholo-
dent clinically ascertained developmental stuttering dataset were gist evaluated each participant to confirm their phenotypic status.
genotyped on Illumina’s Infinium Expanded Multi-Ethnic Geno- We applied the methods described by Pluzhnikov et al. to identify
typing Array (MEGAEX). Duplicate variants and indels were possible plate or batch effects prior to merging unique batches of
removed; for duplicate samples, the duplicate with a lower call genotypes from affected individuals.30 No plate or batch effects
rate was removed. were observed. Initial filtering for stuttering excluded variants
BioVU and samples with a call rate less than 90%. Next, data for affected
Quality control was performed primarily with PLINK v. 1.90.24 individuals were separately assessed for quality control according
Initial filtering thresholds for the entire BioVU sample included to broad ancestral groups (European, American/Hispanic, African

The American Journal of Human Genetics 108, 2271–2283, December 2, 2021 2273
Table 1. Performance of PheML classification model
Predicted status

Affected individuals Control individuals Total

Classified as stuttering by manual review 97 44 141

No indications of stuttering 19 665 684

Total 116 709

American, and Asian) as defined by PCA in which the HAPMAP3 ants with an association p value below 5.0 3 108 to be significant.
reference was used for ancestry classification.31 Each ancestry Manhattan plots were generated with the qqman R package.40
cohort was subsequently analyzed; analysis incorporated a mi- Loci and LD structure visualization and qq-plots plots were gener-
nor-allele filter of 1%, a variant-missingness filter of 3%, and a ated with the LocusZoom browser tool.41
sample-missingness filter of 5%, as well as checks for heterozygos- International Stuttering Project (ISP) GWAS
ity, sex, and variants that did not align with Hardy-Weinberg ex- Control individuals were selected from a sample set of individuals
pectations (variants with a hwe statistic <1 3 1015 were who were not identified as being affected by stuttering according
removed). Then, approximately five ancestry and sex-matched to ICD9 and ICD10 codes (Table S1) or by the PheML prediction
population-based controls per case were drawn from a quality-con- algorithm. These individuals were matched by sex, similarly to
trol filtered BioVU set (quality control for BioVU as described those described for the BioVu individuals described above. Genetic
above). To select ancestry-matched controls, we calculated eigen- Euclicean pairwise distance was minimized, and any pairs of
vectors and eigenvalues through PCA generated by PLINK affected and control individuals not within two standard devia-
v.1.90. PCA was performed on the maximally unrelated set of tions of the mean distribution of all pairwise distances were
affected indivdiuals and potential control individuals (as identi- removed. Individuals under 18 were also excluded as potential
fied by PRIMUS32–35) through use of a panel of SNPs in low linkage controls. Dates of birth for individuals in the ISP sample sets
disequilibrium (LD); additional related affected individuals and were not available, so subjects were not matched by age. A
potential control individuals were projected along each of the GWAS for the ISP stuttering sample set was performed with a fre-
calculated eigenvectors. Data from affected individuals were quency-based additive logistic model via SAIGE (scalable and accu-
merged with that from their selected matched controls (the con- rate implementation of generalized mixed model), a method
trol selection method is described below) for imputation according developed for biobank data in order to control for unbalanced
to standard protocols and specifications outlined for the TOPMed case-control ratios and sample relatedness.42 The regression model
server; these included using the same pre-imputation data-prepa- accounted for population substructure by including the first six
ration toolkit as above (see McCarthy Group tools in web re- principal components as covariates.
sources).27 The autosomal region was imputed on the TOPMed
server with EAGLE_v. 2.4 phasing, Minimac4 imputation, and
Calculations of genetic heritability
the TOPMed reference.29,36,37 Post-imputation quality-control
Genome-wide SNP-based liability-scale heritability within our Euro-
filtering included selecting variants with a minor-allele frequency
pean ancestry (EUR) sample set was calculated through a genomic-
above 1% and removing all variants with R2 imputation info score
relatedness-based restricted maximum-likelihood (GREML)
less than 0.4.
approach implemented through GCTA software.43,|,44 Observed vari-
ance estimates from the observed scale were transformed to an
Genome-wide association studies
expected underlying scale, for which an expected population preva-
BioVU
lence was set to 0.1 on the basis of the observed frequency of
For the GWASs, developmental stuttering patients were stratified
predicted cases within BioVU. Heritability estimates included all var-
by ancestry, with independent association analyses performed
iants tested in the GWASs (see Genotyping, imputation, and quality con-
for each ancestry group: European, African, South EAS, EAS, and
trol in the methods section for exclusion criteria). We corrected for
Hispanic. For each identified case, up to six controls were selected
sex, age, and the first three principal components (see Genotyping,
from the cohort of patients identified as controls by the PheML
imputation, and quality control).
prediction model. Control individuals were matched by age
(within 5 years of their matched affected individual) and sex.
Additionally, control individuals were matched to the lowest Variant-effect-size concordance analysis
genetic Euclidean pairwise case-control distance that met the pre- For the concordance analysis, we compared summary statistics
viously mentioned criteria. Euclidean pairwise distance was calcu- from the EUR GWAS to summary statistics produced from the
lated as the sum of the square of the difference in the eigenvectors ISP GWAS to determine whether the concordance rate between
scaled by their eigenvalue for each principal component calcu- the two summary statistics was higher than expected. The concor-
lated from our PCA.38 Any pairwise case-control distance not dance rate was calculated by the proportion of variants that had
within two standard deviations of the mean distribution of all the same direction of effect over the total variants present in
case-control pairwise distances were removed from the analysis. both GWAS analyses. 7,570,420 variants that passed previously
A logistic-regression model was used for the variant association an- described QC metrics were present in both GWAS analyses,
alyses in SUGEN ,39 and corrections were made for sex, age, and aligned by strand and reference allele, and analyzed here. Addi-
ancestry (genetic ancestry captured by the first three principal tional concordance rates were calculated for variants with p values
components). To correct for multiple testing, we considered vari- below 0.5, 0.05, and 0.005 thresholds in both GWASs. We

2274 The American Journal of Human Genetics 108, 2271–2283, December 2, 2021
Table 2. Demographics of BioVU subjects classified by PheML algorithm
Predicted to exhibit stuttering Predicted not to stutter

Total 9,239 83,503

Male 3,507 (38.0%) 36,140 (43.3%)

Female 5,732 (62.0%) 47,363 (56.7%)

Demographics

Mean age in years (SD) 47.9 (24.9) 55.2 (22.7)

Ancestry

European 6,339 (68.6%) 63,471 (76.0%)

African 1,869 (20.2%) 13,728 (16.4%)

East Asian 124 (1.3%) 772 (0.9%)

South Asian 51 (0.6%) 363 (0.4%)

Hispanic 398 (4.3%) 2,068 (2.5%)

Unknown/other 158 (1.7%) 3,101 (3.7%)

Ancestry was determined through principal-component analysis. Testing set included 825 subjects (141 individuals confirmed to exhibit developmental stuttering
and 684 subjects with no indication of stuttering in their health records). The positive-prediction rate for the model is 83%.

performed a one-sample t test to determine whether each concor- developmental stuttering in the testing set, 97 were pre-
dance rate was significantly higher than an expected concordance dicted as having developmental stuttering by the PheML
rate of 0.5. model, whereas 44 were not, suggesting that despite our
high positive predictive values, 30% or more affected indi-
Modeling of the polygenic risk score viduals might still be missed by our approach.21
We used the summary statistics resulting from our GWAS of
PheML-defined EUR-imputed stuttering to develop a polygenic
risk (PRS) model by using data from all 7,751,954 autosomal PheML imputation of developmental stuttering
SNPs meeting the quality-control criteria outlined above. The identifies a large case sample
PRS model was developed with PRScs python software, which cre- Of a set of 92,742 BioVU subjects, PheML labeled 9,239 as
ates a model that esimtaes genetic liability through a linear combi- having a high likelihood for developmental stuttering
nation of the weight of SNP dosage on effect size and p values from (9.96% prevalence). Of this set, 5,732 affected individuals
the provided GWAS summary data.45 Our global shrinkage param- were female (62.0%; average age of 47.9 years). 6,639
eter (phi) was set to 1. This model was applied to the genetic data (68.6%) were of EUR, 1,869 (20.2%) were of African
of the independent clinically ascertained developmental stutter-
ancestry (AFR), 398 (4.3%) were of Hispanic ancestry, 124
ing cohort as well as their matched controls. This analysis was
were of EAS ancestry (1.3%), and 41 (0.6%) were of South
restricted to only samples that were of EUR. Individual polygenic
scores were calculated through PLINK v. 1.9.25 To assess the signif-
Asian ancestry (SAS) (Table 2). Broad ancestry groups were
icance of the difference in genetic liability for stuttering between stratified via principal-component analysis (Figure S1).
the individuals with clinically ascertained stuttering and the
matched control individuals, we ran a two-sample t test GWASs in the PheML prediction set identify genetic loci
comparing the overall score distribution between these two associated with developmental stuttering
groups.
The PheML predicted-stuttering sample set was stratified
by PCA-based genetic ancestry (African, EAS, European,
Results Hispanic, and South Asian) for genome-wide association
studies (Table 3). We performed a separate GWAS for each
Efficacy of PheML prediction ancestry group. Across analyses in the five ancestry groups,
To test the efficacy of our model predicting PheML stutter- the European and AFR groups were the largest and best
ing, we applied the model to a set of 825 patients (141 pa- powered. One locus reached genome-wide significance in
tients with developmental stuttering confirmed by manual the EUR analysis and one locus reached near genome-
review and 709 patients with no indications of stuttering in wide significance in the AFR sample set (Table 4).
their health records). Of the 116 subjects that our prediction In the EUR case set, the GWAS included 6,339 predicted
model scored as having developmental stuttering, 97 were developmental stuttering cases and 33,172 ancestry and
among those manually reviewed as having developmental sex-matched controls and 7,751,954 imputed variants
stuttering, and 19 had no indications of stuttering in their (Figure 2; see methods section). One statistically significant
records, resulting in a positive prediction rate of 83% (Ta- locus was identified, and the sentinel variant was deter-
ble 1). Of the 141 manually reviewed individuals with mined to be at rs12613255 (beta ¼ 0.323; p ¼ 1.31 3 108

The American Journal of Human Genetics 108, 2271–2283, December 2, 2021 2275
Table 3. Demographics of BioVU subjects used in GWAS
Individuals predicted to stutter Predicted control individuals

Total 9221 45,793

Male 3,491 (37.9%) 17,162 (37.5%)

Female 5,730 (62.1%) 28,631 (62.5%)

Demographics

Mean age in years (SD) 47.7 (24.8) 48.6 (24.2)

Ancestry

European 6,339 (68.7%) 33,172 (72.4%)

African 1,853 (20.1%) 8,372 (18.3%)

East Asian 124 (1.3%) 592 (1.3%)

South Asian 51 (0.6%) 228 (0.5%)

Hispanic 397 (4.3%) 1,395 (3.0%)

Ancestry was determined through principal-component analysis. GWASs were performed with stratifications by ancestry.

), 113 kb 3¢ of CYFIP-related Rac1 interactor A (CYRIA) (MIM: the European-ancestry PheML sample set. The proportion
606322) (see Figure 3). The developmental-stuttering of phenotypic variance explained by genetic factors was re-
GWAS in subjects of AFR included 1,853 affected individ- ported at 0.0232 (SE ¼ 0.0083).43 Through GCTA we also
uals, 8,402 ancestry- and sex-matched control individuals, transformed the explained variance estimates from the
and 13,636,593 variants (Figure 4; see methods). The top observed scale to the underlying liability scale to account
variant, rs7837758, reached near-genome-wide signifi- for an expected prevalence of affected individuals of 0.1.
cance (beta ¼ 0.518; p ¼ 5.07 3 108). rs7837758 is found The proportion of phenotypic variance (liability-scale her-
the third intron of ZMAT4, located on chromosome 8 itability) was 0.0453 (SE ¼ 0.016, p ¼ 2.29 3 103).
(Figure 5). The GWAS performed on subjects of Hispanic
ancestry included 397 affected individuals, 1,457 ancestry- Concordance analysis reveals genetic similarity between
and sex-matched control individuals, and 8,147,169 vari- PheML-predicted stuttering individuals and those
ants (Figure S2). For subjects of EAS ancestry, the GWAS clinically ascertained by the ISP as having
included 124 affected individuals, 716 ancestry- and sex- developmental stuttering
matched controls, and 6,922,517 variants (Figure S3). For To ensure that the genetic profile of our PheML-predicted
subjects of South Asian ancestry, the GWAS included 51 stuttering individuals properly recapitulated effects associ-
affected individuals, 279 ancestry- and sex-matched con- ated with clinical developmental stuttering, we compared
trols, and 7,058,354 variants (Figure S4). Most likely the direction of effect estimated in a GWAS between our
because of the reduced power in smaller sample sizes, asso- GWAS of EUR PheML-predicted stuttering individuals
ciation analyses in the Hispanic, South Asian, and EAS co- and a GWAS of an independent, largely European-ancestry,
horts did not result in any significantly associated variants and clinically ascertained set of individuals with develop-
with our PheML prediction set (see Figures S2–S4 and Table mental stuttering (see Table S2); this latter set was also
S3). We also report several loci that exceeded a suggestive genotyped by the genotyping core facility at Vanderbilt
significance threshold of p ¼ 5 3 106 (108 variants across University, VANTAGE, on the MEGAEX. 7,570,420 imputed
all ancestry GWASs) and were replicated (p < 0.05) in one variants were present and tested in both analyses by an
or more independent developmental stuttering GWASs (Ta- approach similar to that described in the 2014 DIAGRAM
ble 4, Figures S5–S13). Our strongest replications across paper.46
these studies include rs6415726 (HIS GWAS; beta ¼ 0.730; For all variants present in both GWASs, 50.41% of the
p ¼ 9.61 3 107), an intronic C9orf92 variant that replicated variants were found to have the same direction of effect
in the ISP GWAS (beta ¼ 0.197; p ¼ 6.29 3 104), and (3,816,091 of 7,570,420 variants; p ¼ 6.86 3 10112). For
rs10464899 (AFR GWAS; beta ¼ 0.216; p ¼ 1.51 3 107; all variants that had a p value threshold below 0.5 in
see Figure S13), a variant that is 178 kb 50 of TOX [MIM: both GWASs, the concordance-of-effect rate was 50.86%
606863] and replicated in the ISP GWAS (beta ¼ 0.139; p (982,614 of 1,931,927 variants; p ¼ 3.77 3 10127).
¼ 6.88 3 103; see Figure S6). Variants with a p value threshold below 0.05 in both
GWASs had a concordance rate of 53.47% (10,830 of
Genome-wide explained variance within the EUR PheML 20,255 variants; p ¼ 2.84 3 1023). Variants below a p
sample set value threshold of 0.005 in both GWASs had a concor-
Genome-wide SNP-based liability-scale heritability within dance rate of 73.19% (121 of 171 variants; p ¼ 6.25 3
our EUR sample set was calculated through GCTA within 1010) (Table 5).

2276 The American Journal of Human Genetics 108, 2271–2283, December 2, 2021
Stuttering polygenic-risk-score models developed

from samples of European, African, East Asian, and Hispanic ancestry, respectively. ISP indicates variant results from the International Stuttering Project stuttering GWAS (see methods). "Ref." refers to the reference allel, and
The table includes sentinel variants from loci that exceeded p values of 1 3 107, as well as any variants that exceeded 5 3 106 and that were replicated in a separate analysis. EUR, AFR, EAS, and HIS refer to association results
with results from the PheML stuttering GWAS show
Replicating p value increased genetic liability within the ISP stuttering set
We developed a PRS-score model by using the summary

3.21 3 102,
7.21 3 103

1.27 3 102

3.79 3 102

2.17 3 104

0.091 (0.171, 0.011) 2.62 3 102

1.41 3 102

3.61 3 102
statistics for 7,751,954 variants produced by the GWAS
of EUR PheML-predicted stuttering individuals. We
N/A

N/A

N/A

N/A

N/A
then applied this model to the genetic datasets of our
ISP developmental-stuttering subjects and their
matched control individuals (the same set used in the

0.138 (0.038, 0.238)

0.086 (0.018, 0.015)

0.372 (0.021, 0.723)

0.223 (0.105, 0.341)

0.349 (0.070, 0.628)

0.104 (0.007, 0.201)


0.109 (0.099,0.119),
Nearest gene Location Replicating analysis (p < 0.05) Replicating beta

variant concordance analysis, although he sample set


only included those of PCA-based European ancestry).
Our ISP stuttering set scored significantly higher on
the PRS model (mean ¼ 8.56 3 108, SD ¼ 1.13 3
106) than their matched control individuals (mean ¼
N/A

N/A

N/A

N/A

N/A

3.59 3 107, SD ¼ 1.01 3 106; two-sample t test,


t(1131) ¼ 13.12, p ¼ 6.83 3 1039), providing compel-
ling evidence that the genetic architecture identified
in the model-imputed phenotyping discriminates the
genetic liability for developmental stuttering in clini-
cally ascertained cases and population-based controls
(Figure S14).
188 kb 50 ISP, EAS
EUR
N/A

N/A

102 kb 30 N/A

397 kb 50 N/A

N/A

AFR
ISP

ISP

797 bp 50 ISP

ISP

Discussion
0

0
113 kb 3

178 kb 5
intronic

intronic

intronic

intronic

84 kb 50

42 kb 50

Stuttering classification model development and


application to BioVU
We set out to utilize a phenotype-based machine
learning algorithm, PheML, to identify unlabeled cases
0.120 (0.170, 0.071) 1.59 3 106 BRMS1L
9.61 3 107 C9orf92

2.26 3 106 C5orf17


ZMAT4

6.38 3 108 AKAP7

5.32 3 107 KYAT1


CYRIA

of developmental stuttering, an underdiagnosed pheno-


RYR2

MPG
6.85 3 108 DCN

TOX

9.35 3 107 TOX


Top hits identified in GWAS of PheML-imputed affected and control individuals

type, in a large EHR. Our model shows a positive predic-


8

8

8

7

6

tion rate of 83.3%, though it’s important to note that


1.31 3 10

5.07 3 10

8.73 3 10

1.51 3 10

1.87 3 10

while testing, those who were classified as ‘‘controls’’


p value

during manual review may have had stuttering and sim-


ply did not have any mentions of it in their records. We
expect that roughly 16.7% of patients in BioVU that
0.323 (0.211, 0.434)

0.518 (0.331, 0.704)

0.803 (0.512, 1.093)

0.376 (0.239, 0.512)

0.371 (0.235, 0.507)

0.216 (0.135, 0.296)

0.308 (0.188, 0.429)

0.197 (0.119, 0.276)

0.730 (0.438, 1.022)

0.256 (0.114, 0.399)

0.097 (0.057, 0.137)

were classified as a case are false positives, though this


"Alt." refers to the alternative allele the association analysis was conditioned on.

is likely an overestimate.
Ref. Alt. Beta (95% CI)

Applying PheML to BioVU resulted in a large popula-


tion of patients that, even in the absence of a direct diag-
nosis of stuttering, exhibited a constellation of traits
associated with stuttering; an underlying phenotypic
signature that could be leveraged to predict develop-
G

G
C
A

A
T

mental stuttering in a manner akin to imputation.


G

131,599,311 G

G
C

131,706,657 C

C
A

Though 9.96% of the cohort was predicted to be a case


237,682,933 T

by our PheML model, the modest sensitivity of the


16,628,186

40,624,542

91,973,678

60,209,436

60,219,451

16,247,629

36,425,517

23,909,919
Ancestry Chr. Position

126,219

model (68.8%) suggests that this is likely an underesti-


mation of the actual proportion of the true cases in
our sample.
Interestingly, although males exhibit a higher preva-
12

14

16
2

lence of stuttering, more females were identified as ex-


hibiting stuttering by our PheML prediction model
EUR

EUR

EUR
AFR

rs115024493 AFR

AFR

AFR

AFR

AFR

AFR
EAS

HIS

(the female-to-male ratio was 1.6:1). Although there


are more females in BioVU than males (1.3:1; female:
rs12613255

rs10872381

rs78072807

rs10464899

rs34456770

rs10036373
rs7837758

rs2997903

rs6981922

rs6415726

rs8013614

male), this does not fully explain the discrepancy. There


Table 4.

are several possible explanations for this imbalance.


rsID

There might be sex imbalance in the rate or quality of

The American Journal of Human Genetics 108, 2271–2283, December 2, 2021 2277
Figure 2. Manhattan plot and qq-plot of results from GWAS of European-ancestry individuals predicted by PheML to exhibit devel-
opmental stuttering
Analysis included 7,751,954 variants across chromosomes 1–22. One locus in chromosome 2 reached genome-wide significance (p < 5 3
108); the sentinel variant, rs12613255 (BETA ¼ 0.323; p ¼ 1.31 3 108), was 113 kb 3’ of CYRIA (FAM49A is an alias for CYRIA). The red
line indicates the threshold for genome-wide significance (5.0 3 108), and the blue line indicates the threshold for suggestive signif-
icance (1.0 3 105). Loci reported in Table 4 are labeled on the plot as well as the nearest gene.

diagnosis of selected predictive phenotypes; also, the abil- Our estimated positive-prediction rate (83.3%, implying a
ity to predict stuttering might be greater in women than in false-positive rate of 16.7%) among cases is most likely mark-
men. The positive-prediction rate for our model is >83.3%, edly lower than the positive-prediction rate of samples ac-
and more women might be misspecified as affected by stut- quired from affected individuals in treatment clinics, where
tering. A third possibility is that the prevalence of stutter- speech and language pathologists confirm patient status.
ing in women is higher than reported but less frequently However, as with studies that leverage population-based
diagnosed or detected because of a faster or higher rate of controls, our power reduction resulting from imprecision
recovery. Prior evidence from Ambrose et al. supports in definitions of affected and control individuals is offset
this last potential explanation by suggesting that the by the size of the dataset identified by our PheML model.
male-to-female ratio for lifetime prevalence might be Controls for GWASs were selected from patients our
more balanced when mild cases of developmental stutter- model classified as having low likelihood for develop-
ing and individuals who recover early are included.47 mental stuttering (i.e., these patients were not predicted
Future analyses of sex-stratified GWASs of developmental to stutter). Our sensitivity analysis indicates that the model
stuttering and its associated clinical phenome are needed is classifying 68.8% of manually reviewed developmental
if researchers are to further explore differential risk factors stuttering cases as being at high risk for stuttering, whereas
that might contribute to differences in age and rate of re- 31.2% were classified as control individuals. Therefore, we
covery between men and women. expect some model-defined controls, roughly equal to one-
third of the biobank prevalence for stuttering, to exhibit
Genetic discovery in predicted-stuttering cohort stuttering, potentially further reducing our power to
PheML-imputed affected individuals and well-matched discriminate allele-frequency differences between affected
control individuals were stratified by ancestry for GWASs. individuals and controls.

Figure 3. LocusZoom plot for rs12613255 locus in EUR PheML stuttering GWAS
The lead variant (marked as a diamond) was found in chromosome 2, 113 kb 3’ of CYRIA. A dashed line indicates the threshold for
genome-wide significance (5.0 3 108).

2278 The American Journal of Human Genetics 108, 2271–2283, December 2, 2021
Figure 4. Manhattan plot and qq plot of results from GWAS of African-ancestry individuals predicted by PheML to exhibit develop-
mental stuttering
Analysis included 13,643,593 variants across chromosomes 1–22. One variant, rs7837758, reached genome-wide significance (BETA ¼
0.518; p ¼ 5.07 3 108), on chromosome 8 within the third intron of ZMAT4. The red line indicates the threshold for genome-wide
significance (5.0 3 108), and the blue line indicates the threshold for suggestive significance (1.0 3 105). Loci reported in Table 4
are labeled on the plot as well as the nearest gene.

Despite the challenge of misclassification of affected and beta ¼ 0.803; p ¼ 6.4 3 108; see Figure S10) and AFR sam-
control individuals in the sample size attained with our ple (rs7837758, beta ¼ 0.518; p ¼ 5.07 3 108, see Figure 5).
phenotype-imputation approach in biobank-scale data, These studies were markedly smaller in size than our EUR
our analyses in EUR participants not only identified a study (see Table 3) and were therefore likely insufficiently
genome-wide significant locus but also demonstrate that powered to discover variants of modest effect size.
a significant portion of the variance of the trait captured GWAS in the European cohort revealed one significant lo-
by our model is heritable (h2 ¼ 0.045, SE ¼ 0.016, p ¼ cus; the top hit was at rs12613255 (beta ¼ 0.323; p ¼ 1.31 3
2.29 3 103), indicating that there exists a common under- 108); see Figure 3). The closest gene to this variant, which
lying genetic background among those who were classified resides on chromosome 2, is CYRIA, also referred to in the
as exhibiting developmental stuttering in our model. This literature as FAM49A. RNA expression data show that CYRIA
heritability estimate is in line with other common com- is highly expressed in the central nervous system (specif-
plex neurological and psychological traits, such as PTSD ically in the cerebral cortex, basal ganglia, and olfactory re-
in males and anxiety.48,49 We also demonstrate that this gion) and is also highly expressed in the thyroid gland,
common genetic background is consistent with the ge- granulocytes, and monocytes.50 CYRIA has not previously
netic architecture identified in a clinically ascertained in- been implicated in developmental stuttering, although in
dependent GWAS of stuttering. Asian and Brazilian populations it has been reproducibly
Association analyses were separated by ancestry groups. associated with cleft lip and palate,51–54 a trait that was
For the analyses conducted in AFR, HIS, SAS, and EAS not used a predictor variable in our model.
ancestry groups, no variants were observed to be signifi- In the AFR GWAS, variant rs7837758 nearly reached the
cantly associated with our developmental stuttering genome-wide significance threshold of 5.0 3 108 (beta ¼
cohort after genome-wide Bonferroni correction (p < 5 3 0.518; p ¼ 5.07 3 108) (see Figures 4 and 5). This variant is
108) (Table S3), although two variants came close to located on chromosome 8 in the intron of ZMAT4 (Zinc
reaching significance in the EAS sample (rs10872381, finger matrin-type protein 4) (see Figure 5). Variants within

Figure 5. LocusZoom plot for the rs7837758 locus in the AFR PheML Stuttering GWAS
The lead variant (marked as a diamond) was found on chromosome 8, within the third intron of ZMAT4. A dashed line indicates the
threshold for genome-wide significance (5.0 3 108).

The American Journal of Human Genetics 108, 2271–2283, December 2, 2021 2279
Table 5. Results of variant-concordance analysis
Concordant variants Concordance rate (%) Binomial p value

All variants 3,816,091/7,570,420 50.41 6.86 3 10112

p < 0.5 982,614/1,931,927 50.86 3.77 3 10127

p < 0.05 10,830/20,255 53.47 2.84 3 1023

p < 0.005 121/171 73.10 6.25 3 1010

Results from analyses comparing summary statistics of the European PheML stuttering GWAS to those of the ISP stuttering GWAS. Concordant variants include any
variants that were present in both analyses and had the same direction of effect.

this gene have previously been observed to be associated lyses. For example, for the group ‘‘p < 0.05,’’ only variants
with myopia and fasting blood glucose in African Ameri- with a p value that was below 0.05 in both the PheML
cans.55,56 Neither of these phenotypes have been previ- GWAS results and the ISP GWAS results were included in
ously associated with developmental stuttering, nor did the analysis. Together, the four concordance analyses
these phenotypes serve as proxy variables in our prediction demonstrated that the proportion of variants with the
algorithm.21 ZMAT4 has been observed to be highly ex- same direction of effect was significantly greater than
pressed in the central nervous system, especially in tissue random. Additionally, as the significance threshold for var-
types present in the cerebral cortex, cerebellum, and hip- iants included in the concordant analysis became more
pocampus and to be modestly expressed in the basal stringent, the proportion of variants with the same direc-
ganglia.50 tion of effect increased, and 73.1% of all variants that sur-
Little is known about the neuronal basis of developmental passed a threshold of p ¼ 0.005 in both GWAS have the
stuttering, although imaging studies have demonstrated same direction of effect (see Table 5). The clinically ascer-
that patients who stutter show abnormal function in the tained stuttering sample set was from a multi-ethnic anal-
form of overactivity in the cortical motor and pre-motor ysis, although it was predominantly comprised of EUR
areas associated with speech, as well as disruptions in the participants (84.2%; see Figure S1 and Table S2). This
basal ganglia and dopaminergic systems.57–59 Although remarkable finding provides compelling evidence that
the previous linkage analyses have identified candidate the GWAS using our PheML-imputed stuttering phenotype
genes, including DRD2 (MIM: 126450), AP4E1, CYP17A1 is capturing a portion of the genetic architecture of clinical
(MIM: 609300), GNPTAB, GNPTG, and NAGPA, 13,60–62 the developmental stuttering.
mechanisms of action remain uncertain, although both
GNPTAB and GNPTG are active in lysosomal enzyme-target- Validation of GWAS results: stuttering PRS model
ing pathways and energy metabolism.63 We checked for development
replication within these genes but failed to demonstrate To further explore our genetic findings, we developed a PRS
any significant findings (see Table S4). Although the mech- model built with summary statistics from the GWAS con-
anisms of action of our top associated variants on the clin- ducted in our EUR PheML stuttering sample. PRS models
ical profile of developmental stuttering are not yet known, are used for summarizing variant effects and assessing ge-
our approach enabled variant discovery, and future work netic liability for a trait. We applied this model to the ge-
will be needed to reproduce these findings and establish netic dataset of our ISP developmental-stuttering EUR
their functional role in the clinical profile of susceptibility cohort, as well as their matched controls. The ISP stuttering
to developmental stuttering. set scored significantly higher than their matched controls
(two-sample t test, t(1131) ¼ 13.12, p ¼ 6.83 3 1039; see
Validation of GWAS results: concordance analysis Figure S14), indicating that the PRS model developed
Using the summary statistics resulting from our GWAS of from our imputed stuttering GWAS is significantly predic-
EUR individuals predicted by PheML to exhibit stuttering, tive of stuttering liability. This additional evidence strongly
we ran a variant concordance analysis that assessed how supports our conclusion that the PheML classifier captures a
many variants had the same direction of effect as the var- phenotype sufficiently similar to that of the ISP stuttering
iants tested in a GWAS run on a clinically ascertained sample set to identify genetic risk factors relevant for stut-
developmental-stuttering sample set via the approach out- tering. Although statistically significant, the difference
lined in the 2014 DIAGRAM paper, in which concordance observed in the score distributions for stuttering and con-
measures were used for assessing T2D risk alleles across trol individuals suggests that this PRS model has limited
various ancestry groups.46 Variants that were not well clinical value for predicting developmental-stuttering sta-
imputed (r2 > 0.4) in either GWAS were removed from tus (receiver operating characteristic AUC ¼ 0.601; see
the concordance analysis, and datasets were verified to Figure S15). Ge et al.45 simulate the predictive performance
have the same strand and reference-allele orientation. of PRS across various sample sizes and genetic architectures.
This was repeated with only variants that surpassed thresh- In these simulations they demonstrate that predictive per-
olds of p ¼ 0.5, 0.05, and 0.005 in both association ana- formances for more polygenic traits benefitting from

2280 The American Journal of Human Genetics 108, 2271–2283, December 2, 2021
greater sample sizes, showing that PRS models developed UL1RR024975. Genomic data are also supported by investigator-
from 50 to 100K sample sets have drastically better perfor- led projects that include U01HG004798, R01NS032830,
mance metrics than models with smaller sample size.45 RC2GM092618, P50GM115305, U01HG006378, U19HL065962,
Our PRS model was developed from a sample set of and R01HD074711 and additional funding sources listed at
https://victr.vumc.org/biovu-funding.
39,511, which may be underpowered for developing a
model that would be useful for predictive purposes.
We note that the developmental-stuttering comorbidities
Declaration of interests
that formed the basis of our PheML model were determined
in a dataset largely comprising European and African Amer- The authors declare no competing interests.
ican individuals. As such, the underrepresentation of other
racial and ethnic minority groups in our comorbidity detec- Received: June 22, 2021
tion and model building is a limitation. The effect of this Accepted: October 27, 2021
Published: December 2, 2021
population structure in our EHR might lead to our model’s
missing some population-specific comorbidities, conse-
quentially reducing model performance in these subgroups.
Web resources
Future research exploring comorbidities of stuttering and
genetic architecture within and across populations is LocusZoom GWAS visualization software, http://locuszoom.org/
warranted. McCarthy Group tools, www.well.ox.ac.uk/wrayner/tools/
Through our stuttering-prediction algorithm, we de- PheML development code GitHub page, https://github.com/
shawdm1/PheMLStutteringCART
signed and conducted the largest GWAS for the clinical
PLINK v. 1.90 whole-genome data-analysis toolset, http://pngu.
profile of developmental stuttering. Despite a lack of clin-
mgh.harvard.edu/purcell/plink/
ical evaluation and diagnosis of stuttering in our affected
PRScs Model GitHub page, https://github.com/getian107/PRScs
individuals and controls, our approach allowed use of an QQman R software packages, https://cran.r-project.org/web/
existing EHR-linked DNA databank of sufficient size and packages/qqman/
power to identify genome-wide significant loci through a Scikit-learn modeling software, https://github.com/scikit-learn/
population based genetic analysis of this important trait. scikit-learn
These data provide insights into the genetic contributions
to developmental stuttering in patients of African and Eu-
ropean descent, demonstrating that genetic risk of this References
clinical profile is dominated by modest to low genetic ef-
1. Wingate, M.E. (1964). A standard definition of stuttering.
fects, as well as providing a framework for studying under- J. Speech Hear. Disord. 29, 484–489.
reported diseases in large-scale EHRs. 2. Yairi, E., and Ambrose, N. (2013). Epidemiology of stuttering:
21st century advances. J. Fluency Disord. 38, 66–87.
Data and code availability 3. Ajdacic-Gross, V., Vetter, S., Müller, M., Kawohl, W., Frey, F.,
Lupi, G., Blechschmidt, A., Born, C., Latal, B., and Rössler, W.
All code used for developing the PheML model described in the (2010). Risk factors for stuttering: A secondary analysis of a large
methods section is available for download on github (see web re- data base. Eur. Arch. Psychiatry Clin. Neurosci. 260, 279–286.
sources). Genotyping data from BioVU are only available upon 4. Yairi, E. (1983). The onset of stuttering in two- and three-year-
submission of a study proposal and approval through the BioVU old children: a preliminary report. J. Speech Hear. Disord. 48,
Review Committee. 171–177.
5. Singer, C.M., Hessling, A., Kelly, E.M., Singer, L., and Jones,
Supplemental information R.M. (2020). Clinical characteristics associated with stuttering
persistence: A meta-analysis. J. Speech Lang. Hear. Res. 63,
Supplemental information can be found online at https://doi.org/ 2995–3018.
10.1016/j.ajhg.2021.11.004. 6. Seider, R.A., Gladstien, K.L., and Kidd, K.K. (1983). Recovery
and persistence of stuttering among relatives of stutterers.
J. Speech Hear. Disord. 48, 402–409.
Acknowledgments
7. Daniels, D.E., Gabel, R.M., and Hughes, S. (2012). Recounting
This work and investigators D.M.S, H.P., D.G.P., L.E.P., H.-H.C., the K-12 school experiences of adults who stutter: A qualita-
R.M.J., S.J.K., and J.E.B. were supported by R03DC015329, tive analysis. J. Fluency Disord. 37, 71–82.
R01DC017175, and R21DC016723 from the National Institute on 8. McAllister, J., Collier, J., and Shepstone, L. (2012). The impact
Deafness and Other Communication Disorders (NIDCD), National of adolescent stuttering on educational and employment out-
Institutes of Health. D.M.S. is supported by NIDCD grant comes: Evidence from a birth cohort study. J. Fluency Disord.
5T32GM80178-13. Some of the datasets used for the described ana- 37, 106–121.
lyses were obtained from Vanderbilt University Medical Center’s Bio- 9. Frigerio-Domingues, C., and Drayna, D. (2017). Genetic con-
VU, which is supported by numerous sources: institutional funding, tributions to stuttering: The current evidence. Mol. Genet.
private agencies, and federal grants. These include the NIH-funded Genomic Med. 5, 95–102.
shared instrumentation grant S10RR025141 and Clinical and Trans- 10. van Beijsterveldt, C.E.M., Felsenfeld, S., and Boomsma, D.I.
lational Science Awards grants UL1TR002243, UL1TR000445, and (2010). Bivariate genetic analyses of stuttering and

The American Journal of Human Genetics 108, 2271–2283, December 2, 2021 2281
nonfluency in a large sample of 5-year-old twins. J. Speech 26. Altshuler, D.M., Gibbs, R.A., Peltonen, L., Altshuler, D.M.,
Lang. Hear. Res. 53, 609–619. Gibbs, R.A., Peltonen, L., Dermitzakis, E., Schaffner, S.F., Yu,
11. Fagnani, C., Fibiger, S., Skytthe, A., and Hjelmborg, J.V.B. F., Peltonen, L., et al.; International HapMap 3 Consortium
(2011). Heritability and environmental effects for self-reported (2010). Integrating common and rare genetic variation in
periods with stuttering: A twin study from Denmark. Logoped. diverse human populations. Nature 467, 52–58.
Phoniatr. Vocol. 36, 114–120. 27. Das, S., Forer, L., Schönherr, S., Sidore, C., Locke, A.E., Kwong,
12. Kang, C., Riazuddin, S., Mundorff, J., Krasnewich, D., Fried- A., Vrieze, S.I., Chew, E.Y., Levy, S., McGue, M., et al. (2016).
man, P., Mullikin, J.C., and Drayna, D. (2010). Mutations in Next-generation genotype imputation service and methods.
the lysosomal enzyme-targeting pathway and persistent stut- Nat. Genet. 48, 1284–1287.
tering. N. Engl. J. Med. 362, 677–685. 28. McCarthy, S., Das, S., Kretzschmar, W., Delaneau, O., Wood,
13. Raza, M.H., Mattera, R., Morell, R., Sainz, E., Rahn, R., Gutier- A.R., Teumer, A., Kang, H.M., Fuchsberger, C., Danecek, P.,
rez, J., Paris, E., Root, J., Solomon, B., Brewer, C., et al. (2015). Sharp, K., et al.; Haplotype Reference Consortium (2016). A
Association between rare variants in AP4E1, a component of reference panel of 64,976 haplotypes for genotype imputa-
intracellular trafficking and persistent stuttering. Am. J. tion. Nat. Genet. 48, 1279–1283.
Hum. Genet. 97, 715–725. 29. Loh, P.-R., Danecek, P., Palamara, P.F., Fuchsberger, C., A Re-
14. Han, T.-U., Root, J., Reyes, L.D., Huchinson, E.B., Hoffmann, shef, Y., K Finucane, H., Schoenherr, S., Forer, L., McCarthy,
J.D., Lee, W.S., Barnes, T.D., and Drayna, D. (2019). Human S., Abecasis, G.R., et al. (2016). Reference-based phasing using
GNPTAB stuttering mutations engineered into mice cause the Haplotype Reference Consortium panel. Nat. Genet. 48,
vocalization deficits and astrocyte pathology in the corpus cal- 1443–1448.
losum. Proc. Natl. Acad. Sci. USA 116, 17515–17524. 30. Pluzhnikov, A., Below, J.E., Konkashbaev, A., Tikhomirov, A.,
15. Maguire, G.A., Yoo, B.R., and SheikhBahaei, S. (2019). Investi- Kistner-Griffin, E., Roe, C.A., Nicolae, D.L., and Cox, N.J.
gation of Riperidone treatment associated with enhanced (2010). Spoiling the whole bunch: Quality control aimed at
brain activity in patients who stutter. Front. Neurosci. 15, 100. preserving the integrity of high-throughput genotyping.
16. Turk, A.Z., Mahsa, L.M., Fritsch, I., Maguire, G.A., and Sheikh- Am. J. Hum. Genet. 87, 123–128.
Bahaei, S. (2019). Dopamine, vocalization, and astrocytes. 31. Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E.,
Front. Neurosci. 15, 100. Shadick, N.A., and Reich, D. (2006). Principal components
17. Chang, S.-E., and Guenther, F.H. (2020). Involvement of the analysis corrects for stratification in genome-wide association
cortico-basal ganglia-thalamocortical loop in developmental studies. Nat. Genet. 38, 904–909.
stuttering. Front. Psychol. 10, 3088. 32. Staples, J., Qiao, D., Cho, M.H., Silverman, E.K., Nickerson,
18. Alm, P.A. (2020). Streptococcal infection as a major historical D.A., Below, J.E.; and University of Washington Center for
cause of stuttering: Data, mechanisms, and current impor- Mendelian Genomics (2014). PRIMUS: Rapid reconstruction
tance. Front. Hum. Neurosci. 14, 569519. of pedigrees from genome-wide estimates of identity by
19. Kohane, I.S. (2011). Using electronic health records to drive descent. Am. J. Hum. Genet. 95, 553–564.
discovery in disease genomics. Nat. Rev. Genet. 12, 417–428. 33. Staples, J., Nickerson, D.A., and Below, J.E. (2013). Utilizing
20. Wei, W.-Q., Teixeira, P.L., Mo, H., Cronin, R.M., Warner, J.L., graph theory to select the largest set of unrelated individuals
and Denny, J.C. (2016). Combining billing codes, clinical for genetic analysis. Genet. Epidemiol. 37, 136–141.
notes, and medications from electronic health records pro- 34. Staples, J., Witherspoon, D.J., Jorde, L.B., Nickerson, D.A.,
vides superior phenotyping performance. J. Am. Med. Inform. Below, J.E., Huff, C.D.; and University of Washington Center
Assoc. 23 (e1), e20–e27. for Mendelian Genomics (2016). PADRE: Pedigree-aware
21. Pruett, D.G., Shaw, D.M., Chen, H.-H., Petty, L.E., Polikowsky, distant-relationship estimation. Am. J. Hum. Genet. 99,
H.G., Kraft, S.J., Jones, R.M., and Below, J.E. (2021). Identi- 154–162.
fying developmental stuttering and associated comorbidities 35. Staples, J., Ekunwe, L., Lange, E., Wilson, J.G., Nickerson, D.A.,
in electronic health records and creating a phenome risk clas- and Below, J.E. (2016). PRIMUS: improving pedigree recon-
sifier. J. Fluency Disord. 68, 105847. struction using mitochondrial and Y haplotypes. Bioinformat-
22. Pedregosa, F., Gael, V., Gramfort, A., Michel, V., and Thririon, ics 32, 596–598.
B. (2011). Scikit-learn: Machine learning in Python. J. Mach. 36. Taliun, D., Harris, D.N., Kessler, M.D., Carlson, J., Szpiech, Z.A.,
Learn. Res. 12, 2825–2830. Torres, R., Taliun, S.A.G., Corvelo, A., Gogarten, S.M., Kang,
23. Polikowsky, H.G., Shaw, D.M., Petty, L.E., Chen, H.-H., Pruett, H.M., et al.; NHLBI Trans-Omics for Precision Medicine
D.G., Linklater, J.P., Viljoen, K.Z., Beilby, J.M., Highland, H.M., (TOPMed) Consortium (2021). Sequencing of 53,831 diverse ge-
Levitt, B., et al. (2021). Population-based genetic effects for nomes from the NHLBI TOPMed Program. Nature 590, 290–299.
developmental stuttering. HGG Advances 3, in press. 37. Fuchsberger, C., Abecasis, G.R., and Hinds, D.A. (2015). mini-
24. Denny, J.C., Bastarache, L., Ritchie, M.D., Carroll, R.J., Zink, mac2: faster genotype imputation. Bioinformatics 31, 782–784.
R., Mosley, J.D., Field, J.R., Pulley, J.M., Ramirez, A.H., Bowton, 38. Luca, D., Ringquist, S., Klei, L., Lee, A.B., Gieger, C., Wich-
E., et al. (2013). Systematic comparison of phenome-wide as- mann, H.E., Schreiber, S., Krawczak, M., Lu, Y., Styche, A.,
sociation study of electronic medical record data and et al. (2008). On the use of general control samples for
genome-wide association study data. Nat. Biotechnol. 31, genome-wide association studies: genetic matching high-
1102–1110. lights causal variants. Am. J. Hum. Genet. 82, 453–463.
25. Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, 39. Lin, D.-Y., Tao, R., Kalsbeek, W.D., Zeng, D., Gonzalez, F., 2nd,
M.A.R., Bender, D., Maller, J., Sklar, P., de Bakker, P.I.W., Fernández-Rhodes, L., Graff, M., Koch, G.G., North, K.E., and
Daly, M.J., et al. (2007). PLINK: A tool set for whole-genome Heiss, G. (2014). Genetic association analysis under complex
association and population-based linkage analyses. Am. J. survey sampling: the Hispanic Community Health Study/
Hum. Genet. 81, 559–575. Study of Latinos. Am. J. Hum. Genet. 95, 675–688.

2282 The American Journal of Human Genetics 108, 2271–2283, December 2, 2021
40. Turner, S.D. (2018). qqman: an R package for visualizing et al. (2016). A multi-ethnic genome-wide association study
GWAS results using Q-Q and manhattan plots. Journal of identifies novel loci for non-syndromic cleft lip with or
Open Source Software 3, 731. without cleft palate on 2p24.2, 17q23 and 19q13. Hum.
41. Pruim, R.J., Welch, R.P., Sanna, S., Teslovich, T.M., Chines, Mol. Genet. 25, 2862–2872.
P.S., Gliedt, T.P., Boehnke, M., Abecasis, G.R., and Willer, C.J. 52. Huang, L., Jia, Z., Shi, Y., Du, Q., Shi, J., Wang, Z., Mou, Y.,
(2010). LocusZoom: regional visualization of genome-wide as- Wang, Q., Zhang, B., Wang, Q., et al. (2019). Genetic factors
sociation scan results. Bioinformatics 26, 2336–2337. define CPO and CLO subtypes of nonsyndromicorofacial
42. Zhou, W., Nielsen, J.B., Fritsche, L.G., Dey, R., Gabrielsen, cleft. PLoS Genet. 15, e1008357.
M.E., Wolford, B.N., LeFaive, J., VandeHaar, P., Gagliano, 53. Yang, Y., Suzuki, A., Iwata, J., and Jun, G. (2020). Secondary
S.A., Gifford, A., et al. (2018). Efficiently controlling for case- genome-wide association study using novel analytical strate-
control imbalance and sample relatedness in large-scale ge- gies disentangle genetic components of cleft lip and/or cleft
netic association studies. Nat. Genet. 50, 1335–1341. palate in 1q32.2. Genes (Basel) 11, 1280.
43. Yang, J., Lee, S.H., Goddard, M.E., and Visscher, P.M. (2011). 54. Yu, Y., Zuo, X., He, M., Gao, J., Fu, Y., Qin, C., Meng, L., Wang,
GCTA: a tool for genome-wide complex trait analysis. Am. J. W., Song, Y., Cheng, Y., et al. (2017). Genome-wide analyses of
Hum. Genet. 88, 76–82. non-syndromic cleft lip with palate identify 14 novel loci and
44. Yang, J., Benyamin, B., McEvoy, B.P., Gordon, S., Henders, genetic heterogeneity. Nat. Commun. 8, 14364.
A.K., Nyholt, D.R., Madden, P.A., Heath, A.C., Martin, N.G., 55. Ramos, E., Chen, G., Shriner, D., Doumatey, A., Gerry, N.P.,
Montgomery, G.W., et al. (2010). Common SNPs explain a Herbert, A., Huang, H., Zhou, J., Christman, M.F., Adeyemo,
large proportion of the heritability for human height. Nat. A., et al. (2011). Replication of genome-wide association
Genet. 42, 565–569. studies (GWAS) loci for fasting plasma glucose in African-
45. Ge, T., Chen, C.-Y., Ni, Y., Feng, Y.A., and Smoller, J.W. (2019). Americans. Diabetologia 54, 783–788.
Polygenic prediction via Bayesian regression and continuous 56. Cheong, K.X., Yong, R.Y.Y., Tan, M.M.H., Tey, F.L.K., and Ang,
shrinkage priors. Nat. Commun. 10, 1776. B.C.H. (2020). Association of VIPR2 and ZMAT4 with high
46. Mahajan, A., Go, M.J., Zhang, W., Below, J.E., Gaulton, K.J., myopia. Ophthalmic Genet. 41, 41–48.
Ferreira, T., Horikoshi, M., Johnson, A.D., Ng, M.C., Proko- 57. Watkins, K.E., Smith, S.M., Davis, S., and Howell, P. (2008).
penko, I., et al.; DIAbetes Genetics Replication And Meta-anal- Structural and functional abnormalities of the motor system
ysis (DIAGRAM) Consortium; Asian Genetic Epidemiology in developmental stuttering. Brain 131, 50–59.
Network Type 2 Diabetes (AGEN-T2D) Consortium; South 58. Giraud, A.-L., Neumann, K., Bachoud-Levi, A.-C., von Guden-
Asian Type 2 Diabetes (SAT2D) Consortium; Mexican Amer- berg, A.W., Euler, H.A., Lanfermann, H., and Preibisch, C.
ican Type 2 Diabetes (MAT2D) Consortium; and Type 2 Dia- (2008). Severity of dysfluency correlates with basal ganglia ac-
betes Genetic Exploration by Nex-generation sequencing in tivity in persistent developmental stuttering. Brain Lang. 104,
muylti-Ethnic Samples (T2D-GENES) Consortium (2014). 190–199.
Genome-wide trans-ancestry meta-analysis provides insight 59. Shahed, J., and Jankovic, J. (2001). Re-emergence of childhood
into the genetic architecture of type 2 diabetes susceptibility. stuttering in Parkinson’s disease: a hypothesis. Mov. Disord.
Nat. Genet. 46, 234–244. 16, 114–118.
47. Ambrose, N.G., Cox, N.J., and Yairi, E. (1997). The genetic ba- 60. Lan, J., Song, M., Pan, C., Zhuang, G., Wang, Y., Ma, W., Chu,
sis of persistence and recovery in stuttering. J. Speech Lang. Q., Lai, Q., Xu, F., Li, Y., et al. (2009). Association between
Hear. Res. 40, 567–580. dopaminergic genes (SLC6A3 and DRD2) and stuttering
48. Nievergelt, C.M., Maihofer, A.X., Klengel, T., Atkinson, E.G., among Han Chinese. J. Hum. Genet. 54, 457–460.
Chen, C.-Y., Choi, K.W., Coleman, J.R.I., Dalvie, S., Duncan, 61. Mohammadi, H., Joghataei, M.T., Rahimi, Z., Faghihi, F., Kha-
L.E., Gelernter, J., et al. (2019). International meta-analysis zaie, H., Farhangdoost, H., and Mehrpour, M. (2017). Sex ste-
of PTSD genome-wide association studies identifies sex- and roid hormones and sex hormone binding globulin levels,
ancestry-specific genetic risk loci. Nat. Commun. 10, 4558. CYP17 MSP AI (-34T:C) and CYP19 codon 39 (Trp:Arg) vari-
49. Trzaskowski, M., Eley, T.C., Davis, O.S.P., Doherty, S.J., Han- ants in children with developmental stuttering. Brain Lang.
scombe, K.B., Meaburn, E.L., Haworth, C.M., Price, T., and Plo- 175, 47–56.
min, R. (2013). First genome-wide association study on anxi- 62. Kazemi, N., Estiar, M.A., Fazilaty, H., and Sakhinia, E. (2018).
ety-related behaviours in childhood. PLoS ONE 8, e58676. Variants in GNPTAB, GNPTG and NAGPA genes are associated
50. Uhlén, M., Fagerberg, L., Hallström, B.M., Lindskog, C., Oks- with stutterers. Gene 647, 93–100.
vold, P., Mardinoglu, A., Sivertsson, Å., Kampf, C., Sjöstedt, 63. Chow, H.M., Garnett, E.O., Li, H., Etchell, A., Sepulcre, J.,
E., Asplund, A., et al. (2015). Proteomics. Tissue-based map Drayna, D., Chugani, D., and Chang, S.-E. (2020). Linking
of the human proteome. Science 347, 1260419. lysosomal enzyme targeting genes and energy metabolism
51. Leslie, E.J., Carlson, J.C., Shaffer, J.R., Feingold, E., Wehby, G., with altered gray matter volume in children with persistent
Laurie, C.A., Jain, D., Laurie, C.C., Doheny, K.F., McHenry, T., stuttering. Neurobiol Lang 1, 365–380.

The American Journal of Human Genetics 108, 2271–2283, December 2, 2021 2283
The American Journal of Human Genetics, Volume 108

Supplemental information

Phenome risk classification enables


phenotypic imputation and gene discovery
in developmental stuttering
Douglas M. Shaw, Hannah P. Polikowsky, Dillon G. Pruett, Hung-Hsin Chen, Lauren E.
Petty, Kathryn Z. Viljoen, Janet M. Beilby, Robin M. Jones, Shelly Jo Kraft, and Jennifer
E. Below
Supplemental Figures

Figure S1. Principal component analysis of BioVU patients results. Top three principal
components for all subjects in BioVU projected onto 1KG reference data. Broad ancestry groups
were stratified into either African (AFR), European (EUR), Hispanic (HIS), South East Asian (SAS),
East Asian (EAS) ancestry, or admixed Americans (AMR) based on PCs 1-3 (see methods). EUR
ancestry was stratified

Figure S2. Manhattan and qq-plot of Hispanic ancestry PheML predicted developmental
stuttering GWAS results. Analysis included 8,147,169 autosomal variants. No variants reached
genome-wide significance (P<5*10-8). Red line indicates genome-wide significance threshold
(5.0*10-8), blue line indicates suggestive significance threshold (1.0*10-5). Loci reported on table
2 are labeled on plot.
Figure S3. Manhattan and qq-plot of East Asian ancestry PheML predicted developmental
stuttering GWAS results. Analysis included 6,922,517 autosomal variants. No variants reached
genome-wide significance (P<5*10-8). Red line indicates genome-wide significance threshold
(5.0*10-8), blue line indicates suggestive significance threshold (1.0*10-5). Loci reported on table
2 are labeled on plot.

Figure S4. Manhattan and qq-plot of South Asian PheML predicted developmental stuttering
GWAS results. Analysis included 7,058,354 autosomal variants. No variants reached genome-
wide significance (P<5*10-8). Red line indicates genome-wide significance threshold (5.0*10-8),
blue line indicates suggestive significance threshold (1.0*10-5). Only 51 subjects of SAS ancestry
were predicted by the PheML model to have developmental stuttering.
Figure S5. LocusZoom Plot for rs2997903 in AFR PheML Stuttering GWAS. Lead variant found
within the first intron of KYAT1 (beta=0.308; P=5.32*10-7). Dashed line indicates genome-wide
significance threshold (5.0*10-8).
rs6981922

Figure S6. LocusZoom Plot for rs10464899 in AFR PheML Stuttering GWAS. Lead variant found
178kb 5’ of TOX (beta=0.216; P=1.51*10-7). We also reported rs6981922 (beta=0.197;
P=9.35*10-7) which replicated in our East Asian population as well (P=3.27*10-2). Dashed line
indicates genome-wide significance threshold (5.0*10-8).
Figure S7. LocusZoom Plot for rs34456770 in AFR PheML Stuttering GWAS. Lead variant found
797bp 5’ of MPG (beta=0.256; P=1.87*10-6). Dashed line indicates genome-wide significance
threshold (5.0*10-8).

Figure S8. LocusZoom Plot for rs78072807 in AFR PheML Stuttering GWAS. Lead variant found
within the 21st intron of RYR2 (beta=0.371; P=8.73*10-8). Dashed line indicates genome-wide
significance threshold (5.0*10-8).
Figure S9. LocusZoom Plot for rs115024493 in AFR PheML Stuttering GWAS. Lead variant
found 397kb 5’ of DCN (beta=0.376; P=6.58*10-8). Dashed line indicates genome-wide
significance threshold (5.0*10-8).
Figure S10. LocusZoom Plot for rs10872381 in EAS PheML Stuttering GWAS. Lead variant
found 102kb 3’ of AKAP7 (beta=0.803; P=6.38*10-8). Dashed line indicates genome-wide
significance threshold (5.0*10-8).

Figure S11. LocusZoom Plot for rs10036373 in EUR PheML Stuttering GWAS. Lead variant
found 42kb 5’ of C5orf17 (beta=0.701; P=3.68*10-6). Dashed line indicates genome-wide
significance threshold (5.0*10-8).
Figure S12. LocusZoom Plot for rs8013614 in EUR PheML Stuttering GWAS. Lead variant found
84kb 5’ of BRMS1L (beta=-.120; P=1.59*10-6). Dashed line indicates genome-wide significance
threshold (5.0*10-8).
Figure S13. LocusZoom Plot for rs6415726 in HIS PheML Stuttering GWAS. Lead variant found
within the second intron of C9orf92 (beta=0.730; P=9.61*10-6). Dashed line indicates genome-
wide significance threshold (5.0*10-8).
Figure S14. Polygenic risk score violin plots. PRS model was developed using the summary
statistics from the EUR ancestry PheML stuttering GWAS. The International Stuttering Project
stuttering case set (blue) scored significantly higher on the PRS model (mean=8.56*10-8,
SD=1.13*10-6) than their matched controls (orange), (mean=-3.59*10-7, SD = 1.01*10-6;
t(1131)=13.12, P = 6.83*10-39).
Figure S15. Polygenic risk score receiver operating characteristic (ROC) curve. PRS model was
developed using the summary statistics from the EUR ancestry PheML stuttering GWAS. ROC
curve plotted to demonstrate the model performance in predicting stuttering liability in the
International Stuttering Project stuttering set. Area under the curve (AUC) = 0.60.
Supplemental Tables

Supplementary table S1. ICD codes used to identify developmental stuttering patients
ICD-9 Code ICD-10 Code Definition
307.0 F98.5 Adult-Onset Fluency Disorder
315.35 F80.81 Childhood Onset Fluency Disorder
784.5 R47.82 Fluency Disorder in Conditions Classified Elsewhere

Table S1. ICD codes used to identify developmental stuttering.

Supplementary table S2. Demographics of clinically validated stuttering case and control set
Stuttering Cases Population Controls
Total 1345 7019
Male 965 (71.7%) 4951 (70.5%)
Female 380 (28.3%) 2068 (29.5%)
Ancestry n (%)
European 1132 (84.2%) 6111 (87.1%)
African 68 (5.1%) 400 (5.7%)
East Asian 42 (3.1%) 116 (1.7%)
South Asian 44 (3.3%) 148 (2.1%)
Hispanic 38 (2.8%) 132 (1.9%)
Mixed/Other 21 (1.5%) 112 (1.6%)

Table S2. Demographic distribution for subjects used in genome-wide association analysis for
the International Stuttering Project (ISP) stuttering sample set.

See attached table

Table S3. Suggestive hits from PheML predicted developmental stuttering GWAS run in each
ancestry. Table includes all variants where P<5.0*10-6. We also report association results for
each variant in alternative ancestries. GWAS results for European, African, Hispanic, South
Asian, and East Asian ancestry cohorts denoted as EUR, AFR, HIS, SAS, and EAS respectively.
GWAS results from clinically validated set denoted as CV.
See attached table

Table S4. Replication results of previously identified genes associated with stuttering.

You might also like