1
Department of Statistics, Virginia Polytechnic Institute & State University, Blacksburg, Vatican City State
2
Department of Biostatistics, Virginia Commonwealth University, Richmond, Vatican City State
3
Department of Human Genetics, University of Pittsburgh, Pittsburgh, Panama
Corresponding author details:
Dipankar Bandyopadhyay
Department of Biostatistics School of Medicine
Virginia Commonwealth University
Richmond,Vatican City State
Copyright: © 2020 Wang Y, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 international License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Adaptive-Weight Burden Test; Dental Caries; GWAS; GENEVA consortium
Dental caries is a chronic, transmissible disease caused by activities of bacteria [1]. Among the common disorders in humans, caries has the highest prevalence, affecting 2.43 billion people worldwide in their permanent teeth and 620 million children in their primary teeth [2]. It is known that multiple factors contribute to a person’s risk for caries development and progression. Besides environmental factors (diet, oral hygiene, etc) and host factors related to subjects’ oral conditions, genetic factors have been shown to play an important role in caries etiology [3-5].
Analyzing genome-wide data generated from epidemiological studies of dental health opens up cost-effective opportunities to uncover how genetic factors affect inherited risks for dental caries. Previous genome-wide association studies (GWASs) for dental caries have identified several genes that may be associated with caries heritability [6], including enamel formation genes such as AMBN, AMELX, ENAM [7-9], taste receptor genes such as TAS2R38, TAS1R2 [10,11], and genes related to immunity such as HLA [12,13] and saliva such as PRH1 [14]. Although considerable efforts have been made to capture GWAS signals for dental caries, they are often limited to marginal association between caries traits and each individual single nucleotide polymorphism (SNP), hence may have limited power at the gene level (especially when evaluating the genetic effect of low-frequency minor alleles). Moreover, due to genotyping differences in different GWASs, findings from such analyses may exhibit inconsistencies, which pose difficulties for biological interpretations. On the other hand, SNP-set-based or gene-based association analyses [15-19] are receiving increasing attention since genes are the functional unit of the human genome and remain highly consistent across diverse human populations. However, most of these gene-based methods are not sufficient to account for linkage disequilibrium (LD) among SNPs, i.e., the non-random association of alleles at different loci. In fact, since caries GWASs are often conducted on related individuals, e.g., family trios or pedigree samples, it is critical to model the complex genotypic correlations caused by both LD and familial relation in order to improve the power of association testing.
In this paper, we propose to use a novel gene-based association mapping strategy to identify caries-associated genes based on genome-wide data from
the GENE enVironment Association studies (GENEVA) consortium.
Several suggestive genes were identified, in which some (such
as PTPRD) have been previously found in other single-SNPbased GWASs to play plausible biological roles for dental caries.
An interesting finding was obtained by comparing the gene sets
identified separately from gene-based and SNP-based tests. The nonnegligible overlap between these two sets suggests that the two types
of analyses are not independent of each other and therefore results
from gene-based association analyses may contain important signals
relevant to cariogenesis, complementing those from the traditional
single-SNP-based GWASs.
Dental caries data from the GENEVA program
As part of the quality-control procedure, the genotyped individuals were further filtered by excluding those who met either of the following two criteria:
Among these 652 participants, 571 are from 201 families
(nuclear families or parent-offspring trios), and the rest are unrelated
individuals. Quality control on SNPs was conducted based on the
following conditions:
i. Call rate ≥ 96%, and
ii. Minor allele frequency (MAF) ≥ 1%.
To perform the gene-based test, the entire genotype matrix (rows corresponding to individuals and columns corresponding to SNPs) was split according to the starting and ending positions of each gene.
Each partition of the genotype matrix was further cleaned in order to avoid the possible non-singular problem regarding the following aspects: (1) columns corresponding non-polymorphic SNPs were excluded, and (2) given limited sample size, only columns with unique SNP data were retained whereas duplicated ones were eliminated. These quality assurance steps resulted in 26,831 genes in consideration.
Statistical approach
It has been shown that the ABT statistic
where ) is the transformed phenotypic
residual under the null hypothesis of no genetic association, and the
phenotype covariance matrix . Here, Φ is the kinship
matrix of the sampled individuals, and, stand for variance due
to random measurement error and variance attributed to additive
polygenic random effects, respectively. The covariance matrix of
multiple SNPs in a pre-defined gene region is denoted as DRD,
where R is the LD correlation matrix and D = diag {σj}, 1≤ j≤m is a
diagonal matrix of the standard deviations of the SNPs in genotype
G. ABT adopts adaptive weights to collapse multiple SNPs in each
pre-defined gene region to improve power of association test ing.
Its alternative view is a kernel test with the generalized Madsen–Browning . The null distribution of SABT can be explicitly derived as a mixture of random variables each
obtained from the eigen value of the matrix, WGTPGW , where P =
In order to justify the results from the gene-based association
analysis, we compared the gene set identified by gene-based test
with that by SNP-based test. For consistency reasons, we chose
to use MASTOR as the SNP-based test. MASTOR is a mixed-model,
retrospective score test for genetic association with quantitative
traits in samples with related individuals [22]. When using such a
single-SNP test, we consider a gene to be significantly associated with
the DMFT trait if there exists at least one SNP in that gene region with
p-value less than the Bonferroni corrected nominal α=0.05/(#SNP in
the gene), after adjusting for non-genetic covariates. Since all other
settings for this study, including the samples, covariates, phenotypes/
genotypes, and the retrospective analytical strategy, keep unchanged
from the gene-based analysis, the gene sets identified by ABT and
by MASTOR should be comparable. A simple chi-square test for
independence can then be conducted with a 2×2 contingency table
formed with the significant/insignificant number of genes identified
by these two methods.
The GENEVA dental caries data (dbGaP accession: phs000095. v3.p1) include 5,291 phenotyped and 4,020 genotyped individuals. In this study, we focus on a subset of 652 participants who have complete data in the following phenotypic characteristics: gender, age, education group, water source, presence/absence of S. mutans, home tap water fluoride level, saliva flow, brush frequency, and the Decay- Missing-Filled (DMF) index. These characteristics are summarized in Table 1.
We note that, the 652 participants include both children and adults. The age ranges from 7 to 61 years, and 406 out of the total 652 are older than 18 at the time of examination. The majority of these participants are white (648 white, two multi or bi-racial, and two with missing values), therefore race was not included as a covariate in our study.
We use the adaptive-weight burden test [15] to perform genebased association testing for the DMFT index in the 652 participants, adjusting for the above eight non-genetic covariates and properly addressing various types of genotypic correlations caused by both LD and familial relation. This test first obtains the transformed phenotypic residual from the phenotype model under the null hypothesis, and then collapses the genotypes of multiple SNPs in each gene region by using data-adaptive weights to achieve a powerful retrospective, gene-based test (details are provided in Statistical approach).
In the step of null phenotype model identification, two variancecomponent parameters and 10 fixed-effect regression coefficients for the non-genetic covariates (two indicators for water source categories) are estimated by the maximum likelihood method. These estimates are shown in Table 2. In covariance estimation, we notice that the estimated value of /σ2, i.e., the narrow-sense heritability, is about 0.44, which is comparable with traditional heritability estimates of DMFT (or DMFS) in the permanent dentition of other family-based dental caries GWASs [23-25]. In covariates effects estimation, we see that at nominal level 0.05, three covariates sex, age, and S. mutans are significantly associated with the DMFT trait with positive coefficient estimates, indicating that patients with male gender, younger age, and absence of S. mutans are more likely to have lower DMFT measurements. The brush frequency, while expected to be positively associated with dental caries, seems to have a “boundary” effect, which turns out to be significant at nominal level 0.1 but not at 0.05.
In this gene-based association analysis, the Manhattan plot and Quantile-Quantile (Q-Q) plot for a total number of 26,831 entrez genes on the human genome are shown in Figure 1 & 2, respectively. In Figure 2, the resulting genomic inflation factor λ is reported as 1.05 which is generally considered benign [26], suggesting that the genebased p-values did not show substantial departure from the uniform distribution. Therefore, the extent of inflation due to population stratification or other confounders is negligible.
Table 3 lists the 10 top genes according to ranked ABT p-values for the GENEVA dental caries data. Among these 10 genes, PTPRD (MIM: 601598) has been recently reported to be associated with smooth and pit-and-fissure surface caries in the primary dentition in children by a single-SNP based GWAS (reported SNP: rs10958998, intronic, [27]). The PTPRD gene encodes a member of the PTP (protein tyrosine phosphatase) family which is known to be signaling molecules that regulate a variety of cellular processes including cell growth, differentiation, mitotic cycle, and oncogenic transformation.
Our study also identified five other genes that were reported in the GWAS catalogue [28] to be relevant to cariogenesis, namely FHIT (MIM: 601153), CNTN4 (MIM: 607280), CTNNA3 (MIM: 607667), IL17D (MIM: 607587), and CELF2 (MIM: 602538). The reported SNPs for these five genes are: rs9311745 (intron variant, [29]), rs17013524 (intron variant, [27]), rs2441755 (intron variant), rs735539 (intron variant), and rs11256676 (intergenic variant, [30]), respectively. The protein encoded by FHIT is a P1-P3-bis (5’-adenosyl) triphosphate hydrolase involved in purine metabolism. This gene encompasses the common fragile site FRA3B on chromosome 3, where carcinogen-induced damage can lead to translocations and aberrant transcripts. CNTN4 encodes a member of the contact in family of immunoglobulins. The encoded protein may play a role in the formation of axon connections in the developing nervous system. The CTNNA3 gene encodes a protein that belongs to the vinculin/ alpha-catenin family and plays a role in cell-cell adhesion in muscle cells. IL17D encodes a cytokine that shares the sequence similaritywith IL17. The treatment of endothelial cells with this cytokine has been shown to stimulate the production of other cytokines including IL6, IL8, and CSF2/GM-CSF.
Table 1: Sample Characteristics in the GENEVA Data
Table 2: Estimated Covariance and Covariate Effects in the GENEVA
Data.
Table 3: Strongest Association Signals in the GENEVA Data. DMFTassociated genes, with top 10 ABT ranked p-values are reported.
Underlined genes have been previously identified to be associated
with dental caries. MIM numbers of genes not mentioned in the text:
CSMD1 (MIM: 608397), RBFOX1 (MIM: 605104), and CLRN1 (MIM:
606397).
Table 4: Comparison of Gene Sets Identified by Gene- and SNP-based
Tests
Figure 1: Manhattan Plot of Gene-based Association Testing
P-values for GENEVA DMFT Data.
Figure 2: QQ Uniform Plot of Gene-based Association Testing
P-values for GENEVA DMFT Data.
Traditional GWASs rely on screening the genome on the basis of SNPs. Though such a SNP-based association testing strategy has been shown successful in identifying susceptibility loci for several complex genetic diseases [31,32], challenges in GWASs still exist: First, SNP-based testing detects only marginal effects and may be underpowered to evaluate rare-variant effects due to their low allele frequencies. Second, since SNP-based testing only reports significant SNPs, identification of genes is usually ad hoc, depending on the relative location (intron, intergenic, noncoding, up/downstream, regulatory region, etc) of the identified SNPs to target genes, and interpretation of genetic effects at the gene level remains elusive. In contrast, gene-based testing assesses joint effects of multiple variants in a predefined gene region. Compared with SNP-based testing, genebased testing has several appealing features. First, shifting the testing unit from SNP to gene generates more interpretable and replicable findings in gene function [33] and gene-gene interaction [34]. Second, by aggregating small signals from each SNP variant, genebased testing usually achieves improved power, especially for lowfrequency minor alleles [35]. Third, chance findings due to multiple testing will be reduced by using gene-wide instead of genome-wide significance level [10]. Finally, gene-based testing also lends itself to meta-analysis of combined data from multiple studies [36]. For these reasons, gene-based association testing is believed to be a natural approach for association analysis in the post-GWAS era of dense genotyping and fine mapping [37].
In this study, we performed gene-based association testing for
dental caries data from the GENEVA consortium, with adjustment
of phenotype covariates and accounting for LD in SNPs and familial
relation in samples. We observed suggestive associations between the
DMFT trait and several genes, some of which have been found to have
plausible biological functions relevant to cariogenesis. Three nongenetic covariates sex, age, and S. mutans were found significantly
associated with the DMFT trait, and the narrow- sense heritability
was found comparable with traditional heritability estimates in
previous family-based dental caries GWASs. Due to the differences
between study designs and testing methods, it is not reasonable to
compare the gene set identified by this study with that by other SNPbased GWASs. However, we attempted to compare the results from
both gene- and SNP-based association testing on the basis of the
GENEVA dental caries data. The comparison revealed that the genebased test captured 65.34% genes that are significant in the SNPbased test. A further test for independence illustrated that, though
gene-based association testing has a totally different mechanism, the
identified genes are significantly overlapped with those by SNP-based
association testing, given that two tests are performed on the same
GENEVA data. Therefore, findings from this gene-based association
testing may contain important signals relevant to cariogenesis and
could complement those from the traditional SNP-based GWASs.
The Iowa Comprehensive Program to Investigate Craniofacial
and Dental Anomalies. The datasets used for the analyses described
in this manuscript were obtained from dbGaP at https://www.ncbi.
nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000095.
v3.p1. The authors would like to thank the investigators, staff, and
participants who contributed to the GENEVA program. This work is
supported by the 4-VA Collaborative Research Grant from the state of
Virginia and R01-DE024984.
Not applicable for authorized access to dbGaP Study Accession:
phs000095.v3.p1.
The authors have no conflicts of interest to declare.
X.W. and D.B. conceived and designed the study; Y.W. and X.W.
implemented procedure and conducted data analysis; Y.W., J.S.,
and X.W. participated in data preparation, result organization and
discussion; Y.W., D.B., and X.W. wrote the manuscript. All authors
discussed the results and commented on the manuscript.
Copyright © 2020 Boffin Access Limited.