Boffin Access – Peer-Reviewed Journals | Global Open Access Publisher

Gene-Based Association Mapping for Dental Caries in The GENEVA Consortium

Yueyao Wang¹, Dipankar Bandyopadhyay^2*, John R. Shaffer³, Xiaowei Wu¹

¹Department of Statistics, Virginia Polytechnic Institute & State University, Blacksburg, Vatican City State
² Department of Biostatistics, Virginia Commonwealth University, Richmond, Vatican City State
³ Department of Human Genetics, University of Pittsburgh, Pittsburgh, Panama

Article Info

Corresponding author details:

Dipankar Bandyopadhyay
Department of Biostatistics School of Medicine
Virginia Commonwealth University
Richmond,Vatican City State

DOI

:10.31021/jddm.20203156

Article Type

:Research Article

Manuscript ID

:JDDM-3-156

Publisher

:Boffin Access Limited.

Journal Type

:Open Access

Volume

Issue

Received Date: 06 April, 2020

Accepted Date: 13 April, 2020

Published Date: 15 April, 2020

CitationCitation COPIED

Wang Y, Bandyopadhyay D, Shaffer JR, Wu X. Gene-Based Association Mapping for Dental Caries in The GENEVA Consortium. J Dents Dent Med. 2020 Apr;3(4):156.

Copyright: © 2020 Wang Y, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 international License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Abstract

Objective: Dental caries is a multifactorial disease with high prevalence in both children and adults. Recent genome-wide association studies (GWASs) have revealed that genetic factors play an important role in caries incidence. However, existing methods are not sufficient to identify caries-associated genes, due to the complex correlation structure of caries GWAS data, and lack of appropriate summarization at the gene level. This paper attempts to address that by analyzing data from the Gene, Environment Association Studies (GENEVA) consortium.

Methods: We investigated gene-based genetic associations for dental caries based on genome-wide data derived from the GENEVA database, with adjustment to covariates, linkage disequilibrium among single-nucleotide polymorphisms, and family relations, in sampled individuals.
Results: Several suggestive genes were identified, in which some of them have been previously found to have potential biological functions on cariogenesis.
Conclusions: By comparing the gene sets identified from gene-based and SNP-based association testing methods, we found a non-negligible overlap, which indicates that our gene-based analysis can provide substantial supplement to the traditional GWAS analysis.

Keywords

Adaptive-Weight Burden Test; Dental Caries; GWAS; GENEVA consortium

Introduction

Dental caries is a chronic, transmissible disease caused by activities of bacteria [1]. Among the common disorders in humans, caries has the highest prevalence, affecting 2.43 billion people worldwide in their permanent teeth and 620 million children in their primary teeth [2]. It is known that multiple factors contribute to a person’s risk for caries development and progression. Besides environmental factors (diet, oral hygiene, etc) and host factors related to subjects’ oral conditions, genetic factors have been shown to play an important role in caries etiology [3-5].

Analyzing genome-wide data generated from epidemiological studies of dental health opens up cost-effective opportunities to uncover how genetic factors affect inherited risks for dental caries. Previous genome-wide association studies (GWASs) for dental caries have identified several genes that may be associated with caries heritability [6], including enamel formation genes such as AMBN, AMELX, ENAM [7-9], taste receptor genes such as TAS2R38, TAS1R2 [10,11], and genes related to immunity such as HLA [12,13] and saliva such as PRH1 [14]. Although considerable efforts have been made to capture GWAS signals for dental caries, they are often limited to marginal association between caries traits and each individual single nucleotide polymorphism (SNP), hence may have limited power at the gene level (especially when evaluating the genetic effect of low-frequency minor alleles). Moreover, due to genotyping differences in different GWASs, findings from such analyses may exhibit inconsistencies, which pose difficulties for biological interpretations. On the other hand, SNP-set-based or gene-based association analyses [15-19] are receiving increasing attention since genes are the functional unit of the human genome and remain highly consistent across diverse human populations. However, most of these gene-based methods are not sufficient to account for linkage disequilibrium (LD) among SNPs, i.e., the non-random association of alleles at different loci. In fact, since caries GWASs are often conducted on related individuals, e.g., family trios or pedigree samples, it is critical to model the complex genotypic correlations caused by both LD and familial relation in order to improve the power of association testing.

In this paper, we propose to use a novel gene-based association mapping strategy to identify caries-associated genes based on genome-wide data from the GENE enVironment Association studies (GENEVA) consortium. Several suggestive genes were identified, in which some (such as PTPRD) have been previously found in other single-SNPbased GWASs to play plausible biological roles for dental caries. An interesting finding was obtained by comparing the gene sets identified separately from gene-based and SNP-based tests. The nonnegligible overlap between these two sets suggests that the two types of analyses are not independent of each other and therefore results from gene-based association analyses may contain important signals relevant to cariogenesis, complementing those from the traditional single-SNP-based GWASs.

Materials and Methods

Dental caries data from the GENEVA program

Initiated in 2006, the GENEVA program aims to enhance the benefit of collaborative work to further maximize knowledge obtainable through GWAS [20]. One participating study (dbGaP accession: phs000095.v3.p1) of this program collected dental caries data for 5,291 phenotyped and 4,020 genotyped individuals (including cohort of families and unrelated individuals). The major phenotype variable “Tot_D1MFT” describes the status of subjects’ teeth, which was derived by counting all teeth with caries (including white spots). Specifically, a tooth is recorded as a D if a carious lesion(s) or both carious lesion(s) and a restoration are present; M if a tooth has been extracted due to caries; and F if a permanent or temporary filling is present. In addition to the tooth-level caries index, another index at tooth surface level, “Tot_D1MFS”, is also generated in a similar fashion. Eight sociodemographic and environmental variables are collected as covariates, including sex, age at the time of dental exam, source of water (city/public, well and other), tooth brushing frequency, saliva flow rate (calculated as the volume of saliva divided by 3 minutes: in unit ml/min); home water source fluoride level (obtain from home tap water: in unit ppm), educational attainment (up to high school, some college, four-year degree or beyond), presence of oral S. mutans present (yes, no). The genotype data were collected for a total of 589,735 SNPs on an Illumina BeadChip, among which 289,629 (49.11%) are located in 26,449 entrez genes (according to the human gene Ensembl dataset GRCh37). Based on the cleaned genotype data, a total of 20,197 (76.36%) entrez genes were found to contain more than one SNP in the genotype data.

As part of the quality-control procedure, the genotyped individuals were further filtered by excluding those who met either of the following two criteria:

Empirical self-kinship coefficient ii ${\tilde{ϕ}}_{i i} >$ 0.525, i.e., empirical inbreeding coefficient ${\tilde{h}}_{i} \geq$ 0.05, or
Completeness ≤ 96%, where completeness is defined as the proportion of SNPs for which a given individual had genotypes called. Under these criteria, a subset of 652 samples from the GENEVA dental caries data, which have complete data in both phenotypes and genotypes, were selected for the analysis.

Among these 652 participants, 571 are from 201 families (nuclear families or parent-offspring trios), and the rest are unrelated individuals. Quality control on SNPs was conducted based on the following conditions:
i. Call rate ≥ 96%, and
ii. Minor allele frequency (MAF) ≥ 1%.

To perform the gene-based test, the entire genotype matrix (rows corresponding to individuals and columns corresponding to SNPs) was split according to the starting and ending positions of each gene.

Each partition of the genotype matrix was further cleaned in order to avoid the possible non-singular problem regarding the following aspects: (1) columns corresponding non-polymorphic SNPs were excluded, and (2) given limited sample size, only columns with unique SNP data were retained whereas duplicated ones were eliminated. These quality assurance steps resulted in 26,831 genes in consideration.

Statistical approach

The adaptive-weight burden test (ABT) is a retrospective, mixed model test for genetic association of quantitative traits on genotype data with complex correlations. In the GENEVA dental caries study, a group of individuals with known pedigree information were sampled for phenotype (the DMFT index), covariates (sex, age, etc), and genotypes (multiple SNPs in pre-defined gene regions). Under the gene-based testing setup, we denote the phenotype vector by Y, covariate matrix by Z, and genotype matrix by G (for each gene, size n x m. Then, the ABT analysis is conducted on a retrospective paradigm G|(Y,Z), i.e., by treating genotypes as random response in regression. The main advantage of such a retrospective model lies in its ability to directly model genotypic correlations across both SNPs (caused by LD) and samples (due to familial relation) [21], as well as to maintain robustness to phenotype model misspecification [22].

It has been shown that the ABT statistic
$S_{ABT} = \frac{V^{T} G {(\hat{D} R \hat{D})}^{- 1} G^{T} V}{V^{T} Φ V}$
where $V = {\hat{Σ}}_{0}^{- 1} ((Y - Z \hat{β_{0}})$ ) is the transformed phenotypic residual under the null hypothesis of no genetic association, and the phenotype covariance matrix $Σ_{0} = σ_{e}^{2} I + σ_{a}^{2} Φ$ . Here, Φ is the kinship matrix of the sampled individuals, and, $σ_{e}^{2}, σ_{a}^{2}$ stand for variance due to random measurement error and variance attributed to additive polygenic random effects, respectively. The covariance matrix of multiple SNPs in a pre-defined gene region is denoted as DRD, where R is the LD correlation matrix and D = diag {σ_j}, 1≤ j≤m is a diagonal matrix of the standard deviations of the SNPs in genotype G. ABT adopts adaptive weights to collapse multiple SNPs in each pre-defined gene region to improve power of association test ing. Its alternative view is a kernel test with the generalized Madsen–Browning ${({W=(V}^{T} Φ V))}^{- 1 / 2} {(\hat{(D} R \hat{D)})}^{- 1 / 2}$ . The null distribution of S_A_B_T can be explicitly derived as a mixture of $X_{1}^{2}$ random variables each obtained from the eigen value of the matrix, WG^TPGW , where P = ${\hat{Σ}}_{0}^{- 1} - {\hat{Σ}}_{0}^{- 1} Z({(Z^{T} {\hat{Σ}}_{0}^{- 1} Z))}^{- 1} Z^{T} {\hat{Σ}}_{0}^{- 1}$

In order to justify the results from the gene-based association analysis, we compared the gene set identified by gene-based test with that by SNP-based test. For consistency reasons, we chose to use MASTOR as the SNP-based test. MASTOR is a mixed-model, retrospective score test for genetic association with quantitative traits in samples with related individuals [22]. When using such a single-SNP test, we consider a gene to be significantly associated with the DMFT trait if there exists at least one SNP in that gene region with p-value less than the Bonferroni corrected nominal α=0.05/(#SNP in the gene), after adjusting for non-genetic covariates. Since all other settings for this study, including the samples, covariates, phenotypes/ genotypes, and the retrospective analytical strategy, keep unchanged from the gene-based analysis, the gene sets identified by ABT and by MASTOR should be comparable. A simple chi-square test for independence can then be conducted with a 2×2 contingency table formed with the significant/insignificant number of genes identified by these two methods.

Results

The GENEVA dental caries data (dbGaP accession: phs000095. v3.p1) include 5,291 phenotyped and 4,020 genotyped individuals. In this study, we focus on a subset of 652 participants who have complete data in the following phenotypic characteristics: gender, age, education group, water source, presence/absence of S. mutans, home tap water fluoride level, saliva flow, brush frequency, and the Decay- Missing-Filled (DMF) index. These characteristics are summarized in Table 1.

We note that, the 652 participants include both children and adults. The age ranges from 7 to 61 years, and 406 out of the total 652 are older than 18 at the time of examination. The majority of these participants are white (648 white, two multi or bi-racial, and two with missing values), therefore race was not included as a covariate in our study.

We use the adaptive-weight burden test [15] to perform genebased association testing for the DMFT index in the 652 participants, adjusting for the above eight non-genetic covariates and properly addressing various types of genotypic correlations caused by both LD and familial relation. This test first obtains the transformed phenotypic residual from the phenotype model under the null hypothesis, and then collapses the genotypes of multiple SNPs in each gene region by using data-adaptive weights to achieve a powerful retrospective, gene-based test (details are provided in Statistical approach).

In the step of null phenotype model identification, two variancecomponent parameters and 10 fixed-effect regression coefficients for the non-genetic covariates (two indicators for water source categories) are estimated by the maximum likelihood method. These estimates are shown in Table 2. In covariance estimation, we notice that the estimated value of $σ_{a}^{2}$ /σ², i.e., the narrow-sense heritability, is about 0.44, which is comparable with traditional heritability estimates of DMFT (or DMFS) in the permanent dentition of other family-based dental caries GWASs [23-25]. In covariates effects estimation, we see that at nominal level 0.05, three covariates sex, age, and S. mutans are significantly associated with the DMFT trait with positive coefficient estimates, indicating that patients with male gender, younger age, and absence of S. mutans are more likely to have lower DMFT measurements. The brush frequency, while expected to be positively associated with dental caries, seems to have a “boundary” effect, which turns out to be significant at nominal level 0.1 but not at 0.05.

In this gene-based association analysis, the Manhattan plot and Quantile-Quantile (Q-Q) plot for a total number of 26,831 entrez genes on the human genome are shown in Figure 1 & 2, respectively. In Figure 2, the resulting genomic inflation factor λ is reported as 1.05 which is generally considered benign [26], suggesting that the genebased p-values did not show substantial departure from the uniform distribution. Therefore, the extent of inflation due to population stratification or other confounders is negligible.

Table 3 lists the 10 top genes according to ranked ABT p-values for the GENEVA dental caries data. Among these 10 genes, PTPRD (MIM: 601598) has been recently reported to be associated with smooth and pit-and-fissure surface caries in the primary dentition in children by a single-SNP based GWAS (reported SNP: rs10958998, intronic, [27]). The PTPRD gene encodes a member of the PTP (protein tyrosine phosphatase) family which is known to be signaling molecules that regulate a variety of cellular processes including cell growth, differentiation, mitotic cycle, and oncogenic transformation.

Our study also identified five other genes that were reported in the GWAS catalogue [28] to be relevant to cariogenesis, namely FHIT (MIM: 601153), CNTN4 (MIM: 607280), CTNNA3 (MIM: 607667), IL17D (MIM: 607587), and CELF2 (MIM: 602538). The reported SNPs for these five genes are: rs9311745 (intron variant, [29]), rs17013524 (intron variant, [27]), rs2441755 (intron variant), rs735539 (intron variant), and rs11256676 (intergenic variant, [30]), respectively. The protein encoded by FHIT is a P1-P3-bis (5’-adenosyl) triphosphate hydrolase involved in purine metabolism. This gene encompasses the common fragile site FRA3B on chromosome 3, where carcinogen-induced damage can lead to translocations and aberrant transcripts. CNTN4 encodes a member of the contact in family of immunoglobulins. The encoded protein may play a role in the formation of axon connections in the developing nervous system. The CTNNA3 gene encodes a protein that belongs to the vinculin/ alpha-catenin family and plays a role in cell-cell adhesion in muscle cells. IL17D encodes a cytokine that shares the sequence similaritywith IL17. The treatment of endothelial cells with this cytokine has been shown to stimulate the production of other cytokines including IL6, IL8, and CSF2/GM-CSF.

The comparison of gene-based test and SNP-based test shows that, out of the total 26,831 genes, 1,053 exhibit significance by MASTOR, slightly less than the 1,468 genes found by ABT, and the intersection set, i.e., identified by both MASTOR and ABT, contains 688 genes. The contingency table is shown in Table 4, which gives a very significant p-value (< 2.2 × 10⁻¹⁶), indicating that the results from these two types of tests are not independent.

Table 1: Sample Characteristics in the GENEVA Data

Education group has 3 ordinal categories: (i) up to high school; (ii) some college; and (iii) four-year degree or beyond.
DMFT: derived by counting all teeth (including both primary and permanent) that is decayed, missing, and filled.
DMFS: derived by counting all surfaces (including both primary and permanent) those are decayed, missing, and filled.

Table 2: Estimated Covariance and Covariate Effects in the GENEVA Data.

Note:*σ²=

σ_{a}^{2} + σ_{e}^{2}

†: Brush frequency coded as: 1= more than 3 times per day, 2 = 3 times per day, 3 = 2 times per day, 4 = once a day, 5 = less than once a day. In the final samples, category 1 has zero observations, so category 2 is treated as the reference group. §: Education level is coded as: 1 = up to high school, 2 = some college, 3 = four-year degree or beyond.

Table 3: Strongest Association Signals in the GENEVA Data. DMFTassociated genes, with top 10 ABT ranked p-values are reported. Underlined genes have been previously identified to be associated with dental caries. MIM numbers of genes not mentioned in the text: CSMD1 (MIM: 608397), RBFOX1 (MIM: 605104), and CLRN1 (MIM: 606397).

Table 4: Comparison of Gene Sets Identified by Gene- and SNP-based Tests

Figure 1: Manhattan Plot of Gene-based Association Testing P-values for GENEVA DMFT Data.

Figure 2: QQ Uniform Plot of Gene-based Association Testing P-values for GENEVA DMFT Data.

Discussion

Traditional GWASs rely on screening the genome on the basis of SNPs. Though such a SNP-based association testing strategy has been shown successful in identifying susceptibility loci for several complex genetic diseases [31,32], challenges in GWASs still exist: First, SNP-based testing detects only marginal effects and may be underpowered to evaluate rare-variant effects due to their low allele frequencies. Second, since SNP-based testing only reports significant SNPs, identification of genes is usually ad hoc, depending on the relative location (intron, intergenic, noncoding, up/downstream, regulatory region, etc) of the identified SNPs to target genes, and interpretation of genetic effects at the gene level remains elusive. In contrast, gene-based testing assesses joint effects of multiple variants in a predefined gene region. Compared with SNP-based testing, genebased testing has several appealing features. First, shifting the testing unit from SNP to gene generates more interpretable and replicable findings in gene function [33] and gene-gene interaction [34]. Second, by aggregating small signals from each SNP variant, genebased testing usually achieves improved power, especially for lowfrequency minor alleles [35]. Third, chance findings due to multiple testing will be reduced by using gene-wide instead of genome-wide significance level [10]. Finally, gene-based testing also lends itself to meta-analysis of combined data from multiple studies [36]. For these reasons, gene-based association testing is believed to be a natural approach for association analysis in the post-GWAS era of dense genotyping and fine mapping [37].

In this study, we performed gene-based association testing for dental caries data from the GENEVA consortium, with adjustment of phenotype covariates and accounting for LD in SNPs and familial relation in samples. We observed suggestive associations between the DMFT trait and several genes, some of which have been found to have plausible biological functions relevant to cariogenesis. Three nongenetic covariates sex, age, and S. mutans were found significantly associated with the DMFT trait, and the narrow- sense heritability was found comparable with traditional heritability estimates in previous family-based dental caries GWASs. Due to the differences between study designs and testing methods, it is not reasonable to compare the gene set identified by this study with that by other SNPbased GWASs. However, we attempted to compare the results from both gene- and SNP-based association testing on the basis of the GENEVA dental caries data. The comparison revealed that the genebased test captured 65.34% genes that are significant in the SNPbased test. A further test for independence illustrated that, though gene-based association testing has a totally different mechanism, the identified genes are significantly overlapped with those by SNP-based association testing, given that two tests are performed on the same GENEVA data. Therefore, findings from this gene-based association testing may contain important signals relevant to cariogenesis and could complement those from the traditional SNP-based GWASs.

Acknowledgement and Funding

This genome-wide association study is part of the Gene Environment Association Studies (GENEVA) program of the transNIH Genes, Environment and Health Initiative (GEI). Genotyping services were provided by the Center for Inherited Disease Research (CIDR). Assistance with phenotype harmonization and genotype cleaning, as well as with general study coordination, was provided by the GENEVA Coordinating Center and by the National Center for Biotechnology Information (NCBI). Data and samples were provided by:

i. The Center for Oral Health Research in Appalachia;

ii. The University of Pittsburgh School of Dental Medicine (SDM) DNA Bank and Research Registry, supported by the SDM and by the University of Pittsburgh Clinical and Translational Sciences Institute;

iii. The Iowa Fluoride Study and the Iowa Bone Development Study; and

The Iowa Comprehensive Program to Investigate Craniofacial and Dental Anomalies. The datasets used for the analyses described in this manuscript were obtained from dbGaP at https://www.ncbi. nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000095. v3.p1. The authors would like to thank the investigators, staff, and participants who contributed to the GENEVA program. This work is supported by the 4-VA Collaborative Research Grant from the state of Virginia and R01-DE024984.

Statement of Ethics

Not applicable for authorized access to dbGaP Study Accession: phs000095.v3.p1.

Disclosure Statement

The authors have no conflicts of interest to declare.

Author Contributions

X.W. and D.B. conceived and designed the study; Y.W. and X.W. implemented procedure and conducted data analysis; Y.W., J.S., and X.W. participated in data preparation, result organization and discussion; Y.W., D.B., and X.W. wrote the manuscript. All authors discussed the results and commented on the manuscript.

JOURNAL OF DENTISTRY AND DENTAL MEDICINE (ISSN:2517-7389)

Gene-Based Association Mapping for Dental Caries in The GENEVA Consortium

Yueyao Wang¹, Dipankar Bandyopadhyay^2*, John R. Shaffer³, Xiaowei Wu¹

Article Info

CitationCitation COPIED

Abstract

Keywords

Introduction

Materials and Methods

Results

Discussion

Acknowledgement and Funding

Statement of Ethics

Disclosure Statement

Author Contributions

References

JOURNAL OF DENTISTRY AND DENTAL MEDICINE (ISSN:2517-7389)

Gene-Based Association Mapping for Dental Caries in The GENEVA Consortium

Yueyao Wang1, Dipankar Bandyopadhyay2*, John R. Shaffer3, Xiaowei Wu1

Article Info

CitationCitation COPIED

Abstract

Keywords

Introduction

Materials and Methods

Results

Discussion

Acknowledgement and Funding

Statement of Ethics

Disclosure Statement

Author Contributions

References

Yueyao Wang¹, Dipankar Bandyopadhyay^2*, John R. Shaffer³, Xiaowei Wu¹