(1) Department of Biostatistics, University of Alabama at Birmingham, Birmingham, Alabama, USA
(2) Division of Biostatistics, Department of Pediatrics, University of Arkansas for Medical Sciences, Little Rock, Arkansas, USA
* Corresponding author Email: MLi@uams.edu
Introduction
Complex human diseases usually have multifactorial causes, and may develop as a result of the collective effects of multiple genetic variants, complex gene-gene/gene-environment interactions, rare sequence variants, copy number alterations, epigenetic modifications, etc. Understanding the genetic aetiology of complex human diseases require a comprehensive assessment of these causes. Recently, penalised regression methods have gained popularity in genetic research, aiming to detect genetic, epigenetic and environmental factors contributing to complex human diseases. In this article, we attempt to provide a brief overview of these methods in light of their applications in various contexts of genetic research.
Conclusion
These methods are built on the assumption that, given a genotype-phenotype association, the genetic similarity would contribute to the phenotype similarity and aggregate multiple rare and common variants through the genetic similarities between individuals.
Recent advances in genotyping and sequencing technology have enabled researchers to rapidly collect an enormous amount of high-dimensional genotype data throughout the entire genome[1]. The first generation genome-wide association studies (GWAS) commonly test each variant separately. Although thousands of variants have been identified[2], for most complex diseases, the current identified variants explain only a small percentage of the disease heritability[3]. Recently, there has been an intensive effort dedicated to examine multiple variants in a single model for their association with complex diseases/traits. This multi-locus strategy has many advantages over the single-locus search. First, all the genetic variants are not independent, but form Linkage-Disequilibrium (LD) blocks. Fitting a single model for multiple variants allow one to test the effect of one variant while controlling the effect of others, increasing the power to detect weak signals by accounting for other causal effects, and remove false signals by including a stronger causal association[4,5]. Second, analyses of multiple variants simultaneously will reduce the burden of multiple testing; thus, improve the power to detect a causal association.
Penalised regression methods have become popular in genetic research as an attractive alternative for a single marker analysis. In a penalised regression framework, the genetic effects are shrunk by maximizing the log-likelihood function subject to a penalty term, which is a function of the coefficients indexed by one or more tuning parameters. The form of the penalty term determines the general behaviour of penalised methods. For example, various methods have utilised least absolute shrinkage and selection operator (LASSO), which assumes that only a small number of genetic variants are causal to the disease phenotypes and allows simultaneous selection of causal variants and estimation of their effect sizes. The selection of causal variants is achieved by shrinking the effects of non-causal variants to zero, and retaining only a small subset of genetic variants with non-zero effects[5]. On the other hand, ridge-based methods do not directly conduct model selection for causal variants. However, they are much easier to be computed than lasso-based methods, and enjoy stable estimation in the presence of multicollinearity. In practice, a variety of penalty terms have been proposed, among which the most popular ones are ridge, lasso, adaptive lasso, fused lasso, group lasso and elastic net (mixture of lasso and ridge). The mathematical formulas of all these methods are summarised in Table 1. The turning parameter(s) control the degree of penalisation, and the ‘best’ value of the tuning parameter can be selected either by minimizing Akaike information criterion (AIC) or Bayesian information criterion (BIC) or by cross-validation[6]. Penalised regression methods are also closely related to Bayesian shrinkage methods, which achieve the same goal of variable selection by specifying shrinkage prior distributions (probability distributions that have high probability near zero) on the coefficients[7,8]. In Bayesian analysis, it is a standard procedure to fully explore the posterior distribution through Markov chain Monte Carlo (MCMC) simulations. Most of the coefficient estimates are expected to be close to zero through their posterior distributions[7].
Generally, these penalised regression methods are able to handle high-dimensional data, robust to the data multicollinearity due to highly linked variants and reduce the burden of multiple testing. Penalised regression methods attempt to attain a parsimonious model comprising only a small number of variants that are most important for accurate disease prediction, stable effect estimation and easy interpretation. For example, Ayers and Cordell[4] compared various penalised regression methods and single variant analysis under various simulation scenarios and found that penalised methods usually outperform single-SNP analysis, preventing correlated variants from entering the model and producing a sparse model with causal variants. Because of these appealing features, penalised methods and their Bayesian counterparts have been applied extensively in a wide variety of topics in genetic research, such as, analyses of multiple genetic variants, complex gene-gene/gene-environment interactions, rare sequence variants, copy number variants, epigenetic modifications, etc. In the following sections, we briefly review some of their successful applications in various contexts of genetic research.
Penalised regression methods have been effectively applied for multi-locus analyses in both candidate gene-based and genome wide association studies. In these applications, the joint effect of multiple variants is assumed to be additive without any interactions. For example, Li et al.[9] proposed a Bayesian lasso method for GWAS by assuming that the genetic effects have a double exponential distribution as prior. The method models each individual variant with an additive effect and a dominant effect, on which lasso penalties are imposed. The method used MCMC simulations to provide posterior median effect estimates for each variant, while adjusting for the effects of the other variants and covariates. Furthermore, based on posterior samples, a heritability value was estimated for each variant to guide variable selection for variants contributing significantly to phenotype[9]. Breheny and Huang[10] incorporated a grouping structure into the analysis and applied a group MCP method by grouping variants located in the same gene. The group minimaxconcave penalty (MCP) penalty achieves variable selection at both individual level and group level. Therefore, not only can it identify important genes but also can select important variants within those genes. Cho et al.[11] used an elastic net method to jointly analyse variants on a genome-wide scale. Each variant was assumed to have an additive effect. The elastic net penalty also takes advantages of regularization properties (i.e. automatic variable selection and stable estimation in presence of multicollinearity). Both Bayesian lasso and adaptive lasso methods have also been applied to detect quantitative trait loci in plant and animal studies[12,13].
Gene-gene interaction, or termed “epistasis”, occurs when the effect of one genetic variant is influenced by the existence of others[8,14]. Accumulating evidence has suggested that genetic interactions exist pervasively in biological pathways[15], and is a major source accounting for the issue of “missing heritability” and the low replication power for the current positive findings[16,17]. Due to their capability of handling high-order interactions and differentiating interaction effects from main genetic effects, penalised regression methods are also widely used to detect epistasis interactions. Wu et al.[18] developed a two-stage lasso penalised logistic regression to handle genetic interactions in genome-wide association studies. In the first stage, the top variants with the most significant effects were identified. In the second stage, the two-way or higher-order interactions among the selected variants were examined. Yang et al.[19] proposed a group adaptive lasso method for GWAS analysis. All variants and their interactions were treated as multi-level factors, which were detected in a group manner.
A particular type of gene-gene interaction in maternal and prenatal research is the maternal-foetal genotype (MFG) interactions, which occurs when an MFG combination jointly alters the phenotype or risk of disease in the offspring. A well-known example of an MFG interaction is Rh incompatibility[20]. An Rh-negative mother may produce immune antibodies to the Rh antigens on the red blood cells of her Rh-positive foetus, causing Rh isoimmunization. Penalised regression methods have also been used to detect MFG interactions. Li et al.[21] defined a genetic conflict indicator if the baby has a different genotype from its mother. A ridge regression is used to address the data collinearity between maternal and foetal genomes. Alternatively, Li et al.[22] used adaptive lasso embedded within an EM algorithm (to simultaneously detect phased haplotype probabilities) to detect haplotype-haplotype interaction between maternal and foetal genomes, which also differentiate the genetic effects from maternal genotypes, foetal genotypes and MFG interactions.
Similar to epistasis, gene-environment interaction plays a crucial role in understanding the genetic basis of complex diseases. Application of penalised methods to detect gene-environmental interaction is similar to those for detecting gene-gene interaction. The product terms between genetic variants and environmental factors can be incorporated in the model, the effects of which are further estimated subject to a penalty term. Park and Hastie[23] used ridge-penalised logistic regression followed by a forward selection strategy to detect epistasis and gene-environment interactions. Tanck et al.[24] implemented penalised regression, with ridge penalty on main effects and lasso penalty on epistasis and GXE interaction effects. Therefore, all the main effects are always included in the model (property of ridge), while irrelevant interactions are automatically removed (property of lasso). Due to their capability of handling correlated variables, these methods can handle a large number of variants and their GXG and GXE interactions.
Recent evidence has shown rare variants, though individually rare, may have a stronger effect and collectively have a significant impact on disease phenotypes[25,26]. Though rare variants are suggested to be a potential source of ‘missing’ heritability’[16], detecting rare sequence variants associated with complex diseases remains to be a challenge. A single rare variant contains little variation owing to low minor allele frequency (<0.5% or 1%); testing these variants individually lack reasonable statistical power[5]. Many researchers have proposed ways of collapsing information across genes or across other regions so that the combined exposure becomes less rare[27,28,29]. A number of penalised regression methods have also been proposed to handle rare variants, most of which follows various collapsing strategies.
Recently, Zhou et al.[30] used a mixture of group lasso and lasso penalty to jointly test common and rare variants for association with disease phenotypes. The method collapsed rare variants with minor allele frequencies less than 1%, and introduced a group structure for all variants by genes and pathways. It was suggested that using a mixture of group lasso and lasso penalties outperformed using lasso penalties alone, especially when both common and rare variants are present. More recently, Ayers and Cordell[5] developed a method that groups SNPs by genes and collapses the rare variants in the gene into a single “super” variant. The penalty term imposed in their method allows an individual regression coefficient to be estimated for each common variant, effectively allowing individual common variants to be selected, while the grouping penalty allows a borrowing of strength between common and rare variants within the same gene. When applied to real and simulated datasets, their approach showed improved performance compared to its predecessors.
In the past few years, solid evidence has shown that structural variations, due to insertions, deletions and inversions of the DNA, also contribute considerably to the diversity of the human genome[31,32,33]. These structural changes will cause copy number differences in particular genomic regions, ranging from one KB to complete chromosome arm. CNVs may contribute considerably to the development of complex diseases, like cancers[34] and are a major source of the “missing heritability” of complex human diseases[16].
Penalised methods have also been proposed to detect CNVs. Huang et al.[35] proposed to use a least squares regression model and penalised the difference between the relative copy numbers of the neighbouring markers. By using this lasso-type penalty, the change points of CNVs can be detected. Gao et al.[36] later considered the sparsity of CNVs in the genome, and proposed a robust penalised LAD regression model with the adaptive fused lasso penalty. The method was shown to be robust to outliers and correctly detected the numbers and locations of the true breakpoints.
Epigenetic refers to the modifications of DNA or associated proteins, other than DNA sequence variation itself, which carry information content during cell division[37]. Two major molecular mechanisms for epigenetic inheritance are, DNA methylation and histone modification, both of which may lead to heritable changes in gene expression or cellular phenotypes[38]. Epigenetic alteration have long been linked to complex human diseases, such as cancers[37], disorders of genomic imprinting[39], neuropsychiatric diseases[40], autism[41], etc. Detecting epigenetic alterations contributing to complex human disease would also help to account for the issue of “missing heritability”[16].
Penalised regression methods are also used to investigate DNA methylation data. Sun and Wang[42] applied a penalised conditional logistic regression model for matched case-control studies. The method used a network-based penalty to favour selection of Cytosine-phosphate-Guanine (CpG) sites within a gene or genetic pathway. Liu et al.[43] applied a bridge-penalised logistic regression method. Compared to lasso that usually selects independent variants, the proposed sparse logistic regression was able to select highly correlated variants simultaneously. Application of the method to methylation data selected 6 out of 7 CpG regions, which are known to be predictive of lung cancer subtype.
With the advent of genomic era, it is now feasible to investigate the influence of the entire spectrum of human genetic variations on complex human diseases. Quite often we will need to examine a large number of genetic variants far exceeding the number of individuals in the study population. Traditionally regression-based methods would be overwhelmed to jointly consider all variants simultaneously. Over the years, penalised methods have emerged as a powerful tool in genetic research, covering a wide variety of topics, such as multi-variant analysis, gene-gene/gene-environment interactions, rare sequence variants, copy number variants, epigenetics, etc. These research areas represent various sources that may account for the “missing heritability” of complex human diseases. Penalised regression methods have shown a number of advantages, such as facilitating model selection in high-dimensional data analysis, achieving stable effect estimation in the presence of multicollinearity, and reducing the burden of multiple testing. Application of penalised regression methods in various research topics also successfully identified genetic/epigenetic/copy number variants associated with complex human diseases, differentiated causal variants from non-causal ones, and estimated their effects while adjusting for the effect of others. In this article, we attempt to provide a survey of the application of penalised regression methods in various contexts of genetic research. The application of penalised methods is not limited to these topics. Adaptation of these methods to transcriptomics, proteomics and metabolomics data can be straightforward and have been investigated[44,45,46].
It should also be noted that penalised regression methods might also have a few limitations. First, despite their success in model selection, their performance has been unsatisfactory in hypothesis testing and interval estimation[47]. In particular, in genetic studies with large number of variants, lasso and related methods may also produce false positive results[47]. To overcome this limitation, one might combine multiple test statistics by averaging over multiple tuning parameters, rather than building a single ‘best’ model[47]. Second, although penalised methods provide robust performances for detecting causal variants under various disease models[48], they might not be the ‘best’ in all the situations. Other non-penalised approaches can be more advantageous depending on the underlying mechanism and allele frequency of the disease model[49]. For example, compared to lasso and group lasso method, the multifactor-dimensionality reduction (MDR) method was shown to have a higher power to detect pure epistatic interactions among common variants[49]. Therefore, in practice, optimised performance of these methods would be both model and context-dependent. Third, although penalised methods have been applied to various research areas, their performance requires further improvement, especially for detecting rare sequence variants. The current available penalised methods also have a few limitations due to the collapsing strategy of the rare variants due to the presence of: (i) both disease causing and disease-protective variants and (ii) both functional and non-functional variants within a region. There is a huge scope of improvement and further assessment in association testing with rare variants[50]. Recently, a number of similarity-based methods, such as SIMreg and SKAT, have been shown to be robust to the bi-direction of genetic effect in sequencing data analysis[51,52,53,54,55].
These methods are built on the assumption that, given a genotype-phenotype association, the genetic similarity would contribute to the phenotype similarity and aggregate multiple rare variants and common variants through the genetic similarities between individuals. It would be interesting to see if penalised methods can be incorporated to conduct selection among the similarity of genes, environmental factors and their interactions.
This work is supported in part by a cooperative agreement grant U01 NS041588 from the National Institute of Neurological Disorders and Stroke, National Institutes of Health and Department of Health and Human Services.
GWAS generation genome-wide association studies; LD, Linkage-Disequilibrium; LASSO, least absolute shrinkage and selection operator; AIC, Akaike information criterion; BIC Bayesian information criterion; MCMC, Markov chain Monte Carlo; MCP minimaxconcave penalty; MFG, maternal-foetal genotype; CpG Cytosine-phosphate-Guanine; MDR, multifactor-dimensionality reduction.
All authors contributed to the conception, design, and preparation of the manuscript, as well as read and approved the final manuscript.
None declared.
None declared.
All authors abide by the Association for Medical Ethics (AME) ethical rules of disclosure.
Different penalty functions (Assuming
Penalised method | Tuning parameters | Penalty function |
---|---|---|
Lasso | ||
Ridge | ||
Bridge | ||
Adaptive Lasso | ||
Elastic Net | ||
SCAD | ||
MCP | ||
Group Lasso | ||
Fused Lasso |