Department of Electrical and Computer Engineering, University of Miami, Coral Gables, FL, USA
*Corresponding author Email: firstname.lastname@example.org
Mendelian theory of inheritance explains qualitative traits that are controlled by single genetic locus, while modern quantitative genetic theory states that complex traits are influenced by many factors including main effects of many quantitative trait loci, epistatic effects involving more than one quantitative trait loci, environmental effects and the effects of gene-environment interactions. Quantitative trait locus mapping identifies important quantitative trait loci that reveal genetic basis of complex traits, serve as a map for functional gene cloning, and assist breed line selection. The aim of this study was to discuss the quantitative trait locus method and the resulting genomic selection from quantitative trait locus mapping.
We introduced different breeding populations, genetic models and statistical methods for quantitative trait locusmapping. Genetic background, model principles and computational algorithms and techniques were reviewed. We specifically discussed the whole-genome quantitative trait locusmapping that simultaneously estimates genetic effects associated with markers of entire genome. The fact that the whole-genome approach avoids evaluation of multiple models and model selection while enables genomic selection that predicts genetic merits for a quantitative trait has drawn great attention in research community, and will be an effective tool in breeding line selection.
While traditional quantitative trait locusmapping was designed with the availability of marker density and model capability in recent two decades, whole-genome quantitative trait locus mapping is the result of advancement ingenerating high-density molecular markersand development in high-dimensional sparse modelling algorithms. Whole-genome quantitative trait locus mapping and genomic selection together lead to a systematic genetics that will increase genetic gain and revolutionise crop and livestock breeding.
The genotype of an individual organism is the unordered allele pairs at one or more loci. As a result of genetically programmed biological developments, individual organisms exhibits observable characteristics called phenotypes or traits, which may be quantitative such as height and weight, or qualitative such as gender and disease status. The understanding of genotype/phenotype relationship is of paramount importance for both scientific research and social economics. For example, use of detected quantitative trait loci (QTLs) has been proved as an effective tool to increase food production, resistance to diseases and pests, tolerance to heat, cold and draught, and to improve nutrient contents in animal and plant breeding during the last two decades.
Mendelian theory of inheritance explains qualitative traits that are controlled by single genetic locus. If complex traits in animal/plant breeding or diseases of human beings are controlled by many genetic loci, individual effect of each locus can hardly be distinguished. Alternatively, the quantitative inheritance theory assumes that complex traits are resulted from multiple gene factors, gene-gene interactions as well as environmental effects. Each of the main and interaction effects exhibits only a modest effect on the phenotype and it is difficult to dissect their individual effect. The quantitative inheritance theory is applicable for all complex traits in different organisms, even for prokaryotes. For example, it has been shownthat genes belonging to specific/non-specific membrane channels, oxidative stress response and osmotic stress response are involved in conferring bacterial resistance to high arsenic level, and chasing for a single gene or single regulon may end up with no meaningful result[3,4]. While systematic sampling and genotyping is still lacking for prokaryotes, genotype/phenotype association in animal/plant breeding can be modelled by QTL mapping in breeding lines. In human populations, this relationship is studied by examining the association of phenotypes with the natural occurring genetic variations such as single-nucleotide polymorphisms (SNPs).
QTL is a region in genome that is responsible for variation in the phenotype of interest. In animal/plant breeding, molecular markers are selected in even space throughout entire genome and QTL mapping is to infer which genetic loci are strongly associated with the complex trait and to estimate the genetic effects of these loci. Two inbred lines with different traits of interest are chosen to cross and the first generation (F1) will have identical genetic markers that show complete linkage disequilibrium (the non-random association of alleles at different loci) for genes differing between the breeding lines. Starting from the F1, a number of designs have been proposed for QTL mapping. For example, a backcross design is to cross the F1 individuals to one of the two parental lines; an intercrossdesign is to cross between siblings among F1 individuals; a doubled haploid design is to develop individuals from pollens of an F1 plant through antheranotherculture and chromosome doubling; and a recombinant inbred lines design is to cross between sibling individuals for many generations start from F2 till almost all of the segregating loci come to be homozygous. The different experiment designs produce different breeding populations for QTL mapping, and the F2 population provides the most of genetic information among different types of mapping populations.
With the advent of new DNA sequencing technologies, high density markers can be easily generated along the genome. However, it is still very likely that true causal markers are not captured due to the large amount of genomic variants in living organisms. On the other hand, with the large amount of available genetic markers, researchers usually have no idea about the number, location and effect of the markers involved in the inheritance of target phenotypes. Therefore, the correlation among genetic markers and oversaturated models are two common properties in QTL mapping and SNP-based association studies. In this review, we will discuss traditional QTL mapping methods and novel statistical methods that enable whole-genome QTL mapping, meanwhile introduce the idea of genomic selection(GS) resulted from whole-genome QTL mapping.
The authors have referenced some of their own studies in this review.The protocols of these studies have been approved by the relevant ethics committees related to the institution in which they were performed.
Techniques for QTL mapping include single marker mapping, interval mapping, multiple loci mapping as well as composite interval mapping. Principles in these mapping techniques are generally the same and methods used in one population can be extended to other experiment populations. Single marker test examines the segregation of quantitative or qualitative traitswith respect to the examining genotype at a single locus, while multiple loci mapping considers multiple makers, possible high order maker interactions as well as environment factors simultaneously. The interval mapping and composite interval mapping are extensions of single marker test and multiple loci test, respectively. In the interval mapping, consecutive two testing markers (or several markers in a testing window in composite interval mapping) are ordered according to their physical location, and those peak testing values in single tests are declared as QTLs. The various techniques test the genotype and phenotype association with different genetic effects such as additive and dominance effect. In regression models, they can be examined simultaneously by adding dummy variables to encode different effects. A widely used genetic model is the Cockerham model that defines the values of the additive effect as -1, 0 and 1 for the three genotypes and the values of the dominance effect as -0.5 and 0.5 for homozygotes and heterozygotes, respectively.
Similar to QTL mapping, the goal of association study in natural population is to identify SNPs in individuals that are systematically associated with different disease states. Using the natural occurring DNA variations as markers to trace inheritance in families are similar to QTL mapping, while extra cares are required to handlepopulation structures in genome-wide association study ofcommon human diseases. In the ensuing sections, we will focus on the general computational methods that have been applied to both problems.
The single variant approach is simple and straightforward for most models such as
Providing the promising of high power and reasonable type I error rate, multiple-variant approach needs to take care of several challenges. Firstly, traditional ordinary least square method fails for the case of
Variable selection methods include forward selection, backward elimination and forward stage-wise selection. However, these greedy selection methods may result in suboptimal subset, and are computational expensive even for small number of variables. On the other hand, variable shrinkage method includes all variables in the model and applies a penalty function or appropriate prior distributions on the variables to automatically shrink most non-effects towards zero. Moreover, including all possible effects into a single model results in whole-genome QTL mapping, which overcomes limitations of the genetic model considered by traditional mapping methods and prevents all problems of model selection.
Strictly speaking, whole-genome QTL mapping includes additive and dominance main effects and all pair-wise interactions. For a QTL model includes
Another natural outcome of whole-genome QTL mapping is GS, which is originally proposed in genome-wide QTL mapping that mapping results enable predictions of estimated breeding values (EBV).GS is rooted from marker-assisted breeding, and can be viewed as a new generation of breeding method different from traditional breeding that relies on phenotypic selection and relative information. In GS, individual genetic merits are estimated by simultaneously accounting all markers and all types of marker effects. From biological point of view, a marker map must cover all genomic positions such that all QTLs can be assessed for their contributions to EBV, although in reality some adjustment for linkage disequilibriummight be required to avoid colinearity. From computational point of view, the QTL model must simultaneously estimate all genetic effects including main and epistatic effects, genetic and environmental interactions and effects of rare alleles, such that the breeding values are predicted based on all significant effects. Whole-genome QTL mapping provides more accurate marker effects and phenotypic variance estimation, which results in better performance of predicting genomic EBV.
Two well-known methods with ability in handling large number of variables are regularised regression and Bayesian shrinkage method. Consider the general multiple linear regression problem with
where y is an
In the Bayesian shrinkage approach, a prior distribution
is assigned to
If we are looking for
which is the penalised ML method. Then is the mode of the posterior distribution and this method is referred as
The Bayesian interpretation of penalised regression methods has sparked interests in developing Bayesian hierarchical regression models, which can be called as a Bayesian Lasso approach. A direct gain of Bayesian Lassois to incorporate variance information to facilitate significance test, and different techniques are employed for different designs to achieve sparse estimation. Theoretically, prior distributions with spike finite limit at zero and flat tails at two ends can be penalty of the log posterior distribution (Figure 1). In practice, conjugate priors are often chosen. Among them the Normal +
Prior distributions that penalise posterior distributions.
Popular Bayesian Lasso methods
For Bayesian estimation, Markov Chain Monte Carlo (MCMC) can be employed to draw samples from posterior distribution for each parameter. However, for high dimensional data the MCMC method is known to be computationally intensive. The MAP is a modal representation of the posterior distribution that achieves faster computation and easier interpretation. Efficient methods for MAP estimation have been developed, which further integrate out the variance components and employ numerical methods toobtain similar sparse results as Lasso (named as HyperLasso). Another method stands in between MCMC and MAP is the expectation-maximization (EM) algorithm that estimates posterior mean and variance simultaneously through iterative expectation (E-step) and maximization (M-step). However, convergence become a serious problem when model dimension increases. Recently, more efficient algorithm that does not rely on MCMC yet infers the posterior distribution is achieved by empirical Bayesian Lasso (EBlasso) method[23,24].
Different from the iterative EM algorithm, EBlasso first finds the marginal posterior distribution of the variance components. Due to the shrinkage applied from the prior distribution on the variance components, most variables will have zero variance that maximises the marginal posterior distribution, and only those variables with none-zero variance will stay in the model, result in a sparse presentation. Next, the posterior means of the non-zero variables are estimated with the give variance. Along with other algorithmic techniques, the EBlasso approach is very efficientand is able to handle a model with several million of variables. Our previous studies demonstrated that EBlasso outperformed several other multiple QTL mapping methods including the empirical Bayes method in, the Bayesian hierarchical generalised linear models(BhGLM), HyperLasso and Lasso. EBlasso has also been applied to whole-genome QTL mapping and can be easily extended to GS.
A correct QTL model is the one that includes all true QTLs and estimates their effects simultaneously. In real data analyses, we would like to perform a significance test since true QTLs are unknown. Although Lasso and HyperLasso both can handle large models, they only yield a point estimate without variance information. Bootstrapping, refitting to an ordinary least square model, as well as covariance test statistics developed during the Lasso selection path have been associated with the two methods for a significance test. The EBlasso method, on the other hand, can handle a large model with a speed comparable with that of Lasso, and estimate the posterior distribution for a sparse model. The availability of high-dimensional sparse models with capability in handling few million effects enables whole-genome QTL mapping and GS. In fact, with high-density markers available in many animal and plant organisms collecting from both inbred lines and natural populations, GS can play a significant role in improving breading technologies.
All authors contributed to the conception, design, and preparation of the manuscript, as well as read and approved the final manuscript.
All authors abide by the Association for Medical Ethics (AME) ethical rules of disclosure.
Popular Bayesian Lasso methods
|Algorithm||Prior distribution||Inference method||Reference|
||Park T, Casella G
|Bayesian Lasso||Normal + inv_χ2||MCMC||Yi N, Xu S
||MCMC||Li J et al.
|BhGLM||Normal + inv_χ2||EM
||Yi N, Banerjee S
|Bayesian HyperLasso||NEG||EM||Griffin JE, Brown PJ
|Bayesian Lasso||Normal + Cauchy||EM||Gelman A et al.
||Normal + Uniform||Empirical Bayes||Tipping ME, Faul AC
Cai X et al.
Huang A et al.