The standard genetic model assumes that phenotype is the sum of a genetic component and a non-genetic component (residual), . Genomic Selection uses genetic markers covering the whole genome and potentially explaining all the genetic variance. These markers are asumed to be in Linkage Disequilibrium (LD) with the QTL thus models including all markers can estimate breeding values as combinatons of these QTL's.
Response variable y for the i-th individual (i=1,...,n) is regressed on a function of p marker genotypes that seeks to aproximate to the true genetic value of the individual, this is
where function can be a parametric or non-parametric and are the residuals which are usually assumed to be distributed Normal with constant variance .
The genotypic value of an individual is estimated using a linear model in which a linear combination of the marker genotypes are used, that is
where is the intercept, is the genotype of the i-th individual at the j-th marker, is the corresponding marker effect.
Model above presents some estimation difficulties when p is much bigger than n so penalization ans regularization aproaches are used to overcome this problem. Penalization and regularization solutions can be seen as posterior solutions in the Bayesian context.
Is a penalization regression that assumes that the regression coefficients follow independently a Gaussian (Normal) prior distribution, this is . This prior induces shrinkage of estimates toward zero.
It assumes that the regression coefficients have a prior distribution double-exponential (DE, or Laplace) with parameters and . This prior is a thick-tailed prior that can be represented as a infinite mixture of normal densities scaled by exponential () densities, this is
The regression effects are assumed another thick-tailed prior, a scaled t distribution with degree of freedom and scale parameters. Similar as for doble-exponential, the scaled t distribution is represented as mixture of normal densities scaled with a scaled-inverse Chi-squared () density, this is
Markers effects are asummed to be equal to zero with probability and with probability 1- are assumed to follow a scaled t distribution as in Bayes A model.
Similar to Bayes B, markers effects are asummed to be equal to zero with probability and with probability 1- are assumed to follow a Gaussian distribution as in BRR model.
The response is modeled as and its solution is equivalent to that of the BRR model arised when in the model above we make the sustitution
It can be shown that the random vector follows a Normal distribution , where with X is the matrix of centered and standardized marker genotypes and it is called genomic relationship matrix.
The genomic function is expressed as a linear combination of some positive semi-definite basis functions called Reproducing Kernels (RK), , as follows
This model can be rewritten as where is a matrix containing all the evaluations of the RK function at the point (i,i') and .
This problem can be solved in a Bayesian fashion by assuming a prior .
Note: The Ridge Regression (and consequently, G-BLUP) can be represented as a RKHS model by setting K=G.
Models previously above described will be implemented in R software using R-packages 'BGLR' and 'rrBLUP'. Using public data, it will be shown how to run the models for the single-environment case and then how to perform a multi-environment analysis with the G-BLUP model using a marker-by-environment (MxE) and a Reaction Norm approaches that account for GxE interaction.
Data from CIMMYT’s Global Wheat Program. Lines were evaluated for grain yield (each entry corresponds to an average of two plot records) at four different environments; phenotypes (wheat.Y object) were centered and standardized to a unit variance within environment. Each of the lines were genotyped for 1279 diversity array technology (DArT) markers. At each marker two homozygous genotypes were possible and these were coded as 0/1. Marker genotypes are given in the object wheat.X. Finally a matrix wheat.A provides the pedigree relationships between lines computed from the pedigree records. Data is available for download in the R-package 'BGLR'.
if(!"BGLR"%in%rownames(installed.packages())) install.packages("BGLR")
if(!"rrBLUP"%in%rownames(installed.packages())) install.packages("rrBLUP")
library(BGLR)
library(rrBLUP)
data(wheat)
X <- wheat.X
Y <- wheat.Y
A <- wheat.A
# Visualize data
head(Y)
X[1:10,1:5]
- de los Campos, G., Gianola, D., Rosa, G. J. M., Weigel, K. A., & Crossa, J. (2010). Semi-parametric genomic-enabled prediction of genetic values using reproducing kernel Hilbert spaces methods. Genetics Research, 92(4), 295–308.
- de los Campos, G., Hickey, J. M., Pong-Wong, R., Daetwyler, H. D., & Calus, M. P. L. (2013). Whole-genome regression and prediction methods applied to plant and animal breeding. Genetics, 193(2), 327–345.
- Endelman, J. B. (2011). Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP. The Plant Genome Journal, 4(3), 250–255.
- Habier, D., Fernando, R. L., Kizilkaya, K., & Garrick, D. J. (2011). Extension of the bayesian alphabet for genomic selection. BMC Bioinformatics, 12(186), 1-12.
- Jarquín, D., Crossa, J., Lacaze, X., Du Cheyron, P., Daucourt, J., Lorgeou, J., … de los Campos, G. (2014). A reaction norm model for genomic selection using high-dimensional genomic and environmental data. Theoretical and Applied Genetics, 127(3), 595–607.
- Lopez-Cruz, M., Crossa, J., Bonnett, D., Dreisigacker, S., Poland, J., Jannink, J.-L., … de los Campos, G. (2015). Increased prediction accuracy in wheat breeding trials using a marker × environment interaction genomic selection model. G3: Genes, Genomes, Genetics, 5(4), 569–582.
- Meuwissen, T. H. E., Hayes, B. J., & Goddard, M. E. (2001). Prediction of total genetic value using genome-wide dense marker maps. Genetics, 157(4), 1819–1829.
- Park, T., & Casella, G. (2008). The Bayesian Lasso. Journal of the American Statistical Association, 103(482), 681–686.
- Perez, P., & de los Campos, G. (2014). Genome-wide regression and prediction with the BGLR statistical package. Genetics, 198(2), 483–495.
- R Development Core Team. (2015). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing.