Book notes: Molecular Evolution: A Statistical Approach by Ziheng Yang
Last modified: Post views:We decide to read the book Molecular Evolution: A Statistical Approach by Ziheng Yang, to gether on a weekly basis.
Models of nucleotide substitution
- We define $p$ distance as thr proportion of sites that differ between two sequences.
- The true evolutionary distance $d$ is the expected number of substitutions per site between two sequences.
- CTMC (Continuous Time Markov Chain): four nucleotides are states of the chain, it will reach to steady state with a stationary distribution, via a substitution rate matrix.
Markov models of nucleotide substitution and distance estimation
The JC69 model
- Jukes-Cantor 1969 model.
- The substitution rate matrix $Q$ is \(Q = \begin{pmatrix} -3\lambda & \lambda & \lambda & \lambda \\ \lambda & -3\lambda & \lambda & \lambda \\ \lambda & \lambda & -3\lambda & \lambda \\ \lambda & \lambda & \lambda & -3\lambda \end{pmatrix}\)
- The matrix of transition probability: $P(t) = e^{Qt}$. This is using Kolmogorov forward equation, where $\frac{dP(t)}{dt} = QP(t)$.
- We then use Taylor expansion to get $P(t) = I + Qt + \frac{Q^2t^2}{2!} + \cdots$.
- Essense of calculus by 3Blue1Brown.
- Chapman–Kolmogorov equation: $P_ij(t+s) = \sum_k P_ik(t)P_kj(s)$. This account for the multiple (hidden) states between $i$ and $j$.
- As $d=3\lambda t$, then $p(d) = 3p_1(t)=\frac{3}{4}-\frac{3}{4}e^{-4\lambda t}=\frac{3}{4}-\frac{3}{4}e^{-4d/3}$, then $\hat{d}=-\frac{3}{4}log(1-\frac{4}{3}\hat{p})$.
The K80 model
- Kimura 1980 model.
- Transitions: pyrimidine (T $\leftrightarrow$ C), purine (A $\leftrightarrow$ G).
- Transversion: (T, C $\leftrightarrow$ A, G).
- $d = (\alpha + 2\beta)t$.
- $\kappa = \frac{\alpha}{\beta}$.
- $E(S) = p_1(t)$.
- $E(V) = 2p_2(t)$.
- $\hat{d} = -\frac{1}{2} \log(1 - 2S - V) - \frac{1}{4} \log(1 - 2V)$.
HKY85, F84, TN93, etc.
TN93
- Tamura-Nei 1993 model.
- 7 parameters and 6 free parameters.
- $\pi_Y = \pi_T + \pi_C$ and $\pi_R = \pi_A + \pi_G$.
- $P(t)$ can be solved using spectral decomposition of $Q$.
- $Q = U \Lambda U^{-1}$.
- $\Lambda = \text{diag}\lbrace \lambda_1, \lambda_2, \lambda_3, \lambda_4 \rbrace$.
- $P(t) = U e^{\Lambda t} U^{-1} = U \text{diag}\lbrace e^{\lambda_1 t}, e^{\lambda_2 t}, e^{\lambda_3 t}, e^{\lambda_4 t} \rbrace U^{-1}$.
- Essense of linear algebra by 3Blue1Brown.
- Eigenvectors and eigenvalues with prerequisites:
- Linear transformations
- Matrix multiplication
- Determinants
- Linear systems
- Change of basis, Jennifer is a French as indicated in 3:21 of this video :D
- Eigenvectors and eigenvalues with prerequisites:
- $\lambda = -\sum_i \pi_i q_{ii} = 2\pi_T \pi_C \alpha_1 + 2\pi_A \pi_G \alpha_2 + 2\pi_Y \pi_R \beta$.
- $d = \lambda t$.
- Using $E(S_1), E(S_2), E(V)$ to estimate $\alpha_1, \alpha_2, \beta$.
- $\hat{d}$ can be represented as a function of nucleotide frequencies and $E(S_1), E(S_2), E(V)$.
HKY85 and F84
-
6 parameters and 5 free parameters.
- Hasegawa-Kishino-Yano 1985 model.
- Interesting fact: it should be called HKY84, misnamed in Yang 1994.
- Setting $\alpha_1 = \alpha_2 = \alpha$ or $\kappa_1 = \kappa_2 = \kappa$ in TN93 model.
- Felsenstein 1984 model.
- Setting $\alpha_1 = (1+\kappa/\pi_Y)\beta$ and $\alpha_2 = (1+\kappa/\pi_R)\beta$ in TN93 model.
- It is easier to derive a distance formula for F84 than for HKY85.
- Felsenstein 1981 model.
- Setting $\alpha_1 = \alpha_2 = \beta$ in TN93 model.
- A distance formula was derived by Tajima and Nei (1982).
- On a side note, Tajima and Nei (1984) is a classic paper on nucleotide diversity.
The transition/transversion rate ratio
- Three definitions:
- $E(S)/E(V) = p_1(t)/2p_2(t)$ under the K80 model (the uncorrected method).
- $\kappa = \alpha/\beta$ under the K80 model (corrected).
- $R = \alpha/2\beta$ under the K80 model (corredted). It’s called average transition/transversion ratio. More generally, $R = \frac{\pi_T q_{TC} + \pi_C q_{CT} + \pi_A q_{AG} + \pi_G q_{GA}}{\pi_T q_{TA} + \pi_T q_{TG} + \pi_C q_{CA} + \pi_C q_{CG} + \pi_A q_{AT} + \pi_A q_{AC} + \pi_G q_{GT} + \pi_G q_{GC}}$ under UNREST.
Variable substitution rates among sites
- Assuming the rate $r$ for any site is drawn from a gamma distribution.
- Models with Gamma distance denoted by a suffix ‘$+\Gamma$’.
- Gamma distribution have two parameters: shape $\alpha$ and scale $\beta$. The mean is $\alpha/\beta$, and the variance is $\alpha/\beta^2$. You can also consider them as number of Poisson process events of $\alpha$ with rate $1/\beta$.
- Note that the gamma distribution here sets $\alpha = \beta$ so that the mean equals to $1$.
- If a site has a rate $r$, the distance between sequences becomes $d = r t$.
- $p(d\cdot r)$ can be calculated by integrating the gamma distribution.
- For JC69 model, $p = \int_0^\infty \left( \frac{3}{4} - \frac{3}{4} e^{-4d \cdot r / 3} \right) g(r) \, dr = \frac{3}{4} - \frac{3}{4} \left( 1 + \frac{4d}{3\beta} \right)^{-\alpha}$.
- Similarly, one can derive a gamma distance under virtually every model for which a one-rate distance formula is available.
- It is well known that ignoring rate variation among sites leads to underestimation of both the sequence distance and the transition/transversion rate ratio.
Tags: Molecular evolution, Statistical methods
Categories: Book notes
Comments