Poster Presentation 51st Lorne Proteins Conference 2026

Classical and neural models of protein sequence evolution (#409)

Annabel Large 1 , Ian Holmes 1
  1. University of California, Berkeley, Berkeley, CA, United States

To a first approximation, proteins evolve via the accumulation of substitutions and indels. With an appropriate statistical model for this evolutionary process, sequence alignment can be framed as a straightforward (albeit expensive) exercise in Bayesian inference: correctly placing gaps corresponds to identifying where the indels occurred. Similarly, phylogenetic tree construction and ancestral sequence reconstruction can be framed as inference tasks, as can more refined bioinformatics exercises like using a multiple alignment to estimate the type of selection occurring at different places in a sequence (purifying vs diversifying selection), identifying evolutionarily accelerated or conserved sites, and detecting various signals of three-dimensional structure.

All of this is contingent on having a realistic continuous-time Markov process generator for the rates at which substitutions and indels occur, and indeed being able to "solve" (exponentiate) that model. Specifically, we need a time-dependent alignment likelihood of the form P(alignment,descendant|ancestor,time) that obeys the Chapman-Kolmogorov equations for a Markov chain. The pioneering model in this regard was the Thorne, Kishino, and Felsenstein 1991 model (TKF91), the first to directly derive the gap and substitution scoring schemes for dynamic programming  sequence alignment from an underlying model of the instantaneous rates of point substitutions and indels. One year later, the TKF92 model extended TKF91 to allow multi-residue indels, at the cost of introducing latent information.

Recently, several novel improvements on this model have been proposed. The first class of improvements tried to solve the simple process described by TKF92 in a cleaner way. De Maio (Systematic Biology, 2020) and Holmes (Genetics, 2020) used a renormalization approach to bypass the latent information introduced in TKF92. The second class of improvements goes directly to realism, and involves several attempts to model the alignment likelihood more realistically using neural networks, forsaking the clean simplicity of TKF92 (and friends) for a more predictive model of evolution.

We have tested several such models, including models based on mixtures and hierarchically nested versions of TKF92 (as in Holmes, 2004), as well as various neural models. Here, we report the results of these comparisons, with recommendations for future protein evolutionary analyses.

  1. An evolutionary model for maximum likelihood alignment of DNA sequences. Thorne JL, Kishino H, Felsenstein J. J Mol Evol. 1991 Aug;33(2):114-24. doi: 10.1007/BF02193625.
  2. Inching toward reality: an improved likelihood model of sequence evolution. Thorne JL, Kishino H, Felsenstein J. J Mol Evol. 1992 Jan;34(1):3-16. doi: 10.1007/BF00163848.
  3. Bayesian coestimation of phylogeny and sequence alignment. Lunter G, Miklós I, Drummond A, Jensen JL, Hein J. BMC Bioinformatics. 2005 Apr 1;6:83. doi: 10.1186/1471-2105-6-83.
  4. A probabilistic model for the evolution of RNA structure. Holmes I. BMC Bioinformatics. 2004 Oct 26;5:166. doi: 10.1186/1471-2105-5-166.
  5. A Model of Indel Evolution by Finite-State, Continuous-Time Machines. Holmes I. Genetics. 2020 Dec;216(4):1187-1204. doi: 10.1534/genetics.120.303630.
  6. De Maio N., 2020. The cumulative indel model: fast and accurate statistical evolutionary alignment. Syst. Biol. syaa050 10.1093/sysbio/syaa050