Protein evolution behaves as a complex dynamical system, characterised by nonlinearity, sensitivity to initial conditions, and emergent self-organisation [1]. Consequently, the resulting landscape "ruggedness", driven by high-order epistasis, represents the primary determinant of failure for machine learning (ML) models in protein engineering [2]. Here, we present a unified framework for navigating these complex terrains. We first quantitatively demonstrate that landscape ruggedness severely constrains the ability of standard ML architectures to extrapolate beyond training data. To overcome this, we leverage deep evolutionary history. By training Protein Language Models on multiplexed ancestral sequence reconstructions, a technique we term Local Ancestral Sequence Embedding (LASE), we show that ancestral data effectively "smooths" the representation of fitness landscapes, rendering highly epistatic variance learnable [3]. We apply these principles to the evolution of polyethylene terephthalate (PET) hydrolases, identifying functional variants inaccessible to standard rational design and revealing that activity innovation proceeds through distal, neutral mutations [4]. By integrating complexity theory with advanced representation learning, we establish a robust paradigm for exploring functional sequence space and designing novel enzymes in sparse-data regimes.
References:
[1] Gall, B., et al. (2025). Protein evolution as a complex system. Nature Chemical Biology, 21, 1293–1299.
[2] Sandhu, M., et al. (2025). Investigating the determinants of performance in machine learning for protein fitness prediction. Protein Science, 34:e70235.
[3] Matthews, D.S., et al. (2024). Leveraging ancestral sequence reconstruction for protein representation learning. Nature Machine Intelligence, 6, 1542–1555.
[4] Vongsouthi, V., et al. (2025). Ancestral reconstruction of polyethylene terephthalate degrading cutinases reveals a rugged and unexplored sequence-fitness landscape. Science Advances, 11, eads8318.