It’s established that structure is more conserved than sequence, a fact rooted in the constraints of 3D geometry. This is why structural comparisons often succeed in recovering deeper signals where sequence alignments fail. This begs a powerful question: if moving from a 1D sequence to 3D geometry reveals so much more evolutionary history, what if we increased the dimensionality even further? Could analyzing proteins in hundreds of dimensions uncover deeper-resolved signals invisible even in 3D? This is where protein language models that encode the complex interplay of sequence, structure, and function into rich, high-dimensional embeddings become helpful.
Our method, Structome-DeepRoots, harnesses these representations in a novel phylogenetic framework. Instead of information-losing pooling, DeepRoots computes pairwise distances from the average cosine similarity of individually paired residue embeddings, which are identified via structural superposition. This alignment-aware approach provides a granular comparison in a 1280-dimensional latent space. To complete the framework, we also introduce a novel embedding perturbation model for rapid statistical bootstrapping. In this talk, I will demonstrate how this high-dimensional signal resolves complex relationships in the Globin superfamily and quantitatively outperforms our established TM-score baseline (Structome-TM) on the PhyloBench benchmark.
Resources in the Structome suite are accessible here: https://biosig.lab.uq.edu.au/structome/