SAIR: Enabling Deep Learning for Protein-Ligand Interactions with a Synthetic Structural Dataset

Date:

June 21, 2025

2025

Type:

Preprint

Publication:

bioRxiv

Author(s):

Pablo Lemos, Zane Beckwith, Sasaank Bandi, Maarten van Damme, Jordan Crivelli-Decker, Benjamin J. Shields, Thomas Merth, Punit K. Jha, Nicola De Mitri, Tiffany J. Callahan, AJ Nish, Paul Abruzzo, Romelia Salomon-Ferrer, Martin Ganahl

Abstract

Accurate prediction of protein-ligand binding affinities remains a cornerstone problem in drug discovery. While binding affinity is inherently dictated by the 3D structure and dynamics of protein-ligand complexes, current deep learning approaches are limited by the lack of high-quality experimental structures with annotated binding affinities. To address this limitation, we introduce the Structurally Augmented IC50 Repository (SAIR), the largest publicly available dataset of protein-ligand 3D structures with associated activity data. The dataset com-prises 5, 244, 285 structures across 1, 048, 857 unique protein-ligand systems, cu-rated from the ChEMBL and BindingDB databases, which were then computa-tionally folded using the Boltz-1x model. We provide a comprehensive charac-terization of the dataset, including distributional statistics of proteins and ligands, and evaluate the structural fidelity of the folded complexes using PoseBusters. Our analysis reveals that approximately 3% of structures exhibit physical anoma-lies, predominantly related to internal energy violations. As an initial demon-stration, we benchmark several binding affinity prediction methods, including empirical scoring functions (Vina, Vinardo), a 3D convolutional neural network (Onionnet-2), and a graph neural network (AEV-PLIG). While machine learning-based models consistently outperform traditional scoring function methods, nei-ther exhibit a high correlation with ground truth affinities, highlighting the need for models specifically fine-tuned to synthetic structure distributions. This work provides a foundation for developing and evaluating next-generation structure and binding-affinity prediction models and offers insights into the structural and physical underpinnings of protein-ligand interactions. The dataset can be found at https://www.sandboxaq.com/sair.

Download Paper
Back to all publications