On Improving Experimental Binding Affinity Predictions with Synthetic Data

Date:

March 4, 2026

2026

Type:

Preprint

Publication:

bioRxiv

Author(s):

Kevin Ryczko, Phyo Phyo Zin, Jordan Crivelli-Decker, Ly Le, Punit K. Jha, Benjamin J. Shields, Pablo Lemos, Sasaank Bandi, Maarten van Damme, Amogh Sood, Lee Huntington, Mary Pitman, Martin Ganahl, Andrea Bortolato

Abstract

The success of deep learning binding affinity prediction models depends critically on expanding experimental data with reliable synthetic data. We extend the Structurally Augmented IC50 Repository (SAIR) with ≈80K absolute free energy perturbation (AFEP) calculations and present two distinct data splits, SAIR-FEP and SAIR-OOD (out-of-distribution), to simulate realistic drug discovery scenarios. We compare sequence-based proteochemometric (PCM) models and state-of-the-art, structure-based deep learning models and demonstrate that PCM models can be enhanced by physics-based descriptors. While structure-based deep learning methods capture finer geometric detail, their performance is highly sensitive to the input structure. By filtering for high-confidence, co-folded complexes, we show that the performance improves predictably, whereas training on all complexes blindly does not yield performance gains. Finally, using the SAIR-OOD split, we demonstrate that simultaneous training on synthetic and experimental data improves performance on publicly available, experimental benchmarks. These results provide a clear strategy for using synthetic data to advance experimental binding affinity predictions.

Download Paper
Back to all publications