Scroll Top

Affinity Landscapes

Short linear motifs, SLiMs, encode compact regulatory instructions across the proteome, and their recognition by peptide-binding domains such as SH2 modules underlies the wiring of major signaling circuits. Because SLiM-PRD interfaces are acutely sensitive to local sequence context, single amino-acid substitutions can reconfigure interaction networks, enabling evolutionary innovation but also fostering pathogenic rewiring. Yet despite decades of progress, accurate quantitative prediction of SH2–SLiM binding affinities has remained elusive. Classical position-specific scoring matrices and modern discriminative classifiers readily identify candidate sites, but they fail to reproduce biophysically meaningful free-energy landscapes. In this context, a collaborative study from the Bussemaker and Shah Groups at Columbia University, published in Protein Science, establishes an integrated experimental–computational framework, built on massively diverse peptide-display libraries, multi-round affinity selection, next-generation sequencing, and the ProBound free-energy regression method, that yields quantitative, domain-specific sequence-to-affinity models capable of predicting binding free energies across the full theoretical ligand space.

Figure 1

Figure 1. Overview of concerted experimental and computational strategy for generating SH2-peptide binding free energy models. a| Design of peptide-display libraries. b| Schematic showing how a randomized bacterial display library underwent repeated bead-based affinity selection for SH2 binding. In each selection round, the library was sequenced before and after selection. c| Overview of the regression framework used to learn energetic binding models from the sequencing data. For each possible binding site, the energy received independent additive contributions from the residues flanking the phosphorylated tyrosine, thus controlling for the binding-site context wherein the residues reside. These energy contributions were estimated using maximum likelihood estimation, where the likelihood of the observed sequence counts was evaluated by first computing the total affinity for each observed sequence, controlling for multiple possible binding offsets and non-specific binding, and then computing the binomial likelihood for each round, assuming linear section. d| Sequence logo displaying the inferred energy contributions as letters whose height reflects the magnitude of the contributions, relative to the mean for each position.

Central to this advance is the recognition that high-throughput selection of extremely degenerate peptide libraries introduces inherent noise, bias, positional ambiguity, and non-specific carryover, especially when library designs differ in complexity and coverage. Rather than relying on naive enrichment ratios, which conflate true energetic effects with library-dependent artifacts, the authors employ ProBound to learn additive free-energy contributions directly from the multi-round selection data. By summing over all possible binding registers, controlling for non-specific selection, and maximizing the likelihood of observed sequence counts across rounds, the method deconvolves energetic preferences from the substantial experimental noise that plagues conventional enrichment-based estimates. Applied to multiple SH2 domains—including c-Src, Fyn, Grb2, Lyn, Yes, and Blk—the resulting free-energy matrices robustly capture specificity determinants from residues flanking the central phosphotyrosine, revealing the dominant energetic footprint spanning positions −2 through +3.

The strategy is validated across diverse library designs, ranging from semirational pTyrVar collections to fully degenerate random libraries. Importantly, the authors demonstrate that even unbiased X11 libraries—whose naïve state is dominated by weak binders and low-frequency sequences—can be transformed into informative training data through carefully optimized multi-round selection, recovering free-energy parameters that agree closely with models derived from more constrained library formats. This universality is significant. It allows a single random library to serve as a target-agnostic substrate for profiling entirely new peptide recognition domains without relying on prior knowledge of binding motifs. Furthermore, by removing constraints on the central tyrosine and still recovering coherent binding models, the approach proves capable of discovering binding determinants without imposing motif-specific assumptions.

Quantitative accuracy is established via rigorous low-throughput biophysical validation. Competitive fluorescence polarization measurements across panels of natural and engineered phosphopeptides reveal strong agreement between predicted and measured affinities, spanning nearly two orders of magnitude in KD. For Grb2, the hallmark requirement for an asparagine at +2 emerges naturally and is resolved quantitatively: peptides bearing N+2 are predicted, and observed, to bind with affinities that differ by several orders of magnitude relative to N+2 mutants. Likewise, nuanced paralog-specific preferences distinguishing c-Src and Fyn are illuminated, and structural modeling suggests that subtle divergence at a single conserved pocket residue, N201 in c-Src vs. H199 in Fyn, reshapes the electrostatics of the pY+1 binding cleft, rationalizing experimentally observed differences in affinity for glutamate, aspartate, or alanine at that position.

Armed with validated affinity models, the authors extend their analysis to the human tyrosine phosphoproteome, assigning relative affinities to thousands of known phosphosites while filtering for physiologically plausible co-expression. The resulting predictions identify candidate SH2 interactors—including several not previously reported in curated interaction databases, and the model’s ability to discriminate known binding vs. non-binding phosphopeptides in external datasets further supports its generality. The framework is also applied to single-amino-acid variants from clinically relevant mutation databases; in multiple cases, predicted gains or losses of affinity align with known pathogenic signaling mechanisms, offering mechanistic insight into allelic rewiring events involving Vav2, ZAP-70, HS-1, and ARPC1B.

Overall, the study demonstrates that quantitative sequence-to-affinity modeling of SH2 domains is achievable directly from high-throughput peptide display data, without requiring explicit affinity measurements. By accurately reconstructing free-energy landscapes from raw sequencing counts and capturing paralog-specific specificity determinants, this approach offers a powerful route to deciphering SLiM-mediated interaction networks at proteome scale. More broadly, it establishes a generalizable blueprint for mapping binding energetics across diverse peptide-recognition domains, moving beyond binary classifications toward predictive, energetically grounded models of cellular signaling fidelity and its perturbation by natural or disease-associated variation.

Bussemaker Research Image

Figure 4. Flanking specificity of the c-Src, Grb2 and Fyn SH2 domains. a| Energy logos for the c-Src SH2, Fyn SH2 and Grb2 SH2 binding models. b| Scatter plots comparing the predictions from the binding models in a| with competitive fluorescence polarization measurements. Vertical bars indicate standard error. Dashed black lines, and accompanying model expressions and r2 values, indicate linear regression fits to the log-transformed KD-values. c| Comparison of the c-Src and Fyn binding models from a| using an energy logo, top, showing the difference between the model coefficients, and a scatter plot, bottom. e|, d| AlphaFold 3 models of the c-Src and Fyn SH2 domains, shown as surfaces in the central panels, bound to a high-affinity phospho-peptide, GHH-pY-EEIG, shown as purple sticks. Residues on the SH2 domains colored in beige are sites where c-Src and Fyn diverge. A key divergent site, N201 in c-Src and H199 in Fyn, is shown in teal. The zoom-in panels highlight key residues in a cationic pocket on the SH2 domain that interacts with the ±1 residue on the peptide ligand.


Publication Information


Author Information

author image

Dr. Dejan Gagoski is a Staff Scientist at the Department of Biochemistry at the Albert Einstein College of Medicine in New York City, and former Postdoctoral Researcher at the Department of Biological Sciences and the Chemistry Department at Columbia University. He received his Ph.D. in Biochemistry and Protein Engineering from the University of Queensland, Australia in 2016, where he focused on utilizing cell-free protein expression for higher-throughput enzyme and protein characterization screenings.

Dr. Gagoski’s current research integrates protein and peptide display platforms, high-throughput selections, deep sequencing and computer analysis for modeling peptide- and protein-protein interactions. He uses this integrated pipeline for development of anti-viral antibodies, protein-based vaccines and B-cell receptor associated Leukemia therapeutic binders.

author image

Dr. H. Tomas Rube is an Assistant Professor in Applied Mathematics at University of California, Merced. He is research lies at interface of machine learning, biophysics, and high-throughput biology, and is centered on understanding the grammar whereby proteins recognize peptide, RNA and DNA molecules. He is particularly interested in how close coordination between algorithms and experimental design creates new opportunities for profiling biomolecular interactions with increased throughput and rigor.

Dr. Rube obtained his Ph.D. in Physics from Stanford University, followed by postdoctoral research in Prof. Jun Song’s lab at UCSF and Prof. Harmen Bussemaker’s lab at Columbia University. His research program is funded by the Hellman Fellows Fund and the National Science Foundation.