Confidence scoring for deep learning-predicted antibody-antigen complexes: AntiConf as a precision-driven metric

Ünsal, Serbülent; Holland, Benjamin; Sardag, Inci; Timucin, Emel

doi:10.1093/bib/bbag137

Confidence scoring for deep learning-predicted antibody-antigen complexes: AntiConf as a precision-driven metric

Ünsal S., Holland B., Sardag I., Timucin E.

Briefings in bioinformatics, cilt.27, sa.2, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 27 Sayı: 2
Basım Tarihi: 2026
Doi Numarası: 10.1093/bib/bbag137
Dergi Adı: Briefings in bioinformatics
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, Library, Information Science & Technology Abstracts (LISTA), MEDLINE, Directory of Open Access Journals
Anahtar Kelimeler: Alphafold2, AF3-based implementations, multimer prediction, antibody-antigen complexes, model confidence scores
Boğaziçi Üniversitesi Adresli: Evet

Özet

Accurate determination of antibody-antigen (Ab-Ag) complex structures is critical for therapeutic development. While deep learning-based methods, beginning with AlphaFold2 (AF2), have revolutionized multimer predictions, the optimal strategies for Ab-Ag modeling, and the reliability of their confidence scores remain active areas of research. This study evaluates the performance of AF2, Boltz-1, Boltz-1x, Boltz-2, Chai-1, Protenix, Protenix-1, OpenFold3, and ESMFold, on a curated dataset of 200 Ab-Ag complexes. Among the nine methods tested, Protenix-1 emerged as the top performer, with Chai-1 consistently ranking second across multiple success metrics, closely followed by AF2. We observed diverse effects of recycling iterations, with AF2, Chai-1, and Protenix variants benefiting from increased cycles, unlike Boltz variants. We analyzed various model confidence scores, noting high precision from pDockQ2 and high recall from predicted Template-Modeling (pTM) score. By integrating these two scores, we developed antibody confidence (AntiConf), a novel metric that achieves superior performance for all methods in terms of precision and recall. These strengths make AntiConf a valuable post score for both computational predictions and downstream experimental workflows, reflecting its potential to improve Ab-Ag complex predictions by AF2 and AF3 architectures. Altogether, this study addresses current limitations in deep learning-based Ab-Ag complex prediction, showcasing the potential of AntiConf for future assessment studies, and providing a guideline for improving the accuracy of Ab-Ag complex prediction.