Poor Predictions When Hangry

AI on a Diet

Aug 05, 2025

I don't mind stealing bread
From the mouths of decadence
But I can't feed on the powerless
When my cup's already overfilled…
I'm goin’ hungry

“Hunger Strike” - Temple of the Dog

After my last two Substack posts, I got a critical (but loving!) phone call from my Heimdall Bio co-founder and CTO asking who the intended audience was and why my writing was “so fluffy.” We had a spirited debate about the value of smuggling the liberal arts into technical writing. In this post (and this post only), I’ll try to limit the literary flourishes while keeping the analysis.

AI is powering a profound pivot in protein science, propelled by protein language models (pLMs). Research groups are now using these foundational frameworks to predict the functions of unfamiliar sequences and propose proteins that have never existed. But as with people, feeling hungry prompts poor decisions.

The accuracy and real‑world utility of these predictions are bounded by the quality of the training data. Large, diverse, and well‑annotated protein datasets are essential, and much of the data needed for industrial enzyme design does not yet exist in AI‑ready form.

This post examines the technical limitations of current protein sequence–function databases and how they constrain AI‑driven protein design: labeling lapses, limits on learning, dataset distortion, overfitting, and the resulting ceiling on creative, useful designs.

Labeling

AI‑mediated protein prediction starts with labeled examples: catalytic activities, binding partners, structural roles, and other functional annotations. Labels lay the linguistic and logical foundation for what the model learns. When they’re lacking, the whole learning loop suffers.

The label space is large because proteins often perform multiple functions, and those functions are context‑dependent. High‑quality labels come from manual annotation that integrate experimental data with expert review. This process is slow, expensive, and unevenly distributed across protein families. Automated annotation can scale faster but depends on incomplete structural data and faces algorithmic issues, such as handling sequences of widely varying lengths without losing relevant information or adding noise.

The result is that many proteins are incompletely or inconsistently labeled. These gaps are a fundamental limit on training predictive models that must learn general functional rules.

Generalization and the Limits of Learning

Generalization is a model’s ability to apply learned patterns to new, unseen data. For protein AI, it means correctly predicting the function of sequences outside the training distribution.

Models trained on limited datasets often perform well on proteins similar to those they’ve seen, but falter when faced with unfamiliar folds, rare chemistries, or sequences from underrepresented clades. Data filtering can deepen the dilemma by removing proteins without complete metadata or excluding functions with few known examples. While this reduces noise, it also eliminates the edge cases that would push the model toward more broadly applicable rules.

Without diversity in the training set, generalization degrades, and models risk misinterpreting sequences from unexplored regions of protein space — a point I touched on in my previous post The Protein Engineering Stack.

Dataset Bias

Most protein databases are biased toward human biology, model organisms, and a narrow set of pathogens. This “homocentric” bias reflects biomedical research priorities but leaves large portions of sequence space uncharted.

Models trained on such skewed data inherit the skew. They may perform well on overrepresented proteins but become unreliable when applied to sequences from environmental microbiomes, extremophiles, or under‑sampled evolutionary branches (“Fewer than 1% of microbial species have been cultured and characterized in the lab.” Metagenomics, Wikipedia (2025): https://en.wikipedia.org/wiki/Metagenomics).

This is a structural limitation: a model cannot learn functions it has never encountered. Narrow diet leads to narrow predictive potential.

Designs Bound by Data

True de novo protein design depends on a wide base of known structures, motifs, and functional chemistries to recombine or extend. Narrow training data forces models to produce designs that are statistical variations of known patterns.

Before 2012, CRISPR/Cas9 was absent from the literature — no AI model trained on pre‑2012 data could have invented it. Models trained only on known nucleases and canonical DNA-targeting motifs could not extrapolate to the unique mechanism embedded in CRISPR loci. The same limitation applies today: models cannot propose genuinely novel mechanisms absent from their training set.

This creates an innovation ceiling. Even as algorithms advance, the designs they can dream up remain dangerously dependent on the depth and diversity of their data.

That said, recent advances in AI reasoning systems are promising. I can imagine future pLMs capable of exploring entirely new, non‑evolutionary fitness landscapes.

High Confidence, Wrong Results

Overfitting occurs when a model learns patterns too specific to its training data — memorizing noise and artifacts instead of mastering general principles. In protein prediction, overfitting leads to high‑confidence predictions that fail experimentally, demonstrated in the literature by the typically low fraction of expressible, soluble proteins resulting from AlphaFold in silico designs.

Some biotech startups unintentionally measure success using benchmarks drawn from the same data their models were trained on. This inflates apparent accuracy while hiding weak generalization. A model that performs well in retrospective tests but fails in prospective, real‑world evaluation is overfit, regardless of its architecture.

You can imagine millions of new protein structures, but without expression and functional optimization, they’ll remain only that — imagined. The issue isn’t broken models, it’s malnourished ones. And when they’re not starving, they’re living off instant Ramen (not that we all didn’t do that in grad school….)

Our Approach

At Heimdall Bio, our strategy for designing and developing high‑value industrial enzymes starts with improving the input data itself:

Expanding metagenomic sampling.
Enriching functional annotations.
Embracing underrepresented folds and chemistries.

We believe these are the non‑negotiable necessities for meaningful, measurable, and reproducible progress in AI‑driven enzyme design.

Gerald A. Maccioli MD, MBA

Aug 18, 2025

Excellent analysis of how AI’s potential in protein science is constrained by the limits of data. The parallel to hunger driving poor decisions is spot-on — models are only as strong as the diversity and quality of the datasets that nourish them.

Aug 13, 2025

I finally found the time to read this and it was interesting as always. Tell your boss I love your writing precisely because you combine your liberal arts background with science I am curious about. This line is the one I plan to discuss with my students in their research methods seminars this year: "models cannot propose genuinely novel mechanisms absent from their training set". It seems to be the thing people fail to grasp whenever we get to discussing what AI can and cannot do. I still feel woefully behind the curve and information vendors are madly incorporating AI into every database and search tool we use in our library, which leaves us flailing to figure out what the hell is actually occurring behind the scenes. Anyway, lots to think about and worry about as all this change happens around us at such a dizzying pace. Sigh...

1 reply by Jim Golden

3 more comments...

Computational Phronesis

Discussion about this post

Ready for more?