Building an Apprentice Biologist

Reinforcement Learning, Structural Reasoning, and Developing Scientific Intelligence

Jim Golden

Mar 17, 2026

Scientist at the Table. Rembrandt van Rijn. (1624)

Well, some say life
Will beat you down
Break your heart
Steal your crown

So I started out
For God knows where
I guess I’ll know
When I get there

Tom Petty and Jeff Lynne. Learning to Fly (1991)

Over the past few weeks, my colleagues at Heimdall Bio ran a small benchmarking experiment with several of today’s leading AI models. The lion’s share of the work was done by Brandon Neel, a pre-eminent protein biologist who has recently become a master of RL saddle platforms for post-training foundational models.

The questions we asked ranged from senior undergraduate biology through graduate and postdoctoral-level protein research. At its core, the test centered on a simple question: could these systems reason about proteins the way a structural biologist does?

The setup was straightforward. Brandon assembled a set of reinforcement learning questions grounded in real protein structures. Each problem required the model to interpret structural data rather than merely recall biological facts. Some tasks involved identifying secondary structures. Others required calculating geometric relationships between atoms. Still others asked the model to interpret crystallographic metadata or reason about the biochemical role of structural features within a protein complex.

In other words, the models were not asked to talk about biology. They were asked to work with it. The results were instructive.

Many of the systems were impressively fluent. They could describe enzyme mechanisms, summarize biological pathways, and explain the functional significance of particular residues. But when confronted with problems that required structural analysis such as counting alpha helices, interpreting Ramachandran plots, computing atomic coordinates, or identifying structural equivalence between proteins, their reasoning often faltered.

In one benchmark sample, even the strongest model correctly answered only about two-thirds of the structural questions. Several others performed substantially worse.

At first glance, these numbers might appear discouraging. But that would miss the more interesting story.

What the experiment revealed was not simply a limitation of current models. It exposed a deeper distinction that runs through much of artificial intelligence today: the difference between linguistic knowledge and structural reasoning.

Language models are extraordinarily good at recognizing patterns in text. They can synthesize vast bodies of literature and produce explanations that sound remarkably authoritative. But science does not live primarily in language, it lives in structure.

Proteins are not sentences. They are three-dimensional objects governed by geometry, chemistry, and physical constraint. Understanding them requires more than recalling facts about biology. It requires the ability to interrogate structure: to compute, compare, and reason about the spatial organization of matter.

And it is precisely here at the boundary between language and structure that today’s AI systems struggle. But that boundary is also where something interesting begins to happen.

When we began evaluating the models, we did not simply score their answers. We also introduced hints, constraints, and structured feedback. The goal was not merely to measure performance but to observe how the systems adapted when the reasoning landscape changed.

In an earlier essay, I described reinforcement learning as carving a kind of saddle into the terrain of reasoning. On one side lies the pressure to produce answers that sound plausible. On the other lies the pressure to produce answers that are structurally correct. A system navigating that terrain must learn to balance between the two.

What we saw in these experiments suggested that this saddle is not just a metaphor, it is something that can be observed directly in the behavior of the models.

When confronted with structural constraints, the models often began by relying on textual heuristics. But as hints and feedback accumulated, some of them began to shift their approach. They started performing calculations. They examined structural metadata more carefully and they attempted to reconcile conflicting signals between language and geometry.

The behavior looked less like executing a static program and more like learning.

Watching this process unfold was familiar and oddly nostalgic; anyone who has trained graduate students in a scientific discipline would recognize the pattern. Early answers are often confident but shallow. Over time, as constraints accumulate and incorrect assumptions are exposed, reasoning begins to deepen. The student stops relying solely on explanation and begins engaging directly with the structure of the problem.

Scientific thinking rarely emerges fully formed. It is shaped through constraint, correction, and repeated confrontation with reality.

Which raises the provocative possibility that the path from language models to scientific intelligence may not lie primarily in scaling parameters or training on larger corpora of text. It may lie in something closer to what we did in this experiment: placing models in environments where structure matters, where answers can be verified, and where reasoning must survive the pressures imposed by the geometry of the scientific world.

If that is true, then structural biology may turn out to be one of the most revealing training grounds we have for artificial general intelligence (AGI).

Proteins are unforgiving teachers. Their geometry does not bend to plausible explanations. They reward only those representations that preserve what remains true under variation. And that may be exactly the kind of landscape where artificial intelligence begins to learn how to think like a scientist.

If artificial intelligence is ever to become scientific intelligence, it will not happen through language alone. Science advances by discovering what remains stable under constraint. It advances by finding the invariants that persist when systems are perturbed and then examined from different angles. Structural biology offers a remarkable proving ground for this kind of reasoning. Proteins are physical objects, governed by geometry and chemistry, and their structures impose challenging constraints on explanation. They demand calculations, comparisons, and hypotheses that survive contact with reality. For that reason, structural biology may turn out to be one of the first domains where we can watch artificial intelligence make the transition from fluent description to genuine scientific reasoning.

Computational Phronesis

Discussion about this post

Ready for more?