AI for Science Isn't AI for Code

Written by A/Prof Andrew Webb, Ph.D | May 8, 2026 at 6:52 AM

My training is as a protein biochemist, with detours through immunology, virology, and proteomics. My last academic role was building a proteomics centre at WEHI that lived right at the interface of biology, medicine, technology, and bureaucracy. Seeing the coming wave of AI, I found the transition to start-up life an easy one to make. This is where I am writing from: the long, slow, adversarial process by which a claim becomes a fact, and from a first-hand view of how companies and scientists are beginning to grapple with AI.

I am not a software engineer. But for the past decade I have built startups alongside them, and I have learned how much invisible discipline sits behind systems that work at scale. Generated output is useful only when there are habits, interfaces, tests, review gates, and provenance systems that catch its failures. Science now needs its own version of those habits.

AI for Science isn't a capability problem. The capability is already here. The harder problem is what the model can be told, and what its work can be made to defend. Science already has a name for that defensibility: warrant. Warrant is what we have to build.

What the field is already telling us

In a recent Mass Dynamics survey of scientists across our community, the pattern was coherent: AI use is already widespread, but trust remains thin.

Scientists are using AI for coding, scripting, literature review, and administrative work. These are the places AI has already become, in one respondent's phrasing, a "critical tool." Scientific judgement is another story. The concerns were specific: hallucinated citations, IP leakage, the tells of AI-generated prose appearing in professional output, and a worry that handing analysis to a model erodes scientific integrity itself.

Read carefully, every one of those is a trust concern, and underneath each one sits the same root cause: a capable model is being asked to do scientific work without the context scientific work requires.

Trust is not a feeling scientists will eventually develop toward AI. It is an infrastructure property. It appears when context, provenance, method choice, review, and accountability are made legible. That infrastructure is what produces warrant, and warrant is what science has always required of itself.

Software engineering ran into a version of this problem fifteen years ago. The answer the field converged on, imperfectly but decisively, was infrastructure.

Across the past year on this blog I have argued that proteomics workflows need scalable, AI-ready infrastructure, that integration is biology's next leap, that the field is industrialising and our job is to codify the wizardry of our best practitioners, and that protein-blindness is a decision problem, not a data problem. AI forces the question beneath all four: what infrastructure lets complex data become defensible decisions?

Why AI for science isn't AI for code

There is a tempting analogy here: that AI for Science will follow the trajectory (read disruption) of AI for software development. I don’t think this will be the case.

Feedback loops differ by orders of magnitude

A program tells you in seconds whether it runs. A cell line tells you in weeks whether the hypothesis holds. A clinical endpoint tells you in years. The tight-loop d ynamics that let software absorb AI so fast, from write to test to iterate to ship, do not exist for most of what scientists actually do.

The slow loop is not a bug. It is biology.

Ground truth is asymmetric

Code is checked against a specification someone wrote. Science is checked against a universe we are discovering, and that we are only beginning to comprehend.

A software program can be formally verified. A scientific hypothesis can only be provisionally held.

Method is judgement, not dependency

A software library has a stable contract: call it with inputs, get defined behaviour back. A scientific method is a judgement call about this dataset, this biology, this claim, and the quality of that judgement depends on the scientist making it.

Skill does not abstract away. It compounds.

Peer review is adversarial by design

Software has many trust mechanisms: automated tests, reproducible builds, formal code review. They confirm that a change behaves correctly against a spec. Peer review does something different. It asks whether a different method could have produced a different truth. It is anonymous, often adversarial, and load-bearing in a way ordinary engineering review is not.

Software has partial analogues, from security review to postmortems, but no direct equivalent.

Context lives in different places

In software, most of the context that matters lives in the artifact itself: the spec, the comments, the commit history, the test suite, the issue tracker. A capable model can read it. In science, most of the context that matters lives nowhere a model can reach: in the scientist's head, in the lab notebook, in the lineage of who trained whom, for the unwritten reason this particular dataset gets normalized that way.

Every lab has its John or its Jane: the postdoc who knows why we never trust the first few runs on the Mass Spec after a column change, the PI who remembers the failed reagent batch from 2019, the analyst who knows the unspoken QC rules that no SOP quite captures. That tacit context is what the model is missing. It is also what makes scientific work scientific. And, less comfortably, it is also what makes scientific work hard to reproduce. The knowledge that gives a lab its edge is the knowledge that walks out when the postdoc leaves, the knowledge the next group cannot recover. There is strength and fragility in the same property.

The context surface area in science is vastly larger than in software, and almost none of it is currently machine-readable. I think this is the work that is likely ahead of us.

What science should borrow

Software engineering did not absorb AI faster than every other discipline because its epistemology was superior. It absorbed AI faster because it had spent decades building infrastructure for a world in which generated output was about to arrive at volume.

Version control so every change is attributable. Automated tests so claims about behaviour can be checked without a human. Code review so nothing enters the shared record unchallenged. Reproducible builds so a result on one machine holds on another. Package registries so methods can be named, versioned, and depended on. Attribution trails so when something breaks, the record shows who changed what and when.

None of that was built to make code correct. It was built to make code trustworthy at scale, and to make context legible at scale: to capture, version, and surface the accumulated judgement of thousands of contributors so future work could build on it without re-deriving it from scratch.

That is the problem science now has. And much of the infrastructure that solves it is borrowable.

Concern the field is naming	What software already learned
Hallucinated references and fabricated citations	Versioned, verified provenance. Every claim is traceable to the data and the decision that produced it.
IP and proprietary data leaking into frontier models	Governed, isolated environments. Scope enforced by policy, not by hope.
"AI slop" contaminating professional and organisational output	Review gates. No work enters the shared record without human sign-off.
Integration friction across a fragmented agent landscape	Stable, documented interfaces. The substrate does not care which agent is talking to it.
Scientific integrity and craftsmanship being eroded	Attribution, override, and the right to refuse. Automation drafts; scientific judgement decides.
Tacit expert knowledge that lives only in heads	Captured, versioned context. The wizardry of best practitioners encoded into infrastructure that AI and humans can both read.

Every row is the software world's answer to a failure mode the science world has started living with. And every row is implementable as concrete infrastructure that has been shipped, debugged, and hardened in adjacent industries for two decades.

This is what I meant in last June's post when I wrote that the field needed to codify the wizardry of its best practitioners. At the time, that line was about reproducibility. Now, with AI in every lab, it is something more concrete: the John or Jane in every group has 10, 20, 30 years of context in their head, and the model has none of it. Codifying that wizardry is no longer just a nice-to-have for institutional memory. It is the substrate AI needs in order to do scientific work at all.

I can't help imagining what becomes possible when the context layer is fully wired. I believe AGI is genuinely here today, but we're 6 months into an implementation phase that will likely last >10 years. In the near future, an agent won't just run a generic differential analysis, it will run the lab's analysis, on this question, with the standards of this team, and flag every place its choices departed from precedent. It will enable anomaly detection to identify new biology or avenues of research. It will dramatically accelerate those scientists that learn how to operate it reliably and at scale. Scientific outputs will naturally evolve from static reporting to artifacts that a reviewer can re-run end-to-end, with every method choice attributable and every alternative available. The hard question, could a different method have produced a different truth?, becomes cheap to ask, and will likely be surfaced in the primary analysis.

That is a different category of object. Every claim it produces carries its warrant with it.

What compounds is the position. The lab that captures its own context first, on durable infrastructure, becomes the lab the field cites and the regulators trust. The platform that holds the context becomes the platform the work runs on. Warrant, captured in infrastructure, is not a feature. It is the new competitive moat for teams and organisations.

None of this replaces scientific judgement. It frees scientific judgement from the infrastructure burden it should never have been carrying.

What science gets that code never did

The trust infrastructure software delivered was in service of correctness against a spec. Science is after something harder, and the era of AGI has made it harder still. For four centuries, scientific trust has been manufactured by the journal system: peer review, tiered prestige, citation. AI is now stress testing this, with plausible-looking results produced faster than they can be vetted. I believe this will draw us toward mechanisms where trust will sit less in publication, and more in reproducible architectures: infrastructure that lets a stranger, years later, in a different lab, inspect, challenge, reproduce, and build on a claim because the claim itself carries its provenance.

Warrant is the auditable reason a stranger should believe a claim. It is what distinguishes science from every other form of human inquiry. Every part of the scientific path, from hypothesis through method to peer review to replication, exists to manufacture it.

In Consilience (1998), E.O. Wilson wrote that we were drowning in information, while starving for wisdom, and that the world would henceforth be run by synthesizers: people able to put together the right information at the right time and make important choices wisely. In a world where any model can generate plausible scientific prose in seconds, Wilson's distinction is no longer philosophical. It has morphed into a structural challenge. The cost of bad synthesis has collapsed. The premium on good synthesis (synthesis that is trustworthy, defensible, and reproducible) has never been higher. Warrant is what separates the two.

In a world where anyone can vibe-code a plausible-looking scientific result in an afternoon, warrant becomes the only thing that distinguishes science from noise. The discipline that has been quietly manufacturing warrant for four centuries is about to become the discipline whose practices the rest of the knowledge economy has to learn from.

AI does not diminish science. It raises the value of everything science is uniquely good at, provided the infrastructure underneath it can carry the weight.

At Mass Dynamics, this is increasingly how we think about Scientific Intelligence, the phrase Paula Burton introduced earlier this year for what happens when people, process, and technology combine to turn complex data into defensible decisions. Not a chat layer bolted onto a fragile workflow, but the governed context layer beneath scientific work, where every action remains scoped, attributable, reviewable, and tied back to both the data and the decisions that produced it.

AI for Science will not play out like AI for Code. The loops are longer, the ground truth is stranger, the context is more tacit, and the standards of warrant are higher. But software has already shown how trust can be engineered into complex work at scale.

Borrow the infrastructure. Keep the epistemology. Let the discipline that manufactures warrant become the one the rest of the world has to learn from.

This blog post was produced by Assoc. Prof. Andrew Webb using a combination of original notes from discussions and insights. All statistics and quotes referenced are drawn from internal research and published sources. Final compilation was completed with assistance from ChatGPT. Any errors or omissions are unintentional, and the content is provided for informational purposes only. The views, thoughts, and opinions expressed in this text belong solely to the author, and not necessarily to the author's employers, organization, committees or other group or individual

View full post