As my use of coding agents has matured, I revisit two principles that have always organized my scientific life:
- A manuscript should be the source of truth about a project. Thus one should be able to understand the state of a project by reading the manuscript.
- My project should be situated with respect to other papers in the field. Thus one should always keep the literature in mind.
The first principle calls for a way to check that the paper still matches the code that produced it. The second calls for a reference manager that an agent can actually use. Those two needs turned into a system I’ve been building called bipartite; this post describes the gap it fills, what it has grown to become, and what it has meant to me more broadly.
Bipartite enables what I call “manuscript-driven development.” It reflects the way I already think about science: there is a document, the manuscript, that is readable and verifiable and that I can keep in my head, and the work orbits around it. That is how I have always worked as a group leader. The new part is that much of the work is now done by agents.
PIs are vibe coders. Do you know a single PI who reads every line of code their group produces? I don’t. Our group is unusual in having formal pull request (PR) reviews, but even those are necessarily limited in depth, given the many projects going on at once. The PI’s job is to direct at the level of the manuscript. Casting a suspicious eye on results, noticing anomalies, guiding direction, reading the manuscript in detail: this is a scientific PI’s daily work.
Bipartite brings those actions into an agentic workflow.
Concretely, bipartite is a command-line tool bip written in Go, plus a library of skills.
(If you don’t know about skills, they are capabilities that can be loaded into your coding agent for a specific task; examples below and a definition here).
The CLI is the engine, the skills are the interface, and they talk to GitHub, Semantic Scholar, Slack, and our compute servers.
The heart of the system is an agentic development loop. It runs as two coupled sides, ideas and experiments, with GitHub as the shared transport layer between them. On the ideas side, manuscript sessions surface new results and turn them into well-scoped GitHub issues. On the experiments side, those issues are picked up by autonomous workers running in dedicated clones, implemented, reviewed, and landed, which surfaces fresh results back for discussion. Two human touchpoints anchor the otherwise-autonomous flow: high-level discussion of new findings on the ideas side, and pre-merge review on the experiments side. This should feel familiar to PIs, who typically help design models/experiments and also interpret results.
The orchestration uses clones and spawning to run many workers in parallel.
It was great to see old-favorite tools like tmux rising to the front here and working beautifully.
(You can even have agents inspect what other agents are saying/doing in other tmux windows and offer their opinion.)
Bipartite is also a “Swiss army knife” of useful things.
For example, it maintains local databases that store information pulled from Slack, and has a scout skill that checks our servers for availability over SSH before we dispatch work to them.
Two of the design choices here were explicit, and might be surprising. The first design choice is that this is not meant to be a completely self-running system (in contrast to Gas Town, e.g.). There is built-in friction, on purpose. Yes, there is automation: an agent can create new issues that it discovers in the course of implementing a PR. But the decision whether to pursue one of those is up to me.
I really like being able to discuss the work with the agent that implemented it or did the PR review. I mean, it’s cool that you can mention Claude in GitHub and have it make a PR for you but if I did that I would lose this very important level of interaction. This interaction often leads to iteration: a good fraction of the time I really do need to intervene and direct what is happening. That is the part of the job I want to keep.
The second design choice is that this is not the tidiest possible setup, and that is also on purpose. For example, I use multiple full clones of a repository rather than git worktrees. That means those directories persist, so I can resume conversations with previous agents in them, find the artifacts they left behind, and recover when there has been a server reboot. A little mess (regularly cleaned up by agents) buys a lot of resilience.
But the killer features for me are the science-centric ones.
The first is cold-starting an agent into an informed discussion of where things stand.
The /bip-ms and /bip-epic skills scan the relevant repositories and issues and build a dashboard, so that a fresh session can talk with me about the current state of a project rather than starting from nothing.
The second concerns manuscripts directly: the /bip-ms-audit skill.
It reads the paper and the code in parallel and looks for places where the formulas, algorithms, dimensions, or complexity claims in the manuscript have drifted from what the code actually does.
This means that when I read and edit the paper, it is the authoritative truth.
It also means that the paper is reproducible!
This skill has found bugs in (human-authored) code, including my own.
I note that Lior Pachter has a similar and more formal project in span; we are looking into adapting those ideas.
The third is perhaps my favorite: a reference manager for your agents. It uses a git-backed reference library with search through Semantic Scholar and Asta. AI hallucinations begone! When the agent can read the paper it becomes an extraordinary research assistant rather than a teller of tall tales.
The fourth is how teams of contrarian agents verify ideas and results before they enter your permanent record.
- an idea is catalyzed into a GitHub issue as a collaboration with an agent
- it is then checked using the
/bip-issue-checkskill - an implementation agent implements it, spurred on by an
issue-leadsubagent - several subagents review the code from independent perspectives
- a
surprising-conclusion-skepticagent verifies results from a contrarian perspective.
My hope is that this enables everyone to be a PI: graduate students and postdocs included. The point of “hacking like a PI” is not the PI job title. It is working with a team of agents the way a PI works with a team of researchers, directing the science at a high level while the detailed work runs across many parallel sessions.
I should also be honest that this is not a solution that you should use with no sense of software engineering.
You still have to read, and you still need to know what good code looks like.
That said, there are tools to help keep the code in order.
The /bip-decay-audit skill, for instance, sweeps a repository for the ways agent-written code tends to decay, duplicate symbols, dead code, files that have grown into monsters, so that the interfaces stay sane.
Another win for me is that a group-wide collection of skills encodes standards I have spent years trying to instill.
I have an extensive set of wiki pages on how I like things done in the group.
This includes writing guidelines, such as a probably-overbearing rule about which verb tense belongs in which section of a paper.
I am fairly confident that nobody has ever read these documents, except for one German-origin postdoc who has since moved on to a faculty position.
Skills and subagents apply these rules automatically, which dramatically reduces the burden on me.
There is even a plot-reviewer agent that checks figures against Claus Wilke’s guidelines for figure design, so a plot gets reviewed for proportional ink and sensible chart types before I ever lay eyes on it.
There are caveats. First, this takes a lot of tokens. We are in a privileged position with HHMI support, and I don’t mean to offend anyone for whom token cost is real. That said, you can get a lot out of a Claude Max subscription or an equivalent plan.
Second, using any agentic science tool requires discipline and careful human evaluation. Luckily, my research group consists of thoughtful and rigorous scientists who love nothing better than calling bullshit on Claude. (This contrasts with stories from colleagues who have trainees who aren’t able to explain their AI slop.) However, bipartite reduces the “bullshit attack surface” to one place: the manuscript. And we’re all good at reading manuscripts!
Third, working with bipartite exerted a personal toll for a period of months. To explain this, I should say that my core identity as a PI is that I am an “unblocker.” I don’t care about what is blocking a project: if I can hop in and contribute code, answer a question, or email a colleague to get something moving again, I drop everything and do it.
That attitude simply doesn’t fly with agents. You can unblock agent A and let it run, but by the time you check in and unblock agents B, C, and D, agent A is ready again for input and the work never stops. For a period of several months I couldn’t sleep for more than three hours at a time. I have the git logs to prove it.
It felt like some strange combination of joy and being chained to a treadmill. The joy was real: every day felt like Christmas day, knowing how much I would be able to accomplish. Projects I had always dreamed of, but that would have required a big, technically strong team working for a year just to determine viability, were suddenly pursuable alone. On the other hand I was Sisyphus, bound to keep on pushing that rock up the hill 24/7. This was my manifestation of what Andrej Karpathy calls “AI psychosis.” Getting out of it required the help of my wife and a renewed dedication to mindfulness and the humans in my life.
Bipartite is on GitHub under an MIT license, with documentation here. It is early, opinionated, and rough. It is my mise en place and it suits the way I work. Perhaps you work differently, in which case you might use it for inspiration. Or you can submit an issue!
My work is supported by NIH and HHMI, and I’m very grateful to taxpayers and the HHMI for their support.