The idea
Models need traces
of actual work.
The goal is to assemble 20 trillion tokens of public, permissioned coding-agent sessions. Not scraped private repositories. Not synthetic tasks pretending to be work. Real, shareable development sessions that teach agents how software actually gets built.
What a session contains
- User request and follow-up constraints.
- Agent reasoning summaries, tool calls, and command output.
- Code edits, tests, lint checks, browser checks, and deployment notes.
- Failures, corrections, and final verification.
Why it beats snippets
Code alone shows the answer. A session shows how the answer was found: what context mattered, what was ignored, which tests caught regressions, and how the agent recovered when the first attempt was wrong.
Why 20T
Small datasets overfit to one tool, one repo style, or one agent personality. A 20T-token target forces breadth across languages, repo sizes, tooling stacks, task difficulty, and human workflows.
Who should share
Open source maintainers, agent builders, eval authors, researchers, and power users who can publish sessions they own or have permission to release.
Walkthrough
Exporters such as `pi-brain` read local sessions from tools like Pi, Codex, Claude Code, OpenCode, and Cursor. They sanitize secrets locally, convert the result into training-friendly formats, and optionally upload the bundle to Hugging Face or another dataset host.
Each shared dataset becomes one more public shard: a record of real tasks, real repos, real errors, and real verification. The aggregate is what makes future agents less brittle.