Training Data Drive

20T tokens
from real work.

Toy benchmarks are not enough. To train useful coding agents, we need real sessions: tasks, mistakes, tool calls, edits, reviews, dead ends, fixes, and the human intent around them.

Why sessions matter

They contain full problem-solving trajectories, not isolated snippets.Trajectory
They show how agents use files, terminals, browsers, tests, and git.Tool use
They preserve failures and recoveries, which are the parts models need most.Repair

sessions

redact

format

publish

Upload to Hugging Face

1. Install the exporter

This reads local sessions from Pi, Codex, Claude Code, OpenCode, and Cursor.

2. Log in to Hugging Face

Create a write token at huggingface.co/settings/tokens, then paste it here.

3. Inspect and export

Open the export folder and spot-check samples before publishing.

4. Publish a dataset

Make it public, add `format:agent-traces`, `agent-traces`, and `coding-agent` tags, then share the link.

Raw text stays local until you choose to publish. Redaction runs before upload.

20T tokensfrom real work.

20T tokens
from real work.