Sybil Solutions

Training Data Drive

20T tokens
from real work.

Toy benchmarks are not enough. To train useful coding agents, we need real sessions: tasks, mistakes, tool calls, edits, reviews, dead ends, fixes, and the human intent around them.

Why sessions matter

  • They contain full problem-solving trajectories, not isolated snippets.Trajectory
  • They show how agents use files, terminals, browsers, tests, and git.Tool use
  • They preserve failures and recoveries, which are the parts models need most.Repair
sessions
redact
format
publish

What counts

Accepted

Open source coding sessions from Pi, Codex, Claude Code, OpenCode, Cursor, and compatible JSONL exports.

Preferred

Complete sessions with user intent, tool calls, code edits, test output, review loops, and final state.

Excluded

Secrets, private customer work, proprietary code, personal chats, credentials, and anything you do not have rights to share.

Hugging Face traces

Browse source

After upload

Dataset card

Say which tool produced the sessions, what sources are included, what was redacted, and what license applies.

Tags

Use `format:agent-traces`, `agent-traces`, and `coding-agent` so the dataset appears in the trace index.

Share

Send the Hugging Face link to the campaign so the token counter can include your shard in the next snapshot.