Most benchmarks ask one question: does the code pass the tests? We ask three — correctness, reasoning quality, and complexity understanding — in a language where getting things right is genuinely hard: Rust.
"Python hides incomplete understanding behind dynamic typing. Rust's compiler doesn't. We use that as a hard correctness gate."
Vec<i32> vs Vec<Vec<i32>>, and Option<Box<ListNode>> for linked lists. There is no "close enough".rustc --test compiles to a native binary. No interpreter, no venv, no dependency management. One subprocess, one 30-second timeout, one truth.This isn't a wrapper around an existing benchmark. 952 problems assembled through a five-phase pipeline.
doocs/leetcode repository.rustgym_eng and doocs/leetcode. Each template provides the exact pub fn signature.ListNode prepended for linked-list problems, TreeNode for trees.Most benchmarks are one-dimensional. We evaluate three independent signals because they test genuinely different capabilities.
Session-based state management. Each /reset returns a session_id that must be passed to /step.
| Method | Path | Description |
|---|---|---|
| POST | /reset |
New episode · returns session_id + observation |
| POST | /step |
Submit solution · requires session_id |
| POST | /evaluate |
Stateless combined reset + step |
| GET | /state |
Active sessions + server state |
| GET | /health |
Health check |