OpenEnv · RL Environment Rust · 952 Problems · Easy / Medium / Hard
Algo Reasoning Env

Can an AI truly
reason about code?

Most benchmarks ask one question: does the code pass the tests? We ask three — correctness, reasoning quality, and complexity understanding — in a language where getting things right is genuinely hard: Rust.

952 Problems
3 Eval dimensions
2.6k Test harnesses
0→1 Reward range
01

Why Rust

"Python hides incomplete understanding behind dynamic typing. Rust's compiler doesn't. We use that as a hard correctness gate."

Compilation
Code either compiles or it doesn't. No runtime surprises, no partial credit for almost-right syntax. This gives a signal Python's interpreter cannot provide.
Types
🧠
A model must know Vec<i32> vs Vec<Vec<i32>>, and Option<Box<ListNode>> for linked lists. There is no "close enough".
Ownership
🔐
Linked lists and trees require understanding borrowing, ownership transfer, and smart pointers — going beyond algorithm correctness into memory safety.
Evaluation
rustc --test compiles to a native binary. No interpreter, no venv, no dependency management. One subprocess, one 30-second timeout, one truth.
02

Built from the ground up

This isn't a wrapper around an existing benchmark. 952 problems assembled through a five-phase pipeline.

Phase 1
Source extraction
Problem descriptions from LeetCode. Expert explanations and Big-O annotations from the doocs/leetcode repository.
Phase 2
Rust starter code
Function signatures imported from rustgym_eng and doocs/leetcode. Each template provides the exact pub fn signature.
Phase 3
Test harness conversion
2,641 Python test cases converted to Rust — type mappings, linked list construction, float tolerance, order-agnostic checking.
Phase 4
Solution generation
LLM-generated solutions compiled and tested up to 3 times. Compiler errors feed back for self-correction — iterative refinement.
Phase 5
Assembly
Tag-based boilerplate injection: ListNode prepended for linked-list problems, TreeNode for trees.
Difficulty distribution
Easy
347
36.5% · ×0.3 weight
Medium
481
50.5% · ×0.5 weight
Hard
124
13.0% · ×1.0 weight
Array 1,278
String 510
Hash Table 430
Dynamic Programming 387
Math 355
Sorting 320
Greedy 301
Binary Search
Tree
Linked List
Graph
Backtracking
03

Three dimensions, one score

Most benchmarks are one-dimensional. We evaluate three independent signals because they test genuinely different capabilities.

Weight
50%
Correctness — Does the Rust code compile and pass all test cases? The compiler is the judge — no partial credit.
0.0 compile fail · 0.3 compile · 1.0 pass tests
Weight
30%
Reasoning — An LLM judge compares step-by-step reasoning against expert ground truth. Can the model describe the logic flow?
0.0 — 1.0 continuous · semantic matching
Weight
20%
Complexity — Semantic Big-O matching. O(m×n) equals O(n×m), O(max(m,n)) equals O(m+n). Tests algorithmic efficiency understanding.
0 or 1 · binary · semantic normalization
04

Standard OpenEnv interface

Session-based state management. Each /reset returns a session_id that must be passed to /step.

Action space

solution_code: str
reasoning_steps: str
time_complexity: str

Observation space

problem_description: str
starter_code: str
expected_complexity: str
difficulty: Easy | Medium | Hard
tags: list[str]
reward: float 0.0–1.0
API endpoints
Method Path Description
POST /reset New episode · returns session_id + observation
POST /step Submit solution · requires session_id
POST /evaluate Stateless combined reset + step
GET /state Active sessions + server state
GET /health Health check
Usage example
# 1. Reset — get a problem
curl -X POST "https://tm23hgf-rust-algo-reasoning.hf.space/reset" \
  -H "Content-Type: application/json" \
  -d '{}'
 
# 2. Step — submit your solution
curl -X POST "https://tm23hgf-rust-algo-reasoning.hf.space/step" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "abc-123-...",
    "action": {
      "solution_code": "impl Solution { pub fn two_sum(...) ... }",
      "reasoning_steps": "step-1: Use a HashMap for O(n) lookup.",
      "time_complexity": "O(n)"
    }
  }'
# stdout format
[START] task=algo_reasoning env=algo_reasoning_env model=gpt-oss-20b
[STEP]  step=1 action="solution=[len=120] complexity=[O(n)]" reward=0.85 done=true error=null
[STEP]  step=2 action="solution=[len=95] complexity=[O(n²)]" reward=0.30 done=true error=null
[END]   success=true steps=200 score=0.45 rewards=0.85,0.30,...