Algo Reasoning Env

Why Rust

"Python hides incomplete understanding behind dynamic typing. Rust's compiler doesn't. We use that as a hard correctness gate."

Compilation

⚡

Code either compiles or it doesn't. No runtime surprises, no partial credit for almost-right syntax. This gives a signal Python's interpreter cannot provide.

Types

🧠

A model must know Vec<i32> vs Vec<Vec<i32>>, and Option<Box<ListNode>> for linked lists. There is no "close enough".

Ownership

🔐

Linked lists and trees require understanding borrowing, ownership transfer, and smart pointers — going beyond algorithm correctness into memory safety.

Evaluation

⚡

rustc --test compiles to a native binary. No interpreter, no venv, no dependency management. One subprocess, one 30-second timeout, one truth.

Built from the ground up

This isn't a wrapper around an existing benchmark. 952 problems assembled through a five-phase pipeline.

Phase 1

Source extraction

Problem descriptions from LeetCode. Expert explanations and Big-O annotations from the doocs/leetcode repository.

Phase 2

Rust starter code

Function signatures imported from rustgym_eng and doocs/leetcode. Each template provides the exact pub fn signature.

Phase 3

Test harness conversion

2,641 Python test cases converted to Rust — type mappings, linked list construction, float tolerance, order-agnostic checking.

Phase 4

Solution generation

LLM-generated solutions compiled and tested up to 3 times. Compiler errors feed back for self-correction — iterative refinement.

Phase 5

Assembly

Tag-based boilerplate injection: ListNode prepended for linked-list problems, TreeNode for trees.

Difficulty distribution

Easy

347

36.5% · ×0.3 weight

Medium

481

50.5% · ×0.5 weight

Hard

124

13.0% · ×1.0 weight

Three dimensions, one score

Most benchmarks are one-dimensional. We evaluate three independent signals because they test genuinely different capabilities.

Weight

50%

Correctness — Does the Rust code compile and pass all test cases? The compiler is the judge — no partial credit.

0.0 compile fail · 0.3 compile · 1.0 pass tests

Weight

30%

Reasoning — An LLM judge compares step-by-step reasoning against expert ground truth. Can the model describe the logic flow?

0.0 — 1.0 continuous · semantic matching

Weight

20%

Complexity — Semantic Big-O matching. O(m×n) equals O(n×m), O(max(m,n)) equals O(m+n). Tests algorithmic efficiency understanding.

0 or 1 · binary · semantic normalization

Standard OpenEnv interface

Session-based state management. Each /reset returns a session_id that must be passed to /step.

Action space

solution_code: str

reasoning_steps: str

time_complexity: str

Observation space

problem_description: str

starter_code: str

expected_complexity: str

difficulty: Easy | Medium | Hard

tags: list[str]

reward: float 0.0–1.0

API endpoints

Method	Path	Description
POST	`/reset`	New episode · returns session_id + observation
POST	`/step`	Submit solution · requires session_id
POST	`/evaluate`	Stateless combined reset + step
GET	`/state`	Active sessions + server state
GET	`/health`	Health check

Usage example

# 1. Reset — get a problem

curl -X POST "https://tm23hgf-rust-algo-reasoning.hf.space/reset" \

-H "Content-Type: application/json" \

-d '{}'

# 2. Step — submit your solution

curl -X POST "https://tm23hgf-rust-algo-reasoning.hf.space/step" \

-H "Content-Type: application/json" \

-d '{

"session_id": "abc-123-...",

"action": {

"solution_code": "impl Solution { pub fn two_sum(...) ... }",

"reasoning_steps": "step-1: Use a HashMap for O(n) lookup.",

"time_complexity": "O(n)"

}

# stdout format

[START] task=algo_reasoning env=algo_reasoning_env model=gpt-oss-20b

[STEP] step=1 action="solution=[len=120] complexity=[O(n)]" reward=0.85 done=true error=null

[STEP] step=2 action="solution=[len=95] complexity=[O(n²)]" reward=0.30 done=true error=null

[END] success=true steps=200 score=0.45 rewards=0.85,0.30,...