Watch the agent design — compare heuristic vs fine-tuned LoRA models in real time

Tip: Episodes take ~30-60s on CPU. Pre-computed results are ready in the Benchmark tab, or read the Blog while you wait.
⏳ Episode running on CPU...
0s
The model is running inference on CPU — each step takes a few seconds. Totally normal.
Results will appear here when done — you can navigate away and come back.
Checking backend...

Task

Each task tests different layout skills: hierarchy, spacing, reading order

Policy

Heuristic is the teacher that generated SFT training data. LLM is the student.

Run Mode

Run

Cached = instant pre-computed results. Live = real model inference on CPU.
Idle. Pick a task and click Run.
Final Score
Instruction
Steps
Total Reward
Trajectory — no steps yet
StepActionRewardScorePolicy
Raw JSON

    

Benchmark Results

36 episodes total: 4 backends × 3 tasks × 3 seeds. Deterministic environment, MPS (M1) inference. Every number is reproducible.

Overall Performance

BackendInstruction ScoreTotal RewardAvg TimeLLM Steer Rate
Heuristic0.55641.5880.0s
Base Qwen (no LoRA)0.53671.67911.5s100%
SFT Fine-tuned0.55571.78916.8s100%
GRPO Fine-tuned0.55991.85412.0s100%

Per-Task Breakdown

TaskBackendInstr ScoreTotal Reward
Poster (easy)Heuristic0.50331.319
Base0.50871.400
SFT0.52381.435
GRPO0.51291.455
Editorial (med)Heuristic0.54241.544
Base0.48661.658
SFT0.48781.894
GRPO0.47951.966
Dense Flyer (hard)Heuristic0.62351.900
Base0.61481.980
SFT0.65552.038
GRPO0.68722.139

Honest Assessment

These results are real. No cherry-picking, no hidden runs. The environment is deterministic — re-run with the same seeds and you get the same numbers.

What works

What's honest

What to improve

What is DesignGym?

DesignGym 2.0 is an OpenEnv-compatible RL environment where an LLM agent learns to improve graphic layouts through sequential actions — move, resize, align, reflow, promote, finalize — evaluated by computable aesthetic metrics (overlap, alignment, spacing, hierarchy, reading order, instruction fit).

The training pipeline: Heuristic Planner generates expert trajectories → SFT teaches the model the action interface (0% → 100% valid JSON) → GRPO learns which valid actions are better via environment reward.

Project Links

Training Pipeline

DesignGym architecture diagram

End-to-end: OpenEnv environment → heuristic planner bootstraps SFT data → SFT adapter locks in the action interface → GRPO learns design preference from verifiable reward.

SFT: Teaching the Interface

Base Qwen 0.5B understands design language but cannot produce executable JSON actions. SFT on heuristic planner trajectories achieves 0% → 100% valid JSON — a capability phase transition, not just a fine-tune.

SFT training metrics

GRPO: Learning Preference

Once the model can act, GRPO teaches it which valid action is better. It samples multiple candidates, executes them in the environment, and increases probability of higher-reward actions. No reward model needed — the environment is the oracle.

Results from Training (Blog Table)

PolicyFinal ScoreInstr ScoreValid JSONEarly Finalize
Base Qwen 0.5B0.69480.53600%100%
SFT Qwen 0.5B0.71010.6263100%0%
GRPO Qwen 0.5B0.67170.548398%67%
GRPO Best-of-40.67810.5817100%17%

How to Make It Better

Environment

3 tasks (poster, editorial, dense flyer) testing different layout skills. Deterministic scoring via 7 computable aesthetic metrics. Fully OpenEnv-compatible: reset(), step(action), typed Pydantic models, FastAPI server, Docker deployment.