This repository contains the dataset pipeline and evaluation suite for our ICML paper studying whether explicitly prompting LLMs to write iterative or recursive solutions affects code quality and correctness.
.
├── dataset/ # Pipeline to build the verified iterative/recursive pair dataset
├── Eval/ # Benchmark suite: 4 models × 5 subcases
└── README.md
The evaluation dataset is published on HuggingFace:
CLEVDEV/full_before_conv_icmll_pub
Each row is a problem (LeetCode or Codeforces) with:
iterative_solution/recursive_solution— verified Python solutions in both paradigmsiterative_solution_obfuscated/recursive_solution_obfuscated— function-name-obfuscated variants for tracing taskstests,examples,synthetic_tests— test cases withtc_difficultylabelsrename_map— mapping from original to obfuscated function names
The dataset pipeline is in dataset/. The steps in order:
| Step | Script | Description |
|---|---|---|
| 1 | merge_codeforces_datasets.py |
Merge CF problem metadata with accepted Python submissions |
| 2 | verify_codeforces.py |
Filter to solutions that pass all example test cases |
| 3 | build_dataset.py |
Combine LC + CF; classify iterative/recursive via AST |
| 4 | classify_rubric.py |
LLM-based iterative/recursive paradigm classification |
| 5 | transform_merged.py |
LLM-assisted conversion between paradigms |
| 6 | gen_llm_tests.py |
Generate synthetic test cases for CF problems |
| 7 | classify_testcases.py |
Measure per-test-case CPU time; assign easy/medium/hard |
| 8 | rename_functions.py |
Add obfuscated solution variants → final JSONL |
See Eval/EVAL_README.md for full details on models, subcases, and how to run the benchmark.
Requirements:
conda create -n vlm_rl python=3.11
conda activate vlm_rl
pip install vllm instructor pydantic datasets python-dotenv transformers torchRun the benchmark (vLLM backend):
# Terminal A — serve the model
conda activate vllm_serve
bash Eval/launch_vllm.sh qwen-instruct # or: mistral | llama | gemma4
# Terminal B — run all 5 subcases
conda activate vlm_rl
bash Eval/run_qwen_instruct.sh # or: run_mistral.sh | run_llama.sh | run_gemma4.shResults are written to Eval/results/<model-slug>/ as JSONL files (one per subcase).
| Model | HuggingFace ID |
|---|---|
| Qwen3-4B-Instruct | Qwen/Qwen3-4B-Instruct-2507 |
| Mistral-Small-24B | mistralai/Mistral-Small-24B-Instruct-2501 |
| Llama-3.1-8B | meta-llama/Llama-3.1-8B-Instruct |
| Gemma-4-31B | google/gemma-4-31B-it |
| Subcase | Script | Description |
|---|---|---|
codegen_iter |
codegen_iterative.py |
Generate code with explicit iterative prompt |
codegen_rec |
codegen_recursive.py |
Generate code with explicit recursive prompt |
codegen_none |
codegen_no_hint.py |
Generate code with no paradigm hint |
tracing_clean |
test_solution_tracing.py |
Predict return value (original variable names) |
tracing_obf |
test_solution_tracing.py |
Predict return value (obfuscated variable names) |
- Python 3.11+
- vLLM (serving) — conda env
vllm_serve instructor,pydantic,datasets,transformers,torch— conda envvlm_rl- ~50 GB GPU VRAM for the 24B–31B models (single A6000 or equivalent)
- Llama-3.1-8B (~16 GB) and Qwen3-4B (~8 GB) fit on a single A6000