Skip to content

vlgiitr/LRLBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Does Prompting an LLM to Use a Specific Algorithmic Paradigm Improve Code Generation?

This repository contains the dataset pipeline and evaluation suite for our ICML paper studying whether explicitly prompting LLMs to write iterative or recursive solutions affects code quality and correctness.


Repository Structure

.
├── dataset/          # Pipeline to build the verified iterative/recursive pair dataset
├── Eval/             # Benchmark suite: 4 models × 5 subcases
└── README.md

Dataset

The evaluation dataset is published on HuggingFace:

CLEVDEV/full_before_conv_icmll_pub

Each row is a problem (LeetCode or Codeforces) with:

  • iterative_solution / recursive_solution — verified Python solutions in both paradigms
  • iterative_solution_obfuscated / recursive_solution_obfuscated — function-name-obfuscated variants for tracing tasks
  • tests, examples, synthetic_tests — test cases with tc_difficulty labels
  • rename_map — mapping from original to obfuscated function names

Rebuilding the dataset (optional)

The dataset pipeline is in dataset/. The steps in order:

Step Script Description
1 merge_codeforces_datasets.py Merge CF problem metadata with accepted Python submissions
2 verify_codeforces.py Filter to solutions that pass all example test cases
3 build_dataset.py Combine LC + CF; classify iterative/recursive via AST
4 classify_rubric.py LLM-based iterative/recursive paradigm classification
5 transform_merged.py LLM-assisted conversion between paradigms
6 gen_llm_tests.py Generate synthetic test cases for CF problems
7 classify_testcases.py Measure per-test-case CPU time; assign easy/medium/hard
8 rename_functions.py Add obfuscated solution variants → final JSONL

Evaluation

See Eval/EVAL_README.md for full details on models, subcases, and how to run the benchmark.

Quick start

Requirements:

conda create -n vlm_rl python=3.11
conda activate vlm_rl
pip install vllm instructor pydantic datasets python-dotenv transformers torch

Run the benchmark (vLLM backend):

# Terminal A — serve the model
conda activate vllm_serve
bash Eval/launch_vllm.sh qwen-instruct   # or: mistral | llama | gemma4

# Terminal B — run all 5 subcases
conda activate vlm_rl
bash Eval/run_qwen_instruct.sh           # or: run_mistral.sh | run_llama.sh | run_gemma4.sh

Results are written to Eval/results/<model-slug>/ as JSONL files (one per subcase).

Models benchmarked

Model HuggingFace ID
Qwen3-4B-Instruct Qwen/Qwen3-4B-Instruct-2507
Mistral-Small-24B mistralai/Mistral-Small-24B-Instruct-2501
Llama-3.1-8B meta-llama/Llama-3.1-8B-Instruct
Gemma-4-31B google/gemma-4-31B-it

Subcases (5 per model)

Subcase Script Description
codegen_iter codegen_iterative.py Generate code with explicit iterative prompt
codegen_rec codegen_recursive.py Generate code with explicit recursive prompt
codegen_none codegen_no_hint.py Generate code with no paradigm hint
tracing_clean test_solution_tracing.py Predict return value (original variable names)
tracing_obf test_solution_tracing.py Predict return value (obfuscated variable names)

Requirements

  • Python 3.11+
  • vLLM (serving) — conda env vllm_serve
  • instructor, pydantic, datasets, transformers, torch — conda env vlm_rl
  • ~50 GB GPU VRAM for the 24B–31B models (single A6000 or equivalent)
  • Llama-3.1-8B (~16 GB) and Qwen3-4B (~8 GB) fit on a single A6000

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors