Does Prompting an LLM to Use a Specific Algorithmic Paradigm Improve Code Generation?

This repository contains the dataset pipeline and evaluation suite for our ICML paper studying whether explicitly prompting LLMs to write iterative or recursive solutions affects code quality and correctness.

Repository Structure

.
├── dataset/          # Pipeline to build the verified iterative/recursive pair dataset
├── Eval/             # Benchmark suite: 4 models × 5 subcases
└── README.md

Dataset

The evaluation dataset is published on HuggingFace:

CLEVDEV/full_before_conv_icmll_pub

Each row is a problem (LeetCode or Codeforces) with:

iterative_solution / recursive_solution — verified Python solutions in both paradigms
iterative_solution_obfuscated / recursive_solution_obfuscated — function-name-obfuscated variants for tracing tasks
tests, examples, synthetic_tests — test cases with tc_difficulty labels
rename_map — mapping from original to obfuscated function names

Rebuilding the dataset (optional)

The dataset pipeline is in dataset/. The steps in order:

Step	Script	Description
1	`merge_codeforces_datasets.py`	Merge CF problem metadata with accepted Python submissions
2	`verify_codeforces.py`	Filter to solutions that pass all example test cases
3	`build_dataset.py`	Combine LC + CF; classify iterative/recursive via AST
4	`classify_rubric.py`	LLM-based iterative/recursive paradigm classification
5	`transform_merged.py`	LLM-assisted conversion between paradigms
6	`gen_llm_tests.py`	Generate synthetic test cases for CF problems
7	`classify_testcases.py`	Measure per-test-case CPU time; assign easy/medium/hard
8	`rename_functions.py`	Add obfuscated solution variants → final JSONL

Evaluation

See Eval/EVAL_README.md for full details on models, subcases, and how to run the benchmark.

Quick start

Requirements:

conda create -n vlm_rl python=3.11
conda activate vlm_rl
pip install vllm instructor pydantic datasets python-dotenv transformers torch

Run the benchmark (vLLM backend):

# Terminal A — serve the model
conda activate vllm_serve
bash Eval/launch_vllm.sh qwen-instruct   # or: mistral | llama | gemma4

# Terminal B — run all 5 subcases
conda activate vlm_rl
bash Eval/run_qwen_instruct.sh           # or: run_mistral.sh | run_llama.sh | run_gemma4.sh

Results are written to Eval/results/<model-slug>/ as JSONL files (one per subcase).

Models benchmarked

Model	HuggingFace ID
Qwen3-4B-Instruct	`Qwen/Qwen3-4B-Instruct-2507`
Mistral-Small-24B	`mistralai/Mistral-Small-24B-Instruct-2501`
Llama-3.1-8B	`meta-llama/Llama-3.1-8B-Instruct`
Gemma-4-31B	`google/gemma-4-31B-it`

Subcases (5 per model)

Subcase	Script	Description
`codegen_iter`	`codegen_iterative.py`	Generate code with explicit iterative prompt
`codegen_rec`	`codegen_recursive.py`	Generate code with explicit recursive prompt
`codegen_none`	`codegen_no_hint.py`	Generate code with no paradigm hint
`tracing_clean`	`test_solution_tracing.py`	Predict return value (original variable names)
`tracing_obf`	`test_solution_tracing.py`	Predict return value (obfuscated variable names)

Requirements

Python 3.11+
vLLM (serving) — conda env vllm_serve
instructor, pydantic, datasets, transformers, torch — conda env vlm_rl
~50 GB GPU VRAM for the 24B–31B models (single A6000 or equivalent)
Llama-3.1-8B (~16 GB) and Qwen3-4B (~8 GB) fit on a single A6000

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Eval		Eval
dataset		dataset
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Does Prompting an LLM to Use a Specific Algorithmic Paradigm Improve Code Generation?

Repository Structure

Dataset

Rebuilding the dataset (optional)

Evaluation

Quick start

Models benchmarked

Subcases (5 per model)

Requirements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Does Prompting an LLM to Use a Specific Algorithmic Paradigm Improve Code Generation?

Repository Structure

Dataset

Rebuilding the dataset (optional)

Evaluation

Quick start

Models benchmarked

Subcases (5 per model)

Requirements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages