Determine the correctness of NL2SQL translations using testing techniques.
TL;DR: SQLegit is an evaluation harness for NL2SQL systems. Given an NL2SQL translation, it runs a test suite with fast-running test cases to provide a correctness judgment.
- Python 3.10 is recommended (the Docker image uses Python 3.10).
- A dataset JSON (see the run modes below) and SQLite DB folder.
- One LLM provider credential set:
- Azure OpenAI via
langchain-openai(e.g.AZURE_OPENAI_API_KEY,AZURE_OPENAI_ENDPOINT,OPENAI_API_VERSION), or - DeepSeek via
langchain-deepseek(DEEPSEEK_API_KEY)
- Azure OpenAI via
Build:
docker build -t sqlegit .Run (mount your data; pass env via --env-file):
docker run --rm -it \
--env-file .env \
-v "/your-data-path:/SQLegit/data" \
sqlegit /bin/bashrun_judgment.py is the main entrypoint. It reads a dataset JSON and writes a JSONL file under a sibling results/ directory.
python run_judgment.py --helpRequirements:
- dataset JSON contains at least
db_id,question, andqueryorSQL - for BIRD,
evidenceis used as a hint --predicted_sql_pathpoints at a text file with one SQL per line (BIRD/Spider); the code reads the first tab-separated field if present--schema_file_pathis a JSON list withdb_identries (Spidertables.json-style)--db_root_pathcontains per-db folders:<db_id>/<db_id>.sqlite
Example:
python run_judgment.py \
--judge_name "sqlegit+sem+nos+crs+orc+slf+nlr(gpt-4o-mini-0708)" \
--benchmark_name nl2sql-bugs \
--db_root_path /path/to/databases \
--data_file_path /path/to/NL2SQL-Bugs.json \
--schema_file_path /path/to/tables.jsonHow --judge_name is interpreted:
- if it contains
sqlegit(case-insensitive), the runner builds aSQLegitJudge- backbone model is parsed from the name if it contains
gpt-...(defaults togpt-4o-mini-0708) - any of these substrings enable the matching check:
sem,nos,crs,orc,nlr,slf
- backbone model is parsed from the name if it contains
If you run SQLegit checks separately (one JSONL per check), you can combine them with a short-circuit + weighted vote policy:
python run_consensus.py --helpTo generate the per-check JSONLs, run SQLegit with only one check enabled each time, e.g.:
python run_judgment.py \
--judge_name "sqlegit+sem(gpt-4o-mini-0708)" \
--benchmark_name nl2sql-bugs \
--db_root_path /path/to/databases \
--data_file_path /path/to/NL2SQL-Bugs.json \
--schema_file_path /path/to/tables.jsonExample:
python run_consensus.py \
--benchmark-name nl2sql-bugs \
--db-root-path /path/to/databases \
--data-file-path /path/to/NL2SQL-Bugs.json \
--sem-jsonl /path/to/judgments_sem.jsonl \
--nos-jsonl /path/to/judgments_nos.jsonl \
--orc-jsonl /path/to/judgments_orc.jsonl \
--crs-jsonl /path/to/judgments_crs.jsonl \
--slf-jsonl /path/to/judgments_slf.jsonl \
--nlr-jsonl /path/to/judgments_nlr.jsonl \
--out-jsonl /path/to/final_consensus.jsonlrun_judgment.py writes under:
- a sibling
results/directory next to your--data_file_path(i.e.<data_dir>/results/) - file name pattern:
judgments,dataset=<benchmark>[+<nl2sql_model>],judge=<judge_name>[+3-shots][+cot].jsonl
Each line is a JSON dict containing:
final_judgment:true/false/"UNDETERMINED"- one key per enabled SQLegit check (e.g.
semantic_check,oracle_result, ...), each with details like:judgment,confidence,tokens_used,elapsed, per-test-case results, traces, etc.
|-- src/
| |-- judges/ # LLMJudge + SQLegitJudge implementations
| |-- testers/ # semantic/noise-row/cross-model/oracle/self-consistency/... checks
| |-- db_utils/ # schema parsing + execution helpers
| |-- eval/ # evaluation scripts (Spider/BIRD-style)
| |-- evalution.py # evaluation entrypoints (name kept for compatibility)
|-- templates/ # prompt templates for LLM-based components
|-- assets/ # design diagrams (PDF)
|-- run_judgment.py # run a judge and write JSONL
|-- run_consensus.py # combine per-check SQLegit outputs into a final decisionContributions and suggestions are welcome.
If you find bugs, encounter problems when running the code, or have suggestions for SQLegit, please submit an issue or reach out to me (kaimary1221@163.com).