Skip to content

ivankartac/OpeNLGauge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpeNLGauge

Code and resources for the INLG 2025 paper OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs

Overview

Table of Contents

Installation

Create a virtual environment (recommended Python version is 3.10) and install the dependencies:

pip install -r requirements.txt

Data

Training Data

The synthetic dataset used to train the distilled model can be found in data/training/train_ensemble_merged.json. Each training example consists of the following fields:

  • dataset: name of the source dataset
  • input_id: unique id of the input in the source dataset
  • system: system used to generate the output
  • inputs: mapping of one or more headers (e.g. "Article", "Dialogue history", "Question") to input values
  • outputs: mapping of headers (e.g. "Summary", "Response", "Answer") to output values
  • aspect_name: name of the aspect to evaluate
  • aspect_definition: definition of the evaluation aspect
  • task_type: category of the task
  • task_name: name of the task (e.g. "summarization", "data-to-text")
  • extra_task_info: extra information about the task
  • evaluation: generated evaluation of the output, consisting of error analysis and overall score (between 0 and 100)

The file data/training/train_ensemble_original.json contains evaluation outputs of the individual prompted LLMs. The evaluations are postprocessed and parsed to a structured format.

Meta-evaluation Datasets

Preprocessed datasets used for meta-evaluation of our approach can be found in data/meta_eval.

Ensemble of Prompted Open-Weight LLMs

To reproduce the results with the individual prompted LLMs or the ensemble with Ollama, install the package following the instructions in the Ollama repository. After Ollama is installed and running, use the pull command to download the models. For example, to download the Llama 3.1 Nemotron 70B model, run the following command:

ollama pull nemotron:70b-instruct-q8_0

Then create the model from the modelfile:

ollama create eval_nemotron -f openlgauge/configs/ollama/modelfile_nemotron

To run the evaluation, use openlgauge/run_ollama.py script. For example, to evaluate on the QAGS dataset for factual consistency with eval_nemotron model, use the following command:

python openlgauge/eval_zero_shot.py --model eval_nemotron --template openlgauge/templates/zero_shot/qags.jinja --data openlgauge/data/meta_eval/qags.json --aspect-config openlgauge/configs/eval_aspects/qags-factual_consistency.json --output-dir openlgauge/results/openlgauge_ollama

For other models, see the modelfiles in openlgauge/configs/ollama. Prompt templates for zero-shot evaluation on all datasets evaluated in the paper can be found in openlgauge/templates/zero_shot. Configurations of different evaluation aspects are in openlgauge/configs/eval_aspects.

Fine-tuned Model

Training

The distilled OpeNLGauge evaluation metric can be trained using the train.py script, which fine-tunes Llama 3.1 8B on our synthetic dataset.

python openlgauge/train.py --dataset data/training/train_ensemble.json --template openlgauge/templates/finetuned/template.jinja --config openlgauge/configs/training_config.json --model-name openlgauge_ft --output-dir openlgauge/checkpoints/openlgauge_ft

Inference

To run the inference with the fine-tuned model, use the eval_finetuned.py script. For example, to evaluate factual consistency of summaries in the QAGS dataset with the fine-tuned model, use the following command:

python openlgauge/eval_finetuned.py --model openlgauge/model --config openlgauge/configs/inference_config.json --template openlgauge/templates/zero_shot/qags.jinja --data openlgauge/data/qags.json --aspect-config openlgauge/configs/eval_aspects/qags-factual_consistency.json --output-dir openlgauge/results/openlgauge_ft --retry

Meta-evaluation

Please check back soon for updates.

License

Licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Acknowledgements

This work was supported by the European Research Council (Grant agreement No. 101039303, NG-NLG) and the National Recovery Plan funded project MPO 60273/24/21300/21000 CEDMO 2.0 NPO. It used resources of the LINDAT/CLARIAH-CZ Research Infrastructure (Czech Ministry of Education, Youth, and Sports project No. LM2018101).

About

Code and resources for the INLG 2025 paper "OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs"

Resources

Stars

Watchers

Forks

Contributors