TTYG-178 Improve README by pgan002 · Pull Request #51 · Ontotext-AD/graphrag-eval

pgan002 · 2026-02-07T07:50:40Z

Changes

Move most of the documentation from the README file into separate files in a new directory docs/. Benefits:
1. The main page (README) is shorter, and so more welcoming
2. The main page loads faster
3. The sections are shorter and so easier to read
4. The directory helps to understand the contents
Major additions for completeness
Major edits clarity
Links to documentation sections from README and from other sections
Consistent section heading case: sentence case
Join each paragraph into a single line for easier MD editing

Tests

Spelling, grammar, typos: copy-pasted text into word processor
Links: manually followed each link in README.md and docs/*

Open questions

How to format the key-definitions in § Output: as a list or table?
Move section "Aggregate metrics" to a separate file or into section Metrics?

Clarify config usage (config_file_path) and LLM/embedding requirements Update examples to async calls; fix temperature and relevance wording Correct AP calculation and minor whitespace/argparse cleanup

Expand README and docs/input.md with detailed reference and target schemas, including actual_steps and expected fields for retrieval, SPARQL and time-series steps. Update retrieval-ids.md to use min(k, number of relevant items) in recall@k denominator and rename /contextualize precision@k as average context precision with a corrected averaging formula and example.

Only actual steps with `status == "success"` are considered for matching. Output contains a `reference_steps` section mirroring the input; matched reference steps get a `matches` string set to the matching `<actual_step.id>`.

"autocomplete_search" -> "autocomplete_search" or "sparql_query"

ivelinanikolova

All my comments are inline.

ivelinanikolova · 2026-05-12T14:30:28Z

 # QA Evaluation

-This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps. The library provides built-in evaluation metrics and allows the user to define their own (custom) metrics.
+This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps. The library provides built-in evaluation metrics and allows the user to define their own (custom) metrics. The library is agnostic to the agent implementation and LLM it uses. Its input format is versatile and expressive.


Suggested change

This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps. The library provides built-in evaluation metrics and allows the user to define their own (custom) metrics. The library is agnostic to the agent implementation and LLM it uses. Its input format is versatile and expressive.

This is a Python module for assessing the quality of question-answering (QA) systems, such as ones using LLM agents. The evaluation is based on a set of questions, their reference answers and reference steps outlining the context and tool orchestration required to derive the correct response. The final answer and the steps used to reach the answer are verified against the reference dataset. The library provides built-in evaluation metrics ([§ Metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/metrics.md)) and allows the user to define their own (custom) metrics. The library is agnostic to the agent implementation and LLM it uses. Its input format is versatile and expressive.

ivelinanikolova · 2026-05-12T14:47:57Z

-```toml
-graphrag-eval = {version = "*", extras = ["llm"]}
-```
+- [Metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/metrics.md)


I understand the reasons for Metrics to be first in line here but I still find it more logical to be placed below, just before the section 'Custom evaluation'.

Rename Usage to "Quick Start" and move it to top position.

Move LLM use in Evaluation as a subsection of Metrics. See the attached figure suggesting the TOC.

Navigation to Readme or Quick Start is necessary on each of the Readme files.

Done: reduced to quickstart, config, metrics, steps, retrieval-ids, input, output.

ivelinanikolova · 2026-05-12T14:53:32Z

@@ -0,0 +1,14 @@
+# Metrics
+
+The library computes metrics for the quality of the answers. The groups of possible metrics are:


Suggested change

The library computes metrics for the quality of the answers. The groups of possible metrics are:

The quality of the answers is evaluated by computing the following groups of metrics:

ivelinanikolova · 2026-05-12T15:13:02Z

+
+The library computes metrics for the quality of the answers. The groups of possible metrics are:
+1. **[RAGAS answer relevance](https://docs.ragas.io/en/v0.4.3/concepts/metrics/available_metrics/answer_relevance/)** (`answer_relevance`)
+1. **Answer correctness**: Recall, precision, F1 of claims extracted from the actual answer with respect to reference answer claims (`answer_recall`, `answer_precision`, `answer_f1`)


Suggested change

1. **Answer correctness**: Recall, precision, F1 of claims extracted from the actual answer with respect to reference answer claims (`answer_recall`, `answer_precision`, `answer_f1`)

1. **Answer Correctness**: _Recall_, _Precision_ and _F1-measure_ of the claims extracted from the actual answer with respect to the reference answer claims. The actual and reference answers are broken down to sets of claims by the LLM, then the actual claims are matched agains the reference ones with the help of the LLM again. Based on the number of matching claims _Recall_, _Precision_ and _F1-measure_ are calculated (`answer_recall`, `answer_precision`, `answer_f1`).

Applied some ideas of rewording. Rejected capitalization (consistent with rest of docs), italicization (consistent with rest of list); exact wording (repeated metrics names).

ivelinanikolova · 2026-05-20T07:33:33Z

+1. **Answer correctness**: Recall, precision, F1 of claims extracted from the actual answer with respect to reference answer claims (`answer_recall`, `answer_precision`, `answer_f1`)
+1. **Steps score**: Correctness of the agent's steps in responding to a user query (`steps_score`) ([§ Steps score](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/steps.md#steps-score))
+1. Vector retrieval
+    1. **vs. reference answer**: Recall, precision, F1 of the retrieved context claims with respect to the reference answer claims (`retrieval_answer_recall`, `retrieval_answer_precision`, `retrieval_answer_f1`)


Suggested change

1. **vs. reference answer**: Recall, precision, F1 of the retrieved context claims with respect to the reference answer claims (`retrieval_answer_recall`, `retrieval_answer_precision`, `retrieval_answer_f1`)

1. **vs. reference answer**: Recall, precision, F1 of the retrieved context claims with respect to the reference answer claims (`retrieval_answer_recall`, `retrieval_answer_precision`, `retrieval_answer_f1`)

Why change the YAML indentation from 2 to 4 spaces?

If we change it here, should we do so consistently across the docs?

ivelinanikolova · 2026-05-20T07:33:46Z

+1. **Steps score**: Correctness of the agent's steps in responding to a user query (`steps_score`) ([§ Steps score](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/steps.md#steps-score))
+1. Vector retrieval
+    1. **vs. reference answer**: Recall, precision, F1 of the retrieved context claims with respect to the reference answer claims (`retrieval_answer_recall`, `retrieval_answer_precision`, `retrieval_answer_f1`)
+    1. **vs. reference context**: Recall, precision, F1 of the retrieved context claims with respect to the reference context claims (`retrieval_context_recall`, `retrieval_context_precision`, `retrieval_context_f1`)


Suggested change

1. **vs. reference context**: Recall, precision, F1 of the retrieved context claims with respect to the reference context claims (`retrieval_context_recall`, `retrieval_context_precision`, `retrieval_context_f1`)

1. **vs. reference context**: Recall, precision, F1 of the retrieved context claims with respect to the reference context claims (`retrieval_context_recall`, `retrieval_context_precision`, `retrieval_context_f1`)

ivelinanikolova · 2026-06-08T07:50:09Z

@@ -0,0 +1,82 @@
+# Usage


Suggested change

# Usage

# Quick Start

I understand the motivation for renaming this section: it will include installation and usage. But I am not convinced by the name "Quickstart". Maybe "Installation and usage"?

ivelinanikolova · 2026-06-08T07:53:39Z

@@ -0,0 +1,21 @@
+# Installation
+
+To evaluate only steps:


Suggested change

To evaluate only steps:

To evaluate only steps and tool calls different from retrieval and such that do not require LLM usage during evaluation (see [§ Metrics using LLM](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)):

ivelinanikolova · 2026-06-08T07:55:53Z

+graphrag-eval = "*"
+```
+
+To evaluate `answer_relevance`, answer correctness metrics (`answer_recall`, `answer_precision`, `answer_f1`) (see [§ LLM use in evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)) or [custom metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/custom.md) install the `llm` extra:


Suggested change

To evaluate `answer_relevance`, answer correctness metrics (`answer_recall`, `answer_precision`, `answer_f1`) (see [§ LLM use in evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)) or [custom metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/custom.md) install the `llm` extra:

To evaluate metrics based on LLM evaluation, such as `answer_relevance`, answer correctness metrics (`answer_recall`, `answer_precision`, `answer_f1`), retrieval steps evaluation metrics (see [§ Metrics using LLM](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)) or [custom metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/custom.md) install the `llm` extra:

- Merge install.md, usage.md into quickstart.md - Move custom.md, llm.md to metrics.md - Various rewordings by Ivelina Nikolova - Fixes and rewordings by LLM

Evaluation would crash if actual_step has no `status` key

pgan002 requested review from atagarev and nelly-hateva February 7, 2026 07:50

pgan002 force-pushed the TTYG-178 branch from 1fb54c2 to ba141bc Compare February 20, 2026 01:42

pgan002 closed this Mar 14, 2026

pgan002 force-pushed the TTYG-178 branch from ba141bc to e338970 Compare March 14, 2026 01:49

pgan002 reopened this Mar 14, 2026

pgan002 requested review from atagarev and nelly-hateva and removed request for atagarev and nelly-hateva March 14, 2026 02:51

pgan002 self-assigned this Mar 14, 2026