Skip to content

TTYG-178 Improve README#51

Open
pgan002 wants to merge 150 commits into
mainfrom
TTYG-178
Open

TTYG-178 Improve README#51
pgan002 wants to merge 150 commits into
mainfrom
TTYG-178

Conversation

@pgan002

@pgan002 pgan002 commented Feb 7, 2026

Copy link
Copy Markdown
Collaborator

Changes

  • Move most of the documentation from the README file into separate files in a new directory docs/. Benefits:
    1. The main page (README) is shorter, and so more welcoming
    2. The main page loads faster
    3. The sections are shorter and so easier to read
    4. The directory helps to understand the contents
  • Major additions for completeness
  • Major edits clarity
  • Links to documentation sections from README and from other sections
  • Consistent section heading case: sentence case
  • Join each paragraph into a single line for easier MD editing

Tests

  • Spelling, grammar, typos: copy-pasted text into word processor
  • Links: manually followed each link in README.md and docs/*

Open questions

  • How to format the key-definitions in § Output: as a list or table?
  • Move section "Aggregate metrics" to a separate file or into section Metrics?

@pgan002 pgan002 closed this Mar 14, 2026
@pgan002 pgan002 reopened this Mar 14, 2026
@pgan002 pgan002 requested review from atagarev and nelly-hateva and removed request for atagarev and nelly-hateva March 14, 2026 02:51
@pgan002 pgan002 self-assigned this Mar 14, 2026
Comment thread README.md
Comment thread docs/usage.md Outdated
Comment thread docs/installation.md Outdated
Comment thread docs/install.md Outdated
Comment thread docs/steps-score.md Outdated
Comment thread docs/steps-score.md Outdated
Comment thread docs/steps-score.md Outdated
Comment thread docs/steps-score.md Outdated
Comment thread docs/retrieval-evaluation-using-chunk-ids.md Outdated
Comment thread docs/0-intro.md Outdated
Comment thread docs/0-intro.md Outdated
@pgan002 pgan002 requested a review from nelly-hateva March 30, 2026 08:09
Comment thread README.md Outdated
Comment thread docs/installation.md Outdated
Comment thread docs/usage.md Outdated
Comment thread docs/usage.md Outdated
Comment thread docs/usage.md Outdated
Comment thread docs/usage.md Outdated
Philip Ganchev and others added 16 commits May 11, 2026 13:44
Clarify config usage (config_file_path) and LLM/embedding requirements
Update examples to async calls; fix temperature and relevance wording
Correct AP calculation and minor whitespace/argparse cleanup
Expand README and docs/input.md with detailed reference and target
schemas, including actual_steps and expected fields for retrieval,
SPARQL and time-series steps. Update retrieval-ids.md to use
min(k, number of relevant items) in recall@k denominator and rename
/contextualize precision@k as average context precision with a
corrected averaging formula and example.
Only actual steps with `status == "success"` are considered for
matching.
Output contains a `reference_steps` section mirroring the input; matched
reference steps get a `matches` string set to the matching
`<actual_step.id>`.
"autocomplete_search" -> "autocomplete_search" or "sparql_query"

@ivelinanikolova ivelinanikolova left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my comments are inline.

Comment thread README.md Outdated
# QA Evaluation

This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps. The library provides built-in evaluation metrics and allows the user to define their own (custom) metrics.
This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps. The library provides built-in evaluation metrics and allows the user to define their own (custom) metrics. The library is agnostic to the agent implementation and LLM it uses. Its input format is versatile and expressive.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps. The library provides built-in evaluation metrics and allows the user to define their own (custom) metrics. The library is agnostic to the agent implementation and LLM it uses. Its input format is versatile and expressive.
This is a Python module for assessing the quality of question-answering (QA) systems, such as ones using LLM agents. The evaluation is based on a set of questions, their reference answers and reference steps outlining the context and tool orchestration required to derive the correct response. The final answer and the steps used to reach the answer are verified against the reference dataset. The library provides built-in evaluation metrics ([§ Metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/metrics.md)) and allows the user to define their own (custom) metrics. The library is agnostic to the agent implementation and LLM it uses. Its input format is versatile and expressive.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread README.md
```toml
graphrag-eval = {version = "*", extras = ["llm"]}
```
- [Metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/metrics.md)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the reasons for Metrics to be first in line here but I still find it more logical to be placed below, just before the section 'Custom evaluation'.

Rename Usage to "Quick Start" and move it to top position.

Move LLM use in Evaluation as a subsection of Metrics. See the attached figure suggesting the TOC.

Image

Navigation to Readme or Quick Start is necessary on each of the Readme files.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done: reduced to quickstart, config, metrics, steps, retrieval-ids, input, output.

Comment thread docs/metrics.md Outdated
@@ -0,0 +1,14 @@
# Metrics

The library computes metrics for the quality of the answers. The groups of possible metrics are:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The library computes metrics for the quality of the answers. The groups of possible metrics are:
The quality of the answers is evaluated by computing the following groups of metrics:

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment thread docs/metrics.md Outdated
Comment thread docs/metrics.md Outdated

The library computes metrics for the quality of the answers. The groups of possible metrics are:
1. **[RAGAS answer relevance](https://docs.ragas.io/en/v0.4.3/concepts/metrics/available_metrics/answer_relevance/)** (`answer_relevance`)
1. **Answer correctness**: Recall, precision, F1 of claims extracted from the actual answer with respect to reference answer claims (`answer_recall`, `answer_precision`, `answer_f1`)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **Answer correctness**: Recall, precision, F1 of claims extracted from the actual answer with respect to reference answer claims (`answer_recall`, `answer_precision`, `answer_f1`)
1. **Answer Correctness**: _Recall_, _Precision_ and _F1-measure_ of the claims extracted from the actual answer with respect to the reference answer claims. The actual and reference answers are broken down to sets of claims by the LLM, then the actual claims are matched agains the reference ones with the help of the LLM again. Based on the number of matching claims _Recall_, _Precision_ and _F1-measure_ are calculated (`answer_recall`, `answer_precision`, `answer_f1`).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied some ideas of rewording. Rejected capitalization (consistent with rest of docs), italicization (consistent with rest of list); exact wording (repeated metrics names).

Comment thread docs/metrics.md Outdated
1. **Answer correctness**: Recall, precision, F1 of claims extracted from the actual answer with respect to reference answer claims (`answer_recall`, `answer_precision`, `answer_f1`)
1. **Steps score**: Correctness of the agent's steps in responding to a user query (`steps_score`) ([§ Steps score](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/steps.md#steps-score))
1. Vector retrieval
1. **vs. reference answer**: Recall, precision, F1 of the retrieved context claims with respect to the reference answer claims (`retrieval_answer_recall`, `retrieval_answer_precision`, `retrieval_answer_f1`)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **vs. reference answer**: Recall, precision, F1 of the retrieved context claims with respect to the reference answer claims (`retrieval_answer_recall`, `retrieval_answer_precision`, `retrieval_answer_f1`)
1. **vs. reference answer**: Recall, precision, F1 of the retrieved context claims with respect to the reference answer claims (`retrieval_answer_recall`, `retrieval_answer_precision`, `retrieval_answer_f1`)

@pgan002 pgan002 Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the YAML indentation from 2 to 4 spaces?

If we change it here, should we do so consistently across the docs?

Comment thread docs/metrics.md Outdated
1. **Steps score**: Correctness of the agent's steps in responding to a user query (`steps_score`) ([§ Steps score](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/steps.md#steps-score))
1. Vector retrieval
1. **vs. reference answer**: Recall, precision, F1 of the retrieved context claims with respect to the reference answer claims (`retrieval_answer_recall`, `retrieval_answer_precision`, `retrieval_answer_f1`)
1. **vs. reference context**: Recall, precision, F1 of the retrieved context claims with respect to the reference context claims (`retrieval_context_recall`, `retrieval_context_precision`, `retrieval_context_f1`)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **vs. reference context**: Recall, precision, F1 of the retrieved context claims with respect to the reference context claims (`retrieval_context_recall`, `retrieval_context_precision`, `retrieval_context_f1`)
1. **vs. reference context**: Recall, precision, F1 of the retrieved context claims with respect to the reference context claims (`retrieval_context_recall`, `retrieval_context_precision`, `retrieval_context_f1`)

Comment thread docs/usage.md Outdated
@@ -0,0 +1,82 @@
# Usage

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Usage
# Quick Start

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the motivation for renaming this section: it will include installation and usage. But I am not convinced by the name "Quickstart". Maybe "Installation and usage"?

Comment thread docs/install.md Outdated
@@ -0,0 +1,21 @@
# Installation

To evaluate only steps:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To evaluate only steps:
To evaluate only steps and tool calls different from retrieval and such that do not require LLM usage during evaluation (see [§ Metrics using LLM](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)):

Comment thread docs/install.md Outdated
graphrag-eval = "*"
```

To evaluate `answer_relevance`, answer correctness metrics (`answer_recall`, `answer_precision`, `answer_f1`) (see [§ LLM use in evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)) or [custom metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/custom.md) install the `llm` extra:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To evaluate `answer_relevance`, answer correctness metrics (`answer_recall`, `answer_precision`, `answer_f1`) (see [§ LLM use in evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)) or [custom metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/custom.md) install the `llm` extra:
To evaluate metrics based on LLM evaluation, such as `answer_relevance`, answer correctness metrics (`answer_recall`, `answer_precision`, `answer_f1`), retrieval steps evaluation metrics (see [§ Metrics using LLM](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)) or [custom metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/custom.md) install the `llm` extra:

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revamped.

@pgan002 pgan002 requested a review from ivelinanikolova June 12, 2026 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants