Conversation
Clarify config usage (config_file_path) and LLM/embedding requirements Update examples to async calls; fix temperature and relevance wording Correct AP calculation and minor whitespace/argparse cleanup
Expand README and docs/input.md with detailed reference and target schemas, including actual_steps and expected fields for retrieval, SPARQL and time-series steps. Update retrieval-ids.md to use min(k, number of relevant items) in recall@k denominator and rename /contextualize precision@k as average context precision with a corrected averaging formula and example.
Only actual steps with `status == "success"` are considered for matching. Output contains a `reference_steps` section mirroring the input; matched reference steps get a `matches` string set to the matching `<actual_step.id>`.
"autocomplete_search" -> "autocomplete_search" or "sparql_query"
ivelinanikolova
left a comment
There was a problem hiding this comment.
All my comments are inline.
| # QA Evaluation | ||
|
|
||
| This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps. The library provides built-in evaluation metrics and allows the user to define their own (custom) metrics. | ||
| This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps. The library provides built-in evaluation metrics and allows the user to define their own (custom) metrics. The library is agnostic to the agent implementation and LLM it uses. Its input format is versatile and expressive. |
There was a problem hiding this comment.
| This is a Python module for assessing the quality of question-answering systems such as ones based on LLM agents, based on a set of questions and reference answers for them. This includes evaluating the final answer and the steps used to reach the answer (such as orchestrated and executed steps), compared to the given reference steps. The library provides built-in evaluation metrics and allows the user to define their own (custom) metrics. The library is agnostic to the agent implementation and LLM it uses. Its input format is versatile and expressive. | |
| This is a Python module for assessing the quality of question-answering (QA) systems, such as ones using LLM agents. The evaluation is based on a set of questions, their reference answers and reference steps outlining the context and tool orchestration required to derive the correct response. The final answer and the steps used to reach the answer are verified against the reference dataset. The library provides built-in evaluation metrics ([§ Metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/metrics.md)) and allows the user to define their own (custom) metrics. The library is agnostic to the agent implementation and LLM it uses. Its input format is versatile and expressive. |
| ```toml | ||
| graphrag-eval = {version = "*", extras = ["llm"]} | ||
| ``` | ||
| - [Metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/metrics.md) |
There was a problem hiding this comment.
I understand the reasons for Metrics to be first in line here but I still find it more logical to be placed below, just before the section 'Custom evaluation'.
Rename Usage to "Quick Start" and move it to top position.
Move LLM use in Evaluation as a subsection of Metrics. See the attached figure suggesting the TOC.
Navigation to Readme or Quick Start is necessary on each of the Readme files.
There was a problem hiding this comment.
Done: reduced to quickstart, config, metrics, steps, retrieval-ids, input, output.
| @@ -0,0 +1,14 @@ | |||
| # Metrics | |||
|
|
|||
| The library computes metrics for the quality of the answers. The groups of possible metrics are: | |||
There was a problem hiding this comment.
| The library computes metrics for the quality of the answers. The groups of possible metrics are: | |
| The quality of the answers is evaluated by computing the following groups of metrics: |
|
|
||
| The library computes metrics for the quality of the answers. The groups of possible metrics are: | ||
| 1. **[RAGAS answer relevance](https://docs.ragas.io/en/v0.4.3/concepts/metrics/available_metrics/answer_relevance/)** (`answer_relevance`) | ||
| 1. **Answer correctness**: Recall, precision, F1 of claims extracted from the actual answer with respect to reference answer claims (`answer_recall`, `answer_precision`, `answer_f1`) |
There was a problem hiding this comment.
| 1. **Answer correctness**: Recall, precision, F1 of claims extracted from the actual answer with respect to reference answer claims (`answer_recall`, `answer_precision`, `answer_f1`) | |
| 1. **Answer Correctness**: _Recall_, _Precision_ and _F1-measure_ of the claims extracted from the actual answer with respect to the reference answer claims. The actual and reference answers are broken down to sets of claims by the LLM, then the actual claims are matched agains the reference ones with the help of the LLM again. Based on the number of matching claims _Recall_, _Precision_ and _F1-measure_ are calculated (`answer_recall`, `answer_precision`, `answer_f1`). |
There was a problem hiding this comment.
Applied some ideas of rewording. Rejected capitalization (consistent with rest of docs), italicization (consistent with rest of list); exact wording (repeated metrics names).
| 1. **Answer correctness**: Recall, precision, F1 of claims extracted from the actual answer with respect to reference answer claims (`answer_recall`, `answer_precision`, `answer_f1`) | ||
| 1. **Steps score**: Correctness of the agent's steps in responding to a user query (`steps_score`) ([§ Steps score](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/steps.md#steps-score)) | ||
| 1. Vector retrieval | ||
| 1. **vs. reference answer**: Recall, precision, F1 of the retrieved context claims with respect to the reference answer claims (`retrieval_answer_recall`, `retrieval_answer_precision`, `retrieval_answer_f1`) |
There was a problem hiding this comment.
| 1. **vs. reference answer**: Recall, precision, F1 of the retrieved context claims with respect to the reference answer claims (`retrieval_answer_recall`, `retrieval_answer_precision`, `retrieval_answer_f1`) | |
| 1. **vs. reference answer**: Recall, precision, F1 of the retrieved context claims with respect to the reference answer claims (`retrieval_answer_recall`, `retrieval_answer_precision`, `retrieval_answer_f1`) |
There was a problem hiding this comment.
Why change the YAML indentation from 2 to 4 spaces?
If we change it here, should we do so consistently across the docs?
| 1. **Steps score**: Correctness of the agent's steps in responding to a user query (`steps_score`) ([§ Steps score](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/steps.md#steps-score)) | ||
| 1. Vector retrieval | ||
| 1. **vs. reference answer**: Recall, precision, F1 of the retrieved context claims with respect to the reference answer claims (`retrieval_answer_recall`, `retrieval_answer_precision`, `retrieval_answer_f1`) | ||
| 1. **vs. reference context**: Recall, precision, F1 of the retrieved context claims with respect to the reference context claims (`retrieval_context_recall`, `retrieval_context_precision`, `retrieval_context_f1`) |
There was a problem hiding this comment.
| 1. **vs. reference context**: Recall, precision, F1 of the retrieved context claims with respect to the reference context claims (`retrieval_context_recall`, `retrieval_context_precision`, `retrieval_context_f1`) | |
| 1. **vs. reference context**: Recall, precision, F1 of the retrieved context claims with respect to the reference context claims (`retrieval_context_recall`, `retrieval_context_precision`, `retrieval_context_f1`) |
| @@ -0,0 +1,82 @@ | |||
| # Usage | |||
There was a problem hiding this comment.
| # Usage | |
| # Quick Start |
There was a problem hiding this comment.
I understand the motivation for renaming this section: it will include installation and usage. But I am not convinced by the name "Quickstart". Maybe "Installation and usage"?
| @@ -0,0 +1,21 @@ | |||
| # Installation | |||
|
|
|||
| To evaluate only steps: | |||
There was a problem hiding this comment.
| To evaluate only steps: | |
| To evaluate only steps and tool calls different from retrieval and such that do not require LLM usage during evaluation (see [§ Metrics using LLM](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)): |
| graphrag-eval = "*" | ||
| ``` | ||
|
|
||
| To evaluate `answer_relevance`, answer correctness metrics (`answer_recall`, `answer_precision`, `answer_f1`) (see [§ LLM use in evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)) or [custom metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/custom.md) install the `llm` extra: |
There was a problem hiding this comment.
| To evaluate `answer_relevance`, answer correctness metrics (`answer_recall`, `answer_precision`, `answer_f1`) (see [§ LLM use in evaluation](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)) or [custom metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/custom.md) install the `llm` extra: | |
| To evaluate metrics based on LLM evaluation, such as `answer_relevance`, answer correctness metrics (`answer_recall`, `answer_precision`, `answer_f1`), retrieval steps evaluation metrics (see [§ Metrics using LLM](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/llm.md)) or [custom metrics](https://github.com/Ontotext-AD/graphrag-eval/blob/main/docs/custom.md) install the `llm` extra: |
- Merge install.md, usage.md into quickstart.md - Move custom.md, llm.md to metrics.md - Various rewordings by Ivelina Nikolova - Fixes and rewordings by LLM
Evaluation would crash if actual_step has no `status` key
Changes
docs/. Benefits:Tests
README.mdanddocs/*Open questions