Evaluation

Evaluation is a form of testing that helps you validate your LLM's responses and ensure they meet your quality bar.

Firebase Genkit supports third-party evaluation tools through plugins, paired with powerful observability features that provide insight into the runtime state of your LLM-powered applications. Genkit tooling helps you automatically extract data including inputs, outputs, and information from intermediate steps to evaluate the end-to-end quality of LLM responses as well as understand the performance of your system's building blocks.

Types of evaluation

Genkit supports two types of evaluation:

  • Inference-based evaluation: This type of evaluation runs against a collection of pre-determined inputs, assessing the corresponding outputs for quality.

    This is the most common evaluation type, suitable for most use cases. This approach tests a system's actual output for each evaluation run.

    You can perform the quality assessment manually, by visually inspecting the results. Alternatively, you can automate the assessment by using an evaluation metric.

  • Raw evaluation: This type of evaluation directly assesses the quality of inputs without any inference. This approach typically is used with automated evaluation using metrics. All required fields for evaluation (e.g., input, context, output and reference) must be present in the input dataset. This is useful when you have data coming from an external source (e.g., collected from your production traces) and you want to have an objective measurement of the quality of the collected data.

    For more information, see the Advanced use section of this page.

This section explains how to perform inference-based evaluation using Genkit.

Quick start

Setup

  1. Use an existing Genkit app or create a new one by following our [Getting started](get-started) guide.
  2. Add the following code to define a simple RAG application to evaluate. For this guide, we use a dummy retriever that always returns the same documents. ```js import { genkit, z, Document } from "genkit"; import { googleAI, gemini15Flash, gemini15Pro, } from "@genkit-ai/googleai"; // Initialize Genkit export const ai = genkit ({ plugins: [ googleAI(), ] }); // Dummy retriever that always returns the same docs export const dummyRetriever = ai.defineRetriever( { name: "dummyRetriever", }, async (i) => { const facts = [ "Dog is man's best friend", "Dogs have evolved and were domesticated from wolves", ]; // Just return facts as documents. return { documents: facts.map((t) => Document.fromText(t)) }; } ); // A simple question-answering flow export const qaFlow = ai.defineFlow({ name: 'qaFlow', inputSchema: z.string(), outputSchema: z.string(), }, async (query) => { const factDocs = await ai.retrieve({ retriever: dummyRetriever, query, options: { k: 2 }, }); const llmResponse = await ai.generate({ model: gemini15Flash, prompt: `Answer this question with the given context ${query}`, docs: factDocs, }); return llmResponse.text; } ); ```
  3. (Optional) Add evaluation metrics to your application to use while evaluating. This guide uses the `MALICIOUSNESS` metric from the `genkitEval` plugin. ```js import { genkitEval, GenkitMetric } from "@genkit-ai/evaluator"; import { gemini15Pro } from "@genkit-ai/googleai"; export const ai = genkit ({ plugins: [ ... // Add this plugin to your Genkit initialization block genkitEval({ judge: gemini15Pro, metrics: [GenkitMetric.MALICIOUSNESS], }), ] }); ``` **Note:** The configuration above requires installing the [`@genkit-ai/evaluator`](https://www.npmjs.com/package/@genkit-ai/evaluator) package. ```posix-terminal npm install @genkit-ai/evaluator ```
  4. Start your Genkit application ```posix-terminal genkit start -- ```

Create a dataset

Create a dataset to define the examples we want to use for evaluating our flow.

  1. Go to the Dev UI at http://localhost:4000 and click the Datasets button to open the Datasets page.

  2. Click on the Create Dataset button to open the create dataset dialog.

    a. Provide a datasetId for your new dataset. This guide uses myFactsQaDataset.

    b. Select Flow dataset type.

    c. Leave the validation target field empty and click Save

  3. Your new dataset page appears, showing an empty dataset. Add examples to it by following these steps:

    a. Click the Add example button to open the example editor panel.

    b. Only the input field is required. Enter "Who is man's best friend?" in the input field, and click Save to add the example has to your dataset.

    c. Repeat steps (a) and (b) a couple more times to add more examples. This guide adds the following example inputs to the dataset:

    "Can I give milk to my cats?"
    "From which animals did dogs evolve?"
    

    By the end of this step, your dataset should have 3 examples in it, with the values mentioned above.

Run evaluation and view results

To start evaluating the flow, click the Evaluations tab in the Dev UI and click the Run new evaluation button to get started.

  1. Select the Flow radio button to evaluate a flow.

  2. Select qaFlow as the target flow to evaluate.

  3. Select myFactsQaDataset as the target dataset to use for evaluation.

  4. (Optional) If you have installed an evaluator metric using Genkit plugins, you can see these metrics in this page. Select the metrics that you want to use with this evaluation run. This is entirely optional: Omitting this step will still return the results in the evaluation run, but without any associated metrics.

  5. Finally, click Run evaluation to start evaluation. Depending on the flow you're testing, this may take a while. Once the evaluation is complete, a success message appears with a link to view the results. Click on the link to go to the Evaluation details page.

You can see the details of your evaluation on this page, including original input, extracted context and metrics (if any).

Core concepts

Terminology

  • Evaluation: An evaluation is a process that assesses system performance. In Genkit, such a system is usually a Genkit primitive, such as a flow or a model. An evaluation can be automated or manual (human evaluation).

  • Bulk inference Inference is the act of running an input on a flow or model to get the corresponding output. Bulk inference involves performing inference on multiple inputs simultaneously.

  • Metric An evaluation metric is a criterion on which an inference is scored. Examples include accuracy, faithfulness, maliciousness, whether the output is in English, etc.

  • Dataset A dataset is a collection of examples to use for inference-based evaluation. A dataset typically consists of input and optional reference fields. The reference field does not affect the inference step of evaluation but it is passed verbatim to any evaluation metrics. In Genkit, you can create a dataset through the Dev UI. There are two types of datasets in Genkit: Flow datasets and Model datasets.

Schema validation

Depending on the type, datasets have schema validation support in the Dev UI:

  • Flow datasets support validation of the input and reference fields of the dataset against a flow in the Genkit application. Schema validation is optional and is only enforced if a schema is specified on the target flow.

  • Model datasets have implicit schema, supporting both string and GenerateRequest input types. String validation provides a convenient way to evaluate simple text prompts, while GenerateRequest provides complete control for advanced use cases (e.g. providing model parameters, message history, tools, etc). You can find the full schema for GenerateRequest in our API reference docs.

Supported evaluators

Genkit evaluators

Genkit includes a small number of native evaluators, inspired by RAGAS, to help you get started:

  • Faithfulness -- Measures the factual consistency of the generated answer against the given context
  • Answer Relevancy -- Assesses how pertinent the generated answer is to the given prompt
  • Maliciousness -- Measures whether the generated output intends to deceive, harm, or exploit

Evaluator plugins

Genkit supports additional evaluators through plugins, like the Vertex Rapid Evaluators, which you access via the VertexAI Plugin.

Advanced use

Evaluation using the CLI

Genkit CLI provides a rich API for performing evaluation. This is especially useful in environments where the Dev UI is not available (e.g. in a CI/CD workflow).

Genkit CLI provides 3 main evaluation commands: eval:flow, eval:extractData, and eval:run.

eval:flow command

The eval:flow command runs inference-based evaluation on an input dataset. This dataset may be provided either as a JSON file or by referencing an existing dataset in your Genkit runtime.

# Referencing an existing dataset
genkit eval:flow qaFlow --input myFactsQaDataset
# or, using a dataset from a file
genkit eval:flow qaFlow --input testInputs.json

Here, testInputs.json should be an array of objects containing an input field and an optional reference field, like below:

[
  {
    "input": "What is the French word for Cheese?",
  },
  {
    "input": "What green vegetable looks like cauliflower?",
    "reference": "Broccoli"
  }
]

If your flow requires auth, you may specify it using the --auth argument:

genkit eval:flow qaFlow --input testInputs.json --auth "{\"email_verified\": true}"

By default, the eval:flow and eval:run commands use all available metrics for evaluation. To run on a subset of the configured evaluators, use the --evaluators flag and provide a comma-separated list of evaluators by name:

genkit eval:flow qaFlow --input testInputs.json --evaluators=genkit/faithfulness,genkit/answer_relevancy

You can view the results of your evaluation run in the Dev UI at localhost:4000/evaluate.

eval:extractData and eval:run commands

To support raw evaluation, Genkit provides tools to extract data from traces and run evaluation metrics on extracted data. This is useful, for example, if you are using a different framework for evaluation or if you are collecting inferences from a different environment to test locally for output quality.

You can batch run your Genkit flow and add a unique label to the run which then can be used to extract an evaluation dataset. A raw evaluation dataset is a collection of inputs for evaluation metrics, without running any prior inference.

Run your flow over your test inputs:

genkit flow:batchRun qaFlow testInputs.json --label firstRunSimple

Extract the evaluation data:

genkit eval:extractData qaFlow --label firstRunSimple --output factsEvalDataset.json

The exported data has a format different from the dataset format presented earlier. This is because this data is intended to be used with evaluation metrics directly, without any inference step. Here is the syntax of the extracted data.

Array<{
  "testCaseId": string,
  "input": any,
  "output": any,
  "context": any[],
  "traceIds": string[],
}>;

The data extractor automatically locates retrievers and adds the produced docs to the context array. You can run evaluation metrics on this extracted dataset using the eval:run command.

genkit eval:run factsEvalDataset.json

By default, eval:run runs against all configured evaluators, and as with eval:flow, results for eval:run appear in the evaluation page of Developer UI, located at localhost:4000/evaluate.

Custom extractors

Genkit provides reasonable default logic for extracting the necessary fields (input, output and context) while doing an evaluation. However, you may find that you need more control over the extraction logic for these fields. Genkit supports customs extractors to achieve this. You can provide custom extractors to be used in eval:extractData and eval:flow commands.

First, as a preparatory step, introduce an auxilary step in our qaFlow example:

export const qaFlow = ai.defineFlow({
    name: 'qaFlow',
    inputSchema: z.string(),
    outputSchema: z.string(),
  },
  async (query) => {
    const factDocs = await ai.retrieve({
      retriever: dummyRetriever,
      query,
      options: { k: 2 },
    });
    const factDocsModified = await run('factModified', async () => {
        // Let us use only facts that are considered silly. This is a 
        // hypothetical step for demo purposes, you may perform any 
        // arbitrary task inside a step and reference it in custom 
        // extractors.
        //
        // Assume you have a method that checks if a fact is silly
        return factDocs.filter(d => isSillyFact(d.text));
    });

    const llmResponse = await ai.generate({
      model: gemini15Flash,
      prompt: `Answer this question with the given context ${query}`,
      docs: factDocs,
    });
    return llmResponse.text;
  }
);

Next, configure a custom extractor to use the output of the factModified step when evaluating this flow.

If you don't have one a tools-config file to configure custom extractors, add one named genkit-tools.conf.js to your project root.

cd /path/to/your/genkit/app
touch genkit-tools.conf.js

In the tools config file, add the following code:

module.exports = {
  evaluators: [
    {
      actionRef: '/flow/qaFlow',
      extractors: {
        context: { outputOf: 'factModified' },
      },
    },
  ],
};

This config overrides the default extractors of Genkit's tooling, specifically changing what is considered as context when evaluating this flow.

Running evaluation again reveals that context is now populated as the output of the step factModified.

genkit eval:flow qaFlow --input testInputs.json

Evaluation extractors are specified as follows:

  • evaluators field accepts an array of EvaluatorConfig objects, which are scoped by flowName
  • extractors is an object that specifies the extractor overrides. The current supported keys in extractors are [input, output, context]. The acceptable value types are:
    • string - this should be a step name, specified as a string. The output of this step is extracted for this key.
    • { inputOf: string } or { outputOf: string } - These objects represent specific channels (input or output) of a step. For example, { inputOf: 'foo-step' } would extract the input of step foo-step for this key.
    • (trace) => string; - For further flexibility, you can provide a function that accepts a Genkit trace and returns an any-type value, and specify the extraction logic inside this function. Refer to genkit/genkit-tools/common/src/types/trace.ts for the exact TraceData schema.

Note: The extracted data for all these extractors is the type corresponding to the extractor. For example, if you use context: { outputOf: 'foo-step' }, and foo-step returns an array of objects, the extracted context is also an array of objects.

Synthesizing test data using an LLM

Here is an example flow that uses a PDF file to generate potential user questions.

import { genkit, run, z } from "genkit";
import { googleAI, gemini15Flash } from "@genkit-ai/googleai";
import { chunk } from "llm-chunk"; // npm i llm-chunk
import path from "path";
import { readFile } from "fs/promises";
import pdf from "pdf-parse"; // npm i pdf-parse

const ai = genkit({ plugins: [googleAI()] });

const chunkingConfig = {
  minLength: 1000, // number of minimum characters into chunk
  maxLength: 2000, // number of maximum characters into chunk
  splitter: "sentence", // paragraph | sentence
  overlap: 100, // number of overlap chracters
  delimiters: "", // regex for base split method
} as any;

async function extractText(filePath: string) {
  const pdfFile = path.resolve(filePath);
  const dataBuffer = await readFile(pdfFile);
  const data = await pdf(dataBuffer);
  return data.text;
}

export const synthesizeQuestions = ai.defineFlow(
  {
    name: "synthesizeQuestions",
    inputSchema: z.string().describe("PDF file path"),
    outputSchema: z.array(z.string()),
  },
  async (filePath) => {
    filePath = path.resolve(filePath);
    // `extractText` loads the PDF and extracts its contents as text.
    const pdfTxt = await run("extract-text", () => extractText(filePath));

    const chunks = await run("chunk-it", async () =>
      chunk(pdfTxt, chunkingConfig)
    );

    const questions: string[] = [];
    for (var i = 0; i < chunks.length; i++) {
      const qResponse = await ai.generate({
        model: gemini15Flash,
        prompt: {
          text: `Generate one question about the text below: ${chunks[i]}`,
        },
      });
      questions.push(qResponse.text);
    }
    return questions;
  }
);

You can then use this command to export the data into a file and use for evaluation.

genkit flow:run synthesizeQuestions '"my_input.pdf"' --output synthesizedQuestions.json