{ "cells": [ { "cell_type": "markdown", "id": "68e21aa1", "metadata": {}, "source": [ "# Evaluating Agents\n", "\n", "We have an email assistant that uses a router to triage emails and then passes the email to the agent for response generation. How can we be sure that it will work well in production? This is why testing is important: it guides our decisions about our agent architecture with quantifiable metrics like response quality, token usage, latency, or triage accuracy. [LangSmith](https://docs.smith.langchain.com/) offers two primary ways to test agents. \n", "\n", "![overview-img](img/overview_eval.png)" ] }, { "cell_type": "markdown", "id": "4d7f7048", "metadata": {}, "source": [ "#### Load Environment Variables" ] }, { "cell_type": "code", "execution_count": null, "id": "c47d4c3d", "metadata": {}, "outputs": [], "source": [ "from dotenv import load_dotenv\n", "load_dotenv(\"../.env\")" ] }, { "cell_type": "markdown", "id": "2005c34d", "metadata": {}, "source": [ "## How to run Evaluations\n", "\n", "#### Pytest / Vitest\n", "\n", "[Pytest](https://docs.pytest.org/en/stable/) and Vitest are well known to many developers as a powerful tools for writing tests within the Python and JavaScript ecosystems. LangSmith integrates with these frameworks to allow you to write and run tests that log results to LangSmith. For this notebook, we'll use Pytest.\n", "* Pytest is a great way to get started for developers who are already familiar with their framework. \n", "* Pytest is great for more complex evaluations, where each agent test case requires specific checks and success criteria that are harder to generalize.\n", "\n", "#### LangSmith Datasets \n", "\n", "You can also create a dataset [in LangSmith](https://docs.smith.langchain.com/evaluation) and run our assistant against the dataset using the LangSmith evaluate API.\n", "* LangSmith datasets are great for teams who are collaboratively building out their test suite. \n", "* You can leverage production traces, annotation queues, synthetic data generation, and more, to add examples to an ever-growing golden dataset.\n", "* LangSmith datasets are great when you can define evaluators that can be applied to every test case in the dataset (ex. similarity, exact match accuracy, etc.)" ] }, { "cell_type": "markdown", "id": "10b7c989", "metadata": {}, "source": [ "## Test Cases\n", "\n", "Testing often starts with defining the test cases, which can be a challenging process. In this case, we'll just define a set of example emails we want to handle along with a few things to test. You can see the test cases in `eval/email_dataset.py`, which contains the following:\n", "\n", "1. **Input Emails**: A collection of diverse email examples\n", "2. **Ground Truth Classifications**: `Respond`, `Notify`, `Ignore`\n", "3. **Expected Tool Calls**: Tools called for each email that requires a response\n", "4. **Response Criteria**: What makes a good response for emails requiring replies\n", "\n", "Note that we have both\n", "- End to end \"integration\" tests (e.g. Input Emails -> Agent -> Final Output vs Response Criteria)\n", "- Tests for specific steps in our workflow (e.g. Input Emails -> Agent -> Classification vs Ground Truth Classification)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "f8fdc2b8", "metadata": {}, "outputs": [], "source": [ "\n", "%load_ext autoreload\n", "%autoreload 2\n", "\n", "from email_assistant.eval.email_dataset import email_inputs, expected_tool_calls, triage_outputs_list, response_criteria_list\n", "\n", "test_case_ix = 0\n", "\n", "print(\"Email Input:\", email_inputs[test_case_ix])\n", "print(\"Expected Triage Output:\", triage_outputs_list[test_case_ix])\n", "print(\"Expected Tool Calls:\", expected_tool_calls[test_case_ix])\n", "print(\"Response Criteria:\", response_criteria_list[test_case_ix])" ] }, { "cell_type": "markdown", "id": "2337bd7c", "metadata": {}, "source": [ "## Pytest Example\n", "\n", "Let's take a look at how we can write a test for a specific part of our workflow with Pytest. We will test whether our `email_assistant` makes the right tool calls when responding to the emails." ] }, { "cell_type": "code", "execution_count": 22, "id": "ae92fe30", "metadata": {}, "outputs": [], "source": [ "import pytest\n", "from email_assistant.eval.email_dataset import email_inputs, expected_tool_calls\n", "from email_assistant.utils import format_messages_string\n", "from email_assistant.email_assistant import email_assistant\n", "from email_assistant.utils import extract_tool_calls\n", "\n", "from langsmith import testing as t\n", "\n", "@pytest.mark.langsmith\n", "@pytest.mark.parametrize(\n", " \"email_input, expected_calls\",\n", " [ # Pick some examples with e-mail reply expected\n", " (email_inputs[0],expected_tool_calls[0]),\n", " (email_inputs[3],expected_tool_calls[3]),\n", " ],\n", ")\n", "def test_email_dataset_tool_calls(email_input, expected_calls):\n", " \"\"\"Test if email processing contains expected tool calls.\n", " \n", " This test confirms that all expected tools are called during email processing,\n", " but does not check the order of tool invocations or the number of invocations\n", " per tool. Additional checks for these aspects could be added if desired.\n", " \"\"\"\n", " # Run the email assistant\n", " messages = [{\"role\": \"user\", \"content\": str(email_input)}]\n", " result = email_assistant.invoke({\"messages\": messages})\n", " \n", " # Extract tool calls from messages list\n", " extracted_tool_calls = extract_tool_calls(result['messages'])\n", " \n", " # Check if all expected tool calls are in the extracted ones\n", " missing_calls = [call for call in expected_calls if call.lower() not in extracted_tool_calls]\n", " \n", " t.log_outputs({\n", " \"missing_calls\": missing_calls,\n", " \"extracted_tool_calls\": extracted_tool_calls,\n", " \"response\": format_messages_string(result['messages'])\n", " })\n", "\n", " # Test passes if no expected calls are missing\n", " assert len(missing_calls) == 0" ] }, { "cell_type": "markdown", "id": "700aba2a", "metadata": {}, "source": [ "You'll notice a few things. \n", "- To [run with Pytest and log test results to LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest), we only need to add the `@pytest.mark.langsmith ` decorator to our function and place it in a file, as you see in `notebooks/test_tools.py`. This will log the test results to LangSmith.\n", "- Second, we can pass dataset examples to the test function as shown [here](https://docs.smith.langchain.com/evaluation/how_to_guides/pytest#parametrize-with-pytestmarkparametrize) via `@pytest.mark.parametrize`. \n", "\n", "#### Running Pytest\n", "We can run the test from the command line. We've defined the above code in a python file. From the project root, run:\n", "\n", "`! LANGSMITH_TEST_SUITE='Email assistant: Test Tools For Interrupt' pytest notebooks/test_tools.py`" ] }, { "cell_type": "markdown", "id": "53165e98", "metadata": {}, "source": [ "#### Viewing Experiment Result\n", "\n", "We can view the results in the LangSmith UI. The `assert len(missing_calls) == 0` is logged to the `Pass` column in LangSmith. The `log_outputs` are passed to the `Outputs` column and function arguments are passed to the `Inputs` column. Each input passed in `@pytest.mark.parametrize(` is a separate row logged to the `LANGSMITH_TEST_SUITE` project name in LangSmith, which is found under `Datasets & Experiments`.\n", "\n", "![Test Results](img/test_result.png)" ] }, { "cell_type": "markdown", "id": "fd325e27", "metadata": {}, "source": [ "## LangSmith Datasets Example\n", "\n", "![overview-img](img/eval_detail.png)\n", "\n", "Let's take a look at how we can run evaluations with LangSmith datasets. In the previous example with Pytest, we evaluated the tool calling accuracy of the email assistant. Now, the dataset that we're going to evaluate here is specifically for the triage step of the email assistant, in classifying whether an email requires a response.\n", "\n", "#### Dataset Definition \n", "\n", "We can [create a dataset in LangSmith](https://docs.smith.langchain.com/evaluation/how_to_guides/manage_datasets_programmatically#create-a-dataset) with the LangSmith SDK. The below code creates a dataset with the test cases in the `eval/email_dataset.py` file." ] }, { "cell_type": "code", "execution_count": 23, "id": "7ea997ac", "metadata": {}, "outputs": [], "source": [ "from langsmith import Client\n", "\n", "from email_assistant.eval.email_dataset import examples_triage\n", "\n", "# Initialize LangSmith client\n", "client = Client()\n", "\n", "# Dataset name\n", "dataset_name = \"E-mail Triage Evaluation\"\n", "\n", "# Create dataset if it doesn't exist\n", "if not client.has_dataset(dataset_name=dataset_name):\n", " dataset = client.create_dataset(\n", " dataset_name=dataset_name, \n", " description=\"A dataset of e-mails and their triage decisions.\"\n", " )\n", " # Add examples to the dataset\n", " client.create_examples(dataset_id=dataset.id, examples=examples_triage)" ] }, { "cell_type": "markdown", "id": "0b2df606", "metadata": {}, "source": [ "#### Target Function\n", "\n", "The dataset has the following structure, with an e-mail input and a ground truth triage classification for the e-mail as output:\n", "\n", "```\n", "examples_triage = [\n", " {\n", " \"inputs\": {\"email_input\": email_input_1},\n", " \"outputs\": {\"classification\": triage_output_1}, # NOTE: This becomes the reference_output in the created dataset\n", " }, ...\n", "]\n", "```" ] }, { "cell_type": "code", "execution_count": null, "id": "f7d7e83f-3006-4386-9230-786545c7b1a1", "metadata": {}, "outputs": [], "source": [ "print(\"Dataset Example Input (inputs):\", examples_triage[0]['inputs'])" ] }, { "cell_type": "code", "execution_count": null, "id": "f292f070-7af6-4370-9338-e90bfd6b3d42", "metadata": {}, "outputs": [], "source": [ "print(\"Dataset Example Reference Output (reference_outputs):\", examples_triage[0]['outputs'])" ] }, { "cell_type": "markdown", "id": "8290e820", "metadata": {}, "source": [ "We define a function that takes the dataset inputs and passes them to our email assistant. The LangSmith [evaluate API](https://docs.smith.langchain.com/evaluation) passes the `inputs` dict to this function. This function then returns a dict with the agent's output. Because we are evaluating the triage step, we only need to return the classification decision. " ] }, { "cell_type": "code", "execution_count": 26, "id": "0b9d1ded", "metadata": {}, "outputs": [], "source": [ "def target_email_assistant(inputs: dict) -> dict:\n", " \"\"\"Process an email through the workflow-based email assistant.\"\"\"\n", " response = email_assistant.nodes['triage_router'].invoke({\"email_input\": inputs[\"email_input\"]})\n", " return {\"classification_decision\": response.update['classification_decision']}" ] }, { "cell_type": "markdown", "id": "5ba6ec4c", "metadata": {}, "source": [ "#### Evaluator Function \n", "\n", "Now, we create an evaluator function. What do we want to evaluate? We have reference outputs in our dataset and agent outputs defined in the functions above.\n", "\n", "* Reference outputs: `\"reference_outputs\": {\"classification\": triage_output_1} ...`\n", "* Agent outputs: `\"outputs\": {\"classification_decision\": agent_output_1} ...`\n", "\n", "We want to evaluate if the agent's output matches the reference output. So we simply need a an evaluator function that compares the two, where `outputs` is the agent's output and `reference_outputs` is the reference output from the dataset." ] }, { "cell_type": "code", "execution_count": 27, "id": "4fee7532", "metadata": {}, "outputs": [], "source": [ "def classification_evaluator(outputs: dict, reference_outputs: dict) -> bool:\n", " \"\"\"Check if the answer exactly matches the expected answer.\"\"\"\n", " return outputs[\"classification_decision\"].lower() == reference_outputs[\"classification\"].lower()" ] }, { "cell_type": "markdown", "id": "50fd2de9", "metadata": {}, "source": [ "### Running Evaluation\n", "\n", "Now, the question is: how are these things hooked together? The evaluate API takes care of it for us. It passes the `inputs` dict from our dataset the target function. It passes the `reference_outputs` dict from our dataset to the evaluator function. And it passes the `outputs` of our agent to the evaluator function. \n", "\n", "Note this is similar to what we did with Pytest: in Pytest, we passed in the dataset example inputs and reference outputs to the test function with `@pytest.mark.parametrize`." ] }, { "cell_type": "code", "execution_count": null, "id": "6807306d", "metadata": {}, "outputs": [], "source": [ "# Set to true if you want to kick off evaluation\n", "run_expt = True\n", "if run_expt:\n", " experiment_results_workflow = client.evaluate(\n", " # Run agent \n", " target_email_assistant,\n", " # Dataset name \n", " data=dataset_name,\n", " # Evaluator\n", " evaluators=[classification_evaluator],\n", " # Name of the experiment\n", " experiment_prefix=\"E-mail assistant workflow\", \n", " # Number of concurrent evaluations\n", " max_concurrency=2, \n", " )" ] }, { "cell_type": "markdown", "id": "76baff88", "metadata": {}, "source": [ "We can view the results from both experiments in the LangSmith UI.\n", "\n", "![Test Results](img/eval.png)" ] }, { "cell_type": "markdown", "id": "c5146b52", "metadata": {}, "source": [ "## LLM-as-Judge Evaluation\n", "\n", "We've shown unit tests for the triage step (using evaluate()) and tool calling (using Pytest). \n", "\n", "We'll showcase how you could use an LLM as a judge to evaluate our agent's execution against a set of success criteria. \n", "\n", "![types](img/eval_types.png)\n", "\n", "First, we define a structured output schema for our LLM grader that contains a grade and justification for the grade." ] }, { "cell_type": "code", "execution_count": 29, "id": "e1d342b8", "metadata": {}, "outputs": [], "source": [ "from pydantic import BaseModel, Field\n", "from langchain.chat_models import init_chat_model\n", "\n", "class CriteriaGrade(BaseModel):\n", " \"\"\"Score the response against specific criteria.\"\"\"\n", " justification: str = Field(description=\"The justification for the grade and score, including specific examples from the response.\")\n", " grade: bool = Field(description=\"Does the response meet the provided criteria?\")\n", " \n", "# Create a global LLM for evaluation to avoid recreating it for each test\n", "criteria_eval_llm = init_chat_model(\"openai:gpt-4o\")\n", "criteria_eval_structured_llm = criteria_eval_llm.with_structured_output(CriteriaGrade)" ] }, { "cell_type": "code", "execution_count": null, "id": "bec02b18", "metadata": {}, "outputs": [], "source": [ "email_input = email_inputs[0]\n", "print(\"Email Input:\", email_input)\n", "success_criteria = response_criteria_list[0]\n", "print(\"Success Criteria:\", success_criteria)" ] }, { "cell_type": "markdown", "id": "38390ccd", "metadata": {}, "source": [ "Our Email Assistant is invoked with the email input and the response is formatted into a string. These are all then passed to the LLM grader to receive a grade and justification for the grade." ] }, { "cell_type": "code", "execution_count": null, "id": "cbff28fc", "metadata": {}, "outputs": [], "source": [ "response = email_assistant.invoke({\"email_input\": email_input})" ] }, { "cell_type": "code", "execution_count": null, "id": "d64619fb", "metadata": {}, "outputs": [], "source": [ "from email_assistant.eval.prompts import RESPONSE_CRITERIA_SYSTEM_PROMPT\n", "\n", "all_messages_str = format_messages_string(response['messages'])\n", "eval_result = criteria_eval_structured_llm.invoke([\n", " {\"role\": \"system\",\n", " \"content\": RESPONSE_CRITERIA_SYSTEM_PROMPT},\n", " {\"role\": \"user\",\n", " \"content\": f\"\"\"\\n\\n Response criteria: {success_criteria} \\n\\n Assistant's response: \\n\\n {all_messages_str} \\n\\n Evaluate whether the assistant's response meets the criteria and provide justification for your evaluation.\"\"\"}\n", " ])\n", "\n", "eval_result" ] }, { "cell_type": "code", "execution_count": null, "id": "64275647-6fdb-4bf3-806b-4dbc770cbd6f", "metadata": {}, "outputs": [], "source": [ "RESPONSE_CRITERIA_SYSTEM_PROMPT" ] }, { "cell_type": "markdown", "id": "7994952c", "metadata": {}, "source": [ "We can see that the LLM grader returns an eval result with a schema matching our `CriteriaGrade` base model." ] }, { "cell_type": "markdown", "id": "0b44111d", "metadata": {}, "source": [ "## Running against a Larger Test Suite\n", "Now that we've seen how to evaluate our agent using Pytest and evaluate(), and seen an example of using an LLM as a judge, we can use evaluations over a bigger test suite to get a better sense of how our agent performs over a wider variety of examples." ] }, { "cell_type": "markdown", "id": "9280d5ae-3070-4131-8763-454073176081", "metadata": {}, "source": [ "Let's run our email_assistant against a larger test suite.\n", "```\n", "! LANGSMITH_TEST_SUITE='Email assistant: Test Full Response Interrupt' LANGSMITH_EXPERIMENT='email_assistant' pytest tests/test_response.py --agent-module email_assistant\n", "```\n", "\n", "In `test_response.py`, you can see a few things. \n", "\n", "We pass our dataset examples into functions that will run pytest and log to our `LANGSMITH_TEST_SUITE`:\n", "\n", "```\n", "# Reference output key\n", "@pytest.mark.langsmith(output_keys=[\"criteria\"])\n", "# Variable names and a list of tuples with the test cases\n", "# Each test case is (email_input, email_name, criteria, expected_calls)\n", "@pytest.mark.parametrize(\"email_input,email_name,criteria,expected_calls\",create_response_test_cases())\n", "def test_response_criteria_evaluation(email_input, email_name, criteria, expected_calls):\n", "```\n", "\n", "We use LLM-as-judge with a grading schema:\n", "```\n", "class CriteriaGrade(BaseModel):\n", " \"\"\"Score the response against specific criteria.\"\"\"\n", " grade: bool = Field(description=\"Does the response meet the provided criteria?\")\n", " justification: str = Field(description=\"The justification for the grade and score, including specific examples from the response.\")\n", "```\n", "\n", "\n", "We evaluate the agent response relative to the criteria:\n", "```\n", " # Evaluate against criteria\n", " eval_result = criteria_eval_structured_llm.invoke([\n", " {\"role\": \"system\",\n", " \"content\": RESPONSE_CRITERIA_SYSTEM_PROMPT},\n", " {\"role\": \"user\",\n", " \"content\": f\"\"\"\\n\\n Response criteria: {criteria} \\n\\n Assistant's response: \\n\\n {all_messages_str} \\n\\n Evaluate whether the assistant's response meets the criteria and provide justification for your evaluation.\"\"\"}\n", " ])\n", "```" ] }, { "cell_type": "markdown", "id": "ca836fbf", "metadata": {}, "source": [ "Now let's take a look at this experiment in the LangSmith UI and look into what our agent did well, and what it could improve on.\n", "\n", "#### Getting Results\n", "\n", "We can also get the results of the evaluation by reading the tracing project associated with our experiment. This is great for creating custom visualizations of our agent's performance." ] }, { "cell_type": "code", "execution_count": 34, "id": "70b655f8", "metadata": { "lines_to_next_cell": 0 }, "outputs": [], "source": [ "# TODO: Copy your experiment name here\n", "experiment_name = \"email_assistant:8286b3b8\"\n", "# Set this to load expt results\n", "load_expt = False\n", "if load_expt:\n", " email_assistant_experiment_results = client.read_project(project_name=experiment_name, include_stats=True)\n", " print(\"Latency p50:\", email_assistant_experiment_results.latency_p50)\n", " print(\"Latency p99:\", email_assistant_experiment_results.latency_p99)\n", " print(\"Token Usage:\", email_assistant_experiment_results.total_tokens)\n", " print(\"Feedback Stats:\", email_assistant_experiment_results.feedback_stats)" ] }, { "cell_type": "code", "execution_count": null, "id": "0ccdfaa6", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "jupytext": { "cell_metadata_filter": "-all", "main_language": "python", "notebook_metadata_filter": "-all" }, "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.6" } }, "nbformat": 4, "nbformat_minor": 5 }