{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "9bba24bf-3592-47d2-bfb1-5177324a418e",
   "metadata": {},
   "source": [
    "<table style=\"width:100%\">\n",
    "<tr>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<font size=\"2\">\n",
    "Supplementary code for the <a href=\"https://mng.bz/lZ5B\">Build a Reasoning Model (From Scratch)</a> book by <a href=\"https://sebastianraschka.com\">Sebastian Raschka</a><br>\n",
    "<br>Code repository: <a href=\"https://github.com/rasbt/reasoning-from-scratch\">https://github.com/rasbt/reasoning-from-scratch</a>\n",
    "</font>\n",
    "</td>\n",
    "<td style=\"vertical-align:middle; text-align:left;\">\n",
    "<a href=\"https://mng.bz/lZ5B\"><img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/cover-small.webp\" width=\"100px\"></a>\n",
    "</td>\n",
    "</tr>\n",
    "</table>\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "90fa0a7f-b86b-4a92-9957-18f8a4398290",
   "metadata": {},
   "source": [
    "# Appendix F: Common Approaches to LLM Evaluation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "d2c83184-31d0-4bcd-a7ea-67ee366736ea",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "reasoning_from_scratch version: 0.1.0\n",
      "torch version: 2.7.1\n",
      "tokenizers version: 0.21.2\n"
     ]
    }
   ],
   "source": [
    "from importlib.metadata import version\n",
    "\n",
    "used_libraries = [\n",
    "    \"reasoning_from_scratch\",\n",
    "    \"torch\",\n",
    "    \"tokenizers\"  # Used by reasoning_from_scratch\n",
    "]\n",
    "\n",
    "for lib in used_libraries:\n",
    "    print(f\"{lib} version: {version(lib)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "65ecfa9d-b502-4ec5-b80c-e67142273718",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "## F.1 Understanding the main evaluation methods for LLMs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "17412c95-a620-4b3f-978f-39525dba7fd9",
   "metadata": {},
   "source": [
    "- No code in this section"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f815d4c6-71ae-4d21-80a3-5451822d6bd3",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/appendix-f/Appendix_F_F01_raschka.webp\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ead62842-52dd-4162-a4f2-c91256a9f624",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "### F.2 Evaluating answer-choice accuracy"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0d78ffe2-7cf0-45c0-af60-98921a56d3c2",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/appendix-f/Appendix_F_F02_raschka.webp\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "10b2cc4f-5fc4-49ff-be4b-38a920aa5997",
   "metadata": {},
   "source": [
    "- Note that this figure depicts a simplified version of a multiple-choice-based evaluation (like MMLU), where we check the generated output letter against the correct answer letter\n",
    "- In practice, variants of this include log-probability scoring, where instead of checking only the final letter, we compute how likely the model considers each candidate answer\n",
    "- For reasoning models, this can also involve evaluating the likelihood of the correct answer being produced when fed into the model\n",
    "- In either case, the evaluation still checks whether the model selects one of the pre-defined answers\n",
    "- (Output probability scores are discussed in more detail in chapter 4, where we improve the text generation function)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bf03e1ce-fa47-43a3-8d68-6d522a7855d0",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "#### F.2.1 Loading the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "36896c72-6327-4b4d-ae38-563386520ea9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Using Apple Silicon GPU (MPS)\n",
      "✓ qwen3/qwen3-0.6B-base.pth already up-to-date\n",
      "✓ qwen3/tokenizer-base.json already up-to-date\n"
     ]
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "import torch\n",
    "\n",
    "from reasoning_from_scratch.ch02 import (\n",
    "    get_device\n",
    ")\n",
    "from reasoning_from_scratch.qwen3 import (\n",
    "    download_qwen3_small,\n",
    "    Qwen3Tokenizer,\n",
    "    Qwen3Model,\n",
    "    QWEN_CONFIG_06_B\n",
    ")\n",
    "\n",
    "device = get_device()\n",
    "torch.set_float32_matmul_precision(\"high\")\n",
    "\n",
    "# If you have compatibility issues, try to\n",
    "# uncomment the line below and rerun the notebook\n",
    "# device = \"cpu\"\n",
    "\n",
    "WHICH_MODEL = \"base\"\n",
    "\n",
    "if WHICH_MODEL == \"base\":\n",
    "\n",
    "    download_qwen3_small(\n",
    "        kind=\"base\", tokenizer_only=False, out_dir=\"qwen3\"\n",
    "    )\n",
    "\n",
    "    tokenizer_path = Path(\"qwen3\") / \"tokenizer-base.json\"\n",
    "    model_path = Path(\"qwen3\") / \"qwen3-0.6B-base.pth\"\n",
    "    tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)\n",
    "\n",
    "elif WHICH_MODEL == \"reasoning\":\n",
    "\n",
    "    download_qwen3_small(\n",
    "        kind=\"reasoning\", tokenizer_only=False, out_dir=\"qwen3\"\n",
    "    )\n",
    "\n",
    "    tokenizer_path = Path(\"qwen3\") / \"tokenizer-reasoning.json\"\n",
    "    model_path = Path(\"qwen3\") / \"qwen3-0.6B-reasoning.pth\"\n",
    "    tokenizer = Qwen3Tokenizer(\n",
    "        tokenizer_file_path=tokenizer_path,\n",
    "        apply_chat_template=True,\n",
    "        add_generation_prompt=True,\n",
    "        add_thinking=True,\n",
    "    )\n",
    "\n",
    "else:\n",
    "    raise ValueError(f\"Invalid choice: WHICH_MODEL={WHICH_MODEL}\")\n",
    "\n",
    "\n",
    "model = Qwen3Model(QWEN_CONFIG_06_B)\n",
    "model.load_state_dict(torch.load(model_path))\n",
    "\n",
    "model.to(device)\n",
    "\n",
    "\n",
    "USE_COMPILE = False  # Set to true to enable compilation\n",
    "if USE_COMPILE:\n",
    "  torch._dynamo.config.allow_unspec_int_on_nn_module = True\n",
    "  model = torch.compile(model)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3eb95069-28da-4ddc-9a39-1a68b5ef3be9",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "#### F.2.2 Checking the generated answer letter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "e641be06-1e63-41f7-95dc-ddb3dafb80b0",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "How many ways are there to put 4 distinguishable balls into 2 indistinguishable boxes?\n",
      "A. 7\n",
      "B. 11\n",
      "C. 16\n",
      "D. 8\n",
      "Answer: \n"
     ]
    }
   ],
   "source": [
    "example = {\n",
    "    \"question\": (\n",
    "        \"How many ways are there to put 4 distinguishable\"\n",
    "        \" balls into 2 indistinguishable boxes?\"\n",
    "    ),\n",
    "    \"choices\": [\"7\", \"11\", \"16\", \"8\"],\n",
    "    \"answer\": \"D\",\n",
    "}\n",
    "\n",
    "def format_prompt(example):\n",
    "    return (\n",
    "        f\"{example['question']}\\n\"\n",
    "        f\"A. {example['choices'][0]}\\n\"\n",
    "        f\"B. {example['choices'][1]}\\n\"\n",
    "        f\"C. {example['choices'][2]}\\n\"\n",
    "        f\"D. {example['choices'][3]}\\n\"\n",
    "        \"Answer: \"  # trailing space encourages a single-letter next token\n",
    "    )\n",
    "\n",
    "prompt = format_prompt(example)\n",
    "print(prompt)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "502beb50-aa85-45d7-93ea-51b96407f2cd",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "\n",
    "- You can load examples from the MMLU dataset directly via the `datasets` library (which can be installed via `pip install datasets` or `uv add datasets`):\n",
    "\n",
    "```python\n",
    "from datasets import load_dataset\n",
    "\n",
    "configs = get_dataset_config_names(\"cais/mmlu\")\n",
    "dataset = load_dataset(\"cais/mmlu\", \"high_school_mathematics\")\n",
    "\n",
    "# Inspect the first example from test set:\n",
    "example = dataset[\"test\"][0]\n",
    "print(example)\n",
    "```\n",
    "\n",
    "- Above, we used the `\"high_school_mathematics\"` subset; to get a list of the other subsets, use the following code:\n",
    "\n",
    "\n",
    "```python\n",
    "from datasets import get_dataset_config_names\n",
    "\n",
    "subsets = get_dataset_config_names(\"cais/mmlu\")\n",
    "print(subsets)\n",
    "```\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "35ddc825-1736-49ac-b5cc-d8752adcc6ef",
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt_ids = tokenizer.encode(prompt)\n",
    "prompt_fmt = torch.tensor(prompt_ids, device=device).unsqueeze(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f91335d4-5560-407f-9346-9820ad003d7c",
   "metadata": {},
   "source": [
    "- We generate a few tokens and extract the first instance of letter A/B/C/D the model prints:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "05de0b33-19de-4609-9e93-b68314abc61f",
   "metadata": {},
   "outputs": [],
   "source": [
    "from reasoning_from_scratch.ch02 import generate_text_basic_stream_cache\n",
    "\n",
    "\n",
    "def predict_choice(\n",
    "    model, tokenizer, prompt_fmt, max_new_tokens=8\n",
    "):\n",
    "    pred = None\n",
    "    for t in generate_text_basic_stream_cache(\n",
    "        model=model,\n",
    "        token_ids=prompt_fmt,\n",
    "        max_new_tokens=max_new_tokens,\n",
    "        eos_token_id=tokenizer.eos_token_id,\n",
    "    ):\n",
    "        answer = tokenizer.decode(t.squeeze(0).tolist())\n",
    "        for letter in answer:\n",
    "            letter = letter.upper()\n",
    "            if letter in \"ABCD\":\n",
    "                pred = letter\n",
    "                break\n",
    "        if pred:  # stop as soon as a letter appears\n",
    "            break\n",
    "    return pred"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "4ede3f46-7207-4aed-825c-53eb9f307699",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Generated letter: C\n",
      "Correct? False\n"
     ]
    }
   ],
   "source": [
    "pred1 = predict_choice(model, tokenizer, prompt_fmt)\n",
    "\n",
    "print(\n",
    "    f\"Generated letter: {pred1}\\n\"\n",
    "    f\"Correct? {pred1 == example['answer']}\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8965c1a0-fe7d-4b7b-8ff1-b3c9c4b0ee0d",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "### F.3 Using verifiers to check answers"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0a55f47c-7ff0-4ae8-aa7b-eb2c69227e14",
   "metadata": {},
   "source": [
    "- No code in this section (see chapter 3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "055c9f45-5608-4fb5-a3a3-8eb5dd842ca7",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/appendix-f/Appendix_F_F03_raschka.webp\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40052a3a-3e41-4512-85fc-3872da7c8d62",
   "metadata": {},
   "source": [
    "<br>\n",
    "&nbsp;\n",
    "\n",
    "### F.4 Comparing models using preferences and leaderboards"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0365d882-1d8f-4eeb-b938-401257568748",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/appendix-f/Appendix_F_F04_raschka.webp\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f71ecf02-69b6-4527-aaf6-206c70ab1d36",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/appendix-f/Appendix_F_F05_raschka.webp\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "242ed04b-3cb3-4eda-8e9f-f46777a9ea2c",
   "metadata": {},
   "source": [
    "- Elo rating (\"algorithm of 400\") inspired by chess rankings: https://en.wikipedia.org/wiki/Performance_rating_(chess)\n",
    "- Note that LM Arena switched to a statistical Bradely-Terry model that provides scores on a Elo-like scale; however, the same concept of pairwise ranking still applies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "9ea4c83f-a50b-4c0d-a2ad-f42a588dce8b",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Pairwise \"arena votes\" where the first model is the winner and\n",
    "# the second model is the loser\n",
    "votes = [\n",
    "    (\"GPT-5\", \"Claude-3\"),  # First match-up: GPT-5 was preferred over Claude-3\n",
    "    (\"GPT-5\", \"Llama-4\"),\n",
    "    (\"Claude-3\", \"Llama-3\"),\n",
    "    (\"Llama-4\", \"Llama-3\"),\n",
    "    (\"Claude-3\", \"Llama-3\"),\n",
    "    (\"GPT-5\", \"Llama-3\"),\n",
    "]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "5aeb5424-a942-4771-a069-cfb77f389704",
   "metadata": {},
   "outputs": [],
   "source": [
    "def elo_ratings(vote_pairs, k_factor=32, initial_rating=1000):\n",
    "    # Initialize all models with the same base rating\n",
    "    ratings = {\n",
    "        model: initial_rating\n",
    "        for pair in vote_pairs\n",
    "        for model in pair\n",
    "    }\n",
    "\n",
    "    # Update ratings after each match\n",
    "    for winner, loser in vote_pairs:\n",
    "\n",
    "        # Expected score for the current winner given the ratings\n",
    "        expected_winner = 1.0 / (\n",
    "            1.0 + 10 ** ((ratings[loser] - ratings[winner]) / 400.0)\n",
    "        )\n",
    "\n",
    "        # k_factor determines sensitivity of rating updates\n",
    "        ratings[winner] = (\n",
    "            ratings[winner] + k_factor * (1 - expected_winner)\n",
    "        )\n",
    "        ratings[loser] = (\n",
    "            ratings[loser] + k_factor * (0 - (1 - expected_winner))\n",
    "        )\n",
    "\n",
    "    return ratings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "256b3f88-f911-46ef-967f-e6bfcbbf3c49",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "GPT-5    : 1043.7\n",
      "Claude-3 : 1015.2\n",
      "Llama-4  : 1000.7\n",
      "Llama-3  : 940.4\n"
     ]
    }
   ],
   "source": [
    "ratings = elo_ratings(votes, k_factor=32, initial_rating=1000)\n",
    "\n",
    "for model in sorted(ratings, key=ratings.get, reverse=True):\n",
    "    print(f\"{model:8s} : {ratings[model]:.1f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "42fd74fe-108d-4f6f-997d-f8cf80ee9282",
   "metadata": {},
   "source": [
    "- The expected winner score is calculated as follows:\n",
    "\n",
    "$$\\text{expected\\_winner} \\;=\\; \\frac{1}{1 + 10^{\\tfrac{\\text{rating\\_loser} - \\text{rating\\_winner}}{400}}}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a670b39-423f-4fb7-83ba-e881e2fa2cc8",
   "metadata": {},
   "source": [
    "- Intuition:\n",
    "    - If rating_winner >> rating_loser:\n",
    "       - exponent → very negative\n",
    "       - denominator ≈ 1\n",
    "       - expected_winner ≈ 1 (almost certain win)\n",
    "    - If rating_winner << rating_loser:\n",
    "       - exponent → very positive\n",
    "       - denominator → very large\n",
    "       - expected_winner ≈ 0 (almost certain loss)\n",
    "    - If rating_winner == rating_loser:\n",
    "       - exponent = 0\n",
    "       - denominator = 2\n",
    "       - expected_winner = 0.5 (even match)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "994b80e0-eaf9-4ece-921e-1956a2c1f195",
   "metadata": {},
   "source": [
    "&nbsp;\n",
    "### F.5 Judging responses with other LLMs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c436ebda-4f65-4db3-b38d-6c5401b4f684",
   "metadata": {},
   "source": [
    "<img src=\"https://sebastianraschka.com/images/reasoning-from-scratch-images/appendix-f/Appendix_F_F06_raschka.webp\" width=\"500px\">"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7921505e-fdc9-40c2-b24b-c88fa7e99b98",
   "metadata": {
    "id": "68d2b9d3-b6ff-4533-a89d-7b66079b4fd1"
   },
   "source": [
    "- In this section, we automate the response evaluation of the finetuned LLM using another, larger LLM\n",
    "- In particular, we use an instruction-finetuned 20-billion-parameter gpt-oss model by Open AI that can be run locally via ollama ([https://ollama.com](https://ollama.com))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59dfc5c8-f123-44ec-b140-2754ce7676d9",
   "metadata": {
    "id": "ea427a30-36ba-44e3-bb1f-eb0d7008d6e9"
   },
   "source": [
    "- Ollama is an open-source application to run LLMs efficiently\n",
    "- It is a wrapper around llama.cpp ([https://github.com/ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)), which implements LLMs in pure C/C++ to maximize efficiency\n",
    "- Note that it is a to ol for using LLMs to generate text (inference), not training or finetuning LLMs\n",
    "- Before running the code below, install ollama by visiting [https://ollama.com](https://ollama.com) and following the instructions (for instance, clicking on the \"Download\" button and downloading the ollama application for your operating system)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ec4bc59d-954d-427a-a2fb-9f2e74fd9bd7",
   "metadata": {},
   "source": [
    "- For macOS and Windows users, click on the ollama application you downloaded; if it prompts you to install the command line usage, say \"yes\"\n",
    "- Linux users can use the installation command provided on the ollama website\n",
    "- There are 3 ways we can run ollama on our computer:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8dd2794c-d032-4c81-894c-0b1020f4db0a",
   "metadata": {},
   "source": [
    "**1. `ollama serve`**\n",
    "\n",
    "- This runs the ollama backend as a server, usually on `http://localhost:11434`. It doesn't load a model until we call it through the API. This is what we want if we want to use ollama through Python.\n",
    "\n",
    "**2. `ollama run gpt-oss:20b`**\n",
    "\n",
    "- This is a convenience wrapper. If the server is not already running, it will start it, then download the model (the first time), and drop us into an interactive terminal where we can chat with the model. Behind the scenes, it uses the same server API.\n",
    "\n",
    "**3. Ollama desktop app**\n",
    "\n",
    "- This runs the same backend automatically and provides a GUI on top of it (as shown in the figure above).\n",
    "It also applies defaults (system prompt, temperature, stop sequences), which can explain why answers look different from raw API usage."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "79f31931-7aa9-4d16-afd5-022805e663ec",
   "metadata": {},
   "source": [
    "---\n",
    "\n",
    "**Note**:\n",
    "\n",
    "- When running `ollama serve` in the terminal, as described above, you may encounter an error message saying `Error: listen tcp 127.0.0.1:11434: bind: address already in use`\n",
    "- If that's the case, try use the command `OLLAMA_HOST=127.0.0.1:11435 ollama serve` (and if this address is also in use, try to increment the numbers by one until you find an address not in use)\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00c9f02d-789f-49bd-96d9-ef4234dc5c08",
   "metadata": {
    "id": "747a2fc7-282d-47ec-a987-ed0a23ed6822"
   },
   "source": [
    "- For example, to give ollama a try, we can use `ollama run gpt-oss:20b` to try the 20-billion-parameter gpt-oss 20B model. The model\n",
    "  (about 13 GB) will be automatically downloaded the first time you run\n",
    "  this command. (Alternatively, you can use it in the Desktop app similar to the previous figure.)\n",
    "  \n",
    "```bash\n",
    "ollama run gpt-oss:20b\n",
    "```\n",
    "\n",
    "\n",
    "- The output looks like as follows:\n",
    "\n",
    "```\n",
    "$ ollama run gpt-oss:20b\n",
    "pulling manifest \n",
    "pulling b112e727c6f1: 100% ▕█████████████████████████████████▏  13 GB                         \n",
    "pulling fa6710a93d78: 100% ▕█████████████████████████████████▏ 7.2 KB                         \n",
    "pulling f60356777647: 100% ▕█████████████████████████████████▏  11 KB                         \n",
    "pulling d8ba2f9a17b3: 100% ▕█████████████████████████████████▏   18 B                         \n",
    "pulling 55c108d8e936: 100% ▕█████████████████████████████████▏  489 B                         \n",
    "verifying sha256 digest \n",
    "writing manifest \n",
    "removing unused layers \n",
    "success\n",
    "```\n",
    "\n",
    "- For more information on gpt-oss, please see my in-depth article, [From GPT-2 to gpt-oss: Analyzing the Architectural Advances](https://magazine.sebastianraschka.com/p/from-gpt-2-to-gpt-oss-analyzing-the) \n",
    "- Using ollama with the `\"gpt-oss:20b\"` model (a 20B parameter model) requires 13 GB of RAM; if this is not supported by your machine, you can try the smaller model, such as the 4B parameter `qwen3:4b` model, which only requires approximately 4 GB of RAM\n",
    "- Alternatively, you can also use the larger 120-billion gpt-oss (`qwen3:235b`) or even the 235-billion-parameter Qwen3 model (`qwen3:235b`), if your machine supports it\n",
    "- After the download has been completed, you will see a command line prompt that allows you to chat with the model\n",
    "- Try a prompt like \"What is 1+2?\", which should return an output similar to the following\n",
    "\n",
    "```\n",
    ">>> What is 1+2?\n",
    "Thinking...\n",
    "User asks: \"What is 1+2?\" This is simple: answer 3. Provide explanation? Possibly ask for simple \n",
    "arithmetic. Provide answer: 3.\n",
    "...done thinking.\n",
    "\n",
    "1 + 2 = **3**\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d720a970-8528-4b7f-82ae-be6ceaee4d9c",
   "metadata": {
    "id": "7b7b341c-ba0e-40bb-a52c-cb328bbd1fe4"
   },
   "source": [
    "- You can end this session using the input `/bye`"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d803811f-29a1-43bc-be40-b122a9714bf4",
   "metadata": {
    "id": "faaf3e02-8ca0-4edf-be23-60625a5b14e3"
   },
   "source": [
    "- The following code checks whether the ollama session is running correctly before proceeding to use ollama to evaluate the test set responses we generated in the previous section"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "b286154e-8393-4c9d-90cb-6f8605c8a0f9",
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 193
    },
    "id": "026e8570-071e-48a2-aa38-64d7be35f288",
    "outputId": "e30d3533-e1f5-4aa9-b24f-33273fc7b30e"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Ollama running: True\n"
     ]
    }
   ],
   "source": [
    "import psutil\n",
    "\n",
    "def check_if_running(process_name):\n",
    "    running = False\n",
    "    for proc in psutil.process_iter([\"name\"]):\n",
    "        if process_name in proc.info[\"name\"]:\n",
    "            running = True\n",
    "            break\n",
    "    return running\n",
    "\n",
    "ollama_running = check_if_running(\"ollama\")\n",
    "\n",
    "if not ollama_running:\n",
    "    raise RuntimeError(\n",
    "        \"Ollama not running. Launch ollama before proceeding.\"\n",
    "    )\n",
    "print(\"Ollama running:\", check_if_running(\"ollama\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "58474292-dbc0-46bd-96bc-d31e773fad48",
   "metadata": {
    "id": "b3464705-d026-4594-977f-fb357e51c3a9"
   },
   "source": [
    "- Now, an alternative way to the `ollama run` command we used earlier to interact with the model is via its REST API in Python via the following function\n",
    "- Before you run the next cells in this notebook, make sure that ollama is still running (the previous code cells should print `\"Ollama running: True\"`)\n",
    "- Next, run the following code cell to query the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "dae50f52-5144-469a-90df-13f7d9c824fd",
   "metadata": {
    "id": "e3ae0e10-2b28-42ce-8ea2-d9366a58088f",
    "outputId": "cc43acb3-8216-43cf-c77d-71d4089dc96c"
   },
   "outputs": [],
   "source": [
    "import json\n",
    "import requests\n",
    "\n",
    "\n",
    "def query_model(\n",
    "    prompt,\n",
    "    model=\"gpt-oss:20b\",\n",
    "    # If you used OLLAMA_HOST=127.0.0.1:11435 ollama serve\n",
    "    # update the address from 11434 to 11435\n",
    "    url=\"http://localhost:11434/api/chat\"\n",
    "):\n",
    "    # Create the data payload as a dictionary\n",
    "    data = {\n",
    "        \"model\": model,\n",
    "        \"messages\": [\n",
    "            {\"role\": \"user\", \"content\": prompt}\n",
    "        ],\n",
    "        \"options\": {     # Settings below are required for deterministic responses\n",
    "            \"seed\": 123,\n",
    "            \"temperature\": 0,\n",
    "            \"num_ctx\": 2048\n",
    "        }\n",
    "    }\n",
    "\n",
    "    # Send the POST request\n",
    "    with requests.post(url, json=data, stream=True, timeout=30) as r:\n",
    "        r.raise_for_status()\n",
    "        response_data = \"\"\n",
    "        for line in r.iter_lines(decode_unicode=True):\n",
    "            if not line:\n",
    "                continue\n",
    "            response_json = json.loads(line)\n",
    "            if \"message\" in response_json:\n",
    "                response_data += response_json[\"message\"][\"content\"]\n",
    "\n",
    "    return response_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "27d10442-cd63-49df-ad06-b249e22447b3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "3\n"
     ]
    }
   ],
   "source": [
    "ollama_model = \"gpt-oss:20b\"\n",
    "result = query_model(\"What is 1+2?\", ollama_model)\n",
    "print(result)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1e504fcb-7864-44a1-92a5-6d548543cdc0",
   "metadata": {
    "id": "207ae28f-0f8c-4fda-aeef-e7e3046249cc"
   },
   "source": [
    "- Now, using the `query_model` function we defined above, we can evaluate the responses of our own model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "7c520083-247e-4817-ad9e-22f0ac468a9c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def rubric_prompt(instruction, reference_answer, model_answer):\n",
    "    rubric = (\n",
    "        \"You are a fair judge assistant. You will be given an instruction, \"\n",
    "        \"a reference answer, and a candidate answer to evaluate, according \"\n",
    "        \"to the following rubric:\\n\\n\"\n",
    "        \"1: The response fails to address the instruction, providing \"\n",
    "        \"irrelevant, incorrect, or excessively verbose content.\\n\"\n",
    "        \"2: The response partially addresses the instruction but contains \"\n",
    "        \"major errors, omissions, or irrelevant details.\\n\"\n",
    "        \"3: The response addresses the instruction to some degree but is \"\n",
    "        \"incomplete, partially correct, or unclear in places.\\n\"\n",
    "        \"4: The response mostly adheres to the instruction, with only \"\n",
    "        \"minor errors, omissions, or lack of clarity.\\n\"\n",
    "        \"5: The response fully adheres to the instruction, providing a \"\n",
    "        \"clear, accurate, and relevant answer in a concise and efficient \"\n",
    "        \"manner.\\n\\n\"\n",
    "        \"Now here is the instruction, the reference answer, and the \"\n",
    "        \"response.\\n\"\n",
    "    )\n",
    "\n",
    "    prompt = (\n",
    "        f\"{rubric}\\n\"\n",
    "        f\"Instruction:\\n{instruction}\\n\\n\"\n",
    "        f\"Reference Answer:\\n{reference_answer}\\n\\n\"\n",
    "        f\"Answer:\\n{model_answer}\\n\\n\"\n",
    "        f\"Evaluation: \"\n",
    "    )\n",
    "    return prompt"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "639036f2-a801-41c6-b4dd-54f180ad244e",
   "metadata": {},
   "source": [
    "- The the `model_answer` could be the answer produced by our own model; here we hardcode a possible model answer for simplicity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "32156f36-f6e9-49ec-ae60-251b5b6f62b8",
   "metadata": {
    "id": "86b839d4-064d-4178-b2d7-01691b452e5e",
    "outputId": "1c755ee1-bded-4450-9b84-1466724f389a"
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "You are a fair judge assistant. You will be given an instruction, a reference answer, and a candidate answer to evaluate, according to the following rubric:\n",
      "\n",
      "1: The response fails to address the instruction, providing irrelevant, incorrect, or excessively verbose content.\n",
      "2: The response partially addresses the instruction but contains major errors, omissions, or irrelevant details.\n",
      "3: The response addresses the instruction to some degree but is incomplete, partially correct, or unclear in places.\n",
      "4: The response mostly adheres to the instruction, with only minor errors, omissions, or lack of clarity.\n",
      "5: The response fully adheres to the instruction, providing a clear, accurate, and relevant answer in a concise and efficient manner.\n",
      "\n",
      "Now here is the instruction, the reference answer, and the response.\n",
      "\n",
      "Instruction:\n",
      "If all birds can fly, and a penguin is a bird, can a penguin fly?\n",
      "\n",
      "Reference Answer:\n",
      "Yes, according to the premise that all birds can fly, a penguin can fly.\n",
      "\n",
      "Answer:\n",
      "Yes – under those premises a penguin would be able to fly.\n",
      "\n",
      "Evaluation: \n"
     ]
    }
   ],
   "source": [
    "rendered_prompt = rubric_prompt(\n",
    "    instruction=(\n",
    "        \"If all birds can fly, and a penguin is a bird, \"\n",
    "        \"can a penguin fly?\"\n",
    "    ),\n",
    "    reference_answer=(\n",
    "        \"Yes, according to the premise that all birds can fly, \"\n",
    "        \"a penguin can fly.\"\n",
    "    ),\n",
    "    model_answer=(\n",
    "        \"Yes – under those premises a penguin would be able to fly.\"\n",
    "    )\n",
    ")\n",
    "print(rendered_prompt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "f516f2cf-f6a7-4b78-96ec-190b941cf10c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "**Score: 5**\n",
      "\n",
      "The candidate answer directly addresses the question, correctly applies the given premises, and concisely states that a penguin would be able to fly. It is accurate, relevant, and clear.\n"
     ]
    }
   ],
   "source": [
    "result = query_model(rendered_prompt, ollama_model)\n",
    "print(result)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.16"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}