{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "lbsCkPa7csK_"
   },
   "source": [
    "# **CSE 291 / DSC 291: Machine Learning Systems — Programming Assignment 1**\n",
    "\n",
    "Welcome to PA1! In this assignment, you will build a **mini deep learning framework from scratch** and use it to train a **transformer language model** that generates text.\n",
    "\n",
    "### What You Will Learn\n",
    "\n",
    "- How **automatic differentiation** works under the hood — the same technique that powers PyTorch, TensorFlow, and JAX\n",
    "- How to implement forward and backward passes for core operations: matrix multiply, softmax, layer normalization, and more\n",
    "- How a **decoder-only transformer** (the architecture behind GPT, LLaMA, etc.) is constructed from these primitives\n",
    "- How **next-token prediction**, **cross-entropy loss**, and **greedy decoding** work in a language model\n",
    "\n",
    "### What You Will Build\n",
    "\n",
    "1. **An autodiff engine** (`auto_diff.py`): You will implement operators (matmul, softmax, LayerNorm, etc.), a graph evaluator, and reverse-mode gradient computation.\n",
    "2. **A transformer language model** (`transformer.py`): Using only your autodiff engine (no `loss.backward()`!), you will build causal self-attention, a decoder layer, a training loop, and a text generation function.\n",
    "3. By the end, you will **train your model to memorize 10 sentences** and run generation demos directly from this notebook.\n",
    "\n",
    "### Logistics\n",
    "\n",
    "- **Language:** Python only — no C++ or CUDA required.\n",
    "- **Hardware:** No GPU needed. Everything runs on CPU in under a minute.\n",
    "- **Collaboration:** This assignment must be completed individually. Do not post your work on public platforms.\n",
    "- **Submission:** Submit the entire PA1 folder as a zip to Gradescope. Refer to the final section for details.\n",
    "\n",
    "### Testing and Grading\n",
    "\n",
    "We have provided public test scripts in `tests/` that you can run to verify your implementation. Your grade will also depend on private tests. You may submit multiple times — only the last submission before the deadline counts. We strongly encourage writing your own tests as well."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "_P0xzZPwiZ5x"
   },
   "source": [
    "## **Overview**\n",
    "\n",
    "Automatic differentiation forms the core technique for training machine learning models. In this assignment, you are required to develop a basic automatic differentiation system from scratch. Several functions will be given to you, and you will focus on creating standard operations used in transformers and other ML architectures - namely LayerNorm, ReLU, Softmax, among others.\n",
    "\n",
    "Let's first go through a run-down of how autodiff works.\n",
    "\n",
    "The automatic differentiation algorithm in this assignment operates using a computational graph. A computational graph visually represents the sequence of operations needed to compute an expression. You can reference lecture 2, slide 37 for a quick example on how this graph works.\n",
    "\n",
    "Let's begin by exploring the fundamental concepts and data structures used in the framework. A computational graph is composed of nodes, each representing a distinct computation step in the evaluation of the entire expression.\n",
    "\n",
    "Each node consists of three components, as shown in auto_diff.py line 6:\n",
    "\n",
    "*   an operation (field op), specifying the type of computation the node performs.\n",
    "*   a list of input nodes (field inputs), detailing the sources of input for the computation.\n",
    "*   optionally, additional \"attributes\" (field attrs), which vary depending on the node's operation.\n",
    "\n",
    "These attributes will be discussed in more detail later in this section.\n",
    "\n",
    "Input nodes in a computational graph can be defined using ad.Variable. For instance, the input variable nodes $x_1$ and $x_2$ might be set up as follows:\n",
    "\n",
    "```python\n",
    "import auto_diff as ad\n",
    "\n",
    "x1 = ad.Variable(name=\"x1\")\n",
    "x2 = ad.Variable(name=\"x2\")\n",
    "```\n",
    "\n",
    "In auto_diff.py (line 81), the ad.Variable class is used to create a node with the operation placeholder and a specified name. Input nodes have empty inputs and attrs:\n",
    "\n",
    "```python\n",
    "class Variable(Node):\n",
    "    def __init__(self, name: str) -> None:\n",
    "        super().__init__(inputs=[], op=placeholder, name=name)\n",
    "```\n",
    "\n",
    "Here, the placeholder operation signifies that the input variable node does not perform any computation. Apart from placeholder, there are other operations defined in auto_diff.py, like add and matmul. You should not create your own instances of these ops.\n",
    "\n",
    "Consider the example where\n",
    "$y = x_1 * x_2 + x_1$. With x1 and x2 already established as input variables, the rest of the graph can be defined using just one line of Python:\n",
    "\n",
    "```python\n",
    "y = x1 * x2 + x1\n",
    "```\n",
    "\n",
    "This code first creates a node with the operation mul, taking x1 and x2 as its inputs. It then constructs another node with add, which utilizes the result of the multiplication node along with x1 as inputs. Consequently, this computational graph ultimately comprises four nodes.\n",
    "\n",
    "**Important Note**\n",
    "\n",
    "It's important to note that a computational graph (e.g., the four nodes we defined) does not inherently store the actual values of its nodes. The structure of this assignment aligns with the TensorFlow v1 approach that was covered in our lectures. This method contrasts with frameworks like PyTorch, where input tensor values are specified upfront, and the values of intermediate tensors are computed immediately as they are defined. \n",
    "\n",
    "In our framework, values are instead provided at evaluation time — you first define the graph structure, then supply input values through the `Evaluator.run` method (described below) to compute results.\n",
    "\n",
    "\n",
    "#### **Evaluator**\n",
    "\n",
    "Here's a walkthrough of how Evaluator works. The constructor of Evaluator accepts a list of nodes that it needs to evaluate. By initiating an Evaluator with:\n",
    "\n",
    "```evaluator = ad.Evaluator(eval_nodes=[y])```\n",
    "\n",
    "you are essentially setting up an Evaluator instance designed to compute the value of y. To calculate this, input tensor values are provided via the `Evaluator.run` method, which you will implement. **These input tensors are assumed to be of type `torch.Tensor` throughout this assignment.** Here's how it works:\n",
    "\n",
    "Coming back to the example $y = x_1 * x_2 + x_1$ , to calculate the value of the output y given the inputs x1 and x2, we do: \n",
    "```python\n",
    "import torch\n",
    "\n",
    "x1_value = torch.tensor(2.0)\n",
    "x2_value = torch.tensor(3.0)\n",
    "y_value = evaluator.run(input_values={x1: x1_value, x2: x2_value})\n",
    "\n",
    "```\n",
    "\n",
    "In this process, the run method takes the input values using a dictionary of the form `Dict[Node, torch.Tensor]`, calculates the value of the node y internally, and outputs the result. For instance, with the input values 2 * 3 + 2 = 8, the expected result for y_value would be `torch.tensor(8.0)`.\n",
    "\n",
    "The `Evaluator.run` method is responsible for the forward computation of nodes. Building on what was discussed in the lectures, to calculate the gradient of the output with respect to each input node within a computational graph, we enhance the forward graph with an additional backward component. By integrating both forward and backward graphs, and providing values for the input nodes, the Evaluator can compute the output value, the loss value, and the gradient values for each input node in a single execution of `Evaluator.run`.\n",
    "\n",
    "You are tasked with implementing the function ```gradients(output_node: Node, nodes: List[Node]) -> List[Node]``` for some of the operators found in auto_diff.py. This function constructs the backward graph needed for gradient computation. It accepts an output node—typically the node representing the loss function in machine learning applications, where the gradient is preset to 1. It also takes a list of nodes for which gradients are to be computed and returns a list of gradient nodes corresponding to each node in the input list.\n",
    "\n",
    "Returning to our earlier example, once you have implemented the gradients function, you can use it to calculate the gradients of $y$ with respect to $x_1$ and $x_2$. This is done by running:\n",
    "\n",
    "```x1_grad, x2_grad = ad.gradients(output_node=y, node=[x1, x2])```\n",
    "\n",
    "to obtain the respective gradients. Following this, you can set up an Evaluator with nodes y, x1_grad, and x2_grad. This allows you to use the Evaluator.run method to compute both the output value and the gradients for the input nodes.\n",
    "\n",
    "Before you start working on the assignment, let's clarify how operations (ops) work. Within auto_diff.py, each op is equipped with three methods:\n",
    "\n",
    "```__call__(self, **kwargs) -> Node```:\n",
    "\n",
    "*  accepts input nodes (and attributes), creates a new node utilizing this op, and returns the newly created node.\n",
    "\n",
    "```compute(self, node: Node, input_values: List[torch.Tensor])-> torch.Tensor```\n",
    "\n",
    "*  processes the specified node along with its input values and delivers the resultant node value.\n",
    "\n",
    "```gradient(self, node: Node, output_grad: Node) -> List[Node]```\n",
    "\n",
    "*  receives a node and its gradient node, returning the partial adjoint nodes for each input node.\n",
    "\n",
    "In short, each `Op` handles a single node: `compute` evaluates the forward pass and calculates the actual output value, while `gradient` builds the backward path for that node. `Evaluator.run` and the `gradients` function orchestrate these across the entire graph. Accordingly, your `Evaluator.run` implementation should effectively utilize each op's `compute`, and your `gradients` implementation should call each op's `gradient`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "iEVXLeBpeVhw"
   },
   "source": [
    "## **Question 1: Auto Diff Library (45 pt)**\n",
    "\n",
    "### Part 1: Operators (32 pt)\n",
    "In this problem, you will finish several operators in the autodiff.py class. Your goal is to implement the compute (forward) and gradient (backwards) function for these operators, with a few examples provided to you, such as `AddOp`, `AddByConstOp`, `MulOp` and `MulByConstOp`, and more. We have also implemented several other functions, such as `BroadcastOp` which will be useful in Question 3, but you can ignore them for now.\n",
    "\n",
    "The list of operators that you will need to implement are:\n",
    "\n",
    "*   `DivOp`\n",
    "*   `DivByConstOp`\n",
    "*   `TransposeOp`\n",
    "*   `ReLUOp`\n",
    "*   `SqrtOp`\n",
    "*   `PowerOp`\n",
    "*   `MeanOp`\n",
    "*   `MatMulOp`\n",
    "*   `SoftmaxOp`\n",
    "*   `LayerNormOp`\n",
    "\n",
    "You will find that most of your time will be spent on Matmul, SoftMax, and the LayerNorm operators. In turn, these 3 operators alone will be half of the points for this section.\n",
    "\n",
    "### Part 2: Evaluator (13 pt)\n",
    "You will also complete the entire `Evaluator` class. The details of the `Evaluator` class and what we want to achieve are given in the introduction section. We have provided a framework, including a topological sort method that you may choose to implement to help you more easily complete this class.\n",
    "\n",
    "There are several tests provided to ensure your operators are working.\n",
    "\n",
    "To run our sample testing library, you can use the commands:\n",
    "\n",
    "```pytest tests/test_auto_diff_node_forward.py```\n",
    "\n",
    "```pytest tests/test_auto_diff_node_backward.py```\n",
    "\n",
    "```pytest tests/test_auto_diff_graph_forward.py```\n",
    "\n",
    "```pytest tests/test_auto_diff_graph_backward.py```\n",
    "\n",
    "Feel free to also edit these test files to include your own test cases!\n",
    "\n",
    "**NOTE: These tests are not fully comprehensive, so we highly encourage you to write your own tests and ensure that your implemented operators are working.** You may find yourself going back and forth between this part and question 2 to ensure your operators are fully correct.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pytest tests/test_auto_diff_node_forward.py\n",
    "!pytest tests/test_auto_diff_node_backward.py\n",
    "!pytest tests/test_auto_diff_graph_forward.py\n",
    "!pytest tests/test_auto_diff_graph_backward.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "7ucU5GfDrxTl"
   },
   "source": [
    "## **Question 2: Transformer Language Model & Training (45 pt)**\n",
    "\n",
    "In this assignment, you will implement core components of a **decoder-only Transformer language model** using our custom automatic differentiation framework. You will train it to **overfit** a small set of 10 sentences, then use it to **generate text** autoregressively.\n",
    "\n",
    "This exercise mirrors the core training loop of modern large language models (LLMs) — just at a much smaller scale so everything runs comfortably on a CPU.\n",
    "\n",
    "### Provided Operations\n",
    "Your implementation should only use operations **THAT YOU IMPLEMENTED** from the **auto_diff** module, not pre-built torch methods. Any code that uses PyTorch's built-in autograd (e.g., `loss.backward()`) will receive zero credit.\n",
    "\n",
    "### Background\n",
    "\n",
    "A **decoder-only transformer** processes a sequence of tokens and predicts the next token at each position. The key difference from an encoder is **causal masking**: each position can only attend to itself and earlier positions, preventing information from \"leaking\" from the future.\n",
    "\n",
    "**Word-level tokenizer.** We provide a simple word-level tokenizer that splits sentences on whitespace and maps each unique word to an integer index. This is conceptually the same idea as real LLM tokenizers (BPE, SentencePiece, etc.), but much simpler — our \"vocabulary\" is just whole words. A special `<pad>` token (index 0) exists for use during generation but never appears in the training data.\n",
    "\n",
    "**Training setup.** We provide 10 short sentences, each exactly 5 words long. The model is small enough (~39K parameters) to train on CPU in under a minute. Since we have very little data, the goal is to **overfit** — the model should memorize these 10 sentences and be able to reproduce them from a 2-word prompt.\n",
    "\n",
    "### Part 1: Causal Self-Attention (10 pt)\n",
    "\n",
    "Implement single-head **causal** self-attention in the `causal_self_attention()` function:\n",
    "\n",
    "$$Q = X W_Q, \\quad K = X W_K, \\quad V = X W_V$$\n",
    "\n",
    "$$\\text{scores} = \\frac{Q K^T}{\\sqrt{d_{\\text{model}}}} + \\text{mask}$$\n",
    "\n",
    "$$\\text{output} = \\text{Softmax}(\\text{scores}) \\cdot V \\cdot W_O$$\n",
    "\n",
    "The **causal mask** is a matrix with 0 for allowed positions and $-10^9$ for future positions, which drives those attention weights to ~0 after softmax. The mask is pre-tiled to `(batch, seq_len, seq_len)` and passed as an input variable.\n",
    "\n",
    "### Part 2: Decoder Layer (5 pt)\n",
    "\n",
    "Implement `decoder_layer()`, which combines self-attention with a feed-forward network, using **residual connections** and **layer normalization**:\n",
    "\n",
    "1. `attn_out = causal_self_attention(X, ...)`\n",
    "2. `h = LayerNorm(X + attn_out)` ← residual + norm\n",
    "3. `ff_out = ReLU(h @ W_ff1) @ W_ff2` ← feed-forward\n",
    "4. `output = LayerNorm(h + ff_out)` ← residual + norm\n",
    "\n",
    "### Part 3: LM Forward Pass + Loss (10 pt)\n",
    "\n",
    "**`transformer_lm()` (5 pt):** Build the full forward pass:\n",
    "\n",
    "1. Token embedding: `X_onehot @ W_embed` (one-hot word index × embedding matrix)\n",
    "2. Add position embeddings (pre-tiled to batch size, passed as input)\n",
    "3. One decoder layer\n",
    "4. Output projection: `h @ W_head` → logits of shape `(batch, seq_len, vocab_size)`\n",
    "\n",
    "**`cross_entropy_loss()` (5 pt):** Compute the average cross-entropy loss for next-token prediction:\n",
    "\n",
    "$$\\text{loss} = -\\frac{1}{N} \\sum \\text{targets} \\cdot \\log(\\text{softmax}(\\text{logits}))$$\n",
    "\n",
    "where $N$ is the total number of target tokens (`BATCH_SIZE × SEQ_LEN`).\n",
    "\n",
    "### Part 4: Training & Generation (20 pt)\n",
    "\n",
    "**`sgd_epoch()` (10 pt):** Data preparation and the forward/backward pass are provided. You only need to implement the **weight update loop**:\n",
    "- Each gradient has a leading batch dimension — sum over dim 0 before updating\n",
    "- `new_W = W - lr * grad.sum(dim=0)`\n",
    "\n",
    "**`generate()` (10 pt):** Graph setup and an inference helper `run_forward()` are provided. You only need to implement the **generation loop**:\n",
    "- Encode the prompt\n",
    "- Loop: call `run_forward(tokens)`, argmax the logits at the last valid position, append the new token\n",
    "- Return the decoded text\n",
    "\n",
    "**`train_model()`** is **fully provided** — it wires together the forward graph, backward graph, evaluator, weight initialization, and training loop. You do not need to modify it. Just make sure the functions it calls (`transformer_lm`, `cross_entropy_loss`, `sgd_epoch`, `generate`) are correct.\n",
    "\n",
    "**Recommended order:** Read `train_model()` first to understand the pipeline. Then implement bottom-up: attention → decoder layer → forward pass → loss → weight update → generation loop.\n",
    "\n",
    "### Submission, Testing, and Expected Results\n",
    "\n",
    "Write your implementation in `transformer.py`. In addition to code, save your trained model weights as `weights.pt` using `save_weights(weights)`.\n",
    "\n",
    "During evaluation, we will load your submitted `transformer.py` and `weights.pt`, then run `tests/test_transformer.py`. The public test checks that your weights can be loaded, that the tensor shapes are correct, and that generation from the first two words of each training sentence reproduces at least **8 out of 10** training sentences.\n",
    "\n",
    "You can run training from the notebook or from the command line. Both should save a `weights.pt` file.\n",
    "\n",
    "Command-line usage:\n",
    "- `python transformer.py` trains the model and saves `weights.pt`\n",
    "- `python transformer.py --playground` trains the model, saves `weights.pt`, and then launches the interactive playground\n",
    "\n",
    "Notebook usage:\n",
    "- The next code cell runs training and saves `weights.pt`\n",
    "- The following code cell runs `tests/test_transformer.py`\n",
    "\n",
    "Expected behavior:\n",
    "- Training loss should decrease over 500 epochs, reaching near 0\n",
    "- Generation from 2-word prompts should reproduce at least **8 out of 10** training sentences\n",
    "- `python -m pytest tests/test_transformer.py` should pass after `weights.pt` has been saved\n",
    "\n",
    "Scoring for Q2:\n",
    "- Part 1: 10 pt\n",
    "- Part 2: 5 pt\n",
    "- Part 3: 10 pt\n",
    "- Part 4: 20 pt\n",
    "- Q2 required total: 45 pt\n",
    "- Bonus section below: 10 pt\n",
    "\n",
    "Together with Q1 (45 pt), PA1 is graded out of **100 total points**.\n",
    "\n",
    "No GPU is needed. Training should take **under 2 minutes** on a modern CPU.\n",
    "\n",
    "### Key Differences from a Standard Transformer\n",
    "\n",
    "| Simplification | Why |\n",
    "|---------------|-----|\n",
    "| Single attention head | Keeps implementation simple |\n",
    "| No bias terms | Fewer parameters to manage |\n",
    "| Word-level tokenizer | Simple vocabulary, no subword splitting |\n",
    "| Full forward pass for generation | No KV cache to implement |\n",
    "| Pre-tiled position embeddings | Avoids complex broadcasting inside the graph |\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Q2 Notebook Workflow\n",
    "\n",
    "The cells below give a complete notebook workflow for Q2:\n",
    "\n",
    "1. Train the model and save `weights.pt`\n",
    "2. Run the public transformer test\n",
    "3. Complete the bonus writeup on decoding complexity without KV cache\n",
    "4. Load the saved weights and try a few generation prompts\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import transformer\n",
    "\n",
    "weights = transformer.train_model()\n",
    "transformer.save_weights(weights, \"weights.pt\")\n",
    "print(\"Saved weights to weights.pt\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pytest tests/test_transformer.py\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Bonus: Decoding Complexity and Repeated Computation (10 pt)\n",
    "\n",
    "This bonus asks you to analyze the mathematical complexity of the current autoregressive decoding implementation and think about how repeated computation could be reduced. In our current `generate()` implementation, the model runs a **full forward pass over the entire sequence** every time a new token is generated.\n",
    "\n",
    "For asymptotic analysis, treat sequence length as a variable, even though this assignment uses a small fixed sequence length in practice.\n",
    "\n",
    "For this bonus, write a short response in `bonus.txt` that answers the following in mathematical terms:\n",
    "\n",
    "1. What is the **time complexity per decoding step** of the current implementation without KV cache?\n",
    "2. What is the **total time complexity** of generating $T$ new tokens with the current implementation without KV cache?\n",
    "3. What is the **space complexity** of the current implementation without KV cache during generation?\n",
    "4. Is there a way to reduce the repeated computation during decoding? Briefly describe the idea and how the time and space complexity would change. *(Hint: think about caching computations that are reused across decoding steps.)*\n",
    "\n",
    "Requirements:\n",
    "- No code is required for this bonus.\n",
    "- Use Big-O notation and clearly define any symbols you use, such as $t$ for current context length, $T$ for the number of generated tokens, and $d$ for model dimension.\n",
    "- Keep your answer concise but complete.\n",
    "- We will grade this based on clarity and correctness of the mathematical reasoning.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "vscode": {
     "languageId": "plaintext"
    }
   },
   "source": [
    "## Submission\n",
    "\n",
    "Congratulations, you have finished PA1!\n",
    "\n",
    "Submit the entire `PA1` folder as a zipped file to Gradescope under the assignment **PA1**.\n",
    "\n",
    "For this assignment, make sure your submission includes at least:\n",
    "- `auto_diff.py`\n",
    "- `transformer.py`\n",
    "- `weights.pt`\n",
    "- `bonus.txt` if you attempt the 10-point bonus\n",
    "\n",
    "Q1 will be evaluated from your submitted `auto_diff.py`. Q2 will load your submitted `transformer.py` and `weights.pt`, then run `tests/test_transformer.py`. If `weights.pt` is missing, Q2 cannot be evaluated.\n",
    "\n",
    "**Deadline: Wednesday, April 22 at 11:59 PM.**\n",
    "\n",
    "Please note that you have a total of **5 free late days** across all assignments. You can allocate these late days to any assignments you choose, but submissions more than 5 days late will not receive credit without prior approval.\n",
    "\n",
    "If you have any feedback on the difficulty of this assignment, feel free to leave it in `feedback.txt`. We appreciate your input!\n"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}