% ========================================================= % Chapter 3: Classification % ========================================================= \chapter{Classification} In the previous two chapters we studied \emph{linear regression}, whose goal was to predict a continuous numerical output such as a house price or a temperature. We now make a fundamental shift: instead of predicting a number along a continuous spectrum, we wish to assign each input to one of a \emph{finite set of categories}. This task is called \textbf{classification}. Classification problems arise everywhere in applied machine learning: \begin{itemize} \item \textbf{Spam detection}: is an incoming email spam or legitimate? \item \textbf{Fraud detection}: is a credit-card transaction fraudulent? \item \textbf{Medical diagnosis}: is a tumour malignant or benign? \item \textbf{Image recognition}: which digit (0--9) appears in this image? \item \textbf{Sentiment analysis}: is a product review positive or negative? \end{itemize} The simplest and most common form is \textbf{binary classification}, where the output $y$ takes exactly two values. By convention we encode these as: \[y\in\{0,\,1\},\] where $y=1$ denotes the \emph{positive class} (presence of the condition) and $y=0$ the \emph{negative class} (absence). This naming carries no moral connotation. %---------------------------------------------------------- \section{Classification with Logistic Regression} \label{sec:logreg} \subsection{Why Linear Regression Is Inadequate for Classification} A natural first instinct is to reuse the linear regression model $f_{\vec{w},b}(\vec{x})=\vec{w}\cdot\vec{x}+b$ for classification by thresholding: predict $\hat{y}=1$ if $f>0.5$, else $\hat{y}=0$. While this can work in simple situations, it suffers from two fundamental problems. \paragraph{Problem 1: The outlier effect.} Suppose we have a well-separated dataset of benign ($y=0$) and malignant ($y=1$) tumours plotted against tumour size. A linear model may yield a reasonable boundary at some threshold $x^*$. Now add a single extreme outlier — a very large tumour that is still malignant. The least-squares line is pulled toward the outlier, shifting the boundary $x^*$ and misclassifying previously correct points. Linear regression is globally sensitive to every training point; one aberrant observation can corrupt the boundary for the rest. \paragraph{Problem 2: Unbounded output range.} Linear regression produces outputs in $(-\infty,+\infty)$. For a yes/no problem, predicting values such as $1.8$ or $-0.4$ is awkward: they cannot be interpreted as probabilities. We need a model whose output is \emph{guaranteed} to lie in $[0,1]$. \subsection{The Sigmoid (Logistic) Function} The resolution is to \emph{squash} the linear output through a function that maps $\R$ onto $(0,1)$. The standard choice is the \textbf{sigmoid function}: \begin{definition}[Sigmoid Function] \[ g(z)=\frac{1}{1+e^{-z}},\qquad z\in\R. \] \end{definition} \paragraph{Key properties.} \begin{itemize} \item As $z\to+\infty$: $e^{-z}\to 0$, so $g(z)\to 1$. \item As $z\to-\infty$: $e^{-z}\to\infty$, so $g(z)\to 0$. \item At $z=0$: $g(0)=\tfrac{1}{2}$. \item $g$ is smooth, strictly increasing, and S-shaped (sigmoidal). \item Derivative: $g'(z)=g(z)\bigl(1-g(z)\bigr)$. \end{itemize} \begin{figure}[h] \centering \begin{tikzpicture} \begin{axis}[ width=10cm, height=6.5cm, xlabel={$z$}, ylabel={$g(z)$}, xmin=-6.5, xmax=6.5, ymin=-0.05, ymax=1.1, xtick={-6,-4,-2,0,2,4,6}, ytick={0,0.25,0.5,0.75,1}, grid=both, grid style={gray!15, thin}, tick label style={font=\small}, label style={font=\small}, axis lines=left, every axis plot/.append style={line width=1.5pt}] \addplot[pBlue, smooth, samples=200, domain=-6.5:6.5] {1/(1+exp(-x))}; \addplot[pGray, dashed, thin] coordinates{(-6.5,0.5)(6.5,0.5)}; \addplot[pGray, dashed, thin] coordinates{(0,-0.05)(0,0.5)}; \addplot[only marks, mark=*, color=pRed, mark size=3.5pt] coordinates{(0,0.5)}; \node[below right, font=\small] at (axis cs:0.1,0.49) {$(0,\;0.5)$}; \node[right, font=\small, pBlue] at (axis cs:3.5,0.92) {$g(z)\to 1$}; \node[right, font=\small, pBlue] at (axis cs:3.5,0.08) {$g(z)\to 0$}; \end{axis} \end{tikzpicture} \caption{The sigmoid (logistic) function. It maps every real number to $(0,1)$ and equals exactly $0.5$ at $z=0$.} \label{fig:sigmoid} \end{figure} \subsection{The Logistic Regression Model} Logistic regression is built in two stages. \textbf{Stage 1 --- Linear combination.} Compute a weighted sum: \[z=\vec{w}\cdot\vec{x}+b.\] \textbf{Stage 2 --- Sigmoid activation.} Pass $z$ through the sigmoid: \[ f_{\vec{w},b}(\vec{x})=g(z)=g(\vec{w}\cdot\vec{x}+b) =\frac{1}{1+e^{-(\vec{w}\cdot\vec{x}+b)}}. \] The output $f_{\vec{w},b}(\vec{x})$ is interpreted as a \emph{probability}: \[ f_{\vec{w},b}(\vec{x})=P(y=1\mid\vec{x};\;\vec{w},b). \] This is the estimated probability that the label equals 1 given the input $\vec{x}$ and parameters $(\vec{w},b)$. The complementary probability is $P(y=0\mid\vec{x})=1-f_{\vec{w},b}(\vec{x})$. \begin{example}[Tumour classification] Let $w=2$, $b=-5$, $x=3$ cm (tumour size). Then \[z=2\cdot 3-5=1,\qquad g(1)=\frac{1}{1+e^{-1}}\approx 0.731.\] The model reports a $73.1\%$ probability that the tumour is malignant. \end{example} \subsection{From Probability to Class Prediction} To obtain a hard class label we introduce a \textbf{decision threshold} $\tau$ (typically $\tau=0.5$): \[ \hat{y}=\begin{cases}1&\text{if }f_{\vec{w},b}(\vec{x})\ge\tau,\\ 0&\text{otherwise.}\end{cases} \] Because $g(z)\ge 0.5\iff z\ge 0$, the prediction rule is equivalent to: \[\hat{y}=1\iff\vec{w}\cdot\vec{x}+b\ge 0.\] \subsection{Decision Boundaries} The \textbf{decision boundary} is the set of inputs $\vec{x}$ for which $\vec{w}\cdot\vec{x}+b=0$ (i.e.\ where $f=0.5$, the model is exactly neutral). \paragraph{Linear boundaries.} With two features $x_1,x_2$ and parameters $w_1,w_2,b$, the boundary is the straight line $w_1 x_1+w_2 x_2+b=0$. \textit{Example:} $w_1=1$, $w_2=1$, $b=-3$ gives boundary $x_1+x_2=3$. Points with $x_1+x_2>3$ are classified as $y=1$. \paragraph{Non-linear boundaries.} By augmenting the feature vector with polynomial terms, the boundary in the \emph{original} feature space can be non-linear. With features $(x_1^2,x_2^2)$ and $w_1=w_2=1$, $b=-1$, the boundary becomes $x_1^2+x_2^2=1$, a unit circle. \begin{figure}[h] \centering \begin{tikzpicture} \begin{axis}[ width=7.5cm, height=7.5cm, xlabel={$x_1$}, ylabel={$x_2$}, xmin=-2.5, xmax=2.5, ymin=-2.5, ymax=2.5, xtick={-2,-1,0,1,2}, ytick={-2,-1,0,1,2}, grid=both, grid style={gray!15, thin}, tick label style={font=\small}, label style={font=\small}, axis lines=center] \addplot[pBlue, thick, smooth, domain=0:360, samples=200] ({cos(x)},{sin(x)}); \addplot[only marks, mark=*, color=pBlue, mark size=2pt] coordinates{(0.3,0.3)(0.4,-0.3)(-0.3,0.4)(-0.2,-0.2)(0.0,0.5)}; \addplot[only marks, mark=square*, color=pRed, mark size=2pt] coordinates{(1.5,1.2)(1.3,-1.4)(-1.6,0.8)(-1.2,-1.5)(1.8,-0.5) (0.5,1.8)(-0.4,2.0)(2.0,0.4)}; \node[pBlue, font=\small] at (axis cs:0,0) {$\hat{y}=0$}; \node[pRed, font=\small] at (axis cs:1.9,1.9) {$\hat{y}=1$}; \end{axis} \end{tikzpicture} \caption{Circular decision boundary arising from polynomial features $(x_1^2,x_2^2)$ with $w_1=w_2=1$, $b=-1$.} \label{fig:circle-boundary} \end{figure} %---------------------------------------------------------- \section{Cost Function for Logistic Regression} \label{sec:logcost} \subsection{Why Squared Error Fails} Reusing the MSE cost from linear regression with the non-linear sigmoid output produces a \emph{non-convex} cost surface with multiple local minima and flat plateaus. Gradient descent may converge to a sub-optimal solution. We need a cost function that (a) preserves convexity and (b) strongly penalises confident wrong predictions. \subsection{Loss vs.\ Cost: A Key Distinction} \begin{definition}[Loss and Cost] The \textbf{loss} $L\bigl(f,y\bigr)$ measures the penalty on a \emph{single} training example: the cost of predicting $f$ when the true label is $y$. The \textbf{cost} $J$ is the \emph{average loss} over all $m$ training examples: \[J(\vec{w},b)=\frac{1}{m}\sum_{i=1}^m L\!\bigl(f_{\vec{w},b}(\vec{x}^{(i)}),y^{(i)}\bigr).\] \end{definition} \subsection{The Logistic (Cross-Entropy) Loss} We define the loss piecewise: \begin{align} \text{If }y=1:&\quad L=-\log\!\bigl(f\bigr),\label{eq:loss-y1}\\ \text{If }y=0:&\quad L=-\log\!\bigl(1-f\bigr).\label{eq:loss-y0} \end{align} \paragraph{Interpretation for $y=1$ (Eq.~\ref{eq:loss-y1}).} \begin{itemize} \item When $f\approx 1$ (correctly confident): $-\log(1)=0$ — no penalty. \item When $f\to 0$ (confidently wrong): $-\log(f)\to+\infty$ — catastrophic penalty. \end{itemize} \paragraph{Interpretation for $y=0$ (Eq.~\ref{eq:loss-y0}).} \begin{itemize} \item When $f\approx 0$ (correct): $-\log(1)=0$ — no penalty. \item When $f\to 1$ (confidently wrong): $-\log(0)\to+\infty$ — catastrophic penalty. \end{itemize} \begin{figure}[h] \centering \begin{tikzpicture} \begin{axis}[ width=11cm, height=6.5cm, xlabel={Prediction $f$}, ylabel={Loss $L$}, xmin=0.01, xmax=0.99, ymin=0, ymax=5, xtick={0,0.2,0.4,0.6,0.8,1}, ytick={0,1,2,3,4,5}, grid=both, grid style={gray!15, thin}, tick label style={font=\small}, label style={font=\small}, axis lines=left, legend style={at={(0.5,0.97)}, anchor=north, font=\small, fill=white}, every axis plot/.append style={line width=1.4pt}] \addplot[pBlue, smooth, samples=200, domain=0.005:0.995]{-ln(x)}; \addlegendentry{$L=-\log(f)$\quad ($y=1$)} \addplot[pRed, dashed, smooth, samples=200, domain=0.005:0.995]{-ln(1-x)}; \addlegendentry{$L=-\log(1-f)$\quad ($y=0$)} \end{axis} \end{tikzpicture} \caption{Logistic loss functions for $y=1$ (solid, blue) and $y=0$ (dashed, red). Both approach $+\infty$ when the prediction is maximally wrong.} \label{fig:loss-curves} \end{figure} \subsection{The Unified Loss Formula} The piecewise definition collapses elegantly into a single expression: \begin{equation} \boxed{L\bigl(f,y\bigr) =-y\,\log(f)-(1-y)\,\log(1-f).} \label{eq:unified-loss} \end{equation} \paragraph{Verification.} \begin{itemize} \item If $y=1$: $(1-y)=0$ annihilates the second term $\Rightarrow L=-\log(f)$.\checkmark \item If $y=0$: $y=0$ annihilates the first term $\Rightarrow L=-\log(1-f)$.\checkmark \end{itemize} \subsection{The Full Logistic Regression Cost Function} Averaging the unified loss over all $m$ training examples: \begin{equation} \boxed{J(\vec{w},b)=-\frac{1}{m}\sum_{i=1}^m\Bigl[ y^{(i)}\log\!\bigl(f_{\vec{w},b}(\vec{x}^{(i)})\bigr) +(1-y^{(i)})\log\!\bigl(1-f_{\vec{w},b}(\vec{x}^{(i)})\bigr)\Bigr].} \label{eq:logistic-cost} \end{equation} This is the \textbf{cross-entropy loss} or \textbf{log-loss}. It has two complementary justifications: \begin{enumerate} \item \textbf{Statistical (MLE).} Eq.~\eqref{eq:logistic-cost} is exactly the negative log-likelihood of the training data under a Bernoulli model with success probability $f^{(i)}$. Minimising cross-entropy is equivalent to Maximum Likelihood Estimation — a principled statistical objective. \item \textbf{Convexity.} Because the sigmoid is log-concave, this cost function is \emph{convex} in $(\vec{w},b)$. The cost surface has a \emph{single global minimum} and no local minima, so gradient descent is guaranteed to converge to the optimal solution regardless of initialisation. \end{enumerate} %---------------------------------------------------------- \section{Gradient Descent for Logistic Regression} \label{sec:loggd} \subsection{The Training Objective} We seek $(\vec{w}^*,b^*)=\argmin_{\vec{w},b}\;J(\vec{w},b)$. Once found, the trained model estimates $P(y=1\mid\vec{x};\;\vec{w}^*,b^*)$ for any new input $\vec{x}$. \subsection{The Update Rules} Gradient descent applies simultaneously: \begin{align} w_j&\leftarrow w_j-\alpha\,\pd{J}{w_j},\quad j=1,\ldots,n,\\[4pt] b &\leftarrow b-\alpha\,\pd{J}{b}. \end{align} Differentiating Eq.~\eqref{eq:logistic-cost} (using the chain rule and the identity $g'(z)=g(z)(1-g(z))$): \begin{align} \pd{J}{w_j}&=\frac{1}{m}\sum_{i=1}^m \bigl(f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)}\bigr)\,x_j^{(i)},\label{eq:log-dJdw}\\[4pt] \pd{J}{b} &=\frac{1}{m}\sum_{i=1}^m \bigl(f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)}\bigr).\label{eq:log-dJdb} \end{align} \begin{remark} The gradient expressions~\eqref{eq:log-dJdw}--\eqref{eq:log-dJdb} are \emph{structurally identical} to those of linear regression. The critical difference lies in the definition of $f$: in linear regression $f=\vec{w}\cdot\vec{x}+b$; in logistic regression $f=g(\vec{w}\cdot\vec{x}+b)$. Same form, different meaning. \end{remark} \paragraph{Derivation sketch for $\partial J/\partial w_j$.} Let $f^{(i)}=g(z^{(i)})$, $z^{(i)}=\vec{w}\cdot\vec{x}^{(i)}+b$. The single-example loss is $\ell^{(i)}=-y^{(i)}\log f^{(i)}-(1-y^{(i)})\log(1-f^{(i)})$. Using $\partial f^{(i)}/\partial w_j=f^{(i)}(1-f^{(i)})x_j^{(i)}$: \[ \frac{\partial\ell^{(i)}}{\partial w_j} =\Bigl(-\frac{y^{(i)}}{f^{(i)}}+\frac{1-y^{(i)}}{1-f^{(i)}}\Bigr) f^{(i)}(1-f^{(i)})x_j^{(i)} =\bigl(f^{(i)}-y^{(i)}\bigr)x_j^{(i)}. \] Averaging over all $m$ examples gives Eq.~\eqref{eq:log-dJdw}.\quad$\square$ \subsection{Practical Considerations} \paragraph{Choosing the learning rate $\alpha$.} Monitor the learning curve (cost vs.\ iteration). A well-chosen $\alpha$ produces a monotonically decreasing, eventually flat curve. \paragraph{Feature scaling.} When features have vastly different magnitudes, the cost surface becomes an elongated ellipse. Standardising features (zero mean, unit variance) transforms the contours to circles, allowing much faster convergence. \paragraph{Vectorisation.} For large datasets, express the gradient update as matrix--vector operations (NumPy) to exploit parallelism. \paragraph{Convergence criterion.} Halt when the relative change in $J$ between successive iterations falls below a tolerance $\varepsilon$: \[\frac{|J^{(t)}-J^{(t-1)}|}{|J^{(t-1)}|+\varepsilon_0}<\varepsilon.\] %---------------------------------------------------------- \section{The Problem of Overfitting} \label{sec:overfit} \subsection{The Bias--Variance Trade-off} Every supervised learning algorithm navigates a fundamental tension: a model that is too simple cannot capture the true structure (\textbf{underfitting}, or \emph{high bias}), while a model that is too complex memorises the training data and fails to generalise (\textbf{overfitting}, or \emph{high variance}). \paragraph{Underfitting (High Bias).} An underfitting model has a strong preconception about the form of the relationship. Fitting a straight line to clearly curved data forces the model to ignore obvious patterns. Such a model performs poorly even on the training set. \paragraph{Overfitting (High Variance).} An overfitting model is excessively complex — perhaps a degree-15 polynomial fitted to 20 data points. It interpolates the training data perfectly (near-zero training error) but its oscillatory curve is driven by noise rather than the true signal. Small changes in the training set produce drastically different models. \paragraph{Good generalisation.} The ideal model captures the true underlying pattern while ignoring random noise. It achieves low error on both training and unseen test data. \begin{figure}[h] \centering \begin{tikzpicture} \begin{axis}[ name=ax1, width=4.8cm, height=4.6cm, title={\small\textbf{Underfitting}}, title style={font=\small\bfseries}, xmin=0, xmax=4, ymin=0, ymax=4, xtick=\empty, ytick=\empty, axis lines=left] \addplot[only marks, mark=*, mark size=1.8pt, pBlue] coordinates{(0.3,0.4)(0.7,0.9)(1.1,1.5)(1.5,2.0)(2.0,2.3) (2.5,2.0)(3.0,1.5)(3.5,0.8)(3.9,0.3)}; \addplot[pRed, thick, domain=0:4]{1.2}; \end{axis} \begin{axis}[ name=ax2, at={(ax1.east)}, xshift=1.4cm, width=4.8cm, height=4.6cm, title={\small\textbf{Good generalisation}}, title style={font=\small\bfseries}, xmin=0, xmax=4, ymin=0, ymax=4, xtick=\empty, ytick=\empty, axis lines=left] \addplot[only marks, mark=*, mark size=1.8pt, pBlue] coordinates{(0.3,0.4)(0.7,0.9)(1.1,1.5)(1.5,2.0)(2.0,2.3) (2.5,2.0)(3.0,1.5)(3.5,0.8)(3.9,0.3)}; \addplot[pRed, thick, smooth, domain=0.1:3.9, samples=80] {-0.57*(x-2)^2+2.3}; \end{axis} \begin{axis}[ name=ax3, at={(ax2.east)}, xshift=1.4cm, width=4.8cm, height=4.6cm, title={\small\textbf{Overfitting}}, title style={font=\small\bfseries}, xmin=0, xmax=4, ymin=-0.5, ymax=4.5, xtick=\empty, ytick=\empty, axis lines=left] \addplot[only marks, mark=*, mark size=1.8pt, pBlue] coordinates{(0.3,0.4)(0.7,0.9)(1.1,1.5)(1.5,2.0)(2.0,2.3) (2.5,2.0)(3.0,1.5)(3.5,0.8)(3.9,0.3)}; \addplot[pRed, thick, smooth, domain=0.2:3.95, samples=200] {-0.57*(x-2)^2+2.3+0.62*sin(deg(5.3*x))}; \end{axis} \end{tikzpicture} \caption{Three models fitted to the same data. \textit{Left}: underfitting (too simple). \textit{Centre}: good generalisation. \textit{Right}: overfitting (too complex, fitting the noise).} \label{fig:fitting} \end{figure} \subsection{Remedies for Overfitting} \subsubsection{Strategy 1: Collect More Training Data} The most direct remedy. With more data, any fixed-complexity model is less able to memorise individual points and is forced to learn the true underlying pattern. Unfortunately, collecting more data is not always feasible — it may be expensive, time-consuming, or impossible for rare events. \subsubsection{Strategy 2: Feature Selection} If the number of features $n$ is large relative to $m$, overfitting is likely. Reducing the feature set — keeping only the most informative features — reduces the model's capacity to overfit. Feature selection can be done manually (guided by domain knowledge) or automatically (using statistical tests, embedded methods, or regularisation paths). \subsubsection{Strategy 3: Regularisation} \label{sec:regularisation} Regularisation is generally the preferred remedy because it allows us to retain \emph{all} features while preventing any single parameter from dominating the model. \paragraph{Core idea.} Large weight parameters are what allow a model to oscillate wildly. By \emph{penalising} large weights in the cost function, the optimiser is forced to keep them small, producing a smoother, more parsimonious model. \paragraph{Regularised cost function ($L_2$ / Ridge).} \begin{equation} J_{\text{reg}}(\vec{w},b) =J(\vec{w},b)+\frac{\lambda}{2m}\sum_{j=1}^n w_j^2, \label{eq:reg-cost} \end{equation} where $\lambda\ge 0$ is the \textbf{regularisation parameter}. Note that $b$ is conventionally \emph{not} regularised. \paragraph{The role of $\lambda$.} \begin{itemize} \item $\lambda=0$: no regularisation; the model may overfit. \item $\lambda\to\infty$: all weights forced to zero; model underfits (predicts constant $f=\sigma(b)$). \item $\lambda$ ``just right'': balance between fitting data and model simplicity, achieving good generalisation. \end{itemize} \subsection{Regularised Linear Regression} \paragraph{Gradient descent updates.} \begin{align} w_j&\leftarrow w_j-\alpha\Bigl[\frac{1}{m}\sum_{i=1}^m \bigl(f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)}\bigr)x_j^{(i)} +\frac{\lambda}{m}w_j\Bigr],\label{eq:reg-lin-w}\\ b &\leftarrow b-\frac{\alpha}{m}\sum_{i=1}^m \bigl(f_{\vec{w},b}(\vec{x}^{(i)})-y^{(i)}\bigr).\label{eq:reg-lin-b} \end{align} \paragraph{Weight decay interpretation.} Rearranging \eqref{eq:reg-lin-w}: \[ w_j\leftarrow\underbrace{\Bigl(1-\frac{\alpha\lambda}{m}\Bigr)}_{\text{weight decay factor}}w_j -\frac{\alpha}{m}\sum_{i=1}^m\bigl(f^{(i)}-y^{(i)}\bigr)x_j^{(i)}. \] Since $\alpha\lambda/m$ is a small positive number (e.g.\ $0.0002$), the weight decay factor is slightly less than 1. At every step, each weight is \emph{gently shrunk} before the gradient update, creating a constant restoring force toward zero. \subsection{Regularised Logistic Regression} The same approach applies directly: \begin{equation} J(\vec{w},b)=-\frac{1}{m}\sum_{i=1}^m\!\Bigl[ y^{(i)}\log f^{(i)}+(1-y^{(i)})\log(1-f^{(i)})\Bigr] +\frac{\lambda}{2m}\sum_{j=1}^n w_j^2, \end{equation} with gradient descent updates identical in form to \eqref{eq:reg-lin-w}--\eqref{eq:reg-lin-b} but with the sigmoid $f$. \paragraph{Summary table.} \begin{center} \begin{tabular}{llll} \toprule \textbf{Situation} & \textbf{$J_{\text{train}}$} & \textbf{$J_{\text{test}}$} & \textbf{Remedy}\\ \midrule Underfitting (High Bias) & High & High & More features; reduce $\lambda$\\ Good generalisation & Low & Low & ---\\ Overfitting (High Variance) & Low & High & More data; reduce features; increase $\lambda$\\ \bottomrule \end{tabular} \end{center} %---------------------------------------------------------- \section*{Chapter Summary} \addcontentsline{toc}{section}{Chapter Summary} \begin{itemize} \item \textbf{Logistic regression} uses the sigmoid function to map a linear combination of features onto $(0,1)$, yielding a probability estimate $P(y=1\mid\vec{x})$. \item The decision boundary $\vec{w}\cdot\vec{x}+b=0$ is linear; polynomial features extend it to non-linear shapes (circles, ellipses, etc.). \item The \textbf{cross-entropy cost} is derived from MLE and is convex, guaranteeing a unique global minimum. Squared error would produce a non-convex cost with local minima. \item The gradient formulas for logistic regression look identical to those of linear regression, but $f$ is the sigmoid in the logistic case. \item \textbf{Overfitting} occurs when the model is too complex relative to the training data. The three main remedies are more data, feature selection, and $L_2$ \textbf{regularisation} (controlled by $\lambda$). \item Regularisation adds a weight-penalty term to the cost, shrinking weights at every gradient step (``weight decay''), smoothing the decision boundary. \end{itemize}