% ========================================================= % Chapter 7: Decision Trees % ========================================================= \chapter{Decision Trees} Decision trees are among the most powerful and interpretable machine learning algorithms in widespread use. Unlike linear models, they naturally capture complex non-linear relationships without requiring feature scaling or distributional assumptions, and they serve as the building block for advanced ensemble methods --- Random Forests and XGBoost --- that consistently rank among the top-performing algorithms on structured tabular data. \begin{remark} Decision trees are particularly effective for \textbf{tabular (structured) data}. For unstructured data (images, audio, raw text), neural networks are generally preferred. \end{remark} %---------------------------------------------------------- \section{Decision Tree Basics} \subsection{Dataset Representation} Consider classifying animals as cats or not, given three observable features: \begin{itemize} \item $x_1$: Ear Shape --- Pointy or Floppy \item $x_2$: Face Shape --- Round or Not Round \item $x_3$: Whiskers --- Present or Absent \item $y\in\{0,1\}$: $1=$Cat, $0=$Not Cat \end{itemize} Decision trees can handle both categorical and continuous features. For categorical features with more than two values, one-hot encoding (Section~\ref{sec:ohe}) converts each category into a binary indicator. \subsection{Anatomy of a Decision Tree} A decision tree is a hierarchical, acyclic graph encoding a sequence of if--then--else rules. \begin{description} \item[Root Node] The topmost node. Receives the full training dataset. \item[Internal Nodes] Test a specific feature and route examples to child branches based on the outcome. \item[Leaf Nodes] Terminal nodes storing a class label (classification) or numeric value (regression). \end{description} The \textbf{depth} of a node is the number of edges from the root. The root is at depth~$0$. \subsection{Inference} To classify a new example: \begin{enumerate} \item Start at the root node. \item Evaluate the feature tested at the current node. \item Follow the matching branch. \item Repeat at each subsequent node. \item Return the prediction stored in the reached leaf. \end{enumerate} Inference is $O(d)$ in tree depth $d$ --- very efficient. %---------------------------------------------------------- \section{Decision Tree Learning} \subsection{The Recursive Splitting Algorithm} Tree construction proceeds top-down, greedy, and recursively: \begin{enumerate} \item Place the entire training set at the root. \item Select the feature that maximises Information Gain. \item Split examples into subsets by feature value. \item Recursively repeat on each child subset. \item Apply stopping criteria to decide when a node becomes a leaf. \end{enumerate} \subsection{Entropy: Measuring Impurity} \label{sec:entropy} At any node, the training subset may mix multiple classes. The degree of mixing is called \textbf{impurity}. \textbf{Entropy} is the standard measure: \begin{equation} H(p_1)=-p_1\log_2 p_1-(1-p_1)\log_2(1-p_1), \label{eq:entropy} \end{equation} where $p_1$ is the positive-class fraction. Convention: $0\log_2 0\equiv 0$. Properties: \begin{itemize} \item $H(p_1)=0$ when $p_1\in\{0,1\}$ (pure node). \item $H(p_1)=1$ when $p_1=0.5$ (maximum uncertainty). \item $H$ is concave and symmetric, achieving its maximum at $p_1=0.5$. \end{itemize} \begin{figure}[h] \centering \begin{tikzpicture} \begin{axis}[ width=0.62\textwidth, height=5.5cm, xlabel={Positive-class fraction $p_1$}, ylabel={Entropy $H(p_1)$ (bits)}, xmin=0, xmax=1, ymin=0, ymax=1.12, xtick={0,0.25,0.5,0.75,1}, ytick={0,0.25,0.5,0.75,1}, grid=both, grid style={gray!15, thin}, tick label style={font=\small}, label style={font=\small}, axis lines=left, every axis plot/.append style={line width=1.5pt}] \addplot[pBlue, domain=0.001:0.999, samples=200] {-x*log2(x)-(1-x)*log2(1-x)}; \addplot[only marks, mark=*, color=pRed, mark size=3.5pt] coordinates{(0.5,1.0)}; \node[above right, font=\small, pRed] at (axis cs:0.52,1.0) {max at $p_1=0.5$}; \end{axis} \end{tikzpicture} \caption{Binary entropy $H(p_1)$. It peaks at 1 bit (maximum uncertainty) when the two classes are equally likely, and vanishes at pure nodes.} \label{fig:entropy-curve} \end{figure} \begin{center} \begin{tabular}{lccr} \toprule \textbf{Composition} & $p_1$ & $H(p_1)$ & \textbf{Impurity}\\ \midrule 6 cats, 0 dogs & 1.000 & 0.000 & None (pure)\\ 5 cats, 1 dog & 0.833 & 0.650 & Low\\ 4 cats, 2 dogs & 0.667 & 0.918 & Moderate\\ 3 cats, 3 dogs & 0.500 & 1.000 & Maximum\\ 0 cats, 6 dogs & 0.000 & 0.000 & None (pure)\\ \bottomrule \end{tabular} \end{center} \subsection{Information Gain} \label{sec:infogain} Let $n$ be the number of examples at the current node, $n_L$ and $n_R$ those reaching the left and right children. The \textbf{Information Gain} is: \begin{equation} \IG(f)=H(p_1)-\Bigl[w^L H(p_1^L)+w^R H(p_1^R)\Bigr],\quad w^L=\frac{n_L}{n},\;w^R=\frac{n_R}{n}. \label{eq:ig} \end{equation} IG is always non-negative (by convexity of entropy) and equals zero when the split provides no useful information. \begin{example} Root: 10 examples, 5 cats, 5 dogs $\Rightarrow H(p_1)=1$. \textbf{Ear Shape:} Left (5 ex., 4 cats): $H=0.72$. Right (5 ex., 1 cat): $H=0.72$. $\IG=1-[0.5(0.72)+0.5(0.72)]=\mathbf{0.28}$. \textbf{Face Shape:} $\IG\approx\mathbf{0.03}$. \textbf{Whiskers:} $\IG\approx\mathbf{0.12}$. \textbf{Decision}: select \textit{Ear Shape} (largest $\IG$). \end{example} \subsection{Stopping Criteria} \label{sec:stopping} \begin{enumerate} \item \textbf{Perfect purity}: stop when $H=0$. \item \textbf{Maximum depth}: stop at depth $d_{\max}$. \item \textbf{Minimum IG}: stop if best $\IG<\epsilon_{\IG}$. \item \textbf{Minimum node size}: stop if $n