Chapter 2: AI and Machine Learning

1. What is Machine Learning?

Definition

Machine Learning (ML) is a method of solving problems where the solution is not pre-programmed by a human. Instead, the model accumulates knowledge from observation data and uses that knowledge to make decisions or predictions.

Why it exists

Many real-world problems are too complex to solve with hand-written rules. Consider recognizing faces, translating languages, or predicting stock prices. The relationships involved are so intricate that no human could write explicit rules for them. ML exists because it lets data reveal the rules automatically.

Data Representation

All ML starts with data. The data is represented as a set of observations, where each observation is a pair of an input and an output:

\Omega = \{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^{N}

Here:

\(\mathbf{x}^{(i)} \in D \subset \mathbb{R}^n\) is the feature vector (also called independent variables). It is a list of \(n\) numbers that describe the input. For example, if predicting house price, the features might be square footage, number of bedrooms, and age of the house.
\(y^{(i)} \in \mathbb{R}\) is the label (also called the dependent variable). It is the output we want to predict. In the house example, this is the price.
\(N\) is the total number of observations.

The Goal of ML

Core Goal

Find a function \(f: D \to \mathbb{R}\) that models the observed data, meaning:

f(\mathbf{x}) = y \quad \text{for all } (\mathbf{x}, y) \in \Omega

But there is a critical additional requirement: the function must also generalize. That means it must correctly predict outputs for data it has never seen before — inputs \((\mathbf{x}, f(\mathbf{x})) \notin \Omega\).

Important

A model that perfectly memorizes the training data but fails on new data is useless. Generalization is the entire point of machine learning. Everything that follows in this chapter — loss functions, regularization, train/test splits — exists to achieve good generalization.

2. Setting Up the Optimization Problem

Why this step is needed

The goal "find some function \(f\) that fits the data" is too vague. There are infinitely many possible functions. We need to narrow the search space to make the problem solvable.

Parameterized Function Families

Instead of searching over all possible functions, we choose a family of functions that are parameterized by a vector of weights \(\mathbf{w} \in \mathbb{R}^m\). We write the function as \(f_{\mathbf{w}}\) to show that it depends on \(\mathbf{w}\).

Key Idea

The search for the best function becomes a search for the best weights \(\mathbf{w}\). This converts an impossible problem (searching over all functions) into a concrete numerical problem (searching over a vector of numbers).

Error Criterion

We want the model's predictions to be close to the actual values. Specifically, we require:

|f_{\mathbf{w}}(\mathbf{x}) - y| \leq \varepsilon \quad \text{for all } (\mathbf{x}, y) \in \Omega

where \(\varepsilon\) is the maximum allowed error per data point. Since we have \(N\) data points, we measure the overall model error as the average error across all observations:

\frac{1}{N} \sum_{i=1}^{N} |f_{\mathbf{w}}(\mathbf{x}^{(i)}) - y^{(i)}| \leq \varepsilon

Definition

The model error is the average absolute difference between the model's predictions and the actual observed values, taken over the entire training dataset.

3. Modeling as Optimization

Why this matters

By expressing ML as an optimization problem, we can use all the mathematical tools from Chapter 1 (gradient descent, convex optimization, etc.) to find the best model.

The Loss Function

Definition

The loss function (also called cost function or objective function) measures how far the model's predictions are from the actual data. It is a function of the weights \(\mathbf{w}\):

L(\mathbf{w}) = \frac{1}{N} \sum_{i=1}^{N} |f_{\mathbf{w}}(\mathbf{x}^{(i)}) - y^{(i)}|

The Optimization Goal

We want the weights that make the loss as small as possible:

\mathbf{w}^* = \arg\min_{\mathbf{w}} L(\mathbf{w})

Key Insight

This is now a standard optimization problem. The notation \(\arg\min\) means "the value of \(\mathbf{w}\) that minimizes \(L\)." We have converted machine learning into the optimization framework from Chapter 1. The loss function is the objective function, and the weights are the variables we optimize over.

4. Convexifying the Loss Function

Why convexify?

The absolute value \(|f_{\mathbf{w}}(\mathbf{x}) - y|\) is not differentiable at zero, which makes optimization harder. By switching to a squared error, we get a smooth, convex function that has a unique minimum and can be solved exactly.

Matrix Notation

To work with all data points at once, we stack them into matrices and vectors:

Stack all input vectors \(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(N)}\) as rows of a matrix \(A \in \mathbb{R}^{N \times n}\).
Stack all labels \(y^{(1)}, \ldots, y^{(N)}\) into a vector \(B \in \mathbb{R}^N\).

If the model is linear — \(f_{\mathbf{w}}(\mathbf{x}) = \mathbf{x} \cdot \mathbf{w}\) — then the predictions for all data points are \(A\mathbf{w}\), and the squared loss function becomes:

L(\mathbf{w}) = \|A\mathbf{w} - B\|^2

Theorem: Closed-Form Solution

Theorem

If \(\text{Rank}(A) = n\) (meaning the columns of \(A\) are linearly independent), then the loss function \(L(\mathbf{w}) = \|A\mathbf{w} - B\|^2\) has a unique minimum at:

\mathbf{w}^* = (A^T A)^{-1} A^T B

Proof

Step 1 — Expand the Loss

Expand \(L(\mathbf{w})\) by writing out the squared norm:

L(\mathbf{w}) = (A\mathbf{w} - B)^T(A\mathbf{w} - B) = \mathbf{w}^T A^T A \mathbf{w} - 2B^T A \mathbf{w} + B^T B

Step 2 — Compute the Gradient

Take the derivative with respect to \(\mathbf{w}\):

\nabla L(\mathbf{w}) = 2(A^T A)\mathbf{w} - 2A^T B

Step 3 — Compute the Hessian

The second derivative (Hessian matrix) is:

\nabla^2 L(\mathbf{w}) = 2(A^T A)

The matrix \(A^T A\) is always positive semidefinite (for any vector \(\mathbf{v}\), \(\mathbf{v}^T A^T A \mathbf{v} = \|A\mathbf{v}\|^2 \geq 0\)). When \(\text{Rank}(A) = n\), it is positive definite. This confirms that \(L\) is convex, so any critical point is a global minimum.

Step 4 — Solve for the Minimum

Set the gradient to zero and solve:

2(A^T A)\mathbf{w} - 2A^T B = 0 \quad \Longrightarrow \quad \mathbf{w}^* = (A^T A)^{-1} A^T B

5. Linear Regression

Definition

Linear regression is the simplest ML model. It assumes the output is a linear combination of the input features:

f_{\mathbf{w}}(\mathbf{x}) = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n

Why it exists

Linear regression is the starting point of ML because it has a closed-form solution (no iterative optimization needed), it is fast to compute, easy to understand, and works well when the relationship between inputs and output is approximately linear.

Adding a Bias Term

The model above forces the prediction to be zero when all inputs are zero. Usually we want a bias (also called intercept) \(w_0\):

f_{\mathbf{w}}(\mathbf{x}) = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_n x_n

To keep the matrix notation clean, we extend each input vector by prepending a 1: \(\mathbf{x} \to [1, x_1, x_2, \ldots, x_n]\), and extend the weight vector to include \(w_0\): \(\mathbf{w} = [w_0, w_1, \ldots, w_n]^T\). Now the model is still \(f_{\mathbf{w}}(\mathbf{x}) = \mathbf{x} \cdot \mathbf{w}\), and the solution remains:

\mathbf{w} = (A^T A)^{-1} A^T B

where \(A\) now has a column of ones as its first column.

6. Data Testing (Ordinary Least Squares)

Why this step is needed

If the true relationship between inputs and output is non-linear, then linear regression will produce large errors no matter what weights you choose. You need a way to handle non-linear relationships while still using the linear regression machinery.

Basis Function Transformation

Key Idea

Transform the original features using basis functions \(f_1(\mathbf{x}), f_2(\mathbf{x}), \ldots, f_p(\mathbf{x})\). These are functions you choose that create new features from the originals. After the transformation, apply linear regression to the new features.

Define the transformation:

T(\mathbf{x}) = \big(f_1(\mathbf{x}),\; f_2(\mathbf{x}),\; \ldots,\; f_p(\mathbf{x})\big)

Then apply linear regression to the transformed data. The model becomes:

\hat{y} = w_1 f_1(\mathbf{x}) + w_2 f_2(\mathbf{x}) + \cdots + w_p f_p(\mathbf{x})

This is still linear in the weights \(\mathbf{w}\), so the closed-form solution \(\mathbf{w} = (A^T A)^{-1} A^T B\) still applies — but now the matrix \(A\) has entries \(A_{ij} = f_j(\mathbf{x}^{(i)})\).

The Workflow

Step 1

Choose basis functions \(f_1, \ldots, f_p\) based on your understanding of the data.

Step 2

Transform all data points: replace each \(\mathbf{x}^{(i)}\) with \(T(\mathbf{x}^{(i)})\).

Step 3

Run linear regression on the transformed data to find the optimal weights.

Step 4

Compute the error. If it is too large, go back to Step 1 and choose different basis functions.

Important

Choosing the right basis functions is the hardest part. It requires domain knowledge — understanding of the problem area. For example, if you suspect the data follows a quadratic trend, you might use \(f_1(x) = x\) and \(f_2(x) = x^2\). If you suspect periodic behavior, you might use \(f_1(x) = \sin(x)\) and \(f_2(x) = \cos(x)\). There is no automatic way to choose — this is where human expertise enters.

7. Neural Networks

Why neural networks exist

The basis function approach from Section 6 requires a human expert to choose the right transformations. Neural networks exist because they can learn the basis functions automatically from data, removing the need for domain expertise in feature engineering.

Definition

A neural network is a computational graph made of nodes (called neurons) connected by weighted edges. Each neuron receives inputs, multiplies them by weights, sums the results, and passes the sum through a function to produce an output.

The Simplest Neural Network (2-Layer)

A 2-layer neural network has input nodes and one output node. Each input \(x_i\) is connected to the output by a weight \(w_i\). The output is:

y = w_0 + \sum_{i=1}^{n} w_i x_i

This is exactly linear regression. The simplest neural network is linear regression. To handle non-linear relationships, we need additional components: more layers and non-linear activation functions (covered in the following sections).

8. Perceptron

Definition

A perceptron is the simplest neural network that makes binary decisions. It has one output neuron that classifies input into one of two categories: 0 or 1 (false or true, no or yes).

Why it exists

Many problems require a yes/no answer rather than a continuous number: Is this email spam? Does this image contain a cat? Is this transaction fraudulent? The perceptron is the simplest model that can make such binary decisions.

How it Works (Forward Pass)

Step 1 — Compute the weighted sum

t = w_0 + \sum_{i=1}^{n} w_i x_i

Step 2 — Apply the threshold

y = \begin{cases} 1 & \text{if } t > \text{threshold} \\ 0 & \text{otherwise} \end{cases}

The weights \(\mathbf{w}\) are learned from training data. During training, the network sees examples with known labels and adjusts the weights to make correct predictions.

9. Activation Functions

Why activation functions are needed

The perceptron uses a step function (output is either 0 or 1, with a sharp jump). This step function is not differentiable at the threshold point. Since gradient-based learning requires computing derivatives, we cannot train the perceptron using gradients if we keep the step function. We need a smooth replacement.

Sigmoid Functions

Definition

A sigmoid function \(s: \mathbb{R} \to \mathbb{R}\) is any function that is continuous, bounded (output stays within a fixed range), and smooth (differentiable everywhere). It produces an S-shaped curve that transitions gradually from one value to another.

The Logistic Function

The most commonly used sigmoid is the logistic function:

g(u) = \frac{1}{1 + e^{-u}}

Properties of the logistic function:

It maps any real number to the interval \((0, 1)\).
When \(u\) is very negative, \(g(u) \approx 0\). When \(u\) is very positive, \(g(u) \approx 1\).
At \(u = 0\), \(g(0) = 0.5\).
It transitions smoothly, so it has derivatives everywhere.

Key Property: Self-Referential Derivative

Important

The derivative of the logistic function can be expressed using the function's own output:

\[ g'(u) = g(u)(1 - g(u)) \]

This is extremely useful for computation. During the forward pass we already compute \(g(u)\). To compute the derivative (needed for learning), we just multiply \(g(u) \times (1 - g(u))\) — no additional expensive computation required.

10. Perceptron Learning (Backpropagation for a Single Neuron)

Why this matters

This section shows the fundamental mechanism that all neural network training is built on: computing how much each weight contributed to the error, then adjusting it proportionally. Understanding this for a single neuron makes it possible to understand deeper networks.

Forward Pass

With the sigmoid activation, the perceptron computes:

z = g\!\left(w_0 + \sum_{i=1}^{n} w_i x_i\right)

Loss Function

We use the mean squared error over all training data:

L(\mathbf{w}) = \frac{1}{2N} \sum_{i=1}^{N} \big(F_{\mathbf{w}}(\mathbf{x}^{(i)}) - y^{(i)}\big)^2

The factor of \(\frac{1}{2}\) is included to make the derivative cleaner.

Gradient Computation via Chain Rule

Chain Rule Application

To find how the loss changes when we adjust weight \(w_i\), we break the computation into a chain of simpler steps. Define \(u = w_0 + \sum w_i x_i\) and \(z = g(u)\). Then:

\frac{\partial L}{\partial w_i} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial u} \cdot \frac{\partial u}{\partial w_i}

Computing each piece:

Piece 1

\(\dfrac{\partial L}{\partial z} = z - y\) — how much the loss changes when the output changes.

Piece 2

\(\dfrac{\partial z}{\partial u} = z(1 - z)\) — the derivative of the sigmoid (using the self-referential property).

Piece 3

\(\dfrac{\partial u}{\partial w_i} = x_i\) for regular weights, or \(1\) for the bias \(w_0\).

Weight Update Rule

Using the gradient, we update each weight by moving in the opposite direction of the gradient, scaled by a learning rate \(\alpha\):

\mathbf{w} \leftarrow \mathbf{w} + \alpha \cdot \Delta

where \(\Delta_i\) depends on the product \((z - y) \cdot z(1 - z) \cdot x_i\) for each weight.

11. Multi-Output Perceptron

Why multiple outputs?

A single perceptron outputs one number (0 or 1), which can classify into 2 categories. For problems with \(k\) categories (e.g., classifying handwritten digits 0–9), we need multiple output neurons.

Structure

For \(k\) classes, use \(m\) output neurons (where \(m\) is enough to encode \(k\) classes as binary strings). The labels are encoded as binary vectors. The weight matrix becomes \(\mathbf{W} \in \mathbb{R}^{(n+1) \times m}\), where each column corresponds to one output neuron.

Error Gradient

The gradient for each weight \(w_{ij}\) (connecting input \(i\) to output \(j\)) follows the same pattern as the single-neuron case:

\Delta_{0j} = (z_j - y_j) \cdot z_j(1 - z_j)

\Delta_{ij} = (z_j - y_j) \cdot z_j(1 - z_j) \cdot x_i

The first equation is for the bias weight (input is always 1). The second is for a regular weight connecting input \(x_i\) to output neuron \(j\).

12. Perceptron Limitations

Critical Limitation

A single-layer perceptron can only solve linearly separable problems. A problem is linearly separable if you can draw a straight line (or hyperplane in higher dimensions) that separates the two classes.

Linearly Separable: AND and OR

The AND function (output 1 only when both inputs are 1) and the OR function (output 1 when at least one input is 1) are linearly separable. A single line can separate the 1-outputs from the 0-outputs in each case.

Not Linearly Separable: XOR

The XOR function (output 1 when inputs differ) is not linearly separable. No single straight line can separate the points (0,0), (1,1) (which output 0) from (0,1), (1,0) (which output 1). A single perceptron simply cannot learn XOR.

Two Solutions

Solution A — Add Hidden Layers

Add one or more hidden layers of neurons between the input and output. This creates a multi-layer network that can represent non-linear decision boundaries. This is covered in Section 13.

Solution B — Feature Engineering

Use domain knowledge to create new features that make the problem linearly separable. For example, adding the feature \(x_1 \cdot x_2\) to XOR makes it solvable by a single perceptron. This approach is formalized as the semi-static neural network in Section 16.

13. Multi-Layer Network (3-Layer Feed-Forward)

Definition

A 3-layer feed-forward neural network consists of:

Input layer: \(n\) nodes (one per input feature, plus a bias node).
Hidden layer: \(J\) neurons that compute intermediate representations.
Output layer: \(m\) neurons that produce the final predictions.

Why hidden layers matter

Hidden layers allow the network to learn non-linear functions. Each hidden neuron applies a non-linear activation to a linear combination of inputs. Combining multiple such neurons creates a rich space of possible functions that can approximate complex relationships.

Parameters

Matrix \(A \in \mathbb{R}^{(n+1) \times J}\) — weights from input to hidden layer.
Matrix \(B \in \mathbb{R}^{(J+1) \times m}\) — weights from hidden to output layer.

The "+1" accounts for the bias node at each layer.

Forward Pass

Step 1 — Input to Hidden

Compute hidden layer activations by applying sigmoid element-wise:

\mathbf{t} = G(A^T \mathbf{x})

where \(G\) applies the sigmoid function \(g\) to each element independently.

Step 2 — Hidden to Output

Compute output layer activations:

\mathbf{z} = G(B^T \mathbf{t})

where \(\mathbf{t}\) is extended with a bias term (prepend 1) before multiplication.

14. Backpropagation for Multi-Layer Networks

Why this is the central algorithm

Backpropagation is the algorithm that makes training deep networks possible. It efficiently computes how each weight in every layer contributes to the overall error, by propagating error signals backward from the output to the input.

Gradients for Output Weights (Matrix B)

The gradient for a weight \(b_{jk}\) connecting hidden neuron \(j\) to output neuron \(k\):

\frac{\partial E}{\partial b_{jk}} = p_k \cdot t_j

where the output error signal is:

p_k = (z_k - y_k) \cdot z_k(1 - z_k)

This is the same formula as the single-neuron case from Section 10.

Gradients for Hidden Weights (Matrix A)

The gradient for a weight \(a_{ij}\) connecting input \(i\) to hidden neuron \(j\):

\frac{\partial E}{\partial a_{ij}} = q_j \cdot x_i

where the hidden error signal is:

q_j = \left(\sum_{k} p_k \cdot b_{jk}\right) \cdot t_j(1 - t_j)

How to Read This

The hidden error signal \(q_j\) has two parts multiplied together:

\(\sum_k p_k \cdot b_{jk}\): the sum of output errors weighted by the connections from hidden neuron \(j\) to each output neuron. This represents how much hidden neuron \(j\) contributed to the total output error.
\(t_j(1 - t_j)\): the derivative of the sigmoid at hidden neuron \(j\).

The error "flows backward" from outputs through the weights to the hidden layer — hence the name backpropagation.

Weight Update

\mathbf{w} \leftarrow \mathbf{w} + \alpha \cdot \Delta

where \(\alpha\) is the learning rate and \(\Delta\) contains the gradients computed above.

15. Adaptive Learning (QuickProp)

Why adaptive learning?

Standard gradient descent uses a fixed learning rate \(\alpha\). If \(\alpha\) is too large, the training overshoots and oscillates. If \(\alpha\) is too small, training is extremely slow. An adaptive method adjusts the effective step size automatically based on the shape of the error surface.

Definition

QuickProp is an adaptive learning method that uses a parabolic approximation of the error surface. It estimates the optimal weight change by fitting a parabola through the current and previous gradient values.

The Formula

QuickProp computes the weight change at step \(t\) using gradients from the current and previous steps:

C^{(t)} = \frac{\Delta^{(t)} - \Delta^{(t+1)}}{\Delta^{(t-1)} - \Delta^{(t)}} \times C^{(t-1)}

where \(\Delta^{(t)}\) is the gradient at step \(t\) and \(C^{(t)}\) is the actual weight change applied. The ratio of gradient differences estimates the curvature of the error surface, allowing larger steps in flat regions and smaller steps near minima.

16. Semi-Static Neural Network (T-NN)

Definition

A semi-static neural network (T-NN) is a hybrid architecture where the hidden layer weights are fixed (all set to 1) and the hidden layer applies pre-defined basis functions. Only the output layer weights are learned during training.

Why it exists

It combines the strengths of two approaches: (1) the data fitting approach from Section 6 (using domain knowledge to choose basis functions) and (2) the perceptron learning approach (learning weights from data). You get better control over what the network learns, while still benefiting from automated weight optimization.

How it Works

The hidden layer computes pre-defined transformations:

\mathbf{t} = T(\mathbf{x}) = \big(f_1(\mathbf{x}),\; f_2(\mathbf{x}),\; \ldots,\; f_p(\mathbf{x})\big)

The output layer is a standard perceptron operating on the transformed features:

\mathbf{z} = G(B \cdot \mathbf{t})

Only the matrix \(B\) is learned. The transformation \(T\) is chosen by the human designer and remains fixed.

Solving XOR with T-NN

Example

XOR cannot be solved by a single perceptron because it is not linearly separable. But if we choose basis functions \(f_1(\mathbf{x}) = x_1\), \(f_2(\mathbf{x}) = x_2\), and \(f_3(\mathbf{x}) = x_1 \cdot x_2\), the transformed data becomes linearly separable. The perceptron on the transformed features can then learn XOR.

17. Deep Learning

Definition

Deep learning is machine learning using neural networks with many layers. "Deep" refers to the depth of the network (many layers stacked), not a philosophical statement. In practice, deep learning means: multi-layer neural networks trained on large datasets.

Fitting vs. Generalization

Two Qualities of a Model

Every ML model has two performance measures:

Fitting (training accuracy): How well the model matches the training data.
Generalization (test accuracy): How well the model performs on data it has never seen.

These two can conflict with each other.

The Overfitting Problem

Overfitting

Adding more hidden neurons improves fitting (the model can memorize more of the training data). But at some point, the model starts memorizing noise and random patterns in the training data that do not exist in the real world. This causes generalization to decrease. This is called overfitting.

Solution: Train/Validation Split

Procedure

Split the data into two disjoint sets:

\(\Omega_{\text{train}}\): Used to update the weights during training.
\(\Omega_{\text{validation}}\): Never used during training. Used only to evaluate generalization.

During training, monitor the error on \(\Omega_{\text{validation}}\). Stop training when the validation error stops decreasing (even if the training error is still going down). This technique is called early stopping.

18. Deep Neural Networks

Definition

A deep neural network (DNN) is a neural network with many hidden layers and many neurons per layer. "Many" typically means 3 or more hidden layers, though modern networks can have hundreds.

Key Properties

Universal approximation: With enough neurons and enough data, a deep network can approximate any function to arbitrary accuracy.
Hierarchical feature learning: Each layer learns increasingly abstract features. Early layers detect simple patterns; deeper layers combine these into complex concepts.
Domain knowledge integration: The architecture of the network (how layers are connected, what operations they perform) can encode domain knowledge about the problem.

The following three sections cover three specialized DNN architectures, each designed for a specific type of data.

19. Convolutional Neural Networks (CNN)

Definition

A Convolutional Neural Network (CNN) is a deep neural network designed specifically for processing 2-dimensional data, most commonly images. It uses a special operation called convolution to detect spatial patterns.

Why CNNs exist

A standard neural network treats each input independently. For an image with 1000 pixels, that means 1000 independent inputs with no awareness that neighboring pixels are related. CNNs exist because they exploit the spatial structure of images: nearby pixels are more related than distant pixels, and the same pattern (e.g., an edge) can appear anywhere in the image.

CNN Components

1. Convolutional Layer

A small matrix called a filter (or kernel) slides across the image. At each position, it computes a dot product with the image patch underneath, producing one number. Sliding across the entire image produces a feature map. Different filters detect different patterns (horizontal edges, vertical edges, colors, etc.).

2. Activation Function (ReLU)

After convolution, apply the ReLU (Rectified Linear Unit) function: \(f(x) = \max(0, x)\). This is simpler than sigmoid and works better in deep networks because it does not suffer from the vanishing gradient problem (covered in Section 20).

3. Pooling Layer

Reduces the spatial dimensions of the feature map. Max pooling takes the maximum value in each small region. Average pooling takes the average. This reduces computation and makes the network more robust to small shifts in the input.

4. Fully Connected Layer

After several rounds of convolution and pooling, the resulting feature maps are flattened into a 1-dimensional vector and fed into a standard neural network layer. This layer combines all the detected features to make the final decision.

5. Output Layer (Softmax)

For classification tasks, the final layer uses the softmax function, which converts raw scores into probabilities that sum to 1. The class with the highest probability is the prediction.

20. Recurrent Neural Networks (RNN)

Definition

A Recurrent Neural Network (RNN) is a neural network designed for sequential data — data where order matters, such as time series, text, or audio. It processes inputs one at a time and maintains a hidden state that carries information from previous steps.

Why RNNs exist

Standard neural networks and CNNs assume each input is independent. But in sequences, the meaning of the current element depends on what came before. The word "bank" means different things after "river" vs. "investment." RNNs exist to capture these sequential dependencies.

The Recurrence Relation

At each time step \(t\), the RNN updates its hidden state using both the current input and the previous hidden state:

\mathbf{h}_t = \sigma\!\left(W_{hh} \cdot \mathbf{h}_{t-1} + W_{xh} \cdot \mathbf{x}_t + \boldsymbol{\Delta}_h\right)

where:

\(\mathbf{h}_t\) is the hidden state at time \(t\).
\(\mathbf{x}_t\) is the input at time \(t\).
\(W_{hh}\) is the weight matrix applied to the previous hidden state.
\(W_{xh}\) is the weight matrix applied to the current input.
\(\boldsymbol{\Delta}_h\) is a bias vector.
\(\sigma\) is an activation function (typically sigmoid or tanh).

Weight Sharing

The same weight matrices \(W_{hh}\) and \(W_{xh}\) are used at every time step. This means the network learns a single set of rules that it applies repeatedly, regardless of the sequence length. This dramatically reduces the number of parameters.

RNN Problems

Vanishing and Exploding Gradients

During backpropagation through time, gradients are multiplied by the weight matrix at each step. Over many steps:

If the weights are less than 1 in magnitude, gradients shrink exponentially toward zero (vanishing gradients). The network cannot learn long-range dependencies.
If the weights are greater than 1 in magnitude, gradients grow exponentially (exploding gradients). Training becomes unstable.

The practical consequence: standard RNNs forget information from earlier parts of long sequences. This is the problem that LSTM (Section 21) was designed to solve.

21. Long Short-Term Memory (LSTM)

Definition

LSTM (Long Short-Term Memory) is a specialized RNN architecture that can learn long-range dependencies in sequential data. It replaces the simple hidden state of a standard RNN with a more complex unit that includes a memory cell and three gates.

Why LSTM exists

Standard RNNs fail on long sequences because of vanishing gradients. They cannot remember information from many steps ago. LSTM solves this by providing a mechanism to selectively remember important information and forget irrelevant information, even over hundreds of time steps.

The Three Gates

1. Forget Gate

Decides what information to discard from the memory cell. It looks at the current input and previous hidden state, and outputs a number between 0 and 1 for each element in the memory cell. A value of 0 means "completely forget this," and 1 means "completely keep this."

2. Write Gate (Input Gate)

Decides what new information to store in the memory cell. It has two parts: (a) a sigmoid layer that decides which values to update, and (b) a tanh layer that creates candidate values. The memory cell is updated by combining old memory (scaled by the forget gate) with new candidates (scaled by the write gate).

3. Read Gate (Output Gate)

Decides what part of the memory cell to output as the hidden state. It applies a sigmoid to decide which memory elements are relevant, then multiplies by the tanh of the memory cell to produce the output.

Why This Solves Vanishing Gradients

The memory cell provides a direct path for gradients to flow backward through time without being multiplied by weight matrices at every step. The forget gate can learn to keep the gradient flowing by outputting values close to 1. This allows the network to maintain useful information over long sequences without the gradient shrinking to zero.