AI and Machine Learning
How machines learn patterns from data. From linear regression to neural networks, perceptrons to deep learning architectures like CNN, RNN, and LSTM.
- What is Machine Learning?
- Setting Up the Optimization Problem
- Modeling as Optimization
- Convexifying the Loss Function
- Linear Regression
- Data Testing (OLS)
- Neural Networks
- Perceptron
- Activation Functions
- Perceptron Learning (Backpropagation)
- Multi-Output Perceptron
- Perceptron Limitations
- Multi-Layer Network
- Backpropagation for Multi-Layer Networks
- Adaptive Learning (QuickProp)
- Semi-Static Neural Network (T-NN)
- Deep Learning
- Deep Neural Networks
- Convolutional Neural Networks (CNN)
- Recurrent Neural Networks (RNN)
- Long Short-Term Memory (LSTM)
1. What is Machine Learning?
Machine Learning (ML) is a method of solving problems where the solution is not pre-programmed by a human. Instead, the model accumulates knowledge from observation data and uses that knowledge to make decisions or predictions.
Many real-world problems are too complex to solve with hand-written rules. Consider recognizing faces, translating languages, or predicting stock prices. The relationships involved are so intricate that no human could write explicit rules for them. ML exists because it lets data reveal the rules automatically.
Data Representation
All ML starts with data. The data is represented as a set of observations, where each observation is a pair of an input and an output:
Here:
- \(\mathbf{x}^{(i)} \in D \subset \mathbb{R}^n\) is the feature vector (also called independent variables). It is a list of \(n\) numbers that describe the input. For example, if predicting house price, the features might be square footage, number of bedrooms, and age of the house.
- \(y^{(i)} \in \mathbb{R}\) is the label (also called the dependent variable). It is the output we want to predict. In the house example, this is the price.
- \(N\) is the total number of observations.
The Goal of ML
Find a function \(f: D \to \mathbb{R}\) that models the observed data, meaning:
But there is a critical additional requirement: the function must also generalize. That means it must correctly predict outputs for data it has never seen before — inputs \((\mathbf{x}, f(\mathbf{x})) \notin \Omega\).
A model that perfectly memorizes the training data but fails on new data is useless. Generalization is the entire point of machine learning. Everything that follows in this chapter — loss functions, regularization, train/test splits — exists to achieve good generalization.
2. Setting Up the Optimization Problem
The goal "find some function \(f\) that fits the data" is too vague. There are infinitely many possible functions. We need to narrow the search space to make the problem solvable.
Parameterized Function Families
Instead of searching over all possible functions, we choose a family of functions that are parameterized by a vector of weights \(\mathbf{w} \in \mathbb{R}^m\). We write the function as \(f_{\mathbf{w}}\) to show that it depends on \(\mathbf{w}\).
The search for the best function becomes a search for the best weights \(\mathbf{w}\). This converts an impossible problem (searching over all functions) into a concrete numerical problem (searching over a vector of numbers).
Error Criterion
We want the model's predictions to be close to the actual values. Specifically, we require:
where \(\varepsilon\) is the maximum allowed error per data point. Since we have \(N\) data points, we measure the overall model error as the average error across all observations:
The model error is the average absolute difference between the model's predictions and the actual observed values, taken over the entire training dataset.
3. Modeling as Optimization
By expressing ML as an optimization problem, we can use all the mathematical tools from Chapter 1 (gradient descent, convex optimization, etc.) to find the best model.
The Loss Function
The loss function (also called cost function or objective function) measures how far the model's predictions are from the actual data. It is a function of the weights \(\mathbf{w}\):
The Optimization Goal
We want the weights that make the loss as small as possible:
This is now a standard optimization problem. The notation \(\arg\min\) means "the value of \(\mathbf{w}\) that minimizes \(L\)." We have converted machine learning into the optimization framework from Chapter 1. The loss function is the objective function, and the weights are the variables we optimize over.
4. Convexifying the Loss Function
The absolute value \(|f_{\mathbf{w}}(\mathbf{x}) - y|\) is not differentiable at zero, which makes optimization harder. By switching to a squared error, we get a smooth, convex function that has a unique minimum and can be solved exactly.
Matrix Notation
To work with all data points at once, we stack them into matrices and vectors:
- Stack all input vectors \(\mathbf{x}^{(1)}, \ldots, \mathbf{x}^{(N)}\) as rows of a matrix \(A \in \mathbb{R}^{N \times n}\).
- Stack all labels \(y^{(1)}, \ldots, y^{(N)}\) into a vector \(B \in \mathbb{R}^N\).
If the model is linear — \(f_{\mathbf{w}}(\mathbf{x}) = \mathbf{x} \cdot \mathbf{w}\) — then the predictions for all data points are \(A\mathbf{w}\), and the squared loss function becomes:
Theorem: Closed-Form Solution
If \(\text{Rank}(A) = n\) (meaning the columns of \(A\) are linearly independent), then the loss function \(L(\mathbf{w}) = \|A\mathbf{w} - B\|^2\) has a unique minimum at:
Proof
Expand \(L(\mathbf{w})\) by writing out the squared norm:
Take the derivative with respect to \(\mathbf{w}\):
The second derivative (Hessian matrix) is:
The matrix \(A^T A\) is always positive semidefinite (for any vector \(\mathbf{v}\), \(\mathbf{v}^T A^T A \mathbf{v} = \|A\mathbf{v}\|^2 \geq 0\)). When \(\text{Rank}(A) = n\), it is positive definite. This confirms that \(L\) is convex, so any critical point is a global minimum.
Set the gradient to zero and solve:
5. Linear Regression
Linear regression is the simplest ML model. It assumes the output is a linear combination of the input features:
Linear regression is the starting point of ML because it has a closed-form solution (no iterative optimization needed), it is fast to compute, easy to understand, and works well when the relationship between inputs and output is approximately linear.
Adding a Bias Term
The model above forces the prediction to be zero when all inputs are zero. Usually we want a bias (also called intercept) \(w_0\):
To keep the matrix notation clean, we extend each input vector by prepending a 1: \(\mathbf{x} \to [1, x_1, x_2, \ldots, x_n]\), and extend the weight vector to include \(w_0\): \(\mathbf{w} = [w_0, w_1, \ldots, w_n]^T\). Now the model is still \(f_{\mathbf{w}}(\mathbf{x}) = \mathbf{x} \cdot \mathbf{w}\), and the solution remains:
where \(A\) now has a column of ones as its first column.
6. Data Testing (Ordinary Least Squares)
If the true relationship between inputs and output is non-linear, then linear regression will produce large errors no matter what weights you choose. You need a way to handle non-linear relationships while still using the linear regression machinery.
Basis Function Transformation
Transform the original features using basis functions \(f_1(\mathbf{x}), f_2(\mathbf{x}), \ldots, f_p(\mathbf{x})\). These are functions you choose that create new features from the originals. After the transformation, apply linear regression to the new features.
Define the transformation:
Then apply linear regression to the transformed data. The model becomes:
This is still linear in the weights \(\mathbf{w}\), so the closed-form solution \(\mathbf{w} = (A^T A)^{-1} A^T B\) still applies — but now the matrix \(A\) has entries \(A_{ij} = f_j(\mathbf{x}^{(i)})\).
The Workflow
Choose basis functions \(f_1, \ldots, f_p\) based on your understanding of the data.
Transform all data points: replace each \(\mathbf{x}^{(i)}\) with \(T(\mathbf{x}^{(i)})\).
Run linear regression on the transformed data to find the optimal weights.
Compute the error. If it is too large, go back to Step 1 and choose different basis functions.
Choosing the right basis functions is the hardest part. It requires domain knowledge — understanding of the problem area. For example, if you suspect the data follows a quadratic trend, you might use \(f_1(x) = x\) and \(f_2(x) = x^2\). If you suspect periodic behavior, you might use \(f_1(x) = \sin(x)\) and \(f_2(x) = \cos(x)\). There is no automatic way to choose — this is where human expertise enters.
7. Neural Networks
The basis function approach from Section 6 requires a human expert to choose the right transformations. Neural networks exist because they can learn the basis functions automatically from data, removing the need for domain expertise in feature engineering.
A neural network is a computational graph made of nodes (called neurons) connected by weighted edges. Each neuron receives inputs, multiplies them by weights, sums the results, and passes the sum through a function to produce an output.
The Simplest Neural Network (2-Layer)
A 2-layer neural network has input nodes and one output node. Each input \(x_i\) is connected to the output by a weight \(w_i\). The output is:
This is exactly linear regression. The simplest neural network is linear regression. To handle non-linear relationships, we need additional components: more layers and non-linear activation functions (covered in the following sections).
8. Perceptron
A perceptron is the simplest neural network that makes binary decisions. It has one output neuron that classifies input into one of two categories: 0 or 1 (false or true, no or yes).
Many problems require a yes/no answer rather than a continuous number: Is this email spam? Does this image contain a cat? Is this transaction fraudulent? The perceptron is the simplest model that can make such binary decisions.
How it Works (Forward Pass)
The weights \(\mathbf{w}\) are learned from training data. During training, the network sees examples with known labels and adjusts the weights to make correct predictions.
9. Activation Functions
The perceptron uses a step function (output is either 0 or 1, with a sharp jump). This step function is not differentiable at the threshold point. Since gradient-based learning requires computing derivatives, we cannot train the perceptron using gradients if we keep the step function. We need a smooth replacement.
Sigmoid Functions
A sigmoid function \(s: \mathbb{R} \to \mathbb{R}\) is any function that is continuous, bounded (output stays within a fixed range), and smooth (differentiable everywhere). It produces an S-shaped curve that transitions gradually from one value to another.
The Logistic Function
The most commonly used sigmoid is the logistic function:
Properties of the logistic function:
- It maps any real number to the interval \((0, 1)\).
- When \(u\) is very negative, \(g(u) \approx 0\). When \(u\) is very positive, \(g(u) \approx 1\).
- At \(u = 0\), \(g(0) = 0.5\).
- It transitions smoothly, so it has derivatives everywhere.
Key Property: Self-Referential Derivative
The derivative of the logistic function can be expressed using the function's own output:
This is extremely useful for computation. During the forward pass we already compute \(g(u)\). To compute the derivative (needed for learning), we just multiply \(g(u) \times (1 - g(u))\) — no additional expensive computation required.
10. Perceptron Learning (Backpropagation for a Single Neuron)
This section shows the fundamental mechanism that all neural network training is built on: computing how much each weight contributed to the error, then adjusting it proportionally. Understanding this for a single neuron makes it possible to understand deeper networks.
Forward Pass
With the sigmoid activation, the perceptron computes:
Loss Function
We use the mean squared error over all training data:
The factor of \(\frac{1}{2}\) is included to make the derivative cleaner.
Gradient Computation via Chain Rule
To find how the loss changes when we adjust weight \(w_i\), we break the computation into a chain of simpler steps. Define \(u = w_0 + \sum w_i x_i\) and \(z = g(u)\). Then:
Computing each piece:
\(\dfrac{\partial L}{\partial z} = z - y\) — how much the loss changes when the output changes.
\(\dfrac{\partial z}{\partial u} = z(1 - z)\) — the derivative of the sigmoid (using the self-referential property).
\(\dfrac{\partial u}{\partial w_i} = x_i\) for regular weights, or \(1\) for the bias \(w_0\).
Weight Update Rule
Using the gradient, we update each weight by moving in the opposite direction of the gradient, scaled by a learning rate \(\alpha\):
where \(\Delta_i\) depends on the product \((z - y) \cdot z(1 - z) \cdot x_i\) for each weight.
11. Multi-Output Perceptron
A single perceptron outputs one number (0 or 1), which can classify into 2 categories. For problems with \(k\) categories (e.g., classifying handwritten digits 0–9), we need multiple output neurons.
For \(k\) classes, use \(m\) output neurons (where \(m\) is enough to encode \(k\) classes as binary strings). The labels are encoded as binary vectors. The weight matrix becomes \(\mathbf{W} \in \mathbb{R}^{(n+1) \times m}\), where each column corresponds to one output neuron.
Error Gradient
The gradient for each weight \(w_{ij}\) (connecting input \(i\) to output \(j\)) follows the same pattern as the single-neuron case:
The first equation is for the bias weight (input is always 1). The second is for a regular weight connecting input \(x_i\) to output neuron \(j\).
12. Perceptron Limitations
A single-layer perceptron can only solve linearly separable problems. A problem is linearly separable if you can draw a straight line (or hyperplane in higher dimensions) that separates the two classes.
Linearly Separable: AND and OR
The AND function (output 1 only when both inputs are 1) and the OR function (output 1 when at least one input is 1) are linearly separable. A single line can separate the 1-outputs from the 0-outputs in each case.
Not Linearly Separable: XOR
The XOR function (output 1 when inputs differ) is not linearly separable. No single straight line can separate the points (0,0), (1,1) (which output 0) from (0,1), (1,0) (which output 1). A single perceptron simply cannot learn XOR.
Two Solutions
Add one or more hidden layers of neurons between the input and output. This creates a multi-layer network that can represent non-linear decision boundaries. This is covered in Section 13.
Use domain knowledge to create new features that make the problem linearly separable. For example, adding the feature \(x_1 \cdot x_2\) to XOR makes it solvable by a single perceptron. This approach is formalized as the semi-static neural network in Section 16.
13. Multi-Layer Network (3-Layer Feed-Forward)
A 3-layer feed-forward neural network consists of:
- Input layer: \(n\) nodes (one per input feature, plus a bias node).
- Hidden layer: \(J\) neurons that compute intermediate representations.
- Output layer: \(m\) neurons that produce the final predictions.
Hidden layers allow the network to learn non-linear functions. Each hidden neuron applies a non-linear activation to a linear combination of inputs. Combining multiple such neurons creates a rich space of possible functions that can approximate complex relationships.
Parameters
- Matrix \(A \in \mathbb{R}^{(n+1) \times J}\) — weights from input to hidden layer.
- Matrix \(B \in \mathbb{R}^{(J+1) \times m}\) — weights from hidden to output layer.
The "+1" accounts for the bias node at each layer.
Forward Pass
Compute hidden layer activations by applying sigmoid element-wise:
where \(G\) applies the sigmoid function \(g\) to each element independently.
Compute output layer activations:
where \(\mathbf{t}\) is extended with a bias term (prepend 1) before multiplication.
14. Backpropagation for Multi-Layer Networks
Backpropagation is the algorithm that makes training deep networks possible. It efficiently computes how each weight in every layer contributes to the overall error, by propagating error signals backward from the output to the input.
Gradients for Output Weights (Matrix B)
The gradient for a weight \(b_{jk}\) connecting hidden neuron \(j\) to output neuron \(k\):
where the output error signal is:
This is the same formula as the single-neuron case from Section 10.
Gradients for Hidden Weights (Matrix A)
The gradient for a weight \(a_{ij}\) connecting input \(i\) to hidden neuron \(j\):
where the hidden error signal is:
The hidden error signal \(q_j\) has two parts multiplied together:
- \(\sum_k p_k \cdot b_{jk}\): the sum of output errors weighted by the connections from hidden neuron \(j\) to each output neuron. This represents how much hidden neuron \(j\) contributed to the total output error.
- \(t_j(1 - t_j)\): the derivative of the sigmoid at hidden neuron \(j\).
The error "flows backward" from outputs through the weights to the hidden layer — hence the name backpropagation.
Weight Update
where \(\alpha\) is the learning rate and \(\Delta\) contains the gradients computed above.
15. Adaptive Learning (QuickProp)
Standard gradient descent uses a fixed learning rate \(\alpha\). If \(\alpha\) is too large, the training overshoots and oscillates. If \(\alpha\) is too small, training is extremely slow. An adaptive method adjusts the effective step size automatically based on the shape of the error surface.
QuickProp is an adaptive learning method that uses a parabolic approximation of the error surface. It estimates the optimal weight change by fitting a parabola through the current and previous gradient values.
The Formula
QuickProp computes the weight change at step \(t\) using gradients from the current and previous steps:
where \(\Delta^{(t)}\) is the gradient at step \(t\) and \(C^{(t)}\) is the actual weight change applied. The ratio of gradient differences estimates the curvature of the error surface, allowing larger steps in flat regions and smaller steps near minima.
16. Semi-Static Neural Network (T-NN)
A semi-static neural network (T-NN) is a hybrid architecture where the hidden layer weights are fixed (all set to 1) and the hidden layer applies pre-defined basis functions. Only the output layer weights are learned during training.
It combines the strengths of two approaches: (1) the data fitting approach from Section 6 (using domain knowledge to choose basis functions) and (2) the perceptron learning approach (learning weights from data). You get better control over what the network learns, while still benefiting from automated weight optimization.
How it Works
The hidden layer computes pre-defined transformations:
The output layer is a standard perceptron operating on the transformed features:
Only the matrix \(B\) is learned. The transformation \(T\) is chosen by the human designer and remains fixed.
Solving XOR with T-NN
XOR cannot be solved by a single perceptron because it is not linearly separable. But if we choose basis functions \(f_1(\mathbf{x}) = x_1\), \(f_2(\mathbf{x}) = x_2\), and \(f_3(\mathbf{x}) = x_1 \cdot x_2\), the transformed data becomes linearly separable. The perceptron on the transformed features can then learn XOR.
17. Deep Learning
Deep learning is machine learning using neural networks with many layers. "Deep" refers to the depth of the network (many layers stacked), not a philosophical statement. In practice, deep learning means: multi-layer neural networks trained on large datasets.
Fitting vs. Generalization
Every ML model has two performance measures:
- Fitting (training accuracy): How well the model matches the training data.
- Generalization (test accuracy): How well the model performs on data it has never seen.
These two can conflict with each other.
The Overfitting Problem
Adding more hidden neurons improves fitting (the model can memorize more of the training data). But at some point, the model starts memorizing noise and random patterns in the training data that do not exist in the real world. This causes generalization to decrease. This is called overfitting.
Solution: Train/Validation Split
Split the data into two disjoint sets:
- \(\Omega_{\text{train}}\): Used to update the weights during training.
- \(\Omega_{\text{validation}}\): Never used during training. Used only to evaluate generalization.
During training, monitor the error on \(\Omega_{\text{validation}}\). Stop training when the validation error stops decreasing (even if the training error is still going down). This technique is called early stopping.
18. Deep Neural Networks
A deep neural network (DNN) is a neural network with many hidden layers and many neurons per layer. "Many" typically means 3 or more hidden layers, though modern networks can have hundreds.
- Universal approximation: With enough neurons and enough data, a deep network can approximate any function to arbitrary accuracy.
- Hierarchical feature learning: Each layer learns increasingly abstract features. Early layers detect simple patterns; deeper layers combine these into complex concepts.
- Domain knowledge integration: The architecture of the network (how layers are connected, what operations they perform) can encode domain knowledge about the problem.
The following three sections cover three specialized DNN architectures, each designed for a specific type of data.
19. Convolutional Neural Networks (CNN)
A Convolutional Neural Network (CNN) is a deep neural network designed specifically for processing 2-dimensional data, most commonly images. It uses a special operation called convolution to detect spatial patterns.
A standard neural network treats each input independently. For an image with 1000 pixels, that means 1000 independent inputs with no awareness that neighboring pixels are related. CNNs exist because they exploit the spatial structure of images: nearby pixels are more related than distant pixels, and the same pattern (e.g., an edge) can appear anywhere in the image.
CNN Components
A small matrix called a filter (or kernel) slides across the image. At each position, it computes a dot product with the image patch underneath, producing one number. Sliding across the entire image produces a feature map. Different filters detect different patterns (horizontal edges, vertical edges, colors, etc.).
After convolution, apply the ReLU (Rectified Linear Unit) function: \(f(x) = \max(0, x)\). This is simpler than sigmoid and works better in deep networks because it does not suffer from the vanishing gradient problem (covered in Section 20).
Reduces the spatial dimensions of the feature map. Max pooling takes the maximum value in each small region. Average pooling takes the average. This reduces computation and makes the network more robust to small shifts in the input.
After several rounds of convolution and pooling, the resulting feature maps are flattened into a 1-dimensional vector and fed into a standard neural network layer. This layer combines all the detected features to make the final decision.
For classification tasks, the final layer uses the softmax function, which converts raw scores into probabilities that sum to 1. The class with the highest probability is the prediction.
20. Recurrent Neural Networks (RNN)
A Recurrent Neural Network (RNN) is a neural network designed for sequential data — data where order matters, such as time series, text, or audio. It processes inputs one at a time and maintains a hidden state that carries information from previous steps.
Standard neural networks and CNNs assume each input is independent. But in sequences, the meaning of the current element depends on what came before. The word "bank" means different things after "river" vs. "investment." RNNs exist to capture these sequential dependencies.
The Recurrence Relation
At each time step \(t\), the RNN updates its hidden state using both the current input and the previous hidden state:
where:
- \(\mathbf{h}_t\) is the hidden state at time \(t\).
- \(\mathbf{x}_t\) is the input at time \(t\).
- \(W_{hh}\) is the weight matrix applied to the previous hidden state.
- \(W_{xh}\) is the weight matrix applied to the current input.
- \(\boldsymbol{\Delta}_h\) is a bias vector.
- \(\sigma\) is an activation function (typically sigmoid or tanh).
The same weight matrices \(W_{hh}\) and \(W_{xh}\) are used at every time step. This means the network learns a single set of rules that it applies repeatedly, regardless of the sequence length. This dramatically reduces the number of parameters.
RNN Problems
During backpropagation through time, gradients are multiplied by the weight matrix at each step. Over many steps:
- If the weights are less than 1 in magnitude, gradients shrink exponentially toward zero (vanishing gradients). The network cannot learn long-range dependencies.
- If the weights are greater than 1 in magnitude, gradients grow exponentially (exploding gradients). Training becomes unstable.
The practical consequence: standard RNNs forget information from earlier parts of long sequences. This is the problem that LSTM (Section 21) was designed to solve.
21. Long Short-Term Memory (LSTM)
LSTM (Long Short-Term Memory) is a specialized RNN architecture that can learn long-range dependencies in sequential data. It replaces the simple hidden state of a standard RNN with a more complex unit that includes a memory cell and three gates.
Standard RNNs fail on long sequences because of vanishing gradients. They cannot remember information from many steps ago. LSTM solves this by providing a mechanism to selectively remember important information and forget irrelevant information, even over hundreds of time steps.
The Three Gates
Decides what information to discard from the memory cell. It looks at the current input and previous hidden state, and outputs a number between 0 and 1 for each element in the memory cell. A value of 0 means "completely forget this," and 1 means "completely keep this."
Decides what new information to store in the memory cell. It has two parts: (a) a sigmoid layer that decides which values to update, and (b) a tanh layer that creates candidate values. The memory cell is updated by combining old memory (scaled by the forget gate) with new candidates (scaled by the write gate).
Decides what part of the memory cell to output as the hidden state. It applies a sigmoid to decide which memory elements are relevant, then multiplies by the tanh of the memory cell to produce the output.
The memory cell provides a direct path for gradients to flow backward through time without being multiplied by weight matrices at every step. The forget gate can learn to keep the gradient flowing by outputting values close to 1. This allows the network to maintain useful information over long sequences without the gradient shrinking to zero.