Chapter 0 — Math Foundations You Need First

1. Variables and Numbers

What It Is

A variable is a letter that stands for a number we don't know yet. We use variables because we want to talk about rules that work for any number, not just one specific number.

For example, when we write \(x = 5\), we mean "the variable \(x\) currently has the value 5." But \(x\) could be anything — that's the point.

Types of Numbers

Name	Symbol	What It Means	Examples
Natural numbers	\(\mathbb{N}\)	Counting numbers	1, 2, 3, 42, 1000
Integers	\(\mathbb{Z}\)	Whole numbers including negatives and zero	-3, 0, 7, -100
Real numbers	\(\mathbb{R}\)	All numbers on the number line, including decimals	3.14, -2.5, 0, √2

When you see \(x \in \mathbb{R}\), read it as "x is a real number" (x can be any number on the number line).

When you see \(x \in \mathbb{R}^n\), it means x is a list of \(n\) real numbers. For example, \(x \in \mathbb{R}^3\) means x is a list of 3 numbers, like \(x = (2.1, -0.5, 7.0)\). This is called a vector and is covered in detail in Chapter 1.

Why You'll See This Everywhere

AI deals with data, and data is made of numbers. When we write \(x \in \mathbb{R}^n\), we're saying "x is a data point with n features." A photo with 1000 pixels could be \(x \in \mathbb{R}^{1000}\) — each pixel is one number in the list.

2. Functions

What It Is

A function is a rule that takes an input and produces exactly one output. We write it as \(f(x) = \text{something}\).

For example:

\(f(x) = 2x + 1\) — this function takes any number, doubles it, then adds 1. So \(f(3) = 7\), \(f(0) = 1\), \(f(-1) = -1\).
\(f(x) = x^2\) — this function squares the input. So \(f(3) = 9\), \(f(-2) = 4\).

Why It Matters for AI

In AI, the entire goal is to find the right function. You have data (inputs and outputs), and you want to find a function \(f\) that correctly maps inputs to outputs. The whole course is about techniques for finding that function.

Functions with Multiple Inputs

Functions can take more than one input:

\[ f(x_1, x_2) = 3x_1 + 5x_2 - 2 \]

This function takes two numbers and combines them. If \(x_1 = 1\) and \(x_2 = 2\), then \(f(1, 2) = 3(1) + 5(2) - 2 = 11\).

The notation \(f: \mathbb{R}^n \to \mathbb{R}\) means "f takes n numbers as input and produces 1 number as output." This is the most common type of function in this course.

The Notation \(f_w\)

When you see \(f_w(x)\), it means "a function of x that also depends on some parameters w." The parameters w are adjustable knobs. Different values of w give different functions. Learning in AI = finding the best values of w.

3. Equations and Solving Them

What It Is

An equation is a statement that two things are equal. Solving an equation means finding the value(s) of the variable that make the statement true.

Example: solve \(2x + 1 = 7\).

Subtract 1 from both sides: \(2x = 6\)
Divide both sides by 2: \(x = 3\)

The solution is \(x = 3\) because plugging it back in gives \(2(3) + 1 = 7\). True.

Quadratic Equations

A quadratic equation has the form \(ax^2 + bx + c = 0\). The "\(x^2\)" makes it quadratic (degree 2). Solutions are given by:

x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}

The \(\pm\) means there can be two solutions (one with +, one with −). This formula is referenced in Chapter 1 as an example of a correct algorithm.

When Equations Have No Formula

Most equations you encounter in AI don't have a clean formula. For example, there's no formula to directly solve a neural network's parameters. That's why we need optimization — systematic methods to find approximate solutions by trial and improvement.

4. Graphs of Functions

What It Is

A graph of a function \(f(x)\) is a visual plot. The horizontal axis shows input values (x), and the vertical axis shows output values (f(x)). Every point on the curve represents one input-output pair.

Key things to notice on a graph:

Where the curve goes down — the function is decreasing.
Where the curve goes up — the function is increasing.
The lowest point — this is the minimum. Finding this is the central goal of optimization.
Where the curve crosses zero — these are the roots (solutions to \(f(x) = 0\)).

Why It Matters

In AI, we graph the loss function (how wrong the model is). Training means moving along this graph toward the lowest point. The entire field of optimization is about finding the bottom of graphs efficiently.

5. Summation Notation

What It Is

The symbol \(\sum\) (capital Greek letter sigma) means "add up a series of things." It's shorthand for repeated addition.

\sum_{i=1}^{n} x_i = x_1 + x_2 + x_3 + \cdots + x_n

Read this as: "sum of x_i, for i going from 1 to n." It just means add up all the x values from the first one to the nth one.

Example

If \(x_1 = 3\), \(x_2 = 7\), \(x_3 = 1\), then:

\(\sum_{i=1}^{3} x_i = 3 + 7 + 1 = 11\)

You'll see this constantly because AI works with large datasets. Instead of writing out thousands of additions, we use \(\sum\).

The Product Notation

\(\prod\) (capital pi) means "multiply a series of things":

\prod_{i=1}^{n} x_i = x_1 \times x_2 \times x_3 \times \cdots \times x_n

6. Exponents and Logarithms

Exponents

What It Is

\(x^n\) means "multiply x by itself n times." So \(2^3 = 2 \times 2 \times 2 = 8\).

Special cases:

\(x^0 = 1\) (anything to the power of 0 is 1)
\(x^1 = x\)
\(x^{-1} = \frac{1}{x}\)
\(x^{1/2} = \sqrt{x}\)

The Number \(e\)

\(e \approx 2.71828\) is a special mathematical constant. It appears naturally in growth/decay processes. The function \(e^x\) (also written \(\exp(x)\)) shows up constantly in AI, especially in:

The sigmoid function: \(g(x) = \frac{1}{1 + e^{-x}}\) (used in neural networks)
Probability distributions (the Gaussian/normal distribution)

Logarithms

What It Is

A logarithm is the reverse of an exponent. If \(2^3 = 8\), then \(\log_2(8) = 3\). It answers the question: "what power do I raise the base to, to get this number?"

When you see \(\log\) or \(\ln\) in this course, it means the natural logarithm (base \(e\)). So \(\ln(e^3) = 3\).

Why Logs Appear in AI

Logarithms turn multiplication into addition: \(\log(a \times b) = \log(a) + \log(b)\). In AI, we often work with probabilities, which involve multiplying many small numbers together. Multiplying many small numbers gives extremely tiny results that computers struggle with. Taking the log converts this to adding numbers, which is much more computationally stable. That's why you'll see "log-likelihood" and "log-probability" everywhere.

7. Derivatives (Rate of Change)

What It Is

The derivative of a function tells you how fast the output changes when you slightly change the input. It is written as \(f'(x)\) or \(\frac{df}{dx}\).

If \(f(x) = x^2\), then \(f'(x) = 2x\). This means:

At \(x = 3\), the function is changing at rate \(2 \times 3 = 6\) (increasing steeply).
At \(x = 0\), the rate is \(0\) (the function is flat here — this is the minimum).
At \(x = -1\), the rate is \(-2\) (the function is decreasing).

Common Derivative Rules

Function	Derivative
\(f(x) = c\) (constant)	\(f'(x) = 0\)
\(f(x) = x^n\)	\(f'(x) = nx^{n-1}\)
\(f(x) = e^x\)	\(f'(x) = e^x\)
\(f(x) = \ln(x)\)	\(f'(x) = \frac{1}{x}\)

Partial Derivatives

When a function has multiple inputs, like \(f(x_1, x_2)\), the partial derivative \(\frac{\partial f}{\partial x_1}\) measures how \(f\) changes when you change only \(x_1\), keeping \(x_2\) fixed.

Why Derivatives Are Central to AI

The derivative tells you which direction to adjust parameters to reduce error. If the derivative is positive, the error increases when you increase the parameter — so you should decrease it. If negative, you should increase it. This is the foundation of gradient descent, the main algorithm used to train AI models.

8. Minimum and Maximum

What It Is

The minimum of a function is the smallest output value it can produce. The maximum is the largest. Finding the minimum is written as:

x^* = \arg\min_x f(x)

Read this as: "\(x^*\) is the value of \(x\) that makes \(f(x)\) as small as possible." The \(\arg\min\) gives you the input that achieves the minimum, not the minimum value itself.

\(\min_x f(x)\) = the smallest value of \(f\)
\(\arg\min_x f(x)\) = the \(x\) that produces that smallest value

Example

For \(f(x) = (x - 3)^2\):

\(\min f(x) = 0\) (the smallest output)
\(\arg\min f(x) = 3\) (the input that gives that output)

Key Insight

At a minimum, the derivative equals zero: \(f'(x^*) = 0\). The function is "flat" — not going up or down. This is how optimization algorithms find minima: they search for points where the derivative is zero.

9. Partial Derivatives in Depth

What It Is

A partial derivative measures how a function changes when you adjust one input variable while keeping all others fixed. It is written as \(\frac{\partial f}{\partial x_1}\) (using the curly ∂ instead of straight d).

Why It Exists

In AI, functions almost always have multiple inputs (hundreds, millions, or billions of parameters). You need to know how each individual parameter affects the output, so you can adjust each one independently. Partial derivatives give you exactly that — the effect of one parameter at a time.

How to Compute Partial Derivatives

The rule is simple: treat every other variable as if it were a constant number, then take the derivative normally with respect to the variable you care about.

Example 1 — Two Variables

Let \(f(x_1, x_2) = 3x_1^2 + 5x_1 x_2 + 2x_2^2\).

Partial derivative with respect to \(x_1\): treat \(x_2\) as a constant.

\(3x_1^2\) → derivative is \(6x_1\) (normal power rule)
\(5x_1 x_2\) → \(x_2\) is a constant, so this is "\(5x_2\) times \(x_1\)" → derivative is \(5x_2\)
\(2x_2^2\) → this is just a constant (no \(x_1\) in it) → derivative is \(0\)

\frac{\partial f}{\partial x_1} = 6x_1 + 5x_2

Partial derivative with respect to \(x_2\): treat \(x_1\) as a constant.

\(3x_1^2\) → no \(x_2\), derivative is \(0\)
\(5x_1 x_2\) → this is "\(5x_1\) times \(x_2\)" → derivative is \(5x_1\)
\(2x_2^2\) → derivative is \(4x_2\)

\frac{\partial f}{\partial x_2} = 5x_1 + 4x_2

Example 2 — Three Variables

Let \(f(x, y, z) = x^2 y + yz^3 + 4xz\).

\frac{\partial f}{\partial x} = 2xy + 4z \quad \text{(treat } y, z \text{ as constants)}

\frac{\partial f}{\partial y} = x^2 + z^3 \quad \text{(treat } x, z \text{ as constants)}

\frac{\partial f}{\partial z} = 3yz^2 + 4x \quad \text{(treat } x, y \text{ as constants)}

The Gradient — All Partial Derivatives Together

The gradient \(\nabla f\) collects all partial derivatives into a single vector:

\nabla f(x_1, x_2, \ldots, x_n) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix}

Key Fact

The gradient points in the direction where the function increases the fastest. If you want to decrease the function (minimize error in AI), you go in the opposite direction of the gradient. This is literally what gradient descent does: move in the direction \(-\nabla f\).

Second Partial Derivatives and the Hessian

You can take the derivative of a derivative. The second partial derivative \(\frac{\partial^2 f}{\partial x_i \partial x_j}\) means: first differentiate with respect to \(x_j\), then differentiate that result with respect to \(x_i\).

The Hessian matrix \(\nabla^2 f\) collects ALL second partial derivatives into a square matrix:

\nabla^2 f = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix}

Why the Hessian Exists

The gradient tells you the slope (which direction the function is going). The Hessian tells you the curvature (how the slope itself is changing). This matters because curvature tells you whether a flat point (gradient = 0) is a minimum (curves up), maximum (curves down), or a saddle point (curves up in some directions, down in others). In AI, the Hessian helps determine whether your optimization has found a real minimum or a fake one.

Example — Computing the Hessian

For \(f(x_1, x_2) = 3x_1^2 + 5x_1 x_2 + 2x_2^2\), the gradient is:

\nabla f = \begin{bmatrix} 6x_1 + 5x_2 \\ 5x_1 + 4x_2 \end{bmatrix}

Now take partial derivatives of each gradient component:

\(\frac{\partial}{\partial x_1}(6x_1 + 5x_2) = 6\), \(\frac{\partial}{\partial x_2}(6x_1 + 5x_2) = 5\)
\(\frac{\partial}{\partial x_1}(5x_1 + 4x_2) = 5\), \(\frac{\partial}{\partial x_2}(5x_1 + 4x_2) = 4\)

\nabla^2 f = \begin{bmatrix} 6 & 5 \\ 5 & 4 \end{bmatrix}

Notice the Hessian is symmetric (\(5 = 5\) in off-diagonal). This is always true for well-behaved functions.

10. The Chain Rule

What It Is

The chain rule is a formula for computing the derivative of a composition of functions — a function inside another function. If \(y = f(g(x))\), the chain rule tells you how to find \(\frac{dy}{dx}\).

Why It Exists

Neural networks are built by stacking functions inside functions inside functions (layers). The input goes through layer 1, the output of layer 1 feeds into layer 2, and so on. To train the network, you need the derivative of the final output with respect to every weight in every layer. The chain rule is the mathematical tool that makes this possible. Backpropagation — the core algorithm for training neural networks — is just the chain rule applied repeatedly.

Chain Rule for Single Variables

If you have two functions composed together: \(y = f(g(x))\), the derivative is:

\frac{dy}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx}

In words: the derivative of the outer function (evaluated at the inner function) multiplied by the derivative of the inner function.

Example 1 — Simple Chain Rule

Find the derivative of \(y = (3x + 2)^4\).

Identify the parts:

Outer function: \(f(u) = u^4\) → derivative: \(f'(u) = 4u^3\)
Inner function: \(g(x) = 3x + 2\) → derivative: \(g'(x) = 3\)

Apply chain rule:

\frac{dy}{dx} = 4(3x + 2)^3 \cdot 3 = 12(3x + 2)^3

Example 2 — With Exponential

Find the derivative of \(y = e^{-x^2}\).

Outer: \(f(u) = e^u\) → derivative: \(e^u\)
Inner: \(g(x) = -x^2\) → derivative: \(-2x\)

\frac{dy}{dx} = e^{-x^2} \cdot (-2x) = -2x \, e^{-x^2}

Chain Rule with Three Functions

If \(y = f(g(h(x)))\), the chain extends:

\frac{dy}{dx} = \frac{df}{dg} \cdot \frac{dg}{dh} \cdot \frac{dh}{dx}

Each link in the chain contributes one factor. A neural network with 10 layers means a chain of 10 factors multiplied together. This is exactly how backpropagation works.

Example 3 — Three-Layer Chain

Let \(y = \ln(\sin(x^2))\). Three functions stacked: \(\ln(\cdot)\) wraps \(\sin(\cdot)\) which wraps \(x^2\).

Outermost: \(f(u) = \ln(u)\) → derivative: \(\frac{1}{u}\)
Middle: \(g(v) = \sin(v)\) → derivative: \(\cos(v)\)
Innermost: \(h(x) = x^2\) → derivative: \(2x\)

\frac{dy}{dx} = \frac{1}{\sin(x^2)} \cdot \cos(x^2) \cdot 2x = \frac{2x \cos(x^2)}{\sin(x^2)}

Chain Rule with Multiple Variables

When functions have multiple inputs, the chain rule uses partial derivatives and sums:

If \(z = f(u, v)\) where \(u = g(x, y)\) and \(v = h(x, y)\), then:

\frac{\partial z}{\partial x} = \frac{\partial f}{\partial u} \cdot \frac{\partial u}{\partial x} + \frac{\partial f}{\partial v} \cdot \frac{\partial v}{\partial x}

You sum over all paths from \(z\) to \(x\). Each path is a product of partial derivatives along that path.

This Is Backpropagation

In a neural network, the loss \(L\) depends on output \(z\), which depends on hidden values \(t_j\), which depend on weights \(a_{ij}\). To find \(\frac{\partial L}{\partial a_{ij}}\), you trace all paths from \(L\) back to \(a_{ij}\) using the chain rule, multiplying partial derivatives along each path and summing over all paths. That's the entire backpropagation algorithm.

Example 4 — Neural Network Chain Rule

Suppose a single neuron computes:

Weighted sum: \(u = w_0 + w_1 x_1 + w_2 x_2\)
Activation: \(z = g(u) = \frac{1}{1 + e^{-u}}\) (sigmoid)
Loss: \(L = \frac{1}{2}(z - y)^2\)

To find how the loss changes when we adjust weight \(w_1\):

\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial z} \cdot \frac{\partial z}{\partial u} \cdot \frac{\partial u}{\partial w_1}

Computing each piece:

\(\frac{\partial L}{\partial z} = z - y\) (how loss changes with output)
\(\frac{\partial z}{\partial u} = z(1 - z)\) (sigmoid derivative — a known formula)
\(\frac{\partial u}{\partial w_1} = x_1\) (because \(u = w_0 + w_1 x_1 + w_2 x_2\))

\frac{\partial L}{\partial w_1} = (z - y) \cdot z(1 - z) \cdot x_1

This tells you exactly how much to adjust \(w_1\) to reduce the error. Multiply by a learning rate \(\alpha\) and subtract from the current weight: \(w_1 \leftarrow w_1 - \alpha \cdot \frac{\partial L}{\partial w_1}\).

11. Vectors

What It Is

A vector is an ordered list of numbers. A vector with \(n\) numbers is called an \(n\)-dimensional vector and belongs to \(\mathbb{R}^n\).

\mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = (x_1, x_2, \ldots, x_n) \in \mathbb{R}^n

Why Vectors Exist in AI

All data in AI is stored as vectors. A single data point with 5 features is a vector in \(\mathbb{R}^5\). An image with 784 pixels is a vector in \(\mathbb{R}^{784}\). Model weights are also stored as vectors. Every computation in AI — from input processing to gradient updates — operates on vectors.

Vector Operations

Addition

Add two vectors by adding their corresponding entries:

\begin{bmatrix} 1 \\ 3 \\ 5 \end{bmatrix} + \begin{bmatrix} 2 \\ -1 \\ 4 \end{bmatrix} = \begin{bmatrix} 1+2 \\ 3+(-1) \\ 5+4 \end{bmatrix} = \begin{bmatrix} 3 \\ 2 \\ 9 \end{bmatrix}

Both vectors must have the same number of entries. You cannot add a 3D vector to a 2D vector.

Scalar Multiplication

Multiply a vector by a single number (scalar) — multiply every entry:

3 \cdot \begin{bmatrix} 2 \\ -1 \\ 4 \end{bmatrix} = \begin{bmatrix} 6 \\ -3 \\ 12 \end{bmatrix}

Dot Product (Inner Product)

Definition

The dot product of two vectors \(\mathbf{x}\) and \(\mathbf{y}\) is a single number computed by multiplying corresponding entries and adding them all up:

\mathbf{x} \cdot \mathbf{y} = \langle \mathbf{x}, \mathbf{y} \rangle = \sum_{i=1}^{n} x_i \cdot y_i = x_1 y_1 + x_2 y_2 + \cdots + x_n y_n

Example

\(\begin{bmatrix} 2 \\ 3 \\ -1 \end{bmatrix} \cdot \begin{bmatrix} 4 \\ -2 \\ 5 \end{bmatrix} = (2)(4) + (3)(-2) + (-1)(5) = 8 - 6 - 5 = -3\)

Why the Dot Product Matters

The dot product is the most fundamental operation in neural networks. Every neuron computes a dot product: it multiplies each input by its corresponding weight and sums the results. When you see \(\mathbf{w} \cdot \mathbf{x}\) or \(\mathbf{w}^T \mathbf{x}\), it means "compute the weighted sum of inputs" — which is what every neuron does.

Norm (Length/Magnitude)

Definition

The norm of a vector, written \(\|\mathbf{x}\|\), measures its "length" or "size":

\|\mathbf{x}\| = \sqrt{x_1^2 + x_2^2 + \cdots + x_n^2} = \sqrt{\sum_{i=1}^{n} x_i^2} = \sqrt{\mathbf{x} \cdot \mathbf{x}}

Example

\(\left\| \begin{bmatrix} 3 \\ 4 \end{bmatrix} \right\| = \sqrt{3^2 + 4^2} = \sqrt{9 + 16} = \sqrt{25} = 5\)

In AI, \(\|\mathbf{x} - \mathbf{y}\|\) measures how "far apart" two vectors are. The loss function \(\|A\mathbf{w} - B\|^2\) measures the total distance between predictions and actual values — the thing we're trying to minimize.

Transpose

A column vector written sideways becomes a row vector. This is called the transpose, written \(\mathbf{x}^T\):

\mathbf{x} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \quad \Rightarrow \quad \mathbf{x}^T = \begin{bmatrix} 1 & 2 & 3 \end{bmatrix}

The dot product can be written as matrix multiplication: \(\mathbf{x} \cdot \mathbf{y} = \mathbf{x}^T \mathbf{y}\). You'll see this notation throughout the course.

12. Matrices and Linear Algebra

What It Is

A matrix is a rectangular grid of numbers arranged in rows and columns. A matrix with \(n\) rows and \(m\) columns is called an \(n \times m\) matrix and belongs to \(\mathbb{R}^{n \times m}\).

A = \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \end{bmatrix} \in \mathbb{R}^{2 \times 3}

The entry \(a_{ij}\) sits in row \(i\) and column \(j\). First subscript = row, second = column.

Why Matrices Exist in AI

Matrices are how we organize and process large amounts of data efficiently. A dataset with 1000 data points, each having 10 features, is stored as a \(1000 \times 10\) matrix. All the weights connecting one neural network layer to the next are stored as a matrix. Matrix operations let us process entire datasets and layers in a single operation instead of looping through items one at a time.

Matrix Addition

Add two matrices of the same size by adding corresponding entries:

\begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix} + \begin{bmatrix} 5 & 6 \\ 7 & 8 \end{bmatrix} = \begin{bmatrix} 6 & 8 \\ 10 & 12 \end{bmatrix}

Matrix Multiplication

Critical Operation

Matrix multiplication is NOT entry-by-entry. It follows a specific rule and is the most important operation in all of AI.

To multiply matrix \(A\) (\(n \times p\)) by matrix \(B\) (\(p \times m\)), the result \(C = A \cdot B\) is an \(n \times m\) matrix where:

c_{ij} = \sum_{k=1}^{p} a_{ik} \cdot b_{kj}

Each entry \(c_{ij}\) is the dot product of row \(i\) of \(A\) with column \(j\) of \(B\).

Size Rule

You can only multiply \(A \cdot B\) if the number of columns in A equals the number of rows in B. The result has the number of rows from A and the number of columns from B.

\((n \times \textbf{p}) \cdot (\textbf{p} \times m) = (n \times m)\). The inner dimensions must match.

Example — Step by Step

Multiply a \(2 \times 3\) matrix by a \(3 \times 2\) matrix:

\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \cdot \begin{bmatrix} 7 & 8 \\ 9 & 10 \\ 11 & 12 \end{bmatrix} = \begin{bmatrix} ? & ? \\ ? & ? \end{bmatrix}

Result is \(2 \times 2\). Computing each entry:

Row 1 of A · Column 1 of B: \(1 \cdot 7 + 2 \cdot 9 + 3 \cdot 11 = 7 + 18 + 33 = 58\)
Row 1 of A · Column 2 of B: \(1 \cdot 8 + 2 \cdot 10 + 3 \cdot 12 = 8 + 20 + 36 = 64\)
Row 2 of A · Column 1 of B: \(4 \cdot 7 + 5 \cdot 9 + 6 \cdot 11 = 28 + 45 + 66 = 139\)
Row 2 of A · Column 2 of B: \(4 \cdot 8 + 5 \cdot 10 + 6 \cdot 12 = 32 + 50 + 72 = 154\)

= \begin{bmatrix} 58 & 64 \\ 139 & 154 \end{bmatrix}

Matrix-Vector Multiplication

A very common case: multiplying a matrix by a vector. This is how neural network layers work.

\begin{bmatrix} 2 & 3 \\ 1 & -1 \\ 4 & 0 \end{bmatrix} \cdot \begin{bmatrix} 5 \\ 2 \end{bmatrix} = \begin{bmatrix} 2 \cdot 5 + 3 \cdot 2 \\ 1 \cdot 5 + (-1) \cdot 2 \\ 4 \cdot 5 + 0 \cdot 2 \end{bmatrix} = \begin{bmatrix} 16 \\ 3 \\ 20 \end{bmatrix}

A \(3 \times 2\) matrix times a 2D vector gives a 3D vector. Each entry in the result is the dot product of one row of the matrix with the input vector.

Why This Is Important

In a neural network layer with 2 input neurons and 3 output neurons, the weight matrix is \(3 \times 2\). The forward pass is exactly this matrix-vector multiplication: input vector goes in, output vector comes out. Each output neuron computes a dot product of its weights with the input. Matrix multiplication does all neurons at once.

The Identity Matrix

Definition

The identity matrix \(I\) is a square matrix with 1s on the diagonal and 0s everywhere else. Multiplying any matrix by \(I\) gives back the same matrix: \(A \cdot I = I \cdot A = A\).

I = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \quad \text{(3×3 identity)}

It's the matrix equivalent of multiplying a number by 1.

Matrix Inverse

Definition

The inverse of a square matrix \(A\), written \(A^{-1}\), is the matrix that "undoes" \(A\):

A \cdot A^{-1} = A^{-1} \cdot A = I

Not every matrix has an inverse. A matrix that has one is called invertible (or non-singular). A matrix without one is called singular.

Why Inverses Matter in AI

The closed-form solution for linear regression is \(\mathbf{w}^* = (A^T A)^{-1} A^T B\). This formula requires computing a matrix inverse. If \(A^T A\) is not invertible, you can't directly solve for the optimal weights using this formula (and must use iterative methods like gradient descent instead).

Matrix Transpose

The transpose \(A^T\) flips a matrix over its diagonal — rows become columns and columns become rows:

A = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \quad \Rightarrow \quad A^T = \begin{bmatrix} 1 & 4 \\ 2 & 5 \\ 3 & 6 \end{bmatrix}

If \(A\) is \(n \times m\), then \(A^T\) is \(m \times n\). A matrix equal to its own transpose (\(A = A^T\)) is called symmetric.

Determinant

Definition

The determinant of a square matrix, written \(\det(A)\) or \(|A|\), is a single number computed from the matrix entries. Its key property: a matrix is invertible if and only if its determinant is not zero.

For a \(2 \times 2\) matrix:

\det\begin{bmatrix} a & b \\ c & d \end{bmatrix} = ad - bc

Example

\(\det\begin{bmatrix} 3 & 1 \\ 2 & 4 \end{bmatrix} = 3 \cdot 4 - 1 \cdot 2 = 12 - 2 = 10\). Since \(10 \neq 0\), this matrix is invertible.

\(\det\begin{bmatrix} 2 & 4 \\ 1 & 2 \end{bmatrix} = 2 \cdot 2 - 4 \cdot 1 = 4 - 4 = 0\). Since \(0\), this matrix is NOT invertible.

For larger matrices, the determinant is computed recursively by expanding along a row or column (cofactor expansion). The formula grows complex, but the concept is the same: it's one number that tells you if the matrix is invertible.

Positive Semidefinite Matrices

Definition

A symmetric matrix \(A\) is positive semidefinite (written \(A \geq 0\)) if for every vector \(\mathbf{h}\):

\mathbf{h}^T \cdot A \cdot \mathbf{h} \geq 0

Why This Matters

Positive semidefinite matrices determine convexity. If the Hessian matrix of a function is positive semidefinite, the function is convex. Convex functions have a single minimum — so gradient descent is guaranteed to find the best solution. This is the mathematical test that tells you whether your optimization problem is "easy" (convex) or "hard" (non-convex).

Summary: How Linear Algebra Powers AI

Concept	Where It's Used in AI
Vectors	Representing data points, model weights, gradients
Dot product	What every neuron computes (weighted sum of inputs)
Norm	Measuring errors, distances between predictions and targets
Matrices	Storing weights between layers, storing datasets
Matrix multiplication	Forward pass through neural network layers
Transpose	Computing gradients, loss function formulas
Inverse	Closed-form solution for linear regression
Determinant	Checking if inverse exists (if solution is computable)
Gradient (vector)	Direction to update weights in gradient descent
Hessian (matrix)	Checking if a function is convex (easy to optimize)
Positive semidefinite	Test for convexity via the Hessian

13. Sets and Notation

What It Is

A set is a collection of items. We write sets with curly braces: \(\{1, 2, 3\}\) is the set containing 1, 2, and 3.

Notation you'll see:

Symbol	Meaning	Example
\(\in\)	"is a member of" / "belongs to"	\(x \in \{1,2,3\}\) means x is 1, 2, or 3
\(\notin\)	"is not a member of"	\(4 \notin \{1,2,3\}\)
\(\subset\)	"is a subset of"	\(\{1,2\} \subset \{1,2,3\}\)
\(\cup\)	union (combine two sets)	\(\{1,2\} \cup \{3\} = \{1,2,3\}\)
\(\cap\)	intersection (items in both sets)	\(\{1,2\} \cap \{2,3\} = \{2\}\)
\(\emptyset\)	empty set (no items)
\(\forall\)	"for all" / "for every"	\(\forall x \in S\) means "for every x in S"
\(\exists\)	"there exists"	\(\exists x\) means "there is at least one x"

Intervals

\([a, b]\) means all real numbers from \(a\) to \(b\), including both endpoints. \((a, b)\) means the same but excluding the endpoints. \([0, 1]\) is all numbers from 0 to 1.

14. Probability Basics

What It Is

Probability is a number between 0 and 1 that measures how likely something is to happen. 0 means impossible, 1 means certain, 0.5 means equally likely to happen or not.

Key notation:

\(P(A)\) = probability of event A happening
\(P(A|B)\) = probability of A happening given that B has already happened (conditional probability)
\(P(A, B)\) = probability of both A and B happening (joint probability)

Expected Value

What It Is

The expected value \(E[X]\) is the average outcome you'd get if you repeated an experiment infinitely many times. It is written as:

E[X] = \sum_i x_i \cdot P(x_i)

"Multiply each possible outcome by its probability, then add them all up."

Normal (Gaussian) Distribution

The normal distribution \(\mathcal{N}(\mu, \sigma^2)\) is a probability distribution shaped as a bell curve. It is defined by two parameters:

\(\mu\) (mu) = the center (mean/average)
\(\sigma^2\) (sigma squared) = the spread (variance). Larger \(\sigma^2\) = more spread out.

Why This Matters

The Gaussian distribution appears throughout AI. Neural networks often initialize weights from a Gaussian distribution. Diffusion models (Chapter 3) add Gaussian noise to data. The entire theory of generative AI relies on properties of Gaussian distributions.