Optimization for AI
How to find the best solution when perfect answers are impossible — from classical algorithms to gradient descent and convex optimization.
- From Algorithms to Heuristics
- Binary Search for Root Finding
- General Optimization Reformulation
- Greedy Search
- Numerical Optimization
- Definitions and Notation (Linear Algebra)
- Gradient and Hessian
- Convex Optimization
- Special Functions in Convex Optimization
- Non-Convex Optimization and Gradient Descent
- Brent's Method
1. From Algorithms to Heuristics
What is an Algorithm?
An algorithm is a finite sequence of well-defined instructions that transforms an input into an output. Every algorithm must satisfy two properties:
- Correctness — the algorithm always produces the correct result for every valid input.
- Finiteness — the algorithm must terminate (finish running) after a bounded number of steps.
These two properties define the ideal we aim for when solving problems with computers. In practice, real-world problems often violate one or both of these properties, which is why we need alternatives (heuristics). Understanding the ideal helps us understand what we are giving up and why.
Correctness and Its Limits
Correctness means the algorithm works regardless of the specific input — it does not depend on context or luck. For example, the quadratic formula always gives the roots of any quadratic equation:
Given any values of \(a\), \(b\), and \(c\) (with \(a \neq 0\)), this formula produces the correct roots every time. There is no guessing involved.
However, correctness in theory does not guarantee correctness in practice. Two sources of error arise when running algorithms on real computers:
- Roundoff error — hardware represents numbers with finite precision (e.g., 64-bit floating point). The true mathematical answer may not be exactly representable, so the computer rounds it. These small errors can accumulate.
- Truncation error — programming languages and libraries sometimes approximate infinite processes (like infinite series) by cutting them off after a finite number of terms.
A solution is stable if small changes in the input (or small computational errors) cause only small changes in the output. When we say a result "may not be stable," we mean that tiny perturbations — like roundoff — could cause the output to change significantly. Stability is a critical practical concern even for mathematically correct algorithms.
Finiteness and Hard Problems
Finiteness means the algorithm must return a result within an acceptable amount of time. An algorithm that technically terminates but takes billions of years is useless in practice.
Many problems in computer science have been formally proven to be "hard" — meaning no known algorithm can solve them quickly for all inputs, even though the set of possible answers is finite. These are classified as NP-hard problems.
Given a large number \(n = p \times q\) (where \(p\) and \(q\) are both prime), finding \(p\) and \(q\) is computationally hard. There is no known efficient algorithm for this. This hardness is the basis of RSA cryptography — RSA security relies on the assumption that factoring large numbers is infeasible.
Mersenne primes are primes of the form \(p = 2^r + 1\) where \(r\) is itself prime. For example, \(r = 2\) gives \(p = 5\), \(r = 3\) gives \(p = 9\) (not prime), etc. Some values of \(r\) yield primes and some do not. Using Mersenne primes for factorization is finite (the search space is bounded) but the solution may not be stable — small changes in the input can change which factors are found.
The Problem with Continuous Domains
For discrete problems (like factoring integers), the search space is at least finite, even if large. But many problems have continuous domains.
Consider the problem: find \(x\) such that \(f(x) = 0\) on the interval \([a, b]\), where \(f\) is continuous. Even though the interval is bounded, there can be infinitely many solutions. No algorithm can enumerate them all.
A simple approach: pick random values of \(x\) in \([a, b]\) and check if \(f(x) = 0\).
- Finiteness: Yes — we can set a maximum number of random trials, so it always terminates.
- Correctness/Stability: No — each run may find a different root (or none at all). The output is not consistent.
This illustrates the fundamental trade-off: we can guarantee termination, but we lose consistency.
What is a Heuristic?
A heuristic is a search method that uses experience, knowledge, or practical rules to find solutions to problems. Unlike a true algorithm, a heuristic does not guarantee that results will be consistent across runs. However, it does guarantee that the program terminates in a reasonable amount of time.
Many real-world problems (especially in AI) are too hard for exact algorithms. Heuristics exist because we need practical solutions. The key insight is a deliberate relaxation:
- Instead of requiring an exact solution where \(P(s) = 0\), we accept an approximate solution where \(P(s) \approx 0\).
- By relaxing from \(=\) to \(\approx\), solutions always exist. The trade-off is that different runs may produce different approximate solutions — the results are not guaranteed to be stable.
This relaxation is the foundation of optimization in AI. Almost every AI training algorithm is a heuristic that finds "good enough" solutions rather than perfect ones.
2. Binary Search for Root Finding
Binary search is a method for finding a root of a continuous function \(f\) on an interval \([c, d]\), given that \(f(c)\) and \(f(d)\) have opposite signs (i.e., \(f(c) \cdot f(d) < 0\)). By the Intermediate Value Theorem, there must be at least one point in \([c, d]\) where \(f(x) = 0\).
The idea: repeatedly halve the interval. At each step, compute the midpoint and check which half contains the sign change. This narrows down the location of the root.
- Start with interval \([c, d]\) where \(f(c) \cdot f(d) < 0\).
- Compute the midpoint: \(x = \frac{c + d}{2}\).
- Evaluate \(f(x)\).
- If \(f(c) \cdot f(x) < 0\), the root is in \([c, x]\) — set \(d = x\).
- If \(f(x) \cdot f(d) < 0\), the root is in \([x, d]\) — set \(c = x\).
- Repeat until \(d - c < \varepsilon\) (the interval is small enough).
The number of iterations required to reach precision \(\varepsilon\) is:
Each iteration cuts the interval in half, so the interval shrinks exponentially. After \(k\) iterations, the interval width is \(\frac{d-c}{2^k}\). This is both finite (it terminates) and stable (it always converges to a root in the original interval). Binary search is one of the few methods that is both a proper algorithm and practically useful for root-finding.
3. General Optimization Reformulation
Root-finding (solving \(f(x) = 0\)) can be transformed into an optimization problem. This reformulation is fundamental — it is how nearly all AI problems are framed.
Instead of directly solving \(f(x) = 0\), define a new function:
Notice that \(L(x) = 0\) if and only if \(f(x) = 0\). Also, \(L(x)\) is always non-negative. So finding a root of \(f\) is equivalent to finding the point where \(L(x)\) achieves its minimum value of 0.
The root-finding problem \(f(x) = 0\) becomes the optimization problem:
This reformulation is powerful for two reasons:
- Generality: Many problems that are not root-finding problems can still be written as "minimize some function." Optimization is a universal framework.
- Graceful failure: Even if no exact root exists (\(f(x) \neq 0\) for all \(x\)), the optimization version still gives a meaningful answer — the \(x\) that makes \(L(x)\) as small as possible. This is the "good enough" solution that heuristics aim for.
In AI, the function \(L\) is called the loss function (or cost function, or objective function). Training an AI model means minimizing its loss function.
4. Greedy Search
Greedy search is a general-purpose iterative method for minimizing a function when you have no special structure to exploit. It works purely by sampling and comparing.
- Pick a random starting point \(x^*\) in the domain. Evaluate \(L(x^*)\).
- Sample \(N\) random points "near" \(x^*\) (in some neighborhood).
- Evaluate \(L\) at each of these \(N\) points.
- If any of the \(N\) points has a lower value of \(L\), update \(x^*\) to that point.
- Repeat steps 2–4 until some stopping criterion is met (e.g., a maximum number of iterations \(N_{\max}\)).
The computational complexity is \(O(N_{\max}^2)\) — quadratic in the maximum number of iterations, since each iteration evaluates \(N\) points.
Greedy search has no guarantee of finding the global minimum. It can get stuck in local minima — points that are lower than all nearby points but not the lowest overall. It also has no way to know how close it is to the true minimum. Despite these limitations, greedy search is simple to implement and works as a baseline for harder problems.
5. Numerical Optimization
Numerical optimization is the systematic study of how to minimize (or maximize) functions. This is the mathematical core of machine learning and AI.
Given a continuous function \(L : \mathbb{R}^n \to \mathbb{R}\) and a domain \(D \subseteq \mathbb{R}^n\), the optimization problem is to find:
This notation means: \(x^*\) is the point in \(D\) where \(L\) achieves its smallest value. The function \(L\) maps \(n\)-dimensional vectors to real numbers. The domain \(D\) specifies which inputs are allowed.
If the domain \(D\) is convex and the function \(L\) is convex (both defined precisely later in this chapter), then the optimization problem has a very special property: any local minimum is automatically the global minimum. This means simple methods like setting the derivative to zero are guaranteed to find the best possible answer. Convex optimization is covered in detail in Section 8.
6. Definitions and Notation (Linear Algebra)
Optimization in multiple dimensions requires the language of linear algebra. This section defines the essential objects: vectors, matrices, and their operations.
Vectors
A vector \(\mathbf{x} \in \mathbb{R}^n\) is an ordered list of \(n\) real numbers:
In optimization, we often need to find the best values for multiple variables simultaneously. A vector is simply a way to package all those variables into one object. Instead of writing "find the best \(x_1\), \(x_2\), \(\ldots\), \(x_n\)," we write "find the best \(\mathbf{x}\)."
Vector Operations
Vector addition: Add corresponding components.
Scalar multiplication: Multiply every component by a number \(\alpha\).
Dot product (inner product): Multiply corresponding components and sum them.
The dot product measures how "aligned" two vectors are. If two vectors point in the same direction, their dot product is large and positive. If they are perpendicular, the dot product is zero. This concept is used extensively in optimization to measure how a function changes along a particular direction (via the gradient, defined later).
Norm (length): The Euclidean length of a vector.
The norm measures the "size" or "magnitude" of a vector. Note that \(\|\mathbf{x}\| = \sqrt{\langle \mathbf{x}, \mathbf{x} \rangle}\).
Matrices
A matrix \(A \in \mathbb{R}^{n \times m}\) is a rectangular array of real numbers with \(n\) rows and \(m\) columns:
A matrix can be viewed as a collection of column vectors \(A = [\mathbf{A}_{\cdot 1},\; \mathbf{A}_{\cdot 2},\; \ldots,\; \mathbf{A}_{\cdot m}]\), where each \(\mathbf{A}_{\cdot j}\) is a vector in \(\mathbb{R}^n\). Alternatively, it can be viewed as a stack of row vectors.
Matrices represent linear transformations — operations that take a vector as input and produce another vector as output, while preserving the structure of addition and scalar multiplication. In optimization, matrices appear naturally as the second-order behavior of functions (the Hessian, defined later), and in many AI models (neural network weights are matrices).
Matrix Operations
Matrix addition: Add corresponding entries. Both matrices must have the same dimensions.
Matrix multiplication: For \(A \in \mathbb{R}^{n \times k}\) and \(B \in \mathbb{R}^{k \times m}\), the product \(C = A \cdot B \in \mathbb{R}^{n \times m}\) has entries:
Each entry \(c_{ij}\) is the dot product of row \(i\) of \(A\) with column \(j\) of \(B\). The inner dimensions must match: if \(A\) is \(n \times k\), then \(B\) must have \(k\) rows.
Inverse Matrix
For a square matrix \(A \in \mathbb{R}^{n \times n}\), the inverse \(A^{-1}\) is the matrix such that:
where \(I\) is the identity matrix (1s on the diagonal, 0s elsewhere). The inverse exists if and only if the matrix is invertible (also called non-singular).
If you have the equation \(A\mathbf{x} = \mathbf{b}\) and you need to solve for \(\mathbf{x}\), the answer is \(\mathbf{x} = A^{-1}\mathbf{b}\) (when the inverse exists). In optimization, solving for the minimum of a quadratic function requires solving a linear system, which involves matrix inverses.
Determinant
The determinant of a square matrix \(A\), written \(\det(A)\), is a scalar value computed by cofactor expansion:
where \(C_{ij} = (-1)^{i+j} M_{ij}\) is the cofactor, and \(M_{ij}\) is the minor — the determinant of the matrix obtained by deleting row \(i\) and column \(j\) from \(A\). This expansion can be done along any row \(i\) or column \(j\).
A square matrix \(A\) is invertible if and only if \(\det(A) \neq 0\). If \(\det(A) = 0\), the matrix is singular and has no inverse.
7. Gradient and Hessian
The gradient and Hessian are the multi-variable generalizations of the first and second derivatives. They are the primary tools for optimization in multiple dimensions.
Gradient
For a function \(f : \mathbb{R}^n \to \mathbb{R}\), the gradient \(\nabla f(\mathbf{x})\) is the vector of all partial derivatives:
Each component \(\frac{\partial f}{\partial x_i}\) tells you how much \(f\) changes when you change \(x_i\) by a small amount, while holding all other variables fixed.
The gradient points in the direction of steepest increase of the function. If you want to decrease the function as quickly as possible, you move in the opposite direction of the gradient. This is the basis of gradient descent (Section 10).
At a minimum, the function is not increasing in any direction, so the gradient must be the zero vector: \(\nabla f(\mathbf{x}^*) = \mathbf{0}\). This is the necessary condition for optimality.
Hessian
For a function \(f : \mathbb{R}^n \to \mathbb{R}\), the Hessian \(\nabla^2 f(\mathbf{x})\) is the \(n \times n\) matrix of all second partial derivatives:
The Hessian captures how the function curves in each direction and in combinations of directions. It tells you whether a critical point (where the gradient is zero) is a minimum, maximum, or saddle point. It is also used to determine whether a function is convex (curves upward everywhere), which is the key property that makes optimization tractable.
Worked Example
Let \(f(x_1, x_2, x_3) = x_1^2 + 2x_2^2 - 5x_3^2 + 3x_1 x_2 - x_1 x_3 + 2x_2 x_3\).
Gradient:
- \(\frac{\partial f}{\partial x_1} = 2x_1 + 3x_2 - x_3\)
- \(\frac{\partial f}{\partial x_2} = 4x_2 + 3x_1 + 2x_3\)
- \(\frac{\partial f}{\partial x_3} = -10x_3 - x_1 + 2x_2\)
Hessian: Take the partial derivatives of each gradient component:
Notice that the Hessian is a constant matrix (it does not depend on \(\mathbf{x}\)). This happens because \(f\) is a quadratic function. Also notice the Hessian is symmetric (\(\frac{\partial^2 f}{\partial x_i \partial x_j} = \frac{\partial^2 f}{\partial x_j \partial x_i}\)), which is always the case for smooth functions.
8. Convex Optimization
Convex optimization is the class of optimization problems where finding the answer is "easy" in a precise mathematical sense. Understanding convexity is essential because it tells you when simple methods are guaranteed to work.
Convex Domain
A set \(D \subseteq \mathbb{R}^n\) is convex if for any two points \(\mathbf{x}, \mathbf{y} \in D\) and any \(\alpha \in [0, 1]\):
In plain terms: pick any two points in the set. Draw a straight line segment between them. If every point on that line segment is also in the set, then the set is convex. If even one point on the line falls outside the set, it is not convex.
Convex Function
A function \(f : D \to \mathbb{R}\) is convex if for any \(\mathbf{x}, \mathbf{y} \in D\) and any \(\alpha \in [0, 1]\):
What this says: take any two points and look at the function values at those points. The weighted average of the function values (right side) is always greater than or equal to the function value at the weighted average of the points (left side). Geometrically, the line segment connecting any two points on the graph of \(f\) always lies on or above the graph itself.
Why Convexity Matters
If \(f\) is a convex function defined on a convex domain \(D\), then every local minimum of \(f\) is also a global minimum.
This means: if you find any point where the function is lower than all nearby points, that point is guaranteed to be the lowest point overall. You cannot get "stuck" at a wrong answer. This is what makes convex optimization tractable — simple search methods are guaranteed to find the best solution.
How to Check Convexity
There are two equivalent ways to verify that a function is convex:
Check that the Hessian matrix \(\nabla^2 f(\mathbf{x})\) is positive semidefinite at every point \(\mathbf{x}\) in the domain. A matrix \(H\) is positive semidefinite if:
This condition says: no matter which direction \(\mathbf{h}\) you look, the function curves upward (or stays flat) in that direction. It never curves downward.
For any points \(\mathbf{x}\) and \(\mathbf{x}_0\) in the domain:
This says: the function always lies on or above its tangent line (or tangent plane in higher dimensions) at every point. The tangent approximation always underestimates the true function value.
For a twice-differentiable function \(f\), the following are equivalent:
- \(f\) is convex (satisfies the definition above)
- \(f(\mathbf{x}) \geq \nabla f(\mathbf{x}_0)^T(\mathbf{x} - \mathbf{x}_0) + f(\mathbf{x}_0)\) for all \(\mathbf{x}, \mathbf{x}_0\)
- \(\nabla^2 f(\mathbf{x}) \succeq 0\) (Hessian is positive semidefinite) for all \(\mathbf{x}\)
Solving Convex Optimization Problems
Because any local minimum of a convex function is the global minimum, we can find the optimum by setting the gradient to zero:
If the function is convex, any solution to this equation is guaranteed to be the global minimizer. This is much simpler than searching the entire domain.
9. Special Functions in Convex Optimization
Certain function types appear frequently in optimization. Understanding their properties (especially whether they are convex) saves time in practice.
Linear Transformation
A linear function from \(\mathbb{R}^n\) to \(\mathbb{R}\) has the form:
where \(\mathbf{a} \in \mathbb{R}^n\) is a fixed vector of coefficients.
- Gradient: \(\nabla f_{\mathbf{a}}(\mathbf{x}) = \mathbf{a}\) (constant — does not depend on \(\mathbf{x}\)).
- Hessian: \(\nabla^2 f_{\mathbf{a}}(\mathbf{x}) = 0\) (the zero matrix).
- Convexity: Always convex (the zero matrix is trivially positive semidefinite).
Affine Transformation
An affine function adds a constant to a linear function:
where \(b \in \mathbb{R}\) is a constant. The gradient and Hessian are the same as the linear case (adding a constant does not change derivatives). Affine functions are always convex.
Quadratic Form
A quadratic form is a function of the form:
where \(A \in \mathbb{R}^{n \times n}\) is a matrix. Expanding this expression:
which expands to the sum of all squared terms (\(a_{ii}x_i^2\)) and all cross terms (\(a_{ij}x_i x_j\)).
Symmetric Representation
Every quadratic form can be represented using a symmetric matrix \(S\) where \(s_{ij} = s_{ji}\). Given any matrix \(A\), define:
Then \(\mathbf{x}^T A\, \mathbf{x} = \mathbf{x}^T S\, \mathbf{x}\). The symmetric matrix \(S\) is unique — there is exactly one symmetric matrix that produces the same quadratic form.
Using the symmetric representation simplifies analysis. The gradient and Hessian of a quadratic form take clean, simple forms when expressed in terms of the symmetric matrix \(S\).
For \(f(\mathbf{x}) = \mathbf{x}^T S\, \mathbf{x}\) with \(S\) symmetric:
- Gradient: \(\nabla f(\mathbf{x}) = 2S\,\mathbf{x}\)
- Hessian: \(\nabla^2 f(\mathbf{x}) = 2S\)
The quadratic form is convex if and only if \(S\) is positive semidefinite (i.e., \(\mathbf{h}^T S\, \mathbf{h} \geq 0\) for all \(\mathbf{h}\)).
To find the minimum of a convex quadratic form, set the gradient to zero:
For a pure quadratic form (no linear term), the minimum is always at the origin. For a more general quadratic \(f(\mathbf{x}) = \mathbf{x}^T S\,\mathbf{x} + \mathbf{b}^T \mathbf{x} + c\), setting the gradient to zero gives \(2S\,\mathbf{x}^* + \mathbf{b} = \mathbf{0}\), so \(\mathbf{x}^* = -\frac{1}{2}S^{-1}\mathbf{b}\) (when \(S\) is invertible).
10. Non-Convex Optimization and Gradient Descent
Most functions in AI (particularly in deep learning) are not convex. When the objective function is non-convex, we lose the guarantee that any local minimum is the global minimum. The function may have many local minima, and simple methods can get trapped at any of them.
Despite the lack of guarantees, we still need to minimize non-convex functions. Neural networks have non-convex loss landscapes, yet gradient-based methods work remarkably well in practice. The key insight: having the gradient gives us directional information that is far more useful than blind random search, even without convexity guarantees.
Gradient Descent
Gradient descent is an iterative algorithm that updates the current estimate \(\mathbf{x}^*\) by moving in the opposite direction of the gradient:
where \(\alpha \in (0, 1)\) is the learning rate (also called step size).
- Choose a random starting point \(\mathbf{x}^*\) and a learning rate \(\alpha\).
- Compute the gradient \(\nabla f(\mathbf{x}^*)\).
- Update: \(\mathbf{x}^* \leftarrow \mathbf{x}^* - \alpha \nabla f(\mathbf{x}^*)\).
- Repeat steps 2–3 until the gradient is approximately zero (\(\|\nabla f(\mathbf{x}^*)\| < \varepsilon\)) or a maximum number of iterations is reached.
The gradient \(\nabla f(\mathbf{x}^*)\) points in the direction where \(f\) increases most steeply. By subtracting it (moving in the opposite direction), we move toward lower values of \(f\). The learning rate \(\alpha\) controls how large each step is:
- If \(\alpha\) is too large, the steps overshoot and the method may diverge.
- If \(\alpha\) is too small, convergence is very slow.
- Choosing a good \(\alpha\) is one of the practical challenges of gradient descent.
Gradient descent finds local optima, not necessarily the global optimum. For non-convex functions, different starting points may lead to different local minima. This is an inherent limitation — there is no efficient general method for finding global minima of non-convex functions.
However, for convex functions, gradient descent is guaranteed to converge to the global minimum (with appropriate learning rate).
11. Brent's Method
Brent's method is a classical 1D optimization technique that combines the reliability of bracketing methods with the speed of parabolic interpolation.
Brent's method approximates the function \(f(x)\) locally by a parabola (quadratic polynomial) that passes through three known points \((a, f(a))\), \((b, f(b))\), and \((c, f(c))\). The minimum of this parabola gives the next estimate for the minimum of \(f\).
Binary search (Section 2) is reliable but slow — it only uses sign information. Gradient descent (Section 10) is faster but requires computing derivatives. Brent's method occupies a middle ground: it uses only function values (no derivatives needed) but converges faster than binary search by fitting a parabola to exploit the shape of the function.
Given three points \(a\), \(b\), \(c\) with known function values \(f(a)\), \(f(b)\), \(f(c)\), the minimum of the interpolating parabola is at:
- Start with three points \(a\), \(b\), \(c\) that bracket the minimum.
- Compute the parabola minimum \(x\) using the formula above.
- Evaluate \(f(x)\).
- Replace the worst of the three points with \(x\) and repeat.
- If the parabolic step would go outside the bracket, fall back to a bisection step (this is what makes Brent's method robust).
Extension to Multiple Dimensions
Brent's method is inherently a 1D method. To use it for multidimensional optimization, you can apply it along one direction at a time:
- Choose a search direction \(\mathbf{d}\) in \(\mathbb{R}^n\).
- Define the 1D function \(g(t) = f(\mathbf{x} + t\,\mathbf{d})\).
- Use Brent's method to find \(t^*\) that minimizes \(g(t)\).
- Update \(\mathbf{x} \leftarrow \mathbf{x} + t^*\,\mathbf{d}\).
- Choose a new direction \(\mathbf{d}\) and repeat.
The choice of search directions \(\mathbf{d}\) at each step determines the efficiency of the overall method. Common strategies include using the coordinate axes (one variable at a time) or using conjugate directions.