Gradient boosting is gradient descent in function space

Posted on Thu 13 July 2017 in posts

There are so many different ways to look at a machine learning algorithm and understand what it does, how it does it, and why it works. Linear regression can be thought of as minimizing a particular loss function or equivalently projecting a vector onto a subspace; SVMs can be thought of finding a hyperplane which maximizes the margin or equivalently finding basis coefficients of kernel functions which are members of a reproducing kernel Hilbert space!

I've read a great deal about gradient boosting and each explanation varies in its granularity and its perspective. Here, I'd like to start from the most generic perspective on gradient boosting and walk through the motivation and mechanics of the algorithm. Mainly, I'll be "explaining to myself" the classic paper by Friedman, Greedy Function Approximation: A Gradient Boosting Machine.

Gradient boosting != gradient boosted decision trees

This might be more appropriate to mention after I've introduced the problem, but I think this is an important point. Gradient boosting is a method to build up a function as the sum of other functions. These individual functions could be anything, but in practice are usually made to be decision trees. In that case, we call it "gradient boosted decision trees" or "gradient boosted trees" or ... something similar. So, gradient boosting with decision trees is particular subset of the overall gradient boosting paradigm. To understand gradient boosting in the broader sense, I'll keep references to trees out of this post.

The machine learning problem

First, separate yourself from the idea that machine learning is "minimizing a loss function over a training set." It is actually the idea that we have a system that produces a random outcome variable $y$ from a vector of explanatory variables $\mathbf{x}$. In this case we want to find a function that minimizes the expected value of the loss function over the distribution of all possible inputs and outputs.

$$ F^* = \arg \min_F E_{y, \mathbf{x}} L(y, F(\mathbf{x})) = \arg \min_F E_{\mathbf{x}} \left[E_y (L(y, F(\mathbf{x}))) | \mathbf{x} \right] $$

For any given feature vector $\mathbf{x}$, the system can produce many different values of $y$. In fact, there's a distribution over $y$, and for any given $\mathbf{x}$, we want to find a function which minimizes the expected value of the loss function over all possible values of $y$. And since the explanatory variables take on values over a distribution, we want to further minimize the expectation over all possible values of $\mathbf{x}$.

In practice we don't know the true joint distributions of the inputs and outputs, so we seek to minimize the loss over a set of training samples $\{y_i, \mathbf{x}_i\}$ in hopes that it will provide a good approximation to the true minimizer.

Optimization with parameters

We'll get more into what $F$ is momentarily, but consider how we approach many other machine learning problems. We would typically represent $F$ in terms of some finite number of "parameters" (think of parameters as being real numbers), and we'd then find the parameters that minimize the loss over the training set. In a linear regression, for example, the $F$ would be $F(\mathbf{x}) = \mathbf{\beta}^T\mathbf{x}$ where $\beta$ are the parameters that we use to represent $F$. Finding the $\beta_j$ that minimize the loss is the same as finding the linear function $F$ that minimizes the loss.

A common way find those parameters would be to start with some initial guess at the parameters, and then keep incrementing the parameters (or taking "steps") until we arrive at the optimum. In this way, we can think of the best parameters as an additive expansion of the form:

$$ \mathbf{P}^* = \sum_{m=0}^M \mathbf{p}_m $$

In other words, we build up the best parameter vector as the sum of other parameter vectors. This is what optimization methods like gradient descent do.

Optimization with functions

We're used to differentiating the loss with respect to parameters to get gradients for those parameters, but something more foreign is to differentiate the loss with respect to functions. Here, we wish to take a nonparametric approach, i.e. instead of representing the prediction function in terms of some parameters, we want to find an arbitrary prediction function which minimizes the loss.

It will be helpful to be able to think of a function as a vector whose entries are simply the values of the function $F(\mathbf{x})$ for every possible inpute $\mathbf{x}$ in the function's domain. In function space, the domain is infinite, so this would be an infinite length vector, but later we'll see cases where this will simplify to something finite.

As an example, consider the function $f(x) = \sin(x)$

In [9]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

x = np.linspace(-np.pi, np.pi, 10)
y = np.sin(x)
plt.scatter(x, y)
Out[9]:
<matplotlib.collections.PathCollection at 0x1102eada0>

We can think of the function $sin(x)$ as the vector $y$ in the code above.

In [10]:
print(y[:, np.newaxis])
[[ -1.22464680e-16]
 [ -6.42787610e-01]
 [ -9.84807753e-01]
 [ -8.66025404e-01]
 [ -3.42020143e-01]
 [  3.42020143e-01]
 [  8.66025404e-01]
 [  9.84807753e-01]
 [  6.42787610e-01]
 [  1.22464680e-16]]

In the case of a true function defined for all x, imagine that vector growing ever longer as points on the graph are filled in at infinitely small intervals.

In the problem at hand, we can think of the loss function as a vector with infinite entries, one for each $\mathbf{x}$, where each entry is:

$$ \phi(F(\mathbf{x})) = E_y \left[ (L(y, F(\mathbf{x}))) | \mathbf{x} \right] $$

The arbitrary function which best minimizes the loss is the sum of individual functions:

$$ F^*(\mathbf{x}) = \sum_{m=0}^M f_m(\mathbf{x}) $$

where the first function $f_0(\mathbf{x})$ is an initial guess and the subsequent $f_m(\mathbf{x})$ are function steps, usually in the direction of the gradient. To perform steepest descent using functions, we need to find the gradient, which is also a function. The gradient function is just:

$$ g_m(\mathbf{x}) = \left[\frac{\partial \phi(F(\mathbf{x}))}{\partial F(\mathbf{x})}\right]_{F(\mathbf{x}) = F_{m-1}(\mathbf{x})} $$

You can show that this reduces to:

$$ g_m(\mathbf{x}) = E_y \left[\frac{\partial L(y, F(\mathbf{x}))}{\partial F(\mathbf{x})}|\mathbf{x}\right]_{F(\mathbf{x}) = F_{m-1}(\mathbf{x})} $$

So, the gradient is a function, or infinite dimensional vector, with each entry being the derivative of the loss function with respect to $F(\mathbf{x})$ at a particular value for $\mathbf{x}$.

To make it slightly more concrete, consider the squared-error loss:

$$ L(y, F(\mathbf{x})) = \frac{1}{2}(y - F(\mathbf{x}))^2 \\ g_m(\mathbf{x}) = E_y \left[(y - F(\mathbf{x}))|\mathbf{x}\right]_{F(\mathbf{x}) = F_{m-1}(\mathbf{x})} $$

So the gradient function in this case is the expected value of $y - F(\mathbf{x})$ for each value of $\mathbf{x}$. Neat!

Finite data

In practice, the gradient function is only defined at N data points since the function F only affects the loss when evaluated at the individual $\mathbf{x}_i$. Because of this, we can think of the functions as N-dimensional vectors, and the gradient function also as an N-dimensional vector.

$$ \mathbf{g}_m = \{-g_m(\mathbf{x}_i)\}_1^N \\ g_m(\mathbf{x}_i) = \left[\frac{\partial L(y, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)}|\mathbf{x}_i\right]_{F(\mathbf{x}) = F_{m-1}(\mathbf{x})} $$

If we want to iteratively add up arbitrary N-dimensional vectors without any constraints to get to one that minimizes the training loss, we'll end up with the vector $\mathbf{y}$. But this is a useless model! Think of it as a partial function that is only defined for values of $\mathbf{x}$ that appear in the training set. It's just the vector of training labels!

It doesn't predict any other values than what's in the training set. Instead, we can choose to add functions that are members of a parameterized class of functions $h(\mathbf{x}; \mathbf{a})$ which produces functions $\mathbf{h}_m = \{h(\mathbf{x}_i; \mathbf{a}_m)\}_1^N$ that are most parallel to $-\mathbf{g}_m \in \mathbb{R}^N$.

That way, we are still stepping approximately in the direction of the gradient function in the N-dimensional data space, but using functions which are defined for all possible inputs such that they generalize beyond the training set. This is like constrained gradient descent in function space.

More specifically, at each step we will add a function $h(\mathbf{x}; \mathbf{a})$ that is "closest" to the gradient vector by finding the parameters of h that minimize the following:

$$ \mathbf{a}_m = \arg \min_{\mathbf{a}, \beta} \sum_{i=1}^N\left[-g_m(\mathbf{x}_i) - \beta h(\mathbf{x}_i;\mathbf{a})\right]^2 $$

This produces a "constrained" negative gradient $h(\mathbf{x}; \mathbf{a}_m)$ that is used in place of the unconstrained gradient $\mathbf{g}_m$. At each boosting step, the function $F$ is updated as:

$$ F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \rho_m h(\mathbf{x}; \mathbf{a}_m) $$

where the step size $\rho_m$ is determined via the line search:

$$ \rho_m = \arg \min_{\rho} \sum_{i=1}^N L\left(y_i, F_{m-1}(\mathbf{x}_i) + \rho h(\mathbf{x}_i; \mathbf{a}_m)\right) $$

Algorithm: Gradient Boost

  1. $F_0(\mathbf{x}) = \arg \min_\rho \sum_{i=1}^N L(y_i, \rho)$
  2. $\text{For } m = 1 \text{ to } M \text{ do:}$
  3. $\tilde{y} = -\left[\frac{\partial L(y_i, F(\mathbf{x}_i))}{\partial F(\mathbf{x}_i)}\right]$
  4. $\mathbf{a}_m = \arg \min_{\mathbf{a}, \beta} \sum_{i=1}^N\left[-g_m(\mathbf{x}_i) - \beta h(\mathbf{x}_i;\mathbf{a})\right]^2$
  5. $\rho_m = \arg \min_{\rho} \sum_{i=1}^N L\left(y_i, F_{m-1}(\mathbf{x}_i) + \rho h(\mathbf{x}_i; \mathbf{a}_m)\right)$
  6. $F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \rho_m h(\mathbf{x}; \mathbf{a}_m)$
  7. $\text{end For}$
  8. $\text{end Algorithm}$

References

This post was an attempt at explaining sections 1, 2, and 3 of the paper by Jerome Friedman.