GLMs Part II - Exponential families and the GLM log-likelihood

Posted on Thu 06 July 2017 in posts

Generalizing the linear model

Sure, everybody knows what linear regression is (unless they are seriously uncool), but only the most hip among us know that a linear regression is just a Generalized Linear Model (GLM) with a Gaussian family and an identity link function. Say wut? Yes, GLMs are the old, unpopular parents who spawned famous children like linear and logistic regression. GLMs are generalized, which means that they are far less specific than a linear regression and far more adaptable to different types of prediction problems. A consequence of this, as we'll soon find out, is that there are a LOT of symbols, the notation is heavy, and shit gets crazy real fast.

But GLMs are important, ok? They're not as cool as deep learning. Hell, they're not even as cool as decision trees! But they work well, they've been studied and in use for a long time, and they're fairly easy to interpret. Linear models are the "Hello World!" of machine learning. They also make for great job interview questions - don't ever trust a data scientist that can't give a sound, thorough explanation of a linear model!

With that, we'll begin the dirty work of trying to understand these archaic beasts. The notation is heavy and there are going to be lots of greek letters along the way, but we'll take it slow, add some intuition along the way, and maybe drink a beer (wine also acceptable) while we do it.

Formulating the problem

I always find it helpful to state what it is we're trying to accomplish, at a high level first, and then try to drill down from there. To start, we have some data $X$ that influences some outcome, some result $y$, in some way. Going one step further, we can say that our output $y$ is produced by our data $X$ through some process.

We want to use the data to find the process by which the data generates the outcome.

For clarity, let's use an example. We have some data on a house, say the size and number of bedrooms, and we know the price of the house. The data is known and fixed and the price is a result of the data being what it is. There is some process by which the price of a house is produced from the size and number of bedrooms of that house (a coarse model, to be sure). Say a house is 1000 square feet and has 2 bedrooms - that is our data. Let's say the price of the house is 1.5 million (it's in Palo Alto, people!) - that is our outcome. That process, in real life, might be that a seller decides that their 1000 sq ft and 2 bedroom house should go for 1.4 million, which we'll call the listing price. But due to some other unpredictable circumstances, the house doesn't sell for exactly 1.4 million. There is a bit of "randomness" in the final price of each house relative to the initial listing price.

We have now actually loosely specified a model by which the data produces the outcome:

  1. Use the data to find a reasonable expected value for the outcome.
  2. Inject randomness.

We can state this more generically. First, injecting that randomness is the same as saying that our outcome $Y$ is a random variable that is drawn from a probability distribution. For that house with 2 bedrooms and 1000 square feet, we would say that the price is drawn from some distribution that has a mean of 1.4 million. The shape and width of that distribution must be determined - we'll use some data for that! Let's restate our goal:

We want to use our data to find the probability distribution that the random variable $Y$ is drawn from.

This is still pretty vague! A couple of things will help us simplify this. First, since we're dealing with generalized linear models here, we know we want to use linear combinations of the data.

We want to use a linear combination of our data to find the probability distribution that the random variable $Y$ is drawn from.

Second, we assume that the outcomes are all drawn from the same type of distribution. For example, we might assume all of the outcomes are drawn from a Gaussian distribution or a Poisson distribution or a Binomial distribution. The type of distribution is a modeling choice and is selected beforehand. In a GLM, we use only specific types of probability distributions that can be fully specified by a finite number of distribution parameters. This assumption means that "finding the probability distribution that the random variable $Y$ is drawn from" actually means "finding the parameters of the probability distribution that the random variable $Y$ is drawn from." Ok, so now we restate:

We want to use a linear combination of our data to find the parameters of the probability distribution that the random variable $Y$ is drawn from.

For GLMs, it is possible to drill down even further because of yet another assumption. In a GLM, we limit ourselves to only specific types of parameterizable distributions - distributions from the exponential family. We will find out, in short time, all about this family and the mathematics that come with it. But for now, we'll just restate our problem to incorporate this knowledge:

We want to use a linear combination of our data to find the parameters of the exponential family distribution that the random variable $Y$ is drawn from.

We still have a problem, that we don't know and cannot ever truly know, the parameters of the exponential family distribution that outcome is drawn from. In fact, it is unlikely that the outcome was even generated this way! So, what we are truly after is to find, given our assumptions about how the data was generated, the parameters of the exponential family distribution that makes the data most likely to have occurred. This is called maximum likelihood estimation and was introduced in the first post of this series.

We want to use a linear combination of our data to find the parameters of the exponential family distribution that maximize the likelihood of observing the outcome data in the training set.

And that, my friends, is really it. That's the problem that a GLM aims to solve - given some distribution from the exponential family, what is the best way to relate a linear combination of the data to the parameters of that distribution, in order to maximize some likelihood function?

Exponential families

There is surely a whole lot of literature on exponential family distributions that will indulge those who seek a formal treatment of them. Here, I will try to stick to a more intuitive approach.

Exponential family distributions are probability distributions that obey a specific form. All distributions that come from this family - Gaussian, Poisson, Binomial, Gamma, etc... - share nice mathematical properties that, in the case of GLMs, will be rather important. Without further ado, exponential families are probability distributions of the form:

$$ f\left(y|\theta, \phi, w\right) = e^{\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)}\\ Y_i \sim f\left(\cdot|\theta_i, \phi, w_i\right) $$

I warned you about the symbols, didn't I? Let's not get too worked up before we know just how bad this really is. Let's break this down:

  • $w_i$ are known weights, usually 1
  • $\phi$ is a constant scale parameter that is the same for all $Y_i$
  • $\theta_i$ is a canonical parameter, the parameter of interest, that is different for each sample

Ok, so the weights are known beforehand and are usually equal to one. We shouldn't worry too much about the weights then. $\phi$ is a constant - that makes things much simpler. Finally, $\theta_i$ is the parameter of interest, so that's what we'll worry about. $\theta_i$ is the parameter we're talking about above when we stated that "We want to use a linear combination of our data to find the parameters of the exponential family distribution that the random variable $y$ is drawn from."

Some common exponential family distributions

Before we go too much further, let's take a quick moment to list some example distributions that fit the form of the density function shown above:

  • Gaussian (normal)
  • Bernoulli
  • Poisson
  • Gamma
  • Chi-square
  • Dirichlet

Now, I know what you're probably thinking... "I know the density functions for some of those distributions and they sure do not fit that $e^{\text{stuff}}$ format!" Well, fortunately or unfortunately, that is incorrect. Yes, even the Bernoulli distribution $p^k (1-p)^{1-k}$ can be stuffed into that format shown above. Check it out here

How does the exponential family fit into the GLM problem?

We said above that $\theta_i$ is the important part, so let's talk more about it. In an exponential family distribution, we have the relation:

$$ \theta_i = h(\mu_i)\\ \mu_i = E\left[Y_i\right] $$

This says that the parameter that defines our distribution $\theta_i$ is related to the expected value of the outcome through some function $h(\mu)$. This function is known and is defined by the specific exponential family distribution. For example, in a normal distribution $h(\mu) = \mu$ and in a Poisson distribution $h(\mu) = ln(\mu)$.

Now that we know $\theta_i$ is related to $\mu_i$, we can relate this back to our original goal. We know we are seeking to relate a linear combination of the input data to the parameter that defines our distribution. Specifically, we want relate $X\vec{\beta}$ to $\vec{\theta}$; but since we know that $\vec{\theta}$ is just some function of $\mu$, we can restate this by saying we want to relate $X\vec{\beta}$ to $\vec{\mu}$. For most of this post, we will use a variable $\vec{\eta} = X\vec{\beta}$ which is sometime called the "linear predictor." With that in mind, we say finally that our task now is to relate the linear predictor, $\eta$, to the expected value of the outcome, $\vec{\mu}$.

Intuitively, we are asking "how does the expected value of $Y$ change as the data changes linearly?" For an ordinary least squares model, we say that $E\left[Y\right]$ varies identically with $\vec{\eta}$. For a logistic regression, we need to restrict $E\left[Y\right]$ to lie in the interval [0, 1], so we cannot say the same. Instead, the sigmoid function is used to restrict $E\left[Y\right]$ to [0, 1]. If $Y$ were a count variable, we would want to restrict $Y$ to be positive.

In order to relate the linear predictor to the expected value of the outcome, we use what is called a link function. The link function is a model choice made by the practitioner and depends on the nature of the outcome variable. Formally, the link function $g(\mu)$ is defied as:

$$ g(\mu_i) = \eta_i $$

Now, recall that the distribution parameter $\theta_i$ is related to the expected value $\mu_i$ through the function $\theta_i = h(\mu_i)$. This means that we can now relate the distribution parameter to the linear predictor:

$$ \theta_i = h(\mu_i)\\ \mu_i = g^{-1}(\eta_i)\\ \theta_i = h(g^{-1}(\eta_i)) $$

Remember that we are trying to find the parameter $\vec{\theta}$ that makes our data most likely to have occurred. This is done by finding a "likelihood" equation that depends on $\vec{\theta}$ and then finding the value of $\vec{\theta}$ that maximizes the likelihood. Since $\vec{\theta}$ depends on $\vec{\eta}$, we can reformulate this problem by finding the values of $\vec{\eta}$ that maximize the likelihood. Finally, since $\vec{\eta} = X\vec{\beta}$, then we can find the regression coefficients that give us the maximum likelihood.

We have effectively stated how $\vec{\theta}$ depends on $\vec{\beta}$ and so we are ready to find the likelihood equation and solve it for the best value of $\vec{\beta}$.

Note: It is often the case that the link function is chosen such that $g(\mu) = h(\mu)$. If this is the case, then we say that $g(\mu)$ is the canonical link function and much of the math simplifies nicely. In particular, we have that $\theta_i = h(g^{-1}(\eta_i)) = \eta_i$. Many texts and papers present the derivations for GLMs assuming the canonical link, but I think it is better to understand the more general case, and simplify only if possible.

MLE for GLMs

We are tasked with finding the regression coefficients that maximize the likelihood of the data. So, first we need to define the likelihood function, then we need to see how our regression coefficients $\vec{\beta}$ influence that function. Finally we will want to find the values for the $\vec{\beta}$ vector that maximize the likelihood function.

Since we assume that our outcome variable is drawn from an exponential family distribution, we know that the probability density function for the distribution is given by:

$$ f\left(y_i|\theta_i, \phi, w_i\right) = e^{\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)} $$

Following the same logic used for linear regression, we can find the joint density for all the $y_i$ as:

$$ \mathcal{L}(\vec{\theta}|\vec{y},X) = \prod_{i=1}^{N} e^{\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)} $$

The log-likelihood function is:

$$ \mathcal{l}(\vec{\theta}|\vec{y},X) = \sum_{i=1}^{N} \frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi) $$

Now we have a function of $\vec{\theta}$ that we would like to maximize with respect to $\vec{\theta}$ (we will connect this to $\vec{\beta}$ as we go). Pulling out our calculus I skills, we can take the derivative of the function and set equal to zero. The derivative of the log-likelihood, $\partial{\mathcal{l}}/\partial{\theta}$ is often referred to as the "score" and will be denoted as $U$. Taking the score and setting equal to zero, we have (assume the $w_i = 1$):

$$ \vec{U}(\vec{\beta}) = \frac{\partial \mathcal{l}}{\partial \vec{\beta}} = \frac{\partial \vec{\theta}}{\partial \vec{\beta}} \cdot \frac{\partial \mathcal{l}}{\partial \vec{\theta}}\\ = \vec{0} $$

Expanding the second term above with the definition of the likelihood function:

$$ \frac{\partial \mathcal{l}}{\partial \vec{\theta_j}} = \frac{\partial}{\partial \vec{\theta_i}} \left( \sum_{i=1}^{N} \frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi) \right)\\ = \frac{y_j - b'(\theta_j)}{\phi}\\ = \frac{y_j - \mu_j}{\phi} $$

$$ \frac{\partial \mathcal{l}}{\partial \vec{\theta}} = \frac{1}{\phi} \begin{bmatrix} y_0 - \mu_0 \\ y_1 - \mu_1 \\ \vdots \\ y_N - \mu_N \end{bmatrix} = \frac{1}{\phi} \left(\vec{y} - \vec{\mu}\right) $$

Substituting the above for $\frac{\partial \mathcal{l}}{\partial \vec{\theta}}$ the score equation becomes:

$$ \vec{U}(\vec{\beta}) = \frac{1}{\phi} \left(\frac{\partial \vec{\theta}}{\partial \vec{\beta}}\right) \left[\vec{y} - \vec{\mu}\right] = \frac{1}{\phi} \left(\frac{\partial \vec{\theta}}{\partial \vec{\beta}}\right) \left[\vec{y} - g^{-1}(X \vec{\beta})\right] = \vec{0} $$

We now have a concrete equation which we need to solve to find the optimal regression coefficients $\beta$. This is serious progress! Take a moment to soak in where we are and how we got here. There really is a lot to keep track of and the connection from one parameter to the other, as well as the meaning of each, can get lost easily. I find it useful to remember the problem at a high level, and make simple, logical deductions to arrive at this final form.

Solving for the coefficients

It feels great to have gotten the problem into a form that we could potentially solve. Still, as you may have feared, we have a problem. We know that $\vec{\mu} = g^{-1}(X \vec{\beta})$ and that $g(\vec{\mu})$ may not be a linear function. If it is not, we lack a closed form solution to the score equation.

One common method of overcoming this issue is to first use a first order Taylor expansion of $\vec{\mu}$ as an approximation. Then, we will see that the score equation can be manipulated to resemble the form of a simple weighted least squares regression (an ordinary least squares with weights), which has a nice closed form solution. We can then use the method of iteratively reweighted least squares to find incrementally better approximations to the regression coefficients. A follow-up post will provide a comprehensive overview of the derivation and application of IRLS, so hopefully the promise of Taylor expansions to come can tide you over until then.

Further reading

The generalized linear regression interface allows the specification of generalized linear models (GLMs). GLMs allow for flexible specification of linear models which can be used for various types of problems including linear regression, Poisson regression, logistic regression, etc...

When working with data that has a relatively small number of features (< 4096) a GLR can be used to solve linear models on the driver node instead of running an optimization algorithm on the distributed dataset.

Contrasted with linear regression where the output is assumed to have a Gaussian distribution, GLMs are specifications of linear models where the output may take on any distribution from the [exponential family of distributions] (https://en.wikipedia.org/wiki/Exponential_family). An exponential family distribution is any probability distribution of the form

$$ f\left(y|\theta, \phi, w\right) = e^{\frac{y\theta - b(\theta)}{\phi/w} - c(y, \phi)}\\ Y_i \sim f\left(\cdot \vert \theta_i, \phi, w_i\right) $$

where the parameter of interest $\theta_i$ is related to the expected value of the response variable $\mu_i$ by

$$ \theta_i = h(\mu_i) $$

A GLM finds the regression coefficients $\vec{\beta}$ which maximize the joint probability density of the data, also known as the likelihood.

$$ \min_{\vec{\beta}} \mathcal{L}(\vec{\theta}\vert\vec{y},X) = \prod_{i=1}^{N} e^{\frac{y_i\theta_i - b(\theta_i)}{\phi/w_i} - c(y_i, \phi)} $$

where the parameter of interest $\theta_i$ is related to the regression coefficients $\vec{\beta}$ by

$$ \theta_i = h(g^{-1}(\vec{x_i} \cdot \vec{\beta})) $$