The goal in a regression problem is to predict the value of a variable \(y\) given values of a set of input variables \([x_1, x_2, \ldots, x_L]\). The input variables are components of an input vector \(\vec{x}=(x_1 x_2 \cdots x_L)^T\). A general linear model tries to predict \(y\) as a linear combination of functions of the input vector.

\[y = \vec{\beta}^T\vec{\phi}(\vec{x}) + \epsilon\]

\(\vec{\beta}\) is a weight vector, \(\vec{\phi}(\vec{x})\) is a vector whose \(i^{th}\) component is the function \(\phi_i(\vec{x})\), and \(\epsilon\) is an error term that is usually taken to be zero mean Gaussian noise.

The problem is to find a weight vector that minimizes the error. To do this requires a training set of input vectors along with corresponding \(y\) values. The optimal \(\vec{\beta}\) can then be found using either a least squares maximum likelihood approach or a full Bayesian treatment. You can find the details on how to do this in chapter 3 of Pattern Recognition and Machine Learning by Christopher Bishop.

The final result is that given a new input vector \(\vec{x}\) the predicted \(y\) value is a linear combination of the known \(y\) values.

\[\hat{y}=\sum_{n=1}^N k(\vec{x},\vec{x}_n) y_n\]

The function \(k(\vec{x},\vec{x}_n)\) is called an equivalent kernel and it is constructed from the vectors \(\vec{\phi}\) evaluated at each of the training vectors \(\vec{x}_n\) and at the new vector \(\vec{x}\). The really interesting thing about this equation for \(\hat{y}\) is that the kernel function looks very much like a Green's Function.

The kernel must in some sense measure the similarity between the new input vector and the training vector. This similarity measure determines how much influence the training value \(y_n\) has on the new predicted value \(\hat{y}\). The simplest similarity measure between \(\vec{x}\) and \(\vec{x}_n\) is just their inner product \(\vec{x}^T\vec{x}_n\) which is equivalent to measuring the distance \(\|\vec{x}-\vec{x}_n\|\) and this is indeed what you get when \(\vec{\phi}(\vec{x})=\vec{x}\). In this case the kernel function reduces to

\[k(\vec{x},\vec{x}_n) = \vec{x}^T\Sigma^{-1}\vec{x}_n\]

where \(\Sigma^{-1}\) is the inverse of the covariance matrix for the components of the training vectors. It acts as a metric tensor for the space in which the input vectors live.

The prediction equation suggests a link between the k-nearest neighbor (knn) algorithm and regression. Knn uses a weighted sum of the \(y\) values for the k nearest neighbors of the new input vector whereas the prediction equation sums over all the \(y\) values with presumably greater weight given to those values that are closer to the new input vector.

The interpretation of the kernel as a Green function also suggests a possible electrostatic analog of regression. The \(y\) values in this case become charges and the kernel is the Green function for the Poisson equation. This is just one possible electrostatic interpretation. And if you don't like electrostatics you can think in terms of gravity or heat. The point is that just like the same types of equations appear in many different areas of physics they also appear in data analysis and machine learning.

© 2010-2012 Stefan Hollos and Richard Hollos

blog comments powered by Disqus