March 29, 1996 at :<10 0
Regression is the term for fitting ordered discrete or continuous output data, typically based on ordered discrete or continuous input data. Other terms used are function approximation, curve fitting, and surface fitting. Myers (REF) discusses the origin of the least squares method and the term regression. Here are some examples of regression problems:
There are two views of how to proceed, a probabalistic view and a curve fitting view. The first is a statistical point of view, in which the process that generated the training set is:
is additive noise.
It is typically assumed that the form of f() is known,
that the
are known
perfectly, and the probability distribution for
the
is also known.
Our goal is to estimate the true value of
and the only problem is
the unknown additive noise
in the measurement process.
The curve fitting point of view is that we don't know the form of
f().
There may or may not be additive noise, but the noise is not the primary
problem.
We will choose a criteria that measures the error in our prediction of
the
in the training set, and attempt to minimize that criteria.
Fortunately, these two views are compatible. I will largely present the statistical point of view. In the learning context, we typically don't know the form of f(), and we don't know the probability distribution of any noise that may be present, however.
Defining
here are a number of possible criteria to minimize in order to fit the data:
where g() is plotted in Figure 1.1.
Figure 1.1: A candidate error criterion.
All of the above criteria treat the error from each fitted data point
equivalently. This does not have to be the case, and I will later
describe reasons for treating errors from different points differently.
We desire a number of properties of the criteria.
We would like to treat positive and negative errors symmetrically,
so the criteria must be an even function of the errors.
Criteria
and
are odd, and are minimized when the errors become
negative and as large as possible. Therefore they are not acceptable
criteria.
We would like unique answers from our criteria. Consider fitting two
data points
and
with the simplest model possible,
a constant term
.
Criteria
scores any estimate of
that is between
and
equally: for
:
Criteria
also has a discontinuity in its derivative for zero error,
which is an undesirable property.
Criteria
has the interesting property that it penalizes large errors
relatively little. In the example just presented, this criteria would lead
to an estimate of
or
, with the other point ignored
in the fitting process.
This type of criteria is used in robust regression to identify outliers,
points which do not fit the pattern of the data.
Relative to the squared error criterion
, using higher order even
functions will push the fit to make all errors
the same.
Typically the
sum of the squared fit error:
is used as the fit criterion. This criteria is appealing because it is even, usually provides a unique answer, has a continous derivative, and is simple to deal with mathematically. It also turns out to be the same criteria the statistical approach ends up with.
There are in fact several probabalistic viewpoints. One attempts to maximize
the probability of the outputs
by adjusting the estimate for the
weights
This is called the maximum likelihood approach.
Other approaches attempt to find an unbiased estimator for
with minimum variance (the Minimum Variance Unbiased Estimator or MVUE
approach and the Best Linear Unbiased Estimator or BLUE approach (Myers)).
These approaches result in the same answer for the assumptions we are
going to make. Let's pursue the maximum likelihood approach.
Warning: This argument is heuristic rather than rigorous. For a rigorous
version of this argument see (REF).
If we assume the additive noise
is normally distributed
with zero mean and variance
and
is independent of
all other
, the prediction error for each
training point has the same probability distribution:
If the additive noise is independent, then the probability of the training data is:
Since the products of exponentials is equal to the exponential of the sum of the exponents, maximizing Equation 1.11 is equivalent to minimizing (to take into account the - sign in the exponent in Equation 1.10) the quantity:
Note that we get the same criteria that the curve fitting point of view uses.
If we know more about the additive noise, the criteria changes. If the
are independent but have different variances
the the criteria to minimize becomes:
If we know the correlation (variance) and cross correlation between the
additive noise for all training points (a matrix
) then the
criteria to minimize becomes (using vector notation):
where
and
assuming n data points in the regression. We will refer to this criteria as the weighted regression case.
A linear model is linear in the parameters. This means that the partial derivatives with respect to any parameter do not depend on any parameter. Linear models can be put in the form:
An entire training set can be fit be putting the model in the form:
where
and
assuming n data points in the regression.
It is often useful to include a constant term in the regression.
If the input vector does not already include a constant term,
a constant term might be added to each
,
and the number of parameters increased
by one.
In the linear case the general criteria can be written as:
The estimated parameters will be indicated by a carat or hat:
One way to find the minimum of the above criteria is to take its
partial derivative with respect to the weight vector. At the minimum
value this derivative should be zero.
Setting this to zero leaves us with the following equation, known as the normal equation:
If
is full rank, the estimate of
is given by:
The case where
is not full rank or is not well conditioned
is discussed in the section on
Multicollinearity, Ridge Regression, and Principal Components Regression.
In the weighted regression case:
Iterative searches for the minimizing parameters must be done in the case of a model that is nonlinear in the parameters. The simplest (and slowest) search technique is gradient descent. I will explain more sophisticated techniques in the chapter on Training Methods.
Gradient descent takes the derivative of the criteria being minimized, and steps down that gradient:
There are several quantities we might use to assess prediction error:
Show figure of PDF of parameters (ellipses)
Show figure confidence interval of E(f()).
Add this prediction interval to previous figure.
The value of the criteria
gives us a best case estimate of how
well we will predict new data. It is unlikely that on average we will predict
new data better than we can predict data used in the regression. The
squared error per point is:
Other goodness of fit criteria are given by:
We assume our model structure is correct so:
where
Since we have an unbiased estimator of
the expected value of
is
If we don't know
in advance, we can estimate it from the data:
where n is the number of data points and m is the number of weights. m is included in the denominator to avoid bias in this estimate.
Weighted regression case:
One approach to seeing how reliable the parameter estimates are is to simulate many different training sets, and see how the parameter estimates change. This is known as Monte Carlo simulation.
A more direct approach is to locally linearize the nonlinear model about
the estimated parameter vector. Then the formulas from the previous
section on linear parameter variance apply directly.
A nonlinear model of the form
can be linearized
about
:
This defines a new model with parameters
and an additional constant term:
The matrix form of this new model is:
where the ijth
element of
is given by
except for last (n+1th) column of
, which is all 1s.
Given this local linear model, the parameter uncertainty is:
Discuss how far this approximation is valid.
The variance of the estimated value of the function at a particular
location
is based on the parameter variance.
In the linear case
In the nonlinear case the result is
where
and
Can look for structure in the residuals: variable noise, missing terms.
Plot
vs
Histogram
Figure of what residuals should look like: random noise about zero.
Figure of linear fit of quadratic.
Figure of changing
.
Standardized residuals:
An important matrix is the hat matrix:
which plays the following role in regresion:
Note that
applied again has no effect:
Therefore,
is an idempotent matrix:
with the following properties:
Additional facts:
reflects the leverage of a data point.
A large leverage is
Note that this only reflects the data distribution, not the training
outputs
Studentized residuals:
The measures discussed so far all measure how well the data is fit,
In order to measure how well new data is predicted, a technique known
as leave one out cross validation can be used.
The deleted residual
measures how well the
is predicted if
the data point
is removed from the data set:
where the subscript -i indicates that this prediction is made without using the ith data point.
It turns out that
can be computed without performing a regression
with the ith point explicitly deleted.
In order to estimate
we need to estimate
Deleting the ith data point is equivalent to
deleting the ith row
from the matrix
to form a
new matrix
We can use the resulting normal equation to estimate
We note that
Deleting the ith data point removes
from the sum:
We make use of the Sherman-Morrison-Woodbury theorem:
You can prove this by multiplying both sides by
We must also do this deletion for the other part of the normal equations:
so
Putting this all together:
so
Expanding
to
Since
and
we get:
Effect of observation i on fitted value
Cook's distance is an aggregate measure of how much the weights change when a point is deleted/added. It also measures an aggregate of how much the predicted ys change.
Myers 221-242, 348-356, Ch 6.
concept of high influence points.
Myers
If you thought your model structure
was correct for the data,
but that some of the fit errors were due to errors in measuring
, as
well as additive noise (errors in y), there are approaches for dealing with
errors in
: (REF: Myers 357-358).
One approach (for a linear model) is to append the output to the input vector, and perform a singular value decomposition or eigenvalue/eigenvector decomposition of
The vector corresponding to the smallest singular value or eigenvalue is the desired weight vector.
Do this in separate chapter.
Myers 96-111, 185-198.
Mallows
.
Too few terms -> extra bias.
Too many terms -> extra variance.
Neter ch. 12
Glantz ch 6
This document was generated using the LaTeX2HTML translator Version 96.1-d (Mar 10, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 -t NML: Regression -no_navigation -show_section_numbers regression.
The translation was initiated by Christopher G. Atkeson on Fri Mar 29 00:04:55 EST 1996