1 Regression

March 29, 1996 at :<10 0

1.1 Introduction

Regression is the term for fitting ordered discrete or continuous output data, typically based on ordered discrete or continuous input data. Other terms used are function approximation, curve fitting, and surface fitting. Myers (REF) discusses the origin of the least squares method and the term regression. Here are some examples of regression problems:

1.1.1 What will be learned?

There are two views of how to proceed, a probabalistic view and a curve fitting view. The first is a statistical point of view, in which the process that generated the training set is:

equation170

tex2html_wrap_inline1390 is additive noise. It is typically assumed that the form of f() is known, that the tex2html_wrap_inline1394 are known perfectly, and the probability distribution for the tex2html_wrap_inline1390 is also known. Our goal is to estimate the true value of tex2html_wrap_inline1384 and the only problem is the unknown additive noise tex2html_wrap_inline1390 in the measurement process.

The curve fitting point of view is that we don't know the form of f(). There may or may not be additive noise, but the noise is not the primary problem. We will choose a criteria that measures the error in our prediction of the tex2html_wrap_inline1404 in the training set, and attempt to minimize that criteria.

Fortunately, these two views are compatible. I will largely present the statistical point of view. In the learning context, we typically don't know the form of f(), and we don't know the probability distribution of any noise that may be present, however.

1.2 The curve fitting approach

Defining tex2html_wrap_inline1408 here are a number of possible criteria to minimize in order to fit the data:

equation173

equation175

equation177

equation179

equation181

equation183

where g() is plotted in Figure 1.1.

   figure186
Figure 1.1: A candidate error criterion.

All of the above criteria treat the error from each fitted data point equivalently. This does not have to be the case, and I will later describe reasons for treating errors from different points differently. We desire a number of properties of the criteria. We would like to treat positive and negative errors symmetrically, so the criteria must be an even function of the errors. Criteria tex2html_wrap_inline1412 and tex2html_wrap_inline1414 are odd, and are minimized when the errors become negative and as large as possible. Therefore they are not acceptable criteria. We would like unique answers from our criteria. Consider fitting two data points tex2html_wrap_inline1416 and tex2html_wrap_inline1418 with the simplest model possible, a constant term tex2html_wrap_inline1420 . Criteria tex2html_wrap_inline1422 scores any estimate of tex2html_wrap_inline1420 that is between tex2html_wrap_inline1416 and tex2html_wrap_inline1418 equally: for tex2html_wrap_inline1430 :

equation192

Criteria tex2html_wrap_inline1422 also has a discontinuity in its derivative for zero error, which is an undesirable property. Criteria tex2html_wrap_inline1434 has the interesting property that it penalizes large errors relatively little. In the example just presented, this criteria would lead to an estimate of tex2html_wrap_inline1436 or tex2html_wrap_inline1438 , with the other point ignored in the fitting process. This type of criteria is used in robust regression to identify outliers, points which do not fit the pattern of the data. Relative to the squared error criterion tex2html_wrap_inline1440 , using higher order even functions will push the fit to make all errors tex2html_wrap_inline1442 the same. Typically the sum of the squared fit error:

equation194

is used as the fit criterion. This criteria is appealing because it is even, usually provides a unique answer, has a continous derivative, and is simple to deal with mathematically. It also turns out to be the same criteria the statistical approach ends up with.

1.3 The probablistic/statistical approach

There are in fact several probabalistic viewpoints. One attempts to maximize the probability of the outputs tex2html_wrap_inline1404 by adjusting the estimate for the weights tex2html_wrap_inline1448 This is called the maximum likelihood approach. Other approaches attempt to find an unbiased estimator for tex2html_wrap_inline1384 with minimum variance (the Minimum Variance Unbiased Estimator or MVUE approach and the Best Linear Unbiased Estimator or BLUE approach (Myers)). These approaches result in the same answer for the assumptions we are going to make. Let's pursue the maximum likelihood approach. Warning: This argument is heuristic rather than rigorous. For a rigorous version of this argument see (REF).

If we assume the additive noise tex2html_wrap_inline1390 is normally distributed with zero mean and variance tex2html_wrap_inline1454 and tex2html_wrap_inline1390 is independent of all other tex2html_wrap_inline1458 , the prediction error for each training point has the same probability distribution:

  equation197

If the additive noise is independent, then the probability of the training data is:

  equation204

Since the products of exponentials is equal to the exponential of the sum of the exponents, maximizing Equation 1.11 is equivalent to minimizing (to take into account the - sign in the exponent in Equation 1.10) the quantity:

equation210

Note that we get the same criteria that the curve fitting point of view uses.

If we know more about the additive noise, the criteria changes. If the tex2html_wrap_inline1390 are independent but have different variances tex2html_wrap_inline1464 the the criteria to minimize becomes:

equation213

If we know the correlation (variance) and cross correlation between the additive noise for all training points (a matrix tex2html_wrap_inline1466 ) then the criteria to minimize becomes (using vector notation):

equation218

where

equation221

and

equation226

assuming n data points in the regression. We will refer to this criteria as the weighted regression case.

1.4 Solving The Linear Case: The Normal Equations

A linear model is linear in the parameters. This means that the partial derivatives with respect to any parameter do not depend on any parameter. Linear models can be put in the form:

equation233

An entire training set can be fit be putting the model in the form:

equation235

where

equation237

and

equation226

assuming n data points in the regression.

It is often useful to include a constant term in the regression. If the input vector does not already include a constant term, a constant term might be added to each tex2html_wrap_inline1394 , and the number of parameters increased by one.

In the linear case the general criteria can be written as:

equation210

equation250

equation252

equation254

The estimated parameters will be indicated by a carat or hat: tex2html_wrap_inline1474 One way to find the minimum of the above criteria is to take its partial derivative with respect to the weight vector. At the minimum value this derivative should be zero.

equation256

Setting this to zero leaves us with the following equation, known as the normal equation:

equation261

If tex2html_wrap_inline1476 is full rank, the estimate of tex2html_wrap_inline1384 is given by:

equation263

The case where tex2html_wrap_inline1476 is not full rank or is not well conditioned is discussed in the section on Multicollinearity, Ridge Regression, and Principal Components Regression.

In the weighted regression case:

equation267

1.5 Solving The Nonlinear Case: Gradient Descent

Iterative searches for the minimizing parameters must be done in the case of a model that is nonlinear in the parameters. The simplest (and slowest) search technique is gradient descent. I will explain more sophisticated techniques in the chapter on Training Methods.

Gradient descent takes the derivative of the criteria being minimized, and steps down that gradient:

equation210

equation277

equation284

equation288

1.6 Assessing the Prediction Error

There are several quantities we might use to assess prediction error:

1.6.1 Goodness of fit

The value of the criteria tex2html_wrap_inline1486 gives us a best case estimate of how well we will predict new data. It is unlikely that on average we will predict new data better than we can predict data used in the regression. The squared error per point is:

equation302

Other goodness of fit criteria are given by:

equation306

equation310

1.6.2 Parameter Variance: Linear Case

We assume our model structure is correct so:

equation315

equation320

where

equation324

equation329

Since we have an unbiased estimator of tex2html_wrap_inline1384 tex2html_wrap_inline1490 the expected value of

equation334

is

equation338

If we don't know tex2html_wrap_inline1454 in advance, we can estimate it from the data:

equation341

where n is the number of data points and m is the number of weights. m is included in the denominator to avoid bias in this estimate.

Weighted regression case:

equation346

1.6.3 Parameter Variance: Nonlinear Case

One approach to seeing how reliable the parameter estimates are is to simulate many different training sets, and see how the parameter estimates change. This is known as Monte Carlo simulation.

A more direct approach is to locally linearize the nonlinear model about the estimated parameter vector. Then the formulas from the previous section on linear parameter variance apply directly. A nonlinear model of the form tex2html_wrap_inline1386 can be linearized about tex2html_wrap_inline1388 :

equation351

This defines a new model with parameters

equation356

and an additional constant term:

equation358

The matrix form of this new model is:

equation361

where the ijth element of tex2html_wrap_inline1506 is given by

equation363

except for last (n+1th) column of tex2html_wrap_inline1506 , which is all 1s.

Given this local linear model, the parameter uncertainty is:

equation368

Discuss how far this approximation is valid.

1.6.4 Variance of Estimated Values

The variance of the estimated value of the function at a particular location tex2html_wrap_inline1382 is based on the parameter variance. In the linear case

equation373

equation375

In the nonlinear case the result is

equation378

where

equation381

and tex2html_wrap_inline1514

1.7 Assessing the Residuals

Can look for structure in the residuals: variable noise, missing terms.

Plot tex2html_wrap_inline1442 vs tex2html_wrap_inline1404

Histogram tex2html_wrap_inline1442

Figure of what residuals should look like: random noise about zero.

Figure of linear fit of quadratic.

Figure of changing tex2html_wrap_inline1522 .

Standardized residuals:

equation388

1.8 The Hat Matrix

An important matrix is the hat matrix:

equation394

which plays the following role in regresion:

equation397

equation401

Note that tex2html_wrap_inline1524 applied again has no effect:

equation405

Therefore, tex2html_wrap_inline1524 is an idempotent matrix:

equation409

with the following properties:

equation411

equation415

equation419

equation421

Additional facts:

equation423

equation426

equation429

tex2html_wrap_inline1528 reflects the leverage of a data point. A large leverage is tex2html_wrap_inline1530 Note that this only reflects the data distribution, not the training outputs tex2html_wrap_inline1532

Studentized residuals:

equation437

1.9 The PRESS residual

The measures discussed so far all measure how well the data is fit, In order to measure how well new data is predicted, a technique known as leave one out cross validation can be used. The deleted residual tex2html_wrap_inline1534 measures how well the tex2html_wrap_inline1404 is predicted if the data point tex2html_wrap_inline1538 is removed from the data set:

equation443

where the subscript -i indicates that this prediction is made without using the ith data point.

It turns out that tex2html_wrap_inline1534 can be computed without performing a regression with the ith point explicitly deleted. In order to estimate tex2html_wrap_inline1548 we need to estimate tex2html_wrap_inline1550 Deleting the ith data point is equivalent to deleting the ith row tex2html_wrap_inline1556 from the matrix tex2html_wrap_inline1506 to form a new matrix tex2html_wrap_inline1560 We can use the resulting normal equation to estimate tex2html_wrap_inline1562

equation453

We note that

equation461

Deleting the ith data point removes tex2html_wrap_inline1566 from the sum:

equation463

We make use of the Sherman-Morrison-Woodbury theorem:

eqnarray467

You can prove this by multiplying both sides by tex2html_wrap_inline1568 We must also do this deletion for the other part of the normal equations:

equation478

so

equation480

Putting this all together:

eqnarray484

so

eqnarray502

eqnarray517

eqnarray531

Expanding

equation555

to

equation559

Since tex2html_wrap_inline1570 and tex2html_wrap_inline1572 we get:

equation570

eqnarray572

1.10 Influence on the fitted value: DFFITS

Effect of observation i on fitted value tex2html_wrap_inline1576

equation585

1.11 Influence on the regression coefficients: DFBETAS and COOK's D

equation594

Cook's distance is an aggregate measure of how much the weights change when a point is deleted/added. It also measures an aggregate of how much the predicted ys change.

equation601

1.12 Outliers and Robust Regression

Myers 221-242, 348-356, Ch 6.

concept of high influence points.

1.13 Multicollinearity, Ridge Regression, Principal Components Regression

Myers

1.14 Errors in input variables regression

If you thought your model structure tex2html_wrap_inline1380 was correct for the data, but that some of the fit errors were due to errors in measuring tex2html_wrap_inline1382 , as well as additive noise (errors in y), there are approaches for dealing with errors in tex2html_wrap_inline1382 : (REF: Myers 357-358).

One approach (for a linear model) is to append the output to the input vector, and perform a singular value decomposition or eigenvalue/eigenvector decomposition of

equation615

The vector corresponding to the smallest singular value or eigenvalue is the desired weight vector.

1.15 Incremental Model Building and Feature Selection, model selection

Do this in separate chapter.

Myers 96-111, 185-198.

Mallows tex2html_wrap_inline1590 .

Too few terms -> extra bias.

Too many terms -> extra variance.

Neter ch. 12

Glantz ch 6

1.16 Incremental solution to normal equations

About this document ...

This document was generated using the LaTeX2HTML translator Version 96.1-d (Mar 10, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.

The command line arguments were:
latex2html -split 0 -t NML: Regression -no_navigation -show_section_numbers regression.

The translation was initiated by Christopher G. Atkeson on Fri Mar 29 00:04:55 EST 1996


Christopher G. Atkeson
Fri Mar 29 00:04:55 EST 1996