
When performing linear regression, we have a number of data points.

Let's say that we have 1, 2, 3 and so on up through M data points.

Each data point has an output variable, Y,

and a number of input variables, X1 through X N.

So in our baseball example Y is the lifetime number of home runs.

And our X1 and XN are things like height and weight.

Our one through M samples might be different baseball players.

So maybe data point one is Derek Jeter, data point two is Barry Bonds, and

data point M is Babe Ruth.

Generally speaking, we are trying to predict the values of

the output variable for each data point, by multiplying the input variables by

some set of coefficients that we're going to call theta 1 through theta N.

Each theta, which we'll from here on out call the parameters or

the weights of the model, tell us how important an input variable is

when predicting a value for the output variable.

So if theta 1 is very small,

X1 must not be very important in general when predicting Y.

Whereas if theta N is very large,

then XN is generally a big contributor to the value of Y.

This model is built in such a way that we can multiply each X by

the corresponding theta, and sum them up to get Y.

So that our final equation will look something like the equation down here.

Theta 1 plus X1 plus theta 2 times X2,

all the way up to theta N plus XN equals Y.

And we'd want to be able to predict Y for each of our M data points.

In this illustration, the dark blue points represent our reserve data points,

whereas the green line shows the predictive value of Y for

every value of X given the model that we may have created.

The best equation is the one that's going to minimize the difference across all

data points between our predicted Y, and our observed Y.

What we need to do is find the thetas that produce the best predictions.

That is, making these differences as small as possible.

If we wanted to create a value that describes the total areas of our model,

we'd probably sum up the areas.

That is, sum over all of our data points from I equals 1, to M.

The predicted Y minus the actual Y.

However, since these errors can be both negative and

positive, if we simply sum them up, we could have

a total error term that's very close to 0, even if our model is very wrong.

In order to correct this, rather than simply adding up the error terms,

we're going to add the square of the error terms.

This guarantees that the magnitude of each individual error term,

Y predicted minus Y actual is positive.

Why don't we make sure the distinction between input of variables and

output of variables is clear.