Thursday, January 15, 2015

Multivariate Linear Regression

Multivariate Linear Regression

Uni-variate Linear Regression : I will try to explain linear regression using gradient descent with problem of house price prediction which contains only a single feature.

The graph shows the relationship from the between the no of rooms in a house and its price. As you can see the house price is linearly dependent on only one feature, say the number of bedroom. That is, the house price increases with the increase in the no of bedroom.

Given a training set (sample data for which no of rooms and the corresponding house price are known), our goal is to develop a hypothesis(a linear equation, as it is visible from the graph) which trains on the training set and can be used to predict house prices of data outside the training set. There exists a linear relationship between no of rooms and price and the equation of LINE (of the form y=ax+b) could be simple hypothesis.
We can have an simple hypothesis of the form
h(x) = theta_0 + (theta_1 * x)
where x is the number of rooms in the house

Our job is to find the appropriate values of theta_0 & theta_1 so that our hypothesis gives a prediction with less error. In machine learning it is very difficult to arrive on a perfect hypothesis; the goal of ML is make "perfect guesses". In other words, its goal is to reduce the error in prediction. With different values of theta_0 and theta_1 we get different lines (hypothesis).

Line 1 : Here theta_0=0.25 and theta_1=0.25. This hypothesis doesn't fit the data given in the sample set. For example when no of bedroom is 1, the sample data says that the price of the house is $2K, but according to Line 1 its $ 0.5K. Here there is a large difference between outputs predicted by the hypothesis for the training set, hence this can't be hypothesis that fits it.

Line 2 : Here theta_0=0.5 and theta_1=0.5. As you can see, Line 2 predicts that the price of house for a house with 1 bedroom is $ 1K, which is better than the value predicted by Line 1. But still not the best, theta can be adjusted further to arrive at an even better hypothesis, that has fewer prediction error.

Line 3: Here theta_0=1 and theta_1=1. For x=1 it gives a good prediction of $2K, with zero error. Note, even though Line3 has made a good predicted when x=1, for other values of x it gives a prediction which is not 100% correct, but much better than other hypothesis.

As we can see Line 3 is the one that gave prediction that is close to a the original value. There won't a considerable amount of change in the prediction if we adjust our theta further ie our system has converged to a particular value of theta.

Correctness of the hypothesis is measured using 'cost function' J(theta).
here
x_i is the value the feature(ie no of rooms) of the ith entry in the sample data,
h(x_i) is the predicted output value(ie house price) using the hypothesis with theta_0 and theta_1
y_i is actual ouput price
m is the no of training data

Our aim is find theta_0 and theta_1 which has the least error i.e the value of theta for which cost function is least.The method that we are going to use to tune theta is called gradient descent, it can explained with following steps:

1> Assign some random initial values to theta_0 and theta_1
2> Repeat it 'I' number of times (I=no of iterations) or till theta converges
For each theta_i in theta (ie theta_0 and theta_1)
For each entry in the training sample

      1. let x be the no of bedroom and y be the corresponding output(house price which is known) of current training sample
      2. calculate the output h(x) using 'OLD' values of theta and x
      3. calculate the error, h(x)-y, for current training sample
      4. calculate (error * value of variable x whose coefficient is theta_i in the current sample)

sum all the products obtained in the above step
use this calculated-sum to adjust theta_i so as to make the hypothesis "less wrong"


where alpha is the learning rate; it determines how fast the gradient descent works
3> Use this tuned/trained hypothesis to predict values

In short, gradient descent changes the values of theta so that it finally arrives (converges)at a value of theta for which the hypothesis has least error (ie minimized cost function). The above steps can be represent in the following short equation
---------
---------

Note

  • Theta should be updated simultaneously as shown below
  • x(i) value for theta_0 is 1, h(x) = ( theta_0 * 1 ) + ( theta_1 * x )
You get the following 3D graph if you to plot the values of theta_0 and theta_1 against J(theta)
 
What the gradient descent is trying to do is find the global minimum value of J(theta) from the above graph. At the global minimum the value of J(theta) is the least i.e the prediction error is the least.

Multivariate Linear Regression
In multivariate problems the number of features on which the output depends will be greater than or equal to 2. Our goal here remains the same : Given a sample training data, derive a hypothesis which trains on it and can be used to predict output for inputs outside the training set. Since there are more than 1 feature here, the hypothesis is going to be complex. 
h(x1, x2....) = theta_0 + ( theta_1 * x1 ) + ( theta_2 * x2 ) + ...
As in uni-variate Linear regression, gradient descent is used here to find appropriate values of theta_0, theta_1 ,.. so that the hypothesis gives very less prediction error.

A problem on multivariate linear regression from hackerrank
And my solution to it using gradient descent