Tuesday, 7 January 2014

Gradient descent versus normal equation

Gradient descent and normal equation (also called batch processing) both are methods for finding out the local minimum of a function. I have given some intuition about gradient descent in previous article. Gradient descent is actually an iterative method to find out the parameters. We start with the initial guess of the parameters (usually zeros but not necessarily), and gradually adjust those parameters so that we get the function that best fit the given data points. Here is the mathematical formulation of gradient descent in the case of linear regression.

The average cost function is given by
J(θ0,θ1)=12mi=1m(hθ(x(i))y(i))2........(1)

Taking the partial derivative w.r.t θ0 and θ1
J(θ0,θ1)θ0=1mi=1m(hθ(x(i))y(i)) J(θ0,θ1)θ1=1mi=1m(hθ(x(i))y(i)).x(i)

These equations gives the slopes of  (1) at points theta0 and theta1. To minimize (1), we subtract or add the slopes from parameters depending upon whether slopes are decreasing or increasing. If increasing, we subtract and if decreasing we add the slopes so that we reach to the local minimum. To control the speed of convergence, we generally multiply slope with a parameter also called learning parameter and is denoted by α. So, we perform the following two operations until reach a condition at which two consecutive values of parameters are same or nearly same.
θ0:=θ0α1mi=1m(hθ(x(i))y(i))  θ1:=θ1α1mi=1m(hθ(x(i))y(i)).x(i)
The normal equation is slightly different than the gradient descent. Normal equation finds out the parameter in a single step involving matrix inversion and other expensive matrix operations. The beauty behind the normal equation is  it calculates the parameters using the analytical approach in a single step. You don't need the learning parameter and number of iterations (and other stopping criteria). The feature space (i.e. feature matrix), output vector (considering supervised learning) and matrix operations among them are sufficient for calculating the parameters.

Mathematically the normal equation is given by
θ=(XTX)1XTy

Where X is a n x (m+1) dimensional feature matrix containing m features and n variables and y is the output vector. Here is the derivation of Normal equation.
The least square error is given by
err2=(ypredictedyactual))2err2i=(yi(θ0+θ1xi1+θ2xi2))2

The partial derivative should be zero for error minimization i.e.
err2i(θk)=0

For k = 0, 1, 2, we get
2(yi(θ0+θ1xi1+θ2xi2).(1)=02(yi(θ0+θ1xi1+θ2xi2).(xi1)=02(yi(θ0+θ1xi1+θ2xi2).(xi2)=0

Simplification results
yi=nθ0+θ1xi1+θ2xi2x1iyi=θ0xi1+θ1xi1xi1+θ2xi1xi2x2iyi=θ0xi2+θ1xi1xi2+θ2xi2xi2

Expressing these equations in matrix form
1x11x121x21x221x31x32 y1y2y3= 1x11x121x21x221x31x32 111x11x21x31x12x22x32θ0θ1θ2

XTy=XTXθ (XXT)1XTy= (XXT)1 XTXθθ=(XTX)1XTy

But, as the number of features (i.e. dimension) increases, the Normal equation's performance gradually decreases. This is because of the expensive matrix operations. That is way gradient descent is preferred over normal equation in most of the machine learning problem involving large number of features.

As a concluding remarks, use the Normal equation when the number of features are relatively small (in hundreds or sometimes thousands) and if #features are very large (in millions ) choose the gradient descent. 

Note

There are various other optimization techniques besides gradient descent and normal equation. These are actually very basic techniques for minimization using least square method. If you want to learn more about mathematical optimization then check out the wikipedia page. There is very nice optimization library in python. Python developer can look into it

0 comments:

Post a Comment

Twitter Delicious Facebook Digg Stumbleupon Favorites More

 
Design by Free WordPress Themes | Bloggerized by Lasantha - Premium Blogger Themes | Affiliate Network Reviews