Gradient descent and normal equation (also called batch processing) both
are methods for finding out the local minimum of a function. I have
given some intuition about gradient descent in previous article.
Gradient descent is actually an iterative method to find out the
parameters. We start with the initial guess of the parameters (usually
zeros but not necessarily), and gradually adjust those parameters so
that we get the function that best fit the given data points. Here is
the mathematical formulation of gradient descent in the case of linear
regression.
Mathematically the normal equation is given by
θ=(XTX)−1XTy
Where X is a n x (m+1) dimensional feature matrix containing m features and n variables and y is the output vector. Here is the derivation of Normal equation.
The average cost function is given by
J(θ0,θ1)=12m∑i=1m(hθ(x(i))−y(i))2........(1)
Taking the partial derivative w.r.t θ0 and θ1
∂J(θ0,θ1)∂θ0=1m∑i=1m(hθ(x(i))−y(i)) ∂J(θ0,θ1)∂θ1=1m∑i=1m(hθ(x(i))−y(i)).x(i)
These equations gives the slopes of (1) at points theta0 and theta1. To minimize (1), we subtract or add the slopes from parameters depending upon whether slopes are decreasing or increasing. If increasing, we subtract and if decreasing we add the slopes so that we reach to the local minimum. To control the speed of convergence, we generally multiply slope with a parameter also called learning parameter and is denoted by α. So, we perform the following two operations until reach a condition at which two consecutive values of parameters are same or nearly same.
θ0:=θ0−α1m∑i=1m(hθ(x(i))−y(i)) θ1:=θ1−α1m∑i=1m(hθ(x(i))−y(i)).x(i)
The normal equation is slightly different than the gradient descent.
Normal equation finds out the parameter in a single step involving
matrix inversion and other expensive matrix operations. The beauty
behind the normal equation is it calculates the parameters using the
analytical approach in a single step. You don't need the learning
parameter and number of iterations (and other stopping criteria). The
feature space (i.e. feature matrix), output vector (considering
supervised learning) and matrix operations among them are sufficient for
calculating the parameters.Taking the partial derivative w.r.t θ0 and θ1
These equations gives the slopes of (1) at points theta0 and theta1. To minimize (1), we subtract or add the slopes from parameters depending upon whether slopes are decreasing or increasing. If increasing, we subtract and if decreasing we add the slopes so that we reach to the local minimum. To control the speed of convergence, we generally multiply slope with a parameter also called learning parameter and is denoted by α. So, we perform the following two operations until reach a condition at which two consecutive values of parameters are same or nearly same.
Mathematically the normal equation is given by
Where X is a n x (m+1) dimensional feature matrix containing m features and n variables and y is the output vector. Here is the derivation of Normal equation.
The least square error is given by
∑err2=∑(ypredicted−yactual))2∑err2i=∑(yi−(θ0+θ1xi1+θ2xi2))2
The partial derivative should be zero for error minimization i.e.
∂∑err2i∂(θk)=0
For k = 0, 1, 2, we get
∑2(yi−(θ0+θ1xi1+θ2xi2).(−1)=0∑2(yi−(θ0+θ1xi1+θ2xi2).(−xi1)=0∑2(yi−(θ0+θ1xi1+θ2xi2).(−xi2)=0
Simplification results
∑yi=nθ0+θ1∑xi1+θ2∑xi2∑x1iyi=θ0∑xi1+θ1∑xi1xi1+θ2∑xi1xi2∑x2iyi=θ0∑xi2+θ1∑xi1xi2+θ2∑xi2xi2
Expressing these equations in matrix form
⎛⎝1x11x121x21x221x31x32⎞⎠ ⎛⎝y1y2y3⎞⎠= ⎛⎝1x11x121x21x221x31x32⎞⎠ ⎛⎝111x11x21x31x12x22x32⎞⎠⎛⎝θ0θ1θ2⎞⎠
XTy=XTXθ (XXT)−1XTy= (XXT)−1 XTXθθ=(XTX)−1XTy
The partial derivative should be zero for error minimization i.e.
For k = 0, 1, 2, we get
Simplification results
Expressing these equations in matrix form
But, as the number of features (i.e. dimension) increases, the Normal equation's performance gradually decreases. This is because of the expensive matrix operations. That is way gradient descent is preferred over normal equation in most of the machine learning problem involving large number of features.
As a concluding remarks, use the Normal equation when the number of features are relatively small (in hundreds or sometimes thousands) and if #features are very large (in millions ) choose the gradient descent.
0 comments:
Post a Comment