Gradient descent versus normal equation ~ Programming Languages and Techniques

Wednesday, 8 January 2014

Gradient descent versus normal equation

21:11

Unknown

Gradient descent and normal equation (also called batch processing) both are methods for finding out the local minimum of a function. I have given some intuition about gradient descent in previous article. Gradient descent is actually an iterative method to find out the parameters. We start with the initial guess of the parameters (usually zeros but not necessarily), and gradually adjust those parameters so that we get the function that best fit the given data points. Here is the mathematical formulation of gradient descent in the case of linear regression.

The average cost function is given by

J (θ 0, θ 1) = 1 2 m \sum i = 1 m (h θ (x (i)) - y (i)) 2 . . . . . . . . (1)

Taking the partial derivative w.r.t θ₀ and θ₁

\partial J ( θ 0 , θ 1 ) \partial θ 0 = 1 m \sum i = 1 m (h θ (x (i)) - y (i)) \partial J ( θ 0 , θ 1 ) \partial θ 1 = 1 m \sum i = 1 m (h θ (x (i)) - y (i)) . x (i)

These equations gives the slopes of (1) at points theta0 and theta1. To minimize (1), we subtract or add the slopes from parameters depending upon whether slopes are decreasing or increasing. If increasing, we subtract and if decreasing we add the slopes so that we reach to the local minimum. To control the speed of convergence, we generally multiply slope with a parameter also called learning parameter and is denoted by α. So, we perform the following two operations until reach a condition at which two consecutive values of parameters are same or nearly same.

θ 0 : = θ 0 - α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) θ 1 : = θ 1 - α 1 m \sum i = 1 m (h θ (x (i)) - y (i)) . x (i)

The normal equation is slightly different than the gradient descent. Normal equation finds out the parameter in a single step involving matrix inversion and other expensive matrix operations. The beauty behind the normal equation is it calculates the parameters using the analytical approach in a single step. You don't need the learning parameter and number of iterations (and other stopping criteria). The feature space (i.e. feature matrix), output vector (considering supervised learning) and matrix operations among them are sufficient for calculating the parameters.

Mathematically the normal equation is given by

θ = (X T X) - 1 X T y

Where X is a n x (m+1) dimensional feature matrix containing m features and n variables and y is the output vector. Here is the derivation of Normal equation.

The least square error is given by

\sum e r r 2 = \sum (y p r e d i c t e d - y a c t u a l)) 2 \sum e r r 2 i = \sum (y i - (θ 0 + θ 1 x i 1 + θ 2 x i 2)) 2

The partial derivative should be zero for error minimization i.e.

\partial \sum e r r 2 i \partial ( θ k ) = 0

For k = 0, 1, 2, we get

\sum 2 (y i - (θ 0 + θ 1 x i 1 + θ 2 x i 2) . (- 1) = 0 \sum 2 (y i - (θ 0 + θ 1 x i 1 + θ 2 x i 2) . (- x i 1) = 0 \sum 2 (y i - (θ 0 + θ 1 x i 1 + θ 2 x i 2) . (- x i 2) = 0

Simplification results

\sum y i = n θ 0 + θ 1 \sum x i 1 + θ 2 \sum x i 2 \sum x 1 i y i = θ 0 \sum x i 1 + θ 1 \sum x i 1 x i 1 + θ 2 \sum x i 1 x i 2 \sum x 2 i y i = θ 0 \sum x i 2 + θ 1 \sum x i 1 x i 2 + θ 2 \sum x i 2 x i 2

Expressing these equations in matrix form

⎛ ⎝ 1 x 11 x 12 1 x 21 x 22 1 x 31 x 32 ⎞ ⎠ ⎛ ⎝ y 1 y 2 y 3 ⎞ ⎠ = ⎛ ⎝ 1 x 11 x 12 1 x 21 x 22 1 x 31 x 32 ⎞ ⎠ ⎛ ⎝ 111 x 11 x 21 x 31 x 12 x 22 x 32 ⎞ ⎠ ⎛ ⎝ θ 0 θ 1 θ 2 ⎞ ⎠

X T y = X T X θ (X X T) - 1 X T y = (X X T) - 1 X T X θ θ = (X T X) - 1 X T y

But, as the number of features (i.e. dimension) increases, the Normal equation's performance gradually decreases. This is because of the expensive matrix operations. That is way gradient descent is preferred over normal equation in most of the machine learning problem involving large number of features.

As a concluding remarks, use the Normal equation when the number of features are relatively small (in hundreds or sometimes thousands) and if #features are very large (in millions ) choose the gradient descent.

Programming Languages and Techniques

Wednesday, 8 January 2014

Gradient descent versus normal equation

0 comments:

Post a Comment

Traffic Counter

Popular Posts

Blog Archive