Gradient Descent
Gradient descent is an iterative optimisation algorithm that is commonly used in Machine Learning algorithms to minimize cost functions.
Remove a point by clicking on it
The blue line depicts the fit
Description
Consider a bunch of \(m\) points of the form \( (x_i, y_i) \), and a polynomial \( h_{\theta} (x) \) of order \( n \): \[ h_{\theta} (x) = \theta_0 + \theta_1 x + \theta_2 x^2 + \dots + \theta_n x^n \] Our task is to assign all the \(\theta\)'s such that the polynomial fits the points. In order to quantify how good the fit is, we define something called the cost function \( J(\theta) \): \[ J(\theta) = \frac{1}{2m} \sum_{i = 1}^{m} (h_{\theta} (x_i) - y_i)^2 \] A lower cost function implies a better fit. Hence, we want to find the \(\theta\)'s for which this function is minimized. This can be achieved via the Gradient Descent algorithm, which dictates that the \(\theta\)'s should be modified in the following way: \[ \theta_j = \theta_j - \frac{\alpha}{m} \sum_{i = 1}^{m} (h_{\theta} (x_i) - y_i) \cdot x_j \] Here, \( \alpha \) is known as the learning rate. It is the rate at which the \(\theta\)'s try to approach their minimum. It must be optimized correctly, depending on the problem. A low learning rate results in a very slow decrease in cost function. A high learning rate causes the cost function to blow up or oscillate.
Note:
- The learning rate must be optimized correctly. Click on reset if the blue line vanishes, and adjust the learning rate accordingly.
- The concept of gradient descent can be scaled to more variables easily. Infact, even neural networks utilize gradient descent to optimize the weights and biases of neurons in every level.
- Polynomial regression directly finds the minimum of the cost function, by using calculus to find the minima and linear algebra to solve for it. However, that method doesn't scale well as the number of points are increased.
- Related: K-means clustering, Polynomial regression, Genetic Algorithm.
Developed by ChanRT | Fork me at GitHub