Why Gradient Descent Evolved Into Stochastic Gradient Descent

In this article, we will explore not only how but also why optimization techniques like gradient descent and stochastic gradient descent are employed.

We’ve previously covered linear regression, including its connection to vectors and projections in a recent discussion.

Our focus now shifts to grasping gradient descent through the lens of solving a linear regression problem.

But first, let’s quickly revisit the foundational concepts of linear regression and the mathematical principles that drive it, ensuring readers new to the topic can easily follow along.

If you’re already familiar with the core mathematics behind linear regression, you may skip directly to the section titled Why Do We Need Gradient Descent?

Imagine you’ve embarked on your machine learning journey and successfully implemented a linear regression model using Python.

The model performed well, yielding optimal values for the slope and intercept.

This raises an important question: What computational processes occur behind the scenes within this algorithm?

Our objective is to uncover the underlying mathematical framework.

Linear Regression Refresher

To illustrate, consider the following dataset.

Image by Author

Next, we aim to decode the algorithm’s mathematical structure.

These are the standard formulas we encounter for calculating the slope and intercept.

[
beta_1 = frac{sum_{i=1}^{n} (x_i – bar{x})(y_i – bar{y})}{sum_{i=1}^{n} (x_i – bar{x})^2}
]

[
beta_0 = bar{y} – beta_1bar{x}
]

Next, we use these formulas to determine the slope and intercept values.

The fundamental Simple Linear Regression equation is:

[
hat{y}
=
beta_0+beta_1x
]

The slope calculation formula is:

[
beta_1
=
frac{
sum_{i=1}^{n}(x_i-bar{x})(y_i-bar{y})
}{
sum_{i=1}^{n}(x_i-bar{x})^2
}
]

The intercept calculation formula is:

[
beta_0
=
bar{y}
–
beta_1bar{x}
]

Our dataset consists of:

[
x=
[1.2,1.4,1.6,2.1,2.3,3.0,3.1,3.3,3.3,3.8]
]
[
y=
[39344,46206,37732,43526,39892,56643,60151,54446,64446,57190]
]

First, we calculate the mean of x:

[
bar{x}
=
frac{1.2+1.4+1.6+2.1+2.3+3.0+3.1+3.3+3.3+3.8}{10}
]
[
bar{x}
=
frac{25.1}{10}
=
2.51
]

Next, we calculate the mean of y:

[
bar{y}
=
frac{
39344+46206+37732+43526+39892+56643+60151+54446+64446+57190
}{10}
]
[
bar{y}
=
frac{499576}{10}
=
49957.6
]

We then compute:

[
sum(x_i-bar{x})(y_i-bar{y})
]

After performing substitutions and calculations:

[
sum(x_i-bar{x})(y_i-bar{y})
=
41663.44
]

We then calculate:

[
sum(x_i–bar{x})^2
]

After processing:

[
sum(x_i-bar{x})^2
=
4.619
]

Now we determine the slope:

[
beta_1
=
frac{41663.44}{4.619}
]
[
beta_1
=
9020.66
]

Next, we solve for the intercept:

[
beta_0
=
49957.6-(9020.66)(2.51)
]
[
beta_0
=
27315.74
]

The results are:

[
beta_0=27315.74
]
[
beta_1=9020.66
]

Our final regression model is:

[
hat{y}
=
27315.74+9020.66x
]

Although we arrived at these values using formulas, we aren’t quite ready to stop there—we want to delve deeper.

Our next step is to understand how these formulas were actually derived.

To gain this insight, we’ll examine a 3D bowl-shaped curve. This curve emerges when we plot every possible combination of $beta_0$ , $beta_1$ , along with the mean squared error (MSE).

By examining this curve, we can see that minimizing the mean squared error is crucial, and the minimum occurs when the gradient reaches zero.

We already know that differentiation is necessary to determine the slope of any given curve.

We then apply differentiation to the loss function, which the bowl curve represents in three dimensions. Here, we’re dealing with two variables.

Therefore, we use partial differentiation and continue solving to derive the formulas for the slope and intercept.

Deriving the Formulas for Slope and Intercept

Begin with the Mean Squared Error (MSE) loss function:

[
MSE(beta_0,beta_1)
=
frac{1}{n}
sum_{i=1}^{n}
(y_i-(beta_0+beta_1x_i))^2
]

Restructure the inner term:

[
=
frac{1}{n}
sum_{i=1}^{n}
(y_i-beta_0-beta_1x_i)^2
]

Take the partial derivative with respect to ( beta_0 ):

[
frac{partial MSE}{partial beta_0}
=
frac{partial}{partial beta_0}
left(
frac{1}{n}
sum_{i=1}^{n}
(y_i-beta_0-beta_1x_i)^2
right)
]

Factor out the constant:

[
=
frac{1}{n}
frac{partial}{partial beta_0}
sum_{i=1}^{n}
(y_i-beta_0-beta_1x_i)^2
]

Move the derivative inside the summation:

[
=
frac{1}{n}
sum_{i=1}^{n}
frac{partial}{partial beta_0}
(y_i-beta_0-beta_1x_i)^2
]

Apply the chain rule:

[
=
frac{1}{n}
sum_{i=1}^{n}
2(y_i-beta_0-beta_1x_i)
cdot
frac{partial}{partial beta_0}
(y_i-beta_0-beta_1x_i)
]

Apply derivative rules:

[
frac{d}{dbeta_0}(y_i)=0
]
[
frac{d}{dbeta_0}(-beta_0)=-1
]
[
frac{d}{dbeta_0}(-beta_1x_i)=0
]

The inner derivative simplifies to:

[
frac{partial}{partial beta_0}
(y_i-beta_0-beta_1x_i)
=
-1
]

Substitute back:

[
frac{partial MSE}{partial beta_0}
=
frac{1}{n}
sum_{i=1}^{n}
2(y_i-beta_0-beta_1x_i)(-1)
]

Simplify:

[
=
-frac{2}{n}
sum_{i=1}^{n}
(y_i-beta_0-beta_1x_i)
]

Set the derivative equal to zero:

[
-frac{2}{n}
sum_{i=1}^{n}
(y_i-beta_0-beta_1x_i)
=
0
]

Multiply both sides by:

[
-frac{n}{2}
]
[
sum_{i=1}^{n}
(y_i-beta_0-beta_1x_i)
=
0
]

Expand:

[
sum_{i=1}^{n}y_i
–
nbeta_0
–
beta_1sum_{i=1}^{n}x_i
=
0
]

Rearrange:

[
nbeta_0
=
sum_{i=1}^{n}y_i
–
beta_1sum_{i=1}^{n}x_i
]

Divide

This is how we find the intercept:

[
beta_0
=
frac{1}{n}sum_{i=1}^{n}y_i
–
beta_1
frac{1}{n}sum_{i=1}^{n}x_i
]

Using averages:

[
bar{x}
=
frac{1}{n}sum_{i=1}^{n}x_i
]
[
bar{y}
=
frac{1}{n}sum_{i=1}^{n}y_i
]

This simplifies to:

[
beta_0
=
bar{y}
–
beta_1bar{x}
]

Now take partial derivative with respect to ( beta_1 ):

[
frac{partial MSE}{partial beta_1}
=
frac{partial}{partial beta_1}
left(
frac{1}{n}
sum_{i=1}^{n}
(y_i-beta_0-beta_1x_i)^2
right)
]

Take the constant outside:

[
=
frac{1}{n}
frac{partial}{partial beta_1}
sum_{i=1}^{n}
(y_i-beta_0-beta_1x_i)^2
]

Move the derivative inside the summation:

[
=
frac{1}{n}
sum_{i=1}^{n}
frac{partial}{partial beta_1}
(y_i-beta_0-beta_1x_i)^2
]

Apply the chain rule:

[
=
frac{1}{n}
sum_{i=1}^{n}
2(y_i-beta_0-beta_1x_i)
cdot
frac{partial}{partial beta_1}
(y_i-beta_0-beta_1x_i)
]

Compute each part:

[
frac{d}{dbeta_1}(y_i)=0
]
[
frac{d}{dbeta_1}(-beta_0)=0
]
[
frac{d}{dbeta_1}(-beta_1x_i)=-x_i
]

Therefore:

[
frac{partial}{partial beta_1}
(y_i-beta_0-beta_1x_i)
=
-x_i
]

Substitute this in:

[
frac{partial MSE}{partial beta_1}
=
frac{1}{n}
sum_{i=1}^{n}
2(y_i-beta_0-beta_1x_i)(-x_i)
]

Simplify:

[
=
-frac{2}{n}
sum_{i=1}^{n}
x_i(y_i-beta_0-beta_1x_i)
]

Set the derivative to zero:

[
-frac{2}{n}
sum_{i=1}^{n}
x_i(y_i-beta_0-beta_1x_i)
=
0
]

Multiply both sides by:

[
-frac{n}{2}
]
[
sum_{i=1}^{n}
x_i(y_i-beta_0-beta_1x_i)
=
0
]

Expand:

[
sum_{i=1}^{n}x_iy_i
–
beta_0sum_{i=1}^{n}x_i
–
beta_1sum_{i=1}^{n}x_i^2
=
0
]

Substitute:

[
beta_0
=
bar{y}
–
beta_1bar{x}
]

Into the equation:

[
sum_{i=1}^{n}x_iy_i
–
(bar{y}-beta_1bar{x})
sum_{i=1}^{n}x_i
–
beta_1sum_{i=1}^{n}x_i^2
=
0
]

Expand:

[
sum_{i=1}^{n}x_iy_i
–
bar{y}sum_{i=1}^{n}x_i
+
beta_1bar{x}sum_{i=1}^{n}x_i
–
beta_1sum_{i=1}^{n}x_i^2
=
0
]

Remember that:

[
sum_{i=1}^{n}x_i=nbar{x}
]

Substitute:

[
sum_{i=1}^{n}x_iy_i
–
nbar{x}bar{y}
+
beta_1nbar{x}^2
–
beta_1sum_{i=1}^{n}x_i^2
=
0
]

Group terms with ( beta_1 ):

[
beta_1
(nbar{x}^2-sum_{i=1}^{n}x_i^2)
=
nbar{x}bar{y}
–
sum_{i=1}^{n}x_iy_i
]

Multiply both sides by -1:

[
beta_1
(sum_{i=1}^{n}x_i^2-nbar{x}^2)
=
sum_{i=1}^{n}x_iy_i
–
nbar{x}bar{y}
]

This gives us the formula for slope:

[
beta_1
=
frac{
sum_{i=1}^{n}x_iy_i
–
nbar{x}bar{y}
}{
sum_{i=1}^{n}x_i^2
–
nbar{x}^2
}
]

This can also be expressed using variances and covariances:

[
beta_1
=
frac{
sum_{i=1}^{n}(x_i-bar{x})(y_i-bar{y})
}{
sum_{i=1}^{n}(x_i-bar{x})^2
}
]

Finally, plug the value of ( beta_1 ) back into the intercept formula to get:

[
beta_0
=
bar{y}
–
beta_1bar{x}
]

Our final regression line is now:

[
hat{y}
=
beta_0
+
beta_1x
]

We have now fully derived the slope and intercept formulas.

However, keep in mind that this derivation only applies when there is a single input feature.

Even for just one feature, the math was quite involved. When dealing with real-world data containing multiple features, the calculations become much more complicated.

To handle multiple features, we rewrite everything using matrix notation. This leads to the normal equation, which works for any number of features.

Deriving the Normal Equation

In simple linear regression, we only solved for an intercept and a single slope:

[
hat{y}
=
beta_0+beta_1x
]

Real-world situations, however, usually require more than one input:

years of experience
education level
age

With multiple inputs, the regression model becomes:

[
hat{y}
=
beta_0
+
beta_1x_1
+
beta_2x_2
+
beta_3x_3
+
cdots
+
beta_px_p
]

Here:

( beta_0 ) represents the intercept, and
( beta_1,beta_2,beta_3,dots,beta_p ) represent the coefficients for each feature.

As the number of features grows, manually calculating each parameter becomes impractical.

To simplify this, we rewrite the model using matrix operations.

Assume we have ( n ) observations and ( p ) features.

Define the target vector as:

[
Y
=
begin{bmatrix}
y_1
y_2
y_3
vdots
y_n
end{bmatrix}
]

Next, define the feature matrix. The first column is all 1s to account for the intercept term.

[
X
=
begin{bmatrix}
1 & x_{11} & x_{12} & cdots & x_{1p}
1 & x_{21} & x_{22} & cdots & x_{2p}
1 & x_{31} & x_{32} & cdots & x_{3p}
vdots & vdots & vdots & ddots & vdots
1 & x_{n1} & x_{n2} & cdots & x_{np}
end{bmatrix}
]

Next, define the parameter vector:

[
beta
=
begin{bmatrix}
beta_0
beta_1
beta_2
vdots
beta_p
end{bmatrix}
]

By multiplying these together:

[
Xbeta
=
begin{bmatrix}
1 & x_{11} & x_{12} & cdots & x_{1p}
1 & x_{21} & x_{22} & cdots & x_{2p}
1 & x_{31} & x_{32} & cdots & x_{3p}
vdots & vdots & vdots & ddots & vdots
1 & x_{n1} & x_{n2} & cdots & x_{np}
end{bmatrix}
begin{bmatrix}
beta_0
beta_1
beta_2
vdots
beta_p
end{bmatrix}
]

Carrying out the multiplication results in:

[
=
begin{bmatrix}
beta_0+beta_1x_{11}+beta_2x_{12}+cdots+beta_px_{1p}
beta_0+beta_1x_{21}+beta_2x_{22}+cdots+beta_px_{2p}
beta_0+beta_1x_{31}+beta_2x_{32}+cdots+beta_px_{3p}
vdots
beta_0+beta_1x_{n1}+beta_2x_{n2}+cdots+beta_px_{np}
end{bmatrix}
]

This produces the vector of predictions:

[
hat{Y}=Xbeta
]

Now, define the residual vector, which represents the difference between actual values and predicted values:

[
Y-hat{Y}
]

Substitute:

[
Y-Xbeta
]

The Mean Squared Error (MSE) is then given by:

[
MSE
=
frac{1}{n}
(Y-Xbeta)^T(Y-Xbeta)
]

The transpose is used because:

[
(Y-Xbeta)
]

is a column vector. When multiplied by its transpose, the sum of squared residuals becomes a scalar.

Expanding this expression gives:

[
MSE
=
frac{1}{n}
(Y-Xbeta)^T(Y-Xbeta)
]
[
=
frac{1}{n}
left(
Y^TY
–
Y^TXbeta
–
(Xbeta)^TY
+
(Xbeta)^TXbeta
right)
]

Using properties of transposes:

[
(Xbeta)^T
=
beta^TX^T
]

Substitute:

[
MSE
=
frac{1}{n}
left(
Y^TY
–
Y^TXbeta
–
beta^TX^TY
+
beta^TX^TXbeta
right)
]

Recognize that:

[
Y^TXbeta
]

is a scalar, and scalars are equal to themselves.

Each of these expressions equals its own transpose.

As a result:

[
Y^TX beta
=
beta^T X^T Y
]

Combining the two middle terms:

[
text{MSE}
=
frac{1}{n}
left(
Y^T Y
–
2 beta^T X^T Y
+
beta^T X^T X beta
right)
]

To minimize the MSE, we take its derivative with respect to ( beta ).

Step-by-step derivative breakdown:

The derivative of ( Y^T Y ) is zero since it contains no ( beta ).

The derivative of ( -2 beta^T X^T Y ) is:

[
-2 X^T Y
]

The derivative of ( beta^T X^T X beta ) is:

[
2 X^T X beta
]

Putting it all together:

[
frac{partial text{MSE}}{partial beta}
=
frac{1}{n}
left(
-2 X^T Y
+
2 X^T X beta
right)
]

Simplifying:

[
=
frac{-2}{n} X^T Y
+
frac{2}{n} X^T X beta
]

To find the minimum, set this derivative to zero:

[
frac{-2}{n} X^T Y
+
frac{2}{n} X^T X beta
=
0
]

Multiply both sides by ( frac{n}{2} ):

[
– X^T Y
+
X^T X beta
=
0
]

Rearranging gives:

[
X^T X beta
=
X^T Y
]

Next, multiply both sides by ( (X^T X)^{-1} ):

[
(X^T X)^{-1} X^T X beta
=
(X^T X)^{-1} X^T Y
]

Since ( (X^T X)^{-1} X^T X = I ), we get:

[
I beta
=
(X^T X)^{-1} X^T Y
]

And since ( I beta = beta ), the final normal equation is:

[
beta
=
(X^T X)^{-1} X^T Y
]

This single formula simultaneously yields:

The intercept
All slope coefficients
All optimal parameters

that minimize the Mean Squared Error.

Typically, we derive the normal equation by minimizing the RSS (Residual Sum of Squares). However, since MSE is just RSS divided by the number of samples, minimizing MSE leads to the same normal equation result.

Now that we have the normal equation, let’s use it again to solve for the slope and intercept.

Finding the Slope and Intercept via the Normal Equation

The matrix representation of linear regression is:

[
beta=(X^TX)^{-1}X^TY
]

Build the feature matrix.

A column of ones in the first column accounts for the intercept.

[
X
=
begin{bmatrix}
1 & 1.2
1 & 1.4
1 & 1.6
1 & 2.1
1 & 2.3
1 & 3.0
1 & 3.1
1 & 3.3
1 & 3.3
1 & 3.8
end{bmatrix}
]

Form the target vector:

[
Y
=
begin{bmatrix}
39344
46206
37732
43526
39892
56643
60151
54446
64446
57190
end{bmatrix}
]

The parameter vector is:

[
beta
=
begin{bmatrix}
beta_0
beta_1
end{bmatrix}
]

Find the matrix transpose:

[
X^T
=
begin{bmatrix}
1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1
1.2 & 1.4 & 1.6 & 2.1 & 2.3 & 3.0 & 3.1 & 3.3 & 3.3 & 3.8
end{bmatrix}
]

Calculate:

[
X^TX
=
begin{bmatrix}
10 & 25.1
25.1 & 67.89
end{bmatrix}
]

Then compute the matrix inverse:

[
(X^TX)^{-1}
=
begin{bmatrix}
1.4547 & -0.5378
-0.5378 & 0.2142
end{bmatrix}
]

Multiply:

[
X^TY
=
begin{bmatrix}
493576
1326200.7
end{bmatrix}
]

Plug into the normal equation:

[
beta
=
begin{bmatrix}
1.4547 & -0.5378
-0.5378 & 0.2142
end{bmatrix}
begin{bmatrix}
493576
1326200.7
end{bmatrix}
]

After matrix multiplication:

[
beta
=

begin{bmatrix}
27315.02
9020.93
end{bmatrix}
]

So:

[
beta_0=27315.02
]
[
beta_1=9020.93
]

The resulting regression equation is:

[
hat{y}
=
27315.02+9020.93x
]

Why Is Gradient Descent Necessary?

After learning the normal equation, we might believe we can always solve for the best coefficients, regardless of the number of features.

However, this direct method works best with small or medium datasets. For very large datasets, the normal equation becomes computationally demanding.

Recall the normal equation:

[
beta = (X^TX)^{-1}X^Ty
]

The inverse operation is what causes this approach to become slow and resource-intensive.

While fine for small datasets, real-world problems often involve thousands of features and millions of records.

Under these conditions, the normal equation becomes too slow and computationally expensive.

That’s where gradient descent comes in. Rather than solving directly, we move stepwise toward the best solution.

To understand gradient descent, let’s explore its mathematical foundation.

The Math Behind Gradient Descent

When deriving the normal equation, we arrived at this key formula:

[
frac{partial text{MSE}}{partial beta}
=
frac{2}{n} X^T (X beta – Y)
]

This represents the slope (gradient) of the bowl-shaped loss function.

Previously, we set this to zero and solved, yielding the normal equation.

With gradient descent, we stop here and randomly initialize values for $beta$ . Using these values, we compute the slope and iteratively move toward minimum loss.

For example:

$beta_0 = 2$ and $beta_1 = 5$

[
beta^{(0)}=
begin{bmatrix}
beta_0
beta_1
end{bmatrix}
=
begin{bmatrix}
2
5
end{bmatrix}
]

Next, plug these into the gradient expression to compute the current slope of the bowl curve.

We already have the gradient:

[
frac{partial MSE}{partial beta}
=
frac{-2}{n} X^T Y
+
frac{2}{n} X^T X beta
]

Starting parameter values:

[
beta^{(0)}=
begin{bmatrix}
2
5
end{bmatrix}
]

These are just the initial

These values determine the starting point from which Gradient Descent begins its search for the minimum loss.

Next, we’ll build the feature matrix.

With only one feature present, the matrix (X) takes the following form:

[
X=
begin{bmatrix}
1 & 1.2
1 & 1.4
1 & 1.6
1 & 2.1
1 & 2.3
1 & 3.0
1 & 3.1
1 & 3.3
1 & 3.3
1 & 3.8
end{bmatrix}
]

The first column consists of ones to account for the intercept term.

Now compute:

[
X^T
]
[
X^T=
begin{bmatrix}
1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1 & 1
1.2 & 1.4 & 1.6 & 2.1 & 2.3 & 3.0 & 3.1 & 3.3 & 3.3 & 3.8
end{bmatrix}
]

Now compute:

[
X^TX
]
[
X^TX=
begin{bmatrix}
10 & 25.1
25.1 & 67.89
end{bmatrix}
]

Now define the target vector as:

[
y=
begin{bmatrix}
39344
46206
37732
43526
39892
56643
60151
54446
64446
57190
end{bmatrix}
]

Now compute:

[
X^Ty
]
[
X^Ty=
begin{bmatrix}
493576
1326200.7
end{bmatrix}
]

Given that our dataset consists of:

[
n=10
]

Now plug all these values into the gradient equation:

[
frac{partial MSE}{partial beta}
=
frac{-2}{10}
begin{bmatrix}
493576
1326200.7
end{bmatrix}
+
frac{2}{10}
begin{bmatrix}
10 & 25.1
25.1 & 67.89
end{bmatrix}
begin{bmatrix}
2
5
end{bmatrix}
]

First, evaluate the matrix multiplication:

[
begin{bmatrix}
10 & 25.1
25.1 & 67.89
end{bmatrix}
begin{bmatrix}
2
5
end{bmatrix}
=
begin{bmatrix}
(10)(2)+(25.1)(5)
(25.1)(2)+(67.89)(5)
end{bmatrix}
]
[
=
begin{bmatrix}
20+125.5
50.2+339.45
end{bmatrix}
]
[
=
begin{bmatrix}
145.5
389.65
end{bmatrix}
]

Now scale by:

[
frac{2}{10}
]
[
frac{2}{10}
begin{bmatrix}
145.5
389.65
end{bmatrix}
=
begin{bmatrix}
29.1
77.93
end{bmatrix}
]

Next, compute:

[
frac{-2}{10}
begin{bmatrix}
493576
1326200.7
end{bmatrix}
=
begin{bmatrix}
-98715.2
-265240.14
end{bmatrix}
]

Now combine everything:

[
frac{partial MSE}{partial beta}
=
begin{bmatrix}
-98715.2
-265240.14
end{bmatrix}
+
begin{bmatrix}
29.1
77.93
end{bmatrix}
]

In the end:

[
frac{partial MSE}{partial beta}
=
begin{bmatrix}
-98686.1
-265162.21
end{bmatrix}
]

This gradient indicates the steepness of the bowl-shaped MSE loss curve at the current parameter settings.

Here:

[
-98686.1
]

denotes the slope relative to (beta_0)

and

[
-265162.21
]

denotes the slope relative to (beta_1)

Since both values are negative, the loss decreases in the rightward direction, prompting Gradient Descent to shift rightward to minimize the loss.

Rather than directly solving for the optimal parameters using a closed-form solution, Gradient Descent incrementally adjusts the parameter values step by step until it converges to the lowest point of the bowl-shaped loss curve.

This adjustment is carried out using the Gradient Descent update rule:

[
beta:=beta-alphafrac{partial MSE}{partial beta}
]

where:

[
alpha
]

is referred to as the learning rate and determines the magnitude of each update step.

The update rule can be broken down as follows.

[
beta
]

stands for the current parameter values.

[
frac{partial MSE}{partial beta}
]

represents the slope (gradient) of the bowl-shaped loss curve at the present point.

The gradient points in the direction where the loss rises most rapidly.

Consequently, to lower the loss, we move in the direction opposite to the gradient.

This is precisely why the update rule subtracts the gradient:

[
beta:=beta-alphafrac{partial MSE}{partial beta}
]

Here:

[
alpha
]

governs the step size taken toward the minimum.

When the gradient is positive, Gradient Descent moves leftward.

When the gradient is negative, Gradient Descent moves rightward.

By repeatedly evaluating gradients and refining parameters, Gradient Descent steadily advances toward the lowest point of the bowl-shaped loss curve.

Once the parameters are updated, the whole cycle repeats until the loss is minimized and the model settles on the best-fit parameters.

What we can note here is that no matrix inversion is involved at any stage.

Learning Rate

A crucial concept to grasp here is the learning rate.

Suppose:

[
alpha = 0.01
]

and the computed gradient is:

[
frac{partial MSE}{partial beta}
=
begin{bmatrix}
-98686.1
-265162.21
end{bmatrix}
]

Now plug these values into the update rule:

[
beta=
begin{bmatrix}
2
5
end{bmatrix}
–
0.01
begin{bmatrix}
-98686.1
-265162.21
end{bmatrix}
]

First, scale the gradient by the learning rate:

[
0.01
begin{bmatrix}
-98686.1
-265162.21
end{bmatrix}
=
begin{bmatrix}
-986.861
-2651.6221
end{bmatrix}
]

Now substitute back:

[
beta=
begin{bmatrix}
2
5
end{bmatrix}
–
begin{bmatrix}
-986.861
-2651.6221
end{bmatrix}
]

then

[
beta=
begin{bmatrix}
2+986.861
5+2651.6221
end{bmatrix}
]

In the end:

[
beta=
begin{bmatrix}
988.861
2656.6221
end{bmatrix}
]

After one iteration of Gradient Descent:

[
beta_0
]

shifted from:

[
2 rightarrow 988.861
]

and

[
beta_1
]

shifted from:

[
5 rightarrow 2656.6221
]

These revised parameter values bring us closer to the lowest point of the bowl-shaped MSE loss curve.

Now, using these updated values, the entire cycle runs again:

[
text{Predictions}
rightarrow
text{Residuals}
rightarrow
text{Loss}
rightarrow
text{Gradient}
rightarrow
text{Parameter Update}
]

This looping process continues until the loss reaches its minimum and the model converges to the best-fit parameters.

Now let’s explore why selecting the right learning rate is so critical.

If the learning rate is too small:

[
alpha = 0.000001
]

then each update becomes negligible.

As a result:

[
text{Very Slow Convergence}
]

and Gradient Descent might need thousands of iterations to reach the minimum.

Conversely, if the learning rate is excessively large:

[
alpha = 10
]

then each update becomes excessively large.

As a result, Gradient Descent may repeatedly overshoot the minimum and never converge to a solution.

Hence, picking an appropriate learning rate is essential for efficient optimization.

GIF by Author

Stochastic Gradient Descent

Let’s start by recalling what gradient descent is all about.

In standard gradient descent, we calculate gradients using the complete dataset before making any parameter updates.

This approach, known as batch gradient descent, processes every single data point in each update cycle.

Think about a dataset with millions of entries.

For every iteration, standard gradient descent requires you to:

[
text{Traverse the complete dataset}
]
[
text{Compute the loss value}
]
[
text{Determine the gradients}
]

Only after completing all these steps do we adjust the parameters.

This repetitive computation becomes both computationally heavy and extremely time-consuming.

That’s exactly why Stochastic Gradient Descent (SGD) was developed.

Rather than computing gradients across the entire dataset, SGD picks just one random sample at a time and updates the parameters immediately afterward.

The update formula remains unchanged:

[
beta := beta – alpha frac{partial MSE}{partial beta}
]

The key difference lies in how the gradient is computed – using a single random sample rather than the full dataset.

Let’s walk through an example using one data point from our dataset.

Here are our initial parameter values:

[
beta^{(0)} =
begin{bmatrix}
2 \
5
end{bmatrix}
]

And here is the learning rate:

[
alpha = 0.01
]

Suppose SGD randomly picks this training sample from our dataset:

[
(x, y) = (3.0, 56643)
]

For this individual observation:

[
X =
begin{bmatrix}
1 & 3.0
end{bmatrix}
]

along with

[
y =
begin{bmatrix}
56643
end{bmatrix}
]

Let’s compute:

[
X^T =
begin{bmatrix}
1 \
3.0
end{bmatrix}
]

Then calculate:

[
X^TX
]
[
=
begin{bmatrix}
1 \
3.0
end{bmatrix}
begin{bmatrix}
1 & 3.0
end{bmatrix}
]
[
=
begin{bmatrix}
1 & 3.0 \
3.0 & 9.0
end{bmatrix}
]

Now let’s find:

[
X^Ty
]
[
=
begin{bmatrix}
1 \
3.0
end{bmatrix}
begin{bmatrix}
56643
end{bmatrix}
]
[
=
begin{bmatrix}
56643 \
169929
end{bmatrix}
]

Since SGD operates on one sample at a time:

[
n = 1
]

Plugging everything into the gradient formula:

[
frac{partial MSE}{partial beta}
=
frac{-2}{n}X^Ty
+
frac{2}{n}X^TXbeta
]

Substituting the values:

[
=
frac{-2}{1}
begin{bmatrix}
56643 \
169929
end{bmatrix}
+
frac{2}{1}
begin{bmatrix}
1 & 3.0 \
3.0 & 9.0
end{bmatrix}
begin{bmatrix}
2 \
5
end{bmatrix}
]

Start with the matrix multiplication:

[
begin{bmatrix}
1 & 3.0 \
3.0 & 9.0
end{bmatrix}
begin{bmatrix}
2 \
5
end{bmatrix}
]
[
=
begin{bmatrix}
(1)(2) + (3.0)(5) \
(3.0)(2) + (9.0)(5)
end{bmatrix}
]
[
=
begin{bmatrix}
2 + 15 \
6 + 45
end{bmatrix}
]
[
=
begin{bmatrix}
17 \
51
end{bmatrix}
]

Scaling by:

[
frac{2}{1}
]
[
=
begin{bmatrix}
34 \
102
end{bmatrix}
]

Now compute:

[
frac{-2}{1}
begin{bmatrix}
56643 \
169929
end{bmatrix}
=
begin{bmatrix}
-113286 \
-339858
end{bmatrix}
]

Putting everything together:

[
frac{partial MSE}{partial beta}
=
begin{bmatrix}
-113286 \
-339858
end{bmatrix}
+
begin{bmatrix}
34 \
102
end{bmatrix}
]

This simplifies to:

[
frac{partial MSE}{partial beta}
=
begin{bmatrix}
-113252 \
-339756
end{bmatrix}
]

This gradient value represents the direction of steepest descent for this particular training sample.

Now let’s update the parameters using:

[
beta := beta – alpha frac{partial MSE}{partial beta}
]

Inserting all known values:

[
beta =
begin{bmatrix}
2 \
5
end{bmatrix}
–
0.01
begin{bmatrix}
-113252 \
-339756
end{bmatrix}
]

First, apply the learning rate:

[
=
begin{bmatrix}
2 \
5
end{bmatrix}
–
begin{bmatrix}
-1132.52 \
-3397.56
end{bmatrix}
]

Perform the subtraction:

[
=
begin{bmatrix}
2 + 1132.52 \
5 + 3397.56
end{bmatrix}
]

The final result:

[
beta =
begin{bmatrix}
1134.52 \
3402.56
end{bmatrix}
]

After computing just one sample, the parameters are updated instantly.

SGD then selects another random sample from the dataset and performs the same calculations again.

In contrast to batch gradient descent, which waits until it has processed the entire dataset before adjusting parameters, SGD modifies them after each individual training example.

Thanks to these rapid, frequent updates, SGD converges toward a solution much more quickly.

Just look at how much simpler the math becomes when working with a single data point.

SGD keeps iterating through different training samples, adjusting parameters continuously, until the loss reaches its minimum or stops decreasing noticeably.

However, the optimization path tends to be erratic and jagged, resembling a zigzag pattern.

Despite this noisy behavior, SGD is incredibly valuable for tackling modern machine learning and deep learning challenges that involve massive datasets.

Conclusion

We now have a solid grasp of both gradient descent and stochastic gradient descent.

We first derived the normal equation and discovered that computing the inverse matrix becomes computationally demanding and memory-intensive for large datasets.

To address this challenge, we turned to gradient descent, which isn’t restricted to linear regression but is widely applied across various machine learning and deep learning algorithms.

We then learned that even batch gradient descent, the first variant we examined, can be painfully slow for massive datasets because it requires scanning the entire dataset before each parameter update.

This naturally led us to stochastic gradient descent (SGD), which processes one training example at a time and is significantly faster than batch gradient descent for large-scale datasets.

There’s also a third approach called mini-batch gradient descent, where we use a small subset of training samples, typically 32 or 64 rows, before updating the parameters.

This offers a balanced middle ground – faster than batch gradient descent while producing more stable updates than stochastic gradient descent.

Although linear regression does have a direct analytical solution, practitioners often choose gradient descent when dealing with large datasets containing millions of records because the normal equation becomes both computationally expensive and practically unworkable.

When it comes to deep learning, analytical solutions generally don’t exist at all, making optimization techniques like gradient descent absolutely essential.

Dataset License

The dataset referenced in this article is the Salary dataset.

It is openly accessible on Kaggle and is distributed under the Creative Commons Zero (CC0 Public Domain) license. This means you are free to use, adapt, and redistribute it for any purpose—whether personal, academic, or commercial—without any restrictions.

I hope this article has helped clarify gradient descent and stochastic gradient descent for you.

You can find more of my work on Medium and LinkedIn.

I recently published an in-depth exploration of Lasso Regression explained through geometric intuition.

Check it out here.

Thanks for reading!

Top Posts

Critical WordPress Zero-Day: Unauthenticated Code Execution Exposed in WP2Shell Flaw

Bolivia’s Bold Crypto Play: USDT Adoption Sparks AI Mining Debate

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

Why Gradient Descent Evolved Into Stochastic Gradient Descent

The Blackout Test: Crucial Mistakes I Made With Backup Power (And How You Can Avoid Them)

The Trust Chasm: Why Enterprise AI’s Real Crisis Isn’t Retrieval, It’s Context Collapse

Bunkerhill’s $55M Mission: Unleashing Agentic AI to Revolutionize Healthcare

Beyond Context Engineering: The Loop Experiment Running Blind Without an LLM

NVIDIA’s Nemotron 3 Embed: Open-Source #1 Embedding Model Unveiled

10 AI Power Channels Supercharging Your Future

Critical WordPress Zero-Day: Unauthenticated Code Execution Exposed in WP2Shell Flaw

Bolivia’s Bold Crypto Play: USDT Adoption Sparks AI Mining Debate

General Dynamics Fires Back: DISA’s Enclave Cloud Expansion Sparks Contract Clash

Wireless Logic Bolsters US IoT Reach with Strategic SIMETRY Acquisition

The Blackout Test: Crucial Mistakes I Made With Backup Power (And How You Can Avoid Them)

Unlock Peak Performance: Your Command Protocol for GPT-5.6 Synergy

Iran Hunts US Military Phones: CrashStealer macOS Malware & the CVD Blueprint Unmasked

Benjamin Cowen’s Bold Q4 Forecast: Bitcoin’s $44K Bottom is Imminent!

Trending

Critical WordPress Zero-Day: Unauthenticated Code Execution Exposed in WP2Shell Flaw

Bolivia’s Bold Crypto Play: USDT Adoption Sparks AI Mining Debate

Latest Posts

Not More Data, but Better World Models – Unite.AI

OpenAI Is Hiring Head of Preparedness, Amid AI Cyberattack Fears

Subscribe to Updates

Top Posts

Why Gradient Descent Evolved Into Stochastic Gradient Descent

Linear Regression Refresher

Deriving the Formulas for Slope and Intercept

Deriving the Normal Equation

Finding the Slope and Intercept via the Normal Equation

Why Is Gradient Descent Necessary?

The Math Behind Gradient Descent

Learning Rate

Stochastic Gradient Descent

Conclusion

Related Posts