Introduction to Gradient Descent with Python

In this article, I’m going to talk about a popular optimization algorithm in machine learning: gradient descent. I’ll explain what gradient descent is, how it works, and then we’ll write the gradient descent algorithm from scratch in Python. This article assumes you are familiar with derivatives in Calculus.

What is Gradient Descent?

Gradient descent is an optimization algorithm for finding the minimum of a function. The algorithm does this by checking the steepness of the slope along the graph of a line and using that to slowly move towards the lowest point, which presumably has a slope of 0. Write an algorithm to find the lowest cost.

You can think of gradient descent as akin to a blind-folded hiker on top of a mountain trying to get to the bottom. The hiker must feel the incline, or the slope, of the mountain in order to get an idea of where she is going. If the slope is steep, the hiker is closer to the peak and can take bigger steps. If the slope is less steep, the hiker is closer to the bottom and takes smaller steps. If the hiker feels flat ground (a zero slope), she can assume she’s reached the bottom, or minimum.

So given a function with a convex graph, the gradient descent algorithm attempts to find the minimum of the function by using the derivative to check the steepness of points along the line and slowly move towards a slope of zero. After all, “gradient” is just another word for slope.

Implement Gradient Descent in Python

Before we start, import the SymPy library and create a “symbol” called x. We’ll be needing these lines later when we are working with math functions and derivatives.

from sympy import *
x = Symbol('x')

We create our gradient_descent function and give it two parameters: cost_fn, starting_point and learning_rate. The cost_fn is the math function that we want to find the minimum of. The initial_guess parameter is the integer that is our first guess for the x-value of the minimum of the function. We will update this variable to be our new guess after each learning iteration. The last parameter is the learning rate.

def gradient_descent(cost_fn, initial_guess, learning_rate):
    df = cost_fn.diff(x)
    df = lambdify(x, df)

    new_x = initial_guess

    for n in range(100):
        # Step 1: Predict (Make a guess)
        previous_x = new_x

        # Step 2: Calculate the error
        gradient = df(previous_x)

        # Step 3: Learn (Make adjustments)
        new_x = previous_x - learning_rate * gradient

Inside the function, we first get the derivative of the cost function that was inputted as a parameter using the diff function of the SymPy library. We store the derivative in the df variable. Then, we use the lambdify function because it allows us to plug our predictions into the derivative function. Read my article on calculating derivatives in Python for more info on this.

In the for loop, our gradient descent function is following the 3-step algorithm that is used to train many machine learning tools:

  1. Predict (Make a guess)
  2. Calculate the error
  3. Learn (Make adjustments)

You can learn more about this process in this article on how machines “learn.”

In the for loop, the first step is to make an arbitrary guess for the x-value of the minimum of the function. We do this by setting previous_x to new_x, which is the user’s initial guess. previous_value will help us keep track of the preceding prediction value as we make new guesses.

Next, we calculate the error or, in other words, we see how far our current guess is from the minimum of the function. We do this by calculating the derivative of the function at the point we guessed, which will give us the slope at that point. If the slope is large, the guess is far from the minimum. But if the slope is close to 0, the guess is getting closer to the minimum.

Next, we “learn” from the error. In the previous step, we calculated the slope at the x-value that we guessed. We multiply that slope by the learning_rate and subtract that from the current guess value stored in previous_x. Then, we store this new guess value back into new_x.

Then, we run these steps over and over in our for loop until the loop is over.

Before we run our gradient descent function, let’s add some print statements at the end so we can see the values of at the minimum of the function.

print('Minimum occurs at x-value:', min_x)
print('Slope at the minimum is: ', df(min_x))

Now, let’s run our gradient descent function and see what type of output we get with an example. In this example, the cost function is f(x) = x2. The initial guess for x is 3 and the learning rate is 0.1

my_fn = x**2
gradient_descent(my_fn, 3, 0.1)

Currently, we are running the learning loop an arbitrary amount of times. In this example, the loop runs 100 times. But maybe we don’t need to run the loop this many times. Oftentimes you already know ahead of time how precise a calculate you need. You can tell the loop to stop running once a certain level of precision is met. There are many ways you can implement this, but I’ll show you using that for loop we already have.

precision = 0.0001

for n in range(100):
    previous_x = new_x
    gradient = df(previous_x)
    new_x = previous_x - learning_rate * gradient
    
    step_size = abs(new_x - previous_x) 
    
    if step_size < precision:
        break

First, we define a precision value that the gradient descent algorithm should be within. You can also make this a parameter to the function if you choose.

Inside the loop, we create a new variable called step_size which is the distance between previous_x and new_x, which is the new guess that was just calculated in the “learning” step. We take the absolute value of this difference in case it’s negative.

If the step_size is less than the precision we specified, the loop will finish, even if it hasn’t reached 100 iterations yet.

Instead of solving a cost function analytically, the gradient descent algorithm converges on the minimum of a function by brute force. Like a blind-folded hiker, the algorithm goes down the valley (the cost function), following the slope of the graph until it reaches the minimum point.

How Do Machines Learn?

You’ve probably heard of machine learning models that can read human handwriting or understand speech. You might know that these models had to be trained in order to accomplish these tasks– they had to learn. But how exactly does a machine “learn”? What are the steps involved?

In this article, I’m going to be giving a high-level overview of how the “learning” in machine learning happens. I’m going to talk about fundamental ML concepts including cost functions, optimization, and linear regression. I’ll outline the basic framework used in most machine learning techniques.

Data is the foundational of any machine learning model. In a nutshell, the data scientist feeds a bunch of data into the ML model and, as it starts to “learn” from the data, the model will eventually develop a solution. What is the solution? The solution is typically a function that describes the relationship in the data. For a given input, the function should be able to provide the expected output.

In the case of linear regression, one of the most basic ML models, the regression model “learns” two parameters: the slope and the intercept. Once the model learns these parameters to the desired extent, the model can be used to compute the output y for a given input X (in the linear regression equation y = b0 + b1*X). If you’re unfamiliar with linear regression, take a look at my article on linear regression to understand this better.

So now that we know what the goal of machine learning is, we can talk about how exactly the learning happens. The machine learning model usually follows three core steps in order to “learn” the relationship in the data as described by the solution function:

  1. Predict
  2. Calculate the error
  3. Learn

The first step is for the model to make a prediction. To start, the model may make arbitrary guesses for the values that it is solving for in the solution function. In the case of linear regression, the ML model would make guesses for the values of the slope and intercept.

Next, the model would check its prediction against the actual test data and see how good/bad the prediction was. In other words, the model calculates the error in its prediction. In order to compare the prediction against the data, we need to find a way to measure how “good” our prediction was.

Finally, the model will “learn” from its error by adjusting its prediction to have a smaller error.

The model will repeat these 3 steps– predict, calculate error, and learn– a bunch of times and slowly come to the best coefficients for the solution. This simple 3-step algorithm is the basis for training most machine learning models.

When I talked about calculating error earlier, I didn’t talk about the ways in which we measure how “good” or “bad” our predictions are. That leads me to the next topic: cost functions. In machine learning, a cost function is a mechanism that returns the error between predicted outcomes and the actual outcomes. Cost functions measure the size of the error to help achieve the overall goal of optimizing for a solution with the lowest cost.

The objective of an ML model is to find the values of the parameters that minimize the cost function. Cost functions will be different depending on the use case but they all have this same goal.

The Residual Sum of Squares is an example of a cost function. In linear regression, the Residual Sum of Squares is used to calculate and measure the error in predicted coefficient values. It does this by finding the sum of the gaps between the predicted values on the linear regression line and the actual data point values (check out this article for more detail). The lowest sum indicates the most accurate solution.

Cost functions fall under the broader category of optimization. Optimization is a term used in a variety of fields, but in machine learning it is defined as the process of progressing towards the defined goal, or solution, of an ML model. This includes minimizing “bad things” or “costs”, as is done in cost functions, but it also includes maximizing “good things” in other types of functions.

In summary, machine learning is typically done with a fundamental 3-step process: make a prediction, calculate the error, and learn / make adjustments. The error in a prediction is calculated using a cost function. Once the error is minimized, the model is done “learning” and is left with a function that should provide the expected result for future data.