Machine Learning Basics Part 3: Basic model training using Linear Regression and Gradient Descent

Machine Learning Basics Part 3: Basic model training using Linear Regression and Gradient Descent

If you missed part one in the series, you can start here (Machine Learning Basics Part 1: An Overview).

Linear Regression is a straightforward way to find the linear relationship between one or more variables and a predicted target using a supervised learning algorithm. In simple linear regression, the model predicts the relationship between two variables. In multiple linear regression, additional variables that influence the relationship can be included. Output for both types of linear regression is a value within a continuous range.

Simple Linear Regression: Linear Regression works by finding the best fit line to a set of data points.

For example, a plot of the linear relationship between study time and test scores allows the prediction of a test score given the amount of hours studied.

To calculate this linear relationship, use the following:

In this example, ŷ is the predicted value, x is a given data point, θ1 is the feature weight, and θ0 is the intercept point, also known as the bias term. The best fit line is determined by using gradient descent to minimize the cost function. This is a complex way of saying the best line is one that makes predictions closest to actual values. In linear regression, the cost function is calculated using mean squared error (MSE): #331b9

Mean Squared Error for Linear Regression1

In the equation above, the letter m represents the number of data points, ????T is the transpose of the model parameters theta, x is the feature value, and y is the prediction. Essentially, the line is evaluated by the distance between the predicted values and the actual values. Any difference between predicted value and actual value is an error. Minimizing mean squared error increases the accuracy of the model by selecting the line where the predictions and actual values are closest together.

Gradient descent is the method of iteratively adjusting the parameter theta (????) to find the lowest possible MSE. A random parameter is used initially and each iteration of the algorithm takes a small step—the size of which is determined by the learning rate—to gradually change the value of the parameter until the MSE has reached the minimum value. Once this minimum is reached, the algorithm is said to have converged.


Be aware that choosing a learning rate that is smaller than ideal will result in an algorithm that converges extremely slowly because the steps it takes with each iteration are too small. Choosing a learning rate that is too large can result in a model that never converges because step size is too large and it can overshoot the minimum.

Learning Rate set too small1

Learning Rate set too large1


Multiple Linear Regression: Multiple linear regression, or multivariate linear regression, works similarly to simple linear regression but adds additional features. If we revisit the previous example of hours studied to predict test scores, a multiple linear regression example could be using hours studied and hours of sleep the night before exam to predict test scores. This model allows us to use unrelated features on a single data point to make a prediction about that data point. This can be represented visually as finding the plane that best fits the data. In the example below, we can see the relationship between horsepower, weight, and miles per gallon.

Multiple Linear Regression3

Thanks for reading our machine learning series, and keep and eye out for our next blog!



  1. Geron, Aurelien (2017). Hands-On Machine Learning with Scikit-Learn & TensorFlow. Sebastopol, CA: O’Reilly.
Machine Learning Basics Part 1: An Overview

Machine Learning Basics Part 1: An Overview

This is the first in a series of Machine Learning posts meant to act as a gentle introduction to Machine Learning techniques and approaches for those new to the subject. The material is strongly sourced from Hands-On Machine Learning with Scikit-Learn & TensorFlow by Aurélien Géron and from the Coursera Machine Learning class by Andrew Ng. Both are excellent resources and are highly recommended.

Machine Learning is often defined as “the field of study that gives computers the ability to learn without being explicitly programmed” (Arthur Samuel, 1959).

More practically, it is a program that employs a learning algorithm or neural net architecture that once trained on an initial data set, can make predictions on new data.

Common Learning Algorithms:¹

Linear and polynomial regression

Logistic regression

K-nearest neighbors

Support vector machines

Decision trees

Random forests

Ensemble methods

While the above learning algorithms can be extremely effective, more complex problems -, like image classification and natural language processing (NLP) – often require a deep neural net approach.

Common Neural Net (NN) Architectures:¹

Feed forward NN

Convolutional NN (CNN)

Recurrent NN (RNN)

Long short-term memory (LSTM)


We will go into further detail on the above learning algorithms and neural nets in later blog posts.

Some Basic terminology:

Features – These are attributes of the data. For example, a common dataset used to introduce Machine Learning techniques is the Pima Indians Diabetes dataset, which is used to predict the onset of diabetes given additional health indicators. For this dataset, the features are pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, etc.

Labels – These are the desired model predictions. In supervised training, this value is provided to the model during training so that it can learn to associate specific features with a label and increase prediction accuracy. In the Pima Indians Diabetes example, this would be a 1 (indicating diabetes onset is likely) or a 0 (indicating low likelihood of diabetes).

Supervised Learning – This is a learning task in which the training set used to build the model includes labels. Regression and classification are both supervised tasks.

Unsupervised Learning -This is a learning task in which training data is not labeled. Clustering, visualization, dimensionality reduction and association rule learning are all unsupervised tasks.

Some Supervised Learning Algorithms:¹

K-nearest neighbors

Linear regression

Logistic regression

Support vector machines (SVMs)

Decision trees and random forests

Neural networks

Unsupervised Learning Algorithms:¹


• K-means

• Hierarchical cluster analysis (HCA)

• Expectation maximization

Visualization and Dimensionality Reduction

• Principal component analysis (PCA)

• Kernel PCA

• Locally-linear embedding (LLE)

• t-distributed Stochastic Neighbor Embedding (t-SNE)

Association Rule Learning

• Apriori

• Eclat

Dimensionality Reduction: This is the act of simplifying data without losing important information. An example of this is feature extraction, where correlated features are merged into a single feature that conveys the importance of both. For example, if you are predicting housing prices, you may be able to combine square footage with number of bedrooms to create a single feature representing living space

Batch Learning: This is a system that is incapable of learning incrementally and must be trained using all available data at once1. To learn new data, it must be retrained from scratch.

Online Learning: This is a system that is trained incrementally by feeding it data instances sequentially. This system can learn new data as it arrives.

Underfitting:  This is what happens when you creating a model that generalizes too broadly. It does not perform well on the training or test set.

Overfitting:  This is what occurs when you creating a model that performs well on the training set, but has become too specialized and no longer performs well on new data.

Common Notations:

m: The total number of instances in the dataset

X: A matrix containing all of the feature values of every instance of the dataset

x(i): A vector containing all of the feature values of a single instance of the dataset, the ith instance.

y: A vector containing the labels of the dataset. This is the value the model should predict


  1. Géron, Aurélien (2017). Hands-On Machine Learning with Scikit-Learn & TensorFlow. Sebastopol, CA: O’Reilly.