Machine Learning Basics Part 3: Basic model training using Linear Regression and Gradient Descent

Machine Learning Basics Part 3: Basic model training using Linear Regression and Gradient Descent

If you missed part one in the series, you can start here (Machine Learning Basics Part 1: An Overview).

Linear Regression is a straightforward way to find the linear relationship between one or more variables and a predicted target using a supervised learning algorithm. In simple linear regression, the model predicts the relationship between two variables. In multiple linear regression, additional variables that influence the relationship can be included. Output for both types of linear regression is a value within a continuous range.

Simple Linear Regression: Linear Regression works by finding the best fit line to a set of data points.

For example, a plot of the linear relationship between study time and test scores allows the prediction of a test score given the amount of hours studied.

To calculate this linear relationship, use the following:

In this example, ŷ is the predicted value, x is a given data point, θ1 is the feature weight, and θ0 is the intercept point, also known as the bias term. The best fit line is determined by using gradient descent to minimize the cost function. This is a complex way of saying the best line is one that makes predictions closest to actual values. In linear regression, the cost function is calculated using mean squared error (MSE): #331b9

Mean Squared Error for Linear Regression1

In the equation above, the letter m represents the number of data points, 𝛉T is the transpose of the model parameters theta, x is the feature value, and y is the prediction. Essentially, the line is evaluated by the distance between the predicted values and the actual values. Any difference between predicted value and actual value is an error. Minimizing mean squared error increases the accuracy of the model by selecting the line where the predictions and actual values are closest together.

Gradient descent is the method of iteratively adjusting the parameter theta (𝛉) to find the lowest possible MSE. A random parameter is used initially and each iteration of the algorithm takes a small step—the size of which is determined by the learning rate—to gradually change the value of the parameter until the MSE has reached the minimum value. Once this minimum is reached, the algorithm is said to have converged.


Be aware that choosing a learning rate that is smaller than ideal will result in an algorithm that converges extremely slowly because the steps it takes with each iteration are too small. Choosing a learning rate that is too large can result in a model that never converges because step size is too large and it can overshoot the minimum.

Learning Rate set too small1

Learning Rate set too large1


Multiple Linear Regression: Multiple linear regression, or multivariate linear regression, works similarly to simple linear regression but adds additional features. If we revisit the previous example of hours studied to predict test scores, a multiple linear regression example could be using hours studied and hours of sleep the night before exam to predict test scores. This model allows us to use unrelated features on a single data point to make a prediction about that data point. This can be represented visually as finding the plane that best fits the data. In the example below, we can see the relationship between horsepower, weight, and miles per gallon.

Multiple Linear Regression3

Thanks for reading our machine learning series, and keep and eye out for our next blog!



  1. Geron, Aurelien (2017). Hands-On Machine Learning with Scikit-Learn & TensorFlow. Sebastopol, CA: O’Reilly.
Machine Learning Basics Part 2: Regression and Classification

Machine Learning Basics Part 2: Regression and Classification

If you missed part one in the series, you can start here (Machine Learning Basics Part 1: An Overview).


Common real-world problems that are addressed with regression models are predicting housing values, financial forecasting, and predicting travel commute times. Regression models can have a single input feature, referred to as univariate, or multiple input features, referred to as multivariate. When evaluating a regression model, performance is determined by calculating the Mean squared error (MSE) cost function. MSE is the average of the squared errors of each data point from the hypothesis, or simply how far each prediction was from the desired outcome. A model that has a high MSE cost function fits the training data poorly and should be revised.

A visual representation of MSE:

In the image above,1 the actual data point values are represented by red dots. The hypothesis, which is used to make any predictions on future data, is represented by the blue line. The difference between the two is indicated by the green lines. These green lines are used to compute MSE and evaluate the strength of the model’s predictions.

Regression Problem Examples:

  • Given BMI, current weight, activity level, gender, and calorie intake, predict future weight.
  • Given calorie intake, fitness level, and family history, predict percent probability of heart disease.


Commonly Used Regression Models:

Linear Regression: This is a model that represents the relationship between one or more input variables and a linear scalar response. Scalar refers to a single real number.

Ridge Regression: This is a linear regression model that incorporates a regularization term to prevent overfitting. If the regularization term (𝝰) is set to 0, ridge regression acts as simple linear regression. Note that data must be scaled before performing ridge regression.

Lasso Regression: Lasso is an abbreviation for least absolute shrinkage and selection operator regression. Similar to ridge regression, lasso regression includes a regularization term. One benefit to using lasso regression is that it tends to set the weights of the least important features to zero, effectively performing feature selection.2 You can implement lasso regression in Sci-kit Learn using the built-in model library.

Elastic Net: This model uses a regularization term that is a mix of both ridge and lasso regularization terms. By setting r=0 the model behaves as a ridge regression, and setting r=1 makes it behave like a lasso regression. This additional flexibility in customizing regularization can provide the benefits of both models.2 Implement elastic net in Sci-kit Learn using the built in model library. Select an alpha value to control regularization and an l1_ratio to set the mix ratio r.

ClassificationClassification problems predict a class. They can also return a probability value, which is then used to determine the class most likely to be correct. For classification problems, model performance is determined by calculating accuracy.

model accuracy =  correct predictions / total predictions * 100

Classification Problem Examples: Classification has its benefits for predictions in the healthcare industry.For example, given a dataset with features including glucose levels, pregnancies, blood pressure, skin thickness, insulin, and BMI, predictions can be made on the likelihood of the onset of diabetes. Because this prediction should be a 0 or 1, it is considered a binary classification problem.

Commonly Used Classification Models:

Logistic Regression: This is a model that uses a regression algorithm, but is most often used for classification problems since its output can be used to determine the probability of belonging to a certain class.2 Logistic regression uses the sigmoid function to output a value between 0 and 1. If the probability is >= 0.5 that an instance is in the positive class (represented by a 1), the model predicts 1. Otherwise, it predicts 0.

Softmax Regression: This is a logistic regression model that can support multiple classes. Softmax predicts the class with the highest estimated probability. It can only be used when classes are mutually exclusive.2

Naive Bayes: This is a classification system that assumes that the value of a feature is independent from the value of any other feature and ignores any possible correlations between features in making predictions. The model then predicts the class with the highest probability.4

Support Vector Machines (SVM): This is a classification system that identifies a decision border, or hyperplane, as wide as possible between class types and predicts class based on the side of the border that any point falls on. This system does not use probability to assign a class label. SVM models can be fine-tuned by adjusting kernel, regularization, gamma, and margin. We will explore these hyperparameters further in an upcoming blog post focused solely on SVM. Note that SVM can also be used to perform regression tasks.

Decision Trees and Random Forests: A decision tree is a model that separates data into branches by asking a binary question at each fork. For example, in a fruit classification problem one tree fork could ask if a fruit is red. Each fruit instance would either go to one branch for yes or the other for no. At the end of each branch is a leaf with all of the training instances that followed the same decision path. The common problem of overfitting can often be avoided by combining multiple trees into a random forest and taking the prediction from the tree with the highest probability of accuracy.

Neural Networks (NN): This is a model composed of layers of connected nodes. The model takes information in via an input layer and passes it through one or more hidden layers composed of nodes. These nodes are activated by their input, make some determination, and generate output for the next layer of nodes. Connections between nodes have edges, which have a weight that can be adjusted to influence learning. A bias term can also be added to the edges to create a threshold theta (𝛉), which is customizable and determines if the node’s output will continue to the next layer of nodes. The final layer is the output layer, which generates class probabilities and makes a final prediction. When a NN has two or more hidden layers, it’s called a deep neural network. There are multiple types of neural networks and we will explore this in more detail in later blog posts.

K-nearest Neighbor: This model evaluates a new data point by its proximity to training data points and assigns a class based on the majority class of its closest neighbors as determined by  feature similarity. K is an integer set when the model is built and determines how far out the model should look for neighbors. The boundary circle is set when it includes k neighbors.


  2. Geron, Aurelien (2017). Hands-On Machine Learning with Scikit-Learn & TensorFlow. Sebastopol, CA: O’Reilly.
Machine Learning Basics Part 1: An Overview

Machine Learning Basics Part 1: An Overview

This is the first in a series of Machine Learning posts meant to act as a gentle introduction to Machine Learning techniques and approaches for those new to the subject. The material is strongly sourced from Hands-On Machine Learning with Scikit-Learn & TensorFlow by Aurélien Géron and from the Coursera Machine Learning class by Andrew Ng. Both are excellent resources and are highly recommended.

Machine Learning is often defined as “the field of study that gives computers the ability to learn without being explicitly programmed” (Arthur Samuel, 1959).

More practically, it is a program that employs a learning algorithm or neural net architecture that once trained on an initial data set, can make predictions on new data.

Common Learning Algorithms:¹

Linear and polynomial regression

Logistic regression

K-nearest neighbors

Support vector machines

Decision trees

Random forests

Ensemble methods

While the above learning algorithms can be extremely effective, more complex problems -, like image classification and natural language processing (NLP) – often require a deep neural net approach.

Common Neural Net (NN) Architectures:¹

Feed forward NN

Convolutional NN (CNN)

Recurrent NN (RNN)

Long short-term memory (LSTM)


We will go into further detail on the above learning algorithms and neural nets in later blog posts.

Some Basic terminology:

Features – These are attributes of the data. For example, a common dataset used to introduce Machine Learning techniques is the Pima Indians Diabetes dataset, which is used to predict the onset of diabetes given additional health indicators. For this dataset, the features are pregnancies, glucose, blood pressure, skin thickness, insulin, BMI, etc.

Labels – These are the desired model predictions. In supervised training, this value is provided to the model during training so that it can learn to associate specific features with a label and increase prediction accuracy. In the Pima Indians Diabetes example, this would be a 1 (indicating diabetes onset is likely) or a 0 (indicating low likelihood of diabetes).

Supervised Learning – This is a learning task in which the training set used to build the model includes labels. Regression and classification are both supervised tasks.

Unsupervised Learning -This is a learning task in which training data is not labeled. Clustering, visualization, dimensionality reduction and association rule learning are all unsupervised tasks.

Some Supervised Learning Algorithms:¹

K-nearest neighbors

Linear regression

Logistic regression

Support vector machines (SVMs)

Decision trees and random forests

Neural networks

Unsupervised Learning Algorithms:¹


• K-means

• Hierarchical cluster analysis (HCA)

• Expectation maximization

Visualization and Dimensionality Reduction

• Principal component analysis (PCA)

• Kernel PCA

• Locally-linear embedding (LLE)

• t-distributed Stochastic Neighbor Embedding (t-SNE)

Association Rule Learning

• Apriori

• Eclat

Dimensionality Reduction: This is the act of simplifying data without losing important information. An example of this is feature extraction, where correlated features are merged into a single feature that conveys the importance of both. For example, if you are predicting housing prices, you may be able to combine square footage with number of bedrooms to create a single feature representing living space

Batch Learning: This is a system that is incapable of learning incrementally and must be trained using all available data at once1. To learn new data, it must be retrained from scratch.

Online Learning: This is a system that is trained incrementally by feeding it data instances sequentially. This system can learn new data as it arrives.

Underfitting:  This is what happens when you creating a model that generalizes too broadly. It does not perform well on the training or test set.

Overfitting:  This is what occurs when you creating a model that performs well on the training set, but has become too specialized and no longer performs well on new data.

Common Notations:

m: The total number of instances in the dataset

X: A matrix containing all of the feature values of every instance of the dataset

x(i): A vector containing all of the feature values of a single instance of the dataset, the ith instance.

y: A vector containing the labels of the dataset. This is the value the model should predict


  1. Géron, Aurélien (2017). Hands-On Machine Learning with Scikit-Learn & TensorFlow. Sebastopol, CA: O’Reilly.


The Costly Impact of No Shows

The Costly Impact of No Shows

It is estimated that missed appointments cost the healthcare industry $150 billion each year. While the blame for this issue is often directed toward patients, inefficiency at the practice and clinic level can also be a cause.

This paper outlines the most common reasons patients miss their appointments, provides benchmarks by speciality, highlights the impact on revenue, and reviews the tactics that practices and physicians can employ as a solution.


Missed appointments by type

To understand the pervasive nature of missed appointments, it is important to know the variance among practices and specialities. For example, no show rates can range from the low end of 2 percent all the way to 50 percent.

Below are several average rates according to the type of practice:

It should be noted that patients with chronic conditions are more likely to not show for a scheduled appointment, as the challenges associated with their condition can make it difficult to maintain a time commitment. Unfortunately, this group of patients are also the ones who stand to gain the most by showing up.

Another group of patients that contribute to higher-than-average no show rates are those with Medicaid insurance. One reason for this can be due to socioeconomic reasons, such as a patient relying on public transportation or living in a rural area far from where their physician’s office is located.


Common reasons for no shows

A patient may miss their appointment for a variety of reasons, however, the most common causes are:

Lack of reliable transportation to the appointment
Too much time between the scheduling and the appointment
Emotional barriers such as a negative perception of seeing the doctor
Belief that staff do not respect their time or needs

Interestingly, in reviewing the scores of satisfaction surveys, the friendliness of the staff is more important to the patient than the actual outcome of the care that is delivered. This could be due to a lack of clarity the patient has around measuring the quality of care they received. However, it is easy for them to know whether they felt respected or were met with kindness by practice staff.

There is also the belief among patients that they are doing a practice a favor when they cancel an appointment. A patient may think they are giving staff time back in their day or that a new appointment can easily take it’s place, when in reality it creates lost time and resources for practices.


Impact on the bottom line

There are approximately 230,000 physician practices in the U.S. Of those, 47 percent of them are group practices, meaning there is at least more than one physician or doctor at the location. Patient no shows cost this group more than $100 billion dollars each year.


Patient no shows cost group practices more than $100 billion each year.

In a study on one practice, the average rate of appointment no shows was 18 percent, which resulted in a daily loss of $725.42. When employing tactics to reduce the number of no shows, the practice was able to recoup between 3.8% to 10.5% in revenue,
or $166.61 to $463.09.

In another study, a multi-physician clinic had more than 14,000 patient no shows in a single year, resulting in an estimated loss of $1 million dollars in revenue. In single-physician practices, revenue losses can be as as much as $150,000 each year.

On average, a primary care practice earns $143.97 per patient visit, whereas a non-surgical specialty practice earns $78.43 per patient. While these examples outline the revenue a practice stands to lose, they do not take into account other negative impacts, such as increased wait times or patient dissatisfaction. Therefore, the benefit of seeing more patients must be weighed against the risk of increased patient waiting time and staff overtime.


Current solutions and tactics

When considering possible alternatives to decrease the number of patient no shows, practices and clinics have employed several tactics. These include text messages, direct mail, live phone calls, and automated phone calls.

While all these tactics have proven to be successful in reducing the number of no shows, it is important to implement a solution that is cost-effective and complements existing practice efforts. Depending on the goals and objectives, a combination of solutions may be the best option. Below is a baseline introduction to these tactics:

Text Messages
Many software solutions for healthcare practices offer a way to send text messages to patients to remind them of upcoming appointments. These messages also provide an opportunity for a patient to confirm they will keep their appointment, such as replying with a ‘C’ for confirmation. If a patient does not reply or responds with a cancellation answer, this signals the practice staff that there is a need to reach out directly to the patient to either confirm or reschedule their appointment.


Text messages are opened 99–100% of the time.


Depending on the solution, the cost of sending text messages can be free or included as part of a larger software package or service offering. Additionally, text messages have a 99 to 100 percent open rate and reach a patient directly via their mobile device.

Direct Mail
Another option practices and clinics use to remind their patients of appointments is direct mail, often in the form of a simple postcard. A printed piece can cut through digital clutter and offers space to include additional information or callouts.

While printing costs can be relatively inexpensive for postcards—averaging $0.15—0.32—they also rely on having accurate addresses for patients. Another drawback is the inability to have a patient immediately confirm they will keep their appointment. A postcard makes the patient aware but a follow up phone call, either by the patient or practice staff, is required for a confirmation.

Live Phone Calls
A call made by practice staff to a patient is a direct and personable way to reduce no shows. The live phone call also allows the patient to reschedule immediately if they are unable to make their appointment.

However, this is a very manual process, requiring a dedicated staff person to devote time and energy to making and completing calls. A patient may not be available or answer when called, requiring a voice message be left or another call be made.

With this in mind, the benefits of speaking directly to a patient versus the resources spent must be weighed against one another.

Automated Phone Calls
An alternative to live phone calls is an automated service that calls patients on a list, using a prerecorded voice. These services can run in the background with minimal maintenance required by staff.

While these calls can be made indefinitely, they can give the impression of being highly impersonal. A patient may not always listen to the length of the call as well, choosing to hang up as soon as they recognize it as an automated call.


Additional measures to take

Missed appointment fees
As an alternative to appointment reminders, some practices have opted to implement a fee when a patient misses their appointment. This can be due to an outright no show or instituted if a patient cancels their appointment too late, such as within forty-eight hours of their scheduled appointment.

While a fee does act as a deterrent, this can also cause a negative perception of the practice as a patient can feel penalized for missing an appointment for a legitimate reason.

Another option to overcome no shows is to overbook an office’s scheduled appointments. When this is done, an additional patient is already present in the event that a patient does not show up for their appointment.

However, the process of overbooking can be highly unreliable as it relies on predicting whether or not a patient will show up. If an unconfirmed patient does show up for an overbooked time slot, this can cause crowding in a waiting room, resulting in longer than normal wait times and a lower quality of service. If a patient’s wait time is severe enough, this can force the practice staff to fall behind for the day and struggle to catch up. Not only can service levels be negatively impacted for patients throughout the day, but this can also force the physician to cut appointments short, sacrificing face time with the patient.


The process of overbooking can be highly unreliable as it relies on predicting whether or not a patient will show up.


With this in mind, overbooking can solve the issue of no show patients and potentially increase revenue, but could create new problems in its place. Therefore, the benefit of seeing more patients must be weighed against the risk of increased patient waiting time and staff overtime.



It is clear that patient no shows represent a significant problem to the healthcare industry, in both the primary care and speciality office space. However, just as the issues with missed appointments impact patients and providers alike, the solution must also be one that accommodates both parties.

For example, one solution may be economically viable for a practice, but not effective or utilized on behalf of the patient. By engaging with patients in the way they prefer, the foundation for an ongoing relationship can be established. Over time, the conversation moves beyond simple transactional communications and becomes more valuable to the patient and practice.


Download as PDF ›



Berg, B., Murr, M. et. al. (2013). Estimating the Cost of No-shows and Evaluating the Effects of Mitigation Strategies. National Center for Biotechnology Information. Found online at

Toland, Bill. “No-shows cost health care system billions,” Pittsburg Post-Gazette. Feb 24, 2013.

Gold, Jenny, “In cities, the average doctor wait-time is 18.5 days,” The Washington Post. Jan 29, 2014.

Lacy, Naomi. “Why We Don’t Come: Patient Perceptions on No-Shows,” Annals of Family Medicine. vol. 2 no 6. Nov 1, 2004.

Evans, Melanie. “When revenue is a no-show,” Modern Healthcare. Nov 3, 2012.

Mckee, Shawn. “Measuring the Cost of Patient No-Shows.”

Molfenter, Todd. Reducing Appointment No-Shows: Going from Theory to Practice.

The American Journal of Medicine. The Effectiveness of Outpatient Appointment Reminder Systems in Reducing No-Show Rates.

Hasvold PE, Wootton R. Use of telephone and SMS reminders to improve attendance at hospital appointments: a systematic review. Journal of Telemedicine and Telecare. 2011;17(7):358-64.

Guy R, Hocking J, Wand H, Stott S, Ali H, Kaldor J. How Effective Are Short Message Service Reminders at Increasing Clinic Attendance? A Meta Analysis and Systematic Review. Health services research. 2012

Appointment reminder systems are effective but not optimal: results of a systematic review and evidence synthesis employing realist principles

MGMA Cost Survey: 2014 Report Based on 2013 Data. Key Findings Summary Report.