If you missed part one in the series, you can start here (Machine Learning Basics Part 1: An Overview).
Common real-world problems that are addressed with regression models are predicting housing values, financial forecasting, and predicting travel commute times. Regression models can have a single input feature, referred to as univariate, or multiple input features, referred to as multivariate. When evaluating a regression model, performance is determined by calculating the Mean squared error (MSE) cost function. MSE is the average of the squared errors of each data point from the hypothesis, or simply how far each prediction was from the desired outcome. A model that has a high MSE cost function fits the training data poorly and should be revised.
A visual representation of MSE:
In the image above,1 the actual data point values are represented by red dots. The hypothesis, which is used to make any predictions on future data, is represented by the blue line. The difference between the two is indicated by the green lines. These green lines are used to compute MSE and evaluate the strength of the model’s predictions.
Regression Problem Examples:
- Given BMI, current weight, activity level, gender, and calorie intake, predict future weight.
- Given calorie intake, fitness level, and family history, predict percent probability of heart disease.
Commonly Used Regression Models:
Linear Regression: This is a model that represents the relationship between one or more input variables and a linear scalar response. Scalar refers to a single real number.
Ridge Regression: This is a linear regression model that incorporates a regularization term to prevent overfitting. If the regularization term (𝝰) is set to 0, ridge regression acts as simple linear regression. Note that data must be scaled before performing ridge regression.
Lasso Regression: Lasso is an abbreviation for least absolute shrinkage and selection operator regression. Similar to ridge regression, lasso regression includes a regularization term. One benefit to using lasso regression is that it tends to set the weights of the least important features to zero, effectively performing feature selection.2 You can implement lasso regression in Sci-kit Learn using the built-in model library.
Elastic Net: This model uses a regularization term that is a mix of both ridge and lasso regularization terms. By setting r=0 the model behaves as a ridge regression, and setting r=1 makes it behave like a lasso regression. This additional flexibility in customizing regularization can provide the benefits of both models.2 Implement elastic net in Sci-kit Learn using the built in model library. Select an alpha value to control regularization and an l1_ratio to set the mix ratio r.
Classification: Classification problems predict a class. They can also return a probability value, which is then used to determine the class most likely to be correct. For classification problems, model performance is determined by calculating accuracy.
model accuracy = correct predictions / total predictions * 100
Classification Problem Examples: Classification has its benefits for predictions in the healthcare industry.For example, given a dataset with features including glucose levels, pregnancies, blood pressure, skin thickness, insulin, and BMI, predictions can be made on the likelihood of the onset of diabetes. Because this prediction should be a 0 or 1, it is considered a binary classification problem.
Commonly Used Classification Models:
Logistic Regression: This is a model that uses a regression algorithm, but is most often used for classification problems since its output can be used to determine the probability of belonging to a certain class.2 Logistic regression uses the sigmoid function to output a value between 0 and 1. If the probability is >= 0.5 that an instance is in the positive class (represented by a 1), the model predicts 1. Otherwise, it predicts 0.
Softmax Regression: This is a logistic regression model that can support multiple classes. Softmax predicts the class with the highest estimated probability. It can only be used when classes are mutually exclusive.2
Naive Bayes: This is a classification system that assumes that the value of a feature is independent from the value of any other feature and ignores any possible correlations between features in making predictions. The model then predicts the class with the highest probability.4
Support Vector Machines (SVM): This is a classification system that identifies a decision border, or hyperplane, as wide as possible between class types and predicts class based on the side of the border that any point falls on. This system does not use probability to assign a class label. SVM models can be fine-tuned by adjusting kernel, regularization, gamma, and margin. We will explore these hyperparameters further in an upcoming blog post focused solely on SVM. Note that SVM can also be used to perform regression tasks.
Decision Trees and Random Forests: A decision tree is a model that separates data into branches by asking a binary question at each fork. For example, in a fruit classification problem one tree fork could ask if a fruit is red. Each fruit instance would either go to one branch for yes or the other for no. At the end of each branch is a leaf with all of the training instances that followed the same decision path. The common problem of overfitting can often be avoided by combining multiple trees into a random forest and taking the prediction from the tree with the highest probability of accuracy.
Neural Networks (NN): This is a model composed of layers of connected nodes. The model takes information in via an input layer and passes it through one or more hidden layers composed of nodes. These nodes are activated by their input, make some determination, and generate output for the next layer of nodes. Connections between nodes have edges, which have a weight that can be adjusted to influence learning. A bias term can also be added to the edges to create a threshold theta (𝛉), which is customizable and determines if the node’s output will continue to the next layer of nodes. The final layer is the output layer, which generates class probabilities and makes a final prediction. When a NN has two or more hidden layers, it’s called a deep neural network. There are multiple types of neural networks and we will explore this in more detail in later blog posts.
K-nearest Neighbor: This model evaluates a new data point by its proximity to training data points and assigns a class based on the majority class of its closest neighbors as determined by feature similarity. K is an integer set when the model is built and determines how far out the model should look for neighbors. The boundary circle is set when it includes k neighbors.
- Geron, Aurelien (2017). Hands-On Machine Learning with Scikit-Learn & TensorFlow. Sebastopol, CA: O’Reilly.