## Machine Learning

**1. Course Introduction**

Machine learning is present in many fields and industries. It is used heavily in the self-driving car industry to classify objects that a car might encounter while driving, for example, people, traffic signs, and other cars.

Many cloud computer service providers like IBM and Amazon use machine learning to protect their services. It is used to detect and prevent attacks like a distributed denial-of-service attack or suspicious and malicious usage.

Machine learning is also used to find trends and patterns in stock data that can help decide which stocks to trade or which prices to buy and sell at.

Another use for machine learning is to help identify cancer in patients. Using an x-ray scan of the target area, machine learning can help detect any potential tumors.

This course consists of four modules: Introduction and Regression, Classification, Clustering, and the Final Project. Each module comprises videos with hands-on labs to apply what you have learned.

The hands-on labs use Jupyter Lab, which is hosted on Skills Network Labs and uses the Python programming language and various Python libraries like Pandas, Numpy, and Scikit-Learn.

You will explore different machine learning algorithms in this course and work with a variety of data sets to help you apply machine learning.

With linear regression, you will work with an automobile data set to estimate the CO2 emission of cars using various features, and then predict the CO2 emissions of cars that haven’t even been produced yet.

In regression trees, you will work with real estate data to predict the price of houses.

In logistic regression, you will work with customer data for telecommunication companies and see how machine learning is used to predict customer loyalty.

With K-nearest neighbors you will use telecommunication customer data to classify customers.

For support vector machines, you will classify human cell samples as benign or malignant.

In multiclass prediction, you will work with the popular iris data set to classify types of flowers.

With decision trees, you will build a model to determine which drugs to prescribe to patients.

And finally, with K-means, you will learn to segment a customer data set into groups of individuals with similar characteristics.

In the last module, you will complete the final project where you will use many of the classification algorithms to predict rain in Australia.

After completing this course, you will be able to explain, compare, and contrast various machine learning topics and concepts like supervised learning, unsupervised learning, classification, regression, and clustering. You will also be able to describe how the various machine learning algorithms work. And finally, you will learn how to apply these machine learning algorithms in Python using various Python libraries.

**2. Introduction to Machine Learning**

Machine learning is the subfield of computer science that gives “computers the ability to learn without being explicitly programmed.”

Assume that you have a dataset of images of animals such as cats and dogs, and you want to have software or an application that can recognize and differentiate them.

The first thing that you have to do here is interpret the images as a set of feature sets. For example, does the image show the animal’s eyes? If so, what is their size? Does it have ears? What about a tail? How many legs? Does it have wings?

Prior to machine learning, each image would be transformed to a vector of features. Then, traditionally, we had to write down some rules or methods in order to get computers to be intelligent and detect the animals. But, it was a failure. It needed a lot of rules, highly dependent on the current dataset, and not generalized enough to detect out-of-sample cases.

This is when machine learning entered the scene. Using machine learning, allows us to build a model that looks at all the feature sets, and their corresponding type of animals, and it learns the pattern of each animal.

It is a model built by machine learning algorithms. It detects without explicitly being programmed to do so. In essence, machine learning follows the same process that a 4-year-old child uses to learn, understand, and differentiate animals.

So, machine learning algorithms, inspired by the human learning process, iteratively learn from data, and allow computers to find hidden insights. These models help us in a variety of tasks, such as object recognition, summarization, recommendation, and so on.

How do Netflix recommend videos, movies, and TV shows to its users? They use Machine Learning to produce suggestions that you might enjoy! This is similar to how your friends might recommend a television show to you, based on their knowledge of the types of shows you like to watch.

How do banks make a decision when approving a loan application? They use machine learning to predict the probability of default for each applicant, and then approve or refuse the loan application based on that probability. Telecommunication companies use their customers’ demographic data to segment them, or predict if they will unsubscribe from their company the next month.

There are many other applications of machine learning that we see every day in our daily life, such as chatbots, logging into our phones or even computer games using face recognition. Each of these use different machine learning techniques and algorithms. So, let’s quickly examine a few of the more popular techniques.

**The Regression/Estimation technique** is used for predicting a continuous value. For example, predicting things like the price of a house based on its characteristics, or to estimate the Co2 emission from a car’s engine.

A **Classification technique** is used for Predicting the class or category of a case, for example, if a cell is benign or malignant, or whether or not a customer will churn.

**Clustering** groups of similar cases, for example, can find similar patients, or can be used for customer segmentation in the banking field.

**Association technique** is used for finding items or events that often co-occur, for example, grocery items that are usually bought together by a particular customer.

**Anomaly detection** is used to discover abnormal and unusual cases, for example, it is used for credit card fraud detection.

**Sequence mining** is used for predicting the next event, for instance, the click-stream in websites.

**Dimension reduction** is used to reduce the size of data.

And finally, **recommendation systems** associate people’s preferences with others who have similar tastes, and recommends new items to them, such as books or movies.

What is the difference Artificial intelligence (or AI), Machine Learning and Deep Learning?

AI tries to make computers intelligent in order to mimic the cognitive functions of humans. So, Artificial Intelligence is a general field with a broad scope including: Computer Vision, Language Processing, Creativity, and Summarization.

Machine Learning is the branch of AI that covers the statistical part of artificial intelligence. It teaches the computer to solve problems by looking at hundreds or thousands of examples, learning from them, and then using that experience to solve the same problem in new situations.

Deep Learning is a very special field of Machine Learning where computers can actually learn and make intelligent decisions on their own. Deep learning involves a deeper level of automation in comparison with most machine learning algorithms.

**3. Using Python for Machine Learning**

We will use:

NumPy – a math library to work with N-dimensional arrays in Python

SciPy – a collection of numerical algorithms and domain specific toolboxes, including signal processing, optimization, statistics and much more. SciPy is a good library for scientific and high performance computation

Matplotlib – a package that provides 2D plotting, as well as 3D plotting

Basic knowledge about these three packages which are built on top of Python is a good asset for data scientists who want to work with real-world problems.

Also:

Pandas – a very high-level Python library that provides high performance easy to use data structures. It has many functions for data importing, manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and timeseries.

SciKit Learn – a collection of algorithms and tools for machine learning. It has most of the classification, regression and clustering algorithms, and it’s designed to work with NumPy and SciPy.

**4. Supervised Algorithms versus Unsupervised Algorithms**

An easy way to begin grasping the concept of supervised learning is by looking directly at the words that make it up.

Supervise means to observe, and direct the execution of a task, project, or activity. Obviously we aren’t going to be supervising a person, instead will be supervising a machine learning model that might be able to produce classification regions.

We do this by teaching the model, that is we load the model with knowledge so that we can have it predict future instances.

But how exactly do we teach a model? We teach the model by training it with some data from a labeled dataset. It’s important to note that the data is labeled, and what does a labeled dataset look like? Well, it could look something like this.

This example is taken from the cancer dataset. As you can see, we have some historical data for patients, and we already know the class of each row (eg benign or malignant).

Let’s start by introducing some components of this table. The names up here which are called clump thickness, uniformity of cell size, uniformity of cell shape, marginal adhesion and so on are called attributes.

The columns are called features which include the data. If you plot this data, and look at a single data point on a plot, it’ll have all of these attributes that would make a row on this chart also referred to as an observation.

Looking directly at the value of the data, you can have two kinds. The first is numerical. When dealing with machine learning, the most commonly used data is numeric.

The second is categorical, that is its non-numeric because it contains characters rather than numbers. In this case, it’s categorical because this dataset is made for classification.

There are two types of supervised learning techniques. They are **classification**, and **regression**.

Classification is the process of predicting a discrete class label, or category (again, eg benign/malignant).

Regression is the process of predicting a continuous value as opposed to predicting a categorical value in classification.

Here’s another dataset. It is related to CO2 emissions of different cars.

It includes engine size, cylinders, fuel consumption, and CO2 emission of various models of automobiles.

Given this dataset, you can use regression to predict the CO2 emission of a new car by using other fields such as engine size, or number of cylinders.

In Unsupervised Learning, we do not supervise the model, but we let the model work on its own to discover information that may not be visible to the human eye.

The unsupervised algorithm trains on the dataset, and draws conclusions on unlabeled data.

Unlabeled data = pieces of data that have not been tagged with labels identifying characteristics, properties, or classifications.

Generally speaking, unsupervised learning has more difficult algorithms than supervised learning since we know little to no information about the data, or the outcomes that are to be expected.

Dimension reduction, density estimation, market basket analysis, and clustering are the most widely used unsupervised machine learning techniques.

**Dimensionality reduction**, and/or feature selection, play a large role in this by reducing redundant features to make the classification easier.

**Market basket analysis** is a modeling technique based upon the theory that if you buy a certain group of items, you’re more likely to buy another group of items.

**Density estimation** is a very simple concept that is mostly used to explore the data to find some structure within it.

**Clustering** is considered to be one of the most popular unsupervised machine learning techniques used for grouping data points, or objects that are somehow similar.

Cluster analysis has many applications in different domains, whether it be a bank’s desire to segment his customers based on certain characteristics, or helping an individual to organize their favourite types of music in groups.

Generally speaking though, clustering is used mostly for discovering structure, summarization, and anomaly detection.

So, to recap, the biggest difference between supervised and unsupervised learning is that supervised learning deals with labeled data while unsupervised learning deals with unlabeled data.

In supervised learning, we have machine learning algorithms for classification and regression. In unsupervised learning, we have methods such as clustering.

In comparison to supervised learning, unsupervised learning has fewer models and fewer evaluation methods that can be used to ensure that the outcome of the model is accurate. As such, unsupervised learning creates a less controllable environment as the machine is creating outcomes for us.

Quiz.

1. Supervised learning deals with unlabeled data, while unsupervised learning deals with labelled data.

Answer: False

2. The “Regression” technique in Machine Learning is a group of algorithms that are used for:

(a) Finding items/events that often co-occur; for example grocery items that are usually bought together by a customer.

(b) Prediction of class/category of a case; for example, a cell is benign or malignant, or a customer will churn or not.

(c) Predicting a continuous value; for example predicting the price of a house based on its characteristics.

Answer: (c)

3. When comparing Supervised with Unsupervised learning, is this sentence True or False?

In contrast to Supervised learning, Unsupervised learning has more models and more evaluation methods that can be used in order to ensure the outcome of the model is accurate.

Answer: False

4. In a dataset, what do the columns represent?

(a) Observations

(b) Variable Type

(c) Independent Variables

(d) Features

Answer: (d)

5. What is a major benefit of unsupervised learning over supervised learning?

(a) Discover previously unknown information about the dataset.

(b) Being able to produce a prediction based on unlabelled data.

(c) Explore the relationship between features and the target.

(d) Better evaluates the performance of a built model.

Answer: (a)

(b) WRONG because supervised learning does this as well.

(c), (d) these are characteristics of supervised learning

6. What’s the correct order for using a model?

(a) Split the data into the training and test sets, fit the model on the train set, clean the data, evaluate model accuracy.

(b) Clean the data, fit the model on the entire dataset, split the data into training and test sets, evaluate model accuracy.

(c) Split the data into training and test sets, fit the model on the train set, evaluate model accuracy.

(d) Clean the data, split the data into training and test sets, fit the model on the train set, evaluate model accuracy.

Answer: (d)

7. Which of the following is suitable for an unsupervised learning?

(a) Examine the relationship between academic performance and level of in-class participation using observations that include a feature recording each student’s grade.

(b) Segment customers into groups for discovering similar characteristics between them.

(c) Classifying benign and malignant tumors using historical data on tumor shape, color, etc.

(d) Predict house price based on location, house size, and number of rooms.

Answer: (b)

(a) – correlation or regression analysis (supervised)

(c) – classification (supervised)

(d) – regression (supervised)

8. The main purpose of the NumPy library is to:

(a) Achieve scientific computations.

(b) Construct machine learning models.

(c) Visualize results in 2D and 3D plots.

(d) Perform computations on arrays efficiently.

Answer: (d)

**4. Introduction to Regression**

Example: look at this data set. It’s related to co2 emissions from different cars.

It includes engine size, number of cylinders, fuel consumption, and co2 emission from various automobile models.

The question is: given this data set can we predict the co2 emission of a car using other fields such as engine size or cylinders?

Let’s assume we have some historical data from different cars and assume that a car such as in row 9 has not been manufactured yet, but we’re interested in estimating its approximate co2 emission after production. Is it possible?

We can use **regression methods** to predict a continuous value such as co2 emission using some other variables.

**Regression is the process of predicting a continuous value. **

In regression there are two types of variables: **a dependent variable and one or more independent variables. **

**The dependent variable** can be seen as the state, target, or final goal we study and try to predict.

**The independent variables**, also known as explanatory variables, can be seen as the causes of those states.

The independent variables are shown conventionally by X and the dependent variable is notated by Y.

A regression model relates Y or the dependent variable to a function of X i.e. the independent variables.

The key point in the regression is that our dependent value should be continuous and cannot be a discrete value.

- Continuous variables represent measurable amounts (eg water volume or weight).
- Discrete variables represent counts (eg the number of objects in a collection).

However, the independent variable, or variables, can be measured on either a categorical or continuous measurement scale.

So, what we want to do here is to use the historical data of some cars using one or more of their features and from that data make a model.

We use regression to build such a regression estimation model; then the model is used to predict the expected co2 emission for a new or unknown car.

There are two types of regression models:

**Simple regression** is when one independent variable is used to estimate a dependent variable. It can be either linear or non-linear. For example, predicting co2 emission using the variable of engine size.

Linearity of regression is based on the nature of relationship between independent and dependent variables.

When more than one independent variable is present the process is called **multiple linear regression**.

For example, predicting co2 emission using engine size and the number of cylinders in any given car.

Again, depending on the relation between dependent and independent variables it can be either linear or non-linear regression.

Let’s examine some sample applications of regression. Essentially we use regression when we want to estimate a continuous value.

For instance, one of the applications of regression analysis could be in the area of sales forecasting. You can try to predict a sales person’s total yearly sales from independent variables such as age, education, and years of experience.

It can also be used in the field of psychology, for example, to determine individual satisfaction, based on demographic and psychological factors.

We can use regression analysis to predict the price of a house in an area, based on its size number of bedrooms, and so on. We can even use it to predict employment income for independent variables such as hours of work, education, occupation, sex, age, years of experience, and so on.

You can find many examples of the usefulness of regression analysis in these and many other fields or domains, such as finance, healthcare, retail, and more. We have many regression algorithms; each of them has its own importance and a specific condition to which their application is best suited.

**5. Simple linear regression**

Let’s go back to our sample data set.

Question – can we predict the Co2 emission of a car using another field such as engine size?

Yes. We can use linear regression to predict a continuous value such as Co2 emission by using other variables. Linear regression is the approximation of a linear model used to describe the relationship between two or more variables.

In simple linear regression, there are two variables, a dependent variable and an independent variable. The key point in the linear regression is that our dependent value should be continuous and cannot be a discrete value.

However, the independent variables can be measured on either a categorical or continuous measurement scale.

There are two types of linear regression models – simple regression and multiple regression.

- If we predict CO2 emissions based on Engine Size only, this is a Simple Linear Regression.
- If we predict CO2 emissions based on Engine Size and Cylinders, this is a Multiple Linear Regression.

Let’s look at a simple linear regression. To find Co2 emissions based on Engine Size, we can plot variables on a graph. Here’s an example of many cars (more than the ones in the data set).

We can see a relationship here. With linear regression, you can fit a line through the data.

For instance, as the engine size increases, so do the emissions. With linear regression you can model the relationship of these variables. A good model can be used to predict what the approximate emission of each car is.

So for our example in line 9 of the data – for a sample car with engine size 2.4, you can find the emission is 214.

We have an unnecessarily complicated equation to represent the slope.

θ0 is the intercept and θ1 is the gradient or slope of the fitting line. θ0 and θ1 are known as the coefficients of the linear equation.

*How would you draw a line through the points? And how do you determine which line fits best?*

Linear regression estimates the coefficients of the line. This means we must calculate θ0 and θ1 to find the best line to fit the data. This line would best estimate the emission of the unknown data points.

Let’s see how we can find this line or, to be more precise, how we can adjust the parameters to make the line the best fit for the data.

For a moment, let’s assume we’ve already found the best fit line for our data. Now, let’s go through all the points and check how well they align with this line.

Best fit here means that if we have, for instance, a car with engine size x1 = 5.4 and actual Co2 = 250, its Co2 should be predicted very close to the actual value, which is y = 250 based on historical data.

But if we use the fit line, or better to say using our polynomial with known parameters to predict the Co2 emission, it will return y hat = 340. (You can see that on the graph). Now if you compare the actual value of the emission of the car with what we’ve predicted using our model, you will find out that we have a 90 unit error. This means our prediction line is not accurate. This error is also called the residual error. So we can say the error is the distance from the data point to the fitted regression line.

The mean of all residual errors shows how poorly the line fits with the whole data set. Mathematically it can be shown by the equation **Mean Squared Error, shown as MSE**.

The objective of linear regression, is to minimize this MSE equation. To do this, we should find the best parameters θ0 and θ1. Now the question is how to find θ0 and θ1 in such a way that it minimizes this error?

How can we find such a perfect line? Or said another way, how should we find the best parameters for our line? Should we move the line a lot randomly and calculate the MSE value every time and choose the minimum one? Not a good idea!

We have two options here. Option one, we can use a mathematical approach, or option two, we can use an optimization approach.

Mathematical approach:

We can start by estimating θ1.

x bar and y bar are the average engine size and Co2 emissions from the data set. xi and yi are the individual values of these for each row.

We can find the averages and then put it into the slope equation to find θ1, and then use this value θ1 to find the value of θ0 in the intercept equation. Here is a worked through calculation using our data set:

θ0 is also called the bias coefficient, and θ1 is the coefficient for the Co2 emission column.

As a side note, you really don’t need to remember the formula for calculating these parameters, as most of the libraries used for machine learning in Python, R and Scala can easily find these parameters for you. But it’s always good to understand how it works.

Now, we can write down the polynomial of the line.

So for ID=9 in our data set, where the car had an engine size of 2.4, θ0 is 125.74 and θ1 39, our predicted Co2 = 125.74 + 39*2.4 = 219.34.

(Note: there was an error in the calculation of θ0 in the course I took. The accurate value should be as follows: θ0=226.22 -39*3.03 should be equal to 108.05. Therefore, Co2Emission=201.65)

Why is linear regression useful?

- It’s very fast
- No parameter tuning
- Easy to understand
- Highly interpretable

**6. Model Evaluation in Regression Models**

The goal of regression is to build a model to accurately predict an unknown case. To this end, we have to perform regression evaluation after building the model.

When considering evaluation models, we clearly want to choose the one that will give us the most accurate results. So, the question is, how can we calculate the accuracy of our model? In other words, how much can we trust this model for prediction of an unknown sample using a given dataset and having built a model such as linear regression?

Two types of evaluation approaches that can be used to achieve this goal:

(a) train and test on the same dataset

In this solution, we select a portion of our dataset for testing. For instance, assume that we have 10 records in our dataset. We use the entire dataset for training, and we build a model using this training set.

We can select a small portion of the dataset, eg as row number six to nine, but without the labels.

This set is called a test set, which has the labels, but the labels are not used for prediction and is used only as ground truth. The labels are called actual values of the test set.

Now we pass the feature set of the testing portion to our built model and predict the target values. Finally, we compare the predicted values by our model with the actual values in the test set.

This indicates how accurate our model actually is. There are different metrics to report the accuracy of the model, but most of them work generally based on the similarity of the predicted and actual values.

Let’s look at one of the simplest metrics to calculate the accuracy of our regression model. As mentioned, we just compare the actual values y with the predicted values, which is noted as y hat for the testing set.

The error of the model is calculated as the average difference between the predicted and actual values for all the rows. We can write this error as an equation.

Here’s all the maths so far discussed:

So – you train the model on the entire dataset, then you test it using a portion of the same dataset.

When you test with a dataset in which you know the target value for each data point, you’re able to obtain a percentage of accurate predictions for the model.

This evaluation approach would most likely have a high training accuracy and the low out-of-sample accuracy since the model knows all of the testing data points from the training.

What is training accuracy and out-of-sample accuracy?

Training accuracy is the percentage of correct predictions that the model makes when using the test dataset. However, a high training accuracy isn’t necessarily a good thing. For instance, having a high training accuracy may result in an over-fit of the data. This means that the model is overly trained to the dataset, which may capture noise and produce a non-generalized model.

Out-of-sample accuracy is the percentage of correct predictions that the model makes on data that the model has not been trained on.

Doing a train and test on the same dataset will most likely have low out-of-sample accuracy due to the likelihood of being over-fit.

It’s important that our models have high out-of-sample accuracy because the purpose of our model is, of course, to make correct predictions on unknown data.

So, how can we improve out-of-sample accuracy? We can try another evaluation approach:

(b) train/test split

In this approach, we select a portion of our dataset for training, for example, row zero to five, and the rest is used for testing, for example, row six to nine.

The model is built on the training set. Then, the test feature set is passed to the model for prediction. Finally, the predicted values for the test set are compared with the actual values of the testing set.

The train and test sets are mutually exclusive. You train with the training set and test with the testing set.

This will provide a more accurate evaluation on out-of-sample accuracy because the testing dataset is not part of the dataset that has been used to train the data.

This is more realistic for real-world problems. We know the outcome of each data point in the dataset, making it great to test with.

Since this data has not been used to train the model, the model has no knowledge of the outcome of these data points. So, in essence, it’s truly out-of-sample testing.

However, please ensure that you train your model with the testing set afterwards, as you don’t want to lose potentially valuable data.

The issue with train/test split is that it’s highly dependent on the datasets on which the data was trained and tested. The variation of this causes train/test split to have a better out-of-sample prediction than training and testing on the same dataset, but it still has some problems due to this dependency. Another evaluation model, called K-fold cross-validation, resolves most of these issues.

**K-fold cross-validation**

How do you fix a high variation that results from a dependency? Well, you average it.

The entire dataset is represented by the points in the image at the top left. If we have K equals four folds, then we split up this dataset as shown here.

In the first fold for example, we use the first 25% of the dataset for testing and the rest for training. The model is built using the training set and is evaluated using the test set.

Then, in the next round (second fold), the second 25% of the dataset is used for testing and the rest for training the model. Again, the accuracy of the model is calculated.

We continue for all folds. Finally, the result of all four evaluations are averaged. That is, the accuracy of each fold is then averaged, keeping in mind that each fold is distinct, where no training data in one fold is used in another.

K-fold cross-validation in its simplest form performs multiple train/test splits, using the same dataset where each split is different. Then, the result is average to produce a more consistent out-of-sample accuracy.

**7. Evaluation Metrics in Regression Models**

Evaluation metrics are used to explain the performance of a model.

As mentioned, we can compare the actual values and predicted values to calculate the accuracy of our regression model.

Evaluation metrics provide a key role in the development of a model as it provides insight to areas that require improvement.

In the context of regression, the error of the model is the difference between the data points and the trend line generated by the algorithm.

Since there are multiple data points, an error can be determined in multiple ways.

Mean Absolute Error is the mean of the absolute value of the errors. This is the easiest of the metrics to understand, since it’s just the average error.

Mean Squared Error is the mean of the squared error. It’s more popular than Mean Absolute Error because the focus is geared more towards large errors. This is due to the squared term, exponentially increasing larger errors in comparison to smaller ones.

Root Mean Squared Error is the square root of the Mean Squared Error. This is one of the most popular of the evaluation metrics because Root Mean Squared Error is interpretable in the same units as the response vector or Y units, making it easy to relate its information.

Relative Absolute Error (RAE) is a metric expressed as a ratio normalizing the absolute error. It measures the average absolute difference between the actual and predicted values relative to the average absolute difference between the actual values and their mean.

Relative Squared Error is very similar to relative absolute error, but is widely adopted by the data science community as it is used for calculating R-squared.

R-squared is not an error per se, but is a popular metric for the accuracy of your model. It represents how close the data values are to the fitted regression line. The higher the R-squared, the better the model fits your data.

Each of these metrics can be used for quantifying of your prediction. The choice of metric completely depends on the type of model your data type and domain of knowledge.

**8. Multiple Linear Regression**

We’ve looked at simple regression to predict CO2 emissions but in reality there are multiple variables that predict this.

When multiple independent variables are present, the process is called multiple linear regression. For example, predicting CO2 emission using engine size and the number of cylinders in the car’s engine.

Multiple linear regression is an extension of the simple linear regression model.

There are two applications for multiple linear regression.

First, it can be used when we would like to identify the strength of the effect that the independent variables have on the dependent variable.

For example, does revision time, test anxiety, lecture attendance and gender have any effect on exam performance of students?

Second, it can be used to predict the impact of changes, that is, to understand how the dependent variable changes when we change the independent variables.

For example, if we were reviewing a person’s health data, a multiple linear regression can tell you how much that person’s blood pressure goes up or down for every unit increase or decrease in a patient’s body mass index holding other factors constant.

Multiple linear regression is a method of predicting a continuous variable. It uses multiple variables called independent variables or predictors that best predict the value of the target variable which is also called the dependent variable.

In multiple linear regression, the target value, Y, is a linear combination of independent variables, X.

For example, you can predict how much CO2 a car might emit due to independent variables such as the car’s engine size, number of cylinders, and fuel consumption.

Multiple linear regression is very useful because you can examine which variables are significant predictors of the outcome variable. Also, you can find out how each feature impacts the outcome variable.

The best model for our data set is the one with minimum error for all prediction values. So, the objective of multiple linear regression is to minimize the MSE equation.

To minimize it, we should find the best parameters theta, but how?

There are many ways to estimate the value of these coefficients in MLR. However, the most common methods are the ordinary least squares and optimization approach.

Ordinary least squares tries to estimate the values of the coefficients by minimizing the mean square error. This approach uses the data as a matrix and uses linear algebra operations to estimate the optimal values for the theta.

The problem with this technique is the time complexity of calculating matrix operations as it can take a very long time to finish. When the number of rows in your data set is less than 10,000, you can think of this technique as an option. However, for greater values, you should try other faster approaches.

The second option is to use an optimization algorithm to find the best parameters. That is, you can use a process of optimizing the values of the coefficients by iteratively minimizing the error of the model on your training data.

For example, you can use gradient descent which starts optimization with random values for each coefficient, then calculates the errors and tries to minimize it through y’s changing of the coefficients in multiple iterations.

Gradient descent is a proper approach if you have a large data set. Please understand however, that there are other approaches to estimate the parameters of the multiple linear regression.

After you find the best parameters for your model, you can go to the prediction phase.

After we found the parameters of the linear equation, making predictions is as simple as solving the equation for a specific set of inputs.

Imagine we are predicting CO2 emission or Y from other variables for the automobile in record number nine. Our linear regression model representation for this problem would be y hat equals theta transpose x.

Once we find the parameters, we can plug them into the equation of the linear model. For example, let’s use theta zero equals 125, theta one equals 6.2, theta two equals 14, and so on. If we map it to our data set, we can rewrite the linear model as CO_2 emissions equals 125 plus 6.2 multiplied by engine size, plus 14 multiplied by cylinder, and so on. As you can see, multiple linear regression estimates the relative importance of predictors.

For example, it shows cylinder has higher impact on CO_2 emission amounts in comparison with engine size.

Now, let’s plug in the ninth row of our data set and calculate the CO_2 emission for a car with the engine size of 2.4.

So, CO_2 emission equals 125 plus 6.2 times 2.4, plus 14 times four, and so on. We can predict the CO_2 emission for this specific car would be 214.1.

Now, let me address some concerns that you might already be having regarding multiple linear regression.

As you saw, you can use multiple independent variables to predict a target value in multiple linear regression.

It sometimes results in a better model compared to using a simple linear regression which uses only one independent variable to predict the dependent variable.

The question is how, many independent variable should we use for the prediction? Should we use all the fields in our data set? Does adding independent variables to a multiple linear regression model always increase the accuracy of the model?

Adding too many independent variables without any theoretical justification may result in an overfit model. An overfit model is a real problem because it is too complicated for your data set and not general enough to be used for prediction.

It is recommended to avoid using many variables for prediction. There are different ways to avoid overfitting a model in regression, however that is outside the scope of this lesson.

The next question is, should independent variables be continuous? Basically, categorical independent variables can be incorporated into a regression model by converting them into numerical variables.

For example, given a binary variables such as car type, the code dummy zero for manual and one for automatic cars.

As a last point, remember that multiple linear regression is a specific type of linear regression. So, there needs to be a linear relationship between the dependent variable and each of your independent variables. There are a number of ways to check for linear relationship. For example, you can use scatter plots and then visually checked for linearity. If the relationship displayed in your scatter plot is not linear, then you need to use non-linear regression.

Quiz.

1. Which of the following is the meaning of “Out of Sample Accuracy” in the context of evaluation of models?

(a) “Out of Sample Accuracy” is the accuracy of an overly trained model (which may capture noise and produced a non-generalized model)

(b) “Out of Sample Accuracy” is the percentage of correct predictions that the model makes on data that the model has NOT been trained on.

(c) “Out of Sample Accuracy” is the percentage of correct predictions that the model makes using the test dataset.

(d) “Out of Sample Accuracy” is the accuracy of a model on all the data available.

Answer: (b)

2. When should we use Multiple Linear Regression? (Select two)

(a) When there are multiple dependent variables

(b) When we would like to identify the strength of the effect that the independent variables have on a dependent variable.

(c) When we would like to predict impacts of changes in independent variables on a dependent variable.

(d) When we would like to examine the relationship between multiple variables.

Answer: (b), (c)

3. Which sentence is TRUE about linear regression?

(a) A linear relationship is necessary between the independent variables and the dependent variable.

(b) A linear relationship is necessary between the independent and dependent variables as well as in between independent variables.

(c) Simple linear regression requires a linear relationship between the predictor and the response, but multiple linear regression does not.

(d) Multiple linear regression requires a linear relationship between the predictors and the response, but simple linear regression does not.

Answer: (a)

4. What are the requirements for independent and dependent variables in regression?

(a) Independent and dependent variables can be either categorical or continuous.

(b) Independent variables can be either categorical or continuous. Dependent variables must be continuous.

(c) Independent and dependent variables must be continuous.

(d) Independent variables must be continuous. Dependent variables can be either categorical or continuous.

Answer: (b)

5. The key difference between simple and multiple regression is:

(a) Simple regression assumes a linear relationship between variables, whereas this assumption is not necessary for multiple regression.

(b) Multiple linear regression introduces polynomial features.

(c) Simple linear regression compresses multidimensional space into one dimension.

(d) To estimate a single dependent variable, simple regression uses one independent variable whereas multiple regression uses multiple.

Answer: (d)

6. Recall that we tried to predict CO2 emission with car information. Say that now we can describe the relationship as: CO2_emission = 130 – 2.4*cylinders + 8.3*fuel_consumption

What is TRUE of this relationship?

(a) When “cylinders” decreases by 1 while fuel_consumption remains constant, CO2_emission increases by 2.4 units.

(b) When “cylinders” increases by 1 while fuel_consumption remains constant, CO2_emission increases by 2.4 units.

(c) Since the coefficient for “fuel_consumption” is greater than that for “cylinders”, “fuel_consumption” has lower impact on CO2_emission.

(d) When both “cylinders” and “fuel_consumption” increase by 1 unit, CO2_emission decreases.

Answer: (a)

7. What could be the cause of a model yielding high training accuracy and low out-of-sample accuracy?

(a) The model is training on a small training set, so it is underfitting.

(b) The model is training on the entire dataset, so it is underfitting.

(c) The model is training on a small training set, so it is overfitting.

(d) When we perform multiple train/test splits using the same dataset, it will cause overfitting.

Answer: (c)

8. Multiple Linear Regression is appropriate for:

(a) Predicting the sales amount based on month.

(b) Predicting whether a drug is effective for a patient based on her characteristics.

(c) Predicting tomorrow’s rainfall amount based on the wind speed and temperature.

Answer: (c)

## Leave a Reply