Linear Regression In Python With A Real-world Example
Image
Introduction
In this blog, we will be working linear regression using data we analyzed in the previous blog. The objective of this blog is to make you understand how to implement linear regression using python. We shall be using sklearn library to do the regression, matplotlib for visualization and pandas to do the analysis.
Linear Regression
Linear regression is about getting a line of best fit for values. It helps model the relationship between variables for example predicting the cost of a hotel given its neighbourhood, service and type. It is a supervised learning algorithm meaning we provide data to the model for it to learn patterns. Regression helps to predict or interpolate the values of that we haven’t seen given data we have seen before.
Regression is an intuitive algorithm which we all have ever done in our mind, for example, predicting the price of vegetables in a given season.it is the easiest and one of the commonest machine learning algorithms. Actually, most problems just require regression instead of fancy machine learning Algorithms. You can apply it in finance, health, economics problems where the variables are linearly related etc. The limitation of the algorithm is it works with dependant variables that are continuous in nature.
Variable Types
We have two types of variables. Independent variable and dependent variable. Dependent variable /outcome must be continuous type but Independent variables/features can be any time, for example, discrete, continuous, categorical type(gender, class).
Equation
We write the function for linear equation as
Image
where: m = coefficient/ rate of improvement b-bias The aim is to find the optimal parameters( coefficient and bias). y /f(x) is the predicted value. Y is The regression line or line of best fit is one for which an error is minimized. The errors or residuals can be drawn as vertical lines from the observed value to the regression line (see figure one). Residuals are the difference between the points and the fitted line.
Cost Function
We write the cost function for the above equation as:
Image
N is the data points Y is the actual value of observation y=mx+b which is the predicted value Our goal is to minimize it to improve the accuracy by finding m for which mse(m) is minimum
Simple/Univariate Regression
Simple regression is one where we have one feature and one target variable. Simple regression is an easy one. An example is trying to predict the performance of students given the hours they put into studying. Income was given the position or years of experience of the developer.
Equation
This is the same as the general equation, i represents the observation .
Image
It has one feature x and output/ target one. the goal is to find f(x) so that when new data is exposed to the function we can get a prediction which is close to the actual value.
Example: Consider our situation where we have government expenditure (G)and unemployment(U) ignoring other factors that lead to unemployment we model the relationship as.
Image
Implementation We are going to use the African economic data which has government expenditure and unemployment percentage. We shall take unemployment as our target value and government expenditure as x value. For this task, we shall be using sklearn for the regression class, Pandas for analysis and loading data and Matplotlib for visualization.
#import necessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
#Load data
df = pd.read_csv('economic-data-africa.csv', sep=';', encoding='ISO-8859-1')
Let us take a peek at the first 5 rows of our dataset by calling df.head(). the index count start at zero thats the reason the last value 4.
Image
Summarize Data
Before we go ahead and train our regression model there is a need for us to do data analysis on it. This step is important because it helps us discover any outliers and other anomalies in the data. We shall do this in two ways. First using descriptive statistics and then data visualization.
Descriptive statistics
Descriptive statistics help us find distributions in our data. Here we perform statistical tasks like finding the mean, median, count etc on the data. This can be done individually, but thanks to pandas describe function we can do this in one line. let us check the dimension and data types of our variables
#check dimension of data
df.shape
(51, 3)
#check data type
df.dtypes
CountryName object
GovExpenditurePerc int64
UnemploymentPerc int64
dtype: object
The results show us that we have a matrix of 51 by 3 dimension. our dataset has three columns/features and 51 rows. Matrix is written row by column. we also see that our features are country name, UnemploymentPerc and GovExpenditurePerc with data types object/string, integer respectively.
We use describe method to check the the statistical properties of the data.
Image
Data visualizations
Visualization makes it easy for us to see the variable distributions, identify outliers and map relationship in the data. We shall draw a scatter plot of the data to see how they relate to each other. Because linear regression works well with linearly related data, we use this step to verify the variables are linearly related. the bar graph will show the variable distribution.
df.hist( color ="#FF69B4")
plt.savefig("simple_hist.png")
plt.show()
Image
# Visualising the data distribution
plt.scatter(df["GovExpenditurePerc"], df["UnemploymentPerc"], color = '#FF69B4')
plt.title('Unemployment vs Government expenditure',fontsize =20)
plt.xlabel('Government Expenditure',fontsize =20)
plt.ylabel('Unemployment',fontsize =20)
plt.grid()
plt.savefig("UnempVsExp.png")
plt.show()
Image
From our plots, we see that there is an outlier. The expenditure above 80 which may affect our model. Let us remove data that is above 80 per cent.
# Remove rows with outliers
df = df.loc[df['GovExpenditurePerc'] < 70]
df.shape
(50, 3)
Model training
Let us assign our variables to x and y. X = government expenditure y= unemployment We reshape x into two dimension using reshape(-1, 1). We divide data into training and test sections using sklearn test split function so that we can use the test set to use to evaluate how our linear regression model performs on unseen data. We assigned 80% of the data as a train set and 20% as a test set. We shall train our model using the LinearRegression class from sklearn. After creating an instance of the class we call the fit method and pass in x and y data to fit the data.
# deciding target and data value
Y =df.iloc[:,2].values
X = df.iloc[:, 1].values
#reshape x
X =X.reshape(-1, 1)
# Split the data into Train set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2,random_state = 0)
# Fitting Simple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
##Print our w and E value
print(regressor.coef_) # the slope
print(regressor.intercept_) # the intercept
#ouput
[0.33058722]
0.8284703168542435
We are now done training our regression model and have the coefficients and intercept of our model. These can be put in our equation to find the find the line of best fit.we can use this to get unemployment percentage of any country when we have its government expenditure.
Image
Let us plot and see our line of best fit in the training set.
# Visualize the Train set line of regression line
plt.scatter(X_train, y_train, color = '#FF69B4')
plt.plot(X_train, regressor.predict(X_train), color = 'black')
plt.title('Unemployment vs Government expenditure (Training set)',fontsize =15)
plt.xlabel('Government Expenditure',fontsize =20)
plt.ylabel('Unemployment',fontsize =20)
plt.grid()
plt.savefig("UnempVsExpenditure.png")
plt.show()
Model Evaluation
The aim of this step is to see how the model performs on unseen data. Often times algorithms perform well during training data and poorly when new data is used that is why it is important to evaluate the performance on the test set. if we are not satisfied with the results we iteratively to improve the performance. First, we use the inbuilt predict function to get the predicted values, get the mean square error and then compare the actual and predicted value.
# Predicting the Test set results
y_pred = regressor.predict(X_test)
#mean squared error
metrics.mean_squared_error(y_test, y_pred)
values = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': regressor.predict(X_test).flatten()})
values
Let us visualize the actual values alongside the predicted to see how our model performs, discover any hidden patterns. From what we see our model doesn’t perform well.
df_error = values
df_error.plot(kind='bar',figsize=(16,10),color =["gray","pink"])
plt.grid(which='major', linestyle='-', linewidth='0.5', color='green')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.title("Actual and predicted values",fontsize =20)
plt.savefig("simple_actualpredicted.png")
plt.show()
Image
Multivariate Linear Regression
Multivariate linear regression has many two or more features. In real-world situations, we usually use multivariate regression because many features have to be put into consideration. Many of the concepts here are the same the ones in the simple linear regression section with a few extensions so we won’t be going into much details on them.
Equation
Image
The hypothesis for multivariate linear regression is an extension of the simple linear regression. Y =the predicted value. W0 = bias term . this is the point where the line intercepts the y-axis. W1,…..Wn are the parameters. X1,…..Xn are the feature values.
For our problem, the distinct features are population penetration, internet usage and government expenditure. We can model it as follows.
Image
We have to add a bias term because even when there is no population growth, government expenditure and internet usage it will be hard for unemployment to be zero because there are other factors that influence like past government expenditure.
Implementation
We shall use the previous data combined data with population and internet usage. We have to check data and see if each input feature is linearly related to the target value and remove some features that don’t. We shall use the same sklearn linear regression class as before to train the regression model.
Let use import the necessary libraries and load the dataset.
#import the neccessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
#load the dataset
m_df =pd.read_csv("combined_data.csv")
m_df.head()
Image
Summarize Data
I did an analysis before and found out some features don’t contribute much to the performance of the model so let us drop them before we do the analysis. We shall remaining with three input features.
# delete less contributing features
m_df =X = m_df.drop(["Unnamed: 0",
'Population\n (2018 Est.)',
'InternetUsers\n 31-Dec-2017',
'InternetGrowth %\n2000 - 2017',
'Facebook\n subscribers31-Dec-2017'], axis=1)
m_df.head()
Image
Descriptive statistics
In this section we are exploring the mathematical properties of our data set. We check for the data type of variables, get the dimension of the dataset and find the mean, median count and other statistical properties.
# check dimension of data
m_df.shape
(51, 5)
m_df.dtypes
CountryName object
GovExpenditurePerc int64
UnemploymentPerc int64
InternetUsers\n 31-Dec-2000 int64
Penetration\n (% Population) float64
dtype: object
# descriptions
m_df.describe()
# descriptions
m_df.describe()
Image
Data visualizations
This step is for us to get an idea of how our data is distributed. From the plots we shall be able to identify any outliers or skewness in our dataset.it is very hard to identify data in high dimension. In this section we are going to only plot histogram.
# histograms
m_df.hist(color ="#FF69B4",alpha=0.5, figsize=(20, 10))
plt.savefig("multi_hist.png")
plt.show()
Image
We see outliers in almost all features. In this task we will only remove the government expenditure outlier. I encourage you to try and remove the rest and see if there is an improvement in how the regression model performance.
# Remove the outlier rows
m_df = m_df.loc[m_df['GovExpenditurePerc'] < 80]
m_df.shape
(50, 5)
Model training
We shall assign our x and y values then divide the data set into train and test set. because our features have different ranges we are going to normalize so it can have the same ranges.the normalization rescales values to ranges between 0 to 1.This helps speed the computation and the model is less sensitive to feature scale. Sklearn provides this functionality, all we need is to make normalize =True.
# x and y values
Y =m_df.iloc[:,2].values
X = m_df.drop(['CountryName', 'UnemploymentPerc'], axis=1).values
# Split the datas into the Train set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2,
random_state = 0)
# Fitt Linear Regression to the Train set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression(normalize=True)
regressor.fit(X_train, y_train)
#print weights and bias
print(regressor.coef_) # the coefficients
print(regressor.intercept_) # the intercept
[ 3.31007491e-01 -8.64072992e-06 7.37252335e-02]
-1.040968768854757
For clarity let use create a table with coefficients of the features so it is easy for us to feed in the equation.
#coeficients
x =m_df.drop(['CountryName', 'UnemploymentPerc'], axis=1)
coef =pd.DataFrame(regressor.coef_,x.columns,columns =["coefficient"])
coef
Image
Model Evaluation
Let us evaluate and see how our model performs on unseen data.
# Predicting the Test set results
y_pred = regressor.predict(X_test)
values = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': regressor.predict(X_test).flatten()})
values
image
df_error = values
df_error.plot(kind='bar',figsize=(16,10),color =["gray","pink"])
plt.grid(which='major', linestyle='-', linewidth='0.5', color='black')
plt.grid(which='minor', linestyle=':', linewidth='0.5', color='black')
plt.title("Actual and predicted values",fontsize =20)
plt.savefig("Multi_actualpredicted.png")
plt.show()
Image
Conclusion
In this blog, we have looked at the basic linear regression model for both simple and Multivariate regression this involved how to make the predictions and how to evaluate our predictions. We have used the basic form of regression but there are other types of regression i.e Ridge regression, Lasso regression and ElasticNet regression. I encourage you to look at them and compare their performances using the evaluation metrics. Simple data was used to do regression here but the same concept can be applied to complex datasets.
Thanks for reading and see you in the next blog.