Predicting Insurance Premium With Linear Regression

Linear regression is a statistical method use to predict the relationship between two or more variables. The factor being predicted is the dependent variable. In our scenario here, it is the insurance premium. The factors used to predict the value of the dependent variable are called the independent variables; eg age, gender BMI.

The purpose of this exercise is to look into the relationship of the variables and plot a multiple linear regression model based on several features of an individual, such as age, physical and health condition against their existing medical expenses. This can be used for predicting future medical expenses of individuals which can help insurance companies to make decisions on charging the premium.

Getting Started

The first step is to import libraries to be used for analysis and charting. Panda and numpy libraries are used for data analysis purposes while seaborn and matplotlib will be used for chart plotting.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
sns.set()

After importing our libraries, we will need to prepare our dataset. For this purpose, we will import data from csv file with our required dataset.
After loading our dataset, we can perform a sample check of our data by using "df.sample(10)" a sample of 10 lines in our dataset.

In [2]:
df = pd.read_csv('~/Documents/1_DS_360/4_Porftolio/2_Insurance/datasets-insurance.csv')
df.sample(10)
Out[2]:
age sex bmi children smoker region charges
1291 19 male 34.900 0 yes southwest 34828.65400
219 24 female 23.210 0 no southeast 25081.76784
1199 31 female 25.800 2 no southwest 4934.70500
112 37 male 30.800 0 no southwest 4646.75900
412 26 female 17.195 2 yes northeast 14455.64405
1330 57 female 25.740 2 no southeast 12629.16560
1007 47 male 28.215 3 yes northwest 24915.22085
160 42 female 26.600 0 yes northwest 21348.70600
780 30 male 24.400 3 yes southwest 18259.21600
65 19 female 28.900 0 no southwest 1743.21400

As some of the data are in text format, we will need to convert them to boolean to perform further analysis. After conversion, let's view the sample data again.
We will denote the following for text replacement :
Male = 1, female = 0; smoker = 1, non-smoker = 0

In [3]:
df.sex.replace('male', 1, inplace = True)
df.sex.replace ('female', 0, inplace = True)
df.smoker.replace('yes', 1, inplace = True)
df.smoker.replace('no', 0, inplace = True)

df.sample (10)
Out[3]:
age sex bmi children smoker region charges
688 47 0 24.100 1 0 southwest 26236.57997
1292 21 1 23.210 0 0 southeast 1515.34490
861 38 0 28.000 3 0 southwest 7151.09200
417 36 0 22.600 2 1 southwest 18608.26200
38 35 1 36.670 1 1 northeast 39774.27630
903 49 1 36.850 0 0 southeast 8125.78450
693 24 1 23.655 0 0 northwest 2352.96845
933 45 0 35.300 0 0 southwest 7348.14200
125 26 0 28.785 0 0 northeast 3385.39915
1259 52 0 23.180 0 0 northeast 10197.77220

We can also perform a high-level summary check of our data using "df.describe()". There are 1,338 records in our dataset, with an average age of 39 years old. Gender split in our dataset is quite equal with average insurance premium at $13,270.
The standard deviation and quartile ranges let us know how dispersed the distribution of the data within each column is as comparetd to the average. A low standard deviation means that most of the numbers are close to the average, while a high standard deviation means that the numbers are more spread out.

In [4]:
df.describe()
Out[4]:
age sex bmi children smoker charges
count 1338.000000 1338.000000 1338.000000 1338.000000 1338.000000 1338.000000
mean 39.207025 0.505232 30.663397 1.094918 0.204783 13270.422265
std 14.049960 0.500160 6.098187 1.205493 0.403694 12110.011237
min 18.000000 0.000000 15.960000 0.000000 0.000000 1121.873900
25% 27.000000 0.000000 26.296250 0.000000 0.000000 4740.287150
50% 39.000000 1.000000 30.400000 1.000000 0.000000 9382.033000
75% 51.000000 1.000000 34.693750 2.000000 0.000000 16639.912515
max 64.000000 1.000000 53.130000 5.000000 1.000000 63770.428010

Data Analysis and Data Plots

Correlation

To further understand the relationship between each of the variables in our data, an important step is to find out the correlation between the variables. This will help us understand the impact that each variable has on one another.


We will do this by plotting a heatmap using seaborn and matplotlib libraries imported earlier. The relationship between each variables can be observed (whether they are positively or inversely related to each other) and the degree of impact it has on one another.

Observations :
Insurange charges are greatly affected by whether the insured is a smoker, his/her age and also their BMI.
This observation can be further analysed using scatterplots below.

In [5]:
sns.set(rc={'figure.figsize':(10,8)})
sns.set_context("talk", font_scale=0.8)
sns.heatmap(df.corr(), cmap='Blues', annot=True)
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x1100bd550>
In [6]:
sns.pairplot(x_vars=["age","smoker","bmi"],y_vars="charges",hue="smoker",data=df, height=6, aspect=0.8)
Out[6]:
<seaborn.axisgrid.PairGrid at 0x1a24c1f150>

All the three scatterplots show us that insurance premium are higher for smokers compared to non-smokers, regardless of their age and BMI.
Older smokers with high BMI have the highest insurance premium since they are at higher risk of developing life threatening diseases.

Linear Regression

Let us run a linear regression model using the three factors identified above; age, smoker and BMI.
We will use the scikit-learn library for this purpose. We will need to :
1) Create X (training data) and y (our output)
2) Identify the features to use, which are our input columns (independent variable)
3) y is our output, or target variable which we are trying to predict (insurance premium)

In [7]:
feature_cols = ['age', 'smoker', 'bmi']
X = df[feature_cols]
y = df['charges']

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Out[7]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Let's view the intercept and coefficients for our model

In [8]:
print ("intercept", lm.intercept_)
print ("coefficients:", lm.coef_)
print ("R^2 score",lm.score(X,y))
intercept -11676.830425187807
coefficients: [  259.54749155 23823.68449531   322.61513282]
R^2 score 0.7474771588119513

Interpretation of our linear regression model :

1) The 3 variables (age, smoker and bmi) have a 74.7% impact on insurance charges (denoted by R squared = 0.747)
2) Positive coefficients show that as all 3 variables increase, it will also increase cost of insurance premium. This was also shown in the heatmap and scatterplot earlier.
3) The negative intercept here is only meaningful if it is logically meaningful for all predictor variables to be zero. Since it is not logical in our case, we will ignore the negative intercept.

Next, we can predict the insurance premium for different group of customers


1) For a customer aged = 50, smoker = yes, BMI = 35, his annual premium will be $36,415.76.

In [9]:
lm.predict([[50,1,35]])
predict = np.round(lm.predict([[50,1,35]]),2)
predict
Out[9]:
array([36415.76])

2) For a customer aged = 50, smoker = no, BMI = 35, his annual premium will be $12,592.07.
--> A non-smoker's insurance premium is usually lower as a smokers are at higher risks of contracting life threatening diseases.

In [10]:
lm.predict([[50,0,35]])
predict = np.round(lm.predict([[50,0,35]]),2)
predict
Out[10]:
array([12592.07])

3) For a customer aged = 25, smoker = no, BMI = 21, his annual premium will be $1,586.77
--> A younger non-smoker with a healthy BMI will have a much lower annual insurance premium.

In [11]:
lm.predict([[25,0,21]])
predict = np.round(lm.predict([[25,0,21]]),2)
predict
Out[11]:
array([1586.77])