Predicting Insurance Premium With Linear Regression¶

Linear regression is a statistical method use to predict the relationship between two or more variables. The factor being predicted is the dependent variable. In our scenario here, it is the insurance premium. The factors used to predict the value of the dependent variable are called the independent variables; eg age, gender BMI.¶

The purpose of this exercise is to look into the relationship of the variables and plot a multiple linear regression model based on several features of an individual, such as age, physical and health condition against their existing medical expenses. This can be used for predicting future medical expenses of individuals which can help insurance companies to make decisions on charging the premium.¶

Getting Started¶

The first step is to import libraries to be used for analysis and charting. Panda and numpy libraries are used for data analysis purposes while seaborn and matplotlib will be used for chart plotting.

%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
sns.set()

After importing our libraries, we will need to prepare our dataset. For this purpose, we will import data from csv file with our required dataset.
After loading our dataset, we can perform a sample check of our data by using "df.sample(10)" a sample of 10 lines in our dataset.

df = pd.read_csv('~/Documents/1_DS_360/4_Porftolio/2_Insurance/datasets-insurance.csv')
df.sample(10)

As some of the data are in text format, we will need to convert them to boolean to perform further analysis. After conversion, let's view the sample data again.
We will denote the following for text replacement :
Male = 1, female = 0; smoker = 1, non-smoker = 0

df.sex.replace('male', 1, inplace = True)
df.sex.replace ('female', 0, inplace = True)
df.smoker.replace('yes', 1, inplace = True)
df.smoker.replace('no', 0, inplace = True)

df.sample (10)

We can also perform a high-level summary check of our data using "df.describe()". There are 1,338 records in our dataset, with an average age of 39 years old. Gender split in our dataset is quite equal with average insurance premium at $13,270.
The standard deviation and quartile ranges let us know how dispersed the distribution of the data within each column is as comparetd to the average. A low standard deviation means that most of the numbers are close to the average, while a high standard deviation means that the numbers are more spread out.

df.describe()

Data Analysis and Data Plots¶

Correlation¶

To further understand the relationship between each of the variables in our data, an important step is to find out the correlation between the variables. This will help us understand the impact that each variable has on one another.

We will do this by plotting a heatmap using seaborn and matplotlib libraries imported earlier. The relationship between each variables can be observed (whether they are positively or inversely related to each other) and the degree of impact it has on one another.

Observations :
Insurange charges are greatly affected by whether the insured is a smoker, his/her age and also their BMI.
This observation can be further analysed using scatterplots below.

sns.set(rc={'figure.figsize':(10,8)})
sns.set_context("talk", font_scale=0.8)
sns.heatmap(df.corr(), cmap='Blues', annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1100bd550>

sns.pairplot(x_vars=["age","smoker","bmi"],y_vars="charges",hue="smoker",data=df, height=6, aspect=0.8)

<seaborn.axisgrid.PairGrid at 0x1a24c1f150>

All the three scatterplots show us that insurance premium are higher for smokers compared to non-smokers, regardless of their age and BMI.
Older smokers with high BMI have the highest insurance premium since they are at higher risk of developing life threatening diseases.

Linear Regression¶

Let us run a linear regression model using the three factors identified above; age, smoker and BMI.
We will use the scikit-learn library for this purpose. We will need to :
1) Create X (training data) and y (our output)
2) Identify the features to use, which are our input columns (independent variable)
3) y is our output, or target variable which we are trying to predict (insurance premium)

feature_cols = ['age', 'smoker', 'bmi']
X = df[feature_cols]
y = df['charges']

from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

Let's view the intercept and coefficients for our model

print ("intercept", lm.intercept_)
print ("coefficients:", lm.coef_)
print ("R^2 score",lm.score(X,y))

intercept -11676.830425187807
coefficients: [  259.54749155 23823.68449531   322.61513282]
R^2 score 0.7474771588119513

Interpretation of our linear regression model :¶

1) The 3 variables (age, smoker and bmi) have a 74.7% impact on insurance charges (denoted by R squared = 0.747)
2) Positive coefficients show that as all 3 variables increase, it will also increase cost of insurance premium. This was also shown in the heatmap and scatterplot earlier.
3) The negative intercept here is only meaningful if it is logically meaningful for all predictor variables to be zero. Since it is not logical in our case, we will ignore the negative intercept.

Next, we can predict the insurance premium for different group of customers¶

1) For a customer aged = 50, smoker = yes, BMI = 35, his annual premium will be $36,415.76.

lm.predict([[50,1,35]])
predict = np.round(lm.predict([[50,1,35]]),2)
predict

array([36415.76])

2) For a customer aged = 50, smoker = no, BMI = 35, his annual premium will be $12,592.07.
--> A non-smoker's insurance premium is usually lower as a smokers are at higher risks of contracting life threatening diseases.

lm.predict([[50,0,35]])
predict = np.round(lm.predict([[50,0,35]]),2)
predict

array([12592.07])

3) For a customer aged = 25, smoker = no, BMI = 21, his annual premium will be $1,586.77
--> A younger non-smoker with a healthy BMI will have a much lower annual insurance premium.

lm.predict([[25,0,21]])
predict = np.round(lm.predict([[25,0,21]]),2)
predict

array([1586.77])

	age	sex	bmi	children	smoker	region	charges
1291	19	male	34.900	0	yes	southwest	34828.65400
219	24	female	23.210	0	no	southeast	25081.76784
1199	31	female	25.800	2	no	southwest	4934.70500
112	37	male	30.800	0	no	southwest	4646.75900
412	26	female	17.195	2	yes	northeast	14455.64405
1330	57	female	25.740	2	no	southeast	12629.16560
1007	47	male	28.215	3	yes	northwest	24915.22085
160	42	female	26.600	0	yes	northwest	21348.70600
780	30	male	24.400	3	yes	southwest	18259.21600
65	19	female	28.900	0	no	southwest	1743.21400

	age	sex	bmi	children	smoker	region	charges
688	47	0	24.100	1	0	southwest	26236.57997
1292	21	1	23.210	0	0	southeast	1515.34490
861	38	0	28.000	3	0	southwest	7151.09200
417	36	0	22.600	2	1	southwest	18608.26200
38	35	1	36.670	1	1	northeast	39774.27630
903	49	1	36.850	0	0	southeast	8125.78450
693	24	1	23.655	0	0	northwest	2352.96845
933	45	0	35.300	0	0	southwest	7348.14200
125	26	0	28.785	0	0	northeast	3385.39915
1259	52	0	23.180	0	0	northeast	10197.77220

	age	sex	bmi	children	smoker	charges
count	1338.000000	1338.000000	1338.000000	1338.000000	1338.000000	1338.000000
mean	39.207025	0.505232	30.663397	1.094918	0.204783	13270.422265
std	14.049960	0.500160	6.098187	1.205493	0.403694	12110.011237
min	18.000000	0.000000	15.960000	0.000000	0.000000	1121.873900
25%	27.000000	0.000000	26.296250	0.000000	0.000000	4740.287150
50%	39.000000	1.000000	30.400000	1.000000	0.000000	9382.033000
75%	51.000000	1.000000	34.693750	2.000000	0.000000	16639.912515
max	64.000000	1.000000	53.130000	5.000000	1.000000	63770.428010