The first step is to import libraries to be used for analysis and charting. Panda and numpy libraries are used for data analysis purposes while seaborn and matplotlib will be used for chart plotting.
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
sns.set()
After importing our libraries, we will need to prepare our dataset. For this purpose, we will import data from csv file with our required dataset.
After loading our dataset, we can perform a sample check of our data by using "df.sample(10)" a sample of 10 lines in our dataset.
df = pd.read_csv('~/Documents/1_DS_360/4_Porftolio/2_Insurance/datasets-insurance.csv')
df.sample(10)
As some of the data are in text format, we will need to convert them to boolean to perform further analysis. After conversion, let's view the sample data again.
We will denote the following for text replacement :
Male = 1, female = 0; smoker = 1, non-smoker = 0
df.sex.replace('male', 1, inplace = True)
df.sex.replace ('female', 0, inplace = True)
df.smoker.replace('yes', 1, inplace = True)
df.smoker.replace('no', 0, inplace = True)
df.sample (10)
We can also perform a high-level summary check of our data using "df.describe()". There are 1,338 records in our dataset, with an average age of 39 years old. Gender split in our dataset is quite equal with average insurance premium at $13,270.
The standard deviation and quartile ranges let us know how dispersed the distribution of the data within each column is as comparetd to the average. A low standard deviation means that most of the numbers are close to the average, while a high standard deviation means that the numbers are more spread out.
df.describe()
To further understand the relationship between each of the variables in our data, an important step is to find out the correlation between the variables. This will help us understand the impact that each variable has on one another.
We will do this by plotting a heatmap using seaborn and matplotlib libraries imported earlier. The relationship between each variables can be observed (whether they are positively or inversely related to each other) and the degree of impact it has on one another.
Observations :
Insurange charges are greatly affected by whether the insured is a smoker, his/her age and also their BMI.
This observation can be further analysed using scatterplots below.
sns.set(rc={'figure.figsize':(10,8)})
sns.set_context("talk", font_scale=0.8)
sns.heatmap(df.corr(), cmap='Blues', annot=True)
sns.pairplot(x_vars=["age","smoker","bmi"],y_vars="charges",hue="smoker",data=df, height=6, aspect=0.8)
All the three scatterplots show us that insurance premium are higher for smokers compared to non-smokers, regardless of their age and BMI.
Older smokers with high BMI have the highest insurance premium since they are at higher risk of developing life threatening diseases.
Let us run a linear regression model using the three factors identified above; age, smoker and BMI.
We will use the scikit-learn library for this purpose. We will need to :
1) Create X (training data) and y (our output)
2) Identify the features to use, which are our input columns (independent variable)
3) y is our output, or target variable which we are trying to predict (insurance premium)
feature_cols = ['age', 'smoker', 'bmi']
X = df[feature_cols]
y = df['charges']
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X, y)
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
Let's view the intercept and coefficients for our model
print ("intercept", lm.intercept_)
print ("coefficients:", lm.coef_)
print ("R^2 score",lm.score(X,y))
1) The 3 variables (age, smoker and bmi) have a 74.7% impact on insurance charges (denoted by R squared = 0.747)
2) Positive coefficients show that as all 3 variables increase, it will also increase cost of insurance premium. This was also shown in the heatmap and scatterplot earlier.
3) The negative intercept here is only meaningful if it is logically meaningful for all predictor variables to be zero. Since it is not logical in our case, we will ignore the negative intercept.
1) For a customer aged = 50, smoker = yes, BMI = 35, his annual premium will be $36,415.76.
lm.predict([[50,1,35]])
predict = np.round(lm.predict([[50,1,35]]),2)
predict
2) For a customer aged = 50, smoker = no, BMI = 35, his annual premium will be $12,592.07.
--> A non-smoker's insurance premium is usually lower as a smokers are at higher risks of contracting life threatening diseases.
lm.predict([[50,0,35]])
predict = np.round(lm.predict([[50,0,35]]),2)
predict
3) For a customer aged = 25, smoker = no, BMI = 21, his annual premium will be $1,586.77
--> A younger non-smoker with a healthy BMI will have a much lower annual insurance premium.
lm.predict([[25,0,21]])
predict = np.round(lm.predict([[25,0,21]]),2)
predict