The first step is to import libraries to be used for analysis and charting. Panda and numpy libraries are used for data analysis purposes while seaborn and matplotlib will be used for chart plotting.
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
sns.set()
After importing our libraries, we will need to prepare our dataset. For this purpose, we will import data from csv file with our required dataset.
After loading our dataset, we can perform a sample check of our data by using "df.sample(20)" to view a sample of 20 lines in our dataset.
df = pd.read_csv('~/Documents/1_DS_360/4_Porftolio/1_Diabetic/datasets-228-482-diabetes.csv')
df.sample(20)
Next, we can perform general checks on the data in csv file. There are 768 rows of data in our dataset, comprising integers and numerical data with decimals. No null data were noted in all rows and columns (ie no missing data in our dataset).
df.info()
We can also perform a high-level summary check of our data using "df.describe()". As noted earlier, there are 768 records in our dataset. By looking at the mean/average, we can see that most of the patients in our data are middle-aged and have a relatively normal glucose level of 120.89mg/dL (as compared to WHO's range of >126mg/dL as diabetic).
The standard deviation and quartile ranges let us know how dispersed the distribution of the data within each column is as comparetd to the average. A low standard deviation means that most of the numbers are close to the average, while a high standard deviation means that the numbers are more spread out.
df.describe()
To further understand the relationship between each of the variables in our data, an important step is to find out the correlation between the variables. This will help us understand the impact that each variable has on one another.
We will do this by plotting a heatmap using seaborn and matplotlib libraries imported earlier. The relationship between each variables can be observed (whether they are positively or inversely related to each other) and the degree of impact it has on one another.
Observations :
1) Age and pregnancies are highly correlated. This is very relevant in today's world as more women are bearing children at much later stages of their lives.
2) Blood glucose level is a high determinant for diabetes.
sns.set(rc={'figure.figsize':(10,8)})
sns.set_context("talk", font_scale=0.8)
sns.heatmap(df.corr(), cmap='Blues', annot=True)
Next, we will plot a graph to look at the number of diabetic patients in our current dataset.
Observations : 268 patients diagnosed with diabetes while 500 diagnosed without.
sns.set(rc={'figure.figsize':(6,6)})
ax = sns.countplot(df['Outcome'], hue="Outcome", data = df)
loc, labels = plt.xticks()
ax.set_xticklabels(labels);
for p in ax.patches:
ax.text(p.get_x() + p.get_width()/2., p.get_height(), '{0:.0f}'.format(p.get_height()),
fontsize=12, color='black', ha='center', va='bottom')
plt.title("Women With Diabetes")
plt.xlabel('Diabetic')
plt.ylabel('Number of Patients')
Let's look at the age of all women. We plot a graph to analyse the number of diabetic patients by age in our current dataset.
Observations :
1) High number of diabetic patients from mid-twenties to early-forties.
2) Non-diabetic patients are in early twenties. This could be due to non-pregnancies in this age group (as noted earlier, age and pregnancies have a positive relationship).
sns.set(rc={'figure.figsize':(16,8)})
sns.countplot(x="Age", hue="Outcome", data=df)
For subsequent analyses, we will look at data for diabetic patients only. Filter dataset to diabetic patients.
newdf = df[(df.Outcome == 1)]
newdf
Next, we plot a scatterplot to analyse relationship between pregnant women with diabetes and their age.
Observations : Similar to observation above, more women are pregnant between mid-twenties and early-forties have been diagnosed with diabetes.
sns.set(rc={'figure.figsize':(10,6)})
sns.scatterplot(x = 'Age', y = 'Pregnancies', data = newdf)
Let's look at the age of diabetic patients. Plot a histogram based on patient's age.
Observations : Highest number of patients are in late the twenties age group.
sns.set(rc={'figure.figsize':(6,6)})
sns.distplot(newdf['Age'], hist=True, kde=False, bins=int(180/10), hist_kws={'edgecolor':'black'})
What are the glucose level of the diabetic patients? Plot diabetic patients' glucose level as follows.
Observations : Majority of patients' glucose level are within the range of 125-150mg/dL. WHO's definition for a diabetic person is 126mg/dL and above.
sns.distplot(newdf['Glucose'], norm_hist=False, kde=False, hist_kws={'edgecolor':'black'})
How many of the patients have been pregnant before? Plot diabetic patients who are pregnant.
Observations : Majority of pregnant patients' have had 2 pregnancies and below.
sns.distplot(newdf['Pregnancies'], bins=int(500/60), norm_hist=False, kde=False, hist_kws={'edgecolor':'black'})
The analyses above have given us an idea on the relationship between the dataset and their correlation to being diabetic. However, to more accurately predict the relationship of our variables, we can make use of logistic regression, which is useful for predicting outcomes of binary nature eg yes or no results.
Let's run logistic regression model and check which factors influence diabetic probability. We will use the scikit-learn library for this purpose.
Use model from scikit-learn to check model prediction accuracy.
1) Age & Pregnancies
from sklearn.linear_model import LogisticRegression
columns = ['Age','Pregnancies']
X = df[columns]
y = df['Outcome']
from sklearn.metrics import accuracy_score
logmodel = LogisticRegression()
logmodel.fit(X,y)
predictions = logmodel.predict(X)
accuracy_score(y,predictions)
2) Insulin & Glucose
from sklearn.linear_model import LogisticRegression
columns = ['Insulin','Glucose']
X = df[columns]
y = df['Outcome']
from sklearn.metrics import accuracy_score
logmodel = LogisticRegression()
logmodel.fit(X,y)
predictions = logmodel.predict(X)
accuracy_score(y,predictions)
3) Skin Thickness & Glucose
from sklearn.linear_model import LogisticRegression
columns = ['SkinThickness','Glucose']
X = df[columns]
y = df['Outcome']
from sklearn.metrics import accuracy_score
logmodel = LogisticRegression()
logmodel.fit(X,y)
predictions = logmodel.predict(X)
accuracy_score(y,predictions)
4) Age & Pregnancies
from sklearn.linear_model import LogisticRegression
columns = ['Age','Pregnancies']
X = df[columns]
y = df['Outcome']
from sklearn.metrics import accuracy_score
logmodel = LogisticRegression()
logmodel.fit(X,y)
predictions = logmodel.predict(X)
accuracy_score(y,predictions)
1) Women with high insulin and glucose levels are highly likely to develop diabetes.
2) Women with thick skin and high glucose are also highly likely to develop diabetes.
-- Insulin level and skin thickness are symptoms to look out for in potential diabetic patients.
After identifying the factors which can lead to diabetes, we can now perform prediction analysis using logistic regression to find out who else in our dataset are highly likely to develop diabetes.
We will remove the "Outcome" column from our original dataset (akin to initialising our data) and insert predicted answers into a newly created column called "Diabetic" column.
We will also filter out the results to show only list of patients predicted to be diabetic.
predict = pd.read_csv('~/Documents/1_DS_360/4_Porftolio/1_Diabetic/datasets_diabetes_prediction.csv')
X_predict = predict[columns]
predictions = logmodel.predict(X_predict)
predict['Diabetic'] = predictions
predict
is_1=predict['Diabetic']==1
predict_1=predict[is_1]
predict_1
The table above shows list of patients highly likely to develop diabetes (only a sample of the final results are shown here as the list is too long).