Predicting Diabetes Probability With Logistic Regression

Logistic regression is a statistical method used to perform prediction for categorical results, eg in this scenario, to predict if the group of patients in our dataset will develop diabetes, given their respective health information. This same method can also be used for Sales prediction, eg will customers purchase a certain product, or in Human Resources prediction, eg will employee leave the company.
This dataset is originally from the US National Institute of Diabetes and Digestive and Kidney Diseases. Data was obtained from kaggle.com.
The objective of this article is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. All patients here are females at least 21 years old of Native American Indian heritage.
The dataset consists of several medical predictor variables and one target variable, Outcome (whereby outcome = 1 indicates patient has high probability of developing diabetes and outcome = 0 indicates otherwise. Predictor variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.

Getting Started

The first step is to import libraries to be used for analysis and charting. Panda and numpy libraries are used for data analysis purposes while seaborn and matplotlib will be used for chart plotting.

In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
sns.set()

After importing our libraries, we will need to prepare our dataset. For this purpose, we will import data from csv file with our required dataset.

After loading our dataset, we can perform a sample check of our data by using "df.sample(20)" to view a sample of 20 lines in our dataset.

In [2]:
df = pd.read_csv('~/Documents/1_DS_360/4_Porftolio/1_Diabetic/datasets-228-482-diabetes.csv')
In [3]:
df.sample(20)
Out[3]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
427 1 181 64 30 180 34.1 0.328 38 1
129 0 105 84 0 0 27.9 0.741 62 1
633 1 128 82 17 183 27.5 0.115 22 0
761 9 170 74 31 0 44.0 0.403 43 1
635 13 104 72 0 0 31.2 0.465 38 1
273 1 71 78 50 45 33.2 0.422 21 0
479 4 132 86 31 0 28.0 0.419 63 0
457 5 86 68 28 71 30.2 0.364 24 0
616 6 117 96 0 0 28.7 0.157 30 0
400 4 95 64 0 0 32.0 0.161 31 1
698 4 127 88 11 155 34.5 0.598 28 0
256 3 111 56 39 0 30.1 0.557 30 0
732 2 174 88 37 120 44.5 0.646 24 1
464 10 115 98 0 0 24.0 1.022 34 0
331 2 87 58 16 52 32.7 0.166 25 0
515 3 163 70 18 105 31.6 0.268 28 1
429 1 95 82 25 180 35.0 0.233 43 1
317 3 182 74 0 0 30.5 0.345 29 1
741 3 102 44 20 94 30.8 0.400 26 0
52 5 88 66 21 23 24.4 0.342 30 0

Next, we can perform general checks on the data in csv file. There are 768 rows of data in our dataset, comprising integers and numerical data with decimals. No null data were noted in all rows and columns (ie no missing data in our dataset).

In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

We can also perform a high-level summary check of our data using "df.describe()". As noted earlier, there are 768 records in our dataset. By looking at the mean/average, we can see that most of the patients in our data are middle-aged and have a relatively normal glucose level of 120.89mg/dL (as compared to WHO's range of >126mg/dL as diabetic).

The standard deviation and quartile ranges let us know how dispersed the distribution of the data within each column is as comparetd to the average. A low standard deviation means that most of the numbers are close to the average, while a high standard deviation means that the numbers are more spread out.

In [5]:
df.describe()
Out[5]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578 0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329 11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000 21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750 24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000 0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000 0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000 2.420000 81.000000 1.000000

Data Analysis and Data Plots

Correlation

To further understand the relationship between each of the variables in our data, an important step is to find out the correlation between the variables. This will help us understand the impact that each variable has on one another.

We will do this by plotting a heatmap using seaborn and matplotlib libraries imported earlier. The relationship between each variables can be observed (whether they are positively or inversely related to each other) and the degree of impact it has on one another.

Observations :
1) Age and pregnancies are highly correlated. This is very relevant in today's world as more women are bearing children at much later stages of their lives.
2) Blood glucose level is a high determinant for diabetes.

In [7]:
sns.set(rc={'figure.figsize':(10,8)})
sns.set_context("talk", font_scale=0.8)
sns.heatmap(df.corr(), cmap='Blues', annot=True)
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a22811850>

Number of Diabetic Patients

Next, we will plot a graph to look at the number of diabetic patients in our current dataset.
Observations : 268 patients diagnosed with diabetes while 500 diagnosed without.

In [8]:
sns.set(rc={'figure.figsize':(6,6)})

ax = sns.countplot(df['Outcome'], hue="Outcome", data = df)
loc, labels = plt.xticks()
ax.set_xticklabels(labels);

for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/2., p.get_height(), '{0:.0f}'.format(p.get_height()), 
        fontsize=12, color='black', ha='center', va='bottom')


plt.title("Women With Diabetes")
plt.xlabel('Diabetic')
plt.ylabel('Number of Patients')
Out[8]:
Text(0, 0.5, 'Number of Patients')
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values

Age Analysis

Let's look at the age of all women. We plot a graph to analyse the number of diabetic patients by age in our current dataset.
Observations : 1) High number of diabetic patients from mid-twenties to early-forties.
2) Non-diabetic patients are in early twenties. This could be due to non-pregnancies in this age group (as noted earlier, age and pregnancies have a positive relationship).

In [9]:
sns.set(rc={'figure.figsize':(16,8)})
sns.countplot(x="Age", hue="Outcome", data=df)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a245c0b10>

For subsequent analyses, we will look at data for diabetic patients only. Filter dataset to diabetic patients.

In [10]:
newdf = df[(df.Outcome == 1)]
newdf
Out[10]:
Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Outcome
0 6 148 72 35 0 33.6 0.627 50 1
2 8 183 64 0 0 23.3 0.672 32 1
4 0 137 40 35 168 43.1 2.288 33 1
6 3 78 50 32 88 31.0 0.248 26 1
8 2 197 70 45 543 30.5 0.158 53 1
... ... ... ... ... ... ... ... ... ...
755 1 128 88 39 110 36.5 1.057 37 1
757 0 123 72 0 0 36.3 0.258 52 1
759 6 190 92 0 0 35.5 0.278 66 1
761 9 170 74 31 0 44.0 0.403 43 1
766 1 126 60 0 0 30.1 0.349 47 1

268 rows × 9 columns

Next, we plot a scatterplot to analyse relationship between pregnant women with diabetes and their age.
Observations : Similar to observation above, more women are pregnant between mid-twenties and early-forties have been diagnosed with diabetes.

In [11]:
sns.set(rc={'figure.figsize':(10,6)})
sns.scatterplot(x = 'Age', y = 'Pregnancies', data = newdf)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a24719990>

Let's look at the age of diabetic patients. Plot a histogram based on patient's age.
Observations : Highest number of patients are in late the twenties age group.

In [13]:
sns.set(rc={'figure.figsize':(6,6)})
sns.distplot(newdf['Age'], hist=True, kde=False, bins=int(180/10), hist_kws={'edgecolor':'black'})
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2592a310>

What are the glucose level of the diabetic patients? Plot diabetic patients' glucose level as follows.
Observations : Majority of patients' glucose level are within the range of 125-150mg/dL. WHO's definition for a diabetic person is 126mg/dL and above.

In [14]:
sns.distplot(newdf['Glucose'], norm_hist=False, kde=False, hist_kws={'edgecolor':'black'})
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a25d5dd10>

How many of the patients have been pregnant before? Plot diabetic patients who are pregnant.
Observations : Majority of pregnant patients' have had 2 pregnancies and below.

In [15]:
sns.distplot(newdf['Pregnancies'], bins=int(500/60), norm_hist=False, kde=False, hist_kws={'edgecolor':'black'})
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a25fc5a10>

The analyses above have given us an idea on the relationship between the dataset and their correlation to being diabetic. However, to more accurately predict the relationship of our variables, we can make use of logistic regression, which is useful for predicting outcomes of binary nature eg yes or no results.

Logistic Regression

Let's run logistic regression model and check which factors influence diabetic probability. We will use the scikit-learn library for this purpose. Use model from scikit-learn to check model prediction accuracy.
1) Age & Pregnancies

In [16]:
from sklearn.linear_model import LogisticRegression
columns = ['Age','Pregnancies']
X = df[columns]
y = df['Outcome'] 

from sklearn.metrics import accuracy_score
logmodel = LogisticRegression()
logmodel.fit(X,y)
predictions = logmodel.predict(X)
accuracy_score(y,predictions)
Out[16]:
0.6627604166666666

2) Insulin & Glucose

In [17]:
from sklearn.linear_model import LogisticRegression
columns = ['Insulin','Glucose']
X = df[columns]
y = df['Outcome'] 

from sklearn.metrics import accuracy_score
logmodel = LogisticRegression()
logmodel.fit(X,y)
predictions = logmodel.predict(X)
accuracy_score(y,predictions)
Out[17]:
0.74609375

3) Skin Thickness & Glucose

In [18]:
from sklearn.linear_model import LogisticRegression
columns = ['SkinThickness','Glucose']
X = df[columns]
y = df['Outcome'] 

from sklearn.metrics import accuracy_score
logmodel = LogisticRegression()
logmodel.fit(X,y)
predictions = logmodel.predict(X)
accuracy_score(y,predictions)
Out[18]:
0.7395833333333334

4) Age & Pregnancies

In [19]:
from sklearn.linear_model import LogisticRegression
columns = ['Age','Pregnancies']
X = df[columns]
y = df['Outcome'] 

from sklearn.metrics import accuracy_score
logmodel = LogisticRegression()
logmodel.fit(X,y)
predictions = logmodel.predict(X)
accuracy_score(y,predictions)
Out[19]:
0.6627604166666666

Conclusion

1) Women with high insulin and glucose levels are highly likely to develop diabetes.
2) Women with thick skin and high glucose are also highly likely to develop diabetes.
-- Insulin level and skin thickness are symptoms to look out for in potential diabetic patients.

Prediction

After identifying the factors which can lead to diabetes, we can now perform prediction analysis using logistic regression to find out who else in our dataset are highly likely to develop diabetes. We will remove the "Outcome" column from our original dataset (akin to initialising our data) and insert predicted answers into a newly created column called "Diabetic" column.
We will also filter out the results to show only list of patients predicted to be diabetic.

In [20]:
predict = pd.read_csv('~/Documents/1_DS_360/4_Porftolio/1_Diabetic/datasets_diabetes_prediction.csv')
X_predict = predict[columns]
predictions = logmodel.predict(X_predict)
predict['Diabetic'] = predictions
predict

is_1=predict['Diabetic']==1
predict_1=predict[is_1]
predict_1
Out[20]:
ID Pregnancies Glucose BloodPressure SkinThickness Insulin BMI DiabetesPedigreeFunction Age Diabetic
0 1 6 148 72 35 0 33.6 0.627 50 1
9 10 8 125 96 0 0 0.0 0.232 54 1
12 13 10 139 80 0 0 27.1 1.441 57 1
21 22 8 99 84 0 0 35.4 0.388 50 1
24 25 11 143 94 33 146 36.6 0.254 51 1
... ... ... ... ... ... ... ... ... ... ...
749 750 6 162 62 0 0 24.3 0.178 50 1
754 755 8 154 78 32 0 32.4 0.443 45 1
759 760 6 190 92 0 0 35.5 0.278 66 1
761 762 9 170 74 31 0 44.0 0.403 43 1
763 764 10 101 76 48 180 32.9 0.171 63 1

121 rows × 10 columns

The table above shows list of patients highly likely to develop diabetes (only a sample of the final results are shown here as the list is too long).