Predicting Diabetes Probability With Logistic Regression¶

Logistic regression is a statistical method used to perform prediction for categorical results, eg in this scenario, to predict if the group of patients in our dataset will develop diabetes, given their respective health information. This same method can also be used for Sales prediction, eg will customers purchase a certain product, or in Human Resources prediction, eg will employee leave the company.¶

This dataset is originally from the US National Institute of Diabetes and Digestive and Kidney Diseases. Data was obtained from kaggle.com.¶

The objective of this article is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. All patients here are females at least 21 years old of Native American Indian heritage.¶

The dataset consists of several medical predictor variables and one target variable, Outcome (whereby outcome = 1 indicates patient has high probability of developing diabetes and outcome = 0 indicates otherwise. Predictor variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.¶

Getting Started¶

The first step is to import libraries to be used for analysis and charting. Panda and numpy libraries are used for data analysis purposes while seaborn and matplotlib will be used for chart plotting.

%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%config InlineBackend.figure_format = 'retina'
sns.set()

After importing our libraries, we will need to prepare our dataset. For this purpose, we will import data from csv file with our required dataset.

After loading our dataset, we can perform a sample check of our data by using "df.sample(20)" to view a sample of 20 lines in our dataset.

df = pd.read_csv('~/Documents/1_DS_360/4_Porftolio/1_Diabetic/datasets-228-482-diabetes.csv')

df.sample(20)

Next, we can perform general checks on the data in csv file. There are 768 rows of data in our dataset, comprising integers and numerical data with decimals. No null data were noted in all rows and columns (ie no missing data in our dataset).

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

We can also perform a high-level summary check of our data using "df.describe()". As noted earlier, there are 768 records in our dataset. By looking at the mean/average, we can see that most of the patients in our data are middle-aged and have a relatively normal glucose level of 120.89mg/dL (as compared to WHO's range of >126mg/dL as diabetic).

The standard deviation and quartile ranges let us know how dispersed the distribution of the data within each column is as comparetd to the average. A low standard deviation means that most of the numbers are close to the average, while a high standard deviation means that the numbers are more spread out.

df.describe()

Data Analysis and Data Plots¶

Correlation¶

To further understand the relationship between each of the variables in our data, an important step is to find out the correlation between the variables. This will help us understand the impact that each variable has on one another.

We will do this by plotting a heatmap using seaborn and matplotlib libraries imported earlier. The relationship between each variables can be observed (whether they are positively or inversely related to each other) and the degree of impact it has on one another.

Observations :
1) Age and pregnancies are highly correlated. This is very relevant in today's world as more women are bearing children at much later stages of their lives.
2) Blood glucose level is a high determinant for diabetes.

sns.set(rc={'figure.figsize':(10,8)})
sns.set_context("talk", font_scale=0.8)
sns.heatmap(df.corr(), cmap='Blues', annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1a22811850>

Number of Diabetic Patients¶

Next, we will plot a graph to look at the number of diabetic patients in our current dataset.
Observations : 268 patients diagnosed with diabetes while 500 diagnosed without.

sns.set(rc={'figure.figsize':(6,6)})

ax = sns.countplot(df['Outcome'], hue="Outcome", data = df)
loc, labels = plt.xticks()
ax.set_xticklabels(labels);

for p in ax.patches:
    ax.text(p.get_x() + p.get_width()/2., p.get_height(), '{0:.0f}'.format(p.get_height()), 
        fontsize=12, color='black', ha='center', va='bottom')


plt.title("Women With Diabetes")
plt.xlabel('Diabetic')
plt.ylabel('Number of Patients')

Text(0, 0.5, 'Number of Patients')

posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values
posx and posy should be finite values

Age Analysis¶

Let's look at the age of all women. We plot a graph to analyse the number of diabetic patients by age in our current dataset.
Observations : 1) High number of diabetic patients from mid-twenties to early-forties.
2) Non-diabetic patients are in early twenties. This could be due to non-pregnancies in this age group (as noted earlier, age and pregnancies have a positive relationship).

sns.set(rc={'figure.figsize':(16,8)})
sns.countplot(x="Age", hue="Outcome", data=df)

<matplotlib.axes._subplots.AxesSubplot at 0x1a245c0b10>

For subsequent analyses, we will look at data for diabetic patients only. Filter dataset to diabetic patients.

newdf = df[(df.Outcome == 1)]
newdf

Next, we plot a scatterplot to analyse relationship between pregnant women with diabetes and their age.
Observations : Similar to observation above, more women are pregnant between mid-twenties and early-forties have been diagnosed with diabetes.

sns.set(rc={'figure.figsize':(10,6)})
sns.scatterplot(x = 'Age', y = 'Pregnancies', data = newdf)

<matplotlib.axes._subplots.AxesSubplot at 0x1a24719990>

Let's look at the age of diabetic patients. Plot a histogram based on patient's age.
Observations : Highest number of patients are in late the twenties age group.

sns.set(rc={'figure.figsize':(6,6)})
sns.distplot(newdf['Age'], hist=True, kde=False, bins=int(180/10), hist_kws={'edgecolor':'black'})

<matplotlib.axes._subplots.AxesSubplot at 0x1a2592a310>

What are the glucose level of the diabetic patients? Plot diabetic patients' glucose level as follows.
Observations : Majority of patients' glucose level are within the range of 125-150mg/dL. WHO's definition for a diabetic person is 126mg/dL and above.

sns.distplot(newdf['Glucose'], norm_hist=False, kde=False, hist_kws={'edgecolor':'black'})

<matplotlib.axes._subplots.AxesSubplot at 0x1a25d5dd10>

How many of the patients have been pregnant before? Plot diabetic patients who are pregnant.
Observations : Majority of pregnant patients' have had 2 pregnancies and below.

sns.distplot(newdf['Pregnancies'], bins=int(500/60), norm_hist=False, kde=False, hist_kws={'edgecolor':'black'})

<matplotlib.axes._subplots.AxesSubplot at 0x1a25fc5a10>

The analyses above have given us an idea on the relationship between the dataset and their correlation to being diabetic. However, to more accurately predict the relationship of our variables, we can make use of logistic regression, which is useful for predicting outcomes of binary nature eg yes or no results.

Logistic Regression¶

Let's run logistic regression model and check which factors influence diabetic probability. We will use the scikit-learn library for this purpose. Use model from scikit-learn to check model prediction accuracy.
1) Age & Pregnancies

from sklearn.linear_model import LogisticRegression
columns = ['Age','Pregnancies']
X = df[columns]
y = df['Outcome'] 

from sklearn.metrics import accuracy_score
logmodel = LogisticRegression()
logmodel.fit(X,y)
predictions = logmodel.predict(X)
accuracy_score(y,predictions)

0.6627604166666666

2) Insulin & Glucose

from sklearn.linear_model import LogisticRegression
columns = ['Insulin','Glucose']
X = df[columns]
y = df['Outcome'] 

from sklearn.metrics import accuracy_score
logmodel = LogisticRegression()
logmodel.fit(X,y)
predictions = logmodel.predict(X)
accuracy_score(y,predictions)

0.74609375

3) Skin Thickness & Glucose

from sklearn.linear_model import LogisticRegression
columns = ['SkinThickness','Glucose']
X = df[columns]
y = df['Outcome'] 

from sklearn.metrics import accuracy_score
logmodel = LogisticRegression()
logmodel.fit(X,y)
predictions = logmodel.predict(X)
accuracy_score(y,predictions)

0.7395833333333334

4) Age & Pregnancies

from sklearn.linear_model import LogisticRegression
columns = ['Age','Pregnancies']
X = df[columns]
y = df['Outcome'] 

from sklearn.metrics import accuracy_score
logmodel = LogisticRegression()
logmodel.fit(X,y)
predictions = logmodel.predict(X)
accuracy_score(y,predictions)

0.6627604166666666

Conclusion¶

1) Women with high insulin and glucose levels are highly likely to develop diabetes.
2) Women with thick skin and high glucose are also highly likely to develop diabetes.
-- Insulin level and skin thickness are symptoms to look out for in potential diabetic patients.

Prediction¶

After identifying the factors which can lead to diabetes, we can now perform prediction analysis using logistic regression to find out who else in our dataset are highly likely to develop diabetes. We will remove the "Outcome" column from our original dataset (akin to initialising our data) and insert predicted answers into a newly created column called "Diabetic" column.
We will also filter out the results to show only list of patients predicted to be diabetic.

predict = pd.read_csv('~/Documents/1_DS_360/4_Porftolio/1_Diabetic/datasets_diabetes_prediction.csv')
X_predict = predict[columns]
predictions = logmodel.predict(X_predict)
predict['Diabetic'] = predictions
predict

is_1=predict['Diabetic']==1
predict_1=predict[is_1]
predict_1

The table above shows list of patients highly likely to develop diabetes (only a sample of the final results are shown here as the list is too long).

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
427	1	181	64	30	180	34.1	0.328	38	1
129	0	105	84	0	0	27.9	0.741	62	1
633	1	128	82	17	183	27.5	0.115	22	0
761	9	170	74	31	0	44.0	0.403	43	1
635	13	104	72	0	0	31.2	0.465	38	1
273	1	71	78	50	45	33.2	0.422	21	0
479	4	132	86	31	0	28.0	0.419	63	0
457	5	86	68	28	71	30.2	0.364	24	0
616	6	117	96	0	0	28.7	0.157	30	0
400	4	95	64	0	0	32.0	0.161	31	1
698	4	127	88	11	155	34.5	0.598	28	0
256	3	111	56	39	0	30.1	0.557	30	0
732	2	174	88	37	120	44.5	0.646	24	1
464	10	115	98	0	0	24.0	1.022	34	0
331	2	87	58	16	52	32.7	0.166	25	0
515	3	163	70	18	105	31.6	0.268	28	1
429	1	95	82	25	180	35.0	0.233	43	1
317	3	182	74	0	0	30.5	0.345	29	1
741	3	102	44	20	94	30.8	0.400	26	0
52	5	88	66	21	23	24.4	0.342	30	0

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
2	8	183	64	0	0	23.3	0.672	32	1
4	0	137	40	35	168	43.1	2.288	33	1
6	3	78	50	32	88	31.0	0.248	26	1
8	2	197	70	45	543	30.5	0.158	53	1
...	...	...	...	...	...	...	...	...	...
755	1	128	88	39	110	36.5	1.057	37	1
757	0	123	72	0	0	36.3	0.258	52	1
759	6	190	92	0	0	35.5	0.278	66	1
761	9	170	74	31	0	44.0	0.403	43	1
766	1	126	60	0	0	30.1	0.349	47	1

	ID	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Diabetic
0	1	6	148	72	35	0	33.6	0.627	50	1
9	10	8	125	96	0	0	0.0	0.232	54	1
12	13	10	139	80	0	0	27.1	1.441	57	1
21	22	8	99	84	0	0	35.4	0.388	50	1
24	25	11	143	94	33	146	36.6	0.254	51	1
...	...	...	...	...	...	...	...	...	...	...
749	750	6	162	62	0	0	24.3	0.178	50	1
754	755	8	154	78	32	0	32.4	0.443	45	1
759	760	6	190	92	0	0	35.5	0.278	66	1
761	762	9	170	74	31	0	44.0	0.403	43	1
763	764	10	101	76	48	180	32.9	0.171	63	1