miércoles, 8 de julio de 2020

Predición de Enfermedades Cardiacas Aplicando Modelos de Machine Learning y Data Mining

Heart_Disease_Harvard_Dataset

Comparación de Modelos de Machine Learning para la Predición de Enfermedades Cardiacas

Este es un ejemplo de comparacion de modelos de machine learning y Analisis Exploratorio de Datos de enfermedades cardiovasculares. El dataset fue descargado desde la Pagina de Repositorios de la Universidad de California, Irvine.

Lo primero es importar todas las librerias necesarias para el EDA y la posterior implementación de los algoritmos de machine learning.

In [38]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
In [2]:
url="https://raw.githubusercontent.com/jonathan-marin-pavia/Pandas/master/proc_heart_cleve_3_withheader.csv"
data_name="https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names"
In [3]:
df=pd.read_csv(url)

Echemos un vistazo a las primeras instancias de datos de nuestro conjunto:

In [4]:
df.head()
Out[4]:
Disease Age Sex ind_typ_angina ind_atyp_angina ind_non_ang_pain resting_BP Serum_cholest blood_sugar_exc120 ind_for_ecg_1 ind_for_ecg_2 Max_heart_rate ind_exerc_angina ST_dep_by_exerc ind_for_slope_up_exerc ind_for_slope_down_exerc num_vessels_fluro Thal_rev_defect Thal_fixed_defect
0 -1 63 1 1 0 0 145 233 1 0 1 150 0 2.3 0 1 0 0 1
1 1 67 1 0 0 0 160 286 0 0 1 108 1 1.5 0 0 3 0 0
2 1 67 1 0 0 0 120 229 0 0 1 129 1 2.6 0 0 2 1 0
3 -1 37 1 0 0 1 130 250 0 0 0 187 0 3.5 0 1 0 0 0
4 -1 41 0 0 1 0 130 204 0 0 1 172 0 1.4 1 0 0 0 0

Un breve vistazo a los estadisticos descriptivos se muestra como sigue:

In [42]:
df.describe()
Out[42]:
Disease Age Sex ind_typ_angina ind_atyp_angina ind_non_ang_pain resting_BP Serum_cholest blood_sugar_exc120 ind_for_ecg_1 ind_for_ecg_2 Max_heart_rate ind_exerc_angina ST_dep_by_exerc ind_for_slope_up_exerc ind_for_slope_down_exerc num_vessels_fluro Thal_rev_defect Thal_fixed_defect
count 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000
mean -0.076923 54.528428 0.675585 0.076923 0.163880 0.280936 131.668896 247.100334 0.147157 0.013378 0.491639 149.505017 0.327759 1.051839 0.468227 0.070234 0.672241 0.384615 0.060201
std 0.998709 9.020950 0.468941 0.266916 0.370787 0.450210 17.705668 51.914779 0.354856 0.115079 0.500768 22.954927 0.470183 1.163809 0.499826 0.255970 0.937438 0.487320 0.238257
min -1.000000 29.000000 0.000000 0.000000 0.000000 0.000000 94.000000 126.000000 0.000000 0.000000 0.000000 71.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% -1.000000 48.000000 0.000000 0.000000 0.000000 0.000000 120.000000 211.000000 0.000000 0.000000 0.000000 133.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
50% -1.000000 56.000000 1.000000 0.000000 0.000000 0.000000 130.000000 242.000000 0.000000 0.000000 0.000000 153.000000 0.000000 0.800000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 1.000000 61.000000 1.000000 0.000000 0.000000 1.000000 140.000000 275.500000 0.000000 0.000000 1.000000 165.500000 1.000000 1.600000 1.000000 0.000000 1.000000 1.000000 0.000000
max 1.000000 77.000000 1.000000 1.000000 1.000000 1.000000 200.000000 564.000000 1.000000 1.000000 1.000000 202.000000 1.000000 6.200000 1.000000 1.000000 3.000000 1.000000 1.000000

Miremos ahora información mas detallada de nuestro conjunto de datos:

In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 299 entries, 0 to 298
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Disease                   299 non-null    int64  
 1   Age                       299 non-null    int64  
 2   Sex                       299 non-null    int64  
 3   ind_typ_angina            299 non-null    int64  
 4   ind_atyp_angina           299 non-null    int64  
 5   ind_non_ang_pain          299 non-null    int64  
 6   resting_BP                299 non-null    int64  
 7   Serum_cholest             299 non-null    int64  
 8   blood_sugar_exc120        299 non-null    int64  
 9   ind_for_ecg_1             299 non-null    int64  
 10  ind_for_ecg_2             299 non-null    int64  
 11  Max_heart_rate            299 non-null    int64  
 12  ind_exerc_angina          299 non-null    int64  
 13  ST_dep_by_exerc           299 non-null    float64
 14  ind_for_slope_up_exerc    299 non-null    int64  
 15  ind_for_slope_down_exerc  299 non-null    int64  
 16  num_vessels_fluro         299 non-null    int64  
 17  Thal_rev_defect           299 non-null    int64  
 18  Thal_fixed_defect         299 non-null    int64  
dtypes: float64(1), int64(18)
memory usage: 44.5 KB
In [7]:
df.shape
Out[7]:
(299, 19)
In [8]:
df.isnull().sum()
#df.isnull().values.any() # Se puede observar que no hay datos nulos en el dataset. Otras formas de chequearlo son las siguientes:
Out[8]:
Disease                     0
Age                         0
Sex                         0
ind_typ_angina              0
ind_atyp_angina             0
ind_non_ang_pain            0
resting_BP                  0
Serum_cholest               0
blood_sugar_exc120          0
ind_for_ecg_1               0
ind_for_ecg_2               0
Max_heart_rate              0
ind_exerc_angina            0
ST_dep_by_exerc             0
ind_for_slope_up_exerc      0
ind_for_slope_down_exerc    0
num_vessels_fluro           0
Thal_rev_defect             0
Thal_fixed_defect           0
dtype: int64

Definimos X e y

In [28]:
X=df.iloc[:,1:]
y=df["Disease"]
In [29]:
print(X.shape)
print(y.shape)
(299, 18)
(299,)

Tenemos ahora una matrix X de 299 registros y 18 caracteristicas y un vector y de 299 instancias

El Número de pacientes sin/con enfermedades cardiovasculares.

In [10]:
df["Disease"].value_counts()
Out[10]:
-1    161
 1    138
Name: Disease, dtype: int64

Hay 138 pacientes que presentan enfermedades cardiacas. A continuacion se visualizan.

In [12]:
sns.countplot(df["Disease"], palette="RdBu_r" )
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x180aa7f56c8>

Número de pacientes sin/con enfermedades cardiovasculares según su edad.

In [15]:
plt.figure(figsize=(15,10))
sns.countplot(x="Age", hue="Disease", data=df, palette="RdBu_r")
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x180aade3f08>

Ahora se hace una analisis de las correlaciones entre cada uno de los atributos.

In [16]:
df.corr()
Out[16]:
Disease Age Sex ind_typ_angina ind_atyp_angina ind_non_ang_pain resting_BP Serum_cholest blood_sugar_exc120 ind_for_ecg_1 ind_for_ecg_2 Max_heart_rate ind_exerc_angina ST_dep_by_exerc ind_for_slope_up_exerc ind_for_slope_down_exerc num_vessels_fluro Thal_rev_defect Thal_fixed_defect
Disease 1.000000 0.225775 0.283300 -0.091024 -0.246763 -0.310013 0.152840 0.078722 0.013111 0.067379 0.149680 -0.415031 0.425476 0.424672 -0.384730 0.060585 0.460442 0.481585 0.104142
Age 0.225775 1.000000 -0.091813 0.042989 -0.162418 -0.043286 0.290696 0.203377 0.128676 0.083677 0.139149 -0.392342 0.095108 0.197376 -0.183068 0.026018 0.362605 0.104754 0.060092
Sex 0.283300 -0.091813 1.000000 0.092803 -0.040600 -0.123170 -0.065521 -0.195907 0.045862 -0.105856 0.038425 -0.052064 0.149038 0.110237 -0.022648 0.050676 0.093185 0.327572 0.145351
ind_typ_angina -0.091024 0.042989 0.092803 1.000000 -0.127802 -0.180439 0.150260 -0.055531 0.057231 -0.033615 0.067592 0.081268 -0.094614 0.084343 -0.044501 0.068006 -0.059834 -0.021830 0.032472
ind_atyp_angina -0.246763 -0.162418 -0.040600 -0.127802 1.000000 -0.276725 -0.080136 -0.015501 -0.056382 -0.051552 -0.091995 0.256764 -0.232138 -0.281040 0.236417 -0.050966 -0.153886 -0.201429 -0.036080
ind_non_ang_pain -0.310013 -0.043286 -0.123170 -0.180439 -0.276725 1.000000 -0.055227 -0.024900 0.097436 -0.008015 -0.078853 0.150852 -0.262072 -0.121395 0.099450 -0.026198 -0.138891 -0.157658 -0.095631
resting_BP 0.152840 0.290696 -0.065521 0.150260 -0.080136 -0.055227 1.000000 0.132284 0.177623 0.058177 0.141046 -0.048053 0.065885 0.191615 -0.087458 0.121396 0.098773 0.110482 0.075538
Serum_cholest 0.078722 0.203377 -0.195907 -0.055531 -0.015501 -0.024900 0.132284 1.000000 0.006664 0.032914 0.160090 0.002179 0.056388 0.040431 -0.014490 -0.050027 0.119000 0.050995 -0.098157
blood_sugar_exc120 0.013111 0.128676 0.045862 0.057231 -0.056382 0.097436 0.177623 0.006664 1.000000 -0.048370 0.063599 -0.003387 0.011637 0.009093 -0.011390 0.107496 0.145478 0.020898 0.093319
ind_for_ecg_1 0.067379 0.083677 -0.105856 -0.033615 -0.051552 -0.008015 0.058177 0.032914 -0.048370 1.000000 -0.114513 -0.120705 0.042728 0.167688 -0.109266 0.081915 0.040781 -0.032220 0.092917
ind_for_ecg_2 0.149680 0.139149 0.038425 0.067592 -0.091995 -0.078853 0.141046 0.160090 0.063599 -0.114513 1.000000 -0.063417 0.068687 0.090282 -0.118375 0.043866 0.122813 0.006347 0.032359
Max_heart_rate -0.415031 -0.392342 -0.052064 0.081268 0.256764 0.150852 -0.048053 0.002179 -0.003387 -0.120705 -0.063417 1.000000 -0.376359 -0.341262 0.442894 -0.055172 -0.264246 -0.209410 -0.158969
ind_exerc_angina 0.425476 0.095108 0.149038 -0.094614 -0.232138 -0.262072 0.065885 0.056388 0.011637 0.042728 0.068687 -0.376359 1.000000 0.289573 -0.283956 0.059028 0.145570 0.297415 0.062916
ST_dep_by_exerc 0.424672 0.197376 0.110237 0.084343 -0.281040 -0.121395 0.191615 0.040431 0.009093 0.167688 0.090282 -0.341262 0.289573 1.000000 -0.514906 0.393261 0.295832 0.306719 0.102466
ind_for_slope_up_exerc -0.384730 -0.183068 -0.022648 -0.044501 0.236417 0.099450 -0.087458 -0.014490 -0.011390 -0.109266 -0.118375 0.442894 -0.283956 -0.514906 1.000000 -0.257901 -0.151212 -0.232087 -0.181135
ind_for_slope_down_exerc 0.060585 0.026018 0.050676 0.068006 -0.050966 -0.026198 0.121396 -0.050027 0.107496 0.081915 0.043866 -0.055172 0.059028 0.393261 -0.257901 1.000000 -0.029606 0.051734 0.095509
num_vessels_fluro 0.460442 0.362605 0.093185 -0.059834 -0.153886 -0.138891 0.098773 0.119000 0.145478 0.040781 0.122813 -0.264246 0.145570 0.295832 -0.151212 -0.029606 1.000000 0.225453 0.088639
Thal_rev_defect 0.481585 0.104754 0.327572 -0.021830 -0.201429 -0.157658 0.110482 0.050995 0.020898 -0.032220 0.006347 -0.209410 0.297415 0.306719 -0.232087 0.051734 0.225453 1.000000 -0.200089
Thal_fixed_defect 0.104142 0.060092 0.145351 0.032472 -0.036080 -0.095631 0.075538 -0.098157 0.093319 0.092917 0.032359 -0.158969 0.062916 0.102466 -0.181135 0.095509 0.088639 -0.200089 1.000000

Se visualizan estas correlaciones con un mapa de calor.

In [17]:
plt.figure(figsize=(15,15))
sns.heatmap(df.corr(), annot=True, cmap="Blues", fmt=".0%")
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x180aac260c8>
In [30]:
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.25)
In [31]:
sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=sc.transform(X_test)
In [32]:
X_train.shape
Out[32]:
(224, 18)
In [33]:
X_test.shape
Out[33]:
(75, 18)
In [34]:
y_train.shape
Out[34]:
(224,)
In [35]:
y_test.shape
Out[35]:
(75,)
In [40]:
def modelos(X_train,y_train): 
      # Regression logistica    
      from sklearn.linear_model import LogisticRegression
      log=LogisticRegression(max_iter=100000)    
      log.fit(X_train,y_train)        
      # Árboles de decisión    
      from sklearn.tree import DecisionTreeClassifier    
      arbol=DecisionTreeClassifier(criterion = "entropy", random_state=0)    
      arbol.fit(X_train,y_train)        
      #RFC    
      from sklearn.ensemble import RandomForestClassifier    
      forest=RandomForestClassifier(n_estimators=10, criterion = "entropy", random_state=0)
      forest.fit(X_train,y_train)
      #KNN
      from sklearn.neighbors import KNeighborsClassifier
      knn=KNeighborsClassifier(n_neighbors=10)
      knn.fit(X_train, y_train)
            
      
      #imprimir la exactitud de cada modelo en el dataset de entrenamiento        
      print("[0]Exactitud del entrenamiento Logistic Regression:",log.score(X_train,y_train))    
      print("[1]Exactitud del entrenamiento Decision Tree Classifier:",arbol.score(X_train,y_train))    
      print("[2]Exactitud del entrenamiento Random Forest Classifier:",forest.score(X_train,y_train))
      print("[3]Exactitud del entrenamiento KNN Classifier:",knn.score(X_train,y_train))

      
      return log, arbol, forest, knn
model = modelos (X_train,y_train)
[0]Exactitud del entrenamiento Logistic Regression: 0.8883928571428571
[1]Exactitud del entrenamiento Decision Tree Classifier: 1.0
[2]Exactitud del entrenamiento Random Forest Classifier: 0.9866071428571429
[3]Exactitud del entrenamiento KNN Classifier: 0.875

Esto nos muestra que el clasificador Decision Tree esta sobreajustado y no se puede tomar un clasificador solo basandose en los datos de entrenamientos. Testeando y evaluando los clasificadores tenemos lo siguiente:

In [41]:
from sklearn.metrics import confusion_matrix

for i in range(len(model)):
    print ("model: ",i)
    cm=confusion_matrix(y_test, model[i].predict(X_test))
    TP=cm[0][0]
    TN=cm[1][1]
    FN=cm[1][0]
    FP=cm[0][1]
    print(cm)
    print("La exactitud del modelo en los datos de prueba es= ", (TP+TN)/(TP+TN+FN+FP))
    print()
model:  0
[[39  6]
 [ 6 24]]
La exactitud del modelo en los datos de prueba es=  0.84

model:  1
[[34 11]
 [ 8 22]]
La exactitud del modelo en los datos de prueba es=  0.7466666666666667

model:  2
[[40  5]
 [ 6 24]]
La exactitud del modelo en los datos de prueba es=  0.8533333333333334

model:  3
[[37  8]
 [ 7 23]]
La exactitud del modelo en los datos de prueba es=  0.8

Evaluando el modelo por exactitud vemos que el clasificador de Random Forest nos da un valor de 85.33%

In [ ]:
 

No hay comentarios:

Publicar un comentario

Nota: solo los miembros de este blog pueden publicar comentarios.