배고픈 개발자 이야기

[2021/08/24] 파이썬 머신러닝 Decision Tree 본문

인포섹 아카데미

[2021/08/24] 파이썬 머신러닝 Decision Tree

이융희 2021. 8. 24. 17:48
728x90

연습

 

사용하는 모듈

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.metrics import accuracy_score

from IPython.display import Image

import pandas as pd
import numpy as np

import pydotplus
import os

 

전처리

tennis_data = pd.read_csv('playtennis.csv')
tennis_data.Outlook = tennis_data.Outlook.replace('Sunny', 0)
tennis_data.Outlook = tennis_data.Outlook.replace('Overcast', 1)
tennis_data.Outlook = tennis_data.Outlook.replace('Rain', 2)

tennis_data.Temperature = tennis_data.Temperature.replace('Hot', 1)
tennis_data.Temperature = tennis_data.Temperature.replace('Mild', 2)
tennis_data.Temperature = tennis_data.Temperature.replace('Cool', 3)

tennis_data.Humidity = tennis_data.Humidity.replace('High', 1)
tennis_data.Humidity = tennis_data.Humidity.replace('Normal', 2)

tennis_data.Wind = tennis_data.Wind.replace('Weak', 1)
tennis_data.Wind = tennis_data.Wind.replace('Strong', 2)

tennis_data.PlayTennis = tennis_data.PlayTennis.replace('No', 1)
tennis_data.PlayTennis = tennis_data.PlayTennis.replace('Yes', 2)
tennis_data

 

test, trainning용 데이터 분류

ps) 대문자 X는 칸이 여러개라 Matrix라고도 함, 일반적으로 여러 컬럼을 쓰면 대문자, 하나면 소문자를 사용

X = np.array(pd.DataFrame(tennis_data, columns=['Outlook', 'Temperature', 'Humidity', 'Wind']))
y = np.array(pd.DataFrame(tennis_data, columns=['PlayTennis']))

X_train, X_test, y_train, y_test = train_test_split(X, y)

 

Overfit

dt_clf = DecisionTreeClassifier()
dt_clf = dt_clf.fit(X, y)

dt_prediction = dt_clf.predict(X)
accuracy = accuracy_score(y, dt_prediction)
print("정확도 :", accuracy)

이경우 정확도 1.0이지만 항상 1.0은 아님

 

 

테스트 : 트레이닝 = 0.25 : 0.75

dt_clf = DecisionTreeClassifier()
dt_clf = dt_clf.fit(X_train, y_train)

dt_prediction = dt_clf.predict(X_test)
accuracy = accuracy_score(y_test, dt_prediction)
print("정확도 :", accuracy)

결과 0.75, 0.5, 1.0

 

 

Decision Tree Image 생성

feature_names = tennis_data.columns.tolist()
feature_names = feature_names[0:4]
target_name = np.array(['Play No', 'Play Yes'])

dt_dot_data = tree.export_graphviz(dt_clf, out_file = None,
                                   feature_names = feature_names,
                                   class_names = target_name,
                                   filled = True, rounded = True,
                                   special_characters = True)
dt_graph = pydotplus.graph_from_dot_data(dt_dot_data)
Image(dt_graph.create_png())

 

결과

Comments