Split training data and test data

기본 환경: IDE: VS code, Language: Python

❓ 인공지능 모델을 구성하여 기계 학습시킬 때 유의할 점

❗ Train Data Set과 Test Data Set을 나누는 것

Train Data로 Test 수행 시, 익숙한 데이터로만 기계 학습이 진행되어 새로운 데이터를 만나 Predict할 때 오히려 예측치의 Loss가 커지는 문제가 발생할 수 있음

Train Data Set과 Test Data Set 나누는 방법

1. 직접 Data 나누기

# split_train_test1.py

import numpy as np

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense


# 1. Data
x = np.array([1,2,3,4,5,6,7,8,9,10])
y = np.array(range(10)) # [0,1,2,3,4,5,6,7,8,9]

x_train = np.array([1, 2, 3, 4, 5, 6, 7])
x_test = np.array([8, 9, 10])

y_train = np.array(range(7))
y_test = np.array(range(7, 10))


# 2. Model Construction
model = Sequential()
model.add(Dense(64, input_dim=1))
model.add(Dense(64))
model.add(Dense(16))
model.add(Dense(16))
model.add(Dense(1))


# 3. Compile and train
model.compile(loss='mae', optimizer='adam')
model.fit(x_train, y_train, epochs=256, batch_size=5)


# 4. evaluate and predict
loss = model.evaluate(x_test, y_test)
print("Loss: ", loss)

result = model.predict([11])
print("Result: ", result)


'''
# Result

Epoch 256/256
2/2 [==============================] - 0s 5ms/step - loss: 0.0557
1/1 [==============================] - 0s 288ms/step - loss: 0.0259
Loss:  0.02594725228846073
Result:  [[10.031645]]

'''

2. Slice

# split_train_test2.py

import numpy as np

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense


# 1. Data
x = np.array([1,2,3,4,5,6,7,8,9,10])
y = np.array(range(10)) # [0,1,2,3,4,5,6,7,8,9]

x_train = x[:7] # 시작 생략 가능
x_test = x[7:] # 생략 시 끝 값 가져오기 가능
y_train = y[:7]
y_test = y[7:]
print(x_train, x_test, y_train, y_test)
# [1 2 3 4 5 6 7] [ 8  9 10] [0 1 2 3 4 5 6] [7 8 9]

# -로 위치 표현하는 방법
x_train2 = x[:-3]
x_test2 = x[-3:]
print(x_train2, x_test2)
# [1 2 3 4 5 6 7] [ 8  9 10]


# 2. Model Construction
model = Sequential()
model.add(Dense(64, input_dim=1))
model.add(Dense(64))
model.add(Dense(32))
model.add(Dense(16))
model.add(Dense(1))


# 3. Compile and train
model.compile(loss='mae', optimizer='adam')
model.fit(x_train, y_train, epochs=128, batch_size=2)


# 4. evaluate and predict
loss = model.evaluate(x_test, y_test)
print("Loss: ", loss)

result = model.predict([11])
print("Result: ", result)



'''
# Result

Epoch 128/128
4/4 [==============================] - 0s 2ms/step - loss: 0.0481
1/1 [==============================] - 0s 283ms/step - loss: 0.0506
Loss:  0.050618648529052734
Result:  [[9.937546]]

'''

3. train_test_split()

# split_train_test3.py

import numpy as np

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.model_selection import train_test_split


# 1. Data
x = np.array([1,2,3,4,5,6,7,8,9,10])
y = np.array(range(10)) # [0,1,2,3,4,5,6,7,8,9]

x_train, x_test, y_train, y_test = train_test_split(
    x, y,
    train_size = 0.7,
    shuffle = True,
    random_state=1)
print("x_train, x_test: ", x_train, x_test, "\ny_train, y_test: ", y_train, y_test)
'''
train_test_split(arrays, test_size, train_size, random_state, shuffle, stratify)

parameter
arrays: 분할시킬 Data
test_size: 전체 Data 중 test로 사용할 test set 비율
train_size: 1 - test_size (생략 가능)
random_state: 입력 시, 함수 수행 시 마다 결과가 바뀌지 않음
* 같은 데이터로 계속 훈련을 해줘야하므로 random_state를 입력해줘야 함
shuffle: split 시, raw data 셔플 여부(Default = True)
shuffle 사용 시, 자료의 치우침 방지
stratify: 해당 Data 비율 유지
Ex. data = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]
0의 비율: 70%, 1의 비율: 30%
stratify=Y로 설정 시, TestSet과 TrainSet에서 0의 비율과 1의 비율을 Data와 동일하게 유지
data(y)가 분류형 데이터일 경우만 사용 가능(비율 유지 기능이므로)
'''


# 2. Model Construction
model = Sequential()
model.add(Dense(64, input_dim=1))
model.add(Dense(64))
model.add(Dense(32))
model.add(Dense(16))
model.add(Dense(1))


# 3. Compile and train
model.compile(loss='mae', optimizer='adam')
model.fit(x_train, y_train, epochs=128, batch_size=4)


# 4. evaluate and predict
loss = model.evaluate(x_test, y_test)
print("Loss: ", loss)

result = model.predict([11])
print("Result: ", result)


'''
Result

Epoch 128/128
2/2 [==============================] - 0s 10ms/step - loss: 0.1477
1/1 [==============================] - 0s 480ms/step - loss: 0.2799
Loss:  0.27993300557136536
Result:  [[9.541215]]

'''

소스 코드

🔗 HJ0216/TIL

참고 자료

📑 [Python.NumPy] ndarray indexing과 slicing

📑 [Python] sklearn의 train_test_split() 사용법

'Naver Clould with BitCamp > Aartificial Intelligence' 카테고리의 다른 글

Model Performance Indicator (0)	2023.01.21
Matplotlib: Scatter and plot (0)	2023.01.21
Scalar, Vector, Matirx, Tensor (0)	2023.01.20
MultiLayer Perceptron (0)	2023.01.20
Hyper-parameter Tuning (0)	2023.01.20

이모저모 개발 블로그

Split training data and test data

'Naver Clould with BitCamp > Aartificial Intelligence' 카테고리의 다른 글

티스토리툴바

Split training data and test data

'Naver Clould with BitCamp > Aartificial Intelligence' 카테고리의 다른 글

관련글

티스토리툴바