'Naver Clould with BitCamp' 카테고리의 글 목록

Naver Clould with BitCamp

ImageDataGenerator 2023.02.05 1
Ensemble Model 2023.01.31
LSTM, Bidirectional, Conv1D 2023.01.30 1
[Warning] Allocation of ... exceeds 10% of free system memory 2023.01.29
[Project] Stock price prediction using Ensemble model 2023.01.28 1

PREV 이전 1 2 3 4 ···6 NEXT 다음

ImageDataGenerator

HJ0216 2023. 2. 5.

2023. 2. 5.

ImageDataGenerator

이미지를 학습시킬 때 학습 데이터 양이 적을 경우, 학습 데이터를 조금씩 변형시켜 학습 데이터 양을 늘리는 방법 중 하나

ImageDataGenerator Processing

1. ImageDataGenerator 객체 생성

: 이미지 파일들에 사용자가 설정한 여러가지 데이터 변형 기법을 적용함

train_datagen = ImageDataGenerator(
    rescale=1./255., # scaling
    horizontal_flip=True, # 수평 뒤집기
    vertical_flip=True, # 수직 뒤집기
    width_shift_range=0.1, # 가로 이동
    height_shift_range=0.1, # 세로 이동
    rotation_range=5,# 훈련 시, 과적합 문제를 해결하기 위해 shift, ratatoin 시행
    zoom_range=1.2, # 20% 확대
    shear_range=0.7, # 절삭
    fill_mode='nearest' # 이동 시, 발생하는 빈 칸을 어떻게 채울 것인가
)

test_datagen = ImageDataGenerator(
    rescale=1./255.
)

2. flow() / flow_from_directory() / flow_from_dataframe() 함수로 DirectoryIterator 객체 생성

xy_train = train_datagen.flow_from_directory(
    './_data/brain/train', # data path
    target_size=(200, 200), # data shape 통일
    batch_size=10,
    # total data: 160 -> batch_size=10: 160개를 10개씩 잘라서 훈련
    # 1 epoch 당 총 16번(iteration) 훈련 진행
    # dataset_scale check: batch_size를 높게 잡아 dataset scale 확인 가능
    class_mode='binary', # 폴더 라벨링 방식 지정: binary(0 1)
    color_mode='grayscale', # 색상: 흑백 / 컬러(rgb)
    shuffle='True', # parameter, 가장 마지막에 ','가 있어도 문제 X
    )
# Found 160 images belonging to 2 classes
# total 160장의 이미지가 2 classes(2 dir, folder)에 저장

* batch_size에 따른 Iteration 수

'''
total_data: 160
batch_size: 10
x0: xy_train[0][0], y0: xy_train[0][1]
x1: xy_train[1][0], y1: xy_train[1][1]
...
x15: xy_train[15][0], y15: xy_train[15][1]
'''

* xy_train[0] 출력: xy_train의 1번째 batch 출력

print(xy_train[0])
'''
(array([[[[0.08627451], [0.08627451], ..., [0.0858703 ], [0.08754121]],
        ...,
        [[0.3088285 ], [0.22372028], ..., [0.17596895], [0.15582304]]]], dtype=float32),
array([1., 1., 0., 1., 1., 1., 1., 0., 1., 0.], dtype=float32))
'''

* 각 Iteration 당 shape

print(xy_train[0][0].shape) # data_x: (10, 200, 200, 1) = (batch_size, target_size_row, target_size_col, color)
print(xy_train[0][1].shape) # data_y: (10, ) = (batch_size,)

* xy_train의 data type

print(type(xy_train)) # <class 'keras.preprocessing.image.DirectoryIterator'>: tuple(x(numpy),y(numpy))의 집합
print(type(xy_train[0])) # <class 'tuple'> tuple(x(numpy),y(numpy)): 수정 불가능한 list
print(type(xy_train[0][0])) # x: <class 'numpy.ndarray'>
print(type(xy_train[0][1])) # y: <class 'numpy.ndarray'>

3. flow_from_directory()에 맞춘 모델 구현 및 컴파일

xy_test = train_datagen.flow_from_directory(class_mode='binary', )

# Model 최종 Output
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])

# model.add(Dense(2, activation='softmax')) # class_y: 0 1
# one_hot_encoding O: model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

4. flow_from_directory()를 활용한 훈련

hist = model.fit_generator(xy_train, steps_per_epoch=16, epochs=5,
                    validation_data=xy_test,
                    validation_steps=4)
# fit_generator: flow_from_directory()의 (x, y) data batch_size 참조
# steps_per_epoch = total_data/batch_size
# validation_steps: validation data/batch_size

5. 훈련 결과 평가

print("Loss: ", loss[-1]) # list의 가장 마지막 값을 출력
print("Val_Loss: ", val_loss[-1])
print("Accuracy: ", accuracy[-1])
print("Val_acc: ", val_acc[-1])

'''
Result
Loss:  0.6932721734046936
Val_Loss:  0.6928479075431824
Accuracy:  0.5
Val_acc:  0.550000011920929

'''

➕ ImageDataGenerator 출력(MatPlotLib)

import matplotlib.pyplot as plt

img = xy_train[0] # 1 batch(10개의 image set)을 img에 저장

plt.figure(figsize=(20, 10))
for i, img in enumerate(img[0]): # enumerate: (index, list_element)를 tuple type으로 반환
    # enumerate(img[0][0])
    # 루프가 반복될 때마다 변수 i는 현재 요소의 인덱스로 업데이트되고, img는 현재 요소의 값으로 업데이트 됨
    plt.subplot(1, 10, i+1) # subplot(row, col, Index 지정: 1, 2, ...): 전체 이미지 내에 포함된 내부 이미지 개수
    plt.axis('off')
    plt.imshow(img.squeeze()) # 차원(axis) 중, size가 1 인것을 찾아 스칼라 값으로 바꿔 해당차원을 제거
plt.tight_layout()
plt.show()

⭐ 통합 Source Code

# imageDataGenerator.py

import numpy as np

from tensorflow.keras.preprocessing.image import ImageDataGenerator


# 1. Data
train_datagen = ImageDataGenerator(
    rescale=1./255.,
    horizontal_flip=True, # 수평 뒤집기
    vertical_flip=True, # 수직 뒤집기
    width_shift_range=0.1, # 가로 이동
    height_shift_range=0.1, # 세로 이동
    rotation_range=5,# 훈련 시, 과적합 문제를 해결하기 위해 shift, ratatoin 시행
    zoom_range=1.2, # 20% 확대
    shear_range=0.7, # 절삭
    fill_mode='nearest' # 이동 시, 발생하는 빈 칸을 어떻게 채울 것인가
)

test_datagen = ImageDataGenerator(
    rescale=1./255.
)
# test data: data preprocessing X


xy_train = train_datagen.flow_from_directory(
    './_data/brain/train', # data path
    target_size=(200, 200), # data shape 통일
    batch_size=10,
    # total data: 160 -> batch_size=10: 160개를 10개씩 잘라서 훈련
    # 1 epoch 당 총 16번(iteration) 훈련 진행
    # dataset_scale check: batch_size를 높게 잡아 dataset scale 확인 가능
    class_mode='binary', # 폴더 라벨링 방식 지정: binary(0 1)
    color_mode='grayscale', # 색상: 흑백 / 컬러(rgb)
    shuffle='True', # parameter, 가장 마지막에 ','가 있어도 문제 X
    )
# Found 160 images belonging to 2 classes
# total 160장의 이미지가 2 classes(2 dir, folder)에 저장


xy_test = train_datagen.flow_from_directory(
    './_data/brain/test',
    target_size=(200, 200),
    batch_size=10,
    class_mode='binary',
    color_mode='grayscale',
    shuffle='True',
    )
# Found 120 images belonging to 2 classes.
# x,y가 dictionary(k, v) 형태로 들어가 있음


print(xy_train) # <keras.preprocessing.image.DirectoryIterator object at 0x000002134BCFCA60>

print(xy_train)
# <keras.preprocessing.image.DirectoryIterator object at 0x000002134BCFCA60>
# data type: tuple(x(numpy), y(numpy))의 집합
'''
print(xy_train[0])
(array([[[[0.08627451], [0.08627451], ..., [0.0858703 ], [0.08754121]],
        ...,
        [[0.3088285 ], [0.22372028], ..., [0.17596895], [0.15582304]]]], dtype=float32),
array([1., 1., 0., 1., 1., 1., 1., 0., 1., 0.], dtype=float32))


total_data: 160
batch_size: 10
x0: xy_train[0][0], y0: xy_train[0][1]
x1: xy_train[1][0], y1: xy_train[1][1]
...
x15: xy_train[15][0], y15: xy_train[15][1]

'''

print(xy_train[0][0].shape) # data_x: (10, 200, 200, 1) = (batch_size, target_size_row, target_size_col, color)
print(xy_train[0][1].shape) # data_y: (10, ) = (batch_size,)

print(type(xy_train)) # <class 'keras.preprocessing.image.DirectoryIterator'>: tuple(x(numpy),y(numpy))의 집합
print(type(xy_train[0])) # <class 'tuple'> tuple(x(numpy),y(numpy)): 수정 불가능한 list
print(type(xy_train[0][0])) # x: <class 'numpy.ndarray'>
print(type(xy_train[0][1])) # y: <class 'numpy.ndarray'>


# 2. Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten

model = Sequential()
model.add(Conv2D(64,(2,2), input_shape=(200,200,1)))
model.add(Conv2D(64, (3,3), activation='relu'))
model.add(Conv2D(64, (3,3), activation='relu'))
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# model.add(Dense(2, activation='softmax')) # class_y: 0 1
# one_hot_encoding O: model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
# one_hot_encoding X: model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])


# 3. Compile and Train
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['acc'])
hist = model.fit_generator(xy_train, steps_per_epoch=16, epochs=5,
                    validation_data=xy_test,
                    validation_steps=4)
# fit_generator: flow_from_directory(x, y, batch_size) 참조
# steps_per_epoch = total_data/batch_size
# validation_steps: validation_data/batch_size

accuracy = hist.history['acc']
val_acc = hist.history['val_acc']

loss = hist.history['loss']
val_loss = hist.history['val_loss']

'''
print("Loss: ", loss)
len(loss) = hist.fit(epochs)
변수 hist에서 epoch마다 loss를 list 형태로 저장
'''

print("Loss: ", loss[-1]) # list의 가장 마지막 값을 출력
print("Val_Loss: ", val_loss[-1])
print("Accuracy: ", accuracy[-1])
print("Val_acc: ", val_acc[-1])

'''
Result
Loss:  0.6932721734046936
Val_Loss:  0.6928479075431824
Accuracy:  0.5
Val_acc:  0.550000011920929

'''


import matplotlib.pyplot as plt

img = xy_train[0] # 1 batch(10개의 image set)을 img에 저장

plt.figure(figsize=(20, 10))
for i, img in enumerate(img[0]): # enumerate: (index, list_element)를 tuple type으로 반환
    # enumerate(img[0][0])
    # 루프가 반복될 때마다 변수 i는 현재 요소의 인덱스로 업데이트되고, img는 현재 요소의 값으로 업데이트 됨
    plt.subplot(1, 10, i+1) # subplot(row, col, Index 지정: 1, 2, ...): 전체 이미지 내에 포함된 내부 이미지 개수
    plt.axis('off')
    plt.imshow(img.squeeze()) # 차원(axis) 중, size가 1 인것을 찾아 스칼라 값으로 바꿔 해당차원을 제거
'''
squeeze()
x3: array([[[0]],
           [[1]],
           [[2]],
           [[3]],
           [[4]],
           [[5]]])
x3.shape: (6,1,1)

x3.squeeze()
array([0, 1, 2, 3, 4, 5])
'''
plt.tight_layout()
plt.show()

➕ Tuple, List 차이

List[]: 요소값의 생성, 삭제, 수정 가능

Tuple(): 요소값 삭제, 변경 불가

-> 요소가 1개인 경우, 요소뒤에 반드시 ',' 사용

-> x1 = 1, 2, 3 처럼 () 생략 가능

소스 코드

🔗 HJ0216/TIL

참고 자료

📑 [이미지 전처이]. ImagedataGenerator 클래스 : 이미지 제너레이터

📑 Boost Your CNN with the Keras ImageDataGenerator

📑 ImgaeDataGenerator.flow_from_directory을 이용해 이미지 증식하는 방법

📑 컨볼루션 신경망 모델 만들어보기

📑 이미지 전처리

'Naver Clould with BitCamp > Aartificial Intelligence' 카테고리의 다른 글

Ensemble Model (0)	2023.01.31
LSTM, Bidirectional, Conv1D (1)	2023.01.30
[Warning] Allocation of ... exceeds 10% of free system memory (0)	2023.01.29
[Project] Stock price prediction using Ensemble model (1)	2023.01.28
RNN Model Construction (0)	2023.01.27

Ensemble Model

HJ0216 2023. 1. 31.

2023. 1. 31.

기본 환경: IDE: VS code, Language: Python

Model ensemble: 모델들의 앙상블, 즉 여러 모델들을 함께 사용하여 기존보다 성능을 더 올리는 방법

3가지 Model 들을 ensemble 하여 보다 더 좋은 예측력을 가지는 모델을 만듦

# ensemble_model1.py

import numpy as np

x1_datasets = np.array([range(100), range(301, 401)]).transpose() # .transpose() = .T
print(x1_datasets.shape) # (100, 2)
print(x1_datasets)
'''
[[  0 301]
 [  1 302]
 [  2 303]
 ...
  [ 98 399]
 [ 99 400]]
'''

x2_datasets = np.array([range(101, 201), range(411, 511), range(150, 250)]).transpose()
print(x2_datasets.shape) # (100, 3)

y = np.array(range(2001, 2101)) # (100,)

from sklearn.model_selection import train_test_split
x1_train, x1_test, x2_train, x2_test, y_train, y_test = train_test_split(
    x1_datasets, x2_datasets, y, train_size=0.7, random_state=1234
)

print(x1_train.shape, x2_train.shape, y_train.shape) # (70, 2) (70, 3) (70,)
print(x1_test.shape, x2_test.shape, y_test.shape) # (30, 2) (30, 3) (30,)


# 2. Model Construction
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input

# 2-1. Model_1
input1 = Input(shape=(2,))
dense1 = Dense(11, activation='relu', name='ds11')(input1) # name: summary에서 별칭
dense2 = Dense(12, activation='relu', name='ds12')(dense1)
dense3 = Dense(13, activation='relu', name='ds13')(dense2)
output1 = Dense(11, activation='relu', name='ds14')(dense3)

# 2-2. Model_2
input2 = Input(shape=(3,))
dense21 = Dense(21, activation='linear', name='ds21')(input2)
dense22 = Dense(22, activation='linear', name='ds22')(dense21)
output2 = Dense(23, activation='linear', name='ds23')(dense22)

# 2-3. Model_merge
from tensorflow.keras.layers import concatenate
merge1 = concatenate([output1, output2], name='mg1')
# merge1 = Concatenate()([output1, output2], name='mg1')
merge2 = Dense(12, activation='relu', name='mg2')(merge1)
merge3 = Dense(13, name='mg3')(merge2)
last_output = Dense(1, name='last')(merge3)

model = Model(inputs=[input1, input2], outputs=last_output)

model.summary()
'''
Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to
==================================================================================================
 input_1 (InputLayer)           [(None, 2)]          0           []

 ds11 (Dense)                   (None, 11)           33          ['input_1[0][0]']
__________________________________________________________________________________________________

 input_2 (InputLayer)           [(None, 3)]          0           []

 ds12 (Dense)                   (None, 12)           144         ['ds11[0][0]']
__________________________________________________________________________________________________

 ds21 (Dense)                   (None, 21)           84          ['input_2[0][0]']
__________________________________________________________________________________________________

 ds13 (Dense)                   (None, 13)           169         ['ds12[0][0]']
__________________________________________________________________________________________________

 ds22 (Dense)                   (None, 22)           484         ['ds21[0][0]']
__________________________________________________________________________________________________

 ds14 (Dense)                   (None, 11)           154         ['ds13[0][0]']
__________________________________________________________________________________________________

 ds23 (Dense)                   (None, 23)           529         ['ds22[0][0]']
__________________________________________________________________________________________________

 mg1 (Concatenate)              (None, 34)           0           ['ds14[0][0]',
                                                                  'ds23[0][0]']

 mg2 (Dense)                    (None, 12)           420         ['mg1[0][0]']

 mg3 (Dense)                    (None, 13)           169         ['mg2[0][0]']

 last (Dense)                   (None, 1)            14          ['mg3[0][0]']

==================================================================================================
Total params: 2,200
Trainable params: 2,200
Non-trainable params: 0
__________________________________________________________________________________________________
'''


# 3. compile and train
model.compile(loss='mse', optimizer='adam')
model.fit([x1_train, x2_train], y_train, epochs=10, batch_size=8)
# 모델 2개를 훈련시켜야하므로 훈련의 입력값도 2개 필요


# 4. evaluate and predict
loss = model.evaluate([x1_test, x2_test], y_test)
# 모델 2개를 훈련시켰으므로 평가의 입력값도 2개 필요
print("Loss: ", loss)



'''
Result
Loss:  15543.3212890625

'''

# ensemble_model2.py

import numpy as np

from sklearn.model_selection import train_test_split

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input


# 1. Data
x_datasets = np.array([range(100), range(301, 401)]).transpose()

y1 = np.array(range(2001, 2101)) # (100,)
y2 = np.array(range(201, 301)) # (100,)

x_train, x_test = train_test_split(
    x_datasets, train_size=0.7, random_state=1234
)

y1_train, y1_test, y2_train, y2_test = train_test_split(
    y1, y2, train_size=0.7, random_state=1234
)

print(x_train.shape, y1_train.shape, y2_train.shape) # (70, 2) (70,) (70,)
print(x_test.shape, y1_test.shape, y2_test.shape) # (30, 2) (30,) (30,)


# 2. Model Construction
# 2-1. Model_1
input1 = Input(shape=(2,))
dense1 = Dense(11, activation='relu', name='ds11')(input1)
dense2 = Dense(12, activation='relu', name='ds12')(dense1)
dense3 = Dense(13, activation='relu', name='ds13')(dense2)
output = Dense(14, activation='relu', name='ds14')(dense3)

# 2-2. Model_branch1
dense21 = Dense(11, activation='relu', name='ds21')(output)
# input 변수 선언없이, last_output Dense Layer Branch Model에서 직접 받기
dense22 = Dense(12, activation='relu', name='ds22')(dense21)
dense23 = Dense(13, activation='relu', name='ds23')(dense22)
output_b1 = Dense(14, activation='relu', name='ds24')(dense23)

# 2-3. Model_branch2
dense31 = Dense(11, activation='relu', name='ds31')(output)
dense32 = Dense(12, activation='relu', name='ds32')(dense31)
dense33 = Dense(13, activation='relu', name='ds33')(dense32)
output_b2 = Dense(14, activation='relu', name='ds34')(dense33)

model = Model(inputs=[input1], outputs=[output_b1, output_b2])
model.summary()


# 3. compile and train
model.compile(loss='mse', optimizer='adam', metrics=['mae'])
model.fit(x_train, [y1_train, y2_train], epochs=128, batch_size=8)


# 4. evaluate and predict
loss = model.evaluate(x_test, [y1_test, y2_test])
print("Loss: ", loss)



'''
Result
loss: 3059665.2500 / ds24_loss: 3034864.0000 / ds34_loss: 24801.2871
-> model n개 출력: 각 model의 loss 및 loss의 합계도 출력(n+1개)

ds24_mae: 1481.5680 / ds34_mae: 104.1349
-> model n개 출력: 각 model의 mae 출력(합계는 loss 부분만 출력)

'''

소스 코드

🔗 HJ0216/TIL

참고 자료

📑 앙상블 학습(ensemble learning)으로 알고리즘 성능 개선하기(1) - Voting

'Naver Clould with BitCamp > Aartificial Intelligence' 카테고리의 다른 글

ImageDataGenerator (1)	2023.02.05
LSTM, Bidirectional, Conv1D (1)	2023.01.30
[Warning] Allocation of ... exceeds 10% of free system memory (0)	2023.01.29
[Project] Stock price prediction using Ensemble model (1)	2023.01.28
RNN Model Construction (0)	2023.01.27

LSTM, Bidirectional, Conv1D

HJ0216 2023. 1. 30.

2023. 1. 30.

기본 환경: IDE: VS code, Language: Python

다양한 dataset에 따른 LSTM Bidirectional, Conv1D 활용

1. LSTM(Long Short Term Memory)

: RNN Model의 장기 의존성 문제를 보완하기 위해 등장한 모델

# lstm_boston.py

import numpy as np

from tensorflow.keras.models import Sequential, Model, load_model
from tensorflow.keras.layers import Dense, SimpleRNN, LSTM, Dropout, Conv2D, Flatten, MaxPooling2D
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score


# 1. Data
dataset = load_boston()

x = dataset.data # for training
y = dataset.target # for predict

x_train, x_test, y_train, y_test = train_test_split(
    x, y,
    train_size=0.7,
    random_state=123
)

scaler = StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

print(x_train.shape, x_test.shape) # (354, 13), (152, 13)

x_train = x_train.reshape(354, 13, 1)
x_test = x_test.reshape(152, 13, 1)
# reshape 시, timesteps*feature가 유지되도록 reshape


# 2. Model Construction
model = Sequential()
model.add(LSTM(units=64, input_shape=(13,1)))
model.add(Dense(32, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1))


# 3. Compile and Training
model.compile(loss='mse', optimizer='adam')

earlyStopping = EarlyStopping(monitor='loss', mode='min', patience=32, restore_best_weights=True, verbose=1)

model.fit(x, y, epochs=2, callbacks=[earlyStopping], batch_size=2)


# 4. Evaluation and Prediction
loss = model.evaluate(x_test,y_test)
print("Loss: ", loss)
y_predict = model.predict(x_test)

r2 = r2_score(y_test, y_predict)
print("R2: ", r2)



'''
Result
RMSE:  21.002980488736824
R2:  -4.457572735948721

'''

# lstm_fatch_covtype.py

import numpy as np

from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential, Model, load_model
from tensorflow.keras.layers import Dense, SimpleRNN, LSTM, Dropout, Conv2D, Flatten, MaxPooling2D
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import accuracy_score, mean_squared_error, r2_score


# 1. Data
dataset = fetch_covtype()
x = dataset.data # for training
y = dataset.target # for predict

y = to_categorical(y)
y = np.delete(y, 0, axis=1)

x_train, x_test, y_train, y_test = train_test_split(
    x, y,
    train_size=0.7,
    random_state=123
)

scaler = StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

print(x_train.shape, x_test.shape)

x_train = x_train.reshape(406708, 54, 1)
x_test = x_test.reshape(174304, 54, 1)


# 2. Model Construction
model = Sequential()
model.add(LSTM(units=64, input_shape=(54,1)))
model.add(Dense(32, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(7, activation='softmax'))


# 3. Compile and Training
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

earlyStopping = EarlyStopping(monitor='loss', mode='min', patience=32, restore_best_weights=True, verbose=1)

model.fit(x, y, epochs=2, callbacks=[earlyStopping], batch_size=128)


# 4. Evaluation and Prediction
loss, accuracy = model.evaluate(x_test, y_test)
print("loss: ", loss)
print("accuracy: ", accuracy)

y_predict = model.predict(x_test)
y_predict = np.argmax(y_predict, axis=1) # (116203, 7) -> (116203, )
y_test = np.argmax(y_test, axis=1) # (116203, 7) -> (116203,)
# data(y): one hot encoding -> shape: (data_num, class)



'''
Result
loss:  0.062044188380241394
acc:  0.9850000143051147
R2:  0.972367969950624

'''

# lstm_mnist.py

import numpy as np

from tensorflow.keras.models import Sequential, Model, load_model
from tensorflow.keras.layers import Dense, SimpleRNN, LSTM, Dropout, Conv2D, Flatten, MaxPooling2D
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.datasets import mnist, cifar10, cifar100, fashion_mnist

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score


# 1. Data
(x_train, y_train), (x_test, y_test) = mnist.load_data()

print(x_train.shape, x_test.shape) # (60000, 28, 28) (10000, 28, 28)

x_train=x_train/255.
x_test=x_test/255.

'''
print(np.unique(y_train, return_counts=True))

(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=uint8),
array([5923, 6742, 5958, 6131, 5842, 5421, 5918, 6265, 5851, 5949], dtype=int64))
'''


# 2. Model Construction
model = Sequential()
model.add(LSTM(units=64, input_shape=(28,28)))
model.add(Dense(32, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(16, activation='relu'))
model.add(Dense(10, activation='softmax')) # 다중 분류


# 3. Compile and Training
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['acc'])
# 다중분류, one hot encoding X -> sparse categorical crossentropy

earlyStopping = EarlyStopping(monitor='val_loss', mode='min', patience=32, restore_best_weights=True, verbose=1)

model.fit(x_train, y_train,
          validation_split=0.2,
          epochs=2,
          callbacks=[earlyStopping],
          batch_size=512)


# 4. Evaluation and Prediction
result = model.evaluate(x_test, y_test)
print("loss: ", result[0])
print("acc: ", result[1])

y_predict = model.predict(x_test)
y_predict = np.argmax(y_predict, axis=1)

r2 = r2_score(y_test, y_predict)
print("R2: ", r2)



'''
Result
loss:  0.07328376173973083
acc:  0.9812999963760376
R2:  0.9632089716433644

'''

2. Bidirectional RNN

: Sequential 모델은 '지금까지 주어진 것을 보고 다음을 예측'하는 모델이었으나 성능 향상을 위해 '앞으로 주어질 것까지 보고 어떠한 것을 예측'하는 모델을 고안하여 Bidirectional RNN 탄생

❗ I am ? student.

'I am'으로 ?를 추측하는 것보다 'I am'과 'student.'을 함께 보고 예측하는 것이 더 좋은 성과를 나타낼 수 있음

# biDirectional.py

import numpy as np

from sklearn.model_selection import train_test_split

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, SimpleRNN, LSTM, GRU, Bidirectional # 양방향 연산
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint


# 1. Data
a = np.array(range(1, 101))
x_predict = np.array(range(96, 106))
# 모두 x 데이터이므로 y 데이터를 split 할 필요 X

timesteps = 5 # x: 4개, y: 1개

def split_x(dataset, timesteps):
    data_list = [] # 빈 list 생성
    for i in range(len(dataset) - timesteps + 1):
        # for i in range(3->range(3): 0, 1, 2), range(4->2), range(5->1) : 반환하는 리스트 개수
        subset = dataset[i: (i+timesteps)]
        # dataset[0(이상):5(미만)] [1:6] [2:7]: dataset 위치에 있는 값 반환
        data_list.append(subset)
    return np.array(data_list)

a_split = split_x(a, timesteps)
x_pred_split = split_x(x_predict, timesteps-1)
'''
# timesteps의 변수를 timesteps1, timesteps2로 나눠서 사용할 수 있음
timesteps1 = 5
timesteps2 = 4

a_split = split_x(a, timesteps1) # 5 적용
x_pred_split = split_x(x_predict, timesteps2) # 4 적용

'''

x = a_split[:, :-1] # 모든 행, 시작 ~ -1번째 열
y = a_split[:, -1] # 모든 행, -1번째 열(시작: 0번째 열)
x_predict = x_pred_split[:,:]


'''
print(x, y) # (96, 4) (1, 96)
x: [1 2 3 4] ... [96 97 98 99]
y: [5 6 ... 99 100]

print(x_predict) # (7, 4)
[[ 96  97  98  99]
 [ 97  98  99 100]
 [ 98  99 100 101]
 [ 99 100 101 102]
 [100 101 102 103]
 [101 102 103 104]
 [102 103 104 105]]
'''

x_train, x_test, y_train, y_test = train_test_split(
    x, y,
    test_size=0.2,
    shuffle= True,
    random_state = 333
)

print(x_train.shape, x_test.shape, x_predict.shape) # (76, 4) (20, 4) (7, 4)

x_train = x_train.reshape(76,4,1)
x_test = x_test.reshape(20,4,1)
x_predict = x_predict.reshape(7,4,1)
# train_test_split(): 3차원 이상 작업 불가하므로 split 후 reshape
# x_train, x_test, y_train, y_test = train_test_split()


# 2. Model Construction
model = Sequential()
model.add(Bidirectional(LSTM(units=32, return_sequences=True), input_shape=(4,1)))
# Bidirection은 모델이 아니므로 모델 선택 필요
# return_sequences: output_dim을 input_dim과 동일하게 유지하는 parameter
# RNN model에 들어갈 데이터가 시계열 데이터가 아닐 경우, 성능 저하가 있을 수 있음
# Birectional(LSRM) 후, 시계열 데이터가 반환되는 것이 아니라면 RNN을 연달아 사용하여 반드시 성능이 좋아진다고 할 수 없음
model.add(GRU(32, activation='relu'))
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1))
model.summary()
# Bidirection: Non-Bidirectional model 연산량의 2배


# 3. Compile and Training
model.compile(loss='mse', optimizer='adam')

earlyStopping = EarlyStopping(monitor='loss', mode='min', patience=32, restore_best_weights=True, verbose=1)

model.fit(x_train, y_train, epochs=128, callbacks=[earlyStopping], batch_size=2)


# 4. Evaluation and Prediction
loss = model.evaluate(x_test, y_test)

result = model.predict(x_predict)
print("Predict[100 ... 106]: ", result)



'''
Result(Non-Bi)
[[ 99.99038 ]
 [100.99039 ]
 [101.99034 ]
 [102.99025 ]
 [103.99007 ]
 [104.98979 ]
 [105.989456]]

Result(Bi)
[[ 99.07903 ]
 [ 99.89542 ]
 [100.68855 ]
 [101.46187 ]
 [102.215195]
 [102.94844 ]
 [103.6616  ]]

'''

3. Conv1D

: CNN은 Convolution Layer, Pooling Layer, Fully connected Layer로 주로 구성됨

Convolution Layer와 Pooling Layer는 주로 유효 특징 추출을 담당하고, 원본 데이터에서 공간적 정보를 취득할 수 있음

: 1차원 CNN은 이미지 분석이 아닌 시계열이나 텍스트 분석 시 주로 많이 사용됨

-> 1차원: 합성곱을 위한 kernel과 적용하는 데이터의 sequence가 1차원의 모양을 갖는다는 것

# conv1D_california.py

import numpy as np

from tensorflow.keras.models import Sequential, Model, load_model
from tensorflow.keras.layers import Dense, SimpleRNN, LSTM, Dropout, Conv2D, Flatten, MaxPooling2D, Conv1D
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.datasets import mnist, cifar10, cifar100, fashion_mnist

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score


# 1. Data
dataset = fetch_california_housing()

x = dataset.data # for training
y = dataset.target # for predict

x_train, x_test, y_train, y_test = train_test_split(
    x, y,
    train_size=0.7,
    random_state=123
)

scaler = StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)

print(x_train.shape, x_test.shape) # (14447, 8) (6193, 8)

x_train = x_train.reshape(14447, 4, 2)
x_test = x_test.reshape(6193, 4, 2)
# datasets 개수 부분 제외하고 전체 곱만 같으면 문제 X
# (14447,8,1) = (14447,4,2) = (14447,2,4) = (14447,8,1)


# 2. Model Construction
model = Sequential()
model.add(Conv1D(128, 2, padding='same', input_shape=(4,2))) 
model.add(Conv1D(64, 2, padding='same')) 
model.add(Dropout(0.2))
model.add(Conv1D(32, 2, padding='same'))
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Dense(1))


# 3. Compile and Training
model.compile(loss='mse', optimizer='adam')

earlyStopping = EarlyStopping(monitor='val_loss', mode='min', patience=32, restore_best_weights=True, verbose=1)

model.fit(x_train, y_train,
          validation_split=0.2,
          epochs=128,
          callbacks=[earlyStopping],
          batch_size=32)


# 4. Evaluation and Prediction
result = model.evaluate(x_test, y_test)
print("loss: ", result)

y_predict = model.predict(x_test)

r2 = r2_score(y_test, y_predict)
print("R2: ", r2)



'''
Result
loss:  0.3196415901184082
R2:  0.7582663893325312

'''

# conv1D_fetch_covtype.py

import pandas as pd
import numpy as np

from tensorflow.keras.models import Sequential, Model, load_model
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Dense, SimpleRNN, LSTM, Dropout, Conv2D, Flatten, MaxPooling2D, Conv1D
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.datasets import mnist, cifar10, cifar100, fashion_mnist

from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score


# 1. Data
dataset = fetch_covtype()

x = dataset.data # for training
y = dataset.target # for predict

y = to_categorical(y)
print(y.shape) # (581012, 8)

y = np.delete(y, 0, axis=1)
print(y.shape)  # (581012, 7)

x_train, x_test, y_train, y_test = train_test_split(
    x, y,
    train_size=0.7,
    random_state=123
)

scaler = StandardScaler()
scaler.fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler .transform(x_test)

print(x_train.shape, x_test.shape) # (406708, 54) (174304, 54)

x_train = x_train.reshape(406708, 9, 6)
x_test = x_test.reshape(174304, 9, 6)


# 2. Model Construction
model = Sequential()
model.add(Conv1D(128, 2, padding='same', input_shape=(9,6))) 
model.add(Conv1D(64, 2, padding='same')) 
model.add(Dropout(0.2))
model.add(Conv1D(32, 2, padding='same'))
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Dense(7, activation='softmax'))


# 3. Compile and Training
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

earlyStopping = EarlyStopping(monitor='val_loss', mode='min', patience=32, restore_best_weights=True, verbose=1)

model.fit(x_train, y_train,
          validation_split=0.2,
          epochs=128,
          callbacks=[earlyStopping],
          batch_size=256)


# 4. Evaluation and Prediction
loss, accuracy = model.evaluate(x_test, y_test)
print("loss: ", loss)
print("accuracy: ", accuracy)

y_predict = model.predict(x_test)
y_predict = np.argmax(y_predict, axis=1) # (116203, 7) -> (116203, )
y_test = np.argmax(y_test, axis=1) # (116203, 7) -> (116203,)
# data y를 one hot encoding 해준 상태로 (data_num, class)로 shape이 return 됨

r2 = r2_score(y_test, y_predict)
print("R2: ", r2)


'''
Result
loss:  0.5182842016220093
accuracy:  0.7794886827468872
R2:  0.42494066843707257

'''

# conv1D_fashion.py

import pandas as pd
import numpy as np

from tensorflow.keras.models import Sequential, Model, load_model
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.layers import Dense, SimpleRNN, LSTM, Dropout, Conv2D, Flatten, MaxPooling2D, Conv1D
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.datasets import mnist, cifar10, cifar100, fashion_mnist

from sklearn.datasets import fetch_covtype
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.metrics import mean_squared_error, r2_score


# 1. Data
(x_train, y_train), (x_test, y_test) = fashion_mnist.load_data()

print(x_train.shape, x_test.shape) # (60000, 28, 28) (10000, 28, 28)

x_train=x_train/255.
x_test=x_test/255.


# 2. Model Construction
model = Sequential()
model.add(Conv1D(128, 2, padding='same', input_shape=(28,28))) 
model.add(Conv1D(64, 2, padding='same')) 
model.add(Dropout(0.2))
model.add(Conv1D(32, 2, padding='same'))
model.add(Flatten())
model.add(Dense(16, activation='relu'))
model.add(Dense(10, activation='softmax'))


# 3. Compile and Training
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

earlyStopping = EarlyStopping(monitor='val_loss', mode='min', patience=32, restore_best_weights=True, verbose=1)

model.fit(x_train, y_train,
          validation_split=0.2,
          epochs=128,
          callbacks=[earlyStopping],
          batch_size=256)


# 4. Evaluation and Prediction
loss, accuracy = model.evaluate(x_test, y_test)
print("loss: ", loss)
print("accuracy: ", accuracy)

y_predict = model.predict(x_test)
y_predict = np.argmax(y_predict, axis=1) # (10000, 10) -> (10000,)

r2 = r2_score(y_test, y_predict) # y_test = (10000,), y_predict = (10000,)
print("R2: ", r2)



'''
Result
loss:  0.3900017738342285
accuracy:  0.864799976348877
R2:  0.7752727272727273

'''

소스 코드

🔗 HJ0216/TIL

참고 자료

📑 [딥러닝][NLP] Bidirectional RNN

📑 [Pytorch] Conv1D + LSTM 모델 Pytorch 구현

'Naver Clould with BitCamp > Aartificial Intelligence' 카테고리의 다른 글

ImageDataGenerator (1)	2023.02.05
Ensemble Model (0)	2023.01.31
[Warning] Allocation of ... exceeds 10% of free system memory (0)	2023.01.29
[Project] Stock price prediction using Ensemble model (1)	2023.01.28
RNN Model Construction (0)	2023.01.27

[Warning] Allocation of ... exceeds 10% of free system memory

HJ0216 2023. 1. 29.

2023. 1. 29.

기본 환경: IDE: VS code, Language: Python

⚠️ DataSet Size가 큰 경우, 모델 훈련 과정에서 Batch_size를 높일 때 경고 발생

→ CPU memoey 부족으로 인해 발생하는 경고로 사용 시, 1 epoch 당 훈련시킬 데이터 사이즈를 너무 크게 설정할 경우 오류가 발생할 수 있으므로 fit의 batch_size를 줄여야 함

model.fit(x_train, y_train, validation_split=0.2, epochs=1, callbacks=[earlyStopping], batch_size=128)

참고 자료

📑 Allocation of 406978560 exceeds 10% of free system memory

'Naver Clould with BitCamp > Aartificial Intelligence' 카테고리의 다른 글

Ensemble Model (0)	2023.01.31
LSTM, Bidirectional, Conv1D (1)	2023.01.30
[Project] Stock price prediction using Ensemble model (1)	2023.01.28
RNN Model Construction (0)	2023.01.27
CNN Model Construction2 (0)	2023.01.26

[Project] Stock price prediction using Ensemble model

HJ0216 2023. 1. 28.

2023. 1. 28.

기본 환경: IDE: VS code, Language: Python

1. Project 개요
1.1. Project 배경

기준일	KOSPI 시가 총액 (천억원)	삼성전자 시가 총액 (천억원)	비중 (%)
2023-01-27	1,966.56	385.65	약 19.61%

23년 01월 27일자 기준 KOSPI 시가 총액 대비 삼성전자 시가 총액이 약 20%를 차지
우리나라의 경제 지표이자 투자 지표를 의미하는 KOSPI의 5분의 1을 차지하는 삼성전자 주가 예측의 필요성

1.2. Project 목표

2023년 01월 30일자 삼성전자 시가 예측

2. 데이터 분석
2.1. 데이터 설명

일자: 2015.01 ~ 2023.01의 데이터
시가: 개장 후 최초로 체결된 거래 가격
고가: 장중 기록되는 가장 높은 거래 가격
저가: 장중 기록되는 가장 낮은 거래 가격
종가: 개장 후 최종으로 체결된 거래 가격
전일비: 전일 대비 종가의 등락폭
등락률: (전일비/당일종가)*100
거래량: 장중 기록되는 주식 거래량
금액(백만): 장중 기록되는 주식 거래 금액
신용비: 신용 공여를 통해서 매수된 거래 금액
개인: 장중 기록되는 개인 순매매 거래량
기관: 장중 기록되는 기관 순매매 거래량
외인(수량): 장중 기록되는 외국인 순매매 거래량
외국계: 장중 기록되는 외국계 증권사를 이용하는 주체의 순매매 거래량
프로그램: 장중 기록되는 프로그램 순매매 거래량
외인비: 외국인의 상장 주식수 대비 보유 비율

2.2. 데이터 전처리
2.2.1. 2018년 05월 04일자 기준 삼성전자 주가가 50:1로 액면분할 되었으므로 주가가 큰 폭으로 차이나므로 그 이전값은 데이터로 활용하지 않음
2.2.2. 삼성전자의 '시가' 예측을 목표로 하고 있으므로 다양한 데이터 중에서 '시가'열과 가장 유사한 추세선을 그리는 특징 5개를 선택하여 모델 구축
(단, 최신 데이터의 적극 반영을 위해 특징 선별 시, 최근 2년 데이터만을 활용)
-> 시가, 저가, 고가, 종가, 외인비

2021.01.04 ~ 2023.01.27 시가, 고가, 저가, 종가 추세선

2021.01.04 ~ 2023.01.27 시가, 외인비 추세선

* 시가, 외인비 그래프의 경우, 시가를 1400으로 나눠준 값으로 조정함

2.2.3. 외인비의 경우, 다른 x값과 비교할 때 수치가 배우 낮으므로 비슷한 수치를 맞춰주기 위해 추가 연산 진행
외인비(%) -> 외인비(1400%)
2.2.4. RNN 모델 이용 시, 최근 데이터가 영향력을 높일 수 있도록 과거순으로 재정렬
2.2.5. 특정일의 매매에 따라 그 다음날 시가에 영향을 미치는 것을 상정하고 있으므로 '시가'와 다른 데이터 특징과는 1일의 차이가 존재하는 것을 고려(하기 이미지 참고)

2.2.6. 삼성전자와 관련없는 타사 주가를 통해 훈련된 모델을 병합하므로써 성능이 우수해지는지 판별해보기 위해 아모레퍼시픽 주가자료 활용

아모레 퍼시픽 주가자료 또한 삼성전자 데이터와 마찬가지로 전처리 과정 동일하게 진행
~~- 액면분할로 인하여 2015년 05월 08일 이전 자료 사용 X~~
-> Ensemble Model 구현을 위해 동일한 크기의 datasets 필요
-> 삼성전자 액면 분할 시기와 동일하게 2018년 05월 04일자 자료부터 사용
- 특성으로는 삼성전자와 동일하게 시가, 고가, 저가, 종가, 외인비 사용
- 외인비 수치 재조정 : 외인비 * 5700
- RNN Model 구축 시, 최신 데이터 영향력을 높이기 위해 인덱스 기준 내림차순 정렬
- 훈련 데이터와 예측 데이터 일자 구분

훈련 데이터	x	2015.05.08 - 2023.01.26	예측 데이터	x	2023.01.27
		시가, 고가, 저가, 종가, 외인비			시가, 고가, 저가, 종가, 외인비
	y	2015.05.09 - 2023.01.27		y	2023.01.30
		시가			시가

3. 모델링 및 분석 결과

# samsung_predict_open_stock_price.py
# For Detail: https://hj0216.tistory.com/74

import pandas as pd
import numpy as np

from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Input, concatenate, SimpleRNN, LSTM, Dropout, GRU, Bidirectional, Flatten
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

from sklearn.model_selection import train_test_split

path = './keras/'
# path = 현재 dir 아래 keras dir로 경로 지정


# 1.1. Data Preprocessing(Samsung)
samsung = pd.read_csv(path+'stock_samsung.csv', encoding='CP949', nrows=1166, usecols=[1,2,3,4,16], header=0)
# 한글 데이터의 경우, pandas의 read_cvs 사용 시 깨짐 현상 발생 -> encoding='CP949' 추가
# 액면 분할로 인한 주가 차이가 약 50배 이상 차이나므로 액면 분할 이후 데이터만 수집 -> nrows 사용
# 훈련 데이터로 특성 5개만 추출: 1(시가), 2(고가), 3(저가), 4(종가), 16(외인비) -> usecols 사용
# 첫번쩨 행: col name -> header=0 지정

samsung['시가'] = samsung['시가'].str.replace(',', '').astype('float')
samsung['고가'] = samsung['고가'].str.replace(',', '').astype('float')
samsung['저가'] = samsung['저가'].str.replace(',', '').astype('float')
samsung['종가'] = samsung['종가'].str.replace(',', '').astype('float')
# csv 파일에서 형식이 회계 또는 통화로 되어있을 경우 ',' 때문에 oject로 인식되므로 str(,) 삭제 후 형변환
samsung['외인비'] = samsung['외인비']*1400
# 외인비의 수치가 다른 특성에 비해 과도하게 낮으므로 1400을 곱하여 다른 특성과 비슷한 수치를 만들어줌

samsung = samsung.sort_index(ascending=False)
# samsung 출력 시, sort_index가 적용되지 않은 상태로 출력되므로 변수를 재선언하여 정렬이 적용된 내용을 담아줘야 함
# samsung = samsung.sort_values(ascending=True, by=['일자'])

'''
print(samsung[:1165].tail())
samsung datasets 마지막 행 제외 하위 5개 목록 출력: 2023.01.18-26

    시가     고가    저가    종가    외인비
5  60,700  61,000  59,900  60,400  70126.0
4  60,500  61,500  60,400  61,500  70210.0
3  62,100  62,300  61,100  61,800  70238.0
2  63,500  63,700  63,000  63,400  70392.0
1  63,800  63,900  63,300  63,900  70476.0
'''

samsung_open = samsung['시가'][1:]
# 전일 데이터를 기반으로 익일 데이터를 예측하는 것으므로 y data로 사용할 시가 데이터는 첫 날(2018년 5월 4일) 데이터를 제외함
'''
print(samsung_open.head())
samsung_open datasets 상위 5개 목록 출력: 2015.05.08-14

1164    52,600
1163    52,600
1162    51,700
1161    52,000
1160    51,000
'''

x1_train, x1_test, y_train, y_test = train_test_split(
    samsung[:1165], samsung_open,
    shuffle=True,
    train_size=0.7,
    random_state=123
)
# samsung[:1165]: 훈련용 데이터에서 예측에 필요한 01월 27일자 데이터 제외

print(x1_train.shape, x1_test.shape) # (815, 5) (350, 5)
print(y_train.shape, y_test.shape) # (815,) (350,)

x1_train = x1_train.to_numpy()
x1_test = x1_test.to_numpy()
# reshape을 위한 DataFrame -> Numpy

x1_train = x1_train.reshape(815, 5, 1)
x1_test = x1_test.reshape(350, 5, 1)


# 1.2. Data Preprocessing(Amore)
amore = pd.read_csv(path+'stock_amore.csv', encoding='CP949', nrows=1166, usecols=[1,2,3,4,16])
# nrows=1902 (amore 액면분할 기준): Make sure all arrays contain the same number of samples.
# -> nrows=1166 (samsung 액면분할 기준 data shape과 맞춤)
amore['시가'] = amore['시가'].str.replace(',', '').astype('float')
amore['고가'] = amore['고가'].str.replace(',', '').astype('float')
amore['저가'] = amore['저가'].str.replace(',', '').astype('float')
amore['종가'] = amore['종가'].str.replace(',', '').astype('float')
amore['외인비'] = amore['외인비']*5700
amore = amore.sort_index(ascending=False)

x2_train, x2_test = train_test_split(
    amore[:1165],
    shuffle=True,
    train_size=0.7,
    random_state=123
)

print(x2_train.shape, x2_test.shape) # (815, 5) (350, 5)


# 2. Model Construction
# 2-1. Model_1(Samsung)
input1 = Input(shape=(5,1))
lstm1_1 = LSTM(units=64, return_sequences=True,
                        input_shape=(5,1))(input1)
# gru1_2 = GRU(64, activation='relu')(lstm1_1)
dense1_2 = Dense(32, activation='relu')(lstm1_1)
dense1_3 = Dense(16, activation='relu')(dense1_2)
flatten1_4 = Flatten()(dense1_3)
# A `Concatenate` layer requires inputs with matching shapes except for the concatenation axis.
output1 = Dense(16, activation='relu')(flatten1_4)

# 2-2. Model_2(Amore)
input2 = Input(shape=(5,))
dense2_1 = Dense(64, activation='relu')(input2)
dense2_2 = Dense(32, activation='linear')(dense2_1)
dropout2_3 = Dropout(0.1)(dense2_2)
dense2_4 = Dense(16, activation='linear')(dropout2_3)
output2 = Dense(8, activation='relu')(dense2_4)

# 2-3. Model_merge
merge3 = concatenate([output1, output2])
merge3_1 = Dense(64, activation='relu')(merge3)
merge3_2 = Dense(32, activation='relu')(merge3_1)
last_output = Dense(1)(merge3_2)

model = Model(inputs=[input1, input2], outputs=last_output)

model.summary()


# 3. compile and train
model.compile(loss='mse', optimizer='adam')

earlyStopping = EarlyStopping(monitor='val_loss', mode='min', patience=64, restore_best_weights=True, verbose=1)

modelCheckPoint = ModelCheckpoint(monitor='val_loss', mode='auto', verbose=1,
                                   save_best_only=True,
                                   filepath='./_save/MCP/samsumg_open_MCP.hdf5')

model.fit([x1_train, x2_train], y_train,
          validation_split=0.2,
          callbacks=[earlyStopping, modelCheckPoint],
          epochs=256,
          batch_size=64)


# 4. evaluate and predict
loss = model.evaluate([x1_test, x2_test], y_test)


result = model.predict([samsung[1165:].to_numpy().reshape(1,5,1),amore[1165:]])
# train data shape과 predict data shape 맞추기

print("Samsung Electronics market price prediction : ", result)

'''
Result
Samsung Electronics market price prediction :  [[65331.867]]

'''

3.1. 모델 구축
3.1.1. 4년 6개월 삼성전자 주가 및 관련 자료를 기반으로 한 훈련 모델1 구축
3.1.2. 4년 6개월 아모레 퍼시픽 주가 및 관련 자료를 기반으로 한 훈련 모델2 구축
3.1.3. 모델1, 모델2 병합을 통한 새로운 모델3을 구축하여 01월 30일 월요일 삼성전자 시가 예측

3.2. 훈련 및 예측
Samsung Electronics market price prediction : [[65331.867]]

3.3. 평가

(실제) 01월 30일자 삼성전자 시가: 64,900원
(예측) 01월 30일자 삼성전자 시가: 65,331원
(차이): + 431원

3.4. 시사점 및 한계점
3.4.1. 시사점
서로 연관이 없는 주식간에도 모델을 구축해서 훈련을 시킬 때, 현재 주가에서 크게 벗어나지 않는 결과값, 즉 유의미한 결과값을 도출할 수 있음

3.4.2. 한계점
크게 벗어나지 않을 예정이기에 해당 근처값을 목표로 모델을 훈련시킴
예상되는 값이 나올 때까지 Hyper-parameter tuning*을 반복하며 훈련 및 예측을 반복하였음
현재 주가와 비슷한 데이터가 나올 경우, 로스값이 매우 높게 나와서 모델 구성과 훈련의 적합성에 대한 의문이 존재함
-> 최소의 Loss와 최적의 weight를 만들기위해 튜닝을 진행할 경우, 로스는 낮아지지만 현재 주가와는 차이가 커져서 오히려 모델의 성능을 하향 조정하면서 훈련 진행함

* Hyper-parameter Tuning 내용
Model
- 삼성전자 dataset: Bidirectional(LSTM) + GRU model을 구현
예상 주가가 5만원대에 머물러 LSTM 단방향으로 진행
return_sequences=True -> Ouput_shape을 Input과 동일하게 유지하고 Flatten() 진행
-> RNN model로 다뤄질 dataset이 시계열 자료가 아닌 경우, 오히려 저성능을 보일 수 있음
Bidirectional(LSTM) Oput data가 시계열 자료인지 확인 필요
- 아모레 dataset: DNN model 구성
- Merge model: DNN model 구성
- Model Layer의 node를 일괄적으로 128로 시작했으나, 예상 주가가 5만원대로 도출되어 64로 하향 조정
- Model Layer 수를 dropout 포함 5~7 층으로 구성하였으나 6만원 초반이상으로 주가가 오르지 않음
-> Dropout은 예측과 동일하게 사용 시, 기대되는 시가 가격에서 멀어짐
Dropout은 과적합 문제를 해결하기 위해 사용되는데 Dataset의 크기가 1200이면 작은 편에 속하는 것인지 과적합 해결보다는 오히려 성능이 하향됨
- 모든 모델의 activation function은 relu/linear로 진행
-> 입출력값이 모두 비슷하고 양수이기에 두 함수의 차이에 크게 의미를 두지 않음

Training
- Earlystopping을 사용하여 훈련의 종료가 fit의 epoch보다 적게 이뤄지므로 튜닝 대상을 epoch보다는 patience로 지정
- EarlyStopping, ModelCheckPoint 기준을 validation loss로 지정하기 위해 훈련 데이터의 20%를 validation data로 지정
- EarlyStopping argument 중 restore_best_weights=True로 지정하여, restore 지점을 훈련 종료 시가 아닌 best weight 지점으로 수정
- batch size(1 epoch 당 함께 훈련될 data의 개수): 16~64 정도의 수치에서 1월 27일자 주가와 유사하게 도출됨

4. 프로젝트 수행 소감
인공지능 과정을 배우고 처음으로 데이터 전처리와 더불어 모델 구축을 해보았는데, 구현하고자 하는 바와 구현할 수 있는 실력의 차이를 많이 느낄 수 있었습니다.
- matplotlib.plot() 선 그래프를 활용하여 데이터 특성간의 유사도를 찾아보고 싶었으나, 그래프 사이즈가 일정 크기 이상으로 증가하지 않아 최종적으로는 엑셀을 활용하여 판단였습니다.
- 시간 상의 문제로 원본 데이터 이외의 자료를 이용하지 못하여 아쉬움이 남습니다.
- 주말 이후에 처음 거래되는 월요일 시가 예측이므로 다른 요일과 다른 특징이 있는지 확인해기 위해 요일 자료를 따로 추출해서 사용하고 싶었습니다. 그러나 Python에 대한 기본 문법을 모르는 상황에서 요일을 추출해내고 해당 자료만 추출해서 사용하기에는 시간이 부족하다고 판단하여 넘어간 것이 아쉬움으로 남습니다.

그럼에도 불구하고 이번 주가 예측 모델을 구축하면서 큰 성취감도 느낄 수 있었습니다.
- 무엇보다 포기하지 않고 마무리를 지을 수 있게되어 기쁜 마음이 큽니다. 약 12시간동안 다양한 오류에도 포기하지 않고 현재 삼성전자 주가(64,600원)와 비슷한 예측치를 만들 수 있게 되었습니다.
- Dacon에서의 서울시 따릉이 수요 예측, Kaggle에서의 Bike 수요 예측을 진행할 때에는 수업에서 함께 모델을 구축하였기 때문에 어떤 데이터 특성을 넣고 결측치 처리는 어떻게 할 것인지에 대한 고민을 해보지 못했습니다. 그러나 삼성전자 시가 예측 프로젝트를 통해서 데이터 전처리 방법에 대해 스스로 생각해보고 그간 배웠던 지식으로 더 좋은 성능을 만들기 위해서 다양한 Hyper-Parameter Tuning을 진행하며 loss값을 줄이기 위한 노력이 기억에 많이 남습니다.
- 지식적인 측면에서도 데이터를 처리하는 과정에서 중요한 것은 데이터의 타입과 형태라는 것을 다시 한 번 배우게 되었습니다.
데이터 전처리 과정을 약 6시간 정도 소요하고, 모델 구현부터는 빠르게 진행할 수 있을 것이라고 생각하였는데 csv 파일 형식에서 data type이 통화나 회계로 입력될 경우, string 처리되어 모델을 구현할 수 없는 문제가 발생한다는 것을 알게되었으며 해결 방법 또한 찾을 수 있었습니다.
또한 데이터의 타입(pandas, numpy 등)에 따라 사용할 수 있는 메서드들이 상이하다는 점을 RNN 구현을 위한 reshpae 과정에서 배울 수 있게되었습니다.
- nrows, usecols 등 일부 데이터만 활용하기 위해 다양한 메서드들을 배울 수 있게되었으며, 사용하는 것에 어려움을 느끼던 data split의 방법 중 하나인 slicing을 이번 프로젝트를 통해 원할하게 사용할 수 있게 되었습니다.
- 자주했던 실수이기에 잊지 않고자 다시금 적어보자면 training data set shape과 predict data set shape은 언제나 일치해야합니다. 간단하게 이야기를 하자면 데이터를 특정한 형태로 훈련을 수행했다면 예측도 훈련과 동일한 환경에서 이뤄질 수 있도록 준비가 되어야 한다는 뜻입니다.

향후에는 Python 공부 및 심층신경망을 활용한 인공지능 과정에 대해 스스로 공부하면서 부족한 점이 많았을 이번 프로젝트를 다시 한 번 진행해보고 싶습니다. 또한 이번 프로젝트를 진행하면서 모델을 온전히 구현할 수 있었던 것은 인터넷 상에서 다양한 지식을 공유해주신 많은 분들의 도움 덕분이라고 생각합니다. 다시 한 번 감사한 마음을 담아 저 또한 제가 배우고 알게된 지식들을 정확히 또 꾸준히 공유해가도록 하겠습니다.

마지막으로 간략하게 소감과 더불어 제가 그리는 개발자의 모습에 대해 적도록 하겠습니다.
엑셀을 통해서 간단하게 진행하던 작업을 Python언어로 코드를 작성하며 많은 어려움을 느꼈습니다. 간단하게는 덧셈, 곱셈 등 단순 연산에서부터 그래프 그리기까지 검색한 결과와 동일하게 코드를 작성해도 오류가 나기 부지기수였습니다. 어제와 오늘 이러한 과정을 계속해서 반복하며 사용자에게 편리한 기능을 제공하기 위해 얼마나 많은 개발자분들의 노력이 필요했을지 생각을 해볼 수 있게 되었습니다.
이를 통해 저는 막연하게 개발자가 되겠다는 다짐에서 '어떤' 개발자가 되겠다는 결심을 할 수 있었습니다. 저의 삶의 태도처럼 적어도 받는 만큼은 베풀기 위해, 개발자로서 사용자분들께 다양한 편의를 제공해드리고 제가 배운 것과 알게된 것을 필요한 분들을 위해 꾸준히 공유해나가고 싶습니다.

이상 장장 12시간의 삼성전자 시가 예측 모델 구현 프로젝트에 대한 포스팅을 마치도록 하겠습니다.

감사합니다.

➕ ModelCheckPoint model load 시, 가중치 변화 문제
Save ModelCheckPoint 시, Result

Load ModelCheckPoint 시, Result

Predict 확인 후, MCP로 .hdf5 file로 가중치를 저장하였으나 load 할 때 Predict가 틀어지는 문제 발생
-> 당시 시가와 유사한 예측값을 도출한 후, 모델 훈련을 진행하지 않아 근본적인 원인은 찾지 못하였응, 그러나 모델을 저징한 후에 load 과정을 통해서 확인을 해봐야한다는 것을 배움

(+ CPU가 pentium임에도 불구하고 열심히 모델을 돌려준 제 오래된 노트북 LG 그램에게도 감사한 마음을 전합니다,
LG 전자 파이팅..!)

➕ Python을 활용한 그래프 그리기

# 시가의 추세선과 유사한 모양을 그리는 특성을 찾기 위한 시각화

import pandas as pd

import matplotlib.pyplot as plt

path = './keras/'


samsung = pd.read_csv(path+'stock_samsung.csv', encoding='CP949', nrows=1166, usecols=[1,2,3,4,16])

start = samsung['시가']
high = samsung['고가']
low = samsung['저가']
end = samsung['종가']
trading_vol = samsung['거래량']
transaction_amnt = samsung['금액(백만)']
retail_i = samsung['개인']
institutional_i = samsung['기관']
foreign_i = samsung['외인(수량)']
foreign_institutional_i = samsung['외국계']
program = samsung['프로그램']
foreign_i_ratio = samsung['외인비']
date = samsung['일자']

plt.figure(figsize=(500, 50)) # length, height
plt.plot(date, start)
plt.plot(date, high)
plt.plot(date, low)
plt.plot(date, end)
plt.plot(date, trading_vol)
plt.plot(date, transaction_amnt)
plt.plot(date, retail_i)
plt.plot(date, institutional_i)
plt.plot(date, foreign_i)
plt.plot(date, foreign_institutional_i)
plt.plot(date, program)
plt.plot(date, foreign_i_ratio)
plt.gca().axes.xaxis.set_visible(False) # x 축값 생략
plt.gca().axes.yaxis.set_visible(False) # y 축값 생략

plt.show()

➕ Pandas/Python Data Type

Pandas dtype	Python type	비고
object	str	문자열(char 포함)
int64	int	숫자
float64	float	부동 소수점
bool	bool	True or False
datetime64	datetime	시간, 날짜 값

소스 코드
🔗 HJ0216/TIL

📚 참고 자료
한국 거래소 전체 지수 시세 통계 자료
삼성전자 증권 정보
[파이썬] pandas로 cvs에서 특정 값을 가진 행 찾기
판다스(Pandas) df.head(), df.tail()
01. 데이터 값을 기준으로 데이터 정렬: sort_values()
파이썬 matplotlib 그래프 축 없애기
판다스(Pandas)에서 엑셀, CSV 파일의 일부만 불러오기 데이터 타입과 포맷 지정하기
[파이썬 pandas] CSV 파일의 숫자를 문자열(string)로 읽어 들이는 방법(dtype 옵션 사용방법)
[Python pandas] DataFrame에서 천 단위 숫자의 자리 구분 기호 콤마(',')를 없애는 방법
[Python] Pandas DataFrame을 numpy 배열로 변환하는 방법

'Naver Clould with BitCamp > Aartificial Intelligence' 카테고리의 다른 글

LSTM, Bidirectional, Conv1D (1)	2023.01.30
[Warning] Allocation of ... exceeds 10% of free system memory (0)	2023.01.29
RNN Model Construction (0)	2023.01.27
CNN Model Construction2 (0)	2023.01.26
Save model and weights (0)	2023.01.24

이모저모 개발 블로그