[시계열분석] 주식데이터 주가예측 (ARIMA 모델 훈련 및 테스트)

Notice

Recent Posts

Recent Comments

Link

160x600

« 2025/10 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

코딩헤딩

[시계열분석] 주식데이터 주가예측 (ARIMA 모델 훈련 및 테스트) 본문

카테고리 없음

[시계열분석] 주식데이터 주가예측 (ARIMA 모델 훈련 및 테스트)

멈머이 2024. 1. 16. 23:31

728x90

https://coding-heading.tistory.com/109

[시계열분석] 주식데이터 주가예측 (ARIMA, auto_arima)

* 라이브러리 정의 import pandas as pd import datetime import matplotlib.pyplot as plt # 시각화 라이브러리 및 마이너스 기호설정 import platform from matplotlib import font_manager, rc # 마이너스 기호 및 한글 설정 # - 마

coding-heading.tistory.com

저번 글과 이어집니다.

* Best Model을 이용해서 잔차 확인

- 잔차 : 실제값과 예측값과의 차이
- 잔차 검정 : 정상성, 정규성 등을 만족하는지 확인하는 검정

- 검정하는 함수 : summary(), plot_dignstics()

model.summary()

- 확인할 것
* P>|z| p-value 값
* Heteroskedasticity (H) 클수록 정규분포를 보인다

기준은 => 100

* 시각화

model.plot_diagnostics(figsize=(16,8))
plt.show()

* ARIMA 모델 훈련 및 테스트하기

  - 훈련 및 테스트 데이터 = 9 : 1로 분리
  - 시계열 데이터는 train_test_split()함수를 사용 하지 않음.
     -> 연속성을 띄는 데이터의 특성상, 데이터를 앞/뒤의 비율로 분리한다.

train_data = data.iloc[0:int(len(data)*0.9)]
# data[ : int(len(data)*0.9)]
test_data = data.iloc[int(len(data)*0.9):]
# data[int(len(data)*0.9) : ]
train_data.shape, test_data.shape


# 주석으로 된 코드도 같은 결과

결과 : ((2265,), (252,))

* 베스트 모델 선정 및 훈련

- auto_arima는 훈련과 동시에 베스트 모델을 생성해 준다.
- auto_arima : 모델 설정 및 Best Model 추출

model_fit = pm.auto_arima(
    y = train_data,
    d=n_diffs,
    start_p=0, max_p=3,
    start_q=0, max_q=3,
    m=1, seasonal=False,
    stepwise=True,
    trace=True
)

결과 :

Performing stepwise search to minimize aic
ARIMA(0,1,0)(0,0,0)[0] intercept   : AIC=6127.619, Time=0.13 sec
ARIMA(1,1,0)(0,0,0)[0] intercept   : AIC=6119.112, Time=0.06 sec
ARIMA(0,1,1)(0,0,0)[0] intercept   : AIC=6119.471, Time=0.11 sec
ARIMA(0,1,0)(0,0,0)[0] : AIC=6134.024, Time=0.03 sec
ARIMA(2,1,0)(0,0,0)[0] intercept   : AIC=6120.306, Time=0.14 sec
ARIMA(1,1,1)(0,0,0)[0] intercept   : AIC=6118.765, Time=0.29 sec V AIC값이 가장 작은 모델 선정
ARIMA(2,1,1)(0,0,0)[0] intercept   : AIC=6120.763, Time=0.44 sec
ARIMA(1,1,2)(0,0,0)[0] intercept   : AIC=6120.763, Time=0.56 sec
ARIMA(0,1,2)(0,0,0)[0] intercept   : AIC=6120.804, Time=0.23 sec
ARIMA(2,1,2)(0,0,0)[0] intercept   : AIC=6120.503, Time=1.39 sec
ARIMA(1,1,1)(0,0,0)[0] : AIC=6126.014, Time=0.14 sec

Best model:  ARIMA(1,1,1)(0,0,0)[0] intercept
Total fit time: 3.519 seconds

model_fit.summary()

* 위와 똑같이 P>|z| p-value 값과 Heteroskedasticity (H) 확인

* Best Model을 이용하여 예측(Predict)하기

- 시계열에서 예측 용어 : forecast라고 칭한다.
- 예측결과 : 예측데이터, 상한가(상한 바운드), 하한가(하한 바운드)
- 결과 시각화 : 기존값과 예측값이 연결된 시각화
- 수행방법 : forecast 함수 생성 후 predict 수행
: 예측결과 반환

* 향후 예측 함수 생성

- model : Best Model
- n : 예측하려는 향후 기간 (디폴트로 1을 지정)

import numpy as np

def forecast_n_step(model, n=1):

    fc, conf_int = model.predict(n_periods=n, return_conf_int=True)

    # print(fc, conf_int)

    # 반환값은 리스트 형태로 반환해서 전달
    return (
        fc.tolist()[0:n],
        np.asarray(conf_int).tolist()[0:n]
    )

- n_periods : 예측기간(일 단위)
- return_conf_int : 신뢰구간 반환여부
- fc : 예측결과(y_pred)
- conf_int : 신뢰구간

import pandas as pd

def forecast(len, model, index, data=None):
    # 결과값을 담아서 한환할 변수
    y_pred = []
    pred_upper = []
    pred_lower = []

    
    """1. * 데이터(data)가 있는 경우"""
    if data is not None:
        for new_data in data:
            # 예측하기 : 반복수행을 위해 함수로 생성
            fc, conf = forecast_n_step(model)

            # 예측결과 리스트에 담기
            y_pred.append(fc[0])

            # 상한가
            pred_upper.append(conf[0][1])

            # 하한가
            pred_lower.append(conf[0][0])

            # 시계열에서는 데이터별로 Model을 갱신함
            model.update(new_data)
    
        """2. * 데이터(data)가 없는 경우"""
    else:
        for i in range(len):
            # 예측하기 : 반복수행을 위해 함수로 생성
            fc, conf = forecast_n_step(model)

            # 예측결과 리스트에 담기
            y_pred.append(fc[0])

            # 상한가
            pred_upper.append(conf[0][1])

            # 하한가
            pred_lower.append(conf[0][0])

            # 시계열에서는 데이터별로 Model을 갱신함
            model.update(fc[0])
    
    
    
    # 결과값에 대해서는 시리즈 타입으로
    return pd.Series(y_pred, index=index), pred_upper, pred_lower
    # return "", "", ""

* 함수 호출하기

- fc : 에측결과
- upper : 상한가
- lower : 하한가

fc, upper, lower = forecast(len(test_data), model_fit, test_data.index, data=test_data)

fc, upper, lower

결과 :

(Date
2021-10-29     96.433292
2021-11-01    146.727736
2021-11-02    145.113554 ====> fc : 에측결과
2021-11-03    145.121247
2021-11-04    147.477981
                  ...

[98.85955210308333,
  149.885945735709,
  148.27285122968655,
  148.28001832760364, ====> upper : 상한가
  150.6140166410022,
  151.37735391245843,

...

[94.00703115015008,
  143.56952557499235,
  141.95425733814298, ====> lower : 하한가
  141.96247660445064,
  144.34194627399938,
  145.06294206348412,

...

예측결과는 날짜와 매칭되어 출력되었지만 아래서 시각화를 위해 상/하한가는 따로 작업이 필요하다.

* 상한가와 하한가의 리스트 타입 데이터를

- 날짜를 인덱스로 하는 시리즈 타입으로 변환하기
- 추후 시각화 시 결괏값의 인덱스와 매칭하여 그리기 위해

lower_series = pd.Series(lower, index=test_data.index)
upper_series = pd.Series(upper, index=test_data.index)

lower_series, upper_series

결과 :

(Date
2021-10-29     94.007031
2021-11-01    143.569526
2021-11-02    141.954257
2021-11-03    141.962477
2021-11-04    144.341946
                  ...

Date
2021-10-29     98.859552
2021-11-01    149.885946
2021-11-02    148.272851
2021-11-03    148.280018
2021-11-04    150.614017
                  ...

* 전체 시각화

# 훈련데이터, 테스트데이터 시각화

plt.figure(figsize=(20,6))
plt.title("시계열 분석결과 시각화")

### 훈련데이터 그리기
# plt.plot(train_data, label="train_data")

### 테스트데이터 그리기
plt.plot(test_data, label="test_data(예측 전 실제값)", c="b")

### 테스트데이터로 예측한 결과 그리기
plt.plot(fc, label="예측결과", c="r")

### 상한가(상한 바운드) 하한가(하한 바운드) 그리기
plt.fill_between(lower_series.index, lower_series,
                 upper_series, alpha=.5, color="k", label="상/하한가")

plt.legend()
plt.show()

* 모델 성능평가

from sklearn.metrics import mean_absolute_error, mean_squared_error
import math

* 평균제곱오차(MSE)

mse = mean_squared_error(np.exp(test_data), np.exp(fc))
mse

결과 : 4.53049754595729e+128

* 평균절대오차(MAE)

mae = mean_absolute_error(np.exp(test_data), np.exp(fc))
mae

결과 : 298508982420223e+63

* RMSE(Root Mean Squared Error)
- 예측값과 실제값 간의 거리를 나타내는 지표
- 값이 작을수록 모델의 성능이 좋다고 해석

rmse = math.sqrt(mean_squared_error(np.exp(test_data), np.exp(fc)))
rmse

결과 : 2.1284965459115245e+64

* MAPE(Mean Absolute Percentage Error)
- 예측값과 실제값 간의 백분율 오차 평균

mape = np.mean(np.abs(np.exp(fc) - np.exp(test_data)) / np.abs(np.exp(test_data)))
mape * 100

결과 : 11890.53450462006

728x90

코딩헤딩

[시계열분석] 주식데이터 주가예측 (ARIMA 모델 훈련 및 테스트) 본문

[시계열분석] 주식데이터 주가예측 (ARIMA 모델 훈련 및 테스트)

티스토리툴바