Tensorflow

개요

텐서플로우에서 대용량 데이터를 읽어들여 numpy로 부터 데이터를 batch로 모델링 하는 방법을 r1.8기준으로 찾아보고 직접 코드 스니펫을 정리해둔다.
내용이 자세하지만 사용법이 불친절하게 기술되어 있어 이부분을 찾아 본 후 정리한다.

모델은 간단한 DNN Regression이다. 아래의 예제는 파일 객체를 바로 로드하는 TFRecord를 사용하지 않았다. 이유는 데이터셋 크기가 1GB가 채 되지 않기 때문이다.

텐서플로우 data import의 핵심

dataset의 개념을 알아야함
- dataset은 경우에 메모리 혹은 디스크의 데이터를 읽어오는 과정에 따라 load과정이 달라짐
- 보통 from_tensor_slices()를 사용함
iterator의 개념을 알아야함
- iterator는 initilize과정과 데이터를 읽어오는 get_next() 과정이 필요함
- 보통 make_initializable_iterator()를 사용하면 무방함

아래 코드의 핵심

모델은 DNN Regression이다.
로컬에 csv형태로 1GB정도 되는 데이터를 읽어 배치 처리하는 방법을 정리함
데이터 크기가 1GB정도 밖에 되지 않아 numpy로 메모리에 올려 처리함
속도 및 효율성을 위해 데이터를 BATCH_SIZE(1000개) 만큼 데이터를 읽어 학습함
전체 데이터를 배치로 읽는 것을 epoch(200) 만큼 반복함

코드

패키지 로드

import tensorflow as tf
import pandas as pd
import numpy as np
import math

pandas 불러오기

trainX = pd.read_csv('./input/my_model_2_train_x.csv')
trainY = pd.read_csv('./input/my_model_2_train_y.csv')
testX = pd.read_csv('./input/my_model_2_test_x.csv')

numpy 변환

features = trainX.as_matrix().astype('float32')
labels = trainY.as_matrix().astype('float32')

X, Y 셋팅

X = tf.placeholder(tf.float32, shape=[None, features.shape[1]])
Y = tf.placeholder(tf.float32, shape=[None, labels.shape[1]])

X, Y shape 확인

X.shape
TensorShape([Dimension(None), Dimension(11)])

Y.shape
TensorShape([Dimension(None), Dimension(1)])

Dataset 만들기

BATCH_SIZE = 1000
dataset = tf.data.Dataset.from_tensor_slices((features, labels))
dataset = dataset.batch(BATCH_SIZE)

뉴런 설계

W1 = tf.get_variable("W1", shape=[X.shape[1], 50],
                     initializer=tf.contrib.layers.xavier_initializer())
b1 = tf.Variable(tf.random_normal([50]))
L1 = tf.nn.relu(tf.matmul(X, W1) + b1)


# W2 = tf.Variable(tf.random_normal([50, 10]))
W2 = tf.get_variable("W2", shape=[50, 10],
                     initializer=tf.contrib.layers.xavier_initializer())
b2 = tf.Variable(tf.random_normal([10]))
L2 = tf.nn.relu(tf.matmul(L1, W2) + b2)


# W3 = tf.Variable(tf.random_normal([10, 1]))
W3 = tf.get_variable("W3", shape=[10, 1],
                     initializer=tf.contrib.layers.xavier_initializer())
b3 = tf.Variable(tf.random_normal([1]))
hypothesis = tf.matmul(L2, W3) + b3

비용함수 설계

# Simplified cost/loss function
cost = tf.sqrt(tf.reduce_mean(tf.square(hypothesis - Y)))

optimizer 설정

# Minimize
optimizer = tf.train.AdamOptimizer(learning_rate=0.00001)
train = optimizer.minimize(cost)

세션 설정

# Launch the graph in a session.
sess = tf.Session()
# Initializes global variables in the graph.
sess.run(tf.global_variables_initializer())

iterator 및 데이터 객체 설정

iterator = dataset.make_initializable_iterator()
f, l = iterator.get_next()

학습

for epoch in range(200):
    sess.run(iterator.initializer)

    while True:
        try:
            x, y = sess.run([f, l])
            cost_val, hy_val, train_val = sess.run([cost, hypothesis, train], feed_dict={X: x, Y: y})
        except tf.errors.OutOfRangeError:
            break
    print("epoch: ", epoch, cost_val)

    epoch:  0 45.7199
    epoch:  1 10.9965
    epoch:  2 5.58393
    epoch:  3 3.41988
    epoch:  4 1.95303
    epoch:  5 1.69875
    epoch:  6 1.51069
    epoch:  7 1.38712
    epoch:  8 1.1897
    epoch:  9 1.03761

텐서프로우 r1.8에서 대용량 데이터 읽어들여 학습하기

개요