Lido Blog: 深度學習 - 理解 RNN & LSTM & GRU

循環神經網絡(Recurrent Neural Network)，見wiki，相信在看完之後還是無法了解RNN到底在做什麼。
簡單來說，RNN就是將一組序列輸入到網絡中，隨時間將輸出再返回到輸入中去調整權重參數。......還是不懂對吧!!

用簡單程式碼來說明，務必要一行一行讀懂。

假設有一組序列input_sequence，輸入元素input_t，目前網絡的狀態state_t，一開始我們要初始化state_t ，在未訓練的情況下state_t = 0。可以簡單地表示RNN如下:

state_t = 0        #網絡初始狀態
for input_t in input_sequence:   #每個序列的元素都透過神經網絡處理
    output_t = f(input_t, state_t)    #output 會經由input_t & state_t 調整網絡的權重
    state_t = output_t     #再把output返回給state_t

此 function 為 activation function ，舉例如下
ouput_t = f(input_t, state_t) = np.tanh(np.dot(W, input_t) + np.dot(U, state_t) + b)
W,U,b 為權重參數，類似 Dense layer的概念，需要一個 activation 來訓練模型。

還是不太了解的話，參考大神GoKu 的教學。

在Keras中，很簡單地就能將RNN加入模型中。

#載入IMDb 數據庫，做資料預處理
from keras.datasets import imdb
from keras.preprocessing import sequence
max_features = 10000
maxlen = 200
batch_size = 32
print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')
print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)

在設定模型時，加入RNN層實際練習看看

#設定訓練模型，加入RNN層
from keras.layers import Dense , Embedding , SimpleRNN
from keras.models import Sequential
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32)) #加入RNN 層到模型中
model.add(Dense(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

#選擇優化器、損失函數、指標
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])

#丟進fit訓練模型
history = model.fit(input_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)

#將訓練過程畫出
import matplotlib.pyplot as plt
acc = history.history['acc']
val_acc = history.history['val_acc']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc) + 1)
plt.plot(epochs, acc, 'bo', label='Training acc')
plt.plot(epochs, val_acc, 'b', label='Validation acc')
plt.title('Training and validation accuracy')
plt.legend()
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

訓練精度達到85%左右，但是在先前練習只有全連接層時有達到87.5%的精度，為何使用RNN精度還比較低!?

這原因是因為RNN在處理文本較長的序列時，每個時間點ht都要計算一次梯度權重，當權重W>1時，W^t會指數爆炸，而當W<1時，會很快地趨近於0，這就是造成權重指數級爆炸或消失的問題 (Exploding and vanishing gradients problem)，會難以捕捉長期時間關聯。

(圖片來源)

這圖片舉例，當處理的文本越長，圖中等號右邊展開就會越長，導致運算代價過高且會造成權重指數級爆炸或消失的問題，所以通常會限定序列長度而不是全部的文本，導致訓練成果沒有全連接層來的好。

所以我們要使用RNN改良版，LSTM (Long Short-Term Memory)

Bengio et al, "Learning long-term dependencies with gradient descent is difficult", IEEE Transactions on Neural Network, 1994
Pascanu et al, "On the difficulty of training recurrent neural networks", ICML2013

長短期記憶(LSTM : long short-term memory) ，是具有處理長期依賴性的特殊類型的RNN。

LSTM還可以為消失/爆炸梯度問題提供解決方案。

RNN vs LSTM cell representation, source: stanford

一個簡單的LSTM單元是由4個gate組成：
分別是i, f, o, g :

i：input gate，判定是否寫入單元
f：forget gate，判定是否抹除單元
o：output gate，判定輸出多少個單元
g：gate gate，因為沒有更好的名稱，故稱之gate gate，判定寫入多少個單元

與RNN不同之處在於網絡中攜帶著跨越時間步的的信息。它在不同的時間步的的值叫作Ct，其中C表示 carry 的含意。

RNN vs LSTM cell representation, source: stanford

Forgot gate :在獲得先前狀態的輸出h(t-1)之後，forgot gate 幫助我們決定必須從h(t-1)狀態中移除什麼，從而僅保留有用的訊息。 Forgot gate 是一個sigmoid函數，會輸出介於[0,1]之間的狀態。
Input gate + Gate gate: 決定將當前輸入中的新內容添加到我們當前的單元格狀態，並根據我們希望添加它們的大小進行縮放。
sigmoid層決定要更新哪些值，並且tanh層為新候選者創建一個向量以添加到當前單元狀態。
Output gate : 最後將決定從我們的單元狀態輸出什麼，這將由sigmoid函數完成。
我們將輸入與tanh相乘以壓縮(-1,1)之間的值，然後將其與sigmoid函數的輸出相乘，以便我們只輸出我們想要的值。

Gradient Flow 流程如下圖:

RNN vs LSTM cell representation, source: stanford

我將其作為動畫流程，可能比較好理解，如下 :

將C(t-1)和f相乘，是為了抹除攜帶數據流中的不相關信息。同時，i和g都提供關於當前的信息，可以用新信息來更新攜帶軌道C(t)，其中C(t)=dot(f,C(t-1))+dot(i,g)。

但歸根結底，這些解釋並沒有多大意義，因為這些運算的實際效果是由參數化權重決定的，而權重是以端到端的方式進行學習，每次訓練都要從頭開始，不可能為某個運算賦予特定的目的。

就如Keras作者所言:

你只需要記住LSTM單元的作用：允許過去的信息稍後重新進入，從而解決梯度消失問題。

了解LSTM後，在Keras一樣輕易地就能將LSTM層加入模型當中，實際操作如下:

#將LSTM層加入模型當中
from keras.layers import Dense , Embedding , LSTM
from keras.models import Sequential
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))  #LSTM 層
model.add(Dense(1, activation='sigmoid'))

將訓練結果畫出:

訓練精度到達88%

調整maxlen = 500

訓練精度到達88.5%

門控循環單元（GRU，gated recurrent unit），類似LSTM且工作原理與LSTM相同，但它做了一些簡化，使其在較小的數據集上能表現出更好的性能(訓練速度較快，但準確度可能較LSTM差)。
簡而言之，就是在訓練模型時，可以先使用GRU訓練(速度較快)看到初步的結果，再使用LSTM來優化模型(較精準)。

#定義模型
from keras.models import Sequential
from keras.layers import Embedding, Dense , GRU
model = Sequential()
model.add(Embedding(max_features, 32))
model.add(GRU(32))  #加入GRU 層
model.add(Dense(1))

Lido Blog

網頁

2019年1月28日星期一

深度學習 - 理解 RNN & LSTM & GRU

沒有留言:

張貼留言

檢舉濫用情形

網頁

2019年1月28日 星期一

深度學習 - 理解 RNN & LSTM & GRU

沒有留言:

張貼留言

檢舉濫用情形

2019年1月28日星期一