Lido Blog: 深度學習 - LSTM生成文本 - 唐詩

經過文本預處理之後，就要將文本批次丟進模型裡面訓練。

這邊直接使用作者 youyuge34 的編碼來練習。

在根目錄底下的檔案結構:

poetry_model 為訓練 model 代碼: 如下:

會用到的函式庫
















import random
import os

import keras
import numpy as np
from keras.callbacks import LambdaCallback
from keras.models import Input, Model, load_model
from keras.layers import LSTM, Dropout, Dense
from keras.optimizers import Adam

from data_utils import *
from config import Config

首先，先看整個模型class：
























class PoetryModel(object):
    def __init__(self, config):
        self.model = None
        self.do_train = True
        self.loaded_model = True
        self.config = config

        # 文件預處理
        self.word2numF, self.num2word, self.words, self.files_content = preprocess_file(self.config)
        
        # 詩的list
        self.poems = self.files_content.split(']')
        # 詩的總數量
        self.poems_num = len(self.poems)
        
        # 如果模型檔案存在(已訓練好)則直接加載模型，否則開始訓練
        if os.path.exists(self.config.weight_file) and self.loaded_model:
            self.model = load_model(self.config.weight_file)
        else:
            self.train()

其中幾個小細節：
__init__：有種初始化這個 class 的概念。在 Python 腳本語言中非常常用。這邊將後續會用到的 function 都先定義好，方便之後調用。
文件預處理：在上一篇中已經將其寫成函數存成檔案 data_utils.py，可以import 後直接調用 preprocess_file(Config)。另外訓練模型的部分，如果模型已訓練好可以直接調用，不用再重新訓練。

建立模型：使用API 方式建立，兩層 LSTM 搭配 Dropout






















    def build_model(self):
        "建立模型"
        print('building model')

        # 輸入的dimension
        input_tensor = Input(shape=(self.config.max_len, len(self.words)))
        lstm = LSTM(512, return_sequences=True)(input_tensor)
        dropout = Dropout(0.6)(lstm)
        lstm = LSTM(256)(dropout)
        dropout = Dropout(0.6)(lstm)
        dense = Dense(len(self.words), activation='softmax')(dropout)
        #model definition
        self.model = Model(inputs=input_tensor, outputs=dense)
        #determine optimizer
        optimizer = Adam(lr=self.config.learning_rate)
        self.model.compile(loss='categorical_crossentropy',
                           optimizer=optimizer,
                           metrics=['accuracy'])

文本生成器：
定義max_len=6，在生成器中我們取前六個字，讓模型預測第七個字
所以x是文本的前六個字，y是第七個字，然後將其轉換成one-hot編碼。




























    def data_generator(self):
        "生成器生成數據"
        i = 0
        while 1:
            x = self.files_content[i: i + self.config.max_len]
            y = self.files_content[i + self.config.max_len]

            if ']' in x or ']' in y:
                i += 1
                continue
            y_vec = np.zeros(
                shape=(1, len(self.words)),
                dtype=np.bool
            )
            y_vec[0, self.word2numF(y)] = 1.0
            x_vec = np.zeros(
                shape=(1, self.config.max_len, len(self.words)),
                dtype=np.bool
            )

            for t, char in enumerate(x):
                x_vec[0, t, self.word2numF(char)] = 1.0
            yield x_vec, y_vec
            i += 1

訓練模型：



























    def train(self):
        "訓練模型"
        print('training')
        number_of_epoch = len(self.files_content)-(self.config.max_len + 1)*self.poems_num
        number_of_epoch /= self.config.batch_size 
        number_of_epoch = int(number_of_epoch / 1.5)
        print('epoches = ',number_of_epoch)
        print('poems_num = ',self.poems_num)
        print('len(self.files_content) = ',len(self.files_content))

        if not self.model:
            self.build_model()

        self.model.fit_generator(
            generator=self.data_generator(),
            verbose=True,
            steps_per_epoch=self.config.batch_size,
            epochs=number_of_epoch,
            callbacks=[
                keras.callbacks.ModelCheckpoint(self.config.weight_file, save_weights_only=False),
                #LambdaCallback(on_epoch_end=self.generate_sample_result)
            ]
        )

最後，我們開始訓練模型：





PoetryModel(Config)

大概跑9小時多，就完成了33415個 epochs。

Lido Blog

網頁

2019年3月3日星期日

深度學習 - LSTM生成文本 - 唐詩 - 訓練模型

沒有留言:

張貼留言

檢舉濫用情形

網頁

2019年3月3日 星期日

深度學習 - LSTM生成文本 - 唐詩 - 訓練模型

沒有留言:

張貼留言

檢舉濫用情形

2019年3月3日星期日