在文本和序列的問題,首先會遇到的就是數據預處理的部分。因為電腦無法識別文字單詞或語句,要讓電腦能夠理解的話,我們需要將文本先轉化為向量(vector),此過程稱向量化(vectorize)。
向量化的方式有很多種,比較常用的有兩種: One-hot encoding (One-hot 編碼),Token embedding (標記嵌入),只用於單詞的話稱 Word embedding (詞嵌入)。
先前已經先初步了解one-hot操作,以下再練習分別對單詞和字符操作。
One-hot 編碼練習 - 以單詞做one-hot 編碼
import numpy as np samples = ['My name is Lido', 'My insetting is learning Deep-learning.'] token_index = {} for sample in samples: for word in sample.split(): if word not in token_index: token_index[word] = len(token_index) + 1 max_length = 10 results = np.zeros(shape=(len(samples),max_length,max(token_index.values()) + 1)) for i, sample in enumerate(samples): for j, word in list(enumerate(sample.split()))[:max_length]: index = token_index.get(word) results[i, j, index] = 1.
One-hot 編碼練習 - 以字符做one-hot 編碼
import string import numpy as np samples = ['My name is Lido', 'My insetting is learning Deep-learning.'] characters = string.printable token_index = dict(zip(characters, range(1, len(characters) + 1) )) max_length = 50 results = np.zeros((len(samples), max_length, max(token_index.values()) + 1)) for i, sample in enumerate(samples): for j, character in enumerate(sample): index = token_index.get(character) results[i, j, index] = 1.
One-hot 編碼練習 - Keras 內建以單詞做one-hot 編碼
import numpy as np from keras.preprocessing.text import Tokenizer samples = ['My name is Lido', 'My insetting is learning Deep-learning.'] tokenizer = Tokenizer(num_words=100) #只考慮前100個常見的單詞 tokenizer.fit_on_texts(samples) sequences = tokenizer.texts_to_sequences(samples) #將字符串轉換為整數索引組成的列表 one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary') word_index = tokenizer.word_index print('Found %s unique tokens.' % len(word_index)) #one-hot hashing trick (散列技巧) dimensionality = 1000 max_length = 10 results = np.zeros((len(samples), max_length, dimensionality)) for i, sample in enumerate(samples): for j, word in list(enumerate(sample.split()))[:max_length]: index = abs(hash(word)) % dimensionality #將單詞散列為0~1000 範圍內的一個隨機整數索引 results[i, j, index] = 1.
其中當文本標記單詞數量過大,使儲存向量化的維度太大,導致訓練時的記憶體內存不足時,one-hot hashing trick 是一種降維的方法。
沒有留言:
張貼留言