深度學習 - 認識 one hot 數據預處理
one hot 數據預處理,是將訓練&測試數據向量化的方法之一,見以下編碼實例:import numpy as np def vectorize_sequences(sequences, dimension=10000): results = np.zeros((len(sequences), dimension)) for i, sequence in enumerate(sequences): results[i, sequence] = 1. return results
以IMDb為例:
In[3]: train_data.shape
Out[3]: (25000,)
訓練資料是一個25000組的0維向量組成
In[4]:train_data
Out[4]:
for i in train_data[0]:
print('%s,' %i,end="")
印出 train_data[0] 實際的樣子如下:1,14,22,16,43,530,973,1622,1385,65,458,4468,66,3941,4,173,36,256,5,25,100,43,838,112,50,670,2,9,35,.......
train_data_0 =train_data[0]
train_data_0.sort()
for i in train_data_0[:128]:
print('%s,' %int(i) ,end="")
將train_data[0] 排序後列印出來,可以看到 2,4,5,6,7,8...重複。1,2,2,2,2,2,2,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,6,7,7,8,8,8,9,12,12,12,12,12,12,13,13,13,14,14,14,15,15,15,15,16,16,16,16,16,16,16,16,16,16,16,17,17,17,18,18,18,18,.........以上這些幫助我們接下來要了解one hot 編碼要做的事情。
np.zeros((len(sequences), dimension))
此例的 one hot 編碼創造了一個10000維向量為0的矩陣 [0, 0, 0, 0,......0,......0] 總共10000個0。然後將index 有值的地方輸出結果為1
見以下使用範例:
>>>seq = ['one', 'two', 'three'] >>>for i, element in enumerate(seq): ... print(i, seq[i])
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
這邊是將index有值的部分,輸出1將訓練data 帶進one hot 向量化函數後印出來看看發生了什麼事?
x_train = vectorize_sequences(train_data)
x_test = vectorize_sequences(test_data)
x_test = vectorize_sequences(test_data)
for i in x_train[0][:128]:
print('%s,' %int(i) ,end="")
得到以下數據 :
0,1,1,0,1,1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,0,1,1,0,0,1,1,0,1,0,1,0,1,1,0,1,1,0,1,1,0,0,0,1,0,0,1,0,1,0,1,1,1,0,0,0,1,0,0,0,0,0,1,0,0,1,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,1,0,1,0,0,1,1,0,1,1,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,
比對一下train_data & x_train就可以了解。
可以看到3,10,11的地方沒有數據,則為0向量。
沒有留言:
張貼留言