Keras, problème de matrice clairsemée

Question

J'essaie d'alimenter une énorme matrice clairsemée au modèle Keras. Comme l'ensemble de données ne rentre pas dans la RAM, la solution consiste à former le modèle sur des données générées lot par lot par un générateur.

Pour tester cette approche et m'assurer que ma solution fonctionne correctement, j'ai légèrement modifié un le MLP simple de Kera sur la tâche de classification des rubriques du fil de presse Reuters . L'idée est donc de comparer les modèles originaux et édités. Je viens de convertir numpy.ndarray en scipy.sparse.csr.csr_matrix et de le transmettre au modèle.

Mais mon modèle se bloque à un moment donné et j'ai besoin d'un coup de main pour trouver une raison.

Voici le modèle original et mes ajouts ci-dessous

from __future__ import print_function import numpy as np np.random.seed(1337) # for reproducibility from keras.datasets import reuters from keras.models import Sequential from keras.layers import Dense, Dropout, Activation from keras.utils import np_utils from keras.preprocessing.text import Tokenizer max_words = 1000 batch_size = 32 nb_Epoch = 5 print('Loading data...') (X_train, y_train), (X_test, y_test) = reuters.load_data(nb_words=max_words, test_split=0.2) print(len(X_train), 'train sequences') print(len(X_test), 'test sequences') nb_classes = np.max(y_train)+1 print(nb_classes, 'classes') print('Vectorizing sequence data...') tokenizer = Tokenizer(nb_words=max_words) X_train = tokenizer.sequences_to_matrix(X_train, mode='binary') X_test = tokenizer.sequences_to_matrix(X_test, mode='binary') print('X_train shape:', X_train.shape) print('X_test shape:', X_test.shape) print('Convert class vector to binary class matrix (for use with categorical_crossentropy)') Y_train = np_utils.to_categorical(y_train, nb_classes) Y_test = np_utils.to_categorical(y_test, nb_classes) print('Y_train shape:', Y_train.shape) print('Y_test shape:', Y_test.shape) print('Building model...') model = Sequential() model.add(Dense(512, input_shape=(max_words,))) model.add(Activation('relu')) model.add(Dropout(0.5)) model.add(Dense(nb_classes)) model.add(Activation('softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) history = model.fit(X_train, Y_train, nb_Epoch=nb_Epoch, batch_size=batch_size, verbose=1)#, validation_split=0.1) #score = model.evaluate(X_test, Y_test, # batch_size=batch_size, verbose=1) print('Test score:', score[0]) print('Test accuracy:', score[1])

Il génère:

Loading data... 8982 train sequences 2246 test sequences 46 classes Vectorizing sequence data... X_train shape: (8982, 1000) X_test shape: (2246, 1000) Convert class vector to binary class matrix (for use with categorical_crossentropy) Y_train shape: (8982, 46) Y_test shape: (2246, 46) Building model... Epoch 1/5 8982/8982 [==============================] - 5s - loss: 1.3932 - acc: 0.6906 Epoch 2/5 8982/8982 [==============================] - 4s - loss: 0.7522 - acc: 0.8234 Epoch 3/5 8982/8982 [==============================] - 5s - loss: 0.5407 - acc: 0.8681 Epoch 4/5 8982/8982 [==============================] - 5s - loss: 0.4160 - acc: 0.8980 Epoch 5/5 8982/8982 [==============================] - 5s - loss: 0.3338 - acc: 0.9136 Test score: 1.01453569163 Test accuracy: 0.797417631398

Enfin, voici ma part

X_train_sparse = sparse.csr_matrix(X_train) def batch_generator(X, y, batch_size): n_batches_for_Epoch = X.shape[0]//batch_size for i in range(n_batches_for_Epoch): index_batch = range(X.shape[0])[batch_size*i:batch_size*(i+1)] X_batch = X[index_batch,:].todense() y_batch = y[index_batch,:] yield(np.array(X_batch),y_batch) model.fit_generator(generator=batch_generator(X_train_sparse, Y_train, batch_size), nb_Epoch=nb_Epoch, samples_per_Epoch=X_train_sparse.shape[0])

Le crash:

Exception Traceback (most recent call last) <ipython-input-120-6722a4f77425> in <module>() 1 model.fit_generator(generator=batch_generator(X_trainSparse, Y_train, batch_size), 2 nb_Epoch=nb_Epoch, ----> 3 samples_per_Epoch=X_trainSparse.shape[0]) /home/kk/miniconda2/envs/tensorflow/lib/python2.7/site-packages/keras/models.pyc in fit_generator(self, generator, samples_per_Epoch, nb_Epoch, verbose, callbacks, validation_data, nb_val_samples, class_weight, max_q_size, **kwargs) 648 nb_val_samples=nb_val_samples, 649 class_weight=class_weight, --> 650 max_q_size=max_q_size) 651 652 def evaluate_generator(self, generator, val_samples, max_q_size=10, **kwargs): /home/kk/miniconda2/envs/tensorflow/lib/python2.7/site-packages/keras/engine/training.pyc in fit_generator(self, generator, samples_per_Epoch, nb_Epoch, verbose, callbacks, validation_data, nb_val_samples, class_weight, max_q_size) 1356 raise Exception('output of generator should be a Tuple ' 1357 '(x, y, sample_weight) ' -> 1358 'or (x, y). Found: ' + str(generator_output)) 1359 if len(generator_output) == 2: 1360 x, y = generator_output Exception: output of generator should be a Tuple (x, y, sample_weight) or (x, y). Found: None

Je pense que le problème est dû à une mauvaise configuration de samples_per_Epoch. J'apprécierais vraiment si quelqu'un pouvait commenter cela.

Kirk · Answer

Voici ma solution.

def batch_generator(X, y, batch_size): number_of_batches = samples_per_Epoch/batch_size counter=0 shuffle_index = np.arange(np.shape(y)[0]) np.random.shuffle(shuffle_index) X = X[shuffle_index, :] y = y[shuffle_index] while 1: index_batch = shuffle_index[batch_size*counter:batch_size*(counter+1)] X_batch = X[index_batch,:].todense() y_batch = y[index_batch] counter += 1 yield(np.array(X_batch),y_batch) if (counter < number_of_batches): np.random.shuffle(shuffle_index) counter=0

Dans mon cas, matrice X-clairsemée, y-tableau.