Erreur Java Spark: la taille dépasse Integer.MAX_VALUE

Question

J'essaie d'utiliser spark pour une tâche d'apprentissage machine simple . J'ai utilisé pyspark et spark 1.2.0 pour résoudre un problème de régression logistique simple . J'ai 1,2 million d'enregistrements pour la formation et j'ai haché les records . Lorsque je règle le nombre de fonctions hachées sur 1024, le programme fonctionne correctement, mais lorsque je règle le nombre sur 16384, le programme échoue plusieurs fois avec l'erreur suivante:

Py4JJavaError: An error occurred while calling o84.trainLogisticRegressionModelWithSGD. : org.Apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 4.0 failed 4 times, most recent failure: Lost task 1.3 in stage 4.0 (TID 9, workernode0.sparkexperience4a7.d5.internal.cloudapp.net): Java.lang.RuntimeException: Java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE at Sun.nio.ch.FileChannelImpl.map(FileChannelImpl.Java:828) at org.Apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at org.Apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at org.Apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517) at org.Apache.spark.storage.BlockManager.getBlockData(BlockManager.scala:307) at org.Apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57) at org.Apache.spark.network.netty.NettyBlockRpcServer$$anonfun$2.apply(NettyBlockRpcServer.scala:57) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108) at org.Apache.spark.network.netty.NettyBlockRpcServer.receive(NettyBlockRpcServer.scala:57) at org.Apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.Java:124) at org.Apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.Java:97) at org.Apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.Java:91) at org.Apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.Java:44) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.Java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.Java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.Java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.Java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.Java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.Java:319) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.Java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.Java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.Java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.Java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.Java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.Java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.Java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.Java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.Java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.Java:116) at Java.lang.Thread.run(Thread.Java:745) at org.Apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.Java:156) at org.Apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.Java:93) at org.Apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.Java:44) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.Java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.Java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.Java:319) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.Java:103) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.Java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.Java:319) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.Java:163) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.Java:333) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.Java:319) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.Java:787) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.Java:130) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.Java:511) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.Java:468) at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.Java:382) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.Java:354) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.Java:116) at Java.lang.Thread.run(Thread.Java:745) Driver stacktrace: at org.Apache.spark.scheduler.DAGScheduler.org$Apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1214) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1203) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1202) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.Apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1202) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:696) at scala.Option.foreach(Option.scala:236) at org.Apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:696) at org.Apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1420) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.Apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1375) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.Java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.Java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.Java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.Java:107)

Cette erreur se produit lorsque j'entraîne LogisticRegressionWithSGD après le transfert des données dans LabeledPoint.

Quelqu'un a-t-il une idée à ce sujet?

Mon code est le suivant (j'utilise un cahier IPython pour cela):

from pyspark.mllib.regression import LabeledPoint from pyspark.mllib.classification import LogisticRegressionWithSGD from numpy import array from sklearn.feature_extraction import FeatureHasher from pyspark import SparkContext sf = SparkConf().setAppName("test").set("spark.executor.memory", "50g").set("spark.cores.max", 30) sc = SparkContext(conf=sf) training_file = sc.textFile("train_small.txt") def hash_feature(line): values = [0, dict()] for index, x in enumerate(line.strip("
").split('	')): if index == 0: values[0] = float(x) else: values[1][str(index)+"_"+x] = 1 return values n_feature = 2**14 hasher = FeatureHasher(n_features=n_feature) training_file_hashed = training_file.map(lambda line: [hash_feature(line)[0], hasher.transform([hash_feature(line)[1]])]) def build_lable_points(line): values = [0.0] * n_feature for index, value in Zip(line[1].indices, line[1].data): values[index] = value return LabeledPoint(line[0], values) parsed_training_data = training_file_hashed.map(lambda line: build_lable_points(line)) model = LogisticRegressionWithSGD.train(parsed_training_data)

L'erreur se produit lors de l'exécution de la dernière ligne.

Daniel Langdon · Answer

La restriction Integer.MAX_INT concerne la taille d'un fichier en cours de stockage. 1,2M de lignes n’est pas un gros problème, je ne suis pas sûr que votre problème soit "les limites de l’étincelle". Plus probablement, une partie de votre travail crée quelque chose de trop volumineux pour être exécuté par un exécuteur.

Je ne suis pas un codeur Python, mais lorsque vous "hachez les fonctionnalités des enregistrements", vous prenez peut-être un ensemble très fragmenté d'enregistrements et créez un tableau non épars. Cela signifie beaucoup de mémoire pour 16384 fonctionnalités. En particulier, lorsque vous faites Zip(line[1].indices, line[1].data). La seule raison pour laquelle vous ne récupérez pas suffisamment de mémoire est la quantité de données que vous semblez avoir configurée (50G).

Une autre chose qui pourrait aider est d’augmenter le partitionnement. Ainsi, si vous ne pouvez pas utiliser moins de mémoire dans vos lignes, vous pouvez au moins essayer d'avoir moins de lignes pour une tâche donnée. Tous les fichiers temporaires en cours de création dépendront probablement de cela. Il est donc moins probable que vous atteigniez les limites du fichier.

Et, sans aucun lien avec l'erreur mais pertinent pour ce que vous essayez de faire:

16384 est en effet un grand nombre de fonctionnalités, dans le cas optimiste où chacune n’est qu’une fonctionnalité booléenne, vous avez un total de 2 ^ 16384 possibilités d’apprentissage, c’est un nombre énorme (essayez ici: https: //defuse.ca/big-number-calculator.htm ).

Il est TRÈS, TRES probable qu'aucun algorithme ne puisse apprendre une frontière de décision avec seulement 1,2 million d'échantillons. Il vous faudrait probablement au moins quelques milliards de milliards d'exemples pour faire une petite différence dans un tel espace. L'apprentissage automatique a ses limites, alors ne soyez pas surpris si vous n'obtenez pas une précision meilleure que aléatoire.

Je recommanderais certainement d'essayer une réduction de la dimensionnalité d'abord!

Baptiste Wicht · Answer

À un moment donné, il essaie de stocker les fonctionnalités et 1.2M * 16384 est supérieur à Integer.MAX_INT, vous essayez donc de stocker plus que la taille maximale des fonctionnalités prises en charge par Spark.

Vous rencontrez probablement les limites d'Apache Spark.

gsamaras · Answer

L'augmentation du nombre de partitions peut entraîner Les tâches actives sont un nombre négatif dans Spark UI , ce qui signifie probablement que le nombre de partitions est trop élevé.