collect () ou toPandas () sur un grand DataFrame dans pyspark / EMR

Question

J'ai un cluster EMR d'une machine "c3.8xlarge", après avoir lu plusieurs ressources, j'ai compris que je dois autoriser une quantité décente de mémoire hors du tas parce que j'utilise pyspark, j'ai donc configuré le cluster comme suit:

Un exécuteur:

spark.executor.memory 6g
spark.executor.cores 10
spark.yarn.executor.memoryOverhead 4096

Pilote:

spark.driver.memory 21g

Lorsque je cache() le DataFrame, il faut environ 3,6 Go de mémoire.

Maintenant, lorsque j'appelle collect() ou toPandas() sur le DataFrame, le processus se bloque.

Je sais que j'apporte une grande quantité de données dans le pilote, mais je pense que ce n'est pas si grand, et je ne suis pas en mesure de comprendre la raison de l'accident.

Lorsque j'appelle collect() ou toPandas() j'obtiens cette erreur:

Py4JJavaError: An error occurred while calling o181.collectToPython. : org.Apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 6.0 failed 4 times, most recent failure: Lost task 5.3 in stage 6.0 (TID 110, ip-10-0-47-207.prod.eu-west-1.hs.internal, executor 9): ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Container marked as failed: container_1511879540686_0005_01_000016 on Host: ip-10-0-47-207.prod.eu-west-1.hs.internal. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137 Container exited with a non-zero exit code 137 Killed by external signal Driver stacktrace: at org.Apache.spark.scheduler.DAGScheduler.org$Apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1690) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1678) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1677) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.Apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1677) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855) at scala.Option.foreach(Option.scala:257) at org.Apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:855) at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1905) at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1860) at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1849) at org.Apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.Apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:671) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2022) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2043) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2062) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2087) at org.Apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.Apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.Apache.spark.rdd.RDD.collect(RDD.scala:935) at org.Apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278) at org.Apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2803) at org.Apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2800) at org.Apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2800) at org.Apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65) at org.Apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2823) at org.Apache.spark.sql.Dataset.collectToPython(Dataset.scala:2800) at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:62) at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43) at Java.lang.reflect.Method.invoke(Method.Java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.Java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.Java:357) at py4j.Gateway.invoke(Gateway.Java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.Java:132) at py4j.commands.CallCommand.execute(CallCommand.Java:79) at py4j.GatewayConnection.run(GatewayConnection.Java:214) at Java.lang.Thread.run(Thread.Java:748)

==== Mise à jour ====

Comme l'a suggéré @ user6910411, j'ai essayé la solution mentionnée ici , et dans ce cas, j'obtiens l'erreur suivante:

Py4JJavaError: An error occurred while calling z:org.Apache.spark.api.python.PythonRDD.collectAndServe. : org.Apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 2.0 failed 4 times, most recent failure: Lost task 7.3 in stage 2.0 (TID 41, ip-10-0-33-57.prod.eu-west-1.hs.internal, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 13.5 GB of 12 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead. Driver stacktrace: at org.Apache.spark.scheduler.DAGScheduler.org$Apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1690) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1678) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1677) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.Apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1677) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855) at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855) at scala.Option.foreach(Option.scala:257) at org.Apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:855) at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1905) at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1860) at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1849) at org.Apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.Apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:671) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2022) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2043) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2062) at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2087) at org.Apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.Apache.spark.rdd.RDD.withScope(RDD.scala:362) at org.Apache.spark.rdd.RDD.collect(RDD.scala:935) at org.Apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:458) at org.Apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:62) at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43) at Java.lang.reflect.Method.invoke(Method.Java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.Java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.Java:357) at py4j.Gateway.invoke(Gateway.Java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.Java:132) at py4j.commands.CallCommand.execute(CallCommand.Java:79) at py4j.GatewayConnection.run(GatewayConnection.Java:214) at Java.lang.Thread.run(Thread.Java:748)

Une idée de ce qui se passe ici?

zero323 · Accepted Answer

TL; DR Je pense que vous sous-estimez sérieusement les besoins en mémoire.

Même en supposant que les données sont entièrement mises en cache, les informations de stockage ne montreront qu'une fraction de la mémoire de pointe requise pour ramener les données au pilote.

Tout d'abord Spark SQL utilise stockage en colonnes compressé pour la mise en cache. Selon la distribution des données et l'algorithme de compression, la taille en mémoire peut être beaucoup plus petite que celle non compressée Pandas output, sans parler de _ List[Row]. Ce dernier stocke également les noms des colonnes, augmentant encore l'utilisation de la mémoire.
La collecte de données est indirecte, les données étant stockées à la fois côté JVM et Python. Bien que la mémoire JVM puisse être libérée une fois que les données passent par le socket, l'utilisation maximale de la mémoire devrait les prendre en compte).
L'implémentation simple toPandas recueille Rows d'abord, puis crée Pandas DataFrame localement . Ceci augmente encore (peut-être double Heureusement, cette partie est déjà traitée sur le maître (Spark 2.3), avec une approche plus directe utilisant la sérialisation Arrow ( SPARK-13534 - Implémenter le sérialiseur Apache Arrow pour Spark DataFrame for utiliser dans DataFrame.toPandas ).

Pour une solution possible indépendante d'Apache Arrow, vous pouvez vérifier mise en œuvre de la mémoire plus rapide et plus faible vers Pandas dans la liste des développeurs Apache Spark).

Étant donné que les données sont en fait assez volumineuses, j'envisagerais de les écrire dans Parquet et de les relire directement dans Python en utilisant PyArrow ( Lecture et écriture du format Apache Parquet )). les étapes intermédiaires.

Dafni Argyro Krystallidou · Answer

Comme mentionné ci-dessus, lors de l'appel à toPandas (), tous les enregistrements du DataFrame sont collectés dans le programme du pilote et doivent donc être effectués sur un petit sous-ensemble des données. ( https://spark.Apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html )