web-dev-qa-db-fra.com

collect () ou toPandas () sur un grand DataFrame dans pyspark / EMR

J'ai un cluster EMR d'une machine "c3.8xlarge", après avoir lu plusieurs ressources, j'ai compris que je dois autoriser une quantité décente de mémoire hors du tas parce que j'utilise pyspark, j'ai donc configuré le cluster comme suit:

Un exécuteur:

  • spark.executor.memory 6g
  • spark.executor.cores 10
  • spark.yarn.executor.memoryOverhead 4096

Pilote:

  • spark.driver.memory 21g

Lorsque je cache() le DataFrame, il faut environ 3,6 Go de mémoire.

Maintenant, lorsque j'appelle collect() ou toPandas() sur le DataFrame, le processus se bloque.

Je sais que j'apporte une grande quantité de données dans le pilote, mais je pense que ce n'est pas si grand, et je ne suis pas en mesure de comprendre la raison de l'accident.

Lorsque j'appelle collect() ou toPandas() j'obtiens cette erreur:

Py4JJavaError: An error occurred while calling o181.collectToPython.
: org.Apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 6.0 failed 4 times, most recent failure: Lost task 5.3 in stage 6.0 (TID 110, ip-10-0-47-207.prod.eu-west-1.hs.internal, executor 9): ExecutorLostFailure (executor 9 exited caused by one of the running tasks) Reason: Container marked as failed: container_1511879540686_0005_01_000016 on Host: ip-10-0-47-207.prod.eu-west-1.hs.internal. Exit status: 137. Diagnostics: Container killed on request. Exit code is 137
Container exited with a non-zero exit code 137
Killed by external signal
Driver stacktrace:
    at org.Apache.spark.scheduler.DAGScheduler.org$Apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1690)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1678)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1677)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.Apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1677)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
    at scala.Option.foreach(Option.scala:257)
    at org.Apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:855)
    at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1905)
    at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1860)
    at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1849)
    at org.Apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.Apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:671)
    at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2022)
    at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2043)
    at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2062)
    at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2087)
    at org.Apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
    at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.Apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.Apache.spark.rdd.RDD.collect(RDD.scala:935)
    at org.Apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:278)
    at org.Apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply$mcI$sp(Dataset.scala:2803)
    at org.Apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2800)
    at org.Apache.spark.sql.Dataset$$anonfun$collectToPython$1.apply(Dataset.scala:2800)
    at org.Apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
    at org.Apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2823)
    at org.Apache.spark.sql.Dataset.collectToPython(Dataset.scala:2800)
    at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:62)
    at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43)
    at Java.lang.reflect.Method.invoke(Method.Java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.Java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.Java:357)
    at py4j.Gateway.invoke(Gateway.Java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.Java:132)
    at py4j.commands.CallCommand.execute(CallCommand.Java:79)
    at py4j.GatewayConnection.run(GatewayConnection.Java:214)
    at Java.lang.Thread.run(Thread.Java:748)

==== Mise à jour ====

Comme l'a suggéré @ user6910411, j'ai essayé la solution mentionnée ici , et dans ce cas, j'obtiens l'erreur suivante:

Py4JJavaError: An error occurred while calling z:org.Apache.spark.api.python.PythonRDD.collectAndServe.
: org.Apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 2.0 failed 4 times, most recent failure: Lost task 7.3 in stage 2.0 (TID 41, ip-10-0-33-57.prod.eu-west-1.hs.internal, executor 5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 13.5 GB of 12 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
Driver stacktrace:
    at org.Apache.spark.scheduler.DAGScheduler.org$Apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1690)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1678)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1677)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.Apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1677)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:855)
    at scala.Option.foreach(Option.scala:257)
    at org.Apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:855)
    at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1905)
    at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1860)
    at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1849)
    at org.Apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.Apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:671)
    at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2022)
    at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2043)
    at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2062)
    at org.Apache.spark.SparkContext.runJob(SparkContext.scala:2087)
    at org.Apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:936)
    at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.Apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.Apache.spark.rdd.RDD.collect(RDD.scala:935)
    at org.Apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:458)
    at org.Apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:62)
    at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43)
    at Java.lang.reflect.Method.invoke(Method.Java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.Java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.Java:357)
    at py4j.Gateway.invoke(Gateway.Java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.Java:132)
    at py4j.commands.CallCommand.execute(CallCommand.Java:79)
    at py4j.GatewayConnection.run(GatewayConnection.Java:214)
    at Java.lang.Thread.run(Thread.Java:748)

Une idée de ce qui se passe ici?

5
Rami

TL; DR Je pense que vous sous-estimez sérieusement les besoins en mémoire.

Même en supposant que les données sont entièrement mises en cache, les informations de stockage ne montreront qu'une fraction de la mémoire de pointe requise pour ramener les données au pilote.

Étant donné que les données sont en fait assez volumineuses, j'envisagerais de les écrire dans Parquet et de les relire directement dans Python en utilisant PyArrow ( Lecture et écriture du format Apache Parquet )). les étapes intermédiaires.

12
zero323

Comme mentionné ci-dessus, lors de l'appel à toPandas (), tous les enregistrements du DataFrame sont collectés dans le programme du pilote et doivent donc être effectués sur un petit sous-ensemble des données. ( https://spark.Apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html )