Apache Spark Codegen Stage dépasse les 64 Ko

Question

Je reçois une erreur lorsque je conçois des fonctionnalités sur plus de 30 colonnes pour créer environ 200+ colonnes. Il n'échoue pas, mais l'ERREUR s'affiche. Je veux savoir comment éviter cela.

Spark - 2.3.1 Python - 3.6

Configuration de cluster - 1 maître - 32 Go de RAM, 16 cœurs 4 esclaves - 16 Go de RAM, 8 cœurs

Données d'entrée - 8 partitions de fichier parquet avec compression rapide.

Mon Spark-Submit ->

spark-submit --master spark://192.168.60.20:7077 --num-executors 4 --executor-cores 5 --executor-memory 10G --driver-cores 5 --driver-memory 25G --conf spark.sql.shuffle.partitions=60 --conf spark.driver.maxResultSize=2G --conf "spark.executor.extraJavaOptions=-XX:+UseParallelGC" --conf spark.scheduler.listenerbus.eventqueue.capacity=20000 --conf spark.sql.codegen=true /appdata/bblite-codebase/pipeline_data_test_run.py > /appdata/bblite-data/logs/log_10_iter_pipeline_8_partitions_33_col.txt

Stack-Trace ci-dessous -

ERROR CodeGenerator:91 - failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "processNext()V" of class "org.Apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426" grows beyond 64 KB org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass": Code of method "processNext()V" of class "org.Apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426" grows beyond 64 KB at org.codehaus.janino.UnitCompiler.compileUnit(UnitCompiler.Java:361) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.Java:234) at org.codehaus.janino.SimpleCompiler.compileToClassLoader(SimpleCompiler.Java:446) at org.codehaus.janino.ClassBodyEvaluator.compileToClass(ClassBodyEvaluator.Java:313) at org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.Java:235) at org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.Java:204) at org.codehaus.commons.compiler.Cookable.cook(Cookable.Java:80) at org.Apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$Apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1417) at org.Apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1493) at org.Apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1490) at org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.Java:3599) at org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.Java:2379) at org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.Java:2342) at org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.Java:2257) at org.spark_project.guava.cache.LocalCache.get(LocalCache.Java:4000) at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.Java:4004) at org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.Java:4874) at org.Apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1365) at org.Apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:579) at org.Apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:578) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:92) at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:128) at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:119) at org.Apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) at org.Apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.Apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:371) at org.Apache.spark.sql.execution.SortExec.inputRDDs(SortExec.scala:121) at org.Apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:605) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.Apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:150) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.Apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:70) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.Apache.spark.sql.execution.joins.SortMergeJoinExec.doExecute(SortMergeJoinExec.scala:150) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.Apache.spark.sql.execution.ProjectExec.doExecute(basicPhysicalOperators.scala:70) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.Apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.Apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.Apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.Apache.spark.sql.execution.columnar.InMemoryRelation.buildBuffers(InMemoryRelation.scala:107) at org.Apache.spark.sql.execution.columnar.InMemoryRelation.<init>(InMemoryRelation.scala:102) at org.Apache.spark.sql.execution.columnar.InMemoryRelation$.apply(InMemoryRelation.scala:43) at org.Apache.spark.sql.execution.CacheManager$$anonfun$cacheQuery$1.apply(CacheManager.scala:97) at org.Apache.spark.sql.execution.CacheManager.writeLock(CacheManager.scala:67) at org.Apache.spark.sql.execution.CacheManager.cacheQuery(CacheManager.scala:91) at org.Apache.spark.sql.Dataset.persist(Dataset.scala:2924) at Sun.reflect.GeneratedMethodAccessor78.invoke(Unknown Source) at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43) at Java.lang.reflect.Method.invoke(Method.Java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.Java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.Java:357) at py4j.Gateway.invoke(Gateway.Java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.Java:132) at py4j.commands.CallCommand.execute(CallCommand.Java:79) at py4j.GatewayConnection.run(GatewayConnection.Java:238) at Java.lang.Thread.run(Thread.Java:748) Caused by: org.codehaus.janino.InternalCompilerException: Code of method "processNext()V" of class "org.Apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3426" grows beyond 64 KB

vaquar khan · Answer

Le problème est que lorsque Java les programmes générés à l'aide de Catalyst à partir de programmes utilisant DataFrame et Dataset sont compilés en Java bytecode, la taille du code octet d'une méthode ne doit pas être de 64 Ko ou plus, cela entre en conflit avec la limitation du fichier de classe Java, qui est une exception qui se produit.

Masquer l'erreur:

spark.sql.codegen.wholeStage= "false"

Solution:

Afin d'éviter l'occurrence d'une exception due à la restriction ci-dessus, dans Spark, une solution consiste à diviser les méthodes qui compilent et font Java bytecode qui est susceptible de dépasser 64 Ko en plusieurs méthodes lorsque Catalyst génère des programmes Java Cela a été fait.

Utilisez persist ou toute autre séparation logique dans le pipeline

Mandeep Singh · Answer

Nous avons résolu cette erreur en ajoutant des "points de contrôle" supplémentaires dans le code.

Checkpoints = Vous devez réécrire la trame de données (données) sur le disque dans notre cas s3, puis la relire dans une nouvelle trame de données qui conduit au processus de vidage de la JVM spark conteneurs et relancez avec nouveau code

Détails sur le point de contrôle

https://github.com/JerryLead/SparkInternals/blob/master/markdown/english/6-CacheAndCheckpoint.md

Ferrard · Answer

Comme écrit par vaquar, l'introduction d'une séparation logique dans le pipeline devrait aider.

Une façon de réduire la lignée et d'introduire une rupture dans le plan semble être une conversion aller-retour DF -> RDD -> DF:

df = spark_session.sparkContext.createDataFrame(df.rdd, schema=df.schema)

Dans le livre High Performance Spark ils mentionnent en outre qu'il est préférable (plus rapide) de le faire en utilisant Java RDD, c'est-à-dire en utilisant

j_rdd = df._jdf.toJavaRDD() et son schéma j_schema = df._jdf.schema() pour construire un nouveau Java DataFrame et enfin le reconvertir en PySpark DataFrame:

sql_ctx = df.sql_ctx Java_sql_context = sql_ctx._jsqlContext new_Java_df = Java_sql_context.createDataFrame(j_rdd, j_schema) new_df = DataFrame(new_Java_df, sql_ctx)

Dustin Sun · Answer

Si vous utilisez pyspark 2.3+, essayez

spark = SparkSession.builder.master('local').appName('tow-way')\ .config('spark.sql.codegen.wholeStage', 'false')\ ## <-- add this line .getOrCreate()