web-dev-qa-db-fra.com

Écrire RDD sous forme de fichier texte avec Apache Spark

J'explore Spark pour le traitement par lots. Je lance l'étincelle sur ma machine locale en mode autonome.

J'essaie de convertir le RDD Spark en fichier unique [sortie finale] à l'aide de la méthode saveTextFile (), mais cela ne fonctionne pas.

Par exemple, si j’ai plusieurs partitions, comment obtenir un seul fichier en sortie finale.

Mettre à jour:

J'ai essayé les approches ci-dessous, mais j'obtiens une exception de pointeur nulle.

person.coalesce(1).toJavaRDD().saveAsTextFile("C://Java_All//output");
person.repartition(1).toJavaRDD().saveAsTextFile("C://Java_All//output");

L'exception est:

    15/06/23 18:25:27 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
15/06/23 18:25:27 INFO deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
15/06/23 18:25:27 INFO deprecation: mapred.output.key.class is deprecated. Instead, use mapreduce.job.output.key.class
15/06/23 18:25:27 INFO deprecation: mapred.output.value.class is deprecated. Instead, use mapreduce.job.output.value.class
15/06/23 18:25:27 INFO deprecation: mapred.working.dir is deprecated. Instead, use mapreduce.job.working.dir
15/06/23 18:25:27 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
Java.lang.NullPointerException
    at Java.lang.ProcessBuilder.start(ProcessBuilder.Java:1012)
    at org.Apache.hadoop.util.Shell.runCommand(Shell.Java:404)
    at org.Apache.hadoop.util.Shell.run(Shell.Java:379)
    at org.Apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.Java:589)
    at org.Apache.hadoop.util.Shell.execCommand(Shell.Java:678)
    at org.Apache.hadoop.util.Shell.execCommand(Shell.Java:661)
    at org.Apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.Java:639)
    at org.Apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.Java:468)
    at org.Apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.Java:456)
    at org.Apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.Java:424)
    at org.Apache.hadoop.fs.FileSystem.create(FileSystem.Java:905)
    at org.Apache.hadoop.fs.FileSystem.create(FileSystem.Java:798)
    at org.Apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.Java:123)
    at org.Apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
    at org.Apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1104)
    at org.Apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
    at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.Apache.spark.scheduler.Task.run(Task.scala:70)
    at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1142)
    at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:617)
    at Java.lang.Thread.run(Thread.Java:745)
15/06/23 18:25:27 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, localhost): Java.lang.NullPointerException
    at Java.lang.ProcessBuilder.start(ProcessBuilder.Java:1012)
    at org.Apache.hadoop.util.Shell.runCommand(Shell.Java:404)
    at org.Apache.hadoop.util.Shell.run(Shell.Java:379)
    at org.Apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.Java:589)
    at org.Apache.hadoop.util.Shell.execCommand(Shell.Java:678)
    at org.Apache.hadoop.util.Shell.execCommand(Shell.Java:661)
    at org.Apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.Java:639)
    at org.Apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.Java:468)
    at org.Apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.Java:456)
    at org.Apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.Java:424)
    at org.Apache.hadoop.fs.FileSystem.create(FileSystem.Java:905)
    at org.Apache.hadoop.fs.FileSystem.create(FileSystem.Java:798)
    at org.Apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.Java:123)
    at org.Apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
    at org.Apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1104)
    at org.Apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
    at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.Apache.spark.scheduler.Task.run(Task.scala:70)
    at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1142)
    at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:617)
    at Java.lang.Thread.run(Thread.Java:745)

15/06/23 18:25:27 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; aborting job
15/06/23 18:25:27 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 
15/06/23 18:25:27 INFO TaskSchedulerImpl: Cancelling stage 1
15/06/23 18:25:27 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at TestSpark.Java:40) failed in 0.249 s
15/06/23 18:25:28 INFO DAGScheduler: Job 0 failed: saveAsTextFile at TestSpark.Java:40, took 0.952286 s
Exception in thread "main" org.Apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): Java.lang.NullPointerException
    at Java.lang.ProcessBuilder.start(ProcessBuilder.Java:1012)
    at org.Apache.hadoop.util.Shell.runCommand(Shell.Java:404)
    at org.Apache.hadoop.util.Shell.run(Shell.Java:379)
    at org.Apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.Java:589)
    at org.Apache.hadoop.util.Shell.execCommand(Shell.Java:678)
    at org.Apache.hadoop.util.Shell.execCommand(Shell.Java:661)
    at org.Apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.Java:639)
    at org.Apache.hadoop.fs.FilterFileSystem.setPermission(FilterFileSystem.Java:468)
    at org.Apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.Java:456)
    at org.Apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.Java:424)
    at org.Apache.hadoop.fs.FileSystem.create(FileSystem.Java:905)
    at org.Apache.hadoop.fs.FileSystem.create(FileSystem.Java:798)
    at org.Apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.Java:123)
    at org.Apache.spark.SparkHadoopWriter.open(SparkHadoopWriter.scala:90)
    at org.Apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1104)
    at org.Apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
    at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
    at org.Apache.spark.scheduler.Task.run(Task.scala:70)
    at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
    at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1142)
    at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:617)
    at Java.lang.Thread.run(Thread.Java:745)

Driver stacktrace:
    at org.Apache.spark.scheduler.DAGScheduler.org$Apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
    at org.Apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
    at scala.Option.foreach(Option.scala:236)
    at org.Apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
    at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
    at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
    at org.Apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/06/23 18:25:28 INFO SparkContext: Invoking stop() from shutdown hook
15/06/23 18:25:28 INFO SparkUI: Stopped Spark web UI at http://10.37.145.179:4040
15/06/23 18:25:28 INFO DAGScheduler: Stopping DAGScheduler
15/06/23 18:25:28 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/06/23 18:25:28 INFO Utils: path = C:\Users\crh537\AppData\Local\Temp\spark-a52371d8-ae6a-4567-b759-0a6c66c1908c\blockmgr-4d17a5b4-c8f8-4408-af07-0e88239794e8, already present as root for deletion.
15/06/23 18:25:28 INFO MemoryStore: MemoryStore cleared
15/06/23 18:25:28 INFO BlockManager: BlockManager stopped
15/06/23 18:25:28 INFO BlockManagerMaster: BlockManagerMaster stopped
15/06/23 18:25:28 INFO SparkContext: Successfully stopped SparkContext
15/06/23 18:25:28 INFO Utils: Shutdown hook called

Cordialement, Shankar

6
Shankar

Vous pouvez utiliser la méthode coalesce pour enregistrer dans un seul fichier. De cette façon, votre code ressemblera à ceci:

val myFile = sc.textFile("file.txt")
val finalRdd = doStuff(myFile)
finalRdd.coalesce(1).saveAsTextFile("newfile")

Il existe également une autre méthode repartition pour faire la même chose, mais cela provoquera un mélange qui peut être très coûteux, alors que coalesce essaiera d’éviter un mélange.

6
Maksud

Vous exécutez ceci sur Windows? Si oui, alors vous devez ajouter la ligne suivante

System.setProperty("hadoop.home.dir", "C:\\winutil\\")

Vous pouvez télécharger les winutils à partir du lien suivant

http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe

10
Harvinder Singh

Spark utilise en interne le système de fichiers hadoop. Par conséquent, lorsque vous essayez de lire et d’écrire dans un système de fichiers, il recherche d’abord le dossier de configuration HADOOP_HOME contenant bin\winutils.exe. peut-être que vous ne définissez pas cela, c'est la raison pour laquelle son jet de nullpointer. 

0
Arjun gangineni