web-dev-qa-db-fra.com

PySpark dans le bloc-notes iPython soulève Py4JJavaError lors de l'utilisation de count () et de first ()

J'utilise PySpark (v.2.1.0) dans iPython notebook (python v.3.6) sur virtualenv sur mon Mac (Sierra 10.12.3 bêta). 

1.J'ai lancé un ordinateur portable iPython en prenant cette photo dans Terminal - 

 PYSPARK_PYTHON=python3 PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" /Applications/spark-2.1.0-bin-hadoop2.7/bin/pyspark

2.Chargez mon fichier dans Spark Context et assurez-vous qu'il soit chargé- 

>>>lines = sc.textFile("/Users/PanchusMac/Dropbox/Learn_py/Virtual_Env/pyspark/README.md") 

>>>for i in lines.collect(): 
    print(i)

Et cela a bien fonctionné et a imprimé le résultat sur ma console comme indiqué: 

# Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides
high-level APIs in Scala, Java, Python, and R, and an optimized engine that
supports general computation graphs for data analysis. It also supports a
rich set of higher-level tools including Spark SQL for SQL and DataFrames,
MLlib for machine learning, GraphX for graph processing,
and Spark Streaming for stream processing.

<http://spark.Apache.org/>


## Online Documentation

You can find the latest Spark documentation, including a programming
guide, on the [project web page](http://spark.Apache.org/documentation.html).
This README file only contains basic setup instructions. 

Également vérifié le sc - 

>>>print(sc)

<pyspark.context.SparkContext object at 0x101ce4cc0>
  1. Maintenant, lorsque j'essaie d'exécuter des fonctions lines.count() ou lines.first() sur mon RDD, j'ai l'erreur suivante: 


    Py4JJavaError                             Traceback (most recent call last)
    <ipython-input-33-44aeefde846d> in <module>()
    ----> 1 lines.count()
    
    /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py in count(self)
       1039         3
       1040         """
    -> 1041         return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
       1042 
       1043     def stats(self):
    
    /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py in sum(self)
       1030         6.0
       1031         """
    -> 1032         return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
       1033 
       1034     def count(self):
    
    /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py in fold(self, zeroValue, op)
        904         # zeroValue provided to each partition is unique from the one provided
        905         # to the final reduce call
    --> 906         vals = self.mapPartitions(func).collect()
        907         return reduce(op, vals, zeroValue)
        908 
    
    /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/rdd.py in collect(self)
        807         """
        808         with SCCallSiteSync(self.context) as css:
    --> 809             port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
        810         return list(_load_from_socket(port, self._jrdd_deserializer))
        811 
    
    /Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.Zip/py4j/Java_gateway.py in __call__(self, *args)
       1131         answer = self.gateway_client.send_command(command)
       1132         return_value = get_return_value(
    -> 1133             answer, self.gateway_client, self.target_id, self.name)
       1134 
       1135         for temp_arg in temp_args:
    
    /Applications/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.py in deco(*a, **kw)
         61     def deco(*a, **kw):
         62         try:
    ---> 63             return f(*a, **kw)
         64         except py4j.protocol.Py4JJavaError as e:
         65             s = e.Java_exception.toString()
    
    /Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.Zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
        317                 raise Py4JJavaError(
        318                     "An error occurred while calling {0}{1}{2}.\n".
    --> 319                     format(target_id, ".", name), value)
        320             else:
        321                 raise Py4JError(
    
    Py4JJavaError: An error occurred while calling z:org.Apache.spark.api.python.PythonRDD.collectAndServe.
    : org.Apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 (TID 22, localhost, executor driver): org.Apache.spark.SparkException: 
    Error from python worker:
      Traceback (most recent call last):
        File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 183, in _run_module_as_main
          mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
        File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 109, in _get_module_details
          __import__(pkg_name)
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip/pyspark/__init__.py", line 44, in <module>
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip/pyspark/context.py", line 36, in <module>
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip/pyspark/Java_gateway.py", line 25, in <module>
        File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/platform.py", line 886, in <module>
          "system node release version machine processor")
        File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip/pyspark/serializers.py", line 393, in namedtuple
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
    PYTHONPATH was:
      /Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip:/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.Zip:/Applications/spark-2.1.0-bin-hadoop2.7/jars/spark-core_2.11-2.1.0.jar:/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.Zip:/Applications/spark-2.1.0-bin-hadoop2.7/python/:
    Java.io.EOFException
        at Java.io.DataInputStream.readInt(DataInputStream.Java:392)
        at org.Apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:166)
        at org.Apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
        at org.Apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
        at org.Apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
        at org.Apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
        at org.Apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.Apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.Apache.spark.scheduler.Task.run(Task.scala:99)
        at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1142)
        at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:617)
        at Java.lang.Thread.run(Thread.Java:745)
    
    Driver stacktrace:
        at org.Apache.spark.scheduler.DAGScheduler.org$Apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
        at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
        at org.Apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
        at org.Apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
        at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
        at org.Apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
        at scala.Option.foreach(Option.scala:257)
        at org.Apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
        at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
        at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
        at org.Apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
        at org.Apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
        at org.Apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
        at org.Apache.spark.SparkContext.runJob(SparkContext.scala:1918)
        at org.Apache.spark.SparkContext.runJob(SparkContext.scala:1931)
        at org.Apache.spark.SparkContext.runJob(SparkContext.scala:1944)
        at org.Apache.spark.SparkContext.runJob(SparkContext.scala:1958)
        at org.Apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:935)
        at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.Apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
        at org.Apache.spark.rdd.RDD.withScope(RDD.scala:362)
        at org.Apache.spark.rdd.RDD.collect(RDD.scala:934)
        at org.Apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:453)
        at org.Apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
        at Sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at Sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.Java:62)
        at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.Java:43)
        at Java.lang.reflect.Method.invoke(Method.Java:497)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.Java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.Java:357)
        at py4j.Gateway.invoke(Gateway.Java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.Java:132)
        at py4j.commands.CallCommand.execute(CallCommand.Java:79)
        at py4j.GatewayConnection.run(GatewayConnection.Java:214)
        at Java.lang.Thread.run(Thread.Java:745)
    Caused by: org.Apache.spark.SparkException: 
    Error from python worker:
      Traceback (most recent call last):
        File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 183, in _run_module_as_main
          mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
        File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 109, in _get_module_details
          __import__(pkg_name)
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip/pyspark/__init__.py", line 44, in <module>
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip/pyspark/context.py", line 36, in <module>
        File "<frozen importlib._bootstrap>", line 961, in _find_and_load
        File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
        File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
        File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
        File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip/pyspark/Java_gateway.py", line 25, in <module>
        File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/platform.py", line 886, in <module>
          "system node release version machine processor")
        File "/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip/pyspark/serializers.py", line 393, in namedtuple
      TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
    PYTHONPATH was:
      /Applications/spark-2.1.0-bin-hadoop2.7/python/lib/pyspark.Zip:/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.Zip:/Applications/spark-2.1.0-bin-hadoop2.7/jars/spark-core_2.11-2.1.0.jar:/Applications/spark-2.1.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.Zip:/Applications/spark-2.1.0-bin-hadoop2.7/python/:
    Java.io.EOFException
        at Java.io.DataInputStream.readInt(DataInputStream.Java:392)
        at org.Apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:166)
        at org.Apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
        at org.Apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
        at org.Apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
        at org.Apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
        at org.Apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
        at org.Apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
        at org.Apache.spark.rdd.RDD.iterator(RDD.scala:287)
        at org.Apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.Apache.spark.scheduler.Task.run(Task.scala:99)
        at org.Apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at Java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.Java:1142)
        at Java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.Java:617)
        ... 1 more
    

Quelqu'un pourrait-il m'expliquer où cela s'est-il mal passé? Remarque: Lorsque j’ai effectué les mêmes opérations sur mon terminal Mac, elles ont fonctionné comme prévu.

6
Panchu

Pyspark 2.1.0 n'est pas compatible avec Python 3.6, voir https://issues.Apache.org/jira/browse/SPARK-19019 .

Vous devez utiliser une version antérieure de Python ou vous pouvez essayer de construire le maître ou la branche 2.1 à partir de github et cela devrait fonctionner.

6
Mariusz

Oui, j'ai eu le même problème il y a longtemps à Pyspark à Anaconda. J'ai essayé plusieurs façons de remédier à ce problème que j'ai finalement trouvé seul en installant Java pour anaconda séparément.

https://anaconda.org/cyclus/Java-jdk

6
Raja Rajan

Si vous utilisez Anaconda, essayez d’installer Java-jdk pour Anaconda:

conda install -c cyclus Java-jdk
1
Tung Nguyen