Comment exécuter graphx avec Python / pyspark?

Question

J'essaie d'exécuter Spark graphx avec Python en utilisant pyspark. Mon installation semble correcte, car je suis capable d'exécuter les tutoriels pyspark et le (Java) GraphX probablement parce que GraphX fait partie de Spark, pyspark devrait pouvoir l'interfacer, n'est-ce pas?

Voici les tutoriels pour pyspark: http://spark.Apache.org/docs/0.9.0/quick-start.html http://spark.Apache.org/docs /0.9.0/python-programming-guide.html

Voici ceux pour GraphX: http://spark.Apache.org/docs/0.9.0/graphx-programming-guide.html http://ampcamp.berkeley.edu /big-data-mini-course/graph-analytics-with-graphx.html

Quelqu'un peut-il convertir le didacticiel GraphX en Python?

Misty Nodine · Accepted Answer

Il semble que les liaisons python à GraphX sont retardées au moins à Spark ~~1.4~~ ~~1,5~~ ∞. Il attend derrière l'API Java.

Vous pouvez suivre l'état à SPARK-3789 GRAPHX Python pour GraphX - ASF JIRA

zhibo · Answer

Vous devriez regarder GraphFrames ( https://github.com/graphframes/graphframes ), qui encapsule les algorithmes GraphX sous l'API DataFrames et fournit l'interface Python.

Voici un exemple rapide de https://graphframes.github.io/graphframes/docs/_site/quick-start.html , avec une légère modification pour que cela fonctionne

commencez par pyspark avec le paquet de cadres graphiques chargé

pyspark --packages graphframes:graphframes:0.1.0-spark1.6

code python:

from graphframes import * # Create a Vertex DataFrame with unique ID column "id" v = sqlContext.createDataFrame([ ("a", "Alice", 34), ("b", "Bob", 36), ("c", "Charlie", 30), ], ["id", "name", "age"]) # Create an Edge DataFrame with "src" and "dst" columns e = sqlContext.createDataFrame([ ("a", "b", "friend"), ("b", "c", "follow"), ("c", "b", "follow"), ], ["src", "dst", "relationship"]) # Create a GraphFrame g = GraphFrame(v, e) # Query: Get in-degree of each vertex. g.inDegrees.show() # Query: Count the number of "follow" connections in the graph. g.edges.filter("relationship = 'follow'").count() # Run PageRank algorithm, and show results. results = g.pageRank(resetProbability=0.01, maxIter=20) results.vertices.select("id", "pagerank").show()

Wildfire · Answer

GraphX 0.9.0 ne dispose pas encore de l'API python. Elle est attendue dans les prochaines versions.