web-dev-qa-db-fra.com

Quels sont les différents types de jointure dans Spark?

J'ai consulté les documents et il est indiqué que les types de jointures suivants sont pris en charge: 

Type de jointure à effectuer. Par défaut interne. Doit être l'un des suivants: intérieur, croix, outer, full, full_outer, left, left_outer, right, right_outer, left_semi, left_anti.

J'ai examiné la réponse StackOverflow sur les jointures SQL et les deux premières réponses ne mentionnent pas certaines des jointures ci-dessus, par exemple. left_semi et left_anti. Que veulent-ils dire dans Spark?

17
pathikrit

Voici une expérience illustrative simple:

import org.Apache.spark._
import org.Apache.spark.sql._
import org.Apache.spark.sql.expressions._
import org.Apache.spark.sql.functions._

object SparkSandbox extends App {

  case class Row(id: Int, value: String)

  private[this] implicit val spark = SparkSession.builder().master("local[*]").getOrCreate()
  import spark.implicits._
  spark.sparkContext.setLogLevel("ERROR")

  val r1 = Seq(Row(1, "A1"), Row(2, "A2"), Row(3, "A3"), Row(4, "A4")).toDS()
  val r2 = Seq(Row(3, "A3"), Row(4, "A4"), Row(4, "A4_1"), Row(5, "A5"), Row(6, "A6")).toDS()

  val joinTypes = Seq("inner", "outer", "full", "full_outer", "left", "left_outer", "right", "right_outer", "left_semi", "left_anti")

  joinTypes foreach {joinType =>
    println(s"${joinType.toUpperCase()} JOIN")
    r1.join(right = r2, usingColumns = Seq("id"), joinType = joinType).orderBy("id").show()
  }
}

Sortie

 INNER JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
|  3|   A3|   A3|
|  4|   A4| A4_1|
|  4|   A4|   A4|
+---+-----+-----+

OUTER JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
|  1|   A1| null|
|  2|   A2| null|
|  3|   A3|   A3|
|  4|   A4|   A4|
|  4|   A4| A4_1|
|  5| null|   A5|
|  6| null|   A6|
+---+-----+-----+

FULL JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
|  1|   A1| null|
|  2|   A2| null|
|  3|   A3|   A3|
|  4|   A4| A4_1|
|  4|   A4|   A4|
|  5| null|   A5|
|  6| null|   A6|
+---+-----+-----+

FULL_OUTER JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
|  1|   A1| null|
|  2|   A2| null|
|  3|   A3|   A3|
|  4|   A4| A4_1|
|  4|   A4|   A4|
|  5| null|   A5|
|  6| null|   A6|
+---+-----+-----+

LEFT JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
|  1|   A1| null|
|  2|   A2| null|
|  3|   A3|   A3|
|  4|   A4| A4_1|
|  4|   A4|   A4|
+---+-----+-----+

LEFT_OUTER JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
|  1|   A1| null|
|  2|   A2| null|
|  3|   A3|   A3|
|  4|   A4| A4_1|
|  4|   A4|   A4|
+---+-----+-----+

RIGHT JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
|  3|   A3|   A3|
|  4|   A4|   A4|
|  4|   A4| A4_1|
|  5| null|   A5|
|  6| null|   A6|
+---+-----+-----+

RIGHT_OUTER JOIN
+---+-----+-----+
| id|value|value|
+---+-----+-----+
|  3|   A3|   A3|
|  4|   A4| A4_1|
|  4|   A4|   A4|
|  5| null|   A5|
|  6| null|   A6|
+---+-----+-----+

LEFT_SEMI JOIN
+---+-----+
| id|value|
+---+-----+
|  3|   A3|
|  4|   A4|
+---+-----+

LEFT_ANTI JOIN
+---+-----+
| id|value|
+---+-----+
|  1|   A1|
|  2|   A2|
+---+-----+
40
pathikrit

Voici une explication visuelle des différents types de jointures disponibles dans Spark .

Découvrez ce super blog sur Spark rejoint: http://kirillpavlov.com/blog/2016/04/23/beyond-traditional-join-with-Apache-spark/

15
Sohum Sachdev

Aimé l'exemple de Pathikrit. Voici une traduction possible en Java avec Spark v2 et des cadres de données, y compris des jointures croisées.

package net.jgp.books.sparkInAction.ch12.lab940AllJoins;

import Java.util.ArrayList;
import Java.util.List;

import org.Apache.spark.sql.Dataset;
import org.Apache.spark.sql.Row;
import org.Apache.spark.sql.RowFactory;
import org.Apache.spark.sql.SparkSession;
import org.Apache.spark.sql.types.DataTypes;
import org.Apache.spark.sql.types.StructField;
import org.Apache.spark.sql.types.StructType;

/**
 * All joins in a single app, inspired by
 * https://stackoverflow.com/questions/45990633/what-are-the-various-join-types-in-spark.
 * 
 * Used in Spark in Action 2e, http://jgp.net/sia
 * 
 * @author jgp
 */
public class AllJoinsApp {

  /**
   * main() is your entry point to the application.
   * 
   * @param args
   */
  public static void main(String[] args) {
    AllJoinsApp app = new AllJoinsApp();
    app.start();
  }

  /**
   * The processing code.
   */
  private void start() {
    // Creates a session on a local master
    SparkSession spark = SparkSession.builder()
        .appName("Processing of invoices")
        .master("local")
        .getOrCreate();

    StructType schema = DataTypes.createStructType(new StructField[] {
        DataTypes.createStructField(
            "id",
            DataTypes.IntegerType,
            false),
        DataTypes.createStructField(
            "value",
            DataTypes.StringType,
            false) });

    List<Row> rows = new ArrayList<Row>();
    rows.add(RowFactory.create(1, "A1"));
    rows.add(RowFactory.create(2, "A2"));
    rows.add(RowFactory.create(3, "A3"));
    rows.add(RowFactory.create(4, "A4"));
    Dataset<Row> dfLeft = spark.createDataFrame(rows, schema);
    dfLeft.show();

    rows = new ArrayList<Row>();
    rows.add(RowFactory.create(3, "A3"));
    rows.add(RowFactory.create(4, "A4"));
    rows.add(RowFactory.create(4, "A4_1"));
    rows.add(RowFactory.create(5, "A5"));
    rows.add(RowFactory.create(6, "A6"));
    Dataset<Row> dfRight = spark.createDataFrame(rows, schema);
    dfRight.show();

    String[] joinTypes = new String[] { 
        "inner", // v2.0.0. default
        "cross", // v2.2.0
        "outer", // v2.0.0
        "full", // v2.1.1
        "full_outer", // v2.1.1
        "left", // v2.1.1
        "left_outer", // v2.0.0
        "right", // v2.1.1
        "right_outer", // v2.0.0
        "left_semi", // v2.0.0, was leftsemi before v2.1.1
        "left_anti" // v2.1.1
        };

    for (String joinType : joinTypes) {
      System.out.println(joinType.toUpperCase() + " JOIN");
      Dataset<Row> df = dfLeft.join(
          dfRight, 
          dfLeft.col("id").equalTo(dfRight.col("id")), 
          joinType);
      df.orderBy(dfLeft.col("id")).show();
    }
  }
}

Je vais mettre cet exemple dans le répertoire Spark in Action, 2e 's - chapitre 12 .

3
jgp
Spark data frame support following types of joins between two dataframes.
Please find the list of joins and joining string with respect to join types along with scala syntax.
We can use following joining values used for specify the join type in Scala- Spark code. 
***Mathod:*** Leftdataframe.join(Rightdataframe, join_conditions, joinStringName)

Join Name : Join String name in scala -Spark code

1. inner : 'inner'
2. cross: 'cross'
3. outer: 'outer'
4. full: 'full'
5. full outer: 'fullouter'
6. left : 'left'
7. left outer : 'leftouter'
8. right : 'right'
9. right outer : 'rightouter'
10. left semi: 'leftsemi'
11. left anti: 'leftanti'

example: 1. Left Semi join: 
Leftdataframe.join(Rightdataframe, join_conditions, "leftsemi");
2. inner Join Example:
Leftdataframe.join(Rightdataframe, join_conditions, "inner");

Its tested and working well.
1
Rajeev Rathor