web-dev-qa-db-fra.com

Comment créer une table en utilisant delta avec Spark 2.4.4?

C'est Spark 2.4.4 et Delta Lake 0.5.0.

J'essaie de créer une table en utilisant la source de données delta et il semble que je manque quelque chose. Bien que le CREATE TABLE USING delta la commande a bien fonctionné ni le répertoire de la table n'est créé ni insertInto fonctionne.

Le suivant CREATE TABLE USING delta a bien fonctionné, mais insertInto a échoué.

scala> sql("""
create table t5
USING delta
LOCATION '/tmp/delta'
""").show

scala> spark.catalog.listTables.where('name === "t5").show
+----+--------+-----------+---------+-----------+
|name|database|description|tableType|isTemporary|
+----+--------+-----------+---------+-----------+
|  t5| default|       null| EXTERNAL|      false|
+----+--------+-----------+---------+-----------+

scala> spark.range(5).write.option("mergeSchema", true).insertInto("t5")
org.Apache.spark.sql.AnalysisException: `default`.`t5` requires that the data to be inserted have the same number of columns as the target table: target table has 0 column(s) but the inserted data has 1 column(s), including 0 partition column(s) having constant value(s).;
  at org.Apache.spark.sql.execution.datasources.PreprocessTableInsertion.org$Apache$spark$sql$execution$datasources$PreprocessTableInsertion$$preprocess(rules.scala:341)
  ...

Je pensais créer avec des colonnes définies, mais cela n'a pas fonctionné non plus.

scala> sql("""
create table t6
(id LONG, name STRING)
USING delta
LOCATION '/tmp/delta'
""").show
org.Apache.spark.sql.AnalysisException: delta does not allow user-specified schemas.;
  at org.Apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:325)
  at org.Apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:78)
  at org.Apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.Apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.Apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.Apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:194)
  at org.Apache.spark.sql.Dataset.$anonfun$withAction$2(Dataset.scala:3370)
  at org.Apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:78)
  at org.Apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
  at org.Apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
  at org.Apache.spark.sql.Dataset.withAction(Dataset.scala:3370)
  at org.Apache.spark.sql.Dataset.<init>(Dataset.scala:194)
  at org.Apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
  at org.Apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
  ... 54 elided
5
Jacek Laskowski

Delta Lake 0.7.0 avec Spark 3.0.0 (les deux viennent de sortir) prend en charge CREATE TABLE Commande SQL.

Veillez à "installer" Delta SQL en utilisant spark.sql.catalog.spark_catalog propriété de configuration avec org.Apache.spark.sql.delta.catalog.DeltaCatalog.

$ ./bin/spark-submit \
  --packages io.delta:delta-core_2.12:0.7.0 \
  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
  --conf spark.sql.catalog.spark_catalog=org.Apache.spark.sql.delta.catalog.DeltaCatalog

scala> spark.version
res0: String = 3.0.0

scala> sql("CREATE TABLE delta_101 (id LONG) USING delta").show
++
||
++
++

scala> spark.table("delta_101").show
+---+
| id|
+---+
+---+

scala> sql("DESCRIBE EXTENDED delta_101").show(truncate = false)
+----------------------------+---------------------------------------------------------+-------+
|col_name                    |data_type                                                |comment|
+----------------------------+---------------------------------------------------------+-------+
|id                          |bigint                                                   |       |
|                            |                                                         |       |
|# Partitioning              |                                                         |       |
|Not partitioned             |                                                         |       |
|                            |                                                         |       |
|# Detailed Table Information|                                                         |       |
|Name                        |default.delta_101                                        |       |
|Location                    |file:/Users/jacek/dev/oss/spark/spark-warehouse/delta_101|       |
|Provider                    |delta                                                    |       |
|Table Properties            |[]                                                       |       |
+----------------------------+---------------------------------------------------------+-------+
1
Jacek Laskowski