Combinez les champs PySpark DataFrame ArrayType en un seul champ ArrayType

Question

J'ai un PySpark DataFrame avec 2 champs ArrayType:

>>>df DataFrame[id: string, tokens: array<string>, bigrams: array<string>] >>>df.take(1) [Row(id='ID1', tokens=['one', 'two', 'two'], bigrams=['one two', 'two two'])]

Je voudrais les combiner en un seul champ ArrayType:

>>>df2 DataFrame[id: string, tokens_bigrams: array<string>] >>>df2.take(1) [Row(id='ID1', tokens_bigrams=['one', 'two', 'two', 'one two', 'two two'])]

La syntaxe qui fonctionne avec les chaînes ne semble pas fonctionner ici:

df2 = df.withColumn('tokens_bigrams', df.tokens + df.bigrams)

Merci!

zero323 · Accepted Answer

Spark> = 2,4

Vous pouvez utiliser la fonction concat ( SPARK-23736 ):

from pyspark.sql.functions import col, concat df.select(concat(col("tokens"), col("tokens_bigrams"))).show(truncate=False) # +---------------------------------+ # |concat(tokens, tokens_bigrams) | # +---------------------------------+ # |[one, two, two, one two, two two]| # |null | # +---------------------------------+

Pour conserver les données lorsque l'une des valeurs est NULL, vous pouvez coalesce avec array:

from pyspark.sql.functions import array, coalesce df.select(concat( coalesce(col("tokens"), array()), coalesce(col("tokens_bigrams"), array()) )).show(truncate = False) # +--------------------------------------------------------------------+ # |concat(coalesce(tokens, array()), coalesce(tokens_bigrams, array()))| # +--------------------------------------------------------------------+ # |[one, two, two, one two, two two] | # |[three] | # +--------------------------------------------------------------------+

Spark <2,4

Malheureusement pour concaténer array colonnes dans le cas général, vous aurez besoin d'un UDF, par exemple comme ceci:

from itertools import chain from pyspark.sql.functions import col, udf from pyspark.sql.types import * def concat(type): def concat_(*args): return list(chain.from_iterable((arg if arg else [] for arg in args))) return udf(concat_, ArrayType(type))

qui peut être utilisé comme:

df = spark.createDataFrame( [(["one", "two", "two"], ["one two", "two two"]), (["three"], None)], ("tokens", "tokens_bigrams") ) concat_string_arrays = concat(StringType()) df.select(concat_string_arrays("tokens", "tokens_bigrams")).show(truncate=False) # +---------------------------------+ # |concat_(tokens, tokens_bigrams) | # +---------------------------------+ # |[one, two, two, one two, two two]| # |[three] | # +---------------------------------+

David Vrba · Answer

Dans Spark 2.4.0 (2.3 sur la plate-forme Databricks), vous pouvez le faire en mode natif dans l'API DataFrame en utilisant la fonction concat. Dans votre exemple, vous pouvez le faire:

from pyspark.sql.functions import col, concat df.withColumn('tokens_bigrams', concat(col('tokens'), col('bigrams')))

Ici est la jira associée.