PySpark: modifiez les valeurs de colonne lorsqu'une autre valeur de colonne remplit une condition

Question

J'ai un PySpark Dataframe qui a deux colonnes Id et rang,

+---+----+ | Id|Rank| +---+----+ | a| 5| | b| 7| | c| 8| | d| 1| +---+----+

Pour chaque ligne, je cherche à remplacer Id par "autre" si le rang est supérieur à 5.

Si j'utilise un pseudocode pour expliquer:

For row in df: if row.Rank>5: then replace(row.Id,"other")

Le résultat devrait ressembler,

+-----+----+ | Id|Rank| +-----+----+ | a| 5| |other| 7| |other| 8| | d| 1| +-----+----+

Une idée de comment y parvenir? Merci!!!

Pour créer ce Dataframe:

df = spark.createDataFrame([('a',5),('b',7),('c',8),('d',1)], ["Id","Rank"])

Pushkr · Accepted Answer

Vous pouvez utiliser when et otherwise comme -

from pyspark.sql.functions import * df\ .withColumn('Id_New',when(df.Rank <= 5,df.Id).otherwise('other'))\ .drop(df.Id)\ .select(col('Id_New').alias('Id'),col('Rank'))\ .show()

cela donne une sortie comme -

+-----+----+ | Id|Rank| +-----+----+ | a| 5| |other| 7| |other| 8| | d| 1| +-----+----+