Using a dictionary would be a shame because you would need to collect the entire dataframe data
onto the driver which will be very bad for performance and could cause an OOM error.
You could simply perform a left outer join between the two dataframes and use na.fill
to fill empty values with -1
.
data = spark.createDataFrame([(1, 10), (2, 20), (3, 30)], ['id', 'duration'])
df = spark.createDataFrame([(1, 2), (3, 4)], ['id', 'x'])
df
.join(data.withColumnRenamed("duration", "id_duration"), ['id'], 'left')
.na.fill(-1).show()
+---+---+-----------+
| id| x|id_duration|
+---+---+-----------+
| 5| 6| -1|
| 1| 2| 10|
| 3| 4| 30|
+---+---+-----------+
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…