Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
88 views
in Technique[技术] by (71.8m points)

python - How to map a column to create a new column in spark sql dataframe?

In python and pandas, I can create a new column like this:

Using two columns in pandas dataframe to create a dict.

 dict1 = dict(zip(data["id"], data["duration"]))

Then I can apply this dict to create a new column in a second dataframe.

df['id_duration'] = df['id'].map(lambda x: dict1[x] if x in dict1.keys() else -1)

How can I create a new column id_duration in spark sql dataframe, in case I have a dataframe data (having two columns: id and duration) and a dataframe df (having a column id)?

question from:https://stackoverflow.com/questions/65838706/how-to-map-a-column-to-create-a-new-column-in-spark-sql-dataframe

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Using a dictionary would be a shame because you would need to collect the entire dataframe data onto the driver which will be very bad for performance and could cause an OOM error.

You could simply perform a left outer join between the two dataframes and use na.fill to fill empty values with -1.

data = spark.createDataFrame([(1, 10), (2, 20), (3, 30)], ['id', 'duration'])
df = spark.createDataFrame([(1, 2), (3, 4)], ['id', 'x'])

df
    .join(data.withColumnRenamed("duration", "id_duration"), ['id'], 'left')
    .na.fill(-1).show()
+---+---+-----------+
| id|  x|id_duration|
+---+---+-----------+
|  5|  6|         -1|
|  1|  2|         10|
|  3|  4|         30|
+---+---+-----------+

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...