python - How to find the mean value of a array column and then subtract the mean from each element in a pyspark dataframe?

Question

Welcome To Ask or Share your Answers For Others

python - How to find the mean value of a array column and then subtract the mean from each element in a pyspark dataframe?

posted Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - How to find the mean value of a array column and then subtract the mean from each element in a pyspark dataframe?

Here is the list: This is a dataframe in pyspark

id	list1	list2
1	[10, 20, 30]	[30, 40, 50]
2	[35, 65, 85]	[15, 5, 45]

question from:https://stackoverflow.com/questions/66057237/how-to-find-the-mean-value-of-a-array-column-and-then-subtract-the-mean-from-eac

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-06T03:09:23+0000

You can use aggregate to calculate the mean value for each list, then using transform functions on the array columns to subtract the mean for each element :

from pyspark.sql import functions as F

df1 = df.withColumn("list1_avg", F.expr("aggregate(list1, bigint(0), (acc, x) -> acc + x, acc -> acc / size(list1))")) 
    .withColumn("list2_avg", F.expr("aggregate(list2, bigint(0), (acc, x) -> acc + x, acc -> acc / size(list2))")) 
    .withColumn("list1", F.expr("transform(list1, x -> x - list1_avg)")) 
    .withColumn("list2", F.expr("transform(list2, x -> x - list2_avg)")) 
    .drop("list1_avg", "list2_avg")

df1.show(truncate=False)

#+---+-------------------------------------------------------------+-------------------------------------------------------------+
#|id |list1                                                        |list2                                                        |
#+---+-------------------------------------------------------------+-------------------------------------------------------------+
#|1  |[-10.0, 0.0, 10.0]                                           |[-10.0, 0.0, 10.0]                                           |
#|2  |[-26.666666666666664, 3.3333333333333357, 23.333333333333336]|[-6.666666666666668, -16.666666666666668, 23.333333333333332]|
#+---+-------------------------------------------------------------+-------------------------------------------------------------+

Categories

python - How to find the mean value of a array column and then subtract the mean from each element in a pyspark dataframe?

python - How to find the mean value of a array column and then subtract the mean from each element in a pyspark dataframe?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags