python - Best way to apply a transformation to all the columns - Pyspark dataframe

Question

Welcome To Ask or Share your Answers For Others

python - Best way to apply a transformation to all the columns - Pyspark dataframe

posted Oct 7, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - Best way to apply a transformation to all the columns - Pyspark dataframe

My question is very straightforward: what is the best way to apply a custom function to all the columns of a Pyspark dataframe?

I am trying to apply a sum over a window in a large panel dataframe (300 columns and more than 500k rows). Assume this simple dataframe:

from pyspark import SparkContext
sc = SparkContext("local", "Trial")
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window

df = sc.parallelize([
    ('A', 0, 1, 0), ('A', 1, -1, 0), ('A', 0, 0, -1),
    ('B', 0, -1, -1), ('B', 0, 1, 0), ('C', 0, 0, 1)
]).toDF(["id", "col1", "col2", 'col3'])

df.show()

+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
|  A|   0|   1|   0|
|  A|   1|  -1|   0|
|  A|   0|   0|  -1|
|  B|   0|  -1|  -1|
|  B|   0|   1|   0|
|  C|   0|   0|   1|
+---+----+----+----+

I know that I can achieve a cumulative sum over the three columns as follows:

w = Window.partitionBy('id').orderBy('id').rowsBetween(-2,0)
df = df.select('id', *[F.sum(F.col(c)).over(w).alias(c) for c in df.columns[1:]])

df.orderBy('id').show()

+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
|  A|   0|   1|   0|
|  A|   1|   0|   0|
|  A|   1|   0|  -1|
|  B|   0|   0|  -1|
|  B|   0|  -1|  -1|
|  C|   0|   0|   1|
+---+----+----+----+

The problem is that with large dataframes, when calling an action, it takes hours to have a displayed result. Is there a way to eventually improve such computation? Maybe avoiding to loop among the columns in the list comprehension?

question from:https://stackoverflow.com/questions/65939908/best-way-to-apply-a-transformation-to-all-the-columns-pyspark-dataframe

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

python - Best way to apply a transformation to all the columns - Pyspark dataframe

python - Best way to apply a transformation to all the columns - Pyspark dataframe

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags