Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
176 views
in Technique[技术] by (71.8m points)

python - Best way to apply a transformation to all the columns - Pyspark dataframe

My question is very straightforward: what is the best way to apply a custom function to all the columns of a Pyspark dataframe?

I am trying to apply a sum over a window in a large panel dataframe (300 columns and more than 500k rows). Assume this simple dataframe:

from pyspark import SparkContext
sc = SparkContext("local", "Trial")
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window

df = sc.parallelize([
    ('A', 0, 1, 0), ('A', 1, -1, 0), ('A', 0, 0, -1),
    ('B', 0, -1, -1), ('B', 0, 1, 0), ('C', 0, 0, 1)
]).toDF(["id", "col1", "col2", 'col3'])

df.show()

+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
|  A|   0|   1|   0|
|  A|   1|  -1|   0|
|  A|   0|   0|  -1|
|  B|   0|  -1|  -1|
|  B|   0|   1|   0|
|  C|   0|   0|   1|
+---+----+----+----+

I know that I can achieve a cumulative sum over the three columns as follows:

w = Window.partitionBy('id').orderBy('id').rowsBetween(-2,0)
df = df.select('id', *[F.sum(F.col(c)).over(w).alias(c) for c in df.columns[1:]])

df.orderBy('id').show()

+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
|  A|   0|   1|   0|
|  A|   1|   0|   0|
|  A|   1|   0|  -1|
|  B|   0|   0|  -1|
|  B|   0|  -1|  -1|
|  C|   0|   0|   1|
+---+----+----+----+

The problem is that with large dataframes, when calling an action, it takes hours to have a displayed result. Is there a way to eventually improve such computation? Maybe avoiding to loop among the columns in the list comprehension?

question from:https://stackoverflow.com/questions/65939908/best-way-to-apply-a-transformation-to-all-the-columns-pyspark-dataframe

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...