My question is very straightforward: what is the best way to apply a custom function to all the columns of a Pyspark dataframe?
I am trying to apply a sum over a window in a large panel dataframe (300 columns and more than 500k rows). Assume this simple dataframe:
from pyspark import SparkContext
sc = SparkContext("local", "Trial")
import pyspark
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.window import Window
df = sc.parallelize([
('A', 0, 1, 0), ('A', 1, -1, 0), ('A', 0, 0, -1),
('B', 0, -1, -1), ('B', 0, 1, 0), ('C', 0, 0, 1)
]).toDF(["id", "col1", "col2", 'col3'])
df.show()
+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
| A| 0| 1| 0|
| A| 1| -1| 0|
| A| 0| 0| -1|
| B| 0| -1| -1|
| B| 0| 1| 0|
| C| 0| 0| 1|
+---+----+----+----+
I know that I can achieve a cumulative sum over the three columns as follows:
w = Window.partitionBy('id').orderBy('id').rowsBetween(-2,0)
df = df.select('id', *[F.sum(F.col(c)).over(w).alias(c) for c in df.columns[1:]])
df.orderBy('id').show()
+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
| A| 0| 1| 0|
| A| 1| 0| 0|
| A| 1| 0| -1|
| B| 0| 0| -1|
| B| 0| -1| -1|
| C| 0| 0| 1|
+---+----+----+----+
The problem is that with large dataframes, when calling an action, it takes hours to have a displayed result. Is there a way to eventually improve such computation? Maybe avoiding to loop among the columns in the list comprehension?
question from:
https://stackoverflow.com/questions/65939908/best-way-to-apply-a-transformation-to-all-the-columns-pyspark-dataframe 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…