Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
190 views
in Technique[技术] by (71.8m points)

python - Filtering a column with an empty array in Pyspark

I have a DataFrame which contains a lot of repeated values. An aggregated, distinct count of it looks like below

> df.groupby('fruits').count().sort(F.desc('count')).show()


| fruits        | count       |
| -----------   | ----------- |
| [Apples]      | 123         |
| []            | 344         |
| [Apples, plum]| 444         |

My goal is to filter all rows where the value is either [Apples] or [].

Suprisingly, the following works for an non-empty array but for empty it doesn't

import pyspark.sql.types as T

is_apples = F.udf(lambda arr: arr == ['Apples'], T.BooleanType())
df.filter(is_apples(df.fruits).count() # WORKS! shows 123 correctly.

is_empty = F.udf(lambda arr: arr == [], T.BooleanType())
df.filter(is_empty(df.fruits).count() # Doesn't work! Should show 344 but shows zero.

Any idea what I am doing wrong?

question from:https://stackoverflow.com/questions/65662265/filtering-a-column-with-an-empty-array-in-pyspark

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

It might be an array containing an empty string:

is_empty = F.udf(lambda arr: arr == [''], T.BooleanType())

Or it might be an array of null:

is_empty = F.udf(lambda arr: arr == [None], T.BooleanType())

To check them all at once you can use:

is_empty = F.udf(lambda arr: arr in [[], [''], [None]], T.BooleanType())

But actually you don't need a UDF for this, e.g. you can do:

df.filter("fruits = array() or fruits = array('') or fruits = array(null)")

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...