hive - How to control partition size in Spark SQL

Question

Welcome To Ask or Share your Answers For Others

hive - How to control partition size in Spark SQL

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

hive - How to control partition size in Spark SQL

I have a requirement to load data from an Hive table using Spark SQL HiveContext and load into HDFS. By default, the DataFrame from SQL output is having 2 partitions. To get more parallelism i need more partitions out of the SQL. There is no overloaded method in HiveContext to take number of partitions parameter.

Repartitioning of the RDD causes shuffling and results in more processing time.

>

val result = sqlContext.sql("select * from bt_st_ent")

Has the log output of:

Starting task 0.0 in stage 131.0 (TID 297, aster1.com, partition 0,NODE_LOCAL, 2203 bytes)
Starting task 1.0 in stage 131.0 (TID 298, aster1.com, partition 1,NODE_LOCAL, 2204 bytes)

I would like to know is there any way to increase the partitions size of the SQL output.

Question&Answers:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-16T22:23:08+0000

Spark < 2.0:

You can use Hadoop configuration options:

mapred.min.split.size.
mapred.max.split.size

as well as HDFS block size to control partition size for filesystem based formats*.

val minSplit: Int = ???
val maxSplit: Int = ???

sc.hadoopConfiguration.setInt("mapred.min.split.size", minSplit)
sc.hadoopConfiguration.setInt("mapred.max.split.size", maxSplit)

Spark 2.0+:

You can use spark.sql.files.maxPartitionBytes configuration:

spark.conf.set("spark.sql.files.maxPartitionBytes", maxSplit)

In both cases these values may not be in use by a specific data source API so you should always check documentation / implementation details of the format you use.

* Other input formats can use different settings. See for example

Furthermore Datasets created from RDDs will inherit partition layout from their parents.

Similarly bucketed tables will use bucket layout defined in the metastore with 1:1 relationship between bucket and Dataset partition.

Categories

hive - How to control partition size in Spark SQL

hive - How to control partition size in Spark SQL

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags