apache-spark - 为什么读取Parquet文件比在SPARK中写入要占用更多的内存(Why reading Parquet file take much more memory than writing in SPARK)

Question

Welcome To Ask or Share your Answers For Others

apache-spark - 为什么读取Parquet文件比在SPARK中写入要占用更多的内存(Why reading Parquet file take much more memory than writing in SPARK)

posted Mar 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache-spark - 为什么读取Parquet文件比在SPARK中写入要占用更多的内存(Why reading Parquet file take much more memory than writing in SPARK)

I have a CSV file 40G, and use a SPARK job to convert it into parquet file with snappy compression and get a 1.8G parquet file.

(我有一个40G的CSV文件，并使用SPARK作业将其转换为具有活泼压缩的实木复合地板文件，并得到一个1.8G的实木复合地板文件。)

Then I have another SPARK job to read the parquet file and process it.

(然后，我还有另一个SPARK作业来读取镶木地板文件并进行处理。)

And I found that, even I just simply read the file without any process, I need to assign 75G memory to the reading job in order to let it run smoothly, but the writing job just need assign 14G!

(我发现，即使我只是简单地读取文件而没有任何处理，我也需要为读取作业分配75G内存，以使其顺利运行，但是写入作业只需分配14G！)

I use SPARK 2.8 in a single machine, with 8core CPU 128G ram.

(我在一台机器上使用SPARK 2.8，配备8核CPU 128G内存。)

All the setting for both read and write job are same.

(读写作业的所有设置都相同。)

I think it is really too weird that why reading take over 5 times memory as writing (75G vs 14G)?

(我认为为什么阅读占用的内存是写的5倍（75G vs 14G）真的太奇怪了吗？)

Who have idea on that?

(谁对此有想法？)

Thanks!

(谢谢！)

ask by Danny translate from so

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

Categories

apache-spark - 为什么读取Parquet文件比在SPARK中写入要占用更多的内存(Why reading Parquet file take much more memory than writing in SPARK)

apache-spark - 为什么读取Parquet文件比在SPARK中写入要占用更多的内存(Why reading Parquet file take much more memory than writing in SPARK)

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags