I have a CSV file 40G, and use a SPARK job to convert it into parquet file with snappy compression and get a 1.8G parquet file.
(我有一个40G的CSV文件,并使用SPARK作业将其转换为具有活泼压缩的实木复合地板文件,并得到一个1.8G的实木复合地板文件。)
Then I have another SPARK job to read the parquet file and process it. (然后,我还有另一个SPARK作业来读取镶木地板文件并进行处理。)
And I found that, even I just simply read the file without any process, I need to assign 75G memory to the reading job in order to let it run smoothly, but the writing job just need assign 14G! (我发现,即使我只是简单地读取文件而没有任何处理,我也需要为读取作业分配75G内存,以使其顺利运行,但是写入作业只需分配14G!)
I use SPARK 2.8 in a single machine, with 8core CPU 128G ram. (我在一台机器上使用SPARK 2.8,配备8核CPU 128G内存。)
All the setting for both read and write job are same. (读写作业的所有设置都相同。)
I think it is really too weird that why reading take over 5 times memory as writing (75G vs 14G)? (我认为为什么阅读占用的内存是写的5倍(75G vs 14G)真的太奇怪了吗?)
Who have idea on that? (谁对此有想法?)
Thanks! (谢谢!)
ask by Danny translate from so 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…