Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
587 views
in Technique[技术] by (71.8m points)

amazon web services - How to configure Spark / Glue to avoid creation of empty $_folder_$ after Glue job successful execution

I have a simple glue etl job which is triggered by Glue workflow. It drop duplicates data from a crawler table and writes back the result into a S3 bucket. The job is completed successfully . However the empty folders that spark generates "$folder$" remain in s3. It does not look nice in the hierarchy and causes confusion. Is there any way to configure spark or glue context to hide/remove these folders after successful completion of the job?

enter image description here

---------------------S3 image --------------------- enter image description here


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)

Ok finally after few days of testing I found the solution. Before pasting the code let me summarize what I have found ...

  • Those $folder$ are created via Hadoop .Apache Hadoop creates these files when to create a folder in an S3 bucket. Source1 They are actually directory markers as path + /. Source 2
  • To change the behavior , you need to change the Hadoop S3 write configuration in Spark context. Read this and this and this
  • Read about S3 , S3a and S3n here and here
  • Thanks to @stevel 's comment here

Now the solution is to set the following configuration in Spark context Hadoop.

sc = SparkContext()
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

To avoid creation of SUCCESS files you need to set the following configuration as well : hadoop_conf.set("mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")

Make sure you use the S3 URI for writing to s3 bucket. ex:

myDF.write.mode("overwrite").parquet('s3://XXX/YY',partitionBy['DDD'])

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...