Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
667 views
in Technique[技术] by (71.8m points)

scala - Spark Fixed Width File Import Large number of columns causing high Execution time

I am getting the fixed width .txt source file from which I need to extract the 20K columns. As lack of libraries to process fixed width files using spark, I have developed the code which extracts the fields from fixed width text files.

Code read the text file as RDD with

sparkContext.textFile("abc.txt") 

then reads JSON schema and gets the column names and width of each column.

  • In the function I read the fixed length string and using the start and end position we use substring function to create the Array.

  • Map the function to RDD.

  • Convert the above RDD to DF and map colnames and write to Parquet.

The representative code

rdd1=spark.sparkContext.textfile("file1")

{ var now=0
 { val collector= new array[String] (ColLenghth.length) 
 val recordlength=line.length
for (k<- 0 to colLength.length -1)
 { collector(k) = line.substring(now,now+colLength(k))
 now =now+colLength(k)
 }
 collector.toSeq}


StringArray=rdd1.map(SubstrSting(_,ColLengthSeq))
#here ColLengthSeq is read from another schema file which is column lengths



StringArray.toDF("StringCol")
  .select(0 until ColCount).map(j=>$"StringCol"(j) as column_seq(j):_*)
  .write.mode("overwrite").parquet("c"home")

This code works fine with files with less number of columns however it takes lot of time and resources with 20K columns. As number of columns increases , it also increase the time.

If anyone has faced such issue with large number of columns. I need suggestions on performance tuning , how can I tune this Job or code

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Reply

0 votes
by (71.8m points)
Waitting for answers

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
OGeek|极客中国-欢迎来到极客的世界,一个免费开放的程序员编程交流平台!开放,进步,分享!让技术改变生活,让极客改变未来! Welcome to OGeek Q&A Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

1.4m articles

1.4m replys

5 comments

56.8k users

...