scala - Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?

Question

Welcome To Ask or Share your Answers For Others

scala - Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?

posted Oct 17, 2021 in Technique[技术] by 深蓝 (71.8m points)

scala - Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?

I'm trying to implement a Hadoop Map/Reduce job that worked fine before in Spark. The Spark app definition is the following:

val data = spark.textFile(file, 2).cache()
val result = data
  .map(//some pre-processing)
  .map(docWeightPar => (docWeightPar(0),docWeightPar(1))))
  .flatMap(line => MyFunctions.combine(line))
  .reduceByKey( _ + _)

Where MyFunctions.combine is

def combine(tuples: Array[(String, String)]): IndexedSeq[(String,Double)] =
  for (i <- 0 to tuples.length - 2;
       j <- 1 to tuples.length - 1
  ) yield (toKey(tuples(i)._1,tuples(j)._1),tuples(i)._2.toDouble * tuples(j)._2.toDouble)

The combine function produces lots of map keys if the list used for input is big and this is where the exceptions is thrown.

In the Hadoop Map Reduce setting I didn't have problems because this is the point where the combine function yields was the point Hadoop wrote the map pairs to disk. Spark seems to keep all in memory until it explodes with a java.lang.OutOfMemoryError: GC overhead limit exceeded.

I am probably doing something really basic wrong but I couldn't find any pointers on how to come forward from this, I would like to know how I can avoid this. Since I am a total noob at Scala and Spark I am not sure if the problem is from one or from the other, or both. I am currently trying to run this program in my own laptop, and it works for inputs where the length of the tuples array is not very long. Thanks in advance.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Reply

深蓝 · Answer 1 · 2021-10-17T02:45:48+0000

Add the following JVM arg when you launch spark-shell or spark-submit:

-Dspark.executor.memory=6g

You may also consider to explicitly set the number of workers when you create an instance of SparkContext:

Distributed Cluster

Set the slave names in the conf/slaves:

val sc = new SparkContext("master", "MyApp")

Categories

scala - Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?

scala - Why does Spark fail with java.lang.OutOfMemoryError: GC overhead limit exceeded?

Please log in or register to add a comment.

Please log in or register to reply this article.

1 Reply

Distributed Cluster

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags