在线时间:8:00-16:00
迪恩网络APP
随时随地掌握行业动态
扫描二维码
关注迪恩网络微信公众号
开源软件名称(OpenSource Name):Kotlin/kotlin-spark-api开源软件地址(OpenSource Url):https://github.com/Kotlin/kotlin-spark-api开源编程语言(OpenSource Language):Kotlin 94.8%开源软件介绍(OpenSource Introduction):Kotlin for Apache® Spark™Your next API to work with Apache Spark. This project adds a missing layer of compatibility between Kotlin and Apache Spark. It allows Kotlin developers to use familiar language features such as data classes, and lambda expressions as simple expressions in curly braces or method references. We have opened a Spark Project Improvement Proposal: Kotlin support for Apache Spark to work with the community towards getting Kotlin support as a first-class citizen in Apache Spark. We encourage you to voice your opinions and participate in the discussion. Table of Contents
Supported versions of Apache Spark
ReleasesThe list of Kotlin for Apache Spark releases is available here.
The Kotlin for Spark artifacts adhere to the following convention:
How to configure Kotlin for Apache Spark in your projectYou can add Kotlin for Apache Spark as a dependency to your project: Here's an example <dependency>
<groupId>org.jetbrains.kotlinx.spark</groupId>
<artifactId>kotlin-spark-api-3.2</artifactId>
<version>${kotlin-spark-api.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>${spark.version}</version>
</dependency> Note that Once you have configured the dependency, you only need to add the following import to your Kotlin file: import org.jetbrains.kotlinx.spark.api.* JupyterThe Kotlin Spark API also supports Kotlin Jupyter notebooks. To it, simply add
to the top of your notebook. This will get the latest version of the API, together with the latest version of Spark. To define a certain version of Spark or the API itself, simply add it like this:
Inside the notebook a Spark session will be initiated automatically. This can be accessed via the There is also support for HTML rendering of Datasets and simple (Java)RDDs. Check out the example as well. To use Spark Streaming abilities, instead use
This does not start a Spark session right away, meaning you can call NOTE: You need For more information, check the wiki. Kotlin for Apache Spark featuresCreating a SparkSession in Kotlinval spark = SparkSession
.builder()
.master("local[2]")
.appName("Simple Application").orCreate This is not needed when running the Kotlin Spark API from a Jupyter notebook. Creating a Dataset in Kotlinspark.dsOf("a" to 1, "b" to 2) The example above produces Null safetyThere are several aliases in API, like In Spark, you might also come across Scala-native Similarly, you can also create withSpark functionWe provide you with useful function After work block ends, Do not use this when running the Kotlin Spark API from a Jupyter notebook. withSpark {
dsOf(1, 2)
.map { it X it } // creates Tuple2<Int, Int>
.show()
}
withCached functionIt can easily happen that we need to fork our computation to several paths. To compute things only once we should call To solve these problems we've added withSpark {
dsOf(1, 2, 3, 4, 5)
.map { tupleOf(it, it + 2) }
.withCached {
showDS()
filter { it._1 % 2 == 0 }.showDS()
}
.map { tupleOf(it._1, it._2, (it._1 + it._2) * 2) }
.show()
} Here we're showing cached toList and toArray methodsFor more idiomatic Kotlin code we've added Column infix/operator functionsSimilar to the Scala API for dataset.select( col("colA") + 5 )
dataset.select( col("colA") / col("colB") )
dataset.where( col("colA") `===` 6 )
// or alternatively
dataset.where( col("colA") eq 6) To read more, check the wiki. Overload resolution ambiguityWe had to implement the functions We have a special example of work with this function in the Groups example. TuplesInspired by ScalaTuplesInKotlin, the API introduces a lot of helper- extension functions to make working with Scala Tuples a breeze in your Kotlin Spark projects. While working with data classes is encouraged, for pair-like Datasets / RDDs / DStreams Scala Tuples are recommended, both for the useful helper functions, as well as Spark performance. To enable these features simply add import org.jetbrains.kotlinx.spark.api.tuples.* to the start of your file. Tuple creation can be done in the following manners: val a: Tuple2<Int, Long> = tupleOf(1, 2L)
val b: Tuple3<String, Double, Int> = t("test", 1.0, 2)
val c: Tuple3<Float, String, Int> = 5f X "aaa" X 1 To read more about tuples and all the added functions, refer to the wiki. StreamingA popular Spark extension is Spark Streaming. Of course the Kotlin Spark API also introduces a more Kotlin-esque approach to write your streaming programs. There are examples for use with a checkpoint, Kafka and SQL in the examples module. We shall also provide a quick example below: // Automatically provides ssc: JavaStreamingContext which starts and awaits termination or timeout
withSparkStreaming(batchDuration = Durations.seconds(1), timeout = 10_000) { // this: KSparkStreamingSession
// create input stream for, for instance, Netcat: `$ nc -lk 9999`
val lines: JavaReceiverInputDStream<String> = ssc.socketTextStream("localhost", 9999)
// split input stream on space
val words: JavaDStream<String> = lines.flatMap { it.split(" ").iterator() }
// perform action on each formed RDD in the stream
words.foreachRDD { rdd: JavaRDD<String>, _: Time ->
// to convert the JavaRDD to a Dataset, we need a spark session using the RDD context
withSpark(rdd) { // this: KSparkSession
val dataframe: Dataset<TestRow> = rdd.map { TestRow(word = it) }.toDS()
dataframe
.groupByKey { it.word }
.count()
.show()
// +-----+--------+
// | key|count(1)|
// +-----+--------+
// |hello| 1|
// | is| 1|
// | a| 1|
// | this| 1|
// | test| 3|
// +-----+--------+
}
}
} For more information, check the wiki. ExamplesFor more, check out examples module. To get up and running quickly, check out this tutorial. Reporting issues/SupportPlease use GitHub issues for filing feature requests and bug reports. You are also welcome to join kotlin-spark channel in the Kotlin Slack. Code of ConductThis project and the corresponding community is governed by the JetBrains Open Source and Community Code of Conduct. Please make sure you read it. LicenseKotlin for Apache Spark is licensed under the Apache 2.0 License. |
2023-10-27
2022-08-15
2022-08-17
2022-09-23
2022-08-13
请发表评论