sparkle [spär′kəl]: a library for writing resilient analytics
applications in Haskell that scale to thousands of nodes, using
Spark and the rest of the Apache ecosystem under the hood.
See this blog post for the details.
Getting started
The tl;dr using the hello app as an example on your local machine:
sparkle is a tool for creating self-contained Spark applications in
Haskell. Spark applications are typically distributed as JAR files, so
that's what sparkle creates. We embed Haskell native object code as
compiled by GHC in these JAR files, along with any shared library
required by this object code to run. Spark dynamically loads this
object code into its address space at runtime and interacts with it
via the Java Native Interface (JNI).
How to use
To run a Spark application the process is as follows:
create an application in the apps/ folder, in-repo or as
a submodule;
build the app;
submit it to a local or cluster deployment of Spark.
If you run into issues, read the Troubleshooting section below
first.
Build
Linux
Include the following in a BUILD.bazel file next to your source code.
package(default_visibility = ["//visibility:public"])
load(
"@rules_haskell//haskell:defs.bzl",
"haskell_library",
)
load("@io_tweag_sparkle//:sparkle.bzl", "sparkle_package")
# hello-hs needs to contain a Main module with a main function.
# This main function will be invoked by spark.
haskell_library (
name = "hello-hs",
srcs = ...,
deps = ...,
...
)
sparkle_package(
name = "sparkle-example-hello",
src = ":hello-hs",
)
You might want to add the following settings to your .bazelrc.local
file.
common --repository_cache=~/.bazel_repo_cache
common --disk_cache=~/.bazel_disk_cache
common --local_cpu_resources=4
sparkle builds in Mac OS X, but running it requires installing binaries
for Spark and maybe Hadoop (See .circleci/config.yml.
Another alternative is to build and run sparkle via Docker in non-Linux
platforms, using a docker image provisioned with Nix.
Integrating sparkle in another project
As sparkle interacts with the JVM, you need to tell ghc
where JVM-specific headers and libraries are. It needs to be able to
locate jni.h, jni_md.h and libjvm.so.
sparkle uses inline-java to embed fragments of Java code in Haskell
modules, which requires running the javac compiler, which must be
available in the PATH of the shell. Moreover, javac needs to find
the Spark classes that inline-java quotations refer to. Therefore,
these classes need to be added to the CLASSPATH when building sparkle.
Dependending on your build system, how to do this might vary. In this
repo, we use gradle to install Spark, and we query gradle to get
the paths we need to add to the CLASSPATH.
Additionally, the classes need to be found at runtime to load them.
The main thread can find them, but other threads need to invoke
initializeSparkThread or runInSparkThread from
Control.Distributed.Spark.
If the main function terminates with unhandled exceptions, they
can be propagated to Spark with
Control.Distributed.Spark.forwardUnhandledExceptionsToSpark. This
allows spark both to report the exception and to cleanup before
termination.
Submit
Finally, to run your application, for example locally:
$ nix-shell --pure --run "bazel run spark-submit -- /path/to/$PWD/<app-target-name>_deploy.jar"
The <app-target-name> is the name of the Bazel target producing the jar file. See apps in
the apps/ folder for examples.
JNI calls in auxiliary threads fail with ClassNotFoundException
The context class loader of threads needs to be set appropriately
before JNI calls can find classes in Spark. Calling
initializeSparkThread or runInSparkThread from
Control.Distributed.Spark should set it.
Anonymous classes in inline-java quasiquotes fail to deserialize
When using inline-java, it is recommended to use the Kryo serializer,
which is currently not the default in Spark but is faster anyways. If
you don't use the Kryo serializer, objects of anonymous class, which
arise e.g. when using Java 8 function literals,
won't be deserialized properly in multi-node setups. To avoid this
problem, switch to the Kryo serializer by setting the following
configuration properties in your SparkConf:
java.lang.UnsatisfiedLinkError: /tmp/sparkle-app...: failed to map segment from shared object
Sparkle unzips the Haskell binary program in a temporary location on
the filesystem and then loads it from there. For loading to succeed, the
temporary location must not be mounted with the noexec option.
Alternatively, the temporary location can be changed with
java.io.IOException: No FileSystem for scheme: s3n
Spark 2.4 requires explicitly specifying extra JAR files to spark-submit
in order to work with AWS. To work around this, add an additional 'packages'
argument when submitting the job:
请发表评论