sparkplug
about sparkplug
sparkplug allows Apache Spark to act as a near real-time/first-class citizen in an enterprise architecture by exposing Spark via RabbitMQ.
Simply put, you write PlugHandler
classes that are triggered based on commands
, allowing multiple tasks to be embedded in one library. The PlugClient
allows for an easy way to submit jobs to the server. One of the easiest ways to get data between the two systems is ot have the Spark job output to a shared cache, which the business logic can then access once the PlugClient
response is handled.
using sparkplug
To achieve this goal, one creates a custom library that includes sparkplug-server
, then:
-
Implement
PlugHandler
as many times as you want; eachPlugHandler
registers to listen for a specificcommand
(a string tag to allow different messages to have different actions) -
Get a handle to the
Plug
object and register the variousPlugHandler
implementationsval plug = Plug.getInstance() plug.connect() plug.registerHandler("test", plugHandler)
-
Tell the
Plug
to start consuming messages!plug.run()
-
When you are done with the
Plug
, clean up nicely viashutdown()
Compile the JAR, use spark-submit
to load it, and you will now be able to have Spark fire off jobs based on messages from a queue.
customizing sparkplug
sparkplug uses the Typesafe configuration library, and comes with an embedded reference.conf
with the following defaults:
sparkplug {
master = "spark://localhost:7077"
inbound-queue = "q_sparkplug"
outbound-queue = "r_sparkplug"
}
amqp {
addresses = [
{ host = "localhost", port = 5672 }
],
virtual-host = "/",
username = "guest",
password = "guest",
heartbeat = disable,
timeout = 30000,
automatic-recovery = true,
recovery-interval = 5s
ssl = disable
}
These settings are reasonable for development/testing, but are NOT viable in production; you really will want to turn SSL/TLS on, ensure heartbeats are working, and use a RabbitMQ cluster. Similarly, the sparkplug.master
URL will probably need to be changed.
the sparkplug client
The flip-side of having a system that listens to queues is having some other system that is putting messages on the queue and consuming the responses from Spark when a job completes. You can do this manually, simply invert submit messages to the listener-queue
and then listen for responses on the outbound-queue
. However, if this sounds like effort, we've created the PlugClient
for you!
-
Implement the
PlugResponseHandler
(note: unlike the server where there are multiple handlers per command, there is only one response handler - you can embed the logic to discern what to do internally) -
Get a handle to the
PlugClient
object and register thePlugResponseHandler
implementationval plugClient = PlugClient.getInstance() plugClient.setHandler(handler)
-
Tell the
PlugClient
to start consuming messages!plugClient.connect()
-
When you are done with the
PlugClient
, clean up nicely viashutdown()
testing sparkplug
Due to the fact that this library requires Spark and RabbitMQ, full integration testing requires running two Docker containers - one with RabbitMQ and another with a mini Spark cluster along with Gradle for testing.
To instantiate the environment, simply run the following:
./build_environment.sh /bin/bash
./gradlew clean test
roadmap
- Better error handling
- Java-based sparkplug client
- sparkplug broker that allows one to distribute jobs across multiple clusters