For the second time now, the European version of the Spark Summit took place this year in Brussels after its inception last year in Amsterdam. I’ll give a short overview and summary of the talks I attended and what I gathered from speaking with attendees and companies.
Developer Day
Spark’s Performance: The Past, Present, and Future (Developer)
This talk gives a short overview over hardware developments over the last couple of years. While accessing disks and the network are getting faster and faster, the raw computing power of CPUs has not changed significantly. That’s why one of Databrick’s main focus has been the improved usage of modern CPU architectures. Examples for this are avoiding virtual function calls, having data present in CPU registers, and a better usage of modern features like SIMD. More examples of this can be found on the Mechanical Sympathy blog. A central feature for this is code generation implemented in Project Tungsten. The basic idea is to transform high-level code, found in SQL, DataFrames,or DataSets, into simple native code that looks as if it were written specifically to implement your query. No experienced developer would write such low-level code, but it is far easier for a CPU to optimize than the typical abstraction-over-abstraction code found in most complex projects. Spark 2.0 already defaults to use this so-called whole-stage code generation, where an entire job stage is rewritten as a hand-optimized function.
Additionally, Spark uses a column format in-memory to allow optimization such as skipping repeating values.
How to Connect Spark to Your Own Datasource (Developer)
Using the MongoDB connector as an example, Ross Lawley of MongoDB shows how to implement your own data source. No documentation exists for this, so you have to look at existing connectors’ source code. One of the most important aspects is partitioning the data correctly. In the case of MongoDB, there are multiple ways of doing this, so the developer has to choose partitioner manually.
Next up is support for structured data in order to support data sources in SparkSQL (and Python and R) as well as predicate push-down.
Data-Aware Spark (Research)
A typical problem in Spark jobs is the unequal distribution of keys in partitions (data skew) that results in undesirable runtime performance in key-value operations (such as reduceByKey()
) where some executors have more work to do than others. For that, Zoltan Zvara introduces a research-based system that samples the data and watches the distribution of data. Dynamic partitioners are able to redistribute the data automatically.
A Deep Dive into the Catalyst Optimizer (Developer)
Herman van Hovell dives into the depths of Catalyst, the SparkSQL optimizer. At first, SQL statements are transformed into an AST (logical analysis) which is then translated into several possible execution plans. Based on cost-models, the best plan is selected for execution (physical planning). Some optimizations are already possible on the AST, such as constant folding which collapses multiple expressions into one. Some other optimizations are predicate push-down that executes a filter directly at the data source, and column pruning only reads the columns that are required in the final result.
SparkLint: a Tool for Monitoring, Identifying and Tuning Inefficient Spark Jobs Across Your Cluster (Ecosystem)
SparkLint is a tool that was developed internally at Groupon. It is capable of analyzing Spark event logs to calculate and visualize the CPU usage of a job in order to help optimizing it. The tool should be available as open-source by now.
From Single-Tenant Hadoop to 3000 Tenants in Apache Spark: Experiences from Watson Analytics (Experience and Use Cases)
The Watson Analytics für Social Media team is using a Spark Streaming pipeline for text analysis. Since a single social media channel only produces little relevant data, Watson uses the pipeline to process the data of 3000 customers in parallel. One learning at this scale is that HTTP fetches should be executed outside Spark to avoid disturbing the batch interval with unpredictable latency.
Mastering Spark Unit Testing (Developer)
Ted Malaska of Blizzard (the company behind World of Warcraft, Starcraft, and so on) presents how to unit test Spark code. He shows some of his helper classes he uses to test Spark, SparkSQL, and Spark Streaming jobs.
For integration tests you can use mini clusters (locally started services). Examples can be found in the integration tests of the corresponding Spark modules (e.g. spark-streaming-kafka). Or just use Docker.
Apache Spark 2.0 Performance Improvements Investigated With Flame Graphs (Developer)
Luca Canali from CERN uses SparkSQL to analyze 160 PB of data (50 PB/year). Spark is running on three YARN clusters with a total of around 1000 cores and serves mainly for analytics and monitoring. It is also planned to be used for physics.
He advises to use more recent versions of Spark. One of their analyses that took 12 hours on SQL server is processed in Spark 1.6 in 20 minutes, and in Spark 2.0 in two minutes. Using “active benchmarking”, existing bottlenecks should be found and their root cause identified.
The first thing in investigating performance issues is to take a look at the execution plan. Stages that are marked with an asterisk (*) are generated (whole-stage code generation, see above). After that, flame graphs of stack profiles can help identify “hot” code paths. In order to analyze the JVM, tools like jstack and Java Flight Recorder (only available on commercial JVM licenses) can be used. On the OS level, HProfiler and Linux’ perf stat give more insights. Interpreting the results is still the most difficult task however.
Enterprise Day
TensorFrames: Deep Learning with TensorFlow on Apache Spark (Developer)
As mentioned in the Nvidia keynote (not discussed here), in stark contrast to CPUs, GPUs have seen huge improvements in raw computing power over the last couple years. While Tungsten and Catalyst are optimizing CPU usage, they are tailored for general workloads and not specific to numerical problems ubiquitous in Machine Learning. GPUs are really good at numerical computations, but their APIs are not always as accessible with each vendor having their own APIs, and abstractions such as Metal are not cross-platform. There are already some Machine Learning libraries geared towards GPU processing, such as Caffe or Google’s TensorFlow. Now TensorFrames is an adaptation of TensorFlow on Spark DataFrames and DataSets. The performance is already better than pure DataFrames, and there are still some optimizations to come concerning exchanging data between Spark and TensorFlow.
Finding Outliers in Streaming Data: A Scalable Approach (Data Science)
Casey Stella, committer of the cyber-security project Apache Metron, presents how to detect anomalies using SparkStreaming. In order to keep the processing time down, the pipeline is divided into two sections. The first determines candidates using mean values to save computing power. The second analyzes the candidates in detail using Principal Component Analysis. This method currently on works on one-dimensional data.
The demo unfortunately is a bit shortened because someone unplugged the external disk containing all the running services just before the talk. Nevertheless some interesting sponsorings of US doctors by pharmaceutical companies are identified as outliers.
Paddling Up the Stream (Enterprise)
This talk gives a short overview of Spark Streaming as well as some of the common error messages in streaming jobs and how to solve them.
Spark Streaming at Bing Scale (Experience and Use Cases)
Microsoft’s Bing team processes around one billion search requests per month, amounting to around 10s of TBs per hour, using Spark Streaming. The data stems from multiple Kafka data centers and is correlated with each other and other static data sets. Incoming events are merged, analyzed, and the results are written back to Kafka. The following problems occur when handling data at this scale.
Unbalanced partitions – the data in the different Kafka topics arrives at different speeds. KafkaDirectStream maps Kafka partitions 1:1 on Spark partitions so that you have to deal with skewed partitions in Spark jobs as well. To solve this, the Bing team has created a custom KafkaRDD that distributes partitions equally without the need for a repartition()
.
Slow brokers – similarly, accessing the various Kafka instances can have different latency and throughput so it’s possible that related doesn’t arrive in the same batch. A business-defined window of ten minutes is laid over the data and all events that are older are ignored. To solve the problem of losing events that are arriving late because of slow brokers, data from the brokers are fetched in advanced in a separate thread and cached. It however remains unclear why they’re not using Mirror Maker to at least improve the variance in latency.
Finding offsets – the number of offsets after a restart can be very high at this scale. In order to speed up finding offsets, they are fetched and cached in a separate thread as well.
Joins – right now, joins are only possible by batch time, not by business time (planned for Spark 2.1). To solve this, joins are handled manually in updateStateByKey()
.
This talk has some parallels to How We Built an Event-Time Merge of Two Kafka-Streams with Spark Streaming, presented on the first day by Otto GmbH.
SparkOscope (Developer)
SparkOscope (a wordplay on stethoscope) is a tool to find bottlenecks in Spark jobs. It requires Sigar to be installed on all nodes to capture system-wide CPU and memory usage and integrate it into the Spark Monitoring Framework. An extension of the Spark UI allows to visualize both Spark and SparkOscope metrics directly on a job page. More features are planned.
Problem Solving Recipes Learned from Supporting Spark (Experience and Use Cases)
Lightbend offers commercial Spark support and summarizes a selection of common problems and their solutions.
OOM – for one you can tune spark.memory.fraction
and spark.storage.memoryFraction
, although the defaults are already pretty good. A more important tip is to avoid creating too many objects.
NoSuchMethodError – caused by a dependency collision when both you and Spark use the same library but in different version, and one version changes a method signature. You can either force your version of the library to be used by setting --driver-classpath
. If this makes Spark crash, you have to shade.
Speculation – slow tasks can be executed speculatively on other nodes. There are a couple of threshold parameters to avoid using up too many resources.
Strategize Joins – a couple of joins are presents with Broadcast Join being the most interesting one. Apparently Sort Merge Join doesn’t always perform as assumed.
Safe Stream Recovery – a note on how to correctly initialize RDDs in Streaming jobs in order to be able to recover checkpoints.
Extending Spark with Java Agents (Developer)
This talks gives a short overview how to instrument code using Java agents. Java agents are loaded dynamically at runtime and can either inject additional code or rewrite existing classes. They are however hard to debug and maintain, and should mostly be used for hotfixes or collecting metrics without polluting business code. They are also able to access internal Spark datastructures and could cache RDDs based on runtime behavior.
BTrace is a Java annotation DSL which makes writing Java agents easier.
No One Puts Spark in the Container (Developer)
In this excellent talk, Jörg Schad of Mesosphere introduces the underlying Linux basics of Docker. He shows how isolation is implemented using namespaces and control groups (cgroups), and explains the pitfalls of using Docker with the JVM. For example you have to be careful how to assign CPU resources to JVM containers, or the JVM might see a wrong number of cores. On the other hand, you can avoid the Metaspace clogging up an entire system by assigning a hard memory limit on the container which is killed when the JVM exceeds the limit.
Overall Impressions
Developers are still enthusiastic about Spark’s nice API and how easy it is to get a job up and running. There were around 1200 attendees.
Ecosystem – Many companies no longer use Spark solely for proof-of-concepts, but are moving to slightly more advanced use cases with increasing workloads. They now have a bigger need to understand why jobs are slow or crashing.
Data Science – there were less talks about Machine Learning this year, and some of them went back to the basics.
Enterprises – Spark adoption has further increased and can be found in companies of different size and industry, such as financial, telco, e-commerce, biotech, as well as more prominent examples such as Facebook and Netflix.