This week I read on Spark: Cluster Computing with Working Sets. The paper introduces a new framework called Spark which focuses on use cases where MapReudce fails. While a lot of applications fit the MapReduce’s acyclic data flow model, there are use cases requiring iterative jobs where MapReduce is not greatly efficient. Many machine learning algorithms fall into this category. Secondly with MapReduce, each query incurs a significant overhead because it is effectively a different job each time. So MapReduce is not the ideal solution for doing interactive analysis. This is the space which Spark intends to fill in. Spark is implemented in Scala and now supports API in Java, Python, and R
The model supports RDDs, parallel operations and 2 type of shared variables. A driver program implements the high-level control flow and launches different operations in parallel.
Resilient Distributed Datasets (RDDs) are a read-only collection of objects, partitioned across a set of machines in a cluster. A handle to RDD contains enough information to compute the RDD from data in case of partition failure. RDDs can be constructed:
flatMap, map and filter are standard operations as supported by various functional programming languages. For example map will take as input a list of items and return a new list of items after applying a function to each item of the original list.
RDDs are lazy and ephemeral. They are constructed on demand and discarded after use. The persistence can be changed to cache which means they are still lazy but are kept in memory (or disk if they can not fit memory) or to save where they are saved to disk only.
Some supported parallel operations(at time of writing the paper) include:
Since the paper was authored, Spark has come a long way and support much more transformations (sample, union, distinct, groupByKey, reduceByKey, sortByKey, join, cartesian, etc), parallel operations (shuffle, count, first, take, takeSample etc), and persistence options (memory only, memory and disk, disk only, memory only serialized, etc).
Spark supports 2 kinds of shared variables:
Broadcast variables — These are read-only variables that are distributed to worker nodes for once and can be used multiple times (for reading). A use case of this variable would be training data which can be sent to all the worker nodes once and can be used for learning different models instead of sending the same data with each model.
Accumulators — These variables are also shared with workers, the different being that the driver program can only read them and workers can perform only associative operations on them. A use case could be when we want to count the total number of entries in a data set, each worker fills up its count accumulator and sends it to the driver which adds up all the received values.
The core of Spark is the implementation of RDDs. Suppose we start by reading a file, then filtering the lines to get lines with the word “ERROR” in them, then we cache the results and then count the number of such lines using the standard map-reduce trick. RDDs will be formed corresponding to each of these steps and these RDDs will be stored as a link-list to capture the lineage of each RDD.
Each RDD contains a pointer to its parent and information about how it was transformed. This lineage information is sufficient to reconstruct any lost partition and checkpointing of any kind is not required. There is no overhead if no node fails and even if some nodes fails, only select RDDs need to be reconstructed.
Internally an RDD object implements a simple interface consisting of three operations:
Spark is similar to MapReduce — it sends computation to data instead of the other way round. This requires shipping closures to workers — closures to define and process a distributed dataset. This is easy given Scala uses Java serialization. However unlike MapReduce, operations are performed on RDDs that can persist across operations.
Shared variables are implemented using classes with custom serialization formats. When a broadcast variable b is created with a value v, v is saved to a file in the shared file system. The serialized form of b is a path to this file. When b’s value is queried, Spark checks if v is in the local cache. If not, it is read from the file system. For accumulators, each accumulator is given a unique ID upon creation and its serialized form contains its ID and the “zero” value. On the workers, a separate copy of the accumulator is created for each thread and is reset to “zero” value. Once the task finishes, the updated value is sent to the driver program.
The paper describes how an early stage implementation performs on Logistic Regression, Alternating Least Square, and interactive mode. The results seem to outperform MapReduce largely because of caching the results of previous computations. This makes Spark a good alternative for use cases where same data is read into memory again and again (iterative jobs fit the category.) Spark has come a long way since the paper was written. It now supports libraries for handling SQL-like queries (SparkSQL), streaming data (Spark streaming), graphs (GraphX) and machine learning (MLlib) along with more transformations and parallel operations. I came across Spark while working at Adobe Analytics and have been reading about it to learn more. The cool thing about Spark is that it supports interactive analysis and has APIs in Python, R and Java thus making it easy to adopt. While I have not done some much work around Spark, I am looking forward to making something on top of it.