Spark papers (1)

By | 2019年3月21日

版权声明:All rights reserved, SpanningWings blog

Main Spark papers

Spark is very interesting and it became (or is becoming) the new MapReduce. It tries to do all the things that MapReduce can do and claim that is can do it at least as good as MapReduce, which is represented by Hadoop in the open-source world. According to this article Google stopped using MapReduce and replaced it with Dataflow as shown in the VLDB 2015 dataflow paper. Google dataflow was presented in this slide. As of today (2019), Google dataflow did not gain as much attention as Spark. Does this mean it’s too ahead of its time? It will be a good topic of another blog.

Spark: Cluster Computing with Working sets (Hotcloud’10)

Article link

MapReduce and other systems were built around an acyclic data flow. Spark successfully optimized it by caching intermediate results in cache for repetitively usage. The significant performance gain plus the improvement of user experience successfully attracted many users to Spark and built an ecosystem around it. Some examples of use cases that use cached intermediate results include interactive ML,

The key of Spark is Resilient Distributed Datasets (RDD). RDDs are ephemeral in that they can be re-generated by re-computing according to lineage based on data saved on persistent storage (e.g., disks). RDDs can be stored in Java object serialized format and can be spilled over to disks.

Spark inherited the Map stage and reduce stage from map reduce. It also emphasized the shuffle stage, which is not emphasized in map reduce. The shuffle stage is to pass data generated after the map stage to the reducers. If we have M mappers and R reducers, naive Spark implementation has M*R intermediate result files. Compared to MapReduce which contains less Mappers and Reducers, the naive Spark implementation was reported to have a worse performance than Hadoop in the first round of data analysis. This shuffle issue has since been investigated by many researchers.

  1. Compression. Columnar compression??
  2. Combiners. The many small files are combined to large files. We had Sort-based combiners and Hash-based combiners.

Spark was heavily influenced by MapReduce and Dryad/DryadLINQ.

Resillient Distributed Datasets: A Fault-tolerant Abstraction for in-memory Cluster computing (NSDI’12)

Article link

Apparently Spark evolved from the 2010 paper and became more mature in 2012. In the 2012 paper, transformation and action sets were officially defined. Scala is a typed functional language and Spark adopted many terms from it. For someone who is not a big fan of functional programming (burnt by LISP and PROLOG debugging), Scala might be a new starting point. For Spark 2.2, the transformations are listed in the Spark Programming Guide

A few RDD transformations will need explanation:

  • flatMap(func), in which func returns a Seq instead of a single item like in map(). One example is that a list is “flattened” to have several items. For example, (key, [a, b, c]) will be flattened to (key, f(a)), (key, f(b)), (key, f©)
  • join(), (RDD[(K,V)], RDD[(K,W)]) => RDD[(K,(V,W))]. Join looks like the inverse operation of “flat”
  • groupByKey(), RDD[(K,V)] => RDD[(K, SEQ[T])], in which the list of values has been transformed into a sequence of the same type of value. This article suggests to avoid groupByKey() when performing a group of multiple items by key


When persist() is used, “user can get an RDD’s partition order, which is represented by a Partitioner class, and partition another dataset according to it”.

  • hash partitioned RDD (The order is no longer preserved, as the hash function will evenly distribute the items)
  • range partitioned RDD (The order is preserved in the range)

Spark lets the user control the way data is partitioned. In the paper’s example, links can be hash-partitioned by URL across nodes, and ranks can be hash-partitioned by URL in the same way. So later on when links and ranks are joined, the join operation “requires no communication (as each URL’s rank will be on the same machine as its link list)”. That’s really good.

Representing RDDs

5 pieces of information are used to describe an RDD.

  • a set of Partitions
  • a set of Dependencies
  • a function for computing the dataset based on its parents
  • metadata about its partitioning scheme
  • meta data about data placement (preferredLocations())

*note: This gitbook Apache Spark – Best Practices and Tuning indicates that transformations can be organized in different ways and some ways are better than others. It is the “human” execution planner.

*note: The morning paper has a review about the Dataflow paper