Database Technologies......: Best Content on Spark

https://spark-summit.org/2014/wp-content/uploads/2014/07/A-Deeper-Understanding-of-Spark-Internals-Aaron-Davidson.pdf

https://spark-summit.org/2014/wp-content/uploads/2015/03/SparkSummitEast2015-AdvDevOps-StudentSlides.pdf

Following is the Summary of the the above 2 pdf documents

Spark Major core components:

– Execution Model
– The Shuffle
– Caching

Spark Execution Model

1. Create DAG of RDDs to represent computation
2. Create logical execution plan for DAG
Pipeline as much as possible
• Split into “stages” based on need to reorganize data.
3. Schedule and execute individual tasks
Split each stage into tasks
• A task is data + computation
• Execute all tasks within a stage before moving on
---------------------------------------------

“The main abstraction in Spark is that of a resilient distributed dataset (RDD), which represents a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

Users can explicitly cache an RDD in memory across machines and reuse it in multiple MapReduce-like parallel operations.

RDDs achieve fault tolerance through a notion of lineage: if a partition of an RDD is lost, the RDD has enough information about how it was derived from other RDDs to be able to rebuild just that partition.”

“We present Resilient Distributed Datasets (RDDs), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner.

RDDs are motivated by two types of applications that current computing frameworks handle inefficiently: iterative algorithms and interactive data mining tools.

In both cases, keeping data in memory can improve performance by an order of magnitude.”

An RDD can be created 2 ways: -

1) Parallelize a collection -
2) Read data from an external source (S3, C*, HDFS, etc)

Life Cycle of a Spark Program

1) Create some input RDDs from external data or parallelize a collection in your driver program.

2) Lazily transform them to define new RDDs using transformations like filter() or map()

3) Ask Spark to cache() any intermediate RDDs that will need to be reused.

4) Launch actions such as count() and collect() to kick off a parallel computation, which is then optimized and executed by Spark.

Transformations (Lazy)

map()
intersection()
cartesion()
flatMap()
distinct()
pipe()
filter()
groupByKey()
coalesce()
mapPartitions()
reduceByKey()
repartition()
mapPartitionsWithIndex()
sortByKey()
partitionBy()
sample()
join() ...
union()
cogroup() ...

(lazy) - Most transformations are element-wise (they work on one element at a time), but this is not true for all transformations

Actions

reduce()
takeOrdered()
collect()
saveAsTextFile()
count()
saveAsSequenceFile()
first()
saveAsObjectFile()
take()
countByKey()
takeSample()
foreach()
saveToCassandra()

Types of RDDs

HadoopRDD • FilteredRDD • MappedRDD • PairRDD • ShuffledRDD • UnionRDD • PythonRDD • DoubleRDD • JdbcRDD • JsonRDD • SchemaRDD • VertexRDD • EdgeRDD

RDD Interface

1) Set of partitions (“splits”)
2) List of dependencies on parent RDDs
3) Function to compute a partition given parents
4) Optional preferred locations
5) Optional partitioning info for k/v RDDs (Partitioner)

Example: HADOOPRDD

Partitions = one per HDFS block
Dependencies = none
Compute (partition) = read corresponding block
preferredLocations (part) = HDFS block location
Partitioner = none

Example : FilteredRDD

Partitions = same as parent RDD
Dependencies = “one-to-one” on parent
Compute (partition) = compute parent and filter it
preferredLocations (part) = none (ask parent)
Partitioner = none

Example : JoinedRDD

Partitions = One per reduce task
Dependencies = “shuffle” on each parent
Compute (partition) = read and join shuffled data
preferredLocations (part) = none
Partitioner = HashPartitioner(numTasks)

Memory and Persistance

Worker Machine

Recommended to use at most only 75% of a machine’s memory
for Spark Minimum Executor heap size should be 8 GB Max
Executor heap size depends… maybe 40 GB (watch GC)
Memory usage is greatly affected by storage level and serialization format

- If RDD fits in memory, choose MEMORY_ONLY
- If not, use MEMORY_ONLY_SER w/ fast serialization library
- Don’t spill to disk unless functions that computed the datasets are very expensive or they filter a large amount of data. (recomputing may be as fast as reading from disk)
- Use replicated storage levels sparingly and only if you want fast fault recovery (maybe to serve requests from a web app)
- Intermediate data is automatically persisted during shuffle operations.

Spark uses Memory for

RDD Storage: when you call .persist() or .cache(). Spark will limit the amount of memory used when caching to a certain fraction of the JVM’s overall heap, set by spark.storage.memoryFraction

Shuffle and aggregation buffers: When performing shuffle operations, Spark will create intermediate buffers for storing shuffle output data. These buffers are used to store intermediate results of aggregations in addition to buffering data that is going to be directly output as part of the shuffle.

User code: Spark executes arbitrary user code, so user functions can themselves require substantial memory. For instance, if a user application allocates large arrays or other objects, these will content for overall memory usage. User code has access to everything “left” in the JVM heap after the space for RDD storage and shuffle storage are allocated.

Serialization is used when:

Jobs ----------------> Stages ----------> Tasks

Schedulers

Lineage

“One of the challenges in providing RDDs as an abstraction is choosing a representation for them that can track lineage across a wide range of transformations.”

“The most interesting question in designing this interface is how to represent dependencies between RDDs.”

“We found it both sufficient and useful to classify dependencies into two types:
• narrow dependencies, where each partition of the parent RDD is used by at most one partition of the child RDD
• wide dependencies, where multiple child partitions may depend on it.”

Stages

Dependencies Narrow vs Wider

“This distinction is useful for two reasons:

1) Narrow dependencies allow for pipelined execution on one cluster node, which can compute all the parent partitions. For example, one can apply a map followed by a filter on an element-by-element basis. In contrast, wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation.

2) Recovery after a node failure is more efficient with a narrow dependency, as only the lost parent partitions need to be recomputed, and they can be recomputed in parallel on different nodes. In contrast, in a lineage graph with wide dependencies, a single failed node might cause the loss of some partition from all the ancestors of an RDD, requiring a complete re-execution.”

How do you know if a shuffle will be called on a Transformation?

- repartition , join, cogroup, and any of the *By or *ByKey transformations can result in shuffles
- If you declare a numPartitions parameter, it’ll probably shuffle
- If a transformation constructs a shuffledRDD, it’ll probably shuffle
- combineByKey calls a shuffle (so do other transformations like groupByKey, which actually
end up calling combineByKey)

Common Performance issue checklist

1. Ensure enough partitions for concurrency
2. Minimize memory consumption (esp. of sorting and large keys in groupBys)
3. Minimize amount of data shuffled 4. Know the standard library

Spark supports 2 types of shared variables:

• Broadcast variables – allows your program to efficiently send a large, read-only value to all the worker nodes for use in one or more Spark operations. Like sending a large, read-only lookup table to all the nodes.

Broadcast variables let programmer keep a readonly variable cached on each machine rather than shipping a copy of it with tasks

For example, to give every node a copy of a large input dataset efficiently

Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost

• Accumulators – allows you to aggregate values from worker nodes back to the driver program. Can be used to count the # of errors seen in an RDD of lines spread across 100s of nodes. Only the driver can access the value of an accumulator, tasks cannot. For tasks, accumulators are write-only.

Accumulators are variables that can only be “added” to through an associative operation.
Used to implement counters and sums, efficiently in parallel.
Spark natively supports accumulators of numeric value types and standard mutable collections, and programmers can extend for new types.
Only the driver program can read an accumulator’s value, not the tasks

Spark usecases

Database Technologies......

Saturday, October 28, 2017

Best Content on Spark

No comments: