Tuesday 30 December 2014

Tasks and Stages in Spark

1. Different with MapReduce, an application in Spark contains multiple jobs, 
since each action() in an application will correspond to a job.

2. Each job contains multiple stages. Each stage contains multiple tasks.
The last stage generates result. A stage is executed only when its parent stage is finished.

3. NarrowDependency vs. ShuffleDependency.
RDD's partitions depend on all partitions of its parent RDD.
The full dependency between RDD and its parent RDD is called NarrowDependency.

RDD's partitions depend on parts of partitions of its parent RDD.
The partial dependency between RDD and its parent RDD is called ShuffleDependency.

4. NarrowDependency constructs pipeline. The number of tasks is as the same as the number of partitions of its last RDD.
Data is computed only when it is needed and in the results generated position.
Compute some partitions of the left most RDD in a stage first.

5. The computing chain is built from back to front based on the data dependency.
A stage is formed by adding each NarrowDependency, and break into a new stage when encounters a ShuffleDependency. The last stage is id0, its parent stage is id1.
If a stage generates results, its tasks are ResultTasks(like reducer). Otherwise, its tasks are ShuffleMapTask(like mapper).

6. The number of tasks of a stage is determinate by the number of partition of this stage's last RDD.
In a stage, each RDD calls parentRDD.iter() in compute() to fetch the records of its parent RDD.




Data Stage Graph

Reference:
The contents of this article are from https://github.com/JerryLead/SparkInternals

No comments:

Post a Comment