PACT
Definition
Parallelization Contracts (PACTs) are data processing operators in a data flow. Therefore, a PACT has one or more data inputs and one or more outputs. A PACT consists of two components:
- Input Contract
- User function
- User code annotation
Figure 3 shows the components of a PACT, which are exactly one Input Contract and an optional Output Contract. The Input Contract of a PACT defines how the UF (User Function) can be evaluated in parallel. Output Contracts allow the optimizer to infer certain properties of the output data of a UF and hence to create a more efficient execution strategy for a program.
|
Pact Architecture
There are four different componets on the architectural, refer to the following sections:
PACT is a parallel programming model and extends MapReduce. PACT provides a user API to write parallel data processing tasks. Programs written in the PACT programming model are translated by the PACT Compiler into a Nephele job and and executed on Nephele. In fact, Nephele considers PACT programs as regular Nephele jobs; to Nephele, PACT programs are arbitrary user programs.
- PACT Programming Model
- PACT Compiler
- Internal PACT Strategies
- PACT Clients
PACT is a parallel programming model and extends MapReduce. PACT provides a user API to write parallel data processing tasks. Programs written in the PACT programming model are translated by the PACT Compiler into a Nephele job and and executed on Nephele. In fact, Nephele considers PACT programs as regular Nephele jobs; to Nephele, PACT programs are arbitrary user programs.
Advantage
- 1. The PACT programming model encourages a more modular programming style. Although the number of user functions is usually higher, they are more fine-grain and focus on specific problems. Hence, interweaving of functionality which is common for MapReduce jobs can be avoided.
2. Data analysis tasks can be expressed as straight-forward data flows, especially when multiple inputs are required.
3. PACT has a record-based data model, which reduces the need to specify custom data types as not all data items need to be packed into a single value type.
4. PACT frequently eradicates the need for auxiliary structures, such as the distributed cache, which “break” the parallel programming model.
5. Data organization operations such as building a Cartesian product or combining records with equal keys are done by the runtime system. In MapReduce such often needed functionality must be provided by the developer of the user code.
6. PACTs specify data parallelization in a declarative way which leaves several degrees of freedom to the system. These degrees of freedom are an important prerequisite for automatic optimization. The PACT compiler enumerate different execution strategies and chooses the strategy with the least estimated amount of data to ship. In contrast, Hadoop executes MapReduce jobs always with the same strategy.
Working principle
A PACT consists of exactly one second-order function which is called Input Contract and an optional Output Contract. An Input Contract takes a first-order function with task-specific user code and one or more data sets as input parameters. The Input Contract invokes its associated first-order function with independent subsets of its input data in a data-parallel fashion. In this context, the well-known functions map and reduce are examples of Input Contracts. The first-order function has to implement an interface that is specific to the PACT it uses, as it is known from map/reduce.
PACT features a richer set of second-order functions (Map/Reduce/Match/CoGroup/Cross) that can be flexibly composed as DAGs into programs. PACT programs use a generic schema-free tuple data model to ease composition of more complex programs.
The programmer can attach optional Output Contracts to PACTs to denote certain properties of the user code's output data, which are relevant to the parallelization. The compiler can exploit that information and deduce in some cases, that suitable partitionings or orders exist and reuse them. Data processing tasks are implemented by providing custom code to PACTs and assembling them to a work flow graph.
PACT features a richer set of second-order functions (Map/Reduce/Match/CoGroup/Cross) that can be flexibly composed as DAGs into programs. PACT programs use a generic schema-free tuple data model to ease composition of more complex programs.
The programmer can attach optional Output Contracts to PACTs to denote certain properties of the user code's output data, which are relevant to the parallelization. The compiler can exploit that information and deduce in some cases, that suitable partitionings or orders exist and reuse them. Data processing tasks are implemented by providing custom code to PACTs and assembling them to a work flow graph.
Relation between PACT Programming Model and Nephele
Figure 4: Compiling a program to a data flow
To execute a PACT program it is submitted to the PACT Compiler. The compiler translates the program into a data flow program and hands it to the Nephele system for parallel execution. Input/output data is stored in the distributed file system HDFS.