Download Parallator Transformation towards C++ parallelism using TBB PDF

TitleParallator Transformation towards C++ parallelism using TBB
File Size2.7 MB
Total Pages107
Table of Contents
	What is Parallelism
		Parallelism vs. Concurrency
		Explicit vs. Implicit Parallelism
		Performance in Parallelism
	Parallel Programming
		The Need For Explicit Parallelism
		Strategies for Parallelism
		Mechanism for Parallelism
		Key Features For Performance
	Selected C++11 Features
		Lambda Expression
		Defined Multi-Threaded Behavior with STL Container
	Threading Building Blocks
		Introducing Example
		Task Scheduling
		Iteration Space / Range Concept
		Controlling Chunking
		Parallel Algorithms
	Existing Tools
		Intel Parallel Advisor
	Project Goals
	About this Report
	Loop Analysis and Transformation in DS-8 Project
		Focus on Specific Loop Variants
		Loop Body Transformation
	Potential Transformations Using TBB
		Loops to TBB Parallel Algorithms
		STL Algorithms to TBB Parallel Algorithms
	Semantic Analysis Towards STL Algorithms
		Pass Lambda Parameter by Reference / Constant Reference
		Capture by Reference or by Value
	Semantic Analysis Towards Parallelism
		Conflicting Memory Accesses
		Lambda Functors Towards Parallelism
		Independent Iterations
		Additional Criteria for Parallelization
	Tree Pattern Matching
		Available Implementation
		Usage and Limits in Tree Pattern Matching
	Potential further Refactorings
		Divide-and-Conquer Algorithms
		Replace Threading Architecture
	Library for Parallelization
		The Library
	The Plug-In
		Replacement and Transformation
		Codan Integration
		Used Extension Points
	Pattern for Semantic Analysis
		Semantic Analysis Pattern
		Pattern for Parallelism Analysis
		For_Each call Pattern for Parallelism
		Iteration Variable Criteria for Parallelism
		Resulting Plug-In
	Known Issues
		STL Implementation Dependent Behavior
		Index-based Loops and size_t
		Error in Nested Loop Detection
		Improvements to the Existing Implementation
		Further Loop Variants
		Thrust Integration
	Personal Reflection
Parallelize STL
User Guide
	Installation of Plug-in
	Use of Plug-in
		Recognition of Possible Transformations
		Apply a Transformation
		Choose Transformations to Recognize
		Configure Argument Binding
		Configure Launch Mode
Project Management
	Time Schedule
Project Setup
	Development environment
		Used Versions
		Installation and Configuration of TBB
	Continuous Integration
		Build Environment
Upgrade to CDT 8.1.1
List of Figures
Document Text Contents
Page 1

HSR – University of Applied Sciences Rapperswil

Institute for Software

Transformation towards C++ parallelism

using TBB

Master Thesis: Fall Term 2012

Martin Kempf
[email protected]

Supervised by Prof. Peter Sommerlad

Version: February 21, 2013

Page 53

Parallator 2. Analysis

Figure 2.9.: Sequential and parallel pseudo code for a divide-and-conquer


A concrete implementation of a parallel version using tbb ::parallel_invoke is shown in List-
ing 2.30. It uses lambda expressions with reference capture to overcome the issue of
accepting only nullary functors as parameter to parallel_invoke and its ignorance of return
values for a provided nullary function. On line 4 the threshold to use the serial version
can be seen. With the parallel_invoke on line 8 two tasks are spawned to recursively calcu-
late the Fibonacci number, the results are caputred in x and y. The composition of the
subresults can be seen on line 12.

1 void parallelFib ( i n t N , long &sum ) {
2 i f ( N < 2)
3 sum = N ;
4 e l s e i f ( N < 1000)
5 sum = serialFib ( N ) ;
6 e l s e {
7 long x , y ;
8 tbb : : parallel_invoke (
9 [&] ( ) { parallelFib (N−1, x ) ;} ,

10 [&] ( ) { parallelFib (N−2, y ) ;}
11 ) ;
12 sum = x + y ;
13 }
14 }

Listing 2.30: Parallel Fibonacci number calculation using tbb ::parallel_invoke [Gue12].

Another parallel implementation of parallel computation of a Fibonacci number can also
use the low level tasking interface of TBB [Cor12c]. Both implementations make use of
the task scheduler that turns this potential parallelism of the tasks into real parallelism
in a very efficient way as described in Section 1.4.2 on page 10.

In Java it has shown that a refactoring of sequential divide-and-conquer algorithms into a
parallel version that makes use of Java’s ForkJoinTask is feasible and results in a speedup [DME09].

2.6.2. Replace Threading Architecture

Threads are used far before the time TBB is introduced. There might be useful transfor-
mation to make use of TBBs rich and scalable task scheduler for algorithms that do not
match to one of the high-level templates. While the result of a refactoring like this would
increase the maintainability, productivity and maybe also the performance, the recogni-
tion of TBB feature equivalent parts in custom made threading infrastructure and the
transformation towards the TBB feature would not be feasible.

Page 45 of 99

Page 54

Parallator 3. Implementation

3. Implementation

Parallator extends the DS-8 project. It introduces the transformation using TBB into the
CDT plug-in of the DS-8 project. This chapter describes the integration of the transfor-
mation towards parallelism as well as the changes to make the existing transformation
to STL algorithms applicable in practice. Section 3.1.2 on the next page describes the
implemented parallel library that simplified the transformation towards parallelism. Sec-
tion 3.2 on page 49 covers the details of the integration to the existing plug-in and the final
architecture. A detailed description of the patterns used to cover the semantic analysis is
given in Section 3.3 on page 62.

3.1. Library for Parallelization

A header-only library to parallelize iterator-based loops is implemented. The header file
can simply be copied to the project for usage. The motivation for such a library, the
implementation details and design decision are reported here.

3.1.1. Motivation

Listing 3.1 is a good example that shows several aspects that motivated a separate library.
The iteration on the Range object is boilerplate code that needs to be inserted for every
iterator-based loop. Code generation in the IDE has always the issue of introducing names
for variables. Arbitrary chosen variable names make the understanding of the generated
code more difficult. The boilerplate code can be parametrized with template parameters
and extracted to a Body functor that can be reused in other transformation. In that way,
the code to generate is minimized to the call of the library function. The boilerplate code
is hidden. The library is inspired by thrust, but additional TBB specific features like the
selection of the grain size as well as the partitioner shall be possible to the user.

1 std : : vector<WebSite> sites = getWebSites ( ) ;
2 tbb : : parallel_for (
3 tbb : : blocked_range<std : : vector<WebSite > : : iterator>(sites . begin ( ) , sites . end ( ) )

4 [=] ( const tbb : : blocked_range<std : : vector<WebSite > : : iterator>& site ) f
5 const std : : vector<WebSite > : : iterator it_end = site . end ( ) ;
6 f o r ( std : : vector<WebSite > : : iterator it = site . begin ( ) ; it != it_end ; it

++) f
7 it� >search ( searchString ) ;
8 g
9 g) ;

Listing 3.1: Motivation for an own library – tbb ::parallel_for usage with iterator.

Page 46 of 99

Page 106

Parallator Bibliography


[AK12] Intel Corporation Alexey Kukanov. Parallel programming with intel threading
building blocks. Website access October 23, 2012.

[Bec12] Pete Becker. C++ standard. Website access November 04, 2012. http://www.

[Boo13] Boost c++ libraries. Website access February 1, 2013. http://www.

[Cor12a] Intel Corporation. Design parallel performance with less risk and more impact.
Website access November 27, 2012.

[Cor12b] Intel Corporation. Intel threading building blocks for open source. Website
access October 23, 2012.

[Cor12c] Intel Corporation. Intel threading building blocks, reference man-
aual, document number 315415-016us. Website access October 23,

[Cor12d] Intel Corporation. parallel for and parallel for each usage with stl vector. Web-
site access October 26, 2012.

[DD12] Ralph Johnson Danny Dig. Relooper: Refactoring for loop parallelism. Website
access November 1, 2012.

[DME09] Danny Dig, John Marrero, and Michael D. Ernst. Refactoring sequential java
code for concurrency via concurrent libraries. In Proceedings of the 31st Inter-
national Conference on Software Engineering, ICSE ’09, pages 397–407, Wash-
ington, DC, USA, 2009. IEEE Computer Society.

[ea13] Jared Hoberock et al. Parallelizing the standard algorithms library. Website ac-
cess January 20, 2013.

[Fou12] The Eclipse Foundation. Eclipse cdt (c/c++ development tooling). Website
access November 27, 2012.

[Fou13] The Eclipse Foundation. Cdt - static analysis. Website access January 30, 2013.

Page 98 of 99

Page 107

Parallator Bibliography

[GHJ95] Erich Gamma, Richard Helm, and Ralph and Johnson. Design Patterns – Ele-
ments of Reusable Object-Oriented Software. Addison-Wesley Longman, Ams-
terdam, 1 edition, 1995.

[Gue12] Paul Guermonprez. Parallel programming course threading build-
ing blocks (tbb). Website access November 1, 2012. http://

Parallel_Programming_Course_-_InDepth/IntelAcademic_Parallel_08_TBB.pdf .

[JH13] Nathan Bell Jared Hoberock. Thrust - parallel algorithms library. Website
access January 17, 2013. .

[Kem12] Martin Kempf. Dsl for specifying ds-8 tree patterns. Technical report, Institute
for Software, HSR, Switzerland, 2012.

[Kes10] Pascal Kesseli. Loop analysis and transformation towars stl algorithms. Tech-
nical report, Institute for Software, HSR, Switzerland, 2010.

[KV11] Wooyoung Kim and Michael Voss. Multicore desktop programming with intel
threading building blocks. IEEE Softw., 28(1):23–31, January 2011.

[Lea12] Doug Lea. D. lea. parallelarray package extra166y. Website access November 1,
2012. .

[MRR12] Michael D. McCool, Arch D. Robison, and James Reinders. Structured parallel
programming patterns for efficient computation. Elsevier/Morgan Kaufmann,
Waltham, MA, 2012.

[plu13] size t - c++ reference. Website access February 6, 2013. http:
// .

[RR99] Radu Rugina and Martin Rinard. Automatic parallelization of divide and con-
quer algorithms. In Proceedings of the seventh ACM SIGPLAN symposium on
Principles and practice of parallel programming, PPoPP ’99, pages 72–83, New
York, NY, USA, 1999. ACM.

[Sch12] Markus Schorn. Cdt - bugzilla bug 299911. Website access October 8, 2012. .

[Sip06] Michael Sipser. Introduction to the Theory of Computation. Thomson Course
Technology, 2nd edition, 2006.

[SL05] Herb Sutter and James Larus. Software and the concurrency revolution. Queue,
3(7):54–62, September 2005.

[Sut05] Herb Sutter. The Free Lunch Is Over: A Fundamental Turn Toward Concur-
rency in Software. Dr. Dobb’s Journal, 30(3), 2005.

[ZC91] Hans Zima and Barbara Chapman. Supercompilers for parallel and vector com-
puters. ACM, New York, NY, USA, 1991.

Page 99 of 99

Similer Documents