ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη...

30
ΕΠΛ372 Παράλληλη Επεξεργασία Εισαγωγή:Παράλληλη Επεξεργασία Γιάννος Σαζεϊδης Εαρινό Εξάμηνο 2014 READING 1.www.pearsonhighered.com/assets/hip/us/hip_us_pearsonhighered/samplechapter/0321487907.p df 2. Illiac IV http://www.cs.auckland.ac.nz/courses/compsci703s1c/resources/Bouknight-ILIAC-IV.pdf. 3. http://www.youtube.com/watch?v=On-k-E5HpcQ Parallel Computing Landscape 4. Homework #1 Slides based on notes by Calvin and Snyder and Pearson Pub.

Transcript of ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη...

Page 1: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

ΕΠΛ372 Παράλληλη Επεξεργασία

Εισαγωγή:Παράλληλη Επεξεργασία

Γιάννος ΣαζεϊδηςΕαρινό Εξάμηνο 2014

READING1.www.pearsonhighered.com/assets/hip/us/hip_us_pearsonhighered/samplechapter/0321487907.pdf2. Illiac IV http://www.cs.auckland.ac.nz/courses/compsci703s1c/resources/Bouknight-ILIAC-IV.pdf.3. http://www.youtube.com/watch?v=On-k-E5HpcQ Parallel Computing Landscape4. Homework #1Slides based on notes by Calvin and Snyder and Pearson Pub.

Page 2: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Consider A Simple Task …• Adding a sequence of numbers A[0],…,A[n-1]

– Standard way to express itsum = 0;for (i=0; i<n; i++) {

sum += A[i];}

• Semantics require:(…((sum+A[0])+A[1])+…)+A[n-1]

– That is, sequential

• Can it be executed in parallel?

Page 3: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Parallel Summation• To sum a sequence in parallel

– add pairs of values producing 1st level results,– add pairs of 1st level results producing 2nd level

results,– sum pairs of 2nd level results …

• That is,– (…((A[0]+A[1]) + (A[2]+A[3])) +...+ (A[n-2]+A[n-1]))…)

Page 4: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Express the two formulations • Same number of operations different order

• Which version more parallel?• How to go from sequential to parallel?

Page 5: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

The dream of automatic parallelization...

• Since 70s (Illiac IV days) the dream has been to compile sequential programs into parallel object code

• More than three decades of continual, well-funded research implies it’s hopeless– For a tight loop summing numbers, its doable– For other computations it has proved

extremely challenging to generate parallel code, even with pragmas or other assistance from programmers

Page 6: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

What’s the Problem?

• It’s not likely a compiler will produce parallel code from a C specification any time soon…

• Fact: For most computations, the “best” (practically, not theoretically) sequential solution and a “best” parallel solution are usually fundamentally different …• Different solution paradigms imply

computations are not “simply” related• Compiler transformations generally preserve

the solution paradigm• Therefore... the programmer must discover the

parallel || solution

Page 7: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

A Related Computation

• Consider computing the prefix sumsfor (i=1; i<n; i++) {

A[i] += A[i-1];}

• Semantics ...– A[0] is unchanged– A[1] = A[1] + A[0]– A[2] = A[2] + (A[1] + A[0])– ...– A[n-1] = A[n-1] + (A[n-2] + ( ... (A[1] + A[0]) … )

A[i] is the sum of thefirst i + 1 elements

Page 8: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Comparison of two Approaches

• The sequential solution computes the prefixes … the parallel solution computes only the last…

Page 9: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Parallel Prefix Algorithm

R. E. Ladner and M. J. Fischer, Parallel Prefix Computation Journal of the ACM, 1980

2log n time

Applies to a wide class of operations

Page 10: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Definitions: reduction and scan

• Tree-like operation used for parallel sum is called Reduction

• Tree-like operation used for parallel-prefix is called Scan

• Reduction and scan applied to other operators: max, min, second largest, etc

• When should we use these operators?• Faster than sequential operation due to

communication overhead

Page 11: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Parallel Compared to SequentialProgramming

•Has different costs, different advantages•Requires different, unfamiliar algorithms•Must use different abstractions•More complex to understand a program’s behavior•More difficult to control the interactions of the program’s components•Knowledge/tools/understanding more primitive•NEXT: Illustrate complexities of writing parallel programs

Page 12: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Consider a Simple Problem

• Count the 3s in array[] of length values• Sequential program

count = 0;for (i=0; i<length; i++){

if (array[i] == 3)count += 1;

}

Page 13: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Basic background on Thread parallel programming

• Thread is the unit of parallelism• Each thread has its own PC. Sequences

independent of the others• Has private text, registers and stack• Shares global data• Constructs to create, synchronize, and kill

threads

Page 14: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

•Each processor has a private L1 cache; it shares an L2 cache with its “chip-mate” and shares an L3 cache with the other processors.

•What is a possible bottleneck to this parallel system?

Multi-core computer system

Page 15: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Allocations are consecutive indices.

Data allocation/partitioning natural step in developing parallel algo and program

Data allocation to threads

Page 16: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

The first try at a Count 3s

Page 17: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

•One of several possible interleavings of references to the unprotected variable count, illustrating a race condition•This is not a coherence problem. •Imagine debugging the above…

First Solution Incorrect: Race Condition

A high level count++ in assembly =>

ld reg, [count] ++reg store reg,[count]

Page 18: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

•mutex protection for the count variable•mutex: construct that allows one thread at a time in the critical section. Mutex is always good? Large Overhead

Second Solution: Avoid race condition protect shared variables

critical section

Page 19: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Performance of second Count 3s solution

Locks serialize execution. Can we avoid them?IDEA: Do we really need to protect every count update?

Page 20: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

•private_count array elements, one for each thread

•Still need critical section. But only when need to combine result of each thread. Much less frequent.

3rd Solution: private counts per thread

critical section

Page 21: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Algorithmically parallel solution suffers from serialization…

Performance for third Count 3s solution

Page 22: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

FALSE SHARING:1. Granularity of coherence at cache block2. Block fits many variables that are independently and

concurrently updated by different threads

System causes serialization: False Coherence Traffic

Page 23: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

•Private count elements are padded to force them to be allocated to different cache lines•Padding platform dependent. Problem? Portability…

4th Solution: Per thread counter to distinct cache block

Page 24: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

int count;

int private_count = 0

++

Page 25: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Finally performance improvement. But performance does not scale beyond 4???

Performance for third Count 3s solution

Page 26: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

•Memory Bandwidth limited•How to validate? Measure performance with an array that does not contain any 3s. Results same as before.

Analysis to determine source of serialization

Page 27: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Count 3s Summary

• The obvious “break into blocks” program• Wrong answer. Race condition.

• Avoid race condition by protecting the count variable • We got the right answer but the program was

slower … lock congestion• Privatized memory and 1-process was fast

enough, 2- processes slow … false sharing• Separated private variables to own cache line• Finally success.

• Analyze why no performance scaling

Page 28: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Parallel Programming Goals

Must be significantly faster than single thread implementation• Goal: Correct and scalable programs with

performance and portability• Scalable: More processors can be “usefully”

added to solve the problem faster• Performance: Programs run as fast as those

produced by experienced parallel programmers for the specific machine

• Portability: The solutions run well on all parallel platforms

• Minimize programming effort

Page 29: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Other non-obvious system causes of serialization

• virtual to physical address translation lead to cache conflicts (intra and across threads)• Os page mapping algorithms

• what happens when there are more threads than processors• Memory pressure when compute bound• Useful when i/o bound• Stress memory (paging)

• same thread can have different performance depending with which threads co-executes• How do we learn this?

Page 30: ΕΠΛ372 Παράλληλη Επεξεργασία - UCY · ΕΠΛ372 Παράλληλη Επεξεργασία ... well-funded research implies it’s hopeless – For a tight loop

Next

Parallel Architectures•Hw#1•Read about Illiac IV•Questions for reading/matterial