IN2147 Parallel Programming Exam Study Guide (S24 Retake)

Parallel Programming (IN2147) - Overview

Interactive Study Guide - Summer Exam 2024 Retake

Exam:	IN2147 / Retake	Date:	Monday 30th September, 2024
Examiner:	Prof. Dr. Martin Schulz	Time:	08:30 – 10:30

Working Instructions

This exam consists of 28 pages with a total of 7 problems. Please make sure now that you received a complete copy of the exam.
The total amount of achievable credits in this exam is 93 credits.
Detaching pages from the exam is prohibited.
Allowed resources: one analog dictionary English ↔ native language
Subproblems marked by * can be solved without results of previous subproblems.
Answers are only accepted if the solution approach is documented. Give a reason for each answer unless explicitly stated otherwise in the respective subproblem. When asked to name a concept providing the correct name is sufficient.
Do not write with red or green colors nor use pencils.
Physically turn off all electronic devices, put them into your bag and close the bag.

Select a problem from the sidebar to begin studying. Use the toggles to reveal hints, solutions, and analysis.

Problem 1: Amdahl's Law (8 credits)

You have written a multi-threaded program and want it to run as fast as possible. You first run it sequentially and achieve a runtime of 60s. Next, you run it on all 6 cores of your laptop and achieve a runtime of 15s. Assume the parallel regions have ideal speedup and show your calculations.

a)* How many percent of the sequential programs execution time are parallelized?

Show Hint

Use Amdahl's Law for parallel runtime: \( T(p) = (1-f)T_s + \frac{fT_s}{p} \). You are given \(T(6)=15s\), \(T_s=60s\), and \(p=6\). Solve for \(f\).

Show Solution

We use the formula for parallel runtime based on Amdahl's Law: \( T(p) = T_s \cdot ((1-f) + \frac{f}{p}) \).

Substitute the given values:

\( 15 = 60 \cdot ((1-f) + \frac{f}{6}) \)

\( \frac{15}{60} = 1 - f + \frac{f}{6} \)

\( 0.25 = 1 - \frac{5f}{6} \)

\( \frac{5f}{6} = 1 - 0.25 = 0.75 \)

\( f = \frac{0.75 \cdot 6}{5} = \frac{4.5}{5} = 0.9 \)

90% of the execution time is parallelized.

Show Analysis

This calculation directly applies Amdahl's Law to determine the parallelizable fraction of a program based on empirical measurements. A high parallel fraction (90%) is good, but the remaining 10% of serial code will ultimately limit the maximum achievable speedup.

b) Now you want to improve your speedup by upgrading to a new laptop. How many parallel processing units would your new computer need to achieve a speedup of 5? Assume that the cores are identical to the ones in the old laptop.

Show Hint

Use the formula for speedup: \( SU(p) = \frac{1}{(1-f) + \frac{f}{p}} \). You want \(SU(p)=5\) and know \(f=0.9\) from the previous part. Solve for \(p\).

Show Solution

We use the formula for speedup: \( SU(p) = \frac{1}{(1-f) + \frac{f}{p}} \).

We want a speedup of 5, and we know \(f=0.9\):

\( 5 = \frac{1}{(1-0.9) + \frac{0.9}{p}} \)

\( 5 = \frac{1}{0.1 + \frac{0.9}{p}} \)

Take the reciprocal of both sides:

\( 0.1 + \frac{0.9}{p} = \frac{1}{5} = 0.2 \)

\( \frac{0.9}{p} = 0.2 - 0.1 = 0.1 \)

\( p = \frac{0.9}{0.1} = 9 \)

9 parallel processing units are needed.

Show Analysis

This demonstrates how the serial portion of the code (10%) limits scalability. Even though we are adding more cores, the speedup does not increase linearly. To achieve a speedup of 5, we need more than 5 cores because the serial part of the program becomes a bottleneck.

c)* When you actually run your program on your new laptop, you actually see a speedup of 6. How is this phenomenon called and what could be the reason for this?

Show Hint

This is a speedup greater than the number of processors would suggest is possible according to Amdahl's law. What hardware component's behavior changes significantly when the problem is distributed across more cores?

Show Solution

This phenomenon is called Super-linear speedup.

The likely reason is related to cache effects. With multiple cores, the total available cache memory increases. If the program's working set fits better into the aggregated cache of the parallel system compared to the sequential execution (which might have suffered from frequent cache misses), memory access latency significantly decreases, leading to performance improvements beyond simple scaling.

Show Analysis

Super-linear speedup is a counter-intuitive but real effect where parallelization provides benefits beyond just computation distribution. By distributing the data, each core works on a smaller piece that may now fit entirely within its local cache, drastically reducing memory access times and boosting performance beyond theoretical limits that assume constant memory access speed.

Problem 2: Threading (14 credits)

The ParProg team implemented their custom mutex and condition variable implementations with the class structure given in the code template on the next page. The functions implement the commonly known functionality as used throughout the lecture. Now we want to implement our own thread safe queue with our custom synchronization primitives, but need your help getting the synchronization right.

a)* Your task is to insert all missing mutex lock and unlock statements at the most efficient position in the code template so the condition variable is used conceptually correct. Use the provided condition variable and mutex classes. Do not add or alter functions.

Code Template (C++):

class condition_variable {
    void notify_one();
    void wait();
};
class mutex {
    void lock();
    void unlock();
};
template <typename E> class queue {
public:
    void push(E element) {



        queue.push_back(element);



        cv.notify_one();



    }

    E pop() {



        while (queue.empty()) {



            cv.wait();



        }



        E element = queue.front();



        queue.pop_front();



        return element;



    }

private:
    std::deque<E> queue;
    mutex mutex;
    condition_variable cv;
};

Show Hint

1. Any access to the shared `queue` must be protected by a lock.

2. A `wait()` call on a condition variable must be done while holding the lock.

3. For efficiency in `push`, should you unlock before or after `notify_one()`?

Show Solution

Completed Solution (C++):

class condition_variable {
    void notify_one();
    void wait();
};
class mutex {
    void lock();
    void unlock();
};
template <typename E> class queue {
public:
    void push(E element) {
        // L1: Acquire lock before modifying shared data
        mutex.lock();

        queue.push_back(element);

        // L2: Release lock after modification.
        // Done before notify for efficiency (allows woken thread to acquire lock sooner).
        mutex.unlock();

        cv.notify_one();
    }

    E pop() {
        // L3: Acquire lock before checking condition or waiting
        mutex.lock();

        // Wait while the queue is empty (must hold the lock)
        while (queue.empty()) {
            // Assumes wait() correctly handles atomic release/reacquire of the associated mutex.
            cv.wait();
        }

        // Lock is held, queue is not empty.
        E element = queue.front();
        queue.pop_front();

        // L4: Release the lock
        mutex.unlock();

        return element;
    }

private:
    std::deque<E> queue;
    mutex mutex;
    condition_variable cv;
};

Show Analysis

This is a classic producer-consumer queue implementation. The mutex protects the shared data (`queue`), and the condition variable prevents the consumer (`pop`) from busy-waiting. The `while` loop in `pop` is crucial to handle spurious wakeups, ensuring the condition (`!queue.empty()`) is re-checked after waking up. Unlocking before `notify_one` in `push` is a performance optimization to avoid the "hurry up and wait" problem, where the notified thread wakes up only to find the lock is still held.

b) Briefly state for each of the statements you added why it was necessary at that position to ensure correct execution. Provide your answer in the format: `Line : Explanation` for each lock and unlock statement. Clearly reference other relevant line numbers in your explanations. Hint: You shouldn't need more than one sentence per explanation.

Show Hint

For each lock/unlock, explain its role in preventing data races or ensuring the correct use of the condition variable.

Show Solution

L1 (`mutex.lock()` in `push`): Required to ensure exclusive access to the shared `queue` before modification (`push_back`), preventing data races.
L2 (`mutex.unlock()` in `push`): Required to release the lock after the critical section is complete.
L3 (`mutex.lock()` in `pop`): Required before checking the shared condition (`queue.empty()`) and is a prerequisite for correctly calling `cv.wait()`.
L4 (`mutex.unlock()` in `pop`): Required to release the lock after the element has been successfully removed from the queue.

Show Analysis

The placement of locks and unlocks defines the "critical sections" of the code. The goal is to make these sections as small as possible while still protecting all shared state. The interaction with condition variables is particularly strict: the lock must be held when calling `wait()` so the variable can atomically release it and go to sleep.

c) What correctness problem needs to be further considered to make this code safe in all situations? How can this be avoided using the C++ standard library instead of our custom implementation?

Show Hint

What happens if an operation like `queue.push_back()` throws an exception (e.g., out of memory)? Will the `unlock()` statement be executed? What is the C++ pattern for managing resources like locks in the presence of exceptions?

Show Solution

The code lacks Exception Safety. If an operation within the critical section (e.g., `queue.push_back()` potentially throws `std::bad_alloc`) throws an exception, the corresponding `mutex.unlock()` call is skipped. This leaves the mutex permanently locked, causing a deadlock for any other thread that tries to acquire the lock.

This is avoided in the C++ standard library by using RAII (Resource Acquisition Is Initialization) wrappers, such as `std::lock_guard<std::mutex>` or `std::unique_lock<std::mutex>`. These wrappers guarantee that the mutex is automatically unlocked when the scope is exited, regardless of whether it's a normal return or an exception was thrown.

Show Analysis

RAII is a cornerstone of modern C++ for writing robust code. By tying resource management (like locks, memory, or file handles) to the lifetime of an object on the stack, it eliminates a whole class of resource leak bugs. For mutexes, `std::lock_guard` is the simplest RAII wrapper, while `std::unique_lock` is more flexible and is required when working with condition variables because it allows for manual unlocking and re-locking.

Problem 3: OpenMP (12 credits)

Consider the following program:

Code Snippet (C++):

#include "omp.h"
#include <stdio.h>
int main() {
    int a = 1, b = 2, c = 3;
    #pragma omp parallel num_threads(8)
    #pragma omp for schedule(static) firstprivate(b) lastprivate(c)
    for (int i = 0; i < 8; i++){
        #pragma omp critical
        {
            a += omp_get_thread_num();

            b += omp_get_thread_num();

            c += omp_get_thread_num();

        }

        if (i == 4) {
            printf("a: %d\n", a);

            printf("b: %d\n", b);

            printf("c: %d\n", c);
        }
    }

    printf("a: %d\n", a);

    printf("b: %d\n", b);

    printf("c: %d\n", c);

    return 0;
}

a)* State the output of the program in lines 19, 21 and 23. If the output of a line is non-deterministic briefly explain the reason for non-determinism instead (one short sentence per reason should suffice).

Show Hint

The `if (i == 4)` block is executed by the thread assigned to iteration 4. With `schedule(static)` and 8 threads/8 iterations, thread 4 executes iteration 4.

- `a` is shared. Its value depends on which threads have entered the critical section before thread 4.

- `b` is `firstprivate`. Each thread gets a private copy initialized to the original value.

- `c` is `lastprivate`. Each thread gets a private, *uninitialized* copy.

Show Solution

Line 19 (a): Non-deterministic. `a` is a shared variable, and its value at this point depends on the non-deterministic order in which threads have executed the critical section before thread 4.
Line 21 (b): 6. Thread 4 gets a private copy of `b` initialized to 2 (`firstprivate`). Inside the critical section, it executes `b += 4`, so its private `b` becomes 6.
Line 23 (c): Non-deterministic. `c` is `lastprivate` but not `firstprivate`, so each thread's private copy is uninitialized. The output is the result of adding 4 to an indeterminate value.

Show Analysis

This problem tests understanding of OpenMP data sharing clauses. `shared` variables are visible to all threads. `firstprivate` gives each thread a private copy initialized from the master thread's value. `lastprivate` copies the value from the thread that executed the logically last iteration back to the master thread's variable after the loop. The key trap here is that `lastprivate` does not imply `firstprivate`, leading to uninitialized private variables.

b)* State the output of the program in lines 27, 29 and 31. If the output of a line is non-deterministic briefly explain the reason for non-determinism instead (one short sentence per reason should suffice).

Show Hint

These `printf` statements execute after the parallel region has finished.

- `a`: All threads (0 through 7) will have executed the critical section exactly once. Calculate the final sum.

- `b`: `firstprivate` only creates private copies. The original `b` is not modified.

- `c`: `lastprivate` copies the value from the thread that executed the last iteration (i=7) back to the original `c`. What was the state of that thread's private `c`?

Show Solution

Line 27 (a): 29. The `critical` section ensures that the updates to the shared `a` are atomic. After the loop, `a` will be its initial value (1) plus the sum of all thread numbers from 0 to 7. Sum = (7*8)/2 = 28. Final value = 1 + 28 = 29.
Line 29 (b): 2. The original variable `b` is not affected by modifications to the `firstprivate` copies within the parallel region.
Line 31 (c): Non-deterministic. The `lastprivate` clause copies the value from the private `c` of the thread that executed the last iteration (thread 7). Since that private copy was uninitialized, the final value of `c` is indeterminate.

Show Analysis

The final state of the variables demonstrates the one-way nature of `firstprivate` (master -> thread) and `lastprivate` (last thread -> master). The atomicity provided by `critical` is essential for the correct final value of `a`. The non-determinism of `c` is a common bug when `lastprivate` is used without an initialization mechanism like `firstprivate`.

c)* Describe static and dynamic OpenMP for loop scheduling modifiers and name a typical use case for when each schedule is preferred on UMA node.

Show Hint

One schedule divides the work up-front, the other divides it as threads become free. Which one is better for balanced workloads? Which one is better for unbalanced workloads?

Show Solution

Static Scheduling: Iterations are divided into chunks and assigned to threads in a fixed (static) manner before the loop begins.
- Use Case (UMA): Preferred for loops where each iteration takes roughly the same amount of time (a balanced workload), as it has the lowest scheduling overhead.
Dynamic Scheduling: Iterations are assigned to threads in smaller chunks dynamically as threads finish their current work and become available.
- Use Case (UMA): Preferred for loops where iteration execution time varies significantly (an unbalanced workload), as it provides better load balancing at the cost of higher scheduling overhead.

Show Analysis

Choosing the right schedule is a trade-off between overhead and load balancing. `static` is fast and simple but can lead to idle threads if work is uneven. `dynamic` ensures all threads stay busy but incurs the cost of fetching new work from a central queue. A third option, `guided`, is a compromise that starts with large chunks and reduces the chunk size over time.

d)* Briefly explain why a dynamic schedule can cause overhead on a NUMA system.

Show Hint

Consider the "First Touch" memory allocation policy on NUMA systems. If data is initialized by one thread, where is its memory page located? What happens if a `dynamic` schedule later assigns an iteration that uses this data to a thread running on a different socket?

Show Solution

On a NUMA (Non-Uniform Memory Access) system, memory access latency is lower for local memory than for remote memory (on another socket). A `dynamic` schedule assigns iterations to threads as they become available, without regard for data locality. This can frequently result in a thread being assigned an iteration that operates on data located in a remote NUMA node, forcing it to perform slow, high-latency remote memory accesses, which causes significant performance overhead.

Show Analysis

This is a key challenge in NUMA programming. The performance gain from load balancing with a `dynamic` schedule can be completely negated by the penalty of remote memory access. For NUMA systems, a `static` schedule is often preferred, combined with careful data initialization (e.g., parallel initialization) to ensure that threads primarily work on data that is local to their socket.

Problem 4: SIMD (17 credits)

You have a vector processor that can execute 256-bit vector instructions.

a)* The processor can execute the vector instructions with the same latency and throughput as regular integer instructions. What is the expected speedup for 16-bit integer instructions, assuming no overhead?

Show Hint

Calculate the Vector Length (VL), which is the number of data elements that can fit into a single vector register. Speedup = VL.

Show Solution

Vector width = 256 bits.
Data element size = 16 bits.

Vector Length (VL) = Vector Width / Element Size = 256 / 16 = 16.

The expected speedup is 16x.

Show Analysis

SIMD (Single Instruction, Multiple Data) achieves performance gains by performing the same operation on multiple data elements simultaneously. The theoretical speedup is equal to the number of elements that can be processed in parallel, which is the vector length.

Now consider the following sequential code:

Code Snippet (C):

void operation(float *a, float *b, float *c, float d, int size)
{
    for (int i = 1; i < size; i++) {
        a[i] = b[i] / d;
        c[i] = a[i-1] - b[i-1];
    }
}

b) What problem prevents vectorization of this loop? Name and briefly explain the problem.

Show Hint

Look at the data dependencies between loop iterations. Does an iteration depend on a value calculated in a previous iteration? The calculation of `c[i]` uses `a[i-1]`.

Show Solution

The problem is a Loop-Carried Flow Dependence.

Explanation: The statement `c[i] = a[i-1] - b[i-1];` in iteration `i` reads the value of `a[i-1]`. This value was written by the statement `a[i] = b[i] / d;` in the *previous* iteration (`i-1`). This dependency across iterations prevents the loop from being vectorized directly, as vectorization requires iterations to be independent.

Show Analysis

This is a classic example of a dependency that hinders vectorization. The compiler cannot safely execute multiple iterations at once because the result of one iteration is needed as an input for the next. To resolve this, the dependency must be broken, often by restructuring the loop.

c)* Transform the loop using loop-distribution to make it vectorizable.

Show Hint

Split the original loop into two separate loops. The first loop will calculate all values of `a`, and the second loop will calculate all values of `c`. This resolves the dependency.

Show Solution

We use Loop Distribution (also known as Loop Fission) to separate the dependent statements into two loops.

Transformed Code (C):

void operation_distributed(float *a, float *b, float *c, float d, int size)
{
    // Loop 1: Calculates 'a'. This loop has no cross-iteration dependencies.
    for (int i = 1; i < size; i++) {
        a[i] = b[i] / d;
    }
    
    // Loop 2: Calculates 'c'. This loop is also vectorizable as all 'a' values are now computed.
    for (int i = 1; i < size; i++) {
        c[i] = a[i-1] - b[i-1];
    }
}

Show Analysis

Loop distribution is a powerful transformation technique. By splitting the loop, we ensure that all writes to array `a` are completed before any reads of `a` occur in the second loop. This completely removes the loop-carried dependency, allowing a vectorizing compiler to process each of the new loops independently using SIMD instructions.

d)* Vectorize the computation of 'a' using the transformed code from c) with 256-bit vector operations. You do not need to vectorize the computation of 'c' but it should be computed sequentially. Assume that all arrays are unaligned. Do not try to align the arrays but use the correct intrinsics. Do not write remainder loops but use masked instructions instead. You can use the `minint` function defined below.

Here is a list of definitions you might find useful:

Intrinsics	Intrinsics	Intrinsics
`__m256`	`_mm256_set1_ps`	`__m256i`
`__m256f`	`_mm256_set1_pf`	`__m256d`
`__m256d`	`_mm256_set1_pd`	`_mm256_store_ps`
`_mm256_load_ps`	`_mm256_sub_ps`	`_mm256_storeu_ps`
`_mm256_loadu_ps`	`_mm256_sub_pf`	`_mm256_store_pf`
`_mm256_load_pf`	`_mm256_sub_pd`	`_mm256_storeu_pf`
`_mm256_loadu_pf`	`_mm256_mul_ps`	`_mm256_store_pd`
`_mm256_load_pd`	`_mm256_mul_pf`	`_mm256_storeu_pd`
`_mm256_loadu_pd`	`_mm256_mul_pd`	`_mm256_mask_store_ps`
`_mm256_mask_load_ps`	`_mm256_div_ps`	`_mm256_mask_storeu_ps`
`_mm256_mask_loadu_ps`	`_mm256_div_pf`	`_mm256_mask_store_pf`
`_mm256_mask_load_pf`	`_mm256_div_pd`	`_mm256_mask_storeu_pf`
`_mm256_mask_loadu_pf`	`_mm256_mask_div_ps`	`_mm256_mask_store_pd`
`_mm256_mask_load_pd`	`_mm256_mask_div_pf`	`_mm256_mask_storeu_pd`
`_mm256_mask_loadu_pd`	`_mm256_mask_div_pd`

The masked instructions have the following parameters:

_mm256_mask_div_*(src, mask, a, b)
_mm256_mask_load_*(src, mask, addr, mask)
_mm256_mask_loadu_*(src, mask, addr, mask)
_mm256_mask_store_*(addr, mask, a)
_mm256_mask_storeu_*(addr, mask, a)

Helper Function (C):

// Computes the minimum of two integers
int minint(int x, int y) {
    if (x <= y) {
        return x;
    } else {
        return y;
    }
}

Code Template (C):

void operation (float *a, float *b, float *c, float d, int size) {
    int i, ub = size - (size % 4);
    __m_____ a256, b256, d256, zero256;

    d256 = ______;

    for (i = 1; i < ub; i += 8) {
        b256 = ______;

        a256 = ______;

        ______;
    }

    __mmask8 mask = ______;

    zero256 = ______;

    b256m = ______;

    a256m = ______;

    ______;
}

Show Hint

1. The vector length (VL) for 256-bit registers and 32-bit floats is 8.

2. Broadcast the scalar `d` into a vector register `d256` using `_mm256_set1_ps`.

3. The main loop should iterate from `i=1` to `size` in steps of `VL` (8).

4. Inside the loop, you need to generate a mask that handles the final partial vector. For example, if `size-i` is 3, the mask should be `0b00000111`.

5. Use `_mm256_mask_loadu_ps` to load data for `b`, `_mm256_mask_div_ps` for the calculation, and `_mm256_mask_storeu_ps` to store the result into `a`.

6. After the vectorized loop for `a`, add the sequential loop for `c`.

Show Solution

Vectorized Solution (C/AVX):

#include <immintrin.h>

// Assuming __mmask8 is defined, as implied by the masked intrinsics.
// For AVX-512, this is part of the header. For AVX2, one might need a typedef.
typedef unsigned char __mmask8;

void operation_vec(float *a, float *b, float *c, float d, int size) {
    const int VL = 8; // Vector Length for float on 256-bit registers
    int i;

    // --- Vectorized computation of 'a' ---
    __m256 a256, b256, d256, zero256;

    // Broadcast scalar 'd' into a vector
    d256 = _mm256_set1_ps(d);
    // Create a zero vector to use as a source for masked loads
    zero256 = _mm256_setzero_ps();

    // Main vector loop, starting from i=1
    for (i = 1; i < size; i += VL) {
        // Determine how many elements remain to be processed in this iteration
        int remaining = size - i;
        
        // Create a mask for the remaining elements.
        // If remaining >= VL, all lanes are active (0xFF).
        // Otherwise, create a bitmask for the 'remaining' lanes.
        __mmask8 mask = (remaining >= VL) ? 0xFF : (__mmask8)((1U << remaining) - 1);

        // Load unaligned data from b, masked. Inactive lanes get zero from zero256.
        b256 = _mm256_mask_loadu_ps(zero256, mask, &b[i]);

        // Perform the division, masked. Inactive lanes in 'a256' will be zero.
        a256 = _mm256_mask_div_ps(zero256, mask, b256, d256);

        // Store the result to a, unaligned and masked. Only active lanes are written.
        _mm256_mask_storeu_ps(&a[i], mask, a256);
    }

    // --- Sequential computation of 'c' (as requested) ---
    for (i = 1; i < size; i++) {
       c[i] = a[i-1] - b[i-1];
    }
}

Show Analysis

This solution correctly vectorizes the first loop using AVX intrinsics. The key is the use of masked instructions to handle the end of the array without a separate scalar remainder loop. The mask `(1U << remaining) - 1` is a standard bit manipulation trick to generate a sequence of `remaining` ones (e.g., if `remaining` is 3, `1U << 3` is `0b1000`, subtracting 1 gives `0b0111`). This approach is generally more efficient than a separate cleanup loop as it avoids branching and keeps the pipeline full.

Problem 5: MPI Theory (10 credits)

a)* Illustrate an `MPI_Bsend` operation that is using the rendezvous protocol. Draw boxes for buffers and arrows for data movement and communication. Clearly annotate the boxes and arrows. Ignore the UMQ.

Show Hint

1. `MPI_Bsend` *always* copies data from the application buffer to a user-provided MPI buffer first.

2. The rendezvous protocol then involves a handshake: Sender sends a "Request to Send" (RTS), Receiver responds with "Clear to Send" (CTS), then the data is transferred from the MPI buffer to the receiver.

Show Solution

An `MPI_Bsend` (Buffered Send) first copies the data to a user-managed buffer. The underlying transfer from this buffer can then use a protocol like rendezvous for large messages.

Show Analysis

The key characteristic of `MPI_Bsend` is the initial copy to the user buffer (Step 1). This allows the `MPI_Bsend` call to return immediately, decoupling the application from the network communication. The subsequent network transfer (Steps 2-4) happens in the background and can use any protocol; in this case, the rendezvous protocol is shown, which is typical for large messages to avoid overwhelming the receiver's buffers.

b)* Briefly explain the difference between blocking and nonblocking routines in the terminology used up to MPI 3.1.

Show Hint

The distinction centers on when the function call returns and whether the resources (like the send buffer) are safe to reuse immediately after the return.

Show Solution

Blocking routines (e.g., `MPI_Send`, `MPI_Recv`): The function call does not return until the operation is locally complete. This guarantees that the communication buffer is safe to be reused immediately after the call returns.
Nonblocking routines (e.g., `MPI_Isend`, `MPI_Irecv`): The function call initiates the operation and returns immediately, before the operation is necessarily complete. The application must not reuse the communication buffer until completion is verified later using a separate call like `MPI_Wait` or `MPI_Test`.

Show Analysis

This is a fundamental concept in MPI. Blocking calls are simpler to reason about but can lead to deadlocks if not ordered correctly. Nonblocking calls are more complex but enable the powerful pattern of overlapping communication with computation, where the CPU can perform useful work while data is being transferred over the network in the background.

c)* Write code using the template to emulate an efficient `MPI_Alltoall` using other non-alltoall collective MPI library functions.

Code Template (C):

int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,
                 void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)
{
    int type_size;
    MPI_Type_size(recvtype, &type_size);



    return MPI_SUCCESS;
}

Show Hint

An `MPI_Alltoall` can be implemented by having each process `i` perform a `MPI_Scatter` to all other processes from its chunk of the send buffer, and an `MPI_Gather` from all other processes into its chunk of the receive buffer. However, a more direct emulation is to use `MPI_Alltoallv`, which is a more general collective. You can set up the count and displacement arrays for `MPI_Alltoallv` to mimic the regular pattern of `MPI_Alltoall`.

Show Solution

The most direct way to emulate `MPI_Alltoall` is by using its more general vectorized version, `MPI_Alltoallv`, and setting up the parameters to describe a regular, non-vectorized communication pattern.

Emulation using MPI_Alltoallv (C):

#include <mpi.h>
#include <stdlib.h>

int My_MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,
                    void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)
{
    int comm_size;
    MPI_Comm_size(comm, &comm_size);

    // Allocate arrays required for MPI_Alltoallv
    int *sendcounts = (int *)malloc(comm_size * sizeof(int));
    int *recvcounts = (int *)malloc(comm_size * sizeof(int));
    int *sdispls = (int *)malloc(comm_size * sizeof(int));
    int *rdispls = (int *)malloc(comm_size * sizeof(int));

    if (!sendcounts || !recvcounts || !sdispls || !rdispls) {
        return MPI_ERR_NO_MEM;
    }

    // Initialize the arrays for the regular Alltoall pattern
    for (int i = 0; i < comm_size; i++) {
        sendcounts[i] = sendcount;
        recvcounts[i] = recvcount;
        // Displacements are in terms of number of elements, not bytes
        sdispls[i] = i * sendcount;
        rdispls[i] = i * recvcount;
    }

    // Call the collective function MPI_Alltoallv
    int result = MPI_Alltoallv(sendbuf, sendcounts, sdispls, sendtype,
                               recvbuf, recvcounts, rdispls, recvtype, comm);

    // Cleanup
    free(sendcounts);
    free(recvcounts);
    free(sdispls);
    free(rdispls);

    return result;
}

Show Analysis

While one could implement `MPI_Alltoall` with a series of point-to-point `MPI_Send` and `MPI_Recv` calls in a loop, this is inefficient and not a collective operation. Using `MPI_Alltoallv` is the correct approach as it is a single collective call that allows for a more optimized implementation by the MPI library. This solution correctly sets up the count and displacement arrays to specify that each process sends and receives a fixed-size chunk to/from every other process, which is the definition of `MPI_Alltoall`.

Problem 6: Hybrid Programming (23 credits)

In the exercises, we covered how to write an OpenMP and MPI hybrid program. We are now faced with harmonizing two very large arrays of vectors. We are achieving this by normalizing every vector in array `b` and scaling it with the average length of the vectors in array `a`. The sequential version of the `harmonize` operation is shown below. The result of the harmonization is the final state of array `b`.

Sequential Code (C++):

struct vector {
    float x;
    float y;
};

float length(vector& vec) {
    return sqrt(vec.x*vec.x + vec.y*vec.y);
}

void normalize(vector& vec) {
    float len = length(vec);
    vec.x /= len;
    vec.y /= len;
}

void harmonize_sequential(vector *a, vector *b, int size) {
    float avg_a = 0;
    for (int i = 0; i < size; i++){
        avg_a += length(a[i]);
    }
    avg_a /= size;
    
    for (int i = 0; i < size; i++){
        normalize(b[i]);
        b[i].x *= avg_a;
        b[i].y *= avg_a;
    }
}

a)* In the lecture we learned how to create custom MPI Datatypes. For the struct used in our hybrid we want to create the custom contiguous type `MPI_VECTOR`. Finish the method below so it returns a usable MPI custom type. Below you find an excerpt from the MPI 4.0 Standard that might be helpful:

MPI_TYPE_CONTIGUOUS(count, oldtype, newtype)

IN       count       replication count (non-negative integer)
IN       oldtype    old datatype (handle)
OUT    newtype   new datatype (handle)

C binding
int MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype)

Code Template (C++):

void create_mpi_vector(MPI_Datatype *type) {

}

Show Hint

The `vector` struct contains two `float` members. You need to create a new MPI type that represents two contiguous `MPI_FLOAT`s. After creating the type, you must `MPI_Type_commit` it before it can be used.

Show Solution

Completed Solution (C++):

#include <mpi.h>

void create_mpi_vector(MPI_Datatype *type) {
    // Create the datatype representing 2 contiguous MPI_FLOATs
    MPI_Type_contiguous(2, MPI_FLOAT, type);
    
    // Commit the datatype so it can be used in communication
    MPI_Type_commit(type);
}

Show Analysis

Creating custom MPI datatypes is essential for sending non-primitive data structures like structs or classes. `MPI_Type_contiguous` is the simplest constructor, used when the elements of the struct are of the same type and stored contiguously in memory. Committing the type with `MPI_Type_commit` registers it with the MPI library, making it a valid handle for communication calls.

b) Write code to efficiently parallelize the `harmonize()` function using OpenMPI and OpenMP. The vectors are provided at root. Use MPI non-blocking collectives to communicate between ranks wherever you can utilize it to provide an overlap of communication and calculation. You must wait for the result as late as possible to provide optimal overlap of communication and calculation. Use OpenMP to further speedup your code. Your code should have identical output in array `b` as if to running the `harmonize_sequential()` function at root.

To solve the tasks b) and c) we have a cluster of compute nodes. You can assume the following conditions:

Each compute node has identical hardware
Each node has 2 sockets with NUMA between the sockets
Each socket has 6 cores with 3-way simultaenous multithreading
mpirun starts 2 processes per node
mpi was correctly initialized with MPI_THREAD_MULTIPLE
The number of processes evenly divides the size of the pixel array
Assume no padding is added to the Vector struct by the compiler
You should use the create_mpi_vector() function from subproblem a) in your code
You may ignore omp loop schedules, NUMA and thread affinity for question b)

Code Template (C++):

void harmonize(vector *a, vector *b, int size, int rank, int comm_size,
               int root, MPI_Comm comm) {
    // TODO: prepare datatype



    int local_size = size / comm_size;
    vector *local_a = (vector *) malloc(local_size * sizeof(vector));
    vector *local_b = (vector *) malloc(local_size * sizeof(vector));




    float avg_a = 0;

    for (int i = 0; i < local_size; i++){
        avg_a += length(local_a[i]);
    }





    for (int i = 0; i < local_size; i++){
        normalize(local_b[i]);
    }





    for (int i = 0; i < local_size; i++){
        local_b[i].x *= avg_a; local_b[i].y *= avg_a;
    }



}

Show Hint

1. **Data Distribution:** Scatter arrays `a` and `b` from the root to all processes.

2. **Local Sum (A):** Each process calculates the sum of lengths for its local chunk of `a`. Use an OpenMP `parallel for reduction` for this.

3. **Overlap:** The key is to overlap the global sum calculation with the normalization of `b`.

Initiate a non-blocking `MPI_Iallreduce` on the local sums.
While the reduction is in progress, normalize the local `b` array. This can also be parallelized with an OpenMP `parallel for`.
Call `MPI_Wait` *after* the normalization of `b` is complete.

4. **Final Steps:** Calculate the global average, scale the normalized local `b` vectors, and gather the results back to the root.

Show Solution

Hybrid Solution (C++ with MPI + OpenMP):

#include <omp.h>
#include <mpi.h>
#include <stdlib.h>
#include <math.h>

// Assume struct vector, length(), normalize(), create_mpi_vector() are defined.

void harmonize(vector *a, vector *b, int size, int rank, int comm_size,
               int root, MPI_Comm comm) {
    // 1. Prepare custom MPI datatype
    MPI_Datatype MPI_VECTOR;
    create_mpi_vector(&MPI_VECTOR);

    int local_size = size / comm_size;
    vector *local_a = (vector *) malloc(local_size * sizeof(vector));
    vector *local_b = (vector *) malloc(local_size * sizeof(vector));

    // 2. Distribute data from root to all ranks
    MPI_Scatter(a, local_size, MPI_VECTOR, local_a, local_size, MPI_VECTOR, root, comm);
    MPI_Scatter(b, local_size, MPI_VECTOR, local_b, local_size, MPI_VECTOR, root, comm);

    // 3. Calculate local sum of lengths of A using OpenMP for parallel reduction
    float local_sum_a = 0.0f;
    #pragma omp parallel for reduction(+:local_sum_a)
    for (int i = 0; i < local_size; i++){
        local_sum_a += length(local_a[i]);
    }

    // 4. Initiate NON-BLOCKING global reduction to get the global sum
    float global_sum_a;
    MPI_Request request;
    MPI_Iallreduce(&local_sum_a, &global_sum_a, 1, MPI_FLOAT, MPI_SUM, comm, &request);

    // 5. OVERLAP: While communication happens, normalize local B using OpenMP
    #pragma omp parallel for
    for (int i = 0; i < local_size; i++){
        normalize(local_b[i]);
    }

    // 6. WAIT for communication to complete (as late as possible)
    MPI_Wait(&request, MPI_STATUS_IGNORE);

    // 7. Now that we have the global sum, calculate the average
    float avg_a = global_sum_a / size;

    // 8. Scale the normalized local B vectors by the global average, using OpenMP
    #pragma omp parallel for
    for (int i = 0; i < local_size; i++){
        local_b[i].x *= avg_a;
        local_b[i].y *= avg_a;
    }

    // 9. Gather the final results for B back to the root process
    MPI_Gather(local_b, local_size, MPI_VECTOR, b, local_size, MPI_VECTOR, root, comm);

    // 10. Cleanup
    free(local_a);
    free(local_b);
    MPI_Type_free(&MPI_VECTOR);
}

Show Analysis

This solution demonstrates a classic hybrid programming pattern. MPI handles the coarse-grained data distribution and global reduction across nodes. OpenMP handles the fine-grained, loop-level parallelism within each node. The most critical part for performance is the overlap achieved in steps 4, 5, and 6. By using a non-blocking `MPI_Iallreduce`, we allow the CPU to perform the independent `normalize` computation while the MPI library handles the global sum communication in the background, effectively hiding communication latency.

c)* In testing you realize your program is running much slower than expected. To debug you look at the following MPI bindings report of node 0 and OpenMP environment output of one process. Briefly explain all issues you see and show how to fix them. MPI binding report:

rank0: [BBB/BBB/BBB/BBB/BBB/BBB] [.../.../.../.../.../...]
rank1: [.../.../.../.../.../...] [BBB/BBB/BBB/BBB/BBB/BBB]

OPENMP DISPLAY ENVIRONMENT BEGIN:

_OPENMP = '201511'
OMP_DYNAMIC = 'FALSE'
OMP_NESTED = 'FALSE'
OMP_NUM_THREADS = '1'
OMP_SCHEDULE = 'DYNAMIC'
OMP_PROC_BIND = 'MASTER'
OMP_PLACES = '{0,1,2,3,4,5,12,13,14,15,16,17,24,25,26,27,28,29}'
OMP_STACKSIZE = '0'
OMP_WAIT_POLICY = 'PASSIVE'
OMP_THREAD_LIMIT = '4'
OMP_MAX_ACTIVE_LEVELS = '2147483647'
OMP_CANCELLATION = 'FALSE'

OPENMP DISPLAY ENVIRONMENT END

Show Hint

1. **MPI Binding:** The report shows two sockets (`[...]` `[...]`). Where is `rank0` bound? Where is `rank1` bound? Is this optimal?

2. **OpenMP Threads:** Check `OMP_NUM_THREADS`. How many threads are being used per MPI process? How many cores are available per socket (6 physical, 18 with SMT)?

3. **OpenMP Binding:** Check `OMP_PROC_BIND`. Are the (few) threads being spread out or are they all stuck in one place?

4. **Thread Limit:** Is there any other variable limiting the number of threads?

Show Solution

Issues Identified:

MPI Binding is Correct: The MPI binding report is actually good. It shows `rank0` is bound to the first socket and `rank1` is bound to the second socket. This is the ideal placement for a hybrid code with 2 processes per node on a 2-socket machine.
Severe OpenMP Undersubscription: The main issue is `OMP_NUM_THREADS = '1'`. Each MPI process is only using a single thread, leaving the other 5 physical cores (and 17 hardware threads) on its socket completely idle.
Inefficient OpenMP Thread Binding: `OMP_PROC_BIND = 'MASTER'` forces all OpenMP threads to run on the same processing unit as the master thread. Even if `OMP_NUM_THREADS` were increased, this would cause massive contention instead of utilizing the available cores.
Restrictive Thread Limit: `OMP_THREAD_LIMIT = '4'` imposes an additional, low limit on the total number of threads that can be created, which could conflict with setting a higher `OMP_NUM_THREADS`.

How to Fix (by setting environment variables):

The goal is to make each MPI process use the 6 physical cores available on its socket. Using SMT (hyper-threading) is often not beneficial for HPC, so we target physical cores.

# Set the number of threads to match the physical cores per socket
export OMP_NUM_THREADS=6

# Define the places for threads to run on as physical cores
export OMP_PLACES=cores

# Bind threads closely to the master thread's place (within the socket)
export OMP_PROC_BIND=close

# It's good practice to remove the artificial limit
unset OMP_THREAD_LIMIT

Show Analysis

This is a classic hybrid performance tuning problem. While the MPI process placement was correct (one per NUMA domain), the OpenMP configuration was preventing any intra-node parallelism. The fix involves correcting the OpenMP environment variables to fully utilize the resources of the socket that each MPI process is bound to. Setting `OMP_PLACES=cores` and `OMP_PROC_BIND=close` ensures that the 6 OpenMP threads are pinned to the 6 physical cores of the socket, maximizing performance and memory locality.

Problem 7: Advanced MPI (9 credits)

Complete the following MPI program such that a message is encrypted before sending it to the receiver in parallel using MPI partitioned communication. The receiver waits until it received all partitions and only then decrypts the whole message at once. You can assume that 'len' is a multiple of the number of OpenMP threads and that MPI has been initialized correctly.

Hint: You may find the following definitions helpful:

int MPI_Psend_init(const void *buf, int partitions, MPI_Count count,
                   MPI_Datatype datatype, int dest, int tag, MPI_Comm comm,
                   MPI_Info info, MPI_Request *request);

int MPI_Precv_init(void *buf, int partitions, MPI_Count count,
                   MPI_Datatype datatype, int source, int tag, MPI_Comm comm,
                   MPI_Info info, MPI_Request *request);

int MPI_Pready(int partition, MPI_Request request);

int MPI_Parrived(MPI_Request request, int partition, int *flag);

int MPI_Start(MPI_Request *request);

int MPI_Wait(MPI_Request *request, MPI_Status *status);

If the info parameter is not needed, MPI_INFO_NULL can be passed.

Code Template (C):

void encrypt(char *buf, int len, const char *key);
void decrypt(char *buf, int len, const char *key);

void transfer_encrypted(char *message, int len, const char* key,
                       MPI_Comm comm, int sender, int receiver, int num_partitions)
{
    MPI_Request request;
    int rank;
    int part_size = len / num_partitions;

    MPI_Comm_rank(comm, &rank);

    if (rank == sender) {





        omp_set_num_threads(num_partitions);
        #pragma omp parallel shared(message, key, part_size, request)
        {
            int i = omp_get_thread_num();
            encrypt(&message[i * part_size], part_size, key);



        }



        MPI_Request_free(&request);
    } else if (rank == receiver) {









        MPI_Request_free(&request);
        decrypt(message, len, key);
    }
}

Show Hint

Sender Side:

1. Initialize the partitioned send with `MPI_Psend_init`.

2. Start the communication with `MPI_Start`.

3. Inside the OpenMP parallel region, after each thread encrypts its partition, it must call `MPI_Pready` to signal that its partition is ready for sending.

4. After the parallel region, the main thread must wait for the entire send to complete with `MPI_Wait`.

Receiver Side:

1. Initialize the partitioned receive with `MPI_Precv_init`.

2. Start the communication with `MPI_Start`.

3. Wait for *all* partitions to be received with a single `MPI_Wait` call.

4. After waiting, the entire message is in the buffer and can be decrypted.

Show Solution

Completed Solution (C with MPI + OpenMP):

#include <mpi.h>
#include <omp.h>

void encrypt(char *buf, int len, const char *key);
void decrypt(char *buf, int len, const char *key);

void transfer_encrypted(char *message, int len, const char* key,
                       MPI_Comm comm, int sender, int receiver, int num_partitions)
{
    MPI_Request request;
    int rank;
    int part_size = len / num_partitions;
    int tag = 0; // Define a tag for the communication

    MPI_Comm_rank(comm, &rank);

    if (rank == sender) {
        // 1. Initialize the partitioned send operation.
        // The 'count' is the size of each partition.
        MPI_Psend_init(message, num_partitions, part_size, MPI_CHAR, receiver, tag, comm, MPI_INFO_NULL, &request);

        // 2. Start the communication request.
        MPI_Start(&request);

        // 3. Encrypt partitions in parallel and mark them as ready.
        omp_set_num_threads(num_partitions);
        #pragma omp parallel shared(message, key, part_size, request)
        {
            int i = omp_get_thread_num();
            // a. Encrypt the thread's assigned partition
            encrypt(&message[i * part_size], part_size, key);
            
            // b. Signal to MPI that this partition is ready to be sent
            MPI_Pready(i, request);
        }

        // 4. Wait for the entire partitioned send to complete.
        MPI_Wait(&request, MPI_STATUS_IGNORE);

        MPI_Request_free(&request);
    } else if (rank == receiver) {
        // 1. Initialize the partitioned receive operation.
        MPI_Precv_init(message, num_partitions, part_size, MPI_CHAR, sender, tag, comm, MPI_INFO_NULL, &request);

        // 2. Start the communication request.
        MPI_Start(&request);

        // 3. Wait for ALL partitions to be received. MPI_Wait will only complete
        //    when the entire message has arrived.
        MPI_Wait(&request, MPI_STATUS_IGNORE);

        // 4. Free the request and then decrypt the fully received message.
        MPI_Request_free(&request);
        decrypt(message, len, key);
    }
}

Show Analysis

Partitioned communication is an advanced MPI feature designed for scenarios where pieces of a large message become ready at different times. This example shows a perfect use case: multiple threads generate different parts of the message buffer in parallel. As soon as a thread finishes its `encrypt` task, it calls `MPI_Pready`, allowing the MPI library to start transferring that partition while other threads are still working. This creates a pipeline, overlapping computation (encryption) with communication, which can significantly improve performance for large messages.