Parallel Programming (IN2147) - Overview
Interactive Study Guide - Summer Exam 2024 Retake
| Exam: | IN2147 / Retake | Date: | Monday 30th September, 2024 |
|---|---|---|---|
| Examiner: | Prof. Dr. Martin Schulz | Time: | 08:30 – 10:30 |
Working Instructions
- This exam consists of 28 pages with a total of 7 problems. Please make sure now that you received a complete copy of the exam.
- The total amount of achievable credits in this exam is 93 credits.
- Detaching pages from the exam is prohibited.
- Allowed resources: one analog dictionary English ↔ native language
- Subproblems marked by * can be solved without results of previous subproblems.
- Answers are only accepted if the solution approach is documented. Give a reason for each answer unless explicitly stated otherwise in the respective subproblem. When asked to name a concept providing the correct name is sufficient.
- Do not write with red or green colors nor use pencils.
- Physically turn off all electronic devices, put them into your bag and close the bag.
Select a problem from the sidebar to begin studying. Use the toggles to reveal hints, solutions, and analysis.
Problem 1: Amdahl's Law (8 credits)
You have written a multi-threaded program and want it to run as fast as possible. You first run it sequentially and achieve a runtime of 60s. Next, you run it on all 6 cores of your laptop and achieve a runtime of 15s. Assume the parallel regions have ideal speedup and show your calculations.
Show Hint
Use Amdahl's Law for parallel runtime: \( T(p) = (1-f)T_s + \frac{fT_s}{p} \). You are given \(T(6)=15s\), \(T_s=60s\), and \(p=6\). Solve for \(f\).
Show Solution
We use the formula for parallel runtime based on Amdahl's Law: \( T(p) = T_s \cdot ((1-f) + \frac{f}{p}) \).
Substitute the given values:
\( 15 = 60 \cdot ((1-f) + \frac{f}{6}) \)
\( \frac{15}{60} = 1 - f + \frac{f}{6} \)
\( 0.25 = 1 - \frac{5f}{6} \)
\( \frac{5f}{6} = 1 - 0.25 = 0.75 \)
\( f = \frac{0.75 \cdot 6}{5} = \frac{4.5}{5} = 0.9 \)
90% of the execution time is parallelized.
Show Analysis
This calculation directly applies Amdahl's Law to determine the parallelizable fraction of a program based on empirical measurements. A high parallel fraction (90%) is good, but the remaining 10% of serial code will ultimately limit the maximum achievable speedup.
Show Hint
Use the formula for speedup: \( SU(p) = \frac{1}{(1-f) + \frac{f}{p}} \). You want \(SU(p)=5\) and know \(f=0.9\) from the previous part. Solve for \(p\).
Show Solution
We use the formula for speedup: \( SU(p) = \frac{1}{(1-f) + \frac{f}{p}} \).
We want a speedup of 5, and we know \(f=0.9\):
\( 5 = \frac{1}{(1-0.9) + \frac{0.9}{p}} \)
\( 5 = \frac{1}{0.1 + \frac{0.9}{p}} \)
Take the reciprocal of both sides:
\( 0.1 + \frac{0.9}{p} = \frac{1}{5} = 0.2 \)
\( \frac{0.9}{p} = 0.2 - 0.1 = 0.1 \)
\( p = \frac{0.9}{0.1} = 9 \)
9 parallel processing units are needed.
Show Analysis
This demonstrates how the serial portion of the code (10%) limits scalability. Even though we are adding more cores, the speedup does not increase linearly. To achieve a speedup of 5, we need more than 5 cores because the serial part of the program becomes a bottleneck.
Show Hint
This is a speedup greater than the number of processors would suggest is possible according to Amdahl's law. What hardware component's behavior changes significantly when the problem is distributed across more cores?
Show Solution
This phenomenon is called Super-linear speedup.
The likely reason is related to cache effects. With multiple cores, the total available cache memory increases. If the program's working set fits better into the aggregated cache of the parallel system compared to the sequential execution (which might have suffered from frequent cache misses), memory access latency significantly decreases, leading to performance improvements beyond simple scaling.
Show Analysis
Super-linear speedup is a counter-intuitive but real effect where parallelization provides benefits beyond just computation distribution. By distributing the data, each core works on a smaller piece that may now fit entirely within its local cache, drastically reducing memory access times and boosting performance beyond theoretical limits that assume constant memory access speed.
Problem 2: Threading (14 credits)
The ParProg team implemented their custom mutex and condition variable implementations with the class structure given in the code template on the next page. The functions implement the commonly known functionality as used throughout the lecture. Now we want to implement our own thread safe queue with our custom synchronization primitives, but need your help getting the synchronization right.
class condition_variable {
void notify_one();
void wait();
};
class mutex {
void lock();
void unlock();
};
template <typename E> class queue {
public:
void push(E element) {
queue.push_back(element);
cv.notify_one();
}
E pop() {
while (queue.empty()) {
cv.wait();
}
E element = queue.front();
queue.pop_front();
return element;
}
private:
std::deque<E> queue;
mutex mutex;
condition_variable cv;
};
Show Hint
1. Any access to the shared `queue` must be protected by a lock.
2. A `wait()` call on a condition variable must be done while holding the lock.
3. For efficiency in `push`, should you unlock before or after `notify_one()`?
Show Solution
class condition_variable {
void notify_one();
void wait();
};
class mutex {
void lock();
void unlock();
};
template <typename E> class queue {
public:
void push(E element) {
// L1: Acquire lock before modifying shared data
mutex.lock();
queue.push_back(element);
// L2: Release lock after modification.
// Done before notify for efficiency (allows woken thread to acquire lock sooner).
mutex.unlock();
cv.notify_one();
}
E pop() {
// L3: Acquire lock before checking condition or waiting
mutex.lock();
// Wait while the queue is empty (must hold the lock)
while (queue.empty()) {
// Assumes wait() correctly handles atomic release/reacquire of the associated mutex.
cv.wait();
}
// Lock is held, queue is not empty.
E element = queue.front();
queue.pop_front();
// L4: Release the lock
mutex.unlock();
return element;
}
private:
std::deque<E> queue;
mutex mutex;
condition_variable cv;
};
Show Analysis
This is a classic producer-consumer queue implementation. The mutex protects the shared data (`queue`), and the condition variable prevents the consumer (`pop`) from busy-waiting. The `while` loop in `pop` is crucial to handle spurious wakeups, ensuring the condition (`!queue.empty()`) is re-checked after waking up. Unlocking before `notify_one` in `push` is a performance optimization to avoid the "hurry up and wait" problem, where the notified thread wakes up only to find the lock is still held.
Show Hint
For each lock/unlock, explain its role in preventing data races or ensuring the correct use of the condition variable.
Show Solution
- L1 (`mutex.lock()` in `push`): Required to ensure exclusive access to the shared `queue` before modification (`push_back`), preventing data races.
- L2 (`mutex.unlock()` in `push`): Required to release the lock after the critical section is complete.
- L3 (`mutex.lock()` in `pop`): Required before checking the shared condition (`queue.empty()`) and is a prerequisite for correctly calling `cv.wait()`.
- L4 (`mutex.unlock()` in `pop`): Required to release the lock after the element has been successfully removed from the queue.
Show Analysis
The placement of locks and unlocks defines the "critical sections" of the code. The goal is to make these sections as small as possible while still protecting all shared state. The interaction with condition variables is particularly strict: the lock must be held when calling `wait()` so the variable can atomically release it and go to sleep.
Show Hint
What happens if an operation like `queue.push_back()` throws an exception (e.g., out of memory)? Will the `unlock()` statement be executed? What is the C++ pattern for managing resources like locks in the presence of exceptions?
Show Solution
The code lacks Exception Safety. If an operation within the critical section (e.g., `queue.push_back()` potentially throws `std::bad_alloc`) throws an exception, the corresponding `mutex.unlock()` call is skipped. This leaves the mutex permanently locked, causing a deadlock for any other thread that tries to acquire the lock.
This is avoided in the C++ standard library by using RAII (Resource Acquisition Is Initialization) wrappers, such as `std::lock_guard<std::mutex>` or `std::unique_lock<std::mutex>`. These wrappers guarantee that the mutex is automatically unlocked when the scope is exited, regardless of whether it's a normal return or an exception was thrown.
Show Analysis
RAII is a cornerstone of modern C++ for writing robust code. By tying resource management (like locks, memory, or file handles) to the lifetime of an object on the stack, it eliminates a whole class of resource leak bugs. For mutexes, `std::lock_guard` is the simplest RAII wrapper, while `std::unique_lock` is more flexible and is required when working with condition variables because it allows for manual unlocking and re-locking.
Problem 3: OpenMP (12 credits)
Consider the following program:
Code Snippet (C++):#include "omp.h"
#include <stdio.h>
int main() {
int a = 1, b = 2, c = 3;
#pragma omp parallel num_threads(8)
#pragma omp for schedule(static) firstprivate(b) lastprivate(c)
for (int i = 0; i < 8; i++){
#pragma omp critical
{
a += omp_get_thread_num();
b += omp_get_thread_num();
c += omp_get_thread_num();
}
if (i == 4) {
printf("a: %d\n", a);
printf("b: %d\n", b);
printf("c: %d\n", c);
}
}
printf("a: %d\n", a);
printf("b: %d\n", b);
printf("c: %d\n", c);
return 0;
}
Show Hint
The `if (i == 4)` block is executed by the thread assigned to iteration 4. With `schedule(static)` and 8 threads/8 iterations, thread 4 executes iteration 4.
- `a` is shared. Its value depends on which threads have entered the critical section before thread 4.
- `b` is `firstprivate`. Each thread gets a private copy initialized to the original value.
- `c` is `lastprivate`. Each thread gets a private, *uninitialized* copy.
Show Solution
- Line 19 (a): Non-deterministic. `a` is a shared variable, and its value at this point depends on the non-deterministic order in which threads have executed the critical section before thread 4.
- Line 21 (b): 6. Thread 4 gets a private copy of `b` initialized to 2 (`firstprivate`). Inside the critical section, it executes `b += 4`, so its private `b` becomes 6.
- Line 23 (c): Non-deterministic. `c` is `lastprivate` but not `firstprivate`, so each thread's private copy is uninitialized. The output is the result of adding 4 to an indeterminate value.
Show Analysis
This problem tests understanding of OpenMP data sharing clauses. `shared` variables are visible to all threads. `firstprivate` gives each thread a private copy initialized from the master thread's value. `lastprivate` copies the value from the thread that executed the logically last iteration back to the master thread's variable after the loop. The key trap here is that `lastprivate` does not imply `firstprivate`, leading to uninitialized private variables.
Show Hint
These `printf` statements execute after the parallel region has finished.
- `a`: All threads (0 through 7) will have executed the critical section exactly once. Calculate the final sum.
- `b`: `firstprivate` only creates private copies. The original `b` is not modified.
- `c`: `lastprivate` copies the value from the thread that executed the last iteration (i=7) back to the original `c`. What was the state of that thread's private `c`?
Show Solution
- Line 27 (a): 29. The `critical` section ensures that the updates to the shared `a` are atomic. After the loop, `a` will be its initial value (1) plus the sum of all thread numbers from 0 to 7. Sum = (7*8)/2 = 28. Final value = 1 + 28 = 29.
- Line 29 (b): 2. The original variable `b` is not affected by modifications to the `firstprivate` copies within the parallel region.
- Line 31 (c): Non-deterministic. The `lastprivate` clause copies the value from the private `c` of the thread that executed the last iteration (thread 7). Since that private copy was uninitialized, the final value of `c` is indeterminate.
Show Analysis
The final state of the variables demonstrates the one-way nature of `firstprivate` (master -> thread) and `lastprivate` (last thread -> master). The atomicity provided by `critical` is essential for the correct final value of `a`. The non-determinism of `c` is a common bug when `lastprivate` is used without an initialization mechanism like `firstprivate`.
Show Hint
One schedule divides the work up-front, the other divides it as threads become free. Which one is better for balanced workloads? Which one is better for unbalanced workloads?
Show Solution
- Static Scheduling: Iterations are divided into chunks and assigned to threads in a fixed (static) manner before the loop begins.
- Use Case (UMA): Preferred for loops where each iteration takes roughly the same amount of time (a balanced workload), as it has the lowest scheduling overhead.
- Dynamic Scheduling: Iterations are assigned to threads in smaller chunks dynamically as threads finish their current work and become available.
- Use Case (UMA): Preferred for loops where iteration execution time varies significantly (an unbalanced workload), as it provides better load balancing at the cost of higher scheduling overhead.
Show Analysis
Choosing the right schedule is a trade-off between overhead and load balancing. `static` is fast and simple but can lead to idle threads if work is uneven. `dynamic` ensures all threads stay busy but incurs the cost of fetching new work from a central queue. A third option, `guided`, is a compromise that starts with large chunks and reduces the chunk size over time.
Show Hint
Consider the "First Touch" memory allocation policy on NUMA systems. If data is initialized by one thread, where is its memory page located? What happens if a `dynamic` schedule later assigns an iteration that uses this data to a thread running on a different socket?
Show Solution
On a NUMA (Non-Uniform Memory Access) system, memory access latency is lower for local memory than for remote memory (on another socket). A `dynamic` schedule assigns iterations to threads as they become available, without regard for data locality. This can frequently result in a thread being assigned an iteration that operates on data located in a remote NUMA node, forcing it to perform slow, high-latency remote memory accesses, which causes significant performance overhead.
Show Analysis
This is a key challenge in NUMA programming. The performance gain from load balancing with a `dynamic` schedule can be completely negated by the penalty of remote memory access. For NUMA systems, a `static` schedule is often preferred, combined with careful data initialization (e.g., parallel initialization) to ensure that threads primarily work on data that is local to their socket.
Problem 4: SIMD (17 credits)
You have a vector processor that can execute 256-bit vector instructions.
Show Hint
Calculate the Vector Length (VL), which is the number of data elements that can fit into a single vector register. Speedup = VL.
Show Solution
Vector width = 256 bits.
Data element size = 16 bits.
Vector Length (VL) = Vector Width / Element Size = 256 / 16 = 16.
The expected speedup is 16x.
Show Analysis
SIMD (Single Instruction, Multiple Data) achieves performance gains by performing the same operation on multiple data elements simultaneously. The theoretical speedup is equal to the number of elements that can be processed in parallel, which is the vector length.
Now consider the following sequential code:
Code Snippet (C):void operation(float *a, float *b, float *c, float d, int size)
{
for (int i = 1; i < size; i++) {
a[i] = b[i] / d;
c[i] = a[i-1] - b[i-1];
}
}
b) What problem prevents vectorization of this loop? Name and briefly explain the problem.
Show Hint
Look at the data dependencies between loop iterations. Does an iteration depend on a value calculated in a previous iteration? The calculation of `c[i]` uses `a[i-1]`.
Show Solution
The problem is a Loop-Carried Flow Dependence.
Explanation: The statement `c[i] = a[i-1] - b[i-1];` in iteration `i` reads the value of `a[i-1]`. This value was written by the statement `a[i] = b[i] / d;` in the *previous* iteration (`i-1`). This dependency across iterations prevents the loop from being vectorized directly, as vectorization requires iterations to be independent.
Show Analysis
This is a classic example of a dependency that hinders vectorization. The compiler cannot safely execute multiple iterations at once because the result of one iteration is needed as an input for the next. To resolve this, the dependency must be broken, often by restructuring the loop.
Show Hint
Split the original loop into two separate loops. The first loop will calculate all values of `a`, and the second loop will calculate all values of `c`. This resolves the dependency.
Show Solution
We use Loop Distribution (also known as Loop Fission) to separate the dependent statements into two loops.
Transformed Code (C):void operation_distributed(float *a, float *b, float *c, float d, int size)
{
// Loop 1: Calculates 'a'. This loop has no cross-iteration dependencies.
for (int i = 1; i < size; i++) {
a[i] = b[i] / d;
}
// Loop 2: Calculates 'c'. This loop is also vectorizable as all 'a' values are now computed.
for (int i = 1; i < size; i++) {
c[i] = a[i-1] - b[i-1];
}
}
Show Analysis
Loop distribution is a powerful transformation technique. By splitting the loop, we ensure that all writes to array `a` are completed before any reads of `a` occur in the second loop. This completely removes the loop-carried dependency, allowing a vectorizing compiler to process each of the new loops independently using SIMD instructions.
Here is a list of definitions you might find useful:
| Intrinsics | Intrinsics | Intrinsics |
|---|---|---|
__m256 | _mm256_set1_ps | __m256i |
__m256f | _mm256_set1_pf | __m256d |
__m256d | _mm256_set1_pd | _mm256_store_ps |
_mm256_load_ps | _mm256_sub_ps | _mm256_storeu_ps |
_mm256_loadu_ps | _mm256_sub_pf | _mm256_store_pf |
_mm256_load_pf | _mm256_sub_pd | _mm256_storeu_pf |
_mm256_loadu_pf | _mm256_mul_ps | _mm256_store_pd |
_mm256_load_pd | _mm256_mul_pf | _mm256_storeu_pd |
_mm256_loadu_pd | _mm256_mul_pd | _mm256_mask_store_ps |
_mm256_mask_load_ps | _mm256_div_ps | _mm256_mask_storeu_ps |
_mm256_mask_loadu_ps | _mm256_div_pf | _mm256_mask_store_pf |
_mm256_mask_load_pf | _mm256_div_pd | _mm256_mask_storeu_pf |
_mm256_mask_loadu_pf | _mm256_mask_div_ps | _mm256_mask_store_pd |
_mm256_mask_load_pd | _mm256_mask_div_pf | _mm256_mask_storeu_pd |
_mm256_mask_loadu_pd | _mm256_mask_div_pd |
The masked instructions have the following parameters:
_mm256_mask_div_*(src, mask, a, b)_mm256_mask_load_*(src, mask, addr, mask)_mm256_mask_loadu_*(src, mask, addr, mask)_mm256_mask_store_*(addr, mask, a)_mm256_mask_storeu_*(addr, mask, a)
// Computes the minimum of two integers
int minint(int x, int y) {
if (x <= y) {
return x;
} else {
return y;
}
}
Code Template (C):
void operation (float *a, float *b, float *c, float d, int size) {
int i, ub = size - (size % 4);
__m_____ a256, b256, d256, zero256;
d256 = ______;
for (i = 1; i < ub; i += 8) {
b256 = ______;
a256 = ______;
______;
}
__mmask8 mask = ______;
zero256 = ______;
b256m = ______;
a256m = ______;
______;
}
Show Hint
1. The vector length (VL) for 256-bit registers and 32-bit floats is 8.
2. Broadcast the scalar `d` into a vector register `d256` using `_mm256_set1_ps`.
3. The main loop should iterate from `i=1` to `size` in steps of `VL` (8).
4. Inside the loop, you need to generate a mask that handles the final partial vector. For example, if `size-i` is 3, the mask should be `0b00000111`.
5. Use `_mm256_mask_loadu_ps` to load data for `b`, `_mm256_mask_div_ps` for the calculation, and `_mm256_mask_storeu_ps` to store the result into `a`.
6. After the vectorized loop for `a`, add the sequential loop for `c`.
Show Solution
#include <immintrin.h>
// Assuming __mmask8 is defined, as implied by the masked intrinsics.
// For AVX-512, this is part of the header. For AVX2, one might need a typedef.
typedef unsigned char __mmask8;
void operation_vec(float *a, float *b, float *c, float d, int size) {
const int VL = 8; // Vector Length for float on 256-bit registers
int i;
// --- Vectorized computation of 'a' ---
__m256 a256, b256, d256, zero256;
// Broadcast scalar 'd' into a vector
d256 = _mm256_set1_ps(d);
// Create a zero vector to use as a source for masked loads
zero256 = _mm256_setzero_ps();
// Main vector loop, starting from i=1
for (i = 1; i < size; i += VL) {
// Determine how many elements remain to be processed in this iteration
int remaining = size - i;
// Create a mask for the remaining elements.
// If remaining >= VL, all lanes are active (0xFF).
// Otherwise, create a bitmask for the 'remaining' lanes.
__mmask8 mask = (remaining >= VL) ? 0xFF : (__mmask8)((1U << remaining) - 1);
// Load unaligned data from b, masked. Inactive lanes get zero from zero256.
b256 = _mm256_mask_loadu_ps(zero256, mask, &b[i]);
// Perform the division, masked. Inactive lanes in 'a256' will be zero.
a256 = _mm256_mask_div_ps(zero256, mask, b256, d256);
// Store the result to a, unaligned and masked. Only active lanes are written.
_mm256_mask_storeu_ps(&a[i], mask, a256);
}
// --- Sequential computation of 'c' (as requested) ---
for (i = 1; i < size; i++) {
c[i] = a[i-1] - b[i-1];
}
}
Show Analysis
This solution correctly vectorizes the first loop using AVX intrinsics. The key is the use of masked instructions to handle the end of the array without a separate scalar remainder loop. The mask `(1U << remaining) - 1` is a standard bit manipulation trick to generate a sequence of `remaining` ones (e.g., if `remaining` is 3, `1U << 3` is `0b1000`, subtracting 1 gives `0b0111`). This approach is generally more efficient than a separate cleanup loop as it avoids branching and keeps the pipeline full.
Problem 5: MPI Theory (10 credits)
Show Hint
1. `MPI_Bsend` *always* copies data from the application buffer to a user-provided MPI buffer first.
2. The rendezvous protocol then involves a handshake: Sender sends a "Request to Send" (RTS), Receiver responds with "Clear to Send" (CTS), then the data is transferred from the MPI buffer to the receiver.
Show Solution
An `MPI_Bsend` (Buffered Send) first copies the data to a user-managed buffer. The underlying transfer from this buffer can then use a protocol like rendezvous for large messages.
Show Analysis
The key characteristic of `MPI_Bsend` is the initial copy to the user buffer (Step 1). This allows the `MPI_Bsend` call to return immediately, decoupling the application from the network communication. The subsequent network transfer (Steps 2-4) happens in the background and can use any protocol; in this case, the rendezvous protocol is shown, which is typical for large messages to avoid overwhelming the receiver's buffers.
Show Hint
The distinction centers on when the function call returns and whether the resources (like the send buffer) are safe to reuse immediately after the return.
Show Solution
- Blocking routines (e.g., `MPI_Send`, `MPI_Recv`): The function call does not return until the operation is locally complete. This guarantees that the communication buffer is safe to be reused immediately after the call returns.
- Nonblocking routines (e.g., `MPI_Isend`, `MPI_Irecv`): The function call initiates the operation and returns immediately, before the operation is necessarily complete. The application must not reuse the communication buffer until completion is verified later using a separate call like `MPI_Wait` or `MPI_Test`.
Show Analysis
This is a fundamental concept in MPI. Blocking calls are simpler to reason about but can lead to deadlocks if not ordered correctly. Nonblocking calls are more complex but enable the powerful pattern of overlapping communication with computation, where the CPU can perform useful work while data is being transferred over the network in the background.
int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,
void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)
{
int type_size;
MPI_Type_size(recvtype, &type_size);
return MPI_SUCCESS;
}
Show Hint
An `MPI_Alltoall` can be implemented by having each process `i` perform a `MPI_Scatter` to all other processes from its chunk of the send buffer, and an `MPI_Gather` from all other processes into its chunk of the receive buffer. However, a more direct emulation is to use `MPI_Alltoallv`, which is a more general collective. You can set up the count and displacement arrays for `MPI_Alltoallv` to mimic the regular pattern of `MPI_Alltoall`.
Show Solution
The most direct way to emulate `MPI_Alltoall` is by using its more general vectorized version, `MPI_Alltoallv`, and setting up the parameters to describe a regular, non-vectorized communication pattern.
Emulation using MPI_Alltoallv (C):#include <mpi.h>
#include <stdlib.h>
int My_MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype,
void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)
{
int comm_size;
MPI_Comm_size(comm, &comm_size);
// Allocate arrays required for MPI_Alltoallv
int *sendcounts = (int *)malloc(comm_size * sizeof(int));
int *recvcounts = (int *)malloc(comm_size * sizeof(int));
int *sdispls = (int *)malloc(comm_size * sizeof(int));
int *rdispls = (int *)malloc(comm_size * sizeof(int));
if (!sendcounts || !recvcounts || !sdispls || !rdispls) {
return MPI_ERR_NO_MEM;
}
// Initialize the arrays for the regular Alltoall pattern
for (int i = 0; i < comm_size; i++) {
sendcounts[i] = sendcount;
recvcounts[i] = recvcount;
// Displacements are in terms of number of elements, not bytes
sdispls[i] = i * sendcount;
rdispls[i] = i * recvcount;
}
// Call the collective function MPI_Alltoallv
int result = MPI_Alltoallv(sendbuf, sendcounts, sdispls, sendtype,
recvbuf, recvcounts, rdispls, recvtype, comm);
// Cleanup
free(sendcounts);
free(recvcounts);
free(sdispls);
free(rdispls);
return result;
}
Show Analysis
While one could implement `MPI_Alltoall` with a series of point-to-point `MPI_Send` and `MPI_Recv` calls in a loop, this is inefficient and not a collective operation. Using `MPI_Alltoallv` is the correct approach as it is a single collective call that allows for a more optimized implementation by the MPI library. This solution correctly sets up the count and displacement arrays to specify that each process sends and receives a fixed-size chunk to/from every other process, which is the definition of `MPI_Alltoall`.
Problem 6: Hybrid Programming (23 credits)
In the exercises, we covered how to write an OpenMP and MPI hybrid program. We are now faced with harmonizing two very large arrays of vectors. We are achieving this by normalizing every vector in array `b` and scaling it with the average length of the vectors in array `a`. The sequential version of the `harmonize` operation is shown below. The result of the harmonization is the final state of array `b`.
Sequential Code (C++):struct vector {
float x;
float y;
};
float length(vector& vec) {
return sqrt(vec.x*vec.x + vec.y*vec.y);
}
void normalize(vector& vec) {
float len = length(vec);
vec.x /= len;
vec.y /= len;
}
void harmonize_sequential(vector *a, vector *b, int size) {
float avg_a = 0;
for (int i = 0; i < size; i++){
avg_a += length(a[i]);
}
avg_a /= size;
for (int i = 0; i < size; i++){
normalize(b[i]);
b[i].x *= avg_a;
b[i].y *= avg_a;
}
}
Code Template (C++):
MPI_TYPE_CONTIGUOUS(count, oldtype, newtype)IN
countreplication count (non-negative integer)
INoldtypeold datatype (handle)
OUTnewtypenew datatype (handle)C binding
int MPI_Type_contiguous(int count, MPI_Datatype oldtype, MPI_Datatype *newtype)
void create_mpi_vector(MPI_Datatype *type) {
}
Show Hint
The `vector` struct contains two `float` members. You need to create a new MPI type that represents two contiguous `MPI_FLOAT`s. After creating the type, you must `MPI_Type_commit` it before it can be used.
Show Solution
#include <mpi.h>
void create_mpi_vector(MPI_Datatype *type) {
// Create the datatype representing 2 contiguous MPI_FLOATs
MPI_Type_contiguous(2, MPI_FLOAT, type);
// Commit the datatype so it can be used in communication
MPI_Type_commit(type);
}
Show Analysis
Creating custom MPI datatypes is essential for sending non-primitive data structures like structs or classes. `MPI_Type_contiguous` is the simplest constructor, used when the elements of the struct are of the same type and stored contiguously in memory. Committing the type with `MPI_Type_commit` registers it with the MPI library, making it a valid handle for communication calls.
To solve the tasks b) and c) we have a cluster of compute nodes. You can assume the following conditions:
- Each compute node has identical hardware
- Each node has 2 sockets with NUMA between the sockets
- Each socket has 6 cores with 3-way simultaenous multithreading
mpirunstarts 2 processes per node- mpi was correctly initialized with
MPI_THREAD_MULTIPLE - The number of processes evenly divides the size of the pixel array
- Assume no padding is added to the Vector struct by the compiler
- You should use the
create_mpi_vector()function from subproblem a) in your code - You may ignore omp loop schedules, NUMA and thread affinity for question b)
void harmonize(vector *a, vector *b, int size, int rank, int comm_size,
int root, MPI_Comm comm) {
// TODO: prepare datatype
int local_size = size / comm_size;
vector *local_a = (vector *) malloc(local_size * sizeof(vector));
vector *local_b = (vector *) malloc(local_size * sizeof(vector));
float avg_a = 0;
for (int i = 0; i < local_size; i++){
avg_a += length(local_a[i]);
}
for (int i = 0; i < local_size; i++){
normalize(local_b[i]);
}
for (int i = 0; i < local_size; i++){
local_b[i].x *= avg_a; local_b[i].y *= avg_a;
}
}
Show Hint
1. **Data Distribution:** Scatter arrays `a` and `b` from the root to all processes.
2. **Local Sum (A):** Each process calculates the sum of lengths for its local chunk of `a`. Use an OpenMP `parallel for reduction` for this.
3. **Overlap:** The key is to overlap the global sum calculation with the normalization of `b`.
- Initiate a non-blocking `MPI_Iallreduce` on the local sums.
- While the reduction is in progress, normalize the local `b` array. This can also be parallelized with an OpenMP `parallel for`.
- Call `MPI_Wait` *after* the normalization of `b` is complete.
4. **Final Steps:** Calculate the global average, scale the normalized local `b` vectors, and gather the results back to the root.
Show Solution
#include <omp.h>
#include <mpi.h>
#include <stdlib.h>
#include <math.h>
// Assume struct vector, length(), normalize(), create_mpi_vector() are defined.
void harmonize(vector *a, vector *b, int size, int rank, int comm_size,
int root, MPI_Comm comm) {
// 1. Prepare custom MPI datatype
MPI_Datatype MPI_VECTOR;
create_mpi_vector(&MPI_VECTOR);
int local_size = size / comm_size;
vector *local_a = (vector *) malloc(local_size * sizeof(vector));
vector *local_b = (vector *) malloc(local_size * sizeof(vector));
// 2. Distribute data from root to all ranks
MPI_Scatter(a, local_size, MPI_VECTOR, local_a, local_size, MPI_VECTOR, root, comm);
MPI_Scatter(b, local_size, MPI_VECTOR, local_b, local_size, MPI_VECTOR, root, comm);
// 3. Calculate local sum of lengths of A using OpenMP for parallel reduction
float local_sum_a = 0.0f;
#pragma omp parallel for reduction(+:local_sum_a)
for (int i = 0; i < local_size; i++){
local_sum_a += length(local_a[i]);
}
// 4. Initiate NON-BLOCKING global reduction to get the global sum
float global_sum_a;
MPI_Request request;
MPI_Iallreduce(&local_sum_a, &global_sum_a, 1, MPI_FLOAT, MPI_SUM, comm, &request);
// 5. OVERLAP: While communication happens, normalize local B using OpenMP
#pragma omp parallel for
for (int i = 0; i < local_size; i++){
normalize(local_b[i]);
}
// 6. WAIT for communication to complete (as late as possible)
MPI_Wait(&request, MPI_STATUS_IGNORE);
// 7. Now that we have the global sum, calculate the average
float avg_a = global_sum_a / size;
// 8. Scale the normalized local B vectors by the global average, using OpenMP
#pragma omp parallel for
for (int i = 0; i < local_size; i++){
local_b[i].x *= avg_a;
local_b[i].y *= avg_a;
}
// 9. Gather the final results for B back to the root process
MPI_Gather(local_b, local_size, MPI_VECTOR, b, local_size, MPI_VECTOR, root, comm);
// 10. Cleanup
free(local_a);
free(local_b);
MPI_Type_free(&MPI_VECTOR);
}
Show Analysis
This solution demonstrates a classic hybrid programming pattern. MPI handles the coarse-grained data distribution and global reduction across nodes. OpenMP handles the fine-grained, loop-level parallelism within each node. The most critical part for performance is the overlap achieved in steps 4, 5, and 6. By using a non-blocking `MPI_Iallreduce`, we allow the CPU to perform the independent `normalize` computation while the MPI library handles the global sum communication in the background, effectively hiding communication latency.
rank0: [BBB/BBB/BBB/BBB/BBB/BBB] [.../.../.../.../.../...]
rank1: [.../.../.../.../.../...] [BBB/BBB/BBB/BBB/BBB/BBB]
OPENMP DISPLAY ENVIRONMENT BEGIN:
_OPENMP = '201511'
OMP_DYNAMIC = 'FALSE'
OMP_NESTED = 'FALSE'
OMP_NUM_THREADS = '1'
OMP_SCHEDULE = 'DYNAMIC'
OMP_PROC_BIND = 'MASTER'
OMP_PLACES = '{0,1,2,3,4,5,12,13,14,15,16,17,24,25,26,27,28,29}'
OMP_STACKSIZE = '0'
OMP_WAIT_POLICY = 'PASSIVE'
OMP_THREAD_LIMIT = '4'
OMP_MAX_ACTIVE_LEVELS = '2147483647'
OMP_CANCELLATION = 'FALSE'
OPENMP DISPLAY ENVIRONMENT END
Show Hint
1. **MPI Binding:** The report shows two sockets (`[...]` `[...]`). Where is `rank0` bound? Where is `rank1` bound? Is this optimal?
2. **OpenMP Threads:** Check `OMP_NUM_THREADS`. How many threads are being used per MPI process? How many cores are available per socket (6 physical, 18 with SMT)?
3. **OpenMP Binding:** Check `OMP_PROC_BIND`. Are the (few) threads being spread out or are they all stuck in one place?
4. **Thread Limit:** Is there any other variable limiting the number of threads?
Show Solution
Issues Identified:
- MPI Binding is Correct: The MPI binding report is actually good. It shows `rank0` is bound to the first socket and `rank1` is bound to the second socket. This is the ideal placement for a hybrid code with 2 processes per node on a 2-socket machine.
- Severe OpenMP Undersubscription: The main issue is `OMP_NUM_THREADS = '1'`. Each MPI process is only using a single thread, leaving the other 5 physical cores (and 17 hardware threads) on its socket completely idle.
- Inefficient OpenMP Thread Binding: `OMP_PROC_BIND = 'MASTER'` forces all OpenMP threads to run on the same processing unit as the master thread. Even if `OMP_NUM_THREADS` were increased, this would cause massive contention instead of utilizing the available cores.
- Restrictive Thread Limit: `OMP_THREAD_LIMIT = '4'` imposes an additional, low limit on the total number of threads that can be created, which could conflict with setting a higher `OMP_NUM_THREADS`.
How to Fix (by setting environment variables):
The goal is to make each MPI process use the 6 physical cores available on its socket. Using SMT (hyper-threading) is often not beneficial for HPC, so we target physical cores.
# Set the number of threads to match the physical cores per socket
export OMP_NUM_THREADS=6
# Define the places for threads to run on as physical cores
export OMP_PLACES=cores
# Bind threads closely to the master thread's place (within the socket)
export OMP_PROC_BIND=close
# It's good practice to remove the artificial limit
unset OMP_THREAD_LIMIT
Show Analysis
This is a classic hybrid performance tuning problem. While the MPI process placement was correct (one per NUMA domain), the OpenMP configuration was preventing any intra-node parallelism. The fix involves correcting the OpenMP environment variables to fully utilize the resources of the socket that each MPI process is bound to. Setting `OMP_PLACES=cores` and `OMP_PROC_BIND=close` ensures that the 6 OpenMP threads are pinned to the 6 physical cores of the socket, maximizing performance and memory locality.
Problem 7: Advanced MPI (9 credits)
Complete the following MPI program such that a message is encrypted before sending it to the receiver in parallel using MPI partitioned communication. The receiver waits until it received all partitions and only then decrypts the whole message at once. You can assume that 'len' is a multiple of the number of OpenMP threads and that MPI has been initialized correctly.
Hint: You may find the following definitions helpful:
int MPI_Psend_init(const void *buf, int partitions, MPI_Count count,
MPI_Datatype datatype, int dest, int tag, MPI_Comm comm,
MPI_Info info, MPI_Request *request);
int MPI_Precv_init(void *buf, int partitions, MPI_Count count,
MPI_Datatype datatype, int source, int tag, MPI_Comm comm,
MPI_Info info, MPI_Request *request);
int MPI_Pready(int partition, MPI_Request request);
int MPI_Parrived(MPI_Request request, int partition, int *flag);
int MPI_Start(MPI_Request *request);
int MPI_Wait(MPI_Request *request, MPI_Status *status);
If the info parameter is not needed, MPI_INFO_NULL can be passed.
void encrypt(char *buf, int len, const char *key);
void decrypt(char *buf, int len, const char *key);
void transfer_encrypted(char *message, int len, const char* key,
MPI_Comm comm, int sender, int receiver, int num_partitions)
{
MPI_Request request;
int rank;
int part_size = len / num_partitions;
MPI_Comm_rank(comm, &rank);
if (rank == sender) {
omp_set_num_threads(num_partitions);
#pragma omp parallel shared(message, key, part_size, request)
{
int i = omp_get_thread_num();
encrypt(&message[i * part_size], part_size, key);
}
MPI_Request_free(&request);
} else if (rank == receiver) {
MPI_Request_free(&request);
decrypt(message, len, key);
}
}
Show Hint
Sender Side:
1. Initialize the partitioned send with `MPI_Psend_init`.
2. Start the communication with `MPI_Start`.
3. Inside the OpenMP parallel region, after each thread encrypts its partition, it must call `MPI_Pready` to signal that its partition is ready for sending.
4. After the parallel region, the main thread must wait for the entire send to complete with `MPI_Wait`.
Receiver Side:
1. Initialize the partitioned receive with `MPI_Precv_init`.
2. Start the communication with `MPI_Start`.
3. Wait for *all* partitions to be received with a single `MPI_Wait` call.
4. After waiting, the entire message is in the buffer and can be decrypted.
Show Solution
#include <mpi.h>
#include <omp.h>
void encrypt(char *buf, int len, const char *key);
void decrypt(char *buf, int len, const char *key);
void transfer_encrypted(char *message, int len, const char* key,
MPI_Comm comm, int sender, int receiver, int num_partitions)
{
MPI_Request request;
int rank;
int part_size = len / num_partitions;
int tag = 0; // Define a tag for the communication
MPI_Comm_rank(comm, &rank);
if (rank == sender) {
// 1. Initialize the partitioned send operation.
// The 'count' is the size of each partition.
MPI_Psend_init(message, num_partitions, part_size, MPI_CHAR, receiver, tag, comm, MPI_INFO_NULL, &request);
// 2. Start the communication request.
MPI_Start(&request);
// 3. Encrypt partitions in parallel and mark them as ready.
omp_set_num_threads(num_partitions);
#pragma omp parallel shared(message, key, part_size, request)
{
int i = omp_get_thread_num();
// a. Encrypt the thread's assigned partition
encrypt(&message[i * part_size], part_size, key);
// b. Signal to MPI that this partition is ready to be sent
MPI_Pready(i, request);
}
// 4. Wait for the entire partitioned send to complete.
MPI_Wait(&request, MPI_STATUS_IGNORE);
MPI_Request_free(&request);
} else if (rank == receiver) {
// 1. Initialize the partitioned receive operation.
MPI_Precv_init(message, num_partitions, part_size, MPI_CHAR, sender, tag, comm, MPI_INFO_NULL, &request);
// 2. Start the communication request.
MPI_Start(&request);
// 3. Wait for ALL partitions to be received. MPI_Wait will only complete
// when the entire message has arrived.
MPI_Wait(&request, MPI_STATUS_IGNORE);
// 4. Free the request and then decrypt the fully received message.
MPI_Request_free(&request);
decrypt(message, len, key);
}
}
Show Analysis
Partitioned communication is an advanced MPI feature designed for scenarios where pieces of a large message become ready at different times. This example shows a perfect use case: multiple threads generate different parts of the message buffer in parallel. As soon as a thread finishes its `encrypt` task, it calls `MPI_Pready`, allowing the MPI library to start transferring that partition while other threads are still working. This creates a pipeline, overlapping computation (encryption) with communication, which can significantly improve performance for large messages.