Parallel Programming (IN2147) - Overview
Interactive Study Guide - Retake Exam 2023
| Exam: | IN2147 / Retake | Date: | Tuesday 10th October, 2023 |
|---|---|---|---|
| Examiner: | Prof. Dr. Martin Schulz | Time: | 11:00-12:30 |
Working Instructions
- This exam consists of 7 problems with a total of 91 credits.
- Subproblems marked by * can be solved without results of previous subproblems.
- Answers are only accepted if the solution approach is documented. Give a reason for each answer unless explicitly stated otherwise.
- Allowed resources: one analog dictionary English ↔ native language.
- Do not write with red or green colors nor use pencils.
Select a problem from the sidebar to begin studying. Use the toggles to reveal hints, solutions, and analysis.
Problem 1: Amdahl's Law (8 credits)
You are given a parallelized program that you run on your 16 core laptop. When running the program on all 16 cores you observe a speedup of 4 compared to the sequential execution. Assume the parallel regions have perfect speedup. Show your calculations.
Show Hint
Start with the formula for Amdahl's Law: $SU(p) = \frac{1}{(1-f) + \frac{f}{p}}$. You are given $SU(16) = 4$. Solve for $f$.
Show Solution
We use Amdahl's Law (Summary Section 1.3.2): $SU(p) = \frac{1}{(1-f) + \frac{f}{p}}$.
$4 = \frac{1}{(1-f) + \frac{f}{16}}$
$(1-f) + \frac{f}{16} = \frac{1}{4} = 0.25$
$1 - 0.25 = f - \frac{f}{16}$
$0.75 = f(1 - \frac{1}{16}) = \frac{15f}{16}$
$f = \frac{0.75 \times 16}{15} = \frac{12}{15} = 0.8$
80% of the execution time is parallelized.
Show Analysis
This is a direct application of Amdahl's Law. It demonstrates how to calculate the parallelizable fraction of a program given its performance on a specific number of cores. This fraction, $f$, is crucial for predicting scalability.
Show Hint
The maximum theoretical speedup is limited by the sequential portion of the code. Use the formula for Amdahl's limit as the number of processors approaches infinity.
Show Solution
The maximum speedup (Amdahl's Limit) occurs as $p \to \infty$.
$SU(\infty) = \frac{1}{1-f}$ (Section 1.3.2).
$SU(\infty) = \frac{1}{1-0.8} = \frac{1}{0.2} = 5$.
The expected speedup is 5.
Show Analysis
This result highlights the core concept of Amdahl's Law: the serial fraction (1-f) becomes the ultimate bottleneck. No matter how many processors are added, the speedup can never exceed the reciprocal of the serial fraction.
Show Hint
Think about the overheads introduced by parallelization. What extra work do parallel threads/processes need to do that a sequential program doesn't?
Show Solution
(Section 6.3.1)
- Synchronization Overhead: Time spent coordinating threads, such as waiting at barriers or acquiring locks (Section 6.3.2).
- Load Imbalance: Uneven distribution of work, causing some processing units to remain idle while waiting for others to finish.
Show Analysis
These are two of the most common practical limitations to parallel performance. Other factors include communication overhead (in distributed systems) and memory contention. Achieving perfect speedup is rare in real-world applications because these overheads are almost always present.
Show Hint
Consider how the problem is divided. Do you split the data or do you split the tasks?
Show Solution
(Section 1.5)
- Data Parallelism (1.5.1): The same computation is performed concurrently on different subsets of the data (Split data, same computation).
- Functional Parallelism (1.5.2): Different computations (tasks) are performed concurrently (Split computation).
Show Analysis
Data parallelism is very common in scientific computing (e.g., processing different parts of a large array). Functional parallelism is more like an assembly line, where different stages of a pipeline are executed concurrently on different data items.
Problem 2: Terminology and Memory (11 credits)
- Server Rack
- Node
- Process
- Main memory
- Software Thread
- Power supply
- NUMA Socket
- Rank
- Processor
- Cache
- Hardware Thread
- Water cooling
Show Hint
Organize the terms from largest physical container down to the smallest logical unit of execution. Think about the hierarchy from a data center level down to a single core.
Show Solution
A typical HPC architecture involves the following hierarchy:
- Physical Structure: A Server Rack houses multiple Nodes (the blades). Nodes require infrastructure like a Power supply and Water cooling.
- Node Internals (Mainboard): The mainboard features one or more NUMA Sockets, each holding a Processor (CPU). Main memory (DRAM) is attached to the sockets.
- Processor Internals: The processor contains cores and Cache.
- Execution: Cores execute Hardware Threads. The operating system runs a Process, which consists of Software Threads. In MPI, a process is identified by its Rank.
Show Analysis
Understanding this hierarchy is fundamental to parallel programming. The physical layout (Racks, Nodes, Sockets) directly influences performance, especially in terms of communication latency and memory access patterns. The logical concepts (Processes, Threads, Ranks) are the abstractions we use to manage parallel execution on this hardware.
Show Hint
Consider a system with multiple CPU sockets. Is all the RAM equally "close" to every CPU core?
Show Solution
In modern multi-socket systems, memory is physically distributed, with memory banks attached locally to specific sockets. Accessing local memory is fast (low latency). Accessing memory attached to a remote socket requires traversing the interconnect between the sockets, which incurs significantly higher latency.
This effect is called NUMA (Non-Uniform Memory Access) (Section 1.6.2.1).
Show Analysis
NUMA is a critical performance consideration in multi-socket servers. Operating systems try to allocate memory on the same socket as the thread that first accesses it (the "First Touch" policy). Mismatches between thread placement and data placement can lead to severe performance degradation due to high-latency remote memory accesses.
For the next question assume you have the following machine:
Additionally, assume that the memory hierarchy has the following specifications:
| Level | Type | Size | Line size |
|---|---|---|---|
| L1 Cache | Fully associative | 32KB | 16 bytes |
| L2 Cache | Fully associative | 1MB | 32 bytes |
| L3 Cache | 4-way set associative | 8MB | 32 bytes |
You have the following data structure, where counterThreadZero is read/written by a thread on Core 0 and counterThreadTwo is read/written by a thread on Core 2.
struct CounterData {
uint32_t counterThreadZero;
uint32_t counterThreadTwo;
};
c) Insert the minimum required padding into the template below to prevent false sharing in this scenario.
Code Template:
struct CounterData {
uint32_t counterThreadZero;
uint32_t counterThreadTwo;
};
Show Hint
False sharing happens when independent variables, modified by different cores, reside on the same cache line. Identify the largest cache line size that is private to the cores (or core groups). The padding must be sufficient to push the second variable onto a new cache line of that size.
Show Solution
False sharing occurs when different cores modify independent variables located on the same cache line, causing excessive coherence traffic between their private caches (Section 6.4). Core 0 and Core 2 have private L1 (16B line) and L2 (32B line) caches. To prevent false sharing, we must align the data based on the largest private cache line size involved, which is 32 bytes.
uint32_t is 4 bytes. Padding needed = 32 bytes (Line Size) - 4 bytes (variable size) = 28 bytes.
struct CounterData {
uint32_t counterThreadZero;
// Padding to ensure variables are on separate 32-byte cache lines
char padding[28];
uint32_t counterThreadTwo;
};
Show Analysis
The L1 cache line is 16 bytes and the L2 is 32 bytes. Since Core 0 and Core 2 are in different L2 cache domains, the relevant cache line size is 32 bytes. The first variable `counterThreadZero` occupies the first 4 bytes of a 32-byte cache line. To ensure `counterThreadTwo` starts on a *new* 32-byte cache line, we need to add 28 bytes of padding. This prevents modifications to one variable from invalidating the cache line holding the other variable.
Problem 3: Threading (14 credits)
System Specifications:
- Ubuntu 22.04.3 LTS
- 8 cores
- 2 x NUMA socket รก 4 cores
- 16 x 32 KiB L1 Cache, 16 x 256 KiB L2 Cache, 2 x 3 MiB L3 cache (shared by 4 cores)
#include
#include
#define NUM_THREADS 8
#define IMG_WIDTH 16384
#define IMG_HEIGHT 32768
struct Pixel {
double r, g, b;
};
void normalize_image(int start_row, int end_row, Pixel *image) {
for (int col = 0; col < IMG_HEIGHT; col++){
for (int row = start_row; row < end_row; row++){
/* Stateless function to normalize the pixel in place */
normalize(&image[row * IMG_WIDTH + col]);
}
}
}
int main() {
std::thread threads[NUM_THREADS];
/* Allocate Memory for the Pixels of an image */
Pixel *image = (Pixel *) malloc(IMG_HEIGHT * IMG_WIDTH * sizeof(Pixel));
/* Read pixels from file and store it in 'image' */
for (int row = 0; row < IMG_HEIGHT; row++){
for(int col = 0; col < IMG_WIDTH; col ++){
image[row * IMG_WIDTH + col] = get_pixel("input.bmp", row, col);
}
}
/* IMG_HEIGHT is divisible by NUM_THREADS -> no undercounting */
int ps = IMG_HEIGHT / NUM_THREADS;
for(int i = 0; i < NUM_THREADS; ++i){
threads[i] = std::thread(normalize_image, i * ps, (i + 1) * ps, image);
}
for (int i = 0; i < NUM_THREADS; ++i){
threads[i].join();
}
/* ... */
return 0;
}
Show Hint
1. **Memory Allocation:** Look at how the `image` buffer is initialized. On a NUMA system, where will that memory be physically located? What happens when threads on the *other* socket try to access it?
2. **Memory Access Pattern:** Look at the nested loops in `normalize_image`. Is the inner loop accessing memory contiguously or with large strides? How does this affect cache performance?
Show Solution
System context: 2 NUMA sockets, 4 cores each (8 total). 8 threads used.
- NUMA Effects / First Touch Policy (Section 6.5.2):
The `image` data is initialized sequentially by the main thread (Lines 35-39). Due to the First Touch policy, all memory pages are physically allocated on the NUMA node where the main thread runs. During the parallel computation (Lines 44-46), threads running on the remote socket will experience high latency due to remote memory access.
- Solution: Initialize the `image` data in parallel using the same partitioning scheme used for the computation.
- Poor Spatial Locality / Strided Access (Section 6.4.1):
In `normalize_image` (Lines 16-21), the loops are nested such that the inner loop iterates over `row`. Since the image is stored in row-major order (`image[row * IMG_WIDTH + col]`), the inner loop accesses memory with a large stride (`IMG_WIDTH`). This leads to poor cache utilization.
- Solution: Interchange the loops so the inner loop iterates over `col`, ensuring contiguous memory access.
Show Analysis
These two issues are classic performance killers in scientific and image processing codes on modern hardware. The NUMA problem arises from the disconnect between sequential initialization and parallel processing. The locality problem arises from an access pattern that fights against the hardware's prefetching and caching mechanisms. Fixing both is essential for good performance.
Show Hint
Think about the role of the operating system (kernel). Which type of thread requires kernel intervention for scheduling and context switching, and which doesn't? How does this affect performance and capabilities?
Show Solution
- Advantage: Low overhead for creation and context switching, as they do not require expensive system calls (kernel involvement).
- Disadvantage: A blocking system call made by one user-level thread may block the entire process, as the OS scheduler is typically unaware of individual user-level threads.
- Use Case: Implementing fine-grained concurrency models like fibers or coroutines, often used in event-driven applications.
Show Analysis
User-level threads are managed entirely in user space by a runtime library, making them lightweight. System threads (or kernel threads) are managed by the OS. Most modern programming languages (like C++ with `std::thread` or Java) use a mapping to system threads, giving the OS full control over scheduling for true parallelism on multi-core systems.
Show Hint
What does a thread do when it fails to acquire the lock? Does it keep trying actively, or does it give up its turn to run?
Show Solution
(Section 2.4.2)
- Spin-Lock: A thread waits by continuously polling the lock variable ("busy-waiting").
- Advantage: Low latency (fast acquisition when the lock is released).
- Disadvantage: Wastes CPU cycles while spinning.
- Yielding Lock: A thread yields its execution slice to the OS scheduler (goes to sleep) if the lock is occupied.
- Advantage: Does not waste CPU resources while waiting.
- Disadvantage: High latency due to the overhead of context switching and rescheduling.
Show Analysis
The choice between them depends on the expected lock contention time. For very short critical sections where the lock is held briefly, a spin-lock can be more efficient because the cost of spinning is less than the cost of two context switches (to sleep and wake up). For longer critical sections, a yielding lock is better to free up the CPU for other useful work.
Problem 4: SIMD (16 credits)
You have a vector processor that can execute 256-bit vector instructions. Consider the following code:
void operation(float *a, float *b, float *c, int size){
for(int i = 1; i < size; i++){
a[i] = a[i] * b[i];
c[i] = a[i-1] + b[i-1];
}
}
Show Hint
Look at the data dependencies between loop iterations. Does an operation in iteration `i` depend on a result calculated in iteration `i-1`?
Show Solution
The problem is a Loop-Carried Flow Dependence (Section 4.2.1). Statement `c[i] = a[i-1] + b[i-1]` at iteration $i$ reads `a[i-1]`, which is written by the statement `a[i] = a[i] * b[i]` in the previous iteration $i-1$. This dependency across iterations prevents the loop from being vectorized, as all iterations in a vector operation must be independent.
Show Analysis
Vectorization relies on executing multiple iterations in parallel. If iteration `i` needs a value that is only computed in iteration `i-1`, they cannot be executed simultaneously. The compiler must respect this data dependency to ensure correctness, and therefore cannot vectorize the loop as written.
void operation_aligned(float *a, float *b, float *c, int size) {
}
Show Hint
The goal is to remove the loop-carried dependency. Try to rewrite the loop so that the value of `a[i]` used to calculate `c` is the *new* value computed in the same iteration, not the old value from the previous one. This involves shifting the calculation for `c`. You will need to handle the first and last iterations separately (peeling).
Show Solution
We use loop alignment (Section 4.2.2) to shift the execution so the dependency occurs within the same iteration (loop-independent). We align the computation of `a[i]` with `c[i+1]`. This requires peeling the first iteration for the `c` calculation and the last iteration for the `a` calculation.
Solution:
void operation_aligned(float *a, float *b, float *c, int size) {
if (size <= 1) return;
// Peel first iteration for the 'c' calculation (original i=1)
c[1] = a[0] + b[0];
// Aligned loop (runs from i=1 to size-2)
for(int i = 1; i < size - 1; i++){
a[i] = a[i] * b[i]; // S1(i)
// Now uses the value of a[i] computed in this iteration
c[i+1] = a[i] + b[i]; // S2(i+1)
}
// Peel last iteration for the 'a' calculation (original i=size-1)
a[size-1] = a[size-1] * b[size-1];
}
Show Analysis
By shifting the `c` calculation to `c[i+1]`, we make it dependent on `a[i]` and `b[i]`, which are both available within the same iteration `i`. The dependency is no longer "carried" across iterations, making the main loop body fully parallelizable. The peeled iterations handle the boundary conditions that don't fit the new pattern.
Here is a list of definitions you might find useful:
| Intrinsics | Intrinsics |
|---|---|
__m256 | _mm256_storeu_pd |
__m256f | _mm256_mul_ps |
_mm256_load_ps | _mm256_mul_pf |
_mm256_loadu_ps | _mm256_mul_pd |
_mm256_load_pf | |
_mm256_loadu_pf | |
_mm256_load_pd | |
_mm256_loadu_pd | |
_mm256_add_ps | |
_mm256_add_pf | |
_mm256_add_pd | |
__m256i | |
__m256d | |
_mm256_store_ps | |
_mm256_storeu_ps | |
_mm256_store_pf | |
_mm256_storeu_pf | |
_mm256_store_pd |
void operation_vectorized(float *a, float *b, float *c, int size){
int i = 1;
for(; i < size - 1 - ((size - 2) % 8); i+= ){
__m256 a256, b256, c256;
a256 = _mm256_load
b256 = _mm256_load
a256 = _mm256_mul
_mm256_store
c256 = _mm256_add
_mm256_store
}
for( ; ; ){
}
}
Show Hint
1. Start with the peeled scalar operations from part b).
2. The vector length (VL) for 256-bit registers with 32-bit floats is 8.
3. Use unaligned loads (`_mm256_loadu_ps`) and stores (`_mm256_storeu_ps`) as alignment is not guaranteed.
4. Implement the main vector loop, processing 8 elements at a time.
5. Add a scalar remainder loop to handle the elements that don't fit into a full vector chunk.
6. Finish with the final peeled scalar operation.
Show Solution
We vectorize the aligned loop. VL = 8 (256-bit / 32-bit float). We use unaligned intrinsics.
Solution:
#include
void operation_vectorized(float *a, float *b, float *c, int size){
if (size <= 1) return;
// Peeling (scalar)
c[1] = a[0] + b[0];
int i = 1;
const int VL = 8;
// Main Vector Loop (from i=1 up to the last full vector in the size-2 range)
for(; i <= (size - 2) - VL; i += VL){
__m256 a256, b256, c256;
// Load a[i..i+7] and b[i..i+7] (Unaligned)
a256 = _mm256_loadu_ps(&a[i]);
b256 = _mm256_loadu_ps(&b[i]);
// S1(i): a = a * b
a256 = _mm256_mul_ps(a256, b256);
// Store updated a
_mm256_storeu_ps(&a[i], a256);
// S2(i+1): c = a + b (using the newly computed a)
c256 = _mm256_add_ps(a256, b256);
// Store c[i+1..i+8]
_mm256_storeu_ps(&c[i+1], c256);
}
// Remainder Loop (Scalar)
for( ; i < size - 1; i++ ){
a[i] = a[i] * b[i];
c[i+1] = a[i] + b[i];
}
// Peeling (scalar)
a[size-1] = a[size-1] * b[size-1];
}
Show Analysis
This solution correctly implements the full vectorization strategy. It combines the loop alignment transformation from part (b) with SIMD intrinsics. The code correctly handles the boundary conditions with scalar peeling, processes the bulk of the data with a fast vectorized loop, and cleans up any remaining elements with a scalar remainder loop. Using unaligned loads/stores (`_...u_...`) makes the code robust, as it doesn't require specific memory alignment of the input arrays.
Problem 5: OpenMP (11 credits)
In contrast of a dense matrix, where most entries in the matrix are non-zero, in a sparse matrix most entires are zero. To save memory, sparse matrices are usually stored in a sparse matrix format, which only stores the non-zero entries of each row of the matrix.
| 1 | 5 | 3 | 0 |
| 0 | 2 | 0 | 0 |
| 0 | 0 | 0 | 1 |
| 2 | 3 | 0 | 0 |
| 1 | 5 | 3 |
| 2 | ||
| 1 | ||
| 2 | 3 |
double get_max_in_matrix(Sparse_Matrix *matrix, int num_rows) {
double *nnz, max = DBL_MIN;
int num_nnz;
/* OMP PRAGMA */
for (int i = 0; i < num_rows; ++i) {
/* returns the entries and the number of entries in row i of the matrix */
get_nonzeros_in_row(matrix, i, &nnz, &num_nnz);
for (int u = 0; u < num_nnz; u++){
if (nnz[u] > max) {
max = nnz[u];
}
}
}
return max;
}
Show Hint
The number of non-zero elements per row can vary greatly. This suggests an imbalanced workload. What `schedule` clause is best for this? The operation is a maximum reduction. Also, consider which variables need to be private to each thread.
Show Solution
The workload per row in a sparse matrix is irregular, leading to potential load imbalance. We must use `dynamic` scheduling (Section 3.3.3) for efficiency. We also need a reduction for the `max` variable (Section 3.6) and ensure loop-local variables (`nnz`, `num_nnz`) are private.
OMP Pragma:
#pragma omp parallel for reduction(max:max) private(nnz, num_nnz) schedule(dynamic)
Show Analysis
This pragma correctly addresses the key challenges. `reduction(max:max)` handles the parallel maximum finding correctly and efficiently. `private(nnz, num_nnz)` ensures each thread has its own pointers and counters for the row it is processing. `schedule(dynamic)` is crucial for performance with sparse matrices, as it allows threads that finish short rows to dynamically pick up new rows, preventing load imbalance.
int calculate_sum(double *data, int size){
double sum = 0.0;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < size; i++) {
sum += data[i];
}
return sum;
}
Code Template:
int calculate_sum2(double *data, int size){
double sum = 0.0;
/* OMP PRAGMA1 */
{
int local_sum = 0;
/* OMP PRAGMA2 */
for (int i = 0; i < size; i++) {
local_sum += data[i];
}
/* OMP PRAGMA3 */
{
sum += local_sum;
}
}
return sum;
}
Show Hint
This is a manual reduction. 1. Start a parallel region. 2. Distribute the loop iterations among the threads. Each thread accumulates into its own `local_sum`. 3. After the loop, protect the update to the shared `sum` variable. What construct ensures only one thread can execute a block of code at a time?
Show Solution
We implement a manual reduction using thread-private variables and a critical section for the final aggregation (Section 3.5.1).
Solution:
int calculate_sum2(double *data, int size){
double sum = 0.0;
/* OMP PRAGMA1 */
#pragma omp parallel
{
// Note: Template used 'int local_sum', corrected to 'double' as data is double.
double local_sum = 0.0;
/* OMP PRAGMA2 */
#pragma omp for
for (int i = 0; i < size; i++) {
local_sum += data[i];
}
/* OMP PRAGMA3 */
#pragma omp critical
{
sum += local_sum;
}
}
return sum;
}
Show Analysis
This pattern demonstrates how a reduction works under the hood. Each thread computes a partial sum into a private `local_sum`. The implicit barrier at the end of the `omp for` loop ensures all partial sums are computed before any thread proceeds. Then, the `omp critical` section serializes access to the shared `sum` variable, allowing each thread to safely add its partial result to the global total, avoiding a data race.
Show Hint
How do you tell the compiler to parallelize code? How do you query information like the thread ID at runtime? How do you configure the number of threads without changing the code?
Show Solution
(Section 3.1.2)
- Compiler Directives (Pragmas, e.g.,
#pragma omp parallel). - Runtime Library Routines (Functions, e.g.,
omp_get_thread_num()). - Environment Variables (e.g.,
OMP_NUM_THREADS).
Show Analysis
These three components form the complete OpenMP API. Directives are the primary method for expressing parallelism. Runtime routines provide fine-grained control and introspection. Environment variables allow for configuring the parallel execution environment without recompiling the code, which is essential for tuning performance on different machines.
Problem 6: MPI (13 credits)
Consider the following MPI program:
void print_members(int *ranks) {
int i = 0;
while (ranks[i] >= 0){
printf("%d ", ranks[i]);
i++;
}
printf("\n");
}
int main(int argc, char *argv[])
{
int rank, rank_c1, rank_c2, size, row, col;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Comm c1, c2;
col = rank / 4;
row = rank % 4;
MPI_Comm_split(MPI_COMM_WORLD, row, col, &c1);
MPI_Comm_split(MPI_COMM_WORLD, col, 4 - row, &c2);
MPI_Comm_rank(c1, &rank_c1);
MPI_Comm_rank(c2, &rank_c2);
int ranks_c1[size];
memset(ranks_c1, -1, size * sizeof(int));
int ranks_c2[size];
memset(ranks_c2, -1, size * sizeof(int));
MPI_Allgather(&rank, 1, MPI_INT, ranks_c1, 1, MPI_INT, c1);
MPI_Allgather(&rank, 1, MPI_INT, ranks_c2, 1, MPI_INT, c2);
if (rank == 6){
printf("rank_c1: %d members: ", rank_c1);
print_members(ranks_c1);
printf("rank_c2: %d members: ", rank_c2);
print_members(ranks_c2);
}
MPI_Finalize();
return 0;
}
Show Hint
Trace the execution for `rank = 6` with `size = 12`. 1. Calculate `row` and `col` for rank 6. 2. For `c1`, determine the `color` (`row`) and `key` (`col`). Identify all other ranks with the same color. Order them by their key to find the new rank (`rank_c1`). 3. For `c2`, determine the `color` (`col`) and `key` (`4 - row`). Identify all other ranks with the same color. Order them by their key to find `rank_c2`. 4. The `MPI_Allgather` will collect the original world ranks from all processes within each new communicator.
Show Solution
Size=12 (Ranks 0-11). `col = rank / 4`; `row = rank % 4`.
We analyze Rank 6: `col`=6/4=1. `row`=6%4=2.
Communicator c1: `MPI_Comm_split(color=row, key=col)` (Section 9.8.2). Rank 6 is in group `color=2` (all ranks where `rank % 4 == 2`). Members: {2, 6, 10}. The keys for these members are their `col` values: Rank 2 (col=0), Rank 6 (col=1), Rank 10 (col=2). Ranks in the new communicator are assigned based on ascending key order: Rank 2 gets new rank 0, Rank 6 gets new rank 1, Rank 10 gets new rank 2. So, `rank_c1` for Rank 6 is 1. `Allgather` collects the world ranks {2, 6, 10}.
Communicator c2: `MPI_Comm_split(color=col, key=4-row)`. Rank 6 is in group `color=1` (all ranks where `rank / 4 == 1`). Members: {4, 5, 6, 7}. The keys for these members are `4-row`: Rank 4 (4-0=4), Rank 5 (4-1=3), Rank 6 (4-2=2), Rank 7 (4-3=1). Ranks are assigned based on ascending key order: Rank 7 (key=1) gets new rank 0, Rank 6 (key=2) gets new rank 1, Rank 5 (key=3) gets new rank 2, Rank 4 (key=4) gets new rank 3. So, `rank_c2` for Rank 6 is 1. `Allgather` collects the world ranks in the new order: {7, 6, 5, 4}.
Output (at Rank 6):
rank_c1: 1 members: 2 6 10
rank_c2: 1 members: 7 6 5 4
Show Analysis
This problem tests understanding of `MPI_Comm_split`, which is a powerful tool for creating subgroups of processes. The key is to correctly identify the `color` (which defines the group) and the `key` (which defines the ordering within the group) for the specific rank in question. The new ranks are always assigned starting from 0 based on a sorted order of the keys.
Show Hint
When does the function call return? Does the return of the function imply that the operation is complete and the associated memory buffers are safe to reuse?
Show Solution
- Blocking routines (e.g.,
MPI_Send): Return only after the operation is locally complete, meaning the communication buffer is safe to reuse (Section 8.4.2). - Nonblocking routines (e.g.,
MPI_Isend): Return immediately after initiating the operation. The application must later verify completion (e.g., usingMPI_Wait) before reusing the buffer (Section 9.5).
Show Analysis
The key difference is the decoupling of initiation and completion. Nonblocking routines allow the program to start a communication operation and then continue with other computations, potentially overlapping communication with computation to hide latency. However, this flexibility comes with the responsibility for the programmer to explicitly check for completion before reusing buffers.
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int local_results[] = { rank, rank * 2, rank * 2};
// TODO: declare recvbuf
// TODO: reduce local_results into recvbuf using the sum operation at rank 0
MPI_Reduce(local_results,
if (rank == 0) {
// print content of recvbuf
print(recvbuf);
}
MPI_Finalize();
return 0;
}
Show Hint
The receive buffer (`recvbuf`) is only significant on the root process (rank 0). It must be large enough to hold the result. The `MPI_Reduce` call needs the send buffer, receive buffer, count, datatype, operation, root rank, and communicator.
Show Solution
#include
// assume print() is defined
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
int rank, size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int local_results[] = { rank, rank * 2, rank * 2}; // 3 elements
// TODO: declare recvbuf
// Needs to hold the result (3 integers), only significant on root
int recvbuf[3];
// TODO: reduce local_results into recvbuf using the sum operation at rank 0
// MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm)
MPI_Reduce(local_results, recvbuf, 3, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
// print content of recvbuf
// print(recvbuf);
}
MPI_Finalize();
return 0;
}
Show Analysis
This is a standard application of `MPI_Reduce`. All processes provide their `local_results` as the send buffer. The operation `MPI_SUM` is applied element-wise across all processes, and the final result of 3 integers is stored in `recvbuf` on the root process (rank 0).
Show Hint
Write out the `local_results` array for each rank (0, 1, 2, 3). Then, sum the arrays element by element.
Show Solution
Ranks 0, 1, 2, 3. `local_results = {rank, rank*2, rank*2}`.
- Rank 0: {0, 0, 0}
- Rank 1: {1, 2, 2}
- Rank 2: {2, 4, 4}
- Rank 3: {3, 6, 6}
The sum is performed element-wise:
Sum: {0+1+2+3, 0+2+4+6, 0+2+4+6} = {6, 12, 12}.
The content of `recvbuf` at rank 0 is {6, 12, 12}.
Show Analysis
This demonstrates the element-wise nature of MPI reduction operations. Each element of the output array is the reduction of the corresponding elements from the input arrays of all participating processes.
Problem 7: Hybrid Programming (18 credits)
In the exercises we covered how to program an OpenMP and MPI hybrid. We are now faced with the task to find the average brightest pixel in a large image as shown in the sequential code below.
struct Pixel {
unsigned short r;
unsigned short g;
unsigned short b;
};
void average_luminance(Pixel *pixels, int size){
double sum_lum = 0;
for(int i = 0; i < size; i++){
double luminance = 0.2126 * pixels[i].r
+ 0.7152 * pixels[i].g
+ 0.0722 * pixels[i].b;
sum_lum += luminance;
}
printf("Average luminance: %f\n", sum_lum / size);
}
int main(int argc, char *argv[]) {
// assume the arguments are initialized correctly
average_luminance(pixels, size, root);
return 0;
}
Show Hint
When using OpenMP threads within an MPI program, you must initialize MPI with a specific function that requests a certain level of thread support. What is this function, and what is the appropriate support level if only the main thread will be making MPI calls?
Show Solution
We must use `MPI_Init_thread` to initialize MPI with thread support (Section 9.13.1). For the typical hybrid model where MPI calls are made outside OpenMP regions by the main thread, `MPI_THREAD_FUNNELED` is sufficient and appropriate (Section 9.13.4).
Solution:
#include
#include
// Forward declaration of the function and Pixel struct
struct Pixel;
void average_luminance(Pixel *pixels, int size, int root);
int main(int argc, char *argv[]) {
int required = MPI_THREAD_FUNNELED;
int provided;
MPI_Init_thread(&argc, &argv, required, &provided);
if (provided < required) {
fprintf(stderr, "Error: MPI implementation does not provide required thread support.\n");
MPI_Abort(MPI_COMM_WORLD, 1);
}
// Assume pixels, size, root are initialized here for the example
// average_luminance(pixels, size, root);
MPI_Finalize();
return 0;
}
Show Analysis
Using `MPI_Init_thread` is mandatory for correct hybrid programming. It informs the MPI library about the application's threading model. `MPI_THREAD_FUNNELED` guarantees safety when only the main thread makes MPI calls. Higher levels like `MPI_THREAD_SERIALIZED` (multiple threads make calls, but not concurrently) or `MPI_THREAD_MULTIPLE` (fully thread-safe) are available but may incur performance overhead and are not needed for this common hybrid pattern.
To solve the tasks b) and c) we have a cluster of compute nodes. You can assume the following conditions:
- Each compute node has identical hardware
- Each node has 2 sockets with NUMA between the sockets
- Each socket has 4 cores with 2-way simultaenous multithreading
mpirunstarts 4 processes per node- The number of processes evenly divides the size of the pixel array
- Assume no padding is added to the Pixel struct by the compiler
- You may ignore NUMA and thread affinity for question b)
struct Pixel {
unsigned short r;
unsigned short g;
unsigned short b;
};
void average_luminance(Pixel *pixels, int size, int root){
int commSize, commRank;
MPI_Comm_size(MPI_COMM_WORLD, &commSize);
MPI_Comm_rank(MPI_COMM_WORLD, &commRank);
Pixel *localPixels = (Pixel*) malloc( );
double local_sum = 0;
for(int i = ; ; ){
double luminance = 0.2126 * localPixels[i].r
+ 0.7152 * localPixels[i].g
+ 0.0722 * localPixels[i].b;
local_sum += luminance;
}
double global_sum;
printf("Average luminance: %f\n", );
}
Show Hint
1. **Data Distribution:** Use `MPI_Scatter` to send chunks of the `pixels` array from the `root` to all processes, including itself. 2. **Custom Datatype:** Since `Pixel` is a struct, you must create a custom `MPI_Datatype` for it before you can use it in communication calls. `MPI_Type_contiguous` is suitable here. 3. **Local Computation:** Use an OpenMP `parallel for` loop with a `reduction(+:local_sum)` to calculate the sum of luminances for the local chunk of pixels. 4. **Global Aggregation:** Use `MPI_Reduce` to sum all the `local_sum` values into a `global_sum` on the `root` process. 5. **Final Calculation:** The `root` process performs the final division and prints the result.
Show Solution
We use a hybrid approach: MPI Scatter to distribute the data, OpenMP parallel for reduction to calculate local sums, and MPI Reduce to aggregate the global sum at the root.
Solution:
#include
#include
#include
#include
struct Pixel {
unsigned short r;
unsigned short g;
unsigned short b;
};
void average_luminance(Pixel *pixels, int size, int root){
int commSize, commRank;
MPI_Comm_size(MPI_COMM_WORLD, &commSize);
MPI_Comm_rank(MPI_COMM_WORLD, &commRank);
// 1. Define MPI Datatype for Pixel (3 contiguous unsigned shorts)
MPI_Datatype MPI_PIXEL;
MPI_Type_contiguous(3, MPI_UNSIGNED_SHORT, &MPI_PIXEL);
MPI_Type_commit(&MPI_PIXEL);
int local_size = size / commSize;
// 2. Allocate local buffer
Pixel *localPixels = (Pixel*) malloc(local_size * sizeof(Pixel));
// 3. Distribute data from root to all processes
MPI_Scatter(pixels, local_size, MPI_PIXEL, localPixels, local_size, MPI_PIXEL, root, MPI_COMM_WORLD);
// 4. Local computation (OpenMP Parallelization)
double local_sum = 0.0;
#pragma omp parallel for reduction(+:local_sum) schedule(static)
for(int i = 0; i < local_size; i++){
double luminance = 0.2126 * localPixels[i].r
+ 0.7152 * localPixels[i].g
+ 0.0722 * localPixels[i].b;
local_sum += luminance;
}
// 5. Global reduction to sum up local_sum from all ranks at root
double global_sum;
MPI_Reduce(&local_sum, &global_sum, 1, MPI_DOUBLE, MPI_SUM, root, MPI_COMM_WORLD);
// 6. Output at root
if (commRank == root) {
printf("Average luminance: %f\n", global_sum / size);
}
// Cleanup
free(localPixels);
MPI_Type_free(&MPI_PIXEL);
}
Show Analysis
This solution demonstrates a standard and effective hybrid parallelization pattern. - **MPI for Data Distribution:** `MPI_Scatter` is the correct collective to distribute the large dataset across different nodes' memory spaces. - **Custom Datatype:** Creating `MPI_PIXEL` is crucial. MPI needs to know the memory layout of the data it's sending; it cannot automatically handle custom structs. - **OpenMP for Node-level Parallelism:** `omp parallel for reduction` efficiently utilizes the multiple cores within a single node to process its local data chunk. - **MPI for Aggregation:** `MPI_Reduce` is the correct collective to combine the partial results from all nodes back to the root for the final answer.
MPI binding report:
rank0: [BB/././.] [./../../..]
rank1: [../.././.] [../BB/../..]
rank2: [../../BB/.] [../../../../]
rank3: [../../../..] [../../../BB]
OPENMP DISPLAY ENVIRONMENT BEGIN
OPENMP = '201511'
OMP_DYNAMIC = 'FALSE'
OMP_NESTED = 'FALSE'
OMP_NUM_THREADS = '16'
OMP_SCHEDULE = 'DYNAMIC'
OMP_PROC_BIND = 'false'
OMP_PLACES =
OMP_STACKSIZE = '0'
OMP_WAIT_POLICY = 'PASSIVE'
OMP_THREAD_LIMIT = '4294967295'
OMP_MAX_ACTIVE_LEVELS = '2147483647'
OMP_CANCELLATION = 'FALSE'
OPENMP DISPLAY ENVIRONMENT END
Show Hint
1. **Hardware vs. Software:** Compare the number of hardware threads per node (sockets * cores * SMT) with the total number of OpenMP threads being launched (MPI processes * `OMP_NUM_THREADS`). Is there a mismatch? 2. **Thread Affinity:** Look at `OMP_PROC_BIND`. Are threads pinned to specific cores, or can they migrate? Why is this important on a NUMA machine? 3. **MPI Process Placement:** The binding report shows 4 MPI ranks on a 2-socket node. Is this the most efficient way to map processes to the hardware for a hybrid code?
Show Solution
Hardware Context: 2 sockets, 4 cores/socket, 2-way SMT = 16 hardware threads/node.
Configuration: 4 MPI processes per node.
Performance Issues:
- Massive OpenMP Oversubscription: `OMP_NUM_THREADS = '16'`. Each of the 4 MPI processes attempts to launch 16 threads, resulting in 64 total software threads competing for 16 hardware threads. This causes excessive context switching overhead.
- Lack of Thread Pinning (Poor Locality): `OMP_PROC_BIND = 'false'`. Threads are not bound to cores (Section 6.5.3). They can migrate freely across the node, including across NUMA boundaries, which destroys data locality and cache performance.
- Suboptimal Hybrid Strategy: Running 4 MPI processes on a 2-socket node (2 per socket) is generally less efficient than the recommended hybrid strategy of 1 MPI process per socket (Section 9.13.4), as it leads to resource contention for memory bandwidth within each socket.
Fixes (Optimal Strategy):
The best approach is to reconfigure the execution to match the hardware: 1 MPI process per socket (2 total per node), and have OpenMP utilize the cores within each socket. We prioritize physical cores (4 per socket) as SMT is often not beneficial for HPC.
1. Fix MPI Execution Command (assuming OpenMPI):
# Start 2 processes per node, mapping one to each socket
mpirun --map-by socket --bind-to socket -np 2 ./your_program
2. Set OpenMP Environment Variables:
export OMP_NUM_THREADS=4 # Utilize the 4 physical cores per socket
export OMP_PROC_BIND=close # Keep threads close within the socket's cores
export OMP_PLACES=cores # Define the places for binding as cores
Show Analysis
This is a classic hybrid performance tuning problem. The initial setup completely ignores the machine's NUMA topology and leads to resource contention. The fix aligns the software configuration with the hardware reality: - The `mpirun` command ensures each MPI process "owns" a NUMA socket, maximizing memory locality for that process. - The `OMP_` variables ensure that the OpenMP threads created by an MPI process stay within that process's socket (`--bind-to socket` + `OMP_PROC_BIND=close`) and are pinned to physical cores (`OMP_PLACES=cores`), which prevents migration and avoids the potential negative effects of SMT for compute-bound tasks.