Parallel Programming (IN2147) - Overview

Interactive Study Guide - Retake Exam 2023

Exam:IN2147 / Retake Date:Tuesday 10th October, 2023
Examiner:Prof. Dr. Martin Schulz Time:11:00-12:30

Working Instructions

  • This exam consists of 7 problems with a total of 91 credits.
  • Subproblems marked by * can be solved without results of previous subproblems.
  • Answers are only accepted if the solution approach is documented. Give a reason for each answer unless explicitly stated otherwise.
  • Allowed resources: one analog dictionary English ↔ native language.
  • Do not write with red or green colors nor use pencils.

Select a problem from the sidebar to begin studying. Use the toggles to reveal hints, solutions, and analysis.

Problem 1: Amdahl's Law (8 credits)

You are given a parallelized program that you run on your 16 core laptop. When running the program on all 16 cores you observe a speedup of 4 compared to the sequential execution. Assume the parallel regions have perfect speedup. Show your calculations.

a)* How many percent of the sequential programs execution time are parallelized?
Show Hint

Start with the formula for Amdahl's Law: $SU(p) = \frac{1}{(1-f) + \frac{f}{p}}$. You are given $SU(16) = 4$. Solve for $f$.

Show Solution

We use Amdahl's Law (Summary Section 1.3.2): $SU(p) = \frac{1}{(1-f) + \frac{f}{p}}$.

$4 = \frac{1}{(1-f) + \frac{f}{16}}$

$(1-f) + \frac{f}{16} = \frac{1}{4} = 0.25$

$1 - 0.25 = f - \frac{f}{16}$

$0.75 = f(1 - \frac{1}{16}) = \frac{15f}{16}$

$f = \frac{0.75 \times 16}{15} = \frac{12}{15} = 0.8$

80% of the execution time is parallelized.

Show Analysis

This is a direct application of Amdahl's Law. It demonstrates how to calculate the parallelizable fraction of a program given its performance on a specific number of cores. This fraction, $f$, is crucial for predicting scalability.

b) You run the same code on AwesomeMUC that has an infinite amount of parallel processing units. How much speedup would you expect on this machine?
Show Hint

The maximum theoretical speedup is limited by the sequential portion of the code. Use the formula for Amdahl's limit as the number of processors approaches infinity.

Show Solution

The maximum speedup (Amdahl's Limit) occurs as $p \to \infty$.

$SU(\infty) = \frac{1}{1-f}$ (Section 1.3.2).

$SU(\infty) = \frac{1}{1-0.8} = \frac{1}{0.2} = 5$.

The expected speedup is 5.

Show Analysis

This result highlights the core concept of Amdahl's Law: the serial fraction (1-f) becomes the ultimate bottleneck. No matter how many processors are added, the speedup can never exceed the reciprocal of the serial fraction.

c)* Name two reasons why a parallel region might not achieve perfect speedup.
Show Hint

Think about the overheads introduced by parallelization. What extra work do parallel threads/processes need to do that a sequential program doesn't?

Show Solution

(Section 6.3.1)

  1. Synchronization Overhead: Time spent coordinating threads, such as waiting at barriers or acquiring locks (Section 6.3.2).
  2. Load Imbalance: Uneven distribution of work, causing some processing units to remain idle while waiting for others to finish.
Show Analysis

These are two of the most common practical limitations to parallel performance. Other factors include communication overhead (in distributed systems) and memory contention. Achieving perfect speedup is rare in real-world applications because these overheads are almost always present.

d)* Briefly explain the key difference between data parallelism and functional parallelism.
Show Hint

Consider how the problem is divided. Do you split the data or do you split the tasks?

Show Solution

(Section 1.5)

  • Data Parallelism (1.5.1): The same computation is performed concurrently on different subsets of the data (Split data, same computation).
  • Functional Parallelism (1.5.2): Different computations (tasks) are performed concurrently (Split computation).
Show Analysis

Data parallelism is very common in scientific computing (e.g., processing different parts of a large array). Functional parallelism is more like an assembly line, where different stages of a pipeline are executed concurrently on different data items.

Problem 2: Terminology and Memory (11 credits)

a)* Given this exemplary images of a mainboard and a SuperMUC blade (next page) label the marked logical components appropriately with the terms from the following list. Do not use other terms or names.(since you cannot access images directly from this markdown file, please describe the typical layout and the corresponding functionalities)
  • Server Rack
  • Node
  • Process
  • Main memory
  • Software Thread
  • Power supply
  • NUMA Socket
  • Rank
  • Processor
  • Cache
  • Hardware Thread
  • Water cooling
Show Hint

Organize the terms from largest physical container down to the smallest logical unit of execution. Think about the hierarchy from a data center level down to a single core.

Show Solution

A typical HPC architecture involves the following hierarchy:

  • Physical Structure: A Server Rack houses multiple Nodes (the blades). Nodes require infrastructure like a Power supply and Water cooling.
  • Node Internals (Mainboard): The mainboard features one or more NUMA Sockets, each holding a Processor (CPU). Main memory (DRAM) is attached to the sockets.
  • Processor Internals: The processor contains cores and Cache.
  • Execution: Cores execute Hardware Threads. The operating system runs a Process, which consists of Software Threads. In MPI, a process is identified by its Rank.
Show Analysis

Understanding this hierarchy is fundamental to parallel programming. The physical layout (Racks, Nodes, Sockets) directly influences performance, especially in terms of communication latency and memory access patterns. The logical concepts (Processes, Threads, Ranks) are the abstractions we use to manage parallel execution on this hardware.

b)* Given the mainboard of subproblem a). Please explain why memory access from a core can be different depending on memory location? What is this effect called?
Show Hint

Consider a system with multiple CPU sockets. Is all the RAM equally "close" to every CPU core?

Show Solution

In modern multi-socket systems, memory is physically distributed, with memory banks attached locally to specific sockets. Accessing local memory is fast (low latency). Accessing memory attached to a remote socket requires traversing the interconnect between the sockets, which incurs significantly higher latency.

This effect is called NUMA (Non-Uniform Memory Access) (Section 1.6.2.1).

Show Analysis

NUMA is a critical performance consideration in multi-socket servers. Operating systems try to allocate memory on the same socket as the thread that first accesses it (the "First Touch" policy). Mismatches between thread placement and data placement can lead to severe performance degradation due to high-latency remote memory accesses.

For the next question assume you have the following machine:

Core 0 Core 1 Core 2 Core 3 L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache L2 Cache L3 Cache Memory Bank

Additionally, assume that the memory hierarchy has the following specifications:

LevelTypeSizeLine size
L1 CacheFully associative32KB16 bytes
L2 CacheFully associative1MB32 bytes
L3 Cache4-way set associative8MB32 bytes

You have the following data structure, where counterThreadZero is read/written by a thread on Core 0 and counterThreadTwo is read/written by a thread on Core 2.


struct CounterData {
    uint32_t counterThreadZero;
    uint32_t counterThreadTwo;
};
c) Insert the minimum required padding into the template below to prevent false sharing in this scenario. Code Template:

struct CounterData {

    uint32_t counterThreadZero;



    uint32_t counterThreadTwo;

};
Show Hint

False sharing happens when independent variables, modified by different cores, reside on the same cache line. Identify the largest cache line size that is private to the cores (or core groups). The padding must be sufficient to push the second variable onto a new cache line of that size.

Show Solution

False sharing occurs when different cores modify independent variables located on the same cache line, causing excessive coherence traffic between their private caches (Section 6.4). Core 0 and Core 2 have private L1 (16B line) and L2 (32B line) caches. To prevent false sharing, we must align the data based on the largest private cache line size involved, which is 32 bytes.

uint32_t is 4 bytes. Padding needed = 32 bytes (Line Size) - 4 bytes (variable size) = 28 bytes.

Solution:

struct CounterData {
    uint32_t counterThreadZero;

    // Padding to ensure variables are on separate 32-byte cache lines
    char padding[28];

    uint32_t counterThreadTwo;
};
Show Analysis

The L1 cache line is 16 bytes and the L2 is 32 bytes. Since Core 0 and Core 2 are in different L2 cache domains, the relevant cache line size is 32 bytes. The first variable `counterThreadZero` occupies the first 4 bytes of a 32-byte cache line. To ensure `counterThreadTwo` starts on a *new* 32-byte cache line, we need to add 28 bytes of padding. This prevents modifications to one variable from invalidating the cache line holding the other variable.

Problem 3: Threading (14 credits)

a)* The following code is executed on a machine with the specifications given below the code box. Name and briefly explain two likely reasons why the performance of the program below is not optimal on the given machine. Reference the corresponding lines in the code. Briefly describe solutions to the problems.

System Specifications:

  • Ubuntu 22.04.3 LTS
  • 8 cores
  • 2 x NUMA socket รก 4 cores
  • 16 x 32 KiB L1 Cache, 16 x 256 KiB L2 Cache, 2 x 3 MiB L3 cache (shared by 4 cores)
Code Snippet:

#include 
#include 

#define NUM_THREADS 8
#define IMG_WIDTH 16384
#define IMG_HEIGHT 32768

struct Pixel {
    double r, g, b;
};

void normalize_image(int start_row, int end_row, Pixel *image) {
    for (int col = 0; col < IMG_HEIGHT; col++){
        for (int row = start_row; row < end_row; row++){
            /* Stateless function to normalize the pixel in place */
            normalize(&image[row * IMG_WIDTH + col]);
        }
    }
}

int main() {
    std::thread threads[NUM_THREADS];

    /* Allocate Memory for the Pixels of an image */
    Pixel *image = (Pixel *) malloc(IMG_HEIGHT * IMG_WIDTH * sizeof(Pixel));

    /* Read pixels from file and store it in 'image' */
    for (int row = 0; row < IMG_HEIGHT; row++){
        for(int col = 0; col < IMG_WIDTH; col ++){
            image[row * IMG_WIDTH + col] = get_pixel("input.bmp", row, col);
        }
    }

    /* IMG_HEIGHT is divisible by NUM_THREADS -> no undercounting */
    int ps = IMG_HEIGHT / NUM_THREADS;
    for(int i = 0; i < NUM_THREADS; ++i){
        threads[i] = std::thread(normalize_image, i * ps, (i + 1) * ps, image);
    }

    for (int i = 0; i < NUM_THREADS; ++i){
        threads[i].join();
    }
    /* ... */
    return 0;
}
Show Hint

1. **Memory Allocation:** Look at how the `image` buffer is initialized. On a NUMA system, where will that memory be physically located? What happens when threads on the *other* socket try to access it?

2. **Memory Access Pattern:** Look at the nested loops in `normalize_image`. Is the inner loop accessing memory contiguously or with large strides? How does this affect cache performance?

Show Solution

System context: 2 NUMA sockets, 4 cores each (8 total). 8 threads used.

  1. NUMA Effects / First Touch Policy (Section 6.5.2): The `image` data is initialized sequentially by the main thread (Lines 35-39). Due to the First Touch policy, all memory pages are physically allocated on the NUMA node where the main thread runs. During the parallel computation (Lines 44-46), threads running on the remote socket will experience high latency due to remote memory access.
    • Solution: Initialize the `image` data in parallel using the same partitioning scheme used for the computation.
  2. Poor Spatial Locality / Strided Access (Section 6.4.1): In `normalize_image` (Lines 16-21), the loops are nested such that the inner loop iterates over `row`. Since the image is stored in row-major order (`image[row * IMG_WIDTH + col]`), the inner loop accesses memory with a large stride (`IMG_WIDTH`). This leads to poor cache utilization.
    • Solution: Interchange the loops so the inner loop iterates over `col`, ensuring contiguous memory access.
Show Analysis

These two issues are classic performance killers in scientific and image processing codes on modern hardware. The NUMA problem arises from the disconnect between sequential initialization and parallel processing. The locality problem arises from an access pattern that fights against the hardware's prefetching and caching mechanisms. Fixing both is essential for good performance.

b)* List one advantage and one disadvantage of user-level threads vs. system threads. List a typical use case of user-level threads.
Show Hint

Think about the role of the operating system (kernel). Which type of thread requires kernel intervention for scheduling and context switching, and which doesn't? How does this affect performance and capabilities?

Show Solution
  • Advantage: Low overhead for creation and context switching, as they do not require expensive system calls (kernel involvement).
  • Disadvantage: A blocking system call made by one user-level thread may block the entire process, as the OS scheduler is typically unaware of individual user-level threads.
  • Use Case: Implementing fine-grained concurrency models like fibers or coroutines, often used in event-driven applications.
Show Analysis

User-level threads are managed entirely in user space by a runtime library, making them lightweight. System threads (or kernel threads) are managed by the OS. Most modern programming languages (like C++ with `std::thread` or Java) use a mapping to system threads, giving the OS full control over scheduling for true parallelism on multi-core systems.

c)* Briefly explain the difference between Spin-Locks and Yielding Locks. List one advantage and one disadvantage for each.
Show Hint

What does a thread do when it fails to acquire the lock? Does it keep trying actively, or does it give up its turn to run?

Show Solution

(Section 2.4.2)

  • Spin-Lock: A thread waits by continuously polling the lock variable ("busy-waiting").
    • Advantage: Low latency (fast acquisition when the lock is released).
    • Disadvantage: Wastes CPU cycles while spinning.
  • Yielding Lock: A thread yields its execution slice to the OS scheduler (goes to sleep) if the lock is occupied.
    • Advantage: Does not waste CPU resources while waiting.
    • Disadvantage: High latency due to the overhead of context switching and rescheduling.
Show Analysis

The choice between them depends on the expected lock contention time. For very short critical sections where the lock is held briefly, a spin-lock can be more efficient because the cost of spinning is less than the cost of two context switches (to sleep and wake up). For longer critical sections, a yielding lock is better to free up the CPU for other useful work.

Problem 4: SIMD (16 credits)

You have a vector processor that can execute 256-bit vector instructions. Consider the following code:


void operation(float *a, float *b, float *c, int size){
    for(int i = 1; i < size; i++){
        a[i] = a[i] * b[i];
        c[i] = a[i-1] + b[i-1];
    }
}
a)* What problem prevents vectorization of this loop? Name and briefly explain the problem.
Show Hint

Look at the data dependencies between loop iterations. Does an operation in iteration `i` depend on a result calculated in iteration `i-1`?

Show Solution

The problem is a Loop-Carried Flow Dependence (Section 4.2.1). Statement `c[i] = a[i-1] + b[i-1]` at iteration $i$ reads `a[i-1]`, which is written by the statement `a[i] = a[i] * b[i]` in the previous iteration $i-1$. This dependency across iterations prevents the loop from being vectorized, as all iterations in a vector operation must be independent.

Show Analysis

Vectorization relies on executing multiple iterations in parallel. If iteration `i` needs a value that is only computed in iteration `i-1`, they cannot be executed simultaneously. The compiler must respect this data dependency to ensure correctness, and therefore cannot vectorize the loop as written.

b)* Transform the loop using loop-alignment to make it vectorizable. Code Template:

void operation_aligned(float *a, float *b, float *c, int size) {

}
Show Hint

The goal is to remove the loop-carried dependency. Try to rewrite the loop so that the value of `a[i]` used to calculate `c` is the *new* value computed in the same iteration, not the old value from the previous one. This involves shifting the calculation for `c`. You will need to handle the first and last iterations separately (peeling).

Show Solution

We use loop alignment (Section 4.2.2) to shift the execution so the dependency occurs within the same iteration (loop-independent). We align the computation of `a[i]` with `c[i+1]`. This requires peeling the first iteration for the `c` calculation and the last iteration for the `a` calculation.

Solution:

void operation_aligned(float *a, float *b, float *c, int size) {
    if (size <= 1) return;

    // Peel first iteration for the 'c' calculation (original i=1)
    c[1] = a[0] + b[0];

    // Aligned loop (runs from i=1 to size-2)
    for(int i = 1; i < size - 1; i++){
        a[i] = a[i] * b[i];       // S1(i)
        // Now uses the value of a[i] computed in this iteration
        c[i+1] = a[i] + b[i];     // S2(i+1)
    }

    // Peel last iteration for the 'a' calculation (original i=size-1)
    a[size-1] = a[size-1] * b[size-1];
}
Show Analysis

By shifting the `c` calculation to `c[i+1]`, we make it dependent on `a[i]` and `b[i]`, which are both available within the same iteration `i`. The dependency is no longer "carried" across iterations, making the main loop body fully parallelizable. The peeled iterations handle the boundary conditions that don't fit the new pattern.

c) Vectorize your code of subproblem b) using 256-bit vector operations. Work inplace on the provided arrays. Complete the code and add additional code where necessary.

Here is a list of definitions you might find useful:

IntrinsicsIntrinsics
__m256_mm256_storeu_pd
__m256f_mm256_mul_ps
_mm256_load_ps_mm256_mul_pf
_mm256_loadu_ps_mm256_mul_pd
_mm256_load_pf
_mm256_loadu_pf
_mm256_load_pd
_mm256_loadu_pd
_mm256_add_ps
_mm256_add_pf
_mm256_add_pd
__m256i
__m256d
_mm256_store_ps
_mm256_storeu_ps
_mm256_store_pf
_mm256_storeu_pf
_mm256_store_pd
Code Template:

void operation_vectorized(float *a, float *b, float *c, int size){
    int i = 1;
    for(; i < size - 1 - ((size - 2) % 8); i+= ){
        __m256 a256, b256, c256;
        a256 = _mm256_load
        b256 = _mm256_load
        a256 = _mm256_mul
        _mm256_store
        c256 = _mm256_add
        _mm256_store
    }
    for( ; ; ){

    }
}
Show Hint

1. Start with the peeled scalar operations from part b).

2. The vector length (VL) for 256-bit registers with 32-bit floats is 8.

3. Use unaligned loads (`_mm256_loadu_ps`) and stores (`_mm256_storeu_ps`) as alignment is not guaranteed.

4. Implement the main vector loop, processing 8 elements at a time.

5. Add a scalar remainder loop to handle the elements that don't fit into a full vector chunk.

6. Finish with the final peeled scalar operation.

Show Solution

We vectorize the aligned loop. VL = 8 (256-bit / 32-bit float). We use unaligned intrinsics.

Solution:

#include 

void operation_vectorized(float *a, float *b, float *c, int size){
    if (size <= 1) return;

    // Peeling (scalar)
    c[1] = a[0] + b[0];

    int i = 1;
    const int VL = 8;

    // Main Vector Loop (from i=1 up to the last full vector in the size-2 range)
    for(; i <= (size - 2) - VL; i += VL){
        __m256 a256, b256, c256;

        // Load a[i..i+7] and b[i..i+7] (Unaligned)
        a256 = _mm256_loadu_ps(&a[i]);
        b256 = _mm256_loadu_ps(&b[i]);

        // S1(i): a = a * b
        a256 = _mm256_mul_ps(a256, b256);

        // Store updated a
        _mm256_storeu_ps(&a[i], a256);

        // S2(i+1): c = a + b (using the newly computed a)
        c256 = _mm256_add_ps(a256, b256);

        // Store c[i+1..i+8]
        _mm256_storeu_ps(&c[i+1], c256);
    }

    // Remainder Loop (Scalar)
    for( ; i < size - 1; i++ ){
        a[i] = a[i] * b[i];
        c[i+1] = a[i] + b[i];
    }

    // Peeling (scalar)
    a[size-1] = a[size-1] * b[size-1];
}
Show Analysis

This solution correctly implements the full vectorization strategy. It combines the loop alignment transformation from part (b) with SIMD intrinsics. The code correctly handles the boundary conditions with scalar peeling, processes the bulk of the data with a fast vectorized loop, and cleans up any remaining elements with a scalar remainder loop. Using unaligned loads/stores (`_...u_...`) makes the code robust, as it doesn't require specific memory alignment of the input arrays.

Problem 5: OpenMP (11 credits)

In contrast of a dense matrix, where most entries in the matrix are non-zero, in a sparse matrix most entires are zero. To save memory, sparse matrices are usually stored in a sparse matrix format, which only stores the non-zero entries of each row of the matrix.

Dense Matrix Format
1530
0200
0001
2300
Sparse Matrix Format
153
2
1
23
a) Use an OpenMP pragma to ensure efficient parallelization of the loop so that the function below returns the maximum of the values in a sparse matrix. You are allowed to use reductions. Write the pragma in the solution box below the code snippet. Code Snippet:

double get_max_in_matrix(Sparse_Matrix *matrix, int num_rows) {
    double *nnz, max = DBL_MIN;
    int num_nnz;
    /* OMP PRAGMA */
    for (int i = 0; i < num_rows; ++i) {
        /* returns the entries and the number of entries in row i of the matrix */
        get_nonzeros_in_row(matrix, i, &nnz, &num_nnz);
        for (int u = 0; u < num_nnz; u++){
            if (nnz[u] > max) {
                max = nnz[u];
            }
        }
    }
    return max;
}
Show Hint

The number of non-zero elements per row can vary greatly. This suggests an imbalanced workload. What `schedule` clause is best for this? The operation is a maximum reduction. Also, consider which variables need to be private to each thread.

Show Solution

The workload per row in a sparse matrix is irregular, leading to potential load imbalance. We must use `dynamic` scheduling (Section 3.3.3) for efficiency. We also need a reduction for the `max` variable (Section 3.6) and ensure loop-local variables (`nnz`, `num_nnz`) are private.

OMP Pragma:

#pragma omp parallel for reduction(max:max) private(nnz, num_nnz) schedule(dynamic)
Show Analysis

This pragma correctly addresses the key challenges. `reduction(max:max)` handles the parallel maximum finding correctly and efficiently. `private(nnz, num_nnz)` ensures each thread has its own pointers and counters for the row it is processing. `schedule(dynamic)` is crucial for performance with sparse matrices, as it allows threads that finish short rows to dynamically pick up new rows, preventing load imbalance.

b) The code snippet below uses an OpenMP reduction to caculate the sum of an array. Write a semantically equivalent implementation without using an OpenMP reduction. Write the three required OpenMP pragmas in lines 4/5, 9/10 and 15/16 in the code template below. Code Snippet 1:

int calculate_sum(double *data, int size){
    double sum = 0.0;
    #pragma omp parallel for reduction(+:sum)
    for (int i = 0; i < size; i++) {
        sum += data[i];
    }
    return sum;
}
Code Template:

int calculate_sum2(double *data, int size){
    double sum = 0.0;
    /* OMP PRAGMA1 */

    {
        int local_sum = 0;
        /* OMP PRAGMA2 */

        for (int i = 0; i < size; i++) {
            local_sum += data[i];
        }
        /* OMP PRAGMA3 */

        {
            sum += local_sum;
        }
    }
    return sum;
}
Show Hint

This is a manual reduction. 1. Start a parallel region. 2. Distribute the loop iterations among the threads. Each thread accumulates into its own `local_sum`. 3. After the loop, protect the update to the shared `sum` variable. What construct ensures only one thread can execute a block of code at a time?

Show Solution

We implement a manual reduction using thread-private variables and a critical section for the final aggregation (Section 3.5.1).

Solution:

int calculate_sum2(double *data, int size){
    double sum = 0.0;
    /* OMP PRAGMA1 */
    #pragma omp parallel
    {
        // Note: Template used 'int local_sum', corrected to 'double' as data is double.
        double local_sum = 0.0;
        /* OMP PRAGMA2 */
        #pragma omp for
        for (int i = 0; i < size; i++) {
            local_sum += data[i];
        }
        /* OMP PRAGMA3 */
        #pragma omp critical
        {
            sum += local_sum;
        }
    }
    return sum;
}
Show Analysis

This pattern demonstrates how a reduction works under the hood. Each thread computes a partial sum into a private `local_sum`. The implicit barrier at the end of the `omp for` loop ensures all partial sums are computed before any thread proceeds. Then, the `omp critical` section serializes access to the shared `sum` variable, allowing each thread to safely add its partial result to the global total, avoiding a data race.

c) Name the three elements through which an OpenMP developer/application interacts with the OpenMP runtime.
Show Hint

How do you tell the compiler to parallelize code? How do you query information like the thread ID at runtime? How do you configure the number of threads without changing the code?

Show Solution

(Section 3.1.2)

  1. Compiler Directives (Pragmas, e.g., #pragma omp parallel).
  2. Runtime Library Routines (Functions, e.g., omp_get_thread_num()).
  3. Environment Variables (e.g., OMP_NUM_THREADS).
Show Analysis

These three components form the complete OpenMP API. Directives are the primary method for expressing parallelism. Runtime routines provide fine-grained control and introspection. Environment variables allow for configuring the parallel execution environment without recompiling the code, which is essential for tuning performance on different machines.

Problem 6: MPI (13 credits)

Consider the following MPI program:


void print_members(int *ranks) {
    int i = 0;
    while (ranks[i] >= 0){
        printf("%d ", ranks[i]);
        i++;
    }
    printf("\n");
}

int main(int argc, char *argv[])
{
    int rank, rank_c1, rank_c2, size, row, col;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm c1, c2;
    col = rank / 4;
    row = rank % 4;
    MPI_Comm_split(MPI_COMM_WORLD, row, col, &c1);
    MPI_Comm_split(MPI_COMM_WORLD, col, 4 - row, &c2);
    MPI_Comm_rank(c1, &rank_c1);
    MPI_Comm_rank(c2, &rank_c2);
    int ranks_c1[size];
    memset(ranks_c1, -1, size * sizeof(int));
    int ranks_c2[size];
    memset(ranks_c2, -1, size * sizeof(int));
    MPI_Allgather(&rank, 1, MPI_INT, ranks_c1, 1, MPI_INT, c1);
    MPI_Allgather(&rank, 1, MPI_INT, ranks_c2, 1, MPI_INT, c2);
    if (rank == 6){
        printf("rank_c1: %d members: ", rank_c1);
        print_members(ranks_c1);
        printf("rank_c2: %d members: ", rank_c2);
        print_members(ranks_c2);
    }
    MPI_Finalize();
    return 0;
}
a)* What's the output of the program when run with 12 processes?
Show Hint

Trace the execution for `rank = 6` with `size = 12`. 1. Calculate `row` and `col` for rank 6. 2. For `c1`, determine the `color` (`row`) and `key` (`col`). Identify all other ranks with the same color. Order them by their key to find the new rank (`rank_c1`). 3. For `c2`, determine the `color` (`col`) and `key` (`4 - row`). Identify all other ranks with the same color. Order them by their key to find `rank_c2`. 4. The `MPI_Allgather` will collect the original world ranks from all processes within each new communicator.

Show Solution

Size=12 (Ranks 0-11). `col = rank / 4`; `row = rank % 4`.

We analyze Rank 6: `col`=6/4=1. `row`=6%4=2.

Communicator c1: `MPI_Comm_split(color=row, key=col)` (Section 9.8.2). Rank 6 is in group `color=2` (all ranks where `rank % 4 == 2`). Members: {2, 6, 10}. The keys for these members are their `col` values: Rank 2 (col=0), Rank 6 (col=1), Rank 10 (col=2). Ranks in the new communicator are assigned based on ascending key order: Rank 2 gets new rank 0, Rank 6 gets new rank 1, Rank 10 gets new rank 2. So, `rank_c1` for Rank 6 is 1. `Allgather` collects the world ranks {2, 6, 10}.

Communicator c2: `MPI_Comm_split(color=col, key=4-row)`. Rank 6 is in group `color=1` (all ranks where `rank / 4 == 1`). Members: {4, 5, 6, 7}. The keys for these members are `4-row`: Rank 4 (4-0=4), Rank 5 (4-1=3), Rank 6 (4-2=2), Rank 7 (4-3=1). Ranks are assigned based on ascending key order: Rank 7 (key=1) gets new rank 0, Rank 6 (key=2) gets new rank 1, Rank 5 (key=3) gets new rank 2, Rank 4 (key=4) gets new rank 3. So, `rank_c2` for Rank 6 is 1. `Allgather` collects the world ranks in the new order: {7, 6, 5, 4}.

Output (at Rank 6):


rank_c1: 1 members: 2 6 10 
rank_c2: 1 members: 7 6 5 4 
Show Analysis

This problem tests understanding of `MPI_Comm_split`, which is a powerful tool for creating subgroups of processes. The key is to correctly identify the `color` (which defines the group) and the `key` (which defines the ordering within the group) for the specific rank in question. The new ranks are always assigned starting from 0 based on a sorted order of the keys.

b)* Briefly explain the difference between blocking and nonblocking routines in the terminology used up to MPI 3.1.
Show Hint

When does the function call return? Does the return of the function imply that the operation is complete and the associated memory buffers are safe to reuse?

Show Solution
  • Blocking routines (e.g., MPI_Send): Return only after the operation is locally complete, meaning the communication buffer is safe to reuse (Section 8.4.2).
  • Nonblocking routines (e.g., MPI_Isend): Return immediately after initiating the operation. The application must later verify completion (e.g., using MPI_Wait) before reusing the buffer (Section 9.5).
Show Analysis

The key difference is the decoupling of initiation and completion. Nonblocking routines allow the program to start a communication operation and then continue with other computations, potentially overlapping communication with computation to hide latency. However, this flexibility comes with the responsibility for the programmer to explicitly check for completion before reusing buffers.

c)* Complete the TODO statements in the following MPI program so that it correctly executes the `MPI_Reduce`. Code Template:

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);
    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    int local_results[] = { rank, rank * 2, rank * 2};
    // TODO: declare recvbuf

    // TODO: reduce local_results into recvbuf using the sum operation at rank 0
    MPI_Reduce(local_results, 

    if (rank == 0) {
        // print content of recvbuf
        print(recvbuf);
    }
    MPI_Finalize();
    return 0;
}
Show Hint

The receive buffer (`recvbuf`) is only significant on the root process (rank 0). It must be large enough to hold the result. The `MPI_Reduce` call needs the send buffer, receive buffer, count, datatype, operation, root rank, and communicator.

Show Solution
Solution:

#include 
// assume print() is defined

int main(int argc, char *argv[])
{
    MPI_Init(&argc, &argv);
    int rank, size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    int local_results[] = { rank, rank * 2, rank * 2}; // 3 elements
    
    // TODO: declare recvbuf
    // Needs to hold the result (3 integers), only significant on root
    int recvbuf[3];

    // TODO: reduce local_results into recvbuf using the sum operation at rank 0
    // MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm)
    MPI_Reduce(local_results, recvbuf, 3, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);

    if (rank == 0) {
        // print content of recvbuf
        // print(recvbuf);
    }
    MPI_Finalize();
    return 0;
}
Show Analysis

This is a standard application of `MPI_Reduce`. All processes provide their `local_results` as the send buffer. The operation `MPI_SUM` is applied element-wise across all processes, and the final result of 3 integers is stored in `recvbuf` on the root process (rank 0).

d)* When running the program shown in c) with 4 ranks, what is the content of `recvbuf` when it's passed to the print function?
Show Hint

Write out the `local_results` array for each rank (0, 1, 2, 3). Then, sum the arrays element by element.

Show Solution

Ranks 0, 1, 2, 3. `local_results = {rank, rank*2, rank*2}`.

  • Rank 0: {0, 0, 0}
  • Rank 1: {1, 2, 2}
  • Rank 2: {2, 4, 4}
  • Rank 3: {3, 6, 6}

The sum is performed element-wise:

Sum: {0+1+2+3, 0+2+4+6, 0+2+4+6} = {6, 12, 12}.

The content of `recvbuf` at rank 0 is {6, 12, 12}.

Show Analysis

This demonstrates the element-wise nature of MPI reduction operations. Each element of the output array is the reduction of the corresponding elements from the input arrays of all participating processes.

Problem 7: Hybrid Programming (18 credits)

In the exercises we covered how to program an OpenMP and MPI hybrid. We are now faced with the task to find the average brightest pixel in a large image as shown in the sequential code below.


struct Pixel {
    unsigned short r;
    unsigned short g;
    unsigned short b;
};

void average_luminance(Pixel *pixels, int size){
    double sum_lum = 0;
    for(int i = 0; i < size; i++){
        double luminance = 0.2126 * pixels[i].r
                         + 0.7152 * pixels[i].g
                         + 0.0722 * pixels[i].b;
        sum_lum += luminance;
    }
    printf("Average luminance: %f\n", sum_lum / size);
}
a)* Write the code necessary to safely setup MPI for combined usage with OpenMP in the main function below. Code Template:

int main(int argc, char *argv[]) {


    // assume the arguments are initialized correctly
    average_luminance(pixels, size, root);


    return 0;
}
Show Hint

When using OpenMP threads within an MPI program, you must initialize MPI with a specific function that requests a certain level of thread support. What is this function, and what is the appropriate support level if only the main thread will be making MPI calls?

Show Solution

We must use `MPI_Init_thread` to initialize MPI with thread support (Section 9.13.1). For the typical hybrid model where MPI calls are made outside OpenMP regions by the main thread, `MPI_THREAD_FUNNELED` is sufficient and appropriate (Section 9.13.4).

Solution:

#include 
#include 

// Forward declaration of the function and Pixel struct
struct Pixel; 
void average_luminance(Pixel *pixels, int size, int root);

int main(int argc, char *argv[]) {
    int required = MPI_THREAD_FUNNELED;
    int provided;

    MPI_Init_thread(&argc, &argv, required, &provided);

    if (provided < required) {
        fprintf(stderr, "Error: MPI implementation does not provide required thread support.\n");
        MPI_Abort(MPI_COMM_WORLD, 1);
    }

    // Assume pixels, size, root are initialized here for the example
    // average_luminance(pixels, size, root);

    MPI_Finalize();
    return 0;
}
Show Analysis

Using `MPI_Init_thread` is mandatory for correct hybrid programming. It informs the MPI library about the application's threading model. `MPI_THREAD_FUNNELED` guarantees safety when only the main thread makes MPI calls. Higher levels like `MPI_THREAD_SERIALIZED` (multiple threads make calls, but not concurrently) or `MPI_THREAD_MULTIPLE` (fully thread-safe) are available but may incur performance overhead and are not needed for this common hybrid pattern.

b) Write code to parallelize `average_luminance()` using OpenMPI and OpenMP. The pixels are provided at root. Distribute the pixels across all MPI ranks using a collective and speed up the for loop using OpenMP. Your code should provide identical output as running the sequential function.

To solve the tasks b) and c) we have a cluster of compute nodes. You can assume the following conditions:

  • Each compute node has identical hardware
  • Each node has 2 sockets with NUMA between the sockets
  • Each socket has 4 cores with 2-way simultaenous multithreading
  • mpirun starts 4 processes per node
  • The number of processes evenly divides the size of the pixel array
  • Assume no padding is added to the Pixel struct by the compiler
  • You may ignore NUMA and thread affinity for question b)
Code Template:

struct Pixel {
    unsigned short r;
    unsigned short g;
    unsigned short b;
};

void average_luminance(Pixel *pixels, int size, int root){
    int commSize, commRank;
    MPI_Comm_size(MPI_COMM_WORLD, &commSize);
    MPI_Comm_rank(MPI_COMM_WORLD, &commRank);

    Pixel *localPixels = (Pixel*) malloc( );

    double local_sum = 0;
    for(int i = ; ; ){
        double luminance = 0.2126 * localPixels[i].r
                         + 0.7152 * localPixels[i].g
                         + 0.0722 * localPixels[i].b;
        local_sum += luminance;
    }

    double global_sum;


    printf("Average luminance: %f\n", );
}
Show Hint

1. **Data Distribution:** Use `MPI_Scatter` to send chunks of the `pixels` array from the `root` to all processes, including itself. 2. **Custom Datatype:** Since `Pixel` is a struct, you must create a custom `MPI_Datatype` for it before you can use it in communication calls. `MPI_Type_contiguous` is suitable here. 3. **Local Computation:** Use an OpenMP `parallel for` loop with a `reduction(+:local_sum)` to calculate the sum of luminances for the local chunk of pixels. 4. **Global Aggregation:** Use `MPI_Reduce` to sum all the `local_sum` values into a `global_sum` on the `root` process. 5. **Final Calculation:** The `root` process performs the final division and prints the result.

Show Solution

We use a hybrid approach: MPI Scatter to distribute the data, OpenMP parallel for reduction to calculate local sums, and MPI Reduce to aggregate the global sum at the root.

Solution:

#include 
#include 
#include 
#include 

struct Pixel {
    unsigned short r;
    unsigned short g;
    unsigned short b;
};

void average_luminance(Pixel *pixels, int size, int root){
    int commSize, commRank;
    MPI_Comm_size(MPI_COMM_WORLD, &commSize);
    MPI_Comm_rank(MPI_COMM_WORLD, &commRank);

    // 1. Define MPI Datatype for Pixel (3 contiguous unsigned shorts)
    MPI_Datatype MPI_PIXEL;
    MPI_Type_contiguous(3, MPI_UNSIGNED_SHORT, &MPI_PIXEL);
    MPI_Type_commit(&MPI_PIXEL);

    int local_size = size / commSize;

    // 2. Allocate local buffer
    Pixel *localPixels = (Pixel*) malloc(local_size * sizeof(Pixel));

    // 3. Distribute data from root to all processes
    MPI_Scatter(pixels, local_size, MPI_PIXEL, localPixels, local_size, MPI_PIXEL, root, MPI_COMM_WORLD);

    // 4. Local computation (OpenMP Parallelization)
    double local_sum = 0.0;
    #pragma omp parallel for reduction(+:local_sum) schedule(static)
    for(int i = 0; i < local_size; i++){
        double luminance = 0.2126 * localPixels[i].r
                         + 0.7152 * localPixels[i].g
                         + 0.0722 * localPixels[i].b;
        local_sum += luminance;
    }

    // 5. Global reduction to sum up local_sum from all ranks at root
    double global_sum;
    MPI_Reduce(&local_sum, &global_sum, 1, MPI_DOUBLE, MPI_SUM, root, MPI_COMM_WORLD);

    // 6. Output at root
    if (commRank == root) {
        printf("Average luminance: %f\n", global_sum / size);
    }

    // Cleanup
    free(localPixels);
    MPI_Type_free(&MPI_PIXEL);
}
Show Analysis

This solution demonstrates a standard and effective hybrid parallelization pattern. - **MPI for Data Distribution:** `MPI_Scatter` is the correct collective to distribute the large dataset across different nodes' memory spaces. - **Custom Datatype:** Creating `MPI_PIXEL` is crucial. MPI needs to know the memory layout of the data it's sending; it cannot automatically handle custom structs. - **OpenMP for Node-level Parallelism:** `omp parallel for reduction` efficiently utilizes the multiple cores within a single node to process its local data chunk. - **MPI for Aggregation:** `MPI_Reduce` is the correct collective to combine the partial results from all nodes back to the root for the final answer.

c)* In testing you realize your program is running much slower than expected. To debug you look at the following MPI bindings report of node 0 and OpenMP environment output of one process. Briefly state all performance issues you see and show how to fix them.

MPI binding report:


rank0: [BB/././.] [./../../..]
rank1: [../.././.] [../BB/../..]
rank2: [../../BB/.] [../../../../]
rank3: [../../../..] [../../../BB]

OPENMP DISPLAY ENVIRONMENT BEGIN


OPENMP = '201511'
OMP_DYNAMIC = 'FALSE'
OMP_NESTED = 'FALSE'
OMP_NUM_THREADS = '16'
OMP_SCHEDULE = 'DYNAMIC'
OMP_PROC_BIND = 'false'
OMP_PLACES = 
OMP_STACKSIZE = '0'
OMP_WAIT_POLICY = 'PASSIVE'
OMP_THREAD_LIMIT = '4294967295'
OMP_MAX_ACTIVE_LEVELS = '2147483647'
OMP_CANCELLATION = 'FALSE'

OPENMP DISPLAY ENVIRONMENT END

Show Hint

1. **Hardware vs. Software:** Compare the number of hardware threads per node (sockets * cores * SMT) with the total number of OpenMP threads being launched (MPI processes * `OMP_NUM_THREADS`). Is there a mismatch? 2. **Thread Affinity:** Look at `OMP_PROC_BIND`. Are threads pinned to specific cores, or can they migrate? Why is this important on a NUMA machine? 3. **MPI Process Placement:** The binding report shows 4 MPI ranks on a 2-socket node. Is this the most efficient way to map processes to the hardware for a hybrid code?

Show Solution

Hardware Context: 2 sockets, 4 cores/socket, 2-way SMT = 16 hardware threads/node.
Configuration: 4 MPI processes per node.

Performance Issues:

  1. Massive OpenMP Oversubscription: `OMP_NUM_THREADS = '16'`. Each of the 4 MPI processes attempts to launch 16 threads, resulting in 64 total software threads competing for 16 hardware threads. This causes excessive context switching overhead.
  2. Lack of Thread Pinning (Poor Locality): `OMP_PROC_BIND = 'false'`. Threads are not bound to cores (Section 6.5.3). They can migrate freely across the node, including across NUMA boundaries, which destroys data locality and cache performance.
  3. Suboptimal Hybrid Strategy: Running 4 MPI processes on a 2-socket node (2 per socket) is generally less efficient than the recommended hybrid strategy of 1 MPI process per socket (Section 9.13.4), as it leads to resource contention for memory bandwidth within each socket.

Fixes (Optimal Strategy):

The best approach is to reconfigure the execution to match the hardware: 1 MPI process per socket (2 total per node), and have OpenMP utilize the cores within each socket. We prioritize physical cores (4 per socket) as SMT is often not beneficial for HPC.

1. Fix MPI Execution Command (assuming OpenMPI):


# Start 2 processes per node, mapping one to each socket
mpirun --map-by socket --bind-to socket -np 2 ./your_program

2. Set OpenMP Environment Variables:


export OMP_NUM_THREADS=4       # Utilize the 4 physical cores per socket
export OMP_PROC_BIND=close     # Keep threads close within the socket's cores
export OMP_PLACES=cores        # Define the places for binding as cores
Show Analysis

This is a classic hybrid performance tuning problem. The initial setup completely ignores the machine's NUMA topology and leads to resource contention. The fix aligns the software configuration with the hardware reality: - The `mpirun` command ensures each MPI process "owns" a NUMA socket, maximizing memory locality for that process. - The `OMP_` variables ensure that the OpenMP threads created by an MPI process stay within that process's socket (`--bind-to socket` + `OMP_PROC_BIND=close`) and are pinned to physical cores (`OMP_PLACES=cores`), which prevents migration and avoids the potential negative effects of SMT for compute-bound tasks.