IN2147 Parallel Programming Exam Study Guide (S25)

Parallele Programmierung (IN2147) - Overview

Interactive Study Guide - Summer Exam 2025

Exam:	IN2147 / Endterm	Date:	Friday 1st August, 2025
Examiner:	Prof. Dr. Martin Schulz	Time:	11:00-13:00

Working Instructions

This exam consists of 8 problems with a total of 92 credits.
Subproblems marked by * can be solved without results of previous subproblems.
Answers are only accepted if the solution approach is documented. Give a reason for each answer unless explicitly stated otherwise.
Allowed resources: one analog dictionary English ↔ native language.
Do not write with red or green colors nor use pencils.

Select a problem from the sidebar to begin studying. Use the toggles to reveal hints, solutions, and analysis.

Problem 1: Parallel Runtime (7 credits)

You are given a program of which you want to measure the parallelization. First, you run the program on the server FoundForABargainMachine1 with 2 parallel processing units and measure a runtime of 100 seconds. Next, you run the program on the server SomewhatDecentMachine2 with 6 parallel processing units and measure a runtime of 60 seconds. Assume that the parallel processing units of both servers are identical. Assume the parallel regions have perfect speedup. Show your calculations.

a)* Which law determines the parallel runtime for a given number of processing elements? State its name and give the formula!

Show Hint

This law describes the theoretical maximum speedup achievable by parallelizing a task, considering the inherent sequential portion of the code. (Ref: Summary 1.3)

Show Solution

The law is Amdahl's Law.

The formula for the execution time T(p) using p processors, given the sequential execution time T and the fraction of parallel execution f (assuming perfect speedup), is:

T(p) = (1-f) × T + (f × T) / p

Show Analysis

Amdahl's Law (Section 1.3.2) is fundamental to understanding the limits of strong scaling. It highlights that the serial portion of the code (1-f) quickly becomes the bottleneck, limiting the overall speedup regardless of how many processors (p) are added.

b)* Calculate the fraction of parallel execution in the program.

Show Hint

Define T_s (serial time) and T_p (parallelizable time at p=1). The formula becomes T(p) = T_s + T_p/p.

You have T(2)=100s and T(6)=60s. Set up a system of two equations and solve for T_s and T_p. Finally, calculate f = T_p / (T_s + T_p).

Show Solution

We use T(p) = T_s + T_p/p.

System of equations:

100 = T_s + T_p/2
60 = T_s + T_p/6

Subtract equation (2) from (1):

100 - 60 = (T_p/2) - (T_p/6)
40 = T_p(1/2 - 1/6) = T_p(3/6 - 1/6) = T_p(2/6) = T_p/3

T_p = 40 × 3 = 120s

Substitute T_p back into equation (1):

100 = T_s + 120/2
100 = T_s + 60
T_s = 40s

The total sequential execution time T = T_s + T_p = 40s + 120s = 160s.

The fraction of parallel execution f is T_p / T:
f = 120 / 160 = 3/4 = 0.75.

The fraction of parallel execution is 0.75 (or 75%).

Show Analysis

This is a standard method for applying Amdahl's law when the total sequential time is not given directly. By using two different parallel execution times, we can isolate the underlying serial and parallel components of the workload.

c)* Name the two reasons why we need parallel processing as stated in the lecture.

Show Hint

Consider the historical trends in CPU development. What happened to clock speed increases (frequency scaling), and what did chip manufacturers do instead?

Show Solution

According to the course summary (Section 1.1 Motivation for Parallel Computing):

Single-threaded performance improvements are no longer significant (due to power/heat limitations).
The trend in hardware architecture is towards multi-core chips, meaning performance scales with the number of cores utilized.

Show Analysis

The industry hit the "power wall," making further increases in CPU frequency unsustainable. The shift to multi-core architectures means that software must be written in parallel to take advantage of modern hardware performance gains.

Problem 2: Terminology and Memory (11 credits)

a)* What are the two types of scaling parallel programs? Name both and describe what is kept constant for each one.

Show Hint

One measures how much faster you can solve a fixed problem by adding resources. The other measures how large a problem you can solve in the same time by adding resources.

Show Solution

The two types of scaling (Section 10.1 Scaling Types) are:

Strong Scaling: The total problem size is kept constant. It measures how execution time decreases as the number of processors increases.
Weak Scaling: The workload (problem size) per processor is kept constant. It measures the ability to solve larger problems in the same amount of time as the number of processors increases.

Show Analysis

These metrics are essential for evaluating parallel algorithms. Strong scaling is typically constrained by Amdahl's law. Weak scaling often demonstrates how well an application utilizes larger machine resources for larger problems (related to Gustafson's Law).

b)* What conditions must hold for a program to be sequentially consistent?

Show Hint

This is a strong memory model. It requires a global ordering of operations and the preservation of local program order.

Show Solution

According to the summary (Section 4.4.1 Memory Consistency), sequential consistency requires that:

The result of any execution is the same as if the operations of all processors were executed in some sequential order (a global interleaving exists), AND
The operations of each individual processor appear in this sequence in the order specified by its program (local program order is preserved).

Show Analysis

Sequential consistency is the most intuitive model for programmers. However, it imposes strict requirements on the hardware, often limiting performance optimizations (like write buffering or instruction reordering), making it expensive to implement on large-scale systems.

c)* State one reason why multi-node computers typically do not have shared memory between the nodes but use a distributed memory model instead.

Show Hint

What mechanism is required to keep memory synchronized across many different physical locations? How does the overhead of this mechanism scale?

Show Solution

The primary reason is scalability (Section 6.6). Maintaining cache coherence across a large number of nodes in a shared memory system (like NUMA) introduces significant complexity, communication overhead (coherence traffic), and cost, which limits the effective scaling of the system (Section 6.6.2). Distributed memory architectures avoid this overhead.

Show Analysis

While shared memory (ccNUMA) is easier to program, the hardware protocols required to ensure a consistent memory view become a bottleneck as the node count increases. Distributed memory systems prioritize scalability by putting the burden of data movement explicitly on the programmer (e.g., via MPI).

d)* Schematically draw a 2D torus network topology for a supercomputer with 9 nodes. Indicate the nodes as squares and the interconnects as lines.

Show Hint

9 nodes suggests a 3x3 arrangement. A torus is a mesh network where the edges wrap around (top connects to bottom, left connects to right).

Show Solution

A 3x3 2D Torus:

Show Analysis

Network topology (Section 7.3) defines how nodes are connected. The torus topology is advantageous because it eliminates boundaries, providing uniform connectivity (every node has 4 neighbors) and often better bisection bandwidth compared to a simple mesh.

Problem 3: Threading (10 credits)

Suppose you want to write a program that finds the largest value in a large array of non-negative doubles.

a)* What could happen when multiple threads try to modify shared data simultaneously? Please give one method to prevent it.

Show Hint

What is the specific term for the condition where the outcome depends on the unpredictable timing of thread execution when accessing shared memory, and at least one access is a write?

Show Solution

A Data Race can occur (Section 4.1.1). This leads to undefined behavior and non-deterministic results.

A method to prevent it is using a Mutex (Mutual Exclusion Lock) (Section 2.4.1). A mutex ensures that only one thread can access the critical section (the code modifying the shared data) at a time.

Show Analysis

Data races are a primary source of bugs in shared memory programming. Ensuring thread safety through synchronization mechanisms (like mutexes, atomics, or critical sections) is fundamental to correct parallel execution.

b)* As a C++ developer, you decided to use C++ threads to implement multithreading in your program. Fill out the following template achieving a correct and efficient concurrent calculation of the max in an array of any size. Each thread should process a continous part of the array.

Code Template (C++):

#define NUM_THREADS 8

void thread_max(int start, int end, double *data , double *max) {

}

void find_max_in_parallel(double *array, int array_size, double *array_max) {
    int local_size, local_start, local_end;
    std::thread threads[NUM_THREADS];
    local_size = array_size / NUM_THREADS;
    *array_max = -1.0;
    for (int i = 0; i < NUM_THREADS; ++i) {
        threads[i] = std::thread(thread_max,

        );
    }



}

Show Hint

Efficiency: Avoid locking a global variable inside the loop. Use a parallel reduction pattern.

Strategy: 1. Create an auxiliary array (e.g., local_maxes[NUM_THREADS]). 2. Calculate start/end indices carefully, ensuring the remainder is handled (load balancing). 3. Pass the address of the thread's slot in the auxiliary array to thread_max. 4. Join the threads and perform the final reduction on the auxiliary array.

Show Solution

Completed Solution (C++):

#define NUM_THREADS 8
#include <thread>
#include <algorithm>

// The 'max' parameter is used here as a pointer to the storage location
// for this specific thread's local maximum.
void thread_max(int start, int end, double *data , double *max) {
    double local_max = -1.0; // Initialize based on non-negative data assumption
    for (int i = start; i < end; ++i) {
        if (data[i] > local_max) {
            local_max = data[i];
        }
    }
    *max = local_max;
}

void find_max_in_parallel(double *array, int array_size, double *array_max) {
    int local_size, local_start, local_end;
    std::thread threads[NUM_THREADS];
    double local_maxes[NUM_THREADS]; // Storage for local results

    // Calculate base chunk size and remainder for balanced distribution
    local_size = array_size / NUM_THREADS;
    int remainder = array_size % NUM_THREADS;
    int current_start = 0;

    *array_max = -1.0;

    for (int i = 0; i < NUM_THREADS; ++i) {
        local_start = current_start;
        // Distribute the remainder among the first 'remainder' threads
        local_end = local_start + local_size + (i < remainder ? 1 : 0);

        // Ensure bounds safety and handle case where array_size < NUM_THREADS
        if (local_end > array_size) local_end = array_size;
        current_start = local_end;

        // Launch the thread only if it has work
        if (local_start < local_end) {
            // Launch the thread, passing the address of its slot in local_maxes
            threads[i] = std::thread(thread_max,
                local_start, local_end, array, &local_maxes[i]
            );
        }
    }

    // Join threads (Fork/Join model, Section 2.3.2) and perform final reduction
    for (int i = 0; i < NUM_THREADS; ++i) {
        // Check if the thread was launched
        if (threads[i].joinable()) {
            threads[i].join();
            // Reduce (must happen after join to ensure data is written)
            if (local_maxes[i] > *array_max) {
                *array_max = local_maxes[i];
            }
        }
    }
}

Show Analysis

This solution uses a parallel reduction pattern (Section 6.3.2), which is efficient because it minimizes synchronization overhead.

Common Pitfall (Inefficient): A common mistake is to pass the address of the global array_max to all threads and use a mutex inside thread_max. This would serialize the updates and destroy performance due to lock contention.

Load Balancing: The solution correctly handles arbitrary array sizes by calculating a base size and distributing the remainder, ensuring all elements are processed and the load is balanced.

Problem 4: OpenMP (13 credits)

a)* Parallelize the following code snippet using OpenMP. Do not assume a default scope for the variables but specify it explicitly.

Code Snippet (C++):

double random_noise();

int operation(double *x, double *y, int n)
{
    int i;
    double w;
    double min_y = y[0];
    int min_y_index = 0;
    for (i = 0; i < n - 1; ++i) {
        w = random_noise();
        x[i] = (y[i] + y[i + 1]) / 2. + w;
        if (y[i + 1] < min_y) {
            min_y = y[i + 1];
            min_y_index = i;
        }
    }
    x[i] = y[i] + random_noise();
    return min_y_index;
}

Show Hint

1. Use default(none). Identify shared (x, y, n, min_y, min_y_index) and private (i, w) variables.

2. The loop performs a "minloc" (minimum value and index) reduction.

3. Standard OpenMP reductions (Section 3.6) do not support "minloc". You must implement a manual reduction.

4. Use thread-private variables for local minimums and combine them using #pragma omp critical after the loop.

Show Solution

Parallelized Solution (C++ with OpenMP):

double random_noise();

int operation(double *x, double *y, int n)
{
    int i;
    double w;
    // Initialize global minimums (assuming n > 0)
    double min_y = y[0];
    int min_y_index = 0;

    // Start parallel region. Explicitly define data sharing attributes.
    #pragma omp parallel default(none) shared(x, y, n, min_y, min_y_index) private(i, w)
    {
        // Thread-private copies for local minimum tracking
        double local_min_y = min_y;
        int local_min_y_index = min_y_index;

        // Distribute loop iterations among threads (Section 3.3.3)
        #pragma omp for
        for (i = 0; i < n - 1; ++i) {
            w = random_noise();
            x[i] = (y[i] + y[i + 1]) / 2. + w;

            // Update local minimum
            if (y[i + 1] < local_min_y) {
                local_min_y = y[i + 1];
                local_min_y_index = i;
            }
        }
        // Implicit barrier at the end of omp for

        // Update global minimum in a critical section (Section 3.5.1)
        #pragma omp critical
        {
            if (local_min_y < min_y) {
                min_y = local_min_y;
                min_y_index = local_min_y_index;
            }
        }
    } // End parallel region

    // Handle the final iteration sequentially
    if (n > 0) {
        i = n - 1;
        x[i] = y[i] + random_noise();
    }
    return min_y_index;
}

Show Analysis

The key challenge is the "minloc" operation. Since it's not a standard reduction, we manually manage it. Threads compute local minimums privately and then combine them globally within a critical section (Section 3.5.1) to ensure atomicity when updating the shared global minimums. This minimizes synchronization overhead.

Now consider the following code snippet:

void main()
{
    const int N = 100000;
    int a[N];
    int b[N];
    for (int i = 0; i < N; ++i) {
        a[i] = i;
    }
    #pragma omp parallel for schedule(static , 1)
    for (int i = 0; i < N; ++i) {
        a[i] = a[i] + 17;
        b[i] = a[i] % 23;
    }
}

b)* State the 3 performance problems that can occur when running this code snippet on a two-socket system.

Show Hint

Analyze the interaction with a NUMA architecture (two sockets).

1. Initialization is sequential. Consider the "First Touch" policy.

2. schedule(static, 1) is cyclic distribution. What happens when different threads write to adjacent elements (on the same cache line)?

3. How does cyclic distribution affect spatial locality for individual threads?

Show Solution

NUMA Effects / First Touch Policy (Section 6.5.2): Array a is initialized sequentially. The First Touch policy allocates the memory on the socket running the main thread. In the parallel loop, threads on the second socket will suffer high latency due to remote memory access.
False Sharing (Section 6.4): schedule(static, 1) distributes iterations cyclically. Adjacent iterations (accessing adjacent memory locations) are processed by different threads. Writes to the same cache line by different threads cause excessive cache coherence traffic (cache line bouncing).
Poor Spatial Locality / Cache Thrashing (Section 6.4): The cyclic distribution destroys spatial locality for individual threads, as they access scattered elements rather than contiguous blocks. This leads to inefficient cache utilization and prefetching.

Show Analysis

This combination of sequential initialization and cyclic scheduling (static, 1) is notorious for poor performance on NUMA systems. The cyclic schedule maximizes false sharing and minimizes spatial locality. The sequential initialization ensures NUMA penalties for threads on the remote socket.

c)* Do not distribute the calculation to multiple threads but instead vectorize the following operation using OpenMP.

Code Snippet (C++):

float operation(float *a, float *b, int n) {
    float prod = 0.0f;
    for (int k = 0; k < n; k++) {
        prod *= a[k] + b[k];
    }
    return prod;
}

Show Hint

1. Use the OpenMP directive for SIMD (vectorization).

2. This is a product reduction. Identify the correct clause.

3. Critical Error Check: The initialization value for a product reduction should be 1.0f, not 0.0f.

Show Solution

(Note: Corrected initialization from 0.0f to 1.0f)

Vectorized Solution (C++ with OpenMP SIMD):

float operation(float *a, float *b, int n) {
    float prod = 1.0f; // Corrected initialization for product reduction
    #pragma omp simd reduction(*:prod)
    for (int k = 0; k < n; k++) {
        prod *= a[k] + b[k];
    }
    return prod;
}

Show Analysis

We use the #pragma omp simd directive (Section 5.5.4). This instructs the compiler to utilize the CPU's vector instructions (like AVX) to process multiple iterations simultaneously within a single thread. The reduction clause is necessary to correctly handle the accumulation across vector lanes.

d)* Parallelize the following code snippet for traversing a binary tree. Add pragmas into the code box in the right location.

Code Snippet (C++):

struct Node {
    struct Node *left;
    struct Node *right;
};
void visit(struct Node *p);
void traverse(struct Node *p)
{
    if (p->left != nullptr) {
        #pragma omp
        traverse(p->left);
    }
    if (p->right != nullptr) {
        #pragma omp
        traverse(p->right);
    }
    visit(p);
}
void traverse_tree_in_parallel(struct Node *tree)
{
    #pragma omp
    {
        #pragma omp
        traverse(tree);
    }
}

Show Hint

1. Tree traversal is irregular and recursive, suitable for Task Parallelism (not omp for).

2. Start a parallel region in the wrapper function, and use omp single to initiate the first call.

3. Use omp task for the recursive calls.

4. This is a post-order traversal (visit after children). Synchronization is required before visit(p) using omp taskwait.

Show Solution

Parallelized Solution (C++ with OpenMP Tasks):

struct Node {
    struct Node *left;
    struct Node *right;
};
void visit(struct Node *p);
void traverse(struct Node *p)
{
    if (p->left != nullptr) {
        #pragma omp task
        traverse(p->left);
    }
    if (p->right != nullptr) {
        #pragma omp task
        traverse(p->right);
    }
    // Wait for child tasks to complete before visiting the node
    #pragma omp taskwait
    visit(p);
}
void traverse_tree_in_parallel(struct Node *tree)
{
    // Start parallel region
    #pragma omp parallel
    {
        // Use a single thread to initiate the traversal (Section 4.3.3)
        #pragma omp single
        traverse(tree);
    }
}

Show Analysis

Task Parallelism (Section 4.3) is designed for this dynamic workload. The taskwait (Section 4.3.4) is crucial to ensure that the child tasks complete before the parent node is visited, maintaining the post-order traversal logic.

Problem 5: SIMD (17 credits)

a)* Name 2 reasons why the autovectorizer of a compiler might fail to vectorize a loop that can be vectorized manually.

Show Hint

Compilers must be conservative to guarantee correctness. Consider issues related to memory access (pointers/aliasing) and code structure (branches/calls).

Show Solution

Compilers must be conservative (Section 5.5.2). Reasons include:

Potential Data Dependencies / Pointer Aliasing: The compiler cannot prove that different pointers used in the loop do not point to overlapping memory regions (aliasing). To guarantee correctness, it assumes dependencies exist and prevents vectorization.
Complex Control Flow or Function Calls: The presence of complex branches (if/else) or calls to functions that cannot be inlined or vectorized prevents the compiler from generating efficient SIMD code (Section 5.5.4).

Show Analysis

Autovectorization relies on static analysis. If the compiler cannot definitively prove independence (e.g., due to ambiguous pointers), it defaults to sequential execution. Manual vectorization allows the programmer to assert this independence based on their knowledge of the code.

Now consider the following sequential code:

void linear_and_avg(float *a, float *b, float c, int size, float *avg_out)
{
    float sum = 0.0f;
    for (int i = 0; i < size; i ++) {
        a[i] = c * a[i] + b[i];
        sum += a[i];
    }
    *avg_out = sum / size;
}

b)* Vectorize the sequential code using 256-bit vector operations in the code template on the next page. Assume that all arrays are unaligned. Do not try to align the arrays but use the correct intrinsics.

Here is a list of definitions you might find useful:

Intrinsics	Intrinsics	Intrinsics
`__m256`	`_mm256_add_ps`	`__m256i`
`__m256f`	`_mm256_add_pf`	`__m256f`
`__m256d`	`_mm256_add_pd`	`_mm256_store_ps`
`_mm256_load_ps`	`_mm256_dp_ps`	`_mm256_storeu_ps`
`_mm256_loadu_ps`	`_mm256_dp_pf`	`_mm256_store_pf`
`_mm256_load_pf`	`_mm256_dp_pd`
`_mm256_loadu_pf`	`_mm256_mul_ps`	`_mm256_storeu_pf`
`_mm256_load_pd`	`_mm256_mul_pf`	`_mm256_store_pd`
`_mm256_loadu_pd`	`_mm256_mul_pd`	`_mm256_storeu_pd`
`_mm256_set1_ps`	`_mm256_set1_pf`	`_mm256_set1_pd`

Code Template (C++):

void linear_and_avg_vec(float *a, float *b, float c, int size, float *avg_out)
{
    __m sum256, a256, b256, c256, prod256, new_a256;
    sum256 =
    c256 =
    float sum = 0.0f;
    int i = 0;
    for (; i < size - (size % ); ) {
        a256 =
        b256 =
        prod256 =
        new_a256 =
        sum256 =
    }
    *avg_out = sum / size;
}

Show Hint

1. VL: 256-bit vectors / 32-bit floats = 8.

2. Initialization: Initialize vector sum to zero (_mm256_set1_ps(0.0f)). Broadcast scalar c (_mm256_set1_ps(c)).

3. Main Loop: Iterate in steps of 8. Use unaligned instructions (_mm256_loadu_ps, _mm256_storeu_ps).

4. Cleanup: Perform a horizontal reduction of the sum vector. Handle the remainder loop.

Show Solution

Vectorized Solution (C++/AVX Intrinsics):

#include <immintrin.h>

void linear_and_avg_vec(float *a, float *b, float c, int size, float *avg_out)
{
    __m256 sum256, a256, b256, c256, prod256, new_a256;
    const int VL = 8; // Vector Length for float (32-bit) on 256-bit registers

    // Initialize sum vector to zero
    sum256 = _mm256_set1_ps(0.0f);
    // Broadcast scalar 'c' into a vector
    c256 = _mm256_set1_ps(c);

    float sum = 0.0f;
    int i = 0;

    // Main Vector Loop
    for (; i < size - (size % VL); i += VL) {
        // Load unaligned data
        a256 = _mm256_loadu_ps(&a[i]);
        b256 = _mm256_loadu_ps(&b[i]);

        // Calculate prod = c * a
        prod256 = _mm256_mul_ps(c256, a256);

        // Calculate new_a = (c * a) + b
        new_a256 = _mm256_add_ps(prod256, b256);

        // Store the result back to a (unaligned) - Required step
        _mm256_storeu_ps(&a[i], new_a256);

        // Accumulate the sum
        sum256 = _mm256_add_ps(sum256, new_a256);
    }

    // Horizontal Reduction
    float temp_sum[VL];
    _mm256_storeu_ps(temp_sum, sum256);
    for(int j=0; j<VL; ++j) {
        sum += temp_sum[j];
    }

    // Remainder Loop (Scalar) (Section 5.4.1)
    for (; i < size; ++i) {
        a[i] = c * a[i] + b[i];
        sum += a[i];
    }

    *avg_out = sum / size;
}

Show Analysis

This solution correctly implements the vectorized version using AVX (256-bit) intrinsics.

Key Requirement Check: The instruction explicitly states to assume arrays are unaligned, mandating the use of the _loadu_ and _storeu_ variants. The structure includes the necessary steps: vectorized main loop, horizontal reduction, and the remainder loop.

Problem 6: MPI Theory (10 credits)

a)* What is the definition of MPI local and non-local routines in the terminology since MPI 4.0? List an example for both.

Show Hint

Does the routine need communication or actions from another MPI process to complete its execution?

Show Solution

Local routine: A routine whose completion depends only on the local calling process. It does not require communication or synchronization events to occur at another process to complete.
- Example: MPI_Comm_rank, MPI_Comm_size.
Non-local routine: A routine whose completion may depend on actions taken by other MPI processes (e.g., matching communication calls or collective participation).
- Example: MPI_Barrier, MPI_Bcast.

Show Analysis

This distinction is important for understanding MPI semantics. Note that initiating a non-blocking send (MPI_Isend) is considered a local routine because the function call returns immediately based on local state, even though the underlying communication operation is inherently non-local.

b)* What is the definition of MPI completing and incomplete procedures in the terminology since MPI 4.0?

Show Hint

This relates to non-blocking operations and when the resources (like buffers) used in the operation are safe to reuse. (Section 9.5).

Show Solution

Incomplete procedure: A procedure that initiates a communication operation but may return before the operation is locally complete. Resources (like buffers) cannot be reused yet.
- Example: MPI_Isend, MPI_Irecv.
Completing procedure: A procedure that ensures or verifies the completion of an operation. Upon return, resources involved are available for reuse.
- Example: MPI_Wait, MPI_Test (if completion is detected), or blocking calls like MPI_Send.

Show Analysis

Understanding this is vital for managing memory buffers in non-blocking MPI calls. You must not modify a send buffer or read from a receive buffer after an incomplete procedure until a corresponding completing procedure has finished.

c)* Explain passive target synchronization in MPI one-sided communication. Which two options regarding concurrency are there?

Show Hint

In "passive target" synchronization (RMA), consider the role of the target process. What MPI calls are used by the origin process to control the access epoch? The concurrency options refer to the locking mechanisms.

Show Solution

Explanation:

Passive target synchronization (Section 9.11.2, Lock/Unlock) is a mechanism in Remote Memory Access (RMA) where the target process does not actively participate in the synchronization calls. The origin process entirely controls the access epoch to the target's memory window using MPI_Win_lock (to start the epoch) and MPI_Win_unlock (to end it).

Concurrency Options (Lock Types):

Exclusive Lock (MPI_LOCK_EXCLUSIVE): Grants the origin process exclusive access. No other process can access it concurrently.
Shared Lock (MPI_LOCK_SHARED): Allows multiple processes to concurrently access the target window, provided they also use shared locks (e.g., for concurrent reads).

Show Analysis

Passive synchronization contrasts with Active Target synchronization (like Fence or Post/Start/Complete/Wait), where both the origin and target must participate in synchronization calls. Passive synchronization is useful when you do not want to interrupt the computation occurring on the target process.

Problem 7: MPI (15 credits)

Complete the MPI program in the code block below such that it finds the 3 largest elements in 'full_data' consisting of non-negative floats and prints them in the last rank relative to MPI_COMM_WORLD using a manager-worker pattern. Each MPI process searches through a chunk of 'full_data' and sends the 3 locally largest elements to the last rank. The last rank then finds the 3 globally largest elements.

Requirement: In the last rank, the receiving of the partial results and the computation of the 3 globally largest elements should be overlapped using MPI nonblocking communication. You can assume that the total data size is divisible by 3x the number of MPI processes. Use the most efficient MPI calls possible.

Here is a sequential version of the operation:

#define TOTAL_LENGTH 420000000
void read_data(float *output);
void update_top3(const float *data, float *top3_out, int count) {
    for (int i = 0; i < count; ++i) {
        if (data[i] > top3_out[0]) {
            top3_out[2] = top3_out[1];
            top3_out[1] = top3_out[0];
            top3_out[0] = data[i];
        } else if (data[i] > top3_out[1]) {
            top3_out[2] = top3_out[1];
            top3_out[1] = data[i];
        } else if (data[i] > top3_out[2]) {
            top3_out[2] = data[i];
        }
    }
}

int main(int argc, char** argv) {
    float * full_data = new float [TOTAL_LENGTH];
    read_data(full_data); // initialize data
    float top3[3] = {-1.f, -1.f, -1.f};
    update_top3(full_data, top3, TOTAL_LENGTH);
    print_top3 (top3);
    delete[] full_data;
    return 0;
}

Code Template (C++ with MPI):

#include <mpi.h>
int main(int argc, char** argv) {
    int rank, world_size;
    float *full_data = new float[TOTAL_LENGTH];
    if (rank == world_size - 1) read_data(full_data); // initialize data
    int chunk_size = TOTAL_LENGTH / world_size;
    float *local_data = new float[chunk_size];


    float top3[3] = { -1.0f, -1.0f, -1.0f };


    if (rank == world_size - 1) { // Manager
        update_top3(local_data, top3, chunk_size);
        MPI_Request *requests = new MPI_Request[world_size - 1];
        float *recv_buf = new float[3 * (world_size - 1)];
        // receive top3 from all other ranks
        for (int from_rank = 0; from_rank < world_size - 1; ++from_rank) {

        }
        int completed = 0;
        while (completed < world_size -1) {
            // check if received top3 from any rank and update global top3


        }
        delete[] recv_buf; delete[] requests;
    } else { // Worker
        update_top3 (local_data, top3, chunk_size);

    }
    if (rank == world_size - 1) print_top3 (top3);


    delete[] full_data; delete[] local_data;

    return 0;
}

Show Hint

1. Data Distribution: Use MPI_Scatter to efficiently distribute chunks from the manager (last rank) to all processes.

2. Worker Logic: Workers calculate local top 3 and use MPI_Send to the manager.

3. Manager Logic (Overlap): a. Post non-blocking receives (MPI_Irecv) for all incoming results. b. Use MPI_Waitany in a loop. This returns the index of the first completed receive. c. Process the data corresponding to that index immediately using update_top3.

Show Solution

Solution (C++ with MPI):

#include <mpi.h>
// Assume TOTAL_LENGTH, read_data(), update_top3(), print_top3() are defined.

int main(int argc, char** argv) {
    // Initialize MPI environment
    MPI_Init(&argc, &argv);
    int rank, world_size;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &world_size);

    const int MANAGER_RANK = world_size - 1;

    float *full_data = nullptr;
    // Initialize data only on the manager (last rank)
    if (rank == MANAGER_RANK) {
        full_data = new float[TOTAL_LENGTH];
        read_data(full_data);
    }

    int chunk_size = TOTAL_LENGTH / world_size;
    float *local_data = new float[chunk_size];

    // Distribute the data efficiently using MPI_Scatter (Section 9.7.3)
    MPI_Scatter(full_data, chunk_size, MPI_FLOAT, local_data, chunk_size, MPI_FLOAT, MANAGER_RANK, MPI_COMM_WORLD);

    float top3[3] = { -1.0f, -1.0f, -1.0f };

    if (rank == MANAGER_RANK) { // Manager
        // 1. Manager computes its own local top3
        update_top3(local_data, top3, chunk_size);

        int num_workers = world_size - 1;
        MPI_Request *requests = new MPI_Request[num_workers];
        // Buffer to hold incoming results (3 floats per worker)
        float *recv_buf = new float[3 * num_workers];

        // 2. Initiate non-blocking receives (MPI_Irecv, Section 9.5.1)
        for (int from_rank = 0; from_rank < num_workers; ++from_rank) {
            MPI_Irecv(&recv_buf[from_rank * 3], 3, MPI_FLOAT, from_rank, 0, MPI_COMM_WORLD, &requests[from_rank]);
        }

        // 3. Wait for completion and process results concurrently
        int completed = 0;
        while (completed < num_workers) {
            // Wait for any request to complete (MPI_Waitany)
            int index;
            MPI_Waitany(num_workers, requests, &index, MPI_STATUS_IGNORE);

            // Process the completed request immediately (Overlap achieved here)
            if (index != MPI_UNDEFINED) {
                update_top3(&recv_buf[index * 3], top3, 3);
                completed++;
            }
        }
        delete[] recv_buf; delete[] requests;
    } else { // Worker
        // 1. Worker computes its local top3
        update_top3 (local_data, top3, chunk_size);

        // 2. Worker sends result to the manager
        MPI_Send(top3, 3, MPI_FLOAT, MANAGER_RANK, 0, MPI_COMM_WORLD);
    }

    if (rank == MANAGER_RANK) {
        print_top3 (top3);
        delete[] full_data; // Clean up full data on manager
    }

    delete[] local_data;
    MPI_Finalize();
    return 0;
}

Show Analysis

The solution addresses the requirements effectively:

Efficiency: MPI_Scatter is used for optimal data distribution.
Overlap: The key requirement is overlapping reception and computation. Using MPI_Irecv followed by a MPI_Waitany loop achieves this. As soon as one message arrives, the manager processes it (update_top3) while waiting for the next message, hiding communication latency with computation.
Alternative (Not Allowed): Using MPI_Gather would simplify the code but would block until all data arrived, failing the overlap requirement.

Problem 8: Hybrid Programming (9 credits)

a)* Write a function to compute the total kinetic energy of a particle cloud. The energy of a particle is given by the expression

E = 1/2 · m · v²

where m is the mass and v is the velocity of the particle. The computation should be done in parallel using a hybrid MPI + OpenMP approach. You can assume that MPI has already been set up for you. The particles have already been distributed to all MPI ranks. The total energy should be available in each MPI rank.

Code Template (C++):

void compute_energy(double *local_mass, double *local_velocity,
    int local_count, double *total_energy)
{

}

Show Hint

1. Local (OpenMP): Use #pragma omp parallel for with a reduction clause to sum the local energy within the MPI rank.

2. Global (MPI): Combine the local sums. Since the result must be available in *each* rank, use MPI_Allreduce.

Show Solution

Hybrid Solution (C++ with MPI + OpenMP):

#include <omp.h>
#include <mpi.h>

void compute_energy(double *local_mass, double *local_velocity,
    int local_count, double *total_energy)
{
    double local_energy_sum = 0.0;

    // 1. Local computation using OpenMP parallel for with reduction (Section 3.6)
    // E = 0.5 * m * v^2
    #pragma omp parallel for reduction(+:local_energy_sum) schedule(static)
    for (int i = 0; i < local_count; ++i) {
        double m = local_mass[i];
        double v = local_velocity[i];
        local_energy_sum += 0.5 * m * v * v;
    }

    // 2. Global reduction using MPI_Allreduce (Section 9.7.4)
    // MPI_Allreduce ensures the result is available in every rank.
    MPI_Allreduce(&local_energy_sum, total_energy, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
}

Show Analysis

This is a typical structure for hybrid programming (Section 9.13.4). OpenMP handles the fine-grained parallelism within a node (shared memory), while MPI handles the coarse-grained parallelism across nodes (distributed memory). Using MPI_Allreduce ensures all processes receive the final result.

You run your hybrid program on a cluster of compute nodes. You can assume the following conditions:

Each compute node has identical hardware
Each node has 3 sockets and each socket is its own NUMA domain
Each socket has 6 cores 2-way simultaneous multithreading
mpirun starts 3 processes per node

b)* In testing you realize your program is running much slower than expected. To debug you look at the following OpenMPI bindings report of node 0 and OpenMP environment output of one process. Briefly describe the performance issues you see and provide the commands to fix them.

MPI binding report:

MPI binding report:
    rank0: [BB/../../../../..][../../../../../..][../../../../../..]
    rank1: [../BB/../../../..][../../../../../..][../../../../../..]
    rank2: [../../BB/../../..][../../../../../..][../../../../../..]

OPENMP DISPLAY ENVIRONMENT BEGIN:

OPENMP = '201511'
OMP_DYNAMIC = 'FALSE'
OMP_NESTED = 'FALSE'
OMP_NUM_THREADS = '2'
OMP_SCHEDULE = 'DYNAMIC'
OMP_PROC_BIND = 'FALSE'
OMP_PLACES = 
OMP_STACKSIZE = '0'
OMP_WAIT_POLICY = 'PASSIVE'
OMP_THREAD_LIMIT = '4294967295'
OMP_MAX_ACTIVE_LEVELS = '2147483647'
OMP_CANCELLATION = 'FALSE'

Show Hint

1. MPI Placement: The binding report shows where the 3 ranks are located (brackets are sockets, 'B' is binding). Are they utilizing all 3 sockets or clustered on one?

2. OpenMP Threads: Are they using all available cores (6 physical)?

3. Pinning: Is thread pinning enabled (`OMP_PROC_BIND`) to ensure locality?

Fix: Aim for 1 MPI rank per socket, utilizing the physical cores within that socket.

Show Solution

Performance Issues:

Poor MPI Process Placement (Resource Imbalance): The MPI binding report shows all three ranks are bound to the first socket. Sockets 1 and 2 are idle, while Socket 0 is contended. This violates the ideal hybrid strategy of one MPI process per socket (Section 9.13.4).
OpenMP Thread Undersubscription: OMP_NUM_THREADS = '2'. Each socket has 6 cores. Using only 2 threads significantly underutilizes the resources.
Lack of OpenMP Thread Pinning (Poor Locality): OMP_PROC_BIND = 'FALSE'. Threads are not bound (Section 6.5.3) and may migrate, hurting cache locality.

Commands to Fix:

We aim to utilize the 6 physical cores per socket (ignoring SMT is often better for HPC).

1. Set OpenMP Environment Variables:

export OMP_NUM_THREADS=6       # Utilize the physical cores per socket
export OMP_PLACES=cores        # Bind to physical cores
export OMP_PROC_BIND=close     # Keep threads close together within the socket

2. Fix MPI Execution Command (assuming OpenMPI):

mpirun --map-by socket --bind-to socket -np 3 ./your_program

Show Analysis

Affinity (process and thread placement) is crucial on NUMA architectures. The default behavior often leads to suboptimal performance.

The fix ensures that MPI ranks are distributed across the NUMA domains (--map-by socket --bind-to socket) and that the OpenMP threads spawned by each rank remain local to that domain and pinned to specific cores (OMP_PROC_BIND=close, OMP_PLACES=cores). This maximizes memory locality and minimizes contention.