IN2147 Parallel Programming Exam Study Guide (S18)

Parallel Programming (IN2147) - Overview

Interactive Study Guide - Summer Exam 2018

Exam:	IN2147 / Endterm	Date:	Tuesday 24th July, 2018
Examiner:	Prof. Dr. Martin Schulz	Time:	16:00-17:30

Working Instructions

This exam consists of 16 pages with a total of 11 problems. Please make sure now that you received a complete copy of the exam.
The exam duration is exactly 90 minutes.
The total amount of achievable credits in this exam is 90 credits.
Detaching pages from the exam is prohibited.
You are not allowed to bring your own scratch papers. There are extra pages available as scratch in the exam booklet. These pages are only graded if they are clearly referenced by corresponding (sub)problems.
Neither write with red or green colors nor use pencils. Also avoid white out and ink eraser.
Clearly cross out answers which you do not want to be graded.
Subproblems marked by * can be solved without results of previous subproblems.
This is a closed book exam. No additional resources or devices (calculators of any sort) are allowed. Physically turn off all electronic devices (cell phones, smartwatches, etc.) and leave them switched off at all times, put them in your bags, and close the bags. Setting them to silent / airplane mode is not sufficient.

Select a problem from the sidebar to begin studying. Use the toggles to reveal hints, solutions, and analysis.

Problem 1: Amdahl's Law (5 credits)

a)* What can be computed by applying Amdahl's Law?

Show Solution

Amdahl's Law is used to compute the theoretical maximum speedup achievable for a program when executed on a parallel system.

Show Analysis

It takes into account the fraction of the program that is parallelizable ($f$) and the number of processors ($p$). The law highlights the limitations imposed by the sequential portion of the code, which becomes the bottleneck as more processors are added. (Ref: Summary Section 1.3)

b)* Assume you have a program, which is split into regions that either cannot be parallelized at all or that can be parallelized perfectly. Assuming the regions that cannot be parallelized consume 1% of execution time during a sequential run, what is the maximal speedup that can be achieved?

Show Solution

The fraction of execution time that cannot be parallelized (the serial fraction) is $1-f = 0.01$.

The maximal speedup is achieved when the number of processors $p \to \infty$.

The formula for maximum speedup is $SU(\infty) = \frac{1}{1-f}$.

$SU(\infty) = \frac{1}{0.01} = 100$.

The maximal speedup is 100.

Problem 2: Lock Granularity (4 credits)

Lock granularity can have a significant impact on performance of shared memory codes.

a)* Name one possible reason for a negative performance impact in case lock granularity is chosen too fine:

Show Solution

High Locking Overhead: Using many fine-grained locks requires frequent acquisition and release operations. The overhead associated with these synchronization calls (e.g., atomic instructions, memory fences) can become significant and degrade performance.

b)* Name one possible reason for a negative performance impact in case lock granularity is chosen too coarse:

Show Solution

High Lock Contention / Serialization: Using a single coarse-grained lock protects a large critical section, which reduces the available parallelism. This increases the likelihood that multiple threads will try to acquire the same lock simultaneously and be forced to wait, leading to serialization of the execution. (Ref: Section 6.3.2)

Problem 3: Shared memory parallelization using OpenMP (6 credits)

a)* Parallelize for loop in the following function using OpenMP by replacing <OPENMP HERE> with the appropriate OpenMP directive(s) and clause(s). Write your solution in the box below.

void function(float *a, float *b, int n)
{
    int i;
    <OPENMP HERE>
    for (i = 0; i < n - 1; i++)
        a[i] = (b[i] + b[i + 1]) / 2;
    a[i] = b[i] / 2;
}

Show Solution

#pragma omp parallel for private(i)

Show Analysis

The loop iterations are independent (no loop-carried dependencies), so a simple parallel for is sufficient. Since the loop variable `i` is declared outside the parallel region in C style, it should be explicitly made private for correctness, although OpenMP often handles this implicitly for loop variables.

b)* Parallelize for loop in the following function using OpenMP by replacing <OPENMP HERE> with the appropriate OpenMP directive(s) and clause(s). Write your solution in the box below.

double function(float *x, int *y, int n)
{
    int i, b = y[0];
    float a = 1.0;
    <OPENMP HERE>
    for (i = 0; i < n; i++)
    {
        a += x[i];
        if (b < y[i])
            b = y[i];
    }
    return a * b;
}

Show Solution

#pragma omp parallel for private(i) reduction(+:a) reduction(max:b)

Show Analysis

This loop performs two independent reductions: a summation on the float variable `a` and a maximum value search on the integer variable `b`. OpenMP's `reduction` clause (Section 3.6) is the correct and most efficient way to handle this pattern.

Problem 4: OpenMP Variable Privatization (14 credits)

Consider the following program being executed with OMP_NUM_THREADS set to 16.

int value[64], k, t;
for (int i = 0; i < 64; i++)
    value[i] = 63 - i;

t = omp_get_thread_num();
k = 42;
#pragma omp parallel for schedule(static, 2) CLAUSE
for (int i = 0; i < 64; i++)
{
    k = value[i] * 2 + t;
}
printf("Final value of k=%i\n", k);

a)* Which iterations are executed by thread 13 in the parallel for loop?

Show Solution

With 64 iterations and a chunk size of 2, there are 32 total chunks. `schedule(static, 2)` distributes these chunks in a round-robin fashion to the 16 threads.

Thread T executes chunks T, T+16, T+32, ...

Therefore, Thread 13 executes chunk 13 and chunk (13+16) = 29.

Chunk 13 corresponds to iterations: [13*2, 13*2+1] = [26, 27].
Chunk 29 corresponds to iterations: [29*2, 29*2+1] = [58, 59].

The iterations executed by thread 13 are: 26, 27, 58, 59.

b)* Fill in the following table. If multiple values are possible, state MULTIPLE.

CLAUSE set to:	Printed value of k
NONE
`private(k,t)`
`shared(k,t)`
`firstprivate(k,t)`
`lastprivate(k,t)`
`firstprivate(k,t) lastprivate(k,t)`

Show Solution

CLAUSE set to:	Printed value of k
NONE	MULTIPLE
`private(k,t)`	42
`shared(k,t)`	MULTIPLE
`firstprivate(k,t)`	42
`lastprivate(k,t)`	MULTIPLE
`firstprivate(k,t) lastprivate(k,t)`	0

Show Analysis

The key is understanding the data-sharing attributes of `k` and `t`.

NONE / shared(k,t): By default, `k` and `t` are shared (Section 3.4.1). `t` is read-only with value 0 (from the master thread). Multiple threads write to the shared `k` concurrently, causing a data race. The final value is unpredictable.
private(k,t): `k` and `t` are private to each thread and are uninitialized within the parallel region. Any modifications made by threads are local. The original global `k` (initialized to 42) is never touched by the parallel loop and remains 42.
firstprivate(k,t): Private copies are created for each thread and initialized with the master thread's values before the loop (k=42, t=0). Modifications are still local. The global `k` remains 42.
lastprivate(k,t): `k` and `t` are private and uninitialized. The value of `k` from the thread that executes the logically last iteration (i=63) is copied back to the global `k`. However, the private `t` used in that thread's calculation is uninitialized, so the final result is indeterminate (MULTIPLE).
firstprivate(k,t) lastprivate(k,t): Private copies are initialized (k=42, t=0). The last iteration (i=63) is executed by some thread. That thread calculates `k = value[63]*2 + t = (63-63)*2 + 0 = 0`. This final value of 0 is copied back to the global `k`.

Problem 5: OpenMP Tasking (3 credits)

a)* List the OpenMP pragma used to specify a task:

Show Solution

#pragma omp task

b)* What is the difference between a tied and an untied task?

Show Solution

A tied task (the default) is bound to the thread that first begins executing it. If the task is suspended (e.g., at a task scheduling point), it must be resumed by the same thread.

An untied task can be suspended by one thread and later resumed by any available thread in the team, which offers more scheduling flexibility.

Show Analysis

(Ref: Section 4.3.2) Untied tasks are particularly useful for scenarios involving nested tasks or I/O operations where the original thread might become blocked, allowing another available thread to pick up the work and continue making progress.

Problem 6: Performance issues of a two-socket node (6 credits)

a) Name three potential performance drawbacks in the following code snippet if it is executed on a two-socket node.

void main()
{
    const int N = 1000000;
    int a[N];
    int b[N];
    for (int i = 0; i < N; i++)
    {
        a[i] = i;
    }
    #pragma omp parallel for schedule(static, 1)
    for (int i = 0; i < N; i++)
    {
        a[i] = a[i] + 17;
        b[i] = a[i] / 23;
    }
}

Show Solution

NUMA Effects (First Touch Policy): The array `a` is initialized sequentially by the main thread. On a NUMA system, the operating system's "First Touch" policy allocates the physical memory pages on the socket where that thread is running. During the subsequent parallel loop, threads running on the *other* socket will suffer high latency due to remote memory accesses. (Ref: Section 6.5.2)
False Sharing: The `schedule(static, 1)` directive implements a cyclic (or round-robin) distribution of loop iterations. This means adjacent iterations (e.g., `i` and `i+1`) are assigned to different threads. Since these iterations write to adjacent memory locations (`a[i]` and `a[i+1]`, `b[i]` and `b[i+1]`), it is highly likely that different threads will be modifying data within the same cache line. This causes excessive cache coherence traffic as the cache line is repeatedly invalidated and transferred between the sockets. (Ref: Section 6.4)
Poor Spatial Locality: The cyclic distribution also destroys spatial locality for each thread. Instead of accessing a contiguous block of memory, each thread accesses scattered elements (e.g., `a[0]`, `a[16]`, `a[32]`, ...). This prevents effective use of the cache and hardware prefetchers, leading to increased memory bandwidth pressure and more cache misses.

Problem 7: Memory Consistency (4 credits)

a)* What conditions must hold for a program to be sequentially consistent?

Show Solution

For a program to be sequentially consistent, two conditions must be met:

The result of any execution must be the same as if the operations of all processors were executed in some single sequential order (a global total order of operations exists).
The operations of each individual processor must appear in this sequence in the order specified by its program (program order is maintained for each processor).

Show Analysis

(Ref: Section 4.4.1) Sequential consistency is the most intuitive memory model for programmers, as it behaves as if all memory operations were simply interleaved in some global sequence. However, enforcing this strict ordering can prohibit many hardware and compiler optimizations, making it less performant than more relaxed memory models.

Problem 8: Message Passing basics (5 credits)

a)* What is the difference between an eager and a rendevouz protocol?

Show Solution

Eager Protocol: The sender transmits the message data immediately, assuming the receiver has buffer space. The message is buffered by the system if the corresponding receive has not yet been posted. This is typically used for small messages to minimize latency.
Rendezvous Protocol: The sender and receiver first perform a handshake to synchronize. The sender waits for a "ready to receive" signal from the receiver before transmitting the actual data. This is used for large messages to avoid the overhead of creating large system buffers.

Show Analysis

(Ref: Section 9.3) The choice between these protocols is typically handled automatically by the MPI library based on a message size threshold. The goal is to balance the low-latency advantage of the eager protocol against the memory efficiency of the rendezvous protocol.

b)* Describe the functionality of MPI_Waitany.

Show Solution

MPI_Waitany is a blocking function used to complete one operation from a set of non-blocking MPI communications. It takes an array of MPI_Request handles and waits until at least one of the operations in the array has completed. It then returns the index of the single request that completed, allowing the program to process the result of that specific operation.

Show Analysis

(Ref: Section 9.5) This function is useful for overlapping communication and computation when results can be processed in any order as they arrive.

Problem 9: MPI communicators and collective operations (20 credits)

Consider the following MPI program:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char **argv)
{
    int value, temp;
    int size, rank, row, col;
    MPI_Comm c1, c2, c3;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    col = rank % 4;
    row = rank / 4;

    MPI_Comm_split(MPI_COMM_WORLD, row, col, &c1);
    MPI_Comm_split(MPI_COMM_WORLD, col, row, &c2);
    MPI_Comm_split(MPI_COMM_WORLD, rank, col, &c3);

    MPI_Comm_rank(c1, &value);

    MPI_Comm_rank(c2, &value);

    MPI_Allreduce(&rank, &value, 1, MPI_INT, MPI_MAX, c3);
    temp = value;

    MPI_Allreduce(&temp, &value, 1, MPI_INT, MPI_SUM, c2);
    temp = value;

    MPI_Allreduce(&temp, &value, 1, MPI_INT, MPI_MIN, c1);

    MPI_Finalize();
}

a)* Describe the functionality of MPI_Comm_split().

Show Solution

MPI_Comm_split is a collective operation that partitions the processes of an existing communicator into one or more disjoint subgroups. A new communicator is created for each subgroup. The partitioning is determined by the color argument: all processes that provide the same integer `color` value become part of the same new communicator. The key argument is used to determine the rank of each process within its new communicator (lower keys result in lower ranks).

Show Analysis

(Ref: Section 9.8.2) This function is fundamental for creating custom communication topologies, such as grids or task-based groups, enabling collective operations on specific subsets of processes.

b)* Assuming the program is executed with 12 MPI processes, sketch the communicators MPI_COMM_WORLD, c1, c2 and c3 by enclosing circles around the MPI processes (represented by the boxes with their respective ranks in MPI_COMM_WORLD) that are in the same communicator.

Show Solution

c)* Assuming the program is executed with 12 MPI processes, complete the table below according to the content of variable value at specific points in program in each rank:

Line number	rank	value
23	6
25	7
27	8
30	9
33	10

Show Solution

Line number	rank	value
23	6	2
25	7	1
27	8	8
30	9	15
33	10	12

Show Analysis

Let's trace the execution step-by-step:

L23, rank 6: `MPI_Comm_rank(c1, &value)`. Rank 6 is in row 1 (`color=1`). The members of this row are {4, 5, 6, 7}. Their keys are their column numbers {0, 1, 2, 3}. So, rank 6 has the 3rd lowest key, making its rank in `c1` equal to 2. `value` becomes 2.
L25, rank 7: `MPI_Comm_rank(c2, &value)`. This overwrites the previous value. Rank 7 is in column 3 (`color=3`). The members of this column are {3, 7, 11}. Their keys are their row numbers {0, 1, 2}. Rank 7 has the 2nd lowest key, making its rank in `c2` equal to 1. `value` becomes 1.
L27, rank 8: `MPI_Allreduce(&rank, &value, MAX, c3)`. For `c3`, the color is the world rank, so each process is in its own communicator of size 1. The `MPI_MAX` of a single value is just the value itself. So, `value` becomes `rank`, which is 8. Then `temp` is set to 8.
L30, rank 9: `MPI_Allreduce(&temp, &value, SUM, c2)`. Rank 9 is in column 1 (`color=1`) with members {1, 5, 9}. The `temp` value for each of these ranks (from the previous step) is their world rank. The sum is 1 + 5 + 9 = 15. This sum is distributed to all members of column 1. `value` becomes 15.
L33, rank 10: `MPI_Allreduce(&temp, &value, MIN, c1)`. Rank 10 is in row 2 (`color=2`) with members {8, 9, 10, 11}. We need the `temp` values for each of these ranks, which were set after the column-wise sum in the previous step.
- For rank 8 (col 0): sum = 0+4+8 = 12.
- For rank 9 (col 1): sum = 1+5+9 = 15.
- For rank 10 (col 2): sum = 2+6+10 = 18.
- For rank 11 (col 3): sum = 3+7+11 = 21.
The `MPI_MIN` across the `temp` values {12, 15, 18, 21} is 12. This minimum is distributed to all members of row 2. `value` becomes 12.

Problem 10: Blocking and non-bocking collective MPI (7 credits)

Consider the following MPI program:

// function to initialize two of size n
void init(float *x, float *y, int n){...}

// very expensive function which only works on elements of input array x
void costly_calculation(float *x, int n){...}

int main(int argc, char *argv[])
{
    int rank, size;
    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);

    float *a = (float *)malloc(size * sizeof(float));
    float *b = (float *)malloc(size * sizeof(float));
    init(a, b, size);
    float data = rank;

    // point to point communication
    if (rank == 0)
        for (int r = 1; r < size; r++)
        {
            a[0] = data;
            MPI_Recv(a + r, 1, MPI_FLOAT, r, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
        }
    else
        MPI_Send(&data, 1, MPI_FLOAT, 0, 0, MPI_COMM_WORLD);

    costly_calculation(b, size);

    costly_calculation(a, size);

    free(a); free(b);
    MPI_Finalize();
    return 0;
}

a)* List the blocking collective operation with its parameters that can be used to replace the point to point communication between lines 20 to 28.

Show Solution

MPI_Gather(&data, 1, MPI_FLOAT, a, 1, MPI_FLOAT, 0, MPI_COMM_WORLD);

Show Analysis

The code implements a pattern where every process (including rank 0) contributes a single float (`data`), and rank 0 collects all these floats into an array (`a`). This is the definition of the `MPI_Gather` collective operation (Ref: Section 9.7.3).

b)* List the non-blocking collective operation with its parameters that can be used to replace the blocking communication from previous subproblem (a).

Show Solution

Assuming MPI_Request request; is declared:

MPI_Igather(&data, 1, MPI_FLOAT, a, 1, MPI_FLOAT, 0, MPI_COMM_WORLD, &request);

c)* Name the appropriate operation when using a non-blocking collective to insure its completion and specify the proper line number to put this operation for getting the best performance.

Show Solution

Operation: MPI_Wait(&request, MPI_STATUS_IGNORE);

Line Number: The call should be placed at Line 31.

Show Analysis

The goal of non-blocking communication is to overlap communication with computation. The `costly_calculation(b, size)` at line 30 is independent of the gather operation (which fills array `a`). Therefore, we can initiate the `MPI_Igather` before line 30, perform the calculation on `b`, and then place the `MPI_Wait` immediately before line 32, where the result of the gather (`a`) is first needed. This hides the communication latency behind the computation on `b`.

Problem 11: Efficient MPI reduction with virtual hypercube topology (16 credits)

A hypercube of dimension d consists of p=2^d nodes; there exists a direct connection between two nodes n1 and n2 if (and only if) the binary representation of n1 and n2 differs by exactly one bit.

In this task we want to implement an efficient summation reduction across all MPI processes in a communicator optimized for the hypercube topology using point-to-point MPI operations, where MPI process with rank n is mapped to node n in the hypercube. All ranks are with respect to the communicator c. For simplification, we restrict ourselves to the case where rank 0 is the root of the reduction and the size of c is a power of 2 (i.e., the size is 2^d). Under the assumption that each point-to-point operation takes one timestep, the overall reduction across all MPI processes in c should be performed in at most d timesteps and communication is only allowed with direct neighbors in the virtual hypercube topology.

a)* Sketch the three communication timesteps of the reduction using hypercube topology for p = 8 processes by adding arrows to the edges according to the communication direction.

Show Solution

Show Analysis

This reduction algorithm uses a recursive halving (or binomial tree) communication pattern. We iterate over the dimensions of the hypercube from k=0 to d-1.

The algorithm proceeds in `d` steps, where `d` is the dimension of the hypercube (d=3 for 8 processes).
In each step `k` (from 0 to d-1), processes are paired up based on the k-th bit of their rank.
A process whose k-th bit is 1 sends its data to its partner whose k-th bit is 0. The receiving process adds the value to its own.
Timestep 1 (k=0, LSB): Processes with an odd rank (k-th bit is 1) send to their even-ranked neighbor. (1→0, 3→2, 5→4, 7→6). After this, only ranks {0, 2, 4, 6} hold partial sums.
Timestep 2 (k=1): Now considering the 2nd bit (value 2). Processes where this bit is 1 send to partners where it's 0. (2→0, 6→4). Now only ranks {0, 4} hold partial sums.
Timestep 3 (k=2, MSB): Considering the 3rd bit (value 4). The process where this bit is 1 sends to its partner. (4→0).

After d=3 steps, rank 0 has accumulated the final sum from all other processes.

b)* Implement the efficient summation reduction of a float buffer according to your communication schema developed in subproblem (a) in MPI process with rank != 0 using a virtual hypercube topology of dimension d and blocking point-to-point MPI operations.

/**
* @brief efficient summation reduction in rank zero using
*
* virtual hypercube topology
* @param d dimension of hypercube
* @param rank rank of MPI process within communicator c
* @param p number of MPI processes in communicator c
* @param buffer communication buffer
* @param c communicator
* @return float reduced value
*/
float hc_zero_reduce(int d, int rank, int p, float buffer, MPI_Comm c)
{
    if (p != 1 << d) {
        MPI_Abort(c, 1);
    }
}

Show Solution

#include <mpi.h>

float hc_zero_reduce(int d, int rank, int p, float buffer, MPI_Comm c)
{
    if (p != 1 << d) {
        MPI_Abort(c, 1);
    }

    float result = buffer;
    float recv_val;
    int tag = 0;

    // Iterate through the dimensions from k=0 to d-1 (Recursive Halving)
    for (int k = 0; k < d; k++) {
        // Mask for the k-th bit
        int mask = 1 << k;

        // Determine role based on the k-th bit of the current rank
        if ((rank & mask) != 0) {
            // If the k-th bit is 1, this process is a SENDER in this step.
            // Calculate the partner's rank by flipping the k-th bit to 0.
            int partner = rank ^ mask;
            MPI_Send(&result, 1, MPI_FLOAT, partner, tag, c);
            // Once a process sends its data, it is done participating.
            break;
        } else {
            // If the k-th bit is 0, this process is a RECEIVER in this step.
            // Calculate the partner's rank by flipping the k-th bit to 1.
            int partner = rank ^ mask;
            // The problem statement guarantees p=2^d, so the partner always exists.
            MPI_Recv(&recv_val, 1, MPI_FLOAT, partner, tag, c, MPI_STATUS_IGNORE);
            // Perform the summation
            result += recv_val;
        }
    }
    // Rank 0 will complete the loop and return the final sum.
    // Other ranks will break out of the loop after they send.
    return result;
}