Parallele Programmierung (IN2147) - Overview
Interactive Study Guide - Summer Exam 2025
| Exam: | IN2147 / Endterm | Date: | Friday 1st August, 2025 |
|---|---|---|---|
| Examiner: | Prof. Dr. Martin Schulz | Time: | 11:00-13:00 |
Working Instructions
- This exam consists of 8 problems with a total of 92 credits.
- Subproblems marked by * can be solved without results of previous subproblems.
- Answers are only accepted if the solution approach is documented. Give a reason for each answer unless explicitly stated otherwise.
- Allowed resources: one analog dictionary English ↔ native language.
- Do not write with red or green colors nor use pencils.
Select a problem from the sidebar to begin studying. Use the toggles to reveal hints, solutions, and analysis.
Problem 1: Parallel Runtime (7 credits)
You are given a program of which you want to measure the parallelization. First, you run the program on the server FoundForABargainMachine1 with 2 parallel processing units and measure a runtime of 100 seconds. Next, you run the program on the server SomewhatDecentMachine2 with 6 parallel processing units and measure a runtime of 60 seconds. Assume that the parallel processing units of both servers are identical. Assume the parallel regions have perfect speedup. Show your calculations.
Show Hint
This law describes the theoretical maximum speedup achievable by parallelizing a task, considering the inherent sequential portion of the code. (Ref: Summary 1.3)
Show Solution
The law is Amdahl's Law.
The formula for the execution time T(p) using p processors, given the sequential execution time T and the fraction of parallel execution f (assuming perfect speedup), is:
Show Analysis
Amdahl's Law (Section 1.3.2) is fundamental to understanding the limits of strong scaling. It highlights that the serial portion of the code (1-f) quickly becomes the bottleneck, limiting the overall speedup regardless of how many processors (p) are added.
Show Hint
Define T_s (serial time) and T_p (parallelizable time at p=1). The formula becomes T(p) = T_s + T_p/p.
You have T(2)=100s and T(6)=60s. Set up a system of two equations and solve for T_s and T_p. Finally, calculate f = T_p / (T_s + T_p).
Show Solution
We use T(p) = T_s + T_p/p.
System of equations:
- 100 = T_s + T_p/2
- 60 = T_s + T_p/6
Subtract equation (2) from (1):
100 - 60 = (T_p/2) - (T_p/6)
40 = T_p(1/2 - 1/6) = T_p(3/6 - 1/6) = T_p(2/6) = T_p/3
T_p = 40 × 3 = 120s
Substitute T_p back into equation (1):
100 = T_s + 120/2
100 = T_s + 60
T_s = 40s
The total sequential execution time T = T_s + T_p = 40s + 120s = 160s.
The fraction of parallel execution f is T_p / T:
f = 120 / 160 = 3/4 = 0.75.
The fraction of parallel execution is 0.75 (or 75%).
Show Analysis
This is a standard method for applying Amdahl's law when the total sequential time is not given directly. By using two different parallel execution times, we can isolate the underlying serial and parallel components of the workload.
Show Hint
Consider the historical trends in CPU development. What happened to clock speed increases (frequency scaling), and what did chip manufacturers do instead?
Show Solution
According to the course summary (Section 1.1 Motivation for Parallel Computing):
- Single-threaded performance improvements are no longer significant (due to power/heat limitations).
- The trend in hardware architecture is towards multi-core chips, meaning performance scales with the number of cores utilized.
Show Analysis
The industry hit the "power wall," making further increases in CPU frequency unsustainable. The shift to multi-core architectures means that software must be written in parallel to take advantage of modern hardware performance gains.
Problem 2: Terminology and Memory (11 credits)
Show Hint
One measures how much faster you can solve a fixed problem by adding resources. The other measures how large a problem you can solve in the same time by adding resources.
Show Solution
The two types of scaling (Section 10.1 Scaling Types) are:
- Strong Scaling: The total problem size is kept constant. It measures how execution time decreases as the number of processors increases.
- Weak Scaling: The workload (problem size) per processor is kept constant. It measures the ability to solve larger problems in the same amount of time as the number of processors increases.
Show Analysis
These metrics are essential for evaluating parallel algorithms. Strong scaling is typically constrained by Amdahl's law. Weak scaling often demonstrates how well an application utilizes larger machine resources for larger problems (related to Gustafson's Law).
Show Hint
This is a strong memory model. It requires a global ordering of operations and the preservation of local program order.
Show Solution
According to the summary (Section 4.4.1 Memory Consistency), sequential consistency requires that:
- The result of any execution is the same as if the operations of all processors were executed in some sequential order (a global interleaving exists), AND
- The operations of each individual processor appear in this sequence in the order specified by its program (local program order is preserved).
Show Analysis
Sequential consistency is the most intuitive model for programmers. However, it imposes strict requirements on the hardware, often limiting performance optimizations (like write buffering or instruction reordering), making it expensive to implement on large-scale systems.
Show Hint
What mechanism is required to keep memory synchronized across many different physical locations? How does the overhead of this mechanism scale?
Show Solution
The primary reason is scalability (Section 6.6). Maintaining cache coherence across a large number of nodes in a shared memory system (like NUMA) introduces significant complexity, communication overhead (coherence traffic), and cost, which limits the effective scaling of the system (Section 6.6.2). Distributed memory architectures avoid this overhead.
Show Analysis
While shared memory (ccNUMA) is easier to program, the hardware protocols required to ensure a consistent memory view become a bottleneck as the node count increases. Distributed memory systems prioritize scalability by putting the burden of data movement explicitly on the programmer (e.g., via MPI).
Show Hint
9 nodes suggests a 3x3 arrangement. A torus is a mesh network where the edges wrap around (top connects to bottom, left connects to right).
Show Solution
A 3x3 2D Torus:
Show Analysis
Network topology (Section 7.3) defines how nodes are connected. The torus topology is advantageous because it eliminates boundaries, providing uniform connectivity (every node has 4 neighbors) and often better bisection bandwidth compared to a simple mesh.
Problem 3: Threading (10 credits)
Suppose you want to write a program that finds the largest value in a large array of non-negative doubles.
Show Hint
What is the specific term for the condition where the outcome depends on the unpredictable timing of thread execution when accessing shared memory, and at least one access is a write?
Show Solution
A Data Race can occur (Section 4.1.1). This leads to undefined behavior and non-deterministic results.
A method to prevent it is using a Mutex (Mutual Exclusion Lock) (Section 2.4.1). A mutex ensures that only one thread can access the critical section (the code modifying the shared data) at a time.
Show Analysis
Data races are a primary source of bugs in shared memory programming. Ensuring thread safety through synchronization mechanisms (like mutexes, atomics, or critical sections) is fundamental to correct parallel execution.
#define NUM_THREADS 8
void thread_max(int start, int end, double *data , double *max) {
}
void find_max_in_parallel(double *array, int array_size, double *array_max) {
int local_size, local_start, local_end;
std::thread threads[NUM_THREADS];
local_size = array_size / NUM_THREADS;
*array_max = -1.0;
for (int i = 0; i < NUM_THREADS; ++i) {
threads[i] = std::thread(thread_max,
);
}
}
Show Hint
Efficiency: Avoid locking a global variable inside the loop. Use a parallel reduction pattern.
Strategy:
1. Create an auxiliary array (e.g., local_maxes[NUM_THREADS]).
2. Calculate start/end indices carefully, ensuring the remainder is handled (load balancing).
3. Pass the address of the thread's slot in the auxiliary array to thread_max.
4. Join the threads and perform the final reduction on the auxiliary array.
Show Solution
#define NUM_THREADS 8
#include <thread>
#include <algorithm>
// The 'max' parameter is used here as a pointer to the storage location
// for this specific thread's local maximum.
void thread_max(int start, int end, double *data , double *max) {
double local_max = -1.0; // Initialize based on non-negative data assumption
for (int i = start; i < end; ++i) {
if (data[i] > local_max) {
local_max = data[i];
}
}
*max = local_max;
}
void find_max_in_parallel(double *array, int array_size, double *array_max) {
int local_size, local_start, local_end;
std::thread threads[NUM_THREADS];
double local_maxes[NUM_THREADS]; // Storage for local results
// Calculate base chunk size and remainder for balanced distribution
local_size = array_size / NUM_THREADS;
int remainder = array_size % NUM_THREADS;
int current_start = 0;
*array_max = -1.0;
for (int i = 0; i < NUM_THREADS; ++i) {
local_start = current_start;
// Distribute the remainder among the first 'remainder' threads
local_end = local_start + local_size + (i < remainder ? 1 : 0);
// Ensure bounds safety and handle case where array_size < NUM_THREADS
if (local_end > array_size) local_end = array_size;
current_start = local_end;
// Launch the thread only if it has work
if (local_start < local_end) {
// Launch the thread, passing the address of its slot in local_maxes
threads[i] = std::thread(thread_max,
local_start, local_end, array, &local_maxes[i]
);
}
}
// Join threads (Fork/Join model, Section 2.3.2) and perform final reduction
for (int i = 0; i < NUM_THREADS; ++i) {
// Check if the thread was launched
if (threads[i].joinable()) {
threads[i].join();
// Reduce (must happen after join to ensure data is written)
if (local_maxes[i] > *array_max) {
*array_max = local_maxes[i];
}
}
}
}
Show Analysis
This solution uses a parallel reduction pattern (Section 6.3.2), which is efficient because it minimizes synchronization overhead.
Common Pitfall (Inefficient): A common mistake is to pass the address of the global array_max to all threads and use a mutex inside thread_max. This would serialize the updates and destroy performance due to lock contention.
Load Balancing: The solution correctly handles arbitrary array sizes by calculating a base size and distributing the remainder, ensuring all elements are processed and the load is balanced.
Problem 4: OpenMP (13 credits)
double random_noise();
int operation(double *x, double *y, int n)
{
int i;
double w;
double min_y = y[0];
int min_y_index = 0;
for (i = 0; i < n - 1; ++i) {
w = random_noise();
x[i] = (y[i] + y[i + 1]) / 2. + w;
if (y[i + 1] < min_y) {
min_y = y[i + 1];
min_y_index = i;
}
}
x[i] = y[i] + random_noise();
return min_y_index;
}
Show Hint
1. Use default(none). Identify shared (x, y, n, min_y, min_y_index) and private (i, w) variables.
2. The loop performs a "minloc" (minimum value and index) reduction.
3. Standard OpenMP reductions (Section 3.6) do not support "minloc". You must implement a manual reduction.
4. Use thread-private variables for local minimums and combine them using #pragma omp critical after the loop.
Show Solution
double random_noise();
int operation(double *x, double *y, int n)
{
int i;
double w;
// Initialize global minimums (assuming n > 0)
double min_y = y[0];
int min_y_index = 0;
// Start parallel region. Explicitly define data sharing attributes.
#pragma omp parallel default(none) shared(x, y, n, min_y, min_y_index) private(i, w)
{
// Thread-private copies for local minimum tracking
double local_min_y = min_y;
int local_min_y_index = min_y_index;
// Distribute loop iterations among threads (Section 3.3.3)
#pragma omp for
for (i = 0; i < n - 1; ++i) {
w = random_noise();
x[i] = (y[i] + y[i + 1]) / 2. + w;
// Update local minimum
if (y[i + 1] < local_min_y) {
local_min_y = y[i + 1];
local_min_y_index = i;
}
}
// Implicit barrier at the end of omp for
// Update global minimum in a critical section (Section 3.5.1)
#pragma omp critical
{
if (local_min_y < min_y) {
min_y = local_min_y;
min_y_index = local_min_y_index;
}
}
} // End parallel region
// Handle the final iteration sequentially
if (n > 0) {
i = n - 1;
x[i] = y[i] + random_noise();
}
return min_y_index;
}
Show Analysis
The key challenge is the "minloc" operation. Since it's not a standard reduction, we manually manage it. Threads compute local minimums privately and then combine them globally within a critical section (Section 3.5.1) to ensure atomicity when updating the shared global minimums. This minimizes synchronization overhead.
Now consider the following code snippet:
void main()
{
const int N = 100000;
int a[N];
int b[N];
for (int i = 0; i < N; ++i) {
a[i] = i;
}
#pragma omp parallel for schedule(static , 1)
for (int i = 0; i < N; ++i) {
a[i] = a[i] + 17;
b[i] = a[i] % 23;
}
}
b)* State the 3 performance problems that can occur when running this code snippet on a two-socket system.
Show Hint
Analyze the interaction with a NUMA architecture (two sockets).
1. Initialization is sequential. Consider the "First Touch" policy.
2. schedule(static, 1) is cyclic distribution. What happens when different threads write to adjacent elements (on the same cache line)?
3. How does cyclic distribution affect spatial locality for individual threads?
Show Solution
- NUMA Effects / First Touch Policy (Section 6.5.2): Array
ais initialized sequentially. The First Touch policy allocates the memory on the socket running the main thread. In the parallel loop, threads on the second socket will suffer high latency due to remote memory access. - False Sharing (Section 6.4):
schedule(static, 1)distributes iterations cyclically. Adjacent iterations (accessing adjacent memory locations) are processed by different threads. Writes to the same cache line by different threads cause excessive cache coherence traffic (cache line bouncing). - Poor Spatial Locality / Cache Thrashing (Section 6.4): The cyclic distribution destroys spatial locality for individual threads, as they access scattered elements rather than contiguous blocks. This leads to inefficient cache utilization and prefetching.
Show Analysis
This combination of sequential initialization and cyclic scheduling (static, 1) is notorious for poor performance on NUMA systems. The cyclic schedule maximizes false sharing and minimizes spatial locality. The sequential initialization ensures NUMA penalties for threads on the remote socket.
float operation(float *a, float *b, int n) {
float prod = 0.0f;
for (int k = 0; k < n; k++) {
prod *= a[k] + b[k];
}
return prod;
}
Show Hint
1. Use the OpenMP directive for SIMD (vectorization).
2. This is a product reduction. Identify the correct clause.
3. Critical Error Check: The initialization value for a product reduction should be 1.0f, not 0.0f.
Show Solution
(Note: Corrected initialization from 0.0f to 1.0f)
Vectorized Solution (C++ with OpenMP SIMD):float operation(float *a, float *b, int n) {
float prod = 1.0f; // Corrected initialization for product reduction
#pragma omp simd reduction(*:prod)
for (int k = 0; k < n; k++) {
prod *= a[k] + b[k];
}
return prod;
}
Show Analysis
We use the #pragma omp simd directive (Section 5.5.4). This instructs the compiler to utilize the CPU's vector instructions (like AVX) to process multiple iterations simultaneously within a single thread. The reduction clause is necessary to correctly handle the accumulation across vector lanes.
struct Node {
struct Node *left;
struct Node *right;
};
void visit(struct Node *p);
void traverse(struct Node *p)
{
if (p->left != nullptr) {
#pragma omp
traverse(p->left);
}
if (p->right != nullptr) {
#pragma omp
traverse(p->right);
}
visit(p);
}
void traverse_tree_in_parallel(struct Node *tree)
{
#pragma omp
{
#pragma omp
traverse(tree);
}
}
Show Hint
1. Tree traversal is irregular and recursive, suitable for Task Parallelism (not omp for).
2. Start a parallel region in the wrapper function, and use omp single to initiate the first call.
3. Use omp task for the recursive calls.
4. This is a post-order traversal (visit after children). Synchronization is required before visit(p) using omp taskwait.
Show Solution
struct Node {
struct Node *left;
struct Node *right;
};
void visit(struct Node *p);
void traverse(struct Node *p)
{
if (p->left != nullptr) {
#pragma omp task
traverse(p->left);
}
if (p->right != nullptr) {
#pragma omp task
traverse(p->right);
}
// Wait for child tasks to complete before visiting the node
#pragma omp taskwait
visit(p);
}
void traverse_tree_in_parallel(struct Node *tree)
{
// Start parallel region
#pragma omp parallel
{
// Use a single thread to initiate the traversal (Section 4.3.3)
#pragma omp single
traverse(tree);
}
}
Show Analysis
Task Parallelism (Section 4.3) is designed for this dynamic workload. The taskwait (Section 4.3.4) is crucial to ensure that the child tasks complete before the parent node is visited, maintaining the post-order traversal logic.
Problem 5: SIMD (17 credits)
Show Hint
Compilers must be conservative to guarantee correctness. Consider issues related to memory access (pointers/aliasing) and code structure (branches/calls).
Show Solution
Compilers must be conservative (Section 5.5.2). Reasons include:
- Potential Data Dependencies / Pointer Aliasing: The compiler cannot prove that different pointers used in the loop do not point to overlapping memory regions (aliasing). To guarantee correctness, it assumes dependencies exist and prevents vectorization.
- Complex Control Flow or Function Calls: The presence of complex branches (
if/else) or calls to functions that cannot be inlined or vectorized prevents the compiler from generating efficient SIMD code (Section 5.5.4).
Show Analysis
Autovectorization relies on static analysis. If the compiler cannot definitively prove independence (e.g., due to ambiguous pointers), it defaults to sequential execution. Manual vectorization allows the programmer to assert this independence based on their knowledge of the code.
Now consider the following sequential code:
void linear_and_avg(float *a, float *b, float c, int size, float *avg_out)
{
float sum = 0.0f;
for (int i = 0; i < size; i ++) {
a[i] = c * a[i] + b[i];
sum += a[i];
}
*avg_out = sum / size;
}
b)* Vectorize the sequential code using 256-bit vector operations in the code template on the next page. Assume that all arrays are unaligned. Do not try to align the arrays but use the correct intrinsics.
Here is a list of definitions you might find useful:
| Intrinsics | Intrinsics | Intrinsics |
|---|---|---|
__m256 | _mm256_add_ps | __m256i |
__m256f | _mm256_add_pf | __m256f |
__m256d | _mm256_add_pd | _mm256_store_ps |
_mm256_load_ps | _mm256_dp_ps | _mm256_storeu_ps |
_mm256_loadu_ps | _mm256_dp_pf | _mm256_store_pf |
_mm256_load_pf | _mm256_dp_pd | |
_mm256_loadu_pf | _mm256_mul_ps | _mm256_storeu_pf |
_mm256_load_pd | _mm256_mul_pf | _mm256_store_pd |
_mm256_loadu_pd | _mm256_mul_pd | _mm256_storeu_pd |
_mm256_set1_ps | _mm256_set1_pf | _mm256_set1_pd |
void linear_and_avg_vec(float *a, float *b, float c, int size, float *avg_out)
{
__m sum256, a256, b256, c256, prod256, new_a256;
sum256 =
c256 =
float sum = 0.0f;
int i = 0;
for (; i < size - (size % ); ) {
a256 =
b256 =
prod256 =
new_a256 =
sum256 =
}
*avg_out = sum / size;
}
Show Hint
1. VL: 256-bit vectors / 32-bit floats = 8.
2. Initialization: Initialize vector sum to zero (_mm256_set1_ps(0.0f)). Broadcast scalar c (_mm256_set1_ps(c)).
3. Main Loop: Iterate in steps of 8. Use unaligned instructions (_mm256_loadu_ps, _mm256_storeu_ps).
4. Cleanup: Perform a horizontal reduction of the sum vector. Handle the remainder loop.
Show Solution
#include <immintrin.h>
void linear_and_avg_vec(float *a, float *b, float c, int size, float *avg_out)
{
__m256 sum256, a256, b256, c256, prod256, new_a256;
const int VL = 8; // Vector Length for float (32-bit) on 256-bit registers
// Initialize sum vector to zero
sum256 = _mm256_set1_ps(0.0f);
// Broadcast scalar 'c' into a vector
c256 = _mm256_set1_ps(c);
float sum = 0.0f;
int i = 0;
// Main Vector Loop
for (; i < size - (size % VL); i += VL) {
// Load unaligned data
a256 = _mm256_loadu_ps(&a[i]);
b256 = _mm256_loadu_ps(&b[i]);
// Calculate prod = c * a
prod256 = _mm256_mul_ps(c256, a256);
// Calculate new_a = (c * a) + b
new_a256 = _mm256_add_ps(prod256, b256);
// Store the result back to a (unaligned) - Required step
_mm256_storeu_ps(&a[i], new_a256);
// Accumulate the sum
sum256 = _mm256_add_ps(sum256, new_a256);
}
// Horizontal Reduction
float temp_sum[VL];
_mm256_storeu_ps(temp_sum, sum256);
for(int j=0; j<VL; ++j) {
sum += temp_sum[j];
}
// Remainder Loop (Scalar) (Section 5.4.1)
for (; i < size; ++i) {
a[i] = c * a[i] + b[i];
sum += a[i];
}
*avg_out = sum / size;
}
Show Analysis
This solution correctly implements the vectorized version using AVX (256-bit) intrinsics.
Key Requirement Check: The instruction explicitly states to assume arrays are unaligned, mandating the use of the _loadu_ and _storeu_ variants. The structure includes the necessary steps: vectorized main loop, horizontal reduction, and the remainder loop.
Problem 6: MPI Theory (10 credits)
Show Hint
Does the routine need communication or actions from another MPI process to complete its execution?
Show Solution
- Local routine: A routine whose completion depends only on the local calling process. It does not require communication or synchronization events to occur at another process to complete.
- Example:
MPI_Comm_rank,MPI_Comm_size.
- Example:
- Non-local routine: A routine whose completion may depend on actions taken by other MPI processes (e.g., matching communication calls or collective participation).
- Example:
MPI_Barrier,MPI_Bcast.
- Example:
Show Analysis
This distinction is important for understanding MPI semantics. Note that initiating a non-blocking send (MPI_Isend) is considered a local routine because the function call returns immediately based on local state, even though the underlying communication operation is inherently non-local.
Show Hint
This relates to non-blocking operations and when the resources (like buffers) used in the operation are safe to reuse. (Section 9.5).
Show Solution
- Incomplete procedure: A procedure that initiates a communication operation but may return before the operation is locally complete. Resources (like buffers) cannot be reused yet.
- Example:
MPI_Isend,MPI_Irecv.
- Example:
- Completing procedure: A procedure that ensures or verifies the completion of an operation. Upon return, resources involved are available for reuse.
- Example:
MPI_Wait,MPI_Test(if completion is detected), or blocking calls likeMPI_Send.
- Example:
Show Analysis
Understanding this is vital for managing memory buffers in non-blocking MPI calls. You must not modify a send buffer or read from a receive buffer after an incomplete procedure until a corresponding completing procedure has finished.
Show Hint
In "passive target" synchronization (RMA), consider the role of the target process. What MPI calls are used by the origin process to control the access epoch? The concurrency options refer to the locking mechanisms.
Show Solution
Explanation:
Passive target synchronization (Section 9.11.2, Lock/Unlock) is a mechanism in Remote Memory Access (RMA) where the target process does not actively participate in the synchronization calls. The origin process entirely controls the access epoch to the target's memory window using MPI_Win_lock (to start the epoch) and MPI_Win_unlock (to end it).
Concurrency Options (Lock Types):
- Exclusive Lock (
MPI_LOCK_EXCLUSIVE): Grants the origin process exclusive access. No other process can access it concurrently. - Shared Lock (
MPI_LOCK_SHARED): Allows multiple processes to concurrently access the target window, provided they also use shared locks (e.g., for concurrent reads).
Show Analysis
Passive synchronization contrasts with Active Target synchronization (like Fence or Post/Start/Complete/Wait), where both the origin and target must participate in synchronization calls. Passive synchronization is useful when you do not want to interrupt the computation occurring on the target process.
Problem 7: MPI (15 credits)
Complete the MPI program in the code block below such that it finds the 3 largest elements in 'full_data' consisting of non-negative floats and prints them in the last rank relative to MPI_COMM_WORLD using a manager-worker pattern. Each MPI process searches through a chunk of 'full_data' and sends the 3 locally largest elements to the last rank. The last rank then finds the 3 globally largest elements.
Requirement: In the last rank, the receiving of the partial results and the computation of the 3 globally largest elements should be overlapped using MPI nonblocking communication. You can assume that the total data size is divisible by 3x the number of MPI processes. Use the most efficient MPI calls possible.
Here is a sequential version of the operation:
#define TOTAL_LENGTH 420000000
void read_data(float *output);
void update_top3(const float *data, float *top3_out, int count) {
for (int i = 0; i < count; ++i) {
if (data[i] > top3_out[0]) {
top3_out[2] = top3_out[1];
top3_out[1] = top3_out[0];
top3_out[0] = data[i];
} else if (data[i] > top3_out[1]) {
top3_out[2] = top3_out[1];
top3_out[1] = data[i];
} else if (data[i] > top3_out[2]) {
top3_out[2] = data[i];
}
}
}
int main(int argc, char** argv) {
float * full_data = new float [TOTAL_LENGTH];
read_data(full_data); // initialize data
float top3[3] = {-1.f, -1.f, -1.f};
update_top3(full_data, top3, TOTAL_LENGTH);
print_top3 (top3);
delete[] full_data;
return 0;
}
Code Template (C++ with MPI):
#include <mpi.h>
int main(int argc, char** argv) {
int rank, world_size;
float *full_data = new float[TOTAL_LENGTH];
if (rank == world_size - 1) read_data(full_data); // initialize data
int chunk_size = TOTAL_LENGTH / world_size;
float *local_data = new float[chunk_size];
float top3[3] = { -1.0f, -1.0f, -1.0f };
if (rank == world_size - 1) { // Manager
update_top3(local_data, top3, chunk_size);
MPI_Request *requests = new MPI_Request[world_size - 1];
float *recv_buf = new float[3 * (world_size - 1)];
// receive top3 from all other ranks
for (int from_rank = 0; from_rank < world_size - 1; ++from_rank) {
}
int completed = 0;
while (completed < world_size -1) {
// check if received top3 from any rank and update global top3
}
delete[] recv_buf; delete[] requests;
} else { // Worker
update_top3 (local_data, top3, chunk_size);
}
if (rank == world_size - 1) print_top3 (top3);
delete[] full_data; delete[] local_data;
return 0;
}
Show Hint
1. Data Distribution: Use MPI_Scatter to efficiently distribute chunks from the manager (last rank) to all processes.
2. Worker Logic: Workers calculate local top 3 and use MPI_Send to the manager.
3. Manager Logic (Overlap):
a. Post non-blocking receives (MPI_Irecv) for all incoming results.
b. Use MPI_Waitany in a loop. This returns the index of the first completed receive.
c. Process the data corresponding to that index immediately using update_top3.
Show Solution
#include <mpi.h>
// Assume TOTAL_LENGTH, read_data(), update_top3(), print_top3() are defined.
int main(int argc, char** argv) {
// Initialize MPI environment
MPI_Init(&argc, &argv);
int rank, world_size;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &world_size);
const int MANAGER_RANK = world_size - 1;
float *full_data = nullptr;
// Initialize data only on the manager (last rank)
if (rank == MANAGER_RANK) {
full_data = new float[TOTAL_LENGTH];
read_data(full_data);
}
int chunk_size = TOTAL_LENGTH / world_size;
float *local_data = new float[chunk_size];
// Distribute the data efficiently using MPI_Scatter (Section 9.7.3)
MPI_Scatter(full_data, chunk_size, MPI_FLOAT, local_data, chunk_size, MPI_FLOAT, MANAGER_RANK, MPI_COMM_WORLD);
float top3[3] = { -1.0f, -1.0f, -1.0f };
if (rank == MANAGER_RANK) { // Manager
// 1. Manager computes its own local top3
update_top3(local_data, top3, chunk_size);
int num_workers = world_size - 1;
MPI_Request *requests = new MPI_Request[num_workers];
// Buffer to hold incoming results (3 floats per worker)
float *recv_buf = new float[3 * num_workers];
// 2. Initiate non-blocking receives (MPI_Irecv, Section 9.5.1)
for (int from_rank = 0; from_rank < num_workers; ++from_rank) {
MPI_Irecv(&recv_buf[from_rank * 3], 3, MPI_FLOAT, from_rank, 0, MPI_COMM_WORLD, &requests[from_rank]);
}
// 3. Wait for completion and process results concurrently
int completed = 0;
while (completed < num_workers) {
// Wait for any request to complete (MPI_Waitany)
int index;
MPI_Waitany(num_workers, requests, &index, MPI_STATUS_IGNORE);
// Process the completed request immediately (Overlap achieved here)
if (index != MPI_UNDEFINED) {
update_top3(&recv_buf[index * 3], top3, 3);
completed++;
}
}
delete[] recv_buf; delete[] requests;
} else { // Worker
// 1. Worker computes its local top3
update_top3 (local_data, top3, chunk_size);
// 2. Worker sends result to the manager
MPI_Send(top3, 3, MPI_FLOAT, MANAGER_RANK, 0, MPI_COMM_WORLD);
}
if (rank == MANAGER_RANK) {
print_top3 (top3);
delete[] full_data; // Clean up full data on manager
}
delete[] local_data;
MPI_Finalize();
return 0;
}
Show Analysis
The solution addresses the requirements effectively:
- Efficiency:
MPI_Scatteris used for optimal data distribution. - Overlap: The key requirement is overlapping reception and computation. Using
MPI_Irecvfollowed by aMPI_Waitanyloop achieves this. As soon as one message arrives, the manager processes it (update_top3) while waiting for the next message, hiding communication latency with computation. - Alternative (Not Allowed): Using
MPI_Gatherwould simplify the code but would block until all data arrived, failing the overlap requirement.
Problem 8: Hybrid Programming (9 credits)
void compute_energy(double *local_mass, double *local_velocity,
int local_count, double *total_energy)
{
}
Show Hint
1. Local (OpenMP): Use #pragma omp parallel for with a reduction clause to sum the local energy within the MPI rank.
2. Global (MPI): Combine the local sums. Since the result must be available in *each* rank, use MPI_Allreduce.
Show Solution
#include <omp.h>
#include <mpi.h>
void compute_energy(double *local_mass, double *local_velocity,
int local_count, double *total_energy)
{
double local_energy_sum = 0.0;
// 1. Local computation using OpenMP parallel for with reduction (Section 3.6)
// E = 0.5 * m * v^2
#pragma omp parallel for reduction(+:local_energy_sum) schedule(static)
for (int i = 0; i < local_count; ++i) {
double m = local_mass[i];
double v = local_velocity[i];
local_energy_sum += 0.5 * m * v * v;
}
// 2. Global reduction using MPI_Allreduce (Section 9.7.4)
// MPI_Allreduce ensures the result is available in every rank.
MPI_Allreduce(&local_energy_sum, total_energy, 1, MPI_DOUBLE, MPI_SUM, MPI_COMM_WORLD);
}
Show Analysis
This is a typical structure for hybrid programming (Section 9.13.4). OpenMP handles the fine-grained parallelism within a node (shared memory), while MPI handles the coarse-grained parallelism across nodes (distributed memory). Using MPI_Allreduce ensures all processes receive the final result.
You run your hybrid program on a cluster of compute nodes. You can assume the following conditions:
- Each compute node has identical hardware
- Each node has 3 sockets and each socket is its own NUMA domain
- Each socket has 6 cores 2-way simultaneous multithreading
mpirunstarts 3 processes per node
MPI binding report:
rank0: [BB/../../../../..][../../../../../..][../../../../../..]
rank1: [../BB/../../../..][../../../../../..][../../../../../..]
rank2: [../../BB/../../..][../../../../../..][../../../../../..]
OPENMP DISPLAY ENVIRONMENT BEGIN:
OPENMP = '201511'
OMP_DYNAMIC = 'FALSE'
OMP_NESTED = 'FALSE'
OMP_NUM_THREADS = '2'
OMP_SCHEDULE = 'DYNAMIC'
OMP_PROC_BIND = 'FALSE'
OMP_PLACES =
OMP_STACKSIZE = '0'
OMP_WAIT_POLICY = 'PASSIVE'
OMP_THREAD_LIMIT = '4294967295'
OMP_MAX_ACTIVE_LEVELS = '2147483647'
OMP_CANCELLATION = 'FALSE'
Show Hint
1. MPI Placement: The binding report shows where the 3 ranks are located (brackets are sockets, 'B' is binding). Are they utilizing all 3 sockets or clustered on one?
2. OpenMP Threads: Are they using all available cores (6 physical)?
3. Pinning: Is thread pinning enabled (`OMP_PROC_BIND`) to ensure locality?
Fix: Aim for 1 MPI rank per socket, utilizing the physical cores within that socket.
Show Solution
Performance Issues:
- Poor MPI Process Placement (Resource Imbalance): The MPI binding report shows all three ranks are bound to the first socket. Sockets 1 and 2 are idle, while Socket 0 is contended. This violates the ideal hybrid strategy of one MPI process per socket (Section 9.13.4).
- OpenMP Thread Undersubscription:
OMP_NUM_THREADS = '2'. Each socket has 6 cores. Using only 2 threads significantly underutilizes the resources. - Lack of OpenMP Thread Pinning (Poor Locality):
OMP_PROC_BIND = 'FALSE'. Threads are not bound (Section 6.5.3) and may migrate, hurting cache locality.
Commands to Fix:
We aim to utilize the 6 physical cores per socket (ignoring SMT is often better for HPC).
1. Set OpenMP Environment Variables:
export OMP_NUM_THREADS=6 # Utilize the physical cores per socket
export OMP_PLACES=cores # Bind to physical cores
export OMP_PROC_BIND=close # Keep threads close together within the socket
2. Fix MPI Execution Command (assuming OpenMPI):
mpirun --map-by socket --bind-to socket -np 3 ./your_program
Show Analysis
Affinity (process and thread placement) is crucial on NUMA architectures. The default behavior often leads to suboptimal performance.
The fix ensures that MPI ranks are distributed across the NUMA domains (--map-by socket --bind-to socket) and that the OpenMP threads spawned by each rank remain local to that domain and pinned to specific cores (OMP_PROC_BIND=close, OMP_PLACES=cores). This maximizes memory locality and minimizes contention.