Parallel Programming Exam Study Guide

Welcome to the comprehensive study guide for Parallel Programming and High-Performance Computing exams. This resource covers all major topics tested in recent exams (2018-2025), with detailed solutions, analysis, and practical implementation examples.

📝 Available Exam Solutions

2018 Summer Exam

Complete solutions covering Amdahl's Law, OpenMP, MPI, and Hybrid Programming

11 Problems

2023 Summer Retake

Focus on NUMA architectures, performance optimization, and modern MPI features

7 Problems

2024 Summer Retake

Advanced topics including partitioned communication and vectorization

7 Problems

2025 Summer Exam

Latest exam with emphasis on hybrid programming and performance debugging

8 Problems

📌 Key Trends: Recent exams (2023-2025) heavily emphasize Hybrid Programming (MPI+OpenMP) and Performance Engineering, particularly understanding NUMA architectures and debugging suboptimal configurations.

1. Foundations: Performance Metrics and Architecture

1.1 Amdahl's Law and Scalability

Amdahl's Law is a guaranteed component (P1 in all exams).

Core Techniques:

Calculating the parallel fraction (f) given speedup and processors
Determining required processors (p) for target speedup
Calculating maximum theoretical speedup SU(∞)
Solving systems of equations when sequential time is unknown

Beyond the Formula:

Reasons for sub-linear speedup (overhead, load imbalance)
Explaining anomalies like Super-linear speedup (typically cache effects)
Distinguishing Strong vs. Weak scaling

1.2 Architectures and Memory Models

NUMA Awareness: Understanding Non-Uniform Memory Access is critical for performance analysis
Memory Consistency: Sequential Consistency definition (maintaining program order and global total order)

2. Shared Memory I: Threading Fundamentals

2.1 Synchronization Primitives

Application: Implementing thread-safe data structures using mutexes and condition variables
Techniques: Understanding critical sections, lock placement optimization
Robustness: Exception Safety and RAII (e.g., std::lock_guard) to prevent deadlocks
Lock Types: Trade-offs of Spin-Locks vs. Yielding Locks, Coarse vs. Fine granularity

2.2 Parallel Patterns

Manual Implementation: Implementing parallel algorithms without high-level APIs
Techniques: Fork/Join model, efficient reduction with local computation followed by global aggregation

3. Shared Memory II: OpenMP

⚠️ Crucial Focus Area: The analysis of variable privatization is a recurring and challenging theme.

3.1 The Data Environment

Meticulous tracing of execution flow based on specified schedule
Key Insight: lastprivate without firstprivate results in uninitialized private variables

3.2 Work Sharing and Scheduling

static for uniform workloads
dynamic for irregular workloads (e.g., sparse matrix processing)
NUMA Impact: dynamic scheduling can destroy data locality on NUMA systems

3.3 Reductions

Standard vs. Manual implementation using thread-private variables and omp critical
Manual implementation required when OpenMP lacks standard operator (e.g., "MinLoc")

3.4 Task Parallelism

Tasks for irregular or recursive parallelism (tree traversals)
Initiation within parallel/single region
Using omp taskwait for synchronization

4. Shared Memory Performance Engineering

4.1 The "Classic NUMA Pitfalls" Problem

Identifying performance drawbacks on multi-socket systems is frequently tested.

Key Issues:

NUMA Effects (First Touch): Sequential initialization allocates all memory to one socket. Fix: parallel initialization
False Sharing: Fine-grained scheduling causes cache line conflicts. Fix: padding based on cache line size
Poor Locality: Incorrect loop nesting or cyclic scheduling reduces cache efficiency

5. SIMD and Vectorization

5.1 Dependencies and Transformations

Identifying Barriers: Loop-Carried Flow Dependencies
Loop Alignment: Shifting execution to turn loop-carried into loop-independent dependencies
Loop Distribution (Fission): Separating independent statements

5.2 Implementation Techniques

Intrinsics (AVX/AVX-512): Implementing vectorized loops
Handling unaligned memory (loadu/storeu)
Horizontal reductions and remainder handling
Advanced: Masked instructions for remainder handling

6. Distributed Memory Programming: MPI

6.1 Communication Mechanisms

Protocols: Eager (buffering) vs. Rendezvous (handshake)
Modes: Blocking vs. Non-blocking based on buffer safety
Overlap: Using non-blocking operations for computation/communication overlap

6.2 Communicator Management

⚠️ Crucial Focus Area: MPI_Comm_split is heavily tested and requires meticulous analysis.

Understanding color parameter for subgroup definition
Understanding key parameter for rank ordering
Tracing collective operations within new communicators

6.3 Advanced Concepts

Recognizing P2P patterns for collective replacement
Custom datatypes for struct communication
Hypercube topology reduction (recursive halving)
Partitioned Communication for fine-grained synchronization

7. Hybrid Programming (MPI + OpenMP)

Hybrid programming is a cornerstone of recent exams, integrating knowledge across paradigms.

7.1 Implementation Strategy

Standard Workflow:

Initialization with appropriate thread support (e.g., MPI_THREAD_FUNNELED)
MPI data distribution (e.g., Scatter)
Local computation using OpenMP (parallel for)
MPI global aggregation (e.g., Allreduce)

Optimization: Non-blocking MPI collectives for overlap with OpenMP computations

7.2 Performance Tuning and Debugging

⚠️ Most Frequently Tested: Analyzing hardware descriptions, MPI binding reports, and OpenMP environment variables.

Diagnosis and Fixes:

MPI Placement: Fix with mpirun --map-by socket --bind-to socket
Subscription Level: Set OMP_NUM_THREADS appropriately
Thread Pinning: Use OMP_PROC_BIND=close and OMP_PLACES=cores

Best Practice: "1 MPI Rank per Socket" to optimize NUMA locality