Parallel Programming Exam Study Guide

Welcome to the comprehensive study guide for Parallel Programming and High-Performance Computing exams. This resource covers all major topics tested in recent exams (2018-2025), with detailed solutions, analysis, and practical implementation examples.

📝 Available Exam Solutions

📌 Key Trends: Recent exams (2023-2025) heavily emphasize Hybrid Programming (MPI+OpenMP) and Performance Engineering, particularly understanding NUMA architectures and debugging suboptimal configurations.

1. Foundations: Performance Metrics and Architecture

1.1 Amdahl's Law and Scalability

Amdahl's Law is a guaranteed component (P1 in all exams).

Core Techniques:

  • Calculating the parallel fraction (f) given speedup and processors
  • Determining required processors (p) for target speedup
  • Calculating maximum theoretical speedup SU(∞)
  • Solving systems of equations when sequential time is unknown

Beyond the Formula:

  • Reasons for sub-linear speedup (overhead, load imbalance)
  • Explaining anomalies like Super-linear speedup (typically cache effects)
  • Distinguishing Strong vs. Weak scaling

1.2 Architectures and Memory Models

  • NUMA Awareness: Understanding Non-Uniform Memory Access is critical for performance analysis
  • Memory Consistency: Sequential Consistency definition (maintaining program order and global total order)

2. Shared Memory I: Threading Fundamentals

2.1 Synchronization Primitives

  • Application: Implementing thread-safe data structures using mutexes and condition variables
  • Techniques: Understanding critical sections, lock placement optimization
  • Robustness: Exception Safety and RAII (e.g., std::lock_guard) to prevent deadlocks
  • Lock Types: Trade-offs of Spin-Locks vs. Yielding Locks, Coarse vs. Fine granularity

2.2 Parallel Patterns

  • Manual Implementation: Implementing parallel algorithms without high-level APIs
  • Techniques: Fork/Join model, efficient reduction with local computation followed by global aggregation

3. Shared Memory II: OpenMP

⚠️ Crucial Focus Area: The analysis of variable privatization is a recurring and challenging theme.

3.1 The Data Environment

  • Meticulous tracing of execution flow based on specified schedule
  • Key Insight: lastprivate without firstprivate results in uninitialized private variables

3.2 Work Sharing and Scheduling

  • static for uniform workloads
  • dynamic for irregular workloads (e.g., sparse matrix processing)
  • NUMA Impact: dynamic scheduling can destroy data locality on NUMA systems

3.3 Reductions

  • Standard vs. Manual implementation using thread-private variables and omp critical
  • Manual implementation required when OpenMP lacks standard operator (e.g., "MinLoc")

3.4 Task Parallelism

  • Tasks for irregular or recursive parallelism (tree traversals)
  • Initiation within parallel/single region
  • Using omp taskwait for synchronization

4. Shared Memory Performance Engineering

4.1 The "Classic NUMA Pitfalls" Problem

Identifying performance drawbacks on multi-socket systems is frequently tested.

Key Issues:

  • NUMA Effects (First Touch): Sequential initialization allocates all memory to one socket. Fix: parallel initialization
  • False Sharing: Fine-grained scheduling causes cache line conflicts. Fix: padding based on cache line size
  • Poor Locality: Incorrect loop nesting or cyclic scheduling reduces cache efficiency

5. SIMD and Vectorization

5.1 Dependencies and Transformations

  • Identifying Barriers: Loop-Carried Flow Dependencies
  • Loop Alignment: Shifting execution to turn loop-carried into loop-independent dependencies
  • Loop Distribution (Fission): Separating independent statements

5.2 Implementation Techniques

  • Intrinsics (AVX/AVX-512): Implementing vectorized loops
  • Handling unaligned memory (loadu/storeu)
  • Horizontal reductions and remainder handling
  • Advanced: Masked instructions for remainder handling

6. Distributed Memory Programming: MPI

6.1 Communication Mechanisms

  • Protocols: Eager (buffering) vs. Rendezvous (handshake)
  • Modes: Blocking vs. Non-blocking based on buffer safety
  • Overlap: Using non-blocking operations for computation/communication overlap

6.2 Communicator Management

⚠️ Crucial Focus Area: MPI_Comm_split is heavily tested and requires meticulous analysis.
  • Understanding color parameter for subgroup definition
  • Understanding key parameter for rank ordering
  • Tracing collective operations within new communicators

6.3 Advanced Concepts

  • Recognizing P2P patterns for collective replacement
  • Custom datatypes for struct communication
  • Hypercube topology reduction (recursive halving)
  • Partitioned Communication for fine-grained synchronization

7. Hybrid Programming (MPI + OpenMP)

Hybrid programming is a cornerstone of recent exams, integrating knowledge across paradigms.

7.1 Implementation Strategy

Standard Workflow:

  1. Initialization with appropriate thread support (e.g., MPI_THREAD_FUNNELED)
  2. MPI data distribution (e.g., Scatter)
  3. Local computation using OpenMP (parallel for)
  4. MPI global aggregation (e.g., Allreduce)

Optimization: Non-blocking MPI collectives for overlap with OpenMP computations

7.2 Performance Tuning and Debugging

⚠️ Most Frequently Tested: Analyzing hardware descriptions, MPI binding reports, and OpenMP environment variables.

Diagnosis and Fixes:

  • MPI Placement: Fix with mpirun --map-by socket --bind-to socket
  • Subscription Level: Set OMP_NUM_THREADS appropriately
  • Thread Pinning: Use OMP_PROC_BIND=close and OMP_PLACES=cores

Best Practice: "1 MPI Rank per Socket" to optimize NUMA locality