Parallel Programming Exam Study Guide
Welcome to the comprehensive study guide for Parallel Programming and High-Performance Computing exams. This resource covers all major topics tested in recent exams (2018-2025), with detailed solutions, analysis, and practical implementation examples.
📝 Available Exam Solutions
2018 Summer Exam
Complete solutions covering Amdahl's Law, OpenMP, MPI, and Hybrid Programming
11 Problems2023 Summer Retake
Focus on NUMA architectures, performance optimization, and modern MPI features
7 Problems2024 Summer Retake
Advanced topics including partitioned communication and vectorization
7 Problems2025 Summer Exam
Latest exam with emphasis on hybrid programming and performance debugging
8 Problems1. Foundations: Performance Metrics and Architecture
1.1 Amdahl's Law and Scalability
Amdahl's Law is a guaranteed component (P1 in all exams).
Core Techniques:
- Calculating the parallel fraction (
f) given speedup and processors - Determining required processors (
p) for target speedup - Calculating maximum theoretical speedup SU(∞)
- Solving systems of equations when sequential time is unknown
Beyond the Formula:
- Reasons for sub-linear speedup (overhead, load imbalance)
- Explaining anomalies like Super-linear speedup (typically cache effects)
- Distinguishing Strong vs. Weak scaling
1.2 Architectures and Memory Models
- NUMA Awareness: Understanding Non-Uniform Memory Access is critical for performance analysis
- Memory Consistency: Sequential Consistency definition (maintaining program order and global total order)
2. Shared Memory I: Threading Fundamentals
2.1 Synchronization Primitives
- Application: Implementing thread-safe data structures using mutexes and condition variables
- Techniques: Understanding critical sections, lock placement optimization
- Robustness: Exception Safety and RAII (e.g.,
std::lock_guard) to prevent deadlocks - Lock Types: Trade-offs of Spin-Locks vs. Yielding Locks, Coarse vs. Fine granularity
2.2 Parallel Patterns
- Manual Implementation: Implementing parallel algorithms without high-level APIs
- Techniques: Fork/Join model, efficient reduction with local computation followed by global aggregation
3. Shared Memory II: OpenMP
3.1 The Data Environment
- Meticulous tracing of execution flow based on specified
schedule - Key Insight:
lastprivatewithoutfirstprivateresults in uninitialized private variables
3.2 Work Sharing and Scheduling
staticfor uniform workloadsdynamicfor irregular workloads (e.g., sparse matrix processing)- NUMA Impact:
dynamicscheduling can destroy data locality on NUMA systems
3.3 Reductions
- Standard vs. Manual implementation using thread-private variables and
omp critical - Manual implementation required when OpenMP lacks standard operator (e.g., "MinLoc")
3.4 Task Parallelism
- Tasks for irregular or recursive parallelism (tree traversals)
- Initiation within
parallel/singleregion - Using
omp taskwaitfor synchronization
4. Shared Memory Performance Engineering
4.1 The "Classic NUMA Pitfalls" Problem
Identifying performance drawbacks on multi-socket systems is frequently tested.
Key Issues:
- NUMA Effects (First Touch): Sequential initialization allocates all memory to one socket. Fix: parallel initialization
- False Sharing: Fine-grained scheduling causes cache line conflicts. Fix: padding based on cache line size
- Poor Locality: Incorrect loop nesting or cyclic scheduling reduces cache efficiency
5. SIMD and Vectorization
5.1 Dependencies and Transformations
- Identifying Barriers: Loop-Carried Flow Dependencies
- Loop Alignment: Shifting execution to turn loop-carried into loop-independent dependencies
- Loop Distribution (Fission): Separating independent statements
5.2 Implementation Techniques
- Intrinsics (AVX/AVX-512): Implementing vectorized loops
- Handling unaligned memory (
loadu/storeu) - Horizontal reductions and remainder handling
- Advanced: Masked instructions for remainder handling
6. Distributed Memory Programming: MPI
6.1 Communication Mechanisms
- Protocols: Eager (buffering) vs. Rendezvous (handshake)
- Modes: Blocking vs. Non-blocking based on buffer safety
- Overlap: Using non-blocking operations for computation/communication overlap
6.2 Communicator Management
MPI_Comm_split is heavily tested and requires meticulous analysis.
- Understanding
colorparameter for subgroup definition - Understanding
keyparameter for rank ordering - Tracing collective operations within new communicators
6.3 Advanced Concepts
- Recognizing P2P patterns for collective replacement
- Custom datatypes for struct communication
- Hypercube topology reduction (recursive halving)
- Partitioned Communication for fine-grained synchronization
7. Hybrid Programming (MPI + OpenMP)
Hybrid programming is a cornerstone of recent exams, integrating knowledge across paradigms.
7.1 Implementation Strategy
Standard Workflow:
- Initialization with appropriate thread support (e.g.,
MPI_THREAD_FUNNELED) - MPI data distribution (e.g.,
Scatter) - Local computation using OpenMP (
parallel for) - MPI global aggregation (e.g.,
Allreduce)
Optimization: Non-blocking MPI collectives for overlap with OpenMP computations
7.2 Performance Tuning and Debugging
Diagnosis and Fixes:
- MPI Placement: Fix with
mpirun --map-by socket --bind-to socket - Subscription Level: Set
OMP_NUM_THREADSappropriately - Thread Pinning: Use
OMP_PROC_BIND=closeandOMP_PLACES=cores
Best Practice: "1 MPI Rank per Socket" to optimize NUMA locality