Contents

Parallel and Distributed Computing - Detailed Study Summary

Detailed learning notes distilled from the lecture slide PDFs in this directory (decks 01-14). Content has been cross-checked and enriched against the FS21-exam-tuned cheatsheet.html and cheatsheet_extended.html, which are the authoritative references for what is exam-relevant and for the exact formulas. Every dense concept is paired with a short plain-language Intuition note (rendered as a highlighted callout in the HTML version) so you recall the idea, not just the formula.

Course: FHNW MSE TSM_ProgAlg Purpose: exam and exercise preparation

How to Use This Summary

Read the four parts in order for the full story: foundations -> shared memory -> distributed memory -> heterogeneous systems.
Use the deck links in the navigation tables and source index when you need the original figures, code listings, or formulas from the slides.
Use the cross-deck connection sections to understand how formulas, algorithms, and programming models depend on each other.
Use the collapsible Contents list (HTML version) or the heading structure for keyword lookup; this file is the detailed learning summary.

Course Map

Introduction and C++ basics
    -> performance metrics and scalability laws
    -> decomposition, task graphs, mapping, and granularity
        -> shared-memory systems: caches, OpenMP, shared sorting, graph search
        -> distributed-memory systems: MPI, collectives, DMS sorting, numerical algorithms
        -> heterogeneous systems: GPUs, OpenCL/SYCL, matrix multiplication, image processing
    -> conclusion: real HPC systems, interconnects, and exam expectations

Slide deck	Summary part	Main topics
`01 Introduction.pdf`	Part I	Course structure, motivation, machine model, concurrency vs parallelism, abbreviations
`01 Parallel Programming in C++.pdf`	Part I	C++ tooling, threads, futures, synchronization, execution policies
`02 Performance Metrics.pdf`	Part I	RAM/PRAM, Big-O, speedup, efficiency, overhead, Amdahl, Gustafson, Karp-Flatt
`03 Decomposition and Mapping.pdf`	Part I	Task graphs, interaction graphs, granularity, decomposition, mapping, LU
`04 Shared Memory Systems.pdf`	Part II	UMA/NUMA, cache coherence, OpenMP directives, clauses, scheduling
`04 OpenMP_common_mistakes.pdf`	Part II	OpenMP correctness and performance mistakes, best practices, checklist
`05 ParallelSorting1.pdf`	Part II	Comparison sorting, merge sort, quicksort, bitonic sort on shared memory
`06 GraphSearch.pdf`	Part II	Graph representations, DFS, best-first, A*, discrete optimization, task mapping
`07 Scalability.pdf`	Part I	Strong/weak scaling, isoefficiency, cost-optimality, degree of concurrency, minimum time
`08 Distributed Memory Systems.pdf`	Part III	DMS architecture, MPI basics, point-to-point semantics, communicators, topologies
`09 Collective Communication.pdf`	Part III	Communication cost model, topologies, broadcast, reduction, scan, scatter/gather, all-to-all
`10 ParallelSorting2.pdf`	Part III	Distributed odd-even transposition, compare-split, parallel Shellsort
`11 NumericalAlgorithms.pdf`	Part III	Matrix-vector, matrix-matrix, Cannon, SYCL matrix multiply, numerical integration
`12 Heterogeneous Systems.pdf`	Part IV	Heterogeneous architectures, GPU model, SYCL/DPC++, local memory, performance
`12 OpenCL by Ofer Rosenberg.pdf`	Part IV	OpenCL platform, execution, memory model, host workflow, kernel rules, profiling
`12 from OpenCL to Data Parallel C++.pdf`	Part IV	OpenCL to SYCL/DPC++ migration, oneAPI tools, oneMKL, USM, roofline
`13 ImageProcessing.pdf`	Part IV	Digital images, linear filters, borders, GPU image processing, edge detection
`14 Conclusion.pdf`	Part IV	CALCULON, Top500, interconnects, exam information, course synthesis

High-Value Topic Threads

Thread	Decks to revisit	Why it matters
Performance analysis	`02 Performance Metrics.pdf`, `07 Scalability.pdf`, algorithm decks	Speedup alone is not enough; cost, overhead, efficiency, and isoefficiency decide whether an algorithm scales.
Decomposition -> mapping -> communication	`03 Decomposition and Mapping.pdf`, `04 Shared Memory Systems.pdf`, `08 Distributed Memory Systems.pdf`, `09 Collective Communication.pdf`	The same task graph can become OpenMP scheduling, MPI messages, or GPU work-item mapping depending on the architecture.
Sorting across architectures	`05 ParallelSorting1.pdf`, `10 ParallelSorting2.pdf`, `12 Heterogeneous Systems.pdf`	Compare-exchange in shared memory becomes compare-split and neighbor exchange in DMS; bitonic sorting is poor asymptotically but maps well to regular hardware.
Matrix algorithms	`03 Decomposition and Mapping.pdf`, `09 Collective Communication.pdf`, `11 NumericalAlgorithms.pdf`, `12 Heterogeneous Systems.pdf`	Matrix-vector and matrix-matrix examples connect output partitioning, collectives, Cartesian topologies, Cannon, and GPU tiling/local memory.
Programming models	`01 Parallel Programming in C++.pdf`, `04 Shared Memory Systems.pdf`, `08 Distributed Memory Systems.pdf`, `12 Heterogeneous Systems.pdf`, `12 OpenCL by Ofer Rosenberg.pdf`, `12 from OpenCL to Data Parallel C++.pdf`	C++ threads, OpenMP, MPI, OpenCL, and SYCL expose different parts of the same parallel-design problem.
Image and stencil-like data parallelism	`13 ImageProcessing.pdf`, `12 Heterogeneous Systems.pdf`	Output decomposition, halo/border handling, memory locality, and GPU work-group synchronization appear in a concrete application.

Part I - Foundations, Metrics, Decomposition, and Scalability

Parallel Computing Motivation and Machine Models (01 Introduction)

Parallel and distributed computing context

Parallel computing is computation in which many calculations are carried out simultaneously. The basic idea is that large problems can often be divided into smaller subproblems that are solved concurrently. Distributed computing extends this across multiple computers connected by a network, while shared-memory and heterogeneous systems may keep all processing elements inside one machine.

Intuition

Parallel = "split the work over many workers and run them at the same time." The whole course is about how to split (decomposition), where to put the pieces (mapping), and what it costs to make them cooperate (communication/synchronization).

These themes recur because the course is built as one connected story. Performance metrics give you the vocabulary to judge a parallel program — runtime, speedup, efficiency, overhead, cost, cost-optimality, and scalability. Decomposition and mapping give you the recipe for creating parallel work: split a problem into tasks, map tasks to processes, map processes to processors, distribute the data, synchronize, and control the resulting communication. From there the three architecture families each instantiate that recipe differently. Shared-memory systems use threads and OpenMP over a shared address space; distributed-memory systems use message passing, especially MPI; and heterogeneous shared-memory systems add accelerators such as GPUs programmed with SYCL/OpenCL.

Parallel machine model (01 Introduction)

A realistic parallel system is hierarchical, and that hierarchy is the single most important picture to keep in your head. At the top, a cluster is a set of nodes connected by a network. Each node can hold CPUs, GPUs, main memory, and accelerator memory. Inside a CPU you find cores, and each core carries its own caches and arithmetic units such as ALUs and FPUs; a core may further expose hardware threads (hyperthreads). GPUs sit off to the side with many simpler processing elements and high-throughput memory, which is exactly what makes them shine on data-parallel workloads.

This hierarchy matters because computation is cheap only if data movement is controlled. Moving data across cache levels, between CPU and GPU memory, or across a network can dominate runtime.

Intuition

Arithmetic is essentially free; moving data is the expensive part. Every layer of the hierarchy (cache -> RAM -> GPU memory -> network) is slower than the one above it, so good parallel code keeps data close to the unit that uses it. This single idea reappears as cache locality, false sharing, message cost t_s + m*t_w, and GPU local-memory tiling later in the course.

Concurrent programs vs parallel programs (01 Introduction)

Concurrent programs are designed as collections of interacting computational processes — think OS processes or threads. Those processes can run sequentially on a single processor by interleaving their execution steps, or in parallel by assigning different processes to different processors. Either way, the hard part is coordination: you have to ensure correct sequencing of interactions and communication, coordinate access to shared resources, and avoid races, deadlocks, and invalid assumptions about ordering.

Concurrency and parallelism are related but not identical, and the distinction is worth getting precise:

	Concurrency	Parallelism
About	Structure: several tasks in progress, possibly depending on each other	Execution: several units doing mostly independent work at once
Central concern	Coordination and synchronization	Minimizing synchronization
Hardware needed	Can exist on a single core by interleaving	Needs multiple cores

Intuition

Concurrency is about structure (dealing with many things at once, with dependencies); parallelism is about execution (doing many things at once on real hardware). You can have concurrency without parallelism (time-slicing one CPU) and parallelism without much concurrency (independent data-parallel work). A useful pair of pictures: building a house is concurrent — many activities overlap, but dependencies such as "walls before roof" force ordering — whereas farming a large field is parallel — multiple workers divide independent regions and work mostly independently.

Forms of parallelism (01 Introduction)

Parallelism shows up at several levels of the stack at once:

Level	What runs in parallel
Bit-level	Multiple bits of a machine word are processed together
Instruction-level	Multiple instructions are overlapped or executed in parallel by the processor
Data	The same operation is applied to many data elements
Task	Different tasks run in parallel and cooperate

Intuition

Bit- and instruction-level parallelism are "free" — the hardware does them for you. Data and task parallelism are what you design: data parallelism = same code, many elements (the GPU/SIMD/std::execution::par style); task parallelism = different code paths cooperating (the threads/OpenMP-sections style).

Modern computer architecture leans on parallelism because single-core frequency scaling is capped by power and heat. The common hardware forms are multi-core CPUs, multi-processor machines, clusters and grids, and heterogeneous systems with accelerators.

Intuition

Clock speeds stopped rising — the "power wall": faster clocks burn too much heat. The only way left to get more performance is more cores doing work in parallel, which is exactly why this course exists.

Why parallel computing is needed (01 Introduction)

Parallelism is driven by the demand for computational speed, for larger available memory, and sometimes for fault tolerance. Many scientific and engineering problems are simply too large to finish sequentially in a reasonable time, and two classic back-of-the-envelope estimates make the point.

Worked example

Weather forecasting. Model the atmosphere as 3-D cells with attributes. The global atmosphere covers about 5 * 10^8 square miles; with cells of size 1 mile x 1 mile x 1 mile up to 10 miles in height you get roughly 50 * 10^8 cells. If each cell update costs 2000 FLOP, a single time step costs about 10^13 FLOP. Seven days at one-minute intervals is 10080 time steps, so the full forecast is about 10^17 FLOP. A 100 GFLOP/s machine would need about 10^6 seconds — over 11.6 days. To finish in one second you would need about 100 PFLOP/s.

Intuition

The forecast is useless if computing "tomorrow's weather" takes 11 days. Parallelism is what turns an intractable wall-clock time into a useful one.

Worked example

N-body / astronomical simulation. Each of N bodies is attracted by every other body, so direct computation needs N - 1 forces per body, i.e. O(N^2) work per iteration. Approximation algorithms cut this to O(N log N). With 10^11 stars and 10^-9 s per interaction, the direct O(N^2) method takes about 10^13 seconds for one iteration, whereas O(N log N) brings one iteration down to about 10^3 seconds.

Algorithmic improvement and parallelism are complementary, not interchangeable. A better sequential algorithm often matters more than throwing processors at a poor one, but parallel algorithms such as parallel dot product and parallel merge sort exploit extra hardware once the algorithmic structure is suitable.

Intuition

Algorithm first, then parallelize. Going O(N^2) -> O(N log N) above buys a 10^10x speedup — far more than any number of processors could. Parallelizing a bad algorithm just makes it finish a constant factor sooner. (This is the "optimize serial first" rule that returns in the OpenMP and image-processing decks.)

Important abbreviations (01 Introduction)

You will meet these acronyms constantly, so keep this glossary handy:

Abbr.	Meaning	Abbr.	Meaning
ALU	Arithmetic Logic Unit	MIMD	Multiple Instruction Multiple Data
APU	Accelerated Processing Unit	MPP	Massive Parallel Processing
CPU	Central Processing Unit	PRAM	Parallel Random-Access Machine
DMS	Distributed Memory System	SIMD	Single Instruction Multiple Data
GPU	Graphics Processing Unit	SMP	Symmetric Multiprocessing
HPC	High Performance Computing	SMS	Shared Memory System
HSMS	Heterogeneous Shared Memory System	SMT	Simultaneous Multithreading
HT	Hyperthread	SPMD	Single Program Multiple Data

Intuition (Flynn taxonomy)

The classic axes are instruction streams x data streams. SIMD = one instruction drives many data lanes (vector units, GPUs). MIMD = independent cores running independent instructions (multi-core CPUs, clusters). SPMD is the practical programming style on MIMD hardware: one program, every process/rank runs it on its own data slice (the MPI and GPU-kernel model — same kernel, different work-item indices).

Parallel Programming in C++ (01 Parallel Programming in C++)

Tooling and C++ baseline

The deck assumes C++20 or newer; the ISO C++ standard, cppreference.com, and the C++ algorithms library documentation are the references you will return to. On the tooling side you have a choice of IDEs (VS Code, CLion, Visual Studio) and compilers (g++, clang, Intel, VC++). The Intel oneAPI Base Toolkit supplies OpenMP and SYCL, the oneAPI HPC Toolkit adds MPI, and the dev container bundles OpenMP, MPI, and SYCL together. For real runs there is the HPC cluster calculon.informatik.fhnw.ch, driven by SLURM and a Singularity container.

The exam-level point behind all of this is simple: the language and toolchain must support modern concurrency features — std::thread, std::future, std::async, the synchronization primitives, and the standard execution policies.

C++ basics relevant to parallel code

A few C++ fundamentals carry surprising weight once threads enter the picture. Recall first that the build process runs in three stages: preprocessing resolves includes, compilation turns source into object files, and linking combines object files and libraries into an executable.

When it comes to user types, the only difference between struct and class is the default access level — struct members are public, class members are private. Mark read-only access with const member functions, and pass objects by reference (especially large ones) to avoid expensive copies. That last habit is where threading bites: be careful with object lifetime when references are captured by threads.

Intuition

Lifetime bugs are the #1 threading trap. If a thread captures a reference to a local object and the creating function returns before the thread finishes, the thread reads freed memory. Either join() before the object dies, or capture by value / shared_ptr.

Memory placement follows the usual stack-vs-heap split: a stack object is just Point p2{3, 5};, while a heap/shared object is auto p3 = std::make_shared<Person>("Peter", 23); — and remember to use . for direct objects but -> for pointers and shared pointers. For sequences, C arrays can be static or allocated dynamically with new[] and released with delete[], but you should prefer the C++ containers: std::array<T, N> for a static size and std::vector<T> for a dynamic one. Iterators are what let the standard algorithms generalize across arrays and containers alike.

Three approaches to modern C++ parallelism

Modern C++ gives you three levels of abstraction for parallelism, and they form one ladder distinguished by how much the runtime does on your behalf:

Approach	What you write	Tradeoff
Explicit threading	`std::thread`, locks, condition variables, custom scheduling	Maximum control; you own data movement, synchronization, and every safety decision
Execution policies	A standard algorithm called with a sequential, parallel, vectorized, or combined policy	Same algorithmic expression targets different execution strategies; relies on the runtime and standard semantics
Directive-based (e.g. OpenMP)	Parallel directives layered over threads	Least boilerplate for loop partitioning, thread setup, and synchronization; least direct control

The conceptual tradeoff is control vs abstraction: explicit threads give fine-grained control, while execution policies and directives are simpler but lean more on runtime semantics.

Intuition

Same ladder, three rungs of "how much the runtime does for you." std::thread = you manage everything (and own every bug). Execution policies = "add one argument to a standard algorithm and it parallelizes." OpenMP = "add one #pragma to a loop." Higher abstraction means less control over scheduling but far fewer ways to introduce races.

Threads (01 Parallel Programming in C++)

A thread is a single stream of control in a program. In C++, a std::thread starts running automatically in its constructor — there is no separate "start" call:

std::thread th1(printFibs, 28, 35);
std::thread th2([&img] { img.fill(0, 1, 2); });
th1.join();
th2.join();

The constructor pattern is thread(executable_object, parameters...), where the executable object can be a function object, a lambda expression, or a function pointer. By default the arguments are copied into the thread, so to share state you must opt in deliberately — with a [&] reference capture or std::ref. Finally, the parent must either join the thread (wait for it to finish) or detach it.

Intuition

Constructing a std::thread immediately launches the work. Arguments are copied into the thread by default (safe but a hidden cost for big objects); to share state you must opt in with std::ref(x) or a [&] capture. Every thread must be join()-ed or detach()-ed before its destructor runs, or the program calls std::terminate.

Threads are a deliberately low-level abstraction: any data exchange must be synchronized manually, and thread_local storage gives each thread its own instance of a static or global variable. The sharpest gotcha concerns exceptions — an uncaught exception inside a thread function terminates the whole program rather than propagating to the parent.

Watch out

An exception that escapes a thread function does not propagate to the parent — it kills the whole process. That is a key reason to prefer std::async/future, which capture the exception and rethrow it cleanly at get().

Futures and async (01 Parallel Programming in C++)

A std::future represents a result that will become available later, and it is safer than a raw thread both for returning values and for propagating exceptions. The mechanics are small: std::async() starts or schedules a computation and returns immediately with a future<RT> (where RT is the return type); future.get() then blocks until the result is ready. Any exception thrown by the asynchronous computation is rethrown in the parent thread at the moment you call get(). A future can also be produced from a std::promise, which acts as a one-shot communication channel.

Intuition

A future is an "IOU for a result." async hands you the IOU instantly and the work runs in the background; get() is where you cash it in and block until it is ready (and where any exception surfaces). This is the clean, value-returning alternative to a bare thread + manual shared variable.

How the work actually runs depends on the launch policy. std::launch::async guarantees asynchronous execution on a separate thread or execution resource, whereas std::launch::deferred postpones the computation until get() or wait() is called. The difference has visible consequences: if fut1 = async(launch::async, fibrec, 35) and fut2 = async(launch::deferred, fibrec, 35), then fut1 can run concurrently while the parent does other work, but calling fut2.get() first executes fut2 in the calling thread and may delay collecting fut1. Changing the order of your get() calls changes the overlap and therefore the observed runtime.

Intuition

deferred is lazy — nothing happens until you get()/wait(), and then it runs in the calling thread, not in parallel. So a deferred task gives you no speedup; the order in which you get() futures decides how much real overlap you observe. (If you call the deferred one first, you serialize it ahead of the async one.)

Synchronization primitives (01 Parallel Programming in C++)

Synchronization becomes necessary the moment multiple threads access common data and at least one of them writes.

Intuition

Concurrent reads are always fine. The moment one thread writes data another thread touches, you have a potential race and need synchronization. No shared writes -> no synchronization needed (the ideal case: disjoint outputs).

The standard library gives you a toolbox of primitives, ordered roughly from cheapest to heaviest:

Primitive	Purpose
`std::atomic<T>`	Atomic access to a value; operations are indivisible
`std::atomic_flag`	Atomic Boolean-like flag, intended to be lock-free
`std::once_flag` + `std::call_once`	Ensures exactly one thread executes a function
`std::mutex`	Mutual exclusion
`std::recursive_mutex`	Lets the same thread re-enter a critical section — useful for recursive functions but often a design smell
`std::lock_guard`	RAII lock: locks on construction, unlocks on destruction; simple and safe
`std::unique_lock`	RAII lock with explicit locked/unlocked states; supports more flexible locking patterns
`std::condition_variable`	Blocks a thread until another thread signals/notifies a condition

Intuition

Prefer the cheapest tool that is still correct. A lock-free atomic is cheaper than a mutex; an RAII lock_guard is safer than manual lock/unlock (it can never forget to unlock, even on an exception). Reach for a mutex only when you must protect a block of operations, not a single value.

There is a tension to keep in mind: synchronization fixes correctness, but it can hurt scalability by adding overhead, contention, and serialization.

Intuition

Every lock is a little serial section — if many threads queue on the same lock, they take turns instead of running in parallel, and you can end up slower than a single thread. The cure is to share less: give each thread private data and combine results once at the end (the reduction pattern), rather than hammering one shared variable.

Serial vs parallel for-loops (01 Parallel Programming in C++)

Many languages reuse for for two semantically different situations. A serial loop runs a fixed number of repetitive steps with a sequential ordering — reading records from a sequential file, for instance. A parallel loop also runs a fixed number of repetitive steps, but they may execute in any order, like applying an independent operation to every array element. A loop is parallelizable only if its iterations are independent, or if the dependencies are controlled; the hidden dependencies to watch for are shared writes, accumulation variables, I/O ordering, and mutation of non-thread-safe containers.

Intuition

Ask "if I ran the iterations in a random order (or all at once), would the result change?" If yes, there is a dependency. The classic blocker is a loop-carried dependency like a prefix sum c[i] = c[i-1] + x[i], where each step needs the previous one — not parallelizable as written.

Parallel for-each and scheduling (01 Parallel Programming in C++)

A recursive parallelForEach splits an iterator range into halves and conquers them in parallel. The base case is a range of length 0 or 1; the recursive case splits at the midpoint, spawns one half asynchronously, and processes the other half in the current thread, then waits on (gets) the future while handling exceptions so the spawned work is never abandoned. This is a divide-and-conquer task decomposition, and its usefulness hinges on granularity: too fine creates excessive task overhead, too coarse starves you of concurrency.

Intuition

"Spawn one half, do the other half yourself" keeps the current thread busy instead of blocking. But splitting all the way down to single elements drowns you in task-creation overhead, so real implementations stop splitting at a coarse cutoff size (granularity control).

When you assign loop iterations to workers explicitly, the choice is between static and dynamic scheduling:

	Static scheduling	Dynamic scheduling
How work is assigned	The range `[from, to)` is split among `p` threads in fixed chunks up front, often striding by `p * chunkSize`	A shared state object stores the next available chunk; workers call `next()` under a mutex to claim work
Overhead	Low scheduling overhead	Higher synchronization overhead (a lock on every `next()`)
Load balance	Works well when iterations cost the same; can imbalance badly when costs vary	Better balance when task sizes are unknown or irregular

Intuition (static stride pattern)

Thread i takes items i, i+p, i+2p, ... (loop j = i; j < n; j += nThreads). The split is decided up front for ~zero runtime overhead and good locality — perfect when every iteration costs the same.

Intuition

Dynamic = a shared "work queue"; idle threads pull the next chunk at runtime. You pay a lock on every next(), but no thread sits idle while another is overloaded — worth it only when per-iteration cost varies.

This connects directly back to decomposition and mapping: static scheduling is static mapping of tasks to processes, while dynamic scheduling is runtime load balancing.

Ordering and standard execution policies (01 Parallel Programming in C++)

To reason about parallel algorithms you need C++'s ordering vocabulary. Sequenced-before is an asymmetric, transitive relation between evaluations in one thread: if A is sequenced before B, A happens before B in that thread. Indeterminately sequenced evaluations may occur in either order but cannot overlap. Unsequenced evaluations may occur in either order and may overlap or interleave.

Intuition

The key distinction is can they overlap? Indeterminately sequenced = some order, but one finishes before the next starts (no interleaving). Unsequenced = no guarantees at all, they may interleave/run simultaneously (e.g. vectorized lanes). Unsequenced is the dangerous one: you cannot safely take a lock or do anything order-dependent inside it.

The standard execution policies map this vocabulary onto two independent switches — whether multiple threads are used, and whether work may be vectorized/interleaved within a thread:

Policy	Parallel threads	Vectorized	Element-access ordering	Key constraint
`std::execution::seq`	No	No	Indeterminately sequenced (order may differ from a hand-written sequential loop)	Do not rely on incidental order unless the algorithm specifies it
`std::execution::par`	Yes (may use multiple threads)	No	Indeterminately sequenced within each thread	User functions must not create data races or deadlocks
`std::execution::par_unseq`	Yes (may be migrated across threads, including work stealing)	Yes	Unordered across unspecified threads, unsequenced within a thread	Blocking synchronization such as mutex locking can deadlock or be invalid
`std::execution::unseq`	No (one thread)	Yes	Unsequenced within the thread	Runs on one thread using instructions that operate on multiple data elements

Exam

Read the names as two switches — par = "use multiple threads", unseq = "may vectorize/interleave within a thread." So seq = neither, par = threads only, unseq = SIMD only, par_unseq = both. The rule that matters in the exam: anything with unseq forbids blocking synchronization (a mutex can deadlock because a lock taken in one vector lane may never be released before another lane needs it).

These rules turn into concrete correctness verdicts. for_each(par, ..., [&](int i) { v.push_back(i); }) is wrong because std::vector::push_back is not thread-safe. for_each(par, ..., lock_guard<mutex>{m}; ++x;) can be correct because mutual exclusion protects x — but the same mutex pattern under par_unseq is wrong, because blocking synchronization is unsafe with unsequenced/vectorized execution.

Watch out

Correct but slow (key Ex01 lesson). for_each(par, ..., [&](int i){ s += i; }) with a single std::atomic accumulator s is race-free and correct but catastrophically slow — every thread contends on one cache line, with measured speedup S ~= 0.02, E ~= 0 (about 50x slower than serial). The mutex version above is correct but also serializes every write.

Exam

Fix / preferred pattern. Use std::reduce(par, begin, end, init, op), which keeps per-thread private partials and combines them once at the end — no per-element lock or shared atomic. For range queries, have each thread build a local vector (static stride j=i; j<n; j+=nThreads) and merge once. Even ideal reduce here only reaches S ~= 2 because plain summation is memory-bandwidth bound, not compute bound. For max, prefer reduce(par, b, e, init, max) over max_element(par), and make init a true lower bound (e.g. -INFINITY, or 0.0 only if all values are non-negative). The decision rule: reduction-style (private partials) > single atomic > mutex. Never reduce into one shared atomic or lock per element.

Complexity and Performance Metrics (02 Performance Metrics)

Before you can argue that one algorithm is "better" than another, you need a yardstick that does not depend on whose laptop you happened to run it on. That yardstick starts with the Random-Access Machine.

RAM, time, and memory requirements

The Random-Access Machine (RAM) model is an idealized sequential machine with one processor and infinitely large memory cells. It deliberately abstracts away hardware details so that you can compare algorithms on their own terms.

Intuition

RAM is the "fair playing field" — one core, infinite uniform memory, every basic operation costs 1. It lets us talk about an algorithm's growth rate without arguing about whose laptop is faster.

For any algorithm there are three quantities worth distinguishing. The memory requirement is the minimum number of RAM memory cells needed for correct execution. The runtime in wall-clock time is the measured time for a concrete input on a concrete machine. And the runtime in RAM instructions is the number of RAM commands the algorithm executes. The first and last are properties of the algorithm; the middle one is a property of your hardware on that day. That is exactly why we lean on instruction counts: wall-clock measurements are machine-dependent, whereas asymptotic instruction counts support platform-independent algorithm comparison.

Efficient algorithms and analysis cases

An algorithm is considered practicable or efficient when both its runtime and its memory requirements are bounded by a polynomial in the input size. This is a coarse line in the sand, but a useful one.

Intuition

"Efficient" here is the theory-of-computation sense — polynomial = good, exponential = hopeless for large n. It is a coarse classification, not the same as "fast on this benchmark".

The exponential side of that line behaves asymmetrically. If memory is exponential, then runtime is at least exponential too, because that memory must be touched or produced. The converse does not hold: exponential runtime does not necessarily imply exponential memory. Either way, exponential-time algorithms are generally classified as not efficient for large inputs.

When you actually analyze an algorithm you usually pick one of two lenses. Worst-case analysis gives the most useful and practicable guarantee — for example, quicksort with a first-element pivot on reverse-sorted input can require c * n^2 operations. Average-case analysis instead requires a clearly defined distribution over all problem instances of size n, which is often difficult to justify; under suitable assumptions, average quicksort time comes out to c * n * log(n).

Asymptotic complexity and Landau notation

Asymptotic analysis studies runtime T(n) and memory M(n) as n -> infinity. For large n the growth class dominates constants and lower-order terms, so those drop out of the conversation. The deck's Big-O statement makes this precise: T(n) in O(f(n)) means T(n) <= c1 * f(n) + c2 for all relevant n and non-negative constants c1, c2.

The five Landau symbols then form a complete vocabulary for comparing growth rates:

Relation	Meaning
`f in o(g)`	f grows asymptotically slower than g
`f in O(g)`	f grows no faster than g; g is an asymptotic upper bound
`f in Theta(g)`	f grows as fast as g; g is both lower and upper bound
`f in Omega(g)`	f grows no slower than g; g is a lower bound
`f in omega(g)`	f grows asymptotically faster than g

Intuition

Read the five Landau symbols like comparison operators on growth rates: o < O <= Theta >= Omega < omega. Lowercase = strict ("strictly slower/faster"), uppercase = with equality allowed, Theta = "same speed up to constants".

These symbols satisfy a handful of set relations that are worth keeping at your fingertips: Theta(g) subset O(g) and Theta(g) subset Omega(g), with Theta(g) = O(g) intersection Omega(g). The strict classes nest inside their non-strict counterparts, o(g) subset O(g) and omega(g) subset Omega(g), while o(g) intersection omega(g) = empty set (nothing is both strictly slower and strictly faster). Finally there is the duality you will reach for constantly: f in Omega(g) iff g in O(f).

For quick reference, the typical growth classes ordered from better to worse for large n are O(1), O(log n), O(n), O(n log n), O(n^2), O(n^3), O(2^n), and O(n!).

Analytical modeling of parallel systems

A sequential algorithm is usually evaluated by its asymptotic runtime as a function of input size, and in the abstract model that runtime is independent of the serial platform. Parallelism breaks that comfortable simplicity, because a parallel runtime depends on far more variables: the input size n, the number of processing elements p, the communication parameters of the machine, sometimes the output size k, and the messy realities of memory hierarchy, synchronization, scheduling, and load balance.

This is why we always speak of a parallel system — the combination of a parallel algorithm and the underlying parallel platform. A good algorithm on one architecture may be poor on another if communication, memory, or synchronization costs differ.

Intuition

Sequential analysis has one knob (n); parallel analysis has at least two (n and p) plus the machine's communication cost. That is why a single Big-O number is never enough — we always report T_P as a function of both n and p.

PRAM model (02 Performance Metrics)

The Parallel Random-Access Machine (PRAM) is the natural extension of RAM into the parallel world. It gives you p processors sharing an unbounded global memory with uniform access from all processors, a common clock, and the freedom for each processor to execute a different instruction every cycle.

The interesting design question is what happens when two processors reach for the same cell in the same step — and reads and writes are treated independently. That gives four access variants:

Variant	Reads	Writes
EREW	Exclusive	Exclusive
CREW	Concurrent	Exclusive
ERCW	Exclusive	Concurrent
CRCW	Concurrent	Concurrent

Intuition

The four variants are just "can two processors touch the same cell in the same step?" for reads and for writes independently. E = exclusive (one at a time), C = concurrent (many at once). EREW is the strictest/most realistic, CRCW the most permissive/most powerful.

Once you allow concurrent writes, you have to decide who wins, and there are several standard semantics:

Semantics	Resolution rule
Common	Write only if all processors write the same value
Arbitrary	A randomly selected processor's value wins
Priority	A predetermined priority order decides
Sum	Memory receives the sum of all written values

PRAM is wonderful for algorithm analysis, but you should remember it is unrealistic: real machines have a memory hierarchy, finite bandwidth, contention, and nonuniform communication costs that PRAM simply wishes away.

Sources of parallel overhead

Using p processing elements rarely gives you p times the speedup, and the reason always traces back to overhead. There are four sources to keep in mind. Communication (interprocess interaction) is unavoidable because any nontrivial parallel task needs to exchange data. Idling happens when processors wait — for load imbalance, for synchronization, or for serial parts of the computation. Excess computation is work the parallel algorithm does that the serial version never had to, typically to manage parallelism or to reduce communication and synchronization. And contention arises when multiple processors compete for the same memory, lock, communication channel, or master process. Overhead is the central quantity that connects performance metrics, mapping decisions, and scalability — it is the thread running through this entire section.

Intuition

Remember the four overhead sources as communication, idling, excess computation, contention. Every lost bit of speedup traces back to one of them, and they are exactly what the overhead term T_O accounts for.

Core performance metrics

Everything here is built on two execution times. The serial runtime T_S is the elapsed time of the best sequential algorithm on a sequential computer, and the parallel runtime T_P is the wall-clock time from the first processing element's start to the last processing element's finish.

Intuition

T_P is the time of the slowest PE — everyone else is already done and waiting for it. That is why load imbalance hurts: the laggard sets the clock.

From the times you get execution cost, the total processor-time you pay for. Serially that is Cost_ser = 1 * T_S; in parallel it is Cost_par = p * T_P.

Intuition

Cost = "core-seconds you actually paid for" (p cores held for T_P each). If this is much larger than T_S, you bought a lot of idle/overhead time.

The gap between the two costs is the overhead, T_O = Cost_par - Cost_ser = p * T_P - T_S. This is the total processor-time spent in communication, excess computation, idling, and related non-useful work — the four sources from above, now wearing a number.

The headline metric is speedup, S = T_S / T_P. The one rule you must not break: T_S should be the best sequential algorithm for the same problem, not an artificially slow baseline.

Intuition

Speedup answers "how many times faster than the best single-core program?". Always benchmark against the best serial algorithm — beating a deliberately bad baseline inflates S and is a classic exam trap (optimize the serial code first).

Closely related is efficiency, E = S / p = T_S / (p * T_P) = Cost_ser / Cost_par, which measures useful work per processor relative to the sequential baseline.

Intuition

Efficiency is the fraction of each core's time spent doing genuinely useful work. E = 1 means zero waste; E = 0.5 means half of your hardware-time evaporated into overhead.

How you read these numbers falls into three cases. Ideal linear speedup is S = p with E = 1. Sublinear speedup is S < p, which is the usual case. And superlinear speedup is S > p, which is genuinely possible in practice thanks to cache effects or exploratory decomposition — more on that shortly.

Adding n numbers: p = n example (02 Performance Metrics)

Take a concrete problem: add n numbers using p = n processing elements, each owning one number, with n a power of two. The best sequential algorithm must at least read all the input, so T_S = Theta(n). In parallel you can do a tree reduction: pairwise partial sums propagate up a binary tree, and assuming the addition and the communication of one word each take constant time, the whole thing finishes in T_P = Theta(log n).

That looks like a triumph until you compute the metrics. Speedup is S = Theta(n / log n), and since p = n, efficiency is only E = S/p = Theta(1 / log n). The cost is p * T_P = O(n log n), but the serial cost is just Theta(n), so this p = n version is not cost-optimal. The lesson is that fast runtime alone is not enough; using too many processors can waste total work.

Intuition

A binary reduction tree has log n levels, so it finishes in Theta(log n) — blazing fast. But it keeps n PEs busy for that whole time while most of them sit idle after the early levels, so total work blows up to n log n instead of n. Speed bought at the price of efficiency.

Exam

FS21 / Ex02 tip — use concrete costs, not Theta. If the exam gives weights (e.g. 1 add costs 1, 1 communication costs 10), each tree level costs add + comm = 11 and there are log2 p levels; with p < n each PE first does n/p - 1 local adds. So T_P = (n/p - 1) + 11 * log2 p and T_S = n - 1. Recipe: T_P = (local ops) + (#levels) * (add + comm) — plug the numbers in, do not leave the answer as Theta.

Speedup bounds and superlinear speedup

Speedup has a wide range. At the bottom it can be as low as 0 — if the parallel program never terminates, T_P is infinite and S collapses. At the top, ideal theory usually caps speedup at O(p), yet in practice you can observe S > p.

There are two honest reasons for superlinear speedup. Cache effects: splitting the data may make each worker's working set fit in cache, so the aggregate fast memory exceeds what one core ever had. Exploratory decomposition: a parallel search may find a solution earlier than sequential search simply by exploring a lucky branch sooner.

Intuition

Superlinear (S > p) looks like cheating but is real: either each PE's smaller slice now fits in fast cache (more aggregate cache than one core had), or a parallel search gets lucky and stumbles on the goal in a branch the serial version would have reached only much later.

Worked example

Exploratory search makes this concrete. A sequential search takes T_S = O(14). A parallel search with p = 2 finds the answer in T_P = O(5). That gives S = 14/5 = 2.8 > 2 — superlinear on just two processors.

Cost-optimality

Recall that cost is the total processor-time, Cost = p * T_P. A parallel system is cost-optimal when its cost is asymptotically identical to the best serial cost, p * T_P = Theta(T_S). Viewed through efficiency, E = T_S / (p * T_P), this is the same as saying E = Theta(1): efficiency stays bounded away from zero asymptotically.

It pays to memorize the three equivalent characterizations, because the exam can ask for any one of them:

Characterization	Reading
`Cost = p * T_P = Theta(T_S)`	cost matches the best serial work
`E = Theta(1)`	efficiency stays constant as you scale
`T_O = O(W)`	overhead grows no faster than the useful work

The flip side, non-cost-optimality, means the parallel system uses asymptotically more total work than necessary.

Intuition

Cost-optimal = "the parallel version wastes no asymptotic work." You may add cores and the efficiency does not collapse to zero. The three rows above (Cost=Theta(T_S), E=Theta(1), T_O=O(W)) are the same statement viewed from three angles.

Bitonic sort example: impact of non-cost-optimality

Sorting makes a nice cautionary tale. The best comparison-sort serial baseline is T_S = Theta(n log n). The deck's bitonic sort has parallel runtime T_P = (n log^2 n) / p, so its parallel cost is p * T_P = n log^2 n.

Plug in p = n and you get S = (n log n) / (log^2 n) = n / log n with E = S/p = 1 / log n. Scale down to p < n and the numbers become S = p / log n and E = 1 / log n. The pattern is telling: the algorithm is not cost-optimal, but only by a factor of log n. For fixed p the speedup decreases as n increases, and efficiency does not depend on p in this model yet still decreases with n.

Intuition

Bitonic sort is a sorting network — a fixed, data-oblivious grid of compare-exchange columns of depth O(log^2 n), which maps beautifully to SIMD/GPU/regular hardware. The price is that it does log n times more comparisons than an optimal serial sort, so it misses cost-optimality by exactly that factor of log n. (For p < n: sort locally, then repeatedly compare-split with partners.)

Scaling down and cost-optimal parallel addition

Scaling down means using fewer processing elements than the maximum possible, and the payoff is usually better efficiency.

Intuition

Fewer cores, each doing more local work, means each core is busy a larger fraction of the time — so efficiency goes up. The trick is to give every real PE a fat chunk of sequential work before any communication happens.

The naive way to think about it is to treat the processors of the high-p version as virtual processors, assign those virtual processors equally to fewer real processors, and let each real processor do correspondingly more local computation. Applied to addition with p < n, this gives a genuinely cost-optimal scheme. Each processing element first locally adds n/p numbers in Theta(n/p), then the p partial sums are reduced in Theta(log p), so T_P = Theta(n/p + log p). The cost works out to Cost = p * T_P = O(n + p log p), the speedup to S = n / (n/p + log p), and the efficiency to E = 1 / (1 + (p log p)/n).

Cost-optimality then hinges on a single condition: you need p log p = O(n), equivalently n = Omega(p log p). This is the very same expression that reappears later as the isoefficiency function for parallel addition, f(p) = Theta(p log p).

Intuition

The two-phase scheme — local sum (n/p) then tree-reduce the p partials (log p) — turns the wasteful p=n tree into a cost-optimal one as long as the local-work term n/p dominates the reduction term log p. That is exactly the condition n = Omega(p log p). The max number of PEs you can still use cost-optimally is therefore O(n / log n).

Problem size W

The problem (or work) size W is the total number of basic operations required to solve the problem, expressed as a function of input size n. For instance, conventional matrix multiplication of two n x n matrices has W = Theta(n^3). When an optimal sequential algorithm is known, you simply define W = Theta(T_S). With W in hand, parallel runtime becomes a function of both work and processors, T_P = T_P(W, p).

Now hold W fixed and start adding processors. Increasing p decreases the per-processor useful work but usually increases overhead, and the result is unavoidable: efficiency decreases for all parallel systems as p grows with W fixed.

Intuition

W = Theta(T_S) is just "the amount of useful work to do", measured in basic operations. Holding W fixed and cranking p up is strong scaling: each core gets a thinner slice of real work while overhead keeps growing, so efficiency always drops eventually.

Amdahl's Law (02 Performance Metrics)

Amdahl's Law gives an upper bound on speedup for a fixed total work size W — the strong-scaling view. Split the work as W = W_ser + W_par, where W_ser is the non-parallelizable part and W_par is the parallelizable part, and let the serial fraction be f = W_ser / W with 0 <= f <= 1. Sequential time is proportional to W, while parallel time is proportional to fW + ((1 - f)W)/p — the serial part stays put while only the parallel part is divided among p processors.

That model yields the speedup S = 1 / (f + (1 - f)/p), equivalently S = p / (1 + (p - 1)f), with the sobering limit lim_{p->infinity} S = 1/f. The takeaway is that even a small serial fraction caps speedup, and for fixed W adding processors eventually gives diminishing returns.

Intuition

The serial slice you cannot parallelize is a hard ceiling: no matter how many cores you throw at it, speedup saturates at 1/f. Example: f = 5% ⇒ you can never beat 20x, even with infinite cores. This is the pessimistic, fixed-problem ("strong scaling") view.

Gustafson-Barsis Law (02 Performance Metrics)

Gustafson-Barsis flips the question around. Instead of "how much faster for fixed W?", it asks "how much larger a problem can we solve in the same time if p grows?" — the weak-scaling view, giving an upper bound for a scaled problem size at fixed parallel runtime.

Here you decompose the parallel runtime as T_P = T_ser + T_par and define sigma = T_ser / T_P with 0 <= sigma <= 1. Watch the subtle but exam-critical difference: sigma is the serial time fraction of the parallel run, not Amdahl's serial work fraction f. The scaled speedup is S' = sigma + p(1 - sigma), equivalently S' = p + sigma(1 - p). As p -> infinity, S' can grow without the fixed-work Amdahl cap, provided the problem size also grows. In practice this is why large machines are usually justified by scaling up the problem rather than by shrinking a fixed problem's runtime — and it connects directly to scalability and isoefficiency.

Intuition

Grow the problem along with the machine and the fixed serial part becomes a relatively shrinking sliver, so the achievable speedup keeps climbing (near-linear) instead of hitting Amdahl's wall. Amdahl = fixed W, add p, watch E fall (strong scaling); Gustafson = let W grow with p, keep runtime/efficiency (weak scaling).

The two laws are best held side by side:

	Amdahl	Gustafson-Barsis
Scaling regime	Strong (fixed W)	Weak (W grows with p)
What is held fixed	Total work W	Parallel runtime T_P
Serial fraction	`f = W_ser / W` (work)	`sigma = T_ser / T_P` (time)
Speedup	`S = 1 / (f + (1 - f)/p)`	`S' = sigma + p(1 - sigma)`
Behavior as p grows	Saturates at `1/f`	Can grow without the fixed-work cap

Karp-Flatt metric (02 Performance Metrics)

The Karp-Flatt metric works backwards from measurement: it estimates the experimentally observed serial fraction from the speedup you actually recorded. The underlying runtime model is T_P(W, p) = T_ser(W) + T_par(W)/p + T_O(W, p), and given an observed speedup S on p > 1 processors, you compute e = (1/S - 1/p) / (1 - 1/p).

The value e is the experimentally determined serial fraction. If overhead is ignored, it is consistent with Amdahl's serial fraction. What makes the metric genuinely useful is how e moves as you vary p: if e increases with p, then overheads such as communication, contention, or load imbalance are worsening, whereas if e stays roughly constant, the system is behaving like it has a fixed serial fraction.

Decomposition and Mapping (03 Decomposition and Mapping)

Specifying a nontrivial parallel algorithm is more than just "split the work and go." It involves several intertwined decisions. You start with decomposition, splitting the computation into smaller parts and identifying which of them can run concurrently. Then comes mapping, assigning those concurrent pieces of work to processes. Around that you have to handle data distribution (where input, output, and intermediate data physically live), shared data access management (preventing races and reducing contention), and synchronization (coordinating stages and dependencies). Get any of these wrong and your beautiful parallel algorithm grinds.

The deck frames this as three levels of translation. A problem becomes tasks by decomposition; tasks become processes by mapping; and processes become processors by processor mapping.

Intuition

The pipeline is Problem -(decompose)-> Tasks -(map)-> Processes -(map)-> Processors. Decomposition decides how much parallelism exists; mapping decides whether you actually get it.

This pipeline is also where your performance metrics come from. A bad decomposition limits concurrency before you even start; bad mapping causes idling and communication; and bad data distribution causes contention and excess data movement. Keep that chain in mind, because nearly every result later in the course traces back to one of these three stages.

Dense matrix-vector multiplication (03 Decomposition and Mapping)

Dense matrix-vector multiplication is the friendliest case you will meet. Each output element of vector y can be computed independently, which hands you n independent tasks, one per output element. Those tasks share the input vector but have no control dependencies on each other, and in the dense case they all do the same number of operations.

The consequences are almost entirely good news: you get high potential parallelism, a natural output data partitioning falls out for free, and load balancing is trivial as long as the rows have equal length. The only thing you have to think about is communication, which comes from sharing or distributing the input vector.

Intuition

y_i is just the dot product of row i with the whole vector b. Rows do not talk to each other, so the only coupling is everyone needing to read b. That is the textbook "embarrassingly parallel with shared input" case.

Database query processing and task dependency graphs

Not every problem is so cooperative. Consider the query

MODEL = "Yaris" AND YEAR = 2021 AND (COLOR = "GREEN" OR COLOR = "WHITE")

which you can decompose into tasks that each compute an intermediate table: filter model, filter year, filter color white, filter color green, and then combine the partial results with AND/OR operations. Unlike mat-vec, some of these tasks depend on the outputs of others — that is a control dependency — and they are not even the same size, because different filters and joins process different amounts of data.

To reason about this you draw a task dependency graph. A vertex is a task (and its weight can represent the amount of work it does), a directed edge means the result of one task is needed by another, and a directed path is a sequence that must execute in order.

Intuition

The dependency graph answers "who must wait for whom." A directed edge A -> B means B cannot start until A's result exists, no matter how many processors you own.

Two quantities of this graph matter enormously. The critical path length is the length (weight) of the longest directed path, and it determines the shortest possible parallel execution time even with infinitely many processors. The degree of concurrency is the number of tasks that can execute at the same moment; it varies during execution. Its maximum, C, is the most parallelism ever available at any instant, while the average degree of concurrency is total work divided by critical path length.

Intuition

With W = total work and h = critical-path length, W/h (the average degree of concurrency) equals the maximum achievable speedup S_max with unlimited processors. The maximum degree of concurrency C is the widest "front" of simultaneously-runnable tasks, so no more than C processors are ever useful at once.

Watch out

There is a subtlety in how you count C. C(n) counts the most tasks runnable at once, while C(W) sums their weights — and these coincide only for unit-weight tasks. The widest front is often brief, so the minimum number of processors that still finishes at T_P = h is usually far below C(n). Finding it exactly is NP-hard, so in practice you draw a Gantt chart and pack.

For the exam, tie these back to bounds. The critical path gives a lower bound on T_P — more precisely, with limited p you have T_P >= max(h, ceil(W/p)). Total work gives a lower bound on p * T_P, and the maximum concurrency bounds the useful p.

Granularity (03 Decomposition and Mapping)

Granularity simply describes task size, and it is best understood as a single dial you turn between two extremes. The table below contrasts the ends:

	Fine granularity	Coarse granularity
Number of tasks	More	Fewer
Potential concurrency	Higher	Lower
Scheduling / sync / comm overhead	More	Less
Load balance	Easier to balance	Possibly worse

Intuition

Granularity is the comm-to-work ratio dial. Finer = more concurrency but each task does little real work between expensive interactions; coarser = cheaper to manage but you may starve some processors. The right setting balances overhead against available parallelism.

The right granularity, then, balances overhead against available parallelism — exactly the same tradeoff you saw in C++ dynamic scheduling and in the performance deck's cost-optimality examples.

Task interaction graphs (03 Decomposition and Mapping)

The dependency graph captured control dependencies — who must wait for whom. A task interaction graph captures data dependencies instead: a vertex is still a task, but an edge now means a task interaction or data exchange.

Intuition

Two graphs, two questions. Dependency graph = ordering (who waits). Interaction graph = communication (who exchanges data). Mapping wants to honor both at once, and they often pull in opposite directions.

Sparse matrix-vector multiplication is the canonical example. Each output element is again an independent task, but now only the nonzero matrix entries participate, so for memory optimality you may partition the input vector across tasks. The neat result is that the interaction graph is identical to the graph represented by the sparse matrix's adjacency structure. You then use that graph two ways: map highly interacting tasks to the same process to reduce communication, and partition the graph so that work is balanced while cutting as few edges as possible.

Intuition

Put chatty tasks on the same process and the chatter becomes free local memory access. The communication cost of a mapping = number of interaction-graph edges crossing process boundaries (edges inside a process are free).

To make that concrete, take y = A * b and assume one dot product costs 1 time unit and each task interaction also costs 1 time unit. With fine granularity you set p = n = |b|, one task per vector element: the essential computation is 12 and the overhead is 42, which is exactly 2 times the number of graph edges in the example. Coarsen to p = n/4 by grouping vertices into tasks, and the essential computation stays at 12 while the overhead drops to 13, because many interactions are now internal to a process.

Intuition

Essential work is fixed at 12 — coarsening cannot reduce it. What coarsening shrinks is the overhead (42 -> 13), because grouping 4 vertices per process turns most cross-task edges into intra-process memory accesses that cost nothing.

The lesson cuts both ways: coarsening reduces communication by converting inter-task edges into local work, but overdo it and you reduce available parallelism or create load imbalance.

Decomposition techniques

You have three main techniques for producing tasks in the first place, and the cleanest way to choose between them is to ask what is unknown about your problem. Recursive decomposition suits divide-and-conquer problems: you split a problem into subproblems and recursively split until you reach the desired granularity — finding the minimum of an array by combining left and right minima, or quicksort, are the standard examples. Data decomposition partitions input, output, or intermediate data, and the computation tasks are then induced by that data partitioning; this is usually the most natural choice for arrays, matrices, images, and vectors. Exploratory decomposition is the odd one out: decomposition happens together with execution, as in search over a state space (graph search, the 15 puzzle), where the task sizes and even the number of tasks are unknown in advance, so dynamic mapping and load balancing are usually required.

Intuition

Pick the technique by what is unknown. Known structure that splits cleanly -> recursive. Lots of regular data -> data decomposition. You do not even know the task set until you start exploring -> exploratory (and you will need dynamic load balancing).

Output data partitioning and owner computes

When you do data decomposition, the first question is which data you partition — the output or the input. They are governed by the same owner-computes rule but behave very differently, so it is worth seeing them side by side.

With output data partitioning you assign a task to each output data partition, and that task computes its part of the output. This works when the output has many separable values that can each be computed naturally from the input. The owner-computes rule here states that the task assigned a particular output data item performs all computation needed for it. Image filters fit perfectly (each output pixel or tile becomes a task), as does dense matrix multiplication (partition the output matrix C so the owner of C_ij computes that element or block). The benefits are clear ownership, no write conflicts on the output, and correctness that is easy to reason about. The cost is that an owner may have to read large input regions, which increases communication if that input is not local.

Intuition

"You wrote it, you computed it." Because each output cell has exactly one owner, no two tasks ever write the same location, so output partitioning is naturally race-free. The price is that an owner may have to read a lot of distributed input.

Input data partitioning

Input data partitioning flips this around. You assign a task to each input data partition, each task does as much computation as it can using its local data, and then a follow-up combination or reduction phase merges the partials. You reach for this when the input has many separable values but the output has only a few values, or when output elements cannot be computed efficiently in isolation. The owner-computes rule restated for input: the task owning an input partition performs the computations that use that input data. Classic cases are reductions — minimum, maximum, median, sum — and sorting a set of values.

Intuition

When the output is tiny (one number from a reduction) you cannot partition the output, so you partition the input instead: every process crunches its local chunk, then a combine/reduce step merges the partials. That combine step adds a dependency tree (and communication) that output partitioning avoids.

This connects directly to performance. Doing local work first improves data locality, but the combination/reduction step introduces a dependency tree and possible communication — the price you pay for partitioning the input rather than the output.

Mapping tasks to processes

Usually you have more tasks than processes, so mapping is the act of assigning tasks to processes. You are juggling three goals at once: map independent tasks to different processes for load balance, schedule critical-path tasks onto processes as soon as possible, and keep densely interacting tasks on the same process to reduce communication. The two graphs you built earlier feed this directly — the task dependency graph helps you reduce idling and maintain load balance over time, while the task interaction graph helps you reduce communication by preserving locality.

The catch is that these goals conflict. Assigning all work to one processor, for instance, minimizes communication perfectly but maximizes idling and destroys parallelism entirely.

Intuition

The two goals fight. "Spread tasks out" balances load (uses the dependency graph to cut idling) but increases communication; "keep interacting tasks together" cuts communication (uses the interaction graph) but risks imbalance. The degenerate extreme — everything on one PE — has zero communication and zero speedup.

Static and dynamic mapping

The biggest single decision in mapping is when you decide it: before execution (static) or during it (dynamic). The contrast:

	Static mapping	Dynamic mapping (dynamic load balancing)
When tasks are assigned	Before execution	At runtime
Needs good size/comm estimates	Yes	No — handles unknown/generated tasks
Runtime overhead	Very low	Adds scheduling and synchronization overhead
Optimal mapping is	NP-complete to find	n/a
Best for	Regular, predictable work (dense matrix ops)	Irregular workloads, search, variable-cost tasks

Intuition

Decide by predictability. Known, even work -> static (assign once, pay no runtime overhead). Unknown or uneven work -> dynamic (pay scheduling overhead to keep everyone busy). This is exactly the OpenMP schedule(static) vs schedule(dynamic/guided) decision in the C++ deck.

If you go dynamic, you then choose between centralized and distributed schemes. Centralized dynamic mapping has a master managing a pool of available tasks that workers request whenever they go idle — simple, but the master can become a bottleneck or hot spot. Distributed dynamic mapping lets each process send or receive work from others, which avoids the single bottleneck but forces you to answer four design questions:

How are sending and receiving processes paired?
Who initiates the transfer — the idle receiver or the overloaded sender?
How much work is transferred?
When is a transfer triggered?

Intuition

Centralized = one shared work-pool that idle workers pull from (simple, but the master is a hot spot). Distributed = peer-to-peer work stealing (no bottleneck, but you must answer the four design questions above).

Dense matrix multiplication mapping

For C = A * B you use output data partitioning and owner-computes: partition C into blocks and assign an equal number of C elements or blocks to each process for load balance. The interesting variable is the decomposition dimension, because it controls communication. A 1-D decomposition is simpler but may require more communication per process; a 2-D decomposition can reduce communication and use more processes effectively; and higher-dimensional decompositions can expose even more parallelism at the cost of more complicated communication and mapping.

Intuition

The owner of a C-block must read whole rows of A and columns of B. A 2-D block partition makes each process touch a smaller slice of A and B than a 1-D strip partition, so the surface-to-volume (communication-to-compute) ratio drops — this is the same idea that powers Cannon's algorithm later in the course.

Sorting matrix rows: static vs dynamic mapping

Suppose you must sort the entries within each row of an n x n matrix. The decomposition is input data partitioning by rows, with each row becoming a task. Now the static-vs-dynamic question becomes concrete. Static mapping assigns rows to processes before execution, which is fine until rows vary in sorting difficulty — then load balance suffers and processes idle. Dynamic mapping keeps the unsorted rows in a master, and each worker requests a row, sorts it, and asks for another; this balances variable row costs at the cost of some master and communication overhead.

Intuition

This is the canonical "centralized work-pool" example. The textbook motivation is uneven rows -> static idles, dynamic wins.

Exam

FS21/Ex04 nuance: in the actual exercise the rows were equal-cost random rows, so the work was already balanced and static was perfectly fine. dynamic,1 / guided,1 only won marginally (S ~ 8.1 vs 7.4) by smoothing scheduling jitter, while large fixed chunks (static,100) lost when the number of chunks approached the number of threads (uneven tail, S ~ 5-6). Rule of thumb: use dynamic/guided only when per-task cost actually varies.

Solving linear systems and LU decomposition

Given a nonsingular square matrix A and a vector b, you want to solve Ax = b. The LU approach factors A = L * U, where L is lower triangular and U is upper triangular (with a unit diagonal in one convention). Once you have the factorization you solve L y = b by forward substitution and then U x = y by back substitution.

Intuition

Factoring once into triangular L and U turns a hard general solve into two cheap triangular solves. You pay O(n^3) once for the factorization, then each right-hand side b only costs O(n^2).

The substitution steps have closed forms. Forward substitution starts with y_1 = b_1 / l_11 and proceeds with the general form y_i = (b_i - sum_{k=1}^{i-1} l_ik * y_k) / l_ii. Backward substitution starts at the other end with x_n = y_n / u_nn and uses x_i = (y_i - sum_{k=i+1}^{n} u_ik * x_k) / u_ii. The costs line up with the intuition above: LU decomposition is O(n^3), while forward and backward substitution are each O(n^2).

Parallelism here is limited by structure. Gaussian elimination has sequential dependencies across elimination steps, but within each elimination step many updates can run in parallel. So you get high concurrency inside a stage, yet the stages themselves are chained.

Intuition

The elimination steps form a chain (step k+1 needs step k's pivot row), so you cannot parallelize across stages — only the row/column updates within a stage. The max degree of concurrency of Gaussian elimination is therefore C = Theta(n^2) = Theta(W^(2/3)), so it needs W = Omega(p^(3/2)) to stay efficient: a classic example of concurrency capping useful p.

LU task decomposition and static mapping issue

The deck decomposes a block LU factorization into 14 tasks with block size q = n/3 (a 3x3 grid of blocks). Each task has time complexity Theta(q^3), so the total task work is 14 q^3 = 14 * (n/3)^3 = (14/27)n^3 = Theta(n^3) asymptotically.

The trouble is that a static mapping based on input data partitioning can produce severe load imbalance, because some blocks are needed in only one task while others are needed in several later tasks. Treating each task as 1 time unit, the critical path forces T_P >= h = 7 for any mapping — the dependency chain dominates, and adding processors cannot beat that critical-path lower bound. With T_S = 14 serial task units, you can watch efficiency collapse as you add processors that the critical path won't let you use:

Worked example

All three configurations finish at T_P = 7 (the critical path), with T_S = 14:

`p`	cost `p * T_P`	overhead `T_O`	efficiency `E`
3	21	`21 - 14 = 7`	`14/21 = 0.67`
4	28	`28 - 14 = 14`	`14/28 = 0.5`
9	63	`63 - 14 = 49`	`14/63 = 0.22`

p = 9 is one process per block — the natural input partition — and it is the worst offender.

Intuition

Once the critical path pins T_P at 7, throwing more processors at the problem cannot lower runtime — it only inflates cost p * T_P and craters efficiency. Same T_P, more p => strictly worse E. The natural "one block per process" map (p = 9) is the worst offender because most blocks sit idle while the dependency chain crawls forward.

You count communication for a given map the same way as before: #sends = #receives = number of task-interaction edges that cross a process boundary, while edges between two tasks on the same process are free local memory. The broader lesson is that a natural data partition can be a poor one if it ignores task dependencies and uneven reuse — load imbalance and critical-path constraints can dominate everything else.

Minimizing interaction overheads

Finally, several strategies attack interaction overhead directly, and they make most sense once you see that any message carries two separate costs: a fixed startup latency t_s per message, and a per-word transfer cost. You maximize data locality by reusing intermediate data and restructuring computation so reused data is accessed in short time windows. You minimize communication volume, since every communicated word has a cost. You minimize communication frequency, since each interaction pays the startup latency t_s, by merging small messages into fewer larger ones where you can. You minimize contention and hot spots by decentralizing, distributing metadata, and replicating read-mostly data when needed. And you overlap computation with communication using non-blocking communication, multithreading, and prefetching to hide whatever latency remains.

Intuition

There are two separate costs in any message — a fixed startup t_s per message and a per-word transfer cost. "Fewer/larger messages" attacks t_s; "less volume" attacks the per-word part; "overlap with compute" hides whatever is left. Together they shrink T_O.

Taken together, these strategies reduce T_O, which improves speedup, efficiency, cost-optimality, and scalability.

Scalability of Parallel Systems (07 Scalability)

Scaling characteristics

Start with the simplest experiment you can run: freeze the problem size W and keep throwing more processors at it. What happens to efficiency? It falls — almost always. The serial runtime T_S is a fixed constant (the problem hasn't changed), but the overhead T_O keeps growing as you add cores — for example because of Amdahl's Law — and since efficiency is essentially "useful work divided by total work," that rising overhead drags E down with every processor you add.

Intuition

With the problem size frozen, every extra core gets a thinner slice of useful work but still pays its share of communication, idling, and serial bottlenecks. Eventually you are buying cores that mostly wait. This is the "strong scaling" regime (fixed W, raise p), and Amdahl's Law is its pessimist.

You can watch this play out algebraically through Amdahl's Law with serial fraction f. Speedup is S = p / (1 + (p - 1)f) = T_S / T_P, which means the cost is Cost = p T_P = T_S(1 + (p - 1)f) and the overhead works out to T_O = p T_P - T_S = (p - 1) f T_S. The crucial observation is what happens in the limit: as p -> infinity, T_O -> infinity whenever f > 0. Because efficiency can be written as E = S/p = 1 / (1 + T_O/T_S), that exploding overhead forces E -> 0 as p -> infinity for any fixed W and nonzero serial fraction.

Parallel addition tells the same story with concrete formulas. There T_P = n/p + log p, so S = n / (n/p + log p) and E = 1 / (1 + (p log p)/n). This makes the fixed-size trap obvious: if n is fixed and p grows, the ratio (p log p)/n grows without bound and efficiency drops toward zero.

Intuition

The overhead-to-work ratio here is (p log p)/n. With n nailed down, adding cores blows up the numerator while the denominator stays put, so efficiency slides toward zero. The only escape is to let n grow with p — which is exactly the isoefficiency idea below.

Scalable parallel systems

To talk about scalability precisely, recall three building blocks. The work W = Theta(T_S) is the runtime of the best serial algorithm; T_O(W, p) is the overhead written as a function of both problem size and processor count; and a system is cost-optimal exactly when E = Theta(1) and T_O = O(W). With those in hand, the definition is clean: a parallel system is scalable when you can grow the problem size along with p so that efficiency stays constant.

The key enabling condition is how overhead grows with work. If T_O grows sublinearly with W at fixed p — that is, T_O = o(W) — then enlarging W while holding p fixed actually increases efficiency, because useful work outpaces overhead. Once that holds, you can raise W and p together along a curve that keeps E pinned in place.

Intuition

"Scalable" means you can always buy back lost efficiency by making the problem bigger. The precise condition is T_O = o(W) (overhead grows strictly slower than useful work): pour in more work and the overhead becomes relatively negligible, so E climbs back up. You then ramp W and p together along a curve that pins E in place.

Scalability and cost-optimality are related but not identical. A scalable parallel system can always be made cost-optimal by choosing p and W appropriately — but the converse fails: not every cost-optimal system is scalable.

Efficiency as a function of W and p

Think of efficiency as a surface over two axes. Move along the p axis at fixed W and efficiency always decreases — this is true for every system. Move along the W axis at fixed p and, in a scalable system, efficiency increases, because useful work grows faster than overhead. This two-directional behavior is exactly why small benchmark results can mislead you: an algorithm that looks bad for small n may actually scale beautifully if its overhead grows slowly relative to W.

Intuition

E moves in opposite directions along the two axes — down as you add cores, up as you add work. Scalability is the art of moving along both axes at once so E stays level. Never judge scalability from a single small-n timing.

Isoefficiency metric

Isoefficiency turns the previous picture into a single, answerable question: at what rate must W grow as a function of p to keep efficiency E fixed? The answer is a function f(p), and the rule of thumb is simple — smaller isoefficiency growth means better scalability.

Intuition

Isoefficiency f(p) is the price tag of scalability: "to keep efficiency at, say, 0.8 while I double my cores, how much bigger must the problem get?" A gently growing f(p) (close to linear) means cheap scaling; a steep f(p) (e.g. p^3) means you need a vastly bigger problem just to stay even.

Deriving it is mechanical. Begin from T_P = (W + T_O(W, p)) / p, so speedup is S = W / T_P = pW / (W + T_O(W, p)) and efficiency is E = S/p = W / (W + T_O(W, p)), which rearranges to the compact form E = 1 / (1 + T_O(W, p)/W). Holding E fixed pins the overhead-to-work ratio: T_O(W, p) / W = (1 - E)/E, or equivalently W = K * T_O(W, p) with the constant K = E/(1 - E). The isoefficiency function is what you get when you eliminate W from that relation and express it purely in terms of p, giving W = f(p). The payoff: if you change the core count from p to p', the problem must grow by the factor f(p')/f(p) to hold efficiency steady.

Exam

The recipe is three steps. (1) Compute T_O = p T_P - W. (2) Set W = K T_O with K = E/(1-E). (3) Solve for W as a function of p only to get f(p). If T_O has several terms, the asymptotically largest term wins — see the multi-term case below.

Isoefficiency of parallel addition

Run the recipe on adding n numbers. Here W = n and T_P = n/p + log p, so the overhead is T_O = p T_P - W = p log p. The fixed-efficiency condition W = K * T_O = K p log p then collapses to the isoefficiency f(p) = Theta(p log p). Concretely, moving from p to p' requires W' = W * (p' log p') / (p log p).

What does doubling the cores cost you? Plug in p' = 2p:

Factor = (2p log(2p)) / (p log p)
       = 2 log(2p)/log p
       = 2(1 + log 2 / log p)

For large p this is just a hair above 2 — you need slightly more than double the work to keep up.

Intuition

Doubling cores needs slightly more than doubling the work — the extra log 2 / log p is the cost of the deeper reduction tree. Because that excess shrinks as p grows, parallel addition scales almost ideally.

Isoefficiency with multiple overhead terms

Real overhead is often a sum of competing terms, and the trick is to handle each one separately and then take the worst. Consider a hypothetical system with T_O = p^(3/2) + p^(3/4) W^(3/4).

The first term gives W = K p^(3/2), hence f_1(p) = Theta(p^(3/2)). The second term is W = K p^(3/4) W^(3/4); dividing through by W^(3/4) leaves W^(1/4) = K p^(3/4), and raising both sides to the fourth power yields W = K^4 p^3, so f_2(p) = Theta(p^3). The overall isoefficiency is the asymptotically largest of these, f(p) = Theta(p^3). The reason is that every overhead term has to be controlled to keep efficiency fixed, so the worst-growing term is the one that dictates scalability.

Intuition

Each overhead term sets its own minimum growth rate for W. To keep E fixed you must satisfy every one of them, so the steepest term is the binding constraint — exactly like a chain being only as strong as its weakest link. Here p^3 dwarfs p^(3/2), so f(p) = Theta(p^3).

Cost-optimality and isoefficiency

Cost-optimality and isoefficiency turn out to be two views of the same guarantee. Starting again from T_P = (W + T_O(W, p)) / p, the cost is Cost = p T_P = W + T_O(W, p). Cost-optimality demands p T_P = Theta(W), so W + T_O(W, p) = Theta(W), which forces T_O(W, p) = O(W) — equivalently W = Omega(T_O(W, p)). Translated into isoefficiency language, a scalable system stays cost-optimal precisely when W = Omega(f(p)). For parallel addition, where f(p) = p log p, this says cost-optimality requires n = W = Omega(p log p) — exactly the condition the earlier performance deck gave for cost-optimal parallel addition.

Intuition

Cost = work + overhead. "Cost-optimal" just means overhead never out-grows the real work (T_O = O(W)), so the processor-time product stays Theta(T_S) and you waste no asymptotic effort. Staying on or above the isoefficiency curve (W = Omega(f(p))) is the same guarantee written in terms of how big the problem must be for a given core count.

Lower bound on isoefficiency

There is a floor below which isoefficiency physically cannot go. You can never use more than W processors cost-optimally on a problem of work size W — each processor needs at least one unit of work to do. That gives p = O(W), equivalently W = Omega(p), so any isoefficiency function must satisfy f(p) = Omega(p). The best possible case is therefore ideal isoefficiency, f(p) = Theta(p): if doubling p only requires doubling W to hold efficiency, scalability is asymptotically perfect, and anything faster than linear growth is strictly less scalable.

Intuition

You can never keep more than W processors usefully busy (each needs at least one unit of work), so f(p) can never grow slower than p. Linear f(p) = Theta(p) is the gold standard: weak scaling in its purest form, where adding a core just means adding a constant amount of work.

Degree of concurrency and isoefficiency

Communication is not the only thing that caps scalability — the algorithm's own structure does too. The maximum degree of concurrency C(W) is the largest number of operations that can execute simultaneously at any instant of the parallel algorithm. It is a property of the algorithm, independent of the architecture, and it means no more than C(W) processing elements can ever be used effectively.

Intuition

The average degree of concurrency (total work / critical-path length) equals the maximum achievable speedup with unlimited processors; the maximum degree of concurrency C(W) caps how many PEs are ever useful at one instant. Communication and scheduling overhead usually make the genuinely useful p smaller still.

The effect on scalability follows directly. Concurrency-imposed isoefficiency is optimal only when C(W) = Theta(W). If instead C(W) = o(W), then the useful processor count is limited to p = o(W), so W = omega(p) and the isoefficiency must grow strictly faster than linear. And because several ceilings act at once, the overall isoefficiency is the maximum of the isoefficiency functions imposed by limited concurrency, communication, excess computation, and any other overhead.

Intuition

Concurrency is a second ceiling on scalability, independent of communication. If the algorithm's structure only ever exposes C(W) = o(W) parallelism (sequential phases, dependency chains), then most of your work cannot be spread over W cores no matter how cheap communication is — so W must grow super-linearly with p, ruling out ideal scaling.

Gaussian elimination concurrency example

Gaussian elimination on n equations in n variables is the classic illustration. Its work is W = Theta(n^3), but the variables must be eliminated sequentially, one after another, and eliminating a single variable takes Theta(n^2) operations. So at any moment at most Theta(n^2) processors can be kept busy. Since W = Theta(n^3) gives n = Theta(W^(1/3)), the concurrency ceiling is C(W) = O(n^2) = O(W^(2/3)) = o(W). That forces p = O(C(W)) = O(W^(2/3)), hence p^(3/2) = O(W) and finally W = Omega(p^(3/2)). The conclusion is striking: even before you account for a single byte of communication, Gaussian elimination cannot reach ideal Theta(p) isoefficiency, purely because its dependency structure limits concurrency.

Intuition

The n eliminations are a forced sequence (each needs the previous one done), so only the n^2 operations within one elimination step run in parallel. That n^2 = W^(2/3) ceiling — not communication — already forces W = Omega(p^(3/2)). A great lesson: an algorithm's data dependencies can cap scalability before a single message is sent.

Minimum execution time

A subtle but important point: for a fixed problem size W, the configuration that runs fastest may use a p that is not cost-optimal. To find that fastest configuration, treat T_P(W, p) as a function of p, differentiate, and solve d/dp T_P = 0; call the solution p_0. If p_0 <= C(W) you simply use T_P(p_0); otherwise the concurrency ceiling binds and you fall back to p = C(W).

Intuition

Treat p as a real number and find the sweet-spot core count p_0 where adding more cores stops helping — past it, communication/overhead grows faster than the compute term shrinks, so the machine actually slows down. At the optimum the rising overhead term balances the falling compute term, so T_P_min is often just twice either term.

Watch out

First check the p -> infinity limit. If T_P keeps falling toward a finite constant (no interior minimum), there is no p_0 — cap p at the concurrency limit instead: T_P_min = T_P(C(W)).

That procedure gives the minimum execution time, which is not necessarily the minimum cost-optimal execution time. For the latter, recall that with isoefficiency f(p), cost-optimality requires W = Omega(f(p)), so the allowable cost-optimal processor count is p = O(f^{-1}(W)). And since cost-optimal systems have T_O = O(W), the runtime simplifies to T_P = (W + T_O)/p = Theta(W/p), giving the cost-optimal minimum T_P_cost_opt = Omega(W / f^{-1}(W)).

Intuition

The raw minimum time may use so many processors that efficiency tanks. The cost-optimal minimum time asks instead: "what is the fastest I can go while still wasting no asymptotic work?" You take the most cores cost-optimality allows (p = f^{-1}(W)) and divide. Worked shortcuts: f(p) = K^2 p^4 ⇒ T_P_cost_opt = Omega(W^(3/4)); f(p) = Theta(p^2) ⇒ Omega(W^(1/2)). If p_0 happens to equal Theta(f^{-1}(W)), the raw minimum is already cost-optimal.

Minimum time for parallel addition

Make it concrete with parallel addition, where the scalability deck uses T_P = n/p + 2 log p. To minimize, differentiate and solve:

dT_P/dp = -n/p^2 + 2/p = 0
n/p = 2
p_0 = n/2

Substituting back gives the minimum runtime:

T_P_min = n/(n/2) + 2 log(n/2)
        = 2 + 2(log n - 1)   (log base 2)
        = 2 log n

But check the cost: Cost = p_0 * T_P_min = (n/2) * 2 log n = n log n, which is not cost-optimal, because the serial work is only Theta(n).

Now find the cost-optimal minimum instead. The isoefficiency f(p) = K p log p = W = n, inverted approximately, gives p_cost_opt = Theta(n / log n). At that core count the runtime is T_P(p_cost_opt) = n/p_cost_opt + log p_cost_opt ≈ log n + (log n - log log n) = 2 log n - log log n = O(log n).

The two answers sit side by side like this:

Objective	Processors used	Total cost	Wall-clock time
Minimum raw time	about `n/2`	`n log n` (wastes work)	`2 log n`
Minimum cost-optimal time	about `n/log n`	`Theta(n)` (preserves efficiency)	`O(log n)`

Intuition

Both finishes are O(log n) wall-clock, but the raw minimum throws n/2 cores at the problem and pays n log n total — half of it wasted — while the cost-optimal version uses only ~n/log n cores for the same asymptotic time at full efficiency. Lesson: fastest and most-efficient core counts are different; an exam answer should say which one it is optimizing.

Cross-Deck Connections for Exam Prep

From decomposition to metrics

It helps to see the whole course as one pipeline that turns an algorithm's structure into performance numbers. The workflow runs end to end like this:

Choose a decomposition.
Derive the task dependency graph.
Determine total work W and the critical path.
Determine the degree of concurrency C(W).
Derive the task interaction graph and the communication needs.
Choose a mapping / scheduling.
Compute or estimate T_P, T_O, S, E, cost, and scalability.

Each stage hands a hard constraint to the next. The critical path lower-bounds runtime; total work lower-bounds cost; the degree of concurrency upper-bounds the useful processor count; the interaction graph drives communication overhead; granularity trades overhead against concurrency; and the mapping ultimately determines load balance, idling, locality, and contention.

Intuition

Read this chain as "structure → numbers." The task DAG fixes the hard physical limits (critical path = floor on time, work = floor on cost, max concurrency = ceiling on cores); your design choices (granularity, mapping) only decide how close to those limits you land.

Formula summary

Here is the entire Part I metric vocabulary in one place. The core definitions:

Quantity	Definition
`T_S`	serial runtime of the best sequential algorithm
`T_P`	parallel wall-clock runtime
`p`	number of processing elements
`W = Theta(T_S)`	work / problem size
`Cost_par`	`p T_P`
`T_O`	`p T_P - T_S`, or with W: `p T_P - W`
`S`	`T_S / T_P`
`E`	`S/p = T_S/(p T_P)`

A system is cost-optimal when p T_P = Theta(T_S) (equivalently p T_P = Theta(W)), which is the same as saying E = Theta(1), or under the work model T_O(W, p) = O(W).

The scaling laws each answer a different question, and it pays to keep them straight:

Law	Formula	Question it answers
Amdahl (fixed-size)	`S = 1/(f + (1-f)/p) = p/(1 + (p-1)f)`, with `lim_{p->infinity} S = 1/f`	strong scaling: how much faster on fixed `W`?
Gustafson–Barsis (scaled)	`S' = sigma + p(1 - sigma) = p + sigma(1 - p)`	weak scaling: how much more work in the same time?
Karp–Flatt	`e = (1/S - 1/p) / (1 - 1/p)`	what is the measured serial fraction?

Intuition

Amdahl (strong scaling) freezes W and asks how much faster — capped at 1/f. Gustafson (weak scaling) grows W with p and asks how much more work in the same time — uncapped. Same machine, different question. f is a fraction of work; sigma is a fraction of the parallel run's time. Karp-Flatt then runs Amdahl backwards: measure S and recover the hidden serial fraction e — if e stays constant as p grows the limit is genuinely serial code, if e rises the culprit is overhead (communication/imbalance/contention).

For isoefficiency, the chain to memorize is T_P = (W + T_O(W, p))/p, giving E = 1 / (1 + T_O(W, p)/W); for fixed E you set W = K T_O(W, p) with K = E/(1 - E) and express the result as W = f(p). Cost-optimality of a scalable system then reads W = Omega(f(p)), and the ideal lower bound is f(p) = Omega(p).

Finally, the parallel-addition worked case, which shows up again and again: T_P = Theta(n/p + log p), overhead T_O = p log p, efficiency E = 1 / (1 + (p log p)/n). It is cost-optimal iff n = Omega(p log p), with isoefficiency f(p) = Theta(p log p).

Common exam pitfalls

A handful of mistakes account for most lost marks, so internalize them. The biggest is the baseline trap: never compare a parallel algorithm against a deliberately poor sequential baseline — speedup is always measured against the best sequential algorithm for the same problem. In the same spirit, do not treat a low T_P as proof of good parallel performance; always check cost and efficiency too, and never write a runtime expression that omits p, since without p it cannot say anything about scalability.

Watch out

Several modeling assumptions look harmless but quietly break your analysis. Do not assume all tasks carry equal work — irregular tasks make static mapping fragile. Do not assume the maximum concurrency equals the useful processor count once communication and scheduling overhead are paid. Do not choose granularity merely to maximize the task count; past a point, overhead dominates. And do not infer large-instance scalability from a single small-instance runtime — isoefficiency is precisely the tool that tells you how W must grow with p.

Watch out

Two correctness pitfalls round out the list. Inside par_unseq code, never use mutexes or blocking operations, and never rely on element ordering in parallel algorithms unless both the execution policy and the algorithm guarantee it. Finally, keep your graphs straight: control (data) dependencies are not the same as data interactions — dependency graphs constrain ordering, while interaction graphs drive communication.

Part II - Shared Memory, OpenMP, Sorting, and Graph Search

Shared-address-space platforms (04 Shared Memory Systems)

Shared-memory systems give all threads a common address space: a load or store uses the same virtual/global address independent of which processor executes it. This makes programming easier than explicit message passing, because data structures can be shared directly, but it also moves correctness and performance problems into the memory hierarchy: caches, coherence traffic, synchronization, and memory consistency.

Intuition

Message passing makes communication explicit and visible; shared memory makes it invisible. The communication does not disappear - it hides inside cache-coherence traffic, lock contention, and false sharing, where it is harder to spot.

Two major hardware organizations matter, and the difference comes down to whether all of memory is equally far away. In a UMA (uniform memory access) machine, every processor or core reaches main memory with the same latency, except for cache effects; the canonical model is a bus-based multiprocessor with local caches and a shared global memory. In a NUMA (non-uniform memory access) machine, each processor or socket has its own local memory, and local accesses are faster than remote ones. The address space is still shared, but placement now matters, so good shared-memory programs preserve locality and avoid unnecessary remote traffic.

Intuition

UMA = "all memory equally far"; NUMA = "my own memory is close, yours is far." On NUMA, where a page physically lives changes performance even though the code looks identical, so keep each thread working on data it allocated/touched first (first-touch placement).

Why do caches help at all? Because of locality: temporal locality means recently used data is likely to be reused, and spatial locality means nearby data is likely to be used soon. In a multiprocessor, several caches may hold copies of the same memory word or the same cache line, and cache coherence is the hardware mechanism that gives these copies well-defined semantics. The deck describes the intended semantic as serializability: there exists some serial order of memory operations that corresponds to the observed parallel execution.

When one processor writes a value that may exist in other caches, the system has a choice: it must either invalidate the other copies or update them. These are the two coherence protocol families, and they trade bandwidth against re-read latency in opposite ways.

	Invalidate protocol	Update protocol
On a write	makes other cached copies invalid	sends the new value to other cached copies
Implementation	commonly write-back caches	commonly write-through behavior
Best when	other processors do not need the value again soon	processors repeatedly interleave tests and updates of the same variable
Weakness	reader pays a refetch if it still wants the value	wastes bandwidth when readers only read once
In practice	most current machines use invalidate	rarer

Intuition

Invalidate says "your copy is stale, refetch if you still care"; update says "here is the new value, whether you want it or not." Invalidate wins when writes are not immediately re-read (the common case), so real machines use it.

A typical invalidate protocol tracks three states per cache line. Shared means multiple valid cached copies exist, so a later write must generate invalidations. Dirty (also called Modified) means exactly one cache holds the modified valid copy; because all other copies are already invalid, a further local write needs no invalidation. Invalid means this cache's copy cannot be used, so a read triggers a data request and a coherence-state change. When the line a reader wants is dirty in another processor, that owning cache must supply it: the read triggers a flush/transfer so the reader obtains the current value and the states change consistently.

The exam-critical subtlety is that coherence operates at cache-line granularity, not source-variable granularity. False sharing occurs when two threads update different variables that happen to lie on the same cache line. The variables are not logically shared, but the hardware treats the whole line as shared, generating invalidation or update traffic that can destroy scalability even when the source code has no data race at all.

Watch out

Coherence tracks lines, not variables. Two threads hammering neighbouring counters in one array ping-pong the whole line between caches as if they were sharing - "false" sharing because there is no logical race. Fix: pad / align per-thread data so distinct threads' hot variables land on separate cache lines.

OpenMP programming model (04 Shared Memory Systems)

OpenMP is a directive-based API for shared-address-space programming in C, C++, and Fortran. In C/C++, directives are expressed as pragmas:

#pragma omp directive [clause list]

An OpenMP program executes sequentially until it enters a parallel region:

#pragma omp parallel [clause list]
{
    /* structured block */
}

The thread that encounters the region becomes the master thread of the team and gets thread id 0. At the end of the parallel block, the threads join, and the master waits until all team threads finish.

Intuition

OpenMP is a fork/join model. #pragma omp parallel forks a team that all run the same block; the closing brace is an implicit barrier + join. You annotate sequential code rather than rewriting it (unlike MPI, which is SPMD from the first line).

The directive vocabulary is small but expressive. The parallel directive creates a team of threads for a structured block, and within such a region the work-sharing directives divide work up: for splits loop iterations among threads, while sections/section split non-iterative tasks. To restrict execution, single lets exactly one thread (not necessarily the master) run a block, and master restricts a block to the master thread. The synchronization directives are barrier (all threads wait until every thread arrives), critical (one thread at a time in a named or unnamed region), atomic (one memory update performed atomically), ordered (force part of a parallel loop to run in sequential loop order), and flush (enforce a consistent view of shared memory for the listed or shared objects). Finally, threadprivate makes global/static variables private to each thread.

Clauses tune behavior and, above all, data sharing. You can guard a region with if(expr) so a team is created only when a runtime expression is true, or request a concurrency level with num_threads(k). The data-sharing clauses are where most correctness lives: private(x) gives each thread an uninitialized private instance; firstprivate(x) is the same but initialized from the original value; lastprivate(x) copies the value from the lexically last loop iteration or section back to the original; and shared(x) makes all threads access the same object. The default(shared | none) clause controls the default sharing behavior, reduction(op: vars) gives each thread a private partial that is combined once at the end (discussed below), and collapse(n) fuses n perfectly nested loops into one larger iteration space to expose more parallelism.

Omitting default means default(shared), but default(none) is the better habit for both exam-quality and production-quality OpenMP, because it forces every variable to be explicitly scoped, declared inside the parallel region, made threadprivate, const-qualified, or be an automatically private loop-control variable. This single discipline prevents a large class of accidental races.

Exam

default(none) forces you to declare your intent for every variable. The compiler then rejects anything you forgot to scope, turning a silent race into a compile error. This is the single best habit for correct OpenMP.

OpenMP loop scheduling and task decomposition (04 Shared Memory Systems)

#pragma omp for maps loop iterations to threads, and the schedule clause decides exactly how. The four kinds trade scheduling overhead against load balance:

Schedule	How iterations are assigned	Trade-off
`static[, chunk]`	partitioned before the loop runs; with a chunk, chunks go round-robin, otherwise each thread gets one ~equal chunk	low overhead, good locality, but poor balance for irregular work
`dynamic[, chunk]`	chunks (default size 1) handed to idle threads at runtime	better balance, higher scheduling overhead
`guided[, min_chunk]`	decreasing chunk sizes at runtime down to `min_chunk`	big early chunks (cheap, local) shrinking toward the tail (balanced)
`runtime`	taken from the `OMP_SCHEDULE` environment variable	deferred decision

Exam

This is a recurring exam question. Equal work per iteration -> static (cheapest, best locality, no runtime bookkeeping). Uneven/irregular work -> dynamic or guided (idle threads grab the next chunk so nobody starves while one thread is stuck on a heavy iteration). guided = "dynamic with shrinking chunks": big early chunks keep overhead low, a small tail keeps the finish balanced.

A for construct has an implicit barrier at its end unless you add nowait. That is exactly what you want for sequences of independent loops in the same parallel region, where no thread needs the previous loop to be globally complete before it starts the next one.

The reduction(op: vars) clause creates private local copies, combines them at the end, and thereby avoids a shared-update race. Its operators include the arithmetic and logical operations +, *, -, &, |, ^, &&, ||, min, and max, which makes reductions the right tool for sums, counts, maxima, minima, and similar associative accumulations.

Intuition

A reduction is "give every thread its own private accumulator, then add the accumulators together once at the very end." There is no per-iteration lock, so it is both correct and fast - the gold standard for accumulation. Gotcha: a reduction variable is automatically private and combined - do not also list it in shared(). With default(none), the reduction() clause alone classifies it. Putting shared(sum) with sum += ... is a race; reduction(+:sum) with sum += ... is correct.

For task parallelism rather than data-parallel loops, sections is the tool: each section is a different structured block, and threads execute different sections concurrently. This fits structurally different tasks such as taskA, taskB, and taskC.

One thing that is not automatically useful is nested parallelism. Nested parallel for directives may only create logical nested teams executed by the same outer thread unless nested parallelism is explicitly enabled. Real nested teams require settings such as OMP_NESTED=TRUE and OMP_MAX_ACTIVE_LEVELS, and even when they are available, nesting can add overhead and oversubscribe the cores.

OpenMP synchronization and memory model (04 Shared Memory Systems; 04 OpenMP_common_mistakes)

OpenMP's synchronization constructs are correctness tools first and performance costs second. A barrier is a full-team wait; single lets one arbitrary thread execute a block while master restricts execution to thread 0; critical serializes a block, and atomic serializes a simple memory update and is usually cheaper than critical; ordered restores sequential order for a block inside an ordered loop; and flush controls the visible memory view for shared objects. OpenMP also offers explicit locks (omp_init_lock, omp_set_lock, omp_unset_lock, and friends) for mutual exclusion, but they are easier to misuse than critical.

Exam

Cost ranking, fastest to slowest: reduction(+:s) > atomic >> critical ~= omp_lock. A reduction uses private partials combined once (fully parallel); atomic is one cheap hardware-locked update; critical/lock serialize every iteration and add roughly the cost of a barrier each time - they can make a "parallel" loop slower than the serial version. Measured example (summing 1e7 values, p=12): reduction ~1.6 ms, atomic ~149 ms, critical ~1523 ms, lock ~2685 ms, plain serial ~3.4 ms. A per-iteration critical was ~450x slower than serial here.

The prefix-sum example shows a common trap: adding ordered can make a loop syntactically parallel while keeping the real dependency sequential.

#pragma omp parallel for ordered
for (int i = 1; i < n; i++) {
    /* parallelizable work */
    #pragma omp ordered
    cumulSum[i] = cumulSum[i - 1] + list[i];
}

The dependency on cumulSum[i - 1] forces the update order. This may be correct, but it does not give a scalable prefix sum, and the exam lesson is to identify loop-carried dependencies before parallelizing.

Watch out

ordered does not remove the dependency, it just enforces the sequential order while paying parallel overhead - the worst of both worlds. A loop-carried dependency like c[i] = c[i-1] + x[i] is not parallelizable as-is; you must restructure it (e.g. a real log-depth prefix-sum / scan), not annotate it.

The OpenMP memory model is itself a major source of bugs. A read of a shared variable is not automatically guaranteed to observe the latest write unless synchronization creates the necessary visibility; explicit or implicit flush operations can be required. Many constructs include implicit flushes, which hides the issue on common machines, but relying on accidental visibility is not portable reasoning.

Watch out

Without a flush, a thread may keep a shared value in a register/cache and never see another thread's update. Barriers, critical, atomic, and the ends of work-sharing regions all imply flushes - so correct synchronization usually fixes visibility "for free." Sprinkling volatile or manual flush is a blunt, non-portable workaround.

Rounding out the API, the runtime library exposes three groups of functions: thread counts and ids (omp_set_num_threads, omp_get_num_threads, omp_get_max_threads, omp_get_thread_num, omp_get_num_procs, omp_in_parallel), nested/dynamic control (omp_set_dynamic, omp_get_dynamic, omp_set_max_active_levels, omp_get_max_active_levels), and locks (omp_init_lock, omp_destroy_lock, omp_set_lock, omp_unset_lock, omp_test_lock). The behavior of all this is steered by a handful of environment variables: OMP_NUM_THREADS sets the default team size, OMP_SET_DYNAMIC decides whether the runtime may change the thread count dynamically, OMP_NESTED enables nested parallelism, OMP_MAX_ACTIVE_LEVELS caps active nested regions, and OMP_SCHEDULE supplies the schedule for schedule(runtime) loops. OpenMP has also evolved substantially over time: tasks arrived after 2.5, and later versions added array sections, thread affinity (proc_bind), SIMD constructs, device constructs for accelerators, user-defined reductions, task dependencies (depend), an extended memory model, modern C++ support, and CUDA/OpenCL/SYCL/HIP-related support.

Common OpenMP correctness mistakes (04 OpenMP_common_mistakes)

The mistakes paper divides errors into correctness mistakes and performance mistakes, and its overarching lesson is sobering: compilers often do not catch OpenMP mistakes. Race-detection and thread-checking tools help, but you must still reason explicitly about data sharing, memory visibility, and synchronization.

The major correctness mistakes cluster around shared data and visibility. The most basic is unprotected access to shared variables: if multiple threads access a variable and at least one access writes, synchronization is required - even a simple assignment such as i = 1 is not safe by default. Closely related is reading a shared variable without the required memory synchronization or flush, because both writer and reader visibility matter; declaring variables volatile is a blunt workaround that inhibits optimization and is usually inferior to correct synchronization. Forgetting to make variables private is endemic precisely because the default is shared, so loop temporaries and per-thread accumulators silently become shared data races; prefer declaring temporaries inside the parallel region and using default(none).

Several mistakes are about misunderstanding the loop construct. Declaring the loop variable of parallel for as shared reveals a misunderstanding - OpenMP makes canonical loop variables private, and the explicit shared may be silently ignored. Forgetting for in #pragma omp parallel for is worse: every thread may then execute the whole loop, creating races and duplicated work.

Exam

Always state schedule(...) - then a bare parallel schedule without for will not even compile, catching the dropped for.

The remaining traps are smaller but real. Listing ordered in the clause list without an ordered construct inside the loop is a mismatch; trying to change the number of threads after entering a parallel region is illegal, since the team size must be set before the region or on the construct; calling omp_unset_lock from a thread that does not own the lock is wrong, because locks must be acquired and released by the same owning thread; and changing the loop control variable inside an OpenMP for violates the specification.

The best practices follow directly: use default(none) with explicit shared, private, firstprivate, lastprivate, or reduction; declare private temporaries inside the parallel block where possible; specify the loop schedule explicitly since the default is implementation-defined; use reductions for accumulation instead of manual shared updates; test with multiple compilers at high warning levels because different compilers catch different issues; and run thread-checking tools for races and shared-memory errors.

Common OpenMP performance mistakes (04 OpenMP_common_mistakes)

Performance mistakes do not necessarily change the result, but they reduce or destroy speedup. The first family is too fine-grained parallelism: sprinkling parallel for over small loops often loses to thread creation, scheduling, and synchronization overhead, so you should prefer coarser-grained parallel regions and parallelize outer loops when loops are nested.

Intuition

The fork/join + scheduling overhead is a fixed tax per parallel region. If the loop body is tiny, you pay the tax and get nothing. Parallelize the outer loop so each thread gets a large, coarse chunk of work that dwarfs the overhead.

The second family is wrong synchronization strength: use atomic instead of critical whenever a simple atomic update suffices, since a compiler or runtime has more room to optimize an atomic update than an arbitrary critical block; avoid unnecessary critical regions, especially around private data or code reached by only one thread; and avoid unnecessary flush directives, because many constructs already imply flushes. The third family is too much work in critical regions: keep critical sections as small as possible, compute expensive function calls before entering the critical region if the computation itself does not need protection, and avoid entering a critical region on every loop iteration when a local/private computation plus one final merge would do.

The maximum example is the pattern to remember, in three escalating versions:

Bad: every iteration enters critical and checks/updates max.
Better: check a condition before entering critical, then recheck inside.
Best: compute a private per-thread maximum, then merge the per-thread maxima with one protected update or a reduction.

Intuition

The "max pattern": do the bulk of the work privately, synchronize once. A per-iteration critical serializes the entire loop; a private partial result merged once at the end keeps the loop parallel. This is the same shape as a reduction - and is exactly why reduction > atomic >> critical.

A few more pitfalls round out the list. Orphaned constructs outside a parallel region may execute serially or not as intended. Nested parallelism often adds overhead and may not create real additional threads. I/O inside parallel loops can serialize execution, so buffer output and write in larger chunks. Dynamic schedules improve load balance but add runtime overhead, while static schedules are cheap but poor for irregular work. And, returning to a theme from the coherence section, false sharing from independent per-thread variables that happen to share a cache line generates coherence traffic; pad or align them onto separate lines.

Parallel sorting on shared memory systems (05 ParallelSorting1)

Sorting is a core kernel, and the deck draws a line between two families. Comparison-based sorting needs only a relation such as <, which makes it flexible for arbitrary data; its fundamental operation is compare-exchange, and its sequential lower bound is Omega(n log n). Value-based sorting uses numeric values directly, as in bucket sort or sample sort, and can be faster when the value range is suitable.

The deck focuses on comparison-based sorting in shared memory, where the array is shared and the sort runs in-place or with shared temporary buffers. The main parallel issue is not just "split the array"; it is whether the expensive phase of the algorithm can be parallelized without too much extra work, contention, or memory traffic.

Divide and conquer, Merge Sort, and Quicksort (05 ParallelSorting1)

Divide and conquer has three phases - divide (split the problem into similar-size subproblems), conquer (solve subproblems recursively and independently), and merge (combine the subproblem solutions) - and the interesting thing is that Merge Sort and Quicksort put the hard work in different phases. Merge Sort has a cheap divide, easy independent recursive calls, and an expensive merge. Quicksort is the reverse: an expensive divide/partition, easy independent recursive calls, and no expensive merge.

Intuition

The two algorithms are mirror images. Merge Sort is "split trivially, work hard combining"; Quicksort is "work hard splitting, combine trivially." When parallelizing, always ask which phase holds the work - that is the phase you must parallelize, and it is the phase that limits scalability.

Merge Sort obeys the recurrence T_S(1) = c, T_S(n) = 2 T_S(n/2) + c*n, giving runtime Theta(n log n), which is optimal for comparison sorting. Recursive parallelization looks easy - just run one recursive half in another thread - but naive full recursion can spawn n threads while the longest thread still performs a linear merge of n/2 elements. That makes the parallel cost too high, so the simple parallel formulation is not cost-optimal. For p < n, the parallel runtime is T_P = Theta(n + (n/p) log(n/p)), which is cost-optimal only up to p = Theta(log n) PEs.

Intuition

Why does Merge Sort barely scale? The merge tree has log p levels; at level i the team merges blocks of size 2^i, and sum_{i=1..log p} 2^i = 2p, so the total merge work per element sums to (n/p)*2p = 2n - independent of p. The merge phase stays Theta(n) no matter how many cores you add, so Merge Sort is cost-optimal only for about p = O(log n) processors. Its parallelism is essentially wasted beyond that.

Quicksort selects a pivot, partitions into values <= pivot and >= pivot, and recursively sorts the partitions. Good pivots give balanced partitions and average Theta(n log n), but bad pivots can split off one element at a time for a worst case of Theta(n^2). In practice Quicksort usually performs very well despite that worst case, and a practical pivot choice - the median of a[left], a[mid], a[right] - avoids pathological splits.

Parallel Quicksort for p < n proceeds in six steps:

One thread chooses a pivot. In shared memory the pivot is immediately accessible to all threads; in distributed memory it would need broadcast.
Each thread locally partitions its n/p elements into small and large elements.
Threads compute prefix sums over the counts of small elements |S_i| and large elements |L_i|.
The prefix sums tell each thread where to copy its local small and large blocks in the destination array.
Threads copy concurrently into non-overlapping target ranges.
The resulting global small and large partitions are recursively processed; once a partition belongs to a single thread, that thread sorts locally.

Intuition

The prefix sums turn "where does my block go?" into disjoint, conflict-free offsets in a destination array. Because every thread writes to a non-overlapping range, there is no write race during global rearrangement - the scan replaces a lock.

The sub-step costs are local rearrangement Theta(n/p), a parallel prefix sum of p values Theta(log p), global rearrangement Theta(n/p), and local final sorting averaging Theta((n/p) log(n/p)). With an average recursion depth of Theta(log p), the deck summarizes the total parallel runtime as:

T_P = Theta((n/p) log p + log^2 p) + Theta((n/p) log(n/p))
parallel cost = p*T_P = Theta(n log n + p log^2 p)

For p = n the extra term makes the cost Theta(n log^2 n), a factor Theta(log n) worse than optimal. Cost optimality therefore requires p small enough that Theta(p log^2 p) = O(n log n).

Exam

Cost-optimal needs p * log^2 p = O(n log n), i.e. roughly log p = sqrt(log n). For n = 2^25 this gives p = 2^sqrt(25) = 2^5 = 32 cost-optimal processors. (The provided OMP solution uses a simple O(p) prefix block, which adds a Theta(p log p) overhead term - so do not quote a tiny fixed p bound without the constants.)

These costs map straight onto shared-memory performance. Prefix sums are synchronization-heavy but avoid write conflicts by assigning disjoint target positions, and the destination array prevents threads from overwriting each other during global rearrangement. Pivot quality controls load balance, since bad partitions create idle threads; repeated global rearrangement moves a lot of data, so cache locality and memory bandwidth matter; and false sharing can appear in count arrays or per-thread metadata if per-thread counters share cache lines.

Bitonic Sort and sorting networks (05 ParallelSorting1)

A sorting network is built from fixed columns of comparators, where each comparator performs compare-exchange on two wires. An increasing comparator outputs (min(x1, x2), max(x1, x2)), and a decreasing comparator outputs (max(x1, x2), min(x1, x2)). The defining property is that sorting networks are highly synchronous and data-oblivious: the compare pattern does not depend on values, which makes them attractive for hardware, FPGA-style implementations, SIMD-like execution, and GPU workgroups.

Intuition

"Data-oblivious" means the sequence of comparisons is fixed in advance, independent of the actual values. There are no data-dependent branches, so every lane does the same work in lockstep - exactly what SIMD/GPU/FPGA hardware wants.

A bitonic sequence is an increasing subsequence followed by a decreasing subsequence, or a cyclic shift of such a sequence, and any two-element sequence is bitonic. Bitonic Sort builds larger bitonic sequences and then applies bitonic merge. Bitonic merge on a sequence of length 2k compares elements a_i with a_{i+k}: the first half receives the minima, the second half receives the maxima, the first half ends up less than or equal to the second half, and both halves are themselves bitonic, so recursively merging both halves yields a sorted sequence. For p = n, each element passes through 1 + 2 + ... + log n comparator stages, so the runtime is Theta(log^2 n) - and the parallel cost is worse than optimal by a factor log n.

Intuition

The network depth is O(log^2 n) (there are log n merge phases, the i-th using i stages, summing to ~ (log^2 n)/2). With p = n you get an extremely short span Theta(log^2 n), but the total comparator count is Theta(n log^2 n) - a factor log n more work than the n log n lower bound, so it is not cost-optimal. You trade extra work for a regular, synchronous structure the hardware loves.

For p < n, comparators become compare-split operations and each processing unit first sorts its local partition. The deck gives:

T_P = Theta((n/p) log(n/p)) + Theta((n/p) log^2 p)
cost = Theta(n * (log(n/p) + log^2 p))

The isoefficiency due to extra work is Theta(p log(p) log^2(p)), which is poor for large p. So bitonic sort is not the most scalable general shared-memory sorting algorithm, but it shines where the fixed synchronous comparator structure matches the hardware.

Exam

Implementation note (OMP, p < n): sort each local block with std::sort, then run compareSplit on block pairs. In the p = n in-array version the loop nest is for i (log n networks) -> for j (stages) -> for k (comparators); only the inner k-loop is independent. Open the team once outside both outer loops (#pragma omp parallel) and put a bare #pragma omp for on k, so you do not re-fork per stage and the implicit barrier of each omp for synchronizes the stages.

Graph definitions and representations (06 GraphSearch)

A graph G = (V, E) has vertices V and edges E, with n = |V| and m = |E|. An undirected edge is an unordered pair (u, v) and a directed edge is an ordered pair (u, v); a weighted graph attaches weights to edges. An edge that touches its endpoint vertices is incident, vertices joined by an edge are adjacent, a path from v to u is a sequence of vertices where consecutive vertices are connected by edges, and the path length is the number of edges on it.

The two standard representations make opposite memory/access trade-offs. An adjacency matrix is an n x n matrix using Theta(n^2) memory, good for small or dense graphs where m = Theta(n^2). An adjacency list is an array of neighbor lists using Theta(n + m) memory, better for sparse graphs and most large graph problems.

Intuition

Matrix = "is there an edge u->v?" in O(1) but n^2 memory even if empty; list = "who are u's neighbours?" in O(degree) and only stores edges that exist. Most real/large graphs are sparse (m << n^2), so the adjacency list wins.

Parallel graph algorithms depend heavily on partitioning. If a graph has disconnected components of similar size, partitioning is easy; for connected graphs, the edges that cross partitions determine the communication or shared-data interaction - which in shared memory means contention and synchronization, and in distributed memory means messages.

Search algorithms, heuristics, and search overhead (06 GraphSearch)

Many algorithmic problems can be modeled as graph search over explicit vertices or implicit states. In discrete optimization, states are vertices in a huge state-space graph, and feasible solutions are paths to goal states. The search notation is small: g(x) is the cost to reach state x from the initial state s along the current path, h(x) is a heuristic estimate of the cost from x to a goal, h is admissible if it never overestimates the true remaining cost, and l(x) = g(x) + h(x) is a lower-bound estimate of total solution cost when h is admissible.

Intuition

g is "how far I have come", h is "my optimistic guess of how far is left", l = g + h is "optimistic guess of the whole trip through x." Admissible = never pessimistic; this optimism is exactly what guarantees A/IDA return the true optimum (an underestimate can never wrongly discard the best path).

Parallel search may do different work from serial search because threads explore different states at different times. Writing W for the work done by the serial search, W_P for the work done by the parallel search, and f = W_P / W for the search overhead factor, the speedup is bounded by:

S <= p/f

This is crucial: even with perfect hardware, extra search work reduces the maximum possible speedup.

Intuition

Parallel searchers, lacking the serial algorithm's perfect global ordering, expand states the serial search would have pruned. That wasted work is the overhead factor f >= 1, and it caps speedup at p/f - sometimes parallel search even does less work (f < 1, "speedup anomaly") if a worker stumbles onto the goal early.

DFS and parallel DFS (06 GraphSearch)

Depth-first search expands the initial vertex, then repeatedly expands one of the most recently generated vertices: it follows a path to a leaf or dead end, then backtracks. Its strengths are very low memory (linear in search depth), a natural recursive implementation, and suitability for huge or implicit search spaces where breadth-style storage is impossible. Its weakness is optimality - the first solution found is usually not guaranteed optimal, because DFS may search deeply in a poor part of the tree before finding a shallow, better solution elsewhere.

Intuition

DFS keeps only the current path on its stack, so memory is O(depth) - tiny. The price is tunnel vision: it can plunge deep into a bad subtree before noticing a shallow, better solution elsewhere.

Several variants tame that weakness. Simple backtracking stops at the first solution with no optimality guarantee, while ordered backtracking orders successors using a heuristic. Depth-first branch-and-bound (DFBB) keeps the best solution cost so far and prunes paths that cannot beat it, so on termination its best solution is globally optimal. Iterative deepening (IDA) runs DFS with increasing depth bounds to avoid missing shallow solutions for too long, and IDA* bounds l(x) = g(x) + h(x) rather than depth, finding an optimal solution when h is admissible.

Intuition

DFBB = "DFS that remembers the best complete answer so far and refuses to explore any partial path already worse than it" - gives optimality with DFS-sized memory. IDA/IDA* = "repeat DFS with a growing budget" - recovers BFS-like shallow-first behaviour while keeping linear memory, at the cost of re-expanding upper levels each round.

Parallelizing DFS starts by expanding several tree levels until there are enough vertices to give work to p workers. The catch is that subtree sizes are hard to predict, so static assignment often load-imbalances badly, and dynamic load balancing is usually required. In shared memory, workers keep local stacks/open lists and steal work from one another under locks, and when one process or thread finds a solution, other workers may need to terminate or update their pruning bound.

Intuition

Subtree sizes are wildly uneven and unknowable in advance, so a fixed split leaves some workers idle while others drown. The fix is work stealing: an idle worker grabs part of a busy worker's stack. Coordination (locks, global bound) must stay cheap or it becomes the bottleneck.

In other words, parallel DFS is a direct application of OpenMP/multithreaded synchronization issues: the work lists need mutual exclusion, but too much locking turns the work queue into a bottleneck, and a global best bound improves pruning only if updating and reading it is done with correct visibility and synchronization.

Best-first search, A*, and parallel open lists (06 GraphSearch)

Here the deck uses BFS to mean best-first search, not breadth-first search. Best-first search keeps an open list of generated but unexpanded vertices and a closed list of expanded vertices. At each step it expands the most promising open vertex, inserting successors if they are new or if a better path/heuristic value is found than an existing open/closed entry. A is the standard best-first algorithm, using l(x) = g(x) + h(x) as the priority; with an admissible heuristic, A finds an optimal solution first.

Intuition

A* always expands the open vertex with the smallest l = g + h - the most promising candidate. With an admissible h, the first goal it pulls off the open list is provably optimal. Open + closed lists give it a perfect global ordering, which is exactly what makes it hard to parallelize without contention.

Its major drawback is memory: open/closed storage is linear in the explored search space, which can be exponential in search depth.

The simplest parallel scheme is a centralized open list, where multiple workers expand the currently best vertices concurrently. It is easy to implement in shared memory, but the open list is accessed for every expansion, which causes contention and limits speedup. Sequential termination is no longer valid - if one worker finds a goal, other concurrently expanded vertices may still lead to better goals - and the parallelism can expand vertices a sequential A* would never touch, increasing W_P and the overhead factor f.

Watch out

One shared priority queue touched on every expansion is a serialization point - the cleaner the global order, the worse the contention. This is the central tension: a centralized open list gives the best search order but the worst scalability.

Local/distributed open lists reduce contention by initially distributing parts of the search space to worker-local open lists, from which workers expand. Without communication, duplicate work and poor search order grow; with communication, workers periodically exchange promising vertices. The deck names three communication strategies, which differ in how fast good work spreads:

Strategy	Mechanism	Character
Random	periodically send good vertices to random workers	good search regions can spread quickly but unstructured
Ring	exchange good vertices with two neighbors in a virtual bidirectional ring	simple but slow to spread high-quality work globally
Blackboard	a shared structure holds good vertices; workers give/take depending on whether their local best is much better or worse	especially suited to shared memory, since the board is checked frequently

Intuition

Local open lists trade global order for throughput. The communication strategy decides how fast good work spreads: random = fast but unstructured, ring = simple but slow to propagate, blackboard = a shared "best-so-far" board that fits shared memory because it can be polled cheaply and often.

Parallel graph search also adds duplicate detection. The common scheme maps each generated vertex to an owner process or thread, typically by hashing the vertex to a process number, then sends or checks that vertex at its owner. The owner checks its local open/closed lists for duplicates, inserting the vertex if it is new and replacing/updating it if it already exists with worse cost.

Intuition

Hashing a vertex to a fixed owner guarantees all copies of "the same state" land at the same place, so duplicates can be detected and merged locally - turning global duplicate-checking into local lookups.

The shared-memory link is the recurring tension: centralized structures improve global search order but create contention, while distributed structures improve throughput but increase extra work and duplicate management. This is the same trade-off as OpenMP synchronization, where stronger coordination helps correctness and search quality but can dominate runtime.

Discrete optimization as graph search (06 GraphSearch)

A discrete optimization problem is a pair (S, f): S is a finite or countably infinite set of feasible solutions satisfying the constraints, f is a cost function mapping each solution to a real value, and the goal is to find x_opt with f(x_opt) <= f(x) for all feasible x. Such problems can be reformulated as state-space graph search, where a feasible solution is a path from an initial state to a goal state - examples include VLSI layout, robot motion planning, test-pattern generation, facility location, task mapping, and scheduling.

The 0/1 integer linear programming example makes this concrete. The variables are binary, a partial assignment is a search-tree vertex, and assigning the next variable branches to value 0 or 1. After each assignment you can check feasibility and prune the branch if the partial assignment can no longer satisfy the constraints; feasible complete assignments are solution states, and the cost function selects the minimum-cost feasible solution. This is branch-and-bound thinking in miniature: do not generate the entire feasible set if constraints or bounds already prove a branch cannot contain the optimum.

Intuition

The exponential set of candidate solutions becomes a search tree where each level fixes one decision. Pruning infeasible/too-costly branches early means you never materialize most of the 2^n leaves - that pruning is the whole point of branch-and-bound.

Optimal task dependency graph mapping (06 GraphSearch)

Task mapping assigns tasks to processing elements, and a good mapping pulls in three directions at once: it wants to maximize concurrency by placing independent tasks on different PEs, minimize completion time by keeping processors free for critical-path tasks, and minimize interaction by mapping highly interacting tasks close together or on the same PE. These goals conflict. One task on one processor has no communication or idle overhead but no speedup, whereas maximum spreading exposes concurrency but increases interaction and scheduling overhead.

Intuition

Spreading tasks out buys concurrency but adds communication/idle cost; packing tasks together cuts communication but kills parallelism. The optimal mapping is the sweet spot between these two pulls - and finding it exactly is NP-hard.

The scheduling problem then asks: for a given task dependency graph and number of processing elements, what is the minimum parallel runtime T_P? This is itself a discrete optimization problem and can be solved as graph search. The deck's approach has three steps:

Topologically sort the task dependency DAG. If there is no single root, add an artificial root of duration 0. Topological sorting is possible because the dependency graph is acyclic and costs O(n + m). During the sort, compute each task's earliest possible start time from its predecessors' completion times.
Compute the critical path length. This is a lower bound on parallel runtime because critical-path tasks must execute sequentially, and in a DAG the longest path can be computed in O(n + m) using topological order.
Map tasks in topological order to all possible PEs, forming a search tree. A search state stores the partial task assignment and the process execution times; leaf states are complete mappings. For a solution state s, T_P(s) = max_p T(p, s), and the optimum is the minimum T_P(s) over all solution states.

Exam

The critical path length (longest weighted path through the DAG) is a hard lower bound on T_P - those tasks form a dependency chain that must run one after another, no matter how many cores you have. So the search can stop the instant it finds a mapping that reaches the critical-path length: that mapping is provably optimal.

Before parallelizing, sharpen the sequential search. Use DFBB: once a complete mapping gives an upper bound, prune any partial mapping whose current simulated runtime already exceeds it, and update the upper bound whenever a better complete solution is found. Also avoid symmetric search - if several PEs have the same current execution time, mapping the next task to each of them can be equivalent, so explore only one representative unless another criterion such as communication cost distinguishes them.

Exam

Symmetry pruning: if PEs 2, 3 and 4 are all currently idle at the same time, assigning the next task to any of them produces an identical schedule - exploring all three just triples the work. Try one representative.

To parallelize, move from recursive DFS to iterative DFS with explicit open lists, so states can be exchanged among workers, and run one DFSearcher per concurrent search thread. Each worker processes states from its own open list; if a worker runs out of work while others still have many states, it steals from the worker with the most pending states, moving every pth state (or another subset) into its own open list. Open-list access requires mutual exclusion. When a worker finds a better solution, it broadcasts the new upper bound so others can prune more aggressively, and if a worker finds a solution equal to the critical path length, the optimum has been reached and all other workers can stop.

Intuition

Why steal every p-th node, not half? At the start one searcher holds the root and works alone until its open list grows; then all p-1 idle searchers steal from that single victim at once. One worker has the work and p-1 want it (= p total), so taking every p-th node gives each thief roughly 1/p of the frontier - balanced. Stealing every 2nd node would let the first thief grab half and starve the rest. (Implementation: lock both lists in ID order to stay deadlock-free; the bound is a std::atomic, the open list guarded by a std::mutex.)

This is a concrete example of the whole shared-memory theme: parallelism is available in the search tree, but performance depends on reducing contention, balancing irregular work, synchronizing global bounds correctly, and terminating safely.

Cross-topic exam connections

Step back and the parts rhyme. Shared memory makes data access convenient but does not remove communication: cache coherence, invalidations, false sharing, locks, atomics, barriers, open-list contention, and prefix-sum phases are all forms of communication cost.

Correctness and performance are linked in both directions. A race-free OpenMP program can still be slow because of false sharing, too many barriers, oversized critical regions, poor scheduling, or nested-parallel overhead. Conversely, a fast-looking program without synchronization may simply be incorrect, because shared updates and memory visibility are not guaranteed.

Algorithmic decomposition must match the expensive phase, and the sorting and search algorithms each illustrate a different failure mode: Merge Sort's recursive halves parallelize easily, but its merge dominates; Quicksort's halves are easy after partitioning, but partitioning and load balance dominate; Bitonic Sort has massive synchronous parallelism, but its extra work hurts scalability; DFS has many subtrees, but their irregular sizes demand work stealing; and best-first search has a useful global order, but its global priority/open lists become contention bottlenecks.

Cost optimality is the recurring standard. A parallel algorithm is not good just because its span is small - its total work/cost must be close to the best sequential work. Naive parallel Merge Sort and p = n parallel Quicksort/Bitonic Sort all show that excessive parallelism can add enough overhead to lose cost optimality, and parallel search can have W_P > W, where the overhead factor f reduces the speedup bound to p/f.

Finally, your synchronization choice should follow the shared-state pattern:

Shared state pattern	Right tool
Independent per-thread data	private variables, combine later
Simple shared update	reduction or atomic
Compound shared invariant	critical or locks, but keep the region small
Irregular work pools	local queues/stacks plus work stealing
Global stopping or pruning bound	synchronize updates and visibility, but avoid making every expansion wait on a global lock

For exam answers, explicitly connect the algorithm to the machine: identify the shared data, the cache-line behavior, the synchronization points, the load balance, the extra work, and the cost model. That is the bridge between the Shared Memory Systems/OpenMP decks and the Sorting/Graph Search decks.

Part III - Distributed Memory, Collectives, Sorting, and Numerical Algorithms

Distributed-memory model and motivation (08 Distributed Memory Systems)

A distributed-memory system consists of nodes connected by a network. The defining property is that each process has its own exclusive address space, so data is simply not implicitly visible to other processes. Every data element belongs to exactly one address-space partition, and any interaction between partitions requires explicit cooperation through communication. This is the fundamental contrast to shared-memory systems, where tasks are usually threads that carry private state but still share access to a common memory.

Intuition

In shared memory you reach over and read a neighbour's variable for free; in distributed memory there is no shared variable at all — the only way to learn someone else's value is to ask them to mail it to you. That mail has a real cost and shows up directly in your algorithm's runtime.

Why bother with distributed memory at all? The economics are the answer. Large shared-memory MIMD systems require expensive interconnects such as big crossbars, whereas distributed interconnects — meshes, rings, tori, and hypercubes — are cheaper and scale better for coarse-grained computations. Distributed systems are also the natural choice when the data volume or compute demand simply exceeds what any single shared-memory machine could hold.

The programming consequence is what matters for exams, and it is worth internalizing: communication is explicit, costly, and algorithmically visible. Data must be partitioned and placed deliberately. Read-only access to remote data is still communication. Read-write access requires communication plus synchronization or ownership discipline. Nothing is free.

Most MPI programs settle into a recognizable rhythm. Execution is asynchronous, meaning tasks run independently most of the time; it is only loosely synchronous, in that subsets of tasks occasionally synchronize to exchange data; and it is SPMD, in that all processes run the same program and branch on rank and local data.

Intuition

SPMD = one binary, launched p times. The code looks identical on every rank; if (id == 0) ... is how a single program plays different roles (e.g. "the collector" vs "a worker").

It helps to contrast MPI with MapReduce, the other classic distributed model. MapReduce also targets large distributed data sets, but it deliberately hides the details of parallelization and fault tolerance. You write a map that produces key-value pairs from input items, the framework groups values by key, and your reduce combines all values for a given key. A canonical example maps each word to (word length, word) and reduces the grouped lengths into counts. MPI sits one level below this: it exposes the raw communication mechanisms that MapReduce abstracts away.

Intuition

MapReduce trades control for convenience — you write only map and reduce, and the framework handles distribution, shuffling, and crash recovery. MPI gives you the raw Send/Recv knobs and makes you responsible for all of that.

MPI basics and process structure (08 Distributed Memory Systems)

MPI is a standardized, portable message-passing interface for C and Fortran, with implementations such as MPICH, Open MPI, Intel MPI, and Microsoft MPI. Crucially, the standard specifies routine syntax and semantics, not a single implementation strategy — which is exactly why the same program must behave correctly across very different runtimes. MPI 4.1 is the current standard mentioned in the deck, but the examined material focuses on core MPI and collective communication.

You can get a surprising amount done with a minimal routine set. MPI_Init(&argc, &argv) initializes the environment before any other MPI call and may strip MPI-specific command-line arguments; MPI_Finalize() cleans up and terminates at the end. To orient yourself inside the job you call MPI_Comm_size(MPI_COMM_WORLD, &p) for the number of processes and MPI_Comm_rank(MPI_COMM_WORLD, &id) for your own rank. The two workhorses are MPI_Send(...) and MPI_Recv(...). Finally, MPI_Wtime() gives wall-clock seconds for timing — the idiomatic pattern is to MPI_Barrier first, then MPI_Reduce(..., MPI_MAX, ...) so that the slowest finisher defines the parallel runtime.

MPI_COMM_WORLD is the default communicator containing all processes. A communicator is genuinely part of the address of a message: source and destination ranks are meaningful only inside a communicator.

Intuition

The launcher decides p. mpiexec -n 12 prog starts 12 ranks; running the bare binary starts just one. Match -n to the number of physical cores for realistic measurements.

I/O needs a little care because all processes may be able to write to stdout and stderr, but the output order is unpredictable. The best practice for deterministic output is to let only rank 0 write, with other ranks sending their output to rank 0. Input is symmetric: usually only rank 0 reads from stdin, and if every rank needs the input, rank 0 reads it and broadcasts it.

Exam

In the FS21 "greetings" pattern: if rank 0 collects with a fixed loop for(i=1;i<p;i++) Recv(src=i), output prints in rank order but stalls on a slow rank. Using src=MPI_ANY_SOURCE and reading the real sender from status.MPI_SOURCE prints in arrival order and never blocks on a laggard.

For debugging, the deck's advice is pragmatic: start with one or two processes, run everything on one machine when you can, and lean on assertions. Switching to synchronous sends is a good way to expose unsafe ordering. And to attach a debugger to a specific rank, print or query its process id, block that process, attach the debugger, set your breakpoints, and only then release it.

Point-to-point communication semantics (08 Distributed Memory Systems)

The basic point-to-point operations are:

int MPI_Send(void *buf, int count, MPI_Datatype datatype,
             int dest, int tag, MPI_Comm comm);

int MPI_Recv(void *buf, int count, MPI_Datatype datatype,
             int source, int tag, MPI_Comm comm, MPI_Status *status);

The semantic requirement underpinning all of this is simple to state: the receiver obtains the value that was in the send buffer at the time of the send. If process P0 sends x = 100 and then immediately changes x = 0, the receive must still obtain 100. That single requirement is what forces send protocols to define precisely when a send buffer becomes safe to reuse.

Intuition

"Send" semantically captures a snapshot of the buffer. Whether that snapshot is achieved by blocking until the receiver takes it, or by copying it into a hidden buffer, is the implementation's choice — and that choice is exactly what the safety rule below is about.

How does a receive decide which incoming message it matches? Four things line up. The communicator must be shared, since sender and receiver must use the same communication context. The source rank can be a specific rank or MPI_ANY_SOURCE, and the tag can be a specific tag or MPI_ANY_TAG. Finally the datatype and count must be compatible buffers.

A handful of special values smooth the rough edges. MPI_ANY_SOURCE and MPI_ANY_TAG let a receive match any sender or any tag. MPI_PROC_NULL is a dummy destination/source — a no-op partner that is invaluable at array and grid boundaries, because edge processes can then run the same code without special-casing. And MPI_STATUS_IGNORE lets you discard receive status when you do not need it.

When you do want the status, MPI_Status stores at least MPI_SOURCE, MPI_TAG, and MPI_ERROR, and MPI_Get_count(status, datatype, &count) reports how many data items actually arrived. One ordering guarantee is worth memorizing: MPI messages are non-overtaking for the same sender-to-same-receiver pair, so they become available in send order. Messages from different senders may arrive in any order, and once you add buffering and differing tags you have to reason about receive ordering carefully.

Intuition

Two letters from the same friend arrive in the order they were posted (non-overtaking). Letters from different friends can arrive in any order. That is why a fixed-source receive loop serializes you behind the slowest sender.

Blocking, buffering, safety, and deadlocks (08 Distributed Memory Systems)

A blocking send is allowed to be implemented in more than one way, and the difference matters. A non-buffered blocking send returns only once the matching receive has been encountered: this guarantees safe reuse of the send buffer but can cause idling and even deadlock. A buffered blocking send returns as soon as the data has been copied into an internal buffer, which reduces idling at the price of a buffer-copy overhead — and, importantly, still does not make every program safe.

Intuition

A blocking MPI_Send returns when the buffer is safe to reuse — not necessarily when the message has been received. Whether that happens immediately (data copied aside) or only once the receiver shows up is the implementation's call.

This leads to the single most important MPI correctness rule: a correct MPI program must run correctly regardless of whether MPI_Send behaves more like buffered or non-buffered communication. The deck calls this property safety. Two named variants let you control the behaviour explicitly. MPI_Ssend is a synchronous send that blocks until the matching receive starts — perfect for checking safety, because unsafe code that only works thanks to buffering tends to deadlock under it. MPI_Bsend is the explicitly buffered counterpart.

Exam

Treat MPI_Send as if it were MPI_Ssend (synchronous) when reasoning about correctness. If the program is correct under that pessimistic assumption, it is safe. Swapping in MPI_Ssend is the practical litmus test: unsafe code that only "worked" because the runtime happened to buffer will now deadlock.

The classic unsafe pattern is worth keeping in your head: P0 sends a message with tag 1 and then tag 2, while P1 first receives tag 2 and then tag 1. This may run fine with buffering, but it deadlocks under synchronous or non-buffered semantics, because P0 is waiting for the receive of tag 1 while P1 is waiting for tag 2. And buffering is no panacea — deadlocks remain possible because receives always block until a matching message is available, so if both processes call receive first and only send afterward, both wait forever.

Watch out

The two canonical deadlocks: (1) both ranks Recv before Send ⇒ both wait forever for a message nobody has sent yet (this deadlocks even with buffering). (2) both ranks do a non-buffered Send first ⇒ neither send can complete because no receive has been posted. Fix both with MPI_Sendrecv.

MPI gives you MPI_Sendrecv and MPI_Sendrecv_replace for exactly these simultaneous exchanges. They are the right tool for neighbor exchanges, shifts, and compare-split patterns precisely because they sidestep the "both sides send first" and "both sides receive first" traps. MPI_Sendrecv_replace goes one step further by using a single buffer — incoming data overwrites outgoing — which is exactly what circular shifts and Cannon's rotations need.

Nonblocking communication (08 Distributed Memory Systems)

Nonblocking operations return before the operation is complete:

MPI_Isend(..., MPI_Request *request);
MPI_Irecv(..., MPI_Request *request);
MPI_Test(MPI_Request *request, int *flag, MPI_Status *status);
MPI_Wait(MPI_Request *request, MPI_Status *status);
MPI_Waitall(int count, MPI_Request *requests, MPI_Status *statuses);

Their purpose is to overlap communication with useful computation. The contract you must honour is that you may not reuse or modify the external buffer until completion has been confirmed by MPI_Test or MPI_Wait. And be honest with yourself about the payoff: nonblocking communication is not automatically faster. It helps only when the program has independent work to do while the transfer progresses and the implementation or network can actually exploit that overlap.

Intuition

Isend/Irecv post the request and return immediately ("the mail is on its way"); you do unrelated work, then call Wait to make sure it landed before touching the buffer. If there is nothing useful to do meanwhile, nonblocking buys you nothing.

Watch out

FS21 Cannon gotcha: on a single host, nonblocking ≈ blocking, because the MPI runtime moves data through shared memory — there is no real network transfer to overlap with. Nonblocking double-buffering only pays off across multiple physical nodes.

Communicators and virtual topologies (08 Distributed Memory Systems)

A communicator defines a communication domain: a group of processes plus a context that distinguishes messages within that domain. A process may belong to several communicators at once, including overlapping ones — and this is precisely how algorithms restrict communication to rows, columns, subproblems, or pipeline stages.

Intuition

A communicator is a private "chat room" with its own message namespace. The context tag means a message sent in one communicator can never be accidentally matched by a receive in another — even if ranks and tags coincide.

The key routines fall into two clusters. For carving up and managing groups, MPI_Comm_split(comm, color, key, &newComm) partitions a communicator into subcommunicators by color, with ranks inside each new communicator ordered by key; MPI_Intercomm_create(...) builds an inter-communicator between two groups; the trio MPI_Comm_group, MPI_Group_size, MPI_Group_rank lets you access and query process groups (which also support difference, intersection, and union set operations); MPI_Comm_create(comm, group, &newcomm) turns a group into a communicator; and MPI_Comm_free(&comm) releases one when you are done.

Virtual topologies are the second cluster, and they let a program express its logical communication pattern. MPI may then map ranks to hardware in a way that reduces communication cost — especially when you permit reorder. The Cartesian routines are the ones you will reach for: MPI_Cart_create(commOld, nDims, dims, periods, reorder, &commCart) builds a k-dimensional grid or torus, where periods[] toggles wrap-around per dimension and setting all entries to 1 makes a torus; MPI_Cart_coords and MPI_Cart_rank convert between rank and coordinates in either direction; MPI_Cart_shift(commCart, dir, step, &rankSource, &rankDest) hands you the source/destination ranks for a shift; and MPI_Cart_sub(commCart, keep_dims, &commSubcart) forms lower-dimensional subgrids such as rows or columns.

Intuition

A virtual topology lets you say "process (i,j) on a 2D grid" instead of juggling raw ranks. MPI_Cart_shift then hands you the neighbour ranks for a shift in one call, and periods=1 makes the edges wrap so corner processes still have neighbours.

This matters directly for algorithms such as Cannon's matrix multiplication, where processes are logically arranged on a 2D torus and repeatedly shift matrix blocks left and up.

Network topologies and communication cost model (09 Collective Communication)

Distributed-memory performance depends on topology, routing, and contention. The deck draws a line between physical and logical topology, and it is the logical topology that matters for algorithm design: it describes how data flows, even when the physical cabling is different.

The logical topologies you should recognize range from a linear array (each node has left/right neighbours) and a ring or 1D torus (the endpoints connect), through a 2D mesh (north/south/east/west) and a 2D torus (a mesh with wraparound), up to a general d-dimensional mesh where each node has 2d neighbours. The star of the show is the hypercube: it has p = 2^d nodes and d = log p dimensions, every node has log p neighbours, and the distance between two nodes is simply the number of bit positions in which their labels differ.

Intuition

Label every hypercube node with a d-bit string. Two nodes are neighbours iff their labels differ in exactly one bit, so you reach any node by flipping wrong bits one at a time — at most d = log p hops. This bit-flip structure is exactly what makes id XOR 2^i the natural "partner in dimension i."

Routing is modelled at three levels of sophistication: store-and-forward (a router forwards only after receiving the whole message), packet routing (the message is split into packets), and cut-through routing (packets follow the same path and forwarding can begin before the whole message has arrived). The cost is parameterized by ts, the startup time at sender and receiver including routing setup and protocol overhead; th, the per-hop time including switch and network latency; tw, the per-word transfer time dominated by bandwidth and data-size-dependent overhead (tw = 1/bandwidth); m, the message size in words; and l, the number of hops.

For cut-through routing over l hops the full model is:

t_comm = ts + l * th + m * tw

In practice th is often small compared with the startup and bandwidth terms, placement and routing may not even be controllable, and large messages drown out the hop cost anyway. So the collectives deck adopts a simplified model throughout:

t_comm = ts + m * tw

Intuition

Every message costs a fixed "stamp" ts (latency to get started) plus a per-word "postage" m * tw (bandwidth). To go fast: send fewer, larger messages to amortize ts, and move less data to shrink m * tw.

This model assumes an uncongested network. The moment several messages share a link, the tw term must be scaled by the number of messages using that link — and different collective patterns generate very different congestion on rings, meshes, and hypercubes. The deck also fixes some ground rules: collectives are built from point-to-point primitives, channels are bidirectional and FIFO, the network is single-ported, and reading and writing on the same port at the same time is not allowed.

Taxonomy of collective communication (08 Distributed Memory Systems; 09 Collective Communication)

Collective operations involve all processes in a communicator, and every process in that communicator must call the collective. Each one is characterized by who sends, what data is sent, and how received data is handled.

Intuition

A collective is a coordinated pattern every rank in the communicator must enter together. Forgetting one rank's call hangs the whole group. Think of each as a named, optimized macro over many point-to-point messages.

The main categories, together with their MPI calls, are easiest to absorb as a table:

Operation	What it does	MPI call
One-to-all broadcast	One source sends the same data to all processes	`MPI_Bcast`
All-to-one reduction	All processes contribute data combined by an associative operator at one target	`MPI_Reduce`
All-reduce	Like reduction, but all processes receive the result	`MPI_Allreduce`
Prefix sum	Process `k` gets the reduction of ranks `0..k`	`MPI_Scan` (exclusive: `MPI_Exscan`)
Scatter	One source sends different pieces to different processes	`MPI_Scatter`
Gather	One target collects different pieces from all processes	`MPI_Gather`
All-to-all broadcast / allgather	Every process broadcasts its own same message to all others	`MPI_Allgather`
Total exchange (all-to-all personalized)	Every process has a distinct message for every other process	`MPI_Alltoall`
All-to-all reduction	Simultaneous reductions with different destinations (not the same as all-reduce)	`MPI_Reduce_scatter`

The reduction operators themselves include MPI_MAX, MPI_MIN, MPI_SUM, MPI_PROD, logical and bitwise operators, and the location-aware MPI_MAXLOC and MPI_MINLOC.

Intuition

Collectives operate element-wise across ranks. Scan on rank r = reduce ranks 0..r including self; Exscan excludes self and leaves rank 0 untouched. Allgather concatenates each rank's block in rank order (same result on all). Alltoall is a transpose: recv[j][i] = send[i][j]. Reduce_scatter = Allreduce then scatter block i to rank i. Reduce writes only at the root (non-root buffers untouched).

Broadcast and reduction algorithms (09 Collective Communication)

The naive one-to-all broadcast simply sends p - 1 messages from the source. It is inefficient on three counts at once: the source becomes a bottleneck, the network is underused, and the whole thing takes p - 1 communication steps.

The central improvement is recursive doubling. After the source sends to one selected node, there are two informed nodes; each then helps broadcast to a disjoint half of the remaining nodes. Inverting the very same idea gives you all-to-one reduction.

Intuition

The set of nodes that already hold the data doubles every step — 1, 2, 4, 8, ... — so everyone is reached in log p steps instead of p - 1. Reduction is the same tree run backwards: partial results merge pairwise up toward the root.

The doubling idea adapts to richer topologies. On a 2D mesh you view the grid as rows and columns: first broadcast (or reduce) along one row, then do all the columns concurrently, and the scheme generalizes naturally to d-dimensional meshes. On a hypercube, since p = 2^d means d = log p, broadcast and reduction each take log p sequential point-to-point steps, giving:

T_bcast = T_reduce = (ts + m * tw) * log p

There are actually two algorithmic branches for one-to-all broadcast, and the real cost is the smaller of the two:

T_bcast = min( (ts + m * tw) * log p ,  2 * (ts * log p + m * tw) )

The first branch is the spanning-tree / recursive-doubling approach: log p steps, each forwarding the whole message, which is best when m is small because the m * tw * log p term stays cheap. The second branch scatters then all-gathers: split the message into p parts, scatter the parts, and all-gather them, which costs about 2 * (ts * log p + m * tw) and crucially drops the m * tw * log p term — the winner for large m.

Intuition

For a small message, just flood it down a doubling tree. For a huge message, don't re-send all m words log p times; cut it into p pieces, scatter the pieces (each node carries 1/p), then all-gather — every word now crosses the network roughly twice instead of log p times.

One last trick makes the hypercube algorithm fully general. If the source is not rank 0, define virtualId = id XOR source so the same bitwise algorithm runs as though the source were rank 0; at each dimension, the processes that already hold the message send to the partners differing in that bit. Reduction simply reverses the direction, with nodes sending partial results toward rank 0 (or the chosen root) and combining received values with an associative operator.

All-to-all broadcast, all-reduce, and prefix sums (09 Collective Communication)

All-to-all broadcast — also called allgather — means every node is simultaneously a source and a sink: each node sends its own message and receives everyone else's. The simple method is just p independent broadcasts, but the good algorithms exploit topology.

Intuition

Allgather is "everybody tells everybody their own value, and ends up holding the full roster." The total data each node must receive is m * (p - 1) words no matter how clever the schedule, which is the floor on the tw term.

On a ring, each node sends its data to one neighbour, and in every later step forwards what it just received onward; after p - 1 steps everyone holds everything:

T_ring_allgather = (p - 1) * (ts + m * tw)
                 = ts * (p - 1) + m * tw * (p - 1)

Exam

FS21 variant — bidirectional ring (2-in/2-out channels). If every node can send and receive in both ring directions at once, the frontier advances both ways, so the number of rounds drops to ceil((p - 1) / 2) = floor(p / 2). The throughput argument: 2 new packets arrive per round and p - 1 are needed, giving ceil((p - 1) / 2) rounds. The diameter argument: the farthest node is floor(p / 2) hops away, so you cannot beat that — both bounds agree, so it is optimal. And the reason it is not log p: only neighbour links exist, so merging caps the message size, not the one-hop-per-round frontier. Sanity check: p = 7 gives 3 rounds.

A 2D mesh does it in two phases. First each row performs an all-to-all broadcast, then each node consolidates its row messages into one larger message, and finally the columns broadcast those consolidated messages:

T_mesh_allgather = 2 * ts * (sqrt(p) - 1) + m * tw * (p - 1)

The hypercube generalizes the mesh idea to log p dimensions. At each step a node exchanges its accumulated result with partner id XOR 2^i, and the message size doubles every step:

T_hypercube_allgather = ts * log p + m * tw * (p - 1)

In all cases the data-volume lower bound is the same:

Omega(m * tw * (p - 1))

All-reduce combines p buffers under an associative operator and leaves the identical result on every node. You could implement it as a reduce followed by a broadcast, but the better hypercube method follows the same exchange pattern as allgather — except that it never doubles the useful data each step:

T_hypercube_allreduce = (ts + m * tw) * log p

Intuition

All-reduce looks like allgather, but instead of accumulating all messages (size doubling) each node combines incoming values with its own (size stays m). That is why all-reduce keeps the cheap (ts + m * tw) * log p cost while allgather pays m * tw * (p - 1).

Prefix sum computes s_k = sum_{i=0..k} n_i, so node k receives the reduction of all earlier-or-equal ranks. It reuses the all-to-all broadcast kernel with one twist: a received value is added to the result buffer only if it came from a lower-ranked partner, while the outgoing message accumulates every received value. That is the conceptual basis of MPI_Scan; MPI_Exscan is the same thing with the process's own contribution excluded. The FS21 unit-message cost includes the local add work, because the slowest node performs both msg += val and (conditionally) result += val in each of the log p rounds:

T_prefix = (ts + m * tw + 2 * t_add) * log p

It is not cost-optimal for p = n, because the cost Theta(n log n) exceeds the Theta(n) of a sequential prefix sum.

Intuition

Run the allgather/XOR exchange, but keep two accumulators: msg (always add the partner's value — your running subtree total, which doubles its coverage each round) and result (add the partner's value only if the partner has a smaller id). After log p rounds, node k ends with exactly the sum of ranks 0..k.

Scatter, gather, and total exchange (09 Collective Communication)

Scatter is one-to-all personalized communication: one source sends a different message of size m to every node, and gather is the exact inverse. On a hypercube, both take log p steps, during which the number of involved nodes doubles while the forwarded data size halves:

T_scatter = T_gather = ts * log p + m * tw * (p - 1)

This is asymptotically optimal in message size, simply because every destination must receive its own data.

Intuition

Scatter and broadcast both fan out over a doubling tree, but in scatter each node passes on only the half of the data destined for the subtree beyond it — so the data shrinks as it descends, instead of being copied whole.

All-to-all personalized communication, or total exchange, is a different and heavier beast than allgather. In allgather each process sends the same message to everyone; in total exchange each process holds a distinct message for every other process. Matrix transposition, with one full row per process, is the standard example.

Intuition

Allgather = "everyone shouts the same announcement to the whole room." Total exchange = "everyone hands every other person a personal, different note." Total exchange therefore moves far more data — it is the all-pairs transpose pattern.

On a ring, total exchange works by having each node send one consolidated message of size m * (p - 1) to its neighbour, extract the part intended for itself, and forward the rest; the message shrinks by m each step and the process terminates after p - 1 steps:

T_ring_total_exchange = ts * (p - 1) + tw * m * p * (p - 1) / 2

Bidirectional communication can roughly halve that tw term. The hypercube offers two flavours. The dimension-wise algorithm runs log p iterations, each sending m * p / 2 words, and is not optimal:

T = (ts + tw * m * p / 2) * log p

The optimal pairwise-exchange scheme has node i exchange data with i XOR j in step j, for p - 1 steps, each a congestion-free transfer of exactly m words:

T_opt_total_exchange = (ts + tw * m) * (p - 1)

This is asymptotically optimal in message size, but it demands a topology and schedule that avoid link congestion — a hypercube being the natural fit.

Intuition

The dimension-wise scheme reuses cube links and pushes m * p / 2 words log p times (cheaper startup, more bandwidth). The pairwise scheme schedules p - 1 direct, congestion-free swaps of exactly m words each (more startups, minimal bandwidth) — optimal in the data-volume term.

How the MPI and collective topics link together (08 Distributed Memory Systems; 09 Collective Communication)

It helps to see the whole arc as one conceptual chain. Distributed memory forces data placement to be explicit; point-to-point communication is the primitive mechanism for moving that data; blocking, buffering, and nonblocking semantics determine whether the communication is safe and whether it can overlap computation; communicators define which set of processes participates; virtual topologies express the algorithm's logical neighbour structure; collective operations package the recurring communication patterns; and finally cost models tell you which algorithmic formulation is appropriate.

That chain shows up across the catalogue of algorithms. Matrix-vector multiplication needs an all-to-all broadcast (allgather) of the vector partitions when the full input vector is not already shared. Matrix multiplication with 2D block partitioning needs row and column broadcasts or shifts of matrix blocks. Cannon's algorithm uses a 2D torus topology and repeated MPI_Cart_shift-style neighbour exchanges. Odd-even transposition sort uses repeated neighbour compare-split operations, best implemented safely with send-receive exchanges. And numerical integration is just independent local work followed by a reduction.

Distributed sorting problem definition (10 ParallelSorting2)

In distributed-memory sorting, each of the p processing units owns a local array, and the aggregate of all those arrays must end up sorted while remaining distributed. "Sorted" here means two conditions hold at once: each local array is sorted in ascending order, and every element on process P_i is less than or equal to every element on process P_j whenever i < j.

Intuition

"Sorted" in a DMS means two things simultaneously: each rank's local chunk is internally sorted and the ranks are globally ordered (everything on rank 0 ≤ everything on rank 1 ≤ ...). Concatenating the chunks in rank order then yields the full sorted sequence.

If all data initially lives on one process it must first be distributed, typically with a scatter, and if the final sorted data is needed on one process it must be gathered. The deck deliberately excludes this distribution/aggregation cost from the sorting algorithm's own cost.

Odd-even transposition sort in DMS (10 ParallelSorting2)

Odd-even transposition sort is a bubble-sort variant that suits distributed memory because it consists entirely of local, neighbour interactions. There are two cases to distinguish.

When p = n, each process owns a single element, and a compare-exchange between neighbouring processes P_i and P_j has them send their values to each other so the lower-rank process keeps min(a_i, a_j) and the higher-rank process keeps max(a_i, a_j).

When p < n, each process owns n/p elements and the operation becomes a compare-split. Each process first sorts its local block. Then two neighbouring processes exchange their sorted blocks, merge them, and split the result back: the lower-rank process keeps the smaller n/p elements, the higher-rank process keeps the larger n/p. Merging two sorted blocks costs Theta(n/p).

Intuition

Compare-split is the block version of compare-exchange: two neighbours pool their sorted halves into one sorted run of 2n/p, then split it back — the low-rank process keeps the small half, the high-rank process the large half. Implement the exchange with MPI_Sendrecv to stay deadlock-safe, and set edge partners to MPI_PROC_NULL so boundary processes can run the identical loop.

The algorithm runs p phases, alternating which disjoint neighbour pairs interact: odd phases pair one set, even phases pair the shifted set. Because the pairs in a phase are disjoint, all compare-split operations inside that phase run in parallel. An element or block can move at most one process position per phase, which is why p phases suffice for an item at the far right to reach the far left.

For p <= n, the runtime decomposes into three parts: local sorting of each block costs about Theta((n/p) log(n/p)); the compare-split merging over p phases costs p * Theta(n/p) = Theta(n); and the communication, exchanging n/p words with a neighbour in each of p phases, costs:

Theta(p * ts + n * tw)

The deck rolls these together into one parallel-runtime expression:

T_P = Theta((n/p) log(n/p)) + Theta(n) + Theta(p * ts + n * tw)

For p = n the cost is optimal relative to odd-even transposition sort itself, but not optimal compared with asymptotically optimal sequential sorting. It scales poorly: it is cost-optimal only for p = O(log n), and its isoefficiency function is exponential, Theta(p * 2^p).

Intuition

Why is the isoefficiency Theta(p * 2^p)? The overhead carries a p * n term; setting n log n = p * n gives log n = p, i.e. n = 2^p. So to hold efficiency constant you must grow the problem exponentially in p — the textbook signature of a badly scaling algorithm.

The exam reading is that odd-even transposition sort is simple and maps cleanly onto neighbour communication, but it is communication-phase heavy, and its core weakness is that data moves only one process position per phase.

Distributed Shellsort / parallel Shellsort (10 ParallelSorting2)

Parallel Shellsort improves on odd-even transposition sort by prepending an acceleration phase before the local neighbour phases. The idea is to move elements that sit far from their final positions close to the right region in a few large jumps, so that fewer odd-even phases remain afterward.

For p = 2^d the acceleration phase is recursive:

Pair process P_i with process P_{p - i - 1} for i < p/2.
All pairs perform compare-split in parallel.
Split the processes into two groups of size p/2.
Repeat the reverse-order pairing inside each group until groups have size 1.

This takes log p acceleration steps, after which odd-even transposition sort runs — but usually for fewer than p phases. Because the number of required phases now depends on the input distribution, you need a decentralized termination criterion.

Intuition

The acceleration phase pairs each process with its mirror across the array (rank i with rank p-1-i) and compare-splits. This catapults a far-away element across half the machine in one step, instead of inching one position per odd-even phase — so the slow cleanup phase has much less work left.

The termination rule is to stop the odd-even phases once every process leaves its data unchanged in two consecutive phases. Two phases are required because edge processes are active only every second phase. In practice this is implemented decentrally with MPI_Allreduce(..., MPI_LOR, ...) over a per-process "did I swap anything?" flag, so that all ranks agree on when to stop. The runtime is therefore expressed not as the full p phases but as local sort plus acceleration followed by l remaining odd-even phases, where l depends on the data.

The exam reading: parallel Shellsort is odd-even transposition sort with a long-distance compare-split preconditioner. It trades additional structured communication for fewer local-neighbour cleanup phases.

Dense matrix-vector multiplication (11 NumericalAlgorithms)

The problem is to compute y = A * b for a dense n x n matrix A and a vector b. Sequentially, each output element is a dot product:

y_i = row_i(A) dot b

There are n independent dot products, each with n multiply-add pairs, so:

T_S = W = n^2

You have three partitioning choices. Row-wise 1D partitioning is the natural output-data partitioning: each task computes one or more y_i values and needs the corresponding rows of A plus the full vector b. Column-wise 1D partitioning can be useful depending on storage order or a rectangular shape. And 2D partitioning can reduce part of the scalar-product addition through a parallel reduction — cutting it to Theta((n/sqrt(p)) log sqrt(p)) — but the per-PE work is so small that communication and synchronization overhead usually dominate, so the saving rarely pays off here.

Intuition

Mat-vec is output partitioning: assign each output y_i (a row's dot product) to a task. Each task needs its rows of A (which it owns) plus the whole vector b. Whether b is free or costs an allgather is the entire story of the distributed cost.

Memory layout matters too. If A is stored row-major, row-wise traversal exploits cache locality because consecutive row elements are loaded together — so as a rule you should read matrix data in storage order wherever you can.

In the case p = n, each process computes one output element with Theta(n) work. In shared memory the vector b can be read by all tasks without synchronization because it is read-only; in distributed memory b must somehow be present on every process, and if it was partitioned alongside A, the processes must exchange their b partitions using an all-to-all broadcast (allgather).

In the case p < n, you assign at most q = ceil(n/p) rows per process, and each computes q dot products. With a shared b the runtime is simply:

T_P = Theta(q * n) = Theta(n^2 / p)

With a distributed b you add the allgather cost, and the deck gives:

T_overhead = ts * log p + tw * q * (p - 1)
T_P = q * n + ts * log p + tw * q * (p - 1)
    = n^2 / p + ts * log p + O(tw * n)

This is a clean illustration of computation plus collective-communication cost, and it is cost-optimal for p = O(n).

Dense matrix multiplication: shared memory and distributed memory (11 NumericalAlgorithms)

The problem is C = A * B for dense square n x n matrices, where conventional sequential multiplication performs n^3 multiply-add work. Cache behaviour is decisive here: row-major storage makes row access cheap, but column access into B has poor locality, so the improved serial loop order processes whole rows of C while iterating through B's rows, improving cache use without ever explicitly transposing B.

Intuition

Loop order can matter ~30x. The naive i,j,k order strides B down a column (a cache miss every step). Swapping to i,k,j order makes the inner j loop stream B and C contiguously while A[i][k] stays constant across the row — the same flops, dramatically fewer cache misses.

The shared-memory/OpenMP formulation is pleasantly simple: parallelize the outer loop over rows of C, so each worker computes a different set of rows. Coarse granularity keeps overhead low, with p < n giving each PE about n/p rows, and because rows are independent no synchronization is needed between workers. The runtime is:

T_P = n^3 / p

which is cost-optimal relative to the conventional n^3 sequential algorithm.

The distributed-memory/MPI formulation starts from a different premise: for very large matrices you assume the matrices are already distributed, both because one node cannot store them and because the outputs of earlier distributed operations often become the inputs to later ones. The deck uses 2D block partitioning, replacing scalar entries by q x q blocks:

C(i,j,q) = sum_k A(i,k,q) * B(k,j,q)

Each target block of C needs all blocks from the corresponding row of A and column of B. If there are p processes equal to the number of blocks of C, then p = (n/q)^2, hence q = n / sqrt(p). The computation per process is:

Theta(n^3 / p)

and the communication for the all-to-all broadcast of row/column blocks on a hypercube is:

ts * log sqrt(p) + tw * q^2 * (sqrt(p) - 1)

giving the overall deck result:

T_P = n^3 / p + ts * log sqrt(p) + tw * n^2 / sqrt(p)

with parallel cost:

p * T_P = n^3 + ts * p * log sqrt(p) + tw * n^2 * sqrt(p)

The algorithm is cost-optimal for p = O(n^2). The catch is memory: a simple row/column broadcast approach can require each node to hold many blocks at once, around:

M = 2 * n^2 / sqrt(p)

per node for the communicated A/B blocks, so total storage becomes non-optimal for large p.

Intuition

The simple block-broadcast algorithm is fast but memory-hungry: each node must hold an entire block-row of A and block-column of B at once (2 n^2 / sqrt(p) per node). Cannon's algorithm below fixes exactly this by streaming one block-pair at a time.

Cannon's algorithm (08 Distributed Memory Systems; 11 NumericalAlgorithms)

Cannon's algorithm is the memory-efficient distributed matrix multiplication that both decks emphasize. It assumes a logical 2D torus of sqrt(p) x sqrt(p) processes, where process P_{i,j} initially owns blocks A_{i,j} and B_{i,j} and is responsible for computing C_{i,j}.

The core idea is to never store an entire row of A and column of B on a single process. Instead you schedule block rotations so that each process sees the pairs it needs one at a time — at any instant a process holds only one block of A, one block of B, and its accumulated block of C. Getting this started requires an initial alignment: shift row i of A left by i positions with wraparound, and shift column j of B up by j positions with wraparound. After that, the compute-and-shift phase repeats:

Each process multiplies its local A and B blocks and accumulates into local C.
Shift every A block one step left.
Shift every B block one step up.
Repeat for sqrt(p) block multiplications.

Intuition

After the skew alignment, every process holds an A/B block pair whose k-index matches, so it can multiply-add immediately. One left-shift of A and one up-shift of B brings in the next matching pair. After sqrt(p) such rotations, every process has seen all k blocks it needs — using only one block of each matrix in memory at a time.

On the communication side, the logical topology is a torus where each node has four direct logical neighbours, and a single horizontal or vertical circular shift can be done by simultaneous send/receive on all involved processes. One shift of a block of m words costs ts + m * tw, and since the block size is q x q we have m = q^2 = n^2 / p. Summing over both matrices, the compute-and-shift communication is:

2 * (ts + tw * n^2 / p) * sqrt(p)

The initial alignment can often be ignored asymptotically against this repeated compute-and-shift communication for large enough p. The payoff is memory: total memory over all processes is Theta(n^2) (per node Theta(n^2 / p)), which is the key improvement over the simple broadcast-based scheme where each node grows to 2 n^2 / sqrt(p). The full parallel runtime is T_P = n^3 / p + 2 * sqrt(p) * (ts + tw * n^2 / p), with isoefficiency Theta(p^1.5).

Exam

FS21 — Cannon vs DNS decision rule. DNS is the alternative dense matmul that uses p = n^3 processing elements (one per scalar multiply) plus a log p reduction; its isoefficiency is Theta(p * log^3 p). To choose, convert isoefficiency with W = n^3: Cannon needs n^3 >= p^1.5, i.e. n >= sqrt(p); DNS needs n^3 >= p log^3 p, i.e. n >= p^(1/3) * log p. For p = 1024: Cannon needs n >= 32, DNS needs n ≈ 101. So at a modest n = 50 you pick Cannon. DNS scales better asymptotically but needs far more PEs and memory; for small n or fixed p the lower-overhead Cannon wins.

Watch out

Cannon requires a square number of processes (4, 9, 16, ...) so that the sqrt(p) x sqrt(p) grid exists, and sqrt(p) must divide n so the blocks are equal-sized.

The MPI connection is direct: Cannon is naturally implemented with Cartesian communicators, periodic dimensions, coordinate/rank conversion, and shifts — concretely MPI_Cart_create with periods={1,1}, MPI_Cart_shift, and MPI_Sendrecv_replace. Non-blocking Isend/Irecv into double buffers can overlap the block multiply with the shift, but, as noted earlier, this only helps across multiple physical nodes; on one host MPI uses shared memory, so there is nothing to overlap.

GPU/SYCL matrix multiplication and local memory (11 NumericalAlgorithms)

The SYCL section is not distributed-memory MPI, but it reinforces exactly the same block-reuse principle as Cannon's algorithm. For GPU matrix multiplication you use p = n^2 logical work-items, with one work-item computing one C[i,j] via a dot product over k. The simple version reads A[i,k] and B[k,j] straight from global memory, which generates a flood of global-memory reads:

T_P = Theta(n * t_g)

where t_g is the relative global-memory access time. The optimized version overlays a workgroup/block structure so that the threads in a workgroup cooperate:

Load one block of A and one block of B from global memory into local memory.
Synchronize with a local barrier.
Multiply using fast local memory.
Synchronize before loading the next block.

This is exactly Cannon's "use one pair of blocks at a time" idea, only now it lives inside a GPU compute unit rather than across MPI processes.

Intuition

Global memory is slow (t_g), local memory is fast (t_l ≈ L1). The work-group cooperatively copies a tile global→local once, barriers so everyone sees it, then every work-item reuses it many times from fast local memory. Each input word is read from global memory once instead of n times. The barriers are mandatory: one after loading (don't read before all data is staged), one before reloading (don't overwrite a tile others are still using).

With block size q, local-memory access time t_l, and global-memory access time t_g, the deck gives:

T_P = Theta((n/q) * t_g + n * t_l)

so when q is large enough relative to the global/local memory speed ratio, local-memory tiling substantially improves runtime.

Intuition

Tiling pays off because matmul has high arithmetic intensity — Theta(n^3) flops over Theta(n^2) data means each element is reused ~n times. By contrast vector-add (O(1) reuse) is bandwidth-bound and barely benefits from the GPU or from tiling.

Parallel numerical integration (11 NumericalAlgorithms)

Numerical integration approximates:

integral_a^b f(x) dx

by subdividing [a,b] into n intervals of width:

Delta x = (b - a) / n

The deck introduces two rules. The midpoint/rectangle rule approximates each interval by a constant value taken at the midpoint:

integral_{x_i}^{x_{i+1}} f(x) dx
  approx Delta x * f((x_i + x_{i+1}) / 2)

The trapezoidal rule approximates each interval by a line through the endpoint values:

integral_{x_i}^{x_{i+1}} f(x) dx
  approx Delta x * (f(x_i) + f(x_{i+1})) / 2

Although the trapezoidal rule appears to need two function evaluations per interval, neighbouring intervals share their endpoint evaluations, so the average cost is about one evaluation per interval. The composite form makes this explicit by giving the two outer endpoints half weight: T = Delta x * (f(x_0)/2 + f(x_n)/2 + sum_i f(x_i)).

Parallelization could hardly be cleaner. Each interval's contribution is independent, so processes compute local partial sums and a single reduction combines those partial sums into the final integral estimate (MPI_Reduce with MPI_SUM).

Intuition

This is the textbook embarrassingly parallel pattern — slice the domain into independent strips, each rank sums its own strips with zero communication, and a single final reduction stitches the partial sums into the answer.

When p = n, one process computes one interval's contribution and the final result still requires the reduction. When p <= n, each process computes about ceil(n/p) intervals sequentially; choose n as a multiple of p where possible. Because every interval costs the same amount of work, a static mapping is fine and block and cyclic distributions balance the load equally. The two rules differ only in a subtlety: for the rectangle rule, a cyclic distribution balances the slightly unequal interval counts when n is not a multiple of p, whereas for the trapezoidal rule, block partitioning is preferred so that neighbouring intervals can reuse their endpoint evaluations — which is only possible when they are assigned to the same process.

Worked example

integral_0^1 1/(1+x^2) dx = arctan(1) = pi/4, so multiplying the estimate by 4 recovers pi — the standard exam test integral. The error shrinks like 1/n^2, so 10x more intervals gives roughly 100x smaller error.

The exam link is that numerical integration is the cleanest example of "embarrassingly parallel local work plus reduction." Matrix-vector multiplication adds an allgather-like dependency on b; matrix multiplication adds repeated structured communication; sorting adds iterative neighbour exchanges.

Exam-oriented comparison of communication patterns across algorithms

Use this table to connect an algorithm's structure to the collective or point-to-point pattern it relies on:

Problem	Decomposition	Communication pattern	Main cost issue
MPI greetings / rank output (08 Distributed Memory Systems)	Rank-specific branch	Many-to-one sends to rank 0	Output nondeterminism if everyone writes
Dense matrix-vector (11 NumericalAlgorithms)	Rows of `A`, elements of `y`	Allgather/all-to-all broadcast of vector `b` partitions	Communication `O(tw * n)` plus startup
Dense matrix multiplication, simple MPI (11 NumericalAlgorithms)	2D blocks	Row/column all-to-all broadcasts	Memory overhead from storing many blocks
Cannon's matrix multiplication (08 Distributed Memory Systems; 11 NumericalAlgorithms)	2D torus blocks	Neighbor circular shifts	Repeated block shifts, but optimal memory; iso `Theta(p^1.5)`
Numerical integration (11 NumericalAlgorithms)	Intervals	Reduction of partial sums	Reduction overhead after independent work
Odd-even transposition sort (10 ParallelSorting2)	Local sorted blocks	Neighbor compare-split over `p` phases	Too many phases, poor scalability (iso `Theta(p * 2^p)`)
Parallel Shellsort (10 ParallelSorting2)	Local sorted blocks	Long-distance compare-split acceleration, then neighbor cleanup	Data-dependent number of cleanup phases
Broadcast/reduction (09 Collective Communication)	Whole communicator tree/hypercube	Recursive doubling	`log p` startup terms
Total exchange (09 Collective Communication)	Distinct message per pair	Pairwise exchange or ring forwarding	High data volume, congestion avoidance

And here are the formulas worth committing to memory, gathered in one place:

Point-to-point simplified:       T = ts + m * tw
Hypercube broadcast/reduce:      T = min((ts + m*tw)*log p, 2*(ts*log p + m*tw))
Hypercube allgather:             T = ts * log p + m * tw * (p - 1)
Hypercube allreduce:             T = (ts + m * tw) * log p
Hypercube prefix sum (scan):     T = (ts + m * tw + 2 * t_add) * log p
Scatter/gather:                  T = ts * log p + m * tw * (p - 1)
Ring all-to-all bcast:           T = (ts + m * tw) * (p - 1)   [bidirectional: floor(p/2) rounds]
Mesh all-to-all bcast:           T = 2 * ts * (sqrt(p) - 1) + m * tw * (p - 1)
Ring total exchange:             T = ts * (p - 1) + tw * m * p * (p - 1) / 2
Optimal total exchange:          T = (ts + m * tw) * (p - 1)
Matrix-vector with dist. b:      T = n^2/p + ts * log p + O(tw * n)
Block matrix multiplication:     T = n^3/p + ts * log sqrt(p) + tw * n^2/sqrt(p)
Cannon shift communication:      T_comm = 2 * (ts + tw * n^2/p) * sqrt(p)   [iso Theta(p^1.5)]
Odd-even communication:          T_comm = Theta(p * ts + n * tw)

The recurring exam principle is to separate computation, startup overhead, bandwidth/data-volume overhead, and congestion/topology effects. The very same sequential work can lead to wildly different parallel efficiency depending on whether the communication is a broadcast, an allgather, a reduction, a neighbour shift, or a total exchange.

Part IV - Heterogeneous Systems, SYCL/OpenCL, Image Processing, and Course Conclusion

Why Heterogeneous Systems Matter (Deck: 12 Heterogeneous Systems)

Performance no longer climbs mainly by cranking up the clock. Smaller CMOS circuits do let you run at higher frequency, pack in more cores, and grow the caches, but frequency scaling hits a wall: power and heat grow nonlinearly, and pushing the voltage down any further makes transistors switch incorrectly. So modern performance comes from two levers instead — parallelism and specialization. More cores only help if your program actually exposes enough parallel work to fill them. Heterogeneous systems push further still, improving both performance and energy efficiency by routing each class of workload to the architecture that runs it well. The blunt truth underneath all of this is that no single architecture is best for all workloads.

Intuition

The free lunch from clock speed is over (power and heat scale super-linearly with frequency). The only way left to go faster is to do many things at once and to send each kind of work to the chip that is built for it. That is the whole reason heterogeneous systems (CPU + GPU + FPGA + ...) exist.

Once you accept that one chip cannot win everywhere, the natural next question is which chip wins for which kind of work. The course groups workloads into three classes and matches each to the hardware built for it:

Workload class	Examples	Best-fit hardware
Control intensive	searching, sorting, parsing	superscalar CPUs with strong branch prediction, caches, latency-oriented execution
Data intensive	image processing, simulations, modeling, data mining	vector/SIMD architectures such as GPUs
Compute intensive	iterative numerical methods, financial modeling	high arithmetic throughput plus enough parallelism to keep many execution units busy

Real applications almost never sit cleanly in one box; they mix all three classes, which is exactly why practical machines combine CPUs, GPUs, DSPs, FPGAs, APUs, and multicore processors.

Intuition

Lots of branches/irregular work goes to a CPU (latency-optimized, big caches, branch prediction). The same operation over a huge regular array goes to a GPU (throughput-optimized SIMD). Custom dataflow/spatial pipelines go to an FPGA. The exam wants you to justify which device fits a given workload class.

The edge-detection and matrix-multiplication motivation slides hammer home the same practical lesson: GPU acceleration can be enormous, but only when the algorithm exposes enough data parallelism and uses the GPU memory hierarchy well. For matrix multiplication specifically, the deck distinguishes three flavours — plain GPU/SYCL, GPU with vectorization, and GPU with local memory — and the best numbers come from the last one, where computation is tiled and reused out of fast local memory.

Intuition

The deciding factor is arithmetic intensity = FLOPs per byte moved. Vector add does Theta(n) work over Theta(n) data, so intensity is O(1) — it is bandwidth-bound and the GPU barely beats a CPU. Matrix multiply does Theta(n^3) work over Theta(n^2) data, so intensity is Theta(n) (each element reused n times) — the GPU wins big once you tile to actually exploit that reuse.

Device and GPU Architecture (Deck: 12 Heterogeneous Systems)

A handful of architectural styles recur throughout the course. SIMD/vector processing applies one instruction to many data elements at once, which is the natural model for arrays, images, dense matrices, and numerical kernels. Hardware multithreading keeps multiple instruction streams available so the processor can hide stalls: simultaneous multithreading (SMT) interleaves instructions from several threads on the same execution resources, while temporal multithreading switches among threads over time, for example to cover cache misses. Multicore architectures place several mostly independent cores that share data through a memory system, usually kept consistent with cache coherence. Finally, SoC/APU designs integrate CPUs, mobile GPUs, memory controllers, and media/wireless blocks into compact, low-power systems.

The GPU is the centerpiece of this part of the course. A GPU is a heavily multithreaded device with hardware task management; it was designed for graphics pipelines, but that same structure happens to map beautifully onto throughput-oriented data-parallel computation. Structurally, a GPU is built from compute units — called streaming multiprocessors (SMs) in NVIDIA terminology and compute units (CUs) in AMD terminology. Each compute unit packs processing elements, thread schedulers/dispatchers, local memory, an instruction cache, texture memory, and caches. The whole design philosophy emphasizes massive parallelism, high memory bandwidth, and latency hiding, rather than low latency for any single scalar thread.

Intuition

A GPU does not try to make one thread fast. It keeps thousands of work-items in flight so that whenever some are stalled waiting on memory, others are ready to run. Throughput hides latency instead of avoiding it — the opposite philosophy from a CPU core.

The hardware examples in the deck illustrate the scale: modern GPUs provide many thousands of FP32 cores, hundreds of compute units/SMs, and very high-bandwidth memory. The exam-relevant interpretation is that peak FLOP/s is only reachable if the kernel has enough parallelism, regular control flow, coalesced memory access, and sufficient arithmetic intensity — peak is a ceiling you earn, not a number you get for free.

Programming Models for Heterogeneous Systems (Deck: 12 Heterogeneous Systems)

Several programming models compete for these devices, and it helps to see them side by side rather than as a list of names:

Model	What it is
CUDA	NVIDIA platform and programming model for NVIDIA GPUs
OpenCL	Khronos standard: C99-based language plus runtime/API for heterogeneous platforms; separates host code from device kernels
SYCL	Khronos single-source C++ model; host and device code in one program. SYCL 1.2.1 built on OpenCL; SYCL 2020 generalizes to arbitrary backends
DPC++	Intel oneAPI compiler/implementation extending SYCL 2020, targeting CPUs, GPUs, FPGAs, and other accelerators
HIP	C++ runtime API/kernel language for portable AMD/NVIDIA GPU code, a thin CUDA-like layer

Exam

Memorize the terminology bridge, because it makes everything click: work-item ≈ CUDA thread, work-group ≈ CUDA thread block, compute unit ≈ SM, processing element ≈ CUDA core. OpenCL and SYCL use the vendor-neutral names; CUDA uses the NVIDIA names for the exact same concepts.

SYCL's platform model is worth keeping straight. The host is the CPU-side program. Backends are the low-level APIs underneath, such as OpenCL or CUDA. Devices are the things that actually run kernels — CPU, Intel GPU, NVIDIA GPU, FPGA, and so on. A SYCL program submits work through queues to those devices, and the selected backend maps SYCL constructs down onto the device implementation. The reason the course teaches OpenCL alongside SYCL is that OpenCL explains what happens "behind the scenes": kernels, queues, buffers, work-items, work-groups, memory movement, and synchronization all still exist under SYCL — SYCL simply raises the abstraction level.

Intuition

SYCL is OpenCL with the boilerplate hidden and full C++ on top. Everything OpenCL does explicitly (build a program, move buffers, set args) still happens — SYCL just lets the runtime do it for you from one single-source .cpp file.

OpenCL Platform, Execution, and Memory Models (Decks: 12 Heterogeneous Systems; 12 OpenCL by Ofer Rosenberg)

The OpenCL platform model is a strict hierarchy. A host connects to one or more OpenCL devices; a device is a collection of compute units; a compute unit contains one or more processing elements; and those processing elements execute code in SIMD/SPMD style. In practice, the multiple cores on a CPU or GPU are usually exposed as a single OpenCL device, and OpenCL distributes work-items across that device.

Intuition

The hierarchy in one breath: host → device → compute units (SM/CU) → processing elements. Work flows down: you launch a kernel from the host onto a device, the device splits it into work-groups (one per compute unit), and each work-group runs its work-items on the processing elements in SIMD lockstep.

The execution model centers on the kernel, the basic executable unit — similar to a C function, but executed by many work-items at once. A program is a collection of kernels and helper functions, analogous to a dynamic library. The application enqueues kernel execution instances into command queues, which may be in-order or out-of-order. Crucially, kernel execution is asynchronous: enqueueing a kernel does not mean it has run yet.

Intuition

Enqueueing a kernel is like dropping a job in an outbox, not running it. The host call returns immediately; the kernel runs later on the device. This is why you need events / blocking reads / queue waits to know when results are actually ready.

Data parallelism is expressed by defining an N-dimensional computation domain, normally 1D, 2D, or 3D. Each point in that domain is a work-item, and the total domain is the global range, or NDRange. Work-items are then grouped into work-groups using a local range. The payoff of grouping is that work-items in one work-group execute together on a compute unit, can share local memory, and can synchronize with barriers. The catch is that work-items in different work-groups cannot directly synchronize inside one kernel — so you design kernels so that global work-items are independent, or split dependent phases into multiple kernel launches.

Watch out

The one synchronization rule that matters: you can sync within a work-group (barrier), but never across work-groups inside a single kernel. If two pieces of work must see each other's results and they live in different groups, you must end the kernel and launch a new one — the kernel boundary is the only global barrier.

OpenCL also supports task parallelism, where a task kernel executes as a single work-item. That is useful for coarse tasks or device-side operations that are not naturally data-parallel, but be aware that essentially all the GPU speedups in the decks come from data-parallel kernels, not task kernels.

The memory model is a hierarchy of scope and speed. Host memory lives on the CPU side. Global memory is device memory accessible by all work-items on the device. Constant memory is read-only device memory optimized for broadcast/constant data. Local memory is shared by the work-items in one work-group, faster than global memory, and invisible to other work-groups. Private memory is per work-item, conceptually the fastest, and used for scalars and temporaries.

Memory space	Scope	Speed
Private	one work-item (registers)	fastest
Local	one work-group (shared/scratch)	fast, ≈ L1
Constant	whole device, read-only	optimized for broadcast
Global	whole device	slow
Host	CPU side	separate, crosses the bus

Intuition

The memory hierarchy is a speed-vs-scope tradeoff: private (registers, 1 work-item, fastest) → local (shared/scratch, 1 work-group, fast ≈ L1) → global/constant (whole device, slowest). The smaller the audience a memory serves, the faster it is. Good kernels pull data up this ladder (global → local → private) and reuse it before pushing results back down.

The key rule to carry away is that memory management is explicit in OpenCL. Data must be moved from host memory to device/global memory, often copied into local memory for reuse, and copied back when the results are needed.

OpenCL Host-Side Workflow and Objects (Decks: 12 Heterogeneous Systems; 12 OpenCL by Ofer Rosenberg)

Every OpenCL program performs roughly the same host-side setup ceremony:

Discover platforms and devices.
Choose a device, often after querying properties such as number of compute units, clock frequency, global memory size, single/double precision support, and extensions.
Create a context. Devices can share memory objects only if they are in the same context.
Create one or more command queues. Each device needs a queue for submitted work.
Create memory objects: buffers for unstructured memory blocks, images for formatted 2D/3D image data.
Load/build the program, then create kernel objects.
Set kernel arguments.
Enqueue memory transfers and kernel launches.
Synchronize using blocking reads, queue waits, or events.
Read results back to host memory if needed.

Intuition

Every OpenCL program repeats this same long setup dance — discover, context, queue, buffers, build program, kernel, set args, enqueue, sync, read back. SYCL collapses most of these steps into "create a queue, wrap data in buffers, submit a command group." When you see the 10-step list, picture all the boilerplate that SYCL's single-source model removes.

The objects you juggle along the way each have a clear job. Devices are the CPUs, GPUs, DSPs, and so on. Contexts are collections of devices and shared objects. Queues are command streams for a device. Buffers are blocks of memory, accessible as arrays, pointers, or structs inside kernels, while images are opaque formatted 2D/3D data reachable only through image read/write functions. Programs are collections of kernels built for devices, and a kernel is an executable function plus its argument bindings. Finally, events are the synchronization and profiling objects.

OpenCL compiles kernels dynamically. It uses a runtime compilation model much like graphics APIs: code may be compiled to an intermediate representation and then to device-specific machine code at runtime. That flexibility has a cost — compiling programs can be expensive — so you should reuse programs or precompiled binaries whenever you can.

Intuition

OpenCL compiles kernels at runtime so one binary can target whatever GPU is actually present. The price is a slow first build — so cache/reuse compiled programs instead of rebuilding each launch.

Synchronization follows from the queue structure. A single in-order queue naturally serializes the commands submitted to it. The moment you have multiple queues or multiple devices, you need explicit event dependencies. Events are versatile: they can be waited on, inserted as markers, used in command wait lists, queried for status, and used for profiling timestamps.

This brings the performance costs into focus. Program compilation can be expensive. Host-device transfers can dominate on discrete GPUs, for example over PCIe. Kernel launch overhead is nontrivial, so each launch should do enough work to be worth it. And events themselves can be expensive, so use them only where dependencies or profiling genuinely require them.

Watch out

There are four hidden taxes: compile, transfer (PCIe), launch overhead, and events. None of these show up in your FLOP count, but all of them can dominate a small or chatty kernel. The fix is always the same: do more work per launch, keep data on the device, and don't over-synchronize.

OpenCL C Language and Kernel-Level Rules (Deck: 12 OpenCL by Ofer Rosenberg)

OpenCL C is derived from C99 but deliberately leaves several features out: there are no standard C99 headers, no function pointers, no recursion, no variable-length arrays, and no bit fields.

Intuition

GPU hardware has no per-thread call stack and no dynamic dispatch. No recursion / no function pointers / no VLAs all follow from that — everything must be statically resolvable so thousands of work-items can run the same straight-line code in lockstep.

In exchange, OpenCL C adds the machinery you need for data-parallel work. There are work-item and work-group built-ins such as get_global_id, get_local_id, get_group_id, get_global_size, get_local_size, and get_num_groups. There are vector types such as char2, ushort4, int8, float16, and double2. There are address space qualifiers — global, local, constant, and the default private storage — plus synchronization primitives such as barriers, image types and image access functions, and a broad library of built-in math, integer, common, relational, geometric, and vector load/store functions.

Intuition

Every work-item runs the same kernel source; the only thing that distinguishes them is their index. get_global_id(0) answers "which element of the array is mine?" — it is the GPU equivalent of the loop counter i in a serial for loop.

The address-space rules are strict and worth memorizing. Kernel pointer arguments must use global, local, or constant, and they can never be pointers to private memory. Local variables default to private memory, while program-scope variables must live in the constant address space. Casting between address spaces is undefined behavior. And image2d_t and image3d_t are always in the global address space, accessed only through image functions.

The vector and conversion rules follow the same flavour of explicitness. Vector lengths are commonly 2, 4, 8, and 16; vector operations are portable and aligned at the vector length, and components can be accessed and swizzled. OpenCL refuses to do ambiguous implicit vector conversions, so you use explicit convert_type functions, which can specify saturation and rounding modes — for example, saturate to [0,255] and round-to-even. Separately, as_type reinterprets the bits without numerical conversion and therefore requires same-sized types.

Intuition

convert_uchar_sat(x) computes a new value (clamping to [0,255] — exactly what image pixels need); as_int(x) reinterprets the same bits as another type without any arithmetic. One changes the number, the other changes the label on the bits.

On precision, OpenCL defines floating-point accuracy requirements, and common single-precision math functions may come in full, half, or native precision variants. Full precision is the most accurate; native variants may be the fastest but are implementation-defined in accuracy — so choosing a precision is a deliberate performance/accuracy tradeoff.

Barriers come with their own contract: a barrier must be encountered by all work-items in the work-group with compatible arguments, which makes conditional barriers illegal if some work-items in the group would skip them. For moving data, async_work_group_copy can copy between global and local memory, and wait_group_events waits for those copies to finish.

Watch out

The classic barrier bug: putting a barrier inside an if that only some work-items take will hang the work-group forever — the ones that skipped it never arrive, so the ones that reached it wait eternally. Barriers must be on a path every work-item in the group executes.

When something goes wrong, the debugging advice is consistent: start on the CPU if you can, and check all global/local index bounds explicitly when diagnosing GPU crashes. Remember that GPUs may not be preemptively scheduled like CPUs, so avoid kernels that monopolize the device unexpectedly. Use extra output buffers or images to record intermediate values, and lean on context callbacks and event status/profiling to surface API errors and timing.

SYCL Core Concepts (Deck: 12 Heterogeneous Systems)

SYCL is single-source standard C++ for heterogeneous programming. It expresses kernels and data movement with ordinary C++ constructs while preserving the OpenCL-like execution model underneath. The classes you will actually touch are these. A queue schedules kernels on a SYCL device; work is submitted as command groups, and wait or wait_and_throw blocks for submitted work. A buffer<T,D> is a D-dimensional shared data object of type T, usable by both host and kernels. An accessor is how kernels and host code reach buffers, images, or local memory; accessors also express memory requirements and dependencies, which is what lets the runtime schedule kernels safely. A range<D> is a 1D/2D/3D extent for a global iteration space or work-group size, an id<D> is a D-dimensional index usable for accessor indexing, and an item<D> identifies one function instance in a simple parallel_for. When you need work-groups, an nd_range<D> combines a global range with a local work-group range, and the matching nd_item<D> identifies one work-item within that nd_range, giving you global id, local id, group id, barriers, and memory fences.

Intuition

In SYCL you never memcpy to the device by hand. You wrap data in a buffer and declare how each kernel touches it with an accessor (read_only / write_only). The runtime reads those declarations, figures out the data dependencies between kernels, and moves data only when needed. Accessors are both "device view of the data" and "dependency contract."

Intuition

Use a plain parallel_for with item when work-items are fully independent (e.g. vector add). Reach for nd_range + nd_item only when you need work-groups — i.e. when work-items must share local memory and synchronize with a barrier (e.g. tiled matmul, image filter with halo).

The basic SYCL vector addition shows the whole flow in miniature. You create host vectors, create a queue, and wrap the data in buffers. Then you submit a command group, create read/write accessors inside it, and run a parallel_for in which each work-item computes one output element.

Intuition

Vector add is underwhelming on a GPU: one add but three words moved (read a, read b, write c) — purely bandwidth-bound with zero reuse, and the whole array has to cross PCIe. For small n the fixed launch + transfer cost dominates, so the GPU can be slower than the serial CPU. float4 does not help because the limit is data transfer, not the ALU.

When you do need work-groups, you switch from a plain range to an nd_range: the global range gives the total number of work-items, the local range gives the work-group size, and the kernel now receives an nd_item, which unlocks local IDs, group IDs, barriers, and local memory.

SYCL also supports vectorization directly. It provides vector types for integer and floating-point values with lengths such as 2, 3, 4, 8, and 16, and vector operations spanning math, integer, common, relational, and geometric functions. The point is that one work-item can process multiple data elements using wide registers/ALUs, which reduces overhead and increases arithmetic throughput — provided the memory access patterns stay efficient.

Local Memory and Matrix Multiplication (Decks: 12 Heterogeneous Systems; 12 from OpenCL to Data Parallel C++)

The single most important optimization idea in this whole part is data reuse. Global memory has high latency and limited effective bandwidth, while local memory is much faster but only visible inside a work-group. So a good GPU kernel loads a tile of data into local memory, synchronizes, reuses it many times, and only then moves on to the next tile.

Intuition

Instead of every work-item fetching its operands from slow global memory over and over, the work-group cooperatively copies one block into fast local memory once, hits a barrier so everyone sees the full block, then reuses it many times. Read once globally, reuse many times locally — that is where the GPU speedup comes from.

Tiled matrix multiplication C = A * B makes this concrete. The procedure is:

Use output data partitioning: each work-item computes one element (or a small block) of C.
Use block partitioning with BlockSize x BlockSize work-groups, so each work-group owns one tile of the output matrix.
For each tile index k, copy one tile of A and one tile of B from global memory into local memory.
Hit a barrier so all work-items see the fully loaded local tiles.
Each work-item accumulates its partial result in private memory, e.g. Cij += locA[local_row][s] * locB[s][local_col].
Hit another barrier before overwriting local memory with the next tiles.
Write the final accumulated value back to global memory.

Watch out

There are two barriers per tile, and both matter. The first barrier comes after loading the tile (don't compute on a half-loaded tile); the second comes after computing, before the next iteration overwrites local memory (don't clobber data others are still reading). Forgetting either one gives wrong results.

The SYCL matmul deck gives a performance model. Let t_g be the global-memory access time, t_l the local-memory access time, q = BlockSize the tile/work-group side, and p = n^2 the number of work-items. The two cases compare directly:

Variant	Runtime	Why
Simple (no local memory)	`T_P = Theta(n * t_g)`	every one of the `n` multiply steps touches slow global memory
Tiled (local memory, analogous to Cannon)	`T_P = Theta((n/q) * t_g + n * t_l)`	global accesses drop by a factor of `q`, replaced by cheap local accesses

Intuition

Tiling trades n slow global reads for n/q slow reads plus n fast local reads. Since t_l << t_g, raising the tile size q directly shrinks the dominant global-memory term.

The reason this helps is straightforward. Without tiling, the same A and B elements are loaded from global memory again and again. With tiling, each tile element is loaded once from global memory and then reused many times from local memory, which raises arithmetic intensity and cuts the stalls caused by global-memory latency.

Choosing BlockSize is a constrained optimization. The global range should be an integer multiple of the work-group size, or the kernel has to handle boundary cases explicitly. Device work-group range limits cap the maximum local range. Register usage per work-item limits occupancy, and local memory capacity per compute unit limits how big a tile can be. The best size is therefore the largest useful size that satisfies all of these constraints, with divisibility and occupancy both pulling on the answer.

Worked example

For a 2000 x 2000 integer matrix on a GTX 1080, the deck selects BlockSize = 25: it stays below the register-derived limit and it divides 2000 evenly, so there are no ragged borders. The general rule is to pick the largest tile (more reuse, fewer global trips) that simultaneously (a) divides the matrix dimension, (b) stays under the device's max work-group size, and (c) does not blow the register/local-memory budget. 16 is a common safe default.

The payoff is large. Tiled matmul reaches roughly ~50x vs serial / ~6x vs OMP on a GTX 1080, and ~100x vs serial / ~10x vs OMP on an RTX 3080. This is the reward of high arithmetic intensity (Theta(n^3) work over Theta(n^2) data) combined with local-memory reuse — and it is exactly the opposite of vector add, where the GPU barely helps.

The DPC++ deck adds a cautionary profiling case study. A naive DPC++ GPU matrix multiplication showed high apparent GPU utilization, yet its execution units were stalling. VTune GPU Hotspots revealed that shared local memory was not being used effectively. Adding cache blocking plus local-memory tiling with a 16 x 16 tile improved elapsed time from about 2.017 s to 1.22 s, roughly 1.64x. Roofline analysis then showed whether the remaining limit was compute throughput or a memory subsystem such as L3/LLC/global transfer. The exam takeaway is blunt: high GPU utilization alone is not proof of efficient execution — use profiling to separate useful arithmetic from stalls and memory bottlenecks.

Watch out

"GPU is 95% busy" can mean "95% busy stalling on memory." Utilization measures occupancy, not useful work. Only a tool like VTune/Roofline tells you whether those cycles are doing arithmetic or waiting — which is why the fix here (local-memory tiling) raised performance even though utilization already looked high.

OpenCL Performance Rules (Deck: 12 OpenCL by Ofer Rosenberg)

The core tuning rules all push in the same direction: feed the device enough independent, regular, local work. Use large global work sizes (the deck suggests thousands of work-items rather than tiny launches) to hide memory latency and amortize overhead. Keep the work-items in a work-group on similar control-flow paths, because divergence wrecks SIMD/SIMT efficiency. Reuse data through local memory wherever possible. Access memory sequentially across neighboring work-items so the hardware can coalesce transactions and hit high bandwidth. Keep data on the device across multiple kernels, especially for discrete GPUs. Do enough work per kernel launch to pay back the launch overhead. And trade precision for performance only when the numerical requirements actually allow it.

Intuition

The work-items in a group share one instruction fetch and run in SIMD lockstep. If they take different branches of an if, the hardware runs both sides and masks off the inactive lanes — so a divergent branch can halve (or worse) throughput. Keep control flow uniform across a group.

Memory coalescing deserves its own emphasis. When consecutive work-items access consecutive memory addresses, the hardware can combine the memory transactions and dramatically improve bandwidth. The flip side is that poor stride patterns, scattered access, or divergence can destroy effective bandwidth even when the device's theoretical peak bandwidth is enormous.

Intuition

If work-items 0,1,2,3 read addresses 0,1,2,3 (neighbor items → neighbor addresses), the hardware fuses them into one wide memory transaction. If instead they read scattered or strided addresses, it issues many narrow transactions and effective bandwidth collapses — even on a GPU with huge peak bandwidth. Lay out data so adjacent work-items touch adjacent memory.

OpenCL to Data Parallel C++ Migration (Deck: 12 from OpenCL to Data Parallel C++)

The GPU-Quicksort article is the course's worked example of incremental migration from OpenCL 1.2 to DPC++/SYCL. The motivation is a clean comparison between what OpenCL gives you and what it costs you, set against what DPC++ adds:

OpenCL: strengths and limitations	DPC++/SYCL: benefits
Strength: portable, high-performance heterogeneous programming	Single source for host and device code
Strength: explicit control over runtime, host API, device kernels, and memory	Standard C++ templates and template metaprogramming for device code
Limitation: host code and device code are separate	Targets CPUs, GPUs, FPGAs, and other accelerators
Limitation: OpenCL C lacks templates and modern C++ features, making generic algorithms hard	Still permits accelerator-specific tuning
Limitation: debugging two code worlds increases complexity	Interoperates with OpenCL, allowing stepwise migration instead of a risky rewrite

Intuition

OpenCL C has no templates, so a sorter for uint, float, and double means three near-duplicate kernels. SYCL is real C++, so one templated functor covers all three types — generic device code is the headline win, and it is exactly what makes GPU-Quicksort portable across element types.

GPU-Quicksort itself is built from two kernels, gqsort_kernel and lqsort_kernel. A dispatcher iteratively calls gqsort_kernel until the data is partitioned into small chunks, and then lqsort_kernel sorts each remaining chunk locally. Around the kernels, the OpenCL program carries the usual platform/device setup, kernel build, memory allocation, buffer creation, argument binding, and dispatch.

Intuition

gqsort does the coarse, cross-group partitioning (global phase) until chunks are small enough to fit a single work-group; then lqsort finishes each small chunk entirely in local memory (local phase). This mirrors the "no cross-work-group sync inside a kernel" rule — the global phase needs repeated kernel launches, the local phase does not.

The migration follows an incremental pattern, where each step keeps the application compiling and running:

Replace OpenCL platform/device/context/queue initialization with DPC++ queue/device selection.
Retrieve the underlying OpenCL context/device/queue if the rest of the application still uses OpenCL.
Wrap original OpenCL kernels with cl::sycl::kernel.
Replace clSetKernelArg with DPC++ argument setters.
Replace clEnqueueNDRangeKernel with parallel_for.
Convert OpenCL buffers to SYCL buffers and pass accessors rather than raw memory objects.
Query platform/device properties using SYCL get_info<>.
Translate OpenCL C kernels into SYCL lambdas or functors.

Intuition

OpenCL interop lets you migrate one piece at a time — wrap the old kernels first, swap the host plumbing to SYCL, and only later rewrite the kernels themselves as C++. Every intermediate step still compiles and runs, so you never face a risky all-at-once rewrite.

The kernel translation has its own recurring details. Small kernels can become lambdas, while complex kernels with helper functions are better as functor classes; a functor kernel usually implements operator()(nd_item<1> id) or a similar operator for the chosen dimensionality. Global and local OpenCL pointers become global/local accessors or global_ptr/local_ptr, and local memory arrays become local accessors, often constructed outside the functor and passed in. Calls to get_group_id and get_local_id go through the nd_item, local barriers become item.barrier(...) with appropriate memory-space/fence arguments, and atomics require SYCL atomic objects/views, which are often more verbose than OpenCL C atomics.

Templates are what make the port pay off. The migration templatizes helper structs/functions and functor classes so the sorter works for uint, float, and double; OpenCL macros such as UINT_MAX become type-generic C++ constructs such as std::numeric_limits<T>::max(). One gotcha to watch is cl::sycl::select: the mask/third-argument type can depend on operand size, which requires type traits such as a select_type_selector. The broader portability lesson is that the article ports to both Windows and Linux with only small platform-specific adjustments, DPC++ lets old OpenCL kernels and new SYCL code coexist, and practical migration should be staged so each step leaves a working application.

oneAPI Tools, Profiling, and Roofline Reasoning (Deck: 12 from OpenCL to Data Parallel C++)

The oneAPI performance article frames heterogeneous systems as four architecture types — scalar, vector, matrix, and spatial. The goal it pushes is not "always use the GPU"; it is to map the right code regions to the right hardware and then verify that choice with tools.

Intuition

Scalar = CPU core, vector = SIMD/GPU, matrix = tensor/matrix engines, spatial = FPGA. oneAPI's pitch is one source language (DPC++) that can target all four, with tools to tell you which regions to offload — not a blanket "GPU everything."

Intel Advisor is the offload-decision tool. Its Offload Advisor identifies regions likely to benefit from GPU offload and regions not worth offloading, estimating accelerator speedup and weighing compute benefit against transfer/offload overhead. Its Vectorization Advisor finds under-vectorized loops and explains why vectorization may be blocked, the Threading Advisor helps design and tune threading, and Roofline Analysis places loops/kernels against compute and memory ceilings to expose bottlenecks and headroom.

Intuition

Plot performance (FLOP/s) against arithmetic intensity (FLOP/byte). A diagonal "memory roof" (bandwidth) rises until it hits a flat "compute roof" (peak FLOP/s). If your kernel sits under the diagonal it is memory-bound (improve locality/coalescing/tiling); if it sits under the flat part it is compute-bound (improve vectorization/algorithm). The roofline tells you which optimization is worth doing.

Intel VTune Profiler is the runtime-behavior tool. It finds hotspots and resource underutilization and correlates CPU and GPU activity in one timeline. Its GPU Offload Analysis identifies whether a program is CPU-bound or GPU-bound and how effectively kernels are offloaded, while GPU Compute/Media Hotspots pinpoints expensive GPU kernels, stalls, memory-latency problems, inefficient algorithms, and incorrect work-item configurations.

Put together on the matrix-multiplication example, the workflow reads as a story: Advisor identifies the triple-nested loop as a profitable offload candidate; after porting, VTune reveals stalls even when utilization looks high; if local/shared memory is unused and memory traffic is high, you tile/block the computation; and Roofline finally tells you whether further work should target arithmetic throughput, cache locality, global memory bandwidth, or transfer overhead.

oneMKL, Monte Carlo, USM, and Data Movement (Deck: 12 from OpenCL to Data Parallel C++)

The Monte Carlo pi example demonstrates DPC++ usage models and, above all, the importance of data placement. The algorithm itself is simple: generate n random 2D points (x,y) uniformly in [0,1) x [0,1), count the points satisfying x^2 + y^2 <= 1, and estimate pi as 4 * k / n where k is the count inside the quarter circle. Larger n improves accuracy by the law of large numbers.

Intuition

The quarter circle of radius 1 has area pi/4; the unit square has area 1. So the fraction of random points landing inside the quarter circle approximates pi/4, and multiplying by 4 recovers pi. It is embarrassingly parallel — every point is independent — which is exactly why it fits a GPU.

The DPC++/oneMKL usage shows the idiomatic pieces. A cl::sycl::queue selects the execution device and is passed to oneMKL functions; oneMKL RNG engines and distributions generate random numbers into SYCL buffers; buffers and accessors track transfers and consistency across kernels; and Parallel STL can do postprocessing closer to the device, which reduces host-device transfers.

USM (unified shared memory) is the pointer-based alternative to buffers. cl::sycl::malloc allocates directly on host/device/shared memory and supports pointer arithmetic, but requires a manual free. cl::sycl::usm_allocator lets standard containers use USM-backed allocation. And USM APIs return events, so you manage dependencies explicitly with event.wait, wait_and_throw, or queue.wait.

Intuition

Buffers/accessors hide pointers and auto-track dependencies; USM (malloc_host/malloc_device/malloc_shared) gives you raw pointers you can do arithmetic on, at the cost of manual free and manual event-based dependency management. USM feels like CUDA; buffers feel like idiomatic SYCL. Either way the golden rule is: minimize host↔device transfer.

For heterogeneous execution across several devices, you use multiple queues, split work between CPU and accelerator when it makes sense, and construct RNG engines for the exact target device. The overriding performance lesson, repeated once more, is that minimizing data transfer between host and device can dramatically improve performance.

Parallel Image Processing Foundations (Deck: 13 ImageProcessing)

Start with how an image is even represented. An RGB image can be viewed as three grayscale images, one per color channel — red, green, blue. With 8 bits per channel, each channel has 256 levels, so an RGB pixel has 256^3 = 2^24 = 16,777,216 possible colors. RGBA adds an alpha/opacity channel, commonly using 4 bytes per pixel. Be careful: component order can vary by operating system and file format, so your code must not assume a universal byte order without checking.

Storage adds another layer. External image files are often compressed, and that compression may be lossless (e.g. LZW) or lossy (e.g. JPEG); either way, images are decompressed into main memory before display or processing. In memory, an uncompressed RGB/RGBA image is usually a one-dimensional row-major array, where the width w, height h, number of channels, and bits per channel determine the raw row size. The subtlety is stride: the actual number of bytes between consecutive image rows in memory, which may exceed the minimal row size because rows are padded to align their starts for faster access.

Watch out

The bytes per row (stride) can be larger than width * bytesPerPixel, because rows are padded so each starts at an aligned address. So y * width + x indexing is only correct when there is no padding. Use the stride (e.g. getScanWidth / getScanLine(y)) to step between rows, or your image will shear.

This matters directly for GPU kernels: indexing a pixel by y * width + x is only correct if the row stride equals the packed row width. Otherwise you must use the byte/element stride when addressing rows.

Linear Filters and Border Handling (Deck: 13 ImageProcessing)

Many image operations compute an output image from an input image, and often each channel is processed independently as its own grayscale image. The linear filter model takes an input image I, an output image I', and a filter matrix (kernel) H. For each output pixel (u,v), you compute a weighted sum of neighboring input pixels:

I'(u,v) = sum over (i,j) in R of I(u+i, v+j) * H(i,j)

Here (u,v) is the image coordinate with origin at the top-left, and (i,j) is the filter coordinate with origin at the center of the filter matrix.

Intuition

Slide the small H matrix over every pixel, multiply each neighbor by the matching weight, and sum. The weights decide the effect: positive-and-summing-to-1 blurs, mixed signs detect change. It is a convolution, and because each output pixel is independent it is embarrassingly parallel.

The weights are the whole story of what a filter does. Positive weights summing to 1 compute a weighted average — a smoothing filter. Negative weights can measure intensity differences and approximate derivatives, and edge detectors combine negative and positive coefficients. Smaller filters are preferred for efficiency, because cost grows with filter area. Integer weights are often used for performance, with normalization dividing by a fixed value such as the sum of absolute weights. The deck's example filter types are the box filter, the Gaussian filter, and the LoG (Laplacian of Gaussian).

Near the image edges, part of the filter footprint falls outside the valid image — the border problem — and there is no single mathematically correct practical answer. The common strategies trade edge correctness against simplicity:

Border strategy	What it does
Constant border value	write a fixed value, e.g. 0
Keep originals	leave the original unfiltered border pixels
Constant out-of-image	assume out-of-image pixels have a constant value
Extend	extend edge pixels outward (clamp)
Reflect	reflect the image at its border
Partial filters	use partial filters with adjusted normalization, but this complicates the implementation and is less practical

Intuition

At the edge the filter "hangs off" the image, so you must invent the missing neighbors. The choices — zero-pad, keep, clamp/extend, reflect, or just skip the border (interior-only) — all trade correctness at the edge against simplicity. There is no single right answer; pick what is cheap and good enough.

For the GPU, the practical implication is that border handling tends to introduce branches. The fix is to isolate borders, clamp coordinates, or otherwise structure the kernel so that most work-items follow uniform control flow.

GPU Image Processing Pattern (Deck: 13 ImageProcessing)

Image filtering is a natural fit for GPUs because the same operation runs over many pixels and most output pixels are independent. The execution pattern is the canonical one. You use output data partitioning: each output pixel is assigned to one work-item, or each work-item computes a small block of output pixels, and the 2D GPU grid corresponds to the target image. In the simplest mapping, work-item (x,y) computes output pixel (x,y). You do not control kernel invocation order, so you assume work-items may run concurrently and independently; and when the image has more pixels than physical execution units, the GPU scheduler maps many work-items onto the available hardware.

Intuition

Assign one output pixel per work-item — each reads its own neighborhood and writes one result, with no two work-items writing the same pixel, so no synchronization is needed. The 2D NDRange is the image. This is the canonical embarrassingly-parallel GPU pattern.

To go faster, you bring local memory into the picture exactly as you did for matmul. Each work-group computes a tile of the output image, but the input region it needs is larger than the output tile because filters reach into neighboring pixels — for a 3 x 3 filter, a one-pixel halo around the tile. The work-items cooperatively copy the tile plus its halo from global memory into local memory, a barrier ensures the full local tile is loaded, and then each work-item computes its output pixel from local memory. Reused input pixels are thus read once from global memory and many times from local memory.

Intuition

Neighboring output pixels share input pixels, so each tile must also load a ring of border pixels — the halo — floor(fSize/2) wide (1 pixel for a 3x3 filter). Adjacent work-groups' input regions overlap by exactly that halo. Load tile + halo into local memory once, then every work-item reuses shared pixels from fast memory.

In other words, this is the image-processing twin of tiled matrix multiplication: load a shared working set into local memory, synchronize, reuse it, and cut global memory traffic.

Gradient-Based Edge Detection (Deck: 13 ImageProcessing; Deck: 12 Heterogeneous Systems)

Edge detection turns a grayscale image into a black-and-white line drawing of the strongest intensity changes. The process runs in six steps:

Apply an x derivative filter Hx to compute Dx(u,v).
Apply a y derivative filter Hy to compute Dy(u,v).
Interpret (Dx, Dy) as the gradient vector at pixel (u,v).
Compute the edge strength:

E(u,v) = sqrt(Dx(u,v)^2 + Dy(u,v)^2)

Compare E(u,v) with a threshold.
If it is above the threshold, write a foreground/edge pixel; otherwise write background.

Intuition

Dx and Dy measure how fast brightness changes in x and y. Their combined magnitude E = sqrt(Dx^2 + Dy^2) is large exactly where intensity changes sharply — i.e. at edges. Threshold it to turn "strong gradient" into a clean black/white line drawing.

The deck derives simple derivative/smoothing filters:

Hx = [-1  0  1
      -1  0  1
      -1  0  1]

Hy = [-1 -1 -1
       0  0  0
       1  1  1]

Reading these, Dx estimates horizontal intensity change and Dy estimates vertical intensity change. The gradient direction points toward the strongest intensity increase and is orthogonal to the edge direction itself, and your choice of threshold controls how many details surface as edges. On the GPU, one work-item per output pixel is again natural; because Hx and Hy use overlapping neighborhoods, local-memory tiling can reduce repeated global reads; and since border handling and thresholding introduce branches, you keep them simple and uniform where possible.

Worked example

Optimize the serial code FIRST (Sol.13) — parallelizing bad serial code produces misleading "speedups." The deck's measured chain for a filter on a 4080x4080 image goes: naive getPixelColor per pixel 26953 ms → hoist row = getScanLine(y) out and index row[x], looping over rows (row-major) 2019 ms (~13.4x, serial only) → OpenMP over the independent outer row loop 299 ms (~6.8x more, p=12 ≈ physical cores, rows independent so no sync) → SYCL GPU 88 ms (~22.9x) → sycl::int3 to process all 3 colour channels at once with mad24(y,width,x) = y*width + x for the index 79 ms (~25.5x). The int3 vectorization gain is hardware-dependent and can even be slower — always measure.

Course Conclusion Links: HPC Systems, GPUs, and Interconnects (Deck: 14 Conclusion)

The conclusion deck ties the programming models back to real HPC infrastructure. The local example is FHNW CALCULON, a self-managed HPC cluster for research and production projects in AI and data science. It contains nodes with different CPU core counts and, more to the point, heterogeneous GPU nodes spanning several generations — NVIDIA RTX 2080 Ti, RTX 3080, RTX A4500, and NVIDIA H200. The flagship H200 GPU nodes are wired together with high-speed 400 Gbit/s networking, jobs are scheduled with SLURM, and they run inside Singularity containers.

Intuition

CALCULON is literally the course's thesis made physical — mixed CPU nodes plus several GPU generations, fast interconnect for the flagship nodes, a batch scheduler (SLURM) to share it, and containers (Singularity) for reproducible environments.

The Top500/Green500 message scales the same idea up to the whole field. Current leading supercomputers are massively parallel and heterogeneous, combining many CPU cores with accelerator GPUs such as AMD Instinct, Intel GPU Max, and NVIDIA GH200/H100. Green500 shifts the ranking to performance per watt, which shows that energy efficiency is central to modern parallel computing, and the sheer prevalence of GPU-accelerated systems reinforces the course's theme: performance and energy efficiency both come from matching workloads to specialized hardware.

Intuition

Top500 ranks raw FLOP/s; Green500 ranks FLOP/s per watt. Both are dominated by GPU-accelerated, heterogeneous machines — which is the whole course in one data point: speed and energy efficiency both come from specialization.

The interconnects are the plumbing that makes all this work. InfiniBand is a high-throughput, low-latency switched-fabric network used in HPC for node-to-node and storage communication, with speeds progressing through FDR/EDR/HDR/NDR to XDR. Slingshot is HPE Cray's exascale-era interconnect, Ethernet-compatible, with dynamic routing and Dragonfly topology support; the Dragonfly topology keeps the path diameter small, with a typical path being local link, global link, local link. PCI Express connects CPUs and accelerators within a node, and its bandwidth can matter for host-device transfers. Terabit Ethernet variants support high-speed cluster networking, and NVIDIA NVLink provides high-bandwidth, energy-efficient CPU-GPU and GPU-GPU communication, with newer versions raising aggregate bandwidth substantially.

Intuition

Which wire for which hop? PCIe = inside a node, between CPU and accelerator (and the transfer cost the GPU section kept warning about). NVLink = much faster GPU↔GPU / CPU↔GPU links. InfiniBand / Slingshot = between nodes across the cluster, with Dragonfly keeping any two nodes within "local-global-local" hops so the network diameter stays small.

Finally, the deck links all of this back to the exam. The exam emphasizes completing code in C++/OpenMP/MPI/SYCL, parallelizing sequential algorithms, implementing parallel algorithms, and analyzing performance/scalability. For this part specifically, you should be able to explain how a sequential loop becomes a kernel over an NDRange, how work-groups/local memory/barriers change performance, and how to reason about GPU memory movement — and then connect those low-level choices to high-level metrics like speedup, scalability, bandwidth limits, arithmetic intensity, launch/transfer overhead, and energy efficiency.

Exam-Focused Checklist Across These Decks

Before the exam, make sure you can do each of the following for this part of the course:

Topic	What you must be able to explain
Why heterogeneous	frequency scaling limits, workload diversity, energy efficiency, and specialized architectures
Workload → hardware	control intensive to CPUs, data/SIMD to GPUs, specialized dataflow/spatial work to FPGAs or accelerators
OpenCL/SYCL terms	host, device, compute unit, processing element, kernel, NDRange, work-item, work-group, global range, local range, queue, context, buffer, image, accessor
CUDA bridge	work-item ≈ thread, work-group ≈ block, compute unit ≈ SM, processing element ≈ CUDA core
Memory spaces	host, global, constant, local, private; scope, speed intuition, and synchronization implications
Synchronization rule	barriers synchronize only within a work-group and must be reached by all work-items in that group
OpenCL host workflow	device discovery, context/queue creation, buffer/image allocation, build program, create kernel, set args, enqueue transfers/kernel, synchronize/read back
Asynchronous kernels	why OpenCL kernels are asynchronous and how events/blocking reads/queue waits establish ordering
Local-memory tiling	for matrix multiplication and image filtering, including halo regions for stencils
Arithmetic intensity	vec-add `Theta(n)/Theta(n)=O(1)` (GPU barely helps, bandwidth-bound) vs mat-mul `Theta(n^3)/Theta(n^2)=Theta(n)` (GPU wins big with tiling)
Tiled-matmul model	simple `T_P = Theta(nt_g)` vs tiled `T_P = Theta((n/q)t_g + n*t_l)`
Work-group size	justify using global divisibility, device limits, register pressure, local memory capacity, and occupancy
Memory coalescing	why neighboring work-items should access neighboring addresses
Divergent control flow	why similar work-item paths matter
SYCL single-source	queues, buffers/accessors, command groups, `parallel_for`, `range`, `nd_range`, `item`, `nd_item`
DPC++ migration	interoperability first, then replace queues/kernel launches/buffers, then translate kernels to lambdas/functors
DPC++ templates	why they help generic GPU algorithms such as GPU-Quicksort
Profiling	high utilization can still mean stalls; use Advisor/VTune/Roofline to find offload candidates, hotspots, memory bottlenecks, and headroom
Image processing	RGB/RGBA layout, stride, linear filters, border handling, output data partitioning, gradient magnitude, and thresholded edge detection
Optimization order	optimize the serial code first, or parallel speedups are misleading
Models → HPC reality	modern Top500 systems are heterogeneous, GPU-heavy, and depend on fast node and accelerator interconnects

Compact Exam Checklist

Across the whole course, make sure each of these is at your fingertips:

Area	What to have ready
Core metrics	`T_S`, `T_P`, `S`, `E`, cost, overhead `T_O`, cost-optimality, Amdahl, Gustafson-Barsis, Karp-Flatt, isoefficiency, degree of concurrency, and minimum execution time
Decomposition & mapping	explain a decomposition and mapping choice for matrix operations, sorting, graph search, image filtering, and numerical integration
OpenMP	data-sharing clauses, schedules, implicit barriers, reductions, race patterns, false sharing, and the common mistakes from the supplementary paper
MPI	message matching, blocking vs nonblocking communication, safe sends, deadlocks, `MPI_Sendrecv`, communicators, Cartesian topologies, and collective-call rules
Collectives	the simplified cost model and the hypercube runtimes for broadcast/reduction, all-to-all broadcast/allgather, scatter/gather, scan, all-reduce, and total exchange
Algorithms	connect complexity to architecture: shared quicksort/bitonic, distributed odd-even/Shellsort, matrix-vector, Cannon, numerical integration, GPU tiled matrix multiplication, and image filters
Heterogeneous systems	the OpenCL/SYCL platform, execution, and memory model: host/device, queues, kernels, work-items, work-groups, buffers/accessors, local memory, barriers, and host-device transfer costs
Coding	be ready to write or complete small C++/OpenMP/MPI/SYCL code snippets and to explain why a parallel version may be wrong or slow

Source Files Checked

All PDF decks in the directory were re-extracted and checked while preparing this summary:

Parallel and Distributed Computing - Detailed Study Summary

How to Use This Summary

Course Map

Deck Coverage and Navigation

High-Value Topic Threads

Part I - Foundations, Metrics, Decomposition, and Scalability

Parallel Computing Motivation and Machine Models (01 Introduction)

Parallel and distributed computing context

Parallel machine model (01 Introduction)

Concurrent programs vs parallel programs (01 Introduction)

Forms of parallelism (01 Introduction)

Why parallel computing is needed (01 Introduction)

Important abbreviations (01 Introduction)

Parallel Programming in C++ (01 Parallel Programming in C++)

Tooling and C++ baseline

C++ basics relevant to parallel code

Three approaches to modern C++ parallelism

Threads (01 Parallel Programming in C++)

Futures and async (01 Parallel Programming in C++)

Synchronization primitives (01 Parallel Programming in C++)

Serial vs parallel for-loops (01 Parallel Programming in C++)

Parallel for-each and scheduling (01 Parallel Programming in C++)

Ordering and standard execution policies (01 Parallel Programming in C++)

Complexity and Performance Metrics (02 Performance Metrics)

RAM, time, and memory requirements

Efficient algorithms and analysis cases

Asymptotic complexity and Landau notation

Analytical modeling of parallel systems

PRAM model (02 Performance Metrics)

Sources of parallel overhead

Core performance metrics

Adding n numbers: p = n example (02 Performance Metrics)

Speedup bounds and superlinear speedup

Cost-optimality

Bitonic sort example: impact of non-cost-optimality

Scaling down and cost-optimal parallel addition

Problem size W

Amdahl's Law (02 Performance Metrics)

Gustafson-Barsis Law (02 Performance Metrics)

Karp-Flatt metric (02 Performance Metrics)

Decomposition and Mapping (03 Decomposition and Mapping)

Dense matrix-vector multiplication (03 Decomposition and Mapping)

Database query processing and task dependency graphs

Granularity (03 Decomposition and Mapping)

Task interaction graphs (03 Decomposition and Mapping)

Decomposition techniques

Output data partitioning and owner computes

Input data partitioning

Mapping tasks to processes

Static and dynamic mapping

Dense matrix multiplication mapping

Sorting matrix rows: static vs dynamic mapping

Solving linear systems and LU decomposition

LU task decomposition and static mapping issue

Minimizing interaction overheads

Scalability of Parallel Systems (07 Scalability)

Scaling characteristics

Scalable parallel systems

Efficiency as a function of W and p

Isoefficiency metric

Isoefficiency of parallel addition

Isoefficiency with multiple overhead terms

Cost-optimality and isoefficiency

Lower bound on isoefficiency

Degree of concurrency and isoefficiency

Gaussian elimination concurrency example

Minimum execution time

Minimum time for parallel addition

Cross-Deck Connections for Exam Prep

From decomposition to metrics

Formula summary

Common exam pitfalls

Part II - Shared Memory, OpenMP, Sorting, and Graph Search

Shared-address-space platforms (04 Shared Memory Systems)

Cache coherence protocols and false sharing (04 Shared Memory Systems)

OpenMP programming model (04 Shared Memory Systems)

OpenMP loop scheduling and task decomposition (04 Shared Memory Systems)

OpenMP synchronization and memory model (04 Shared Memory Systems; 04 OpenMP_common_mistakes)

Common OpenMP correctness mistakes (04 OpenMP_common_mistakes)

Common OpenMP performance mistakes (04 OpenMP_common_mistakes)