HPC Practices

This section concentrates the HPC guidance (MPI, GPU, determinism, logging, performance).

1. Determinism & Reproducibility

Avoid hidden global state; seed RNGs explicitly; document nondeterministic paths.

// Deterministic RNG (repeatable experiments)
std::mt19937 gen(1337);
std::uniform_real_distribution<double> dist(0.0, 1.0);
double u = dist(gen);

2. Memory & Data Layout

Reuse allocations in loops; prefer contiguous memory; consider SoA for SIMD/vectorization.

// Wrong: allocates each iteration
for (int i = 0; i < N; ++i)
{
	std::vector<double> buf(1024);
	process(buf);
}

// Correct: reuse
std::vector<double> buf(1024);
for (int i = 0; i < N; ++i)
{
	process(buf);
}

Glossary: SoA = Structure of Arrays vs AoS = Array of Structures.

3. Concurrency & Atomics

Minimize shared mutable state; prefer message passing; when needed, use std::atomic with explicit memory order.

std::atomic<int> S_counter{0};
S_counter.fetch_add(1, std::memory_order_relaxed);

4. MPI Practices

Prefer collectives over manual send/recv loops — simpler, faster, less error-prone.
Avoid gratuitous barriers; synchronize only when required (timing, phases).
Logging is MPI-aware and integrated with Google glog.

4.1. Prefer collectives over manual loops

// Wrong: manual broadcast via send/recv
if (comm.rank() == 0)
{
	for (int r = 1; r < comm.size(); ++r)
	{
		MPI_Send(buf.data(), count, MPI_DOUBLE, r, 0, MPI_COMM_WORLD);
	}
}
else
{
	MPI_Recv(buf.data(), count, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}

// Correct: use a collective broadcast
MPI_Bcast(buf.data(), count, MPI_DOUBLE, 0, MPI_COMM_WORLD);

// (boost::mpi3 also provides collective methods, e.g. comm.broadcast_n)

4.2. MPI-aware logging with glog

// Transparent: behavior depends on logging configuration
LOG(INFO) << fmt::format("Starting computation with N={}", N);

// In selective/debug mode, all ranks may emit if configured
LOG(DEBUG) << fmt::format("[rank {}] localNorm={}", comm.rank(), localNorm);

4.3. Avoid unnecessary barriers

// Wrong: barrier in every loop step (expensive!)
for (int step = 0; step < steps; ++step)
{
	compute_local();
	MPI_Barrier(MPI_COMM_WORLD); // unnecessary
}

// Correct: rely on nonblocking ops or collectives
std::vector<MPI_Request> reqs;
// issue nonblocking Isend/Irecv into reqs...
MPI_Waitall(static_cast<int>(reqs.size()), reqs.data(), MPI_STATUSES_IGNORE);

// Barrier only when timing phases
MPI_Barrier(MPI_COMM_WORLD);
double t0 = MPI_Wtime();
do_work();
MPI_Barrier(MPI_COMM_WORLD);
double t1 = MPI_Wtime();
LOG(INFO) << fmt::format("Phase time = {:.6f}s", t1 - t0);

5. GPU / Accelerators

Don’t leak CUDA/HIP types in public headers; keep device pointers opaque; keep kernels focused.

// header (opaque device handle)
class DeviceBuffer
{
public:
	void * M_dev = nullptr; // opaque; defined/managed in .cu/.hip
};

6. Logging with Google glog (GLOG)

MPI-aware logging modes: 1) Master-only: only rank 0 produces output (other ranks get a NoOp stream). 2) All ranks: every rank logs (useful for debugging). 3) Selective: rank 0 logs info; specific ranks can emit debug.

Users do not need to write if (comm.rank()==0) guards — LOG(…) honors the configured mode. On ranks where logging is disabled, the call compiles to a NoOp.

// Transparent to the user
LOG(INFO) << fmt::format("Starting computation with N={}", N);
LOG(WARNING) << fmt::format("[rank {}] localNorm={}", comm.rank(), localNorm);

Initialization (done once per process):

int main(int argc, char **argv)
{
	google::InitGoogleLogging(argv[0]);
	// Feel++ logging setup chooses master-only / all-ranks / selective
}

7. Testing & Benchmarking

Fix RNG seeds; make tests deterministic. Separate microbenchmarks from unit tests. Use sanitizers (ASan/UBSan/TSan) in dedicated CI jobs.

std::mt19937 gen(42); // fixed seed for tests

8. Glossary (HPC)

MPI: Message Passing Interface — standard for distributed-memory parallelism.
GPU: Graphics Processing Unit — accelerator used for massively parallel workloads.
SoA/AoS: Structure of Arrays / Array of Structures — data layout patterns affecting vectorization and cache behavior.
SIMD: Single Instruction, Multiple Data — CPU vector instructions (e.g., AVX).
Barrier: global synchronization point across MPI ranks (use sparingly).