Par Lab Bootcamp 2013: Parallelizing a Particle Simulation

Parallelizing a Particle Simulation

Problem statement

Your goal is to parallelize on XSEDE's Stampede(online) or NERSC's Hopper(on-site) a toy particle simulator (similar particle simulators are used in mechanics, biology, astronomy, etc.) that reproduces the behaviour shown in the following animation:

The range of interaction forces (cutoff) is limited as shown in grey for a selected particle. Density is set sufficiently low so that given n particles, only O(n) interactions are expected.

Suppose we have a code that runs in time T = O(n) on a single processor. Then we'd hope to run in time T/p when using p processors. We'd like you to write parallel codes that approach these expectations.

Correctness and Performance

A simple correctness check which computes the minimal distance between 2 particles during the entire simulation is provided. A correct simulation will have particles stay at greater than 0.4 (of cutoff) with typical values between 0.7-0.8. A simulation were particles don't interact correctly will be less than 0.4 (of cutoff) with typical values between 0.01-0.05 . More details as well as an average distance are described in the source file.

The code we are providing will do this distance check based on the calls to the interaction function but the full autograder will do the checks based on the outputted txt file. We'd recommend keeping the correctness checker in your code (for the OpenMP,Pthreads and MPI codes the overhead isn't significant) but depending on performance desires it can be deleted as long as the simulation can still output a correct txt file.

The performance of the code is determined by doing multiple runs of the simulation with increasing particle numbers and comparing the performance with the autograder.cpp provided. This can be done automatically with the auto-* scripts.

There will be two types of scaling that are tested for your parallel codes:

In strong scaling we keep the problem size constant but increase the number of processors
In weak scaling we increase the problem size proportionally to the number of processors so the work/processor stays the same (Note that for the purposes of this assignment we will assume a linear scaling between work and processors)

For more details on the options provided by each file you can use the -h flag on the compiled executables.

Besides the text summary outputs we also provide a visualization tool that generates a picture similar to the animation which allows you to analyze the txt output file. The relevant files can be found in the Source Files below under visualization.

Important note for Performance:

While the scripts we are providing have small numbers of particles (500-1000) to allow for the O(n²) algorithm to finish execution, the final codes should be tested with values 100 times larger (50000-100000) to better see their performance.

Grading

Your grade will depend on the performance and efficiency sustained by your codes on Stampede/Hopper and will be broken into 3 sections (Serial - 30%, OpenMP or Pthreads - 35%, MPI - 35%):

Serial Code^* performance will be tested via fitting multiple runs of the code to a line on a log/log plot and calculating the slope of that line. This will determine whether your code is attaining the O(n) desired complexity versus the starting O(n²). With an achieved result of O(n^x) you will receive
- If x is between 2 and 1.5 you will receive a serial score between 0 and 75 proportional to x. (Ex:1.75 gives a score of 37.5)
- If x is between 1.5 and 1.3 you will receive a serial score between 75 and 100 proportional to x. (Ex:1.34 gives a score of 95)
- If x is below 1.3 you will receive a serial score of 100.
Shared Memory performance and MPI Performance will be tested via calculating the strong and weak scaling average efficiencies for 1,2,4,8,16 processor/thread runs.

The strong and weak average efficiencies ( eff_ss , eff_ws) will each contribute 50% of a grade to either OpenMP,Pthreads or MPI codes.
The grade for the "Shared Memory" part of the Total grade will be selected as the maximum between the OpenMP and Pthreads Grades so only one implementation needs to be provided.
An efficiency (strong or weak) will give a grade according to the following rules:
- If the efficiency is between 0 and 0.5 you will receive a score between 0 and 75 proportional to it (Ex:0.2 gives a score of 30)
- If the efficiency is between 0.5 and 0.8 you will receive a score between 75 and 100 proportional to it (Ex:0.62 gives a score of 85)
- If the efficiency is above 0.8 you will receive a score of 100

* While we encourage you to try writing a serial algorithm since it will be easier to parallelize a personal code we also know that time is limited so we are providing a sample serial O(n) implementation for you to use in the Help Files below

Example grade

Lets assume that John Student completes the assignment and has:

a Serial Code with x = 1.25
an OpenMP code with eff_ss=0.8 and eff_ws=0.7
an MPI code with eff_ss=0.4 and eff_ws=0.5

His grades per each section would be:

Serial = 100
OpenMP = 0.5 * 100 + 0.5 * 91.66 = 95.83
Pthreads = 0.5*0 + 0.5* 0 = 0
MPI = 0.5 * 60 + 0.5* 75 = 67.5

His total grade = Serial * 0.3 + MAX(OpenMP, Pthreads) * 0.35 + MPI * 0.35 = 100 * 0.3 + 95.83 * 0.35 + 67.5 * 0.35 = 87.1655

Source files

You may start with the serial and parallel implementations supplied below. All of them run in O(n²) time, which is unacceptably inefficient.

serial.cpp: a serial implementation,
openmp.cpp: a shared memory parallel implementation done using OpenMP,
pthreads.cpp: a shared memory parallel implementation done using pthreads (if you prefer it over OpenMP),
mpi.cpp: a distributed memory parallel implementation done using MPI,
common.cpp, common.h: an implementation of common functionality, such as I/O, numerics and timing,
Makefile: a makefile that compile that should work directly on Stampede,
autograder.cpp: a code to help determine performance and grading,
job-hopper-serial, job-hopper-pthreads16, job-hopper-openmp16, job-hopper-mpi16: sample batch files to launch jobs on Hopper. Use qsub to submit on Hopper. Use these files to check correctness
auto-hopper-serial, auto-hopper-pthreads16, auto-hopper-openmp16, auto-hopper-mpi16: sample batch files to launch autograder jobs on Hopper. Use qsub to submit on Hoppper. Use these files to check performance
job-carver-serial, job-carver-pthreads8, job-carver-openmp8, job-carver-mpi8: sample batch files to launch jobs on Carver. Use qsub to submit on Carver. Use these files to check correctness
auto-carver-serial, auto-carver-pthreads8, auto-carver-openmp8, auto-carver-mpi8: sample batch files to launch autograder jobs on Carver. Use qsub to submit on Carver. Use these files to check performance
job-stampede-serial, job-stampede-pthreads16, job-stampede-openmp16, job-stampede-mpi16: sample batch files to launch jobs on Stampede. Use sbatch to submit on Stampede. Use these files to check correctness
auto-stampede-serial, auto-stampede-pthreads16, auto-stampede-openmp16, auto-stampede-mpi16: sample batch files to launch autograder jobs on Stampede. Use sbatch to submit on Stampede. Use these files to check performance
particles.tgz: all above files in one tarball archive.

If you wish to build it on other systems, you might need a custom implementation of pthread barrier, such as: pthread_barrier.c, pthread_barrier.h.

You may consider using the following visualization program to check the correctness of the result produced by your code: Linux/Mac version (requires SDL), Windows version.

To download all the files directly to Stampede/Hopper you can use the following command:

wget http://www.eecs.berkeley.edu/~carazvan/2013bootcamp/particles.tgz

Note that the Makefile contains 2 sets of configuration options for Stampede and Hopper with Stampede (online) being the default. In order for the compilation to go correctly you must comment out/uncomment approapriate lines within the Makefile

Login, Compilation and Job submission

Hopper

For onsite attendees a introduction to the NERSC Hopper system will be provided on the first day and there will be TAs present to help with any issues.

hopper.nersc.gov

To submit jobs on the Hopper system:

qsub job-hopper-mpi16

To check the status of user train120's running jobs you can use the following command:

qstat -u train120

Note that on Hopper you can go up to 24 cores per node

Stampede

The easiest way to login to XSEDE resources is via the Accounts tab in XSEDE User Portal. To reach this page login to XSEDE User Portal, navigate to MY XSEDE tab and from there select the Accounts page. All machines you have access to will have a login option next to them that will launch OpenSSH via a Java Applet. For this assignment we will be using the Stampede supercomputer.

For setting up XSEDE single login and OpenSSH separately from XUP please check the instructions on doing so within the XSEDE User Portal; Please be aware that most XSEDE resources are not available via direct ssh and all requests must go through the XSEDE single login system (Some machines do allow direct login; for more information please check each supercomputers "System access" page).

You can extract the downloaded directory from the tgz archive with:

tar -xvf particles.tgz

The default Makefile is built with the default intel compiler; therefore you can simply type make to compile the code.

To submit jobs on the Stampede system you will use the SLURM Batch Environment interface for example:

sbatch job-stampede-mpi16

To check the status of running jobs you can use the following command:

showq -u

Once you have code that gives correct output you should check it's efficiency with the autograder provided by running the command:

sbatch auto-stampede-mpi16

This example will run your MPI code with the "no output" (-no) option and run 1,2,4,8,16 processor runs of your parallel code with varying sizes of problems to determine strong and weak scaling.

Note that at the moment the Stampede system is very fast in processing jobs; therefore depending on your problem size the job might finish even before running the showq command; check the folder for the .stdout output file if you had a successful submission

For more details on sbatch commands please see Stampede's documentation page.

Resources

Hopper Computing Environment
Stampede Computing Environment
Shared memory implementations may require using locks that are availabale as omp_lock_t in OpenMP (requires omp.h) and pthread_mutex_t in pthreads (requires pthread.h).
Alternatively, you may consider using atomic operations such as __fetch_and_swap in the IBM compiler (requires builtins.h) and __sync_lock_test_and_set in the newer versions of the GNU compiler.
Distributed memory implementation may benefit from overlapping communication and computation that is provided by nonblocking MPI routines such as MPI_Isend and MPI_Irecv.
Other useful resources: pthreads tutorial, OpenMP tutorial, OpenMP specifications and MPI specifications.

Optional: GPU Source files, CUDA Programming

Consider starting this and other optional parts after you are happy with your submitted code.

An optional part of the assignment is to take advantage of the attached GPUs on Dirac and of the manycore Intel Xeon Phi Coprocessor on Stampede via CUDA.

Source Files

particles_gpu.tgz

We will provide a naive O(n²) GPU implementation, similar to the openmp, pthreads, and MPI codes you started from. It will be your task to make the necessary algorithmic changes and machine optimizations to achieve favorable performance across a range of problem sizes.
autocorrect

We are also providing an executable prebuilt on Stampede that can be used to check the correctness of a program by verifying the txt output file; This program will run at a serial O(n) speed and will give the results for the miminum and maximum distance between particles during the run. (This program can be run on any txt output file from a simulation with small sized tests (n < 10000)
autocorrectDirac

We are also providing an executable prebuilt on Dirac that can be used to check the correctness of a program by verifying the txt output file; This program will run at a serial O(n) speed and will give the results for the miminum and maximum distance between particles during the run. (This program can be run on any txt output file from a simulation with small sized tests (n < 10000)

NERSC Dirac

Dirac is actually a subset of Carver nodes therefore login can be achieved by ssh-ing into:

carver.nersc.gov

To submit jobs on the Dirac system:

qsub job-dirac-gpu

To check the status of user train120's running jobs you can use the following command:

qstat -u train120

XSEDE Stampede

You should note that Intel Xeon Phi is NOT a GPU and the speedups you can achieve will be limited since there are tens of cores vs thousands in a GPU. However the CUDA model will still apply and the code you write should be nearly identical to a GPU implementation.

To download all the files directly to Stampede/Dirac you can use the following command:

wget http://www.eecs.berkeley.edu/~carazvan/2013bootcamp/particles_gpu.tgz

In order to load cuda before compiling on Stampede use the command:

module load cuda

More information on GPU Programming on Stampede is available at the XSEDE Stampede webpage.

Optional: Other improvements

Implementing a Hybrid code that uses shared memory on each node and MPI between nodes
Parallelezing the particle creation and output
Using CUDA and Shared Memory per each node and MPI between nodes
Comparing scalability on other XSEDE platforms

Help Files

These files are here to help you with the assignemnt if you get stuck on achieving O(n) algorithm or on implementing the serial code since our goal is for you to focus on the parallel implementation

Theory - a short pdf detailing how to achieve an O(n) algorithm

Serial Code - an implementation that achieve an O(n) algorithm for the problem