Par Lab Bootcamp 2013: Parallelizing a Particle Simulation

Parallelizing a Particle Simulation

Problem statement

Your goal is to parallelize on XSEDE's Stampede(online) or NERSC's Hopper(on-site) a toy particle simulator (similar particle simulators are used in mechanics, biology, astronomy, etc.) that reproduces the behaviour shown in the following animation:

The range of interaction forces (cutoff) is limited as shown in grey for a selected particle. Density is set sufficiently low so that given n particles, only O(n) interactions are expected.

Suppose we have a code that runs in time T = O(n) on a single processor. Then we'd hope to run in time T/p when using p processors. We'd like you to write parallel codes that approach these expectations.


Correctness and Performance

A simple correctness check which computes the minimal distance between 2 particles during the entire simulation is provided. A correct simulation will have particles stay at greater than 0.4 (of cutoff) with typical values between 0.7-0.8. A simulation were particles don't interact correctly will be less than 0.4 (of cutoff) with typical values between 0.01-0.05 . More details as well as an average distance are described in the source file.

The code we are providing will do this distance check based on the calls to the interaction function but the full autograder will do the checks based on the outputted txt file. We'd recommend keeping the correctness checker in your code (for the OpenMP,Pthreads and MPI codes the overhead isn't significant) but depending on performance desires it can be deleted as long as the simulation can still output a correct txt file.

The performance of the code is determined by doing multiple runs of the simulation with increasing particle numbers and comparing the performance with the autograder.cpp provided. This can be done automatically with the auto-* scripts.

There will be two types of scaling that are tested for your parallel codes:

For more details on the options provided by each file you can use the -h flag on the compiled executables.

Besides the text summary outputs we also provide a visualization tool that generates a picture similar to the animation which allows you to analyze the txt output file. The relevant files can be found in the Source Files below under visualization.

Important note for Performance:

While the scripts we are providing have small numbers of particles (500-1000) to allow for the O(n2) algorithm to finish execution, the final codes should be tested with values 100 times larger (50000-100000) to better see their performance.


Grading

Your grade will depend on the performance and efficiency sustained by your codes on Stampede/Hopper and will be broken into 3 sections (Serial - 30%, OpenMP or Pthreads - 35%, MPI - 35%):

* While we encourage you to try writing a serial algorithm since it will be easier to parallelize a personal code we also know that time is limited so we are providing a sample serial O(n) implementation for you to use in the Help Files below

Example grade

Lets assume that John Student completes the assignment and has:

His grades per each section would be:

His total grade = Serial * 0.3 + MAX(OpenMP, Pthreads) * 0.35 + MPI * 0.35 = 100 * 0.3 + 95.83 * 0.35 + 67.5 * 0.35 = 87.1655


Source files

You may start with the serial and parallel implementations supplied below. All of them run in O(n2) time, which is unacceptably inefficient.

serial.cpp
a serial implementation,
openmp.cpp
a shared memory parallel implementation done using OpenMP,
pthreads.cpp
a shared memory parallel implementation done using pthreads (if you prefer it over OpenMP),
mpi.cpp
a distributed memory parallel implementation done using MPI,
common.cpp, common.h
an implementation of common functionality, such as I/O, numerics and timing,
Makefile
a makefile that compile that should work directly on Stampede,
autograder.cpp
a code to help determine performance and grading,
job-hopper-serial, job-hopper-pthreads16, job-hopper-openmp16, job-hopper-mpi16
sample batch files to launch jobs on Hopper. Use qsub to submit on Hopper. Use these files to check correctness
auto-hopper-serial, auto-hopper-pthreads16, auto-hopper-openmp16, auto-hopper-mpi16
sample batch files to launch autograder jobs on Hopper. Use qsub to submit on Hoppper. Use these files to check performance
job-carver-serial, job-carver-pthreads8, job-carver-openmp8, job-carver-mpi8
sample batch files to launch jobs on Carver. Use qsub to submit on Carver. Use these files to check correctness
auto-carver-serial, auto-carver-pthreads8, auto-carver-openmp8, auto-carver-mpi8
sample batch files to launch autograder jobs on Carver. Use qsub to submit on Carver. Use these files to check performance
job-stampede-serial, job-stampede-pthreads16, job-stampede-openmp16, job-stampede-mpi16
sample batch files to launch jobs on Stampede. Use sbatch to submit on Stampede. Use these files to check correctness
auto-stampede-serial, auto-stampede-pthreads16, auto-stampede-openmp16, auto-stampede-mpi16
sample batch files to launch autograder jobs on Stampede. Use sbatch to submit on Stampede. Use these files to check performance
particles.tgz
all above files in one tarball archive.

If you wish to build it on other systems, you might need a custom implementation of pthread barrier, such as: pthread_barrier.c, pthread_barrier.h.

You may consider using the following visualization program to check the correctness of the result produced by your code: Linux/Mac version (requires SDL), Windows version.

To download all the files directly to Stampede/Hopper you can use the following command:

wget http://www.eecs.berkeley.edu/~carazvan/2013bootcamp/particles.tgz

Note that the Makefile contains 2 sets of configuration options for Stampede and Hopper with Stampede (online) being the default. In order for the compilation to go correctly you must comment out/uncomment approapriate lines within the Makefile


Login, Compilation and Job submission

Hopper

For onsite attendees a introduction to the NERSC Hopper system will be provided on the first day and there will be TAs present to help with any issues.

Login to Hopper can be achieved by ssh-ing into:

hopper.nersc.gov

To submit jobs on the Hopper system:

qsub job-hopper-mpi16

To check the status of user train120's running jobs you can use the following command:

qstat -u train120

Note that on Hopper you can go up to 24 cores per node

Stampede

The easiest way to login to XSEDE resources is via the Accounts tab in XSEDE User Portal. To reach this page login to XSEDE User Portal, navigate to MY XSEDE tab and from there select the Accounts page. All machines you have access to will have a login option next to them that will launch OpenSSH via a Java Applet. For this assignment we will be using the Stampede supercomputer.

For setting up XSEDE single login and OpenSSH separately from XUP please check the instructions on doing so within the XSEDE User Portal; Please be aware that most XSEDE resources are not available via direct ssh and all requests must go through the XSEDE single login system (Some machines do allow direct login; for more information please check each supercomputers "System access" page).

You can extract the downloaded directory from the tgz archive with:

tar -xvf particles.tgz

The default Makefile is built with the default intel compiler; therefore you can simply type make to compile the code.

To submit jobs on the Stampede system you will use the SLURM Batch Environment interface for example:

sbatch job-stampede-mpi16

To check the status of running jobs you can use the following command:

showq -u

Once you have code that gives correct output you should check it's efficiency with the autograder provided by running the command:

sbatch auto-stampede-mpi16

This example will run your MPI code with the "no output" (-no) option and run 1,2,4,8,16 processor runs of your parallel code with varying sizes of problems to determine strong and weak scaling.

Note that at the moment the Stampede system is very fast in processing jobs; therefore depending on your problem size the job might finish even before running the showq command; check the folder for the .stdout output file if you had a successful submission

For more details on sbatch commands please see Stampede's documentation page.


Resources


Optional: GPU Source files, CUDA Programming

Consider starting this and other optional parts after you are happy with your submitted code.

An optional part of the assignment is to take advantage of the attached GPUs on Dirac and of the manycore Intel Xeon Phi Coprocessor on Stampede via CUDA.

Source Files

NERSC Dirac

Dirac is actually a subset of Carver nodes therefore login can be achieved by ssh-ing into:

carver.nersc.gov

To submit jobs on the Dirac system:

qsub job-dirac-gpu

To check the status of user train120's running jobs you can use the following command:

qstat -u train120

XSEDE Stampede

You should note that Intel Xeon Phi is NOT a GPU and the speedups you can achieve will be limited since there are tens of cores vs thousands in a GPU. However the CUDA model will still apply and the code you write should be nearly identical to a GPU implementation.

To download all the files directly to Stampede/Dirac you can use the following command:

wget http://www.eecs.berkeley.edu/~carazvan/2013bootcamp/particles_gpu.tgz

In order to load cuda before compiling on Stampede use the command:

module load cuda

More information on GPU Programming on Stampede is available at the XSEDE Stampede webpage.


Optional: Other improvements


Help Files

These files are here to help you with the assignemnt if you get stuck on achieving O(n) algorithm or on implementing the serial code since our goal is for you to focus on the parallel implementation

Theory - a short pdf detailing how to achieve an O(n) algorithm

Serial Code - an implementation that achieve an O(n) algorithm for the problem