Your goal is to parallelize on XSEDE's Stampede(online) or NERSC's Hopper(on-site) a toy particle simulator (similar particle simulators are used in mechanics, biology, astronomy, etc.) that reproduces the behaviour shown in the following animation:
The range of interaction forces (cutoff) is limited as shown in grey for a selected particle. Density is set sufficiently low so that given n particles, only O(n) interactions are expected.
Suppose we have a code that runs in time T = O(n) on a single processor. Then we'd hope to run in time T/p when using p processors. We'd like you to write parallel codes that approach these expectations.
A simple correctness check which computes the minimal distance between 2 particles during the entire simulation is provided. A correct simulation will have particles stay at greater than 0.4 (of cutoff) with typical values between 0.7-0.8. A simulation were particles don't interact correctly will be less than 0.4 (of cutoff) with typical values between 0.01-0.05 . More details as well as an average distance are described in the source file.
The code we are providing will do this distance check based on the calls to the interaction function but the full autograder will do the checks based on the outputted txt file. We'd recommend keeping the correctness checker in your code (for the OpenMP,Pthreads and MPI codes the overhead isn't significant) but depending on performance desires it can be deleted as long as the simulation can still output a correct txt file.
The performance of the code is determined by doing multiple runs of the simulation with increasing particle numbers and comparing the performance with the autograder.cpp provided. This can be done automatically with the auto-* scripts.
There will be two types of scaling that are tested for your parallel codes:
For more details on the options provided by each file you can use the -h flag on the compiled executables.
Besides the text summary outputs we also provide a visualization tool that generates a picture similar to the animation which allows you to analyze the txt output file. The relevant files can be found in the Source Files below under visualization.
While the scripts we are providing have small numbers of particles (500-1000) to allow for the O(n2) algorithm to finish execution, the final codes should be tested with values 100 times larger (50000-100000) to better see their performance.
Your grade will depend on the performance and efficiency sustained by your codes on Stampede/Hopper and will be broken into 3 sections (Serial - 30%, OpenMP or Pthreads - 35%, MPI - 35%):
* While we encourage you to try writing a serial algorithm since it will be easier to parallelize a personal code we also know that time is limited so we are providing a sample serial O(n) implementation for you to use in the Help Files below
Lets assume that John Student completes the assignment and has:
His grades per each section would be:
His total grade = Serial * 0.3 + MAX(OpenMP, Pthreads) * 0.35 + MPI * 0.35 = 100 * 0.3 + 95.83 * 0.35 + 67.5 * 0.35 = 87.1655
You may start with the serial and parallel implementations supplied below. All of them run in O(n2) time, which is unacceptably inefficient.
|
If you wish to build it on other systems, you might need a custom implementation of pthread barrier, such as: pthread_barrier.c, pthread_barrier.h.
You may consider using the following visualization program to check the correctness of the result produced by your code: Linux/Mac version (requires SDL), Windows version.
To download all the files directly to Stampede/Hopper you can use the following command:
wget http://www.eecs.berkeley.edu/~carazvan/2013bootcamp/particles.tgz
Note that the Makefile contains 2 sets of configuration options for Stampede and Hopper with Stampede (online) being the default. In order for the compilation to go correctly you must comment out/uncomment approapriate lines within the Makefile
Hopper
For onsite attendees a introduction to the NERSC Hopper system will be provided on the first day and there will be TAs present to help with any issues.
Login to Hopper can be achieved by ssh-ing into:
hopper.nersc.gov
To submit jobs on the Hopper system:
qsub job-hopper-mpi16
To check the status of user train120's running jobs you can use the following command:
qstat -u train120
Note that on Hopper you can go up to 24 cores per node
Stampede
The easiest way to login to XSEDE resources is via the Accounts tab in XSEDE User Portal. To reach this page login to XSEDE User Portal, navigate to MY XSEDE tab and from there select the Accounts page. All machines you have access to will have a login option next to them that will launch OpenSSH via a Java Applet. For this assignment we will be using the Stampede supercomputer.
For setting up XSEDE single login and OpenSSH separately from XUP please check the instructions on doing so within the XSEDE User Portal; Please be aware that most XSEDE resources are not available via direct ssh and all requests must go through the XSEDE single login system (Some machines do allow direct login; for more information please check each supercomputers "System access" page).
You can extract the downloaded directory from the tgz archive with:
tar -xvf particles.tgz
The default Makefile is built with the default intel compiler; therefore you can simply type make to compile the code.
To submit jobs on the Stampede system you will use the SLURM Batch Environment interface for example:
sbatch job-stampede-mpi16
To check the status of running jobs you can use the following command:
showq -u
Once you have code that gives correct output you should check it's efficiency with the autograder provided by running the command:
sbatch auto-stampede-mpi16
This example will run your MPI code with the "no output" (-no) option and run 1,2,4,8,16 processor runs of your parallel code with varying sizes of problems to determine strong and weak scaling.
Note that at the moment the Stampede system is very fast in processing jobs; therefore depending on your problem size the job might finish even before running the showq command; check the folder for the .stdout output file if you had a successful submission
For more details on sbatch commands please see Stampede's documentation page.
Consider starting this and other optional parts after you are happy with your submitted code.
An optional part of the assignment is to take advantage of the attached GPUs on Dirac and of the manycore Intel Xeon Phi Coprocessor on Stampede via CUDA.
Source Files
We will provide a naive O(n2) GPU implementation, similar to the openmp, pthreads, and MPI codes you started from. It will be your task to make the necessary algorithmic changes and machine optimizations to achieve favorable performance across a range of problem sizes.
autocorrect
We are also providing an executable prebuilt on Stampede that can be used to check the correctness of a program by verifying the txt output file; This program will run at a serial O(n) speed and will give the results for the miminum and maximum distance between particles during the run. (This program can be run on any txt output file from a simulation with small sized tests (n < 10000)
autocorrectDirac
We are also providing an executable prebuilt on Dirac that can be used to check the correctness of a program by verifying the txt output file; This program will run at a serial O(n) speed and will give the results for the miminum and maximum distance between particles during the run. (This program can be run on any txt output file from a simulation with small sized tests (n < 10000)
NERSC Dirac
Dirac is actually a subset of Carver nodes therefore login can be achieved by ssh-ing into:
carver.nersc.gov
To submit jobs on the Dirac system:
qsub job-dirac-gpu
To check the status of user train120's running jobs you can use the following command:
qstat -u train120
XSEDE Stampede
You should note that Intel Xeon Phi is NOT a GPU and the speedups you can achieve will be limited since there are tens of cores vs thousands in a GPU. However the CUDA model will still apply and the code you write should be nearly identical to a GPU implementation.
To download all the files directly to Stampede/Dirac you can use the following command:
wget http://www.eecs.berkeley.edu/~carazvan/2013bootcamp/particles_gpu.tgz
In order to load cuda before compiling on Stampede use the command:
module load cuda
More information on GPU Programming on Stampede is available at the XSEDE Stampede webpage.
These files are here to help you with the assignemnt if you get stuck on achieving O(n) algorithm or on implementing the serial code since our goal is for you to focus on the parallel implementation
Theory - a short pdf detailing how to achieve an O(n) algorithm
Serial Code - an implementation that achieve an O(n) algorithm for the problem