301 lines
15 KiB
Plaintext
301 lines
15 KiB
Plaintext
--------------------------------
|
||
LAMMPS ACCELERATOR LIBRARY
|
||
--------------------------------
|
||
|
||
W. Michael Brown (ORNL)
|
||
Trung Dac Nguyen (ORNL/Northwestern)
|
||
Nitin Dhamankar (Intel)
|
||
Axel Kohlmeyer (Temple)
|
||
Peng Wang (NVIDIA)
|
||
Anders Hafreager (UiO)
|
||
V. Nikolskiy (HSE)
|
||
Maurice de Koning (Unicamp/Brazil)
|
||
Rodolfo Paula Leite (Unicamp/Brazil)
|
||
Steve Plimpton (SNL)
|
||
Inderaj Bains (NVIDIA)
|
||
|
||
|
||
------------------------------------------------------------------------------
|
||
|
||
This directory has source files to build a library that LAMMPS links against
|
||
when using the GPU package.
|
||
|
||
This library must be built with a C++ compiler along with CUDA, HIP, or OpenCL
|
||
before LAMMPS is built, so LAMMPS can link against it.
|
||
|
||
This library, libgpu.a, provides routines for acceleration of certain
|
||
LAMMPS styles and neighbor list builds using CUDA, OpenCL, or ROCm HIP.
|
||
|
||
Pair styles supported by this library are marked in the list of Pair style
|
||
potentials with a "g". See the online version at:
|
||
|
||
https://docs.lammps.org/Commands_pair.html
|
||
|
||
In addition the (plain) pppm kspace style is supported as well.
|
||
|
||
------------------------------------------------------------------------------
|
||
DEVICE QUERY
|
||
------------------------------------------------------------------------------
|
||
The gpu library includes binaries to check for available GPUs and their
|
||
properties. It is a good idea to run this on first use to make sure the
|
||
system and build is setup properly. Additionally, the GPU numbering for
|
||
specific selection of devices should be taking from this output. The GPU
|
||
library may split some accelerators into separate virtual accelerators for
|
||
efficient use with MPI.
|
||
|
||
After building the GPU library, for OpenCL:
|
||
./ocl_get_devices
|
||
for CUDA:
|
||
./nvc_get_devices
|
||
and for ROCm HIP:
|
||
./hip_get_devices
|
||
|
||
------------------------------------------------------------------------------
|
||
QUICK START
|
||
------------------------------------------------------------------------------
|
||
OpenCL: Mac without MPI:
|
||
make -f Makefile.mac_opencl -j; cd ../../src/; make mpi-stubs
|
||
make g++_serial -j
|
||
./lmp_g++_serial -in ../bench/in.lj -log none -sf gpu
|
||
|
||
OpenCL: Mac with MPI:
|
||
make -f Makefile.mac_opencl_mpi -j; cd ../../src/; make g++_openmpi -j
|
||
mpirun -np $NUM_MPI ./lmp_g++_openmpi -in ../bench/in.lj -log none -sf gpu
|
||
|
||
OpenCL: Linux with Intel oneAPI:
|
||
make -f Makefile.oneapi -j; cd ../../src; make oneapi -j
|
||
export OMP_NUM_THREADS=$NUM_THREADS
|
||
mpirun -np $NUM_MPI ./lmp_oneapi -in ../bench/in.lj -log none -sf gpu
|
||
|
||
OpenCL: Linux with MPI:
|
||
make -f Makefile.linux_opencl -j; cd ../../src; make omp -j
|
||
export OMP_NUM_THREADS=$NUM_THREADS
|
||
mpirun -np $NUM_MPI ./lmp_omp -in ../bench/in.lj -log none -sf gpu
|
||
|
||
NVIDIA CUDA:
|
||
make -f Makefile.cuda_mps -j; cd ../../src; make omp -j
|
||
export CUDA_MPS_LOG_DIRECTORY=/tmp; export CUDA_MPS_PIPE_DIRECTORY=/tmp
|
||
nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
|
||
export OMP_NUM_THREADS=$NUM_THREADS
|
||
mpirun -np $NUM_MPI ./lmp_omp -in ../bench/in.lj -log none -sf gpu
|
||
echo quit | /usr/bin/nvidia-cuda-mps-control
|
||
|
||
AMD HIP:
|
||
make -f Makefile.hip -j; cd ../../src; make omp -j
|
||
export OMP_NUM_THREADS=$NUM_THREADS
|
||
mpirun -np $NUM_MPI ./lmp_omp -in ../bench/in.lj -log none -sf gpu
|
||
|
||
------------------------------------------------------------------------------
|
||
Installing oneAPI, OpenCl, CUDA, or ROCm
|
||
------------------------------------------------------------------------------
|
||
The easiest approach is to use the linux package manger to perform the
|
||
installation from Intel, NVIDIA, etc. repositories. All are available for
|
||
free. The oneAPI installation includes Intel optimized MPI and C++ compilers,
|
||
along with many libraries. Alternatively, Intel OpenCL can also be installed
|
||
separately from the Intel repository.
|
||
|
||
NOTE: Installation of the CUDA SDK is not required, only the CUDA toolkit.
|
||
|
||
See:
|
||
|
||
https://software.intel.com/content/www/us/en/develop/tools/oneapi/hpc-toolkit.html
|
||
|
||
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html
|
||
|
||
https://github.com/RadeonOpenCompute/ROCm
|
||
|
||
------------------------------------------------------------------------------
|
||
Build Intro
|
||
------------------------------------------------------------------------------
|
||
|
||
You can type "make lib-gpu" from the src directory to see help on how
|
||
to build this library via make commands, or you can do the same thing
|
||
by typing "python Install.py" from within this directory, or you can
|
||
do it manually by following the instructions below.
|
||
|
||
Build the library using one of the provided Makefile.* files or create
|
||
your own, specific to your compiler and system. For example:
|
||
|
||
make -f Makefile.linux_opencl
|
||
|
||
When you are done building this library, two files should
|
||
exist in this directory:
|
||
|
||
libgpu.a the library LAMMPS will link against
|
||
Makefile.lammps settings the LAMMPS Makefile will import
|
||
|
||
Makefile.lammps is created by the make command, by copying one of the
|
||
Makefile.lammps.* files. See the EXTRAMAKE setting at the top of the
|
||
Makefile.* files.
|
||
|
||
IMPORTANT: You should examine the final Makefile.lammps to ensure it is
|
||
correct for your system, else the LAMMPS build can fail.
|
||
|
||
IMPORTANT: If you re-build the library, e.g. for a different precision
|
||
(see below), you should do a "make clean" first, e.g. make -f
|
||
Makefile.linux clean, to ensure all previous derived files are removed
|
||
before the new build is done.
|
||
|
||
NOTE: The system-specific setting LAMMPS_SMALLBIG (default) or LAMMPS_BIGBIG
|
||
- if specified when building LAMMPS (i.e. in src/MAKE/Makefile.foo) -
|
||
should be consistent with that specified when building libgpu.a (i.e.
|
||
by LMP_INC in the lib/gpu/Makefile.bar).
|
||
|
||
|
||
------------------------------------------------------------------------------
|
||
PRECISION MODES
|
||
------------------------------------------------------------------------------
|
||
The GPU library supports 3 precision modes: single, double, and mixed, with
|
||
the latter being the default for most Makefiles aside from Mac specific
|
||
Makefiles due to the more restrictive nature of the Apple OpenCL for some
|
||
devices.
|
||
|
||
To specify the precision mode (output to the screen before LAMMPS runs for
|
||
verification), set either CUDA_PRECISION, OCL_PREC, or HIP_PRECISION to one
|
||
of -D_SINGLE_SINGLE, -D_DOUBLE_DOUBLE, or -D_SINGLE_DOUBLE.
|
||
|
||
Some accelerators or OpenCL implementations only support single precision.
|
||
This mode should be used with care and appropriate validation as the errors
|
||
can scale with system size in this implementation. This can be useful for
|
||
accelerating test runs when setting up a simulation for production runs on
|
||
another machine. In the case where only single precision is supported, either
|
||
LAMMPS must be compiled with -DFFT_SINGLE to use PPPM with GPU acceleration
|
||
or GPU acceleration should be disabled for PPPM (e.g. suffix off or pair/only
|
||
as described in the LAMMPS documentation).
|
||
|
||
|
||
------------------------------------------------------------------------------
|
||
CUDA BUILD NOTES
|
||
------------------------------------------------------------------------------
|
||
NOTE: when compiling with CMake, all of the considerations listed below
|
||
are considered within the CMake configuration process, so no separate
|
||
compilation of the gpu library is required. Also this will build in support
|
||
for all compute architecture that are supported by the CUDA toolkit version
|
||
used to build the gpu library. A similar setup is possible using
|
||
Makefile.linux_multi after adjusting the settings for the CUDA toolkit in use.
|
||
|
||
Only CUDA toolkit version 8.0 and later and only GPU architecture 3.0
|
||
(aka Kepler) and later are supported by this version of LAMMPS. If you want
|
||
to use older hard- or software you have to compile for OpenCL or use an older
|
||
version of LAMMPS.
|
||
|
||
If you do not want to use a fat binary, that supports multiple CUDA
|
||
architectures, the CUDA_ARCH must be set to match the GPU architecture. This
|
||
is reported by nvc_get_devices executable created by the build process and
|
||
a detailed list of GPU architectures and CUDA compatible GPUs can be found
|
||
e.g. here: https://en.wikipedia.org/wiki/CUDA#GPUs_supported
|
||
|
||
The CUDA_HOME variable should be set to the location of the CUDA toolkit.
|
||
|
||
To build, edit the CUDA_ARCH, CUDA_PRECISION, CUDA_HOME variables in one of
|
||
the Makefiles. CUDA_ARCH should be set based on the compute capability of
|
||
your GPU. This can be verified by running the nvc_get_devices executable after
|
||
the build is complete. Additionally, the GPU package must be installed and
|
||
compiled for LAMMPS. This may require editing the gpu_SYSPATH variable in the
|
||
LAMMPS makefile.
|
||
|
||
Please note that the GPU library accesses the CUDA driver library directly,
|
||
so it needs to be linked with the CUDA driver library (libcuda.so) that ships
|
||
with the Nvidia driver. If you are compiling LAMMPS on the head node of a GPU
|
||
cluster, this library may not be installed, so you may need to copy it over
|
||
from one of the compute nodes (best into this directory). Recent CUDA toolkits
|
||
starting from CUDA 9 provide a dummy libcuda.so library (typically under
|
||
$(CUDA_HOME)/lib64/stubs), that can be used for linking.
|
||
|
||
Best performance with the GPU library is typically with multiple MPI processes
|
||
sharing the same GPU cards. For NVIDIA, this is most efficient with CUDA
|
||
MPS enabled. To prevent runtime errors for GPUs configured in exclusive process
|
||
mode with MPS, the GPU library should be build with the -DCUDA_MPS_SUPPORT flag.
|
||
|
||
------------------------------------------------------------------------------
|
||
HIP BUILD NOTES
|
||
------------------------------------------------------------------------------
|
||
|
||
1. GPU sorting requires installing hipcub
|
||
(https://github.com/ROCmSoftwarePlatform/hipCUB). The HIP CUDA-backend
|
||
additionally requires cub (https://nvlabs.github.io/cub). Download and
|
||
extract the cub directory to lammps/lib/gpu/ or specify an appropriate
|
||
path in lammps/lib/gpu/Makefile.hip.
|
||
2. In Makefile.hip it is possible to specify the target platform via
|
||
export HIP_PLATFORM=amd (ROCm >= 4.1), HIP_PLATFORM=hcc (ROCm <= 4.0)
|
||
or HIP_PLATFORM=nvcc as well as the target architecture (gfx803, gfx900, gfx906 etc.)
|
||
3. If your MPI implementation does not support `mpicxx --showme` command,
|
||
it is required to specify the corresponding MPI compiler and linker flags
|
||
in lammps/lib/gpu/Makefile.hip and in lammps/src/MAKE/OPTIONS/Makefile.hip.
|
||
|
||
------------------------------------------------------------------------------
|
||
OPENCL BUILD NOTES
|
||
------------------------------------------------------------------------------
|
||
If GERYON_NUMA_FISSION is defined at build time, LAMMPS will consider separate
|
||
NUMA nodes on GPUs or accelerators as separate devices. For example, a 2-socket
|
||
CPU would appear as two separate devices for OpenCL (and LAMMPS would require
|
||
two MPI processes to use both sockets with the GPU library - each with its
|
||
own device ID as output by ocl_get_devices). OpenCL version 1.2 or later is
|
||
required.
|
||
|
||
For a debug build, use "-DUCL_DEBUG -DGERYON_KERNEL_DUMP" and remove
|
||
"-DUCL_NO_EXIT" and "-DMPI_GERYON" from the build options.
|
||
|
||
------------------------------------------------------------------------------
|
||
ALL PREPROCESSOR OPTIONS (For Advanced Users)
|
||
------------------------------------------------------------------------------
|
||
_SINGLE_SINGLE Build library for single precision mode
|
||
_SINGLE_DOUBLE Build library for mixed precision mode
|
||
_DOUBLE_DOUBLE Build library for double precision mode
|
||
CUDA_MPS_SUPPORT Do not generate errors for exclusive mode for CUDA
|
||
MPI_GERYON Library should use MPI_Abort for unhandled errors
|
||
GERYON_NUMA_FISSION Accelerators with main memory NUMA are split into
|
||
multiple virtual accelerators for each NUMA node
|
||
LAL_USE_OMP=0 Disable OpenMP in lib, regardless of compiler setting
|
||
LAL_USE_OMP_SIMD=0 Disable OpenMP SIMD in lib, regardless of compiler set
|
||
GERYON_OCL_FLUSH For OpenCL, flush queue after every enqueue
|
||
LAL_NO_OCL_EV_JIT Turn off JIT specialization for kernels in OpenCL
|
||
LAL_USE_OLD_NEIGHBOR Use old neighbor list algorithm
|
||
USE_CUDPP Enable GPU binning in neighbor builds (not recommended)
|
||
USE_HIP_DEVICE_SORT Enable GPU binning for HIP builds
|
||
(only w/ LAL_USE_OLD_NEIGHBOR)
|
||
LAL_NO_BLOCK_REDUCE Use host for energy/virial accumulation
|
||
LAL_OCL_EXTRA_ARGS Supply extra args for OpenCL compiler delimited with :
|
||
UCL_NO_EXIT LAMMPS should handle errors instead of Geryon lib
|
||
UCL_DEBUG Debug build for Geryon
|
||
GERYON_KERNEL_DUMP Dump all compiled OpenCL programs with compiler
|
||
flags and build logs
|
||
GPU_CAST Casting performed on GPU, untested recently
|
||
THREE_CONCURRENT Concurrent 3-body calcs in separate queues, untested
|
||
LAL_SERIALIZE_INIT Force serialization of initialization and compilation
|
||
for multiple MPI tasks sharing the same accelerator.
|
||
Some accelerator API implementations have had issues
|
||
with temporary file conflicts in the past.
|
||
LAL_DISABLE_PREFETCH Disable prefetch in kernels
|
||
GERYON_FORCE_SHARED_MAIN_MEM_ON Should only be used for builds where the
|
||
accelerator is guaranteed to share physical
|
||
main memory with the host (e.g. integrated
|
||
GPU or CPU device). Default behavior is to
|
||
auto-detect. Impacts OpenCL only.
|
||
GERYON_FORCE_SHARED_MAIN_MEM_OFF Should only be used for builds where the
|
||
accelerator is guaranteed to have discrete
|
||
physical main memory vs the host (discrete
|
||
GPU card). Default behavior is to
|
||
auto-detect. Impacts OpenCL only.
|
||
|
||
|
||
------------------------------------------------------------------------------
|
||
References for Details
|
||
------------------------------------------------------------------------------
|
||
|
||
Brown, W.M., Wang, P. Plimpton, S.J., Tharrington, A.N. Implementing
|
||
Molecular Dynamics on Hybrid High Performance Computers - Short Range
|
||
Forces. Computer Physics Communications. 2011. 182: p. 898-911.
|
||
|
||
and
|
||
|
||
Brown, W.M., Kohlmeyer, A. Plimpton, S.J., Tharrington, A.N. Implementing
|
||
Molecular Dynamics on Hybrid High Performance Computers - Particle-Particle
|
||
Particle-Mesh. Computer Physics Communications. 2012. 183: p. 449-459.
|
||
|
||
and
|
||
|
||
Brown, W.M., Masako, Y. Implementing Molecular Dynamics on Hybrid High
|
||
Performance Computers - Three-Body Potentials. Computer Physics Communications.
|
||
2013. 184: p. 2785–2793.
|