From ec3f68bed2b62a9f8f44fa19fbedbb7c198ce8d4 Mon Sep 17 00:00:00 2001
From: sjplimp The GPU package was developed by Mike Brown at ORNL.
+ Additional requirements in your input script to run the styles with a
gpu suffix are as follows:
10.2 GPU package
+
The GPU package was developed by Mike Brown at ORNL. -
A few LAMMPS pair styles can be run on graphical processing units (GPUs). We plan to add more over time. Currently, they only support NVIDIA GPU cards. To use them you need to install @@ -130,7 +130,7 @@ certain NVIDIA CUDA software on your system:
When using GPUs, you are restricted to one physical GPU per LAMMPS -process. Multiple processes can share a single GPU and in many cases +process. Multiple processes can share a single GPU and in many cases it will be more efficient to run with multiple processes per GPU. Any GPU accelerated style requires that fix gpu be used in the input script to select and initialize the GPUs. The format for the @@ -252,8 +252,195 @@ latter requires that your GPU card supports double precision.
The USER-CUDA package was developed by Christian Trott at U Technology Ilmenau in Germany.
+This package will only be of any use to you, if you have an NVIDIA(tm) +graphics card being CUDA(tm) enabled. Your GPU needs to support +Compute Capability 1.3. This list may help +you to find out the Compute Capability of your card: +
+http://en.wikipedia.org/wiki/Comparison_of_Nvidia_graphics_processing_units +
+Install the Nvidia Cuda Toolkit in version 3.2 or higher and the +corresponding GPU drivers. The Nvidia Cuda SDK is not required for +LAMMPSCUDA but we recommend to install it and +
+make sure that the sample projects can be compiled without problems. +
+You should also be able to compile LAMMPS by typing +
+make YourMachine +
+inside the src directory of LAMMPS root path. If not, you should +consult the LAMMPS documentation. +
+If your CUDA toolkit is not installed in the default directoy +/usr/local/cuda edit the file lib/cuda/Makefile.common +accordingly. +
+Go to lib/cuda/ and type +
+make OPTIONS +
+where OPTIONS are one or more of the following: +
+The settings will be written to the lib/cuda/Makefile.defaults. When +compiling with make only those settings will be used. +
+Go to src, install the USER-CUDA package with make yes-USER-CUDA +and compile the binary with make YourMachine. You might need to +delete old object files if you compiled without the USER-CUDA package +before, using the same Machine file (rm Obj_YourMachine/*). +
+CUDA versions of classes are only installed if the corresponding CPU +versions are installed as well. E.g. you need to install the KSPACE +package to use pppm/cuda. +
+In order to make use of the GPU acceleration provided by the USER-CUDA +package, you only have to add +
+accelerator cuda +
+at the top of your input script. See the accelerator command for details of additional options. +
+When compiling with USER-CUDA support the -accelerator command-line +switch is effectively set to "cuda" by default +and does not have to be given. +
+If you want to run simulations without using the "cuda" styles with +the same binary, you need to turn it explicitely off by giving "-a +none", "-a opt" or "-a gpu" as a command- +
+line argument. +
+The kspace style pppm/cuda has to be requested explicitely. +
The USER-CUDA package is an alternative package for GPU acceleration +that runs as much of the simulation as possible on the GPU. Depending on +the simulation, this can provide a significant speedup when the number +of atoms per GPU is large. +
+The styles available for GPU acceleration +will be different in each package. +
+The main difference between the "GPU" and the "USER-CUDA" package is +that while the latter aims at calculating everything on the device the +GPU package uses it as an accelerator for the pair force, neighbor +list and pppm calculations only. As a consequence in different +scenarios either package can be faster. Generally the GPU package is +faster than the USER-CUDA package, if the number of atoms per device +is small. Also the GPU package profits from oversubscribing +devices. Hence one usually wants to launch two (or more) MPI processes +per device. +
+The exact crossover where the USER-CUDA package becomes faster depends +strongly on the pair-style. For example for a simple Lennard Jones +system the crossover (in single precision) can often be found between +50,000 - 100,000 atoms per device. When performing double precision +calculations this threshold can be significantly smaller. As a result +the GPU package can show better "strong scaling" behaviour in +comparison with the USER-CUDA package as long as this limit of atoms +per GPU is not reached. +
+Another scenario where the GPU package can be faster is, when a lot of +bonded interactions are calculated. Those are handled by both packages +by the host while the device simultaniously calculates the +pair-forces. Since, when using the GPU package, one launches several +MPI processes per device, this work is spread over more CPU cores as +compared to running the same simulation with the USER-CUDA package. +
+As a side note: the GPU package performance depends to some extent on +optimal bandwidth between host and device. Hence its performance is +affected if no full 16 PCIe lanes are available for each device. In +HPC environments this can be the case if S2050/70 servers are used, +where two devices generally share one PCIe 2.0 16x slot. Also many +multi GPU mainboards do not provide full 16 lanes to each of the PCIe +2.0 16x slots. +
+While the GPU package uses considerable more device memory than the +USER-CUDA package, this is generally not much of a problem. Typically +run times are larger than desired, before the memory is exhausted. +
+Currently the USER-CUDA package supports a wider range of +force-fields. On the other hand its performance is considerably +reduced if one has to use a fix at every timestep, which is not yet +available as a "CUDA"-accelerated version. +
+In the end for each simulations its best to just try both packages and +see which one is performing better in the particular situation. +
+In the following 4 benchmark systems which are supported by both the +GPu and the CUDA package are shown: +
+1. Lennard Jones, 2.5A +256,000 atoms +2.5 A cutoff +0.844 density +
+2. Lennard Jones, 5.0A +256,000 atoms +5.0 A cutoff +0.844 density +
+3. Rhodopsin model +256,000 atoms +10A cutoff +Coulomb via PPPM +
+4. Lihtium-Phosphate +295650 atoms +15A cutoff +Coulomb via PPPM +
+Hardware: +Workstation: +2x GTX470 +i7 950@3GHz +24Gb DDR3 @ 1066Mhz +CentOS 5.5 +CUDA 3.2 +Driver 260.19.12 +
+eStella: +6 Nodes +2xC2050 +2xQDR Infiniband interconnect(aggregate bandwidth 80GBps) +Intel X5650 HexCore @ 2.67GHz +SL 5.5 +CUDA 3.2 +Driver 260.19.26 +
+Keeneland: +HP SL-390 (Ariston) cluster +120 nodes +2x Intel Westmere hex-core CPUs +3xC2070s +QDR InfiniBand interconnec +