add info on how to debug if LAMMPS seems stuck
This commit is contained in:
@ -235,3 +235,53 @@ from GDB. In addition you get a more specific hint about what cause the
|
|||||||
segmentation fault, i.e. that it is a NULL pointer dereference. To find
|
segmentation fault, i.e. that it is a NULL pointer dereference. To find
|
||||||
out which pointer exactly was NULL, you need to use the debugger, though.
|
out which pointer exactly was NULL, you need to use the debugger, though.
|
||||||
|
|
||||||
|
Debugging when LAMMPS appears to be stuck
|
||||||
|
=========================================
|
||||||
|
|
||||||
|
Sometimes the LAMMPS calculation appears to be stuck, that is the LAMMPS
|
||||||
|
process or processes are active, but there is no visible progress. This
|
||||||
|
can have multiple reasons:
|
||||||
|
|
||||||
|
- The selected styles are slow and require a lot of CPU time and the
|
||||||
|
system is large. When extrapolating the expected speed from smaller
|
||||||
|
systems, one has to factor in that not all models scale linearly with
|
||||||
|
system size, e.g. :doc:`kspace styles like ewald or pppm
|
||||||
|
<kspace_style>`. There is very little that can be done in this case.
|
||||||
|
- The output interval is not set or set to a large value with the
|
||||||
|
:doc:`thermo <thermo>` command. I the first case, there will be output
|
||||||
|
only at the first and last step.
|
||||||
|
- The output is block-buffered and instead of line-buffered. The output
|
||||||
|
will only be written to the screen after 4096 or 8192 characters of
|
||||||
|
output have accumulated. This most often happens for files but also
|
||||||
|
with MPI parallel executables for output to the screen, since the
|
||||||
|
output to the screen is handled by the MPI library so that output from
|
||||||
|
all processes can be shown. This can be suppressed by using the
|
||||||
|
``-nonblock`` or ``-nb`` command-line flag, which turns off buffering
|
||||||
|
for screen and logfile output.
|
||||||
|
- An MPI parallel calculation has a bug where a collective MPI function
|
||||||
|
is called (e.g. ``MPI_Barrier()``, ``MPI_Bcast()``,
|
||||||
|
``MPI_Allreduce()`` and so on) before pending point-to-point
|
||||||
|
communications are completed or when the collective function is only
|
||||||
|
called from a subset of the MPI processes. This also applies to some
|
||||||
|
internal LAMMPS functions like ``Error::all()`` which uses
|
||||||
|
``MPI_Barrier()`` and thus ``Error::one()`` must be called, if the
|
||||||
|
error condition does not happen on all MPI processes simultaneously.
|
||||||
|
- Some function in LAMMPS has a bug where a ``for`` or ``while`` loop
|
||||||
|
does not trigger the exit condition and thus will loop forever. This
|
||||||
|
can happen when the wrong variable is incremented or when one value in
|
||||||
|
a comparison becomes ``NaN`` due to an overflow.
|
||||||
|
|
||||||
|
In the latter two cases, further information and stack traces (see above)
|
||||||
|
can be obtain by attaching a debugger to a running process. For that the
|
||||||
|
process ID (PID) is needed; this can be found on Linux machines with the
|
||||||
|
``top``, ``htop``, ``ps``, or ``pstree`` commands.
|
||||||
|
|
||||||
|
Then running the (GNU) debugger ``gdb`` with the ``-p`` flag followed by
|
||||||
|
the process id will attach the process to the debugger and stop
|
||||||
|
execution of that specific process. From there on it is possible to
|
||||||
|
issue all debugger commands in the same way as when LAMMPS was started
|
||||||
|
from the debugger (see above). Most importantly it is possible to
|
||||||
|
obtain a stack trace with the ``where`` command and thus determine where
|
||||||
|
in the execution of a timestep this process is. Also internal data can
|
||||||
|
be printed and execution single stepped or continued. When the debugger
|
||||||
|
is exited, the calculation will resume normally.
|
||||||
|
|||||||
Reference in New Issue
Block a user