From 24763bfd8e8c8f46294736b1a13d9ce9b37500c9 Mon Sep 17 00:00:00 2001 From: Axel Kohlmeyer Date: Wed, 8 Jan 2025 12:16:37 -0500 Subject: [PATCH] add info on how to debug if LAMMPS seems stuck --- doc/src/Errors_debug.rst | 50 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 50 insertions(+) diff --git a/doc/src/Errors_debug.rst b/doc/src/Errors_debug.rst index cc28273aa3..61fb1f7525 100644 --- a/doc/src/Errors_debug.rst +++ b/doc/src/Errors_debug.rst @@ -235,3 +235,53 @@ from GDB. In addition you get a more specific hint about what cause the segmentation fault, i.e. that it is a NULL pointer dereference. To find out which pointer exactly was NULL, you need to use the debugger, though. +Debugging when LAMMPS appears to be stuck +========================================= + +Sometimes the LAMMPS calculation appears to be stuck, that is the LAMMPS +process or processes are active, but there is no visible progress. This +can have multiple reasons: + +- The selected styles are slow and require a lot of CPU time and the + system is large. When extrapolating the expected speed from smaller + systems, one has to factor in that not all models scale linearly with + system size, e.g. :doc:`kspace styles like ewald or pppm + `. There is very little that can be done in this case. +- The output interval is not set or set to a large value with the + :doc:`thermo ` command. I the first case, there will be output + only at the first and last step. +- The output is block-buffered and instead of line-buffered. The output + will only be written to the screen after 4096 or 8192 characters of + output have accumulated. This most often happens for files but also + with MPI parallel executables for output to the screen, since the + output to the screen is handled by the MPI library so that output from + all processes can be shown. This can be suppressed by using the + ``-nonblock`` or ``-nb`` command-line flag, which turns off buffering + for screen and logfile output. +- An MPI parallel calculation has a bug where a collective MPI function + is called (e.g. ``MPI_Barrier()``, ``MPI_Bcast()``, + ``MPI_Allreduce()`` and so on) before pending point-to-point + communications are completed or when the collective function is only + called from a subset of the MPI processes. This also applies to some + internal LAMMPS functions like ``Error::all()`` which uses + ``MPI_Barrier()`` and thus ``Error::one()`` must be called, if the + error condition does not happen on all MPI processes simultaneously. +- Some function in LAMMPS has a bug where a ``for`` or ``while`` loop + does not trigger the exit condition and thus will loop forever. This + can happen when the wrong variable is incremented or when one value in + a comparison becomes ``NaN`` due to an overflow. + +In the latter two cases, further information and stack traces (see above) +can be obtain by attaching a debugger to a running process. For that the +process ID (PID) is needed; this can be found on Linux machines with the +``top``, ``htop``, ``ps``, or ``pstree`` commands. + +Then running the (GNU) debugger ``gdb`` with the ``-p`` flag followed by +the process id will attach the process to the debugger and stop +execution of that specific process. From there on it is possible to +issue all debugger commands in the same way as when LAMMPS was started +from the debugger (see above). Most importantly it is possible to +obtain a stack trace with the ``where`` command and thus determine where +in the execution of a timestep this process is. Also internal data can +be printed and execution single stepped or continued. When the debugger +is exited, the calculation will resume normally.