From 24763bfd8e8c8f46294736b1a13d9ce9b37500c9 Mon Sep 17 00:00:00 2001
From: Axel Kohlmeyer <akohlmey@gmail.com>
Date: Wed, 8 Jan 2025 12:16:37 -0500
Subject: [PATCH] add info on how to debug if LAMMPS seems stuck

---
 doc/src/Errors_debug.rst | 50 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 50 insertions(+)
diff --git a/doc/src/Errors_debug.rst b/doc/src/Errors_debug.rst
index cc28273aa3..61fb1f7525 100644
--- a/doc/src/Errors_debug.rst
+++ b/doc/src/Errors_debug.rst
@@ -235,3 +235,53 @@ from GDB. In addition you get a more specific hint about what cause the
 segmentation fault, i.e. that it is a NULL pointer dereference.  To find
 out which pointer exactly was NULL, you need to use the debugger, though.
 
+Debugging when LAMMPS appears to be stuck
+=========================================
+
+Sometimes the LAMMPS calculation appears to be stuck, that is the LAMMPS
+process or processes are active, but there is no visible progress.  This
+can have multiple reasons:
+
+- The selected styles are slow and require a lot of CPU time and the
+  system is large. When extrapolating the expected speed from smaller
+  systems, one has to factor in that not all models scale linearly with
+  system size, e.g. :doc:`kspace styles like ewald or pppm
+  <kspace_style>`. There is very little that can be done in this case.
+- The output interval is not set or set to a large value with the
+  :doc:`thermo <thermo>` command. I the first case, there will be output
+  only at the first and last step.
+- The output is block-buffered and instead of line-buffered. The output
+  will only be written to the screen after 4096 or 8192 characters of
+  output have accumulated.  This most often happens for files but also
+  with MPI parallel executables for output to the screen, since the
+  output to the screen is handled by the MPI library so that output from
+  all processes can be shown.  This can be suppressed by using the
+  ``-nonblock`` or ``-nb`` command-line flag, which turns off buffering
+  for screen and logfile output.
+- An MPI parallel calculation has a bug where a collective MPI function
+  is called (e.g. ``MPI_Barrier()``, ``MPI_Bcast()``,
+  ``MPI_Allreduce()`` and so on) before pending point-to-point
+  communications are completed or when the collective function is only
+  called from a subset of the MPI processes.  This also applies to some
+  internal LAMMPS functions like ``Error::all()`` which uses
+  ``MPI_Barrier()`` and thus ``Error::one()`` must be called, if the
+  error condition does not happen on all MPI processes simultaneously.
+- Some function in LAMMPS has a bug where a ``for`` or ``while`` loop
+  does not trigger the exit condition and thus will loop forever.  This
+  can happen when the wrong variable is incremented or when one value in
+  a comparison becomes ``NaN`` due to an overflow.
+
+In the latter two cases, further information and stack traces (see above)
+can be obtain by attaching a debugger to a running process.  For that the
+process ID (PID) is needed; this can be found on Linux machines with the
+``top``, ``htop``, ``ps``, or ``pstree`` commands.
+
+Then running the (GNU) debugger ``gdb`` with the ``-p`` flag followed by
+the process id will attach the process to the debugger and stop
+execution of that specific process.  From there on it is possible to
+issue all debugger commands in the same way as when LAMMPS was started
+from the debugger (see above).  Most importantly it is possible to
+obtain a stack trace with the ``where`` command and thus determine where
+in the execution of a timestep this process is.  Also internal data can
+be printed and execution single stepped or continued.  When the debugger
+is exited, the calculation will resume normally.