Getting 'BUG: soft lockup - CPU#0 stuck for Xs!' in /var/log/messages

Solution Verified - Updated 7 Aug 2024

Environment

Red Hat Enterprise Linux 5, 6 ,7, 8, 9

Issue

Any message in /var/log/messages referencing soft lockups like these:

kernel: BUG: soft lockup - CPU#0 stuck for 10s! [bond1:3307]

kernel: BUG: soft lockup - CPU#0 stuck for 67s! [bond1:3307]

Not necessarily specific to CPU#0 or process bond1
The system does not boot up (or just very slowly), I only see many messages "BUG: soft lockup". Booting the system with maxcpus=1 or disabling the slots for fiberHBA cards in BIOS makes the system boot.
modprobe qla2xxx causes infinite loop with softlockups.
I just freshly installed a new server and loaded the qla2xxx module with modprobe to activate the HBA in the system. For building this system the qla2xxx module was blacklisted as it blocked the system at intallation time already. Loading the module now leads to the soft lockups.
After discovering new SAN LUNs, the system locks up.

Resolution

Investigate if a misconfig causes the issue

Consulting the files /etc/grub.conf and /boot/grub/grub.conf, in RHEL 6 and below, or /etc/sysconfig/grub from RHEL 7 onwards, it should be verified if the console output is redirected to a console, i.e. using console=ttyS1 or console=ttyS1,9600. In both of these cases the output is restricted to 9600 baud, limiting the output and possibly causing issues.
A fix might be to not log to the serial console, or explicitly configure a higher baudrate, i.e. using console=ttyS1,115200. Please note, in some situations also 115200 baud might be a limiting factor.

Otherwise, investigate further root cause conditions

Determine if the system was under extremely high load at the time the soft lockups were seen in the logs. If the sysstat package was already installed, it will have recorded load average every 10 minutes using a cron job.
Then Load average can be found by searching for ldavg in /var/log/sa/sar<day> where day is the number date of the day when soft lockups were seen. If load average is significantly higher than the amount of logical CPU cores on the system it indicates the soft lockups probably occured because of extremely high workloads.
In this case it would be best to determine what processes caused the load to go so high and make changes so that the processes don't cause the issue again.
Since it is also possible that defects in the kernel could have caused the soft lockups, full logs needs to be investigated around the time of the soft lockups to see if the issue is a bug or is fixed by errata. It can help to look in the changelog of the latest kernel available on Red Hat Network and see if any soft lockup issues were fixed since the version of the installed kernel.
Another way is to eliminate the possibility of a known issue which has already been fixed by testing the system by running it with the latest kernel and see if the soft lockups happen again. Red Hat support may be required to conclusively determine if the issue is a bug.
Also verify with a hardware vendor that the issue is not hardware related. One way to verify that the issue is not a known and solved hardware problem is to update the firmware or BIOS to the latest available from the hardware vendor.
On virtual systems, soft lockups can indicate that the underlying hypervisor is overcommitted. Please see this article addressing this issue: VMware virtual machine guest suffers multiple soft lockups at the same time
If all of the above have been verified to not be the cause it could be a case where soft lockups do not indicate a problem; for example on systems with very large numbers of CPU cores.

If this is encountered in RHEL 5, then increase the threshold at which the messages appear using the following procedures:

Run following command and check whether "soft lockup" errors are still encountered on the system:

    # sysctl -w kernel.softlockup_thresh=30

To make this parameter persistent across reboots by adding following line in /etc/sysctl.conf file:

     kernel.softlockup_thresh=30

In RHEL 6 and above, the threshold is now named "watchdog_thresh" and can be set to no higher than 60:

To make this change in RHEL 6 and above, set the tuneable kernel.watchdog_thresh in sysctl.conf

Additional Notes:

The softlockup_thresh kernel parameter was introduced in Red Hat Enterprise Linux 5.2 in kernel-2.6.18-92.el5 thus it is not possible to modify this on older versions.

Root Cause

Soft lockups are situations in which the kernel's scheduler subsystem has not been given a chance to perform its job for more than the limit set by the watchdog threshold, in seconds; they can be caused by defects in the kernel, by hardware issues or by extremely high workloads.
If lockups are encountered on a virtual system, it is important to ensure that the hypervisor is not overcommitted.
Hardware issues related to newly installed memory might cause soft lockups.
Also misconfigurations might cause the issue, like redirecting console output to a serial device and limiting it to i.e. 9600 baud.
On systems with a very large numbers of CPU cores soft lockups might not indicate a problem.

Diagnostic Steps

In the event that a hang occurs, and soft lockups are noted in the logs, it may be necessary to collect a vmcore from the hung system to fully diagnose the issue. Often, a system experiencing soft lockups may not response to keyboard commands - in this case, follow this guide to ensure the system is able to be panicked via an NMI in the event that it is unresponsive to keyboard command:
How to collect system information to provide to Red Hat Support for analysis when a system hangs

Sometimes, a system that is hung with soft lockup messages may fail to produce any logs or SAR data. In this case, a vmcore collected may be the only source of data about the event.

You can configure the kernel to panic the system in the event of a softlockup using the following command. Note the command as listed will take effect immediately and be persistent across reboots:

#  sysctl -w 'kernel.softlockup_panic=1' >> /etc/sysctl.d/99-kdump.conf

Another possible cause of these messages being seen in a VMware hosted RHEL environment is is the VMware host steals CPU time from the guest, as described in this article:
VMware virtual machine guest suffers multiple soft lockups at the same time

Some further discussion on a CPU soft lockup is also available here:
What is a CPU soft lockup?

SBR

Kernel

Product(s)

Red Hat Enterprise Linux

Components

kernel

Category

Troubleshoot

Tags

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.