NVME idle timeout at bootup
These days I'm mostly running EasyOS on internal NVME SSD in my Zenbook laptop, rather than booting up the Lenovo desktop. I noticed that sometimes there was a 20 second delay after the speed-test and before it asks for the password. Very occasionally, which made it difficult to capture.
Isolated the problem by inserting 'dmesg' before the delay occurs and again just after. Found this:
[ 2.464309] usb 3-2.4: Product: USB OPTICAL MOUSE
[ 2.468319] input: USB OPTICAL MOUSE as /devices/pci0000:00/0000:00:14.0/usb3/3-2/3-2.4/3-2.4:1.0/0003:0000:3825.0001/input/input1
[ 2.468573] hid-generic 0003:0000:3825.0001: input: USB HID v1.11 Mouse [ USB OPTICAL MOUSE] on usb-0000:00:14.0-2.4/input0
[ 3.197955] init (1): drop_caches: 3
[ 33.892978] nvme nvme0: I/O tag 1015 (c3f7) QID 9 timeout, completion polled
[ 33.924005] EXT4-fs (nvme0n1p2): recovery complete
[ 33.924271] EXT4-fs (nvme0n1p2): mounted filesystem a6446008-7f3a-4196-be5c-c0416f4edf44 r/w with ordered data mode. Quota mode: disabled.
The delay is occurring when try to mount the ext4 working-partition, that "QID 9 timeout" line.
A search revealed that the kernel very aggressively drops the NVME drive into idle-mode. Lots of online reports of this causing trouble, even dropping into idle-mode while using the drive. Here is some information:
https://netrouting.com/knowledge_base/intel-p4510-nvme-qid-timeout-linux-fix-bare-metal/
My experience with NVME SSDs are they can run hot, so idle-mode is probably necessary. A lot of online advice is to use the kernel commandline parameter "pcie_aspm=off"; which did seem to work for me, but I think that turns off idle-mode entirely.
There is another suggestion, to set the idle-mode wakeup delay, with kernel parameter:
nvme_core.default_ps_max_latency_us=<microseconds>
Google AI says this:
nvme_core.default_ps_max_latency_us is a Linux kernel
parameter that sets the maximum latency (in microseconds) for an
NVMe drive to wake up from its power-saving states, often used
to prevent unstable drives from disconnecting. Setting this to 0
disables deep, high-latency sleep states (APST), improving
stability and performance but increasing power consumption and
heat.
I have done this, with a value of "200000", being 200
milliseconds. So far so good. Have put this permanently into the
limine.cfg file in easy-*.img.
Tags: easy