kioxia nvme / num_err_log_entries 0xc004 / smartctl
So, these new Kioxia NVMe drives were incrementing the num_err_log_entries as soon as they were inserted into the machine. But the error said INVALID_FIELD. What gives?
In contrast to the other (mostly Intel) drives, these drives started
incrementing the num_err_log_entries
as soon as they were plugged in:
# nvme smart-log /dev/nvme21n1
Smart Log for NVME device:nvme21n1 namespace-id:ffffffff
...
num_err_log_entries : 932
The relevant errors should be readable in the error-log. All 64 errors in the log looked the same:
error_count : 932
sqid : 0
cmdid : 0xc
status_field : 0xc004(INVALID_FIELD)
parm_err_loc : 0x4
lba : 0xffffffffffffffff
nsid : 0x1
vs : 0
INVALID_FIELD, what is this?
The error count kept increasing regularly — like clockwork actually. And the internet gave us no clues what this might be.
It turns out it was our monitoring. The Zabbix scripts we
employ fetch drive health
status values from various sources. And one of the things they do, is
run smartctl -a
on all drives. And for every such call, the error
count was incremented.
# nvme list
Node SN Model FW Rev
------------- ------------ ------------------- --------
...
/dev/nvme20n1 PHLJ9110xxxx INTEL SSDPE2KX010T8 VDV10131
/dev/nvme21n1 X0U0A02Dxxxx KCD6DLUL3T84 0102
/dev/nvme22n1 X0U0A02Jxxxx KCD6DLUL3T84 0102
If we run it on the Intel drive, we get this:
# smartctl -a /dev/nvme20n1
...
Model Number: INTEL SSDPE2KX010T8
...
=== START OF SMART DATA SECTION ===
Read NVMe SMART/Health Information failed: NVMe Status 0x4002
# nvme smart-log /dev/nvme20n1 | grep ^num_err
num_err_log_entries : 0
# nvme error-log /dev/nvme20n1 | head -n12
Error Log Entries for device:nvme20n1 entries:64
.................
Entry[ 0]
.................
error_count : 0
sqid : 0
cmdid : 0
status_field : 0(SUCCESS)
parm_err_loc : 0
lba : 0
nsid : 0
vs : 0
But on the Kioxias, we get this:
# smartctl -a /dev/nvme21n1
...
Model Number: KCD6DLUL3T84
...
=== START OF SMART DATA SECTION ===
Read NVMe SMART/Health Information failed: NVMe Status 0x6002
# nvme smart-log /dev/nvme21n1 | grep ^num_err
num_err_log_entries : 933
# nvme error-log /dev/nvme21n1 | head -n12
Error Log Entries for device:nvme21n1 entries:64
.................
Entry[ 0]
.................
error_count : 933
sqid : 0
cmdid : 0x6
status_field : 0xc004(INVALID_FIELD)
parm_err_loc : 0x4
lba : 0xffffffffffffffff
nsid : 0x1
vs : 0
Apparently the Kioxia drive does not like what smartctl is sending.
Luckily this turned out to be an issue that smartctl claims responsibility for. And it had already been fixed.
If this works, the problem is that this drive requires that the broadcast namespace is specified if SMART/Health and Error Information logs are requested. This issue was unspecified in early revisions of the NVMe standard.
In our case, applying this fix was easy on this Ubuntu/Bionic machine:
# apt-cache policy smartmontools
smartmontools:
Installed: 6.5+svn4324-1ubuntu0.1
Candidate: 6.5+svn4324-1ubuntu0.1
Version table:
7.0-0ubuntu1~ubuntu18.04.1 100
100 http://MIRROR/ubuntu bionic-backports/main amd64 Packages
*** 6.5+svn4324-1ubuntu0.1 500
500 http://MIRROR/ubuntu bionic-updates/main amd64 Packages
100 /var/lib/dpkg/status
# apt-get install smartmontools=7.0-0ubuntu1~ubuntu18.04.1
This smartmontools update from 6.5 to 7.0 not only got rid of the new errors, it also showed more relevant health output.
Now if we could just reset the error-log count on the drives, then this would be even better...