segfault in library / addr2line / objdump
Yesterday, we spotted some SEGFAULTs on an Ubuntu/Focal server. We did not have core dumps, but the kernel message in dmesg was sufficient to find a culprit.
The observed messages were these:
nginx[854]: segfault at 6d702e746379 ip 00007ff40dc2f5a3 sp 00007fffd51c8420 error 4 in libperl.so.5.30.0[7ff40dbc7000+166000]
Code: 48 89 43 10 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 40 00 0f b6 7f 30 48 c1 e8 03 48 29 f8 48 89 c3 74 89 48 8b 02 <4c> 8b 68 10 4d 85 ed 0f 84 28 01 00 00 0f b6 40 30 49 c1 ed 03 49
nginx[951947]: segfault at 10 ip 00007fba4a1645a3 sp 00007ffe57b0f8a0 error 4 in libperl.so.5.30.0 (deleted)[7fba4a0fc000+166000]
Code: 48 89 43 10 48 83 c4 18 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 40 00 0f b6 7f 30 48 c1 e8 03 48 29 f8 48 89 c3 74 89 48 8b 02 <4c> 8b 68 10 4d 85 ed 0f 84 28 01 00 00 0f b6 40 30 49 c1 ed 03 49
And after upgrading libperl5.30 from 5.30.0-9ubuntu0.3 to 5.30.0-9ubuntu0.4, we got these similar ones:
traps: nginx[955774] general protection fault ip:7f6af33345a3 sp:7ffe74310100 error:0 in libperl.so.5.30.0[7f6af32cc000+166000]
nginx[1049280]: segfault at 205bd ip 00007f5e60d265d9 sp 00007ffe7b5f08c0 error 4 in libperl.so.5.30.0[7f5e60cbe000+166000]
Code: 00 0f b6 40 30 49 c1 ed 03 49 29 c5 0f 84 17 01 00 00 48 8b 76 10 48 8b 52 10 4c 8d 3c fe 4c 8d 0c c2 84 c9 0f 84 c7 02 00 00 <49> 83 39 00 0f 85 ad 03 00 00 49 83 c1 08 49 83 ed 01 49 8d 74 1d
Apparently they were triggered by an nginx reload.
If we had a proper core dump, we could extract lots of useful info from it: where the crash occurred, which registers and variables were set, and the call chain (backtrace). With the info from above, we can at most get where the crash happened, and maybe which register had a bad value. But it is definitely better than nothing.
Feeding calculated offset to addr2line
For the most basic attempt, I found a box which still had libperl
version 5.30.0-9ubuntu0.3. I installed the perl-debug apt package
— perl-debug_5.30.0-9ubuntu0.3_amd64.deb
from
https://launchpadlibrarian.net/ — there. From the kernel message
“nginx[854]: segfault at 6d702e746379 ip 00007ff40dc2f5a3 sp 00007fffd51c8420 error 4 in libperl.so.5.30.0[7ff40dbc7000+166000]”
we take the instruction pointer 00007ff40dc2f5a3
and subtract the
library starting position 7ff40dbc7000
:
0x7ff40dc2f5a3 - 0x7ff40dbc7000 = 0x685a3
Feed that to addr2line and get the location of the crash... right?
$ addr2line -Cfe /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0 685a3
Perl_vload_module
op.c:7750
At first glance that appears okay. But when we check what happens in the machine instructions there, it is not:
$ objdump -d /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0 --disassemble=Perl_vload_module
...
00000000000685a3 <Perl_vload_module@@Base+0xd3>:
...
6859b: 83 c0 08 add $0x8,%eax
6859e: 49 03 57 10 add 0x10(%r15),%rdx
685a2: 41 89 07 mov %eax,(%r15)
685a5: 48 8b 0a mov (%rdx),%rcx
685a8: 45 31 e4 xor %r12d,%r12d
685ab: 48 85 c9 test %rcx,%rcx
...
There is no instruction start at 0x685a3
!
Searching for machine code inside a binary
What if we simply look for the instructions as shown in the Code:
message?
To this end, I hacked together a script that does the following:
- Spawn a copy of
objdump
to disassemble the binary; - look for the instructions as passed on the command line;
- display where the instructions are found.
The objdump-find-instructions.py
script is collapsed here (see
“details”):
details of objdump-find-instructions.py
#!/usr/bin/env python3
import re
import subprocess
import sys
# Look for these:
# > 19640c: 48 89 44 24 28 mov %rax,0x28(%rsp)
# > 196411: 31 c0 xor %eax,%eax
# > 196413: 48 85 db test %rbx,%rbx
code_re = re.compile(
br'^\s+(?P<addr>[0-9a-f]+):(?P<code>(\s[0-9a-f]{2})+)\s+'
br'(?P<decoded>.*)')
code_without_decoded_re = re.compile(
br'^\s+(?P<addr>[0-9a-f]+):(?P<code>(\s[0-9a-f]{2})+)\s*$')
# Look for these:
# > 000000000004ea40 <Perl_ck_concat@@Base>:
func_re = re.compile(br'^(?P<addr>[0-9a-f]+) <(?P<name>[^<>]*)>:')
# Look for blanks:
blank_re = re.compile(br'^\s*$')
# Lines to ignore:
ignore_re = re.compile(
br'^/.*:\s+file format |^Disassembly of section ')
def to_bin(binstr_array):
return bytes([int(i, 16) for i in binstr_array])
def to_hex(binarray):
return ' '.join('{:02x}'.format(i) for i in binarray)
# Get executable/binary from argv
executable = sys.argv[1] # /usr/lib/x86_64-linux-gnu/libperl.so.5.30
# Get needle from argv
needle = [i.encode() for i in sys.argv[2:]] # ['48', '89', '44', '24', '28']
needle_len = len(needle)
assert needle_len >= 2, 'must specify XX XX XX bytes to search for'
needle_bin = to_bin(needle)
MAX_BUF = needle_len + 30
class Matcher:
def search(self, haystack, regex):
self.match = regex.search(haystack)
if self.match:
self.dict = self.match.groupdict()
return self.match
def get(self, key):
return self.dict[key]
# Execute
proc = subprocess.Popen(
['/usr/bin/objdump', '-d', executable], stdout=subprocess.PIPE)
# Parse
code_bin = bytearray()
last_func = None
last_addr = None
matcher = Matcher()
for line in proc.stdout:
line = line.rstrip()
if matcher.search(line, blank_re):
last_func = None
last_addr = None
elif matcher.search(line, func_re):
last_func = matcher.get('name')
last_addr = matcher.get('addr')
elif (matcher.search(line, code_re) or
matcher.search(line, code_without_decoded_re)):
new_code_bin = to_bin(matcher.get('code').lstrip().split())
code_bin.extend(new_code_bin)
code_bin = code_bin[-MAX_BUF:] # truncate early
# This contains search on binary is pretty fast, compared to doing
# sub-array comparisons.
# real 0m9.873s --> 0m4.000s
# user 0m12.637s --> 0m6.624s
if needle_bin in code_bin:
print(
last_addr.decode(), last_func.decode(),
matcher.get('addr').decode(),
matcher.get('decoded').decode(),
to_hex(new_code_bin))
# print('//', to_hex(code_bin))
assert needle_len > 1
code_bin = code_bin[(-needle_len + 1):] # skip same results
elif matcher.search(line, ignore_re):
pass
else:
print('discarding', line)
exit(2)
The script is invoked like this:
$ python3 objdump-find-instructions.py PATH_TO_BINARY INSTRUCTIONS...
We include all instructions up to and including the <4c>
and invoke it
like this:
$ python3 objdump-find-instructions.py \
/usr/lib/x86_64-linux-gnu/libperl.so.5.30.0 \
48 89 43 10 48 83 c4 18 5b 5d 41 5c 41 5d 41 \
5e 41 5f c3 0f 1f 40 00 0f b6 7f 30 48 c1 e8 \
03 48 29 f8 48 89 c3 74 89 48 8b 02 4c
It spews out this one line:
00000000000b0500 Perl__invlist_intersection_maybe_complement_2nd@@Base
b05a3 mov 0x10(%rax),%r13 4c 8b 68 10
That contains the following info:
- The function
Perl__invlist_intersection_maybe_complement_2nd@@Base
starts at00000000000b0500
. - At
0xb05a3
there is amov 0x10(%rax),%r13
instruction. - That instruction is
4c 8b 68 10
in machine code.
That instruction corresponds with the position in the Code:
log line.
Code: [... 8b 02] <4c> 8b 68 10 [4d 85 ...]
This looks like a much better candidate than the Perl_vload_module
we
got from addr2line. The reading of 0x10(%rax)
matches the second
crash perfectly: if the %rax
register is 0 — a common value — then
this would produce a segfault at 10
.
Getting the surrounding code from objdump:
$ objdump -d /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0 --start-address=0xb0500
...
00000000000b0500 <Perl__invlist_intersection_maybe_complement_2nd@@Base>:
...
b059e: 74 89 je b0529 <Perl__invlist_intersection_maybe_complement_2nd@@Base+0x29>
b05a0: 48 8b 02 mov (%rdx),%rax
b05a3: 4c 8b 68 10 mov 0x10(%rax),%r13
b05a7: 4d 85 ed test %r13,%r13
b05aa: 0f 84 28 01 00 00 je b06d8 <Perl__invlist_intersection_maybe_complement_2nd@@Base+0x1d8>
...
Offset 0x48000
I was confident that this is the right crash location. And because Perl did have a problem with the code in this vicinity, it was easy to file a lp2035339 bug report.
But I could not explain yet why the calculated offset of 0x685a3
is
off. The difference between 0x685a3
and 0xb05a3
is 0x48000
.
A bit of poking around the binary did turn up this:
$ objdump -p /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0
...
Dynamic Section:
NEEDED libdl.so.2
NEEDED libm.so.6
NEEDED libpthread.so.0
NEEDED libc.so.6
NEEDED libcrypt.so.1
SONAME libperl.so.5.30
INIT 0x0000000000048000
FINI 0x00000000001ad6b4
...
The machine instructions reside between 0x48000
and 0x1ad6b4
.
That's where we got the extra 0x48000
we need.
So, next time we do an addr2line lookup of a library, we should check
the INIT
offset, and add that to calculated instruction pointer
position.
Check with newer version
After upgrading both libperl and perl-debug on the test box, we could confirm that the latest crashes were caused by the same problem.
From
“traps: nginx[955774] general protection fault ip:7f6af33345a3 sp:7ffe74310100 error:0 in libperl.so.5.30.0[7f6af32cc000+166000]”
and the INIT
offset of 0x48000
we get 0xb05a3
and from
“nginx[1049280]: segfault at 205bd ip 00007f5e60d265d9 sp 00007ffe7b5f08c0 error 4 in libperl.so.5.30.0[7f5e60cbe000+166000]”
we get 0xb05d9
.
addr2line gives us:
$ addr2line -Cfe /usr/lib/x86_64-linux-gnu/libperl.so.5.30.0 b05a3 b05d9
Perl__invlist_intersection_maybe_complement_2nd
invlist_inline.h:51
Perl__invlist_intersection_maybe_complement_2nd
regcomp.c:9841
Both in Perl__invlist_intersection_maybe_complement_2nd
. Same problem.
general protection fault vs. segfault
Lastly, why did we get a
“traps: ... general protection fault ... error:0”
for one crash and
“segfault at ... ip ... error 4”
for the others?
I'm not entirely sure. As far as I can gather, this could be the
difference between the segmentation violation happening while running
in kernel mode versus running in user mode. The error code of 0
vs.
4
does indicate as much. (See “details” for a snippet from
arch/x86/include/asm/trap_pf.h
.)
details of error_code
/*
* Page fault error code bits:
*
* bit 0 == 0: no page found 1: protection fault
* bit 1 == 0: read access 1: write access
* bit 2 == 0: kernel-mode access 1: user-mode access
* bit 3 == 1: use of reserved bit detected
* bit 4 == 1: fault was an instruction fetch
* bit 5 == 1: protection keys block access
* bit 15 == 1: SGX MMU page-fault
*/
enum x86_pf_error_code {
X86_PF_PROT = 1 << 0,
X86_PF_WRITE = 1 << 1,
X86_PF_USER = 1 << 2,
X86_PF_RSVD = 1 << 3,
X86_PF_INSTR = 1 << 4,
X86_PF_PK = 1 << 5,
X86_PF_SGX = 1 << 15,
};
But maybe it has a different reason, like the specific memory location that was tried (we don't see it in this message). Let me know if you know!