Opened 3 years ago

Last modified 3 years ago

#5056 accepted defect

Running kea-dhcp4 on many interfaces causes segfault

Reported by: awm1 Owned by: marcin
Priority: medium Milestone: Outstanding Tasks
Component: Unclassified Version: git
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DHCP Feature Depending on Ticket:
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

As the ability to serve multiple IPv4 subnets per one interface has been added into Kea 1.1, I wanted to use it in our network (many VLANs with multiple end-devices per each, everyone has its own /30 subnet).

However, when I start kea-dhcp4 on a router with ~80 VLANs, it ends after a several seconds with Segmentation fault. kea-dhcp6 (which I use as well) keeps running without any problem.

I tried to run kea-dhcp4 on a router with 10 VLANs only and it haven't crashed yet. My OS is FreeBSD 10.3, I use Kea 1.1 compiled from source.

I'm enclosing a relevant part of my kea.conf file. Of course I can share any other debug output, however, there isn't any output in the log file (/var/log/kea-dhcp4.log in my case) even when using DEBUG severity.

Subtickets

Attachments (3)

kea.conf (4.5 KB) - added by awm1 3 years ago.
kea-dhcp4.core.zip (381.1 KB) - added by awm1 3 years ago.
kea-dhcp4.truss.zip (72.5 KB) - added by awm1 3 years ago.

Download all attachments as: .zip

Change History (9)

Changed 3 years ago by awm1

comment:1 Changed 3 years ago by marcin

Hi,
Thank you for your bug report. Would it be possible for you to provide a backtrace from this segfault?

Marcin Siodelski
ISC

Changed 3 years ago by awm1

Changed 3 years ago by awm1

comment:2 Changed 3 years ago by awm1

Thank you for your response. There is an output from "bt full", which I executed in gdb just after kea-dhcp4 crash:

(gdb) bt full
#0  0x0000000000000010 in ?? ()
No symbol table info available.
#1  0x00000008058b5290 in ?? ()
No symbol table info available.
#2  0x0000000000000016 in ?? ()
No symbol table info available.
#3  0x0000000805838400 in ?? ()
No symbol table info available.
#4  0x0000000000000016 in ?? ()
No symbol table info available.
#5  0x00007fffdfffdc90 in ?? ()
No symbol table info available.
#6  0x0000000800c07a37 in isc::dhcp::TimerMgrImpl::timerCallback (this=<value optimized out>, timer_name=<value optimized out>) at timer_mgr.cc:513
No locals.
Previous frame inner to this frame (corrupt stack?)

I'm not really sure, if it will be helpful, so I added a crash dump produced by kea-dhcp4 + system call log from "truss" to this ticket as an attachment.

comment:3 Changed 3 years ago by hschempf

  • Milestone changed from Kea-proposed to Kea1.2

Per Oct 27 Kea team meeting, accept 1.2

comment:4 Changed 3 years ago by marcin

  • Owner set to marcin
  • Status changed from new to accepted

comment:5 Changed 3 years ago by marcin

I have been working on this issue but no luck in reproducing it so far.

I have a FreeBSD 11 system on which I created 100 VLANs and made Kea to listen on all these interfaces. It works fine for me, though it is not a router but a regular FreeBSD box. I guess, a router may have some specifics that causes the behavior that you're reporting.

The provided truss file seems to confirm that the segfault occurs within the timer callback. The timer callback is invoked every 10 seconds to try to cleanup expired leases in the lease database. I wonder if disabling the lease expiration would help to work around your crash. To do so you'd need to put this in the configuration.

  "expired-leases-processing": {
    "reclaim-timer-wait-time": 0
  },

If the issue goes away, it may point to some problem related with the asynchronous timer.

There is an interesting thing in the truss file, in the following part:

kevent(1094,0x0,0,{ },128,{ 9.997204000 })	 = 0 (0x0)
gettimeofday({ 1477299971.776066 },0x0)		 = 0 (0x0)
fcntl(1093,F_GETFL,)				 = 6 (0x6)
select(1094,{ 1029 1031 1032 1033 1034 1035 1036 1038 1039 1041 1042 1043 1044 1048 1050 1059 1092 1093 },0x0,0x0,{ 0.000000 }) = 1 (0x1)

kevent() is invoked when the lease expiration timer is due. The callback function is invoked and this callback function signals the timer readiness to the main thread by writing to a pipe. Before, it writes to a pipe it checks if the pipe hasn't been already written to. So, it does this:

    fd_set read_fds;
    FD_ZERO(&read_fds);

    // Add select_fd socket to listening set
    FD_SET(sink_,  &read_fds);

    // Set zero timeout (non-blocking).
    struct timeval select_timeout;
    select_timeout.tv_sec = 0;
    select_timeout.tv_usec = 0;

    // Return true only if read ready, treat error same as not ready.
    return (select(sink_ + 1, &read_fds, NULL, NULL, &select_timeout) > 0);

This is corresponds to the select() call visible in the truss file (see above). The code specifies a single descriptor to be checked by select(), whereas the truss file shows many descriptors. This looks like a sign of some memory corruption, stack overrun etc. The problem is, however, to find out where it origins. I ran kea-dhcp4 through valgrind but this experiment didn't reveal any obvious issues on my system.

I would be good if I could get the following data to further chase this problem:

  • Does this problem occur when lease expiration/reclamation is disabled?
  • Can you build a version of Kea with debug symbols and provide more meaningful backtrace?
  • Can you provide logs? I see your point about the log file being empty but it is hard to believe because the truss file clearly contains writes to a log file. Also, the truss file indicates that there are some errors written to this log but it doesn't include the whole error string so I can't infer what the errors are about.
  • Can you run kea through valgrind, using the configuration that causes it to crash and see if valgrind reports any issues?
  • Can you provide your config.log output from the configuration stage of Kea build?
  • Is Kea responding to any DHCP queries between it starts up and crashes?

Thanks,
Marcin

comment:6 Changed 3 years ago by hschempf

  • Milestone changed from Kea1.2 to Outstanding Tasks

16 Feb: per team discussion, move from 1.2 to outstanding. Marcin looked at the ticket and sent add'l questions sent to submitter -- pending a response.

Note: See TracTickets for help on using tickets.