Opened 7 years ago

Closed 6 years ago

#3074 closed defect (fixed)

kea6 fails to start after repeated start/stop

Reported by: wlodekwencel Owned by: muks
Priority: very high Milestone: Sprint-20131015
Component: ~bind-ctl (obsolete) Version:
Keywords: kea6 config Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: High
Sub-Project: DHCP Feature Depending on Ticket:
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no


Kea6 fail to start after exactly 108 start/stop procedure.

each Forge test sets new Kea6 configuration as follows:

  1. stop

config remove Init/components b10-dhcp6
config commit
Dhcp6 shutdown

  1. start fresh

config add Init/components b10-dhcp6
config set Init/components/b10-dhcp6/kind dispensable
config commit

  1. config (e.g)

config add Dhcp6/subnet6
config set Dhcp6/subnet6[0]/subnet "2001:db8:1::/64"
config set Dhcp6/subnet6[0]/pool [ "2001:db8:1::0 - 2001:db8:1::ffff" ]
config commit

after exactly 108 start/stop procedures, it fails on stop (every time on stop)
bindctl stdout:
Error: [Errno 32] Broken pipe
Configuration not committed
"Shutting down."
bind log:
ERROR [b10-cmdctl.cmdctl/7827] CMDCTL_COMMAND_ERROR error in command set_config to module ConfigManager?: [Errno 32] Broken pipe
INFO [b10-dhcp6.dhcp6/12922] DHCP6_SHUTDOWN server shutdown
INFO [b10-init.init/7820] BIND10_PROCESS_ENDED process 12922 of b10-dhcp6 ended with status 0

after that any attempt to start/stop/change configure of kea6 fails with info:
INFO [b10-init.init/7820] BIND10_CONFIGURATOR_RECONFIGURE reconfiguring running components

it seems related to #2757


Change History (10)

comment:1 Changed 6 years ago by muks

  • Sub-Project changed from Core to DHCP

Stephen: We discussed this during our sprint meeting. Please can you look at this ticket and check if it's DHCP related?

comment:2 Changed 6 years ago by tomek

  • Defect Severity changed from N/A to High
  • Milestone changed from New Tasks to Sprint-DHCP-20130918

comment:3 Changed 6 years ago by muks

  • Defect Severity changed from High to Very High
  • Milestone changed from Sprint-DHCP-20130918 to Sprint-20131001
  • Owner set to muks
  • Status changed from new to assigned

Stephen has asked me to look at this urgently as this can be a blocker for DHCP team's presentation, so I'll pick this next.

comment:4 Changed 6 years ago by muks

  • Defect Severity changed from Very High to High
  • Priority changed from medium to very high

comment:5 Changed 6 years ago by muks

When this bug is fixed, remember to close #2757 and #3041 too.

comment:6 Changed 6 years ago by muks

  • Owner changed from muks to UnAssigned
  • Status changed from assigned to reviewing

trac3074 is now ready for review.

The problem was that Init has two sockets open to msgq subscribed to the Init group (it has to start a first socket to track process startup) and it does not read from both of them. So msgq keeps queuing data on the unread socket (from its side), and after a while, times out and closes the socket. However, Init uses this socket still to communicate to msgq, so when the socket is closed from the other side (resulting in EPIPE) everything collapses. The fix was simple: we unsubscribe from the Init group on the unmonitored socket so that nothing is queued up on it from the msgq side.

While debugging this problem, quite a bit of additional logging was necessary, but I've only committed the ones that are absolutely essential as otherwise the amount of trace logging will hinder performance.

Some changes were suggested (and made locally) to the DHCP lettuce ticket #3140. It seems these have now been made in trac3140, so these patches have been dropped.

DHCP team have tried this patch and have found it to fix the issue.

comment:7 follow-up: Changed 6 years ago by muks

A couple of lettuce tests failed on this branch:

We need to investigate whether it is due to this branch, and what caused them.

comment:8 in reply to: ↑ 7 Changed 6 years ago by muks

Replying to muks:

A couple of lettuce tests failed on this branch:

We need to investigate whether it is due to this branch, and what caused them.

I'm not able to reproduce both issues on my workstation on this branch. But these are real issues anyway, so I'll ask Jeremy to check on the MacOS environment.

comment:9 Changed 6 years ago by muks

  • Owner changed from UnAssigned to muks

Shane confirmed on Jabber that he has reviewed this ticket (but could not update it at the time as Trac website was down). He said it is ready for merge.

comment:10 Changed 6 years ago by muks

  • Resolution set to fixed
  • Status changed from reviewing to closed

Merged to master branch in commit ed672a898d28d6249ff0c96df12384b0aee403c8:

* 8e5d945 [3074] Unsubscribe from earlier CC session first
* 33a9204 [3074] Add logging
* 2115ed0 [3074] Add logging
* dab0927 [3074] Add logging

Also added ChangeLog:

+698.   [bug]           muks
+       A bug was fixed in the interaction between b10-init and b10-msgq
+       that caused BIND 10 failures after repeated start/stop of
+       components.
+       (Trac #3094, git ed672a898d28d6249ff0c96df12384b0aee403c8

Resolving as fixed. Thank you for the review Shane.

Note: See TracTickets for help on using tickets.