Opened 10 years ago

Closed 5 years ago

#134 closed defect (wontfix)

Boss more careful shutdown

Reported by: jreed Owned by: UnAssigned
Priority: low Milestone: DNS Outstanding Tasks
Component: ~Boss of BIND (obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: Medium
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 0.0 Add Hours to Ticket:
Total Hours: Internal?: no

Description

I had an b10-xfrout running since March 23 (with nothing else using it). I couldn't kill it except with SIGKILL.

Here is a bind10 shutdown (for a different xfrout):

Sending SIGTERM to b10-xfrout (PID 1267).
Sending SIGTERM to b10-cfgmgr (PID 15979).
Process b10-cfgmgr (PID 15979) died.
Sending SIGKILL to b10-xfrout (PID 1267).
Process b10-xfrout (PID 1267) died.

This ticket is opened to track down why b10-xfrout won't exit on its own.

Subtickets

Attachments (1)

ticket_134.diff (5.1 KB) - added by zhanglikun 10 years ago.
The patch for fixing ticket 134

Download all attachments as: .zip

Change History (17)

comment:1 Changed 10 years ago by jreed

I realize I already opened a ticket for this a couple weeks ago: #121

comment:2 follow-up: Changed 10 years ago by zhanglikun

  1. When Boss process receive CTRL-C, 'shutdown' message will be sent to all the subprocess(cmdctl, cfgmgr, etc), then sleep for 0.1 second, then send signal SIGTERM to non-dead subprocesses. signal SIGKILL will be sent out by boss at end, to make sure every process is killed.
  2. When CTRL-C is inputted, process xfrout also get the signal SIGINT, then xfrout try to exit, but the problem is, xfrout get the signal SIGTERM sent by boss before exit.

So the easy way to fix this is: Make boss sleep more seconds after send out "shutdown" message?

comment:3 in reply to: ↑ 2 ; follow-up: Changed 10 years ago by jreed

Replying to zhanglikun:

So the easy way to fix this is: Make boss sleep more seconds after send out "shutdown" message?

The BSD reboot(8) since at least 1990 waits 5 seconds after SIGTERM before doing a SIGKILL. I think that 5 seconds is good. If it ever gets to SIGKILL then we have a problem and should report that.

comment:4 in reply to: ↑ 3 ; follow-up: Changed 10 years ago by shane

Replying to jreed:

Replying to zhanglikun:

So the easy way to fix this is: Make boss sleep more seconds after send out "shutdown" message?

The BSD reboot(8) since at least 1990 waits 5 seconds after SIGTERM before doing a SIGKILL. I think that 5 seconds is good. If it ever gets to SIGKILL then we have a problem and should report that.

Well... we have a few things to do to make this as good as possible.

First, xfrout should implement shutdown.

Second, bind10 can wait a much longer time before proceeding from one severity to another.

To do this "properly", bind10 should do something like this:

  • Send a shutdown request to all processes
  • Wait for processes to die
  • When they all die, exit
  • If more than X seconds pass, send SIGTERM to all processes
  • Wait for processes to die
  • When they all die, exit
  • If more than X seconds pass, send SIGKILL to all processes
  • Wait for processes to die

This would both speed up the usual case (right now we wait 0.2 seconds whether we need to or not) and slow down the extreme case (since we'll be waiting SECONDS instead of tenths of seconds). I tend to think X should be 1 or 2 - nothing should take that long to stop unless things are broken.

Something we can also do is create a "fast-shutdown" option. This might change X from 1 or 2 to 0.1 like we have now (although still exiting faster under normal circumstances).

Does this make sense?

comment:5 in reply to: ↑ 4 Changed 10 years ago by shane

Replying to shane:

To do this "properly", bind10 should do something like this:

  • Send a shutdown request to all processes
  • Wait for processes to die
  • When they all die, exit
  • If more than X seconds pass, send SIGTERM to all processes
  • Wait for processes to die
  • When they all die, exit
  • If more than X seconds pass, send SIGKILL to all processes
  • Wait for processes to die

Additionally, we should probably kill msgq last, in case some process needs it to exit. This complicates things a bit.

def terminate(processes, timeout=2.0):

for process in processes:

ask_politely_to_exit

reap_all_processes(timeout)
for process in processes:

process.kill(SIGTERM)

reap_all_processes(timeout)
for process in processes:

process.kill(SIGKILL)

reap_all_processes(timeout)


def shutdown():

to_terminate = processes
to_terminate.remove(msgq)
terminate(to_terminate)
terminate(msgq)

Note this could slow things down if msgq was being unruly, but that is unlikely. :)

Changed 10 years ago by zhanglikun

The patch for fixing ticket 134

comment:6 Changed 10 years ago by zhanglikun

Hi Jeremy and Shane, Please help review my patch for this ticket. What I did is:

  1. Avoid to call function shutdown() of xfrout twice. One is caused by SIGINT and the other is caused by "shutdown" message sent by boss.
  2. Xfrout do select for the connection from auth process in one thread, so when we shutdown Xfrout, this thread should be stopped(with join()). But sometimes this thread is blocked by select(), so I set the poll intervals of select is 0.1 second, and let boss sleep for 0.5s (old value is 0.1s) after sending "shutdown" message. Does this make sense?
  1. Xfrout shutdown logic:

join all the threads of xfrout
remove unix socket file which is shared by auth process.

comment:7 Changed 10 years ago by zhanglikun

Also the patch try to fix ticket 135 and ticket 151

comment:8 Changed 10 years ago by zhanglikun

  • Owner set to shane
  • Status changed from new to assigned

comment:9 follow-up: Changed 10 years ago by shane

The problem with this approach is that it will only shutdown the server after all xfrout have completed. This could take several seconds or even HOURS, depending on what is going on. (Imagine a server sending the .CN zone)

Just had a quick conversation with Feng, and I think we actually have a harder problem here. :)

The question is: What does the administrator want to do?

  1. Wait until all in-progress operations are done, then shut down?
  2. Shut down right away?

Sometimes you want to wait until everything is done, then exit. Sometimes you want to stop RIGHT AWAY.

Also note that when you say "wait", what you actually mean is "wait for a while". Sometimes this is a couple seconds (when you are rebooting the box perhaps), sometimes this several minutes (like when you are stopping the process by hand), sometimes this is hours (like if a really long XFR is going out).

So, I propose that we need to change the behavior to implement two different shutdown styles from the point of view of xfrout:

  1. Begin shutdown
  2. Shut down right away

In the first case, we would wait until all threads are done then exit. In the second case, xfrout can just call sys.exit() and let the OS clean everything up.

What the administrator probably wants from either using ctrl-C or "killall bind10" is:

  • Tell me that shutdown has started
  • If you are waiting on something, tell me how many are left and update me as they exit
  • Let me hit ctrl-C or send SIGTERM again to stop right away (tell me I can do this, too)

If the administrator is using bindctl, we probably want something like:

  • Give me two options "shutdown" and "force-shutdown" - the second stops right away
  • If I do "shutdown", keep me informed of the status of the shutdown

This may seem a bit complicated, but I think it is probably the right answer.

I'm going to cut & paste this to bind10-dev for wider discussion...

comment:10 in reply to: ↑ 9 Changed 10 years ago by shane

Replying to shane:

The problem with this approach is that it will only shutdown the server after all xfrout have completed. This could take several seconds or even HOURS, depending on what is going on. (Imagine a server sending the .CN zone)

Likun just pointed out that the code sets a shutdown event, so it actually exits after the next message is sent (which could in principle be a while due to TCP, but will usually be very quick).

So, I propose that we apply his patch because it improves behavior, but leave this ticket open pending the discussion about shutdown on bind10-dev. I'll take ownership until that is done.

comment:11 Changed 10 years ago by shane

  • Component changed from Unclassified to Boss of BIND
  • Milestone set to feature backlog item
  • Priority changed from major to minor
  • Summary changed from xfrout not killed by SIGTERM to Boss more careful shutdown
  • Type changed from defect to enhancement

comment:12 Changed 9 years ago by stephen

  • Milestone feature backlog item deleted

Milestone feature backlog item deleted

comment:13 Changed 9 years ago by shane

  • billable set to 0
  • Internal? unset
  • Owner changed from shane to UnAssigned

comment:14 Changed 8 years ago by shane

  • Defect Severity set to Medium
  • Sub-Project set to DNS
  • Type changed from enhancement to defect

I think there is still some improvement we can do here. It should be relatively minor work, in the shutdown() method, to catch all exceptions and continue to kill, kill, kill until everything is cleaned up.

comment:15 Changed 6 years ago by stephen

  • Milestone set to DNS Outstanding Tasks

comment:16 Changed 5 years ago by tomek

  • Resolution set to wontfix
  • Status changed from assigned to closed

DNS and BIND10 framework is outside of scope for Kea project.
The corresponding code has been removed from Kea git repository.
If you want to follow up on DNS or former BIND10 issues, see
http://bundy-dns.de project.

Closing ticket.

Note: See TracTickets for help on using tickets.