Opened 8 years ago

Closed 5 years ago

#1049 closed defect (wontfix)

Processes not shutting down cleanly

Reported by: jreed Owned by:
Priority: medium Milestone: Common Outstanding Tasks
Component: ~Boss of BIND (obsolete) Version: bind10-old
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 6 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

I don't think bind10 (boss) should fallback to SIGKILL.

I see it often on multiple systems. Processes aren't dieing with SIGTERM. So we brute force them to close. (General a "shutdown" sent over a command channel does work though.)

I understand it is useful so everything dies, but it hides problems -- why aren't the processes closing correctly?

We shouldn't ever have real data loss on abrupt shutdown (because we should never respond about success until data has written and synced to final storage.

But we do have data loss potential in log messages output. If we SIGKILL we may lose debugging output that may be useful.

If the SIGKILL is still desired, please:

  • don't do the SIGKILL only .1 second after SIGTERM. Wait much longer. Even check if children are still alive first and then wait a few seconds.
  • allow option to turn this off. I think that developers should never use SIGKILL or we won't fix real exit problems.

Subtickets

Change History (14)

comment:1 Changed 8 years ago by shane

  • Milestone changed from New Tasks to Year 3 Task Backlog

There are two issues here.

First, I agree that a process not exiting with the shutdown message - or at least SIGTERM - is a bug. I also agree that using SIGKILL masks this. So we do need a way for developers to disable this.

The second issue is that in real operation the system needs to promptly shut down, even if there are developer errors. So I argue that for default operation we need to keep the system "as is".

As for the timing... waiting a few seconds seems insane to me. All of our tasks need to be able to handle being shut down at any time anyway (we may get accidental "kill -9" from administrator mistyping process ID, the Linux out-of-memory handler can shut things down, power cables can be tripped over, and so on). Having the system take more than a second to shutdown seems like a defect to me, and there doesn't seem to be any real need for it.

So I think this ticket has a few work items, both minor:

  1. Add a flag to disable SIGKILL on shutdown (or perhaps one to set the interval before falling back to SIGKILL?)
  2. Fix any processes that currently block or catch SIGTERM (besides the boss)
  3. Insure all processes properly handle shutdown requests

comment:2 Changed 8 years ago by jreed

Note it is not just SIGKILL, but now I have example of nine SIGTERMs being sent and then six SIGKILLs. The SIGTERMs should not happen for msgq controlled components ("shutdown" sent to each).

comment:3 follow-up: Changed 8 years ago by jreed

This problem continues. We should figure why components aren't exiting and don't resort to forced kill.

Here are some examples:

2012-01-18 12:54:51.120 DEBUG [b10-boss.boss] BIND10_CONFIGURATOR_TASK performing task stop on Socket creator
2012-01-18 12:54:51.120 INFO  [b10-boss.boss] BIND10_COMPONENT_STOP component Socket creator is being stopped
2012-01-18 12:54:51.161 INFO  [b10-boss.boss] BIND10_SOCKCREATOR_TERMINATE terminating socket creator
2012-01-18 12:54:51.295 DEBUG [b10-boss.boss] BIND10_CONFIGURATOR_TASK performing task stop on msgq
2012-01-18 12:54:51.303 INFO  [b10-boss.boss] BIND10_COMPONENT_STOP component msgq is being stopped
2012-01-18 12:54:51.303 DEBUG [b10-boss.boss] BIND10_CONFIGURATOR_TASK performing task stop on b10-cmdctl
2012-01-18 12:54:51.303 INFO  [b10-boss.boss] BIND10_COMPONENT_STOP component b10-cmdctl is being stopped
2012-01-18 12:54:51.329 INFO  [b10-boss.boss] BIND10_STOP_PROCESS asking b10-cmdctl to shut down
2012-01-18 12:54:51.411 ERROR [b10-boss.boss] BIND10_CONFIGURATOR_PLAN_INTERRUPTED configurator plan interrupted, only 2 of 8 done
2012-01-18 12:54:52.503 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 1481 of Socket creator ended with status 0
...
2012-01-18 12:54:52.580 INFO  [b10-boss.boss] BIND10_SEND_SIGTERM sending SIGTERM to cfgmgr (PID 1671)
...
2012-01-18 12:54:52.626 INFO  [b10-boss.boss] BIND10_SEND_SIGTERM sending SIGTERM to b10-cmdctl (PID 1935)
...
2012-01-18 12:54:52.628 INFO  [b10-boss.boss] BIND10_SEND_SIGTERM sending SIGTERM to msgq (PID 1778)
...
2012-01-18 12:54:52.733 INFO  [b10-boss.boss] BIND10_SEND_SIGKILL sending SIGKILL to cfgmgr (PID 1671)
2012-01-18 12:54:52.792 INFO  [b10-boss.boss] BIND10_SEND_SIGKILL sending SIGKILL to b10-cmdctl (PID 1935)
2012-01-18 12:54:52.793 INFO  [b10-boss.boss] BIND10_SEND_SIGKILL sending SIGKILL to msgq (PID 1778)
2012-01-18 12:54:52.903 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 1778 of msgq ended with status 9
2012-01-18 12:54:52.903 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 1935 of b10-cmdctl ended with status 9
2012-01-18 12:54:52.903 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 1671 of cfgmgr ended with status 9

Why try to SIGKILL so fast?

comment:4 in reply to: ↑ 3 Changed 8 years ago by vorner

Hello

Replying to jreed:

Why try to SIGKILL so fast?

I guess this one is a bug in emergency shutdown. It just didn't give the components time to terminate, for some reason. This indeed should be fixed.

comment:5 Changed 8 years ago by jelte

  • Milestone changed from Year 3 Task Backlog to Next-Sprint-Proposed

comment:6 Changed 8 years ago by shane

Note also that sometimes the problem is that the msgq is in a broken state, or simply broken with certain components. I think the 3 work items I outlined are the appropriate ones for this ticket. Possibly separate tickets are needed for each.

Last edited 8 years ago by shane (previous) (diff)

comment:7 Changed 8 years ago by jinmei

It's not clear to me, after all of the discussions, what's the goal
of the ticket now. Without knowing it I cannot given an estimate.

comment:8 Changed 8 years ago by shane

The concern is that use of SIGKILL is masking real problems. Which is why I proposed:

  • Add a flag to disable SIGKILL on shutdown (or perhaps one to set the interval before falling back to SIGKILL?)

We know that some processes inappropriately block signals (xfrin and xfrout I believe). These should be fixed to no longer do that blocking, which is why I proposed:

  • Fix any processes that currently block or catch SIGTERM (besides the boss)

Finally, we need to make sure that all of our processes are actually receiving and handling shutdown requests (I think they are), which is why I proposed:

  • Insure all processes properly handle shutdown requests

There is a discussion about whether the 0.1 second delay before SIGKILL is appropriate. Personally I think the delay needs to be small - nothing is more annoying for an administrator than having to wait for a system to shutdown. But I don't think there is any specific actions for that, at least not without discussion on bind10-dev.

comment:9 Changed 8 years ago by jelte

  • Estimated Difficulty changed from 0.0 to 6

comment:10 Changed 7 years ago by shane

  • Summary changed from bind10 should not resort to SIGKILL to Processes not shutting down cleanly

comment:11 Changed 7 years ago by jreed

  • Milestone set to Next-Sprint-Proposed

comment:12 Changed 7 years ago by jreed

  • Milestone set to Next-Sprint-Proposed

comment:13 Changed 6 years ago by stephen

  • Milestone set to Common Outstanding Tasks

comment:14 Changed 5 years ago by tomek

  • Resolution set to wontfix
  • Status changed from new to closed
  • Version set to old-bind10

This issue is related to bind10 code that is no longer part of Kea.

If you are interested in BIND10/Bundy framework or its DNS components,
please check http://bundy-dns.de.

Closing ticket.

Note: See TracTickets for help on using tickets.