Opened 8 years ago

Closed 7 years ago

#1858 closed defect (fixed)

sockcreator doesn't die

Reported by: jinmei Owned by: jinmei
Priority: medium Milestone: Sprint-20121023
Component: ~Boss of BIND (obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: Core Feature Depending on Ticket:
Estimated Difficulty: 4 Add Hours to Ticket: 0
Total Hours: 4.52 Internal?: no

Description

On my personal server the socket creator often stays alive after I
shut down the system (via bindctl 'Boss shutdown'). ps indicates it
sleeps on socket read.

I couldn't reproduce it if I restart things and immediately stop it,
but in my experience it often happens if I shutdown it after some
time, like after a few days (or perhaps even several hours) of run.

So I suspect it's a real bug and should be solved. Not super urgent,
but I'm putting it to the next sprint proposed queue.

Subtickets

Change History (16)

comment:1 Changed 8 years ago by jinmei

This is a log when this happened.

2012-03-27 23:12:11.780 INFO  [b10-boss.boss] BIND10_CONFIGURATOR_STOP bind10 component configurator is shutting down
2012-03-27 23:12:11.782 INFO  [b10-boss.boss] BIND10_COMPONENT_STOP component b10-auth-2 is being stopped
2012-03-27 23:12:11.782 INFO  [b10-boss.boss] BIND10_STOP_PROCESS asking b10-auth-2 to shut down
2012-03-27 23:12:11.783 ERROR [b10-boss.boss] BIND10_CONFIGURATOR_PLAN_INTERRUPTED configurator plan interrupted, only 0 of 11 done
2012-03-27 23:12:12.800 INFO  [b10-boss.boss] BIND10_SEND_SIGTERM sending SIGTERM to Socket creator (PID 29154)
2012-03-27 23:12:12.801 WARN  [b10-boss.boss] BIND10_SOCKCREATOR_KILL killing the socket creator
...
2012-03-27 23:12:12.993 ERROR [b10-xfrout.xfrout] XFROUT_RECEIVE_FILE_DESCRIPTOR_ERROR error receiving the file descriptor for an XFR connection
(lots of this)
...
2012-03-27 23:12:19.347 INFO  [b10-boss.boss] BIND10_SEND_SIGKILL sending SIGKILL to Socket creator (PID 29154)
(lots of this)
...
2012-03-27 23:13:21.874 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 29154 of Socket creator ended with status 15
2012-03-27 23:13:21.875 INFO  [b10-boss.boss] BIND10_SHUTDOWN_COMPLETE all processes ended, shutdown complete

In this case I killed the socket creator by hand. I guess the log at
23:13:21.874 was due to this manual operation.

So, what apparently happened is Boss's shutdown message to the socket
creator was somehow lost, and while the Boss tried to kill it
forcefully (and of course unsuccessfully because it didn't have the
permission), both Boss and the creator kept alive.

The lost message is itself an issue, but the subsequent behavior is
obviously very bad (if I didn't notice it, it could effectively result
in a busy loop). At least Boss should give up sending a signal
if it fails due to permission denied because it would never succeed by
a retry.

Last edited 8 years ago by jinmei (previous) (diff)

comment:2 Changed 8 years ago by shane

There are a couple things here. I think that loop with xfrout may be related to the "infinite loop on xfrout" ticket (#988).

Last edited 8 years ago by shane (previous) (diff)

comment:3 follow-up: Changed 8 years ago by vorner

Hello

Two notes:

  • As something failed during the shutdown (I don't know what it was, but there was an exception during the shutdown). Then the shutdown is aborted and it goes to the emergency shutdown. This should be changed and I believe there's a ticket for it somewhere ‒ if we're shutting down, we should try to „accumulate“ the exceptions, but shut down as many things as possible the usual way.
  • The fact that it fails to deliver the SIGKILL is a problem. I don't completely like aborting the KILLs generally, as that could leave some other things running without boss and there's less chance of administrator noticing that. On the other hand, termination of boss should cause the socket creator to shut down, as it will lose the stdin it reads commands from and fail.

Anyway, we might want to examine an ability to keep the „bind low ports“ privilege only and run as the user the rest runs at. That would solve the problem of undeliverable KILL.

comment:4 Changed 8 years ago by vorner

Looking at it, the #1412 should fix the problem I believe.

comment:5 in reply to: ↑ 3 Changed 8 years ago by jinmei

Replying to vorner:

Hello

Two notes:

  • The fact that it fails to deliver the SIGKILL is a problem. I don't completely like aborting the KILLs generally, as that could leave some other things running without boss and there's less chance of administrator noticing that. On the other hand, termination of boss should cause the socket creator to shut down, as it will lose the stdin it reads commands from and fail.

We could also give feedback to the administrator if the shutdown is
triggered from a command via cmdctl (in practice, which means it's
from a bindctl terminal) instead of returning a 'success' answer
unconditionally.

            if command == "shutdown":
                self.runnable = False
                answer = isc.config.ccsession.create_answer(0)

Anyway, we might want to examine an ability to keep the „bind low ports“ privilege only and run as the user the rest runs at. That would solve the problem of undeliverable KILL.

...for those systems that have fine-grained capability control. "The
world is not Linux" rule applies here (as far as I know many if not
all of BSD variants don't have an equivalent interface).

comment:6 Changed 8 years ago by jinmei

This ticket is now probably confusing. If we include this in the next sprint,
my suggestion is to focus on not repeating signals if it fails due to permission
denied. I believe it will at least work as a workaround for the problem even if
it may not solve all underlying issues.

comment:7 Changed 7 years ago by jinmei

  • Milestone set to Next-Sprint-Proposed

comment:8 Changed 7 years ago by jelte

  • Milestone changed from Next-Sprint-Proposed to Sprint-20121023

comment:9 Changed 7 years ago by jinmei

  • Owner set to jinmei
  • Status changed from new to accepted

comment:10 Changed 7 years ago by jinmei

trac1858 is ready for review.

While there are some derivative discussions in the ticket comments so
far, I focused on two major issues:

  • the socket creator doesn't die even after the bind10 process terminates. On looking into it, I found the cause of this: Python3.1 (or older) doesn't close other FDs than stdin/out/err in a child process created via Popen. The opened FD prevents the socket creator from getting EOF from the channel with the bind10 process.
  • the boss keep trying to kill the process even after EPERM. It simply doesn't make sense because it will never magically succeed.

Combining these, it at least solves my problem: the bind10 process
exits, just ignoring the socket creator; and then the socket creator
exits correctly.

Proposed changelog entry:

494.?	[bug]		jinmei
	Fixed a problem that shutting down BIND 10 kept some of the
	processes alive.  It was two-fold: when the main bind10 process
	started as a root, started b10-sockcreator with the privilege, and
	then dropped the privilege, the bind10 process cannot kill the
	sockcreator via signal any more (when it has to), but it kept
	sending the signal and didn't stop.  Also, when running on Python
	3.1 (or older), the sockcreator had some additional file
	descriptor open, which prevented it from exiting even after the
	bind10 process terminated.  Now the bind10 process simply gives up
	killing a subprocess if it fails due to lack of permission, and it
	makes sure the socket creator is spawned without any unnecessary
	FDs open.
	(Trac #1858, git TBD)

comment:11 Changed 7 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from accepted to reviewing

comment:12 Changed 7 years ago by jinmei

Since no one seems to have started review, I made one unrelated
cleanup at ae0565a. I think it's an important cleanup because
a confusing warning-level log message can be harmful, but it'd be
too small to create a separate ticket. I hope it's okay to piggy back
it on this ticket.

comment:13 Changed 7 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:14 follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei
  • Total Hours changed from 0 to 0.52

Hello

The branch seems OK, please merge.

comment:15 in reply to: ↑ 14 Changed 7 years ago by jinmei

Replying to vorner:

Hello

The branch seems OK, please merge.

Thanks, merge done, closing.

comment:16 Changed 7 years ago by jinmei

  • Resolution set to fixed
  • Status changed from reviewing to closed
  • Total Hours changed from 0.52 to 4.52
Note: See TracTickets for help on using tickets.