Opened 9 years ago

Closed 8 years ago

#642 closed defect (wontfix)

SIGHUP and other signals cause boss to leave BIND 10 processes lying around

Reported by: shane Owned by: shane
Priority: low Milestone: Year 3 Task Backlog
Component: ~Boss of BIND (obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: Low
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 10.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

If the boss process gets killed by SIGHUP (or any other reason) then it leaves the BIND 10 processes lying around. While some signals cannot be caught, most can, and we should try to cleanup in these cases.

Subtickets

Change History (16)

comment:1 Changed 9 years ago by shane

Okay, this has been done and is ready for review in branch trac642.

It involved a bit of refactoring of the code to allow tests to be written for this. The main thing was splitting the main() function up into separate functions. This allows the setup code to be run without actually starting the server. This is useful for testing the signal handling code, since we can then have signal handlers installed but not actually run the main loop.

I did have to poke around a bit inside the bind10 module itself, in order to use the "mock" boss class instead of the real one. This insures we don't actually start all of the programs when we begin.

comment:2 Changed 9 years ago by shane

  • Owner changed from shane to UnAssigned
  • Status changed from new to reviewing

comment:3 Changed 9 years ago by shane

  • Milestone changed from A-Team-Task-Backlog to A-Team-Sprint-20110316

I'm going to put this on the review queue for the next release, since it does fix a bug and does not seem to be a drastic fix. If it can't get reviewed & merged, it is not a tragedy of course.

comment:4 Changed 9 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:5 Changed 9 years ago by vorner

  • Owner changed from vorner to shane

Hello

I'm not sure if it really works. If I start bind10 on the background (with &) and I close the terminal, all processes survive for a while, but then start to die out one by one. I guess it's because they can no longer write to the terminal that got closed.

It works if I send a TERM signal. But in that case (and in the other cases as well), it probably shouldn't terminate with 0 status.

And the tests are quite repetitive. Would it be possible to have some kind of common „body“ of the test and just call it with the name of the signal to test?

Thanks

comment:6 Changed 9 years ago by stephen

  • Milestone A-Team-Task-Backlog deleted

Milestone A-Team-Task-Backlog deleted

comment:7 Changed 9 years ago by shane

  • Milestone set to Sprint-20110405

comment:8 Changed 9 years ago by shane

I sent a mail to the bind10-dev list about this:

From: Shane Kerr <shane@isc.org>
To: bind10-dev <bind10-dev@lists.isc.org>
Date: Fri, 18 Mar 2011 14:30:01 +0100
Subject: [bind10-dev] Handling Disappearing Terminals

All,

We need to think about what happens to the server when the terminal it
is running in disappears.

History
-------
(Skip if you are impatient for the good stuff.)

At the end of last month, Jeremy sent a mail about his problems setting
up a forwarding resolver:

https://lists.isc.org/pipermail/bind10-dev/2011-February/002038.html

He reported this:

        I know why my bind10 was killed; it doesn't daemonize so when I
        closed terminal it was running in, it was killed -- but
        sometimes children didn't get killed. HUP or whatever signal was
        not trapped or passed to children?

This led me to make a ticket so that we handle SIGHUP and other signals
that might kill the boss process:

http://bind10.isc.org/ticket/642

However, Michal noted that this didn't seem to do anything at all when
he started a process in the background and the terminal was closed. So I
had a look and discovered that the behavior for processes varies quite a
bit depending on the exact details of how the controlling terminal goes
away.


Details of Terminal Closing
---------------------------
I looked at what happens to a process under 3 ways of being started:

1. Running the program
2. Using "su" and then running the program 
3. Using "sudo" to run the program

My theory was that there may be slightly different things done, and it
turns out that is true.

I tried 3 types of test:

A. Start program and close the terminal window
B. Start program in the background (with & at the shell) then logout
C. Start program in the background then close the terminal window

I wrote small Python programs to use for this test, to concentrate on
figuring out the behavior.

My 1st program intercepted all signals possible, and then just waited
around for a KILL signal. :)

My 2nd program intercepted all signals possible, and then wrote a stream
to STDOUT in a loop.

My 3rd program intercepted all signals possible, and then used select()
to see if anything was available for reading, and tried to read if it
was.

Results:

--[ 1: idle ]----------------------------------------------------------
              Start/Close     Background/Logout    Background/Close
normal          SIGHUP            nothing               SIGHUP
su              nothing           nothing               nothing
sudo          SIGHUP (3x)         nothing               SIGHUP

--[ 2: writing ]-------------------------------------------------------
              Start/Close      Background/Logout   Background/Close
normal   SIGHUP, SIGTSTP, err         err             SIGHUP, err
su       err, SIGHUP, SIGTSTP         err                 err
sudo     SIGHUP, SIGTSTP, err         err             SIGHUP, err

--[ 3: reading ]-------------------------------------------------------
              Start/Close      Background/Logout   Background/Close
normal        SIGHUP, EOF             EOF             SIGHUP, EOF
su                EOF             SIGTTIN, EOF       SIGTTIN, EOF
sudo          SIGHUP, err      SIGTTIN, SIGTSTP,      SIGHUP, EOF
                             SIGTERM, SIGTSTP, EOF

If more than one thing happened, they are listed in the order they
occurred.

Key:
  SIGXXX is a signal arriving
  err is an I/O error (either writing or reading)
  EOF means a read return 0 bytes, indicating EOF



Michal's Observation
--------------------
I think we can understand Michal's results:

      * When the terminal window closed, the boss got no signal at all. 
      * Then when one of the child processes tried to output some
        message, it got a write error.
      * When the boss caught the dying child, it tried to output a
        message explaining this and *also* got a write error.
      * Over time, more and more children got write errors and died.


Analysis
--------
The boss process can adapt itself to handle the terminal going away,
because based on the research above, we can detect this and change
outputs so that they go to /dev/null (or better yet so they call empty
functions).

The problem becomes what we do with child processes. If we want them to
write to the console, then they will get some sort of error too.

      * We could let the children die, and restart them, but this is...
        inelegant.
      * We could perhaps have the boss act as a proxy and use pipes to
        read the output.
      * We could do the same thing, but with pseudo-ttys. Python even
        has a module for this:
        http://docs.python.org/py3k/library/pty.html
      * We could shut down.

I realize some people want us to 'properly' daemonize. This would make
the problem go away, but we'll have to change all of the processes to
live in such an environment, and we'll *still* have to deal with these
issues when the program is run in the equivalent of '-f' or '-g' from
BIND 9 (run in foreground).

Please let me know what you think.

--
Shane

_______________________________________________
bind10-dev mailing list
bind10-dev@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind10-dev

comment:9 Changed 9 years ago by shane

  • Owner changed from shane to vorner

Based on this analysis and feedback (there was not much) I decided the best thing is simply to shut down if the output TTY goes away.

The code exits if it has a problem writing - or indeed any uncaught exception (which is probably should have anyway).

I consolidated some of the tests as requested. Please have another look, thanks.

comment:10 Changed 9 years ago by vorner

  • Owner changed from vorner to shane

Hello

The code looks OK. But still, if I start it in a terminal with & and then exit the terminal (this time, it was ssh connection), it still survives and starts dying out one by one. Should I have a look what happens at my side and why this is happening?

Thanks

comment:11 Changed 9 years ago by shane

  • Defect Severity set to N/A
  • Sub-Project set to DNS

Ug... okay, this is most likely caused by the boss process itself terminating when writing output during shutdown.

I have added checks for problems writing throughout the shutdown code, and also changed the test to look at that case. I also actually tried the code by running it in the same way you have been. :)

One note - unless something actually writes output to stdout there is no error. This will happen eventually due to the stats program logging debugging, or sometimes some other task, but may take a while. AFAIK, there is no way to avoid this in a portable fashion. We could check for hangup using POLLHUP, but that wouldn't work in OS X, so we'd have to also code a version with kqueues, which is probably a bit beyond the scope of this exercise, which has already gotten quite big!

Note that a lot of this problem should go away when we switch to our actual logging system, instead of just writing to stdout.

comment:12 Changed 9 years ago by shane

  • Defect Severity changed from N/A to Low
  • Owner changed from shane to vorner

comment:13 Changed 9 years ago by vorner

  • Owner changed from vorner to shane

Good morning

Still, this doesn't work like it should. If I run it as ./sbin/bind10 -v & and close the terminal, it starts dying out as before. If I redirect the stderr (but not stdout), it shutdowns properly, with this at the end of the shutdown:

[b10-msgq] Closing socket fd 9
[b10-msgq] Receive error: EOF
Traceback (most recent call last):
  File "./sbin/bind10", line 1042, in main_loop
    next_rstart = boss_of_bind.restart_processes()
  File "./sbin/bind10", line 818, in restart_processes
    proc_info.name)
IOError: [Errno 5] Input/output error
[bind10] Exception in main loop: [Errno 5] Input/output errorException IOError: (5, 'Input/output error') in <_io.TextIOWrapper name='<stdout>' encoding='UTF-8'> ignored

So, maybe the boss dies by some automatic stacktrace printing?

Anyway, the new code solves part of the porblem I guess. But it is little bit repetetive. Wouldn't it be better to have a method like:

def safe_write(self, what):
        try:
                sys.stdout.write(what)
        except:
                self.runnable = False

And use that?

With regards

comment:14 Changed 9 years ago by shane

  • Estimated Difficulty changed from 0.0 to 10

comment:15 Changed 9 years ago by shane

  • Milestone changed from Sprint-20110816 to Year 3 Task Backlog

Moving this to year 3 backlog, as clearly I have not had time to work on it in months. :(

comment:16 Changed 8 years ago by jelte

  • Resolution set to wontfix
  • Status changed from reviewing to closed

Ok, I don't think this ticket and its branch are going anywhere, and the process-startup code has changed significantly due to the configurable-module changes (the addition of the Boss/components configurables).

I am creating a new ticket for this functionality, since I think we still need it, and moving the branch to historic (origin/trac642-historic). New ticket is #1521.

Closing this one.

Note: See TracTickets for help on using tickets.