Opened 9 years ago

Closed 6 years ago

Last modified 6 years ago

#530 closed defect (fixed)

[kean] msgq error message for unable to create socket

Reported by: jreed Owned by: muks
Priority: medium Milestone: Sprint-20131015
Component: ~msgq (obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 0.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

  File "/home/reed/opt/bind10/sbin/bind10", line 544, in startup
    self.start_all_processes(c_channel_env)
  File "/home/reed/opt/bind10/sbin/bind10", line 485, in start_all_processes
    self.start_msgq(c_channel_env)
  File "/home/reed/opt/bind10/sbin/bind10", line 343, in start_msgq
    raise CChannelConnectError("Unable to connect to c-channel after 5 seconds")
__main__.CChannelConnectError: Unable to connect to c-channel after 5 seconds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/reed/opt/bind10/sbin/bind10", line 899, in <module>
    main()

  File "/home/reed/opt/bind10/sbin/bind10", line 838, in main
    startup_result = boss_of_bind.startup()
  File "/home/reed/opt/bind10/sbin/bind10", line 546, in startup
    self.kill_started_processes()
  File "/home/reed/opt/bind10/sbin/bind10", line 263, in kill_started_processes
    self.processes[pid].process.kill()
  File "/usr/pkg/lib/python3.1/subprocess.py", line 1337, in kill
    self.send_signal(signal.SIGKILL)
  File "/usr/pkg/lib/python3.1/subprocess.py", line 1327, in send_signal
    os.kill(self.pid, sig)
OSError: [Errno 3] No such process

Verbose mode gave me more information:

  File "/home/reed/opt/bind10/libexec/bind10-devel/b10-msgq", line 156, in setup_listener
    self.listen_socket.bind(self.socket_file)
socket.error: [Errno 13] Permission denied

But didn't give me the path.

msgq should give a clear error message that says the path without the python noise.

bind10 should also not have any traceback. It should say that the msgq exited.

And why SIGKILL? (versus friendlier signal)

(I know the problem with msgq, but that is unrelated to this ticket.)

Subtickets

Change History (12)

comment:1 Changed 6 years ago by muks

  • Defect Severity set to N/A
  • Milestone set to Sprint-20130903
  • Sub-Project set to DNS
  • Summary changed from msgq error message for unable to create socket to [kean] msgq error message for unable to create socket

comment:2 Changed 6 years ago by kean

  • Owner set to kean
  • Status changed from new to accepted

comment:3 Changed 6 years ago by kean

  • Owner changed from kean to muks
  • Status changed from accepted to reviewing

Two small fixes for this. First, display the name of the socket file, if it is known, in the startup error message. Second, when running in verbose mode as described in other error messages, do not redirect STDOUT from b10-msgq so that any errors that msgq encounters during startup are displayed.

No ChangeLog? entry required.

comment:4 Changed 6 years ago by muks

  • Owner changed from muks to UnAssigned

Please don't assign tickets for review directly to a developer. Mark it as unassigned.

I have to finish currently assigned tickets before picking any new, and Shane or Jeremy may pick some meanwhile.

comment:5 in reply to: ↑ description Changed 6 years ago by muks

  • Owner changed from UnAssigned to kean

Hi Kean

Here are my review comments:

Your commits look ok (I think we discussed the verbose changes on Jabber), but do they address the following points in the ticket description:

bind10 should also not have any traceback. It should say that the msgq exited.

Would it still generate traceback or not?

And why SIGKILL? (versus friendlier signal)

Why SIGKILL? What happens that causes the ticket description's log?

comment:6 follow-up: Changed 6 years ago by kean

  • Resolution set to fixed
  • Status changed from reviewing to closed

No it no longer generates a traceback. The bug is over 3 years old and obviously things have changed a lot in that time.

The SIGKILL is deliberate because it is a "forceful" shutdown. See src/bin/init.py.in round about line 848. Here is what the output looks like:

./bind10 -m /tmp/sock/msgq -v
DEBUG [b10-msgq.msgq]: MSGQ_START Msgq version b10-msgq 20110127 (BIND 10 20130529) starting
DEBUG [b10-msgq.msgq]: MSGQ_LISTENER_SETUP Starting to listen on socket file '/tmp/sock/msgq'
FATAL [b10-msgq.msgq]: MSGQ_LISTENER_FAILED Failed to initialize listener on socket file '/tmp/sock/msgq': [Errno 13] Permission denied
FATAL [b10-msgq.msgq]: MSGQ_START_FAIL Error during startup: [Errno 13] Permission denied
INFO [b10-init.init]: BIND10_STARTING starting BIND10: bind10 20110223 (BIND 10 20130529)
DEBUG [b10-init.init]: BIND10_CHECK_MSGQ_ALREADY_RUNNING checking if msgq is already running
INFO [b10-init.init]: BIND10_CONFIGURATOR_START bind10 component configurator is starting up
DEBUG [b10-init.init]: BIND10_CONFIGURATOR_BUILD building plan '{}' -> '{'sockcreator': {'priority': 200, 'kind': 'core', 'special': 'sockcreator'}, 'msgq': {'priority': 199, 'kind': 'core', 'special': 'msgq'}, 'cfgmgr': {'priority': 198, 'kind': 'core', 'special': 'cfgmgr'}}'
DEBUG [b10-init.init]: BIND10_CONFIGURATOR_RUN running plan of 3 tasks
DEBUG [b10-init.init]: BIND10_CONFIGURATOR_TASK performing task start on Socket creator
INFO [b10-init.init]: BIND10_COMPONENT_START component Socket creator is starting
INFO [b10-init.init]: BIND10_SOCKCREATOR_INIT initializing socket creator parser
DEBUG [b10-init.init]: BIND10_STARTED_PROCESS_PID started b10-sockcreator (PID 4763)
DEBUG [b10-init.init]: BIND10_CONFIGURATOR_TASK performing task start on msgq
INFO [b10-init.init]: BIND10_COMPONENT_START component msgq is starting
INFO [b10-init.init]: BIND10_STARTING_PROCESS starting process b10-msgq
DEBUG [b10-init.init]: BIND10_STARTED_PROCESS_PID started b10-msgq (PID 4764)
ERROR [b10-init.init]: BIND10_CONNECTING_TO_CC_FAIL failed to connect to configuration/command channel; try -v to see output from msgq
ERROR [b10-init.init]: BIND10_COMPONENT_START_EXCEPTION component msgq failed to start: Unable to connect to c-channel after 5 seconds
ERROR [b10-init.init]: BIND10_COMPONENT_FAILED component msgq (pid None) failed: unknown condition
FATAL [b10-init.init]: BIND10_COMPONENT_UNSATISFIED component msgq is required to run and failed
ERROR [b10-init.init]: BIND10_CONFIGURATOR_PLAN_INTERRUPTED configurator plan interrupted, only 1 of 3 done
INFO [b10-init.init]: BIND10_KILLING_ALL_PROCESSES killing all started processes
INFO [b10-init.init]: BIND10_SEND_SIGKILL sending SIGKILL to Socket creator (PID 4763)
WARN [b10-init.init]: BIND10_SOCKCREATOR_KILL killing the socket creator
FATAL [b10-init.init]: BIND10_STARTUP_ERROR error during startup: Unable to start b10-msgq: Unable to connect to c-channel after 5 seconds

comment:7 in reply to: ↑ 6 Changed 6 years ago by muks

  • Resolution fixed deleted
  • Status changed from closed to reopened

Hi Kean

Replying to kean:

The SIGKILL is deliberate because it is a "forceful" shutdown. See src/bin/init.py.in round about line 848.

The question is: WHY was SIGKILL used? What happened that required doing a forceful shutdown?

comment:8 follow-up: Changed 6 years ago by kean

  • Owner changed from kean to muks
  • Status changed from reopened to reviewing

if you look at the code I pointed you to you will see how the decision is made. A few lines above it in the shutdown() method, it first sends SIGTERM by calling self.__kill_children(False) and then if there are any remaining components, it does a "forceful" shutdown by calling self.__kill_children(True). There are probably ways that can be improved but that was an "aside" question in the bug, not the purpose of the fix. If we want to change that code and put in a sleep to give the SIGTERM time to do its thing then by all means lets do so but then it belongs in another bug (IMHO). It was you who told me that bugs need to be very targeted in nature and this bug was about the msgq socket error messages not about why SIGKILL was used (again, except as an aside). If you disagree I am happy to address the SIGKILL issue too.

comment:9 Changed 6 years ago by jreed

  • Resolution set to complete
  • Status changed from reviewing to closed

I think it is okay for this ticket to be closed. The SIGKILL ticket is #1049. (Since it was my bug report, I now resolved this ticket.)

comment:10 in reply to: ↑ 8 ; follow-up: Changed 6 years ago by muks

  • Resolution complete deleted
  • Status changed from closed to reopened

Replying to kean:

if you look at the code I pointed you to you will see how the decision is made.

I think I was not clear in my last comments. I asked about why SIGKILL has to be sent, i.e., why it has to be shutdown forcefully like that. It is not about the code that sends the SIGKILL signal. In the normal way of things, components should gracefully shutdown.

What happens that causes the ticket description's log?

#1049 says we ought not to send the SIGKILL, but that is unrelated to what caused this component to not shutdown gracefully.

Can you look at the traceback in the description, look at the code in question and diagnose:

  • What component is being shutdown? Is it b10-msgq?
  • Why is the component being shutdown at this point by b10-init when startup itself has failed and there is no process to kill?
  • If b10-msgq startup fails, it's fatal. How should b10-init handle it? Getting such a traceback would be bad. It would need to be logged appropriately and BIND 10 should stop gracefully. This problem will recur if we leave it as-is.

These issues need not be fixed in this ticket. You can create new tickets for new work. But please diagnose what exactly went wrong.

comment:11 in reply to: ↑ 10 ; follow-up: Changed 6 years ago by kean

  • Resolution set to fixed
  • Status changed from reopened to closed

Replying to muks:

Replying to kean:

if you look at the code I pointed you to you will see how the decision is made.

I think I was not clear in my last comments. I asked about why SIGKILL has to be sent, i.e., why it has to be shutdown forcefully like that. It is not about the code that sends the SIGKILL signal. In the normal way of things, components should gracefully shutdown.

If you won't look at the code my responses to you will be meaningless. It explains why it sends SIGKILL. I will quote the code here so that hopefully this will be clear to you as it addresses all your questions except 1 (which component is being killed - the answer is - any - it doesn't matter they will all be killed the same way but in this particular case it was msgq).

        # Send TERM and KILL signals to modules if we're not prevented
        # from doing so
        if not self.nokill:
            # next try sending a SIGTERM
            self.__kill_children(False)
            # finally, send SIGKILL (unmaskable termination) until everybody
            # dies
            while self.components:
                # XXX: some delay probably useful... how much is uncertain
                time.sleep(0.1)
                self.reap_children()
                self.__kill_children(True)
            logger.info(BIND10_SHUTDOWN_COMPLETE)

So what happens first is the call to __kill_children(False). This sends the SIGTERM signal to each component to kill it. If any of those processes has a SIGTERM handler and doesn't exit *immediately* - and by immediately I mean in the time it takes this loop to execute, it drops into the while loop and calls __kill_children(True) which sends a SIGKILL.

The root of the problem is that there is no pause between sending the SIGTERM and the SIGKILL. Even the most trivial of SIGTERM handler will take longer than the amount of time it takes for the code to drop into sending SIGKILL. Therefore, as stated above in comment 8 the thing to do is to put in some kind of pause, possibly even the Python equivalent of a wait() in that loop.

What happens that causes the ticket description's log?

There was a permissions problem on the socket file or directory and the system was unable to start up.

  • What component is being shutdown? Is it b10-msgq?

Yes.

  • Why is the component being shutdown at this point by b10-init when startup itself has failed and there is no process to kill?

Because it is msgq itself that was being started up, one of the very first things init attempts. So until that point startup hasn't failed yet - it is the failure to start msgq that is the failure.

  • If b10-msgq startup fails, it's fatal. How should b10-init handle it? Getting such a traceback would be bad. It would need to be logged appropriately and BIND 10 should stop gracefully. This problem will recur if we leave it as-is.

As was stated above there is no more traceback. I quoted the output showing what the failure case looks like. Your statement about the problem recurring isn't correct. In the 3 years since this bug was filed changes have been made to catch the exceptions and log them cleanly rather than generating tracebacks due to unhandled exceptions.

It is very important to note that the above diagnosis pertains only to the old code. As the code currently stands in master, none of the above is relevant at all because the correct exception handling is done. The entire line of questions is moot. The only error that needed to be fixed was not squelching the stdout output from b10-msgq so that startup errors are visible, and doing a better job of displaying the socket name during exception handling. However the above issue of not waiting for SIGTERM to be able to do its job is still relevant.

Last edited 6 years ago by kean (previous) (diff)

comment:12 in reply to: ↑ 11 Changed 6 years ago by muks

Replying to kean:

If you won't look at the code my responses to you will be meaningless. It explains why it sends SIGKILL. I will quote the code here so that hopefully this will be clear to you as it addresses all your questions except 1 (which component is being killed - the answer is - any - it doesn't matter they will all be killed the same way but in this particular case it was msgq).

You have pointed at this code several times now. I'm aware of what this code does having written it. Let me put it in another way. The questions are not about this code at all. What was asked is:

  • Why does the component have to be forcefully terminated? This is not normal behavior. Components need to gracefully shutdown themselves. Sending TERM or KILL is a worst-case scenario. That code should not even exist if components shutdown properly. But because this is a b10-msgq startup failure,
  • If the component failed startup, why is its pid being passed to os.kill() ? It'll always not find the pid.
    os.kill(self.pid, sig)
OSError: [Errno 3] No such process

The root of the problem is that there is no pause between sending the SIGTERM and the SIGKILL. Even the most trivial of SIGTERM handler will take longer than the amount of time it takes for the code to drop into sending SIGKILL. Therefore, as stated above in comment 8 the thing to do is to put in some kind of pause, possibly even the Python equivalent of a wait() in that loop.

The topic of such a pause has been discussed before. Check the lists and earlier bug reports. I don't remember what the exact discussion was, but what we do is based on that.

  • If b10-msgq startup fails, it's fatal. How should b10-init handle it? Getting such a traceback would be bad. It would need to be logged appropriately and BIND 10 should stop gracefully. This problem will recur if we leave it as-is.

As was stated about there is no more traceback. I quoted the output showing what the failure case looks like. Your statement about the problem recurring isn't correct. In the 3 years since this bug was filed changes have been made to catch the exceptions and log them cleanly rather than generating tracebacks due to unhandled exceptions.

I looked at the entire startup codepath in init.py.in, and it's the try/except in __kill_children() that catches this OSError due to #1858. This is the explanation I was asking for, for the OSError traceback in the ticket description.

Note: See TracTickets for help on using tickets.