Opened 8 years ago

Closed 8 years ago

#1271 closed defect (fixed)

the boss process is too impatient while waiting for cfgmgr

Reported by: jreed Owned by: stephen
Priority: medium Milestone: Sprint-20111108
Component: Unclassified Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: Core Feature Depending on Ticket:
Estimated Difficulty: 6 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

On a virtual machine I received the following Exception:

freebsd8-64-4# bind10 --verbose                                                 2011-09-29 15:39:46.584 INFO  [b10-boss.boss] BIND10_STARTING starting BIND10: bind10 20110223 (BIND 10 20110819)
2011-09-29 15:39:46.956 INFO  [b10-boss.boss] BIND10_SOCKCREATOR_INIT initializing socket creator parser
2011-09-29 15:39:46.959 INFO  [b10-boss.boss] BIND10_STARTING_PROCESS starting process b10-msgq
2011-09-29 15:39:48.253 INFO  [b10-boss.boss] BIND10_STARTING_PROCESS starting process b10-cfgmgr
2011-09-29 15:39:49.376 INFO  [b10-boss.boss] BIND10_STARTING_PROCESS starting process ccsession
2011-09-29 15:39:55.454 INFO  [b10-boss.boss] BIND10_KILLING_ALL_PROCESSES killing all started processes
2011-09-29 15:39:55.457 WARN  [b10-boss.boss] BIND10_SOCKCREATOR_KILL killing the socket creator
2011-09-29 15:39:55.473 INFO  [b10-boss.boss] BIND10_KILL_PROCESS killing process b10-msgq
2011-09-29 15:39:55.474 INFO  [b10-boss.boss] BIND10_KILL_PROCESS killing process b10-cfgmgr
Exception socket.error: error(32, 'Broken pipe') in <bound method ModuleCCSession.__del__ of <isc.config.ccsession.ModuleCCSession object at 0x802996dd0>> ignored
2011-09-29 15:39:55.504 FATAL [b10-boss.boss] BIND10_STARTUP_ERROR error during startup: Unable to start ccsession: No answer from ConfigManager when asking about Remote module Logging

This probably took over two minutes.

Subtickets

Change History (19)

comment:1 follow-up: Changed 8 years ago by shane

Even though the report is that it took over two minutes, the program thought it only took 9 seconds.

I'm not sure what we can do on a machine so badly broken. :(

comment:2 Changed 8 years ago by shane

  • Milestone New Tasks deleted

comment:3 in reply to: ↑ 1 Changed 8 years ago by jinmei

Replying to shane:

Even though the report is that it took over two minutes, the program thought it only took 9 seconds.

I'm not sure what we can do on a machine so badly broken. :(

This exact same happened on my personal server. (You might call it
broken:-) it's (in today's standard) unusually slow, but otherwise
it works perfectly fine. In my case sleeping for 1 second wasn't
sufficient for cfgmr to start up:

    def start_cfgmgr(self, c_channel_env):
        """
            Starts the configuration manager process
        """
[...]
        # sleep until b10-cfgmgr is fully up and running, this is a good place
        # to have a (short) timeout on synchronized groupsend/receive
        # TODO: replace the sleep by a listen for ConfigManager started
        # message
        time.sleep(1)

and so it encounters "broken pipe" in start_ccssession:

        self.start_cfgmgr(c_channel_env)
        self.start_ccsession(c_channel_env)

I tentatively changed the sleep time to 5 seconds, and it worked,
but it's needless to say that it's just an ugly workaround.

As noted in TODO, what the boss process should actually do is to establish
more explicit synchronization with cfgmgr. It should wait on msgq,
cfgmgr sends an "I'm ready" message to the boss when it starts up,
the boss continues its process then. We might also want to specify
a limit on the waiting time (and since it's before the boss gets the
configuration, it must be given via a command line option or an
environment variable).

I'd also suggest revising the log messages in this failure case. It's
difficult to know what was actually wrong from these messages. Seeing
the bare 'broken pipe' exception is also not cool.

Finally, not directly related to this issue, but this log message is
not good:

2011-09-29 15:39:49.376 INFO  [b10-boss.boss] BIND10_STARTING_PROCESS starting process ccsession

because ccsession is not a "process". This is a result of a naive
copy-paste here:

        self.log_starting("ccsession")

it should be fixed.

I proposed this ticket to be handled in the next sprint.

comment:4 Changed 8 years ago by jinmei

  • Milestone set to Next-Sprint-Proposed
  • Summary changed from exception on extremely slow startup to the process is too impatient while waiting for cfgmgr

(also changing the problem summary as it's not specific to an "extremely" slow setup)

comment:5 Changed 8 years ago by jinmei

  • Priority changed from minor to blocker

(I'm also changing the priority to "blocker" to indicate my stronger preference on this one)

comment:6 Changed 8 years ago by jinmei

  • Summary changed from the process is too impatient while waiting for cfgmgr to the boss process is too impatient while waiting for cfgmgr

comment:7 Changed 8 years ago by jelte

  • Milestone changed from Next-Sprint-Proposed to Sprint-20111025

comment:8 Changed 8 years ago by jinmei

  • Priority changed from blocker to major

comment:9 Changed 8 years ago by jelte

  • Estimated Difficulty changed from 0 to 6

comment:10 Changed 8 years ago by stephen

  • Owner set to stephen
  • Status changed from new to assigned

comment:11 Changed 8 years ago by stephen

  • Owner changed from stephen to jelte
  • Status changed from assigned to reviewing

Changes seem to work. My personal system is fast enough that I don't get the problems, but tests with a sleep of 5 seconds added as the first statement of b10-cfgmgr appear to indicate that the handshake is OK.

comment:12 Changed 8 years ago by jelte

  • Owner changed from jelte to stephen

Code looks ok, and indeed seems to work.

just some documentation comments; I'd add the unit (seconds) of pwait to the documentation lines, and add the variable description to __init__() of bob as well, and the new command line argument needs a description in the bind 10 guide.

I suppose our unit test framework isn't equipped to test this? (this may be more of a system test thing, although atm i don't know how exactly to do that either)

comment:13 Changed 8 years ago by jelte

oh and we might want to have a changelog entry :)

comment:14 Changed 8 years ago by jelte

  • Milestone changed from Sprint-20111025 to Sprint-20111108

comment:15 Changed 8 years ago by stephen

  • Owner changed from stephen to jelte

just some documentation comments

Done.

oh and we might want to have a changelog entry

How about:

XXX.	[bug]		stephen
        Boss process now waits for the configuration manager to initialize
        itself before continuing with startup.  This fixes a race condition
        whereby the Boss could start the configuration manager and then
        immediately start components that depended on that component being
        fully initialized.
	(Trac #1271, git yyyy...)

comment:16 follow-up: Changed 8 years ago by jreed

What about having the components react correctly if the ConfigManager? is not answering too? Some crash and output a Traceback, others do nice logging and do their own shutdown. Maybe they should have retry on startup too?

comment:17 in reply to: ↑ 16 Changed 8 years ago by jelte

Replying to jreed:

What about having the components react correctly if the ConfigManager? is not answering too? Some crash and output a Traceback, others do nice logging and do their own shutdown. Maybe they should have retry on startup too?

most do almost the same thing, there are a few modules that explicitely catch the error, some just let it bubble up. Changing that shouldn't be too much trouble. Probably more of a question what 'the right thing' is here, for components that is. (i guess log and shutdown; they should be killed soon anyway). Make that a separate ticket?

comment:18 Changed 8 years ago by jelte

  • Owner changed from jelte to stephen

looks good, this can be merged

comment:19 Changed 8 years ago by stephen

  • Resolution set to fixed
  • Status changed from reviewing to closed

Merged in commit 607cbae949553adac7e2a684fa25bda804658f61
ChangeLog entry added in commit b16e9d26953cd7117d14ea8dde9e739cb34cb878

Note: See TracTickets for help on using tickets.