Opened 9 years ago

Closed 9 years ago

#565 closed defect (fixed)

Boss should start/stop b10-auth and b10-recurse without restart

Reported by: vorner Owned by: vorner
Priority: medium Milestone: A-Team-Sprint-20110309
Component: ~Boss of BIND (obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity:
Sub-Project: Feature Depending on Ticket:
Estimated Difficulty: 3.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

Currently, if the Boss.start_auth or Boss.start_recursive is modified, the change is not reflected. To choose a different server to run, one needs to modify these configuration variables and then restart boss (therefore the whole system).

While something more flexible (with configurable, yet unknown services, dependencies, etc) is needed in future, it would be helpful if boss could start and stop the servers as the configuration changes at runtime, without a restart.

Subtickets

Change History (13)

comment:1 Changed 9 years ago by vorner

  • Milestone changed from A-Team-Task-Backlog to A-Team-Sprint-20110223
  • Owner set to vorner
  • Status changed from new to accepted

comment:2 Changed 9 years ago by vorner

  • Owner changed from vorner to UnAssigned
  • Status changed from accepted to reviewing

I think it is ready for review.

However, I'd like to note few things. For one, I believe we will need to have more flexible boss (with list of subsystems to have running, it isn't possible not to start stats for example, or start auth without xfrout), therefore this is kind of short to middle term solution.

Another one is, there are no tests for part of it - the part that shuts down or starts the processes. I added only little bit of it, and I would need to create a whole testing framework for it (to tests respawns, starts, stops, etc). It seems an overkill for this task right now (because this is a nice feature that the users might notice, so I want it to get to the Y2 release). I'll create another ticket for it and put it to the A team backlog, where I describe some ideas about how the boss could be generalized and tested.

I propose this changelog entry:

[bug]      vorner
It is possible to start and stop resolver and authoritative server without
restart of the whole system. Change of the configuration (Boss/start_auth
and Boss/start_resolver) is enough.

comment:3 Changed 9 years ago by vorner

  • Estimated Difficulty changed from 0.0 to 3

It should have some estimate as well.

comment:4 Changed 9 years ago by jinmei

  • Owner changed from UnAssigned to jinmei

comment:5 follow-up: Changed 9 years ago by jinmei

I don't see critical problems in the patch itself at this stage. Some
minor points are commented below.

While looking at the code, however, some fundamental architecture
issues arose. The bob (or the BIND10 system in general) doesn't seem
to be able to handle subtle failure cases due to the asynchronous
architecture. For example, what if we start/stop b10-auth
successfully but fail to do it for b10-xfrout? Or what would happen
if we start a server (either auth or resolver) immediately after
stopping it? For bob, "stop" means sending a shutdown message, so it
would be possible to try to start it again before the process actually
dies.

These are probably not an issue specific to this patch but rather a
larger architectural one for the BIND10 process model, though. To
make it really production ready, I guess we need to carefully design
state management of the subprocesses so that the questions like the
above ones are clarified.

At the current level of overall maturity of BIND 10, I'm okay with
leaving this as a backlog.

bind10.py.in

  • In config_handler() I'd say 'If this is...' here:
            # This is initial update, don't do anything by it, leave it to startup
    
    Or move the comment in the 'if not self.runnable' block.
  • I see some duplicate pattern in config_handler(). maybe we need refactoring.
  • In stop_process(), s/second/sendto/ ?
    +        Stop the given process, friendly-like. The process is the name it has
    +        (in logs, etc), the second is the address on msgq.
    
    Or do you mean the "second argument" (for the caller)? In any case the variable name 'second' sounds a bit misleading to me; it looks more like a function/operation name. I'd name it something like 'recipient', 'remote', etc.

bind10_test.py

  • StartAllProcessesBob? should better be renamed. same for the comment of the class. Also, 'process' is not really the correct term because ccsession is not a process (although it's not specific to this patch).
  • In StartAllProcessesBob?, we probably don't need all of the stop_xxx (e.g., stop_ccsession())
  • check_started_xxx's contain a lot of duplicates. I would combine them into, e.g, check_started_common, where msgq/cfgmgr/ccsession/stats/cmdctl/(and perhaps more) are checked, and call it from check_started_xxx's (or from each specific test). We might also pass true/false to the common checker so that it can be used for check_preconditions() and check_started_none().
  • test_start_none(): s/Created Bob/Create? BoB/? Same for other similar comments (I noticed this is actually not for this patch).
            # Created Bob and ensure initialization correct
    

changelog

I would regard this change as 'func', although I don't oppose if you
prefer calling it a bug.

comment:6 Changed 9 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:7 in reply to: ↑ 5 ; follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

While looking at the code, however, some fundamental architecture
[ ... ]
These are probably not an issue specific to this patch but rather a
larger architectural one for the BIND10 process model, though. To
make it really production ready, I guess we need to carefully design
state management of the subprocesses so that the questions like the
above ones are clarified.

I agree. I think boss is one of the parts we want to have a closer look and refactor it. We need the dependencies there, and more general system of starting things.

I want to have them in case someone starts using stop_ something that is not used now and forgets to update the tests. This way, all possible ones are overriden.

I would regard this change as 'func', although I don't oppose if you
prefer calling it a bug.

OK, func is probably OK as well.

I should have addressed everything mentioned in the comments.

Thanks

comment:8 in reply to: ↑ 7 Changed 9 years ago by jinmei

Replying to vorner:

I want to have them in case someone starts using stop_ something that is not used now and forgets to update the tests. This way, all possible ones are overriden.

OK.

I should have addressed everything mentioned in the comments.

  • As for config_handler() refactoring, the idea looks good, but it's not readable to me. It took some time for me to understand the calls to start_stop() is not an indentation error. It's also not easy to understand the entire logic of the main function. I'd at least like to see the main part in a unified place:
            if self.verbose:
                sys.stdout.write("[bind10] Handling new configuration: " +
                    str(new_config) + "\n")
            # no intermediate large def here.
            start_stop('resolver', self.started_resolver_family, resolver_on,
                resolver_off)
            # or here
            start_stop('auth', self.started_auth_family, auth_on, auth_off)
    
            answer = isc.config.ccsession.create_answer(0)
            return answer
            # TODO
    
    BTW, I just noticed the "TODO" comment doesn't make sense (understanding it's a leftover problem before this patch). If there's a specific TODO item, we should describe it; if it was now garbage, we should remove it.
  • As for s/second/sendto/ in stop_process(), there was a clear typo. I've fixed it. You didn't say anything about the naming, so I was not sure if you disagreed with me or overlooked that part of comment, but in any case this is a minor point and if that's intentional I'm okay with keeping the name.
  • As for renaming StartAllProcessesBob?, I realized the comment was unclear, sorry. I mainly intended to point out that it's now not specific to "start", so it would better be something like "StartStopCheckBob?"
  • In the following, s/should running/should be running/? Also, not all of them are "processes".
            Check that the right sets of processes are started. The ones that
            should running are specified by the core, auth and resolver parameters
            (they are groups of processes, eg. auth means b10-auth, -xfrout, -xfrin
            and -zonemgr).
    

comment:9 Changed 9 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:10 follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to jinmei

I reordered and commented the function, hopefully it's more readable now.

About the rest of the points, I probably understood wrongly or overlooked them. It should be better now.

Thank you

comment:11 in reply to: ↑ 10 Changed 9 years ago by jinmei

Replying to vorner:

I reordered and commented the function, hopefully it's more readable now.

About the rest of the points, I probably understood wrongly or overlooked them. It should be better now.

It now looks okay. Feel free to merge.

comment:12 Changed 9 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:13 Changed 9 years ago by vorner

  • Resolution set to fixed
  • Status changed from reviewing to closed

Hello

Thanks, merged, closing.

Note: See TracTickets for help on using tickets.