Opened 10 years ago

Closed 8 years ago

Last modified 8 years ago

#213 closed enhancement (complete)

Change hard-coded process startups to configuration-driven

Reported by: shane Owned by: vorner
Priority: medium Milestone: Sprint-20111122
Component: ~Boss of BIND (obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: Core Feature Depending on Ticket:
Estimated Difficulty: 9 Add Hours to Ticket: 79.43
Total Hours: Internal?: no

Description

The boss process starts everything in a single function with hard-coded values. This needs to be driven by configuration, like everything else in the system.

Exceptions are the msgq and the cfgmgr.

Subtickets

Change History (55)

comment:1 Changed 9 years ago by stephen

  • Milestone feature backlog item deleted

Milestone feature backlog item deleted

comment:2 Changed 9 years ago by shane

  • Defect Severity set to N/A
  • Internal? unset
  • Milestone set to Year 3 Task Backlog
  • Sub-Project set to Core

comment:3 Changed 9 years ago by shane

Some tickets depend on this:

#791
#792
#793
#794
#795

comment:4 Changed 9 years ago by shane

Ticket #123 provides insights into how this should operate.

comment:5 Changed 9 years ago by shane

Ticket #230 also provides a suggestion.

comment:6 Changed 8 years ago by stephen

  • Milestone changed from Year 3 Task Backlog to Sprint-20110927

comment:7 Changed 8 years ago by jelte

One big open question is whether 'module-dependencies' (if any) should be coded into the boss configuration checker, or come directly from module specifications.

comment:8 Changed 8 years ago by vorner

  • Owner set to vorner
  • Status changed from new to accepted

comment:9 Changed 8 years ago by jelte

  • Milestone changed from Sprint-20110927 to Sprint-20111011

comment:10 Changed 8 years ago by jelte

  • Milestone changed from Sprint-20111011 to Sprint-20111025

comment:11 follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to UnAssigned
  • Status changed from accepted to reviewing

Hello

The branch took longer and grew bigger than I expected. I hope the commit messages and a little description helps, though.

So, it currently looks like this:

  • There's a configurator object, that keeps track of configuration changes. It creates components and let them start and stop accordingly. This is in new isc.bind10.component module.
  • The components (living in the same place) are either common ones, starting as any other process, or they can be subclassed to support different types of startup.
  • The components are of three kinds, differing in the way they solve their failure. Dispensable one always tries to restart. Needed shuts down the boss if it fails to start (or crashes less than 10s after being started), but restarts if it happens later on. The core ones just terminate the system every time.
  • The old way of starting and stopping was mostly removed, only the functions that did the actual starting and stopping were left and are used from within the components. The boss just creates the configurator and passes the configuration there.

However, the branch in current state has few drawbacks:

  • I had a hard time editing boss. I didn't distract myself with proper committing at the time, so I did some history rewriting afterwards to clean it up and make it slightly sane. So some commit dates might look strange. It also might be possible I put some of the diffs into wrong commits by accident, but I think it should not be much.
  • The support of brittle mode is gone. It can be simulated by setting all components to core, but it might be less comfortable. The tests for it are just disabled. We should decide if we want to reimplement the brittle mode (if then, this ticket, or different one?), or remove it (the command line option, docs and tests).
  • The support for delayed restarts is not present currently. I want to get the review sooner than later, but I don't think it is usable without this support, as when a dispensable component fails to start, it ends up in an infinite loop.
  • Some parts are less tested than it would deserve. They are mostly stuff when external processes need to be started, so we might want some kind of system-level tests for these.
  • There are missing parts, like:
    • Some more error handling and validation of the config would be nice.
    • Support to specify command line parameters in the configuration. The configuration already allows that, but it doesn't get through the component all the way to the thing that starts the command. This should be small, though.
    • We probably want to get rid of command line arguments of the started processes and give them their config by the usual ways. For one, we can't change them at runtime, for another, we might want to get rid of some special components for auth and resolver.
    • If we're shutting down, we might want to gather exceptions and keep shutting down, then pass all the exceptions at once back, or something. This way, if one process fails to stop, all the other's aren't stopped in the clean way, but killed.
    • Stopping of socket creator causes an exception often.
    • The ccsession is starting log message is wrong, as ccsession is not a process.
    • We might want to have a way to kill a component if it doesn't want comply, when the user removes it from the configuration.
    • We might want to support waiting for start/stop of a component, we don't do that now at all.
    • The named_set bindctl handling is buggy. I found only one way to add a component to the configuration, and it is providing the JSON of the whole component in config set. Removing a default component works only if the configuration is already modified.
    • There's no user-level documentation, but there definitely needs to be.
    • We'd need a way for the name of the process and the name on the message buss to correspond, this way the configuration needs to provide both names, which is unnecessarily verbose.
    Is it OK if I create tickets for these?

So, my question is, is the branch bearable for review, or should I think of splitting it somehow (the problem is I don't see an easy way to do it, except for splitting it into the new components.py and the changes in boss, which is easy to do while reviewing). And, if the branch is reviewabla, what should we do with it? I think we should solve some of the regressions caused by the new implementation before giving it to users, maybe even before giving it to developers (namely the restarts, brittle mode and documentation). If so, we might want to do the tickets soon enough, branch them from here, but merge master to it from time to time (once a week?) to minimise the diversion.

Thank you

comment:12 Changed 8 years ago by jelte

  • Estimated Difficulty changed from 0.0 to 9

comment:13 Changed 8 years ago by jinmei

  • Owner changed from UnAssigned to jinmei

From a quick look it seems quite big. I'll try to look into some more
details to figure out if this is the possible minimum chunk anyway.

comment:14 follow-up: Changed 8 years ago by jinmei

I'll first provide comments on the component module (and its tests).

(In order to not disappoint you due to the amount of comment:-) In
general the design looks sensible to me. But since this is a newly
introduced concept, I encountered many points to discuss.

Regressions

  • This branch seems to break system tests (maybe related to the next one)
  • When I stopped the system via bindctl 'Boss shutdown', the bind10 process raised an exception:
    Traceback (most recent call last):
      File "bind10", line 952, in <module>
        main()
      File "bind10", line 947, in main
        boss_of_bind.shutdown()
      File "bind10", line 646, in shutdown
        component.pid())
      File "/Users/jinmei/src/isc/git/bind10-213/src/lib/python/isc/bind10/component.py", line 232, in pid
        return self.__creator.pid()
    AttributeError: 'NoneType' object has no attribute 'pid'
    
    and b10-msgq process remained alive.

Design/General? issues

  • Do we allow multiple instances (processes) of the same component? Like multiple auth processes for multiple cores? If we do, can we handle that scenario in this framework?
  • Don't we need an interval (and/or allow having an interval) before restarting a component? (and should the interval also be configured?). This is probably a TODO thing, which is okay for now, but just checking.
  • I wouldn't consider Auth/Resolver/CmdCtl? "needed" components. For example, if the system is intended to be DHCP only, we don't need either auth or resolver.
  • I'd avoid using magic numbers/values directly in the source code as much as possible: e.g. 10 seconds, some well-known command names like 'start', 'stop', so that we can unify the place where they are defined and can be easily changed at once (when necessary) and also so that the intent of the magic values is clearer.
  • I'd keep this module independent from the knowledge of which component is special for the boss, and let it focus on the generic framework. In that sense, it's awkward to see classes like SockCreator?, Msgq, etc and the mapping dictionary of 'special' be defined in this module. It seems to me to be more reasonable if they are defined in the boss module (or even independently from the module implementation itself, e.g. in a separate configuration file or something) and passed to this module as kind of generic parameters (see also the comment about 'special' in Configurator below).

Component class

  • An object of this class is a sort of finite state machine, but since its transition is represented via changes to multiple mutable attributes (running (which can be True or False), dead) it's difficult to follow at least to me (I needed to draw a state transition diagram to understand the code and be sure about its soundness). I'd suggest making this point more explicit by, e.g., introducing a single attribute representing the current state. (see bullets below for more specifics)
  • Whether or not adopting the idea of explicit state transition, a diagram in the documentation would still help for others to read and understand the code. It would look something like:
                             start_intl fail
                               OR failed()
            start()          /---------->Failed----*----->Dead
      Init----------->Running<----+----/(call
                      (call            failed_intl)
    		 start_intl)
                        ^  |           *: if kind == core
                 start()|  |stop()        or kind == needed and
                        |  v              it fails too soon
                      Stopped          +: otherwise
                   (call stop_intl)
    
  • What if stop_internal() raises an exception?
  • The notion of "dead" is not documented.
  • failed() doesn't check if it's called with dead being True while start() does this check. While it may not be a bug because if dead is true running must be False, in order to understand that you need to follow how and where these values can be changed. If we had a single attribute of e.g., state, it would be immediately clear that the check of "state != RUNNING" covers "state == DEAD". (You may or may not agree with this idea itself, but at least it was the case for me that it took time to understand this implementation was okay due to the existence of multiple attributes).
  • It seems to make sense to allow derived class to override some of the methods. But I suspect the requirement to the overridden methods should be more clarified in terms of which part (attribute) of the base class they can modify/ignore/refer to, or even which method can be overridden:
    • Can running() be overridden? If it can, and if the overridden version ignores running, behaviors that depend on the result of running() may not meet the assumption of the implementation (and may cause disruption).
    • if a derived class overrides start_internal() and in its version it doesn't set _procinfo, then pid() won't work unless pid() is also overridden.
    • there may be some more similar cases that I'm missing
  • A related point to the previous general issue, but not specific to that: strange thing will happen if pid() is called before _procinfo is set (e.g., at the "Init" state). Also, what should happen if this method is called without the underlying process? Further, this is actually not specific to PID, but also generally applicable to _procinfo.
  • On the specific point of the relationship between start_internal() and _procinfo, I guess it's better to request derived method not modify base attributes and instead return an ProcessInfo? object as a contract. The start() method will then set _procinfo to the returned value, and we can make sure _procinfo is always valid as long as start() succeeds.
  • Shouldn't xxx_internal better be named like "protected" methods, e.g. _start_internal()?

SockCreator? class

  • stop_internal: as long as the guideline is held creator should never be None when this method is called (in other words should this ever happen it means a bug). So I guess raising an exception is probably better.

CfgMgr? class, etc

  • isn't it better (at least in terms of conciseness) to pass 'address' to the base class constructor rather than setting it by themselves? that is, instead of
            self._address = 'ConfigManager'
    
    do this:
            Component.__init__(self, process, boss, kind, 'ConfigManager')
    

Configurator class

  • The syntax and semantics of the 'configuration' (argument to startup() and reconfigure()) are not clear enough in the documentation. I needed to look into the implementation and test cases to understand how it should be and how it works. This should be documented somewhere (probably part of the class doc string), more explicitly. It would cover the following points:
    • 'configuration' is a dictionary, whose keys are the identifiers of components (often derived from the program invoked for the component)
    • each value of a dictionary entry is another dictionary, which gives the configuration for the component. The (currently available) keys are: 'special', 'process', 'kind', 'address', and 'param'. The expected type of each key value is explained with other restrictions (e.g., 'kind' can only be one of the well-known kinds). For those configuration parameters that can be omitted, the default values are shown. And explain how these values are used.
  • (A related point) Isn't it okay to not perform syntax check against the given parameters? What if an unknown parameter is given? What if a parameter value is of an unexpected type? Or is this assumed to be checked at the caller side? (if so, it should be documented so).
  • Configurator.running doesn't have doc string.
  • The notion of 'special' is confusing, at least to me. This essentially specifies a specific derived Component class (if explicitly specified). So I'd rather simply call it something like 'component_class', specify the actual class Component class name (e.g. 'SockCreator?'), and use some reflection technique to instantiate an object of the specified class.
  • _run_plan(): Like the configuration, I'd like to see doc string that explains more details of 'plan'.
  • _run_plan(): s/beforehead/beforehand/?
            Run a plan, created beforehead by _build_plan.
    
  • _run_plan(): this doesn't seem to provide the strong exception

guarantee, i.e. self._components can have an incomplete update when
an exception is raised. Is that okay? (for example, _old_config
and _components will be inconsistent once this happens).

  • A related comment: since _old_config and _components are inter related (and have one-to-one relationships if I understand it correctly), I'd rather manage them in a combined way, e.g.
      _components = { 'msgq': (comonent_obj, componet_config), ... }
    
    That way we can be more sure, when reading the code, about the consistency between them, and it will be easier to ensure the consistency at least per component basis.
  • The exception message seems to be too broad:
                            raise NotImplementedError('Changing configuration of' +
                                                      ' a running component is ' +
                                                      'not yet supported. Remove' +
                                                      ' and re-add ' + cname +
                                                      'to get the same effect')
    
    because it only checks a subset of known configuration parameters (e.g. even if 'address' is modified it doesn't complain)
  • reconfigure() docstring: I'd like to see more desc, like the fact that currently running components that are not specified in 'configuration' will be stopped.
  • build_plan: using "params" as a local variable while also having "'params'" in the configuration is a bit confusing. I'd rename the former to something else.
  • is it okay to allow Configurator.reconfigure() to even stop core components? It seems to contradict the idea that the system cannot even start if such components fail to start.
  • what should happen if we do Configurator.shutdown() and try to start it again? Is it expected to be an okay behavior? Unexpected but not prohibited? Or should be prohibited?

component_test.py

ComponentTests?

  • check_dead: what about calling failed()? (also see above for the comments about the class)
  • check_dead: 'failed' instead of 'stopped'? ('dead' is a result of 'failed' not 'stop')
            # Surely it can't be stopped again
    
  • check_dead: should also check it's not None?
            self.assertNotEqual(0, self._exitcode)
    

ConfiguratorTest?

  • Is it intentional to test these two? (same for the shutdown case. also note it's inconsistent with the other two cases)
            self.assertTrue(configurator._running)
            self.assertTrue(configurator.running())
    

main

  • I don't know about this, but referring to other similar cases it doesn't seem to be necessary:
        isc.log.init("bind10") # FIXME Should this be needed?
    

comment:15 Changed 8 years ago by jinmei

(forgot to mention this) I made some minor editorial changes and pushed them to the branch.

comment:16 follow-ups: Changed 8 years ago by jinmei

General

There were too many changes and too many removed tests, so it was
difficult to be confident about compatibility, stability, regression
freeness, etc...(some were already noted in the tickets, and I'm okay
with deferring compatibility support for some of the existing ones,
but I'm still now sure if the rest of the changes are okay in terms of
this sense).

Maybe it was deemed to be too difficult or simply impossible, but due
to the above concerns I'd still like to try step-by-step transition.
A possibility would be:

  1. first merge the component module (this should be safe,
  2. choose a small part of the existing functionality, and replace only that part with the new module. do it carefully by preserving the semantics of the original test as much as possible. Maybe we should initially simply add code to call the corresponding component module but do nothing, then confirm the behavior with tests, and then remove the old code while letting the component module do the real job.
  3. repeat the 2nd process until we remove all old code.

I'll provide feedback on some details of the rest of diff anyway.

And one more thing: Do we need a changelog entry for this task? Or is
the plan to provide it when we complete the entire migration?

bind10_config.py.in

I don't understand the trick here:

    LIBEXECDIR = ("@libexecdir@/@PACKAGE@"). \
        replace("${exec_prefix}", "@exec_prefix@"). \
        replace("${prefix}", "@prefix@")

The relationship between @libexecdir@, ${exec_prefix}, @exec_prefix@,
${prefix}, and @prefix@ is not clear. Please add comments about the
intent.

bind10_src.py.in

  • Some of the removed log IDs seem to still remain in the message file. e.g.
    -        logger.info(BIND10_CONFIGURATION_START_AUTH, self.cfg_start_auth)
    -        logger.info(BIND10_CONFIGURATION_START_RESOLVER, self.cfg_start_resolver)
    
    (There seem to be more) These orphan messages should be identified comprehensively and cleaned up.
  • BoB.__init__(): ideally I'd like to move the setting of __core_components outside of the bind10 module (and even more ideally make it "configurable" by a command line option, etc).
  • Also about __core_components: the intent of the priorities should be documented (commented).
  • config_handler(): I'm often confused about this point, but doesn't this callback only get the difference from the current configuration? If that's the case, I suspect it doesn't meet the assumption of the component module, that is, it could accidentally kill a running component.
  • config_handler(): shouldn't we catch any exception while handling the new config and make sure return some "answer"?
  • config_handler(): while this point may become moot due to the previous point (and it's not really about this branch), but I don't see the need for having a temporary variable 'answer' here:
            answer = isc.config.ccsession.create_answer(0)
            return answer
    


  • kill_started_processes(): from a quick look at the library reference, os.kill(signal.SIGTERM) is less portable than Poepen.kill(), which was used for the original code (especially if we consider Windows support), and in that sense this is a regression. The same comment applies to other cases where os.kill() is used.
  • read_bind10_config(): the method's doc string may need to be updated, depending on whether this work also means we are going to decide how/where a list of running components should be defined.
  • start_msgq(): It's not about this branch, but variable 'c_channel' is confusing to me because it doesn't sound like a process. I'd rather say 'msgq', 'msgq_proc', etc.
  • start_simple(): this part of doc is not valid any more:
                The port and address arguments are for log messages only.
    
  • start_auth(), etc: While the comment above start_auth() seems to suggest the use of them in tests, they now are not used in tests (overkilling?). Also, as already stated, ideally I'd move these outside the bind10 module.
  • start_all_processes(): I'd add comment that other processes will be started in read_bind10_config():
            # Extract the parameters associated with Bob.  This can only be
            # done after the CC Session is started.
    
  • start_all_processes(): is it okay to remove this?
    -        # Everything after the main components can run as non-root.
    -        # TODO: this is only temporary - once the privileged socket creator is
    -        # fully working, nothing else will run as root.
    -        if self.uid is not None:
    -            posix.setuid(self.uid)
    
  • stop_all_processes(): this is now one-line method so may be removed (the caller could directly do what this method does)
  • stop_process(): I was not sure how we could remove this comment due to this branch:
    -        # TODO: Some timeout to solve processes that don't want to die would
    -        # help. We can even store it in the dict, it is used only as a set
    
    and this line (or the concept of "expected_shutdowns"):
    -        self.expected_shutdowns[process] = 1
    
    (I'm not necessarily saying we can't. I simply didn't understand).
  • component_shutdown(): shouldn't this check be at the beginning of the method? At least before setting exitcode?
            if not self.__started:
                raise Exception("Component failed during startup");
    
  • component_shutdown(): if I read the code correctly, __stopping can never be True in the context where this method can be called:
            if self.__stopping:
                return
    
    If the intent is to prevent a duplicate call, we'll probably need a dedicated attribute like "shutting_down", and set it to True in this own method.
  • reap_children(): I was not sure if it's safe to remove this:
    -                self.dead_processes[proc_info.pid] = proc_info
    
  • reap_children(): maybe we should pass the exit_status to component.failed() so that we can log the event with that information? (btw, otherwise we won't need a specific variable).
  • restart_processes(): I was not sure if we can safely remove these:
    -            if proc_info.name in self.expected_shutdowns:
    -                # We don't restart, we wanted it to die
    -                del self.expected_shutdowns[proc_info.name]
    -                continue
    
    see also comments about stop_process().
  • register_process() doesn't have doc string.

bob.spec

  • not actually about the spec itself, but when we specify the default for some configuration parameters, the application that uses them should get the default from the spec, rather than hardcoding them in the source code. One example:
                    component = creator(params.get('process', cname), self.__boss,
                                        params.get('kind', 'dispensable'),
                                        params.get('address'),
                                        params.get('params'))
    
  • I'm not sure "b10-xxx" is the best choice for the component name. It seems more like a kind of component, rather than the program (file) name, so things like auth (or Auth) sound a bit better to me. But this is probably a minor preference matter.
  • likewise, while "params" are defined as a list of strings, component_test uses integers:
            plan = configurator._build_plan({}, {
                'component': {
                    'kind': 'needed',
                    'params': [1, 2],
                    'address': 'address'
                }
            })
    
  • I can't find "priority" in the spec.
  • dhcp related config is missing, too.

bind10_test.py.in

General comment: TestStartStopProcessesBob? was completely removed and
TestBossComponents? was completely newly introduced. Both are pretty
big, and the changes are not trivial, so I was not sure such changes
in tests ensure safe refactoring. (For example, the original tests
check whether auth/resolver are started based on some
run-time/config-time conditions. Such kind of tests seem to be
missing). Overall, I've not seen anything obviously wrong in the
added tests, but I couldn't be confident that these tests are
sufficient, and I was not really comfortable about that.

  • TestBossComponents?.test_correct_run: about this point, my understanding is that we are expected to return an error answer in such cases:
            # Is it OK to raise, or should it catch also and convert to error
            # answer?
            self.assertRaises(Exception, bob.config_handler,
                              {'components': compconf})
    
    See also the comment about config_handler() above.

comment:17 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:18 in reply to: ↑ 11 Changed 8 years ago by jinmei

Replying to vorner:

Hello

  • There are missing parts, like:

[...]

Is it OK if I create tickets for these?

I've not examined each of these closely, but, in general, I'd rather
positively agree with deferring small TODOs to separate tickets as
this one is already too big. We need to ensure some level of
operational compatibility, and opinions on which level of
compatibility is necessary would probably vary. So we should probably
discuss that point in a wider place.

comment:19 in reply to: ↑ 14 ; follow-ups: Changed 8 years ago by vorner

Hello

Replying to jinmei:

(In order to not disappoint you due to the amount of comment:-) In
general the design looks sensible to me. But since this is a newly
introduced concept, I encountered many points to discuss.

I kind of expected a lot of comments, as the branch is quite large.

  • This branch seems to break system tests (maybe related to the next one)

Hmm, maybe I need to look at the configurations there. I'll handle it sometime soon.

  • When I stopped the system via bindctl 'Boss shutdown', the bind10 process raised an exception:
    Traceback (most recent call last):
      File "bind10", line 952, in <module>
        main()
      File "bind10", line 947, in main
        boss_of_bind.shutdown()
      File "bind10", line 646, in shutdown
        component.pid())
      File "/Users/jinmei/src/isc/git/bind10-213/src/lib/python/isc/bind10/component.py", line 232, in pid
        return self.__creator.pid()
    AttributeError: 'NoneType' object has no attribute 'pid'
    
    and b10-msgq process remained alive.

I couldn't reproduce this, it worked for me. But I added some handling of this, so it shouldn't cause an exception any more (I expect this happens in some time when the component is already stopped, but it didn't yet get the SIGCHLD or something).

Anyway, I find it strange it did bring the system down without killing msgq. Or maybe not, considering where the exception happened.

  • Do we allow multiple instances (processes) of the same component? Like multiple auth processes for multiple cores? If we do, can we handle that scenario in this framework?

This framework should be able to handle that, provided their names are different. I actually expected it to be used that way (or maybe having a 'count': 64 option for a component sometime in future, if copy-pasting bunch of components is deemed uncomfortable).

However, the rest of the system can't handle it yet (like we need them to have different addresses on the message bus or something). Maybe we should warn the user about it in config, that starting two auths won't do what he wants.

  • Don't we need an interval (and/or allow having an interval) before restarting a component? (and should the interval also be configured?). This is probably a TODO thing, which is okay for now, but just checking.

We do, it's in the TODO somewhere.

  • I wouldn't consider Auth/Resolver/CmdCtl? "needed" components. For example, if the system is intended to be DHCP only, we don't need either auth or resolver.

I'm not sure, maybe the "needed" name is a bit misleading. It says that it should not start if these can't be started, but not bring the system down if they crash later on.

If someone uses boss to start dhcp, he would just remove auth and resolver from configuration and have the dhcp part as needed.

  • I'd keep this module independent from the knowledge of which component is special for the boss, and let it focus on the generic framework. In that sense, it's awkward to see classes like SockCreator?, Msgq, etc and the mapping dictionary of 'special' be defined in this module. It seems to me to be more reasonable if they are defined in the boss module (or even independently from the module implementation itself, e.g. in a separate configuration file or something) and passed to this module as kind of generic parameters (see also the comment about 'special' in Configurator below).

I put it to a different module.

  • An object of this class is a sort of finite state machine, but since its transition is represented via changes to multiple mutable attributes (running (which can be True or False), dead) it's difficult to follow at least to me (I needed to draw a state transition diagram to understand the code and be sure about its soundness). I'd suggest making this point more explicit by, e.g., introducing a single attribute representing the current state. (see bullets below for more specifics)

Yes, you're right. It happened in kind of evolutionary way, the running one was there first, then the dead appeared later on and I didn't think about it. This way it looks simpler. I also added the diagram.

  • What if stop_internal() raises an exception?

Then we have a problem.

Actually, the component is considered shut down at the time and the exception is propagated. The idea behind this is, we can't really consider it running, because it might be already stopped and if there's problem stopping, if we try again (during system shutdown or sometime else), it would fail again. This way, if it happens during real shutdown and the process is still running, it will be at last killed. If it happens during reconfiguration, I don't know. Any ideas what to do then?

  • It seems to make sense to allow derived class to override some of the methods. But I suspect the requirement to the overridden methods should be more clarified in terms of which part (attribute) of the base class they can modify/ignore/refer to, or even which method can be overridden:

I added some documentation.

  • A related point to the previous general issue, but not specific to that: strange thing will happen if pid() is called before _procinfo is set (e.g., at the "Init" state). Also, what should happen if this method is called without the underlying process? Further, this is actually not specific to PID, but also generally applicable to _procinfo.

The _procinfo should not be accessed if it's not running and it should in fact be more private than protected. I'm not sure if I access it from some kind of tests, or if it's leftover from some older version. Anyway, pid() now checks and returns None if it's not running (or derived component is bad).

  • On the specific point of the relationship between start_internal() and _procinfo, I guess it's better to request derived method not modify base attributes and instead return an ProcessInfo? object as a contract. The start() method will then set _procinfo to the returned value, and we can make sure _procinfo is always valid as long as start() succeeds.

I think there are none that modify it. This is actually what _start_func hook is for, it returns the procinfo and the rest is handled the usual way.

Overriding start_internal is for bigger differences, where there might be no procinfo involved or no process (therefore PID) at all.

I added some documentation there.

SockCreator? class

  • stop_internal: as long as the guideline is held creator should never be None when this method is called (in other words should this ever happen it means a bug). So I guess raising an exception is probably better.

Yes, you're right. It's leftover from the copy-paste from boss.

  • isn't it better (at least in terms of conciseness) to pass 'address' to the base class constructor rather than setting it by themselves? that is, instead of

Yes, evolution O:-). Modified.

  • (A related point) Isn't it okay to not perform syntax check against the given parameters? What if an unknown parameter is given? What if a parameter value is of an unexpected type? Or is this assumed to be checked at the caller side? (if so, it should be documented so).

It's not completely OK, but it's one of the TODO items (better error handling and checks). It currently implicitly relies on bindctl not allowing anything that's not allowed by spec file, but the checks should be definitelly added.

essentially specifies a specific derived Component class (if
explicitly specified). So I'd rather simply call it something like
'component_class', specify the actual class Component class name
(e.g. 'SockCreator?'), and use some reflection technique to
instantiate an object of the specified class.

I'm not against some renaming (but the component_class is rather long and expects that someone has an idea what a class is). But I'd be against the reflection techniques, for one, they complicate tests, for another, I don't want the user to be able to type any class name there and expect things to work.

  • _run_plan(): Like the configuration, I'd like to see doc string that explains more details of 'plan'.

I added a little bit, but that thing is mostly private to the configurator and tests.

  • A related comment: since _old_config and _components are inter related (and have one-to-one relationships if I understand it correctly), I'd rather manage them in a combined way, e.g.
      _components = { 'msgq': (comonent_obj, componet_config), ... }
    
    That way we can be more sure, when reading the code, about the consistency between them, and it will be easier to ensure the consistency at least per component basis.

I implemented this. But should we try some kind of rollback in case of failure-to-start or something?

  • The exception message seems to be too broad:
                            raise NotImplementedError('Changing configuration of' +
                                                      ' a running component is ' +
                                                      'not yet supported. Remove' +
                                                      ' and re-add ' + cname +
                                                      'to get the same effect')
    
    because it only checks a subset of known configuration parameters (e.g. even if 'address' is modified it doesn't complain)

Not checking address was actually a bug (as the check was there before the address was added). Anyway, I'd like this one to go sometime soon and be able to handle the change if at all possible without bothering the user.

  • is it okay to allow Configurator.reconfigure() to even stop core components? It seems to contradict the idea that the system cannot even start if such components fail to start.

Yes, as described at the class's docstring, it's up to the caller to ensure this doesn't happen (and boss does that by making sure the three core components are always there).

I actually want to allow user to provide more core components (eg. say that auth is core) and simulate kind of semi-brittle mode with it. But then, the user should be able to remove the auth and shut it down. And the configurator is not the place which should know which are system-core components and which are core because user wants them so.

  • what should happen if we do Configurator.shutdown() and try to start it again? Is it expected to be an okay behavior? Unexpected but not prohibited? Or should be prohibited?

I didn't think about that one. Boss doesn't use it so and I don't think there ever will be another user than a boss.

But I guess it would work, so I wouldn't bother forbidding it explicitly or making sure it does unless it's really needed.

    isc.log.init("bind10") # FIXME Should this be needed?

Hmm, it seems to fail without it. It's just a copied fixme from some other test, I'd like the tests to be able to start by themself without the init in tests (as there's a test-specific call just below it).

More answers and code will follow, I just need a flush O:-).

comment:20 in reply to: ↑ 16 ; follow-up: Changed 8 years ago by vorner

Hello

Replying to jinmei:

There were too many changes and too many removed tests, so it was
difficult to be confident about compatibility, stability, regression
freeness, etc...(some were already noted in the tickets, and I'm okay
with deferring compatibility support for some of the existing ones,
but I'm still now sure if the rest of the changes are okay in terms of
this sense).

Maybe it was deemed to be too difficult or simply impossible, but due
to the above concerns I'd still like to try step-by-step transition.
A possibility would be:

  1. first merge the component module (this should be safe,
  2. choose a small part of the existing functionality, and replace only that part with the new module. do it carefully by preserving the semantics of the original test as much as possible. Maybe we should initially simply add code to call the corresponding component module but do nothing, then confirm the behavior with tests, and then remove the old code while letting the component module do the real job.
  3. repeat the 2nd process until we remove all old code.

I don't know how to do it well. I did write the module first (and I think it could be merged first, if it makes any difference), but I had enough hard time getting through the boss code to start by plugging it in by parts, finding out that it breaks it completely, because then the two ways mostly collided (like the way processes are stopped differed, the way terminated processes were handled) and it turned out that mostly two copies of the code (the old and the new one) would need to coexist.

My original history contained some of these steps and situations where they coexisted a little bit (with large parts being broken at the time), but I removed these temporary parts of code so the diff is more readable and reviewable. If you'd like, I can try to look into the git internals if it's still there.

The removed tests were removed simply because they were testing something that no longer makes any sense and does not exist. There's no test to see if resolver (for example) is started, because the boss now has (almost) no knowledge about existence of resolver. It tests that a component configuration is passed to the configurator (which might be a resolver or whatever else, the boss doesn't care) and the configurator should be tested quite a lot.

It is true the boss is probably tested less than it should.

And one more thing: Do we need a changelog entry for this task? Or is
the plan to provide it when we complete the entire migration?

Possibly. There should be a changelog in the end, because it's a really well visible change, but I'm not sure when. If we decide to go with the big branch way, maybe when merging back to master. If we choose to merge now, maybe I'd add a changelog then. But I'd rather like to add it when there's a little of documentation for a user somewhere to point to it.

I don't understand the trick here:

    LIBEXECDIR = ("@libexecdir@/@PACKAGE@"). \
        replace("${exec_prefix}", "@exec_prefix@"). \
        replace("${prefix}", "@prefix@")

The relationship between @libexecdir@, ${exec_prefix}, @exec_prefix@,
${prefix}, and @prefix@ is not clear. Please add comments about the
intent.

This was handled in the boss by a trick from Makefile, which I didn't like. But it turned out the replacement (the things in @@) happened to contain more and more variables and I kind of continued to replace them until it stopped. Do you think it is OK (with a comment) or I should find a better way to do it?

  • Some of the removed log IDs seem to still remain in the message file. e.g.
    -        logger.info(BIND10_CONFIGURATION_START_AUTH, self.cfg_start_auth)
    -        logger.info(BIND10_CONFIGURATION_START_RESOLVER, self.cfg_start_resolver)
    
    (There seem to be more) These orphan messages should be identified comprehensively and cleaned up.

I grepped them out, I left only the BIND10_START_AS_NON_ROOT, as I'd like to reintroduce the message.

  • BoB.__init__(): ideally I'd like to move the setting of __core_components outside of the bind10 module (and even more ideally make it "configurable" by a command line option, etc).

I don't think configuration on command line makes any sense. These are the real core components. The boss won't be able to work (and load the configuration) without them. I don't see a reason why boss shouldn't know the bootstrap process and why a user would want to modify it (or in which way it would make any sense). In fact, I'd like to keep users as far from these as possible. If anybody has a reason to have a different set of these in bootstrap process, they are coding already anyway. I don't expect even the authors of additional bind10 components to need to modify them.

  • config_handler(): I'm often confused about this point, but doesn't this callback only get the difference from the current configuration? If that's the case, I suspect it doesn't meet the assumption of the component module, that is, it could accidentally kill a running component.

It gets all the modified top-level items, but they are complete. So, the components config is there complete or is not there at all.

Thanks (there'll be more coming).

comment:21 in reply to: ↑ 19 ; follow-up: Changed 8 years ago by jinmei

Replying to vorner:

  • When I stopped the system via bindctl 'Boss shutdown', the bind10 process raised an exception:

[...]

I couldn't reproduce this, it worked for me. But I added some handling of this, so it shouldn't cause an exception any more (I expect this happens in some time when the component is already stopped, but it didn't yet get the SIGCHLD or something).

Anyway, I find it strange it did bring the system down without killing msgq. Or maybe not, considering where the exception happened.

It (after pulling the latest code) still fails, but now in a different way. Maybe it's MacOS specific?

Traceback (most recent call last):
  File "/Users/jinmei/opt/sbin/bind10", line 967, in <module>
    main()
  File "/Users/jinmei/opt/sbin/bind10", line 962, in main
    boss_of_bind.shutdown()
  File "/Users/jinmei/opt/sbin/bind10", line 655, in shutdown
    component.kill()
  File "/Users/jinmei/opt/lib/python3.1/site-packages/isc/bind10/component.py", line 286, in kill
    self._procinfo.terminate()
AttributeError: 'ProcessInfo' object has no attribute 'terminate'

and b10-msgq didn't die.

comment:22 in reply to: ↑ 21 Changed 8 years ago by vorner

Hello

Replying to jinmei:

Anyway, I find it strange it did bring the system down without killing msgq. Or maybe not, considering where the exception happened.

It (after pulling the latest code) still fails, but now in a different way. Maybe it's MacOS specific?

Traceback (most recent call last):
  File "/Users/jinmei/opt/sbin/bind10", line 967, in <module>
    main()
  File "/Users/jinmei/opt/sbin/bind10", line 962, in main
    boss_of_bind.shutdown()
  File "/Users/jinmei/opt/sbin/bind10", line 655, in shutdown
    component.kill()
  File "/Users/jinmei/opt/lib/python3.1/site-packages/isc/bind10/component.py", line 286, in kill
    self._procinfo.terminate()
AttributeError: 'ProcessInfo' object has no attribute 'terminate'

Sorry, this is my fault. This is a code from yesterday afternoon, moving away from os.kill, I didn't have time to try it then (and, as usual, this is the part which handles external processes so it doesn't have a unittest :-|). Should be fixed now.

comment:23 in reply to: ↑ 16 ; follow-up: Changed 8 years ago by vorner

Hello

Replying to jinmei:

  • kill_started_processes(): from a quick look at the library reference, os.kill(signal.SIGTERM) is less portable than Poepen.kill(), which was used for the original code (especially if we consider Windows support), and in that sense this is a regression. The same comment applies to other cases where os.kill() is used.

The killing was moved into the components which handle them in concrete way.

  • read_bind10_config(): the method's doc string may need to be updated, depending on whether this work also means we are going to decide how/where a list of running components should be defined.

I believe that this ticket explicitly requires it to be in boss (it doesn't seem doable to search for the correct configuration everywhere). Besides, configuration for a module should be something the module itself needs for correct working. As it is not the job of each component to start itself, it doesn't seem to be the best place to put the configuration into each respective component. But if you think we should talk this over on the dev list, it is possible.

  • start_auth(), etc: While the comment above start_auth() seems to suggest the use of them in tests, they now are not used in tests (overkilling?). Also, as already stated, ideally I'd move these outside the bind10 module.

I'd like that too, in the long term. But as they use quite a lot from boss internals, I think we should defer it.

  • start_all_processes(): is it okay to remove this?
    -        # Everything after the main components can run as non-root.
    -        # TODO: this is only temporary - once the privileged socket creator is
    -        # fully working, nothing else will run as root.
    -        if self.uid is not None:
    -            posix.setuid(self.uid)
    

Not really, I missed that one. But it's unhappy in the current state, it doesn't do what it should, as there's no right place. We should finish the socket creator soon enough.

  • stop_process(): I was not sure how we could remove this comment due to this branch:
    -        # TODO: Some timeout to solve processes that don't want to die would
    -        # help. We can even store it in the dict, it is used only as a set
    

Ah, that one was removed because it now belongs somewhere else. But it got lost in the ether until it got there, so I put it there now.

and this line (or the concept of "expected_shutdowns"):

-        self.expected_shutdowns[process] = 1

(I'm not necessarily saying we can't. I simply didn't understand).

The expected shutdowns are no longer needed. A component is told to shut down, which puts it into a stopped state. A failed is called only on the running components. So the component remembers itself instead of relying on external dictionary.

  • component_shutdown(): shouldn't this check be at the beginning of the method? At least before setting exitcode?
            if not self.__started:
                raise Exception("Component failed during startup");
    

No, it shouldn't, as this is not an exception from this method, but rather two different ways to stop the boss, depending on when it happens. I added a comment explaining what it does and why.

  • component_shutdown(): if I read the code correctly, __stopping can never be True in the context where this method can be called:
            if self.__stopping:
                return
    
    If the intent is to prevent a duplicate call, we'll probably need a dedicated attribute like "shutting_down", and set it to True in this own method.

I don't think it can be called twice currently and it probably wouldn't hurt either. I don't know what I was thinking before, when I added it, so I removed it.

  • reap_children(): I was not sure if it's safe to remove this:
    -                self.dead_processes[proc_info.pid] = proc_info
    

The dead_processes were used for the delayed starts. We need to reintroduce them sometime, but maybe in a completely different way. So that's why they are removed from the code now.

  • restart_processes(): I was not sure if we can safely remove these:
    -            if proc_info.name in self.expected_shutdowns:
    -                # We don't restart, we wanted it to die
    -                del self.expected_shutdowns[proc_info.name]
    -                continue
    
    see also comments about stop_process().

Yes, the same explanation applies here ‒ it is handled differently now.

  • not actually about the spec itself, but when we specify the default for some configuration parameters, the application that uses them should get the default from the spec, rather than hardcoding them in the source code. One example:
                    component = creator(params.get('process', cname), self.__boss,
                                        params.get('kind', 'dispensable'),
                                        params.get('address'),
                                        params.get('params'))
    

The only one that had a default value was the kind, so I set it to required instead of optional.

The rest is optional and has no default. These defaults are either „empty“ (eg. nothing there) in the code or guessed from something that changes for each component (the process to start), which is impossible to state in the spec file.

  • I'm not sure "b10-xxx" is the best choice for the component name. It seems more like a kind of component, rather than the program (file) name, so things like auth (or Auth) sound a bit better to me. But this is probably a minor preference matter.

That is because I use the name as the default value of process to start, to save the user some typing when introducing a component. I'd like to unify it with the address on the message queue somehow as well, so the only thing needed to add a new component would be to type:

config add Boss/components b10-component

And be done.

  • likewise, while "params" are defined as a list of strings,
  • I can't find "priority" in the spec.
  • dhcp related config is missing, too.

What dhcp related config? I think the DHCP wasn't started before (and it didn't really have a start_dhcp option yet). But if a user wants, he can always add the component, right? There's no special need for DHCP. Or did I overlook something?

General comment: TestStartStopProcessesBob? was completely removed and
TestBossComponents? was completely newly introduced. Both are pretty
big, and the changes are not trivial, so I was not sure such changes
in tests ensure safe refactoring. (For example, the original tests
check whether auth/resolver are started based on some
run-time/config-time conditions. Such kind of tests seem to be
missing). Overall, I've not seen anything obviously wrong in the
added tests, but I couldn't be confident that these tests are
sufficient, and I was not really comfortable about that.

Well, it's simply because the old way of configuration no longer exists, so there's no reason (or way) to test it. Instead of testing that if there's a start_resolver, the resolver is started, we test that if there's a component in configuration, that component is started (by passing it to the configurator and by the configurator tests).

  • TestBossComponents?.test_correct_run: about this point, my understanding is that we are expected to return an error answer in such cases:
            # Is it OK to raise, or should it catch also and convert to error
            # answer?
            self.assertRaises(Exception, bob.config_handler,
                              {'components': compconf})
    
    See also the comment about config_handler() above.

I was thinking it should be the job of whatever calls the handler to catch the exceptions and convert them to error messages. But I probably didn't have a reason to expect that to do so.

I know I still haven't looked at the system tests, but I'm assigning the current changes to you for now, as I don't expect to have time for it the next 2 or 3 days :-|.

Thanks

comment:24 Changed 8 years ago by jelte

  • Milestone changed from Sprint-20111025 to Sprint-20111108

comment:25 Changed 8 years ago by jinmei

Some response:

  • Do we allow multiple instances (processes) of the same component? Like multiple auth processes for multiple cores? If we do, can we handle that scenario in this framework?

This framework should be able to handle that, provided their names are different. I actually expected it to be used that way (or maybe having a 'count': 64 option for a component sometime in future, if copy-pasting bunch of components is deemed uncomfortable).

However, the rest of the system can't handle it yet (like we need them to have different addresses on the message bus or something). Maybe we should warn the user about it in config, that starting two auths won't do what he wants.

It would be nice to document it somewhere. In any case actually
realizing it is far beyond the scope of this ticket.

  • I wouldn't consider Auth/Resolver/CmdCtl? "needed" components. For example, if the system is intended to be DHCP only, we don't need either auth or resolver.

I'm not sure, maybe the "needed" name is a bit misleading. It says that it should not start if these can't be started, but not bring the system down if they crash later on.

If someone uses boss to start dhcp, he would just remove auth and resolver from configuration and have the dhcp part as needed.

Perhaps the point to consider is what should be specified as 'needed'
by default in bob.spec. If we see BIND 10 as a generic framework for
various kind of Internet servers (starting with DNS, then DHCP, and
perhaps even HTTP, etc), it would be more reasonable to begin with an
empty list of specific servers. If a user wants to use the framework
for DNS services, auth (and/or resolver) will then be specified as
'needed'. On the other hand, realistically speaking most people will
see BIND 10 as DNS software (at least for initial N years), so it
might be over generalization and just increase the configuration
overhead. Right now I have a strong opinion either way. Maybe one
option is to decide it at ./configure time, and make its default DNS
related servers. But in any case I'm okay with deferring this point.

I don't have a strong opinion about the naming of 'needed', btw.

  • I'd keep this module independent from the knowledge of which component is special for the boss, and let it focus on the generic framework. [...]

I put it to a different module.

Okay.

  • An object of this class is a sort of finite state machine, [...]

Yes, you're right. It happened in kind of evolutionary way, the running one was there first, then the dead appeared later on and I didn't think about it. This way it looks simpler. I also added the diagram.

  • What if stop_internal() raises an exception?

Then we have a problem.

Actually, the component is considered shut down at the time and the exception is propagated. The idea behind this is, we can't really consider it running, because it might be already stopped and if there's problem stopping, if we try again (during system shutdown or sometime else), it would fail again. This way, if it happens during real shutdown and the process is still running, it will be at last killed. If it happens during reconfiguration, I don't know. Any ideas what to do then?

On thinking about it more as being explicitly asked, I think we should
keep truck of the status of spawned processes more precisely. Right
now (both before or after this branch), it seems that we are not very
accurate on this point.

A child process can have the following states:

  • dead (process doesn't exist)
  • alive but not ready to run (in initialization)
  • alive and running
  • alive and shutting down (boss has sent a shutdown command)
  • alive but hang (process exists but cannot do any active work and cannot even receive a shutdown command)

We (at least partly) manage these states via BoB.processes and
BoB._componet_configurator(._components), but the relationship among
these doesn't seem to be well clarified. And, it causes some real bad
things:

  • since we don't explicitly recognize the 'not ready' state, we have a problem like #1271. We can (should) fix individual problems, but I suspect it's a tip of iceberg.
  • as far as I know we don't have any explicit way to detect the "hang" state.
  • we don't explicitly recognize the "shutting down" state, and once the boss sends a shutdown command the boss basically forgets that component (and cannot deal with the situation the process somehow doesn't die)

My original question about stop_internal() is related to the last
point. Based on this observation, for this particular issue I believe
we should keep truck of the transition from "shutting down" to "dead"
more closely. For example, we don't immediately remove the component
on stop() it but maintain it in some "shutting down queue" and watch
the process. If it doesn't die for a certain amount of period the
boss will kill it more forcefully. (It's just an example sketch of
idea, rather than a concrete proposal).

In longer term, I believe we should clarify the above relationship,
then define, implement, and test the behavior based on the
clarification.

I think we should also check what other multi-process systems such as
postfix and xorp handle the issue of managing child processes.

All that said, this will be beyond the scope of this already-fat
task. After all, the pre-213 implementation isn't good in this sense,
so in the sense of porting the current behavior under a new framework
we don't have to solve it now. So, at the moment, I'm okay with just
leaving a comment that e.g. stop_process() is generally expected to be
exception free (for now) and the behavior is undefined if and when
that happens.

comment:26 in reply to: ↑ 19 Changed 8 years ago by jinmei

Replying to (the rest of) vorner:

  • It seems to make sense to allow derived class to override some of the methods. But I suspect the requirement to the overridden methods should be more clarified [...]

I added some documentation.

Ack. One small point in the revised doc:

        You should also register any processes started within boss.

If this is something that all derived class (whether overridden or
not) should meet, maybe we should move it to start().

  • A related point to the previous general issue, but not specific to that: strange thing will happen if pid() is called before _procinfo is set (e.g., at the "Init" state). Also, what should happen if this method is called without the underlying process? Further, this is actually not specific to PID, but also generally applicable to _procinfo.

The _procinfo should not be accessed if it's not running and it should in fact be more private than protected. I'm not sure if I access it from some kind of tests, or if it's leftover from some older version. Anyway, pid() now checks and returns None if it's not running (or derived component is bad).

Maybe the accessibility assumption for _procinfo should be clearly
documented. As for pid(), I guess the (more) proper condition is
if self._procinfo is not None here:

        return self._procinfo.pid if self._procinfo else None

or revise the whole statement to:

        return None if self._procinfo is None else self._procinfo.pid

(same comment applies to kill()).

Also, for pid() to always work correctly, I suspect we should reset
_procinfo to None once the component stops running.

  • On the specific point of the relationship between start_internal() and _procinfo, I guess it's better to request derived method not modify base attributes and instead return an ProcessInfo? object as a contract. The start() method will then set _procinfo to the returned value, and we can make sure _procinfo is always valid as long as start() succeeds.

I think there are none that modify it. This is actually what _start_func hook is for, it returns the procinfo and the rest is handled the usual way.

Overriding start_internal is for bigger differences, where there might be no procinfo involved or no process (therefore PID) at all.

I added some documentation there.

Maybe my intent wasn't clearly delivered...let me rephrase it:
Assume I wanted to completely override _start_internal() for some
reason. Then I should set _procinfo in the override version;
otherwise kill() won't work correctly. And, just like I commented
about "You should also register..." part, if this is something all
overridden the original implementation should do, I'd move it to
start(), and (in this specific case) ask the implementor to return a
ProcessInfo object from the overridden version of _start_internal().
Or, if you don't like that, at least we should document that
_procinfo must be set in the overridden version of _start_internal().

  • (A related point) Isn't it okay to not perform syntax check against the given parameters? [...]

It's not completely OK, but it's one of the TODO items (better error handling and checks). It currently implicitly relies on bindctl not allowing anything that's not allowed by spec file, but the checks should be definitelly added.

Okay.

essentially specifies a specific derived Component class (if
explicitly specified). So I'd rather simply call it something like
'component_class', specify the actual class Component class name
(e.g. 'SockCreator?'), and use some reflection technique to
instantiate an object of the specified class.

I'm not against some renaming (but the component_class is rather long and expects that someone has an idea what a class is). But I'd be against the reflection techniques, for one, they complicate tests, for another, I don't want the user to be able to type any class name there and expect things to work.

It doesn't necessarily have to be reflection-like programming
technique, but my point was that I'd like to separate the knowledge
and process of how specific components should be created/killed from
boss related modules, because it's ultimately something that only the
author of that component can (and should) know. So, for example, in
my ideal world we'd write down how b10-auth should be spawned and
killed somewhere and somehow, maybe as part of b10-config.db and/or in
a form of plugin python module. The boss process passes it to the
component module mostly transparently, and eventually performs the
required spawning steps, again, transparently.

I'd also envision some external developers want to use the BIND 10
framework to control a third party server process and want the boss
process to invoke the third party server without modifying the BIND 10
source code. My overall point is that I'd like this framework to have
this type of flexibility. Whether to use the so-called "reflection"
is implementation details. But in any case, I don't think we need it
today, especially not in this ticket, so I'm not insisting this should
be solved within this task.

As for the name, 'special' sounded too vague to me. But naming is
often a bikeshed matter (while sometimes it's very important), so as
long as it's clearly documented we should probably focus on more
important points.

  • _run_plan(): Like the configuration, I'd like to see doc string that explains more details of 'plan'.

I added a little bit, but that thing is mostly private to the configurator and tests.

I know it's private. My point is that we need some form of
documentation that helps people who try to understand (and perhaps
modify) the implementation. It may not necessarily have to be doc string.

  • A related comment: since _old_config and _components are inter related (and have one-to-one relationships if I understand it correctly), I'd rather manage them in a combined way, e.g.
      _components = { 'msgq': (comonent_obj, componet_config), ... }
    
    That way we can be more sure, when reading the code, about the consistency between them, and it will be easier to ensure the consistency at least per component basis.

I implemented this.

Change (18e970e1) looks okay, but using two different words
'config(uration)' and 'spec' is confusing. Please unify them.

But should we try some kind of rollback in case of failure-to-start or something?

Ideally, but I suspect it's very difficult if not impossible because
spawning/killing a process is not always a reversible operation. If
and when we clarify the relationship between the process status and
component status (discussed in my previous comment message) we may be
able to solve this in a cleaner way (for example, we may be able to
ensure that Configurator.reconfigure() itself doesn't involve
a non-reversible operation). But for now, I suspect it's too
difficult and should be remove from the scope of the ticket.

  • The exception message seems to be too broad:
                            raise NotImplementedError('Changing configuration of' +
                                                      ' a running component is ' +
                                                      'not yet supported. Remove' +
                                                      ' and re-add ' + cname +
                                                      'to get the same effect')
    
    because it only checks a subset of known configuration parameters (e.g. even if 'address' is modified it doesn't complain)

Not checking address was actually a bug (as the check was there before the address was added). Anyway, I'd like this one to go sometime soon and be able to handle the change if at all possible without bothering the user.

Hardcoding the (un)expected items is quite error prone (it easily
causes inconsistency), but considering it a short term workaround, I
wouldn't bother further. Also we'll need a space before 'to' here:

                                                  'to get the same effect')

but, again, if it's short term it doesn't matter much anyway.

  • is it okay to allow Configurator.reconfigure() to even stop core components? It seems to contradict the idea that the system cannot even start if such components fail to start.

Yes, as described at the class's docstring, it's up to the caller to ensure this doesn't happen (and boss does that by making sure the three core components are always there).

Okay, but do you mean this?

    Note that this will allow you to stop (by invoking reconfigure) a core
    component. There should be some kind of layer protecting users from ever
    doing so (users must not stop the config manager, message queue and stuff
    like that or the system won't start again).

I fail to understand that this means that the caller shouldn't stop a
core component. It's probably better to describe that more
explicitly.

  • what should happen if we do Configurator.shutdown() and try to start it again? Is it expected to be an okay behavior? Unexpected but not prohibited? Or should be prohibited?

I didn't think about that one. Boss doesn't use it so and I don't think there ever will be another user than a boss.

But I guess it would work, so I wouldn't bother forbidding it explicitly or making sure it does unless it's really needed.

I have no strong opinion about what it should be, but it's probably
worth noting that it's not expected behavior but not explicitly
prohibited (for now).

comment:27 in reply to: ↑ 23 ; follow-up: Changed 8 years ago by jinmei

Replying to vorner:

Hello

Replying to jinmei:

  • kill_started_processes(): from a quick look at the library reference, os.kill(signal.SIGTERM) is less portable than Poepen.kill(), which was used for the original code (especially if we consider Windows support), and in that sense this is a regression. The same comment applies to other cases where os.kill() is used.

The killing was moved into the components which handle them in concrete way.

As for kill(), I'd use higher level abstraction than SIGKILL instead
of SIGTERM. to match the abstraction level of the python interface.

  • read_bind10_config(): the method's doc string may need to be updated, depending on whether this work also means we are going to decide how/where a list of running components should be defined.

I believe that this ticket explicitly requires it to be in boss (it doesn't seem doable to search for the correct configuration everywhere). Besides, configuration for a module should be something the module itself needs for correct working. As it is not the job of each component to start itself, it doesn't seem to be the best place to put the configuration into each respective component. But if you think we should talk this over on the dev list, it is possible.

I already start forgetting my original intent on this comment:-),
but I guess it's just that the doc string did now not seem to match
what the code does and they should be consistent, rather than what
we actually should do. Your argument rather seems to focus on the latter.
As for what we should do, I personally believe what a module should start
should be determined and defined by the module itself, not around in
the boss as I believe I already mentioned. But I'm okay with leaving
that discussion out of scope of this ticket.

BTW, I guess this comment shouldn't be affected (thus shouldn't be
removed) because of this branch:

        # [...]  Note that the logging
        # configuration may override the "-v" switch set on the command line.
  • start_auth(), etc: While the comment above start_auth() seems to suggest the use of them in tests, they now are not used in tests (overkilling?). Also, as already stated, ideally I'd move these outside the bind10 module.

I'd like that too, in the long term. But as they use quite a lot from boss internals, I think we should defer it.

Deferring it is fine.

  • start_all_processes(): is it okay to remove this?
    -        # Everything after the main components can run as non-root.
    -        # TODO: this is only temporary - once the privileged socket creator is
    -        # fully working, nothing else will run as root.
    -        if self.uid is not None:
    -            posix.setuid(self.uid)
    

Not really, I missed that one. But it's unhappy in the current state, it doesn't do what it should, as there's no right place. We should finish the socket creator soon enough.

Hmm...I'm especially not happy about letting xfrin with the root
privilege as a result of this. I suspect it's unlikely that
the socketcreator work is completed and merged before the boss config
stuff, I would introduce some intermediate workaround to retain the
previous behavior on this point. This would also be one example case
where incremental upgrade should have been beneficial.

  • stop_process(): I was not sure how we could remove this comment due to this branch:
    -        # TODO: Some timeout to solve processes that don't want to die would
    -        # help. We can even store it in the dict, it is used only as a set
    

Ah, that one was removed because it now belongs somewhere else. But it got lost in the ether until it got there, so I put it there now.

I don't understand this. If you are referring to a post-review
change, please be more specific about the exact commit ID (but of
course the more fundamental issue is that this branch is too big, so
it's extremely difficult for a reviewer to keep truck of every small
change).

and this line (or the concept of "expected_shutdowns"):

-        self.expected_shutdowns[process] = 1

(I'm not necessarily saying we can't. I simply didn't understand).

The expected shutdowns are no longer needed. A component is told to shut down, which puts it into a stopped state. A failed is called only on the running components. So the component remembers itself instead of relying on external dictionary.

On further look based on this comment, I was first afraid that
removing expected_shutdowns might introduce a bad side effect in
restart_processes() because it changed the condition of how the for
loop goes. I then noticed that dead_processes were effectively
removed (it's never updated), so we could probably remove this for
loop and dead_processes altogether.

Frankly, this branch has too many such things, and it made me
(as a reviewer) frustrated. In your brain as the main architect you
may have a perfect view of everything and can be sure which one is
safe, which one is now unnecessary, etc, but as a reviewer who is
suddenly shown the whole set of changes, it's mostly impossible to
have that higher level confidence. I could point out individual
things that may break other parts (and that's what I've done so far),
but I can never be sure about the big picture, and can never feel
safe.

This again leads to the possibility of incremental upgrade. I'll see
what we can do on this point later.

  • component_shutdown(): if I read the code correctly, __stopping can never be True in the context where this method can be called:
            if self.__stopping:
                return
    
    If the intent is to prevent a duplicate call, we'll probably need a dedicated attribute like "shutting_down", and set it to True in this own method.

I don't think it can be called twice currently and it probably wouldn't hurt either. I don't know what I was thinking before, when I added it, so I removed it.

This doesn't seem to be removed.

  • reap_children(): I was not sure if it's safe to remove this:
    -                self.dead_processes[proc_info.pid] = proc_info
    

The dead_processes were used for the delayed starts. We need to reintroduce them sometime, but maybe in a completely different way. So that's why they are removed from the code now.

See above for expected_shutdowns. It may be safe, but it's unclear
and probably incomplete, and made me so confused.

  • I'm not sure "b10-xxx" is the best choice for the component name.

That is because I use the name as the default value of process to start, to save the user some typing when introducing a component. I'd like to unify it with the address on the message queue somehow as well, so the only thing needed to add a new component would be to type:

config add Boss/components b10-component

And be done.

Then please describe the intent somewhere (it's okay to defer it to
the user doc task).

  • likewise, while "params" are defined as a list of strings,
  • I can't find "priority" in the spec.
  • dhcp related config is missing, too.

What dhcp related config? I think the DHCP wasn't started before (and it didn't really have a start_dhcp option yet). But if a user wants, he can always add the component, right? There's no special need for DHCP. Or did I overlook something?

Ah, I thought DHCP daemons were invoked by the boss process, if that's
not the case, forget the DHCP comment per se. But if we'd say "if a
user wants...", I'd also argue that it's the same for
xfrin/xfrout/auth, etc. (or is this perhaps an intermediate state so
we preserve the current operational practice and it will eventually
move outside the boss spec? if so, that's okay to me)

General comment: TestStartStopProcessesBob? was completely removed and
TestBossComponents? was completely newly introduced. Both are pretty
big, and the changes are not trivial, so I was not sure such changes
in tests ensure safe refactoring. [...]

Well, it's simply because the old way of configuration no longer exists, so there's no reason (or way) to test it. Instead of testing that if there's a start_resolver, the resolver is started, we test that if there's a component in configuration, that component is started (by passing it to the configurator and by the configurator tests).

With all due respect, it might be "simple" in the architect's brain,
but not for someone suddenly introduced to the whole big changes.
This is actually another instance of whether we couldn't make it
incremental introduction. I'll re-consider it in that context later.

  • TestBossComponents?.test_correct_run: about this point, my understanding is that we are expected to return an error answer in such cases:
            # Is it OK to raise, or should it catch also and convert to error
            # answer?
            self.assertRaises(Exception, bob.config_handler,
                              {'components': compconf})
    
    See also the comment about config_handler() above.

I was thinking it should be the job of whatever calls the handler to catch the exceptions and convert them to error messages. But I probably didn't have a reason to expect that to do so.

config_handler is a callback function, and the caller is
ModuleCCSession. Since it's beyond the module boundary, IMO it's
generally safer not to propagate arbitrary exceptions to the caller
side (even if re-raise is okay it's safer to catch them once and make
sure the re-raised exception is one of the expected exceptions for the
caller). Besides, in my understanding, ModuleCCSession doesn't
actually expect to get an exception on the call to the callback
handler.

comment:28 in reply to: ↑ 20 Changed 8 years ago by jinmei

Replying to vorner:

Hello

There were too many changes and too many removed tests, [...]

Maybe it was deemed to be too difficult or simply impossible, but due
to the above concerns I'd still like to try step-by-step transition.
A possibility would be:

  1. first merge the component module (this should be safe,
  2. choose a small part of the existing functionality, and replace only that part with the new module. do it carefully by preserving the semantics of the original test as much as possible. Maybe we should initially simply add code to call the corresponding component module but do nothing, then confirm the behavior with tests, and then remove the old code while letting the component module do the real job.
  3. repeat the 2nd process until we remove all old code.

I don't know how to do it well.

I've created a kind of quick hack branch ("trac213-incremental") to
see how the step-by-stp approach was difficult or could be feasible.
It begins with the latest component stuff and the master version of
boss code, and modifies the latter so that only the start and stop
processes are updated to use the component/configurator framework.

It doesn't yet introduce any new configuration syntax, and restart is
still handled in the old way. In that sense it's incomplete, but it
keeps all existing tests with a relatively minor modifications so we
can be more confident that it doesn't break the existing behavior. In
fact, it passes system tests, too.

This is a kind of proof-of-concept stuff and I'm not saying we should
do this exactly the way I did in that branch, but to me it proved we
could effectively make the changes more gradually. On top of the
first small changes like the ones in the "incremental" branch, we can
then add support for restarting using the configurator/component
modules. We can also then update the boss configuration syntax so
that we can replace the old style with the new one. Throughout the
process, we'd be able to replace "the tests that no longer make sense"
with corresponding new ones more gradually, thereby making us believe
it's safe migration.

Isn't it possible we can once step back to something like this and
rebuild the things one by one? If you still think it's impossible...,
I don't know how we can move forward. Frankly, the size of the
changes exceeded my brain capacity. Maybe we should find another
reviewer, or you might divide the diffs into several sets so I can eat
them one set to another.

BTW, in either case I believe we need commit c916095. Without this
killing the boss process by a single would also kill sockcreator, and
it will make the shutdown process a bit unexpected (this problem
exists in the current code, but the configurator code logic will make
it more visible).

And one more thing: Do we need a changelog entry for this task? Or is
the plan to provide it when we complete the entire migration?

Possibly. There should be a changelog in the end, because it's a really well visible change, but I'm not sure when. If we decide to go with the big branch way, maybe when merging back to master. If we choose to merge now, maybe I'd add a changelog then. But I'd rather like to add it when there's a little of documentation for a user somewhere to point to it.

That's fine.

I don't understand the trick here:

    LIBEXECDIR = ("@libexecdir@/@PACKAGE@"). \
        replace("${exec_prefix}", "@exec_prefix@"). \
        replace("${prefix}", "@prefix@")

The relationship between @libexecdir@, ${exec_prefix}, @exec_prefix@,
${prefix}, and @prefix@ is not clear. Please add comments about the
intent.

This was handled in the boss by a trick from Makefile, which I didn't like. But it turned out the replacement (the things in @@) happened to contain more and more variables and I kind of continued to replace them until it stopped. Do you think it is OK (with a comment) or I should find a better way to do it?

At least as long as the intent is commented I'm okay with it.

  • BoB.__init__(): ideally I'd like to move the setting of __core_components outside of the bind10 module (and even more ideally make it "configurable" by a command line option, etc).

I don't think configuration on command line makes any sense. These are the real core components. The boss won't be able to work (and load the configuration) without them. I don't see a reason why boss shouldn't know the bootstrap process and why a user would want to modify it (or in which way it would make any sense). In fact, I'd like to keep users as far from these as possible. If anybody has a reason to have a different set of these in bootstrap process, they are coding already anyway. I don't expect even the authors of additional bind10 components to need to modify them.

I'm not 100% sure if we never want to modularize (e.g.) the config
manager and make it replaceable, but I admit that may be
over-generalization. So I don't oppose to leaving this point as is.

comment:29 Changed 8 years ago by jinmei

I believe I responded to all comments (that need a discussion). The
ticket wasn't given back to me, so I'll simply keep it assigned to you
(=vorner).

comment:30 in reply to: ↑ 27 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

I've created a kind of quick hack branch ("trac213-incremental") to
see how the step-by-stp approach was difficult or could be feasible.
It begins with the latest component stuff and the master version of
boss code, and modifies the latter so that only the start and stop
processes are updated to use the component/configurator framework.

Hmm, that looks like it could work. As it seems my current approach doesn't, as it is hard to review, let's try it.

So, I'll try to address everything related to the configurator framework and the stuff that's already in the branch and then branch from it to add a little bit more.

Anyway, I noticed two things:

  • The start_all_pracesses method starts only b10-auth, not all the xfrin, xfrout, etc. That doesn't seem to matter much, as the configuration handler is called soon enough anyway. Is there a reason for this? Anyway, this is temporary, so it doesn't matter.
  • The start_xfrin in master now has some kind of hack with paths. Should we preserve that one?

Isn't it possible we can once step back to something like this and
rebuild the things one by one? If you still think it's impossible...,

Yes, it seems doable now. I had some kind of mental block, most probably because when I started the task, I didn't have really clear idea how the result will look like.

BTW, in either case I believe we need commit c916095. Without this
killing the boss process by a single would also kill sockcreator, and
it will make the shutdown process a bit unexpected (this problem
exists in the current code, but the configurator code logic will make
it more visible).

Ah, now I see why the socket creator threw so often on shutdown for me. That was one of the TODO items up there.

  • BoB.__init__(): ideally I'd like to move the setting of __core_components outside of the bind10 module (and even more ideally make it "configurable" by a command line option, etc).

I don't think configuration on command line makes any sense. These are the real core components. The boss won't be able to work (and load the configuration) without them. I don't see a reason why boss shouldn't know the bootstrap process and why a user would want to modify it (or in which way it would make any sense). In fact, I'd like to keep users as far from these as possible. If anybody has a reason to have a different set of these in bootstrap process, they are coding already anyway. I don't expect even the authors of additional bind10 components to need to modify them.

I'm not 100% sure if we never want to modularize (e.g.) the config
manager and make it replaceable, but I admit that may be
over-generalization. So I don't oppose to leaving this point as is.

We may want to have some modularization. But it wouldn't work from command line anyway, as these core components are tightly coupled into the rest of the code (the socket creator will be used to answer requests for sockets, we need to connect to the c_channel using the message queue, etc.). So, I think we should leave this for something in the year 5 or so.

Replying to jinmei:

If someone uses boss to start dhcp, he would just remove auth and resolver from configuration and have the dhcp part as needed.

Perhaps the point to consider is what should be specified as 'needed'
by default in bob.spec. If we see BIND 10 as a generic framework for
various kind of Internet servers (starting with DNS, then DHCP, and
perhaps even HTTP, etc), it would be more reasonable to begin with an
empty list of specific servers. If a user wants to use the framework
for DNS services, auth (and/or resolver) will then be specified as
'needed'. On the other hand, realistically speaking most people will
see BIND 10 as DNS software (at least for initial N years), so it
might be over generalization and just increase the configuration
overhead. Right now I have a strong opinion either way. Maybe one
option is to decide it at ./configure time, and make its default DNS
related servers. But in any case I'm okay with deferring this point.

When we introduced the start_auth and start_resolver configuration options, we discussed this AFAIK and we decided to have start_auth as true as the default. The current (nonincremental) version starts the same set of processes as the default. I did want the default to be empty and let user start what he wants at the time and we might bring it up sometime for wider discussion. But I don't think it should be just changed in the ticket, there'll be enough new things for users in that anyway.

  • An object of this class is a sort of finite state machine, [...]

Yes, you're right. It happened in kind of evolutionary way, the running one was there first, then the dead appeared later on and I didn't think about it. This way it looks simpler. I also added the diagram.

  • What if stop_internal() raises an exception?

Then we have a problem.

Actually, the component is considered shut down at the time and the exception is propagated. The idea behind this is, we can't really consider it running, because it might be already stopped and if there's problem stopping, if we try again (during system shutdown or sometime else), it would fail again. This way, if it happens during real shutdown and the process is still running, it will be at last killed. If it happens during reconfiguration, I don't know. Any ideas what to do then?

On thinking about it more as being explicitly asked, I think we should
keep truck of the status of spawned processes more precisely. Right
now (both before or after this branch), it seems that we are not very
accurate on this point.

[...]

All that said, this will be beyond the scope of this already-fat
task. After all, the pre-213 implementation isn't good in this sense,
so in the sense of porting the current behavior under a new framework
we don't have to solve it now. So, at the moment, I'm okay with just
leaving a comment that e.g. stop_process() is generally expected to be
exception free (for now) and the behavior is undefined if and when
that happens.

Yes, you're right, that would help. We could require components to report to boss „I'm ready“ when it starts (which could solve the problem of explicitly configuring address) and have few more states and more things like that. I also agree with leaving it out for now, but I added little bit of documentation with that regard.

Replying to jinmei:

Replying to (the rest of) vorner:

  • It seems to make sense to allow derived class to override some of the methods. But I suspect the requirement to the overridden methods should be more clarified [...]

I added some documentation.

Ack. One small point in the revised doc:

        You should also register any processes started within boss.

If this is something that all derived class (whether overridden or
not) should meet, maybe we should move it to start().

Hmm, I'm not sure about that. Actually, I envisioned the component could have more than one process running (in which case it would register multiple times) or no process at all. So, doing this would decrease the flexibility. Should I document this idea somewhere?

And, well, I don't expect that many components that are not started the usual way. In the long term, I'd really like to see only the real core components to be inherited.

documented. As for pid(), I guess the (more) proper condition is
if self._procinfo is not None here:

        return self._procinfo.pid if self._procinfo else None

Ah, right. Too much perl education in my case O:-). The original was in fact correct, as the _procinfo is an object, so it can be False only if it is None, but this is more the python way.

Also, for pid() to always work correctly, I suspect we should reset
_procinfo to None once the component stops running.

Then kill wouldn't work in some cases, or I'd need to store the ProcInfo? object twice, once for kill and once for pid. Or check the state in pid, but then, it wouldn't be there for the log message when it is killed.

So, instead of making it more complicated this way, I believe nothing is hurt by the way it is now, so I just added a note.

I added some documentation there.

Maybe my intent wasn't clearly delivered...let me rephrase it:
Assume I wanted to completely override _start_internal() for some
reason. Then I should set _procinfo in the override version;

As the current version of _start_internal only calls the _start_func(), stores the procinfo and registers the process, the only reason I can come up with would be to not use ProcInfo? at all (the way the socket creator does). If you have the ProcInfo? object, you are OK by providing the _start_func hook.

otherwise kill() won't work correctly. And, just like I commented
about "You should also register..." part, if this is something all
overridden the original implementation should do, I'd move it to
start(), and (in this specific case) ask the implementor to return a
ProcessInfo object from the overridden version of _start_internal().
Or, if you don't like that, at least we should document that
_procinfo must be set in the overridden version of _start_internal().

Ah, I added the kill later than the documentation. It should say kill should be overridden in the case you provide _start_internal. In that case, everything that uses _procinfo would be overriden.

I'd also envision some external developers want to use the BIND 10
framework to control a third party server process and want the boss
process to invoke the third party server without modifying the BIND 10
source code. My overall point is that I'd like this framework to have
this type of flexibility. Whether to use the so-called "reflection"
is implementation details. But in any case, I don't think we need it
today, especially not in this ticket, so I'm not insisting this should
be solved within this task.

Well, in that case we would need a way to push the code into the running master somehow. I'm not against that, but I had hoped that in the long term, the only 'special' components would be the real core ones (msgq, sockcreator, cfgmgr). The rest would have all the parameters & other needed info in the specification and started the usual way. Then we could take 'special' out of the spec file as well.

If it turns out we really need a way to push additional code into boss that does a different startup, I hope we would come up with a way how to modify the dictionary with special components (eg. each plugin or whatever it would be would plug itself into some place to be known).

Okay, but do you mean this?

    Note that this will allow you to stop (by invoking reconfigure) a core
    component. There should be some kind of layer protecting users from ever
    doing so (users must not stop the config manager, message queue and stuff
    like that or the system won't start again).

I fail to understand that this means that the caller shouldn't stop a
core component. It's probably better to describe that more
explicitly.

No, that doesn't mean caller shouldn't stop a core component, that is completely safe. The caller shouldn't stop a component which can't be started again. Stoping b10-auth set as core is completely OK. I updated the doc to be more clear about it.

Replying to jinmei:

Replying to vorner:

Hello

Replying to jinmei:

  • kill_started_processes(): from a quick look at the library reference, os.kill(signal.SIGTERM) is less portable than Poepen.kill(), which was used for the original code (especially if we consider Windows support), and in that sense this is a regression. The same comment applies to other cases where os.kill() is used.

The killing was moved into the components which handle them in concrete way.

As for kill(), I'd use higher level abstraction than SIGKILL instead
of SIGTERM. to match the abstraction level of the python interface.

I'm not sure I understand here. Currently the component uses the terminate() and kill() calls from the procinfo.process(). That is what was used before.

Do you mean something else?

Not really, I missed that one. But it's unhappy in the current state, it doesn't do what it should, as there's no right place. We should finish the socket creator soon enough.

Hmm...I'm especially not happy about letting xfrin with the root
privilege as a result of this. I suspect it's unlikely that
the socketcreator work is completed and merged before the boss config
stuff, I would introduce some intermediate workaround to retain the
previous behavior on this point. This would also be one example case
where incremental upgrade should have been beneficial.

This one is not in there right now, but I might have a proposal for a workaround. We may define a special drop-root-privileges component to place the dropping in the middle of the startup by the priorities. Does it sound reasonable?

  • # TODO: Some timeout to solve processes that don't want to die would
  • # help. We can even store it in the dict, it is used only as a set

}}}

Ah, that one was removed because it now belongs somewhere else. But it got lost in the ether until it got there, so I put it there now.

I don't understand this. If you are referring to a post-review
change, please be more specific about the exact commit ID (but of
course the more fundamental issue is that this branch is too big, so
it's extremely difficult for a reviewer to keep truck of every small
change).

It's git fd3c952098c46d84c9a277b1409442813a263876.

That is because I use the name as the default value of process to start, to save the user some typing when introducing a component. I'd like to unify it with the address on the message queue somehow as well, so the only thing needed to add a new component would be to type:

config add Boss/components b10-component

And be done.

Then please describe the intent somewhere (it's okay to defer it to
the user doc task).

Yes, I think it should be in the user docs. The programmer doc's describe the behaviour, but it's true not really the rationale behind that, but I don't think it belongs there.

What dhcp related config? I think the DHCP wasn't started before (and it didn't really have a start_dhcp option yet). But if a user wants, he can always add the component, right? There's no special need for DHCP. Or did I overlook something?

Ah, I thought DHCP daemons were invoked by the boss process, if that's
not the case, forget the DHCP comment per se. But if we'd say "if a
user wants...", I'd also argue that it's the same for
xfrin/xfrout/auth, etc. (or is this perhaps an intermediate state so
we preserve the current operational practice and it will eventually
move outside the boss spec? if so, that's okay to me)

Actually, there was a code to start it depending the config, but the config wasn't in the spec file yet.

So, this should be all regarding the code that is there for now. I can give this to you and do the next incremental step on a new branch (without a new ticket).

Thank you

comment:31 in reply to: ↑ 30 ; follow-ups: Changed 8 years ago by jinmei

Replying to part of vorner before my lunch break...

Anyway, I noticed two things:

  • The start_all_pracesses method starts only b10-auth, not all the xfrin, xfrout, etc. That doesn't seem to matter much, as the configuration handler is called soon enough anyway. Is there a reason for this? Anyway, this is temporary, so it doesn't matter.

I thought they started as a result of this:

        if self.cfg_start_auth:
            component_config['b10-xfrout'] = { 'kind': 'dispensable',
                                               'address': 'Xfrout' }
            component_config['b10-xfrin'] = { 'kind': 'dispensable',
                                              'address': 'Xfrin' }
            component_config['b10-zonemgr'] = { 'kind': 'dispensable',
                                              'address': 'Zonemgr' }
            self.__propagate_component_config(component_config)

didn't they?

BTW, in this sense the resolver part was incorrect. I intended to call
self.__propagate_component_config here, too:

        if self.cfg_start_resolver:
            component_config['b10-resolver'] = { 'kind': 'needed',
                                                 'special': 'resolver' }
            self.started_resolver_family = True

And, the reason why the invocation of xfrin, etc, was separated from
that of auth was partly because I tried to keep the original code
logic as much as possible, and partly because that would allow xfrin,
etc to start as a non privileged user while allowing auth (and
resolver) to have the privilege.

  • The start_xfrin in master now has some kind of hack with paths. Should we preserve that one?

Yes, we should, unless #1292 is resolved soon.

Replying to jinmei:

If someone uses boss to start dhcp, he would just remove auth and resolver from configuration and have the dhcp part as needed.

Perhaps the point to consider is what should be specified as 'needed'
by default in bob.spec. [...]

When we introduced the start_auth and start_resolver configuration options, we discussed this AFAIK and we decided to have start_auth as true as the default. The current (nonincremental) version starts the same set of processes as the default. I did want the default to be empty and let user start what he wants at the time and we might bring it up sometime for wider discussion. But I don't think it should be just changed in the ticket, there'll be enough new things for users in that anyway.

Okay. (And I agree we should generally focus on realizing the current
behavior with the new framework at least in this ticket).

Replying to jinmei:

Ack. One small point in the revised doc:

        You should also register any processes started within boss.

If this is something that all derived class (whether overridden or
not) should meet, maybe we should move it to start().

Hmm, I'm not sure about that. Actually, I envisioned the component could have more than one process running (in which case it would register multiple times) or no process at all. So, doing this would decrease the flexibility. Should I document this idea somewhere?

Yes please. I'd leave it to you whether to update the code itself.

documented. As for pid(), I guess the (more) proper condition is
if self._procinfo is not None here:

        return self._procinfo.pid if self._procinfo else None

Ah, right. Too much perl education in my case O:-). The original was in fact correct, as the _procinfo is an object, so it can be False only if it is None, but this is more the python way.

Right (the original was correct), hence "(more)" proper. In this
particular case I don't much care about the difference, but since
there are actually subtle cases where not X is not always
X is not None and the difference can cause a real bug, it would be a
good practice if we always try to be stricter.

comment:32 in reply to: ↑ 31 Changed 8 years ago by vorner

Hello

Replying to jinmei:

  • The start_all_pracesses method starts only b10-auth, not all the xfrin, xfrout, etc. That doesn't seem to matter much, as the configuration handler is called soon enough anyway. Is there a reason for this? Anyway, this is temporary, so it doesn't matter.

I thought they started as a result of this:
[...]
didn't they?

Ah, right. I got confused when it was separated, the first condition with start_auth looked too short O:-).

BTW, in this sense the resolver part was incorrect. I intended to call
self.__propagate_component_config here, too:

        if self.cfg_start_resolver:
            component_config['b10-resolver'] = { 'kind': 'needed',
                                                 'special': 'resolver' }
            self.started_resolver_family = True

Added and tried manually.

  • The start_xfrin in master now has some kind of hack with paths. Should we preserve that one?

Yes, we should, unless #1292 is resolved soon.

I returned the workaround by creating an xfrin special component. The start should be tested by system tests (there are some testing xfrin specially). However, my system does not check the workaround itself.

Hmm, I'm not sure about that. Actually, I envisioned the component could have more than one process running (in which case it would register multiple times) or no process at all. So, doing this would decrease the flexibility. Should I document this idea somewhere?

Yes please. I'd leave it to you whether to update the code itself.

Added a comment.

comment:33 follow-up: Changed 8 years ago by vorner

The next step I did is getting rid of the started_*_family variables. They are redundant and keep information kept in the ConfigManager? already.

As a result, the started as non-root messages got moved to the start_auth and start_resolver functions, being called even on restart. Also, I moved them to WARN, as this probably won't do what the user wanted.

It is in the 213-incremental-families branch.

comment:34 in reply to: ↑ 31 ; follow-up: Changed 8 years ago by jinmei

Replying to part of vorner before my lunch break...

...continued...

Replying to vorner:

Also, for pid() to always work correctly, I suspect we should reset
_procinfo to None once the component stops running.

[...]

I added some documentation there.

Maybe my intent wasn't clearly delivered...let me rephrase it:
Assume I wanted to completely override _start_internal() for some
reason. Then I should set _procinfo in the override version;

[...]

otherwise kill() won't work correctly. And, just like I commented
about "You should also register..." part, if this is something all
overridden the original implementation should do, I'd move it to
start(), [...]

Ah, I added the kill later than the documentation. It should say kill should be overridden in the case you provide _start_internal. In that case, everything that uses _procinfo would be overriden.

All these three points now seem to be related this point: it was not
clear what was the intended guideline to customize the start/stop details.
Also, as indicated by the missing documentation update to kill(),
it's not clear which set of methods are supposed to be overridden at
once. Even though documentation helps, it seems to be more error prone.

I'd suggest something like this:

  • Introduce a separate class, say ComponentInternal?. It has "start", "stop", "name", and "pid" methods. This class is an abstract base class; it's not expected to be used directly, and all of these methods are expected to be overridden in a derived class.
  • Define a derived DefaultComponentInternal?. Move the current _start_internal (renamed to start), _stop_internal (renamed to stop), name and pid, _procinfo to this class. This class takes a callable object (start_func) on initialization to allow small customization for startup only.
  • Component is constructed with a concrete ComponentInternal?, whose default is DefaultComponentInternal?. Component.start() calls component_internal.start(), etc, instead of Component._start(). _procinfo.
  • Say if the user wants complete customization, define a separate ComponentInternal? derived class and use it with Component. If the user wants a small customization for startup, only pass a different callable to the Component (or set DefaultComponentInternal?._start_func to the user's callable in some way)

This way it's clearer what is expected to be overridden as a set.
It's also safer in terms of avoiding to accidentally refer to things
like _procinfo with partially overridden methods.

I'd also envision some external developers want to use the BIND 10
framework to control a third party server process and want the boss
process to invoke the third party server without modifying the BIND 10
source code. [...]

Well, in that case we would need a way to push the code into the running master somehow. I'm not against that, but I had hoped that in the long term, the only 'special' components would be the real core ones (msgq, sockcreator, cfgmgr). The rest would have all the parameters & other needed info in the specification and started the usual way. Then we could take 'special' out of the spec file as well.

If it turns out we really need a way to push additional code into boss that does a different startup, I hope we would come up with a way how to modify the dictionary with special components (eg. each plugin or whatever it would be would plug itself into some place to be known).

I have no further comment on this. This point will be out of scope of
this ticket anyway.

Okay, but do you mean this?

    Note that this will allow you to stop (by invoking reconfigure) a core
    component. There should be some kind of layer protecting users from ever
    doing so (users must not stop the config manager, message queue and stuff
    like that or the system won't start again).

I fail to understand that this means that the caller shouldn't stop a
core component. It's probably better to describe that more
explicitly.

No, that doesn't mean caller shouldn't stop a core component, that is completely safe. The caller shouldn't stop a component which can't be started again. Stoping b10-auth set as core is completely OK. I updated the doc to be more clear about it.

Okay.

Replying to jinmei:

  • kill_started_processes(): from a quick look at the library reference, os.kill(signal.SIGTERM) is less portable [...]

The killing was moved into the components which handle them in concrete way.

As for kill(), I'd use higher level abstraction than SIGKILL instead
of SIGTERM. to match the abstraction level of the python interface.

I'm not sure I understand here. Currently the component uses the terminate() and kill() calls from the procinfo.process(). That is what was used before.

Do you mean something else?

I meant this comment is UNIX (or POSIX) specific while Python's
subprocess class is more generic:

        If the forcefull is true, it uses SIGKILL instead of SIGTERM.

I'd say something like this instead:

        If the forcefull is true, it literally "kills" the component
	instead of "stopping" it.  If the component represents a
        process in a POSIX based OS, this means the process is sent
	a SIGKILL signal instead of SIGTERM.

Not really, I missed that one. But it's unhappy in the current state, it doesn't do what it should, as there's no right place. We should finish the socket creator soon enough.

Hmm...I'm especially not happy about letting xfrin with the root
privilege as a result of this. [...]

This one is not in there right now, but I might have a proposal for a workaround. We may define a special drop-root-privileges component to place the dropping in the middle of the startup by the priorities. Does it sound reasonable?

Probably, as a workaround (by definition as long as it works and is
relatively short term, any solution should be reasonable). Note also
that the experimental 213-incremental branch has another workaround
(not specifically intending to address this, but as a result of the
attempt of keeping the original behavior as much as possible).

Ah, I thought DHCP daemons were invoked by the boss process, if that's
not the case, forget the DHCP comment per se. But if we'd say "if a
user wants...", I'd also argue that it's the same for
xfrin/xfrout/auth, etc. (or is this perhaps an intermediate state so
we preserve the current operational practice and it will eventually
move outside the boss spec? if so, that's okay to me)

Actually, there was a code to start it depending the config, but the config wasn't in the spec file yet.

Ah, cfg_start_dhcp6 (in master) is always False (unless the test code
explicitly modifies it). Confusing:-) While it's not a problem of
this task per se, it would be nicer to add some comment around here:

        if self.cfg_start_dhcp6:
            self.start_dhcp6(c_channel_env)

comment:35 follow-up: Changed 8 years ago by vorner

Hello

In between, I created another microbranch, 213-internal-restarts. It uses the component framework to do the restarts.

As the main branch, it destroys the delayed restarts and brittle mode. Now, should it be fixed in this micro-branch, or is it OK for now?

comment:36 in reply to: ↑ 33 Changed 8 years ago by jinmei

Replying to vorner:

The next step I did is getting rid of the started_*_family variables. They are redundant and keep information kept in the ConfigManager? already.

As a result, the started as non-root messages got moved to the start_auth and start_resolver functions, being called even on restart. Also, I moved them to WARN, as this probably won't do what the user wanted.

It is in the 213-incremental-families branch.

The changes in the "family" micro branch (including the ones made
on top of the original 212-incremental branch) look okay.

comment:37 in reply to: ↑ 35 Changed 8 years ago by jinmei

Replying to vorner:

Hello

In between, I created another microbranch, 213-internal-restarts. It uses the component framework to do the restarts.

This branch seems to break systest.

comment:38 Changed 8 years ago by jinmei

As for delayed restart and brittle, it would depend on how much work
we need for them. Also, *personally*, I'm okay with skipping the
brittle mode.

In any case this ticket is becoming too long now (and several micro
branches are running in parallel, each of which is discussed in this
ticket), so it would probably make more sense to create separate
tickets for each separate microbranch.

At this point I'm giving the ticket back to you.

comment:39 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:40 in reply to: ↑ 34 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

All these three points now seem to be related this point: it was not
clear what was the intended guideline to customize the start/stop details.
Also, as indicated by the missing documentation update to kill(),
it's not clear which set of methods are supposed to be overridden at
once. Even though documentation helps, it seems to be more error prone.

I'd suggest something like this:

  • Introduce a separate class, say ComponentInternal?. It has "start", "stop", "name", and "pid" methods. This class is an abstract base class; it's not expected to be used directly, and all of these methods are expected to be overridden in a derived class.
  • Define a derived DefaultComponentInternal?. Move the current _start_internal (renamed to start), _stop_internal (renamed to stop), name and pid, _procinfo to this class. This class takes a callable object (start_func) on initialization to allow small customization for startup only.
  • Component is constructed with a concrete ComponentInternal?, whose default is DefaultComponentInternal?. Component.start() calls component_internal.start(), etc, instead of Component._start(). _procinfo.
  • Say if the user wants complete customization, define a separate ComponentInternal? derived class and use it with Component. If the user wants a small customization for startup, only pass a different callable to the Component (or set DefaultComponentInternal?._start_func to the user's callable in some way)

This way it's clearer what is expected to be overridden as a set.
It's also safer in terms of avoiding to accidentally refer to things
like _procinfo with partially overridden methods.

This sounded like too many classes to me, it's hard to follow it. So I made a compromise to have a base class which has the methods which should not be overriden and took the procinfo and stuff to a separate derived class. Would this help?

I meant this comment is UNIX (or POSIX) specific while Python's
subprocess class is more generic:

        If the forcefull is true, it uses SIGKILL instead of SIGTERM.

I updated it to say „for example“ to make it clear this is not the only possible way.

This one is not in there right now, but I might have a proposal for a workaround. We may define a special drop-root-privileges component to place the dropping in the middle of the startup by the priorities. Does it sound reasonable?

Probably, as a workaround (by definition as long as it works and is
relatively short term, any solution should be reasonable). Note also
that the experimental 213-incremental branch has another workaround
(not specifically intending to address this, but as a result of the
attempt of keeping the original behavior as much as possible).

But this workaround won't work if we move to the configurable way completely. Because if the list of processes is defined completely by configuration, the boss doesn't have the knowledge to know where to put the „drop privileges“ call.

Actually, there was a code to start it depending the config, but the config wasn't in the spec file yet.

Ah, cfg_start_dhcp6 (in master) is always False (unless the test code
explicitly modifies it). Confusing:-) While it's not a problem of
this task per se, it would be nicer to add some comment around here:

        if self.cfg_start_dhcp6:
            self.start_dhcp6(c_channel_env)

If this will disappear at this branch, I don't think the comment is needed any more.

Giving it to you, as I'll create a new branch for the restarts branch so we don't mix them too much.

comment:41 in reply to: ↑ 40 ; follow-up: Changed 8 years ago by jinmei

Replying to vorner:

This way it's clearer what is expected to be overridden as a set.
It's also safer in terms of avoiding to accidentally refer to things
like _procinfo with partially overridden methods.

This sounded like too many classes to me, it's hard to follow it. So I made a compromise to have a base class which has the methods which should not be overriden and took the procinfo and stuff to a separate derived class. Would this help?

Works for me. See comments about implementation details.

I meant this comment is UNIX (or POSIX) specific while Python's
subprocess class is more generic:

        If the forcefull is true, it uses SIGKILL instead of SIGTERM.

I updated it to say „for example“ to make it clear this is not the only possible way.

OK.

Probably, as a workaround (by definition as long as it works and is
relatively short term, any solution should be reasonable). Note also
that the experimental 213-incremental branch has another workaround
(not specifically intending to address this, but as a result of the
attempt of keeping the original behavior as much as possible).

But this workaround won't work if we move to the configurable way completely. Because if the list of processes is defined completely by configuration, the boss doesn't have the knowledge to know where to put the „drop privileges“ call.

I know. But I believe we could still leave hardcoded workaround in
bind10_src.py that checks the given configuration and performs the
incremental startups. We could have more generic drop-root-priv
components. In either case it will be a not-too-long workaround until
sockcreator is done, so I'd go for easier way (and basically I'm okay
with any workable solution).

Some comments on specific details:

  • 'boss' and 'kind' seem to be a duplicate for BaseComponent? and Component. For the latter we can simply refer to the BaseComponent?'s doc?
  • Msgq doesn't meet the recommendation of the Component class:
        directly. It is not recommended to override methods of this class
        on one-by-one basis.
    
    If this exceptional customization is still considered the best, I think we should note more explicitly that it (intentionally) breaks the recommendation and explain more detailed rationale than just saying "hackish".
  • in component tests, we now don't test the Component class itself. I suspect we should introduce a mock boss object and test Component (that doesn't actually spawn a process or kill it, but just internally emulates such operations).
  • I realized it's not clear from the doc string what is expected for the 'info' parameter of BoB.register_process(). Right now it's a ProcInfo? object, but it's a short term workaround, and it will eventually be a BaseComponent? object. It will help if we clarify that in the docstring.

comment:42 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:43 in reply to: ↑ 41 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

Some comments on specific details:

The boss was so short it would be longer to reference the other docs. But I did the reference for the kind, as it had the description of values.

  • Msgq doesn't meet the recommendation of the Component class:
        directly. It is not recommended to override methods of this class
        on one-by-one basis.
    
    If this exceptional customization is still considered the best, I think we should note more explicitly that it (intentionally) breaks the recommendation and explain more detailed rationale than just saying "hackish".

I'm not sure it's the best possible, but I don't like to have another copy of the common parts, when the only thing I want to do is to make it not stop. This is reasonably short, if nothing else. O:-)

  • in component tests, we now don't test the Component class itself. I suspect we should introduce a mock boss object and test Component (that doesn't actually spawn a process or kill it, but just internally emulates such operations).

I added some tests for the methods that don't do the manipulation with external world. Which isn't much, because most of the Component is the actual manipulation, but there's something.

Or you meant something else?

  • I realized it's not clear from the doc string what is expected for the 'info' parameter of BoB.register_process(). Right now it's a ProcInfo? object, but it's a short term workaround, and it will eventually be a BaseComponent? object. It will help if we clarify that in the docstring.

Actually, even now it is the BaseComponent? object, just it now extracts the ProcInfo? out of it. Anyway, the doc is added.

Thanks

comment:44 in reply to: ↑ 43 ; follow-up: Changed 8 years ago by jinmei

Replying to vorner:

The boss was so short it would be longer to reference the other docs. But I did the reference for the kind, as it had the description of values.

Okay.

  • Msgq doesn't meet the recommendation of the Component class:
        directly. It is not recommended to override methods of this class
        on one-by-one basis.
    
    If this exceptional customization is still considered the best, I think we should note more explicitly that it (intentionally) breaks the recommendation and explain more detailed rationale than just saying "hackish".

I'm not sure it's the best possible, but I don't like to have another copy of the common parts, when the only thing I want to do is to make it not stop. This is reasonably short, if nothing else. O:-)

Okay.

  • in component tests, we now don't test the Component class itself. I suspect we should introduce a mock boss object and test Component (that doesn't actually spawn a process or kill it, but just internally emulates such operations).

I added some tests for the methods that don't do the manipulation with external world. Which isn't much, because most of the Component is the actual manipulation, but there's something.

Or you meant something else?

Not fully looked into it yet, but the test failed:

ERROR: test_component_attributes (__main__.ComponentTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jinmei/src/isc/git/bind10-213inc/src/lib/python/isc/bind10/tests/component_test.py", line 469, in test_component_attributes
    self.assertIsInstance(component._procinfo, TestProcInfo)
AttributeError: 'ComponentTests' object has no attribute 'assertIsInstance'
  • I realized it's not clear from the doc string what is expected for the 'info' parameter of BoB.register_process(). Right now it's a ProcInfo? object, but it's a short term workaround, and it will eventually be a BaseComponent? object. It will help if we clarify that in the docstring.

Actually, even now it is the BaseComponent? object, just it now extracts the ProcInfo? out of it. Anyway, the doc is added.

We'll soon see this be the case as we complete #1365. The revised doc
looks okay.

comment:45 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:46 in reply to: ↑ 44 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

Not fully looked into it yet, but the test failed:

ERROR: test_component_attributes (__main__.ComponentTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jinmei/src/isc/git/bind10-213inc/src/lib/python/isc/bind10/tests/component_test.py", line 469, in test_component_attributes
    self.assertIsInstance(component._procinfo, TestProcInfo)
AttributeError: 'ComponentTests' object has no attribute 'assertIsInstance'

It seems your version of unittest lacks the assertIsInstance check. Does the workaround work for you?

Thanks

comment:47 in reply to: ↑ 46 ; follow-up: Changed 8 years ago by jinmei

Replying to vorner:

Not fully looked into it yet, but the test failed:

ERROR: test_component_attributes (__main__.ComponentTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jinmei/src/isc/git/bind10-213inc/src/lib/python/isc/bind10/tests/component_test.py", line 469, in test_component_attributes
    self.assertIsInstance(component._procinfo, TestProcInfo)
AttributeError: 'ComponentTests' object has no attribute 'assertIsInstance'

It seems your version of unittest lacks the assertIsInstance check. Does the workaround work for you?

Ah, yes. According to http://docs.python.org/py3k/library/unittest.html
assertIsInstance is new in Python 3.2 (I'm (normally) using 3.1).

As for the test itself: it looks good. I'll test some other cases:

  • this code path
            else:
                # TODO Handle params, etc
                procinfo = self._boss.start_simple(self._process)
    
  • stop()
  • kill()

comment:48 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:49 in reply to: ↑ 47 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

As for the test itself: it looks good. I'll test some other cases:

  • this code path
            else:
                # TODO Handle params, etc
                procinfo = self._boss.start_simple(self._process)
    
  • stop()
  • kill()

I originally thought it's impossible to test, as it starts/stop external processes. But then I realized anything in python can be hijacked, so I added some more tests for these.

Thanks

comment:50 in reply to: ↑ 49 Changed 8 years ago by jinmei

Replying to vorner:

As for the test itself: it looks good. I'll test some other cases:

  • this code path
            else:
                # TODO Handle params, etc
                procinfo = self._boss.start_simple(self._process)
    
  • stop()
  • kill()

I originally thought it's impossible to test, as it starts/stop external processes. But then I realized anything in python can be hijacked, so I added some more tests for these.

Looks okay.

comment:51 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:52 Changed 8 years ago by jelte

  • Milestone changed from Sprint-20111108 to Sprint-20111122

comment:53 Changed 8 years ago by vorner

  • Resolution set to complete
  • Status changed from reviewing to closed

OK, all merged together.

comment:54 Changed 8 years ago by vorner

  • Add Hours to Ticket set to 47.25

Ah, the real time spent on it.

comment:55 Changed 8 years ago by jinmei

  • Add Hours to Ticket changed from 47.25 to 79.43
Note: See TracTickets for help on using tickets.