Opened 9 years ago

Closed 8 years ago

#640 closed defect (fixed)

cfgmgr hanging and command and configurations for no-longer-running components

Reported by: jreed Owned by: jelte
Priority: medium Milestone: Sprint-20120221
Component: Unclassified Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 0.0 Add Hours to Ticket: 0
Total Hours: 20.67 Internal?: no

Description (last modified by jreed)

Easy to reproduce:

> config set Boss/start_auth false
> config set Boss/start_resolver true
> config commit
> Auth sendstats
{
    "error": "Module 'Auth' not responding"
}
> config set Auth/statistics-interval 120
> config commit
Error: Module 'ConfigManager' not responding
Configuration not committed

Note that the not responding error I think is accurate. I think the cfgmgr is now hanging.

This ticket is two problems.

Subtickets

Change History (21)

comment:1 Changed 9 years ago by jreed

  • Description modified (diff)

comment:2 Changed 8 years ago by jreed

  • Defect Severity set to N/A
  • Sub-Project set to DNS

This problem continues.

It is slightly different now. (Note that the Boss components configuration is different now also.)

> config remove Boss/components b10-auth
> config add Boss/components b10-resolver
> config set Boss/components/b10-resolver/kind needed
> config set Boss/components/b10-resolver/priority 10
> config set Boss/components/b10-resolver/special resolver
> config commit
> config set Auth/statistics-interval 120
> config commit
Error: Timeout waiting for answer from Auth
Configuration not committed

It should not error on configuring Auth. If it must error now since support doesn't exist yet then the error should be more clear.

Also see:

> config set Resolver/timeout_query 2001
> config commit
Error: Module 'ConfigManager' not responding
Configuration not committed

I then tried as root and this time it failed differently (even before commit):

> config set Auth/statistics-interval 120 
Traceback (most recent call last):
  File "/usr/pkg/lib/python3.1/cmd.py", line 213, in onecmd
    func = getattr(self, 'do_' + cmd)
AttributeError: 'BindCmdInterpreter' object has no attribute 'do_config'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./bin/bindctl", line 149, in <module>
    result = tool.run()
  File "/home/reed/builder/work/BIND10/20120124093934-NetBSD5-i386/install/lib/python3.1/site-packages/bindctl/bindcmd.py", line 138, in run
    self.cmdloop()
  File "/usr/pkg/lib/python3.1/cmd.py", line 139, in cmdloop
    stop = self.onecmd(line)
  File "/home/reed/builder/work/BIND10/20120124093934-NetBSD5-i386/install/lib/python3.1/site-packages/bindctl/bindcmd.py", line 466, in onecmd
    Cmd.onecmd(self, line)
  File "/usr/pkg/lib/python3.1/cmd.py", line 215, in onecmd
    return self.default(line)
  File "/home/reed/builder/work/BIND10/20120124093934-NetBSD5-i386/install/lib/python3.1/site-packages/bindctl/bindcmd.py", line 432, in default
    self._parse_cmd(line)
  File "/home/reed/builder/work/BIND10/20120124093934-NetBSD5-i386/install/lib/python3.1/site-packages/bindctl/bindcmd.py", line 568, in _parse_cmd
    self._validate_cmd(cmd)
  File "/home/reed/builder/work/BIND10/20120124093934-NetBSD5-i386/install/lib/python3.1/site-packages/bindctl/bindcmd.py", line 366, in _validate_cmd
    params[param_count - 2] += params[param_count - 1]
KeyError: -1

Note this may be an old ticket, but I hit the same problem (Error: Module 'ConfigManager?' not responding
) yesterday when trying to set up a resolver for some tests.

comment:3 Changed 8 years ago by jelte

  • Milestone set to Sprint-20120207

comment:4 Changed 8 years ago by jelte

Just a few initial comments, these are indeed all related (except maybe that last one), and the problem is that configmanager is never told about modules being stopped, resulting in timeouts when it tries to contact them.

There are a few levels of solutions IMO;

  • Make sure the ModuleCCSession also sends a message when it is closed, so that ConfigManager? can remove it from its list of currently running modules. This will (or should) also result in bindctl at least erroring sooner and not with a timeout ('unknown module') when sending a direct command.
  • We may need to add another message from boss 'I don't know if you have noticed, but process X has disappeared.', but this may require another addition (need to match module names to process names somewhere if we don't already)
  • The second issue then becomes one that we wanted for a long time as well; offline configuration, though this time only 'partly' offline; i.e. configmanager should be able to check and change configuration values of modules that are not running). But I am not entirely sure on what its behaviour should look like (e.g. why should an admin be bothered with config values for things he isn't running)

comment:5 Changed 8 years ago by jelte

  • Owner set to jelte
  • Status changed from new to assigned

comment:6 Changed 8 years ago by jelte

  • Owner changed from jelte to UnAssigned
  • Status changed from assigned to reviewing

Notes:

the diff has gotten quite big already, so it's about time to call this good enough for now; it solves the problem, though not ideally just yet.

What it does now is inform the config manager that a module is stopping, which through some more signaling ends up removing it from the 'running modules' list in bindctl; so when you remove a component from Boss/components, commit it, and then try to send that component a command or change its configuration, bindctl will immediately report that the module is unknown or not running.

So in effect, this is actually an implementation of bullet point 1 of #123; when a module (defined by using the ModuleCCSession class) quits correctly, it causes a message 'stopping' to be sent to the channel ConfigManager? (when it is killed, currently there is no such message, we can consider that as well, and implement it in Boss, but as we expect the module to be restarted anyway, I think we can live without that for now).

Reviewing notes:
We currently have separate implementations for ModuleCCsession in C++ and python, which in a way is a small advantage; the C++ version automatically handles this when the destructor is called, but for python this is not practical; the moment when del() will be called is not predictable enough, and some quick tests have shown that is is currently not called when the object goes out of scope in a number of our modules.

So for the python version, I added an explicit send_stopping() method, which will send the message. If you have a suggestion for a better name, please let me know :) Originally called it stop(), but it's not to be confused with close(), which actually closes the communication channel, but I'd like to keep them separate.

This does mean that I needed to add a call to every python module, which is the main cause of the patch growing. And since our modules use the ModuleCCSession in a different way, it's not as straightforward everywhere as one may hope. In fact, I think that we should have a followup task to make the ModuleCCSession a Context (the pythonic way to do RAII), and make our modules more consistent in this regard. I considered doing this right now but for some (of not all) modules this requires some big changes in the way they work. As for the followup, I recommend we make ModuleCCSession a context-usable object, implement it in a relatively clean module (ddns comes to mind), then use that as the template to clean up the others.

Another thing, closely related to this, is that our tests have a similar problem; we have a growing number of local Mock classes (and I had to add a few more). In a lot of cases, these tend to check the same thing. For some we can move them to testutil, so at least there is a common class for them; I added a very minimal MockModuleCCSession to testutils, and used that one in a number of tests (but not all; some mockclasses provide extra things only needed for those specific tests). But I have been playing around a bit and I think that we can do something way better in python; a general mock/proxy call-counting class, I'll send a proposal to the list separately. This would not be direct followup for this, but since this branch inspired me to look into that, I thought i'd mention it anyway.

So the actual changes are these:

  • updated ccsession.h ccsession.cc and ccsession.py to allow for the sending of the 'stopping' informational message
  • updated cfgmgr to act on this message and send cmdctl another message (in fact a form of spec_update, but the content signaling 'remove this spec from your known list).
  • fixed bindctl to clear its list of known modules before asking the new ones (it originally only overwrote them)
  • added calls to the python modules to send said stopping message (for the C++ modules, it is automatically called upon destruction of the ModuleCCSession)
  • added MockCCSession test and check whether stop() is called

Proposed changelog:
[func] jelte
modules now inform the system when they are stopping. As a result, they are removed from the 'active modules' list in bindctl, which can then inform the user directly when it tries to send them a command or configuration update. Previously this would result in a 'not responding' error instead of 'not running'.

(i personally think this is more func than bug, but no strong opinion)

comment:7 follow-up: Changed 8 years ago by jinmei

I've not yet reviewed everything, but as I'm taking a day off tomorrow
I'll provide feedback on the things I've covered so far.

First off, it was not clear which part of the original report was
solved. I see the potential that this set of changes can make bindctl
more responsive in the reported scenario, but was it identified why
cfgmgr was hanging (or did it really hang)? And does this branch
solve that problem? What about the traceback output? (BTW I
understand this branch cannot solve the issue of configuring inactive
component).

A related point is that since this is an issue concerning interactions
of multiple processes, unittests wouldn't be enough to be sure whether
the problem was solved. It may be okay to defer it to a separate
ticket, but I think we need a system level test for this kind of
thing.

Next, I made a few trivial changes to the branch.

ccsession.cc

  • Not a big deal, but sendStopping() is a private method, not very big, and used only within the ModuleCCSession destructor. Is it worth a separate method?
  • logging itself could cause an exception, so the destructor could still throw. However, this is not the only point that has this problem, so maybe we can leave it open for now...

ccsession.h

  • I thought we decided to use C++ style comments for doxygen
  • In our convention we'd explicitly add 'virtual' (for clarity) to the destructor declaration.

config_message.mes

If not yet done, you may want to reorder the messages using
tools/reorder_message_file.py

ccsession_unittests.cc

  • maybe a matter of taste, but the intermediate block just to have a separate scope looks a bit awkward. I'd do something like this:
        scoped_ptr<ModuleCCSession> mccs(new ModuleCCSession(
                                             ccspecfile("spec2.spec"),
                                             session, NULL, NULL,
                                             true, false));
        ...
        mccs.reset();               // this will invoke destructor
    
    At least I think some comments about the intent of the scope would be necessary.
  • it would be nicer if we could test the case where the destructor throws.

ccsession.py

  • send_stopping: while the pydoc says "any errors are logged", only SessionError? is caught. Is that intentional? Can we assume create_command() doesn't raise an exception?
  • send_stopping: I believe you don't need 'str' for 'se'. (might not be a big deal in this case, but delaying 'str' could possibly help improve performance).

cfgmgr.py

  • if _handle_module_stopping is private, should it better be prefixed with __? Same for _send_module_spec_to_cmdctl.
  • _handle_module_stopping returns None, saying "This command is not expected to be answered". But returning None doesn't seem to suppress sending a response.
  • Not directly related to this ticket, but the if-else in handle_msg is getting too big to understand/manage. I guess it's time to consider refactoring, maybe in some OO way.

ccsession_test.py

  • test_stop: not necessarily a problem, but it seems the same pattern repeats in multiple tests. It would be better to unify the common sequence of tests for conciseness.

comment:8 Changed 8 years ago by jinmei

  • Owner changed from UnAssigned to jinmei

comment:9 in reply to: ↑ 7 ; follow-up: Changed 8 years ago by jelte

Replying to jinmei:

I've not yet reviewed everything, but as I'm taking a day off tomorrow
I'll provide feedback on the things I've covered so far.

Ack, thanks so far :)

First off, it was not clear which part of the original report was
solved. I see the potential that this set of changes can make bindctl
more responsive in the reported scenario, but was it identified why
cfgmgr was hanging (or did it really hang)? And does this branch
solve that problem? What about the traceback output? (BTW I
understand this branch cannot solve the issue of configuring inactive
component).

Oh, sorry for not being clear; that ConfigManager? timeout was actually closely related; the timeout values for communication from bindctl to configmanager and from configmanager to any other (in this case non-running) module are the same; so the 'configmanager is not responding' was not because configmanager was hanging, but because it was simply waiting for a timeout itself. So this would address the root cause, but not that specific problem (we probably need to make cmdctl->cfgmgr timeouts a bit higher than the rest).

A related point is that since this is an issue concerning interactions
of multiple processes, unittests wouldn't be enough to be sure whether
the problem was solved. It may be okay to defer it to a separate
ticket, but I think we need a system level test for this kind of
thing.

oh yeah, good point. I've added a few steps to terrain/bind10_control.py that I needed, and then cleaned it up a bit.
Added a new feature file 'bindctl_commands.feature', which currently only contains this test (run basic setup, remove a number of modules, and check if they are not shown anymore)

Next, I made a few trivial changes to the branch.

ccsession.cc

  • Not a big deal, but sendStopping() is a private method, not very big, and used only within the ModuleCCSession destructor. Is it worth a separate method?

I personally prefer it in a separate method; not so much because of its size, but because the action it performs is not directly related to local resource cleanup, and is kind of a side effect, so constructing the message and sending it seems (imho) nicer in a separate method. No strong opinion though (side not, i even considered making it a public call, and having some check to see if it had been called already in the destructor, so the client code can decide when it should be called. But that did not seem worth it).

  • logging itself could cause an exception, so the destructor could still throw. However, this is not the only point that has this problem, so maybe we can leave it open for now...

hmm, imo we should either make logging calls exception-free or add exception-free ones, since it is usually about the only thing you can do in a destructor.

ccsession.h

  • I thought we decided to use C++ style comments for doxygen

oops, fixed

  • In our convention we'd explicitly add 'virtual' (for clarity) to the destructor declaration.

actually, it's even needed for technical reasons for classes that are subclassed, but this one isn't supposed to be subclassed afaik I realize we can't stop people from doing it, but personally i tend to look at the virtualness of the destructor to see if it should be subclassed in the first place. Of course, one could argue that it would be better to add it when it's not necessary than to not have it when it is, so if you insist i'll make it virtual.

config_message.mes

If not yet done, you may want to reorder the messages using
tools/reorder_message_file.py

done (it only moved an older one)

ccsession_unittests.cc

  • maybe a matter of taste, but the intermediate block just to have a separate scope looks a bit awkward. I'd do something like this:
        scoped_ptr<ModuleCCSession> mccs(new ModuleCCSession(
                                             ccspecfile("spec2.spec"),
                                             session, NULL, NULL,
                                             true, false));
        ...
        mccs.reset();               // this will invoke destructor
    
    At least I think some comments about the intent of the scope would be necessary.

changed

  • it would be nicer if we could test the case where the destructor throws.

ack, added a simple throw_on_send option to fakesession, and added test

ccsession.py

  • send_stopping: while the pydoc says "any errors are logged", only SessionError? is caught. Is that intentional? Can we assume create_command() doesn't raise an exception?

Oh, right, made it except Exception. create_command can raise as well, but i don't really want to catch those; imo they can only be programmer errors (well, except for out of mem of course), and should fail as loudly as possible. Added a comment to that effect.

  • send_stopping: I believe you don't need 'str' for 'se'. (might not be a big deal in this case, but delaying 'str' could possibly help improve performance).

ack, changed

cfgmgr.py

  • if _handle_module_stopping is private, should it better be prefixed with __? Same for _send_module_spec_to_cmdctl.

as you can see from the rest of the code here, i tend to only use the one when I don't think the class/method name mangling is necessary. Should we start doing this as a principle? (and if so, should we update all 'private' methods?)

  • _handle_module_stopping returns None, saying "This command is not expected to be answered". But returning None doesn't seem to suppress sending a response.

ah doh, fixed.

  • Not directly related to this ticket, but the if-else in handle_msg is getting too big to understand/manage. I guess it's time to consider refactoring, maybe in some OO way.

depends on how; as it is now, it is indeed getting long, but is essentially a switch. I'm open to proposals, but some 'smart' solutions might actually be more complicated :p

ccsession_test.py

  • test_stop: not necessarily a problem, but it seems the same pattern repeats in multiple tests. It would be better to unify the common sequence of tests for conciseness.

Yes, actually, more generally, this is partly why I was looking at the callcounter thingy a few nights ago; i am working during my down-hours on a slightly more general version of that, and I think we can replace a lot of 'checking if stuff has been called' tests with such a beasty. But I guess you're not just talking about that. I actually think we should not just unify more test code; we can probably do a much better job of unifying the common patterns in the modules themselves as well (and unified test code should then kind of roll out as part of that work).

comment:10 Changed 8 years ago by jinmei

First, comments on the rest of the original branch (before you updated
it based on the initial comments).

some higher level points

The relationship between bindctl and module CC session is complicated
and difficult to understand. For example, it took some time for me to
understand the interaction of UIModuleCCSession and bindctl - these
two (defined in separate modules) are tightly coupled and work based
on some implicit assumption of each other's features. So, to
understand changes to the module CC session I needed to look into
bindctl. Is it only me, or is this a known issue and kind of a longer
term TODO? In any case, however, I'm not requesting it be resolved
within this ticket.

Maybe unrelated to this ticket but: does bindctl try to get the full
latest spec and config data of all running modules on accepting every
command? If so, is the overhead acceptable?

When the application doesn't die gracefully, sending the stop message
can be skipped. Is it a TODO thing when it supports some kind of
RAII? For that matter, the whatever successor of msgq may make it
possible for cfgmgr to (easily) detect it when a specific module is
disconnected. If so, we may not have to have a "send stopping"
primitive, either directly or indirectly.

ccsession.py

I've come up with additional comments on this file while reading other
part of the branch.

  • request_specifications: it would be helpful if it had more detailed description, including it clears the previous data. It would also be helpful how this method (via update_specs_and_config()) is expected to be used - and, for that matter, whether request_specification() is expected to be public or essentially private (in the latter case adding __ will clarify the intent a bit clearer, although this will make some update to the test code).
  • request_specifications: not directly related to this ticket, but I suspect we can remove or at least revise the comment.

ccsession_mock.py

  • I'd like to have some pydoc description for MockModuleCCSession.

cfgmgr_test.py

  • This part is a bit confusing:
            # drop the two messages for now
            self.assertEqual(len(self.fake_session.message_queue), 2)
            self.fake_session.get_message("Cmdctl", None)
            self.fake_session.get_message("TestModule", None)
    
            self.assertEqual(len(self.fake_session.message_queue), 0)
    
    It's probably for clearing the faked message queue for the convenience of subsequent tests, but this part itself raises some unrelated questions: what are these two messages? Where do the magic string of "Cmdctl" or "TestModule?" come from? Maybe the answer can be found if I read the other tests above this part (I didn't), but unless these preceding tests are prerequisite of the added tests, I'd rather be free from the duty of understanding it. Also related, test_handle_msg() now looks too long and is generally difficult to understand when something is added. I'd suggest separating the message handling tests into independent sub tests. I suspect the handling of "stopping" can be a separate test case, and if so, it will much easier to understand.

cmdctl.py.in

  • While the comment says: "If it is None, delete the module if it is known (and ignore otherwise)", it seems that None will be set in the "otherwise" case. And, is it okay to ignore it anyway? If this is a bogus input I think we should rather return an error, log it, or both. Or, if it's an okay value, the comment should be updated.

cmdctl_test.py

  • I think it should also test the case when arg[1] is NULL and args[0] is not in modules_spec. See also the comment on cmdctl.py.

bind10_src.py

  • Is "rest of the system" correct (or appropriate)?
            # If ccsession is still there, inform rest of the system this module
            # is stopping.
    
    I guess it's only delivered to cfgmgr and cmdctl (and perhaps bindctl via cmdctl).
  • Is it possible that self.ccs is None?
            if self.ccs is not None:
                self.ccs.send_stopping()
    

ddns_test.py

  • Why doesn't this use MockModuleCCSession?

Others

  • stats and xfrout don't seem to be tested.

comment:11 in reply to: ↑ 9 Changed 8 years ago by jinmei

Replying to jelte:

First off, it was not clear which part of the original report was
solved. I see the potential that this set of changes can make bindctl
more responsive in the reported scenario, but was it identified why
cfgmgr was hanging (or did it really hang)? And does this branch
solve that problem? What about the traceback output? (BTW I
understand this branch cannot solve the issue of configuring inactive
component).

Oh, sorry for not being clear; that ConfigManager? timeout was actually closely related; the timeout values for communication from bindctl to configmanager and from configmanager to any other (in this case non-running) module are the same; so the 'configmanager is not responding' was not because configmanager was hanging, but because it was simply waiting for a timeout itself. So this would address the root cause, but not that specific problem (we probably need to make cmdctl->cfgmgr timeouts a bit higher than the rest).

Okay, I actually suspected it was not hanging, just a long timeout.
What about the traceback?

ccsession.cc

  • Not a big deal, but sendStopping() is a private method, not very big, and used only within the ModuleCCSession destructor. Is it worth a separate method?

I personally prefer it in a separate method;

That's fine.

  • logging itself could cause an exception, so the destructor could still throw. However, this is not the only point that has this problem, so maybe we can leave it open for now...

hmm, imo we should either make logging calls exception-free or add exception-free ones, since it is usually about the only thing you can do in a destructor.

Yeah, I made a ticket before, but it didn't seem to be very easy.

  • In our convention we'd explicitly add 'virtual' (for clarity) to the destructor declaration.

actually, it's even needed for technical reasons for classes that are subclassed, but this one isn't supposed to be subclassed afaik I realize we can't stop people from doing it, but personally i tend to look at the virtualness of the destructor to see if it should be subclassed in the first place. Of course, one could argue that it would be better to add it when it's not necessary than to not have it when it is, so if you insist i'll make it virtual.

I meant this is because it has a base class and the destructor is
derived from it, not because this class could be derived from others.
I thought it was in the coding guideline, but I couldn't find it...

cfgmgr.py

  • if _handle_module_stopping is private, should it better be prefixed with __? Same for _send_module_spec_to_cmdctl.

as you can see from the rest of the code here, i tend to only use the one when I don't think the class/method name mangling is necessary. Should we start doing this as a principle? (and if so, should we update all 'private' methods?)

I think it's a good practice, but not a strong opinion. In Python
it's not fully enforcible anyway.

  • Not directly related to this ticket, but the if-else in handle_msg is getting too big to understand/manage. I guess it's time to consider refactoring, maybe in some OO way.

depends on how; as it is now, it is indeed getting long, but is essentially a switch. I'm open to proposals, but some 'smart' solutions might actually be more complicated :p

It's probably related to extract common initial patterns of Python
apps. In any case it'd be beyond the scope of this ticket.

ccsession_test.py

  • test_stop: not necessarily a problem, but it seems the same pattern repeats in multiple tests. It would be better to unify the common sequence of tests for conciseness.

Yes, actually, more generally, this is partly why I was looking at the callcounter thingy a few nights ago; [...]

I'm not sure if we are talking about the same thing...what I meant is
something rather simple: repeating something like this:

        fake_session = FakeModuleCCSession()
        self.assertFalse("Spec1" in fake_session.subscriptions)
        mccs = self.create_session("spec1.spec", None, None, fake_session)
        self.assertTrue("Spec1" in fake_session.subscriptions)

but anyway, I don't request it be solved within this ticket. The
branch is already quite large, so it's better to complete it only with
focused and relevant changes.

One other comment on the revised diff:

(lettuce)bind10_control.py

  • run_bindctl: I'd set the default for cmdctl_port to None to avoid duplicate hardcoding of 47805.

comment:12 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:13 Changed 8 years ago by jelte

  • Owner changed from vorner to jelte

comment:14 follow-up: Changed 8 years ago by jelte

  • Owner changed from jelte to jinmei

Answering both your comment fields as one, in an attempt to cope with the potentially confusing non-threaded trac comment view :)

Replying to jinmei:

First, comments on the rest of the original branch (before you updated
it based on the initial comments).

some higher level points

The relationship between bindctl and module CC session is complicated
and difficult to understand. For example, it took some time for me to
understand the interaction of UIModuleCCSession and bindctl - these
two (defined in separate modules) are tightly coupled and work based
on some implicit assumption of each other's features. So, to
understand changes to the module CC session I needed to look into
bindctl. Is it only me, or is this a known issue and kind of a longer
term TODO? In any case, however, I'm not requesting it be resolved
within this ticket.

It isn't an explicit TODO so far, but it grew this way partly because of the way bindctl was originally written and how it changed over time. The idea should be that bindctl is just one tool that can use 'the API' (UIModuleCCSession) to control everything (hence a lot of functionality does not reside in bindctl itself), but I readily admit that it may have gotten quite muddled over time. Also see next point.

Maybe unrelated to this ticket but: does bindctl try to get the full
latest spec and config data of all running modules on accepting every
command? If so, is the overhead acceptable?

Yes, it (currently) does; IIRC this got put in by Likun when he made bindctl a stateless http server, and we found that on the one hand you want to modify things locally (i.e. you need to be able change multiple settings and submit them as one), but do want things like tab-completion, for which you need at least the full specs and config. Rather than inventing a 'given your previous state, these are the changes' protocol he opted for a full fetch of the remote data on every command, since that was easiest at the time. As long as it's run locally and we don't have kilobytes of data to send, I think the overhead is acceptable. There are a few easy targets for improvements; we could consider having a very simple 'anything changed?' request/response prior to requesting all data, and we could make the actual data one big request (instead of a small one for each module). But I must say I have not yet though much about doing it completely differently (we could for instance go the other way completely and make it more like AJAX websites; putting all interactive knowledge on the server side in a session and making bindctl a much more basic tool).

When the application doesn't die gracefully, sending the stop message
can be skipped. Is it a TODO thing when it supports some kind of
RAII? For that matter, the whatever successor of msgq may make it
possible for cfgmgr to (easily) detect it when a specific module is
disconnected. If so, we may not have to have a "send stopping"
primitive, either directly or indirectly.

Yes, this is what I meant by making ModuleCCSession a Context; the way to do RAII in python is different from C++, since the often implicit reference counting messes up much of the reliability of del(), so what you do is make your objects implement the Context interface; give it a method enter() and exit(), and use your object with the with-statement. So what I had in mind would look something like

    with ModuleCCSession(args) as mccs:
        main_proccessing_loop()

But since this requires some fundamental changes in nearly all modules, I deemed it out of scope for this ticket itself, but have not made a new ticket yet (want to work out a bit of an approach for that first).

ccsession.py

I've come up with additional comments on this file while reading other
part of the branch.

  • request_specifications: it would be helpful if it had more detailed description, including it clears the previous data. It would also be helpful how this method (via update_specs_and_config()) is expected to be used - and, for that matter, whether request_specification() is expected to be public or essentially private (in the latter case adding __ will clarify the intent a bit clearer, although this will make some update to the test code).
  • request_specifications: not directly related to this ticket, but I suspect we can remove or at least revise the comment.

updated the docs a bit (and moved the update function to below the two it calls). In short, the update one is a convenience function, since most of the time you need them both. There are exceptions (mainly in the tests, at this moment, but if we change the way we do the updates to something more efficient, this may or may not be the case for bindctl too).

ccsession_mock.py

  • I'd like to have some pydoc description for MockModuleCCSession.

ok

cfgmgr_test.py

  • This part is a bit confusing:
            # drop the two messages for now
            self.assertEqual(len(self.fake_session.message_queue), 2)
            self.fake_session.get_message("Cmdctl", None)
            self.fake_session.get_message("TestModule", None)
    
            self.assertEqual(len(self.fake_session.message_queue), 0)
    
    It's probably for clearing the faked message queue for the convenience of subsequent tests, but this part itself raises some unrelated questions: what are these two messages? Where do the magic string of "Cmdctl" or "TestModule?" come from? Maybe the answer can be found if I read the other tests above this part (I didn't), but unless these preceding tests are prerequisite of the added tests, I'd rather be free from the duty of understanding it. Also related, test_handle_msg() now looks too long and is generally difficult to understand when something is added. I'd suggest separating the message handling tests into independent sub tests. I suspect the handling of "stopping" can be a separate test case, and if so, it will much easier to understand.

Refactored the entire test, split it into four separate ones (and redid a bit of the internals)

cmdctl.py.in

  • While the comment says: "If it is None, delete the module if it is known (and ignore otherwise)", it seems that None will be set in the "otherwise" case. And, is it okay to ignore it anyway? If this is a bogus input I think we should rather return an error, log it, or both. Or, if it's an okay value, the comment should be updated.

i originally opted for lenience, but returning an error is probably a good idea (though i'm not sure what we should do with that error). Anyway the code itself was wrong to put in that none, so added a test for that as well, and it now sends back an error

cmdctl_test.py

  • I think it should also test the case when arg[1] is NULL and args[0] is not in modules_spec. See also the comment on cmdctl.py.

ah, yes, just did :)

bind10_src.py

  • Is "rest of the system" correct (or appropriate)?
            # If ccsession is still there, inform rest of the system this module
            # is stopping.
    
    I guess it's only delivered to cfgmgr and cmdctl (and perhaps bindctl via cmdctl).

I used 'system' here because the fact that this message goes to the ConfigManager? channel is kind of an implementation detail for ModuleCCSession (and if we want to be pedantic, this is just a channel that cfgmgr happens to listen on, its name being the same as the module name by convention).

  • Is it possible that self.ccs is None?
            if self.ccs is not None:
                self.ccs.send_stopping()
    

hmm no maybe i was overly conservative there, removed the check

ddns_test.py

  • Why doesn't this use MockModuleCCSession?

these tests use much more details about the CCSession than the rest, and tbh i figured if we're gonna seriously refactor those parts of the tests anyway, we could better do it then (for starters, mockmoduleccsession would need to be a subclass of configdata for these tests, and so it needs specfiles for input, etc.). So I chose to defer that part, and left the MyCCSession in here.

Others

  • stats and xfrout don't seem to be tested.

i knew i missed some :p

Or actually just one; in the first commit after the review I've added the b10-stats test.

Xfrout went into a separate commit; the main Xfrout code wasn't tested at all, and there is a problem with the interaction between XfroutServer?.shutdown() and threads in unit tests. I've modified the shutdown code to only wait for other threads when the was a UnixSocketServer?, but simply go on and not join() them if there wasn't. I'm not entirely sure whether this is correct, so i've put it in a separate commit for now.

(next points will go into a third commit)

Replying to jinmei:

Replying to jelte:

Oh, sorry for not being clear; that ConfigManager? timeout was actually closely related; the timeout values for communication from bindctl to configmanager and from configmanager to any other (in this case non-running) module are the same; so the 'configmanager is not responding' was not because configmanager was hanging, but because it was simply waiting for a timeout itself. So this would address the root cause, but not that specific problem (we probably need to make cmdctl->cfgmgr timeouts a bit higher than the rest).

Okay, I actually suspected it was not hanging, just a long timeout.
What about the traceback?

I could not reproduce it at this time, and if it is an existing bug I consider it a separate problem.

  • logging itself could cause an exception, so the destructor could still throw. However, this is not the only point that has this problem, so maybe we can leave it open for now...

hmm, imo we should either make logging calls exception-free or add exception-free ones, since it is usually about the only thing you can do in a destructor.

Yeah, I made a ticket before, but it didn't seem to be very easy.

ok, added a catchall there, but we should probably 'try again' :)

  • In our convention we'd explicitly add 'virtual' (for clarity) to the destructor declaration.

actually, it's even needed for technical reasons for classes that are subclassed, but this one isn't supposed to be subclassed afaik I realize we can't stop people from doing it, but personally i tend to look at the virtualness of the destructor to see if it should be subclassed in the first place. Of course, one could argue that it would be better to add it when it's not necessary than to not have it when it is, so if you insist i'll make it virtual.

I meant this is because it has a base class and the destructor is
derived from it, not because this class could be derived from others.
I thought it was in the coding guideline, but I couldn't find it...

oh, ok. Can't find it either, but do recall some discussion about this. OK, made it virtual.

cfgmgr.py

  • if _handle_module_stopping is private, should it better be prefixed with __? Same for _send_module_spec_to_cmdctl.

as you can see from the rest of the code here, i tend to only use the one when I don't think the class/method name mangling is necessary. Should we start doing this as a principle? (and if so, should we update all 'private' methods?)

I think it's a good practice, but not a strong opinion. In Python
it's not fully enforcible anyway.

ok, changed, there were a number of tests that actually used them, but they were easily converted to use the general public method

  • Not directly related to this ticket, but the if-else in handle_msg is getting too big to understand/manage. I guess it's time to consider refactoring, maybe in some OO way.

depends on how; as it is now, it is indeed getting long, but is essentially a switch. I'm open to proposals, but some 'smart' solutions might actually be more complicated :p

It's probably related to extract common initial patterns of Python
apps. In any case it'd be beyond the scope of this ticket.

noted

ccsession_test.py

  • test_stop: not necessarily a problem, but it seems the same pattern repeats in multiple tests. It would be better to unify the common sequence of tests for conciseness.

Yes, actually, more generally, this is partly why I was looking at the callcounter thingy a few nights ago; [...]

I'm not sure if we are talking about the same thing...what I meant is
something rather simple: repeating something like this:

        fake_session = FakeModuleCCSession()
        self.assertFalse("Spec1" in fake_session.subscriptions)
        mccs = self.create_session("spec1.spec", None, None, fake_session)
        self.assertTrue("Spec1" in fake_session.subscriptions)

but anyway, I don't request it be solved within this ticket. The
branch is already quite large, so it's better to complete it only with
focused and relevant changes.

ah, right, noted again :)

One other comment on the revised diff:

(lettuce)bind10_control.py

  • run_bindctl: I'd set the default for cmdctl_port to None to avoid duplicate hardcoding of 47805.

ack, done

comment:15 in reply to: ↑ 14 ; follow-up: Changed 8 years ago by jinmei

Replying to jelte:

(let's leave bigger issues open for now. They are basically well
beyond the scope of this ticket. Beside, we've already spent too much
time on this)

cfgmgr_test.py

[...]

Refactored the entire test, split it into four separate ones (and redid a bit of the internals)

It generally looks good, but see below for some specific.

Others

  • stats and xfrout don't seem to be tested.

i knew i missed some :p

Or actually just one; in the first commit after the review I've added the b10-stats test.

Xfrout went into a separate commit; [...] I'm not entirely sure whether this is correct, so i've put it in a separate commit for now.

I have a suggestion, see below.

Okay, I actually suspected it was not hanging, just a long timeout.
What about the traceback?

I could not reproduce it at this time, and if it is an existing bug I consider it a separate problem.

Okay.

Yeah, I made a ticket before, but it didn't seem to be very easy.

ok, added a catchall there, but we should probably 'try again' :)

I'd basically leave it to you, but see below.

ccsession.py

  • I realized I forgot to mention one thing I had noticed before (sigh): is this comment still valid?
            # this step should be unnecessary but is the current way cmdctl returns stuff
            # so changes are needed there to make this clean (we need a command to simply get the
            # full specs for everything, including commands etc, not separate gets for that)
    
    at least there are no "separate gets" here now.
  • what's the purpose of the change in commit?
    -                self.request_current_config()
    +                self.update_specs_and_config()
    
    Was this a necessary change but forgotten, or is it an effect of more recent changes in the branch? Also, is this necessary? If update_spec_and_config() is performed for even every input, this seems to be a redundant update. Maybe it's related to the higher level design issue on the relationship between ccsession and bindctl.

cmdctl.py

  • maybe we should revise the comment "and ignore otherwise" too?

xfrout

  • the change to xfrout.py doesn't seem to be correct. Even if it happens to be safe it looks quite fragile (and at the very least it's not clear why the thread related code is in that 'if'). IMO it's basically another instance that the xfrout design is so spaghetti and hard to extend (it also too much relies on threads, making the code rather non-understandable)...but even if so that will be another topic. For now I'd work around it in some other way, like this:
            if self._unix_socket_server:
                self._unix_socket_server.shutdown()
            self._wait_for_threads()
    
        def _wait_for_threads(self):
            # Wait for all threads to terminate.  This is a dedicated subroutine
            # of shutdown(), but is extracted into the separate thread so we
            # can test shutdown without involving thread operation (the test would
            # override this method)
            main_thread = threading.currentThread()
            for th in threading.enumerate():
                if th is main_thread:
                    continue
                th.join()
    
    and do this in the test:
    class MyXfroutServer(XfroutServer):
        def __init__(self):
            ....
            self._wait_for_threads = lambda : None
    

ccession.cc

  • Hmm, since this problem is more generic, I'd keep the code simpler for now, rather than introducing the additional try-catch encapsulation.

cfgmgr_test.py

  • The refactoring generally looks good and correct, but too substantial to be sure if they preserve the previous semantics. In particular, I was not sure if test_handle_msg_update_config() was okay.
  • test_stopping_message: what's different between the last two chunks of tests? ('but if...' and 'If the module...')
Last edited 8 years ago by jinmei (previous) (diff)

comment:16 Changed 8 years ago by jinmei

  • Owner changed from jinmei to jelte

comment:17 in reply to: ↑ 15 ; follow-up: Changed 8 years ago by jelte

  • Owner changed from jelte to jinmei

Replying to jinmei:

Replying to jelte:

(let's leave bigger issues open for now. They are basically well
beyond the scope of this ticket. Beside, we've already spent too much
time on this)

ack

Others

  • stats and xfrout don't seem to be tested.

i knew i missed some :p

Or actually just one; in the first commit after the review I've added the b10-stats test.

Xfrout went into a separate commit; [...] I'm not entirely sure whether this is correct, so i've put it in a separate commit for now.

I have a suggestion, see below.

ccsession.py

  • I realized I forgot to mention one thing I had noticed before (sigh): is this comment still valid?
            # this step should be unnecessary but is the current way cmdctl returns stuff
            # so changes are needed there to make this clean (we need a command to simply get the
            # full specs for everything, including commands etc, not separate gets for that)
    
    at least there are no "separate gets" here now.

oh, indeed, somehow i skipped over that too (i though i had removed that already)

  • what's the purpose of the change in commit?
    -                self.request_current_config()
    +                self.update_specs_and_config()
    
    Was this a necessary change but forgotten, or is it an effect of more recent changes in the branch? Also, is this necessary? If update_spec_and_config() is performed for even every input, this seems to be a redundant update. Maybe it's related to the higher level design issue on the relationship between ccsession and bindctl.

You are right, it shouldn't be necessary in the first place. Removed it.

cmdctl.py

  • maybe we should revise the comment "and ignore otherwise" too?

Ack, removed.

xfrout

  • the change to xfrout.py doesn't seem to be correct. Even if it happens to be safe it looks quite fragile (and at the very least it's not clear why the thread related code is in that 'if'). IMO it's basically another instance that the xfrout design is so spaghetti and hard to extend (it also too much relies on threads, making the code rather non-understandable)...but even if so that will be another topic. For now I'd work around it in some other way, like this:
            if self._unix_socket_server:
                self._unix_socket_server.shutdown()
            self._wait_for_threads()
    
        def _wait_for_threads(self):
            # Wait for all threads to terminate.  This is a dedicated subroutine
            # of shutdown(), but is extracted into the separate thread so we
            # can test shutdown without involving thread operation (the test would
            # override this method)
            main_thread = threading.currentThread()
            for th in threading.enumerate():
                if th is main_thread:
                    continue
                th.join()
    
    and do this in the test:
    class MyXfroutServer(XfroutServer):
        def __init__(self):
            ....
            self._wait_for_threads = lambda : None
    

Oh hey, cute :) much better, thanks for the suggestion, I put it in

ccession.cc

  • Hmm, since this problem is more generic, I'd keep the code simpler for now, rather than introducing the additional try-catch encapsulation.

so, er, revert to the original? done.

cfgmgr_test.py

  • The refactoring generally looks good and correct, but too substantial to be sure if they preserve the previous semantics. In particular, I was not sure if test_handle_msg_update_config() was okay.

I ran it through the coverage tool, and all code still seems to be covered at least. Made one more modification, to hit another path.

  • test_stopping_message: what's different between the last two chunks of tests? ('but if...' and 'If the module...')

none, somehow i added the same test twice, removed the second.

comment:18 in reply to: ↑ 17 Changed 8 years ago by jinmei

It now looks okay. Please feel free to merge.

comment:19 Changed 8 years ago by jinmei

  • Owner changed from jinmei to jelte

comment:20 Changed 8 years ago by jinmei

  • Total Hours changed from 0 to 8.67

comment:21 Changed 8 years ago by jelte

  • Resolution set to fixed
  • Status changed from reviewing to closed
  • Total Hours changed from 8.67 to 20.67

Thanks! Merged (with the additional race condition issue in a test I found and we discussed on jabber, and which resulted in the addition of ticket #1668), closing this ticket. Finally :p

Note: See TracTickets for help on using tickets.