Opened 9 years ago

Closed 9 years ago

#296 closed enhancement (fixed)

Timeouts on msgq's cc-channel

Reported by: jelte Owned by: jelte
Priority: medium Milestone: y2 6 month milestone
Component: Unclassified Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity:
Sub-Project: Feature Depending on Ticket:
Estimated Difficulty: 0.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

Sometimes modules simply hang waiting for a blocking read forever, mostly because they expect some answer from another module that either crashed or is looping. Either way this is a programmatic error, but other modules should at least be able to recover from this.

So we need timeouts in or below the recvmsg() calls (both in c++ and python).

And of course the callers should be able to handle whatever exception is raised.

Subtickets

Attachments (1)

test.diff (1.5 KB) - added by jinmei 9 years ago.

Download all attachments as: .zip

Change History (22)

comment:1 Changed 9 years ago by jelte

r2632 contains a start (timer+async_read instead of direct blocking read). The external API has not been changed so you can change/disable the default timeout (currently 5 seconds), but I'd like someone to take a look at this first.

comment:2 follow-up: Changed 9 years ago by jinmei

Some random comments:

  • overall it seems to work as intended, but I'd confirm it works even if there's another outstanding event and it's completed before this async_read() is done.
  • I hope this type of trick is a short term workaround. I guess we should eventually make these asynchronous.
  • if I remember correctly there's a version of async_read where you don't have to specify "transfer_at_least".
  • is it okay to ignore other errors than operation_aborted in async_read()? it's also not clear what this code tries to do by explicitly excluding operation_aborted.
  • there are some style guide violations.
  • I thought the type of the error_code arg in asio callbacks is const asio::error_code&.
  • the policy of whether to use the asio namespace doesn't seem to be consistent. (and I'm not sure why we don't need to add it for "async_read"...)
  • this seems to be an incomplete what() message:
                isc_throw(SessionTimeout, "Timeout or error on ");
    

comment:3 in reply to: ↑ 2 ; follow-up: Changed 9 years ago by jelte

Replying to jinmei:

Some random comments:

thanks :)

  • overall it seems to work as intended, but I'd confirm it works even if there's another outstanding event and it's completed before this async_read() is done.

and also block until such is the case?

  • I hope this type of trick is a short term workaround. I guess we should eventually make these asynchronous.

I think it'll require quite a bit of redesign of the msgq interface, as well as the way it is used (which i suspect is the reason parts of it now do an 'empty' async read followed by a sync read if that one is triggers)

But yes, we should consider it.

  • if I remember correctly there's a version of async_read where you don't have to specify "transfer_at_least".

Oh yes, I think it should also work without that. Removing.

  • is it okay to ignore other errors than operation_aborted in async_read()? it's also not clear what this code tries to do by explicitly excluding operation_aborted.

No the 'real' errors aren't handled right now; the operation_aborted is an 'error' value that is given when cancel() is called, in which case the 'i am done' boolean isn't set here. Perhaps we should not use a bool but a return code directly.

  • there are some style guide violations.
  • I thought the type of the error_code arg in asio callbacks is const asio::error_code&.
  • the policy of whether to use the asio namespace doesn't seem to be consistent. (and I'm not sure why we don't need to add it for "async_read"...)
  • this seems to be an incomplete what() message:
                isc_throw(SessionTimeout, "Timeout or error on ");
    

Ack. Fixed these (well, I just put asio:: before the async_read()s)

r2634

comment:4 in reply to: ↑ 3 Changed 9 years ago by jinmei

Not sure if we are now in the formal review cycle:-) Would you like it?

If this is a formal review I'd request usual check list:

  • add test please
  • provide changelog entry please
  • and perhaps more

Replying to jelte:

  • overall it seems to work as intended, but I'd confirm it works even if there's another outstanding event and it's completed before this async_read() is done.

and also block until such is the case?

Not fully thought about what should be specifically tested, but yes, probably.

I think it'll require quite a bit of redesign of the msgq interface, as well as the way it is used (which i suspect is the reason parts of it now do an 'empty' async read followed by a sync read if that one is triggers)

But yes, we should consider it.

A middle term solution might be to extend our ASIO wrapper to support this mode and have other modules use it. In any case, that's beyond the scope of this ticket.

  • is it okay to ignore other errors than operation_aborted in async_read()? it's also not clear what this code tries to do by explicitly excluding operation_aborted.

No the 'real' errors aren't handled right now; the operation_aborted is an 'error' value that is given when cancel() is called, in which case the 'i am done' boolean isn't set here. Perhaps we should not use a bool but a return code directly.

The 'real' errors don't seem to be handled anywhere...and yes, we should probably pass the error code itself back to the caller.

Also, as commented this code logic is not so trivial. Some comments will help.

  • I thought the type of the error_code arg in asio callbacks is const asio::error_code&.

It still seems to lack '&'.

And one additional comment:

  • personally, I'd avoid heavily relying on the default parameter. since it's currently used in a (essentially) private function to session.cc and all usage uses the default value, I'd rather define a class constant SessionImpl? and use that value in readData(). Please also describe the rationale (if any) of the magic number of 5.

comment:5 Changed 9 years ago by jelte

Just a note: this is not ready for full review yet; as you already noted, error handling is not complete, and neither is testing. And we need a similar addition to the python part.

But I did want someone to look at it to see if i didn't abuse asio too much :)

comment:6 follow-up: Changed 9 years ago by jelte

  • Owner changed from jelte to UnAssigned
  • Status changed from new to reviewing

Ok, i think it should be about ready now;

I've added error handling (raises a SessionError? containing the lowlevel error message), and added tests. There is an interface to set timeouts now; but rather than add an optional argument to everything that might send a message, you can set it on the session object itself, in which case the value will be used for any blocking read.

I also added timeouts to the python version, though in this case no exception is raised (but the answer and envelope are None). Should we have an exception here too (ie if (blocking and timeout and no answer) raise TimeoutError?)?

comment:7 Changed 9 years ago by jinmei

  • Owner changed from UnAssigned to jinmei

comment:8 in reply to: ↑ 6 Changed 9 years ago by jinmei

Replying to jelte:

Ok, i think it should be about ready now;

I also added timeouts to the python version, though in this case no exception is raised (but the answer and envelope are None). Should we have an exception here too (ie if (blocking and timeout and no answer) raise TimeoutError?)?

I wondered this point when I first saw the code (without checking this comment closely). It would depend on whether we want to separate timeouts from other cases where an empty answer is returned. I'm not so sure which is the case in this scenario, but in general I guess we want to know it's a timeout if it is.

One substantial point in the patch, if I understand it correctly:

In session.cc, it seems possible that SessionImpl::setResult can be called after SessionImpl::readData() returns. If that happens invalid stack positions will be overritten.

One easy way to avoid that scenario is to make xxx_result and xxx_code class member variables. There's still a risk that setResult() could be called after the SessionImpl object is destructed, but I guess it's not specific to the session class and can be considered a separate issue (including whether we need to address it).

Other comments (below) are basically minor.

common

  • What's the rationale of the magic number of 4 seconds? Please describe. I'd also avoid hardcoding the magic number in the code logic: I'd define a constant variable with the rationale description and use that variable in other part of the code. I'd also centralize the definition of the default timeout value for python and C++, if it can be done easily.
  • I wish we could test the case where we specify 'no timeout', but I understand it's very difficult if not impossible.

session.cc

  • SessionImpl::setTimeout()/getTimeout() don't have to be (and so shouldn't even be) virtual

session.h

  • I'd make (Abstract)Session::getTimeout() const member function.
  • I'd move the generacl description of set/getTimeout() to the abstract base class. Also, the description of "defaults to 4000" for setTimeout() doesn't seem to be very correct (or it seems to be ambiguous) in that 4000 is not the default parameter of this function. I'd add a description to the (concrete) Session about the timeout behavior, and explain the default timeout value there. See also the comment about hardcoding the magic number (above).
  • according the latest yet-undocumented coding guideline there should be a blank line before "\brief"
  • I've noticed some minor editorial/coding style issues and directly made suggested changes to the branch (r2696).

session_unittest.cc

  • why explicitly specifying the global namespace for unlink? It's not incorrect, but not consistent with how we specify standard C libraries in other part of the code...hmm, maybe it's because it conflicts with some name in the asio name space?
  • it's a matter of taste, but I'd use a fixture (TEST_F) for tests that have a common setup phase (like unlinking the file) to avoid having duplicate setup code.
  • it's not actually specific to this patch, but I'd avod using int (or size_t in session.cc), e.g.:
            const unsigned int length_net = htonl(length);
    
    in fact, in this specific case it's actually a big buggy because the size of 'int' may be smaller than 32 bits. We should use uint32_t, uint16_t, etc. (but this can go to a separate ticket).
  • the in-function block in connect_ok_connection_reset is ugly and obscures the intention (even with the comment). I'd add a close() method to TestDomainSocket and call it explicitly.

msgq.py.in

  • is it related to the timeout hack, or an independent change? if so, maybe some comments may help.

comment:9 Changed 9 years ago by jinmei

  • Owner changed from jinmei to jelte

comment:10 follow-up: Changed 9 years ago by jelte

  • Owner changed from jelte to jinmei

Adressed all the minor issues in r2699;

(note; for the unlink in test I made a SessionTest? class; which contains a pointer to the TestDomainSocket? class; this is because we need to unlink before we initialize it, so it can't be done with a member initializer. I briefly entertained to make a 'reverse file holder' class, that is empty except for a constructor that calls unlink(), but that was too uglyfor me :))

About the namespace, we do use it somewhere else; in the class inheritance definition of the classes we use in TEST_F. But i've removed the one from unlink. (second one also removed in r2701)

Lastly, the int size thing; yes you are right, I could add it to this one (but have not yet), but we could also make it another ticket (not sure if there are other places where this comes up)

Oh and i found another problem in cfgmgr.py; with the current way the timeouts are handled, it should not exit if group_recvmsg() returns None (r2700)

About the setResult function. I don't think it can actually happen in practice, since either the timer expires, in which case the async_read() is canceled, or async_read returns an error, in which case the timer is canceled, so after this method exits there should be no pending actions that result in callbacks.

However, if we are uncomfortable with it; i did commit a change that makes it use member variables (and two setResult functions instead of 1); r2702. Though i'm actually not at all sure whether this is better.

comment:11 in reply to: ↑ 10 Changed 9 years ago by jinmei

Replying to jelte:

About the setResult function. I don't think it can actually happen in practice, since either the timer expires, in which case the async_read() is canceled, or async_read returns an error, in which case the timer is canceled, so after this method exits there should be no pending actions that result in callbacks.

Depending on "in practice", I believe it can happen. At least according to the ASIO documentation, the cancel method doesn't remove the event but trigger an immediate event with an error.

"In practice", we'll basically terminate the process in our usage, but if we accept this condition and continue the io_service, the outstanding event will be invoked after exiting from setResult().

I've confirmed this with a simple experimental patch (attached). If you apply it and rebuild everything, and:

  • start b10-msgq
  • start b10-auth

you'll see something like this:

[b10-auth] Server created.
[b10-auth] IOService created.
[b10-auth] Configuration session channel created.
SessionImpl::setResult called for 0x100504480, result_code at 0x7fff5fbfe3d0:Undefined error: 0
SessionImpl::setResult called for 0x100504480, result_code at 0x7fff5fbfe3c0:Undefined error: 0
SessionImpl::setResult called for 0x100504480, result_code at 0x7fff5fbfe590:Undefined error: 0
SessionImpl::setResult called for 0x100504480, result_code at 0x7fff5fbfe580:Undefined error: 0
SessionImpl::setResult called for 0x100504480, result_code at 0x7fff5fbfe590:Undefined error: 0
SessionImpl::setResult called for 0x100504480, result_code at 0x7fff5fbfe580:asio error
SessionImpl::setResult called for 0x100504480, result_code at 0x7fff5fbfe470:Undefined error: 0
[b10-auth] Server started.
SessionImpl::setResult called for 0x100504480, result_code at 0x7fff5fbfe480:asio error

The first "asio error" should be corresponding to the timeout.

The second "asio error" should be a cancel event triggered in SessionImpl::readData(). Note that this event is handled after b10-auth starts the main io_service loop.

Changed 9 years ago by jinmei

comment:12 Changed 9 years ago by jelte

r2716 contains an approach based on what we discussed in jabber, with some other cleanups as well.

comment:13 follow-up: Changed 9 years ago by jinmei

About r2699

  • I'd avoid using a macro. I'd define it as either a namespace or class constant, e.g.:
    class SessionImpl {
    private:
        static const size_t MSGQ_DEFAULT_TIMEOUT = 4000;
    ...
    };
    
  • isn't it possible to move setup for 'sess' to the fixture?
        Session sess(my_io_service);
    

About r2716

  • I have some suggested editorial changes (committed to the branch)
  • setResult() should better be in an anonymous namespace.
  • I'd remove the hardcoded value from:
        // timeout for blocking reads (in seconds, defaults to 4000)
    
  • is it okay to do io_service().reset()? According to the ASIO reference "this function must not be called while there are any unfinished calls". I'm afraid we cannot assume such a thing in general.
  • I don't understand why we need to check getTimeout() != 0 here:
                if (read_result && getTimeout() != 0) {
    
  • The following line isn't necessarily incorrect but looks a bit awkward:
            if (read_result->value() != 0) {
    
    why not this? This seems to be more consistent with the other usage of asio::error_code in this code:
            if (*read_result) {
    

Other changes look okay.

comment:14 Changed 9 years ago by jinmei

  • Owner changed from jinmei to jelte

comment:15 in reply to: ↑ 13 ; follow-up: Changed 9 years ago by jelte

  • Owner changed from jelte to jinmei

Replying to jinmei:

About r2699

  • I'd avoid using a macro. I'd define it as either a namespace or class constant, e.g.:
    class SessionImpl {
    private:
        static const size_t MSGQ_DEFAULT_TIMEOUT = 4000;
    ...
    };
    

ack

  • isn't it possible to move setup for 'sess' to the fixture?
        Session sess(my_io_service);
    

yes, it 'feels' a bit weird though, since in essence we're now setting up the session before the 'host' testdomainsocket. But since establish() is called manually anyway this shouldn't be a problem. Updated.

About r2716

  • I have some suggested editorial changes (committed to the branch)

thanks

  • setResult() should better be in an anonymous namespace.

done

  • I'd remove the hardcoded value from:
        // timeout for blocking reads (in seconds, defaults to 4000)
    

done

  • is it okay to do io_service().reset()? According to the ASIO reference "this function must not be called while there are any unfinished calls". I'm afraid we cannot assume such a thing in general.

thing is, we can't assume we don't need it either; if io_service has decided it ran out of work, it'll stop doing anything on run_one(), and our while loop will run forever (this happens in one of the test cases here). The documentation also says that you need to call it before any second or later set of calls to run() or run_one().

  • I don't understand why we need to check getTimeout() != 0 here:
                if (read_result && getTimeout() != 0) {
    

If timeout is set to 0, the timer isn't started, and waiting for it to end would result in an eternal loop (after data has arrived). Added comment.

  • The following line isn't necessarily incorrect but looks a bit awkward:
            if (read_result->value() != 0) {
    
    why not this? This seems to be more consistent with the other usage of asio::error_code in this code:
            if (*read_result) {
    

Personal preference; I don't have a strong opinion, but it is more clear to me as a reader what we are checking for here (success result). Unless i'm wrong in reading the code, there's no asio::error_code::success or something similar, so this would have to do. If you think *read_result directly is more clear, we can change it.

Other changes look okay.

Ok, changes committed in r2735

comment:16 in reply to: ↑ 15 ; follow-up: Changed 9 years ago by jinmei

  • Owner changed from jinmei to jelte

Replying to jelte:

  • is it okay to do io_service().reset()? According to the ASIO reference "this function must not be called while there are any unfinished calls". I'm afraid we cannot assume such a thing in general.

thing is, we can't assume we don't need it either; if io_service has decided it ran out of work, it'll stop doing anything on run_one(), and our while loop will run forever (this happens in one of the test cases here). The documentation also says that you need to call it before any second or later set of calls to run() or run_one().

Hmm, I see, but I'm afraid reset() can acutally "be called while there are unfinished calls to run()" in practice. In fact, if I understand the code correctly, almost all calls to SessionImpl::readData() are performed inside a main io_service loop (i.e run()) in case of b10-auth (exceptions are the very first ones to establish sessions before starting a mail loop).

From a quick look at the implementation of reset() (which I think is in asio/detail/xxx_io_service.hpp), reset() actually doesn't trigger an exception or otherwise indicate an error even if the assumption isn't met. But I'm not so sure if this means this usage is actually safe or it actually breaks something fundamental inside io_serivce which could cause strange failures in rare cases.

One possible workaround would be

  • to set session timeout to 0 (meaning no timeout) in b10-auth just before starting the main loop, and
  • in readData(), skip the reset()->run_one() trick completely, if getTimeout() == 0

And we should eventually integrate it to the asio wrapper so that we can handle this in the same io_service().run context or make everything asynchronous.

  • I don't understand why we need to check getTimeout() != 0 here:
                if (read_result && getTimeout() != 0) {
    

If timeout is set to 0, the timer isn't started, and waiting for it to end would result in an eternal loop (after data has arrived). Added comment.

Hmm, but doesn't timer.cancel() triggers the setResult() event for the timer? Perhaps it's not the case when the timer hasn't started in the first place?

  • The following line isn't necessarily incorrect but looks a bit awkward:
            if (read_result->value() != 0) {
    
    why not this? This seems to be more consistent with the other usage of asio::error_code in this code:
            if (*read_result) {
    

Personal preference; I don't have a strong opinion, but it is more clear to me as a reader what we are checking for here (success result). Unless i'm wrong in reading the code, there's no asio::error_code::success or something similar, so this would have to do. If you think *read_result directly is more clear, we can change it.

To me, the problem of this code is that it implicitly assumes the relationship between the low level value() and the abstract level of asio::error_code, that is, it assumes if value() is non 0 it means an error (not a success).

This is true in the current implementation, and it's quite unlikely to change in future versions, but it's not guaranteed (the reference only says value() returns "the error value"). What if, again unlikely, a future version of ASIO introduces a new "success" error_category to separate different types of successful results? In that case value() would return non 0 values while it can still indicate a success via operator! or the type conversion operator.

I'm not sure if I fully understand "there's no asio::error_code::success or something similar", but if what you mean is a way to know if a particular error_code indicates a success or failure, you can use operator! or the type conversion operator (to bool), and the reference explicitly says so (unlike the semantics of value()):

  • operator unspecified_bool_type: Operator returns non-null if there is a non-success error code.
  • operator!: Operator to test if the error represents success.

In fact, this code explicitly tests if there's a non-success error code using "operator unspecified_bool_type":

        if (*read_result) {

Having said that, these are relatively pedantic argument, and it's not or probably won't be different from the current code usig value() in practice. So, if this is not convincing and you still think value() is a more intuitive representation, please move forward with it. (but please make a comment that the code assumes the implicit relationship between the raw value() and the success/failure semantics).

comment:17 in reply to: ↑ 16 ; follow-up: Changed 9 years ago by jelte

  • Owner changed from jelte to jinmei

Replying to jinmei:

From a quick look at the implementation of reset() (which I think is in asio/detail/xxx_io_service.hpp), reset() actually doesn't trigger an exception or otherwise indicate an error even if the assumption isn't met. But I'm not so sure if this means this usage is actually safe or it actually breaks something fundamental inside io_serivce which could cause strange failures in rare cases.

One possible workaround would be

  • to set session timeout to 0 (meaning no timeout) in b10-auth just before starting the main loop, and
  • in readData(), skip the reset()->run_one() trick completely, if getTimeout() == 0

No, then we wouldn't have timeouts and we'd be back where we started. Unless you are thinking of changing the timeout value for those times we need it (i.e. before every call to group_recvmsg(). But I think I have found another solution to the original problem; there's a 'work' service that simply tells run not to stop. Added this to the applications that do not have a continuous work object (like the tests), and reset() is not needed anymore. r2750.

  • I don't understand why we need to check getTimeout() != 0 here:
                if (read_result && getTimeout() != 0) {
    

If timeout is set to 0, the timer isn't started, and waiting for it to end would result in an eternal loop (after data has arrived). Added comment.

Hmm, but doesn't timer.cancel() triggers the setResult() event for the timer? Perhaps it's not the case when the timer hasn't started in the first place?

Nope, since we haven't added setResult it to the callbacks of the timeout, cancel() would be a nop.

  • operator unspecified_bool_type: Operator returns non-null if there is a non-success error code.
  • operator!: Operator to test if the error represents success.

Right. You have convinced me, and I updated it in r2752 (with a small comment).

Jelte

comment:18 in reply to: ↑ 17 Changed 9 years ago by jinmei

Replying to jelte:

I think the latest branch is okay to merge.

Some minor, non blocking comments follow. These can be ignored.

  • I made a trivial editorial fix in the branch (r2754)
  • As for the idea of the implication of not setting timeout after starting, I was aware of that. But I thought it could be an acceptable workaround in that we can at least notice bootstrap time failures quite soon. Once the whole BIND 10 service starts successfully and then a CC session stalls, I suspect we'll end up kill everything anyway (at least for now). In any event, I see the revised code is better than that.
  • To be very precise, the following comment in r2752...
            // asio::error_code evaluates to false if there was no error
    
    ...should be
            // asio::error_code evaluates to true if there was an error
    
    because the type conversion operator only says the latter. The former comment is also true with the understanding that the negation of "if a boolean value is not true it must be true" and "the negation of there is an error is there is no error", but one needs to run this logic to reach this conclusion (even if it's pretty trivial).

comment:19 Changed 9 years ago by jinmei

Oh, one final thing: I should have asked for a proposed ChangeLog? entry, but I don't think it's not worth the additional time at this stage. So please simply update ChangeLog? without going throw an additional review cycle for it.

comment:20 Changed 9 years ago by jinmei

  • Owner changed from jinmei to jelte

comment:21 Changed 9 years ago by jelte

  • Resolution set to fixed
  • Status changed from reviewing to closed

Thanks for the review and discussion!

Merged to trunk in r2761, changelog updated in r2762, closing ticket.

Note: See TracTickets for help on using tickets.