Opened 9 years ago

Closed 9 years ago

#312 closed defect (fixed)

python msgq timeouts differ from c++ ones

Reported by: jelte Owned by: stephen
Priority: medium Milestone: y2 6 month milestone
Component: ~msgq (obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity:
Sub-Project: Feature Depending on Ticket:
Estimated Difficulty: 0.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no


In the python msgq/cc api, when a blocking read times out, it results in a (env, msg) = (None, None) result. But the c++ version currently raises an exception. The python API should probably do the same.

This also means catching it in the right places in the modules, if necessary (which is the reason it was left as it is in #296).


Change History (12)

comment:1 Changed 9 years ago by jelte

  • Owner set to jelte
  • Status changed from new to accepted

comment:2 Changed 9 years ago by jelte

  • Owner changed from jelte to UnAssigned
  • Status changed from accepted to reviewing

I think this is ready for review; branches/trac312

recvmsg(False) now raises an error when no data is read after the default timeout expires, instead of returning (None, None).

Modules and other libraries that call this are updated to correctly handle this change. play around a bit since it has one call that is supposed to block forever until data is read (before that call it sets the timeout to 0 and sets it back to the default if that call has returned)

There is one TODO, about logging. Since adding logging would be the subject of another ticket, I left this TODO in so we are reminded to add a log message at that location.

comment:3 Changed 9 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:4 Changed 9 years ago by vorner

  • Owner changed from vorner to jelte

In src/lib/python/isc/cc/, in the _receive_full_buffer, there are two small problems:

  • On line 138, if a short read happens (for example 2 bytes of the length came in one packet, another 2 in another packet that is still travelling somewhere on the network), it returns None, when it should probably loop and wait for the rest (and either get a timeout waiting for the rest or get the length part eventually).
  • The except: thing is odd. One thing is it catches even ctrl+c, in which case it just returns None and does not terminate. And when something like a real error happens, maybe some exception would be better than return None? However, this was here even before the changes introduced in this branch.

Anyway, if this is interrupted/terminated in a non-standard way, is it always considered unrecoverable for this Session object? Since if the whole message is not read and some of it is left in the socket, next read would consider first 4 bytes of the remaining part as length of the next one.

comment:5 Changed 9 years ago by jelte

  • Owner changed from jelte to vorner

Ok, this should be fixed now.

I've found a few more weird things in the code, and refactored it a bit; i've removed the class variables which weren't really used anyway. I've removed the catch-all, but did add a bit of special case handling for errors and 0 receives since this behaviour is (currently) what the calling modules expect.

See r2906

comment:6 Changed 9 years ago by vorner

  • Owner changed from vorner to jelte

I think it is correct, when it is in blocking mode. However, in nonblocking, I still think there might be a problem. If I'm wrong, please tell me, why this can not happen:

I'm assuming that if None is returned, it means „I do not have any messages from the bus right now, ask later“.

So, asume someone sent a message like this:
But because the network is slow, we got only this, and the rest is somewhere on the way, when we ask if there is something and we do not block:
So, now we read length, which returs whole 4 bytes. Then we read part of the data until the „a“. Because we still do not have enough, we loop again (in the _receive_bytes) and try to get more, but because the socket is nonblocking, it throws socket.error with EAGAIN (meaning no more data now). It is caught in _receive_full_buffer, which then returns None (no message right now), discarding the „length“ and „........da“ we already read.

So someone asks for a message a moment later, so it goes to read the length. The „ta.........“ is on the socket now, so we take part of it as a length, which is wrong. Similar thing would happen, if we asked at the time when we had only part of the length on the socket (which can happen when the two messages were sent together/there was long send buffer and it was in the same packet with preciding message).

I think we need to keep the data we already read and reuse it next time in case of EAGAIN. May I suggest a wrapper around recv that keeps a buffer of read data and when we get the whole message we clean it, when we get EAGAIN, we put position at the front and try getting data from the buffer next time before we start reading from the socket again? Or, if I'm not clear in what I mean, should I write the code and give the review to someone else instead?

There's other thing that seems little bit odd:

if len(new_data) == 0: # server closed connection

if nonblock:

return None


raise ProtocolError?("Read of 0 bytes: connection closed")

Why, when the server closes a connection, it is an error when we have blocking read and is „No message right now, ask later“ when we are nonblocking? (recv returns 0 bytes only when the connection is closed, if there are 0 bytes waiting and we are nonblocking, it throws EAGAIN).

comment:7 Changed 9 years ago by jelte

  • Owner changed from jelte to stephen

No you are right, I forgot about that one (perhaps because we are working on domainsockets here, but that may not be the case forever). I suspect this is what the original code had intended to do as well.

Reintroduced class variables; code now does read-length followed by read-data, and keeps read data until it successfully read everything. If there is an EAGAIN in the meantime, it will keep the data. r2916

Oh and the second case was a leftover from a bad test case, which i updated.

Currently, on other errors, it will drop whatever has been read (while there may be a case where we get a timeout, i think it is more likely that the rest of the data will never come)

Since you are gone this week, i'm reassigning this to Stephen.

comment:8 Changed 9 years ago by stephen

  • Owner changed from stephen to jelte

Branch taken at r2808
Version checked: r2916

Ignoring property changes, files changed were

All these are OK

OK, but two minor points:

a) Where an error message is constructed from the concatenation of two or more strings, e.g.

sys.stderr.write("[b10-xfrout] Error creating xfrout,"
    "is the configuration manager running?")

... there is sometimes a missing space at the join (e.g. the text output by the example above will have no space after the comma). The example listed is one of the changes made in this ticket, but there are a couple of other examples in the file.

b) Some error messages in the file are output via print(), others are output via sys.stderr.write().

comment:9 Changed 9 years ago by jelte

  • Owner changed from jelte to stephen

minor points addressed in r2933

comment:10 Changed 9 years ago by stephen

  • Owner changed from stephen to jelte

All changes look OK - please go ahead and merge.

comment:11 Changed 9 years ago by jelte

  • Owner changed from jelte to stephen

If found another intermodule-problem, sometimes ctrl-c caused the boss process to stop on an exception, without killing all the other modules (there are two that do not shutdown themselves if msgq disappears), r2936 makes it go into shutdown mode on that case. Could you take a look at that one too?

comment:12 Changed 9 years ago by jelte

  • Resolution set to fixed
  • Status changed from reviewing to closed

merged to trunk, r2941.

Note: See TracTickets for help on using tickets.