Opened 7 years ago

Closed 7 years ago

#2617 closed defect (fixed)

MSGQ_RECV_ERR errors on clean shutdown

Reported by: jreed Owned by: jinmei
Priority: medium Milestone: Sprint-20130219
Component: ~msgq (obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: Low
Sub-Project: Core Feature Depending on Ticket:
Estimated Difficulty: Discuss (4?) Add Hours to Ticket: 0
Total Hours: 13.36 Internal?: no

Description

I sent a SIGTERM to bind10. This resulted in three ERRORs:

2013-01-08 10:12:05.984 INFO  [b10-boss.boss/3373] BIND10_RECEIVED_SIGNAL received signal SIGTERM
2013-01-08 10:12:05.984 INFO  [b10-boss.boss/3373] BIND10_SHUTDOWN stopping the server
2013-01-08 10:12:05.985 INFO  [b10-boss.boss/3373] BIND10_CONFIGURATOR_STOP bind10 component configurator is shutting down
2013-01-08 10:12:05.985 INFO  [b10-boss.boss/3373] BIND10_COMPONENT_STOP component b10-auth is being stopped
2013-01-08 10:12:05.985 INFO  [b10-boss.boss/3373] BIND10_STOP_PROCESS asking b10-auth to shut down
2013-01-08 10:12:05.986 INFO  [b10-boss.boss/3373] BIND10_COMPONENT_STOP component Socket creator is being stopped
2013-01-08 10:12:05.986 INFO  [b10-boss.boss/3373] BIND10_SOCKCREATOR_TERMINATE terminating socket creator
2013-01-08 10:12:05.986 INFO  [b10-boss.boss/3373] BIND10_COMPONENT_STOP component msgq is being stopped
2013-01-08 10:12:05.986 INFO  [b10-boss.boss/3373] BIND10_COMPONENT_STOP component cfgmgr is being stopped
2013-01-08 10:12:05.986 INFO  [b10-boss.boss/3373] BIND10_STOP_PROCESS asking cfgmgr to shut down
2013-01-08 10:12:05.987 INFO  [b10-auth.auth/3377] AUTH_DATASRC_CLIENTS_BUILDER_STOPPED data source builder thread stopped
2013-01-08 10:12:05.987 ERROR [b10-msgq.msgq/3375] MSGQ_RECV_ERR Error reading from socket 8: EOF
2013-01-08 10:12:05.987 ERROR [b10-msgq.msgq/3375] MSGQ_RECV_ERR Error reading from socket 7: EOF
2013-01-08 10:12:05.991 ERROR [b10-msgq.msgq/3375] MSGQ_RECV_ERR Error reading from socket 5: <class 'socket.error'>
2013-01-08 10:12:06.986 INFO  [b10-boss.boss/3373] BIND10_PROCESS_ENDED process 3374 of Socket creator ended with status 0
2013-01-08 10:12:06.986 INFO  [b10-boss.boss/3373] BIND10_PROCESS_ENDED process 3376 of cfgmgr ended with status 0
2013-01-08 10:12:06.986 INFO  [b10-boss.boss/3373] BIND10_PROCESS_ENDED process 3377 of b10-auth ended with status 0
2013-01-08 10:12:06.986 INFO  [b10-boss.boss/3373] BIND10_SEND_SIGTERM sending SIGTERM to msgq (PID 3375)
2013-01-08 10:12:07.086 INFO  [b10-boss.boss/3373] BIND10_PROCESS_ENDED process 3375 of msgq ended with status 0
2013-01-08 10:12:07.086 INFO  [b10-boss.boss/3373] BIND10_SHUTDOWN_COMPLETE all processes ended, shutdown complete

I also received the errors when doing a "Boss shutdown":

2013-01-08 10:25:41.879 ERROR [b10-msgq.msgq/3704] MSGQ_RECV_ERR Error reading from socket 5: <class 'socket.error'>
2013-01-08 10:25:41.883 ERROR [b10-msgq.msgq/3704] MSGQ_RECV_ERR Error reading from socket 7: EOF
2013-01-08 10:25:41.885 ERROR [b10-msgq.msgq/3704] MSGQ_RECV_ERR Error reading from socket 8: EOF
2013-01-08 10:25:41.885 ERROR [b10-msgq.msgq/3704] MSGQ_RECV_ERR Error reading from socket 9: EOF

The server should not have ERRORs during a clean shutdown. This will prompt an administrator to research if something is wrong or corrupted.

Subtickets

Attachments (1)

msgq.diff (588 bytes) - added by jinmei 7 years ago.

Download all attachments as: .zip

Change History (27)

comment:1 Changed 7 years ago by shane

  • Component changed from Unclassified to msgq
  • Defect Severity changed from N/A to Low
  • Milestone changed from New Tasks to Next-Sprint-Proposed

I agree, EOF should not be an error. Bad things happen to good processes. :)

This ticket should be to change the ERROR to DEBUG, IMHO.

comment:2 Changed 7 years ago by vorner

We can't just change the log level. This log message could potentially log other errors too. We need to distinguish them somehow.

comment:3 Changed 7 years ago by shane

Okay, makes sense. We can check to see if it is EOF and create a new debug message, MSGQ_RECV_EOF, for that.

We should also print out the details of the socket.error message that we get, such as the errno and description, when that happens. (I suspect that we will also need to make some errno debug level, such as ECONNRESET or the like.)

comment:4 Changed 7 years ago by jelte

  • Milestone changed from Next-Sprint-Proposed to Sprint-20130205

comment:5 Changed 7 years ago by jinmei

First off, I cannot reproduce it. In some cases I see these messages
on shutdown:

2013-01-29 14:57:23.926 ERROR [b10-msgq.msgq/25245] MSGQ_SEND_ERR Error while sending to socket 9: EPIPE
2013-01-29 14:57:23.926 ERROR [b10-msgq.msgq/25245] MSGQ_READ_UNKNOWN_FD Got read on strange socket 9

but I can never successfully produce the errors reported in this message.

In any case, according to the code, I suspect these are in fact
"erroneous" events, in that something unexpected is happening within
the system, like msgq encounters EPIPE or EOF in the middle of
handling a message. It's true that these are something msgq cannot
always control, but it's still a kind of error within the entire BIND
10 system. I guess what's happening is something like this: one
process sends a "I'm quitting" message to some other process but the
other process terminates before accepting it. So, the right fix
should be to clarify the system-shutdown process and implement it
correctly (terminate the processes in the expected order, guaranteeing
necessary synchronization). If it doesn't work that way it's
reasonable to report such events as an ERROR at msgq.

But, such higher level fix will be beyond the scope of this task.
So my suggestion is to keep the error level but clarify these things
in the detailed version of log descriptions. Even if people don't
read them that seems to be the right way to handle this matter than
pretending there's no issue by lowering the log level.

comment:6 Changed 7 years ago by jinmei

Jeremy: if you can reproduce your problem, could you apply this patch
and show the error message again? At least this lost the information
of the cause of the error and is useless:

2013-01-08 10:12:05.991 ERROR [b10-msgq.msgq/3375] MSGQ_RECV_ERR Error reading from socket 5: <class 'socket.error'>

Changed 7 years ago by jinmei

comment:7 Changed 7 years ago by jelte

  • Priority changed from low to medium

comment:8 Changed 7 years ago by jreed

I tested once before I tried the patch (using same old code from 2013-01-08 and received:

2013-01-30 09:02:23.535 ERROR [b10-msgq.msgq/17635] MSGQ_SEND_ERR Error while sending to socket 8: EPIPE
2013-01-30 09:02:23.535 ERROR [b10-msgq.msgq/17635] MSGQ_SEND_ERR Error while sending to socket 7: EPIPE
2013-01-30 09:02:23.536 ERROR [b10-msgq.msgq/17635] MSGQ_READ_UNKNOWN_FD Got read on strange socket 7
2013-01-30 09:02:23.536 ERROR [b10-msgq.msgq/17635] MSGQ_READ_UNKNOWN_FD Got read on strange socket 8
2013-01-30 09:02:23.536 ERROR [b10-msgq.msgq/17635] MSGQ_RECV_ERR Error reading from socket 11: EOF
2013-01-30 09:02:23.537 ERROR [b10-msgq.msgq/17635] MSGQ_RECV_ERR Error reading from socket 12: EOF
2013-01-30 09:02:23.537 ERROR [b10-msgq.msgq/17635] MSGQ_SEND_ERR Error while sending to socket 9: EPIPE
2013-01-30 09:02:23.537 ERROR [b10-msgq.msgq/17635] MSGQ_READ_UNKNOWN_FD Got read on strange socket 9
2013-01-30 09:02:23.538 ERROR [b10-msgq.msgq/17635] MSGQ_RECV_ERR Error reading from socket 13: EOF
2013-01-30 09:02:23.538 ERROR [b10-msgq.msgq/17635] MSGQ_RECV_ERR Error reading from socket 14: EOF
2013-01-30 09:02:23.538 ERROR [b10-msgq.msgq/17635] MSGQ_SEND_ERR Error while sending to socket 5: EPIPE
2013-01-30 09:02:23.539 ERROR [b10-msgq.msgq/17635] MSGQ_RECV_ERR Error reading from socket 10: EOF

So I manually applied the change (note this is older code so patch didn't apply as is):

2013-01-30 09:08:43.171 INFO  [b10-boss.boss/17666] BIND10_COMPONENT_STOP component msgq is being stopped
...
2013-01-30 09:08:43.173 ERROR [b10-msgq.msgq/17668] MSGQ_RECV_ERR Error reading from socket 11: EOF
2013-01-30 09:08:43.173 ERROR [b10-msgq.msgq/17668] MSGQ_SEND_ERR Error while sending to socket 7: EPIPE
2013-01-30 09:08:43.174 ERROR [b10-msgq.msgq/17668] MSGQ_READ_UNKNOWN_FD Got read on strange socket 7
2013-01-30 09:08:43.174 ERROR [b10-msgq.msgq/17668] MSGQ_SEND_ERR Error while sending to socket 8: EPIPE
2013-01-30 09:08:43.174 ERROR [b10-msgq.msgq/17668] MSGQ_READ_UNKNOWN_FD Got read on strange socket 8
2013-01-30 09:08:43.175 ERROR [b10-msgq.msgq/17668] MSGQ_RECV_ERR Error reading from socket 12: EOF
2013-01-30 09:08:43.175 ERROR [b10-msgq.msgq/17668] MSGQ_SEND_ERR Error while sending to socket 9: EPIPE
2013-01-30 09:08:43.175 ERROR [b10-msgq.msgq/17668] MSGQ_READ_UNKNOWN_FD Got read on strange socket 9
2013-01-30 09:08:43.175 ERROR [b10-msgq.msgq/17668] MSGQ_RECV_ERR Error reading from socket 13: EOF
2013-01-30 09:08:43.176 ERROR [b10-msgq.msgq/17668] MSGQ_RECV_ERR Error reading from socket 14: EOF
2013-01-30 09:08:43.176 ERROR [b10-msgq.msgq/17668] MSGQ_RECV_ERR Error reading from socket 10: EOF
2013-01-30 09:08:43.176 ERROR [b10-msgq.msgq/17668] MSGQ_RECV_ERR Error reading from socket 5: [Errno 104] Connection reset by peer

This is on the CentOS 5.3 system bind10-testing1.lab.

I am able to repeat everytime to get ERRORs on sending term signal to boss.

I don't get the response from the patch everytime, but it is often like:

2013-01-30 09:18:44.010 ERROR [b10-msgq.msgq/17715] MSGQ_RECV_ERR Error reading from socket 5: [Errno 104] Connection reset by peer

comment:9 Changed 7 years ago by jinmei

  • Owner set to jinmei
  • Status changed from new to accepted

comment:10 Changed 7 years ago by jreed

I can not reproduce with latest master.

comment:11 Changed 7 years ago by jreed

http://git.bind10.isc.org/~jreed/trac2617.1.txt is my debug 99 output using the patch from trac2617 backported to master as of 2013-01-08.

Note this problem only happens when I use --verbose to start bind10. I can repeat every time. If I don't use --verbose I don't get any ERROR.

comment:12 Changed 7 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from accepted to reviewing

trac2617 is ready for review.

The main changes are from 73ca6b0 to ac23914, where I handled
"connection-close" type of events separately, reducing log level when
it's deemed to be okay. Also, 4df7916 and 2a1e415 fix the
"UNKNOWN_FD" situation, which was a kind of real bug (though
relatively harmless except for the noisy log).

Other changes are generally necessary log enhancements and some
cleanups. Added logs are something I found useful when I tried to
understand the issue, but I also thought they were useful in general,
so I've kept them in the branch. But they can be optional if they
look too much or controversial.

I don't plan to add a changelog entry for this task as it's mostly an
internal cleanup.

comment:13 Changed 7 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:14 follow-ups: Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

I have few small notes. First of, is ECONNRESET really harmless? It means the other side was either Windows (which couldn't have happened with a local-domain socket) or the other side died in a painful death without closing the socket. It also means they could miss reading all data in the socket. So, I'm not sure it should be handled the same as ordinary EOF.

Why if the socket is writable it is not readable? Couldn't it be written and read both at the same time?

This exception:

class MsgQCloseOnReceive(Exception):
    '''Exception raised when reading data from a socket results in "shutdown.

    This can be either getting 0-length data or via ECONNRESET socket.error
    exception.  This class holds whether it happens in the middle of reading
    (i.e. after reading some) via partial_read parameter, which is set to True
    if and only if so.  This will be used by an upper layer cathing the
    exception to distinguish severity of the event.

    "'''
    def __init__(self, reason, partial_read):
        self.partial_read = partial_read
        self.__reason = reason
    def __str__(self):
        return self.__reason

I think the double-quotes in the docstring are somewhat confused. Also, I thought we usually put empty lines between method definitions.

comment:15 in reply to: ↑ 14 Changed 7 years ago by jinmei

Replying to vorner:

Why if the socket is writable it is not readable? Couldn't it be written and read both at the same time?

What are you referring to? Is it about a change of this branch?

comment:16 in reply to: ↑ 14 ; follow-up: Changed 7 years ago by jinmei

Thanks for the review.

Replying to vorner:

I have few small notes. First of, is ECONNRESET really harmless? It means the other side was either Windows (which couldn't have happened with a local-domain socket) or the other side died in a painful death without closing the socket. It also means they could miss reading all data in the socket. So, I'm not sure it should be handled the same as ordinary EOF.

Hmm, based on this comment I realized I naively assumed ECONNRESET can
be handled like 0 being returned from recv(). After trying to figure
out how exactly that could happen, I found it quite difficult to
reproduce with some simplified experiments. According to logs from
BIND 10, that generally seems to happen for connections with xfrin,
xfrout, and cmdctl. I suspected the use of threads may be related,
but experiments with threads still didn't reproduce it.

At this point I gave up figuring it out - I guess it's beyond the
scope of this ticket. Since we (I) don't know how that happens and
(therefore) what kind of bad things can happen with it, it wouldn't be
a good idea to pretend it cannot be serious. So I decided to err on
the side of caution by logging all socket.error at the error level.
I clarified that it's probably less harmful if it happens on shutdown
in the detailed version of log message, even though I know no one will
read it.

Why if the socket is writable it is not readable? Couldn't it be written and read both at the same time?

As I asked in my previous quick response, I'm not sure which part of
the code you're referring to. Maybe this one?

                        writable = event & select.POLLOUT
                        readable = not writable and (event & select.POLLIN)

if so, actually, I don't know:-) I wondered the same thing, but in
this task I chose to preserve the original code logic:

                        if event & select.POLLOUT:
                            self.__process_write(fd)
                        elif event & select.POLLIN:
                            self.process_socket(fd)

we can probably loosen it, but I didn't want to break it accidentally
as it's outside the scope of this task.

This exception:

class MsgQCloseOnReceive(Exception):
    '''Exception raised when reading data from a socket results in "shutdown.

    This can be either getting 0-length data or via ECONNRESET socket.error
    exception.  This class holds whether it happens in the middle of reading
    (i.e. after reading some) via partial_read parameter, which is set to True
    if and only if so.  This will be used by an upper layer cathing the
    exception to distinguish severity of the event.

    "'''
    def __init__(self, reason, partial_read):
        self.partial_read = partial_read
        self.__reason = reason
    def __str__(self):
        return self.__reason

I think the double-quotes in the docstring are somewhat confused. Also, I thought we usually put empty lines between method definitions.

Hmm, it in fact looks like the convention is to not use double-quotes
in docstrings from a quick browse of python standard libs. Changed it
to single-quote, and also fixed the incomplete quotation. Also added
an empty line.

comment:17 Changed 7 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:18 in reply to: ↑ 16 ; follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

At this point I gave up figuring it out - I guess it's beyond the
scope of this ticket. Since we (I) don't know how that happens and
(therefore) what kind of bad things can happen with it, it wouldn't be
a good idea to pretend it cannot be serious. So I decided to err on
the side of caution by logging all socket.error at the error level.
I clarified that it's probably less harmful if it happens on shutdown
in the detailed version of log message, even though I know no one will
read it.

My guess is the socket is opened in a daemon thread and the thread is terminated the hard way. The main thread exits and the daemon threads are just killed, so the sockets in them are just auto-closed by the OS when the application exists.

Why if the socket is writable it is not readable? Couldn't it be written and read both at the same time?

As I asked in my previous quick response, I'm not sure which part of
the code you're referring to. Maybe this one?

                        writable = event & select.POLLOUT
                        readable = not writable and (event & select.POLLIN)

if so, actually, I don't know:-) I wondered the same thing, but in
this task I chose to preserve the original code logic:

                        if event & select.POLLOUT:
                            self.__process_write(fd)
                        elif event & select.POLLIN:
                            self.process_socket(fd)

we can probably loosen it, but I didn't want to break it accidentally
as it's outside the scope of this task.

Yes, I meant this. There's a similar code in the kqueue implementation. I don't think there's a reason to it, maybe the original author thought the events are exclusive. I don't think anything would break if it was changed.

I think the double-quotes in the docstring are somewhat confused. Also, I thought we usually put empty lines between method definitions.

There's one double quote at the end of the docstring and it is unpaired:

"'''

comment:19 in reply to: ↑ 18 ; follow-up: Changed 7 years ago by jinmei

Replying to vorner:

At this point I gave up figuring it out - I guess it's beyond the
scope of this ticket. Since we (I) don't know how that happens and
(therefore) what kind of bad things can happen with it, it wouldn't be
a good idea to pretend it cannot be serious. So I decided to err on
the side of caution by logging all socket.error at the error level.
I clarified that it's probably less harmful if it happens on shutdown
in the detailed version of log message, even though I know no one will
read it.

My guess is the socket is opened in a daemon thread and the thread is terminated the hard way. The main thread exits and the daemon threads are just killed, so the sockets in them are just auto-closed by the OS when the application exists.

I tried a situation like in my experiment, but still couldn't
reproduce the exception at the receiver side.

Why if the socket is writable it is not readable? Couldn't it be written and read both at the same time?

[...]

we can probably loosen it, but I didn't want to break it accidentally
as it's outside the scope of this task.

Yes, I meant this. There's a similar code in the kqueue implementation. I don't think there's a reason to it, maybe the original author thought the events are exclusive. I don't think anything would break if it was changed.

So are you suggesting to change it in this task?

I think the double-quotes in the docstring are somewhat confused. Also, I thought we usually put empty lines between method definitions.

There's one double quote at the end of the docstring and it is unpaired:

"'''

Ah, right, fixed.

comment:20 Changed 7 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:21 in reply to: ↑ 19 ; follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

Yes, I meant this. There's a similar code in the kqueue implementation. I don't think there's a reason to it, maybe the original author thought the events are exclusive. I don't think anything would break if it was changed.

So are you suggesting to change it in this task?

Yes, that would be nice. If there really is reason why this wouldn't work, we'll at least know the reason and can add a note into a comment. If it works, it will look cleaner.

comment:22 in reply to: ↑ 21 ; follow-up: Changed 7 years ago by jinmei

Replying to vorner:

Yes, I meant this. There's a similar code in the kqueue implementation. I don't think there's a reason to it, maybe the original author thought the events are exclusive. I don't think anything would break if it was changed.

So are you suggesting to change it in this task?

Yes, that would be nice. If there really is reason why this wouldn't work, we'll at least know the reason and can add a note into a comment. If it works, it will look cleaner.

I've tried that for the kqueue version, and found it causing a lettuce
test failure. I've not fully looked into it, but I suspect the test
scenario (perhaps implicitly) expect msgq to know one closing socket
until operations on all other sockets are completed. This may not
necessarily be a defect of msgq but can be a matter of test setup,
but in any event I'd say it's beyond the scope of this ticket (note,
again, this branch doesn't introduce a new behavior on this point).
Likewise, I'd hold off updating the "poller" version.

For now I've added some more comments about the code. For longer
term, I'd suggest we first add more unit tests around the code here so
we can be more sure about safety of such changes, then unify the
poller/kqueue version to simplify the code (maybe so it'll use select;
we don't need scalability here and select is probably most portable),
and, on top of that, we can clean up this point.

So I propose we close this ticket without changing the behavior on
this point.

comment:23 Changed 7 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:24 in reply to: ↑ 22 ; follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei
  • Total Hours changed from 0 to 2.11

Hello

Replying to jinmei:

I've tried that for the kqueue version, and found it causing a lettuce
test failure. I've not fully looked into it, but I suspect the test
scenario (perhaps implicitly) expect msgq to know one closing socket
until operations on all other sockets are completed. This may not
necessarily be a defect of msgq but can be a matter of test setup,
but in any event I'd say it's beyond the scope of this ticket (note,
again, this branch doesn't introduce a new behavior on this point).
Likewise, I'd hold off updating the "poller" version.

For now I've added some more comments about the code. For longer
term, I'd suggest we first add more unit tests around the code here so
we can be more sure about safety of such changes, then unify the
poller/kqueue version to simplify the code (maybe so it'll use select;
we don't need scalability here and select is probably most portable),
and, on top of that, we can clean up this point.

So I propose we close this ticket without changing the behavior on
this point.

OK, that seems reasonable. There's one small typo in the comment (okayb):

# Note: it may be okayb to read data if available

With fixing it, please merge.

comment:25 in reply to: ↑ 24 Changed 7 years ago by jinmei

Replying to vorner:

So I propose we close this ticket without changing the behavior on
this point.

OK, that seems reasonable. There's one small typo in the comment (okayb):

# Note: it may be okayb to read data if available

With fixing it, please merge.

Thanks for catching the error, fixed it, merge done, closing.

comment:26 Changed 7 years ago by jinmei

  • Resolution set to fixed
  • Status changed from reviewing to closed
  • Total Hours changed from 2.11 to 13.36
Note: See TracTickets for help on using tickets.