Opened 7 years ago

Closed 7 years ago

#2398 closed defect (fixed)

msgq crash

Reported by: jreed Owned by: jelte
Priority: medium Milestone: Sprint-20121204
Component: ~msgq (obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: High
Sub-Project: Core Feature Depending on Ticket:
Estimated Difficulty: 4 Add Hours to Ticket: 0
Total Hours: 2.98 Internal?: no

Description

msgq crashed.

On an unrelated bug, b10-auth was hanging already. I tried to do Auth shutdown and that timed out and failed. I sent a kill signall to the process. Then msqg crashed:

2012-10-23 15:50:18.551 DEBUG [b10-cmdctl.cmdctl] CMDCTL_SEND_COMMAND sending command shutdown to module Auth
2012-10-23 15:50:18.552 DEBUG [b10-cmdctl.cmdctl] CMDCTL_COMMAND_SENT command 'shutdown' to module 'Auth' was sent
2012-10-23 15:50:22.559 ERROR [b10-cmdctl.cmdctl] CMDCTL_COMMAND_ERROR error in 
command shutdown to module Auth: Module 'Auth' not responding
2012-10-23 15:50:25.892 DEBUG [b10-boss.boss] BIND10_RECEIVED_COMMAND received command: show_processes
2012-10-23 15:50:25.896 DEBUG [b10-stats.stats] STATS_SEND_STATISTICS_REQUEST requesting Xfrout to send statistics
2012-10-23 15:50:25.896 DEBUG [b10-stats.stats] STATS_SEND_STATISTICS_REQUEST requesting Auth to send statistics
2012-10-23 15:50:25.896 DEBUG [b10-stats.stats] STATS_SEND_STATISTICS_REQUEST requesting Boss to send statistics
2012-10-23 15:50:25.897 DEBUG [b10-xfrout.xfrout] XFROUT_RECEIVED_GETSTATS_COMMAND received command to get statistics data
2012-10-23 15:50:25.898 DEBUG [b10-boss.boss] BIND10_RECEIVED_COMMAND received command: getstats
2012-10-23 15:50:27.258 INFO  [b10-boss.boss] BIND10_LOST_SOCKET_CONSUMER consumer 16 of sockets disconnected, considering all its sockets closed
[b10-msgq] Closing socket fd 9
[b10-msgq] Closing socket fd 11
[b10-msgq] Closing socket fd 5
2012-10-23 15:50:27.260 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 19663
 of b10-auth ended with status 15
2012-10-23 15:50:27.261 ERROR [b10-boss.boss] BIND10_COMPONENT_FAILED component 
b10-auth (pid 19663) failed: process terminated with exit status 15 (killed by signal 15: SIGTERM)
2012-10-23 15:50:27.261 INFO  [b10-boss.boss] BIND10_COMPONENT_START component b10-auth is starting
2012-10-23 15:50:27.261 INFO  [b10-boss.boss] BIND10_STARTING_PROCESS starting process b10-auth
Traceback (most recent call last):
  File "/home/reed/opt/bind10/libexec/bind10-devel/b10-msgq", line 552, in <module>
    msgq.run()
  File "/home/reed/opt/bind10/libexec/bind10-devel/b10-msgq", line 463, in run    self.run_kqueue()
  File "/home/reed/opt/bind10/libexec/bind10-devel/b10-msgq", line 495, in run_kqueue
    self.__process_write(event.ident)
  File "/home/reed/opt/bind10/libexec/bind10-devel/b10-msgq", line 394, in __process_write
    amount_sent = self.__send_data(sock, msg)
  File "/home/reed/opt/bind10/libexec/bind10-devel/b10-msgq", line 346, in __send_data
    raise e
  File "/home/reed/opt/bind10/libexec/bind10-devel/b10-msgq", line 341, in __send_data
    return sock.send(data)
socket.error: [Errno 32] Broken pipe
2012-10-23 15:50:27.270 DEBUG [b10-boss.boss] BIND10_STARTED_PROCESS_PID started b10-auth (PID 9955)
2012-10-23 15:50:27.283 ERROR [b10-resolver.cc] CC_ASYNC_READ_FAILED asynchronous read failed (error code = 2)
2012-10-23 15:50:27.283 FATAL [b10-boss.boss] BIND10_MSGQ_DISAPPEARED msgq channel disappeared

This ticket is not about the auth problem. It is about msgq.

Subtickets

Change History (13)

comment:1 Changed 7 years ago by shane

Basically msgq is brittle in the face of socket errors.

See also ticket #1706 and #681.

Luckily there is only one place where data is written. Sadly neither of the places that invoke the send() seem to be set up to handle errors, so this involves non-trivial coding.

comment:2 Changed 7 years ago by muks

  • Summary changed from msqg crash to msgq crash

comment:3 Changed 7 years ago by jelte

  • Defect Severity changed from N/A to High
  • Milestone changed from New Tasks to Sprint-20121120

comment:4 Changed 7 years ago by jelte

  • Owner set to jelte
  • Status changed from new to assigned

comment:5 follow-up: Changed 7 years ago by jelte

  • Owner changed from jelte to UnAssigned
  • Status changed from assigned to reviewing

Ready for review.

As far as msgq itself is concerned, I only addressed the SIGPIPE, which did not turn out to be much of a problem (there is already code to handle remote sockets that are suddenly closed, so it's really only a question of catching and handling the error).

It does *not* handle and recover any other exceptions (except the socket errors that were already handled). I don't think we should catchall here, as depending on the exception we should handle it differently (and perhaps simply crash, if it is a syntax error for example). Also, I kept it to send() calls only. If we want we can do something similar for recv() (which should now be a lot easier, as I think the tests could be extended to test that too).

As for the changes:

Most of the diff is testing code. I took a look at the existing fork() and run() code, but it is a bit brittle imo, and chances are that it leaves processes hanging and it is difficult to communicate errors back to check for them.

So I made a pretty simple and naive threading class that does nothing but call msgq.run() and store any exception it raises.

Secondly, I added a BadSocket? class which wraps an existing socket and optionally raises a given exception after a given number of bytes have been sent. I added tests for some existing handling (eagain and ewouldblock), and a few new ones.

comment:6 Changed 7 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:7 in reply to: ↑ 5 ; follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jelte

Hello

Replying to jelte:

It does *not* handle and recover any other exceptions (except the socket errors that were already handled). I don't think we should catchall here, as depending on the exception we should handle it differently (and perhaps simply crash, if it is a syntax error for example). Also, I kept it to send() calls only. If we want we can do something similar for recv() (which should now be a lot easier, as I think the tests could be extended to test that too).

I don't know. For one, I think syntax errors are not descendants of Exceptions.
Even when leaving these out, looking at man send, I think there are more
interesting cases there:

  • ECONNRESET (it would probably be similar to EPIPE).
  • ENOBUFS (the other side probably does not read, or something, so the same).
  • EINTR (just retry like with EAGAIN).

Also, there are few points to the code:

  • What happens if the kill_socket is called (because of EPIPE), but the socket has some data to read and has the read even queued?
  • I think there's +-3 error in this comment:
            raise_on_send: integer. If send_exception is not None, it will be
                                    raised on this byte (i.e. 1 = on the first
                                    call to send(), 1 = on the 4th call to send)
                                    Note: if 0, send_exception will not be raised.
    
  • I don't like timeouts in testing code much. These are probably reasonable, because they time out only in case something in the test fail. But is 0.2 enough even on slow systems?
            # Give it a chance to stop, if it doesn't, no problem, it'll
            # die when the program does
            msgq_thread.join(0.2)
    
  • Also, the code directly below, is it safe? In case the timeout happens, the thread could still be running. And it could actually be modifying some of the variables that are being examined. I know this is mostly a theoretical concern and I know that if it was in the unsafe case, the test would be failing anyway, but maybe it would be better to assert first the thread is not alive.
            # Check the exception from the thread, if any
            self.assertEqual(expect_send_exception, msgq_thread.caught_exception)
    

comment:8 in reply to: ↑ 7 Changed 7 years ago by jelte

  • Owner changed from jelte to vorner

Replying to vorner:

Hello

Replying to jelte:

It does *not* handle and recover any other exceptions (except the socket errors that were already handled). I don't think we should catchall here, as depending on the exception we should handle it differently (and perhaps simply crash, if it is a syntax error for example). Also, I kept it to send() calls only. If we want we can do something similar for recv() (which should now be a lot easier, as I think the tests could be extended to test that too).

I don't know. For one, I think syntax errors are not descendants of Exceptions.

yeah they are; http://docs.python.org/2/library/exceptions.html#exception-hierarchy, though a few tend to be thrown from the python parser (like IndentationError?), and therefore not be caught normally, unless something like eval() is used, a number of them are (like NameError?). This has bitten me quite a few times already in parts of the code enclosed in catchalls that ignore errors.

Even when leaving these out, looking at man send, I think there are more
interesting cases there:

  • ECONNRESET (it would probably be similar to EPIPE).
  • ENOBUFS (the other side probably does not read, or something, so the same).
  • EINTR (just retry like with EAGAIN).

Added these (and moved each type into a test that loops to reduce code duplication). I wonder if we should maybe also kill it in the case of any other socket errors, but for now I think this is good enough.

Also, there are few points to the code:

  • What happens if the kill_socket is called (because of EPIPE), but the socket has some data to read and has the read even queued?

I suspect we'll need to do something similar for any problem that may arise while *reading*. Hopefully doing that would be a lot easier now that we at least have somewhat of a testing framework.

Or maybe we should just go async completely :p (which will of course come with all kinds of other fun problems)

  • I think there's +-3 error in this comment:
            raise_on_send: integer. If send_exception is not None, it will be
                                    raised on this byte (i.e. 1 = on the first
                                    call to send(), 1 = on the 4th call to send)
                                    Note: if 0, send_exception will not be raised.
    

That will teach me to change behaviour without rewriting the entire docstring. Simplified it.

  • I don't like timeouts in testing code much. These are probably reasonable, because they time out only in case something in the test fail. But is 0.2 enough even on slow systems?
            # Give it a chance to stop, if it doesn't, no problem, it'll
            # die when the program does
            msgq_thread.join(0.2)
    
  • Also, the code directly below, is it safe? In case the timeout happens, the thread could still be running. And it could actually be modifying some of the variables that are being examined. I know this is mostly a theoretical concern and I know that if it was in the unsafe case, the test would be failing anyway, but maybe it would be better to assert first the thread is not alive.
            # Check the exception from the thread, if any
            self.assertEqual(expect_send_exception, msgq_thread.caught_exception)
    

While playing around with it a bit more, I ran into more problems. The base problem was that in the case of *success* there was no way to actually stop msgq (apart from outright killing it).

So I caved in and did what i had originally tried not to; built in a mechanism to stop msgq. It is not a full command (yet), though when we hook msgq up to itself, the stopping mechanism at least is there already :) (oh and we can consider using it in boss)

This involved making the main poll loops not while True but while self.running, and adding a timeout to the calls to poll() and kqueue.control(). I chose 2 seconds for this, rather arbitrarily, but it's a choice between not looping too often and not having to wait too long for it to realize it should shut down.

That allowed the testing code to have much less timeouts, and timeouts that can be much higher; I still have 1 second for an initial join() in the cases where a kill_socket should occur (to give it a chance to handle existing events before killing it), but the main and final join now waits a full minute. If it hasn't closed down by then it is considered an error and the test fails.
I'll keep an eye on that 1 second if it causes problems on slower systems.

comment:9 follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jelte

Hello

There are still few things:

  • Would it be worth it unifying send_prepared_msg and __process_write? They seem they share a lot of common code now.
  • Is it OK? Couldn't it lead to some kind of starvation of one socket (if there was a really hight amount of data being sent, there'd always be something to write to the socket and it wouldn't get to the read). Arguably, there must be some reads somewhere to have writes, but there still can be one socket that is unlucky.
  • I don't like the way the running is set to false and then there's waiting until the read times out. Shouldn't the poll be waken up somehow? Waiting 2 seconds seems like a really long time.

comment:10 in reply to: ↑ 9 Changed 7 years ago by jelte

  • Owner changed from jelte to vorner

Replying to vorner:

Hello

There are still few things:

  • Would it be worth it unifying send_prepared_msg and __process_write? They seem they share a lot of common code now.

Yeah the error checking can actually be done in send_data(), as long as it's allowed to return 'an integer or None' (None meaning 'socket seems dead. destroyed it'); Updated.

  • Is it OK? Couldn't it lead to some kind of starvation of one socket (if there was a really hight amount of data being sent, there'd always be something to write to the socket and it wouldn't get to the read). Arguably, there must be some reads somewhere to have writes, but there still can be one socket that is unlucky.

er, what exactly is 'it'? In general, as long as actual data is sent, and since it's not looping locally on EAGAIN-like codes (therefore either diminishing the amount of data to send or doing the main loop and hitting the reads as well), it should prefer sending as much data as the writers accept, and then move on to the readers. But maybe I am completely misunderstanding your comment :)

  • I don't like the way the running is set to false and then there's waiting until the read times out. Shouldn't the poll be waken up somehow? Waiting 2 seconds seems like a really long time.

Doh, I realized last night the timeout isn't necessary at all; we can just add a second socket pair to control msgq. Done so (and removed the timeout in msgq itself again); now the only timeout that is left is the backup one to make sure the test doesn't get stuck for eternity.

comment:11 Changed 7 years ago by jelte

while closing my editor I saw a small error in an error message that should in theory not be triggered, but I updated and fixed it anyway (it said [XX] UNKNOWN ERROR which wasn't really useful)

comment:12 Changed 7 years ago by vorner

  • Owner changed from vorner to jelte
  • Total Hours changed from 0 to 2.98

Hello

It looks good to merge.

Talking about merging, it might require a changelog.

comment:13 Changed 7 years ago by jelte

  • Resolution set to fixed
  • Status changed from reviewing to closed

Thanks!

Added a changelog entry as well

Merged, closing ticket

Note: See TracTickets for help on using tickets.