Opened 9 years ago

Closed 9 years ago

#420 closed defect (fixed)

Unresponsive process can block msgq

Reported by: shane Owned by: vorner
Priority: medium Milestone: A-Team-Sprint-20110126
Component: ~msgq (obsolete) Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity:
Sub-Project: Feature Depending on Ticket:
Estimated Difficulty: 13.0 Add Hours to Ticket: 0
Total Hours: 0.5 Internal?: no

Description

A process that does not read it's communication channel can cause the msgq to block. This is bad for the entire system.

The recent example is:

send(10, "\0\0\0\227\0\206{\"instance\": \"*\", \"group\": \"Xfrout\", \"seq\": 172, \"from\": \"4cf7bb38_6@madras\", \"to\": \"4cf7bb38_7@madras\", \"reply\": 336, \"type\": \"send\"}{\"result\": [0]}"..., 155, 0

We should change the msgq to send in non-blocking mode.

What happens when msgq attempts to send and it is not able to is an open question. One possibility is to simply close the connection. Another possibility is to buffer the data, although we need to eventually give up and close the connection.

Subtickets

Change History (16)

comment:1 Changed 9 years ago by shane

  • Add Hours to Ticket changed from 0.0 to 0.5
  • Total Hours changed from 0.0 to 0.5

comment:2 Changed 9 years ago by stephen

  • Milestone set to A-Team-Sprint-20110126

comment:3 Changed 9 years ago by stephen

  • Estimated Difficulty changed from 0.0 to 13

comment:4 Changed 9 years ago by vorner

  • Owner set to vorner
  • Status changed from new to accepted

comment:5 Changed 9 years ago by vorner

  • Owner changed from vorner to UnAssigned
  • Status changed from accepted to reviewing

The fix itself is probably straightforward ‒ buffer data that do not fit into the socket (in non-blocking way) and if the socket does not eat anything in 0.1 seconds, just drop the connection. This fixes the fact that sock.send() might send less data then actually passed. Maybe we might want that timeout configurable in future and maybe something more should be done in case of the timeout (like telling it to boss or something). Should we create a ticket for it?

There are two kinds of tests now. Two tests hammer it with data that nobody reads and expects the socket to get closed and therefore raise an exception on the next write. The other two try pinging (I added ping command for that reason) the msgq for some time to see the answers are returning and that the msgq will not close it if the connection works.

I fixed a place where None was put into dictionary instead of deleting the entry, which could potentially lead to resource leak, when I saw it in the code.

There are two kinds of selecting sockets in msgq ‒ the default is poll, but it has kqueue as fallback. Is there anyone with the system that actually needs the fallback? Could such person test it works with it as well?

I didn't reproduce the original bug (the freeze of system), but if anyone can test it in reality, it would be nice as well.

comment:6 Changed 9 years ago by zzchen_pku

  • Owner changed from UnAssigned to zzchen_pku

comment:7 follow-up: Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to vorner

Initial review:

  • def add_kqueue_socket(self, socket, add_filter = 0):
    
    0 doesn't make sense to me, maybe we should use enum value instead(select.KQ_FILTER_READ?)
  • I noticed there are some commented-out code lines haven't been removed from source.
  • There are some duplicate code between send_prepared_msg() and process_write(),is it possible to eliminate them?

comment:8 in reply to: ↑ 7 ; follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to zzchen_pku

Replying to zzchen_pku:

  • def add_kqueue_socket(self, socket, add_filter = 0):
    
    0 doesn't make sense to me, maybe we should use enum value instead(select.KQ_FILTER_READ?)

Well, I tried that and python complains. It seems the default parameter values are looked up before the select module is properly imported.

However, 0 doesn't mean the KQ_FILTER_READ, it means no additional filters. I modified the variable name so it is more obvious what it means. Bit or of no flags is 0 (you see that in C all the time, 0 means default).

I could add a bool parameter instead to say if we want writing or not, if it helps.

  • I noticed there are some commented-out code lines haven't been removed from source.

But I didn't introduce them. They seem to be notes for future, mostly about logging and stuff like that, waiting for something to be implemented. Should I remove them? I wasn't sure I wouldn't delete a note that would be useful in future (they would be preserved in the history, but who looks there if not searching for something specific?).

  • There are some duplicate code between send_prepared_msg() and process_write(),is it possible to eliminate them?

I eliminated something, I can't find any more duplicates except for single line ones (which is not shortened by putting them into a function).

And I forgot to mention a changelog entry last time:

[bug]
One frozen process no longer freezes the whole b10-msgq. It caused the whole system
to stop working.

comment:9 in reply to: ↑ 8 ; follow-up: Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to vorner

Replying to vorner:

Well, I tried that and python complains. It seems the default parameter values are looked up before the select module is properly imported.

However, 0 doesn't mean the KQ_FILTER_READ, it means no additional filters. I modified the variable name so it is more obvious what it means. Bit or of no flags is 0 (you see that in C all the time, 0 means default).

I could add a bool parameter instead to say if we want writing or not, if it helps.

Okay.

But I didn't introduce them. They seem to be notes for future, mostly about logging and stuff like that, waiting for something to be implemented. Should I remove them? I wasn't sure I wouldn't delete a note that would be useful in future (they would be preserved in the history, but who looks there if not searching for something specific?).

For this purpose, I am okay with it.

  • There are some duplicate code between send_prepared_msg() and process_write(),is it possible to eliminate them?

I eliminated something, I can't find any more duplicates except for single line ones (which is not shortened by putting them into a function).

And I forgot to mention a changelog entry last time:

[bug]
One frozen process no longer freezes the whole b10-msgq. It caused the whole system
to stop working.

I see it.

Another problem:
lines 348~350 from send_prepared_msg()

         buff += msg[amount_sent:]
     else:
         buff = msg

Maybe I misread something, shouldn't it be

         buff += msg
     else:
         buff = msg[amount_sent:]

?

Last edited 9 years ago by zzchen_pku (previous) (diff)

comment:10 in reply to: ↑ 9 ; follow-up: Changed 9 years ago by vorner

However, 0 doesn't mean the KQ_FILTER_READ, it means no additional filters. I modified the variable name so it is more obvious what it means. Bit or of no flags is 0 (you see that in C all the time, 0 means default).

I could add a bool parameter instead to say if we want writing or not, if it helps.

Okay.

Okay to the current code or to the bool version?

Another problem:
lines 348~350 from send_prepared_msg()

Maybe I misread something, shouldn't it be

Yes, it should, thanks. But I need to write a test that actually checks the data returned, not only reads them and throws away.

comment:11 in reply to: ↑ 10 Changed 9 years ago by zzchen_pku

Replying to vorner:

However, 0 doesn't mean the KQ_FILTER_READ, it means no additional filters. I modified the variable name so it is more obvious what it means. Bit or of no flags is 0 (you see that in C all the time, 0 means default).

I could add a bool parameter instead to say if we want writing or not, if it helps.

Okay.

Okay to the current code or to the bool version?

Sorry, forget to clarify it, I think the bool version is more clear.

comment:12 Changed 9 years ago by zzchen_pku

Another problem:
msgq_test.py

os.fork() and socket.sendall() may throw exceptions.

Review done.
PS: A litter busy recently, sorry for discontinuous review.

comment:13 Changed 9 years ago by vorner

  • Owner changed from vorner to zzchen_pku

I changed the parameter to bool and added check that the data returned from MsgQ are the ones expected, so if anything is changed/lost/added somewhere should be detected. It is in the last two commits.

But what is the concern about fork or sendall throwing exception? The exceptions mean either a bug (the remote side of sendall is no longer there) or reaching some OS limit. In both cases, the test would fail (either by the exception itself or the sub-process dieing), which isn't a problem in the rare occasion of broken system. If it wasn't test, it would make sense to catch them and change them to a better error message, but is it really needed in test? Or do you see some bigger problem there than the test would simply fail or crash?

comment:14 Changed 9 years ago by vorner

I noticed one of the sub-processes could outlive the tests and be around forever, I added another commit that prevents it.

comment:15 Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to vorner

It seems you have replied the question:-)
The latest code looks okay, please go ahead and merge.

comment:16 Changed 9 years ago by vorner

  • Resolution set to fixed
  • Status changed from reviewing to closed

Thanks, closing.

Note: See TracTickets for help on using tickets.