Opened 9 years ago

Closed 9 years ago

#400 closed defect (fixed)

Race condition in zonemgr

Reported by: vorner Owned by: vorner
Priority: medium Milestone:
Component: Unclassified Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity:
Sub-Project: Feature Depending on Ticket:
Estimated Difficulty: 0.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

The auto-builder failures on FreeBSD8-i386 might be related to a race condition in zone manager. It is possible for the code inside the while loop not to run at all. I propose this simple patch to try and fix it. I think it is too small to be included in changelog. Do you think it is OK?

Subtickets

Attachments (2)

race.patch (634 bytes) - added by vorner 9 years ago.
race2.patch (1.3 KB) - added by jinmei 9 years ago.

Download all attachments as: .zip

Change History (8)

Changed 9 years ago by vorner

comment:1 Changed 9 years ago by vorner

  • Owner set to UnAssigned
  • Status changed from new to reviewing

comment:2 in reply to: ↑ description Changed 9 years ago by jinmei

Replying to vorner:

The auto-builder failures on FreeBSD8-i386 might be related to a race condition in zone manager. It is possible for the code inside the while loop not to run at all. I propose this simple patch to try and fix it. I think it is too small to be included in changelog. Do you think it is OK?

I agree your proposed change will likely fix the problem, but I'm afraid that it's difficult to understand the real intent of the new code. That is, what this code does is to set start_event only once, on the first iteration of the while loop, but only when self._running is True. Now that we've known this bug we may be able to guess this is somehow related to a race between _running and start_event, but I suspect it's unclear to people who read this code without having such background.

One possible way to address this concern is to add comments about why we do this, but I personally would like to offer a "simpler" solution: run the while loop unconditionally and always return from this method once we receive a message from the "shutdown socket" (= self._read_sock). I'm attaching a proposed diff to the ticket.

According to the code comment _running seems to be used to detect and reject a possible "false alarm" delivered to _read_sock, but IMO it's much better not to rely on a shared variable between threads than worrying about such a very rare error case.

What do you think?

Changed 9 years ago by jinmei

comment:3 Changed 9 years ago by jinmei

  • Owner changed from UnAssigned to vorner

comment:4 follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to jinmei

Well, I'm little bit nervous about leaving such false alarm there, as it could be hard to hunt down if it happens and could harm the zonemanager in severe manner (if I remember the code correctly, it would do something like half-shutdown, part of the functionality would stop working). Anyway, we are trying to solve a problem that happens only in testing conditions (the thread is started and shut down immediately and we check it did something), while the false alarm could happen in real life.

Anyway, this solution has another problem. The socket stays readable forever. If it is restarted, it will terminate immediately.

May I suggest at last try reading data from the socket in non-blocking way? That way if it receives spurious wakeup, it detects it and does not terminate and if it receives data, it reads them off the socket. Or, a simple solution, check self._running at the end of the cycle body?

comment:5 in reply to: ↑ 4 Changed 9 years ago by jinmei

  • Owner changed from jinmei to vorner

Replying to vorner:

Well, I'm little bit nervous about leaving such false alarm there, as it could be hard to hunt down if it happens and could harm the zonemanager in severe manner (if I remember the code correctly, it would do something like half-shutdown, part of the functionality would stop working). Anyway, we are trying to solve a problem that happens only in testing conditions (the thread is started and shut down immediately and we check it did something), while the false alarm could happen in real life.

In the review of #299 I realized the spurious wakeup can happen in Linux due to a kernel bug (sigh, every OS has its own deviant behavior) and my proposal won't work.

So, I'm now okay with your original patch. Any other solution would look equally tricky, so I don't see a benefit to tweak the code further. Please go commit your patch, with comment about why we need to do that, and also about how the "false alarm" can happen.

comment:6 Changed 9 years ago by vorner

  • Resolution set to fixed
  • Status changed from reviewing to closed

I added the comments and commited. Closing.

Note: See TracTickets for help on using tickets.