Opened 7 years ago

Closed 7 years ago

#2790 closed defect (fixed)

Lettuce tests timing & missed messages

Reported by: vorner Owned by: jelte
Priority: medium Milestone: Sprint-20130402
Component: Unclassified Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: Core Feature Depending on Ticket:
Estimated Difficulty: 9 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

The lettuce tests fail quite often, usually by waiting for some message to appear. However, the message seems to present in the log output file. Also, it didn't seem to have any garbage around, as suggested elsewhere.

My wild guess of what might be happening:

  • A lot of messages are output at once.
  • Lettuce checks in intervals. There's more than one mouthful of messages at the next check.
  • Kernel provides only something like 4k of messages at the next read, splitting one of the messages at the end of the block.
  • Lettuce gets confused about the partial message.

But looking at it again, this seems like an unlikely situation. Could it be that write of single message is not atomic and the lettuce check is done at the exact time only part of the message is not output?

Anyway, it is some kind of race condition, as it obviously happens only sometimes.

Subtickets

Change History (18)

comment:1 Changed 7 years ago by jreed

Also see #1629.

comment:2 Changed 7 years ago by vorner

Yes, but as I said, my output had no garbage in it, yet it still didn't work. So this might be a different problem.

comment:3 Changed 7 years ago by jinmei

I also suggest reverting the increased wait time of 60 sec. It didn't seem to help
and can now be even annoying for local tests by developers.

comment:4 in reply to: ↑ description Changed 7 years ago by muks

Replying to vorner:

The lettuce tests fail quite often, usually by waiting for some message to appear. However, the message seems to present in the log output file. Also, it didn't seem to have any garbage around, as suggested elsewhere.

I don't know what this particular case is. But many lettuce features have statements that wait for a "next" log message (after that point in time). The message ID may have been logged before, but it waits for a new one.

comment:5 Changed 7 years ago by jelte

  • Milestone changed from New Tasks to Sprint-20130319

comment:6 Changed 7 years ago by jelte

Some additional experiment and discussion results:

It would appear that the stderr non-buffering is the culprit here; When I repeatedly run one specific test it fails every 10-20 runs or so; I changed the logging for that test to stdout instead of stderr, and it ran successfully for 500+ runs.

Looking into the log4cplus code, it does a number of writes per log message (depending on the format parts);

I tried setting stderr to line-buffering but this did not change anything, either the call is ignored or one of the libraries resets it.

What did help was change log4cplus to first format into a string, and output the whole string in one go. This may or may not be a good idea anyway (as it results in an extra stack allocation, and an extra copy of the data, but less system calls).

But what this also suggests is that the interprocess file locking may not be working completely; one of the 'bad' lines in my output had data from two different processes, we have so far not been able to figure out why this happens.

So there are two workarounds for now:

  • change output (either in tests, by default, or both) to use the line-buffered stdout instead of stderr
  • update log4cplus to do one write per log message (it's a very small patch)

doing the second will however take some time to get into release and distributions, etc.

And of course we need to figure out why the file locking doesn't appear to work in this condition (any attempts to debug that so far resulted in the symptom disappearing)

comment:7 Changed 7 years ago by jelte

update2: the log4cplus change makes it only slightly better; it occurs less often, but now entire messages seemingly disappear

comment:8 follow-up: Changed 7 years ago by jelte

update3: we have come to the conclusion that it is the threading that is the culprit here; the interprocesssynclock is not thread-safe; I have experimented with one of the branches of #2198, and that one appears to fully solve the issue.

I therefore propose to finish up #2198, and close this one.

comment:9 in reply to: ↑ 8 Changed 7 years ago by jinmei

Replying to jelte:

update3: we have come to the conclusion that it is the threading that is the culprit here; the interprocesssynclock is not thread-safe; I have experimented with one of the branches of #2198, and that one appears to fully solve the issue.

I therefore propose to finish up #2198, and close this one.

If that (= we'll need to take care of both inter-process and
inter-thread synchronization issues regarding logging ourselves) is
our conclusion, to which I'm not necessarily opposed if log4cplus
itself can't solve it or we can find a better logging library that
meets our requirement, then I'd suggest considering outsourcing the
low level tasks as much as possible. A major concern of mine for
#2198 is that we are going to do too low-level things ourselves
(handling flock(2) ourselves and now going to deal with pthread locks
ourselves) while partially out-sourcing the feature (logging, in this
case).

I'd like to do either

  1. fully control everything ourselves, not just for synchronization issues but also the underlying logging itself, or
  2. out-source as many things that are not of our main concern as possible

Option 1 is probably not realistic especially at this stage, so, in
practice, what I'd suggest is to consider using a third party tool for
the synchronization problems. From a quick look
boost::interprocess::named_mutex can be a candidate; it can be a
unified synchronization mechanism for a multi-process and multi-thread
environment.

comment:10 follow-up: Changed 7 years ago by jinmei

I figured out one of the common failure cases, where it appears
as if xfrin hanged. Actually, what happened is one of the zonemgr
threads blocked in a lock and failed to send the refresh command to
xfrin.

The blocking thread is running ZonemgrRefresh. On receiving a
NOTIFY, this thread is eventually invoked and tries to send the
refresh command to xfrin:

    def _send_command(self, module_name, command_name, params):
        """Send command between modules."""
        try:
            self._mccs.rpc_call(command_name, module_name, params=params)

rpc_call eventually reaches Session.sendmsg(), where the thread
tries to acquire a lock:

    def sendmsg(self, env, msg=None):
        with self._lock:

On the other hand, main thread normally blocks on the same socket
(Session.recvmsg via `ModuleCCSession.check_command), while
acquiring the lock:

    def recvmsg(self, nonblock = True, seq = None):
        with self._lock:
...
            data = self._receive_full_buffer(nonblock) # <= this can block

So, if the main thread blocks before the refresh thread completes the
command exchange, zonemgr falls into semi-deadlock (not a complete
deadlock, because it's unlocked on receiving a new command). And
that's what actually happens when the lettuce test fails this way.

At the very least a thread shouldn't block with holding a lock. We
should probably also clarify which attributes the _lock tries to
protect and try to narrow the critical section (if possible).

BTW, in case it's not clear: this is a real bug that can happen on a
deployed environment, not just a test-only regression.

comment:11 in reply to: ↑ 10 Changed 7 years ago by vorner

Hello

Replying to jinmei:

The blocking thread is running ZonemgrRefresh. On receiving a
NOTIFY, this thread is eventually invoked and tries to send the
refresh command to xfrin:

    def _send_command(self, module_name, command_name, params):
        """Send command between modules."""
        try:
            self._mccs.rpc_call(command_name, module_name, params=params)

:-(. The threads there and use of the CC is such a mess (and yes, I admit, I probably caused this bug, by unifying the used mccs over the threads, it might have used separate connection to msgq before, but that felt wrong ‒ why would one program connect to msgq multiple times?).

I think there are two things that should be done:

  • Clarify if accessing CC from multiple threads is allowed, how it is allowed and if receiving a message should work after we enter the command main loop.
  • Get rid of threads in the zone manager. There does not seem to be need for them, except to cause bugs and brain damage (we don't really need performance in zone manager).

But the question now is, both these probably will take some non-trivial time (though the first one could be reasonably short-time goal). Do we want to do some work around, like re-introducing separate CC session?

comment:12 Changed 7 years ago by jelte

while i am all for getting rid of the threads in zonemgr (preferably including the main thread ;)), I think a reasonable intermediate solution is to do something similar to for instance ddns; e.g. select() followed by a nonblocking call to check_command()

comment:13 follow-up: Changed 7 years ago by jelte

e.g. as a very quick suggestion (no doubt I'm missing 15 error cases here):

diff --git a/src/bin/zonemgr/zonemgr.py.in b/src/bin/zonemgr/zonemgr.py.in
index 59900c4..b85b60e 100755
--- a/src/bin/zonemgr/zonemgr.py.in
+++ b/src/bin/zonemgr/zonemgr.py.in
@@ -667,7 +667,14 @@ class Zonemgr:
         self.running = True
         try:
             while not self._shutdown_event.is_set():
-                self._module_cc.check_command(False)
+                fileno = self._module_cc.get_socket().fileno()
+                try:
+                    (reads, _, _) = select.select([fileno], [], [])
+                except select.error as se:
+                    if se.args[0] != errno.EINTR:
+                        raise
+                if fileno in reads:
+                    self._module_cc.check_command(True)
         finally:
             self._module_cc.send_stopping()

with that trac2198_5 passes all lettuce tests on my single-core debian VM

comment:14 in reply to: ↑ 13 Changed 7 years ago by jinmei

Replying to jelte:

e.g. as a very quick suggestion (no doubt I'm missing 15 error cases here):

diff --git a/src/bin/zonemgr/zonemgr.py.in b/src/bin/zonemgr/zonemgr.py.in
index 59900c4..b85b60e 100755
--- a/src/bin/zonemgr/zonemgr.py.in
+++ b/src/bin/zonemgr/zonemgr.py.in
@@ -667,7 +667,14 @@ class Zonemgr:
         self.running = True
         try:
             while not self._shutdown_event.is_set():
-                self._module_cc.check_command(False)
+                fileno = self._module_cc.get_socket().fileno()
+                try:
+                    (reads, _, _) = select.select([fileno], [], [])
+                except select.error as se:
+                    if se.args[0] != errno.EINTR:
+                        raise
+                if fileno in reads:
+                    self._module_cc.check_command(True)
         finally:
             self._module_cc.send_stopping()

As you probably know, technically this is not a complete solution
because exiting from select does not guarantee all data are available.

Also, I suspect we should still prevent Session.recvmsg() (and
sendmsg, too, while less critical in practice) from blocking with
an acquired lock. Otherwise same type of locking issue can happen on
other modules.

But that doesn't seem to be an easy task (probably requires
fundamental changes to the Session class), and that's probably why
you suggested this point-fix patch. If so, I think we can apply it
for now, but I'd like to add some comments about its limitation.

comment:15 Changed 7 years ago by jelte

yes this proposed patch is a short-term workaround to at least get it working more than it currently is :)

FYI, I applied it to yet another 2198 branch (based on 2198_5), and it passed the buildbot lettuce test

comment:16 Changed 7 years ago by jelte

  • Owner set to jelte
  • Status changed from new to assigned

comment:17 Changed 7 years ago by jinmei

May or may not be related, but I noticed even with the proposed patch,
revised lettuce tests in #2252 failed in xfrin_notify_handling.feature
just like we discussed in this ticket. Maybe you could play with it,
too.

comment:18 Changed 7 years ago by jelte

  • Resolution set to fixed
  • Status changed from assigned to closed

OK I added some comments as to why this is done, and applied the short-term workaround for now; the other issues should be address by 2198.

Note: See TracTickets for help on using tickets.