Opened 7 years ago

Closed 7 years ago

#2670 closed defect (wontfix)

"A deadlock might be detected" failure on NetBSD

Reported by: naokikambe Owned by: naokikambe
Priority: medium Milestone: Sprint-20130205
Component: statistics Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

Fix this build farm failure.
http://git.bind10.isc.org/~tester/builder//BIND10/20130129033301-NetBSD4-i386-GCC/logs/unittests.out

From a part of stack traces as following, the dead lock seems to occur between setting up msgq and starting msgq.

======================================================================
ERROR: test_openclose_mccs (__main__.TestStatsHttpd)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jreed/builder/work/BIND10/20130129033301-NetBSD4-i386-GCC/build/src/bin/stats/tests/b10-stats-httpd_test.py", line 604, in setUp
    self.base = BaseModules()
  File "/home/jreed/builder/work/BIND10/20130129033301-NetBSD4-i386-GCC/build/src/bin/stats/tests/test_utils.py", line 534, in __init__
    self.msgq.run()
  File "/home/jreed/builder/work/BIND10/20130129033301-NetBSD4-i386-GCC/build/src/bin/stats/tests/test_utils.py", line 88, in run
    self.server._thread.start()
  File "/usr/pkg/lib/python3.1/threading.py", line 481, in start
    self._started.wait()
  File "/usr/pkg/lib/python3.1/threading.py", line 389, in wait
    self._cond.wait(timeout)
  File "/usr/pkg/lib/python3.1/threading.py", line 229, in wait
    waiter.acquire()
  File "/home/jreed/builder/work/BIND10/20130129033301-NetBSD4-i386-GCC/build/src/bin/stats/tests/test_utils.py", line 52, in sig_handler
    self.fail_handler("A deadlock might be detected")
AssertionError: A deadlock might be detected

======================================================================
ERROR: test_running (__main__.TestStatsHttpd)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jreed/builder/work/BIND10/20130129033301-NetBSD4-i386-GCC/build/src/bin/stats/tests/b10-stats-httpd_test.py", line 604, in setUp
    self.base = BaseModules()
  File "/home/jreed/builder/work/BIND10/20130129033301-NetBSD4-i386-GCC/build/src/bin/stats/tests/test_utils.py", line 533, in __init__
    self.msgq = ThreadingServerManager(MockMsgq)
  File "/home/jreed/builder/work/BIND10/20130129033301-NetBSD4-i386-GCC/build/src/bin/stats/tests/test_utils.py", line 81, in __init__
    self.server = server(*args, **kwargs)
  File "/home/jreed/builder/work/BIND10/20130129033301-NetBSD4-i386-GCC/build/src/bin/stats/tests/test_utils.py", line 110, in __init__
    result = self.msgq.setup()
  File "/home/jreed/builder/work/BIND10/20130129033301-NetBSD4-i386-GCC/build/src/bin/msgq/msgq.py", line 292, in setup
    self.setup_listener()
  File "/home/jreed/builder/work/BIND10/20130129033301-NetBSD4-i386-GCC/build/src/bin/msgq/msgq.py", line 263, in setup_listener
    raise e
  File "/home/jreed/builder/work/BIND10/20130129033301-NetBSD4-i386-GCC/build/src/bin/msgq/msgq.py", line 254, in setup_listener
    self.listen_socket.bind(self.socket_file)
  File "/home/jreed/builder/work/BIND10/20130129033301-NetBSD4-i386-GCC/build/src/bin/stats/tests/test_utils.py", line 52, in sig_handler
    self.fail_handler("A deadlock might be detected")
AssertionError: A deadlock might be detected

Subtickets

Change History (8)

comment:1 Changed 7 years ago by naokikambe

  • Milestone changed from New Tasks to Sprint-20130205

It's okayed on the list to push it to the current sprint.
https://lists.isc.org/pipermail/bind10-dev/2013-January/004319.html

comment:2 Changed 7 years ago by naokikambe

  • Owner set to jinmei
  • Status changed from new to reviewing

Replying this comments on the list,
https://lists.isc.org/pipermail/bind10-dev/2013-January/004319.html

  • I don't understand how the "deadlock" happened and how this lock solves that. Please make more detailed explanations and/or comments.

As I mentioned in the above description, as long as I see the stack traces, the unittest seemed to be hanging between setting up msgq and starting msgq. So I suspected that one thread was setting up msgq and another thread was starting msqg just before the failure occurred. So I think that another thread should wait while one thread is setting up msgq. That's why I changed the code to lock between setting up msgq and starting msgq.

  • Same comments are repeated. I think these should be unified:

From the above stack traces, the dead lock didn't seem to occur actually when shutting down msgq. We may not need the latter lock. Nevertheless I've revised the two comments.

  • I don't understand why we need to call isc.log.resetUnitTestRootLogger() from multiple places. Isn't it enough to call it from the test main? If not, please explain.

Sorry, I cannot explain well why the messages are reduced if it's inserted there. I actually examined that in runtime. After creating a object of each mock module, if resetUnitTestRootLogger() is done, then the messages seem to be reduced.
Anyway this change isn't directly related to this dead-lock failure. This issue should be handled on the other ticket.

So I've pushed 'trac2670' as a new branch instead of the temporary branch 'fix_stats_tests'. I'll delete the old one. Please review the new one.

Last edited 7 years ago by naokikambe (previous) (diff)

comment:3 follow-up: Changed 7 years ago by jinmei

First off, the problem still seems to happen with this branch:
http://git.bind10.isc.org/~tester/builder//BIND10/20130130180303-NetBSD4-i386-GCC/logs/unittests.out

Secondly, I still don't understand how the (seeming) deadlock happened
and how the lock would solve it; normally adding a lock solves issues
like a race condition, and (although it's possible to solve a deadlock
happening due to a race condition) an added lock could even cause a
new deadlock, not solve it.

I also didn't understand how multiple threads could instantiate
BaseModules at the same time.

Since I don't understand these basic points it's not surprising there
are some other unclear points, but I'm also not sure why that's
specific to msgq. From a quick look at the code, the same thing
(whether it's a race or deadlock) seems to be able to occur for the
rest of BaseModules.__init__().

And, in general: I don't like to make an arbitrary change simply
because it *might* help, without understanding these "how"s.

For the same reason, I don't like to silence the log message without
understanding how this happened. It's especially so because it's not
directly related to the main issue of this ticket.

comment:4 Changed 7 years ago by jinmei

  • Owner changed from jinmei to naokikambe

comment:5 in reply to: ↑ 3 ; follow-up: Changed 7 years ago by naokikambe

  • Owner changed from naokikambe to jinmei

Replying to jinmei:

First off, the problem still seems to happen with this branch:
http://git.bind10.isc.org/~tester/builder//BIND10/20130130180303-NetBSD4-i386-GCC/logs/unittests.out

Oops! the problem isn't resolved. The change which I added makes no sense. :-|

Secondly, I still don't understand how the (seeming) deadlock happened
and how the lock would solve it; normally adding a lock solves issues
like a race condition, and (although it's possible to solve a deadlock
happening due to a race condition) an added lock could even cause a
new deadlock, not solve it.

I also didn't understand how multiple threads could instantiate
BaseModules at the same time.

Since I don't understand these basic points it's not surprising there
are some other unclear points, but I'm also not sure why that's
specific to msgq. From a quick look at the code, the same thing
(whether it's a race or deadlock) seems to be able to occur for the
rest of BaseModules.__init__().

And, in general: I don't like to make an arbitrary change simply
because it *might* help, without understanding these "how"s.

For the same reason, I don't like to silence the log message without
understanding how this happened. It's especially so because it's not
directly related to the main issue of this ticket.

I thought that locking makes sense against the problem because multiple tests were hanging in parallel from a look of the stack traces. But eventually it was wrong.

I understand your opinion, but right now I don't have much information about what you are asking. I neither understand mechanism about how such a dead lock happens only on NetBSD. I think we might need to investigate more deeply about how the process behaves, e.g. at the system-call level, so that we would realize mechanism clearly. IMO a dead-lock problem is very complicated in general.

I'd like to stop working for it by guessing randomly. I would not have any other good idea soon. So should we pull this ticket out of the current sprint? I would remove the branch from the repository if so.

Regards,

comment:6 in reply to: ↑ 5 ; follow-up: Changed 7 years ago by jinmei

Replying to naokikambe:

I'd like to stop working for it by guessing randomly. I would not have any other good idea soon. So should we pull this ticket out of the current sprint? I would remove the branch from the repository if so.

At this point I think we should close this ticket as "wontfix", and
create a separate new ticket for a complete solution.

BTW, even if it may be quite specific to NetBSD (although I sometimes
see similar failure on my OSX laptop), for a "complete" solution I'd
like to suggest getting rid of threads. I believe we should be able
to introduce proper mock classes for most, if not all, of the tests
without requiring inter-process communication and blocking operations.
Threads are too tricky for primitive things like unit tests, and a
perfect recipe for disaster.

comment:7 Changed 7 years ago by jinmei

  • Owner changed from jinmei to naokikambe

comment:8 in reply to: ↑ 6 Changed 7 years ago by naokikambe

  • Resolution set to wontfix
  • Status changed from reviewing to closed

Thanks for the suggestion, I'm closing as wontfix. Regarding such a fundamental problem, I assume we'll work on #1668 or other tickets.

Note: See TracTickets for help on using tickets.