Opened 9 years ago

Closed 9 years ago

#335 closed defect (fixed)

b10-xfrout not exiting

Reported by: jreed Owned by: vorner
Priority: medium Milestone:
Component: xfrout Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity:
Sub-Project: Feature Depending on Ticket:
Estimated Difficulty: 0.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

b10-xfrout doesn't exit when receives SIGTERM. (At least not quick enough.)

b10-zonemgr may have related issue but appears to be fixed/worked around using patch from trac ticket #330.

Subtickets

Change History (25)

comment:1 Changed 9 years ago by jreed

May be related to this also:

$ echo Xfrout shutdown | DESTDIR/bin/bindctl
["login success "] login as root
> send the command to cmd-ctrld
received reply: {"error": "unknown error"}
> $

BIND 10 shows:

[b10-cmdctl] Begin send command 'shutdown' to module 'Xfrout'
13-Sep-2010 14:37:59.656 Xfrout: INFO: Received shutdown command.
[b10-cmdctl] Finish send command 'shutdown' to module 'Xfrout'

I would expect bind10 to show it died and restart it.

It is still running.

So I run:

$ DESTDIR/bin/bindctl
["login success "] login as root
> Xfrout shutdown
send the command to cmd-ctrld
received reply: {"error": "unknown error"}
> quit

Now output from cmdctl is:

[b10-cmdctl] Begin send command 'shutdown' to module 'Xfrout'
[b10-cmdctl] Finish send command 'shutdown' to module 'Xfrout'

So this time missing the Xfrout INFO log message.

ps shows xfrout is still running. It is in a normal Interruptible sleep state.

comment:2 Changed 9 years ago by zhanglikun

I am working on the problem that xfrout/zonemgr can't be killed by signal 'SIGINT', and it will cover the problem mentioned by this ticket.

comment:3 Changed 9 years ago by zhanglikun

  • Owner set to jreed
  • Status changed from new to reviewing

The code for this ticket has been finished now, it's ready for review now, Jeremy, could you have a test, I am not sure who will be the best guy to review it

comment:4 Changed 9 years ago by zhanglikun

one note, the change code is in branch #335 now,

comment:5 Changed 9 years ago by jreed

  • Owner changed from jreed to UnAssigned

I didn't review code, but the new behavior works as expected.

comment:6 Changed 9 years ago by vorner

  • Owner changed from UnAssigned to vorner

I don't mind reading code, so I'll have a look at it.

comment:7 follow-up: Changed 9 years ago by vorner

I have problems with failing make check here:

...............F..........F........
======================================================================
FAIL: test_run_timer (__main__.TestZonemgrRefresh)
This case will run timer in daemon thread.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/vorner/work/bind10/src/bin/zonemgr/tests/zonemgr_test.py", line 399, in test_run_timer
    self.assertFalse(listener.is_alive())
AssertionError: True is not False

======================================================================
FAIL: test_shutdown (__main__.TestZonemgrRefresh)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/vorner/work/bind10/src/bin/zonemgr/tests/zonemgr_test.py", line 417, in test_shutdown
    self.assertFalse(listener.is_alive())
AssertionError: True is not False

If I look at the test:

def test_shutdown(self):
        self.zone_refresh._check_sock = self.zone_refresh._master_socket 
        listener = threading.Thread(target=self.zone_refresh.run_timer)
        listener.start()
        self.assertTrue(listener.is_alive())
        self.zone_refresh.shutdown()
        self.assertFalse(listener.is_alive())

It calls shutdown in one thread and checks that other thread already exited. But it does not have enough time to do so, the is_alive() is called too early (there's a race condition, running again produced only one failed test, not 2). You do wait for the thread event, but then, there's still the thread teardown (like exit from the function and such) and if a context switch happens on the event set (set is called, the thread which called shutdown is woken up and the one that was shutting down do not have time to exit the function).

Furthermore, I do not like this (xfrin.py.in, somewhere after line 500):

            try:
                self._send_cc_session.group_sendmsg(msg, XFROUT_MODULE_NAME)
                self._send_cc_session.group_sendmsg(msg, ZONE_MANAGER_MODULE_NAME)
            except: 
                pass

Not that we ignore all exceptions without doing something with them, this also catches all kinds of KeyboardInterrupt? or SystemExit? objects that are not proper exceptions. We should at last catch Exception, but, best of, catch only the one that killed msgq generates (and probably log a warning).

As well as this test (zonemgr_test.py):

        # test select.error by using bad file descriptor 
        bad_file_descriptor = self.zone_refresh._master_socket.fileno()
        self.zone_refresh._check_sock = bad_file_descriptor  
        self.zone_refresh._master_socket.close()
        self.assertRaises(None, self.zone_refresh.run_timer()) 

I think there shouldn't be None, but the exception that really is thrown, because the code can fail for some other reason and we shouldn't catch that.

            if self._read_sock in rlist: # awaken by shutdown socket 
                break

As linux man page says in the Bugs section: „Under Linux, select() may report a socket file descriptor as "ready for reading", while nevertheless a subsequent read blocks.“ If such spurious wakeup happens on the _read_sock, we should not exit. Note that continue instead of break would solve this, since self._runnable would be False if the real shutdown comes.

comment:8 Changed 9 years ago by vorner

  • Owner changed from vorner to zhanglikun

comment:9 in reply to: ↑ 7 ; follow-up: Changed 9 years ago by zzchen_pku

Replying to vorner:

It calls shutdown in one thread and checks that other thread already exited. But it does not have enough time to do so, the is_alive() is called too early (there's a race condition, running again produced only one failed test, not 2). You do wait for the thread event, but then, there's still the thread teardown (like exit from the function and such) and if a context switch happens on the event set (set is called, the thread which called shutdown is woken up and the one that was shutting down do not have time to exit the function).

Good catch. Main thread will sleep for a while to ensure that the timer thread has enough time to exit.

Not that we ignore all exceptions without doing something with them, this also catches all kinds of KeyboardInterrupt? or SystemExit? objects that are not proper exceptions. We should at last catch Exception, but, best of, catch only the one that killed msgq generates (and probably log a warning).

Likun will handle it.

I think there shouldn't be None, but the exception that really is thrown, because the code can fail for some other reason and we shouldn't catch that.

Okay.

            if self._read_sock in rlist: # awaken by shutdown socket 
                break

As linux man page says in the Bugs section: „Under Linux, select() may report a socket file descriptor as "ready for reading", while nevertheless a subsequent read blocks.“ If such spurious wakeup happens on the _read_sock, we should not exit. Note that continue instead of break would solve this, since self._runnable would be False if the real shutdown comes.

I am not sure what will trigger that affect, but it works for me, I update the code according to your suggestions.
Thank you for your comments.

comment:10 in reply to: ↑ 9 ; follow-up: Changed 9 years ago by zhanglikun

  • Owner changed from zhanglikun to vorner

Not that we ignore all exceptions without doing something with them, this also catches all kinds of KeyboardInterrupt? or SystemExit? objects that are not proper exceptions. We should at last catch Exception, but, best of, catch only the one that killed msgq generates (and probably log a warning).

Likun will handle it.

ok, code has been committed in r3151. please have a review again, :).

comment:11 in reply to: ↑ 10 Changed 9 years ago by zhanglikun

Note: the lastest change is r3141 and r3151.

comment:12 follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to zhanglikun

The exceptions are OK.

The spurious wakeup does not happen usually, as I gather it happens only under extreme conditions (high load, bad checksum on incoming packet) and it is considered a bug. But as it should not break on working system, it is OK to provide a workaround I think. Thanks for including it.

About the race condition, the sleep(1) does help sometime, but not always, I was still able to reproduce it. And I found another one. If run_timers is run in one thread and shutdown soon after, the shutdown calls _is_runnable = False first and then the other thread sets _is_runnable = True. May I suggest using a wrapper to run the run_timers in another thread, that would store the thread object and calling .join on it instead of having the event object? It seems cleaner and would remove both problems, because the _is_runnable could be called before the thread is started.

And I noticed one more thing. We shouldn't call run_timers, then shutdown and then run_timers again, because the shutdown pipe would still contain b'shutdown' and would still be readable ‒ and we would get a busy loop. Maybe it should be mentioned in documentation that one should not call run_timers multiple times on the same object. Or, clean the pipe (or create a new one and throw the old one away) on each run_timers start.

comment:13 in reply to: ↑ 12 Changed 9 years ago by zzchen_pku

  • Owner changed from zhanglikun to vorner

Replying to vorner:

The spurious wakeup does not happen usually, as I gather it happens only under extreme conditions (high load, bad checksum on incoming packet) and it is considered a bug. But as it should not break on working system, it is OK to provide a workaround I think. Thanks for including it.

Got it, thanks.

About the race condition, the sleep(1) does help sometime, but not always, I was still able to reproduce it. And I found another one. If run_timers is run in one thread and shutdown soon after, the shutdown calls _is_runnable = False first and then the other thread sets _is_runnable = True. May I suggest using a wrapper to run the run_timers in another thread, that would store the thread object and calling .join on it instead of having the event object? It seems cleaner and would remove both problems, because the _is_runnable could be called before the thread is started.

Actually, Zonemgr calls _start_zone_refresh_timer() to run run_timers and .join on shutdown, you can find them in Zonemgr class.
I have updated the test case to call .join on timer thread shutdown. This blocks the main thread until the timer is terminated.

And I noticed one more thing. We shouldn't call run_timers, then shutdown and then run_timers again, because the shutdown pipe would still contain b'shutdown' and would still be readable ‒ and we would get a busy loop. Maybe it should be mentioned in documentation that one should not call run_timers multiple times on the same object. Or, clean the pipe (or create a new one and throw the old one away) on each run_timers start.

Zonemgr process only call run_timers once on startup, but since unittest may call it more than once, clean the pipe on each run_timers shutdown to handle the busy loop.

diff : svn diff -r 3152

comment:14 Changed 9 years ago by vorner

  • Owner changed from vorner to zhanglikun

With the changes, you reintroduced the problem with spurious wakeup (I just found it untouched in the notify_out as well). Actually, worse version, the first version exited and boss restarted it. This version will block forever if there's a spurious wakeup, because the recv() is blocking.

And the notify_out test just failed for me again, because it has the exact problem as zonemanager had at the beginning.

And I do not think that we test anything at all by running join explicitly in test and then testing if the thread is alive. The thread is not alive by definition after join.

Would you mind if I have a look at it and fix the problems I see and then you review them (like, exchanging roles for these problems)?

comment:15 Changed 9 years ago by zhanglikun

  • Owner changed from zhanglikun to vorner

I will appreciate your action, please go ahead, thanks.

comment:16 follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to zhanglikun

I hopefully fixed the problems with race conditions and spurious wakeup. I discovered (and fixed) one more ‒ when the process (either zonemanager or xfrout) was idle (no zones to manage), they had a special-case with sleep, not select. But then, it did not recognize someone tried to wake it up by the shutdown socket and kept sleeping (and boss usually killed them before they woke up).

I modified the changelog entry as well, to include a small change in public methods of the classes.

Could you have a look at my changes, if I didn't break anything else? My changes are the last three commits on the branch (r3191, r3200, r3201).

Thank you

comment:17 in reply to: ↑ 16 Changed 9 years ago by zzchen_pku

Replying to vorner:
I reviewed the r3191, it looks OK. I updated some comments, please find them in r3204.

comment:18 follow-up: Changed 9 years ago by zhanglikun

This is the reply for the change in r3200.

/trac335/src/bin/xfrout/xfrout.py.in :

  1. I don't know why you want to remove the following lines code. there will be 3 or more

threads when xfrout is running, they are mainthread, notify-out thread and unix-socket_-

server thread, and some other threads which are doing zone transfer. But I like you wrap

creating dispatcher thread to one function.

           main_thread = threading.currentThread()  
           # close the thread which's doing zone transfer.  
           for th in threading.enumerate():  
               if th is main_thread:  
                   continue  
               th.join()  
     
  1. some function string doesn't follow the standard way.

see the old one is

  """ This the comments line1
  this is the comments line2
  """

has been changed as

  """ 
  This the comments line1
  this is the comments line2
  """

branches/trac335/src/lib/python/isc/notify/notify_out.py ¶

  1. I would like to move the following line code to function NotifyOut?.init(), since

the variable self._started_event is used by another function self._despatcher().

self._started_event = threading.Event()

  1. do we really need do the clean up in shutdown()?
                  # Clean up  
      218         self._write_sock = None  
      219         self._read_sock = None  
      220         self._thread = None  
    
    
  1. why we just create the variable self._read_sock in NotfityOut?.init(), I just feel

confused about why not self._write_sock. I would keep the old style

self._read_sock, self._write_sock = socket.socketpair()

I find you have mentioned self._read_sock maybe be empty in some test, but I doesn't find
it in the test case.

4 can we simple the following code to "if min_timeout:"?

      if min_timeout is not None:

comment:19 Changed 9 years ago by zhanglikun

  • Owner changed from zhanglikun to vorner

comment:20 in reply to: ↑ 18 ; follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to zhanglikun

Replying to zhanglikun:

This is the reply for the change in r3200.

/trac335/src/bin/xfrout/xfrout.py.in :

  1. I don't know why you want to remove the following lines code. there will be 3 or more

threads when xfrout is running, they are mainthread, notify-out thread and unix-socket_-
server thread, and some other threads which are doing zone transfer.

Because they are unnecessary, as much as I'm avare. .join() just waits for the thread to finish, does not terminate it. So the code just waited for everything to terminate. But as the threads are not daemon threads, the python interpretter will wait for them anyway. Or I missed something?

  1. some function string doesn't follow the standard way.

see the old one is

  """ This the comments line1
  this is the comments line2
  """

has been changed as

  """ 
  This the comments line1
  this is the comments line2
  """

Is it a standard one? I read that the terminating tripple-quotemark should be on separate line and it looked more symetric/aestetical to have the first one on separate line as well. I do not mind changing it back, if it is more usual.

the variable self._started_event is used by another function self._despatcher().

self._started_event = threading.Event()

I need a new one every time I start the thread (some tests start it multiple times). Furthermore, I'm not sure how expensive .set() is, it might include some synchronisation, locking, etc. So I remove the event when it is no longer needed and the loop ignores it.

Another way would be to just set it each iteration, and before starting the thread again, .clear() it. Do you think this way is better?

  1. do we really need do the clean up in shutdown()?

Yes, we need to clean the socket. We write b'shutdown' to it and never read it, just terminate. If we start it again (like from tests), we would get a busyloop, it would be readable all the time, but self._serving would still be true. So removing the socket and throwing it away with the data inside is needed (we could take the work and read it in non-blocking way, but that would be longer). Furthermore, freeing some resources (2 file descriptors, to start with) is a nice sideeffect.

The thread is cleaned up only for consistency, it will not be needed any more anyway.

I find you have mentioned self._read_sock maybe be empty in some test, but I doesn't find
it in the test case.

Some test runs some internal functions (_wait_for_notify_reply) directly, without calling dispatcher() (and starting a thread). Therefore the _read_sock is on default None value. It is just a sideeffect of running only part of the machinery, not that the test would set it on purpose.

      if min_timeout is not None:

Because min_timeout might be 0 legally (and 0 is considered False) in which case we would replace it with tmp_timeout, but that might be higher than 0.

Do you think I should modify anything else except the comment strings (I'm getting to it soon)? Put some of these explanations to comments? Or, do you think I'm wrong in any of them?

Thank you

comment:21 in reply to: ↑ 20 ; follow-up: Changed 9 years ago by zhanglikun

  • Owner changed from zhanglikun to vorner

Replying to vorner:

others are ok, just three points, I just have different opnion, not are saying you are wrong.

Replying to zhanglikun:

This is the reply for the change in r3200.

/trac335/src/bin/xfrout/xfrout.py.in :

  1. I don't know why you want to remove the following lines code. there will be 3 or more

threads when xfrout is running, they are mainthread, notify-out thread and unix-socket_-
server thread, and some other threads which are doing zone transfer.

Because they are unnecessary, as much as I'm avare. .join() just waits for the thread to finish, does not terminate it. So the code just waited for everything to terminate. But as the threads are not daemon threads, the python interpretter will wait for them anyway. Or I missed something?

yes vornor, what you joined is just the notify-out thread, and you are changing xfrout's code, there may some threads which are doing zone-transfer out, and they are not joined.

the variable self._started_event is used by another function self._despatcher().

self._started_event = threading.Event()

I need a new one every time I start the thread (some tests start it multiple times). Furthermore, I'm not sure how expensive .set() is, it might include some synchronisation, locking, etc. So I remove the event when it is no longer needed and the loop ignores it.

Another way would be to just set it each iteration, and before starting the thread again, .clear() it. Do you think this way is better?

Hi vornor, you know I am a c++ man, not professional python man, so I like to declare the variable in init(), if the variable is shared by more than class's function.

Now, self._started_event is defined in dispatcher(), and is used in _dispatcher(), so when I read _dispatcher(), I need to search to find where self._started_event is defined, if I just call _dispatcher() directly, there will be error: self._started_event has been defined. so , from my point view, there are two choices:

  1. define self._started_event in self.init().
  2. pass self._started_event to _dispatcher() as one parameter. like _dispatcher(event=_started_event).

again, that's just my opinion. If you don't agree, please go and merge,

  1. do we really need do the clean up in shutdown()?

Yes, we need to clean the socket. We write b'shutdown' to it and never read it, just terminate. If we start it again (like from tests), we would get a busyloop, it would be readable all the time, but self._serving would still be true. So removing the socket and throwing it away with the data inside is needed (we could take the work and read it in non-blocking way, but that would be longer). Furthermore, freeing some resources (2 file descriptors, to start with) is a nice sideeffect.

I should read the data to check whether the data in socket is 'shutdown', not just check it's readable. see the fix in r3245. The test case should be independent with each other, test case writer should garantee for this.

The thread is cleaned up only for consistency, it will not be needed any more anyway.

I find you have mentioned self._read_sock maybe be empty in some test, but I doesn't find
it in the test case.

Some test runs some internal functions (_wait_for_notify_reply) directly, without calling dispatcher() (and starting a thread). Therefore the _read_sock is on default None value. It is just a sideeffect of running only part of the machinery, not that the test would set it on purpose.

again, I a c++ man, when I read the code, I just feel a little tricky about the definition self._read_sock, since it's part of socketpair, so why not define self._read_sock and self._write_sock toghther.

comment:22 in reply to: ↑ 21 Changed 9 years ago by vorner

Replying to zhanglikun:

Because they are unnecessary, as much as I'm avare. .join() just waits for the thread to finish, does not terminate it. So the code just waited for everything to terminate. But as the threads are not daemon threads, the python interpretter will wait for them anyway. Or I missed something?

yes vornor, what you joined is just the notify-out thread, and you are changing xfrout's code, there may some threads which are doing zone-transfer out, and they are not joined.

Let's wait for the mailling list what others think, then.

  1. pass self._started_event to _dispatcher() as one parameter. like _dispatcher(event=_started_event).

I implemented this one, as it seems cleaner than the original.

I should read the data to check whether the data in socket is 'shutdown', not just check it's readable. see the fix in r3245. The test case should be independent with each other, test case writer should garantee for this.

This is not problem of tests, it is problem of the interface. Without the cleanup (and without reading it), you couldn't start, shutdown and then start again, but the interface looks like it should be possible. Not that anyone would do it anywhere in our code (except for the tests).

I changed the data to just b's', to ensure they fit into single call of .recv() and added workaround for the linux select bug. No need to create new socket now, as this one is cleaned by the read.

I changed the code to assert, since as we talked on jabber, noone is really expected to write anything else there and this is only a sanity check.

again, I a c++ man, when I read the code, I just feel a little tricky about the definition self._read_sock, since it's part of socketpair, so why not define self._read_sock and self._write_sock toghther.

Created for the whole time, as as part of ↑.

I found a failing test (it put wrong data into the socket) and an attempt to shutdown the zonemanager twice if it was done by command and not CTRL+C.

These changes are r3258-r3261. Let's wait with the review for the threads.

comment:23 follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to zhanglikun

Thread joins returned, in r3269.

What about now, do you think it is OK?

comment:24 in reply to: ↑ 23 Changed 9 years ago by zhanglikun

  • Owner changed from zhanglikun to vorner

Replying to vorner:

Thread joins returned, in r3269.

What about now, do you think it is OK?

it's ok now vorner, but I also make one minor fix, see r3271, if you agree with it, this ticket should be mereged now.

comment:25 Changed 9 years ago by vorner

  • Resolution set to fixed
  • Status changed from reviewing to closed

Merged, thanks.

Note: See TracTickets for help on using tickets.