Opened 9 years ago

Closed 9 years ago

#299 closed defect (fixed)

AXFR fails half the time

Reported by: zzchen_pku Owned by: zzchen_pku
Priority: medium Milestone:
Component: xfrout Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity:
Sub-Project: Feature Depending on Ticket:
Estimated Difficulty: 0.0 Add Hours to Ticket: 0
Total Hours: 1.5 Internal?: no

Description

AXFR fails half the time because of connection problems.
Error message:[b10-auth] Error in handling XFR request: Fail to send the socket file descriptor to xfrout module.

Subtickets

Attachments (2)

auth.diff (1.1 KB) - added by zzchen_pku 9 years ago.
xfroutfix.diff (1.9 KB) - added by jinmei 9 years ago.

Download all attachments as: .zip

Change History (42)

comment:1 Changed 9 years ago by zzchen_pku

  • Component changed from Inter-module communication to xfrout

comment:2 follow-up: Changed 9 years ago by zzchen_pku

It happens because the Xfrout server use long tcp connection while xfrout client use short tcp connection.

comment:3 in reply to: ↑ 2 ; follow-up: Changed 9 years ago by jinmei

Replying to zzchen_pku:

It happens because the Xfrout server use long tcp connection while xfrout client use short tcp connection.

I'm afraid I don't understand. Could you provide some more specific details?

comment:4 in reply to: ↑ 3 ; follow-up: Changed 9 years ago by zzchen_pku

I'm afraid I don't understand. Could you provide some more specific details?

Xfrout server will close the connection after a request has been completed, but client won't reconnect to server each time. Client will disconnect from server only when there is an exception raised by sendXfroutRequestInfo, then reconnect the next time,so AXFR fails half the time.
Please find below code(auth_srv.cc):

344     try {
345         if (!xfrout_connected_) {
346             xfrout_client_.connect();
347             xfrout_connected_ = true;
348         }
349         xfrout_client_.sendXfroutRequestInfo(
350             io_message.getSocket().getNative(),
351             io_message.getData(),
352             io_message.getDataSize());
353     } catch (const XfroutError& err) {
354         if (xfrout_connected_) {
355             // disconnect() may trigger an exception, but since we try it
356             // only if we've successfully opened it, it shouldn't happen in
357             // normal condition.  Should this occur, we'll propagate it to the
358             // upper layer.
359             xfrout_client_.disconnect();
360             xfrout_connected_ = false;
361         }

comment:5 in reply to: ↑ 4 Changed 9 years ago by jinmei

Replying to zzchen_pku:

I'm afraid I don't understand. Could you provide some more specific details?

Xfrout server will close the connection after a request has been completed, but client won't reconnect to server each time. Client will disconnect from server only when there is an exception raised by sendXfroutRequestInfo, then reconnect the next time,so AXFR fails half the time.

Ah, I see, thanks.

Hmm, why does the xfrout server always close the connection? Doesn't it make sense to keep it open?

comment:6 Changed 9 years ago by jelte

right, so either the client needs to simply make a new connection every time (and not only on the first time or if an error occurred), or the server needs to not close the connection

(and if we go for the latter, perhaps the client could retry a connect() + send() after disconnect() due to error)

Changed 9 years ago by zzchen_pku

comment:7 follow-up: Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to jinmei
  • Status changed from new to reviewing

Jinmei ,I have upload a quick fix patch,can you help to review it and commit it to trunk. Thanks.

comment:8 in reply to: ↑ 7 ; follow-up: Changed 9 years ago by jinmei

Replying to zzchen_pku:

Jinmei ,I have upload a quick fix patch,can you help to review it and commit it to trunk. Thanks.

This patch doesn't seem to handle the case where disconnect() fails correctly.

I still don't understand why the xfrout server cannot keep the connection. Although the overhead may be marginal in our usage, it just seems to me an obvious waste.

comment:9 in reply to: ↑ 8 ; follow-up: Changed 9 years ago by zzchen_pku

I still don't understand why the xfrout server cannot keep the connection. Although the overhead may be marginal in our usage, it just seems to me an obvious waste.

Because Xfrout server is a subclass of SocketServer?, and SocketServer? implements logic for single request/response per connection. May be we can override the SocketServer? behavior to implement long tcp connection. But for this release, I think we need a quick fix.

comment:10 in reply to: ↑ 9 ; follow-up: Changed 9 years ago by jinmei

Replying to zzchen_pku:

I still don't understand why the xfrout server cannot keep the connection. Although the overhead may be marginal in our usage, it just seems to me an obvious waste.

Because Xfrout server is a subclass of SocketServer?, and SocketServer? implements logic for single request/response per connection. May be we can override the SocketServer? behavior to implement long tcp connection. But for this release, I think we need a quick fix.

Okay, and I agree we need a quick fix.

Please make a separate ticket for a better solution. If it's too hard to fix xfrout and we can be sure the overhead is negligible, that's fine, but we should at least consider more lightweight solution in the separate ticket.

As for the proposed diff, I think we still need some cleanups.

  • we need to adjust tests
  • as noted in my previous comment, we should handle the case where disconnect() (ever) fails more carefully. basically this shouldn't happen in our usage so it's not a good idea to silently catch it.
  • we should keep the consistency between xfrout_connected_ and whether xfrout_client_ is actually connected. if we really don't need to modify the state variable we should rather remove it.

I'm attaching a counter proposal patch, where I keep xfrout_connected_ (assuming in a complete fix we'll need it again). Please check it, and if it's okay, go commit it. Please also not forget adding a changelog entry.

Changed 9 years ago by jinmei

comment:11 Changed 9 years ago by jinmei

  • Owner changed from jinmei to zzchen_pku

comment:12 in reply to: ↑ 10 Changed 9 years ago by zzchen_pku

Please make a separate ticket for a better solution. If it's too hard to fix xfrout and we can be sure the overhead is negligible, that's fine, but we should at least consider more lightweight solution in the separate ticket.

OK, I will.

I'm attaching a counter proposal patch, where I keep xfrout_connected_ (assuming in a complete fix we'll need it again). Please check it, and if it's okay, go commit it. Please also not forget adding a changelog entry.

That's fine.I have committed it.

comment:13 Changed 9 years ago by jinmei

  • Owner changed from zzchen_pku to jinmei
  • Status changed from reviewing to accepted

I believe this should be removed from the review queue.

If we open a new ticket for the better solution, this one should be closed. If we keep using this ticket for that purpose, we'll simply move it from the queue and handle it as an open ticket.

For now I'll move it form the queue and give it back to Jerry.

comment:14 Changed 9 years ago by jinmei

  • Owner changed from jinmei to zzchen_pku
  • Status changed from accepted to assigned

comment:15 follow-up: Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to jinmei
  • Status changed from assigned to reviewing

Modified Xfrout server logic, Xfrout server and client will communicate by long tcp connection. Client needs to make a new connection only on the first time or if an error occurred.
I created a new branch(trac299) for review.

comment:16 in reply to: ↑ 15 Changed 9 years ago by jinmei

Replying to zzchen_pku:

Modified Xfrout server logic, Xfrout server and client will communicate by long tcp connection. Client needs to make a new connection only on the first time or if an error occurred.
I created a new branch(trac299) for review.

This branch seems to be empty:

% svn log --stop-on-copy svn+ssh://bind10.isc.org/svn/bind10/branches/trac299              
------------------------------------------------------------------------
r2965 | chenzhengzhang | 2010-09-17 17:53:25 +0900 (Fri, 17 Sep 2010) | 2 lines

create new branch for review

% svn diff -r2965 svn+ssh://bind10.isc.org/svn/bind10/branches/trac299
(no output)

comment:17 Changed 9 years ago by jinmei

  • Owner changed from jinmei to zzchen_pku

comment:18 follow-up: Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to jinmei

update, please try again.

comment:19 in reply to: ↑ 18 ; follow-up: Changed 9 years ago by jinmei

  • Add Hours to Ticket changed from 0.0 to 1.0
  • Owner changed from jinmei to zzchen_pku
  • Total Hours changed from 0.0 to 1.0

Replying to zzchen_pku:

update, please try again.

I suspect with this approach the xfrout thread cannot catch shutdown.

On a related note, polling every 0.5 sec is ugly. We should revisit this design.

Another minor point. Please don't leave commented-out code without any explnation:

                #self._log.log_message("error", "Failed to receive the file descriptor for XFR connection")

If we really don't need it (but I wouldn't understand why in this case), just remove it; if we need it but temporarily comment it out for some reason, leave a comment about it. I believe I pointed out this kind of thing several times.

Giving the ticket back to Jerry.

comment:20 in reply to: ↑ 19 ; follow-up: Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to jinmei

Replying to jinmei:

I suspect with this approach the xfrout thread cannot catch shutdown.

Do you mean xfrout server can't catch the event of xfrout client shutdown. Since Xfrout socket is in blocking mode, when a recv() returns 0 bytes, it means the xfrout client has closed the connection, at this time we can break the handle loop and close this request.

On a related note, polling every 0.5 sec is ugly. We should revisit this design.

Yeah, polling reduces Xfrout server's responsiveness to a shutdown request and wastes cpu at all other times, I'll create a new ticket for it.

Another minor point. Please don't leave commented-out code without any explnation:

     #self._log.log_message("error", "Failed to receive the file descriptor for XFR connection")

Sorry, I forgot to add comment for it. Because the log will display each time xfrout client shutdown. I have updated the recv_fd() function, which will return different value to distinguish shutdown event from error handling, or any other good suggestion?

comment:21 in reply to: ↑ 20 ; follow-up: Changed 9 years ago by jinmei

  • Add Hours to Ticket changed from 0.0 to 0.5
  • Owner changed from jinmei to zzchen_pku
  • Total Hours changed from 1.0 to 1.5

Replying to zzchen_pku:

Replying to jinmei:

I suspect with this approach the xfrout thread cannot catch shutdown.

Do you mean xfrout server can't catch the event of xfrout client shutdown. Since Xfrout socket is in blocking mode, when a recv() returns 0 bytes, it means the xfrout client has closed the connection, at this time we can break the handle loop and close this request.

I meant the shutdown command via the command handler or a signal, etc. Then XfroutServer?.shutdown() is called, where the shared "shutdown_event" is set. It causes self._unix_socket_server.shutdown(), but if I understand the code correctly the serve_forever() loop doesn't stop because the worker thread is in a deeper loop (i.e. the big while loop in XfroutSession?.handle()).

#self._log.log_message("error", "Failed to receive the file descriptor for XFR connection")

}}}

Sorry, I forgot to add comment for it. Because the log will display each time xfrout client shutdown. I have updated the recv_fd() function, which will return different value to distinguish shutdown event from error handling, or any other good suggestion?

The "xfrout client" here means the auth server in effect, right? Then I don't understand why we'd worry about the shutdown because it should be a rare event with the new code. Nevertheless I generally think separating errors and normal shutdown is a good idea.

comment:22 in reply to: ↑ 21 Changed 9 years ago by zzchen_pku

Replying to jinmei:

I meant the shutdown command via the command handler or a signal, etc. Then XfroutServer?.shutdown() is called, where the shared "shutdown_event" is set. It causes self._unix_socket_server.shutdown(), but if I understand the code correctly the serve_forever() loop doesn't stop because the worker thread is in a deeper loop (i.e. the big while loop in XfroutSession?.handle()).

I see, thank you for your specific explanation. I have added stop flag for request handling loop.

The "xfrout client" here means the auth server in effect, right?

Yes.
Since there is a issue with xfrout shutdown, this ticket will be ready for review after #335 has been commited.

comment:23 follow-up: Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to jinmei

Please find the new change in r3301. Appreciate your comments.

comment:24 in reply to: ↑ 23 ; follow-up: Changed 9 years ago by jinmei

  • Owner changed from jinmei to zzchen_pku

Replying to zzchen_pku:

Please find the new change in r3301. Appreciate your comments.

Hmm, more and more I'm familiar with the code, more and more questions arise:-)

But I'll focus on the diff it self.

  • test doesn't pass for me.
    ERROR: test_check_xfrout_available (__main__.TestXfroutSession)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/Users/jinmei/src/isc/bind10/branches/trac299/src/bin/xfrout/tests/xfrout_test.py", line 83, in setUp
        self.xfrsess = MyXfroutSession(request, None, None, self.log)
    TypeError: __init__() takes exactly 6 positional arguments (5 given)
    
    (many more errors follow)
    

did you test it yourself?

  • there seems to be a race condition about SESSION_RUNNABLE. see the discussion for trac #352. But, even if we fixed the race, I wouldn't be so much happy because the use of this variable scatters over the source and it's very difficult to trace how exactly it works, much less be sure whether it's correct. I see the need for more fundamental refactoring here. (see also below).
  • I would avoid using hardcoded value like "-2":
                    if fd == -2:
                        self._log.log_message("error", "Failed to receive the file descriptor for XFR connection")
    
    I'd define a constant in the fd_share library (both C++ and python binding) and use that constant in the rest of the code.
  • there are many redundant white spaces at EOLs or in blank lines, which seem to be common in patches from CNNIC developers. You might want to check your editor/IDE setting. (Or I suggest you write a checker script and run it before you commit anything)

I've not reviewed all of the code yet because I have a more fundamental issue (which is not directly due to this particular patch).

First of all, are we intentionally allowing multiple clients to connect to the xfrout server via the unix domain socket? The current implementation allows that because it inherits from ThreadingUnixStreamServer. This behavior might be good (e.g. if we use a multi-process model for b10-auth in the future, each sub process may want to make a separate connection to xfrout), but I'm not sure about the intent from the code. Actually, it rather seems to assume a single client.

Second, since the actual xfrout session works in the blocking mode, we cannot handle multiple xfrout sessions concurrently. This is not good if we have multiple large zones and when xfrout sessions run for these zones at the same time. And, of course, the shutdown procedure cannot (always) be as quick as it should be.

Depending on the answer to the first point, we might rather avoid using the socketserver framework. And, for the second part we need more sophisticated operation for xfrout sessions.

Finally, despite the pretty complicated structure of code, xfrout is only quite poorly documented. We need to provide fully detailed design architecture and more detailed description for the classes and major methods.

comment:25 in reply to: ↑ 24 Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to jinmei

Replying to jinmei:

  • test doesn't pass for me.

Sorry, I have updated the unittest.

  • there seems to be a race condition about SESSION_RUNNABLE. see the discussion for trac #352. But, even if we fixed the race, I wouldn't be so much happy because the use of this variable scatters over the source and it's very difficult to trace how exactly it works, much less be sure whether it's correct. I see the need for more fundamental refactoring here. (see also below).

Use shutdown_evnet instead of SESSION_RUNNABLE to eliminate race condition

  • I would avoid using hardcoded value like "-2":
                    if fd == -2:
                        self._log.log_message("error", "Failed to receive the file descriptor for XFR connection")
    
    I'd define a constant in the fd_share library (both C++ and python binding) and use that constant in the rest of the code.

Done.

  • there are many redundant white spaces at EOLs or in blank lines, which seem to be common in patches from CNNIC developers. You might want to check your editor/IDE setting. (Or I suggest you write a checker script and run it before you commit anything)

I haven't concern my editor setting before:-)thank you for your remind, I have removed the redundant white spaces

I've not reviewed all of the code yet because I have a more fundamental issue (which is not directly due to this particular patch).

First of all, are we intentionally allowing multiple clients to connect to the xfrout server via the unix domain socket? The current implementation allows that because it inherits from ThreadingUnixStreamServer. This behavior might be good (e.g. if we use a multi-process model for b10-auth in the future, each sub process may want to make a separate connection to xfrout), but I'm not sure about the intent from the code. Actually, it rather seems to assume a single client.

Second, since the actual xfrout session works in the blocking mode, we cannot handle multiple xfrout sessions concurrently. This is not good if we have multiple large zones and when xfrout sessions run for these zones at the same time. And, of course, the shutdown procedure cannot (always) be as quick as it should be.

Depending on the answer to the first point, we might rather avoid using the socketserver framework. And, for the second part we need more sophisticated operation for xfrout sessions.

Finally, despite the pretty complicated structure of code, xfrout is only quite poorly documented. We need to provide fully detailed design architecture and more detailed description for the classes and major methods.

I think we need to create a new ticket for this:-), or this ticket will become more complex. I'll create one if you agree.

Please find the new change in r3395, thank you.

comment:26 follow-up: Changed 9 years ago by jinmei

  • Owner changed from jinmei to zzchen_pku

AuthSrv? unittests failed.

auth_srv_unittest.cc:485: Failure
Value of: xfrout.isConnected()
  Actual: true
Expected: false
[  FAILED  ] AuthSrvTest.AXFRSuccess (1 ms)
...
auth_srv_unittest.cc:508: Failure
Value of: xfrout.isConnected()
  Actual: true
Expected: false
[  FAILED  ] AuthSrvTest.AXFRSendFail (0 ms)

If you could make sure all tests pass before review, the review process would go much smoothier.

comment:27 follow-up: Changed 9 years ago by zhanglikun

First of all, are we intentionally allowing multiple clients to connect to the xfrout server via the unix domain socket? The current implementation allows that because it inherits from ThreadingUnixStreamServer?. This behavior might be good (e.g. if we use a multi-process model for b10-auth in the future, each sub process may want to make a separate connection to xfrout), but I'm not sure about the intent from the code. Actually, it rather seems to assume a single client.

Yes, currently, with the long connection between Xfrout and Auth, It can only handle one Auth client. I can't find a better solution to support multiple auth in the future.

Second, since the actual xfrout session works in the blocking mode, we cannot handle multiple xfrout sessions concurrently. This is not good if we have multiple large zones and when xfrout sessions run for these zones at the same time. And, of course, the shutdown procedure cannot (always) be as quick as it should be.

No, the current module support concurrent transfer out. Each xfrout session is done in one thread. we can make max-transfers-out zones do transfer out concurrently, but currently, our naive datasource API(python sqlite3) doesn't support concurrent read operation.

comment:28 in reply to: ↑ 27 ; follow-up: Changed 9 years ago by jinmei

Replying to zhanglikun:

First of all, are we intentionally allowing multiple clients to connect to the xfrout server via the unix domain socket? The current implementation allows that because it inherits from ThreadingUnixStreamServer?. This behavior might be good (e.g. if we use a multi-process model for b10-auth in the future, each sub process may want to make a separate connection to xfrout), but I'm not sure about the intent from the code. Actually, it rather seems to assume a single client.

Yes, currently, with the long connection between Xfrout and Auth, It can only handle one Auth client. I can't find a better solution to support multiple auth in the future.

Actually, as noted in my comment above, I suspect the current code allows that. But the point is that it's not clear whether it's by design or a side effect.

Second, since the actual xfrout session works in the blocking mode, we cannot handle multiple xfrout sessions concurrently. This is not good if we have multiple large zones and when xfrout sessions run for these zones at the same time. And, of course, the shutdown procedure cannot (always) be as quick as it should be.

No, the current module support concurrent transfer out. Each xfrout session is done in one thread. we can make max-transfers-out zones do transfer out concurrently, but currently, our naive datasource API(python sqlite3) doesn't support concurrent read operation.

By multiple xfrout session, I meant handling multiple xfrout requests concurrently. I don't think it possible with the current code (of trac #299), exactly for the reason we discussed in the first paragraph: "with the long connection between Xfrout and Auth", a single XfroutSession? instance handles all incoming requests sequentially.

comment:29 in reply to: ↑ 28 ; follow-up: Changed 9 years ago by zhanglikun

Replying to jinmei:

By multiple xfrout session, I meant handling multiple xfrout requests concurrently. I don't think it possible with the current code (of trac #299), exactly for the reason we discussed in the first paragraph: "with the long connection between Xfrout and Auth", a single XfroutSession? instance handles all incoming requests sequentially.

Yes, the transfer-in request has to be processed sequentially. from my point of view, I don't think it's a serious problem, if compared with the time of zone transfer. what's your opinion?

comment:30 in reply to: ↑ 29 ; follow-up: Changed 9 years ago by jinmei

Replying to zhanglikun:

Replying to jinmei:

By multiple xfrout session, I meant handling multiple xfrout requests concurrently. I don't think it possible with the current code (of trac #299), exactly for the reason we discussed in the first paragraph: "with the long connection between Xfrout and Auth", a single XfroutSession? instance handles all incoming requests sequentially.

Yes, the transfer-in request has to be processed sequentially. from my point of view, I don't think it's a serious problem, if compared with the time of zone transfer. what's your opinion?

I'm not sure if we are talking about the same thing. What do you mean by "tarnsfer-in"? I've been exclusively talking about "transfer-out", where a secondary server sends an AXFR request to a BIND 10 primary server and BIND10 replies to it via the xfrout process.

Now, consider BIND 10 manages two large zones and the following sequence of events:

0sec: client A sends an AXFR request for zone X.
0.1sec: b10-xfrout starts handling the request. it takes 10 minutes.
0.2sec: client B sends an AXFR request for zone Y. It won't get any response soon because b10-xfrout is busy for client A.
10min 0.1sec: b10-xfrout complete the transfer for client A, and starts handling client B. (but client B may have given up the transfer by then)

Are we on the same page?

comment:31 in reply to: ↑ 30 ; follow-up: Changed 9 years ago by zhanglikun

Replying to jinmei:

I'm not sure if we are talking about the same thing. What do you mean by "tarnsfer-in"? I've been exclusively talking about "transfer-out", where a secondary server sends an AXFR request to a BIND 10 primary server and BIND10 replies to it via the xfrout process.

Sorry, my mistake, I wrote wrong word, it should be 'transfer-out', we were and are talking about the same thing.

Now, consider BIND 10 manages two large zones and the following sequence of events:

0sec: client A sends an AXFR request for zone X.
0.1sec: b10-xfrout starts handling the request. it takes 10 minutes.
0.2sec: client B sends an AXFR request for zone Y. It won't get any response soon because b10-xfrout is busy for client A.
10min 0.1sec: b10-xfrout complete the transfer for client A, and starts handling client B. (but client B may have given up the transfer by then)

the events should like this:

=0sec: client A sends an AXFR request for zone X.
=0.1sec: b10-xfrout starts handling the request by creating one thread, then b10-xfrout starts to listen to the next AXFR request. it takes 10 minutes for zone X's transfer-out, then the created transfer-out thread terminate.
=0.2sec: client B sends an AXFR request for zone Y, b10-xfrout create another thread for zone Y's transfer-out. Now there are two transfer-out threads.

=0.2+sec: transfer-out for client B finish, transfer-out thread terminates. There is only one transfer-out thread now(for client A).

=10min 0.1sec: transfer-out for client A finish, no transfer-out thread now.

Are we on the same page?

Yes, we are.

comment:32 in reply to: ↑ 31 Changed 9 years ago by jinmei

Replying to zhanglikun:

the events should like this:

I don't think this is correct in trac #299:

=0.2sec: client B sends an AXFR request for zone Y, b10-xfrout create another thread for zone Y's transfer-out. Now there are two transfer-out threads.

The AXFR from client B should be transferred from b10-auth to b10-xfrout via "the long connection between Xfrout and Auth". And a single XfroutSession instance (corresponding to the same single thread) handles all incoming requests sequentially.

Or am I misreading the trac299 code?

comment:33 in reply to: ↑ 26 ; follow-up: Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to jinmei

Replying to jinmei:

AuthSrv? unittests failed.

Oops...forgot to enable gtest while configuring, sorry to waste your valuable time,will be more careful later.
unittests have been updated.

comment:34 in reply to: ↑ 33 ; follow-up: Changed 9 years ago by jinmei

  • Owner changed from jinmei to zzchen_pku

Replying to zzchen_pku:

unittests have been updated.

okay, confirmed. I've noticed there are obsolete comments (due to the change of this branch) in the test code. I've directly made a change on the branch to remove them (r3435).

Now going back to the main patch:

xfrout.py

  • I don't understand why we need to check _shutdown_event in XfroutSession.handle():
            while not self.server._shutdown_event.is_set(): # Check if xfrout is shutdown
    
    when shutdown happens, shouldn't the XfroutSession be able to detect that via the _shutdown_sock?
  • when EINTR happens, recv_fd() will be called even if it may not be readable.
  • why can't this be log but stderr?
                        sys.stderr.write("[b10-xfrout] Error with select(): %s\n" %e)
    
  • (this is not specific to this patch but anyway) I'm not sure if it's guaranteed all data is received here:
                msgdata = self.request.recv(msg_len)
    
    isn't it possible (at least in theory) that it's a partial read?
  • naively assuming IPv4 is not good:
                sock = socket.fromfd(fd, socket.AF_INET, socket.SOCK_STREAM)
    
    although as far as I know the second parameter of fromfd will be simply ignored.
  • I suspect the naming of "_{master,slave}_sock" is confusing:
            self._master_sock, self._slave_sock = socket.socketpair()
    
    because it may look like related to the notion of DNS master (primary) and slave (secondary) while it is not.

fd_shar_python

  • I was not sure what "REV" in "XFR_FD_REV_FAIL" means (receive?). I'd suggest not to be afraid of using relatively long variable name if it clarifies the intent.
  • using static this way is not C++-ish:
    static PyObject *XFR_FD_REV_FAIL;
    
    as far as remember this usage is actually even "deprecated". (although this type of cleanup may better be deferred to a different ticket)
  • also please add documentation of this constatnt (at least in C++, and if possible in the python wrapper)
  • error cases should be handled properly here:
        XFR_FD_REV_FAIL = Py_BuildValue("i", isc::xfr::XFR_FD_REV_FAIL);
        PyModule_AddObject(mod, "XFR_FD_REV_FAIL", XFR_FD_REV_FAIL);
    

Other high level points

I think we need to create a new ticket for this:-), or this ticket will become more complex. I'll create one if you agree.

That's fine.

For that matter, I've noticed test coverage of xfrout is not really good (my test showed only 57% of the code is covered). We'll also need a separate task (ticket) to improve test coverage.

comment:35 in reply to: ↑ 34 ; follow-up: Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to jinmei

Replying to jinmei:

okay, confirmed. I've noticed there are obsolete comments (due to the change of this branch) in the test code. I've directly made a change on the branch to remove them (r3435).

Thanks, Jinmei.

  • I don't understand why we need to check _shutdown_event in XfroutSession.handle():
            while not self.server._shutdown_event.is_set(): # Check if xfrout is shutdown
    
    when shutdown happens, shouldn't the XfroutSession be able to detect that via the _shutdown_sock?

I remember vorner mentioned before:

As linux man page says in the Bugs section: „Under Linux, select() may report a 
socket file descriptor as "ready for reading", while nevertheless a subsequent 
read blocks.“ If such spurious wakeup happens on the _read_sock, we should not 
exit. Note that continue instead of break would solve this, since self._runnable 
would be False if the real shutdown comes.

The spurious wakeup does not happen usually, as I gather it happens only under 
extreme conditions (high load, bad checksum on incoming packet) and it is 
considered a bug.

Just to ensure the real shutdown comes.

  • when EINTR happens, recv_fd() will be called even if it may not be readable.
  • why can't this be log but stderr?
                        sys.stderr.write("[b10-xfrout] Error with select(): %s\n" %e)
    

Done.

  • (this is not specific to this patch but anyway) I'm not sure if it's guaranteed all data is received here:
                msgdata = self.request.recv(msg_len)
    
    isn't it possible (at least in theory) that it's a partial read?

The xfrout request message shouldn't be too large, but in order to avoid partial read, have updated the recv logic.

  • naively assuming IPv4 is not good:
                sock = socket.fromfd(fd, socket.AF_INET, socket.SOCK_STREAM)
    
    although as far as I know the second parameter of fromfd will be simply ignored.
  • I suspect the naming of "_{master,slave}_sock" is confusing:
            self._master_sock, self._slave_sock = socket.socketpair()
    
    because it may look like related to the notion of DNS master (primary) and slave (secondary) while it is not.

fd_shar_python

  • I was not sure what "REV" in "XFR_FD_REV_FAIL" means (receive?). I'd suggest not to be afraid of using relatively long variable name if it clarifies the intent.
  • using static this way is not C++-ish:
    static PyObject *XFR_FD_REV_FAIL;
    
    as far as remember this usage is actually even "deprecated". (although this type of cleanup may better be deferred to a different ticket)
  • also please add documentation of this constatnt (at least in C++, and if possible in the python wrapper)
  • error cases should be handled properly here:
        XFR_FD_REV_FAIL = Py_BuildValue("i", isc::xfr::XFR_FD_REV_FAIL);
        PyModule_AddObject(mod, "XFR_FD_REV_FAIL", XFR_FD_REV_FAIL);
    

Updated.

I think we need to create a new ticket for this:-), or this ticket will become more complex. I'll create one if you agree.

For that matter, I've noticed test coverage of xfrout is not really good (my test showed only 57% of the code is covered). We'll also need a separate task (ticket) to improve test coverage.

Please refer to #406 and #407.

comment:36 in reply to: ↑ 35 ; follow-up: Changed 9 years ago by jinmei

  • Owner changed from jinmei to zzchen_pku

Replying to zzchen_pku:

when shutdown happens, shouldn't the XfroutSession be able to detect that via the _shutdown_sock?

I remember vorner mentioned before:

[snip]

Just to ensure the real shutdown comes.

Ah, okay. But the intent is way too difficult to understand just from the code. Please add comments about why we need to do that way.

  • when EINTR happens, recv_fd() will be called even if it may not be readable.

With the new code, when errno.EINTR is received the while loop is terminated. Is that the intent? Also, this notation is not readable to me:

                if not e.args[0] == errno.EINTR:

I'd say either

                if e.args[0] is not errno.EINTR:

or just use the plain old style

                if e.args[0] != errno.EINTR:

which IMO is even more readable than "not A == B".

  • (this is not specific to this patch but anyway) I'm not sure if it's guaranteed all data is received here:
                msgdata = self.request.recv(msg_len)
    
    isn't it possible (at least in theory) that it's a partial read?

The xfrout request message shouldn't be too large, but in order to avoid partial read, have updated the recv logic.

fd_shar_python

static PyObject *XFR_FD_REV_FAIL;

as far as remember this usage is actually even "deprecated". (although this type of cleanup may better be deferred to a different ticket)

  • also please add documentation of this constatnt (at least in C++, and if possible in the python wrapper)

As for doc, please make it in the doxygen style.

  • error cases should be handled properly here:
        XFR_FD_REV_FAIL = Py_BuildValue("i", isc::xfr::XFR_FD_REV_FAIL);
        PyModule_AddObject(mod, "XFR_FD_REV_FAIL", XFR_FD_REV_FAIL);
    

Updated.

Mostly okay, but there's one minor style issues:

PyObject *XFR_FD_RECEIVE_FAIL ...

should be

PyObject* XFR_FD_RECEIVE_FAIL ...

Finally, a new issue with the latest change in xfrout_test.py. The following comment is not true any more now that it also overrides _send_data():

# We subclass the Session class we're testing here, only
# to override the __init__() method, which wants a socket,
class MyXfroutSession(XfroutSession):

and, in any event, this comment seems to be incomplete (ending with a comma).

comment:37 in reply to: ↑ 36 ; follow-up: Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to jinmei

Replying to jinmei:

when shutdown happens, shouldn't the XfroutSession be able to detect that via the _shutdown_sock?

[snip]

Ah, okay. But the intent is way too difficult to understand just from the code. Please add comments about why we need to do that way.

Okay, done.

  • when EINTR happens, recv_fd() will be called even if it may not be readable.

With the new code, when errno.EINTR is received the while loop is terminated. Is that the intent? Also, this notation is not readable to me:

                if not e.args[0] == errno.EINTR:

I'd say either

                if e.args[0] is not errno.EINTR:

or just use the plain old style

                if e.args[0] != errno.EINTR:

which IMO is even more readable than "not A == B".

Updated, will clean (rlist, wlist, xlist) and continue the loop when received errno.EINTR.

The xfrout request message shouldn't be too large, but in order to avoid partial read, have updated the recv logic.

as far as remember this usage is actually even "deprecated". (although this type of cleanup may better be deferred to a different ticket)

  • also please add documentation of this constatnt (at least in C++, and if possible in the python wrapper)

As for doc, please make it in the doxygen style.

Done.

Mostly okay, but there's one minor style issues:

PyObject *XFR_FD_RECEIVE_FAIL ...

should be

PyObject* XFR_FD_RECEIVE_FAIL ...

I found both styles in bind10 code, not sure which one is better, and http://bind10.isc.org/wiki/BIND9CodingGuidelines uses the former one.

Finally, a new issue with the latest change in xfrout_test.py. The following comment is not true any more now that it also overrides _send_data():

# We subclass the Session class we're testing here, only
# to override the __init__() method, which wants a socket,
class MyXfroutSession(XfroutSession):

and, in any event, this comment seems to be incomplete (ending with a comma).

Updated, thank you for your carefully review.
Please find the new change in r3462

comment:38 in reply to: ↑ 37 ; follow-up: Changed 9 years ago by jinmei

  • Owner changed from jinmei to zzchen_pku

Replying to zzchen_pku:

Ah, okay. But the intent is way too difficult to understand just from the code. Please add comments about why we need to do that way.

Okay, done.

I'm afraid the "why" is not still clear. I'd suggest something like this:

        # check self.server._shutdown_event to ensure the real shutdown comes.
        # Linux could trigger a spurious readable event on the _shutdown_sock
        # due to a bug, so we need perform a double check.

Mostly okay, but there's one minor style issues:

PyObject *XFR_FD_RECEIVE_FAIL ...

should be

PyObject* XFR_FD_RECEIVE_FAIL ...

I found both styles in bind10 code, not sure which one is better, and http://bind10.isc.org/wiki/BIND9CodingGuidelines uses the former one.

We use a different style for C++ than BIND 9 (which is written in C). BIND10 code guidelines are documented here:
http://bind10.isc.org/wiki/CodingGuidelines

Like many other style issues, 100% consistency is difficult to achive, so it's not surprising you found both styles. But the agreed style is the latter (btw: it's not about better or worse. It's simply a matter of consistency).

With fixing the comment and the style, I'm okay with the branch.

comment:39 in reply to: ↑ 38 Changed 9 years ago by zzchen_pku

Replying to jinmei:

We use a different style for C++ than BIND 9 (which is written in C). BIND10 code guidelines are documented here:
http://bind10.isc.org/wiki/CodingGuidelines

Like many other style issues, 100% consistency is difficult to achive, so it's not surprising you found both styles. But the agreed style is the latter (btw: it's not about better or worse. It's simply a matter of consistency).

Got it, thanks.

With fixing the comment and the style, I'm okay with the branch.

Done.

Thank you for your review. Proposed changelog entry is as follows:

116.  [bug]       jerry
  src/bin/xfrout: Xfrout and Auth will communicate by long tcp
  connection, Auth needs to make a new connection only on the first
  time or if an error occurred.
  (Trac #299, svn rTBD)

comment:40 Changed 9 years ago by zzchen_pku

  • Resolution set to fixed
  • Status changed from reviewing to closed

Merged into trunk, closing it.

Note: See TracTickets for help on using tickets.