Opened 7 years ago

Closed 7 years ago

#2934 closed defect (fixed)

xfrout session can be broken due to EAGAIN

Reported by: jinmei Owned by: jinmei
Priority: medium Milestone: Sprint-20130528
Component: xfrout Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 2 Add Hours to Ticket: 0
Total Hours: 13.21 Internal?: no

Description

I noticed xfrout-ing a large zone from b10-xfrout can be abruptly
terminated if I dump the transferred record to a terminal using 'dig
axfr'. On a closer look it seems XfroutSession._send_data() raises
(an exception due to) EAGAIN:

        while total_count < size:
            count = os.write(sock_fd, data[total_count:])
            total_count += count

(It should be reproducible even more easily by, e.g., starting axfr
with dig and suspend it before it completes).

It might be system dependent, but on my system sock_fd is non
blocking (probably derived from the original TCP socket with which
b10-auth received the AXFR query), which is the reason for the error.

While this might be relatively minor, it should easily happen in real
world, due to a slow link or packet loss, etc, too. So I think we
should fix it sooner.

A cleanest solution would be to do the asynchronous write correctly,
communicating with the parent thread so it can gracefully terminate on
shutdown. But, assuming we'll redesign xfr* fundamentally, an easier
workaround is sufficient: making the FD (socket) non blocking. I'm
attaching a patch to do this. I confirmed it solved the problem.

Subtickets

Attachments (2)

xfrout.diff (922 bytes) - added by jinmei 7 years ago.
run_python-tool.sh.diff (418 bytes) - added by jinmei 7 years ago.

Download all attachments as: .zip

Change History (26)

Changed 7 years ago by jinmei

comment:1 Changed 7 years ago by vorner

I might have met this problem and disregarded it as problem with my setup, it seems. It strangely failed with xfr-out on IPv6, but worked on IPv4 (possibly my IPv4 is faster or due to the larger IPv6 headers, it the buffers fill up with IPv6 but not with IPv4).

I'm going to check if it still happens with this patch.

comment:2 follow-up: Changed 7 years ago by vorner

Yes, this really helped with my problem. It wouldn't cross my mind it could be this.

Should we just merge such a small patch? Or do we need the full review overhead for these 4 lines?

comment:3 in reply to: ↑ 2 Changed 7 years ago by jinmei

Replying to vorner:

Yes, this really helped with my problem. It wouldn't cross my mind it could be this.

Should we just merge such a small patch? Or do we need the full review overhead for these 4 lines?

We'll at least need a test. I also wonder whether we might rather do
asynchronous write, because otherwise an evil or hangup client could
hold the session unnecessarily long. With blocking write we cannot be
responsive to shutdown event either. It may be okay to defer these
things to the planned complete rewrite, but I think we should at least
discuss them here.

comment:4 Changed 7 years ago by muks

  • Estimated Difficulty changed from 0 to 2

comment:5 Changed 7 years ago by muks

  • Milestone changed from Next-Sprint-Proposed to Sprint-20130528

comment:6 Changed 7 years ago by jinmei

  • Owner set to jinmei
  • Status changed from new to accepted

comment:7 Changed 7 years ago by jinmei

trac2943 is ready for review.

The first commit is the actual fix and unit tests. For the purpose
of the test I changed the place of making it nonblocking, but it's
essentially the same as the originally proposed patch.

I believe non-blocking I/O with asynchronous API (such as select) is a
better choice, but xfrout is already so broken in many ways, and with
the expected plan of rewrite I didn't go that far.

The second commit is not directly related, but fixes a small oddity
I found while I tried to provide a system test of this problem.
The change is small so I think it's okay to include it.

The third commit is the largest one: lettuce tests of xfrout including
the scenario of this bug. I needed to provide some supplemental tools
such as a loadzone terrain and a custom axfr client (emulating slower
receiver), so this part of the diff is actually the largest. But
these should be basically straightforward, and I believe these tools
are useful for other test cases, so worth adding at this point.

Proposed changelog:

615.	[bug]		jinmei
	b10-xfrout now uses blocking send for xfr response messages
	to prevent abrupt termination of the stream due to a slower
	client or narrower network bandwidth.
	(Trac #2934, git TBD)

comment:8 Changed 7 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from accepted to reviewing

comment:9 Changed 7 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:10 follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

The lettuce tests don't pass for me, for some kind of permission error:

    Load 100 records for zone example.org to DB file data/xfrout.sqlite3                        # features/terrain/load    Load 100 records for zone example.org to DB file data/xfrout.sqlite3                        # features/terrain/loadzone.py:58
    Traceback (most recent call last):
      File "/home/vorner/.local/lib64/python2.7/site-packages/lettuce/core.py", line 117, in __call__
        ret = self.function(self.step, *args, **kw)
      File "/var/tmp/bind10/bind10-3/tests/lettuce/features/terrain/loadzone.py", line 86, in load_zone_rr_to_dbfile
        run_loadzone(zone, f.name, db_file)
      File "/var/tmp/bind10/bind10-3/tests/lettuce/features/terrain/loadzone.py", line 35, in run_loadzone
        subprocess.PIPE, subprocess.PIPE)
      File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
        errread, errwrite)
      File "/usr/lib64/python2.7/subprocess.py", line 1308, in _execute_child
        raise child_exception
    OSError: [Errno 13] Permission denied
    Given I have bind10 running with configuration xfrout_master.conf                           # features/terrain/bind    Given I have bind10 running with configuration xfrout_master.conf                           # features/terrain/bind10_control.py:107
    And wait for bind10 stderr message BIND10_STARTED_CC                                        # features/terrain/step    And wait for bind10 stderr message BIND10_STARTED_CC                                        # features/terrain/steps.py:34
    And wait for bind10 stderr message CMDCTL_STARTED                                           # features/terrain/step    And wait for bind10 stderr message CMDCTL_STARTED                                           # features/terrain/steps.py:34
    And wait for bind10 stderr message AUTH_SERVER_STARTED                                      # features/terrain/step    And wait for bind10 stderr message AUTH_SERVER_STARTED                                      # features/terrain/steps.py:34
    And wait for bind10 stderr message XFROUT_STARTED                                           # features/terrain/step    And wait for bind10 stderr message XFROUT_STARTED                                           # features/terrain/steps.py:34
    And wait for bind10 stderr message ZONEMGR_STARTED                                          # features/terrain/step    And wait for bind10 stderr message ZONEMGR_STARTED                                          # features/terrain/steps.py:34
    When I do a customized AXFR transfer of example.org                                         # features/terrain/tran    When I do a customized AXFR transfer of example.org                                         # features/terrain/transfer.py:88
    Then transfer result should have 101 rrs                                                    # features/terrain/tran    Then transfer result should have 101 rrs                                                    # features/terrain/transfer.py:164

  Scenario: transfer a large zone                                                               # features/xfrout_bind10.feature:22
    Load 50000 records for zone example.org to DB file data/xfrout.sqlite3                      # features/terrain/load    Load 50000 records for zone example.org to DB file data/xfrout.sqlite3                      # features/terrain/loadzone.py:58
    Traceback (most recent call last):
      File "/home/vorner/.local/lib64/python2.7/site-packages/lettuce/core.py", line 117, in __call__
        ret = self.function(self.step, *args, **kw)
      File "/var/tmp/bind10/bind10-3/tests/lettuce/features/terrain/loadzone.py", line 86, in load_zone_rr_to_dbfile
        run_loadzone(zone, f.name, db_file)
      File "/var/tmp/bind10/bind10-3/tests/lettuce/features/terrain/loadzone.py", line 35, in run_loadzone
        subprocess.PIPE, subprocess.PIPE)
      File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
        errread, errwrite)
      File "/usr/lib64/python2.7/subprocess.py", line 1308, in _execute_child
        raise child_exception
    OSError: [Errno 13] Permission denied
    Given I have bind10 running with configuration xfrout_master.conf 

Also, the 5 seconds of timeout seems like a really long time to me. I guess some kind of wait is needed here, but single second would be more than enough to fill whatever buffers the OS has (or flush the whole zone into the buffers if they are large enough).

When I do a customized AXFR transfer of example.org from [::1]:47806 with pose of 5 seconds

Does anything still use the database_file option? Didn't we change the xfrin/out to use the client lists?

    "Auth": {
        "database_file": "data/xfrout.sqlite3",
        "listen_on": [ {
            "address": "::1",
            "port": 47806
        } ]
    },

Then there are two things in comments.

This should be „with“, not „without“:

# examine how it's called, so we replace it without our mock.

I believe this is not a file, but socket. Files are always blocking, at least in linux (I don't know about other unices).

    """A helper function to change blocking mode of the given file.

    It sets the mode of blocking I/O for the file associated with filenum,
    according to parameter 'on': if it's True the file will be made blocking;
    otherwise it will be made non-blocking.

    filenum(int): file number (descriptor) of the file to update.
    on(bool): whether enable (True) or disable (False) blocking I/O.

    """

comment:11 in reply to: ↑ 10 ; follow-up: Changed 7 years ago by jinmei

Thanks for the review.

Replying to vorner:

The lettuce tests don't pass for me, for some kind of permission error:

      File "/var/tmp/bind10/bind10-3/tests/lettuce/features/terrain/loadzone.py", line 35, in run_loadzone
        subprocess.PIPE, subprocess.PIPE)
      File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
        errread, errwrite)
      File "/usr/lib64/python2.7/subprocess.py", line 1308, in _execute_child
        raise child_exception
    OSError: [Errno 13] Permission denied

Hmm, I have no idea about the reason for that. Other terrain script
should also use installed bindctl, and both cases are invoked in the
same way. So if other tests such as bindctl_commands_feature works
I don't see why it doesn't work for loadzone. Does your test
environment have b10-loadzone installed somewhere in the PATH with
executable permission?

Also, the 5 seconds of timeout seems like a really long time to me. I guess some kind of wait is needed here, but single second would be more than enough to fill whatever buffers the OS has (or flush the whole zone into the buffers if they are large enough).

When I do a customized AXFR transfer of example.org from [::1]:47806 with pose of 5 seconds

I actually didn't try to lower the value, but I found I'd still need
at least 4 seconds to reproduce the original problem with the test.
Of course, it highly depends on local environment, so 5 seconds may be
too long for some, and might even be too short for others. In any
case the problem shouldn't happen how long we delay it, and in that
sense any value is okay as long as it catches any future regression
like that. For now, I've kept the same amount of delay with some
explanatory comments.

Does anything still use the database_file option? Didn't we change the xfrin/out to use the client lists?

    "Auth": {
        "database_file": "data/xfrout.sqlite3",
        "listen_on": [ {
            "address": "::1",
            "port": 47806
        } ]
    },

xfrout still needs to refer to it (one of the things we should fix).

Then there are two things in comments.

This should be „with“, not „without“:

# examine how it's called, so we replace it without our mock.

Ah, right, corrected.

I believe this is not a file, but socket. Files are always blocking, at least in linux (I don't know about other unices).

    """A helper function to change blocking mode of the given file.

    It sets the mode of blocking I/O for the file associated with filenum,
    according to parameter 'on': if it's True the file will be made blocking;
    otherwise it will be made non-blocking.

    filenum(int): file number (descriptor) of the file to update.
    on(bool): whether enable (True) or disable (False) blocking I/O.

    """

I intentionally chose the word file because the API used in the
function only relies on file abstraction and the descriptor it takes
is named "fileno" in the Python API. But that's a minor point, and in
fact the given descriptor must always be for a socket for this helper
anyway, so I revised the docstring replacing 'file' with 'socket'.

comment:12 Changed 7 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:13 in reply to: ↑ 11 ; follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

Thanks for the review.

Replying to vorner:

The lettuce tests don't pass for me, for some kind of permission error:

      File "/var/tmp/bind10/bind10-3/tests/lettuce/features/terrain/loadzone.py", line 35, in run_loadzone
        subprocess.PIPE, subprocess.PIPE)
      File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
        errread, errwrite)
      File "/usr/lib64/python2.7/subprocess.py", line 1308, in _execute_child
        raise child_exception
    OSError: [Errno 13] Permission denied

Hmm, I have no idea about the reason for that. Other terrain script
should also use installed bindctl, and both cases are invoked in the
same way. So if other tests such as bindctl_commands_feature works
I don't see why it doesn't work for loadzone. Does your test
environment have b10-loadzone installed somewhere in the PATH with
executable permission?

I don't know why it fails. I do have bind10 installed, but to a strange location. Neither b10-loadzone nor bindctl is in path. But I run it through the run_lettuce.sh script, it might be related. And maybe it might be related to this in setup_intree_bind10.sh:

PATH=/home/vorner/work/bind10/src/bin/bind10:/home/vorner/work/bind10/src/bin/bindctl:/home/vorner/work/bind10/src/bin/msgq:/home/vorner/work/bind10/src/bin/auth:/home/vorner/work/bind10/src/bin/resolver:/home/vorner/work/bind10/src/bin/cfgmgr:/home/vorner/work/bind10/src/bin/cmdctl:/home/vorner/work/bind10/src/bin/stats:/home/vorner/work/bind10/src/bin/xfrin:/home/vorner/work/bind10/src/bin/xfrout:/home/vorner/work/bind10/src/bin/zonemgr:/home/vorner/work/bind10/src/bin/ddns:/home/vorner/work/bind10/src/bin/dhcp6:/home/vorner/work/bind10/src/bin/sockcreator:$PATH
export PATH

I actually didn't try to lower the value, but I found I'd still need
at least 4 seconds to reproduce the original problem with the test.
Of course, it highly depends on local environment, so 5 seconds may be
too long for some, and might even be too short for others. In any
case the problem shouldn't happen how long we delay it, and in that
sense any value is okay as long as it catches any future regression
like that. For now, I've kept the same amount of delay with some
explanatory comments.

-
Is it so slow it actually keeps running and generating messages to the buffer for the whole 4 seconds?

comment:14 in reply to: ↑ 13 Changed 7 years ago by jinmei

Replying to vorner:

Hmm, I have no idea about the reason for that. Other terrain script
should also use installed bindctl, and both cases are invoked in the
same way. So if other tests such as bindctl_commands_feature works
I don't see why it doesn't work for loadzone. Does your test
environment have b10-loadzone installed somewhere in the PATH with
executable permission?

I don't know why it fails. I do have bind10 installed, but to a strange location. Neither b10-loadzone nor bindctl is in path. But I run it through the run_lettuce.sh script, it might be related. And maybe it might be related to this in setup_intree_bind10.sh:

PATH=/home/vorner/work/bind10/src/bin/bind10:/home/vorner/work/bind10/src/bin/bindctl:/home/vorner/work/bind10/src/bin/msgq:/home/vorner/work/bind10/src/bin/auth:/home/vorner/work/bind10/src/bin/resolver:/home/vorner/work/bind10/src/bin/cfgmgr:/home/vorner/work/bind10/src/bin/cmdctl:/home/vorner/work/bind10/src/bin/stats:/home/vorner/work/bind10/src/bin/xfrin:/home/vorner/work/bind10/src/bin/xfrout:/home/vorner/work/bind10/src/bin/zonemgr:/home/vorner/work/bind10/src/bin/ddns:/home/vorner/work/bind10/src/bin/dhcp6:/home/vorner/work/bind10/src/bin/sockcreator:$PATH
export PATH

Ah, that's probably it. I'm still not sure why it resulted in
"permission denied", but I confirmed it failed locally if I moved the
installed b10-loadzone (the error was "no such file", which made more
sense to me). I've added a path to loadzone in the PATH setting.

I actually didn't try to lower the value, but I found I'd still need
at least 4 seconds to reproduce the original problem with the test.
Of course, it highly depends on local environment, so 5 seconds may be
too long for some, and might even be too short for others. In any
case the problem shouldn't happen how long we delay it, and in that
sense any value is okay as long as it catches any future regression
like that. For now, I've kept the same amount of delay with some
explanatory comments.

-
Is it so slow it actually keeps running and generating messages to the buffer for the whole 4 seconds?

Yes. This is a log:

2013-05-20 09:23:09.657 INFO  [b10-xfrout.xfrout/24060] XFROUT_XFR_TRANSFER_STARTED AXFR client [::1]:56977: transfer of zone example.org/IN has started
2013-05-20 09:23:13.171 ERROR [b10-xfrout.xfrout/24060] XFROUT_XFR_TRANSFER_ERROR AXFR client [::1]:56977: error transferring zone example.org/IN: [Errno 35] Resource temporarily unavailable

That's probably partially because xfrout in the test is also quite
slow, dumping log messages at the highest debug level. And, probably
for the same reason, 5 seconds of wait time are not that big a
bottleneck in the entire test scenario: with the 5 seconds delay, it
took about 14 seconds; if I changed it to 1 second, it was just
reduced to 12.

comment:15 Changed 7 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:16 follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Is it possible you forgot to add the setup_intree_bind10.sh file? I see it added to a makefile, but not to git and lettuce now fails with:

    When I do a customized AXFR transfer of example.org                                         # features/terrain/tran    When I do a customized AXFR transfer of example.org                                         # features/terrain/transfer.py:88
    Traceback (most recent call last):
      File "/home/vorner/.local/lib64/python2.7/site-packages/lettuce/core.py", line 117, in __call__
        ret = self.function(self.step, *args, **kw)
      File "/var/tmp/bind10/bind10-1/tests/lettuce/features/terrain/transfer.py", line 129, in perform_custom_axfr
        "stderr:\n" + str(stderr)
    AssertionError: xfr-client exit code: 2
    stdout:
    stderr:
    run_python-tool.sh: 22: .: setup_intree_bind10.sh: not found
    
    Then transfer result should have 101 rrs    

comment:17 in reply to: ↑ 16 ; follow-up: Changed 7 years ago by jinmei

Replying to vorner:

Is it possible you forgot to add the setup_intree_bind10.sh file? I see it added to a makefile, but not to git and lettuce now fails with:

The .sh file should be auto-generated by configure.ac (and now also
automake) from .sh.in, so shouldn't be in the repository. Did you
run autoreconf and make again?

comment:18 Changed 7 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:19 in reply to: ↑ 17 ; follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

Is it possible you forgot to add the setup_intree_bind10.sh file? I see it added to a makefile, but not to git and lettuce now fails with:

The .sh file should be auto-generated by configure.ac (and now also
automake) from .sh.in, so shouldn't be in the repository. Did you
run autoreconf and make again?

Yes, I did. I always run the build & test from completely clean checkout.

Anyway, I checked and the file is there. But still, the tools does not find it. Is it possible the run_python-tool.sh is run from different directory?

Why is the script even needed? Because if you run lettuce through the run_lettuce.sh, the PATH should already be set up and propagated through.

comment:20 in reply to: ↑ 19 Changed 7 years ago by jinmei

Replying to vorner:

Anyway, I checked and the file is there. But still, the tools does not find it. Is it possible the run_python-tool.sh is run from different directory?

I don't think so; there seem to be other places that assume where
run_lettuce.sh is run, so if it's run under different directory other
things will be broken too.

Comparing these two run_ scripts, I wonder we might have to clarify
the (relative) path for the setup script, depending on details of
particular /bin/sh implementations. Does the attached diff
solve your issue?

Why is the script even needed? Because if you run lettuce through the run_lettuce.sh, the PATH should already be set up and propagated through.

It'll be needed when we test installed set of BIND 10 (specifying -I
for run_lettuce.sh).

Changed 7 years ago by jinmei

comment:21 Changed 7 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:22 follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei
  • Total Hours changed from 0 to 1.96

Yes, the attached patch helps (for no reason apparent to me, as . should not use $PATH or anything).

So, please, apply it and merge.

comment:23 in reply to: ↑ 22 Changed 7 years ago by jinmei

Replying to vorner:

Yes, the attached patch helps (for no reason apparent to me, as . should not use $PATH or anything).

So, please, apply it and merge.

done, closing.

comment:24 Changed 7 years ago by jinmei

  • Resolution set to fixed
  • Status changed from reviewing to closed
  • Total Hours changed from 1.96 to 13.21
Note: See TracTickets for help on using tickets.