Opened 7 years ago

Closed 7 years ago

#2252 closed enhancement (fixed)

Implement counters into Xfrin (1/3)

Reported by: naokikambe Owned by: naokikambe
Priority: medium Milestone: Sprint-20130514
Component: xfrin Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 5 Add Hours to Ticket: 0
Total Hours: 270.46 Internal?: no

Description (last modified by naokikambe)

According to StatisticsItems, implement the following statistics items to Xfrin in the same way as Xfrout (#2158, #2222, and #2225).

ModuleItemin BIND 9 Statistics XML (tree representation under isc/bind/statistics)Description from BIND 9 ManualImplemented in BIND 10?
Xfrin
xfrin.[zonename].soaoutv4server/zsstat/SOAOutv4Number of IPv4 SOA queries sent from Xfrinno
xfrin.[zonename].soaoutv6server/zsstat/SOAOutv6Number of IPv6 SOA queries sent from Xfrinno
xfrin.[zonename].axfrreqv4server/zsstat/AXFRReqv4Number if IPv4 AXFR requests sent from Xfrinno
xfrin.[zonename].axfrreqv6server/zsstat/AXFRReqv6Number if IPv6 AXFR requests sent from Xfrinno
xfrin.[zonename].ixfrreqv4server/zsstat/IXFRReqv4Number if IPv4 IXFR requests sent from Xfrinno
xfrin.[zonename].ixfrreqv6server/zsstat/IXFRReqv6Number if IPv6 IXFR requests sent from Xfrinno
xfrin.[zonename].xfrsuccessserver/zsstat/XfrSuccessNumber of zone transfer requests succeededno
xfrin.[zonename].xfrfailserver/zsstat/XfrFailNumber of zone transfer requests failedno
xfrin.[zonename].time_to_ixfr(none)Elapsed time of the last IXFRno
xfrin.[zonename].time_to_axfr(none)Elapsed time of the last AXFRno

The class XfrinConnection in Xfrin should be revised for counting the above items. Besides a counter class for Xfrin should be introduced by the same way as XfroutCounter. The related unittest and the lettuce test should be also revised. This ticket depends on #2225 because the same statistics library as Xfrout can be used.

BTW most of per-zone items for Xfrin are in the above list. Regarding the rest of per-zone items it isn't clear whether something like notifyinv4, notifyrej, time_to_reload, etc. can be collected by Xfrin. Regarding notifyinv4 current Xfrin doesn't seem to handle an incoming notifier.

Subtickets

Change History (45)

comment:1 Changed 7 years ago by naokikambe

  • Description modified (diff)
  • Owner set to naokikambe
  • Status changed from new to accepted

comment:2 Changed 7 years ago by naokikambe

A proposed ChangeLog entry:

nnn.    [func]          naokikambe
        Added Xfrin statistics items: soaoutv4, soaoutv6, axfrreqv4, axfrreqv6,
        ixfrreqv4, ixfrreqv6, xfrsuccess, xfrfail, time_to_ixfr, and
        time_to_axfr.  These are per-zone type counters.  Their values can be
        obtained with zone names by invoking "Stats show Xfrin" via bindctl
        while Xfrin is running.
        (Trac #2252, git TBD)

comment:3 Changed 7 years ago by naokikambe

  • Owner changed from naokikambe to UnAssigned
  • Status changed from accepted to reviewing

The branch can be reviewed.

The new Xfrin counters are implemented in the statistics library which was introduced in #2225.

comment:4 Changed 7 years ago by shane

  • Milestone changed from New Tasks to StatsRedesign

comment:5 Changed 7 years ago by shane

  • Milestone changed from StatsRedesign to Next-Sprint-Proposed

comment:6 Changed 7 years ago by naokikambe

  • Owner changed from UnAssigned to naokikambe
  • Status changed from reviewing to accepted

In #2252, we decided to complete a statistics library at first. That includes changes in this ticket. That is, the branch trac2252 needs to be updated. That will be started after completion of #2225. So this ticket's status is being reverted to being accepted.

comment:7 Changed 7 years ago by naokikambe

  • Milestone set to Next-Sprint-Proposed

comment:8 Changed 7 years ago by naokikambe

  • Owner changed from naokikambe to UnAssigned
  • Status changed from accepted to reviewing

trac2252 (fc40317) is ready for reviewing.

I've applied changes added on #2225 into the branch.

comment:9 Changed 7 years ago by jelte

  • Milestone changed from Next-Sprint-Proposed to Sprint-20130319

comment:10 Changed 7 years ago by jinmei

  • Owner changed from UnAssigned to jinmei

comment:11 follow-up: Changed 7 years ago by jinmei

general

  • Don't we need a changelog entry?
  • after struggling with figuring out how XfrinConnection statistics work I realized it relied on a global (singleton) statistics object for the module. I don't know what kind of prior discussion brought that decision, but I still think that's a bad practice. Singletons make the code more untestable and more fragile (as it often builds hidden dependency between separated modules; when one module modifies a singleton it implicitly affects the other), trading its superficial convenience. IMO it should be avoided unless there's something inherently impossible without using singletone. Is there any such strong reason in this case?

b10-xfrin.xml

  • This looks awkward to me in various points:
                        <term>time_to_ixfr</term>
                        <listitem><simpara>
                          Elapsed time in second to do the last IXFR
                        </simpara></listitem>
    
    First off, I don't understand the description. Second, with my understanding what this counter is from the code, "time_to_ixfr" doesn't seem to represent what it means. My understanding of this counter is "Elapsed time since the start of currently running IXFR, or, if it's not currently running, the duration of the previous IXFR" (correct?). If so, I'd rename it, e.g., "latest_ixfr_duration". Same for AXFR. And also update the description.
  • as for time_to_ixfr, what if there's never been an xfr for the zone?

xfrin.py.in

  • get_ipver_str: I suggest making it "protected" by renaming _get_ipver_str. I'd also clarify it's mostly private except for tests (which is my understanding).
  • get_ipver_str: what if family is neither INET6 nor INET?
  • _check_soa_serial: shouldn't this be distinguished per name and class? both class IN and CH can have "example.com", for example.
            # count soaoutv4 or soaoutv6 requests
            self._counters.inc('zones', self._zone_name.to_text(),
                               'soaout' + self.get_ipver_str())
    
    This comment applies to all other counters and timers.
  • do_xfrin: not directly related to this branch, but using this opportunity I suggest simplifying this part:
                # Right now RRType.[IA]XFR().to_text() is 'TYPExxx', so we need
                # to hardcode here.
                req_str = 'IXFR' if request_type == RRType.IXFR else 'AXFR'
    
    as follows:
                req_str = request_type.to_text()
    
    we should now be able to do that.
  • do_xfrin: stop_timer wouldn't be called if exception is raised during the process.
  • do_xfrin: maybe a matter of taste, but the final clause could be simplified:
                counter_dict = {XFRIN_OK: 'xfrsuccess', XFRIN_FAIL, 'xfrfail'}
                self._counters.inc('zones', self._zone_name.to_text(),
                                   counter_dict[ret])
    
  • command_handler: why do we bother to log it here?
                    logger.debug(DBG_XFRIN_TRACE, XFRIN_RECEIVED_GETSTATS_COMMAND,
                                 str(answer))
    
    I'm not necessarily opposed to logging this event per se, but it's awkward to see only this command is logged especially if it's at a debug level.

xfrin_test.py

  • in general, I see many similar patters repeated like this:
            self.assertEqual(1, self.conn._counters.get('zones', TEST_ZONE_NAME_STR,
                                                        'soaoutv4'))
    
    I'd like to unify them.
  • test_ipver_str: do we have to reset self.conn.socket? I guess it's re-constructed for every test case.
  • test_soacheck doesn't confirm soaoutv6 is incremented.
  • test_do_xfrin: likewise. it's redundant, and yet incomplete.
  • isn't this mostly trivial?
            self.assertGreaterEqual(self.conn._counters.get('zones',
                                                            TEST_ZONE_NAME_STR,
                                                            'time_to_axfr'),
                                    0.0)
    
    (the only other case is failure with exception and a negative value). This seems to suggest we should revisit how we test these in the first place. Checking the resulting counter values via get() is indirect and less reliable. It's probably better to steal inc() etc and confirm it's called at the expected timing on expected counter(s). It's not only for this case, but in general.

b10-xfrin.xml

I made some suggested editorial changes directly. I also suggest
updating time_to_axfr and time_to_ixfr (see above). Same for
xfrin.spec.

lettuce tests

  • repeating this pattern is noisy and less efficient (we need to issue bindctl command for each line:
        Then the statistics counter notifyoutv4 for the zone _SERVER_ should be 0
        Then the statistics counter notifyoutv6 for the zone _SERVER_ should be 0
        ...
    
    I suggest unifying them in a single bindctl command as much as possible. see queries.feature.
  • also, overall, (it appears) only a subset of counters are actually checked to be incremented. is it okay?
  • can we always assume this?
        Then the statistics counter time_to_axfr for the zone _SERVER_ should be greater than 0.0
        Then the statistics counter time_to_axfr for the zone example.org. should be greater than 0.0
    
    What if xfr is completed super fast?
  • should we repeat the initial statistics again?
        #
        # Test6 for Xfrin statistics
        #
        # check initial statistics
        #
    
    If so, I'd consider unifying them somehow.
  • why "greater than 0"? can't we be more specific?
        Then the statistics counter soaoutv6 for the zone _SERVER_ should be greater than 0
        Then the statistics counter soaoutv6 for the zone example.org. should be greater than 0
    
  • is this correct?
    -    # Test1 for Xfrout statistics
    +    # Test9 for Xfrout statistics
    
    This is in a new scenario.

comment:12 Changed 7 years ago by jinmei

  • Owner changed from jinmei to naokikambe

comment:13 in reply to: ↑ 11 ; follow-up: Changed 7 years ago by naokikambe

  • Owner changed from naokikambe to jinmei

Hello, thank you for reviewing!

Replying to jinmei:

general

  • Don't we need a changelog entry?

Of course, we need. Please see this.

  • after struggling with figuring out how XfrinConnection statistics work I realized it relied on a global (singleton) statistics object for the module. I don't know what kind of prior discussion brought that decision, but I still think that's a bad practice. Singletons make the code more untestable and more fragile (as it often builds hidden dependency between separated modules; when one module modifies a singleton it implicitly affects the other), trading its superficial convenience. IMO it should be avoided unless there's something inherently impossible without using singletone. Is there any such strong reason in this case?

Sorry, but I can't understand well whether there are such harmful things from the current code. However, regarding tests of the statistics library, they work well so far on the build farm as long as I see. Can we consider about that if some problem occurs which you worry about in the future?

b10-xfrin.xml

  • This looks awkward to me in various points:
                        <term>time_to_ixfr</term>
                        <listitem><simpara>
                          Elapsed time in second to do the last IXFR
                        </simpara></listitem>
    
    First off, I don't understand the description. Second, with my understanding what this counter is from the code, "time_to_ixfr" doesn't seem to represent what it means. My understanding of this counter is "Elapsed time since the start of currently running IXFR, or, if it's not currently running, the duration of the previous IXFR" (correct?). If so, I'd rename it, e.g., "latest_ixfr_duration". Same for AXFR. And also update the description.

The latter one is correct. I revised the names as you pointed out. Could you see the following commit?

75445a1 [2252] rename the item name and update descriptions
  • as for time_to_ixfr, what if there's never been an xfr for the zone?

It shows still the default value, 0.0.

xfrin.py.in

  • get_ipver_str: I suggest making it "protected" by renaming _get_ipver_str. I'd also clarify it's mostly private except for tests (which is my understanding).

I made it protected as you mentioned. Could you see the following commit?

a1b3cfd [2252] rename the helper method name
  • get_ipver_str: what if family is neither INET6 nor INET?

I considered the case where None value is returned. Could you see the following commit?

4de51f8 [2252] care for the socket family which is neither INET6 nor INET
  • _check_soa_serial: shouldn't this be distinguished per name and class? both class IN and CH can have "example.com", for example.
            # count soaoutv4 or soaoutv6 requests
            self._counters.inc('zones', self._zone_name.to_text(),
                               'soaout' + self.get_ipver_str())
    
    This comment applies to all other counters and timers.

Probably should be considered. As an example, we might need to introduce a classes layer over the current zones layer in the statistics data structure. But this is too big to handle in this ticket. Could we consider in other ticket?

  • do_xfrin: not directly related to this branch, but using this opportunity I suggest simplifying this part:
                # Right now RRType.[IA]XFR().to_text() is 'TYPExxx', so we need
                # to hardcode here.
                req_str = 'IXFR' if request_type == RRType.IXFR else 'AXFR'
    
    as follows:
                req_str = request_type.to_text()
    
    we should now be able to do that.

I applied it. Could you see the following commit?

01b2a2f [2252] simplify complicated if-else lines (unrelated to #2252)
  • do_xfrin: stop_timer wouldn't be called if exception is raised during the process.

I think it is okay that the timer count would be discarded if exception is raised.

  • do_xfrin: maybe a matter of taste, but the final clause could be simplified:
                counter_dict = {XFRIN_OK: 'xfrsuccess', XFRIN_FAIL, 'xfrfail'}
                self._counters.inc('zones', self._zone_name.to_text(),
                                   counter_dict[ret])
    

I revised the code as you proposed. Could you see the following commit?

2cfdda6 [2252] simplify complicated if-elif lines using a dictionary
  • command_handler: why do we bother to log it here?
                    logger.debug(DBG_XFRIN_TRACE, XFRIN_RECEIVED_GETSTATS_COMMAND,
                                 str(answer))
    
    I'm not necessarily opposed to logging this event per se, but it's awkward to see only this command is logged especially if it's at a debug level.

The message is used in the lettuce tests. but it's too annoying, so I removed { poll-interval: 1} from stats config in lettuce. Could you see the following commit?

4d94e85 [2252] remove poll-interval from stats config

xfrin_test.py

  • in general, I see many similar patters repeated like this:
            self.assertEqual(1, self.conn._counters.get('zones', TEST_ZONE_NAME_STR,
                                                        'soaoutv4'))
    
    I'd like to unify them.

I unified them. Could you see the following commit?

936b3ff [2252] unify redundant assertion lines
  • test_ipver_str: do we have to reset self.conn.socket? I guess it's re-constructed for every test case.

I revised the code to reset the socket object within the test case. Could you see the following commit?

0ef3b7a [2252] test _get_ipver_str() by creating MockXfrinConnection object
  • test_soacheck doesn't confirm soaoutv6 is incremented.

I added test cases for IPv6. Could you see the following commit?

7fe200c [2252] add tests for counters of XfrinConnection in IPv6 address
  • test_do_xfrin: likewise. it's redundant, and yet incomplete.

I included in the above commits.

  • isn't this mostly trivial?
            self.assertGreaterEqual(self.conn._counters.get('zones',
                                                            TEST_ZONE_NAME_STR,
                                                            'time_to_axfr'),
                                    0.0)
    
    (the only other case is failure with exception and a negative value). This seems to suggest we should revisit how we test these in the first place. Checking the resulting counter values via get() is indirect and less reliable. It's probably better to steal inc() etc and confirm it's called at the expected timing on expected counter(s). It's not only for this case, but in general.

Thank you for implying. but regarding this part, I changed it to a bit more accurate check by using assertAlmostEqual. Could you see the following commit?

fcf8b9d [2252] replace assertGreaterEqual with assertAlmostEqual

b10-xfrin.xml

I made some suggested editorial changes directly. I also suggest
updating time_to_axfr and time_to_ixfr (see above). Same for
xfrin.spec.

Thank you for revising. I also updated the man page. The changes were included in the above commit.

lettuce tests

  • repeating this pattern is noisy and less efficient (we need to issue bindctl command for each line:
        Then the statistics counter notifyoutv4 for the zone _SERVER_ should be 0
        Then the statistics counter notifyoutv6 for the zone _SERVER_ should be 0
        ...
    
    I suggest unifying them in a single bindctl command as much as possible. see queries.feature.

I consolidated such same repeated patterns by following the way of queries.feature. I update the step function to adopt this. Could you see the following commits?

85aef71 [2252] add evaluating a minimum or maxmimum value
55a49c6 [2252] unify repeating similar patterns
aca3456 [2252] remove an obsoleted statistics step method
  • also, overall, (it appears) only a subset of counters are actually checked to be incremented. is it okay?

I added some tests for IPv4. Could you see the following commits?

be67ff2 [2252] add a test scenario for handling incoming notify only in IPv4
d6d0432 [2252] add a test scenario for Xfr request rejected in IPv4
  • can we always assume this?
        Then the statistics counter time_to_axfr for the zone _SERVER_ should be greater than 0.0
        Then the statistics counter time_to_axfr for the zone example.org. should be greater than 0.0
    
    What if xfr is completed super fast?

I considered the case that the transferring would be completed in 0.0 second. This change was included in the above commits.

  • should we repeat the initial statistics again?
        #
        # Test6 for Xfrin statistics
        #
        # check initial statistics
        #
    
    If so, I'd consider unifying them somehow.

I unified them. Could you see the following commit?

d69fb64 [2252] unify multiple similar checks of initial statistics into a step method
  • why "greater than 0"? can't we be more specific?
        Then the statistics counter soaoutv6 for the zone _SERVER_ should be greater than 0
        Then the statistics counter soaoutv6 for the zone example.org. should be greater than 0
    

I reconsidered such expected values for ranges which the counter values can be. Could you see the following commit?

655179c [2252] reconsider expected values of counters for more specific values
  • is this correct?
    -    # Test1 for Xfrout statistics
    +    # Test9 for Xfrout statistics
    
    This is in a new scenario.

I reset the numbers of tests. Could you see the following commit?

cb5eb7d [2252] reset numbers in the scenario tests

Regards,

comment:14 in reply to: ↑ 13 ; follow-up: Changed 7 years ago by jinmei

Replying to naokikambe:

general

  • Don't we need a changelog entry?

Of course, we need. Please see this.

Oops, sorry, I overlooked it. On the proposed text, I suspect
enumerating all counters is a bit too verbose. I'd simplify it to,
e.g.:

nnn.    [func]          naokikambe
        Added Xfrin statistics items such as the number of successful
	transfers.  These are per-zone type counters.  Their values can be
        obtained with zone names by invoking "Stats show Xfrin" via bindctl
        while Xfrin is running.
        (Trac #2252, git TBD)
  • after struggling with figuring out how XfrinConnection statistics work I realized it relied on a global (singleton) statistics object for the module. I don't know what kind of prior discussion brought that decision, but I still think that's a bad practice. [...]

Sorry, but I can't understand well whether there are such harmful things from the current code. However, regarding tests of the statistics library, they work well so far on the build farm as long as I see. Can we consider about that if some problem occurs which you worry about in the future?

It's more about design matter so the harm may not be so obvious or
even a subjective matter. I'm okay with excluding it for this ticket,
but I'd argue this should be at least discussed (and if we agree)
resolved before using the current design further. So I'd suggest
creating a ticket to discuss and fix it, and do it before the
subsequent counter tickets for xfrin or others.

b10-xfrin.xml

  • about latest_*xfr_duration: I missed the currently ongoing one doesn't show up in the statistics. In that case "last_*xfr_duration" would be better than latest.
  • as for time_to_ixfr, what if there's never been an xfr for the zone?

It shows still the default value, 0.0.

It seems to be suboptimal in that using a magic number (especially in
the valid range - 0.0 should still be possible if everything is done
super quickly) to present a kind of "None" is always a source of
trouble. I'd consider something that can explicitly mean "no
completed xfrs". And, especially if we use a magic value, document it
somewhere.

xfrin.py.in

  • get_ipver_str: I suggest making it "protected" by renaming _get_ipver_str. I'd also clarify it's mostly private except for tests (which is my understanding).

I made it protected as you mentioned. Could you see the following commit?

a1b3cfd [2252] rename the helper method name

By "clarify" I meant update the docstring too.

  • get_ipver_str: what if family is neither INET6 nor INET?

I considered the case where None value is returned. Could you see the following commit?

4de51f8 [2252] care for the socket family which is neither INET6 nor INET

It's mostly obvious that None is returned, but the point is whether we
should expect it to happen or treat it as an error. I suspect our
implementation should have other cases, so it seems to me more
reasonable to throw an exception in that case.

  • _check_soa_serial: shouldn't this be distinguished per name and class? both class IN and CH can have "example.com", for example.
            # count soaoutv4 or soaoutv6 requests
            self._counters.inc('zones', self._zone_name.to_text(),
                               'soaout' + self.get_ipver_str())
    
    This comment applies to all other counters and timers.

Probably should be considered. As an example, we might need to introduce a classes layer over the current zones layer in the statistics data structure. But this is too big to handle in this ticket. Could we consider in other ticket?

I'm okay with differing it, but it seems to be a quite fundamental
design error. So I'd argue that should be fixed before we rely on the
current format further. The longer we postpone it, the more painful
it is to fix.

  • do_xfrin: stop_timer wouldn't be called if exception is raised during the process.

I think it is okay that the timer count would be discarded if exception is raised.

Is it really discarded? Shouldn't it be left until next time we set
the timer for the same zone? If it's left, that doesn't seem to be
clear. Also, if the intent is to only count the duration of
successfully completed transfers (even excluding the case of
XfrinZoneUptodate), I believe that should be documented.

  • do_xfrin: maybe a matter of taste, but the final clause could be simplified:
                counter_dict = {XFRIN_OK: 'xfrsuccess', XFRIN_FAIL, 'xfrfail'}
                self._counters.inc('zones', self._zone_name.to_text(),
                                   counter_dict[ret])
    

I revised the code as you proposed. Could you see the following commit?

2cfdda6 [2252] simplify complicated if-elif lines using a dictionary

Why does it check if ret is in counter_dict?

            counter_dict = {XFRIN_OK: 'xfrsuccess', XFRIN_FAIL: 'xfrfail'}
            if ret in counter_dict:
                self._counters.inc('zones', self._zone_name.to_text(),
                                   counter_dict[ret])

Is there a case where this if doesn't hold?

  • command_handler: why do we bother to log it here?

[...]

I'm not necessarily opposed to logging this event per se, but it's
awkward to see only this command is logged especially if it's at a
debug level.

The message is used in the lettuce tests. but it's too annoying, so I removed { poll-interval: 1} from stats config in lettuce. Could you see the following commit?

4d94e85 [2252] remove poll-interval from stats config

It doesn't really answer my point. Imagine someone reviews this
code. It'd be quite awkward only this case of command has a debug
log. I'd rather generalize it with a command line parameter or add
log messages to all command separately (former is probably better
because the latter approach isn't very much scalable).

xfrin_test.py

  • in general, I see many similar patters repeated like this:
  • test_soacheck doesn't confirm soaoutv6 is incremented.
  • test_do_xfrin: likewise. it's redundant, and yet incomplete.

Regarding these, I guess we can even more unified them, and there are
still missing cases that are not tested: at least ixfrreqv4
and ixfrreqv6. If I were to implement the test, I'd introduce a
separate test case, not extending an existing do_xfrin test, and only
concentrate on statistics changes. So, I'd, for example, steal
_send_query or _handle_xfrin_responses (because what it does
doesn't matter in the statistics tests), and only checks statistics
counter values after calling the tweaked do_xfrin. I'd parameterize
everything so it's clearer all counters are actually tested without
much duplicates. (am I clear?).

I included in the above commits.

  • isn't this mostly trivial?
            self.assertGreaterEqual(self.conn._counters.get('zones',
                                                            TEST_ZONE_NAME_STR,
                                                            'time_to_axfr'),
                                    0.0)
    

[...]

Thank you for implying. but regarding this part, I changed it to a bit more accurate check by using assertAlmostEqual. Could you see the following commit?

fcf8b9d [2252] replace assertGreaterEqual with assertAlmostEqual

It doesn't seem to be improved much. If we think a complete tests are
too much for this task, I'd rather remove this counter and complete
the implementation and tests separately. Otherwise, I'll do a better
test covering various cases such as not counting it on exception, etc.
What I don't like to do is to allow poorly test code to be merged with
the excuse of "testing it is too heavy".

lettuce tests

  • First off, test failed. I thought it might be related to the issue of #2790, but at least this one doesn't seem to be related to it:
        When I query statistics zones of bind10 module Xfrin with cmdctl                                                # features/terrain/bind10_control.py:367
        The statistics counters are 0 in category .Xfrin.zones except for the follow    The statistics counters are 0 in category .Xfrin.zones except for the following items                           # features/terrain/bind10_control.py:383
          | item_name                         | item_value | min_value |
          | _SERVER_.soaoutv6                 | 1          |           |
          | _SERVER_.axfrreqv6                | 1          |           |
          | _SERVER_.xfrsuccess               | 1          |           |
          | _SERVER_.latest_axfr_duration     |            | 0.0       |
          | example.org..soaoutv6             | 1          |           |
          | example.org..axfrreqv6            | 1          |           |
          | example.org..xfrsuccess           | 1          |           |
          | example.org..latest_axfr_duration |            | 0.0       |
        Traceback (most recent call last):
          File "/Library/Python/2.7/site-packages/lettuce/core.py", line 125, in __call__
            ret = self.function(self.step, *args, **kw)
          File "/Users/jinmei/src/isc/git/bind10-2252/tests/lettuce/features/terrain/bind10_control.py", line 419, in check_statistics_items
            (name, found, value)
        AssertionError: Statistics item .Xfrin.zones._SERVER_.soaoutv6 has unexpected value 2 (expect 1)
    
    and some others still fail even if I applied the patch Jelte mentioned.
  • check_statistics_items: it seems that the docstring has to be updated.
  • check_init_statistics: if it's xfr specific, it seems it should be named accordingly. Also, it seems to have been somewhere other than bind10_control.py (from a quick look it's for more generic modules; xfrs are too specific in that sense).
  • check_init_statistics: I'm not sure if these are 100% correct:
            step.given(check_str)
    ...
            step.given(check_str)
    
    check_str would contain a trailing '%s' at this point, correct? if so, is that intentional?
  • xfrin_notify_handling.feature: I don't understand the purpose of this test:
        check initial statistics for Xfrout
        When I query statistics socket of bind10 module Xfrout with cmdctl port 47804
        The statistics counters are 0 in category .Xfrout.socket.unixdomain except for the following items
          | item_name | min_value | max_value |
          | open      |         0 |         1 |
    
    why xfrout test in an "xfrin feature"? why not a specific expected value but min-max? (I then noticed there are other xfrout tests here...why in xfrin tests?)
  • it's not obvious why these are specified with "min". please add comments on why.
        The statistics counters are 0 in category .Xfrout.zones except for the following items
          | item_name                | item_value | min_value |
          | _SERVER_.notifyoutv4     |          5 |           |
          | _SERVER_.xfrrej          |            |         1 |
          | example.org..notifyoutv4 |          5 |           |
          | example.org..xfrrej      |            |         1 |
    
  • same for this:
          | item_name              | min_value | max_value |
          | _SERVER_.soaoutv4      |         1 |         3 |
          | _SERVER_.axfrreqv4     |         1 |         3 |
    ...
    

comment:15 Changed 7 years ago by jinmei

  • Owner changed from jinmei to naokikambe

comment:16 in reply to: ↑ 14 ; follow-up: Changed 7 years ago by naokikambe

  • Owner changed from naokikambe to jinmei

Replying to jinmei:

Oops, sorry, I overlooked it. On the proposed text, I suspect
enumerating all counters is a bit too verbose. I'd simplify it to,
e.g.:

Thank you! That sounds good to me.

It's more about design matter so the harm may not be so obvious or
even a subjective matter. I'm okay with excluding it for this ticket,
but I'd argue this should be at least discussed (and if we agree)
resolved before using the current design further. So I'd suggest
creating a ticket to discuss and fix it, and do it before the
subsequent counter tickets for xfrin or others.

I'm okay with redesigning if the current library has some problem. But what ticket should I create? What contents should I write in the ticket? Sorry, I couldn't understand well about problems. Could you create it yourself?

b10-xfrin.xml

  • about latest_*xfr_duration: I missed the currently ongoing one doesn't show up in the statistics. In that case "last_*xfr_duration" would be better than latest.
  • as for time_to_ixfr, what if there's never been an xfr for the zone?

It shows still the default value, 0.0.

It seems to be suboptimal in that using a magic number (especially in
the valid range - 0.0 should still be possible if everything is done
super quickly) to present a kind of "None" is always a source of
trouble. I'd consider something that can explicitly mean "no
completed xfrs". And, especially if we use a magic value, document it
somewhere.

I updated descriptions about the default value 0.0. Could you see the following commit?

26a9ee9 [2252] update descriptions of latest_ixfr_duration and latest_axfr_duration about their default values

xfrin.py.in

  • get_ipver_str: I suggest making it "protected" by renaming _get_ipver_str. I'd also clarify it's mostly private except for tests (which is my understanding).

I made it protected as you mentioned. Could you see the following commit?

a1b3cfd [2252] rename the helper method name

By "clarify" I meant update the docstring too.

I updated the docstring. Could you see this?

ad8e7ea [2252] update docstring of _get_ipver_str() about accessing
  • get_ipver_str: what if family is neither INET6 nor INET?

I considered the case where None value is returned. Could you see the following commit?

4de51f8 [2252] care for the socket family which is neither INET6 nor INET

It's mostly obvious that None is returned, but the point is whether we
should expect it to happen or treat it as an error. I suspect our
implementation should have other cases, so it seems to me more
reasonable to throw an exception in that case.

I updated the method and the related test to raise an exception. Could see the following commit for details?

d1d37e1 [2252] _get_ipver_str() raises a ValueError exception on address families other than AF_INET or AF_INET6
  • _check_soa_serial: shouldn't this be distinguished per name and class? both class IN and CH can have "example.com", for example.
            # count soaoutv4 or soaoutv6 requests
            self._counters.inc('zones', self._zone_name.to_text(),
                               'soaout' + self.get_ipver_str())
    
    This comment applies to all other counters and timers.

Probably should be considered. As an example, we might need to introduce a classes layer over the current zones layer in the statistics data structure. But this is too big to handle in this ticket. Could we consider in other ticket?

I'm okay with differing it, but it seems to be a quite fundamental
design error. So I'd argue that should be fixed before we rely on the
current format further. The longer we postpone it, the more painful
it is to fix.

The current data structure is already implemented in Xfrout. I think we should repair altogether at a time in future.

  • do_xfrin: stop_timer wouldn't be called if exception is raised during the process.

I think it is okay that the timer count would be discarded if exception is raised.

Is it really discarded? Shouldn't it be left until next time we set
the timer for the same zone? If it's left, that doesn't seem to be
clear. Also, if the intent is to only count the duration of
successfully completed transfers (even excluding the case of
XfrinZoneUptodate), I believe that should be documented.

Strictly speaking, that would be discarded when the next start time is set. I guess the timer wouldn't probably started yet in case of XfrinZoneUptodate. I described about discarding at the following commit.

dc2960e [2252] add notes about the case that the timer is already started but not yet stopped
  • do_xfrin: maybe a matter of taste, but the final clause could be simplified:
                counter_dict = {XFRIN_OK: 'xfrsuccess', XFRIN_FAIL, 'xfrfail'}
                self._counters.inc('zones', self._zone_name.to_text(),
                                   counter_dict[ret])
    

I revised the code as you proposed. Could you see the following commit?

2cfdda6 [2252] simplify complicated if-elif lines using a dictionary

Why does it check if ret is in counter_dict?

            counter_dict = {XFRIN_OK: 'xfrsuccess', XFRIN_FAIL: 'xfrfail'}
            if ret in counter_dict:
                self._counters.inc('zones', self._zone_name.to_text(),
                                   counter_dict[ret])

I realized that we don't need to take care of any other status than XFRIN_OK and XFRIN_FAIL. I misread the code. I revised at the following commit.

5ac7de7 [2252] remove the if condition because we don't need to consider any other status than XFRIN_OK and XFRIN_FAIL

Is there a case where this if doesn't hold?

  • command_handler: why do we bother to log it here?

[...]

I'm not necessarily opposed to logging this event per se, but it's
awkward to see only this command is logged especially if it's at a
debug level.

The message is used in the lettuce tests. but it's too annoying, so I removed { poll-interval: 1} from stats config in lettuce. Could you see the following commit?

4d94e85 [2252] remove poll-interval from stats config

It doesn't really answer my point. Imagine someone reviews this
code. It'd be quite awkward only this case of command has a debug
log. I'd rather generalize it with a command line parameter or add
log messages to all command separately (former is probably better
because the latter approach isn't very much scalable).

I generalized the log id not to be specific to the getstats command. Could you see the following commit?

b0f2aca [2252] replace the log message XFRIN_RECEIVED_COMMAND to the more general id XFRIN_RECEIVED_COMMAND with the com

xfrin_test.py

  • in general, I see many similar patters repeated like this:
  • test_soacheck doesn't confirm soaoutv6 is incremented.
  • test_do_xfrin: likewise. it's redundant, and yet incomplete.

Regarding these, I guess we can even more unified them, and there are
still missing cases that are not tested: at least ixfrreqv4
and ixfrreqv6. If I were to implement the test, I'd introduce a
separate test case, not extending an existing do_xfrin test, and only
concentrate on statistics changes. So, I'd, for example, steal
_send_query or _handle_xfrin_responses (because what it does
doesn't matter in the statistics tests), and only checks statistics
counter values after calling the tweaked do_xfrin. I'd parameterize
everything so it's clearer all counters are actually tested without
much duplicates. (am I clear?).

I removed once all statistics related tests from the existing Xfrin tests. And I added them again as new tests. They are based on TestXfrinConnection. So the change becomes big. Could you see the following commits?

bae4e49 [2252] remove checks of statistics counters from the existing test classes unrelated to statistics tests
208c318 [2252] add tests dedicated to statistics

I included in the above commits.

  • isn't this mostly trivial?
            self.assertGreaterEqual(self.conn._counters.get('zones',
                                                            TEST_ZONE_NAME_STR,
                                                            'time_to_axfr'),
                                    0.0)
    

[...]

Thank you for implying. but regarding this part, I changed it to a bit more accurate check by using assertAlmostEqual. Could you see the following commit?

fcf8b9d [2252] replace assertGreaterEqual with assertAlmostEqual

It doesn't seem to be improved much. If we think a complete tests are
too much for this task, I'd rather remove this counter and complete
the implementation and tests separately. Otherwise, I'll do a better
test covering various cases such as not counting it on exception, etc.
What I don't like to do is to allow poorly test code to be merged with
the excuse of "testing it is too heavy".

I tried to change the above assertion to more accurate one by using assertGreaterEqual. It compares at 3 decimal places. This change is included in the above commits.

lettuce tests

  • First off, test failed. I thought it might be related to the issue of #2790, but at least this one doesn't seem to be related to it:
        When I query statistics zones of bind10 module Xfrin with cmdctl                                                # features/terrain/bind10_control.py:367
        The statistics counters are 0 in category .Xfrin.zones except for the follow    The statistics counters are 0 in category .Xfrin.zones except for the following items                           # features/terrain/bind10_control.py:383
          | item_name                         | item_value | min_value |
          | _SERVER_.soaoutv6                 | 1          |           |
          | _SERVER_.axfrreqv6                | 1          |           |
          | _SERVER_.xfrsuccess               | 1          |           |
          | _SERVER_.latest_axfr_duration     |            | 0.0       |
          | example.org..soaoutv6             | 1          |           |
          | example.org..axfrreqv6            | 1          |           |
          | example.org..xfrsuccess           | 1          |           |
          | example.org..latest_axfr_duration |            | 0.0       |
        Traceback (most recent call last):
          File "/Library/Python/2.7/site-packages/lettuce/core.py", line 125, in __call__
            ret = self.function(self.step, *args, **kw)
          File "/Users/jinmei/src/isc/git/bind10-2252/tests/lettuce/features/terrain/bind10_control.py", line 419, in check_statistics_items
            (name, found, value)
        AssertionError: Statistics item .Xfrin.zones._SERVER_.soaoutv6 has unexpected value 2 (expect 1)
    
    and some others still fail even if I applied the patch Jelte mentioned.

Even if I didn't apply the Jelte's patch, the related tests passed on my environment. And I haven't changed these thresholds since the first time the codes could be reviewed. Did that always fail on your environment? Can I see the full logs of lettuce tests and stderr? Depending on the results, we might need to reconsider the thresholds.

  • check_statistics_items: it seems that the docstring has to be updated.

I updated the docstring. Could you see the following commit?

be426ec [2252] update description of check_statistics_items() about min_value and max_value
  • check_init_statistics: if it's xfr specific, it seems it should be named accordingly. Also, it seems to have been somewhere other than bind10_control.py (from a quick look it's for more generic modules; xfrs are too specific in that sense).

I generalized it not to be specific to Xfrin nor Xfrout. Could you see the following commits?

c37fd56 [2252] generalize the module name as an argument of check_init_statistics()
0eec3b5 [2252] add a port number of cmdctl as an option of check_init_statistics()
  • check_init_statistics: I'm not sure if these are 100% correct:
            step.given(check_str)
    ...
            step.given(check_str)
    
    check_str would contain a trailing '%s' at this point, correct? if so, is that intentional?

That was wrong tests. I updated at the above commits. Thank you for suggestion.

  • xfrin_notify_handling.feature: I don't understand the purpose of this test:
        check initial statistics for Xfrout
        When I query statistics socket of bind10 module Xfrout with cmdctl port 47804
        The statistics counters are 0 in category .Xfrout.socket.unixdomain except for the following items
          | item_name | min_value | max_value |
          | open      |         0 |         1 |
    

I realized that we don't really need to test them. Initial statistics are already checked just before the test. I removed them at the following commit. Thank you for suggestion.

c11f993 [2252] remove several initial checks for unixdomain socket counters of Xfrout

why xfrout test in an "xfrin feature"? why not a specific expected
value but min-max? (I then noticed there are other xfrout tests
here...why in xfrin tests?)

It might be a feature test specific to Xfrin before, but now it includes various tests including statistics counter tests. So I updated the description in the header of xfrin_notify_handling.feature file. Could you see the following commit?

b448dab [2252] update description of xfrin_notify_handling.feature
  • it's not obvious why these are specified with "min". please add comments on why.
        The statistics counters are 0 in category .Xfrout.zones except for the following items
          | item_name                | item_value | min_value |
          | _SERVER_.notifyoutv4     |          5 |           |
          | _SERVER_.xfrrej          |            |         1 |
          | example.org..notifyoutv4 |          5 |           |
          | example.org..xfrrej      |            |         1 |
    

We cannot expected these rejection counters to be specific values. I add notes about the reason. Please see the following commit?

9344c41 [2252] set the expected max value of rejection counters to 3
  • same for this:
          | item_name              | min_value | max_value |
          | _SERVER_.soaoutv4      |         1 |         3 |
          | _SERVER_.axfrreqv4     |         1 |         3 |
    ...
    

On my environment, they cannot be expected to a specific value. They seem to depend on timing and environment. I added notes about this. Please see the following commit?

14ee53d [2252] add notes about counters of Xfrin: soaoutv4, axfrreqv4, soaoutv6, axfrreqv6, and xfrfail

Regards,

Changed 7 years ago by jinmei

comment:17 in reply to: ↑ 16 ; follow-up: Changed 7 years ago by jinmei

Replying to naokikambe:

It's more about design matter so the harm may not be so obvious or
even a subjective matter. I'm okay with excluding it for this ticket,
but I'd argue this should be at least discussed (and if we agree)
resolved before using the current design further. So I'd suggest
creating a ticket to discuss and fix it, and do it before the
subsequent counter tickets for xfrin or others.

I'm okay with redesigning if the current library has some problem. But what ticket should I create? What contents should I write in the ticket? Sorry, I couldn't understand well about problems. Could you create it yourself?

I've created one: #2883. We might also discuss it in the next team
call.

b10-xfrin.xml

  • about latest_*xfr_duration: I missed the currently ongoing one doesn't show up in the statistics. In that case "last_*xfr_duration" would be better than latest.

Note: I had agreed last_*xfr_duration would be better, but they are
not changed.

  • as for time_to_ixfr, what if there's never been an xfr for the zone?

It shows still the default value, 0.0.

It seems to be suboptimal in that using a magic number (especially in
the valid range - 0.0 should still be possible if everything is done
super quickly) to present a kind of "None" is always a source of
trouble. I'd consider something that can explicitly mean "no
completed xfrs". And, especially if we use a magic value, document it
somewhere.

I updated descriptions about the default value 0.0. Could you see the following commit?

26a9ee9 [2252] update descriptions of latest_ixfr_duration and latest_axfr_duration about their default values

In that case this description isn't accurate. It should be something
like:

                    <listitem><simpara>
                      Duration of the latest IXFR in seconds. 0.0 means it was
                      shorter than 0.05 seconds or there was no
                      successful IXFR.
                    </simpara></listitem>

and, as this description suggests, it's generally a bad idea to
overload a specific value that way; such ambiguity will eventually
bite us(ers). Better approach is to actually revise the code to
eliminate the ambiguity. Do you still think the overloading is the
way to go?

xfrin.py.in

  • get_ipver_str: I suggest making it "protected" by renaming

By "clarify" I meant update the docstring too.

I updated the docstring. Could you see this?

ad8e7ea [2252] update docstring of _get_ipver_str() about accessing

I further clarified it can also be used in tests.

  • _check_soa_serial: shouldn't this be distinguished per name and class?

[...]

I'm okay with differing it, but it seems to be a quite fundamental
design error. So I'd argue that should be fixed before we rely on the
current format further. The longer we postpone it, the more painful
it is to fix.

The current data structure is already implemented in Xfrout. I think we should repair altogether at a time in future.

These should be all fixed, yes, but if you mean by "in future", we
still keep using it, I'd be opposed to it. IMO this level of design
defect should be corrected sooner than later. Just as the fact of
xfrout indicates, if we rely on the flawed design more, fixing them
will simply be more and more painful. For this particular ticket, I
wouldn't block due to this, but I don't like to pick up any other
counter tickets until this issue is fixed.

I created a ticket: #2884.

  • do_xfrin: stop_timer wouldn't be called if exception is raised during the process.

I think it is okay that the timer count would be discarded if exception is raised.

Is it really discarded? Shouldn't it be left until next time we set
the timer for the same zone? If it's left, that doesn't seem to be
clear. Also, if the intent is to only count the duration of
successfully completed transfers (even excluding the case of
XfrinZoneUptodate), I believe that should be documented.

Strictly speaking, that would be discarded when the next start time is set. I guess the timer wouldn't probably started yet in case of XfrinZoneUptodate. I described about discarding at the following commit.

dc2960e [2252] add notes about the case that the timer is already started but not yet stopped

By "document", I (probably ... not quite remember after some period
since last comment cycle) meant the man page or something, not (only)
the code comment.

I'm not sure if I don't understand this: "I guess the timer wouldn't
probably started yet in case of XfrinZoneUptodate". You mean it
should have come from _check_soa_serial? But that can also happen
in IXFR.

As for dc2960e, shouldn't this be "not yet stopped"?

+        support multi-threaded use. If the specified timer is already
+        started but not yet started, the last start time is
+        overwritten."""
  • command_handler: why do we bother to log it here?

[...]

I generalized the log id not to be specific to the getstats command. Could you see the following commit?

b0f2aca [2252] replace the log message XFRIN_RECEIVED_COMMAND to the more general id XFRIN_RECEIVED_COMMAND with the com

This change itself looks good. But please reorder the log messages
(use the tool under bind10/tools).

But, while working on #1938 I (happen to) noticed it had a more
fundamental problem in the context of using it in lettuce. The
corresponding lettuce test reads:

    When I query statistics zones of bind10 module Xfrin with cmdctl
    wait for new bind10 stderr message STATS_SEND_STATISTICS_REQUEST
    wait for new bind10 stderr message XFRIN_RECEIVED_COMMAND
    last bindctl output should not contain "error"

    When I query statistics zones of bind10 module Xfrin with cmdctl

This is not reliable, because just seeing XFRIN_RECEIVED_COMMAND
doesn't guarantee the stats daemon received the answer (on the other
hand, checking STATS_SEND_... should be unnecessary because if it
didn't happen we wouldn't see XFRIN_RECEIVED_COMMAND; further, I
suspect it could even cause failure, because log message order from
different processes is not predictable).

What we actually need is something like
"STATS_STATISTICS_REQUEST_DONE" and check that.

Note also that #1938 has been merged, and the meaning of "wait for
new" was clarified in a possibly non compatible way.

xfrin_test.py

I've reviewed the entire change in the branch from the branch point.
The latest version looks much better, but it still seems to have some
issues:

  • the tests for the durations do not always succeed (for me). in general, it's not really good to rely on the real world clock in unit tests because it's too fragile. (see also the next point)
  • AXFR+xfrfail case doesn't seem to be tested. Combining the first and this point, I think it's easier to test if we mock many of the methods used in do_xfrin, and for durations only check start_timer/stop_timer are called at the correct point with the correct parameters (not checking the resulting values).
  • some class attributes (ipver, name_to_counter, master_addrinfo), are introduced while they are essentially instance attributes. using class attributes when they don't have to be so is generally not a good practice because change in one test instance could affect another test. (This is also related to the issue that led to #2883). I suggest making them instance attributes.
  • the intent of each test (such as test_do_soacheck_uptodate) is not so obvious. Please add docstring or comment.
  • the case of test_do_soacheck_uptodate makes me wonder why xfrsuccess is counted even if there's no xfr transaction. Is that okay? (What BIND 9 does in this case?) If that's by design, I suspect its description has to be updated:
                        <listitem><simpara>
                          Number of zone transfer requests succeeded
                        </simpara></listitem>
    
    (since there's actually not even a transfer request in this case).

lettuce tests

  • Just wonder: are the changes in test_spec3.spec necessary?
  • First off, test failed. I thought it might be related to the issue of #2790, but at least this one doesn't seem to be related to it:

[...]

Even if I didn't apply the Jelte's patch, the related tests passed on my environment.

That one is not relevant to Jelte's patch.

And I haven't changed these thresholds since the first time the codes could be reviewed. Did that always fail on your environment? Can I see the full logs of lettuce tests and stderr? Depending on the results, we might need to reconsider the thresholds.

I've attached the stderr log files and the output from lettuce when it
failed. Note that there are two issues:

  • statistics values are not the expected ones. I suspect it's related to the "STATS_STATISTICS_REQUEST_DONE" issue I explained above. And this is not related to Jelte's patch.
  • this synchronization failed
        Then wait for new bind10 stderr message XFRIN_XFR_TRANSFER_STARTED
    
    even with Jelte's patch. On master, I've never seen this happen if that patch is applied.
  • check_statistics_items: it seems that the docstring has to be updated.

I updated the docstring. Could you see the following commit?

be426ec [2252] update description of check_statistics_items() about min_value and max_value

It says 5 columns but only mention 4.

    the scenario. The multiline part has at most 5 columns: item_name,
    item_value, min_value, and max_value. item_name is a relative name
    to category. item_value is an expected value for
  • check_init_statistics: if it's xfr specific, it seems it should be named accordingly. Also, it seems to have been somewhere other than bind10_control.py (from a quick look it's for more generic modules; xfrs are too specific in that sense).

I generalized it not to be specific to Xfrin nor Xfrout. Could you see the following commits?

c37fd56 [2252] generalize the module name as an argument of check_init_statistics()
0eec3b5 [2252] add a port number of cmdctl as an option of check_init_statistics()

This part still seems to be DNS specific:

    step.given(notcontain_str % 'example.org.')

It should probably be parameterized.

  • it's not obvious why these are specified with "min". please add comments on why.
  • same for this:
          | item_name              | min_value | max_value |
          | _SERVER_.soaoutv4      |         1 |         3 |
          | _SERVER_.axfrreqv4     |         1 |         3 |
    ...
    

On my environment, they cannot be expected to a specific value. They seem to depend on timing and environment. I added notes about this. Please see the following commit?

14ee53d [2252] add notes about counters of Xfrin: soaoutv4, axfrreqv4, soaoutv6, axfrreqv6, and xfrfail

I suspect this is due to #2879. Please confirm that from logs, and
try modifying isc.notify.notify_out._MAX_NOTIFY_TRY_NUM to 1 to see
if the number is now predictable (I believe it should be exactly 1).
If these are confirmed, please make more accurate comment in the
lettuce feature and add comments to #2879 so when it's done we should
be able to clean up this test.

comment:18 Changed 7 years ago by jinmei

  • Owner changed from jinmei to naokikambe

comment:19 in reply to: ↑ 17 ; follow-up: Changed 7 years ago by naokikambe

  • Owner changed from naokikambe to jinmei

Hello, sorry for my late response.

Replying to jinmei:

Replying to naokikambe:

It's more about design matter so the harm may not be so obvious or
even a subjective matter. I'm okay with excluding it for this ticket,
but I'd argue this should be at least discussed (and if we agree)
resolved before using the current design further. So I'd suggest
creating a ticket to discuss and fix it, and do it before the
subsequent counter tickets for xfrin or others.

I'm okay with redesigning if the current library has some problem. But what ticket should I create? What contents should I write in the ticket? Sorry, I couldn't understand well about problems. Could you create it yourself?

I've created one: #2883. We might also discuss it in the next team
call.

Thank you for creating. Yes, I'd like to understand well what problem is and how we should improve it somewhere. I think it might be appropriate for a topic of the face-to-face meeting to discuss.

b10-xfrin.xml

  • about latest_*xfr_duration: I missed the currently ongoing one doesn't show up in the statistics. In that case "last_*xfr_duration" would be better than latest.

Note: I had agreed last_*xfr_duration would be better, but they are
not changed.

Sorry, I missed to read it before. I've revised. Please see this.

678965b [2252] change to better names; s/lastest/last/i
  • as for time_to_ixfr, what if there's never been an xfr for the zone?

It shows still the default value, 0.0.

It seems to be suboptimal in that using a magic number (especially in
the valid range - 0.0 should still be possible if everything is done
super quickly) to present a kind of "None" is always a source of
trouble. I'd consider something that can explicitly mean "no
completed xfrs". And, especially if we use a magic value, document it
somewhere.

I updated descriptions about the default value 0.0. Could you see the following commit?

26a9ee9 [2252] update descriptions of latest_ixfr_duration and latest_axfr_duration about their default values

In that case this description isn't accurate. It should be something
like:

                    <listitem><simpara>
                      Duration of the latest IXFR in seconds. 0.0 means it was
                      shorter than 0.05 seconds or there was no
                      successful IXFR.
                    </simpara></listitem>

and, as this description suggests, it's generally a bad idea to
overload a specific value that way; such ambiguity will eventually
bite us(ers). Better approach is to actually revise the code to
eliminate the ambiguity. Do you still think the overloading is the
way to go?

I believe the timer has microsecond accuracy on the most platforms. So I think it is not 0.05 but 0.000001. I've update description at the following commit according to this.

ff8d95f [2252] update descriptions

However I don't think this could be so confusing for users. Because I think users could see whether a real XFR is started from other counters like axfrreqv4 or ixfrreqv4.

xfrin.py.in

  • get_ipver_str: I suggest making it "protected" by renaming

By "clarify" I meant update the docstring too.

I updated the docstring. Could you see this?

ad8e7ea [2252] update docstring of _get_ipver_str() about accessing

I further clarified it can also be used in tests.

Thank you. It looks good.

  • _check_soa_serial: shouldn't this be distinguished per name and class?

[...]

I'm okay with differing it, but it seems to be a quite fundamental
design error. So I'd argue that should be fixed before we rely on the
current format further. The longer we postpone it, the more painful
it is to fix.

The current data structure is already implemented in Xfrout. I think we should repair altogether at a time in future.

These should be all fixed, yes, but if you mean by "in future", we
still keep using it, I'd be opposed to it. IMO this level of design
defect should be corrected sooner than later. Just as the fact of
xfrout indicates, if we rely on the flawed design more, fixing them
will simply be more and more painful. For this particular ticket, I
wouldn't block due to this, but I don't like to pick up any other
counter tickets until this issue is fixed.

I created a ticket: #2884.

Thank you for creating the ticket. I meant that "in future" was "in near future" at that time. Sorry for confusion. I also think this issue is serious. So shall we start #2884 after this ticket? Anyway, according to this, I've updated description for avoiding confusion from users. Could you see the following commit?

20df0b0 [2252] update descriptions of zonename
  • do_xfrin: stop_timer wouldn't be called if exception is raised during the process.

I think it is okay that the timer count would be discarded if exception is raised.

Is it really discarded? Shouldn't it be left until next time we set
the timer for the same zone? If it's left, that doesn't seem to be
clear. Also, if the intent is to only count the duration of
successfully completed transfers (even excluding the case of
XfrinZoneUptodate), I believe that should be documented.

Strictly speaking, that would be discarded when the next start time is set. I guess the timer wouldn't probably started yet in case of XfrinZoneUptodate. I described about discarding at the following commit.

dc2960e [2252] add notes about the case that the timer is already started but not yet stopped

By "document", I (probably ... not quite remember after some period
since last comment cycle) meant the man page or something, not (only)
the code comment.

I also wrote a comment on the man page. Could you see the commit?

4d5c20e [2252] update descriptions of last_*xfr_duration

I'm not sure if I don't understand this: "I guess the timer wouldn't
probably started yet in case of XfrinZoneUptodate". You mean it
should have come from _check_soa_serial? But that can also happen
in IXFR.

Yes, I misread the code. The condition can be changed in case of IXFR as you mentioned.

As for dc2960e, shouldn't this be "not yet stopped"?

+        support multi-threaded use. If the specified timer is already
+        started but not yet started, the last start time is
+        overwritten."""

Yes, thank you for correcting. I've updated.

2e97db1 [2252] correct a wrong word
  • command_handler: why do we bother to log it here?

[...]

I generalized the log id not to be specific to the getstats command. Could you see the following commit?

b0f2aca [2252] replace the log message XFRIN_RECEIVED_COMMAND to the more general id XFRIN_RECEIVED_COMMAND with the com

This change itself looks good. But please reorder the log messages
(use the tool under bind10/tools).

Thank you. I've reordered the messages. Please see.

307e3c7 [2252] reorder by tools/reorder_message_file.py

But, while working on #1938 I (happen to) noticed it had a more
fundamental problem in the context of using it in lettuce. The
corresponding lettuce test reads:

    When I query statistics zones of bind10 module Xfrin with cmdctl
    wait for new bind10 stderr message STATS_SEND_STATISTICS_REQUEST
    wait for new bind10 stderr message XFRIN_RECEIVED_COMMAND
    last bindctl output should not contain "error"

    When I query statistics zones of bind10 module Xfrin with cmdctl

This is not reliable, because just seeing XFRIN_RECEIVED_COMMAND
doesn't guarantee the stats daemon received the answer (on the other
hand, checking STATS_SEND_... should be unnecessary because if it
didn't happen we wouldn't see XFRIN_RECEIVED_COMMAND; further, I
suspect it could even cause failure, because log message order from
different processes is not predictable).

What we actually need is something like
"STATS_STATISTICS_REQUEST_DONE" and check that.

Note also that #1938 has been merged, and the meaning of "wait for
new" was clarified in a possibly non compatible way.

That's right. I've removed waiting for 'STATS_SEND_STATISTICS_REQUEST'. But I didn't introduce 'STATS_STATISTICS_REQUEST_DONE' as you mentioned. Because even if we introduced it, we cannot identify whether b10-stats received statistics from b10-xfrin. b10-stats also asks other modules like b10-init and b10-auth. I've already pulled the newer master onto trac2252 at 836fabb. That must include the changes which you pushed on #1938.

xfrin_test.py

I've reviewed the entire change in the branch from the branch point.
The latest version looks much better, but it still seems to have some
issues:

Thank you. Anyway first of all,

  • the case of test_do_soacheck_uptodate makes me wonder why xfrsuccess is counted even if there's no xfr transaction. Is that okay? (What BIND 9 does in this case?) If that's by design, I suspect its description has to be updated:
                        <listitem><simpara>
                          Number of zone transfer requests succeeded
                        </simpara></listitem>
    
    (since there's actually not even a transfer request in this case).

I found I misunderstood these counters by your comment. We should count axfrreqv4/6 or ixfrreqv4/6 before sending out an XFR request. After that we should count up xfrsuccess or xfrfail depending on the result of the query. I've applied this correction to the code. Please see this for details.

5c0c1cb [2252] change positions for counters to be counted
4a8316d [2252] update the test cases due to the previous change
  • the tests for the durations do not always succeed (for me). in general, it's not really good to rely on the real world clock in unit tests because it's too fragile. (see also the next point)

That's right. I've updated the code to check a duration time by a fixed value. Please see this.

e346959 [2252] update datetime to always return fixed seconds
  • AXFR+xfrfail case doesn't seem to be tested. Combining the first and this point, I think it's easier to test if we mock many of the methods used in do_xfrin, and for durations only check start_timer/stop_timer are called at the correct point with the correct parameters (not checking the resulting values).

As for this, I've included the above commits. I think a case for both AXFR and xfrfail is included in the updated code.

  • some class attributes (ipver, name_to_counter, master_addrinfo), are introduced while they are essentially instance attributes. using class attributes when they don't have to be so is generally not a good practice because change in one test instance could affect another test. (This is also related to the issue that led to #2883). I suggest making them instance attributes.

Hmm..., sorry, I cannot still understand what you worry about so much. For example, if we intentionally change the class attribute 'ipver' to 'v7' inside of the test case like the following:

  • src/bin/xfrin/tests/xfrin_test.py

    diff --git a/src/bin/xfrin/tests/xfrin_test.py b/src/bin/xfrin/tests/xfrin_test.py
    index ecd2bf5..c17741f 100644
    a b class TestStatisticsXfrinAXFRv4(TestStatisticsXfrinConn): 
    21862186    def test_soaout(self):
    21872187        '''tests that an soaoutv4 or soaoutv6 counter is incremented
    21882188        when an soa query succeeds'''
     2189        self.ipver = 'v7'
    21892190        self.conn.response_generator = self._create_soa_response_data
    21902191        self._check_init_statistics()
    21912192        self.assertEqual(self.conn._check_soa_serial(), XFRIN_OK)

Then could this change unexpectedly affect to other test cases? In this case I can see it affects only the two related cases: TestStatisticsXfrinAXFRv4.test_soaout() and TestStatisticsXfrinAXFRv6.test_soaout(). But other cases definitely refer to the changed attribute. Can you crash a test code by changing a attribute of the test case class? Maybe I have less ability to code python. Please correct me if I misunderstand this.

  • the intent of each test (such as test_do_soacheck_uptodate) is not so obvious. Please add docstring or comment.

That's right, the case names were inappropriate. I've revised them and their docstring. Please see this.

a4b1c8d [2252] update case names and their docstrings according to the tested counters

lettuce tests

  • Just wonder: are the changes in test_spec3.spec necessary?

Not necessary. Keeping same content makes no sense, but I leave a change I added before.

  • First off, test failed. I thought it might be related to the issue of #2790, but at least this one doesn't seem to be related to it:

[...]

Even if I didn't apply the Jelte's patch, the related tests passed on my environment.

That one is not relevant to Jelte's patch.

And I haven't changed these thresholds since the first time the codes could be reviewed. Did that always fail on your environment? Can I see the full logs of lettuce tests and stderr? Depending on the results, we might need to reconsider the thresholds.

I've attached the stderr log files and the output from lettuce when it
failed. Note that there are two issues:

Thank you. By checking that, I've adjusted soaoutv4 and soaoutv6. There seems to be a problem related to #2879 as you mentioned below.

  • statistics values are not the expected ones. I suspect it's related to the "STATS_STATISTICS_REQUEST_DONE" issue I explained above. And this is not related to Jelte's patch.

I didn't introduced it as I mentioned above.

  • this synchronization failed
        Then wait for new bind10 stderr message XFRIN_XFR_TRANSFER_STARTED
    
    even with Jelte's patch. On master, I've never seen this happen if that patch is applied.

Hmm..., sorry, I'm not sure about this. I have merged the newer master on to the branch. Could you try lettuce test again? If you still have same problem, please let me know.

  • check_statistics_items: it seems that the docstring has to be updated.

I updated the docstring. Could you see the following commit?

be426ec [2252] update description of check_statistics_items() about min_value and max_value

It says 5 columns but only mention 4.

    the scenario. The multiline part has at most 5 columns: item_name,
    item_value, min_value, and max_value. item_name is a relative name
    to category. item_value is an expected value for

Thank you for correcting. I've revised the wrong number.

4f63e23 [2252] correct the wrong number of columns to be mentioned
  • check_init_statistics: if it's xfr specific, it seems it should be named accordingly. Also, it seems to have been somewhere other than bind10_control.py (from a quick look it's for more generic modules; xfrs are too specific in that sense).

I generalized it not to be specific to Xfrin nor Xfrout. Could you see the following commits?

c37fd56 [2252] generalize the module name as an argument of check_init_statistics()
0eec3b5 [2252] add a port number of cmdctl as an option of check_init_statistics()

This part still seems to be DNS specific:

    step.given(notcontain_str % 'example.org.')

It should probably be parameterized.

I've parametrized it. Please see this.

1f578c9 [2252] parameterize for the domain name not to be contained in queried statistics
  • it's not obvious why these are specified with "min". please add comments on why.
  • same for this:
          | item_name              | min_value | max_value |
          | _SERVER_.soaoutv4      |         1 |         3 |
          | _SERVER_.axfrreqv4     |         1 |         3 |
    ...
    

On my environment, they cannot be expected to a specific value. They seem to depend on timing and environment. I added notes about this. Please see the following commit?

14ee53d [2252] add notes about counters of Xfrin: soaoutv4, axfrreqv4, soaoutv6, axfrreqv6, and xfrfail

I suspect this is due to #2879. Please confirm that from logs, and
try modifying isc.notify.notify_out._MAX_NOTIFY_TRY_NUM to 1 to see
if the number is now predictable (I believe it should be exactly 1).
If these are confirmed, please make more accurate comment in the
lettuce feature and add comments to #2879 so when it's done we should
be able to clean up this test.

I think so too. If I set _MAX_NOTIFY_TRY_NUM to 1, then we can expect these counters to be exactly 1. I suspect that's why checking soaoutv6 failed on your environment. According to your uploaded log(lettuce.log), soaoutv6 was expected to count 1 but actually count 2. I've added notes on both the code and #2879. For the notes, please see this.

9d00667 [2252] update the note to more accurate one due to bug #2879

BTW on my environment, checking initial statistics of Xfrout sometimes failed. I suspect this is because a socket of Xfrout is already opened before b10-stats asks to b10-xfrout. So I added exceptional cases to checking. Could you see this change?

d7885ec [2252] add an exceptional case when checking initial statistics of Xfrout

Regards,

comment:20 Changed 7 years ago by jinmei

Please read tests/lettuce/configurations/DO_NOT_USE_127.0.0.1:47807
and adjusted the configuration. That's seemingly the reason for the
failure I've seen.

comment:21 Changed 7 years ago by jinmei

  • Owner changed from jinmei to naokikambe

comment:22 in reply to: ↑ 19 ; follow-up: Changed 7 years ago by jinmei

Replying to naokikambe:

  • as for time_to_ixfr, what if there's never been an xfr for the zone?

It shows still the default value, 0.0.

It seems to be suboptimal in that using a magic number (especially in
the valid range - [...]

I updated descriptions about the default value 0.0. Could you see the following commit?

26a9ee9 [2252] update descriptions of latest_ixfr_duration and latest_axfr_duration about their default values

In that case this description isn't accurate. It should be something
like:

                    <listitem><simpara>
                      Duration of the latest IXFR in seconds. 0.0 means it was
                      shorter than 0.05 seconds or there was no
                      successful IXFR.
                    </simpara></listitem>

and, as this description suggests, it's generally a bad idea to
overload a specific value that way; such ambiguity will eventually
bite us(ers). Better approach is to actually revise the code to
eliminate the ambiguity. Do you still think the overloading is the
way to go?

I believe the timer has microsecond accuracy on the most platforms. So I think it is not 0.05 but 0.000001. I've update description at the following commit according to this.

Hmm, I don't remember why I referred to 0.05. Maybe from the
representation of "0.0" I assumed that it would be represented as
%.1f.

ff8d95f [2252] update descriptions

However I don't think this could be so confusing for users. Because I think users could see whether a real XFR is started from other counters like axfrreqv4 or ixfrreqv4.

They *could*, but it's inconvenient and error prone. That's the whole
point. I can easily imagine a naive user writes (buggy) monitoring
code like this:

def check_xfr_status():
    # retrieve last_axfr_duration using cmdctl interface or stats_httpd
    last_duration = get_last_axfr_duration()
    if last_duration == 0: # BUG: it should also check axfrreqv6 and axfrreqv4
        # statistics insist there has been no xfr-in successful attempt; we know
        # the primary server has responded to xfr request.  that should mean
        # an evil attack.  We need to retaliate immediately.
        launch_nuclear_missiles()

Sure, it's the user's fault that he/she doesn't read the documentation
carefully and triggers a nuclear war as a result. But we know people
don't read document, right (see my previous ticket comment, btw)?

So, IMO, overloading like this simply means we are causing troubles
for users due to our laziness.

In any case, if this is still not convincing and you think overloading
is better, the latest description doesn't parse well to me:

                      Duration of the last IXFR. 0.0 means no successful IXFR done
                      in greater than or equal to a microsecond. If a started
                      timer is never stopped because of failure, start time of
                      duration will be reset next time.

Also, it's awkward to see concepts like a "timer" here because it's
just an implementation detail.

xfrin.py.in

Thank you for creating the ticket. I meant that "in future" was "in near future" at that time. Sorry for confusion. I also think this issue is serious. So shall we start #2884 after this ticket? Anyway, according to this, I've updated description for avoiding confusion from users. Could you see the following commit?

20df0b0 [2252] update descriptions of zonename

I'd say "e.g." before (IN, CH, HS).

  • do_xfrin: stop_timer wouldn't be called if exception is raised during the process.

[...]

By "document", I (probably ... not quite remember after some period
since last comment cycle) meant the man page or something, not (only)
the code comment.

I also wrote a comment on the man page. Could you see the commit?

4d5c20e [2252] update descriptions of last_*xfr_duration

See above. We need to explain it without using implementation
specific terms.

I'm not sure if I don't understand this: "I guess the timer wouldn't
probably started yet in case of XfrinZoneUptodate". You mean it
should have come from _check_soa_serial? But that can also happen
in IXFR.

Yes, I misread the code. The condition can be changed in case of IXFR as you mentioned.

I'm not sure if that's reasonable. See below.

  • some class attributes (ipver, name_to_counter, master_addrinfo), are introduced while they are essentially instance attributes. using class attributes when they don't have to be so is generally not a good practice because change in one test instance could affect another test. (This is also related to the issue that led to #2883). I suggest making them instance attributes.

Hmm..., sorry, I cannot still understand what you worry about so much. For example, if we intentionally change the class attribute 'ipver' to 'v7' inside of the test case like the following:

First, this should be considered a principle matter. I hope we agree
that global objects/variable, etc, are bad in general. singleton or
class attributes are just modernized form of global, so, their use
should generally be considered bad, and must be justified with a very
strong reason if that's inevitable.

In this specific case, completely overriding ipver actually creates
an instance variable so the problem wouldn't be apparent. updating
name_to_counter dict is probably a better example. But, again, this
is basically a principle matter; the fact that it uses global without
a reason is itself a problem.

  • this synchronization failed
        Then wait for new bind10 stderr message XFRIN_XFR_TRANSFER_STARTED
    
    even with Jelte's patch. On master, I've never seen this happen if that patch is applied.

Hmm..., sorry, I'm not sure about this. I have merged the newer master on to the branch. Could you try lettuce test again? If you still have same problem, please let me know.

I think I figured it out. Please see the previous comment.

BTW on my environment, checking initial statistics of Xfrout sometimes failed. I suspect this is because a socket of Xfrout is already opened before b10-stats asks to b10-xfrout. So I added exceptional cases to checking. Could you see this change?

d7885ec [2252] add an exceptional case when checking initial statistics of Xfrout

I'm not sure about this as it didn't happen for me. But, in general,
I'd try to figure out how that actually happened, and then address the
issue either by fixing the real issue or (if it's not easily
avoidable) introducing a work around like this.

Specific code comments:

xfrin_test.py

  • I'd define these as "private" by using double-underscore: _orig_datetime, _orig_start_timer, _const_sec.
  • I'd use camel convention for fakedatatime. I also suspect 'data' should be 'date'. So it would be:
            class FakeDateTime:
                @classmethod
                def now(cls): return time2
    
  • test_axfrreq_xfrfail: it only covers a subset of failure cases. since we don't have to reproduce the exact xfr processing in this context (which is done in tests for the protocol itself), I'd simplify the test by stealing a method and broaden the coverage:
        def test_axfrreq_xfrfail(self):
            """DON'T FORGET DOCUMENT IT"""
            self._check_init_statistics()
            count = 0
            for ex in [XfrinZoneError, XfrinProtocolError, XfrinException,
                       Exception]:
                def exception_raiser():
                    raise ex()
                self.conn._handle_xfrin_responses = exception_raiser
                self.assertEqual(self.conn.do_xfrin(False), XFRIN_FAIL)
                count += 1
                self._check_updated_statistics({'axfrreq' + self.ipver: count,
                                                'xfrfail': count})
    
  • test_ixfrreq_xfrfail: first, same comment as axfrreq_xfrfail applies. secondly (while it may make it moot), it's not obvious why xfr is expected to fail in this case from the test code. Third, note that ixfr can result in XfrinZoneUptodate exception. Is it okay to consider it a failure (see also below)?

xfrin.py.in

  • In this revised code, if IXFR results in XfrinZoneUptodate xfrfail is counted. Is it intentional, and if so, is it reasonable? From a quick look, BIND 9 seems to consider it a success. In any case, this should be clarified in the documentation: what exactly "succeed/fail" mean.

comment:23 in reply to: ↑ 22 ; follow-up: Changed 7 years ago by naokikambe

  • Owner changed from naokikambe to jinmei

Hello,

Replying to jinmei:

Please read tests/lettuce/configurations/DO_NOT_USE_127.0.0.1:47807
and adjusted the configuration. That's seemingly the reason for the
failure I've seen.

Sorry, I missed the notes. I changed the port number to 47809. Could you try the lettuce test again? The commit is:

71e0b2b [2252] use 127.0.0.1:47809 as address and port of the primary server in lettuce

Replying to jinmei:

ff8d95f [2252] update descriptions

However I don't think this could be so confusing for users. Because I think users could see whether a real XFR is started from other counters like axfrreqv4 or ixfrreqv4.

They *could*, but it's inconvenient and error prone. That's the whole
point. I can easily imagine a naive user writes (buggy) monitoring
code like this:

def check_xfr_status():
    # retrieve last_axfr_duration using cmdctl interface or stats_httpd
    last_duration = get_last_axfr_duration()
    if last_duration == 0: # BUG: it should also check axfrreqv6 and axfrreqv4
        # statistics insist there has been no xfr-in successful attempt; we know
        # the primary server has responded to xfr request.  that should mean
        # an evil attack.  We need to retaliate immediately.
        launch_nuclear_missiles()

Sure, it's the user's fault that he/she doesn't read the documentation
carefully and triggers a nuclear war as a result. But we know people
don't read document, right (see my previous ticket comment, btw)?

So, IMO, overloading like this simply means we are causing troubles
for users due to our laziness.

(I would be so happy if such a user used statistics counters much as above. But in fact, ..)
I think the user probably needs another trigger like axfrreqv4 or axfrreqv6 or bind10 log or the primary server's log or something. Because he/she must realize that the script would not work at first. last_axfr_duration must keep being 0.0 until the first xfr is done. Also in a real case, it would be a very rare that a real xfr is done in less than a microsecond. Can we fix the issue on another ticket after this ticket? Anyway ..

In any case, if this is still not convincing and you think overloading
is better, the latest description doesn't parse well to me:

                      Duration of the last IXFR. 0.0 means no successful IXFR done
                      in greater than or equal to a microsecond. If a started
                      timer is never stopped because of failure, start time of
                      duration will be reset next time.

Also, it's awkward to see concepts like a "timer" here because it's
just an implementation detail.

I tried to revise the descriptions. Could you take a look at the commit?

0732633 [2252] update descriptions of last_ixfr_duration and last_axfr_duration

xfrin.py.in

Thank you for creating the ticket. I meant that "in future" was "in near future" at that time. Sorry for confusion. I also think this issue is serious. So shall we start #2884 after this ticket? Anyway, according to this, I've updated description for avoiding confusion from users. Could you see the following commit?

20df0b0 [2252] update descriptions of zonename

I'd say "e.g." before (IN, CH, HS).

That's right. I've revised at:

a50e0a4 [2252] revise description of zonename (add 'e.g.' before the zone classes)
  • some class attributes (ipver, name_to_counter, master_addrinfo), are introduced while they are essentially instance attributes. using class attributes when they don't have to be so is generally not a good practice because change in one test instance could affect another test. (This is also related to the issue that led to #2883). I suggest making them instance attributes.

Hmm..., sorry, I cannot still understand what you worry about so much. For example, if we intentionally change the class attribute 'ipver' to 'v7' inside of the test case like the following:

First, this should be considered a principle matter. I hope we agree
that global objects/variable, etc, are bad in general. singleton or
class attributes are just modernized form of global, so, their use
should generally be considered bad, and must be justified with a very
strong reason if that's inevitable.

In this specific case, completely overriding ipver actually creates
an instance variable so the problem wouldn't be apparent. updating
name_to_counter dict is probably a better example. But, again, this
is basically a principle matter; the fact that it uses global without
a reason is itself a problem.

We cannot cause such a problem as well by updating either name_to_counter or master_addrinfo in that case. Because both of them were defined as a tuple. I'd not like to disagree with using a global variable. But I'm still not sure about difference between using a pure global variable and using a class-scope reference.

Anyway I've removed them and changed to instance objects or getter properties. A getter property cannot be changed to another reference without a setter. They must be treated as read-only. The commit is:

084f1f5 [2252] remove some constant definitions from the class scope

BTW on my environment, checking initial statistics of Xfrout sometimes failed. I suspect this is because a socket of Xfrout is already opened before b10-stats asks to b10-xfrout. So I added exceptional cases to checking. Could you see this change?

d7885ec [2252] add an exceptional case when checking initial statistics of Xfrout

I'm not sure about this as it didn't happen for me. But, in general,
I'd try to figure out how that actually happened, and then address the
issue either by fixing the real issue or (if it's not easily
avoidable) introducing a work around like this.

I'm not really sure but in fact I couldn't reproduce this issue. I couldn't investigate further. So I've reverted such exceptional cases. If the issue happens again, I would try to figure out the reason. The commit is:

61ec72f [2252] remove exceptional cases when checking initial statistics of Xfrout

Specific code comments:

xfrin_test.py

  • I'd define these as "private" by using double-underscore: _orig_datetime, _orig_start_timer, _const_sec.

I couldn't do only for _const_sec because I realized it was referred from another child class. The commit is:

04e248b [2252] make the references class-local: _orig_datetime, _orig_start_timer
  • I'd use camel convention for fakedatatime. I also suspect 'data' should be 'date'. So it would be:
            class FakeDateTime:
                @classmethod
                def now(cls): return time2
    

I've revised as you suggested. Thanks. The commit is:

1cef659 [2252] update a fake class name to camel convention including a typo fix
  • test_axfrreq_xfrfail: it only covers a subset of failure cases. since we don't have to reproduce the exact xfr processing in this context (which is done in tests for the protocol itself), I'd simplify the test by stealing a method and broaden the coverage:
        def test_axfrreq_xfrfail(self):
            """DON'T FORGET DOCUMENT IT"""
            self._check_init_statistics()
            count = 0
            for ex in [XfrinZoneError, XfrinProtocolError, XfrinException,
                       Exception]:
                def exception_raiser():
                    raise ex()
                self.conn._handle_xfrin_responses = exception_raiser
                self.assertEqual(self.conn.do_xfrin(False), XFRIN_FAIL)
                count += 1
                self._check_updated_statistics({'axfrreq' + self.ipver: count,
                                                'xfrfail': count})
    
  • test_ixfrreq_xfrfail: first, same comment as axfrreq_xfrfail applies. secondly (while it may make it moot), it's not obvious why xfr is expected to fail in this case from the test code. Third, note that ixfr can result in XfrinZoneUptodate exception. Is it okay to consider it a failure (see also below)?

I've revised the two test cases by following your instruction. Thanks. I've also added comments. Could you check that? The commit is:

5db9144 [2252] check an xfrfail incremented with multiple exceptions

xfrin.py.in

  • In this revised code, if IXFR results in XfrinZoneUptodate xfrfail is counted. Is it intentional, and if so, is it reasonable? From a quick look, BIND 9 seems to consider it a success. In any case, this should be clarified in the documentation: what exactly "succeed/fail" mean.

I revised the codes to treat an xfr request as success in a case of raised XfrinZoneUptodate. For consistency between BIND 9 and 10, I think we should define this case as success. In fact, do_xfrin() already defines so. The commit is:

8544c8e [2252] treat an xfr request as success with an XfrinZoneUptodate raised

Regards,

comment:24 in reply to: ↑ 23 ; follow-up: Changed 7 years ago by jinmei

Replying to naokikambe:

(I would be so happy if such a user used statistics counters much as above. But in fact, ..)
I think the user probably needs another trigger like axfrreqv4 or axfrreqv6 or bind10 log or the primary server's log or something. Because he/she must realize that the script would not work at first. last_axfr_duration must keep being 0.0 until the first xfr is done. Also in a real case, it would be a very rare that a real xfr is done in less than a microsecond. Can we fix the issue on another ticket after this ticket? Anyway ..

This is fine.

In this specific case, completely overriding ipver actually creates
an instance variable so the problem wouldn't be apparent. updating
name_to_counter dict is probably a better example. But, again, this
is basically a principle matter; the fact that it uses global without
a reason is itself a problem.

We cannot cause such a problem as well by updating either

name_to_counter or master_addrinfo in that case. Because both
of them were defined as a tuple. I'd not like to disagree with

Ah, okay, that was my misunderstanding. Perhaps I was too worried
about the seemingly easy use of global instances including class
attributes. In any case, the revised code looks good.

xfrin.py.in

  • In this revised code, if IXFR results in XfrinZoneUptodate xfrfail is counted. Is it intentional, and if so, is it reasonable? From a quick look, BIND 9 seems to consider it a success. In any case, this should be clarified in the documentation: what exactly "succeed/fail" mean.

I revised the codes to treat an xfr request as success in a case of raised XfrinZoneUptodate. For consistency between BIND 9 and 10, I think we should define this case as success. In fact, do_xfrin() already defines so. The commit is:

8544c8e [2252] treat an xfr request as success with an XfrinZoneUptodate raised

Documentation update is missing.

And one minor point: this comment doesn't parse well to me:

                # Note: Catching an exception XfrinZoneUptodate
                # defines that IXFR succeeded.

And, one last non trivial point. I'd hesitate to say this after the
long cycle of review process, but do_xfrin() now looks overly
complicated to me due to the series of statistics related changes.
The organization of the nested try-re-raise-catch is particularly
unreadable to me. I suggest refactoring this method so we can
understand the entire flow is more easily. And, I wouldn't like to
defer it to a separate ticket; it'll quite likely to be buried in
other tickets.

comment:25 Changed 7 years ago by jinmei

  • Owner changed from jinmei to naokikambe

comment:26 in reply to: ↑ 24 ; follow-up: Changed 7 years ago by naokikambe

  • Owner changed from naokikambe to jinmei

Hello,

Replying to jinmei:

xfrin.py.in

  • In this revised code, if IXFR results in XfrinZoneUptodate xfrfail is counted. Is it intentional, and if so, is it reasonable? From a quick look, BIND 9 seems to consider it a success. In any case, this should be clarified in the documentation: what exactly "succeed/fail" mean.

I revised the codes to treat an xfr request as success in a case of raised XfrinZoneUptodate. For consistency between BIND 9 and 10, I think we should define this case as success. In fact, do_xfrin() already defines so. The commit is:

8544c8e [2252] treat an xfr request as success with an XfrinZoneUptodate raised

Documentation update is missing.

Documentation was not same but I've updated a bit. The commit is:

4b79543 [2252] update documentation for the test cases

And one minor point: this comment doesn't parse well to me:

                # Note: Catching an exception XfrinZoneUptodate
                # defines that IXFR succeeded.

In the below change I've removed the inner try-except statement. So the note was unnecessary.

And, one last non trivial point. I'd hesitate to say this after the
long cycle of review process, but do_xfrin() now looks overly
complicated to me due to the series of statistics related changes.
The organization of the nested try-re-raise-catch is particularly
unreadable to me. I suggest refactoring this method so we can
understand the entire flow is more easily. And, I wouldn't like to
defer it to a separate ticket; it'll quite likely to be buried in
other tickets.

I've tried to refactor it so that it's easy to read. The commit is:

31f138b [2252] simplify complicated nested try-except statements

Regards,

comment:27 in reply to: ↑ 26 Changed 7 years ago by jinmei

Replying to naokikambe:

I revised the codes to treat an xfr request as success in a case of raised XfrinZoneUptodate. For consistency between BIND 9 and 10, I think we should define this case as success. In fact, do_xfrin() already defines so. The commit is:

8544c8e [2252] treat an xfr request as success with an XfrinZoneUptodate raised

Documentation update is missing.

Documentation was not same but I've updated a bit. The commit is:

4b79543 [2252] update documentation for the test cases

By documentation I meant the description of the man page. I've pushed
my suggested text.

Other changes look okay. If the man page update is okay, please feel
free to merge.

comment:28 Changed 7 years ago by jinmei

  • Owner changed from jinmei to naokikambe
  • Total Hours changed from 0 to 16.75

comment:29 follow-up: Changed 7 years ago by jinmei

One last additional note: #2879 has been merged to master, so you'll
probably have to (or simply can) adjust the lettuce test on merge.

comment:30 in reply to: ↑ 29 ; follow-up: Changed 7 years ago by naokikambe

Replying to jinmei:

I revised the codes to treat an xfr request as success in a case of raised XfrinZoneUptodate. For consistency between BIND 9 and 10, I think we should define this case as success. In fact, do_xfrin() already defines so. The commit is:

8544c8e [2252] treat an xfr request as success with an XfrinZoneUptodate raised

Documentation update is missing.

Documentation was not same but I've updated a bit. The commit is:

4b79543 [2252] update documentation for the test cases

By documentation I meant the description of the man page. I've pushed
my suggested text.

Sorry for my misunderstanding. The proposed text looks good. I corrected XML tags.

6b3b78f [2252] replace ending tags

Replying to jinmei:

One last additional note: #2879 has been merged to master, so you'll
probably have to (or simply can) adjust the lettuce test on merge.

I removed cares for #2879 from xfrin_notify_handling.feature. And I merged the master including the changes in #2879 into the branch.

199beea [2252] Merge branch 'master' of git://git.bind10.isc.org/bind10 into trac2252
0a2277b [2252] adjust lettuce tests; xfrin_notify_handling.feature

BTW the branch doesn't pass lettuce test on the build farm: http://git.bind10.isc.org/~tester/builder//BIND10-lettuce/20130426125701-DebianLinux-x86_64-GCC/logs/lettuce.out

I've seen it before on my environment, which I commented at comment:19. I'm going to investigate it, and I'd fix the error or add a workaround. So I continue to work on this ticket.

Regards,

comment:31 in reply to: ↑ 30 Changed 7 years ago by naokikambe

  • Owner changed from naokikambe to jinmei

Hello,

Replying to naokikambe:

BTW the branch doesn't pass lettuce test on the build farm: http://git.bind10.isc.org/~tester/builder//BIND10-lettuce/20130426125701-DebianLinux-x86_64-GCC/logs/lettuce.out

I've seen it before on my environment, which I commented at comment:19. I'm going to investigate it, and I'd fix the error or add a workaround. So I continue to work on this ticket.

For this, I've added a change:

c0acd59 [2252] add exceptional cases for checking initial statistics of Xfrout

As a result, the lettuce test all seems to pass except for the scenario which I didn't work on this ticket for: http://git.bind10.isc.org/~tester/builder//BIND10-lettuce/20130430135253-DebianLinux-x86_64-GCC/logs/lettuce.out

Could you review again? Otherwise can I go merging?

Regards,

comment:32 follow-up: Changed 7 years ago by jinmei

I'm not really sure if I understand the commit comment:

[2252] add exceptional cases for checking initial statistics of Xfrout

.Xfrout.socket.unixdomain.open should be 1 immediately after Xfrout started.
But Stats could not have the latest statistics of Xfrout if it requests until
Xfrout is ready for accepting a statistics request from Stats.  Thus in that
case, Stats would show that .Xfrout.socket.unixdomain.open is 1.  For checking
exactly that .Xfrout.socket.unixdomain.open is 1 here, it waits for the second
time until Stats requests all modules which are Auth, Init, and Xfrout.  The
Stats configuration "poll-interval" is changed to one second so that timeout in
the lettuce test is avoided.

Does this mean that the problem is as follows?

  • if b10-stats doesn't start up soon enough and Xfrout is ready at the time the initial polling from b10-stats, .Xfrout.socket.unixdomain.open is 1, so this check will fail:
        check initial statistics not containing example.org for Xfrout
        with cmdctl port 47804
    
  • on the other hand, if b10-stats starts up sooner and collects initial statistics before xfrout is ready, .Xfrout.socket.unixdomain.open is 0. And when we perform this b10-stats simply returns the remembered result (because poll-interval is the default):
        check initial statistics not containing example.org for Xfrout with cmdctl port 47804
    
    So we cannot simply assert the counter is 1 (or 0).

And the "fix" is to set poll-interval to 1 (then in the second case
"check initial statistics" triggers another polling and b10-stats will
get the latest value) and assert the counter is 1?

If so, I think the previous workaround is less ugly, which, if I
remember it, is to assert the counter is either 0 or 1 (with
commenting why). To me the current fix has the following issues:

  • it's not clear why we need to use that specific value of poll-interval.
  • smaller poll-interval values will cause noisy log output. it's inconvenient when we need to examine the log output to debug a test failure.

comment:33 Changed 7 years ago by jinmei

  • Owner changed from jinmei to naokikambe

comment:34 in reply to: ↑ 32 Changed 7 years ago by naokikambe

  • Owner changed from naokikambe to jinmei

Hello, sorry for my late response.

Replying to jinmei:

And the "fix" is to set poll-interval to 1 (then in the second case
"check initial statistics" triggers another polling and b10-stats will
get the latest value) and assert the counter is 1?

If so, I think the previous workaround is less ugly, which, if I
remember it, is to assert the counter is either 0 or 1 (with
commenting why). To me the current fix has the following issues:

  • it's not clear why we need to use that specific value of poll-interval.
  • smaller poll-interval values will cause noisy log output. it's inconvenient when we need to examine the log output to debug a test failure.

The config change is for avoiding timeout of lettuce test. But I reverted it to the default(60). And I updated the lettuce tests to expect the value to be 0 or 1. And I also added notes. If the notes are still hard for you to understand, I'll revise them. Please review again.

0475216 [2252] .Xfrout.socket.unixdomain.open is either expected to be 0 or 1, and add notes
8f4a354 [2252] revert poll-interval to the default value

Regards,

comment:35 Changed 7 years ago by jinmei

I suggest a few wording cleanups and avoiding repeating the same
lengthy notes for all cases. I'm attaching my proposed patch.
But I'd leave it to you. Whether with or without it, please feel
free to merge.

Changed 7 years ago by jinmei

comment:36 Changed 7 years ago by jinmei

  • Owner changed from jinmei to naokikambe
  • Total Hours changed from 16.75 to 17.33

comment:37 Changed 7 years ago by naokikambe

  • Resolution set to fixed
  • Status changed from reviewing to closed
  • Total Hours changed from 17.33 to 270.46

Merging was done, the ticket is being closed.
Thank you for reviewing.

Note: See TracTickets for help on using tickets.