Opened 7 years ago

Closed 7 years ago

Last modified 7 years ago

#2136 closed enhancement (fixed)

update the stats daemon according to the new statistics model

Reported by: naokikambe Owned by: naokikambe
Priority: medium Milestone: Sprint-20120904
Component: statistics Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 8? (depends on #2135) Add Hours to Ticket: 0
Total Hours: 77 Internal?: no

Description

update the stats daemon according to the new model of #2135

Subtickets

Change History (23)

comment:1 Changed 7 years ago by jelte

  • Milestone changed from New Tasks to Sprint-20120731

comment:2 Changed 7 years ago by jelte

  • Estimated Difficulty changed from 0 to 8? (depends on #2135)

comment:3 Changed 7 years ago by naokikambe

  • Owner set to naokikambe
  • Status changed from new to accepted

comment:4 Changed 7 years ago by naokikambe

  • Owner changed from naokikambe to UnAssigned
  • Status changed from accepted to reviewing

The branch is ready for reviewing. The documentation task #2135 isn't reviewed yet but I implemented according to that document prior to completion. If changes of the document affect this implement, the codes should be changed.

BTW, This is a proposed ChangeLog entry. It includes changes in #2137 and #2138 since they depends on this ticket. So I'd like to add it after all of them are finished.

nnn.    [func]*         naokikambe, someone
        b10-stats polls bind10 and b10-auth for statistics data.  The
        "poll-interval" parameter in b10-stats is for configuring the polling
        interval. All statistics data collected once are preserved while
        b10-stats is running. The "sendstats" command was removed from bind10
        and b10-auth. The "statistics-interval" configuration item was removed
        from b10-auth.
        (Trac #2136, #2137 and #2138, git TBD)

comment:5 follow-up: Changed 7 years ago by jinmei

In my environment where ulimit -n = 256, some tests failed (perhaps
partially) due to 'Too many open files':

======================================================================
ERROR: test_xsl_handler (__main__.TestStatsHttpd)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/jinmei/src/isc/git/bind10-2136/src/bin/stats/tests/b10-stats-httpd_test.py", line 651, in setUp
  File "/Users/jinmei/src/isc/git/bind10-2136/src/bin/stats/tests/test_utils.py", line 467, in __init__
  File "/Users/jinmei/src/isc/git/bind10-2136/src/bin/stats/tests/test_utils.py", line 64, in __init__
  File "/Users/jinmei/src/isc/git/bind10-2136/src/bin/stats/tests/test_utils.py", line 104, in __init__
  File "/Users/jinmei/src/isc/git/bind10-2136/src/bin/msgq/msgq.py", line 195, in setup
  File "/Users/jinmei/src/isc/git/bind10-2136/src/bin/msgq/msgq.py", line 168, in setup_listener
  File "/usr/local/Cellar/python3/3.2.3/lib/python3.2/socket.py", line 94, in __init__
socket.error: [Errno 24] Too many open files

Increasing the ulimit worked as a work around, but I guess there are
actually FD leaks in the test or stat related code. They should
better be plugged.

comment:6 follow-up: Changed 7 years ago by jinmei

Some small things:

  • I've made a few small wording changes to documentation and pushed it.
  • This sounds awkward to me:
    +      and so on. <command>b10-stats</command> periodically requests statistics
    +      data to each module and receives. The interval time can be configured
    
    Maybe something like "b10-stats periodically requests each module to return statistic data"?

On the main changes: from a quick look it makes quite a lot of
substantial changes and doesn't seem to be easy to understand the
overall idea. I'm afraid I myself have enough time to complete the
review due to my absence next week, so if the actual reviewer is okay
with that, that's fine, but if I were a reviewer, I'd like to see some
guideline on how to review it. Ideally, it's better if the entire
diff can be staged, so the review can be done step-by-step.

comment:7 in reply to: ↑ 5 ; follow-up: Changed 7 years ago by naokikambe

Replying to jinmei:

In my environment where ulimit -n = 256, some tests failed (perhaps
partially) due to 'Too many open files':

Thank you for catching up. I found some of mock auth objects were unfinished. I fixed it and upload the codes. The new codes seem to be OK for my environment where ulimit -n is set to 256. Could you try again in your environment?

comment:8 in reply to: ↑ 6 Changed 7 years ago by naokikambe

Replying to jinmei:

  • This sounds awkward to me:
    +      and so on. <command>b10-stats</command> periodically requests statistics
    +      data to each module and receives. The interval time can be configured
    
    Maybe something like "b10-stats periodically requests each module to return statistic data"?

That is OK for me. Thank you.

On the main changes: from a quick look it makes quite a lot of
substantial changes and doesn't seem to be easy to understand the
overall idea. I'm afraid I myself have enough time to complete the
review due to my absence next week, so if the actual reviewer is okay
with that, that's fine, but if I were a reviewer, I'd like to see some
guideline on how to review it. Ideally, it's better if the entire
diff can be staged, so the review can be done step-by-step.

Thank you for the suggestion and sorry for so huge diffs.
I categorized these diffs into 6 groups. Most of changes are for refactoring or for applying test codes for the new specification. I'd like to ask a reviewer to review the changes rather than these changes, especially top 3 groups. And then I'd like to ask to review partially test codes corresponding to the changes of the main code if it's allowed. This can be helpful for the reviewer?

  • Adding a new config 'poll-interval'
    a6b0955 [2136] introduced a new configration item 'poll-interval'
    97a53cf [2136] added config setup
    397d4b7 [2136] added update of 'poll-interval'
    
  • Implementing polling of the stats module for collecting statistics data
    613c10d [2136] updated stats.py and added a related testcase
    7f94e12 [2136] revised the debug message of request
    80ac056 [2136] implemented polling to each modules for collecting statistics data
    
  • Implementing preservation of statistics data from an inavtive instance
    89d79f4 [2136] preserve all statistics data including the inactive instances' one while stats module is working
    
  • Removing the obsolete command
    1c195bb [2136] removed 'set' command of stats module due to its unnecessity
    
  • Documentation
    e7739c1 [2136] updated bind10-guide.* due to removing set command from stats module
    7e6d71c [2136] some small suggested wording fixes to doc.
    
  • Refactoring or applying test codes for new specification
    d2d7cbe [2136] removed unnecessary waits
    4b27d27 [2136] added a test dbfile of b10-config for testing stats with the Mock boss
    ef4095e [2136] revised test_util.py miscellaneously
    5980e50 [2136] removed the obsoleted command 'sendstats' from the mock modules
    4df681f [2136] renamed 'pid' to 'mid' and change the default value -1 to None on the second argument of the the 'update_statistics_data' method
    4a50ef0 [2136] added deeply checking statistics values of each of module and instance
    e206cb2 [2136] added a new message id for getting invalid statistics data from the module
    6d44728 [2136] do group_sendmsg too all modules then do group_recvmsg in do_polling for efficiency
    6c0bc0b [2136] kept consistency in unittest due to the previous changes
    4b300f5 [2136] moved creation of a stats.Stats object in each test case
    8f7ae09 [2136] Updated not to calculate next timing based on current timestamp in every polling
    2b1cbf9 [2136] Refactoring unittest and update to evaluate the next_polltime value as a integer
    be0d496 [2136] Removed 'trees' argument from 'getstats' command
    6d20444 [2136] Fixed wrong checking: The variable key was assumed to be constant.
    c46ccb2 [2136] Fixed unfinished MockAuth objects
    

comment:9 in reply to: ↑ 7 Changed 7 years ago by jinmei

Replying to naokikambe:

Replying to jinmei:

In my environment where ulimit -n = 256, some tests failed (perhaps
partially) due to 'Too many open files':

Thank you for catching up. I found some of mock auth objects were unfinished. I fixed it and upload the codes. The new codes seem to be OK for my environment where ulimit -n is set to 256. Could you try again in your environment?

It now worked, but it took way too long (nearly 4 minutes) to
complete. The master version it takes about 12 seconds. Even the
current duration is pretty long for unit tests, and I'm afraid several
minutes are far beyond the acceptable level. Is it only me?

comment:10 follow-up: Changed 7 years ago by naokikambe

Taking so long time was due to my wrong fix. But if I reverted the last change, tests for "b10-stats-httpd" need open files more than 256. In that test, many open files are used. Because http server and http client objects are created many times and some other IO objects are used at same time. So the number of mock auth instances was reduced. Could you try again? And please refer to the following log for the detail?:

0472e39 [2136] Revised the test scripts and the config db file

comment:11 in reply to: ↑ 10 ; follow-up: Changed 7 years ago by jinmei

Replying to naokikambe:

Taking so long time was due to my wrong fix. But if I reverted the last change, tests for "b10-stats-httpd" need open files more than 256. In that test, many open files are used. Because http server and http client objects are created many times and some other IO objects are used at same time. So the number of mock auth instances was reduced. Could you try again? And please refer to the following log for the detail?:

I didn't look into details in the code, but the tests now pass with
ulimit -n == 256 and reasonably quickly.

comment:12 in reply to: ↑ 11 Changed 7 years ago by naokikambe

Thank you for rechecking. Things related to b10-stats-httpd are out of scope in this ticket. I should have taken care about open files.
BTW I'd like to progress regularly this ticket soon.

comment:13 Changed 7 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:14 follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to UnAssigned

If there's so much reviewing on the run already, why is it marked as Unassigned? Whose turn is it now?

comment:15 in reply to: ↑ 14 Changed 7 years ago by naokikambe

Jinmei-san checked only the running of "make check" so far, but nobody reviews the source codes yet.

BTW I've updated some codes due to the changes of the wiki page StatsModule.
(#2135 is a documentation ticket for the wiki page.)

My changes to the codes are:

  • Supporting show_processes of the boss module for Stats to count instances of the target modules
    3f922de [2136] Updated the do_polling method
    58b6810 [2136] Updated the "show_processes" to provide address of each module
    99b6098 [2136] Introduced an address() method for the Component classes
    
  • Updating to invoke polling at the time when the show command of Stats is invoked
    082d6a9 [2136] Updated the 'show' command handler
    
  • Refactoring for efficiency
    f8f9de2 [2136] Separated the updating process from the receiving process
    
Last edited 7 years ago by naokikambe (previous) (diff)

comment:16 Changed 7 years ago by muks

  • Owner changed from UnAssigned to muks

Picking for review.

comment:17 Changed 7 years ago by muks

  • Owner changed from muks to naokikambe

Hi Kambe-san

Here are my comments.

  • It's better to put src/bin/stats/tests/b10-config_test.db in src/bin/stats/tests/testdata/ sub-directory as we follow that convention for all test related data files. You may want to consider naming it just b10-config.db too.
  • This config file (b10-config_test.db) should be multi-line and indented so that everything is not in a single line. Any patching will modify the entire file and it will take time to check what changed.
  • The branch preserves all statistics data including the data for processes which no longer exist. What was the reason for this change?
  • What do we do about this FIXME in stats.py.in:
                # FIXME: A issue might happen when consolidating
                # statistics of the multiple instances. If they have
                # different statistics data which are not for adding
                # each other, this might happen: If these are integer
                # or float, these are added each other, otherwise
                # these are overwritten into one of them.
  • I have pushed some changes to the branch, where discussion was not necessary. Please check these commits on the branch.
  • Is the following code what you intended to write (in tests)? t is unused and the loop checks the same assertion many times.
            for s in stat.statistics_data_bymid['Auth'].values():
                for t in s.values():
                    self.assertEqual(
                        s, {'queries.perzone': auth.queries_per_zone,
                            'queries.tcp': auth.queries_tcp,
                            'queries.udp': auth.queries_udp})
  • Will the show command now force a polling if (current_time-lasttime_poll) < 1) ? Why not simply show what data is available and let the poll-interval be used for updating?
  • I am not sure if using message bus address is a correct way of finding multiple instances of the same module. In any case, the new code in the following (do_polling()) uses int indexes and types, instead of looking for keys. Is it possible to rewrite it to look for particular keys? Are such keys available?
-            if rcode == 0 and 'components' in value:
-                modules = [ c['special'].capitalize() \
-                                for c in value['components'].values() \
-                                if 'special' in c ]
+            if rcode == 0 and type(value) is list:
+                modules = [ v[2] if type(v) is list and len(v) > 2 \
+                                else None for v in value ]
  • In some commits, unrelated changes are mixed. For example, in commit 1c195bb85881b91fae165ef5b8334da8b6611a00, the set command is removed, but poll-interval is also documented. Such mixed commits are difficult to follow when reviewing. In this case, the poll-interval manpage change should go into the commit introducing that config setting, but in case we forgot it, it should become a separate commit just adding the manpage change. Another example is commit 4a50ef0287747928c8dbd4d045df9780df3c3840 where the changes to test_utils.py do not belong in that commit.

I have reviewed the commits one-by-one, and that the lettuce and system tests pass. Please assign it back to me when you are done with the above changes and I'll re-review the code change as a whole once.

comment:18 Changed 7 years ago by naokikambe

  • Owner changed from naokikambe to muks

Hello Mukund-san,

Thank you for reviewing!

Replying to muks:

  • It's better to put src/bin/stats/tests/b10-config_test.db in src/bin/stats/tests/testdata/ sub-directory as we follow that convention for all test related data files. You may want to consider naming it just b10-config.db too.
  • This config file (b10-config_test.db) should be multi-line and indented so that everything is not in a single line. Any patching will modify the entire file and it will take time to check what changed.


I moved it under stats/tests/testdata and rename it to b10-config.db. Please refer to:

e0a964e [2136] Changes regarding the test config file
  • The branch preserves all statistics data including the data for processes which no longer exist. What was the reason for this change?

I described so in StatsModule when working on #2135. But originally I intended that it is for fixing the defect #1941(stats lossage (multiple auth servers)). For discarding statistics data of an inactive instance, PID is required. Because PID is used for identifying whether such process is running or dead. A list of PIDs is obtained from the result of the Boss command show_processes.
At this branch, PID isn't used for identifying instances, so the stats module cannot discard statistics data of an inactive instance and doesn't need to do so. IMO the stats module might not have to consolidate statistics data by each instance if we also consider the below issue.

  • What do we do about this FIXME in stats.py.in:
                # FIXME: A issue might happen when consolidating
                # statistics of the multiple instances. If they have
                # different statistics data which are not for adding
                # each other, this might happen: If these are integer
                # or float, these are added each other, otherwise
                # these are overwritten into one of them.

I cannot find a reasonable solution for it.:-( Only summable statistics type, e.g integer or real, should be defined in the statistics spec of a module which might run multiply. I added a minor workaround for this: If two values are string type, they are consolidated into bigger one. If one of them is None type, they are consolidated into another one.
Regarding this workaround, please refer to:

003d027 [2136] revised the internal method _accum()
  • I have pushed some changes to the branch, where discussion was not necessary. Please check these commits on the branch.

These make sense for me. Thank you very much!

  • Is the following code what you intended to write (in tests)? t is unused and the loop checks the same assertion many times.
            for s in stat.statistics_data_bymid['Auth'].values():
                for t in s.values():
                    self.assertEqual(
                        s, {'queries.perzone': auth.queries_per_zone,
                            'queries.tcp': auth.queries_tcp,
                            'queries.udp': auth.queries_udp})

That's right! I revised it at:

ec6716e [2136] removed an unnecessary loop
  • Will the show command now force a polling if (current_time-lasttime_poll) < 1) ? Why not simply show what data is available and let the poll-interval be used for updating?

I at first imagined that a command like "force-poll" might be required. But I got the following Jinmei-san's comment when I worked on #2135. I finally agreed and followed it.
https://lists.isc.org/pipermail/bind10-dev/2012-July/003698.html

What I envisioned is a revision of the "show" command; the new version
basically immediately invokes the "send statistics" request from stats
to the module(s).  We probably want to rate-limit this, e.g., at most
1 request per second though.  The stats daemon wait for the
response(s), and sends the result back to cmdctl.
  • I am not sure if using message bus address is a correct way of finding multiple instances of the same module. In any case, the new code in the following (do_polling()) uses int indexes and types, instead of looking for keys. Is it possible to rewrite it to look for particular keys? Are such keys available?
-            if rcode == 0 and 'components' in value:
-                modules = [ c['special'].capitalize() \
-                                for c in value['components'].values() \
-                                if 'special' in c ]
+            if rcode == 0 and type(value) is list:
+                modules = [ v[2] if type(v) is list and len(v) > 2 \
+                                else None for v in value ]

Neither am I. I think it might be possible. But I am thinking that this is just a workaround for releasing the statistics feature by the end of Sept.. IMO this part should be revised in the near future. I added a note related to above at:

2fc9bd6 [2136] added a note about a value from the Boss command show_processes
  • In some commits, unrelated changes are mixed. For example, in commit 1c195bb85881b91fae165ef5b8334da8b6611a00, the set command is removed, but poll-interval is also documented. Such mixed commits are difficult to follow when reviewing. In this case, the poll-interval manpage change should go into the commit introducing that config setting, but in case we forgot it, it should become a separate commit just adding the manpage change. Another example is commit 4a50ef0287747928c8dbd4d045df9780df3c3840 where the changes to test_utils.py do not belong in that commit.

I've revised these incorrect messages, so the git ids are changed to:

94d9320 [2136] removed 'set' command and introduced a 'poll-interval' config item
9d21102 [2136] added some misc changes

Old commits were changed. Could you recheckout and review again?

I have reviewed the commits one-by-one, and that the lettuce and system tests pass. Please assign it back to me when you are done with the above changes and I'll re-review the code change as a whole once.

I give you. Thank you for continuing to review.

BTW I'm sorry. I added more changes not related to above:

07a1ba0 [2136] moved invoking update_modules() to outside of the method
4b19f8e [2136] corrected the improper test
37ce3a0 [2136] removed the do_polling method from the __init__ method of stats.Stats()

I think these issues should be fixed in this ticket. Could you also review them?

comment:19 follow-up: Changed 7 years ago by muks

  • Owner changed from muks to naokikambe

Hi Kambe-san

I apologise for the delay in reviewing this bug. I was caught up in fixing #2165 which took much longer than expected.

  • The changes made for the review comments seem ok. make check and Lettuce tests pass. I have not re-run the system tests as many of these are commented out as they need implementation in other components.
  • In Stats.command_show(), should the call to update_statistics_data() pass owner instead of self.module_name?
  • Does this require a ChangeLog? entry? I think adding one in any case would be nice as it changes the basic design of statistics gathering.

comment:20 in reply to: ↑ 19 ; follow-up: Changed 7 years ago by naokikambe

  • Owner changed from naokikambe to muks

Hello Mukund-san,

No worries, it's too big change for a ticket.

Replying to muks:

  • The changes made for the review comments seem ok. make check and Lettuce tests pass. I have not re-run the system tests as many of these are commented out as they need implementation in other components.

We need to uncomment them after merging trac2138.

  • In Stats.command_show(), should the call to update_statistics_data() pass owner instead of self.module_name?

Yes. self.module_name is Stats and owner is a possible module name, e.g. Stats, Auth, Xfrout and so on. If it updates statistics data of Stats, owner is set to Stats. In other case, if it updates statistics data of Auth, owner is set to Auth.

  • Does this require a ChangeLog? entry? I think adding one in any case would be nice as it changes the basic design of statistics gathering.

Yes, it does. I pasted it above. Did you read it?

Thanks,

comment:21 in reply to: ↑ 20 ; follow-up: Changed 7 years ago by muks

  • Owner changed from muks to naokikambe

Hi Kambe-san

Replying to naokikambe:

  • In Stats.command_show(), should the call to update_statistics_data() pass owner instead of self.module_name?

Yes. self.module_name is Stats and owner is a possible module name, e.g. Stats, Auth, Xfrout and so on. If it updates statistics data of Stats, owner is set to Stats. In other case, if it updates statistics data of Auth, owner is set to Auth.

I follow.. this looks fine then as it's updating Stat's keys.

  • Does this require a ChangeLog? entry? I think adding one in any case would be nice as it changes the basic design of statistics gathering.

Yes, it does. I pasted it above. Did you read it?

I missed reading it. :) It looks fine, so you can commit the ChangeLog? entry too when merging this ticket #2136, and update it with other ticket numbers when you merge the other tickets.

I think you can go ahead and merge now. :)

comment:22 in reply to: ↑ 21 Changed 7 years ago by naokikambe

  • Resolution set to fixed
  • Status changed from reviewing to closed

Hello Mukund-san,

Thank you very much for reviewing for long time.
I've merged it, so I'm closing the ticket.
Thanks,

comment:23 Changed 7 years ago by naokikambe

  • Total Hours changed from 0 to 77
Note: See TracTickets for help on using tickets.