Opened 9 years ago

Closed 9 years ago

#684 closed defect (fixed)

b10-xfrout uses 100% of CPU

Reported by: shane Owned by: zzchen_pku
Priority: high Milestone: Sprint-20110517
Component: xfrout Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 4.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

The b10-xfrout process is using 100% of CPU on my server.

I see several errors:

09-Mar-2011 14:37:59.552 Xfrout: INFO: sending notify to 85.17.60.232#53
[b10-zonemgr] Received notify command for zone (time-travellers.org., IN).
09-Mar-2011 14:37:59.554 Xfrout: INFO: notify to 85.17.60.232#53: retried exceeded
[b10-xfrin] transfer of 'time-travellers.org.': AXFR succeeded
09-Mar-2011 14:38:00.327 Xfrout: INFO: zone 'time-travellers.org./IN': receive notify others command
09-Mar-2011 14:38:31.552 Xfrout: ERROR: send notify to 2001:610:719:1::3#53 failed: [Errno -9] Address family for hostname not supported
09-Mar-2011 14:38:33.550 Xfrout: INFO: notify retry to 2001:610:719:1::3#53
09-Mar-2011 14:38:33.553 Xfrout: ERROR: send notify to 2001:610:719:1::3#53 failed: [Errno -9] Address family for hostname not supported
09-Mar-2011 14:38:37.550 Xfrout: INFO: notify retry to 2001:610:719:1::3#53
09-Mar-2011 14:38:37.553 Xfrout: ERROR: send notify to 2001:610:719:1::3#53 failed: [Errno -9] Address family for hostname not supported
09-Mar-2011 14:38:45.551 Xfrout: INFO: notify retry to 2001:610:719:1::3#53
09-Mar-2011 14:38:45.554 Xfrout: ERROR: send notify to 2001:610:719:1::3#53 failed: [Errno -9] Address family for hostname not supported

The machine is using IPv4 only, although the zones have IPv6 servers configured.

Also, some of the slaves will not accept NOTIFY from this server.

I am guessing one of these is the source of the problem.

Subtickets

Attachments (1)

notifyout.diff (710 bytes) - added by zzchen_pku 9 years ago.

Download all attachments as: .zip

Change History (17)

comment:1 Changed 9 years ago by stephen

  • Milestone A-Team-Task-Backlog deleted

Milestone A-Team-Task-Backlog deleted

comment:2 Changed 9 years ago by shane

This is still a problem in the latest release (2011-03).

comment:3 Changed 9 years ago by shane

  • Priority changed from major to critical

I had a look at this today, as it is affecting my server. strace on the offending thread reveals:

select(14, [10 12 13], [], [], {0, 0})  = 0 (Timeout)
gettimeofday({1301656680, 878097}, NULL) = 0
gettimeofday({1301656680, 878149}, NULL) = 0
gettimeofday({1301656680, 878221}, NULL) = 0
select(14, [10 12 13], [], [], {0, 0})  = 0 (Timeout)
gettimeofday({1301656680, 878369}, NULL) = 0
gettimeofday({1301656680, 878421}, NULL) = 0
   .
   .
   .

I looked through the code and the only select() call that seems to match is this one from the notify code:

        (block_timeout, valid_socks, notifying_zones) = \
            self._prepare_select_info()
        # This is None only during some tests
        if self._read_sock is not None:
            valid_socks.append(self._read_sock)
        try:
            r_fds, w, e = select.select(valid_socks, [], [], block_timeout)
        except select.error as err:
            if err.args[0] != EINTR:
                return {}, {}

The block_timeout is getting set to 0 for some reason. This may have to do with errors sending because it is trying to use IPv6 as notify targets.

comment:4 Changed 9 years ago by shane

  • Milestone set to Sprint-20110419

comment:5 Changed 9 years ago by zhanglikun

  • Owner set to zhanglikun
  • Status changed from new to assigned

I will look at this problem

comment:6 Changed 9 years ago by zhanglikun

  • Owner changed from zhanglikun to UnAssigned

according the scrum, one person can only focus on one ticket, so, let me release this ticket first, only focus on ticket 598.

comment:7 Changed 9 years ago by zzchen_pku

  • Defect Severity set to N/A
  • Owner changed from UnAssigned to zzchen_pku
  • Sub-Project set to DNS

Changed 9 years ago by zzchen_pku

comment:8 Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to shane

Hi, Shane,
After looking into the code, I found when we call send_notify() for a zone whose notify_slaves is empty(which means the zone only has one apex ns record - the primary master name server), 'zone_notify_info.get_current_notify_target()' always returns None, so 'zone_notify_info.notify_timeout' shall never be updated and the zone will never be removed from '_notifying_zones' list, so block_timeout's value keeps 0 all the time, which will cause 100% CPU.

tgt = zone_notify_info.get_current_notify_target()
if tgt:
    zone_notify_info.notify_try_num += 1
    if zone_notify_info.notify_try_num > _MAX_NOTIFY_TRY_NUM:
        self._log_msg('info', 'notify to %s: retried exceeded' % addr_to_str(tgt))
        self._notify_next_target(zone_notify_info)
    else:
        retry_timeout = _NOTIFY_TIMEOUT * pow(2, zone_notify_info.notify_try_num)
        # set exponential backoff according rfc1996 section 3.6
        zone_notify_info.notify_timeout = time.time() + retry_timeout
        self._send_notify_message_udp(zone_notify_info, tgt)

The attached patch should resolve this bug.
But I am not sure if it is the real reason of your issue, can you please try it and give me some feedback?

Thanks.

Last edited 9 years ago by zzchen_pku (previous) (diff)

comment:9 Changed 9 years ago by shane

Okay great. I have applied this patch and will wait a few days and see if the problem happens again. I cannot reproduce it on demand, so we cannot be 100% sure, but it is better than nothing. :)

comment:10 Changed 9 years ago by stephen

  • Estimated Difficulty changed from 0.0 to 4

comment:11 Changed 9 years ago by shane

  • Resolution set to fixed
  • Status changed from assigned to closed

I've been running with this patch for a few days now, and have not seen the problem reappear. I'll close the ticket for now, and if we see the problem re-appear I'll reopen it.

comment:12 follow-up: Changed 9 years ago by shane

  • Resolution fixed deleted
  • Status changed from closed to reopened

Whoops, I thought this was applied to master already.

I'm changing the owner back to Jerry. I think we need a test added to duplicate this problem (probably in test_send_notify?), and then it can go into review.

comment:13 Changed 9 years ago by shane

  • Owner changed from shane to zzchen_pku
  • Status changed from reopened to assigned

comment:14 in reply to: ↑ 12 Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to shane
  • Status changed from assigned to reviewing

Replying to shane:

Whoops, I thought this was applied to master already.

I'm changing the owner back to Jerry. I think we need a test added to duplicate this problem (probably in test_send_notify?), and then it can go into review.

trac684 is ready for review now.
new change : git diff -r fe8babb0 9c3195a5

comment:15 Changed 9 years ago by shane

  • Owner changed from shane to zzchen_pku

Okay, I think this looks good. Please go ahead and merge!

comment:16 Changed 9 years ago by zzchen_pku

  • Resolution set to fixed
  • Status changed from reviewing to closed

Merged. Thanks.

Note: See TracTickets for help on using tickets.