Opened 9 years ago

Closed 9 years ago

#387 closed defect (fixed)

xfrin not checking for new copy of zone

Reported by: shane Owned by: shane
Priority: low Milestone: Sprint-20110517
Component: xfrin Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 0.0 Add Hours to Ticket: 0
Total Hours: 2.0 Internal?: no

Description

My BIND 10 instance is not getting the latest version of my zone.

I have it acting as a secondary for 2 zones:

time-travellers.org
time-travellers.nl.eu.org

However, after the zone were re-signed, no new versions have arrived. I restarted the BIND 10 process, and it still has not transfered a new versions of the zones.

shane@shane-asus-laptop:~$ dig @madras.curryboys.net +noall +answer -t soa time-travellers.org
time-travellers.org.	3600	IN	SOA	madras.curryboys.net. shane_kerr.yahoo.com. 2010102001 900 1800 864000 600
shane@shane-asus-laptop:~$ dig @b10-ns.time-travellers.org +noall +answer -t soa time-travellers.org
time-travellers.org.	3600	IN	SOA	madras.curryboys.net. shane_kerr.yahoo.com. 2010101206 900 1800 864000 600
shane@shane-asus-laptop:~$ dig @madras.curryboys.net +noall +answer -t soa time-travellers.nl.eu.org
time-travellers.nl.eu.org. 3600	IN	SOA	madras.curryboys.net. shane_kerr.yahoo.com. 2010102001 86400 7200 604800 3600
shane@shane-asus-laptop:~$ dig @b10-ns.time-travellers.org +noall +answer -t soa time-travellers.nl.eu.org
time-travellers.nl.eu.org. 3600	IN	SOA	madras.curryboys.net. shane_kerr.yahoo.com. 2010101212 86400 7200 604800 3600

Do we not act on a received NOTIFY now? And also, do we not check for the zone on startup?

Assigning to Jerry, but please re-assign if necessary.

Subtickets

Attachments (1)

zonemgr.diff (13.2 KB) - added by zzchen_pku 9 years ago.

Download all attachments as: .zip

Change History (25)

comment:1 Changed 50 years ago by zzchen_pku

  • Add Hours to Ticket changed from 0.0 to 2.0
  • Total Hours changed from 0.0 to 2.0

comment:1 Changed 9 years ago by jreed

I see they have new SOAs now. Did you do anything?

comment:2 in reply to: ↑ description Changed 9 years ago by zzchen_pku

  • Status changed from new to accepted

Replying to shane:

Do we not act on a received NOTIFY now? And also, do we not check for the zone on startup?

I tried it again, it seems Xfrin will act on a received NOTIFY, have you configured the slave address in also-notify?
Currently, zonemgr won't check for the zones on startup, I'll fix it.

comment:3 follow-up: Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to UnAssigned
  • Status changed from accepted to reviewing

Since it's quite trivial, I just attached a patch to fix it.
Zonemgr will check for the zones on startup.

comment:4 in reply to: ↑ 3 ; follow-up: Changed 9 years ago by jinmei

Replying to zzchen_pku:

Since it's quite trivial, I just attached a patch to fix it.
Zonemgr will check for the zones on startup.

The change is simple, but I couldn't understand what was the problem of the original code and how the change solves it, just from the diff.

Maybe Shane (the reporter) can understand it immediately, and if so, it should be assigned to him. Otherwise, please provide more information.

comment:5 Changed 9 years ago by jinmei

  • Owner changed from UnAssigned to zzchen_pku

comment:6 in reply to: ↑ 4 ; follow-up: Changed 9 years ago by zzchen_pku

Replying to jinmei:

The change is simple, but I couldn't understand what was the problem of the original code and how the change solves it, just from the diff.

Originally, zonemgr will load zones and set refresh timers for them at startup, each zone has to wait (soa_refresh - jitter) sec to check zone update for the first time(if not received notify during this time).
With this patch, the initial zone refresh timeout will be set 0, so xfrin will check for zone status as soon as zonemgr start.

comment:7 in reply to: ↑ 6 ; follow-up: Changed 9 years ago by jinmei

Replying to zzchen_pku:

Replying to jinmei:

The change is simple, but I couldn't understand what was the problem of the original code and how the change solves it, just from the diff.

Originally, zonemgr will load zones and set refresh timers for them at startup, each zone has to wait (soa_refresh - jitter) sec to check zone update for the first time(if not received notify during this time).
With this patch, the initial zone refresh timeout will be set 0, so xfrin will check for zone status as soon as zonemgr start.

Okay, but shouldn't we also include a jitter in this case? Not sure if this patch changes something on this point.

BTW, we need proposed text for the changelog entry.

comment:8 in reply to: ↑ 7 ; follow-up: Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to jinmei

Replying to jinmei:

Okay, but shouldn't we also include a jitter in this case? Not sure if this patch changes something on this point.

Sure,we need a jitter for this case, I have updated the patch and added a new config parameter for reload jitter, please have a look.

BTW, we need proposed text for the changelog entry.

Please find the proposed changelog entry below:

src/bin/zonemgr: Check for zone status after loading a zone into zonemgr. Imposes some random jitters to avoid many zones need to do 
refresh at the same time.(Trac #387, svn rTBD)

comment:9 in reply to: ↑ 8 ; follow-up: Changed 9 years ago by jinmei

  • Owner changed from jinmei to zzchen_pku

Replying to zzchen_pku:

Replying to jinmei:

Okay, but shouldn't we also include a jitter in this case? Not sure if this patch changes something on this point.

Sure, need a jitter for this case, I have updated the patch and added a new config parameter for reload jitter, please have a look.

Hmm...now I have even more questions...why do we have to do something special for the jitter just by changing the base parameter? From a quick look at the code, is it because only a minus jitter (and only when the result is a future time) is introduced? If so, why only minus? (Is that what BIND 9 does?) And, from a quick look at the revised patch, we basically wait for the SOA retry time. Is that okay? What if retry is big, such as 30 minutes? (Again, is this what BIND 9 does?)

Plus...especially now that you've introduced quite a lot of changes including new config parameters, I guess we need tests for these.

Another minor point: "set_zone_reload_timer" sounds counter intuitive because it sounds like a timer for reloading (i.e., we reload something when the timer expires).

BTW, we need proposed text for the changelog entry.

Please find the proposed changelog entry below:

src/bin/zonemgr: Check for zone status after loading a zone into zonemgr. Imposes some random jitters to avoid many zones need to do 
refresh at the same time.(Trac #387, svn rTBD)

This doesn't specify a change category (I suspect it's a "bug"). And it doesn't explain what was wrong (just like I asked for the change). Also, if you change existing parameters, it will break backward compatibility, so there should be a "*" mark.

Changed 9 years ago by zzchen_pku

comment:10 in reply to: ↑ 9 Changed 9 years ago by zzchen_pku

Replying to jinmei:

Hmm...now I have even more questions...why do we have to do something special for the jitter just by changing the base parameter? From a quick look at the code, is it because only a minus jitter (and only when the result is a future time) is introduced? If so, why only minus? (Is that what BIND 9 does?) And, from a quick look at the revised patch, we basically wait for the SOA retry time. Is that okay? What if retry is big, such as 30 minutes? (Again, is this what BIND 9 does?)

I think we have discussed the topic in bind10-dev before, I listed some of them below. Maybe we haven't reached a consensus on it, appreciate your suggestions.

> >  - about timeout: BIND 9 imposes some random jitters for refresh and
> >    retry (and perhaps some other) timers.  we probably want to do the
> >    same thing.
> 
> Thanks for your hint, we will use it in bind10.
> I checked the code of bind9, random jitters are only used for refresh
time:
> = When bind9 starts, jitter for refresh time is  rand() % (retry * 3 / 4)
> = Refresh fails, jitter for next refresh time is   rand() % (retry / 4)
> = Refresh successes, jitter for next refresh time is rand() % 
> (refresh/4)
> > I see there as being two alternatives.  If "r" is the refresh time and "j"
> the jitter value we could set it that the zone refreshes:
> > a) any time between r - j and r
> > or
> > b) any time between r - j and r + j
> > My own thought is that the latter is the most natural definition, as 
> > the term jitter tends to imply a deviation about a mean.
> Yeah, from the mean of jitter, the latter one is more natural, but my 
> concern is, it will delay zone refresh which may not user wants,
> 
> Like a meeting or check in time of company, we can arrive ahead or in 
> time, if you are late, what the boss will say?


Plus...especially now that you've introduced quite a lot of changes including new config parameters, I guess we need tests for these.

Yes,that is necessary,I have updated the unittest.

Another minor point: "set_zone_reload_timer" sounds counter intuitive because it sounds like a timer for reloading (i.e., we reload something when the timer expires).

Done.

This doesn't specify a change category (I suspect it's a "bug"). And it doesn't explain what was wrong (just like I asked for the change). Also, if you change existing parameters, it will break backward compatibility, so there should be a "*" mark.

121.  [bug]*      jerry
src/bin/zonemgr: Fix a bug that xfrin not checking for new copy of zone on startup. 
Imposes some random jitters to avoid many zones need to do refresh at the same time.
(Trac #387, svn rTBD)

comment:11 follow-up: Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to jinmei

comment:12 in reply to: ↑ 11 Changed 9 years ago by jinmei

  • Owner changed from jinmei to UnAssigned

Replying to zzchen_pku:

This ticket now seems to go beyond the "quite trivial" level, and it doesn't seem reasonable to take on it as a side business. Let's keep it open and consider it as an individual bug fix task for a future (next?) sprint.

comment:13 Changed 9 years ago by shane

  • Milestone set to A-Team-Task-Backlog
  • Priority changed from major to minor

The xfrin does seem to be getting copies of the zone now, although IIRC we are using lots of CPU for this. Anyway, keeping this ticket open, and assiging it to the A-Team backlog.

comment:14 Changed 9 years ago by stephen

  • Milestone A-Team-Task-Backlog deleted

Milestone A-Team-Task-Backlog deleted

comment:15 Changed 9 years ago by shane

  • Milestone set to Sprint-20110405

Needs to be looked at by someone.

comment:16 Changed 9 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:17 follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to zzchen_pku

Hello

The code seems OK. This should solve the problem it does not check them after restart. But why no new zone arrived at the beginning? Was it a different problem or misunderstanding and there's no problem?

And, what if someone have really a lot of zones? In that case most of them will try to refresh near the startup (not directly at once because of the jitter), and the retry time should probably be short (a lot shorter than usual refresh). Is it possible to limit number of concurrent requests or something? (I know this is out of scope of this ticket).

Anyway, it seems good enough to merge now.

comment:18 in reply to: ↑ 17 ; follow-up: Changed 9 years ago by zzchen_pku

  • Owner changed from zzchen_pku to vorner

Replying to vorner:

Hello
This should solve the problem it does not check them after restart. But why no new zone arrived at the beginning? Was it a different problem or misunderstanding and there's no problem?

In the original implementation, zonemgr manage zones in sqlite db, so no new zone arrives at the beginning.

And, what if someone have really a lot of zones? In that case most of them will try to refresh near the startup (not directly at once because of the jitter), and the retry time should probably be short (a lot shorter than usual refresh). Is it possible to limit number of concurrent requests or something? (I know this is out of scope of this ticket).

I think xfrin should do this work, xfrin has its own _max_transfers_in, when the concurrent requests exceeds _max_transfers_in, it can send a command with ZONE_XFRIN_FAILED flag to zonemgr, then zonemgr will retry to do refresh for the specified zone.

I will merge the patch, thanks.

comment:19 in reply to: ↑ 18 ; follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to zzchen_pku

Hello

Replying to zzchen_pku:

Replying to vorner:

This should solve the problem it does not check them after restart. But why no new zone arrived at the beginning? Was it a different problem or misunderstanding and there's no problem?

In the original implementation, zonemgr manage zones in sqlite db, so no new zone arrives at the beginning.

No, we misunderstood. Shane did have the zone there, then he updated the zone on master. And this happened:

  1. No new version of the zone was arriving for a long time.
  2. He tried to solve it the usual way, by turning the thing off and on again.
  3. Still no new version of the zone arrived.

I agree this solves 3., there's nothing to solve about 2. (definitely not by a programmer), but my question was about 1. ‒ why the new version wasn't getting in for a long time? Is it known?

Maybe we should ask Shane, how long he did wait and if there were notifies, etc. Could you maybe pass him the ticket after you merge it to master, instead of closing?

Thanks

comment:20 in reply to: ↑ 19 Changed 9 years ago by zzchen_pku

Replying to vorner:

This should solve the problem it does not check them after restart. But why no new zone arrived at the beginning? Was it a different problem or misunderstanding and there's no problem?

In the original implementation, zonemgr manage zones in sqlite db, so no new zone arrives at the beginning.

No, we misunderstood. Shane did have the zone there, then he updated the zone on master. And this happened:

  1. No new version of the zone was arriving for a long time.
  2. He tried to solve it the usual way, by turning the thing off and on again.
  3. Still no new version of the zone arrived.

I agree this solves 3., there's nothing to solve about 2. (definitely not by a programmer), but my question was about 1. ‒ why the new version wasn't getting in for a long time? Is it known?

Yeah, I understood. I wonder if it's a config problem or cased by other issues, since the test success in my environment.

Maybe we should ask Shane, how long he did wait and if there were notifies, etc. Could you maybe pass him the ticket after you merge it to master, instead of closing?

Okay, I will.

comment:21 Changed 9 years ago by zzchen_pku

  • Defect Severity set to N/A
  • Owner changed from zzchen_pku to shane
  • Sub-Project set to DNS

Merged, leave it open and reassign to Shane.
Please refer to comment:19

comment:22 Changed 9 years ago by zzchen_pku

  • Add Hours to Ticket changed from 0 to 2

comment:23 Changed 9 years ago by shane

  • Resolution set to fixed
  • Status changed from reviewing to closed

Okay, I have retested this, and my new copy of the zone arrived without problem. I'll close it, and we can re-open if it happens again.

Note: See TracTickets for help on using tickets.