Opened 9 years ago

Closed 9 years ago

#626 closed defect (fixed)

unexpected response from bind10 (SERVFAIL from auth server) (hotcache issue)

Reported by: shane Owned by: jinmei
Priority: high Milestone: A-Team-Sprint-20110316
Component: b10-auth Version:
Keywords: Cc: zhangcuiling@…
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity:
Sub-Project: Feature Depending on Ticket:
Estimated Difficulty: 5.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

Cuiling reports:


hi,

i found something strange with bind10. in certain situation, bind10 replies with SERVFAIL for a while.

preparation


[root@devel sbin]# ./bind10 --version
bind10 20101129 (BIND 10 20110120)

[root@devel bin]# cat demo.example.com.txt 
$TTL 43200
@       IN      SOA     NS1.example.com.        root.example.com. (
                2009042700      ; serial
                3600    ; refresh
                900     ; retry
                604801  ; expire
                7200    ; minimum
                )
        IN      NS      NS1
        IN      NS      NS2
        IN      NS      NS3

$ORIGIN demo.example.com.
ns1     IN      A       10.10.1.1
ns2     IN      A       10.10.1.2
ns3     IN      A       10.10.1.3

[root@devel bin]# ./b10-loadzone -o demo.example.com demo.example.com.txt
Using SQLite3 database file /home/zcl/software/bind10-devel/20110120/var/bind10-devel/zone.sqlite3
Zone name is demo.example.com.
Loading file "demo.example.com.txt"
8 RR(s) loaded in 0.13 second(s) (100.00% of demo.example.com.txt)              
Done.

experiment


[root@devel sbin]# ./bind10 -p 30001
[root@devel sbin]# dig @localhost -p 30001 demo.example.com soa

; <<>> DiG 9.7.2-P2 <<>> @localhost -p 30001 demo.example.com soa
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 63538
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 3, ADDITIONAL: 3
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;demo.example.com.              IN      SOA

;; ANSWER SECTION:
demo.example.com.       43200   IN      SOA     NS1.example.com. root.example.com. 2009042700 3600 900 604801 7200

;; AUTHORITY SECTION:
demo.example.com.       43200   IN      NS      NS1.demo.example.com.
demo.example.com.       43200   IN      NS      NS2.demo.example.com.
demo.example.com.       43200   IN      NS      NS3.demo.example.com.

;; ADDITIONAL SECTION:
NS1.demo.example.com.   43200   IN      A       10.10.1.1
NS2.demo.example.com.   43200   IN      A       10.10.1.2
NS3.demo.example.com.   43200   IN      A       10.10.1.3

;; Query time: 1 msec
;; SERVER: ::1#30001(::1)
;; WHEN: Thu Feb 24 14:59:20 2011
;; MSG SIZE  rcvd: 181

[root@devel sbin]# dig @localhost -p 30001 nxdomain.demo.example.com a

; <<>> DiG 9.7.2-P2 <<>> @localhost -p 30001 nxdomain.demo.example.com a
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 38494
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;nxdomain.demo.example.com.     IN      A

;; Query time: 0 msec
;; SERVER: ::1#30001(::1)
;; WHEN: Thu Feb 24 14:59:32 2011
;; MSG SIZE  rcvd: 43

after a while ( about 30 seconds ):

; <<>> DiG 9.7.2-P2 <<>> @localhost -p 30001 nxdomain.demo.example.com a
; (2 servers found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 12707
;; flags: qr aa rd; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;nxdomain.demo.example.com.     IN      A

;; AUTHORITY SECTION:
demo.example.com.       43200   IN      SOA     NS1.example.com. root.example.com. 2009042700 3600 900 604801 7200

;; Query time: 0 msec
;; SERVER: ::1#30001(::1)
;; WHEN: Thu Feb 24 15:00:01 2011
;; MSG SIZE  rcvd: 88

queries in the following sequence can get correct responses:

./bind10 -p 30001
dig @localhost -p 30001 nxdomain.demo.example.com a

./bind10 -p 30001
dig @localhost -p 30001 demo.example.com a
dig @localhost -p 30001 nxdomain.demo.example.com a

Subtickets

Change History (17)

comment:1 Changed 9 years ago by jreed

I'd be interested in seeing if b10-auth says anything about this when ran with verbose mode. (bind10 --verbose)

comment:2 Changed 9 years ago by jreed

I could repeat this.

Verbose says:

[b10-auth] Internal error, returning SERVFAIL: SOA RR not found in demo.example.com./IN

I recall we discussed this problem long ago.

comment:3 Changed 9 years ago by jreed

I had same problem last July and Shane had problem last month.

The problem goes away when I disable hotspot cache: bind10 -n

comment:4 Changed 9 years ago by jreed

  • Priority changed from major to critical

comment:5 Changed 9 years ago by shane

  • Summary changed from unexpected response from bind10 (SERVFAIL from auth server) to unexpected response from bind10 (SERVFAIL from auth server) (hotcache issue)

comment:6 Changed 9 years ago by stephen

  • Milestone changed from A-Team-Task-Backlog to A-Team-Sprint-20110316

comment:7 Changed 9 years ago by jinmei

  • Owner set to jinmei
  • Status changed from new to accepted

comment:8 Changed 9 years ago by jinmei

  • Estimated Difficulty changed from 0.0 to 5.0

My gut feeling is this is a bit bigger than a "size 3" task.

So, adjusing the usual underestimation of engineers, I'd give it 5.

comment:9 Changed 9 years ago by jinmei

Branch trac626 is ready for review.

As commented I really didn't like this solution, but my understanding
is that we'll revisit the relationship between the query logic and
data source (+ probably hot spot cache) pretty soon, so I'd go with
this workable hack.

The patch is quite small, so I've piggy-backed one irrelevant cleanup
to the test: replace the local headerCheck() to the one in
libtestutils (commit db1c1a4). This refactoring should preserve the
existing test behavior.

This is the proposed changelog entry:

  197.?	[bug]		jinmei
	b10-auth, src/lib/datasrc: fixed a bug where hot spot cache failed
	to reuse cached SOA for negative responses.  Due to this bug
	b10-auth returned SERVFAIL when it was expected to return a
	negative response immediately after a specific SOA query for
	the zone.
	(Trac #626, git TBD)

comment:10 Changed 9 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from accepted to reviewing

comment:11 Changed 9 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:12 follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

I confirm the test throws an exception before the fix and works afterwards (which would be cause for the SERVFAIL problem). Also, the code seem to be clean.

But I have a problem understanding what exactly was the problem and how this can possibly fix it. How is referral related to cached SOA? What did I overlook? Could you explain it little bit, please?

Btw, I fixed a compilation problem here, and pushed (one line in makefile).

And, as this was originaly reported by someone external to the project, should we wait for his confirmation as well?

Thanks.

comment:13 in reply to: ↑ 12 ; follow-up: Changed 9 years ago by jinmei

Replying to vorner:

Hello

I confirm the test throws an exception before the fix and works afterwards (which would be cause for the SERVFAIL problem). Also, the code seem to be clean.

But I have a problem understanding what exactly was the problem and how this can possibly fix it. How is referral related to cached SOA? What did I overlook? Could you explain it little bit, please?

The problem is this: when SOA is explicitly queried the data source API
returns "REFERRAL" flag (which means the owner name has an NS, which
is always the case for the owner name (=zone apex) of SOA as long as
the zone is validly configured). In this case hasDelegation() checks
if the NS is really a delegation, excluding the zone apex, so it's not
confused.

On the other hand, in order to find SOA for a negative response,
addSOA() performs "simple query", and rejects any result if it has a
flag (including "REFERRAL"). The "simple query" implementation of
sqlite3 data source (and test data source that largely emulates the
sqlite3 behavior) cancels the REFERRAL flag internally, so it normally
won't confuse addSOA() either.

However, when SOA is first explicitly queried and hot spot cache is
enabled, the result is cached with the "REFERRAL" flag. And, in the
original code cache.retrieve() returns the cached RRset intact, with
the "REFERRAL" flag being on even for the "simple query" mode, which
confused addSOA(), having it discard the result.

My patch adds a post processing after cache.retrieve() so that it
provides the same behavior as the non cache case.

Btw, I fixed a compilation problem here, and pushed (one line in makefile).

Ah, okay, thanks. Hmm, we may not want to rely on libasiolink, but
that would be a separate issue.

And, as this was originaly reported by someone external to the
project, should we wait for his confirmation as well?

Holding merge until we get confirmation? In this case since we can
reproduce the problem and should be quite confident about the cause
and fix, I don't think we have to wait. But we should notify the
original reporter about the fix when we merge the fix to master.

comment:14 Changed 9 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:15 in reply to: ↑ 13 ; follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

Replying to vorner:

But I have a problem understanding what exactly was the problem and how this can possibly fix it. How is referral related to cached SOA? What did I overlook? Could you explain it little bit, please?

The problem is this: when SOA is explicitly queried the data source API
returns "REFERRAL" flag (which means the owner name has an NS, which
is always the case for the owner name (=zone apex) of SOA as long as
the zone is validly configured). In this case hasDelegation() checks
if the NS is really a delegation, excluding the zone apex, so it's not
confused.

OK, it seems to make sense. I tend to agree this part of code probably needs some attention in future.

And, as this was originaly reported by someone external to the
project, should we wait for his confirmation as well?

Holding merge until we get confirmation? In this case since we can
reproduce the problem and should be quite confident about the cause
and fix, I don't think we have to wait. But we should notify the
original reporter about the fix when we merge the fix to master.

No, I don't want to wait with merge. I only asked if we want to wait with closing the ticket (which is different matter than merging). But we might as well close it and let it be reopened in case it doesn't work.

Anyway, as I was saying, this can be merged.

Thanks

comment:16 in reply to: ↑ 15 Changed 9 years ago by jinmei

Replying to vorner:

Holding merge until we get confirmation? In this case since we can
reproduce the problem and should be quite confident about the cause
and fix, I don't think we have to wait. But we should notify the
original reporter about the fix when we merge the fix to master.

No, I don't want to wait with merge. I only asked if we want to wait with closing the ticket (which is different matter than merging). But we might as well close it and let it be reopened in case it doesn't work.

Anyway, as I was saying, this can be merged.

Okay, thanks for review.

I don't have a strong opinion about close or keep open, but I'd be
inclined to close it in terms of task size estimation/evaluation (i.e,
if we required confirmation from the original reporter the duration
may be unnecessarily longer due to a reason we cannot control). If it
turns out that the initial fix doesn't really solve the original
problem I'd open a new ticket as a separate task (or reopen the
existing one with a revised estimation).

So, at the moment, I'll close the ticket (merge already done).

comment:17 Changed 9 years ago by jinmei

  • Resolution set to fixed
  • Status changed from reviewing to closed
Note: See TracTickets for help on using tickets.