Opened 9 years ago

Closed 9 years ago

#851 closed defect (fixed)

b10-auth (+ hotspot cache) can crash in handling query when DB is busy

Reported by: jinmei Owned by: jinmei
Priority: high Milestone: Sprint-20110503
Component: data source Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 3.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

When b10-auth uses the sqlite3 data source with hot spot cache, it's
possible that there's a substantial difference between the cache
content and the actual database. In some worst cases it can lead to
crashing b10-auth (and apparently it actually happened to my personal
"production" server).

Specifically, DataSrc::doQuery() can reach the following point after
retrieving the answer from the cache successfully:

        const Name* const zonename = zoneinfo.getEnclosingZone();
        if ((task->flags & REFERRAL) != 0 &&
            (zonename->getLabelCount() == task->qname.getLabelCount() ||

But if the zone itself has been deleted from the DB by the time
doQuery() is called, getEnclosingZone() can return NULL, which (when
the task has the REFERRAL flag on) will subsequently cause a crash due
to NULL pointer dereference.

Such inconsistency between the cache and DB can always happen under
a normal operation, but there's another scenario that can cause the
same effect more realistically. sqlite3_step() seems to fail
immediately when the DB is locked with a return value of SQLITE_BUSY.
But our sqlite3 data source code always handles all non successful
code as if there were no record found (in which case SQLITE_DONE will
be returned). So, for example, if b10-auth handles a query while
xfrin is updating the DB, some call inside the sqlite3 data source to
sqlite3_step() can fail with SQLITE_BUSY and make the same effect as
the previous scenario.

For an initial fix, we should at least stop assuming
getEnclosingZone() is always valid (even after successful initial
lookup). If we find inconsistency it should result in SERVFAIL for
now.

We should then separate SQLITE_DONE and other failures of
sqlite3_step(). In the latter case, I suspect we should return
SERVFAIL for now.

For a longer term solution, we should ensure (if possible at all)
queries don't result in SERVFAIL just due to updating a zone after
xfrin, etc (I believe SERVFAIL is okay in rare operational cases such
as the zone is really deleted while some RRs of the zone still remain
in the cache). But it will be difficult to implement. One simple
approach is to impose a short timeout for sqlite3_step() or a retry
after SQLITE_BUSY, but it's not 100% reliable. Also, we probably
don't want to introduce blocking operation in query processing even if
it's small. So this should be deferred to a separate future task.

Subtickets

Change History (13)

comment:1 Changed 9 years ago by jinmei

p.s. there's a workaround to avoid the crash bug: disable hot spot cache.
I'm now disabling it on my server.

comment:2 Changed 9 years ago by jinmei

After looking at the code closely I found the immediate problem
(i.e. server crash) can be fixed by a quite simple patch. So I did
it. It's in the trac581 branch and is ready for review.

This fix is a bit suboptimal in terms of performance, because it
always tries to identify the best matching zone in data sources
even if the desired answer is cached and subsequent process doesn't
require the zone information. But IMO this data source code is
already too complicated to identify the exact cases that require the
zone information (and IMO it requires substantial cleanup/refactoring
anyway). Also, even with the hot spot cache it's still not that fast
anyway. Besides, according to my quick local benchmark using
auth/benchmarks/query_bench, I've not seen visible performance drop
due to this patch.

I'd leave the fix to sqlite3 data source I mentioned in the problem
description (handle error cases other than SQLITE_DONE separately) to
a different ticket.

The proposed changelog entry:

219.?	[bug]		jinmei
	b10-auth, src/lib/datasrc: inconsistency between the hot spot
	cache and actual data source could cause a crash while query
	processing.  The crash could happen, e.g., when an sqlite3 DB file
	is being updated after a zone transfer while b10-auth handles a
	query using the corresponding sqlite3 data source.
	(Trac #851, git TBD)

comment:3 Changed 9 years ago by jinmei

  • Owner set to UnAssigned
  • Status changed from new to reviewing

comment:4 Changed 9 years ago by stephen

  • Milestone changed from Year 3 Task Backlog to Sprint-20110503

comment:5 Changed 9 years ago by vorner

  • Defect Severity set to N/A
  • Owner changed from UnAssigned to vorner
  • Sub-Project set to DNS

comment:6 follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Looking at the surrounding code, it is indeed quite hairy. Why, if the data really lives in cache, did it touch the database in the first place? But that's not the point here, I guess.

I have one question. Is the whole thing wrapped in some kind of transaction? If yes, then I believe the code can be merged. If not, then this only makes the probability of occurrence smaller ‒ because the following code still does expect that if it got there, it can get the data from DB and be happy.

Thanks

comment:7 in reply to: ↑ 6 Changed 9 years ago by jinmei

Replying to vorner:

Looking at the surrounding code, it is indeed quite hairy. Why, if the data really lives in cache, did it touch the database in the first place? But that's not the point here, I guess.

Do you mean, for example, zoneinfo (or some part of it that is
necessary for query processing) should also be cached? If so, I tend
to agree, but as you probably understand that's a separate issue. We
could solve this symptom along with that point, but I'm afraid that
will require more fundamental and big changes and longer
development/review cycles. I also believe the current architecture of
query processing and its interaction of hot spot cache are already too
complicated and am reluctant to introduce more bandaids without
revisiting the architecture (which we've been calling "refactoring").
Hopefully we have time for this architecture level of work this year.

I have one question. Is the whole thing wrapped in some kind of transaction? If yes, then I believe the code can be merged. If not, then this only makes the probability of occurrence smaller ‒ because the following code still does expect that if it got there, it can get the data from DB and be happy.

I'm not sure if I understand your concern. First, there's no (any
sense of) transaction around this logic of code. But I don't think
the revised code has any possibility of triggering the same bug, even
with a smaller probability. Any query process begins with doQuery(),
and it always calls doQueryTask(). If the timing issue that triggered
this crash bug happens in our first attempt of getting the zoneinfo in
doQueryTask(), it will result in 'task.flags' having the NO_SUCH_ZONE
flag. Then in doQueryTask() any further process is canceled (either
resulting in a REFUSED answer or in missing a particular part of
response):

        if (task->flags == NO_SUCH_ZONE) {
            if (task->state == QueryTask::GETANSWER) {
                m.setRcode(Rcode::REFUSED());
                return;
            }
            continue;
        }

If you think this observation is not correct or you meant something
different, could you explain your concern along with the code?

comment:8 Changed 9 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:9 follow-up: Changed 9 years ago by vorner

  • Owner changed from vorner to jinmei

Good morning

Well, I didn't really read the code to depth. But as I understand it, it worked like this before the change:

  • Someone asked a query, the answer got cached.
  • Some time passes.
  • The DB becomes busy (by xfrin maybe).
  • Someone asks the same query.
  • The cache says it's valid, so we continue processing.
  • We get part of the answer from cache.
  • We ask the DB something more (eg. the zoneinfo.getEnclosingZone() on line 947), it fails because of being busy, and returns NULL. We don't expect NULL here, so we crash.

What I think can happen now is this:

  • Someone asked a query, the answer got cached.
  • Some time passes.
  • Someone asks the same query.
  • We first check the DB (in doQueryTask) if it is valid. The DB confirms.
  • We get part of the answer from cache.
  • The DB becomes busy right now (by xfrin). It's a tiny time window, but it can happen, we didn't lock the DB in transaction.
  • We ask the DB something more (again, the zoneinfo.getEnclosingZone() on line 947), it fails, crash.

Does it make some sense?

Or, is that kept somewhere between the calls? Should I read the code more thoroughfully?

Anyway, asking the DB several times without having a transaction is IMO wrong, because we can get inconsistent information (someone can modify the DB between our queries to it).

With regards

comment:10 in reply to: ↑ 9 ; follow-up: Changed 9 years ago by jinmei

Replying to vorner:

Well, I didn't really read the code to depth. But as I understand it, it worked like this before the change:

  • Someone asked a query, the answer got cached.
  • Some time passes.
  • The DB becomes busy (by xfrin maybe).
  • Someone asks the same query.
  • The cache says it's valid, so we continue processing.
  • We get part of the answer from cache.
  • We ask the DB something more (eg. the zoneinfo.getEnclosingZone() on line 947), it fails because of being busy, and returns NULL. We don't expect NULL here, so we crash.

This is correct (or at least matches my understanding).

What I think can happen now is this:

  • Someone asked a query, the answer got cached.
  • Some time passes.
  • Someone asks the same query.
  • We first check the DB (in doQueryTask) if it is valid. The DB confirms.

And the 'zoneinfo' variable stores the positive result. From this
point, zoneinfo.getDataSource() or zoneinfo.getEnclosingZone() won't
require DB search, and both always return non NULL.

  • We get part of the answer from cache.
  • The DB becomes busy right now (by xfrin). It's a tiny time window, but it can happen, we didn't lock the DB in transaction.
  • We ask the DB something more (again, the zoneinfo.getEnclosingZone() on line 947), it fails, crash.

This doesn't happen, because zoneinfo remembers the positive result.

And, from my check before submitting this patch, there shouldn't be
any other form of unintentional DB access within the same while loop
of DataSrc::doQuery().

Anyway, asking the DB several times without having a transaction is IMO wrong, because we can get inconsistent information (someone can modify the DB between our queries to it).

Whether or not solving it using a transaction, yes, it's bad that we
can have (and subsequently return) inconsistent data in terms of 100%
accuracy. This argument applies even without the hot spot cache, and
IMO is far beyond the scope of this bug fix ticket. So, in any case,
I'd propose separating this issue, and concentrate on fixing the crash
bug by making the code as good/bad as the one without the hot spot
cache.

With noting that, for the food for the separate point, we previously
discussed how accurate we want to be in terms of this type of timing
issue. BIND 9 tries to be very accurate so that, e.g., all records
from the same zone for a single response should belong to the same
version even if the zone is being modified by a dynamic update running
in parallel with the query processing. But to ensure this level of
accuracy its code became super complicated. We discussed whether we
should provide the same level of accuracy at all cost and complexity
in BIND 10, and, although I don't think there was a clear consensus,
we generally though there should be a better middle ground (i.e.,
preferring implementation simplicity by sacrificing some level of
accuracy).

I have no problem with revisiting this discussion. But please trigger
it in a different thread.

comment:11 in reply to: ↑ 10 ; follow-up: Changed 9 years ago by vorner

Hello

Replying to jinmei:

And the 'zoneinfo' variable stores the positive result. From this
point, zoneinfo.getDataSource() or zoneinfo.getEnclosingZone() won't
require DB search, and both always return non NULL.

OK, then I was too paranoid about the code and my worry doesn't apply. Therefore I think this can be merged.

accuracy. This argument applies even without the hot spot cache, and
IMO is far beyond the scope of this bug fix ticket. So, in any case,

Yes, of course. I just noticed it while looking at the ticket, so I commented it, it doesn't mean it must be resolved now.

I have no problem with revisiting this discussion. But please trigger
it in a different thread.

OK. I think I'll bring it up once we go about merging the datasource APIs, in which time we will be going trough the processing logic anyway.

Thanks

comment:12 in reply to: ↑ 11 Changed 9 years ago by jinmei

Replying to vorner:

accuracy. This argument applies even without the hot spot cache, and
IMO is far beyond the scope of this bug fix ticket. So, in any case,

Yes, of course. I just noticed it while looking at the ticket, so I commented it, it doesn't mean it must be resolved now.

Okay, merge done. I've also created a follow up ticket that I mentioned
earlier (before review) in this ticket (#860).

Now closing ticket. Thanks for review.

comment:13 Changed 9 years ago by jinmei

  • Estimated Difficulty changed from 0.0 to 3.0
  • Resolution set to fixed
  • Status changed from reviewing to closed
Note: See TracTickets for help on using tickets.