Opened 7 years ago

Closed 5 years ago

#2609 closed defect (wontfix)

SERVFAIL on all queries while b10-loadzone is running

Reported by: vorner Owned by:
Priority: medium Milestone: Remaining BIND10 tickets
Component: Unclassified Version: bind10-old
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

Configure a large zone in SQLite (without the in-memory cache). When it is
queried, it answers the query. But when b10-loadzone is run to update the zone,
auth returns SERVFAIL to all queries, because the database is locked

2013-01-07 13:05:55.158 ERROR [b10-auth.auth/19124] AUTH_PROCESS_FAIL message processing failure: Unexpected failure in sqlite3_step: database is locked

Updating a zone shouldn't cause a disruption (with the cz. zone, it takes like
5 minutes to load the new zone to the DB, which is not good).

I know this may not be fixable due to SQLite limitations. But we should probably at least warn about loading too large zones there.

Subtickets

Change History (12)

comment:1 follow-ups: Changed 7 years ago by jinmei

Does that happen with the old b10-loadzone, too?

Not looked into the implementation, but I thought this scenario
shouldn't cause such disruption because it's not a write-write
conflict.

comment:2 in reply to: ↑ 1 ; follow-up: Changed 7 years ago by jinmei

Replying to jinmei:

Does that happen with the old b10-loadzone, too?

Not looked into the implementation, but I thought this scenario
shouldn't cause such disruption because it's not a write-write
conflict.

Hmm, actually, SQLite3 doesn't even allow read if there's a
transaction in which some write operations have been performed:
http://www.sqlite.org/faq.html#q5

To solve this situation, I guess we need some tricky operation within
sqlite_accessor:

  • if it's for replacing the entire zone, don't start a transaction, but assign a new zone ID in the "zones" table (preventing it from matching actual queries accidentally).
  • build the new zone using the new zone ID (without a transaction)
  • on completion, start a transaction and swap the old and new zone IDs; also update the zone ID in the diff table to the new one; commit the transaction.
  • remove records for the old version of the zone (if it's reasonably fast, in a transaction; otherwise do it without making a transaction)

This is tricky in various points: we now need to do rollback operation
ourselves. updating the diff table may also be tricky. if loadzone
and xfrin happen at the same time (though quite unlikely in practice)
that would result in a mess. etc...

So, a higher level alternative is to declare that the SQLite3 data
source shouldn't be used for a huge zone. In that case, we should add
support for another data source (using a database that has more
fine-grained locking) very soon, though.

comment:3 in reply to: ↑ 2 Changed 7 years ago by vorner

Hello

Replying to jinmei:

Hmm, actually, SQLite3 doesn't even allow read if there's a
transaction in which some write operations have been performed:
http://www.sqlite.org/faq.html#q5

That is, indeed, stupid.

To solve this situation, I guess we need some tricky operation within
sqlite_accessor:

  • if it's for replacing the entire zone, don't start a transaction, but assign a new zone ID in the "zones" table (preventing it from matching actual queries accidentally).

I think this would not only be much more complicated, but also slower. The
sqlite3 library would sync() after each write and the poor disk could jump
out of the case because of that.

And, it would lower the chance of collision a bit, but it could probably still
happen, if the write of some piece of data happens at the same time as request
to read.

Another option could be to use a separate database file for each zone,
construct the new version into a new file and then rename the file. But that
has obvious drawbacks too.

So, a higher level alternative is to declare that the SQLite3 data
source shouldn't be used for a huge zone. In that case, we should add
support for another data source (using a database that has more
fine-grained locking) very soon, though.

That sounds like less work and more useful than the workaround.

comment:4 follow-up: Changed 7 years ago by shane

  • Milestone New Tasks deleted

Okay, should we close this ticket then, with the understanding that we need to support a different SQL backend?

comment:5 in reply to: ↑ 1 Changed 7 years ago by jreed

Replying to jinmei:

Does that happen with the old b10-loadzone, too?

Yes this happens with the bind10-20120816-release and bind10-20121115-release versions:

AUTH_PROCESS_FAIL message processing failure: Unexpected failure in sqlite3_step: database is locked

comment:6 in reply to: ↑ 4 Changed 7 years ago by vorner

Replying to shane:

Okay, should we close this ticket then, with the understanding that we need to support a different SQL backend?

I think we should at least document that SQLite3 backend should not be used
this way. We may even try to warn when loading such a large zone by XFRIN or
b10-loadzone. So I think there's still something for this ticket.

comment:7 follow-up: Changed 7 years ago by jreed

I confirmed that the server continues to serve fine from other datasource while the sqlite3 is locked. So maybe it can be extended to support multiple sqlite3 datasources? Then it can just dynamically create a new one, load into it while serving from the previous, and remove old when done.

comment:8 in reply to: ↑ 7 ; follow-up: Changed 7 years ago by jinmei

Replying to jreed:

I confirmed that the server continues to serve fine from other datasource while the sqlite3 is locked. So maybe it can be extended to support multiple sqlite3 datasources? Then it can just dynamically create a new one, load into it while serving from the previous, and remove old when done.

What if there are multiple zones in it?

comment:9 in reply to: ↑ 8 ; follow-up: Changed 7 years ago by jreed

Replying to jinmei:

Replying to jreed:

I confirmed that the server continues to serve fine from other datasource while the sqlite3 is locked. So maybe it can be extended to support multiple sqlite3 datasources? Then it can just dynamically create a new one, load into it while serving from the previous, and remove old when done.

What if there are multiple zones in it?

With different zones, it still has locked database and serves SERVFAIL for both (previously working zone and new different zone). (I just checked again.)

If I kill a loadzone in progress, it is unlocked immediately and returns working answers again from the original data.

comment:10 in reply to: ↑ 9 Changed 7 years ago by jinmei

Replying to jreed:

I confirmed that the server continues to serve fine from other datasource while the sqlite3 is locked. So maybe it can be extended to support multiple sqlite3 datasources? Then it can just dynamically create a new one, load into it while serving from the previous, and remove old when done.

What if there are multiple zones in it?

With different zones, it still has locked database and serves SERVFAIL for both (previously working zone and new different zone). (I just checked again.)

I know:-) My question (or point) was using multiple SQLite3 DB files
is not an easy task, because in general it would have multiple zones,
so the new version cannot simply be in a fresh DB file. We need to
maintain two complete sets of DB files containing all zones, one for
write only, the other for read only; or we need to have separate DB
files for different zones and maintain them consistently; or we need
to do something tricky anyway.

It's not impossible, but the question is whether we want to support
such scenario with SQLite3.

comment:11 Changed 6 years ago by tomek

  • Milestone set to Remaining BIND10 tickets

comment:12 Changed 5 years ago by tomek

  • Resolution set to wontfix
  • Status changed from new to closed
  • Version set to old-bind10

This issue is related to bind10 code that is no longer part of Kea.

If you are interested in BIND10/Bundy framework or its DNS components,
please check http://bundy-dns.de.

Closing ticket.

Note: See TracTickets for help on using tickets.