Opened 9 years ago

Closed 8 years ago

#326 closed defect (fixed)

Race condition in first startup (when database file does not exist)

Reported by: jelte Owned by: jelte
Priority: medium Milestone: Sprint-20110830
Component: data source Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: Medium
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 8 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

When I start up bind10 for the first time (or after I have removed zones.sqlite3), xfrout almost always throws an error. This seems to be because the database isn't fully initialized yet when a second module that uses it.

The error differs; it is one of the following, depending on how far the first module that creates it has gotten:

isc.datasrc.sqlite3_ds.Sqlite3DSError: Bad database schema version
sqlite3.OperationalError?: no such table: zones
sqlite3.OperationalError?: table schema_version already exists

Subtickets

Change History (15)

comment:1 Changed 8 years ago by shane

  • Defect Severity set to Medium
  • Resolution set to fixed
  • Status changed from new to closed
  • Sub-Project set to DNS

We think this has been fixed.

comment:2 Changed 8 years ago by shane

  • Resolution fixed deleted
  • Status changed from closed to reopened

Jeremy added more information, in what I think is a duplicate ticket, #630:

I think this may be caused by having multiple bind10's running at same time. Need to research how this happens:

Traceback (most recent call last):
  File "/home/jreed/tmp/bind10-devel-20110224/src/lib/python/isc/datasrc/sqlite3_ds.py", line 81, in open
    cur.execute("SELECT version FROM schema_version")
sqlite3.OperationalError: no such table: schema_version

and in xfrout:

  File "/home/jreed/tmp/bind10-devel-20110224/src/lib/python/isc/datasrc/sqlite3_ds.py", line 36, in create
    cur.execute("CREATE TABLE schema_version (version INTEGER NOT NULL)")
sqlite3.OperationalError: table schema_version already exists

comment:3 Changed 8 years ago by shane

Here's a full trace:

Traceback (most recent call last):
  File "/opt/bind10/lib/python3.2/site-packages/isc/datasrc/sqlite3_ds.py", line 81, in open
    cur.execute("SELECT version FROM schema_version")
sqlite3.OperationalError: no such table: schema_version

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/bind10/libexec/bind10-devel/b10-xfrout", line 682, in <module>
    xfrout_server = XfroutServer()
  File "/opt/bind10/libexec/bind10-devel/b10-xfrout", line 577, in __init__
    self._start_notifier()
  File "/opt/bind10/libexec/bind10-devel/b10-xfrout", line 589, in _start_notifier  
    self._notifier = notify_out.NotifyOut(datasrc)
  File "/opt/bind10/lib/python3.2/site-packages/isc/notify/notify_out.py", line 120, in __init__
    self._init_notify_out(datasrc_file)
  File "/opt/bind10/lib/python3.2/site-packages/isc/notify/notify_out.py", line 132, in _init_notify_out
    for zone_name, zone_class in sqlite3_ds.get_zones_info(datasrc_file):
  File "/opt/bind10/lib/python3.2/site-packages/isc/datasrc/sqlite3_ds.py", line 160, in get_zones_info
    conn, cur = open(dbfile)
  File "/opt/bind10/lib/python3.2/site-packages/isc/datasrc/sqlite3_ds.py", line 84, in open
    create(cur)
  File "/opt/bind10/lib/python3.2/site-packages/isc/datasrc/sqlite3_ds.py", line 36, in create
    cur.execute("CREATE TABLE schema_version (version INTEGER NOT NULL)")
sqlite3.OperationalError: table schema_version already exists

comment:4 Changed 8 years ago by shane

  • Milestone set to Next-Sprint-Proposed

The problem here arises in sqlite3_ds.py:

    # Does the database exist yet?  If not, create it.
    try:
        cur.execute("SELECT version FROM schema_version")
        row = cur.fetchone()
    except:
        create(cur)
        conn.commit()
        row = [1]

This code causes problems if multiple programs execute the create() function at the same time.

Three changes should probably be made:

  1. We should not catch all exceptions, but rather only the sqlite3.OperationalError? exception, and we should verify that the exception is "no such table".
  2. We should not perform our table creation in an exception handler, rather we should set a flag and create the table outside of the exception handler.
  3. If we have to create the tables, then we need to:
    1. start with a BEGIN EXCLUSIVE TRANSACTION statement (http://www.sqlite.org/lang_transaction.html)
    2. check again that the table does not exist
    3. either do nothing and ROLLBACK (if the table does exist) or create the tables then COMMIT (if the tables still do not exist)

I'm not 100% sure, but this seems to be the case.

comment:5 Changed 8 years ago by stephen

  • Milestone changed from Next-Sprint-Proposed to Sprint-20110816

comment:6 Changed 8 years ago by jinmei

Is this worth solving?

I've not fully examined the problem and the follow up discussion,
but from a quick glance it seems to be possibly solved when we
complete the data source refactoring (or the refactoring would
require different solution to the problem).

comment:7 Changed 8 years ago by jelte

  • Estimated Difficulty changed from 0.0 to 8

comment:8 Changed 8 years ago by jelte

As far as the current code in the refactoring is concerned, the problem wouldn't be fixed. It's still the race condition between the schema_version check and the creation of the table(s), which has so far been copied into the new code.

I think that an exclusive lock (as proposed) would indeed fix it. Then there is just the question of how to handle not getting it :) I have some experimental code that would support this assumption (though technically with a race condition you never know, I put in some sleeps to increase the race window and it does indeed work with those)

I propose a fixed number of delayed retries until it gives up (the db could be locked forever).

comment:9 Changed 8 years ago by jelte

  • Owner set to jelte
  • Status changed from reopened to assigned

comment:10 Changed 8 years ago by jelte

  • Owner changed from jelte to UnAssigned
  • Status changed from assigned to reviewing

Did a bit of refactoring the initialization as i was fixing this;

What it does now is that it checks the schema version in the database (in a normal connection). If the result indicates the table does not appear to exist, it tries to get an exclusive lock. Once it has that, it checks the version *again*. If it has not been created by now, it does so now.

oh and the c++ version now actually checks the schema version :)

I've implemented this algorithm in the python datasrc module, and both the old and the new c++ code (those two are quite similar, the setup code got almost completely copied, save for some var names). I only added tests for the new code, as i suspect we'll drop the old implementation soonish.

comment:11 Changed 8 years ago by stephen

  • Owner changed from UnAssigned to stephen

comment:12 follow-up: Changed 8 years ago by stephen

  • Owner changed from stephen to jelte

src/lib/datasrc/sqlite3_accessor.cc
src/lib/datasrc/sqlite3_datasrc.cc

check_schema_version()
The check

} else if (rc != SQLITE_BUSY || i == 50) {

... inside the loop is wrong. The "i == 50" condition will never be true as the loop termination condition is "i < 50".

create_database()
Same issue as above with the "i == 50" check.

"int rc" can be declared inside the loop and initialized with the call to sqlite3_exec().

If the schema exists, the code returns the version but does not cancel the transaction started by executing "BEGIN EXCLUSIVE TRANSACTION". It should execute a rollback (or commit) before returning.

checkAndSetupSchema()
If the schema does exist but the version is not equal to that expected, the code will attempt to create the database. It should only create the database if the returned value from check_schema_version() is -1 (table does not exist); in all other cases it should report a version mismatch and terminate the program.

general
The wait period (number of iterations of loops) should be symbolic constants.

Not part of this ticket, but indenting the continuation lines of each statement in SCHEMA_LIST would make it more readable.

src/lib/python/isc/datasrc/sqlite3_ds.py
Would it be possible to encapsulate the creation of the database into a Python-callable C++ module so that the creation code (also present in sqlite3_xxx.cc) is located in just one place? At the moment the possibility exists of divergence between the Python and C++ creation functions.

comment:13 in reply to: ↑ 12 Changed 8 years ago by jelte

  • Owner changed from jelte to stephen

Replying to stephen:

src/lib/datasrc/sqlite3_accessor.cc
src/lib/datasrc/sqlite3_datasrc.cc

check_schema_version()
The check

} else if (rc != SQLITE_BUSY || i == 50) {

... inside the loop is wrong. The "i == 50" condition will never be true as the loop termination condition is "i < 50".

ah, doh.

I have a quote configured as an entry message in a chat room;

"There are two hard problems in computer science: cache invalidation, naming things, and off-by-one errors."

...

fixed, made it loop from 1 while <=, instead of < though, so the == WAIT_TIME is more recognizable (as opposed to == WAIT_TIME - 1)

"int rc" can be declared inside the loop and initialized with the call to sqlite3_exec().

ack

If the schema exists, the code returns the version but does not cancel the transaction started by executing "BEGIN EXCLUSIVE TRANSACTION". It should execute a rollback (or commit) before returning.

ack. Also added it in case the creation fails for whatever reason (it may not matter much in such a case but still, not nice to leave it open).

checkAndSetupSchema()
If the schema does exist but the version is not equal to that expected, the code will attempt to create the database. It should only create the database if the returned value from check_schema_version() is -1 (table does not exist); in all other cases it should report a version mismatch and terminate the program.

crap, of course. Fixed

general
The wait period (number of iterations of loops) should be symbolic constants.

sure

Not part of this ticket, but indenting the continuation lines of each statement in SCHEMA_LIST would make it more readable.

sure

src/lib/python/isc/datasrc/sqlite3_ds.py
Would it be possible to encapsulate the creation of the database into a Python-callable C++ module so that the creation code (also present in sqlite3_xxx.cc) is located in just one place? At the moment the possibility exists of divergence between the Python and C++ creation functions.

Yes, not to mention that we right now have 3 versions :p

But I think that is the ultimate goal for after the datasource refactoring (doing everything through datasource wrapper), but right now that doesn't help us get this fixed :) I don't see much advantage of having just a subset like the creation wrapped, tbh.

comment:14 Changed 8 years ago by stephen

  • Owner changed from stephen to jelte

Looks OK, please merge.

comment:15 Changed 8 years ago by jelte

  • Resolution set to fixed
  • Status changed from reviewing to closed

Thanks, merged, closing ticket.

Note: See TracTickets for help on using tickets.