Opened 7 years ago

Closed 7 years ago

#2459 closed defect (fixed)

call to getCachedZoneWriter must be protected by mutex

Reported by: jinmei Owned by: jinmei
Priority: medium Milestone: Sprint-20121120
Component: b10-auth Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: High
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 3 Add Hours to Ticket: 0
Total Hours: 1.5 Internal?: no

Description

I just noticed one big issue in the thread based zone reloading:
there can be an inter-thread race for the same DB connection (if
I understand it correctly).

Consider an sqlite3 data source that doesn't use in-memory "cache".
If b10-auth receives a loadzone command for one of the zones
of the data source (either by manually, or more likely from xfrin
or DDNS), the separate builder thread calls getCachedZoneWriter().
This method internally uses the SQLite3 connection to identify the
zone (since this data source doesn't have an in-memory cache). But
the main thread also uses the same connection for normal query
processing, and both threads run in parallel in this case.

SQLite3 doesn't ensure such an operation succeeds:
http://www.sqlite.org/faq.html#q6
(btw same kind of restriction seems to apply to PostgreSQL and MySQL).

This problem seems deep (will create a separate ticket), but for now I
suggest an urgent care fix: just protect the call to
getCachedZoneWriter() using the mutex.

I also suggest one piggy back fix: don't treat this case as an error:

    case datasrc::ConfigurableClientList::ZONE_NOT_CACHED:
        isc_throw(InternalCommandError, "failed to load zone " << origin
                  << "/" << rrclass << ": not served from memory");

because it can happen in the scenario described above, and if it's
from xfrin or DDNS it's not really an error. My suggestion is to
just log it at the DEBUG level and return a NULL pointer. In
doLoadZone() we check the return value and if it's NULL just return.

Subtickets

Attachments (1)

0001-2459-Update-comments-to-indicate-where-the-main-thre.patch (2.7 KB) - added by muks 7 years ago.
Patch to update comments about where the main thread acquires lock

Download all attachments as: .zip

Change History (14)

comment:1 Changed 7 years ago by jelte

  • Estimated Difficulty changed from 0 to 3

comment:2 Changed 7 years ago by jelte

  • Defect Severity changed from N/A to High
  • Milestone changed from Next-Sprint-Proposed to Sprint-20121120

comment:3 Changed 7 years ago by jinmei

  • Priority changed from very high to medium

comment:4 Changed 7 years ago by jinmei

  • Owner set to jinmei
  • Status changed from new to accepted

comment:5 Changed 7 years ago by jinmei

trac2459 is ready for review.

The branch is exactly what's proposed in the ticket's description
(and it's small).

I also confirmed my theory was (seemingly) correct by the following
setup:

  • run b10-auth with an sqlite3-only data source
  • keep sending queries to b10-auth at a high volume using queryperf(-like tool)
  • at the same time, repeat "Auth loadzone" from bindctl in a shell loop

Eventually b10-auth crashed due to segmentation fault (normally within
a minute or so). It died somewhere around sqlite3 API, so (while I
didn't looked into it any further) it's quite likely to be a race I
suspected. After applying this change, it didn't happen.

I don't think we need a changelog for this fix as it's an intermediate
regression.

comment:6 Changed 7 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from accepted to reviewing

comment:7 Changed 7 years ago by muks

  • Owner changed from UnAssigned to muks

Picking for review.

comment:8 follow-up: Changed 7 years ago by muks

  • Owner changed from muks to jinmei

The code change itself looks fine to me, and make check (with the updated lock counts) passes. Lettuce also passes (though I suppose these don't test the specific problem).

I see that doLoadZone() grabs a lock on this mutex before calling getCachedZoneWriter(). Where does the main thread lock this mutex?

comment:9 in reply to: ↑ 8 Changed 7 years ago by jinmei

Thanks for the review.

Replying to muks:

The code change itself looks fine to me, and make check (with the updated lock counts) passes. Lettuce also passes (though I suppose these don't test the specific problem).

Right. I suspect (reliably) reproducing the problem itself in
unit/system tests is almost impossible.

I see that doLoadZone() grabs a lock on this mutex before calling getCachedZoneWriter(). Where does the main thread lock this mutex?

Via DataSrcClientsMgrBase::Holder. And, in practice, the only place
in the main thread that acquires the lock is
AuthSrvImpl::processNormalQuery.

comment:10 Changed 7 years ago by jinmei

  • Owner changed from jinmei to muks

Changed 7 years ago by muks

Patch to update comments about where the main thread acquires lock

comment:11 follow-up: Changed 7 years ago by muks

  • Owner changed from muks to jinmei

Good to merge now. If you are ok, please commit the attached patch too. Though the location where the lock is acquired in the main thread can change in the future (from AuthSrvImpl::processNormalQuery()), it still is better to document it as it was not immediately obvious to find it, starting from the loader lock. If you feel the extra comments are too much, then please go ahead and merge as-is.

comment:12 in reply to: ↑ 11 Changed 7 years ago by jinmei

Replying to muks:

Good to merge now. If you are ok, please commit the attached patch too. Though the location where the lock is acquired in the main thread can change in the future (from AuthSrvImpl::processNormalQuery()), it still is better to document it as it was not immediately obvious to find it, starting from the loader lock. If you feel the extra comments are too much, then please go ahead and merge as-is.

Thanks for the suggestion. I've applied the suggested comment to
auth_srv.cc. I've skipped the one for datasrc_clients_mgr.h because
at least architecturally the clients manager/builder should be
independent from AuthSrv, so mentioning specific details of
AuthSrv in this context seemed a bit awkward.

With this change I'll merge the branch and close the ticket.

comment:13 Changed 7 years ago by jinmei

  • Resolution set to fixed
  • Status changed from reviewing to closed
  • Total Hours changed from 0 to 1.5
Note: See TracTickets for help on using tickets.