#5477 closed enhancement (fixed)

Database reconnect

Reported by: tomek Owned by: tmark
Priority: medium Milestone: Kea1.4
Component: database-all Version: git
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DHCP Feature Depending on Ticket:
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

As of 1.3, Kea detects database connection failure and terminates. This was the simpler of two choices. Since Kea can't do anything useful without its database, the decision to terminate was simpler to implement and it easy to detect, so external mechanisms can be put in place (like start the server again).

However, the better solution is to attempt to reconnect. This is what this ticket is about. This ticket covers MySQL and Postgres. It does not cover Cassandra, as Cassandra has its own way of dealing with failures (you specify contact points and some of them could go offline).

However, before this code is put into review, it must be tested that it didn't break Cassandra if the changes are generic.

Note: we previously tried to address the issue and failed (see #3639). Note that MySQL has auto-reconnect ability, but it's misleading. The connection itself is re-established, but the compiled statements are no longer valid. As far as I'm aware, there is no way to recompile the statements automatically.

Subtickets

Change History (10)

comment:1 Changed 17 months ago by tmark

  • Owner set to tmark
  • Status changed from new to assigned

comment:2 Changed 16 months ago by tmark

  • Owner changed from tmark to UnAssigned
  • Status changed from assigned to reviewing

Ticket is ready for review.

The trick here was find a way to propagate the connection error out of the depths of backend-land so the "application" layer can decide what to do about it. Long story short, when the application creates a backend manager it passes in a callback to invoke if connectivity is lost.

In addition, I added parameters to lease and host db parsers, so the behavior is configurable.

dhcpsrv

Added a callback that DatabaseConnection? derivations should invoke
when they lose connectivity.

Added an optional callback parameter from CfgDbAccess::createManagers()
all the way down to DatabaseConnection? ctor.

pgsql_connection.cc

PgSqlConnection::~PgSqlConnection?() - Added logic to close the
connection only when the connect state is still OK.
Otherwise it likes to core dump.

PgSqlConnection::checkStatementError() - Modified to invoke the
connectivity lost callback on "fatal" errors

pgsql_lease_mgr_unittest.cc
pgsql_host_data_source_unittest.cc

Added tests to verify that the lost callback is NOT invoked on an
open failure

kea-dhcp4
kea-dhcp6

Added support for max-reconnect-tries and reconnect-wait-time
to lease and host db parsers

Added a callback for when DB backends detect loss of connectivity

Added a self-rescheduling method to attempt to reconnect to the
backends if retries are enabled

Added appropriate log messages

ChangeLog?:

13xx.   [func]      tmark
    kea-dhp4 and kea-dhcp6 can now be configured to attempt to
    reconnect to Postgresql backends if connectivity is lost.
    (Trac #5477, git TDB)

I have created #5556 to cover adding this to MySQL.

Note, unit testing lost connectivity is more-or-less impossible. We
can't very well start/stop DBs on users' machines. We should be able
to create Forge tests though.

comment:3 Changed 16 months ago by tmark

One thing to note, if you do system testing, it is easiest to start and stop Posgresql rather then the interface for non local DBs. This is due to the high TCP timeout it takes for psql lib calls to fail when the network is out.

comment:4 follow-up: Changed 16 months ago by marcin

  • Owner changed from UnAssigned to tmark

I reviewed commit 78b8edb6dbc4b2bac72f4d37e1f5cbacb5b63dcd.

I updated some copyright dates and corrected some typos. So you need to pull.

src/bin/dhcp4/ctrl_dhcp4_srv.h
src/bin/dhcp6/ctrl_dhcp6_srv.h
The dbLostCallback's sole argument should be described.

Also, both dbReconnect and dbLostCallback deserve some brief description what they do.

dbReconnect: I'd suggest to pass db_reconnect_ctl to this function as a parameter instead of assigning it to the class member. It is going to be automatically destroyed when not needed anymore. Also, no need to initialize the class member which may be error prone. However, it is a matter ot taste, so I don't insist on this change.

src/bin/dhcp4/dhcp4_messages.mes
src/bin/dhcp6/dhcp6_messages.mes
I fixed several typos and minor issues in those files. Please pull.

src/lib/dhcpsrv/database_connection.h
Shouldn't this commentary be removed:

/// typedef boost::function<bool (const ParameterMap& parameters)> Callback;

?

The description of the callback type and its name suggests that it is a generic callback for propagating various events upward. But, it includes a ReconnectCtlPtr argument in it which makes it specific to reconnect event. I guess we should make it reconnect specific callback type and rename it to something like ReconnectCallback.

Another suggestion I'd have is to use "std::function" template instead of "boost::function". Now, this might be a little contoversial because we use boost::function all over the place. My point on using std::function in new code is that it allows for using lambdas instead of boost::bind which often makes the code more readable and easier to unit test (I am not sure if implicit conversion of lambdas to boost::function always works). On the other hand, we don't have any coding guidelines with respect to the use of C++11 language features over boost features.

src/lib/dhcpsrv/tests/database_connection_unittest.cc
The new tests should have short descriptions, so as the test fixture class.

comment:5 in reply to: ↑ 4 Changed 16 months ago by tmark

  • Owner changed from tmark to marcin

Replying to marcin:

I reviewed commit 78b8edb6dbc4b2bac72f4d37e1f5cbacb5b63dcd.

I updated some copyright dates and corrected some typos. So you need to pull.

src/bin/dhcp4/ctrl_dhcp4_srv.h
src/bin/dhcp6/ctrl_dhcp6_srv.h

Thx

The dbLostCallback's sole argument should be described.

Done

Also, both dbReconnect and dbLostCallback deserve some brief description what they do.

Done

dbReconnect: I'd suggest to pass db_reconnect_ctl to this function as a parameter instead of assigning it to the class member. It is going to be automatically destroyed when not needed anymore. Also, no need to initialize the class member which may be error prone. However, it is a matter ot taste, so I don't insist on this change.

Done

src/bin/dhcp4/dhcp4_messages.mes
src/bin/dhcp6/dhcp6_messages.mes
I fixed several typos and minor issues in those files. Please pull.

More thx.

src/lib/dhcpsrv/database_connection.h
Shouldn't this commentary be removed:

/// typedef boost::function<bool (const ParameterMap& parameters)> Callback;

?

Oops. Gone.

The description of the callback type and its name suggests that it is a generic callback for propagating various events upward. But, it includes a ReconnectCtlPtr argument in it which makes it specific to reconnect event. I guess we should make it reconnect specific callback type and rename it to something like ReconnectCallback.

I renamed it.

Another suggestion I'd have is to use "std::function" template instead of "boost::function". Now, this might be a little controversial because we use boost::function all over the place. My point on using std::function in new code is that it allows for using lambdas instead of boost::bind which often makes the code more readable and easier to unit test (I am not sure if implicit conversion of lambdas to boost::function always works). On the other hand, we don't have any coding guidelines with respect to the use of C++11 language features over boost features.

I passed on this but will do so in the future.

src/lib/dhcpsrv/tests/database_connection_unittest.cc
The new tests should have short descriptions, so as the test fixture class.

Done

Please re-review.

comment:6 Changed 16 months ago by marcin

  • Owner changed from marcin to tmark

I reviewed commit 8f5ce7060067456b68a6df39d3d20f122b96596e

I am getting the following compilation warning:

database_connection_unittest.cc:71:5: warning: variable 'ret' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
    ASSERT_NO_THROW(ret = datasrc.invokeDbLostCallback());

followed by:

database_connection_unittest.cc:70:13: note: initialize the variable 'ret' to silence this warning
    bool ret;

I recommend to follow that suggestion.

I also corrected some weird spelling, so please pull the branch.

The ticket can be merged after correcting the warning above. I don't need to see this again.

comment:7 Changed 16 months ago by tmark

  • Resolution set to complete
  • Status changed from reviewing to closed

Addressed the above comment, and merged with git 8e62a058382b2245d418cfbf829776934c638e5e
Added ChangeLog? entry 1372.

ticket is complete.

comment:8 Changed 16 months ago by fdupont

  • Resolution complete deleted
  • Status changed from closed to reopened

I know the ticket was merged but I am afraid the doc still needs to be updated?

comment:9 Changed 16 months ago by tmark

I updated the doc under 5556 which is ready for review.

comment:10 Changed 16 months ago by tmark

  • Resolution set to fixed
  • Status changed from reopened to closed
Note: See TracTickets for help on using tickets.