#5580 closed defect (fixed)

HA: fixes in DHCPv4 state machine post #5470 merge.

Reported by: marcin Owned by: tmark
Priority: high Milestone: Kea1.4
Component: high-availability Version: git
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DHCP Feature Depending on Ticket:
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

I started system testing the HA implementation and found a major (but easy to fix) issue which must be corrected. When the server is in the partner down state it will try to send updates to the partner. The partner is down so it doesn't receive lease updates and the active server will drop to he packets. In the partner-down state the local server must not send lease updates to the partner. It simply sends heartbeats to detect when the partner wakes up.

Subtickets

Change History (5)

comment:1 Changed 15 months ago by marcin

  • Owner set to UnAssigned
  • Status changed from new to reviewing

I fixed serveral issues in the HA state machine. The most critical one was that the server refused to respond to DHCP queries in the partner-down state because it tried to send lease updates to the partner. The server must not send lease updates in the partner down state unless it is sending them to a backup server, for which the communication error doesn't cause DHCP query to be dropped.

While testing the HA on the CentOS 7 with GCC 7 and boost 1.66 I experienced crashes caused by a bug (again my bug) in the unix domain socket implementation. It was mistakenly using std::bind instead of boost::bind.

After those changes I was able to successfully test a HA setup running in the load balancing mode and hot standby mode with a third server being a backup.

Proposed ChangeLog entries:

premium:

XX.	[bug]		marcin
	Corrected several bugs in the high availability state machine
	of which the most critical one was that the DHCP server refused
	to respond to the DHCP clients in the partner down state.
	(Trac #5580, git cafe)

main repo:

13XX.	[bug]		marcin
	Corrected a bug in the libkea-asiolink library which caused
	the DHCP servers to crash while processing commands over
	the unix domain socket on some systems.
	(Trac #5580, git 93e2b2198c3163afb81d51fdf5ec547602a12415)

comment:2 Changed 15 months ago by tmark

  • Owner changed from UnAssigned to tmark

comment:3 follow-up: Changed 15 months ago by tmark

  • Owner changed from tmark to marcin

Main Repo


Changes are fine but I am curious as to what is different about boost::bind and std::bind that causes a run time error? This is perhaps a lesson the rest of the team would benefit from understanding.

Premium Repo:


src/hooks/dhcp/high_availability/ha_service.*

+ / @return Number of scheduled lease updates. This value is used to
+
/ determine if the DHCP query should be parked while waiting for the
+ / lease update to complete.
+ size_t asyncSendLeaseUpdates(const dhcp::Pkt4Ptr& query,
+ const dhcp::Lease4CollectionPtr& leases,
+ const dhcp::Lease4CollectionPtr& deleted_leases,
+

The return description is could be misleading, depending upon what one counts
as a "lease update". To me a "lease update" is information for a single lease.

What it returns is really the number of peers to whom updates have been
scheduled to be sent.

One could also argue that asyncSendLeaseUpdates() makes an assumption that it
will only be called if there are in fact, altered leases. If it were called
with both leases and deleted_leases empty, it would send nothing to its backup
peers but would return a count of servers contacted > 0. This is probably a
moot point because it appears to be called when there are changed leases.


This still blows up:

host_cmds_unittest.cc: In destructor ‘virtual {anonymous}::LibLoadTest::~LibLoadTest?()’:
host_cmds_unittest.cc:352:29: error: ‘class isc::dhcp::HostMgr?’ has no member named ‘setTestHostDataSource’

HostMgr::instance().setTestHostDataSource(HostDataSourcePtr?());

but I'm assuming 5580 branch predates this getting fixed.


Beyond that, your changes are fine.

comment:4 in reply to: ↑ 3 Changed 15 months ago by marcin

  • Owner changed from marcin to tmark

Replying to tmark:

Main Repo


Changes are fine but I am curious as to what is different about boost::bind and std::bind that causes a run time error? This is perhaps a lesson the rest of the team would benefit from understanding.

I don't know the internals but reading https://stackoverflow.com/questions/10555566/difference-between-c11-stdbind-and-boostbind I can tell that there are sufficiently many differences to cause problems.

Premium Repo:


src/hooks/dhcp/high_availability/ha_service.*

+ / @return Number of scheduled lease updates. This value is used to
+
/ determine if the DHCP query should be parked while waiting for the
+ / lease update to complete.
+ size_t asyncSendLeaseUpdates(const dhcp::Pkt4Ptr& query,
+ const dhcp::Lease4CollectionPtr& leases,
+ const dhcp::Lease4CollectionPtr& deleted_leases,
+

The return description is could be misleading, depending upon what one counts
as a "lease update". To me a "lease update" is information for a single lease.

What it returns is really the number of peers to whom updates have been
scheduled to be sent.

One could also argue that asyncSendLeaseUpdates() makes an assumption that it
will only be called if there are in fact, altered leases. If it were called
with both leases and deleted_leases empty, it would send nothing to its backup
peers but would return a count of servers contacted > 0. This is probably a
moot point because it appears to be called when there are changed leases.

I updated the description. I was considering checking whether the containers are both empty within the function but then I thought that it doesn't make a lot of sense because I should also check whether the pointers aren't null etc. Isn't it really a waste of CPU cycles while you already check that elsewhere?


This still blows up:

host_cmds_unittest.cc: In destructor ‘virtual {anonymous}::LibLoadTest::~LibLoadTest?()’:
host_cmds_unittest.cc:352:29: error: ‘class isc::dhcp::HostMgr?’ has no member named ‘setTestHostDataSource’

HostMgr::instance().setTestHostDataSource(HostDataSourcePtr?());

but I'm assuming 5580 branch predates this getting fixed.

Yeah, that belongs to a different ticket.


Beyond that, your changes are fine.

Thanks for review.

comment:5 Changed 15 months ago by marcin

  • Resolution set to fixed
  • Status changed from reviewing to closed

Merged into both main and premium repo with commits cb5276a24436a9e9ce4d1ab4630e7193a4c2d803 and bf29e14b056822216a03d9789f6b00486559b930.

Note: See TracTickets for help on using tickets.