Opened 2 years ago

Closed 21 months ago

#5476 closed enhancement (complete)

ha-scopes command

Reported by: tomek Owned by: marcin
Priority: medium Milestone: Kea1.4
Component: high-availability Version: git
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DHCP Feature Depending on Ticket:
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

The HADesign defines mechanism for assigning specific scopes to servers. To manipulate that, it proposed ha-service-scopes command. See here: HADesign.

Note: The design proposed to ha-service-scopes as a name, but ha-scopes seems simpler. Please consider renaming it.

This covers both v4 and v6.

Subtickets

Change History (7)

comment:1 Changed 21 months ago by marcin

  • Owner set to marcin
  • Status changed from new to accepted

comment:2 Changed 21 months ago by marcin

  • Owner changed from marcin to UnAssigned
  • Status changed from accepted to reviewing

This ticket has been now implemented. It must be compiled with the trac5466 branch of the Kea main repository.

Proposed ChangeLog entry:

XX.	[func]		marcin
	Implemented ha-scopes command processing in the libkea-ha
	hooks library.
	(Trac #5476, git cafe)

comment:3 Changed 21 months ago by tmark

  • Owner changed from UnAssigned to tmark

comment:4 follow-up: Changed 21 months ago by tmark

  • Owner changed from tmark to marcin

First, let me say that you've done a remarkable job with this complex effort. Congratulations. It was quite a bit to digest in a single review and there are probably small nits here and there that I missed.

I found it relatively awkward to try and sort of comments per ticket so I've just included them on this ticket. I reviewed everything from 69d95cbf70cf5ce0f9ffee061fc46ec500210fdcg upward, which
appears to cover #5459, #5463, #5464, #5466, #5474 and #5476.

Hopefully we can avoid such clustering in the future, of course I realize this was beyond your control.

=======================================================================
query_filter.*

QueryFilter::

I prefer amServingScope() over isScopeServed(), and inScope() over
shouldProcess().

=======================================================================
ha_impl.cc

HAImpl::scopesHandler()

This is more a general suggestion.

Maybe we need a method parseCommandWithArguments() or some such, as
this basic step of parseCommand() followed by :

+        // Arguments must be present.
+        if (!args) {
+            isc_throw(BadValue, "arguments not found in the 'ha-scopes' command");
+        }
+
+        // Arguments must be a map.
+        if (args->getType() != Element::map) {
+            isc_throw(BadValue, "arguments in the 'ha-scopes' command are not a map");
+        }

is done all over the place. Maybe parseCommand should simply return a
map element pointer whether the args were present or otherwise. If they
were missing return an empty map. Each processor would still error out
on any required arguments.

Food for thought.

=======================================================================

ha_messages.mes

+% HA_SCOPES_HANDLER_FAILED ha-scopes command failed: %1
+This error message is issued to indicate that the ha-scopes command handler
+failed while processing the command. The argument provides the reason for
+failure.

should be "reason for the failure" ;)

=======================================================================
query_filter.cc

+ assumptions about the availability of the servers' confiurations

confiurations -> configurations

=======================================================================
query_filter.*

QueryFilter::serveDefaultScopesOnly()
QueryFilter::serveFailoverScopesOnly()

I think you should drop the "Only" from the function name. It's almost
confusing and I don't think really adds anyting.

=======================================================================
query_filter.cc

QueryFilter::serveDefaultScopesOnly() {

The text below takes reading it twice:

    // If I am primary, I always serve my scope. If I am secondary, it means
    // that we're doing load balancing, so I am also responsible for my
    // own scope. If I am standby, I am not responsible for any scope until
    // the failover event.

how about this?

    // If I am primary or secondary, then I am only responsible for my own
    // scope.  If I am standby, I am not responsible for any scope.

=======================================================================
query_filter.cc

QueryFilter::serveFailoverScopesOnly() {

Here too, I think you should drop the "Only" from the function name.

        // doing load balancing, the scope of the secondary server also
        // have to be served. Regardless if I am primary or secondary,

"have to be served" -> "has to be served"

=======================================================================
query_filter_unittest.cc

TEST_F(QueryFilterTest?, loadBalancingThisPrimary)
TEST_F(QueryFilterTest?, loadBalancingThisSecondary)
TEST_F(QueryFilterTest?, loadBalancingThisBackup).

Once QueryFilter::loadBalance() can select more than the primary
I'm assuming you'll expand these tests to verify packets that
in both scopes.

Maybe you should mark them with @todos.

=======================================================================
ha_service.h

In the commentary for processSynchronize() you describe it as
processing the ha-sync command "sent by the administrator" but
the command it is also generated automatically. I think you can
should either drop "sent by the administrator" or amend it to
include mention of its other origins.

+ / @brief Processes ha-sync command and returns a response.
+
/
+ / This method processes ha-sync command sent by the administrator. It
+
/ instructs the server to disable the DHCP service on the HA peer,
+ / fetch all leases from the peer and update the local lease database.

=======================================================================

General comment:

Have we done any testing with huge result sets?

If you have 2M leases, then lease4-get-all reponse would be single
JSON string containing all of them right? Have we tried this?

Not sure if QA is planning ahead for this, but it is something
they should start working on having system tests for asap. I would
hate to have us roll out and find out that large lease count syncs
choke for some reason.

Hopefully we'll have some time before final to do some benchmarking.
It would be interesting to compare sync/startup of ISC DHCP
failover pair with KEa failover (persisted memfile) pair, with
large lease files, and a MySQL pair.

=======================================================================

ha_server.cc

HAService::asyncSyncLeases()

Why test existing lease twice?

if (existing_lease && (existing_lease->cltt_ < lease->cltt_)) {
    LeaseMgrFactory::instance().updateLease4(lease);
} else if (!existing_lease) {
    // There is no such lease, so let's add it.
    LeaseMgrFactory::instance().addLease(lease);
}

You could do this, which tests it once and also let's us
log (if we wanted) the disregard of an obsolete update
(which ISC DHCP does):

if (!existing_lease)
    // There is no such lease, so let's add it.
    LeaseMgrFactory::instance().addLease(lease);
        LeaseMgrFactory::instance().updateLease4(lease);
} else if (existing_lease->cltt_ < lease->cltt_)) {
    // Ours is stale, update it
    LeaseMgrFactory::instance().updateLease4(lease);
} else {
    log_debug(WHY ARE YOU SENDING ME OBSOLETE INFO!)
}

=======================================================================
ha_messages.mes

+% HA_LEASE4_SYNC_FAILED synchrnonization failed for lease: %1, reason: %2

Misspelled synchronization (too many n's)

and below "failed" should be "fails" because "creating" and "updating" are
present tense.

+This warning message is issued when creating or updating a lease in the
+local lease database failed.

should be:

+This warning message is issued when creating or updating a lease in the
+local lease database fails.

HA_LEASES4_SYNC_COMMUNICATIONS_FAILED
HA_LEASES4_SYNC_FAILED

In both of these "reason for error" should be "reason for the error"

=======================================================================
communication_state.cc

CommunicationState4::analyseMessage ()

There is a possible gotcha in using secs. Some clients (mostly
Windows, iirc) send it in the wrong byte order. ISC DHCP has an
option to enable "detection" of this, see #45364

 The check-secs-byte-order statement

         check-secs-byte-order flag;

         When check-secs-byte-order is enabled,  the  server  will  check  for
         DHCPv4  clients  that  do  the byte ordering on the secs field incor‐
         rectly. This field should be in network byte order but  some  clients
         get  it wrong. When this parameter is enabled the server will examine
         the secs field and if it looks wrong (high byte non zero and low byte
         zero)  swap  the  bytes.   The default is disabled. This parameter is
         only useful when doing load  balancing  within  failover.

=======================================================================
communication_state.cc

CommunicationState4::analyseMessage ()

I am little concerned about the sequential search for client-id,
although in reality it should pretty hard to for a single MAC to
amass very many, unless it was a malicious attempt. The other
server would have to not be responding so we don't poke, and
someone would have to know then to flood us with one mac, ever
increasing client-ids, and a large secs value... AND it would
require max-unacked-clients to have been set to a large value.

Probably nothing to do here, other than think about it.

=======================================================================

communication_state_unittest.cc

+    /// @return Pointer to a function being test heartbeat implementation.

Uhmm...what?

Maybe this...

    Pointer to heartbeat implementation function under test

================================================================
ha_service_unittest.cc

void testSendLeaseUpdates(std::function<void()> unpark_handler,

const bool should_pass) {

Not that it makes a difference now, but shouldn't you probably
change the order below and replace the HAService state before
using any of its services? For that matter why not either pass it
into or create the NakedCommunicationState? inside TestHAService's
constructor?

    // Create HA service and schedule lease updates.
    TestHAService service(io_service_, config_storage);
    ASSERT_NO_THROW(service.asyncSendLeaseUpdates(query, leases4, deleted_leases4,
                                                      parking_lot_handle));
    service.communication_state_ = state;

Code might change in the future which need the state and figuring out why
the test fell down might not be obvious.

================================================================
ha_test.cc

We should consider adding a wrapper in asiolink for reset():

+ io_service_->get_io_service().reset();

There are 2 occurrences of get_io_service() in HA and they're
both in ha_test.cc.

================================================================
ha_service.cc

+ / hook library and the DHCP server. All asynchronous tasks are ran by

change "ran" to "run"

+ / @param deleted_leases Pointer to a collection og the released leases.

change "og" to "of"

=========================================================
ha_service.cc

HAService::asyncSendLeaseUpdate() -

Each time we have success we're potentially going to poke.
So that's at least once per V4 lease, twice if a prior lease
was deleted.

Should we concern ourselves about how often we will be
resetting the heartbeat timer? Think about high traffic
times... Looking at CommunicationState::startHeartbeatInternal()
,which gets called by poke(), I'm not sure this stuff we want
to do constantly. IOService as to do whatever stuff it does
to destroy the old timer and insert a new one.

I'm wondering if we don't need some sort of gating mechanism.
If I poked it 20 ms ago, why poke it again? You could use
stopwatch to track how long ago you poked and only poke every
so often. Some percentage of the interval? Just seems like
we would be thrashing this timer when the system is busy.

=========================================================
ha_service.cc

HAService::asyncSendLeaseUpdate() -

What controls attempting to send lease updates to a peer?
If a peer is known to be offline, then where do we avoid
attempting to send it updates? If we keep trying and keep
failing, won't we just keep dropping the packets out of
the parking lot?

If this is logic yet to come, that's fine.

=========================================================
ha_service.cc

HAService::verifyAsyncResponse(const HttpResponsePtr?& response)

Maybe you could include the value of rcode in the throw text on
an unsuccessful result.

=========================================================
Dhcpv4Srv::processPacket()

When you park a packet, you then reset() rsp.

I did not see where HooksManager::park() does anything
directly with rsp (like make a copy). What keesp the
actual response alive? Does passing it as a param to
when binding function copy the pointer?

On the surface it looks like you would destroying it before
it unparking it. What am I missing?

=========================================================
HAService::processSynchronize()

After calling the asyncSend, you run the client's
IOservice. What guards against us getting stuck in
this? Is there a timer inside the client?

=========================================================
communication_state.*

You're spelling analyze with an "s":

CommunicationState::analyseMessage()

I believe we use US spellings by convention.

=========================================================

HAService::processSynchronize()

After calling the asyncSend, you run the client's
IOservice. What guards against us getting stuck in
this? Is there a timer inside the client?

=========================================================
ha_service_unittest.cc

HAServiceTest::testSendLeaseUpdates()

Looking at the code below:

        // Create HA service and schedule lease updates.
        TestHAService service(io_service_, config_storage);
        ASSERT_NO_THROW(service.asyncSendLeaseUpdates(query, leases4, deleted_leases4,
                                                      parking_lot_handle));

        service.communication_state_ = state;

Shouldn't you replace the state immediately after constructing
the service, prior to invoking any service functions? I realize
it doesn't currently matter now, but in the future it might and you
could end up with a subtle bug.

Maybe it would better to simply create the replacement state
inside TestHAService's constructor? That way all tests have the
mutable state and the order is ensured.

Last edited 21 months ago by tmark (previous) (diff)

comment:5 in reply to: ↑ 4 Changed 21 months ago by marcin

  • Owner changed from marcin to tmark

Replying to tmark:

First, let me say that you've done a remarkable job with this complex effort. Congratulations. It was quite a bit to digest in a single review and there are probably small nits here and there that I missed.

I found it relatively awkward to try and sort of comments per ticket so I've just included them on this ticket. I reviewed everything from 69d95cbf70cf5ce0f9ffee061fc46ec500210fdcg upward, which
appears to cover #5459, #5463, #5464, #5466, #5474 and #5476.

Hopefully we can avoid such clustering in the future, of course I realize this was beyond your control.

=======================================================================
query_filter.*

QueryFilter::

I prefer amServingScope() over isScopeServed(), and inScope() over
shouldProcess().

I modified the code per you preference.

=======================================================================
ha_impl.cc

HAImpl::scopesHandler()

This is more a general suggestion.

Maybe we need a method parseCommandWithArguments() or some such, as
this basic step of parseCommand() followed by :

+        // Arguments must be present.
+        if (!args) {
+            isc_throw(BadValue, "arguments not found in the 'ha-scopes' command");
+        }
+
+        // Arguments must be a map.
+        if (args->getType() != Element::map) {
+            isc_throw(BadValue, "arguments in the 'ha-scopes' command are not a map");
+        }

is done all over the place. Maybe parseCommand should simply return a
map element pointer whether the args were present or otherwise. If they
were missing return an empty map. Each processor would still error out
on any required arguments.

Food for thought.

Possibly. But, I decided to not do it here. Created ticket #5561.

=======================================================================

ha_messages.mes

+% HA_SCOPES_HANDLER_FAILED ha-scopes command failed: %1
+This error message is issued to indicate that the ha-scopes command handler
+failed while processing the command. The argument provides the reason for
+failure.

should be "reason for the failure" ;)

Corrected.

=======================================================================
query_filter.cc

+ assumptions about the availability of the servers' confiurations

confiurations -> configurations

Corrected.

=======================================================================
query_filter.*

QueryFilter::serveDefaultScopesOnly()
QueryFilter::serveFailoverScopesOnly()

I think you should drop the "Only" from the function name. It's almost
confusing and I don't think really adds anyting.

Renamed as suggested.

=======================================================================
query_filter.cc

QueryFilter::serveDefaultScopesOnly() {

The text below takes reading it twice:

    // If I am primary, I always serve my scope. If I am secondary, it means
    // that we're doing load balancing, so I am also responsible for my
    // own scope. If I am standby, I am not responsible for any scope until
    // the failover event.

how about this?

    // If I am primary or secondary, then I am only responsible for my own
    // scope.  If I am standby, I am not responsible for any scope.

Replaced the comment.

=======================================================================
query_filter.cc

QueryFilter::serveFailoverScopesOnly() {

Here too, I think you should drop the "Only" from the function name.

        // doing load balancing, the scope of the secondary server also
        // have to be served. Regardless if I am primary or secondary,

"have to be served" -> "has to be served"

Corrected.

=======================================================================
query_filter_unittest.cc

TEST_F(QueryFilterTest?, loadBalancingThisPrimary)
TEST_F(QueryFilterTest?, loadBalancingThisSecondary)
TEST_F(QueryFilterTest?, loadBalancingThisBackup).

Once QueryFilter::loadBalance() can select more than the primary
I'm assuming you'll expand these tests to verify packets that
in both scopes.

Maybe you should mark them with @todos.

Added todos as suggested.

=======================================================================
ha_service.h

In the commentary for processSynchronize() you describe it as
processing the ha-sync command "sent by the administrator" but
the command it is also generated automatically. I think you can
should either drop "sent by the administrator" or amend it to
include mention of its other origins.

+ / @brief Processes ha-sync command and returns a response.
+
/
+ / This method processes ha-sync command sent by the administrator. It
+
/ instructs the server to disable the DHCP service on the HA peer,
+ / fetch all leases from the peer and update the local lease database.

Dropped the "administrator" thing.

=======================================================================

General comment:

Have we done any testing with huge result sets?

If you have 2M leases, then lease4-get-all reponse would be single
JSON string containing all of them right? Have we tried this?

Not sure if QA is planning ahead for this, but it is something
they should start working on having system tests for asap. I would
hate to have us roll out and find out that large lease count syncs
choke for some reason.

Hopefully we'll have some time before final to do some benchmarking.
It would be interesting to compare sync/startup of ISC DHCP
failover pair with KEa failover (persisted memfile) pair, with
large lease files, and a MySQL pair.

We don't but QA is supposed to work on this. I'll be poking Wlodek.

=======================================================================

ha_server.cc

HAService::asyncSyncLeases()

Why test existing lease twice?

if (existing_lease && (existing_lease->cltt_ < lease->cltt_)) {
    LeaseMgrFactory::instance().updateLease4(lease);
} else if (!existing_lease) {
    // There is no such lease, so let's add it.
    LeaseMgrFactory::instance().addLease(lease);
}

You could do this, which tests it once and also let's us
log (if we wanted) the disregard of an obsolete update
(which ISC DHCP does):

if (!existing_lease)
    // There is no such lease, so let's add it.
    LeaseMgrFactory::instance().addLease(lease);
        LeaseMgrFactory::instance().updateLease4(lease);
} else if (existing_lease->cltt_ < lease->cltt_)) {
    // Ours is stale, update it
    LeaseMgrFactory::instance().updateLease4(lease);
} else {
    log_debug(WHY ARE YOU SENDING ME OBSOLETE INFO!)
}

Re-organized this section as suggested.

=======================================================================
ha_messages.mes

+% HA_LEASE4_SYNC_FAILED synchrnonization failed for lease: %1, reason: %2

Misspelled synchronization (too many n's)

and below "failed" should be "fails" because "creating" and "updating" are
present tense.

+This warning message is issued when creating or updating a lease in the
+local lease database failed.

should be:

+This warning message is issued when creating or updating a lease in the
+local lease database fails.

HA_LEASES4_SYNC_COMMUNICATIONS_FAILED
HA_LEASES4_SYNC_FAILED

In both of these "reason for error" should be "reason for the error"

Corrected all these things.

=======================================================================
communication_state.cc

CommunicationState4::analyseMessage ()

There is a possible gotcha in using secs. Some clients (mostly
Windows, iirc) send it in the wrong byte order. ISC DHCP has an
option to enable "detection" of this, see #45364

 The check-secs-byte-order statement

         check-secs-byte-order flag;

         When check-secs-byte-order is enabled,  the  server  will  check  for
         DHCPv4  clients  that  do  the byte ordering on the secs field incor‐
         rectly. This field should be in network byte order but  some  clients
         get  it wrong. When this parameter is enabled the server will examine
         the secs field and if it looks wrong (high byte non zero and low byte
         zero)  swap  the  bytes.   The default is disabled. This parameter is
         only useful when doing load  balancing  within  failover.

Created ticket #5558.

=======================================================================
communication_state.cc

CommunicationState4::analyseMessage ()

I am little concerned about the sequential search for client-id,
although in reality it should pretty hard to for a single MAC to
amass very many, unless it was a malicious attempt. The other
server would have to not be responding so we don't poke, and
someone would have to know then to flood us with one mac, ever
increasing client-ids, and a large secs value... AND it would
require max-unacked-clients to have been set to a large value.

Probably nothing to do here, other than think about it.

There is a room for improvement here but I think it is good enough for now. There are hurdles for the malicious attacker here as you're pointing out.

=======================================================================

communication_state_unittest.cc

+    /// @return Pointer to a function being test heartbeat implementation.

Uhmm...what?

Maybe this...

    Pointer to heartbeat implementation function under test

Modified as suggested.

================================================================
ha_service_unittest.cc

void testSendLeaseUpdates(std::function<void()> unpark_handler,

const bool should_pass) {

Not that it makes a difference now, but shouldn't you probably
change the order below and replace the HAService state before
using any of its services? For that matter why not either pass it
into or create the NakedCommunicationState? inside TestHAService's
constructor?

    // Create HA service and schedule lease updates.
    TestHAService service(io_service_, config_storage);
    ASSERT_NO_THROW(service.asyncSendLeaseUpdates(query, leases4, deleted_leases4,
                                                      parking_lot_handle));
    service.communication_state_ = state;

Code might change in the future which need the state and figuring out why
the test fell down might not be obvious.

I changed the order of statements, but I decided to not pass it via constructor, because there are currently many more cases where replacing a state is not required. I am revisit that further down the road.

================================================================
ha_test.cc

We should consider adding a wrapper in asiolink for reset():

+ io_service_->get_io_service().reset();

There are 2 occurrences of get_io_service() in HA and they're
both in ha_test.cc.

Created #5559.

================================================================
ha_service.cc

+ / hook library and the DHCP server. All asynchronous tasks are ran by

change "ran" to "run"

+ / @param deleted_leases Pointer to a collection og the released leases.

change "og" to "of"

Changed all occurrences of "ran" and fixed the typo.

=========================================================
ha_service.cc

HAService::asyncSendLeaseUpdate() -

Each time we have success we're potentially going to poke.
So that's at least once per V4 lease, twice if a prior lease
was deleted.

Should we concern ourselves about how often we will be
resetting the heartbeat timer? Think about high traffic
times... Looking at CommunicationState::startHeartbeatInternal()
,which gets called by poke(), I'm not sure this stuff we want
to do constantly. IOService as to do whatever stuff it does
to destroy the old timer and insert a new one.

I'm wondering if we don't need some sort of gating mechanism.
If I poked it 20 ms ago, why poke it again? You could use
stopwatch to track how long ago you poked and only poke every
so often. Some percentage of the interval? Just seems like
we would be thrashing this timer when the system is busy.

Per our jabber discussion, I updated the code to not poke more often than every 1 second. The heartbeat timer is set in seconds, so that matches the resolution of the heartbeat timer for lease updates case. We may later consider optimizations if that turns out to have performance implications.

=========================================================
ha_service.cc

HAService::asyncSendLeaseUpdate() -

What controls attempting to send lease updates to a peer?
If a peer is known to be offline, then where do we avoid
attempting to send it updates? If we keep trying and keep
failing, won't we just keep dropping the packets out of
the parking lot?

If this is logic yet to come, that's fine.

This is a kind of issue to be resolved when we implement state machines.

=========================================================
ha_service.cc

HAService::verifyAsyncResponse(const HttpResponsePtr?& response)

Maybe you could include the value of rcode in the throw text on
an unsuccessful result.

I include the rcode now.

=========================================================
Dhcpv4Srv::processPacket()

When you park a packet, you then reset() rsp.

I did not see where HooksManager::park() does anything
directly with rsp (like make a copy). What keesp the
actual response alive? Does passing it as a param to
when binding function copy the pointer?

On the surface it looks like you would destroying it before
it unparking it. What am I missing?

This is in src/lib/hooks/parking_lots.h:

        void update(const boost::any& parked_object,
                    std::function<void()> callback) {
            parked_object_ = parked_object;
            unpark_callback_ = callback;
        }

As you can see the parked_object (which is Pkt4Ptr in this case) is copied.

=========================================================
HAService::processSynchronize()

After calling the asyncSend, you run the client's
IOservice. What guards against us getting stuck in
this? Is there a timer inside the client?

Yes, the HTTP client has a timer which is signals a timeout (which is treated as error).

=========================================================
communication_state.*

You're spelling analyze with an "s":

CommunicationState::analyseMessage()

I believe we use US spellings by convention.

Ooops. That was Polglish, actually. Fixed that.

=========================================================

HAService::processSynchronize()

After calling the asyncSend, you run the client's
IOservice. What guards against us getting stuck in
this? Is there a timer inside the client?

See above.

=========================================================
ha_service_unittest.cc

HAServiceTest::testSendLeaseUpdates()

Looking at the code below:

        // Create HA service and schedule lease updates.
        TestHAService service(io_service_, config_storage);
        ASSERT_NO_THROW(service.asyncSendLeaseUpdates(query, leases4, deleted_leases4,
                                                      parking_lot_handle));

        service.communication_state_ = state;

Shouldn't you replace the state immediately after constructing
the service, prior to invoking any service functions? I realize
it doesn't currently matter now, but in the future it might and you
could end up with a subtle bug.

Maybe it would better to simply create the replacement state
inside TestHAService's constructor? That way all tests have the
mutable state and the order is ensured.

Corrected the order.

comment:6 Changed 21 months ago by tmark

  • Owner changed from tmark to marcin

Changes are fine.

comment:7 Changed 21 months ago by marcin

  • Resolution set to complete
  • Status changed from reviewing to closed

Merged with commit 64c9408be1ac50a96aa878bf22d43667767d9a59.

Note: See TracTickets for help on using tickets.