#5638 closed defect (fixed)

Alloc engine gives up on allocating a lease because of the wrong status in the CalloutHandle

Reported by: marcin Owned by: marcin
Priority: high Milestone: Kea1.4-final
Component: dhcp Version: git
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DHCP Feature Depending on Ticket:
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

While testing High Availability it was found that the servers occasionally fail to allocate a lease for the clients. The allocation engine log says that the server failed to allocate the lease after 0 attempts. The number of attempts is driven by the pool size, so making 0 attempts seems to be caused by the code leaving the loop too early. In the course of testing it was later found that the following loop:

        for (uint64_t i = 0; i < max_attempts; ++i) {
            IOAddress candidate = allocator->pickAddress(subnet,
                                                         ctx.query_->getClasses(),
                                                         client_id,
                                                         ctx.requested_address_);
            // If address is not reserved for another client, try to allocate it.
            if (!addressReserved(candidate, ctx)) {
                // The call below will return the non-NULL pointer if we
                // successfully allocate this lease. This means that the
                // address is not in use by another client.
                new_lease = allocateOrReuseLease4(candidate, ctx);
                if (new_lease) {
                    return (new_lease);
                } else if (ctx.callout_handle_ &&
                           (ctx.callout_handle_->getStatus() !=
                            CalloutHandle::NEXT_STEP_CONTINUE)) {
                    // Don't retry when the callout status is not continue.
                    subnet.reset();
                    break;
                }
            }
            ++total_attempts;
        }

is left early, because of the wrong CallloutHandle status being set. The confirmed solution for this issue is to explicitly set the next step status to CONTINUE prior to calling allocateOrReuseLease4. Otherwise, some junk status may be present, left by some previous callouts' invocations.

The v6 code should also be examined to see if similar issue occurs.

Subtickets

Change History (8)

comment:1 Changed 13 months ago by marcin

  • Milestone changed from Kea-proposed to Kea1.4-final

Moving to 1.4-final per my agreement with Tomek.

comment:2 Changed 13 months ago by marcin

  • Owner set to marcin
  • Status changed from new to accepted

comment:3 Changed 13 months ago by marcin

  • Owner changed from marcin to UnAssigned
  • Status changed from accepted to reviewing

I corrected the bug in the allocation engine for both v6 and v4 case. I also created some unit tests to exercise those scenarios. However, the v6 test is not really getting as far as it should because the allocation engine guards against the situations when the createLease6 fails by checking whether a lease exists for a given address. This makes the probability of the original issue occurring in the v6 case very low.

Proposed ChangeLog entry:

14XX.	[bug]		marcin
	Corrected bug in the allocation engine which caused occasional
	lease allocation failures when a loaded hooks library set the
	callout status to non default value, e.g. "skip" rather than
	"continue". In such cases, the server reported that it failed
	to allocate a lease "after 0 attempts".
	(Trac #5638, git cafe)

comment:4 Changed 13 months ago by tmark

  • Owner changed from UnAssigned to tmark

comment:5 follow-up: Changed 13 months ago by tmark

  • Owner changed from tmark to marcin

General comment:

In a way, it strikes me that there is a larger failing. What you've encountered is some prior call out has set the status to SKIP and, presumably that was checked on the return side of the call out. It is then left set that way. One wonders if we should be resetting at that point after which we have tested it, making it operate more like a one-shot.

Alternatively, when we check if there is a call out present, maybe we should be resetting it to CONTINUE when there are none.

It also seems as if, perhaps, since the server has decided to continue and attempt to hand out a lease, it should be resetting this much higher up, like at the top of discoverLease4() and the ilk,
rather than inside allocateUnreservedLeaseX().

However, your solution certainly works for this code path and this far into the release schedule that is probably the safest approach.

Were you able to determine what call out was setting it to skip and when? I would like to understand the particular case causing the issue. I am assuming you were able to replicate this.


Please add some commentary to TEST_F(AllocEngine4Test, simpleAlloc4) that explains what is being done. It is pretty subtle that you are testing against a "left over skip" value. Might warrant it's own test for clarity.

Same goes for @@ AllocEngine6Test::simpleAlloc6Test(), at least add some explanation.

Builds and unit tests pass under Centos 7.

comment:6 in reply to: ↑ 5 Changed 13 months ago by marcin

  • Owner changed from marcin to tmark

Replying to tmark:

General comment:

In a way, it strikes me that there is a larger failing. What you've encountered is some prior call out has set the status to SKIP and, presumably that was checked on the return side of the call out. It is then left set that way. One wonders if we should be resetting at that point after which we have tested it, making it operate more like a one-shot.

Alternatively, when we check if there is a call out present, maybe we should be resetting it to CONTINUE when there are none.

There is a typical pattern how we execute callouts for a hook point:

        if (HooksManager::calloutsPresent(Hooks.hook_index_buffer4_send_)) {

            // Delete previously set arguments
            callout_handle->deleteAllArguments();

            callout_handle->setArgument(...);

            // Call callouts
            HooksManager::callCallouts(Hooks.hook_index_buffer4_send_,
                                       *callout_handle);

            // Callouts decided to skip the next processing step. The next
            // processing step would to parse the packet, so skip at this
            // stage means drop.
            if (callout_handle->getStatus() == CalloutHandle::NEXT_STEP_SKIP) {
                ...
            }
}

We follow that pattern in most cases and it works fine for two reasons:

  • HooksManager::callCallouts appear to reset the status to CONTINUE
  • returned status is examined within the if condition, so only if the callout is present and right after the callout returns.

The code in the allocation engine that we're fixing is an exception from this rule. It checks the returned status outside of the "if" condition, therefore it is possible that there is some garbage there. I don't think we want to now rewrite existing hook points to follow some more robust approach. In my opinion, it is sufficient to check the callout status right after the callout returns. In the exceptional case like the allocation engine, if it is checked outside of callout invocation section it has to be explicitly reset to make sure there is no garbage there. If you think it is worth it, I can modify the lease4_select hook point to check the status and set some argument (passed by reference) to allocateOrReuse function which could be later examined in place where we currently check the status gathered from the CalloutHandle. I am just not sure it is worth that effort.

It also seems as if, perhaps, since the server has decided to continue and attempt to hand out a lease, it should be resetting this much higher up, like at the top of discoverLease4() and the ilk,
rather than inside allocateUnreservedLeaseX().

However, your solution certainly works for this code path and this far into the release schedule that is probably the safest approach.

Were you able to determine what call out was setting it to skip and when? I would like to understand the particular case causing the issue. I am assuming you were able to replicate this.

The HA hooks library implements buffer4_receive callout which performs load balancing. It has to look into the packet so it calls unpack() and then sets NEXT_STEP_SKIP to prevent the server from unpacking the packet again. I am sure that this is the status that gets propagated to the allocation engine and causes the loop to break without attempting to allocate an address.


Please add some commentary to TEST_F(AllocEngine4Test, simpleAlloc4) that explains what is being done. It is pretty subtle that you are testing against a "left over skip" value. Might warrant it's own test for clarity.

Same goes for @@ AllocEngine6Test::simpleAlloc6Test(), at least add some explanation.

Added commentary.

Builds and unit tests pass under Centos 7.

comment:7 Changed 13 months ago by tmark

  • Owner changed from tmark to marcin

Your explanation is acceptable. Somehow it still feels clunky but it is a bit like using C functions that set errno. If you need to check its value, clear it before calling funcs that set it.

Commentary is good, thanks.

Please merge.

comment:8 Changed 13 months ago by marcin

  • Resolution set to fixed
  • Status changed from reviewing to closed

Merged with commit f2e9b686ae52e1b06f660e1b522588b1440e2620

Note: See TracTickets for help on using tickets.