#5605 closed defect (fixed)

RADIUS testing fixes

Reported by: tomek Owned by: fdupont
Priority: high Milestone: Kea1.4
Component: hook-radius Version: git
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DHCP Feature Depending on Ticket:
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

This emergency ticket is created to address last moment issues discovered with RADIUS.

Subtickets

Attachments (1)

bt (4.6 KB) - added by fdupont 19 months ago.
segmentation fault at exit (lldb bt)

Download all attachments as: .zip

Change History (19)

comment:1 Changed 19 months ago by fdupont

Fixes pushed to the branch. Testing with #5538 config reaches DHCP4_STARTED (and unloads).

comment:2 Changed 19 months ago by fdupont

  • Owner set to fdupont
  • Status changed from new to accepted

Making Kea and premium new trac5605a.

Changed 19 months ago by fdupont

segmentation fault at exit (lldb bt)

comment:3 Changed 19 months ago by fdupont

I attached the call stack for the seg fault at exit: IMHO it is an order issue between host cache or radius backend destruction and host manager (a static).

comment:4 Changed 19 months ago by fdupont

Go further (perhaps at the end of what I can do without a dedicated testbed) and got a DHCPOFFER via the radius server.
Using:

  • example JSON config (I updated the trac5605a Kea branch for it)
  • ISC DHCP client with this config:
    option AOP code 250 = text;
    send AOP "chambertin";
    
  • FreeRADIUS server with at the top of authorize file (cf Getting Started):
    chambertin Cleartext-Password := "French-wine"
                        Framed-IP-Address := 192.0.2.10
    

comment:5 Changed 19 months ago by fdupont

Shall try delBackend in hook host backends' destructors to see if it solves the host manager static crash on exit.

comment:6 Changed 19 months ago by fdupont

Because of this code at the end of acceptDirectRequest() (`dhcp4_srv.cc lines 2999-3004):

    bool drop = false;
    bool result = (!pkt->getLocalAddr().isV4Bcast() || selectSubnet(pkt, drop));
    if (drop) {
        // The packet must be dropped.
        return (false);
    }
    return (result);

selectSubnet() can be called and its callout subnet4_select before pkt4_receive
which does not make sense. I added a sanity_check flag to skip the callout which
will be called as expected at the beginning of process<packet-type> methods.

With this change system tests are successful for reservation and class/pool.

I got an exit crash caused by the macOS fix (destruction order is compiler dependent).
I'll commit a fix which should work for all cases.

comment:7 Changed 19 months ago by fdupont

  • Owner changed from fdupont to UnAssigned
  • Status changed from accepted to reviewing

Changed the exit crash fix so now it works with any destroy order...
Checked with success reservation and pool, DORA / renew, access / accounting on CentOS 7.
Doc still need updates:

  • attributes (i.e. #5608 so not in this ticket)
  • make clear that when flex-id is used client-id-printable must be true
  • restrictions, i.e., no shared network and no host database backend

Ready for review.

comment:8 Changed 19 months ago by fdupont

  • Owner changed from UnAssigned to fdupont
  • Status changed from reviewing to accepted

Cherry-picked doc changes from #5608. Taking the ticket to add server config logs.

comment:9 Changed 19 months ago by fdupont

  • Owner changed from fdupont to UnAssigned
  • Status changed from accepted to reviewing

comment:10 Changed 19 months ago by fdupont

Fixed misc details:

  • added config.h includes in access and accounting files (without it accounting default to sync vs async)
  • revamped extract_duid (accounting failed to pop leading 0 in User-Name)
  • fixed tests/Makefile.am (reorder to do request sooner, missed backend)

Still to be reviewed.

comment:11 Changed 19 months ago by tomek

  • Owner changed from UnAssigned to tomek

comment:12 follow-up: Changed 19 months ago by tomek

  • Owner changed from tomek to fdupont

KEA CODE REVIEW

I have reviewed changes on trac5605a in Kea repo. The code looks
ok, but I added extra comments to explain the sanity_only flag.
Please pull trac5605a and review.

This branch includes .gitmodules (again) and premium/ dir. Before you
commit anything, please review the files (and their content). I use
git gui to inspect them. I suggest you do the same.

Before you merge anything, make sure you remove .gitmodules and
premium/ dir from the repo.

Once you address those, the code is ready for merge. The change
is small enough, so it doesn't require a Changelog.

PREMIUM CODE REVIEW

Makefile.am
Why did you change the order of files in radius_unittests_SOURCES?
They were in alphabetic order, they are not anymore. Why?

radius.cc
In my opinion the RadiusImpl::getHostCache() is a very bad place
to add host cache backend for at least two reasons. First, it should
be done in the host cache code. Second, the method is called getHostCache(),
it should not add anything (this comment applies to type=radius as well).

radius_accounting.cc
Added some comments. Please pull and review.

radius_callout.cc
subnet4_select and subnet6_select: I think the reasons why the subnet4_select
callout is doing nothing should be logged. Added appropriate messages. Please
pull and review.

radius_utils.cc
extractDuid should also report a failure (set extracted to false), unless there
are very good reasons why not to do it.


With the exception of radius.cc, the code looks good and could be merged.
However, I have experienced several segmentation faults when running unit-tests
on Mac OS. Out of maybe 10 tries, I've experienced one segfault in
AccountingTest?.lease6Expire, AccountingTest?.lease6Decline, AccountingTest?.lease6Release.
AccountingTest?.lease6Renew. I tried to investigate (enabled core dump, run, got
core, loaded it using lldb, but then backtrace doesn't show anything useful:

(lldb) bt
libkea-asiolink.5.dylib was compiled with optimization - stepping may behave oddly; variables may not be available.
* thread #1, stop reason = signal SIGSTOP
  * frame #0: 0x0000000110556ba2 libkea-asiolink.5.dylib`isc::asiolink::IOServiceImpl::poll() [inlined] boost::asio::io_service::poll(this=0x0000000000000000) at io_service.ipp:85 [opt]
    frame #1: 0x0000000110556b92 libkea-asiolink.5.dylib`isc::asiolink::IOServiceImpl::poll(this=0x0000000000000000) at io_service.cc:72 [opt]
    frame #2: 0x000000010fa81359 radius_unittests`(anonymous namespace)::AccountingTest_lease6Release_Test::TestBody(this=0x00007fbc514a3150) at accounting_unittests.cc:1885 [opt]
    frame #3: 0x000000010faecd43 radius_unittests`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) [inlined] void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(method=<unavailable>, location=<unavailable>)(), char const*) at gtest.cc:2402 [opt]
    frame #4: 0x000000010faecd34 radius_unittests`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(object=<unavailable>, method=<unavailable>, location="the test body")(), char const*) at gtest.cc:2438 [opt]
    frame #5: 0x000000010faecc3f radius_unittests`testing::Test::Run(this=0x00007fbc514a3150) at gtest.cc:2474 [opt]
    frame #6: 0x000000010faedf2e radius_unittests`testing::TestInfo::Run(this=0x00007fbc51416820) at gtest.cc:2656 [opt]
    frame #7: 0x000000010faee8a7 radius_unittests`testing::TestCase::Run(this=0x00007fbc51410b80) at gtest.cc:2774 [opt]
    frame #8: 0x000000010faf65c7 radius_unittests`testing::internal::UnitTestImpl::RunAllTests(this=0x00007fbc51406690) at gtest.cc:4649 [opt]
    frame #9: 0x000000010faf60c3 radius_unittests`bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) [inlined] bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(method=<unavailable>, location=<unavailable>)(), char const*) at gtest.cc:2402 [opt]
    frame #10: 0x000000010faf60b4 radius_unittests`bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(object=<unavailable>, method=<unavailable>, location="auxiliary test code (environments or event listeners)")(), char const*) at gtest.cc:2438 [opt]
    frame #11: 0x000000010faf5fe7 radius_unittests`testing::UnitTest::Run(this=0x000000010fbb7820) at gtest.cc:4257 [opt]
    frame #12: 0x000000010f8f3dd0 radius_unittests`main [inlined] RUN_ALL_TESTS() at gtest.h:2233 [opt]
    frame #13: 0x000000010f8f3dc3 radius_unittests`main(argc=1, argv=<unavailable>) at run_unittests.cc:14 [opt]
    frame #14: 0x00007fff99968235 libdyld.dylib`start + 1

The code has been compiled with Francis fixes for FreeRADIUS-client.

I was able to replicate this issue on master, so it most likely
has nothing to do with the changes. Unless you know some easy and fast
way to fix this, I suggest to move forward with this ticket (review
my changes, address the radius.cc comment and then merge it), then
open separate ticket for this segfault problem.

comment:13 in reply to: ↑ 12 ; follow-up: Changed 19 months ago by fdupont

  • Owner changed from fdupont to tomek

Replying to tomek:

KEA CODE REVIEW

I have reviewed changes on trac5605a in Kea repo. The code looks
ok, but I added extra comments to explain the sanity_only flag.
Please pull trac5605a and review.

=> can't pull (usually it means you didn't push).

This branch includes .gitmodules (again) and premium/ dir. Before you
commit anything, please review the files (and their content).

=> it comes back at each git commit -a and cherry-pick
requires the -a flag...

PREMIUM CODE REVIEW

Makefile.am
Why did you change the order of files in radius_unittests_SOURCES?
They were in alphabetic order, they are not anymore. Why?

=> asynchronous accounting messed the request tests so
I put them in the logical order (and I found that backend test
was forgotten during merge).

radius.cc
In my opinion the RadiusImpl::getHostCache() is a very bad place
to add host cache backend for at least two reasons. First, it should
be done in the host cache code. Second, the method is called getHostCache(),
it should not add anything (this comment applies to type=radius as well).

=> the problem is the place (it can't be called before because at load time
the host manager is not initialized) or the name (checkHostBackends should
be better) or both? Note for the first point the #5562 should really improve
things.

radius_callout.cc
subnet4_select and subnet6_select: I think the reasons why the subnet4_select
callout is doing nothing should be logged. Added appropriate messages. Please
pull and review.

=> I disagree for the first case: to have no auth means that only the
accounting service was wanted. For the second case (failed to get
host cache) the error message should be sent by checkHostBackends.

radius_utils.cc
extractDuid should also report a failure (set extracted to false), unless there
are very good reasons why not to do it.

=> I disagree: to have nothing to extract is *not* an error condition.
And BTW the extract duid flag default to true.

With the exception of radius.cc, the code looks good and could be merged.

=> this means another round of review, doesn't it?

However, I have experienced several segmentation faults when running unit-tests

=> ASIO bug. IMHO it comes from BOOST_ASIO_DISABLE_THREADS which is still set to 1
when obviously it should not. Usually I get this crash in the thread memory allocation.
I agree to open a Kea 1.4 ticket even it comes from the use of threads in unit tests
(required for synchronous operation).

comment:14 in reply to: ↑ 13 ; follow-up: Changed 19 months ago by tomek

  • Owner changed from tomek to fdupont

Replying to fdupont:

Replying to tomek:

PREMIUM CODE REVIEW

Makefile.am
Why did you change the order of files in radius_unittests_SOURCES?
They were in alphabetic order, they are not anymore. Why?

=> asynchronous accounting messed the request tests so
I put them in the logical order (and I found that backend test
was forgotten during merge).

Fair enough. Please add a comment in Makefile.am that the tests are to be
listed in logical order.

radius.cc
In my opinion the RadiusImpl::getHostCache() is a very bad place
to add host cache backend for at least two reasons. First, it should
be done in the host cache code. Second, the method is called getHostCache(),
it should not add anything (this comment applies to type=radius as well).

=> the problem is the place (it can't be called before because at load time
the host manager is not initialized) or the name (checkHostBackends should
be better) or both? Note for the first point the #5562 should really improve
things.

Ok, let me rephrase that. There are three problems:

  1. the command is called getHostCache(), so it looks like simple getter, but

it actually sets various things. The change you did (rename to checkHostBackends)
solves the problem.

  1. the other issue I had is that the Radius library code adds a backend

that is of type cache. This should be done not here, but in the host-cache lib.
Each library should set up and clean up after itself.

  1. The checkHostBackends() adds two backends, but then cleanup deletes all

backends (note HostMgr::delAllBackends() in line 106) If there were more
backends configured, the cleanup operation would delete more than the
checkHostBackends added.

radius_callout.cc
subnet4_select and subnet6_select: I think the reasons why the subnet4_select
callout is doing nothing should be logged. Added appropriate messages. Please
pull and review.

=> I disagree for the first case: to have no auth means that only the
accounting service was wanted. For the second case (failed to get
host cache) the error message should be sent by checkHostBackends.

ok.

radius_utils.cc
extractDuid should also report a failure (set extracted to false), unless there
are very good reasons why not to do it.

=> I disagree: to have nothing to extract is *not* an error condition.
And BTW the extract duid flag default to true.

Ok, maybe failure was not the best word to describe it. I think the code
should set the boolean to false to say the extraction did not take place.
Nevertheless it should be reported.

However, I have experienced several segmentation faults when running unit-tests

=> ASIO bug. IMHO it comes from BOOST_ASIO_DISABLE_THREADS which is still set to 1
when obviously it should not. Usually I get this crash in the thread memory allocation.
I agree to open a Kea 1.4 ticket even it comes from the use of threads in unit tests
(required for synchronous operation).

Ok. Please open it.

comment:15 in reply to: ↑ 14 Changed 19 months ago by fdupont

  • Owner changed from fdupont to tomek

Replying to tomek:

Ok, let me rephrase that. There are three problems:

  1. the command is called getHostCache(), so it looks like simple getter, but

it actually sets various things. The change you did (rename to checkHostBackends)
solves the problem.

=> so it is done.

  1. the other issue I had is that the Radius library code adds a backend

that is of type cache. This should be done not here, but in the host-cache lib.
Each library should set up and clean up after itself.

=> it can't be done in the host cache library because the host cache library
does not register a callout and it must be done in a callout. One possibility
should be to use the server configured callout but in fact it is far better
to move addBackend's to Kea core, i.e., #5562.

  1. The checkHostBackends() adds two backends, but then cleanup deletes all

backends (note HostMgr::delAllBackends() in line 106) If there were more
backends configured, the cleanup operation would delete more than the
checkHostBackends added.

=> in fact no because only the two backends are added. If you look at log
for the change history:

  • without cleanup: crash at exit on macOS
  • with two individual cleanups: fixed in macOS but crash at exit on CentOS

The problem is at exit time global statics are destroyed in a compiler dependent
order. So I changed to the backend list clear which is immune.

Can we agree for more comments to explain these choices?

radius_utils.cc
extractDuid should also report a failure (set extracted to false), unless there
are very good reasons why not to do it.

=> I disagree: to have nothing to extract is *not* an error condition.
And BTW the extract duid flag default to true.

Ok, maybe failure was not the best word to describe it. I think the code
should set the boolean to false to say the extraction did not take place.

=> OK for setting the flag to false (vs assuming it is called with this).

Nevertheless it should be reported.

=> Still disagree: no utility function should report something.

Created #5615 for the asio vs thread issue.

Need a decision about #5612 soon.

comment:16 Changed 19 months ago by fdupont

A short note explaining why to move the addBackend("type=cache"); to host cache code is a bad idea:

  • at first view it seems fine because it avoids to require host cache users to do it.
  • the fact it can't be called at load time is not a problem because we have the server configured callout point which is at the very end of controlled server processConfig
  • the issue is at exit time: currently host manager structures are destroyed in a compiler dependent order so it puts a constraint on host cache users which voids the benefit.

In conclusion this should be postpone to #5542, i.e. further discussion should be in that ticket...

comment:17 Changed 19 months ago by tomek

  • Owner changed from tomek to fdupont

Ok, your explanation makes sense. Go ahead and merge this.

comment:18 Changed 19 months ago by fdupont

  • Resolution set to fixed
  • Status changed from reviewing to closed

Merged. Closing.

Note: See TracTickets for help on using tickets.