Opened 4 years ago

Closed 3 years ago

#4494 closed defect (fixed)

Segmentation fault after dhcpsrv unit tests exit

Reported by: stephen Owned by: stephen
Priority: high Milestone: Kea1.1
Component: Unclassified Version: git
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DHCP Feature Depending on Ticket:
Estimated Difficulty: 0 Add Hours to Ticket: 0.5
Total Hours: 7.5 Internal?: no

Description

Running the Kea unit tests on commit f116221, on Ubuntu 14.04 (on a virtual machine) with g++ 4.8.4 and log4cplus 1.0.2rc3, the dhcpsrv unit tests ended with a segmentation fault after reporting all tests had passed. The top of the stack looked like:

Program received signal SIGSEGV, Segmentation fault.
__memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:36
36	../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S: No such file or directory.
(gdb) bt
#0  __memcpy_sse2_unaligned () at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:36
#1  0x00007ffff5b7be30 in std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long) ()
   from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#2  0x00007ffff5b7c50c in std::string::assign(std::string const&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3  0x00007ffff50ae177 in operator= (__str=..., this=0xa28968) at /usr/include/c++/4.8/bits/basic_string.h:547
#4  log4cplus::pattern::BasicPatternConverter::convert (this=<optimised out>, result=" ", event=...)
    at src/patternlayout.cxx:420
#5  0x00007ffff50ae58a in log4cplus::pattern::PatternConverter::formatAndAppend (this=0xa94190, output=..., event=...)
    at src/patternlayout.cxx:359
#6  0x00007ffff50ae852 in log4cplus::PatternLayout::formatAndAppend (this=0xaa1840, output=..., event=...)
    at src/patternlayout.cxx:1122
#7  0x00007ffff5097378 in log4cplus::FileAppenderBase::append (this=0xa9d690, event=...) at src/fileappender.cxx:353
#8  0x00007ffff5085a9f in log4cplus::Appender::doAppend (this=0xa9d690, event=...) at src/appender.cxx:287
#9  0x00007ffff50845fe in log4cplus::helpers::AppenderAttachableImpl::appendLoopOnAppenders (this=this@entry=0xa86590, 
    event=...) at src/appenderattachableimpl.cxx:173
#10 0x00007ffff50a7714 in log4cplus::spi::LoggerImpl::callAppenders (this=0xa89930, event=...) at src/loggerimpl.cxx:60
#11 0x00007ffff799af05 in isc::log::LoggerImpl::outputRaw (this=0xa88f80, severity=<optimised out>, message=...)
    at logger_impl.cc:165
#12 0x00000000005dcfd0 in isc::log::Formatter<isc::log::Logger>::~Formatter (this=0x7fffffffdbe0, __in_chrg=<optimised out>)
    at ../../../../src/lib/log/log_formatter.h:160
#13 0x00007ffff77254d2 in isc::dhcp::TimerMgr::unregisterTimers (this=this@entry=0xa89290) at timer_mgr.cc:653
#14 0x00007ffff7725778 in isc::dhcp::TimerMgr::~TimerMgr (this=0xa89290, __in_chrg=<optimised out>) at timer_mgr.cc:621

The problem lies in the logging call made in TimerMgr::unregisterTimers(). This is a LOG_DEBUG call - but debug logging was not enabled. So as well as the problem deep in log4cplus, the implication is that the Logger::isDebugEnabled call gave the wrong return value as well. As none of the Kea logger objects had been destroyed at this point (the code was edited to write a message to stdout in the Logger's destructor - no messages appeared), and Logger::isXxxEnabled is a shell around a call to a similar method in log4cplus, the implication seems to be that objects in log4cplus have been destroyed so causing the problem.

The issuing of messages when static objects are being destroyed has been a problem elsewhere: see #4492.

Suggested solution
Every program, be it a Kea binary of a unit test, initializes the logging subsystem as one of the first things it does, an action that sets a static boolean "logging initialized" flag. Prior to that, any call to a logger function will cause an exception to be thrown. This flag can be used to address the issue:

  1. The logging initialization function should install an exit handler (using the atexit() call)L that exit handler will run before the destructors of static objects. If that handler clears the "logging initialized" flag, the flag will be clear when the static object destructors are executed. As the flag is boolean, no destructor will be run on it, so it will not change during the execution of the those functions.
  1. The "logging initialized" flag protects against all logging operations before log4cplus has been initialized. However, during the rundown of static objects, the only use of the logging system is to log a message. All Kea messages are logged via the LOG_XXX macros, which call Logger::isXxxEnabled: if that returns true, the message is logged. At present, these isXxxEnabled methods directly call a log4cplus method. The proposal is to modify them so that they check the Kea "logging initialized" flag first, and only call the log4cplus method if logging is initialized: if the flag is clear, no log4cplus functions will be called.

The net effect of this is to suppress the issuing of logging messages after the "return" from main() has been executed. The change can be achieved by modifying the logging module only: no other changes are needed to Kea code.

Subtickets

Change History (20)

comment:1 Changed 4 years ago by hschempf

  • Milestone changed from Kea-proposed to Kea1.1

per 5/5 team meeting, accept for 1.1 as bug fixing. no estimate.

comment:2 Changed 4 years ago by stephen

  • Owner set to stephen
  • Status changed from new to accepted

comment:3 Changed 4 years ago by stephen

  • Owner changed from stephen to UnAssigned
  • Status changed from accepted to reviewing

Ready for review. Suggested ChangeLog entry:

Modify logging to ensure that calls to log messages after program exit
(from static object destructors) are no-ops.  This reduces the risk of
crashes by preventing access to potentially-destroyed objects.

comment:4 Changed 4 years ago by fdupont

  • Owner changed from UnAssigned to fdupont

comment:5 Changed 4 years ago by fdupont

  • Owner changed from fdupont to UnAssigned

src/lib/log/log_messages.mes: in LOG_EXIT_HANDLER_DECLARE_FAIL %2 -> %1

src/lib/log/logger_manager.h: double space in "for example, if those objects"

src/lib/log/tests/terminate_logger_test.cc: main(int, char**) -> main() (this is the standard way in C++ for a program taking no arguments)

Reading the code it seems nice. I'll be busy tomorrow so I release the ticket in case someone wants to (deeper) review it.

comment:6 Changed 4 years ago by stephen

  • Total Hours changed from 0 to 4

comment:7 Changed 4 years ago by fdupont

  • Owner changed from UnAssigned to stephen

Giving the ticket to Stephen even my review is not really finished (or more exactly I am not sure now I really finished it, at least I didn't run the code for lack of time).

comment:8 Changed 4 years ago by stephen

  • Owner changed from stephen to fdupont

I've made the changes. Ready for re-review.

comment:9 Changed 4 years ago by stephen

  • Owner changed from fdupont to UnAssigned

comment:10 Changed 4 years ago by marcin

  • Owner changed from UnAssigned to marcin

comment:11 follow-up: Changed 4 years ago by marcin

  • Owner changed from marcin to UnAssigned

I wasn't able to reproduce this problem. Possibly, I am using different configuration flags, or lack some required logger configuration that triggers this issue. In the jenkins history I don't see any failures on Ubuntu 14 because of this. So, I can just review the ticket without reproducing the problem or pass this over to someone who can actually reproduce it for a review.

comment:12 in reply to: ↑ 11 Changed 4 years ago by fdupont

Replying to marcin:

I wasn't able to reproduce this problem. Possibly, I am using different configuration flags, or lack some required logger configuration that triggers this issue. In the jenkins history I don't see any failures on Ubuntu 14 because of this. So, I can just review the ticket without reproducing the problem or pass this over to someone who can actually reproduce it for a review.

=> one of the nasty properties of the global allocation/destroy order issue is to be very hard to reproduce... So IMHO you should not try to reproduce it.

comment:13 Changed 4 years ago by marcin

  • Owner changed from UnAssigned to marcin

comment:14 Changed 4 years ago by marcin

  • Add Hours to Ticket changed from 0 to 3
  • Owner changed from marcin to stephen
  • Total Hours changed from 4 to 7

Reviewed commit 328cac314b81fdac141ff619600e0f42beea1e09

In general, nicely done.

I tried to reproduce the issue with static initialization order fiasco but with no luck. I also haven't seen any signs of this problem on Ubuntu 14 machine in Jenkins. So, perhaps it is specific to your particular setup, but I agree it may come up anywhere, anytime. The point is that I couldn't test if the fix really works, but I believe you had tested it.

The unit tests and the distcheck passed for me on OS-X and FreeBSD10.3.

I have some nits to mention here. Nothing major. After addressing the nits you're ok to merge.

src/lib/log/log_messages.cc
It seems that after updating the message file you did not regenerate the "log_messages.cc". As a result the LOG_EXIT_HANDLER_DECLARE_FAIL message still contains "%2" instead of "%1".

src/lib/log/tests/terminate_logger_test.cc
Typo "messages§" in:

// Ensure that the messages§ are loaded into the dictionary at program
// startup.  Putting the MessageInitializer in main() ensures that it is
// deleted before the LogAfterExit destructor is run.

src/lib/log/tests/terminate_logger_test.sh.in
There are copy paste errors in the file header. Specifically, the copyright dates are incorrect and the file description that starts with "Checks that the logger will limit the output of messages..." is not relevant to the test.

comment:15 Changed 4 years ago by fdupont

  • Priority changed from medium to high

Fedora 22 VMs are still unstable on Jenkins because of this issue. I propose:

  • boost the priority
  • check it solves the Jenkins VM issue.

comment:16 Changed 4 years ago by fdupont

Not conclusive: master gives an error and a warning, trac4494 gives 2 warnings: libdhcpsrv_unittests still crashed on exit and a dhcp6 server test showed a similar problem...
I have a Fedora 22 VM, I'll see if I can reproduce the problem on it (note I keep my VMs up to date so it will be only an attempt).

comment:17 Changed 4 years ago by fdupont

I put a fprintf() in the log exist handler to check and it is not called before libdhcpsrv_unittests crashes. Conclusion: the atexit() solution doesn't work well... We can either create our own runtime library with main, exit and atexit, or add explicit log disabling in all executables. Note we already have isc::log::initLogger(); so why not add the symmetrical?

comment:18 Changed 3 years ago by stephen

  • Owner changed from stephen to marcin

Changes are in trac4494_new

My original solution, to disable logging at program shutdown, was treating the symptoms rather than the causes. And, as Francis pointed out, it didn't work.

The problem seems to due to objects declared static within functions. Consider the code snipper below:

Example& getA() {
    static Example A();
    return (A);
}

std::string& Example::getFoo() {
    static std::string foo("bar");
    return (foo);
}

AnotherExample& getB() {
    static AnotherExample B();
    return (B);
}

Functions/methods such as this are used to avoid the "static initialisation fiasco", where initialisation of a static object can succeed of fail depending on when other objects are initialised. Encapsulating the object within a function means that the object is created and initialised when the function is first called: this is after the program has properly started and is when (presumably) dependent objects have been initialised.

What appears to be happening is that at the point that the object is initialised, an entry is made in the exit handler list to destroy that object. That can lead to the following sequence:

  1. getA() is called, so creating A, an object of type Example.
  2. getB() is called, so creating B, an object of type AnotherExample.
  3. Something calls A::getFoo(), so the static string object "foo" is created.

At this point, the exit handler list is:

  1. Destructor for "foo"
  2. Destructor for B
  3. Destructor for A

(Destructors are executed in LIFO order.)

If the program now exits and the destructor for B calls A::getFoo(), although A is still in existence, "foo" has been destroyed. Depending on compiler and underlying memory management system, the contents of "foo" may or may not be accessible, which is why the crash is seen on some systems and not on others.

From trial and error, I found that a logging message could be successfully output in the HooksManager destructor, but caused a crash in the CfgMgr destructor. Using gdb, I managed to list the contents of the exit handler list at the point of program exit. The relevant portion is below:

{flavor = 4, func = {at = 0x7499d4406b2eba3d, on = {fn = 0x7499d4406b2eba3d, arg = 0x7ffff7941c60 <isc::dhcp::CfgMgr::instance()::cfg_mgr>}, cxa = {fn = 0x7499d4406b2eba3d, arg = 0x7ffff7941c60 <isc::dhcp::CfgMgr::instance()::cfg_mgr>, dso_handle = 0x7ffff79404e0}}},
{flavor = 4, func = {at = 0x7499d44c5112ba3d, on = {fn = 0x7499d44c5112ba3d, arg = 0x7ffff699fef0 <isc::dhcp::IfaceMgr::instancePtr()::iface_mgr>}, cxa = {fn = 0x7499d44c5112ba3d, arg = 0x7ffff699fef0 <isc::dhcp::IfaceMgr::instancePtr()::iface_mgr>, dso_handle = 0x7ffff699fc40}}},
{flavor = 4, func = {at = 0x8b663a412be6ba3d, on = {fn = 0x8b663a412be6ba3d, arg = 0x7ffff79425f0 <isc::dhcp::TimerMgr::instance()::timer_mgr>}, cxa = {fn = 0x8b663a412be6ba3d, arg = 0x7ffff79425f0 <isc::dhcp::TimerMgr::instance()::timer_mgr>, dso_handle = 0x7ffff79404e0}}},
{flavor = 4, func = {at = 0x8b663a283daaba3d, on = {fn = 0x8b663a283daaba3d, arg = 0x7ffff7b9a3d8 <isc::log::MessageDictionary::getText(std::string const&) const::empty>}, cxa = {fn = 0x8b663a283daaba3d, arg = 0x7ffff7b9a3d8 <isc::log::MessageDictionary::getText(std::string const&) const::empty>, dso_handle = 0x7ffff7b99fc0}}},
{flavor = 4, func = {at = 0x8b663a283daaba3d, on = {fn = 0x8b663a283daaba3d, arg = 0x7ffff7b9a280 <isc::log::LoggerLevelImpl::logLevelToString(int)::debug_string>}, cxa = {fn = 0x8b663a283daaba3d, arg = 0x7ffff7b9a280 <isc::log::LoggerLevelImpl::logLevelToString(int)::debug_string>, dso_handle = 0x7ffff7b99fc0}}},
{flavor = 4, func = {at = 0x8b663a283daaba3d, on = {fn = 0x8b663a283daaba3d, arg = 0x7ffff7b9a288 <isc::log::LoggerLevelImpl::logLevelToString(int)::empty_string>}, cxa = {fn = 0x8b663a283daaba3d, arg = 0x7ffff7b9a288 <isc::log::LoggerLevelImpl::logLevelToString(int)::empty_string>, dso_handle = 0x7ffff7b99fc0}}},
{flavor = 4, func = {at = 0x7499d705b7e2ba3d, on = {fn = 0x7499d705b7e2ba3d, arg = 0x7ffff6e35400 <isc::stats::StatsMgr::instance()::stats_mgr>}, cxa = {fn = 0x7499d705b7e2ba3d, arg = 0x7ffff6e35400 <isc::stats::StatsMgr::instance()::stats_mgr>, dso_handle = 0x7ffff6e353a8}}},
{flavor = 4, func = {at = 0x7499d4462fc2ba3d, on = {fn = 0x7499d4462fc2ba3d, arg = 0x7ffff7942320 <isc::dhcp::LeaseMgrFactory::getLeaseMgrPtr()::leaseMgrPtr>}, cxa = {fn = 0x7499d4462fc2ba3d, arg = 0x7ffff7942320 <isc::dhcp::LeaseMgrFactory::getLeaseMgrPtr()::leaseMgrPtr>, dso_handle = 0x7ffff79404e0}}},
{flavor = 4, func = {at = 0x7499d6a26d02ba3d, on = {fn = 0x7499d6a26d02ba3d, arg = 0x7ffff6330080 <isc::hooks::HooksManager::getHooksManager()::manager>}, cxa = {fn = 0x7499d6a26d02ba3d, arg = 0x7ffff6330080 <isc::hooks::HooksManager::getHooksManager()::manager>, dso_handle = 0x7ffff632fd80}}},

The important part are the lines referencing the LoggerLevelImpl and the MessageDictionary, which seem to indicate that objects in the logging system are being destroyed before the CfgMgr destructor is called. In fact, it the object debug_string in LoggerLevelImpl::logLevelToString which is at fault.

The fix contains patches for both LoggerLevelImpl and MessageDictionary. In the former case, a method that returns a reference to a const static object declared within the method has been changed to return an object by value. In the latter case, the const static variable within the method has been moved to be a const member within the enclosing object. (In the example above, the equivalent of making "foo" a member of object A. B can now make use of "foo" because A is not destroyed until after B.)

I've had a look at the other occurrences of this style of object access within the logging code, and the rest seem to be OK - they appear to be used during logging initialisation, which occurs very early in all programs, before other global static objects are created.

Suggested ChangeLog entry is:

Changes to logging to avoid premature destruction of some objects during
program termination.

comment:19 Changed 3 years ago by marcin

  • Add Hours to Ticket changed from 3 to 0.5
  • Owner changed from marcin to stephen
  • Total Hours changed from 7 to 7.5

I reviewed commit 4f200756e540e2fa295cf80b33b03cb93cdd03d5

Thanks for the detailed explanation of the underlying issue. As I said in my previous review, I had been unable to reproduce the problem so I am not able to verify whether the fix works or not. But, I am sure you have made this verification.

Changes are straightforward and I don't see any issue with them.

I wonder if your findings shouldn't be somehow included in the Kea Coding Guidelines, so as we avoid introducing such bugs in the future?

I think the change log should also include some sentence explaining that this fixes a bug which caused crashes observed on some (what?) systems. As it stands now, it sounds like you have applied some changes that are not really user visible, as they modify the internals of the logging system. This in fact fixes a crash of the system, which makes it much more interesting to users.

comment:20 Changed 3 years ago by stephen

  • Resolution set to fixed
  • Status changed from reviewing to closed

Merged in commit 023c2fc3688bcdbccb4b286bea873c8f9c06cd5f.

Note: See TracTickets for help on using tickets.