Opened 7 years ago

Closed 7 years ago

#2903 closed defect (fixed)

need more asio error handling in asiodns

Reported by: jinmei Owned by: jinmei
Priority: medium Milestone: Sprint-20130514
Component: b10-auth Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 4 Add Hours to Ticket: 0
Total Hours: 14.82 Internal?: no

Description

asiodns::TCPServer::operator() propagates one possible exception
from the asio level:

        peer_.reset(new TCPEndpoint(socket_->remote_endpoint()));

this internally calls getpeername(2), which could fail if the
connection is unexpectedly reset (e.g., by receiving RST) immediately
after accepting a new connection. If that happens this version of
remote_endpoint() throws an exception, which would be propagated to
the top level of the application and terminate it. that is seemingly
happening on our AS112 server.

One easy way to fix this is to use non-throw version of
remote_endpoint() and handle the error code (see the attached patch).

I also suggest checking other calls to ASIO methods/functions that can
throw and handle the errors appropriately.

Subtickets

Attachments (3)

tcp.diff (786 bytes) - added by jinmei 7 years ago.
asiodns.diff (10.0 KB) - added by jinmei 7 years ago.
asiodns-2.diff (9.9 KB) - added by jinmei 7 years ago.

Download all attachments as: .zip

Change History (24)

Changed 7 years ago by jinmei

comment:1 Changed 7 years ago by jinmei

  • Priority changed from medium to very high

comment:2 follow-up: Changed 7 years ago by jreed

I am running this patch on the as112 server since 2013-04-05 18:53 UTC.

comment:3 in reply to: ↑ 2 Changed 7 years ago by jreed

And no crashes since then.

root    5155 55.2  0.1 50736 11476   0  I    Fri06PM 4595:27.89 b10-auth
root    5152 53.8  0.1 50736 11388   0  I    Fri06PM 4603:20.78 b10-auth
root    5156 48.0  0.1 50736 11540   0  I    Fri06PM 4609:40.85 b10-auth
root    5157 47.0  0.1 50736 11496   0  I    Fri06PM 4607:53.54 b10-auth

comment:4 Changed 7 years ago by jreed

It crashed today. The kernel output:

pid 5157 (b10-auth), uid 0: exited on signal 6 (core dumped)

The bind10 output (that was not sent to logger):

terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
  what():  Connection reset by peer

The time stamp for that saved output is: Apr 19 13:10

Here are my logs:

2013-04-19 12:55:10.565 ERROR [b10-auth.cc/5156] CC_TIMEOUT timeout reading data from command channel
2013-04-19 12:55:10.566 ERROR [b10-auth.cc/5152] CC_TIMEOUT timeout reading data from command channel
2013-04-19 12:55:10.567 ERROR [b10-auth.auth/5156] AUTH_ZONEMGR_COMMS error communicating with zone manager: Timeout while reading data from cc session
2013-04-19 12:55:10.568 ERROR [b10-auth.cc/5156] CC_LENGTH_NOT_READY length not ready
2013-04-19 12:55:10.568 ERROR [b10-auth.auth/5152] AUTH_ZONEMGR_COMMS error communicating with zone manager: Timeout while reading data from cc session
2013-04-19 12:55:10.568 ERROR [b10-auth.auth/5156] AUTH_ZONEMGR_COMMS error communicating with zone manager: ASIO read: data length is not ready
2013-04-19 12:55:10.569 ERROR [b10-auth.cc/5152] CC_LENGTH_NOT_READY length not ready
2013-04-19 12:55:10.570 ERROR [b10-auth.auth/5152] AUTH_ZONEMGR_COMMS error communicating with zone manager: ASIO read: data length is not ready
2013-04-19 13:10:20.426 INFO  [b10-init.init/5145] BIND10_LOST_SOCKET_CONSUMER consumer 31 of sockets disconnected, considering all its sockets closed
2013-04-19 13:10:20.541 INFO  [b10-init.init/5145] BIND10_PROCESS_ENDED process 5157 of b10-auth-4 ended with status 134
2013-04-19 13:10:20.615 ERROR [b10-init.init/5145] BIND10_COMPONENT_FAILED component b10-auth-4 (pid 5157) failed: process dumped core with exit status 134 (killed by signal 6: SIGABRT)
2013-04-19 13:10:20.629 INFO  [b10-init.init/5145] BIND10_COMPONENT_START component b10-auth-4 is starting
2013-04-19 13:10:20.633 INFO  [b10-init.init/5145] BIND10_STARTING_PROCESS starting process b10-auth
2013-04-19 13:10:22.070 INFO  [b10-auth.auth/14839] AUTH_SERVER_CREATED server created
2013-04-19 13:10:22.070 INFO  [b10-auth.auth/14839] AUTH_DATASRC_CLIENTS_BUILDER_STARTED data source builder thread started
2013-04-19 13:10:22.173 INFO  [b10-auth.auth/14839] AUTH_DATASRC_CLIENTS_BUILDER_RECONFIGURE_STARTED data source reconfiguration started
2013-04-19 13:10:22.174 INFO  [b10-auth.auth/14839] AUTH_SERVER_STARTED server started
2013-04-19 13:10:22.384 WARN  [b10-auth.datasrc_memory/14839] DATASRC_MEMORY_CHECK_WARNING BIND./CH: zone BIND/CH: NS has no address records (A or AAAA)
2013-04-19 13:10:22.445 INFO  [b10-auth.auth/14839] AUTH_DATASRC_CLIENTS_BUILDER_RECONFIGURE_SUCCESS data source reconfiguration completed successfully

Note this is using the tcp.diff patch.

Backtraces for the core file:

> thread apply all bt
Thread 2 (Thread 8032041c0 (LWP 101012)):
#0  0x000000080287503c in thr_kill () from /lib/libc.so.7
#1  0x00000008029111cb in abort () from /lib/libc.so.7
#2  0x0000000802462fa4 in __gnu_cxx::__verbose_terminate_handler ()
   from /usr/lib/libstdc++.so.6
#3  0x00000008024674a3 in std::set_unexpected () from /usr/lib/libstdc++.so.6
#4  0x00000008024674e3 in std::terminate () from /usr/lib/libstdc++.so.6
#5  0x000000080246744a in __cxa_throw () from /usr/lib/libstdc++.so.6
#6  0x0000000800f972c1 in boost::throw_exception<asio::system_error> (
    e=@0x7fffffffd300) at throw_exception.hpp:61
#7  0x0000000800f97453 in asio::detail::do_throw_error (err=@0x7fffffffd370)
    at throw_error.ipp:32
#8  0x0000000800f974a8 in asio::detail::throw_error (err=@0x7fffffffd370)
    at throw_error.hpp:34
#9  0x0000000801238e96 in asio::basic_socket<asio::ip::tcp, asio::stream_socket_service<asio::ip::tcp> >::close (this=0x803337760) at basic_socket.hpp:264
#10 0x00000008012239de in isc::asiodns::TCPServer::operator() (
    this=0x7fffffffe008, ec=
      {value_ = 0, category_ = asio::error::system_category}, length=81)
    at tcp_server.cc:252
#11 0x0000000801236d13 in asio::detail::write_op<asio::basic_stream_socket<asio::ip::tcp, asio::stream_socket_service<asio::ip::tcp> >, boost::array<asio::const_buffer, 2l>, asio::detail::transfer_all_t, isc::asiodns::TCPServer>::operator() (this=0x7fffffffdfb0, ec=@0x7fffffffe108, bytes_transferred=81, start=0)
    at write.hpp:148
#12 0x0000000801236f02 in asio::detail::binder2<asio::detail::write_op<asio::basic_stream_socket<asio::ip::tcp, asio::stream_socket_service<asio::ip::tcp> >, boost::array<asio::const_buffer, 2l>, asio::detail::transfer_all_t, isc::asiodns::TCPServer>, asio::error_code, unsigned long>::operator() (this=0x7fffffffdfb0)
    at bind_handler.hpp:96
#13 0x0000000801236fac in asio::asio_handler_invoke<asio::detail::binder2<asio::detail::write_op<asio::basic_stream_socket<asio::ip::tcp, asio::stream_socket_service<asio::ip::tcp> >, boost::array<asio::const_buffer, 2l>, asio::detail::transfer_all_t, isc::asiodns::TCPServer>, asio::error_code, unsigned long> > (
    function=@0x7fffffffdfb0) at handler_invoke_hook.hpp:63
#14 0x0000000801237000 in asio_handler_invoke_helpers::invoke<asio::detail::binder2<asio::detail::write_op<asio::basic_stream_socket<asio::ip::tcp, asio::stream_socket_service<asio::ip::tcp> >, boost::array<asio::const_buffer, 2l>, asio::detail::transfer_all_t, isc::asiodns::TCPServer>, asio::error_code, unsigned long>, isc::asiodns::TCPServer> (function=@0x7fffffffe1b0, context=@0x7fffffffe208)
    at handler_invoke_helpers.hpp:39
#15 0x0000000801237071 in asio::detail::asio_handler_invoke<asio::detail::binder2<asio::detail::write_op<asio::basic_stream_socket<asio::ip::tcp, asio::stream_socket_service<asio::ip::tcp> >, boost::array<asio::const_buffer, 2l>, asio::detail::transfer_all_t, isc::asiodns::TCPServer>, asio::error_code, unsigned long>,asio::basic_stream_socket<asio::ip::tcp, asio::stream_socket_service<asio::ip::tcp> >, boost::array<asio::const_buffer, 2l>, asio::detail::transfer_all_t, isc::asiodns::TCPServer> (function=@0x7fffffffe1b0, this_handler=0x7fffffffe1b0)
    at write.hpp:289
#16 0x00000008012370a5 in asio_handler_invoke_helpers::invoke<asio::detail::binder2<asio::detail::write_op<asio::basic_stream_socket<asio::ip::tcp, asio::stream_socket_service<asio::ip::tcp> >, boost::array<asio::const_buffer, 2l>, asio::detail::transfer_all_t, isc::asiodns::TCPServer>, asio::error_code, unsigned long>, asio::detail::write_op<asio::basic_stream_socket<asio::ip::tcp, asio::stream_socket_service<asio::ip::tcp> >, boost::array<asio::const_buffer, 2l>, asio::detail::transfer_all_t, isc::asiodns::TCPServer> > (function=@0x7fffffffe1b0, 
    context=@0x7fffffffe1b0) at handler_invoke_helpers.hpp:39
#17 0x000000080123717f in asio::detail::reactive_socket_send_op<asio::detail::consuming_buffers<asio::const_buffer, boost::array<asio::const_buffer, 2l> >, asio::detail::write_op<asio::basic_stream_socket<asio::ip::tcp, asio::stream_socket_service<asio::ip::tcp> >, boost::array<asio::const_buffer, 2l>, asio::detail::transfer_all_t, isc::asiodns::TCPServer> >::do_complete (owner=0x8032271d0, 
    base=0x803308600) at reactive_socket_send_op.hpp:102
#18 0x0000000800f87495 in asio::detail::task_io_service_operation::complete (
    this=0x803308600, owner=@0x8032271d0) at task_io_service_operation.hpp:34
#19 0x0000000800f96d67 in asio::detail::task_io_service::do_one (
    this=0x8032271d0, lock=@0x7fffffffe480, this_idle_thread=0x7fffffffe490)
    at task_io_service.ipp:277
#20 0x00000008013bf56a in asio::detail::task_io_service::run (
    this=0x8032271d0, ec=@0x7fffffffe4f0) at task_io_service.ipp:129
#21 0x00000008013bf816 in asio::io_service::run (this=0x80320d1e0)
    at io_service.ipp:56
#22 0x00000008013bf845 in isc::asiolink::IOServiceImpl::run (this=0x80320d1e0)
    at io_service.cc:45
#23 0x00000008013be398 in isc::asiolink::IOService::run (this=0x803204380)
    at io_service.cc:81
#24 0x00000000004463d4 in main (argc=1, argv=0x7fffffffea30) at main.cc:250

Thread 1 (Thread 8032551c0 (LWP 101574)):
#0  0x000000080214f2bc in __error () from /lib/libthr.so.3
#1  0x000000080214d3b5 in pthread_cond_signal () from /lib/libthr.so.3
#2  0x0000000801973fa9 in isc::util::thread::CondVar::wait (this=0x803204418, 
    mutex=@0x803204420) at sync.cc:239
#3  0x000000000043d53e in isc::auth::datasrc_clientmgr_internal::DataSrcClientsBuilderBase<isc::util::thread::Mutex, isc::util::thread::CondVar>::run (
    this=0x803204440) at datasrc_clients_mgr.h:484
#4  0x0000000000433be2 in boost::_mfi::mf0<void, isc::auth::datasrc_clientmgr_internal::DataSrcClientsBuilderBase<isc::util::thread::Mutex, isc::util::thread::CondVar> >::operator() (this=0x80322cac0, p=0x803204440)
    at mem_fn_template.hpp:49
#5  0x0000000000433fa1 in boost::_bi::list1<boost::_bi::value<isc::auth::datasrc_clientmgr_internal::DataSrcClientsBuilderBase<isc::util::thread::Mutex, isc::util::thread::CondVar>*> >::operator()<boost::_mfi::mf0<void, isc::auth::datasrc_clientmgr_internal::DataSrcClientsBuilderBase<isc::util::thread::Mutex, isc::util::thread::CondVar> >, boost::_bi::list0> (this=0x80322cad0, f=@0x80322cac0, 
    a=@0x7fffffbfeeff) at bind.hpp:253
#6  0x0000000000433fe6 in boost::_bi::bind_t<void, boost::_mfi::mf0<void, isc::auth::datasrc_clientmgr_internal::DataSrcClientsBuilderBase<isc::util::thread::Mutex, isc::util::thread::CondVar> >, boost::_bi::list1<boost::_bi::value<isc::auth::datasrc_clientmgr_internal::DataSrcClientsBuilderBase<isc::util::thread::Mutex, isc::util::thread::CondVar>*> > >::operator() (this=0x80322cac0)
    at bind_template.hpp:20
#7  0x000000000043400d in boost::detail::function::void_function_obj_invoker0<boost::_bi::bind_t<void, boost::_mfi::mf0<void, isc::auth::datasrc_clientmgr_internal::DataSrcClientsBuilderBase<isc::util::thread::Mutex, isc::util::thread::CondVar> >, boost::_bi::list1<boost::_bi::value<isc::auth::datasrc_clientmgr_internal::DataSrcClientsBuilderBase<isc::util::thread::Mutex, isc::util::thread::CondVar>*> > >, void>::invoke (function_obj_ptr=@0x80322cac0)
    at function_template.hpp:153
#8  0x0000000800f98f91 in boost::function0<void>::operator() (this=0x80322cab8)
    at function_template.hpp:1013
#9  0x0000000801978342 in isc::util::thread::Thread::Impl::run (
    impl_raw=0x80322cab0) at thread.cc:71
#10 0x0000000802145511 in pthread_getprio () from /lib/libthr.so.3
#11 0x0000000000000000 in ?? ()
Cannot access memory at address 0x7fffffbff000
#0  0x000000080287503c in thr_kill () from /lib/libc.so.7
Last edited 7 years ago by jreed (previous) (diff)

comment:5 Changed 7 years ago by muks

  • Estimated Difficulty changed from 0 to 4

comment:6 Changed 7 years ago by jinmei

moving it to the current sprint and picking it up, as I'm basically supposed to
work on defects only this week and we run out of any open defect tickets (or
open review tasks assigned to me). I believe we'll agree to working to this ticket
in the coming sprint anyway.

comment:7 Changed 7 years ago by jinmei

  • Milestone changed from Next-Sprint-Proposed to Sprint-20130423
  • Owner set to jinmei
  • Priority changed from very high to medium
  • Status changed from new to accepted

Changed 7 years ago by jinmei

Changed 7 years ago by jinmei

comment:8 Changed 7 years ago by jreed

The as112 server is now running with the asiodns-2.diff patch.

comment:9 Changed 7 years ago by jinmei

trac2903 is ready for review.

Running part of this branch seems to suggest it certainly fixes
the originally reported issue.

In changes up to 65bbffe I've examined all calls to asio:: methods and
functions, and handled error cases (in some cases I decided to let any
exception be propagated as they really seem to be rare and shouldn't
be triggered by remote clients).

Also, on a closer look I believed we should actually keep handling
incoming connections/UDP packets even if async_accept or
async_receive_from fail; at least silently stopping receiving
shouldn't be good; we should either throw an exception or (as this
branch does) keep going. At least in some cases the latter is better,
and I think unless we know there are cases we should rather throw (and
terminate) through the added logs we should keep this behavior.

Finally, I used this opportunity to introduce one unrelated set of
cleanups. It's mainly suggested in #1764. Amount of code change
isn't small, but I believe they are quite easy to understand (although
updates to tests are a bit tricky). This should make the code much
cleaner, and should actually make it a little bit faster: in my quick
experiment with a root server set up, max QPS is about 5% better than
the branch point version. But if this part looks too unrelated and/or
too big, I'm okay to separate it.

Suggested changelog:

608.?	[bug]		jinmei
	Handled some rare error cases in DNS server classes correctly.
	This fix specifically solves occasional crash of b10-auth due to
	errors caused by TCP DNS clients.  Also, as a result of cleanups
	with the fix, b10-auth should now be a little bit faster in
	handling UDP queries: in some local experiments it ran about 5%
	faster.
	(Trac #2903, git TBD)

comment:10 Changed 7 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from accepted to reviewing

comment:11 Changed 7 years ago by jinmei

Additional notes:

  • assuming no one started reviewing it, I've made a few more changes (4933a1c)
  • The change from asio::buffer() to asio::mutable/const_buffers_1() may need explanation. See commit log for 4ee02ef and 4933a1c. The redundant layer of wrapper calls may be optimized out, but at least these changes shouldn't do any harm.

comment:12 Changed 7 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:13 follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

I'm not strictly against unrelated cleanups, but this amount seems like 50% of the branch. If already written, I don't think it is worth throwing it out, or trying to rebuild the history for two branches, but the history would be cleaner if the works were separated. So, include it now, but let's be more careful about the amount of unrelated work in future. Anyway, I think there might be something wrong with out work process, if we can't push these cleanups in some fast way without cluttering other branches.

Also, a test in distcheck is failing for me. I don't know if it might be related, though. Should I try to find the cause of it?

...F.......
======================================================================
FAIL: test_bad_data (__main__.TestUserMgr)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/tmp/bind10/bind10-2/bind10-20130221/_build/../src/bin/usermgr/tests/b10-cmdctl-usermgr_test.py", line 477, in test_bad_data
    'add', 'user1', 'pass1'
  File "/var/tmp/bind10/bind10-2/bind10-20130221/_build/../src/bin/usermgr/tests/b10-cmdctl-usermgr_test.py", line 141, in run_check
    self.assertEqual(expected_stdout, stdout.decode())
AssertionError: 'Using accounts file: test_users.csv\nError parsing csv file: newline inside str [truncated]... != 'Using accounts file: test_users.csv\n'
  Using accounts file: test_users.csv
- Error parsing csv file: newline inside string

Can you explain in which cases this can throw? Allocation errors? Something else? If so, would it be some internal asio error?

-    io_.post(*this);
+    io_.post(*this);  // this can throw, but can be considered fatal.

Many YIELD return statements are replaced just by return? Is it safe? Was it YIELD really unneeded? Any idea why was it there before?

Also, in many error cases, should we log? At least DEBUG?

What was the use of checkin_callback? Is it used anywhere? Why the other server classes need it and the sync one does not? Should there be more explanation in the doxygen, than just „unused“?

Are these really unused? If so, why not remove them completely? Or do I understand the comment wrong?

-    // Objects to hold the query message and the answer
+    // Objects to hold the query message and the answer: these are not used
     isc::dns::MessagePtr query_, answer_;

In the last commit, you use object variable instead of local one. But, as all the local occurrences in other servers are ec, should this one be ec_ instead of ecode_, for consistency?

comment:14 in reply to: ↑ 13 ; follow-up: Changed 7 years ago by jinmei

Thanks for the review.

Replying to vorner:

I'm not strictly against unrelated cleanups, but this amount seems like 50% of the branch. If already written, I don't think it is worth throwing it out, or trying to rebuild the history for two branches, but the history would be cleaner if the works were separated. So, include it now, but let's be more careful about the amount of unrelated work in future. Anyway, I think there might be something wrong with out work process, if we can't push these cleanups in some fast way without cluttering other branches.

I'm sorry for the excessive amount of changes. I admit I tend to be a
bad piggy-backer. For this particular case, in retrospect I should at
least have limited the changes to sync_udp_server internal. As a
general workflow, it's probably wise to not do any side work in
general, or at least discuss it at the beginning of the review
process, before showing the changes.

Also, a test in distcheck is failing for me. I don't know if it might be related, though. Should I try to find the cause of it?

...F.......
======================================================================
FAIL: test_bad_data (__main__.TestUserMgr)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/tmp/bind10/bind10-2/bind10-20130221/_build/../src/bin/usermgr/tests/b10-cmdctl-usermgr_test.py", line 477, in test_bad_data
    'add', 'user1', 'pass1'
  File "/var/tmp/bind10/bind10-2/bind10-20130221/_build/../src/bin/usermgr/tests/b10-cmdctl-usermgr_test.py", line 141, in run_check
    self.assertEqual(expected_stdout, stdout.decode())
AssertionError: 'Using accounts file: test_users.csv\nError parsing csv file: newline inside str [truncated]... != 'Using accounts file: test_users.csv\n'
  Using accounts file: test_users.csv
- Error parsing csv file: newline inside string

I have no idea about this one, and I couldn't reproduce it myself.
It also looks quite unrelated to the changes in this branch...does
that happen if you run the distcheck on a clean tree (after distclean,
or even on a re-cloned repository)?

Can you explain in which cases this can throw? Allocation errors? Something else? If so, would it be some internal asio error?

-    io_.post(*this);
+    io_.post(*this);  // this can throw, but can be considered fatal.

According to the ASIO documentation
http://think-async.com/Asio/asio-1.4.8/doc/asio/reference/CompletionHandler.html
it can throw at least for memory allocation failure, which should be
okay to be considered fatal. I have to admit I've not fully checked
other possible reasons, but I believe it should be quite safe to say
it wouldn't be triggered by external message (at least not directly),
so I think it okay to let any possible exceptions be propagated.

Many YIELD return statements are replaced just by return? Is it safe? Was it YIELD really unneeded? Any idea why was it there before?

If I understand it correctly, CORO_YIELD only makes sense if the
function can be called again. So, in the context of ASIO event
handler, it should simply be no-op in effect unless it's followed by
something that registers itself as the handler of new event, like
async_xxx.

I don't know exactly why there were patterns of CORO_YIELD return,
but I suspect it began with a simple typo (redundant, though
harmless), and has been naively copy-pasted.

Also, in many error cases, should we log? At least DEBUG?

Hmm, maybe we should. In some cases I chose ERROR as I believe they
are really rare.

What was the use of checkin_callback? Is it used anywhere? Why the other server classes need it and the sync one does not? Should there be more explanation in the doxygen, than just „unused“?

I created a ticket (#2935), also trying to answer these questions, and
referred to the ticket from the documentation of SyncUDPServer.

Are these really unused? If so, why not remove them completely? Or do I understand the comment wrong?

-    // Objects to hold the query message and the answer
+    // Objects to hold the query message and the answer: these are not used
     isc::dns::MessagePtr query_, answer_;

Ah, actually, query_ is used. I've updated the comment with more
explanation on answer_.

In the last commit, you use object variable instead of local one. But, as all the local occurrences in other servers are ec, should this one be ec_ instead of ecode_, for consistency?

Actually, I somewhat intentionally avoid ec_ because the
operator() callback also has a parameter named ec and I thought
one could confuse the other. But it's not a strong opinion or
preference (and as long as one is defined as a const reference it's
less likely to be misused), so I changed it.

comment:15 Changed 7 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:16 in reply to: ↑ 14 ; follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

I'm not strictly against unrelated cleanups, but this amount seems like 50% of the branch. If already written, I don't think it is worth throwing it out, or trying to rebuild the history for two branches, but the history would be cleaner if the works were separated. So, include it now, but let's be more careful about the amount of unrelated work in future. Anyway, I think there might be something wrong with out work process, if we can't push these cleanups in some fast way without cluttering other branches.

I'm sorry for the excessive amount of changes. I admit I tend to be a
bad piggy-backer. For this particular case, in retrospect I should at
least have limited the changes to sync_udp_server internal. As a
general workflow, it's probably wise to not do any side work in
general, or at least discuss it at the beginning of the review
process, before showing the changes.

Well, my point about the workflow was about the fact that it seems wrong there's a motivation for the piggy-backing. It shows that it is too hard to push in the small changes that are noticed into master, so they are piggy-backed with the bigger tickets as a result. I don't know if there's a solution, but it would be nice if it was easy enough to fix these things so we don't feel like piggy-backing. Currently, if you want to fix it properly, it is either 2-line change or creating a ticket and waiting for a long time to get picked up.

Also, a test in distcheck is failing for me. I don't know if it might be related, though. Should I try to find the cause of it?

...F.......
======================================================================
FAIL: test_bad_data (__main__.TestUserMgr)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/var/tmp/bind10/bind10-2/bind10-20130221/_build/../src/bin/usermgr/tests/b10-cmdctl-usermgr_test.py", line 477, in test_bad_data
    'add', 'user1', 'pass1'
  File "/var/tmp/bind10/bind10-2/bind10-20130221/_build/../src/bin/usermgr/tests/b10-cmdctl-usermgr_test.py", line 141, in run_check
    self.assertEqual(expected_stdout, stdout.decode())
AssertionError: 'Using accounts file: test_users.csv\nError parsing csv file: newline inside str [truncated]... != 'Using accounts file: test_users.csv\n'
  Using accounts file: test_users.csv
- Error parsing csv file: newline inside string

I have no idea about this one, and I couldn't reproduce it myself.
It also looks quite unrelated to the changes in this branch...does
that happen if you run the distcheck on a clean tree (after distclean,
or even on a re-cloned repository)?

I always run the builds and tests on fresh clone of the branch, on a separate container (something like lightweight virtual machine) without any other network services.

But I tried on master and it happens too. Maybe I upgraded something and the test started to fail. After removing this test, it went through correctly. I'm going to create a separate ticket for it.

About the changes:

% ASIODNS_SYNC_UDP_CLOSE_SOCKET_FAIL_ON_STOP failed to close a
This is the same to ASIODNS_UDP_CLOSE_SOCKET_FAIL_ON_STOP but happens
on the "synchronous UDP server", mainly used for the authoritative DNS
server daemon.

Looking at that, it seems quite a long ID. Even longer than the human-readable message. But that might be related to the fact the human-readable message looks incomplete.

Many messages speak about closing when the server is stopping itself. I believe we close the sockets under other circumstances too, for example when reconfiguring the listening addresses.

When did we see a failure on close? Under what circumstances can that happen? And do we leak a file descriptor in that case, or are we trying to close something that is not open?

From cppcheck (I don't know if they were present before, but it seems related):

src/lib/asiodns/sync_udp_server.cc:41: check_fail: Member variable 'SyncUDPServer::resume_called_' is not initialized in the constructor. (warning,uninitMemberVar)
src/lib/asiodns/sync_udp_server.cc:41: check_fail: Member variable 'SyncUDPServer::done_' is not initialized in the constructor. (warning,uninitMemberVar)

comment:17 in reply to: ↑ 16 ; follow-up: Changed 7 years ago by jinmei

Replying to vorner:

I'm sorry for the excessive amount of changes. [...]

Well, my point about the workflow was about the fact that it seems
wrong there's a motivation for the piggy-backing. It shows that it

I understood it, but simply thought your main point was about this
particular piggyback and focused on it in my response. Regarding the
workflow, I agree there's a room to improve in terms of how to handle
very small tasks. But at the same time I do not necessarily think the
higher barrier of incorporating them is a bad practice. Any such work
can be a distraction shifting our focus from the main feature tasks,
and if we consider ensuring quality, including tests and
documentation, it's not uncommon that even a few lines of patch is not
really that small as it's originally deemed.

We have some operational workaround in our practice: we already allow
making very trivial changes such as fixing typo in comments directly,
sometimes even skipping any review. We also allow skipping the formal
review process for urgent fixes if they are simple enough. And I
think reasonable piggy-back is actually not that bad as workaround:
by definition of piggy-back, both the developer and reviewer should
already be working in the context closer to the piggy-backed code, so
the overhead of distraction can be reduced. And yet we can (probably)
make more changes than we would if we strictly follow the formal
workflow.

As admitted already, I agree the current practice would still not be
ideal in terms of the balance of quality, speed, and focus. But
that's far beyond of the scope of this single ticket. More of a
subject of a team call, or ideally we could have discussed it in the
f2f meeting.

About the changes:

% ASIODNS_SYNC_UDP_CLOSE_SOCKET_FAIL_ON_STOP failed to close a
This is the same to ASIODNS_UDP_CLOSE_SOCKET_FAIL_ON_STOP but happens
on the "synchronous UDP server", mainly used for the authoritative DNS
server daemon.

Looking at that, it seems quite a long ID. Even longer than the human-readable message. But that might be related to the fact the human-readable message looks incomplete.

I admit they are quite long and probably look awkward. It's sometimes
difficult to come up with reasonable ID names due to the constraint of
uniqueness and if we still want to make them human understandable (not
something like "ASIODNS_LOG_1137"). Anyway, I tried to make them a bit
more concise. The incomplete message was a clear error, which was
also fixed.

Many messages speak about closing when the server is stopping itself. I believe we close the sockets under other circumstances too, for example when reconfiguring the listening addresses.

I tried to clarify that with the above change.

When did we see a failure on close? Under what circumstances can that happen? And do we leak a file descriptor in that case, or are we trying to close something that is not open?

To be honest, it's a sort of mystery to me. But it certainly happened
on our AS112 server (this ticket even has the log and stack trace of
that case). I tried to provide much more detailed explanation for
that log, including information about these questions, to the extent
of my best knowledge, but I still couldn't make it definitive.

From cppcheck (I don't know if they were present before, but it seems related):

src/lib/asiodns/sync_udp_server.cc:41: check_fail: Member variable 'SyncUDPServer::resume_called_' is not initialized in the constructor. (warning,uninitMemberVar)
src/lib/asiodns/sync_udp_server.cc:41: check_fail: Member variable 'SyncUDPServer::done_' is not initialized in the constructor. (warning,uninitMemberVar)

They were in the original code, and shouldn't have caused a real
problem as they are always set before read, but explicit
initialization is a generally good practice anyway, if only to silence
cppcheck. So I fixed it in the branch.

comment:18 Changed 7 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:19 in reply to: ↑ 17 ; follow-up: Changed 7 years ago by vorner

  • Owner changed from vorner to jinmei
  • Total Hours changed from 0 to 3.72

Hello

Replying to jinmei:

To be honest, it's a sort of mystery to me. But it certainly happened
on our AS112 server (this ticket even has the log and stack trace of
that case). I tried to provide much more detailed explanation for
that log, including information about these questions, to the extent
of my best knowledge, but I still couldn't make it definitive.

OK. I think there's nothing else we can do about that.

I believe it's now good to be merged.

comment:20 in reply to: ↑ 19 Changed 7 years ago by jinmei

Replying to vorner:

OK. I think there's nothing else we can do about that.

I believe it's now good to be merged.

Okay, thanks. Merge done, closing.

comment:21 Changed 7 years ago by jinmei

  • Resolution set to fixed
  • Status changed from reviewing to closed
  • Total Hours changed from 3.72 to 14.82
Note: See TracTickets for help on using tickets.