Opened 8 years ago

Closed 7 years ago

#1762 closed defect (fixed)

loading huge zone crashes all

Reported by: jreed Owned by:
Priority: low Milestone:
Component: Unclassified Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

I tried to load 100,000,003 records in a single zone. After six minutes bind10 crashed.

2012-03-05 21:59:37.971 INFO  [b10-auth.auth] AUTH_SERVER_CREATED server created
2012-03-05 21:59:38.164 INFO  [b10-stats.stats] STATS_STARTING starting
2012-03-05 21:59:38.177 INFO  [b10-xfrout.xfrout] XFROUT_NEW_CONFIG Update xfrout configuration
2012-03-05 21:59:38.177 INFO  [b10-xfrout.xfrout] XFROUT_NEW_CONFIG_DONE Update xfrout configuration done
2012-03-05 21:59:38.237 INFO  [b10-stats-httpd.stats-httpd] STATHTTPD_STARTED listening on 127.0.0.1#8000
Traceback (most recent call last):
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/libexec/bind10-devel/b10-cfgmgr", line 107, in <module>
Traceback (most recent call last):
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/libexec/bind10-devel/b10-xfrout", line 1041, in <module>
Traceback (most recent call last):
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/libexec/bind10-devel/b10-stats-httpd", line 829, in <module>
2012-03-05 22:05:52.114 2012-03-05 22:05:52.114INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 16839 of Socket creator ended with status 9
 ERROR [b10-xfrin.xfrin] XFRIN_UNKNOWN_ERROR unknown error: Read of 0 bytes: connection closed
2012-03-05 22:06:00.128    sys.exit(main())
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/libexec/bind10-devel/b10-cfgmgr", line 95, in main
2012-03-05 22:06:00.697    cm.run()
  ERROR [b10-boss.boss] BIND10_COMPONENT_FAILED component Socket creator (pid 16839) failed with 9 exit status
ERROR [  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/config/cfgmgr.py", line 526, in run
b10-zonemgr.config] CONFIG_SESSION_STOPPING_FAILED error sending stopping message: [Errno 32] Broken pipe
    xfrout_server.run()
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/libexec/bind10-devel/b10-xfrout", line 1014, in run
    stats_httpd.start()
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/libexec/bind10-devel/b10-stats-httpd", line 292, in start
    self.mccs.check_command(nonblock=False)
    self._cc.check_command(False)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/config/ccsession.py", line 258, in check_command
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/config/ccsession.py", line 258, in check_command
    msg, env = self._session.group_recvmsg(nonblock)
    msg, env = self._session.group_recvmsg(nonblock)
2012-03-05 22:06:03.862  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 266, in group_recvmsg
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 266, in group_recvmsg
 FATAL [b10-boss.boss] BIND10_COMPONENT_UNSATISFIED component Socket creator is required to run and failed
Traceback (most recent call last):
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/libexec/bind10-devel/b10-zonemgr", line 693, in <module>
    env, msg  = self.recvmsg(nonblock, seq)
    msg, env = self.cc.group_recvmsg(False)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 266, in group_recvmsg
    env, msg  = self.recvmsg(nonblock, seq)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 121, in recvmsg
    data = self._receive_full_buffer(nonblock)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 203, in _receive_full_buffer
    self._receive_len_data()
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 163, in _receive_len_data
    new_data = self._receive_bytes(self._recv_len_size)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 151, in _receive_bytes
    env, msg  = self.recvmsg(nonblock, seq)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 121, in recvmsg
    data = self._receive_full_buffer(nonblock)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 203, in _receive_full_buffer
    self._receive_len_data()
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 163, in _receive_len_data
    new_data = self._receive_bytes(self._recv_len_size)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 151, in _receive_bytes
    raise ProtocolError("Read of 0 bytes: connection closed")
    raise ProtocolError("Read of 0 bytes: connection closed")
isc.cc.session.ProtocolError: Read of 0 bytes: connection closed
2012-03-05 22:06:07.267 INFO  [b10-boss.boss] BIND10_PROCESS_ENDED process 16840 of msgq ended with status 9
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 133, in recvmsg
    return self.recvmsg(nonblock, seq)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/installlib/python3.1/site-packages/isc/cc/session.py", line 121, in recvmsg
    data = self._receive_full_buffer(nonblock)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 203, in _receive_full_buffer
    self._receive_len_data()
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 163, in _receive_len_data
    new_data = self._receive_bytes(self._recv_len_size)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 151, in _receive_bytes
    raise ProtocolError("Read of 0 bytes: connection closed")
isc.cc.session.ProtocolError: Read of 0 bytes: connection closed
isc.cc.session.ProtocolError: Read of 0 bytes: connection closed
2012-03-05 22:06:07.267 ERROR [b10-stats.config] CONFIG_SESSION_STOPPING_FAILED error sending stopping message: [Errno 32] Broken pipe
    zonemgrd.run()
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/libexec/bind10-devel/b10-zonemgr", line 663, in run
2012-03-05 22:06:13.788 FATAL [b10-boss.boss] BIND10_MSGQ_DISAPPEARED msgq channel disappeared
    self._module_cc.check_command(False)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/config/ccsession.py", line 258, in check_command
    msg, env = self._session.group_recvmsg(nonblock)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 266, in group_recvmsg
    env, msg  = self.recvmsg(nonblock, seq)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 121, in recvmsg
    data = self._receive_full_buffer(nonblock)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 203, in _receive_full_buffer
    self._receive_len_data()
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 163, in _receive_len_data
    new_data = self._receive_bytes(self._recv_len_size)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 151, in _receive_bytes
    raise ProtocolError("Read of 0 bytes: connection closed")
Traceback (most recent call last):
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/libexec/bind10-devel/b10-stats", line 406, in <module>
isc.cc.session.ProtocolError: Read of 0 bytes: connection closed
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/jreed/pkg/lib/python3.1/threading.py", line 509, in _bootstrap_inner
    self.run()
  File "/home/jreed/pkg/lib/python3.1/threading.py", line 462, in run
    self._target(*self._args, **self._kwargs)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/libexec/bind10-devel/b10-cmdctl", line 351, in _handle_msg_from_msgq
    self._module_cc.check_command(False)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/config/ccsession.py", line 258, in check_command
    msg, env = self._session.group_recvmsg(nonblock)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 266, in group_recvmsg
    env, msg  = self.recvmsg(nonblock, seq)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 121, in recvmsg
    data = self._receive_full_buffer(nonblock)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 203, in _receive_full_buffer
    self._receive_len_data()
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 163, in _receive_len_data
    new_data = self._receive_bytes(self._recv_len_size)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 151, in _receive_bytes
    raise ProtocolError("Read of 0 bytes: connection closed")
isc.cc.session.ProtocolError: Read of 0 bytes: connection closed

2012-03-05 22:06:17.8552012-03-05 22:06:17.855  ERROR [b10-xfrin.config] CONFIG_SESSION_STOPPING_FAILED error sending stopping message: [Errno 32] Broken pipe
INFO  [b10-boss.boss] BIND10_SHUTDOWN stopping the server
    stats.start()
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/libexec/bind10-devel/b10-stats", line 189, in start
    self.mccs.check_command(False)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/config/ccsession.py", line 258, in check_command
2012-03-05 22:06:31.553 ERROR [b10-boss.config] CONFIG_SESSION_STOPPING_FAILED error sending stopping message: [Errno 32] Broken pipe
2012-03-05 22:06:34.286 INFO  [b10-boss.boss] BIND10_CONFIGURATOR_STOP bind10 component configurator is shutting down
    msg, env = self._session.group_recvmsg(nonblock)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 266, in group_recvmsg
    env, msg  = self.recvmsg(nonblock, seq)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 121, in recvmsg
2012-03-05 22:06:40.777 INFO  [b10-boss.boss] BIND10_COMPONENT_STOP component msgq is being stopped
2012-03-05 22:06:46.091 INFO  [b10-boss.boss] BIND10_COMPONENT_STOP component b10-xfrin is being stopped
    data = self._receive_full_buffer(nonblock)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 203, in _receive_full_buffer
    self._receive_len_data()
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 163, in _receive_len_data
    new_data = self._receive_bytes(self._recv_len_size)
  File "/home/jreed/dnsbenchsuite/work/master-2012-02-15/20120222215946/install/lib/python3.1/site-packages/isc/cc/session.py", line 151, in _receive_bytes
    raise ProtocolError("Read of 0 bytes: connection closed")
2012-03-05 22:06:50.005isc.cc.session.ProtocolError: Read of 0 bytes: connection closed
 INFO  [b10-boss.boss] BIND10_STOP_PROCESS asking b10-xfrin to shut down
2012-03-05 22:07:00.415 ERROR [b10-boss.boss] BIND10_CONFIGURATOR_PLAN_INTERRUPTED configurator plan interrupted, only 1 of 9 done
2012-03-05 22:07:13.574 INFO  [b10-boss.boss] BIND10_SEND_SIGTERM sending SIGTERM to cfgmgr (PID 16841)
2012-03-05 22:07:18.566 INFO  [b10-boss.boss] BIND10_SEND_SIGTERM sending SIGTERM to b10-zonemgr (PID 16842)
2012-03-05 22:07:18.567 INFO  [b10-boss.boss] BIND10_SEND_SIGTERM sending SIGTERM
...
2012-03-05 22:07:19.350 INFO  [b10-boss.boss] BIND10_SHUTDOWN_COMPLETE all processes ended, shutdown complete

(Even though that is code from February I reproduced with more recent code from March 1 too.)

The long wait was during the datasrc doing the "DATASRC_MEM_ADD_RRSET adding RRset" into the zone.

The files auth_xfrout_conn and msgq_socket were left around.

I have done this with --verbose debugging but don't yet see any specific error. It crashed after loading about 15% of the master file. Maybe caused by communication timeout (busy while loading zone?) or hit some memory limit.

2012-03-05 23:31:37.406 DEBUG [b10-auth (7232).datasrc] DATASRC_MEM_ADD_RRSET adding RRset 'ns2.4104824.example./A' into zone 'example.'
2012-03-05 23:31:41.226 ERROR [b10-boss.boss] BIND10_COMPONENT_FAILED component Socket creator (pid 7223) failed with 9 exit status
2012-03-05 23:31:42.381 FATAL [b10-boss.boss] BIND10_COMPONENT_UNSATISFIED component Socket creator is required to run and failed
2012-03-05 23:31:44.381 DEBUG [b10-auth (7232).datasrc] DATASRC_MEM_ADD_RRSET adding RRset '4104825.example./NS' into zone 'example.'

It continued to load many more DATASRC_MEM_ADD_RRSET after that even after the rest was shutting down, until a SIGTERM was sent to auth.

Subtickets

Change History (5)

comment:1 follow-up: Changed 8 years ago by jreed

The Linux kernel logged:

Mar  5 22:05:56 bind10-testing1 kernel: Total swap = 6094840kB
Mar  5 22:05:57 bind10-testing1 kernel: Free swap:            0kB
Mar  5 22:05:57 bind10-testing1 kernel: 1310720 pages of RAM
Mar  5 22:05:57 bind10-testing1 kernel: 301752 reserved pages
Mar  5 22:05:57 bind10-testing1 kernel: 336 pages shared
Mar  5 22:05:57 bind10-testing1 kernel: 0 pages swap cached
Mar  5 22:05:57 bind10-testing1 kernel: Out of memory: Killed process 16839 (b10-sockcreator).
Mar  5 22:05:57 bind10-testing1 kernel: b10-auth invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0
Mar  5 22:05:57 bind10-testing1 kernel: 
Mar  5 22:05:57 bind10-testing1 kernel: Call Trace:
Mar  5 22:05:58 bind10-testing1 kernel:  [<ffffffff800c3bda>] out_of_memory+0x8e
/0x2f3
Mar  5 22:05:58 bind10-testing1 kernel:  [<ffffffff8000f2ea>] __alloc_pages+0x24
5/0x2ce
Mar  5 22:05:58 bind10-testing1 kernel:  [<ffffffff80012a77>] __do_page_cache_re
adahead+0xa2/0x1e6
Mar  5 22:05:58 bind10-testing1 kernel:  [<ffffffff80063a05>] __wait_on_bit_lock
+0x5b/0x66
Mar  5 22:05:58 bind10-testing1 kernel:  [<ffffffff880fad4d>] :dm_mod:dm_any_con
gested+0x38/0x3f
Mar  5 22:05:58 bind10-testing1 kernel:  [<ffffffff80013406>] filemap_nopage+0x1
48/0x322
Mar  5 22:05:58 bind10-testing1 kernel:  [<ffffffff8000886a>] __handle_mm_fault+
0x1f8/0xe5c
Mar  5 22:05:58 bind10-testing1 kernel:  [<ffffffff80066b9f>] do_page_fault+0x4c
b/0x830
Mar  5 22:05:58 bind10-testing1 kernel:  [<ffffffff80028b27>] do_brk+0x1cf/0x2b2
Mar  5 22:05:58 bind10-testing1 kernel:  [<ffffffff8005dde9>] error_exit+0x0/0x8
4
Mar  5 22:05:58 bind10-testing1 kernel: 
Mar  5 22:05:58 bind10-testing1 kernel: Mem-info:

We should have the components recognize when they can't allocate memory and then shutdown nicely versus having confusing backtraces.

By the way, with old BIND 9.6.3, it just hung after using 3.7 virtual memory. I killed named manually at about 1.5 hours.

comment:2 in reply to: ↑ 1 ; follow-up: Changed 8 years ago by jinmei

Replying to jreed:

We should have the components recognize when they can't allocate memory and then shutdown nicely versus having confusing backtraces.

I agree this is the desired behavior, but I'm not sure if it's
feasible in a python program. Even handling a memory allocation
failure may require some more memory and there may be no realistic
exit than a crash.

comment:3 in reply to: ↑ 2 Changed 8 years ago by shane

  • Milestone New Tasks deleted
  • Priority changed from medium to low

Replying to jinmei:

Replying to jreed:

We should have the components recognize when they can't allocate memory and then shutdown nicely versus having confusing backtraces.

I agree this is the desired behavior, but I'm not sure if it's
feasible in a python program. Even handling a memory allocation
failure may require some more memory and there may be no realistic
exit than a crash.

I also agree that it is not always feasible, but I note that a lot of these come from our message library, which could perhaps handle it a bit more gracefully. But yes, basically when we are out of memory in the general case our only option is to say "out of memory" and exit. We might be able to drop some cache data or something like that in special cases, but basically we have to exit.

Note that in Linux you may not have any notice that you are out of memory, as the kernel overcommits and then kills processes to make more memory available:

http://lwn.net/Articles/317814/

I'm not sure what specific action we should take here, but I'll leave the ticket, although at a lower priority.

comment:4 Changed 8 years ago by vorner

I guess the out of memory is actually a secondary problem. As it kept adding the RRsets even when shutting down, I guess the problem is the config manager sent a config update, auth started loading the huge zone and stopped responding. After something timed out (the config manager, probably), a chain reaction happened (the manager got confused, did strange things to message queue, and that's when it all went wrong). But it's just a guess.

comment:5 Changed 7 years ago by jinmei

  • Resolution set to fixed
  • Status changed from new to closed

this problem is essentially a crash of overmemory condition.
I believe we can reasonably say it's resolved with the memory-efficient
version of in-memory data source, although we don't have a test
machine that has sufficient memory to load 100 million RRs and for
this specific data set the same crash would still happen.

Note: See TracTickets for help on using tickets.