Opened 8 years ago

Closed 8 years ago

Last modified 8 years ago

#1028 closed defect (fixed)

Large memory footprint for b10-xfrin

Reported by: shane Owned by: jinmei
Priority: medium Milestone: Sprint-20111108
Component: xfrin Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: Medium
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 9 Add Hours to Ticket:
Total Hours: 14:49(+) Internal?: no

Description

On one of my production boxes:

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
  626 root      20   0  431m 256m 3728 S  0.0 51.7   1:10.13 b10-xfrin          
  630 root      20   0  174m  13m 3216 S  0.0  2.8   0:00.10 b10-cmdctl         
12341 root      20   0  201m  13m 3588 S  0.0  2.7   2:18.70 b10-xfrout         
  629 root      20   0  148m  13m 2632 S  0.0  2.6   3:19.01 b10-stats-httpd    
  627 root      20   0  151m  10m 2688 S  0.0  2.1   0:17.06 b10-zonemgr        
  628 root      20   0  143m  10m 2324 S  0.0  2.0   0:07.21 b10-stats          
  623 root      20   0  178m 5236 2936 S  0.0  1.0   0:02.45 b10-cfgmgr         
12091 root      20   0 67676 4000 2924 S  0.0  0.8   0:12.52 b10-auth           
  622 root      20   0  143m 3896 2288 S  0.0  0.8   0:27.36 b10-msgq           
  621 root      20   0  143m 2672 2316 S  0.0  0.5   0:00.09 bind10            

b10-xfrin is using 20x as much memory as anything else. This seems like it is probably a bug.

Subtickets

Change History (18)

comment:1 Changed 8 years ago by shane

  • Milestone New Tasks deleted

comment:2 Changed 8 years ago by vorner

Some observations that are probably related are in #1322 (it grows during transfer).

comment:3 Changed 8 years ago by jinmei

  • Milestone set to Next-Sprint-Proposed

comment:4 Changed 8 years ago by jelte

  • Add Hours to Ticket changed from 0 to 9

comment:5 Changed 8 years ago by jelte

  • Add Hours to Ticket 9 deleted
  • Estimated Difficulty changed from 0.0 to 9

comment:6 Changed 8 years ago by jelte

  • Milestone changed from Next-Sprint-Proposed to Sprint-20111108

comment:7 Changed 8 years ago by jinmei

  • Owner set to jinmei
  • Status changed from new to accepted

comment:8 Changed 8 years ago by jinmei

I identified the leak and fixed it. It was in the isc.dns wrapper.

The fix is in branch trac1028, which is ready for review.

Proposed changelog entry:

304.	[bug]		jinmei
	Python isc.dns, xfrin, xfrout: fixed reference leak in
	Message.get_question(), Message.get_section(), and
	RRset.get_rdata().  The leak caused severe memory leak in
	b10-xfrin, and (although no one reported it) should have caused
	less visible leak in b10-xfrout.
	(Trac #1028, git TBD)

This is due to the error-prone Python C-API, and I wouldn't be surprised
there are still some similar leaks. I grepped for PyList_Append and
PyList_New, and fixed all leaks found from the result, but if the reviewer
can identify something more, it'll probably make sense to fix them
within this ticket.

comment:9 Changed 8 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from accepted to reviewing

comment:10 Changed 8 years ago by jelte

  • Owner changed from UnAssigned to jelte

comment:11 follow-up: Changed 8 years ago by jelte

  • Owner changed from jelte to jinmei

One comment on the code, there is a TODO comment there, and it is now above the wrong code (it talks about whether the get_section should really return an iterator or not). We should either remove it or move it back to the function itself (not opposed to just removing the suggestion entirely)

Code looks good. I did however find that when I repeatedly send retransfer commands, it still looks like b10_ixfr keeps growing in size...

Have you just been going through code or did you use some form of memory tracking system?

comment:12 in reply to: ↑ 11 ; follow-up: Changed 8 years ago by jinmei

Replying to jelte:

Thanks for the review.

One comment on the code, there is a TODO comment there, and it is now above the wrong code (it talks about whether the get_section should really return an iterator or not). We should either remove it or move it back to the function itself (not opposed to just removing the suggestion entirely)

It was originally just above get_question('s wrapper) and seemed to intend
to cover both get_question and get_section. The added code was also for
these two functions, so I moved the TODO comment above the whole group of
the helper template and functions.

Anyway, I have no objection to placing the TODO back to the original point.
So I made the change.

Code looks good. I did however find that when I repeatedly send retransfer commands, it still looks like b10_ixfr keeps growing in size...

You mean b10-xfrin? Hmm.

Have you just been going through code or did you use some form of memory tracking system?

The former (but I also actually check the memory footprint by transferring
a large size of zone).

comment:13 Changed 8 years ago by jinmei

  • Owner changed from jinmei to jelte

comment:14 in reply to: ↑ 12 Changed 8 years ago by jinmei

Replying to jinmei:

Code looks good. I did however find that when I repeatedly send retransfer commands, it still looks like b10_ixfr keeps growing in size...

You mean b10-xfrin? Hmm.

Okay, I believe I've found other leaks.

First one is in DataSourceClient?.get_updater(). See commit
65bd895. The fix is trivial, although it was difficult to figure it
out because it was indirect from the visible symptom.

The other one is a circular (self) reference within XfrinConnection?,
which is fixed in commit 1fc79b9.

Both of these somehow prevent XfrinConnection? from being released
(I was not really sure how exactly it happened though - simply because
having a self reference or composing an object with a non zero
reference doesn't always seem to cause leak. It may be specific to
threaded cases).

And, while fixing the second leak, I've noticed there are other
(though less likely to happen) possibilities of similar leak in
process_xfrin. So I also fixed it in commit 738b11d. This also
addresses some part of the concern described in #1292 (with this
fix it will be at least logged, and the session "lock" will be
released - although it still doesn't help much for #1292 because xfr
won't succeed anyway unless the fundamental issue of dlopen is
solved).

Finally, I made a small, unrelated cleanup: commit 1e9bb55.

I've been running the fixed code while repeating retransfer, and
I don't see significant growth of memory. Actually, I've still seen a
gradual increase of memory footprint - right now I'm not sure if
there's still leak or it's system level leak such as the one due to
memory fragmentation. But even if it's real remaining leak in our
code, I believe the current set of fixes is worth merging.

Another question, related to commit 65bd895 but not related to the
main topic of this ticket: I've moved Py_INCREF in
createZoneUpdaterObject() inside the first if block; the reference
seems to leak otherwise if tp_alloc fails and returns NULL. If I'm
correct here, we'll need the same change to createZoneIteratorObject()
and createZoneFinderObject(), but I've not touched them because
they are not really relevant to the topic of the ticket (and this
failure mode would be unlikely to happen in practice). Also, is there
a valid case where base_obj is NULL? If not, we should probably
rather throw an exception, or maybe we could pass base_obj by
reference if it can never be NULL.

This is the updated changelog entry:

305.?	[bug]		jinmei
	Python isc.dns, isc.datasrc, xfrin, xfrout: fixed reference leak
	in Message.get_question(), Message.get_section(),
	RRset.get_rdata(), and DataSourceClient.get_updater().
	The leak caused severe memory leak in b10-xfrin, and (although no
	one reported it) should have caused less visible leak in
	b10-xfrout.  b10-xfrin had its own leak, which was also fixed.
	(Trac #1028, git TBD)

comment:15 follow-up: Changed 8 years ago by jelte

  • Owner changed from jelte to jinmei

There is another circular reference; XfrinState.__state it doesn't really appear to cause any problems, but perhaps we should probably unset it in the *End() handlers.

I've fixed one trivial typo in a comment.

Code looks good.

Yes, we should move the other increfs too. I do not think that in their current uses, there are any cases where parent can be NULL, though IIRC I originally wanted to put said functionality into util/python somewhere as well (and there could be very valid cases for some other types). I have no problems with either suggestion.

comment:16 in reply to: ↑ 15 Changed 8 years ago by jinmei

Replying to jelte:

There is another circular reference; XfrinState.__state it doesn't really appear to cause any problems, but perhaps we should probably unset it in the *End() handlers.

Do you mean XfrinConnection.__state? If so, I don't think there's
a circular reference around it - this attribute is only set to an
XfrinState object, which is "stateless" (so there shouldn't be
a reference back to the connection object from it).

I've fixed one trivial typo in a comment.

Ack, thanks.

Code looks good.

Yes, we should move the other increfs too. I do not think that in their current uses, there are any cases where parent can be NULL, though IIRC I originally wanted to put said functionality into util/python somewhere as well (and there could be very valid cases for some other types). I have no problems with either suggestion.

Okay, I simply moved the incref in other cases and merged the code.
Right now I don't have a strong opinion the NULL thing, and in any case
it's totally offtopic of this ticket, so I left it intact.

I'm now closing the ticket.

comment:17 Changed 8 years ago by jinmei

  • Resolution set to fixed
  • Status changed from reviewing to closed

comment:18 Changed 8 years ago by jinmei

  • Total Hours changed from 0 to 14:49(+)
Note: See TracTickets for help on using tickets.