Opened 9 years ago

Closed 9 years ago

#536 closed task (complete)

Make OutputBuffer more lightweight

Reported by: vorner Owned by: vorner
Priority: medium Milestone: Sprint-20110419
Component: libdns++ Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity:
Sub-Project: Feature Depending on Ticket:
Estimated Difficulty: 7.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

This is one of the performance related tasks.

The isc::dns::OutputBuffer::operator[] function takes 25% of our running time. It is quite surprising, as a simple function like that should inline. The task is to make it more friendly for inlining, which includes:

  • The return value of the operator[] should be uint8_t instead of const uint8_t &, because the single byte is smaller to copy/pass around than actual pointer and saves overhead of taking the address and dereferencing it.
  • The data inside should be kept if something simpler than vector (as this is the probable reason why it isn't inlined). An array should be enough. It is more work to do in the modifying functions, and care must be taken to correctly allocate/deallocate (and create a copy constructor, if one is needed anywhere), but the possible performance gain should be large.

Subtickets

Change History (14)

comment:1 Changed 9 years ago by vorner

  • Owner changed from jinmei to UnAssigned
  • Status changed from new to assigned

Trac seems to assign tickets to people by Component.

comment:2 Changed 9 years ago by vorner

  • Owner changed from UnAssigned to vorner
  • Status changed from assigned to accepted

comment:3 Changed 9 years ago by vorner

  • Owner changed from vorner to UnAssigned
  • Status changed from accepted to reviewing

The changes are ready in trac536. But the performance gain is smaller than I hoped for. I measured it and the QPS changed for me from 24155.668743 to 24919.778250, which looks like ~4%. I checked and the function really got inlined (or, it doesn't show on the profile list, therefore I guess it got into something).

There's a condition checking bounds inside the operator. If I tried removing it, it changed to 27046.200271. But that isn't a good way, removing a check.

comment:4 Changed 9 years ago by vorner

Ups, sorry, I looked wrong, I compiled it without --enable-static, so I was examining only the functions inside bin/auth. So, correction.

It does not inline unless the bounds check is removed (even if told to explicitly, the compilers have no respect and do whatever they like). The numbers I mentioned above are valid, only when compiled without --with-lcov it runs slightly faster. So, the commited code had some 26kQPS, the one without bounds check 32kQPS here.

The profile doesn't change much with the bounds check, without it changes to this:

47446    23.7183  isc::dns::Name::compare(isc::dns::Name const&) const
30205    15.0995  isc::dns::MessageRenderer::writeName(isc::dns::Name const&, bool)
26195    13.0949  boost::detail::shared_count::~shared_count()
15388     7.6925  asiolink::UDPServer::UDPServer(asiolink::UDPServer const&)
6560      3.2793  asiolink::UDPServer::~UDPServer()

Most of it probably got inlined into compare, but the writeName got faster as well, rendering some next problems more significant (eg. the shared pointers).

About the bounds check, what is the price of trading? If the check is not there and there's a bug, user will see a SEGFAULT instead of exception (I guess a bounds check exception shouldn't be caught anyway). Is the performance worth it? Should I bring it to the dev list? Or, do we have any rules about it?

comment:5 Changed 9 years ago by stephen

  • Estimated Difficulty changed from 0.0 to 7

comment:6 Changed 9 years ago by stephen

  • Milestone A-Team-Task-Backlog deleted

Milestone A-Team-Task-Backlog deleted

comment:7 Changed 9 years ago by shane

  • Milestone set to Sprint-20110405

comment:8 Changed 9 years ago by shane

  • Owner changed from UnAssigned to shane
  • Status changed from reviewing to accepted

I had a quick look through this.

One actual point:

We check for buffer size of 0 in ensureAllocated(), but not in our constructor(s). malloc() may return NULL in some cases if we try to allocate 0 bytes, but not always (it's up to the system). I think we should either explicitly disallow zero-sized buffers or we should check for 0-bytes and set our buffer to NULL rather than try malloc(). (The realloc() command always works, since we always insure at least 1024 bytes.)

We may want to experiment with reference counts and copy-on-write rather than actual memory allocation in our assignment operators if we need further optimizations here in the future. :)

I know this code hasn't changed, but I wonder if this:

        buffer_[size_ ++] = static_cast<uint8_t>((data & 0xff00U) >> 8);
        buffer_[size_ ++] = static_cast<uint8_t>(data & 0x00ffU);

Might be more efficient if written as array access somehow rather than bit shifting. Hopefully modern compilers optimize that away, but maybe some micro-benchmarking would reveal more.

comment:9 Changed 9 years ago by shane

  • Owner changed from shane to vorner
  • Status changed from accepted to reviewing

Okay, I checked compilation and tests, and it seems okay. Please merge! :)

comment:10 Changed 9 years ago by shane

Oh, as for the bounds check, I think maybe we can make a separate ticket for that, and consider it as future work. We can remove it if it really buys us 50% performance boost!

Please feel free to bring up on the dev list after merging and making a separate ticket. Maybe others feel strongly about this.

Last edited 9 years ago by shane (previous) (diff)

comment:11 Changed 9 years ago by vorner

  • Owner changed from vorner to shane

Hello

I pushed fix for the 0-length bug. Is it OK?

About the bit shifts, for one, bit shifts are really fast. For another, this is a place where even the ancient compilers were able to optimise, because there are no sideeffects and everything is just local finding of instructions doing the same (I've seen gcc do things like replacing strcpy(string, "hello"); with two MOVs). So I would just stay with whatever is the most readable, but if you like, we might try read the generated assembler if it ends up the same ;-).

I'll create the ticket, but I don't know if it really is full 50%. It looks more like 25% to me ;-).

Thanks

comment:12 Changed 9 years ago by shane

Sorry, sorry. I should not have approved a merge! My mistake.

I had a look at the fix, and it looks like there is still one malloc() that can fail, in the assignment operator. I added a check here, based on the other two, and pushed the change. Assuming this looks good to you then please merge.

comment:13 Changed 9 years ago by shane

  • Owner changed from shane to vorner

comment:14 Changed 9 years ago by vorner

  • Resolution set to complete
  • Status changed from reviewing to closed

OK, thanks, closing (and I'll create the ticket)

Note: See TracTickets for help on using tickets.