Opened 8 years ago

Closed 8 years ago

#1603 closed task (complete)

replace name compression with more efficient one

Reported by: jinmei Owned by: jinmei
Priority: medium Milestone: Sprint-20120320
Component: libdns++ Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: N/A
Sub-Project: DNS Feature Depending on Ticket: auth performance
Estimated Difficulty: 7 Add Hours to Ticket: 0
Total Hours: 21.66 Internal?: no

Description

This is a subtask for "Replace the name compression logic"
described in
https://lists.isc.org/pipermail/bind10-dev/2012-January/002985.html

It depends on #1602.

We rewrite the internal implementation of dns::MessageRenderer:

  • MessageRendererImpl now internally maintains a hash table. Its key is LabelSequence and the value is corresponding compression offset.
  • The general algorithm of writeName is the same as the current implementation, but it uses the backend.
  • It should keep reserving resources for some reasonable amount of hash entries so that it can be efficient when reused.
  • Since LabelSequence needs the underlying name object to be alive, we should also keep a copy of the name if the sequence is stored in the hash.

Subtickets

Change History (22)

comment:1 Changed 8 years ago by jelte

  • Milestone changed from Year 3 Task Backlog to Next-Sprint-Proposed

comment:2 Changed 8 years ago by jelte

  • Estimated Difficulty changed from 0 to 7

comment:3 Changed 8 years ago by jinmei

  • Milestone changed from Year 3 Task Backlog to Sprint-20120306
  • Owner changed from UnAssigned to jinmei
  • Status changed from new to accepted

as we're running out of open tickets, now pushing this one to the current sprint...

comment:4 Changed 8 years ago by jinmei

(updating with some correction and addition...)

trac1603 is ready for review.

I needed to update the LabelSequence class (for fixing small bugs
and for some missing features), and the branch has become larger than
I expected (once again...).

The changes to the LabelSequence are independent, so if the branch
looks too big, we can separate the review process for the updates to
LabelSequence and the MessageRenderer changes.

  • first, I've copied the original MessageRenderer implementation to a separate file with changing the class name (dc0963a) so I can compare the performance. My plan is to remove this file and the corresponding code in the benchmark once review is completed.
  • changes from bfb21ab to ba84010 are the necessary updates to LabelSequence. I've fixed the reversal meaning issue of #1754 within this branch (commit 681c927)
  • Commits f0b94ec and 1296fe2 are the main changes. Basically, it updates the internal of MessageRendererImpl and the writeName() method. Other part was basically intact except for things changed in 1296fe2.
  • beed5d7 is a newly added microbenchmark test. It's a standalone test (test data are hardocded) so anyone should be able to run it on their environment to compare the old and new versions.
  • finally, 7eda3c0 is one proposed additional change to LabelSequence. It's not necessarily needed for the purpose of this branch (so if it's controversial I'm okay with skipping it for now), but my benchmark indicated the performance improvement is significant (as logged, it made the code about 17% faster). Also, it also happens to address the portability concern of strncasecmp() I mentioned in #1754.

This is a result of this test on my laptop:

Benchmark for old MessageRenderer (positive response)
Performed 4200000 iterations in 10.491190s (400335.90ips)
Benchmark for dumb MessageRenderer (positive response)
Performed 4200000 iterations in 0.168659s (24902317.69ips)
Benchmark for new MessageRenderer (positive response)
Performed 4200000 iterations in 1.353617s (3102797.91ips)
Benchmark for old MessageRenderer (NXDOMAIN response)
Performed 400000 iterations in 0.678503s (589533.13ips)
Benchmark for dumb MessageRenderer (NXDOMAIN response)
Performed 400000 iterations in 0.015943s (25089380.92ips)
Benchmark for new MessageRenderer (NXDOMAIN response)
Performed 400000 iterations in 0.217929s (1835460.17ips)
Benchmark for old MessageRenderer (SERVFAIL response)
Performed 100000 iterations in 0.068628s (1457131.20ips)
Benchmark for dumb MessageRenderer (SERVFAIL response)
Performed 100000 iterations in 0.004277s (23380874.44ips)
Benchmark for new MessageRenderer (SERVFAIL response)
Performed 100000 iterations in 0.081888s (1221180.15ips)

Note: in the case of "SERVFAL" scenario (where there's only one name),
the new version is actually slower than the old version. If we want
to avoid that, we could have the auth server use the "dumb" renderer
when it returns an error message. The dumb implementation is mostly
empty, so the additional implementation/maintenance overhead should be
minimal, and it's faster than other versions in the order of
magnitude.

One more thing: to do more "fair" comparison with the old version, I
made another branch on top of the main branch, trac1603bench (pushed
to the public repo). It includes the optimization made in #1568 (use
assert instead of throw in the buffer class). The microbenchmark
shows the new version still 2 to 5 times faster (except for the
"SERVFAIL" case).

Finally, while I originally didn't think we need a changelog for this
task, it turned out to include one minor backward-incompatible
change, so I propose adding this one:

395.	[func]*		jinmei
	libdns++: updated the internal implementation of the
	MessageRenderer class.  This is mostly a transparent change, but
	the new version now doesn't allow changing compression mode in the
	middle of rendering (which shouldn't be an issue in practice).
	On the other hand, name compression performance was significantly
	improved: depending on the number of names, micro benchmark tests
	showed the new version is 3 to 7 times faster than the previous
	version .
	(Trac #1603, git TBD)
Last edited 8 years ago by jinmei (previous) (diff)

comment:5 Changed 8 years ago by jinmei

  • Owner changed from jinmei to UnAssigned
  • Status changed from accepted to reviewing

comment:6 Changed 8 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:7 follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

First, on my system the new one is also slower on NXDOMAINs. I fear it's because the clear or initialization is quite heavyweight. Maybe we could just call clear() on each vector that is not empty? Or, can the renderer know how many names there will be, to create large enough hash table and use open addressing instead of separate chaining? We could get rid of so many vectors.

Speaking about the hash table and hashing, I have few questions:

  • Why does the hash function take 16 characters only? What is the meaning of 16? Why not hash the whole thing?
  • Why do we create so „high“ hash table, 64 buckets and 16 slots in each bucket? If we expect so many collisions, shouldn't we create more buckets instead?
  • Why do we use 64 buckets, when 64 is nice round number? AFAIK there are less collisions with prime number sizes.
  • Isn't there a hash table in the boost library to just reuse?

What does the swap do here? Why is the thing copied and new vector created?

    for (size_t i = 0; i < MessageRendererImpl::BUCKETS; ++i) {
        if (impl_->table_[i].size() > MessageRendererImpl::RESERVED_ITEMS) {
            impl_->table_[i].reserve(MessageRendererImpl::RESERVED_ITEMS);
            vector<MessageRendererImpl::OffsetItem>(impl_->table_[i].begin(),
                                                    impl_->table_[i].end()).
                swap(impl_->table_[i]);
        }

About the map used for lower case ‒ the whole thing with private header seems not really nice. For one, why is the variable declared in both name_internal.h and name.h? And, could we put it as a private static member of Name, so others could really not get to it, but the LabelSequence (being a friend) could?

About the LabelSequence changes, I'd have few suggestions:

  • Currently, a default-constructed label sequence is kind of broken. And the pointer there looks hairy too. Could it be initialized with Name::ROOT_NAME() instead? That way it would kind of correspond to a default-constructed Name.
  • As the hash of label sequence is such a lightweight type (size_t), would it make sense to cache it for future use? Like having two size_t variables (array) and two bools, each for one of sensitive and insensitive, the bool saying if the size_t is filled in and when the hash is computed, it would be stored. Whenever the sequence would strip on either end, it would reset them to false. This would speed up computing the hash again (which is AFAIK happening ‒ it is first computed and looked in the hash table, then it is computed again when adding to the table). It would also could speed up comparison in the negative case ‒ if they are both computed and they are not the same (the hashes), nor the sequences can be the same. If this case doesn't happen, the real comparison would happen. This could speed up lookup in case of collision.

Also, the old renderer. Would it make sense to preserve it, so future benchmarks still work? To put it under the benchmark directory, so we don't poison the namespace of the libdns++?

comment:8 follow-up: Changed 8 years ago by vorner

And, I forgot to note, the systest fails for me in ixfr/in-2 (produces a really large diff of missing entries, looks like xfrin is not starting up for some reason). Does it fail for you too?

comment:9 in reply to: ↑ 8 Changed 8 years ago by jinmei

Replying to vorner:

First, about this one:

And, I forgot to note, the systest fails for me in ixfr/in-2 (produces a really large diff of missing entries, looks like xfrin is not starting up for some reason). Does it fail for you too?

Good catch, there was a serious bug in the code (btw what happened was
xfrout died due to the bug of the renderer): if we tried to render
many names the placeholder vector for names would need to be extended,
and when that happens some resource of the originally stored names may
be released (as a result of copy). Then LabelSequences? that refer to
that resource could trigger a crash.

On thinking about it, I ended up with somewhat reverting to the
original design: still using the render's internal buffer for
placeholder but using a hash table for performance improvement.
Unfortunately that required a heavy rewrite. The internal
implementation of messagerenderer.cc was almost fully revised, so
could you review it again as a whole rather than checking the diff?
(Sorry for the duplicate efforts...)

One note: in the revised version we heavily use
OutputBuffer::operator[] again. So, (if this version is finally
adopted) we'll need #1568 in the end. The trac1603bench branch
contains both trac1603 and trac1568 changes, and its benchmark test
shows the final improvement we'd get from both.

This redesign affects other points you raised in the previous comment.
I'm going to answer them (in a separate ticket comment) on top of this.

comment:10 in reply to: ↑ 7 ; follow-ups: Changed 8 years ago by jinmei

Replying to vorner:

First, on my system the new one is also slower on NXDOMAINs. I fear it's because the clear or initialization is quite heavyweight. Maybe we could just call clear() on each vector that is not empty? Or, can the renderer know how many names there will be, to create large enough hash table and use open addressing instead of separate chaining? We could get rid of so many vectors.

I suspect one reason for the poorer performance with fewer names
was because we needed to make copies of names. The new version
eliminates the need, so I think it's now better. At least in my
environment the new version is more than 3 times faster in the case of
NXDOMAIN.

Speaking about the hash table and hashing, I have few questions:

  • Why does the hash function take 16 characters only? What is the meaning of 16? Why not hash the whole thing?
  • Why do we create so „high“ hash table, 64 buckets and 16 slots in each bucket? If we expect so many collisions, shouldn't we create more buckets instead?
  • Why do we use 64 buckets, when 64 is nice round number? AFAIK there are less collisions with prime number sizes.
  • Isn't there a hash table in the boost library to just reuse?

These are all good questions, and regarding the parameters, the short
answer is they were derived from the BIND 9 implementation (I also
noted that in the doxygen comment). That does not necessarily mean
these are the best choice, of course, and I don't really know how
these parameters were chosen in BIND 9. But if we don't have a strong
reason for a particular choice I think it's generally a good start.

Regarding a prime number vs 64, again, I don't know why the BIND 9
developer chose it, but in this case I actually thought about it
myself. In this case 64 should be as good/bad as a prime or any other
number close it because we don't really know the distribution of the
hash results. Choosing a prime number is a good practice when we know
the input has a particular pattern of bias (such as if we use one
character of a domain name as a hash key directly - when most of the
input would be between 'a' to 'z' and '0' to '9' while the input space
is the whole 8 bits), but since our input is a result of hash
function, we cannot simply know (we may, after analyzing the hash
function with the knowledge of typical bias of inputs, but that's not
obvious). So I didn't bother to tweak it.

What does the swap do here? Why is the thing copied and new vector created?

    for (size_t i = 0; i < MessageRendererImpl::BUCKETS; ++i) {
        if (impl_->table_[i].size() > MessageRendererImpl::RESERVED_ITEMS) {
            impl_->table_[i].reserve(MessageRendererImpl::RESERVED_ITEMS);
            vector<MessageRendererImpl::OffsetItem>(impl_->table_[i].begin(),
                                                    impl_->table_[i].end()).
                swap(impl_->table_[i]);
        }

It's for trimming an excessive capacity. I added some more comments
about it.

About the map used for lower case ‒ the whole thing with private header seems not really nice. For one, why is the variable declared in both name_internal.h and name.h? And, could we put it as a private static member of Name, so others could really not get to it, but the LabelSequence (being a friend) could?

Yeah I know, and I didn't like it either. But in the end, the new
version still keeps it: First, the garbage in name.h was just a
leftover. I removed it. Second, as a result of revision
MessageRenderer? also uses it (otherwise it would endure a bit more
poorer performance), so it cannot be hidden in a Name object. But if
you have a better idea of achieving the goal: sharing this for Name,
LabelSequence? and MessageRenderer?, I'm happy to hear and adopt it.

About the LabelSequence changes, I'd have few suggestions:

  • Currently, a default-constructed label sequence is kind of broken. And the pointer there looks hairy too. Could it be initialized with Name::ROOT_NAME() instead? That way it would kind of correspond to a default-constructed Name.

As a result of revise, at least this branch doesn't need the default
constructor any more. I suspect we'll have a similar need in future,
but for now I revered this part to the original implementation.

  • As the hash of label sequence is such a lightweight type (size_t), would it make sense to cache it for future use? Like having two size_t variables (array) and two bools, each for one of sensitive and insensitive, the bool saying if the size_t is filled in and when the hash is computed, it would be stored. Whenever the sequence would strip on either end, it would reset them to false. This would speed up computing the hash again (which is AFAIK happening ‒ it is first computed and looked in the hash table, then it is computed again when adding to the table). It would also could speed up comparison in the negative case ‒ if they are both computed and they are not the same (the hashes), nor the sequences can be the same. If this case doesn't happen, the real comparison would happen. This could speed up lookup in case of collision.

That seemed to be a good idea, so I implemented it, although not in
the LabelSequence? class, but in the logic of MessageRenderer?
implementation. I compared the resulting performance; it was not
super significant, but I saw some non negligible improvement, so I
kept it.

Also, the old renderer. Would it make sense to preserve it, so future benchmarks still work? To put it under the benchmark directory, so we don't poison the namespace of the libdns++?

Okay. I've moved it to the benchmark directory with some note.

comment:11 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:12 in reply to: ↑ 10 Changed 8 years ago by jinmei

Replying to jinmei:

  • Isn't there a hash table in the boost library to just reuse?

I just realized I didn't answer this point. As far as I can see
boost's hash table is similar to std::set in that we cannot preserve a
capacity for some amount of entries. We want to reuse a reasonable
amount of resource so the renderer can be efficiently reused, so I
think we need to do some lower level housekeeping ourselves.

comment:13 in reply to: ↑ 10 ; follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

Hello

Replying to jinmei:

I suspect one reason for the poorer performance with fewer names
was because we needed to make copies of names. The new version
eliminates the need, so I think it's now better. At least in my
environment the new version is more than 3 times faster in the case of
NXDOMAIN.

I found the reason. I use -O0 for faster compilation O:-). Anyway, with -O2, the new one is several times faster now, which is good.

These are all good questions, and regarding the parameters, the short
answer is they were derived from the BIND 9 implementation (I also
noted that in the doxygen comment). That does not necessarily mean
these are the best choice, of course, and I don't really know how
these parameters were chosen in BIND 9. But if we don't have a strong
reason for a particular choice I think it's generally a good start.

Well, I fear they might be chosen randomly at some point in history and now everybody copies them „because that's how it was done all the time“. But it is probably OK to keep it this way, it might not matter much.

Regarding a prime number vs 64, again, I don't know why the BIND 9
developer chose it, but in this case I actually thought about it
myself. In this case 64 should be as good/bad as a prime or any other
number close it because we don't really know the distribution of the
hash results. Choosing a prime number is a good practice when we know
the input has a particular pattern of bias (such as if we use one
character of a domain name as a hash key directly - when most of the
input would be between 'a' to 'z' and '0' to '9' while the input space
is the whole 8 bits), but since our input is a result of hash
function, we cannot simply know (we may, after analyzing the hash
function with the knowledge of typical bias of inputs, but that's not
obvious). So I didn't bother to tweak it.

Well, the hash function might need a pattern. The prime number should help no matter what pattern length it is (chance of it being prime is lower), but it might not be that important anyway.

About the map used for lower case ‒ the whole thing with private header seems not really nice. For one, why is the variable declared in both name_internal.h and name.h? And, could we put it as a private static member of Name, so others could really not get to it, but the LabelSequence (being a friend) could?

Yeah I know, and I didn't like it either. But in the end, the new
version still keeps it: First, the garbage in name.h was just a
leftover. I removed it. Second, as a result of revision
MessageRenderer? also uses it (otherwise it would endure a bit more
poorer performance), so it cannot be hidden in a Name object. But if
you have a better idea of achieving the goal: sharing this for Name,
LabelSequence? and MessageRenderer?, I'm happy to hear and adopt it.

Maybe as it is getting to be more generally used, it would make sense to make it generally available utility ‒ make it public instead.

  • As the hash of label sequence is such a lightweight type (size_t), would it make sense to cache it for future use? Like having two size_t variables (array) and two bools, each for one of sensitive and insensitive, the bool saying if the size_t is filled in and when the hash is computed, it would be stored. Whenever the sequence would strip on either end, it would reset them to false. This would speed up computing the hash again (which is AFAIK happening ‒ it is first computed and looked in the hash table, then it is computed again when adding to the table). It would also could speed up comparison in the negative case ‒ if they are both computed and they are not the same (the hashes), nor the sequences can be the same. If this case doesn't happen, the real comparison would happen. This could speed up lookup in case of collision.

That seemed to be a good idea, so I implemented it, although not in
the LabelSequence? class, but in the logic of MessageRenderer?
implementation. I compared the resulting performance; it was not
super significant, but I saw some non negligible improvement, so I
kept it.

Hmm, still, I wanted to avoid some comparisons in case of bucket collision. Maybe storing the hash in the OffsetItem? as well and looking into it first before doing the real comparison?

Also, a typo here in the comment (thrower should be slower):

    /// \brief A common helper to throw an exception on invalid operation.
    ///
    /// Experiments showed that throwing from each method makes the buffer
    /// operation thrower, so we consolidate it here, and let the methods
    /// call this.
}}

comment:14 in reply to: ↑ 13 Changed 8 years ago by jinmei

Replying to vorner:

Hello

I suspect one reason for the poorer performance with fewer names
was because we needed to make copies of names. [...]

I found the reason. I use -O0 for faster compilation O:-). Anyway, with -O2, the new one is several times faster now, which is good.

Okay, good to know that.

These are all good questions, and regarding the parameters, the short
answer is they were derived from the BIND 9 implementation (I also
noted that in the doxygen comment). [...]

Well, I fear they might be chosen randomly at some point in history and now everybody copies them „because that's how it was done all the time“. But it is probably OK to keep it this way, it might not matter much.

I have to admit you're probably right that these were arbitrary
choice. But unless we can suggest something better with a reason, I'm
afraid all we can do is to note that it was derived from the prior
implementation.

BTW I realized I didn't answer one specific question: "why up to 16".
Still I don't know (why BIND 9 does this), but I guess it's for not
spending too much time for a calculating ideal hash against some
unusually long names. I think the motivation is okay for the intended
purpose here, although it may be a marginal optimization in practice.

Regarding a prime number vs 64, again, I don't know why the BIND 9
developer chose it, but in this case I actually thought about it
myself. In this case 64 should be as good/bad as a prime or any other
number close it [...]

Well, the hash function might need a pattern. The prime number should help no matter what pattern length it is (chance of it being prime is lower),

I don't think so; when we use a bucket size of n and represent the
resulting values from the hash function in the form of nX + Y (0 <= Y
< n), the probability of collisions depend on how many values are
produced for each Y by the combination of the original input and the
algorithm of the hash function. It all depends on the details of the
function and bias in the original data, and just choosing a prime
number for n doesn't help in itself (consider, e.g, a stupid "hash"
function f(x) = 31 * x (ignore overflow for brevity). If we choose 31
(which is a prime number) for the hash bucket size, we have 100%
collisions).

That said, it doesn't mean 64 is better than a prime number. As I
said, both are just equally bad. So,

but it might not be that important anyway.

I don't mind changing it to a prime number. I just pointed out it's
not an improvement.

About the map used for lower case ‒ [...]

Yeah I know, and I didn't like it either. But in the end, the new
version still keeps it: [...]

Maybe as it is getting to be more generally used, it would make sense to make it generally available utility ‒ make it public instead.

For now, I'd still keep it semi-private. After all, normal
applications can (and should) just use std::tolower() for this
purpose. If we still see more cases where this level of micro
optimization is required, we can consider moving it more public.
I noted this point in the header file.

  • As the hash of label sequence is such a lightweight type (size_t), would it make sense to cache it for future use? [...]

That seemed to be a good idea, so I implemented it, although not in
the LabelSequence? class, but in the logic of MessageRenderer?
implementation. I compared the resulting performance; it was not
super significant, but I saw some non negligible improvement, so I
kept it.

Hmm, still, I wanted to avoid some comparisons in case of bucket collision. Maybe storing the hash in the OffsetItem? as well and looking into it first before doing the real comparison?

Ah, okay, I didn't fully understand what you meant. I did it in the
latest version. As you probably realized, however, this would be
rather additional overhead than optimization when we have no or very
few collisions (and we'd probably expect that in usual cases; when we
need to render too many names like in AXFR-out that can cause many
collisions, other overhead due to the larger number of RRs and message
size and the use of TCP would make this optimization moot.) The
benchmark showed storing and comparing the full hash indeed makes it a
bit slower, but at the micro benchmark level it was not so significant
either. So for now I've kept the code in the branch. I'm okay either
with or without it.

Also, a typo here in the comment (thrower should be slower):

    /// \brief A common helper to throw an exception on invalid operation.
    ///
    /// Experiments showed that throwing from each method makes the buffer
    /// operation thrower, so we consolidate it here, and let the methods
    /// call this.
}}

Ah, thanks for catching it. Fixed.

The branch was revised with the suggested hash update and some other
minor cleanups.

comment:15 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:16 Changed 8 years ago by jinmei

Note: I've made a few more minor updates since the previous comment.
As of this writing the latest commit hash is
4dbcf3b666bf27ae2c3018007e163235e5c326e2

(and I don't plan to make any more changes)

comment:17 follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

I don't get what the advantage of the commit 0b55e11e55f2422e2aa1f920f48debac92d826af is. Otherwise, I think it is OK to merge.

comment:18 in reply to: ↑ 17 Changed 8 years ago by jinmei

Replying to vorner:

I don't get what the advantage of the commit 0b55e11e55f2422e2aa1f920f48debac92d826af is. Otherwise, I think it is OK to merge.

It was an attempt to keep the things in the stack minimum, hopefully
getting better utilization of cache (because normally only a few
entries of this array will be used). The use of boost::array is for
better security. At least the former may be premature, so I'm okay
with canceling it if you want. I think the latter is worthwhile.

comment:19 Changed 8 years ago by jinmei

  • Owner changed from jinmei to vorner

comment:20 follow-up: Changed 8 years ago by vorner

  • Owner changed from vorner to jinmei

I guess this would have the exact opposite effect, as the stack area is likely to be already in the cache, while the memory where the data live likely won't. But that will be marginal and the safety might be a good thing.

So, please merge.

Thank you

comment:21 in reply to: ↑ 20 Changed 8 years ago by jinmei

Replying to vorner:

I guess this would have the exact opposite effect, as the stack area is likely to be already in the cache, while the memory where the data live likely won't. But that will be marginal and the safety might be a good thing.

So, please merge.

Okay, I've added a bit more comments that we might change the place of
the array. When all major performance tasks are completed we could
perform profiling bottlenecks again, and if we need to do something
here for overall performance we can revisit it.

Branch merged, closing ticket.

comment:22 Changed 8 years ago by jinmei

  • Resolution set to complete
  • Status changed from reviewing to closed
  • Total Hours changed from 0 to 21.66
Note: See TracTickets for help on using tickets.