Opened 9 years ago

Closed 9 years ago

#408 closed task (complete)

General control logic of NSASthe

Reported by: vorner Owned by: vorner
Priority: medium Milestone:
Component: recurser Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity:
Sub-Project: Feature Depending on Ticket:
Estimated Difficulty: 0.0 Add Hours to Ticket: 0
Total Hours: 0 Internal?: no

Description

This is the framework of the processing described in NameserverAddressStoreDesign.

Subtickets

Change History (12)

comment:1 Changed 9 years ago by vorner

  • Owner set to vorner
  • Status changed from new to accepted

comment:2 Changed 9 years ago by vorner

  • Owner changed from vorner to UnAssigned
  • Status changed from accepted to reviewing

I'm not saying this is done. It isn't, but I'd like a review anyway, as it got to a place where it resembles what I think should be final behaviour sometime, because:

  • I guessed some future interfaces with rest of the world. I'm not sure they are correct.
  • It is getting big.
  • I'm not sure I know what I'm doing at all places there.
  • If it really is what it should be, it might be used in the current state by the rest of developing code, even when it is not really done inside.

It has some major problems currently:

  • It completely ignores TTL.
  • There's no way to tell it only A or AAAA address is requested. It just pretends that both kinds are OK.
  • It does not use additional section, therefore ignores glue.

There are lots of TODOs and FIXMEs over the code as well.

Could someone have a look at it and tell me if it is completely wrong or just needs more work and completing?

It is in the branches/trac408 branch, branchpoint is r3468, current head r3540.

Thank you

comment:3 Changed 9 years ago by stephen

  • Owner changed from UnAssigned to stephen

comment:4 Changed 9 years ago by stephen

  • Owner changed from stephen to vorner

I've reviewed the source code but have only skimmed the tests - these need to be reviewed properly when this task is finished.

src/lib/nsas/hash_table.h
get() now locks the table entry with a scoped_lock. The original code used a sharable_lock to allow concurrent gets.

Comment: There needs to be a restriction on the generator function that it must return in a very short time, since the relevant slot in the hash table is locked while it is running.

src/lib/nsas/zone_entry.h
src/lib/nsas/zone_entry.cc
Would prefer a more descriptive name such as LocalLock instead of LLock. The same goes for Lock::Impl - something like "Lock::LockList" or "Lock::LockCollection".

It would be more efficient to sort the nameservers once when they are added to the zone entry than to sort them every time a lock is requested.

When locking the nameservers, the internal mutex of the nameserver entry is directly accessed (via ns->mutex_). This breaks encapsulation. Would it not be better to add a "getLock()" to the NameserverEntry object and to lock them through that interface?

Is there actually a need to lock all nameservers within a zone at once?

src/lib/nsas/tests/fetchable_unittest.cc
The accessMethods test ends with the creation of a new Fetchable object, but no associated test for it.

src/lib/nsas/tests/nameserver_address_store_unittest.cc
There are some non-ASCII characters in the file (the quotes around the word 'unreachable' in the zoneWithoutNameservers test).

src/lib/nsas/resolver_interface.h
Shouldn't the arguments to resolve() be passed by reference so as to save a copy?

src/lib/nsas/nameserver_address_store.cc
src/lib/nsas/nameserver_address_store.h
Constructor. The default value of 3001 is the first prime number above 3000 (header comment is wrong, sorry!)

lookup(). Further to a comment made in another review, should we be passing RRClass objects instead of uint16_t?

processZone(). This accesses both the zones and the nameservers in them, which breaks the layering of NameserverAddressStore -> ZoneEntry -> NameserverEntry. I would suggest that this method should be part of the zone entry object.

The same goes for newNS and newZone; pass the arguments to the ZoneEntry object and have it create the internal NameserverEntry objects.

Comment: with respect to TTLs, we need to be aware of the possibility that records may have zero TTL. This means that during an access to new records, the records may become expired.

Comment: chooseAddress(). This will have to be tied into the RTT associated with an address.

src/lib/nsas/nameserver_entry.h
src/lib/nsas/nameserver_entry.cc
The idea regarding the callbacks was that when the zone_entry receives a query, it adds the callback related to the query to its list, then queries the nameservers. If the nameserver entry object does not have the information, a callback is added to its callback list and a query initiated to the resolver. When the resolver returns with an answer, the nameserver entry executes all its callbacks. These callbacks in turn execute all the callbacks associated with the zone, which cause the result to be returned to the caller. If a second query for the zone comes in during this processing, the callback is added to the zone entry's callback queue; the fact that there is already a callback present in the queue means that there is no need to do any tests or checks - all these were done when the first callback was added.

In askIP() then, is there a need check that a callback for a given zone exists? askIP() should never be called if there is already an address fetch outstanding on a nameserver (because there will be something in the callback queue). Besides, if an additional callback for a particular zone is mistakenly added, when it is executed there are no callbacks in the zone entry's callback queue, so the effect is a no-op.

In askIP, I suggest that we should always issue queries for both A and AAAA records.

comment:5 Changed 9 years ago by vorner

  • Owner changed from vorner to stephen

Hello

I somehow refactored (or rewrote) the original code. It should be more correct, since I had better idea what I'm trying to accomplish.

There are some TODO notes and the code would use some more tests, so it is not ready to be included in trunk yet. However, the changes tend to get large and I'm afraid me and Ocean are diverging with #408 and #356 branches. So, what I would like to do is put this trough review and merge into #356. Then I would either branch again (I want to merge, not only sync, so Ocean has up to date code as well, I saw some commits in code that mostly disappeared in my branch) or, if he agrees, work on the same branch as him and add the tests and update documentation on wiki.

I also propose to split these tasks off:

  • Optimise to run only 2 parallel queries at once, not query all zones at once (this should be easy one).
  • Add a flag to the ResolverInterface? to ask it for cache data only (something like NO_REMOTE/CACHE_ONLY) and use it to request whatever there's in the cache before going to the 2 at a time.
  • Remove the LRU list from nameserver entry hash table, use weak pointers there and drop them when no zone references them.

Furthermore, I'd like if anyone had an idea how to dispatch callbacks more safely. Because the callback might call functions of the class that dispatches them, its mutex must not be locked. But if I take all the callbacks out of the vector (so I do not touch unlocked vector), unlock and dispatch them one by one, if there's an exception, I lose the rest of them. I had two ideas how to solve it, but neither of them look really good:

  • If I catch an exception (or, well, from some guard object desctructor) and there are still some callbacks to go, lock again, put them back and then let the exception fly. This does not lose them, but there's a chance they will never be called again and that the locking and putting back might raise another exception, which is a problem (eg. uncatchable crash).
  • Modify the callback base class. When it is destroyed, it would call the failure() method. However, that is problematic, since it is called from destructor and the subclass destructor was already called at this point. It could be wrapped somehow (one class holding the callback and calling its failure() and then releasing it). But that sounds little bit complicated and there's the problem of a lot of unknown code running from a destructor, bringing the two exceptions at once problem.

The branches/trac408 is currently at r3734. I'd like to have a review before I can merge it back to #356. And I think it might be easier to review the whole branch or the code instead of the changes, because a lot of the code was changed or deleted. However, I might add some tests in the meantime.

comment:6 follow-ups: Changed 9 years ago by stephen

  • Owner changed from stephen to vorner

I have reviewed the differences between my last review (r3540) and the revision I checked out (r3758).

src/lib/nsas/hash_table.h
HashTable::getOrAdd()
Comment: the suggestion in the "to do" comment about doing part of the lookup using a shareable lock seems a good idea. But I would use an upgradeable lock, allowing the conversion between a read lock to an exclusive lock without releasing it. That would make the code simpler: acquire read lock and look up entry, if not found upgrade to exclusive lock and create entry.

src/lib/nsas/zone_entry.h
src/lib/nsas/zone_entry.cc

newNS()
As this function is purely local to this module, suggest that it is declared "static"

class ZoneEntry::ResolverCallback
Needs a header describing the purpose of the class and full header documentation of its methods.

ZoneEntry::ResolverCallback::success
Needs better documentation. It's manipulating pointers to nameservers, but I'm not clear what is the difference between "old" and "new" nameservers, what manipulation is taking place, or why.

Is the five minutes mentioned in this method a reference to caching server failures? If so, the five minutes is defined in RFC 2308 (section 7).

ZoneEntry::randIndex()
When this branch is merged with #356, check the random number generation code there. It also might be worth splitting random number generating functions off into a separate library module.

class ZoneEntry::ProcessGuard
Needs full header documentation.

class ZoneEntry::NameserverCallback
Needs full header documentation.

ZoneEntry::dispatchFailures()
Can't the arguments be passed by reference to save a copy operation?

ZoneEntry::process()
The header says that this function "processes the nameservers". What processing is done?

Can't the arguments be passed by reference to save a copy operation?

Could we have a bit more documentation on the locks. Under what circumstances should the caller pass a lock and under what circumstances should the zone entry's own lock be used? Also, the header says that if an exception is generated the lock is not released; what if the lock is the internal one?

Re whether it is unsafe to lock in the middle: should process() release a lock if it is passed in externally? Would it not be better to require the caller to unlock if process completes successfully?

Is there really a need for the process guard? If there are multiple calls on the stack, as each trigger a callback only those callbacks queued will be executed. Is this a bad thing? (If this is being called in response to the completion of some external fetch operation - the documentation is not clear here - do we want to wait until all outstanding fetches have completed? Some may take a long time and some may not result in an answer being returned. In this case, it is better to call all callbacks with the information that arrives first.)

src/lib/nsas/fetchable.h
Setting a default value of NOT_ASKED for the constructor argument would have saved the need to define a default constructor.

src/lib/nsas/TODO
Comment on cache notifying NSAS of changes: the only changes the NSAS should know about is (a) when a set of nameservers changes (this is most likely if the delegation RRset in the parent is different from that in the zone itself) and (b) if zone information is explicitly deleted from the cache (in which case the zone entry should be deleted).

Comment on removal of LRU from nameserver entries: using a weak pointer in the LRU list is a good idea. The nameserver entry already contains a pointer into the LRU list; so a simple solution is to add something to the destructor that removes the entry from the list. However, it's not clear how to prevent a concurrent accessor looking up that nameserver entry in the hash table while the destruction is in progress. The LRU list approach works; setting the size to about 3-4x that of the zone entry LRU list should avoid nameservers being deleted while their zones are active; few zones have more than four nameservers and many have less.

Comment on dispatching callbacks: either remove and dispatch one at a time, or catch all exceptions and not if any were raised; after all callbacks have been dispatched if an exception was caught, raise a new one. But as you say, what happens if an exception is raised in a callback - will the program continue?

src/lib/nsas/nameserver_entry.h
src/lib/nsas/nameserver_entry.cc
Is there any reason why the header commentsfor askIP() are delimited by /*...*/? All other comments in the .h file use the Doxygen "/" form.

NameserverEntry::getAddresses()
(Header comments) The TTL of the nameserver entry expiring while an address fetch is in progress should be a rare occurrence; but whatever happens to the nameserver entry, it should remain in place at least until the fetch returns. Besides, the expiration of the nameserver entry is only of interest to the encompassing zone entry, not to the nameserver entry itself.

(Code) When checking for ... expiration_ <= now && expiration_ , it is marginally more efficient to reverse the order of the comparisons; if expiration_ is zero only one test will be peformed instead of both.

(Code) In the switch (getState()) check, under what circumstances would the state be IN_PROGRESS but we do not expect the address? (Am I right in guessing that's if there is no I/O in progress for our address family but there is for another?)

class NameserverEntry::ResolverCallback
Methods in this class need full documentation. For example, what is the success() processing doing? (And why is the RTT being incremented if the current RTT is -1 (in the call to entries.push_back().)

addressSelection()
This could be declared static.

src/lib/nsas/nameserver_address_store.cc
src/lib/nsas/nameserver_address_store.h
Comment: might be a good idea to encapsulate both the hash table and LRU list for each of the classes (ZoneEntry and NameserverEntry) in one class. (If nothing else, it should simplify the constructor definition.) This could also hide the sizing of the LRU list with respect to the hash table size.

src/lib/nsas/tests/zone_entry_unittest.cc
class ZoneEntryTest
Needs better documentation, e.g. what does checkInjected() actually check?

Test ChangedNS
The test name suggests that this checks for a changed nameserver, but the comments indicate something different.

The header comment suggests that this does three tests. Perhaps it should be split?

General
It's sometimes confusing to know exactly what is being tested. Some better comments would help. (E.g. "Check it answers callbacks when we give it addresses" - I thought zone entries didn't access addresses directly.)

src/lib/nsas/tests/nsas_test.h
The comments in class TestResolver are delimited by /*...*/ but the rest of the file uses the Doxygen "/".

src/lib/nsas/tests/nameserver_entry_unittest.cc
class NameserverEntryTest
The fillSet() method requires some documentation.

Documentation
It would be worth re-visiting the NSAS Design and check that the description of the data structures and processing match what is now in place.

comment:7 in reply to: ↑ 6 Changed 9 years ago by vorner

(This is a partial response, how I get trough the comments. The ones not mentioned are still to be done.)

Replying to stephen:

src/lib/nsas/hash_table.h
HashTable::getOrAdd()
Comment: the suggestion in the "to do" comment about doing part of the lookup using a shareable lock seems a good idea. But I would use an upgradeable lock, allowing the conversion between a read lock to an exclusive lock without releasing it. That would make the code simpler: acquire read lock and look up entry, if not found upgrade to exclusive lock and create entry.

Reading documentation, it seems it would not work. It suggests only one can have upgradable lock at the time, which is the same as exclusive in our case.

newNS()
As this function is purely local to this module, suggest that it is declared "static"

It is in an anonymous namespace, which has equivalent effect to being static. It is unreachable from other modules and compiler knows it is not called from outside, allowing it to optimise more.

class ZoneEntry::ResolverCallback
Needs a header describing the purpose of the class and full header documentation of its methods.

Done.

ZoneEntry::ResolverCallback::success
Needs better documentation. It's manipulating pointers to nameservers, but I'm not clear what is the difference between "old" and "new" nameservers, what manipulation is taking place, or why.

Done

ZoneEntry::randIndex()
When this branch is merged with #356, check the random number generation code there. It also might be worth splitting random number generating functions off into a separate library module.

This one will not be needed at all after it is merged properly, it does the choosing of address Ocean already wrote.

class ZoneEntry::ProcessGuard
Needs full header documentation.

Done

class ZoneEntry::NameserverCallback
Needs full header documentation.

Done

ZoneEntry::dispatchFailures()
Can't the arguments be passed by reference to save a copy operation?

With family, there's no gain (it's an enum, which might be even smaller than the reference/pointer). There might be with the shared pointer,
but the function is private, so the compiler can (and as it is small, probably will) inline it and eliminate all the stuff there. So I think this does not really matter at all.

comment:8 in reply to: ↑ 6 Changed 9 years ago by vorner

Replying to stephen:

ZoneEntry::process()
The header says that this function "processes the nameservers". What processing is done?

I hope this explanation is better.

Could we have a bit more documentation on the locks. Under what circumstances should the caller pass a lock and under what circumstances should the zone entry's own lock be used? Also, the header says that if an exception is generated the lock is not released; what if the lock is the internal one?

The working of locks was probably too complicated. I found out that the problem/complications may be solved by using a recursive lock, so I changed that one and removed the parameter. So we will have less documentation on the locks O:-).

Is there really a need for the process guard? If there are multiple calls on the stack, as each trigger a callback only those callbacks queued will be executed. Is this a bad thing? (If this is being called in response to the completion of some external fetch operation - the documentation is not clear here - do we want to wait until all outstanding fetches have completed? Some may take a long time and some may not result in an answer being returned. In this case, it is better to call all callbacks with the information that arrives first.)

There is. It waits only for the nameservers already in cache, not for the ones from external fetches. I changed the comment explaining why it is done.

src/lib/nsas/fetchable.h
Setting a default value of NOT_ASKED for the constructor argument would have saved the need to define a default constructor.

Thanks, good catch.

src/lib/nsas/TODO
Comment on cache notifying NSAS of changes: the only changes the NSAS should know about is (a) when a set of nameservers changes (this is most likely if the delegation RRset in the parent is different from that in the zone itself) and (b) if zone information is explicitly deleted from the cache (in which case the zone entry should be deleted).

OK

Comment on removal of LRU from nameserver entries: using a weak pointer in the LRU list is a good idea. The nameserver entry already contains a pointer into the LRU list; so a simple solution is to add something to the destructor that removes the entry from the list. However, it's not clear how to prevent a concurrent accessor looking up that nameserver entry in the hash table while the destruction is in progress. The LRU list approach works; setting the size to about 3-4x that of the zone entry LRU list should avoid nameservers being deleted while their zones are active; few zones have more than four nameservers and many have less.

Actually, I do not want a LRU list with nameservers at all. I want only the hash table with weak pointers. The thing keeping them alive would be the ZoneEntry? holding the nameserver.

The with weak pointers there would delete them safely (shared_ptr promisses that). The gotcha there is the weak pointer is left inside the hash table and throws an exception upon access. So we would need to catch that exception when we access that hash slot and delete the dead weak pointer (which points nowhere at that time).

And the LRU list approach probably does not work, because an active zone entry never touches the hash table again if the nameserver list stays the same. It keeps the original pointers, avoiding lookups. Therefore we will drop a nameserver of the oldest zone entry, but that one might still be active.

Comment on dispatching callbacks: either remove and dispatch one at a time, or catch all exceptions and not if any were raised; after all callbacks have been dispatched if an exception was caught, raise a new one. But as you say, what happens if an exception is raised in a callback - will the program continue?

Catching all exceptions does not work in C++. catch (exception &) catches only the ones inheriting from exception, while anything in C++ can be thrown. And catch (...) does not give the thing we caught and so we have nothing to throw again. And there's still the problem of two concurrent exceptions. Probably taking out one at a time with the recursive locks will work better.

src/lib/nsas/nameserver_entry.h
src/lib/nsas/nameserver_entry.cc
Is there any reason why the header commentsfor askIP() are delimited by /*...*/? All other comments in the .h file use the Doxygen "/" form.

Well, yes, there is.

For one, it is / ... */, which is a Doxygen comment as well as /. It is just multiline one. I tend to use multiline comment form with multiline comments, for one, it is clear that it is a single comment and not multiple short comments one after another. Second, vim does not seem to support / comments very well. The block alignment (gq) puts / in the middle and it puts only two slashes after pressing enter. I know this looks little bit inconsistent with the other comments, but using a single-line comments for one multiline seems strange and modifying all the old, already written ones, seems wrong too.

comment:9 in reply to: ↑ 6 Changed 9 years ago by vorner

  • Owner changed from vorner to stephen

Replying to stephen:

Is the five minutes mentioned in this method a reference to caching server failures? If so, the five minutes is defined in RFC 2308 (section 7).

I didn't find it there. I think this is a slightly different problem ‒ the server itself might not be dead. I'm just unable to obtain its address (maybe because the nameserver that knows it is down, or because no nameserver I know address of knows them or because my network cable is suddenly eaten by a mouse...).

NameserverEntry::getAddresses()
(Header comments) The TTL of the nameserver entry expiring while an address fetch is in progress should be a rare occurrence; but whatever happens to the nameserver entry, it should remain in place at least until the fetch returns. Besides, the expiration of the nameserver entry is only of interest to the encompassing zone entry, not to the nameserver entry itself.

I'm not sure I really understand this comment. What is the problem with expiring while there's a fetch in progress?

(Code) When checking for ... expiration_ <= now && expiration_ , it is marginally more efficient to reverse the order of the comparisons; if expiration_ is zero only one test will be peformed instead of both.

Maybe, I'm not sure, it depends on what happens more often. But this one is both below any interesting speed improvement and a thing that today's compilers are able to understand better than humans anyway.

Anyway, I switched it, as it seems more readable the other way.

(Code) In the switch (getState()) check, under what circumstances would the state be IN_PROGRESS but we do not expect the address? (Am I right in guessing that's if there is no I/O in progress for our address family but there is for another?)

I added a comment into the code. In short, yes, the entry might already received (or failed to) the IP address of this family while still waiting for the other one. In that case, we just provide them.

src/lib/nsas/nameserver_address_store.cc
src/lib/nsas/nameserver_address_store.h
Comment: might be a good idea to encapsulate both the hash table and LRU list for each of the classes (ZoneEntry and NameserverEntry) in one class. (If nothing else, it should simplify the constructor definition.) This could also hide the sizing of the LRU list with respect to the hash table size.

I put that into a TODO list, this should go together with making multiple LRU lists for one hash table and with dropping LRU list of the NameserverEntry?.

Test ChangedNS
The header comment suggests that this does three tests. Perhaps it should be split?

They test how the entry reacts to series of events, so, no, they need to go one after another (and there's a part that needs to happen before this test, so it is tested together with it as well).

It's sometimes confusing to know exactly what is being tested. Some better comments would help. (E.g. "Check it answers callbacks when we give it addresses" - I thought zone entries didn't access addresses directly.)

They pull them out of the NameserverEntries?. The resolver provides them to the NameserverEntry?, but from the external/API point of view, it encapsulates handling them.

Documentation
It would be worth re-visiting the NSAS Design and check that the description of the data structures and processing match what is now in place.

Yes, I want to do it some time around merging. But IMO keeping documentation on wiki seems bad, because while the code has multiple versions (with possibly multiple different ways how it does something), the wiki shows only the latest one.

Anything not mentioned above should already be done now.

And I synced with #356 and did some modifications to the code merged from there. The biggest one is moving the address selection logic from NameserverEntry? to ZoneEntry? since we want to choose between nameservers as well as addresses of single one (RRT of two addresses of the same nameserver should be close to each other, while different nameservers might differ a lot).

comment:10 follow-up: Changed 9 years ago by stephen

  • Owner changed from stephen to vorner
  • Summary changed from General control logic of NSAS to General control logic of NSASthe

Comment: the suggestion in the "to do" comment about doing part of the lookup using a shareable lock
seems a good idea. But I would use an upgradeable lock, allowing the conversion between a read lock
to an exclusive lock without releasing it. That would make the code simpler: acquire read lock and
look up entry, if not found upgrade to exclusive lock and create entry.

Reading documentation, it seems it would not work. It suggests only one can have upgradable lock at the time, which is the same as exclusive in our case.

Ah yes, you are right. Although there seem to be three states:

  • shared ownership - allow other threads to shared ownership or upgrtade ownership
  • upgrade ownership - allow other threads to have shared ownsership
  • exclusive ownership - do not allow any other form of ownership

So we could acquire shared ownership then, if we want to modify the table, upgrade to exclusive ownership in two steps.

newNS()
As this function is purely local to this module, suggest that it is declared "static"

It is in an anonymous namespace, which has equivalent effect to being static. It is unreachable from other
modules and compiler knows it is not called from outside, allowing it to optimise more.

Yes, you are right, ignore my comment.

ZoneEntry::dispatchFailures()
Can't the arguments be passed by reference to save a copy operation?

With family, there's no gain (it's an enum, which might be even smaller than the reference/pointer). There
might be with the shared pointer, but the function is private, so the compiler can (and as it is small,
probably will) inline it and eliminate all the stuff there. So I think this does not really matter at all.

Probably not. But if in the future the code is extended and moved to a .cc file, it will matter. The suggestion is just a bit of future-proofing.

ZoneEntry::process()

The header says that this function "processes the nameservers". What processing is done?

I hope this explanation is better.

It is - thanks.

Could we have a bit more documentation on the locks. Under what circumstances should the
caller pass a lock and under what circumstances should the zone entry's own lock be used?
Also, the header says that if an exception is generated the lock is not released; what if
the lock is the internal one?

The working of locks was probably too complicated. I found out that the problem/complications
may be solved by using a recursive lock, so I changed that one and removed the parameter.
So we will have less documentation on the locks O:-).

OK

Is there really a need for the process guard? If there are multiple calls on the stack...

:

There is. It waits only for the nameservers already in cache, not for the ones from external fetches. I
changed the comment explaining why it is done.

OK

Comment on removal of LRU from nameserver entries...
:

Actually, I do not want a LRU list with nameservers at all. I want only the hash table with weak pointers.
The thing keeping them alive would be the ZoneEntry holding the nameserver.

The with weak pointers there would delete them safely (shared_ptr promisses that). The gotcha there is
the weak pointer is left inside the hash table and throws an exception upon access. So we would need to
catch that exception when we access that hash slot and delete the dead weak pointer (which points nowhere
at that time).

Will an access always throw an exception? The pointer points to a block of memory which could easily be allocated to another nameserver object.

And the LRU list approach probably does not work, because an active zone entry never touches the
hash table again if the nameserver list stays the same. It keeps the original pointers, avoiding lookups.
Therefore we will drop a nameserver of the oldest zone entry, but that one might still be active.

Hmm... on thinking about it, you are right. Even though a zone entry is deleted, the associated nameserver entries only get deleted when they reach the end of the LRU list. So if we have one zone active for a long time and flood the table with short-lived zones (with different nameservers), even though the short-lived zones might be explicitly deleted, the presence of their nameservers would force those associated with the long-lived zone out of the table.

src/lib/nsas/nameserver_entry.h
src/lib/nsas/nameserver_entry.cc
Is there any reason why the header commentsfor askIP() are delimited by /*...*/? All other comments in
the .h file use the Doxygen "/" form.

Well, yes, there is.

For one, it is / ... */ ...

The point was not what style of comments are used, it's just that there is a mixture of styles in the file which looks messy. In general, I think changes to a file should be in the same style as the existing code.

Is the five minutes mentioned in this method a reference to caching server failures? If so, the five minutes
is defined in RFC 2308 (section 7).

I didn't find it there. I think this is a slightly different problem ‒ the server itself might not be dead.
I'm just unable to obtain its address (maybe because the nameserver that knows it is down, or because no
nameserver I know address of knows them or because my network cable is suddenly eaten by a mouse...).

Is that really different? You can't reach the server, so as far as you are concerned, it is down.

NameserverEntry::getAddresses()
:

I'm not sure I really understand this comment. What is the problem with expiring while there's a fetch in
progress?

Pointers might become invalid because something referred to by the fecth expires (and is deleted) before the fetch returns.

(Code) In the switch (getState()) check, under what circumstances would the state be IN_PROGRESS but we do
:

I added a comment into the code. In short, yes, the entry might already received (or failed to) the IP
address of this family while still waiting for the other one. In that case, we just provide them.

OK

Test ChangedNS
The header comment suggests that this does three tests. Perhaps it should be split?

They test how the entry reacts to series of events, so, no, they need to go one after another (and there's a
part that needs to happen before this test, so it is tested together with it as well).

OK

Documentation
It would be worth re-visiting the NSAS Design and check that the description of the data structures and
processing match what is now in place.

Yes, I want to do it some time around merging. But IMO keeping documentation on wiki seems bad,
because while the code has multiple versions (with possibly multiple different ways how it does
something), the wiki shows only the latest one.

Something for the F2F? Perhaps we should keep this overview documentation with the code, perhaps in the form of an OpenOffice and/or PDF file (to allow diagrams to be more easily included)?

However, all points have been addressed, please go ahead and merge with #356.

comment:11 in reply to: ↑ 10 Changed 9 years ago by vorner

Replying to stephen:

  • shared ownership - allow other threads to shared ownership or upgrtade ownership
  • upgrade ownership - allow other threads to have shared ownsership
  • exclusive ownership - do not allow any other form of ownership

So we could acquire shared ownership then, if we want to modify the table, upgrade to exclusive ownership in two steps.

It makes no sense to provide any way for shared lock to upgrade to exclusive (either directly or trough other state). It would lead to unpreventable deadlocks.

If there are two threads holding shared lock and both decide they want exclusive, neither of them can get it, because the other one has shared. But none of them is willing to give up its shared one until it reaches exclusive and finishes the work.

This is the reason for existence of upgrade ownership. It is like shared one, but only one has a guarantee it can decide to want exclusive lock.

Anyway, I believe it is not a problem now, since we have much more hash cells than threads and the chance of two colliding is small. And if the thread only looks up the entry (does not create one), it holds the lock really short time. I think we can leave this be unless we measure it is a bottleneck.

The with weak pointers there would delete them safely (shared_ptr promisses that). The gotcha there is
the weak pointer is left inside the hash table and throws an exception upon access. So we would need to
catch that exception when we access that hash slot and delete the dead weak pointer (which points nowhere
at that time).

Will an access always throw an exception? The pointer points to a block of memory which could easily be allocated to another nameserver object.

According to documentation, it does.

What I partly guess from the documentation and how the shared_ptr looks in debugger, it holds two pointers. One for the real data and one for a small piece of memory where it holds the reference counts. So it can hold two counts ‒ one of the shared_ptrs there and one of total weak_ptrs+shared_ptrs. If shared goes to zero, the real object is destroyed, but the small piece of memory not. So the weak_ptr looks there and finds that the shared_ptr count is zero, therefore it was deleted and throws. The piece of memory is destroyed when the last weak_ptr stops pointing to it.

It is probably little bit simplified, but it shows it can be done easily. The fancy thing there will be doing it thread-safe and fast, but for modelling of the behaviour this is probably enough.

Is the five minutes mentioned in this method a reference to caching server failures? If so, the five minutes
is defined in RFC 2308 (section 7).

I didn't find it there. I think this is a slightly different problem ‒ the server itself might not be dead.
I'm just unable to obtain its address (maybe because the nameserver that knows it is down, or because no
nameserver I know address of knows them or because my network cable is suddenly eaten by a mouse...).

Is that really different? You can't reach the server, so as far as you are concerned, it is down.

Well, it is. A dead server is one I know where it exists and it doesn't answer. This is a case when I do not know who to ask so there's nobody I could call dead. The difference is also whose fault it is. I think I should leave the FIXME there.

NameserverEntry::getAddresses()
:

I'm not sure I really understand this comment. What is the problem with expiring while there's a fetch in
progress?

Pointers might become invalid because something referred to by the fecth expires (and is deleted) before the fetch returns.

Ah, right. Unless I have a bug in the code somewhere, this can't happen. The only pointers that exist there are to NameserverEntry? and ZoneEntry?. And the callbacks (from NameserverEntry? or resolver) contain shared pointers, so they survive until the answer is provided and the callback destroyed.

Documentation
It would be worth re-visiting the NSAS Design and check that the description of the data structures and
processing match what is now in place.

Yes, I want to do it some time around merging. But IMO keeping documentation on wiki seems bad,
because while the code has multiple versions (with possibly multiple different ways how it does
something), the wiki shows only the latest one.

Something for the F2F? Perhaps we should keep this overview documentation with the code, perhaps in the form of an OpenOffice and/or PDF file (to allow diagrams to be more easily included)?

I was thinking about doxygen. We already have API documentation there and it allows creating separate pages, including images, etc. So having a docs subdirectory as well as tests one might make sense, we would have everything under one system, we could see diffs (it is hard with a PDF) and we could generate the documentation for the whole thing.

But you are right it might make sense to bring this up on F2F. I'm going to write a mail about it.

However, all points have been addressed, please go ahead and merge with #356.

I have merged it. I'll close the ticket when I update the wiki documentation and it is merged to trunk.

comment:12 Changed 9 years ago by vorner

  • Resolution set to complete
  • Status changed from reviewing to closed

The documentation is updated. Closing.

Note: See TracTickets for help on using tickets.