Opened 8 years ago

Closed 8 years ago

#1491 closed defect (fixed)

Boss/components config option got set to empty

Reported by: jelte Owned by: jelte
Priority: medium Milestone: Sprint-20120417
Component: Unclassified Version:
Keywords: Cc:
CVSS Scoring: Parent Tickets:
Sensitive: no Defect Severity: Medium
Sub-Project: DNS Feature Depending on Ticket:
Estimated Difficulty: 6 Add Hours to Ticket: 0
Total Hours: 0.52 Internal?: no

Description

I was playing around a bit with the 'kind' option in Boss/components, and changing it results in an error (as this is currently not supported; see #1413), removing it and readding it also leads to the same error (without intermediate commit, that is), and after that, the components set had been set to empty. This meant that no non-core processes were running anymore, and, most importantly cmdctl had stopped (so only manual fixing of the config file worked).

This looks like an error in cfgmgr (where it doesn't recover from update checks correctly).

The symptoms for this specific case will disappear if #1413 gets done, and for when a thing like this happens (and there may be more cases where we get 'bad' config file, even cases out of our control), we need to do #1443, and/or allow offline-configuration (for instance so that bindctl can work directly on the b10-config.db file, if it can't connect to cmdctl)

Subtickets

Change History (15)

comment:1 Changed 8 years ago by shane

  • Defect Severity changed from N/A to Medium
  • Milestone changed from New Tasks to Next-Sprint-Proposed

comment:2 Changed 8 years ago by jelte

  • Estimated Difficulty changed from 0 to 6

comment:3 Changed 8 years ago by jreed

Same as #1489?

comment:4 Changed 8 years ago by jelte

  • Milestone changed from Year 3 Task Backlog to Next-Sprint-Proposed

comment:5 Changed 8 years ago by jelte

  • Milestone changed from Next-Sprint-Proposed to Sprint-20120403

comment:6 Changed 8 years ago by jelte

  • Owner set to jelte
  • Status changed from new to assigned

comment:7 Changed 8 years ago by jelte

  • Owner changed from jelte to UnAssigned
  • Status changed from assigned to reviewing

Turned out the problem wasn't in cfgmgr, which was working fine, but like lists, we have no 'difference' concept for named sets; so if the set is currently a default value, and you try to set a child element, the whole set needs to be copied into 'local' data in bindctl.

comment:8 Changed 8 years ago by muks

  • Owner changed from UnAssigned to muks

Picking for review

comment:9 Changed 8 years ago by muks

  • Owner changed from muks to UnAssigned

Returning to pool as I think my own review bugs are going to keep me busy.

comment:10 Changed 8 years ago by jinmei

I've looked at it for a while, but I'm afraid there's a gap between
the problem description and the code changes. Could you explain what
exactly happened in the problematic scenario along with the affected
data structures (and perhaps control flow)?

Or, if someone else can easily understand these points and complete
the review, of curse that's fine.

comment:11 Changed 8 years ago by jinmei

  • Owner changed from UnAssigned to jelte

comment:12 follow-up: Changed 8 years ago by jelte

  • Owner changed from jelte to UnAssigned

So what happens is, bindctl keeps 'local' data, which is really just the changes from the 'current' data as present in cfgmgr. For most basic types, changing a value just means giving it a value in local data, and when it is committed it is overwritten (after validation etc.). But we don't really have a representation of 'changes'; so for lists and sets, if we add or remove an element, we need to copy the entire list or set to 'local data', then remove or add the element (since we have no representation of difference, the entire list or set will be overwritten when it is committed).

This was already implemented, but we missed at least one scenario; if a child element of a named set is modified, the local data only stores that specific element of the set (it is copied if you add or remove an element in a named_set directly).

You can reproduce the problem like this; (warning; this will result in all modules being removed, including b10-cmdctl)

Doing config show Boss/components between each step may show more info on what happens; the error after the first commit is unrelated (but boss/components is our only use of named_set afaik)

bindctl
> config set Boss/components/b10-stats kind needed
> config commit
Error: Changing configuration of a running component is not yet supported. Remove and re-add b10-stats to get the same effect
Configuration not committed
> config remove Boss/components b10-stats
> config commit
> config show Boss
Failed to send request, the connection is closed

(then retry this in the branch)

So when the first set is done, the 'new named set' contains only b10-stats, as it is not directly changed itself, and is not copied to 'local'. Of course in this specific example the commit fails, and you do 'config remove' on b10-stats, to re-add it later. But after the remove the set is empty. If you commit it then, boss will stop all configurable modules...

The immediate solution is to copy named_sets to local data completely if any part of it is changed. The real solution is in the api; we need to have a way to communicate 'differences' for specific types (additions and deletions, for lists and maps, and 'something has changed in a child, but leave the rest alone', etc.). I have no doubt that there are more issues like this (but this one is the most dangerous one I know of and wanted fixed right now, as you can lock yourself out of the system by accident)

comment:13 Changed 8 years ago by vorner

  • Owner changed from UnAssigned to vorner

comment:14 in reply to: ↑ 12 Changed 8 years ago by vorner

  • Owner changed from vorner to jelte
  • Total Hours changed from 0 to 0.52

Hello

Replying to jelte:

The immediate solution is to copy named_sets to local data completely if any part of it is changed. The real solution is in the api; we need to have a way to communicate 'differences' for specific types (additions and deletions, for lists and maps, and 'something has changed in a child, but leave the rest alone', etc.). I have no doubt that there are more issues like this (but this one is the most dangerous one I know of and wanted fixed right now, as you can lock yourself out of the system by accident)

The branch seems like it is doing what it should. And I agree we should do something with the design soon.

Anyway, the changes are OK and it is ready to merge. I'm just missing a changelog entry, this one would definitely deserve one.

comment:15 Changed 8 years ago by jelte

  • Resolution set to fixed
  • Status changed from reviewing to closed

thanks, merged, closing ticket

Note: See TracTickets for help on using tickets.