You are here: TWiki > Operations Web>OasisWhitelistSigning (16 Jul 2014, RobQ?)

Root cause analysis, Oasis Whitelist signature out of date

Root Cause

  • Human error. Operations believed the whitelist had been resigned.

The Event

  • RSV Status changed to WARNING at Fri Jul 04 2014 08:16:01 GMT-0400 (EDT)
  • Operations checked the status of all services early Saturday July 5 morning, saw the warning status
    • Status display is available here
  • Logged into a worker node at the GOC to verify content was being delivered correctly
    • /cvmfs/oasis.opensciencegrid.org was not mounted on the node. It is likely it had not been mounted for over a month.
    • changing working directory to /cvmfs/oasis/opensciencegrid.org caused automount to mount the filesystem
    • ls (ell ess) showed the expected content.
  • A decision was made by Operations that the inability to update content over a holiday weekend did not constitute a Critical or High priority state and repair could wait until working hours. Terms are defined in the SLA.
  • Sunday July 6, Dave Dykstra creates ticket 21659
  • ~08:00 GMT-0400 (EDT) Monday July 7:
    • Master key copied from thumb drive to proper location on oasis.grid.iu.edu
    • Whitelist resigning script invoked
    • Master key removed from system
  • Status changed to OK at Mon Jul 07 2014 08:31:01 GMT-0400 (EDT) without further intervention
    • Status is defined by test logic creating this status page.

User Impact

GOC has received no OASIS user report of problems within the duration of the event.

Detection Method

Standard GOC monitoring

Repair Method

Manual resigning of whitelist

Factors contributing to the failure

  • Resigning of the whitelist cannot be automated because security requires the master key does not routinely exist on the stratum 0 server.
  • Whitelist age is not specifically monitored by the GOC monitoring infrastructure.

Remediation

  • Add logic to the status page to change status to warning if the whitelist signature age exceeds 20 days, critical if it exceeds 25 days.
    • An alarm is generated any time the status reported is not "OK"
  • Implement routine resigning during production maintenance windows.

-- ScottTeige - 07 Jul 2014

Topic revision: r5 - 16 Jul 2014 - 20:16:30 - RobQ?
Hello, TWikiGuest
Register

 
TWIKI.NET

TWiki | Report Bugs | Privacy Policy

This site is powered by the TWiki collaboration platformCopyright by the contributing authors. All material on this collaboration platform is the property of the contributing authors..