- Human error. Operations believed the whitelist had been resigned.
- RSV Status changed to WARNING at Fri Jul 04 2014 08:16:01 GMT-0400 (EDT)
- Operations checked the status of all services early Saturday July 5 morning, saw the warning status
- Status display is available here
- Logged into a worker node at the GOC to verify content was being delivered correctly
- /cvmfs/oasis.opensciencegrid.org was not mounted on the node. It is likely it had not been mounted for over a month.
- changing working directory to /cvmfs/oasis/opensciencegrid.org caused automount to mount the filesystem
- ls (ell ess) showed the expected content.
- A decision was made by Operations that the inability to update content over a holiday weekend did not constitute a Critical or High priority state and repair could wait until working hours. Terms are defined in the SLA.
- Sunday July 6, Dave Dykstra creates ticket 21659
- ~08:00 GMT-0400 (EDT) Monday July 7:
- Master key copied from thumb drive to proper location on oasis.grid.iu.edu
- Whitelist resigning script invoked
- Master key removed from system
- Status changed to OK at Mon Jul 07 2014 08:31:01 GMT-0400 (EDT) without further intervention
- Status is defined by test logic creating this status page.
GOC has received no OASIS user report of problems within the duration of the event.
Standard GOC monitoring
Manual resigning of whitelist
Factors contributing to the failure
- Resigning of the whitelist cannot be automated because security requires the master key does not routinely exist on the stratum 0 server.
- Whitelist age is not specifically monitored by the GOC monitoring infrastructure.
- Add logic to the status page to change status to warning if the whitelist signature age exceeds 20 days, critical if it exceeds 25 days.
- An alarm is generated any time the status reported is not "OK"
- Implement routine resigning during production maintenance windows.
- 07 Jul 2014
Topic revision: r5 - 16 Jul 2014 - 20:16:30 - RobQ