Firewall Issue of 2014-07-14
On 2014-07-14 several systems at the OSG Operations Center suffered firewall issues, denying access to important ports on various systems including osg-xsede.grid.iu.edu, osg-flock.grid.iu.edu, ce.grid.iu.edu, reports.grid.iu.edu, and swamp-ticket.grid.iu.edu.
The root causes were
- a number of errors in the firewall implementation at Operations that had built up over years
- a more recent error in the firewall implementation that had been inadvertently introduced during the transition to IPv6, causing some systems to run unfirewalled
- a hasty decision made in an attempt to fix the issue
- For years Operations, finding the RHEL firewall configuration system inadequate for their purposes, have been using a system of global and local firewall scripts, the global ones installed via RPM on the base system images and maintained via Puppet and the local ones installed via scripts when a service is installed.
- However, there is one file, /etc/iptables.d/50-local-rules, which we'll call F in this explanation, that was installed via the RPM and not controlled by Puppet but locally maintained.
- File F was originally created as an easy way to activate firewall rules; it contained a number of commented-out rules that could be uncommented — this turned out to be only rarely used.
- Instead, systems have their own local firewall rules files, specific to the service(s) they run, separate from F.
- However, over time, staff (and others) setting up systems have occasionally relied on F, uncommenting its rules or actually adding more rules to it.
- This led to F being a vaguely-defined file of uncertain status with no centralized control.
- Additionally, changes have been made to the firewall script and rules to shift toward IPv6 capability over the past few months. He made an error that caused some systems to exhibit an unsuccessful return value when stopping the firewall, even though the firewall did actually stop correctly. He noticed this, but it wasn't a big problem, just an inaccurate return value — but what he didn't realize immediately was that whenever Puppet restarted the firewall (since Puppet runs the stop and start separately instead of running a restart), it would detect the "failure" and not start the firewall again.
- Operations has been considering the idea of removing file F from RPM control and had changed the RPM spec file to omit it, intending to test this on test VMs.
- Operations then noticed the problem caused by the inaccurate return value. He realized that as a result, any number of systems at Operations could be running completely unfirewalled. He should not have panicked, but he did. He made a change that fixed the problem and pushed it out immediately. This would not have caused a problem, but the change to file F inadvertently went along with it. This occurred at approximately 16:30 (all times UTC).
- This RPM was pushed to all servers, causing F to be renamed to F.rpmsave, which would cause it to be ignored whenever the firewall was restarted. Unfortunately in this case, the update of this RPM causes a firewall restart.
- This meant that the services that relied on rules in F in order to operate correctly no longer had them. This included osg-xsede, osg-flock, and others.
- At 17:15 RSV started sending alerts about osg-flock, but it was not immediately clear what was causing them.
- Around 20:30 Operations was made aware that osg-flock was down, precipitating further investigation.
- Around 20:50 Operations fixed the problem on osg-xsede and osg-flock (by moving the needed rules in F into other files or renaming F.rpmsave to something other than F and making sure its permissions were correct, then restarting the firewall again).
- We estimate that 75-100 kHrs of user time was displaced from osg-xsede users to users submitting via a different mechanism.
Standard GOC monitoring was sending alerts, but their import was not immediately clear until Operations was notified by email that osg-flock was inaccessible.
Manual correction of firewall configuration
Factors contributing to the failure
- Confusion from the start about the purpose of firewall config file called "F" in this document
- Changes to firewall script due to IPv6 transition
- OSG Operations didn't follow procedure and should have notified the Operations Center Lead and Operations Coordinator and conferred about what to do rather than taking unilateral action — perhaps the impact on file "F" would have been discovered before the change went out (or perhaps not)
- The firewall script's error has already been fixed
- Firewall files on servers known to be affected have already been fixed
- A thorough, methodical search must be conducted for other ramifications of this problem (F.rpmsave files containing uncommented rules)
- The file "F" should probably just be removed so as to eliminate the confusion in its status
- Operations will no longer push any change, no matter how small or how critical, to production servers without obtaining approval
- 15 Jul 2014
Topic revision: r3 - 16 Jul 2014 - 19:56:42 - RobQ