Thursday, 15 September 2016

When your firewall dies, and you need something, FAST.

For a number of years, I've been mentioning to colleagues, managers, interested members of staff and random strangers in the supermarket (well maybe not the last one) that I don't like single points of failure in enterprise ICT infrastructure.

I occasionally picture my network layout in my head, and think about the single points of failure with dread.
"One of these days, that single fancy firewall is going to die. And we're not going to be happy about it". 
Said firewall "died" last week...

I needed to get something up and running quickly that would enable at least basic functionality - access to GAFE and other "vital" functionality.

Fortunately, I happened to have a Mikrotik Cloud Core Router lying about the place, and a fairly long history of using Mikrotik (since RouterOS 3.something), so it doesn't take me long to hack together a passable edge network on one. These make excellent "Swiss Army Knives" for networking people; the Cloud Cores tend to have a little more oomph and some nice port options (like 10GbE and SFPs) compared to the more basic (but still rather capable) RouterBoards. I fairly quickly (less than an hour) established the required routing, NAT and a basic firewall ruleset (that allowed for the "core" teaching and learning functionality to work - i.e. GAFE, Wikipedia, etc.), but trying to manage a school's internet requirements in 2016 solely by using whitelisting of IP addresses, protocols and ports is an exercise in frustration; you might be able to hack something together with L7 filters and perhaps regexes on DNS lookups, but that's more engineering effort and time than you have in an emergency of this sort. I could easily pass >400Mb/s through it to the internet, so it's no slouch.

After I got the most "critical" things working, I think I spent about 3 hours (unsuccessfully) trying to get Turnitin to work, which Matric students needed to use that day to submit their final Life Orientation portfolio assignments through. Some services use ridiculous CDN topologies - Turnitin is one of those (the number of DNS hoops you jump through is crazy). I resorted to blanket allowing a few monitored machines outgoing traffic to get it done, and then closed them up again once the assessments were submitted. Staff machines behind locked doors got rather less restricted access quite early on.

Then the next morning, in the shower, I suddenly thought "OpenDNS". I seem to have a fairly large number of brainwaves in morning showers - it's possibly after my brain has spent the night subconsciously processing things, and then the not particularly alert, but not particularly distracted brain state that allows ideas to pop into being - or maybe just that few hours of distance from a problem that allows you space to think of slightly less immediate ideas and solutions.

In a pinch, you can basically use OpenDNS to keep kids off the "worst" of the internet very easily by using their "family shield" IP addresses as your internal DNS server's forwarders - and, vitally, firewalling off all other DNS servers from clients (because children will engineer around that quicker than you might imagine). You should be blocking client DNS at your border anyway. If you have some actual money to throw at the problem, their commercial Umbrella products look very interesting (in particular, because you can effectively provision your firewalling in the cloud, and on other people's networks - no matter where your users end up - and without VPNs). Educational pricing seems quite attractive, but would be considerably more than we presently spend on our NGFW - possibly because we only have 50% of the NGFW we ought to have...

There are likely alternatives out there (Untangle, the somewhat defunct Dan's Guardian, Squid as a "transparent" proxy and SquidGuard (which can also work with pfSense), etc.), but they lack the immediacy of a platform I know really well and simply changing some DNS server addresses - the latter is something that even very inexperienced techs will be able to sort out rapidly (not all schools have the luxury of a sysadmin, whether in title or in spirit). Schools mostly need to be able to whitelist domain names to effectively meet their needs - if you can't do that, you'll really struggle; having a subscription service that does so with categories is very useful. But I like free things, particularly for temporary fixes.

So, with a Cloud Core Router and OpenDNS keeping kids away from the more horrid things, we had working Internet again. Far too much Internet (we block a lot of things for "because kids shouldn't..." or "because bandwidth" reasons), but that is better than not enough. Of course, once I rebuilt the firewall from scratch (protip: document your ruleset somewhere in a format that isn't simply a backup of a config file, particularly when they are 45,000 lines long and not all that human-readable... Yeah, lesson learnt) kids have been less pleased, after last weekend's Smorgasbord of content they usually can't have. I still haven't been stabbed in a corridor, so that's a plus.

One pupil sent me an email kindly informing me of the sudden cornucopia of Internet wonders, so there are decent kids out there in the world (I more often get messages from rather entitled brats demanding things - and I got a few of those messages recently...).

Once that was done, I set to investigating why the Fortigate was just not working properly.
The Internet just... stopped for all clients, servers etc. If you were trying to get through the Fortigate, your packets just... wouldn't... flow....  You could see the rulesets working (packets being correctly filtered, logged, allowed, denied, etc.) but the traffic was just... stalled somewhere. From the firewall itself, you could happily tracert/ping or whatnot, but from either side? Dead loss.
It was using >99% CPU, yet sys diag top wasn't showing any processes going mad. On a few previous occasions when this sort of behaviour has ensued, rebooting or power cycling has restored connectivity. Not this time...
Googling suggested some cheaper firewalls in the range frequently had issues (with similar resulting symptoms) from the drives inside not enjoying high writes that logging to disk tends to cause - this thing had been logging millions of packets a day for several years, so an unhappy disk seemed plausible. Friday, Saturday and Sunday involved an HQIP image diagnostic a couple of times, lots of disk tests, with no real obvious problem that could be consistently reproduced.
My suspicions are that in the 2.5 years since the firewall was initially set up, the underlying firmware just didn't like some of the things we were asking it to do.
The thing was "nuke 'n paved", and reinstalled. Using the same configuration (or earlier backup configurations) would almost immediately show the same symptoms. So nuke 'n pave again, and rebuild from a vanilla "factory default" setting, on the request of Fortigate's helpdesk. Apparently, when restored into a VM by Fortigate support, it was also problematic.

What's really odd is that things usually show a more obvious pattern - you make a change (firmware version, configuration change, change in load), and then the thing will misbehave. In this case, it just sort of crept up out of nowhere and sulked - we last did firmware updates weeks ago, and traffic loading amped up a while back, after the holidays. (I compulsively read Release Notes on software/firmware before doing updates). Not fun.

It's mostly working again now. It was working again before I got a semi-formal quote from OpenDNS for Umbrella, which was extremely expensive for something I no longer need... unlike Family Shield, which is free. So check out Umbrella (they have a two week free trial), but keep Family Shield (and the more fully featured home options) tucked into your box of tricks to help out relatives, friends... and random strangers in supermarkets who approach you asking about Computers, because they know you work in IT (in the same way I feel doctors probably get exposed to strange rashes in the canned goods aisle)... And get a Mikrotik or three. They're wonderful.

A more redundant firewall infrastructure has moved up my list of priorities (and I might also ruminate some more on the potential of Umbrella with BYOD and the "benefits beyond boundaries" moving NGFWing into the Cloud might bring)...

p.s. monitoring is your friend. Here is a year long trace of CPU usage on the firewall in question. Note the 100% CPU usage in late August, which is related to the problem that prompted the need for the workarounds above... 
Observium tells us there was a problem...


  1. Incidentally, Fortigate's help on this amounted to "nuke and pave it and build from scratch" and "we think you're doing too much logging".

  2. I've since had CPU use creep inexorably upwards once again.
    I've really cut down on logging to the absolute bare essentials, and disabled all attempts at certificate/ssl inspection - that seems to tax it too much (argh).

    When things need deeper investigation, I up the logging level on the relevant rules.

    1. Disabling SSL certificate inspection is a BAD idea. Stops all interception of dubious materials when accessed over SSL. So much for having a tick box about IP classification...