MANRS is a great idea and initiative. I fully support the aims, and accept that there are (occasionally) going to be teething issues. We've however run into some "interesting" problems where RPKI hasn't worked terribly well. Some time last night, we ran into another one.
Somehow, the router we peer with at one of our upstream ISPs lost its sessions with their RPKI OV validation server. This meant that as they learnt and evaluated our routes, they were being marked as "not found". Of course, they also had routes to our prefixes from our other ISP that were still marked as "valid", so that was marked as the preferred path.
Of course, our edge routers were still merrily choosing that route for about 50% of the destinations on the Internet, and somewhere between the ISP's Provider Edge router and the rest of the internet, they vanished into a black hole.
We saw the following things that looked really odd:
- ping testing to the peer router failed
- traceroute past the peer router failed (but revealed the ISP's immediately peered router, but no further)
- we could re-establish BGP sessions and prefixes would happily be exchanged, but still the "internet was broken".
So after some troubleshooting, it turned out RPKI OV at that ISP had failed again. Which lead me to wondering why, if we were delivering packets, they weren't actually going anywhere else. There are valid routes, and ultimately, packets will normally get through somehow by delivery via as many routers as is reasonably necessary. My strong suspicion is that, having received packets from a not found RPKI route/interface (which the router will therefore not install as an active, valid route), uRPF looks at that and goes "my dude, packets from that prefix are being received over a path that is not the best path back to that origin network. This path over here from that customer's other ISP is the path, because it's marked RPKI valid. These here packets MUST BE FRAUDS. Drop them!".
So, buggy RPKI OV plus strict uRPF is a packet dropping perfect storm for people who are multi-homed.
MANRS suggests one ought to be doing both uRPF and RPKI OV, and we have an ISP that likes to do things properly (with a routing equipment supplier that has some glitchy implementations of some of the features needed).
So, buggy RPKI OV plus strict uRPF is a packet dropping perfect storm for people who are multi-homed.
MANRS suggests one ought to be doing both uRPF and RPKI OV, and we have an ISP that likes to do things properly (with a routing equipment supplier that has some glitchy implementations of some of the features needed).
Normally, when our link to the ISP fails, it's because somewhere along the light path from our secondary datacentre to their PoP nearly 1,000kms away, someone's fired a shotgun at the fibre, or stolen a section hoping it's copper, or run a backhoe through the cable, or a veldt fire has burnt through the fibre, or strong wind and rain have caused the massive long distance electricity pylons it's mounted on to fall over. (Yeah, we get quite a lot of faults. No, none of them are covered by the SLA, Yes, the root cause analysis sometimes brings some amusement). But every so often, it's something else.
Descent Into Madness: Starting the Day off Just Wrong.
In this case, we actually still had a link to the ISP, but traffic failed hard. Unfortunately, it took rather longer to diagnose the problem than might be ideal, or normal.
The perfect storm ran something along the lines of me getting an early morning message from my early rising colleague saying the link to that ISP was down. Having recently shown them the steps I take to diagnose that connection, I assumed that it really was down. And, well, fine, that doesn't cause us any kind of issues, other than higher risk for an outage or slower connectivity if the other ISP loses some links. I'll drop the down ISP a quick mail from my phone before I even get out of bed asking them to check for an outage on the fibre, and look into it in the office.
Got into work and various people were saying "DNS was broken" and "what is bootstrap" (a couple of random DNS servers we check to see if we have working DNS recursion happen to be called "bootstrapcdn")? Let's pull out dig +trace. A LOT of DNS was broken. Eventually, I thought "right, this is widespread enough that it looks like a routing/connectivity problem, not a DNS problem". Either that, or there is a MASSIVE DDOS happening somewhere really important. Please, sinus headache, leave me alone so I can think of the commands I need. Also, coffee. Lots of coffee.
Got into work and various people were saying "DNS was broken" and "what is bootstrap" (a couple of random DNS servers we check to see if we have working DNS recursion happen to be called "bootstrapcdn")? Let's pull out dig +trace. A LOT of DNS was broken. Eventually, I thought "right, this is widespread enough that it looks like a routing/connectivity problem, not a DNS problem". Either that, or there is a MASSIVE DDOS happening somewhere really important. Please, sinus headache, leave me alone so I can think of the commands I need. Also, coffee. Lots of coffee.
I then did the basic connectivity check any network engineer does - traceroutes to problematic destinations.
Hang on a minute.
What the hell?
How are traceroutes getting to an unpingable peer router and stopping there? Why are they even attempting that route if the fibre is down?
Maybe it's not as unreachable as people said. Maybe, just maybe, it's not as "down" as it appears we've now clearly assumed it was.
Let's go look a the BGP summary on the edge routers.
What? The sessions are up and established to that router our monitoring system can't ping? Hmm. Let's clear the peering sessions. What? they re-establish and prefixes are exchanged and accepted? Ok. That's weird.
Must be some kind of problem with the upstream ISP.
Let's kill the peering session.
Oh look, working Internet.
Hi, ISP, so, sorry, we think it's not a fibre problem, but some kind of weird routing glitch or filtering issue at your end, as we can establish sessions and exchange prefixes - please can you check? (Back of my Mind: "I wonder if it's the revenge of the buggy RPKI again").
A session re-establishment later, and they have some info to look through.
"Oh yeah, sorry, it's was broken RPKI OV again, try now".
Yeah, that's working.
Facepalm.
Hey network engineering colleague, let me go over troubleshooting connectivity again.
This step *here* is really important. Ask your routers what they see. Be very suspicious of an un-pingable router that has established BGP sessions!
BGP is, in some ways, really dumb. It assumes if you can reach the other router, it's going to deliver your packets. That's not necessarily the case.
Pinging a remote router is a pretty basic test that doesn't tell you everything we need to know, and, well, can make a ASS out of U and ME. Tells you something might be wrong, but not necessarily what. A failed ping test is a failed ping test, no more, no less. It's an invitation to dig (or traceroute, or any number of other commands :) ) deeper.
Oh, p.s. here's the documentation on massaging routes in our systems. Yeah, it's in the wiki. Yeah, I also read the documentation I write when running commands I rarely run. Particularly when I can't brain, because it's too early, my sinuses are bastards and there is no coffee.
Oh, p.s. here's the documentation on massaging routes in our systems. Yeah, it's in the wiki. Yeah, I also read the documentation I write when running commands I rarely run. Particularly when I can't brain, because it's too early, my sinuses are bastards and there is no coffee.
p.p.s traceroute. It is your friend. It will show you weird broken crap really fast.
Postscript: And then the connection DID go down, and now my colleague has seen both error conditions. Nice.
ReplyDeleteAnd yes, they do uRPF in strict mode and are considering going to "loose" whilst they work with Cisco to fix a number of lingering bugs.
ReplyDelete