05 Third - Party measurement of network outages in hurricane sandy - John heideman

>> JOHN HEIDEMAN: Okay. My name is John. These are my colleagues at USC and this is the response by the department of Homeland Security. So we started with this question of what can things tell us about damage. So these are the ways we use to digest problems on the Internet what happens to the infrastructure. But we actually have a broader goal which is not just to track which is not just to track outages here but in all the world. We want to know about natural disasters beauty also want to know about unfortunately human disasters, Egypt, and lib bra, and Syria. We want to know about the shape of outages, long-term outages or things that may affect people for a shorter time. And not just this core routing infrastructure, but also home networks and systems. Because we actually believe that those end systems are where a lot of the outages happen. So as some background what pings do, tell us if an IP address is active. And so you have ping command and over time, you get back your reprise. Those are green dots. You can interrupt this. So it tells you how that address is being used. You can also get negative replies. So if we send a ping and get no reply that's a black dot. And so by combining these you can tell sort of what's going on in the network. So pings tell you something about the network, because if you get a bunch of positive replies that says that part of the network is up or at least something in that part of the network is up. That's pretty simple. And you would like to think that if you get negative replies that says that part of the network is down, but unfortunately, they don't quite tell you enough. Because there's a bunch of ambiguity in negative replies. Could be the network's down, it could be that somebody shut their laptop or it could be that the addresses get reassigned dynamically and change or have a firewall or all this other stuff. Okay? So the challenge is how do we disambiguate negative replies to tell a hurricane struck when he closed his laptop, because those have different importance, of course. And so this ambiguity is this challenge that we deal with. And to deal with this, we look at, not just one address, which would be very difficult to disambiguate but look at a bunch of addresses that are adjacent to each other. And so if you put in a bunch of addresses you'll get a bunch of responses but over time, a bunch respond and a bunch stop responding. That's probably a good indication that the network which used to be healthy and happy and up is no long he were up. That's what we call an outage. That's how we do tech outages. So we do this in blocks around the Internet. So this bar here represents 256 addresses in one slice 24, that's a block of connect addresses on the Internet. And you can see a bunch of green and black dots. This is real data and you can see an outage in the middle of there, that little black line. And if we do this to a whole bunch of blocks, you can see sometimes those outages are correlated to adjacent blocks, have an outage at the Sametime, possibly caused by the same thing. Sometimes we see outages in other boxes at different times. So we map these outages we detect from the raw ping data to outages. We map each block into a single line and then we -- and then we cluster those lines, and you can see patterns in here -- well, I can see patterns in here of lines that show up next to each other that they represent different outages. And just to give you a better -- so this -- we see a couple of different outage that is we took data in the past. And what I'm going to show you is plots like this, where we have time going across the X-axis, we have blocks going across the Y-axis, and you see colored splotches in the middle. Those are outages, and in fact, this data here is our data on Hurricane Sandy. So I'll come back to that in a little bit. I don't want to get ahead of myself, but -- so for Hurricane Sandy data, we reanalyze an existing data set that we've been collecting. We've been collecting data in this way since 2006. So we probe a random sample of 41,000/24 blocks in the Internet. For two weeks we commence and the data is available. And the technical details about how we collect it are also available on this slide. So, getting very specific about Sandy we took one of our data sets, 41,000 blocks, we looked at about 12,000 of those in the U.S. Of those 12,000, about 4,000 have been updated. We can analyze, and this picture on the right shows where we geolocate those blocks to. Some of the blocks, this big blob in the Atlantic is not sand. These are blocks where we can't tell. This is our data and this is Sandy. Three days before you see a pretty calm network. Then when Sandy makes landfall, which was very conveniently at midnight UTC, you can see a bunch of blocks go out. Okay? A bunch of networks go out in our sample. So, the next step you want to take is to quantify this. It's easy to guess, yes, of course, networks went down. We look at the margin distribution here. If you sum up the number of colored dots that each Column that's the number that appears in this thing here. It represents the percentage of the world, the percentage of the U.S. that's down at any given instant. This plots that marginal distribution. The Blue Line is median per day. Each red X is a measurement over 11 minutes. And what you can see a, first, the Internet's a big place. Some of it is always down. That's a fact of life. Right? And our rate which is reasonably steady is about 210ths of a percentage. And you can see that very clearly in the three days before landfall. After landfall we see that rate doubles. So if you just look at the U.S. twice -- there was twice as much outage the day Sandy -- after Sandy hit. Okay? And then the final thing you can see in our data is after four days it pretty much came back to the baseline. So, the neat thing is, by doing these stupid things, these simple measurements, and analyzing them in the right way, you can get quite a bit of information out of -- and understanding about what's going on on the ground. So, the next step s you know, is this really Sandy, is this just noise, because we do see all kinds of stuff in our data. So to isolate -- to demonstrate that this was actually correlated with Sandy we geolocated all these IP addresses, all these blocks, I'm sorry, and you can see the colored bars in the middle. The light colored bars are New York and New Jersey. So the big uptick is correlated with outages in New York and New Jersey, pretty compelling that what we are seeing here is the effects of Sandy. And we actually plotted and geolocated on the map and so you can see three days before not much in the Northeast, three days, 4 days after you can see -- I should stop waving my hand. You can see a lot of stuff in the New York, New Jersey area. And then it tapers off. So we find this is a pretty compelling evidence of Sandy and that quantification of how much damage was seen at least in our sample. So, this is Sandy, just as evidence -- just as evidence of generality of this approach, we actually have another -- data from a number of other major events, this is the Japanese earthquake in 2011. In March, 2011. This is the Egyptian revolution, of course, we started our survey just after they shut off the network but we can see it come back on. We can see the big world events, the neat thing is that we see less publicized events, there is also a pretty big outage in Australia. It didn't make the news because there was no Australian revolution and because they have a much bigger footprint than Egypt. And so most of us know what's up. But it's important to know these smaller events as well particularly if we try to improve the resiliency infrastructure over time. This is a two-week sample where you see outages in America. So we can start to get a handle on those as well. Two American carriers in one number. Okay. So the bottom line is our goal, they're not just big world events but try to understand the resiliency of the Internet as a whole so once we can measure something we can do a better job to improve it. So we're actually on our way to trying to accomplish this task. And the first step to understanding the U.S., infrastructure as a whole is to track all IP goes not just a random sample. The challenge here is our current approach takes -- is quite traffic-intensive. And the neat thing is, we think we can get our traffic rates down to about 20 probes an hour which is less than 1% of the background radiation. It's less -- it would be a tiny increase in the amount of traffic that you would see just to be on the Internet at all. So when we get to this point a single machine can track outages in the entire IP addresses where we hope to show results on that very shortly. So just to summarize, so we show that with pings you can track the effects of the natural disasters, and on critical infrastructure like the Internet, details about this are in our technical work and the data is available and I'd love your feedback. So, you want to hold the questions to the end. So I think I'll hand to our next.