15 October 2019 at 4 p.m.:
LESLIE CARR: As everyone is sitting down, I just want to remind everyone that this conference is put on by the community and part of the community is the Programme Committee, the Programme Committee are the people who get some of these fine speakers to tell us how everything in the Internet is about to break and ex employed. If you want to help with this very important part of putting on the conference, email email@example.com, with a short bio and tomorrow at 4:00 we will have all the candidates stand up here and give a very quick speech. I want to encourage people who are new to the community to apply as well because what counts is that you are willing to put in the time and effort and not necessarily that you have been doing this for 20 years and standing at the microphone all the time.
As well, it was a little crowded last time, so if everyone could please squish in, we'd really appreciate that, so, you know, get to know your neighbours a little better and just take the seat right next to them.
And, I will give everyone one more minute to get in and get seated down.
I will give everyone two minutes to get seated down so we can enjoy the awesome next talk, because, yeah, we all know people whose networks have been DDOSSed and how painful that is. All right, and just a reminder, as everyone is sitting down, we have got to be friendly with our neighbours.
All right. And on that note, let us start up the next session. We are going to ‑‑ everyone please quiet down. Don't have to be an elementary school‑teacher here. Really, shush. All right. Thank you. And let's please welcome Steinhor from ARBOR to tell us about DDOS.
STEINHOR BJARNASON: So, Hi everyone. I am Icelandic and live in Norway and work for US company and present in Amsterdam, and my role in ARBOR is to look at new attacks, attack trends, try to understand where things are going and try to come up with new technologies and approaches to stop those things and try to understand what are the bad guys up to and what they could potentially be to go and that is a lot of scary stuff.
So, this is what I will be covering today. I do not have a section on DNS type of attacks, and that is because they are so common today, it's just happens all the time. And the talk earlier about DNS over TLS, over http and so on, that really scares me but that is a discussion for tonight.
So please come tonight.
So, first of all, DDOS trends. Everything I will be saying is actually in this report so I will just download it and take a lot at it. But basically, DNS attacks are on the rise and the attacks are increasing in size, so if you compare the first half of 2017 to first half of 2018, then the attack size has increased by 174%, gone from 622 up to 172 terabits and that is rather big. The interesting thing is, the attack frequency has gone down, but the attack size and the global attack volume has gone up. It basically translates into the attacks are harder hitting and the bad guys, they don't need to launch multiple attacks to take websites town, they just take the big sledgehammer and take it off‑line, easy. And also, these tools are being weaponised, meaning they are put on websites where you can log on and just paying a fraction of a Bitcoin or five dollars you can launch a five‑minute 300 gig DDOS attack against anyone with zero technical skills. And the interesting thing is, memcached, that was the approach launched too big, it took one week before that attack was available on the Booter, basically the guys who came up with the initial attack did the research and looked at how to utilise this attack and so on, they did all the hard work. Then they made it into a script and Python scripts and so on and after one week, after they did the big thing, anyone which can use a web browser can launch similar type of attack. It's like it takes a lot of effort to build the initial gun but after that any idiot can point it and shoot someone with it. And that's what is happening.
If we take a look at what has been happening in Europe, in the first half of 2018, then you see it's basically the same thing. We in ARBOR, we see about one‑third or 50% of the attack volume on the Internet, so if you want to see the real numbers multiplied by 22 point something. What we managed to see in 2018, 560,000 inbound attacks to Europe, and the attack size was 0.75 gigs. And comparing it to last year the attack size went from 0.44 up to 0 .75, basically confirming what I was just saying. And the attacks are bigger, they are 54 attacks getter than 100 gigs and but last year we have 25, so more than ouble the number of large attacks compared to the year before. And it seems like this is just continuing.
Okay. Interesting things. Carpet‑bombing. We saw this early this year, and basically, and this is ‑‑ well, it's not new, but only the really clever guys have been using it earlier. What happened in the beginning of this year, this got weaponised, basically put on to a web portal where you could go in and use this as an attack ‑‑ so the ideas is basically, instead of focusing on one special guy like him, he is not even looking at me, then ‑‑ and that attack would usually launch a big attack that will be noticed immediately. Instead of attacking him directly, I attack everyone on this side of the room because that attack will probably fill up with the links and so on and it will be extremely difficult to mitigate that attack because it will spread all over and if the guys become clever then maybe I will move it, I will begin attacking this part of the room and then at the back, I basically rotate around the side of the blocks, meaning when you have stopped the attack the guy basically changes it. And the time for the attackers to change an attack Vector is less than 5 minutes, and I know this, because when I went to NANOG about two weeks ago, then I was actually helping to mitigate one of these attacks. And they were attacking us, attacking the customer, we stopped it, life was good. They changed the attack Vector. We stopped it it. They changed it and we stopped it. Took about six attempts before the guy gave up. Which is irritating and basically ruined my night.
Okay. So how does this look like? It's basically normal UDP reflection type of attack, and seeing from 10 gigs up to 600 gigs attacking a service provider. The interesting thing is the 600 gig because they managed to spread it out so much the detection systems didn't pick it up because the detection systems are usually focused on specific IPs, /32s. And in this case you have got spread all the network, went under the radar and they didn't detect it million the customer started to scream at them and that is too late. When the customer starts screaming, well, you are not doing your job basically.
So, like I said, UDPs, that is DNS, TCP, Memcaching, well and so on and it will be very difficult. One thing and maybe even more fun that is IP fragments and to give you a quick review of what they are, if you have a large packet like 4 K packet, then that packet will be split up into fragments, you have the first one and only the first packet in the fragments will contain the headers themselves, the packet headers. The source port, the destination port and so on. All the other fragments have nothing, which means if being bombarded by packets then only a small section of the packets coming in will have something which you can tie it to. The others are just fragments. And then it's rather difficult to be able to determine should it be blocking them or not.
So, how to you detect this? You cannot ‑‑ you cannot focus on specific destination IPs any more. You need to look at the subnets and traffic going across links and so on and use that as additional warning mechanism. The problem is, of course, that you need to know what is normal, what is the normal load on a router, what is the normal load on these specific links and so on? If you don't know that, you don't ‑‑ if you don't know what is normal you cannot detect what is abnormal.
So, how to mitigate this. If they are not using fragmented packets it's like any other reflection type of attacks and that usually you will have the source port like 1900 versus DP, 11 to 11 for Memcaching and so on and when you have detected the attack then usually you can just put an axis list in place, use Flow Spec, the quickest way, or axis, whatever you want and you basically block it. Usually there is no reason to get 100s of gigabits of SSDP traffic coming towards your network. But the fragments which I explained earlier, they are a little bit tricky but I think basically the only ‑‑ what you can do, if you have to receive fragments from someone else, and DNS is actually one good example especially the EDNS packets, then what you can do is say, these packets going towards your recursive resolvers should be allowed. Fragmented DNS packets going to subscribers rate limit them down to 1%. If you to that, and a number of our customers have been doing this, they can easily mitigate one‑third or 50% of this attack immediately. Because usually it should not be getting large UDP fragments going towards your networks. There are exceptions, if there are document them, put them in place, rate limit the rest. That will really help.
Okay. But like I said, there is exceptions to this fixed source port number, and SSDP is a good example of that. And there was certain company out there with saying this is the end of the Internet and so on and came up with an article but actually, this has been around since 2015 and the Internet is still up, so, yeah, kind of jumping the ball.
The thing is, SSDP usually comes from source port 1900, so in this case, we are seeing packets, you see it like, this, packets come in, reply from 1900. This weird thing is we were seeing packets coming from random source ports, so we surveyed the Internet and we took a look and we found that 55% of the clients were sending replies and random source ports and the reason is, there is a bug in ‑‑ and this is ‑‑ and this is commonly used on small CP devices and this basically means when they receive these replies, there is a bug, which basically sends out a reply back with random source port. And basically this is the Rob. The guys installing the software on those CPs, they do not bother to change anything, just take the standard config, implement it, with bugs and everything and they sell it to you guys for huge amounts of money.
Okay. Which means you cannot use the source port for detection or mitigation, and that means you need to take a look at, use some of the other tricks which I mentioned earlier. In addition to this, interesting bugs we discovered close to 2% of these boxes, of the bug in this, allows you to punch a hole through the CPE from the outside, UPnP is usually designed it allow devices on the inside to get access to the Internet and allow external connections to come in, so they punch a hole dynamically. This bug allows anyone on the Internet to open up a hole into your network through the CPE, which is taking user‑friendliness to the extreme, basically. So we did a small test. This is what we is it and we mapped the Internet and these are the results, basically. So there is a lot of vulnerable devices in certain areas of the world. I am told this is because like up here, the home uses by their own CPEs, and when you go out in the store and want to buy a router or something, what do you usually select? Buy the cheapest one because it's cheap. Why do I need to pay extra €10 or €50 for something more? But the problem is the cheapest still have the most security vulnerabilities and this is the result.
So continuing. I have a lot of material to go through and you can read the slides afterwards. So memcached, that is the reason we have got the 1.75 terabit attack. This is showing you thousand implement memcached attack in your own lab and the interesting thing is sending one packet, if you to things properly, you can force the server to send 536,302 packets as a reply. This translates into amplification factor of one to 500,000. That's totally crazy, basically. One packet, yeah. 6.2 gigs. If you send two packets you get 12 gigs and then you continue and continue. In the end of course, you will fill up the link on that server and then you use another server and so on. It's extremely powerful. If you think this is too complex, well here is the toolkit to do this for you.
So how do you detect this, relatively easy, it's coming from source port 11 to 11. Why do you want memcached packets from the outside to your network, you tonight, block it, throw it away. A lot more info here. Here is an example from Job Snijders, this is how he does it and this basically solves those things very quickly. This is configuration which you should have on your network all the time because it will save you a day. It's as easy as that. You don't need this carpets coming into your network. So follow his guidance.
One interesting question is memcached has two interesting commands, one is called flush, flushing the keys which the attacker is using and a shutdown command which takes down the server. So one very quick question for you guys, raise your hand, if you agree: If you are being attacked with a last memcached attack, should you go out and issue, use those commands to limit the attack itself? Raise your hands if you think that is the right way. One. Okay, that's good. I want to talk to you afterwards. One ‑‑ okay, 1.5. No. Of course not. The thing is, these systems be long to someone which is innocent, someone's servers are being abused to launch an attack against you so they are victims in this case. If you go in and flush all the keys, you are flushing the cast service over, you are taking town a system of which belongs to someone, you are basically becoming as evil as the bad guys. And it's probably illegal, you could get put into jail. So don't do it. Just one thing: Statement at the bottom, traditional DDOS defences work pretty well, don't go down this path because otherwise we will begin to start an escalating war on the Internet, the attackers launch something and we react and then they do something, so it will be battle of the robots on the Net and things will go to hell. So just one thing: This will continue. All new kinds of attacks will come up and it's basically, it's a very exciting world which I live in, I am always seeing something new and sometimes I think living in a cabin at the top of a mountain is a pretty cool thing but we need to understand better what they are doing. Because the thing is, the bad guys, they continue to innovate and come up with new stuff. Like in 2016, we are now at least five different major families of attack tools which are more and more powerful, so what can we do? We need to see through this fog and understand what they are to go. And one of the things we are starting to experiment is to infiltrate the botnets, and these actually examples of them launching attacks against someone and we seeing those attack happening as they are being sent out, the attack commands and we are ‑‑ pretending to be reflectors, so we can understand how they are doing things. So it's trying to gather intelligence and the idea is we should share this intelligence with community to help us understand what the vulnerabilities are and what we can do better. If we set up something that looks like a CPE and allow attackers to go into it inject the malware, that is wonderful because we get the later version and ‑‑ we are trying to get into the cycle of what these bad guys are doing such we can understand them and hopefully know about the attacks before they happen so we can take the right action before they hit your networks.
Okay. So, it's nice living in a castle but we need visibility, that is basically what I am trying to say. And also one thing, although CPE devices make extremely strict requirements of your vendors, there was one service provider in, I am not going to say the country, but they recently bought 10,000 CPEs from a certain vendor and installed them and deployed them and one week later, the entire access network went down because all had been hijacked to launch toss attacks and because of a bug in the CPEs so make requirements of those guys. Don't buy the cheapest stuff available.
So questions, thoughts?
BRIAN NISBET: So, thank you very much, very interesting. There are some people at the mics.
AUDIENCE SPEAKER: Artem Gavrichenkov from Qrator Labs. Do I get it right that you should suggest to block the source IP addresses of UDP amp fires, is it true?
STEINHOR BJARNASON: That is one of the angles we are looking into. One of the things, the bad guys are scanning the Internet daily to ‑‑ we want to validate that they are being used to launch attacks before we classify them as evil. So yeah.
AUDIENCE SPEAKER: As a victim should I block source IP addresses of UDP amp fires or not?
STEINHOR BJARNASON: Yes.
AUDIENCE SPEAKER: Automatically?
STEINHOR BJARNASON: In almost all cases that will help you.
AUDIENCE SPEAKER: I have actually submitted a night thing talk, let's see if it goes through the PC, but I mean, what you are suggesting is actually I can't stress it more ‑‑ it's totally dangerous. Might be used conduct DDOS itself in an obvious way
STEINHOR BJARNASON: Only do it if you know they are being used maliciously to launch attacks. But you are right in many ways. And I am right as well.
BRIAN NISBET: Everybody is right, it's great. Consensus, meeting over.
GERT DORING: Hello. I want to challenge the bit about the memcached D, they are just innocent victims, the bit. I am not advocating shutting them down, I am with you there, but they are just victims, it's not their fault they are running a high bandwidth machining that killing other people. These people need to be addressed, talked to and educated as well, so just telling them you are a victim, it's not your fault,
STEINHOR BJARNASON: I understand where you are coming from.
GERT DORING: We need to have discussion on running unsecure systems on the Internet and how to make people liable for that, basically. You can't just run around and not care.
STEINHOR BJARNASON: The answer to that is education, educate people about what should be done and good security analysis before you deploy anything on the Internet. There was one guy who actually went out and said those IoT devices, CPs are vulnerable and I am going to brick them and he took down 100,000 of these. That is extreme. I understand where you are coming from. If you have got defences that will normally help you, like the ‑‑ the rate limiting for memcached of attacks and so on. It's a grey area.
AUDIENCE SPEAKER: We run an ISP with eyeballs and have been doing the filtering for on external ports on networks for last four years. I think we found ‑‑ there was a presentation there, and document was put out at that time. And it was incredibly well (works) for most of the DDOS and when one goes through it's probably a new one that has popped up and you need to add it to your rules, if it fills the transit down okay sure but that is it. Okay.
RANDY BUSH: I. J and Arrcus. I am an idiot like nobody else here and I get caught by these, memcached D was a surprise. I believe the scanners and notifiers, that come to us, hey you are vulnerable, are very helpful and make a significant difference. What can we as a community do to encourage and support these activities?
STEINHOR BJARNASON: Good point. Like I said, we are in the early stages of trying to use this approach, but one of the major things and like I said to you earlier is education; before putting stuff on the Internet at least attempt to make it secure, please, otherwise ‑‑
RANDY BUSH: But that target moves.
STEINHOR BJARNASON: Memcaching wasn't there running a year ago. In one year time there will be at least 3 or 4 new attack vectors. We have a huge industry of people trying to use new vulnerabilities to use against us. There is no good answer to us but let's see what we can do. Hopefully I will be able to come with more realistic things which we as a community can do to get rid of this stuff or at least limit it.
BRIAN NISBET: Thank you very much.
BRIAN NISBET: I will just remind people there are still seats in the room, we are fit full, 881 people registered for this meeting with over 500 already on site, if you could move in a little, don't use chairs to keep your bag have a metre above the floor or whatever else but there are seats enough for everybody at the moment so please do facilitate people going in. So, okay. Cool. So next up, Giovane Moura from SIDN labs on when the Dike breaks desecting DNS defences during DDOS.
GIOVANE MOURA: Good afternoon. So before I start I apologise for people who were here yesterday at OARC and saw this presentation and also in previous IETF. Let's get started. So this work actually we have done together with some people from university in Holland and also IS I in University of southern California.
And this is a ‑‑ slides are from a paper we are going to present in two weeks' time at IMC in the conference so if you are interested in the paper you can download it here, too. And I don't think I have to introduce much DDOS now because previous presenter covered pretty much what you need to know about DDOS but I work for DNS operator as a researcher and of course we are also scared of service attacks and now we have botnets we don't even need reflection, you can have DDOS at terabit scale with packets from many sources, what happened to the classic or very well known attack on dine, is a big DNS provider, was the first attack at terabit level, caused by the mere botnet brought down some parts of the memcached networks of Dyn and some people suffered. So the work is to investigate how much you can rely on parts of the DNS and dissect it to ‑‑ how much prodeliver attacks.
One of them was the ‑‑ attacks of 2015, this is a graph here from DNSMON provided by RIPE Atlas, you see red are connectivity problems because they are victims of targets but interestingly enough, even though some of them are down there are no known of reports of errors by users and total 16 terabit level attack on Dyn that was caused more impact like some users cannot reach popular websites, I think Twitter is one of those, Netflix, there is a bunch of those, it's very interesting ‑‑ you see there is two different attacks but they have very different outcomes so from the point of view of operators and researcher we want to understand why is that the case. It's pretty much DNS as everybody knows, user sends a query and somehow gets an answer but in reality there are many more components in between so in this figure here you see that there is a stop resolve here in the bottom and that will be a user and the user wants to get the IP address and say for example, NL, but the serve who can give that answer authoritative name servers, these are the ones who can answer for this particular domain but the user has to rely on DNS resolvers and this is ‑‑ we just have also saw presentation from Sarah, she talk about them, but there is many layers of recursives, you can get two on your network and four is any quad whatever there and they can have different caches, fragmented caches, so when a user ‑‑ anyway, when a user asks for this particular query you see here the command for that, they would get ‑‑ user get an answer and this answer gets replied if had he exist, we have TTL value which is pretty much the ‑‑ one hour, that's the authoritative server saying this answer given to you, the recursives, the red ones you can start there thank for up to one hour in this particular case but operators are free to send TTLs for their records and we wanted to know then in this case, in this figure here, the recursives ‑‑ the recursives resolvers in the milled have this purpose boxes with caches and they can be fragmented or not and hosts in different places, we wanted to know when the authoritative servers are under attack, what happens with these users, in this case the stub resolver, how much it can trust in the cache of the recursive resolvers in the middle, this layer in the middle here. Because those caches are actually designed to help users when authority tiffs are under attack but also to improve performance, you don't have to fetch all the time with same query, just put in cache and valid for as long as TTL was set by authoritative. And I think the way to understand caches, TTL value in this case I think the best analogy I could come up with the star, if you play Mario, when you get a star you are good to go, when there is a denial of service attacks on the authoritative name servers you can rely upon only cache to get an answer, if you are lucky if it's in a cache. And the TTL, the count down for cache is the TTL which is the star if you play Mario. Anyway, so after this introduction, let's evaluate the ‑‑ types of notification, or whatever but legitimate traffic as well. And we broke this down into three different parts, one we evaluate the attack on the normal operations ‑‑ sorry, the user experience under normal operations, like when there is no DDOS to understand actually how caching works in the wide and we confirm that with production ‑‑ but also from the routes but I am just going to include, the rest is in the paper and emulate DDOS attacks on DNS serves and show what happens on users' experience.
So how can we measure cache in the wild? What we did, we registered new domain. I chose one that was never registered and therefore of course would be very unpop larks ‑‑ what I did I set up two authoritative name servers that would be the serves on top of the previous figure that can actually answer for that and I host EC 2 and I need a lot of vantage points because I could run a local resolver but I wanted to get the larger number of resolvers out there I can get with the different software versions and flavours and whatever, we used RIPE Atlas and 10,000 probes but some have more than one resolver, each point is local resolver and a configure each probe to send unique queer see they don't interfere with each other, so they would ask for a AAAA record which is just IPv6, so probe number 500 we ask for the to main, 500 that cached ‑‑ and I have encoded in the answer a serial evaluator, it's all in the paper, but I value allows me later to tell if the answer contains a sellial value which was exactly as ‑‑ I keep changing this value or if it was actually answered by cache so just by looking at the serial value I can tell if it was in a cache or not the way it was answered. And then we probe every 20 minutes for a bunch of values for TTL so I have different scenarios we change it to TTL and probe and see what happens, TTL influenced how much gets cached. In this particular measurement we only control the authoritative name servers that we run, the top layer and the bottom layer, the vantage points have no idea about recursives, RIPE Atlas can use whatever they want and whatever the users have configured in their own networks. How effective is caching in the wild. So this is a figure we got, what we see here on the Y axis is the number of queries, and the X Axis is the different experiments. I think a bunch of callers here but what really matters, focus on the yellow, the yellow ones are cache misses. These are queries, in theory, that could have been answered by caches because this particular resolvers should have known the answers but they did not and what matters is like, roughly 30% of the queries are cache ‑‑ they could have been answered by cache or not, they are cache miss, that is the yellow colour you see here and they regardless of the TTL. And that is specific to our non‑popular domain, domain we only use for 15,000 vantage points and but the good news 80% of the time it works well, and then we went and investigate why there is this cache misses, and we don't have much knowledge of the resolver infrastructure, but when look at who was actually answering the queries and so forth and turns out that Google public the one of those answer like half ‑‑ roughly half of the queries from RIPE Atlas, they were the middleman there, and if you understand something about Anycast recursive resolvers they are massive and have many sides and multiple serves per site and multi level caches, caches with different priorities. Cache misses happen because of complex caches or get flushed but now we just have a baseline for how much you can expect to be cache in a normal operations. And we went out and confirmed that data now and just to move quickly here, we can see that we run or name servers and set TTL for one hour and we look into every query for that hour, for the name servers for the A records and you compute the time in between to consecutive queries for the same resolvers and we found roughly 20% of them do not respect our testimony of TTL, so it confirms our set‑up that we show like for the test domain in, we have run with RIPE Atlas. If you are interested in that look at the paper, we also look at the routes, the figures are similar.
So okay. So we know so far, how caching works in the wild for both and .nl. We are interested to ‑‑ using experience during DDOS. So,to emulate a denial on DNS, we did not DDOS anybody. It would be problematic, so what do we have done, we use a similar set‑up with the RIPE Atlas that I showed before, and since run the authoritative name servers for our cache desk to main I start to readily drop queries at our certain rates with different rates at authoritative name servers, you are going to see how the clients react to that. And we want you to understand why some denial of service attacks seems to have more impact than others, that is the idea of to go the experiments. I cover a couple of them here. Let's talk about the doomsday scenario, that is when our servers go down. In this case if your authoritative name servers can't answer more queries what is going to happen is that clients are only going to get answers for the zone of the host if it's in the cache. If have been queried before and still valid in the cache. In scenario A we set TTL records for our queries for zone for one hour, we probe every ten minutes and time equals to ten we drop all the packets. So this is a time series here the graph for this particular DDOS. We see that a time equal to zero, the good colour is blue colour, that is the one you should focus. The others is pretty bad, not getting an answer, getting ServFail and you see the arrow going down, that is the emulation of the DDOS attacks, you see a lot of people still getting answers even though authoritatives are down, the answers are in cache and they can get from the local resolvers or whatever in the recursives, whatever which ‑‑ her getting answers. And remember that the TTL for this scenario is one hour, so roughly after one hour we see a drop in 60 minutes of the number of people, the blue colour, they get an answer and after that it's pretty bad. When the TTL expires, it means the cache would have expired and nobody can get an answer. There is some that get an answer and this is one of the IETF drafts working on it, I like it because that draft says if you can't reach authoritative name server, answer your clients with the last ‑‑ latest answer you know. So and there is some parameters in that. You see after the cache expires, there is a small bar here blue people ‑‑ that is where get get an answer with that, 0.2% in this case. What is interesting here too is during the cache 35 to 70% of the people touring the cache only phase get an answer so that shows how cache works.
Right, so that's ‑‑ let's repeat the same scenario, scenario B, we started DDos not only at 10 equals to 10 minutes but 10 equals to 59 minutes. That means I am changing the cache freshness so that means that like some people we were about to have the cache expired because it's one hour TTL and let's see what happens. We see that cache ‑‑ it's much less effective when the DDOS attacks starts, when the arrow goes down. You see the colour blue drop far quickly than before even though the TTL is the same. Interesting enough, fragmented cache have some, they might be filling in another time later so they can still get an answer and when the arrow go up at time equals to 219 here, the users very quickly start to get an answer so they are very quickly able to get an answers from authoritatives. We changed for half an hour instead of 30 minutes ‑‑ failure at 60 minutes again and if you reduce TTL this graph shows if you compare to the previous ones results are going to be worse so the TTL influences how much gets cached and how much users get an answer in worst case scenario. So caching is partially successful during complete DDOS, meaning it can serve the users. Operators of DNS authority tiffs cannot expect protection for clients for period as long as TTL, it depends exactly the state of the cache, that the cache of the users when the attacks starts. Serving stale content, like answers that were allegedly expired provides a less resort against doomsday scenario. Some Ops seem to do it but it's not widespread yet. The shorter you set them.
But turns out that most DDOS attacks on DNS servers they don't lead to complete failure but partial that was the case with the routes, 2015. Dyn was not totally down, some people could still get an answer. Instead of dropping 100 percent and let's drop different rates of packets and that gets more interesting. If your server is dropping 50% of the packets for TTL in half an hour and in this case we started at times equals 59 minutes, even if your server had drop 50% of the packets losing them, most clients are getting an answer and the only thing they are going to notice in the graph below that shows the latency, focuses on the green line, it takes a little longer to get an answer but it's still kind of ‑‑ it's very good because most clients get an answer and we just not take a lot longer. So 50% in packet loss looks like one and two make it so they get an answer. No let's get a little more aggressive and let's drop 90% of the packets so 9 in 10 get dropped and let's see what happened in TTL over half an hour. This graph I am getting here so when the arrow goes down the emulation starts, DDOS attack, focus on the blue, blue are people getting an answer like 90% of the queries are being dropped and it turns out most clients at this particular rate here are getting served as well and my ‑‑ I mean, in this case takes a little longer, can look like in the graph below, latency graph, some increase on how long it takes to get an answer but my interpretation is 60% getting an answer and for me this is good engineering, your server is dropping one nine in ten packets, people are getting an answer because of the TTL as well.
So let's try to defuse now how caching ‑‑ undermine caching for these measurements because in this case probative 10 minutes, but let's the TTL, set the record to one minute if I probe a time equals two zero and I get an answer when a cache only one for minute, next time the same for the same query, that should not be in the cache more because it's passed more than the TTL time. And in the experiment here so pretty much disabling cache, we see when started DDOS, when the arrow goes down here, 27% of people still getting an answer even though nine out of ten still being dropped, that was mind‑boggling, how can this even be possible. So, let's see why. So part of the DNS resilience that the recursives that I showed earlier ‑‑ they keep on trying, they are going to actually hammer the hell out of the authoritative name servers and I think they are doing their job because they are trying to get an answer. And ‑‑ let me first explain the graph here. Showed the number of queries we are getting every ten minutes and at times it goes to 60 we start dropping nine out of ten but I still count them and capture all of them. What do we see here in this shaded here when the start of DDOS ‑‑ we didn't get any DDOS traffic we just dropped the sources and started seeing way more queries coming in and I call this friendly fire and I think there is an old RFC, if you do ‑‑ run authoritative over provision by ten times but in this case going to get eight times more friendly fire for this scenario so I think it's ‑‑ maybe it's a good idea to revise this number 10 here, shows going to get eight times friendly fire and usually this is not noticed during a DDos, so the implications for the operators for authoritative name servers is you should be ready for friendly fire in case of DDOS attacks on your servers. So, to wrap up here, caching and retries for DNS works really well. That is how resolvers help users when authority tiffs are having really big issues. But it works well provided some authoritative name servers stay partially up because if able to get an answer refreshes your caches and provided as well that the cache is less longer than the DDOS attack, the DDOS attack will last for one day, trying to bring the infrastructure down for one day and your TTL one hour ‑‑ is one hour, be able to save you. And this results here, I kind of like them because you can explain why the routes ‑‑ the attack on the routes back in 2015, not many people noticed because I think the routes have detailed two days for for the DNS record ‑‑ I think one week for AAAA, so it would be in the cache, even those attacks last one year and for the case of Dyn or some other clients the TTLs were very low, so very quickly expire from the cache and would have to query again. So there is a clear trade‑off here between TTL and DNS resilience, have the power to tell how long their record has to be cached and ‑‑ there is a lot of CDNs as well that use very short TTLs to handle DDOS attacks. They have their own infrastructure behind it, do some IGMP stuff and they actually use outage shortage for that. So CDNs is a kind of corner case for that but our results explain this ‑‑ the pains of Dyn customers and users perceptions.
This is the first study to dissect DNS resilience. From a users' perspective that is what we were carrying here.
We avoid the design choices of various vendors. Caches and retries important PA role in DNS resilience ‑‑ the developers of many authoritative recursive resolvers software and we are actually able to show when cache happens and when it doesn't. It's consistent with recent outcomes and the community should be aware of this trade‑off with TTLs and robustness, choose wisely and that is the direction I am moving right now in the next research, I want to understand a bit better now that we understand the role of it. TTLs how we can more carefully set those values. Maybe this is the first study to show that it works, like it's the only hope in many cases and yeah, I think it's time for questions. That is a dyke near Holland, when a dyke breaks, if it does we are below, we will all be swimming and that is not a good idea. So yeah, thanks.
LESLIE CARR: All right. Thank you. You have a few minutes for questions. Remember, to state your name and your affiliation.
AUDIENCE SPEAKER: It's Dave from Oracle + Dyn. Thank you very much for this work as a co‑author, I am very happy to see it's supportive of that work. There are a lot more interesting details there about how to responsibly use stale data in this case of last resort, I think it's a little funny to call CDNs a corner case just given how much traffic they are responsible for the on the Internet and policies to you for this question because you heard it from me yesterday but I am doing it for the benefit of the room. This is a proxy for the user experience is that correct? Because you didn't measure what different applications would do in the face of an answer taking 6 and a half seconds to get back to them.
GIOVANE MOURA: That is absolutely correct. We used RIPE Atlas and since we use it we are bound measurements that we can run, but it would be also interesting research to understand how that would actually impact applications. So we were trying to understand in the DNS level how that would be felt by user but application would be a different story. Yeah. Probably worse.
LESLIE CARR: Well, thank you very much.
LESLIE CARR: And now it's time for our lightning talks. During our lightning talks, our short presentations submitted sort of at the last minute so a reminder that if you have any thoughts for presentations we still have some slots for Friday so please submit via the submissions system. And now Emile from RIPE is going to tell us about BGP routes.
EMILE ABEN: I am going to tell something about zombie routes and that is something that we came up for stuck routes. So what we are thinking and looking into is when a BGP route, when the origin withdraws, but the route is still visible on other places of the Internet, even after path hunting. So the good thing is it doesn't happen too often but we were actually surprised and I will show you later, how often it happens. And the bad is, it does happen so it's, even if you withdraw routes sometimes stuff just keeps lingering, being on that. And the ugly is it's hard to debug, we have long trouble tickets with people who are actually trying to debug this, so these zombies will not eat your brains but keep them busy. So let me first give you an example. The BGP neon cat, I don't know if people know this piece, it was one of our community members doing this, so what was actually done here was using BGP and RIPE stats to just draw a cat picture and this is Internet cat pictures are important so this is an important picture. But if you look carefully, it has a red eye and I talk with Job and this was not intentional, this was while he was creating this on the Internet his canvass was actually misbehaving, so this was actually ‑ he with true route and still in our system there was still some lingering routes there for hours, a couple of days even, I think. So it may be funny this but the serious back story here is that even if very experienced BGP operators see these types of affects something is going on, maybe. Another one, we had a trouble ticket, somebody saying we withdrew this route months ago and it's still visible in your systems and, yeah, eventually, three months after this route was withdrawn, one of our RIS ‑‑ our route collector system had to manually get into their systems to remove it so there is three months, so and this is confusing, if you want to know something is routed publically, people look at RIS and maybe very experienced will know, ah, if there is a couple lingering, that is probably noise but what is that noise actually? So and our internal ‑‑ internal processes sometimes use this as something still routed. So what we actually did was look at BGP beacons and this is an interesting thing we do at RIS in our routing information system is we have a couple of beacons like a to see or so where it is announced for two hours and withdrawn for two hours and nouned for two hours so you have a stable signal there. And this is a picture so this is the ‑‑ well, the route collectors, all the ‑‑ this is time, so you will see this nice up two hours, down to two hours, up two hours. But you will also see this: sometimes here, here, here, this route was not withdrawn and actually it doesn't show up really well but you can see that there is path hunting going on so there was some BGP activity going on but still, the path continued being there. You also see these phenomena, longer term routes still being announced while they were withdrawn from the origin.
So, if you just do statistics, it's ‑‑ what is likelihood of this? We have a bunch of these peers in RIS, like 160 or so, we have a dozen of these prefixes that we do up and down, how often are they just stuck? So, I was really surprised to actually see numbers this high so I'm ‑‑ and it's both Roman who I really trust and myself who I don't trust as much, who calculated these numbers and we come up at these, like, over 1% of routes are getting stuck, and again, this is in RIS and also added a CDF there because it looks nice. What is nice about these zombie paths is they are longer than normal routes. So if this really causes you operational troubles you might actually have a mitigation route by just renouncing and withdrawing because your announced path will probably be shorter and you can withdraw but that is really a hack. So, well, now ‑‑ we call it zombie routes because I could use zombies in my talk. We did not see any sign of increasing or decreasing but we cannot extrapolate. Will things get worse? We tonight know. If there is more prefixes and updates, what is going to happen. Is this like a leaky faucet that will explode, what happens if you have a massive route leak and then things start settling down, will that have a lasting effect if there is stuff lingering around. So these are all open things that we are thinking about. We don't have answers here of course. And we want to know what causes these, they are software bugs, we asked around a little bit, in routers for instance at BGP, cause some bugs and we heard people talk about BGP optimizers causing stuff like this. Is it route reflector set‑ups that are leaky or is it our route collector systems itself? We are pretty confident it's not route collector session because if we bounce sessions these zombies comes back. It's in all the other systems you see these type of things. What would be interesting to see if it's collated to update rates so that might be another signal that update rates are actually a problem.
So, we want to ‑‑ so, we want to figure out more about this, so what you typically see in BGPlay if people know the tool, there is a withdrawal you will see a couple of routes stuck, lingering. It might be nice if we make BGPlay zombie aware but this needs us to find these things in the wild, up to now we have looked at these RIS beacons and we don't know how representative they are for things in the wild. Would people want to be alerted about this? Does this increase your table sizes? And to you want to know if this happens this in your network? Want to create awareness, because if I talk to researches they are typically not aware this is a thing where probably network operators, stuck routes are.
And I also like to have some operator feedback. Does this cause trouble for people? Do people know what causes this? Does this cause trouble for you that you cannot solve by where you are self but need a support trouble tickets with other people? So, that's it from me. Is there time for questions?
BRIAN NISBET: There are. This is going to be fun because there is a whole two minutes and that's it.
WOLFGANG TREMMEL: The first time I have seen this was 20 years ago and I didn't realise it's still a thing.
RANDY BUSH: Wedgies? Are any of them wedgies? B) what is the cause?
EMILE ABEN: I don't think these are wedgies because it's withdrawals, but...
AUDIENCE SPEAKER: Will van Gulik, Swiss IXP ‑‑ I heard recently that something like that with zombie prefix happened on the SwissIX I think route servers and I don't think ‑‑ so it exists and it's not necessarily related to your interesting because we see that somewhere else so I will try get some information back because maybe that might be helpful for everyone and I cannot remember all the details on that.
EMILE ABEN: Thank you, that would be very helpful.
PETER HESSLER: With Hostserver. We have seen this in the wild, we have seen this through our transit provider and through other sources so it does exist, it's really annoying and how can we detect it other than oops, the route no longer works or the operator told us they turned off that router? How can we detect this?
EMILE ABEN: What we see is for the RIS peers it's like 160 peers going to a low level but we don't know if that means somebody did some weird traffic engineering or if it's actually a zombie. That is the short version.
BRIAN NISBET: We only have ten minutes for each one of those slots, that is the arrangement, if the speaker speaks there is less time for questions. So, our next lightning is on tracing cross‑border web tracking and that is from Costas Iordanou.
COSTAS IORDANOU: I am four year PhD student at the Berlin so today I am going to talk about exciting work I have been doing with colleagues from University of Berlin, University of Madrid and data transparency lab. I will present methodology on how we can trace cross‑border web tracking and this work has been accepted at the Internet conference in Boston USA and will be presented next month. So let's start.
To give some information on why geolocation or tracking domains is important, let's provide some detail on the new European Union general protection regulation, GDPR which offers protection to European citizens across wide range of privacy threat, including tracking on sensitive categories and this one with the biggest change with respect to privacy and regulations on the web in the last few years. The implementation date of GDPR across European Union was May 25th 2018. In general, the new legislation tries to regulate how users' data are collected, processed and stored and if they include any sensitive information about the user. The implementation of the legislation is left to each European Member State data protection authority. The national DPA is responsible for handling any complaints of citizens or legal entities, so it is important to know how many tracking flows across national borders and where the tracking servers are physically located. We can use this information as a starting point for investigations.
Now, how can we identify the physical location of such servers? To answer this question, we develop the following methodology:
As the first step, we use real outsiders using browser extension mainly for tourisms, provide good geolocation diversity, for each third party domain, that the users come across during surfing the web. The second reason is because real users can interact with the visited websites and Tuesday login, maximising third party domains. At the same time and within the browser extension, we also map the IP addresses that we observe for each third party to main. The first step involves a filtering stage where we use different filter lists and to identify tracking domains, out of all the third party requests that we observe in our data list. The last step is a geolocation and in the following slides I will only talk on that.
So, here we have a diagram and we present the results from users located within EU 28 countries, the left side of the diagram and the corresponding continents of tracking flows that we observed from those users outside of the diagram. Note in this case we use the MaxMind geolocation service. So we observe that most of the tracking flows, almost 66% is travelling through North America and only 33% to euro 28. This is an error on the analysis, and this is due to the fact that the IP addresses of the infrastructure servers in the MaxMind database are assigned to the physical location of the legal entity using those IPs, so to serve this problem we use IP map tool but first, we need to make sure that we are not making mistake again. So to validate the accuracy of the IP map tool we use publically available information of Amazon and Microsoft as our. For all the IPs that were active, and we detected in our system, we include ‑‑ to the two Cloud services we compared the geolocation that we collected against of each Cloud service. At the country level we observed 99 .6% accuracy and 100% accuracy at the ‑‑ for each IP geolocation request that that we sent to the IP map tool we receive about 100 RIPE Atlas responses and we use majority wording to collect geolocate the IP. The smaller rate that we observe in our results at the country level is usually due assignments between neighbouring countries.
So now let's see results again. Now we observe the right figure, that utilis the IP map tool, we open we have almost 80% of tracking flows ending in EU 28, and only ‑‑ almost 11 percent to North America. So these ‑‑ configure the results to this type of study. Using million of tracking flow using four daily snapshots from four European ISPs, each represents one ISP is divided into four daily snapshots. We have German, one in Poland and one in Hungary. Each route ‑‑ percentage on the corresponding continent as the first column. At the bottom, we also probe the time‑line over the tracking flow for the EU 28 countries, the time‑line is on the X Axis and the tracking flow termination percentage is on the Y axis. We also the time‑line with GDPR activation date. So the German ISP, here, the blue dot German mobile ISP, coloured yellow correspond to Poland Hungary. We have the results from the first snapshot on November 8th 2017. Next, we have the results from the second snapshot on April 4th, the results from the first snapshot on May 16th 2018, and finally, the results on June 20, after the GDPR activation date. We observed that the confinement levels, EU 20 are stable over time before and after the GDPR activation date. The results from the ISPs show comparable configurations as those compared with real user tat that we presented in the previous slides. Based on our results we conclude most of the tracking domains have already presence in EU 28 countries before the GDPR activation date and this is to be expected since serving tracking advertisements is time constraint so most other related domains and their tracking collaborators has been ‑‑ had physical presence in Europe to minimise these kind of delays when serving tracking ‑‑
So in the paper we have more details really to the methodology, confinement improvement suggestions and detailed results.
And with that, I'm ready to take questions.
AUDIENCE SPEAKER: Dave from Oracle + Dyn. Did you identify who the non‑confineers were? Was it only a couple of organisations that were responsible for most of it or was it kind of scattered all over?
COSTAS IORDANOU: Actually, we have some plots on the paper and we are showing the percentage of confinement levels across the third party domains that we identify. So I suggest you have a look at the paper to see the results there.
LESLIE CARR: Thank you very much.
LESLIE CARR: Next up is Daniel from DE‑CIX to tell us all about Booter services.
DANIEL KOPP: I am with the research department at DE‑CIX and I am glad to show you some results of some side project we have done lately. So it's about DDOS attacks and especially about Booter services, so they grow and grow and we have 1.7 terabits so this also frightens some people and the question is, how dangerous can Booter service really be? So we were approached by federal agency to help them record Booter services and get more insight to that, so we built this Tedicated system to record DDOS attacks up to 10 gigabits, so in case you haven't seen it, like this is a website of a Booter service so it's easy to find, you can just Google it and you find tons of services and you don't need any technical experience to use them. And then they provide different service plans that you can buy, they start with five dollars and go up to 200 dollars, usually they have two different service levels, they call one basic and the other is VIP serves and provide much more bandwidth if you use the VIP service. And it goes like a flat rate for DDOS attacks basically, you can attack as many times a day with a basic service around 50 times a day, and then the service runs for 30 days for just 5 dollars and yeah, you can run one to multiple attacks at the same time and they provide different types of attacks, for example on the right you can see at the box where you are actually on the toss attack and you type in URL or IP address and order whatever UDP attacks, DNS, NTP and also there are a lot of attacks that attack like applications and they claim to offer like five to gigabits. So the payment, yeah, also exists for fake services, we bought some services that just didn't happen, the money was gone, basically. We didn't buy them, it was the government agency. So the payment is usually done with crypto currency, done by a broker, so they hide behind them probably and it takes time to activate them and as I said, like 20 to 200 euros you can pay for them. We build a dedicated system to measure the attack because, yeah, the goal was to keep the impact as minimal as possible and to be like also able to record failure free as possible and, like, yeah, so you don't impact other systems on other networks. So, the goal was also to have like, a really good connectivity so you can see where the DDOS is coming from.
So, there was some pitfalls in building that, most important was to have dedicated network card that you can mirror to record the traffic, also you need like a SA S drive that you can write really fast to your storage and dedicated rate controller and also very difficult was a system set‑up because it was just like one single system deployed at IXP and we used software BIRD for that and it was difficult so you have to be really careful you don't flood the whole peering network. And then for Internet connectivity we had 10 big bit peering port and transit port, our own ASN and own IPv4 space. The measurement limits for the set‑up is if you use TCP dump of course we have a 10 gigabit card, it's just 10. If you use sFlow it's on switch port, again it's 10 gigabits but if you use IPFIX we can basically see like the whole attack coming in because it's ingress in the ISP so we can see up to the whole IXP bandwidth basically. So this is the ‑‑ this was the most powerful Booter service that we could find and this is the services and the bandwidth you can get so the first ‑‑ the biggest one we saw is 20 gigabits with this VIP service with NTP reflection attack and it had 4 million packets per second and after that this is the free service which was around 5 gigabits and this was just 20 dollars actually for a whole month. And then there was a lot of not amplified attacks, they said it's direct but we didn't have a deeper look into that yet, they were just around 70 megabits. So, the DDOS NTP reflection attack, it came from 930 source IP address and from 350 source ASNs and interesting was that the top Three Networks sending the traffic came from China, Taiwan and Hungary and they accounted for 23% of the whole traffic. Also interesting 80% was received by the transit and only 20% was from IXP. All of the member cached attack we thought that would be the biggest one and the whole set‑up we were frightened what would happen because if you say there is 1.7 terabit attacks and you don't really know so we all had to backup plans if something goes wrong. Luckily the memcached attack was ‑‑ used 300 reflectors and 150 source ASN. What was interesting was the an NTP usually set in Asia and the memcached ones were mostly in Europe.
So it was really just an early, like a side project with and we just started with that so maybe there will be more in the future and if you are interested or if you have good ideas about what to do with this stuff you already recorded, feel free to ask me, I will be around the whole week and for us we will try to build better strategies and tested already some mitigation studies at ISP with that service and to pinpoint to the problems, where is it coming from, how can we solve it? Yeah. So I am open for questions.
BRIAN NISBET: Thank you very much.
AUDIENCE SPEAKER: Hello. I have a somewhat special question. Can you go back to the payment screen, the fourth screen or something like that. Yeah. There is an invoice number. Did you actually get an invoice for that...?
DANIEL KOPP: I mean, they say they are legit services and say just that they are providing a stress test of our networks. I mean, I can see that the service can be interesting if you want to test DDOS mitigation strategies or something else but of course they are abusing like a reflector as an Internet.
AUDIENCE SPEAKER: But did you get an actual invoice?
DANIEL KOPP: I didn't really pay attention to the invoice.
AUDIENCE SPEAKER: But you paid?
DANIEL KOPP: Yeah.
BRIAN NISBET: Clearly the German State does not have good procurement rules.
ERIK BAIS: I did some presentations on similar topic. We actually created a rating system for peers on Internet Exchanges, so you can actually see where the traffic might come from. And which networks actually have amplification services or devices that are vulnerable for amplification attacks in their network so I am more than happy to share you with API that you can have a look because the actual problem here, the actual problem here is to fix the vulnerable servers and not hunting down the Booter servers.
DANIEL KOPP: I think that is also what federal agency was trying to to. As far as I know, they approached all the networks with reflectors, that is what I understood.
BRIAN NISBET: Okay. Any other questions? No. Okay. Thank you very much.
AUDIENCE SPEAKER: Hello. I want to comment on the last comment, that we need to fix the vulnerable systems. On one hand that is not always possible because DNS servers are meant to answer queries, and on the other hand, with hunting down the Booter servers we can maybe also hunt down the persons who are using the Booter serves for prosecution, so hunting down Booter servers does actually have a real meaning. Thank you.
DANIEL KOPP: Interesting, the services even offered you to scan the Internet for open resolvers, for example.
BRIAN NISBET: Okay. Now, thank you very much.
BRIAN NISBET: So that concludes our second plenary session today. There are a number of things, including the BCOP task force BoF this evening. There is the welcome reception; all the all the information in your packs. Please rate the talks and if you are interested in joining the PC, please sent a bio and a picture to PC at RIPE to the net. Again the information is on the website. Thank you all very much.
LIVE CAPTIONING BY AOIFE DOWNES RPR