Plenary
Tuesday, 16th October 2018
9 a.m.
ONDREJ SURY: Good morning. I know it's early but it's time to start. I am Ondrej, this is Osama, we are the Programme Committee members and we will be chairing this session. There is still time to nominate yourself to the Programme Committee if you want to help organising the RIPE events, and the time is, the cut‑off date is half past three today. The nominees will have a chance to present themselves at the start of the session that starts at 4:00 and, at the same time, the voting will begin.
Also, don't forget to rate the presentation you see, it will help greatly the Programme Committee to make the RIPE meetings even better next time.
So our first presentation will be done by Henrik. Please.
HENRIK KRAMSHOEJ: Thank you. Good morning. I am a security consultant from Copenhagen Denmark, infrastructure and one of my things this year I did an audit of VXLAN enabled network so I have this presentation VXLAN security or injection and it's on GitHub and you are free to use it so use it all you want. We should talk about VXLAN because VXLAN is something that is being pushed by a lot of different vendors in our industry. We have all the big vendors and all the smaller vendors and all the virtualisation platforms, they all seem to want to use VXLAN for different purposes and reasons, some of them may be they tonight want to talk to their network department, maybe problems internally in the organisation and it's a very easy way to transport a lot of data around in the network. So the reason I'm doing this talk is because we are seeing these networks with VXLAN being used across the Internet, we have live networks with production traffic which are insecure today, we need to fix this. We have a lot of vendors that have the speed of implementation and we have hardware support and wire speed, we have all of the good stuff but they never seem to really care about security, at least they don't want to talk much about security because then the other vendors that don't talk as much may seem more interesting for the users and that is a pattern we have seen a lot in our industry, that people buy the stuff that seems easy and he and not that seem complicated. I need some help in at the signing network patterns which is why I think RIPE is a very good place to have this talk, we need network patterns how to in a more appropriate way. In smaller networks it might be quite easy, a VLAN enabled trunk and we can secure that basically. Then it gets harder as we move to different data centre, we connect with fibre and if we try to do it across the Internet everything seems to fall apart security wise and I will be happy to discuss this with all of you. We need to increase visibility to VXLAN attacks so VXLAN traffic is much like the other traffic we have but a lot of our devices and firewalls and intrusion detection and tools that we use for monitoring our networks are not VXLAN‑aware which makes it kind of like a hidden ships in the night approach where attackers can just send data and attacks across our networks without us seeing them.
I think we should stop repeating the same mistakes again. We have seen a lot of tunneling being done by attackers and if you use VXLAN across data centres you have a complex problems at your hands which is not getting much focus right now.
To recap for those of you who don't know about VXLAN, it's a channeling protocol, it has these routers in different data centres, router 1 and 2, they are connected some way through the Internet, and we have a host on the inside 100010 which needs to sect to 100020 and it's the same in both ends, we want to connect those because that is very easy and we can move servers and services from one data centre to another. The top part of this, the diagram shows it's an IP packet which has a source and destination of the routers, the outside external address and then inside it has a UDP packet with a port 4789 and inside that the data is actually just a packet, but it's a Layer 2 packet which includes the Ethernet header and IP header, source, destination and it can contain anything that you can put into packets.
Most often we also see a lot of VLANs involved with this and I am not going to go much into that this time. It's quite easy to get using either open DSB or Linux, and connect two devices so it's quite easy to get working if you know your way around Linux. Not a lot of security information is available and by itself VXLAN did you see not provide any security at all, there is nothing. So if you configure it it will if right through your network and you will have immediately problems like spoof packets so if we see the two routers that I showed you in this, it's very easy to take one of the IP addresses on the outside and then spoof packets coming from another place.
If you want to security VXLAN you have to isolate it but you also want to have the reachability, the availability, so the really hard to get this right, and when I look at some of the vendor documents which is the provider that I am using, they have some VXLAN security document, it's quite large, doesn't cover all the use cases, doesn't really tell you how to do this in a good manner and I have talked to some of my peers in the networking world and we seem to find the same so the vendor documents they are very large and tell you how to implement and get it working, and on page 8 of the PDF and it says remember to secure your network. And it's like, yeah, sure, but how? And it's not detailed as part of the regular how to set up VXLAN, and in some cases I see a lot of block posts about people doing VXLAN through some kind of Cloud providers so you have a VP S somewhere and use VXLAN across the Internet with no confident challenge and no encryption and no authenticity of the packets, nothing. Using IPsec would perhaps be the best but it creates a lot of other problems like if you have a broadcast storm on your Layer 2 network that will crate a lot of packet that each need to be transport across IPsec and have a good CPU and latency and all other problems might come from that. Currently we have a huge gap in understanding these issues and we have a lot of missing tool, security tool coverage and we have seen hackers in the early days, years ago used stuff like ‑‑ IPv6 to ex filtrate data so I am pretty sure they will use VXLAN too. So what I'm saying is that you can produce VXLAN packets, send them across the Internet with a spoof source IP address, then they will get accepted by the devices, hardware line speed fast and then they will get injected on to the Layer 2 which is behind the firewall, which is on the inside of your network and actually, the one producing the packets they will decide which VLAN the packets will end up on so it's really a bad situation. I tried to do this drawing which is the other one with an attacker on top so we have an attacker which is on one 185.27.115.666, so that is my attack server. So from my attack server I can send the spoof packets coming from 192.0.2.10 and then it will get sent across the Internet it has a destination IP address, that is what the Internet does, it arrives at the destination, and then it looks at the source and say, oh, that is from my peer in the other data centre, let's just decapsulate this packet and put it on to the right VLAN and has something you can do very easily. You can do it very easily because it's just UDP packets and the Internet can transport UDP packets fine, I love UDP, it's fine and great and no problems. So, what happens next is that with this method, it's possible to inject all sorts of traffic so a lot of the old use cases and all of the old security problems we have seen on Layer 2 networks and VLAN networks get exaggerated, it gets amplified so now we can from the Internet across the Internet send app spoofing, it's really that fun, you can send app spoofing inside a network, do traffic behind the firewall so kind of DDOS device will not see this traffic so it will go directly through the servers on the inside, and it will actually look like it's coming from the other side, and depending on the source address you put inside the VXLAN packet it will look like it's a server which is not connected to a network, it might be an RFC 1918 packet, you will also send packets back through that address so you will create more packets inside the network which will get VXLAN and send to the other peer and it's really hard to monitor this in a good way, with the current tools we have.
You can also do some other fun stuff which is something I have played a little bit with, you can inject UDP packets which seem to be sourced from the inside of the network, and if you send them through the firewall they will get sent outside. So, what is the point of sending injected traffic inside the network out through the firewall? Because you are just sending it to yourself then. But the thing is, you can create state in the firewall which has some repercussions later on. These were some of the first attacks I realised and tried to get implemented and I did it with a ‑‑ packet creation library with Python and it only took a few hours to get the first working examples so it is really easy to get these working and I am going to try a lot of other attacks across VXLAN and try and break both implementations and switches and stuff like that. We have the problem of doing this so the firewalls and the IDSes and so on can see this. Currently I am brainstorming this list of attacks so we can create like a large test plan, I am also trying to do various tool enhancements and if you go see programmer we should talk because I have some patches which are not very clean and pretty to look at so I would like some help with the extending tools and trying to push knowledge about secure VXLAN deployment. So, one of the scenarios that I was working with is sending a UDP DNS request to an inside server so in this case, the method that is listed on the slide now is that I want to do a lookup on an internal server. This internal server might not even have a public IP. If you have an internal DNS server which is not reachable from the Internet it might not be configured with a lot of restrictions. Let's say it was an active directory server so you have that, DNS server on the inside, by creating this VXLAN packets you can just send it on to this destination, through the Internet, it get decapsulated and sent through the DNS server and that does what a DNS server does, responds to the IP address which is listed in the source, sending that out through the default gateway and in many cases UDP responses back out through the firewall is allowed, especially the firewalls that were in this network, they don't do a lot of UDP DNS inspections so they just allow the UDP to come out.
So, that is something you can do to an insider serve on an RFC 1918 private address and it was tested using Cloud firewalls. Doing this with Scapy is ‑‑ it's a variable saying you want to have an Ethernet packet which is UDP packet and VXLAN header, that is supported. And then at the bottom of the slide you actually create the packet because you have the VXLAN header in front and then just at the Ethernet packet so this is just a couple of lines creating a DNS request that is getting sent across the network on to this network.
And a fun fact is that you might think you need a lot of information to do these attacks but I found through some of the experiments that I did that you can ‑‑ you don't have to guess the Ethernet MAC of the internal serve, that would probably be a requirement but if you send a UDP packet with a DNS request in an ether next packet with the broadcast MAC as the destination, my open BSD with unbound takes that UDP packet anyway and responds to it. So some of the restrictions that you might think were in place are not in place, and in any case, in some of these cases you could just generate a lot of packets so even though I'm showing it with one packet, I also have a tool called HP 3 which I extended so it can produce VXLAN headers and that can to millions of packets per second so I could just like scan the whole RFC 1918 space with DNS packets with DNS requests.
The thing to recognise is when you do these tricks and send all of these packets, they will get sent into the network and the responses will go out through the regular path which will typically, if you come from a private address will involve some kind of NATTing so when you get the responses you don't get the response from 10.00.10, you will get the NATTed IP address, so this is just a little bit of Python that just looks at the outside network, in this case the 19200 /24, so that will be the response where the packets are coming from.
But the source port that you used in the DNS request will now be the destination port so at least the port will be the same so you can recognise the traffic when it comes out again.
Moving on to a more advanced scenario, I was talking about sending UDP traffic in through the network which gets out through the firewall. In this case, I tried to do some, again we are attacking this internal serve but now we are trying to use that IP address as the source of the packets we are injecting, so if I'm injecting the packets with the source of 100010, send it through the firewall, through the VXLAN, then it gets sent out through the firewall but creates state in the firewall and what happens if you create state in the firewall, it expects requests and packets to come back that way. If you open up, force open with UDP packets out through the firewalls you can actually afterwards just do normal requests across this tunnel that you now created or that UDP connection that you now have. It's not a connection but it's, has state in the firewall. So, sending it to the firewall from the inside is allowed, sending it out to your server, your serve as we saw before knows what port it's supposed to expect it from, and then can take the IP address and actually you can just start doing regular DNS requests, so if you use this method to open you would afterwards be able to use zone transfer or all of the other stuff from the outside to this server, not real with TCP but you can request ‑‑ do a lot of DNS requests across this UDP connection.
And I tested again this working and some of the open source operating systems that I have available and the Python is not very difficult, it's very easy to get a packet, extract the information and then start doing requests to this and I'm thinking about we could use this to abuse even more protocols. Say, there is an FTP application layer gateway on the firewall we can spoof FTP packets and use that to open a connection so we can use TCP also through this method.
Some people might say I need a lot of information to do this, and it is true that, in my case, I knew the network, I had access to the network, I could debug it and monitor and Wireshark all the parts of it but the thing is that we can generate a lot of packets and we can general ‑‑ generate ten million ‑‑ ten million packets per second from regular platforms easily. So in some of these cases, we might be able to like brute force it by sending lots of packets and waiting for a response when we match the parameters, the IP addresses, the VLANs and so on we will get a response out through this VXLAN network. So, and also, the information that is needed to carry out this attack is something that you would find in block posts, you would find it in technical support requests, you would find it in forums on the Internet, any former employee consultant would know some of this. It is not information we would consider secret. So, I guess a lot of you are using 10 /8 for your internal network. Sure. A lot of company networks won't use 19218 private addresses so the quite easy to narrow down the parameters that you want to try, the ones you you want to test and you can generate a lot of these packets. Also I have seen a lot of devices also in banks when I do testing that have SNMP open and that is a lot of you will know already, will have all this information, so by using SNMP you can find the ARC tables and MAC tables and can take a lot of that information on all of the VLAN information and insert it into your Python programme and then it will work pretty easily. So do you think some of this is possible in real networks? Can we see some hands. Are people awake? It is possible, thank you. Because I think we need to do something about this.
The thing I'm doing currently trying to help this is I'm trying to do a lot of extensions to tools, trying to do patches for Bro, the Bro security monitor that was renamed; Suricata which is great ideas engine and I have done two patches for these two, it seems to be very few lines of code that needed is and I am going to submit it this week and if somebody wants to help me browse my patch and see if it works better, or we should change anything that would be great, you don't have to but it would be great. And also, tools like H ping 3 which is a packet generator tool generation 3 is a very last tool and I have done some changes to do that. In some cases you might be able to create a VXLAN interface on your Linux box and route packets in through that and they will get sent away with VXLAN header.
All of these things are on GitHub, it's on my account which is, I did a ‑‑ the H ping stool not really being managed any more but we can add to it anyway. I also have a example scripts but if some of you want to see those and some of the people that I trust and I trust all of you at RIPE of course, you can see T it's not magical in any way.
The lessons I learned from this is VXLAN is a fun protocol to work with, it's very easy to get start, you should perhaps always use TLS and encryption even on your secure local server lance. It's very difficult to secure this because you need to secure it from external packets coming from with spoofed IPs in your network, you need to make sure that, from partners and other networks internally, you don't allow spoof packets in. You even have the problem of virtualilised servers in your environment, if they can send these packets with spoofed they can do VLAN hopping so they can jump from one VLAN where they are sending packets to another VLAN which is going around the firewalls, so breaking the axis control list and so on so there is a lot of issues with this, and I stop using VXLAN is probably not possibility for a lot of you so we need a lot of help and I think we should help each other and that is my presentation for today. And I think we have time for questions?
OSAMA I AL‑DOSARY: Yes, we do.
AUDIENCE SPEAKER: Ignas Bagdonas, Equinix. Is both a comment and a question. Thank you for raising this. That's important to have this discussion. What I hear is two topics: One, which is on the title, and that seems to be a perceived focus but that is probably not the large problem to solve and not that important. VXLAN is eventually going away. So in order answer to your question what is the right answer to use a proper encapsulator and that will be a lightning talk about that later. The far more important you are trying to touch is the proper network design or education but about how to build the things. So paraphrasing what you were saying just recently, if somebody from external of your environment sends them tens of millions of packets per second and affects your internal infrastructure it means you have far bigger problems than VXLAN in your network. For all of this to work what you are describing, you simply have not to follow the proper operational hygiene. If you follow that, getting inside in your infrastructure and affecting some protocol mechanics, certainly that will happen and this is I would say normal and expected. So, what is an important aspect to do is to educate the community that one protocol or the other protocol is not a deciding factor. What needs to happen is to decrease reliance on vendors saying by our new shiny box and it will solve your problems. No. Being solved by the proper network design and not necessarily by the shiny boxes.
HENRIK KRAMSHOEJ: Unfortunately it's not using network people it's more the server people in my experience. Thank you.
GERT DORING: Gert Doring, Space Net. I have done a bit of VXLAN testing earlier this year and bum belled into the same thing, how do you secure this? We tested with Arista and we have this, it's not where you can filter on the source packets of VXLAN encapsulation. So, you tell us who is allowed to send packets into this VNI and then it comes back to proper network design like if all your V tabs are in the same /24 you permit that 24 and everybody else can send packets to the hard content and the box will just ignore it. The interesting aspect about that is the online help on the box tell you about the switch and the dogs doesn't so we are back to asking the vendors to make this more prominent, put it into their best current practices document. So if somebody naively deploys this they will run into this head‑first and we need to raise awareness on everybody. Thanks for the presentation. And by the way, I don't think this one is better, it's sort of like, at least on the beamer it's coming out a bit weird.
AUDIENCE SPEAKER: James Bensley, professional maniac. Thanks for really good talk. We have actually had the same problem for ages or very similar problem for ages with BGP signalled and VPLS VPNs in the draft. It states near the end of the draft somewhere no should accept label packets for PE if it hasn't explicitly advertise that label, but if you send any label and it will process and switch it on so ‑‑ going back to your first or second slide there is no security in VXLAN at all, the a blind transport the same as MPLS. But at least in the MPLS RFC there is this tiny smidge of security where they are saying that PE shouldn't accept labels that they haven't explicitly advertised, although no vendor has advertised that, do you know if there is anything in the VXLAN RFCs that says anything about a similar mechanic for only accepting things that have been explicitly advertised?
HENRIK KRAMSHOEJ: As the previous said, there are some knobs you can turn on and they tried to make sure you only accept the VXLAN packets if they are coming from that peer which you have configured but the thing is that we are coming from that peer because we are spooking packets. So, and a lot of people still, today, don't realise how easy it is to poof packets from the inside outside, from virtual machines and so on, since this is being UDP packets I am also thinking a lot of user space hacks and attacks will be able to relay packets on to VXLAN so it might not come from the hacker server, it might come from any server on the Internet more or less.
AUDIENCE SPEAKER: A quick follow‑up. Do you see many vendors implementing those knobs
A. I have mostly researched Cisco and Arista and I don't think they are doing a good job at documenting it at least as was also mentioned.
AUDIENCE SPEAKER: Thanks. Good talk.
AUDIENCE SPEAKER: Blake, with Eyebrows. Thank you for putting this together. As any ammunition I can give to show my enterprise customers they should think about this is good stuff, gardens, castles, all that. Have you seen anybody using MACSec over ‑‑ in the wild calm calm no
AUDIENCE SPEAKER: It was really not what they expected, it didn't work very well.
HENRIK KRAMSHOEJ: How many people really use McSec? A few okay. Good. Five, ten people, I think. I don't use McSec in the networks that I configure unfortunately but maybe we should.
AUDIENCE SPEAKER: To respond to James's question, in the latest versions, in the last six months or a year ‑‑ VPN that like actually verifies that I have label table entries for the labels that I am receiving, it doesn't just blindly process the MPLS packets which is cool.
OSAMA I AL‑DOSARY: Thank you, Henrik.
(Applause)
OSAMA I AL‑DOSARY: So our next presenter is Attilla de Groot.
ATILLA DE GROOT: I work for Cumulus networks. I will be presenting on EVPN to the host and yesterday I got a response like oh not another vendor that is presenting something about EVPN. Well I do think a lot of vendors are talking about EVPN think it makes issues like this easier to implement. I am involved in a lot of designs for networks together with customers, and then I come across quite some use cases and they are asking me you have EVPN sorted out, you are building your Overlay Networks but how do I integrate it with my hosts? What I would like to to is go over two of the use cases that I see, the problems with that and how I think that you can solve that with EVPN on hosts.
Now, what we have just heard as well is that VXLAN is implemented, mostly by server people, and it's done so that they can ignore the network, they are just tunneling everything over it, and from a vendor I should say you should secure your network now. But one of the typical deployments that you see are virtuallation environments, very typical you have an MLAG connection to your servers with one or more VLANs and then they do VXLAN between the hosts itself. That could even be a double overlay because you have the network guys doing EVPN VXLAN and the server guys who don't talk to each other and maybe even a container environment inside the VMs thanks that are doing exactly the same thing. What you usually see in open stack or VMware is you have dedicated network nodes so all the attendant traffic should flow through those and that means that is a bubble Mcfor your traffic.
Now that does have a few issues. The first of them is MLAG, a lot of people don't like it, it is not a standard, and a lot of network people are used to routing which is something they rather use. What I already said is that with VXLAN on host you are basically ignoring your network, tools and such. You have no idea what is going on in the overlay network from a network perspective. And you could also have things like traffic tram owning. If your exit node or network node is somewhere else in your network where the network guys have no idea where it is then traffic can go back and forward. Also there are orchestration issues on /SPWE /TPWRAEUTing network and host overlay, so what you see is that vendors have to implement something like O V SD B for open stack, if someone wants to include a bare metal host in the same Layer 2 domain.
Now, another use case that comes across a lot are con taint Eireann /SRAOEURPLTS. Luckily they do a lot of BGP to the hosts already although still environments that still do MLAG and then have some container orchestration on top of that. What you typically see is that you advertise your host routes directly from your container and possibly even your container IP addresses directly /PH your IP fabric. We use docker a lot, but in some cases you see that there is a container overlay as well.
Now, what are the issues with this? Well, basically, the multi tendency, if you have a software as a serve solution then it doesn't really matter because the customer only cares about its application. But what if you want to provide a container infrastructure to a customer that doesn't want to run its own infrastructure? Then you basically have to dedicate a host to a customer or you have to ‑‑ yeah, somehow, implement multi tendency for those customers. That means that you can't have overlapping of IP addresses, you have to manage a lot of ACLs between the tenants, and that is something that, yeah, is a lot of work. (Attendants)
Now, over the past few years, there has been a lot of development in standardised Linux tools, the have LAN aware bridge for example, much more easier to configure and scaleable, also merchant silicone. We have the Linux kernel, you have seen VXLAN is natively supported on Linux box, it's quite easy to configure. We have free range routing, a routing suite that has support for EVPN with VXLAN, network like IF up down 2, IP route 2. So that allows ‑‑ or we have all the components to use EVPN on standardised Linux hosts as well.
Now, what can you actually do with that? Now, one of the first problems I addressed is MLAG towards hosts. If you would be running EVPN on hosts what you can basically do is you can break down your MLAG and just have routed sessions towards your container or hyper visor. Free range routing has something that's called BGP unnumbered, it is not really BGP unnumbered but you are using the IPv6 link local addresses, and the detect your neighbours based on router advertisements. Now, what you typically do to implement EVPN on hosts that could be a quite simple configuration, is that you advertise your look back addresses that will be your virtual end points for VXLAN and you only have to enable the EVPN address family. That will make sure that everything gets advertised.
Now, if you don't look at what you'd like to have for a EVPN environment, is one of the things is you would like to have a Layer 2 tendency, so Layer 2 availability over multiple hyper visors, you could argue if that's a good thing or not but in many cases that is simply a customer requirement. Now, EVPN has the possibility for ‑‑ or the implementations to do Arp and suppression, so you do have a stretch layer 2 layer domain but more control over Arp and neighbour discovery. So what you basically do is you build Layer 2 VNIs, the VXLAN tunnels between the hyper visors and the MAC addresses or MAC IP combinations are advertised using EVPN type 2 messages. So if you look at the hyper visor, how that is configured, you basically configure a VLAN in the VLAN aware bridge, you configure VNI and attach them to each other in the configuration and with that configuration that I showed you earlier, that VNI is advertised to other hosts so the other host know where to ‑‑ where the tunnel end points are and advertise the MAC addresses.
Now, a general issue with EVPN or standard question is, where to start routing. I think that most of the vendors have written entire blocks on it, where do you want to put your gateway, do you want to put it centralised or distributed? Now EVPN has the possibility for that. I personally like the distributed way because you can start routing quite early. If you would configure an SVI in that VLAN on your Linux host that SVI, the addresses are not advertised, so basically you are creating an Anycast set‑up. And every packet beyond the local domain will be routed locally so you don't have a network node any more, so no bottleneck for traffic, and yeah, that makes it a quite clear on how your traffic flows are going.
Now, another possibility that EVPN has is layer 3 tendency, and specifically for VMs, I think that's quite a useful solution because then you can have multiple attendant on the same box having one or more VMs that are completely separated from each other. To do that you use VRFs on the host, natively implemented you can see how or ‑‑ yeah, download all the details and have a look at it. EVPN has the possibility to do prefix advertisement with EVPN type 5 and basically you build layer 3 VNIs. If you are not familiar with the concept, it's pretty much the same as a layer 3 VPN with MPLS. In this case, the encapsulation is VXLAN and from a configuration standpoint, it is the same as what I explained earlier, the only thing is you create an SVI on the host and you assign it to a specific VRF.
Now, what can you do in container environments then? Pretty much the same as what we have for VMs. You can redistribute your container IPs directly into a VRF so that means that if you have multiple tenants, you can also divide them inside a host 4. Now, that also means that you can overlap prefixes and IP addresses, you could say, okay, people shouldn't care about their IP addresses but what has shown earlier is everyone cares about 10 /8. The other thing is you don't need ACLs for tenant segregation any more.
Now, another benefit is the bare metal integration, so as I said earlier, right now vendors have to support something like OVSDB and given that EVPN VXLAN is just an RFC, you can terminate host tunnels with the same configuration on a top of rack switch and include a bare metal host in the same Layer 2 domain. You can also do the layer 3 tenancy and that means that you have your distributed routing again, so your configuration is exactly the same, you know how your traffic flows are going.
Now, I ‑‑ before I presented this, I wanted to build it all myself and I built a demo. There is a link to GitHub repository, and you can have a look. But it does mean that there is no commercial support. I did this during the summer and I had to use the 4.17 kernel, free range routing only has EVPN or all the features in the master branch, and not all the tools are very stable, so yeah, this is ‑‑ at the moment more for a demo environment, but, yeah, feel free to try it out of course.
Now, what is there still to be done to implement something like this? We will be running into orchestration issues, because things like open stack neutron don't have any plug in to manage EVPNs on hosts. Same thing for swarm, they have no idea of VRFs on the hosts so still are implementations that someone has to make. Also decisions in an organisation, what you see a lot is that you have server guys, network guys, they don't talk to each other, so the question is, if you are integrating both who is going to manage which part? Also, if you look at the EVPN implementation in VXLAN on merchant silicone there are some caveats you have to keep in mind. In most implementations you, the broadcast traffic, if you can't do Arp suppression, it's Anycasted and that means you do have a limitation in how many V tabs you can have in a single domain. There is a possibility to use multicast replication, that isn't implemented in free range routing yet but is on the road map for that but that means you also have to implement protocol and multicast on host itself. Now, at this moment, EVPN in free range routing doesn't have support for route leaking so that means if you want to have route distribution between different tenants, you need to go through a router ‑‑ a firewall, etc. If you want to have services like DNS, etc., you have to create a tenant and that could pose for problems. I have understood that it is the idea to implement this for free range routing as well.
Another few possibilities that I have been thinking about how you can improve a set‑up like this is to implementing micro segmentation, that is one of the things that where you have your security based on every application and every segment that you have in your network. Now, while I don't like managing ACLs, what you can do is you can do filtering inside a tenant so it wouldn't be for tenant segregation but specifically inside a tenant.
Now, if you look at the development of the Linux kernel that has been done a lot of the development on BPF and the performance is quite good, and I have been also been thinking about using Flow Spec for distributing those ACLs, including BPF so that you have the performance and micro segmentation all with open tools and RFCs for that.
Now, that was my last slide and I think I have some time for questions.
ONDREJ SURY: Okay, so do we have any questions? Please come to the mic and affiliation.
AUDIENCE SPEAKER: My name is Cyrill from infrastructure service based in Russia. I had two questions but one question you have answered during the presentation, is that have you seen this in the wild? The answer is no. As far as I know, as far as I have seen, there are only layer 3 on a host is still running on ‑‑ cluster but no open stack or V ware solutions can be be provided even with layer 3. And the second question: Don't you afraid of putting EVPN to hands off server guys, this man as we know from previous presentations are able to run VXLANs through the public Internet?
ATILLA DE GROOT: I will start with your first question. From a vendor speaking, I think we have a few customers that are doing routing to the host with open stack and the only thing that they are doing for that is to announce their look backs. But that does give them problems because the consultants or the open stack providers, they have no where had what routing is so it gives them issues when they have to troubleshoot; they have to explain what they are doing etc., etc.. yeah, the issues that server guys and network guys are ignoring each other and giving them access to EVPN that is definitely an issue that you have to solve, but I'm not sure if that's a technical issue. I think that we should solve that in organisations.
AUDIENCE SPEAKER: It's a mix of technical and organisation issues because server guys often think that ah, we are wonderful server, it's connected to the network and some magic is done and ‑‑ to connect to another server.
ATILLA DE GROOT: Well the only thing I can say the server guys should be educated there is more to it than just configuring one interface and hoping that your packet arrives at the other end.
AUDIENCE SPEAKER: Thank you.
ONDREJ SURY: We still have a couple of minutes for questions. Then, thank you.
ATILLA DE GROOT: Thank you.
(Applause).
ONDREJ SURY: Next presentation should be done by Flemming Heino, hopefully he is here. And topic is also EVPN.
FLEMMING HEINO: I am here to talk about a project we have recently done at LINX reasonably marriage project which is deploying a disaggregated on one of our two London exchanges. As a background we have kind of hinted, LINX runs two exchanges in London, more or less independently. One we prefer to use LAN 1 with a traditional router vendor, Juniper in this case. And LAN 2, at the start of this project, was still using traditional Layer 2 solution using ring protection not quite spanning three, but that generation of technology. We attempted to move to a routed solution in the past but we hadn't been successful for various mini chip problems. We were not treating that as a crisis up until 2015, when suddenly 100 gig became a thing and we had a title wave of orders coming through and realised we were going to outgrow our chassises, we were going to need to invest quite a bit in our core, and if we did it on technology that we didn't think had a long‑term future there would be a lot of capital that we would have to deappreciate in a hurry later. So, what did we do?
Well, we didn't immediately just select a vendor, we really wanted to think about what as a network we wanted to offer and what would be the best options. So we looked not at vendors but at strategies at one of the end of spectrum like London 1 network, the other is what we picked, open networking, something that was definitely on the leading edge side.
So we didn't do a vendor selection really, did a strategy selection, we picked the best vendor for each fit. So the various strategies we looked at kind of what is the best way to power an Internet Exchange. One is just to pick another gold plated traditional router vendor, they do everything we want to do but don't do it cheaply. We could have gone for another low cost Layer 2 solution just with a platform that scaled better, we would have been stuck in what we felt was a previous generation of technology. We could have gone for one of the emerging switch vendors, they were promising, but they really didn't have that much focus on the XIP market and we also had a fear that with the amount of dollars in the hyperscale if we had an important feature we might not get the mind space. We could have just picked the same vendor as LAN 1 and that would have given us a lot of savings, but we have an approach historically of having two different networks, two different vendors and we wanted to see if there was the appetite for just merging the two. Or as we picked, the disaggregated. So we looked for the best strategy with the RFCs and tested but the key bit in addition is being a membership based organisation, we sat down with quite a large sample of our members, both small, large, relatively technical, more just as long as it works, and across the geography we found out how much appetite for risk, what was the things they really cared about and the answer we got was, actually, because you have two lance you have a unique opportunity to do something bold, now don't do something foolish but do something bold.
So, this really decided open networking. One of the preparations I said on the strategies, we kind of honed the strategies and did cost with one vendor, so this is how the beginning of our open networking path. We found a hardware partner, Edgecore Networks, who are actually part of Accton and doing open networking solutions for a very long time as OEM, OEDM. Chances are a good proportion of you have bought equipment they have manufactured, they just have someone else's badge on front of it. We sat down and ‑‑ some of the exchange‑specific features that are required and that was, let's call it an embarrassing failure, being brutally honest, the Knot was the wrong one. Half way through it was okay but jumping back on a plane, no point in finishing this. But fortunately we couldn't catch the plane the next day so we are stuck in an office trying to be polite. Instead of turning acrimonious, how did we get the requirement capture wrong and we sat down figuring out probably a better requirement capture that was done at the RFC and what happened? Edgecore went to all the NOS vendors in the market at the time and said the one we should have picked is IPN fusion. So bit of a background on on that, those of you who are familiar with Zebra or Quagga, they are the original developers that then moved on to a commercial company and have been stack vendors, again there is a reasonable chance of a good number of you have deployed their stacks but by someone else. So they work with Edgecore to give a demo. It wasn't quite the bar of the path which is a risk we took mainly because they started a little bit late because we had the aborted first attempt. And also we did no IP infusion so we talked to a number of their customers and said how are they really, to they really deliver, how good is their code and good a bit of a feeling. But what really sold us on daring this is we had two parties were willing to invest the time and the effort to make it happen because, as much as the open networking is a thing in the hyperscale data centre, the Internet Exchanges are different and they renew their number of customizations and adaptations that were needed. So highlighting a few of those differences really set the scene of some of the challenges.
So, first thing, something that an exchange as a partner port, if you are in the same data centre it's really easy, cross‑connect, dedicated port, simple configuration and that plugs into the broadcast domain but if you remote we tend to work with partners to allow the member to connect to us and they will deliver on a single port to a LAG but they will deliver each member as a VLAN so now all the features that used to be designed to function on a port need to function on a VLAN and forwarding ‑‑ that is totally different implementation, and we need to do reasonable formal policing and shaping so that one member cannot burst or be DDoSSed and affect the other member on that partner. But the biggest challenge remembering this is a Layer 2 DNA is the many to one VLAN mapping. And if you talk to a layer 3 vendor and say I want VLAN 52201 and 517 to be mapped into this routing instance, this VRF, they go fine. You go to ‑‑ you see that in Layer 2 and I want VLAN 512, 206 and 152 to be mapped VLAN 100, and they go sorry? So that is a total change for any Layer 2‑based implementation, really to move it ton a layer 3 implementation are the access port.
The other big challenge that we have is we have quite a large range of port speeds. Best practice, you don't put the same broadcast domain large ports, multiple times 100 and gig e‑ports, they are going to create micro bursts, that is not how you design networks. As an exchange, if you want to offer those range of ports to our members, we can't put them on separate Layer 2 to mains, we just have to put multi‑stage in the correct network design to actually allow the traffic to be accelerated and decelerated and put them on the right switch. But there is a number of design parameters and instrumentation that we needed that the hyperscale people solved by being allowed to do better in network design.
The other big thing was a range of speed is flooding and broadcast traffic. Generally, background flooding is a significant issue to small members. Something like 100 meg, if you have got a gig port that is a lot of flooding. If you have got 200 gig you don't detect that. As we can't separate it, that is a challenge. And the last one historically always has been an issue with an XIP and that is MAC security. Data centre, there is a cross‑connect, what do they do? They put a look‑back. If you are running a Layer 2 fabric and haven't got MAC, you haven't got it. All those features are bread and butter for people in the IXP market. And the last one is port in the demarcation. If it's one organisation and you have a strange frame hopefully you can contact your serve guy and say what on earth are you sending me? It's not our organisation. And the source of the traffic might not be the IXP who is sending it, so really need the instrumentation and the ability to detect what is happening because we can't just pick up the phone to figure out where that strange traffic is coming from.
So, going as far as the project and the challenges we picked up, so we really started by finding a target solution. First thing we really wanted EVPN kind of see it as the only way to deal effectively with flooded traffic. It's turning, we have been doing for 20 years now, is BGP, for IP forwarding, to MAC forwarding. And the really great function is that you don't need to ‑‑ now that you know that every address table is synchronised, you don't need that default route equivalent so in IP routing if you don't have a default route, if there is a packet and not in your RIB, you drop it. In MAC, because on learning basis you are never 100 percent sure the table was correct, well you forwarded it, some of it just in case and that turns into flooding. EVPN, everything synchronised, if one router doesn't know about it, it means it's not programmed anywhere, don't send it to the other ones in case they know about it, they won't. So you can be a lot more aggressive on cutting out flooded traffic. The other thing compared to a Layer 2 network, Layer 2 networks when they change topology, they need to relearn and do a MAC flush and the full reconvergence even though if you go with a tester it might be of the order of half a second that you can actually see that some flows are disruptive for a few seconds as the whole MAC tables reconverge. So order of magnitude improvement in convergence.
And then we are hoping to add multi‑homing so that is an open door ‑‑ we haven't done it yet ‑‑ but an open door on how we can build on the technology to move forwards.
So we built all the exchange features so we can explain the MAC ACLs, the one too many mapping, including actually the allowance of genuine flooded traffic on a shared port, you don't want to have you can send them many Arps on the whole port and if one member has a flood event they get the whole allocation and other members don't get access to it.
We also, to help with Arp, have implemented proxy MD to reduce background traffic but all of those mechanisms is new mechanisms, we made sure in the software if we ever had interoperability bug it could be switched off. So far, touch no wood here, we haven't had to do it but all the features we have implemented we have implemented with an option to fall back.
Questions sometimes get asked why did we not go for controller? I am not going to read the whole slide but basically that is not who we were. We know as an organisation we had to move much more towards system‑based development Ops but trying to to everything in that methodology was going to overstretch our team and we wanted ‑‑ not on building mission critical controllers, replacing engineers who are very good at doing that. So here is the start of the real world. We had got to there. Once we had actually selected the vendor, we are still quite a long time, so why did it take so long? Well, reality came in. When we actually started testing and fine‑tuning the solution, we found out a lot of things that we didn't expect to find. One clear example is a state memory on Broadcom ASIC. If you use policer, it uses TCAM memory on that ASIC and potentially on the VLAN that is 4 policers. The MAC ACLs, they use TCAM entries as well. The tomahawk has 1024 across 32 ports. That doesn't sound too bad, apart from by default they are split into buckets of 256, so now suddenly the number of connection as partner ports on the switches that if you want to deliver partner serves at 100 gig is where you are going to end up, suddenly starts running out of those resources so we needed to do, really go rewrite the allocation algorithm or IP infusion rewrote the algorithm so any protocol that had some earmarked TCAM that we weren't going to use, released it, put it into a shared pool instead of having them in pre‑defined buckets and then also made sure that it behaved gracefully at the limit. Key message is if you are deploying on anything with TCAM and those kind of functions, they are worth paying attention to. You might be using different ASICs but do pay attention to what you can do with those.
The other thing that we had was that we really had to put dynamic learning as a last resort, and implementation of most Broadcom which are fixed pipeline, I am not 100 percent sure why but it's historical decision they have chosen not to reverse, but they do learning before the McACL, and learning consumes forwarding path so you have a look‑back you will consume forwarding capacity to learn a packet, then you are going to discard it unlearn. So even with a MAC ACL, if you have learning, you really have to be careful you don't get stuck into that route or you'll suddenly have your forwarding capacity losing one or two orders of magnitude.
The other thing that's specific to this strata X GS, we were initially trying to go MPLS, we kind of new the protocol and some of the concerns about the VXLAN, but it can only remove a certain number of labels in one go. That, for one ECMP and the entry label which takes two labels so before you have looked at what service you have lost two labels. So that one was immediately off the table. And because it was designed to run VPLS, they hadn't designed the tunnels to be unidirectional, it was still design on the pseudowire context where the MPLS was bi‑directional so Broadcom basically hadn't tested that implementation, so we wanted to spend a lot of time on there. The other thing is that every LSP that goes through a core actually creates an entropy table and because it's N squared and we are doing multiple labels for load sharing all of a sudden we did forecasts and could see if the network grew the way we wanted to we would run out of interface enables our core middle of 2020 which wouldn't have been a good thing to plan to do.
A valid question is why not StrataDNX? That is the other family, the one we are using is tried enter to Tomahawk. When we started the project they weren't quite dense enough. Yes, one advantage they had was much bigger buffers, we do a fair bit of numerical analysis on the buffers and the indication was they were good enough and we found out through the migration that that was indeed the case. We were cautious when we did the migration in two hits, we skipped the members in the first part so if we had got the maths wrong we wouldn't have to abort the migration. And then as a second part we did a couple of test GEs, so and bit them up progressively. It was a concern but so far again, touch wood, we were right.
The other thing that DNX has is external TCAM and it's not necessarily a good thing, in particular if you have small forwarding tables, so first thing it uses more power and in these environmentally aware days that is something you don't want to just ignore. Also just that side, if you are looking for small number of entries repeatedly on an external TCAM across the ‑‑ experience can't ‑‑ experience we have had with other vendors can show that that's something that can bite you if you are at unexpected times.
But the key message it wouldn't have been a bad choice; it was just that we started on one avenue, put a whole loafed effort on that avenue and changing horse mid‑stream seemed a bit bold.
So the new target solution was VXLAN, they had already started working on it. To address the security concerns on VXLAN, first thing we have a closed network, we did actually try as a test sending the VXLAN packets across an access port to our routers to see if they routed them, we do as flow sampling, to see if it could be sent up to the CPU and being routed back into the plane and all of those cases, no, we didn't inject packets. Plus, putting ‑‑ removing the tinfoil hat, we run the core of the Internet. If one manages to inject packets that is very hard way to achieve something that is reasonably easy. But still we looked at it, we wanted to make sure the solution didn't inject packets.
Having gone through the limitation, this is quite an important slide, I have talked about the limitations of the chip set and this is knowledge share. The tone I really open is not they are horrible or terrible chips. No, they are fantastic chips. They have revolutionised the market and changed the economics, but like anything, their design compromises. They were very helpful on trying to work around the compromises with us but because we are trying to use something that is ‑‑ that is really economical for one purpose we can't complain they are not ideally suited for our purpose and they are good chips. Please don't take this, this is Broadcom at the end, they really had our back so please tonight take it as a ‑‑ anything other than enthusiasm for what they have done. I am just saying if you are using them for anything other than what they are primarily designed, pay attention to the detail.
One bit that I wanted to kind of ‑‑ pick on on this presentation some of the things we have done was on the architecture, and one big advantage that we liked with this being one use switches is the leaf and spine architecture. It's, I am sure some of you know it but this is the IP fabric across the design of choice for the large scale data centres and even for us, this is the easiest scalability was quite useful. The previous approach we had, the big chassis, hopefully you can see all the solicitors to you have infrastructure line cards, member facing line cards, fabric cards, central, and it follows a scaleup model; once you ran out of slots, you have to rip out a card and replace it with higher density and if you haven't depreciated previous cards or previous chassis, tough. And the maintenance and the planning and the number of different SKUs actually makes this ‑‑ can make this a relatively slow scaling model. Whereas leaf and spine, it scales out. The line cards become switches on their own right, the fabric they become essentially spine switches. If you want more lean switches, as long as you have room in your rack and the roof next door, put in another leaf and if you want more spine upgrade them one at a time and/or add third or fourth or six so the scaling a lot more he will elastic, that is why it's referred to as scaleup model, we can have a lot more spare inventory because it's not 20 different models that we have to buy and invest capital just in case so we can have more inventory deployed but a lot more inventory just sitting ready for deployment and what that really translates to is if somebody orders 300 gigs in a site no one has organised for the last year, and we can turn around and we used to be reasonably bad, if we had a totally out of the blue order it would quite frequently be a sorry we need to do a build out, give us a few months. Now we expect the turn around to be a lot faster. And that's ‑‑ and coming back to the strategy, back to what we are trying to achieve, as a member‑based organisation we couldn't deploy solution that was really good for half of our members but delivered a lot less for the others. So, an important point is, we are delivering benefits across the scale ‑‑ across a range of our membership. The small numbers members as I explained before, the reduction in flood, the ability to say no, you are only going to get a tiny amount of background traffic and broadcast and multicast, that is a significant benefit if your port or CPU constraint. The faster ones because they suddenly realise they need significant capacity and they tend to be the ones that give the large unforecast orders, the fact that we could be a lot more responsive. And the other thing, it's a lot more economical of a solution and cost savings hopefully is of benefit for everyone.
So just as a recap, I want to go through how we got there. So really, a large part of that project was a prototyping and if you came up as far as the slides and everything we discovered, it was very much try one thing, find out, push it to a limit, scale, fix and patch, retest, so it was a large period where we were finding out what was possible. By November '17, it's like, okay, good enough, because we had been quite experimental, they probably were a few modules that weren't written the way they would have been written from the beginning and a few models that needed refactoring, so a fair bit of time hardening. We also found a few failure types we hadn't picked up and there weren't any automated test to fix them so we had to write the test module and then fix the bug and retest it. It was a little big longer than we were hoping but it was quite confidence‑building that now we have the methodology for future releases to do a relatively fast type approval and a relatively fast migration about two months, and it's now live. We did one software upgrade which we found a few things from packet types we hadn't catered, but that basically was just making temporary fixes permanent and removing some work around but otherwise quite happy with it. So are there any questions?
AUDIENCE SPEAKER: This is Andreas Polyrakis from GRNET ‑‑ thank you, Flemming, for a very good presentation. I have two questions actually. The first one is that ‑‑ quite shallow buffer, our concern, because we are on similar products, if this is a limitation in an IXP environment especially when you mix customers of different speeds, have you investigated this problem at all or seen this as a problem? Do you mix speeds?
FLEMMING HEINO: So were we concerned? Yes. We did the maths, we did the analysis, and we do ‑‑ we don't go straight from 100s, we tend go from 100 to 40 and the main thing we are worried about is mixing tense and 100s in the proportion on the wrong switch, having a switch with a small number of tens, our mathematics say it will probably work, and the second thing as I said, during the migration we did it in two phases, first migrated everything but the gig E members, kept them on their own, their old switches, made sure that worked, migrated a few, check, kept an eye on the buffers, that we weren't getting buffer dropped, we do when a member congests you need to reasonable amount of analysis, we were going lieu and looked at those members and migrated more and one site and now we have migrated everyone and so far, it looks like our analysis was right. But yes, it was a valid concern and the number of people that did challenge us on it. The maths said yes, and experience seems that we got it right.
AUDIENCE SPEAKER: The very useful feedback. The second question is, I didn't see anything in your slide about any of your plans about automation. I would like to ask if you have any such plans and how well these fit with the software IP infusion?
FLEMMING HEINO: So the reason is that it's a work in progress. Part of it again is we are growing a software team, while the software was changing, we really focused on automating LON 1, built a lot of methodologies and that project meant we didn't release our software team quite in time so they are about six months out of step so we got, and it's on GitHub, something that can actually push and pull configS through NETCONF, we are 90% through templating. There is a few ‑‑ we are trying to do kind of NAPAM single push as opposed to stateful and there is a few changes we might need to fine tune, if you change too many variables at once you can create rate conditions, so we might need to do one or two tweaks as we tested NETCONF. Either chanbe the way we do pushes, or far better, if we can tell IP infusions actualled we have a raised condition, can you fix it. That's where we are. We wanted to launch with it live but our LON 1 automation over ran. We are 980 to 98% on automation and the it will be on GitHub and we are ‑‑ are on GitHub and are willing to share it when it's in production or it's available now, or support other people when it's in production.
AUDIENCE SPEAKER: Okay. Thank you.
PETER HESSLER: From Hostserver. So in the migration from the old traditional not quite expanding tree to E DP N have you seen the change in your observability of what the network is doing and has it got eneasier or different, are things now possible that weren't?
FLEMMING HEINO: Yeah, tunnel statistics are absolutely fantastic. It was a real pain to try figure out and to subtract and to add, whereas now because it's an overlay network, not quite as good MPLS statistics, you can't ‑‑ MPLS you can get transit LSP statistics, you can't get on the VXLAN but still night and day better compared to the previous one.
AUDIENCE SPEAKER: Thank you.
AUDIENCE SPEAKER: Cyrill. You have mentioned data plane issues regarding Broadcom 6 but haven't mentioned any plane issues about ‑‑ with implementation of EVPN technology. You might have a lot of counter plane issues. How do you tale with them?
FLEMMING HEINO: Mainly kind of through testing, though we really didn't ‑‑ so first thing, we are not ‑‑ we are not ‑‑ you can get EVPN, you can scale it hugely but remember, we are only run EVPN on the switch, so the V tabs on the switch we are not talking to servers so that makes the scalability and control plane issues, basically we remove them in testing, we haven't seen them.
AUDIENCE SPEAKER: How do you deal with a lot of issues on counter plane, a lot of maybe flapping, dumping and so on?
FLEMMING HEINO: So okay, being an Internet Exchange, as I hinted, we actually switch off MAC learning. We ‑‑ to implement MAC security we need to know the MAC address of our member, so we actually switch off learning, allow one, maybe two if the members is doing migration, then if the member has worked with an exchange before they will tell us we are swapping from that MAC address to that and will temporarily allow both. The withdrawals are relatively ‑‑ the amount of BGP updates for the withdrawals are relatively low, if it exists, one or two MAC addresses so the scaling challenges if you are running V tech to a hypervisor versus running as an overlay for a router, challenges are the opposite. You actually can have issues, you don't have enough entropy, not that you have too much.
OSAMA I AL‑DOSARY: Any other questions? Do you have any questions online? All right. So we are early, I actually have a question, maybe it's a new question. What's the issue with mixing speeds on the same switch? What is the problem there? I didn't get it.
FLEMMING HEINO: It's just ‑‑ so the buffers are dynamic buffers and the algorithm really is ‑‑ works better and if you have the wrong balance then the algorithm for managing the dynamic buffer pool can temporarily give too much to one speed and then not have enough because the ‑‑ even though gig e‑ports don't tend to fill their buffers very fast and drain very fast, so they can sometimes end up needing to be elastic. So, yeah, and it's really coming from 40 to 1 gig that was already stretching the buffers, if then the existing buffers are being consumed by 10 gig on the most difficult task, if the algorithm went wrong, that's where we would have suffered.
OSAMA I AL‑DOSARY: Okay. Well thank you very much, thank you.
(Applause)
OSAMA I AL‑DOSARY: All right. So, this concludes this section for today. We have a couple of reminders. First, please vote for PC elections, we have the voting open until about 3:30 p.m. and then at 4 p.m. we are going to have the candidates come up here so that you can do your voting. So please ‑‑ sorry, not vote now, I am sorry, please send your nominations and then the voting will start at 4 p.m. Also, please rate the talks, so what you have seen today, that really helps us, and we are going to be back here at 11:00. Thank you everyone.
LIVE CAPTIONING BY AOIFE DOWNES RPR