RIPE 89. Side room. Wednesday, 30th October 2024.
9am.
DNS.
WILLEM TOOROP: Good morning everybody. Welcome to the DNS working group. My name is which will em Toorop. Me and Doris who isn't feeling too well at the moment and Moritz Muller who couldn't make it in person, he is with us in spirit and alive as you can see there on the screen.
We are the DNS working group chairs and we will be chairing this session. Good to see you all.
Last weekend, there was in this same hotel the DNS org workshop the 43rd with yet again as unusual an excellent selection of outstanding DNS talks.
So if you weren't there, then I can recommend you to use your search engine of choice and look for DNS org 43 and look at the programme. Notably they had two plenary sessions, one about a recent series of resource depletion attacks that have been hampering the resolver with the researchers they found... it was very interesting, there was also a talk about how to deal with those attacks in general and potentially also with the future resource depletion attacks. That was on the Saturday. And on Sunday there was another plenary session, also very interesting about denial of service and high check attacks, I am not sure when the videos will be online but perhaps I can tell you later. Definitely worth checking out.
So speaking of plenary sessions, tomorrow from 6 to 7 in the main room we will have a plenary session comparing the resolver recommendations that have been developed here in RIPE with the operational practice of real some prominent resolver. And in the will be Yury will be presenting the Dutch university's resolver, Dave Knight for ultra DNS public resolver, Babok for QUAD 9 and... for presenting the DNS you resolver and I am given an insight into had he search for DNS resolver an the whole session will be chaired by Shenka, it's similar to what we had last time in Krakow but now with way more time which is really good.
We have a very densely packed programme, so I won't hold you any longer but there is first another exciting announcement to be made by Vesna. So Vesna, go ahead. (APPLAUSE.).
VESNA MONOJLOVIC: Good morning, I am Vesna, community builder for RIPE innocence and I am here representing a group of people who are organising the DNS Hackathon, you can reach at this email address and then we will all read and reply.
So our next event is in Stockholm in March, it's going to be during the weekend on the 15th and 16th March because the week following that on a Tuesday there will be a Netnod meeting so this Hack‑a‑thon is organised by three organisations, Netnod, who is hosting us, DNS org and RIPE NCC, we have published the article on RIPE Labs that describes all of this in a lot of details.
We are also helped by people from other organisations that are powerhouses of DNS, so we have their first names there on screen, Arife, Samaneh and Christian on the programme committee, they will be helping out in choosing the project and later on giving the symbolic awards for the best team work or the craziest idea or the most completed project. And we also have the organising committee of Johann in a, Denesh and myself who are working behind the scenes to arrange all the logistics.
This is not our first Hackathon, already last year we have organised a similar event together just before the RIPE meeting that happened in Rotterdam and a long time ago the RIPE NCC organised a DNS measurement Hackathon focusing mostly on the RIPE Atlas data. How many of you have been to the Hackathon before? Can I see a show of hands? OK, not even half.
Let me explain very briefly in the last minute that I have here, when I say Hackathon, I mean people getting together to combine their creativity, coming from all the different sides of operating the internet, so both the network operators and graphic designers, software developers and researchers, students and actual hackers, what I do not mean is breaking into other people's computers. So the hackers have to reclaim the pride of that word and not be mistaken with other usage of it.
So the Hackathons are very intense but very short, most of the work actually happens outside of those two days and for example we have also published in the RIPE Labs a full op of one of the project that lived through after the event was finished in Rotterdam and also IETF is having their events every time before their meeting.
So please register. Registration is already open. And thank you for your attention. (APPLAUSE.)
WILLEM TOOROP: Thank you, Vesna.
VESNA MONOJLOVIC: Questions? I have 20 seconds left. No, OK, you know where to find me.
WILLEM TOOROP: OK. The first speaker will be Raffaele from the University of Twente who will be talking about dark DNS, revisiting the value of rapid zone updates.
RAFFAELE SOMMESE: Good morning everyone, I am Raffaele from University of the Twente and today I am going to present to you work that was accepted to this year edition of IM C. And this work is about revisiting the idea of almost 20 years ago of rapid zone update. So, all of you probably in in room knows the programme of ICANN called the CZDS centralized zone data service, this was a programme that ICANN introduced for its own sharing when ‑‑ basically it's a lot of research from us and also from the industry into DNS resilience, infrastructure aanalysis and abuse prevention. But there is a problem with this programme. It's just one snapshot today. And if you do one snapshot today is that you take a picture, imagine always at the same time an the same day, these are two pictures, one was like on the Sunday at 6pm and one was Monday at 6pm, the only things that you can say here is there was cloudiness in two days, they are mostly the same, so it's difficult to say, at night, space X launch happened. And it was so transient that no one was able to perceive by just looking at the snapshot took at different days.
So why we need more granular data, we want to quickly detect newly registered domains for DNS abuse detection or trademark protection and so on. A colleague of mine did an intensive study of hijacking detection of granular domain on the DNS system and for example we want to look at live DNS infrastructure al changes to see how the DNS provider reacts in face of an attack.
And regarding the registry of domain names, I'm citing this quote that says most of the reregistering of domain names leads a very brutal and short life because most of them die young and are like a really short life. And again 20 years ago this service was out there from Verisign and unfortunately it was shut down because it was abused by mail spammers for sending spam mail, unfortunately it was like something really useful also for the community for providing prevention against abuse.
So what we are left with. Of course there was like alternatives to this service and one was using passive DNS data. One of the most famous feed of passive DNS data is one provided by domain tools, SIEnod feed, newly observed domain feed but it has some limitation, you have only active queried domains, sometimes you are able to detect abuse only when the abuse happens, it's a commercial data source, it's limited availability so what we are left with. Now I year ago, we wrote this paper and I presented it also to the last RIPE in, where we demonstrated that we can learn half of the second level domains from TLD, using CT logs data. And CT logs data are a nice thing, they have live data. And in fact we look to the distribution of how quickly we detect newly registered domain names and initially we told that we were detecting newly registered domain names in a day but the truth was we were detecting new registered domains in less than a day. So we developed infrastructure to quickly discover newly registered domain name. We use the join between the domains that we can learn from certificate transparency log with the domain that we have from ICANN CZDS or the zone that we have from... if something is not ‑‑ in the zone of ~.com or.net or whatever, it's likely to be a newly registered domain name, we do our measurement and we do like a lot of stuff like performing measurement and these are publicly available on the internet on this URL.
So what we learned out of these, we learned that we are able to detect 42% of newly registered domain names before they appear since the DNS snapshot, that's around 1 domain per second, 76 K domain per day but most interesting things that we learned is that 1% this these domains never shows up in CZDS snapshot, one are so short lived, they never make it up to the next two. How quickly we are detecting these domain names, the interesting thing is we take less than one hour to detect, within 45 minutes we have half of the newly registered domain names and 30% within 15 minutes.
Now, let's focus back on the part where I said there are domains that show us in these logs but never shows up in CZDS and it's 1%, it's 0.1 ‑‑ 0.01 per second domains never shows up in CZDS. There are two main possible reasons, the first is the certificates issued for domains that are expired, this is possible, this is allowed. It's a security nightmare but let's not talk about that.
And the second one is that domain lasting less than a day, the space X launch I told you about.
And we call these transient domains, we use RDAP data to distinguish the two cases and we found three months previous, 42,000 transient domains, these were completely invisible to researchers and security operators so far no one was able to catch them.
So we discovered that also of these domains, they half of them died within the first six hours of their existence, a really short lead, and there are very few legitimate cases for why these domains should have these short life span and the reason for early removal include for example abuse, credit card fraud and so on. .and also blocklists are not able to detect these domains. And we think it's a missed opportunity for registry and registrars to share which are the domains that should be taken down by others.
The interesting thing is that they really like look long lived domains so basically half of these are also on Cloudflare and a third is also on Cloudflare CDN. And the thing that we ask also ourself, just a part of the visibility but which is the complete visibility that we, how much of the visibility we miss. And we look at that, we compared it with the EPP transaction log and we looked at the newly observed domains from the domain tools. And the interesting thing is that in the three months for.nlwe find 334 domains that were registered and deleted such as they never made it to a zone snapshot and we identified only 99, one third, we were able to go with one third, we still a huge blind spot even using this data source and comparing this to the domain tools, it covers roughly 5% more of the domain names, the overlap between the two data sources was just 60%. So you need both basically to get more visibility.
What's next.
So these domains, the existence of these domains represent a source of measure of success of the fact that registry and registrar are able to tackle abuse before they can do damage, we don't know. So the problem here is that each registrar has to independently relearn the same signals if it's the mall dishes domain or not and in the meantime it's completely invisible to the researcher like us because we don't have access to the kind of feed that's indeed the registrar. Our call is trying to promote resurrection of rapid zone updates, expanding beyond .com, the goal of rapid zone update was to promote security and stability by providing a useful tool to online security companies, ISPs, we think this is still true. And the fact that nowadays we have like uncoordinated action against abuse really calls for expanding transparency inside the DNS ecosystem. And from CZDS, we learned there are ways to mitigate the abuse which creating a system where some people can access, we can have an approval system and this is not just me calling for that, it's also the security and stability committee of ICANN in the SSAC 125 resolution, they really call for a collaborative approach for fighting the DNS abuse.
So my call for you for today is help us to enable transparency inside in the DNS, we know there are security and privacy concerns in sharing more frequent data and registering data and we fully understand them. The problem here is we need to create a framework where we can share this kind of data and we can really make the fight against abuse a collaborative approach where multiple stakeholders can collaborate with each other and share data with each other in a trusted and secure environment, with a different level of trust and data access, I don't see that everyone should have access to all the kind of data possible but the least we can collaborate and we can finally fight better the problem of abuse and find better solution to identify abuse in the domain ecosystem.
I thank you for your attention, you can go on the website and play with our data and see the live stream of data. If you want to reach us, these are the mail addresses and this is the QR code of the paper. Thanks a lot.
(APPLAUSE.)
WILLEM TOOROP: Thank you, you still have four minutes also for questions too so if there are questions for Raffaele.
AUDIENCE SPEAKER: Michael Richardson, I under institute the six hour withdrawal is because the registrars have figured out it was an abusive thing and do these domains ‑‑ I guess you don't know ‑‑ do they then show up through another registrar very quickly?
RAFFAELE SOMMESE: Sometimes it happens, it's not that frequent, we just have three months of data, have have an overview to tell you it happens frequently, we will look into that in a more logical fashion but I think it's very like the problem is they try to ‑‑ register the domain on .com and on .net, that's more difficult to watch, you need a global view to monitor all the TLDs.
AUDIENCE SPEAKER: The data sharing issues are privacy or commercial concerns?
RAFFAELE SOMMESE: This is a tough question. I would say both, in the European Union, privacy is the most because some people think that the domain name are privacy identifiable, if you have your name and surname indeed the domain name, they considered it for GDPR, some TLD of course use also the commercial advantage of knowing the list of domain names for providing service for example for trademark protection or for example for new branding or stuff like that. So there are both a concern, however if we stood up like a framework where people can really collaborate with each other, we can see for instance this data should not be used for commercial purposes.
WILLEM TOOROP: Sorry, did you state your name and affiliation?
AUDIENCE SPEAKER: It was Michael Richardson, Sandelman software works.
WILLEM TOOROP: Excellent. Any more questions?
AUDIENCE SPEAKER: Pavel. I have a question, what do you see actually as the way forward, what would be the practical step to make it happen? And also I can offer anecdotal piece of data that for the phishing domains that we see, I think around, for a data set of less than a year, CZDS, that contributes to around 7, 8% as a signal for detection of malicious domains that we are handling compared with certificate transparency, which is way over 80% and there's ‑‑ also kind of show you, shows you how those sources differ and the delay I think is the main contributor here.
RAFFAELE SOMMESE: Yes, I agree, again it's the problem that CZDS is too late. I mean, one snapshot a day is too late to detect abuse, one thing we are trying to do is and that was my call here an the reason I present to RIPE is trying to get a couple of CC TLDs and GTLDs that are interested to participate in such of an experiment where we can set up agreement for data concern, demonstrate the effectiveness of the sharing, how with data sharing we can manage to fight better abuse and maybe try to bring these two bigger like ICANN and try to convince ICANN to implement these atGTLD level.
AUDIENCE SPEAKER: Thank you Raffaele.
RAFFAELE SOMMESE: Thanks a lot.
(APPLAUSE.)
WILLEM TOOROP: Next speaker is Thomas from packet clearing house talking about the update of the Hungarian registry.
TOMAS CILLAG: Yes. So first to start is it's not just me, there are like a small team behind these works, I just help to redesign in a newer way. Also yes, my name player is PCH, however this is like a project work. OK. This was like a project work with the Hungarian CC TLD, that's that, yes, you can find my email address here.
And yeah, what were the previous, the primary goals, with the hidden primary, because that's what people are basically talking about. Operate both in secure and resilient manner, use some kind of clusterer so some kind of like you can think about like a plane with not just an engine, if something fails, maybe shouldn't everything fail all at once.
Trying to follow the best practices and try to keep it simple, yeah. In case we see or detect some issues, we should break the process so it means the broken zones should not be loaded and this is one of the main areas or main things I want to highlight here.
Yes. The previous system was based on open DNSSEC. And it served us since 2015 so what's a good thing. There are newer practices and better tools nowadays so we try to move in those direction. And yeah, I also want to share something which we learned during the process, yes. There is a lot of information here on the slides, they will be online, I will just try to highlight a few things here.
So yes, the previous system was via open DNSSEC and the zone distribution was by BIND. This was a hot/cold cluster kind of setup, which means that there was a shared IP address and the operator basically asked decided where the shared IP address decided, there was a few scrips facilitating, there are still, there is something which is in the past, there are still some stuff which is still true, I will explain these in a little bit later.
So the KSKs were on a hardware token, a USB hardware token. And Nitrokey HSM and the ZSKs were on file, soft HSHso soft signing, one of the hosts were running open DNSSEC activity, the other side was quietly waiting. Right.
This is a very simple like diagram or whatever, so what I want to highlight here is that there were on the secondarys in general in the secondarys only one source or one primary was configured because that was the shared IP address, if someone wanted to switch sides to the other host, then basically there was a script that needed to be run to run the IP address and another one to put it in here, it's great because there's some kind of redundancy but till he will you what we found how can we do this better.
There are two locations. They are five, six kilometres apart. Both in Budapest. Yeah.
There were safety mechanisms and this is one of the cards I want to talk about. There were scripts and they were hooked in from the open DNSSEC, notify command facility, this is basically used in general to tell BIND that OK, the zone is generated and signed everything, ready to be loaded. Here additional checks were implemented, it was Perl/shell, many use similar set ups, there were also pre‑generated signatures for DNS keyrrsets, in case something broke or would broke, it didn't happen but it was there. I mean, so if there were, if there would have been some issue with the key, I mean the KSK, we would have something to fly a little until we figuring out how to exactly recover.
Right. OK. Yeah. So for there shall there were multiple scenarios or multiple ways we tried to handle things; one of them is check and prevention, which means that if we find some kind of anomaly which is clearly detectable, that should be logged and the last known good state should be used, basically the new zone should not be loaded.
One of these checks is so‑called golden records, which means that we have a list of notable we have a list of notable institutions like big banks, universities, you name it, we have a list and we expect them not to go away and they are registered along, long, long time ago. So if these would go away, that might be some kind of issue with the generation of the zone. So this is one of the things where we would stop the zone and some of the operators would come and see what's going on.
Another check is for significant changes. If the number of records generated during ‑‑ sorry, if the number of delegations changed significantly, we generate the zone every ten minutes. So in in case if it's like more than 1% which is like currently like 9,000, if the change between 10 minutes is more than that, that is something which also warrants an investigation. So the zone is not loaded. There are syntax checks obviously and there is a new check which was recently developed, with the new signer system or related to the new signer system and that is basically checking the DNSSEC delegation from the parent so basically checking the DS record at the parent, properly validated and using the authoritative data. And I will tell you how this can be easily hooked using standard DNS tooling instead of reinventing and doing checks using files, whatever, I will get to that point.
Yes. There is also another case when only an alert, it should be only an alert and the zone should be still loaded. That is a case when the RR SIG would be about to expire, it should have been refreshed but they were not refreshed, the zone should be still loaded because there is nothing to do here, the last loading zone would have the same issue or, yeah, so it doesn't help really.
However there is still like an alert sent out to the operational team to check and figure out what's going on, maybe there's some kind of operational issue, maybe there is some kind of software bug, anything can happen, you know software so...
Yeah, the new signer system, that is basically using knot DNS as a signer and using NSD as a distribution and verification point. In this case this is a hot/warm two node set up, we have shared nothing, there is no shared IP address basically nothing is shared between the two sides. The nodes of the IP is configured on the primaries but I will get to that on the next slide.
We do some fix with the serials which means that one of the host is generating hot, the other one is generating even serials and there's some kind of delay or staggering between them, if you look at them I mean you just convert it, it seems like it's, the other one is like 19 minutes behind, this is to give an advantage to one of the servers or one of the nodes to be always primary or always win in normal situations. I told ‑‑ I said that it's the zone generated every ten minutes, so this means that if two generations for ‑‑ two new zones being generated on one and not on the other, the primary, the preferred primary would somehow not generate a new zone, that means there would be a....
This is just what I talked about, there's no IP address being reconfigured between the sides, both IP addresses are being configured, one secondary and basically the serial, each secondary decides based on the serial as normal DNS behaviour.
NSD verification, this is a relatively recent feature but still a few years ago as I realised doing my slides. Basically, you can write the script or read the standard input zone which is not really what we wanted, what we want and generally use is using standard DNS queries to look at the zone and that way we can use net DNS for scripts, for example, but we can use much shorter and simpler scripts instead of like one or or few like ‑‑ we wrote simple little scripts to check each criteria.
If something is going on and the script should exit with the non‑zero code and that signals NSD to keep serving the old zone.
Yeah. One thing which I want to highlight here is we are using the off line KSK functionality with knot because while open DNSSEC had the ability to use both the new key and both use backhand for soft signing and PKCS11 for the KSK, it's not possible with knot, however we can have different instances of knot and one of them is just handling the KSK and the other one is the ZSKs. Basically we were able to replace our custom scripts as well, that were doing the pre‑generation of the DNSSEC RRset and the signatures.
That's what I wanted to highlight here, yes.
NSD verification, this is how it looks like if you run the NSD control status, this is the exact moment when the verification is done so you can see that the new zone is being committed, being like NSD new version but it's still, the verification is still running. Yes, you can see the slides later on.
We have, we experienced an issue with ixfr so basically we needed to back port the version to Debian 12 and the newer version already had a fix so this one is good.
There were also another incident so basically for some reason NSDstopped in its tracks during verification, the verification process properly completed and then as the log, successful completion with the zero return code, however somehow it got stuck, probably some kind of IPC communication issue. The another node, the non‑preferred node took over in 20 minutes just it was designed to, so regardless that there were the operators were being paged and what not, they knew about the issue, even during the night, they didn't need it to do anything, we were able to get together with the whole team the next day to figure out what happened because there were no manual intervention needed.
And I think that's it. I have some excess slides that you can check later on because it's uploaded but this is what I wanted. Yeah.
WILLEM TOOROP: Thank you Tomas. (APPLAUSE.).
Yes, we have very limited time so I think it's best to that you take the questions in the hallway.
TOMAS CILLAG: OK, thank you.
WILLEM TOOROP: Thanks, excellent, next speaker would be Yevheniya from the University of Grenoble, she has been doing research looking into extended DNS errors.
YEVHENIYA NOSYK: Good morning everyone. Can you hear me?
Today I will be speaking about extended DNS errors. So every time we receive a DNS response, we can see one of those response codes in the back, we are going to get the NX domain. So some of them were designed in the original DNS RFC, it's a pretty small list, but there is a couple of ambiguities, for example there's a code of nine has no different means, not authorised and the R Code of 16 was assigned twice by mistake.
But in any case, if DNS validation failures or many other different problems, most commonly see this serve fail response code and unfortunately it's the very generic and it does not provide us any clue has to what exactly went wrong.
So the solution to this is so‑called extended DNS errors.
What I am talking about here is this proposed that standard that appeared four years ago. The original idea was to extend the serve fails but more generally it's quite a generic mechanism to prevent context to DNS messages, even if they do not necessarily fail and also to make it clear extended error codes exist completely independently from response codes.
So extended errors rely on DNS zero and this is how the option looks like. So this is option code 15 and the two main contributions of the document are the info code and extra text fields.
So before extended errors, we would get this kind of Servfail response and we would have no idea what exactly happened.
Now thanks to extended errors, we have quite some additional info for us, first of all, we can see the info code 7 which stands for signature expired and this by itself is already pretty informative but we can also go back and check what we have in the extra text field which follows. And what we can see here is for example the idea of the DNS key is generated signatures but also the expiration types them of the signature and as an important note here is that the extra text field is implementation dependent, some software will go very detailed, others will return with less info and some can leave it completely empty.
Then for the info codes themselves, there are 31 of them as of now. There is a registry for those. So 25 were defined in the original extended errors RFC but the remaining ones were added later on. This errors are not really subdivided into any groups but we can see some of these deal with Gen NSEC failures, other with cache and software operation and many other different aspects of DNS.
And one reason I wanted to talk about extended errors today because it's part of the licence resolver recommendations published this year, which is why we were wondering whether RFC is deployed by public resolvers and in DNS software.
To do so we have chosen nine different systems to test. We have here four open source general software vendors and five big public resolvers. Then as for test case what we did was create 53 different sub domains under extended DNS errors .com, so some of them were completely misconfigured but others were‑‑ present various interesting corner cases. So I will not detail every single domain name right now but feel free to check the website.
Then for methodology it's very dimple because we send the licence A request and we analyse the response.
Now one problem we encountered is open DNS does not work in France any more, every single response I received contained this extended error 16 which an stands for standard, in order for the measurement to make sense, I had to run it from a different DPS in a different country.
As a result, we tested nine different systems and we declared 53 different sub domains, once again I am not going through all of them but just I will discuss a couple of high level trends.
So the first we were checking is how many subdomains trigger the exact same standard error code among all the testing systems.
So in this highlighted example, we see a domain name where the signature or AR set was removed and which is why we are going to extend to error code ten which stands for RR SIG missing, we had nine more cases similar to this one, in means in ten out of 53, three were the same extended error codes across all the tested system. We were wondering why the results actually differ.
So, for example, we have this subdomain which is not compliant with the RFC 9226 because it has 151 additional iterations, this is an example of the domain name that was treated differently by different resolvers.
So for example on the first case with what Cloudflare did, they returned EDE: 27 and they also notified the validation resulted in a Bowe he bogus state. For Google DNS we got DNS second determinate and for open DNS, NSEC missing. The first case is interesting by self it. Because first of all we can see that we can have more than one extended error code in a packet so two or more is perfectly normal and also thisET code 27 is one of the recent one us because it was added after the regional RFC and we can see there's calls in the wild?
That's the reason for inconsistencies because some extended errors do not reflect the miss configuration but rather resolver capabilities so for example this domain name is signed with an algorithm that Cloudflare does not support and which is why it has 1 which stands for unsupported DNS key algorithm.
And finally we had the bunch of DNSSEC related misconfiguration and in those cases we were getting this rather generic EDE code 6, sometimes it's more precise like DNS expired for example.
So the reason why we need to talk about inconsistencies, because of this recent proposed standard, so it can happen that different resolver will report the very same problem using different extended error codes and in that case, it's important for name server operators to correctly interpret those results.
An the second thing I want to discuss is I was wondering extended errors to locate miss configurations at scale. So the methodology is style quite similar here, we sent a request and analysed the response and this time we were standing more than 300 million registered domain names.
OK. So overall more than 19 million domain names trigger at least one extended error code by Cloudflare, 19 unique errors and 240 combinations of two or more of those, the most common one was no reachable authority. So as the description goes, the resolver did not manage to meet any of the name server and this problem was for more than 14 million domain names on our data set, to make it more precise, we have this network error EDE an the extra text field becomes super useful, we can see the IP address of the name server, that returns a refused the response code to Cloudflare.
And actually the two extended error codes were the most common combination that we saw. In this case, for example, it means that EDE 22 says we could not reach any of the name servers and EDE 23 precisely tell us what exactly went wrong.
Also got a bunch of not authoritative EDEs which were expected to be received by a resolver but what happens here is that its name servers who returned this extended error to the resolver and the resolver simply forwarded it to us. So extended errors can also be forwarded.
Also we got a bunch of prohibited EDEs, which were generated on the name server site and simply forwarded to us. So in in case, we are probably dealing with one of the software bugs because I checked a couple of cases manually and nothing seems to have been prohibited, we were still getting the response, no known error but still name servers included those with EDE prohibited.
And this is probably the most interesting case we saw because the name of the error suggested we are dealing with some DNS validation problem, but the validation does not seem to to fail. And if you look at the extra text field, we can see that Cloudflare points us to the keys of the TLDs. So what we did next was to check those domain names with DNS keys and those keys do not generate any signatures which is... which is why we get the RR SIGs missing, in this case once the contact of the operator of.18 and they told us this DNS key is a so‑called stand by key so it does not used to the contents of the file but may be used in the case of emergency key rollover, we told 15 it wills in total withstand by key signing keys, that's what trigger this EDE by Cloudflare.
So also contacted them asking was it expected behaviour or not, so they told us yes but they also... now if says if there was a key rollover in process or a stand by key in the zone file, we can expect to see this kind of extended licence error.
So there are many more interesting cases that I will probably not have time to talk about, we also saw of course a bunch of DNSSEC related errors for example unsupported algorithms, expired signatures and many different ones but it looks like delegations are the most prevalent problems that we see.
So to conclude, the extended errors, extended DNS errors propose a standard appeared four years ago now and it is already implemented by major DNS resolver software vendors and public resolver. It is extremely efficient to look for misconfiguration at scale and it's a promising technique for DNS trouble shooting.
Thank you. (APPLAUSE.)
WILLEM TOOROP: Thank you. We have room for questions.
AUDIENCE SPEAKER: Hi, I was one of the co‑authors on the draft and I want to say very much for this research, it's fantastic to see it was quickly and widely adopted in many situation situations and I wanted to remind people in the room who might be seeing this for the first time or those who may have more gotten these error codes are unsecured, they are part of the EDE record and they are not signed and so they are very useful and informative for debugging an reporting, however, you shouldn't rely on them to be making any protocol decisions. So. Thank you.
WILLEM TOOROP: Any more questions? We do have time. Also for the online participant, please use the question and answer section in Meetecho.
AUDIENCE SPEAKER: Shane Kerr from IBM. Do you expect that differences in EDNS results EDE results rather would cause problems from resolver operators running multiple implementations, if you are running both knot and BIND you are going to apparently in some cases get different results, do you think that can cause operational problems for the actual users?
YEVHENIYA NOSYK: For me it was a bit confusing, it took time to dig into the differences and check why exactly we saw those, it's important to say that different does not mean incorrect, so, for example, with a DNSSEC related ones, some of them were more precise, others were more generic but at the end, they do point to the same root cause, right. I think it's more going to be confusing for those who implement DNS error reporting because if you start getting many reports from different resolver, with different EDEs, then it can get confusing and...
AUDIENCE SPEAKER: Yes, thank you.
AUDIENCE SPEAKER: Hi, my name is Roy Adams from ICANN, one minor correction, the error code you mentioned at the beginning, it wasn't signed by mistake, they make them in different context, those why they are both error code 60. Love the work. I have been a little bit involved in all of this stuff, I was very curious to your presentation, very interested as well.
You mentioned as a response to the last questioner that it may be good for different implementations to standardise the same error code in the context of error reporting; however, in the context of error reporting, it might actually be interesting to see that this type of resolver will say this and the other type of resolver will say that, and combined it actually makes sense of what's going on instead of having standardised on one simple R Code, two different R codes, I say R Code, I mean error codes, two different error codes might make sense as they amplify each other.
YEVHENIYA NOSYK: If we are going to receive those reports, we need to be careful when interpreting those.
AUDIENCE SPEAKER: Absolutely, out of curiosity, have you come acos any DNS error reporting implementations?
YEVHENIYA NOSYK: I haven't, yet. Thank you.
AUDIENCE SPEAKER: Good morning. So in your table I think it was slide 19, quite interesting thing happens where the QUAD 9 column is a mix of the unbound and power DNSSEC column and that is because QUAD 9 runs both and you might get a response from either and they have, in fact, requested us to see if we can be make more consistent with what unbound does.
YEVHENIYA NOSYK: OK. Thank you.
AUDIENCE SPEAKER: Thank you, good talk.
YEVHENIYA NOSYK: I do have a question, whose telephone is this? OK!
WILLEM TOOROP: It's mine! Thank you. (APPLAUSE.).
Next up is Roy Arends with peculiar incident he noticed.
ROY ARENDS: Hi everyone, my name is Roy Arends, I work for ICANN. This is a story about an incident that we ‑‑ a global incident we noticed about six months ago. But in order to understand it, I want to give you a little bit of a back story.
So this is about reserved top level domain, this is RFC 2606, it was published in '99, and it talks about things like .set and the reason why these domain names need to be reserved and that's to give a little bit of guidance, right, aside of some local additional top‑level domains for testing, they might actually end up being used in the real world. When they get delegated later on.
Of course in '99 no one could foresee we would have many more top level domains because in 1994, it's extremely unlikely there will be more than currently assigned, however here we are, we have a little bit more. Before 2012, we already had 13 additional new top level domains and independence 2012 of course, there's a thousand new top level domains added and we have currently more than 1500.
Now, the reason I am mentioning 2012 is when that avalanche of new top‑level domains came about, the security and stability advisory committee, a committee within ICANN that advises the ICANN board about security and stability, they noticed that there is a potential for collisions and they came up, they wrote these three different reports, they have done much more work since then on this but it's basically an advisory on internal name certificates, if you remember you could get certificates that end for domain that ended in .mail though it wasn't assigned by ICANN at the time and there's the mitigation of name collision risk, SAC062, it was about things like host or corp, you can imagine if host or corp gets assigned as a top level domain, there will be many collisions. Many more name collision, there was an URL, at the end of the presentation, I have assembled the URLs I am using in this presentation.
Meanwhile, ICANN decided to, sorry, CAS decided to initiate the name collision analysis project and I came across this, ICANN organises the DNS symposium and we had one in Bangkok in 2019 and it was Alexander from.atwho presented about DNS magnitude, I think it was based on an idea from Sebastian but he implemented it and it was about domain popularity, most of us are all counting queries, how many queries does that domain get and he basically had the idea to instead of counting queries, you count individual unique hosts that query for a domain name. And that's literally actually what DNS, it's a fancy title, simple solution.
Right, so why is it more interesting you can imagine if this entire room let's say we have about, I don't know, 200 people in this room, if the entire room asks for Weiss beer, right, that's far more popular, two hundred people asking for Weiss beer just me asking two hundred times for Weiss beer, it's a different kind of popularity and I think if two hundred people ask for it, it's a little bit more interesting than if one person asks two hundred times. So there it is. You count hosts, not queries.
We have had this, we asked him to work with us and gave him access to L root data and there's a fantastic paper published around this, there's an URL later in the presentation for this as well and since then, we have decided to apply this to L root data and this is a project of myself that I am running and I need to use my glasses for this, I am getting older.
This is basically the magnitude statistics, it's measureed on an UTC calendar day, there's a day of six days, now I'd like to say this because of security we don't want people messing with this and running scripts and immediately seeing results, it also comes in handy when there are mistakes being made in collecting the data in the statistical calculation, etc, etc. And keep in mind I am not a developer, at least not a very good one. That's why I gathered the research and the development and so I checked the stuff every day and let me go through this table really quick. You see com, net, arpa, org, etc, etc, if you look at the weekly and monthly and quarterly rank, it's extremely stable, the only change you see here is in a week .cn and .xyz exchange places and you can see that the amount of unique sources, the host basically, 241,000 compared to 230,000 is relatively close, you can imagine, it also depends if it's a weekend or during a day, they swap places sometimes.
Anyway, I measure a few things, sources on the amount of unique sources and I also count queries so it's not just hosts that I show but also queries and you can see why it's sometimes different. If you look at, let's say. .local, which is a special use top‑level domain, it has far more queries than the higher ranked .de, more unique hosts asking for it, anyway, let's continue. Like I said, I developed this and I am not a very good developer so I look at this stuff every day. And around May I saw this gap around Scloud, normally in the top are all things that are either delegated or that are special use and this one, the one on the bottom, I hope you can see this, it says s cloud, I have never heard of Scloud before, I have no idea what what it was and it was ranked at 17 and it came all the way from 635, what I didn't tell you yet the ranking is based on the lock scale. It's extremely hard, right, it's an exponential scale, it's extremely hard to go from 600 to five hundred or five hundred to a hundred or from a hundred to 17, it's extremely hard, something must have happened. If I just remove and you see the click thing on the top, hide this and hide that, if I remove all the delegated an special use ones, you can see the rest.
Now local domain, we know this is very popular, .lan and .internal, they are far below the Scloud, any of you an idea of what Scloud was I hope you haven't seen a presentation from Dan over the weekend, I didn't either and I typed it into Google and Google said you mean cloud and I said no, no, Scloud so you really mean Scloud and it's something along the lines of Soundcloud, no, I don't mean that. So look a little bit further and it turns out if you look at a packet capture data, there are millions of queries for collector.azure.Microsoft.Scloud, it was basically 90% of the traffic that ended inScloud, Microsoft release versions are around April and May. If you want to compare you need baseline and so I took .internal as a baseline and you can see that if you look at the rank, right, one being most popular, and a thousand being very not popular, the rank is pretty stable for .internal and the that you see in the rank. The middle graph is the amount of unique hosts. Now you see a few dips there and that's my own bias, that's not actually bias in the traffic, that's my own bias, I am it's not a very good developer, sometimes these things fall over, sometimes DNS engineering have basically more important stuff than tell me when they do schedule changes. They do announce it, I don't pay attention always, that's why the dips are there.
But overall, pretty stable for .internal, except when you look at the amount of queries as you probably know by now Andrew gave a presentation about .internal, it's now a domain that will never ever be assigned by ICANN, it's a private domain, it's probably related to the dipping queries because operators might now suppress .internal queries.
All right, if I superimpose .Scloud, sorry, if I superimpose it over this graph, you get this and you see this massive rise ‑‑ and keep in mind again this is a work scale, this rise is absolutely crazy, all of a sudden you get this massive increase in traffic so, sorry, yes, so this rank all of a sudden in April and May became more popular forScloud than .internal and you can see in the host as well for the second graph, the reasonScloud doesn't show up because of the massive amount of. Internal queries that we had, keep. Internal mind that the amount of hosts were more than the amount of queries for S Cloud than internal.
If I then zoom in on the specific months of April and May, we see that, you see a little bit of a weaker node pattern,in a weekend you see less hosts than during the week, you see specifically in the middle graph and you can see how S Cloud became more popular during these two months.
Now specific things happen on the 11th April and the 16th May. And it was so specific I decided to look at where this actually came from and it turns out there is a thing called Microsoft teams, I don't have to give an introduction, you should have heard about Microsoft teams before.
And there are 36 variations of teams on the Microsoft website with variations, there's an implementation for IOS, Android, there are government‑related implementations for teams. So 36 variations of it. And exactly on April 11th, one was deployed. And since this is IOS and not necessarily Mac OS, a lot of people have this, they automate their updates, right, and update your app when when updates are available, that's why we saw the massive rise of hosts and queries.
And on May 16th, yet a new version was deployed, and that's where we see the large amount of increasing queries for that date.
Continuing. First I asked Dwayne from Verisign, he has a similar interest at I have looking at route server traffic looking to see to could we find fun stuff and he hadn't seen this before, he was also saying that this is an interesting issue, we should investigate, so we did.
So on May 13th, I sent an email, I went on DNS org matter most by the way, and I reached out to an old friend of mine, he is not old, I just know him for a long time, and he worked at Microsoft at a time, he still works at Microsoft and I asked if he could do something about it and if he understood what was going on, I sent traffic samples and graphs and eventually they came back and yes, this was a bug and it will be fixed in subsequent releases.
So let's have a look. Right. This is what happened since they responded. And by the way I understand they were already working on a fix when they got my alert.
So since then we are looking at current data you can see that... so this is new release he is of the teams app, you can see the ranking going down, it's not still where it was before the incident but it's going down nevertheless. You can also see the amount of host queries for it going down an the amount of queries for it going down. Now, of course, a better solution would be to not use S Cloud at all because you are high Jaging, not hijacking, you are squatting a space that is not assigned to you and we now have .internal. So if anyone of Microsoft is here, please tell your peers.
Now I understand from my friend that the techies actually agree with this, it's not always the techies who get the make the choice this is a story about Scloud and you see the massive rise because of that. Anyway, that was my story, here are the references I used. If there are any questions, my email address is below. Thanks.
(APPLAUSE.)
WILLEM TOOROP: There's four minutes space for questions so if there are any questions:
AUDIENCE SPEAKER: Jim Reid, interesting talk, well done. I want to talk about this use of .S Cloud, not specifically that domain, we have seen a trend of big vendor plucking domain names out of the air, using them for their own internal management purposes, and it's potentially wreaking havoc on the DNS systems and the root server in particular, we were shocked to find there were ten, 12 character strings coming from every instance of Google to try to see if there was any domain righting going on, I wonder perhaps if it's time for SSAC to come up with a stronger recommendation that says, particularly to big vendors, don't do this, please don't do this. I am not sure .internal is necessarily the answer, but we really have to get home this message, you should not be picking names out the ether and use them for your own purposes especially for large applications of huge numbers of users, I think that's a very dangerous I think this to thing to do. Thanks.
ROY ARENDS: Thank you, I agree with you.
JIM REID: You can buy me a beer later!
AUDIENCE SPEAKER: Cathy Alvin from ISC, we didn't get any reports of this or I haven't heard of any, I am assuming while this was going on that resolver operators potentially would have seen it as a storm of queries that were responded to with NX domains?
ROY ARENDS: The amount of queries were actually, I mean it was, it wasn't a super amount of queries, it was just a large amount of hosts and for individual resolver operators, it would be hard because they would be one of those hosts, it would be hard to distinguish that, there are some reports and someone said oh I see these queries but for what I have seen, as completely under the radar of everyone basically. So it would be good if we could do something about this. But I don't think that resolver operators would have the ability to check this. Internal your own system because the amount of queries wouldn't be that high. And if you think about it, as your Microsoft.S Cloud, if you have. Internal your organisation Microsoft teams, these queries look benign, right, so.
AUDIENCE SPEAKER: Thank you.
ROY ARENDS: Thank you.
WILLEM TOOROP: Thank you, there's also an an online question, I will read it out. Sebastian asks: Usually not one to defend Microsoft but could this just be a typo, while Microsoft is migrating their services to Microsoft cloud?
ROY ARENDS: I have no answer for that. It could be but I think these guys, at least on the technical side know what they are doing, I don't think it's a typo and I understand it's a joke but I think all joking aside, the main point is don't squat, right, don't use Scloud, use internal.
AUDIENCE SPEAKER: Susan Wolf, there has been some hallway conversation ‑‑ because there was a related talk ‑‑ about a document that goes through all of these things, that goes through .internal, .alt, all of these things about where these names come from and sort of advising people where do you want to put your ad hoc names. One of the things I really want this document to say is appropriating single‑label names, top‑level names, is never a good idea and if you think you have to do that, you probably need to look closely at what you are doing and why. Butin any case, making this document RFC will only really reach people that read RFC but sometimes it breaks out out and there are other places to publish something, it does seem like we need something along those lines to introduce people to the whole zoo of semi different naming conventions we are using these days. I am going to assume everybody who agreed with you is going to be willing to contribute text and think about a home for that document.
ROY ARENDS: Yes.
AUDIENCE SPEAKER: Very short. Michael Richardson. 1994, yeah, I sold a firewall that did split horizon DNS by default and the IETF never defined what that was, I am guessing Scloud resolved within Microsoft somehow to be useful and so that would have been a split horizon, I think split horizon is a disaster and we should rec on it and tell people to use internal.example.com instead and what Suzanne said, we need that document if only as a hammer for a junior engineer to tell his manager, no, this is a bad idea. Thanks.
ROY ARENDS: I agree. Thank you .
WILLEM TOOROP: Thank you, Roy. (APPLAUSE.).
And now I am happy to say that we once again have another presenting presenting the traditional update of the RIPE NCC.
ANAND BUDDHDEV: Good morning. Anand Buddhdev and I work at the RIPE NCC and I workin the team that operates the RIPE NCC's DNS infrastructure and I am here to give a short update on what we have been doing since the last type meeting.
So we proposed the retirement of a service of the RIPE NCC and this service has been running for many years and we provided secondary DNS for LIRs so the members of the RIPE NCC and we would allow them to have secondary DNS on our server if they had large enough reverse DNS zones.
So sorry slide was wrong. So in May this year, we proposed to shut down the service and we published a RIPE Labs article. This article is still available so you can go and refer to it. We engaged with our community, we had discussion on the DNS working group mailing list and then my colleague Martin did a presentation at RIPE 88 with all the details of how we would execute this.
Our community supported this decision and so in June we started implementation and it's still on going. So this is the process. On the 18th June, we changed the configuration and stopped accepting requests for secondary with this service. On the 19th June, we sent out emails to all the users of the service and we sent emails to the zone contacts, the tech contacts and the admin contacts of all the domain objectsin the RIPE database where the name server was referenced. And immediately users started to do updates and many users managed to do the updates without any trouble.
A few people were confused about a couple of things, they sent tickets and these were also resolved.
We have been sending monthly reminders about this to folks to migrate away from our DNS server and yeah, we do that at the start of each month.
Here is a chart showing the progress, the blue line shows the number of domain objects remainingin the database that still reference our name server, the orange line shows you the trend so it looks quite good, the rate of withdrawal of service seems to be approximately in line with what we had expected but yeah, there is still a long way to go.
We do have stragglers and we did wonder why people were not moving away so quickly and we decided to broaden our reach so we looked at the enclosing address space or matching address space for these DNS zones and we extracted more contact information from the address space objects that we have in the RIPE database because we thought perhaps the contacts in the domain objects might not be the right ones or might be outdated so that resulted in some more updates.
The other thing is that we had some large operators with a large number of objects who had still not my grated away and we reached out to them by phone and they said oh yeah, yeah, we didn't realise we had to take action or we were just waiting for, you know, some more time or engineering resources. But phone calls do help and so they quickly took action and there were more updates. But we still have over 1,000 domain objects in the RIPE database to go, so that's over a thousand reverse DNS zones that still need to be migrated away.
The deadline for this is the 31st December. Anyone still using this service is required to migrate away from our name server by the 31st December and what we will then do is somewhere in January, probably the middle of January, we will update all the domain objects in the RIPE database that are referring to our name server and we will delete that name server.
This could have the unfortunate consequence of leaving some rereverse DNS zones with just one name server currently they are referencing the RIPE NCC name server and their own name server. At the last count there were 66 such reverse DNS zones, this is not encouraged by the RFCs; however, the DNS protocol does allow for this and resolution should work if that single name server is still answering and active. The users have been sent several emails, hopefully they will take action soon and migrate away.
I also looked at the latest stats and since 24th October, there have been no new updates, so anyone present here who is still using this service or who knows anyone who is using this service, please go and prod them.
And finally I come to an update on the authoritative and K root hosted DNS service that the RIPE NCC is providing. So we operate a K root name server and then we operate the AuthDNS server where we carry RIPE.net and the zones of the RIPE innocence, as well as the reverse DNS zones of the other RIRs and also the forward domain names of all the other RIRs and we also have some CC TLDs we provide secondary DNS service for.
So we have 19 instances of the authoritative DNS instances and/or sorry service and 120 instances of Kroot. There is a lot of redundancy and resilience in the K‑root service and we are seeking more hosts for the authoritative DNS service so if you are interested in hosting one of these, please speak to us, we have a website where you can find more information. And just briefly, in order to host one of these, you need to provide a server, a Dell or a virtual server, with enough memory and disk space and network interfaces and you contact us and we take you through the process and install and configure the server and substantiate the service. AuthDNS benefits everybody, it's not so important but mail services and some of the services like SSH make heavy use of it. And then of course all the RIR and their forward domains are in there.
So please come to come talk to us and we would appreciate it if you would host one of these instances for us.
And with that, I thank you for your listening to me. And if you have any questions or comments, please let me know.
(APPLAUSE.)
WILLEM TOOROP: We have 30 seconds.
AUDIENCE SPEAKER: Have you deteched any vanity name being used for the name server that you are retiring, someone who is reusing your APLs on queries that you haven't seen.
ANAND BUDDHDEV: Not that we are aware of. We did look at all the zones that that are using our service and they just reference NS.RIPE.net.
JIM REID: Jim Reid, trouble maker again, nice talk, thank you very much, I want to say thanks to you and your colleagues, you do a great job running this DNS infrastructure and I think we all take it for granted so a little bit of thanks now and again is long overdue, I have one question. You can buy me a drink later too!
The yeah I had was about the AuthDNS instances when you were talking about putting extra instances of the anycast key services, if I remember correctly the policy was you were trying to favour locations inside the RIPE service region, that still constrained on what you are going to do with these AuthDNS instances or are you taking a much more open minded approach?
ANAND BUDDHDEV: No, we are open to hosting these instances anywhere. And we try to look for regions that not so well served and so yeah, we welcome applications from everywhere.
JIM REID: Thank you.
AUDIENCE SPEAKER: The services that you are retiring, that is only for reverse zones or forward zones as well?
ANAND BUDDHDEV: This is only for reverse DNS zones. For LIRs that had large amount of IP addressesfthey had a slash 16 sized reverse DNS zone or a slash 32 sized IPv6, they were getting the service from us and that's the one we retiring.
AUDIENCE SPEAKER: OK, thank you.
WILLEM TOOROP: Thank you, I have a small question myself. I can't help to notice that you look even more amazing than you already usually do!
ANAND BUDDHDEV: Thank you, well I dressed up festively because of the Hindu festival of Diwali which started yesterday and finishes Saturday.
WILLEM TOOROP: Happy Diwali and thank you! (APPLAUSE.)
Don't forget to rate the talks. It helps the chairs to assemble a nice programme for you, also don't forget tomorrow 6 o'clock in the room next door is the main room is the resolver recommendations, in practice, thanks to the describe, to audio, video, Meetecho, many thanks to the stenorgraphy, great job you are amazing and thanks to Andrew for the additional references an the chat and yeah, see you hopefully tomorrow at the plenary session and otherwise in Lisbon, thank you. Cheers.
(Coffee break)