Listen in on Jane Street’s Ron Minsky as he has conversations with engineers working on everything from clock synchronization to reliable multicast, build systems to reconfigurable hardware. Get a peek at how Jane Street approaches problems, and how those ideas relate to tech more broadly.
Clock synchronization, keeping all of the clocks on your network set to the “correct” time, sounds straightforward: our smartphones sure don’t seem to have trouble with it. Next, keep them all accurate to within 100 microseconds, and prove that you did -- now things start to get tricky. In this episode, Ron talks with Chris Perl, a systems engineer at Jane Street about the fundamental difficulty of solving this problem at scale and how we solved it.
Welcome to Signals And Threads, in-depth conversations about every layer of the tech stack from Jane Street. I’m Ron Minsky.
Today we’re going to talk about a deceptively simple topic: clock synchronization. I think there’s nothing like trying to write computer programs to manipulate time to convince you that time is an incredibly complicated thing, and it’s complicated in like 16 different ways, from time zones to leap seconds to all sorts of other crazy things, but one of the really interesting corners of this world is how do you get all of the clocks on your big computer network to roughly agree with each other? In other words, clock synchronization.
So we’re going to talk about that with Chris Perl, who’s a sysadmin, who’s worked at Jane Street since 2012. Chris is better than anyone I have ever worked with, at diving into the horrible details of complex systems and understanding how they work and how they can be made to work better, and he’s done a lot of work here, specifically on clock synchronization, and has, in the course of that, redone our entire system for doing clock synchronization, so he’s had an opportunity to really learn a lot about the topic. Chris, to get started, can you give us just a quick overview of how computer clocks work in the first place?
So, I guess the rough gist is something like you have some oscillator, a little crystal effectively that’s inside the computer that is oscillating at some frequency, and that’s driving an interrupt that the operating system is going to handle in some level – there’s probably lots of details here that I’m just skipping over – but that’s driving an interrupt that’s going happen in the operating system. And the operating system is using that to derive its notion of time, and so if you have a really high-quality oscillator, and those timer interrupts happen at the right rate so that you’re tracking real-time that might just happen, and if your oscillator’s very good, and very stable you could actually just be pretty close to the correct time just by virtue of that. But the truth is that most computers come with fairly bad oscillators and they change their frequencies for various reasons like heat, so if you are using your computer to compile the Linux kernel or something like that, that could change the heat profile, change the frequency of the oscillator, and actually change how well you’re doing of keeping real time.
When we naively think of clock synchronization as people, we think of it as like, “I’m going to go set my clock”. I’m going to look at what time it is and adjust my clock to match whatever real-time is, but you’re actually talking about a different thing here. You’re talking not just about setting what the right time is right now but keeping that time correct, essentially keeping the rate at which time is going forward in sync.
Correct. You’d love it if you could get like a really, really high-quality oscillator for super cheap in all your computers and then you wouldn’t need a lot of adjustment to get them to keep the correct time, but that would be really expensive. You can buy such things, they just cost a lot of money.
So, you say that heat and various other things that are going on in the computer will cause this rate at which time is appearing to march forward inside of your computer to drift around. How accurate are these? Can give me a kind of numerical sense of how far these things drift away?
The stuff that we run, we capture some of these statistics, we see machines that have a frequency correction applied to them of, say, 50 parts per million, which is like microseconds per second, so that works out to roughly a couple seconds per day, is how you would wind up drifting off. But I’m sure that if you had a super old desktop under your desk, that you stole from your parents or something and you were trying to rebuild into a Linux box, you might have worse numbers than that. Like a sort of relatively current generation server from a well-known vendor, you’re talking somewhere around 50 to 100 microseconds per second that they can sort of walk-off out of alignment.
Okay, so clock synchronization is the process of trying to get all of those clocks that you have across your whole data center and across multiple data centers to be in sync with each other. Is that the right way of thinking about it?
I think so. “In sync”, is an interesting sort of thing to say, right? You kind of would like that if you were able to instantaneously ask two random servers on your network, what time it was at the same exact point in time, if you could somehow magically do that, that they would agree to some relatively small margin of error, and I think that that’s kind of what we mean by clock synchronization. That if you could somehow magically freeze time and go ask every single computer on your network, “Hey. What time do you think it is?” that they would all roughly agree to within some error bound that you can define.
Right. And this basic model actually assumes that there is a well-defined notion of what it means to be instantaneously at the same time, which isn’t exactly true because of relativity and stuff like that, but we’re going to mostly ignore that. So, I guess one property that you’re highlighting here is having the clocks agree with each other, and that’s part of it, but there’s another piece, right, which is having the clocks agree with some external reference. There’s some notion of like, what does the world think the time is? So, where does that external reference come from?
I’m not an expert on this stuff, but I’ll give you the sort of 10,000-foot view. You have various physics laboratories all over the world, like NPL in the UK, and other places across the world. They all have measurements of what they think time is, using things like hydrogen masers and sort of very accurate atomic methods. They contribute all of that stuff to a single source who kind of averages it, or does some sort of weighting, to come up with what the correct time is, and then you kind of magic that over to the Air Force, who then sends it up to the GPS constellation. And GPS has a mechanism for getting time from the GPS satellites down to GPS receivers, and so if you’re a person who runs a computer network and you’re interested in synchronizing your clocks to a relatively high degree of accuracy with something like UTC, which is effectively Greenwich Mean Time, it is just sort of the current time without time zones applied.
If you’re interested in doing that, what you can do is you can just go out to a vendor and you can buy a thing called a GPS appliance, which can hook up to a little antenna that goes onto the roof. It can receive this signal from the GPS constellation and basically gives you out time, and the accuracy there is something like maybe 100 nanoseconds or so. So you’ve got the sort of atomic measurements being fed up to a GPS constellation, down to GPS receivers that you, as an operator of a computer network, can buy.
And for the purposes of this conversation, we’re going to treat those GPS receivers as the received wisdom as to what time it is, and our job is to figure out how, inside of a computer network, you make all of the different devices agree with each other and agree with that external reference.
Why is it important? What does having synchronized clocks help you do?
If you put yourself in the shoes of a financial regulatory authority, and you have all these different participants out there doing stuff with computer systems, and something weird happens, and you’d like to come up with a total ordering of events of what led to this crazy thing – or what led to this good thing, who knows – but you want to have a total ordering of events. If people don’t have good clock synchronization, to some external source, you can’t compare the timestamp from participant A to the timestamp from participant B, so if you were to decree everybody must have time that is within some error bound, you know if these timestamps are within that error bound, well, then I can’t be sure about the ordering, but if they’re farther away than that then I can be sure about the ordering. I can know which one came first and which one came second, and that can be very useful.
So that’s the motivation that’s very specific to our industry, but don’t people in other industries care a lot about clock synchronization, too? I would have thought that there are other reasons that would drive you to want to synchronize the machines on the network.
Oh, sure. There’s lots of different things. I mean, just like a general sysadmin topic, a lot of times you want to gather logs from all the systems on your computer network, and you want to analyze them for various reasons. Maybe it’s because you’re concerned about intruders. Or maybe it’s because you’re just trying to understand the way things are functioning, and if your clocks aren’t synchronized it’s very hard to kind of understand things that might have happened on system B and how they relate to system A because the two timestamps are just not – you just can’t compare them if they’re not synchronized.
And I suppose there are also some distributed systems, algorithmic reasons to want clocks. Certainly, some kinds of distributed algorithms end up using clocks as ways of breaking ties between systems, and so that requires at least some reasonable level of synchronization.
For sure. There’s also other network protocols that are widely used that require clock synchronization, but much less precise levels of clock synchronization. Kerberos is a widely used authentication protocol, and that requires that the clocks be synchronized to within five minutes, and the idea there is to thwart replay attacks, and stuff like that, making sure that somebody can’t obtain your credentials from a couple days ago and use them again, that kind of thing. So there, it’s like the error bars are very wide but there’s still some sort of synchronization necessary.
Right. And I guess that’s a general theme when thinking about synchronization: different applications require different levels of synchronization, but more closely synchronized never hurts.
There’s definitely tradeoffs as you start to get into the lower bounds, but yeah. If they were all free, sure, I’d like to have them exactly the same.
How do people normally approach this problem of clock synchronization? What are the standard tools out there?
Most people, you just kind of run whatever your distribution shipped as an NTP daemon. NTP stands for the Network Time Protocol, and it is a protocol that up until not that long ago, I just kind of assumed used some kind of magic, and it knows how to talk to some servers on the Internet or some local servers that you probably then having talking to servers on the Internet, and it synchronizes your clocks with those servers. It exchanges some packets, maybe it takes a little while, maybe a few minutes, maybe longer. You probably don’t understand exactly why, but eventually, your clocks are relatively in sync to within maybe tens, or so, of milliseconds.
Can you give us a tour of how NTP actually works?
Like I said, for a long time, I just kind of assumed it was magic and didn’t really think too hard about it, and then at some point, I got tasked, within Jane Street, to actually look at some of this stuff and try and meet some requirements that were a little bit harder than the sort of standard tens of milliseconds synchronization. So I actually went and just was like, “Okay. Well. How does NTP do this from first principles?” right? Like, let’s go read some of the papers from David Mills. Let’s just go see if we can actually reason this out ourselves. At the end of the day, it’s just four timestamps. There’s a lot more complicated stuff around it, but the sort of core of it is these four timestamps.
Let’s say I’m the client, and you’re the server. First, I send you a packet, but before I do I write down a timestamp. When you receive that packet, you write down a timestamp. Then, you send me a reply, and before you do you write down a timestamp. Finally, when I receive that reply, I write down a timestamp.
It may not seem that groundbreaking, but with just those four timestamps I can compute two important numbers, the offset, and the delay. The offset is how far my clock is off from yours, so if you think it’s 12 pm and I think it’s 12:05 pm then the offset would be five minutes. The delay is how long it took those packets to traverse the network. To compute those numbers you basically take a system of equations, and for me, an important aspect was actually writing down, with a piece of paper and a pencil, and solving these equations myself, was understanding that there’s a sort of huge assumption in this protocol, that the delay for that first packet, where I timestamped then you did, and the delay for the second packet, where you timestamped and then I did, the assumption is that those times are the same and if they’re not the same they introduce what’s called error, and that is a sort of very important aspect. That is an assumption that is made so that you can actually solve those equations to get the offset and the delay.
Can you maybe explain what it is about the calculation that depends on the symmetry between the send time and the received time?
Those delays are kind of what tie it together, right? You know that if the clocks were in sync you know that the timestamp that you took minus the timesstamp that I took should be equal to the delay of the packet to get to you, right? And vice versa. My timestamp, from the second one that I received, minus your timestamp should be equal to the delay that it took for the packet to get from you to me. And you’re like, “Well. What do I do with this information?” And you say, “What if I just assume that those two delays are equal?” and if I assume that those two delays are equal, then I can start rearranging the various pieces of the equation. And then that’s how you can actually solve for the delay and the offset.
What’s the role of the two timestamps on the server-side? So, if you ask me what time it is, I write down when I receive it, and then I write down the time where I send it back. You could imagine a simpler protocol with just three timestamps. Then you just assume that that time that I wrote down happened in the middle of that interval, the interval between the time you sent the first method and received the second message.
How do you know when in the middle is, right? There’s lots of vagaries that happen with operating systems like if you timestamp it on either end, as soon as you receive the packet you timestamp it, and then maybe you have to do some work, and then right before you send it back you timestamp it, and that’s sort of how you get closest to those differences I mentioned representing the actual network delay from one to the other.
And I guess an extra assumption that you’re making here is that in that period between the first timestamp and the second timestamp you had better assume that the rate at which the clock is going forward is about right. I think that throws another error term into the equation. It’s, I think, typically extremely small, right? If you just, it certainly seems like something you can, in practice, ignore. Because if you just look at the number of parts per million, or whatever that you were talking about, in terms of how much drift there is in a real computer clock, I think that is, in fact, pretty tiny.
Right, well you’ve got the correction being applied by the time daemon that’s running on the computer, which is keeping the clock in sync. Presumably, the server-side of this communication is also taking time from somewhere either reference clock or some sort of higher upstream stratum in NTP, like clocks that are better than it, something like a GPS receiver, and it has applied a sort of correction to the operating system to say, “Hey, I currently believe that the frequency is off by this much. Please correct that when you hand me back time.” So, I feel like your biggest – I guess to your point of being able to ignore in practice – your biggest concern would be if in between those two timestamps something massive changed, like the temperature rose or dropped by many, many degrees or something such that, that frequency correction was now just wildly incorrect.
Okay, so we have now a network protocol. Put a timestamp, send a message, another timestamp, another timestamp, you get it back. Now the computer that started all this has some estimate for how much it’s clock is off. What does it do then?
In the simple world, you could just set your time. You could just, you could just sort of say like, “And the time should be X,” but that’s not generally how most Network Time Protocol daemons work. What they’ll do is they’ll take a number of samples from a single server, but many times you have multiple servers configured so you’ll take many samples from multiple servers, and you’ll sort of apply different criteria to these things to decide if you should even consider them. I think the reference implementation of NTPD has this notion of a “popcorn spike,” where if your offset, you know, if you’ve gotten back 30 samples and they all kind of look about the same, but then you get one that’s wildly off, you just kind of say like, “I’m gonna ignore that one, because likely that was just due to some crazy queueing somewhere in the network or something like that.”
You can sort of think of this as a kind of voting algorithm: You have a bunch of different oracles that are telling you things about what time is, you kind of bring them all together and have some way of computing an aggregate assumption about what the time currently is that tries to be robust to errors and drop outliers and the like.
Yeah, I think that’s right. You try to pick out the people who are lying to you, right? Some of those servers you might be talking to upstream might just be telling you incorrect things. They’re generally referred to in sort of NTP parlance as falsetickers, and the ones who are not falsetickers are truechimers.
Oh, that’s awesome.
I’m not sure why exactly these are the names, but these are some of the names you might see if you’re looking around the internet. So you try and pick out the ones that are telling you the truth. You apply some other various heuristics to them to try and figure out which ones you think are the best, right? Which ones maybe have the smallest error bars – even though you might think that these are decent sources to use some of them might have wider error bars than others, right, like your samples may represent a sort of wider range than the other ones – so you try and figure out which ones are the best and then you use that to sort of tell your operating system to effectively speed up or slow down its frequency correction for how off it is, and try and sort of remove that error over time. You don’t just abruptly adjust the time that the system thinks it is. Most time daemons will not aggressively step the clock. The reason for that is that most applications do not enjoy when the time just changes drastically, especially not when it changes drastically in the negative direction.
This highlights another property you want out of your clocks, which we haven’t touched on yet, which is: we said we want our clocks to be accurate. Your criterion for what it means for them to be right is you go to them and ask them what time it is, and they give numbers pretty close to each other. But there’s another property you want, which is you want the clocks to, in a micro sense, advance about a second per second and you especially want it to never go backwards, because there are lots of algorithms on a computer, which are making the assumption implicitly and you know, naively reasonably, that the clocks only go forward, and lots of things can get screwed up when you allow clocks to jump backwards.
Right. So, a way that you can maintain that property that you just mentioned, while still correcting, is simply effectively tell the operating system like “Hey, slow down. I want to have time slow down effectively such that like this error gets removed, but I don’t have to actually step time backwards and make applications sad.”
And I actually remember this, personally, from many years ago where the one place where I really intersected with clock synchronization in a serious way was, I was asked to look at Jane Street’s clock synchronization on its Windows machines. I wrote a small program that sent little NTP packets to the Windows machines, which knew the NTP protocol and responded appropriately. And they had the four timestamps on them, and instead of trying to compute the average I actually computed upper and lower bounds on how far the clock sync was off and generated a little graph to see what was going on. And I remember being quite surprised to discover that if you graphed how far off the clocks were you’d see this weird sawtooth behavior where the clocks would go out of sync, and then, bang, they would be in sync again, and then slowly out of sync and then in sync again. And that’s because the Windows NTP daemon we were running was configured to just smash the clock to the correct value and not do any of the kind of adjusting of the rate, which I think if I remember correctly, that’s called slewing, which is, I think, a term I’ve heard in no other context.
Yeah, that is correct.
Okay, so NTP lets you, in fact, do the slewing in an appropriate way so you can keep the rates pretty close to the real-time rates, but still, over time, slowly converge things back if they are far apart. In practice, how quickly can NTP bring clocks that are desynchronized back together?
At least with some of the newer time daemons… so I don’t know what the default is for the reference implementation. I know with some of the newer daemons like chrony, the default is that it takes 12 seconds to remove one second of error. So depending on how far away you were you can sort of like work it out, right, 12 seconds to remove one second. So if you were a day, it’d be 86,400 times 12, which is a lot of seconds.
So that’s actually quite fast, which means the rate at which the clock moves forward can be off by on the order of 10%, which is pretty aggressively pushing on the gas there.
And these knobs are adjustable. If you really want to you can sort of change the max rate at which they will attempt to make these adjustments.
So, we had clock synchronization working on our systems before you started in 2012, and yet you needed to redo it. What were we doing, and why did we have to redo it?
So, we did what I’m sure lots of people do. We discussed GPS appliances before – so we have some GPS appliances, which are bringing us accurate-ish time, and then we pointed a bunch of time servers at those GPS appliances using NTP, and we pointed a bunch of clients at those time servers, and we sort of dusted our hands off and said, “Ah, done.” There was no real requirements around what is the maximum error. Are we trying to maintain anything? If you look at any given time can you tell us how far off any given system is from, say, UTC? And so that served us fine for a while. The main motivation for some of the work that was done was a bunch of different financial regulations that happened in Europe, but one of them specifically had to do with time synchronization, and what it said was that you have to be able to show that your clocks are in sync with UTC to within 100 microseconds. So the 100 microsecond number was the big change. At first, when we first heard this requirement. It’s like, “Well. Maybe we’re good. We don’t actually know at the moment like maybe we’re just fine.”
Okay, and so we looked at it, and were we just fine?
No, definitely not. So it was, I think I said it before, but like most systems were like a couple hundred microseconds out, but the real problem, or one of the real problems, was that they sort of would bounce all over the place. Like sometimes they could be relatively tight, say 150 microseconds, but various things would happen on the system that could disturb them and knock them, say, four or 500 microseconds out of alignment. If a bunch of systems all start on a given computer at the same time, and they all start using the processors very aggressively, that’ll fundamentally change the heat profile of the system. As the heat profile changes the frequency will change and then the time daemon might have a harder time keeping the correct time, because the frequency is no longer what it was before, and it has to sort of figure it out as it goes.
So, I started sort of investigating, “Okay, how can we solve this problem? Like what do we have to do,” and sort of just started looking into various different things. I didn’t know, at the beginning of all this, can we solve this problem with NTP? Is NTP capable of solving this problem or do we have to use some different, newer, better protocol? Because NTP has been around for a long time.
What did you find?
I definitely did the dumb thing, right, and I went to Google and I said, “How do you meet MiFID II time compliance regulations,” or something along those lines, and probably many different sort of combinations of those words to try and find all the good articles. If you just do that, what you find out is that everyone tells you should be using PTP, which is the Precision Time Protocol. It’s a completely different protocol. And if you go read on the internet, you’ll see that it is capable of doing “better time synchronization” than NTP, but nobody really tends to give you reasons. Lots of people will say things like “NTP is good to milliseconds. PTP is good to microseconds,” but without any sort of information backing that. So if you just do that, you’re like, “Well. We should clearly just run PTP. No problem. Let’s just do that.”
So I did a bunch of research trying to figure out is that a good idea? So the first thing I also wanted to understand was what is magic about PTP? What makes it so much better than NTP, such that you can say these things like NTP is good to milliseconds, PTP is good to microseconds.
Where does the precision of the Precision Time Protocol come from?
Exactly. And what I found actually surprised me to some extent. The protocol is a little different. The sort of who sends what messages when is a little bit different. It involves multicast, which is different. But at the end of the day, it’s those same four timestamps that are being used to do the calculation, which I found a bit surprising. I was sort of like, “Now, if it’s the same four timestamps, more or less, what is it about PTP that makes it much more accurate?” And what I was able to find is, I think it’s basically three things.
One is that many, many hardware vendors support hardware timestamping with PTP – so your actual network cards, we sort of talked about the packet showing up at the network, it having to raise an interrupt, the CPU having to get involved to schedule the application, right. You do all this stuff and then eventually get a timestamp taken. PTP, with hardware timestamping, as soon as that packet arrives at the network interface card, it can record a timestamp for it, and then when it hands the packet up to the application it says, “Here’s our packet. Oh, and by the way, here’s the timestamp that came with it.” We were talking before about trying to move those timestamps as close as you could, such that they, actually the difference of them represented the delay from the client to the server and from the server to the client; so if you push them sort of down the stack to the hardware it means that you’re going to have much more accurate timestamps, and you’ll have a much better chance that those things are actually symmetric, meaning you’re taking good time. And it also removes a lot of the other uncertainty in taking those timestamps, such as scheduling delay, interrupt delay, other processes competing for CPU on the box, stuff like that, so that’s one thing. So, you have hardware timestamping as a PTP thing.
Another thing is the frequency of updates. So I think by default PTP sends its sync messages to clients once every second, whereas, at least for the reference implementation of NTPD, I believe the lowest you can turn that knob for how often should I query my server is once every eight seconds, so you have the hardware timestamping, you have the frequency of updates.
And then the other bit of it is the fact that lots of switches – so I think PTP was basically designed with the idea that you’d have all of the sort of network contributing to your time distribution, and so all of your switches can also get involved and help you move time across the network while understanding their own delays that they’re adding to it, and so they can kind of remove their own delays and help you move time accurately across the network. At least that’s the, that’s kind of the intent of PTP.
The idea is, I guess, you can do in some sense, the moral equivalent of what NTP does with the two middle timestamps. Where there are two timestamps in NTP that come from the server that’s reporting time. It’s like when it receives and then when it sends out and you get to subtract out from the added noise that gap between those two timestamps, and then the idea is you can do this over and over again across the network, and so delays and noise are introduced by, for example, queueing on the switch, would go away. Like you would essentially know how much those delays were and as a result, you could potentially be much more accurate.
Yeah. I think that’s roughly the conclusion I came to, that that’s what makes PTP more accurate than NTP, which was surprising to me. And then I did a bunch of research and was talking to various people in the industry, and at various conferences and stuff, and there was some agreement that you can make NTP also very accurate you just have to control some of these things, so there are… in addition to being able to do hardware timestamping with PTP packets some cards, these days, support the ability to hardware timestamp all the packets, and if your machine is just acting as an NTP server and most of the packets it receives are NTP packets, well then you’re effectively timestamping NTP packets. Some cards also will timestamp just NTP packets. They can sort of recognize them and timestamp only those, but it was sort of like “Okay if we have the right hardware, we can get the timestamping bit of it. That’s kind of an interesting thing. With the different NTPD implementation, chrony being the other implementation I’m talking about as opposed to the reference one, you can turn that knob for how frequently you should poll your server, I think as much as like 16 times a second. There’s a bit of like diminishing returns there, it’s not always better to go lower… point being, you can tune it to at least match sort of what PTP’s default of once a second.
And the more I dug, and the more I talked to people, the more people told me, “Hey, you definitely do not want to involve your switches in your time distribution. If you can figure out a way to leave them out of it, you should do so.” I was happy to hear that in some ways, because right now the reliability. or the sort of, the responsibility of the time distribution kind of lies with one group, and that’s fine. When you then have this responsibility shared across multiple groups, right, it becomes a lot more complicated. Every switch upgrade, suddenly, you’re concerned. “Well, Is it possible that this new version of the firmware you’re putting on that version of that particular switch has a bug related to this PTP stuff and is causing problems?”
Given all of that, I started to believe that it was possible that we could solve this problem of getting within 100 microseconds using NTP and I sort of set out to try and see if I could actually do that.
It seems like in some sense, the design of PTP where it relies for its extra accuracy on these switches violates this old end-to-end property that people talk about as the design of the internet of trying to get as much of the functionalities you can around the edge of the system. And I think that is motivated by a lot of the kind of concerns you’re talking about: you have more control over the endpoints, and you want to rely on the fabric for as little as possible. I guess the other reason you want to rely on the fabric it’s not just that there are different groups, and like, “ Oh, it’s hard to work with people in different areas and coordinate.” It’s also in various ways, fundamentally harder to manage networks than it is to manage various other kinds of software, but the reality is, in many organizations, in many contexts, a lot of getting the network right involves an extremely good, extremely well trained, extraordinarily careful human being, just going in and changing the configs and not getting it wrong. It’s kind of a terrifying system, and the less complexity, you can ask the people who have to go in and do this terrifying thing of modifying the network, the less complexity you can ask them to take care of the better.
I mean, that’s a very, very true point. And another aspect of it is having less black-box systems involved. So, chrony is an open-source project, we can sort of inspect the code and see what it’s doing and understand how it behaves. The GPS appliances are not but the idea of having less black-box systems where, “Hey, that’s really strange. Why did we see this spike? We have absolutely no idea.” The amount we can minimize that kind of stuff, the better.
Right. The primary currency of the sysadmin is inspectability.
You want to be able go in and figure out what the hell is happening.
Yes. Huge proponent of things you can inspect and debug.
You talked a bunch about hardware timestamping, and I have a kind of dumb and mechanical question about all this, which is you talked about essentially software processes keeping clocks up to date. You have this NTP daemon that knows how to adjust the rates of things and stuff, but then you talked about the NIC going in timestamping packets. So does the NIC have a hardware clock and the motherboard have a hardware clock, or the CPU? How are these clocks related to each other? What’s actually going on when you’re doing hardware timestamping?
Yeah, the NIC also has a hardware clock.
Is it a different time? Do you have to run NTP between the NIC and the host system?
I think that would be challenging, but like yes, you can use a thing from the Linux PTP project to move time from a network card to the system. It’s called phc2sys. That’s just a thing you can do, You have time on your network cards, you can move that time to the system, you can move it from the system to another network card, you can kind of shift this time around in various ways. But yes, the cards themselves do also have a clock on them that you’re also keeping in sync.
So another thing you mentioned about PTP is that it uses multicast. So, I’ve had the chance to sit down and talk at length with Brian Nigito, in a previous episode of this podcast, about multicast, and I’m curious what role multicast plays here in the context of PTP?
The whole idea is that at the sort of root of this time distribution tree you have what’s known as a grandmaster, and you can have multiple grandmasters, and a grandmaster is just something that doesn’t need to be getting PTP, time from PTP. It’s like you know the GPS reference or something else. You have this grandmaster you can have multiple ones. There’s a thing called the best master clock algorithm for the participants of PTP to determine which of them is the best one to act as the grandmaster at any given time, and then the idea is that you multicast out these packets to say, “Here’s my first timestamp,” and it just makes it easier on the network. As a PTP client, you just have to come on and start listening for these multicast messages and then you can start receiving time, as opposed to you having to actually reach out and be configured to go talk to the server. You can just sort of have less configuration and start receiving these time updates from the grandmaster.
Got it. So you think it’s mostly a kind of zero-configuration kind of story.
It also makes it easier for the grandmaster. You don’t have to maintain these connections. You don’t have to have all these sockets open. You just sort of have like one socket there kind of multicasting out. It’s not 100% true, because there’s a delay request and a delay response message that’s involved in all this too.
And it’s also actually kind of strange… I think this was recently changed in the most recent version of PTP, but technically the way it works is the grandmaster sends this multicast message that is a synchronization message which contains one of those timestamps. When the client receives it, it actually sends a multicast message back that says, “Hey, I’d like a delay request,” and then when the grandmaster receives that it sends out another multicast message that says, “Here’s the delay response,” which is kind of insane when you think about it, right, because you’re, you’re involving all of these other potential peers that are listening in on the network with your exchange, and you can configure some of these open-source projects – like the Linux PTP project, which uses a daemon called ptp4l – you can configure it to do a hybrid model where it receives the sync message as a multicast message, but then since it knows where that message came from it just does a unicast delay request and then delay response, which makes a lot more sense.
Yeah, the base behavior you’re describing sounds pathological, right? Essentially, quadratic. You get….
Everyone sends a message to everyone. That is not usually a recipe for good algorithmic complexity.
I’m not sure why it was designed that way. It could be that the original people were sort of thinking that you’d have these smaller domains where you have these boundary clocks. Or sort of you’re multicasting… you’re sort of limited to how many people you’re talking to. But I kind of agree the default behavior seems a little crazy to me, and that’s why in our case, where we are using PTP (we’re using it in a small area of the world) we have it configured to do that hybrid thing where the actual sync message comes in multicast, but the delay request and the delay response wind up being unicast.
There’s a major thing that I haven’t touched on here, which is that NTP, I mentioned it before you have multiple servers, and you kind of have this built-in notion of redundancy, right, where you’re sort of comparing all the times from the different servers, and you’re trying to figure out which of them are falsetickers, right, and so if any of them misbehave, the protocol kind of has this built-in notion of trying to call them out and ignore them. With PTP, we’re talking about the single grandmaster, that would be a GPS appliance, and unfortunately, we have found black box GPS appliances to be less than ideal. It would be fine, if you’re just talking about a straight failure scenario, right? We have a GPS appliance, maybe we have two of them, they have agreed amongst themselves who is the grandmaster. One of them goes offline, the other one notices that, it picks up, starts broadcasting. That would be a perfectly fine world and I wouldn’t be too concerned about it, but the thing that we’ve seen happen is we want to perform maintenance on a GPS appliance because its compact flash card is running out of life and we need to actually replace that. When you go to shut it down, it happens to send out a PTP packet that is like complete crazy pants, like just absolutely bonkers.
It makes no sense whatsoever. The timestamp is off the charts. And we’ve had GPS appliances do things like that, and so part of my thinking through this was, you know, “Geez, at the end of the day I really don’t want to be pulled back to a single GPS appliance that is providing time to potentially large swathes of the network,” because if it goes crazy there’s no real provisions in PTP for finding the crazy person everybody will just follow those crazy timestamps wherever they lead.
At least for a while. It sounds like there’s a way of eventually deciding someone else is the right one to pay attention to, but it means for short periods of time you may end up just listening to the wrong guy.
I’m not an expert in exactly what’s involved in the best master clock algorithm, but I thought what was in the best master clock algorithm was simply about how good your clock was, and so if you were sitting there saying, “I have the best clock. It’s fantastic.” but then you’re telling people the completely wrong time because you had some kind of a bug or misconfiguration, you would continue to operate in that mode indefinitely.
That’s fascinating. What it sounds like is despite the fact that PTP is newer, and, in some ways, shinier, and in some ways having fundamentally better capabilities for some aspect of what it’s doing, it also just threw out a bunch of the careful engineering that had gone into NTP over a long period, because NTP has significantly more robust ways of combining clocks than what you’re describing for PTP.
Yes, that was my kind of interpretation of looking at all this stuff: it feels like we threw out a lot of the safety, and that makes me super nervous based on my experience with these devices.
So here we are, we have an NTP solution that’s not working well enough, and a PTP solution that’s kind of terrifying. So, where’d you go from there?
So, we’re trying to build a proof of concept. At the end of the day, we sort of figured, “All right. We have these GPS appliances.” We talked about hardware timestamping before on the GPS appliances, and how they can’t hardware timestamp the NTP packets, so that’s problematic. We thought, “How can we move time from the GPS appliances off into the rest of the network?” And so we decided that we could use PTP to move time from the GPS appliances to a set of Linux machines, and then on those Linux machines we could leverage things, like hardware timestamping, and the NTP interleaved mode to move the time from those machines onto machines further downstream.
The NTP interleaved mode, just to give a short overview of what that means… when you send a packet if you get it hardware timestamps on transmission the way you use that hardware timestamp is you get it kind of looped back to you as an application. So I transmit a packet, I get that hardware timestamp after the packet’s already gone out the door. That’s not super useful from an NTP point of view, because really you wanted the other side to receive that hardware timestamp, and so the interleaved mode is sort of a special way in which you can run NTP and that when you transmit your next NTP packet you send that hardware timestamp that you got for the previous transmission, and then the other side, each side can use those. I don’t want to get into too much of the details of how that works, but it allows you to get more accuracy and to leverage those hardware timestamps on transmission.
I see. And this was already built into existing NTP clients, this ability to take advantage of hardware timestamps by moving them to the next interaction. That’s not a thing you had to invent.
Nope, it’s existed for a while, and it, I think given with the reference NTP implementation it can leverage timestamps taken at the driver level to do something similar, but chrony adds the ability to actually leverage hardware timestamps in the same fashion, sort of sending them in the next message so that they can calculate a more accurate difference.
Because hardware timestamps are a relatively new invention of all of this, right? When NTP was designed I don’t think there was any devices that did hardware timestamping.
I think that is true, and as I was saying before, when this all first came to fruition the things that supported hardware timestamping were PTP specific.
Okay, so now you have an infrastructure where there’s a bunch of GPS devices, a layer of Linux servers, which get time via PTP from those GPS devices, and then act as NTP servers for the rest of the network.
So maybe I missed, why does that first layer have to use PTP rather than NTP?
The major reason is that the GPS appliances… its apropos to what we were just talking about… The GPS appliances will hardware timestamp their PTP because they have dedicated cards for it, but they don’t hardware timestamp their NTP, so the quality of time that you’re getting off of the GPSes if you’re talking NTP to them – like, if you just remove the time servers and you have the clients talk directly to the GPS appliances, for example – it’s just going to be a lot lower quality. And to be honest, I don’t know if they support the interleaved mode of NTP, like it’s not something I ever really dug into. It sort of goes back to that black box thing of like, “Well, we can configure this thing in such a way that it spits out hardware timestamped PTP and be relatively confident that it’s doing that job.” But anything more esoteric gets a little dicey.
Got it. And you solve the false ticker problem by basically having each Linux server that’s acting as a kind of NTP… marrying each one of those to an individual GPS device. So if that GPS device is crazy then the Linux server will say crazy things, but then things internally on the network are going to, in a fault-tolerant way, combine together multiple of these Linux servers and be able to use the high accuracy way of getting data from those servers.
That’s exactly right. We constrain the servers that any given client can pick using various sort of knobs within chrony because we want to meet certain requirements, and so we would like to ensure that any given client is going to talk to his local NTP server as opposed to one that is, say, 600 microseconds away somewhere, because as soon as you go to talk to that one that’s 600 microseconds away, you introduce a lot of potential error. And so what we do is we force the NTP clients to talk to their local servers, and then we also configure them to talk to a bunch of other servers, which are sort of too far away to get very high accurate time, but we use them just as you described, to sort of call out and figure out if either of the two local ones have gone crazy. If both of the two local ones have gone crazy. Well, we’re kind of out of luck.
How well did this all work in practice?
It worked surprisingly well. So sort of designing the system coming up with a system that can do this stuff and remain fault-tolerant and all that is one thing, but then there’s also the other thing of show me, right, like show me that you’re within 100 microseconds of UTC. So that required understanding like, what are the errors? And that comes back to the sort of asymmetry question and understanding things like if the NTP daemon is not accepting updates from its server for whatever reason because maybe it thinks the server is crazy, or because it thinks that the samples it just took are incorrect like maybe you had a popcorn spike or something like that, it’ll do a thing where it’ll basically increment a number that represents its uncertainty about how much it might be drifting while not getting good updates from its server, and so you kind of have to add together all these different errors.
You have that one, you have the known error introduced by the actual time daemon, the time daemon… what it knows how far off it is, and then you have that round-trip time divided by two that I mentioned. So if you take all that, added together, you kind of have to do a similar thing for the PTP segment I mentioned, and then you kind of have to add on the 100 nanoseconds for the GPS that I mentioned. If you can add all that together, most of our servers, we can show that we are absolutely no worse error than about 35 microseconds, most of the time, assuming not some extenuating circumstances. So a design choice we made in this whole thing was your best bet for getting good time to clients is to have a dedicated network for it. Dedicated NIC, dedicated network, have it be quiet, nice, you know, no interference, but that’s expensive and annoying, and nobody really wants to do that.
It’s expensive in a few ways. It’s expensive in the physical hardware, you have to provision, but it’s also just expensive in the net complexity of the network, right? I think there’s a lot of reasons why we want to try and keep the network simple and having a whole separate network just sounds like a disaster from a management perspective.
Agreed. Right. So I was like, “I really don’t want to go down that road.” So we sort of said, “Well. Let’s see what happens.” And so I was just saying, most of the time we can attest that we are better than 35 microseconds, you know, 35 mics of error worst case, but there are situations where you can cause that to fall over. You can… for example, we have some clients that don’t support hardware timestamping. They’re just older generation servers, they don’t have NICs with hardware timestamping. If on those things, you just saturate the NIC for five minutes solid you’re probably going to get your error bounds outside of 100 mics, just going to happen. But on newer machines that do support hardware timestamping you can do things like that, and still stay within 50 mics of UTC, which is pretty cool.
Some of this is built upon the fact that we know we have a very smart networking team, we’re confident in the ship that they run and the way our networks are built in that kind of stuff that lends something to the not wanting to build a dedicated time network, and we think we can get by without it. And so that’s sort of where we ended up around 35 mics. I want to say it’s 35 to 40 mics for systems that don’t have hardware timestamping on the client-side, and closer to 20 mics for systems that do have hardware timestamping on the client-side. And as I mentioned, the systems that do have hardware timestamping on the client side are kind of more robust to problems, to just things that people might do. You know, maybe somebody’s debugging something and they want to pull a 10 gigabyte core dump off of a machine. They’re not thinking about the timestamping on the machine right now like they’re focused on their job to try and actually figure what happened with that system.
So the other aspect of all this was reporting on it and showing it. How do we surveil to show that we are actually in compliance? And so for that, we took what we think is a relatively reasonable approach, which is: we sample. And so there’s kind of no interval at which you can sample which is small enough if you want to be absolutely sure that all the time you were never out of compliance, right? You could say, “Well. What’s a reasonable sample? Every five seconds? No, that’s definitely too much. Okay, every one second? Maybe that’s fine. Every hundred milliseconds,” right, so where do you stop? So we sort of decided that for machines that go out of compliance it is likely that if we sampled every 10 seconds, we would pick them up because it’s not like there’s these crazy perverse spikes that sort of jump in there and then disappear. It is more like somebody’s SCPing a large file across the network, or something is misconfigured somewhere, and then therefore it is a sort of persistent state that sticks around for a while. So we sample every 10 seconds, pulling out these sort of various stats I mentioned about what represents the total error, and then we pump that into effectively a big database all day long, and then at various times throughout the day, we surveil that database and look for any systems that are not sort of meeting their time obligations. We hold different systems to different levels of accuracy.
So after all of this, not that I want to call this into existence, but imagine that there’s a new version of European regulations, MiFID III comes out and says, “Now you have to be within 10 microseconds.” Assuming that technology is still as it is now what would you have to do to get the next order of magnitude in precision?
Not this. So I think probably you’d want to go to something like PTP, but probably not just PTP directly. There’s a thing called White Rabbit, which is kind of like some PTP extensions, basically. I think it might actually be completely formalized in the most recent PTP specification. But that is a combination of roughly PTP with synchronous ethernet. So synchronous Ethernet allows you to get syntonization across the network, so you can sort of make sure that the frequencies are the same.
Can I ask you what the word syntonization means?
It just basically means that those two, the frequencies are in sync. So it doesn’t mean that we have the same time, but it means that we are sort of advancing at the same rate.
I see. So there are techniques, essentially, that let you get the rate the same without necessarily getting the clocks in sync first.
Correct. And is my understanding that White Rabbit sort of uses this idea that you can have the rates the same, with PTP, to work out some additional constraints that they can solve and get sub-nanosecond time synchronization. I think we would have to put a lot more thought into the sort of reliability and redundancy story. I sort of discounted PTP because it didn’t necessarily have the best reliability/redundancy story. It’s not to say we couldn’t have figured out a way to make it work for us. We almost certainly could’ve. You could have two grandmasters, one sitting there as the primary doing its normal operation, one sitting there as a standby, and if the primary one goes crazy for some reason you, could have some automated tooling or something that an operator could use to take it out of service, and bring the secondary into service and only have maybe a minor service disruption. I can imagine us doing that work but given the problem, we were trying to solve it seemed not necessary. We can solve this problem using this existing technology, but I do think if we had to go much lower like you said an order of magnitude lower, we’d have to start looking at something else.
Well. Thank you, so much. This has been really fun. I’ve really enjoyed learning about how the whole wild world of clock synchronization is knit together.
Well. Thank you, very much. It was a pleasure being here. It’s a pleasure talking about these things. It’s fun to try and solve these interesting, challenging problems.
You can find links related to the topics that Chris and I discussed, as well as a full transcript and glossary at signalsandthreads.com. Thanks for joining us, and see you next week.
A process for determining the best time source available, which PTP then uses as its grandmaster.
A more customizable implementation of NTP. (More)
The process of coordinating clocks across multiple systems. The goal is to have all clocks report approximately the same time if they were to all be queried instantaneously.
An NTP time reference determined to be reporting inaccurate times. (More)
(Global Positioning System) A network of satellites that provide location and time information to military and civilian systems.
The clock used by PTP as its time reference.
A request for the computer's operating system to respond to an event e.g. time has advanced and the OS should increment the clock. (wikipedia)
A set of European financial regulations that required, among a number of other changes, that some market participants, including Jane Street, be able to show that their clocks were synchronized, within 100 Microseconds, to UTC.
The network hardware that routes packets throughout the network.
(Network Interface Controller) The hardware that connects a computer to a network.
(Network Time Protocol) A distributed algorithm, developed by Dr.David Mills, that synchronizes a system's clock with a reference source of time. (Wikipedia)
(Network Time Protocol daemon) A daemon (background process) that uses NTP to update the system clock. (More)
A circuit that produces consistent electrical pulses that can be used to drive operating system interrupts. (Wikipedia)
An NTP response that is determined to be an outlier. Popcorn spikes are usually caused by network delays. (More)
(Precision Time Protocol) A successor to NTP that uses multicast, hardware timestamping, networking switches, and faster default update rates to achieve better synchronization. (Wikipedia)
An implementation of a protocol or other technical standard provided by the specification authors.
Adjusting the rate at which time advances to reconcile a system's time with a reference. By changing the rate at which time advances, systems avoid avoid abruptly stepping time forward or backward.
A standard that allows clock signals to be transferred over Ethernet hardware.
A process of making systems match frequencies. This can be used to ensure that all clocks in a system are advancing at the same rate, when they do not necessarily agree on the current time.
An NTP time reference determined to be reporting accurate times. (More)
(Coordinated Universal Time) The time at 0 degrees longitude with no timezone or daylight savings adjustments applied. (Wikipedia)