Andrew Galt, CEO of ZeroTier on DIY Cyber Guy

#95 – Are Internet Companies Becoming Too Big To ‘Fail’

About Andrew Galt

Andrew Gault is the CEO of ZeroTier, where he is building the next generation of secure network infrastructure: software defined, resilient, and globally scalable. ZeroTier connects millions of devices across more than 230 countries and territories, giving organizations the ability to create private networks as easily as spinning up a virtual machine. For investors focused on foundational infrastructure, ZeroTier represents a breakout platform with real global traction.

Andrew is also a Founding Partner at 7percent Ventures, where he has deployed more than 70 million dollars into over 100 companies. He was early capital into Oculus VR, acquired by Facebook, and Magic Pony, acquired by Twitter, and is most active with deeply technical teams working in AI, video, and networking.

In addition, he serves as Executive Chair at PlotBox, a private equity backed SaaS company modernizing operations in the death care industry.

Andrew has spent his career at the intersection of product, go to market, and infrastructure, operating on both sides of the table as founder and investor. He moved to the West Coast after co founding Gaikai in 2008, raising 45 million dollars from Benchmark, Rustic Canyon, NEA, and Intel, and leading the company through its 380 million dollar acquisition by Sony in 2012, where the technology became PlayStation Now.

Andrew’s Links

Andrew’s Company: https://www.zerotier.com/
Andrew’s Company: https://www.linkedin.com/in/andrewgault/

SUMMARY:

The interview examined systemic risks from centralized internet infrastructure using the recent Cloudflare outage as a focal example. Participants reviewed the incident timeline and technical cause: a Cloudflare configuration file overflow under heavy demand produced DNS and route resolution failures that prevented users from reaching many services, and the failure was attributed to software/configuration limits rather than a malicious attack.

The discussion broadened to architecturally driven risks as traffic and services concentrate around Cloudflare, AWS, Azure, and Google Cloud, creating choke points that produce cascading failures across vendors and customers. Guest Andrew Gault described engineering patterns to reduce lateral failure impact, including overlay networks that present LAN-like abstractions and zero-trust approaches to keep teams operational when a provider has issues; he referenced ZeroTier as a product example. Participants also explored agentic and multi-agent AI applied to network security and operations—covering uses such as real-time detection and remediation and automation of junior developer tasks—and debated, without resolution, whether AI could route around or otherwise mitigate such centralization choke points.

SHOW NOTES:

Hair on fire 3 out of 5

For: Internet Users

Late last year a configuration bug at Cloudflare — the backbone for roughly 20 % of the internet — knocked major platforms offline, including huge powerhouses OpenAI’s ChatGPT and X.

CLoudflare said it saw a “spike in unusual traffic” that caused its network to serve errors across the web, illustrating just how fragile centralized digital infrastructure can be.

Essentially, a master list of rules that the Cloudflare system uses to route and protect traffic kept getting bigger over time. The file eventually grew larger than the software was built to handle. That caused servers to start returning error messages instead of loading websites. It was not a hack — it was a technical limit that unexpectedly broke under heavy demand.

If a single configuration file can cripple vital services, then the network design choices in your own environment deserve scrutiny. Traditional, centralized routing and security depend on a few critical choke points — and when those fail, the impact ripples outward. That is why resilient architectures and identity-centric overlays matter.

The internet did not break because of an attack — it broke because our dependencies broke first. That is the lesson every IT leader must take seriously.

Here with me to discuss today is Andrew Gault – CEO of ZeroTier, a former partner at 7percent Ventures, early investor in Oculus and Magic Pony, and co-founder of Gaikai “Guy Kai”, acquired by Sony for $380M.

Q: What does this issue say about the architecture of the internet

TRANSCRIPT

0:00 – David W. Schropfer

Welcome back, everybody, to DIY Cyber Guy. This is episode 95, are internet companies becoming too big to fail? Now, this is a hair on fire three out of five, and it’s for, frankly, everybody that uses the internet. I’m guessing that every man, woman, and maybe children who are listening to this podcast now use the internet in some way, including listening to this podcast now. So I wanted to start with the word fail in the title, because this is important. We’ve all, most of us have lived through the banking crisis. For example, where banks were considered too big to fail, and that’s why they were bailed out, and what happened to the economy after that happened to the economy. In this case, we’re talking about whether or not the internet, which we depend on for many, many more things than we probably should, the internet itself can fail. If you can’t reach it, if it’s not returning the results you’re asking for, if you can’t connect with the information that you need, then it’s failing, then it’s not operating like it should. So if that happens, and you just simply can’t use the internet, the internet by definition has failed, and that’s because of a shrinking number of companies. In other words, if a very few number of companies actually have a serious internal failure where they stop working, the internet itself could effectively fail, which is the complete opposite of how it’s designed, but that’s a story another day. So here’s what I’m talking about. Specifically, for those of you who haven’t guessed, it was the CloudFlare outage late last year. So CloudFlare, for those of you who don’t know, is the backbone of roughly 20% of the internet. And when it went down late last year, it knocked out major platforms like OpenAI, OpenAI’s chat GPT, it knocked down TwitterNowX, it down huge swaths of the internet and You know, even things like making airline reservations, travel reservations, connecting to many, multiple websites was impossible. It just didn’t work. You got an error message instead of reaching the site that you wanted to. So what Cloudflare said later is that it saw a spike in unusual traffic that caused its network to serve errors across the web, illustrating just how fragile a centralized digital infrastructure can be. So essentially what happened is this, this was a configuration error, meaning that there’s a master list of rules in Cloudflare’s system and in almost every electronic or every computer system that there is. There’s a master set of rules that governs how the Cloudflare systems route traffic to keep its customers safe. So the file itself got bigger and bigger and bigger over time. There are more and more rules and exceptions, et cetera, governed by this one file. And the file eventually grew larger than the software was built to handle. That caused the Cloudflare servers to start returning simple error messages instead of loading websites. That’s what I mean by when you when you tried to go to a website or a user tried to go to a website, they would get a you know, 404 error or a website not found or some sort of error, anything except the website that they were actually seeking. And this wasn’t a hack, and this is incredibly important to emphasize. Nobody did anything malicious. This is not an intentional error. It was just a technical limit that unexpectedly broke under heavy demand. So think of it this way. If a single configuration file can cripple vital services and huge swaths of the internet, then the network design choices in your own environment deserve a lot more scrutiny. Traditionally, centralized routing and security depend on just a few critical choke points. And when those choke points fail, the impact ripples outward. And that’s why a resilient architecture and identity-centric overlays really matter for the architecture of the internet and maybe the architecture of the system that you personally are working on or in charge of So here’s the salient point of all that. In In this case, the internet did not break because of an attack. It broke because our dependencies broke first. And that’s a lesson that every IT leader and everybody who’s in charge of any size network has to take very seriously. So here with me to discuss this today is Andrew Gault, CEO of ZeroTier and former partner at 7% Ventures. Andrew is also an early investor in Oculus and Magic Pony, and co-founder of GuyKai, acquired by Sony for $380 million. Andrew, welcome to the show.

5:20 – Andrew Gault
Thank you so much. Pleasure to be here.

5:23 – David W. Schropfer
Glad to have you on. And I’m really interested in what you have to say about this. So question number one, what does this issue, this cloud fair issue that I’ve been talking about, say about the architecture of the internet?

5:34 – Andrew Gault
I think it just highlights how much it’s changed. So if you think way back when, when the internet was starting, it was an interconnection of networks. Local networks and connect it together. One campus connected to another campus, connecting to another campus, and so on until it wrapped the world. And it was always there to be distributed. I mean, it was there to survive a nuclear attack, right? And it worked. And that’s why it took off. And the internet will root around something that’s down at the IP layer. But I think as the decades have gone on, and we’ve all started to use it and, you know, the economics started coming into this. It has just got more and more and more centralized for what we do every day. You have obviously Cloudflare, great company. I don’t mean to be mean about them. You should probably use them. But anyone could have a configuration update. And then you have Amazon, you have Microsoft Azure, Google Cloud. There are just certain hyperscalers or certain backbone technology companies, which have really become that central point where we all, whether we like it or not, we depend on them. And even if we try not to use them, one of our vendors will use them, or maybe one of our vendors’ vendors will use them. And it becomes almost impossible to truly avoid independent on these central choke points.

7:02 – David W. Schropfer
Right. One of the things that was most fascinating to me about the CloudPoint outage the fact that CloudPoint has multiple nodes. They have lots of different data centers around the country and around the world. The idea being if any one of those data centers literally disappeared off the face of the earth, the others would pick up the load. So it’s so interesting that just one central config file failing had a ripple effect across their entire system, and they couldn’t route around it. So should we be you mentioned Azure and AWS, Amazon Web Services, should we be concerned that even in those truly backbone technologies, there’s a configuration problem that could happen and affect all of their sites, physical distribution sites, physical data centers, and not just one of those data centers all at the same time?

8:02 – Andrew Gault
I think, so in this case, the configuration file needed to be pushed, I think, to all the servers in their network. And of course, that’s automated. So once you know, you make your file, you test it works in your test environment, you hit deploy, out it goes. And then you get very, very common and these kind of larger deployments, a cascading failure where one server goes down and increases the load on the other servers, when that then trips one of them over, and then so on. And it just keeps cascades through the whole chain of servers. So yeah, I would think of it less in terms of locations for this one and more in terms of services.

8:46 – Andrew Gault
Do you really care where the service is hosted? I just care that I need, I think in this case it was DNS or resolution.

8:54 – David W. Schropfer
What happens if DNS goes down?

8:56 – Andrew Gault
What happens if in the organization your email goes down? Pick the service, right?

9:03 – David W. Schropfer
It’s what is that dependency chain?

9:06 – Andrew Gault
And how, yeah, how can I engineer around that? Or, or maybe what would the process be? Because I think what we what, looking back to my answer the first question there, you’re never going to engineer around it. There’s always going to be a gotcha. There are a lot of very smart people at Cloudflare that spend a lot of time making sure that this wouldn’t happen. Building test environments and it happened, right? And so I think the same applies with security.

9:38 – Andrew Gault
My company works mostly in zero trust and there it’s about don’t only trust the perimeter, trust only each server so you can’t move laterally. So something might get hacked in your system. Some server may go down, some servers may go off, but we don’t want that to spread. Through the organization and your services.

9:59 – David W. Schropfer
And in the case of Zero Tier, and for my listeners, Zero Tier is not a sponsor. As I say, almost every episode, this is not a pay to play podcast. But I’m curious, in the case of Zero Tier, if you had a client that, for example, came to you after listening to this podcast and said, wow, Andrew, I’m really concerned that something that happens, if something happens at cloud flair again, or, you know, I’m very dependent on AWS, what is zero tier going to do to give that CEO some level of comfort that a major event at one of those providers isn’t going to necessarily cripple their business for a long period of time?

10:44 – Andrew Gault
I think, happy to answer that in a generic term, we are an overlay network, meaning we run on top of a physical network in an abstracted way. So we can, an overlay network based on software, you can install a bit of software at both points, the server and your laptop, your phone or your laptop, and it will abstract away all the physical links and make it appear just like a regular LAN to all the software running on your laptop, on the server, on your phone. And it allows you to very, well, what, first of all, the underlying IP traffic it goes direct, so it doesn’t hairpin through a cloud provider. And I think a lot of the classical VPN solutions, SASE is a term that comes up a lot now, it’s kind of the buzzword. All of these rely on a central firewall or a central loop through the cloud, which just builds in that central node that could fail. But using a an overlay network, of course, your devices can go direct. So should the, if Google Cloud goes down, if Amazon goes down, well, maybe I can’t access my server from my laptop, but my laptop can still talk to my phone, right? They are now on the same isolated network. And you get many security benefits with that, as well as a pretty much just resilience on the connection between your devices.

12:15 – David W. Schropfer
So the employees can, in this example of a CEO whose hair is on fire or wants to prevent it from being on fire. You can tell that CEO, your employees will be able to keep doing their jobs.

12:28 – Andrew Gault
Yes, exactly. It’s not that everything just goes down instantly. It’s the one service that went down cannot be reached. And I think that it’s the, I’m using lateral movement, I guess more of a security term, but in just your business, you don’t want one system to go down down? The printer went down. Why does that also mean my email went down, right? That’s what we’re building, right? We’re just building it where they’re all as a big mesh interdependencies on some cloud provider. And, you know, if you have a big enough network with enough systems, a hacker will get onto one of those systems. So you have to architect security assuming that. And I think what a lot of us forget is the same applies to our business services. Right? One of them will go down at some point. And can we, you know, can we isolate them so that that then doesn’t take down the ability for everyone to work?

13:24 – David W. Schropfer
Excellent.

13:25 – Unidentified Speaker
So, so even if, even if employees had to do without one particular system for a period of time, they would still be able to function, you know, hopefully with the, with the remaining systems and services that they would need to consume as employees who are just trying to do their job.

13:43 – Andrew Gault
Right.

13:44 – David W. Schropfer
And I could see that would give a huge measure of relief to a lot of CEOs because, you know, CEOs are also responsible for payroll. So when you’re when you’re wrestling with payroll every week or every month or whatever the period is, you want to make sure you are getting the most out of that payroll, which means your people are working.

14:02 – Unidentified Speaker
Right.

14:02 – David W. Schropfer
Because if they’re not, you’re paying you’re still paying the money for those salaries, but they’re not getting the job done because your systems were offline. So sounds like a great a great investment to make to ensure that a big disaster is confined to the one product that might be affecting without affecting all the products across the board. 100%.

14:26 – Andrew Gault
Yeah. It’s that mindset that no matter, obviously you should strive for things not to go down, you should strive to architect it, you know, to be resilient, but time will find a way, right? And something’s going to go down, right? You know, a memory chip may go bad. It might not be a logical error, right? And literally anything could happen. And you just want to be resilient to that, for sure.

14:53 – David W. Schropfer
Exactly. I’m a bit of a history buff. And I’m curious if the founders of the internet were sitting around and say, well, Vint Cerf is still around and some of those other icons who figured out, you know, DNS was and why DNS would make the internet work and how to distribute between the universities, etc., which grew into the internet that we have today. If they were here on this podcast, what do you think they would say about how the architecture of the internet itself has evolved for better or worse?

15:30 – Andrew Gault
I can think of a good example in either way. So for better, back then, because of linking local networks together, security was very much an afterthought, right? If I’m in my office with my 20 employees, and obviously I trust them all, do I really need to log in, right? It’s like, so a lot of these early protocols literally had no authentication, right? It wasn’t an afterthought, it just wasn’t a thought.

15:58 – David W. Schropfer
They were so happy that it worked at all. Like, hey, look, we can move data packets around between universities. This is great.

16:04 – Andrew Gault
Exactly. Course, as as you grow, and you know, the population online and get you get, you get troublemakers, people trying to make mischief, so people just trying to be evil. And then you kind of learn the hard way that okay, I can’t actually trust everyone. And so I think a lot. And it amazes me if you Read these RFCs of how the internet has developed and have the kind of like, it’s always imagined designed by committee, but designed by hundreds of thousands of people, and they come to a conclusion of a change in a protocol or a design.

16:38 – Unidentified Speaker
It’s so hard to keep it backwards compatible, but add on, but they have done it.

16:43 – Andrew Gault
Almost every service we use online, even through DNS, which is one of the last ones, there are now secure variants and authentication and encryption are part of it. And I think that’s definitely for the better. I don’t personally know any of the original inventors of the internet, but I imagine if you sat them down, they would say, yeah, with hindsight, it was a mistake not to have thought about that at the beginning. It would have been much easier. It would have solved a lot of problems we had, you know, over the decades. I think on the worst side, I imagine that they would have philosophical issues with the centralization. And I think it wouldn’t just be about the network, meaning Cloudflare, the hyperscalers, Amazon, Azure, whoever. Because that’s just honestly simple economics, right? It just is cheaper for all of us for that to start happening.

17:36 – David W. Schropfer
And so you can’t really fight economics.

17:39 – Andrew Gault
And I’m not sure you want to, because we’d all be paying a lot more if it hadn’t gone like that. I think the philosophical issue is probably everything now is a SaaS that silos information. And I think if you think about why did we want to link all these networks together back then? It’s because we wanted to share information. We wanted to, HTTP, the protocol I’m sure we all know to download a webpage, was designed for pulling some information off your computer directly, pulling some information off your colleague’s computer directly. It was to access and share information amongst ourselves. And the internet now, or I guess the business models that have developed on the internet, are the complete inverse, right? It’s like, if you’re any kind of social network, upload something to it, and I’m not picking on anyone here, I mean literally all of them, they become the owner of the copyright of what you uploaded. They own the information and they then can keep that information and monetize it. I think that’s more philosophical than maybe what we’re talking about today, but I that might hurt them because I think it’s so exact opposite of the original intent, which is all information should be free and shared. And now it’s all, we see it in the markets now, system of record, right? The SaaS apocalypse, because market valuations going down for SaaS companies, because we think AI might just simply be able to rebuild them. But the reason is, what is that company at its It is a database of information, a system of record, which is gatekept. So I don’t know, maybe AI will undo some this. Can’t predict the future. But yeah, I think Ranbo is saying, I think that the philosophy of where information is, it kind of would be where they’ve said it’s gone for the worse.

19:37 – David W. Schropfer
Okay, and I can edit out this next question if it’s way too off base, but I was gonna, well, let me just ask it and see where it goes. If it’s so far off base, then we can take it from there. Okay, well, what do you think the founders would say about the architecture itself, not just the flow of information, but the fact that we’ve got all of these very real choke points that we were talking about in the cloud flare incident, that can actually have a material impact on how information continues to flow around. I mean, we were talking earlier about the nature of the internet is designed to move around a problem if a problem exists. If there’s an outage in, you University of Massachusetts, well, then it bounces off Boston College or some other university. And now I’m obviously talking back in the day when it was basically universities using this. So what do you say about how we actually built this and how we built those choke points?

20:41 – Andrew Gault
Again, the economics drives that. I think the CloudFlare example that we’re talking about Cloudflare provides, they’re not hoarding information. They are kind of providing connectivity services, right? For your website so that it loads faster and these kinds of things. I think for sure, again, it’s not in Cloudflare’s interest, but for sure the web browser could, it could have been evolved in a different way. So it would know to look in different places, right? And to root around. But again, wasn’t in anyone’s interest, either the web browser’s interest or the owner of the information and the data to evolve like that. Yeah, I think, like I say, if I was fetching data only off your machine and not off a website, would I even need a Cloudflare, right? I think it’s all, there are good reasons that it’s developed like this, but yeah, it was definitely not, what the original intent was, right?

21:47 – David W. Schropfer
Gotcha.

21:50 – David W. Schropfer
I want to dive into the AI piece of it. Just in your opinion, do you think AI as it evolves is going to create, I’m going to use the word pathways for the loss of a better term, better pathways for moving this information around and avoid some of the issues, avoid some of the choke points of avoid some of the downed areas, avoid some of the things like config files that have gotten too big and broke the internet, like the CloudFlare incident. Do you think AI actually has a role in that? Or are we really back to just the architects themselves?

22:30 – Andrew Gault
No, I absolutely think so. I mean, if you haven’t played around with AI, it’s getting so good, especially the last six months, it really stepped up.

22:42 – Andrew Gault
especially if you’re using kind of a multi-agent development processes, it can really replace junior developers now. We use it internally for the security angle. So obviously, if you’re monitoring traffic online, and I guess even in the cloud, for example, of monitoring a rollout of configuration change, humans, we have certain processing speeds We’re pretty quick, we’re pretty sharp, but we can detect an error, roll it back, write a fix, and then redeploy it in two milliseconds, right? And maybe AI can’t do that today, but I do think in the next year or two, it can. And that’s what we’re leaning into. So part of our solution is a distributed firewall, almost, down upon all your devices, what ports, what packets, who can connect to who, kind of stuff. And we’re determined and RNC is agentic in NetOps, which is basically just a fancy way of saying AI can help write those rules, AI can look at the traffic flowing on your networks, it can, in theory, at least detect malicious traffic, long ahead of a human could, or even detect to know what’s going on. And they can in real time, update that firewall rule. So I do think AI will level up network security and the resilience and the redundancy in networking, for sure, over coming months and years.

24:19 – Unidentified Speaker
Excellent.

24:19 – David W. Schropfer
Well, we’ll have you back on the podcast sometime when you start to release some of those things, we can talk about how that’s going. That’s great. Well, it’s been great having you on the podcast. Andrew, where can people find out more about what you do?

24:33 – Andrew Gault
You can find us online at zerotear.com. You’ll see there, you can learn a lot more about overlay networks and how we work and the use cases. The beauty of an overlay network where it’s just abstracting a complicated network down into just simple, plain, vanilla LAN networks is the number of use cases. You really can use it almost anywhere and your security will level up.

25:02 – Andrew Gault
to your duty will get easier, it will be easier to reason about, and security will absolutely level up. So I think if you want to learn more, absolutely, zerojerk.com is the place to go. That’s fantastic.

25:12 – David W. Schropfer
And for those of you who didn’t catch that, just go to diycyberguy.com, search for episode 95, and you’ll find the link that Andrew was just speaking about. So that’s absolutely terrific. Thanks for being on the show and sharing your opinions, Andrew.

25:28 – Andrew Gault
Thank you, it’s a pleasure to be here.