The Infrastructure behind OpenAI
Intro
As there has been a lot of talk around ChatGPT, Whisper and DALL·E 2, I’ve been not wondering how to apply for my universal basic income, but what’s running behind all that magic.
While researching throughout various sources, I’ve noticed that some things are safe to say, while other things are assumptions due to a lack of proof.
Collaboration
Google Workspaces (Mail, Docs etc.)
Everybody who’s ever been setting up Google Workspaces/Google Apps will see that they’re using GMail and most likely Google Drive, Docs etc.
# MX records show that they're using Google Workspaces
dig MX openai.com +short
1 aspmx.l.google.com.
5 alt1.aspmx.l.google.com.
10 alt3.aspmx.l.google.com.
5 alt2.aspmx.l.google.com.
10 alt4.aspmx.l.google.com.
Slack
It looks like they use Slack for messaging. (This was just a wild guess.)
https://openai.slack.com/redirects to https://openai.enterprise.slack.com/ and shows “Sign in with AzureAD”.
Cloud Infrastructure
Cloud provider / Hyperscaler
It seems safe to say that the majority of OpenAI’s workloads run on Microsoft Azure after MS has stated that OpenAI migrates it’s services to Azure and they would collaborate on tailored infrastructure.
Microsoft has announced in 2019 that they would form “exclusive computing partnership with Microsoft to build new Azure AI supercomputing technologies” [1]
- Microsoft and OpenAI will jointly build new Azure AI supercomputing technologies
- OpenAI will port its services to run on Microsoft Azure, which it will use to create new AI technologies and deliver on the promise of artificial general intelligence
- Microsoft will become OpenAI’s preferred partner for commercializing new AI technologies
OpenAI uses MS Azure Name Servers
dig NS openai.com +short
ns3-02.azure-dns.org.
ns4-02.azure-dns.info.
ns1-02.azure-dns.com.
ns2-02.azure-dns.net.
OpenAI API resolves to an Azure IP address
dig api.openai.com +short
52.152.96.252 # Microsoft (Azure) IP
Their main website is running behind Microsoft Azure IPs
dig openai.com +short
13.107.238.44
13.107.237.44
As every Cloud vendor has it’s strengths and weaknesses, it’s also likely that to a smaller extend Amazon Web Services (AWS) and Google Cloud Platform (GCP) are being used as well. Both are also being mentioned in job posts.
Infrastructure Orchestration
Kubernetes
OpenAI has shared in a number of blog posts about their move from “bare metal Cloud VMs” to Kubernetes. For creating large models like GPT-3, CLIP, and DALL·E they require an infrastructure that can scale quickly, but also to a large fleet (up to 7500 nodes). Their Machine Learning Jobs may span across hundreds of Kubernetes Podsand occupies most of the hardware resources on the node. However, some resources are being used by Prometheus exporters (node exporter) and Kubelet (Kubernetes node agent). The largest clusters run 5 Kubernetes API and 5 etcd nodes and minimize impact if one were to ever go down. They run on on their own dedicated nodes. It’s not clearly stated if OpenAI maintains their own Kubernetes clusters or runs them using Azure Kubernetes Service (AKS), but we can assume that they maintain their own K8s fleet. [2]
Observability
For their Observability tooling, OpenAI uses Grafana, Prometheus and Prometheus Exporters. Previously, they also used DataDog, which could be still the case. [3]
DDoS mitigation & CDN
Generally, most of OpenAI’s endpoints resolve to some Azure related host, but ChatGPT (chat.openai.com) sits behind Cloudflare. It appears that Cloudflare is mostly being used for keeping bots out (anti bot captcha) and its CDN capabilities.
dig NS chat.openai.com +short
chat.openai.com.cdn.cloudflare.net.
Infrastructure as Code (IaC)
Terraform
In multiple job posts, OpenAI mentioned that their Infrastructure Engineers need to know their way around Terraform. There is nothing that proves, that they’re using it, but as it’s sort of standard infrastructure tooling, it would make sense that they store their infrastructure code in Terraform. [4]
Chef
Also, in a few Infrastructure related job posts, it’s mentioned that people should know Chef, which at least has been used in the past, according to code in their public GitHub organization. [5]
CI/CD
Besides the obvious use of GitHub Actions, which can be discovered in OpenAI’s GitHub organization (used for Whisper and other code), a single job post mentioned Buildkite CI/CD, so Buildkite could be also used internally for code testing and deployment pipelines. [6]
APIs
Generally, APIs seem to be powered by Flask and OpenAPI with React frontends, according to their job postings. [7]
Conclusion
Most of OpenAIs infrastructure tooling does not seem like there is much secret sauce when looking from the outside. They do have some impressive numbers with their Kubernetes scaling numbers and the hardware advancements in collaboration with Microsoft are very interesting.