Skip to content

Commit

Permalink
Add OpenAI Case Study (kubernetes#9224)
Browse files Browse the repository at this point in the history
* remove duplicate crowdfire link

* add OpenAI Case Study

* fix spacing issue
  • Loading branch information
alexcontini authored and k8s-ci-robot committed Jun 25, 2018
1 parent a29a924 commit 0f6fbe4
Show file tree
Hide file tree
Showing 7 changed files with 108 additions and 7 deletions.
16 changes: 9 additions & 7 deletions content/en/case-studies/_index.html
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,14 @@
<div class="content">
<div class="case-studies">

<div class="case-study">
<img src="/images/case_studies/openai_feature.png" alt="openAI">
<p class="quote">"Research teams can now take advantage of the frameworks we’ve built on top of Kubernetes, which make it easy to launch experiments, scale them by 10x or 50x, and take little effort to manage."</p>
<!--<p class="attrib">— Christopher Berner, Head of Infrastructure for OpenAI</p>-->
<a href="/case-studies/openai/">Read about OpenAI</a>
</div>


<div class="case-study">
<img src="/images/case_studies/newyorktimes_feature.png" alt="the new york times">
<p class="quote">"I think once you get over the initial hump, things get a lot easier and actually a lot faster."</p>
Expand All @@ -33,13 +41,6 @@
<a href="/case-studies/squarespace/">Read about Squarespace</a>
</div>

<div class="case-study">
<img src="/images/case_studies/crowdfire_feature.png" alt="Crowdfire">
<p class="quote">"In the 15 months that we’ve been using Kubernetes, it has been amazing for us. It enabled us to iterate quickly, increase development speed, and continuously deliver new features and bug fixes to our users, while keeping our operational costs and infrastructure management overhead under control."</p>
<!--<p class="attrib">—Amanpreet Singh, Software Engineer at Crowdfire</p>-->
<a href="/case-studies/crowdfire/">Read about Crowdfire</a>
</div>


</div>
</div>
Expand All @@ -62,6 +63,7 @@ <h4><i>"Kubernetes has the opportunity to be the new cloud platform. The amount
<main>
<h3>Kubernetes Users</h3>
<div id="usersGrid">
<a target="_blank" href="/case-studies/openai/"><img src="/images/case_studies/openai_feature.png" alt="OpenAI"></a>
<a target="_blank" href="/case-studies/newyorktimes/"><img src="/images/case_studies/newyorktimes_feature.png" alt="The New York Times"></a>
<a target="_blank" href="/case-studies/nordstrom/"><img src="/images/case_studies/nordstrom_feature.png" alt="Nordstrom"></a>
<a target="_blank" href="/case-studies/crowdfire/"><img src="/images/case_studies/crowdfire_feature.png" alt="Crowdfire"></a>
Expand Down
99 changes: 99 additions & 0 deletions content/en/case-studies/openAI.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
---
title: OpenAI Case Study
case_study_styles: true
cid: caseStudies
css: /css/style_case_studies.css
---

<div class="banner1 desktop" style="background-image: url('/images/CaseStudy_openAI_banner1.jpg')">
<h1> CASE STUDY:<img src="/images/openAI_logo.png" style="margin-bottom:-1%" class="header_logo"><br> <div class="subhead">Launching and Scaling Up Experiments, Made Simple

</div></h1>

</div>

<div class="details">
Company &nbsp;<b>OpenAI</b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Location &nbsp;<b>San Francisco, California</b>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Industry &nbsp;<b>Artificial Intelligence Research</b>
</div>

<hr>
<section class="section1">
<div class="cols">
<div class="col1">
<h2>Challenge</h2>
An artificial intelligence research lab, OpenAI needed infrastructure for deep learning that would allow experiments to be run either in the cloud or in its own data center, and to easily scale. Portability, speed, and cost were the main drivers.
<br>
<h2>Solution</h2>
OpenAI began running Kubernetes on top of AWS in 2016, and in early 2017 migrated to Azure. OpenAI runs key experiments in fields including robotics and gaming both in Azure and in its own data centers, depending on which cluster has free capacity. "We use Kubernetes mainly as a batch scheduling system and rely on our <a href="https://github.com/openai/kubernetes-ec2-autoscaler">autoscaler</a> to dynamically scale up and down our cluster," says Christopher Berner, Head of Infrastructure. "This lets us significantly reduce costs for idle nodes, while still providing low latency and rapid iteration."
</div>

<div class="col2">

<h2>Impact</h2>
The company has benefited from greater portability: "Because Kubernetes provides a consistent API, we can move our research experiments very easily between clusters," says Berner. Being able to use its own data centers when appropriate is "lowering costs and providing us access to hardware that we wouldn’t necessarily have access to in the cloud," he adds. "As long as the utilization is high, the costs are much lower there." Launching experiments also takes far less time: "One of our researchers who is working on a new distributed training system has been able to get his experiment running in two or three days. In a week or two he scaled it out to hundreds of GPUs. Previously, that would have easily been a couple of months of work."


</div>

</div>
</section>
<div class="banner2">
<div class="banner2text">

<div class="video">
<iframe width="560" height="315" src="https://www.youtube.com/embed/v4N3Krzb8Eg" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe><br>
<span style="font-size:21px;line-height:0.5em !important;width:60%;">Check out "Building the Infrastructure that Powers the Future of AI" presented by Vicki Cheung, Member of Technical Staff & Jonas Schneider, Member of Technical Staff at OpenAI from KubeCon/CloudNativeCon Europe 2017.</span>
</div>
</div>
</div>
<section class="section2">
<div class="fullcol">
<h2>From experiments in robotics to old-school video game play research, OpenAI’s work in artificial intelligence technology is meant to be shared.</h2>
With a mission to ensure powerful AI systems are safe, OpenAI cares deeply about open source—both benefiting from it and contributing safety technology into it. "The research that we do, we want to spread it as widely as possible so everyone can benefit," says OpenAI’s Head of Infrastructure Christopher Berner. The lab’s philosophy—as well as its particular needs—lent itself to embracing an open source, cloud native strategy for its deep learning infrastructure.<br><br>
OpenAI started running Kubernetes on top of AWS in 2016, and a year later, migrated the Kubernetes clusters to Azure. "We probably use Kubernetes differently from a lot of people," says Berner. "We use it for batch scheduling and as a workload manager for the cluster. It’s a way of coordinating a large number of containers that are all connected together. We rely on our <a href="https://github.com/openai/kubernetes-ec2-autoscaler">autoscaler</a> to dynamically scale up and down our cluster. This lets us significantly reduce costs for idle nodes, while still providing low latency and rapid iteration." <br><br>
In the past year, Berner has overseen the launch of several Kubernetes clusters in OpenAI’s own data centers. "We run them in a hybrid model where the control planes—the Kubernetes API servers, <a href="https://github.com/coreos/etcd">etcd</a> and everything—are all in Azure, and then all of the Kubernetes nodes are in our own data center," says Berner. "The cloud is really convenient for managing etcd and all of the masters, and having backups and spinning up new nodes if anything breaks. This model allows us to take advantage of lower costs and have the availability of more specialized hardware in our own data center."


</div>
</section>
<div class="banner3" style="background-image: url('/images/CaseStudy_openAI_banner3.jpg')">
<div class="banner3text">
OpenAI’s experiments take advantage of Kubernetes’ benefits, including portability. "Because Kubernetes provides a consistent API, we can move our research experiments very easily between clusters..."

</div>
</div>
<section class="section3">
<div class="fullcol">
Different teams at OpenAI currently run a couple dozen projects. While the largest-scale workloads manage bare cloud VMs directly, most of OpenAI’s experiments take advantage of Kubernetes’ benefits, including portability. "Because Kubernetes provides a consistent API, we can move our research experiments very easily between clusters," says Berner. The on-prem clusters are generally "used for workloads where you need lots of GPUs, something like training an ImageNet model. Anything that’s CPU heavy, that’s run in the cloud. But we also have a number of teams that run their experiments both in Azure and in our own data centers, just depending on which cluster has free capacity, and that’s hugely valuable."<br><br>
Berner has made the Kubernetes clusters available to all OpenAI teams to use if it’s a good fit. "I’ve worked a lot with our games team, which at the moment is doing research on classic console games," he says. "They had been running a bunch of their experiments on our dev servers, and they had been trying out Google cloud, managing their own VMs. We got them to try out our first on-prem Kubernetes cluster, and that was really successful. They’ve now moved over completely to it, and it has allowed them to scale up their experiments by 10x, and do that without needing to invest significant engineering time to figure out how to manage more machines. A lot of people are now following the same path."

</div>
</section>
<div class="banner4" style="background-image: url('/images/CaseStudy_openAI_banner4.jpg')">
<div class="banner4text">
"One of our researchers who is working on a new distributed training system has been able to get his experiment running in two or three days," says Berner. "In a week or two he scaled it out to hundreds of GPUs. Previously, that would have easily been a couple of months of work."
</div>
</div>

<section class="section5" style="padding:0px !important;">
<div class="fullcol">
That path has been simplified by frameworks and tools that two of OpenAI’s teams have developed to handle interaction with Kubernetes. "You can just write some Python code, fill out a bit of configuration with exactly how many machines you need and which types, and then it will prepare all of those specifications and send it to the Kube cluster so that it gets launched there," says Berner. "And it also provides a bit of extra monitoring and better tooling that’s designed specifically for these machine learning projects."<br><br>
The impact that Kubernetes has had at OpenAI is impressive. With Kubernetes, the frameworks and tooling, including the autoscaler, in place, launching experiments takes far less time. "One of our researchers who is working on a new distributed training system has been able to get his experiment running in two or three days," says Berner. "In a week or two he scaled it out to hundreds of GPUs. Previously, that would have easily been a couple of months of work." <br><br>
Plus, the flexibility they now have to use their on-prem Kubernetes cluster when appropriate is "lowering costs and providing us access to hardware that we wouldn’t necessarily have access to in the cloud," he says. "As long as the utilization is high, the costs are much lower in our data center. To an extent, you can also customize your hardware to exactly what you need."


</div>

<div class="banner5">
<div class="banner5text">
"Research teams can now take advantage of the frameworks we’ve built on top of Kubernetes, which make it easy to launch experiments, scale them by 10x or 50x, and take little effort to manage." <br><span style="font-size:14px;letter-spacing:0.12em;padding-top:20px">— CHRISTOPHER BERNER, HEAD OF INFRASTRUCTURE FOR OPENAI</span>
</div>
</div>

<div class="fullcol">

OpenAI is also benefiting from other technologies in the CNCF cloud-native ecosystem. <a href="https://grpc.io/">gRPC</a> is used by many of its systems for communications between different services, and <a href="https://prometheus.io/">Prometheus</a> is in place "as a debugging tool if things go wrong," says Berner. "We actually haven’t had any real problems in our Kubernetes clusters recently, so I don’t think anyone has looked at our Prometheus monitoring in a while. If something breaks, it will be there."<br><br>
One of the things Berner continues to focus on is Kubernetes’ ability to scale, which is essential to deep learning experiments. OpenAI has been able to push one of its Kubernetes clusters on Azure up to more than <a href="https://blog.openai.com/scaling-kubernetes-to-2500-nodes/">2,500 nodes</a>. "I think we’ll probably hit the 5,000-machine number that Kubernetes has been tested at before too long," says Berner, adding, "We’re definitely <a href="https://jobs.lever.co/openai/f163bf64-278e-417b-ad2e-5e508a29eb71">hiring</a> if you’re excited about working on these things!"
</div>

</section>
Binary file added static/images/CaseStudy_openAI_banner1.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/images/CaseStudy_openAI_banner3.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/images/CaseStudy_openAI_banner4.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/images/case_studies/openai_feature.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/images/openAI_logo.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 0f6fbe4

Please sign in to comment.