Facebook releases design for souped up AI server, 'Big Sur'
Code-named Big Sur, Facebook uses the server to run its machine learning programs, a type of AI software that "learns" and gets better at tasks over time. It's contributing Big Sur to the Open Compute Project, which it set up to let companies share designs for new hardware.
One common use for machine learning is image recognition, where a software program studies a photo or video to identify the objects in the frame. But it's being applied to all kinds of large data sets, to spot things like email spam and credit card fraud.
Facebook, Google and Microsoft are all pushing hard at AI, which helps them build smarter online services. Facebook has released some open-source AI software in the past, but this is the first time it's released AI hardware.
Big Sur relies heavily on GPUs, which are often more efficient than CPUs for machine learning tasks. The server can have as many as eight high-performance GPUs that each consume up to 300 watts, and can be configured in a variety of ways via PCIe.
Facebook said the GPU-based system is twice as fast as its previous generation of hardware. "And distributing training across eight GPUs allows us to scale the size and speed of our networks by another factor of two," it said in a blog post Thursday.
One notable thing about Big Sur is that it doesn't require special cooling or other "unique infrastructure," Facebook said. High performance computers generate a lot of heat, and keeping them cool can be costly. Some are even immersed in exotic liquids to stop them overheating.
Big Sur doesn't need any of that, according to Facebook. It hasn't released the hardware specs yet, but images show a large airflow unit inside the server that presumably contains fans that blow cool air across the components. Facebook says it can use the servers in its air-cooled data centers, which avoid industrial cooling systems to keep costs down.
Like a lot of other Open Compute hardware, it's designed to be as simple as possible. OCP members are fond of talking about the "gratuitous differentiation" that server vendors put in their products, which can drive up costs and make it harder to manage equipment from different vendors.
"We've removed the components that don't get used very much, and components that fail relatively frequently — such as hard drives and DIMMs — can now be removed and replaced in a few seconds," Facebook said. All the handles and levers that technicians are supposed to touch are colored green, so the machines can be serviced quickly, and even the motherboard can be removed within a minute. "In fact, Big Sur is almost entirely tool-less --the CPU heat sinks are the only things you need a screwdriver for" Facebook says.
It's not sharing the design to be altruistic: Facebook hopes others will try out the hardware and suggest improvements. And if other big companies ask server makers to build their own Big Sur systems, the economies of scale should help drive costs down for Facebook.
Machine learning has come to the fore lately for a couple of reasons. One is that large data sets used to train the systems have become publicly available. The other is that powerful computers have gotten affordable enough to do some impressive AI work.
Facebook pointed to software it developed already that can read stories, answer questions about an image, play games, and learn tasks by observing examples. "But we realized that truly tackling these problems at scale would require us to design our own systems," it said.
Big Sur, named after a stretch of picturesque California coastline, uses GPUs from Nvidia, including its Tesla Accelerated Computing Platform.
Facebook said it will to triple its investment in GPUs so that it can bring machine learning to more of its services.
"Big Sur is twice as fast as our previous generation, which means we can train twice as fast and explore networks twice as large," it said. "And distributing training across eight GPUs allows us to scale the size and speed of our networks by another factor of two."
Google is also rolling out machine learning across more of its services. "Machine learning is a core, transformative way by which we’re rethinking everything we’re doing," Google CEO Sundar Pichai said in October.
Facebook didn't say when it would release the specifications for Big Sur. The next OCP Summit in the U.S. takes place in March, so it might say more about the system more then.