Microsoft is working on a new Artificial Intelligence infrastructure service codenamed ‘Singularity’ that will provide data scientists and AI researchers a way to build, scale, experiment, and iterate on their models on a Microsoft Azure Cloud service built specifically for AI.
A group of researchers, mostly of Indian origin, have published a paper that provides technical details about the ‘Singularity’ project.
Lowering costs by driving high utilisation across deep learning workloads is a crucial lever for cloud providers. We present Singularity, Microsoft’s globally distributed scheduling service for highly-efficient and reliable execution of deep learning training and inference workloads, – the researchers wrote.
‘Singularity’ is a fully managed, globally distributed infrastructure service for AI workloads at Microsoft, with support for diverse hardware accelerators.
‘Singularity’ is designed from the ground up to scale across a global fleet of hundreds of thousands of GPUs and other AI accelerators.
“Singularity is built with one key goal: driving down the cost of AI by maximizing the aggregate useful throughput on a given fixed pool of capacity of accelerators at planet scale,” said the researchers.
In 2019, Microsoft invested $1 billion in OpenAI. Last year, it announced that ‘Azure OpenAI Service’ will give customers access to OpenAI’s powerful models in addition to the security, reliability, compliance, data privacy, and other enterprise-grade capabilities built into Microsoft Azure.
According to the researchers, ‘Singularity’ achieves a significant breakthrough in scheduling deep learning workloads, converting niche features such as elasticity into mainstream, always-on features that the scheduler can rely on for implementing stringent service level agreements (SLAs).
“Singularity enables unprecedented levels of workload fungibility, making it possible for jobs to take advantage of spare capacity anywhere in the globally-distributed fleet, while still preserving the SLAs,” they added.
Microsoft in 2020 unveiled a new powerful supercomputer in collaboration with OpenAI, making new infrastructure available in Azure to train extremely large AI models.
The supercomputer developed for OpenAI is a single system with more than 285,000 CPU cores, 10,000 GPUs, and 400 gigabits per second of network connectivity for each GPU server.