Solving Model Training Constraints With Serverless Edge Computing

Avi Barliya

Avi Barliya

March 26, 2020 · 8 min read

Artificial intelligence and machine learning applications are breaking new ground in every industry, but the cutting edge of these exciting technologies may be advancing faster than the processing power required to support model training. Constraints in processing power are limiting the speed and efficiency with which data scientists and other tech professionals can train their models, and this is limiting the value they are able to create. Delays like this mean longer periods between phases in product development, and sometimes even the most innovative applications of AI reach their end before achieving commercial viability due to time and resource constraints. These constraints are forcing AI professionals to look for novel solutions – which they are finding in multi cloud environments and serverless edge computing solutions.

Data scientists are constantly seeking better ways to process the mountains of data necessary to train AI and ML models — and many of them have found exactly what they’re looking for with Dis.co. Our user-friendly platform makes parallelizing heavy compute demands easy. By enabling complex jobs to run faster and more efficiently than ever, Dis.co is helping the AI/ML industry outrun the constraints that are holding back the otherwise limitless potential of this technology.

Model Training Compute Constraints

AI breakthroughs are revolutionizing healthcare, communications, engineering, and commerce, but this progress is coming at a cost. The models supporting these advanced applications demand massive amounts of computing power to train for real-world applications, and the industry is facing a potential shortage of this valuable resource.

The compute power necessary to train the largest AI models is increasing exponentially. New research from OpenAI has tracked this growth, finding that the compute demands for training the largest AI models has doubled approximately every 100 days since 2012. Thanks to the magic of compounding growth, the amount of computing power needed to train large AI models is increasing seven times faster than previously estimated based on 2018 data. This is raising real concerns regarding a potential compute bottleneck, choking off further innovation in fields that demand real-time processing or big data management solutions.

AI and ML models are growing more sophisticated, and this increasing complexity is driving a growing global demand for computing power. This growing demand is increasing costs and driving down availability for model training solutions, particularly among startups and within academic applications where resources are limited.

Models Are Demanding More

As developers build upon the innovations of their predecessors, AI applications are becoming increasingly sophisticated. Programmers have created algorithms capable of self-teaching complex tasks like manipulating a Rubik’s Cube with a robotic hand or drones capable of autonomous flight. AI is beating humans at games, recognizing voices and images, and solving some of our most pressing concerns in healthcare and scientific research. However, running the terabytes of data through these algorithms they require to learn their intended task is enough to melt your average PC into a molten puddle.

The algorithm that taught the robot hand how to play with a Rubik’s Cube broke new ground in the field, in doing so it required over 1,000 desktop computers and about a dozen specialized GPU machines to run at full-tilt for several months. These resources are estimated to have demanded around 2.8 GW-hours of electricity — the approximate power output of three utility-scale power plants operating for a full hour.

The advancement of AI and ML is incredibly exciting, and it carries the potential to make our world a happier, healthier, and safer place. The challenges that AI applications are taking on are growing more and more complicated, and as a result the models these innovations rely upon are becoming more and more demanding. The developer community is coming up with models that are consuming more compute power, using more data, and training for more time. Fortunately, improved computing solutions are rising to meet the model training needs of data scientists facing processing constraints, thus freeing up the marketplace for greater innovation.

The Future of Model Training is Distributed

For advanced AI/ML applications, standard model training just isn’t cutting it anymore. The demand for computing power in this space has grown so much that the burdens of model training must be distributed across multiple computing machines, as well as networks that aggregate every bit of available computing power for its best use. As a result — when it comes to the distribution of model training — it’s really not a question of when it will occur; rather, it’s a matter of how.

Model training is a very compute intensive process. It can take many weeks to train one model even when developers take additional measures to expedite things. Model training can run quicker on higher-end machines or even GPUs, but collecting this costly hardware can be prohibitively expensive. Thus, for many, the best way of speeding up a computing process (here training) is to parallelize the workload by breaking the problem up to smaller chunks. Dis.co enables hybrid cloud environment-based parallelization, which offers substantial benefits beyond what is possible on local machines.

Data scientists and tech innovators have developed several strategies for multiplying their model training efficiency, but on-prem solutions are constrained by the amount and the processing power of the GPUs or CPUs that jobs have access to. To increase compute resources available for training their models, many data scientists are moving their jobs from local machines to the cloud. As this occurs, technologies enabling cloud-based distributed computing applications are allowing developers to train models with greater speed and efficiency.

AI professionals are already using multiple node architectures to parallelize model training instances across multiple GPUs. As compute problems become more complex, however, an increasing number of nodes will be necessary. This will raise novel issues with respect to supporting effective and efficient communication between nodes in an architecture designed for distributed model training. One of the simplest ways of facilitating effective communication between compute nodes in multi-node architectures is to locate data processing as close to the source of data creation as possible. In the age of the IoT, more and more model training data is being generated at the edges of compute networks. Thus, as innovators in distributed compute technology continue to find new ways to support ML model training jobs, the nodes running these instances will move farther and farther out towards the edges of supporting networks.

Model Training on the Edge

Standard Machine Learning Models require the use of centralized training data, which is typically stored on a single machine or cloud. Under this approach, data is collected from where it is siloed and analyzed for a particular purpose in a centralized location. Once assembled, data is run through an increasingly large number of model parameters. This process is time-consuming and resource-inefficient, and it’s simply unworkable for many data scientists working with complex models.

Distributed compute has the potential to address many of the model training constraints facing AI/ML developers. By distributing the workload for AI/ML model training, serverless edge computing solutions are able to optimize the performance of large jobs. This is possible because distributed computing solutions address both the big data and model complexity constraints burdening centralized approaches to model training, opening up an entirely new world of possibilities for complex AI/ML models.

Because it harnesses both cloud-based and distributed resources, model training on the edge can occur either from the top-down or from the bottom-up. In a top-down approach, the cloud learns model parameters and sends the information to serverless edge computing resources. Each edge site is able to adapt the model to suit the constraints of its unique environment or data set. A bottom-up approach, on the other hand, requires each serverless edge computing site to learn relevant model parameters from locally-stored data. From there, the distributed resource sends model training data to the cloud where it can be combined into a whole.

Limited compute resources are stifling the potential of AI/ML applications, and model training has already decentralized to some degree as a matter of necessity. As this trend continues, the future of AI/ML applications will continue to grow increasingly distributed as 5G and other edge-enabling technologies evolve. Until the infrastructure exists to support fully-distributed model training however, optimization through hybrid cloud environment-based parallelization remains the best approach for optimizing complex model training jobs. If your organization is running AI/ML model training jobs or other demanding compute instances, speak with an expert to learn how Dis.co can improve speed and efficiency with easy parallelization of heavy-compute jobs.

Avi Barliya

Avi Barliya

March 26, 2020 · 8 min read