Machine Learning with Disco

Yuval Greenfield

Yuval Greenfield

February 24, 2020 · 5 min read

In recent years, hardware can’t keep up with the rate at which data is growing. More precisely – a single machine can’t keep up with the size of many datasets. Sharing the load across local and cloud machines can reduce hour-long tasks to minutes, but it requires development effort and expertise. You don’t have to build that infrastructure yourself. Dis.co provides an easy and cost-effective way to use multiple machines to make compute jobs faster. Whether you need to run a simulation, process footage, or train a model, Dis.co simplifies distributing compute tasks across your own hardware, the cloud, or both. You can launch jobs from the command line interface, the Python SDK, or from the web interface. This post will explain how Disco can be used to accelerate machine learning jobs. For a more in-depth intro to Disco, see this post on how Disco works.

Where does Disco fit in an ML workflow

Disco shines where there is an opportunity to parallelize. Commonly these tasks are:

  • Data processing (or preprocessing)
  • Hyperparameter optimization
  • Testing heavy models on massive datasets

The following is an example of hyperparameter optimization, you can clone the repo and follow along after you set up your account. Please star and follow our examples repo to get notified when we publish more scenarios, or reach out and we could build an example for your use-case.

Multi-multiprocessing on a cloud

It might be worth spinning up a machine to run a function that can take over an hour to complete. It’s especially useful to spin up machines for multiple functions that run for hours. The tradeoffs to consider when using the cloud are speed, cost, and how complicated it is to set everything up. The following snippet uses discomp which provides a drop-in replacement for multiprocessing.Pool(). Each iteration in the map might find itself on a different machine, while our main machine will only wait for the results.

def pow3(x):
    print (x**3)
    return (x**3)
    
with discomp.Pool() as po:
    results = po.map(pow3, range(10))
    print(results)
    # prints [0, 1, 8, 27, 64, 125, 216, 343, 512, 729]

Hyperparameter optimization on a cloud

The above example is simple. For a more complete scenario, you can see how I compared the performance of different hyperparameter configurations in a credit card fraud dataset using XGBoost. Testing 30 sets of parameters took 83 minutes locally, but by leveraging cloud machines, I was able to run the tests 9 times faster.

In Python, you can use a for loop to iterate over a sequence and execute code for every item. Another option is to use the map function which does the same thing.

for item in sequence:
    my_function(item)

# the above code is equivalent to the following
map(my_function, sequence)

The nice thing about map is that it’s easier to distribute the workload. You can call multiprocessing.Pool().map to distribute the load amongst local processes, or discomp.Pool().map to send the function to your cloud or cluster. As can be seen in the chart above, the XGBoost optimization job took 82 minutes using a traditional map call, 35 minutes using a multiprocessing.Pool().map, and 9 minutes using discomp.Pool().map.

The following part of the code is what made the discomp case faster.

with discomp.Pool() as pool:
    results = pool.map(test_fit, param_options)
    best_index = np.argmax(results)
    print(f"Ran {param_test_count} tests")
    print(f"Best result {results[best_index]:0.2f}")
    print(f"Best params: {param_options[best_index]}")

You can see a discomp.Pool() is instantiated, and we use the pool.map function which will cause the test_fit function to be sent to multiple machines to run in parallel. The exact amount of machines depends on the number of tasks. The amount of tasks equals the amount of items in the array called param_options.

Data locality

The larger the datasets, the more important data locality becomes. You must make sure the execution environment is as close as possible to the data itself to avoid extra costs and slowdowns. Your data should not leave the datacenter, and for cloud provider purposes that usually means your data shouldn’t leave its “region”. For example on AWS, moving 10 GBs costs $1.5 each time it leaves the S3 bucket’s region.

Disco supports a variety of deployment scenarios. “Disco Cloud” is the simplest, where we provide the entire solution and bill users directly for compute hours. When you use the Disco Cloud, make sure your data is on AWS us-west-2.

If you deploy Disco on your own cloud, you’ll choose which zone to deploy to. In this scenario – you’ll have to make sure you deploy Disco in the same region where your data lives.

Want to go fast?

If you’d like to scale your ML workflow to the cloud, Dis.co makes it easy and efficient. To see how Dis.co can accelerate AI/ML model training for your company, please visit try.dis.co/aiml

Yuval Greenfield

Yuval Greenfield

February 24, 2020 · 5 min read