Parallelize your Python Scripts in the Cloud with Dis.co and Custom Docker Images

Raymond Lo

Raymond Lo

March 19, 2020 · 6 min read

We have all faced the limitations of running complex python scripts locally that take hours or days to complete. But what if we could take the same script, modify it a little, and distribute it across 5, 10, or even 100 computers to fully parallelize our workload? This is exactly what Dis.co helps you do, and in this post, I’ll show you how to use our custom Python SDK to run your scripts in a custom Docker image.

 

Dis.co’s Python SDK

Dis.co’s Python SDK includes important functions such as job creation, job handling, and job downloading. These functions allow you to seamlessly connect your workflow with Dis.co in Python. To illustrate how this works, you will have a sample job that instructs the machines to download images from the given URLs and then downloads the results back to the local machine. All of the code is written in Python and portable across all platforms that support running the script.

Let’s start with the actual code itself. You can download the source code from Github or check it out with the following command:

This example is broken up into 3 main steps that are talked about in order to learn how this feature works.

 

Step 1. Setup A Custom Docker Image for Dis.co

Let’s get started by opening the disco.dockerfile. Once you open it, you will see a list of commands. In this particular example, we had prepared our Docker image with the barebone setup with Python 3.7 and all other essential libraries pre-installed. For example, we added the wget tool to the system with apt-get to demonstrate how you can support additional libraries and tools. Also, we installed the Python Libraries — requests, pathlib, and datetime to support additional functions in our example. Developers who would like to learn more about the advanced features should definitely read Best practices for writing Dockerfiles.

By using the apt-get install and pip install commands, you can now install any required libraries such as OpenCV, PyTorch, or Tensorflow then set up the environment for deployment. Once everything is configured, it’s time to build the image by running the following command:

Make sure you add your docker id to the command. After it’s completed, you can verify the image with the “docker images” command. 

Also, you can now test the image by running the Docker image interactively with this command in Terminal:

Next, you can verify that everything is working correctly by typing in the following Python code to make sure that our required libraries are installed correctly:

This should show the current date and time of your machine. After you validate that everything is working correctly, you can push the image to your own Docker Hub repository with the docker push command. And please replace “docker_id” with your own Docker ID.

To verify the push, you can login to your Docker Hub Repos and you will see the Docker Image inside your repositories.

Finally, you can connect the image to your Dis.co account by following these instructions.

 

Step 2. Write the Python SDK Code

Next, you will look into the sample code needed to run  Dis.co’s Python SDK. You will start in the main.py script, where you first need to import “disco” in order to access Dis.co’s job management features. Additionally, you will be using the DockerImageclass to select different Docker Images in runtime.

And here is how to select the custom Docker image you created in the first step.

With the correct docker ID, you can create a job for that specific Docker image. This Docker feature is important as you can run different jobs with different environments that support varying hardware and software configurations. To enable this, simply define the docker_image_id in the “create” function.

If you follow the code’s inline comments, you will notice that Dis.co requires some additional files: a server script and task scripts. In the next step, you will create these files.

 

Step 3. Create the Server Script

The server script defines what each distributed server will execute in parallel at runtime. For each instance of the script that runs on the server, the script will require a unique input file, which is defined by the user prior to the job execution.

Any files that are copied to the ./run-result folder are automatically picked up by the dis.co agent and will allow you to download them as dis.co results later on in any of the client libraries or the Web UI. The content inside the run-result will be downloaded back as a zip file.

 

Step 4. Create the Input Files

Input files in Dis.co define what each script will perform at runtime. You can also treat this as the parameters variable to the Python script that runs independently and concurrently. For example, in this example, the input is a URL, and we basically put that inside the task1.txt file.

In this case, you simply pass the URL of the image we would like to retrieve. Thus, the behaviour will be that each machine will take on individual tasks on downloading different images. Now, if you change the algorithm, you can have each machine do web crawling tasks and perform data processing such as semantic analytics. This is a very powerful use case, as now you can write a heavily parallelized system with only a few lines of code and scale it very well.

 

Step 5. Run and Examine the Results.

Lastly, you run the Python script. This will upload the jobs and also download the result back locally. If you encounter any errors, please double-check the install instructions.

Now, on your local directory, you will find two jpeg images. Open them and you should see a cat and a dog.

Imagine: if you add in an image recognition algorithm on the server script, you can now crawl the web and create your own cat and dog database on the fly. Pretty neat! And that’s how you can easily create a customizable, heavily parallelized, serverless engine in minutes with Dis.co – the easiest and most efficient way to scale your compute. Speak with an expert to learn how Dis.co enables easy parallelization of heavy-compute jobs.

Raymond Lo

Raymond Lo

March 19, 2020 · 6 min read