The Dis.co team was out in force at this year’s Samsung Developer Conference (SDC 2019) held on October 29 and 30, 2019. The 5,800 attendees traveled
Web crawling and scraping are common techniques you can use to collect large amounts of data quickly. Crawling can be used to create archives of whole websites that can be processed more easily at a later time. Scraping specific information on these pages can also help improve data analytics or business intelligence applications.
Data scientists may need some preliminary test data behind architecting a model, and scaping publicly available data is an easy way to get started. It’s also now legal, with limitations. Unfortunately, depending on the amount of data you need to collect and process, this can be a very time-consuming process. In this article, we’ll discuss some techniques for web crawling and scraping with Python and Dis.co – a serverless computing platform that uses cloud parallelization to improve the speed and efficiency of heavy compute jobs.
As with any complex job, the computer you run the process on becomes dedicated to that sole task. While you can increase performance by taking advantage of multiple processors and threads, you are still limited by the performance of the computer running your job. If you are a data scientist looking to collect a large amount of publicly available data for initial training, you wouldn’t want to run this on your main computer – it would take up too much of your local compute resource.
While trying to collect data on my own experiments, I ran into a similar problem. Cloud parallelization offers an immediate means of solving the time and resource constraints associated with data crawling for model training, so I decided to try it for myself with Dis.co.
The Goal: Crawl Model Training Data
I’m laying down the foundational work to help train a nutritional model. But, in order to do this, we need to find a large data set of food. I had to commit local resources to proper cleaning, organization, and tagging of production data. So, I started like most people would – by writing a simple Jupyter notebook that crawled publicly available nutritional data sites.
The job ran on my laptop, and it took a little less than a day to crawl about 200,000 URLs from a single domain. Completing this task in a day seemed like a meaningful achievement – that is, until I discovered how much deeper our rabbit hole goes.
The nutrition database I chose to crawl had 28 independent domains, each containing links to the URLs I was crawling for. Running this on a notebook was not going to work – and I really didn’t want to bother setting up a cluster in GCP or AWS. At this point, it becomes apparent that we need a way to scale this up in a more efficient way.
So, Dis.co to the rescue!
Dis.co is designed to be a raw set of APIs that you can build on top of to help abstract away the complexity of parallelizing your workload over multiple computers. I started with a relatively simple test to assess how Dis.co works with web crawling and scraping tasks.
Once I was able to prove that my idea would work on its own, I was able to scale it using Dis.co in 3 straight forward steps:
1. Building a Custom Docker Image
Whenever we have to run scraping scripts across multiple computers with a specific set of Python packages, creating a Docker image is typically the best way to go. Dis.co has recently added support for custom Docker images and the ability to use pip install to load an environment from a requirements.txt.
There are easier ways to get a Docker image in a registry, but I like automation. So, after reading this post by Jim Crist I created a new github repo, a dockerhub repo, and configured auto-build in dockerhub.
2. Connecting A Cloud Provider
Now that we have a sample crawl job running, it’s time to unleash all 28 jobs with 4 web crawler tasks each. These tasks could take anywhere from 10 minutes to 24 hours per instance on a local machine. Running them simultaneously on multiple cloud computers would greatly improve the speed and efficiency of the jobs.
Since Dis.co supports AWS and GCP in addition to its own discoCloud, we can run our instances using any cloud platform we want. Integrating AWS, GCP, Packet, or Azure is as simple as selecting “Add Cloud” from the “Cloud” tab in the Settings menu of the Dis.co dashboard:
From there, you just click on“Cloud Type” to select the service you want to use and fill in required information describing the job:
After I configured my project to use AWS, I needed to make sure that my AssumeRole was enabled in my AWS account. To onboard AWS Cloud to Dis.co, we merely select “AWS” from the “Cloud Details” page. From there, Dis.co provides a link to follow to the AWS Cloud Formation console. Cloud creation starts after we enable the integration and select “Create Stack.” Altogether, setup took about two minutes.
After configuring AWS as a backend for my Dis.co, I was ready to start running my jobs!
3. Running the jobs
Finally, with everything configured, we can run our 28 separate jobs, each with 4 configuration files (one for each part of the site to be crawled). I spaced these tasks out over the course of a few days so that we do not cause any issues with the sites we are crawling.
The output of each job was stored in .csv format. To do this, I used print intostdout files. The diagnostics of the program were written using the logging module and went to the stderr files. I used a Jupyter notebook to load the 112 csv files, do some sanity checks, perform some post-processing and cleaning, and load the results to a Google Drive folder.
Overall, I was able to collect roughly 650k nutritional data points from 28 sites with an AWS cost of less than $30.
Running a complex web crawl job on Dis.co was quick and easy. We got the results we were looking for without having to move data from our cloud or tie up local machines. But there’s always room for improvement; here are a few things I suggest you try when doing this on your own:
- Make jobs re-entrant. It’s very sad when a job runs for hours, encounters a problem, and doesn’t save any of its results. A common Dis.co pattern should be 1) attempt to load a checkpoint at initialization and resume from there and 2) periodically save this data structure to S3 or GCS. The bucket object location could be specified by hashing the contents of the script file and input file.
- Automate more of the boring stuff. I used a combination of the Dis.co Python module from Jupyter notebooks and the web UI. I found myself doing more manual data processing in the web UI than I would prefer. Next time, I would commit to managing all jobs and handling all tasks via the Dis.co’s Python SDK.
- Use scrapy. I considered using scrapy when I started the crawling task, but it seemed a bit like overkill. After rolling my own web crawling code, I found myself solving problems that are already handled out of the box in scrapy. So I’d only suggest doing this for very specialized use cases or in situations where you need to maintain control of the service from end to end.
Outside of these few things I would modify on my next scraping project, Dis.co’s serverless cloud computing platform was a great way to speed up test data collection. The platform is also useful in data processing, AI/ML model training, video composition, 3D rendering, and any other application with hefty compute demands. If you are interested in learning more about Dis.co, I suggest checking out the product demo.
Hands-down, Dis.co is the easiest and most efficient way to scale your compute. Speak with an expert to learn how Dis.co enables easy parallelization of heavy-compute jobs.