The Dis.co team was out in force at this year’s Samsung Developer Conference (SDC 2019) held on October 29 and 30, 2019. The 5,800 attendees traveled
When you hear the phrase batch processing what comes to mind? One of your first thoughts could be, “Oh, that’s when a bunch of data gets processed after hours when there are more compute resources available.” Others will think of credit card statement processing – when charges for a specific period are accumulated and combined into a single credit card billing statement. Hours, days, or weeks of video image processing or genome sequencing may be top of mind if you are in research or a scientist. And, just like any other business process or workflow, there are challenges with batch processing.
The term batch processing comes from the early days of computing when users entered programs on punch cards. A system operator would feed the programmed cards into a computer in batches.
In this blog post, we’ll cover simple and complex types of batch processing jobs. Then we’ll address how to improve compute-intensive processes by using modern cloud computing models and approaches.
What is Batch Processing?
Batch processing is a method of running high-volume, repetitive data jobs with only a minimal amount of human intervention – if any at all.
Batch jobs can be stored up during business hours and then triggered to start up on their own as the necessary resources become available. Processing data when more compute resources are available – like after business hours – is more efficient. Once the job is triggered, all input data is preselected using scripts and command-line parameters.
When batch processing is particularly useful is for an operation that requires the compute resource or peripheral device to be available for an extended period. Running that job after hours allows the job to continue until completed or until an error occurs.
Typical batch processing jobs include credit card statement preparation, inventory replenishment, warehouse management, and payroll runs. When it comes to processing large volumes of data to extract detailed insights, batch processing is also the first choice.
Next, let’s look at other common types of batch processing jobs as well as more sophisticated forms of batch computing that require significant compute resources.
Batch Computing Use Cases
Financial Services – Beyond the typical batch computing process of generating credit card statements, companies automate other billing processes and trigger them to run based on date, time, or other factors, such as the volume of transactions.
Consider another example we can all relate to, returning a product we paid for with a credit card. Depending on the processing frequency, it may take from one day up to several days before the credit for the purchase shows up in our credit or debit card transaction history.
Retail – The processing of online purchases initiates a fulfillment process that includes pulling products, boxing for shipping, selecting a shipping provider, and producing shipping labels specific to the provider and delivery speed. Orders are batched throughout the day for each step in the process, according to set parameters.
Inventory management is critical for warehouse fulfillment centers for Amazon, big-box retailers like home improvement stores Lowe’s and Home Depot, or membership warehouse clubs such as Costco and Sam’s Club. Daily, weekly, and monthly batch runs help identify product replenishment needs for specific locations and the overall business.
Healthcare – A common practice is to collect a certain number of routine lab tests to process and determine the results. It is more efficient to collect lab tests or samples for several hours or even an entire day.
Compute-Intensive Batch Processing
There are many types of batch processing that require significant compute power to reduce the amount of time needed to run and complete the job. One solution is to split up the processing into multiple batches that run simultaneously across many machines. Another solution is to run certain types, such as neural network training, video rendering, and crypto mining, on programmable graphical processing units (GPUs). These jobs naturally run faster on a GPU than a CPU.
Astronomy – Thousands upon thousands of images taken daily to study the solar system require running algorithms to analyze the photos in detail. This type of image processing can take days or weeks, depending on the volume of data.
Biotech – Genomic sequencing is the process of analyzing genomic data to find out what kind of diseases are likely to get passed to the next generation. A genome is the complete set of DNA, including all genes. This type of study requires matching the patterns of genomic data from thousands of patients.
Media and Entertainment – The process of image and video rendering for movies and animations involves converting an ‘original sketch’ into a ‘high-resolution image,’ such as lip-syncing with animation characters. These CPU-intensive jobs may take weeks or months to complete. Another example is video encoding for YouTube videos that converts original videos to the required YouTube format.
Big Data Processing – Financial services companies need to process billions of financial transactions to perform financial modeling. Data scientists use batch processing to prepare, clean, and structure data used by data scientists, called Extract, Transform, and Load (ETL).
Predictive Maintenance – Another form of batch processing involves working with large amounts of data to extract real-time analytics. Insights are studied to discover issues that require making complex decisions quickly about immediate action to be taken and the prioritization of those actions.
Here’s one example. Usage data extracted from sensors on equipment and algorithms get processed to produce the analytics. The algorithms compare the actual wear and tear of a piece of equipment with historical data. These analytics translate into predictions for when workers must perform maintenance or replace parts.
These are just a few of the many examples of batch processing. Next, we’ll take a look at the benefits of batch processing.
Benefits of Batch Compute
Now that we’ve covered specific types of batch processing, let’s review the benefits:
- Batch jobs run on most servers anywhere, anytime. They don’t require specialized hardware or system support to input data.
- Little to no user interaction is required. Automation platforms schedule and initialize batch jobs to run on idle resources, which saves money.
While these benefits are valuable, especially to large organizations, challenges exist.
Challenges that Exist
Resource Requirements – As mentioned earlier, many types of batch processing are constrained by the amount of power and time it takes to complete processing a batch job. These factors increase costs and delays in completing critical work.
Bottlenecks – Batch jobs get scheduled according to the expected duration of the processing jobs. Where there are problems with data, errors, or some other interruption of the process, those jobs will exceed the scheduled length of time. When the jobs do not finish running “after hours,” it prevents or delays access to these resources by other users during regular business hours. These delays hinder efficiency and increase the costs of doing business.
Delays in Realizing the Value of Data – When it comes to the discovery of breakthroughs in pharmaceuticals or medical equipment, the extra days or weeks required for processing substantial amounts of data can be a matter of life or death. In retail, without access to real-time inventory numbers, products sell out, and companies lose sales.
Streamlining and Optimizing Batch Processing
There are several compute models – such as parallel, distributed, and serverless computing – that, when combined, provide access to substantial power and deliver faster results. First, we’ll look at these models individually. Then we’ll describe the impact of combining these three models to optimize batch processing.
Parallel Computing – This computing approach launches jobs that have been separated into many smaller files and runs them simultaneously – on a single server or multiple machines. These jobs may be computation-based problems that get divided into smaller computations. Alternatively, parallel computing may involve the execution of many processes that are independent of the completion of another step. Once the subprocess is run and completed, the results are collected, ready to be restored to the origin of the processes.
Distributed Computing – This computing model disperses batch processing jobs across multiple resources that can include on-premises or private cloud, public, hybrid, or multicloud solutions. Components of the software system, such as applications and data, are shared among multiple computers to improve efficiency and performance.
Serverless Computing – This model of cloud computing transfers the responsibility for allocating and provisioning servers from IT teams or developers to a cloud services provider. This approach takes maximum advantage of available resources and removes the need for shared resources.
What if there were a way to combine these three models of computing for maximum optimization? There is. The Dis.co solution parallelizes batch processing by seamlessly distributing workloads across any available CPU or GPU.
The Dis.co Scheduling Agent distributes the batch processing jobs across available resources to use the optimal resource configuration and compute models. Available resources may be an on-prem or private cloud, or public, hybrid, or multicloud. Dis.co also has a cloud computing platform that taps into underutilized personal devices. Think “Batch-Computing-as-a-Service.” This approach accelerates time to results, reduces costs, and improves the customer experience.
In each of the following examples, how and where jobs run is seamless to the client:
- A media-production company now uses Dis.co to speed up the rendering of videos, improving customer experience. They split the videos into many smaller files and submit the file with a line of code. Dis.co does the rest.
- Researchers at an academic institution are searching for new planets by continuously recording video of the solar system. It previously took weeks to analyze the high volume of images. Using Dis.co reduced the image processing time from weeks to hours.
- A biotech company was having difficulty training AI models across public clouds. Dis.co assembled a serverless HPC cluster to accelerate model training across multiple clouds.
Batch processing has evolved from its early days of routine and repetitive tasks. It is now a vital form of computing that has many uses across industries and sectors, the Internet of Things, and research. New computing models accelerate the realization of value, especially when it comes to areas such as healthcare research and preventive maintenance.
If your organization is searching for a solution to run and complete batch processing jobs faster, contact us for a demo and the opportunity to test drive running your batch processing job with a free account.