Batch Processing vs. Stream Processing

If you've read DevRel Katy Farmer's stellar post, Kapacitor and Continuous Queries: How to Decide Which Tool You Need, then you know that when our community talks, we listen. So, in alignment with that view and in honor of our very own Kapacitor Koala, let's tackle another common community issue that has come to our attention: when should we use batch processing versus stream processing in our Kapacitor tasks?

Image title

Our famous Kapacitor Koala.

Now, if you've no vague idea what Kapacitor is, I recommend doing a little light reading on it here and here just to get you up to speed. Kapacitor, the final component of our TICK Stack, offers several capabilities such as data transformation, downsampling, and alerting. Kapacitor uses its own DSL called TICKscript, which allows you to define certain tasks that can then be executed on your data — essentially, it's processing your data for you.

Here's where it gets tricky, though: How do you choose whether to process your data as a batch task or streaming task?

Batch Tasks

Let's discuss batch tasks first. A batch is a collection of data points that have been grouped together within a specific time interval. Another term often used for this is a window of data. When running a batch task, Kapacitor queries InfluxDB periodically, thereby avoiding having to buffer much of your data in RAM. There are several cases where batch processing is the way to go:

Stream Tasks

On the other side, we have stream tasks. Stream tasks create subscriptions to InfluxDB so that every data point written to InfluxDB is also written to Kapacitor. One should note though that stream tasks use a high percentage of available memory, so memory availability is a key factor to take into consideration. Here's where stream processing is most ideal:

Another advantage some might see with writing stream tasks is the ease of use in having to define the task using only Kapacitor's TICKscript, without having to delve into writing queries for InfluxDB. If you are comfortable with writing both, however, it's probably going to be in your best interest to go with batch processing most of the time since it uses a lot less memory. An additional factor to consider is that Kapacitor is not limited to use only with InfluxDB. For example, if you want to send data straight from Telegraf over to Kapacitor, that will have to be done as a streaming task.

Key Takeaways

When our community talks, we listen. We'd love to hear how your batch and stream tasks are going! Send us your comments, questions, issues, and blog ideas on our community site and feel free to reach out to us on Twitter: @InfluxDB or @mschae16.

 

 

 

 

Top