What’s the big deal with streaming ETL?
Streaming ETL—aka Stream Processing aka Real-time ETL—seems to be quite the buzz recently. Every large-scale business is investing a fortune in building and managing its own platform and the likes of Netflix, Uber, Grab, DoorDash, Shopify, Airbnb, and co. are leading the pack. So what is the big deal? Why are advanced data teams not satisfied with just scheduling data models anymore? As seems to be the custom these days, let’s start with a screenshot from ChatGPT.
Why is Streaming ETL popular?
Conventional data pipelines are geared toward reporting and designed with dashboards/humans as consumers but it's ultimately a well-known fact that data systems tend toward production systems. After all, the purpose of data teams is to make organizations more efficient and uncover opportunities. But if a dashboard is truly worth looking at every day, then aren't the actions it derives worth automating? This may seem rhetorical but the underlying logical fallacy (designing data systems for reporting despite the eventual action being driven by a system) is the very reason why Streaming ETL pipelines have surged. Streaming ETL has bred a new generation of data teams that are far more agile and versatile, whose contribution not only extends to operations but is designed to do so efficiently.
One of the most common reactions to the mention of Streaming ETL and real-time processing is “we don’t need faster dashboards”. It’s actually become a running joke among real-time adopters and a good testimony that there is still a certain level of education to be done in the broader data community around what the point of Streaming ETL really is. The first thing to understand is that going real-time isn’t about dashboards, it’s about operations. Faster dashboards are a by-product of achieving great efficiency at the data layer but their true objective is operational excellence. In a data pipeline, the slowest component drives the SLA and by the time you realize a consumer has operation-grade requirements, it's too late. With Streaming ETL, you are certain that no matter who the consumer is they can take data to production systems, and the SLA allows it.
What is Streaming ETL?
In the past, when we wanted to analyze data, we would typically collect and store it all in a database or data lake (assuming there's any difference these days) and then run batch-processing logics at regular intervals to identify patterns and trends. This worked well for many applications, but it had its limitations. For starters, it needed to be triggered, which made it difficult to quickly gain insights and derive actions as we would only know something of interest happened after we had chosen to look for it. Additionally, it required a significant amount of storage and processing resources, which could be expensive to maintain and would scale very poorly as we tried to shorten trigger cycles.
Streaming ETL addresses that. Instead of collecting and storing data in a data store and then processing it in batches, Streaming ETL allows us to incrementally process data as it is being generated, in real time. This means that we can instantly identify patterns and trends, derive actions, and inform consumers without interrupting the motion of data. These capabilities bring about a number of important benefits.
Benefits of Streaming ETL
There are a number of benefits associated with the ability to incrementally compute insights on the fly, the core ones being:
- Real-time insights: it allows for the analysis of data as it is being generated, rather than after it has been stored in a database. This means that insights can be identified and acted on in real-time, rather than after the facts. As a result, we can make better decisions and respond to changing conditions as they happen. This is particularly important in today's fast-paced business environment, where being able to quickly analyze and act on data can give companies a competitive edge and can help us improve the efficiency and productivity of many different industries and applications.
- Scalability: Streaming ETL systems are designed to handle large amounts of data, and can be easily scaled up or down as needed. This makes them well-suited to applications that involve large amounts of data, such as transactional and telemetric systems and/or volatile volumes.
- Cost-effectiveness: Because it allows for real-time analysis of data, it can help reduce the need for expensive storage and processing resources. This can result in significant cost savings compared to traditional data processing methods.
- Fault tolerance: Streaming ETL systems are typically designed to be fault-tolerant, meaning they can continue to operate even if some of the components fail. This helps to ensure that the system is always available and that data is not lost.
Conclusion
Streaming ETL is taking over the data processing world by storm and for a good reason, it's faster, scales better to speed and volumes, doesn't require complex schedule management, and easily integrates with existing systems. At the organizational level, it helps improve efficiency, productivity, and decision-making. If you're not familiar with Streaming ETL, it's definitely worth experimenting with at the very least – it could have a big impact on your team and organization.
Want to try a modern Streaming ETL solution? Reach out at ben@popsink.com or visit Popsink.