A Brief History of Apache Storm

In this series of blog posts, we will provide an in-depth look select features introduced with the release of Apache Storm (Storm) 1.0. To kick off the series, we’ll take a look how Storm has evolved over the years from its beginnings as an open source project, up to the 1.0 milestone release.

“The Hadoop of Real-Time”

Storm was originally created by Nathan Marz while he was at Backtype (later acquired by Twitter) working on analytics products based on historical and real-time analysis of the Twitter firehose. Nathan envisioned Storm as a replacement for the real-time component that was based on a cumbersome and brittle system of distributed queues and workers. Storm introduced the concept of the “stream” as a distributed abstraction for data in motion, as well as a fault tolerance and reliability model that was difficult, if not impossible, to achieve with a traditional queues and workers architecture.

Nathan open sourced Storm to GitHub on September 19th, 2011 during his talk at Strange Loop, and it quickly became the most watched JVM project on GitHub. Production deployments soon followed, and the Storm development community rapidly expanded.

At the time Storm was introduced, Big Data analytics largely involved batch processing in map-reduce on Apache Hadoop or one of the higher level abstractions like Apache Pig and Cascading. The introduction of Storm helped drive a change in the way people thought about large scale analytics, spurring the rise of stream processing and real-time analytics.

Early versions of Storm introduced the familiar stream abstraction and the corresponding Spout/Bolt/Topology API that allowed developers to easily reason about streaming computations. Another feature Storm introduced, and one that to this day remains unique to Storm, is the concept of Distributed Remote Procedure Calls where the inherent parallelism and scalability of Storm can be leveraged in a synchronous, request-response paradigm.

The Storm 0.8.x line of releases introduced the Trident API which added support for exactly-once semantics, micro-batching for increased throughput, stateful processing, and a high-level API for joins, aggregations, grouping, functions, and filters. Other improvements at the time included pluggable schedulers, introduction of the LMAX Disruptor for higher throughput, tick tuples, and improvements to the Storm UI.

Storm Moves to Apache

With encouragement from Andy Feng at Yahoo!, Nathan decided to propose moving Storm to Apache, and the project officially entered the Apache Incubator on September 18, 2013. This move marked the beginning of a fundamental shift in the Storm community away from a model where a single individual leads a project, to the consensus-driven Apache development model. The move to Apache ensured that the project would be governed by a sustainable community, and that no single individual would represent a bottleneck in terms of decision making.

During Storm’s time in the Apache Incubator, the 0.9.x line of releases was introduced. In 0.9.x Storm’s underlying transport layer based on 0mq was replaced with an implementation based on Netty. Not only was the new Netty-based transport almost twice as fast as the previous implementation, but the fact that Netty is a pure Java framework freed users from the requirement of difficult to install, platform specific binaries. Finally, the Netty transport set the stage for authorization and authentication between worker processes.

The 0.9.x line of releases also introduced Apache Kafka integration as a first class component in the Apache Storm distribution. Prior to this point Kafka integration was a separate project that had been forked many times, and it was difficult for users to understand which versions were compatible with various versions of Storm and Kafka. Bringing Kafka integration into the Apache distribution ensured that compatibility was maintained with each Storm release. Similarly, the Storm community added support for HDFS and Apache HBase integration as well.

Other notable improvements from this time included performance improvements to the Netty transport, pluggable serialization for multi-lang components, topology visualization, a logviewer, and support for Microsoft Windows deployments.

Apache Storm became a top-level Apache project on September 17, 2014. In the time since first entering the Apache Incubator with a group 7 initial committers, the Apache Storm PMC has grown to 28 members strong with contributions from 342 individuals.

Enterprise Readiness

Before graduating from the Apache Incubator, the Storm community was already working on the next major iteration of the platform: Version 0.10, which focused primarily on security and enterprise readiness. Much like the early days of Apache Hadoop, Apache Storm originally evolved in an environment where security was not a high-priority concern. Rather, it was assumed that Storm would be deployed to environments suitably cordoned off from security threats. While a large number of users were comfortable setting up their own security measures for Storm, this proved a hindrance to broader adoption among larger enterprises where security policies prohibited deployment without specific safeguards.

The implementation of enterprise-grade security in Apache Storm was a momentous effort that involved active collaboration between Yahoo!, Hortonworks, Symantec, and the broader Apache Storm community. Some of the highlights of Storm’s security features include:

Another important feature of Storm 0.10 was the introduction of Flux, a  framework and set of utilities that make defining and deploying Storm topologies less developer-intensive. Prior to Flux, a common complaint from users was that the definition of a topology DAG was often tied up in java code, and any change required recompilation and packaging of the topology. Flux addressed that problem by providing a YAML DSL for defining topologies in a simple text file, decoupling the DAG definition from Java code.

Some of the key features of Flux include:

Development of the 0.10 line of releases also saw a proliferation of additional integration components including JDBC/RDBMS integration, streaming ingest to Apache Hive, Microsoft Azure Event Hubs integration, and Redis support. Other important features included rolling upgrade support, logging improvements, and a partial key groupings implementation.

1.0 Milestone

On April 12, 2016 the Apache Storm community announced the release of Apache Storm 1.0, representing yet another major milestone in the evolution of the project. Version 1.0 includes a tremendous number of new features, and usability, management, and performance improvements.

In the coming weeks we will continue this blog series with more in-depth articles covering the important new features included in 1.0, including:

For a sneak preview of these features, please see the Apache Storm 1.0 release announcement and stay tuned for more in-depth detail.

 

 

 

 

Top