Real-time and streaming analytics were all the rage at Strata + Hadoop World 2017. Since we previously reviewed other streaming solutions, let’s now dig into Amazon Kinesis. Amazon Kinesis is a platform for streaming data on Amazon Web Services (AWS) cloud. It simplifies loading and querying streaming data for real-time analytics.
Introducing Amazon Kinesis
Amazon Kinesis is a fully managed service for real-time processing of streaming data at massive scale. Amazon Kinesis is ideal for Internet of Things (IoT) use cases. It can collect and process hundreds of terabytes of data per hour from hundreds of thousands of sources, allowing you to easily write applications that process information in real-time, from sources such as web site click-streams, Raspberry Pi gadgets, devices, social media, operational logs, metering data and more.
With Amazon Kinesis, you can build real-time dashboards, capture exceptions, execute algorithms, and generate alerts. With point-and-click menus, you can ingest data, query it and then send output to a variety of destinations including but not limited to Amazon S3, Amazon EMR, Amazon DynamoDB, or Amazon Redshift.
There are three areas of Amazon Kinesis that can be used together.
- Stream Analytics
Streams is used for custom processing per incoming record, with sub-1 second processing latency and a choice of stream processing frameworks. Firehose is for use cases for data latency of 60 seconds or higher. Stream Analytics is a service that enables you to analyze streaming data with standard SQL syntax for performing time series analytics, feeding real-time dashboards, or providing real-time metrics in monitoring apps. In my opinion, stream Analytics is the really fun part of Kinesis. That’s where you can see the data flowing through and live query it.
To get started with Kinesis, you can sign into the Kinesis Analytics console and create a new stream processing application. Alternatively you could use the AWS CLI and AWS SDKs.
Amazon Kinesis automatically manages the big data infrastructure, storage, networking, and configuration needed to collect and process streaming data. Ingested data is organized into Streams. You do not need to provision, deploy, or maintain hardware with Amazon Kinesis. It also synchronously replicates stream data across facilities AWS regions for high availability and data durability.
To configure Streaming is quite easy. You specify how much throughput capacity you want in terms of Shards. Shards are a bit like partitions in databases. Kinesis shards can collect up to 1 megabyte per second of data at 1,000+ transactions per second. Amazon Kinesis apps can read data from each shard at up to 2 megabytes per second. Notably, if you do need to increase or decrease shard capacity, you can do that while your stream keeps on flowing. It is pretty slick.
Amazon Kinesis Firehose is a fully managed service for delivering real-time streaming data to destinations With Firehose, you do not need to write any code or manage any resources. You merely configure your data producers to send data to Firehose. Firehose buffers and automatically delivers streaming data to chosen destinations.
With Stream Analytics, you can process, compute and aggregate streaming data using standard SQL in time window queries. The service simplifies streaming time series analytics, feeding data to real-time dashboards, and updating real-time metrics for monitoring apps. The following diagram illustrates a typical application architecture.
To create a streaming analytics app, you pick and optionally partition an input streaming source, define a SQL query, and select an output destination. SQL statements process stream data input and produce output. For more information, see Limits and Configuring Application Input.
Amazon Kinesis provides a timestamp column in each app stream called Timestamps and the ROWTIME Column. You can use this column in time-based Window Queries. If you want to add a lookup table, you can configure a reference data source to enrich your input data stream within stream analytics. You can also have multiple streams or use JOIN queries to correlate data arriving on different streams.
Time boxed SQL queries that execute continuously over streams are called windowed queries. To get result sets from continuously updating input, you often limit queries using a window defined in terms of time or rows with functions such as tumbling or sliding windows.
Tumbling windows are aggregations that use a GROUP BY clause and do not overlap.
Sliding windows define a window based on a time or row and might have overlapping rows.
One of the really nice stream analytics features is the library of available window query templates. Notice the green commented code that is helpful for both novice and experts alike.
Streaming analytics platforms process data continuously before data is stored. In the big data and IoT world where many gadgets generate streaming data all the time with bursts to millions of events per hour, cloud stream analytics offerings can be wonderful assets for building real-time and near real-time dashboards and monitoring apps.
In this article we stepped through the basics of Amazon Kinesis. To further explore and evaluate Amazon Kinesis, check out the online documentation, available stock pricing demo or join me in my next O’Reilly Amazon analytics class on June 15, 2017. I have included an entire section on cloud IoT analytics to walk through Amazon Athena, Kinesis, QuickSight and more.