Amazon Redshift is a petabyte-scale cloud data warehouse service that was first launched by the industry leading cloud provider in late 2012. Today in a growing market of cloud data warehouse entrants, Amazon Redshift faces strong competition from Snowflake, Azure Data Warehouse, IBM dashDB, Oracle, and other players. The following is a quick look at Amazon Redshift, how to get started with it and how to visualize the sample data warehouse with a self-service BI tool.
Amazon Web Services (AWS)
As I continue working on Project Tropical Garden 2.0, advising groups on industry trends, learning new skills and searching for my next great challenge, I am seeing increased demand for Amazon cloud and big data analytics skills. Having spent time at Microsoft, I am already familiar with Azure’s offerings. Now it is time to play with Amazon’s sea of web services. Since I worked with Amazon Redshift in the past, it seemed like a perfect place to start.
Amazon is the #1 cloud vendor in the market. Recent findings from Synergy Research Group estimate that Amazon Web Services (AWS) has a worldwide market share of over 31%. Despite intensifying cloud wars, high competitor cloud sales growth and extreme levels of cloud sales staffing, Amazon AWS also continues to grow rapidly.
In the 2015 Magic Quadrant for Cloud Infrastructure as a Service (IaaS), Gartner placed Amazon in the Leaders quadrant and rated AWS as having both the furthest completeness of vision and the highest ability to execute. The full report is available on Amazon’s web site.
AWS is a broad collection of global compute, storage, database, analytics and application services. If you have never evaluated or worked with Amazon before, you can get hands-on experience with it using a Free Tier Account for one year.
Amazon Redshift is located in the Database section of AWS services. It integrates nicely with other Amazon services and a wide variety of business intelligence partner offerings.
Amazon Redshift is relatively easy to learn, easy to load, and performs excellently. The technical architecture includes a combination of massively parallel processing (MPP) engines, columnar data storage, and optimized data compression encoding schemes that deliver rapid analytical queries. The cloud nature of Amazon Redshift enables quick, elastic scale, affordable data warehouse solutions that are accessible to small or large organizations. In the past, it was challenging and cost prohibitive to evaluate and purchase multi-million dollar data warehouse hardware that was almost immediately outdated. Even individual data enthusiasts can spin up truly powerful cloud data warehouses today in minutes as I did in this exercise.
To get started, I followed the tutorial provided in Amazon’s Redshift online docs.
Step 1: Set Up Prerequisites
Step 2: Create an IAM Role
Step 3: Launch a Sample Amazon Redshift Cluster
Step 4: Authorize Access to the Cluster
Step 5: Connect to the Sample Cluster
Step 6: Load Sample Data from Amazon S3
While setting up Prerequisites, I was not able to get the SQL Workbench/J query tool functional quickly. I had a much better experience installing and using the Aginity Workbench for Redshift. Aginity Workbench includes fantastic query and database development functionality along with Amazon specific functionality for Redshift.
Amazon provides a sample data warehouse and the scripts to load the data from public S3 storage. The process was fast and straight forward. It could have been made even easier if Amazon would add a menu option when you create a new Redshift cluster. Creating the Amazon Redshift sample cloud data warehouse with standard SQL scripts and a simple COPY command took several minutes from start to finish.
After I created and loaded the sample data warehouse, I began reviewing the available tools, settings, monitoring and management options. On an initial glance, there seemed to be excellent alerting throughout Amazon’s settings to help customers avoid cloud bill surprises. I did miss pause along with granular compute and storage functionality that I do appreciate in other cloud data warehouse offerings.
Connecting and Querying Amazon Redshift
To connect, query and visualize Amazon Redshift data, you might need to configure the ODBC/JDBC connection settings on your computer if your query or BI tool does not use optimized native drivers.
For my evaluation, I tested connecting Tableau 9.3 Desktop to my Amazon Redshift environment.
Once connected I chose a Live connection to evaluate hybrid cloud to on-premises query performance. I also wanted to avoid copying all of the Amazon Redshift sample data down to my local laptop. Then I added two fact tables and all of the sample data warehouse dimensions. Tableau automatically joined the related tables based on column names. From there, I started visualizing the Amazon Redshift sample data warehouse content by dragging-and-dropping dimensions and measures onto the canvas.
Much like the other cloud data warehouse solutions that I have tested, there was an initial query delay on a few of the lower level detail queries with Amazon Redshift and Tableau. Unlike my review of other cloud data sources and BI tools in hybrid BI use cases, none of the remote Amazon Redshift live queries to Tableau timed out. When dealing with cloud BI and hybrid BI scenarios, data gravity is an incredibly important concept to understand. Be sure to load test during BI tool evaluations to ensure a hybrid BI solution will function and be usable with larger data sets.
- Video Introduction: https://www.youtube.com/watch?v=UhQjSzdlO_g
- Technical Overview: http://docs.aws.amazon.com/redshift/latest/dg/c_redshift_system_overview.html
- Managing Clusters: http://docs.aws.amazon.com/redshift/latest/mgmt/managing-clusters-vpc.html
- Utilities: https://github.com/awslabs/amazon-redshift-utils
- Loading Data: http://docs.aws.amazon.com/redshift/latest/dg/t_Loading_tables_with_the_COPY_command.html
- Querying Data: http://docs.aws.amazon.com/redshift/latest/dg/c_designing-queries-best-practices.html
- Best Practices: http://docs.aws.amazon.com/redshift/latest/dg/best-practices.html
- FAQ’s: https://aws.amazon.com/redshift/faqs/
- Knowledge Center: https://aws.amazon.com/premiumsupport/knowledge-center/#Amazon_Redshift
My initial getting started with Amazon Redshift review does not cover many other critical areas for cloud data warehouse evaluation. I highly recommend looking through the online documentation and reaching out to Amazon directly for a deeper dive into available cloud data warehouse functionality. There is far more awesome detail for you to enjoy than I mentioned here.
In my testing, I found Amazon Redshift to have good cloud data warehousing capabilities but it might not have all of the best in class features available in the market today. Notably Amazon Redshift has significant processing and storage management differences than competing vendors Snowflake and Azure Data Warehouse use. One desired feature that Amazon Redshift did not provide was a pause functionality. Cloud data warehouse pausing might not be a show stopper for many groups that can use the available snapshot work-around.
The bottom line: Amazon Redshift is an affordable, industry leading, cloud data warehouse that can be easily and instantly deployed to serve small or large scale analytics workloads.
As the industry leading cloud provider, Amazon Redshift has offers a solid cloud data warehouse option that integrates nicely with other Amazon services and a wide variety of business intelligence offerings. Amazon Redshift is relatively easy to learn, easy to load, and performs excellently.