Ever since I first heard of the Amazon Athena announcement at AWS re:Invent 2016, I have wanted to dig into that solution. Amazon Athena is an interactive query service that makes it easy to analyze large-scale data directly in Amazon Simple Storage Service (S3) using standard SQL for big data analytics. You can also query it from Amazon QuickSight, third-party solutions such as open source R, Looker and more.
Today early adopters of Amazon Athena are using it for big data analytics pipeline projects along with Kinesis streaming data and other Amazon data sources.
Athena is serverless parallel query pay-per-use service. There is no infrastructure to set up or manage. It scales automatically and can handle large datasets or complex distributed queries.
Getting Started with Amazon Athena
Setting up Athena is simple. In the AWS Management Console, you assign permissions via Amazon’s Identity and Access Management (IAM) console and attach or enable the Amazon Athena policy. Then you point the Athena service at data stored in S3. From there, you can run Hive DDL to create external tables – a popular concept in cloud realm – and begin using standard SQL with a built-in query editor to run schema on read, ad-hoc queries.
No ETL is needed thus Amazon Athena expedites big data analytics.
Athena supports a wide variety of source data formats such as CSV, JSON, ORC, or Parquet. A JDBC driver is key for connectivity between Athena to a plethora of applications, data visualization and BI tools.
Notably Athena uses database catalog, table concepts and terminology but those are merely metadata definitions – Athena is not physical relational database. This virtual/logical data warehouse design pattern with metadata is becoming common in a hybrid world where data resides everywhere.
Then you merely use standard ANSI SQL syntax to query your source data via the external table metadata definition. Athena supports large joins, window functions, and arrays. Note stored procedures and user defined functions are not available at this time.
You can either store results in an S3 bucket, view them or download them as a CSV file. You should also consider compressing your results for optimal pricing. Today Snappy, Zlib, and GZIP are the supported compression formats with Amazon Athena.
You can and should also think about partitioning your data to reduce the amount of data scanned by each query, improving performance and reducing cost. Athena leverages Hive for partitioning data. You can partition your data by any key. Usually partitioning is based on a time or date field.
Amazon Athena query performance further improves if you convert your data into open source columnar formats such as Apache Parquet or ORC. You may need to take existing Amazon S3 data sources by creating a cluster in Amazon EMR and converting it using Hive. The following example using the AWS CLI shows you how to do this with a script and data stored in Amazon S3. More detail on that topic is available in the online docs.
For More Information
To further explore Amazon Athena, I recommend watching the re:Invent session below and reading the docs online. At the end of March 2017, I will be giving a live class on this topic and Amazon QuickSight with O’Reilly Media. Stay tuned or contact me for course information. I will share more details as soon as I get them.