November 2017 was another peak analytics industry event news month. I am thrilled that it is finally over. Here are the collected highlights from IBM Watson Data Science for All, AWS re:Invent, Salesforce Dreamforce, Tableau, TIBCO, and many other top vendors. December should be a much quieter month filled with the usual trends, predictions, awards and fun holiday surprises.
IBM Watson Data Platform Updates
Last year I covered the launch of IBM Data Science Experience and Watson Data Platform. In my hands-on review, I mentioned the user experience was exceptionally well done. Apparently I am not the only one that liked their design. Fast-forward one year and what do I see? In my HOW Design magazine this month, IBM’s design team was showcased with a Best of Show award for the IBM Cloud Carbon Design System. Kudos to them.
This year I attended IBM’s Data Science event in New York City where a series of IBM Watson Data Platform enhancements were shown. The new offerings include a Data Catalog, a data prep tool called Data Refinery, and a combined Apache Spark/Apache Hadoop service called Analytics Engine.
IBM Data Catalog
It seems every big tech vendor, data integration vendor and a myriad of niche vendors now offer data catalogs, a “one-stop-shop” for enterprise metadata. IBM’s Data Catalog is now in beta with ~30 connectors in the cloud and on premises. It is capable of scanning metadata from IBM DB2, Cloudant, Cloud Object Storage, Oracle, Microsoft SQL Server, Microsoft Azure, Amazon S3, Salesforce.com, Hortonworks HDFS, Sybase, other asset types such as structured data (row/column), semi-structured (social, memos, etc.), csv, Excel files, Jupyter ipynb documents and even images. Like most other data catalogs, IBM uses a crawler-type approach to populate a metadata repository and also provides connectivity to the data catalog for data prep and reporting users.
Note: If you are considering a data catalog, check out my earlier article on this hot industry topic: Why you need a Data Catalog and How to Select One.
IBM Data Refinery
IBM Data Refinery is a cloud data prep tool with a browser-based, point-and-click GUI interface that can visually apply R scripts to select, combine and transform data. Data Refinery uses Apache Spark as an execution engine to better scale data prep for large scale data sets that are too large to fit in memory.
Data Refinery shows both profiling metrics and visualizations. A Profile tab contains descriptive statistics and a Visualization tab allows you to select a combination of fields to build a chart. It automatically suggests appropriate plots (bar charts, heat maps, and so on), based on the data types.
Currently Data Refinery supports ~25 data source types and also can be used with IBM Data Catalog. To learn more about Data Refinery, you can sign up for the open beta.
AWS re:Invent 2017
As expected, there was a tsunami of announcements from AWS re:Invent 2017 that started trickling in one week before the event kicked off. This year over 43,000 folks went to Las Vegas to attend the event in-person and more than 60,000 signed up for live stream coverage. If you want to learn about AWS and watch the free event session videos, check out the AWS YouTube channel.
Unlike other cloud vendors, AWS caters primarily to technical app developers. Andy Jassy, CEO of Amazon Web Services, reiterated that message throughout his keynote that featured launches of Amazon Elastic Containers for Kubernetes (EKS), AWS Fargate, Aurora Multi-Master, Aurora Serverless, DynamoDB Global Tables, Amazon Neptune, S3 Select, Amazon Sagemaker, AWS DeepLens, Amazon Rekognition Video, Amazon Kinesis Video Streams, Amazon Transcribe, Amazon Translate, Amazon Comprehend, AWS IoT 1-Click, AWS IoT Device Management, AWS IoT Device Defender, AWS IoT Analytics, Amazon FreeRTOS, and Greengrass ML Inference. Did you catch all that?!? No worries if not. I’ll highlight the relevant analytics services in a moment.
AWS now has an $18B run rate with millions of active customers. As the cloud market leader, AWS touts a 44.1% cloud market segment share. That metric has grown from 39% last year. Today AWS offers more than 100 cloud services. Even more staggering – AWS anticipates over 1,700 cloud service enhancement releases this coming year. How do cloud customers keep up with this extreme pace of innovation?
Cloud development speed is light speed.
In his entertaining keynote, Jassy teased Oracle and other cloud competitors. Here is one of his slides showcasing advantages using charts with no axes. Anyone in my audience recognize this classic technique? Notably someone in my Twitter following pointed that fun tid-bit out.
The re:Invent 2017 keynote was split into three different sections: Instances, Compute and Serverless computing. During these sections new Elastic GPU was mentioned along with a new service called AWS Fargate for managed container apps to be deployed without having to manage servers or clusters. Serverless computing is changing the development landscape. According to Jassy, popular AWS Lambda had over 300% growth.
AWS Analytics News
AWS had awesome analytics and machine learning tracks. All of these sessions should now be available on the YouTube channel. Amazon claimed S3 is the world’s most popular choice for data lakes. Earlier this year, Amazon released Glue for S3 data lakes. During the event, we finally got more insight into the new Glue Python scripting service that creates Amazon’s flavor of a data catalog. We also heard about new S3 Select. S3 Select can query data within S3 objects up to 4x faster than S3 using standard SQL queries.
The Amazon QuickSight product team was kind enough to personally brief me before this event to share highlights. In the latest update, a preview for on-premises data source connectivity is probably the most significant enhancement. Private VPC Access is a new feature that allows secure connections to data within VPCs or on-premises without the need for public endpoints.
Other new features include:
- Geospatial Visualization – You can now create map visuals.
- Flat Table Support – In addition to pivot tables, you can now use flat tables for tabular reporting. To learn more, read about Using Tabular Reports.
- Calculated SPICE Fields – You can now perform run-time calculations on SPICE data as part of your analysis. Read Adding a Calculated Field to an Analysis for more information.
- Wide Table Support – You can now use tables with up to 1000 columns.
- Other Buckets – You can summarize the long tail of high-cardinality data into buckets, as described in Working with Visual Types in Amazon QuickSight.
- HIPAA Compliance – You can now run HIPAA-compliant workloads on QuickSight.
In 2017 Amazon QuickSight rolled out a total of 46 new features. Notably that is not a rapid development pace for cloud BI apps. I suspect Amazon QuickSight, a business user app, might be a bit neglected due to the AWS developer focus. Google Cloud also seems to overlook front-end cloud BI apps that help seal premium cloud data source deals.
This year AWS announced five new machine learning services and a deep learning-enabled wireless video camera for developers called DeepLens on the stage. New services include Amazon SageMaker for machine learning development; Amazon Transcribe for converting speech to text; Amazon Translate for translating text between languages; Amazon Comprehend for understanding natural language; and Amazon Rekognition Video, a new computer vision service for analyzing videos in batches and in real-time.
The most exciting new AWS machine learning service to me was Amazon Sagemaker. Amazon SageMaker is a fully managed end-to-end machine learning service that enables data scientists, developers, and machine learning experts to quickly build, train, and host machine learning models at scale.
There are 3 main components of Amazon SageMaker:
- Authoring: Zero-setup hosted Jupyter notebook IDEs and a library of pre-built Jupyter notebooks for common use cases for data exploration, cleaning, and preprocessing. You can run these on general instance types or GPU powered instances.
- Model Training: A distributed model building, training, and validation service. You can use built-in common supervised and unsupervised learning algorithms and frameworks or create your own training with Docker containers. The training can scale to tens of instances to support faster model building. Training data is read from S3 and model artifacts are put into S3. The model artifacts are the data dependent model parameters, not the code that allows you to make inferences from your model. This separation of concerns makes it easy to deploy Amazon SageMaker trained models to other platforms like IoT devices.
- Model Hosting: A model hosting service with HTTPs endpoints for invoking your models to get realtime inferences. These endpoints can scale to support traffic and allow you to A/B test multiple models simultaneously. Again, you can construct these endpoints using the built-in SDK or provide your own configurations with Docker images.
Each of these components can be used in isolation or together. You can get started with Amazon SageMaker for free. For the first two months of usage each month you’re provided 250 hours of t2.medium notebook usage, 50 hours of m4.xlarge usage for training, and 125 hours of m4.xlarge usage for hosting. Beyond the free tier, the pricing differs by region but is billed per-second of instance usage, per-GB of storage, and per-GB of Data transfer into and out of the service. If you are interested in learning more about Amazon SageMaker, check out the AWS re:Invent 2017 session. To learn more about other AWS’s machine learning services, visit https://aws.amazon.com/machine-learning.com.
During the keynote, NFL Next Gen Stats were showcased. NFL Next Gen Stats are real-time player stats combined with historical analytics that are provided to announcers and other folks during football games. It was a pretty cool case study for IoT, streaming and real-time analytics. Additional AWS IoT Device management, AWS IoT Analytics services, and disconnected edge analytics using a new OS for AWS GreenGrass gadgets called FreeRTOS was also unveiled.
Over 170,000 attendees saw how Salesforce delivers simplified advanced analytics for the masses at Dreamforce. Today Salesforce Einstein delivers more than 475 million predictions every day. In the keynote this year, Salesforce announced new Einstein-powered features that check for predictive data quality, allow non-technical users to build AI-powered apps and create intelligent app experiences. For analytics and data science pros, Salesforce Einstein can be an inspirational example of how to embed artificial intelligence, machine learning and advance analytics seamlessly into line of business apps.
In the analytics keynote called Complete Analytics Powered by Einstein, a cool new prediction builder called myEinstein was showcased along with related Einstein bots, deeper integration of the BeyondCore acquired technology, analytics apps, natural language search, improved drag-and-drop reporting and many other new futures. Pricing for all of these new services has not yet been released.
The new Einstein Prediction Builder is a simple point-and-click wizard that automates predictive model development. Business users can walk through a five-step process to create and automatically deploy predictions for a chosen Salesforce object – custom objects included. The solution works with structured and unstructured fields. Natural language innovation powers predictions from free-text fields. Models automatically learn and improve as they’re used, delivering accurate, personalized recommendations and predictions in the context of business.
Following creation of a prediction, business users can assign actions to it. In the demonstrated example, Salesforce showed a case being created for students with predicted high drop-out rates. All of this can be done by a business user – no data scientist or app developer needed to embed the algorithm. Here are a few screen shots from the keynote.
The research team showcased rich text analytics capabilities and futures in that space. My favorite quote from this event is soooooo true of just about anything technical today.
The bottleneck is no longer access to information; it’s our ability to keep up.
For end-to-end analytics, data loading, intelligent data prep and reporting has certainly improved. These areas of Salesforce include intelligent data transformation suggestions, recommended visualizations and more throughout the user experience as shown below.
Salesforce also announced a key partnership with Google to deliver four new, turnkey integrations between Google Analytics 360, Salesforce Sales Cloud and Salesforce Marketing Cloud:
- Sales data from Sales Cloud will be available in Analytics 360 for use in attribution, bid optimization and audience creation
- Data from Analytics 360 will be visible in the Marketing Cloud reporting UI for a more complete understanding of campaign performance
- Audiences created in Analytics 360 will be available in Marketing Cloud for activation via direct marketing channels, including email and SMS
- Customer interactions from Marketing Cloud will be available in Analytics 360 for use in creating audience lists
To get a deeper overview of what is coming in Winter 2018 from Salesforce analytics, check out the live demos shown in Einstein Analytics – Release Readiness Live Winter ’18 or the online tutorials available at https://trailhead.salesforce.com/trails/get_smart_einstein.
TIBCO Keeps on Surprising
TIBCO announced the acquisition of Alpine Data, an innovative, cloud-based data science and social collaboration platform. This addition will bolster TIBCO’s leading data science technology and analytics portfolio, complementing solutions like TIBCO Statistica™, TIBCO Spotfire®, and TIBCO StreamBase® to help companies accelerate their time to data-driven insight and action. Several years ago I did a Solution Review of Alpine Data.
TIBCO also released Spotfire 7.11 and the new data virtualization acquisition made them a Leader in The Forrester Wave™: Enterprise Data Virtualization, Q4 2017 by Forrester Research.
Tableau Maestro Visual Data Prep
New visual data prep capabilities for Tableau finally entered into beta. With Project Maestro, you can visually explore data as you are preparing it for visualization in three coordinated views. In the beta, you can add new steps to create a flow, review data distributions in a profile pane, or manipulate row level data with the data grid. That news should be exciting for Tableau fans even though licensing and pricing futures are still unknown.
Notably the Project Maestro data prep experience looks similar to Trifacta and Paxata. I look forward to testing it and sharing my feedback with you soon.
Additional November Announcements
Here is a summary of other highlighted analytics announcements from November. If you have news to share, please contact me to let me know.
- Logi Analytics 12.5 Released
- New Datawatch Monarch Swarm Enterprise Data Preparation Platform Launched
- Birst Extends ROLAP and ETL for Business Users
- DataRobot Add Multi-Class Classification and Anomaly Detection
- Attunity Launches Modern Data Integration Platform on AWS Marketplace
- AtScale adds Universal Semantic Layer to AWS Cloud
- Snowflake Introduces Snowpipe – Continuous, Automated Data Loading
- 1010 version 11 Released
- Dataiku version 4.1 Released
- Powered By Looker Embedded Analytics Unveiled
That’s all I have for this month. Time to start planning for 2018… expect more candid hands-on solution reviews, theme articles, deep dives and industry updates.