Self-service data prep, data wrangling, data mash up, combining data, vlookups or ETL, this area of the market is heating up despite a lack of consistent terminology. A fascinating collision is happening in this space right now. Enterprise BI, ETL and data catalog tools, cloud BI and big data BI vendors, self-service data visualization players, data science offerings and even workflow apps are all expanding into data prep while new players are still entering.
On August 25, Gartner released a new market guide briefly covering 36 self-service data prep vendors and current trends. I also have been digging deeply, “hands-on”, into a few of these vendors including Paxata, Alteryx, Trifacta, Datawatch, Talend, Informatica REV, Tamr, IBM Dataworks, SAP Agile Data and SAS Data Loader. Of all the data prep players that I have tested thus far, Paxata is the most innovative with a “future of data prep” user experience. Here is my quick, unbiased take on them today.
Paxata Adaptive Data Preparation
Paxata is a stand-alone self-service data prep solution designed for non-technical business analysts. Although a lot of competing solutions say the same thing, not many actually live up to it in product testing with Excel users. I have been evaluating software for years. I totally love it. (If you see a solution that needs user experience (UX), usability or design improvement, please send them my way.)
Having said that, Paxata’s design is truly easy, intuitive, provides immediate data prep gratification and is FUN! It feels like Tableau for data prep. The product team also included lovely contextual help that can optionally be hidden.
My personal favorite Paxata feature is the visual, interactive filters and live updates. I love that I can rapidly understand what data I have to cleanse, pivot or join and what my data prep transformations will do before committing to them. I also appreciate the ability to preview changes, reorder and edit steps without losing my work.
Another really nice Paxata capability is annotations and versioning for rolling back or going back to remember what I did in a step and why.
The intelligent data prep process is further enhanced with automation and machine learning. Paxata can automatically suggest joins with Intellifusion, find similar values with Cluster + Edit using natural language processing (NLP) technology. Other common tasks such as splitting columns, concatenating. de-duplication and repairing errors, nulls, and whitespace can be done without any scripting or technical skills.
For BI, reporting and self-service data visualization users, Paxata projects and be easily embedded and accessed via a URL feature called ClicktoPrep™ that allows dynamic linking between dashboards for viewing lineage, annotations, transformations or versions.
In my hands-on evaluation, I found the Spark-powered platform to be really fast and snappy. The speed and ability to handle and schedule incremental new loads of large data set prep is awesome. There are also big data optimized export destinations such as Apache AVRO, Hive and S3. Speaking of output, right now both the output and input connection options are pretty limited to CSV, Hive, Amazon S3, MySQL, SQL Server, Oracle, Postgres and SFTP but do have granular permission settings. Connector options is an area that I assume will continue to improve over time.
Update 9/1/2016: Paxata’s trial does not provide all supported connectors. In addition to the above connector options, Paxata also supports the following:
- import from CSV, Excel, Fixed Width, XML, JSON, AVRO, and Parquet
- import from compressed formats GZip, BZip2, Deflate, Snappy, and LZ4
- import connectivity to Amazon Redshift and Salesforce
- import via DataDirect Cloud (enabling secure access to dozens of big data, relational, and cloud sources for customers using this service)
- import/export to any JDBC-compliant data source, HDFS (Kerberized and Non) for CDH, HDP, and MapR
- Paxata’s Paxata Connector SDK and extensible REST API connector framework allows creation of custom connectors
Other weaknesses that I noted were the calculations not having rich intellisense or debugging yet, no source in-database transformations – need to import data, no automatic finding of data tables in files or web site pages, no easy way to remove top or nested rows/columns with multi-level spreadsheets and no unstructured data extraction that I saw in other reviewed data prep tools. Although I found the product design innovative, intuitive and unique with cheery bright colors, I wished that I could change the auto-assigned brown color on my projects. I do not like the color brown – color is an emotional thing for me.
Paxata Technical Architecture
For my technical audience, Paxata is built on the amazing Spark technology meaning that it is big data ready. The modern parallel in-memory pipelined data prep engine powers the immediate visual gratification that I loved. It can scale on commodity hardware to handle billions of rows of data and can be deployed on premises (Hadoop/non-Hadoop) or as a cloud multi-tenant SaaS app on Amazon AWS or in hybrid mode.
Keep an Eye on Paxata
All in all Paxata’s design and user experience is fantastic, innovative and really enjoyable. I don’t usually like data prep but I did with Paxata because I could contextually visualize what I was doing instantly. I could see my data to wrangle and what I was doing live as I was doing it thanks to Spark. The powerful machine learning features, cluster-edit for fuzzy matching, Intellifusion joins, and pattern identification, were also quite useful.
Right now Paxata is lacking in depth and breadth when compared to more mature data prep platform offerings. However, Paxata’s user experience alone sets the bar high for next generation data prep. It has great market potential that is likely limited merely by missing data sources and destinations that will eventually grow over time. (Hint: Now I know why a successful, respected former Forrester industry analyst made her recent leap of faith to join the Paxata team.)
For more information on this super cool, niche data prep tool, check out the following resources: