One of the best kept secrets in the database market is EXASOL . Built in Germany from the ground-up, EXASOL achieved the best TPC-H performance ranking. TPC-H is an industry standard benchmark for analytical databases. That exercise secured EXASOL the coveted title of being the world’s fastest database for analytics.

Exasol TPCH

How did I not hear of them? Honestly, I must have been in a Microsoft bubble for too long. That can happen while you are attempting to keep up with Microsoft’s weekly cloud changes. As a former SQL Server BI product manager, I guess you could also say that I had “database blinders” on when it came to the ~300+ other databases in the market. Since leaving Microsoft, I have been expanding once again into the broader market. I love evaluating all types of analytics solutions.

EXASOL Excitement

I learned about EXASOL from a dear friend at Tableau that also tests solutions in his spare time. A few weeks later I noticed a brilliant Tableau Zen, Mark Jackson, blogging about his work with EXASOL . Slowly, more and more peers in my analytics network have been buzzing about using EXASOL to cure Tableau TDE blues. Apparently it is easy swap out Tableau TDEs to EXASOL. Plus, there is a new free EXASOL Small Business Edition that can be used commercially.

A “fear of missing out” motivated me to learn about it. After several conversations with EXASOL, I was invited to lead a customer research study. Those customer interviews only further intrigued me. I can confidently say EXASOL has something special; it is more than just the fastest database in the world.

What is EXASOL?

EXASOL is a low cost, high performance massively parallel processing (MPP) database that also has an extendable data analytics framework. An MPP database is partitioned across multiple servers or nodes with each server/node having memory/processors to process data locally. All communication is via a network interconnect meaning there is no disk-level sharing or contention. This is referred to as a ‘shared-nothing’ architecture.

Exasol architecture

Highlighted features include:

  • High performance MPP in-memory database for analytics
  • ACID compliant, fully transactional with an analytics design focus
  • Intelligent in-memory column store engine
  • Runs on standard x86 hardware
  • Complies with Standard ANSI SQL
  • ODBC, JDBC, .NET as well as a JSON-based web socket API
  • Parallelized analytics with R, Python, Java and more
  • Geospatial analytics
  • Data virtualization framework with virtual schemas
  • External table support forHadoop
  • Modern data integration framework
  • Advanced automation
  • Extremely scalable, supports high concurrency
  • Deployment choice software-only, appliance or cloud
  • Minimal administration overhead, self-tuning capabilities

Admittedly I was skeptical of the last bullet. Tuning indexes and queries is assumed and expected. However, in every conversation in the EXASOL customer study the hands-free, self-tuning came up as being a pleasant surprise that was highly valuable. One of the case study accounts has a 200TB, 64 node EXASOL cluster with no DBA to take care it. They said to me that EXASOL takes care of itself.

Extendable Data Analytics Framework

In EXASOL version 6, the product team added an extendable analytics framework that can integrate any programming language for in-database analytics using User Defined Functions (UDFs). This is the area that I am personally most interested in diving deeper into for operationalizing data science.

EXASOL ’s UDF analytics are available for R, Python, Java, Lua and any universal/pluggable language of choice such as Scala, Julia, or C++. Open analytics frameworks published to GitHub continue to add more functionality. In-database MPP execution of algorithms near the data offers optimal performance and scale. UDFs are simple to query making embedding or using predictive algorithms in applications or reports easy. For Tableau, TIBCO Spotfire, Qlik, Excel, RStudio or Anaconda Python fans, you can call EXASOL UDF functions from those environments too.

UDFs

EXASOL executes pre-defined or user-defined (UDF) analytical functions fully in parallel and distributed across the database cluster. The performance and scalability of EXASOL combined with capabilities in R, such as clustering, exponential smoothing, predictive modeling, and machine learning, make this a popular combination.

Organizations today do use EXASOL with R for customer segmentation, market basket analysis, real-time offers, predictive asset maintenance, and other applications. It is not uncommon to have a situation where hundreds of instances of a script are running in parallel on an EXASOL cluster. While execution is tightly integrated with EXASOL SQL processing, the actual program code run is encapsulated using container technology. EXASOL also decides which resource limits are applied for each script container, in such a way that a misbehaving script container is not able to block all available resources. The isolation of processes allows for secure script programming.

Beyond the popular analytics framework used for data science, EXASOL supports an ad-hoc, experimental “data science way of working” with large-scale data. In order to move more and more computation to the data, EXASOL is constantly adding new features for in database analytical programming to the core system. If this topic interests you, EXASOL has a white paper called Big Data Science that is a good read and delves into deeper detail.

EXASOL R SDK

For R fans, EXASOL R SDK offers functionality to interact with the EXASOL database out of R programs. It extends the ORDBC and offers fast data transfer between EXASOL and R (exa.readData() and exa.writeData()) and it makes it convenient to run parts of your R code in parallel in EXASOL ution using R UDF functions behind the scenes (exa.createScript())

A collection of R prediction scripts can be downloaded and explored from the EXASOL ’s website.

R SDK

For R users, the easiest way to get started with EXASOL in-database analytical programming is to use the provided exa.createScript()packages to run code directly from the command line. Here is an example.

> require(exasol)

First, create a connection to the database:

> C <- odbcConnect(„your Data Source Name“)

Then, if we want a function from our R session to be available in EXASOL, we simply deploy it in EXASOL with the exa.createScript command.

testscript <- exa.createScript(

C,

„test.mymean“,

function(data) {

data$next_row(NA); # read all values

from this group

data$emit(mean(data$val))

},

inArgs = c(„groupid INT“, „val DOUBLE“),

outArgs = c(„groupid INT“, „mean DOUBLE“))

When exa.createScript is called it does two things:

  • It deploys the supplied anonymous function (of course this function could also be named) in EXASOL using the name test.mymean.
  • It returns a new R-function (which gets assigned the name testscript) which transparently causes the execution of test.mymean in EXASOL on the supplied data and data grouping specification. In order to execute this function in EXASOL, we simply type:
testscript(„groupid“, „val“, table =“test. 
twogroups“, groupBy = „groupid“)

val and groupid are the names of the columns that are to be fed into test.mymean. The table parameter defines the data that is input into test.mymean.table and can be defined by arbitrary complex SQL statements. Finally, as  test.mymean defines an aggregate function, we use the  groupBy parameter to define the SQL groups that we would like to aggregate.

For R developers, this should be straightforward for you and open up a lot of new possibilities for large-scale analytics. EXASOL provides a high quality, high performance in-database approach that usually comes with an expensive price tag.

EXAPowerlytics

Another analytics functionality within EXASOL is EXAPowerlytics. EXAPowerlytics implements generic functions for complex windowing over time series queries, cumulative sums, moving averages etc. Without analytic functions these queries would require correlated subqueries or self joins that are inefficiently and complicated to code. Examples of EXAPowerlytics can be found in online documentation.

For More Information

To learn more about EXASOL V6, check out the free virtual machine download or cloud demo and their massive library of white papers including one on data science.