As promised, this is the next article in Getting Started with Tableau 8.1 & R. If you have not read the first article and/or have not already installed and configured R and RServe, I suggest that you read the first article before continuing. If you are new to R, there are many free sources to get you up and running with the basics. One fairly quick and easy free course that even has some simulations in it is Try R from O’Reilly and Code School.

In Tableau 8.1, a connection to RServe was added in Help > Settings and Performance > Manage R Connection along with several new Calculated Field functions called SCRIPT_STR, SCRIPT_REAL, SCRIPT_BOOL, and SCRIPT_INT to ease integration and R function calls in Tableau. To oversimplify how this works, Tableau Calculated Field scripts pass Tableau viz dimension, measure or parameter values to R. R algorithms then compute a value that gets returned back to Tableau as that Calculated Field result value. You can use advanced R Calculated Fields for a wide variety of analytic use cases including but not limited to computing advanced forecasts, prediction scores, outlier detection, association, clusters or classifications.

In the first article I covered the programming classic, Hello World, introduced parameters and R arguments for passing values. Those were not very exciting examples but they did establish a foundation of understanding. After reading Bora Beran’s awesome blog, I liked the Clustering example. It is easy to understand and a popular data mining / predictive analytics use case. I wanted try R Clustering in Tableau with the same classic Microsoft Bike Buyer data mining demo data set that comes with the Data Mining Add-In for Excel and SQL Server. In past blogs, I developed a Clustering example with this exact same data set using Analysis Service DMX queries. I posted that live demo on Tableau Public with instructions on how I did it. This time I am going to use the new Tableau R Calculated Field functions instead of an Analysis Service DMX query.

There are a few base Clustering packages available in R by default such as kmeans. The R CRAN community library has more Clustering packages that you can search and install such as Density Based Clustering that is not sensitive to noisy data unlike kmeans. Clustering algorithms apply space/vector quantization to identify related groups of data points. k-means clustering strives to partition data points into the n amount of clusters that you define. Cluster assignment is based on space computation, dividing a large set of points (vectors) into groups having approximately the same number of points closest to them. Each group is represented by its centroid point in k-means. The bottom line concept is that related data should reside close to each other in a mathematical space. Now that you understand Clustering, let’s use it in a Tableau visualization to visually explore Bike Buyer variables and the Cluster patterns.

I think it is easier to test, debug and run your R functions in RStudio or another R GUI tool before using them in a Tableau Calculated Field. To get help on R syntax in RStudio, use help(function name). In this Clustering example it would be:

> help(kmeans)

From there within RStudio, I imported my Bike Buyer data set into a data.frame (R version of a database table) using:

> BikeBuyers <- read.csv(“~/BikeBuyers.csv”)

Now I can experiment with R Cluster algorithm settings changing the amount of Clusters, passing in different data.frame columns indicated by $ as variables, etc. I ended up choosing the following snippet that I will now use in Tableau.

> kmeans (data.frame (BikeBuyers$Income ,BikeBuyers$Cars, BikeBuyers$Age, BikeBuyers$Children) ,4)$cluster. Here the 4 is not 4 arguments or parameters. The 4 means 4 Clusters. These do not have to match and it is only ironic that they do in this demo.

Back in Tableau 8.1, I connect to my BikeBuyers.csv file, create a Tableau vizualization with data point values that I’d like to pass to R to see what Cluster gets assigned. I can see the assigned Clusters by creating a new Tableau Calculated Field with my R script. R kmean Cluster functions return an Integer value back to Tableau so I used a SCRIPT_INT in the Tableau Calculated Field that I am naming RCluster.

SCRIPT_INT(‘kmeans(data.frame(.arg1,.arg2,.arg3,.arg4),4)$cluster;’, SUM([Age]), SUM([Cars]),SUM([Children]),SUM([Income]))

TableauRBlogPart2b

To pass in Tableau data point values from the Tableau viz to the Calculated Field you map the Tableau field names to an R arguments (parameters). In this example, .arg1,.arg2,.arg3,.arg4 get passed values from SUM([Age]), SUM([Cars]), SUM([Children]), SUM([Income]).

To see computed Cluster reults, I placed that R Calculated Field onto Color on the Marks Card. I could alternatively place it on a Label to see the R Cluster labels. The resulting Cluster numbers are dynamically computed when the Calculated Field is placed or when dimensions or measures or change on the Tableau visualization.

Couple tips: You may want to change how the results are computed from Continous to Discrete by right-clicking the R Calculated Field on Color. The other issue that I am seeing come up on the Tableau and R FAQ is a message “Error in sample.int(m, k) : cannot take a sample larger than the population when ‘replace = FALSE'”. This issue happens if the data points on the Tableau viz are not enough or don’t make sense to put into R Clusters. If you run into this issue, try placing different fields on the Tableau viz. When it is working correctly, you will see different Cluster colors in your visualizations. This allows you to visually explore a wide variety of variables looking for patterns.

Clustering can be useful in the real world for narrowing in on important variables and understanding what variables are different between various Clusters. For example in a marketing use case, you could identify natural Clusters of good, ok and not so good customers and then target initiatives that influence variables that shift an ok customer to a good customer. You could also use Clustering to filter out low value prospects that are not likely to become a good customer.

I hope you have found this article helpful in furthering your Tableau and R skills. For more resources, check out the fantastic, free Tableau video library.