From raw data to actionable intelligence, I will share how easy it is to collect, ingest and visually analyze unstructured data in this next Oceans of Data series article. Unstructured analytics is ideal for monitoring customer sentiment, gathering market intelligence and many other types of analysis. Historically, this type of analytics work has required complicated scraping code, industry specialized platforms like Digital Reasoning or a custom services engagement to set up an IBM Watson implementation. Recently I found a lovely new solution called Stratifyd. It seems like a “Tableau for unstructured data”. Let’s take a look at how it works.
Ingest. Analyze. Visualize.
Stratifyd is the result of post-doctorate work at computer science and data visualization research centers. The founders were conducting government-funded research on how artificial intelligence could be used to ingest, analyze, and visualize unstructured data. From there, a startup was born. Today numerous reputable organizations such as Lenovo, Kimberly Clark, and Etsy are using Stratifyd for unstructured analytics.
To get started, you select data sources to examine. Stratifyd has over 80+ data connectors, many of which are programmed through public APIs. It can load data from CSV, JSON, Excel files, MySQL, Microsoft SQL, Postgre SQL, MongoDB, Hadoop, IBM DB2, Cassandra, JDBC connections and transfer voice to text data from audio files. For my testing, I selected Consumer Affairs data. I wanted to analyze Frontier Communications, one of my current utility vendors.
After connecting to the Consumer Affairs data, I was provided a preview of the results. Although there are options to add filters to “tune” the search such as removing terms and adding filters, I opted to initially view the full set of data.
Next an automatically created dashboard of semantic topics was displayed by Stratifyd. Visualizations included a topic wheel, bigram word cloud, map and a temporal trend graph.
Notably, Stratifyd has over 5,000 visualization types. I could have gotten incredibly creative with this one if I had more free time to play with it. The interactive visualizations can be filtered and also provide drill- down to see the underlying details. As I reviewed the results, I was impressed by how quickly I was able to find problem areas and see the actual customer comments. I chose Frontier Communications as my subject because our family has had issues with them here in Tampa. Sure enough, Stratifyd’s results accurately captured known issues in our region.
How it Works
Stratifyd extracts, cleans, normalizes and performs automated feature engineering on the unstructured data. They use natural language processing and an unsupervised learning clustering algorithm that characterizes each document as a mixture of topics, with each topic consisting of a small set of bigrams that frequently occur together. A topic wheel displays the bigrams grouped by topic number, while the word cloud/list ranks bigrams by either count or buzzword pointwise-mutual-information (“PMI”) score. Topics are indexed by the Stratifyd Significance Percentage, colored by sentiment, sized and ranked by their statistical relevance. The topic model visual can be viewed as a pie chart, network graph, or a tree-map.
Buzzword PMI is a computational linguistics measure of association and collocation between words. Stratifyd uses PMI to select the most relevant buzzwords and then ranks them using machine learned heuristics. On the backend, they use a buzzword significance score that is similar to the topic percentage.
PMI counts how frequently two words occur together in a corpus as well as how frequently the words occur individually. The probability of co-occurrence and individual occurrence can then be approximated. A higher PMI score means the probability of co-occurrence (bigram) is higher than or slightly lower than the probabilities of individual occurrences (unigram) for two words. As a result, common words – such as “the”, “is”, “be”, “to”, “in”, etc. – have very low PMI scores. Bigrams, aka “unigram pairs”, with high PMI scores tend to be more unique in comparison. You can learn more in Stratifyd’s help docs. For now, let’s move on.
As I hovered over topics, the wheel automatically filtered related visualizations allowing me to focus on the problem areas. I also experimented by changing visualization types to see tree maps and network graph views. One of the favorite features was being able to search and read the raw customer notes. I could see that many other Tampa residents had experienced the same problems with Frontier Communications that we had noticed after an acquisition last year.
Diving deeper into Stratifyd’s capabilities, I discovered there was much more that I could do with this solution to optimize searches, filters, stopwords and eliminate the noise. When it comes to natural language text solutions, problems with understanding context and limiting false positives will frequently come up. Although I did not deeply test all of Stratifyd’s options below, it appears that they are robust for overcoming common unstructured analytics challenges.
Stratifyd is available as a scalable, containerized cloud or on-premises platform.
Currently they support 25 languages: Arabic, Chinese, Dutch, English, Filipino, French, German, Greek, Hausa, Hindi, Indonesian, Irish, Italian, Japanese, Korean, Persian, Polish, Portuguese, Russian, Shona, Spanish, Swahili, Swedish, Turkish, and Welsh. Impressive for a newcomer!
For More Information
Stratifyd provides an effortless way for the business to get unstructured data insights and proactively monitor the gold mine of public web data sources. Most data sources in the world today do not have pre-defined data models or schema – files, email, logs, social media, websites, machine-generated data and so on. Extracting value from unstructured sources of data can provide a crucial competitive advantage. If you’d like to learn more about Stratifyd and see a demo of it, please visit https://www.stratifyd.com.