In this Oceans of Data series article, I will share a tip on creating reports from unstructured data. Unstructured data refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured data is by far the majority of data in our glorious world. Email, invoices, inventory documents, government forms, saved report files, the list of unstructured data could go on and on and on.
Recently while reviewing Datawatch Monarch data prep, I stumbled on an Equifax case study where they automated data extraction from unstructured documents to save time and money. By automating that previously tedious and “un-fun” work, data quality also improved. Here is a link to the Equifax story that inspired me to test unstructured data extraction.
Monarch Auto Define
Although there are a variety of ways to extract unstructured data from files, one tried-and-true, fast and simple approach is to use Datawatch Monarch. Years ago I used this tool when building Department of Defense digital contract reporting projects. At that time, the process to define data regions and extract unstructured data required a bit of field mapping experimentation. With the latest version of Monarch Auto Define, that process is intelligently automated today.
Monarch’s Auto Define feature is essentially the easy button
for what used to be a challenging task.
If you are copying and pasting values from unstructured documents into Excel for reporting – here is a better approach. You can automatically extract data from an Adobe Acrobat PDF file or other type of file for reporting in a few clicks.
Extracting Unstructured File Data
To get started, download a free trial of Datawatch Monarch. After installing Monarch, look in the file directory for the Invoices example file located at C:\Users\Public\Documents\Datawatch Monarch\Reports\Classic.pdf.
- Launch Monarch, click the Data Prep Studio icon at the top.
- Exit the Tutorial pop-up and choose Open Data. Then select PDF Report.Navigate to C:\Users\Public\Public Documents\Datawatch Monarch\Reports, select Classic.pdf, and then click Open. You are then brought to the Report Discovery window. Scroll through window and notice that there are many invoices in that single file – not just one.Essentially you can use this approach for onesie or bulk data processing from unstructured documents. If you have many unstructured files to process, you can automate or schedule these steps. For more information on that option, check out Datawatch Server Automator.
- Now click the Auto Define button on the toolbar and watch how Monarch automagically, intelligently finds, defines, maps and extracts the unstructured data in the document for you.Alternatively, you can define each column individually by double-clicking on a field in the top window. Data Prep Studio will create a new column for each field that you define, and populate that column with similar field values.
- Select Open in Data Prep Studio to complete this step and then click Preview Data.Here you can optionally make changes, blend other data sources, and so on. For my test, I merely wanted to export the data for reporting.
- To export data, click the Load Selected Tables button and then Export Data.
- Now pick your desired data export destination and you are good to go. You can analyze the automatically extracted data to your hearts content in your BI tool of choice. Here is a peek at my exports to Excel and Tableau TDE.
- Last but not least, I built a lovely Tableau report from this previously unstructured, unusable, dark data in a matter of two minutes.
Life is good in the modern analytics world.