In my previous articles Predictive Model Data Prep: An Art and Science and Data Prep Essentials for Automated Machine Learning, I shared foundational data preparation tips to help you successfully get started with predictive analytics. I also covered the basics in a data prep white paper and related webinar. In this article, I want to share additional insights learned from the trenches of working with time series models.
Time Series Modeling
Time series is one of the most popular, profitable, and powerful types of predictions used today. These models predict future values of data based on history and trends. You can use statistical or machine learning methods to analyze time series to identify patterns such as seasonality, unusual events and relationships with input variables. Common use cases for time series models include forecasts for sales, product demand by SKU, predictive maintenance, staffing, inventory, and many other applications.
Time series analysis assumes that there are signals in the data that can at least be partially accounted for by a change in time or other independent variables. Example independent variables include season, weather, weekends, holiday, planned events, work schedules or macroeconomics factors such as GDP, unemployment rate, or the stock market valuations.
Time series modeling is one of the more complex types of machine learning. You should start with simple models and build in more complexity over time. Regularly spaced time intervals such as minutes, day, week, or month may behave quite differently – for different scenarios, products and so on. You will likely also need to balance lagging variables due to the cause/effect patterns of the real world.
Your past data may not look like your future data. For example, unless your model has seen a stock market crash like we saw in 2008 it cannot predict that there will be another one. This is a classic limitation of machine learning. However, in time series the dynamics of estimating the local behavior generally change much faster.
Structuring Time Series Input Data
Time series projects use date time partitioning. Unlike other types of machine learning projects, time series projects produce different types of models which forecast multiple future predictions instead of an individual prediction for each row. Your input framework may consist of a Forecast Point (defining a time a prediction is being made), a Feature Derivation Window (a rolling window used to create features), and a Forecast Window (a rolling window of future values to predict).
For my analytics pros, this concept is similar to dimensional time calendars.
During data prep, derived time series features such as lags and rolling statistics will be used as input features to train the models. Depending on your tools, you might need to do this manually or your automated machine learning platform might do it for you automatically creating up to two hundreds or more potential time features. Automated time series partitions are derived features created from found patterns that spans rows. For my analytics audience, this concept is similar to creating dimensional time calendars for reporting. However, time series features might be spans of time that are driven by patterns versus your calendar.
Not Too Much Data, Not Too Little Data
Unlike other machine learning modeling techniques, more data doesn’t mean better performance for your time series models. If you use data from too long ago, your model might learn trends that are no longer relevant anymore. Often using more recent data is better than using more data to reduce diluting new patterns. Also keep in mind that you should have enough data. Trying to reliably predict Black Friday sales for instance will require more than two weeks of prior data. You’d probably want to feed in two or three prior years of data.
Split into Multiple Projects
Another technique that improves model accuracy is building multiple data prep and machine learning projects based on unique patterns of behavior found in your data. You can usually use a visualization tool to find obvious groups and splits over time.
In a retail use case, you might initially review sales over years for all departments to find seasonal patterns with expected sales peaks during holiday months. Then you’ll review differences between seasonal departments such as TVs, videos, and toys and non-seasonal departments like grocery items dairy, cheese and eggs or snack foods. Getting even more granular, you might opt to build SKU level specific models to maximize accuracy.
Unlike seasonal departments in our analysis, grocery items did not decline in year over year sales. Those non-seasonal items appear to be steadily performing over time with sales. They also do not seem to be influenced by promotions. Thus, stop wasting advertising budget on those necessities.
Delving a little deeper, you visually can see two natural groups of departments – seasonal and weekly. To improve forecast accuracy with a time series model, you would create at least two different time series projects for the two groups of departments. In other cases, you might end up creating different data prep projects for new items versus existing items, promotional items, discontinued items, etc. If you used one dataset to predict all the items ignoring the pattern differences, you’d probably get an unreliable forecast.
Measuring Model Performance
To examine time series model performance, you’ll use backtesting techniques and measure forecast improvements over baseline models.
Since seasonal purchases are highly variable, time sensitive, and a top revenue generator in our retail example, demand forecast improvement for stock planning decisions can make a massive positive impact on the bottom line.
In contrast, non-seasonal forecasts will have less impact due to their stability over time.
Enhancing Input Data with External Data Sources
One of the most important things in time series is looking at how your predictions perform over time and continuing to enhance your input data with new features and external data. By seeing where your model makes mistakes, you might find a fascinating pattern from a business process or event that was omitted in your initially prepared data.
This is where the beauty of the human mind shines brightly.
A fun real-world example of this tip was a towing company learning that hometown team football game schedules needed to be included for reliably predicting required car towing staff. Apparently, beer drinking football fans were smart enough to find another a ride home. The next morning, they called towing companies. That significant variable was found by reviewing where the largest errors in forecasts were happening. This is where the beauty of the human mind shines brightly. Only a human was able to decipher local football games as the missing data prep ingredient.
Until Next Time
That wraps up my time series data prep tips from actual projects today. As I continue to learn more, I will share my knowledge with you. Now I’ve got to get back to work prepping for annual February Gartner Magic Quadrant announcements, webinars, and the start of spring in-person events. This spring I’ll be speaking at Gartner Australia, providing a theatre demo at Gartner Orlando, attending INFORMS in Austin and speaking AI Live in New Orleans. I know there will be more events … each week new ones pop up on my radar. If you are attending any of these events, please stop by and say hi.