Data preparation is critical for any analytics, business intelligence or machine learning effort. Although automated machine learning provides safeguards to prevent common mistakes and is robust enough to handle imperfect data, you’ll still want to properly prepare your data to get optimal results. Unlike other analytical techniques, machine learning algorithms rely on carefully curated data sources. You’ll need to organize your data within one wide analytical row of input variables and outcome metrics that describe an entire lifetime of events.

In this article, white paper and related webinar, I will review how to amalgamate data in a machine learning-friendly format that accurately reflects business processes and outcomes. I will share basic guidelines, practical tips and additional resources to help get you started mastering the art of automated machine learning model data preparation.

DataRobot Data Prep Webinar

Thinking Differently

Data preparation for machine learning requires business domain expertise, bias awareness and an experimental thought process. Before preparing your data, you’ll first define a business problem solve. During that exercise, you’ll select an outcome metric and brainstorm potential input variables that influence it from many varied perspectives. From there you will begin identifying, collecting, cleaning, shaping and sampling data to run through automated machine learning model processes.

Note that it is also not unusual for relevant machine learning input data to occur outside of existing transactional processes. If that is the case, you can still start creating a first-generation machine learning model with existing data and continue to build new model versions over time as supplementary data is acquired.

Machine Learning Input Data Sources

Machine learning algorithms ingest single tables, views, or comma separated (.csv) flat files. If you have data stored in a dimensional data warehouse or transactional, normalized database format, you will need to join fields from multiple tables to create a single unified, flattened machine learning “view”.

data prep

Machine learning “views” contain outcome metrics along with input predictor variables that should be collected at a level of analytical granularity that you can make actionable decisions upon. Beware not to overly aggregate or overcomplicate variable design. Pick a level of analytical detail that is both understandable and practical for operationalizing your model.

My Top 10 Data Prep Tips

Even if more data cleansing and feature engineering tasks are automated in the future, business subject matter expertise and data preparation creativity will likely remain key model performance differentiators. Since the quality of automated machine learning model output depends on the quality of input, here are a few of my favorite data preparation tips to help you build better models.

  1. Choose a metric level of granularity for actionable decision-making with the predictive output.
  2. Predictive algorithms assume that each record is independent and unrelated. If relationships do exist between records, create a new derived variable called a feature to capture data relationships.
  3. When selecting predictor variables, keep in mind that you want to gather a maximum amount of information from a minimum number of variables to avoid the curse of dimensionalitywithout overfitting or underfitting.
  4. Decide how to deal with outliers. Some algorithms such as regression are sensitive to them for standard deviation in statistical significance calculations. Confirm if the data is relevant and real. If you expect it to happen again, do not remove those points. Alternatively, consider reducing outlier influence by using transformations.
  5. For missing values, decide if you want to delete it or impute a likely or expected value. If you impute a mean, you may be reducing your standard deviation thus a distribution-based imputation approach is more reliable. As you deal with missing values, do not lose the initial context. A common approach is to add a column to the row to flag data was missing.
  6. Machine learning algorithms assume input information is correct. Treat incorrect values as missing if there are only a few. If there are a lot of inaccurate values, try to determine what happened to repair them.
  7. Where possible, reduce variable skew typically by a transformation function that has a disproportionate effect on the tails of the distribution.
  8. Avoid using high-cardinality fields that contain a very large number of distinct values.
  9. Do not use duplicate, redundant or other highly correlated variables that carry the same information or live in the same hierarchy to avoid collinearity issues.
  10. Creating features from several combined variables and ratios provides more improvement and model accuracy than any single-variable transformation because of the information gain associated with these interactions. Learn to love ratios.