In Getting Started with Python [Part 1], you were introduced to the popular Python language for data analysis. Then in Getting Started with Python [PART 2] we covered pandas, Python’s data libraries. In this article, we will explore predictive analytics with Python using scikit-learn.

Introducing scikit-learn

The scikit-learn project started ten years ago as a Google project and since grown more than 30 active contributors with paid sponsorships from INRIA, Google, Tinyclues and the Python Software Foundation. It provides simple and efficient tools for supervised and unsupervised learning algorithms. Extensions or modules for SciPy are called SciKits. Thus, it is referred to as scikit-learn. The scikit-learn interface is in Python though c-libraries are used for optimizing performance.

scikit-learn

Scikit-learn was built on NumPy, SciPy (Scientific Python), and matplotlib.

  • NumPy: Base n-dimensional array package
  • SciPy: Fundamental library for scientific computing
  • Matplotlib: Comprehensive 2D/3D plotting
  • IPython: Enhanced interactive console
  • Sympy: Symbolic mathematics
  • Pandas: Data structures and analysis

Predictive modeling options provided by scikit-learn include capabilities across the entire CRISP-DM life-cycle. Here are a few highlighted features.

  • Algorithms: Naive Bayes, linear models, neural networks, support vector machines, clustering and decision trees
  • Cross validation: for estimating the performance of supervised models on unseen data
  • Datasets: for test datasets and for generating datasets with specific properties for investigating model behavior
  • Dimensionality reduction: for reducing the number of attributes in data for summarization, visualization and feature selection such as Principal component analysis
  • Ensemble methods: for combining the predictions of multiple supervised models
  • Feature extraction: for defining attributes in image and text data
  • Feature selection: for identifying meaningful attributes from which to create supervised models
  • Parameter tuning: for getting the most out of supervised models
  • Manifold learning: For summarizing and depicting complex multi-dimensional data

Let’s dive in and see what it can do.

Simple scikit-learn example

To better understand how to start using scikit-learn, let’s walk through a simple PCA example that uses a popular data science tutorial dataset called Iris. Iris data contains 3 different types Setosa, Versicolour, and Virginica petal and sepal length. It is stored in a 150×4 numpy.ndarray.

Let’s launch your iPython Jupyter notebook and copy the code from the simple PCA example to run it. In the code snippet, note we are importing matplotlib for rendering chart output. We reference sklearn – the library we are using for this exercise and then importing the same datasets to get access to the Iris data.

For algorithms, we will use PCA (principal component analysis). Principal component analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset. It’s often used to reduce dimensionality in data mining exercises. A super cool PCA explanation and example is shown at http://setosa.io/ev/principal-component-analysis/.

scikit-learn

After referencing and importing sample data, we’ll visually explore it using a plot.

scikit-learn scikit-learn

Now let’s run the PCA algorithm with 3 n_components and see what it returns to us in a plot.

scikit-learn

Note eigenvectors are the principal components of a dataset. Don’t get confused by that term. There a zillion other examples we could walk through but I’ll save those for future posts. In the meantime, enjoy exploring the resources I have provided below. Those also contain many scikit-learn code samples to get you started.

For Further Learning

Here are a few of my favorite resources to learn more about Python scikit-learn and predictive modeling using Python.