In Getting Started with Python [Part 1], you were introduced to the popular Python language for data analysis. Now we are going to play with pandas, Python’s data libraries. pandas is a Python package that supports fast, flexible, and easy-to-read data structures that includes computational functions for data analysis. Since most of my readers have experience with the SQL query language, I am going to relate pandas programming to SQL statements.
The pandas package supports:
- Data loading and saving to and from files, databases, or PyTables/HDF5 formats
- Data structures with labeled axes
- Data correction, cleansing, aggregation and transformation
- Data joining, appending, removing, filtering, and querying
- Arithmetic, linear regression and statistical computations
Data structures in pandas are flexible containers for dimensional data. Two of pandas’ primary data structures: Series single-typed arrays and the DataFrame. DataFrames, conceptually similar to database tables and R DataFrames can contain more than one datatype, are containers for Series. Panels are a container for DataFrame objects.
A DataFrame is a tabular data structure that has a set of ordered columns and rows. It can alternatively be imagined as a group of Series objects that share an index (column names). A DataFrame can be created from different data structures such as a list of dictionaries, a dictionary of Series, or a record array.
To create a DataFrame, we can import or read data into it. Pandas read_csv function offers over 40 parameters to controlling the data loading process from setting delimiter type, assigning header, skipping rows, and error handling.
- sep: Delimiter type between columns, by default is set to comma (csv)
- dtype: Data type for data or columns
- header: Sets row numbers to use as the column names
- skiprows: Skips line numbers to skip at the start of the file
- error_bad_lines: Shows rows with exceptions, if set to false error lines are skipped
pandas also has support for reading and writing a DataFrame directly from or to a database such as the read_frame or write_frame function within the pandas module.
To get started using pandas, launch your iPython Jupyter notebook.
In the first cell, import the import the pandas and numpy libraries.
Just like we did last time, let’s read and view the BikeBuyer csv file.
The head and tail functions can be used to view a small sample of data. By default, the functions return five elements, but you can customize that number.
In SQL, you would use a SELECT statement to view and limit records.
In pandas, you use following syntax.
Calling the DataFrame without the list of column names would display all columns just like SQL * FROM tablename. Filtering in SQL is done via a WHERE clause, WHERE Gender = ‘Male’.
In pandas, DataFrames can be filtered with boolean indexing.
Multiple conditions can be passed in both SQL and pandas DataFrame using OR and AND.
To handle missing data or null values in SQL and with pandas, you can use the isnull() or notnull() functions. Unlike SQL, pandas isnull() will remove rows or columns in DataFrames. Be careful when using it. In data science, missing or null values can be crucial to your analysis or algorithm. By default, the isnull() removes any row containing a missing value. A better way to control missing values is to use supported parameters of functions. Consider using fillna() function to fill a custom value for missing values. For example, bikes_df.fillna(-1).
In SQL and pandas, SQL’s GROUP BY functions can get aggregate counts.
Note in pandas code the function size() was used and not count(). pandas count() applies the function to each column, returning the number of not null records within each. We could have applied the count() method to an individual column and it would work as expected. Grouping by more than one column is done by passing a list of columns to the groupby() method.
pandas has a variety of statistical functions for aggregation to summarized information such as mean, sum, or quantile. pandas also supports a large number of methods to compute them.
JOINs can be performed with join() or merge(). By default, join() will join the DataFrames on their indices. Each method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or the columns to join on (column names or indices).
Here are several generic samples from the docs for joining data.
UNION ALL can be performed using concat(). SQL’s UNION is similar to pandas UNION ALL but pandas UNION will remove duplicate rows. In pandas, you can use concat() in conjunction with drop_duplicates().
Additional pandas query Ranking, Top N, Sorting, Order By functions are covered in the docs.
Also don’t forget to use the TAB key to find available functions.
In the final sample for today, I’ll query my pandas DataFrame for all records where Age is under 50 and sort my results by Education. Now I’d better get back to my billable work. This is all the free time I have for fun stuff this week.
For Further Learning
In the next Getting Started with Python series article, I’ll dig into predictive analytics with Python. One of the most common questions I get is why Python or R. I must tell you, when I talk to big data talent they do seem to prefer Python or Scala with Apache Spark over R. If you do like R, H2O.ai is certainly a compelling option that I covered in my Big Data Analytics session last December.
To learn more about Python pandas for data analysis, check out the following resources.
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython
Python Pandas Tutorials
Pandas Cookbook Tutorials and Data Sets