Python is a splendid, flexible, open source language that is easy to learn, easy to use, and has powerful libraries for data analysis and data science. I have been meaning to write this article for over a year now. It is looooooong overdue. Time flies when you’re having fun exploring the constantly evolving world of analytics. Along with the weeds in my garden, there seems to be an endless list of neglected article topics piling up. Python is one of them! Without any further delay, let’s get started.
Earlier this year, I wrote about top programming languages to learn. Python consistently ranked highly. In analytics and big data realms, it is one of the most popular programming languages in the world. Python is a general-purpose programming language with rich libraries such as scikit-learn for analytical and quantitative computing. It is used in scientific computing and highly quantitative domains such as finance, oil and gas, and physics.
Python can be downloaded and installed from the official Python website https://www.python.org. There are several IDEs for it including but not limited to PyCharm, Spyder, Python Tools for Visual Studio and Continuum Analytics Anaconda.
Although I have tried PyCharm and liked it. I currently use the free version of Anaconda by Continuum Analytics. Anaconda Navigator’s UX makes it simple to manage analytic environments (even R Studio), launch interactive, iPython Jupyter Notebooks, find samples, training material, and community events.
It is also simple to set up, install (conda install x) or update (conda update x) commonly used analytics packages.
- Pandas makes it easy to work with data and data tables called DataFrame. If you have worked with R or Spark in the past, data frames in Python are similar.
- NumPy is used for scientific computing. It is fast but not as easy as Pandas.
- SciPy is has statistics functions. It is also used for mathematics, science, and engineering functions.
- Statsmodels is for statisticians. It has functions for exploring data and performing descriptive statistics.
- MatPlotLib is a plotting library for the Python programming language.
- Seaborn used with MatPlotLib for better looking visualizations.
- Scikit-learn is for data science, it includes functions for preprocessing, supervised and unsupervised machine learning algorithms, model selection, and more.
- Bokeh is a Python interactive visualization library.
- Anaconda datashader for big data visualizations. This totally awesome library can be used in conjunction with Bokeh. It is amazing! With datashader, you can visualize millions or billions of points of points with no downsampling required. Stay tune = datashader will get a dedicated blog soon. Check out the white paper by Dr. James Bednar.
Unlike other getting started with Python articles that explain Python 2 versus 3 version decisions, I’ll let you look that up and decide for yourself. I have been staying far away from compare articles these days.
Other good resources that I found when getting started with Python include:
- Python online docs and tutorials
https://docs.python.org/3.6/tutorial/ (Official tutorial)
- How to use iPython interactive notebooks (Official tutorial)
- Pandas tutorials and data sets (Excellent!)
- Numpy Tutorials
- O’Reilly Python for Data Analysis by Wes McKinney and related material on GitHub.
- O’Reilly Python Data Science Handbook by Jake VanderPlas and the free related ebook called A Whirlwind Tour of Python with interactive online notebooks on GitHub.
- Python Machine Learning tutorials
- Other nice iPython notebook examples
Python in Action
To see Python working you can run commands in a command prompt, IDE, or Python Notebook. I personally enjoy using the iPython notebooks in Jupyter. Let’s start with the programming classic “Hello World”.
In Anaconda Navigator, click Launch on the Jupyter option. Then navigate to Files and choose New > Python 3 kernel.
When your first notebook is displayed, I highly recommend navigating to Help > User Interface Tour first to see how to interact with the notebook, run and cancel commands, get help, etc.
Now go ahead and type in print (“Hello World”) and use the Ctrl-Enter keys or the arrow button to run the code in that cell. The result of your code or an error message will be displayed below.
You could continue to type in more Python code snippets or open example notebooks (*.ipynb files) and run them with File > Open. I find running samples
Since “Hello World” is not thrilling for my data savvy, analytics audience, let’s see a few analytics related examples. In future series articles, I’ll cover Python analytics libraries in more depth.
Here is an example of loading a csv file into a Python data frame and exploring it. If, you’d like to follow along, I put the file on a public Amazon S3 share for you to download.
Here are descriptive statistics for that data frame using describe() and computing average on a specific data frame column Age.
Here are a several data visualization examples of the Python data frame.
Last but not least, here is a cool data science sample using scikit-learn.
I hope that was helpful to inspire you to learn Python, get it installed and at least get it functional for now. In Part 2 of this series, I will cover the Pandas library in more detail.