What is Pandas?
Files associated with this lesson:
utils.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
def apply_theme():
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', 1000)
pd.set_option('display.float_format', '{:,.3f}'.format)
flatui = ["#2e86de", "#ff4757", "#feca57", "#2ed573", "#ff7f50", "#00cec9", "#fd79a8", "#a4b0be"]
flatui_palette = sns.color_palette(flatui)
sns.palplot(flatui_palette)
sns.set_palette(flatui_palette)
sns.set_style("darkgrid", {
'axes.edgecolor': '#2b2b2b',
'axes.facecolor': '#2b2b2b',
'axes.labelcolor': '#919191',
'figure.facecolor': '#2b2b2b',
'grid.color': '#545454',
'patch.edgecolor': '#2b2b2b',
'text.color': '#bababa',
'xtick.color': '#bababa',
'ytick.color': '#bababa'
})
Lecture.ipynb
What is Pandas?¶
pandas
is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
The pandas package is probably the most important tool for Data Scientists and Analysts working with Python today. The powerful machine learning and glamorous visualization tools may get all the attention, but pandas is the backbone of most data-related projects.
Fun fact 🎁:
pandas
is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. — Wikipedia
pandas popularity has grown exponentially in the last years. Here's an image of The Atlas showing popularity of data science tools on Stack Overflow where we see pandas has become the dominating tools used by Python data scientists.
What is pandas used for?¶
If you're thinking about data science as a career, then it is imperative that one of the first things you do is learn pandas.
This tool will help you get, clean, transform and analyze your data.
For example, say you want to explore a dataset stored in a CSV on your computer. The first step is to use pandas to extract the data from that CSV into a DataFrame (a table-like data structure, we'll see more about it later). The we proceed with the routine data analysis tasks:
- Quick Exploratory Data Analysis (EDA);
- Calculate statistics such as average, median, max, or min of each column;
- Creating visualizations. Plot bars, lines, histograms, bubbles, and more;
- Cleaning the data by doing things like removing missing values and filtering rows or columns by some criteria;
- Building machine learning models to create predictions or classifications
- Store the cleaned, transformed data back into a CSV, other file or a database;
Why no just using Excel?¶
Excel is one of the most popular and widely-used data tools; it's hard to find an organization that doesn't work with it in some way. From analysts, to sales VPs, to CEOs, professionals use Excel for both quick stats and accounting and serious data crunching.
Using pandas with Microsoft Excel can give you the best of both worlds and optimize your workflow.
Pandas works with data stored in Python to manipulate and analyze data. As opposed to Excel, Python is completely free to download and use.
Pandas operates right on the back of Python. As a result, is extremely fast and efficient by using useful methods that allow automating data processing tasks better than what Excel does, including processing Excel files.
In Excel, once you exceed 50K rows, it starts to slow down considerably. Pandas, on the other hand, has no real limit and handles millions of data points seamlessly. In terms of pure space, Excel caps a single spreadsheet at 1.048.576 rows exactly. At that point, your calculations would take forever to compute. More likely, Excel would just crash. A million rows may seem like a lot of data, but for data scientists, this is but a drop in the bucket.
Pandas, however, has no limitation to the number of data points you can have in a DataFrame
(their version of a data set). It’s limited only by the amount of memory (RAM) of the computer it is running on.
It is also easier to create and use complex equations and calculations on your data. You can apply hundreds of computations to millions of data points instantly with pandas. Since Python is open source, there are already hundreds of libraries created that could streamline the length of time it takes to calculate.
Hands on!¶
We'll just import pandas and other useful libraries such as numpy, matplotlib and seaborn to work with.
Note that to import pandas and numpy we use the aliases pd
and np
. This is just a convention, which means it's not strictly necessary, but it is recommended.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from utils import apply_theme
%matplotlib inline
apply_theme()
NumPy and pandas¶
Pandas is built on top of the NumPy package, which means that all the efficient structures and functions we saw about numpy in previous lessons, will also apply to pandas.
While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed to work with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical (possibly multidimensional) arrays.
Overview Data Structures - Series and Dataframe¶
To get started with pandas, you will need to get comfortable with its two main data structures: Series
and DataFrame
s.
A Series
is essentially used for column-data, and a DataFrame
is a multi-dimensional table made up of a collection of Series
. Pandas relies on NumPy arrays to store this data, which means it also uses its data types.
DataFrame
s and Series
are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.
Let's define some data within Python lists:
names = ['Avery Bradley', 'John Holland', 'Jonas Jerebko',
'Jordan Mickey', 'Terry Rozier', 'Jared Sullinger', 'Evan Turner']
teams = ['Boston Celtics', 'Boston Celtics', 'Boston Celtics',
'Boston Celtics', 'Boston Celtics', 'Boston Celtics', 'Boston Celtics']
numbers = [0, 30, 8, np.nan, 12, 7, 11]
names
teams
numbers
Series creation¶
my_series = pd.Series(names, name='Name')
my_series.to_frame()
Each value can be accessed using just its key/index position on Series:
my_series[3]
my_series.loc[3]
DataFrame creation¶
There are many ways to create a DataFrame
from scratch, but a great option is to just use a simple dict
.
data = {
'Name': names,
'Team': teams,
'Number': numbers
}
my_df = pd.DataFrame(data)
my_df
Each value can be accessed using its key/index position and value position on DataFrames:
my_df['Name']
my_df['Name'][3]
my_df.loc[3, 'Name']
In future lectures we'll see more on locating and extracting data from the DataFrame, don't worry if you don't get it right not.
Let's move on to some quick methods for creating DataFrames from various other sources.
Reading external data¶
pandas allow us to read different types of external data files such as CSV, TXT and XLS.
With CSV files all you need is a single line to load in the data:
df = pd.read_csv('bitcoin_data.csv')
df.head()
Also, there are many options when loading data, for example CSVs don't have indexes like our DataFrames, so we'll designate the index_col
when reading:
df = pd.read_csv(
'bitcoin_data.csv',
index_col=0,
parse_dates=True
).loc[:, 'Open':'Close']
df.head()
fig, ax = plt.subplots(figsize=(16, 6))
df.plot(ax=ax)
plt.title("Bitcoin price (USD)", fontsize=16, fontweight='bold', color='white')
Plotting example: Bollinger bands¶
As a sneak peek of what we'll see in upcoming lectures, lets make some basic plots using pandas.
Bollinger Bands are a technical trading tool created by John Bollinger in the early 1980s. They arose from the need for adaptive trading bands and the observation that volatility was dynamic, not static as was widely believed at the time.
Calculate Bollinger bands¶
To demostrate the strategy we will use a 30 periods rolling mean window, and 1.5 standard deviations for each of the bands. This might not be the optimal configuration for this dataset, but we will talk more about optimizing these two arguments later.
# set number of days and standard deviations to use for rolling
# lookback period for Bollinger band calculation
window = 30
no_of_std = 1.5
# calculate rolling mean and standard deviation
rolling_mean = df['Close'].rolling(window).mean()
rolling_std = df['Close'].rolling(window).std()
# create two new DataFrame columns to hold values of upper and lower Bollinger bands
df['Rolling Mean'] = rolling_mean
df['Bollinger High'] = rolling_mean + (rolling_std * no_of_std)
df['Bollinger Low'] = rolling_mean - (rolling_std * no_of_std)
df.tail()
fig, ax = plt.subplots(figsize=(16, 6))
df[['Close','Bollinger High','Bollinger Low']].plot(ax=ax)
plt.title("Bitcoin - Bollinger bands (USD)", fontsize=16, fontweight='bold', color='white')
Check out the blog post we wrote about Bollinger bands here!
bitcoin_data.csv
Timestamp | Open | High | Low | Close | Volume (BTC) | Volume (Currency) | Weighted Price | |
---|---|---|---|---|---|---|---|---|
0 | 1/1/17 0:00 | 966.34 | 1005.00 | 960.53 | 997.75 | 6850.59 | 6764742.06 | 987.47 |
1 | 1/2/17 0:00 | 997.75 | 1032.00 | 990.01 | 1012.54 | 8167.38 | 8273576.99 | 1013.00 |
2 | 1/3/17 0:00 | 1011.44 | 1039.00 | 999.99 | 1035.24 | 9089.66 | 9276500.31 | 1020.56 |
3 | 1/4/17 0:00 | 1035.51 | 1139.89 | 1028.56 | 1114.92 | 21562.46 | 23469644.96 | 1088.45 |
4 | 1/5/17 0:00 | 1114.38 | 1136.72 | 885.41 | 1004.74 | 36018.86 | 36211399.53 | 1005.35 |
5 | 1/6/17 0:00 | 1004.73 | 1026.99 | 871.00 | 893.89 | 27916.70 | 25523261.28 | 914.26 |
6 | 1/7/17 0:00 | 894.02 | 907.05 | 812.28 | 906.20 | 20401.11 | 17624310.02 | 863.89 |
7 | 1/8/17 0:00 | 906.20 | 941.81 | 881.30 | 909.75 | 8937.49 | 8168170.35 | 913.92 |
8 | 1/9/17 0:00 | 909.80 | 912.87 | 875.00 | 896.23 | 8716.18 | 7780059.06 | 892.60 |
9 | 1/10/17 0:00 | 896.09 | 912.47 | 889.41 | 905.05 | 8535.52 | 7704271.20 | 902.61 |