What is Pandas?
DONE

Files associated with this lesson:

utils.py

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

def apply_theme():
    pd.set_option('display.max_columns', 500)
    pd.set_option('display.width', 1000)
    pd.set_option('display.max_colwidth', 1000)
    pd.set_option('display.float_format', '{:,.3f}'.format)

    flatui = ["#2e86de", "#ff4757", "#feca57", "#2ed573", "#ff7f50", "#00cec9", "#fd79a8", "#a4b0be"]
    flatui_palette = sns.color_palette(flatui)
    sns.palplot(flatui_palette)
    sns.set_palette(flatui_palette)

    sns.set_style("darkgrid", {
        'axes.edgecolor': '#2b2b2b',
        'axes.facecolor': '#2b2b2b',
        'axes.labelcolor': '#919191',
        'figure.facecolor': '#2b2b2b',
        'grid.color': '#545454',
        'patch.edgecolor': '#2b2b2b',
        'text.color': '#bababa',
        'xtick.color': '#bababa',
        'ytick.color': '#bababa'
    })

Lecture.ipynb

rmotr

What is Pandas?¶

pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

The pandas package is probably the most important tool for Data Scientists and Analysts working with Python today. The powerful machine learning and glamorous visualization tools may get all the attention, but pandas is the backbone of most data-related projects.

Fun fact 🎁: pandas is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals. — Wikipedia

pandas popularity has grown exponentially in the last years. Here's an image of The Atlas showing popularity of data science tools on Stack Overflow where we see pandas has become the dominating tools used by Python data scientists.

purple-divider

What is pandas used for?¶

If you're thinking about data science as a career, then it is imperative that one of the first things you do is learn pandas.

This tool will help you get, clean, transform and analyze your data.

For example, say you want to explore a dataset stored in a CSV on your computer. The first step is to use pandas to extract the data from that CSV into a DataFrame (a table-like data structure, we'll see more about it later). The we proceed with the routine data analysis tasks:

Quick Exploratory Data Analysis (EDA);
Calculate statistics such as average, median, max, or min of each column;
Creating visualizations. Plot bars, lines, histograms, bubbles, and more;
Cleaning the data by doing things like removing missing values and filtering rows or columns by some criteria;
Building machine learning models to create predictions or classifications
Store the cleaned, transformed data back into a CSV, other file or a database;

purple-divider

Why no just using Excel?¶

Excel is one of the most popular and widely-used data tools; it's hard to find an organization that doesn't work with it in some way. From analysts, to sales VPs, to CEOs, professionals use Excel for both quick stats and accounting and serious data crunching.

Using pandas with Microsoft Excel can give you the best of both worlds and optimize your workflow.

Pandas works with data stored in Python to manipulate and analyze data. As opposed to Excel, Python is completely free to download and use.

Pandas operates right on the back of Python. As a result, is extremely fast and efficient by using useful methods that allow automating data processing tasks better than what Excel does, including processing Excel files.

In Excel, once you exceed 50K rows, it starts to slow down considerably. Pandas, on the other hand, has no real limit and handles millions of data points seamlessly. In terms of pure space, Excel caps a single spreadsheet at 1.048.576 rows exactly. At that point, your calculations would take forever to compute. More likely, Excel would just crash. A million rows may seem like a lot of data, but for data scientists, this is but a drop in the bucket.

Pandas, however, has no limitation to the number of data points you can have in a DataFrame (their version of a data set). It’s limited only by the amount of memory (RAM) of the computer it is running on.

It is also easier to create and use complex equations and calculations on your data. You can apply hundreds of computations to millions of data points instantly with pandas. Since Python is open source, there are already hundreds of libraries created that could streamline the length of time it takes to calculate.

green-divider

Hands on!¶

We'll just import pandas and other useful libraries such as numpy, matplotlib and seaborn to work with.

Note that to import pandas and numpy we use the aliases pd and np. This is just a convention, which means it's not strictly necessary, but it is recommended.

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from utils import apply_theme

%matplotlib inline
apply_theme()

green-divider

NumPy and pandas¶

Pandas is built on top of the NumPy package, which means that all the efficient structures and functions we saw about numpy in previous lessons, will also apply to pandas.

While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed to work with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical (possibly multidimensional) arrays.

Overview Data Structures - Series and Dataframe¶

To get started with pandas, you will need to get comfortable with its two main data structures: Series and DataFrames.

A Series is essentially used for column-data, and a DataFrame is a multi-dimensional table made up of a collection of Series. Pandas relies on NumPy arrays to store this data, which means it also uses its data types.

DataFrames and Series are quite similar in that many operations that you can do with one you can do with the other, such as filling in null values and calculating the mean.

Let's define some data within Python lists:

In [2]:

names = ['Avery Bradley', 'John Holland', 'Jonas Jerebko',
         'Jordan Mickey', 'Terry Rozier', 'Jared Sullinger', 'Evan Turner']

teams = ['Boston Celtics', 'Boston Celtics', 'Boston Celtics',
        'Boston Celtics', 'Boston Celtics', 'Boston Celtics', 'Boston Celtics']

numbers = [0, 30, 8, np.nan, 12, 7, 11]

In [3]:

names

Out[3]:

['Avery Bradley',
 'John Holland',
 'Jonas Jerebko',
 'Jordan Mickey',
 'Terry Rozier',
 'Jared Sullinger',
 'Evan Turner']

In [4]:

teams

Out[4]:

['Boston Celtics',
 'Boston Celtics',
 'Boston Celtics',
 'Boston Celtics',
 'Boston Celtics',
 'Boston Celtics',
 'Boston Celtics']

In [5]:

numbers

Out[5]:

[0, 30, 8, nan, 12, 7, 11]

Series creation¶

In [6]:

my_series = pd.Series(names, name='Name')

my_series.to_frame()

Out[6]:

	Name
0	Avery Bradley
1	John Holland
2	Jonas Jerebko
3	Jordan Mickey
4	Terry Rozier
5	Jared Sullinger
6	Evan Turner

Each value can be accessed using just its key/index position on Series:

In [7]:

my_series[3]

Out[7]:

'Jordan Mickey'

In [8]:

my_series.loc[3]

Out[8]:

'Jordan Mickey'

DataFrame creation¶

There are many ways to create a DataFrame from scratch, but a great option is to just use a simple dict.

In [9]:

data = {
    'Name': names,
    'Team': teams,
    'Number': numbers
}

In [10]:

my_df = pd.DataFrame(data)

my_df

Out[10]:

	Name	Team	Number
0	Avery Bradley	Boston Celtics	0.000
1	John Holland	Boston Celtics	30.000
2	Jonas Jerebko	Boston Celtics	8.000
3	Jordan Mickey	Boston Celtics	nan
4	Terry Rozier	Boston Celtics	12.000
5	Jared Sullinger	Boston Celtics	7.000
6	Evan Turner	Boston Celtics	11.000

Each value can be accessed using its key/index position and value position on DataFrames:

In [11]:

my_df['Name']

Out[11]:

0      Avery Bradley
1       John Holland
2      Jonas Jerebko
3      Jordan Mickey
4       Terry Rozier
5    Jared Sullinger
6        Evan Turner
Name: Name, dtype: object

In [12]:

my_df['Name'][3]

Out[12]:

'Jordan Mickey'

In [13]:

my_df.loc[3, 'Name']

Out[13]:

'Jordan Mickey'

In future lectures we'll see more on locating and extracting data from the DataFrame, don't worry if you don't get it right not.

Let's move on to some quick methods for creating DataFrames from various other sources.

green-divider

Reading external data¶

pandas allow us to read different types of external data files such as CSV, TXT and XLS.

With CSV files all you need is a single line to load in the data:

In [14]:

df = pd.read_csv('bitcoin_data.csv')

df.head()

Out[14]:

	Timestamp	Open	High	Low	Close	Volume (BTC)	Volume (Currency)	Weighted Price
0	1/1/17 0:00	966.340	1,005.000	960.530	997.750	6,850.590	6,764,742.060	987.470
1	1/2/17 0:00	997.750	1,032.000	990.010	1,012.540	8,167.380	8,273,576.990	1,013.000
2	1/3/17 0:00	1,011.440	1,039.000	999.990	1,035.240	9,089.660	9,276,500.310	1,020.560
3	1/4/17 0:00	1,035.510	1,139.890	1,028.560	1,114.920	21,562.460	23,469,644.960	1,088.450
4	1/5/17 0:00	1,114.380	1,136.720	885.410	1,004.740	36,018.860	36,211,399.530	1,005.350

Also, there are many options when loading data, for example CSVs don't have indexes like our DataFrames, so we'll designate the index_col when reading:

In [15]:

df = pd.read_csv(
    'bitcoin_data.csv',
    index_col=0,
    parse_dates=True
).loc[:, 'Open':'Close']

df.head()

Out[15]:

	Open	High	Low	Close
Timestamp
2017-01-01	966.340	1,005.000	960.530	997.750
2017-01-02	997.750	1,032.000	990.010	1,012.540
2017-01-03	1,011.440	1,039.000	999.990	1,035.240
2017-01-04	1,035.510	1,139.890	1,028.560	1,114.920
2017-01-05	1,114.380	1,136.720	885.410	1,004.740

In [16]:

fig, ax = plt.subplots(figsize=(16, 6))

df.plot(ax=ax)

plt.title("Bitcoin price (USD)", fontsize=16, fontweight='bold', color='white')

Out[16]:

Text(0.5, 1.0, 'Bitcoin price (USD)')

green-divider

Plotting example: Bollinger bands¶

As a sneak peek of what we'll see in upcoming lectures, lets make some basic plots using pandas.

Bollinger Bands are a technical trading tool created by John Bollinger in the early 1980s. They arose from the need for adaptive trading bands and the observation that volatility was dynamic, not static as was widely believed at the time.

Calculate Bollinger bands¶

To demostrate the strategy we will use a 30 periods rolling mean window, and 1.5 standard deviations for each of the bands. This might not be the optimal configuration for this dataset, but we will talk more about optimizing these two arguments later.

In [17]:

# set number of days and standard deviations to use for rolling 
# lookback period for Bollinger band calculation
window = 30
no_of_std = 1.5

# calculate rolling mean and standard deviation
rolling_mean = df['Close'].rolling(window).mean()
rolling_std = df['Close'].rolling(window).std()

# create two new DataFrame columns to hold values of upper and lower Bollinger bands
df['Rolling Mean'] = rolling_mean
df['Bollinger High'] = rolling_mean + (rolling_std * no_of_std)
df['Bollinger Low'] = rolling_mean - (rolling_std * no_of_std)

In [18]:

df.tail()

Out[18]:

	Open	High	Low	Close	Rolling Mean	Bollinger High	Bollinger Low
Timestamp
2018-03-24	8,917.990	9,020.000	8,505.000	8,547.000	9,533.030	11,150.406	7,915.654
2018-03-25	8,541.960	8,680.000	8,368.630	8,453.900	9,475.957	11,109.229	7,842.684
2018-03-26	8,451.120	8,500.000	7,831.150	8,149.660	9,424.612	11,096.248	7,752.976
2018-03-27	8,152.260	8,211.620	7,742.110	7,791.700	9,364.668	11,094.048	7,635.287
2018-03-28	7,791.690	8,104.980	7,723.030	8,039.860	9,288.506	11,032.616	7,544.396

In [19]:

fig, ax = plt.subplots(figsize=(16, 6))

df[['Close','Bollinger High','Bollinger Low']].plot(ax=ax)

plt.title("Bitcoin - Bollinger bands (USD)", fontsize=16, fontweight='bold', color='white')

Out[19]:

Text(0.5, 1.0, 'Bitcoin - Bollinger bands (USD)')

Check out the blog post we wrote about Bollinger bands here!

purple-divider

bitcoin_data.csv

          
  
      
      Timestamp
      Open
      High
      Low
      Close
      Volume (BTC)
      Volume (Currency)
      Weighted Price
    

  
      0
      1/1/17 0:00
      966.34
      1005.00
      960.53
      997.75
      6850.59
      6764742.06
      987.47
    

      1
      1/2/17 0:00
      997.75
      1032.00
      990.01
      1012.54
      8167.38
      8273576.99
      1013.00
    

      2
      1/3/17 0:00
      1011.44
      1039.00
      999.99
      1035.24
      9089.66
      9276500.31
      1020.56
    

      3
      1/4/17 0:00
      1035.51
      1139.89
      1028.56
      1114.92
      21562.46
      23469644.96
      1088.45
    

      4
      1/5/17 0:00
      1114.38
      1136.72
      885.41
      1004.74
      36018.86
      36211399.53
      1005.35
    

      5
      1/6/17 0:00
      1004.73
      1026.99
      871.00
      893.89
      27916.70
      25523261.28
      914.26
    

      6
      1/7/17 0:00
      894.02
      907.05
      812.28
      906.20
      20401.11
      17624310.02
      863.89
    

      7
      1/8/17 0:00
      906.20
      941.81
      881.30
      909.75
      8937.49
      8168170.35
      913.92
    

      8
      1/9/17 0:00
      909.80
      912.87
      875.00
      896.23
      8716.18
      7780059.06
      892.60
    

      9
      1/10/17 0:00
      896.09
      912.47
      889.41
      905.05
      8535.52
      7704271.20
      902.61
    


        

	Timestamp	Open	High	Low	Close	Volume (BTC)	Volume (Currency)	Weighted Price
0	1/1/17 0:00	966.34	1005.00	960.53	997.75	6850.59	6764742.06	987.47
1	1/2/17 0:00	997.75	1032.00	990.01	1012.54	8167.38	8273576.99	1013.00
2	1/3/17 0:00	1011.44	1039.00	999.99	1035.24	9089.66	9276500.31	1020.56
3	1/4/17 0:00	1035.51	1139.89	1028.56	1114.92	21562.46	23469644.96	1088.45
4	1/5/17 0:00	1114.38	1136.72	885.41	1004.74	36018.86	36211399.53	1005.35
5	1/6/17 0:00	1004.73	1026.99	871.00	893.89	27916.70	25523261.28	914.26
6	1/7/17 0:00	894.02	907.05	812.28	906.20	20401.11	17624310.02	863.89
7	1/8/17 0:00	906.20	941.81	881.30	909.75	8937.49	8168170.35	913.92
8	1/9/17 0:00	909.80	912.87	875.00	896.23	8716.18	7780059.06	892.60
9	1/10/17 0:00	896.09	912.47	889.41	905.05	8535.52	7704271.20	902.61

What is Pandas? DONE check

Files associated with this lesson:

What is Pandas?¶

What is pandas used for?¶

Why no just using Excel?¶

Hands on!¶

NumPy and pandas¶

Overview Data Structures - Series and Dataframe¶

Series creation¶

DataFrame creation¶

Reading external data¶

Plotting example: Bollinger bands¶

Calculate Bollinger bands¶

😥 Please, login to continue

Previous attempt

Get in touch with us

What is Pandas?
DONE