Reading CSV and TXT files
DONE

Files associated with this lesson:

Lecture.ipynb

rmotr

Reading CSV and TXT files¶

Rather than creating Series or DataFrames strutures from scratch, or even from Python core sequences or ndarrays, the most typical use of pandas is based on the loading of information from files or sources of information for further exploration, transformation and analysis.

In this lecture we'll learn how to read comma-separated values files (.csv) and raw text files (.txt) into pandas DataFrames.

purple-divider

Hands on!¶

In [1]:

import pandas as pd

green-divider

Reading data with Python¶

As we saw on previous courses we can read data simply using Python.

When you want to work with a file, the first thing to do is to open it. This is done by invoking the open() built-in function.

open() has a single required argument that is the path to the file and has a single return, the file object.

The with statement automatically takes care of closing the file once it leaves the with block, even in cases of error.

In [2]:

filepath = 'btc-market-price.csv'

with open(filepath, 'r') as reader:
    print(reader)

<_io.TextIOWrapper name='btc-market-price.csv' mode='r' encoding='UTF-8'>

Once the file is opened, we can read its content as follows:

In [3]:

filepath = 'btc-market-price.csv'

with open(filepath, 'r') as reader:
    for index, line in enumerate(reader.readlines()):
        # read just the first 10 lines
        if (index < 10):
            print(index, line)

0 2/4/17 0:00,1099.169125

1 3/4/17 0:00,1141.813

2 4/4/17 0:00,?

3 5/4/17 0:00,1133.079314

4 6/4/17 0:00,-

5 7/4/17 0:00,-

6 8/4/17 0:00,1181.149838

7 9/4/17 0:00,1208.8005

8 10/4/17 0:00,1207.744875

9 11/4/17 0:00,1226.617038

green-divider

Reading data with Pandas¶

Probably one of the most recurrent types of work for data analysis: public data sources, logs, historical information tables, exports from databases. So the pandas library offers us functions to read and write files in multiple formats like CSV, JSON, XML and Excel's XLSX, all of them creating a DataFrame with the information read from the file.

We'll learn how to read different type of data including:

CSV files (.csv)
Raw text files (.txt)
JSON data from a file and from an API
Data from a SQL query over a database

There are many other available reading functions as the following table shows:

pandas read data table

green-divider

The `read_csv` method¶

The first method we'll learn is read_csv, that let us read comma-separated values (CSV) files and raw text (TXT) files into a DataFrame.

The read_csv function is extremely powerful and you can specify a very broad set of parameters at import time that allow us to accurately configure how the data will be read and parsed by specifying the correct structure, enconding and other details. The most common parameters are as follows:

filepath: Path of the file to be read.
sep: Character(s) that are used as a field separator in the file.
header: Index of the row containing the names of the columns (None if none).
index_col: Index of the column or sequence of indexes that should be used as index of rows of the data.
names: Sequence containing the names of the columns (used together with header = None).
skiprows: Number of rows or sequence of row indexes to ignore in the load.
na_values: Sequence of values that, if found in the file, should be treated as NaN.
dtype: Dictionary in which the keys will be column names and the values will be types of NumPy to which their content must be converted.
parse_dates: Flag that indicates if Python should try to parse data with a format similar to dates as dates. You can enter a list of column names that must be joined for the parsing as a date.
date_parser: Function to use to try to parse dates.
nrows: Number of rows to read from the beginning of the file.
skip_footer: Number of rows to ignore at the end of the file.
encoding: Encoding to be expected from the file read.
squeeze: Flag that indicates that if the data read only contains one column the result is a Series instead of a DataFrame.
thousands: Character to use to detect the thousands separator.
decimal: Character to use to detect the decimal separator.
skip_blank_lines: Flag that indicates whether blank lines should be ignored.

Full read_csv documentation can be found here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html.

In this case we'll try to read our btc-market-price.csv CSV file using different parameters to parse it correctly.

This file contains records of the mean price of Bitcoin per date.

green-divider

Reading our first CSV file¶

Everytime we call read_csv method, we'll need to pass an explicit filepath parameter indicating the path where our CSV file is.

Any valid string path is acceptable. The string could be a URL. Valid URL schemes include HTTP, FTP, S3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv.

For example we can use read_csv method to load data directly from an URL:

In [4]:

csv_url = "https://raw.githubusercontent.com/datasets/gdp/master/data/gdp.csv"

pd.read_csv(csv_url).head()

Out[4]:

	Country Name	Country Code	Year	Value
0	Arab World	ARB	1968	2.576068e+10
1	Arab World	ARB	1969	2.843420e+10
2	Arab World	ARB	1970	3.138550e+10
3	Arab World	ARB	1971	3.642691e+10
4	Arab World	ARB	1972	4.331606e+10

Or just use a local file:

In [5]:

df = pd.read_csv('btc-market-price.csv')

df.head()

Out[5]:

	2/4/17 0:00	1099.169125
0	3/4/17 0:00	1141.813
1	4/4/17 0:00	?
2	5/4/17 0:00	1133.079314
3	6/4/17 0:00	-
4	7/4/17 0:00	-

In this case we let pandas infer everything related to our data, but in most of the cases we'll need to explicitly tell pandas how we want our data to be loaded. To do that we use parameters.

Let's see how theses parameters work.

green-divider

First row behaviour with `header` parameter¶

The CSV file we're reading has only two columns: Timestamp and Price. It doesn't have a header. Pandas automatically assigned the first row of data as headers, which is incorrect. We can overwrite this behavior with the header parameter.

In [6]:

df = pd.read_csv('btc-market-price.csv',
                 header=None)

In [7]:

df.head()

Out[7]:

	0	1
0	2/4/17 0:00	1099.169125
1	3/4/17 0:00	1141.813
2	4/4/17 0:00	?
3	5/4/17 0:00	1133.079314
4	6/4/17 0:00	-

green-divider

Missing values with `na_values` parameter¶

We can define a na_values parameter with the values we want to be recognized as NA/NaN. In this case empty strings '', ? and - will be recognized as null values.

In [8]:

df = pd.read_csv('btc-market-price.csv',
                 header=None,
                 na_values=['', '?', '-'])

In [9]:

df.head()

Out[9]:

	0	1
0	2/4/17 0:00	1099.169125
1	3/4/17 0:00	1141.813000
2	4/4/17 0:00	NaN
3	5/4/17 0:00	1133.079314
4	6/4/17 0:00	NaN

green-divider

Column names using `names` parameter¶

We'll add that columns names using the names parameter.

In [10]:

df = pd.read_csv('btc-market-price.csv',
                 header=None,
                 na_values=['', '?', '-'],
                 names=['Timestamp', 'Price'])

In [11]:

df.head()

Out[11]:

	Timestamp	Price
0	2/4/17 0:00	1099.169125
1	3/4/17 0:00	1141.813000
2	4/4/17 0:00	NaN
3	5/4/17 0:00	1133.079314
4	6/4/17 0:00	NaN

green-divider

Column types using `dtype` parameter¶

Without using the dtype parameter pandas will try to figure it out the type of each column automatically. We can use dtype parameter to force pandas to use certain dtype.

In this case we'll force the Price column to be float.

In [12]:

df = pd.read_csv('btc-market-price.csv',
                 header=None,
                 na_values=['', '?', '-'],
                 names=['Timestamp', 'Price'],
                 dtype={'Price': 'float'})

In [13]:

df.head()

Out[13]:

	Timestamp	Price
0	2/4/17 0:00	1099.169125
1	3/4/17 0:00	1141.813000
2	4/4/17 0:00	NaN
3	5/4/17 0:00	1133.079314
4	6/4/17 0:00	NaN

In [14]:

df.dtypes

Out[14]:

Timestamp     object
Price        float64
dtype: object

The Timestamp column was interpreted as a regular string (object in pandas notation), we can parse it manually using a vectorized operation as we saw on previous courses.

We'll parse Timestamp column to Datetime objects using to_datetime method:

In [15]:

pd.to_datetime(df['Timestamp']).head()

Out[15]:

0   2017-02-04
1   2017-03-04
2   2017-04-04
3   2017-05-04
4   2017-06-04
Name: Timestamp, dtype: datetime64[ns]

In [16]:

df['Timestamp'] = pd.to_datetime(df['Timestamp'])

In [17]:

df.head()

Out[17]:

	Timestamp	Price
0	2017-02-04	1099.169125
1	2017-03-04	1141.813000
2	2017-04-04	NaN
3	2017-05-04	1133.079314
4	2017-06-04	NaN

In [18]:

df.dtypes

Out[18]:

Timestamp    datetime64[ns]
Price               float64
dtype: object

green-divider

Date parser using `parse_dates` parameter¶

Another way of dealing with Datetime objects is using parse_dates parameter with the position of the columns with dates.

In [19]:

df = pd.read_csv('btc-market-price.csv',
                 header=None,
                 na_values=['', '?', '-'],
                 names=['Timestamp', 'Price'],
                 dtype={'Price': 'float'},
                 parse_dates=[0])

In [20]:

df.head()

Out[20]:

	Timestamp	Price
0	2017-02-04	1099.169125
1	2017-03-04	1141.813000
2	2017-04-04	NaN
3	2017-05-04	1133.079314
4	2017-06-04	NaN

In [21]:

df.dtypes

Out[21]:

Timestamp    datetime64[ns]
Price               float64
dtype: object

green-divider

Adding index to our data using `index_col` parameter¶

By default, pandas will automatically assign a numeric autoincremental index or row label starting with zero. You may want to leave the default index as such if your data doesn’t have a column with unique values that can serve as a better index. In case there is a column that you feel would serve as a better index, you can override the default behavior by setting index_col property to a column. It takes a numeric value representing the index or a string of the column name for setting a single column as index or a list of numeric values or strings for creating a multi-index.

In our data, we are choosing the first column, Timestamp, as index (index=0) by passing zero to the index_col argument.

In [4]:

df = pd.read_csv('btc-market-price.csv',
                 header=None,
                 na_values=['', '?', '-'],
                 names=['Timestamp', 'Price'],
                 dtype={'Price': 'float'},
                 parse_dates=[0],
                 index_col=[0])

In [5]:

df.head()

Out[5]:

	Price
Timestamp
2017-02-04	1099.169125
2017-03-04	1141.813000
2017-04-04	NaN
2017-05-04	1133.079314
2017-06-04	NaN

In [24]:

df.dtypes

Out[24]:

Price    float64
dtype: object

green-divider

A more challenging parsing¶

Now we'll read another CSV file. This file has the following columns:

first_name
last_name
age
math_score
french_score
next_test_date

Let's read it and see how it looks like.

In [6]:

exam_df = pd.read_csv('exam_review.csv')

In [7]:

exam_df

Out[7]:

		first_name>last_name>age>math_score>french_score
Ray>Morley>18>"68	000">"75	000"
Melvin>Scott>24>77>83	NaN	NaN
Amirah>Haley>22>92>67	NaN	NaN
Gerard>Mills>19>"78	000">72	NaN
Amy>Grimes>23>91>81	NaN	NaN

green-divider

Custom data delimiters using `sep` parameter¶

We can define which delimiter to use by using the sep parameter. If we don't use the sep parameter, pandas will automatically detect the separator.

In most of the CSV files separator will be comma (,) and will be automatically detected. But we can find files with other separators like semicolon (;), tabs (\t, specially on TSV files), whitespaces or any other special character.

In this case the separator is a > character.

In [8]:

exam_df = pd.read_csv('exam_review.csv',
                      sep='>')

In [9]:

exam_df

Out[9]:

	first_name	last_name	age	math_score	french_score
0	Ray	Morley	18	68,000	75,000
1	Melvin	Scott	24	77	83
2	Amirah	Haley	22	92	67
3	Gerard	Mills	19	78,000	72
4	Amy	Grimes	23	91	81

green-divider

Custom data encoding¶

Files are stored using different "encodings". You've probably heard about ASCII, UTF-8, latin1, etc.

While reading data custom encoding can be defined with the encoding parameter.

encoding='UTF-8': will be used if data is UTF-8 encoded.
encoding='iso-8859-1': will be used if data is ISO/IEC 8859-1 ("extended ASCII") encoded.

In our case we don't need a custom enconding as data is properly loaded.

green-divider

Custom numeric `decimal` and `thousands` character¶

The decimal and thousands characters could change between datasets. If we have a column containing a comma (,) to indicate the decimal or thousands place, then this column would be considered a string and not numeric.

In [12]:

exam_df = pd.read_csv('exam_review.csv',
                      sep='>')

In [13]:

exam_df

Out[13]:

	first_name	last_name	age	math_score	french_score
0	Ray	Morley	18	68,000	75,000
1	Melvin	Scott	24	77	83
2	Amirah	Haley	22	92	67
3	Gerard	Mills	19	78,000	72
4	Amy	Grimes	23	91	81

In [14]:

exam_df[['math_score', 'french_score']].dtypes

Out[14]:

math_score      object
french_score    object
dtype: object

To solve that, ensuring such columns are interpreted as integer values, we'll need to use the decimal and/or thousands parameters to indicate correct decimal and/or thousands indicators.

In [15]:

exam_df = pd.read_csv('exam_review.csv',
                      sep='>',
                      decimal=',')

In [16]:

exam_df

Out[16]:

	first_name	last_name	age	math_score	french_score
0	Ray	Morley	18	68.0	75.0
1	Melvin	Scott	24	77.0	83.0
2	Amirah	Haley	22	92.0	67.0
3	Gerard	Mills	19	78.0	72.0
4	Amy	Grimes	23	91.0	81.0

In [34]:

exam_df[['math_score', 'french_score']].dtypes

Out[34]:

math_score      float64
french_score    float64
dtype: object

Let's see what happens with the thousands parameter:

In [35]:

pd.read_csv('exam_review.csv',
            sep='>',
            thousands=',')

Out[35]:

	first_name	last_name	age	math_score	french_score
0	Ray	Morley	18	68000	75000
1	Melvin	Scott	24	77	83
2	Amirah	Haley	22	92	67
3	Gerard	Mills	19	78000	72
4	Amy	Grimes	23	91	81

green-divider

Excluding specific rows¶

We can use the skiprows to:

Exclude reading specified number of rows from the beginning of a file, by passing an integer argument. This removes the header too.
Skip reading specific row indices from a file, by passing a list containing row indices to skip.

In [17]:

exam_df = pd.read_csv('exam_review.csv',
                      sep='>',
                      decimal=',')

In [18]:

exam_df

Out[18]:

	first_name	last_name	age	math_score	french_score
0	Ray	Morley	18	68.0	75.0
1	Melvin	Scott	24	77.0	83.0
2	Amirah	Haley	22	92.0	67.0
3	Gerard	Mills	19	78.0	72.0
4	Amy	Grimes	23	91.0	81.0

To skip reading the first 2 rows from this file, we can use skiprows=2:

In [19]:

pd.read_csv('exam_review.csv',
            sep='>',
            skiprows=2)

Out[19]:

	Melvin	Scott	24	77	83
0	Amirah	Haley	22	92	67
1	Gerard	Mills	19	78,000	72
2	Amy	Grimes	23	91	81

As the header is considered as the first row, to skip reading data rows 1 and 3, we can use skiprows=[1,3]:

In [20]:

exam_df = pd.read_csv('exam_review.csv',
                      sep='>',
                      decimal=',',
                      skiprows=[1,3])

In [21]:

exam_df

Out[21]:

	first_name	last_name	age	math_score	french_score
0	Melvin	Scott	24	77.0	83
1	Gerard	Mills	19	78.0	72
2	Amy	Grimes	23	91.0	81

green-divider

Get rid of blank lines¶

The skip_blank_lines parameter is set to True so blank lines are skipped while we read files.

If we set this parameter to False, then every blank line will be loaded with NaN values into the DataFrame.

In [22]:

pd.read_csv('exam_review.csv',
            sep='>',
            skip_blank_lines=False)

Out[22]:

	first_name	last_name	age	math_score	french_score
0	Ray	Morley	18.0	68,000	75,000
1	Melvin	Scott	24.0	77	83
2	Amirah	Haley	22.0	92	67
3	NaN	NaN	NaN	NaN	NaN
4	Gerard	Mills	19.0	78,000	72
5	Amy	Grimes	23.0	91	81

green-divider

Loading specific columns¶

We can use the usecols parameter when we want to load just specific columns and not all of them.

Performance wise, it is better because instead of loading an entire dataframe into memory and then deleting the not required columns, we can select the columns that we’ll need, while loading the dataset itself.

As a parameter to usecols, you can pass either a list of strings corresponding to the column names or a list of integers corresponding to column index.

In [42]:

pd.read_csv('exam_review.csv',
            usecols=['first_name', 'last_name', 'age'],
            sep='>')

Out[42]:

	first_name	last_name	age
0	Ray	Morley	18
1	Melvin	Scott	24
2	Amirah	Haley	22
3	Gerard	Mills	19
4	Amy	Grimes	23

Or using just the column position:

In [43]:

pd.read_csv('exam_review.csv',
            usecols=[0, 1, 2],
            sep='>')

Out[43]:

	first_name	last_name	age
0	Ray	Morley	18
1	Melvin	Scott	24
2	Amirah	Haley	22
3	Gerard	Mills	19
4	Amy	Grimes	23

green-divider

Using a `Series` instead of `DataFrame`¶

If the parsed data only contains one column then we can return a Series by setting the squeeze parameter to True.

In [44]:

exam_test_1 = pd.read_csv('exam_review.csv',
                          sep='>',
                          usecols=['last_name'])

In [45]:

type(exam_test_1)

Out[45]:

pandas.core.frame.DataFrame

In [46]:

exam_test_2 = pd.read_csv('exam_review.csv',
                          sep='>',
                          usecols=['last_name'],
                          squeeze=True)

In [47]:

type(exam_test_2)

Out[47]:

pandas.core.series.Series

green-divider

Save to CSV file¶

Finally we can also save our DataFrame as a CSV file.

In [48]:

exam_df

Out[48]:

	first_name	last_name	age	math_score	french_score
0	Melvin	Scott	24	77.0	83
1	Gerard	Mills	19	78.0	72
2	Amy	Grimes	23	91.0	81

We can simply generate a CSV string from our DataFrame:

In [49]:

exam_df.to_csv()

Out[49]:

',first_name,last_name,age,math_score,french_score\n0,Melvin,Scott,24,77.0,83\n1,Gerard,Mills,19,78.0,72\n2,Amy,Grimes,23,91.0,81\n'

Or specify a file path where we want our generated CSV code to be saved:

In [50]:

exam_df.to_csv('out.csv')

In [51]:

pd.read_csv('out.csv')

Out[51]:

	Unnamed: 0	first_name	last_name	age	math_score	french_score
0	0	Melvin	Scott	24	77.0	83
1	1	Gerard	Mills	19	78.0	72
2	2	Amy	Grimes	23	91.0	81

In [52]:

exam_df.to_csv('out.csv',
               index=None)

In [53]:

pd.read_csv('out.csv')

Out[53]:

	first_name	last_name	age	math_score	french_score
0	Melvin	Scott	24	77.0	83
1	Gerard	Mills	19	78.0	72
2	Amy	Grimes	23	91.0	81

purple-divider

btc-market-price.csv

          
      2/4/17 0:00
      1099.169125
    
      0
      3/4/17 0:00
      1141.813
    
      1
      4/4/17 0:00
      ?
    
      2
      5/4/17 0:00
      1133.079314
    
      3
      6/4/17 0:00
      -
    
      4
      7/4/17 0:00
      -
    
      5
      8/4/17 0:00
      1181.149838
    
      6
      9/4/17 0:00
      1208.8005
    
      7
      10/4/17 0:00
      1207.744875
    
      8
      11/4/17 0:00
      1226.617038
    
      9
      12/4/17 0:00
      1218.92205

exam_review.csv

          
      first_name>last_name>age>math_score>french_score
    
      Ray>Morley>18>"68
      000">"75
      000"
    
      Melvin>Scott>24>77>83
      NaN
      NaN
    
      Amirah>Haley>22>92>67
      NaN
      NaN
    
      Gerard>Mills>19>"78
      000">72
      NaN
    
      Amy>Grimes>23>91>81
      NaN
      NaN

Reading CSV and TXT files DONE check

Files associated with this lesson:

Reading CSV and TXT files¶

Hands on!¶

Reading data with Python¶

Reading data with Pandas¶

The read_csv method¶

Reading our first CSV file¶

First row behaviour with header parameter¶

Missing values with na_values parameter¶

Column names using names parameter¶

Column types using dtype parameter¶

Date parser using parse_dates parameter¶

Adding index to our data using index_col parameter¶

A more challenging parsing¶

Custom data delimiters using sep parameter¶

Custom data encoding¶

Custom numeric decimal and thousands character¶

Excluding specific rows¶

Get rid of blank lines¶

Loading specific columns¶

Using a Series instead of DataFrame¶

Save to CSV file¶

😥 Please, login to continue

Previous attempt

Get in touch with us

Reading CSV and TXT files
DONE

The `read_csv` method¶

First row behaviour with `header` parameter¶

Missing values with `na_values` parameter¶

Column names using `names` parameter¶

Column types using `dtype` parameter¶

Date parser using `parse_dates` parameter¶

Adding index to our data using `index_col` parameter¶

Custom data delimiters using `sep` parameter¶

Custom numeric `decimal` and `thousands` character¶

Using a `Series` instead of `DataFrame`¶