Pragmatic filter with DataFrame.query
Files associated with this lesson:
Lecture.ipynb
Pragmatic filtering with DataFrame.query()
¶
We'll explore another way of filtering data using the query
method of DataFrame
s.
IMPORTANT: The query
method is only functional on Pandas >= 0.25. Check out your Pandas version:
import pandas as pd
pd.__version__
Hands on!¶
import numpy as np
The first thing we'll do is create again the DataFrame
from our previous lecture:
df = pd.DataFrame({
'Population': [35.467, 63.951, 80.94 , 60.665, 127.061, 64.511, 318.523],
'GDP': [1785387, 2833687, 3874437, 2167744, 4602367, 2950039, 17348075],
'Surface Area': [9984670, 640679, 357114, 301336, 377930, 242495, 9525067],
'HDI': [0.913, 0.888, 0.916, 0.873, 0.891, 0.907, 0.915],
'Continent': ['America', 'Europe', 'Europe', 'Europe',
'Asia', 'Europe', 'America']
})
df.columns = ['Population', 'GDP', 'Surface Area', 'HDI', 'Continent']
df.index = ['Canada', 'France', 'Germany', 'Italy',
'Japan', 'United Kingdom', 'United States']
df
The query()
method¶
DataFrame.query allows us to write queries and filters with a more "human readable" approach than regular boolean arrays. It's still necessary (and important) to understand Boolean Arrays, but query
will make some filters easier to perform.
In its simplest form, query
receives just one parameter: the expression to query for:
df.query('Population > 70')
We can access column names directly within the expression, and literals (numbers, strings, etc) are still literals; just make sure to surround strings with the correct set of quotes:
df.query('Continent == "Asia"')
Accessing your local environment¶
The query()
method can also access variables defined in the local scope; just prepend them with the @
symbol:
max_population = 70
df.query('Population > @max_population')
desired_continent = 'Asia'
df.query('Continent == @desired_continent')
Columns names that contain white spaces¶
If a column name contain white spaces, just surround it with backticks:
`Column Name`
df.query('`Surface Area` >= 500_000')
Boolean operators¶
If a column name contain white spaces, just surround it with backticks:
`Column Name`
df.query('`Surface Area` >= 500_000 & Continent == "Europe"')
Parentheses can be optionally applied for readability purposes (or to reduce ambiguity):
df.query('(`Surface Area` >= 500_000) & (Continent == "Europe")')
And there's full support for and
, or
and not
operators:
df.query('(`Surface Area` >= 500_000) & not (Continent == "Europe")')