Often we want a subset of our data using criteria. For example, we need to look at results for a specific range of years or for a given set of cities.
To select data from year 2002
python
surveys_df[surveys_df['year'] == 2002]
Which produces the following output:
record_id month day year district city tenure hindfoot_length income
33320 33321 1 12 2002 1 DM M 38 44
33321 33322 1 12 2002 1 DO M 37 58
33322 33323 1 12 2002 1 PB M 28 45
33323 33324 1 12 2002 1 AB NaN NaN NaN
33324 33325 1 12 2002 1 DO M 35 29
...
35544 35545 12 31 2002 15 AH NaN NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN NaN
35546 35547 12 31 2002 10 RM F 15 14
35547 35548 12 31 2002 7 DO M 36 51
35548 35549 12 31 2002 5 NaN NaN NaN NaN
[2229 rows x 9 columns]
Or we can select all rows that do not contain the year 2002.
surveys_df[surveys_df.year != 2002]
We can define sets of criteria too:
surveys_df[(surveys_df['year'] >= 1980) & (surveys_df['year'] <= 1985)]
Use can use the syntax below when querying data from a DataFrame. Experiment with selecting various subsets of the “surveys” data.
==
!=
>
or <
>=
<=
surveys_df
DataFrame that contain data from
the year 1999 and that contain income values less than or equal to 20000 How
many columns did you end up with? What did your neighbor get?isin
command in python to query a DataFrame based upon a
list of values as follows:
surveys_df[surveys_df['city'].isin([listGoesHere])]
. Use the isin
function
to find all records that contain particular city in
the surveys DataFrame. How many records contain these values?~
symbol in Python can be used to return the OPPOSITE of the selection that you specify in python.
It is equivalent to is not in. Write a query that selects all rows that are NOT equal to ‘R’ or ‘O’ in the surveys
data.Answer:
surveys_df[~surveys_df['tenure'].isin(['R','O'])]
Dealing with missing data values is always a challenge. It’s sometimes hard to know why values are missing - was it because of a data entry error? Or data that someone was unable to collect? Should the value be 0? We need to know how missing values are represented in the dataset in order to make good decisions. If we’re lucky, we have some metadata that will tell us more about how null values were handled.
For instance, like in household surveys, missing data values are often defined as -9999. Having a bunch of -9999 values in your data could really alter numeric calculations. Often in spreadsheets, cells are left empty where no data are available. Pandas will, by default, replace those missing values with NaN. However it is good practice to get in the habit of intentionally marking cells that have no data, with a no data value! That way there are no questions in the future when you (or someone else) explores your data.
Dealing with missing data values is always a challenge. It’s sometimes hard to know why values are missing - was it because of a data entry error? Or data that someone was unable to collect? Should the value be 0? We need to know how missing values are represented in the dataset in order to make good decisions. If we’re lucky, we have some metadata that will tell us more about how null values were handled.
For instance, like in household surveys, missing data values are often defined as -9999. Having a bunch of -9999 values in your data could really alter numeric calculations. Often in spreadsheets, cells are left empty where no data are available. Pandas will, by default, replace those missing values with NaN. However it is good practice to get in the habit of intentionally marking cells that have no data, with a no data value! That way there are no questions in the future when you (or someone else) explores your data.
Let’s explore the NaN values in our data a bit further. Using the tools we learned in lesson 02, we can figure out how many rows contain NaN values for income. We can also create a new subset from our data that only contains rows with income values > 0 (ie select meaningful income values):
len(surveys_df[pd.isnull(surveys_df.income)])
# how many rows have income values?
len(surveys_df[surveys_df.income> 0])
We can replace all NaN values with zeroes using the .fillna()
method (after
making a copy of the data so we don’t lose our work):
df1 = surveys_df.copy()
# fill all NaN values with 0
df1['income'] = df1['income'].fillna(0)
However NaN and 0 yield different analysis results. The mean value when NaN values are replaced with 0 is different from when NaN values are simply thrown out or ignored.
df1['income'].mean()
38.751976145601844
We can fill NaN values with any value that we chose. The code below fills all NaN values with a mean for all income values.
df1['income'] = surveys_df['income'].fillna(surveys_df['income'].mean())
We could also chose to create a subset of our data, only keeping rows that do not contain NaN values.
The point is to make conscious decisions about how to manage missing data. This is where we think about how our data will be used and how these values will impact the scientific conclusions made from the data.
Python gives us all of the tools that we need to account for these issues. We just need to be cautious about how the decisions that we make impact scientific results.