Python and Pandas - Part 2
In this article I will follow on from the previous article on Pandas, and we will take a deeper look into how to use this library for further data analysis. We will look at identifying duplicates and tidying up our data to create useful insights later.
We will continue to use the dataset of Most Popular Baby Names by Sex and Mother's Ethnic Group, New York City between 2011 and 2014.
Accessing Columns
We have seen in the previous article that we can access DataFrame columns through the loc and iloc variables, providing either an index or the name of the column.
But there is an alternative way too — you can use the dot (.) operator on the DataFrame and call the name of the column directly. This will result in a series of the selected column:
>>> baby_names.BRTH_YR.head()
0 2011
1 2011
2 2011
3 2011
4 2011
Name: BRTH_YR, dtype: int64
It is perfectly fine to stick with any approach. Sometimes I use the direct access, other times I go through with slicing.
Typos and Other Entry Errors
This is a hard topic to solve because typos are little mistakes that you do not notice and it is difficult to come up with an automated solution to fix them. However, sometimes you can see and identify problems in the dataset which are easy to solve, like different naming of the same value. If you have taken a look at the baby_names.csv file you probably noticed that the ethnicity has different forms:
>>> import numpy as np
>>> np.unique(baby_names.iloc[:, 2])
array(['ASIAN AND PACI', 'ASIAN AND PACIFIC ISLANDER', 'BLACK NON HISP',
'BLACK NON HISPANIC', 'HISPANIC', 'WHITE NON HISP',
'WHITE NON HISPANIC'], dtype=object)
Longer values have been trimmed in some cases. To avoid ambiguities later we will bring every entry to its longer form:
>>> baby_names.loc[baby_names.ETHCTY == 'ASIAN AND PACI', 'ETHCTY'] = 'ASIAN AND PACIFIC ISLANDER'
>>> baby_names.loc[baby_names.ETHCTY == 'BLACK NON HISP', 'ETHCTY'] = 'BLACK NON HISPANIC'
>>> baby_names.loc[baby_names.ETHCTY == 'WHITE NON HISP', 'ETHCTY'] = 'WHITE NON HISPANIC'
>>> np.unique(baby_names.iloc[:, 2])
array(['ASIAN AND PACIFIC ISLANDER', 'BLACK NON HISPANIC', 'HISPANIC',
'WHITE NON HISPANIC'], dtype=object)
To do this we create a selector based on the ethnicity for each three values which come in two flavors, apply it to the loc of the DataFrame and then set the new value. This really replaces old values with the new ones. At the end we verify that the values are really updated and the short forms are gone from the dataset.
Another problem can be that we have names in all capital letters. We can use the knowledge of the above example to further improve the dataset:
>>> baby_names.NM.head()
0 GERALDINE
1 GIA
2 GIANNA
3 GISELLE
4 GRACE
>>> baby_names.NM = baby_names.NM.str.capitalize()
>>> baby_names.NM.head()
0 Geraldine
1 Gia
2 Gianna
3 Giselle
4 Grace
Name: NM, dtype: object
You can see that we’ve overwritten the values in the columns with their capitalized representations. This helps in the future to eliminate duplicates and when displaying the data.
Identifying Duplicate Data and Clearing the Dataset
This use case is really useful. Every time you get a new dataset you should keep an eye out for duplicated lines and if you find some, ask your customer what you should do with the duplicated information. Most of the time the answer will be to tidy-up the dataset and then re-run your analysis.
Duplicates can be identified with the duplicated() method of the DataFrame class. This method returns a Series of True and False values that you can use to display duplicated rows:
>>> baby_names.duplicated().head()
0 False
1 False
2 False
3 False
4 False
dtype: bool
>>> baby_names[baby_names.duplicated()].head()
BRTH_YR GNDR ETHCTY NM CNT RNK
1963 2011 FEMALE WHITE NON HISPANIC Scarlett 31 60
1964 2011 FEMALE WHITE NON HISPANIC Serena 16 75
1965 2011 FEMALE WHITE NON HISPANIC Shaina 16 75
1966 2011 FEMALE WHITE NON HISPANIC Shaindel 15 76
1967 2011 FEMALE WHITE NON HISPANIC Shaindy 48 44
You can see, that when we pass the Series containing the identified duplicates to the original DataFrame as a slice-argument we get back all the lines (a subset of the DataFrame) which are duplicated. The excerpt does not show anything interesting, it seems like the values are not duplicates. But if we sort the results, we will see that there is a problem:
>>> baby_names[baby_names.duplicated()].sort_values(by='NM').head()
BRTH_YR GNDR ETHCTY NM CNT RNK
2582 2011 FEMALE BLACK NON HISPANIC Aaliyah 69 5
4082 2011 FEMALE BLACK NON HISPANIC Aaliyah 69 5
2766 2011 FEMALE HISPANIC Aaliyah 63 30
4266 2011 FEMALE HISPANIC Aaliyah 63 30
6045 2011 FEMALE BLACK NON HISPANIC Aaliyah 69 5
We can see that there are duplicated rows in the dataset. If we dig a bit deeper we can find that all the duplicates are in the dataset of 2011:
>>> baby_names[baby_names.duplicated()].describe()
BRTH_YR CNT RNK
count 5889.0 5889.000000 5889.000000
mean 2011.0 34.602140 56.494142
std 0.0 40.513511 25.012470
min 2011.0 10.000000 1.000000
25% 2011.0 13.000000 37.000000
50% 2011.0 20.000000 60.000000
75% 2011.0 36.000000 77.000000
max 2011.0 426.000000 97.000000
Now it’s time to tidy our list. Fortunately Pandas thought about this too and DataFrames have a method called drop_duplicates() which, as its name already suggests, drops duplicates and returns the tidied DataFrame as result. There are some arguments you can pass along to the method to fine-tune the tidying, for example to tidy the data in-place, but it is good practice to keep the original dataset if you test your data because sometimes you need the original version. It won't be easy to reproduce it if you’ve done a lot of transformations to it.
>>> clean_baby_names = baby_names.drop_duplicates()
>>> clean_baby_names.describe()
BRTH_YR CNT RNK
count 8073.000000 8073.000000 8073.000000
mean 2012.522482 34.480243 57.272761
std 1.118043 39.931078 25.609985
min 2011.000000 10.000000 1.000000
25% 2012.000000 13.000000 38.000000
50% 2013.000000 20.000000 59.000000
75% 2014.000000 36.000000 78.000000
max 2014.000000 426.000000 102.000000
>>> clean_baby_names[clean_baby_names.NM == 'Aaliyah']
BRTH_YR GNDR ETHCTY NM CNT RNK
1110 2011 FEMALE BLACK NON HISPANIC Aaliyah 69 5
1294 2011 FEMALE HISPANIC Aaliyah 63 30
8031 2012 FEMALE BLACK NON HISPANIC Aaliyah 55 10
8202 2012 FEMALE HISPANIC Aaliyah 68 26
10031 2013 FEMALE BLACK NON HISPANIC Aaliyah 73 3
10212 2013 FEMALE HISPANIC Aaliyah 56 33
12072 2014 FEMALE BLACK NON HISPANIC Aaliyah 67 4
12259 2014 FEMALE HISPANIC Aaliyah 55 36
I have assigned the cleaned data to a new variable. Now we see, that the data seems to be clean because the mean of the BRTH_YR column is between 2012 and 2013 which is the middle of our dataset.
The drop_duplicates() method defines duplicate rows if all of their columns have the same value. Therefore we can be rest assured that we did not lose any information during our process.
Alternatively we can select columns to identify duplicates. For example, imagine that a name is entered twice for the same year and ethnicity but with different counts and ranks. In this case these lines would remain in the dataset but we do not want it because it changes our analysis. To identify such entries, we can fine-tune our query:
>>> clean_baby_names.duplicated().sum()
0
>>> clean_baby_names.duplicated(['NM', 'BRTH_YR','ETHCTY']).sum()
65
Wow, it seems, we have found duplicated data! Let's verify that they are really duplicates before we delete them:
>>> clean_baby_names[clean_baby_names.duplicated(['NM', 'BRTH_YR','ETHCTY'])].sort_values('NM').head()
BRTH_YR GNDR ETHCTY NM CNT RNK
13344 2014 MALE HISPANIC Alexis 40 68
1132 2011 FEMALE BLACK NON HISPANIC Angel 16 39
9551 2012 MALE WHITE NON HISPANIC Ariel 20 84
9268 2012 MALE HISPANIC Ariel 18 92
1340 2011 FEMALE HISPANIC Ariel 17 71
>>> clean_baby_names[(clean_baby_names.NM == 'Alexis') & (clean_baby_names.ETHCTY == 'HISPANIC') & (clean_baby_names.BRTH_YR == 2014)]
BRTH_YR GNDR ETHCTY NM CNT RNK
12272 2014 FEMALE HISPANIC Alexis 13 70
13344 2014 MALE HISPANIC Alexis 40 68
After creating the filtering we get only one result for the name Alexis. This is suspicious, so we get all the entries for Alexis, where the year is 2014 and the ethnicity is Hispanic. Now we see, that there are two results: one for females and one for males. So these are not real duplicates. We could continue the search for the different names in our list to verify that they are indeed split among MALE and FEMALE but we can add the gender column to our duplicate filter:
>>> clean_baby_names.duplicated(['NM', 'BRTH_YR','ETHCTY','GNDR']).sum()
0
OK, no more duplicates. Naturally if you have good domain knowledge you can identify names which can only be male/female and filter or aggregate them.
Handling NaN and N/A
Sometimes you have to deal with datasets where some fields are not filled but you need a value there because your calculations are based on it, for example with Machine Learning, but we do not have to go so far because aggregation functions need values too. To solve this problem Pandas has a method available: fillna.
For an example let's change some values in our baby_names DataFrame to missing:
>>> import numpy as np
>>> na_baby_names = clean_baby_names.copy()
>>> na_baby_names.loc[na_baby_names.NM == 'Riley', 'CNT'] = np.nan
>>> na_baby_names.loc[na_baby_names.NM == 'Avery', 'RNK'] = ''
>>> na_baby_names.describe()
BRTH_YR CNT
count 8073.000000 8049.000000
mean 2012.522482 34.527519
std 1.118043 39.978346
min 2011.000000 10.000000
25% 2012.000000 13.000000
50% 2013.000000 20.000000
75% 2014.000000 36.000000
max 2014.000000 426.000000
>>> na_baby_names.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8073 entries, 0 to 13961
Data columns (total 6 columns):
BRTH_YR 8073 non-null int64
GNDR 8073 non-null object
ETHCTY 8073 non-null object
NM 8073 non-null object
CNT 8049 non-null float64
RNK 8073 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 441.5+ KB
I have created a copy in the cleaned dataset to keep the original data intact. Now if we take a look at the describe() results we see that the RNK column is missing because it has string values so it cannot be used for statistical analysis, and the count of the CNT column is smaller than the count in the BRTH_YR column because there are some nan values. Calling the info() method we get a more detailed description.
>>> na_baby_names[na_baby_names.CNT.isnull()].head()
BRTH_YR GNDR ETHCTY NM CNT RNK
156 2011 FEMALE HISPANIC Riley NaN 76
473 2011 FEMALE WHITE NON HISPANIC Riley NaN 63
1250 2011 FEMALE BLACK NON HISPANIC Riley NaN 40
1608 2011 MALE HISPANIC Riley NaN 86
1889 2011 MALE WHITE NON HISPANIC Riley NaN 93
To fix this issue we will take the mean value for each year's names to fill in the missing values. I know this is not the best approach but it is better than guessing a value or leaving the fields as nan:
>>> for y in np.unique(na_baby_names.loc[(na_baby_names.CNT.isnull()), 'BRTH_YR']):
... na_baby_names.loc[(na_baby_names.CNT.isnull()) & (na_baby_names.BRTH_YR == y), 'CNT'] = na_baby_names.loc[(na_baby_names.BRTH_YR == y), 'CNT'].mean().astype(int)
...
>>> na_baby_names.describe()
BRTH_YR CNT
count 8073.000000 8073.000000
mean 2012.522482 34.525579
std 1.118043 39.918905
min 2011.000000 10.000000
25% 2012.000000 13.000000
50% 2013.000000 20.000000
75% 2014.000000 36.000000
max 2014.000000 426.000000
The solution iterates over the unique years and sets the missing value to the integer-result (cropping the decimal part) of the mean of the other counts from the same year.
Now it is time to handle the empty values in the RNK column. There are again different approaches depending on your dataset. Here we know that the column should have only numeric (or to be more specific integer) values. To do this we can replace empty fields with np.nan:
>>> na_baby_names.loc[(na_baby_names.RNK == ''), 'RNK'] = np.nan
>>> na_baby_names.loc[(na_baby_names.RNK == ''), 'RNK']
Series([], Name: RNK, dtype: object)
>>> na_baby_names.loc[(na_baby_names.RNK.isnull()), 'RNK']
249 NaN
723 NaN
936 NaN
7888 NaN
8255 NaN
8552 NaN
9079 NaN
9273 NaN
9559 NaN
9890 NaN
10068 NaN
10273 NaN
10567 NaN
10915 NaN
11085 NaN
11285 NaN
11579 NaN
11926 NaN
12107 NaN
12323 NaN
12611 NaN
13170 NaN
13359 NaN
Name: RNK, dtype: object
Well, this is a very basic approach and only works with the empty string. If we have different strings in the column too then we need to find out those values and set them to nan individually — and this is a no-go. Fortunately we have a solution for this in Pandas:
>>> import pandas as pd
>>> na_baby_names.loc[na_baby_names.NM == 'Avery', 'RNK'] = ''
>>> na_baby_names.RNK = pd.to_numeric(na_baby_names.RNK)
>>> na_baby_names.loc[(na_baby_names.NM == 'Avery'), 'RNK'].head()
249 NaN
723 NaN
936 NaN
7888 NaN
8255 NaN
Name: RNK, dtype: float64
The to_numeric function converts the values of a specified column (Series) or table (DataFrame) to numeric values and does not raise an exception if a non-numeric value is encountered, like a string in the example case.
Information on Datasets
We know how we can access some information like the shape or number of elements of a DataFrame. But Pandas offers more insights using the info() method. This gives us memory usage data too, which helps to identify large datasets and to make decisions on whether in-place filtering or purging unnecessary data.
>>> clean_baby_names.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8073 entries, 0 to 13961
Data columns (total 6 columns):
BRTH_YR 8073 non-null int64
GNDR 8073 non-null object
ETHCTY 8073 non-null object
NM 8073 non-null object
CNT 8073 non-null int64
RNK 8073 non-null int64
dtypes: int64(3), object(3)
memory usage: 441.5+ KB
The + symbol indicates that the true memory usage can be higher because Pandas does not count the memory used by values in columns with dtype=object. But if we are keen enough we can tell Pandas to give us the real memory usage. For this we have to provide the memory_usage argument with the value of 'deep' to the info() method:
>>> clean_baby_names.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8073 entries, 0 to 13961
Data columns (total 6 columns):
BRTH_YR 8073 non-null int64
GNDR 8073 non-null object
ETHCTY 8073 non-null object
NM 8073 non-null object
CNT 8073 non-null int64
RNK 8073 non-null int64
dtypes: int64(3), object(3)
memory usage: 1.9 MB
Now we see that our dataset eats almost 2 MB of space. This is not so bad but remember, we work with an example dataset. If we step-out to real-life projects we can get significantly larger data and in such cases you have to consider memory usage in your applications.
Conclusion
We have seen that real datasets come flawed without intention. We need to take care and validate the dataset prior to jumping into it and doing analysis. For this we have seen how to identify duplicates and clean-up missing data.
In the next article we will continue our journey and will learn how to aggregate our data. We will use this knowledge to identify and remove outliers in the dataset which could flaw our analysis.
Be sure to use DiscoverSDK’s search to find all the best development tools in one place.
Recent Stories
Top DiscoverSDK Experts
Compare Products
Select up to three two products to compare by clicking on the compare icon () of each product.
{{compareToolModel.Error}}
{{CommentsModel.TotalCount}} Comments
Your Comment