Introduction to Pandas - Data Analysis in Python
In this article I will continue the previous series where we introduced NumPy. Now we will take a look at Pandas, the de-facto standard for data handling with Python.
We ran into some limitations while using NumPy, for instance loading from a CSV file required every column's contents to be strings if there was one column containing a non-number entry. Now we will take a step further and see how Pandas approaches this problem and what can we do with it.
Some words about Pandas
Pandas aims to integrate the functionality of NumPy and matplotlib, to give you a convenient tool to do data analytics and visualization. And beside the integration it does a bit more—it makes the usage better.
To install Pandas simply execute the following command on your command line:
pip install pandas
This will load and install Pandas with all its dependencies, like NumPy. After it’s done we can start our exploration.
The data structures in Pandas are capable of holding elements of any type: Series, DataFrame and Panel. The common factor is that the data structures are labeled.
We will use DataFrames most of the time through this tutorial series but here is a brief introduction to each of them:
- Series -- a one-dimensional array-like labeled object, which is capable of holding any type of object.
- DataFrame -- a two-dimensional labeled data structure where columns can be of different types.
- Panel -- a three-dimensional data structure. You can think of them like dictionaries of DataFrames.
Basic functionality
Let's get started with exploring the basic functionality of Pandas.
>>> import pandas as pd
>>> import numpy as np
>>> series = pd.Series([1,2,3,4,5, np.nan, "a string", 6])
>>> series
0 1
1 2
2 3
3 4
4 5
5 NaN
6 a string
7 6
dtype: object
In the example above we have created a Series object which contains a lot of different values. One interesting thing to note is that we can reference to not-a-number elements using NumPy's nan symbol which tells us that the element is not a number but it can be used as one numerical type pointing out to be not a number. The type of the series is object which refers to its mixed contents because we have a string included.
If we use only numerical types we get the basic NumPy dtype, float for our series:
>>> series = pd.Series([1,2,np.nan, 4])
>>> series
0 1.0
1 2.0
2 NaN
3 4.0
dtype: float64
It does not matter if the data includes a nan—the series will be treated as a series of numbers.
But Series are not what we are interested in but DataFrames because this resembles a 2d array similar to a CSV file or a relational database table:
>>> df = pd.DataFrame(np.array([1,2,3,4,5,6]).reshape(2,3))
>>> df
0 1 2
0 1 2 3
1 4 5 6
>>> df.dtypes
0 int32
1 int32
2 int32
dtype: object
The default setting shows the numeric index of the rows and the columns but it can be changed to give more sense to the data:
>>> df = pd.DataFrame(np.array([1,2,3,4,5,6]).reshape(2,3), columns=list('ABC'), index=list('XY'))
>>> df
A B C
X 1 2 3
Y 4 5 6
As you can see, the index argument provides the list to use for the rows, while the list provided through the columns argument can be used to alter the column indexes.
If we have bigger datasets, the head method can be useful too. It displays the first n rows provided as an argument, if you do not provide an argument, the default of 5 will be used:
>>> df2 = pd.DataFrame(np.arange(1, 7501).reshape(500,15))
>>> df2.head(2)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
>>> df2.head()
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
2 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
3 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
4 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
The opposite is the tail method. It shows the last n lines of the DataFrame, if you omit the argument, the default of 5 will be used:
>>> df2.tail()
0 1 2 3 4 5 6 7 8 9 10 11 \
495 7426 7427 7428 7429 7430 7431 7432 7433 7434 7435 7436 7437
496 7441 7442 7443 7444 7445 7446 7447 7448 7449 7450 7451 7452
497 7456 7457 7458 7459 7460 7461 7462 7463 7464 7465 7466 7467
498 7471 7472 7473 7474 7475 7476 7477 7478 7479 7480 7481 7482
499 7486 7487 7488 7489 7490 7491 7492 7493 7494 7495 7496 7497
12 13 14
495 7438 7439 7440
496 7453 7454 7455
497 7468 7469 7470
498 7483 7484 7485
499 7498 7499 7500
>>> df2.tail(1)
0 1 2 3 4 5 6 7 8 9 10 11 \
499 7486 7487 7488 7489 7490 7491 7492 7493 7494 7495 7496 7497
12 13 14
499 7498 7499 7500
As you can see, the columns are split among multiple lines if they get too broad but this won't bother us in the future because we won't print our data to the console but will use visualizations to show what we can gather out if it.
Describing data
This is the functionality we will use very often if we have a new dataset to analyze:
>>> df3 = pd.DataFrame(np.arange(1, 100, 0.12).reshape(33,25))
>>> df3.describe()
0 1 2 3 4 5 \
count 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000
mean 49.000000 49.120000 49.240000 49.360000 49.480000 49.600000
std 29.008619 29.008619 29.008619 29.008619 29.008619 29.008619
min 1.000000 1.120000 1.240000 1.360000 1.480000 1.600000
25% 25.000000 25.120000 25.240000 25.360000 25.480000 25.600000
50% 49.000000 49.120000 49.240000 49.360000 49.480000 49.600000
75% 73.000000 73.120000 73.240000 73.360000 73.480000 73.600000
max 97.000000 97.120000 97.240000 97.360000 97.480000 97.600000
6 7 8 9 ... 15 \
count 33.000000 33.000000 33.000000 33.000000 ... 33.000000
mean 49.720000 49.840000 49.960000 50.080000 ... 50.800000
std 29.008619 29.008619 29.008619 29.008619 ... 29.008619
min 1.720000 1.840000 1.960000 2.080000 ... 2.800000
25% 25.720000 25.840000 25.960000 26.080000 ... 26.800000
50% 49.720000 49.840000 49.960000 50.080000 ... 50.800000
75% 73.720000 73.840000 73.960000 74.080000 ... 74.800000
max 97.720000 97.840000 97.960000 98.080000 ... 98.800000
16 17 18 19 20 21 \
count 33.000000 33.000000 33.000000 33.000000 33.000000 33.000000
mean 50.920000 51.040000 51.160000 51.280000 51.400000 51.520000
std 29.008619 29.008619 29.008619 29.008619 29.008619 29.008619
min 2.920000 3.040000 3.160000 3.280000 3.400000 3.520000
25% 26.920000 27.040000 27.160000 27.280000 27.400000 27.520000
50% 50.920000 51.040000 51.160000 51.280000 51.400000 51.520000
75% 74.920000 75.040000 75.160000 75.280000 75.400000 75.520000
max 98.920000 99.040000 99.160000 99.280000 99.400000 99.520000
22 23 24
count 33.000000 33.000000 33.000000
mean 51.640000 51.760000 51.880000
std 29.008619 29.008619 29.008619
min 3.640000 3.760000 3.880000
25% 27.640000 27.760000 27.880000
50% 51.640000 51.760000 51.880000
75% 75.640000 75.760000 75.880000
max 99.640000 99.760000 99.880000
[8 rows x 25 columns]
As you can see, calling the describe method on the DataFrame prints out a brief summary on each column in the dataset: the number of elements (count), their mean, standard deviation (std), the minimum and maximum values, and some values in-between.
Indexing and Slicing
Slicing DataFrames works using the iloc attribute:
>>> df3 = pd.DataFrame(np.arange(1, 100, 0.12).reshape(33,25))
>>> df3.iloc[:5,:10]
0 1 2 3 4 5 6 7 8 9
0 1.0 1.12 1.24 1.36 1.48 1.6 1.72 1.84 1.96 2.08
1 4.0 4.12 4.24 4.36 4.48 4.6 4.72 4.84 4.96 5.08
2 7.0 7.12 7.24 7.36 7.48 7.6 7.72 7.84 7.96 8.08
3 10.0 10.12 10.24 10.36 10.48 10.6 10.72 10.84 10.96 11.08
4 13.0 13.12 13.24 13.36 13.48 13.6 13.72 13.84 13.96 14.08
In the example above we have selected the first 5 rows and the first 10 columns. We can implement the head() and tail() methods using the default line-count of 5 with iloc too:
>>> df3.iloc[-5:] # df3.tail(5)
0 1 2 3 4 5 6 7 8 9 ... \
28 85.0 85.12 85.24 85.36 85.48 85.6 85.72 85.84 85.96 86.08 ...
29 88.0 88.12 88.24 88.36 88.48 88.6 88.72 88.84 88.96 89.08 ...
30 91.0 91.12 91.24 91.36 91.48 91.6 91.72 91.84 91.96 92.08 ...
31 94.0 94.12 94.24 94.36 94.48 94.6 94.72 94.84 94.96 95.08 ...
32 97.0 97.12 97.24 97.36 97.48 97.6 97.72 97.84 97.96 98.08 ...
15 16 17 18 19 20 21 22 23 24
28 86.8 86.92 87.04 87.16 87.28 87.4 87.52 87.64 87.76 87.88
29 89.8 89.92 90.04 90.16 90.28 90.4 90.52 90.64 90.76 90.88
30 92.8 92.92 93.04 93.16 93.28 93.4 93.52 93.64 93.76 93.88
31 95.8 95.92 96.04 96.16 96.28 96.4 96.52 96.64 96.76 96.88
32 98.8 98.92 99.04 99.16 99.28 99.4 99.52 99.64 99.76 99.88
[5 rows x 25 columns]
>>> df3.iloc[:5] # df3.head(5)
0 1 2 3 4 5 6 7 8 9 ... \
0 1.0 1.12 1.24 1.36 1.48 1.6 1.72 1.84 1.96 2.08 ...
1 4.0 4.12 4.24 4.36 4.48 4.6 4.72 4.84 4.96 5.08 ...
2 7.0 7.12 7.24 7.36 7.48 7.6 7.72 7.84 7.96 8.08 ...
3 10.0 10.12 10.24 10.36 10.48 10.6 10.72 10.84 10.96 11.08 ...
4 13.0 13.12 13.24 13.36 13.48 13.6 13.72 13.84 13.96 14.08 ...
15 16 17 18 19 20 21 22 23 24
0 2.8 2.92 3.04 3.16 3.28 3.4 3.52 3.64 3.76 3.88
1 5.8 5.92 6.04 6.16 6.28 6.4 6.52 6.64 6.76 6.88
2 8.8 8.92 9.04 9.16 9.28 9.4 9.52 9.64 9.76 9.88
3 11.8 11.92 12.04 12.16 12.28 12.4 12.52 12.64 12.76 12.88
4 14.8 14.92 15.04 15.16 15.28 15.4 15.52 15.64 15.76 15.88
[5 rows x 25 columns]
But Pandas takes an extra step and allows us to access data through labels in DataFrames. In this example it’s not very spectacular because the labels of the dataset are the same as their position. However using the rename method we can easily rename the columns of the DataFrame and we can use those indices to access the values:
>>> df4 = df3.rename(columns=lambda c: chr(65+c))
>>> df4.loc[:5, 'A':'D']
A B C D
0 1.0 1.12 1.24 1.36
1 4.0 4.12 4.24 4.36
2 7.0 7.12 7.24 7.36
3 10.0 10.12 10.24 10.36
4 13.0 13.12 13.24 13.36
5 16.0 16.12 16.24 16.36
>>> df4.loc[:5, ('A','D')]
A D
0 1.0 1.36
1 4.0 4.36
2 7.0 7.36
3 10.0 10.36
4 13.0 13.36
5 16.0 16.36
The most interesting part is the lambda function provided for the columns' names. That's because you need either a dictionary-like parameter or a function which renames the labels. Creating a dictionary would bloat the example so a simple anonymous function is ideal. The chr() function returns the character representation of the provided number, chr(65) equals the character A. The parameter c is the current header of the axis (in this case the columns), which is a number from 0 to 24.
This is something cool for future usage where we deal with data from CSV files.
Reading CSV files
Now it is time for some real-world-like examples. We will use the same dataset already known from the latest NumPy article. If you did not read it or just do not want to look it up here is the file available to download: Most Popular Baby Names by Sex and Mother's Ethnic Group, New York City. I saved it as baby_names.csv.
It contains names given to babies in New York city, based on gender and ethnicity. Let's start by reading in the file:
>>> import numpy as np
>>> import pandas as pd
>>> baby_names = pd.read_csv('baby_names.csv')
>>> baby_names.head()
BRTH_YR GNDR ETHCTY NM CNT RNK
0 2011 FEMALE HISPANIC GERALDINE 13 75
1 2011 FEMALE HISPANIC GIA 21 67
2 2011 FEMALE HISPANIC GIANNA 49 42
3 2011 FEMALE HISPANIC GISELLE 38 51
4 2011 FEMALE HISPANIC GRACE 36 53
>>> baby_names.tail()
BRTH_YR GNDR ETHCTY NM CNT RNK
13957 2014 MALE WHITE NON HISPANIC Yousef 18 94
13958 2014 MALE WHITE NON HISPANIC Youssef 24 88
13959 2014 MALE WHITE NON HISPANIC Yusuf 16 96
13960 2014 MALE WHITE NON HISPANIC Zachary 90 39
13961 2014 MALE WHITE NON HISPANIC Zev 49 65
You can see, that reading a CSV file happens with the read_csv function. The result is a DataFrame—and unlike with NumPy we did not have to tell Pandas to read in all the data as Strings. This means we have columns containing numeric values as numeric types in the DataFrame. We can verify this if we look at the description of the DataFrame:
>>> baby_names.describe()
BRTH_YR CNT RNK
count 13962.000000 13962.000000 13962.000000
mean 2011.880318 34.531657 56.944349
std 1.134940 40.176370 25.361691
min 2011.000000 10.000000 1.000000
25% 2011.000000 13.000000 38.000000
50% 2011.000000 20.000000 59.000000
75% 2013.000000 36.000000 78.000000
max 2014.000000 426.000000 102.000000
You can see that descrcibe() only selects the columns which have numeric data. Naturally some information makes no sense, like the RNK column (only the count is useful) or the mean and std in BRTH_YR.
You may have noticed, that the columns in the DataFrame have labels which are identical to the headers in the CSV file. This is a nice feature in Pandas which becomes useful when you do not need to know the index of the column you want to access.
DataFrames can be sorted along columns or axis and the shape will be maintained after the sort. Now we have a real dataset so let's try out sorting. Let's find the first five names with the lowest and the highest count in all the data (remember: we have years from 2011 to 2014—the sort will only count the min and max based on each year's data and not sum up the same names over the years):
>>> baby_names.head()
BRTH_YR GNDR ETHCTY NM CNT RNK
0 2011 FEMALE HISPANIC GERALDINE 13 75
1 2011 FEMALE HISPANIC GIA 21 67
2 2011 FEMALE HISPANIC GIANNA 49 42
3 2011 FEMALE HISPANIC GISELLE 38 51
4 2011 FEMALE HISPANIC GRACE 36 53
>>> baby_names.sort_values(by='CNT').head()
BRTH_YR GNDR ETHCTY NM CNT RNK
8744 2012 FEMALE WHITE NON HISP MAE 10 83
6746 2011 FEMALE WHITE NON HISPANIC LEILA 10 81
2389 2011 MALE HISPANIC ALLAN 10 94
11009 2013 MALE ASIAN AND PACIFIC ISLANDER Martin 10 57
11013 2013 MALE ASIAN AND PACIFIC ISLANDER Maximilian 10 57
>>> baby_names.sort_values(by='CNT', ascending=False).head()
BRTH_YR GNDR ETHCTY NM CNT RNK
1504 2011 MALE HISPANIC JAYDEN 426 1
5430 2011 MALE HISPANIC JAYDEN 426 1
7393 2011 MALE HISPANIC JAYDEN 426 1
3505 2011 MALE HISPANIC JAYDEN 426 1
9385 2012 MALE HISPANIC JAYDEN 364 1
As you can see we have identified a problem in the dataset—we have entries which have the same content. This makes our data less usable as it is and we have to tidy it up to gain any real insights. But this is the good thing in data science and we will look at an approach how to filter out duplicates.
Conclusion
We have seen that Pandas is the next step to data analytics with Python after NumPy because it lets us handle data in a better way. However we cannot disregard NumPy because Pandas builds on top of NumPy and matplotlib to give us a single point where you can do your data analysis and visualization. Pandas really leverages the functionality if we import CSV files with mixed content—you do not have to worry about the conversions.
In the next article we will tidy our data from duplicates and do some data munging to visualize our dataset with matplotlib provide insights for the user.
Find and compare thousands of development tools with DiscoverSDK.
Recent Stories
Top DiscoverSDK Experts
Compare Products
Select up to three two products to compare by clicking on the compare icon () of each product.
{{compareToolModel.Error}}
{{CommentsModel.TotalCount}} Comments
Your Comment