Guide: Pandas DataFrames for Data Analysis
Table of Contents
- What Is Pandas and Why Should I Use It?
- What Is a Pandas DataFrame?
- Creating and Viewing a Pandas DataFrame
- Manipulating Data in Your DataFrame
- Conclusion
“Data scientist” is one of the hottest jobs in tech, and Python is the lingua franca of data science. Python’s easy-to-learn syntax, open ecosystem, and strong community has made it one of the fastest growing languages in recent years.
In this post, we’ll learn about Pandas, a high-performance open-source package for doing data analysis in Python.
We’ll cover:
- What Pandas is and why should you use it.
- What a Pandas DataFrame is.
- Creating and viewing a DataFrame.
- Manipulating data in a DataFrame.
Let’s get started.
What Is Pandas and Why Should I Use It?
Pandas is an open-source library for performing data analysis with Python. It was created by Wes McKinney when he was working for AQR Capital, an investment firm. Wes and AQR Capital open-sourced the project, and its popularity has exploded in the Python community.
A big portion of a data scientist’s time is spent cleaning data, and this is where Pandas really shines. Pandas helps you quickly and efficiently operate on large tables of data.
As an example, imagine you have a large, two-dimensional data set, comparable to an Excel spreadsheet. Your data set has many columns and rows.
You would use Pandas for
- Setting default values for rows with missing values.
- Merging (or “joining,” in SQL parlance) two separate data sets.
- Filtering your data set based on the values in a particular column.
- Viewing summary statistics, such as mean, standard deviation and percentiles.
These operations can save you a lot of time and let you get to the important work of finding the value from your data.
Now that we know what Pandas is and why we would use it, let’s learn about the key data structure of Pandas.
What Is a Pandas DataFrame?
The core data structure in Pandas is a DataFrame. A DataFrame is a two-dimensional data structure made up of columns and rows
If you have a background in the statistical programming language R, a DataFrame is modeled after the data.frame object in R.
The Pandas DataFrame structure gives you the speed of low-level languages combined with the ease and expressiveness of high-level languages.
Each row in a DataFrame makes up an individual record—think of a user for a SaaS application or the summary of a single day of stock transactions for a particular stock symbol.
Each column in a DataFrame represents an observed value for each row in the DataFrame. DataFrames can have multiple columns, each of which has a defined type.
For example, if you have a DataFrame that contains daily transaction summaries for a stock symbol, you might have one column of type float that indicates the closing price while another column of type int that indicates the total volume traded that day.
DataFrames are built on top of NumPy, a blazing-fast library that uses C/C++ and Fortran for fast, efficient computation of data.
Now that we understand the basics behind a DataFrame, let’s play around with creating and viewing a DataFrame.
Creating and Viewing a Pandas DataFrame
In this section, we’re going to create and view a Pandas DataFrame. We’ll use some summary stock data to learn the basic Pandas operations.
Installing Pandas can be tricky due to its dependencies on numerical computing libraries like NumPy, which include tools for integrating with Fortran and other low-level languages.
If you’re not a Python expert, the easiest way to get started with Pandas is to install the Anaconda distribution of Python. Check the Pandas installation docs to see all of your options.
First, let’s create a Pandas DataFrame. There are multiple ways to create a DataFrame—from a single Python dictionary, from a list of dictionaries, from a list of lists, and many more.
One of the more common ways to create a DataFrame is from a CSV file using the read_csv()
function. Pandas even makes it easy to read CSV over HTTP by allowing you to pass a URL into the read_csv()
function.
Let’s do that here. We’ll use this URL, which contains a CSV that I’ve assembled. It includes summary stock data for three stocks (AAPL, MSFT, and AMZN) over five trading days.
>>> import pandas as pd
>>> url = 'https://gist.githubusercontent.com/alexdebrie/b3f40efc3dd7664df5a20f5eee85e854/raw/ee3e6feccba2464cbbc2e185fb17961c53d2a7f5/stocks.csv'
>>> df = pd.read_csv(url)
>>> type(df)
< class 'pandas.core.frame.DataFrame'>
>>> print(df)
date symbol open high low close volume
0 2019-03-01 AMZN 1655.13 1674.26 1651.00 1671.73 4974877
1 2019-03-04 AMZN 1685.00 1709.43 1674.36 1696.17 6167358
2 2019-03-05 AMZN 1702.95 1707.80 1689.01 1692.43 3681522
3 2019-03-06 AMZN 1695.97 1697.75 1668.28 1668.95 3996001
4 2019-03-07 AMZN 1667.37 1669.75 1620.51 1625.95 4957017
5 2019-03-01 AAPL 174.28 175.15 172.89 174.97 25886167
6 2019-03-04 AAPL 175.69 177.75 173.97 175.85 27436203
7 2019-03-05 AAPL 175.94 176.00 174.54 175.53 19737419
8 2019-03-06 AAPL 174.67 175.49 173.94 174.52 20810384
9 2019-03-07 AAPL 173.87 174.44 172.02 172.50 24796374
10 2019-03-01 GOOG 1124.90 1142.97 1124.75 1140.99 1450316
11 2019-03-04 GOOG 1146.99 1158.28 1130.69 1147.80 1446047
12 2019-03-05 GOOG 1150.06 1169.61 1146.19 1162.03 1443174
13 2019-03-06 GOOG 1162.49 1167.57 1155.49 1157.86 1099289
14 2019-03-07 GOOG 1155.72 1156.76 1134.91 1143.30 1166559
In the example above, we imported Pandas and aliased it to pd
, as is common when working with Pandas. Then we used the read_csv()
function to create a DataFrame from our CSV file. You can see that the returned object is of type pandas.core.frame.DataFrame
. Further, printing the object shows us the entire DataFrame.
Viewing a DataFrame With Head and Tail
When you first assemble a new DataFrame, you often want to look at your data to see what you’re working with. If your DataFrame is large, it can be overwhelming to print the entire DataFrame if you just want to see a few rows.
You can use the shead()
and tail()
methods on your DataFrame object to print a few rows.
>>> df.head()
date symbol open high low close volume
0 2019-03-01 AMZN 1655.13 1674.26 1651.00 1671.73 4974877
1 2019-03-04 AMZN 1685.00 1709.43 1674.36 1696.17 6167358
2 2019-03-05 AMZN 1702.95 1707.80 1689.01 1692.43 3681522
3 2019-03-06 AMZN 1695.97 1697.75 1668.28 1668.95 3996001
4 2019-03-07 AMZN 1667.37 1669.75 1620.51 1625.95 4957017
>>> df.tail()
date symbol open high low close volume
10 2019-03-01 GOOG 1124.90 1142.97 1124.75 1140.99 1450316
11 2019-03-04 GOOG 1146.99 1158.28 1130.69 1147.80 1446047
12 2019-03-05 GOOG 1150.06 1169.61 1146.19 1162.03 1443174
13 2019-03-06 GOOG 1162.49 1167.57 1155.49 1157.86 1099289
14 2019-03-07 GOOG 1155.72 1156.76 1134.91 1143.30 1166559
head()
will print the first few rows of your DataFrame and tail()
will print the last few rows.
Both methods will show five rows by default, but also accept an optional argument to print the number of rows you prefer.
Understanding Your DataFrame With Info and Describe
In addition to seeing a few example rows, you may want to get a feel for your DataFrame as a whole. Pandas has some useful methods here as well.
The info()
method will provide information on your DataFrame structure, including the number of rows in your DataFrame, the names and types of your columns, and the amount of memory usage for your DataFrame.
>>> df.info()
< class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 7 columns):
date 15 non-null object
symbol 15 non-null object
open 15 non-null float64
high 15 non-null float64
low 15 non-null float64
close 15 non-null float64
volume 15 non-null int64
dtypes: float64(4), int64(1), object(2)
memory usage: 920.0+ bytes
The describe()
method will include summary statistics for your numeric columns, such as the mean, standard deviation, and percentiles:
>>> df.describe()
open high low close volume
count 15.000000 15.000000 15.000000 15.000000 1.500000e+01
mean 1001.402000 1008.867333 990.836667 998.705333 9.936580e+06
std 645.683362 650.102605 637.842739 642.198730 1.036832e+07
min 173.870000 174.440000 172.020000 172.500000 1.099289e+06
25% 175.815000 176.875000 174.255000 175.690000 1.448182e+06
50% 1150.060000 1158.280000 1134.910000 1147.800000 4.957017e+06
75% 1661.250000 1672.005000 1635.755000 1647.450000 2.027390e+07
max 1702.950000 1709.430000 1689.010000 1696.170000 2.743620e+07
These methods can save you a lot of time as you familiarize yourself with your data.
Manipulating Data in Your DataFrame
Now, let’s learn how to manipulate data in our DataFrame. Pandas makes it easy to clean and munge your data before doing machine learning or other analyses.
In this section, we’ll continue to use our DataFrame from the previous section. We’ll see how to select particular rows, how to select particular columns, and how to add a new column.
Select Rows in a DataFrame
Often you want to select rows in your DataFrame that meet a particular condition. For example, imagine we want to operate on rows for a single stock symbol.
We can use the loc()
method to pass an equality statement that filters for rows where the symbol is equal to AMZN. This will return all trading data for Amazon.
>>> df.loc[df['symbol'] == 'AMZN']
date symbol open high low close volume
0 2019-03-01 AMZN 1655.13 1674.26 1651.00 1671.73 4974877
1 2019-03-04 AMZN 1685.00 1709.43 1674.36 1696.17 6167358
2 2019-03-05 AMZN 1702.95 1707.80 1689.01 1692.43 3681522
3 2019-03-06 AMZN 1695.97 1697.75 1668.28 1668.95 3996001
4 2019-03-07 AMZN 1667.37 1669.75 1620.51 1625.95 4957017
You can specify multiple equality statements in the loc()
method. In the code below, we’re selecting all rows where the symbol is AMZN and the stock opened at over 1690.
>>> df.loc[(df['symbol'] == 'AMZN') & (df['open'] > 1690)]
date symbol open high low close volume
2 2019-03-05 AMZN 1702.95 1707.80 1689.01 1692.43 3681522
3 2019-03-06 AMZN 1695.97 1697.75 1668.28 1668.95 3996001
Select Columns in a DataFrame
You can select a specific column in your DataFrame using dot notation on your DataFrame object. You’d do this by specifying the name of the column you’d like to retrieve.
>>> df.open
0 1655.13
1 1685.00
2 1702.95
3 1695.97
4 1667.37
5 174.28
6 175.69
7 175.94
8 174.67
9 173.87
10 1124.90
11 1146.99
12 1150.06
13 1162.49
14 1155.72
Name: open, dtype: float64
In the example above, we’re selecting the open column for all rows.
This makes it hard to see additional details about each row. Often, you’ll want to select multiple rows for better context. You can use bracket syntax and pass an array of column names to select multiple columns.
>>> df[['date', 'symbol', 'open']]
date symbol open
0 2019-03-01 AMZN 1655.13
1 2019-03-04 AMZN 1685.00
2 2019-03-05 AMZN 1702.95
3 2019-03-06 AMZN 1695.97
4 2019-03-07 AMZN 1667.37
5 2019-03-01 AAPL 174.28
6 2019-03-04 AAPL 175.69
7 2019-03-05 AAPL 175.94
8 2019-03-06 AAPL 174.67
9 2019-03-07 AAPL 173.87
10 2019-03-01 GOOG 1124.90
11 2019-03-04 GOOG 1146.99
12 2019-03-05 GOOG 1150.06
13 2019-03-06 GOOG 1162.49
14 2019-03-07 GOOG 1155.72
There you have it! We just retrieved the date, symbol, and open columns for each row.
Adding a New Column to a Pandas DataFrame
For this last example, let’s see how to change our DataFrame. We’ll do this by adding an entirely new column.
Imagine you want a column that shows the change in price for each stock on each trading day. You can create a new column using bracket syntax, just like adding a new key to a Python dictionary.
14 2019-03-07 GOOG 1155.72
>>> df['change'] = df['close'] - df['open']
>>> df
date symbol open high low close volume change
0 2019-03-01 AMZN 1655.13 1674.26 1651.00 1671.73 4974877 16.60
1 2019-03-04 AMZN 1685.00 1709.43 1674.36 1696.17 6167358 11.17
2 2019-03-05 AMZN 1702.95 1707.80 1689.01 1692.43 3681522 -10.52
3 2019-03-06 AMZN 1695.97 1697.75 1668.28 1668.95 3996001 -27.02
4 2019-03-07 AMZN 1667.37 1669.75 1620.51 1625.95 4957017 -41.42
5 2019-03-01 AAPL 174.28 175.15 172.89 174.97 25886167 0.69
6 2019-03-04 AAPL 175.69 177.75 173.97 175.85 27436203 0.16
7 2019-03-05 AAPL 175.94 176.00 174.54 175.53 19737419 -0.41
8 2019-03-06 AAPL 174.67 175.49 173.94 174.52 20810384 -0.15
9 2019-03-07 AAPL 173.87 174.44 172.02 172.50 24796374 -1.37
10 2019-03-01 GOOG 1124.90 1142.97 1124.75 1140.99 1450316 16.09
11 2019-03-04 GOOG 1146.99 1158.28 1130.69 1147.80 1446047 0.81
12 2019-03-05 GOOG 1150.06 1169.61 1146.19 1162.03 1443174 11.97
13 2019-03-06 GOOG 1162.49 1167.57 1155.49 1157.86 1099289 -4.63
14 2019-03-07 GOOG 1155.72 1156.76 1134.91 1143.30 1166559 -12.42
In the example above, we created a new column, change, that shows the difference between the open and the close values for a day.
Notice how we were able to assign it by operating on the DataFrame rows directly. When adding columns like this, Pandas knows to use the values for each row when computing its value. This makes it easy to add new values that are computed from existing values in your DataFrame.
Conclusion
In this article, we learned about the basics of Pandas, the widely-popular data analysis and manipulation library for Python.
After reviewing the background of Pandas, we learned about the DataFrame, the core data structure in Pandas. We saw how to create and view a Pandas DataFrame, as well as how to manipulate a DataFrame to prepare your data for further use.
Just like Pandas supercharges your workflow for processing data, the Kite plugin supercharges your workflow for writing Python across dozens of libraries. It uses deep learning to provide the most intelligent autocompletion for Python you can find out there. Download Kite for free to give it a try.
This post is a part of Kite’s new series on Python. You can check out the code from this and other posts on our GitHub repository.