Learn, create and have fun with Python

Python Pandas quick reference

This is a working sheet for the Python Pandas library which is commonly used for data manipulation and analysis.
Dataset used is FrogID dataset 2.0

Main data structures in Pandas

numpy.ndarray is the bedrock data structure on which the Pandas data structures are based.
Series – 1D (NumPy) array coupled with an array of labels.
DataFrame – mutable 2-dimensional labelled array, which can contain columns of heterogeneous types. Similar to structured arrays in NumPy (ndarray). It is conceptually analogous to a table or spreadsheet. A DataFrame column is a Series. Every axis in a DataFrame has an index. A DataFrame structure can be treated as if it is a dictionary of Series objects.
Panel is a 3D array mainly used for 3D data. Axis names: items (axis 0) each item corresponds to DataFrame (DF), major_axis (1) each item corresponds to the rows if the DF structure, minor_axis (2) each item corresponds to the columns of each DF structure. Less widely used.

Code examples for dataframe

Pandas DataFrame

import pandas as pd
df = pd.read_csv(path_to_csv_file,  parse_dates=True)  # create DataFrame from .csv
df = pd.DataFrame(df, columns=['scientificName', 'stateProvince', 'decimalLatitude', 'decimalLongitude', 'eventDate'])  # create DataFrame with certain columns
df.index  # access index labels => RangeIndex(start=0, stop=126169, step=1
df.columns  # access column labels => Index(['scientificName', stateProvince', 'decimalLatitude', 'decimalLongitude', 'eventDate'], dtype='object')
df["stateProvince"]  # get column as Series structure
df["new_column"] = True  # add new column with default value
del df["new_column"]  # delete column
new_column = df.pop("new_column")  # pop new column as Series
df.insert(2, "new_column", df["scientificName"] == "Litoria rothii")  # insert column (position (default at the end), name, value)

Indexing Pandas DataFrame

Extracting subset based on column names, range of columns, etc.

df[["eventDate", "scientificName"]]  # select all column values with columns in particular order. KeyError exception
df.get("wrong_column_name")  # None
series_ds = df["scientificName"]  # select column as Series data structure
# Rows cannot be selected with [] in DataFrame, but can be in Series as in
series_ds[100]  # where 100 is an index
df.scientificName  # column as an attribute. Only valid Python identifiers, SyntaxError otherwise
df.columns=['scientific_Name', 'state_Province', 'decimal_Latitude', 'decimal_Longitude', 'event_Date']  # rename columns
df[:2]  # select by slicing (class DataFrame). Standard Python slicing apply. [::2] every second row, [::-1] reverse order
# df.loc[startrow:endrow, startcolumn:endcolumn]
df.loc[4:10, "scientificName":"eventDate"]  # subset as DataFrame
df.loc[3]  # where 3 is index. Returns Series
df.loc[[2, 12]]  # DataFrame. Individual rows by index
df.loc[:1]  # DataFrame. Slice by index
df.loc[:, "stateProvince"]  # extract column
df.loc[147, "stateProvince"]  # access row/column value. Class str. The same as:
df.loc[[0, 15], ["scientificName", "stateProvince"]]  # for row 0, 5 select columns "scientificName", "stateProvince"
df["stateProvince"][147]  # as [] cannot be used to select rows. First select column for a Series
df.loc[df["stateProvince"] == "TAS", :]  # select records for TAS
df.loc[df["decimalLatitude"] >= -28.00, :]  # select records that are above of -28.00 Latitude (North of Surfers Paradise - Yallabatharra line)
# .iloc integer-location based indexing for selection by position.
df.iloc[[1, 100], [0, 3]]  # for rows 1 and 100 select columns 0 and 3
df.iloc[lambda x: x.index % 2 == 0]  # rows with even index
df.iloc[:, lambda df: [0, 4]]  #  callable function that expects the Series or DataFrame
# # boolean mask the same length as the index
df1 = df.loc[:3]
df1.iloc[[True, False, True, True]]  # select True rows. But df1.iloc[[True, True, False*2]] - doesn't work (duplicates first True rows)
df.iloc[:2, [True, True, False, False, True, False]]  # boolean array of columns' length. First Two rows and "scientificName", "stateProvince", "eventDate" columns
df.iloc[:3, :2]  # first three rows, first two columns
# at iat
df.at[4, 'scientificName']  #  get a single value in a DataFrame or Series
df.at[4, 'scientificName'] = "Litoria rothii"  # set a single value in a DataFrame or Series
df.iat[1, 0]  # get value
df.loc[0].iat[1]  # get value within a series
# Cross section .xs (xs can not be used to set values)
df.xs('VIC')  # get values at specific index
df.xs(('VIC', "Limnodynastes dumerilii"))  # get values at several indexes
df.xs("Limnodynastes dumerilii", level=1)  # get values at index and specific level or
df.xs("Limnodynastes dumerilii", level='scientificName')
df.xs(("TAS", "Limnodynastes dumerilii"), level=[0, "scientificName"])  # several indexes and levels
df.xs("eventDate", axis=1)  # get values at column and axis

Cast string to timestamp and filter

from datetime import datetime
df['eventDate'] = df["eventDate"].astype('datetime64[ns]')  # cast str to timestamp (time is set to 0:00:0000)
df.loc[df["eventDate"] > datetime.strptime("2019-01-20", '%Y-%m-%d'), :]
# or for date format
df['eventDate'] = df["eventDate"].astype('datetime64[ns]').dt.date  # to access the date component
df.loc[df["eventDate"] > datetime.strptime("2019-01-20", '%Y-%m-%d').date(), :]  # use .date() on datetime obj


df.set_index(["stateProvince", "scientificName"])  # use drop=True to delete columns to be used as the new index.
df.set_index(["stateProvince", "scientificName"], inplace=True)  # does not create new object.
df_index = df.index  # index as MultiIndex of tuples
df_index.get_level_values(1)  # list of index labels for selected level. 
df.reorder_levels(["scientificName", "stateProvince"], axis=0)  # swap indexes. df[:7].reorder_levels(["scientificName", "stateProvince"], axis=0))

Accessing pandas dataframe columns, rows, and cells
Select Rows & Columns by Name or Index in DataFrame using loc & iloc
How to convert string to datetime format in Pandas
Pandas DataFrame.iloc

Leave a Reply

Back to top