Author | Nejat Hakan |
nejat.hakan@outlook.de | |
PayPal Me | https://paypal.me/nejathakan |
Data Wrangling & Analysis Power Tools - Pandas & NumPy
Introduction
Welcome to the core of modern data analysis in Python. This section delves into two indispensable libraries: NumPy and Pandas. NumPy (Numerical Python) provides the foundation for numerical operations, especially on large, multi-dimensional arrays and matrices. Pandas, built atop NumPy, offers high-level data structures—Series and DataFrames—and powerful tools specifically designed for data manipulation, cleaning, analysis, and visualization. Mastering these libraries is crucial for anyone serious about data science, machine learning, financial analysis, or any field involving structured data.
We will embark on a journey starting with the fundamental building blocks in the Basic section, covering array creation, manipulation in NumPy, and the introduction of Pandas Series and DataFrames. The Intermediate section will equip you with practical skills for real-world data tasks, including loading data from various sources, handling ubiquitous issues like missing values, and transforming data into desired formats. Finally, the Advanced section will explore sophisticated techniques such as hierarchical indexing, complex data selection, powerful grouping and aggregation methods, and specialized tools for time series analysis.
Each theoretical sub-section is followed by a hands-on Workshop. These workshops are designed as mini-projects, providing step-by-step guidance to apply the concepts you've just learned to practical, often real-world-inspired scenarios. This blend of theory and practice aims to solidify your understanding and build your confidence in using NumPy and Pandas effectively. Prepare to transform raw data into actionable insights!
Basic Level
This initial level focuses on establishing a solid understanding of the core data structures and fundamental operations provided by NumPy and Pandas. We begin with NumPy's ndarray
, the cornerstone of numerical computing in Python, and then transition to Pandas' Series
and DataFrame
, which are essential for handling tabular data.
1. NumPy Fundamentals
NumPy is the foundational package for scientific computing in Python. Its primary contribution is the powerful N-dimensional array object (ndarray
). Understanding NumPy arrays is the first step towards efficient data manipulation and analysis.
Understanding NumPy Arrays
At its heart, NumPy provides an efficient way to store and manipulate dense numerical data. Unlike Python's built-in lists, which can contain elements of different types and are stored inefficiently in memory for numerical operations, NumPy arrays are homogeneous (all elements must be of the same data type) and stored in a contiguous block of memory. This homogeneity and memory layout allow NumPy to leverage optimized, pre-compiled C code for mathematical operations, resulting in significantly faster execution speeds compared to equivalent operations on Python lists.
Key Characteristics of NumPy Arrays (ndarray
):
- Homogeneous Type: All elements must be of the same data type (e.g.,
int32
,float64
). This allows for optimized storage and computation. - Fixed Size: The size of a NumPy array is fixed upon creation. Changing the size requires creating a new array.
- Efficient Memory Usage: Contiguous memory storage enables faster access and manipulation.
- Vectorized Operations: NumPy allows you to apply operations on entire arrays without explicit Python loops, leading to concise and fast code.
- Dimensions (Axes): Arrays can have multiple dimensions. A 1-D array is like a vector, a 2-D array is like a matrix, and arrays can have 3 or more dimensions. The number of dimensions is called the
rank
, and the size of each dimension is called theshape
.
Creating NumPy Arrays:
The most common way to create a NumPy array is using the np.array()
function, passing a Python list or tuple.
import numpy as np
# Creating a 1-D array from a list
list1 = [1, 2, 3, 4, 5]
arr1d = np.array(list1)
print("1D Array:")
print(arr1d)
print("Shape:", arr1d.shape) # Output: (5,) - A tuple indicating size in each dimension
print("Data Type:", arr1d.dtype) # Output: int64 (depends on your system)
print("Number of Dimensions:", arr1d.ndim) # Output: 1
print("Size (Total Elements):", arr1d.size) # Output: 5
print("-" * 20)
# Creating a 2-D array from a list of lists
list2 = [[1, 2, 3], [4, 5, 6]]
arr2d = np.array(list2)
print("2D Array:")
print(arr2d)
print("Shape:", arr2d.shape) # Output: (2, 3) - 2 rows, 3 columns
print("Data Type:", arr2d.dtype) # Output: int64
print("Number of Dimensions:", arr2d.ndim) # Output: 2
print("Size (Total Elements):", arr2d.size) # Output: 6
# Specifying the data type during creation
arr_float = np.array([1.0, 2.5, 3.14], dtype=np.float32)
print("\nFloat Array:")
print(arr_float)
print("Data Type:", arr_float.dtype) # Output: float32
Other Array Creation Methods:
NumPy provides convenient functions for creating arrays with initial placeholder content:
np.zeros(shape)
: Creates an array filled with zeros.np.ones(shape)
: Creates an array filled with ones.np.full(shape, fill_value)
: Creates an array filled with a specific value.np.arange(start, stop, step)
: Similar to Python'srange()
, but returns a NumPy array.np.linspace(start, stop, num)
: Creates an array with a specified number (num
) of evenly spaced values betweenstart
andstop
.np.eye(N)
ornp.identity(N)
: Creates an N x N identity matrix (1s on the diagonal, 0s elsewhere).np.random.rand(d0, d1, ..., dn)
: Creates an array of the given shape with random samples from a uniform distribution over [0, 1).np.random.randn(d0, d1, ..., dn)
: Creates an array of the given shape with random samples from a standard normal distribution (mean 0, variance 1).np.random.randint(low, high, size)
: Creates an array of the given size with random integers fromlow
(inclusive) tohigh
(exclusive).
# Example usage of other creation methods
zeros_arr = np.zeros((2, 4)) # 2 rows, 4 columns
print("\nZeros Array:\n", zeros_arr)
ones_arr = np.ones((3, 3), dtype=np.int16) # Specify integer type
print("\nOnes Array:\n", ones_arr)
full_arr = np.full((2, 2), 99)
print("\nFull Array:\n", full_arr)
arange_arr = np.arange(0, 10, 2) # Start=0, Stop=10 (exclusive), Step=2
print("\nArange Array:", arange_arr)
linspace_arr = np.linspace(0, 1, 5) # 5 evenly spaced points between 0 and 1 (inclusive)
print("\nLinspace Array:", linspace_arr)
identity_matrix = np.eye(3)
print("\nIdentity Matrix:\n", identity_matrix)
rand_arr = np.random.rand(2, 2)
print("\nRandom (Uniform) Array:\n", rand_arr)
randn_arr = np.random.randn(3, 2)
print("\nRandom (Normal) Array:\n", randn_arr)
randint_arr = np.random.randint(10, 20, size=(2, 5)) # Random integers between 10 and 19
print("\nRandom Integer Array:\n", randint_arr)
Understanding these creation methods and the fundamental properties of ndarray
objects is essential before moving on to manipulating and analyzing data with them.
Workshop NumPy Array Basics
Goal: Create various NumPy arrays representing hypothetical experimental data and inspect their properties.
Scenario: You are assisting a physics lab that measures temperature (float
) and particle counts (int
) at different time points. You need to create NumPy arrays to store this preliminary data.
Steps:
- Import NumPy: Always start by importing the NumPy library, conventionally aliased as
np
. - Temperature Data (1D Float Array): Measurements were taken at five consecutive time points: 23.5, 24.1, 24.8, 24.5, 23.9 degrees Celsius. Create a 1D NumPy array to store these values. Specify the data type as
np.float32
for memory efficiency. - Inspect Temperature Array: Print the shape, data type (
dtype
), number of dimensions (ndim
), and total number of elements (size
) of thetemperature_data
array. Self-Check: Does the output match your expectations for a 1D array of 5 float elements? - Particle Counts (2D Integer Array): Two sets of particle count experiments were run simultaneously over four time intervals.
- Set A counts: 105, 112, 109, 115
- Set B counts: 98, 101, 100, 104 Create a 2D NumPy array where each row represents a set of experiments. Use the default integer type.
- Inspect Particle Count Array: Print the shape, data type, dimensions, and size of the
particle_counts
array.Self-Check: Does the shapeprint("\nParticle Count Array Properties:") print(f" Shape: {particle_counts.shape}") print(f" Data Type: {particle_counts.dtype}") print(f" Dimensions: {particle_counts.ndim}") print(f" Size: {particle_counts.size}")
(2, 4)
correctly represent 2 experiments (rows) and 4 time intervals (columns)? Is the data type an integer type? - Create Placeholder Arrays: The lab plans to run another experiment with 3 trials, each measuring 5 voltage readings. Create:
- An array
voltage_readings
of shape (3, 5) initialized with zeros. - An array
error_margins
of the same shape, initialized with the value0.1
.
- An array
- Create Time Sequence: Generate a sequence of time points representing 10 measurements taken every 0.5 seconds, starting from time 0. Use
np.arange
ornp.linspace
.Self-Check: Did both methods produce an array of length 10 with the correct time values?# Using np.arange time_points_arange = np.arange(0, 5, 0.5) # Stop = 10 * 0.5 = 5.0 print("\nTime Points (arange):", time_points_arange) print(f" Length: {len(time_points_arange)}") # Using np.linspace time_points_linspace = np.linspace(0, 4.5, 10) # Start=0, Stop=4.5, Num=10 print("Time Points (linspace):", time_points_linspace) print(f" Length: {len(time_points_linspace)}")
np.linspace
is often preferred when you know the number of points required and want to include the endpoint.np.arange
is better when the step size is more critical.
This workshop introduced creating fundamental NumPy arrays using various methods and inspecting their key attributes (shape
, dtype
, ndim
, size
). This forms the basis for performing operations on these arrays.
Basic Array Operations
NumPy shines when performing operations on arrays. These operations are typically vectorized, meaning they operate on entire arrays element-wise without requiring explicit Python loops. This leads to concise code and significant performance gains.
Element-wise Arithmetic Operations:
Standard arithmetic operators (+
, -
, *
, /
, **
for exponentiation) work directly on NumPy arrays, applying the operation element by element. For these operations to work between two arrays, the arrays generally need to have compatible shapes (either the same shape or conformable according to broadcasting rules, discussed later).
import numpy as np
arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([10, 20, 30, 40])
# Element-wise addition
sum_arr = arr1 + arr2
print("Sum:", sum_arr) # Output: [11 22 33 44]
# Element-wise subtraction
diff_arr = arr2 - arr1
print("Difference:", diff_arr) # Output: [ 9 18 27 36]
# Element-wise multiplication
prod_arr = arr1 * arr2
print("Product:", prod_arr) # Output: [ 10 40 90 160]
# Element-wise division
div_arr = arr2 / arr1
print("Division:", div_arr) # Output: [10. 10. 10. 10.]
# Element-wise exponentiation
pow_arr = arr1 ** 2
print("Squared:", pow_arr) # Output: [ 1 4 9 16]
# Operations with scalars
scalar_prod = arr1 * 5
print("Scalar Product:", scalar_prod) # Output: [ 5 10 15 20]
scalar_add = arr1 + 100
print("Scalar Addition:", scalar_add) # Output: [101 102 103 104]
Important Note on Division by Zero: NumPy handles division by zero according to IEEE 754 standards:
1 / 0
results ininf
(infinity).-1 / 0
results in-inf
(negative infinity).0 / 0
results innan
(Not a Number). NumPy usually issues a runtime warning in these cases but doesn't raise an error, allowing calculations to proceed.
arr_zeros = np.array([1, 2, 0, 4])
print("\nDivision involving zero:")
# The following line will likely produce a RuntimeWarning
print("1 / arr_zeros:", 1 / arr_zeros) # Output: [ 1. 0.5 inf 0.25]
print("arr_zeros / arr_zeros:", arr_zeros / arr_zeros) # Output: [ 1. 1. nan 1.]
Basic Indexing and Slicing:
Accessing elements in NumPy arrays is similar to Python lists, but with more powerful capabilities, especially for multi-dimensional arrays.
-
1-D Arrays: Use square brackets
[]
with the index (0-based). Slicing uses thestart:stop:step
notation.Crucial Point: Slices on NumPy arrays return views, not copies. Modifying a slice modifies the original array. If you need a copy, use thearr1d = np.arange(10) # [0 1 2 3 4 5 6 7 8 9] print("\n1D Array:", arr1d) # Accessing single element print("Element at index 3:", arr1d[3]) # Output: 3 # Slicing: elements from index 2 up to (not including) index 7 print("Slice [2:7]:", arr1d[2:7]) # Output: [2 3 4 5 6] # Slicing with step print("Slice [1:9:2]:", arr1d[1:9:2]) # Output: [1 3 5 7] # Slicing from the beginning or to the end print("Slice [:5]:", arr1d[:5]) # Output: [0 1 2 3 4] print("Slice [5:]:", arr1d[5:]) # Output: [5 6 7 8 9] # Assigning a value to a slice arr1d_copy = arr1d.copy() # Important: Create a copy to avoid modifying original arr1d_copy[0:3] = 99 print("Modified Slice:", arr1d_copy) # Output: [99 99 99 3 4 5 6 7 8 9]
.copy()
method. -
Multi-Dimensional Arrays: Use comma-separated indices or slices within the square brackets, one for each dimension (axis).
arr[row_index, column_index]
orarr[row_slice, column_slice]
.arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) print("\n2D Array:\n", arr2d) # Accessing single element (row 1, column 2) print("Element at [1, 2]:", arr2d[1, 2]) # Output: 6 # Accessing a specific row (returns a 1D array) print("Row at index 0:", arr2d[0]) # Output: [1 2 3] print("Row at index 0 (explicit):", arr2d[0, :]) # Equivalent, clearer for >1D # Accessing a specific column (returns a 1D array) print("Column at index 1:", arr2d[:, 1]) # Output: [2 5 8] # Slicing a sub-matrix (rows 0 and 1, columns 1 and 2) print("Sub-matrix arr2d[0:2, 1:3]:\n", arr2d[0:2, 1:3]) # Output: # [[2 3] # [5 6]] # Assigning a value to a 2D slice arr2d_copy = arr2d.copy() arr2d_copy[1:3, 0:2] = 0 print("Modified 2D Slice:\n", arr2d_copy) # Output: # [[1 2 3] # [0 0 6] # [0 0 9]]
Mastering these basic arithmetic operations and indexing/slicing techniques is fundamental for almost any task involving NumPy arrays.
Workshop Basic Array Operations
Goal: Perform calculations and data extraction on the experimental data arrays created in the previous workshop.
Scenario: Using the temperature_data
and particle_counts
arrays, you need to convert temperatures to Fahrenheit, calculate the average particle count for each experiment set, and extract specific data points.
Steps:
- Import NumPy and Reload Data (if needed): Make sure NumPy is imported and you have the arrays from the previous workshop. If not, recreate them.
import numpy as np # Recreate if needed: temperature_data = np.array([23.5, 24.1, 24.8, 24.5, 23.9], dtype=np.float32) particle_counts = np.array([[105, 112, 109, 115], [98, 101, 100, 104]]) print("Arrays ready:") print("Temperatures (C):", temperature_data) print("Particle Counts:\n", particle_counts)
- Temperature Conversion: Convert the
temperature_data
from Celsius to Fahrenheit using the formula:F = C * 9/5 + 32
. Perform this using vectorized operations. Self-Check: Verify a value manually. E.g., 23.5 * 9/5 + 32 = 42.3 + 32 = 74.3. Does the output match? - Calculate Average Particle Counts: Calculate the mean (average) count for each experiment set (each row) in the
particle_counts
array. NumPy provides aggregation functions likemean()
,sum()
,min()
,max()
. These can operate along specific axes (axis=0
for column-wise,axis=1
for row-wise). Self-Check: Calculate the average for the first row manually: (105 + 112 + 109 + 115) / 4 = 441 / 4 = 110.25. Does the output match? - Calculate Overall Average Count: Calculate the average of all particle counts in the
particle_counts
array. - Extract Specific Data:
- Get the temperature at the 3rd time point (index 2).
- Get the particle counts for the first experiment set (Set A, row 0).
- Get the particle count for Set B (row 1) at the 2nd time interval (column 1).
- Get the particle counts for both sets but only for the last two time intervals (columns 2 and 3).
temp_at_t2 = temperature_data[2] print(f"\nTemperature at 3rd time point: {temp_at_t2}") counts_set_a = particle_counts[0, :] # Or simply particle_counts[0] print("Counts for Set A:", counts_set_a) count_setb_t1 = particle_counts[1, 1] print(f"Count for Set B at 2nd interval: {count_setb_t1}") counts_last_two_intervals = particle_counts[:, 2:4] # All rows (:), columns 2 and 3 (2:4) print("Counts for last two intervals:\n", counts_last_two_intervals) print("Shape:", counts_last_two_intervals.shape) # Should be (2, 2)
- Modify Data (Using a View): Create a slice representing the first two temperatures. Add
0.5
degrees to these measurements directly on the slice. Print the originaltemperature_data
array to observe the change (because slices are views).Self-Check: Did the first two elements of the originaltemp_slice = temperature_data[0:2] print("\nOriginal first two temps:", temp_slice) temp_slice += 0.5 # Modify the slice directly print("First two temps after adding 0.5:", temp_slice) print("Original temperature_data array NOW:", temperature_data)
temperature_data
array change? This demonstrates the concept of views. If you wanted to avoid modifying the original, you should have used.copy()
.
This workshop applied basic arithmetic operations vectorially and demonstrated how to extract specific data points or sub-arrays using indexing and slicing, including the important distinction between views and copies.
2. Pandas Foundations
While NumPy provides the low-level array structures and numerical operations, Pandas provides higher-level data structures and functions tailored for practical data analysis. Pandas is indispensable for tasks like reading data from files, cleaning messy datasets, handling missing values, and preparing data for modeling or visualization.
Introduction to Pandas Series
A Pandas Series
is a one-dimensional labeled array capable of holding data of any type (integers, strings, floating-point numbers, Python objects, etc.). It's similar to a NumPy array but with an important addition: an index. The index provides labels for the data points, allowing for more flexible and intuitive data access compared to using only integer indices. Think of it like a combination of a NumPy array and a Python dictionary.
Key Characteristics of Pandas Series:
- 1-Dimensional: Holds a sequence of values.
- Homogeneous Type (Usually): While a Series can hold mixed types (stored as
object
dtype), performance is best when data is homogeneous (likeint64
,float64
,bool
,datetime64[ns]
,string
). - Indexed: Each data point has an associated label (the index). If not specified, Pandas creates a default integer index (0, 1, 2...).
- Vectorized Operations: Like NumPy arrays, Series support vectorized operations, making computations fast and code concise.
- Handling Missing Data: Pandas uses
NaN
(Not a Number) to represent missing data and provides tools to detect and handle it.
Creating Pandas Series:
The primary way to create a Series is using the pd.Series()
constructor.
import pandas as pd
import numpy as np # Often used alongside Pandas
# Creating a Series from a Python list (default integer index)
data_list = [10, 20, 30, 40, 50]
s1 = pd.Series(data_list)
print("Series from list (default index):")
print(s1)
# Output:
# 0 10
# 1 20
# 2 30
# 3 40
# 4 50
# dtype: int64
print("\nValues:", s1.values) # Access the underlying NumPy array
print("Index:", s1.index) # Access the index object
print("-" * 20)
# Creating a Series from a NumPy array with a custom index
data_np = np.array([1.1, 2.2, 3.3, 4.4])
custom_index = ['a', 'b', 'c', 'd']
s2 = pd.Series(data_np, index=custom_index)
print("Series from NumPy array (custom index):")
print(s2)
# Output:
# a 1.1
# b 2.2
# c 3.3
# d 4.4
# dtype: float64
print("\nValues:", s2.values)
print("Index:", s2.index)
print("-" * 20)
# Creating a Series from a Python dictionary
data_dict = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
s3 = pd.Series(data_dict)
print("Series from dictionary:")
print(s3)
# Output: (Keys become index, values become data; order might vary pre-Python 3.7)
# Ohio 35000
# Texas 71000
# Oregon 16000
# Utah 5000
# dtype: int64
# You can also specify an index when creating from a dict, aligning data
states = ['California', 'Ohio', 'Oregon', 'Texas']
s4 = pd.Series(data_dict, index=states)
print("\nSeries from dictionary with specified index:")
print(s4)
# Output: (Values aligned by index labels, missing values become NaN)
# California NaN
# Ohio 35000.0
# Oregon 16000.0
# Texas 71000.0
# dtype: float64
# Giving a name to the Series and its index
s4.name = 'Population'
s4.index.name = 'State'
print("\nSeries with names:")
print(s4)
# Output:
# State
# California NaN
# Ohio 35000.0
# Oregon 16000.0
# Texas 71000.0
# Name: Population, dtype: float64
Accessing Data in a Series:
You can access data using the index labels or the integer positions.
- Using Index Labels: Similar to dictionary key access.
- Using Integer Positions: Similar to list/array indexing, primarily via the
.iloc
attribute for explicit position-based access. Using direct square brackets[]
can be ambiguous if the index itself contains integers.
# Using s2: index = ['a', 'b', 'c', 'd']
print("\nAccessing data in s2:")
print("Element with label 'b':", s2['b']) # Output: 2.2
# Using .loc for explicit label-based access (preferred)
print("Element with label 'c' using .loc:", s2.loc['c']) # Output: 3.3
# Accessing multiple elements by label
print("Elements 'a' and 'd':\n", s2[['a', 'd']]) # Pass a list of labels
# Output:
# a 1.1
# d 4.4
# dtype: float64
# Using integer positions with .iloc (0-based)
print("Element at position 1 using .iloc:", s2.iloc[1]) # Output: 2.2 (corresponds to 'b')
# Slicing by integer position with .iloc
print("Elements from position 1 to 3 using .iloc:\n", s2.iloc[1:4]) # Excludes index 4
# Output:
# b 2.2
# c 3.3
# d 4.4
# dtype: float64
# Slicing by label with .loc (inclusive of the end label)
print("Elements from 'b' to 'd' using .loc:\n", s2.loc['b':'d'])
# Output:
# b 2.2
# c 3.3
# d 4.4
# dtype: float64
Vectorized Operations and Alignment:
Arithmetic operations between Series align data based on index labels. If an index label exists in one Series but not the other, the result for that label will be NaN
.
s5 = pd.Series([1, 2, 3], index=['x', 'y', 'z'])
s6 = pd.Series([10, 20, 30, 40], index=['w', 'x', 'y', 'a'])
print("\ns5:\n", s5)
print("s6:\n", s6)
# Addition aligns on index labels 'x' and 'y'
print("\ns5 + s6:\n", s5 + s6)
# Output:
# a NaN
# w NaN
# x 21.0 (1 + 20)
# y 33.0 (2 + 30)
# z NaN
# dtype: float64
# NumPy universal functions (ufuncs) also work on Series
print("\nSquare root of s2:\n", np.sqrt(s2))
# Output:
# a 1.048809
# b 1.483240
# c 1.816590
# d 2.097618
# dtype: float64
Pandas Series form the building blocks for DataFrames and are essential for representing single columns or rows of data with meaningful labels.
Workshop Creating and Manipulating Series
Goal: Create and manipulate Pandas Series to represent and analyze product inventory and sales data.
Scenario: You manage inventory for a small online store. You need to track the number of items in stock for different products and their corresponding sales figures for the last month.
Steps:
- Import Pandas: Start by importing the Pandas library.
- Create Inventory Series: Create a Pandas Series named
inventory
to store the current stock count for the following products: 'Laptop', 'Keyboard', 'Mouse', 'Webcam', 'Monitor'. The stock counts are 15, 75, 120, 30, and 25 respectively. Use the product names as the index. - Inspect the Series: Print the
index
,values
,dtype
, andname
of theinventory
Series. - Create Sales Series: Create another Series named
monthly_sales
representing the number of units sold last month. Data: Laptop: 5, Mouse: 45, Keyboard: 30, Monitor: 8, HDMI Cable: 50. Notice 'Webcam' is missing and 'HDMI Cable' is extra compared to the inventory list. - Access Inventory Data:
- Retrieve the stock count for 'Mouse'.
- Retrieve the stock counts for 'Keyboard' and 'Monitor' simultaneously.
- Retrieve the stock count for the first two products using integer-based indexing (
.iloc
).mouse_stock = inventory['Mouse'] # or inventory.loc['Mouse'] print(f"\nStock count for Mouse: {mouse_stock}") keyboard_monitor_stock = inventory[['Keyboard', 'Monitor']] # or inventory.loc[['Keyboard', 'Monitor']] print("\nStock for Keyboard and Monitor:") print(keyboard_monitor_stock) first_two_stock_iloc = inventory.iloc[0:2] print("\nStock for first two products (iloc):") print(first_two_stock_iloc)
- Check Stock Levels: Use boolean indexing to find which products have an inventory level below 50.
- Calculate Remaining Stock: Calculate the approximate remaining stock after last month's sales. Subtract
monthly_sales
frominventory
. Observe how Pandas handles the alignment and missing items.Self-Check: Why are 'Webcam' and 'HDMI Cable' resulting inremaining_stock = inventory - monthly_sales print("\nEstimated Remaining Stock (Inventory - Sales):") print(remaining_stock)
NaN
(Not a Number)? Because 'Webcam' exists ininventory
but notmonthly_sales
, and 'HDMI Cable' exists inmonthly_sales
but notinventory
. Arithmetic operations require matching index labels for non-NaN results unless specific methods withfill_value
are used (which we'll see later). - Update Inventory: Suppose a new shipment arrived: 10 Laptops and 20 Webcams were added. Update the
inventory
Series accordingly.print("\nOriginal Inventory for Laptop and Webcam:") print(inventory.loc[['Laptop', 'Webcam']]) # Method 1: Direct assignment (if index exists) inventory['Laptop'] += 10 inventory['Webcam'] += 20 # Method 2: Using .loc for potentially multiple updates # inventory.loc[['Laptop', 'Webcam']] = inventory.loc[['Laptop', 'Webcam']] + [10, 20] print("\nUpdated Inventory for Laptop and Webcam:") print(inventory.loc[['Laptop', 'Webcam']]) print("\nFull Inventory after update:") print(inventory)
This workshop demonstrated creating Series with custom indices, accessing data using labels and positions, performing vectorized operations (including alignment behavior), filtering using boolean conditions, and modifying Series values.
Introduction to Pandas DataFrames
The DataFrame
is the most central and commonly used data structure in Pandas. It represents a rectangular table of data, conceptually similar to a spreadsheet, a SQL table, or a dictionary of Series objects. It's designed to handle heterogeneous data types across columns.
Key Characteristics of Pandas DataFrames:
- 2-Dimensional: Data is organized in rows and columns.
- Labeled Axes: Both rows (index) and columns have labels.
- Heterogeneous Columns: Different columns can have different data types (e.g.,
int
,float
,string
,datetime
). - Mutable Size (Potentially): Columns can generally be inserted and deleted. Inserting/deleting rows is possible but often less efficient than column operations.
- Built on NumPy: Under the hood, the data is often stored in one or more NumPy arrays, ensuring efficiency for many operations.
- Handles Missing Data: Uses
NaN
for missing values.
Creating Pandas DataFrames:
DataFrames can be created from various inputs:
- Dictionary of lists or NumPy arrays: Keys become column names, values (lists/arrays) become column data. Lengths must match.
- Dictionary of Series: Similar to above, but Series indices are aligned; the union of all indices becomes the DataFrame index.
- List of Dictionaries: Each dictionary represents a row. Keys become column names. Pandas infers column names and fills missing values with
NaN
. - 2D NumPy array: With optional row and column labels.
- Another DataFrame: Creating a copy or subset.
- Reading from files: (CSV, Excel, SQL, etc.) - A very common method, covered later.
import pandas as pd
import numpy as np
# 1. From a Dictionary of Lists
data_dict_lists = {
'StudentID': [101, 102, 103, 104],
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Score': [85.5, 92.0, 78.5, 88.0]
}
df1 = pd.DataFrame(data_dict_lists)
print("DataFrame from Dict of Lists:")
print(df1)
# Output:
# StudentID Name Score
# 0 101 Alice 85.5
# 1 102 Bob 92.0
# 2 103 Charlie 78.5
# 3 104 David 88.0
print("\nDataFrame Info:")
df1.info() # Provides a concise summary (index type, columns, non-null counts, dtype, memory usage)
print("-" * 30)
# 2. From a Dictionary of Series (index alignment)
s_pop = pd.Series([35000, 71000, 16000], index=['Ohio', 'Texas', 'Oregon'])
s_area = pd.Series([44825, 268596, 98381], index=['Ohio', 'Texas', 'Oregon'])
s_capital = pd.Series(['Columbus', 'Austin', 'Salem', 'Sacramento'], index=['Ohio', 'Texas', 'Oregon', 'California'])
df2 = pd.DataFrame({'Population': s_pop, 'Area': s_area, 'Capital': s_capital})
print("DataFrame from Dict of Series:")
print(df2)
# Output: (Index is union of Series indices, columns align, NaN where missing)
# Population Area Capital
# California NaN NaN Sacramento
# Ohio 35000.0 44825.0 Columbus
# Oregon 16000.0 98381.0 Salem
# Texas 71000.0 268596.0 Austin
print("\nColumns:", df2.columns)
print("Index:", df2.index)
print("-" * 30)
# 3. From a List of Dictionaries
data_list_dicts = [
{'colA': 1, 'colB': 'x'},
{'colA': 2, 'colB': 'y', 'colC': True},
{'colA': 3, 'colB': 'z'}
]
df3 = pd.DataFrame(data_list_dicts)
print("DataFrame from List of Dicts:")
print(df3)
# Output: (Columns inferred, NaN for missing keys)
# colA colB colC
# 0 1 x NaN
# 1 2 y True
# 2 3 z NaN
print("-" * 30)
# 4. From a 2D NumPy Array
data_np_2d = np.random.randn(3, 4) # 3 rows, 4 columns
row_labels = ['R1', 'R2', 'R3']
col_labels = ['C1', 'C2', 'C3', 'C4']
df4 = pd.DataFrame(data_np_2d, index=row_labels, columns=col_labels)
print("DataFrame from 2D NumPy Array:")
print(df4)
# Output: (Example random values)
# C1 C2 C3 C4
# R1 -0.595167 0.185749 -0.664021 -0.811483
# R2 0.562083 -0.101249 0.276538 -0.239427
# R3 0.484972 0.121686 0.095458 -0.593214
Basic DataFrame Inspection:
Once you have a DataFrame, several attributes and methods are essential for understanding its structure and content:
df.head(n=5)
: Returns the firstn
rows.df.tail(n=5)
: Returns the lastn
rows.df.shape
: Returns a tuple representing the dimensionality (rows, columns).df.index
: Access the row index.df.columns
: Access the column labels.df.dtypes
: Returns the data type of each column.df.info()
: Provides a concise summary including index dtype, column dtypes, non-null values, and memory usage.df.describe()
: Generates descriptive statistics (count, mean, std, min, max, percentiles) for numerical columns. Can be adapted for object/categorical columns too.
print("\nInspecting df1:")
print("Head:\n", df1.head(2))
print("\nShape:", df1.shape)
print("\nData Types:\n", df1.dtypes)
print("\nDescriptive Statistics:\n", df1.describe())
# For df2 which has NaNs and mixed types
print("\nInspecting df2:")
df2.info()
print("\nDescriptive Statistics (df2):\n", df2.describe()) # Only for numeric columns
print("\nDescribe all columns (df2):\n", df2.describe(include='all')) # Include non-numeric
DataFrames are the workhorse of Pandas, providing a flexible and powerful way to manage and prepare structured data for analysis.
Workshop DataFrame Creation and Basic Inspection
Goal: Create a Pandas DataFrame to store student information and perform basic inspection tasks.
Scenario: You are a teaching assistant and need to create a digital grade book. You have student IDs, names, and scores from two different assignments (Assignment 1 and Assignment 2).
Steps:
- Import Pandas: Ensure Pandas is imported.
- Prepare Data: Create Python lists or NumPy arrays for the student data:
student_ids
: [S101, S102, S103, S104, S105]student_names
: ['Aditya', 'Bhavna', 'Chetan', 'Divya', 'Elena']assignment1_scores
: [8, 10, 7, 9, 8]assignment2_scores
: [9, 7, 8, 10, np.nan] (Elena missed the second assignment)
- Create DataFrame: Create a DataFrame named
gradebook_df
from a dictionary where keys are the desired column names ('ID', 'Name', 'Assignment1', 'Assignment2') and values are the lists created above. - Set Student ID as Index: Often, a unique identifier like Student ID makes a good index. Set the 'ID' column as the DataFrame's index. Use the
set_index()
method. Make sure the change persists ingradebook_df
(useinplace=True
or reassign the result). Self-Check: Is 'ID' no longer a regular column but now appears as the index label? - Basic Inspection: Perform the following inspections on
gradebook_df
:- Display the first 3 rows.
- Display the last 2 rows.
- Show the dimensions (shape).
- List the column names.
- Show the data types of each column.
print("\n--- DataFrame Inspection ---") print("First 3 rows (head):\n", gradebook_df.head(3)) print("\nLast 2 rows (tail):\n", gradebook_df.tail(2)) print("\nShape:", gradebook_df.shape) print("\nColumn Names:", gradebook_df.columns) print("\nIndex:", gradebook_df.index) # Also check the index print("\nData Types:\n", gradebook_df.dtypes)
- Use
info()
: Get a concise summary using the.info()
method. Pay attention to the Non-Null Count for 'Assignment2'. Self-Check: Doesinfo()
show 5 entries for the index, 'Name', and 'Assignment1', but only 4 non-null entries for 'Assignment2'? This confirms the presence of theNaN
value. - Use
describe()
: Generate descriptive statistics for the numerical columns ('Assignment1', 'Assignment2'). Self-Check: Notice that thecount
for 'Assignment2' is 4, reflecting the missing value which is excluded from calculations likemean
,std
, etc., by default.
This workshop covered creating a DataFrame from related data collections, setting a meaningful index, and using fundamental methods (head
, tail
, shape
, columns
, index
, dtypes
, info
, describe
) to understand its structure and basic statistical properties.
Intermediate Level
Now that you have a grasp of the basic structures, this level focuses on common data wrangling tasks: getting data into Pandas, cleaning it up, and transforming it into a more usable format. These are critical steps in nearly every data analysis workflow.
3. Data Loading and Saving with Pandas
Real-world data rarely starts inside your Python script. It usually resides in external files like CSVs, Excel spreadsheets, JSON files, or databases. Pandas provides efficient and easy-to-use functions to read data from these sources into DataFrames and write DataFrames back out.
Reading from Different File Formats (CSV, Excel, JSON)
Pandas offers a suite of read_*
functions. We'll focus on the most common ones:
pd.read_csv(filepath_or_buffer, ...)
: Reads data from Comma Separated Value (.csv) files. This is perhaps the most frequently used data loading function. It has many optional parameters to handle various CSV complexities (different delimiters, headers, specific columns, data types, date parsing, etc.).pd.read_excel(io, sheet_name=0, ...)
: Reads data from Microsoft Excel files (.xls, .xlsx, .xlsm, .xlsb, .odf). It requires an additional library (openpyxl
for.xlsx
orxlrd
for.xls
). You can specify which sheet to read by name or index.pd.read_json(path_or_buf, orient='records', ...)
: Reads data from JSON (JavaScript Object Notation) strings or files. JSON data can have various structures, and theorient
parameter helps Pandas interpret it correctly (e.g.,'records'
for a list of JSON objects,'columns'
for a dict where keys are columns,'index'
for a dict where keys are index labels).
Common Parameters for read_*
functions:
filepath_or_buffer
/io
/path_or_buf
: The path to the file (string), a URL, or any object with aread()
method (like a file handle).sep
ordelimiter
(forread_csv
): The character used to separate values (e.g.,','
,'\t'
for tab,';'
). Pandas usually infers this well but explicit setting helps.header
: Row number(s) to use as the column names. Default is0
(first row). UseNone
if the file has no header.index_col
: Column number or name to use as the DataFrame's row index.usecols
: A list of column names or indices to read, useful for large files where you only need a subset.dtype
: A dictionary mapping column names to specific data types (e.g.,{'col_name': np.float64}
). Helps prevent incorrect type inference and saves memory.parse_dates
: A list of column names or indices that should be parsed as dates/times.na_values
: A list of strings or a dictionary mapping column names to strings that should be treated asNaN
(missing values).
Example Reading CSV:
Assume we have a file named students.csv
:
StudentID,FullName,Major,GPA
S201,Alice Smith,Physics,3.7
S202,Bob Johnson,CompSci,3.9
S203,Charlie Brown,CompSci,3.5
S204,Diana Ross,Math,3.8
S205,,Physics,
import pandas as pd
import io # To simulate a file from a string
# Simulate reading from a file (replace with 'students.csv' if you create the file)
csv_data = """StudentID,FullName,Major,GPA
S201,Alice Smith,Physics,3.7
S202,Bob Johnson,CompSci,3.9
S203,Charlie Brown,CompSci,3.5
S204,Diana Ross,Math,3.8
S205,,Physics,""" # Note the missing name and GPA
# Basic read
# df_students = pd.read_csv('students.csv')
df_students = pd.read_csv(io.StringIO(csv_data))
print("--- Basic CSV Read ---")
print(df_students)
df_students.info() # Observe inferred types and non-null counts
print("\n--- CSV Read with Options ---")
# Simulate reading again, using StudentID as index and handling missing values explicitly
# df_students_indexed = pd.read_csv('students.csv', index_col='StudentID', na_values=[''])
df_students_indexed = pd.read_csv(io.StringIO(csv_data),
index_col='StudentID',
na_values=[''] # Treat empty strings as NaN
)
print(df_students_indexed)
df_students_indexed.info() # Note StudentID is now index, FullName/GPA have NaNs
Example Reading Excel:
Assume an Excel file grades.xlsx
with a sheet named Sheet1
:
StudentID | Assignment1 | Assignment2 | FinalExam |
---|---|---|---|
S201 | 90 | 85 | 92 |
S202 | 95 | 98 | 96 |
S203 | 80 | 82 | 85 |
(You would need openpyxl
installed: pip install openpyxl
)
# Assuming 'grades.xlsx' exists in the same directory
# Requires openpyxl: pip install openpyxl
try:
# df_grades = pd.read_excel('grades.xlsx', sheet_name='Sheet1', index_col='StudentID')
# print("\n--- Excel Read ---")
# print(df_grades)
# df_grades.info()
print("\n--- Excel Read (Skipped - requires file/openpyxl) ---")
# Placeholder DataFrame for demonstration
df_grades_placeholder = pd.DataFrame({
'Assignment1': [90, 95, 80],
'Assignment2': [85, 98, 82],
'FinalExam': [92, 96, 85]
}, index=pd.Index(['S201', 'S202', 'S203'], name='StudentID'))
print(df_grades_placeholder)
df_grades_placeholder.info()
except ImportError:
print("\nError: 'openpyxl' library not found. Please install it to read .xlsx files.")
except FileNotFoundError:
print("\nError: 'grades.xlsx' not found. Cannot demonstrate Excel reading.")
Example Reading JSON:
Assume a file products.json
with record-oriented data:
[
{"ProductID": "P1001", "Name": "Laptop", "Price": 1200.00, "InStock": true},
{"ProductID": "P1002", "Name": "Keyboard", "Price": 75.50, "InStock": true},
{"ProductID": "P1003", "Name": "Mouse", "Price": 25.99, "InStock": false}
]
# Simulate reading from 'products.json'
json_data = """
[
{"ProductID": "P1001", "Name": "Laptop", "Price": 1200.00, "InStock": true},
{"ProductID": "P1002", "Name": "Keyboard", "Price": 75.50, "InStock": true},
{"ProductID": "P1003", "Name": "Mouse", "Price": 25.99, "InStock": false}
]
"""
# df_products = pd.read_json('products.json', orient='records')
df_products = pd.read_json(io.StringIO(json_data), orient='records')
print("\n--- JSON Read ('records' orient) ---")
print(df_products)
df_products.info() # Note Price is float, InStock is bool
# Example with 'columns' orient JSON
json_data_columns = """
{
"ProductID": {"0": "P1001", "1": "P1002", "2": "P1003"},
"Name": {"0": "Laptop", "1": "Keyboard", "2": "Mouse"},
"Price": {"0": 1200.00, "1": 75.50, "2": 25.99},
"InStock": {"0": true, "1": true, "2": false}
}
"""
df_products_cols = pd.read_json(io.StringIO(json_data_columns), orient='columns')
print("\n--- JSON Read ('columns' orient) ---")
print(df_products_cols) # Index is string '0', '1', '2' here
Mastering these reading functions and their common parameters is crucial for starting any data analysis project. Always inspect the loaded DataFrame (head()
, info()
, describe()
) to ensure the data was read correctly.
Workshop Loading and Initial Exploration of a Real Dataset
Goal: Load a real-world dataset (e.g., Titanic passenger data) from a CSV file and perform initial inspection.
Dataset: We'll use the well-known Titanic dataset, often available online. If you don't have it locally, you can often load it directly from a URL. A common source is from the Seaborn library's GitHub repository.
Steps:
- Import Pandas:
- Define Dataset Location: Specify the path or URL to the dataset. We'll use a direct URL for convenience.
- Load the CSV Data: Use
pd.read_csv()
to load the data from the URL into a DataFrame calledtitanic_df
.try: titanic_df = pd.read_csv(titanic_url) print("\nTitanic dataset loaded successfully!") except Exception as e: print(f"\nError loading dataset: {e}") print("Please check the URL or your internet connection.") # As a fallback, create a minimal placeholder DataFrame titanic_df = pd.DataFrame({ 'survived': [0, 1, 1], 'pclass': [3, 1, 3], 'sex': ['male', 'female', 'female'], 'age': [22.0, 38.0, 26.0], 'fare': [7.25, 71.2833, 7.925], 'embarked': ['S', 'C', 'S'] }) print("\nUsing a minimal placeholder DataFrame for the workshop.")
- Initial Inspection - Size and Shape: How many rows and columns does the dataset have?
- Initial Inspection - Head and Tail: Look at the first few and last few rows to get a feel for the data and column names.
Self-Check: Identify the column names. What kind of data seems to be in each column (numeric, text, categorical)? Are there any obvious missing values (often represented as
if 'titanic_df' in locals(): print("\n--- First 5 Rows (head) ---") print(titanic_df.head()) print("\n--- Last 5 Rows (tail) ---") print(titanic_df.tail())
NaN
)? - Initial Inspection - Data Types and Missing Values: Use
info()
to get a summary of column data types and non-null counts. This is crucial for identifying missing data. Self-Check: Which columns have missing values (where Non-Null Count is less than the total number of entries/rows)? What are the data types inferred by Pandas? Do they seem appropriate (e.g., 'age' as float, 'sex' as object/string)? - Initial Inspection - Numerical Summary: Use
describe()
to get summary statistics for the numerical columns.Self-Check: Look at theif 'titanic_df' in locals(): print("\n--- Descriptive Statistics (Numerical Columns) ---") print(titanic_df.describe())
mean
,min
,max
, and standard deviation (std
) for columns like 'age', 'fare'. Does anything look surprising (e.g., minimum age, maximum fare)? Thecount
here also confirms the number of non-missing values for these columns. - Initial Inspection - Categorical Summary: Use
describe(include=['object', 'category'])
to get summary statistics for non-numerical columns (like 'sex', 'embarked').Self-Check: What are theif 'titanic_df' in locals(): print("\n--- Descriptive Statistics (Object/Categorical Columns) ---") # Include 'category' if you convert types later print(titanic_df.describe(include=['object']))
unique
values in columns like 'sex' or 'embarked'? What is the most frequent value (top
) and its frequency (freq
)?
This workshop provided hands-on experience loading a standard dataset and performing the essential first steps of data exploration: checking dimensions, previewing data, understanding data types and missing values, and getting basic statistical summaries. This initial analysis informs the subsequent data cleaning and preparation steps.
Writing Data to Files
After processing or analyzing your data, you often need to save the resulting DataFrame to a file for later use, sharing, or input into another system. Pandas provides corresponding to_*
methods.
df.to_csv(path_or_buf, sep=',', index=True, header=True, ...)
: Writes the DataFrame to a CSV file.index=True
(default): Writes the DataFrame index as a column. Set toFalse
if you don't want to save the index.header=True
(default): Writes the column names as the header row. Set toFalse
for no header.na_rep=''
(default): String representation forNaN
values (missing data). You might set this to something specific if needed, e.g.,'NULL'
.float_format='%.2f'
: Format string for floating point numbers (e.g., limit to 2 decimal places).columns
: List of column names to write (if you want to save only a subset).
df.to_excel(excel_writer, sheet_name='Sheet1', index=True, header=True, ...)
: Writes the DataFrame to an Excel file. Requiresopenpyxl
.excel_writer
: Path to the file or anExcelWriter
object (for writing multiple sheets to the same file).sheet_name
: Name of the sheet to write to.
df.to_json(path_or_buf, orient='records', lines=False, ...)
: Writes the DataFrame to a JSON file.orient
: Controls the JSON structure (e.g.,'records'
,'columns'
,'index'
,'values'
,'table'
).'records'
is common for row-based data.lines=True
: Writes each row as a separate JSON object on a new line (JSON Lines format). Useful for streaming or large files.date_format
: Format for date/time objects (e.g.,'iso'
for ISO 8601).
Example Writing Data:
Let's use the df_students_indexed
DataFrame from the CSV reading example.
import pandas as pd
import io
import numpy as np # for NaN example later
# Recreate a sample DataFrame
csv_data = """StudentID,FullName,Major,GPA
S201,Alice Smith,Physics,3.7
S202,Bob Johnson,CompSci,3.9
S203,Charlie Brown,CompSci,3.5
S204,Diana Ross,Math,3.8
S205,,Physics,"""
df_students_indexed = pd.read_csv(io.StringIO(csv_data),
index_col='StudentID',
na_values=['']
)
# Add a hypothetical column
df_students_indexed['Status'] = ['Active', 'Active', 'Graduated', 'Active', 'Active']
print("--- DataFrame to Save ---")
print(df_students_indexed)
# --- Writing to CSV ---
# Prepare a buffer to simulate writing to a file
csv_output_buffer = io.StringIO()
df_students_indexed.to_csv(csv_output_buffer, index=True) # Include index
print("\n--- CSV Output (with index) ---")
print(csv_output_buffer.getvalue())
csv_output_buffer_noindex = io.StringIO()
df_students_indexed.to_csv(csv_output_buffer_noindex, index=False) # Exclude index
print("\n--- CSV Output (without index) ---")
print(csv_output_buffer_noindex.getvalue())
csv_output_buffer_custom = io.StringIO()
df_students_indexed.to_csv(csv_output_buffer_custom, sep='\t', na_rep='MISSING') # Tab separated, custom NaN
print("\n--- CSV Output (tab separated, custom NaN) ---")
print(csv_output_buffer_custom.getvalue())
# --- Writing to Excel ---
# Requires openpyxl: pip install openpyxl
excel_output_path = 'output_students.xlsx' # Specify a file path
try:
df_students_indexed.to_excel(excel_output_path, sheet_name='Student Data', index=False)
print(f"\n--- Excel Output ---")
print(f"DataFrame successfully written to '{excel_output_path}' (index excluded).")
# You would typically open the Excel file to verify
except ImportError:
print("\nError: 'openpyxl' not found. Skipping Excel writing.")
# --- Writing to JSON ---
json_output_buffer_records = io.StringIO()
df_students_indexed.to_json(json_output_buffer_records, orient='records', indent=4) # Pretty-printed records
print("\n--- JSON Output (orient='records', indented) ---")
print(json_output_buffer_records.getvalue())
json_output_buffer_index = io.StringIO()
# Reset index to make StudentID a column for better index-oriented JSON
df_students_indexed.reset_index().to_json(json_output_buffer_index, orient='index', indent=4)
print("\n--- JSON Output (orient='index', indented, after reset_index) ---")
print(json_output_buffer_index.getvalue())
json_output_buffer_lines = io.StringIO()
df_students_indexed.to_json(json_output_buffer_lines, orient='records', lines=True) # One JSON object per line
print("\n--- JSON Output (orient='records', lines=True) ---")
print(json_output_buffer_lines.getvalue())
# Clean up the created Excel file if it exists (optional)
import os
if os.path.exists(excel_output_path):
try:
os.remove(excel_output_path)
print(f"\nCleaned up temporary file: '{excel_output_path}'")
except PermissionError:
print(f"\nWarning: Could not remove temporary file '{excel_output_path}'. It might be open.")
Choosing the right output format and parameters depends on how the data will be used next. CSV is universal but simple, Excel is good for human readability and basic analysis, while JSON is flexible and common in web applications and APIs.
Workshop Saving Processed Data
Goal: Load the Titanic dataset, perform a simple modification (e.g., create a new column), and save the modified DataFrame to both CSV and JSON formats.
Scenario: After loading the Titanic dataset, you want to create a new column indicating whether a passenger was a child (e.g., age < 18). You then need to save this enhanced dataset for future analysis.
Steps:
- Import Pandas and Load Data: Import Pandas and reload the Titanic dataset (using the URL or a local copy). Handle potential loading errors as before.
import pandas as pd import numpy as np # Needed for checking NaN print("Pandas and NumPy imported.") # URL or local path titanic_url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv' # titanic_url = 'titanic.csv' # If you have it locally try: titanic_df = pd.read_csv(titanic_url) print("\nTitanic dataset loaded successfully!") print(f"Original shape: {titanic_df.shape}") except Exception as e: print(f"\nError loading dataset: {e}") # Create a minimal placeholder if loading fails titanic_df = pd.DataFrame({ 'survived': [0, 1, 1, 0, 1], 'pclass': [3, 1, 3, 1, 3], 'sex': ['male', 'female', 'female', 'male', 'female'], 'age': [22.0, 38.0, 26.0, 54.0, 4.0], # Added a child age 'fare': [7.25, 71.2833, 7.925, 51.8625, 16.7], 'embarked': ['S', 'C', 'S', 'S', 'S'] }) print("\nUsing a minimal placeholder DataFrame.") print(f"Placeholder shape: {titanic_df.shape}")
- Perform Data Modification: Create a new boolean column named
is_child
. This column should beTrue
if the passenger's 'age' is less than 18, andFalse
otherwise. Handle potentialNaN
values in the 'age' column (a child cannot be identified if the age is unknown, so these should probably beFalse
or perhaps kept asNaN
depending on requirements - let's make themFalse
for this exercise).if 'titanic_df' in locals(): # Method 1: Using boolean condition (will be NaN where age is NaN) # titanic_df['is_child_temp'] = titanic_df['age'] < 18 # Method 2: Using np.where for explicit True/False/Handle NaN # np.where(condition, value_if_true, value_if_false) # titanic_df['is_child'] = np.where(titanic_df['age'] < 18, True, False) # Method 3: Using .apply with a lambda function (handles NaN implicitly if needed) # This is more verbose but flexible. Let's stick to boolean indexing + fillna titanic_df['is_child'] = titanic_df['age'] < 18 # Now, decide how to handle NaN ages. Let's fill NaN in 'is_child' with False. # Note: This assumes NaN age means not a child for this specific purpose. titanic_df['is_child'].fillna(False, inplace=True) # Convert to boolean type explicitly if desired (fillna might make it object) titanic_df['is_child'] = titanic_df['is_child'].astype(bool) print("\n--- DataFrame with 'is_child' column added ---") print(titanic_df[['age', 'is_child']].head(10)) # Show relevant columns print(f"\nNew shape: {titanic_df.shape}") print("\nValue counts for 'is_child':") print(titanic_df['is_child'].value_counts())
- Save to CSV: Save the modified
titanic_df
to a CSV file namedtitanic_enhanced.csv
. Do not include the default Pandas integer index in the output file.Self-Check: If you openoutput_csv_path = 'titanic_enhanced.csv' if 'titanic_df' in locals(): try: titanic_df.to_csv(output_csv_path, index=False) print(f"\nDataFrame successfully saved to '{output_csv_path}' (index excluded).") except Exception as e: print(f"\nError saving to CSV: {e}")
titanic_enhanced.csv
(e.g., in a text editor or spreadsheet program), does it contain theis_child
column? Is the Pandas integer index (0, 1, 2...) absent? - Save to JSON: Save the modified
titanic_df
to a JSON file namedtitanic_enhanced.json
. Use therecords
orientation and add indentation for readability.Self-Check: Openoutput_json_path = 'titanic_enhanced.json' if 'titanic_df' in locals(): try: titanic_df.to_json(output_json_path, orient='records', indent=4) print(f"\nDataFrame successfully saved to '{output_json_path}' (orient='records', indented).") except Exception as e: print(f"\nError saving to JSON: {e}")
titanic_enhanced.json
. Is it a list[...]
where each element is a JSON object{...}
representing a passenger? Does each object contain theis_child
key-value pair? - (Optional) Clean up: Remove the generated files if you don't need them.
import os if os.path.exists(output_csv_path): try: os.remove(output_csv_path); print(f"\nRemoved '{output_csv_path}'") except Exception as e: print(f"Could not remove {output_csv_path}: {e}") if os.path.exists(output_json_path): try: os.remove(output_json_path); print(f"Removed '{output_json_path}'") except Exception as e: print(f"Could not remove {output_json_path}: {e}")
This workshop demonstrated a common workflow: load data, perform a transformation (creating a new feature), and save the result in different standard formats using Pandas' to_csv
and to_json
methods with relevant options.
4. Data Cleaning and Preparation
Raw data is often messy. It might contain missing values, incorrect data types, duplicate entries, or inconsistent formatting. Data cleaning (also called data cleansing or data scrubbing) is the process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset. It's a crucial and often time-consuming step in the data analysis pipeline. Pandas provides excellent tools for this.
Handling Missing Data (NaN)
Missing data is represented in Pandas primarily by the special floating-point value NaN
(Not a Number). Python's built-in None
object is also treated as missing data in many contexts, especially in columns with object
dtype. Identifying and handling NaN
s is fundamental.
Detecting Missing Data:
isnull()
orisna()
: These methods return a boolean DataFrame (or Series) of the same shape as the original, whereTrue
indicates a missing value (NaN
orNone
) andFalse
indicates a non-missing value.notnull()
ornotna()
: The opposite ofisnull()
, returningTrue
for non-missing values.- Combining with
.sum()
: Calling.sum()
on the boolean DataFrame/Series returned byisnull()
counts the number of missing values per column (or per row ifaxis=1
).
import pandas as pd
import numpy as np
data = {'col1': [1, 2, np.nan, 4, 5],
'col2': ['A', 'B', 'C', np.nan, 'E'],
'col3': [True, False, True, True, np.nan],
'col4': [np.nan, np.nan, np.nan, np.nan, np.nan],
'col5': [10, 20, 30, 40, 50]}
df_missing = pd.DataFrame(data)
print("DataFrame with Missing Data:")
print(df_missing)
print("\nDetecting NaN (isnull()):")
print(df_missing.isnull())
print("\nDetecting Non-NaN (notnull()):")
print(df_missing.notnull())
print("\nCounting NaN per column:")
print(df_missing.isnull().sum()) # Default axis=0
print("\nCounting NaN per row:")
print(df_missing.isnull().sum(axis=1))
print("\nTotal number of NaN values:")
print(df_missing.isnull().sum().sum())
Handling Missing Data:
There are two main strategies for dealing with missing data:
-
Dropping: Remove rows or columns containing
NaN
values.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
: Drops rows (defaultaxis=0
) or columns (axis=1
) containing missing values.how='any'
(default): Drop the row/column if anyNaN
values are present.how='all'
: Drop the row/column only if all values areNaN
.thresh=N
: Keep rows/columns that have at leastN
non-NaN
values.subset=[col_list]
: Only considerNaN
s in specific columns when deciding whether to drop rows.
-
Filling (Imputation): Replace
NaN
values with some other value.fillna(value, method=None, axis=None, inplace=False, limit=None)
: FillsNaN
values.value
: Scalar value (e.g., 0, 'Unknown'), dictionary/Series (to fill different values per column), or DataFrame.method='ffill'
(forward fill): Propagate the last valid observation forward to fill the gap.method='bfill'
(backward fill): Use the next valid observation to fill the gap.limit
: If usingffill
orbfill
, the maximum number of consecutiveNaN
s to fill.
Example Dropping:
print("\n--- Dropping NaN ---")
print("Original DataFrame:\n", df_missing)
# Drop rows with ANY NaN
df_dropped_rows_any = df_missing.dropna() # axis=0, how='any' are defaults
print("\nDataFrame after dropping rows with ANY NaN:\n", df_dropped_rows_any)
# Drop columns with ANY NaN
df_dropped_cols_any = df_missing.dropna(axis=1)
print("\nDataFrame after dropping columns with ANY NaN:\n", df_dropped_cols_any)
# Drop columns only if ALL values are NaN
df_dropped_cols_all = df_missing.dropna(axis=1, how='all')
print("\nDataFrame after dropping columns with ALL NaN:\n", df_dropped_cols_all)
# Drop rows where NaN appears specifically in 'col2' or 'col3'
df_dropped_subset = df_missing.dropna(subset=['col2', 'col3'])
print("\nDataFrame after dropping rows with NaN in 'col2' or 'col3':\n", df_dropped_subset)
# Keep rows that have at least 4 non-NaN values
df_dropped_thresh = df_missing.dropna(thresh=4)
print("\nDataFrame after keeping rows with at least 4 non-NaNs:\n", df_dropped_thresh)
Example Filling (Imputation):
print("\n--- Filling NaN ---")
print("Original DataFrame:\n", df_missing)
# Fill all NaN with 0 (might not make sense for all columns!)
df_filled_zero = df_missing.fillna(0)
print("\nDataFrame after filling all NaN with 0:\n", df_filled_zero)
# Note: col2 (string) and col3 (bool) are now mixed types (object)
# Fill NaN with different values per column
fill_values = {'col1': df_missing['col1'].mean(), # Fill numeric with mean
'col2': 'Unknown', # Fill string with placeholder
'col3': False, # Fill boolean with False
'col4': 0} # Fill numeric with 0
df_filled_specific = df_missing.fillna(value=fill_values)
print("\nDataFrame after filling NaN with specific values per column:\n", df_filled_specific)
print(df_filled_specific.dtypes) # Check types
# Forward fill (propagate last valid observation)
df_ffill = df_missing.fillna(method='ffill')
print("\nDataFrame after forward fill (ffill):\n", df_ffill)
# Backward fill (propagate next valid observation)
df_bfill = df_missing.fillna(method='bfill')
print("\nDataFrame after backward fill (bfill):\n", df_bfill)
# Forward fill with a limit
df_ffill_limit = df_missing.fillna(method='ffill', limit=1) # Fill max 1 consecutive NaN
print("\nDataFrame after forward fill (limit=1):\n", df_ffill_limit)
The choice between dropping and filling depends heavily on the context:
- How much data is missing? Dropping rows might discard too much information if many rows have sporadic NaNs. Dropping columns might remove important features.
- Why is the data missing? (Missing Completely At Random, Missing At Random, Missing Not At Random). This influences whether simple imputation methods (like mean/median/mode) are appropriate.
- What is the goal of the analysis? Some models handle NaNs, while others require complete data.
Imputation is often preferred to preserve data, but the imputation method must be chosen carefully (mean, median, mode, regression imputation, etc.).
Workshop Identifying and Handling Missing Values
Goal: Identify and handle missing values in the 'age' and 'embarked' columns of the Titanic dataset.
Scenario: You've loaded the Titanic dataset and noticed missing values using .info()
. You need to decide on and implement strategies to handle these missing values before further analysis.
Steps:
- Import Pandas and Load Data: Reload the Titanic dataset.
import pandas as pd import numpy as np print("Pandas and NumPy imported.") titanic_url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv' try: titanic_df = pd.read_csv(titanic_url) print("\nTitanic dataset loaded successfully!") print(f"Original shape: {titanic_df.shape}") except Exception as e: print(f"\nError loading dataset: {e}") # Placeholder with NaNs titanic_df = pd.DataFrame({ 'survived': [0, 1, 1, 0, 1, 0], 'pclass': [3, 1, 3, 1, 3, 3], 'sex': ['male', 'female', 'female', 'male', 'female', 'male'], 'age': [22.0, 38.0, np.nan, 54.0, 4.0, np.nan], 'fare': [7.25, 71.2833, 7.925, 51.8625, 16.7, 8.4583], 'embarked': ['S', 'C', 'S', 'S', 'S', 'Q'], 'deck': [np.nan, 'C', np.nan, 'E', np.nan, np.nan] # Add another column with many NaNs }) print("\nUsing a placeholder DataFrame with NaNs.") print(f"Placeholder shape: {titanic_df.shape}")
- Identify Missing Values:
- Use
.isnull().sum()
to count missing values in each column. - Calculate the percentage of missing values for each column.
Self-Check: Identify the columns with the most missing values (e.g., 'deck', 'age') and those with just a few ('embarked').
if 'titanic_df' in locals(): print("\n--- Missing Value Counts ---") missing_counts = titanic_df.isnull().sum() print(missing_counts[missing_counts > 0]) # Show only columns with missing values print("\n--- Missing Value Percentages ---") total_rows = len(titanic_df) missing_percentages = (missing_counts / total_rows) * 100 print(missing_percentages[missing_percentages > 0].round(2))
- Use
- Strategy for 'deck': The 'deck' column has a very high percentage of missing values. For many analyses, imputing this might be unreliable or introduce noise. A common strategy is to drop such columns. Drop the 'deck' column.
if 'titanic_df' in locals() and 'deck' in titanic_df.columns: titanic_df.drop('deck', axis=1, inplace=True) print("\n'deck' column dropped.") print(f"New shape: {titanic_df.shape}") # Verify it's gone print("\nMissing counts after dropping 'deck':") print(titanic_df.isnull().sum()[titanic_df.isnull().sum() > 0]) elif 'titanic_df' in locals(): print("\n'deck' column not found or already dropped.")
- Strategy for 'age': The 'age' column has a significant number of missing values (around 20%), but age is likely an important predictor (e.g., for survival). Dropping rows would lose too much data. Imputation is a better choice. A common simple strategy is to fill with the median age (less sensitive to outliers than the mean).
- Calculate the median age.
- Fill the
NaN
values in the 'age' column with the calculated median.if 'titanic_df' in locals() and 'age' in titanic_df.columns: median_age = titanic_df['age'].median() print(f"\nMedian age: {median_age:.2f}") titanic_df['age'].fillna(median_age, inplace=True) print("'age' column NaNs filled with median value.") # Verify NaNs are gone in 'age' print("\nMissing counts after filling 'age':") print(titanic_df.isnull().sum()[titanic_df.isnull().sum() > 0]) print(f"Age column now has {titanic_df['age'].isnull().sum()} NaN values.") elif 'titanic_df' in locals(): print("\n'age' column not found.")
- Strategy for 'embarked': The 'embarked' column has very few missing values. This is a categorical feature. Dropping the few rows might be acceptable, but imputation with the mode (most frequent value) is also common and preserves the data.
- Calculate the mode of the 'embarked' column. (Note:
.mode()
returns a Series, as there might be multiple modes; usually, we take the first one using.iloc[0]
). - Fill the
NaN
values in 'embarked' with the calculated mode.Self-Check: Doesif 'titanic_df' in locals() and 'embarked' in titanic_df.columns: embarked_mode = titanic_df['embarked'].mode()[0] # Get the first mode print(f"\nMost frequent embarkation port (mode): {embarked_mode}") titanic_df['embarked'].fillna(embarked_mode, inplace=True) print("'embarked' column NaNs filled with mode value.") # Verify NaNs are gone print("\nMissing counts after filling 'embarked':") final_missing = titanic_df.isnull().sum() print(final_missing[final_missing > 0]) if final_missing.sum() == 0: print("All handled NaNs have been addressed.") else: print("Some NaNs might remain in other columns.") elif 'titanic_df' in locals(): print("\n'embarked' column not found.") # Final check with info if 'titanic_df' in locals(): print("\n--- Final DataFrame Info ---") titanic_df.info()
.info()
now show that 'age' and 'embarked' have the full count of non-null values? Was 'deck' successfully removed?
- Calculate the mode of the 'embarked' column. (Note:
This workshop guided you through identifying missing data proportions and applying common handling strategies: dropping a column with too many NaNs, imputing a numerical column with the median, and imputing a categorical column with the mode. These are essential data cleaning techniques.
Data Transformation (Mapping, Replacing, Applying Functions)
Often, data isn't in the right format or requires modification for analysis. Pandas provides several methods for transforming data within Series or DataFrames.
map(arg, na_action=None)
(Series method): Substitutes each value in a Series with another value. Primarily used with a dictionary or a function. Values not found in the mapping dictionary becomeNaN
unless the original value was alreadyNaN
.replace(to_replace, value, inplace=False, limit=None, regex=False, method='pad')
(Series/DataFrame method): Replaces specified values (to_replace
) with other values (value
). More flexible thanmap
as it can replace multiple different values simultaneously and doesn't automatically turn non-replaced values intoNaN
. Can work with lists, dicts, regex.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)
(DataFrame/Series method): Applies a function along an axis of the DataFrame (row-wise or column-wise) or element-wise on a Series. Very powerful for custom transformations.applymap(func, na_action=None)
(DataFrame method - Use with caution, often slower): Applies a function element-wise to every element of a DataFrame. Often, vectorized operations orapply
are more efficient. Being deprecated/subsumed by other methods in newer Pandas. Use.map
on columns or structuredapply
instead.
Example using map
:
Convert numerical codes to meaningful labels in a Series.
import pandas as pd
import numpy as np
s = pd.Series([1, 2, 1, 3, 2, 1, 4])
print("Original Series:\n", s)
# Mapping dictionary
status_map = {1: 'Active', 2: 'Inactive', 3: 'Pending'}
# Apply map - values not in map (e.g., 4) become NaN
s_mapped = s.map(status_map)
print("\nSeries after mapping (value 4 becomes NaN):\n", s_mapped)
# Map with a function (e.g., add prefix)
s_mapped_func = s.map(lambda x: f"Code_{x}")
print("\nSeries after mapping with lambda function:\n", s_mapped_func)
Example using replace
:
Clean up inconsistent category names or specific values.
df = pd.DataFrame({'Category': ['A', 'B', 'C', 'A', 'B_extra', 'C', 'D', 'a'],
'Value': [10, 20, 15, 12, 25, 18, 30, 9]})
print("\nOriginal DataFrame:\n", df)
# Replace a single value
df_replaced_single = df.replace('B_extra', 'B')
print("\nReplace 'B_extra' with 'B':\n", df_replaced_single)
# Replace multiple values with a single value
df_replaced_multiple = df.replace(['A', 'a'], 'Type_A')
print("\nReplace 'A' and 'a' with 'Type_A':\n", df_replaced_multiple)
# Replace using a dictionary (different replacements for different values)
replace_dict = {'A': 'Group Alpha', 'a': 'Group Alpha', 'B': 'Group Beta', 'C': 'Group Gamma'}
df_replaced_dict = df.replace({'Category': replace_dict}) # Specify column for dict keys
print("\nReplace using dictionary mapping:\n", df_replaced_dict)
# Replace numeric value
df_replaced_num = df.replace({'Value': {10: 100}}) # Replace 10 with 100 in 'Value' col
print("\nReplace numeric value 10 with 100 in 'Value':\n", df_replaced_num)
Example using apply
:
Perform row-wise or column-wise calculations.
df_scores = pd.DataFrame({'Quiz1': [8, 7, 9, 6],
'Quiz2': [10, 8, 7, 9],
'Homework': [18, 20, 15, 19]})
print("\nScores DataFrame:\n", df_scores)
# Apply function to each column (axis=0, default) - e.g., calculate range
def range_fn(x): # x is a Series (a column in this case)
return x.max() - x.min()
col_ranges = df_scores.apply(range_fn) # or df_scores.apply(range_fn, axis=0)
print("\nRange (max-min) for each column:\n", col_ranges)
# Apply function to each row (axis=1) - e.g., calculate total score
def total_score(row): # row is a Series (a row in this case)
return row['Quiz1'] + row['Quiz2'] + row['Homework']
df_scores['Total'] = df_scores.apply(total_score, axis=1)
print("\nDataFrame with Total score calculated row-wise:\n", df_scores)
# Apply with lambda for simpler operations (e.g., normalize homework score out of 20)
df_scores['Homework_Perc'] = df_scores['Homework'].apply(lambda x: (x / 20) * 100)
print("\nDataFrame with Homework Percentage:\n", df_scores)
These transformation methods (map
, replace
, apply
) are fundamental tools for reshaping your data into the form needed for analysis or modeling. Choose the method that best suits the specific transformation task.
Workshop Cleaning and Transforming Customer Data
Goal: Clean and transform a hypothetical customer dataset, including standardizing categorical values and creating new features using transformation functions.
Scenario: You have a small dataset of customer information with inconsistencies in gender notation and want to calculate the age from the date of birth.
Steps:
- Import Pandas and Create Data:
import pandas as pd from datetime import datetime # To calculate age # Create sample customer data data = { 'CustomerID': [1001, 1002, 1003, 1004, 1005, 1006], 'Gender': ['M', 'Female', 'male', 'F', 'FEM', 'Male'], 'DateOfBirth': ['1990-05-15', '1985-11-22', '1998-02-01', '2001-07-30', '1985-03-10', '1995-09-05'], 'City': ['New York', 'London', 'Paris', 'London', 'New York', 'Tokyo'] } customer_df = pd.DataFrame(data) print("Original Customer DataFrame:") print(customer_df) customer_df.info() # Note DateOfBirth is object type initially
- Standardize 'Gender' Column: The 'Gender' column has inconsistent values ('M', 'Female', 'male', etc.). Standardize them to 'Male' and 'Female'.
- Method 1: Using
replace
with a dictionary: Create a mapping for all variations. - Method 2: Using
map
after converting to a consistent case (e.g., lower): Convert all to lowercase, then map 'm'/'male' and 'f'/'female'. (Let's usereplace
for simplicity here).Self-Check: Are all values in the 'Gender' column now either 'Male' or 'Female'?gender_replace_map = { 'M': 'Male', 'male': 'Male', 'F': 'Female', 'Female': 'Female', # Explicitly map correct ones too, or use more targeted replace 'FEM': 'Female' } # Apply replace to the 'Gender' column customer_df['Gender'] = customer_df['Gender'].replace(gender_replace_map) print("\nDataFrame after standardizing Gender:") print(customer_df) print("\nValue Counts for standardized Gender:") print(customer_df['Gender'].value_counts())
- Method 1: Using
- Convert 'DateOfBirth' to Datetime: The 'DateOfBirth' column is currently stored as strings (object). Convert it to proper datetime objects using
pd.to_datetime()
for calculations. - Calculate Age: Create a new 'Age' column by calculating the difference between a reference date (e.g., today) and the 'DateOfBirth'.
- Get today's date.
- Define a function that takes a birthdate and returns the age in years.
- Use
.apply()
on the 'DateOfBirth' column with this function.Self-Check: Does the 'Age' column contain plausible integer ages based on the 'DateOfBirth' values?today = pd.to_datetime('today') # Use pandas' to_datetime for consistency # Or: today = datetime.now() def calculate_age(born): # Calculate age based on year difference, adjust if birthday hasn't passed this year age = today.year - born.year - ((today.month, today.day) < (born.month, born.day)) return age # Apply the function to the 'DateOfBirth' column customer_df['Age'] = customer_df['DateOfBirth'].apply(calculate_age) print("\nDataFrame with Age column calculated:") print(customer_df)
- Categorize City (using
map
): Create a 'Region' column based on the 'City'. Map 'New York' to 'North America', 'London' and 'Paris' to 'Europe', and 'Tokyo' to 'Asia'. Self-Check: Does the 'Region' column correctly reflect the mapping based on the 'City'?
This workshop involved common data cleaning and transformation tasks: standardizing categorical data using replace
, converting data types using pd.to_datetime
, creating a new feature (Age) using a custom function with apply
, and mapping values to create categories using map
.
Dealing with Duplicates
Duplicate rows can skew analysis results (e.g., inflating counts or averages). Pandas provides easy ways to identify and remove duplicate rows.
duplicated(subset=None, keep='first')
(DataFrame/Series method): Returns a boolean Series indicating which rows are duplicates.subset
: List of column labels to consider when identifying duplicates. By default, uses all columns.keep='first'
(default): Marks duplicates asTrue
except for the first occurrence.keep='last'
: Marks duplicates asTrue
except for the last occurrence.keep=False
: Marks all duplicates asTrue
.
drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
(DataFrame/Series method): Returns a DataFrame/Series with duplicate rows removed.- Parameters
subset
andkeep
work the same way as induplicated()
. inplace=False
(default): Returns a new DataFrame with duplicates removed. Set toTrue
to modify the original DataFrame.ignore_index=False
(default): IfTrue
, the resulting index will be reset to 0, 1, ..., n-1.
- Parameters
Example:
import pandas as pd
data = {'colA': ['X', 'Y', 'X', 'Y', 'Z', 'X'],
'colB': [1, 2, 1, 2, 3, 1],
'colC': [100, 200, 100, 300, 400, 100]}
df_duplicates = pd.DataFrame(data)
print("Original DataFrame with Duplicates:")
print(df_duplicates)
# Identify duplicate rows (considering all columns)
print("\nIdentifying duplicate rows (keep='first'):")
print(df_duplicates.duplicated())
# Output: Rows 2 and 5 are duplicates of row 0
print("\nIdentifying duplicate rows (keep='last'):")
print(df_duplicates.duplicated(keep='last'))
# Output: Rows 0 and 2 are duplicates of row 5
print("\nIdentifying all duplicate rows (keep=False):")
print(df_duplicates.duplicated(keep=False))
# Output: Rows 0, 2, 5 are marked as part of a duplicate set
# Identify duplicates based on a subset of columns ('colA', 'colB')
print("\nIdentifying duplicates based on 'colA', 'colB' (keep='first'):")
print(df_duplicates.duplicated(subset=['colA', 'colB']))
# Output: Rows 2 and 5 are duplicates of row 0 based on ('X', 1)
# Row 3 is a duplicate of row 1 based on ('Y', 2)
print("\n--- Removing Duplicates ---")
# Drop duplicate rows (keeping the first occurrence)
df_no_duplicates_first = df_duplicates.drop_duplicates() # keep='first' is default
print("\nDataFrame after dropping duplicates (keep='first'):")
print(df_no_duplicates_first)
# Drop duplicate rows (keeping the last occurrence)
df_no_duplicates_last = df_duplicates.drop_duplicates(keep='last')
print("\nDataFrame after dropping duplicates (keep='last'):")
print(df_no_duplicates_last)
# Drop duplicates based on subset ('colA', 'colB'), keep first
df_no_duplicates_subset = df_duplicates.drop_duplicates(subset=['colA', 'colB'], keep='first')
print("\nDataFrame after dropping duplicates based on 'colA','colB' (keep='first'):")
print(df_no_duplicates_subset)
Deciding which columns define a "duplicate" (subset
) and which occurrence to keep (keep
) depends on the specific meaning of your data and the goals of your analysis. Always check the result carefully after dropping duplicates.
Workshop Finding and Removing Duplicate Records
Goal: Identify and remove duplicate entries from a simulated order dataset.
Scenario: You have a dataset representing online orders, but due to a system glitch, some orders were recorded multiple times. You need to clean the dataset by removing these exact duplicates. You also want to identify orders that might be duplicates based only on customer and product, ignoring the timestamp.
Steps:
- Import Pandas and Create Data:
import pandas as pd # Create sample order data with duplicates order_data = { 'OrderID': [101, 102, 103, 101, 104, 105, 102, 106, 101], 'CustomerID': ['CUST-A', 'CUST-B', 'CUST-A', 'CUST-A', 'CUST-C', 'CUST-B', 'CUST-B', 'CUST-D', 'CUST-A'], 'Product': ['Widget', 'Gadget', 'Widget', 'Widget', 'Thingamajig', 'Gadget', 'Gadget', 'Widget', 'Widget'], 'Quantity': [2, 1, 3, 2, 5, 2, 1, 1, 2], 'Timestamp': pd.to_datetime(['2023-10-26 10:00:00', '2023-10-26 10:05:00', '2023-10-26 10:08:00', '2023-10-26 10:00:00', # Exact duplicate of row 0 '2023-10-26 10:15:00', '2023-10-26 11:00:00', '2023-10-26 10:05:00', # Exact duplicate of row 1 '2023-10-26 11:30:00', '2023-10-26 10:00:00']) # Exact duplicate of row 0 & 3 } orders_df = pd.DataFrame(order_data) print("Original Orders DataFrame:") print(orders_df) print(f"\nOriginal number of rows: {len(orders_df)}")
- Identify Exact Duplicates: Find rows that are completely identical across all columns. Use
duplicated()
with default settings.Self-Check: Rows 3, 6, and 8 should be identified asexact_duplicates_bool = orders_df.duplicated() print("\nBoolean Series indicating exact duplicates (keeping first):") print(exact_duplicates_bool) # Show the actual duplicate rows exact_duplicate_rows = orders_df[exact_duplicates_bool] print("\nThe exact duplicate rows are:") print(exact_duplicate_rows)
True
(duplicates of rows 0, 1, and 0 respectively). - Remove Exact Duplicates: Create a new DataFrame
orders_cleaned_df
by dropping the identified exact duplicates, keeping the first occurrence.Self-Check: Doesorders_cleaned_df = orders_df.drop_duplicates(keep='first') print("\nDataFrame after removing exact duplicates (keeping first):") print(orders_cleaned_df) print(f"\nNumber of rows after removing exact duplicates: {len(orders_cleaned_df)}")
orders_cleaned_df
have 6 rows (original 9 - 3 duplicates)? Are rows with index 3, 6, and 8 gone? - Identify Duplicates Based on Subset: Now, identify rows that are duplicates based only on 'CustomerID' and 'Product', ignoring 'OrderID', 'Quantity', and 'Timestamp'. This might indicate multiple orders of the same product by the same customer, which may or may not be errors depending on context.
Self-Check: Rows where ('CUST-A', 'Widget') appears (indices 0, 2, 3, 8) and where ('CUST-B', 'Gadget') appears (indices 1, 5, 6) should all be marked
subset_duplicates_bool = orders_df.duplicated(subset=['CustomerID', 'Product'], keep=False) # Keep=False marks ALL occurrences print("\nBoolean Series indicating duplicates based on CustomerID and Product (marking all):") print(subset_duplicates_bool) # Show all rows involved in a subset duplication subset_duplicate_rows = orders_df[subset_duplicates_bool] print("\nRows involved in duplication based on CustomerID and Product:") # Sort for easier comparison print(subset_duplicate_rows.sort_values(by=['CustomerID', 'Product']))
True
. - Remove Subset Duplicates (Optional/Contextual): If the goal was to keep only the first order instance of a specific product by a customer, you could use
drop_duplicates
with the subset.Self-Check: Does this result have only one row for ('CUST-A', 'Widget') and one for ('CUST-B', 'Gadget')? The specific row kept depends on the# Example: Keep only the first time each customer ordered each product orders_first_instance_df = orders_df.drop_duplicates(subset=['CustomerID', 'Product'], keep='first') print("\nDataFrame keeping only the first instance (based on CustomerID, Product):") print(orders_first_instance_df) print(f"\nNumber of rows after keeping first instance: {len(orders_first_instance_df)}")
keep
parameter and original order.
This workshop demonstrated how to use duplicated()
and drop_duplicates()
both for finding exact row duplicates and for finding duplicates based on a subset of columns, allowing for different data cleaning objectives.
Advanced Level
This level delves into more sophisticated Pandas and NumPy functionalities, enabling complex data manipulations, aggregations, and analyses crucial for deeper insights and specialized tasks like time series analysis.
5. Advanced Data Selection and Indexing
Beyond basic slicing and label/integer-based access ([]
, .loc
, .iloc
), Pandas offers more advanced indexing techniques, including hierarchical indexing (MultiIndex) and powerful boolean masking, allowing for intricate data selection and manipulation.
Hierarchical Indexing (MultiIndex)
Hierarchical indexing, or MultiIndex, allows you to have multiple index levels on an axis (rows or columns). It's a way to represent higher-dimensional data in a lower-dimensional structure like a Series (1D) or DataFrame (2D). Think of it as creating groups within groups.
Creating a MultiIndex:
You can create a MultiIndex explicitly using pd.MultiIndex.from_arrays
, pd.MultiIndex.from_tuples
, or pd.MultiIndex.from_product
, or implicitly by providing a list of arrays/tuples to the index
or columns
argument when creating a Series or DataFrame.
import pandas as pd
import numpy as np
# Method 1: Using from_arrays
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
multi_index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
print("MultiIndex from arrays:\n", multi_index)
# Create a Series with this MultiIndex
s_multi = pd.Series(np.random.randn(8), index=multi_index)
print("\nSeries with MultiIndex:")
print(s_multi)
print("-" * 30)
# Method 2: Implicitly during DataFrame creation
data = np.random.randn(4, 3)
df_multi_cols = pd.DataFrame(data,
columns=[['GroupA', 'GroupA', 'GroupB'],
['Metric1', 'Metric2', 'Metric1']],
index=['Row1', 'Row2', 'Row3', 'Row4'])
df_multi_cols.columns.names = ['BroadGroup', 'SpecificMetric']
print("\nDataFrame with MultiIndex Columns:")
print(df_multi_cols)
print("-" * 30)
# Method 3: Implicitly with index during DataFrame creation
df_multi_rows = pd.DataFrame(np.random.randn(6, 2),
index=[['Type1', 'Type1', 'Type1', 'Type2', 'Type2', 'Type2'],
['SubA', 'SubB', 'SubC', 'SubA', 'SubB', 'SubC']],
columns=['Value1', 'Value2'])
df_multi_rows.index.names = ['MainType', 'SubType']
print("\nDataFrame with MultiIndex Rows:")
print(df_multi_rows)
Indexing with MultiIndex:
Accessing data in objects with MultiIndex requires careful use of .loc
, .iloc
, and potentially slicing.
- Partial Indexing: You can select data by specifying only the outer level(s) of the index.
- Tuple-based Indexing: Provide a tuple to
.loc
to select based on multiple levels. - Slicing: Slicing works, but slicing across levels requires careful syntax, often using
slice(None)
to select all items within intermediate levels. Thepd.IndexSlice
object provides a more readable way to slice MultiIndexes.
print("\n--- Indexing Series with MultiIndex ---")
print("Original Series (s_multi):\n", s_multi)
# Select outer level 'bar'
print("\nSelect outer level 'bar':\n", s_multi['bar'])
# Equivalent using .loc
# print("\nSelect outer level 'bar' (.loc):\n", s_multi.loc['bar'])
# Select specific combination using tuple with .loc
print("\nSelect ('baz', 'two'):", s_multi.loc[('baz', 'two')])
# Select multiple outer levels
print("\nSelect outer levels 'bar' and 'foo':\n", s_multi.loc[['bar', 'foo']])
# Slicing the outer level (works like regular index)
print("\nSlice outer level from 'baz' to 'foo':\n", s_multi['baz':'foo'])
# Slicing inner level for a specific outer level
print("\nSlice inner level from 'one' for outer 'bar':\n", s_multi.loc[('bar', 'one'):]) # Be careful with interpretation
print("\nSlice inner level == 'one' across applicable outer levels:\n", s_multi.loc[:, 'one']) # Slice notation: select all outer, inner 'one'
print("\n--- Indexing DataFrame with MultiIndex Rows ---")
print("Original DataFrame (df_multi_rows):\n", df_multi_rows)
# Select outer index level 'Type1'
print("\nSelect outer index 'Type1':\n", df_multi_rows.loc['Type1'])
# Select specific row ('Type2', 'SubA')
print("\nSelect row ('Type2', 'SubA'):\n", df_multi_rows.loc[('Type2', 'SubA')])
# Select specific cell ('Type1', 'SubB'), 'Value2'
print("\nSelect cell [('Type1', 'SubB'), 'Value2']:", df_multi_rows.loc[('Type1', 'SubB'), 'Value2'])
# Slice using IndexSlice
idx = pd.IndexSlice
# Select all 'SubA' subtypes across all MainTypes for 'Value1'
print("\nSelect all 'SubA' for 'Value1' using IndexSlice:\n", df_multi_rows.loc[idx[:, 'SubA'], 'Value1'])
# Select 'Type1' MainType, 'SubB' through 'SubC' SubTypes, for all columns
print("\nSelect 'Type1', 'SubB':'SubC' using IndexSlice:\n", df_multi_rows.loc[idx['Type1', 'SubB':'SubC'], :])
print("\n--- Indexing DataFrame with MultiIndex Columns ---")
print("Original DataFrame (df_multi_cols):\n", df_multi_cols)
# Select outer column level 'GroupA'
print("\nSelect outer column 'GroupA':\n", df_multi_cols['GroupA'])
# Select specific column ('GroupA', 'Metric2')
print("\nSelect column ('GroupA', 'Metric2'):\n", df_multi_cols[('GroupA', 'Metric2')]) # Direct [] works for columns
# Select cell 'Row2', ('GroupB', 'Metric1') using .loc
print("\nSelect cell ['Row2', ('GroupB', 'Metric1')]:", df_multi_cols.loc['Row2', ('GroupB', 'Metric1')])
# Select using IndexSlice for columns
# Select all rows for 'Metric1' across all BroadGroups
print("\nSelect all rows for 'Metric1' using IndexSlice:\n", df_multi_cols.loc[:, idx[:, 'Metric1']])
Stacking and Unstacking:
MultiIndex objects allow for easy reshaping between "long" and "wide" formats:
stack(level=-1, dropna=True)
: Moves a level from the column index to become the innermost level of the row index, potentially creating a Series if only one column level remains. Produces a "longer" format.unstack(level=-1, fill_value=None)
: Moves a level from the row index to become the innermost level of the column index. Produces a "wider" format.fill_value
can be used to replace NaNs introduced during unstacking.
print("\n--- Stacking and Unstacking ---")
print("Original DataFrame (df_multi_rows):\n", df_multi_rows)
# Unstack the inner row index level ('SubType')
df_unstacked = df_multi_rows.unstack()
print("\nDataFrame after unstacking 'SubType':\n", df_unstacked) # SubType becomes inner col level
# Unstack the outer row index level ('MainType')
df_unstacked_outer = df_multi_rows.unstack(level=0) # or level='MainType'
print("\nDataFrame after unstacking 'MainType':\n", df_unstacked_outer)
# Now, stack the innermost column level back to the row index
df_stacked_back = df_unstacked.stack()
print("\nDataFrame after stacking back the innermost column level:\n", df_stacked_back) # Should resemble original
# Stack based on DataFrame with multi-level columns (df_multi_cols)
print("\nOriginal DataFrame (df_multi_cols):\n", df_multi_cols)
df_stacked_cols = df_multi_cols.stack() # Stacks innermost col level ('SpecificMetric')
print("\nDataFrame after stacking 'SpecificMetric':\n", df_stacked_cols)
df_stacked_cols_outer = df_multi_cols.stack(level=0) # Stacks outer col level ('BroadGroup')
print("\nDataFrame after stacking 'BroadGroup':\n", df_stacked_cols_outer)
MultiIndex is a powerful feature for working with complex, structured datasets within the familiar DataFrame format. Mastering its creation, indexing, and reshaping capabilities is key for advanced data manipulation.
Workshop Working with MultiIndex DataFrames
Goal: Create and manipulate a MultiIndex DataFrame representing quarterly sales data for different products across regions.
Scenario: You have sales data broken down by region, product category, and quarter. You need to structure this data using a MultiIndex and then practice selecting and summarizing parts of it.
Steps:
- Import Libraries and Create Data:
Self-Check: Does the DataFrame have a two-level row index ('Region', 'Product') and a single-level column index ('Q1', 'Q2', 'Q3', 'Q4')?
import pandas as pd import numpy as np # Create data components regions = ['North', 'South', 'East', 'West'] products = ['Electronics', 'Apparel', 'Home Goods'] quarters = ['Q1', 'Q2', 'Q3', 'Q4'] # Create a MultiIndex using from_product multi_index = pd.MultiIndex.from_product([regions, products], names=['Region', 'Product']) # Create random sales data (e.g., in thousands of dollars) np.random.seed(42) # for reproducibility sales_data = np.random.randint(50, 500, size=(len(multi_index), len(quarters))) # rows = region*product, cols = quarters # Create the DataFrame sales_df = pd.DataFrame(sales_data, index=multi_index, columns=quarters) print("Quarterly Sales Data with MultiIndex:") print(sales_df)
- Select Data Using
.loc
:- Select all data for the 'North' region.
- Select data for ('South', 'Apparel').
- Select Q1 sales for all products in the 'East' region.
- Select Q3 and Q4 sales for 'Electronics' across all regions. Use
pd.IndexSlice
.print("\n--- Data Selection ---") # Data for 'North' region north_sales = sales_df.loc['North'] print("\nSales for North Region:\n", north_sales) # Data for ('South', 'Apparel') south_apparel_sales = sales_df.loc[('South', 'Apparel')] print("\nSales for South Region, Apparel Product:\n", south_apparel_sales) # This will be a Series # Q1 sales for 'East' region east_q1_sales = sales_df.loc['East', 'Q1'] print("\nQ1 Sales for East Region:\n", east_q1_sales) # This will be a Series (Product -> Sales) # Q3/Q4 sales for 'Electronics' across all regions idx = pd.IndexSlice electronics_q3_q4 = sales_df.loc[idx[:, 'Electronics'], ['Q3', 'Q4']] print("\nQ3 & Q4 Sales for Electronics (All Regions):\n", electronics_q3_q4)
- Calculate Summary Statistics:
- Calculate the total annual sales for each (Region, Product) combination (sum across columns).
- Calculate the average sales per quarter for each Region (average across Products within a Region, then across columns).
- Calculate the total sales for each Product category across all regions and quarters.
print("\n--- Summary Statistics ---") # Total annual sales per (Region, Product) sales_df['Total'] = sales_df.sum(axis=1) # Sum across columns (Quarters) print("\nDataFrame with Total Annual Sales:\n", sales_df) # Average quarterly sales per Region (Group by Region level, then mean) # First, calculate mean quarterly sales for each row # Then, group by region and calculate the mean of those means # OR, more directly: group by level 0 and take the mean across all values in the group avg_quarterly_sales_per_region = sales_df[['Q1','Q2','Q3','Q4']].mean(axis=1).groupby(level='Region').mean() print("\nAverage Quarterly Sales per Region (approx):\n", avg_quarterly_sales_per_region) # Alternative: Calculate total sales per region per quarter, then average total_sales_per_region_quarter = sales_df[['Q1','Q2','Q3','Q4']].groupby(level='Region').sum() avg_sales_per_region = total_sales_per_region_quarter.mean(axis=1) print("\nAverage Total Sales per Region (across quarters):\n", avg_sales_per_region) # Total sales per Product category (group by Product level, sum everything) # We need to sum the 'Total' column grouped by 'Product' total_sales_per_product = sales_df['Total'].groupby(level='Product').sum() # Alternatively, sum all quarterly data grouped by product # total_sales_per_product = sales_df[['Q1','Q2','Q3','Q4']].sum(axis=1).groupby(level='Product').sum() print("\nTotal Sales per Product Category (All Regions, All Quarters):\n", total_sales_per_product) # Remove the 'Total' column if we want to continue with original shape sales_df = sales_df.drop('Total', axis=1)
- Reshape using
unstack()
: Make 'Product' a column level to easily compare products within each region side-by-side. Self-Check: Does the resulting DataFrame have 'Region' as the row index and a MultiIndex for columns ('Product', 'Quarter')? - Reshape using
stack()
: Take thesales_unstacked_product
DataFrame and stack the 'Quarter' level back into the row index, creating a Series with a three-level MultiIndex (Region, Quarter, Product).Self-Check: Does the resulting Series/DataFrame have three levels in its index? Is the data now in a "long" format where each row represents a single observation (Region, Product, Quarter -> Sales)?# Stack the innermost column level ('Quarter') from the unstacked DF sales_fully_stacked = sales_unstacked_product.stack(level=-1) # Innermost is Quarter level # Or stack the original df by quarter # sales_fully_stacked = sales_df.stack() # This makes quarter the innermost row index level # sales_fully_stacked.index.names = ['Region', 'Product', 'Quarter'] # Name the new level print("\nDataFrame stacked by Quarter (from unstacked state):") print(sales_fully_stacked.head(12)) # Show first few rows of the long format # If stacking the original DF: sales_stacked_original = sales_df.stack() sales_stacked_original.index.names = ['Region', 'Product', 'Quarter'] print("\nOriginal DataFrame stacked by Quarter:") print(sales_stacked_original.head(12)) # Show first few rows
This workshop provided practice in creating a MultiIndex DataFrame, selecting data using various .loc
techniques (including pd.IndexSlice
), calculating grouped summaries using groupby(level=...)
, and reshaping the data between wide and long formats using unstack()
and stack()
.
Boolean Indexing and .loc
/.iloc
Boolean indexing is a powerful technique where you filter data based on logical conditions. It allows for complex selections that go beyond simple label or positional indexing.
How it Works:
- You create a boolean Series (or array) whose index aligns with the DataFrame's index (or whose length matches the number of rows/columns). This boolean Series contains
True
for rows/columns you want to keep andFalse
for those you want to discard. - You pass this boolean Series inside the
.loc[]
accessor (or sometimes directly[]
, though.loc
is safer and more explicit).
Creating Boolean Conditions:
You typically create boolean Series using comparison operators (==
, !=
, >
, <
, >=
, <=
) on DataFrame columns or Series. Multiple conditions can be combined using logical operators:
&
(AND): Both conditions must be true.|
(OR): At least one condition must be true.~
(NOT): Inverts the boolean value.
Important: When combining multiple conditions, each condition must be enclosed in parentheses ()
due to Python's operator precedence rules.
Examples:
Let's use the cleaned Titanic dataset from a previous workshop.
import pandas as pd
import numpy as np
# Recreate or load the cleaned Titanic data (age/embarked filled, deck dropped)
titanic_url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
try:
titanic_df = pd.read_csv(titanic_url)
# Apply cleaning steps from previous workshop
if 'deck' in titanic_df.columns: titanic_df.drop('deck', axis=1, inplace=True)
if 'age' in titanic_df.columns: titanic_df['age'].fillna(titanic_df['age'].median(), inplace=True)
if 'embarked' in titanic_df.columns: titanic_df['embarked'].fillna(titanic_df['embarked'].mode()[0], inplace=True)
print("Cleaned Titanic dataset loaded/recreated.")
print(f"Shape: {titanic_df.shape}")
print("NaN counts:\n", titanic_df.isnull().sum())
except Exception as e:
print(f"Error loading/cleaning dataset: {e}")
# Fallback placeholder
titanic_df = pd.DataFrame({
'survived': [0, 1, 1, 0, 1, 0, 0, 1],
'pclass': [3, 1, 3, 1, 3, 3, 1, 3],
'sex': ['male', 'female', 'female', 'male', 'female', 'male', 'male', 'female'],
'age': [22.0, 38.0, 26.0, 54.0, 4.0, 28.0, 30.0, 14.0], # Median filled approx 28
'fare': [7.25, 71.28, 7.92, 51.86, 16.7, 8.45, 50.0, 11.24],
'embarked': ['S', 'C', 'S', 'S', 'S', 'Q', 'S', 'S'] # Mode filled S
})
print("Using placeholder DataFrame.")
# Condition 1: Select passengers older than 60
cond_age_gt_60 = titanic_df['age'] > 60
print("\nBoolean Series for age > 60:\n", cond_age_gt_60.head())
# Apply the condition using .loc
passengers_over_60 = titanic_df.loc[cond_age_gt_60]
print("\nPassengers older than 60:\n", passengers_over_60[['age', 'pclass', 'sex']].head())
# Condition 2: Select female passengers in 1st class
cond_female = titanic_df['sex'] == 'female'
cond_pclass1 = titanic_df['pclass'] == 1
# Combine conditions using & (AND) - note parentheses!
cond_female_pclass1 = (cond_female) & (cond_pclass1)
print("\nBoolean Series for female AND pclass=1:\n", cond_female_pclass1.head())
female_pclass1_passengers = titanic_df.loc[cond_female_pclass1]
print("\nFemale passengers in 1st class:\n", female_pclass1_passengers[['sex', 'pclass', 'age', 'fare']].head())
# Condition 3: Select passengers who either paid more than $100 OR embarked at 'C'
cond_fare_gt_100 = titanic_df['fare'] > 100
cond_embarked_c = titanic_df['embarked'] == 'C'
# Combine conditions using | (OR) - note parentheses!
cond_highfare_or_cherbourg = (cond_fare_gt_100) | (cond_embarked_c)
highfare_or_cherbourg_passengers = titanic_df.loc[cond_highfare_or_cherbourg]
print(f"\nNumber of passengers with fare > 100 OR embarked at C: {len(highfare_or_cherbourg_passengers)}")
print("Sample:\n", highfare_or_cherbourg_passengers[['fare', 'embarked', 'pclass', 'age']].head())
# Condition 4: Select passengers NOT in 3rd class
cond_not_pclass3 = ~(titanic_df['pclass'] == 3) # Using ~ (NOT)
# Alternatively: cond_not_pclass3 = titanic_df['pclass'] != 3
passengers_not_pclass3 = titanic_df.loc[cond_not_pclass3]
print(f"\nNumber of passengers not in 3rd class: {len(passengers_not_pclass3)}")
print("Pclass value counts for this selection:\n", passengers_not_pclass3['pclass'].value_counts())
# Using boolean indexing to select specific COLUMNS with .loc
# Select 'age' and 'fare' columns for rows where 'survived' is 1
survived_passengers_age_fare = titanic_df.loc[titanic_df['survived'] == 1, ['age', 'fare']]
print("\nAge and Fare for survivors:\n", survived_passengers_age_fare.head())
# Using boolean indexing with .iloc (Less common, requires boolean array of correct length)
# Select rows 0, 2, 4 using a boolean mask
mask = np.array([True, False, True, False, True] + [False]*(len(titanic_df)-5)) # Create boolean array matching length
selected_rows_iloc = titanic_df.iloc[mask]
print("\nSelecting rows 0, 2, 4 using boolean mask with .iloc:\n", selected_rows_iloc)
Key Advantages of Boolean Indexing with .loc
:
- Readability: Conditions often express the filtering logic clearly.
- Flexibility: Allows for complex, multi-condition selections based on data values.
- Efficiency: Vectorized boolean operations are generally fast.
- Label Alignment:
.loc
ensures alignment based on index labels, preventing subtle errors.
While .iloc
accepts boolean arrays, it's primarily designed for integer position-based indexing. Using boolean Series with .loc
is the standard and generally preferred method for value-based filtering in Pandas.
Workshop Complex Data Filtering and Subset Selection
Goal: Apply complex boolean indexing criteria to the Titanic dataset to isolate specific groups of passengers and analyze their characteristics.
Scenario: You need to identify and analyze several specific passenger cohorts from the Titanic dataset:
- Children (age < 18) who were in 3rd class.
- Adult males (age >= 18, sex = 'male') who survived.
- Passengers who embarked at 'S' but were not in 3rd class.
- Passengers whose fare was unusually high (e.g., greater than the 95th percentile) or unusually low (e.g., less than the 5th percentile).
Steps:
- Import Pandas and Load/Prepare Data: Ensure the cleaned Titanic dataset is loaded into
titanic_df
.import pandas as pd import numpy as np # Assuming titanic_df is loaded and cleaned from the previous section # If not, reload and clean it here. try: # Verify data is loaded and cleaned print("Using pre-loaded and cleaned Titanic DataFrame.") print(titanic_df.info()) if titanic_df.isnull().sum().sum() != 0: print("Warning: DataFrame still contains NaNs. Please ensure cleaning steps were run.") except NameError: print("Error: titanic_df not found. Loading and cleaning...") # Paste loading and cleaning code here if needed titanic_url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv' try: titanic_df = pd.read_csv(titanic_url) if 'deck' in titanic_df.columns: titanic_df.drop('deck', axis=1, inplace=True) if 'age' in titanic_df.columns: titanic_df['age'].fillna(titanic_df['age'].median(), inplace=True) if 'embarked' in titanic_df.columns: titanic_df['embarked'].fillna(titanic_df['embarked'].mode()[0], inplace=True) print("Cleaned Titanic dataset loaded/recreated.") if titanic_df.isnull().sum().sum() != 0: print("Warning: NaNs remain after cleaning.") except Exception as e: print(f"Fatal Error: Could not load/clean data - {e}") # Cannot proceed without data exit() # Or raise an error
- Filter 1: Third-Class Children:
- Create boolean conditions for
age < 18
andpclass == 3
. - Combine them using
&
. - Select the rows using
.loc
. - Display the shape and maybe the survival rate of this group.
print("\n--- Filter 1: Third-Class Children ---") cond_child = titanic_df['age'] < 18 cond_pclass3 = titanic_df['pclass'] == 3 third_class_children = titanic_df.loc[cond_child & cond_pclass3] print(f"Number of third-class children: {third_class_children.shape[0]}") if not third_class_children.empty: survival_rate_3rd_child = third_class_children['survived'].mean() print(f"Survival rate for third-class children: {survival_rate_3rd_child:.2%}") print(third_class_children[['age', 'sex', 'survived']].head())
- Create boolean conditions for
- Filter 2: Surviving Adult Males:
- Create boolean conditions for
age >= 18
,sex == 'male'
, andsurvived == 1
. - Combine them using
&
. - Select the rows using
.loc
. - Display the shape and maybe the class distribution of this group.
print("\n--- Filter 2: Surviving Adult Males ---") cond_adult = titanic_df['age'] >= 18 cond_male = titanic_df['sex'] == 'male' cond_survived = titanic_df['survived'] == 1 surviving_adult_males = titanic_df.loc[cond_adult & cond_male & cond_survived] print(f"Number of surviving adult males: {surviving_adult_males.shape[0]}") if not surviving_adult_males.empty: print("Class distribution for surviving adult males:") print(surviving_adult_males['pclass'].value_counts(normalize=True).round(3)) print(surviving_adult_males[['age', 'pclass', 'fare']].head())
- Create boolean conditions for
- Filter 3: Non-Third-Class Passengers from Southampton:
- Create boolean conditions for
embarked == 'S'
andpclass != 3
. - Combine them using
&
. - Select the rows using
.loc
. - Display the shape and class distribution.
print("\n--- Filter 3: Non-Third-Class from Southampton ---") cond_embarked_s = titanic_df['embarked'] == 'S' cond_not_pclass3 = titanic_df['pclass'] != 3 # Or ~(titanic_df['pclass'] == 3) non_3rd_southampton = titanic_df.loc[cond_embarked_s & cond_not_pclass3] print(f"Number of non-third-class passengers from Southampton: {non_3rd_southampton.shape[0]}") if not non_3rd_southampton.empty: print("Class distribution for non-third-class from Southampton:") print(non_3rd_southampton['pclass'].value_counts()) print(non_3rd_southampton[['pclass', 'age', 'fare', 'embarked']].head())
- Create boolean conditions for
- Filter 4: Passengers with Extreme Fares:
- Calculate the 5th and 95th percentiles of the 'fare' column using
.quantile()
. - Create boolean conditions for
fare < 5th_percentile
andfare > 95th_percentile
. - Combine them using
|
(OR). - Select the rows using
.loc
. - Display the shape and descriptive statistics for the fares in this group.
print("\n--- Filter 4: Passengers with Extreme Fares ---") fare_5th_percentile = titanic_df['fare'].quantile(0.05) fare_95th_percentile = titanic_df['fare'].quantile(0.95) print(f"5th percentile fare: {fare_5th_percentile:.2f}") print(f"95th percentile fare: {fare_95th_percentile:.2f}") cond_low_fare = titanic_df['fare'] < fare_5th_percentile cond_high_fare = titanic_df['fare'] > fare_95th_percentile extreme_fare_passengers = titanic_df.loc[cond_low_fare | cond_high_fare] print(f"\nNumber of passengers with extreme fares: {extreme_fare_passengers.shape[0]}") if not extreme_fare_passengers.empty: print("Fare statistics for passengers with extreme fares:") print(extreme_fare_passengers['fare'].describe()) print(extreme_fare_passengers[['fare', 'pclass', 'age', 'survived']].head())
- Calculate the 5th and 95th percentiles of the 'fare' column using
This workshop exercised the creation of multiple boolean conditions based on different columns and data types, combining them using logical operators (&
, |
), and using the resulting boolean Series within .loc
to perform complex data filtering and subset selection for targeted analysis.
6. Grouping, Aggregating, and Reshaping Data
Grouping and aggregation are core components of data analysis. They allow you to split your data into groups based on some criteria, apply calculations (like sum, mean, count) independently to each group, and then combine the results back into a meaningful structure. Pandas' groupby()
mechanism is exceptionally powerful for this. Reshaping tools like pivot tables further help in summarizing and presenting data.
The GroupBy Mechanism
The groupby()
method allows for a "split-apply-combine" process on DataFrames or Series:
- Split: The data is split into groups based on one or more keys. The keys can be column names, index levels, arrays, or functions applied to the index/columns.
- Apply: A function (e.g., aggregation, transformation, filtering) is applied independently to each group.
- Combine: The results of the function application are combined back into a resulting object (e.g., a Series or DataFrame).
Basic GroupBy Operation:
import pandas as pd
import numpy as np
# Sample data: Sales by Region and Product
data = {'Region': ['North', 'South', 'North', 'South', 'North', 'South', 'East', 'East'],
'Product': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
'Sales': [100, 150, 200, 250, 120, 180, 300, 350],
'Quantity': [10, 15, 8, 10, 12, 18, 20, 22]}
df_sales = pd.DataFrame(data)
print("Original Sales DataFrame:")
print(df_sales)
# Group by 'Region'
grouped_by_region = df_sales.groupby('Region')
print("\nGroupBy object (Region):", grouped_by_region) # Shows an object, not the data yet
# Apply an aggregation function (e.g., sum)
region_sum = grouped_by_region.sum()
print("\nSum of Sales and Quantity by Region:")
print(region_sum)
# Apply another aggregation (e.g., mean)
region_mean = grouped_by_region.mean()
print("\nMean Sales and Quantity by Region:")
print(region_mean)
# Get the size of each group
region_size = grouped_by_region.size()
print("\nSize of each Region group:")
print(region_size) # Returns a Series
# Get the count of non-null values in each column per group
region_count = grouped_by_region.count()
print("\nCount of non-null values per column by Region:")
print(region_count)
# Grouping by multiple keys ('Region' and 'Product')
grouped_by_region_product = df_sales.groupby(['Region', 'Product'])
print("\nGroupBy object (Region, Product):", grouped_by_region_product)
# Calculate mean sales for each combination
region_product_mean_sales = grouped_by_region_product['Sales'].mean() # Select 'Sales' then aggregate
print("\nMean Sales by Region and Product:")
print(region_product_mean_sales) # Result has a MultiIndex
# Unstack the result for a wider view
print("\nMean Sales Unstacked:")
print(region_product_mean_sales.unstack())
# Iterating through groups (less common for aggregation, useful for inspection/custom logic)
print("\nIterating through Region groups:")
for name, group_df in grouped_by_region:
print(f"--- Group: {name} ---")
print(group_df)
# You could perform custom calculations on group_df here
# Selecting a specific group
print("\nGetting the 'North' group:")
north_group = grouped_by_region.get_group('North')
print(north_group)
Key Points:
- Calling
groupby()
itself doesn't perform calculations; it creates aGroupBy
object that holds information about the groups. - Aggregation functions (
sum()
,mean()
,median()
,min()
,max()
,std()
,var()
,count()
,size()
,first()
,last()
, etc.) are then applied to this object. - You can group by one or multiple column names, index levels, or external arrays/Series of the same length.
- Selecting a column before aggregation (
grouped['Sales'].sum()
) is generally more efficient than aggregating the whole DataFrame and then selecting (grouped.sum()['Sales']
). - The result of an aggregation typically has the group keys as its index (potentially a MultiIndex).
The GroupBy mechanism is fundamental for summarizing data based on categorical variables or other grouping criteria.
Workshop Analyzing Sales Data by Region and Product
Goal: Use groupby()
to analyze the sample sales data, calculating total and average sales per region and per product.
Scenario: Using the df_sales
DataFrame created in the previous section, you need to answer specific business questions by grouping and aggregating the data:
- What are the total sales for each region?
- What are the total sales for each product across all regions?
- Which region-product combination had the highest average quantity sold per transaction?
Steps:
- Import Pandas and Prepare Data: Ensure the
df_sales
DataFrame is available.import pandas as pd import numpy as np # Recreate data if needed data = {'Region': ['North', 'South', 'North', 'South', 'North', 'South', 'East', 'East'], 'Product': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'], 'Sales': [100, 150, 200, 250, 120, 180, 300, 350], 'Quantity': [10, 15, 8, 10, 12, 18, 20, 22]} df_sales = pd.DataFrame(data) print("Original Sales DataFrame:") print(df_sales)
- Question 1: Total Sales per Region:
- Group
df_sales
by the 'Region' column. - Select the 'Sales' column from the grouped object.
- Apply the
sum()
aggregation. - Print the resulting Series. Self-Check: Manually sum the sales for 'North' (100 + 200 + 120 = 420). Does the result match?
- Group
- Question 2: Total Sales per Product:
- Group
df_sales
by the 'Product' column. - Select the 'Sales' column.
- Apply the
sum()
aggregation. - Print the result. Self-Check: Sum sales for 'A' (100 + 150 + 120 + 180 = 550). Does the result match?
- Group
- Question 3: Highest Average Quantity per Region-Product:
- Group
df_sales
by both 'Region' and 'Product'. - Select the 'Quantity' column.
- Apply the
mean()
aggregation. - Find the index (Region, Product combination) corresponding to the maximum value in the resulting Series using
.idxmax()
. - Retrieve the maximum average quantity value using
.max()
.Self-Check: Look at the average quantities calculated. Does ('East', 'B') indeed have the highest average ( (20+22)/2 = 21 )?print("\n--- Q3: Highest Average Quantity per Region-Product ---") avg_quantity_by_region_product = df_sales.groupby(['Region', 'Product'])['Quantity'].mean() print("Average Quantity per (Region, Product):\n", avg_quantity_by_region_product) # Find the combination with the highest average quantity highest_avg_qty_combination = avg_quantity_by_region_product.idxmax() highest_avg_qty_value = avg_quantity_by_region_product.max() print(f"\nThe combination with the highest average quantity sold is: {highest_avg_qty_combination}") print(f"Highest average quantity: {highest_avg_qty_value:.2f}")
- Group
This workshop focused on using the basic groupby()
and aggregation workflow (selecting columns and applying functions like sum()
and mean()
) to answer specific analytical questions about summarized data. It also introduced finding the maximum value and its corresponding index label(s) in the aggregated result.
Aggregation Functions
While simple aggregations like sum()
, mean()
, count()
are common, Pandas GroupBy objects support more complex aggregation scenarios using the .agg()
(or .aggregate()
) method. This method allows you to:
- Apply multiple aggregation functions at once.
- Apply different aggregation functions to different columns.
- Use custom-defined aggregation functions.
Using .agg()
:
- Multiple Functions on All Columns: Pass a list of function names (as strings) or actual function objects.
- Different Functions per Column: Pass a dictionary where keys are column names and values are the function(s) to apply to that column (can be a single function name/object or a list of them).
- Custom Functions: Define your own function that takes a Series (representing the data within a group for a specific column) and returns a single scalar value. Pass this function object to
.agg()
. - Renaming Aggregated Columns: When passing a dictionary, you can provide tuples
(column, function)
as keys, or use named aggregation syntax for clearer output column names.
Named Aggregation (Preferred for Clarity):
The modern way, especially when applying multiple aggregations or needing specific output names, is named aggregation within .agg()
:
grouped_object.agg(
output_col_name1 = pd.NamedAgg(column='input_col_name1', aggfunc='sum'),
output_col_name2 = pd.NamedAgg(column='input_col_name2', aggfunc='mean'),
output_col_name3 = pd.NamedAgg(column='input_col_name1', aggfunc=np.std),
# ...
)
Examples:
Using the df_sales
DataFrame:
import pandas as pd
import numpy as np
# Recreate data if needed
data = {'Region': ['North', 'South', 'North', 'South', 'North', 'South', 'East', 'East'],
'Product': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
'Sales': [100, 150, 200, 250, 120, 180, 300, 350],
'Quantity': [10, 15, 8, 10, 12, 18, 20, 22]}
df_sales = pd.DataFrame(data)
grouped_region_product = df_sales.groupby(['Region', 'Product'])
print("Original Sales DataFrame:")
print(df_sales)
print("\n--- Aggregation Examples ---")
# 1. Multiple functions on selected columns
agg_multi_func = grouped_region_product[['Sales', 'Quantity']].agg(['sum', 'mean', 'count'])
print("\nApplying 'sum', 'mean', 'count' to Sales & Quantity:")
print(agg_multi_func)
# Note the MultiIndex columns in the result: (Column, AggFunction)
# 2. Different functions per column using a dictionary
agg_dict_func = grouped_region_product.agg({
'Sales': ['sum', 'mean'], # Sum and Mean for Sales
'Quantity': 'sum' # Only Sum for Quantity
})
print("\nApplying different functions per column (dict):")
print(agg_dict_func)
# 3. Custom aggregation function (e.g., range)
def sales_range(x):
return x.max() - x.min()
agg_custom_func = grouped_region_product.agg({
'Sales': ['sum', sales_range], # Mix built-in and custom
'Quantity': 'mean'
})
print("\nApplying custom 'sales_range' function:")
print(agg_custom_func)
# 4. Using Named Aggregation (Preferred)
agg_named = grouped_region_product.agg(
TotalSales = pd.NamedAgg(column='Sales', aggfunc='sum'),
AverageSales = pd.NamedAgg(column='Sales', aggfunc='mean'),
TotalQuantity = pd.NamedAgg(column='Quantity', aggfunc='sum'),
SalesRange = pd.NamedAgg(column='Sales', aggfunc=sales_range),
TransactionCount = pd.NamedAgg(column='Sales', aggfunc='count') # Count transactions
)
print("\nUsing Named Aggregation:")
print(agg_named)
# Note the clean, user-defined column names in the result.
# Applying aggregation without grouping (on the whole DataFrame)
# Note: Often simpler functions exist, e.g., df.sum(), df.mean()
# But useful if mixing functions or applying to specific columns
overall_agg = df_sales.agg({
'Sales': ['sum', 'mean'],
'Quantity': ['min', 'max']
})
print("\nAggregation on the whole DataFrame:")
print(overall_agg)
The .agg()
method provides immense flexibility for summarizing grouped data exactly how you need it, allowing for complex, multi-faceted aggregations in a concise manner, especially with the named aggregation syntax.
Workshop Calculating Summary Statistics for Groups
Goal: Use the .agg()
method on the cleaned Titanic dataset to calculate various summary statistics for different passenger groups.
Scenario: You want to gain deeper insights into the Titanic passengers by calculating multiple statistics simultaneously for groups based on 'Pclass' and 'Sex'. Specifically, you want to find:
- For each combination of 'Pclass' and 'Sex':
- The total number of passengers (count).
- The number of survivors.
- The survival rate.
- The average age.
- The median fare.
- The range of fares (max - min).
Steps:
- Import Libraries and Prepare Data: Ensure the cleaned
titanic_df
is loaded.import pandas as pd import numpy as np # Assuming titanic_df is loaded and cleaned try: print("Using pre-loaded and cleaned Titanic DataFrame.") titanic_df.info(memory_usage=False) # Concise info if titanic_df.isnull().sum().sum() != 0: print("Warning: DataFrame still contains NaNs.") except NameError: print("Error: titanic_df not found. Loading and cleaning...") # Add loading/cleaning code if necessary exit()
- Define Custom Function (Fare Range): Create a simple function to calculate the range (max - min).
- Group Data: Group the
titanic_df
by both 'Pclass' and 'Sex'. - Perform Aggregation using Named Aggregation: Use the
.agg()
method withpd.NamedAgg
to calculate all the required statistics with clear output names. Note that 'survived' is coded as 0 or 1, sosum()
gives the count of survivors andmean()
gives the survival rate.print("\n--- Calculating Aggregated Statistics ---") summary_stats = grouped_class_sex.agg( PassengerCount = pd.NamedAgg(column='survived', aggfunc='count'), # Count any non-null column for total Survivors = pd.NamedAgg(column='survived', aggfunc='sum'), SurvivalRate = pd.NamedAgg(column='survived', aggfunc='mean'), AverageAge = pd.NamedAgg(column='age', aggfunc='mean'), MedianFare = pd.NamedAgg(column='fare', aggfunc='median'), FareRange = pd.NamedAgg(column='fare', aggfunc=fare_range) # Use custom function ) # Format the SurvivalRate as percentage for better readability (optional) summary_stats['SurvivalRate'] = summary_stats['SurvivalRate'].map('{:.2%}'.format) print("\nSummary Statistics by Pclass and Sex:") print(summary_stats)
-
Analyze Results: Examine the output DataFrame
summary_stats
.- Compare survival rates between sexes within each class.
- Compare survival rates across classes for the same sex.
- Look at average ages and median fares. Do they align with expectations for different classes?
- Check the fare ranges – are there large differences within certain groups?
Self-Check:
- Does the 'PassengerCount' match the number of individuals in each group?
- Is the 'SurvivalRate' significantly higher for females than males within each class?
- Does the 'SurvivalRate' generally decrease as 'Pclass' increases (from 1st to 3rd)?
- Does 'AverageAge' tend to decrease slightly with class?
- Is 'MedianFare' drastically different across classes?
- Are fare ranges wider in 1st class compared to 3rd?
This workshop demonstrated the power and clarity of using .agg()
with named aggregation (pd.NamedAgg
) to compute multiple, potentially complex, summary statistics for grouped data, including the use of custom aggregation functions. This provides a rich, tabular summary suitable for analysis and reporting.
Pivoting and Reshaping DataFrames
Sometimes, your data is in a "long" format (many rows, few columns, one observation per row) and you need it in a "wide" format (fewer rows, more columns, information spread across columns), or vice-versa. MultiIndex stack()
/unstack()
are one way to do this. pivot()
and pivot_table()
are other powerful tools, especially for transitioning from long to wide format based on data values.
pivot(index=None, columns=None, values=None)
(DataFrame method): Reshapes data (produces a "wide" format) based on unique values from specified index and columns. It's suitable when the combination ofindex
andcolumns
uniquely identifies avalues
entry. If there are duplicate combinations,pivot
will raise an error.-
pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')
(Pandas function or DataFrame method): Creates a spreadsheet-style pivot table. It's more flexible thanpivot
because it can handle duplicate entries for a given index/column combination by aggregating them using the specifiedaggfunc
(default is 'mean'). It's essentially a combination ofgroupby
and reshaping.aggfunc
: Function used for aggregation (e.g., 'sum', 'mean', 'count', or a list/dict of functions).fill_value
: Value to replace missing values (NaNs) in the resulting table.margins=True
: Adds row and column totals (subtotals and grand totals).
-
pd.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)
(Pandas function): The inverse of pivoting; transforms a DataFrame from "wide" format to "long" format by unpivoting columns into rows.id_vars
are columns to keep as identifier variables, whilevalue_vars
are columns to unpivot.
Example pivot
vs pivot_table
:
import pandas as pd
# Data suitable for pivot (unique index/column combo)
data_long_unique = pd.DataFrame({
'Date': pd.to_datetime(['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02']),
'Sensor': ['Temp', 'Humidity', 'Temp', 'Humidity'],
'Reading': [25.5, 60.1, 26.0, 62.3]
})
print("Long format data (unique combos):")
print(data_long_unique)
# Use pivot to make Sensors the columns
df_pivoted = data_long_unique.pivot(index='Date', columns='Sensor', values='Reading')
print("\nPivoted data (wide format):")
print(df_pivoted)
# Data with duplicate index/column combos (needs aggregation)
data_long_duplicates = pd.DataFrame({
'Student': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice'],
'Assignment': ['HW1', 'HW1', 'HW2', 'HW1', 'HW2', 'HW1'],
'Score': [90, 85, 95, 88, 92, 89] # Alice HW1 appears twice
})
print("\nLong format data (duplicate combos):")
print(data_long_duplicates)
# Trying pivot fails due to duplicate ('Alice', 'HW1')
try:
data_long_duplicates.pivot(index='Student', columns='Assignment', values='Score')
except ValueError as e:
print("\nPivot failed as expected:", e)
# Use pivot_table with an aggregation function (e.g., mean for Alice's HW1)
df_pivot_table = data_long_duplicates.pivot_table(index='Student', columns='Assignment',
values='Score', aggfunc='mean') # Default is mean
print("\nPivot Table (aggfunc='mean'):")
print(df_pivot_table)
# Pivot table with sum and margins
df_pivot_table_sum = data_long_duplicates.pivot_table(index='Student', columns='Assignment',
values='Score', aggfunc='sum',
fill_value=0, # Fill missing scores with 0
margins=True, margins_name='TotalScore')
print("\nPivot Table (aggfunc='sum', margins=True, fill_value=0):")
print(df_pivot_table_sum)
Example melt
:
# Start with the wide format df_pivoted
print("\nWide format data (df_pivoted):")
print(df_pivoted)
# Melt it back to long format
# Reset index first to make 'Date' a column to use as id_vars
df_pivoted_reset = df_pivoted.reset_index()
print("\nWide format with Date as column:")
print(df_pivoted_reset)
df_melted = pd.melt(df_pivoted_reset,
id_vars=['Date'], # Column(s) to keep fixed
value_vars=['Humidity', 'Temp'], # Columns to unpivot
var_name='SensorType', # Name for the new variable column
value_name='SensorReading')# Name for the new value column
print("\nMelted data (long format):")
print(df_melted)
pivot_table
is extremely useful for summarizing data in a familiar spreadsheet-like format, while melt
is essential for converting data into the "tidy" long format often required by plotting libraries or statistical modeling functions.
Workshop Reshaping Time Series Data
Goal: Use pivot_table
to summarize monthly sales data from a long format and melt
to convert summarized data back.
Scenario: You have a dataset containing daily sales records for different product categories over several months. You need to create a pivot table showing total monthly sales for each category, and then demonstrate how to melt
this summary table back into a long format.
Steps:
- Import Libraries and Create Sample Data: Create a DataFrame simulating daily sales data.
import pandas as pd import numpy as np # Create sample daily sales data date_rng = pd.date_range(start='2023-01-01', end='2023-03-31', freq='D') num_days = len(date_rng) categories = ['Electronics', 'Clothing', 'Groceries'] np.random.seed(10) data = { 'Date': np.random.choice(date_rng, size=num_days * 2), # More entries than days 'Category': np.random.choice(categories, size=num_days * 2), 'Sales': np.random.randint(50, 1000, size=num_days * 2) } daily_sales_df = pd.DataFrame(data).sort_values(by='Date').reset_index(drop=True) # Add Month and Year columns for easy grouping/pivoting daily_sales_df['Year'] = daily_sales_df['Date'].dt.year daily_sales_df['Month'] = daily_sales_df['Date'].dt.month print("Sample Daily Sales Data (Long Format):") print(daily_sales_df.head()) print(f"\nTotal records: {len(daily_sales_df)}")
- Create Pivot Table: Use
pivot_table
to create a summary table where:- The index represents the 'Month'.
- The columns represent the 'Category'.
- The values represent the total 'Sales' (
aggfunc='sum'
). - Handle potential missing combinations with
fill_value=0
. Self-Check: Does the table show Months (1, 2, 3) as rows and Categories ('Clothing', 'Electronics', 'Groceries') as columns? Do the values represent the sum of sales for that month-category combination?
- Add Margins (Totals): Recreate the pivot table, but this time add row and column margins (totals) using
margins=True
.Self-Check: Does the table now include a 'Total' row showing the sum across months for each category, and a 'Total' column showing the sum across categories for each month? Is there a grand total at the bottom right?print("\n--- Creating Pivot Table with Margins ---") monthly_sales_pivot_margins = daily_sales_df.pivot_table( values='Sales', index='Month', columns='Category', aggfunc='sum', fill_value=0, margins=True, # Add row/column totals margins_name='Total' # Name for the margins row/column ) print("\nMonthly Sales Pivot Table with Totals:") print(monthly_sales_pivot_margins)
- Melt the Pivot Table: Take the original pivot table (without margins) and use
pd.melt
to convert it back into a long format.- First,
reset_index()
to make 'Month' a regular column. - Then, use
melt
with 'Month' as theid_vars
.Self-Check: Does theprint("\n--- Melting the Pivot Table (without margins) ---") # Reset index to make 'Month' a column monthly_sales_pivot_reset = monthly_sales_pivot.reset_index() print("Pivot Table with Month as column:") print(monthly_sales_pivot_reset) # Melt the DataFrame monthly_sales_long = pd.melt( monthly_sales_pivot_reset, id_vars=['Month'], # Column(s) to keep as identifiers value_vars=['Clothing', 'Electronics', 'Groceries'], # Columns to unpivot var_name='ProductCategory', # Name for the new column holding old column names value_name='TotalMonthlySales' # Name for the new column holding the values ) print("\nMelted Monthly Sales Data (Long Format):") print(monthly_sales_long.sort_values(by=['Month', 'ProductCategory']))
monthly_sales_long
DataFrame have columns 'Month', 'ProductCategory', and 'TotalMonthlySales'? Does each row represent the total sales for a specific category in a specific month, effectively reversing the pivot operation?
- First,
This workshop demonstrated how pivot_table
can aggregate and reshape data from a long format into a wide summary table, which is excellent for reporting. It also showed how melt
performs the inverse operation, converting a wide table back into a long format suitable for other types of analysis or visualization.
7. Time Series Analysis with Pandas
Pandas was originally developed with financial data analysis in mind, and thus has exceptionally strong capabilities for working with time series data. This includes specialized data structures for dates and times, methods for time-based indexing and selection, frequency conversion, resampling, rolling window calculations, and more.
Date and Time Data Types and Tools
Pandas provides several tools for handling time-based data:
Timestamp
: Represents a single point in time. It's Pandas' equivalent of Python'sdatetime.datetime
object but is based on NumPy'sdatetime64
dtype, offering more efficiency and functionality.DatetimeIndex
: An index composed ofTimestamp
objects. This allows for powerful time-based indexing and selection in Series and DataFrames.Period
: Represents a time span or interval, like a specific month (e.g., '2023-01'), year, or quarter.PeriodIndex
: An index composed ofPeriod
objects. Useful for data that represents fixed frequency intervals.Timedelta
: Represents a duration or difference between two dates/times.pd.to_datetime()
: The primary function for converting strings, integers, or other inputs intoTimestamp
objects orDatetimeIndex
. It's very flexible in parsing various date/time formats.pd.date_range()
: Generates aDatetimeIndex
with a fixed frequency (e.g., daily, hourly, monthly).
Working with DatetimeIndex:
When a DataFrame's index is a DatetimeIndex
, Pandas enables convenient time-based slicing and selection.
import pandas as pd
# --- Creating Timestamps and DatetimeIndex ---
ts = pd.Timestamp('2023-10-26 10:30:00')
print("Timestamp:", ts)
print("Year:", ts.year)
print("Month:", ts.month_name())
print("Day of week:", ts.day_name())
# Create DatetimeIndex using date_range
date_idx = pd.date_range(start='2023-11-01', periods=7, freq='D') # 7 days starting Nov 1st
print("\nDatetimeIndex (Daily):")
print(date_idx)
date_idx_hourly = pd.date_range(start='2023-11-01', end='2023-11-01 05:00:00', freq='H') # Hourly
print("\nDatetimeIndex (Hourly):")
print(date_idx_hourly)
# --- Creating a Time Series DataFrame ---
# Use date_idx as the index
np.random.seed(0)
ts_data = np.random.randn(len(date_idx))
time_series = pd.Series(ts_data, index=date_idx)
print("\nTime Series (Series with DatetimeIndex):")
print(time_series)
df_ts = pd.DataFrame({'Value': ts_data, 'Value2': ts_data*2}, index=date_idx)
print("\nTime Series DataFrame:")
print(df_ts)
# --- Time-Based Indexing and Selection ---
print("\n--- Time-Based Selection ---")
# Select a specific date
print("\nData for 2023-11-03:")
print(df_ts.loc['2023-11-03'])
# Select a range of dates (slicing)
print("\nData from 2023-11-02 to 2023-11-05:")
print(df_ts['2023-11-02':'2023-11-05']) # .loc is implied and preferred: df_ts.loc[...]
# Select data for a specific year or month (partial string indexing)
# Create longer series for demonstration
date_idx_long = pd.date_range(start='2022-12-01', periods=60, freq='D') # Approx 2 months
df_ts_long = pd.DataFrame({'Value': np.random.randn(len(date_idx_long))}, index=date_idx_long)
print("\nLonger Time Series DataFrame (Head):")
print(df_ts_long.head())
print("\nData for the year 2023:")
print(df_ts_long['2023'].head()) # Selects all entries in 2023
print("\nData for January 2023:")
print(df_ts_long['2023-01'].head()) # Selects all entries in Jan 2023
# --- Using Timedelta ---
delta = pd.Timedelta(days=1, hours=5)
print("\nTimedelta:", delta)
print("Timestamp + Timedelta:", ts + delta)
# --- Using Period ---
p_month = pd.Period('2023-01', freq='M') # Represents the entire month of Jan 2023
print("\nPeriod (Month):", p_month)
print("Period start time:", p_month.start_time)
print("Period end time:", p_month.end_time)
# Convert timestamp to period
print("Timestamp converted to Monthly Period:", ts.to_period('M'))
# Create PeriodIndex
period_idx = pd.period_range(start='2023-01', periods=5, freq='M')
print("\nPeriodIndex (Monthly):")
print(period_idx)
# Series with PeriodIndex
s_period = pd.Series(np.random.randn(5), index=period_idx)
print("\nSeries with PeriodIndex:\n", s_period)
Mastering to_datetime
, date_range
, and the behavior of DatetimeIndex
(especially time-based slicing) is crucial for preparing time series data for analysis in Pandas.
Workshop Analyzing Stock Prices Over Time
Goal: Load historical stock price data (e.g., Apple AAPL), convert the date column to a DatetimeIndex, and perform basic time-based selections and analysis.
Dataset: We'll use historical stock data, which is widely available. For instance, from Yahoo Finance via libraries like yfinance
or by downloading a CSV. For simplicity here, we'll simulate loading from a CSV structure.
Steps:
- Import Libraries:
- Prepare Simulated Stock Data: Create a string simulating a CSV file with AAPL stock data. In a real scenario, you'd use
pd.read_csv('your_stock_data.csv')
or a library likeyfinance
.python # Simulate CSV data for AAPL stock # Columns: Date, Open, High, Low, Close, Adj Close, Volume stock_csv_data = """Date,Open,High,Low,Close,Adj Close,Volume 2023-01-03,129.27,130.89,124.17,125.07,124.87,112117500 2023-01-04,126.89,128.66,125.08,126.36,126.16,89113600 2023-01-05,127.12,127.77,124.76,125.02,124.82,80962700 2023-01-06,126.01,130.29,124.89,129.62,129.41,87754700 2023-01-09,130.47,133.41,129.89,130.15,129.94,70790800 2023-01-10,130.26,131.26,128.12,130.73,130.52,63896200 2023-01-11,131.25,133.51,130.46,133.49,133.27,69458900 2023-01-12,133.88,134.26,131.41,133.41,133.19,71379600 2023-01-13,132.03,134.92,131.66,134.76,134.54,57809700 2023-02-01,143.97,146.61,141.32,145.43,145.19,77663600 2023-02-02,148.90,151.18,148.17,150.82,150.57,118339000 2023-02-03,148.03,157.38,147.83,154.50,154.24,154279900 """
- Load Data and Set DatetimeIndex:
- Load the data using
pd.read_csv
. - Crucially, use the
parse_dates=['Date']
argument to tell Pandas to interpret the 'Date' column as dates. - Use the
index_col='Date'
argument to set the 'Date' column as the DataFrame index immediately upon loading.Self-Check: Is the index of# Load the data, parsing dates and setting index try: aapl_df = pd.read_csv(io.StringIO(stock_csv_data), parse_dates=['Date'], index_col='Date') print("AAPL stock data loaded successfully with DatetimeIndex:") print(aapl_df.head()) print("\nIndex type:", type(aapl_df.index)) print("Index dtype:", aapl_df.index.dtype) except Exception as e: print(f"Error loading data: {e}") # Create placeholder if error aapl_df = pd.DataFrame(columns=['Open','High','Low','Close','Adj Close','Volume'])
aapl_df
aDatetimeIndex
? Are the dates parsed correctly?
- Load the data using
- Time-Based Selection:
- Select the data for a specific date, e.g., '2023-01-05'.
- Select the data for the week starting '2023-01-09'.
- Select all data for January 2023 using partial string indexing.
- Select all data for February 2023.
Self-Check: Did each selection return the expected rows based on the dates?
if not aapl_df.empty: print("\n--- Time-Based Selections ---") # Specific date try: data_jan_5 = aapl_df.loc['2023-01-05'] print("\nData for 2023-01-05:\n", data_jan_5) except KeyError: print("\nData for 2023-01-05 not found.") # Week starting Jan 9 (until Jan 13 in this data) data_week_jan_9 = aapl_df.loc['2023-01-09':'2023-01-13'] print("\nData for week starting 2023-01-09:\n", data_week_jan_9) # All data for January 2023 data_jan_2023 = aapl_df.loc['2023-01'] print("\nAll data for January 2023:\n", data_jan_2023) # All data for February 2023 data_feb_2023 = aapl_df.loc['2023-02'] print("\nAll data for February 2023:\n", data_feb_2023) else: print("\nSkipping selections as DataFrame is empty.")
- Basic Analysis:
- Calculate the average closing price for January 2023.
- Find the highest price ('High') reached in February 2023.
if not aapl_df.empty: print("\n--- Basic Analysis ---") try: avg_close_jan = aapl_df.loc['2023-01', 'Close'].mean() print(f"\nAverage Closing Price in Jan 2023: {avg_close_jan:.2f}") except KeyError: print("\nCould not calculate average closing price for Jan 2023 (data missing).") try: max_high_feb = aapl_df.loc['2023-02', 'High'].max() print(f"Highest Price (High) in Feb 2023: {max_high_feb:.2f}") except KeyError: print("\nCould not find maximum high for Feb 2023 (data missing).") except ValueError: # Handle case where Feb 2023 data exists but is empty print("\nNo data found for Feb 2023 to calculate max high.") else: print("\nSkipping analysis as DataFrame is empty.")
This workshop demonstrated the workflow for loading time series data, ensuring the date column becomes a DatetimeIndex
using parse_dates
and index_col
in read_csv
, and leveraging the power of the DatetimeIndex
for intuitive time-based slicing and selection to perform basic analysis.
Resampling and Frequency Conversion
Resampling involves changing the frequency of your time series data. This is a common operation:
- Downsampling: Aggregating data from a higher frequency to a lower frequency (e.g., converting daily data to monthly data). This typically involves an aggregation function (like
mean
,sum
,ohlc
- Open, High, Low, Close). - Upsampling: Converting data from a lower frequency to a higher frequency (e.g., monthly data to daily data). This usually involves filling or interpolating the missing data points created during the conversion.
The primary method for resampling in Pandas is .resample()
, used on a Series or DataFrame with a DatetimeIndex or PeriodIndex.
.resample(rule, axis=0, closed=None, label=None, convention='start', kind=None, on=None, level=None, origin='start_day', offset=None)
:
rule
: The target frequency (e.g.,'D'
for daily,'W'
for weekly,'M'
for month-end,'MS'
for month-start,'Q'
for quarter-end,'QS'
for quarter-start,'A'
or'Y'
for year-end,'AS'
or'YS'
for year-start,'H'
for hourly,'T'
or'min'
for minutely,'S'
for secondly). You can combine these with multipliers (e.g.,'2H'
for 2 hours).closed
: Which side of the interval is closed ('left' or 'right'). Relevant for frequencies like daily when aggregating intraday data.label
: Which bin edge label to use ('left' or 'right').convention
: For Period resampling ('start' or 'end').kind
: Resample to 'timestamp' (default) or 'period'.
After calling .resample()
, you typically chain an aggregation method (like .sum()
, .mean()
, .ohlc()
, .agg()
) for downsampling, or an interpolation/filling method (like .asfreq()
, .ffill()
, .bfill()
, .interpolate()
) for upsampling.
Downsampling Example:
import pandas as pd
import numpy as np
# Use the longer time series DataFrame df_ts_long from previous section
date_idx_long = pd.date_range(start='2022-12-01', periods=60, freq='D') # Dec 2022, Jan 2023
np.random.seed(1)
df_ts_long = pd.DataFrame({'Value': np.random.randint(1, 100, size=len(date_idx_long))}, index=date_idx_long)
print("Original Daily Time Series (Head):")
print(df_ts_long.head())
# Downsample daily data to monthly sum
monthly_sum = df_ts_long['Value'].resample('M').sum() # 'M' = Month End frequency
print("\nMonthly Sum (Resampled):")
print(monthly_sum)
# Downsample daily data to monthly mean
monthly_mean = df_ts_long['Value'].resample('M').mean()
print("\nMonthly Mean (Resampled):")
print(monthly_mean)
# Downsample daily data to weekly OHLC (Open, High, Low, Close)
# Requires multiple columns or a specific structure, let's fake it
df_ts_ohlc = pd.DataFrame({
'Open': np.random.rand(60) + 10,
'High': np.random.rand(60) + 11,
'Low': np.random.rand(60) + 9,
'Close': np.random.rand(60) + 10
}, index=date_idx_long)
# Ensure High >= Open/Close and Low <= Open/Close (simplified logic)
df_ts_ohlc['High'] = df_ts_ohlc[['High', 'Open', 'Close']].max(axis=1)
df_ts_ohlc['Low'] = df_ts_ohlc[['Low', 'Open', 'Close']].min(axis=1)
print("\nOriginal Daily OHLC Data (Fake):")
print(df_ts_ohlc.head())
weekly_ohlc = df_ts_ohlc.resample('W').agg({
'Open': 'first',
'High': 'max',
'Low': 'min',
'Close': 'last'
})
# Alternative simpler syntax for OHLC: .resample('W').ohlc()
# weekly_ohlc_simple = df_ts_ohlc.resample('W').ohlc()
print("\nWeekly OHLC (Resampled):")
print(weekly_ohlc)
# print("\nWeekly OHLC (Simple):") # Output structure is slightly different (MultiIndex cols)
# print(weekly_ohlc_simple)
Upsampling Example:
# Start with the monthly sum data
print("\nMonthly Sum Data:")
print(monthly_sum)
# Upsample monthly data to daily frequency (will create NaNs)
daily_from_monthly_asfreq = monthly_sum.resample('D').asfreq()
print("\nUpsampled to Daily (asfreq - creates NaNs):")
print(daily_from_monthly_asfreq.head(10)) # Show NaNs
# Upsample using forward fill
daily_from_monthly_ffill = monthly_sum.resample('D').ffill()
print("\nUpsampled to Daily (ffill):")
print(daily_from_monthly_ffill.head(10)) # Shows value propagated forward
# Upsample using backward fill
daily_from_monthly_bfill = monthly_sum.resample('D').bfill()
print("\nUpsampled to Daily (bfill):")
print(daily_from_monthly_bfill.tail(10)) # Shows value propagated backward
# Upsampling with interpolation (linear)
# Need numeric data, let's use monthly_mean
print("\nMonthly Mean Data:")
print(monthly_mean)
daily_from_monthly_interp = monthly_mean.resample('D').interpolate(method='linear')
print("\nUpsampled to Daily (linear interpolation):")
# Display around a month transition to see interpolation
print(daily_from_monthly_interp['2022-12-25':'2023-01-05'])
Resampling is essential for aligning time series with different frequencies or for summarizing data over specific time periods (weeks, months, quarters). Choose the aggregation method (sum
, mean
, ohlc
, etc.) carefully for downsampling and the filling/interpolation method (ffill
, bfill
, interpolate
) for upsampling based on the nature of your data and analysis goals.
Workshop Aggregating Time Series Data to Different Frequencies
Goal: Use resampling on the AAPL stock price data to calculate weekly and monthly averages.
Scenario: You have the daily AAPL stock price data (aapl_df
) and want to analyze its trends on a weekly and monthly basis by calculating the average closing price for each period.
Steps:
- Import Libraries and Prepare Data: Ensure the
aapl_df
DataFrame with the DatetimeIndex is available from the previous workshop.import pandas as pd import numpy as np import io # Assuming aapl_df is loaded from the previous workshop # If not, recreate/reload it here. stock_csv_data = """Date,Open,High,Low,Close,Adj Close,Volume 2023-01-03,129.27,130.89,124.17,125.07,124.87,112117500 2023-01-04,126.89,128.66,125.08,126.36,126.16,89113600 2023-01-05,127.12,127.77,124.76,125.02,124.82,80962700 2023-01-06,126.01,130.29,124.89,129.62,129.41,87754700 2023-01-09,130.47,133.41,129.89,130.15,129.94,70790800 2023-01-10,130.26,131.26,128.12,130.73,130.52,63896200 2023-01-11,131.25,133.51,130.46,133.49,133.27,69458900 2023-01-12,133.88,134.26,131.41,133.41,133.19,71379600 2023-01-13,132.03,134.92,131.66,134.76,134.54,57809700 2023-01-17,134.83,137.29,134.13,135.94,135.72,63646600 2023-01-18,136.82,138.61,135.03,135.21,134.99,79572200 2023-01-19,134.08,136.25,133.77,135.27,135.05,58280400 2023-01-20,135.28,138.02,134.22,137.87,137.64,80223600 2023-01-23,138.12,143.32,137.90,141.11,140.87,81760300 2023-01-24,140.31,143.16,140.30,142.53,142.29,66435100 2023-01-25,140.89,142.43,138.81,141.86,141.62,65799300 2023-01-26,143.17,144.25,141.90,143.96,143.72,54115500 2023-01-27,143.16,147.23,143.08,145.93,145.69,70555800 2023-01-30,144.96,145.55,142.85,143.00,142.76,64015300 2023-01-31,142.70,144.34,142.28,144.29,144.05,65874500 2023-02-01,143.97,146.61,141.32,145.43,145.19,77663600 2023-02-02,148.90,151.18,148.17,150.82,150.57,118339000 2023-02-03,148.03,157.38,147.83,154.50,154.24,154279900 """ try: aapl_df = pd.read_csv(io.StringIO(stock_csv_data), parse_dates=['Date'], index_col='Date') print("AAPL stock data loaded successfully.") print(aapl_df.head()) except Exception as e: print(f"Error loading data: {e}") aapl_df = pd.DataFrame(columns=['Open','High','Low','Close','Adj Close','Volume'])
- Weekly Resampling:
- Select the 'Close' price Series.
- Use
.resample('W')
to group by week (ends on Sunday by default). - Apply the
.mean()
aggregation function to calculate the average closing price for each week. - Print the resulting weekly average closing prices. Self-Check: Do the index dates represent week-ending dates (Sundays)? Are the values plausible averages of the daily closing prices within those weeks?
- Monthly Resampling:
- Select the 'Close' price Series again.
- Use
.resample('M')
to group by month (ends on the last day of the month by default). Or use'MS'
for month-start frequency. Let's use'M'
. - Apply the
.mean()
aggregation function. - Print the resulting monthly average closing prices. Self-Check: Does the index represent month-end dates? Are there fewer data points than the weekly results?
- Monthly OHLC: Resample the entire DataFrame to get monthly OHLC data.
- Use
.resample('M')
on the whole DataFrame. - Apply the
.agg()
method with appropriate functions for 'Open' ('first'), 'High' ('max'), 'Low' ('min'), 'Close' ('last'), and potentially 'Volume' ('sum'). Self-Check: Does the resulting DataFrame have month-end dates as the index? Do the 'Open' values correspond to the first trading day's open of the month, 'High' to the maximum high during the month, etc.? Is the 'Volume' the sum of daily volumes for the month?
- Use
This workshop provided practical experience in using .resample()
to downsample time series data. You calculated weekly and monthly average closing prices and also aggregated daily OHLCV data into monthly OHLCV bars, common tasks in financial data analysis.
Conclusion
Throughout this journey into Pandas and NumPy, we have explored the essential tools for data wrangling and analysis in Python. Starting from the Basic level, we grasped the fundamentals of NumPy arrays and the core Pandas structures: Series and DataFrames, learning how to create, inspect, and perform elementary operations on them.
Transitioning to the Intermediate level, we tackled practical, real-world challenges. We learned to efficiently load data from various file formats like CSV, Excel, and JSON, and how to save our processed results. Crucially, we addressed the ubiquitous problem of messy data, developing strategies for identifying and handling missing values (NaN
), transforming data using mapping and function application, and detecting and removing duplicate entries.
Finally, in the Advanced level, we delved into more sophisticated techniques. We explored the power of hierarchical indexing (MultiIndex) for managing complex datasets, mastered advanced boolean indexing for intricate data selection, and unlocked the full potential of the GroupBy mechanism for sophisticated aggregation using .agg()
. We also learned how to reshape data effectively using pivot_table
and melt
, and gained specialized skills for time series analysis, including date/time handling, time-based indexing, and resampling data to different frequencies.
The included Workshops provided hands-on practice, guiding you through applying these concepts to tangible scenarios, from analyzing experimental data and managing inventories to cleaning customer records, exploring the Titanic dataset, and summarizing sales and stock price data.
Mastering NumPy and Pandas is not just about learning library functions; it's about adopting a data-centric way of thinking. It involves understanding data structures, recognizing patterns, formulating questions, and efficiently manipulating data to find answers. The skills acquired here form the bedrock for virtually all subsequent data science tasks, including visualization, machine learning, and statistical modeling. Continue practicing, explore different datasets, and challenge yourself with complex data problems – the fluency you develop with Pandas and NumPy will be invaluable in your academic and professional journey.