Skip to content
Author Nejat Hakan
eMail nejat.hakan@outlook.de
PayPal Me https://paypal.me/nejathakan


Data Wrangling & Analysis Power Tools - Pandas & NumPy

Introduction

Welcome to the core of modern data analysis in Python. This section delves into two indispensable libraries: NumPy and Pandas. NumPy (Numerical Python) provides the foundation for numerical operations, especially on large, multi-dimensional arrays and matrices. Pandas, built atop NumPy, offers high-level data structures—Series and DataFrames—and powerful tools specifically designed for data manipulation, cleaning, analysis, and visualization. Mastering these libraries is crucial for anyone serious about data science, machine learning, financial analysis, or any field involving structured data.

We will embark on a journey starting with the fundamental building blocks in the Basic section, covering array creation, manipulation in NumPy, and the introduction of Pandas Series and DataFrames. The Intermediate section will equip you with practical skills for real-world data tasks, including loading data from various sources, handling ubiquitous issues like missing values, and transforming data into desired formats. Finally, the Advanced section will explore sophisticated techniques such as hierarchical indexing, complex data selection, powerful grouping and aggregation methods, and specialized tools for time series analysis.

Each theoretical sub-section is followed by a hands-on Workshop. These workshops are designed as mini-projects, providing step-by-step guidance to apply the concepts you've just learned to practical, often real-world-inspired scenarios. This blend of theory and practice aims to solidify your understanding and build your confidence in using NumPy and Pandas effectively. Prepare to transform raw data into actionable insights!


Basic Level

This initial level focuses on establishing a solid understanding of the core data structures and fundamental operations provided by NumPy and Pandas. We begin with NumPy's ndarray, the cornerstone of numerical computing in Python, and then transition to Pandas' Series and DataFrame, which are essential for handling tabular data.

1. NumPy Fundamentals

NumPy is the foundational package for scientific computing in Python. Its primary contribution is the powerful N-dimensional array object (ndarray). Understanding NumPy arrays is the first step towards efficient data manipulation and analysis.

Understanding NumPy Arrays

At its heart, NumPy provides an efficient way to store and manipulate dense numerical data. Unlike Python's built-in lists, which can contain elements of different types and are stored inefficiently in memory for numerical operations, NumPy arrays are homogeneous (all elements must be of the same data type) and stored in a contiguous block of memory. This homogeneity and memory layout allow NumPy to leverage optimized, pre-compiled C code for mathematical operations, resulting in significantly faster execution speeds compared to equivalent operations on Python lists.

Key Characteristics of NumPy Arrays (ndarray):

  • Homogeneous Type: All elements must be of the same data type (e.g., int32, float64). This allows for optimized storage and computation.
  • Fixed Size: The size of a NumPy array is fixed upon creation. Changing the size requires creating a new array.
  • Efficient Memory Usage: Contiguous memory storage enables faster access and manipulation.
  • Vectorized Operations: NumPy allows you to apply operations on entire arrays without explicit Python loops, leading to concise and fast code.
  • Dimensions (Axes): Arrays can have multiple dimensions. A 1-D array is like a vector, a 2-D array is like a matrix, and arrays can have 3 or more dimensions. The number of dimensions is called the rank, and the size of each dimension is called the shape.

Creating NumPy Arrays:

The most common way to create a NumPy array is using the np.array() function, passing a Python list or tuple.

import numpy as np

# Creating a 1-D array from a list
list1 = [1, 2, 3, 4, 5]
arr1d = np.array(list1)
print("1D Array:")
print(arr1d)
print("Shape:", arr1d.shape) # Output: (5,) - A tuple indicating size in each dimension
print("Data Type:", arr1d.dtype) # Output: int64 (depends on your system)
print("Number of Dimensions:", arr1d.ndim) # Output: 1
print("Size (Total Elements):", arr1d.size) # Output: 5

print("-" * 20)

# Creating a 2-D array from a list of lists
list2 = [[1, 2, 3], [4, 5, 6]]
arr2d = np.array(list2)
print("2D Array:")
print(arr2d)
print("Shape:", arr2d.shape) # Output: (2, 3) - 2 rows, 3 columns
print("Data Type:", arr2d.dtype) # Output: int64
print("Number of Dimensions:", arr2d.ndim) # Output: 2
print("Size (Total Elements):", arr2d.size) # Output: 6

# Specifying the data type during creation
arr_float = np.array([1.0, 2.5, 3.14], dtype=np.float32)
print("\nFloat Array:")
print(arr_float)
print("Data Type:", arr_float.dtype) # Output: float32

Other Array Creation Methods:

NumPy provides convenient functions for creating arrays with initial placeholder content:

  • np.zeros(shape): Creates an array filled with zeros.
  • np.ones(shape): Creates an array filled with ones.
  • np.full(shape, fill_value): Creates an array filled with a specific value.
  • np.arange(start, stop, step): Similar to Python's range(), but returns a NumPy array.
  • np.linspace(start, stop, num): Creates an array with a specified number (num) of evenly spaced values between start and stop.
  • np.eye(N) or np.identity(N): Creates an N x N identity matrix (1s on the diagonal, 0s elsewhere).
  • np.random.rand(d0, d1, ..., dn): Creates an array of the given shape with random samples from a uniform distribution over [0, 1).
  • np.random.randn(d0, d1, ..., dn): Creates an array of the given shape with random samples from a standard normal distribution (mean 0, variance 1).
  • np.random.randint(low, high, size): Creates an array of the given size with random integers from low (inclusive) to high (exclusive).
# Example usage of other creation methods
zeros_arr = np.zeros((2, 4)) # 2 rows, 4 columns
print("\nZeros Array:\n", zeros_arr)

ones_arr = np.ones((3, 3), dtype=np.int16) # Specify integer type
print("\nOnes Array:\n", ones_arr)

full_arr = np.full((2, 2), 99)
print("\nFull Array:\n", full_arr)

arange_arr = np.arange(0, 10, 2) # Start=0, Stop=10 (exclusive), Step=2
print("\nArange Array:", arange_arr)

linspace_arr = np.linspace(0, 1, 5) # 5 evenly spaced points between 0 and 1 (inclusive)
print("\nLinspace Array:", linspace_arr)

identity_matrix = np.eye(3)
print("\nIdentity Matrix:\n", identity_matrix)

rand_arr = np.random.rand(2, 2)
print("\nRandom (Uniform) Array:\n", rand_arr)

randn_arr = np.random.randn(3, 2)
print("\nRandom (Normal) Array:\n", randn_arr)

randint_arr = np.random.randint(10, 20, size=(2, 5)) # Random integers between 10 and 19
print("\nRandom Integer Array:\n", randint_arr)

Understanding these creation methods and the fundamental properties of ndarray objects is essential before moving on to manipulating and analyzing data with them.

Workshop NumPy Array Basics

Goal: Create various NumPy arrays representing hypothetical experimental data and inspect their properties.

Scenario: You are assisting a physics lab that measures temperature (float) and particle counts (int) at different time points. You need to create NumPy arrays to store this preliminary data.

Steps:

  1. Import NumPy: Always start by importing the NumPy library, conventionally aliased as np.
    import numpy as np
    print("NumPy imported successfully!")
    
  2. Temperature Data (1D Float Array): Measurements were taken at five consecutive time points: 23.5, 24.1, 24.8, 24.5, 23.9 degrees Celsius. Create a 1D NumPy array to store these values. Specify the data type as np.float32 for memory efficiency.
    temperatures_list = [23.5, 24.1, 24.8, 24.5, 23.9]
    temperature_data = np.array(temperatures_list, dtype=np.float32)
    
    print("\nTemperature Data Array:")
    print(temperature_data)
    
  3. Inspect Temperature Array: Print the shape, data type (dtype), number of dimensions (ndim), and total number of elements (size) of the temperature_data array.
    print("\nTemperature Array Properties:")
    print(f"    Shape: {temperature_data.shape}")
    print(f"    Data Type: {temperature_data.dtype}")
    print(f"    Dimensions: {temperature_data.ndim}")
    print(f"    Size: {temperature_data.size}")
    
    Self-Check: Does the output match your expectations for a 1D array of 5 float elements?
  4. Particle Counts (2D Integer Array): Two sets of particle count experiments were run simultaneously over four time intervals.
    • Set A counts: 105, 112, 109, 115
    • Set B counts: 98, 101, 100, 104 Create a 2D NumPy array where each row represents a set of experiments. Use the default integer type.
      counts_list = [[105, 112, 109, 115], [98, 101, 100, 104]]
      particle_counts = np.array(counts_list)
      
      print("\nParticle Count Array:")
      print(particle_counts)
      
  5. Inspect Particle Count Array: Print the shape, data type, dimensions, and size of the particle_counts array.
    print("\nParticle Count Array Properties:")
    print(f"    Shape: {particle_counts.shape}")
    print(f"    Data Type: {particle_counts.dtype}")
    print(f"    Dimensions: {particle_counts.ndim}")
    print(f"    Size: {particle_counts.size}")
    
    Self-Check: Does the shape (2, 4) correctly represent 2 experiments (rows) and 4 time intervals (columns)? Is the data type an integer type?
  6. Create Placeholder Arrays: The lab plans to run another experiment with 3 trials, each measuring 5 voltage readings. Create:
    • An array voltage_readings of shape (3, 5) initialized with zeros.
    • An array error_margins of the same shape, initialized with the value 0.1.
      voltage_readings = np.zeros((3, 5))
      error_margins = np.full((3, 5), 0.1)
      
      print("\nPlaceholder Voltage Readings (Zeros):")
      print(voltage_readings)
      print("\nPlaceholder Error Margins (0.1):")
      print(error_margins)
      
  7. Create Time Sequence: Generate a sequence of time points representing 10 measurements taken every 0.5 seconds, starting from time 0. Use np.arange or np.linspace.
    # Using np.arange
    time_points_arange = np.arange(0, 5, 0.5) # Stop = 10 * 0.5 = 5.0
    print("\nTime Points (arange):", time_points_arange)
    print(f"    Length: {len(time_points_arange)}")
    
    # Using np.linspace
    time_points_linspace = np.linspace(0, 4.5, 10) # Start=0, Stop=4.5, Num=10
    print("Time Points (linspace):", time_points_linspace)
    print(f"    Length: {len(time_points_linspace)}")
    
    Self-Check: Did both methods produce an array of length 10 with the correct time values? np.linspace is often preferred when you know the number of points required and want to include the endpoint. np.arange is better when the step size is more critical.

This workshop introduced creating fundamental NumPy arrays using various methods and inspecting their key attributes (shape, dtype, ndim, size). This forms the basis for performing operations on these arrays.

Basic Array Operations

NumPy shines when performing operations on arrays. These operations are typically vectorized, meaning they operate on entire arrays element-wise without requiring explicit Python loops. This leads to concise code and significant performance gains.

Element-wise Arithmetic Operations:

Standard arithmetic operators (+, -, *, /, ** for exponentiation) work directly on NumPy arrays, applying the operation element by element. For these operations to work between two arrays, the arrays generally need to have compatible shapes (either the same shape or conformable according to broadcasting rules, discussed later).

import numpy as np

arr1 = np.array([1, 2, 3, 4])
arr2 = np.array([10, 20, 30, 40])

# Element-wise addition
sum_arr = arr1 + arr2
print("Sum:", sum_arr) # Output: [11 22 33 44]

# Element-wise subtraction
diff_arr = arr2 - arr1
print("Difference:", diff_arr) # Output: [ 9 18 27 36]

# Element-wise multiplication
prod_arr = arr1 * arr2
print("Product:", prod_arr) # Output: [ 10  40  90 160]

# Element-wise division
div_arr = arr2 / arr1
print("Division:", div_arr) # Output: [10. 10. 10. 10.]

# Element-wise exponentiation
pow_arr = arr1 ** 2
print("Squared:", pow_arr) # Output: [ 1  4  9 16]

# Operations with scalars
scalar_prod = arr1 * 5
print("Scalar Product:", scalar_prod) # Output: [ 5 10 15 20]

scalar_add = arr1 + 100
print("Scalar Addition:", scalar_add) # Output: [101 102 103 104]

Important Note on Division by Zero: NumPy handles division by zero according to IEEE 754 standards:

  • 1 / 0 results in inf (infinity).
  • -1 / 0 results in -inf (negative infinity).
  • 0 / 0 results in nan (Not a Number). NumPy usually issues a runtime warning in these cases but doesn't raise an error, allowing calculations to proceed.
arr_zeros = np.array([1, 2, 0, 4])
print("\nDivision involving zero:")
# The following line will likely produce a RuntimeWarning
print("1 / arr_zeros:", 1 / arr_zeros) # Output: [ 1.   0.5  inf  0.25]
print("arr_zeros / arr_zeros:", arr_zeros / arr_zeros) # Output: [  1.   1.  nan   1.]

Basic Indexing and Slicing:

Accessing elements in NumPy arrays is similar to Python lists, but with more powerful capabilities, especially for multi-dimensional arrays.

  • 1-D Arrays: Use square brackets [] with the index (0-based). Slicing uses the start:stop:step notation.

    arr1d = np.arange(10) # [0 1 2 3 4 5 6 7 8 9]
    print("\n1D Array:", arr1d)
    
    # Accessing single element
    print("Element at index 3:", arr1d[3]) # Output: 3
    
    # Slicing: elements from index 2 up to (not including) index 7
    print("Slice [2:7]:", arr1d[2:7]) # Output: [2 3 4 5 6]
    
    # Slicing with step
    print("Slice [1:9:2]:", arr1d[1:9:2]) # Output: [1 3 5 7]
    
    # Slicing from the beginning or to the end
    print("Slice [:5]:", arr1d[:5]) # Output: [0 1 2 3 4]
    print("Slice [5:]:", arr1d[5:]) # Output: [5 6 7 8 9]
    
    # Assigning a value to a slice
    arr1d_copy = arr1d.copy() # Important: Create a copy to avoid modifying original
    arr1d_copy[0:3] = 99
    print("Modified Slice:", arr1d_copy) # Output: [99 99 99  3  4  5  6  7  8  9]
    
    Crucial Point: Slices on NumPy arrays return views, not copies. Modifying a slice modifies the original array. If you need a copy, use the .copy() method.

  • Multi-Dimensional Arrays: Use comma-separated indices or slices within the square brackets, one for each dimension (axis). arr[row_index, column_index] or arr[row_slice, column_slice].

    arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
    print("\n2D Array:\n", arr2d)
    
    # Accessing single element (row 1, column 2)
    print("Element at [1, 2]:", arr2d[1, 2]) # Output: 6
    
    # Accessing a specific row (returns a 1D array)
    print("Row at index 0:", arr2d[0]) # Output: [1 2 3]
    print("Row at index 0 (explicit):", arr2d[0, :]) # Equivalent, clearer for >1D
    
    # Accessing a specific column (returns a 1D array)
    print("Column at index 1:", arr2d[:, 1]) # Output: [2 5 8]
    
    # Slicing a sub-matrix (rows 0 and 1, columns 1 and 2)
    print("Sub-matrix arr2d[0:2, 1:3]:\n", arr2d[0:2, 1:3])
    # Output:
    # [[2 3]
    #  [5 6]]
    
    # Assigning a value to a 2D slice
    arr2d_copy = arr2d.copy()
    arr2d_copy[1:3, 0:2] = 0
    print("Modified 2D Slice:\n", arr2d_copy)
    # Output:
    # [[1 2 3]
    #  [0 0 6]
    #  [0 0 9]]
    

Mastering these basic arithmetic operations and indexing/slicing techniques is fundamental for almost any task involving NumPy arrays.

Workshop Basic Array Operations

Goal: Perform calculations and data extraction on the experimental data arrays created in the previous workshop.

Scenario: Using the temperature_data and particle_counts arrays, you need to convert temperatures to Fahrenheit, calculate the average particle count for each experiment set, and extract specific data points.

Steps:

  1. Import NumPy and Reload Data (if needed): Make sure NumPy is imported and you have the arrays from the previous workshop. If not, recreate them.
    import numpy as np
    
    # Recreate if needed:
    temperature_data = np.array([23.5, 24.1, 24.8, 24.5, 23.9], dtype=np.float32)
    particle_counts = np.array([[105, 112, 109, 115], [98, 101, 100, 104]])
    
    print("Arrays ready:")
    print("Temperatures (C):", temperature_data)
    print("Particle Counts:\n", particle_counts)
    
  2. Temperature Conversion: Convert the temperature_data from Celsius to Fahrenheit using the formula: F = C * 9/5 + 32. Perform this using vectorized operations.
    temperature_fahrenheit = temperature_data * (9/5) + 32
    
    print("\nTemperatures (F):")
    print(temperature_fahrenheit)
    print("Data Type:", temperature_fahrenheit.dtype) # Note: Result is likely float64 due to calculation
    
    Self-Check: Verify a value manually. E.g., 23.5 * 9/5 + 32 = 42.3 + 32 = 74.3. Does the output match?
  3. Calculate Average Particle Counts: Calculate the mean (average) count for each experiment set (each row) in the particle_counts array. NumPy provides aggregation functions like mean(), sum(), min(), max(). These can operate along specific axes (axis=0 for column-wise, axis=1 for row-wise).
    # Calculate the mean along axis 1 (across columns for each row)
    average_counts_per_set = particle_counts.mean(axis=1)
    
    print("\nAverage Particle Counts per Set:")
    print(average_counts_per_set)
    print("Shape:", average_counts_per_set.shape) # Should be (2,) - one average per row
    
    Self-Check: Calculate the average for the first row manually: (105 + 112 + 109 + 115) / 4 = 441 / 4 = 110.25. Does the output match?
  4. Calculate Overall Average Count: Calculate the average of all particle counts in the particle_counts array.
    overall_average_count = particle_counts.mean() # No axis specified means aggregate over all elements
    
    print(f"\nOverall Average Particle Count: {overall_average_count:.2f}")
    
  5. Extract Specific Data:
    • Get the temperature at the 3rd time point (index 2).
    • Get the particle counts for the first experiment set (Set A, row 0).
    • Get the particle count for Set B (row 1) at the 2nd time interval (column 1).
    • Get the particle counts for both sets but only for the last two time intervals (columns 2 and 3).
      temp_at_t2 = temperature_data[2]
      print(f"\nTemperature at 3rd time point: {temp_at_t2}")
      
      counts_set_a = particle_counts[0, :] # Or simply particle_counts[0]
      print("Counts for Set A:", counts_set_a)
      
      count_setb_t1 = particle_counts[1, 1]
      print(f"Count for Set B at 2nd interval: {count_setb_t1}")
      
      counts_last_two_intervals = particle_counts[:, 2:4] # All rows (:), columns 2 and 3 (2:4)
      print("Counts for last two intervals:\n", counts_last_two_intervals)
      print("Shape:", counts_last_two_intervals.shape) # Should be (2, 2)
      
  6. Modify Data (Using a View): Create a slice representing the first two temperatures. Add 0.5 degrees to these measurements directly on the slice. Print the original temperature_data array to observe the change (because slices are views).
    temp_slice = temperature_data[0:2]
    print("\nOriginal first two temps:", temp_slice)
    temp_slice += 0.5 # Modify the slice directly
    print("First two temps after adding 0.5:", temp_slice)
    print("Original temperature_data array NOW:", temperature_data)
    
    Self-Check: Did the first two elements of the original temperature_data array change? This demonstrates the concept of views. If you wanted to avoid modifying the original, you should have used .copy().

This workshop applied basic arithmetic operations vectorially and demonstrated how to extract specific data points or sub-arrays using indexing and slicing, including the important distinction between views and copies.

2. Pandas Foundations

While NumPy provides the low-level array structures and numerical operations, Pandas provides higher-level data structures and functions tailored for practical data analysis. Pandas is indispensable for tasks like reading data from files, cleaning messy datasets, handling missing values, and preparing data for modeling or visualization.

Introduction to Pandas Series

A Pandas Series is a one-dimensional labeled array capable of holding data of any type (integers, strings, floating-point numbers, Python objects, etc.). It's similar to a NumPy array but with an important addition: an index. The index provides labels for the data points, allowing for more flexible and intuitive data access compared to using only integer indices. Think of it like a combination of a NumPy array and a Python dictionary.

Key Characteristics of Pandas Series:

  • 1-Dimensional: Holds a sequence of values.
  • Homogeneous Type (Usually): While a Series can hold mixed types (stored as object dtype), performance is best when data is homogeneous (like int64, float64, bool, datetime64[ns], string).
  • Indexed: Each data point has an associated label (the index). If not specified, Pandas creates a default integer index (0, 1, 2...).
  • Vectorized Operations: Like NumPy arrays, Series support vectorized operations, making computations fast and code concise.
  • Handling Missing Data: Pandas uses NaN (Not a Number) to represent missing data and provides tools to detect and handle it.

Creating Pandas Series:

The primary way to create a Series is using the pd.Series() constructor.

import pandas as pd
import numpy as np # Often used alongside Pandas

# Creating a Series from a Python list (default integer index)
data_list = [10, 20, 30, 40, 50]
s1 = pd.Series(data_list)
print("Series from list (default index):")
print(s1)
# Output:
# 0    10
# 1    20
# 2    30
# 3    40
# 4    50
# dtype: int64

print("\nValues:", s1.values) # Access the underlying NumPy array
print("Index:", s1.index)   # Access the index object

print("-" * 20)

# Creating a Series from a NumPy array with a custom index
data_np = np.array([1.1, 2.2, 3.3, 4.4])
custom_index = ['a', 'b', 'c', 'd']
s2 = pd.Series(data_np, index=custom_index)
print("Series from NumPy array (custom index):")
print(s2)
# Output:
# a    1.1
# b    2.2
# c    3.3
# d    4.4
# dtype: float64

print("\nValues:", s2.values)
print("Index:", s2.index)

print("-" * 20)

# Creating a Series from a Python dictionary
data_dict = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
s3 = pd.Series(data_dict)
print("Series from dictionary:")
print(s3)
# Output: (Keys become index, values become data; order might vary pre-Python 3.7)
# Ohio      35000
# Texas     71000
# Oregon    16000
# Utah       5000
# dtype: int64

# You can also specify an index when creating from a dict, aligning data
states = ['California', 'Ohio', 'Oregon', 'Texas']
s4 = pd.Series(data_dict, index=states)
print("\nSeries from dictionary with specified index:")
print(s4)
# Output: (Values aligned by index labels, missing values become NaN)
# California        NaN
# Ohio          35000.0
# Oregon        16000.0
# Texas         71000.0
# dtype: float64

# Giving a name to the Series and its index
s4.name = 'Population'
s4.index.name = 'State'
print("\nSeries with names:")
print(s4)
# Output:
# State
# California        NaN
# Ohio          35000.0
# Oregon        16000.0
# Texas         71000.0
# Name: Population, dtype: float64

Accessing Data in a Series:

You can access data using the index labels or the integer positions.

  • Using Index Labels: Similar to dictionary key access.
  • Using Integer Positions: Similar to list/array indexing, primarily via the .iloc attribute for explicit position-based access. Using direct square brackets [] can be ambiguous if the index itself contains integers.
# Using s2: index = ['a', 'b', 'c', 'd']
print("\nAccessing data in s2:")
print("Element with label 'b':", s2['b']) # Output: 2.2

# Using .loc for explicit label-based access (preferred)
print("Element with label 'c' using .loc:", s2.loc['c']) # Output: 3.3

# Accessing multiple elements by label
print("Elements 'a' and 'd':\n", s2[['a', 'd']]) # Pass a list of labels
# Output:
# a    1.1
# d    4.4
# dtype: float64

# Using integer positions with .iloc (0-based)
print("Element at position 1 using .iloc:", s2.iloc[1]) # Output: 2.2 (corresponds to 'b')

# Slicing by integer position with .iloc
print("Elements from position 1 to 3 using .iloc:\n", s2.iloc[1:4]) # Excludes index 4
# Output:
# b    2.2
# c    3.3
# d    4.4
# dtype: float64

# Slicing by label with .loc (inclusive of the end label)
print("Elements from 'b' to 'd' using .loc:\n", s2.loc['b':'d'])
# Output:
# b    2.2
# c    3.3
# d    4.4
# dtype: float64

Vectorized Operations and Alignment:

Arithmetic operations between Series align data based on index labels. If an index label exists in one Series but not the other, the result for that label will be NaN.

s5 = pd.Series([1, 2, 3], index=['x', 'y', 'z'])
s6 = pd.Series([10, 20, 30, 40], index=['w', 'x', 'y', 'a'])

print("\ns5:\n", s5)
print("s6:\n", s6)

# Addition aligns on index labels 'x' and 'y'
print("\ns5 + s6:\n", s5 + s6)
# Output:
# a     NaN
# w     NaN
# x    21.0  (1 + 20)
# y    33.0  (2 + 30)
# z     NaN
# dtype: float64

# NumPy universal functions (ufuncs) also work on Series
print("\nSquare root of s2:\n", np.sqrt(s2))
# Output:
# a    1.048809
# b    1.483240
# c    1.816590
# d    2.097618
# dtype: float64

Pandas Series form the building blocks for DataFrames and are essential for representing single columns or rows of data with meaningful labels.

Workshop Creating and Manipulating Series

Goal: Create and manipulate Pandas Series to represent and analyze product inventory and sales data.

Scenario: You manage inventory for a small online store. You need to track the number of items in stock for different products and their corresponding sales figures for the last month.

Steps:

  1. Import Pandas: Start by importing the Pandas library.
    import pandas as pd
    import numpy as np # We might need NaN later
    print("Pandas imported successfully!")
    
  2. Create Inventory Series: Create a Pandas Series named inventory to store the current stock count for the following products: 'Laptop', 'Keyboard', 'Mouse', 'Webcam', 'Monitor'. The stock counts are 15, 75, 120, 30, and 25 respectively. Use the product names as the index.
    products = ['Laptop', 'Keyboard', 'Mouse', 'Webcam', 'Monitor']
    stock_counts = [15, 75, 120, 30, 25]
    
    inventory = pd.Series(stock_counts, index=products, name='InventoryCount')
    
    print("Product Inventory Series:")
    print(inventory)
    
  3. Inspect the Series: Print the index, values, dtype, and name of the inventory Series.
    print("\nInventory Series Details:")
    print(f"    Index: {inventory.index}")
    print(f"    Values: {inventory.values}")
    print(f"    Data Type: {inventory.dtype}")
    print(f"    Name: {inventory.name}")
    
  4. Create Sales Series: Create another Series named monthly_sales representing the number of units sold last month. Data: Laptop: 5, Mouse: 45, Keyboard: 30, Monitor: 8, HDMI Cable: 50. Notice 'Webcam' is missing and 'HDMI Cable' is extra compared to the inventory list.
    sales_data = {'Laptop': 5, 'Mouse': 45, 'Keyboard': 30, 'Monitor': 8, 'HDMI Cable': 50}
    monthly_sales = pd.Series(sales_data, name='MonthlySales')
    
    print("\nMonthly Sales Series:")
    print(monthly_sales)
    
  5. Access Inventory Data:
    • Retrieve the stock count for 'Mouse'.
    • Retrieve the stock counts for 'Keyboard' and 'Monitor' simultaneously.
    • Retrieve the stock count for the first two products using integer-based indexing (.iloc).
      mouse_stock = inventory['Mouse'] # or inventory.loc['Mouse']
      print(f"\nStock count for Mouse: {mouse_stock}")
      
      keyboard_monitor_stock = inventory[['Keyboard', 'Monitor']] # or inventory.loc[['Keyboard', 'Monitor']]
      print("\nStock for Keyboard and Monitor:")
      print(keyboard_monitor_stock)
      
      first_two_stock_iloc = inventory.iloc[0:2]
      print("\nStock for first two products (iloc):")
      print(first_two_stock_iloc)
      
  6. Check Stock Levels: Use boolean indexing to find which products have an inventory level below 50.
    low_stock_products = inventory[inventory < 50]
    print("\nProducts with stock < 50:")
    print(low_stock_products)
    
  7. Calculate Remaining Stock: Calculate the approximate remaining stock after last month's sales. Subtract monthly_sales from inventory. Observe how Pandas handles the alignment and missing items.
    remaining_stock = inventory - monthly_sales
    
    print("\nEstimated Remaining Stock (Inventory - Sales):")
    print(remaining_stock)
    
    Self-Check: Why are 'Webcam' and 'HDMI Cable' resulting in NaN (Not a Number)? Because 'Webcam' exists in inventory but not monthly_sales, and 'HDMI Cable' exists in monthly_sales but not inventory. Arithmetic operations require matching index labels for non-NaN results unless specific methods with fill_value are used (which we'll see later).
  8. Update Inventory: Suppose a new shipment arrived: 10 Laptops and 20 Webcams were added. Update the inventory Series accordingly.
    print("\nOriginal Inventory for Laptop and Webcam:")
    print(inventory.loc[['Laptop', 'Webcam']])
    
    # Method 1: Direct assignment (if index exists)
    inventory['Laptop'] += 10
    inventory['Webcam'] += 20
    
    # Method 2: Using .loc for potentially multiple updates
    # inventory.loc[['Laptop', 'Webcam']] = inventory.loc[['Laptop', 'Webcam']] + [10, 20]
    
    print("\nUpdated Inventory for Laptop and Webcam:")
    print(inventory.loc[['Laptop', 'Webcam']])
    
    print("\nFull Inventory after update:")
    print(inventory)
    

This workshop demonstrated creating Series with custom indices, accessing data using labels and positions, performing vectorized operations (including alignment behavior), filtering using boolean conditions, and modifying Series values.

Introduction to Pandas DataFrames

The DataFrame is the most central and commonly used data structure in Pandas. It represents a rectangular table of data, conceptually similar to a spreadsheet, a SQL table, or a dictionary of Series objects. It's designed to handle heterogeneous data types across columns.

Key Characteristics of Pandas DataFrames:

  • 2-Dimensional: Data is organized in rows and columns.
  • Labeled Axes: Both rows (index) and columns have labels.
  • Heterogeneous Columns: Different columns can have different data types (e.g., int, float, string, datetime).
  • Mutable Size (Potentially): Columns can generally be inserted and deleted. Inserting/deleting rows is possible but often less efficient than column operations.
  • Built on NumPy: Under the hood, the data is often stored in one or more NumPy arrays, ensuring efficiency for many operations.
  • Handles Missing Data: Uses NaN for missing values.

Creating Pandas DataFrames:

DataFrames can be created from various inputs:

  • Dictionary of lists or NumPy arrays: Keys become column names, values (lists/arrays) become column data. Lengths must match.
  • Dictionary of Series: Similar to above, but Series indices are aligned; the union of all indices becomes the DataFrame index.
  • List of Dictionaries: Each dictionary represents a row. Keys become column names. Pandas infers column names and fills missing values with NaN.
  • 2D NumPy array: With optional row and column labels.
  • Another DataFrame: Creating a copy or subset.
  • Reading from files: (CSV, Excel, SQL, etc.) - A very common method, covered later.
import pandas as pd
import numpy as np

# 1. From a Dictionary of Lists
data_dict_lists = {
    'StudentID': [101, 102, 103, 104],
    'Name': ['Alice', 'Bob', 'Charlie', 'David'],
    'Score': [85.5, 92.0, 78.5, 88.0]
}
df1 = pd.DataFrame(data_dict_lists)
print("DataFrame from Dict of Lists:")
print(df1)
# Output:
#    StudentID     Name  Score
# 0        101    Alice   85.5
# 1        102      Bob   92.0
# 2        103  Charlie   78.5
# 3        104    David   88.0

print("\nDataFrame Info:")
df1.info() # Provides a concise summary (index type, columns, non-null counts, dtype, memory usage)

print("-" * 30)

# 2. From a Dictionary of Series (index alignment)
s_pop = pd.Series([35000, 71000, 16000], index=['Ohio', 'Texas', 'Oregon'])
s_area = pd.Series([44825, 268596, 98381], index=['Ohio', 'Texas', 'Oregon'])
s_capital = pd.Series(['Columbus', 'Austin', 'Salem', 'Sacramento'], index=['Ohio', 'Texas', 'Oregon', 'California'])

df2 = pd.DataFrame({'Population': s_pop, 'Area': s_area, 'Capital': s_capital})
print("DataFrame from Dict of Series:")
print(df2)
# Output: (Index is union of Series indices, columns align, NaN where missing)
#             Population    Area     Capital
# California         NaN     NaN  Sacramento
# Ohio           35000.0   44825.0    Columbus
# Oregon         16000.0   98381.0       Salem
# Texas          71000.0  268596.0      Austin

print("\nColumns:", df2.columns)
print("Index:", df2.index)

print("-" * 30)

# 3. From a List of Dictionaries
data_list_dicts = [
    {'colA': 1, 'colB': 'x'},
    {'colA': 2, 'colB': 'y', 'colC': True},
    {'colA': 3, 'colB': 'z'}
]
df3 = pd.DataFrame(data_list_dicts)
print("DataFrame from List of Dicts:")
print(df3)
# Output: (Columns inferred, NaN for missing keys)
#    colA colB   colC
# 0     1    x    NaN
# 1     2    y   True
# 2     3    z    NaN

print("-" * 30)

# 4. From a 2D NumPy Array
data_np_2d = np.random.randn(3, 4) # 3 rows, 4 columns
row_labels = ['R1', 'R2', 'R3']
col_labels = ['C1', 'C2', 'C3', 'C4']
df4 = pd.DataFrame(data_np_2d, index=row_labels, columns=col_labels)
print("DataFrame from 2D NumPy Array:")
print(df4)
# Output: (Example random values)
#           C1        C2        C3        C4
# R1 -0.595167  0.185749 -0.664021 -0.811483
# R2  0.562083 -0.101249  0.276538 -0.239427
# R3  0.484972  0.121686  0.095458 -0.593214

Basic DataFrame Inspection:

Once you have a DataFrame, several attributes and methods are essential for understanding its structure and content:

  • df.head(n=5): Returns the first n rows.
  • df.tail(n=5): Returns the last n rows.
  • df.shape: Returns a tuple representing the dimensionality (rows, columns).
  • df.index: Access the row index.
  • df.columns: Access the column labels.
  • df.dtypes: Returns the data type of each column.
  • df.info(): Provides a concise summary including index dtype, column dtypes, non-null values, and memory usage.
  • df.describe(): Generates descriptive statistics (count, mean, std, min, max, percentiles) for numerical columns. Can be adapted for object/categorical columns too.
print("\nInspecting df1:")
print("Head:\n", df1.head(2))
print("\nShape:", df1.shape)
print("\nData Types:\n", df1.dtypes)
print("\nDescriptive Statistics:\n", df1.describe())
# For df2 which has NaNs and mixed types
print("\nInspecting df2:")
df2.info()
print("\nDescriptive Statistics (df2):\n", df2.describe()) # Only for numeric columns
print("\nDescribe all columns (df2):\n", df2.describe(include='all')) # Include non-numeric

DataFrames are the workhorse of Pandas, providing a flexible and powerful way to manage and prepare structured data for analysis.

Workshop DataFrame Creation and Basic Inspection

Goal: Create a Pandas DataFrame to store student information and perform basic inspection tasks.

Scenario: You are a teaching assistant and need to create a digital grade book. You have student IDs, names, and scores from two different assignments (Assignment 1 and Assignment 2).

Steps:

  1. Import Pandas: Ensure Pandas is imported.
    import pandas as pd
    import numpy as np
    print("Pandas imported successfully!")
    
  2. Prepare Data: Create Python lists or NumPy arrays for the student data:
    • student_ids: [S101, S102, S103, S104, S105]
    • student_names: ['Aditya', 'Bhavna', 'Chetan', 'Divya', 'Elena']
    • assignment1_scores: [8, 10, 7, 9, 8]
    • assignment2_scores: [9, 7, 8, 10, np.nan] (Elena missed the second assignment)
      student_ids = ['S101', 'S102', 'S103', 'S104', 'S105']
      student_names = ['Aditya', 'Bhavna', 'Chetan', 'Divya', 'Elena']
      assignment1_scores = [8, 10, 7, 9, 8]
      assignment2_scores = [9, 7, 8, 10, np.nan] # Using np.nan for the missing score
      
  3. Create DataFrame: Create a DataFrame named gradebook_df from a dictionary where keys are the desired column names ('ID', 'Name', 'Assignment1', 'Assignment2') and values are the lists created above.
    data_for_df = {
        'ID': student_ids,
        'Name': student_names,
        'Assignment1': assignment1_scores,
        'Assignment2': assignment2_scores
    }
    gradebook_df = pd.DataFrame(data_for_df)
    
    print("Initial Gradebook DataFrame:")
    print(gradebook_df)
    
  4. Set Student ID as Index: Often, a unique identifier like Student ID makes a good index. Set the 'ID' column as the DataFrame's index. Use the set_index() method. Make sure the change persists in gradebook_df (use inplace=True or reassign the result).
    # Option 1: Reassignment (often safer)
    # gradebook_df = gradebook_df.set_index('ID')
    
    # Option 2: Inplace modification
    gradebook_df.set_index('ID', inplace=True)
    
    print("\nGradebook DataFrame with ID as Index:")
    print(gradebook_df)
    
    Self-Check: Is 'ID' no longer a regular column but now appears as the index label?
  5. Basic Inspection: Perform the following inspections on gradebook_df:
    • Display the first 3 rows.
    • Display the last 2 rows.
    • Show the dimensions (shape).
    • List the column names.
    • Show the data types of each column.
      print("\n--- DataFrame Inspection ---")
      print("First 3 rows (head):\n", gradebook_df.head(3))
      print("\nLast 2 rows (tail):\n", gradebook_df.tail(2))
      print("\nShape:", gradebook_df.shape)
      print("\nColumn Names:", gradebook_df.columns)
      print("\nIndex:", gradebook_df.index) # Also check the index
      print("\nData Types:\n", gradebook_df.dtypes)
      
  6. Use info(): Get a concise summary using the .info() method. Pay attention to the Non-Null Count for 'Assignment2'.
    print("\nDataFrame Info Summary:")
    gradebook_df.info()
    
    Self-Check: Does info() show 5 entries for the index, 'Name', and 'Assignment1', but only 4 non-null entries for 'Assignment2'? This confirms the presence of the NaN value.
  7. Use describe(): Generate descriptive statistics for the numerical columns ('Assignment1', 'Assignment2').
    print("\nDescriptive Statistics for Scores:")
    print(gradebook_df.describe())
    
    Self-Check: Notice that the count for 'Assignment2' is 4, reflecting the missing value which is excluded from calculations like mean, std, etc., by default.

This workshop covered creating a DataFrame from related data collections, setting a meaningful index, and using fundamental methods (head, tail, shape, columns, index, dtypes, info, describe) to understand its structure and basic statistical properties.


Intermediate Level

Now that you have a grasp of the basic structures, this level focuses on common data wrangling tasks: getting data into Pandas, cleaning it up, and transforming it into a more usable format. These are critical steps in nearly every data analysis workflow.

3. Data Loading and Saving with Pandas

Real-world data rarely starts inside your Python script. It usually resides in external files like CSVs, Excel spreadsheets, JSON files, or databases. Pandas provides efficient and easy-to-use functions to read data from these sources into DataFrames and write DataFrames back out.

Reading from Different File Formats (CSV, Excel, JSON)

Pandas offers a suite of read_* functions. We'll focus on the most common ones:

  • pd.read_csv(filepath_or_buffer, ...): Reads data from Comma Separated Value (.csv) files. This is perhaps the most frequently used data loading function. It has many optional parameters to handle various CSV complexities (different delimiters, headers, specific columns, data types, date parsing, etc.).
  • pd.read_excel(io, sheet_name=0, ...): Reads data from Microsoft Excel files (.xls, .xlsx, .xlsm, .xlsb, .odf). It requires an additional library (openpyxl for .xlsx or xlrd for .xls). You can specify which sheet to read by name or index.
  • pd.read_json(path_or_buf, orient='records', ...): Reads data from JSON (JavaScript Object Notation) strings or files. JSON data can have various structures, and the orient parameter helps Pandas interpret it correctly (e.g., 'records' for a list of JSON objects, 'columns' for a dict where keys are columns, 'index' for a dict where keys are index labels).

Common Parameters for read_* functions:

  • filepath_or_buffer / io / path_or_buf: The path to the file (string), a URL, or any object with a read() method (like a file handle).
  • sep or delimiter (for read_csv): The character used to separate values (e.g., ',', '\t' for tab, ';'). Pandas usually infers this well but explicit setting helps.
  • header: Row number(s) to use as the column names. Default is 0 (first row). Use None if the file has no header.
  • index_col: Column number or name to use as the DataFrame's row index.
  • usecols: A list of column names or indices to read, useful for large files where you only need a subset.
  • dtype: A dictionary mapping column names to specific data types (e.g., {'col_name': np.float64}). Helps prevent incorrect type inference and saves memory.
  • parse_dates: A list of column names or indices that should be parsed as dates/times.
  • na_values: A list of strings or a dictionary mapping column names to strings that should be treated as NaN (missing values).

Example Reading CSV:

Assume we have a file named students.csv:

StudentID,FullName,Major,GPA
S201,Alice Smith,Physics,3.7
S202,Bob Johnson,CompSci,3.9
S203,Charlie Brown,CompSci,3.5
S204,Diana Ross,Math,3.8
S205,,Physics,
import pandas as pd
import io # To simulate a file from a string

# Simulate reading from a file (replace with 'students.csv' if you create the file)
csv_data = """StudentID,FullName,Major,GPA
S201,Alice Smith,Physics,3.7
S202,Bob Johnson,CompSci,3.9
S203,Charlie Brown,CompSci,3.5
S204,Diana Ross,Math,3.8
S205,,Physics,""" # Note the missing name and GPA

# Basic read
# df_students = pd.read_csv('students.csv')
df_students = pd.read_csv(io.StringIO(csv_data))

print("--- Basic CSV Read ---")
print(df_students)
df_students.info() # Observe inferred types and non-null counts

print("\n--- CSV Read with Options ---")
# Simulate reading again, using StudentID as index and handling missing values explicitly
# df_students_indexed = pd.read_csv('students.csv', index_col='StudentID', na_values=[''])
df_students_indexed = pd.read_csv(io.StringIO(csv_data),
                                  index_col='StudentID',
                                  na_values=[''] # Treat empty strings as NaN
                                 )
print(df_students_indexed)
df_students_indexed.info() # Note StudentID is now index, FullName/GPA have NaNs

Example Reading Excel:

Assume an Excel file grades.xlsx with a sheet named Sheet1:

StudentID Assignment1 Assignment2 FinalExam
S201 90 85 92
S202 95 98 96
S203 80 82 85

(You would need openpyxl installed: pip install openpyxl)

# Assuming 'grades.xlsx' exists in the same directory
# Requires openpyxl: pip install openpyxl
try:
    # df_grades = pd.read_excel('grades.xlsx', sheet_name='Sheet1', index_col='StudentID')
    # print("\n--- Excel Read ---")
    # print(df_grades)
    # df_grades.info()
    print("\n--- Excel Read (Skipped - requires file/openpyxl) ---")
    # Placeholder DataFrame for demonstration
    df_grades_placeholder = pd.DataFrame({
        'Assignment1': [90, 95, 80],
        'Assignment2': [85, 98, 82],
        'FinalExam': [92, 96, 85]
    }, index=pd.Index(['S201', 'S202', 'S203'], name='StudentID'))
    print(df_grades_placeholder)
    df_grades_placeholder.info()

except ImportError:
    print("\nError: 'openpyxl' library not found. Please install it to read .xlsx files.")
except FileNotFoundError:
    print("\nError: 'grades.xlsx' not found. Cannot demonstrate Excel reading.")

Example Reading JSON:

Assume a file products.json with record-oriented data:

[
  {"ProductID": "P1001", "Name": "Laptop", "Price": 1200.00, "InStock": true},
  {"ProductID": "P1002", "Name": "Keyboard", "Price": 75.50, "InStock": true},
  {"ProductID": "P1003", "Name": "Mouse", "Price": 25.99, "InStock": false}
]
# Simulate reading from 'products.json'
json_data = """
[
  {"ProductID": "P1001", "Name": "Laptop", "Price": 1200.00, "InStock": true},
  {"ProductID": "P1002", "Name": "Keyboard", "Price": 75.50, "InStock": true},
  {"ProductID": "P1003", "Name": "Mouse", "Price": 25.99, "InStock": false}
]
"""
# df_products = pd.read_json('products.json', orient='records')
df_products = pd.read_json(io.StringIO(json_data), orient='records')

print("\n--- JSON Read ('records' orient) ---")
print(df_products)
df_products.info() # Note Price is float, InStock is bool

# Example with 'columns' orient JSON
json_data_columns = """
{
  "ProductID": {"0": "P1001", "1": "P1002", "2": "P1003"},
  "Name": {"0": "Laptop", "1": "Keyboard", "2": "Mouse"},
  "Price": {"0": 1200.00, "1": 75.50, "2": 25.99},
  "InStock": {"0": true, "1": true, "2": false}
}
"""
df_products_cols = pd.read_json(io.StringIO(json_data_columns), orient='columns')
print("\n--- JSON Read ('columns' orient) ---")
print(df_products_cols) # Index is string '0', '1', '2' here

Mastering these reading functions and their common parameters is crucial for starting any data analysis project. Always inspect the loaded DataFrame (head(), info(), describe()) to ensure the data was read correctly.

Workshop Loading and Initial Exploration of a Real Dataset

Goal: Load a real-world dataset (e.g., Titanic passenger data) from a CSV file and perform initial inspection.

Dataset: We'll use the well-known Titanic dataset, often available online. If you don't have it locally, you can often load it directly from a URL. A common source is from the Seaborn library's GitHub repository.

Steps:

  1. Import Pandas:
    import pandas as pd
    print("Pandas imported.")
    
  2. Define Dataset Location: Specify the path or URL to the dataset. We'll use a direct URL for convenience.
    # URL for the Titanic dataset (CSV format) from Seaborn's GitHub repo
    # (Check if this URL is still valid or find an alternative source if needed)
    titanic_url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
    print(f"Attempting to load data from: {titanic_url}")
    
  3. Load the CSV Data: Use pd.read_csv() to load the data from the URL into a DataFrame called titanic_df.
    try:
        titanic_df = pd.read_csv(titanic_url)
        print("\nTitanic dataset loaded successfully!")
    except Exception as e:
        print(f"\nError loading dataset: {e}")
        print("Please check the URL or your internet connection.")
        # As a fallback, create a minimal placeholder DataFrame
        titanic_df = pd.DataFrame({
            'survived': [0, 1, 1],
            'pclass': [3, 1, 3],
            'sex': ['male', 'female', 'female'],
            'age': [22.0, 38.0, 26.0],
            'fare': [7.25, 71.2833, 7.925],
            'embarked': ['S', 'C', 'S']
        })
        print("\nUsing a minimal placeholder DataFrame for the workshop.")
    
  4. Initial Inspection - Size and Shape: How many rows and columns does the dataset have?
    if 'titanic_df' in locals(): # Check if DataFrame exists
        print("\n--- Dataset Shape ---")
        print(f"Dimensions (rows, columns): {titanic_df.shape}")
        print(f"Total number of passengers (rows): {titanic_df.shape[0]}")
        print(f"Total number of features (columns): {titanic_df.shape[1]}")
    
  5. Initial Inspection - Head and Tail: Look at the first few and last few rows to get a feel for the data and column names.
    if 'titanic_df' in locals():
        print("\n--- First 5 Rows (head) ---")
        print(titanic_df.head())
    
        print("\n--- Last 5 Rows (tail) ---")
        print(titanic_df.tail())
    
    Self-Check: Identify the column names. What kind of data seems to be in each column (numeric, text, categorical)? Are there any obvious missing values (often represented as NaN)?
  6. Initial Inspection - Data Types and Missing Values: Use info() to get a summary of column data types and non-null counts. This is crucial for identifying missing data.
    if 'titanic_df' in locals():
        print("\n--- DataFrame Info (Data Types and Non-Null Counts) ---")
        titanic_df.info()
    
    Self-Check: Which columns have missing values (where Non-Null Count is less than the total number of entries/rows)? What are the data types inferred by Pandas? Do they seem appropriate (e.g., 'age' as float, 'sex' as object/string)?
  7. Initial Inspection - Numerical Summary: Use describe() to get summary statistics for the numerical columns.
    if 'titanic_df' in locals():
        print("\n--- Descriptive Statistics (Numerical Columns) ---")
        print(titanic_df.describe())
    
    Self-Check: Look at the mean, min, max, and standard deviation (std) for columns like 'age', 'fare'. Does anything look surprising (e.g., minimum age, maximum fare)? The count here also confirms the number of non-missing values for these columns.
  8. Initial Inspection - Categorical Summary: Use describe(include=['object', 'category']) to get summary statistics for non-numerical columns (like 'sex', 'embarked').
    if 'titanic_df' in locals():
        print("\n--- Descriptive Statistics (Object/Categorical Columns) ---")
        # Include 'category' if you convert types later
        print(titanic_df.describe(include=['object']))
    
    Self-Check: What are the unique values in columns like 'sex' or 'embarked'? What is the most frequent value (top) and its frequency (freq)?

This workshop provided hands-on experience loading a standard dataset and performing the essential first steps of data exploration: checking dimensions, previewing data, understanding data types and missing values, and getting basic statistical summaries. This initial analysis informs the subsequent data cleaning and preparation steps.

Writing Data to Files

After processing or analyzing your data, you often need to save the resulting DataFrame to a file for later use, sharing, or input into another system. Pandas provides corresponding to_* methods.

  • df.to_csv(path_or_buf, sep=',', index=True, header=True, ...): Writes the DataFrame to a CSV file.
    • index=True (default): Writes the DataFrame index as a column. Set to False if you don't want to save the index.
    • header=True (default): Writes the column names as the header row. Set to False for no header.
    • na_rep='' (default): String representation for NaN values (missing data). You might set this to something specific if needed, e.g., 'NULL'.
    • float_format='%.2f': Format string for floating point numbers (e.g., limit to 2 decimal places).
    • columns: List of column names to write (if you want to save only a subset).
  • df.to_excel(excel_writer, sheet_name='Sheet1', index=True, header=True, ...): Writes the DataFrame to an Excel file. Requires openpyxl.
    • excel_writer: Path to the file or an ExcelWriter object (for writing multiple sheets to the same file).
    • sheet_name: Name of the sheet to write to.
  • df.to_json(path_or_buf, orient='records', lines=False, ...): Writes the DataFrame to a JSON file.
    • orient: Controls the JSON structure (e.g., 'records', 'columns', 'index', 'values', 'table'). 'records' is common for row-based data.
    • lines=True: Writes each row as a separate JSON object on a new line (JSON Lines format). Useful for streaming or large files.
    • date_format: Format for date/time objects (e.g., 'iso' for ISO 8601).

Example Writing Data:

Let's use the df_students_indexed DataFrame from the CSV reading example.

import pandas as pd
import io
import numpy as np # for NaN example later

# Recreate a sample DataFrame
csv_data = """StudentID,FullName,Major,GPA
S201,Alice Smith,Physics,3.7
S202,Bob Johnson,CompSci,3.9
S203,Charlie Brown,CompSci,3.5
S204,Diana Ross,Math,3.8
S205,,Physics,"""
df_students_indexed = pd.read_csv(io.StringIO(csv_data),
                                  index_col='StudentID',
                                  na_values=['']
                                 )
# Add a hypothetical column
df_students_indexed['Status'] = ['Active', 'Active', 'Graduated', 'Active', 'Active']
print("--- DataFrame to Save ---")
print(df_students_indexed)

# --- Writing to CSV ---
# Prepare a buffer to simulate writing to a file
csv_output_buffer = io.StringIO()
df_students_indexed.to_csv(csv_output_buffer, index=True) # Include index
print("\n--- CSV Output (with index) ---")
print(csv_output_buffer.getvalue())

csv_output_buffer_noindex = io.StringIO()
df_students_indexed.to_csv(csv_output_buffer_noindex, index=False) # Exclude index
print("\n--- CSV Output (without index) ---")
print(csv_output_buffer_noindex.getvalue())

csv_output_buffer_custom = io.StringIO()
df_students_indexed.to_csv(csv_output_buffer_custom, sep='\t', na_rep='MISSING') # Tab separated, custom NaN
print("\n--- CSV Output (tab separated, custom NaN) ---")
print(csv_output_buffer_custom.getvalue())

# --- Writing to Excel ---
# Requires openpyxl: pip install openpyxl
excel_output_path = 'output_students.xlsx' # Specify a file path
try:
    df_students_indexed.to_excel(excel_output_path, sheet_name='Student Data', index=False)
    print(f"\n--- Excel Output ---")
    print(f"DataFrame successfully written to '{excel_output_path}' (index excluded).")
    # You would typically open the Excel file to verify
except ImportError:
    print("\nError: 'openpyxl' not found. Skipping Excel writing.")

# --- Writing to JSON ---
json_output_buffer_records = io.StringIO()
df_students_indexed.to_json(json_output_buffer_records, orient='records', indent=4) # Pretty-printed records
print("\n--- JSON Output (orient='records', indented) ---")
print(json_output_buffer_records.getvalue())

json_output_buffer_index = io.StringIO()
# Reset index to make StudentID a column for better index-oriented JSON
df_students_indexed.reset_index().to_json(json_output_buffer_index, orient='index', indent=4)
print("\n--- JSON Output (orient='index', indented, after reset_index) ---")
print(json_output_buffer_index.getvalue())

json_output_buffer_lines = io.StringIO()
df_students_indexed.to_json(json_output_buffer_lines, orient='records', lines=True) # One JSON object per line
print("\n--- JSON Output (orient='records', lines=True) ---")
print(json_output_buffer_lines.getvalue())

# Clean up the created Excel file if it exists (optional)
import os
if os.path.exists(excel_output_path):
    try:
        os.remove(excel_output_path)
        print(f"\nCleaned up temporary file: '{excel_output_path}'")
    except PermissionError:
         print(f"\nWarning: Could not remove temporary file '{excel_output_path}'. It might be open.")

Choosing the right output format and parameters depends on how the data will be used next. CSV is universal but simple, Excel is good for human readability and basic analysis, while JSON is flexible and common in web applications and APIs.

Workshop Saving Processed Data

Goal: Load the Titanic dataset, perform a simple modification (e.g., create a new column), and save the modified DataFrame to both CSV and JSON formats.

Scenario: After loading the Titanic dataset, you want to create a new column indicating whether a passenger was a child (e.g., age < 18). You then need to save this enhanced dataset for future analysis.

Steps:

  1. Import Pandas and Load Data: Import Pandas and reload the Titanic dataset (using the URL or a local copy). Handle potential loading errors as before.
    import pandas as pd
    import numpy as np # Needed for checking NaN
    
    print("Pandas and NumPy imported.")
    
    # URL or local path
    titanic_url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
    # titanic_url = 'titanic.csv' # If you have it locally
    
    try:
        titanic_df = pd.read_csv(titanic_url)
        print("\nTitanic dataset loaded successfully!")
        print(f"Original shape: {titanic_df.shape}")
    except Exception as e:
        print(f"\nError loading dataset: {e}")
        # Create a minimal placeholder if loading fails
        titanic_df = pd.DataFrame({
            'survived': [0, 1, 1, 0, 1],
            'pclass': [3, 1, 3, 1, 3],
            'sex': ['male', 'female', 'female', 'male', 'female'],
            'age': [22.0, 38.0, 26.0, 54.0, 4.0], # Added a child age
            'fare': [7.25, 71.2833, 7.925, 51.8625, 16.7],
            'embarked': ['S', 'C', 'S', 'S', 'S']
        })
        print("\nUsing a minimal placeholder DataFrame.")
        print(f"Placeholder shape: {titanic_df.shape}")
    
  2. Perform Data Modification: Create a new boolean column named is_child. This column should be True if the passenger's 'age' is less than 18, and False otherwise. Handle potential NaN values in the 'age' column (a child cannot be identified if the age is unknown, so these should probably be False or perhaps kept as NaN depending on requirements - let's make them False for this exercise).
    if 'titanic_df' in locals():
        # Method 1: Using boolean condition (will be NaN where age is NaN)
        # titanic_df['is_child_temp'] = titanic_df['age'] < 18
    
        # Method 2: Using np.where for explicit True/False/Handle NaN
        # np.where(condition, value_if_true, value_if_false)
        # titanic_df['is_child'] = np.where(titanic_df['age'] < 18, True, False)
    
        # Method 3: Using .apply with a lambda function (handles NaN implicitly if needed)
        # This is more verbose but flexible. Let's stick to boolean indexing + fillna
        titanic_df['is_child'] = titanic_df['age'] < 18
        # Now, decide how to handle NaN ages. Let's fill NaN in 'is_child' with False.
        # Note: This assumes NaN age means not a child for this specific purpose.
        titanic_df['is_child'].fillna(False, inplace=True)
        # Convert to boolean type explicitly if desired (fillna might make it object)
        titanic_df['is_child'] = titanic_df['is_child'].astype(bool)
    
    
        print("\n--- DataFrame with 'is_child' column added ---")
        print(titanic_df[['age', 'is_child']].head(10)) # Show relevant columns
        print(f"\nNew shape: {titanic_df.shape}")
        print("\nValue counts for 'is_child':")
        print(titanic_df['is_child'].value_counts())
    
  3. Save to CSV: Save the modified titanic_df to a CSV file named titanic_enhanced.csv. Do not include the default Pandas integer index in the output file.
    output_csv_path = 'titanic_enhanced.csv'
    if 'titanic_df' in locals():
        try:
            titanic_df.to_csv(output_csv_path, index=False)
            print(f"\nDataFrame successfully saved to '{output_csv_path}' (index excluded).")
        except Exception as e:
            print(f"\nError saving to CSV: {e}")
    
    Self-Check: If you open titanic_enhanced.csv (e.g., in a text editor or spreadsheet program), does it contain the is_child column? Is the Pandas integer index (0, 1, 2...) absent?
  4. Save to JSON: Save the modified titanic_df to a JSON file named titanic_enhanced.json. Use the records orientation and add indentation for readability.
    output_json_path = 'titanic_enhanced.json'
    if 'titanic_df' in locals():
        try:
            titanic_df.to_json(output_json_path, orient='records', indent=4)
            print(f"\nDataFrame successfully saved to '{output_json_path}' (orient='records', indented).")
        except Exception as e:
            print(f"\nError saving to JSON: {e}")
    
    Self-Check: Open titanic_enhanced.json. Is it a list [...] where each element is a JSON object {...} representing a passenger? Does each object contain the is_child key-value pair?
  5. (Optional) Clean up: Remove the generated files if you don't need them.
    import os
    if os.path.exists(output_csv_path):
        try: os.remove(output_csv_path); print(f"\nRemoved '{output_csv_path}'")
        except Exception as e: print(f"Could not remove {output_csv_path}: {e}")
    if os.path.exists(output_json_path):
        try: os.remove(output_json_path); print(f"Removed '{output_json_path}'")
        except Exception as e: print(f"Could not remove {output_json_path}: {e}")
    

This workshop demonstrated a common workflow: load data, perform a transformation (creating a new feature), and save the result in different standard formats using Pandas' to_csv and to_json methods with relevant options.

4. Data Cleaning and Preparation

Raw data is often messy. It might contain missing values, incorrect data types, duplicate entries, or inconsistent formatting. Data cleaning (also called data cleansing or data scrubbing) is the process of detecting and correcting (or removing) corrupt or inaccurate records from a dataset. It's a crucial and often time-consuming step in the data analysis pipeline. Pandas provides excellent tools for this.

Handling Missing Data (NaN)

Missing data is represented in Pandas primarily by the special floating-point value NaN (Not a Number). Python's built-in None object is also treated as missing data in many contexts, especially in columns with object dtype. Identifying and handling NaNs is fundamental.

Detecting Missing Data:

  • isnull() or isna(): These methods return a boolean DataFrame (or Series) of the same shape as the original, where True indicates a missing value (NaN or None) and False indicates a non-missing value.
  • notnull() or notna(): The opposite of isnull(), returning True for non-missing values.
  • Combining with .sum(): Calling .sum() on the boolean DataFrame/Series returned by isnull() counts the number of missing values per column (or per row if axis=1).
import pandas as pd
import numpy as np

data = {'col1': [1, 2, np.nan, 4, 5],
        'col2': ['A', 'B', 'C', np.nan, 'E'],
        'col3': [True, False, True, True, np.nan],
        'col4': [np.nan, np.nan, np.nan, np.nan, np.nan],
        'col5': [10, 20, 30, 40, 50]}
df_missing = pd.DataFrame(data)

print("DataFrame with Missing Data:")
print(df_missing)

print("\nDetecting NaN (isnull()):")
print(df_missing.isnull())

print("\nDetecting Non-NaN (notnull()):")
print(df_missing.notnull())

print("\nCounting NaN per column:")
print(df_missing.isnull().sum()) # Default axis=0

print("\nCounting NaN per row:")
print(df_missing.isnull().sum(axis=1))

print("\nTotal number of NaN values:")
print(df_missing.isnull().sum().sum())

Handling Missing Data:

There are two main strategies for dealing with missing data:

  1. Dropping: Remove rows or columns containing NaN values.

    • dropna(axis=0, how='any', thresh=None, subset=None, inplace=False): Drops rows (default axis=0) or columns (axis=1) containing missing values.
      • how='any' (default): Drop the row/column if any NaN values are present.
      • how='all': Drop the row/column only if all values are NaN.
      • thresh=N: Keep rows/columns that have at least N non-NaN values.
      • subset=[col_list]: Only consider NaNs in specific columns when deciding whether to drop rows.
  2. Filling (Imputation): Replace NaN values with some other value.

    • fillna(value, method=None, axis=None, inplace=False, limit=None): Fills NaN values.
      • value: Scalar value (e.g., 0, 'Unknown'), dictionary/Series (to fill different values per column), or DataFrame.
      • method='ffill' (forward fill): Propagate the last valid observation forward to fill the gap.
      • method='bfill' (backward fill): Use the next valid observation to fill the gap.
      • limit: If using ffill or bfill, the maximum number of consecutive NaNs to fill.

Example Dropping:

print("\n--- Dropping NaN ---")
print("Original DataFrame:\n", df_missing)

# Drop rows with ANY NaN
df_dropped_rows_any = df_missing.dropna() # axis=0, how='any' are defaults
print("\nDataFrame after dropping rows with ANY NaN:\n", df_dropped_rows_any)

# Drop columns with ANY NaN
df_dropped_cols_any = df_missing.dropna(axis=1)
print("\nDataFrame after dropping columns with ANY NaN:\n", df_dropped_cols_any)

# Drop columns only if ALL values are NaN
df_dropped_cols_all = df_missing.dropna(axis=1, how='all')
print("\nDataFrame after dropping columns with ALL NaN:\n", df_dropped_cols_all)

# Drop rows where NaN appears specifically in 'col2' or 'col3'
df_dropped_subset = df_missing.dropna(subset=['col2', 'col3'])
print("\nDataFrame after dropping rows with NaN in 'col2' or 'col3':\n", df_dropped_subset)

# Keep rows that have at least 4 non-NaN values
df_dropped_thresh = df_missing.dropna(thresh=4)
print("\nDataFrame after keeping rows with at least 4 non-NaNs:\n", df_dropped_thresh)

Example Filling (Imputation):

print("\n--- Filling NaN ---")
print("Original DataFrame:\n", df_missing)

# Fill all NaN with 0 (might not make sense for all columns!)
df_filled_zero = df_missing.fillna(0)
print("\nDataFrame after filling all NaN with 0:\n", df_filled_zero)
# Note: col2 (string) and col3 (bool) are now mixed types (object)

# Fill NaN with different values per column
fill_values = {'col1': df_missing['col1'].mean(), # Fill numeric with mean
               'col2': 'Unknown',                # Fill string with placeholder
               'col3': False,                   # Fill boolean with False
               'col4': 0}                       # Fill numeric with 0
df_filled_specific = df_missing.fillna(value=fill_values)
print("\nDataFrame after filling NaN with specific values per column:\n", df_filled_specific)
print(df_filled_specific.dtypes) # Check types

# Forward fill (propagate last valid observation)
df_ffill = df_missing.fillna(method='ffill')
print("\nDataFrame after forward fill (ffill):\n", df_ffill)

# Backward fill (propagate next valid observation)
df_bfill = df_missing.fillna(method='bfill')
print("\nDataFrame after backward fill (bfill):\n", df_bfill)

# Forward fill with a limit
df_ffill_limit = df_missing.fillna(method='ffill', limit=1) # Fill max 1 consecutive NaN
print("\nDataFrame after forward fill (limit=1):\n", df_ffill_limit)

The choice between dropping and filling depends heavily on the context:

  • How much data is missing? Dropping rows might discard too much information if many rows have sporadic NaNs. Dropping columns might remove important features.
  • Why is the data missing? (Missing Completely At Random, Missing At Random, Missing Not At Random). This influences whether simple imputation methods (like mean/median/mode) are appropriate.
  • What is the goal of the analysis? Some models handle NaNs, while others require complete data.

Imputation is often preferred to preserve data, but the imputation method must be chosen carefully (mean, median, mode, regression imputation, etc.).

Workshop Identifying and Handling Missing Values

Goal: Identify and handle missing values in the 'age' and 'embarked' columns of the Titanic dataset.

Scenario: You've loaded the Titanic dataset and noticed missing values using .info(). You need to decide on and implement strategies to handle these missing values before further analysis.

Steps:

  1. Import Pandas and Load Data: Reload the Titanic dataset.
    import pandas as pd
    import numpy as np
    
    print("Pandas and NumPy imported.")
    
    titanic_url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
    try:
        titanic_df = pd.read_csv(titanic_url)
        print("\nTitanic dataset loaded successfully!")
        print(f"Original shape: {titanic_df.shape}")
    except Exception as e:
        print(f"\nError loading dataset: {e}")
        # Placeholder with NaNs
        titanic_df = pd.DataFrame({
            'survived': [0, 1, 1, 0, 1, 0],
            'pclass': [3, 1, 3, 1, 3, 3],
            'sex': ['male', 'female', 'female', 'male', 'female', 'male'],
            'age': [22.0, 38.0, np.nan, 54.0, 4.0, np.nan],
            'fare': [7.25, 71.2833, 7.925, 51.8625, 16.7, 8.4583],
            'embarked': ['S', 'C', 'S', 'S', 'S', 'Q'],
            'deck': [np.nan, 'C', np.nan, 'E', np.nan, np.nan] # Add another column with many NaNs
        })
        print("\nUsing a placeholder DataFrame with NaNs.")
        print(f"Placeholder shape: {titanic_df.shape}")
    
  2. Identify Missing Values:
    • Use .isnull().sum() to count missing values in each column.
    • Calculate the percentage of missing values for each column.
      if 'titanic_df' in locals():
          print("\n--- Missing Value Counts ---")
          missing_counts = titanic_df.isnull().sum()
          print(missing_counts[missing_counts > 0]) # Show only columns with missing values
      
          print("\n--- Missing Value Percentages ---")
          total_rows = len(titanic_df)
          missing_percentages = (missing_counts / total_rows) * 100
          print(missing_percentages[missing_percentages > 0].round(2))
      
      Self-Check: Identify the columns with the most missing values (e.g., 'deck', 'age') and those with just a few ('embarked').
  3. Strategy for 'deck': The 'deck' column has a very high percentage of missing values. For many analyses, imputing this might be unreliable or introduce noise. A common strategy is to drop such columns. Drop the 'deck' column.
    if 'titanic_df' in locals() and 'deck' in titanic_df.columns:
        titanic_df.drop('deck', axis=1, inplace=True)
        print("\n'deck' column dropped.")
        print(f"New shape: {titanic_df.shape}")
        # Verify it's gone
        print("\nMissing counts after dropping 'deck':")
        print(titanic_df.isnull().sum()[titanic_df.isnull().sum() > 0])
    elif 'titanic_df' in locals():
         print("\n'deck' column not found or already dropped.")
    
  4. Strategy for 'age': The 'age' column has a significant number of missing values (around 20%), but age is likely an important predictor (e.g., for survival). Dropping rows would lose too much data. Imputation is a better choice. A common simple strategy is to fill with the median age (less sensitive to outliers than the mean).
    • Calculate the median age.
    • Fill the NaN values in the 'age' column with the calculated median.
      if 'titanic_df' in locals() and 'age' in titanic_df.columns:
          median_age = titanic_df['age'].median()
          print(f"\nMedian age: {median_age:.2f}")
      
          titanic_df['age'].fillna(median_age, inplace=True)
          print("'age' column NaNs filled with median value.")
      
          # Verify NaNs are gone in 'age'
          print("\nMissing counts after filling 'age':")
          print(titanic_df.isnull().sum()[titanic_df.isnull().sum() > 0])
          print(f"Age column now has {titanic_df['age'].isnull().sum()} NaN values.")
      elif 'titanic_df' in locals():
           print("\n'age' column not found.")
      
  5. Strategy for 'embarked': The 'embarked' column has very few missing values. This is a categorical feature. Dropping the few rows might be acceptable, but imputation with the mode (most frequent value) is also common and preserves the data.
    • Calculate the mode of the 'embarked' column. (Note: .mode() returns a Series, as there might be multiple modes; usually, we take the first one using .iloc[0]).
    • Fill the NaN values in 'embarked' with the calculated mode.
      if 'titanic_df' in locals() and 'embarked' in titanic_df.columns:
          embarked_mode = titanic_df['embarked'].mode()[0] # Get the first mode
          print(f"\nMost frequent embarkation port (mode): {embarked_mode}")
      
          titanic_df['embarked'].fillna(embarked_mode, inplace=True)
          print("'embarked' column NaNs filled with mode value.")
      
          # Verify NaNs are gone
          print("\nMissing counts after filling 'embarked':")
          final_missing = titanic_df.isnull().sum()
          print(final_missing[final_missing > 0])
          if final_missing.sum() == 0:
              print("All handled NaNs have been addressed.")
          else:
              print("Some NaNs might remain in other columns.")
      
      elif 'titanic_df' in locals():
          print("\n'embarked' column not found.")
      
      # Final check with info
      if 'titanic_df' in locals():
          print("\n--- Final DataFrame Info ---")
          titanic_df.info()
      
      Self-Check: Does .info() now show that 'age' and 'embarked' have the full count of non-null values? Was 'deck' successfully removed?

This workshop guided you through identifying missing data proportions and applying common handling strategies: dropping a column with too many NaNs, imputing a numerical column with the median, and imputing a categorical column with the mode. These are essential data cleaning techniques.

Data Transformation (Mapping, Replacing, Applying Functions)

Often, data isn't in the right format or requires modification for analysis. Pandas provides several methods for transforming data within Series or DataFrames.

  • map(arg, na_action=None) (Series method): Substitutes each value in a Series with another value. Primarily used with a dictionary or a function. Values not found in the mapping dictionary become NaN unless the original value was already NaN.
  • replace(to_replace, value, inplace=False, limit=None, regex=False, method='pad') (Series/DataFrame method): Replaces specified values (to_replace) with other values (value). More flexible than map as it can replace multiple different values simultaneously and doesn't automatically turn non-replaced values into NaN. Can work with lists, dicts, regex.
  • apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs) (DataFrame/Series method): Applies a function along an axis of the DataFrame (row-wise or column-wise) or element-wise on a Series. Very powerful for custom transformations.
  • applymap(func, na_action=None) (DataFrame method - Use with caution, often slower): Applies a function element-wise to every element of a DataFrame. Often, vectorized operations or apply are more efficient. Being deprecated/subsumed by other methods in newer Pandas. Use .map on columns or structured apply instead.

Example using map:

Convert numerical codes to meaningful labels in a Series.

import pandas as pd
import numpy as np

s = pd.Series([1, 2, 1, 3, 2, 1, 4])
print("Original Series:\n", s)

# Mapping dictionary
status_map = {1: 'Active', 2: 'Inactive', 3: 'Pending'}

# Apply map - values not in map (e.g., 4) become NaN
s_mapped = s.map(status_map)
print("\nSeries after mapping (value 4 becomes NaN):\n", s_mapped)

# Map with a function (e.g., add prefix)
s_mapped_func = s.map(lambda x: f"Code_{x}")
print("\nSeries after mapping with lambda function:\n", s_mapped_func)

Example using replace:

Clean up inconsistent category names or specific values.

df = pd.DataFrame({'Category': ['A', 'B', 'C', 'A', 'B_extra', 'C', 'D', 'a'],
                   'Value': [10, 20, 15, 12, 25, 18, 30, 9]})
print("\nOriginal DataFrame:\n", df)

# Replace a single value
df_replaced_single = df.replace('B_extra', 'B')
print("\nReplace 'B_extra' with 'B':\n", df_replaced_single)

# Replace multiple values with a single value
df_replaced_multiple = df.replace(['A', 'a'], 'Type_A')
print("\nReplace 'A' and 'a' with 'Type_A':\n", df_replaced_multiple)

# Replace using a dictionary (different replacements for different values)
replace_dict = {'A': 'Group Alpha', 'a': 'Group Alpha', 'B': 'Group Beta', 'C': 'Group Gamma'}
df_replaced_dict = df.replace({'Category': replace_dict}) # Specify column for dict keys
print("\nReplace using dictionary mapping:\n", df_replaced_dict)

# Replace numeric value
df_replaced_num = df.replace({'Value': {10: 100}}) # Replace 10 with 100 in 'Value' col
print("\nReplace numeric value 10 with 100 in 'Value':\n", df_replaced_num)

Example using apply:

Perform row-wise or column-wise calculations.

df_scores = pd.DataFrame({'Quiz1': [8, 7, 9, 6],
                          'Quiz2': [10, 8, 7, 9],
                          'Homework': [18, 20, 15, 19]})
print("\nScores DataFrame:\n", df_scores)

# Apply function to each column (axis=0, default) - e.g., calculate range
def range_fn(x): # x is a Series (a column in this case)
    return x.max() - x.min()

col_ranges = df_scores.apply(range_fn) # or df_scores.apply(range_fn, axis=0)
print("\nRange (max-min) for each column:\n", col_ranges)

# Apply function to each row (axis=1) - e.g., calculate total score
def total_score(row): # row is a Series (a row in this case)
    return row['Quiz1'] + row['Quiz2'] + row['Homework']

df_scores['Total'] = df_scores.apply(total_score, axis=1)
print("\nDataFrame with Total score calculated row-wise:\n", df_scores)

# Apply with lambda for simpler operations (e.g., normalize homework score out of 20)
df_scores['Homework_Perc'] = df_scores['Homework'].apply(lambda x: (x / 20) * 100)
print("\nDataFrame with Homework Percentage:\n", df_scores)

These transformation methods (map, replace, apply) are fundamental tools for reshaping your data into the form needed for analysis or modeling. Choose the method that best suits the specific transformation task.

Workshop Cleaning and Transforming Customer Data

Goal: Clean and transform a hypothetical customer dataset, including standardizing categorical values and creating new features using transformation functions.

Scenario: You have a small dataset of customer information with inconsistencies in gender notation and want to calculate the age from the date of birth.

Steps:

  1. Import Pandas and Create Data:
    import pandas as pd
    from datetime import datetime # To calculate age
    
    # Create sample customer data
    data = {
        'CustomerID': [1001, 1002, 1003, 1004, 1005, 1006],
        'Gender': ['M', 'Female', 'male', 'F', 'FEM', 'Male'],
        'DateOfBirth': ['1990-05-15', '1985-11-22', '1998-02-01', '2001-07-30', '1985-03-10', '1995-09-05'],
        'City': ['New York', 'London', 'Paris', 'London', 'New York', 'Tokyo']
    }
    customer_df = pd.DataFrame(data)
    
    print("Original Customer DataFrame:")
    print(customer_df)
    customer_df.info() # Note DateOfBirth is object type initially
    
  2. Standardize 'Gender' Column: The 'Gender' column has inconsistent values ('M', 'Female', 'male', etc.). Standardize them to 'Male' and 'Female'.
    • Method 1: Using replace with a dictionary: Create a mapping for all variations.
    • Method 2: Using map after converting to a consistent case (e.g., lower): Convert all to lowercase, then map 'm'/'male' and 'f'/'female'. (Let's use replace for simplicity here).
      gender_replace_map = {
          'M': 'Male',
          'male': 'Male',
          'F': 'Female',
          'Female': 'Female', # Explicitly map correct ones too, or use more targeted replace
          'FEM': 'Female'
      }
      
      # Apply replace to the 'Gender' column
      customer_df['Gender'] = customer_df['Gender'].replace(gender_replace_map)
      
      print("\nDataFrame after standardizing Gender:")
      print(customer_df)
      print("\nValue Counts for standardized Gender:")
      print(customer_df['Gender'].value_counts())
      
      Self-Check: Are all values in the 'Gender' column now either 'Male' or 'Female'?
  3. Convert 'DateOfBirth' to Datetime: The 'DateOfBirth' column is currently stored as strings (object). Convert it to proper datetime objects using pd.to_datetime() for calculations.
    customer_df['DateOfBirth'] = pd.to_datetime(customer_df['DateOfBirth'])
    
    print("\nDataFrame after converting DateOfBirth to datetime:")
    customer_df.info() # Verify DateOfBirth dtype is now datetime64[ns]
    print(customer_df.head())
    
  4. Calculate Age: Create a new 'Age' column by calculating the difference between a reference date (e.g., today) and the 'DateOfBirth'.
    • Get today's date.
    • Define a function that takes a birthdate and returns the age in years.
    • Use .apply() on the 'DateOfBirth' column with this function.
      today = pd.to_datetime('today') # Use pandas' to_datetime for consistency
      # Or: today = datetime.now()
      
      def calculate_age(born):
          # Calculate age based on year difference, adjust if birthday hasn't passed this year
          age = today.year - born.year - ((today.month, today.day) < (born.month, born.day))
          return age
      
      # Apply the function to the 'DateOfBirth' column
      customer_df['Age'] = customer_df['DateOfBirth'].apply(calculate_age)
      
      print("\nDataFrame with Age column calculated:")
      print(customer_df)
      
      Self-Check: Does the 'Age' column contain plausible integer ages based on the 'DateOfBirth' values?
  5. Categorize City (using map): Create a 'Region' column based on the 'City'. Map 'New York' to 'North America', 'London' and 'Paris' to 'Europe', and 'Tokyo' to 'Asia'.
    city_to_region_map = {
        'New York': 'North America',
        'London': 'Europe',
        'Paris': 'Europe',
        'Tokyo': 'Asia'
    }
    
    customer_df['Region'] = customer_df['City'].map(city_to_region_map)
    
    print("\nDataFrame with Region column mapped from City:")
    print(customer_df)
    
    Self-Check: Does the 'Region' column correctly reflect the mapping based on the 'City'?

This workshop involved common data cleaning and transformation tasks: standardizing categorical data using replace, converting data types using pd.to_datetime, creating a new feature (Age) using a custom function with apply, and mapping values to create categories using map.

Dealing with Duplicates

Duplicate rows can skew analysis results (e.g., inflating counts or averages). Pandas provides easy ways to identify and remove duplicate rows.

  • duplicated(subset=None, keep='first') (DataFrame/Series method): Returns a boolean Series indicating which rows are duplicates.
    • subset: List of column labels to consider when identifying duplicates. By default, uses all columns.
    • keep='first' (default): Marks duplicates as True except for the first occurrence.
    • keep='last': Marks duplicates as True except for the last occurrence.
    • keep=False: Marks all duplicates as True.
  • drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False) (DataFrame/Series method): Returns a DataFrame/Series with duplicate rows removed.
    • Parameters subset and keep work the same way as in duplicated().
    • inplace=False (default): Returns a new DataFrame with duplicates removed. Set to True to modify the original DataFrame.
    • ignore_index=False (default): If True, the resulting index will be reset to 0, 1, ..., n-1.

Example:

import pandas as pd

data = {'colA': ['X', 'Y', 'X', 'Y', 'Z', 'X'],
        'colB': [1, 2, 1, 2, 3, 1],
        'colC': [100, 200, 100, 300, 400, 100]}
df_duplicates = pd.DataFrame(data)
print("Original DataFrame with Duplicates:")
print(df_duplicates)

# Identify duplicate rows (considering all columns)
print("\nIdentifying duplicate rows (keep='first'):")
print(df_duplicates.duplicated())
# Output: Rows 2 and 5 are duplicates of row 0

print("\nIdentifying duplicate rows (keep='last'):")
print(df_duplicates.duplicated(keep='last'))
# Output: Rows 0 and 2 are duplicates of row 5

print("\nIdentifying all duplicate rows (keep=False):")
print(df_duplicates.duplicated(keep=False))
# Output: Rows 0, 2, 5 are marked as part of a duplicate set

# Identify duplicates based on a subset of columns ('colA', 'colB')
print("\nIdentifying duplicates based on 'colA', 'colB' (keep='first'):")
print(df_duplicates.duplicated(subset=['colA', 'colB']))
# Output: Rows 2 and 5 are duplicates of row 0 based on ('X', 1)
#         Row 3 is a duplicate of row 1 based on ('Y', 2)

print("\n--- Removing Duplicates ---")

# Drop duplicate rows (keeping the first occurrence)
df_no_duplicates_first = df_duplicates.drop_duplicates() # keep='first' is default
print("\nDataFrame after dropping duplicates (keep='first'):")
print(df_no_duplicates_first)

# Drop duplicate rows (keeping the last occurrence)
df_no_duplicates_last = df_duplicates.drop_duplicates(keep='last')
print("\nDataFrame after dropping duplicates (keep='last'):")
print(df_no_duplicates_last)

# Drop duplicates based on subset ('colA', 'colB'), keep first
df_no_duplicates_subset = df_duplicates.drop_duplicates(subset=['colA', 'colB'], keep='first')
print("\nDataFrame after dropping duplicates based on 'colA','colB' (keep='first'):")
print(df_no_duplicates_subset)

Deciding which columns define a "duplicate" (subset) and which occurrence to keep (keep) depends on the specific meaning of your data and the goals of your analysis. Always check the result carefully after dropping duplicates.

Workshop Finding and Removing Duplicate Records

Goal: Identify and remove duplicate entries from a simulated order dataset.

Scenario: You have a dataset representing online orders, but due to a system glitch, some orders were recorded multiple times. You need to clean the dataset by removing these exact duplicates. You also want to identify orders that might be duplicates based only on customer and product, ignoring the timestamp.

Steps:

  1. Import Pandas and Create Data:
    import pandas as pd
    
    # Create sample order data with duplicates
    order_data = {
        'OrderID': [101, 102, 103, 101, 104, 105, 102, 106, 101],
        'CustomerID': ['CUST-A', 'CUST-B', 'CUST-A', 'CUST-A', 'CUST-C', 'CUST-B', 'CUST-B', 'CUST-D', 'CUST-A'],
        'Product': ['Widget', 'Gadget', 'Widget', 'Widget', 'Thingamajig', 'Gadget', 'Gadget', 'Widget', 'Widget'],
        'Quantity': [2, 1, 3, 2, 5, 2, 1, 1, 2],
        'Timestamp': pd.to_datetime(['2023-10-26 10:00:00', '2023-10-26 10:05:00',
                                     '2023-10-26 10:08:00', '2023-10-26 10:00:00', # Exact duplicate of row 0
                                     '2023-10-26 10:15:00', '2023-10-26 11:00:00',
                                     '2023-10-26 10:05:00', # Exact duplicate of row 1
                                     '2023-10-26 11:30:00',
                                     '2023-10-26 10:00:00']) # Exact duplicate of row 0 & 3
    }
    orders_df = pd.DataFrame(order_data)
    
    print("Original Orders DataFrame:")
    print(orders_df)
    print(f"\nOriginal number of rows: {len(orders_df)}")
    
  2. Identify Exact Duplicates: Find rows that are completely identical across all columns. Use duplicated() with default settings.
    exact_duplicates_bool = orders_df.duplicated()
    print("\nBoolean Series indicating exact duplicates (keeping first):")
    print(exact_duplicates_bool)
    
    # Show the actual duplicate rows
    exact_duplicate_rows = orders_df[exact_duplicates_bool]
    print("\nThe exact duplicate rows are:")
    print(exact_duplicate_rows)
    
    Self-Check: Rows 3, 6, and 8 should be identified as True (duplicates of rows 0, 1, and 0 respectively).
  3. Remove Exact Duplicates: Create a new DataFrame orders_cleaned_df by dropping the identified exact duplicates, keeping the first occurrence.
    orders_cleaned_df = orders_df.drop_duplicates(keep='first')
    
    print("\nDataFrame after removing exact duplicates (keeping first):")
    print(orders_cleaned_df)
    print(f"\nNumber of rows after removing exact duplicates: {len(orders_cleaned_df)}")
    
    Self-Check: Does orders_cleaned_df have 6 rows (original 9 - 3 duplicates)? Are rows with index 3, 6, and 8 gone?
  4. Identify Duplicates Based on Subset: Now, identify rows that are duplicates based only on 'CustomerID' and 'Product', ignoring 'OrderID', 'Quantity', and 'Timestamp'. This might indicate multiple orders of the same product by the same customer, which may or may not be errors depending on context.
    subset_duplicates_bool = orders_df.duplicated(subset=['CustomerID', 'Product'], keep=False) # Keep=False marks ALL occurrences
    print("\nBoolean Series indicating duplicates based on CustomerID and Product (marking all):")
    print(subset_duplicates_bool)
    
    # Show all rows involved in a subset duplication
    subset_duplicate_rows = orders_df[subset_duplicates_bool]
    print("\nRows involved in duplication based on CustomerID and Product:")
    # Sort for easier comparison
    print(subset_duplicate_rows.sort_values(by=['CustomerID', 'Product']))
    
    Self-Check: Rows where ('CUST-A', 'Widget') appears (indices 0, 2, 3, 8) and where ('CUST-B', 'Gadget') appears (indices 1, 5, 6) should all be marked True.
  5. Remove Subset Duplicates (Optional/Contextual): If the goal was to keep only the first order instance of a specific product by a customer, you could use drop_duplicates with the subset.
    # Example: Keep only the first time each customer ordered each product
    orders_first_instance_df = orders_df.drop_duplicates(subset=['CustomerID', 'Product'], keep='first')
    
    print("\nDataFrame keeping only the first instance (based on CustomerID, Product):")
    print(orders_first_instance_df)
    print(f"\nNumber of rows after keeping first instance: {len(orders_first_instance_df)}")
    
    Self-Check: Does this result have only one row for ('CUST-A', 'Widget') and one for ('CUST-B', 'Gadget')? The specific row kept depends on the keep parameter and original order.

This workshop demonstrated how to use duplicated() and drop_duplicates() both for finding exact row duplicates and for finding duplicates based on a subset of columns, allowing for different data cleaning objectives.


Advanced Level

This level delves into more sophisticated Pandas and NumPy functionalities, enabling complex data manipulations, aggregations, and analyses crucial for deeper insights and specialized tasks like time series analysis.

5. Advanced Data Selection and Indexing

Beyond basic slicing and label/integer-based access ([], .loc, .iloc), Pandas offers more advanced indexing techniques, including hierarchical indexing (MultiIndex) and powerful boolean masking, allowing for intricate data selection and manipulation.

Hierarchical Indexing (MultiIndex)

Hierarchical indexing, or MultiIndex, allows you to have multiple index levels on an axis (rows or columns). It's a way to represent higher-dimensional data in a lower-dimensional structure like a Series (1D) or DataFrame (2D). Think of it as creating groups within groups.

Creating a MultiIndex:

You can create a MultiIndex explicitly using pd.MultiIndex.from_arrays, pd.MultiIndex.from_tuples, or pd.MultiIndex.from_product, or implicitly by providing a list of arrays/tuples to the index or columns argument when creating a Series or DataFrame.

import pandas as pd
import numpy as np

# Method 1: Using from_arrays
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
multi_index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second'])
print("MultiIndex from arrays:\n", multi_index)

# Create a Series with this MultiIndex
s_multi = pd.Series(np.random.randn(8), index=multi_index)
print("\nSeries with MultiIndex:")
print(s_multi)

print("-" * 30)

# Method 2: Implicitly during DataFrame creation
data = np.random.randn(4, 3)
df_multi_cols = pd.DataFrame(data,
                           columns=[['GroupA', 'GroupA', 'GroupB'],
                                    ['Metric1', 'Metric2', 'Metric1']],
                           index=['Row1', 'Row2', 'Row3', 'Row4'])
df_multi_cols.columns.names = ['BroadGroup', 'SpecificMetric']
print("\nDataFrame with MultiIndex Columns:")
print(df_multi_cols)

print("-" * 30)

# Method 3: Implicitly with index during DataFrame creation
df_multi_rows = pd.DataFrame(np.random.randn(6, 2),
                           index=[['Type1', 'Type1', 'Type1', 'Type2', 'Type2', 'Type2'],
                                  ['SubA', 'SubB', 'SubC', 'SubA', 'SubB', 'SubC']],
                           columns=['Value1', 'Value2'])
df_multi_rows.index.names = ['MainType', 'SubType']
print("\nDataFrame with MultiIndex Rows:")
print(df_multi_rows)

Indexing with MultiIndex:

Accessing data in objects with MultiIndex requires careful use of .loc, .iloc, and potentially slicing.

  • Partial Indexing: You can select data by specifying only the outer level(s) of the index.
  • Tuple-based Indexing: Provide a tuple to .loc to select based on multiple levels.
  • Slicing: Slicing works, but slicing across levels requires careful syntax, often using slice(None) to select all items within intermediate levels. The pd.IndexSlice object provides a more readable way to slice MultiIndexes.
print("\n--- Indexing Series with MultiIndex ---")
print("Original Series (s_multi):\n", s_multi)

# Select outer level 'bar'
print("\nSelect outer level 'bar':\n", s_multi['bar'])
# Equivalent using .loc
# print("\nSelect outer level 'bar' (.loc):\n", s_multi.loc['bar'])

# Select specific combination using tuple with .loc
print("\nSelect ('baz', 'two'):", s_multi.loc[('baz', 'two')])

# Select multiple outer levels
print("\nSelect outer levels 'bar' and 'foo':\n", s_multi.loc[['bar', 'foo']])

# Slicing the outer level (works like regular index)
print("\nSlice outer level from 'baz' to 'foo':\n", s_multi['baz':'foo'])

# Slicing inner level for a specific outer level
print("\nSlice inner level from 'one' for outer 'bar':\n", s_multi.loc[('bar', 'one'):]) # Be careful with interpretation
print("\nSlice inner level == 'one' across applicable outer levels:\n", s_multi.loc[:, 'one']) # Slice notation: select all outer, inner 'one'

print("\n--- Indexing DataFrame with MultiIndex Rows ---")
print("Original DataFrame (df_multi_rows):\n", df_multi_rows)

# Select outer index level 'Type1'
print("\nSelect outer index 'Type1':\n", df_multi_rows.loc['Type1'])

# Select specific row ('Type2', 'SubA')
print("\nSelect row ('Type2', 'SubA'):\n", df_multi_rows.loc[('Type2', 'SubA')])

# Select specific cell ('Type1', 'SubB'), 'Value2'
print("\nSelect cell [('Type1', 'SubB'), 'Value2']:", df_multi_rows.loc[('Type1', 'SubB'), 'Value2'])

# Slice using IndexSlice
idx = pd.IndexSlice
# Select all 'SubA' subtypes across all MainTypes for 'Value1'
print("\nSelect all 'SubA' for 'Value1' using IndexSlice:\n", df_multi_rows.loc[idx[:, 'SubA'], 'Value1'])
# Select 'Type1' MainType, 'SubB' through 'SubC' SubTypes, for all columns
print("\nSelect 'Type1', 'SubB':'SubC' using IndexSlice:\n", df_multi_rows.loc[idx['Type1', 'SubB':'SubC'], :])

print("\n--- Indexing DataFrame with MultiIndex Columns ---")
print("Original DataFrame (df_multi_cols):\n", df_multi_cols)

# Select outer column level 'GroupA'
print("\nSelect outer column 'GroupA':\n", df_multi_cols['GroupA'])

# Select specific column ('GroupA', 'Metric2')
print("\nSelect column ('GroupA', 'Metric2'):\n", df_multi_cols[('GroupA', 'Metric2')]) # Direct [] works for columns

# Select cell 'Row2', ('GroupB', 'Metric1') using .loc
print("\nSelect cell ['Row2', ('GroupB', 'Metric1')]:", df_multi_cols.loc['Row2', ('GroupB', 'Metric1')])

# Select using IndexSlice for columns
# Select all rows for 'Metric1' across all BroadGroups
print("\nSelect all rows for 'Metric1' using IndexSlice:\n", df_multi_cols.loc[:, idx[:, 'Metric1']])

Stacking and Unstacking:

MultiIndex objects allow for easy reshaping between "long" and "wide" formats:

  • stack(level=-1, dropna=True): Moves a level from the column index to become the innermost level of the row index, potentially creating a Series if only one column level remains. Produces a "longer" format.
  • unstack(level=-1, fill_value=None): Moves a level from the row index to become the innermost level of the column index. Produces a "wider" format. fill_value can be used to replace NaNs introduced during unstacking.
print("\n--- Stacking and Unstacking ---")
print("Original DataFrame (df_multi_rows):\n", df_multi_rows)

# Unstack the inner row index level ('SubType')
df_unstacked = df_multi_rows.unstack()
print("\nDataFrame after unstacking 'SubType':\n", df_unstacked) # SubType becomes inner col level

# Unstack the outer row index level ('MainType')
df_unstacked_outer = df_multi_rows.unstack(level=0) # or level='MainType'
print("\nDataFrame after unstacking 'MainType':\n", df_unstacked_outer)

# Now, stack the innermost column level back to the row index
df_stacked_back = df_unstacked.stack()
print("\nDataFrame after stacking back the innermost column level:\n", df_stacked_back) # Should resemble original

# Stack based on DataFrame with multi-level columns (df_multi_cols)
print("\nOriginal DataFrame (df_multi_cols):\n", df_multi_cols)
df_stacked_cols = df_multi_cols.stack() # Stacks innermost col level ('SpecificMetric')
print("\nDataFrame after stacking 'SpecificMetric':\n", df_stacked_cols)
df_stacked_cols_outer = df_multi_cols.stack(level=0) # Stacks outer col level ('BroadGroup')
print("\nDataFrame after stacking 'BroadGroup':\n", df_stacked_cols_outer)

MultiIndex is a powerful feature for working with complex, structured datasets within the familiar DataFrame format. Mastering its creation, indexing, and reshaping capabilities is key for advanced data manipulation.

Workshop Working with MultiIndex DataFrames

Goal: Create and manipulate a MultiIndex DataFrame representing quarterly sales data for different products across regions.

Scenario: You have sales data broken down by region, product category, and quarter. You need to structure this data using a MultiIndex and then practice selecting and summarizing parts of it.

Steps:

  1. Import Libraries and Create Data:
    import pandas as pd
    import numpy as np
    
    # Create data components
    regions = ['North', 'South', 'East', 'West']
    products = ['Electronics', 'Apparel', 'Home Goods']
    quarters = ['Q1', 'Q2', 'Q3', 'Q4']
    
    # Create a MultiIndex using from_product
    multi_index = pd.MultiIndex.from_product([regions, products], names=['Region', 'Product'])
    
    # Create random sales data (e.g., in thousands of dollars)
    np.random.seed(42) # for reproducibility
    sales_data = np.random.randint(50, 500, size=(len(multi_index), len(quarters))) # rows = region*product, cols = quarters
    
    # Create the DataFrame
    sales_df = pd.DataFrame(sales_data, index=multi_index, columns=quarters)
    
    print("Quarterly Sales Data with MultiIndex:")
    print(sales_df)
    
    Self-Check: Does the DataFrame have a two-level row index ('Region', 'Product') and a single-level column index ('Q1', 'Q2', 'Q3', 'Q4')?
  2. Select Data Using .loc:
    • Select all data for the 'North' region.
    • Select data for ('South', 'Apparel').
    • Select Q1 sales for all products in the 'East' region.
    • Select Q3 and Q4 sales for 'Electronics' across all regions. Use pd.IndexSlice.
      print("\n--- Data Selection ---")
      
      # Data for 'North' region
      north_sales = sales_df.loc['North']
      print("\nSales for North Region:\n", north_sales)
      
      # Data for ('South', 'Apparel')
      south_apparel_sales = sales_df.loc[('South', 'Apparel')]
      print("\nSales for South Region, Apparel Product:\n", south_apparel_sales) # This will be a Series
      
      # Q1 sales for 'East' region
      east_q1_sales = sales_df.loc['East', 'Q1']
      print("\nQ1 Sales for East Region:\n", east_q1_sales) # This will be a Series (Product -> Sales)
      
      # Q3/Q4 sales for 'Electronics' across all regions
      idx = pd.IndexSlice
      electronics_q3_q4 = sales_df.loc[idx[:, 'Electronics'], ['Q3', 'Q4']]
      print("\nQ3 & Q4 Sales for Electronics (All Regions):\n", electronics_q3_q4)
      
  3. Calculate Summary Statistics:
    • Calculate the total annual sales for each (Region, Product) combination (sum across columns).
    • Calculate the average sales per quarter for each Region (average across Products within a Region, then across columns).
    • Calculate the total sales for each Product category across all regions and quarters.
      print("\n--- Summary Statistics ---")
      
      # Total annual sales per (Region, Product)
      sales_df['Total'] = sales_df.sum(axis=1) # Sum across columns (Quarters)
      print("\nDataFrame with Total Annual Sales:\n", sales_df)
      
      # Average quarterly sales per Region (Group by Region level, then mean)
      # First, calculate mean quarterly sales for each row
      # Then, group by region and calculate the mean of those means
      # OR, more directly: group by level 0 and take the mean across all values in the group
      avg_quarterly_sales_per_region = sales_df[['Q1','Q2','Q3','Q4']].mean(axis=1).groupby(level='Region').mean()
      print("\nAverage Quarterly Sales per Region (approx):\n", avg_quarterly_sales_per_region)
      # Alternative: Calculate total sales per region per quarter, then average
      total_sales_per_region_quarter = sales_df[['Q1','Q2','Q3','Q4']].groupby(level='Region').sum()
      avg_sales_per_region = total_sales_per_region_quarter.mean(axis=1)
      print("\nAverage Total Sales per Region (across quarters):\n", avg_sales_per_region)
      
      
      # Total sales per Product category (group by Product level, sum everything)
      # We need to sum the 'Total' column grouped by 'Product'
      total_sales_per_product = sales_df['Total'].groupby(level='Product').sum()
      # Alternatively, sum all quarterly data grouped by product
      # total_sales_per_product = sales_df[['Q1','Q2','Q3','Q4']].sum(axis=1).groupby(level='Product').sum()
      print("\nTotal Sales per Product Category (All Regions, All Quarters):\n", total_sales_per_product)
      
      # Remove the 'Total' column if we want to continue with original shape
      sales_df = sales_df.drop('Total', axis=1)
      
  4. Reshape using unstack(): Make 'Product' a column level to easily compare products within each region side-by-side.
    print("\n--- Reshaping ---")
    sales_unstacked_product = sales_df.unstack(level='Product')
    print("\nDataFrame Unstacked by Product:\n", sales_unstacked_product)
    
    Self-Check: Does the resulting DataFrame have 'Region' as the row index and a MultiIndex for columns ('Product', 'Quarter')?
  5. Reshape using stack(): Take the sales_unstacked_product DataFrame and stack the 'Quarter' level back into the row index, creating a Series with a three-level MultiIndex (Region, Quarter, Product).
    # Stack the innermost column level ('Quarter') from the unstacked DF
    sales_fully_stacked = sales_unstacked_product.stack(level=-1) # Innermost is Quarter level
    # Or stack the original df by quarter
    # sales_fully_stacked = sales_df.stack() # This makes quarter the innermost row index level
    # sales_fully_stacked.index.names = ['Region', 'Product', 'Quarter'] # Name the new level
    
    
    print("\nDataFrame stacked by Quarter (from unstacked state):")
    print(sales_fully_stacked.head(12)) # Show first few rows of the long format
    
    # If stacking the original DF:
    sales_stacked_original = sales_df.stack()
    sales_stacked_original.index.names = ['Region', 'Product', 'Quarter']
    print("\nOriginal DataFrame stacked by Quarter:")
    print(sales_stacked_original.head(12)) # Show first few rows
    
    Self-Check: Does the resulting Series/DataFrame have three levels in its index? Is the data now in a "long" format where each row represents a single observation (Region, Product, Quarter -> Sales)?

This workshop provided practice in creating a MultiIndex DataFrame, selecting data using various .loc techniques (including pd.IndexSlice), calculating grouped summaries using groupby(level=...), and reshaping the data between wide and long formats using unstack() and stack().

Boolean Indexing and .loc/.iloc

Boolean indexing is a powerful technique where you filter data based on logical conditions. It allows for complex selections that go beyond simple label or positional indexing.

How it Works:

  1. You create a boolean Series (or array) whose index aligns with the DataFrame's index (or whose length matches the number of rows/columns). This boolean Series contains True for rows/columns you want to keep and False for those you want to discard.
  2. You pass this boolean Series inside the .loc[] accessor (or sometimes directly [], though .loc is safer and more explicit).

Creating Boolean Conditions:

You typically create boolean Series using comparison operators (==, !=, >, <, >=, <=) on DataFrame columns or Series. Multiple conditions can be combined using logical operators:

  • & (AND): Both conditions must be true.
  • | (OR): At least one condition must be true.
  • ~ (NOT): Inverts the boolean value.

Important: When combining multiple conditions, each condition must be enclosed in parentheses () due to Python's operator precedence rules.

Examples:

Let's use the cleaned Titanic dataset from a previous workshop.

import pandas as pd
import numpy as np

# Recreate or load the cleaned Titanic data (age/embarked filled, deck dropped)
titanic_url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
try:
    titanic_df = pd.read_csv(titanic_url)
    # Apply cleaning steps from previous workshop
    if 'deck' in titanic_df.columns: titanic_df.drop('deck', axis=1, inplace=True)
    if 'age' in titanic_df.columns: titanic_df['age'].fillna(titanic_df['age'].median(), inplace=True)
    if 'embarked' in titanic_df.columns: titanic_df['embarked'].fillna(titanic_df['embarked'].mode()[0], inplace=True)
    print("Cleaned Titanic dataset loaded/recreated.")
    print(f"Shape: {titanic_df.shape}")
    print("NaN counts:\n", titanic_df.isnull().sum())
except Exception as e:
    print(f"Error loading/cleaning dataset: {e}")
    # Fallback placeholder
    titanic_df = pd.DataFrame({
        'survived': [0, 1, 1, 0, 1, 0, 0, 1],
        'pclass': [3, 1, 3, 1, 3, 3, 1, 3],
        'sex': ['male', 'female', 'female', 'male', 'female', 'male', 'male', 'female'],
        'age': [22.0, 38.0, 26.0, 54.0, 4.0, 28.0, 30.0, 14.0], # Median filled approx 28
        'fare': [7.25, 71.28, 7.92, 51.86, 16.7, 8.45, 50.0, 11.24],
        'embarked': ['S', 'C', 'S', 'S', 'S', 'Q', 'S', 'S'] # Mode filled S
    })
    print("Using placeholder DataFrame.")

# Condition 1: Select passengers older than 60
cond_age_gt_60 = titanic_df['age'] > 60
print("\nBoolean Series for age > 60:\n", cond_age_gt_60.head())

# Apply the condition using .loc
passengers_over_60 = titanic_df.loc[cond_age_gt_60]
print("\nPassengers older than 60:\n", passengers_over_60[['age', 'pclass', 'sex']].head())

# Condition 2: Select female passengers in 1st class
cond_female = titanic_df['sex'] == 'female'
cond_pclass1 = titanic_df['pclass'] == 1

# Combine conditions using & (AND) - note parentheses!
cond_female_pclass1 = (cond_female) & (cond_pclass1)
print("\nBoolean Series for female AND pclass=1:\n", cond_female_pclass1.head())

female_pclass1_passengers = titanic_df.loc[cond_female_pclass1]
print("\nFemale passengers in 1st class:\n", female_pclass1_passengers[['sex', 'pclass', 'age', 'fare']].head())

# Condition 3: Select passengers who either paid more than $100 OR embarked at 'C'
cond_fare_gt_100 = titanic_df['fare'] > 100
cond_embarked_c = titanic_df['embarked'] == 'C'

# Combine conditions using | (OR) - note parentheses!
cond_highfare_or_cherbourg = (cond_fare_gt_100) | (cond_embarked_c)

highfare_or_cherbourg_passengers = titanic_df.loc[cond_highfare_or_cherbourg]
print(f"\nNumber of passengers with fare > 100 OR embarked at C: {len(highfare_or_cherbourg_passengers)}")
print("Sample:\n", highfare_or_cherbourg_passengers[['fare', 'embarked', 'pclass', 'age']].head())

# Condition 4: Select passengers NOT in 3rd class
cond_not_pclass3 = ~(titanic_df['pclass'] == 3) # Using ~ (NOT)
# Alternatively: cond_not_pclass3 = titanic_df['pclass'] != 3

passengers_not_pclass3 = titanic_df.loc[cond_not_pclass3]
print(f"\nNumber of passengers not in 3rd class: {len(passengers_not_pclass3)}")
print("Pclass value counts for this selection:\n", passengers_not_pclass3['pclass'].value_counts())

# Using boolean indexing to select specific COLUMNS with .loc
# Select 'age' and 'fare' columns for rows where 'survived' is 1
survived_passengers_age_fare = titanic_df.loc[titanic_df['survived'] == 1, ['age', 'fare']]
print("\nAge and Fare for survivors:\n", survived_passengers_age_fare.head())

# Using boolean indexing with .iloc (Less common, requires boolean array of correct length)
# Select rows 0, 2, 4 using a boolean mask
mask = np.array([True, False, True, False, True] + [False]*(len(titanic_df)-5)) # Create boolean array matching length
selected_rows_iloc = titanic_df.iloc[mask]
print("\nSelecting rows 0, 2, 4 using boolean mask with .iloc:\n", selected_rows_iloc)

Key Advantages of Boolean Indexing with .loc:

  • Readability: Conditions often express the filtering logic clearly.
  • Flexibility: Allows for complex, multi-condition selections based on data values.
  • Efficiency: Vectorized boolean operations are generally fast.
  • Label Alignment: .loc ensures alignment based on index labels, preventing subtle errors.

While .iloc accepts boolean arrays, it's primarily designed for integer position-based indexing. Using boolean Series with .loc is the standard and generally preferred method for value-based filtering in Pandas.

Workshop Complex Data Filtering and Subset Selection

Goal: Apply complex boolean indexing criteria to the Titanic dataset to isolate specific groups of passengers and analyze their characteristics.

Scenario: You need to identify and analyze several specific passenger cohorts from the Titanic dataset:

  1. Children (age < 18) who were in 3rd class.
  2. Adult males (age >= 18, sex = 'male') who survived.
  3. Passengers who embarked at 'S' but were not in 3rd class.
  4. Passengers whose fare was unusually high (e.g., greater than the 95th percentile) or unusually low (e.g., less than the 5th percentile).

Steps:

  1. Import Pandas and Load/Prepare Data: Ensure the cleaned Titanic dataset is loaded into titanic_df.
    import pandas as pd
    import numpy as np
    
    # Assuming titanic_df is loaded and cleaned from the previous section
    # If not, reload and clean it here.
    try:
        # Verify data is loaded and cleaned
        print("Using pre-loaded and cleaned Titanic DataFrame.")
        print(titanic_df.info())
        if titanic_df.isnull().sum().sum() != 0:
             print("Warning: DataFrame still contains NaNs. Please ensure cleaning steps were run.")
    except NameError:
        print("Error: titanic_df not found. Loading and cleaning...")
        # Paste loading and cleaning code here if needed
        titanic_url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv'
        try:
            titanic_df = pd.read_csv(titanic_url)
            if 'deck' in titanic_df.columns: titanic_df.drop('deck', axis=1, inplace=True)
            if 'age' in titanic_df.columns: titanic_df['age'].fillna(titanic_df['age'].median(), inplace=True)
            if 'embarked' in titanic_df.columns: titanic_df['embarked'].fillna(titanic_df['embarked'].mode()[0], inplace=True)
            print("Cleaned Titanic dataset loaded/recreated.")
            if titanic_df.isnull().sum().sum() != 0: print("Warning: NaNs remain after cleaning.")
        except Exception as e:
            print(f"Fatal Error: Could not load/clean data - {e}")
            # Cannot proceed without data
            exit() # Or raise an error
    
  2. Filter 1: Third-Class Children:
    • Create boolean conditions for age < 18 and pclass == 3.
    • Combine them using &.
    • Select the rows using .loc.
    • Display the shape and maybe the survival rate of this group.
      print("\n--- Filter 1: Third-Class Children ---")
      cond_child = titanic_df['age'] < 18
      cond_pclass3 = titanic_df['pclass'] == 3
      third_class_children = titanic_df.loc[cond_child & cond_pclass3]
      
      print(f"Number of third-class children: {third_class_children.shape[0]}")
      if not third_class_children.empty:
          survival_rate_3rd_child = third_class_children['survived'].mean()
          print(f"Survival rate for third-class children: {survival_rate_3rd_child:.2%}")
      print(third_class_children[['age', 'sex', 'survived']].head())
      
  3. Filter 2: Surviving Adult Males:
    • Create boolean conditions for age >= 18, sex == 'male', and survived == 1.
    • Combine them using &.
    • Select the rows using .loc.
    • Display the shape and maybe the class distribution of this group.
      print("\n--- Filter 2: Surviving Adult Males ---")
      cond_adult = titanic_df['age'] >= 18
      cond_male = titanic_df['sex'] == 'male'
      cond_survived = titanic_df['survived'] == 1
      surviving_adult_males = titanic_df.loc[cond_adult & cond_male & cond_survived]
      
      print(f"Number of surviving adult males: {surviving_adult_males.shape[0]}")
      if not surviving_adult_males.empty:
          print("Class distribution for surviving adult males:")
          print(surviving_adult_males['pclass'].value_counts(normalize=True).round(3))
      print(surviving_adult_males[['age', 'pclass', 'fare']].head())
      
  4. Filter 3: Non-Third-Class Passengers from Southampton:
    • Create boolean conditions for embarked == 'S' and pclass != 3.
    • Combine them using &.
    • Select the rows using .loc.
    • Display the shape and class distribution.
      print("\n--- Filter 3: Non-Third-Class from Southampton ---")
      cond_embarked_s = titanic_df['embarked'] == 'S'
      cond_not_pclass3 = titanic_df['pclass'] != 3 # Or ~(titanic_df['pclass'] == 3)
      non_3rd_southampton = titanic_df.loc[cond_embarked_s & cond_not_pclass3]
      
      print(f"Number of non-third-class passengers from Southampton: {non_3rd_southampton.shape[0]}")
      if not non_3rd_southampton.empty:
          print("Class distribution for non-third-class from Southampton:")
          print(non_3rd_southampton['pclass'].value_counts())
      print(non_3rd_southampton[['pclass', 'age', 'fare', 'embarked']].head())
      
  5. Filter 4: Passengers with Extreme Fares:
    • Calculate the 5th and 95th percentiles of the 'fare' column using .quantile().
    • Create boolean conditions for fare < 5th_percentile and fare > 95th_percentile.
    • Combine them using | (OR).
    • Select the rows using .loc.
    • Display the shape and descriptive statistics for the fares in this group.
      print("\n--- Filter 4: Passengers with Extreme Fares ---")
      fare_5th_percentile = titanic_df['fare'].quantile(0.05)
      fare_95th_percentile = titanic_df['fare'].quantile(0.95)
      print(f"5th percentile fare: {fare_5th_percentile:.2f}")
      print(f"95th percentile fare: {fare_95th_percentile:.2f}")
      
      cond_low_fare = titanic_df['fare'] < fare_5th_percentile
      cond_high_fare = titanic_df['fare'] > fare_95th_percentile
      extreme_fare_passengers = titanic_df.loc[cond_low_fare | cond_high_fare]
      
      print(f"\nNumber of passengers with extreme fares: {extreme_fare_passengers.shape[0]}")
      if not extreme_fare_passengers.empty:
          print("Fare statistics for passengers with extreme fares:")
          print(extreme_fare_passengers['fare'].describe())
      print(extreme_fare_passengers[['fare', 'pclass', 'age', 'survived']].head())
      

This workshop exercised the creation of multiple boolean conditions based on different columns and data types, combining them using logical operators (&, |), and using the resulting boolean Series within .loc to perform complex data filtering and subset selection for targeted analysis.

6. Grouping, Aggregating, and Reshaping Data

Grouping and aggregation are core components of data analysis. They allow you to split your data into groups based on some criteria, apply calculations (like sum, mean, count) independently to each group, and then combine the results back into a meaningful structure. Pandas' groupby() mechanism is exceptionally powerful for this. Reshaping tools like pivot tables further help in summarizing and presenting data.

The GroupBy Mechanism

The groupby() method allows for a "split-apply-combine" process on DataFrames or Series:

  1. Split: The data is split into groups based on one or more keys. The keys can be column names, index levels, arrays, or functions applied to the index/columns.
  2. Apply: A function (e.g., aggregation, transformation, filtering) is applied independently to each group.
  3. Combine: The results of the function application are combined back into a resulting object (e.g., a Series or DataFrame).

Basic GroupBy Operation:

import pandas as pd
import numpy as np

# Sample data: Sales by Region and Product
data = {'Region': ['North', 'South', 'North', 'South', 'North', 'South', 'East', 'East'],
        'Product': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
        'Sales': [100, 150, 200, 250, 120, 180, 300, 350],
        'Quantity': [10, 15, 8, 10, 12, 18, 20, 22]}
df_sales = pd.DataFrame(data)
print("Original Sales DataFrame:")
print(df_sales)

# Group by 'Region'
grouped_by_region = df_sales.groupby('Region')
print("\nGroupBy object (Region):", grouped_by_region) # Shows an object, not the data yet

# Apply an aggregation function (e.g., sum)
region_sum = grouped_by_region.sum()
print("\nSum of Sales and Quantity by Region:")
print(region_sum)

# Apply another aggregation (e.g., mean)
region_mean = grouped_by_region.mean()
print("\nMean Sales and Quantity by Region:")
print(region_mean)

# Get the size of each group
region_size = grouped_by_region.size()
print("\nSize of each Region group:")
print(region_size) # Returns a Series

# Get the count of non-null values in each column per group
region_count = grouped_by_region.count()
print("\nCount of non-null values per column by Region:")
print(region_count)

# Grouping by multiple keys ('Region' and 'Product')
grouped_by_region_product = df_sales.groupby(['Region', 'Product'])
print("\nGroupBy object (Region, Product):", grouped_by_region_product)

# Calculate mean sales for each combination
region_product_mean_sales = grouped_by_region_product['Sales'].mean() # Select 'Sales' then aggregate
print("\nMean Sales by Region and Product:")
print(region_product_mean_sales) # Result has a MultiIndex

# Unstack the result for a wider view
print("\nMean Sales Unstacked:")
print(region_product_mean_sales.unstack())

# Iterating through groups (less common for aggregation, useful for inspection/custom logic)
print("\nIterating through Region groups:")
for name, group_df in grouped_by_region:
    print(f"--- Group: {name} ---")
    print(group_df)
    # You could perform custom calculations on group_df here

# Selecting a specific group
print("\nGetting the 'North' group:")
north_group = grouped_by_region.get_group('North')
print(north_group)

Key Points:

  • Calling groupby() itself doesn't perform calculations; it creates a GroupBy object that holds information about the groups.
  • Aggregation functions (sum(), mean(), median(), min(), max(), std(), var(), count(), size(), first(), last(), etc.) are then applied to this object.
  • You can group by one or multiple column names, index levels, or external arrays/Series of the same length.
  • Selecting a column before aggregation (grouped['Sales'].sum()) is generally more efficient than aggregating the whole DataFrame and then selecting (grouped.sum()['Sales']).
  • The result of an aggregation typically has the group keys as its index (potentially a MultiIndex).

The GroupBy mechanism is fundamental for summarizing data based on categorical variables or other grouping criteria.

Workshop Analyzing Sales Data by Region and Product

Goal: Use groupby() to analyze the sample sales data, calculating total and average sales per region and per product.

Scenario: Using the df_sales DataFrame created in the previous section, you need to answer specific business questions by grouping and aggregating the data:

  1. What are the total sales for each region?
  2. What are the total sales for each product across all regions?
  3. Which region-product combination had the highest average quantity sold per transaction?

Steps:

  1. Import Pandas and Prepare Data: Ensure the df_sales DataFrame is available.
    import pandas as pd
    import numpy as np
    
    # Recreate data if needed
    data = {'Region': ['North', 'South', 'North', 'South', 'North', 'South', 'East', 'East'],
            'Product': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
            'Sales': [100, 150, 200, 250, 120, 180, 300, 350],
            'Quantity': [10, 15, 8, 10, 12, 18, 20, 22]}
    df_sales = pd.DataFrame(data)
    print("Original Sales DataFrame:")
    print(df_sales)
    
  2. Question 1: Total Sales per Region:
    • Group df_sales by the 'Region' column.
    • Select the 'Sales' column from the grouped object.
    • Apply the sum() aggregation.
    • Print the resulting Series.
      print("\n--- Q1: Total Sales per Region ---")
      total_sales_by_region = df_sales.groupby('Region')['Sales'].sum()
      print(total_sales_by_region)
      # Optional: Sort for better presentation
      print("\nSorted by Total Sales:")
      print(total_sales_by_region.sort_values(ascending=False))
      
      Self-Check: Manually sum the sales for 'North' (100 + 200 + 120 = 420). Does the result match?
  3. Question 2: Total Sales per Product:
    • Group df_sales by the 'Product' column.
    • Select the 'Sales' column.
    • Apply the sum() aggregation.
    • Print the result.
      print("\n--- Q2: Total Sales per Product ---")
      total_sales_by_product = df_sales.groupby('Product')['Sales'].sum()
      print(total_sales_by_product)
      
      Self-Check: Sum sales for 'A' (100 + 150 + 120 + 180 = 550). Does the result match?
  4. Question 3: Highest Average Quantity per Region-Product:
    • Group df_sales by both 'Region' and 'Product'.
    • Select the 'Quantity' column.
    • Apply the mean() aggregation.
    • Find the index (Region, Product combination) corresponding to the maximum value in the resulting Series using .idxmax().
    • Retrieve the maximum average quantity value using .max().
      print("\n--- Q3: Highest Average Quantity per Region-Product ---")
      avg_quantity_by_region_product = df_sales.groupby(['Region', 'Product'])['Quantity'].mean()
      print("Average Quantity per (Region, Product):\n", avg_quantity_by_region_product)
      
      # Find the combination with the highest average quantity
      highest_avg_qty_combination = avg_quantity_by_region_product.idxmax()
      highest_avg_qty_value = avg_quantity_by_region_product.max()
      
      print(f"\nThe combination with the highest average quantity sold is: {highest_avg_qty_combination}")
      print(f"Highest average quantity: {highest_avg_qty_value:.2f}")
      
      Self-Check: Look at the average quantities calculated. Does ('East', 'B') indeed have the highest average ( (20+22)/2 = 21 )?

This workshop focused on using the basic groupby() and aggregation workflow (selecting columns and applying functions like sum() and mean()) to answer specific analytical questions about summarized data. It also introduced finding the maximum value and its corresponding index label(s) in the aggregated result.

Aggregation Functions

While simple aggregations like sum(), mean(), count() are common, Pandas GroupBy objects support more complex aggregation scenarios using the .agg() (or .aggregate()) method. This method allows you to:

  1. Apply multiple aggregation functions at once.
  2. Apply different aggregation functions to different columns.
  3. Use custom-defined aggregation functions.

Using .agg():

  • Multiple Functions on All Columns: Pass a list of function names (as strings) or actual function objects.
  • Different Functions per Column: Pass a dictionary where keys are column names and values are the function(s) to apply to that column (can be a single function name/object or a list of them).
  • Custom Functions: Define your own function that takes a Series (representing the data within a group for a specific column) and returns a single scalar value. Pass this function object to .agg().
  • Renaming Aggregated Columns: When passing a dictionary, you can provide tuples (column, function) as keys, or use named aggregation syntax for clearer output column names.

Named Aggregation (Preferred for Clarity):

The modern way, especially when applying multiple aggregations or needing specific output names, is named aggregation within .agg():

grouped_object.agg(
    output_col_name1 = pd.NamedAgg(column='input_col_name1', aggfunc='sum'),
    output_col_name2 = pd.NamedAgg(column='input_col_name2', aggfunc='mean'),
    output_col_name3 = pd.NamedAgg(column='input_col_name1', aggfunc=np.std),
    # ...
)

Examples:

Using the df_sales DataFrame:

import pandas as pd
import numpy as np

# Recreate data if needed
data = {'Region': ['North', 'South', 'North', 'South', 'North', 'South', 'East', 'East'],
        'Product': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
        'Sales': [100, 150, 200, 250, 120, 180, 300, 350],
        'Quantity': [10, 15, 8, 10, 12, 18, 20, 22]}
df_sales = pd.DataFrame(data)
grouped_region_product = df_sales.groupby(['Region', 'Product'])

print("Original Sales DataFrame:")
print(df_sales)

print("\n--- Aggregation Examples ---")

# 1. Multiple functions on selected columns
agg_multi_func = grouped_region_product[['Sales', 'Quantity']].agg(['sum', 'mean', 'count'])
print("\nApplying 'sum', 'mean', 'count' to Sales & Quantity:")
print(agg_multi_func)
# Note the MultiIndex columns in the result: (Column, AggFunction)

# 2. Different functions per column using a dictionary
agg_dict_func = grouped_region_product.agg({
    'Sales': ['sum', 'mean'],    # Sum and Mean for Sales
    'Quantity': 'sum'           # Only Sum for Quantity
})
print("\nApplying different functions per column (dict):")
print(agg_dict_func)

# 3. Custom aggregation function (e.g., range)
def sales_range(x):
    return x.max() - x.min()

agg_custom_func = grouped_region_product.agg({
    'Sales': ['sum', sales_range], # Mix built-in and custom
    'Quantity': 'mean'
})
print("\nApplying custom 'sales_range' function:")
print(agg_custom_func)

# 4. Using Named Aggregation (Preferred)
agg_named = grouped_region_product.agg(
    TotalSales = pd.NamedAgg(column='Sales', aggfunc='sum'),
    AverageSales = pd.NamedAgg(column='Sales', aggfunc='mean'),
    TotalQuantity = pd.NamedAgg(column='Quantity', aggfunc='sum'),
    SalesRange = pd.NamedAgg(column='Sales', aggfunc=sales_range),
    TransactionCount = pd.NamedAgg(column='Sales', aggfunc='count') # Count transactions
)
print("\nUsing Named Aggregation:")
print(agg_named)
# Note the clean, user-defined column names in the result.

# Applying aggregation without grouping (on the whole DataFrame)
# Note: Often simpler functions exist, e.g., df.sum(), df.mean()
# But useful if mixing functions or applying to specific columns
overall_agg = df_sales.agg({
    'Sales': ['sum', 'mean'],
    'Quantity': ['min', 'max']
})
print("\nAggregation on the whole DataFrame:")
print(overall_agg)

The .agg() method provides immense flexibility for summarizing grouped data exactly how you need it, allowing for complex, multi-faceted aggregations in a concise manner, especially with the named aggregation syntax.

Workshop Calculating Summary Statistics for Groups

Goal: Use the .agg() method on the cleaned Titanic dataset to calculate various summary statistics for different passenger groups.

Scenario: You want to gain deeper insights into the Titanic passengers by calculating multiple statistics simultaneously for groups based on 'Pclass' and 'Sex'. Specifically, you want to find:

  1. For each combination of 'Pclass' and 'Sex':
    • The total number of passengers (count).
    • The number of survivors.
    • The survival rate.
    • The average age.
    • The median fare.
    • The range of fares (max - min).

Steps:

  1. Import Libraries and Prepare Data: Ensure the cleaned titanic_df is loaded.
    import pandas as pd
    import numpy as np
    
    # Assuming titanic_df is loaded and cleaned
    try:
        print("Using pre-loaded and cleaned Titanic DataFrame.")
        titanic_df.info(memory_usage=False) # Concise info
        if titanic_df.isnull().sum().sum() != 0:
             print("Warning: DataFrame still contains NaNs.")
    except NameError:
        print("Error: titanic_df not found. Loading and cleaning...")
        # Add loading/cleaning code if necessary
        exit()
    
  2. Define Custom Function (Fare Range): Create a simple function to calculate the range (max - min).
    def fare_range(x):
        """Calculates the range (max - min) of a Series."""
        return x.max() - x.min()
    
  3. Group Data: Group the titanic_df by both 'Pclass' and 'Sex'.
    grouped_class_sex = titanic_df.groupby(['pclass', 'sex'])
    print("\nGroupBy object created for ('pclass', 'sex').")
    
  4. Perform Aggregation using Named Aggregation: Use the .agg() method with pd.NamedAgg to calculate all the required statistics with clear output names. Note that 'survived' is coded as 0 or 1, so sum() gives the count of survivors and mean() gives the survival rate.
    print("\n--- Calculating Aggregated Statistics ---")
    summary_stats = grouped_class_sex.agg(
        PassengerCount = pd.NamedAgg(column='survived', aggfunc='count'), # Count any non-null column for total
        Survivors = pd.NamedAgg(column='survived', aggfunc='sum'),
        SurvivalRate = pd.NamedAgg(column='survived', aggfunc='mean'),
        AverageAge = pd.NamedAgg(column='age', aggfunc='mean'),
        MedianFare = pd.NamedAgg(column='fare', aggfunc='median'),
        FareRange = pd.NamedAgg(column='fare', aggfunc=fare_range) # Use custom function
    )
    
    # Format the SurvivalRate as percentage for better readability (optional)
    summary_stats['SurvivalRate'] = summary_stats['SurvivalRate'].map('{:.2%}'.format)
    
    print("\nSummary Statistics by Pclass and Sex:")
    print(summary_stats)
    
  5. Analyze Results: Examine the output DataFrame summary_stats.

    • Compare survival rates between sexes within each class.
    • Compare survival rates across classes for the same sex.
    • Look at average ages and median fares. Do they align with expectations for different classes?
    • Check the fare ranges – are there large differences within certain groups?

    Self-Check:

    • Does the 'PassengerCount' match the number of individuals in each group?
    • Is the 'SurvivalRate' significantly higher for females than males within each class?
    • Does the 'SurvivalRate' generally decrease as 'Pclass' increases (from 1st to 3rd)?
    • Does 'AverageAge' tend to decrease slightly with class?
    • Is 'MedianFare' drastically different across classes?
    • Are fare ranges wider in 1st class compared to 3rd?

This workshop demonstrated the power and clarity of using .agg() with named aggregation (pd.NamedAgg) to compute multiple, potentially complex, summary statistics for grouped data, including the use of custom aggregation functions. This provides a rich, tabular summary suitable for analysis and reporting.

Pivoting and Reshaping DataFrames

Sometimes, your data is in a "long" format (many rows, few columns, one observation per row) and you need it in a "wide" format (fewer rows, more columns, information spread across columns), or vice-versa. MultiIndex stack()/unstack() are one way to do this. pivot() and pivot_table() are other powerful tools, especially for transitioning from long to wide format based on data values.

  • pivot(index=None, columns=None, values=None) (DataFrame method): Reshapes data (produces a "wide" format) based on unique values from specified index and columns. It's suitable when the combination of index and columns uniquely identifies a values entry. If there are duplicate combinations, pivot will raise an error.
  • pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All') (Pandas function or DataFrame method): Creates a spreadsheet-style pivot table. It's more flexible than pivot because it can handle duplicate entries for a given index/column combination by aggregating them using the specified aggfunc (default is 'mean'). It's essentially a combination of groupby and reshaping.

    • aggfunc: Function used for aggregation (e.g., 'sum', 'mean', 'count', or a list/dict of functions).
    • fill_value: Value to replace missing values (NaNs) in the resulting table.
    • margins=True: Adds row and column totals (subtotals and grand totals).
  • pd.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None) (Pandas function): The inverse of pivoting; transforms a DataFrame from "wide" format to "long" format by unpivoting columns into rows. id_vars are columns to keep as identifier variables, while value_vars are columns to unpivot.

Example pivot vs pivot_table:

import pandas as pd

# Data suitable for pivot (unique index/column combo)
data_long_unique = pd.DataFrame({
    'Date': pd.to_datetime(['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02']),
    'Sensor': ['Temp', 'Humidity', 'Temp', 'Humidity'],
    'Reading': [25.5, 60.1, 26.0, 62.3]
})
print("Long format data (unique combos):")
print(data_long_unique)

# Use pivot to make Sensors the columns
df_pivoted = data_long_unique.pivot(index='Date', columns='Sensor', values='Reading')
print("\nPivoted data (wide format):")
print(df_pivoted)

# Data with duplicate index/column combos (needs aggregation)
data_long_duplicates = pd.DataFrame({
    'Student': ['Alice', 'Bob', 'Alice', 'Charlie', 'Bob', 'Alice'],
    'Assignment': ['HW1', 'HW1', 'HW2', 'HW1', 'HW2', 'HW1'],
    'Score': [90, 85, 95, 88, 92, 89] # Alice HW1 appears twice
})
print("\nLong format data (duplicate combos):")
print(data_long_duplicates)

# Trying pivot fails due to duplicate ('Alice', 'HW1')
try:
    data_long_duplicates.pivot(index='Student', columns='Assignment', values='Score')
except ValueError as e:
    print("\nPivot failed as expected:", e)

# Use pivot_table with an aggregation function (e.g., mean for Alice's HW1)
df_pivot_table = data_long_duplicates.pivot_table(index='Student', columns='Assignment',
                                                values='Score', aggfunc='mean') # Default is mean
print("\nPivot Table (aggfunc='mean'):")
print(df_pivot_table)

# Pivot table with sum and margins
df_pivot_table_sum = data_long_duplicates.pivot_table(index='Student', columns='Assignment',
                                                     values='Score', aggfunc='sum',
                                                     fill_value=0, # Fill missing scores with 0
                                                     margins=True, margins_name='TotalScore')
print("\nPivot Table (aggfunc='sum', margins=True, fill_value=0):")
print(df_pivot_table_sum)

Example melt:

# Start with the wide format df_pivoted
print("\nWide format data (df_pivoted):")
print(df_pivoted)

# Melt it back to long format
# Reset index first to make 'Date' a column to use as id_vars
df_pivoted_reset = df_pivoted.reset_index()
print("\nWide format with Date as column:")
print(df_pivoted_reset)

df_melted = pd.melt(df_pivoted_reset,
                    id_vars=['Date'],          # Column(s) to keep fixed
                    value_vars=['Humidity', 'Temp'], # Columns to unpivot
                    var_name='SensorType',     # Name for the new variable column
                    value_name='SensorReading')# Name for the new value column
print("\nMelted data (long format):")
print(df_melted)

pivot_table is extremely useful for summarizing data in a familiar spreadsheet-like format, while melt is essential for converting data into the "tidy" long format often required by plotting libraries or statistical modeling functions.

Workshop Reshaping Time Series Data

Goal: Use pivot_table to summarize monthly sales data from a long format and melt to convert summarized data back.

Scenario: You have a dataset containing daily sales records for different product categories over several months. You need to create a pivot table showing total monthly sales for each category, and then demonstrate how to melt this summary table back into a long format.

Steps:

  1. Import Libraries and Create Sample Data: Create a DataFrame simulating daily sales data.
    import pandas as pd
    import numpy as np
    
    # Create sample daily sales data
    date_rng = pd.date_range(start='2023-01-01', end='2023-03-31', freq='D')
    num_days = len(date_rng)
    categories = ['Electronics', 'Clothing', 'Groceries']
    
    np.random.seed(10)
    data = {
        'Date': np.random.choice(date_rng, size=num_days * 2), # More entries than days
        'Category': np.random.choice(categories, size=num_days * 2),
        'Sales': np.random.randint(50, 1000, size=num_days * 2)
    }
    daily_sales_df = pd.DataFrame(data).sort_values(by='Date').reset_index(drop=True)
    
    # Add Month and Year columns for easy grouping/pivoting
    daily_sales_df['Year'] = daily_sales_df['Date'].dt.year
    daily_sales_df['Month'] = daily_sales_df['Date'].dt.month
    
    print("Sample Daily Sales Data (Long Format):")
    print(daily_sales_df.head())
    print(f"\nTotal records: {len(daily_sales_df)}")
    
  2. Create Pivot Table: Use pivot_table to create a summary table where:
    • The index represents the 'Month'.
    • The columns represent the 'Category'.
    • The values represent the total 'Sales' (aggfunc='sum').
    • Handle potential missing combinations with fill_value=0.
      print("\n--- Creating Pivot Table ---")
      monthly_sales_pivot = daily_sales_df.pivot_table(
          values='Sales',
          index='Month',
          columns='Category',
          aggfunc='sum',
          fill_value=0 # Fill months/categories with no sales with 0
      )
      
      print("\nMonthly Sales Pivot Table (Wide Format):")
      print(monthly_sales_pivot)
      
      Self-Check: Does the table show Months (1, 2, 3) as rows and Categories ('Clothing', 'Electronics', 'Groceries') as columns? Do the values represent the sum of sales for that month-category combination?
  3. Add Margins (Totals): Recreate the pivot table, but this time add row and column margins (totals) using margins=True.
    print("\n--- Creating Pivot Table with Margins ---")
    monthly_sales_pivot_margins = daily_sales_df.pivot_table(
        values='Sales',
        index='Month',
        columns='Category',
        aggfunc='sum',
        fill_value=0,
        margins=True, # Add row/column totals
        margins_name='Total' # Name for the margins row/column
    )
    
    print("\nMonthly Sales Pivot Table with Totals:")
    print(monthly_sales_pivot_margins)
    
    Self-Check: Does the table now include a 'Total' row showing the sum across months for each category, and a 'Total' column showing the sum across categories for each month? Is there a grand total at the bottom right?
  4. Melt the Pivot Table: Take the original pivot table (without margins) and use pd.melt to convert it back into a long format.
    • First, reset_index() to make 'Month' a regular column.
    • Then, use melt with 'Month' as the id_vars.
      print("\n--- Melting the Pivot Table (without margins) ---")
      # Reset index to make 'Month' a column
      monthly_sales_pivot_reset = monthly_sales_pivot.reset_index()
      print("Pivot Table with Month as column:")
      print(monthly_sales_pivot_reset)
      
      # Melt the DataFrame
      monthly_sales_long = pd.melt(
          monthly_sales_pivot_reset,
          id_vars=['Month'],                # Column(s) to keep as identifiers
          value_vars=['Clothing', 'Electronics', 'Groceries'], # Columns to unpivot
          var_name='ProductCategory',       # Name for the new column holding old column names
          value_name='TotalMonthlySales'    # Name for the new column holding the values
      )
      
      print("\nMelted Monthly Sales Data (Long Format):")
      print(monthly_sales_long.sort_values(by=['Month', 'ProductCategory']))
      
      Self-Check: Does the monthly_sales_long DataFrame have columns 'Month', 'ProductCategory', and 'TotalMonthlySales'? Does each row represent the total sales for a specific category in a specific month, effectively reversing the pivot operation?

This workshop demonstrated how pivot_table can aggregate and reshape data from a long format into a wide summary table, which is excellent for reporting. It also showed how melt performs the inverse operation, converting a wide table back into a long format suitable for other types of analysis or visualization.

7. Time Series Analysis with Pandas

Pandas was originally developed with financial data analysis in mind, and thus has exceptionally strong capabilities for working with time series data. This includes specialized data structures for dates and times, methods for time-based indexing and selection, frequency conversion, resampling, rolling window calculations, and more.

Date and Time Data Types and Tools

Pandas provides several tools for handling time-based data:

  • Timestamp: Represents a single point in time. It's Pandas' equivalent of Python's datetime.datetime object but is based on NumPy's datetime64 dtype, offering more efficiency and functionality.
  • DatetimeIndex: An index composed of Timestamp objects. This allows for powerful time-based indexing and selection in Series and DataFrames.
  • Period: Represents a time span or interval, like a specific month (e.g., '2023-01'), year, or quarter.
  • PeriodIndex: An index composed of Period objects. Useful for data that represents fixed frequency intervals.
  • Timedelta: Represents a duration or difference between two dates/times.
  • pd.to_datetime(): The primary function for converting strings, integers, or other inputs into Timestamp objects or DatetimeIndex. It's very flexible in parsing various date/time formats.
  • pd.date_range(): Generates a DatetimeIndex with a fixed frequency (e.g., daily, hourly, monthly).

Working with DatetimeIndex:

When a DataFrame's index is a DatetimeIndex, Pandas enables convenient time-based slicing and selection.

import pandas as pd

# --- Creating Timestamps and DatetimeIndex ---
ts = pd.Timestamp('2023-10-26 10:30:00')
print("Timestamp:", ts)
print("Year:", ts.year)
print("Month:", ts.month_name())
print("Day of week:", ts.day_name())

# Create DatetimeIndex using date_range
date_idx = pd.date_range(start='2023-11-01', periods=7, freq='D') # 7 days starting Nov 1st
print("\nDatetimeIndex (Daily):")
print(date_idx)

date_idx_hourly = pd.date_range(start='2023-11-01', end='2023-11-01 05:00:00', freq='H') # Hourly
print("\nDatetimeIndex (Hourly):")
print(date_idx_hourly)

# --- Creating a Time Series DataFrame ---
# Use date_idx as the index
np.random.seed(0)
ts_data = np.random.randn(len(date_idx))
time_series = pd.Series(ts_data, index=date_idx)
print("\nTime Series (Series with DatetimeIndex):")
print(time_series)

df_ts = pd.DataFrame({'Value': ts_data, 'Value2': ts_data*2}, index=date_idx)
print("\nTime Series DataFrame:")
print(df_ts)

# --- Time-Based Indexing and Selection ---
print("\n--- Time-Based Selection ---")

# Select a specific date
print("\nData for 2023-11-03:")
print(df_ts.loc['2023-11-03'])

# Select a range of dates (slicing)
print("\nData from 2023-11-02 to 2023-11-05:")
print(df_ts['2023-11-02':'2023-11-05']) # .loc is implied and preferred: df_ts.loc[...]

# Select data for a specific year or month (partial string indexing)
# Create longer series for demonstration
date_idx_long = pd.date_range(start='2022-12-01', periods=60, freq='D') # Approx 2 months
df_ts_long = pd.DataFrame({'Value': np.random.randn(len(date_idx_long))}, index=date_idx_long)
print("\nLonger Time Series DataFrame (Head):")
print(df_ts_long.head())

print("\nData for the year 2023:")
print(df_ts_long['2023'].head()) # Selects all entries in 2023

print("\nData for January 2023:")
print(df_ts_long['2023-01'].head()) # Selects all entries in Jan 2023

# --- Using Timedelta ---
delta = pd.Timedelta(days=1, hours=5)
print("\nTimedelta:", delta)
print("Timestamp + Timedelta:", ts + delta)

# --- Using Period ---
p_month = pd.Period('2023-01', freq='M') # Represents the entire month of Jan 2023
print("\nPeriod (Month):", p_month)
print("Period start time:", p_month.start_time)
print("Period end time:", p_month.end_time)
# Convert timestamp to period
print("Timestamp converted to Monthly Period:", ts.to_period('M'))

# Create PeriodIndex
period_idx = pd.period_range(start='2023-01', periods=5, freq='M')
print("\nPeriodIndex (Monthly):")
print(period_idx)
# Series with PeriodIndex
s_period = pd.Series(np.random.randn(5), index=period_idx)
print("\nSeries with PeriodIndex:\n", s_period)

Mastering to_datetime, date_range, and the behavior of DatetimeIndex (especially time-based slicing) is crucial for preparing time series data for analysis in Pandas.

Workshop Analyzing Stock Prices Over Time

Goal: Load historical stock price data (e.g., Apple AAPL), convert the date column to a DatetimeIndex, and perform basic time-based selections and analysis.

Dataset: We'll use historical stock data, which is widely available. For instance, from Yahoo Finance via libraries like yfinance or by downloading a CSV. For simplicity here, we'll simulate loading from a CSV structure.

Steps:

  1. Import Libraries:
    import pandas as pd
    import numpy as np
    import io # To simulate CSV loading
    
  2. Prepare Simulated Stock Data: Create a string simulating a CSV file with AAPL stock data. In a real scenario, you'd use pd.read_csv('your_stock_data.csv') or a library like yfinance. python # Simulate CSV data for AAPL stock # Columns: Date, Open, High, Low, Close, Adj Close, Volume stock_csv_data = """Date,Open,High,Low,Close,Adj Close,Volume 2023-01-03,129.27,130.89,124.17,125.07,124.87,112117500 2023-01-04,126.89,128.66,125.08,126.36,126.16,89113600 2023-01-05,127.12,127.77,124.76,125.02,124.82,80962700 2023-01-06,126.01,130.29,124.89,129.62,129.41,87754700 2023-01-09,130.47,133.41,129.89,130.15,129.94,70790800 2023-01-10,130.26,131.26,128.12,130.73,130.52,63896200 2023-01-11,131.25,133.51,130.46,133.49,133.27,69458900 2023-01-12,133.88,134.26,131.41,133.41,133.19,71379600 2023-01-13,132.03,134.92,131.66,134.76,134.54,57809700 2023-02-01,143.97,146.61,141.32,145.43,145.19,77663600 2023-02-02,148.90,151.18,148.17,150.82,150.57,118339000 2023-02-03,148.03,157.38,147.83,154.50,154.24,154279900 """
  3. Load Data and Set DatetimeIndex:
    • Load the data using pd.read_csv.
    • Crucially, use the parse_dates=['Date'] argument to tell Pandas to interpret the 'Date' column as dates.
    • Use the index_col='Date' argument to set the 'Date' column as the DataFrame index immediately upon loading.
      # Load the data, parsing dates and setting index
      try:
          aapl_df = pd.read_csv(io.StringIO(stock_csv_data),
                                parse_dates=['Date'],
                                index_col='Date')
          print("AAPL stock data loaded successfully with DatetimeIndex:")
          print(aapl_df.head())
          print("\nIndex type:", type(aapl_df.index))
          print("Index dtype:", aapl_df.index.dtype)
      except Exception as e:
          print(f"Error loading data: {e}")
          # Create placeholder if error
          aapl_df = pd.DataFrame(columns=['Open','High','Low','Close','Adj Close','Volume'])
      
      Self-Check: Is the index of aapl_df a DatetimeIndex? Are the dates parsed correctly?
  4. Time-Based Selection:
    • Select the data for a specific date, e.g., '2023-01-05'.
    • Select the data for the week starting '2023-01-09'.
    • Select all data for January 2023 using partial string indexing.
    • Select all data for February 2023.
      if not aapl_df.empty:
          print("\n--- Time-Based Selections ---")
      
          # Specific date
          try:
              data_jan_5 = aapl_df.loc['2023-01-05']
              print("\nData for 2023-01-05:\n", data_jan_5)
          except KeyError:
              print("\nData for 2023-01-05 not found.")
      
      
          # Week starting Jan 9 (until Jan 13 in this data)
          data_week_jan_9 = aapl_df.loc['2023-01-09':'2023-01-13']
          print("\nData for week starting 2023-01-09:\n", data_week_jan_9)
      
          # All data for January 2023
          data_jan_2023 = aapl_df.loc['2023-01']
          print("\nAll data for January 2023:\n", data_jan_2023)
      
          # All data for February 2023
          data_feb_2023 = aapl_df.loc['2023-02']
          print("\nAll data for February 2023:\n", data_feb_2023)
      
      else:
          print("\nSkipping selections as DataFrame is empty.")
      
      Self-Check: Did each selection return the expected rows based on the dates?
  5. Basic Analysis:
    • Calculate the average closing price for January 2023.
    • Find the highest price ('High') reached in February 2023.
      if not aapl_df.empty:
          print("\n--- Basic Analysis ---")
          try:
              avg_close_jan = aapl_df.loc['2023-01', 'Close'].mean()
              print(f"\nAverage Closing Price in Jan 2023: {avg_close_jan:.2f}")
          except KeyError:
               print("\nCould not calculate average closing price for Jan 2023 (data missing).")
      
          try:
              max_high_feb = aapl_df.loc['2023-02', 'High'].max()
              print(f"Highest Price (High) in Feb 2023: {max_high_feb:.2f}")
          except KeyError:
               print("\nCould not find maximum high for Feb 2023 (data missing).")
          except ValueError: # Handle case where Feb 2023 data exists but is empty
               print("\nNo data found for Feb 2023 to calculate max high.")
      
      else:
           print("\nSkipping analysis as DataFrame is empty.")
      

This workshop demonstrated the workflow for loading time series data, ensuring the date column becomes a DatetimeIndex using parse_dates and index_col in read_csv, and leveraging the power of the DatetimeIndex for intuitive time-based slicing and selection to perform basic analysis.

Resampling and Frequency Conversion

Resampling involves changing the frequency of your time series data. This is a common operation:

  • Downsampling: Aggregating data from a higher frequency to a lower frequency (e.g., converting daily data to monthly data). This typically involves an aggregation function (like mean, sum, ohlc - Open, High, Low, Close).
  • Upsampling: Converting data from a lower frequency to a higher frequency (e.g., monthly data to daily data). This usually involves filling or interpolating the missing data points created during the conversion.

The primary method for resampling in Pandas is .resample(), used on a Series or DataFrame with a DatetimeIndex or PeriodIndex.

.resample(rule, axis=0, closed=None, label=None, convention='start', kind=None, on=None, level=None, origin='start_day', offset=None):

  • rule: The target frequency (e.g., 'D' for daily, 'W' for weekly, 'M' for month-end, 'MS' for month-start, 'Q' for quarter-end, 'QS' for quarter-start, 'A' or 'Y' for year-end, 'AS' or 'YS' for year-start, 'H' for hourly, 'T' or 'min' for minutely, 'S' for secondly). You can combine these with multipliers (e.g., '2H' for 2 hours).
  • closed: Which side of the interval is closed ('left' or 'right'). Relevant for frequencies like daily when aggregating intraday data.
  • label: Which bin edge label to use ('left' or 'right').
  • convention: For Period resampling ('start' or 'end').
  • kind: Resample to 'timestamp' (default) or 'period'.

After calling .resample(), you typically chain an aggregation method (like .sum(), .mean(), .ohlc(), .agg()) for downsampling, or an interpolation/filling method (like .asfreq(), .ffill(), .bfill(), .interpolate()) for upsampling.

Downsampling Example:

import pandas as pd
import numpy as np

# Use the longer time series DataFrame df_ts_long from previous section
date_idx_long = pd.date_range(start='2022-12-01', periods=60, freq='D') # Dec 2022, Jan 2023
np.random.seed(1)
df_ts_long = pd.DataFrame({'Value': np.random.randint(1, 100, size=len(date_idx_long))}, index=date_idx_long)
print("Original Daily Time Series (Head):")
print(df_ts_long.head())

# Downsample daily data to monthly sum
monthly_sum = df_ts_long['Value'].resample('M').sum() # 'M' = Month End frequency
print("\nMonthly Sum (Resampled):")
print(monthly_sum)

# Downsample daily data to monthly mean
monthly_mean = df_ts_long['Value'].resample('M').mean()
print("\nMonthly Mean (Resampled):")
print(monthly_mean)

# Downsample daily data to weekly OHLC (Open, High, Low, Close)
# Requires multiple columns or a specific structure, let's fake it
df_ts_ohlc = pd.DataFrame({
    'Open': np.random.rand(60) + 10,
    'High': np.random.rand(60) + 11,
    'Low': np.random.rand(60) + 9,
    'Close': np.random.rand(60) + 10
}, index=date_idx_long)
# Ensure High >= Open/Close and Low <= Open/Close (simplified logic)
df_ts_ohlc['High'] = df_ts_ohlc[['High', 'Open', 'Close']].max(axis=1)
df_ts_ohlc['Low'] = df_ts_ohlc[['Low', 'Open', 'Close']].min(axis=1)

print("\nOriginal Daily OHLC Data (Fake):")
print(df_ts_ohlc.head())

weekly_ohlc = df_ts_ohlc.resample('W').agg({
    'Open': 'first',
    'High': 'max',
    'Low': 'min',
    'Close': 'last'
})
# Alternative simpler syntax for OHLC: .resample('W').ohlc()
# weekly_ohlc_simple = df_ts_ohlc.resample('W').ohlc()

print("\nWeekly OHLC (Resampled):")
print(weekly_ohlc)
# print("\nWeekly OHLC (Simple):") # Output structure is slightly different (MultiIndex cols)
# print(weekly_ohlc_simple)

Upsampling Example:

# Start with the monthly sum data
print("\nMonthly Sum Data:")
print(monthly_sum)

# Upsample monthly data to daily frequency (will create NaNs)
daily_from_monthly_asfreq = monthly_sum.resample('D').asfreq()
print("\nUpsampled to Daily (asfreq - creates NaNs):")
print(daily_from_monthly_asfreq.head(10)) # Show NaNs

# Upsample using forward fill
daily_from_monthly_ffill = monthly_sum.resample('D').ffill()
print("\nUpsampled to Daily (ffill):")
print(daily_from_monthly_ffill.head(10)) # Shows value propagated forward

# Upsample using backward fill
daily_from_monthly_bfill = monthly_sum.resample('D').bfill()
print("\nUpsampled to Daily (bfill):")
print(daily_from_monthly_bfill.tail(10)) # Shows value propagated backward

# Upsampling with interpolation (linear)
# Need numeric data, let's use monthly_mean
print("\nMonthly Mean Data:")
print(monthly_mean)
daily_from_monthly_interp = monthly_mean.resample('D').interpolate(method='linear')
print("\nUpsampled to Daily (linear interpolation):")
# Display around a month transition to see interpolation
print(daily_from_monthly_interp['2022-12-25':'2023-01-05'])

Resampling is essential for aligning time series with different frequencies or for summarizing data over specific time periods (weeks, months, quarters). Choose the aggregation method (sum, mean, ohlc, etc.) carefully for downsampling and the filling/interpolation method (ffill, bfill, interpolate) for upsampling based on the nature of your data and analysis goals.

Workshop Aggregating Time Series Data to Different Frequencies

Goal: Use resampling on the AAPL stock price data to calculate weekly and monthly averages.

Scenario: You have the daily AAPL stock price data (aapl_df) and want to analyze its trends on a weekly and monthly basis by calculating the average closing price for each period.

Steps:

  1. Import Libraries and Prepare Data: Ensure the aapl_df DataFrame with the DatetimeIndex is available from the previous workshop.
    import pandas as pd
    import numpy as np
    import io
    
    # Assuming aapl_df is loaded from the previous workshop
    # If not, recreate/reload it here.
    stock_csv_data = """Date,Open,High,Low,Close,Adj Close,Volume
    2023-01-03,129.27,130.89,124.17,125.07,124.87,112117500
    2023-01-04,126.89,128.66,125.08,126.36,126.16,89113600
    2023-01-05,127.12,127.77,124.76,125.02,124.82,80962700
    2023-01-06,126.01,130.29,124.89,129.62,129.41,87754700
    2023-01-09,130.47,133.41,129.89,130.15,129.94,70790800
    2023-01-10,130.26,131.26,128.12,130.73,130.52,63896200
    2023-01-11,131.25,133.51,130.46,133.49,133.27,69458900
    2023-01-12,133.88,134.26,131.41,133.41,133.19,71379600
    2023-01-13,132.03,134.92,131.66,134.76,134.54,57809700
    2023-01-17,134.83,137.29,134.13,135.94,135.72,63646600
    2023-01-18,136.82,138.61,135.03,135.21,134.99,79572200
    2023-01-19,134.08,136.25,133.77,135.27,135.05,58280400
    2023-01-20,135.28,138.02,134.22,137.87,137.64,80223600
    2023-01-23,138.12,143.32,137.90,141.11,140.87,81760300
    2023-01-24,140.31,143.16,140.30,142.53,142.29,66435100
    2023-01-25,140.89,142.43,138.81,141.86,141.62,65799300
    2023-01-26,143.17,144.25,141.90,143.96,143.72,54115500
    2023-01-27,143.16,147.23,143.08,145.93,145.69,70555800
    2023-01-30,144.96,145.55,142.85,143.00,142.76,64015300
    2023-01-31,142.70,144.34,142.28,144.29,144.05,65874500
    2023-02-01,143.97,146.61,141.32,145.43,145.19,77663600
    2023-02-02,148.90,151.18,148.17,150.82,150.57,118339000
    2023-02-03,148.03,157.38,147.83,154.50,154.24,154279900
    """
    try:
        aapl_df = pd.read_csv(io.StringIO(stock_csv_data),
                              parse_dates=['Date'],
                              index_col='Date')
        print("AAPL stock data loaded successfully.")
        print(aapl_df.head())
    except Exception as e:
        print(f"Error loading data: {e}")
        aapl_df = pd.DataFrame(columns=['Open','High','Low','Close','Adj Close','Volume'])
    
  2. Weekly Resampling:
    • Select the 'Close' price Series.
    • Use .resample('W') to group by week (ends on Sunday by default).
    • Apply the .mean() aggregation function to calculate the average closing price for each week.
    • Print the resulting weekly average closing prices.
      if not aapl_df.empty:
          print("\n--- Weekly Resampling (Average Close) ---")
          weekly_avg_close = aapl_df['Close'].resample('W').mean()
          print(weekly_avg_close)
      else:
          print("\nSkipping weekly resampling as DataFrame is empty.")
      
      Self-Check: Do the index dates represent week-ending dates (Sundays)? Are the values plausible averages of the daily closing prices within those weeks?
  3. Monthly Resampling:
    • Select the 'Close' price Series again.
    • Use .resample('M') to group by month (ends on the last day of the month by default). Or use 'MS' for month-start frequency. Let's use 'M'.
    • Apply the .mean() aggregation function.
    • Print the resulting monthly average closing prices.
      if not aapl_df.empty:
          print("\n--- Monthly Resampling (Average Close) ---")
          monthly_avg_close = aapl_df['Close'].resample('M').mean()
          print(monthly_avg_close)
      else:
          print("\nSkipping monthly resampling as DataFrame is empty.")
      
      Self-Check: Does the index represent month-end dates? Are there fewer data points than the weekly results?
  4. Monthly OHLC: Resample the entire DataFrame to get monthly OHLC data.
    • Use .resample('M') on the whole DataFrame.
    • Apply the .agg() method with appropriate functions for 'Open' ('first'), 'High' ('max'), 'Low' ('min'), 'Close' ('last'), and potentially 'Volume' ('sum').
      if not aapl_df.empty:
          print("\n--- Monthly Resampling (OHLCV) ---")
          monthly_ohlcv = aapl_df.resample('M').agg({
              'Open': 'first',
              'High': 'max',
              'Low': 'min',
              'Close': 'last',
              'Volume': 'sum'
          })
          print(monthly_ohlcv)
      else:
          print("\nSkipping monthly OHLCV resampling as DataFrame is empty.")
      
      Self-Check: Does the resulting DataFrame have month-end dates as the index? Do the 'Open' values correspond to the first trading day's open of the month, 'High' to the maximum high during the month, etc.? Is the 'Volume' the sum of daily volumes for the month?

This workshop provided practical experience in using .resample() to downsample time series data. You calculated weekly and monthly average closing prices and also aggregated daily OHLCV data into monthly OHLCV bars, common tasks in financial data analysis.


Conclusion

Throughout this journey into Pandas and NumPy, we have explored the essential tools for data wrangling and analysis in Python. Starting from the Basic level, we grasped the fundamentals of NumPy arrays and the core Pandas structures: Series and DataFrames, learning how to create, inspect, and perform elementary operations on them.

Transitioning to the Intermediate level, we tackled practical, real-world challenges. We learned to efficiently load data from various file formats like CSV, Excel, and JSON, and how to save our processed results. Crucially, we addressed the ubiquitous problem of messy data, developing strategies for identifying and handling missing values (NaN), transforming data using mapping and function application, and detecting and removing duplicate entries.

Finally, in the Advanced level, we delved into more sophisticated techniques. We explored the power of hierarchical indexing (MultiIndex) for managing complex datasets, mastered advanced boolean indexing for intricate data selection, and unlocked the full potential of the GroupBy mechanism for sophisticated aggregation using .agg(). We also learned how to reshape data effectively using pivot_table and melt, and gained specialized skills for time series analysis, including date/time handling, time-based indexing, and resampling data to different frequencies.

The included Workshops provided hands-on practice, guiding you through applying these concepts to tangible scenarios, from analyzing experimental data and managing inventories to cleaning customer records, exploring the Titanic dataset, and summarizing sales and stock price data.

Mastering NumPy and Pandas is not just about learning library functions; it's about adopting a data-centric way of thinking. It involves understanding data structures, recognizing patterns, formulating questions, and efficiently manipulating data to find answers. The skills acquired here form the bedrock for virtually all subsequent data science tasks, including visualization, machine learning, and statistical modeling. Continue practicing, explore different datasets, and challenge yourself with complex data problems – the fluency you develop with Pandas and NumPy will be invaluable in your academic and professional journey.