Author | Nejat Hakan |
License | CC BY-SA 4.0 |
nejat.hakan@outlook.de | |
PayPal Me | https://paypal.me/nejathakan |
Data Visualization Storytelling with Matplotlib & Seaborn
Introduction
Welcome to the world of data visualization and storytelling using Python's powerful libraries, Matplotlib and Seaborn, specifically tailored for a Linux environment. In today's data-driven world, simply having data is not enough; the ability to effectively explore, understand, and communicate insights hidden within that data is paramount. Data visualization transforms raw numbers into intuitive graphical representations, making complex information accessible and understandable.
But visualization alone isn't the end goal. True impact comes from Data Storytelling – the art and science of weaving data, visuals, and narrative into a compelling story that drives understanding and action. Think of data as the evidence, visualization as the means of presenting that evidence, and the narrative as the argument or insight you want to convey.
Why Matplotlib and Seaborn?
- Matplotlib: It is the foundational data visualization library in Python. It provides a low-level interface for creating a vast array of static, animated, and interactive plots. Its strength lies in its flexibility and control over virtually every aspect of a figure. Mastering Matplotlib gives you the power to create highly customized visualizations.
- Seaborn: Built on top of Matplotlib, Seaborn provides a high-level interface specifically designed for creating attractive and informative statistical graphics. It excels at visualizing complex datasets, revealing patterns and relationships through sophisticated plot types with less code. It integrates seamlessly with Pandas DataFrames, a standard data structure in data analysis.
Our Goal:
This guide aims to equip you, as university students, with the knowledge and practical skills to not only create technically correct visualizations with Matplotlib and Seaborn but also to use them effectively to tell compelling stories with data. We will progress from basic plotting concepts to advanced customization and statistical visualization techniques, culminating in the ability to craft narratives that resonate with your audience. We assume you have a working Python environment set up on your Linux system and are familiar with basic Python syntax and data structures like lists and dictionaries. Familiarity with NumPy and Pandas will be highly beneficial, especially for the intermediate and advanced sections.
Setup in Linux:
Before we begin, ensure you have the necessary libraries installed. Open your Linux terminal and use pip (Python's package installer):
If you are using Anaconda/Miniconda, you can use conda:
Now, let's embark on our journey to becoming effective data visualization storytellers!
1. Foundations of Matplotlib
Matplotlib is the cornerstone of the Python visualization landscape. Understanding its fundamental concepts is crucial before building more complex plots or using higher-level libraries like Seaborn. It offers fine-grained control, allowing you to tailor every element of your visualization.
Core Components Anatomy of a Plot
To effectively use Matplotlib, you need to understand its main components. Think of it like learning the anatomy of a drawing canvas:
- Figure: The outermost container for everything. It's the overall window or page that everything is drawn on. You can have multiple independent Figures. A Figure can contain one or more Axes.
- Axes: This is what you typically think of as 'a plot'. It's the region of the Figure where data is plotted with x-axis, y-axis (or other coordinates), labels, ticks, etc. A Figure can contain multiple Axes objects, arranged in grids or placed freely. Don't confuse Axes (the plotting area) with Axis (the number-line-like objects).
- Axis: These are the number-line-like objects that determine the graph limits. They handle the data limits (which can be controlled via
set_xlim()
,set_ylim()
) and generate the ticks and tick labels. An Axes object typically has an x-axis and a y-axis. - Ticks: These are the markers denoting specific points on an Axis. There are major ticks and minor ticks.
- Tick Labels: The string labels associated with the ticks (e.g., '0', '5', '10').
- Labels: Descriptive text for the x-axis (
xlabel
) and y-axis (ylabel
). - Title: A descriptive title for the Axes (the plot).
- Legend: A guide that explains the mapping of visual properties (like color or marker style) to data series. Essential when plotting multiple datasets on the same Axes.
- Artist: Essentially, everything you see on the Figure is an Artist object. This includes Text objects, Line2D objects, Collection objects, Patch objects, etc. Most plotting functions return Artist objects. When you use
plt.plot()
, it createsLine2D
artists within the current Axes.
Understanding this hierarchy (Figure contains Axes, Axes contain Axis, and various Artists like lines, text, etc.) is key to customizing plots effectively using Matplotlib's object-oriented approach, which we'll explore later.
Your First Plot Line Plots
The most basic and common plot is the line plot, typically used to show trends over a continuous interval or sequence, like time. Matplotlib's pyplot
module provides a simple interface for creating plots quickly.
Let's create a simple line plot showing hypothetical temperature changes over a week.
import matplotlib.pyplot as plt
import numpy as np # We often use NumPy for numerical data
# Sample data: Days and corresponding temperatures
days = np.arange(1, 8) # Days 1 through 7
temperatures_celsius = [15.2, 16.8, 14.5, 17.0, 19.1, 18.5, 17.8]
# Create the plot
plt.plot(days, temperatures_celsius)
# Add basic labels and title for context
plt.xlabel("Day of the Week")
plt.ylabel("Temperature (°C)")
plt.title("Weekly Temperature Trend")
# Display the plot
plt.show()
Explanation:
import matplotlib.pyplot as plt
: Imports thepyplot
module, conventionally aliased asplt
. This provides functions for creating figures, axes, and plotting data.import numpy as np
: Imports NumPy for easy creation of numerical sequences (np.arange
).plt.plot(days, temperatures_celsius)
: This is the core plotting command. It takes x-values (days
) and y-values (temperatures_celsius
) and plots them as points connected by lines. By default, it uses a solid blue line.plt.xlabel(...)
,plt.ylabel(...)
,plt.title(...)
: These functions add descriptive text to the plot, making it understandable.plt.show()
: This function displays the plot window. In some environments like Jupyter notebooks, plots might render automatically, butplt.show()
is generally needed in scripts.
Basic Customization:
You can easily customize the appearance:
import matplotlib.pyplot as plt
import numpy as np
days = np.arange(1, 8)
temperatures_celsius = [15.2, 16.8, 14.5, 17.0, 19.1, 18.5, 17.8]
# Customize color, linestyle, and add markers
plt.plot(days, temperatures_celsius,
color='red', # Set line color
linestyle='--', # Use a dashed line ('-', '--', '-.', ':')
marker='o') # Add circular markers ('o', 's', '^', 'x', '*')
plt.xlabel("Day of the Week")
plt.ylabel("Temperature (°C)")
plt.title("Weekly Temperature Trend (Customized)")
plt.grid(True) # Add a grid for easier reading
plt.show()
Here, we added arguments to plt.plot()
to change the color, line style, and add markers at each data point. plt.grid(True)
adds a background grid.
Common Plot Types Scatter Plots and Bar Charts
Beyond line plots, Matplotlib supports many other fundamental chart types.
Scatter Plots (plt.scatter()
):
Used to visualize the relationship or correlation between two numerical variables. Each point represents an observation.
import matplotlib.pyplot as plt
import numpy as np
# Sample data: Study hours and corresponding exam scores
study_hours = np.array([2, 3, 5, 1, 6, 4, 7, 3.5])
exam_scores = np.array([65, 70, 85, 50, 90, 75, 95, 72])
plt.scatter(study_hours, exam_scores, color='green', marker='^')
plt.xlabel("Hours Studied")
plt.ylabel("Exam Score")
plt.title("Relationship between Study Hours and Exam Scores")
plt.grid(True, linestyle=':', alpha=0.7) # Customize grid
plt.show()
plt.scatter()
plots individual points. We can see a potential positive correlation – more study hours tend to correspond to higher scores. We also customized the grid to be dotted and slightly transparent (alpha
).
Bar Charts (plt.bar()
, plt.barh()
):
Used to compare quantities across different categories. plt.bar()
creates vertical bars, and plt.barh()
creates horizontal bars.
import matplotlib.pyplot as plt
# Sample data: Programming language popularity
languages = ['Python', 'JavaScript', 'Java', 'C#', 'C++']
popularity = [31.5, 28.0, 16.8, 7.5, 6.2] # Hypothetical percentages
plt.figure(figsize=(8, 5)) # Control the figure size (width, height in inches)
plt.bar(languages, popularity, color=['blue', 'orange', 'green', 'red', 'purple'])
plt.xlabel("Programming Language")
plt.ylabel("Popularity (%)")
plt.title("Programming Language Popularity Survey")
plt.ylim(0, 35) # Set y-axis limits for better perspective
plt.xticks(rotation=45, ha='right') # Rotate x-axis labels for readability
plt.tight_layout() # Adjust layout to prevent labels overlapping
plt.show()
# --- Horizontal Bar Chart ---
plt.figure(figsize=(8, 5))
plt.barh(languages, popularity, color='skyblue') # Horizontal bars
plt.xlabel("Popularity (%)")
plt.ylabel("Programming Language") # Note the axis label change
plt.title("Programming Language Popularity Survey (Horizontal)")
plt.xlim(0, 35)
plt.gca().invert_yaxis() # Optional: Display highest popularity at the top
plt.tight_layout()
plt.show()
Key points:
plt.figure(figsize=(...))
: Creates a new Figure and allows specifying its size.plt.bar()
takes categories (here,languages
) and corresponding values (popularity
).- We can pass a list of colors to
color
to color each bar individually. plt.ylim()
/plt.xlim()
: Control the range of the axes.plt.xticks(rotation=..., ha=...)
: Useful for long category names to prevent overlap.ha
controls horizontal alignment.plt.tight_layout()
: Automatically adjusts subplot parameters for a tight layout.plt.barh()
works similarly but swaps the role of x and y.plt.gca().invert_yaxis()
is often used with horizontal bars to put the "first" category at the top.
Customizing Plots Labels Titles and Legends
Clear labels, titles, and legends are essential for making plots self-explanatory. We've already used xlabel
, ylabel
, and title
. Let's look at adding a legend when plotting multiple lines.
import matplotlib.pyplot as plt
import numpy as np
# Sample data for two cities
days = np.arange(1, 8)
temp_city_a = [15.2, 16.8, 14.5, 17.0, 19.1, 18.5, 17.8]
temp_city_b = [12.1, 13.5, 13.0, 14.8, 16.2, 15.9, 15.5]
# Plot data for both cities, adding a 'label' for each plot
plt.plot(days, temp_city_a, marker='o', linestyle='-', label='City A')
plt.plot(days, temp_city_b, marker='s', linestyle='--', label='City B')
# Add labels and title
plt.xlabel("Day")
plt.ylabel("Temperature (°C)")
plt.title("Temperature Comparison: City A vs City B")
# Add a legend - Matplotlib uses the 'label' arguments from plot()
plt.legend()
# You can customize legend location: plt.legend(loc='upper left')
plt.grid(True)
plt.show()
label='...'
argument within each plt.plot()
call. Then, plt.legend()
automatically creates the legend using these labels.
Saving Plots
Once you've created a plot, you'll often want to save it to a file (e.g., for inclusion in reports or presentations).
import matplotlib.pyplot as plt
import numpy as np
days = np.arange(1, 8)
temperatures_celsius = [15.2, 16.8, 14.5, 17.0, 19.1, 18.5, 17.8]
plt.plot(days, temperatures_celsius, marker='o')
plt.xlabel("Day of the Week")
plt.ylabel("Temperature (°C)")
plt.title("Weekly Temperature Trend")
plt.grid(True)
# Save the figure before showing it
# Specify the path (relative or absolute in Linux) and format
# Common formats: png, jpg, svg, pdf
plt.savefig('weekly_temperature_trend.png', dpi=300) # Save as PNG with high resolution
# plt.savefig('/home/user/Documents/plots/weekly_temp.pdf') # Example absolute path
# You can still show the plot after saving if needed
# plt.show()
Key points for plt.savefig()
:
- Call
savefig()
beforeplt.show()
. In many backends,plt.show()
clears the figure after displaying it. - The file format is determined by the extension (e.g.,
.png
,.pdf
,.svg
). dpi
(dots per inch) controls the resolution for raster formats like PNG and JPG. Higher values (e.g., 300 or 600) are better for print quality.- Vector formats like SVG and PDF are resolution-independent and often preferred for publications as they scale perfectly.
- Provide a valid Linux file path (relative or absolute).
Workshop Basic Plotting Exploration
Goal: Create and customize basic Matplotlib plots using real-world data – average monthly rainfall in a city.
Dataset: We'll use hypothetical average monthly rainfall data for London, UK (in mm).
# Data for the workshop
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
avg_rainfall_mm = [55.2, 40.9, 41.6, 43.7, 49.4, 45.1, 44.5, 49.5, 49.1, 68.5, 59.0, 55.2]
Steps:
-
Setup:
- Create a new Python file (e.g.,
london_rainfall.py
) in your preferred directory on your Linux system. - Import
matplotlib.pyplot
asplt
. - Define the
months
andavg_rainfall_mm
lists as shown above.
- Create a new Python file (e.g.,
-
Create a Line Plot:
- Use
plt.plot()
to visualize the average rainfall throughout the year. - Add appropriate
xlabel
("Month"),ylabel
("Average Rainfall (mm)"), andtitle
("Average Monthly Rainfall in London"). - Add markers (e.g., 'x') to the line plot.
- Add a grid for better readability.
- Use
plt.xticks(rotation=45)
to make the month labels clearer. - Use
plt.tight_layout()
to adjust spacing. - Display the plot using
plt.show()
.
# Step 1: Setup import matplotlib.pyplot as plt months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'] avg_rainfall_mm = [55.2, 40.9, 41.6, 43.7, 49.4, 45.1, 44.5, 49.5, 49.1, 68.5, 59.0, 55.2] # Step 2: Create and Customize Line Plot plt.figure(figsize=(10, 6)) # Make figure a bit larger plt.plot(months, avg_rainfall_mm, marker='x', color='dodgerblue', linestyle='-') plt.xlabel("Month") plt.ylabel("Average Rainfall (mm)") plt.title("Average Monthly Rainfall in London (Line Plot)") plt.grid(True, linestyle='--', alpha=0.6) plt.xticks(rotation=45, ha='right') plt.tight_layout() plt.show()
- Use
-
Create a Bar Chart:
- Now, represent the same data using a vertical bar chart (
plt.bar()
). - Create a new figure using
plt.figure(figsize=(...))
to avoid drawing over the previous plot if running interactively. - Use
plt.bar()
withmonths
andavg_rainfall_mm
. - Assign a color (e.g., 'lightblue').
- Add the same labels and title as before (adjusting the title slightly, e.g., "Average Monthly Rainfall in London (Bar Chart)").
- Rotate x-axis labels as needed.
- Set appropriate y-axis limits using
plt.ylim()
(e.g., from 0 to slightly above the maximum rainfall) to provide context. - Use
plt.tight_layout()
. - Display the plot.
# Step 3: Create and Customize Bar Chart plt.figure(figsize=(10, 6)) plt.bar(months, avg_rainfall_mm, color='lightblue') plt.xlabel("Month") plt.ylabel("Average Rainfall (mm)") plt.title("Average Monthly Rainfall in London (Bar Chart)") plt.ylim(0, max(avg_rainfall_mm) + 10) # Set ylim from 0 to max+10 plt.xticks(rotation=45, ha='right') plt.grid(True, axis='y', linestyle=':', alpha=0.7) # Grid lines only on y-axis plt.tight_layout() plt.show()
- Now, represent the same data using a vertical bar chart (
-
Save the Bar Chart:
- Before the
plt.show()
command for the bar chart, add a line to save the figure as a PDF file namedlondon_rainfall_bar.pdf
. Choose a high-quality setting if applicable (PDF is vector, so DPI isn't the primary concern, but ensures fonts are embedded correctly).
# Step 3 (continued): Create and Customize Bar Chart plt.figure(figsize=(10, 6)) plt.bar(months, avg_rainfall_mm, color='lightblue') plt.xlabel("Month") plt.ylabel("Average Rainfall (mm)") plt.title("Average Monthly Rainfall in London (Bar Chart)") plt.ylim(0, max(avg_rainfall_mm) + 10) plt.xticks(rotation=45, ha='right') plt.grid(True, axis='y', linestyle=':', alpha=0.7) plt.tight_layout() # Step 4: Save the Bar Chart plt.savefig('london_rainfall_bar.pdf') print("Bar chart saved as london_rainfall_bar.pdf") # Optional confirmation plt.show()
- Before the
-
Run the Script:
- Open your Linux terminal, navigate to the directory where you saved
london_rainfall.py
, and run it:python london_rainfall.py
. - You should see two plot windows appear sequentially, and a PDF file
london_rainfall_bar.pdf
should be created in the same directory.
- Open your Linux terminal, navigate to the directory where you saved
This workshop provides hands-on practice with creating basic line and bar plots, customizing their appearance with labels, titles, colors, markers, and grids, and saving the results – fundamental skills for any data visualization task.
2. Introduction to Seaborn Simplifying Visualization
While Matplotlib provides ultimate control, it can sometimes be verbose for creating common statistical plots. Seaborn enters the picture as a high-level library built on top of Matplotlib. Its primary goal is to make creating informative and attractive statistical graphics easier and more intuitive, especially when working with Pandas DataFrames.
Think of Seaborn as a specialist chef who uses Matplotlib's kitchen (tools and infrastructure) to quickly prepare beautiful and standardized dishes (statistical plots).
Relationship with Matplotlib:
Seaborn functions often call Matplotlib functions internally. This means:
- You can use Matplotlib commands to customize Seaborn plots after they are created.
- Seaborn plots are ultimately drawn onto Matplotlib Axes, fitting into the Figure/Axes structure.
- Knowledge of Matplotlib basics helps in understanding and fine-tuning Seaborn plots.
Seaborn's Strengths High Level Interface and Aesthetics
Seaborn shines in several areas:
- High-Level Functions: Provides functions for specific statistical plot types (like distribution plots, categorical plots, regression plots) that might require many lines of Matplotlib code.
- Pandas DataFrame Integration: Designed to work seamlessly with Pandas DataFrames. You often just specify the DataFrame and the column names for x, y, hue, etc.
- Statistical Estimation: Many Seaborn plots automatically perform necessary statistical aggregation or estimation (e.g., calculating means and confidence intervals for bar plots, fitting regression lines).
- Attractive Default Styles and Palettes: Comes with several built-in themes and color palettes that significantly improve the default appearance of plots compared to base Matplotlib.
Let's see how to apply a Seaborn theme:
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd # Seaborn works best with Pandas DataFrames
# Apply a Seaborn theme (affects subsequent Matplotlib and Seaborn plots)
sns.set_theme(style="darkgrid", palette="viridis") # Examples: "whitegrid", "dark", "ticks"
# Palettes: "rocket", "magma", "deep", "muted"
# Recreate the temperature plot from before - notice the style difference
days = np.arange(1, 8)
temperatures_celsius = [15.2, 16.8, 14.5, 17.0, 19.1, 18.5, 17.8]
plt.plot(days, temperatures_celsius, marker='o') # Still using Matplotlib
plt.xlabel("Day of the Week")
plt.ylabel("Temperature (°C)")
plt.title("Weekly Temperature Trend (Seaborn Theme Applied)")
plt.show()
# Let's use a Seaborn function directly with a DataFrame
data = {'Day': days, 'Temperature': temperatures_celsius}
df = pd.DataFrame(data)
# Use Seaborn's lineplot
sns.lineplot(x='Day', y='Temperature', data=df, marker='o')
plt.title("Weekly Temperature Trend (Seaborn lineplot)")
plt.show()
# Reset to default Matplotlib styles if needed later
# sns.reset_defaults()
Notice how sns.set_theme()
instantly changes the look (background grid, font, default colors). The sns.lineplot()
function achieves a similar result to plt.plot()
but is designed to work directly with DataFrame columns. It also often adds features like confidence interval bands by default if there are multiple observations per x-value.
Creating Statistical Plots with Seaborn
Seaborn excels at quickly generating insightful statistical visualizations. Let's explore some common categories using a built-in dataset. Seaborn comes with several sample datasets; the 'tips' dataset is a classic example, recording tips given in a restaurant.
import matplotlib.pyplot as plt
import seaborn as sns
# Load a built-in dataset
tips = sns.load_dataset("tips")
# Display the first few rows to understand the data
print("Tips Dataset Head:")
print(tips.head())
# Columns: total_bill, tip, sex, smoker, day, time, size
# --- Relational Plots ---
# Scatter plot to see relationship between total bill and tip
plt.figure(figsize=(8, 6))
sns.scatterplot(x="total_bill", y="tip", data=tips)
plt.title("Total Bill vs Tip Amount")
plt.show()
# Add semantics using 'hue' (color based on a category)
plt.figure(figsize=(8, 6))
sns.scatterplot(x="total_bill", y="tip", hue="smoker", data=tips)
plt.title("Total Bill vs Tip Amount (Color by Smoker Status)")
plt.show()
# Use 'size' semantic for another variable (less common, can get cluttered)
plt.figure(figsize=(8, 6))
sns.scatterplot(x="total_bill", y="tip", hue="time", size="size", data=tips, sizes=(20, 200))
plt.title("Total Bill vs Tip (Hue=Time, Size=Party Size)")
plt.show()
# --- Distribution Plots ---
# Histogram of total bills
plt.figure(figsize=(8, 6))
sns.histplot(data=tips, x="total_bill", bins=20, kde=True) # Add Kernel Density Estimate
plt.title("Distribution of Total Bills")
plt.show()
# Kernel Density Estimate plot
plt.figure(figsize=(8, 6))
sns.kdeplot(data=tips, x="tip", fill=True) # Shaded KDE
plt.title("Distribution of Tip Amounts")
plt.show()
# Box plot to compare tip distributions by day
plt.figure(figsize=(8, 6))
sns.boxplot(x="day", y="tip", data=tips, palette="pastel")
plt.title("Tip Distribution by Day of the Week (Box Plot)")
plt.show()
# Violin plot (combines box plot and KDE)
plt.figure(figsize=(8, 6))
sns.violinplot(x="day", y="tip", data=tips, hue="sex", split=True, palette="muted")
plt.title("Tip Distribution by Day and Sex (Violin Plot)")
plt.show()
# --- Categorical Plots ---
# Bar plot showing average total bill per day (default is mean)
plt.figure(figsize=(8, 6))
# Note: Seaborn barplot automatically calculates mean and shows confidence interval
sns.barplot(x="day", y="total_bill", data=tips, palette="bright", errorbar="sd") # Show standard deviation instead of CI
plt.title("Average Total Bill by Day")
plt.show()
# Count plot showing number of observations per category
plt.figure(figsize=(8, 6))
sns.countplot(x="day", data=tips, hue="time", palette="Set2")
plt.title("Count of Visits per Day (Split by Time)")
plt.show()
# Strip plot (scatter plot for categorical data)
plt.figure(figsize=(8, 6))
sns.stripplot(x="day", y="tip", data=tips, jitter=True, alpha=0.7) # Jitter avoids overlap
plt.title("Individual Tips by Day (Strip Plot)")
plt.show()
# Swarm plot (similar to strip plot, avoids overlap better but doesn't scale to large N)
plt.figure(figsize=(8, 6))
sns.swarmplot(x="day", y="tip", data=tips, hue="smoker", dodge=True, size=4) # dodge separates hues
plt.title("Individual Tips by Day (Swarm Plot, Colored by Smoker)")
plt.show()
Key Takeaways:
- Seaborn functions often take
data
(a DataFrame) andx
,y
,hue
,style
,size
arguments referring to column names. hue
is extremely useful for comparing distributions or relationships across categories.- Distribution plots (
histplot
,kdeplot
,boxplot
,violinplot
) help understand the spread and shape of data. - Categorical plots (
barplot
,countplot
,stripplot
,swarmplot
) are designed for visualizing data grouped by categories. - Many Seaborn plots automatically handle statistical calculations (e.g., mean, confidence intervals in
barplot
).
Basic Customization in Seaborn
While Seaborn provides great defaults, you can customize plots further:
- Arguments within Seaborn functions: Most functions have parameters for
color
,palette
,marker
,linestyle
, etc. Explore the documentation for specific functions. - Using Matplotlib: Since Seaborn plots on Matplotlib Axes, you can get the Axes object and use Matplotlib functions for fine-tuning (titles, labels, limits, annotations).
import matplotlib.pyplot as plt
import seaborn as sns
tips = sns.load_dataset("tips")
# Example: Customize a boxplot using Seaborn arguments and Matplotlib
plt.figure(figsize=(9, 6))
# Use Seaborn function with specific palette and line width
ax = sns.boxplot(x="day", y="tip", data=tips,
palette="coolwarm", # Change color palette
linewidth=1.5, # Make lines thicker
order=['Thur', 'Fri', 'Sat', 'Sun']) # Control category order
# Use Matplotlib functions on the returned Axes (ax)
ax.set_title("Customized Tip Distribution by Day", fontsize=16)
ax.set_xlabel("Day of the Week", fontsize=12)
ax.set_ylabel("Tip Amount ($)", fontsize=12)
ax.set_ylim(0, 11) # Adjust y-axis limits
ax.grid(axis='y', linestyle='--', alpha=0.7) # Add horizontal grid lines
plt.show()
Here, sns.boxplot()
returns the Matplotlib Axes
object (ax
). We then use ax.set_title()
, ax.set_xlabel()
, etc., just like we would with a plot created directly with Matplotlib. This combination gives both ease-of-use and fine control.
Workshop Visualizing Dataset Distributions
Goal: Use Seaborn to explore the distributions and relationships within the built-in 'penguins' dataset.
Dataset: The 'penguins' dataset contains measurements for different penguin species.
# Data for the workshop - load the dataset
import seaborn as sns
import matplotlib.pyplot as plt
penguins = sns.load_dataset("penguins")
# Explore the data
print("Penguins Dataset Info:")
penguins.info()
print("\nPenguins Dataset Head:")
print(penguins.head())
# Columns: species, island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex
Steps:
-
Setup:
- Create a new Python file (e.g.,
penguin_viz.py
). - Import
seaborn
assns
andmatplotlib.pyplot
asplt
. - Load the
penguins
dataset usingsns.load_dataset("penguins")
. - Print the
.info()
and.head()
of the DataFrame to understand its structure and potential missing values. - Set a Seaborn theme you like (e.g.,
sns.set_theme(style="ticks", palette="muted")
).
- Create a new Python file (e.g.,
-
Visualize Single Variable Distributions:
- Create a
histplot
showing the distribution offlipper_length_mm
. Add a KDE overlay (kde=True
). Add an informative title. - Create a
kdeplot
showing the distribution ofbody_mass_g
, separated byspecies
using thehue
parameter. Usefill=True
for shaded densities. Add an informative title.
# Step 1: Setup import seaborn as sns import matplotlib.pyplot as plt import pandas as pd # Good practice to import pandas penguins = sns.load_dataset("penguins") print("Penguins Dataset Info:") penguins.info() # Note potential missing values in 'sex' print("\nPenguins Dataset Head:") print(penguins.head()) # Drop rows with missing values for simplicity in this workshop penguins = penguins.dropna() print("\nPenguins Dataset Info after dropna():") penguins.info() # Verify missing values are handled sns.set_theme(style="ticks", palette="muted") # Step 2: Single Variable Distributions plt.figure(figsize=(8, 5)) sns.histplot(data=penguins, x="flipper_length_mm", kde=True) plt.title("Distribution of Penguin Flipper Lengths") plt.show() plt.figure(figsize=(8, 5)) sns.kdeplot(data=penguins, x="body_mass_g", hue="species", fill=True) plt.title("Distribution of Penguin Body Mass by Species") plt.show()
- Create a
-
Visualize Relationships Between Variables:
- Create a
scatterplot
to explore the relationship betweenbill_length_mm
andbill_depth_mm
. Color the points byspecies
using thehue
parameter. Add an informative title. - Can you observe different clusters for different species?
- Create a
-
Visualize Categorical Data:
- Create a
boxplot
comparing theflipper_length_mm
across the differentspecies
. Add an informative title. - Create a
countplot
showing the number of penguins observed on eachisland
. Usehue="species"
to see the species distribution per island. Add an informative title.
# Step 4: Categorical Data Visualization plt.figure(figsize=(8, 6)) sns.boxplot(data=penguins, x="species", y="flipper_length_mm") plt.title("Flipper Length Distribution by Species") plt.show() plt.figure(figsize=(8, 6)) sns.countplot(data=penguins, x="island", hue="species") plt.title("Penguin Count per Island by Species") plt.show()
- Create a
-
Save a Plot:
- Choose one of the plots you created (e.g., the scatter plot from Step 3).
- Before its
plt.show()
command, addplt.savefig('penguin_bill_dimensions.png', dpi=200)
.
# Step 3 (modified to include saving) plt.figure(figsize=(9, 6)) sns.scatterplot(data=penguins, x="bill_length_mm", y="bill_depth_mm", hue="species") plt.title("Bill Length vs Bill Depth by Species") plt.grid(True, linestyle=':', alpha=0.5) # Step 5: Save the Plot plt.savefig('penguin_bill_dimensions.png', dpi=200) print("Scatter plot saved as penguin_bill_dimensions.png") plt.show()
-
Run the Script:
- Execute your
penguin_viz.py
script from the Linux terminal:python penguin_viz.py
. - Observe the generated plots and the saved PNG file. Analyze what each plot tells you about the penguin dataset.
- Execute your
This workshop demonstrates how Seaborn simplifies the creation of common statistical plots, allowing you to quickly explore distributions, relationships, and categorical comparisons within a dataset, often with just a single line of code per plot.
3. Enhancing Visualizations for Clarity
Creating a basic plot is often just the first step. To make visualizations truly effective and communicate insights clearly, we need to enhance them. This involves careful customization of aesthetics like colors and styles, thoughtful arrangement of multiple plots, and adding context through annotations. It also requires choosing the most appropriate plot type for the data and the message you want to convey.
Advanced Customization Colors Styles and Annotations
Beyond basic color
and linestyle
arguments, Matplotlib and Seaborn offer extensive customization options.
Color Palettes and Colormaps:
- Seaborn Palettes: Seaborn makes using well-designed color palettes easy.
- Qualitative palettes: For distinct categories (e.g.,
Set1
,Pastel1
,tab10
). - Sequential palettes: For numerical data where order matters, showing progression (e.g.,
Blues
,Greens
,viridis
,magma
). - Diverging palettes: For numerical data where the midpoint is meaningful, highlighting deviations in two directions (e.g.,
coolwarm
,RdBu
,PiYG
). - Use
sns.color_palette("palette_name", n_colors=...)
to get a list of colors. Many Seaborn functions accept apalette
argument directly.
- Qualitative palettes: For distinct categories (e.g.,
- Matplotlib Colormaps: Matplotlib has a wide range of colormaps accessible via
plt.get_cmap("cmap_name")
. These are often used in plots like heatmaps or contour plots, or to manually color elements.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
# Example using Seaborn palettes
tips = sns.load_dataset("tips")
plt.figure(figsize=(10, 6))
sns.stripplot(x="day", y="total_bill", data=tips, hue="sex", palette="Set1", dodge=True)
plt.title("Total Bill by Day (Seaborn 'Set1' Palette)")
plt.show()
# Example generating colors from a palette
num_categories = len(tips['day'].unique())
custom_palette = sns.color_palette("viridis", n_colors=num_categories)
plt.figure(figsize=(10, 6))
sns.boxplot(x="day", y="tip", data=tips, palette=custom_palette)
plt.title("Tip by Day (Seaborn 'viridis' Palette)")
plt.show()
Styles:
- Seaborn Themes: We saw
sns.set_theme(style=...)
earlier (e.g., "darkgrid", "whitegrid", "ticks", "white", "dark"). These control the overall background, grid, and spine appearance. - Matplotlib Stylesheets: Matplotlib has predefined style sheets you can apply globally using
plt.style.use('style_name')
. Examples include 'ggplot', 'seaborn-v0_8-darkgrid' (to mimic Seaborn), 'fivethirtyeight', 'bmh'. These affect colors, line widths, fonts, etc.
import matplotlib.pyplot as plt
import numpy as np
# Apply a Matplotlib style
plt.style.use('fivethirtyeight')
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
plt.figure(figsize=(8, 5)) # Figure size needs setting after style usually
plt.plot(x, y1, label='Sine')
plt.plot(x, y2, label='Cosine')
plt.title("Sine and Cosine Waves ('fivethirtyeight' Style)")
plt.xlabel("X value")
plt.ylabel("Y value")
plt.legend()
plt.show()
# Revert to default style if needed
# import matplotlib as mpl
# mpl.rcParams.update(mpl.rcParamsDefault)
# Or just plt.style.use('default')
Annotations:
Adding text or arrows to highlight specific points or regions in a plot is crucial for storytelling.
plt.text(x, y, "text")
: Adds text at specified data coordinates (x
,y
).ax.annotate("text", xy=(x_point, y_point), xytext=(x_text, y_text), arrowprops=dict(...))
: A more versatile function. It placestext
atxytext
coordinates and can draw an arrow pointing from the text to the data pointxy
.arrowprops
controls the arrow style.
import matplotlib.pyplot as plt
import numpy as np
days = np.arange(1, 8)
temperatures = [15.2, 16.8, 14.5, 17.0, 19.1, 18.5, 17.8]
max_temp_day = days[np.argmax(temperatures)]
max_temp = max(temperatures)
plt.style.use('default') # Reset style
plt.figure(figsize=(9, 5))
plt.plot(days, temperatures, marker='o', label='Temperature')
plt.xlabel("Day")
plt.ylabel("Temperature (°C)")
plt.title("Weekly Temperature with Annotation")
plt.grid(True, linestyle=':')
# Simple text annotation
plt.text(days[2] + 0.1, temperatures[2] - 0.5, 'Dip')
# More complex annotation with arrow
plt.annotate(f'Peak: {max_temp}°C',
xy=(max_temp_day, max_temp), # Point to annotate
xytext=(max_temp_day - 1.5, max_temp + 1), # Text position
arrowprops=dict(facecolor='black', shrink=0.05, width=1, headwidth=8),
fontsize=10,
bbox=dict(boxstyle="round,pad=0.3", fc="yellow", alpha=0.3)) # Optional text box
plt.legend()
plt.ylim(13, 21) # Adjust limits to make space for annotation
plt.show()
Working with Multiple Subplots
Often, you need to display multiple related plots together in a single figure. Matplotlib provides excellent tools for this.
plt.subplots()
:
The most common and recommended way to create a grid of subplots. It returns a Figure
object and an array of Axes
objects.
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 2 * np.pi, 100)
y_sin = np.sin(x)
y_cos = np.cos(x)
y_tan = np.tan(x)
# Create a figure with 2 rows and 2 columns of subplots
# sharex=True means all subplots in the same column share the x-axis
# sharey=True means all subplots in the same row share the y-axis
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10, 8), sharex=True)
# axes is a 2D numpy array: [[ax1, ax2], [ax3, ax4]]
# Access individual axes using indexing: axes[row, col]
# Plot on the first subplot (top-left)
axes[0, 0].plot(x, y_sin, color='blue')
axes[0, 0].set_title('Sine Wave')
axes[0, 0].grid(True)
axes[0, 0].set_ylabel('Amplitude') # Only need y-label on the left column due to sharey
# Plot on the second subplot (top-right)
axes[0, 1].plot(x, y_cos, color='red')
axes[0, 1].set_title('Cosine Wave')
axes[0, 1].grid(True)
# Plot on the third subplot (bottom-left)
axes[1, 0].plot(x, y_tan, color='green')
axes[1, 0].set_title('Tangent Wave')
axes[1, 0].set_ylim(-5, 5) # Tangent goes to infinity, limit y-axis
axes[1, 0].grid(True)
axes[1, 0].set_xlabel('Radians') # Only need x-label on the bottom row due to sharex
axes[1, 0].set_ylabel('Amplitude')
# Fourth subplot (bottom-right) - can be left empty or used for something else
axes[1, 1].plot(x, y_sin * y_cos, color='purple')
axes[1, 1].set_title('Sine * Cosine')
axes[1, 1].grid(True)
axes[1, 1].set_xlabel('Radians')
# Add an overall title to the figure
fig.suptitle('Trigonometric Functions', fontsize=16, y=1.02)
# Adjust layout to prevent titles/labels overlapping
plt.tight_layout(rect=[0, 0.03, 1, 0.98]) # rect adjusts space for suptitle
plt.show()
subplots()
:
- Returns
fig
andaxes
. Ifnrows=1
andncols=1
,axes
is a single Axes object. Ifnrows=1
orncols=1
,axes
is a 1D array. Otherwise, it's a 2D array. - Use
axes[i, j]
(oraxes[i]
) to access and plot on specific subplots. sharex=True
/sharey=True
is very useful for comparing plots, as it links axes and removes redundant labels.
Seaborn's Figure-Level Functions:
Seaborn has "figure-level" functions that automatically create figures with multiple subplots based on data structure. These often wrap around Matplotlib's FacetGrid
or PairGrid
.
relplot()
: Figure-level interface for relational plots (scatterplot
,lineplot
).displot()
: Figure-level interface for distribution plots (histplot
,kdeplot
,ecdfplot
).catplot()
: Figure-level interface for categorical plots (stripplot
,swarmplot
,boxplot
,violinplot
,barplot
,pointplot
).pairplot()
: Creates a matrix of scatterplots for pairwise relationships and histograms/KDEs for diagonal distributions.jointplot()
: Creates a scatterplot with marginal distributions on the axes.
These functions use arguments like row
, col
, and hue
to structure the grid and differentiate data subsets.
import matplotlib.pyplot as plt
import seaborn as sns
tips = sns.load_dataset("tips")
# Example: relplot() to show total_bill vs tip, separated by 'time' (col) and 'smoker' (row)
g = sns.relplot(
data=tips,
x="total_bill", y="tip",
col="time", # Creates columns for different times (Lunch, Dinner)
row="smoker", # Creates rows for smoker status (Yes, No)
hue="sex", # Colors points by sex within each subplot
kind="scatter" # Specifies the underlying plot type
)
g.fig.suptitle("Tip vs Total Bill by Time, Smoker Status, and Sex", y=1.03)
plt.show()
# Example: pairplot() to visualize pairwise relationships in the penguins dataset
penguins = sns.load_dataset("penguins").dropna()
sns.pairplot(penguins, hue="species", diag_kind="kde") # Use kde on diagonal
plt.suptitle("Pairwise Relationships in Penguin Dataset", y=1.02)
plt.show()
# Example: jointplot() showing relationship and marginal distributions
sns.jointplot(data=penguins, x="bill_length_mm", y="flipper_length_mm", hue="species", kind="scatter") # kind can be 'kde', 'hist', 'reg'
plt.suptitle("Bill Length vs Flipper Length with Marginal Distributions", y=1.02)
plt.show()
Choosing the Right Plot for Your Data and Story
Selecting the appropriate visualization is crucial for effective communication. The choice depends on:
-
What you want to show:
- Comparison: Comparing values across categories (Bar chart, Point plot, Box plot).
- Relationship/Correlation: Investigating the link between two or more numerical variables (Scatter plot, Line plot (for trends), Heatmap, Regression plot).
- Distribution: Understanding how a single numerical variable is spread (Histogram, KDE plot, Box plot, Violin plot, ECDF plot).
- Composition: Showing parts of a whole (Stacked bar chart, Pie chart (use with caution!), Treemap - requires other libraries often).
- Trend over Time: Showing how data changes over a continuous interval (Line plot, Area chart).
-
The type of data you have:
- Categorical: Data representing groups or labels (e.g., 'species', 'day', 'sex').
- Numerical (Continuous): Data that can take any value within a range (e.g., 'temperature', 'bill_length').
- Numerical (Discrete): Data that can only take specific numerical values (e.g., 'number of children', 'party size').
- Time Series: Data points indexed in time order.
Common Pitfalls:
- Using Pie Charts for too many categories or precise comparisons: Pie charts are generally poor for comparing similar segment sizes and become unreadable with more than a few slices. Bar charts are usually better.
- Misleading Axes: Not starting a bar chart's quantitative axis at zero can exaggerate differences. Using inappropriate scales (e.g., linear vs. log) can obscure patterns.
- Overplotting: Too many data points on a scatter plot can create an uninterpretable blob. Solutions include using transparency (
alpha
), smaller markers, sampling, or density plots (kdeplot
,histplot
). - Chart Junk: Adding unnecessary visual elements (heavy grid lines, 3D effects, excessive labels, background images) that distract from the data itself (more on this in the next section).
- Choosing Complexity over Clarity: A visually stunning but incomprehensible plot fails its purpose. Simplicity is often key.
Always ask: "What is the key message I want my audience to take away from this visual?" and choose the plot type that conveys that message most clearly and accurately.
Workshop Comparative Analysis with Subplots
Goal: Use subplots (via Matplotlib or Seaborn's figure-level functions) to compare different aspects of the 'titanic' dataset.
Dataset: The 'titanic' dataset contains information about passengers on the Titanic, including survival status.
# Data for the workshop - load the dataset
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
titanic = sns.load_dataset("titanic")
# Explore the data
print("Titanic Dataset Info:")
titanic.info()
# Note missing values in 'age', 'embarked', 'deck', 'embark_town'
print("\nTitanic Dataset Head:")
print(titanic.head())
# Columns: survived (0=No, 1=Yes), pclass (Ticket class), sex, age, sibsp (# siblings/spouses aboard),
# parch (# parents/children aboard), fare, embarked, class, who, adult_male, deck, embark_town, alive, alone
Steps:
-
Setup:
- Create a new Python file (e.g.,
titanic_analysis.py
). - Import
seaborn
,matplotlib.pyplot
, andpandas
. - Load the
titanic
dataset. Print.info()
and.head()
. - For simplicity in this workshop, let's fill missing 'age' values with the median age. Handle 'embarked' and 'embark_town' by filling with the mode, or drop 'deck' due to many missing values.
- Set a suitable Seaborn theme.
# Step 1: Setup import seaborn as sns import matplotlib.pyplot as plt import pandas as pd titanic = sns.load_dataset("titanic") print("Titanic Dataset Info (Before Handling NaNs):") titanic.info() # Handle Missing Values (Simple Strategy) median_age = titanic['age'].median() titanic['age'].fillna(median_age, inplace=True) mode_embarked = titanic['embarked'].mode()[0] # mode() returns a Series titanic['embarked'].fillna(mode_embarked, inplace=True) mode_embark_town = titanic['embark_town'].mode()[0] titanic['embark_town'].fillna(mode_embark_town, inplace=True) titanic.drop(columns=['deck'], inplace=True) # Drop column with too many NaNs print("\nTitanic Dataset Info (After Handling NaNs):") titanic.info() # Verify NaNs are handled for relevant columns sns.set_theme(style="whitegrid", palette="pastel")
- Create a new Python file (e.g.,
-
Create Subplots using Matplotlib
subplots()
:- Create a figure with 1 row and 2 columns (
plt.subplots(1, 2, ...)
). - Left Subplot: Create a
countplot
showing the distribution ofpclass
(passenger class). Use theax=
argument insns.countplot
to specify the left Axes object. Add a title like "Passenger Class Distribution". - Right Subplot: Create a
histplot
showing the distribution of passengerage
. Use theax=
argument to specify the right Axes object. Add a title like "Passenger Age Distribution". - Adjust layout using
plt.tight_layout()
and display the figure.
# Step 2: Using Matplotlib subplots() fig, axes = plt.subplots(1, 2, figsize=(12, 5)) # Left subplot: Passenger Class Distribution sns.countplot(data=titanic, x='pclass', ax=axes[0], palette='coolwarm') axes[0].set_title('Passenger Class Distribution') axes[0].set_xlabel('Passenger Class') axes[0].set_ylabel('Count') # Right subplot: Passenger Age Distribution sns.histplot(data=titanic, x='age', bins=20, kde=True, ax=axes[1], color='skyblue') axes[1].set_title('Passenger Age Distribution') axes[1].set_xlabel('Age') axes[1].set_ylabel('Frequency') fig.suptitle('Basic Passenger Demographics', fontsize=16, y=1.03) plt.tight_layout(rect=[0, 0.03, 1, 0.98]) plt.show()
- Create a figure with 1 row and 2 columns (
-
Create Subplots using Seaborn
catplot()
:- Use
sns.catplot()
to compare survival rates across different categories. - Create a plot showing the
survived
count (usekind='count'
) split bysex
(usecol='sex'
). This will automatically create two subplots. - Add an informative title using
g.fig.suptitle(...)
whereg
is the object returned bycatplot
.
# Step 3: Using Seaborn catplot() for comparison g = sns.catplot( data=titanic, x='survived', # 0 = No, 1 = Yes col='sex', # Creates columns for male/female kind='count', # Specifies a count plot palette='viridis', height=5, # Height of each facet aspect=0.8 # Aspect ratio of each facet ) # Setting titles and labels on FacetGrid requires accessing axes g.set_axis_labels("Survived (0=No, 1=Yes)", "Count") g.set_titles("Sex = {col_name}") # Template for subplot titles g.fig.suptitle('Survival Count by Sex', fontsize=16, y=1.03) plt.tight_layout(rect=[0, 0.03, 1, 0.98]) plt.show()
- Use
-
Combine Different Plot Types using
FacetGrid
(Optional Advanced):- Let's analyze survival rate (
survived
, often plotted as mean for rate) bypclass
andsex
. A bar plot is suitable here. - Use
sns.catplot()
again, settingx='pclass'
,y='survived'
,hue='sex'
, andkind='bar'
. Setcol
maybe toembarked
to see if embarkation point mattered. Note: The defaultbarplot
in Seaborn shows the mean ofy
and a confidence interval. Sincesurvived
is 0 or 1, the mean is the survival rate.
# Step 4: Survival Rate Analysis with catplot() g = sns.catplot( data=titanic, x='pclass', y='survived', # Mean of survived = survival rate hue='sex', col='embarked', # Compare across embarkation points kind='bar', palette='Spectral', height=5, aspect=0.7, errorbar=None # Optionally turn off error bars for cleaner look ) g.set_axis_labels("Passenger Class", "Survival Rate") g.set_titles("Embarked = {col_name}") # Adjust legend position if needed # g.legend.set_title("Sex") # g.fig.subplots_adjust(top=0.9) # Adjust space for suptitle g.fig.suptitle('Survival Rate by Class, Sex, and Embarkation Point', fontsize=16, y=1.03) plt.tight_layout(rect=[0, 0.03, 1, 0.98]) plt.show()
- Let's analyze survival rate (
-
Run and Interpret:
- Run your
titanic_analysis.py
script. - Examine the generated figures. What insights can you draw from comparing distributions and survival rates across different passenger segments? For example, how did class and sex influence survival? Did the embarkation point seem to matter significantly after accounting for class and sex?
- Run your
This workshop shows how arranging plots side-by-side using Matplotlib's subplots
or Seaborn's figure-level functions (catplot
) allows for direct comparison and deeper analysis of different facets of a dataset.
4. Introduction to Data Storytelling Principles
Having mastered the technical skills to create various plots, we now shift focus to the art of data storytelling. A technically perfect visualization is useless if it doesn't communicate a clear message or insight. Data storytelling involves weaving together data, visuals, and narrative to engage your audience and drive understanding. It's about transforming data from observations into meaningful narratives.
What Makes a Good Data Story
A compelling data story typically possesses several key characteristics:
- Clear Message/Insight: It focuses on conveying a specific finding, trend, or conclusion derived from the data. Avoid overwhelming the audience with too much information at once. What is the single most important thing you want them to remember?
- Context: Data rarely speaks for itself. Provide background information, define terms, explain the significance of the findings, and set the scene. Why should the audience care about this data? What benchmarks or comparisons are relevant?
- Relevant Visualizations: Use charts that accurately and effectively support the message. The choice of plot type, colors, and annotations should reinforce the narrative, not distract from it.
- Narrative Arc: Like any good story, a data story often has a structure – perhaps starting with a hook or a question, presenting evidence (the data and visuals), building towards a climax (the key insight), and ending with a conclusion or call to action.
- Audience Awareness: Tailor the story to your audience's level of expertise, interests, and needs. A presentation for executives might focus on high-level summaries and implications, while a report for technical peers might delve into methodological details and nuances.
- Simplicity and Clarity: Avoid jargon where possible. Ensure visuals are easy to interpret. Focus attention on the most important elements.
Essentially, a good data story answers the "So what?" question about your data analysis.
Decluttering Visualizations Maximizing the Data Ink Ratio
A crucial principle, popularized by Edward Tufte, is maximizing the "data-ink ratio." This means ensuring that the ink (or pixels) used in a graphic is primarily dedicated to displaying the data itself, minimizing non-data elements ("chart junk").
How to Declutter:
- Remove Redundant Information: If information is present in text (like a title stating the units), it might not need to be repeated on the axis label if space is tight or context allows. However, clarity is paramount, so don't remove essential labels.
- Eliminate Unnecessary Grid Lines: Heavy, dark grid lines can obscure data. If needed, use light, thin, non-intrusive lines (often just horizontal or vertical, not both). Sometimes, no grid is necessary.
- Mute Background Elements: Use subtle colors for axes, backgrounds, and grids. The data should stand out.
- Avoid 3D Effects: Pseudo-3D effects (like 3D bars or pies) distort perception and add no informational value. Stick to 2D.
- Minimize Chart Borders and Background Fills: Often, these are unnecessary and just add visual noise.
- Use Direct Labeling: Instead of relying solely on a legend, consider labeling data series directly on the plot if it doesn't cause clutter. This reduces the cognitive load of looking back and forth.
Example: Decluttering a Bar Chart
Let's take a standard bar chart and apply decluttering principles.
import matplotlib.pyplot as plt
import numpy as np
# Data
languages = ['Python', 'JavaScript', 'Java', 'C#', 'C++']
popularity = [31.5, 28.0, 16.8, 7.5, 6.2]
# --- Plot 1: Default Cluttered Look ---
plt.style.use('default') # Start with default
plt.figure(figsize=(7, 5))
plt.bar(languages, popularity, color='grey')
plt.ylabel("Popularity (%)")
plt.xlabel("Programming Language")
plt.title("Programming Language Popularity (Cluttered)")
plt.grid(True, axis='y', color='black', linestyle='-', linewidth=1) # Heavy grid
plt.box(True) # Explicitly draw frame
plt.show()
# --- Plot 2: Decluttered Version ---
plt.figure(figsize=(7, 5))
# Use light colors, remove redundant elements
bars = plt.bar(languages, popularity, color='lightsteelblue')
# Remove top and right spines (axis lines)
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_color('grey') # Mute left spine
plt.gca().spines['bottom'].set_color('grey') # Mute bottom spine
# Use subtle grid lines if needed, or remove them
plt.grid(True, axis='y', color='lightgrey', linestyle='--', linewidth=0.5)
# Or plt.grid(False)
# Remove axis labels if title/context is sufficient (use judgment!)
# plt.xlabel("") # Or keep if needed
# plt.ylabel("") # Or keep if needed
# Add data labels directly (optional, can replace y-axis)
# plt.tick_params(axis='y', which='both', left=False, labelleft=False) # Hide y-axis ticks/labels
for bar in bars:
yval = bar.get_height()
plt.text(bar.get_x() + bar.get_width()/2.0, yval + 0.5, f'{yval:.1f}%',
va='bottom', ha='center', color='dimgray', fontsize=9) # Place text above bar
plt.xticks(color='dimgray') # Mute tick labels
plt.yticks(color='dimgray')
plt.title("Python leads Programming Language Popularity", loc='left', fontsize=12, fontweight='bold') # Informative title
plt.suptitle("Hypothetical Survey Results (%)", y=0.92, x=0.125, color='grey', fontsize=9, ha='left') # Subtitle for context
plt.ylim(0, max(popularity) * 1.15) # Add padding for labels
plt.tight_layout()
plt.show()
The decluttered version focuses attention on the data (the bar heights and labels) by removing or muting non-essential elements like heavy grids, borders, and redundant labels (if context allows). The title is made more narrative.
Using Visual Cues Effectively Color Size and Position
Our brains are wired to quickly process certain visual properties, known as preattentive attributes. We can leverage these to guide the viewer's eye and emphasize important information without them having to consciously search for it. Key attributes include:
- Color: Use color strategically.
- Highlighting: Use a distinct, bright, or saturated color for the key data points or series you want to emphasize, while keeping other elements muted (e.g., grey).
- Categorization: Use distinct qualitative colors for different categories (ensure they are distinguishable, especially for colorblind viewers - use palettes like
viridis
,magma
, orColorBrewer
sets). - Sequence/Magnitude: Use sequential color palettes (light-to-dark or vice-versa) to represent numerical magnitude.
- Divergence: Use diverging palettes (e.g., blue-white-red) to show deviations from a central point.
- Consistency: Use color consistently across multiple related charts.
- Size: Varying the size of markers (in scatter plots) or lines can draw attention or encode an additional variable. Be mindful that perception of area/size can be non-linear.
- Position: Where elements are placed matters. We naturally read top-to-bottom, left-to-right (in Western cultures). Placing key information prominently (e.g., top-left) can increase its impact. The relative position of points (e.g., high vs. low on a y-axis) is fundamental to how we interpret charts.
- Added Marks: Bold text, enclosure (circling an area), or annotations (arrows, labels) explicitly direct attention.
Example: Using Color for Emphasis
import matplotlib.pyplot as plt
import numpy as np
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
rainfall = [55, 41, 42, 44, 49, 45, 44, 50, 49, 68, 59, 55] # Simplified London rainfall
plt.figure(figsize=(10, 5))
# Default color for all bars
colors = ['lightgrey'] * len(months)
# Highlight the month with maximum rainfall (October)
max_rain_index = np.argmax(rainfall)
colors[max_rain_index] = 'dodgerblue'
plt.bar(months, rainfall, color=colors)
# Minimalist styling
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)
plt.gca().spines['left'].set_visible(False)
plt.tick_params(axis='y', which='both', left=False, labelleft=False) # Hide y-axis
plt.xticks(fontsize=10, color='dimgray')
plt.grid(False)
# Add title and direct labels
plt.title("October is the Wettest Month in London", loc='left', fontsize=13)
plt.suptitle("Average Monthly Rainfall (mm)", y=0.9, x=0.125, color='grey', fontsize=10, ha='left')
for i, val in enumerate(rainfall):
plt.text(i, val + 1, f'{val}', ha='center', va='bottom',
color= 'dodgerblue' if i == max_rain_index else 'dimgray',
fontsize=9, fontweight='bold' if i == max_rain_index else 'normal')
plt.ylim(0, max(rainfall) * 1.15)
plt.tight_layout()
plt.show()
Crafting a Narrative with a Sequence of Plots
Often, a single plot isn't enough to tell the whole story. You might need a sequence of visualizations to:
- Introduce Context: Start with a broad overview (e.g., overall trend, distribution).
- Break Down the Data: Explore different segments or categories (e.g., using small multiples or subsequent plots focusing on specific groups).
- Highlight Relationships: Show correlations or comparisons between variables.
- Build to a Conclusion: Use annotations and emphasis on later plots to pinpoint the key insight.
The sequence guides the audience through your analysis process, making the final conclusion more convincing. Each plot should logically follow the previous one, building the narrative step by step. Think about how you would explain your findings verbally – the sequence of plots should mirror that explanation.
Workshop Refining a Visualization for Storytelling
Goal: Take a basic visualization (e.g., from a previous workshop) and apply storytelling principles (decluttering, emphasis, annotations) to communicate a specific message.
Scenario: We'll use the penguins
dataset scatter plot (bill_length_mm
vs bill_depth_mm
colored by species
) created earlier. Our goal is to refine it to clearly communicate that Adelie penguins have distinctly different bill dimensions compared to Gentoo and Chinstrap penguins.
Original Plot Code (for reference):
# Assuming penguins DataFrame is loaded and cleaned as before
# import seaborn as sns
# import matplotlib.pyplot as plt
# sns.set_theme(style="ticks", palette="muted") # Or any theme
# plt.figure(figsize=(9, 6))
# sns.scatterplot(data=penguins, x="bill_length_mm", y="bill_depth_mm", hue="species")
# plt.title("Bill Length vs Bill Depth by Species")
# plt.grid(True, linestyle=':', alpha=0.5)
# plt.show()
Steps:
-
Setup:
- Create a new Python file (e.g.,
penguin_story.py
). - Import
seaborn
,matplotlib.pyplot
,pandas
. - Load and clean the
penguins
dataset (dropna()
).
- Create a new Python file (e.g.,
-
Identify the Core Message: We want to highlight the separation of Adelie penguins based on bill dimensions.
-
Choose Emphasis Strategy: We will use color and annotations.
- Mute the colors for Gentoo and Chinstrap.
- Use a distinct, brighter color for Adelie.
- Add text annotations to label the groups and state the key message.
- Declutter the plot (remove unnecessary grid lines, potentially simplify axes/spines).
-
Implement the Refined Plot:
- Define a custom color palette where Adelie stands out.
- Create the scatter plot using this palette.
- Get the Axes object returned by
sns.scatterplot
. - Remove distracting elements (e.g., top/right spines, maybe grid).
- Add annotations using
ax.text()
orax.annotate()
to label the clusters and explicitly state the finding. - Craft an effective title that conveys the main message.
# Step 4: Implement Refined Plot plt.style.use('default') # Start fresh plt.figure(figsize=(10, 7)) # Define custom palette: Highlight Adelie species_list = penguins['species'].unique() # Get species order used by Seaborn palette_colors = { # Map species to colors 'Adelie': 'darkorange', 'Chinstrap': 'lightgrey', 'Gentoo': 'darkgrey' } # Create the scatter plot ax = sns.scatterplot( data=penguins, x="bill_length_mm", y="bill_depth_mm", hue="species", palette=palette_colors, s=60, # Adjust marker size if needed alpha=0.8, # Adjust transparency legend='full' # Can control legend ('brief', 'full', False) ) # --- Decluttering --- ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) ax.grid(False) # Remove grid lines for cleaner look ax.set_xlabel("Bill Length (mm)", fontsize=11, color='dimgray') ax.set_ylabel("Bill Depth (mm)", fontsize=11, color='dimgray') ax.tick_params(colors='dimgray') # --- Emphasis and Annotations --- # Modify legend (optional, but can help focus) handles, labels = ax.get_legend_handles_labels() # Optional: remove legend if using direct labels: ax.get_legend().remove() ax.legend(title='Penguin Species', title_fontsize='11', loc='upper left', frameon=False) # Add annotation to highlight Adelie cluster ax.text(35, 21.5, 'Adelie penguins have shorter,\ndeeper bills compared to others.', fontsize=11, color='black', style='italic', bbox=dict(boxstyle="round,pad=0.4", fc="white", ec="darkorange", alpha=0.8)) # Add labels for other clusters (optional) ax.text(55, 16, 'Gentoo\n(Longer, shallower bills)', ha='center', fontsize=9, color='darkgrey') ax.text(50, 20.5, 'Chinstrap', ha='center', fontsize=9, color='dimgray') # --- Narrative Title --- plt.title("Adelie Bills Stand Apart", fontsize=16, fontweight='bold', loc='left') plt.suptitle("Bill Dimensions of Three Penguin Species", fontsize=11, color='grey', y=0.92, x=0.125, ha='left') plt.tight_layout(rect=[0, 0, 1, 0.9]) # Adjust layout plt.savefig("penguin_bill_story.png", dpi=300) plt.show()
-
Review and Iterate:
- Run
penguin_story.py
. - Look at the generated
penguin_bill_story.png
. Does it clearly communicate the intended message? Is it visually appealing and easy to understand? - Could the annotations be clearer? Is the color choice effective? Is it too cluttered or too sparse? (Self-correction: Maybe the direct labels for Gentoo/Chinstrap add clutter? Could remove them or make them fainter. Is the annotation box too large?) Adjust the code based on your assessment.
- Run
This workshop focused on transforming a standard exploratory plot into a piece of communication. By applying decluttering techniques and using visual cues like color and annotations strategically, we directed the audience's attention to the specific insight we wanted to convey about Adelie penguins.
5. Advanced Matplotlib Techniques
While Seaborn simplifies many common tasks, mastering Matplotlib's deeper features unlocks unparalleled customization and control, essential for complex layouts, fine-grained aesthetics, and non-standard visualizations. This section delves into the object-oriented interface, intricate element control, and advanced layout tools.
Object Oriented Interface vs Pyplot
So far, we've primarily used the matplotlib.pyplot
interface (e.g., plt.plot()
, plt.title()
). This is a state-based interface that implicitly keeps track of the "current" Figure and Axes, making simple plots quick to generate.
However, for more complex scenarios involving multiple figures, multiple axes, or fine-grained control, the Object-Oriented (OO) interface is generally preferred and considered best practice.
The OO Approach:
- Explicitly create
Figure
andAxes
objects. The most common way isfig, ax = plt.subplots()
(orfig, axes = plt.subplots(...)
for multiple axes). - Call methods directly on these objects (e.g.,
ax.plot()
,ax.set_title()
,ax.set_xlabel()
,ax.legend()
,fig.suptitle()
).
Comparison:
Feature | pyplot Interface (plt.* ) |
Object-Oriented Interface (ax.* ) |
---|---|---|
Simplicity | High for simple, single plots | Slightly more verbose initially |
Control | Less explicit, relies on state | Explicit control over objects |
Complexity | Can become confusing with multiple plots/figures | Scales better to complex figures |
Readability | Good for simple scripts | More explicit and often clearer for complex logic |
Best Practice | Good for quick exploration, simple plots | Preferred for reusable code, complex figures, libraries |
Example: OO vs. Pyplot
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
y = np.sin(x)
# --- Pyplot Approach ---
plt.figure() # Implicitly creates figure and axes
plt.plot(x, y, label='Sine')
plt.title('Sine Wave (Pyplot)')
plt.xlabel('X')
plt.ylabel('Amplitude')
plt.legend()
plt.grid(True)
# plt.show() # Show would display this plot
# --- Object-Oriented Approach ---
fig, ax = plt.subplots() # Explicitly create figure and axes
ax.plot(x, y, label='Sine')
ax.set_title('Sine Wave (Object-Oriented)')
ax.set_xlabel('X')
ax.set_ylabel('Amplitude')
ax.legend()
ax.grid(True)
# fig.savefig('sine_oo.png') # Can save the figure object
plt.show() # Displays the current figure (in this case, the OO one)
Both produce similar plots, but the OO approach gives you explicit handles (fig
, ax
) to work with. This becomes essential when managing multiple subplots, as seen previously with fig, axes = plt.subplots(2, 2)
. You'd then use axes[0, 0].plot(...)
, axes[0, 1].set_title(...)
, etc.
Recommendation: Get comfortable using the OO interface, especially when your plots become more than just a single, simple visualization.
Fine Grained Control Ticks Grids and Spines
Matplotlib offers deep control over the appearance of axis elements.
Ticks:
The Axes
object has xaxis
and yaxis
attributes, which are Axis
objects. These objects manage ticks and labels. You can use the ticker
module for sophisticated tick formatting and locating.
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.ticker as mticker # Import the ticker module
x = np.linspace(0, 2 * np.pi, 500)
y = 100 * np.sin(x) # Scale y-values
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(x, y)
ax.set_title("Customizing Ticks")
# --- Customizing Ticks ---
# 1. Set major tick locations
ax.xaxis.set_major_locator(mticker.MultipleLocator(np.pi / 2)) # Ticks at multiples of pi/2
ax.yaxis.set_major_locator(mticker.FixedLocator([-100, 0, 100])) # Fixed tick locations
# 2. Format major tick labels
def pi_formatter(x, pos):
"""Formats radians in terms of pi"""
if np.isclose(x, 0): return "0"
multiple = x / np.pi
if np.isclose(multiple, 1): return r"$\pi$"
if np.isclose(multiple, 2): return r"$2\pi$"
return fr"${multiple:.1f}\pi$" # Use f-string with LaTeX
ax.xaxis.set_major_formatter(mticker.FuncFormatter(pi_formatter))
ax.yaxis.set_major_formatter(mticker.FormatStrFormatter('%d%%')) # Format as percentage
# 3. Minor ticks (often automatic, but can be controlled)
ax.xaxis.set_minor_locator(mticker.AutoMinorLocator(2)) # Auto minor ticks between majors
ax.yaxis.set_minor_locator(mticker.MultipleLocator(25)) # Minor ticks every 25 units
# 4. Tick parameters (appearance)
ax.tick_params(axis='x', which='major', length=6, width=1, rotation=0, colors='blue', labelsize=10)
ax.tick_params(axis='y', which='minor', length=3, color='red', linestyle=':')
ax.tick_params(axis='both', which='major', direction='in', top=True, right=True) # Ticks inside, also on top/right
plt.show()
Grids:
Control grid appearance using ax.grid()
.
# ... (previous setup) ...
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(x, y, 'g-')
ax.set_title("Customizing Grids")
# Customize grid
ax.grid(True, which='major', axis='both', color='grey', linestyle='--', linewidth=0.5, alpha=0.7)
ax.grid(True, which='minor', axis='y', color='lightgrey', linestyle=':', linewidth=0.5, alpha=0.5)
# Set tick locations for grid alignment
ax.xaxis.set_major_locator(mticker.MultipleLocator(np.pi))
ax.yaxis.set_major_locator(mticker.MultipleLocator(50))
ax.xaxis.set_minor_locator(mticker.MultipleLocator(np.pi / 4))
ax.yaxis.set_minor_locator(mticker.MultipleLocator(10))
plt.show()
Spines:
Spines are the lines connecting the axis tick marks, delineating the plot area.
# ... (previous setup) ...
fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(x, y, 'm-')
ax.set_title("Customizing Spines")
# Hide top and right spines (common for cleaner look)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
# Change color and linewidth of remaining spines
ax.spines['left'].set_color('purple')
ax.spines['left'].set_linewidth(1.5)
ax.spines['bottom'].set_position(('outward', 10)) # Move bottom spine outward by 10 points
# Move ticks to the left spine only
ax.yaxis.tick_left()
ax.xaxis.tick_bottom()
plt.show()
Creating Complex Layouts Gridspec and Inset Axes
For layouts beyond simple uniform grids (plt.subplots
), Matplotlib provides more advanced tools.
GridSpec
:
Allows creating grids where subplots can span multiple rows or columns.
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import numpy as np
fig = plt.figure(figsize=(10, 7))
# Create a 3x3 grid specification
gs = gridspec.GridSpec(3, 3, figure=fig, hspace=0.4, wspace=0.3)
# Add subplots occupying different grid cells
ax1 = fig.add_subplot(gs[0, :]) # Top row, spans all 3 columns
ax1.set_title('Top Row (Full Width)')
ax1.plot(np.random.rand(10))
ax1.set_xticks([]) # Remove ticks for cleaner look
ax2 = fig.add_subplot(gs[1, 0:2]) # Middle row, first 2 columns
ax2.set_title('Middle Row (Cols 0-1)')
ax2.plot(np.random.rand(10), 'r')
ax2.set_xticks([])
ax3 = fig.add_subplot(gs[1:, 2]) # Spans rows 1 and 2, column 2
ax3.set_title('Right Column (Rows 1-2)')
ax3.plot(np.random.rand(10), 'g')
ax3.set_yticks([])
ax4 = fig.add_subplot(gs[2, 0]) # Bottom row, first column
ax4.set_title('Bottom Left')
ax4.plot(np.random.rand(10), 'k')
ax5 = fig.add_subplot(gs[2, 1]) # Bottom row, second column
ax5.set_title('Bottom Middle')
ax5.plot(np.random.rand(10), 'm')
ax5.set_yticks([])
fig.suptitle('Complex Layout with GridSpec', fontsize=16)
# plt.tight_layout(rect=[0, 0.03, 1, 0.95]) # tight_layout might conflict with GridSpec spacing
plt.show()
GridSpec
provides flexibility by defining a grid structure first and then assigning Axes
to specific slices of that grid.
Inset Axes (ax.inset_axes()
):
Place one plot inside another, often used for magnifying a specific region or adding context.
import matplotlib.pyplot as plt
import numpy as np
# Main plot data
x = np.linspace(0, 10, 100)
y = np.sin(x) * np.exp(-x / 5)
# Region to zoom in on
x_zoom = np.linspace(2, 4, 50)
y_zoom = np.sin(x_zoom) * np.exp(-x_zoom / 5)
fig, ax_main = plt.subplots(figsize=(9, 6))
# Plot main data
ax_main.plot(x, y, label='Damped Sine Wave')
ax_main.set_title('Main Plot with Inset Axes')
ax_main.set_xlabel('X')
ax_main.set_ylabel('Y')
ax_main.grid(True, linestyle=':')
# Define the position and size of the inset axes
# [x, y, width, height] in axes coordinates (0-1 relative to parent axes)
inset_pos = [0.55, 0.55, 0.4, 0.4]
ax_inset = ax_main.inset_axes(inset_pos)
# Plot zoomed data on the inset axes
ax_inset.plot(x_zoom, y_zoom, color='red', label='Zoomed Region')
ax_inset.set_title('Zoomed Area', fontsize=10)
ax_inset.set_xlabel('X (Zoom)', fontsize=9)
ax_inset.set_ylabel('Y (Zoom)', fontsize=9)
ax_inset.tick_params(axis='both', which='major', labelsize=8)
ax_inset.grid(True, linestyle=':', alpha=0.5)
# Optional: Mark the zoomed region on the main plot
ax_main.axvspan(2, 4, color='grey', alpha=0.2, label='Zoomed Region Marked')
ax_main.legend(loc='lower left')
plt.show()
inset_axes
is powerful for detailed views within a broader context.
Interactive Visualizations (Brief Mention)
While libraries like Plotly and Bokeh are specialists in interactive web-based visualizations, Matplotlib offers some basic interactivity, primarily through its different backends.
- Built-in Interactivity: Most Matplotlib GUI backends (like Qt, Tk, Wx, macOS) provide built-in tools for zooming, panning, and saving the figure interactively.
- Event Handling: Matplotlib has an event handling system (
mpl_connect
) allowing you to write Python code that responds to mouse clicks, key presses, etc., on the plot. This can be used to create custom interactive behaviors (e.g., clicking a point to display its details). - Widgets: Libraries like
ipywidgets
(in Jupyter) can be combined with Matplotlib to create interactive controls (sliders, dropdowns) that update plots. - 3D Plotting: The
mpl_toolkits.mplot3d
toolkit provides functions for creating 3D scatter, surface, wireframe, and contour plots (ax.plot_surface
,ax.scatter
, etc.). These often allow interactive rotation in GUI backends.
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D # Import 3D plotting tools
import numpy as np
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d') # Create a 3D Axes
# Make data
u = np.linspace(0, 2 * np.pi, 100)
v = np.linspace(0, np.pi, 100)
x = 10 * np.outer(np.cos(u), np.sin(v))
y = 10 * np.outer(np.sin(u), np.sin(v))
z = 10 * np.outer(np.ones(np.size(u)), np.cos(v))
# Plot the surface
ax.plot_surface(x, y, z, cmap='viridis') # Use a colormap
ax.set_title("3D Surface Plot (Interact with Mouse)")
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis')
plt.show() # In a suitable backend, you can rotate this plot
While powerful, creating complex, web-friendly interactive plots often leads developers to use libraries specifically designed for that purpose.
Workshop Building a Complex Dashboard Style Layout
Goal: Create a multi-panel figure using GridSpec
to display different facets of the 'tips' dataset in a dashboard-like arrangement.
Scenario: We want a figure that shows:
- Overall distribution of
total_bill
(top, full width). - Relationship between
total_bill
andtip
(middle left). - Distribution of
tip
amounts (middle right). - Average tip amount by
day
(bottom left). - Count of visits by
time
(bottom right).
Steps:
-
Setup:
- Create a new Python file (e.g.,
tips_dashboard.py
). - Import
matplotlib.pyplot
,seaborn
,pandas
,matplotlib.gridspec
. - Load the
tips
dataset. - Set a Seaborn theme if desired.
- Create a new Python file (e.g.,
-
Define GridSpec Layout:
- Create a Matplotlib
figure
. - Define a
GridSpec
object (e.g., 3 rows, 2 columns). Adjusthspace
andwspace
for spacing.
- Create a Matplotlib
-
Create Subplots and Assign Axes:
- Use
fig.add_subplot()
withGridSpec
slicing to create the five required Axes objects according to the layout described above.
# Step 3: Create Subplots and Assign Axes ax_hist_total = fig.add_subplot(gs[0, :]) # Top row, full width ax_scatter = fig.add_subplot(gs[1, 0]) # Middle row, left column ax_hist_tip = fig.add_subplot(gs[1, 1]) # Middle row, right column ax_bar_tip_day = fig.add_subplot(gs[2, 0]) # Bottom row, left column ax_count_time = fig.add_subplot(gs[2, 1]) # Bottom row, right column
- Use
-
Populate Each Subplot:
- Use Seaborn (or Matplotlib) functions to create the desired plots on each corresponding
Axes
object using theax=
argument. - Add appropriate titles and labels to each subplot. Customize as needed.
# Step 4: Populate Each Subplot # Plot 1: Distribution of Total Bill sns.histplot(data=tips, x='total_bill', kde=True, ax=ax_hist_total, color='skyblue') ax_hist_total.set_title('Distribution of Total Bill Amount', fontsize=13) ax_hist_total.set_xlabel('Total Bill ($)') ax_hist_total.set_ylabel('Frequency') # Plot 2: Total Bill vs Tip sns.scatterplot(data=tips, x='total_bill', y='tip', hue='smoker', alpha=0.7, ax=ax_scatter) ax_scatter.set_title('Total Bill vs Tip Amount', fontsize=13) ax_scatter.set_xlabel('Total Bill ($)') ax_scatter.set_ylabel('Tip Amount ($)') ax_scatter.legend(title='Smoker', fontsize=9, title_fontsize=10) # Plot 3: Distribution of Tip Amount sns.histplot(data=tips, x='tip', kde=True, ax=ax_hist_tip, color='lightcoral') ax_hist_tip.set_title('Distribution of Tip Amount', fontsize=13) ax_hist_tip.set_xlabel('Tip Amount ($)') ax_hist_tip.set_ylabel('Frequency') # Plot 4: Average Tip by Day sns.barplot(data=tips, x='day', y='tip', ax=ax_bar_tip_day, palette='pastel', ci=None, order=['Thur', 'Fri', 'Sat', 'Sun']) ax_bar_tip_day.set_title('Average Tip by Day', fontsize=13) ax_bar_tip_day.set_xlabel('Day of the Week') ax_bar_tip_day.set_ylabel('Average Tip ($)') ax_bar_tip_day.tick_params(axis='x', rotation=45) # Plot 5: Visit Count by Time sns.countplot(data=tips, x='time', ax=ax_count_time, palette='bright') ax_count_time.set_title('Visit Count by Time', fontsize=13) ax_count_time.set_xlabel('Time of Day') ax_count_time.set_ylabel('Number of Visits')
- Use Seaborn (or Matplotlib) functions to create the desired plots on each corresponding
-
Add Overall Title and Display:
- Add a main title to the entire figure using
fig.suptitle()
. - Use
plt.show()
to display the dashboard. You might need to adjusttight_layout
or theGridSpec
spacing parameters (hspace
,wspace
) iteratively to get the desired look.
# Step 5: Add Overall Title and Display fig.suptitle('Restaurant Tips Dashboard', fontsize=18, fontweight='bold', y=0.98) # tight_layout often needs careful adjustment with GridSpec, may need manual tweaks or skip # fig.tight_layout(rect=[0, 0.03, 1, 0.95]) plt.savefig('tips_dashboard.png', dpi=300) print("Dashboard saved as tips_dashboard.png") plt.show()
- Add a main title to the entire figure using
-
Run and Refine:
- Execute
tips_dashboard.py
. - Examine the output figure
tips_dashboard.png
. Does the layout effectively present the different pieces of information? Are the plots clear and well-labeled? Adjust spacing, titles, or plot types as needed for clarity and aesthetic appeal.
- Execute
This workshop demonstrates how GridSpec
enables the creation of sophisticated, non-uniform layouts, allowing you to combine multiple related visualizations into a cohesive dashboard-style figure, providing a comprehensive overview of your data.
6. Advanced Seaborn and Statistical Visualization
Seaborn's capabilities extend beyond basic plots. It offers powerful tools for visualizing statistical models, complex relationships, and matrix data, often integrating statistical computations directly into the visualization process. This section explores some of these advanced features and how to seamlessly blend Seaborn with Matplotlib for maximum flexibility.
Advanced Statistical Plots Regression Plots Heatmaps and Clustermaps
Seaborn makes visualizing statistical relationships and patterns relatively straightforward.
Regression Plots (regplot
, lmplot
):
These functions draw a scatter plot of two variables (x
, y
) and then fit and plot a linear regression model relating them, along with a confidence interval band for the regression line.
sns.regplot()
: Plots data onto a specific Matplotlib Axes (axes-level function).sns.lmplot()
: Creates a full figure, potentially with subplots based onhue
,col
, orrow
(figure-level function, usesregplot
internally within aFacetGrid
).
import matplotlib.pyplot as plt
import seaborn as sns
tips = sns.load_dataset("tips")
penguins = sns.load_dataset("penguins").dropna()
# --- regplot (Axes-level) ---
fig, ax = plt.subplots(figsize=(8, 6))
sns.regplot(x="total_bill", y="tip", data=tips,
scatter_kws={'alpha':0.5, 's':50}, # Customize scatter points
line_kws={'color':'red', 'linewidth':2}, # Customize regression line
ax=ax)
ax.set_title('Regression Plot: Tip vs Total Bill')
plt.show()
# --- lmplot (Figure-level) ---
# Shows regression for different smoker groups with hue and separate columns
g = sns.lmplot(x="total_bill", y="tip", data=tips,
hue="smoker", # Color by smoker status
col="time", # Separate plots for Lunch/Dinner
height=5, aspect=0.8,
palette='Set1',
scatter_kws={'alpha':0.6})
g.fig.suptitle('Linear Model Plot: Tip vs Total Bill (by Smoker/Time)', y=1.03)
plt.show()
# Can fit higher-order polynomial regression
plt.figure(figsize=(8, 6))
sns.regplot(data=penguins, x="bill_length_mm", y="flipper_length_mm",
order=2, # Fit a 2nd order polynomial
line_kws={'color':'orange'})
plt.title('Polynomial Regression (2nd Order): Flipper Length vs Bill Length')
plt.show()
lmplot
is particularly powerful for quickly comparing regression lines across different subsets of the data.
Heatmaps (heatmap
):
Visualize matrix data where values are represented by color intensity. Excellent for showing correlation matrices or feature interactions.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
# Example 1: Correlation Matrix
penguins_numeric = penguins.select_dtypes(include=np.number) # Select only numerical columns
correlation_matrix = penguins_numeric.corr()
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix,
annot=True, # Show values in cells
cmap='coolwarm', # Choose a diverging colormap
fmt=".2f", # Format annotations to 2 decimal places
linewidths=.5) # Add lines between cells
plt.title('Correlation Matrix of Penguin Measurements')
plt.show()
# Example 2: Generic matrix data (e.g., flight passenger counts)
flights = sns.load_dataset("flights")
flights_pivot = flights.pivot(index="month", columns="year", values="passengers") # Pivot for matrix format
plt.figure(figsize=(10, 8))
sns.heatmap(flights_pivot,
annot=True, fmt="d", # Annotate with integer format
cmap="viridis", # Sequential colormap
linewidths=.5,
linecolor='lightgrey',
cbar_kws={'label': 'Number of Passengers'}) # Add label to color bar
plt.title('Monthly Flight Passengers (1949-1960)')
plt.xlabel('Year')
plt.ylabel('Month')
plt.xticks(rotation=45)
plt.yticks(rotation=0)
plt.show()
cmap
), annotations (annot
, fmt
), and cell separation (linewidths
, linecolor
).
Clustermaps (clustermap
):
A clustermap takes a matrix, performs hierarchical clustering on its rows and/or columns, and displays the matrix reordered according to the clustering, alongside dendrograms showing the cluster hierarchy. Useful for finding groups of similar rows/columns.
import matplotlib.pyplot as plt
import seaborn as sns
# Using the flights pivot table from before
flights = sns.load_dataset("flights")
flights_pivot = flights.pivot(index="month", columns="year", values="passengers")
# Create a clustermap
g = sns.clustermap(flights_pivot,
cmap="magma", # Colormap
standard_scale=1, # Scale rows or columns (0=rows, 1=columns)
linewidths=.5,
figsize=(10, 10))
g.fig.suptitle('Clustermap of Flight Passengers (Columns Scaled)', y=1.02)
plt.show()
# Example with correlation matrix - find groups of correlated variables
iris = sns.load_dataset("iris")
species = iris.pop("species") # Remove species column for correlation
iris_corr = iris.corr()
sns.clustermap(iris_corr,
cmap="vlag", # Diverging colormap good for correlations
annot=True, fmt=".2f",
linewidths=1,
figsize=(7, 7))
plt.suptitle('Clustermap of Iris Feature Correlations', y=1.02)
plt.show()
clustermap
automatically reorders rows and columns based on similarity, revealing structures that might be hidden in the original matrix order. standard_scale
or z_score
can normalize data before clustering.
Integrating Matplotlib and Seaborn Seamlessly
Because Seaborn builds on Matplotlib, they work together naturally.
- Seaborn on Matplotlib Axes: Most Seaborn plotting functions (except figure-level ones like
lmplot
,catplot
, etc.) have anax=
parameter. You can create a Matplotlib figure and axes layout (e.g., usingplt.subplots
orGridSpec
) and then tell specific Seaborn functions exactly which axes to draw on. This was demonstrated in theGridSpec
workshop (tips_dashboard.py
). - Customizing Seaborn Plots with Matplotlib: After a Seaborn plot is created (even figure-level ones), you can access the underlying Matplotlib Figure and Axes objects to apply further customizations (titles, labels, annotations, ticks, spines, limits) using standard Matplotlib methods. For figure-level functions, the returned object (often a
FacetGrid
orPairGrid
) usually has.fig
and.ax
or.axes
attributes.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Example: Seaborn plot on specific Matplotlib Axes + Matplotlib customization
x = np.random.randn(100)
y = 2 * x + np.random.randn(100) * 0.5
fig, ax = plt.subplots(figsize=(8, 6))
# Use Seaborn for the statistical plot
sns.regplot(x=x, y=y, ax=ax, color='purple', scatter_kws={'alpha': 0.5})
# Use Matplotlib for fine-tuning
ax.set_title("Seaborn Regression on Matplotlib Axes", fontsize=15)
ax.set_xlabel("Independent Variable (X)", fontsize=12)
ax.set_ylabel("Dependent Variable (Y)", fontsize=12)
ax.grid(True, linestyle=':', alpha=0.6)
ax.axhline(0, color='grey', linewidth=0.8, linestyle='--') # Add horizontal line at y=0
ax.axvline(0, color='grey', linewidth=0.8, linestyle='--') # Add vertical line at x=0
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.text(0.05, 0.95, 'Custom annotation', transform=ax.transAxes, # Text relative to axes size
fontsize=10, verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
plt.show()
Customizing Seaborn Plot Aesthetics Beyond Defaults
While sns.set_theme()
provides global styling, you can customize themes and contexts more finely or temporarily.
- Styling Functions:
sns.set_style("style_name")
: Sets the aesthetic style (like "darkgrid", "whitegrid", "ticks"). Controls background, grid, spines.sns.set_palette("palette_name")
: Sets the default color palette.sns.set_context("context_name")
: Scales plot elements (lines, fonts, markers) for different contexts like "paper", "notebook" (default), "talk", "poster".
- Temporary Styling: Use a
with
statement for temporary style changes:with sns.axes_style("darkgrid"): # Plots inside this block use the darkgrid style plt.figure() sns.histplot(data=tips, x='tip') plt.title("Histogram with Temporary Darkgrid Style") # Style reverts outside the 'with' block plt.figure() sns.histplot(data=tips, x='total_bill') plt.title("Histogram with Default Style (After 'with')") plt.show()
-
Customizing Theme Parameters: You can pass a dictionary of parameters to
set_theme
oraxes_style
to override specific MatplotlibrcParams
.custom_params = {"axes.spines.right": False, "axes.spines.top": False, "grid.color": ".8", "grid.linestyle": ":"} sns.set_theme(style="white", rc=custom_params) # Apply white style with custom overrides plt.figure() sns.boxplot(data=tips, x='day', y='tip') plt.title("Box Plot with Custom Theme Parameters") plt.show() sns.reset_defaults() # Reset to default settings
Visualizing Uncertainty Confidence Intervals and Error Bars
Many real-world analyses involve uncertainty (due to sampling, measurement error, etc.). Visualizing this uncertainty is crucial for honest data representation.
-
Seaborn's Built-in Uncertainty: Many Seaborn functions automatically calculate and display uncertainty:
lineplot
: Shows confidence interval (default 95%) around the estimated trend, especially if multiple y-values exist for each x.barplot
,pointplot
: Show confidence intervals (default bootstrapped 95% CI) for the estimated mean (or other estimator) in each category.regplot
,lmplot
: Show confidence interval around the regression line.- You can often control this with the
errorbar
parameter (e.g.,errorbar='sd'
for standard deviation,errorbar=None
to disable).
-
Custom Error Bars with Matplotlib: If you have pre-calculated errors (like standard deviations, standard errors, or confidence intervals), you can use Matplotlib's
ax.errorbar()
function.
import matplotlib.pyplot as plt
import numpy as np
# Example data with errors
categories = ['A', 'B', 'C', 'D']
means = np.array([20, 35, 30, 27])
std_errors = np.array([2, 3, 2.5, 2.2]) # Example standard errors
fig, ax = plt.subplots(figsize=(7, 5))
# Plot means as points and add error bars
ax.errorbar(categories, means, yerr=std_errors,
fmt='o', # Format for the points ('o', 's', '-', etc.)
color='dodgerblue',
ecolor='lightcoral', # Color of the error bars
elinewidth=2, # Linewidth of error bars
capsize=5, # Size of the caps on error bars
label='Mean +/- SE')
ax.set_ylabel('Measured Value')
ax.set_title('Data with Custom Error Bars')
ax.set_ylim(0, 45)
ax.grid(True, axis='y', linestyle=':', alpha=0.6)
ax.legend()
plt.show()
ax.errorbar
gives full control over how uncertainty is displayed when you provide the error values directly.
Workshop Advanced Statistical Analysis Visualization
Goal: Use advanced Seaborn plots (lmplot
, heatmap
) to analyze relationships and correlations within the 'penguins' dataset, focusing on differences between species.
Dataset: The 'penguins' dataset, cleaned of missing values.
Steps:
-
Setup:
- Create a new Python file (e.g.,
penguin_advanced.py
). - Import
seaborn
,matplotlib.pyplot
,pandas
,numpy
. - Load the
penguins
dataset and drop rows with missing values (dropna()
).
- Create a new Python file (e.g.,
-
Analyze Relationship with
lmplot
:- Investigate the relationship between
bill_length_mm
andflipper_length_mm
. - Use
sns.lmplot()
to create separate regression plots for eachspecies
(usinghue='species'
). - Customize the plot appearance (e.g.,
height
,aspect
,palette
,scatter_kws
). - Add an appropriate overall title. Interpret the results: Does the relationship differ significantly across species?
Interpretation: Observe if the slopes or intercepts of the regression lines vary noticeably between Adelie, Chinstrap, and Gentoo penguins, suggesting different scaling relationships between bill and flipper length for each species.# Step 2: Analyze Relationship with lmplot print("Generating lmplot...") g = sns.lmplot( data=penguins, x="bill_length_mm", y="flipper_length_mm", hue="species", height=6, aspect=1.1, palette="viridis", # Choose a suitable palette scatter_kws={'alpha': 0.6, 's': 40}, line_kws={'linewidth': 2} ) g.set_axis_labels("Bill Length (mm)", "Flipper Length (mm)") g.fig.suptitle('Relationship between Bill Length and Flipper Length by Species', y=1.03, fontsize=14) # Optional: Adjust legend position/title # g.legend.set_title("Penguin Species") plt.tight_layout(rect=[0, 0.03, 1, 0.96]) plt.savefig('penguin_lmplot_species.png', dpi=300) print("lmplot saved as penguin_lmplot_species.png") plt.show()
- Investigate the relationship between
-
Analyze Correlations with
heatmap
:- Calculate the correlation matrix for the numerical features within each species. This requires grouping the data first.
- Create a figure with subplots (e.g., 1 row, 3 columns using
plt.subplots
) to display one heatmap per species. - Iterate through each species:
- Filter the DataFrame for that species.
- Select numerical columns.
- Calculate the correlation matrix (
.corr()
). - Use
sns.heatmap()
to plot the matrix on the corresponding subplot Axes. Customize withannot=True
,cmap
,fmt
, etc. Add a title to each subplot indicating the species.
- Add an overall figure title. Adjust layout. Save the figure.
- Interpret: Are the correlation patterns similar or different across the species?
Interpretation: Compare the correlation values (e.g., between bill length and bill depth, or flipper length and body mass) across the three heatmaps. Strong differences might indicate distinct morphological strategies or adaptations among the species. For instance, is the correlation between bill length and depth stronger in one species than others?# Step 3: Analyze Correlations with heatmap per species print("Generating heatmaps...") numerical_cols = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'] species_list = penguins['species'].unique() fig, axes = plt.subplots(1, len(species_list), figsize=(18, 5), sharey=True) # Share y-axis for consistent feature order fig.suptitle('Feature Correlation Matrices by Penguin Species', fontsize=16, y=1.03) for i, species_name in enumerate(species_list): ax = axes[i] # Filter data for the current species species_data = penguins[penguins['species'] == species_name][numerical_cols] # Calculate correlation matrix corr_matrix = species_data.corr() # Plot heatmap sns.heatmap(corr_matrix, annot=True, cmap='vlag', fmt=".2f", linewidths=.5, ax=ax, cbar= (i == len(species_list) - 1), # Only show colorbar for the last plot cbar_kws={'label': 'Correlation Coefficient'} if (i == len(species_list) - 1) else {}) ax.set_title(f'{species_name} Penguins', fontsize=12) ax.tick_params(axis='x', rotation=45) if i > 0: # Remove y-labels for inner plots if sharing y-axis ax.set_ylabel('') ax.tick_params(axis='y', length=0) plt.tight_layout(rect=[0, 0.03, 1, 0.95]) # Adjust layout plt.savefig('penguin_heatmaps_species.png', dpi=300) print("Heatmaps saved as penguin_heatmaps_species.png") plt.show()
-
Run and Interpret:
- Execute
penguin_advanced.py
. - Carefully examine the
lmplot
and the species-specific heatmaps. - Synthesize the findings: What do these advanced statistical visualizations reveal about the differences and similarities between the penguin species based on their physical measurements?
- Execute
This workshop applied more sophisticated Seaborn functions (lmplot
, heatmap
) combined with data manipulation (grouping) and Matplotlib layout control (subplots
) to perform a deeper comparative analysis, revealing subtle statistical patterns and relationships within different segments of the data.
7. Telling Compelling Stories with Data
We've journeyed through the technical aspects of creating visualizations with Matplotlib and Seaborn, from basic plots to advanced statistical graphics and customization. Now, we bring everything together to focus on the ultimate goal: telling compelling stories that inform, persuade, and drive action using data. This involves structuring your narrative, choosing appropriate visual metaphors, using annotations strategically, and presenting your findings effectively.
Structuring Your Data Narrative
A data story isn't just a collection of charts; it needs structure to guide the audience logically from context to conclusion. A common and effective structure follows these steps:
- The Hook / The Question: Start by grabbing your audience's attention. What problem are you addressing? What question are you trying to answer? Why is this important? Example: "Passenger survival rates on the Titanic famously varied. But how much did factors like class and gender independently influence someone's chances?"
- Provide Context: Set the scene. Describe the dataset, define key terms, explain any relevant background information or benchmarks. Example: Briefly introduce the Titanic dataset, explain what 'pclass' means, and mention the overall survival rate.
- The Rising Action / Present Findings: Introduce the data and visualizations sequentially. Start with broader overviews and gradually focus on more specific insights. Each visual should build upon the last, supporting the central narrative.
- Visual 1: Show overall survival count/rate (e.g., countplot or bar chart).
- Visual 2: Break down survival by gender (e.g., catplot count by sex). Narrative: "Gender was a major factor..."
- Visual 3: Break down survival by class (e.g., catplot count by pclass). Narrative: "...but passenger class also played a crucial role."
- Visual 4: Combine factors (e.g., barplot of survival rate by class, hue by sex). Narrative: "Looking deeper, we see the interplay: females in all classes fared better than males, but first-class passengers had higher survival rates overall, especially women."
- The Climax / The Insight: Clearly present the main takeaway message, often highlighted in a final, focused visualization with clear annotations. Example: A refined bar chart explicitly comparing survival rates for specific groups (e.g., 1st class female vs. 3rd class male) with annotations emphasizing the disparity.
- The Conclusion / The Resolution: Summarize the key findings. What are the implications? What are the limitations? What questions remain? What actions should be taken based on these insights? Example: "While 'women and children first' was a factor, social class heavily dictated survival chances, particularly among men. This highlights the stark social stratification aboard the ship."
This structure provides a clear path for your audience, making complex information digestible and memorable.
Choosing the Right Visual Metaphor
The type of chart you choose acts as a visual metaphor for the data relationship you want to emphasize. Selecting the wrong metaphor can confuse or mislead your audience. Revisit the guidelines from Section 3 ("Choosing the Right Plot"):
- Change over Time: Line charts, area charts. Metaphor: A journey or flow.
- Comparison across Categories: Bar charts (vertical or horizontal), point plots. Metaphor: Comparing heights or lengths.
- Part-to-Whole Composition: Stacked bar charts, treemaps (use pie charts sparingly). Metaphor: Slices of a whole or segments of a total.
- Relationship/Correlation: Scatter plots, regression plots, connected scatter plots (for time evolution of two variables). Metaphor: A pattern of points or a trend line.
- Distribution: Histograms, density plots (KDE), box plots, violin plots. Metaphor: The shape or spread of the data.
- Geospatial Data: Choropleth maps, point maps (often require libraries like GeoPandas). Metaphor: Location and spatial patterns.
Think about the primary relationship in your data and select the chart type that best represents that relationship visually.
Annotation and Emphasis Guiding Your Audience
Annotations are your tools for turning a chart from a passive display into an active part of your narrative. They bridge the gap between the visual and the story.
Effective Annotation Techniques:
- Highlighting Key Points: Use color, size, or added marks (arrows, circles) to draw immediate attention to the most important data points, trends, or differences, as demonstrated in the Section 4 workshop.
- Explaining Significance: Add text directly on the chart (using
ax.text
orax.annotate
) to explain what the highlighted element means or why it's important. Don't assume the audience will automatically understand the implication. - Labeling Directly: When possible and not too cluttered, label data series or significant points directly instead of relying solely on a legend.
- Titles and Subtitles: Use narrative titles that state the main finding (e.g., "Survival Rate Plummets for Third-Class Men") rather than just describing the chart axes (e.g., "Survival Rate vs. Class and Gender"). Use subtitles for context or data sources.
- Reference Lines/Regions: Add lines (e.g.,
ax.axhline
,ax.axvline
) or shaded regions (ax.axvspan
,ax.axhspan
) to indicate targets, thresholds, averages, or specific periods, providing context for the plotted data.
Example: Adding Narrative Annotations
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Load and prepare Titanic data (as in Section 3 Workshop)
titanic = sns.load_dataset("titanic")
median_age = titanic['age'].median()
titanic['age'].fillna(median_age, inplace=True)
mode_embarked = titanic['embarked'].mode()[0]
titanic['embarked'].fillna(mode_embarked, inplace=True)
mode_embark_town = titanic['embark_town'].mode()[0]
titanic['embark_town'].fillna(mode_embark_town, inplace=True)
titanic.drop(columns=['deck'], inplace=True)
# --- Create the base plot: Survival Rate by Class and Sex ---
plt.style.use('seaborn-v0_8-whitegrid') # Use a clean style
fig, ax = plt.subplots(figsize=(9, 6))
sns.barplot(data=titanic, x='pclass', y='survived', hue='sex',
palette={'male': 'lightblue', 'female': 'lightcoral'},
errorbar=None, ax=ax) # errorbar=None simplifies for annotation
# --- Add Annotations and Emphasis ---
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_ylabel("Survival Rate", fontsize=11)
ax.set_xlabel("Passenger Class", fontsize=11)
ax.tick_params(axis='both', which='major', labelsize=10)
ax.set_ylim(0, 1.05) # Ensure space for annotations
# Format y-axis as percentage
ax.yaxis.set_major_formatter(mticker.PercentFormatter(xmax=1.0))
# Narrative Title
ax.set_title("Wealth and Gender Were Key Determinants of Titanic Survival", fontsize=14, loc='left', fontweight='bold')
fig.suptitle("Survival rates varied dramatically across passenger groups", fontsize=11, y=0.92, x=0.125, ha='left', color='grey')
# Annotations highlighting key findings
# Highlight high survival for 1st class females
ax.text(0, 0.96 + 0.02, f"{titanic[(titanic.pclass == 1) & (titanic.sex == 'female')].survived.mean():.0%}",
ha='center', color='darkred', fontweight='bold')
ax.annotate('Highest Survival:\n~96% for 1st Class Females', xy=(0, 0.96), xytext=(-0.4, 0.6), # Adjust text position
arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.2", color='darkred'),
fontsize=9, color='darkred')
# Highlight low survival for 3rd class males
ax.text(2-0.2, 0.14 + 0.02, f"{titanic[(titanic.pclass == 3) & (titanic.sex == 'male')].survived.mean():.0%}",
ha='center', color='darkblue', fontweight='bold')
ax.annotate('Lowest Survival:\n~14% for 3rd Class Males', xy=(2-0.2, 0.14), xytext=(1.5, 0.25), # Adjust text position
arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=-.2", color='darkblue'),
fontsize=9, color='darkblue')
# Customize legend
ax.legend(title='Sex', frameon=False, loc='upper center', bbox_to_anchor=(0.5, -0.15), ncol=2)
plt.tight_layout(rect=[0, 0.1, 1, 0.9]) # Adjust rect for legend/titles
plt.savefig("titanic_survival_story.png", dpi=300)
plt.show()
Presenting Your Visualizations Effectively
How you present your data story matters almost as much as the content itself. Consider the medium and audience:
- Reports/Documents: Ensure high-resolution plots (use
dpi=300
or vector formats like PDF/SVG forsavefig
). Integrate plots smoothly with surrounding text. Use captions to explain the plot and its relevance to the text. Ensure consistent styling across all visuals. - Presentations (Slides): Keep plots simple and focused on one message per slide. Use large fonts and clear visuals. Use animations or progressive reveals (builds) to introduce elements sequentially, guiding the audience's focus. Minimize text on slides; use visuals as the primary communication tool, supplemented by your verbal explanation.
- Interactive Dashboards/Web: (Beyond Matplotlib/Seaborn basics, often using Plotly Dash, Streamlit, Bokeh). Design for user interaction. Allow exploration (filtering, zooming) but guide users towards key insights. Ensure responsiveness across different screen sizes.
- Accessibility: Use colorblind-friendly palettes (like
viridis
,magma
,cividis
, orColorBrewer
diverging/qualitative sets). Ensure sufficient contrast. Use clear fonts. Provide text alternatives or detailed descriptions for complex visuals where appropriate.
Key Presentation Tips:
- Know Your Audience: Tailor complexity and focus.
- One Key Message per Visual: Avoid overwhelming the audience.
- Label Everything Clearly: Axes, titles, legends, annotations.
- Use Consistent Style: Colors, fonts, layout across related visuals.
- Practice Your Narrative: Rehearse how you will explain the visuals and connect them to the overall story.
- Seek Feedback: Ask others if your story and visuals are clear and compelling.
Workshop From Analysis to Narrative A Complete Data Story
Goal: Take a dataset, perform exploratory analysis to find an interesting insight, and build a short, compelling data story (2-3 visualizations) using Matplotlib/Seaborn, focusing on narrative structure, clear visuals, and annotations.
Dataset: We'll use a dataset related to CO2 emissions. We can fetch historical CO2 emissions data per capita for a few selected countries using available libraries or a prepared CSV file. (For simplicity, let's assume we have a CSV co2_data_subset.csv
with columns: Year
, Country
, CO2_per_capita
).
Scenario: Explore how CO2 emissions per capita have changed over time for a few major economies (e.g., USA, China, Germany, India) and tell a story about their differing trajectories.
Example co2_data_subset.csv
structure:
Year,Country,CO2_per_capita
1960,USA,15.99
1960,China,1.20
1960,Germany,9.85
1960,India,0.26
...
2018,USA,15.24
2018,China,7.38
2018,Germany,8.88
2018,India,1.84
...
Steps:
-
Setup and Exploration:
- Create a Python file (e.g.,
co2_story.py
). - Import
pandas
,matplotlib.pyplot
,seaborn
,matplotlib.ticker
. - Load the
co2_data_subset.csv
into a Pandas DataFrame. - Explore the data: check data types, time range, countries included. Perform basic plotting (e.g., individual line plots per country) to understand trends.
- Identify the Narrative: Based on exploration, a likely story is the dramatic rise of China's per capita emissions compared to the stabilization/slight decline in the US and Germany, and the lower but rising level in India. Our story: "Shifting Landscapes: China's Rapid Rise in Per Capita CO2 Emissions Compared to Established Economies."
# Step 1: Setup and Exploration import pandas as pd import matplotlib.pyplot as plt import seaborn as sns import matplotlib.ticker as mticker import numpy as np # For potential calculations # Assume 'co2_data_subset.csv' exists in the same directory try: co2_df = pd.read_csv('co2_data_subset.csv') except FileNotFoundError: print("Error: co2_data_subset.csv not found. Please create this file.") # Create dummy data if file not found for demonstration years = np.arange(1960, 2021) data = [] countries = ['USA', 'China', 'Germany', 'India'] # Simplified trend simulation - replace with real data! trends = { 'USA': 16 + 4 * np.sin(np.pi * (years - 1960) / 60) - (years - 1960) * 0.05, 'China': 1.2 + 0.005 * (years - 1960)**2, 'Germany': 10 + 2 * np.sin(np.pi * (years - 1960) / 50) - (years - 1990) * 0.1 * (years > 1990), 'India': 0.25 + 0.0005 * (years - 1960)**2 } for year in years: for country in countries: if year <= 2018: # Limit dummy data for consistency val = trends[country][year-1960] * (1 + np.random.randn()*0.05) # Add noise data.append({'Year': year, 'Country': country, 'CO2_per_capita': max(0, val)}) # Ensure non-negative co2_df = pd.DataFrame(data) co2_df.to_csv('co2_data_subset.csv', index=False) # Save dummy data print("Dummy co2_data_subset.csv created.") co2_df['Year'] = pd.to_datetime(co2_df['Year'], format='%Y') # Convert Year to datetime print("CO2 Data Loaded:") print(co2_df.head()) print("\nCountries:", co2_df['Country'].unique()) print("Time Range:", co2_df['Year'].min().year, "-", co2_df['Year'].max().year) # Initial exploratory plot (optional, not part of final story visuals) # plt.figure(figsize=(10,6)) # sns.lineplot(data=co2_df, x='Year', y='CO2_per_capita', hue='Country') # plt.title('Exploratory Plot: CO2 Emissions Per Capita') # plt.show() # Narrative Identified: Focus on China's rise vs others.
- Create a Python file (e.g.,
-
Visual 1: The Overall Trend:
- Create a line plot showing the
CO2_per_capita
trend overYear
for all four countries. - Use color to distinguish countries.
- Use a clean style. Add clear labels and a narrative title introducing the topic.
- This sets the context and shows the general picture.
# Step 2: Visual 1 - Overall Trend Context plt.style.use('seaborn-v0_8-whitegrid') fig1, ax1 = plt.subplots(figsize=(10, 6)) sns.lineplot(data=co2_df, x='Year', y='CO2_per_capita', hue='Country', palette='tab10', linewidth=2, ax=ax1) ax1.set_title('Diverging Paths: Per Capita CO2 Emissions (1960-2018)', fontsize=14, loc='left', fontweight='bold') ax1.set_ylabel('Metric Tons of CO2 per Capita', fontsize=11) ax1.set_xlabel('Year', fontsize=11) ax1.legend(title='Country', frameon=False) ax1.spines['top'].set_visible(False) ax1.spines['right'].set_visible(False) ax1.tick_params(axis='both', which='major', labelsize=10) ax1.grid(True, axis='y', linestyle=':', alpha=0.7) plt.tight_layout() plt.savefig('co2_story_visual_1.png', dpi=300) print("Visual 1 saved.") plt.show()
- Create a line plot showing the
-
Visual 2: Highlighting China's Trajectory:
- Re-plot the data, but this time use emphasis to highlight China's line.
- Make China's line thicker and/or a more vibrant color. Make other lines thinner and grey/muted.
- Add annotations pointing out the rapid increase in China's emissions, especially post-2000, and perhaps the peak/plateau of US/German emissions.
- Adjust the title to reflect the focus on China.
# Step 3: Visual 2 - Highlighting China's Trajectory fig2, ax2 = plt.subplots(figsize=(10, 6)) countries_to_plot = co2_df['Country'].unique() highlight_country = 'China' colors = {country: ('red' if country == highlight_country else 'lightgrey') for country in countries_to_plot} linewidths = {country: (3 if country == highlight_country else 1.5) for country in countries_to_plot} alphas = {country: (1.0 if country == highlight_country else 0.7) for country in countries_to_plot} for country in countries_to_plot: subset = co2_df[co2_df['Country'] == country] sns.lineplot(data=subset, x='Year', y='CO2_per_capita', color=colors[country], linewidth=linewidths[country], alpha=alphas[country], label=country if country == highlight_country else None, # Only label highlighted ax=ax2) # Add labels for muted lines manually if needed (can get cluttered) for country in countries_to_plot: if country != highlight_country: last_point = co2_df[(co2_df['Country'] == country) & (co2_df['Year'] == co2_df['Year'].max())] if not last_point.empty: ax2.text(last_point['Year'].iloc[0] + pd.Timedelta(days=100), last_point['CO2_per_capita'].iloc[0], country, color='grey', fontsize=9, va='center') # Annotations # China's rise china_2000 = co2_df[(co2_df['Country'] == 'China') & (co2_df['Year'].dt.year == 2000)]['CO2_per_capita'].iloc[0] china_2018 = co2_df[(co2_df['Country'] == 'China') & (co2_df['Year'].dt.year == 2018)]['CO2_per_capita'].iloc[0] ax2.annotate(f'China\'s emissions surged\npost-2000 ({china_2000:.1f} to {china_2018:.1f} tons)', xy=(pd.Timestamp('2009-01-01'), china_2018 * 0.8), # Adjust position xytext=(pd.Timestamp('1975-01-01'), china_2018 * 1.1), # Adjust text position arrowprops=dict(arrowstyle="->", connectionstyle="arc3,rad=.3", color='red'), fontsize=10, color='red', bbox=dict(boxstyle="round,pad=0.3", fc="white", ec="red", alpha=0.8)) # US/Germany plateau (optional annotation) us_peak_year = co2_df[co2_df.Country=='USA']['CO2_per_capita'].idxmax() # Find index of max value us_peak_val = co2_df.loc[us_peak_year]['CO2_per_capita'] us_peak_year_dt = co2_df.loc[us_peak_year]['Year'] ax2.text(us_peak_year_dt - pd.Timedelta(days=4000), us_peak_val * 1.1, 'USA/Germany emissions peaked\n and started declining', fontsize=9, color='dimgray', ha='center') # Styling ax2.set_title('China Became a Major CO2 Emitter Per Capita After 2000', fontsize=14, loc='left', fontweight='bold') ax2.set_ylabel('Metric Tons of CO2 per Capita', fontsize=11) ax2.set_xlabel('Year', fontsize=11) ax2.legend(title=highlight_country, loc='upper left', frameon=False) # Legend only for highlighted ax2.spines['top'].set_visible(False) ax2.spines['right'].set_visible(False) ax2.tick_params(axis='both', which='major', labelsize=10) ax2.grid(True, axis='y', linestyle=':', alpha=0.7) plt.tight_layout() plt.savefig('co2_story_visual_2.png', dpi=300) print("Visual 2 saved.") plt.show()
-
Visual 3 (Optional): Comparative Snapshot:
- Create a bar chart comparing the
CO2_per_capita
values for the selected countries in two specific years (e.g., 1990 and 2018) to starkly show the change. - This provides a clear before-and-after comparison reinforcing the narrative.
# Step 4: Visual 3 - Comparative Snapshot (Bar Chart) compare_years = [1990, 2018] compare_df = co2_df[co2_df['Year'].dt.year.isin(compare_years)].pivot(index='Country', columns='Year', values='CO2_per_capita') compare_df.columns = [str(col.year) for col in compare_df.columns] # Rename columns to strings fig3, ax3 = plt.subplots(figsize=(8, 5)) compare_df.plot(kind='bar', ax=ax3, colormap='Pastel2', width=0.8) # Styling and Annotations ax3.set_title(f'CO2 Emissions Per Capita Shift ({compare_years[0]} vs {compare_years[1]})', fontsize=13, loc='left', fontweight='bold') ax3.set_ylabel('Metric Tons of CO2 per Capita', fontsize=10) ax3.set_xlabel('Country', fontsize=10) ax3.tick_params(axis='x', rotation=0, labelsize=10) ax3.tick_params(axis='y', labelsize=9) ax3.legend(title='Year', frameon=False) ax3.spines['top'].set_visible(False) ax3.spines['right'].set_visible(False) ax3.grid(True, axis='y', linestyle=':', alpha=0.5) # Add value labels (optional) for container in ax3.containers: ax3.bar_label(container, fmt='%.1f', label_type='edge', fontsize=8, padding=2) ax3.margins(y=0.1) # Add margin for labels plt.tight_layout() plt.savefig('co2_story_visual_3.png', dpi=300) print("Visual 3 saved.") plt.show()
- Create a bar chart comparing the
-
Synthesize the Story:
- Review the three visuals (
co2_story_visual_1.png
,co2_story_visual_2.png
,co2_story_visual_3.png
). - Write a short narrative (1-3 paragraphs) that uses these visuals to tell the story identified in Step 1. Start with the overall context (Visual 1), then focus on China's dramatic change using the emphasis and annotations in Visual 2, and potentially use Visual 3 to provide a stark numerical comparison confirming the shift. Conclude with the implications (e.g., shifting global emissions landscape, challenges for climate policy).
Example Narrative Snippet: "Historical data reveals distinct trajectories in per capita CO2 emissions among major economies since 1960 (see Visual 1). While the USA and Germany, early industrializers, showed high levels that eventually stabilized or declined, India's emissions remained low but grew steadily. The most dramatic story, however, is China's (highlighted in Visual 2). Following its rapid economic expansion, particularly after 2000, China's per capita emissions surged, overtaking Germany and significantly closing the gap with the US. Annotations on Visual 2 pinpoint this rapid acceleration. A direct comparison between 1990 and 2018 (Visual 3) underscores this transformation, showing China's per capita emissions multiplying while others saw more modest changes or reductions. This shift highlights the evolving global landscape of CO2 emissions and the critical role of developing economies in future climate trends."
- Review the three visuals (
This workshop walked through the process of finding a narrative within data, creating a sequence of visualizations with increasing focus and emphasis, using annotations to guide the audience, and structuring the visuals to tell a coherent and compelling data story.
Conclusion
Throughout this guide, we have explored the essential tools and techniques for data visualization and storytelling using Matplotlib and Seaborn in a Linux environment. We started with the fundamentals of Matplotlib, understanding its core components and creating basic plot types like line, scatter, and bar charts. We saw how crucial customization – labels, titles, legends, colors, and saving plots – is for initial clarity.
We then introduced Seaborn as a high-level interface built upon Matplotlib, streamlining the creation of sophisticated statistical plots. We learned how Seaborn simplifies visualizing distributions, categorical data, and relationships, often with built-in statistical intelligence and aesthetically pleasing defaults. The workshops provided hands-on practice with real-world datasets like 'tips', 'penguins', and 'titanic', reinforcing these concepts.
Moving to intermediate techniques, we focused on enhancing visualizations through advanced customization of colors and styles, managing multiple subplots effectively using Matplotlib's subplots
and Seaborn's figure-level functions (catplot
, relplot
, pairplot
), and the critical skill of choosing the right plot for the data and the intended message.
Crucially, we transitioned from merely creating plots to crafting narratives. We introduced the core principles of data storytelling: identifying a clear message, providing context, decluttering visuals to maximize the data-ink ratio, using visual cues like color and annotation strategically, and structuring a sequence of plots to build a compelling argument.
In the advanced sections, we delved deeper into Matplotlib's object-oriented interface, gaining fine-grained control over ticks, grids, and spines, and mastering complex layouts with GridSpec
and inset axes. We explored advanced Seaborn capabilities, including regression plots (lmplot
), heatmaps, and clustermaps, understanding how to visualize statistical models and matrix data effectively. We emphasized the seamless integration between Matplotlib and Seaborn and the importance of visualizing uncertainty.
The final workshop encapsulated the entire process, guiding you from raw data (CO2 emissions) through exploratory analysis to identifying a narrative and constructing a multi-visual data story complete with emphasis and annotations.
Key Takeaways:
- Foundation First: Matplotlib provides the fundamental building blocks and ultimate control.
- Seaborn for Speed and Stats: Seaborn excels at rapidly creating attractive statistical graphics from DataFrames.
- OO is Powerful: Matplotlib's object-oriented interface is key for complex figures and customization.
- Storytelling Matters: Data needs context, narrative, and clear visuals focused on insight.
- Declutter and Emphasize: Remove noise, use visual cues (color, annotation) to guide the audience.
- Choose Wisely: Select plot types and structures that best serve your message.
- Practice is Crucial: The best way to master data visualization and storytelling is through consistent practice with diverse datasets.
The journey of data visualization is iterative. You will often create a plot, analyze it, refine it, add context, and perhaps even choose a different approach as your understanding deepens. Embrace this process.
While we focused on Matplotlib and Seaborn, the Python ecosystem offers other powerful visualization libraries worth exploring as your needs evolve, such as Plotly and Bokeh for interactive web-based visualizations, and Altair for its declarative approach.
Armed with the knowledge and skills from this guide, you are now well-equipped to not only create informative visualizations but also to weave them into compelling data stories that illuminate insights and communicate effectively in your academic and future professional endeavors. Happy plotting!