Author | Nejat Hakan |
nejat.hakan@outlook.de | |
PayPal Me | https://paypal.me/nejathakan |
Database Intermediate
Introduction
Welcome to the intermediate exploration of databases within the Linux environment. Having mastered the fundamentals of SQL (Structured Query Language) and basic database concepts (tables, rows, columns, primary keys), we now delve deeper into the mechanics, design, and optimization techniques that distinguish a novice user from a proficient database practitioner. This section assumes you have a working relational database system installed on your Linux machine (like PostgreSQL or MySQL/MariaDB) and are comfortable executing basic SQL commands.
In this part, we will move beyond simple data retrieval and manipulation. We will explore:
- Advanced SQL Querying: Techniques for formulating complex questions to extract intricate insights from your data, including sophisticated joins, subqueries, Common Table Expressions (CTEs), and powerful window functions.
- Relational Database Design Principles: Moving beyond simple table creation to understand the theory and practice of designing robust, efficient, and maintainable database schemas using normalization and effective indexing strategies.
- Transaction Management and Concurrency Control: Understanding how databases ensure data integrity even when multiple users or processes access and modify data simultaneously, focusing on ACID properties, isolation levels, and locking.
- Database Performance Tuning Basics: Learning how to identify and address performance bottlenecks in your queries and database structure using tools like
EXPLAIN
and applying index tuning techniques. - Introduction to Server Side Database Programming: Extending database functionality using stored procedures, functions, and triggers to encapsulate logic, enforce complex business rules, and automate tasks directly within the database server.
- Database Security Fundamentals: Implementing basic security measures by managing users, roles, and privileges to control access to sensitive data.
Our approach will be highly practical. Each theoretical subsection is followed by a hands-on "Workshop" where you will apply the learned concepts in a step-by-step manner using a sample database project within your Linux environment. Get ready to significantly enhance your database skills!
1. Advanced SQL Querying
While basic SELECT, INSERT, UPDATE, and DELETE statements are the foundation, real-world data analysis often requires more sophisticated querying techniques. Intermediate SQL involves combining data from multiple tables in complex ways, nesting queries, and performing calculations across sets of rows.
Complex Joins Beyond INNER JOIN
You are likely familiar with INNER JOIN
, which returns rows only when there is a match in both tables being joined. However, other join types are essential for different scenarios:
- LEFT JOIN (or LEFT OUTER JOIN): Returns all rows from the left table (the first table mentioned) and the matched rows from the right table. If there is no match in the right table, NULL values are returned for columns from the right table.
- Use Case: Finding all customers and any orders they might have placed. Even customers with no orders will be listed.
- RIGHT JOIN (or RIGHT OUTER JOIN): Returns all rows from the right table (the second table mentioned) and the matched rows from the left table. If there is no match in the left table, NULL values are returned for columns from the left table.
- Use Case: Finding all orders and the customers who placed them. Even if an order somehow existed without a valid customer (which shouldn't happen in a well-designed DB, but illustrates the point), it would be listed. It's often less intuitive than LEFT JOIN, and many queries can be rewritten using LEFT JOIN by simply swapping the table order.
- FULL OUTER JOIN: Returns all rows when there is a match in either the left or the right table. It essentially combines the results of a LEFT JOIN and a RIGHT JOIN. If there's no match for a row from the left table in the right table, the right table's columns are NULL. Conversely, if there's no match for a row from the right table in the left table, the left table's columns are NULL.
- Use Case: Comparing two lists to see all items from both, highlighting matches and mismatches. For example, listing all registered students and all enrolled students in a specific course, showing who is registered but not enrolled, who is enrolled but not registered (perhaps an error), and who is both.
- CROSS JOIN: Returns the Cartesian product of the two tables – every row from the first table is combined with every row from the second table. This is rarely used with a condition, but can be useful for generating combinations. Be very careful, as it can produce massive result sets.
- Use Case: Generating all possible combinations of T-shirt sizes and colors.
Understanding when to use each join type is crucial for correctly retrieving the desired dataset. Visualizing these joins using Venn diagrams can be very helpful.
Subqueries and Common Table Expressions CTEs
A subquery (also known as an inner query or nested query) is a query embedded inside another SQL query. Subqueries can appear in various clauses: SELECT
, FROM
, WHERE
, and HAVING
.
- Subqueries in
WHERE
Clause: Often used for filtering based on the result of another query.- Operators:
IN
,NOT IN
,ANY
,ALL
,EXISTS
,NOT EXISTS
. - Example: Find all students enrolled in courses taught by a specific professor.
- Operators:
- Correlated Subqueries: A subquery that references columns from the outer query. The subquery is re-evaluated for each row processed by the outer query. These can be powerful but potentially slow if not written carefully.
- Example: Find all courses where the enrollment count is above the average enrollment count for their respective department.
SELECT c.course_name, c.department_id FROM courses c WHERE (SELECT COUNT(*) FROM enrollments e WHERE e.course_id = c.course_id) > (SELECT AVG(enroll_count) FROM (SELECT COUNT(*) as enroll_count FROM enrollments e2 JOIN courses c2 ON e2.course_id = c2.course_id WHERE c2.department_id = c.department_id -- Correlation GROUP BY e2.course_id) as dept_avg);
- Example: Find all courses where the enrollment count is above the average enrollment count for their respective department.
- Subqueries in
FROM
Clause (Derived Tables): The subquery's result set is treated as a temporary table that the outer query can select from.- Example: Calculate the average number of courses per student.
Common Table Expressions (CTEs) provide a way to define temporary, named result sets that you can reference within a single SQL statement (SELECT
, INSERT
, UPDATE
, or DELETE
). They are defined using the WITH
clause.
- Benefits:
- Readability: Break down complex queries into logical, named blocks.
- Reusability: Reference the same CTE multiple times within the main query.
- Recursion: CTEs can reference themselves, enabling recursive queries (e.g., organizational hierarchies, bill of materials).
- Syntax:
- Example (Rewriting the correlated subquery example using CTEs):
Notice how the CTE approach separates the calculation of individual course counts and department averages into distinct, understandable steps.
WITH CourseEnrollmentCounts AS ( SELECT course_id, COUNT(*) as enroll_count FROM enrollments GROUP BY course_id ), DepartmentAvgEnrollment AS ( SELECT c.department_id, AVG(cec.enroll_count) as avg_dept_enroll FROM CourseEnrollmentCounts cec JOIN courses c ON cec.course_id = c.course_id GROUP BY c.department_id ) SELECT c.course_name, c.department_id FROM courses c JOIN CourseEnrollmentCounts cec ON c.course_id = cec.course_id JOIN DepartmentAvgEnrollment dae ON c.department_id = dae.department_id WHERE cec.enroll_count > dae.avg_dept_enroll;
Window Functions
Window functions perform calculations across a set of table rows that are somehow related to the current row. Unlike aggregate functions (like SUM()
, AVG()
, COUNT()
) used with GROUP BY
, window functions do not collapse the rows into a single output row; they return a value for each row based on a "window" of related rows.
-
Key Concepts:
OVER()
Clause: This clause is mandatory and specifies that the function is a window function.PARTITION BY
(Optional): Divides the rows into partitions (groups). The window function is applied independently to each partition. Similar toGROUP BY
, but doesn't collapse rows.ORDER BY
(Optional withinOVER()
): Orders rows within each partition. This is crucial for ranking and sequence-related functions (ROW_NUMBER
,RANK
,LAG
,LEAD
).- Frame Clause (Optional within
OVER()
): Specifies the precise set of rows (the "window frame") within the partition relative to the current row (e.g.,ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
).
-
Common Window Functions:
- Ranking:
ROW_NUMBER()
,RANK()
,DENSE_RANK()
,NTILE(n)
ROW_NUMBER()
: Assigns a unique sequential integer to each row within its partition.RANK()
: Assigns ranks based on theORDER BY
clause within the partition. Skips ranks after ties (e.g., 1, 2, 2, 4).DENSE_RANK()
: Assigns ranks without gaps after ties (e.g., 1, 2, 2, 3).
- Aggregate:
SUM()
,AVG()
,COUNT()
,MIN()
,MAX()
used withOVER()
.- Example: Calculate the running total of order amounts for each customer.
- Value:
LAG()
,LEAD()
,FIRST_VALUE()
,LAST_VALUE()
,NTH_VALUE()
LAG(column, offset, default)
: Access data from a previous row within the partition.LEAD(column, offset, default)
: Access data from a subsequent row within the partition.
- Ranking:
-
Example: Rank students within each department based on their GPA.
SELECT student_name, department_id, gpa, RANK() OVER (PARTITION BY department_id ORDER BY gpa DESC) as department_rank, DENSE_RANK() OVER (PARTITION BY department_id ORDER BY gpa DESC) as dense_department_rank, ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY gpa DESC) as row_num_in_department FROM students;
- Example: Show each employee's salary and the average salary of their department.
Mastering these advanced SQL techniques allows for sophisticated data analysis and reporting directly within the database, often more efficiently than processing raw data in an application layer.
Workshop Advanced SQL Querying
Goal: Practice using advanced joins, subqueries, CTEs, and window functions in a university database scenario.
Prerequisites:
- A running PostgreSQL instance on your Linux machine.
- Access to the
psql
command-line client or another database tool. - A database created for this workshop (e.g.,
university_db
).
Setup:
- Connect to your PostgreSQL instance:
psql -U your_username -d postgres
- Create the database:
CREATE DATABASE university_db;
- Connect to the new database:
\c university_db
-
Create the necessary tables and insert sample data:
CREATE TABLE departments ( department_id VARCHAR(10) PRIMARY KEY, department_name VARCHAR(100) NOT NULL, faculty_count INTEGER ); CREATE TABLE students ( student_id VARCHAR(10) PRIMARY KEY, student_name VARCHAR(100) NOT NULL, department_id VARCHAR(10) REFERENCES departments(department_id), gpa NUMERIC(3, 2) CHECK (gpa >= 0.0 AND gpa <= 4.0) ); CREATE TABLE courses ( course_id VARCHAR(10) PRIMARY KEY, course_name VARCHAR(100) NOT NULL, department_id VARCHAR(10) REFERENCES departments(department_id), credits INTEGER CHECK (credits > 0) ); CREATE TABLE enrollments ( enrollment_id SERIAL PRIMARY KEY, student_id VARCHAR(10) REFERENCES students(student_id), course_id VARCHAR(10) REFERENCES courses(course_id), grade CHAR(1), -- A, B, C, D, F or NULL if not graded yet UNIQUE (student_id, course_id) -- A student can enroll in a course only once ); -- Insert Sample Data INSERT INTO departments (department_id, department_name, faculty_count) VALUES ('CS', 'Computer Science', 25), ('EE', 'Electrical Engineering', 20), ('MATH', 'Mathematics', 15), ('PHYS', 'Physics', 18); INSERT INTO students (student_id, student_name, department_id, gpa) VALUES ('S1001', 'Alice Wonderland', 'CS', 3.8), ('S1002', 'Bob The Builder', 'CS', 3.5), ('S1003', 'Charlie Chaplin', 'EE', 3.9), ('S1004', 'Diana Prince', 'EE', 3.7), ('S1005', 'Ethan Hunt', 'MATH', 3.2), ('S1006', 'Fiona Glenanne', NULL, 3.6); -- Student not yet assigned to a dept INSERT INTO courses (course_id, course_name, department_id, credits) VALUES ('CS101', 'Intro to Programming', 'CS', 4), ('CS305', 'Databases', 'CS', 3), ('EE201', 'Circuit Theory', 'EE', 4), ('MATH211', 'Calculus III', 'MATH', 4), ('PHYS101', 'General Physics I', 'PHYS', 5), ('CS550', 'Advanced Algorithms', 'CS', 3); -- An advanced course maybe not taken by many INSERT INTO enrollments (student_id, course_id, grade) VALUES ('S1001', 'CS101', 'A'), ('S1001', 'CS305', 'A'), ('S1002', 'CS101', 'B'), ('S1003', 'EE201', 'A'), ('S1004', 'EE201', 'B'), ('S1004', 'CS101', 'C'), -- Cross-department enrollment ('S1005', 'MATH211', 'B'), ('S1001', 'MATH211', 'A'); -- Cross-department enrollment
Tasks:
-
LEFT JOIN: List all students and the names of the courses they are enrolled in. Include students who are not enrolled in any course (though our sample data doesn't have such a case, the query structure matters). Also, list the student's department name.
- Hint: You'll need multiple joins. Start with
students
andLEFT JOIN
enrollments
. Then joincourses
anddepartments
. - Observe: All students should be listed. If a student had no enrollments, their
course_name
would be NULL. Notice Fiona doesn't have a department name.
- Hint: You'll need multiple joins. Start with
-
RIGHT JOIN (Conceptual): List all courses and the names of students enrolled in them. Include courses that have no students enrolled.
- Hint: Use
courses
as the left table andenrollments
as the right table in aLEFT JOIN
for easier reading, or swap and useRIGHT JOIN
.-- Using LEFT JOIN (more common) SELECT c.course_name, c.course_id, s.student_name FROM courses c LEFT JOIN enrollments e ON c.course_id = e.course_id LEFT JOIN students s ON e.student_id = s.student_id ORDER BY c.course_name, s.student_name; -- Equivalent using RIGHT JOIN SELECT c.course_name, c.course_id, s.student_name FROM enrollments e RIGHT JOIN courses c ON e.course_id = c.course_id LEFT JOIN students s ON e.student_id = s.student_id ORDER BY c.course_name, s.student_name;
- Observe: Course
CS550
andPHYS101
appear with NULL student names because no one is enrolled in them in our sample data.
- Hint: Use
-
Subquery in
WHERE
: Find the names of students enrolled in the 'Databases' course (CS305).SELECT student_name FROM students WHERE student_id IN (SELECT student_id FROM enrollments WHERE course_id = 'CS305');
- Result: Should show 'Alice Wonderland'.
-
Subquery in
FROM
(Derived Table): Find the department(s) with the highest average GPA among their students.- Hint: First, calculate the average GPA per department. Then find the maximum of these averages. Finally, select the department(s) matching that maximum average.
SELECT department_id, avg_gpa FROM (SELECT department_id, AVG(gpa) as avg_gpa FROM students WHERE department_id IS NOT NULL -- Exclude students without a department GROUP BY department_id) as dept_gpa WHERE avg_gpa = (SELECT MAX(avg_gpa) FROM (SELECT department_id, AVG(gpa) as avg_gpa FROM students WHERE department_id IS NOT NULL GROUP BY department_id) as max_dept_gpa);
- Observe: This shows the department ID and its average GPA for the department(s) with the top average.
- Hint: First, calculate the average GPA per department. Then find the maximum of these averages. Finally, select the department(s) matching that maximum average.
-
CTE: Calculate the number of students enrolled in each department. List department name and student count.
- Hint: Create a CTE to count students per
department_id
from thestudents
table. Then join this CTE with thedepartments
table.WITH DepartmentStudentCounts AS ( SELECT department_id, COUNT(student_id) as num_students FROM students WHERE department_id IS NOT NULL GROUP BY department_id ) SELECT d.department_name, dsc.num_students FROM departments d JOIN DepartmentStudentCounts dsc ON d.department_id = dsc.department_id ORDER BY d.department_name;
- Observe: This gives a clean count per department.
- Hint: Create a CTE to count students per
-
Window Function (Ranking): Rank students within each department based on GPA (highest GPA gets rank 1). Show student name, department name, GPA, and their rank.
SELECT s.student_name, d.department_name, s.gpa, RANK() OVER (PARTITION BY s.department_id ORDER BY s.gpa DESC) as department_rank FROM students s JOIN departments d ON s.department_id = d.department_id -- Use INNER JOIN to only rank students in departments ORDER BY d.department_name, department_rank;
- Observe: Students are grouped by department, and within each department, ranked by GPA. If there were ties,
RANK()
would show gaps. TryDENSE_RANK()
to see the difference.
- Observe: Students are grouped by department, and within each department, ranked by GPA. If there were ties,
-
Window Function (Aggregate): For each student, show their name, GPA, and the average GPA of their department.
SELECT s.student_name, s.gpa, d.department_name, AVG(s.gpa) OVER (PARTITION BY s.department_id) as avg_dept_gpa FROM students s JOIN departments d ON s.department_id = d.department_id ORDER BY d.department_name, s.student_name;
- Observe: Each student row now includes the average GPA calculated across all students within their specific department. The
AVG()
function here doesn't collapse rows because of theOVER()
clause.
- Observe: Each student row now includes the average GPA calculated across all students within their specific department. The
Cleanup (Optional):
-- Disconnect from the database first (\q in psql)
-- Then connect to another database like 'postgres'
-- psql -U your_username -d postgres
DROP DATABASE university_db;
This workshop provides hands-on experience with advanced SQL constructs, enabling you to tackle more complex data retrieval and analysis tasks.
2. Relational Database Design Principles
Designing a database schema is more than just creating tables; it's about structuring data logically and efficiently to ensure data integrity, minimize redundancy, and facilitate querying. Two core concepts in relational database design are normalization and indexing.
Normalization In Depth
Normalization is the process of organizing the columns (attributes) and tables (relations) of a relational database to minimize data redundancy and improve data integrity. It involves decomposing larger tables into smaller, well-structured tables and defining relationships between them. The primary goals are:
- Eliminate Redundancy: Storing the same piece of data multiple times wastes space and increases the risk of inconsistencies.
- Avoid Update Anomalies: Issues that arise when inserting, updating, or deleting data.
- Insertion Anomaly: Inability to add data about one entity without adding data about another unrelated entity. (e.g., Cannot add a new course unless a student enrolls in it, if course details are stored only in an enrollment table).
- Update Anomaly: Changing redundant data in one place requires changing it everywhere it appears. Failure to do so leads to inconsistencies. (e.g., If a professor's office number is stored with every course they teach, changing their office requires updating multiple rows).
- Deletion Anomaly: Deleting data about one entity unintentionally deletes data about another entity. (e.g., If student and course details are in the same table, deleting the last student enrolled in a course might delete the course information itself).
Normalization is typically achieved by following a series of rules called Normal Forms (NF):
- First Normal Form (1NF):
- Rule: Each column must contain atomic (indivisible) values, and each column must have a unique name within the table. There should be no repeating groups of columns.
- Violation Example: A
phone_numbers
column containing a comma-separated list ('555-1234, 555-5678'). A set of columns likechild1_name
,child2_name
,child3_name
. - Fix: Create a separate related table (e.g.,
StudentPhoneNumbers
table withstudent_id
andphone_number
columns, allowing multiple rows per student). For repeating groups, normalize into a separate table.
- Second Normal Form (2NF):
- Rule: The table must be in 1NF, and all non-key attributes must be fully functionally dependent on the entire primary key. This rule applies primarily to tables with composite primary keys (keys made up of multiple columns).
- Violation Example: A table
Enrollments(StudentID, CourseID, StudentName, CourseName, Grade)
. The primary key is(StudentID, CourseID)
.Grade
depends on the full key (StudentID
,CourseID
).StudentName
depends only onStudentID
(partial dependency).CourseName
depends only onCourseID
(partial dependency).
- Fix: Decompose into multiple tables:
Students(StudentID, StudentName, ...)
Courses(CourseID, CourseName, ...)
Enrollments(StudentID, CourseID, Grade)
-
Third Normal Form (3NF):
- Rule: The table must be in 2NF, and there should be no transitive dependencies. A transitive dependency exists when a non-key attribute depends on another non-key attribute, which in turn depends on the primary key. (A -> B and B -> C implies A -> C transitively).
- Violation Example: A table
Courses(CourseID, CourseName, ProfessorID, ProfessorOffice)
. Primary key isCourseID
.CourseName
depends onCourseID
.ProfessorID
depends onCourseID
(assuming one professor per course).ProfessorOffice
depends onProfessorID
(a non-key attribute).- Therefore,
CourseID
->ProfessorID
->ProfessorOffice
creates a transitive dependency (CourseID
->ProfessorOffice
).
- Fix: Decompose into multiple tables:
Courses(CourseID, CourseName, ProfessorID)
Professors(ProfessorID, ProfessorName, ProfessorOffice)
-
Boyce-Codd Normal Form (BCNF): A slightly stricter version of 3NF. A table is in BCNF if, for every non-trivial functional dependency X -> Y, X is a superkey (i.e., X determines all other attributes in the table). Most 3NF tables are also in BCNF. Violations are rare and often involve tables with multiple candidate keys that overlap.
- Higher Normal Forms (4NF, 5NF, 6NF): Address more complex dependencies like multi-valued dependencies and join dependencies. They are less commonly applied in typical business application design but are important in specific complex scenarios.
Normalization Trade-offs: While normalization reduces redundancy and improves integrity, it often increases the number of tables. This can lead to more complex queries involving multiple joins, which might impact performance. Sometimes, denormalization (intentionally introducing some redundancy by relaxing normal forms, e.g., moving from 3NF back to 2NF for specific columns) is performed selectively to optimize read performance for frequently executed queries. This should be done cautiously and typically only after identifying performance bottlenecks.
Effective Indexing Strategies
An index is a special lookup table that the database search engine can use to speed up data retrieval operations. Simply put, an index is a pointer to data in a table. An index in a database is very similar to an index in the back of a book.
-
How Indexes Work (Typically B-Trees): Most relational databases use a B-tree (or variant like B+ tree) structure for indexes. A B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time (O(log n)). The index stores the indexed column value(s) and a pointer (e.g., row ID or physical address) to the actual table row containing that value. When you query based on an indexed column (e.g., in a
WHERE
clause orJOIN
condition), the database can use the B-tree to quickly locate the relevant row pointers instead of scanning the entire table (a "full table scan"). -
Types of Indexes:
- Primary Key Index: Automatically created when you define a primary key. Ensures uniqueness and provides fast lookup by the primary key value. Often clustered.
- Unique Index: Ensures that all values in the indexed column(s) are unique (allows one NULL value usually, though behavior varies by RDBMS). Created automatically for unique constraints.
- Clustered Index (Specific RDBMS, e.g., SQL Server, MySQL's InnoDB): Determines the physical order of data rows in the table. A table can have only one clustered index. If defined, accessing rows sequentially based on the clustered index key is very fast. Primary keys are often implemented as clustered indexes by default.
- Non-Clustered Index (Heap Tables in some RDBMS): The index structure is separate from the data rows. The index contains pointers to the actual rows, which may be stored in a different order (often a heap). A table can have multiple non-clustered indexes. This is the more common index type in PostgreSQL.
- Composite Index (Multi-column Index): An index created on two or more columns. Useful for queries that filter or sort on multiple columns simultaneously. The order of columns in the composite index matters significantly. An index on
(colA, colB)
can efficiently support queries filtering oncolA
alone, or on bothcolA
andcolB
, but usually not efficiently oncolB
alone. - Partial Index (PostgreSQL, SQLite): An index built on a subset of a table's rows, defined by a
WHERE
clause in theCREATE INDEX
statement. Useful for indexing only frequently queried subsets (e.g., index onlyactive
orders). - Expression Index / Function-Based Index (PostgreSQL, Oracle): An index built on the result of a function or expression involving one or more columns. Useful for queries that filter based on the result of a function (e.g.,
WHERE LOWER(username) = 'admin'
).
-
When to Create Indexes:
- Columns frequently used in
WHERE
clauses. - Columns frequently used in
JOIN
conditions (foreign keys are prime candidates). - Columns frequently used in
ORDER BY
orGROUP BY
clauses.
- Columns frequently used in
-
When Not to Create Indexes (or to be Cautious):
- Small Tables: The overhead of using an index might outweigh the benefit of a full table scan.
- Columns with Low Cardinality (Few Unique Values): Indexing a column like
gender
('Male', 'Female', 'Other') is often not helpful, as the index doesn't significantly narrow down the search space. The database might still prefer a full table scan. Index selectivity is key – indexes work best on columns where values are highly unique or selective. - Tables with Frequent, Heavy Write Operations (INSERT, UPDATE, DELETE): Every index needs to be updated whenever the underlying table data changes. Too many indexes can significantly slow down write operations.
- Columns Rarely Queried: Unused indexes waste storage space and add overhead to write operations.
-
Indexing Strategy:
- Identify Critical Queries: Focus on the most frequent or performance-sensitive queries.
- Analyze Query Predicates: Examine the
WHERE
,JOIN
,ORDER BY
, andGROUP BY
clauses. - Index Foreign Keys: Usually a good starting point.
- Consider Composite Indexes: For queries filtering/sorting on multiple columns. Place the most selective column(s) first in the index definition.
- Use
EXPLAIN
: Analyze query execution plans (covered later) to see if indexes are being used effectively and identify missing indexes. - Monitor and Refine: Regularly review index usage and performance. Drop unused indexes and potentially add new ones as query patterns evolve.
Effective database design through normalization and strategic indexing is fundamental to building scalable and performant database applications.
Workshop Relational Database Design Principles
Goal: Analyze a poorly designed table, normalize it to 3NF, and add appropriate indexes based on anticipated queries.
Prerequisites:
- A running PostgreSQL instance on your Linux machine.
- Access to the
psql
command-line client or another database tool. - A database created for this workshop (e.g.,
design_workshop_db
).
Setup:
- Connect to your PostgreSQL instance:
psql -U your_username -d postgres
- Create the database:
CREATE DATABASE design_workshop_db;
- Connect to the new database:
\c design_workshop_db
-
Create the initial unnormalized table representing project assignments:
-- !! This is a deliberately poorly designed table for demonstration !! CREATE TABLE project_assignments_bad ( assignment_id INT PRIMARY KEY, project_name VARCHAR(100), project_location VARCHAR(50), employee_id VARCHAR(10), employee_name VARCHAR(100), employee_email VARCHAR(100), -- Potential 1NF violation if multiple emails stored department_id VARCHAR(10), department_name VARCHAR(100), assignment_date DATE, hours_billed NUMERIC(5, 2) ); -- Insert some sample data (notice the redundancy) INSERT INTO project_assignments_bad VALUES (1, 'Alpha Launch', 'New York', 'E101', 'Alice Smith', 'alice.s@example.com', 'D01', 'Technology', '2023-10-01', 40.0), (2, 'Alpha Launch', 'New York', 'E102', 'Bob Johnson', 'bob.j@example.com', 'D01', 'Technology', '2023-10-01', 35.5), (3, 'Beta Test', 'London', 'E101', 'Alice Smith', 'alice.s@example.com', 'D01', 'Technology', '2023-11-15', 50.0), (4, 'Gamma Initiative', 'New York', 'E205', 'Charlie Brown', 'charlie.b@example.com', 'D02', 'Marketing', '2023-10-05', 60.0), (5, 'Alpha Launch', 'New York', 'E205', 'Charlie Brown', 'charlie.b@example.com', 'D02', 'Marketing', '2023-11-01', 25.0);
Tasks:
-
Analyze the
project_assignments_bad
Table:- Identify violations of Normal Forms (1NF, 2NF, 3NF).
- List the potential insertion, update, and deletion anomalies.
- Think:
- 1NF: Is
employee_email
potentially storing multiple values? (Let's assume for now it's intended to be one, but it's a common place for 1NF issues). Are there repeating groups? (No obvious ones here). - 2NF: The primary key is
assignment_id
(a single column). Since the PK is not composite, 2NF is technically satisfied (as there are no partial dependencies on a single-column key). However, the structure feels wrong, leading us to 3NF. - 3NF: Are there transitive dependencies?
assignment_id
->project_name
,project_location
(Project details depend on the assignment? Unlikely. Seems project details are repeated).assignment_id
->employee_id
->employee_name
,employee_email
,department_id
->department_name
. These look like transitive dependencies! Employee details depend onemployee_id
, not directly on theassignment_id
. Department name depends ondepartment_id
.
- Anomalies:
- Update: If Alice Smith changes her email, you have to update multiple rows (rows 1 and 3). If the 'Technology' department name changes, you have to update multiple rows. If the 'Alpha Launch' project location changes (unlikely, but possible), multiple rows need updating.
- Insertion: Cannot add a new employee unless they are assigned to a project. Cannot add a new project unless an employee is assigned. Cannot add a new department unless an employee exists in it.
- Deletion: If assignment 5 is deleted (Charlie Brown's only work on Alpha Launch), we lose the fact that Charlie Brown was ever associated with Alpha Launch. If assignments 4 and 5 are deleted, we lose information about Charlie Brown entirely (if he's not elsewhere) and potentially the 'Marketing' department if no other employees are listed.
- 1NF: Is
-
Normalize the Schema to 3NF:
- Design a set of new tables that adhere to 3NF.
- Define appropriate primary keys and foreign keys to link the tables.
- Think: We need separate tables for entities like Projects, Employees, and Departments, and a table to link them (Assignments).
Departments(department_id PK, department_name)
Employees(employee_id PK, employee_name, employee_email, department_id FK)
Projects(project_id PK, project_name, project_location)
- Need a new PK for projects.Assignments(assignment_id PK, project_id FK, employee_id FK, assignment_date, hours_billed)
- Implementation (SQL):
-- 1. Departments Table CREATE TABLE departments ( department_id VARCHAR(10) PRIMARY KEY, department_name VARCHAR(100) NOT NULL UNIQUE -- Department names should probably be unique ); -- 2. Employees Table CREATE TABLE employees ( employee_id VARCHAR(10) PRIMARY KEY, employee_name VARCHAR(100) NOT NULL, employee_email VARCHAR(100) UNIQUE, -- Emails should be unique department_id VARCHAR(10) REFERENCES departments(department_id) -- Foreign Key ); -- 3. Projects Table -- Let's create a surrogate key for projects CREATE TABLE projects ( project_id SERIAL PRIMARY KEY, -- Use SERIAL for auto-incrementing integer PK project_name VARCHAR(100) NOT NULL, project_location VARCHAR(50), UNIQUE (project_name, project_location) -- Prevent duplicate project definitions ); -- 4. Assignments Table (Linking Table) CREATE TABLE assignments ( assignment_id SERIAL PRIMARY KEY, -- Keep a unique ID for each specific assignment instance project_id INTEGER REFERENCES projects(project_id) ON DELETE CASCADE, -- If project deleted, remove assignments employee_id VARCHAR(10) REFERENCES employees(employee_id) ON DELETE RESTRICT, -- Prevent deleting employee if assigned assignment_date DATE NOT NULL, hours_billed NUMERIC(5, 2) CHECK (hours_billed >= 0) );
-
Migrate Data (Conceptual / Optional Manual Insertion):
- Extract distinct data from the
project_assignments_bad
table and insert it into the new normalized tables. This requires carefulINSERT INTO ... SELECT DISTINCT ...
statements or manual insertion. - Example Inserts (Manual):
-- Populate Departments INSERT INTO departments (department_id, department_name) VALUES ('D01', 'Technology'), ('D02', 'Marketing') ON CONFLICT (department_id) DO NOTHING; -- Avoid errors if run multiple times -- Populate Employees INSERT INTO employees (employee_id, employee_name, employee_email, department_id) VALUES ('E101', 'Alice Smith', 'alice.s@example.com', 'D01'), ('E102', 'Bob Johnson', 'bob.j@example.com', 'D01'), ('E205', 'Charlie Brown', 'charlie.b@example.com', 'D02') ON CONFLICT (employee_id) DO NOTHING; -- Populate Projects (Need to map names to new project_id) INSERT INTO projects (project_name, project_location) VALUES ('Alpha Launch', 'New York'), ('Beta Test', 'London'), ('Gamma Initiative', 'New York') ON CONFLICT (project_name, project_location) DO NOTHING; -- Populate Assignments (Requires looking up the new project_ids) -- Find project IDs first: -- SELECT project_id, project_name FROM projects; (Assume Alpha=1, Beta=2, Gamma=3) INSERT INTO assignments (project_id, employee_id, assignment_date, hours_billed) VALUES (1, 'E101', '2023-10-01', 40.0), (1, 'E102', '2023-10-01', 35.5), (2, 'E101', '2023-11-15', 50.0), (3, 'E205', '2023-10-05', 60.0), (1, 'E205', '2023-11-01', 25.0);
- Extract distinct data from the
-
Define Necessary Indexes:
- Based on the normalized schema, identify columns that would benefit from indexing, assuming common query patterns.
- Common Queries Might Be:
- Find all assignments for a specific employee.
- Find all employees working on a specific project.
- Find all assignments within a date range.
- Find employee details by employee ID (already covered by PK).
- Find project details by project ID (already covered by PK).
- Find employees in a specific department.
- Think:
- Primary Keys (
departments.department_id
,employees.employee_id
,projects.project_id
,assignments.assignment_id
) are automatically indexed. - Foreign Keys are excellent candidates for indexing to speed up joins.
- Other frequently filtered or sorted columns.
- Primary Keys (
- Implementation (SQL):
-- Index on Foreign Keys in 'employees' table CREATE INDEX idx_employees_department_id ON employees(department_id); -- Indexes on Foreign Keys in 'assignments' table CREATE INDEX idx_assignments_project_id ON assignments(project_id); CREATE INDEX idx_assignments_employee_id ON assignments(employee_id); -- Index for searching assignments by date (if common) CREATE INDEX idx_assignments_assignment_date ON assignments(assignment_date); -- Maybe an index on employee name if searching by name is frequent -- Be cautious if names are not very unique or if table is large and writes frequent -- CREATE INDEX idx_employees_employee_name ON employees(employee_name); -- PostgreSQL supports trigram indexes for faster LIKE/ILIKE searches on text: -- CREATE EXTENSION IF NOT EXISTS pg_trgm; -- CREATE INDEX idx_employees_employee_name_trgm ON employees USING gin (employee_name gin_trgm_ops); -- Maybe an index on project name if searching by name is frequent -- CREATE INDEX idx_projects_project_name ON projects(project_name);
-
Verify the Design:
- Mentally run through the anomaly scenarios identified in step 1. How does the normalized design prevent them?
- Update: Change Alice's email in the
employees
table once. Change 'Technology' department name indepartments
once. - Insertion: Can add a new employee via
employees
table without an assignment. Can add a new project viaprojects
table. Can add a new department viadepartments
. - Deletion: Deleting an assignment from
assignments
doesn't delete employee or project info. Deleting an employee might be restricted if they have assignments (due toON DELETE RESTRICT
) or could cascade delete assignments ifON DELETE CASCADE
was used on the FK inassignments
. Deleting a project could cascade delete related assignments. The core entity data remains intact unless explicitly deleted from its own table.
- Update: Change Alice's email in the
- Mentally run through the anomaly scenarios identified in step 1. How does the normalized design prevent them?
Cleanup (Optional):
This workshop demonstrated the practical process of identifying design flaws in a flat table structure, applying normalization principles to create a robust 3NF schema, and adding strategic indexes to support anticipated query workloads. This foundation is crucial for building maintainable and efficient database systems.
3. Transaction Management and Concurrency Control
In real-world applications, multiple users or processes often need to access and modify data in the database simultaneously. Concurrency Control mechanisms are essential to ensure that these concurrent operations do not interfere with each other in ways that compromise data integrity. Transaction Management provides the framework for grouping database operations together such that they are treated as a single, indivisible unit of work.
Understanding ACID Properties
A transaction is a sequence of database operations (e.g., one or more SQL statements like SELECT
, INSERT
, UPDATE
, DELETE
) performed as a single logical unit of work. To ensure data integrity, especially in concurrent environments, database systems guarantee ACID properties for transactions:
- Atomicity: Ensures that all operations within a transaction are completed successfully as a single unit. If any part of the transaction fails, the entire transaction is rolled back, and the database is left unchanged, as if the transaction never started. It's an "all or nothing" principle.
- Analogy: Transferring money between bank accounts. You debit one account and credit another. Both actions must succeed; if the credit fails after the debit, the debit must be undone (rolled back).
- Consistency: Ensures that a transaction brings the database from one valid state to another valid state. It preserves database invariants (e.g., constraints, triggers, cascades). If a transaction violates database consistency rules (like a unique constraint or a check constraint), it will be rolled back.
- Analogy: In the bank transfer, consistency ensures the total amount of money across both accounts remains the same before and after the transaction (ignoring fees for simplicity). It also ensures account balances don't violate rules like dropping below a minimum required balance if such a constraint exists.
- Isolation: Ensures that concurrent transactions operate independently of one another. The effect of executing multiple transactions concurrently should be the same as if they were executed sequentially (one after another). This prevents intermediate, uncommitted states of one transaction from being visible to other transactions.
- Analogy: Two people trying to book the last seat on a flight simultaneously. Isolation ensures that only one person successfully books the seat, and the other person sees the seat as unavailable, preventing a double-booking. The level of isolation can be configured (see Isolation Levels).
- Durability: Ensures that once a transaction has been successfully committed, its changes are permanent and will survive any subsequent system failures (e.g., power outages, crashes). Committed data is typically written to non-volatile storage (like disk) before the commit is acknowledged.
- Analogy: Once the bank confirms your deposit transaction is complete, the money is safely in your account, even if the bank's computer system crashes immediately afterward. The record of the deposit persists.
Databases use mechanisms like logging (Write-Ahead Logging - WAL), locking, and multi-version concurrency control (MVCC) to implement these ACID properties.
Isolation Levels Explained
While perfect isolation (as if transactions run sequentially) is the ideal, it can impose significant performance overhead due to locking or versioning mechanisms. Therefore, SQL defines standard isolation levels that allow developers to make trade-offs between consistency and performance by relaxing the isolation rules slightly. Lower isolation levels generally offer better performance but increase the risk of concurrency phenomena:
-
Concurrency Phenomena:
- Dirty Read: Transaction T1 reads data that has been modified by another transaction T2, but T2 has not yet committed. If T2 rolls back, T1 has read data that never officially existed.
- Non-Repeatable Read: Transaction T1 reads a row. Transaction T2 then modifies or deletes that row and commits. If T1 re-reads the same row, it gets different data or finds the row is gone.
- Phantom Read: Transaction T1 executes a query with a
WHERE
clause that retrieves a set of rows. Transaction T2 then inserts a new row that satisfies T1'sWHERE
clause and commits. If T1 re-executes its query, it sees the "phantom" newly inserted row.
-
SQL Standard Isolation Levels (from lowest to highest):
- Read Uncommitted:
- Allows Dirty Reads, Non-Repeatable Reads, and Phantom Reads.
- Offers the highest concurrency but the lowest level of data consistency. Transactions can see uncommitted changes made by other transactions.
- Usefulness: Rarely used in practice for general applications due to inconsistency risks. Might be acceptable for approximate counts or monitoring where absolute precision isn't critical. (Note: PostgreSQL implements this level the same as Read Committed).
- Read Committed:
- Prevents Dirty Reads. Allows Non-Repeatable Reads and Phantom Reads.
- Guarantees that a transaction only reads data that has been committed. However, within the same transaction, two identical queries might return different results if another committed transaction modified the data in between.
- Implementation Detail (Common): Uses short-term locks or MVCC. Each statement sees a snapshot of the database as it existed when the statement began.
- Default Level: Often the default isolation level in many databases (including PostgreSQL, Oracle, SQL Server). Good balance for many web applications.
- Repeatable Read:
- Prevents Dirty Reads and Non-Repeatable Reads. Allows Phantom Reads (in some implementations, though PostgreSQL's implementation prevents phantoms too).
- Guarantees that if a transaction reads a row multiple times, it will see the same data for that row each time. Other transactions cannot modify data being read by the current transaction until it commits or rolls back. However, other transactions can insert new rows that might match the query criteria (phantoms).
- Implementation Detail (Common): Uses longer-term read locks or MVCC. The transaction sees a snapshot of the database as it existed when the transaction began.
- Usefulness: Needed when a transaction requires a consistent view of specific rows throughout its duration, e.g., performing calculations based on data read at the start of the transaction. Might result in serialization errors if concurrent updates conflict.
- Serializable:
- Prevents Dirty Reads, Non-Repeatable Reads, and Phantom Reads.
- Provides the highest level of isolation. Guarantees that the effect of concurrently executing transactions is identical to executing them serially in some order.
- Implementation Detail (Common): Uses extensive locking (e.g., range locks) or advanced MVCC techniques like Serializable Snapshot Isolation (SSI) in PostgreSQL.
- Usefulness: Required for applications with complex transactions where absolute consistency is paramount and any concurrency anomaly is unacceptable. Can significantly reduce concurrency and may lead to more transaction failures (serialization errors) that need to be retried by the application.
- Read Uncommitted:
-
Setting Isolation Level: Typically done per-transaction using a command like:
Choosing the appropriate isolation level is a critical design decision based on the application's tolerance for concurrency phenomena versus its need for performance and throughput.
Locking Mechanisms
Databases use locks to manage concurrent access to data resources (like tables, rows, or pages) and enforce isolation levels. When a transaction needs to access a resource in a way that could conflict with other transactions, it acquires a lock on that resource.
-
Lock Granularity: Locks can be acquired at different levels:
- Table Locks: Lock the entire table. Simple but offers very low concurrency.
- Page Locks: Lock a block of disk storage containing multiple rows. A compromise between table and row locks.
- Row Locks: Lock individual rows. Offers the highest concurrency but can have higher management overhead. Most modern RDBMS primarily use row-level locking for user data.
- Key-Range Locks (Used for Serializable): Lock ranges of values within an index to prevent phantom reads.
-
Lock Modes: Indicate the type of access allowed:
- Shared Locks (S-Locks / Read Locks): Multiple transactions can hold a shared lock on the same resource simultaneously. Allows concurrent reading but prevents modification by transactions holding exclusive locks. Typically acquired for
SELECT
statements (depending on isolation level). - Exclusive Locks (X-Locks / Write Locks): Only one transaction can hold an exclusive lock on a resource at a time. Prevents any other transaction (read or write) from acquiring any lock on the resource. Typically acquired for
INSERT
,UPDATE
,DELETE
operations. - Update Locks (U-Locks): Some systems use these as an intermediate lock during updates to prevent certain types of deadlocks. An update lock indicates an intent to update and is compatible with shared locks but not other update or exclusive locks. It is later promoted to an exclusive lock when the actual modification occurs.
- Shared Locks (S-Locks / Read Locks): Multiple transactions can hold a shared lock on the same resource simultaneously. Allows concurrent reading but prevents modification by transactions holding exclusive locks. Typically acquired for
-
Locking Phenomena:
- Blocking: When a transaction tries to acquire a lock on a resource that is already held by another incompatible transaction, it must wait (is blocked) until the first transaction releases the lock (by committing or rolling back).
- Deadlock: Occurs when two (or more) transactions are waiting for each other to release locks. For example, T1 locks resource A and waits for resource B, while T2 locks resource B and waits for resource A. Neither transaction can proceed. Databases have deadlock detection mechanisms and typically resolve them by aborting one of the transactions (forcing it to roll back) and allowing the other to proceed. The aborted transaction usually needs to be retried by the application.
-
Pessimistic vs. Optimistic Concurrency Control:
- Pessimistic Concurrency Control (Locking): Assumes conflicts are likely and uses locks to prevent them from happening before accessing data. This is the traditional approach described above.
- Optimistic Concurrency Control (e.g., using Versioning/MVCC): Assumes conflicts are rare. Transactions read data without acquiring strong locks initially. When a transaction tries to commit, the system checks if the data it read has been modified by another committed transaction since it was read. If a conflict is detected, the committing transaction is rolled back and must be retried. Multi-Version Concurrency Control (MVCC), used by PostgreSQL and Oracle, is a common implementation. MVCC maintains multiple versions of rows, allowing readers not to block writers and vice versa, improving concurrency, especially for read-heavy workloads. Readers typically see a consistent snapshot of the data. Conflicts (write-write conflicts) are detected at commit time for higher isolation levels like Repeatable Read or Serializable.
Understanding transactions, ACID properties, isolation levels, and locking/MVCC is crucial for developing reliable applications that handle concurrent data access correctly.
Workshop Transaction Management and Concurrency Control
Goal: Observe the effects of different isolation levels and understand basic transaction handling (COMMIT, ROLLBACK) and potential deadlocks.
Prerequisites:
- A running PostgreSQL instance on your Linux machine.
- Access to the
psql
command-line client. You will need two separate terminal windows connected to the same database simultaneously to simulate concurrency. - A database created for this workshop (e.g.,
concurrency_db
).
Setup:
- Terminal 1 & 2: Connect to your PostgreSQL instance:
psql -U your_username -d postgres
- Terminal 1: Create the database:
CREATE DATABASE concurrency_db;
- Terminal 1 & 2: Connect to the new database:
\c concurrency_db
- Terminal 1: Create a simple bank accounts table:
CREATE TABLE accounts ( account_id VARCHAR(10) PRIMARY KEY, account_holder VARCHAR(100) NOT NULL, balance NUMERIC(10, 2) CHECK (balance >= 0) -- Ensure non-negative balance ); INSERT INTO accounts (account_id, account_holder, balance) VALUES ('A101', 'Alice', 1000.00), ('B202', 'Bob', 500.00); -- Verify initial state in both terminals SELECT * FROM accounts;
Tasks:
-
Atomicity (COMMIT and ROLLBACK): Simulate a successful and a failed bank transfer.
- Terminal 1:
BEGIN; -- Start transaction UPDATE accounts SET balance = balance - 100 WHERE account_id = 'A101'; UPDATE accounts SET balance = balance + 100 WHERE account_id = 'B202'; -- Check intermediate state (only visible within this transaction) SELECT * FROM accounts; COMMIT; -- Make changes permanent -- Check final state (now visible everywhere) SELECT * FROM accounts;
- Terminal 2:
- Terminal 1 (Simulate Failure):
BEGIN; UPDATE accounts SET balance = balance - 50 WHERE account_id = 'B202'; -- Oh no! Simulate an error or decision to cancel before updating Alice SELECT * FROM accounts; -- See intermediate change for Bob ROLLBACK; -- Abort the transaction -- Check state after rollback SELECT * FROM accounts; -- Should show balances reverted to state before BEGIN
- Observe:
COMMIT
makes changes durable.ROLLBACK
undoes all changes within the transaction, demonstrating atomicity.
- Terminal 1:
-
Isolation Level - Read Committed (Default): Observe Non-Repeatable Read.
- Terminal 1:
- Terminal 2:
- Terminal 1:
- Observe: Terminal 1's second
SELECT
reads a different value than the firstSELECT
within the same transaction. This is a Non-Repeatable Read, allowed by Read Committed. Each statement sees committed data as of the time the statement starts.
-
Isolation Level - Repeatable Read: Prevent Non-Repeatable Read.
- Reset Balances (Terminal 1):
- Terminal 1:
- Terminal 2:
BEGIN; -- Try to update Bob's balance. This might wait (block) or succeed depending on timing/MVCC. -- Let's assume it succeeds and commits. UPDATE accounts SET balance = balance - 100 WHERE account_id = 'B202'; COMMIT; -- Verify Bob's balance from T2's perspective SELECT balance FROM accounts WHERE account_id = 'B202'; -- Shows 400.00
- Terminal 1:
- Observe: Terminal 1's second
SELECT
should still show the same balance (500.00) as the firstSELECT
, even though Terminal 2 committed a change. Repeatable Read ensures that data read within a transaction remains consistent for the duration of that transaction based on the snapshot taken when the transaction began.
-
Isolation Level - Repeatable Read (Potential Write Conflict):
- Reset Balances (Terminal 1):
- Terminal 1:
- Terminal 2:
- Terminal 1:
-- Now, Alice tries to update Bob's balance based on the old value read (500) -- E.g., calculate interest based on the 500 value. -- Try to update Bob's balance (conflicts with T2's update) UPDATE accounts SET balance = balance + 50 WHERE account_id = 'B202'; -- Tries to make it 550 -- This command will likely FAIL with a serialization error! -- psql might output: ERROR: could not serialize access due to concurrent update -- You MUST issue a ROLLBACK after such an error. ROLLBACK;
- Observe: Because T1's view of the database was fixed when it started, and T2 modified the row T1 intended to update, T1's update would violate the Repeatable Read guarantee. The database detects this conflict (a write-write conflict based on T1's read snapshot) and forces T1 to roll back. The application would typically need to retry T1.
-
Deadlock Simulation:
- Reset Balances (Terminal 1):
- Terminal 1:
BEGIN; -- Lock Alice's account by updating it UPDATE accounts SET balance = balance + 0 WHERE account_id = 'A101'; -- Acquire lock on A101 -- Wait for a moment to let Terminal 2 proceed \prompt 'Press Enter after starting T2 update...' -- Pauses script -- Try to lock Bob's account UPDATE accounts SET balance = balance + 0 WHERE account_id = 'B202'; -- Waits for T2 to release lock on B202 -- If deadlock occurs, this transaction might be chosen as the victim COMMIT; -- Or ROLLBACK if deadlock error occurs
- Terminal 2:
BEGIN; -- Lock Bob's account by updating it UPDATE accounts SET balance = balance + 0 WHERE account_id = 'B202'; -- Acquire lock on B202 -- Wait for a moment (less than T1 wait if possible) -- Try to lock Alice's account UPDATE accounts SET balance = balance + 0 WHERE account_id = 'A101'; -- Waits for T1 to release lock on A101 -- Deadlock situation: T1 holds A101 waits for B202, T2 holds B202 waits for A101 COMMIT; -- Or ROLLBACK if deadlock error occurs
- Observe: After both terminals attempt the second update, PostgreSQL's deadlock detector should intervene within a short time (usually a second or two). One of the terminals will receive an error message similar to:
ERROR: deadlock detected
. The message will detail the locks involved. The transaction that received the error has been automatically rolled back. The other transaction can then proceed and commit successfully (if it wasn't trying to do anything else). Applications need to be prepared to catch deadlock errors and retry the failed transaction.
Cleanup (Optional):
-- In one terminal:
DROP TABLE accounts;
-- Disconnect (\q) then connect to postgres db
DROP DATABASE concurrency_db;
This workshop provides a tangible feel for how transactions isolate operations, how different isolation levels affect visibility and consistency, and how databases handle conflicts like serialization failures and deadlocks. Writing robust concurrent applications requires understanding these behaviors.
4. Database Performance Tuning Basics
As datasets grow and query complexity increases, database performance can degrade. Performance tuning involves identifying bottlenecks and optimizing the database structure and queries to improve response times and resource utilization.
Analyzing Query Performance with EXPLAIN
The most crucial tool for understanding how a database executes a specific SQL query is the EXPLAIN
command (or its variants like EXPLAIN ANALYZE
). It shows the query execution plan chosen by the database optimizer. The optimizer's goal is to find the most efficient way to execute a query, considering available indexes, table statistics, and possible execution strategies.
-
Usage:
EXPLAIN <Your SQL Query>;
- Shows the estimated plan without actually running the query. Useful for quick checks.
EXPLAIN ANALYZE <Your SQL Query>;
(Recommended for deeper analysis)- Actually executes the query, measures real timings and row counts, and then displays the plan along with the actual execution statistics. Provides much more accurate insights but incurs the cost of running the query. Caution: Use
EXPLAIN ANALYZE
with care onUPDATE
,DELETE
,INSERT
statements in production, as it will execute the modification. You might wrap it in a transaction and roll back:BEGIN; EXPLAIN ANALYZE UPDATE ...; ROLLBACK;
- Actually executes the query, measures real timings and row counts, and then displays the plan along with the actual execution statistics. Provides much more accurate insights but incurs the cost of running the query. Caution: Use
-
Interpreting the Output: The output is typically a tree structure representing the steps the database takes. Key elements to look for:
- Scan Methods: How the database reads data from tables:
- Sequential Scan (Seq Scan): Reads the entire table block by block. Often indicates a missing index or that the query retrieves a large portion of the table, making an index less efficient. Can be slow on large tables.
- Index Scan: Uses an index to locate specific rows. Much faster than a Seq Scan if the index is selective (narrows down the search significantly). Can involve reading both the index and the table data (unless it's an Index-Only Scan).
- Index-Only Scan: Uses an index that contains all the data needed for the query (including
WHERE
clause andSELECT
list columns). Avoids accessing the main table data ("heap fetches"), making it very fast. Requires the index to cover all necessary columns. - Bitmap Heap Scan: A two-phase scan. First, an Index Scan (or multiple index scans) gathers the locations (TIDs) of relevant rows into a bitmap in memory. Second, it fetches the actual rows from the table ("heap") in physical disk order, which can be more efficient than random access for larger result sets. Often seen when combining results from multiple indexes or when an index scan returns many rows.
- Join Methods: How the database combines data from multiple tables:
- Nested Loop Join: Iterates through rows of the outer table. For each outer row, it scans the inner table (potentially using an index) to find matching rows. Efficient if the outer table is small and there's a good index on the join column of the inner table. Can be very slow if both tables are large and indexes are poor.
- Hash Join: Builds an in-memory hash table on the smaller of the two tables (based on the join key). Then, scans the larger table; for each row, it probes the hash table to find matches. Efficient for large tables when equi-joins are involved and enough memory is available. Does not require indexes on join keys but benefits from them for the initial table scans.
- Merge Join: Requires both input tables to be sorted on the join key. It then reads through both sorted tables simultaneously, merging the matching rows. Efficient if the inputs are already sorted (e.g., due to an index scan or previous
ORDER BY
) or if the datasets are very large and don't fit in memory for a hash join. Often requires explicit sort steps if data isn't pre-sorted, which can be costly.
- Costs (Estimates in basic
EXPLAIN
):cost=startup_cost..total_cost
: The optimizer's estimate of the work required. The first number is the cost before the first row can be returned; the second is the total cost to return all rows. Units are arbitrary but useful for comparing different plans for the same query. Lower cost is generally better.rows
: Estimated number of rows output by this step.width
: Estimated average width (in bytes) of the rows output by this step.
- Actual Timings and Rows (
EXPLAIN ANALYZE
):actual time=startup_time..total_time
: Actual time (in milliseconds) spent.startup_time
is time to get the first row;total_time
is time to get all rows from this node and its children. This is the most valuable metric for identifying slow parts of the query.rows
: Actual number of rows returned by the node. Comparing this to the estimatedrows
can reveal if the database statistics are inaccurate, leading the optimizer to choose a suboptimal plan.loops
: Number of times this node was executed.
- Scan Methods: How the database reads data from tables:
-
Analysis Process:
- Run
EXPLAIN ANALYZE
on your slow query. - Start reading the plan from the innermost, most indented operations upwards/outwards.
- Look for high
actual time
values to identify the slowest steps. - Examine scan methods: Are there Seq Scans on large tables for queries that should be selective? This suggests a missing or unused index.
- Examine join methods: Is an inefficient Nested Loop being used on large tables? Maybe a Hash Join or Merge Join would be better (which might require different indexing or more memory).
- Compare estimated
rows
vs. actualrows
. Large discrepancies suggest outdated statistics. RunANALYZE <table_name>;
orVACUUM ANALYZE <table_name>;
to update them. - Look for explicit
Sort
operations, especially costly ones. Can they be avoided with an appropriate index that provides the required order? - Check
Filter
conditions applied after a scan. If a Seq Scan is followed by a very selective filter, an index on the filter column could drastically improve performance.
- Run
Using EXPLAIN ANALYZE
is an iterative process: analyze the plan, hypothesize an optimization (e.g., add an index, rewrite the query, update statistics), apply the change, and run EXPLAIN ANALYZE
again to see if the plan and performance improved.
Index Tuning Techniques
Based on the analysis of query plans using EXPLAIN
, you can apply various index tuning techniques:
-
Adding Missing Indexes:
- If
EXPLAIN ANALYZE
shows a costly Seq Scan on a large table where theWHERE
clause is selective (i.e., filters out most rows), an index on the column(s) in theWHERE
clause is likely needed. - If
EXPLAIN ANALYZE
shows a costly Nested Loop join where the inner table is scanned repeatedly, an index on the join column of the inner table is usually beneficial. Foreign key columns are prime candidates. - If
EXPLAIN ANALYZE
shows a costlySort
operation because of anORDER BY
orGROUP BY
clause, an index on the relevant column(s) might allow the database to retrieve rows in the desired order directly from the index, avoiding the sort step.
- If
-
Creating Composite Indexes:
- If queries frequently filter on multiple columns (e.g.,
WHERE colA = ? AND colB = ?
), a composite index on(colA, colB)
can be much more efficient than separate indexes oncolA
andcolB
. - The order of columns in a composite index matters. Place columns with higher selectivity (more unique values) or columns used with equality predicates (
=
) earlier in the index definition. For an index on(colA, colB)
, queries filtering only oncolA
can use the index effectively. Queries filtering only oncolB
generally cannot. - Consider
ORDER BY
clauses as well. An index on(colA, colB)
can satisfyWHERE colA = ? ORDER BY colB
.
- If queries frequently filter on multiple columns (e.g.,
-
Creating Covering Indexes:
- If a query retrieves columns that are all present within an existing index (including
WHERE
,SELECT
,JOIN
clauses), the database might perform an Index-Only Scan. - You can sometimes create indexes specifically to "cover" all columns needed by a frequent, critical query to enable Index-Only Scans. This involves including columns from the
SELECT
list in the index definition (e.g., using theINCLUDE
clause in PostgreSQL'sCREATE INDEX
if the columns are not needed for filtering/sorting but just for retrieval). This trades write performance and storage for faster read performance for that specific query.
- If a query retrieves columns that are all present within an existing index (including
-
Using Expression Indexes:
- If queries frequently filter based on an expression or function applied to a column (e.g.,
WHERE LOWER(email) = ?
orWHERE date_trunc('month', created_at) = ?
), a regular index on the column (email
orcreated_at
) won't be used directly. - Create an index on the expression itself:
- The query must use the exact same expression for the index to be considered.
- If queries frequently filter based on an expression or function applied to a column (e.g.,
-
Using Partial Indexes:
- If you frequently query only a specific subset of rows in a large table (e.g.,
WHERE status = 'active'
), creating a full index might be inefficient if 'active' rows are a small percentage. - A partial index includes only the rows matching the condition:
- This index will be smaller and faster to scan for queries that include
WHERE status = 'active'
.
- If you frequently query only a specific subset of rows in a large table (e.g.,
-
Removing Unused or Redundant Indexes:
- Indexes consume storage space and slow down
INSERT
,UPDATE
,DELETE
operations. Regularly identify and remove indexes that are never used by queries. Database systems often provide tools or views to track index usage statistics (e.g.,pg_stat_user_indexes
in PostgreSQL). - Remove redundant indexes. For example, if you have an index on
(colA, colB)
and another index only on(colA)
, the second index is usually redundant because the composite index can also serve queries filtering only oncolA
. (There can be exceptions related to index size or specific query patterns, but often it's redundant).
- Indexes consume storage space and slow down
-
Maintaining Table Statistics:
- Accurate statistics about table data distribution are crucial for the query optimizer to choose the best plan. Outdated statistics can lead to poor choices like using a Seq Scan when an Index Scan would be better, or vice-versa.
- Regularly run
ANALYZE <table_name>;
orVACUUM ANALYZE <table_name>;
(especially after large data modifications) to keep statistics up-to-date. Most databases also have an auto-vacuum/auto-analyze daemon that handles this automatically to some extent, but manual analysis can be necessary.
Performance tuning is an ongoing process. Monitor your database performance, analyze slow queries, apply appropriate indexing and other optimization techniques, and measure the impact of your changes.
Workshop Database Performance Tuning Basics
Goal: Use EXPLAIN ANALYZE
to identify a performance bottleneck and improve query speed by adding an appropriate index.
Prerequisites:
- A running PostgreSQL instance on your Linux machine.
- Access to the
psql
command-line client. - A database created for this workshop (e.g.,
tuning_db
).
Setup:
- Terminal: Connect to PostgreSQL:
psql -U your_username -d postgres
- Terminal: Create the database:
CREATE DATABASE tuning_db;
- Terminal: Connect to the new database:
\c tuning_db
-
Terminal: Create a table and populate it with a significant amount of data. We'll use
generate_series
to create many rows quickly.CREATE TABLE products ( product_id SERIAL PRIMARY KEY, product_name VARCHAR(100) NOT NULL, category VARCHAR(50), launch_date DATE, price NUMERIC(10, 2), -- Add a column with less distinct values for demonstration status VARCHAR(20) DEFAULT 'active' ); -- Insert a large number of rows (e.g., 1 million) -- This might take a minute or two INSERT INTO products (product_name, category, launch_date, price, status) SELECT 'Product ' || g, 'Category ' || (g % 100), -- 100 distinct categories '2020-01-01'::date + (g % 1000) * interval '1 day', -- ~3 years of dates (random() * 500 + 10)::numeric(10,2), -- Random price between 10 and 510 CASE (g % 10) -- Cycle through statuses WHEN 0 THEN 'discontinued' WHEN 1 THEN 'beta' ELSE 'active' -- 80% are active END FROM generate_series(1, 1000000) as g; -- Ensure statistics are up-to-date after the large insert ANALYZE products;
Tasks:
-
Analyze a Slow Query without Index:
- Let's query for products in a specific category launched on a particular date. This combination should be relatively selective.
- Run
EXPLAIN ANALYZE
on the query: - Observe the Output:
- Pay close attention to the Scan Method on the
products
table. It will almost certainly be aSeq Scan
(Sequential Scan). - Note the Execution Time. This is your baseline performance.
- Look at the
Filter:
line associated with the Seq Scan. It shows the conditions being applied (category = ...
andlaunch_date = ...
). - Observe the number of
rows removed by Filter
compared to the total rows scanned (which should be ~1 million). If the filter is highly selective (removes most rows), a Seq Scan is inefficient.
- Pay close attention to the Scan Method on the
-
Hypothesize and Add an Index:
- The query filters on
category
andlaunch_date
. A Seq Scan is slow because the database has to read the entire table and check every row. - Hypothesis: A composite index on
(category, launch_date)
should significantly speed up this query. The database could use the index to directly locate rows matching both criteria. - Create the index:
- The query filters on
-
Analyze the Query WITH Index:
- Run the exact same
EXPLAIN ANALYZE
command again: - Observe the New Output:
- The Scan Method should now be different. You'll likely see an
Index Scan
or possibly aBitmap Heap Scan
usingidx_products_category_launch
. - Compare the Execution Time to the previous run. It should be drastically lower (likely milliseconds instead of seconds).
- If it's an Index Scan, notice the
Index Cond:
line shows the conditions being used directly against the index. - If it's a Bitmap Heap Scan, notice the
Bitmap Index Scan
node using the index, followed by theBitmap Heap Scan
node fetching the rows from the table.
- The Scan Method should now be different. You'll likely see an
- Run the exact same
-
Analyze a Query on a Low Cardinality Column:
- Let's query for all 'discontinued' products. The
status
column has low cardinality (only a few distinct values: 'active', 'beta', 'discontinued'). - Run
EXPLAIN ANALYZE
: - Observe: You will likely still see a
Seq Scan
. Even though we are filtering, 'discontinued' represents about 10% of the table (100,000 rows). The optimizer likely determines that reading 10% of the table via an index (which involves jumping back and forth between the index and the table data) is more expensive than just reading the whole table sequentially.
- Let's query for all 'discontinued' products. The
-
Attempt Indexing Low Cardinality (and see if it's used):
- Create an index on
status
: - Re-run the
EXPLAIN ANALYZE
for the 'discontinued' query: - Observe: The planner might still choose a
Seq Scan
or possibly switch to aBitmap Heap Scan
. Even with the index, accessing 100,000 rows might be deemed faster via a sequential read or bitmap approach than through a standard index scan. This demonstrates that indexes are not always beneficial, especially on low-selectivity columns when retrieving a large fraction of the table.
- Create an index on
-
Consider a Partial Index:
- Imagine we very frequently query for 'beta' status products, which are only 10% of the data (100k rows). A full index wasn't great. But what if 'beta' was much rarer, say 0.1%? Let's simulate that scenario with a query targeting a specific, less common category.
- Query for a specific category, which is more selective (1% of data). (This might use the composite index if it exists, or Seq Scan if not indexed)
- Now, imagine we only ever care about 'Category 99' products that are also 'active'. Create a partial index:
- Analyze a query matching the partial index criteria:
- Observe: This query should now efficiently use the smaller
idx_products_cat99_active
partial index (likely via an Index Scan or Bitmap Scan on that index) because theWHERE
clause matches the index's condition.
Cleanup (Optional):
-- Drop the table (and implicitly its indexes)
DROP TABLE products;
-- Disconnect (\q) then connect to postgres db
DROP DATABASE tuning_db;
This workshop illustrated the power of EXPLAIN ANALYZE
in diagnosing performance issues (identifying Seq Scans) and the effectiveness of adding appropriate indexes (like composite indexes) to resolve them. It also showed that indexing isn't a silver bullet, especially for low-selectivity columns, and introduced the concept of partial indexes for optimizing queries on specific data subsets.
5. Introduction to Server Side Database Programming
While SQL is powerful for querying and manipulating data, sometimes you need to encapsulate business logic, enforce complex rules, or perform actions automatically within the database itself. This is achieved using server-side programming features like Stored Procedures, Functions, and Triggers. These are typically written in a procedural language extension of SQL (like PL/pgSQL for PostgreSQL, T-SQL for SQL Server, PL/SQL for Oracle) or sometimes in other languages like Python, Perl, Tcl if supported by the RDBMS.
Stored Procedures and Functions
Stored Procedures and Stored Functions are blocks of code stored directly in the database catalog and executed on the server. They allow you to group multiple SQL statements and procedural logic (loops, conditionals, variables) into reusable units.
-
Key Differences:
- Functions: Primarily designed to compute and return a single value (scalar or composite type) or a set of rows (table function). They are typically callable within SQL queries (e.g., in
SELECT
lists orWHERE
clauses). In PostgreSQL, functions are the primary mechanism and can perform actions, although procedures are now also supported more distinctly. - Stored Procedures: Designed primarily to perform an action or a set of actions (e.g., complex data modifications, batch jobs). They may or may not return values, but typically do so via output parameters rather than a direct return value usable in SQL expressions. They are generally invoked using a specific command (e.g.,
CALL
in SQL standard and PostgreSQL).
- Functions: Primarily designed to compute and return a single value (scalar or composite type) or a set of rows (table function). They are typically callable within SQL queries (e.g., in
-
Benefits:
- Encapsulation: Hide complex logic behind a simple interface. Application developers just call the procedure/function without needing to know the internal details.
- Reusability: Write the logic once and call it from multiple applications or different parts of the same application.
- Performance:
- Reduced Network Traffic: Instead of sending multiple individual SQL statements from the client to the server, you send a single call. The logic executes entirely on the server, reducing round trips.
- Pre-compilation/Caching: The database often parses and caches the execution plan for stored procedures/functions, potentially speeding up subsequent calls.
- Security: Grant
EXECUTE
permission on a procedure/function without granting permissions on the underlying tables. The procedure/function runs with the privileges of its definer (or invoker, depending on definition), allowing controlled access to data. - Consistency: Ensure that complex operations involving multiple steps are always performed the same way, enforcing business rules consistently.
-
Example (PostgreSQL using PL/pgSQL): Creating a Function to Calculate Average GPA for a Department
-- Assumes the 'students' and 'departments' tables from previous workshops exist. CREATE OR REPLACE FUNCTION get_department_avg_gpa(dept_id VARCHAR) RETURNS NUMERIC(3, 2) -- Specifies the return type AS $$ -- Start of function body block (using dollar-quoting) DECLARE avg_gpa NUMERIC(3, 2); -- Declare a local variable BEGIN -- Check if department exists IF NOT EXISTS (SELECT 1 FROM departments WHERE department_id = dept_id) THEN RAISE EXCEPTION 'Department % does not exist', dept_id; END IF; -- Calculate the average GPA for the given department ID SELECT AVG(s.gpa) INTO avg_gpa -- Store result in the variable FROM students s WHERE s.department_id = dept_id; -- Handle case where department has no students (AVG returns NULL) IF avg_gpa IS NULL THEN RETURN 0.00; -- Or perhaps raise an exception or return NULL as appropriate END IF; RETURN avg_gpa; -- Return the calculated value END; $$ LANGUAGE plpgsql; -- Specify the language used -- How to call the function: SELECT department_name, get_department_avg_gpa(department_id) as average_gpa FROM departments; SELECT get_department_avg_gpa('CS');
-
Example (PostgreSQL using PL/pgSQL): Creating a Procedure to Transfer Students
-- Assumes 'students' and 'departments' tables exist. CREATE OR REPLACE PROCEDURE transfer_student( s_id VARCHAR, -- Input parameter: student ID new_dept_id VARCHAR -- Input parameter: new department ID ) AS $$ DECLARE old_dept_id VARCHAR; BEGIN -- Validate inputs (basic example) IF NOT EXISTS (SELECT 1 FROM students WHERE student_id = s_id) THEN RAISE EXCEPTION 'Student % not found', s_id; END IF; IF NOT EXISTS (SELECT 1 FROM departments WHERE department_id = new_dept_id) THEN RAISE EXCEPTION 'Target Department % not found', new_dept_id; END IF; -- Get the old department ID for logging or other logic SELECT department_id INTO old_dept_id FROM students WHERE student_id = s_id; -- Perform the update UPDATE students SET department_id = new_dept_id WHERE student_id = s_id; -- Optional: Log the transfer action (could insert into an audit table) RAISE NOTICE 'Student % transferred from Dept % to Dept %', s_id, old_dept_id, new_dept_id; -- Procedures don't explicitly return a value like functions. -- COMMIT/ROLLBACK is generally handled by the calling transaction, -- unless the procedure manages its own transactions (less common). END; $$ LANGUAGE plpgsql; -- How to call the procedure: CALL transfer_student('S1006', 'PHYS'); -- Transfer Fiona to Physics -- Verify the change SELECT * FROM students WHERE student_id = 'S1006';
Triggers
A trigger is a special type of stored procedure that is automatically executed (fired) by the database server in response to certain events on a particular table or view. These events are typically INSERT
, UPDATE
, or DELETE
operations.
-
Purpose:
- Enforcing Complex Business Rules: Implement constraints that are difficult or impossible to define using standard check or foreign key constraints.
- Auditing/Logging: Automatically record changes made to data into an audit trail table.
- Maintaining Data Consistency/Denormalization: Automatically update summary tables or derived data when underlying data changes (use with caution, can be complex).
- Preventing Invalid Operations: Intercept and prevent certain DML operations based on specific conditions.
-
Key Concepts:
- Trigger Event: The DML operation that causes the trigger to fire (
INSERT
,UPDATE
,DELETE
). PostgreSQL also supportsTRUNCATE
. - Trigger Timing: When the trigger fires relative to the event:
BEFORE
: Executes before the DML operation attempts to modify the row(s). Useful for validating or modifying data before it's written. Can prevent the operation by returning NULL or raising an exception (in row-level triggers).AFTER
: Executes after the DML operation has successfully completed (but typically still within the same transaction). Useful for actions that depend on the change having occurred, like logging or updating related tables.INSTEAD OF
: Used primarily for views. Executes instead of the DML operation on the view, allowing you to define how updates to complex views should translate into updates on the underlying base tables.
- Trigger Level:
FOR EACH ROW
: The trigger function executes once for each individual row affected by the DML statement. Inside the trigger function, special variablesNEW
(representing the row after the change for INSERT/UPDATE) andOLD
(representing the row before the change for UPDATE/DELETE) are available.FOR EACH STATEMENT
: The trigger function executes only once per DML statement, regardless of how many rows were affected.NEW
andOLD
are not available in statement-level triggers. Useful for actions that don't depend on individual row data.
- Trigger Event: The DML operation that causes the trigger to fire (
-
Example (PostgreSQL using PL/pgSQL): Auditing Changes to Student GPAs
-
Create an Audit Table:
-
Create the Trigger Function: This function contains the logic to be executed.
CREATE OR REPLACE FUNCTION log_gpa_change() RETURNS TRIGGER -- Must return TRIGGER type AS $$ BEGIN -- Only log if the GPA actually changed. -- TG_OP is a special variable indicating the operation (INSERT, UPDATE, DELETE) IF (TG_OP = 'UPDATE' AND OLD.gpa IS DISTINCT FROM NEW.gpa) THEN INSERT INTO student_gpa_audit (student_id, old_gpa, new_gpa) VALUES (OLD.student_id, OLD.gpa, NEW.gpa); -- Use OLD and NEW records ELSIF (TG_OP = 'INSERT') THEN INSERT INTO student_gpa_audit (student_id, old_gpa, new_gpa) VALUES (NEW.student_id, NULL, NEW.gpa); -- No OLD gpa on insert -- Could add ELSIF (TG_OP = 'DELETE') to log deletions if needed -- INSERT INTO student_gpa_audit (student_id, old_gpa, new_gpa) -- VALUES (OLD.student_id, OLD.gpa, NULL); END IF; -- For AFTER triggers, the return value is ignored, but returning NEW is common practice. -- For BEFORE triggers, returning NEW allows the operation to proceed (possibly modified), -- returning NULL skips/cancels the operation for that row. RETURN NEW; END; $$ LANGUAGE plpgsql;
-
Create the Trigger: Attach the function to the table and specify the event/timing.
(TheCREATE TRIGGER students_gpa_audit_trigger AFTER INSERT OR UPDATE ON students -- Event and Timing FOR EACH ROW -- Trigger Level WHEN (pg_trigger_depth() = 0) -- Optional: Prevent trigger cascades from self-calls EXECUTE FUNCTION log_gpa_change(); -- The function to execute
WHEN (pg_trigger_depth() = 0)
clause is a common PostgreSQL idiom to prevent infinite loops if the trigger function itself causes another event on the same table that would fire the same trigger). -
Test the Trigger:
You should see entries in-- Update a student's GPA UPDATE students SET gpa = 3.9 WHERE student_id = 'S1002'; -- Insert a new student INSERT INTO students (student_id, student_name, department_id, gpa) VALUES ('S2001', 'Test Student', 'CS', 3.0); -- Check the audit table SELECT * FROM student_gpa_audit ORDER BY change_timestamp DESC;
student_gpa_audit
corresponding to theUPDATE
andINSERT
operations.
-
-
Potential Pitfalls:
- Complexity: Overuse of triggers can make application logic hard to follow and debug, as actions happen "magically" in the background.
- Performance: Poorly written triggers, especially row-level triggers on high-traffic tables, can significantly impact performance.
- Cascading Effects: Triggers can fire other triggers, potentially leading to complex and hard-to-predict interactions or even infinite loops if not carefully designed.
Server-side programming provides powerful tools for enhancing database functionality, but should be used judiciously, balancing the benefits against potential complexity and performance implications.
Workshop Introduction to Server Side Database Programming
Goal: Create and test a stored function and a trigger in the university database scenario.
Prerequisites:
- A running PostgreSQL instance on your Linux machine.
- Access to the
psql
command-line client. - The
university_db
database created in the "Advanced SQL Querying" workshop, with its tables (departments
,students
,courses
,enrollments
) and sample data. If you dropped it, recreate it using the setup steps from that workshop.
Setup:
- Terminal: Connect to your PostgreSQL instance and the
university_db
: - Verify Tables: Ensure the tables exist.
Tasks:
-
Create a Stored Function: Create a function
get_student_credits(student_id)
that calculates the total credits a given student is currently enrolled in.- Hint: You'll need to join
enrollments
andcourses
tables, filter by the inputstudent_id
, and sum thecredits
column. Handle the case where the student ID is invalid or the student has no enrollments. - Implementation (SQL/PLpgSQL):
CREATE OR REPLACE FUNCTION get_student_credits(s_id VARCHAR) RETURNS INTEGER AS $$ DECLARE total_credits INTEGER; BEGIN -- Check if student exists IF NOT EXISTS (SELECT 1 FROM students WHERE student_id = s_id) THEN RAISE EXCEPTION 'Student ID % not found.', s_id; END IF; -- Calculate total credits SELECT COALESCE(SUM(c.credits), 0) -- Use COALESCE to return 0 if SUM is NULL (no enrollments) INTO total_credits FROM enrollments e JOIN courses c ON e.course_id = c.course_id WHERE e.student_id = s_id; RETURN total_credits; END; $$ LANGUAGE plpgsql;
- Test the Function:
-- Test with Alice (CS101=4, CS305=3, MATH211=4 -> Total 11) SELECT get_student_credits('S1001'); -- Test with Bob (CS101=4 -> Total 4) SELECT get_student_credits('S1002'); -- Test with Fiona (no enrollments -> Total 0) INSERT INTO students (student_id, student_name, gpa) VALUES ('S1007', 'Test NoEnroll', 3.0); SELECT get_student_credits('S1007'); -- Test with an invalid student ID -- SELECT get_student_credits('S9999'); -- This should raise an exception
- Hint: You'll need to join
-
Create an Audit Table for Enrollments: Create a table to log when enrollments are added or removed.
CREATE TABLE enrollment_audit ( audit_id SERIAL PRIMARY KEY, action VARCHAR(10) NOT NULL, -- e.g., 'INSERT', 'DELETE' enrollment_id INTEGER, -- Can be NULL if action is 'DELETE' and we don't capture it before deletion student_id VARCHAR(10) NOT NULL, course_id VARCHAR(10) NOT NULL, action_timestamp TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP, db_user_name TEXT DEFAULT current_user );
-
Create a Trigger Function for Enrollment Audit: This function will insert records into
enrollment_audit
.CREATE OR REPLACE FUNCTION log_enrollment_change() RETURNS TRIGGER AS $$ BEGIN IF (TG_OP = 'INSERT') THEN INSERT INTO enrollment_audit (action, enrollment_id, student_id, course_id) VALUES ('INSERT', NEW.enrollment_id, NEW.student_id, NEW.course_id); RETURN NEW; -- Allow INSERT to proceed ELSIF (TG_OP = 'DELETE') THEN INSERT INTO enrollment_audit (action, enrollment_id, student_id, course_id) VALUES ('DELETE', OLD.enrollment_id, OLD.student_id, OLD.course_id); RETURN OLD; -- Allow DELETE to proceed (return value is used but content often irrelevant for AFTER DELETE) END IF; -- We are not auditing UPDATEs in this example RETURN NULL; -- Should not happen for INSERT/DELETE, but good practice END; $$ LANGUAGE plpgsql;
-
Create the Trigger: Attach the function to the
enrollments
table to fire after inserts or deletes. -
Test the Trigger:
- Enroll Bob in Databases (CS305):
- Check Audit Table:
- Check Bob's Credits using the function:
- Remove Alice from Math (MATH211):
- Check Audit Table:
- Check Alice's Credits:
Cleanup (Optional):
-- Drop trigger, function, and audit table
DROP TRIGGER IF EXISTS enrollments_audit_trigger ON enrollments;
DROP FUNCTION IF EXISTS log_enrollment_change();
DROP FUNCTION IF EXISTS get_student_credits(VARCHAR);
DROP TABLE IF EXISTS enrollment_audit;
-- To fully clean up, drop the main tables and database if desired
-- DROP TABLE enrollments; DROP TABLE courses; DROP TABLE students; DROP TABLE departments;
-- \c postgres
-- DROP DATABASE university_db;
This workshop provided practical experience in creating a stored function to encapsulate logic (calculating credits) and using triggers to automate actions (auditing enrollment changes). These techniques are valuable for building more sophisticated and robust database applications.
6. Database Security Fundamentals
Protecting data stored within the database is paramount. Database security involves controlling who can access the database and what actions they are permitted to perform on specific objects. This is typically managed through user accounts, roles, and privileges.
User and Role Management
Databases maintain their own list of users, distinct from operating system users. To connect to a database, a client must typically provide valid database user credentials.
- Users (or Login Roles): Represent individual entities (people or applications) that need to connect to the database. Users have connection privileges and potentially other specific privileges granted directly to them.
- Creating Users (PostgreSQL Example):
- Roles: Represent a set or "group" of privileges. Instead of granting the same set of privileges to multiple users individually, you can grant the privileges to a role, and then grant membership in that role to the users. This simplifies privilege management significantly. If a user needs more permissions, you add them to another role; if permissions need to change for a group, you modify the role, and all members inherit the change.
- In PostgreSQL, the concept of "User" and "Role" is largely unified.
CREATE USER
is essentially an alias forCREATE ROLE
with theLOGIN
privilege assumed by default. Roles created withCREATE ROLE
do not have login privileges unless explicitly granted (ALTER ROLE myrole WITH LOGIN;
). We often think of roles withLOGIN
as "users" and roles withoutLOGIN
as "groups". - Creating Roles (PostgreSQL Example):
- In PostgreSQL, the concept of "User" and "Role" is largely unified.
- Granting Role Membership: Assigning users to roles (or roles to other roles, creating hierarchies).
- Granting Membership (PostgreSQL Example):
-- Make the 'registrar_app' user a member of the 'data_entry_clerk' role GRANT data_entry_clerk TO registrar_app; -- Create another user and grant them read-only access CREATE USER analyst_user WITH LOGIN PASSWORD 'another_secret'; GRANT read_only_access TO analyst_user; -- Roles can inherit from other roles GRANT read_only_access TO data_entry_clerk; -- Now data_entry_clerk also has read_only privileges
- Granting Membership (PostgreSQL Example):
Using roles is the recommended practice for managing permissions for anything beyond the simplest setups.
Granting and Revoking Privileges
Privileges define the specific actions that a user or role is allowed to perform on database objects.
-
Common Object Privileges:
SELECT
: Allows reading data from a table, view, or specific columns.INSERT
: Allows adding new rows to a table or view. Can be granted on specific columns.UPDATE
: Allows modifying data in existing rows of a table or view. Can be granted on specific columns.DELETE
: Allows removing rows from a table or view.TRUNCATE
: Allows quickly removing all rows from a table (usually much faster thanDELETE
but often requires higher privileges).REFERENCES
: Allows creating a foreign key constraint that references a table.TRIGGER
: Allows creating triggers on a table.USAGE
: Allows using schemas, sequences, data types, domains, foreign data wrappers, foreign servers. Required to access objects within a schema.EXECUTE
: Allows calling a function or procedure.
-
Administrative Privileges:
CREATE
: Allows creating database objects (e.g.,CREATE TABLE
,CREATE DATABASE
,CREATE SCHEMA
,CREATE ROLE
).CONNECT
: Allows a user to connect to a specific database (checked after initial authentication).
-
GRANT
Statement: Used to give privileges to roles or users.- Syntax (Simplified):
PUBLIC
: A special pseudo-role meaning "all roles/users". Granting to PUBLIC should be done sparingly.WITH GRANT OPTION
: Allows the grantee to grant the same privilege to others. Use with extreme caution.- Examples (PostgreSQL):
-- Grant basic read access on the students table to the read_only_access role GRANT SELECT ON TABLE students TO read_only_access; GRANT SELECT ON TABLE courses TO read_only_access; GRANT SELECT ON TABLE enrollments TO read_only_access; GRANT SELECT ON TABLE departments TO read_only_access; -- Grant data entry privileges on enrollments to the data_entry_clerk role GRANT SELECT, INSERT, UPDATE, DELETE ON TABLE enrollments TO data_entry_clerk; -- Also needs ability to SELECT from students and courses to verify FKs GRANT SELECT ON TABLE students, courses TO data_entry_clerk; -- Needs USAGE on the schema containing the tables (if not public) -- Assuming tables are in the 'public' schema, USAGE is usually granted by default. -- If in a different schema 'university_schema': -- GRANT USAGE ON SCHEMA university_schema TO read_only_access; -- GRANT USAGE ON SCHEMA university_schema TO data_entry_clerk; -- Allow the data_entry_clerk role to use the get_student_credits function GRANT EXECUTE ON FUNCTION get_student_credits(VARCHAR) TO data_entry_clerk;
-
REVOKE
Statement: Used to remove previously granted privileges.- Syntax (Simplified):
GRANT OPTION FOR
: Revokes only the grant option, not the privilege itself.CASCADE
: Revokes the privilege and also revokes it from anyone who received it from the user/role being revoked from (directly or indirectly via grant option).RESTRICT
: (Often the default) Prevents revoking if dependent privileges exist (e.g., if the user granted the privilege to someone else).- Examples (PostgreSQL):
-
Principle of Least Privilege: A fundamental security concept. Users and applications should only be granted the absolute minimum privileges necessary to perform their intended functions. Avoid granting broad privileges like superuser status or
ALL PRIVILEGES
unless absolutely required for administrative tasks. Use specific roles with narrowly defined permissions.
Effective security management through users, roles, and carefully assigned privileges is crucial for protecting data integrity and confidentiality.
Workshop Database Security Fundamentals
Goal: Create users and roles, grant specific privileges, and test access restrictions.
Prerequisites:
- A running PostgreSQL instance on your Linux machine where you have administrative privileges (like the default
postgres
user). - Access to the
psql
command-line client. - The
university_db
database created in previous workshops, with its tables (departments
,students
,courses
,enrollments
). Assume these tables exist and contain data.
Setup:
- Terminal: Connect to PostgreSQL as an administrator:
(Replace
postgres
if you use a different admin user). - Verify Tables: Ensure the tables exist.
Tasks:
-
Create Roles: Create two roles:
student_viewer
(for read-only access to course/enrollment info) andcourse_admin
(for managing course data). -
Create Users: Create two users:
alice_advisor
(who will advise students) andbob_registrar
(who will manage courses). Assign passwords. -
Grant Role Membership: Assign the users to the roles.
-
Grant Privileges to Roles: Assign appropriate permissions to the roles based on the Principle of Least Privilege.
student_viewer
needs:- SELECT on
students
,courses
,enrollments
,departments
.
- SELECT on
course_admin
needs:- SELECT, INSERT, UPDATE, DELETE on
courses
. - (Implicitly needs USAGE on the schema, usually
public
by default).-- Privileges for student_viewer role GRANT SELECT ON TABLE students, courses, enrollments, departments TO student_viewer; -- Privileges for course_admin role GRANT SELECT, INSERT, UPDATE, DELETE ON TABLE courses TO course_admin; -- Ensure USAGE on schema (often default for public, but good practice) GRANT USAGE ON SCHEMA public TO student_viewer; GRANT USAGE ON SCHEMA public TO course_admin;
- SELECT, INSERT, UPDATE, DELETE on
-
Test
alice_advisor
Permissions:- Open a NEW Terminal Window. Do not close the admin connection yet.
- New Terminal: Connect as
alice_advisor
: - New Terminal (as alice_advisor): Try allowed and disallowed actions:
-- Allowed actions (SELECT) SELECT student_name, gpa FROM students WHERE student_id = 'S1001'; -- OK SELECT course_name, credits FROM courses WHERE department_id = 'CS'; -- OK SELECT * FROM enrollments WHERE student_id = 'S1001'; -- OK -- Disallowed actions (INSERT, UPDATE, DELETE) INSERT INTO courses (course_id, course_name, department_id, credits) VALUES ('ART101', 'Art History', 'PHYS', 3); -- Should FAIL (permission denied for table courses) UPDATE students SET gpa = 4.0 WHERE student_id = 'S1001'; -- Should FAIL (permission denied for table students) DELETE FROM enrollments WHERE student_id = 'S1001'; -- Should FAIL (permission denied for table enrollments) -- Try creating a table (should fail - no CREATE privilege) -- CREATE TABLE my_test_table (id int); -- Should FAIL -- Exit this connection \q
-
Test
bob_registrar
Permissions:- Open another NEW Terminal Window (or reuse the previous non-admin one).
- New Terminal: Connect as
bob_registrar
: - New Terminal (as bob_registrar): Try allowed and disallowed actions:
-- Allowed actions (SELECT, INSERT, UPDATE, DELETE on courses) SELECT * FROM courses WHERE department_id = 'CS'; -- OK INSERT INTO courses (course_id, course_name, department_id, credits) VALUES ('ART101', 'Art History', 'PHYS', 3); -- OK (Assuming PHYS dept exists) UPDATE courses SET credits = 4 WHERE course_id = 'ART101'; -- OK DELETE FROM courses WHERE course_id = 'ART101'; -- OK -- Disallowed actions (accessing other tables like students) SELECT student_name FROM students WHERE student_id = 'S1001'; -- Should FAIL (permission denied for table students) INSERT INTO enrollments (student_id, course_id, grade) VALUES ('S1001', 'CS101', 'B'); -- Should FAIL (permission denied for table enrollments) -- Exit this connection \q
-
Modify Privileges (as Admin): Let's say
course_admin
should not be allowed to delete courses, only manage them otherwise.- Go back to the FIRST Terminal (connected as admin).
- Revoke the
DELETE
privilege:
-
Retest
bob_registrar
Delete Permission:- Connect again as
bob_registrar
in the second terminal. - Try to delete a course (first insert one to delete):
-- Connect as bob_registrar again psql -U bob_registrar -d university_db -h localhost -- Insert a course to delete INSERT INTO courses (course_id, course_name, department_id, credits) VALUES ('TEMP101', 'Temp Course', 'CS', 1); -- OK -- Attempt to delete it DELETE FROM courses WHERE course_id = 'TEMP101'; -- Should now FAIL (permission denied) -- Clean up the temp course using UPDATE (which is still allowed) -- Or just leave it. It shows the revoke worked. -- UPDATE courses SET course_name = 'Old Temp Course' WHERE course_id = 'TEMP101'; -- This would be OK \q
- Connect again as
Cleanup (Optional):
-- In the admin terminal:
DROP USER IF EXISTS alice_advisor;
DROP USER IF EXISTS bob_registrar;
DROP ROLE IF EXISTS student_viewer;
DROP ROLE IF EXISTS course_admin;
-- To fully clean up database:
-- \c postgres
-- DROP DATABASE university_db;
This workshop demonstrated the fundamental process of creating roles and users, granting specific privileges based on job function, and verifying that security restrictions are enforced correctly. This granular control is essential for protecting database assets.
Conclusion
Moving beyond the basics, this section equipped you with intermediate database skills crucial for real-world development and administration on Linux. You've learned to craft complex SQL queries using advanced joins, subqueries, CTEs, and window functions to extract deeper insights from data. We explored the critical principles of relational database design, focusing on normalization to ensure data integrity and minimize redundancy, and strategic indexing to optimize query performance.
Understanding how databases handle concurrent access through ACID properties, isolation levels, and locking mechanisms is vital for building reliable multi-user applications. You gained practical experience observing these concepts in action. Furthermore, you learned how to diagnose performance bottlenecks using EXPLAIN ANALYZE
and apply index tuning techniques to speed up slow queries.
We introduced server-side programming with stored functions, procedures, and triggers, enabling you to encapsulate logic, enforce complex rules, and automate tasks directly within the database. Finally, we covered database security fundamentals, emphasizing the importance of user/role management and the Principle of Least Privilege to control access and protect your data.
The workshops provided hands-on practice reinforcing these concepts within a PostgreSQL environment on Linux. These intermediate skills form a strong foundation for tackling more complex database challenges, delving into advanced administration topics like replication and high availability, exploring different database paradigms like NoSQL, or specializing in database performance optimization.