Disclaimer

Please be aware that the information and procedures described herein are provided "as is" and without any warranty, express or implied. I assume no liability for any potential damages or issues that may arise from applying these contents. Any action you take upon the information is strictly at your own risk.

All actions and outputs documented were performed within a virtual machine running a Linux Debian server as the host system. The output and results you experience may differ depending on the specific Linux distribution and version you are using.

It is strongly recommended that you test all procedures and commands in a virtual machine or an isolated test environment before applying them to any production or critical systems.

No warranty for damages.
Application of content at own risk.
Author used a virtual machine with a Linux Debian server as host.
Output may vary for the reader based on their Linux version.
Strong recommendation to test in a virtual machine.

Author	Nejat Hakan
License	CC BY-SA 4.0
eMail	nejat.hakan@outlook.de
PayPal Me	https://paypal.me/nejathakan

Database Intermediate

Introduction

Welcome to the intermediate exploration of databases within the Linux environment. Having mastered the fundamentals of SQL (Structured Query Language) and basic database concepts (tables, rows, columns, primary keys), we now delve deeper into the mechanics, design, and optimization techniques that distinguish a novice user from a proficient database practitioner. This section assumes you have a working relational database system installed on your Linux machine (like PostgreSQL or MySQL/MariaDB) and are comfortable executing basic SQL commands.

In this part, we will move beyond simple data retrieval and manipulation. We will explore:

Advanced SQL Querying: Techniques for formulating complex questions to extract intricate insights from your data, including sophisticated joins, subqueries, Common Table Expressions (CTEs), and powerful window functions.
Relational Database Design Principles: Moving beyond simple table creation to understand the theory and practice of designing robust, efficient, and maintainable database schemas using normalization and effective indexing strategies.
Transaction Management and Concurrency Control: Understanding how databases ensure data integrity even when multiple users or processes access and modify data simultaneously, focusing on ACID properties, isolation levels, and locking.
Database Performance Tuning Basics: Learning how to identify and address performance bottlenecks in your queries and database structure using tools like EXPLAIN and applying index tuning techniques.
Introduction to Server Side Database Programming: Extending database functionality using stored procedures, functions, and triggers to encapsulate logic, enforce complex business rules, and automate tasks directly within the database server.
Database Security Fundamentals: Implementing basic security measures by managing users, roles, and privileges to control access to sensitive data.

Our approach will be highly practical. Each theoretical subsection is followed by a hands-on "Workshop" where you will apply the learned concepts in a step-by-step manner using a sample database project within your Linux environment. Get ready to significantly enhance your database skills!

1. Advanced SQL Querying

While basic SELECT, INSERT, UPDATE, and DELETE statements are the foundation, real-world data analysis often requires more sophisticated querying techniques. Intermediate SQL involves combining data from multiple tables in complex ways, nesting queries, and performing calculations across sets of rows.

Complex Joins Beyond INNER JOIN

You are likely familiar with INNER JOIN, which returns rows only when there is a match in both tables being joined. However, other join types are essential for different scenarios:

LEFT JOIN (or LEFT OUTER JOIN): Returns all rows from the left table (the first table mentioned) and the matched rows from the right table. If there is no match in the right table, NULL values are returned for columns from the right table.
- Use Case: Finding all customers and any orders they might have placed. Even customers with no orders will be listed.
RIGHT JOIN (or RIGHT OUTER JOIN): Returns all rows from the right table (the second table mentioned) and the matched rows from the left table. If there is no match in the left table, NULL values are returned for columns from the left table.
- Use Case: Finding all orders and the customers who placed them. Even if an order somehow existed without a valid customer (which shouldn't happen in a well-designed DB, but illustrates the point), it would be listed. It's often less intuitive than LEFT JOIN, and many queries can be rewritten using LEFT JOIN by simply swapping the table order.
FULL OUTER JOIN: Returns all rows when there is a match in either the left or the right table. It essentially combines the results of a LEFT JOIN and a RIGHT JOIN. If there's no match for a row from the left table in the right table, the right table's columns are NULL. Conversely, if there's no match for a row from the right table in the left table, the left table's columns are NULL.
- Use Case: Comparing two lists to see all items from both, highlighting matches and mismatches. For example, listing all registered students and all enrolled students in a specific course, showing who is registered but not enrolled, who is enrolled but not registered (perhaps an error), and who is both.
CROSS JOIN: Returns the Cartesian product of the two tables – every row from the first table is combined with every row from the second table. This is rarely used with a condition, but can be useful for generating combinations. Be very careful, as it can produce massive result sets.
- Use Case: Generating all possible combinations of T-shirt sizes and colors.

Understanding when to use each join type is crucial for correctly retrieving the desired dataset. Visualizing these joins using Venn diagrams can be very helpful.

Subqueries and Common Table Expressions CTEs

A subquery (also known as an inner query or nested query) is a query embedded inside another SQL query. Subqueries can appear in various clauses: SELECT, FROM, WHERE, and HAVING.

Subqueries in WHERE Clause: Often used for filtering based on the result of another query.

Operators: IN, NOT IN, ANY, ALL, EXISTS, NOT EXISTS.

Example: Find all students enrolled in courses taught by a specific professor.

SELECT student_name
FROM students
WHERE student_id IN (SELECT student_id
                     FROM enrollments
                     WHERE course_id IN (SELECT course_id
                                         FROM courses
                                         WHERE professor_id = 'PROF123'));

Correlated Subqueries: A subquery that references columns from the outer query. The subquery is re-evaluated for each row processed by the outer query. These can be powerful but potentially slow if not written carefully.

Example: Find all courses where the enrollment count is above the average enrollment count for their respective department.

SELECT c.course_name, c.department_id
FROM courses c
WHERE (SELECT COUNT(*) FROM enrollments e WHERE e.course_id = c.course_id) >
      (SELECT AVG(enroll_count)
       FROM (SELECT COUNT(*) as enroll_count
             FROM enrollments e2
             JOIN courses c2 ON e2.course_id = c2.course_id
             WHERE c2.department_id = c.department_id -- Correlation
             GROUP BY e2.course_id) as dept_avg);

Subqueries in FROM Clause (Derived Tables): The subquery's result set is treated as a temporary table that the outer query can select from.

Example: Calculate the average number of courses per student.

SELECT AVG(num_courses) as avg_courses_per_student
FROM (SELECT student_id, COUNT(course_id) as num_courses
      FROM enrollments
      GROUP BY student_id) as student_course_counts;

Common Table Expressions (CTEs) provide a way to define temporary, named result sets that you can reference within a single SQL statement (SELECT, INSERT, UPDATE, or DELETE). They are defined using the WITH clause.

Benefits:
- Readability: Break down complex queries into logical, named blocks.
- Reusability: Reference the same CTE multiple times within the main query.
- Recursion: CTEs can reference themselves, enabling recursive queries (e.g., organizational hierarchies, bill of materials).

Syntax:

WITH CteName1 AS (
    -- Query definition for CteName1
),
CteName2 AS (
    -- Query definition for CteName2 (can reference CteName1)
)
-- Main query that can reference CteName1 and CteName2
SELECT ...
FROM CteName1
JOIN CteName2 ON ...
WHERE ...;

Example (Rewriting the correlated subquery example using CTEs):

WITH CourseEnrollmentCounts AS (
    SELECT course_id, COUNT(*) as enroll_count
    FROM enrollments
    GROUP BY course_id
),
DepartmentAvgEnrollment AS (
    SELECT c.department_id, AVG(cec.enroll_count) as avg_dept_enroll
    FROM CourseEnrollmentCounts cec
    JOIN courses c ON cec.course_id = c.course_id
    GROUP BY c.department_id
)
SELECT c.course_name, c.department_id
FROM courses c
JOIN CourseEnrollmentCounts cec ON c.course_id = cec.course_id
JOIN DepartmentAvgEnrollment dae ON c.department_id = dae.department_id
WHERE cec.enroll_count > dae.avg_dept_enroll;

Notice how the CTE approach separates the calculation of individual course counts and department averages into distinct, understandable steps.

Window Functions

Window functions perform calculations across a set of table rows that are somehow related to the current row. Unlike aggregate functions (like SUM(), AVG(), COUNT()) used with GROUP BY, window functions do not collapse the rows into a single output row; they return a value for each row based on a "window" of related rows.

Key Concepts:
- OVER() Clause: This clause is mandatory and specifies that the function is a window function.
- PARTITION BY (Optional): Divides the rows into partitions (groups). The window function is applied independently to each partition. Similar to GROUP BY, but doesn't collapse rows.
- ORDER BY (Optional within OVER()): Orders rows within each partition. This is crucial for ranking and sequence-related functions (ROW_NUMBER, RANK, LAG, LEAD).
- Frame Clause (Optional within OVER()): Specifies the precise set of rows (the "window frame") within the partition relative to the current row (e.g., ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW).
Common Window Functions:
- Ranking: ROW_NUMBER(), RANK(), DENSE_RANK(), NTILE(n)
  - ROW_NUMBER(): Assigns a unique sequential integer to each row within its partition.
  - RANK(): Assigns ranks based on the ORDER BY clause within the partition. Skips ranks after ties (e.g., 1, 2, 2, 4).
  - DENSE_RANK(): Assigns ranks without gaps after ties (e.g., 1, 2, 2, 3).
- Aggregate: SUM(), AVG(), COUNT(), MIN(), MAX() used with OVER().
  - Example: Calculate the running total of order amounts for each customer.
- Value: LAG(), LEAD(), FIRST_VALUE(), LAST_VALUE(), NTH_VALUE()
  - LAG(column, offset, default): Access data from a previous row within the partition.
  - LEAD(column, offset, default): Access data from a subsequent row within the partition.

Example: Rank students within each department based on their GPA.

SELECT
    student_name,
    department_id,
    gpa,
    RANK() OVER (PARTITION BY department_id ORDER BY gpa DESC) as department_rank,
    DENSE_RANK() OVER (PARTITION BY department_id ORDER BY gpa DESC) as dense_department_rank,
    ROW_NUMBER() OVER (PARTITION BY department_id ORDER BY gpa DESC) as row_num_in_department
FROM students;

Example: Show each employee's salary and the average salary of their department.

SELECT
    employee_name,
    department_id,
    salary,
    AVG(salary) OVER (PARTITION BY department_id) as avg_dept_salary
FROM employees;

Mastering these advanced SQL techniques allows for sophisticated data analysis and reporting directly within the database, often more efficiently than processing raw data in an application layer.

Workshop Advanced SQL Querying

Goal: Practice using advanced joins, subqueries, CTEs, and window functions in a university database scenario.

Prerequisites:

A running PostgreSQL instance on your Linux machine.
Access to the psql command-line client or another database tool.
A database created for this workshop (e.g., university_db).

Setup:

Connect to your PostgreSQL instance: psql -U your_username -d postgres
Create the database: CREATE DATABASE university_db;
Connect to the new database: \c university_db

Create the necessary tables and insert sample data:

CREATE TABLE departments (
    department_id VARCHAR(10) PRIMARY KEY,
    department_name VARCHAR(100) NOT NULL,
    faculty_count INTEGER
);

CREATE TABLE students (
    student_id VARCHAR(10) PRIMARY KEY,
    student_name VARCHAR(100) NOT NULL,
    department_id VARCHAR(10) REFERENCES departments(department_id),
    gpa NUMERIC(3, 2) CHECK (gpa >= 0.0 AND gpa <= 4.0)
);

CREATE TABLE courses (
    course_id VARCHAR(10) PRIMARY KEY,
    course_name VARCHAR(100) NOT NULL,
    department_id VARCHAR(10) REFERENCES departments(department_id),
    credits INTEGER CHECK (credits > 0)
);

CREATE TABLE enrollments (
    enrollment_id SERIAL PRIMARY KEY,
    student_id VARCHAR(10) REFERENCES students(student_id),
    course_id VARCHAR(10) REFERENCES courses(course_id),
    grade CHAR(1), -- A, B, C, D, F or NULL if not graded yet
    UNIQUE (student_id, course_id) -- A student can enroll in a course only once
);

-- Insert Sample Data
INSERT INTO departments (department_id, department_name, faculty_count) VALUES
('CS', 'Computer Science', 25),
('EE', 'Electrical Engineering', 20),
('MATH', 'Mathematics', 15),
('PHYS', 'Physics', 18);

INSERT INTO students (student_id, student_name, department_id, gpa) VALUES
('S1001', 'Alice Wonderland', 'CS', 3.8),
('S1002', 'Bob The Builder', 'CS', 3.5),
('S1003', 'Charlie Chaplin', 'EE', 3.9),
('S1004', 'Diana Prince', 'EE', 3.7),
('S1005', 'Ethan Hunt', 'MATH', 3.2),
('S1006', 'Fiona Glenanne', NULL, 3.6); -- Student not yet assigned to a dept

INSERT INTO courses (course_id, course_name, department_id, credits) VALUES
('CS101', 'Intro to Programming', 'CS', 4),
('CS305', 'Databases', 'CS', 3),
('EE201', 'Circuit Theory', 'EE', 4),
('MATH211', 'Calculus III', 'MATH', 4),
('PHYS101', 'General Physics I', 'PHYS', 5),
('CS550', 'Advanced Algorithms', 'CS', 3); -- An advanced course maybe not taken by many

INSERT INTO enrollments (student_id, course_id, grade) VALUES
('S1001', 'CS101', 'A'),
('S1001', 'CS305', 'A'),
('S1002', 'CS101', 'B'),
('S1003', 'EE201', 'A'),
('S1004', 'EE201', 'B'),
('S1004', 'CS101', 'C'), -- Cross-department enrollment
('S1005', 'MATH211', 'B'),
('S1001', 'MATH211', 'A'); -- Cross-department enrollment

Tasks:

LEFT JOIN: List all students and the names of the courses they are enrolled in. Include students who are not enrolled in any course (though our sample data doesn't have such a case, the query structure matters). Also, list the student's department name.
- Hint: You'll need multiple joins. Start with students and LEFT JOIN enrollments. Then join courses and departments.
```
SELECT
    s.student_name,
    s.student_id,
    d.department_name,
    c.course_name
FROM students s
LEFT JOIN enrollments e ON s.student_id = e.student_id
LEFT JOIN courses c ON e.course_id = c.course_id
LEFT JOIN departments d ON s.department_id = d.department_id
ORDER BY s.student_name, c.course_name;
```
- Observe: All students should be listed. If a student had no enrollments, their course_name would be NULL. Notice Fiona doesn't have a department name.

RIGHT JOIN (Conceptual): List all courses and the names of students enrolled in them. Include courses that have no students enrolled.

Hint: Use courses as the left table and enrollments as the right table in a LEFT JOIN for easier reading, or swap and use RIGHT JOIN.

-- Using LEFT JOIN (more common)
SELECT
    c.course_name,
    c.course_id,
    s.student_name
FROM courses c
LEFT JOIN enrollments e ON c.course_id = e.course_id
LEFT JOIN students s ON e.student_id = s.student_id
ORDER BY c.course_name, s.student_name;

-- Equivalent using RIGHT JOIN
SELECT
    c.course_name,
    c.course_id,
    s.student_name
FROM enrollments e
RIGHT JOIN courses c ON e.course_id = c.course_id
LEFT JOIN students s ON e.student_id = s.student_id
ORDER BY c.course_name, s.student_name;

Observe: Course CS550 and PHYS101 appear with NULL student names because no one is enrolled in them in our sample data.

Subquery in WHERE: Find the names of students enrolled in the 'Databases' course (CS305).

SELECT student_name
FROM students
WHERE student_id IN (SELECT student_id
                     FROM enrollments
                     WHERE course_id = 'CS305');

Result: Should show 'Alice Wonderland'.

Subquery in FROM (Derived Table): Find the department(s) with the highest average GPA among their students.

Hint: First, calculate the average GPA per department. Then find the maximum of these averages. Finally, select the department(s) matching that maximum average.

SELECT department_id, avg_gpa
FROM (SELECT department_id, AVG(gpa) as avg_gpa
      FROM students
      WHERE department_id IS NOT NULL -- Exclude students without a department
      GROUP BY department_id) as dept_gpa
WHERE avg_gpa = (SELECT MAX(avg_gpa)
                 FROM (SELECT department_id, AVG(gpa) as avg_gpa
                       FROM students
                       WHERE department_id IS NOT NULL
                       GROUP BY department_id) as max_dept_gpa);

Observe: This shows the department ID and its average GPA for the department(s) with the top average.

CTE: Calculate the number of students enrolled in each department. List department name and student count.

Hint: Create a CTE to count students per department_id from the students table. Then join this CTE with the departments table.

WITH DepartmentStudentCounts AS (
    SELECT
        department_id,
        COUNT(student_id) as num_students
    FROM students
    WHERE department_id IS NOT NULL
    GROUP BY department_id
)
SELECT
    d.department_name,
    dsc.num_students
FROM departments d
JOIN DepartmentStudentCounts dsc ON d.department_id = dsc.department_id
ORDER BY d.department_name;

Observe: This gives a clean count per department.

Window Function (Ranking): Rank students within each department based on GPA (highest GPA gets rank 1). Show student name, department name, GPA, and their rank.

SELECT
    s.student_name,
    d.department_name,
    s.gpa,
    RANK() OVER (PARTITION BY s.department_id ORDER BY s.gpa DESC) as department_rank
FROM students s
JOIN departments d ON s.department_id = d.department_id -- Use INNER JOIN to only rank students in departments
ORDER BY d.department_name, department_rank;

Observe: Students are grouped by department, and within each department, ranked by GPA. If there were ties, RANK() would show gaps. Try DENSE_RANK() to see the difference.

Window Function (Aggregate): For each student, show their name, GPA, and the average GPA of their department.
```
SELECT
    s.student_name,
    s.gpa,
    d.department_name,
    AVG(s.gpa) OVER (PARTITION BY s.department_id) as avg_dept_gpa
FROM students s
JOIN departments d ON s.department_id = d.department_id
ORDER BY d.department_name, s.student_name;
```
- Observe: Each student row now includes the average GPA calculated across all students within their specific department. The AVG() function here doesn't collapse rows because of the OVER() clause.

Cleanup (Optional):

-- Disconnect from the database first (\q in psql)
-- Then connect to another database like 'postgres'
-- psql -U your_username -d postgres
DROP DATABASE university_db;

This workshop provides hands-on experience with advanced SQL constructs, enabling you to tackle more complex data retrieval and analysis tasks.

2. Relational Database Design Principles

Designing a database schema is more than just creating tables; it's about structuring data logically and efficiently to ensure data integrity, minimize redundancy, and facilitate querying. Two core concepts in relational database design are normalization and indexing.

Normalization In Depth

Normalization is the process of organizing the columns (attributes) and tables (relations) of a relational database to minimize data redundancy and improve data integrity. It involves decomposing larger tables into smaller, well-structured tables and defining relationships between them. The primary goals are:

Eliminate Redundancy: Storing the same piece of data multiple times wastes space and increases the risk of inconsistencies.
Avoid Update Anomalies: Issues that arise when inserting, updating, or deleting data.
- Insertion Anomaly: Inability to add data about one entity without adding data about another unrelated entity. (e.g., Cannot add a new course unless a student enrolls in it, if course details are stored only in an enrollment table).
- Update Anomaly: Changing redundant data in one place requires changing it everywhere it appears. Failure to do so leads to inconsistencies. (e.g., If a professor's office number is stored with every course they teach, changing their office requires updating multiple rows).
- Deletion Anomaly: Deleting data about one entity unintentionally deletes data about another entity. (e.g., If student and course details are in the same table, deleting the last student enrolled in a course might delete the course information itself).

Normalization is typically achieved by following a series of rules called Normal Forms (NF):

First Normal Form (1NF):
- Rule: Each column must contain atomic (indivisible) values, and each column must have a unique name within the table. There should be no repeating groups of columns.
- Violation Example: A phone_numbers column containing a comma-separated list ('555-1234, 555-5678'). A set of columns like child1_name, child2_name, child3_name.
- Fix: Create a separate related table (e.g., StudentPhoneNumbers table with student_id and phone_number columns, allowing multiple rows per student). For repeating groups, normalize into a separate table.
Second Normal Form (2NF):
- Rule: The table must be in 1NF, and all non-key attributes must be fully functionally dependent on the entire primary key. This rule applies primarily to tables with composite primary keys (keys made up of multiple columns).
- Violation Example: A table Enrollments(StudentID, CourseID, StudentName, CourseName, Grade). The primary key is (StudentID, CourseID).
  - Grade depends on the full key (StudentID, CourseID).
  - StudentName depends only on StudentID (partial dependency).
  - CourseName depends only on CourseID (partial dependency).
- Fix: Decompose into multiple tables:
  - Students(StudentID, StudentName, ...)
  - Courses(CourseID, CourseName, ...)
  - Enrollments(StudentID, CourseID, Grade)
Third Normal Form (3NF):
- Rule: The table must be in 2NF, and there should be no transitive dependencies. A transitive dependency exists when a non-key attribute depends on another non-key attribute, which in turn depends on the primary key. (A -> B and B -> C implies A -> C transitively).
- Violation Example: A table Courses(CourseID, CourseName, ProfessorID, ProfessorOffice). Primary key is CourseID.
  - CourseName depends on CourseID.
  - ProfessorID depends on CourseID (assuming one professor per course).
  - ProfessorOffice depends on ProfessorID (a non-key attribute).
  - Therefore, CourseID -> ProfessorID -> ProfessorOffice creates a transitive dependency (CourseID -> ProfessorOffice).
- Fix: Decompose into multiple tables:
  - Courses(CourseID, CourseName, ProfessorID)
  - Professors(ProfessorID, ProfessorName, ProfessorOffice)
Boyce-Codd Normal Form (BCNF): A slightly stricter version of 3NF. A table is in BCNF if, for every non-trivial functional dependency X -> Y, X is a superkey (i.e., X determines all other attributes in the table). Most 3NF tables are also in BCNF. Violations are rare and often involve tables with multiple candidate keys that overlap.
Higher Normal Forms (4NF, 5NF, 6NF): Address more complex dependencies like multi-valued dependencies and join dependencies. They are less commonly applied in typical business application design but are important in specific complex scenarios.

Normalization Trade-offs: While normalization reduces redundancy and improves integrity, it often increases the number of tables. This can lead to more complex queries involving multiple joins, which might impact performance. Sometimes, denormalization (intentionally introducing some redundancy by relaxing normal forms, e.g., moving from 3NF back to 2NF for specific columns) is performed selectively to optimize read performance for frequently executed queries. This should be done cautiously and typically only after identifying performance bottlenecks.

Effective Indexing Strategies

An index is a special lookup table that the database search engine can use to speed up data retrieval operations. Simply put, an index is a pointer to data in a table. An index in a database is very similar to an index in the back of a book.

How Indexes Work (Typically B-Trees): Most relational databases use a B-tree (or variant like B+ tree) structure for indexes. A B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time (O(log n)). The index stores the indexed column value(s) and a pointer (e.g., row ID or physical address) to the actual table row containing that value. When you query based on an indexed column (e.g., in a WHERE clause or JOIN condition), the database can use the B-tree to quickly locate the relevant row pointers instead of scanning the entire table (a "full table scan").
Types of Indexes:
- Primary Key Index: Automatically created when you define a primary key. Ensures uniqueness and provides fast lookup by the primary key value. Often clustered.
- Unique Index: Ensures that all values in the indexed column(s) are unique (allows one NULL value usually, though behavior varies by RDBMS). Created automatically for unique constraints.
- Clustered Index (Specific RDBMS, e.g., SQL Server, MySQL's InnoDB): Determines the physical order of data rows in the table. A table can have only one clustered index. If defined, accessing rows sequentially based on the clustered index key is very fast. Primary keys are often implemented as clustered indexes by default.
- Non-Clustered Index (Heap Tables in some RDBMS): The index structure is separate from the data rows. The index contains pointers to the actual rows, which may be stored in a different order (often a heap). A table can have multiple non-clustered indexes. This is the more common index type in PostgreSQL.
- Composite Index (Multi-column Index): An index created on two or more columns. Useful for queries that filter or sort on multiple columns simultaneously. The order of columns in the composite index matters significantly. An index on (colA, colB) can efficiently support queries filtering on colA alone, or on both colA and colB, but usually not efficiently on colB alone.
- Partial Index (PostgreSQL, SQLite): An index built on a subset of a table's rows, defined by a WHERE clause in the CREATE INDEX statement. Useful for indexing only frequently queried subsets (e.g., index only active orders).
- Expression Index / Function-Based Index (PostgreSQL, Oracle): An index built on the result of a function or expression involving one or more columns. Useful for queries that filter based on the result of a function (e.g., WHERE LOWER(username) = 'admin').
When to Create Indexes:
- Columns frequently used in WHERE clauses.
- Columns frequently used in JOIN conditions (foreign keys are prime candidates).
- Columns frequently used in ORDER BY or GROUP BY clauses.
When Not to Create Indexes (or to be Cautious):
- Small Tables: The overhead of using an index might outweigh the benefit of a full table scan.
- Columns with Low Cardinality (Few Unique Values): Indexing a column like gender ('Male', 'Female', 'Other') is often not helpful, as the index doesn't significantly narrow down the search space. The database might still prefer a full table scan. Index selectivity is key – indexes work best on columns where values are highly unique or selective.
- Tables with Frequent, Heavy Write Operations (INSERT, UPDATE, DELETE): Every index needs to be updated whenever the underlying table data changes. Too many indexes can significantly slow down write operations.
- Columns Rarely Queried: Unused indexes waste storage space and add overhead to write operations.
Indexing Strategy:
1. Identify Critical Queries: Focus on the most frequent or performance-sensitive queries.
2. Analyze Query Predicates: Examine the WHERE, JOIN, ORDER BY, and GROUP BY clauses.
3. Index Foreign Keys: Usually a good starting point.
4. Consider Composite Indexes: For queries filtering/sorting on multiple columns. Place the most selective column(s) first in the index definition.
5. Use EXPLAIN: Analyze query execution plans (covered later) to see if indexes are being used effectively and identify missing indexes.
6. Monitor and Refine: Regularly review index usage and performance. Drop unused indexes and potentially add new ones as query patterns evolve.

Effective database design through normalization and strategic indexing is fundamental to building scalable and performant database applications.

Workshop Relational Database Design Principles

Goal: Analyze a poorly designed table, normalize it to 3NF, and add appropriate indexes based on anticipated queries.

Prerequisites:

A running PostgreSQL instance on your Linux machine.
Access to the psql command-line client or another database tool.
A database created for this workshop (e.g., design_workshop_db).

Setup:

Connect to your PostgreSQL instance: psql -U your_username -d postgres
Create the database: CREATE DATABASE design_workshop_db;
Connect to the new database: \c design_workshop_db

Create the initial unnormalized table representing project assignments:

-- !! This is a deliberately poorly designed table for demonstration !!
CREATE TABLE project_assignments_bad (
    assignment_id INT PRIMARY KEY,
    project_name VARCHAR(100),
    project_location VARCHAR(50),
    employee_id VARCHAR(10),
    employee_name VARCHAR(100),
    employee_email VARCHAR(100), -- Potential 1NF violation if multiple emails stored
    department_id VARCHAR(10),
    department_name VARCHAR(100),
    assignment_date DATE,
    hours_billed NUMERIC(5, 2)
);

-- Insert some sample data (notice the redundancy)
INSERT INTO project_assignments_bad VALUES
(1, 'Alpha Launch', 'New York', 'E101', 'Alice Smith', 'alice.s@example.com', 'D01', 'Technology', '2023-10-01', 40.0),
(2, 'Alpha Launch', 'New York', 'E102', 'Bob Johnson', 'bob.j@example.com', 'D01', 'Technology', '2023-10-01', 35.5),
(3, 'Beta Test', 'London', 'E101', 'Alice Smith', 'alice.s@example.com', 'D01', 'Technology', '2023-11-15', 50.0),
(4, 'Gamma Initiative', 'New York', 'E205', 'Charlie Brown', 'charlie.b@example.com', 'D02', 'Marketing', '2023-10-05', 60.0),
(5, 'Alpha Launch', 'New York', 'E205', 'Charlie Brown', 'charlie.b@example.com', 'D02', 'Marketing', '2023-11-01', 25.0);

Tasks:

Analyze the project_assignments_bad Table:
- Identify violations of Normal Forms (1NF, 2NF, 3NF).
- List the potential insertion, update, and deletion anomalies.
- Think:
  - 1NF: Is employee_email potentially storing multiple values? (Let's assume for now it's intended to be one, but it's a common place for 1NF issues). Are there repeating groups? (No obvious ones here).
  - 2NF: The primary key is assignment_id (a single column). Since the PK is not composite, 2NF is technically satisfied (as there are no partial dependencies on a single-column key). However, the structure feels wrong, leading us to 3NF.
  - 3NF: Are there transitive dependencies?
    - assignment_id -> project_name, project_location (Project details depend on the assignment? Unlikely. Seems project details are repeated).
    - assignment_id -> employee_id -> employee_name, employee_email, department_id -> department_name. These look like transitive dependencies! Employee details depend on employee_id, not directly on the assignment_id. Department name depends on department_id.
  - Anomalies:
    - Update: If Alice Smith changes her email, you have to update multiple rows (rows 1 and 3). If the 'Technology' department name changes, you have to update multiple rows. If the 'Alpha Launch' project location changes (unlikely, but possible), multiple rows need updating.
    - Insertion: Cannot add a new employee unless they are assigned to a project. Cannot add a new project unless an employee is assigned. Cannot add a new department unless an employee exists in it.
    - Deletion: If assignment 5 is deleted (Charlie Brown's only work on Alpha Launch), we lose the fact that Charlie Brown was ever associated with Alpha Launch. If assignments 4 and 5 are deleted, we lose information about Charlie Brown entirely (if he's not elsewhere) and potentially the 'Marketing' department if no other employees are listed.

Normalize the Schema to 3NF:

Design a set of new tables that adhere to 3NF.
Define appropriate primary keys and foreign keys to link the tables.
Think: We need separate tables for entities like Projects, Employees, and Departments, and a table to link them (Assignments).
- Departments(department_id PK, department_name)
- Employees(employee_id PK, employee_name, employee_email, department_id FK)
- Projects(project_id PK, project_name, project_location) - Need a new PK for projects.
- Assignments(assignment_id PK, project_id FK, employee_id FK, assignment_date, hours_billed)
Implementation (SQL):

-- 1. Departments Table
CREATE TABLE departments (
    department_id VARCHAR(10) PRIMARY KEY,
    department_name VARCHAR(100) NOT NULL UNIQUE -- Department names should probably be unique
);

-- 2. Employees Table
CREATE TABLE employees (
    employee_id VARCHAR(10) PRIMARY KEY,
    employee_name VARCHAR(100) NOT NULL,
    employee_email VARCHAR(100) UNIQUE, -- Emails should be unique
    department_id VARCHAR(10) REFERENCES departments(department_id) -- Foreign Key
);

-- 3. Projects Table
-- Let's create a surrogate key for projects
CREATE TABLE projects (
    project_id SERIAL PRIMARY KEY, -- Use SERIAL for auto-incrementing integer PK
    project_name VARCHAR(100) NOT NULL,
    project_location VARCHAR(50),
    UNIQUE (project_name, project_location) -- Prevent duplicate project definitions
);

-- 4. Assignments Table (Linking Table)
CREATE TABLE assignments (
    assignment_id SERIAL PRIMARY KEY, -- Keep a unique ID for each specific assignment instance
    project_id INTEGER REFERENCES projects(project_id) ON DELETE CASCADE, -- If project deleted, remove assignments
    employee_id VARCHAR(10) REFERENCES employees(employee_id) ON DELETE RESTRICT, -- Prevent deleting employee if assigned
    assignment_date DATE NOT NULL,
    hours_billed NUMERIC(5, 2) CHECK (hours_billed >= 0)
);

Migrate Data (Conceptual / Optional Manual Insertion):

Extract distinct data from the project_assignments_bad table and insert it into the new normalized tables. This requires careful INSERT INTO ... SELECT DISTINCT ... statements or manual insertion.

Example Inserts (Manual):

-- Populate Departments
INSERT INTO departments (department_id, department_name) VALUES
('D01', 'Technology'),
('D02', 'Marketing')
ON CONFLICT (department_id) DO NOTHING; -- Avoid errors if run multiple times

-- Populate Employees
INSERT INTO employees (employee_id, employee_name, employee_email, department_id) VALUES
('E101', 'Alice Smith', 'alice.s@example.com', 'D01'),
('E102', 'Bob Johnson', 'bob.j@example.com', 'D01'),
('E205', 'Charlie Brown', 'charlie.b@example.com', 'D02')
ON CONFLICT (employee_id) DO NOTHING;

-- Populate Projects (Need to map names to new project_id)
INSERT INTO projects (project_name, project_location) VALUES
('Alpha Launch', 'New York'),
('Beta Test', 'London'),
('Gamma Initiative', 'New York')
ON CONFLICT (project_name, project_location) DO NOTHING;

-- Populate Assignments (Requires looking up the new project_ids)
-- Find project IDs first:
-- SELECT project_id, project_name FROM projects; (Assume Alpha=1, Beta=2, Gamma=3)
INSERT INTO assignments (project_id, employee_id, assignment_date, hours_billed) VALUES
(1, 'E101', '2023-10-01', 40.0),
(1, 'E102', '2023-10-01', 35.5),
(2, 'E101', '2023-11-15', 50.0),
(3, 'E205', '2023-10-05', 60.0),
(1, 'E205', '2023-11-01', 25.0);

Define Necessary Indexes:

Based on the normalized schema, identify columns that would benefit from indexing, assuming common query patterns.
Common Queries Might Be:
- Find all assignments for a specific employee.
- Find all employees working on a specific project.
- Find all assignments within a date range.
- Find employee details by employee ID (already covered by PK).
- Find project details by project ID (already covered by PK).
- Find employees in a specific department.
Think:
- Primary Keys (departments.department_id, employees.employee_id, projects.project_id, assignments.assignment_id) are automatically indexed.
- Foreign Keys are excellent candidates for indexing to speed up joins.
- Other frequently filtered or sorted columns.
Implementation (SQL):

-- Index on Foreign Keys in 'employees' table
CREATE INDEX idx_employees_department_id ON employees(department_id);

-- Indexes on Foreign Keys in 'assignments' table
CREATE INDEX idx_assignments_project_id ON assignments(project_id);
CREATE INDEX idx_assignments_employee_id ON assignments(employee_id);

-- Index for searching assignments by date (if common)
CREATE INDEX idx_assignments_assignment_date ON assignments(assignment_date);

-- Maybe an index on employee name if searching by name is frequent
-- Be cautious if names are not very unique or if table is large and writes frequent
-- CREATE INDEX idx_employees_employee_name ON employees(employee_name);
-- PostgreSQL supports trigram indexes for faster LIKE/ILIKE searches on text:
-- CREATE EXTENSION IF NOT EXISTS pg_trgm;
-- CREATE INDEX idx_employees_employee_name_trgm ON employees USING gin (employee_name gin_trgm_ops);

-- Maybe an index on project name if searching by name is frequent
-- CREATE INDEX idx_projects_project_name ON projects(project_name);

Verify the Design:
- Mentally run through the anomaly scenarios identified in step 1. How does the normalized design prevent them?
  - Update: Change Alice's email in the employees table once. Change 'Technology' department name in departments once.
  - Insertion: Can add a new employee via employees table without an assignment. Can add a new project via projects table. Can add a new department via departments.
  - Deletion: Deleting an assignment from assignments doesn't delete employee or project info. Deleting an employee might be restricted if they have assignments (due to ON DELETE RESTRICT) or could cascade delete assignments if ON DELETE CASCADE was used on the FK in assignments. Deleting a project could cascade delete related assignments. The core entity data remains intact unless explicitly deleted from its own table.

Cleanup (Optional):

-- Disconnect (\q) then connect to postgres db
DROP DATABASE design_workshop_db;

This workshop demonstrated the practical process of identifying design flaws in a flat table structure, applying normalization principles to create a robust 3NF schema, and adding strategic indexes to support anticipated query workloads. This foundation is crucial for building maintainable and efficient database systems.

3. Transaction Management and Concurrency Control

In real-world applications, multiple users or processes often need to access and modify data in the database simultaneously. Concurrency Control mechanisms are essential to ensure that these concurrent operations do not interfere with each other in ways that compromise data integrity. Transaction Management provides the framework for grouping database operations together such that they are treated as a single, indivisible unit of work.

Understanding ACID Properties

A transaction is a sequence of database operations (e.g., one or more SQL statements like SELECT, INSERT, UPDATE, DELETE) performed as a single logical unit of work. To ensure data integrity, especially in concurrent environments, database systems guarantee ACID properties for transactions:

Atomicity: Ensures that all operations within a transaction are completed successfully as a single unit. If any part of the transaction fails, the entire transaction is rolled back, and the database is left unchanged, as if the transaction never started. It's an "all or nothing" principle.
- Analogy: Transferring money between bank accounts. You debit one account and credit another. Both actions must succeed; if the credit fails after the debit, the debit must be undone (rolled back).
Consistency: Ensures that a transaction brings the database from one valid state to another valid state. It preserves database invariants (e.g., constraints, triggers, cascades). If a transaction violates database consistency rules (like a unique constraint or a check constraint), it will be rolled back.
- Analogy: In the bank transfer, consistency ensures the total amount of money across both accounts remains the same before and after the transaction (ignoring fees for simplicity). It also ensures account balances don't violate rules like dropping below a minimum required balance if such a constraint exists.
Isolation: Ensures that concurrent transactions operate independently of one another. The effect of executing multiple transactions concurrently should be the same as if they were executed sequentially (one after another). This prevents intermediate, uncommitted states of one transaction from being visible to other transactions.
- Analogy: Two people trying to book the last seat on a flight simultaneously. Isolation ensures that only one person successfully books the seat, and the other person sees the seat as unavailable, preventing a double-booking. The level of isolation can be configured (see Isolation Levels).
Durability: Ensures that once a transaction has been successfully committed, its changes are permanent and will survive any subsequent system failures (e.g., power outages, crashes). Committed data is typically written to non-volatile storage (like disk) before the commit is acknowledged.
- Analogy: Once the bank confirms your deposit transaction is complete, the money is safely in your account, even if the bank's computer system crashes immediately afterward. The record of the deposit persists.

Databases use mechanisms like logging (Write-Ahead Logging - WAL), locking, and multi-version concurrency control (MVCC) to implement these ACID properties.

Isolation Levels Explained

While perfect isolation (as if transactions run sequentially) is the ideal, it can impose significant performance overhead due to locking or versioning mechanisms. Therefore, SQL defines standard isolation levels that allow developers to make trade-offs between consistency and performance by relaxing the isolation rules slightly. Lower isolation levels generally offer better performance but increase the risk of concurrency phenomena:

Concurrency Phenomena:
- Dirty Read: Transaction T1 reads data that has been modified by another transaction T2, but T2 has not yet committed. If T2 rolls back, T1 has read data that never officially existed.
- Non-Repeatable Read: Transaction T1 reads a row. Transaction T2 then modifies or deletes that row and commits. If T1 re-reads the same row, it gets different data or finds the row is gone.
- Phantom Read: Transaction T1 executes a query with a WHERE clause that retrieves a set of rows. Transaction T2 then inserts a new row that satisfies T1's WHERE clause and commits. If T1 re-executes its query, it sees the "phantom" newly inserted row.
SQL Standard Isolation Levels (from lowest to highest):
1. Read Uncommitted:
  - Allows Dirty Reads, Non-Repeatable Reads, and Phantom Reads.
  - Offers the highest concurrency but the lowest level of data consistency. Transactions can see uncommitted changes made by other transactions.
  - Usefulness: Rarely used in practice for general applications due to inconsistency risks. Might be acceptable for approximate counts or monitoring where absolute precision isn't critical. (Note: PostgreSQL implements this level the same as Read Committed).
2. Read Committed:
  - Prevents Dirty Reads. Allows Non-Repeatable Reads and Phantom Reads.
  - Guarantees that a transaction only reads data that has been committed. However, within the same transaction, two identical queries might return different results if another committed transaction modified the data in between.
  - Implementation Detail (Common): Uses short-term locks or MVCC. Each statement sees a snapshot of the database as it existed when the statement began.
  - Default Level: Often the default isolation level in many databases (including PostgreSQL, Oracle, SQL Server). Good balance for many web applications.
3. Repeatable Read:
  - Prevents Dirty Reads and Non-Repeatable Reads. Allows Phantom Reads (in some implementations, though PostgreSQL's implementation prevents phantoms too).
  - Guarantees that if a transaction reads a row multiple times, it will see the same data for that row each time. Other transactions cannot modify data being read by the current transaction until it commits or rolls back. However, other transactions can insert new rows that might match the query criteria (phantoms).
  - Implementation Detail (Common): Uses longer-term read locks or MVCC. The transaction sees a snapshot of the database as it existed when the transaction began.
  - Usefulness: Needed when a transaction requires a consistent view of specific rows throughout its duration, e.g., performing calculations based on data read at the start of the transaction. Might result in serialization errors if concurrent updates conflict.
4. Serializable:
  - Prevents Dirty Reads, Non-Repeatable Reads, and Phantom Reads.
  - Provides the highest level of isolation. Guarantees that the effect of concurrently executing transactions is identical to executing them serially in some order.
  - Implementation Detail (Common): Uses extensive locking (e.g., range locks) or advanced MVCC techniques like Serializable Snapshot Isolation (SSI) in PostgreSQL.
  - Usefulness: Required for applications with complex transactions where absolute consistency is paramount and any concurrency anomaly is unacceptable. Can significantly reduce concurrency and may lead to more transaction failures (serialization errors) that need to be retried by the application.

Setting Isolation Level: Typically done per-transaction using a command like:

-- PostgreSQL / SQL Standard
BEGIN;
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- Your SQL statements here...
COMMIT;
-- Or:
START TRANSACTION ISOLATION LEVEL SERIALIZABLE;
-- Your SQL statements here...
COMMIT;

Choosing the appropriate isolation level is a critical design decision based on the application's tolerance for concurrency phenomena versus its need for performance and throughput.

Locking Mechanisms

Databases use locks to manage concurrent access to data resources (like tables, rows, or pages) and enforce isolation levels. When a transaction needs to access a resource in a way that could conflict with other transactions, it acquires a lock on that resource.

Lock Granularity: Locks can be acquired at different levels:
- Table Locks: Lock the entire table. Simple but offers very low concurrency.
- Page Locks: Lock a block of disk storage containing multiple rows. A compromise between table and row locks.
- Row Locks: Lock individual rows. Offers the highest concurrency but can have higher management overhead. Most modern RDBMS primarily use row-level locking for user data.
- Key-Range Locks (Used for Serializable): Lock ranges of values within an index to prevent phantom reads.
Lock Modes: Indicate the type of access allowed:
- Shared Locks (S-Locks / Read Locks): Multiple transactions can hold a shared lock on the same resource simultaneously. Allows concurrent reading but prevents modification by transactions holding exclusive locks. Typically acquired for SELECT statements (depending on isolation level).
- Exclusive Locks (X-Locks / Write Locks): Only one transaction can hold an exclusive lock on a resource at a time. Prevents any other transaction (read or write) from acquiring any lock on the resource. Typically acquired for INSERT, UPDATE, DELETE operations.
- Update Locks (U-Locks): Some systems use these as an intermediate lock during updates to prevent certain types of deadlocks. An update lock indicates an intent to update and is compatible with shared locks but not other update or exclusive locks. It is later promoted to an exclusive lock when the actual modification occurs.
Locking Phenomena:
- Blocking: When a transaction tries to acquire a lock on a resource that is already held by another incompatible transaction, it must wait (is blocked) until the first transaction releases the lock (by committing or rolling back).
- Deadlock: Occurs when two (or more) transactions are waiting for each other to release locks. For example, T1 locks resource A and waits for resource B, while T2 locks resource B and waits for resource A. Neither transaction can proceed. Databases have deadlock detection mechanisms and typically resolve them by aborting one of the transactions (forcing it to roll back) and allowing the other to proceed. The aborted transaction usually needs to be retried by the application.
Pessimistic vs. Optimistic Concurrency Control:
- Pessimistic Concurrency Control (Locking): Assumes conflicts are likely and uses locks to prevent them from happening before accessing data. This is the traditional approach described above.
- Optimistic Concurrency Control (e.g., using Versioning/MVCC): Assumes conflicts are rare. Transactions read data without acquiring strong locks initially. When a transaction tries to commit, the system checks if the data it read has been modified by another committed transaction since it was read. If a conflict is detected, the committing transaction is rolled back and must be retried. Multi-Version Concurrency Control (MVCC), used by PostgreSQL and Oracle, is a common implementation. MVCC maintains multiple versions of rows, allowing readers not to block writers and vice versa, improving concurrency, especially for read-heavy workloads. Readers typically see a consistent snapshot of the data. Conflicts (write-write conflicts) are detected at commit time for higher isolation levels like Repeatable Read or Serializable.

Understanding transactions, ACID properties, isolation levels, and locking/MVCC is crucial for developing reliable applications that handle concurrent data access correctly.

Workshop Transaction Management and Concurrency Control

Goal: Observe the effects of different isolation levels and understand basic transaction handling (COMMIT, ROLLBACK) and potential deadlocks.

Prerequisites:

A running PostgreSQL instance on your Linux machine.
Access to the psql command-line client. You will need two separate terminal windows connected to the same database simultaneously to simulate concurrency.
A database created for this workshop (e.g., concurrency_db).

Setup:

Terminal 1 & 2: Connect to your PostgreSQL instance: psql -U your_username -d postgres
Terminal 1: Create the database: CREATE DATABASE concurrency_db;
Terminal 1 & 2: Connect to the new database: \c concurrency_db

Terminal 1: Create a simple bank accounts table:

CREATE TABLE accounts (
    account_id VARCHAR(10) PRIMARY KEY,
    account_holder VARCHAR(100) NOT NULL,
    balance NUMERIC(10, 2) CHECK (balance >= 0) -- Ensure non-negative balance
);

INSERT INTO accounts (account_id, account_holder, balance) VALUES
('A101', 'Alice', 1000.00),
('B202', 'Bob', 500.00);

-- Verify initial state in both terminals
SELECT * FROM accounts;

Tasks:

Atomicity (COMMIT and ROLLBACK): Simulate a successful and a failed bank transfer.

Terminal 1:

BEGIN; -- Start transaction
UPDATE accounts SET balance = balance - 100 WHERE account_id = 'A101';
UPDATE accounts SET balance = balance + 100 WHERE account_id = 'B202';
-- Check intermediate state (only visible within this transaction)
SELECT * FROM accounts;
COMMIT; -- Make changes permanent
-- Check final state (now visible everywhere)
SELECT * FROM accounts;

Terminal 2:

-- Check state before T1 commits
SELECT * FROM accounts; -- Should show original balances until T1 commits
-- Check state after T1 commits
SELECT * FROM accounts; -- Should show updated balances

Terminal 1 (Simulate Failure):

BEGIN;
UPDATE accounts SET balance = balance - 50 WHERE account_id = 'B202';
-- Oh no! Simulate an error or decision to cancel before updating Alice
SELECT * FROM accounts; -- See intermediate change for Bob
ROLLBACK; -- Abort the transaction
-- Check state after rollback
SELECT * FROM accounts; -- Should show balances reverted to state before BEGIN

Observe: COMMIT makes changes durable. ROLLBACK undoes all changes within the transaction, demonstrating atomicity.

Isolation Level - Read Committed (Default): Observe Non-Repeatable Read.

Terminal 1:

BEGIN;
-- Set explicit isolation level (though it's likely the default)
SET TRANSACTION ISOLATION LEVEL READ COMMITTED;
-- Read Bob's balance
SELECT balance FROM accounts WHERE account_id = 'B202'; -- Should show the current balance (e.g., 600 after task 1)
-- Wait here...

Terminal 2:

BEGIN;
UPDATE accounts SET balance = balance - 200 WHERE account_id = 'B202';
COMMIT; -- Commit the change
-- Check Bob's new balance
SELECT balance FROM accounts WHERE account_id = 'B202'; -- Should show the reduced balance (e.g., 400)

Terminal 1:

-- Still inside the same transaction, read Bob's balance again
SELECT balance FROM accounts WHERE account_id = 'B202'; -- Re-read the balance
COMMIT;

Observe: Terminal 1's second SELECT reads a different value than the first SELECT within the same transaction. This is a Non-Repeatable Read, allowed by Read Committed. Each statement sees committed data as of the time the statement starts.

Isolation Level - Repeatable Read: Prevent Non-Repeatable Read.

Reset Balances (Terminal 1):

UPDATE accounts SET balance = 1000.00 WHERE account_id = 'A101';
UPDATE accounts SET balance = 500.00 WHERE account_id = 'B202';
SELECT * FROM accounts; -- Verify reset

Terminal 1:

BEGIN;
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- Read Bob's balance
SELECT balance FROM accounts WHERE account_id = 'B202'; -- Shows 500.00
-- Wait here...

Terminal 2:

BEGIN;
-- Try to update Bob's balance. This might wait (block) or succeed depending on timing/MVCC.
-- Let's assume it succeeds and commits.
UPDATE accounts SET balance = balance - 100 WHERE account_id = 'B202';
COMMIT;
-- Verify Bob's balance from T2's perspective
SELECT balance FROM accounts WHERE account_id = 'B202'; -- Shows 400.00

Terminal 1:

-- Still inside the Repeatable Read transaction, read Bob's balance again
SELECT balance FROM accounts WHERE account_id = 'B202'; -- Re-read the balance
COMMIT;

Observe: Terminal 1's second SELECT should still show the same balance (500.00) as the first SELECT, even though Terminal 2 committed a change. Repeatable Read ensures that data read within a transaction remains consistent for the duration of that transaction based on the snapshot taken when the transaction began.

Isolation Level - Repeatable Read (Potential Write Conflict):

Reset Balances (Terminal 1):

UPDATE accounts SET balance = 1000.00 WHERE account_id = 'A101';
UPDATE accounts SET balance = 500.00 WHERE account_id = 'B202';

Terminal 1:

BEGIN;
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;
-- Read Bob's balance (establishes snapshot)
SELECT balance FROM accounts WHERE account_id = 'B202'; -- Shows 500.00
-- Wait here...

Terminal 2:

BEGIN;
SET TRANSACTION ISOLATION LEVEL REPEATABLE READ; -- Or even default Read Committed
-- Bob makes a withdrawal
UPDATE accounts SET balance = balance - 100 WHERE account_id = 'B202';
COMMIT; -- T2 successfully commits the change. Balance is now 400.

Terminal 1:

-- Now, Alice tries to update Bob's balance based on the old value read (500)
-- E.g., calculate interest based on the 500 value.
-- Try to update Bob's balance (conflicts with T2's update)
UPDATE accounts SET balance = balance + 50 WHERE account_id = 'B202'; -- Tries to make it 550
-- This command will likely FAIL with a serialization error!
-- psql might output: ERROR: could not serialize access due to concurrent update
-- You MUST issue a ROLLBACK after such an error.
ROLLBACK;

Observe: Because T1's view of the database was fixed when it started, and T2 modified the row T1 intended to update, T1's update would violate the Repeatable Read guarantee. The database detects this conflict (a write-write conflict based on T1's read snapshot) and forces T1 to roll back. The application would typically need to retry T1.

Deadlock Simulation:

Reset Balances (Terminal 1):

UPDATE accounts SET balance = 1000.00 WHERE account_id = 'A101';
UPDATE accounts SET balance = 500.00 WHERE account_id = 'B202';

Terminal 1:

BEGIN;
-- Lock Alice's account by updating it
UPDATE accounts SET balance = balance + 0 WHERE account_id = 'A101'; -- Acquire lock on A101
-- Wait for a moment to let Terminal 2 proceed
\prompt 'Press Enter after starting T2 update...' -- Pauses script
-- Try to lock Bob's account
UPDATE accounts SET balance = balance + 0 WHERE account_id = 'B202'; -- Waits for T2 to release lock on B202
-- If deadlock occurs, this transaction might be chosen as the victim
COMMIT; -- Or ROLLBACK if deadlock error occurs

Terminal 2:

BEGIN;
-- Lock Bob's account by updating it
UPDATE accounts SET balance = balance + 0 WHERE account_id = 'B202'; -- Acquire lock on B202
-- Wait for a moment (less than T1 wait if possible)
-- Try to lock Alice's account
UPDATE accounts SET balance = balance + 0 WHERE account_id = 'A101'; -- Waits for T1 to release lock on A101
-- Deadlock situation: T1 holds A101 waits for B202, T2 holds B202 waits for A101
COMMIT; -- Or ROLLBACK if deadlock error occurs

Observe: After both terminals attempt the second update, PostgreSQL's deadlock detector should intervene within a short time (usually a second or two). One of the terminals will receive an error message similar to: ERROR: deadlock detected. The message will detail the locks involved. The transaction that received the error has been automatically rolled back. The other transaction can then proceed and commit successfully (if it wasn't trying to do anything else). Applications need to be prepared to catch deadlock errors and retry the failed transaction.

Cleanup (Optional):

-- In one terminal:
DROP TABLE accounts;
-- Disconnect (\q) then connect to postgres db
DROP DATABASE concurrency_db;

This workshop provides a tangible feel for how transactions isolate operations, how different isolation levels affect visibility and consistency, and how databases handle conflicts like serialization failures and deadlocks. Writing robust concurrent applications requires understanding these behaviors.

4. Database Performance Tuning Basics

As datasets grow and query complexity increases, database performance can degrade. Performance tuning involves identifying bottlenecks and optimizing the database structure and queries to improve response times and resource utilization.

Analyzing Query Performance with EXPLAIN

The most crucial tool for understanding how a database executes a specific SQL query is the EXPLAIN command (or its variants like EXPLAIN ANALYZE). It shows the query execution plan chosen by the database optimizer. The optimizer's goal is to find the most efficient way to execute a query, considering available indexes, table statistics, and possible execution strategies.

Usage:
- EXPLAIN <Your SQL Query>;
  - Shows the estimated plan without actually running the query. Useful for quick checks.
- EXPLAIN ANALYZE <Your SQL Query>; (Recommended for deeper analysis)
  - Actually executes the query, measures real timings and row counts, and then displays the plan along with the actual execution statistics. Provides much more accurate insights but incurs the cost of running the query. Caution: Use EXPLAIN ANALYZE with care on UPDATE, DELETE, INSERT statements in production, as it will execute the modification. You might wrap it in a transaction and roll back: BEGIN; EXPLAIN ANALYZE UPDATE ...; ROLLBACK;
Interpreting the Output: The output is typically a tree structure representing the steps the database takes. Key elements to look for:
- Scan Methods: How the database reads data from tables:
  - Sequential Scan (Seq Scan): Reads the entire table block by block. Often indicates a missing index or that the query retrieves a large portion of the table, making an index less efficient. Can be slow on large tables.
  - Index Scan: Uses an index to locate specific rows. Much faster than a Seq Scan if the index is selective (narrows down the search significantly). Can involve reading both the index and the table data (unless it's an Index-Only Scan).
  - Index-Only Scan: Uses an index that contains all the data needed for the query (including WHERE clause and SELECT list columns). Avoids accessing the main table data ("heap fetches"), making it very fast. Requires the index to cover all necessary columns.
  - Bitmap Heap Scan: A two-phase scan. First, an Index Scan (or multiple index scans) gathers the locations (TIDs) of relevant rows into a bitmap in memory. Second, it fetches the actual rows from the table ("heap") in physical disk order, which can be more efficient than random access for larger result sets. Often seen when combining results from multiple indexes or when an index scan returns many rows.
- Join Methods: How the database combines data from multiple tables:
  - Nested Loop Join: Iterates through rows of the outer table. For each outer row, it scans the inner table (potentially using an index) to find matching rows. Efficient if the outer table is small and there's a good index on the join column of the inner table. Can be very slow if both tables are large and indexes are poor.
  - Hash Join: Builds an in-memory hash table on the smaller of the two tables (based on the join key). Then, scans the larger table; for each row, it probes the hash table to find matches. Efficient for large tables when equi-joins are involved and enough memory is available. Does not require indexes on join keys but benefits from them for the initial table scans.
  - Merge Join: Requires both input tables to be sorted on the join key. It then reads through both sorted tables simultaneously, merging the matching rows. Efficient if the inputs are already sorted (e.g., due to an index scan or previous ORDER BY) or if the datasets are very large and don't fit in memory for a hash join. Often requires explicit sort steps if data isn't pre-sorted, which can be costly.
- Costs (Estimates in basic EXPLAIN):
  - cost=startup_cost..total_cost: The optimizer's estimate of the work required. The first number is the cost before the first row can be returned; the second is the total cost to return all rows. Units are arbitrary but useful for comparing different plans for the same query. Lower cost is generally better.
  - rows: Estimated number of rows output by this step.
  - width: Estimated average width (in bytes) of the rows output by this step.
- Actual Timings and Rows (EXPLAIN ANALYZE):
  - actual time=startup_time..total_time: Actual time (in milliseconds) spent. startup_time is time to get the first row; total_time is time to get all rows from this node and its children. This is the most valuable metric for identifying slow parts of the query.
  - rows: Actual number of rows returned by the node. Comparing this to the estimated rows can reveal if the database statistics are inaccurate, leading the optimizer to choose a suboptimal plan.
  - loops: Number of times this node was executed.
Analysis Process:
1. Run EXPLAIN ANALYZE on your slow query.
2. Start reading the plan from the innermost, most indented operations upwards/outwards.
3. Look for high actual time values to identify the slowest steps.
4. Examine scan methods: Are there Seq Scans on large tables for queries that should be selective? This suggests a missing or unused index.
5. Examine join methods: Is an inefficient Nested Loop being used on large tables? Maybe a Hash Join or Merge Join would be better (which might require different indexing or more memory).
6. Compare estimated rows vs. actual rows. Large discrepancies suggest outdated statistics. Run ANALYZE <table_name>; or VACUUM ANALYZE <table_name>; to update them.
7. Look for explicit Sort operations, especially costly ones. Can they be avoided with an appropriate index that provides the required order?
8. Check Filter conditions applied after a scan. If a Seq Scan is followed by a very selective filter, an index on the filter column could drastically improve performance.

Using EXPLAIN ANALYZE is an iterative process: analyze the plan, hypothesize an optimization (e.g., add an index, rewrite the query, update statistics), apply the change, and run EXPLAIN ANALYZE again to see if the plan and performance improved.

Index Tuning Techniques

Based on the analysis of query plans using EXPLAIN, you can apply various index tuning techniques:

Adding Missing Indexes:
- If EXPLAIN ANALYZE shows a costly Seq Scan on a large table where the WHERE clause is selective (i.e., filters out most rows), an index on the column(s) in the WHERE clause is likely needed.
- If EXPLAIN ANALYZE shows a costly Nested Loop join where the inner table is scanned repeatedly, an index on the join column of the inner table is usually beneficial. Foreign key columns are prime candidates.
- If EXPLAIN ANALYZE shows a costly Sort operation because of an ORDER BY or GROUP BY clause, an index on the relevant column(s) might allow the database to retrieve rows in the desired order directly from the index, avoiding the sort step.
Creating Composite Indexes:
- If queries frequently filter on multiple columns (e.g., WHERE colA = ? AND colB = ?), a composite index on (colA, colB) can be much more efficient than separate indexes on colA and colB.
- The order of columns in a composite index matters. Place columns with higher selectivity (more unique values) or columns used with equality predicates (=) earlier in the index definition. For an index on (colA, colB), queries filtering only on colA can use the index effectively. Queries filtering only on colB generally cannot.
- Consider ORDER BY clauses as well. An index on (colA, colB) can satisfy WHERE colA = ? ORDER BY colB.
Creating Covering Indexes:
- If a query retrieves columns that are all present within an existing index (including WHERE, SELECT, JOIN clauses), the database might perform an Index-Only Scan.
- You can sometimes create indexes specifically to "cover" all columns needed by a frequent, critical query to enable Index-Only Scans. This involves including columns from the SELECT list in the index definition (e.g., using the INCLUDE clause in PostgreSQL's CREATE INDEX if the columns are not needed for filtering/sorting but just for retrieval). This trades write performance and storage for faster read performance for that specific query.
Using Expression Indexes:
- If queries frequently filter based on an expression or function applied to a column (e.g., WHERE LOWER(email) = ? or WHERE date_trunc('month', created_at) = ?), a regular index on the column (email or created_at) won't be used directly.
- Create an index on the expression itself:
```
CREATE INDEX idx_lower_email ON users (LOWER(email));
CREATE INDEX idx_created_month ON orders (date_trunc('month', created_at));
```
- The query must use the exact same expression for the index to be considered.
Using Partial Indexes:
- If you frequently query only a specific subset of rows in a large table (e.g., WHERE status = 'active'), creating a full index might be inefficient if 'active' rows are a small percentage.
- A partial index includes only the rows matching the condition:
```
CREATE INDEX idx_orders_active ON orders (order_date) WHERE status = 'active';
```
- This index will be smaller and faster to scan for queries that include WHERE status = 'active'.
Removing Unused or Redundant Indexes:
- Indexes consume storage space and slow down INSERT, UPDATE, DELETE operations. Regularly identify and remove indexes that are never used by queries. Database systems often provide tools or views to track index usage statistics (e.g., pg_stat_user_indexes in PostgreSQL).
- Remove redundant indexes. For example, if you have an index on (colA, colB) and another index only on (colA), the second index is usually redundant because the composite index can also serve queries filtering only on colA. (There can be exceptions related to index size or specific query patterns, but often it's redundant).
Maintaining Table Statistics:
- Accurate statistics about table data distribution are crucial for the query optimizer to choose the best plan. Outdated statistics can lead to poor choices like using a Seq Scan when an Index Scan would be better, or vice-versa.
- Regularly run ANALYZE <table_name>; or VACUUM ANALYZE <table_name>; (especially after large data modifications) to keep statistics up-to-date. Most databases also have an auto-vacuum/auto-analyze daemon that handles this automatically to some extent, but manual analysis can be necessary.

Performance tuning is an ongoing process. Monitor your database performance, analyze slow queries, apply appropriate indexing and other optimization techniques, and measure the impact of your changes.

Workshop Database Performance Tuning Basics

Goal: Use EXPLAIN ANALYZE to identify a performance bottleneck and improve query speed by adding an appropriate index.

Prerequisites:

A running PostgreSQL instance on your Linux machine.
Access to the psql command-line client.
A database created for this workshop (e.g., tuning_db).

Setup:

Terminal: Connect to PostgreSQL: psql -U your_username -d postgres
Terminal: Create the database: CREATE DATABASE tuning_db;
Terminal: Connect to the new database: \c tuning_db

Terminal: Create a table and populate it with a significant amount of data. We'll use generate_series to create many rows quickly.

CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    product_name VARCHAR(100) NOT NULL,
    category VARCHAR(50),
    launch_date DATE,
    price NUMERIC(10, 2),
    -- Add a column with less distinct values for demonstration
    status VARCHAR(20) DEFAULT 'active'
);

-- Insert a large number of rows (e.g., 1 million)
-- This might take a minute or two
INSERT INTO products (product_name, category, launch_date, price, status)
SELECT
    'Product ' || g,
    'Category ' || (g % 100), -- 100 distinct categories
    '2020-01-01'::date + (g % 1000) * interval '1 day', -- ~3 years of dates
    (random() * 500 + 10)::numeric(10,2), -- Random price between 10 and 510
    CASE (g % 10) -- Cycle through statuses
        WHEN 0 THEN 'discontinued'
        WHEN 1 THEN 'beta'
        ELSE 'active' -- 80% are active
    END
FROM generate_series(1, 1000000) as g;

-- Ensure statistics are up-to-date after the large insert
ANALYZE products;

Tasks:

Analyze a Slow Query without Index:
- Let's query for products in a specific category launched on a particular date. This combination should be relatively selective.
- Run EXPLAIN ANALYZE on the query:
```
EXPLAIN ANALYZE SELECT product_id, product_name, price
FROM products
WHERE category = 'Category 50' AND launch_date = '2021-05-15';
```
- Observe the Output:
  - Pay close attention to the Scan Method on the products table. It will almost certainly be a Seq Scan (Sequential Scan).
  - Note the Execution Time. This is your baseline performance.
  - Look at the Filter: line associated with the Seq Scan. It shows the conditions being applied (category = ... and launch_date = ...).
  - Observe the number of rows removed by Filter compared to the total rows scanned (which should be ~1 million). If the filter is highly selective (removes most rows), a Seq Scan is inefficient.
Hypothesize and Add an Index:
- The query filters on category and launch_date. A Seq Scan is slow because the database has to read the entire table and check every row.
- Hypothesis: A composite index on (category, launch_date) should significantly speed up this query. The database could use the index to directly locate rows matching both criteria.
- Create the index:
```
CREATE INDEX idx_products_category_launch ON products (category, launch_date);
```
Analyze the Query WITH Index:
- Run the exact same EXPLAIN ANALYZE command again:
```
EXPLAIN ANALYZE SELECT product_id, product_name, price
FROM products
WHERE category = 'Category 50' AND launch_date = '2021-05-15';
```
- Observe the New Output:
  - The Scan Method should now be different. You'll likely see an Index Scan or possibly a Bitmap Heap Scan using idx_products_category_launch.
  - Compare the Execution Time to the previous run. It should be drastically lower (likely milliseconds instead of seconds).
  - If it's an Index Scan, notice the Index Cond: line shows the conditions being used directly against the index.
  - If it's a Bitmap Heap Scan, notice the Bitmap Index Scan node using the index, followed by the Bitmap Heap Scan node fetching the rows from the table.
Analyze a Query on a Low Cardinality Column:
- Let's query for all 'discontinued' products. The status column has low cardinality (only a few distinct values: 'active', 'beta', 'discontinued').
- Run EXPLAIN ANALYZE:
```
EXPLAIN ANALYZE SELECT product_id, product_name
FROM products
WHERE status = 'discontinued';
```
- Observe: You will likely still see a Seq Scan. Even though we are filtering, 'discontinued' represents about 10% of the table (100,000 rows). The optimizer likely determines that reading 10% of the table via an index (which involves jumping back and forth between the index and the table data) is more expensive than just reading the whole table sequentially.
Attempt Indexing Low Cardinality (and see if it's used):
- Create an index on status:
```
CREATE INDEX idx_products_status ON products (status);
```
- Re-run the EXPLAIN ANALYZE for the 'discontinued' query:
```
EXPLAIN ANALYZE SELECT product_id, product_name
FROM products
WHERE status = 'discontinued';
```
- Observe: The planner might still choose a Seq Scan or possibly switch to a Bitmap Heap Scan. Even with the index, accessing 100,000 rows might be deemed faster via a sequential read or bitmap approach than through a standard index scan. This demonstrates that indexes are not always beneficial, especially on low-selectivity columns when retrieving a large fraction of the table.
Consider a Partial Index:
- Imagine we very frequently query for 'beta' status products, which are only 10% of the data (100k rows). A full index wasn't great. But what if 'beta' was much rarer, say 0.1%? Let's simulate that scenario with a query targeting a specific, less common category.
- Query for a specific category, which is more selective (1% of data).
```
EXPLAIN ANALYZE SELECT product_id, product_name
FROM products
WHERE category = 'Category 99';
```
  (This might use the composite index if it exists, or Seq Scan if not indexed)
- Now, imagine we only ever care about 'Category 99' products that are also 'active'. Create a partial index:
```
CREATE INDEX idx_products_cat99_active ON products (launch_date) WHERE category = 'Category 99' AND status = 'active';
```
- Analyze a query matching the partial index criteria:
```
EXPLAIN ANALYZE SELECT product_id, product_name
FROM products
WHERE category = 'Category 99' AND status = 'active' AND launch_date > '2022-01-01';
```
- Observe: This query should now efficiently use the smaller idx_products_cat99_active partial index (likely via an Index Scan or Bitmap Scan on that index) because the WHERE clause matches the index's condition.

Cleanup (Optional):

-- Drop the table (and implicitly its indexes)
DROP TABLE products;
-- Disconnect (\q) then connect to postgres db
DROP DATABASE tuning_db;

This workshop illustrated the power of EXPLAIN ANALYZE in diagnosing performance issues (identifying Seq Scans) and the effectiveness of adding appropriate indexes (like composite indexes) to resolve them. It also showed that indexing isn't a silver bullet, especially for low-selectivity columns, and introduced the concept of partial indexes for optimizing queries on specific data subsets.

5. Introduction to Server Side Database Programming

While SQL is powerful for querying and manipulating data, sometimes you need to encapsulate business logic, enforce complex rules, or perform actions automatically within the database itself. This is achieved using server-side programming features like Stored Procedures, Functions, and Triggers. These are typically written in a procedural language extension of SQL (like PL/pgSQL for PostgreSQL, T-SQL for SQL Server, PL/SQL for Oracle) or sometimes in other languages like Python, Perl, Tcl if supported by the RDBMS.

Stored Procedures and Functions

Stored Procedures and Stored Functions are blocks of code stored directly in the database catalog and executed on the server. They allow you to group multiple SQL statements and procedural logic (loops, conditionals, variables) into reusable units.

Key Differences:
- Functions: Primarily designed to compute and return a single value (scalar or composite type) or a set of rows (table function). They are typically callable within SQL queries (e.g., in SELECT lists or WHERE clauses). In PostgreSQL, functions are the primary mechanism and can perform actions, although procedures are now also supported more distinctly.
- Stored Procedures: Designed primarily to perform an action or a set of actions (e.g., complex data modifications, batch jobs). They may or may not return values, but typically do so via output parameters rather than a direct return value usable in SQL expressions. They are generally invoked using a specific command (e.g., CALL in SQL standard and PostgreSQL).
Benefits:
- Encapsulation: Hide complex logic behind a simple interface. Application developers just call the procedure/function without needing to know the internal details.
- Reusability: Write the logic once and call it from multiple applications or different parts of the same application.
- Performance:
  - Reduced Network Traffic: Instead of sending multiple individual SQL statements from the client to the server, you send a single call. The logic executes entirely on the server, reducing round trips.
  - Pre-compilation/Caching: The database often parses and caches the execution plan for stored procedures/functions, potentially speeding up subsequent calls.
- Security: Grant EXECUTE permission on a procedure/function without granting permissions on the underlying tables. The procedure/function runs with the privileges of its definer (or invoker, depending on definition), allowing controlled access to data.
- Consistency: Ensure that complex operations involving multiple steps are always performed the same way, enforcing business rules consistently.

Example (PostgreSQL using PL/pgSQL): Creating a Function to Calculate Average GPA for a Department

-- Assumes the 'students' and 'departments' tables from previous workshops exist.
CREATE OR REPLACE FUNCTION get_department_avg_gpa(dept_id VARCHAR)
RETURNS NUMERIC(3, 2) -- Specifies the return type
AS $$ -- Start of function body block (using dollar-quoting)
DECLARE
    avg_gpa NUMERIC(3, 2); -- Declare a local variable
BEGIN
    -- Check if department exists
    IF NOT EXISTS (SELECT 1 FROM departments WHERE department_id = dept_id) THEN
        RAISE EXCEPTION 'Department % does not exist', dept_id;
    END IF;

    -- Calculate the average GPA for the given department ID
    SELECT AVG(s.gpa) INTO avg_gpa -- Store result in the variable
    FROM students s
    WHERE s.department_id = dept_id;

    -- Handle case where department has no students (AVG returns NULL)
    IF avg_gpa IS NULL THEN
        RETURN 0.00; -- Or perhaps raise an exception or return NULL as appropriate
    END IF;

    RETURN avg_gpa; -- Return the calculated value
END;
$$ LANGUAGE plpgsql; -- Specify the language used

-- How to call the function:
SELECT department_name, get_department_avg_gpa(department_id) as average_gpa
FROM departments;

SELECT get_department_avg_gpa('CS');

Example (PostgreSQL using PL/pgSQL): Creating a Procedure to Transfer Students

-- Assumes 'students' and 'departments' tables exist.
CREATE OR REPLACE PROCEDURE transfer_student(
    s_id VARCHAR,         -- Input parameter: student ID
    new_dept_id VARCHAR   -- Input parameter: new department ID
)
AS $$
DECLARE
    old_dept_id VARCHAR;
BEGIN
    -- Validate inputs (basic example)
    IF NOT EXISTS (SELECT 1 FROM students WHERE student_id = s_id) THEN
        RAISE EXCEPTION 'Student % not found', s_id;
    END IF;
    IF NOT EXISTS (SELECT 1 FROM departments WHERE department_id = new_dept_id) THEN
        RAISE EXCEPTION 'Target Department % not found', new_dept_id;
    END IF;

    -- Get the old department ID for logging or other logic
    SELECT department_id INTO old_dept_id FROM students WHERE student_id = s_id;

    -- Perform the update
    UPDATE students
    SET department_id = new_dept_id
    WHERE student_id = s_id;

    -- Optional: Log the transfer action (could insert into an audit table)
    RAISE NOTICE 'Student % transferred from Dept % to Dept %', s_id, old_dept_id, new_dept_id;

    -- Procedures don't explicitly return a value like functions.
    -- COMMIT/ROLLBACK is generally handled by the calling transaction,
    -- unless the procedure manages its own transactions (less common).

END;
$$ LANGUAGE plpgsql;

-- How to call the procedure:
CALL transfer_student('S1006', 'PHYS'); -- Transfer Fiona to Physics

-- Verify the change
SELECT * FROM students WHERE student_id = 'S1006';

Triggers

A trigger is a special type of stored procedure that is automatically executed (fired) by the database server in response to certain events on a particular table or view. These events are typically INSERT, UPDATE, or DELETE operations.

Purpose:
- Enforcing Complex Business Rules: Implement constraints that are difficult or impossible to define using standard check or foreign key constraints.
- Auditing/Logging: Automatically record changes made to data into an audit trail table.
- Maintaining Data Consistency/Denormalization: Automatically update summary tables or derived data when underlying data changes (use with caution, can be complex).
- Preventing Invalid Operations: Intercept and prevent certain DML operations based on specific conditions.
Key Concepts:
- Trigger Event: The DML operation that causes the trigger to fire (INSERT, UPDATE, DELETE). PostgreSQL also supports TRUNCATE.
- Trigger Timing: When the trigger fires relative to the event:
  - BEFORE: Executes before the DML operation attempts to modify the row(s). Useful for validating or modifying data before it's written. Can prevent the operation by returning NULL or raising an exception (in row-level triggers).
  - AFTER: Executes after the DML operation has successfully completed (but typically still within the same transaction). Useful for actions that depend on the change having occurred, like logging or updating related tables.
  - INSTEAD OF: Used primarily for views. Executes instead of the DML operation on the view, allowing you to define how updates to complex views should translate into updates on the underlying base tables.
- Trigger Level:
  - FOR EACH ROW: The trigger function executes once for each individual row affected by the DML statement. Inside the trigger function, special variables NEW (representing the row after the change for INSERT/UPDATE) and OLD (representing the row before the change for UPDATE/DELETE) are available.
  - FOR EACH STATEMENT: The trigger function executes only once per DML statement, regardless of how many rows were affected. NEW and OLD are not available in statement-level triggers. Useful for actions that don't depend on individual row data.

Example (PostgreSQL using PL/pgSQL): Auditing Changes to Student GPAs

Create an Audit Table:

CREATE TABLE student_gpa_audit (
    audit_id SERIAL PRIMARY KEY,
    student_id VARCHAR(10) NOT NULL,
    old_gpa NUMERIC(3, 2),
    new_gpa NUMERIC(3, 2),
    change_timestamp TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    db_user_name TEXT DEFAULT current_user
);

Create the Trigger Function: This function contains the logic to be executed.

CREATE OR REPLACE FUNCTION log_gpa_change()
RETURNS TRIGGER -- Must return TRIGGER type
AS $$
BEGIN
    -- Only log if the GPA actually changed.
    -- TG_OP is a special variable indicating the operation (INSERT, UPDATE, DELETE)
    IF (TG_OP = 'UPDATE' AND OLD.gpa IS DISTINCT FROM NEW.gpa) THEN
        INSERT INTO student_gpa_audit (student_id, old_gpa, new_gpa)
        VALUES (OLD.student_id, OLD.gpa, NEW.gpa); -- Use OLD and NEW records
    ELSIF (TG_OP = 'INSERT') THEN
         INSERT INTO student_gpa_audit (student_id, old_gpa, new_gpa)
         VALUES (NEW.student_id, NULL, NEW.gpa); -- No OLD gpa on insert
    -- Could add ELSIF (TG_OP = 'DELETE') to log deletions if needed
    --     INSERT INTO student_gpa_audit (student_id, old_gpa, new_gpa)
    --     VALUES (OLD.student_id, OLD.gpa, NULL);
    END IF;

    -- For AFTER triggers, the return value is ignored, but returning NEW is common practice.
    -- For BEFORE triggers, returning NEW allows the operation to proceed (possibly modified),
    -- returning NULL skips/cancels the operation for that row.
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

Create the Trigger: Attach the function to the table and specify the event/timing.

CREATE TRIGGER students_gpa_audit_trigger
AFTER INSERT OR UPDATE ON students -- Event and Timing
FOR EACH ROW                 -- Trigger Level
WHEN (pg_trigger_depth() = 0) -- Optional: Prevent trigger cascades from self-calls
EXECUTE FUNCTION log_gpa_change(); -- The function to execute

(The WHEN (pg_trigger_depth() = 0) clause is a common PostgreSQL idiom to prevent infinite loops if the trigger function itself causes another event on the same table that would fire the same trigger).

Test the Trigger:

-- Update a student's GPA
UPDATE students SET gpa = 3.9 WHERE student_id = 'S1002';

-- Insert a new student
INSERT INTO students (student_id, student_name, department_id, gpa)
VALUES ('S2001', 'Test Student', 'CS', 3.0);

-- Check the audit table
SELECT * FROM student_gpa_audit ORDER BY change_timestamp DESC;

You should see entries in student_gpa_audit corresponding to the UPDATE and INSERT operations.

Potential Pitfalls:
- Complexity: Overuse of triggers can make application logic hard to follow and debug, as actions happen "magically" in the background.
- Performance: Poorly written triggers, especially row-level triggers on high-traffic tables, can significantly impact performance.
- Cascading Effects: Triggers can fire other triggers, potentially leading to complex and hard-to-predict interactions or even infinite loops if not carefully designed.

Server-side programming provides powerful tools for enhancing database functionality, but should be used judiciously, balancing the benefits against potential complexity and performance implications.

Workshop Introduction to Server Side Database Programming

Goal: Create and test a stored function and a trigger in the university database scenario.

Prerequisites:

A running PostgreSQL instance on your Linux machine.
Access to the psql command-line client.
The university_db database created in the "Advanced SQL Querying" workshop, with its tables (departments, students, courses, enrollments) and sample data. If you dropped it, recreate it using the setup steps from that workshop.

Setup:

Terminal: Connect to your PostgreSQL instance and the university_db:
```
psql -U your_username -d university_db
```

Verify Tables: Ensure the tables exist.

\dt
SELECT * FROM students LIMIT 2;
SELECT * FROM courses LIMIT 2;
SELECT * FROM enrollments LIMIT 2;

Tasks:

Create a Stored Function: Create a function get_student_credits(student_id) that calculates the total credits a given student is currently enrolled in.

Hint: You'll need to join enrollments and courses tables, filter by the input student_id, and sum the credits column. Handle the case where the student ID is invalid or the student has no enrollments.

Implementation (SQL/PLpgSQL):

CREATE OR REPLACE FUNCTION get_student_credits(s_id VARCHAR)
RETURNS INTEGER
AS $$
DECLARE
    total_credits INTEGER;
BEGIN
    -- Check if student exists
    IF NOT EXISTS (SELECT 1 FROM students WHERE student_id = s_id) THEN
        RAISE EXCEPTION 'Student ID % not found.', s_id;
    END IF;

    -- Calculate total credits
    SELECT COALESCE(SUM(c.credits), 0) -- Use COALESCE to return 0 if SUM is NULL (no enrollments)
    INTO total_credits
    FROM enrollments e
    JOIN courses c ON e.course_id = c.course_id
    WHERE e.student_id = s_id;

    RETURN total_credits;
END;
$$ LANGUAGE plpgsql;

Test the Function:

-- Test with Alice (CS101=4, CS305=3, MATH211=4 -> Total 11)
SELECT get_student_credits('S1001');

-- Test with Bob (CS101=4 -> Total 4)
SELECT get_student_credits('S1002');

-- Test with Fiona (no enrollments -> Total 0)
INSERT INTO students (student_id, student_name, gpa) VALUES ('S1007', 'Test NoEnroll', 3.0);
SELECT get_student_credits('S1007');

-- Test with an invalid student ID
-- SELECT get_student_credits('S9999'); -- This should raise an exception

Create an Audit Table for Enrollments: Create a table to log when enrollments are added or removed.

CREATE TABLE enrollment_audit (
    audit_id SERIAL PRIMARY KEY,
    action VARCHAR(10) NOT NULL, -- e.g., 'INSERT', 'DELETE'
    enrollment_id INTEGER, -- Can be NULL if action is 'DELETE' and we don't capture it before deletion
    student_id VARCHAR(10) NOT NULL,
    course_id VARCHAR(10) NOT NULL,
    action_timestamp TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP,
    db_user_name TEXT DEFAULT current_user
);

Create a Trigger Function for Enrollment Audit: This function will insert records into enrollment_audit.

CREATE OR REPLACE FUNCTION log_enrollment_change()
RETURNS TRIGGER
AS $$
BEGIN
    IF (TG_OP = 'INSERT') THEN
        INSERT INTO enrollment_audit (action, enrollment_id, student_id, course_id)
        VALUES ('INSERT', NEW.enrollment_id, NEW.student_id, NEW.course_id);
        RETURN NEW; -- Allow INSERT to proceed
    ELSIF (TG_OP = 'DELETE') THEN
        INSERT INTO enrollment_audit (action, enrollment_id, student_id, course_id)
        VALUES ('DELETE', OLD.enrollment_id, OLD.student_id, OLD.course_id);
        RETURN OLD; -- Allow DELETE to proceed (return value is used but content often irrelevant for AFTER DELETE)
    END IF;
    -- We are not auditing UPDATEs in this example
    RETURN NULL; -- Should not happen for INSERT/DELETE, but good practice
END;
$$ LANGUAGE plpgsql;

Create the Trigger: Attach the function to the enrollments table to fire after inserts or deletes.

CREATE TRIGGER enrollments_audit_trigger
AFTER INSERT OR DELETE ON enrollments
FOR EACH ROW
EXECUTE FUNCTION log_enrollment_change();

Test the Trigger:

Enroll Bob in Databases (CS305):

INSERT INTO enrollments (student_id, course_id, grade)
VALUES ('S1002', 'CS305', NULL);

Check Audit Table:

SELECT * FROM enrollment_audit ORDER BY action_timestamp DESC LIMIT 1;
-- Should show an 'INSERT' record for S1002 and CS305

Check Bob's Credits using the function:

SELECT get_student_credits('S1002'); -- Should now be 4 (CS101) + 3 (CS305) = 7

Remove Alice from Math (MATH211):

DELETE FROM enrollments
WHERE student_id = 'S1001' AND course_id = 'MATH211';

Check Audit Table:

SELECT * FROM enrollment_audit ORDER BY action_timestamp DESC LIMIT 2;
-- Should show the previous INSERT and a new 'DELETE' record for S1001 and MATH211

Check Alice's Credits:

SELECT get_student_credits('S1001'); -- Should now be 4 (CS101) + 3 (CS305) = 7

Cleanup (Optional):

-- Drop trigger, function, and audit table
DROP TRIGGER IF EXISTS enrollments_audit_trigger ON enrollments;
DROP FUNCTION IF EXISTS log_enrollment_change();
DROP FUNCTION IF EXISTS get_student_credits(VARCHAR);
DROP TABLE IF EXISTS enrollment_audit;

-- To fully clean up, drop the main tables and database if desired
-- DROP TABLE enrollments; DROP TABLE courses; DROP TABLE students; DROP TABLE departments;
-- \c postgres
-- DROP DATABASE university_db;

This workshop provided practical experience in creating a stored function to encapsulate logic (calculating credits) and using triggers to automate actions (auditing enrollment changes). These techniques are valuable for building more sophisticated and robust database applications.

6. Database Security Fundamentals

Protecting data stored within the database is paramount. Database security involves controlling who can access the database and what actions they are permitted to perform on specific objects. This is typically managed through user accounts, roles, and privileges.

User and Role Management

Databases maintain their own list of users, distinct from operating system users. To connect to a database, a client must typically provide valid database user credentials.

Users (or Login Roles): Represent individual entities (people or applications) that need to connect to the database. Users have connection privileges and potentially other specific privileges granted directly to them.

Creating Users (PostgreSQL Example):

-- Create a user with a password, allowed to log in
CREATE USER registrar_app WITH LOGIN PASSWORD 'highly_secret_password';

-- Create a user who cannot log in (useful for owning objects or grouping privileges via roles)
CREATE USER report_owner WITH NOLOGIN;

Roles: Represent a set or "group" of privileges. Instead of granting the same set of privileges to multiple users individually, you can grant the privileges to a role, and then grant membership in that role to the users. This simplifies privilege management significantly. If a user needs more permissions, you add them to another role; if permissions need to change for a group, you modify the role, and all members inherit the change.
- In PostgreSQL, the concept of "User" and "Role" is largely unified. CREATE USER is essentially an alias for CREATE ROLE with the LOGIN privilege assumed by default. Roles created with CREATE ROLE do not have login privileges unless explicitly granted (ALTER ROLE myrole WITH LOGIN;). We often think of roles with LOGIN as "users" and roles without LOGIN as "groups".
- Creating Roles (PostgreSQL Example):
```
-- Create a role intended as a group for read-only access
CREATE ROLE read_only_access;

-- Create a role for data entry personnel
CREATE ROLE data_entry_clerk;
```

Granting Role Membership: Assigning users to roles (or roles to other roles, creating hierarchies).

Granting Membership (PostgreSQL Example):

-- Make the 'registrar_app' user a member of the 'data_entry_clerk' role
GRANT data_entry_clerk TO registrar_app;

-- Create another user and grant them read-only access
CREATE USER analyst_user WITH LOGIN PASSWORD 'another_secret';
GRANT read_only_access TO analyst_user;

-- Roles can inherit from other roles
GRANT read_only_access TO data_entry_clerk; -- Now data_entry_clerk also has read_only privileges

Using roles is the recommended practice for managing permissions for anything beyond the simplest setups.

Granting and Revoking Privileges

Privileges define the specific actions that a user or role is allowed to perform on database objects.

Common Object Privileges:
- SELECT: Allows reading data from a table, view, or specific columns.
- INSERT: Allows adding new rows to a table or view. Can be granted on specific columns.
- UPDATE: Allows modifying data in existing rows of a table or view. Can be granted on specific columns.
- DELETE: Allows removing rows from a table or view.
- TRUNCATE: Allows quickly removing all rows from a table (usually much faster than DELETE but often requires higher privileges).
- REFERENCES: Allows creating a foreign key constraint that references a table.
- TRIGGER: Allows creating triggers on a table.
- USAGE: Allows using schemas, sequences, data types, domains, foreign data wrappers, foreign servers. Required to access objects within a schema.
- EXECUTE: Allows calling a function or procedure.
Administrative Privileges:
- CREATE: Allows creating database objects (e.g., CREATE TABLE, CREATE DATABASE, CREATE SCHEMA, CREATE ROLE).
- CONNECT: Allows a user to connect to a specific database (checked after initial authentication).

GRANT Statement: Used to give privileges to roles or users.

Syntax (Simplified):

GRANT <privilege_type> [, ...]
ON [TABLE | SEQUENCE | DATABASE | FUNCTION | PROCEDURE | SCHEMA | ...] <object_name> [, ...]
TO <role_name | user_name | PUBLIC> [, ...]
[WITH GRANT OPTION];

PUBLIC: A special pseudo-role meaning "all roles/users". Granting to PUBLIC should be done sparingly.
WITH GRANT OPTION: Allows the grantee to grant the same privilege to others. Use with extreme caution.

Examples (PostgreSQL):

-- Grant basic read access on the students table to the read_only_access role
GRANT SELECT ON TABLE students TO read_only_access;
GRANT SELECT ON TABLE courses TO read_only_access;
GRANT SELECT ON TABLE enrollments TO read_only_access;
GRANT SELECT ON TABLE departments TO read_only_access;

-- Grant data entry privileges on enrollments to the data_entry_clerk role
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLE enrollments TO data_entry_clerk;
-- Also needs ability to SELECT from students and courses to verify FKs
GRANT SELECT ON TABLE students, courses TO data_entry_clerk;
-- Needs USAGE on the schema containing the tables (if not public)
-- Assuming tables are in the 'public' schema, USAGE is usually granted by default.
-- If in a different schema 'university_schema':
-- GRANT USAGE ON SCHEMA university_schema TO read_only_access;
-- GRANT USAGE ON SCHEMA university_schema TO data_entry_clerk;

-- Allow the data_entry_clerk role to use the get_student_credits function
GRANT EXECUTE ON FUNCTION get_student_credits(VARCHAR) TO data_entry_clerk;

REVOKE Statement: Used to remove previously granted privileges.

Syntax (Simplified):

REVOKE [GRANT OPTION FOR] <privilege_type> [, ...]
ON [TABLE | SEQUENCE | DATABASE | FUNCTION | PROCEDURE | SCHEMA | ...] <object_name> [, ...]
FROM <role_name | user_name | PUBLIC> [, ...]
[CASCADE | RESTRICT]; -- Behavior if dependent privileges exist

GRANT OPTION FOR: Revokes only the grant option, not the privilege itself.
CASCADE: Revokes the privilege and also revokes it from anyone who received it from the user/role being revoked from (directly or indirectly via grant option).
RESTRICT: (Often the default) Prevents revoking if dependent privileges exist (e.g., if the user granted the privilege to someone else).

Examples (PostgreSQL):

-- Remove delete permission from data_entry_clerk on enrollments
REVOKE DELETE ON TABLE enrollments FROM data_entry_clerk;

-- Remove all privileges on students table from read_only_access
REVOKE ALL PRIVILEGES ON TABLE students FROM read_only_access;

Principle of Least Privilege: A fundamental security concept. Users and applications should only be granted the absolute minimum privileges necessary to perform their intended functions. Avoid granting broad privileges like superuser status or ALL PRIVILEGES unless absolutely required for administrative tasks. Use specific roles with narrowly defined permissions.

Effective security management through users, roles, and carefully assigned privileges is crucial for protecting data integrity and confidentiality.

Workshop Database Security Fundamentals

Goal: Create users and roles, grant specific privileges, and test access restrictions.

Prerequisites:

A running PostgreSQL instance on your Linux machine where you have administrative privileges (like the default postgres user).
Access to the psql command-line client.
The university_db database created in previous workshops, with its tables (departments, students, courses, enrollments). Assume these tables exist and contain data.

Setup:

Terminal: Connect to PostgreSQL as an administrator:
```
psql -U postgres -d university_db
```
(Replace postgres if you use a different admin user).
Verify Tables: Ensure the tables exist.
```
\dt
```

Tasks:

Create Roles: Create two roles: student_viewer (for read-only access to course/enrollment info) and course_admin (for managing course data).
```
CREATE ROLE student_viewer;
CREATE ROLE course_admin;
```

Create Users: Create two users: alice_advisor (who will advise students) and bob_registrar (who will manage courses). Assign passwords.

CREATE USER alice_advisor WITH LOGIN PASSWORD 'advisorpass';
CREATE USER bob_registrar WITH LOGIN PASSWORD 'registrarpass';

Grant Role Membership: Assign the users to the roles.

GRANT student_viewer TO alice_advisor;
GRANT course_admin TO bob_registrar;

Grant Privileges to Roles: Assign appropriate permissions to the roles based on the Principle of Least Privilege.

student_viewer needs:
- SELECT on students, courses, enrollments, departments.

course_admin needs:

SELECT, INSERT, UPDATE, DELETE on courses.

(Implicitly needs USAGE on the schema, usually public by default).

-- Privileges for student_viewer role
GRANT SELECT ON TABLE students, courses, enrollments, departments TO student_viewer;

-- Privileges for course_admin role
GRANT SELECT, INSERT, UPDATE, DELETE ON TABLE courses TO course_admin;

-- Ensure USAGE on schema (often default for public, but good practice)
GRANT USAGE ON SCHEMA public TO student_viewer;
GRANT USAGE ON SCHEMA public TO course_admin;

Test alice_advisor Permissions:

Open a NEW Terminal Window. Do not close the admin connection yet.

New Terminal: Connect as alice_advisor:

psql -U alice_advisor -d university_db -h localhost
# Enter password 'advisorpass' when prompted

New Terminal (as alice_advisor): Try allowed and disallowed actions:

-- Allowed actions (SELECT)
SELECT student_name, gpa FROM students WHERE student_id = 'S1001'; -- OK
SELECT course_name, credits FROM courses WHERE department_id = 'CS'; -- OK
SELECT * FROM enrollments WHERE student_id = 'S1001'; -- OK

-- Disallowed actions (INSERT, UPDATE, DELETE)
INSERT INTO courses (course_id, course_name, department_id, credits)
VALUES ('ART101', 'Art History', 'PHYS', 3); -- Should FAIL (permission denied for table courses)

UPDATE students SET gpa = 4.0 WHERE student_id = 'S1001'; -- Should FAIL (permission denied for table students)

DELETE FROM enrollments WHERE student_id = 'S1001'; -- Should FAIL (permission denied for table enrollments)

-- Try creating a table (should fail - no CREATE privilege)
-- CREATE TABLE my_test_table (id int); -- Should FAIL

-- Exit this connection
\q

Test bob_registrar Permissions:

Open another NEW Terminal Window (or reuse the previous non-admin one).

New Terminal: Connect as bob_registrar:

psql -U bob_registrar -d university_db -h localhost
# Enter password 'registrarpass' when prompted

New Terminal (as bob_registrar): Try allowed and disallowed actions:

-- Allowed actions (SELECT, INSERT, UPDATE, DELETE on courses)
SELECT * FROM courses WHERE department_id = 'CS'; -- OK
INSERT INTO courses (course_id, course_name, department_id, credits)
VALUES ('ART101', 'Art History', 'PHYS', 3); -- OK (Assuming PHYS dept exists)
UPDATE courses SET credits = 4 WHERE course_id = 'ART101'; -- OK
DELETE FROM courses WHERE course_id = 'ART101'; -- OK

-- Disallowed actions (accessing other tables like students)
SELECT student_name FROM students WHERE student_id = 'S1001'; -- Should FAIL (permission denied for table students)

INSERT INTO enrollments (student_id, course_id, grade)
VALUES ('S1001', 'CS101', 'B'); -- Should FAIL (permission denied for table enrollments)

-- Exit this connection
\q

Modify Privileges (as Admin): Let's say course_admin should not be allowed to delete courses, only manage them otherwise.
- Go back to the FIRST Terminal (connected as admin).
- Revoke the DELETE privilege:
```
REVOKE DELETE ON TABLE courses FROM course_admin;
```

Retest bob_registrar Delete Permission:

Connect again as bob_registrar in the second terminal.

Try to delete a course (first insert one to delete):

-- Connect as bob_registrar again
psql -U bob_registrar -d university_db -h localhost

-- Insert a course to delete
INSERT INTO courses (course_id, course_name, department_id, credits)
VALUES ('TEMP101', 'Temp Course', 'CS', 1); -- OK

-- Attempt to delete it
DELETE FROM courses WHERE course_id = 'TEMP101'; -- Should now FAIL (permission denied)

-- Clean up the temp course using UPDATE (which is still allowed)
-- Or just leave it. It shows the revoke worked.
-- UPDATE courses SET course_name = 'Old Temp Course' WHERE course_id = 'TEMP101'; -- This would be OK

\q

Cleanup (Optional):

-- In the admin terminal:
DROP USER IF EXISTS alice_advisor;
DROP USER IF EXISTS bob_registrar;
DROP ROLE IF EXISTS student_viewer;
DROP ROLE IF EXISTS course_admin;

-- To fully clean up database:
-- \c postgres
-- DROP DATABASE university_db;

This workshop demonstrated the fundamental process of creating roles and users, granting specific privileges based on job function, and verifying that security restrictions are enforced correctly. This granular control is essential for protecting database assets.

Conclusion

Moving beyond the basics, this section equipped you with intermediate database skills crucial for real-world development and administration on Linux. You've learned to craft complex SQL queries using advanced joins, subqueries, CTEs, and window functions to extract deeper insights from data. We explored the critical principles of relational database design, focusing on normalization to ensure data integrity and minimize redundancy, and strategic indexing to optimize query performance.

Understanding how databases handle concurrent access through ACID properties, isolation levels, and locking mechanisms is vital for building reliable multi-user applications. You gained practical experience observing these concepts in action. Furthermore, you learned how to diagnose performance bottlenecks using EXPLAIN ANALYZE and apply index tuning techniques to speed up slow queries.

We introduced server-side programming with stored functions, procedures, and triggers, enabling you to encapsulate logic, enforce complex rules, and automate tasks directly within the database. Finally, we covered database security fundamentals, emphasizing the importance of user/role management and the Principle of Least Privilege to control access and protect your data.

The workshops provided hands-on practice reinforcing these concepts within a PostgreSQL environment on Linux. These intermediate skills form a strong foundation for tackling more complex database challenges, delving into advanced administration topics like replication and high availability, exploring different database paradigms like NoSQL, or specializing in database performance optimization.