Author | Nejat Hakan |
nejat.hakan@outlook.de | |
PayPal Me | https://paypal.me/nejathakan |
Database Basics
Introduction What is a Database
Welcome to the world of databases! In today's digital age, data is everywhere. From social media posts and online shopping transactions to scientific research and business analytics, vast amounts of information are constantly being generated, processed, and stored. But how do we manage this deluge of data effectively? How do we ensure it's organized, accessible, secure, and consistent? The answer lies in databases and the systems designed to manage them.
At its core, a database is simply an organized collection of structured information, or data, typically stored electronically in a computer system. Think of it as a highly efficient digital filing cabinet. Instead of rummaging through physical folders, you use specialized software to store, retrieve, update, and manage the data.
The software used to interact with databases – to create them, manage them, query them, and control access – is called a Database Management System (DBMS). When we talk about databases like PostgreSQL, MySQL, MongoDB, or SQLite, we are often referring to the DBMS, although the term "database" is frequently used to encompass both the data itself and the system managing it.
Why are Databases Necessary?
Imagine trying to manage customer information for a large online store using simple text files or spreadsheets.
- Scalability: As the number of customers and orders grows, searching, sorting, and updating information in flat files becomes incredibly slow and inefficient. Databases are designed to handle massive amounts of data and scale effectively.
- Data Integrity: How do you ensure that every customer has a unique ID, or that an order always refers to an existing product? Databases enforce rules (constraints) to maintain the accuracy, consistency, and reliability of data.
- Concurrency: Multiple users or applications often need to access and modify data simultaneously (e.g., multiple customers placing orders at the same time). A DBMS handles concurrent access, preventing conflicts and ensuring data remains consistent. Simple file systems typically lock the entire file, which is highly inefficient.
- Data Security: Databases provide mechanisms to control who can access what data and what operations they can perform (e.g., read-only access for analysts, full access for administrators).
- Data Retrieval: Databases offer powerful query languages (like SQL) that allow you to retrieve specific subsets of data based on complex criteria far more efficiently than searching through unstructured files.
- Backup and Recovery: DBMS provide tools and mechanisms for backing up data regularly and recovering it in case of hardware failure, software bugs, or human error.
A Brief Evolution
Early data management involved primitive file systems. Then came hierarchical and network models in the 1960s and 70s, which organized data in tree-like or graph-like structures. While an improvement, they were often complex and inflexible.
The major breakthrough came in the 1970s with Dr. E.F. Codd's invention of the relational model, which organizes data into simple tables (relations) with rows and columns. This model, combined with the development of Structured Query Language (SQL), revolutionized data management and remains the dominant paradigm today for many applications.
More recently, the rise of big data, the internet, and highly distributed applications led to the development of NoSQL (Not Only SQL) databases. These offer different data models (key-value, document, column-family, graph) designed to handle specific challenges like massive scale, high availability, and unstructured or semi-structured data, sometimes sacrificing some of the strict consistency guarantees of relational databases.
Types of Databases (Brief Overview)
- Relational Databases (SQL): Organize data in tables with predefined schemas. Examples: PostgreSQL, MySQL, MariaDB, SQLite, Oracle Database, SQL Server. They use SQL for data definition and manipulation.
- NoSQL Databases: A broad category encompassing various non-relational models.
- Key-Value Stores: Simple pairs of keys and values. Examples: Redis, Memcached.
- Document Databases: Store data in flexible, JSON-like documents. Examples: MongoDB, Couchbase.
- Column-Family Stores: Store data in columns rather than rows, optimized for large datasets and aggregations. Examples: Cassandra, HBase.
- Graph Databases: Focus on relationships between data points (nodes and edges). Examples: Neo4j, ArangoDB.
In this section, we will delve deeper into these concepts, starting with the foundational relational model and SQL, and then exploring the diverse world of NoSQL. We will focus on how these databases operate within the Linux environment, providing practical skills through hands-on workshops.
1. Relational Databases and SQL
The relational model, introduced by E.F. Codd, is the foundation for the most widely used type of database system. Its elegance and power stem from its simple, intuitive structure based on mathematical set theory and relational algebra. In a relational database, data is organized into tables, which are formally called relations.
Core Concepts of the Relational Model
-
Tables (Relations): A table represents a collection of related data entries organized in a two-dimensional grid of columns and rows. Each table typically represents a specific type of entity, like "Customers", "Products", or "Orders".
- Example: A
Students
table might store information about university students.
- Example: A
-
Rows (Tuples or Records): Each row in a table represents a single instance or record of the entity type the table describes.
- Example: In the
Students
table, each row would represent one specific student.
- Example: In the
-
Columns (Attributes or Fields): Each column represents a specific characteristic or attribute of the entity type. All entries in a given column are of the same data type (e.g., integer, text, date).
- Example: The
Students
table might have columns likestudent_id
(integer),first_name
(text),last_name
(text),date_of_birth
(date),major
(text).
- Example: The
-
Schema: The schema defines the structure of the database. It specifies the tables, the columns within each table, the data types for each column, and the relationships between tables (using keys and constraints). Defining the schema is a critical first step in designing a relational database.
-
Keys: Keys are special columns (or sets of columns) used to uniquely identify rows and establish relationships between tables.
- Primary Key (PK): A column (or combination of columns) whose value uniquely identifies each row in a table. Primary keys cannot contain NULL values and must be unique. Every table in a well-designed relational database should have a primary key.
- Example:
student_id
in theStudents
table would likely be the primary key.
- Example:
- Foreign Key (FK): A column (or combination of columns) in one table that refers to the primary key of another table. Foreign keys are used to link related tables and enforce referential integrity (ensuring that a row in one table cannot reference a non-existent row in another table).
- Example: An
Enrollments
table might have astudent_id
column that is a foreign key referencing thestudent_id
primary key in theStudents
table. This ensures that every enrollment record corresponds to a valid student.
- Example: An
- Primary Key (PK): A column (or combination of columns) whose value uniquely identifies each row in a table. Primary keys cannot contain NULL values and must be unique. Every table in a well-designed relational database should have a primary key.
-
Data Types: Each column must be assigned a specific data type, which dictates the kind of data it can store (e.g.,
INTEGER
,VARCHAR(255)
for variable-length text up to 255 characters,TEXT
for longer text,DATE
,TIMESTAMP
,BOOLEAN
,NUMERIC(10, 2)
for numbers with fixed precision). Using appropriate data types helps ensure data integrity and optimize storage and performance.
Structured Query Language (SQL)
SQL is the standard language used to communicate with relational databases. It's a declarative language, meaning you specify what data you want, and the DBMS figures out how to retrieve or manipulate it. SQL commands are generally categorized into several sub-languages:
-
Data Definition Language (DDL): Used to define and modify the database structure (the schema).
CREATE
: Creates database objects (e.g.,CREATE DATABASE
,CREATE TABLE
).ALTER
: Modifies existing database objects (e.g.,ALTER TABLE ADD COLUMN
).DROP
: Deletes database objects (e.g.,DROP TABLE
).
-
Data Manipulation Language (DML): Used to insert, retrieve, update, and delete data within the tables.
SELECT
: Retrieves data from one or more tables. This is the most frequently used SQL command.INSERT
: Adds new rows of data into a table.UPDATE
: Modifies existing data in a table.DELETE
: Removes rows of data from a table.
-
Data Control Language (DCL): Used to manage user permissions and access control.
GRANT
: Gives specific permissions (e.g.,SELECT
,INSERT
) on database objects to users or roles.REVOKE
: Removes previously granted permissions.
-
Transaction Control Language (TCL): Used to manage transactions, ensuring data consistency.
BEGIN TRANSACTION
(orSTART TRANSACTION
): Marks the beginning of a transaction.COMMIT
: Saves all changes made during the current transaction permanently.ROLLBACK
: Discards all changes made during the current transaction, restoring the database to its state before the transaction began.
ACID Properties
Relational databases are renowned for adhering to the ACID properties, which guarantee the reliability of database transactions:
-
Atomicity: Ensures that each transaction is treated as a single, indivisible unit. Either all operations within the transaction are completed successfully, or none of them are. If any part of the transaction fails, the entire transaction is rolled back. This prevents partial updates that could leave the database in an inconsistent state.
- Example: Transferring money involves debiting one account and crediting another. Atomicity ensures that both actions happen or neither happens; you never end up with money deducted but not credited.
-
Consistency: Guarantees that a transaction brings the database from one valid state to another. Any data written to the database must be valid according to all defined rules, including constraints, cascades, and triggers. If a transaction would violate these rules, it is rolled back.
- Example: If a column has a
UNIQUE
constraint, a transaction attempting to insert a duplicate value will fail, maintaining consistency.
- Example: If a column has a
-
Isolation: Ensures that concurrent transactions execute independently without interfering with each other. The intermediate state of one transaction should not be visible to other transactions until it is committed. This prevents issues like dirty reads (reading uncommitted data) or lost updates. Databases achieve this through locking mechanisms or multi-version concurrency control (MVCC).
- Example: If two users try to book the last available seat on a flight simultaneously, isolation ensures that only one transaction succeeds, while the other fails or waits.
-
Durability: Guarantees that once a transaction has been committed, its changes are permanent and will survive subsequent system failures (like power outages or crashes). This is typically achieved by writing changes to transaction logs stored on non-volatile storage before acknowledging the commit.
- Example: Once you receive confirmation that your online order is placed (transaction committed), you can be confident the order data is saved permanently, even if the server crashes immediately afterward.
Popular Relational Databases on Linux
Linux is a very popular platform for hosting relational databases. Some common choices include:
- PostgreSQL: A powerful, open-source object-relational database system known for its standards compliance, robustness, extensibility, and strong feature set. It's often favored for complex applications requiring high data integrity.
- MySQL: The world's most popular open-source relational database, known for its ease of use, speed, and large community support. It's widely used for web applications (often part of the LAMP stack - Linux, Apache, MySQL, PHP/Python/Perl).
- MariaDB: A community-developed fork of MySQL, created by the original developers of MySQL. It aims to maintain high compatibility with MySQL while adding new features and improvements, and remaining fully open-source.
- SQLite: An embedded relational database library. Instead of a server process, SQLite stores the entire database in a single file on the host machine. It's incredibly lightweight and commonly used in applications (like web browsers, mobile apps) that need a local, self-contained database without the overhead of a full client-server system.
Understanding the relational model and SQL is fundamental for anyone working with structured data, regardless of the specific RDBMS used.
Workshop Installing and Connecting to PostgreSQL
This workshop guides you through installing the PostgreSQL database server on a Linux system, performing initial configuration, and connecting to it using the psql
command-line client. We will use PostgreSQL because of its robustness and SQL standards compliance, making it an excellent choice for learning.
Prerequisites:
- A Linux system (e.g., Ubuntu 20.04/22.04, Debian 11/12, Fedora 36+, CentOS/RHEL 8+).
- Access to the terminal with
sudo
privileges. - Internet connection to download packages.
Steps:
-
Update Package Lists: Always start by ensuring your system's package list is up-to-date.
- On Debian/Ubuntu:
- On Fedora/CentOS/RHEL:
-
Install PostgreSQL Server and Client: Install the main PostgreSQL server package and the client utilities, which include
psql
.- On Debian/Ubuntu:
The
postgresql-contrib
package includes useful extensions and utilities. - On Fedora:
Fedora often uses modules; you might need to list available streams and enable one first, or simply install the default version.
# Check available versions (optional) # sudo dnf module list postgresql # Install default version sudo dnf install postgresql-server postgresql-contrib -y # Initialize the database cluster (REQUIRED on Fedora/RHEL) sudo postgresql-setup --initdb # Enable and start the service (REQUIRED on Fedora/RHEL) sudo systemctl enable postgresql sudo systemctl start postgresql
- On CentOS/RHEL:
PostgreSQL might be available via AppStream or require enabling the official PostgreSQL repository. Using the official repo is often recommended for newer versions.
- Option A (Using AppStream - may be older version):
# Check available versions (optional) # sudo dnf module list postgresql # Enable a specific stream (e.g., 13) # sudo dnf module enable postgresql:13 -y sudo dnf install postgresql-server postgresql-contrib -y sudo postgresql-setup --initdb sudo systemctl enable postgresql sudo systemctl start postgresql
- Option B (Using Official PostgreSQL Repo - Recommended for latest versions):
Visit https://www.postgresql.org/download/linux/redhat/ for instructions on adding the repository, then install using
dnf
.
- Option A (Using AppStream - may be older version):
- On Debian/Ubuntu:
The
-
Verify Installation and Service Status: Check if the PostgreSQL service is running.
You should see output indicating the service is "active (running)". If not, try starting it:sudo systemctl start postgresql
. On Debian/Ubuntu, the service usually starts automatically after installation. -
Understand PostgreSQL User Authentication: By default, PostgreSQL uses an authentication method called
peer
authentication for local connections. This means it checks the operating system username of the client connecting via a Unix domain socket and allows access if it matches a valid PostgreSQL username. During installation, a PostgreSQL superuser namedpostgres
is created, associated with the system'spostgres
user. -
Connect as the
You should now see thepostgres
Superuser: To perform administrative tasks, you need to connect as thepostgres
user. Usesudo
to switch to the systempostgres
user and then runpsql
.psql
prompt, which looks likepostgres=#
. This indicates you are connected to the default database (also namedpostgres
) as thepostgres
user.- To exit
psql
, type\q
and press Enter.
- To exit
-
Set a Password for the
You should seepostgres
User (Recommended): While connected aspostgres
inpsql
, set a strong password. This is crucial if you plan to allow password-based connections later (e.g., over the network). Replace'YourStrongPassword'
with a secure password.ALTER ROLE
as confirmation. -
Create a New Database User: Let's create a dedicated user for your own projects instead of always using the superuser. We'll create a user named
Replaceuniversity_admin
. While still inpsql
as thepostgres
user:'AnotherSecurePassword'
with a strong password. You should seeCREATE ROLE
. -
Create a New Database: Let's create a database named
You should seeuniversity_db
and makeuniversity_admin
the owner.CREATE DATABASE
. -
Grant Privileges (Optional but Good Practice): Although
You should seeuniversity_admin
owns the database, explicit grants are often clearer. Let's grant all privileges on the new database to the new user.GRANT
.- Now, exit
psql
:\q
- Now, exit
-
Connect as the New User: You can now connect directly to the new database using the user you created.
-U university_admin
: Specifies the PostgreSQL username.-d university_db
: Specifies the database name to connect to.-h localhost
: Specifies the host (uselocalhost
for local connections). If you omit-h
,psql
might trypeer
authentication via sockets first. Using-h localhost
often forces password authentication (if configured).-W
: Prompts for the password you set foruniversity_admin
.
Enter the password (
AnotherSecurePassword
) when prompted. Your prompt should now look likeuniversity_db=>
, indicating you are connected touniversity_db
asuniversity_admin
. -
Explore
psql
Commands: Insidepsql
, you can use backslash commands:\l
: List all databases.\du
: List all users (roles).\dt
: List tables in the current database (will be empty now).\d <table_name>
: Describe a specific table (once you create one).\conninfo
: Show information about the current connection.\q
: Quitpsql
.
Congratulations! You have successfully installed PostgreSQL, created a dedicated user and database, and learned how to connect using psql
. You are now ready to start designing and interacting with your relational database on Linux. Remember to exit psql
(\q
) when you are finished.
2. Data Modeling Fundamentals
Before you start creating tables and inserting data, it's crucial to design the structure of your database. This process is called data modeling. A well-designed data model ensures data integrity, reduces redundancy, improves query performance, and makes the database easier to understand and maintain. The goal is to accurately represent the real-world entities and their relationships within the constraints of the relational model.
Entities Attributes and Relationships
-
Entities: An entity represents a distinct object, person, place, concept, or event about which you want to store information. Entities typically become tables in your database.
- Examples:
Student
,Course
,Professor
,Department
,Product
,Order
.
- Examples:
-
Attributes: An attribute is a property or characteristic of an entity. Attributes become the columns in your database table. Each attribute has a specific data type.
- Examples: For the
Student
entity, attributes might bestudent_id
,first_name
,last_name
,email
,date_of_birth
,major
. For theCourse
entity, attributes might becourse_code
,course_title
,credits
,department_id
.
- Examples: For the
-
Relationships: Relationships describe how different entities are associated with each other. Understanding relationships is key to linking tables correctly using foreign keys. There are three main types of relationships:
- One-to-One (1:1): Each instance of one entity is related to exactly one instance of another entity, and vice versa. This is less common. It might be used to split a table with many columns or for security reasons.
- Example: A
User
entity might have a 1:1 relationship with aUserProfile
entity containing optional details. One user has one profile, and one profile belongs to one user.
- Example: A
- One-to-Many (1:N): One instance of an entity can be related to many instances of another entity, but each instance of the second entity is related to only one instance of the first. This is very common.
- Example: One
Department
can have manyProfessors
, but eachProfessor
belongs to only oneDepartment
. The foreign key (department_id
) would typically be placed in theProfessors
table (the "many" side).
- Example: One
- Many-to-Many (M:N): One instance of an entity can be related to many instances of a second entity, and one instance of the second entity can also be related to many instances of the first.
- Example: A
Student
can enroll in manyCourses
, and aCourse
can have manyStudents
enrolled. To implement M:N relationships in a relational database, you need an associative table (also called a junction or linking table). This table sits between the two entities and contains foreign keys referencing the primary keys of both related tables. - Implementation: Create an
Enrollments
table with columns likeenrollment_id
(PK),student_id
(FK referencingStudents
),course_code
(FK referencingCourses
), and potentially other attributes likeenrollment_date
orgrade
.
- Example: A
- One-to-One (1:1): Each instance of one entity is related to exactly one instance of another entity, and vice versa. This is less common. It might be used to split a table with many columns or for security reasons.
Keys and Constraints Revisited
- Primary Key (PK): Essential for uniquely identifying each row. Can be a single column (
student_id
) or multiple columns (a composite key, e.g.,(order_id, product_id)
in anOrderItems
table). Often, an auto-incrementing integer (SERIAL
orIDENTITY
column) is used as a surrogate key when no natural unique identifier exists or when the natural key is complex. - Foreign Key (FK): Enforces referential integrity, ensuring relationships between tables are valid. When defining a foreign key, you usually specify what should happen if the referenced primary key is updated or deleted (e.g.,
ON DELETE CASCADE
- delete related rows,ON DELETE SET NULL
- set the FK to NULL,ON DELETE RESTRICT
- prevent deletion if referenced). - Constraints: Rules enforced by the database to ensure data accuracy and consistency.
NOT NULL
: Ensures a column cannot contain NULL values.UNIQUE
: Ensures all values in a column (or set of columns) are unique across the table (allows one NULL value, typically). Primary keys are implicitlyUNIQUE
andNOT NULL
.CHECK
: Specifies a condition that must be true for any value entered into a column.- Example:
CHECK (credits > 0)
on aCourses
table.
- Example:
DEFAULT
: Provides a default value for a column if none is specified during insertion.
Normalization
Normalization is the process of organizing the columns and tables of a relational database to minimize data redundancy and improve data integrity. It involves decomposing tables into smaller, well-structured tables and defining relationships between them. The goal is to isolate data so that additions, deletions, and modifications of an attribute can be made in just one table. Redundancy wastes space and can lead to inconsistencies (update anomalies).
Normalization is typically described in terms of normal forms (NFs):
-
First Normal Form (1NF):
- Rule: Each column must contain atomic (indivisible) values, and each column must have a unique name within the table. There should be no repeating groups of columns. Each row must be unique (enforced by a primary key).
- Violation Example: A
StudentCourses
column containing a comma-separated list of course codes ('CS101, MATH202'). - Fix: Separate courses into individual rows, likely in a separate
Enrollments
table linked to theStudents
andCourses
tables.
-
Second Normal Form (2NF):
- Prerequisite: Must be in 1NF.
- Rule: All non-key attributes must be fully functionally dependent on the entire primary key. This rule applies mainly when a table has a composite primary key (a PK made of multiple columns). It means no non-key column should depend on only part of the composite key.
- Violation Example: Table
OrderItems (OrderID, ProductID, Quantity, OrderDate, ProductName)
. Assume(OrderID, ProductID)
is the composite PK. Here,ProductName
depends only onProductID
(part of the PK), not the whole key.OrderDate
depends only onOrderID
(part of the PK).Quantity
depends on the full key. - Fix: Decompose into separate tables:
Orders(OrderID (PK), OrderDate)
,Products(ProductID (PK), ProductName)
, andOrderItems(OrderID (FK), ProductID (FK), Quantity)
. The primary key forOrderItems
would be(OrderID, ProductID)
.
-
Third Normal Form (3NF):
- Prerequisite: Must be in 2NF.
- Rule: No non-key attribute should be transitively dependent on the primary key. This means a non-key attribute should not depend on another non-key attribute.
- Violation Example: Table
Students (StudentID (PK), Name, Major, Department, DepartmentHead)
. Here,DepartmentHead
depends onDepartment
, which in turn depends onStudentID
. This is a transitive dependency. If the head of a department changes, you'd have to update multiple student records. - Fix: Decompose into
Students (StudentID (PK), Name, Major, DepartmentID (FK))
andDepartments (DepartmentID (PK), DepartmentName, DepartmentHead)
.
Higher normal forms (BCNF, 4NF, 5NF) exist but are less commonly applied in practice. Achieving 3NF is often considered a good balance between reducing redundancy and maintaining practical usability. Sometimes, denormalization (intentionally violating normalization rules) is done selectively to improve read performance, but it should be done cautiously as it reintroduces redundancy and potential update anomalies.
Entity-Relationship Diagrams (ERDs)
An ERD is a visual representation of the entities, attributes, and relationships in a database model. It's a blueprint used during the design phase. Common notations include Crow's Foot notation.
- Entities: Represented by rectangles.
- Attributes: Listed inside the entity rectangles (PKs and FKs often indicated).
- Relationships: Represented by lines connecting related entities. Symbols on the ends of the lines (like a crow's foot for "many" and a single stroke for "one") indicate the cardinality (1:1, 1:N, M:N).
Drawing an ERD helps clarify the structure and identify potential issues before writing any CREATE TABLE
statements.
Workshop Designing a Simple University Database Schema
In this workshop, we will apply data modeling concepts to design a simple schema for our university_db
. We'll define entities, attributes, keys, and relationships, and then translate this design into SQL CREATE TABLE
statements using PostgreSQL syntax.
Scenario:
We need to store information about Students, Courses, and which student is enrolled in which course.
Entities:
Students
Courses
Enrollments
(Associative entity for the M:N relationship between Students and Courses)
Attributes and Keys:
-
Students
Table:student_id
: Primary Key (PK), uniquely identifies a student. Let's use a serial integer type that auto-increments.first_name
: Text, cannot be null.last_name
: Text, cannot be null.email
: Text, must be unique, cannot be null.enrollment_date
: Date, default to the current date.
-
Courses
Table:course_id
: Primary Key (PK), uniquely identifies a course. Let's use a serial integer type.course_code
: Text, unique identifier like 'CS101', must be unique, cannot be null.course_title
: Text, cannot be null.credits
: Integer, must be positive.
-
Enrollments
Table:enrollment_id
: Primary Key (PK), uniquely identifies an enrollment record. Serial integer type.student_id
: Foreign Key (FK) referencingStudents(student_id)
. Cannot be null. If a student is deleted, their enrollments should also be deleted (CASCADE).course_id
: Foreign Key (FK) referencingCourses(course_id)
. Cannot be null. If a course is deleted, related enrollments should also be deleted (CASCADE).enrollment_timestamp
: Timestamp with time zone, default to the current timestamp.- Constraint: A student should not be able to enroll in the same course more than once. We need a UNIQUE constraint on the combination of
student_id
andcourse_id
.
Entity-Relationship Diagram (Text-Based Representation):
+-------------+ +-----------------+ +-----------+
| Students | | Enrollments | | Courses |
+-------------+ +-----------------+ +-----------+
| PK student_id|-------|< FK student_id | | PK course_id|
| first_name | | FK course_id >|-------|< course_code|
| last_name | | PK enrollment_id| | course_title|
| email (UQ) | | enrollment_ts | | credits |
| enrollment_dt| | UQ (student_id, | +-----------+
+-------------+ | course_id) |
+-----------------+
PK
: Primary KeyFK
: Foreign KeyUQ
: Unique Constraint---|<-
: Represents a one-to-many relationship (one Student can have many Enrollments, one Course can have many Enrollments).
Steps:
-
Connect to Your Database: If you are not already connected, connect to the
Enter the password when prompted.university_db
as theuniversity_admin
user you created in the previous workshop. -
Create the
Students
Table: Execute the following SQL command inpsql
. Note the use ofSERIAL
for auto-incrementing PK,VARCHAR
for text with length limits (adjust as needed),UNIQUE
,NOT NULL
, andDEFAULT
.If successful, you should seeCREATE TABLE Students ( student_id SERIAL PRIMARY KEY, first_name VARCHAR(100) NOT NULL, last_name VARCHAR(100) NOT NULL, email VARCHAR(255) UNIQUE NOT NULL, enrollment_date DATE DEFAULT CURRENT_DATE );
CREATE TABLE
. You can verify by typing\dt
(shows the table) or\d Students
(shows table details). -
Create the
Courses
Table: Execute the following SQL command. Note theCHECK
constraint.You should seeCREATE TABLE Courses ( course_id SERIAL PRIMARY KEY, course_code VARCHAR(10) UNIQUE NOT NULL, course_title VARCHAR(200) NOT NULL, credits INTEGER CHECK (credits > 0) );
CREATE TABLE
. Verify with\dt
or\d Courses
. -
Create the
Enrollments
Table: Execute the following SQL command. This table linksStudents
andCourses
. Note theFOREIGN KEY
definitions withON DELETE CASCADE
and the compositeUNIQUE
constraint.You should seeCREATE TABLE Enrollments ( enrollment_id SERIAL PRIMARY KEY, student_id INTEGER NOT NULL, course_id INTEGER NOT NULL, enrollment_timestamp TIMESTAMPTZ DEFAULT CURRENT_TIMESTAMP, CONSTRAINT fk_student FOREIGN KEY(student_id) REFERENCES Students(student_id) ON DELETE CASCADE, CONSTRAINT fk_course FOREIGN KEY(course_id) REFERENCES Courses(course_id) ON DELETE CASCADE, -- Ensure a student cannot enroll in the same course multiple times CONSTRAINT unique_student_course UNIQUE (student_id, course_id) );
CREATE TABLE
. Verify with\dt
or\d Enrollments
. Notice how\d Enrollments
shows the primary key, foreign key constraints, and the unique constraint. -
Review the Schema: Use
psql
commands to review the tables you created:\dt
: List all tables (courses
,enrollments
,students
).\d Students
: Describe theStudents
table structure, including columns, types, constraints, and indexes (like the primary key and unique email index).\d Courses
: Describe theCourses
table.\d Enrollments
: Describe theEnrollments
table, paying attention to the foreign key constraints and the unique constraint.
You have now successfully designed a basic relational database schema and implemented it using SQL DDL commands in PostgreSQL. This structure minimizes redundancy (student/course details are stored only once) and enforces integrity (valid student/course references, unique enrollments). You are now ready to populate these tables with data. Remember to exit psql
(\q
) when finished.
3. Basic SQL Operations SELECT INSERT UPDATE DELETE
Now that we have designed and created the structure (schema) for our university_db
, the next step is to interact with the data within the tables. The core SQL commands for this belong to the Data Manipulation Language (DML): INSERT
, SELECT
, UPDATE
, and DELETE
.
1. INSERT Adding Data
The INSERT
statement is used to add new rows (records) into a table.
- Syntax (Specifying Columns): This is the recommended way, as it's independent of column order.
- Syntax (Implicit Columns - Requires matching order and values for all columns): Less robust if the table structure changes.
-
Inserting Multiple Rows:
-
Important Notes:
- Values must match the data types of the corresponding columns.
- Text (string) values must be enclosed in single quotes (e.g.,
'John Doe'
). - Numeric values are not quoted (e.g.,
123
,99.9
). - Date/Timestamp values are often quoted (e.g.,
'2023-10-27'
,'2023-10-27 10:30:00'
). The exact format can depend on the database configuration, but ISO 8601 format (YYYY-MM-DD
) is generally standard. - You don't need to specify values for columns with
DEFAULT
values (like ourenrollment_date
orenrollment_timestamp
) orSERIAL
/IDENTITY
columns (like our primary keys), as the database will generate them automatically. INSERT
statements must satisfy all constraints (NOT NULL
,UNIQUE
,CHECK
,FOREIGN KEY
). If a constraint is violated, theINSERT
will fail.
2. SELECT Retrieving Data
The SELECT
statement is arguably the most powerful and frequently used SQL command. It allows you to retrieve data from one or more tables based on specified criteria.
- Basic Syntax:
- Selecting All Columns:
Use*
(asterisk) as a shortcut, but explicitly listing columns is often better for clarity and performance, especially in applications. -
Filtering Data with
WHERE
:
TheWHERE
clause filters rows based on a condition.SELECT column1, column2 FROM table_name WHERE condition; -- e.g., WHERE credits > 3 OR WHERE first_name = 'Alice'
- Comparison Operators:
=
,!=
or<>
,>
,<
,>=
,<=
- Logical Operators:
AND
,OR
,NOT
- Other Operators:
BETWEEN
,IN
(matches values in a list),LIKE
(pattern matching with%
wildcard for multiple characters,_
for single character),IS NULL
,IS NOT NULL
.
- Comparison Operators:
-
Sorting Results with
ORDER BY
:
Sorts the retrieved rows based on one or more columns. -
Limiting Results with
LIMIT
:
Restricts the number of rows returned (useful for pagination or finding top N results). -
Aggregating Data with
GROUP BY
and Aggregate Functions:
Used to group rows that have the same values in specified columns and calculate aggregate statistics for each group.- Common Aggregate Functions:
COUNT()
(counts rows),SUM()
,AVG()
(average),MAX()
,MIN()
.
- Common Aggregate Functions:
-
Filtering Groups with
HAVING
:
Similar toWHERE
, but filters groups after aggregation, based on the results of aggregate functions.WHERE
filters rows before aggregation. -
Joining Tables:
SELECT
becomes truly powerful when combining data from multiple related tables usingJOIN
clauses (covered in more advanced SQL sections, but the basic idea is to link tables based on matching foreign key and primary key values).
3. UPDATE Modifying Data
The UPDATE
statement is used to modify existing data in a table.
-
Syntax:
-
CRITICAL WARNING: If you omit the
WHERE
clause in anUPDATE
statement, ALL ROWS in the table will be updated! Always double-check yourWHERE
clause before executing anUPDATE
. It's often wise to run aSELECT
statement with the sameWHERE
clause first to verify which rows will be affected. -
Example: Change the email address for a specific student.
-- First, verify the student exists and you have the right ID SELECT student_id, first_name, last_name, email FROM Students WHERE email = 'old.email@example.com'; -- If correct, perform the update UPDATE Students SET email = 'new.email@example.com' WHERE student_id = 123; -- Assuming 123 is the ID found above
4. DELETE Removing Data
The DELETE
statement is used to remove one or more rows from a table.
-
Syntax:
-
CRITICAL WARNING: Similar to
UPDATE
, if you omit theWHERE
clause in aDELETE
statement, ALL ROWS in the table will be deleted! Always be extremely careful and verify yourWHERE
clause. Consider using transactions (BEGIN
,COMMIT
,ROLLBACK
) for potentially risky delete operations. -
Example: Remove a specific enrollment record.
-
Note on Foreign Keys: If other tables have foreign key constraints referencing the row you are trying to delete, the
DELETE
operation might fail unless the constraint was defined withON DELETE CASCADE
(which would automatically delete the referencing rows) orON DELETE SET NULL
(which would set the foreign key column in the referencing rows to NULL). OurEnrollments
table usesON DELETE CASCADE
, so deleting a student would automatically delete their enrollments.
Mastering these four DML commands (INSERT
, SELECT
, UPDATE
, DELETE
) is essential for effectively managing data within a relational database. Always pay close attention to WHERE
clauses in UPDATE
and DELETE
statements to avoid unintended data modification or loss.
Workshop Populating and Querying the University Database
In this workshop, we will use the DML commands (INSERT
, SELECT
, UPDATE
, DELETE
) learned in the previous section to populate our university_db
tables (Students
, Courses
, Enrollments
) with sample data and then practice retrieving and manipulating that data.
Prerequisites:
- PostgreSQL server running.
- The
university_db
database and tables (Students
,Courses
,Enrollments
) created as per the previous workshop. - Connected to
university_db
asuniversity_admin
usingpsql
.
Steps:
-
Insert Data into
Students
Table: Let's add a few students. Notice we don't specifystudent_id
(it'sSERIAL
) orenrollment_date
(it usesDEFAULT CURRENT_DATE
).You should seeINSERT INTO Students (first_name, last_name, email) VALUES ('Alice', 'Smith', 'alice.s@example.edu'), ('Bob', 'Johnson', 'bob.j@example.com'), ('Charlie', 'Williams', 'charlie.w@example.edu'), ('David', 'Brown', 'david.b@example.net');
INSERT 0 4
indicating 4 rows were inserted. -
Verify Student Data: Use
(Expected Output will show 4 rows with IDs 1-4, the names/emails you entered, and the current date)SELECT
to see the data you just inserted. Note howstudent_id
andenrollment_date
were automatically populated. -
Insert Data into
Courses
Table: Add some courses. We don't specifycourse_id
(it'sSERIAL
).You should seeINSERT INTO Courses (course_code, course_title, credits) VALUES ('CS101', 'Introduction to Computer Science', 3), ('MATH202', 'Calculus II', 4), ('PHYS101', 'General Physics I', 4), ('ENG101', 'Composition I', 3);
INSERT 0 4
. -
Verify Course Data:
(Expected Output will show 4 rows with IDs 1-4, the course details, matching the input) -
Insert Data into
Enrollments
Table: Now, let's enroll students in courses. We need thestudent_id
andcourse_id
values. Assuming the IDs were generated sequentially starting from 1:- Alice (ID 1) enrolls in CS101 (ID 1) and MATH202 (ID 2).
- Bob (ID 2) enrolls in CS101 (ID 1).
- Charlie (ID 3) enrolls in MATH202 (ID 2) and PHYS101 (ID 3).
- David (ID 4) enrolls in ENG101 (ID 4).
You should see
INSERT INTO Enrollments (student_id, course_id) VALUES (1, 1), -- Alice in CS101 (1, 2), -- Alice in MATH202 (2, 1), -- Bob in CS101 (3, 2), -- Charlie in MATH202 (3, 3), -- Charlie in PHYS101 (4, 4); -- David in ENG101
INSERT 0 6
.
-
Verify Enrollment Data:
(Expected Output will show 6 rows with enrollment IDs 1-6, the student/course ID pairs, and timestamps) -
Practice
SELECT
Queries: Now, let's retrieve specific information.- List all student names (first and last):
- Find all courses worth 4 credits:
- Find students whose email ends with '.edu':
- List students ordered by last name:
- Find the course IDs Alice Smith (student_id 1) is enrolled in:
- Count how many students are enrolled in total: (Counts enrollment records)
- Count students per course: (Using
GROUP BY
) - (Slightly more advanced) List the names of students enrolled in CS101 (course_id 1): This requires a JOIN, which links tables. (Expected output: Alice Smith, Bob Johnson)
-
Practice
UPDATE
: Let's assume Bob Johnson changed his email. Hisstudent_id
is 2.- First, verify the current email: (Expected: bob.j@example.com)
- Now, update it:
You should see
UPDATE 1
. - Verify the change: (Expected: robert.johnson@example.com)
-
Practice
DELETE
: Let's say Alice Smith (student_id 1) drops MATH202 (course_id 2).- First, verify the enrollment exists: (Expected: One row showing this enrollment)
- Now, delete it:
You should see
DELETE 1
. - Verify it's gone: (Expected: Only the enrollment row for course_id 1 should remain for student_id 1)
-
(Optional) Test
ON DELETE CASCADE
: Let's delete David Brown (student_id 4). Because ofON DELETE CASCADE
on theEnrollments
table, his enrollment record(s) should also be deleted automatically.- Verify David's enrollment: (Expected: One row for course_id 4)
- Delete David from the Students table:
You should see
DELETE 1
. - Verify David is gone from Students: (Expected: 0 rows)
- Verify David's enrollment is also gone: (Expected: 0 rows)
This workshop provided hands-on practice with the fundamental DML operations. You learned how to insert data respecting constraints, query data using various SELECT
clauses (WHERE
, ORDER BY
, GROUP BY
), modify data safely using UPDATE
with WHERE
, and remove data using DELETE
with WHERE
, while observing the effect of foreign key constraints like ON DELETE CASCADE
. Feel free to experiment further with different queries. Exit psql
(\q
) when you are done.
4. Introduction to NoSQL Databases
While relational databases (using SQL) are powerful and suitable for a vast range of applications involving structured data and requiring strong consistency (ACID properties), they aren't always the best fit for every scenario. The rise of the internet, social media, mobile applications, and the Internet of Things (IoT) generated unprecedented volumes, velocities, and varieties of data (often called "Big Data"). Handling this scale and diversity exposed some limitations of traditional relational databases, leading to the development of NoSQL databases.
NoSQL stands for "Not Only SQL". It's an umbrella term for database management systems that differ from classic relational databases in significant ways, particularly in their data models, scaling strategies, and consistency guarantees. They emerged to address needs like:
- Massive Scalability: Handling terabytes or petabytes of data distributed across many servers (horizontal scaling). Relational databases traditionally scale vertically (bigger server), which has limits.
- High Availability: Ensuring the database remains operational even if some servers fail.
- Flexible Data Models: Managing unstructured or semi-structured data (like user profiles with varying fields, sensor readings, social media posts) without a rigid, predefined schema.
- High Performance: Achieving very high read/write throughput for specific access patterns.
The CAP Theorem
Understanding NoSQL often involves understanding the CAP theorem, a fundamental concept in distributed systems formulated by Eric Brewer. It states that a distributed data store can only provide two of the following three guarantees simultaneously:
- Consistency (C): Every read receives the most recent write or an error. All nodes in the system see the same data at the same time. (This is different but related to the 'C' in ACID).
- Availability (A): Every request receives a (non-error) response, without guaranteeing it contains the most recent write. The system remains operational even if some nodes fail.
- Partition Tolerance (P): The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes (i.e., the network partitions).
In a distributed system, network partitions (P) are a reality that must be tolerated. Therefore, the CAP theorem essentially forces a trade-off between Consistency (C) and Availability (A) when a partition occurs.
- CP Systems (Consistency + Partition Tolerance): These systems prioritize consistency. If a partition occurs, they might become unavailable for some requests (e.g., writes, or reads from the affected partition) to avoid returning potentially stale or inconsistent data. Many relational databases configured for replication fall here, as well as some NoSQL systems like HBase.
- AP Systems (Availability + Partition Tolerance): These systems prioritize availability. Even during a partition, all nodes remain available for reads and writes. However, this means some reads might return stale data until the partition heals and data synchronization catches up. This is often referred to as eventual consistency. Many NoSQL databases (like Cassandra, DynamoDB, Riak, Couchbase) are designed as AP systems.
BASE Properties
In contrast to the strict ACID guarantees of relational databases, many NoSQL databases (especially AP systems) adhere to the BASE philosophy:
- Basically Available: The system guarantees availability (as per the CAP theorem).
- Soft State: The state of the system may change over time, even without input, due to eventual consistency. Data might be inconsistent temporarily across nodes.
- Eventually Consistent: If no new updates are made to a given data item, eventually all accesses to that item will return the last updated value. Data gradually propagates through the distributed system.
This trade-off (sacrificing immediate consistency for availability and scalability) is acceptable or even desirable for many modern web applications where temporary inconsistencies (e.g., seeing an old profile picture for a few seconds) are less critical than the system being always available and scalable.
Types of NoSQL Databases
NoSQL is not a single technology but a category. The main types differ significantly in their data models and use cases:
-
Key-Value Stores:
- Model: Simplest model. Stores data as a collection of key-value pairs, like a dictionary or hash map. The 'value' can be anything (a string, number, JSON object, image, video). The database operations are typically limited to
GET(key)
,PUT(key, value)
,DELETE(key)
. - Pros: Extremely fast for simple lookups, highly scalable, simple model.
- Cons: Querying by value is often inefficient or impossible; relationships between data items aren't inherently managed.
- Examples: Redis, Memcached, Riak KV, Amazon DynamoDB (also document).
- Use Cases: Caching, session management, user profiles, real-time leaderboards.
- Model: Simplest model. Stores data as a collection of key-value pairs, like a dictionary or hash map. The 'value' can be anything (a string, number, JSON object, image, video). The database operations are typically limited to
-
Document Databases:
- Model: Stores data in documents, typically using formats like JSON (JavaScript Object Notation) or BSON (Binary JSON) or XML. Documents are self-contained units of data, and unlike rows in relational tables, documents within the same collection can have different structures (schema flexibility). Documents can contain nested structures (arrays and sub-documents).
- Pros: Flexible schema, natural mapping to object-oriented programming structures, querying based on document content is possible and often indexed. Good general-purpose NoSQL choice.
- Cons: Can lead to data redundancy if not modeled carefully; complex queries involving joins across different document types can be less efficient than in SQL.
- Examples: MongoDB, Couchbase, ArangoDB (multi-model), Elasticsearch (also search engine).
- Use Cases: Content management systems, user profiles, product catalogs, blogging platforms, mobile app data.
-
Column-Family (or Wide-Column) Stores:
- Model: Stores data in tables with rows and columns, but optimized for column-based access. Data for a particular column is stored contiguously on disk. Rows can have different columns, and columns can be grouped into "column families". Think of it as a two-dimensional key-value store (row key -> column family -> column -> value/timestamp).
- Pros: Extremely scalable for huge datasets (petabytes), high write performance, efficient querying over specific columns across many rows. Good for aggregation queries.
- Cons: Can be more complex to model and manage than other types. Read performance can vary depending on access patterns. Eventual consistency is common.
- Examples: Apache Cassandra, Apache HBase, Google Bigtable.
- Use Cases: Big data analytics, logging, time-series data, recommendation engines, IoT data streams.
-
Graph Databases:
- Model: Designed specifically to store and navigate relationships. Data is represented as nodes (entities) and edges (relationships connecting nodes). Both nodes and edges can have properties (attributes).
- Pros: Excellent performance for traversing complex relationships (e.g., "find friends of friends"). Intuitive model for highly interconnected data.
- Cons: Can be less efficient for queries involving aggregation over the entire dataset compared to other models. Scaling strategies can differ.
- Examples: Neo4j, ArangoDB (multi-model), JanusGraph, Amazon Neptune.
- Use Cases: Social networks, recommendation engines (e.g., "people who bought this also bought..."), fraud detection, network/IT operations mapping, knowledge graphs.
Choosing Between SQL and NoSQL
The choice isn't always SQL vs. NoSQL; sometimes the best solution involves using both (polyglot persistence).
-
Choose SQL/Relational when:
- Data is structured and schema is well-defined and relatively stable.
- Strong consistency (ACID) is paramount (e.g., financial transactions).
- You need complex querying capabilities involving multiple tables (JOINs).
- Mature technology and wide tooling support are important.
-
Choose NoSQL when:
- Dealing with massive amounts of data requiring horizontal scaling.
- High availability is more critical than immediate consistency.
- Data is unstructured or semi-structured, or the schema evolves rapidly.
- You need extremely high read/write throughput for specific (often simpler) query patterns.
- Your data model naturally fits one of the NoSQL types (key-value, document, graph, column-family).
Understanding the different NoSQL paradigms opens up new possibilities for handling diverse data challenges, especially in the context of modern, large-scale applications running on distributed Linux environments.
Workshop Setting Up and Using Redis as a Key Value Store
This workshop introduces you to Redis, a popular open-source, in-memory data structure store often used as a high-performance key-value database, cache, and message broker. We will install Redis on Linux, connect using its command-line interface (redis-cli
), and perform basic key-value operations.
Prerequisites:
- A Linux system (e.g., Ubuntu, Debian, Fedora, CentOS/RHEL).
- Access to the terminal with
sudo
privileges. - Internet connection.
Steps:
-
Update Package Lists:
- On Debian/Ubuntu:
- On Fedora/CentOS/RHEL:
-
Install Redis Server:
- On Debian/Ubuntu:
- On Fedora:
- On CentOS/RHEL (using EPEL repository): The Extra Packages for Enterprise Linux (EPEL) repository often provides Redis.
-
Start and Enable Redis Service: On most systems using
systemd
, you need to start the service and enable it to launch on boot. Debian/Ubuntu often starts it automatically post-installation. -
Verify Redis Service Status:
Look for "active (running)". -
Connect Using
You should see a prompt likeredis-cli
: The Redis installation includes a command-line client.127.0.0.1:6379>
. This indicates you're connected to the Redis server running on the default host (127.0.0.1
) and port (6379
). You can test the connection with thePING
command: Redis should reply withPONG
. -
Basic Key-Value Operations:
-
(Expected Output:SET key value
: Stores a string value associated with a key. If the key already exists, it's overwritten.OK
) (Expected Output:OK
) -
(Expected Output:GET key
: Retrieves the string value associated with a key."alice_wonder"
) (Expected Output:"xyz789abc"
) (Expected Output:(nil)
- indicating the key doesn't exist) -
KEYS pattern
:
Lists keys matching a pattern (*
matches any sequence,?
matches single char).
Warning:
(Expected Output: Lists
UseKEYS
with caution on production databases with many keys, as it can be slow and block other operations. It's okay for learning/debugging.user:100:username
andapp:config:api_key
) (Expected Output: Listsuser:100:username
)-
(Expected Output:EXISTS key
:
Checks if a key exists. Returns1
if it exists,0
otherwise.(integer) 1
) (Expected Output:(integer) 0
) -
(Expected Output:DEL key [key ...]
:
Deletes one or more keys and their associated values. Returns the number of keys deleted.(integer) 1
) (Expected Output:(nil)
) -
(Expected Output:INCR key
/DECR key
:
Atomically increments or decrements the integer value stored at a key. If the key doesn't exist, it's created with value 0 first. Useful for counters.OK
) (Expected Output:(integer) 101
) (Expected Output:(integer) 1
- created and incremented) (Expected Output:"101"
) (Expected Output:(integer) 100
)
-
-
Setting Key Expiration:
Redis is often used for caching, where data should automatically expire.-
(Expected Output:EXPIRE key seconds
: Sets a timeout (in seconds) on an existing key.OK
) (Expected Output:(integer) 1
- meaning expiration set) (Expected Output:(integer) 58
- or similar, showing remaining time) (Wait for 60 seconds...) (Expected Output:(nil)
- key has expired) -
(Expected Output:SETEX key seconds value
: Atomically sets a key to a value with an expiration time.OK
) (Expected Output:(integer) 300
- or slightly less)
-
-
Other Data Structures (Brief Mention):
Redis is more than just key-value; it supports other structures like Lists (LPUSH
,RPUSH
,LRANGE
), Hashes (HSET
,HGET
,HGETALL
- good for storing object fields), Sets (SADD
,SMEMBERS
,SINTER
), and Sorted Sets (ZADD
,ZRANGE
- useful for leaderboards). Explore these withHELP command_name
insideredis-cli
. -
Exit
Or pressredis-cli
:
Ctrl+D
.
You have successfully installed Redis, connected to it, and performed fundamental key-value operations, including setting expirations. This demonstrates the simplicity and speed of key-value stores like Redis, commonly used in Linux environments for caching, session management, and other high-performance tasks.
5. Database Administration Basics on Linux
Managing databases involves more than just designing schemas and running queries. Database Administration (DBA) encompasses tasks related to maintaining the health, security, performance, and reliability of database systems. While DBA is a deep field, understanding the basics is crucial for anyone running database-dependent applications on Linux. We'll focus on common tasks, primarily using PostgreSQL as an example, but the concepts apply broadly.
User and Permission Management
Controlling who can access the database and what they can do is fundamental for security. Databases use roles (often called users) and privileges.
-
Roles/Users:
Represent entities that can connect to the database and own objects. Roles can have attributes like login permission (LOGIN
), superuser status (SUPERUSER
), ability to create databases (CREATEDB
), or create roles (CREATEROLE
).- Creating a User (PostgreSQL):
- Creating a Role (Group): Roles can also act like groups to manage permissions collectively.
- Assigning User to Role:
-
Privileges: Permissions granted on specific database objects (databases, schemas, tables, columns, sequences, functions). Common privileges include
SELECT
,INSERT
,UPDATE
,DELETE
,TRUNCATE
,REFERENCES
,TRIGGER
,CREATE
,CONNECT
,TEMPORARY
,EXECUTE
,USAGE
.- Granting Privileges (PostgreSQL):
-- Grant connect access to a database GRANT CONNECT ON DATABASE mydatabase TO myapp_user; -- Grant usage access to a schema (necessary before granting table privileges) GRANT USAGE ON SCHEMA public TO myapp_user; GRANT USAGE ON SCHEMA public TO readonly_group; -- Grant to role -- Grant specific privileges on a table GRANT SELECT, INSERT, UPDATE, DELETE ON TABLE products TO myapp_user; GRANT SELECT ON ALL TABLES IN SCHEMA public TO readonly_group; -- Grant SELECT on all current tables -- Set default privileges for future tables (important!) ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO readonly_group;
- Revoking Privileges (PostgreSQL):
- Granting Privileges (PostgreSQL):
-
Linux Interaction: Database user management is distinct from Linux system users, although authentication methods like
peer
(used by default forpostgres
locally) link them. Network connections typically use password methods (md5
orscram-sha-256
in PostgreSQL) defined in thepg_hba.conf
file.
Basic Backup and Restore Strategies
Data loss can be catastrophic. Regular backups are essential. Strategies vary, but common types include:
-
Logical Backups: Export the database structure (DDL) and data (DML
INSERT
statements or copy commands) into a human-readable (or compressed) file. Platform-independent but can be slower to restore for very large databases.- PostgreSQL (
pg_dump
,pg_dumpall
):# Dump a single database (common) pg_dump -U db_user -h localhost -W -F c -f mydatabase_backup.dump mydatabase # -U: DB User, -h: Host, -W: Prompt for password # -F c: Custom format (compressed, flexible) # -f: Output file # mydatabase: Database name # Dump all databases (roles, tablespaces) - needs superuser pg_dumpall -U postgres -h localhost -W -f alldatabases.sql
- MySQL/MariaDB (
mysqldump
):
- PostgreSQL (
-
Physical Backups: Copy the actual data files that make up the database. Faster restore times, especially for large databases, but often platform/version dependent and more complex to set up correctly (requires database shutdown or specific tools like
pg_basebackup
for online backups).
Restoring Data:
- PostgreSQL (
psql
,pg_restore
):# Restore from plain SQL dump psql -U db_user -h localhost -W -d mydatabase -f mydatabase_backup.sql # Restore from custom format dump (use pg_restore) # Assumes database 'mydatabase' exists and is empty pg_restore -U db_user -h localhost -W -d mydatabase -v mydatabase_backup.dump # -d: Database to restore into # -v: Verbose output
- MySQL/MariaDB (
mysql
client):
Backup Frequency and Retention: Depends on how critical the data is and how often it changes. Common strategies involve daily full backups, possibly supplemented by more frequent incremental backups or transaction log archiving (Point-in-Time Recovery - PITR). Store backups securely, preferably off-site or in a different location. Test your restore process regularly!
Monitoring Database Performance
Keeping an eye on performance helps identify bottlenecks and resource issues.
- Linux System Tools: Standard tools are invaluable.
top
/htop
: Monitor CPU, memory usage of database processes (e.g.,postgres
,mysqld
).iostat
: Check disk I/O activity. Highiowait
can indicate disk bottlenecks.vmstat
/free -h
: Monitor memory usage, swapping. Excessive swapping kills database performance.netstat
/ss
: Check network connections to the database.
- Database Logs: Contain vital information about errors, slow queries, connection attempts, checkpoints, etc. Location depends on the distribution and configuration (e.g.,
/var/log/postgresql/
,/var/log/mysql/
). Configure logging levels appropriately (e.g., log slow queries above a certain threshold). - Database-Specific Tools/Queries:
- PostgreSQL:
SELECT * FROM pg_stat_activity;
: Shows current connections and queries being executed.EXPLAIN ANALYZE <your_query>;
: Shows the query plan and actual execution time, crucial for optimizing slow queries.pg_stat_statements
extension: Tracks execution statistics for all queries.- Monitoring tools: pgAdmin, Pganalyze, Datadog, etc.
- MySQL/MariaDB:
SHOW PROCESSLIST;
: Shows current connections/queries.EXPLAIN <your_query>;
: Shows the query execution plan.- Performance Schema /
sys
schema: Provides detailed performance metrics. - Monitoring tools: MySQL Workbench, Percona Monitoring and Management (PMM), etc.
- PostgreSQL:
Configuration Files
Database behavior is controlled by configuration files. Editing these requires understanding the parameters and usually restarting the database service for changes to take effect. Always back up configuration files before editing!
- PostgreSQL:
postgresql.conf
: Main server configuration (memory allocation likeshared_buffers
,work_mem
; logging; network settings; autovacuum settings). Location often within the data directory (e.g.,/var/lib/postgresql/<version>/main/
) or/etc/postgresql/<version>/main/
.pg_hba.conf
: Client authentication control (Host-Based Authentication). Defines which users/networks can connect using which methods (peer, md5, scram-sha-256, etc.). Crucial for security.pg_ident.conf
: Maps system usernames to database usernames (used forident
authentication).
- MySQL/MariaDB:
my.cnf
ormy.ini
: Main configuration file (locations can vary:/etc/my.cnf
,/etc/mysql/my.cnf
,/etc/mysql/mysql.conf.d/
). Contains settings for storage engines (InnoDB buffer pool size), logging, network ports, etc.
Basic Security Considerations
- Network Access: Use
pg_hba.conf
(PostgreSQL) or firewall rules (ufw
,firewalld
) and bind-address settings (inpostgresql.conf
ormy.cnf
) to restrict connections only from trusted hosts/networks. Avoid exposing databases directly to the public internet unless absolutely necessary and properly secured. - Strong Passwords: Enforce strong passwords for all database users. Use secure authentication methods (
scram-sha-256
preferred in PostgreSQL). - Principle of Least Privilege: Grant users only the permissions they absolutely need to perform their tasks. Avoid using superuser accounts for applications.
- Regular Updates: Keep the database server and the underlying Linux OS patched against known vulnerabilities.
- Encryption: Consider encrypting data at rest (filesystem level or Transparent Data Encryption if supported) and in transit (using SSL/TLS connections).
- Auditing: Configure logging to track important actions (logins, DDL changes, etc.).
Basic Linux DBA skills are essential for ensuring your databases run smoothly, securely, and reliably. Start with user management, backups, basic monitoring, and understanding key configuration files.
Workshop Backing Up and Restoring the PostgreSQL University Database
This workshop provides practical experience with fundamental backup and restore operations for your PostgreSQL university_db
using the standard pg_dump
and psql
/pg_restore
command-line tools on Linux.
Prerequisites:
- PostgreSQL server running with the
university_db
database populated from the previous workshop. - Access to the Linux terminal.
- You should know the password for the
university_admin
user (or thepostgres
superuser if needed). - The
pg_dump
andpg_restore
/psql
command-line tools installed (usually included withpostgresql-client
or the mainpostgresql
package).
Scenario:
We will create a backup of the university_db
, simulate data loss by dropping the database, and then restore it from the backup.
Steps:
-
Create a Backup Directory: It's good practice to store backups in a dedicated location.
-
Perform the Backup using
pg_dump
: We'll use the custom format (-F c
) which is compressed and generally recommended. We'll back up theuniversity_db
as theuniversity_admin
user.-U university_admin
: Connect as this PostgreSQL user.-h localhost
: Connect to the server on the local machine.-W
: Prompt for theuniversity_admin
password.-F c
: Use the custom archive format (compressed).-f university_db_backup.dump
: Write the output to this file.university_db
: The name of the database to dump.
Enter the password for
university_admin
when prompted. If successful, the command will complete without errors, and you'll have a file nameduniversity_db_backup.dump
in your~/db_backups
directory. -
Examine the Backup File (Optional): Since it's a custom format, you can't view it directly as SQL text. However,
This command lists the items archived in the dump file (schema definitions, table data, constraints, etc.).pg_restore
can list its contents. -
Simulate Data Loss (Drop the Database): WARNING: This step permanently deletes the database. Ensure your backup completed successfully in Step 2 before proceeding. To drop a database, you usually need higher privileges (like the
Now, insidepostgres
superuser) or be the database owner (university_admin
should work here, but sometimes superuser access is needed if there are active connections). You also cannot be connected to the database you intend to drop. Let's connect to the defaultpostgres
database first using thepostgres
Linux user.psql
connected aspostgres
to thepostgres
database, dropuniversity_db
: You might get an error if other users are connected. If so, you may need to terminate connections first (advanced topic) or ensure no otherpsql
sessions or applications are usinguniversity_db
. If successful, you'll seeDROP DATABASE
. Verify it's gone by listing databases:university_db
should no longer be listed. Exitpsql
: -
Recreate the Empty Database: Before restoring, the target database needs to exist. We'll recreate it, ensuring the original owner (
Insideuniversity_admin
) is set. Again, use thepostgres
superuser for this.psql
: You should seeCREATE DATABASE
. Verify its existence:university_db
should be listed again, owned byuniversity_admin
. Exitpsql
: -
Restore the Database using
pg_restore
: Now, usepg_restore
to load the data from your backup file into the newly created empty database. We'll connect asuniversity_admin
since that user owns the database and likely has permissions to create objects within it based on the dump file.cd ~/db_backups # Make sure you are in the directory with the dump file pg_restore -U university_admin -h localhost -W -d university_db -v university_db_backup.dump
-U university_admin
: Connect as this user.-h localhost
: Connect to the local server.-W
: Prompt for the password.-d university_db
: Restore into this database.-v
: Verbose mode (shows progress).university_db_backup.dump
: The backup file to restore from.
Enter the
university_admin
password. You should see verbose output aspg_restore
processes the dump file, recreating tables, constraints, and loading data. Look for completion messages without errors. -
Verify the Restoration: Connect to the database again and check if the tables and data are back.
Enter the password. Insidepsql
:- List tables:
(Expected:
courses
,enrollments
,students
should be listed) - Check data in one of the tables: (Expected: You should see the student data that existed before you dropped the database - likely Alice, Bob, Charlie, as David was deleted in a previous workshop unless you re-added him)
- Check enrollments: (Expected: You should see the enrollment data corresponding to the restored students/courses)
- List tables:
(Expected:
-
Exit
psql
:
Congratulations! You have successfully backed up a PostgreSQL database using pg_dump
, simulated a failure by dropping the database, and restored it using pg_restore
. This is a critical skill for any database administrator or developer working with databases on Linux. Remember to schedule regular backups as part of any production deployment.
Conclusion
This journey through Database Basics has equipped you with a foundational understanding of how data is organized, managed, and manipulated, particularly within the Linux environment. We started with the core concept of what a database is and why Database Management Systems (DBMS) are essential for handling data efficiently, reliably, and securely.
We delved into the ubiquitous Relational Model, exploring its core components: tables, rows, columns, schemas, and the crucial role of primary and foreign keys in maintaining data integrity and defining relationships. We learned about SQL (Structured Query Language), the standard language for interacting with relational databases, covering its main sub-languages (DDL, DML, DCL, TCL) and focusing on the fundamental DML commands: INSERT
, SELECT
, UPDATE
, and DELETE
. The importance of ACID properties (Atomicity, Consistency, Isolation, Durability) in guaranteeing reliable transactions within relational systems was also highlighted. Our practical workshops focused on PostgreSQL, guiding you through installation, connection, schema design using data modeling principles like normalization, and manipulating data within a sample university database.
Recognizing that the relational model isn't the only solution, we explored the diverse world of NoSQL databases. We discussed the motivations behind their development – handling Big Data challenges like scale, availability, and flexible data structures. We touched upon the CAP theorem and the BASE properties, understanding the trade-offs often made between consistency and availability in distributed NoSQL systems. We surveyed the major NoSQL categories – Key-Value Stores, Document Databases, Column-Family Stores, and Graph Databases – understanding their unique data models and common use cases. The workshop provided a taste of NoSQL using Redis, demonstrating the simplicity and speed of key-value operations.
Finally, we covered essential Database Administration Basics relevant to Linux environments. This included managing users and privileges, implementing fundamental backup and restore strategies using tools like pg_dump
and pg_restore
, performing basic performance monitoring using Linux system tools and database-specific commands, understanding the importance of configuration files (like postgresql.conf
and pg_hba.conf
), and applying key security considerations.
The key takeaway is that choosing the right database technology – whether SQL or a specific type of NoSQL – depends heavily on the specific requirements of your application regarding data structure, consistency needs, scalability demands, and query patterns. Often, modern applications employ a polyglot persistence approach, using multiple database types for different tasks.
The knowledge and practical skills gained here provide a solid platform for further exploration. Potential next steps include diving deeper into advanced SQL (joins, subqueries, window functions), learning about database indexing and performance optimization, exploring specific database technologies in more detail (e.g., advanced PostgreSQL features, MongoDB query language, Cassandra data modeling), or investigating database replication and high-availability setups. Databases are a cornerstone of modern software development and data science, and continued learning in this area is highly valuable.