Week 8: Database Integration & SQL Security
Lesson Details
Duration: 90-120 minutes (self-paced) Focus: Understanding databases, writing secure SQL queries, and preventing SQL injection attacks Mode: Self-paced with hands-on activities
Prerequisites:
- Week 7: Server-side programming and basic authentication
- Week 4: Input validation and XSS prevention
- Week 3: Git basics
- Understanding of data types and variables
- Basic command line skills
What You'll Build: A secure blog application with persistent database storage, replacing Week 7's in-memory user storage with a proper database.
Learning Objectives
By the end of this lesson, you will be able to:
- Connect web applications to databases using connection strings and environment variables
- Design database schemas with proper normalisation, primary keys, and foreign keys
- Write SQL queries for all CRUD operations (Create, Read, Update, Delete)
- Implement prepared statements to prevent SQL injection attacks
- Apply database indexing to improve query performance
- Create shell scripts for automated database backups
Syllabus Alignment:
Programming for the Web
- ✓ Model elements that form a web development system (interfacing with databases based on SQL or non-SQL)
- ✓ Observe and describe back-end process (databases)
- ✓ Develop web application with shell scripts (database backup scripts)
- ✓ Apply web-based database and construct script that executes SQL (selecting fields, 'group by', common SQL queries, constraints using WHERE, table joins)
- ✓ Investigate effect of big data on web architecture (data mining, metadata, database indexing)
Secure Software Architecture
- ✓ Explore fundamental security concepts (confidentiality, integrity, availability - applied to data)
- ✓ Design, develop and implement code using defensive data input handling practices (input validation, sanitisation - preventing SQL injection)
- ✓ Apply security features for data protection
Database Choice
To simplify your development process it is strongly suggested that you use SQLite for all lessons. It supports all key concepts of the syllabus.
Only pick Postgres if you have some understanding of databases or have someone who can support you. It is a fully functional SQL database that is used across many of the worlds top Software companies. Also, it is best to run it within docker, which is out of scope for this syllabus, but will simplfy setup and teardown of the database.
1: Database Fundamentals and Connection
Duration: 15-20 minutes | Bloom's Level: Remember, Understand | GRR: Teacher-led introduction
What is a Database?
A database is an organised collection of structured data that provides persistent storage - meaning your data survives server restarts and power failures. Unlike storing data in variables or arrays (which disappear when your programme stops), databases keep your data safe permanently.
Activating Prior Knowledge
Remember Week 7 when we stored users in an array? What happened when we restarted the server? All users disappeared! Databases solve this problem.
Key Database Terminology
graph TD
A[Database System] --> B[Database]
B --> C[Tables]
C --> D[Rows/Records]
C --> E[Columns/Fields]
D --> F[Individual Data Values]
style A fill:#e1f5ff
style B fill:#b3e5fc
style C fill:#81d4fa
style D fill:#4fc3f7
style E fill:#4fc3f7
| Term | Definition | Example |
|---|---|---|
| Database | Organised collection of structured data | Your blog's database containing all users, posts, and comments |
| Table | Collection of related data in rows and columns | users table stores all user information |
| Row | A single record in a table | One user: id=5, username="alice", email="alice@example.com" |
| Column | A field that stores specific type of data | The email column stores email addresses |
| Schema | Blueprint of database structure | Defines all tables, columns, data types, and relationships |
| Primary Key | Unique identifier for each row (cannot be null) | User ID: Each user has unique ID (1, 2, 3...) |
| Foreign Key | Column referencing another table's primary key | posts.author_id links to users.id |
SQL vs NoSQL Databases
| Feature | SQL (Relational) | NoSQL |
|---|---|---|
| Structure | Tables with fixed schema | Flexible (documents, key-value, graph) |
| Examples | PostgreSQL, MySQL, SQLite | MongoDB, Redis, Cassandra |
| Best For | Structured data with relationships | Unstructured data, rapid scaling |
| Schema | Must be defined upfront | Can change dynamically |
| Relationships | JOINs between tables | Embedded or referenced documents |
Which Should I Use?
- SQL for: User accounts, financial records, blog posts (structured data with clear relationships)
- NoSQL for: Real-time chat, social media feeds, IoT sensor data (flexible, rapidly changing data)
Database Connection
Connection String: A URL containing all information needed to connect to a database.
Breaking down the connection string:
sqlite:///- Database type (SQLite with three slashes for relative path)mydatabase.db- Database file name (stored in current directory)
Alternative paths:
Absolute path (Unix/Linux/Mac):
Absolute path (Windows):
Relative path with subdirectory:
In-memory database (temporary, lost when programme closes):
Key differences from PostgreSQL:
- No username/password: SQLite is file-based, so there's no authentication
- There is no security on the database contents
- No host/port: SQLite doesn't run as a server, it's just a file
- Three slashes:
sqlite:///indicates a relative path from current directory - Four slashes:
sqlite:////indicates an absolute path (note the extra slash after://) ```
Python sqlite3 Module
When using Python's built-in sqlite3 module directly (not through an ORM), you simply provide the file path:
import sqlite3
# Relative path
conn = sqlite3.connect('mydatabase.db')
# Absolute path
conn = sqlite3.connect('/home/username/projects/mydatabase.db')
# In-memory database
conn = sqlite3.connect(':memory:')
File Permissions
Ensure your application has read/write permissions for:
- The database file itself (
mydatabase.db) - The directory containing the database
- SQLite creates temporary files (
.db-journal,.db-wal) in the same directory
Breaking down the connection string:
postgresql://- Database type (in this case Postgres)username:password- Credentialslocalhost- Server location (your computer)5432- Port numbermydatabase- Database name
What is CRUD?
CRUD is the alphabet of databases - everything you do with data is one of these four operations:
graph LR
A[CRUD Operations] --> B[Create: Add new data]
A --> C[Read: Retrieve data]
A --> D[Update: Modify data]
A --> E[Delete: Remove data]
B --> F[SQL: INSERT]
C --> G[SQL: SELECT]
D --> H[SQL: UPDATE]
E --> I[SQL: DELETE]
style A fill:#4caf50
style B fill:#81c784
style C fill:#81c784
style D fill:#81c784
style E fill:#81c784
| Operation | SQL Command | HTTP Method | Real-World Example |
|---|---|---|---|
| Create | INSERT | POST | User signs up for account |
| Read | SELECT | GET | View blog post |
| Update | UPDATE | PUT/PATCH | Edit profile information |
| Delete | DELETE | DELETE | Remove old comment |
Activity 1.1: Explore Database Structure (WE DO)
Duration: 10 minutes
Objective: Understand database tables and relationships
What You'll Do
- Open DB Browser for SQLite (or similar GUI tool)
- Open the sample database:
blog_sample.db - Examine the
userstable structure - Identify the primary key
- Look at the
poststable - Identify the foreign key linking to users
Questions to Answer:
- What columns exist in the
userstable? - Which column is the primary key? How can you tell?
- In the
poststable, which column is the foreign key? - What does this foreign key relationship represent?
Expected Answers
- Columns: id, username, email, password_hash, created_at
- Primary key:
id(unique identifier, auto-increments) - Foreign key:
author_id - Links each post to the user who wrote it
Activity 1.2: Set Up Database Connection (YOU DO)
Duration: 10 minutes
Objective: Configure secure database connection
What You'll Do
- Create a
.envfile for environment variables - Add database credentials (don't commit this file!)
- Update
.gitignoreto exclude.env - Test database connection
Step-by-Step:
1. Create .env file:
2. Update .gitignore:
3. Create and activate a virtual environment:
Linux/macOS:
Windows:
You should see (venv) appear at the start of your command prompt, indicating the virtual environment is active.
4. Install required packages:
Virtual Environment Best Practices
Always use a virtual environment for Python projects. This:
- Keeps project dependencies isolated
- Prevents conflicts between different projects
- Avoids modifying system Python packages
- Makes your project reproducible on other machines
To deactivate the virtual environment when finished:
Add to your .gitignore:
Never Commit .venv/
The .venv/ directory should never be committed to Git. Instead, use a requirements.txt file to track dependencies:
Create requirements.txt:
Install from requirements.txt (on another machine):
5. Test connection (Python example):
import sqlite3
import os
from dotenv import load_dotenv
# Load environment variables
load_dotenv()
# Connect to database
db_name = os.getenv('DB_NAME')
conn = sqlite3.connect(db_name)
print(f"Connected to {db_name} successfully!")
# Close connection
conn.close()
6. Run your test:
Success Criteria
.envfile created and contains database credentials.envis listed in.gitignore- Connection test prints success message
- No credentials visible in your source code
Common Mistake
Don't commit .env to Git! If you accidentally do:
2: Database Schema Design & Normalisation
Duration: 15-20 minutes | Bloom's Level: Understand, Apply | GRR: Guided practice
Database Schema Design
A schema is your database's blueprint. Good schema design prevents data problems before they start.
Normalisation: Organising Data Efficiently
Normalisation reduces data redundancy and improves integrity by dividing large tables into smaller, related tables.
Example: Blog Application Schema
erDiagram
USERS ||--o{ POSTS : writes
USERS ||--o{ COMMENTS : makes
POSTS ||--o{ COMMENTS : receives
CATEGORIES ||--o{ POSTS : categorises
USERS {
int id PK
string username
string email
string password_hash
datetime created_at
}
POSTS {
int id PK
string title
text content
int author_id FK
int category_id FK
datetime created_at
datetime updated_at
}
COMMENTS {
int id PK
int post_id FK
int user_id FK
text content
datetime created_at
}
CATEGORIES {
int id PK
string name
text description
}
Why Normalisation Matters
❌ Unnormalised (Poor Design):
| post_id | title | content | author_name | author_email | author_bio |
|---|---|---|---|---|---|
| 1 | Hello | ... | Alice | alice@test.com | I love coding... |
| 2 | World | ... | Alice | alice@test.com | I love coding... |
| 3 | SQL | ... | Alice | alice@test.com | I love coding... |
Problems:
- Alice's data repeated in every post (redundancy)
- If Alice changes email, must update every post
- Wasted storage space
- Risk of inconsistency (updating some rows but not others)
✅ Normalised (Good Design):
Users Table:
| user_id | username | bio | |
|---|---|---|---|
| 1 | Alice | alice@test.com | I love coding... |
Posts Table:
| post_id | title | content | author_id |
|---|---|---|---|
| 1 | Hello | ... | 1 |
| 2 | World | ... | 1 |
| 3 | SQL | ... | 1 |
Benefits:
- ✅ Alice's data stored once
- ✅ One update changes all references
- ✅ Less storage space
- ✅ Data consistency guaranteed
Creating Tables with SQL
SQL (Structured Query Language) is the standardised language for managing relational databases.
Example: Create Users Table
CREATE TABLE users (
id INTEGER PRIMARY KEY AUTOINCREMENT,
username TEXT NOT NULL UNIQUE,
email TEXT NOT NULL UNIQUE,
password_hash TEXT NOT NULL,
bio TEXT,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
Breaking it down:
CREATE TABLE users- Creates new table named "users"id INTEGER PRIMARY KEY- Unique identifier (primary key)AUTOINCREMENT- Database automatically assigns next numberNOT NULL- Field must have a valueUNIQUE- No two users can have same username/emailDEFAULT CURRENT_TIMESTAMP- Automatically sets creation time
Example: Create Posts Table with Foreign Key
CREATE TABLE posts (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
content TEXT NOT NULL,
author_id INTEGER NOT NULL,
category_id INTEGER,
status TEXT DEFAULT 'draft',
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (author_id) REFERENCES users(id) ON DELETE CASCADE
);
Foreign Key Explained:
FOREIGN KEY (author_id)- This column links to another tableREFERENCES users(id)- Links to theidcolumn inuserstableON DELETE CASCADE- If user deleted, their posts are also deleted- most production databases will avoid
ON DELETE CASCADEalso called cascade delete as it can result in unexpected data being deleted when not intended - eg: Maybe you want the users data to be removed but it should be stored in a backup system for 5 years for policy reasons. The cascaade delete above could result in all of the data being deleted before backed up.
- most production databases will avoid
Activity 2.1: Design Your Database Schema (WE DO)
Duration: 15 minutes
Objective: Create a complete database schema for a blog application
Requirements
Design tables for:
- Users (authentication and profile)
- Posts (blog content)
- Comments (user feedback)
- Categories (organising posts)
Each table must have:
- Primary key
- Appropriate data types
- NOT NULL constraints where needed
- Foreign keys for relationships
Starter Template:
-- Users table
CREATE TABLE users (
id INTEGER PRIMARY KEY AUTOINCREMENT,
username TEXT NOT NULL UNIQUE,
-- Add more columns here
);
-- Posts table
CREATE TABLE posts (
id INTEGER PRIMARY KEY AUTOINCREMENT,
-- Design this table
);
-- Comments table
CREATE TABLE comments (
-- Design this table
);
-- Categories table
CREATE TABLE categories (
-- Design this table
);
Guided Questions:
- What information do we need to store about users?
- What information do we need for each blog post?
- How do we link a post to its author?
- How do we link a comment to both the post and the commenter?
Sample Solution
CREATE TABLE users (
id INTEGER PRIMARY KEY AUTOINCREMENT,
username TEXT NOT NULL UNIQUE,
email TEXT NOT NULL UNIQUE,
password_hash TEXT NOT NULL,
bio TEXT,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE categories (
id INTEGER PRIMARY KEY AUTOINCREMENT,
name TEXT NOT NULL UNIQUE,
description TEXT
);
CREATE TABLE posts (
id INTEGER PRIMARY KEY AUTOINCREMENT,
title TEXT NOT NULL,
content TEXT NOT NULL,
author_id INTEGER NOT NULL,
category_id INTEGER,
status TEXT DEFAULT 'draft',
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
updated_at DATETIME DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (author_id) REFERENCES users(id) ON DELETE CASCADE,
FOREIGN KEY (category_id) REFERENCES categories(id) ON DELETE SET NULL
);
CREATE TABLE comments (
id INTEGER PRIMARY KEY AUTOINCREMENT,
post_id INTEGER NOT NULL,
user_id INTEGER NOT NULL,
content TEXT NOT NULL,
created_at DATETIME DEFAULT CURRENT_TIMESTAMP,
FOREIGN KEY (post_id) REFERENCES posts(id) ON DELETE CASCADE,
FOREIGN KEY (user_id) REFERENCES users(id) ON DELETE CASCADE
);
Activity 2.2: Create Your Database (YOU DO)
Duration: 10 minutes
Objective: Execute SQL statements to create your database structure
What You'll Do
- Create a new file:
schema.sql - Write CREATE TABLE statements for all tables
- Execute the schema file to create your database
- Verify tables were created correctly
Step-by-Step:
1. Create schema.sql:
-- schema.sql
-- Database schema for blog application
CREATE TABLE users (
-- Your design from Activity 2.1
);
CREATE TABLE categories (
-- Your design
);
CREATE TABLE posts (
-- Your design
);
CREATE TABLE comments (
-- Your design
);
2. Execute schema:
# SQLite
sqlite3 blog_database.db < schema.sql
# PostgreSQL
psql -U username -d database_name -f schema.sql
3. Verify tables exist:
Success Criteria
- All four tables created without errors
- Each table has a primary key
- Foreign keys properly reference other tables
- You can view table structures with
.schema
Git Checkpoint:
git add schema.sql
git commit -m "Add database schema for blog application
- Create users table with authentication fields
- Create posts table with foreign key to users
- Create comments table with relationships
- Create categories for post organisation"
Content Part 3: SQL Queries and CRUD Operations (WE DO)
Duration: 20-25 minutes | Bloom's Level: Apply | GRR: Guided practice
SQL CRUD Operations
Now that we have tables, let's fill them with data and query them.
CREATE: INSERT Operations
Insert Single Record:
INSERT INTO users (username, email, password_hash)
VALUES ('alice', 'alice@example.com', '$2b$12$hashed_password');
Insert Multiple Records:
INSERT INTO categories (name, description)
VALUES
('Technology', 'Posts about tech'),
('Travel', 'Travel experiences'),
('Food', 'Recipes and reviews');
Insert with Returning ID (PostgreSQL):
INSERT INTO users (username, email, password_hash)
VALUES ('bob', 'bob@example.com', '$2b$12$hash')
RETURNING id;
READ: SELECT Operations
A query is a request for data from the database.
Select Everything:
Avoid SELECT *
In production, specify exact columns you need. SELECT * is slower and retrieves unnecessary data.
Select Specific Columns:
WHERE Clause - Filtering Results:
| Operator | Meaning | Example |
|---|---|---|
= |
Equal to | WHERE age = 25 |
!= or <> |
Not equal | WHERE status != 'deleted' |
> |
Greater than | WHERE age > 18 |
< |
Less than | WHERE price < 100 |
>= |
Greater or equal | WHERE age >= 18 |
<= |
Less or equal | WHERE stock <= 10 |
LIKE |
Pattern match | WHERE name LIKE 'A%' |
IN |
Match list | WHERE country IN ('AU', 'NZ') |
BETWEEN |
Range | WHERE age BETWEEN 18 AND 65 |
AND |
Both conditions | WHERE age > 18 AND country = 'AU' |
OR |
Either condition | WHERE country = 'AU' OR country = 'NZ' |
Examples:
-- Single condition
SELECT * FROM users WHERE country = 'Australia';
-- Multiple conditions
SELECT * FROM users WHERE age > 18 AND country = 'Australia';
-- Pattern matching
SELECT * FROM users WHERE email LIKE '%@gmail.com';
-- List matching
SELECT * FROM posts WHERE status IN ('published', 'featured');
ORDER BY - Sorting Results:
-- Newest posts first
SELECT * FROM posts ORDER BY created_at DESC;
-- Oldest posts first
SELECT * FROM posts ORDER BY created_at ASC;
-- Multiple sort criteria
SELECT * FROM posts ORDER BY category_id ASC, created_at DESC;
LIMIT - Restricting Results:
-- Get 10 most recent posts
SELECT * FROM posts
ORDER BY created_at DESC
LIMIT 10;
-- Pagination: Skip first 10, get next 10
SELECT * FROM posts
ORDER BY created_at DESC
LIMIT 10 OFFSET 10;
GROUP BY - Data Aggregation:
GROUP BY groups rows with same values, allowing calculations on each group.
Common Aggregate Functions:
| Function | Purpose | Example |
|---|---|---|
COUNT() |
Count rows | SELECT COUNT(*) FROM users |
SUM() |
Add values | SELECT SUM(price) FROM orders |
AVG() |
Average | SELECT AVG(age) FROM users |
MIN() |
Minimum | SELECT MIN(price) FROM products |
MAX() |
Maximum | SELECT MAX(salary) FROM employees |
Examples:
-- Count posts per user
SELECT author_id, COUNT(*) as post_count
FROM posts
GROUP BY author_id;
-- Average comments per post
SELECT post_id, AVG(rating) as avg_rating
FROM comments
GROUP BY post_id;
-- Posts per category
SELECT categories.name, COUNT(posts.id) as post_count
FROM categories
LEFT JOIN posts ON categories.id = posts.category_id
GROUP BY categories.id;
UPDATE: Modifying Data
-- Update single field
UPDATE users
SET email = 'newemail@example.com'
WHERE id = 5;
-- Update multiple fields
UPDATE posts
SET title = 'New Title', updated_at = CURRENT_TIMESTAMP
WHERE id = 10;
-- Conditional update
UPDATE posts
SET status = 'archived'
WHERE created_at < '2023-01-01';
Always Use WHERE with UPDATE
Without WHERE clause, you'll update EVERY record in the table!
DELETE: Removing Data
-- Delete specific record
DELETE FROM comments WHERE id = 100;
-- Delete based on condition
DELETE FROM posts WHERE status = 'spam';
-- Delete old records
DELETE FROM sessions WHERE created_at < DATE('now', '-30 days');
Always Use WHERE with DELETE
Without WHERE clause, you'll delete EVERYTHING in the table!
Activity 3.1: Practice CRUD Operations (WE DO)
Duration: 15 minutes
Objective: Write SQL queries for common database operations
Complete These Tasks
Work through each task, writing and executing SQL queries.
Task 1: INSERT - Create Data
-- 1. Insert yourself as a user
INSERT INTO users (username, email, password_hash)
VALUES ('your_username', 'your_email@example.com', 'temp_hash');
-- 2. Insert a category
INSERT INTO categories (name, description)
VALUES ('Tutorial', 'Educational content');
-- 3. Insert a post (use your user_id and category_id)
INSERT INTO posts (title, content, author_id, category_id, status)
VALUES ('My First Post', 'Hello database world!', 1, 1, 'published');
Task 2: SELECT - Read Data
-- 4. Get all published posts
-- Your query here
-- 5. Find posts by a specific author
-- Your query here
-- 6. Count total users
-- Your query here
-- 7. Get 5 most recent posts
-- Your query here
Task 3: UPDATE - Modify Data
-- 8. Update your user's bio
-- Your query here
-- 9. Change a post's status to 'published'
-- Your query here
Task 4: DELETE - Remove Data
Sample Solutions
-- Task 2.4: Get all published posts
SELECT * FROM posts WHERE status = 'published';
-- Task 2.5: Posts by specific author
SELECT * FROM posts WHERE author_id = 1;
-- Task 2.6: Count total users
SELECT COUNT(*) as user_count FROM users;
-- Task 2.7: 5 most recent posts
SELECT * FROM posts ORDER BY created_at DESC LIMIT 5;
-- Task 3.8: Update bio
UPDATE users SET bio = 'Aspiring developer' WHERE id = 1;
-- Task 3.9: Publish post
UPDATE posts SET status = 'published', updated_at = CURRENT_TIMESTAMP WHERE id = 1;
-- Task 4.10: Delete comment
DELETE FROM comments WHERE id = 1;
Activity 3.2: Table Joins (WE DO)
Duration: 15 minutes
Objective: Combine data from multiple tables using JOINs
Understanding JOINs
A JOIN combines rows from two or more tables based on related columns.
graph LR
A[Posts Table] -->|author_id| B[JOIN]
C[Users Table] -->|id| B
B --> D[Combined Result]
style B fill:#4caf50
JOIN Types Visual Guide
graph TD
A[JOIN Types] --> B[INNER JOIN]
A --> C[LEFT JOIN]
A --> D[RIGHT JOIN]
A --> E[FULL OUTER JOIN]
B --> F[Only matching records from both tables]
C --> G[All from left table + matches from right]
D --> H[All from right table + matches from left]
E --> I[All records from both tables]
style A fill:#2196f3
style B fill:#4caf50
style C fill:#ff9800
style D fill:#ff9800
style E fill:#f44336
INNER JOIN - Only Matching Records:
-- Get posts with their author names
SELECT posts.title, posts.content, users.username
FROM posts
INNER JOIN users ON posts.author_id = users.id;
Result: Only posts that have a valid author (orphaned posts excluded)
LEFT JOIN - All From Left Table:
-- Get all users and their post count (including users with 0 posts)
SELECT users.username, COUNT(posts.id) as post_count
FROM users
LEFT JOIN posts ON users.id = posts.author_id
GROUP BY users.id;
Result: Every user appears, even users who haven't posted (post_count = 0)
Complex JOIN Example:
-- Posts with authors and categories
SELECT
posts.title,
users.username as author,
categories.name as category,
COUNT(comments.id) as comment_count
FROM posts
INNER JOIN users ON posts.author_id = users.id
INNER JOIN categories ON posts.category_id = categories.id
LEFT JOIN comments ON posts.id = comments.post_id
GROUP BY posts.id;
Old Style Comma Table Join
There is an older style of join using commas. It should be avoided as it doesn't give you as much control over your queries and can impact readabilty of your queries, especilly if they are large. You should still be able to understand this format as it will be used in some cases.
Practice Tasks
Complete these JOIN queries:
- Get all posts with author names and emails
- Find all users who have never posted
- Get posts with comment count
- List categories with their post count
Sample Solutions
-- 1. Posts with author details
SELECT posts.*, users.username, users.email
FROM posts
INNER JOIN users ON posts.author_id = users.id;
-- 2. Users who never posted
SELECT users.username
FROM users
LEFT JOIN posts ON users.id = posts.author_id
WHERE posts.id IS NULL;
-- 3. Posts with comment count
SELECT posts.title, COUNT(comments.id) as comments
FROM posts
LEFT JOIN comments ON posts.id = comments.post_id
GROUP BY posts.id;
-- 4. Categories with post count
SELECT categories.name, COUNT(posts.id) as post_count
FROM categories
LEFT JOIN posts ON categories.id = posts.category_id
GROUP BY categories.id;
4: SQL Injection and Security
Duration: 20-25 minutes | Bloom's Level: Analyse, Evaluate | GRR: Guided critical thinking
What is SQL Injection?
SQL Injection is a code injection attack where attackers insert malicious SQL code through user input, allowing them to manipulate or access the database. It's the #5 web application vulnerability (OWASP Top 10).
Critical Security Vulnerability
SQL injection can:
- Bypass authentication (log in without password)
- Steal all data from database
- Delete entire database
- Modify data
- Execute system commands
How SQL Injection Works
Vulnerable Code (NEVER DO THIS):
# User provides input
username = request.form['username'] # User types: alice
password = request.form['password'] # User types: password123
# String concatenation builds SQL query (DANGEROUS!)
query = "SELECT * FROM users WHERE username = '" + username + "' AND password = '" + password + "'"
# Execute query
cursor.execute(query)
Normal Use:
- User enters: username =
alice, password =password123 - Resulting query:
SELECT * FROM users WHERE username = 'alice' AND password = 'password123' - Works as intended ✓
Malicious Use - Attack #1: Authentication Bypass:
- Attacker enters: username =
admin' OR '1'='1, password =anything - Resulting query:
SELECT * FROM users WHERE username = 'admin' OR '1'='1' AND password = 'anything' - Because
'1'='1'is always true, query returns ALL users! - Attacker logs in as admin without knowing password! ❌
sequenceDiagram
participant A as Attacker
participant F as Login Form
participant S as Server
participant D as Database
A->>F: Username: admin' OR '1'='1
F->>S: Send malicious input
S->>D: SELECT * FROM users WHERE username='admin' OR '1'='1'
D->>S: Returns ALL users (query is true!)
S->>A: Welcome, Admin!
Note over D: SQL injection successful!
Attack #2: Database Destruction:
- Attacker enters: username =
admin'; DROP TABLE users;--, password =anything - Resulting query becomes:
- First query: Selects admin user
- Second query: Deletes entire users table!
--comments out rest of query- Entire user database destroyed! ❌
Attack #3: Data Theft:
- Attacker enters: username =
' UNION SELECT credit_card_number FROM payments--, password =x - Resulting query:
- Steals all credit card numbers from payments table! ❌
Real-World SQL Injection Damage
| Incident | Year | Impact | Cost |
|---|---|---|---|
| TalkTalk | 2015 | 157,000 customers' data stolen via SQL injection | £400,000 fine + £77 million total costs |
| Heartland Payment | 2008 | 130 million credit cards exposed | Approximately £110 million |
| Sony Pictures | 2011 | 1 million accounts compromised | Immeasurable reputation damage |
Activity 4.1: Create Vulnerable Code
Duration: 10 minutes
Objective: Understand SQL injection by creating and attacking vulnerable code
Controlled Environment Only
This is for learning purposes only. NEVER deploy this code!
What You'll Do
- Create deliberately vulnerable login system
- Attack your own code with SQL injection
- Witness the danger firsthand
- Understand why this is critical
Step 1: Create Vulnerable Login (demonstration.py)
# demonstration.py - VULNERABLE CODE FOR LEARNING ONLY
import sqlite3
def vulnerable_login(username, password):
"""DO NOT USE IN PRODUCTION - DEMONSTRATION ONLY"""
conn = sqlite3.connect('demo.db')
cursor = conn.cursor()
# VULNERABLE: String concatenation
query = "SELECT * FROM users WHERE username = '" + username + "' AND password = '" + password + "'"
print(f"Executing query: {query}")
cursor.execute(query)
result = cursor.fetchone()
if result:
print(f"✓ Login successful! Welcome {result[1]}")
else:
print("✗ Login failed")
conn.close()
# Test normal login
print("\n=== Normal Login ===")
vulnerable_login("alice", "password123")
# Test SQL injection attack
print("\n=== SQL Injection Attack ===")
vulnerable_login("admin' OR '1'='1", "anything")
Step 2: Run and Observe
Expected Output:
=== Normal Login ===
Executing query: SELECT * FROM users WHERE username = 'alice' AND password = 'password123'
✗ Login failed
=== SQL Injection Attack ===
Executing query: SELECT * FROM users WHERE username = 'admin' OR '1'='1' AND password = 'anything'
✓ Login successful! Welcome admin
Analysis Questions
- Why did the attack succeed?
- What does
'1'='1'mean to the database? - How does the
--comment work in attacks? - What damage could this cause in a real application?
Understanding the Attack
- The query sees
'1'='1'as part of the WHERE condition '1'='1'is always TRUE in SQL- The OR operator means "username = 'admin'" OR "TRUE"
- TRUE OR anything = TRUE
- Query returns all users, attacker bypasses authentication
Activity 4.2: Fix with Prepared Statements
Duration: 15 minutes
Objective: Implement proper SQL injection defence
The Solution: Prepared Statements
A prepared statement (also called parameterised query) separates SQL code from user data.
How It Works:
graph LR
A[SQL Template] --> B[Database Pre-compiles]
C[User Data] --> D[Sent as Parameters]
B --> E[Safe Execution]
D --> E
style A fill:#4caf50
style C fill:#2196f3
style E fill:#4caf50
- SQL query written with placeholders (
?or named parameters) - Database compiles query template
- User data sent separately as parameters
- Database treats parameters as pure data, never as SQL code
Secure Code with Prepared Statements:
# secure_login.py - PROPER IMPLEMENTATION
import sqlite3
def secure_login(username, password):
"""SECURE: Uses prepared statements"""
conn = sqlite3.connect('demo.db')
cursor = conn.cursor()
# SECURE: Prepared statement with placeholders
query = "SELECT * FROM users WHERE username = ? AND password = ?"
print(f"Query template: {query}")
print(f"Parameters: {username}, {password}")
# Execute with parameters passed separately
cursor.execute(query, (username, password))
result = cursor.fetchone()
if result:
print(f"✓ Login successful! Welcome {result[1]}")
else:
print("✗ Login failed")
conn.close()
# Test normal login
print("\n=== Normal Login ===")
secure_login("alice", "password123")
# Test SQL injection attempt (now fails safely)
print("\n=== SQL Injection Attempt (Blocked) ===")
secure_login("admin' OR '1'='1", "anything")
Expected Output:
=== Normal Login ===
Query template: SELECT * FROM users WHERE username = ? AND password = ?
Parameters: alice, password123
✗ Login failed (correct - user doesn't exist yet)
=== SQL Injection Attempt (Blocked) ===
Query template: SELECT * FROM users WHERE username = ? AND password = ?
Parameters: admin' OR '1'='1, anything
✗ Login failed (attack blocked - treated as literal username!)
Why This Works:
| Vulnerable Code | Prepared Statement |
|---|---|
| User input concatenated into SQL | User input separated from SQL |
| Database sees input as code | Database sees input as pure data |
', ;, -- have special meaning |
Special characters automatically escaped |
| Attacker can inject commands | Attacker input treated literally |
Key Principle
Prepared statements ensure user input is NEVER interpreted as SQL code, only as data values.
Activity 4.3: Secure Your Application
Duration: 15 minutes
Objective: Replace all string concatenation with prepared statements
What You'll Do
- Review your Server Side Programming authentication code
- Identify all SQL queries
- Convert to prepared statements
- Test with malicious inputs
- Verify attacks now fail
Task 1: Secure Registration
Before (Vulnerable):
def register(username, email, password):
# VULNERABLE
query = "INSERT INTO users (username, email, password_hash) VALUES ('" + username + "', '" + email + "', '" + password_hash + "')"
cursor.execute(query)
After (Secure):
def register(username, email, password):
# Hash password first
password_hash = bcrypt.hashpw(password.encode('utf-8'), bcrypt.gensalt())
# SECURE: Prepared statement
query = "INSERT INTO users (username, email, password_hash) VALUES (?, ?, ?)"
try:
cursor.execute(query, (username, email, password_hash))
db.commit()
return True
except sqlite3.IntegrityError:
return False # Username or email already exists
Task 2: Secure Login
Before (Vulnerable):
def login(username, password):
query = "SELECT * FROM users WHERE username = '" + username + "'"
cursor.execute(query)
After (Secure):
def login(username, password):
query = "SELECT id, username, password_hash FROM users WHERE username = ?"
cursor.execute(query, (username,))
user = cursor.fetchone()
if user and bcrypt.checkpw(password.encode('utf-8'), user[2]):
return {'id': user[0], 'username': user[1]}
return None
Task 3: Secure All Other Queries
Find and fix:
- Post creation
- Post updates
- Comment insertion
- User profile queries
- Any other database operations
Security Checklist:
- All queries use
?placeholders - No string concatenation with user input
- Parameters passed as tuple to
execute() - Tested with malicious inputs like
' OR '1'='1 - All attacks now fail safely
Complete Secure Example
import sqlite3
import bcrypt
class SecureDatabase:
def __init__(self, db_name):
self.conn = sqlite3.connect(db_name)
self.conn.row_factory = sqlite3.Row
self.cursor = self.conn.cursor()
def register_user(self, username, email, password):
"""Securely register new user"""
password_hash = bcrypt.hashpw(password.encode('utf-8'), bcrypt.gensalt())
query = "INSERT INTO users (username, email, password_hash) VALUES (?, ?, ?)"
try:
self.cursor.execute(query, (username, email, password_hash))
self.conn.commit()
return True
except sqlite3.IntegrityError:
return False
def login_user(self, username, password):
"""Securely authenticate user"""
query = "SELECT id, username, password_hash FROM users WHERE username = ?"
self.cursor.execute(query, (username,))
user = self.cursor.fetchone()
if user and bcrypt.checkpw(password.encode('utf-8'), user['password_hash']):
return {'id': user['id'], 'username': user['username']}
return None
def create_post(self, title, content, author_id, category_id):
"""Securely create post"""
query = """
INSERT INTO posts (title, content, author_id, category_id, status)
VALUES (?, ?, ?, ?, 'draft')
"""
self.cursor.execute(query, (title, content, author_id, category_id))
self.conn.commit()
return self.cursor.lastrowid
def get_user_posts(self, user_id):
"""Securely get user's posts"""
query = """
SELECT posts.*, users.username
FROM posts
INNER JOIN users ON posts.author_id = users.id
WHERE author_id = ?
ORDER BY created_at DESC
"""
self.cursor.execute(query, (user_id,))
return self.cursor.fetchall()
def close(self):
self.conn.close()
Git Checkpoint:
git add .
git commit -m "Fix SQL injection vulnerabilities
- Replace all string concatenation with prepared statements
- Add input validation
- Test confirms attacks now fail
- All queries use parameterised statements"
5: Performance, Indexing & Backups
Duration: 20-25 minutes | Bloom's Level: Apply, Analyse | GRR: Guided application
Database Performance and Indexing
As your database grows, query performance becomes critical. An index dramatically speeds up data retrieval.
What is an Index?
Think of a database index like a book's index - instead of reading every page to find a topic, you check the index and jump directly to the right page.
graph TD
A[Query: Find user with email='alice@example.com'] --> B{Index Exists?}
B -->|No Index| C[Scan ALL Rows]
B -->|With Index| D[Jump Directly to Row]
C --> E[1,000,000 rows checked]
D --> F[~20 checks via index tree]
style C fill:#f44336
style D fill:#4caf50
Performance Comparison:
| Operation | Without Index | With Index |
|---|---|---|
| Find user by email in 1M users | 1,000,000 row scans | ~20 checks |
| Query time | Seconds | Milliseconds |
| Growth rate | Linear (gets slower proportionally) | Logarithmic (barely slows down) |
When to Create Indexes
Create indexes on columns you:
- Search frequently (WHERE clauses)
- Sort by (ORDER BY)
- Join on (JOIN conditions)
- Use in GROUP BY
Common index candidates:
- Primary keys (automatically indexed)
- Foreign keys (should be indexed)
- Email addresses (frequent lookups)
- Usernames (frequent searches)
- Timestamps (for sorting/filtering)
Don't index:
- Small tables (< 1000 rows)
- Columns with many duplicate values
- Columns that change frequently
Creating Indexes
-- Index on email for fast login lookups
CREATE INDEX idx_users_email ON users(email);
-- Index on foreign key for fast JOINs
CREATE INDEX idx_posts_author ON posts(author_id);
-- Index on timestamp for sorting
CREATE INDEX idx_posts_created ON posts(created_at);
-- Composite index (multiple columns)
CREATE INDEX idx_posts_status_date ON posts(status, created_at);
Query Optimisation
Bad Query (Slow):
Good Query (Fast):
-- Gets only needed columns with limit
SELECT id, title, created_at
FROM posts
WHERE status = 'published'
ORDER BY created_at DESC
LIMIT 10;
Use EXPLAIN to Analyse Queries:
EXPLAIN QUERY PLAN
SELECT * FROM users WHERE email = 'alice@example.com';
-- Without index: SCAN TABLE users
-- With index: SEARCH TABLE users USING INDEX idx_users_email
Activity 5.1: Measure Index Performance
Duration: 15 minutes
Objective: See the dramatic performance improvement from indexing
What You'll Do
- Create large test dataset (10,000 users)
- Measure query time without index
- Create index
- Measure query time with index
- Compare performance improvement
Step 1: Generate Test Data
# generate_test_data.py
import sqlite3
import random
import string
import time
conn = sqlite3.connect('blog_database.db')
cursor = conn.cursor()
print("Generating 10,000 test users...")
start = time.time()
for i in range(10000):
username = ''.join(random.choices(string.ascii_lowercase, k=8))
email = f"{username}@example.com"
password_hash = "dummy_hash_for_testing"
try:
cursor.execute(
"INSERT INTO users (username, email, password_hash) VALUES (?, ?, ?)",
(username, email, password_hash)
)
except sqlite3.IntegrityError:
pass # Skip duplicates
conn.commit()
end = time.time()
print(f"Created test users in {end - start:.2f} seconds")
conn.close()
Step 2: Measure Without Index
# test_performance.py
import sqlite3
import time
conn = sqlite3.connect('blog_database.db')
cursor = conn.cursor()
# Get a random email to search for
cursor.execute("SELECT email FROM users ORDER BY RANDOM() LIMIT 1")
test_email = cursor.fetchone()[0]
print(f"Searching for: {test_email}")
print("\n=== WITHOUT INDEX ===")
# Measure query time
start = time.time()
cursor.execute("SELECT * FROM users WHERE email = ?", (test_email,))
result = cursor.fetchone()
end = time.time()
print(f"Query time: {(end - start) * 1000:.2f} milliseconds")
print(f"Found: {result[1] if result else 'Not found'}")
# Check query plan
cursor.execute("EXPLAIN QUERY PLAN SELECT * FROM users WHERE email = ?")
plan = cursor.fetchall()
print(f"Query plan: {plan}")
conn.close()
Step 3: Create Index
Step 4: Measure With Index
# test_performance.py (continued)
print("\n=== WITH INDEX ===")
# Recreate index
cursor.execute("CREATE INDEX IF NOT EXISTS idx_users_email ON users(email)")
# Measure again
start = time.time()
cursor.execute("SELECT * FROM users WHERE email = ?", (test_email,))
result = cursor.fetchone()
end = time.time()
print(f"Query time: {(end - start) * 1000:.2f} milliseconds")
print(f"Found: {result[1] if result else 'Not found'}")
# Check query plan
cursor.execute("EXPLAIN QUERY PLAN SELECT * FROM users WHERE email = ?")
plan = cursor.fetchall()
print(f"Query plan: {plan}")
Expected Results:
=== WITHOUT INDEX ===
Query time: 15.23 milliseconds
Query plan: SCAN TABLE users
=== WITH INDEX ===
Query time: 0.45 milliseconds
Query plan: SEARCH TABLE users USING INDEX idx_users_email
Performance improvement: 33x faster!
Document Your Findings:
| Metric | Without Index | With Index | Improvement |
|---|---|---|---|
| Query time | ___ ms | ___ ms | ___x faster |
| Query plan | SCAN TABLE | SEARCH USING INDEX | ✓ |
| Rows scanned | 10,000 | ~1 | ✓ |
Shell Scripts for Database Backups
Why Backups Matter:
- Hardware failures
- Human errors (accidental deletions)
- Security breaches
- Data corruption
Critical Rule
A backup you haven't tested is not a backup at all!
The 3-2-1 Backup Rule
- 3 copies of your data
- 2 different media types
- 1 copy offsite
Creating Backup Scripts
Basic Backup Script (Bash):
#!/bin/bash
# backup_database.sh - Automated database backup
# Configuration
DB_NAME="blog_database.db"
BACKUP_DIR="./backups"
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="$BACKUP_DIR/backup_$DATE.db"
# Create backup directory if needed
mkdir -p $BACKUP_DIR
# Perform backup
echo "Starting backup..."
sqlite3 $DB_NAME ".backup '$BACKUP_FILE'"
# Compress to save space
gzip $BACKUP_FILE
# Keep only last 7 days
find $BACKUP_DIR -name "backup_*.db.gz" -mtime +7 -delete
# Log completion
echo "Backup completed: ${BACKUP_FILE}.gz"
echo "$(date): Backup successful" >> $BACKUP_DIR/backup.log
Make Script Executable:
Run Backup:
Activity 5.2: Create Backup System
Duration: 15 minutes
Objective: Implement automated database backups
What You'll Do
- Create backup script
- Test backup creation
- Create restore script
- Test restore procedure
- Schedule automatic backups (optional)
Task 1: Create Backup Script
Create backup_database.sh:
Task 2: Create Restore Script
Create restore_database.sh:
#!/bin/bash
# restore_database.sh
BACKUP_FILE=$1
if [ -z "$BACKUP_FILE" ]; then
echo "Usage: ./restore_database.sh <backup_file.db.gz>"
exit 1
fi
# Decompress
gunzip -c $BACKUP_FILE > /tmp/restored.db
# Copy to working directory
cp /tmp/restored.db blog_database.db
# Clean up
rm /tmp/restored.db
echo "Database restored from $BACKUP_FILE"
Task 3: Test Your Backups
# 1. Create backup
./backup_database.sh
# 2. Make a change to database
sqlite3 blog_database.db "DELETE FROM posts WHERE id = 1"
# 3. Restore from backup
./restore_database.sh backups/backup_20250121_143022.db.gz
# 4. Verify data restored
sqlite3 blog_database.db "SELECT COUNT(*) FROM posts"
Success Criteria
- Backup script creates compressed backup files
- Backups are timestamped
- Old backups are automatically deleted
- Restore script successfully recovers data
- Tested restore procedure works
- Backup log shows successful operations
Optional: Schedule with Cron Note: Cron does not exist on Windows machines. A scheduled task is a similar concept on Windows.
# Edit crontab
crontab -e
# Add line to run backup daily at 2 AM
0 2 * * * /path/to/backup_database.sh
Cron Schedule Format:
* * * * * command
│ │ │ │ │
│ │ │ │ └─── Day of week (0-7, Sunday = 0 or 7)
│ │ │ └───── Month (1-12)
│ │ └─────── Day of month (1-31)
│ └───────── Hour (0-23)
└─────────── Minute (0-59)
Git Checkpoint:
git add backup_database.sh restore_database.sh
git commit -m "Add database backup and restore scripts
- Automated backup with timestamp
- Compression to save space
- Cleanup old backups (7 day retention)
- Restore script for recovery
- Backup logging"
Part 6: Big Data Concepts & Database Management (20 minutes)
Big Data Concepts (Introduction)
What is Big Data?
Big data refers to datasets so large or complex that traditional database tools struggle to handle them efficiently. For web applications, this means planning for scale from the beginning.
flowchart LR
A[Small Data] -->|Growth| B[Medium Data]
B -->|Growth| C[Big Data]
A --> A1[100s of records]
A --> A2[Single server]
A --> A3[Simple queries]
B --> B1[1000s-100,000s records]
B --> B2[Needs optimisation]
B --> B3[Requires indexing]
C --> C1[Millions+ records]
C --> C2[Distributed systems]
C --> C3[Specialised tools]
style A fill:#9f9,stroke:#333
style B fill:#ffd93d,stroke:#333
style C fill:#ff6b6b,stroke:#333
Big Data Terminology
| Term | Definition | Example in Web Context |
|---|---|---|
| Metadata | Data about data - information describing other data | User profile has metadata: account_created_date, last_login, file_upload_date |
| Data Mining | Process of discovering patterns in large datasets | Analysing millions of user behaviours to recommend products or content |
| Scalability | Ability to handle growing amounts of data and users | Blog with 100 users vs 100 million users requires different architecture |
| Query Optimisation | Making database queries run faster and more efficiently | Bad query scans entire table; optimised query uses indexes |
| Sharding | Splitting data across multiple database servers | Users A-M on server 1, N-Z on server 2 |
| Replication | Creating copies of database for redundancy and performance | Master database writes, multiple read replicas for queries |
Scalability Considerations
As your web application grows, database performance becomes critical:
flowchart LR
A[Application Growth] --> B{User Scale}
B -->|100s users| C[Small Scale]
B -->|1000s users| D[Medium Scale]
B -->|Millions users| E[Large Scale]
C --> C1[Single DB Server]
C --> C2[Simple Queries]
C --> C3[Basic Backups]
D --> D1[Indexes Required]
D --> D2[Connection Pooling]
D --> D3[Query Optimisation]
E --> E1[Database Replication]
E --> E2[Sharding]
E --> E3[Caching Layers]
E --> E4[Load Balancing]
style A fill:#9f9,stroke:#333
style B fill:#ffd93d,stroke:#333
style C fill:#4d96ff,stroke:#333
style D fill:#4d96ff,stroke:#333
style E fill:#4d96ff,stroke:#333
Small Scale (Hundreds of Users):
- Single database server sufficient
- Simple queries work fine
- Backups are quick (minutes)
- No special optimisation needed
- Example: Personal blog, small business website
Medium Scale (Thousands of Users):
- Need indexes on frequently queried columns
- Connection pooling becomes important
- Regular database maintenance required
- Backup strategies more complex (may take hours)
- Example: Regional e-commerce site, university portal
Large Scale (Millions of Users):
- Database replication (multiple copies across servers)
- Sharding (splitting data across multiple databases)
- Caching layers (Redis, Memcached) to reduce database load
- Load balancing across multiple servers
- Specialised big data tools (Hadoop, Spark)
- Example: Facebook, Amazon, Netflix
Data Mining in Web Applications
What is Data Mining?
Data mining is the process of discovering patterns, trends, and useful information from large datasets. In web applications, this helps make data-driven decisions.
flowchart TD
A[/Raw User Data/] --> B[Data Collection]
B --> C[Data Processing]
C --> D[Pattern Discovery]
D --> E[Actionable Insights]
E --> F[Business Decisions]
E --> G[User Experience Improvements]
E --> H[Product Recommendations]
style A fill:#9f9,stroke:#333
style B fill:#4d96ff,stroke:#333
style C fill:#4d96ff,stroke:#333
style D fill:#4d96ff,stroke:#333
style E fill:#4d96ff,stroke:#333
style F fill:#9f9,stroke:#333
style G fill:#9f9,stroke:#333
style H fill:#9f9,stroke:#333
Common Data Mining Applications
1. User Behaviour Analysis
Understanding how users interact with your application:
def analyse_user_activity():
"""Find patterns in user activity"""
query = """
SELECT
strftime('%H', created_at) as hour_of_day,
COUNT(*) as post_count
FROM posts
GROUP BY hour_of_day
ORDER BY hour_of_day
"""
results = query_db(query)
# Discover: Most posts are created between 18:00-22:00
# Insight: Schedule maintenance during low-activity hours (03:00-05:00)
return results
2. Content Popularity Analysis
Identifying trending topics and popular content:
def find_trending_topics():
"""Discover what content engages users most"""
query = """
SELECT
posts.title,
COUNT(comments.comment_id) as engagement,
AVG(LENGTH(comments.content)) as avg_comment_length
FROM posts
LEFT JOIN comments ON posts.post_id = comments.post_id
WHERE posts.created_at >= date('now', '-7 days')
GROUP BY posts.post_id
HAVING engagement > 5
ORDER BY engagement DESC
LIMIT 10
"""
results = query_db(query)
# Insight: Posts about Python tutorials get 3x more engagement
# Decision: Create more Python content
return results
3. User Segmentation
Grouping users by behaviour patterns:
def segment_users():
"""Categorise users by activity level"""
query = """
SELECT
users.user_id,
users.username,
COUNT(DISTINCT posts.post_id) as posts_created,
COUNT(DISTINCT comments.comment_id) as comments_made,
CASE
WHEN COUNT(DISTINCT posts.post_id) > 10 THEN 'Power User'
WHEN COUNT(DISTINCT posts.post_id) > 5 THEN 'Active User'
WHEN COUNT(DISTINCT posts.post_id) > 0 THEN 'Casual User'
ELSE 'Lurker'
END as user_segment
FROM users
LEFT JOIN posts ON users.user_id = posts.user_id
LEFT JOIN comments ON users.user_id = comments.user_id
GROUP BY users.user_id
"""
results = query_db(query)
# Insight: 80% of content comes from 20% of users (Pareto Principle)
# Decision: Create incentives for casual users to post more
return results
4. Correlation Analysis
Finding relationships between different metrics:
def analyse_engagement_correlation():
"""Does post length correlate with engagement?"""
query = """
SELECT
LENGTH(posts.content) as content_length,
COUNT(comments.comment_id) as comment_count
FROM posts
LEFT JOIN comments ON posts.post_id = comments.post_id
GROUP BY posts.post_id
HAVING content_length > 0
"""
results = query_db(query)
# Insight: Posts between 300-800 characters get most comments
# Decision: Suggest optimal post length to users
return results
Privacy and Ethical Considerations
Data Mining Ethics
When mining user data, you must:
- Obtain consent: Users should know their data is being analysed
- Respect privacy: Anonymise personal information
- Comply with regulations: GDPR, CCPA (covered in Week 11)
- Be transparent: Explain how insights are used
- Provide opt-out: Allow users to exclude their data
Example of anonymisation:
Metadata Management
Understanding Metadata
Metadata is "data about data" - information that describes, explains, or makes it easier to find and use other data.
Types of Metadata:
- Descriptive: What the data is (title, author, description)
- Structural: How data is organised (file format, table schema)
- Administrative: When, who, how (creation date, permissions, version)
Metadata in Database Design
flowchart LR
A[Core Data] --> B[Actual Content]
A --> C[Metadata]
B --> B1[Post text]
B --> B2[Comment content]
B --> B3[User message]
C --> C1[When created]
C --> C2[Who created it]
C --> C3[File size/type]
C --> C4[Last modified]
C --> C5[Access count]
style A fill:#9370DB,stroke:#333
style B fill:#4d96ff,stroke:#333
style C fill:#4d96ff,stroke:#333
Practical Example: File Upload System
CREATE TABLE uploaded_files (
file_id INTEGER PRIMARY KEY AUTOINCREMENT,
-- Core data
file_path TEXT NOT NULL, -- Where file is stored
-- Descriptive metadata
filename TEXT NOT NULL, -- Original filename
file_type TEXT NOT NULL, -- MIME type (image/jpeg, application/pdf)
file_size INTEGER NOT NULL, -- Size in bytes
description TEXT, -- User-provided description
-- Administrative metadata
uploaded_by INTEGER NOT NULL, -- User who uploaded
uploaded_at DATETIME DEFAULT CURRENT_TIMESTAMP,
last_modified DATETIME, -- When file was last changed
last_accessed DATETIME, -- When file was last viewed
-- Usage metadata
download_count INTEGER DEFAULT 0, -- Popularity tracking
is_public BOOLEAN DEFAULT 0, -- Privacy setting
-- Technical metadata
file_hash TEXT UNIQUE, -- SHA-256 hash for duplicate detection
thumbnail_path TEXT, -- Preview image location
FOREIGN KEY (uploaded_by) REFERENCES users(user_id)
);
Benefits of Good Metadata
1. Duplicate Detection
def check_duplicate_file(file_hash):
"""Prevent users uploading the same file twice"""
query = "SELECT file_id, filename FROM uploaded_files WHERE file_hash = ?"
existing = query_db(query, (file_hash,), one=True)
if existing:
return f"File already exists as '{existing['filename']}'"
return None
2. Storage Management
def get_user_storage_usage(user_id):
"""Calculate total storage used by user"""
query = """
SELECT
uploaded_by,
COUNT(*) as file_count,
SUM(file_size) as total_bytes,
ROUND(SUM(file_size) / 1024.0 / 1024.0, 2) as total_mb
FROM uploaded_files
WHERE uploaded_by = ?
GROUP BY uploaded_by
"""
return query_db(query, (user_id,), one=True)
3. Cleanup of Old Data
def find_unused_files(days_inactive=90):
"""Find files that haven't been accessed recently"""
query = """
SELECT file_id, filename, file_size, last_accessed
FROM uploaded_files
WHERE last_accessed < date('now', '-' || ? || ' days')
OR last_accessed IS NULL
ORDER BY file_size DESC
"""
return query_db(query, (days_inactive,))
4. Popular Content Identification
def get_most_downloaded_files():
"""Find most popular files"""
query = """
SELECT
filename,
file_type,
download_count,
uploaded_at
FROM uploaded_files
WHERE is_public = 1
ORDER BY download_count DESC
LIMIT 20
"""
return query_db(query)
Metadata for API Responses
Metadata improves API usability:
@app.route('/api/posts')
def api_get_posts():
"""API endpoint with metadata"""
page = request.args.get('page', 1, type=int)
per_page = 10
offset = (page - 1) * per_page
# Get posts
posts = query_db(
'SELECT * FROM posts ORDER BY created_at DESC LIMIT ? OFFSET ?',
(per_page, offset)
)
# Get total count for metadata
total = query_db('SELECT COUNT(*) as count FROM posts', one=True)['count']
# Response includes metadata
return jsonify({
'data': [dict(post) for post in posts],
'metadata': {
'page': page,
'per_page': per_page,
'total_records': total,
'total_pages': (total + per_page - 1) // per_page,
'has_next': offset + per_page < total,
'has_previous': page > 1,
'timestamp': datetime.now().isoformat()
}
})
Advanced Indexing Concepts
How Indexes Work
Indexes are like a book's index - they create a sorted lookup structure that allows quick data retrieval without scanning every row.
flowchart TD
A[Query: Find user 'charlie'] --> B{Index on username?}
B -->|No Index| C[Full Table Scan]
C --> C1[Read row 1: alice]
C1 --> C2[Read row 2: bob]
C2 --> C3[Read row 3: charlie ✓]
C3 --> C4[Read row 4: david]
C4 --> C5[Read row 5: eve]
C5 --> D[/Found after 5 reads/]
B -->|With Index| E[Index Lookup]
E --> E1[B-tree search]
E1 --> E2[Found in 2 comparisons]
E2 --> E3[Jump directly to row 3]
E3 --> F[/Found after 1 read/]
style A fill:#9f9,stroke:#333
style B fill:#ffd93d,stroke:#333
style C fill:#ff6b6b,stroke:#333
style E fill:#4d96ff,stroke:#333
style D fill:#ff6b6b,stroke:#333
style F fill:#9f9,stroke:#333
B-Tree Index Structure (Simplified)
Databases typically use B-tree indexes:
Finding 'K' requires only 3 comparisons instead of scanning all letters.
Measuring Index Performance
def demonstrate_index_performance():
"""Show difference between indexed and non-indexed queries"""
import time
# Query WITHOUT index
start = time.time()
result = query_db("SELECT * FROM posts WHERE title LIKE '%Python%'")
without_index = time.time() - start
# Create index
execute_db("CREATE INDEX IF NOT EXISTS idx_posts_title ON posts(title)")
# Query WITH index
start = time.time()
result = query_db("SELECT * FROM posts WHERE title LIKE '%Python%'")
with_index = time.time() - start
print(f"Without index: {without_index*1000:.2f}ms")
print(f"With index: {with_index*1000:.2f}ms")
print(f"Improvement: {(without_index/with_index):.1f}x faster")
Query Execution Plans
Use EXPLAIN QUERY PLAN to see how SQLite executes queries:
def analyse_query_plan():
"""Check if query uses indexes"""
query = "SELECT * FROM users WHERE username = 'alice'"
# Get execution plan
plan = query_db(f"EXPLAIN QUERY PLAN {query}")
for step in plan:
print(dict(step))
# Without index: SCAN TABLE users
# With index: SEARCH TABLE users USING INDEX idx_users_username
Composite Indexes
Indexes on multiple columns for complex queries:
-- Composite index on user_id and created_at
CREATE INDEX idx_posts_user_date ON posts(user_id, created_at);
-- This query benefits from composite index
SELECT * FROM posts
WHERE user_id = 5
ORDER BY created_at DESC;
Column Order Matters:
-- ✅ Good: Index can be used for both conditions
CREATE INDEX idx_posts_user_date ON posts(user_id, created_at);
SELECT * FROM posts WHERE user_id = 5 AND created_at > '2025-01-01';
-- ❌ Less optimal: Index only helps with created_at
CREATE INDEX idx_posts_date_user ON posts(created_at, user_id);
SELECT * FROM posts WHERE user_id = 5; -- Can't use index efficiently
Rule of thumb: Put the most selective (unique) column first, and columns used in equality tests before range tests.
Index Trade-offs
flowchart LR
A[Add Index] --> B[Benefits]
A --> C[Costs]
B --> B1[Faster SELECT]
B --> B2[Faster WHERE]
B --> B3[Faster JOIN]
B --> B4[Faster ORDER BY]
C --> C1[Slower INSERT]
C --> C2[Slower UPDATE]
C --> C3[Slower DELETE]
C --> C4[More disk space]
style A fill:#9370DB,stroke:#333
style B fill:#9f9,stroke:#333
style C fill:#ff6b6b,stroke:#333
When indexes hurt performance:
# If you have index on 'status' column
# But status only has values: 'active', 'inactive'
# Index doesn't help much because:
# - Low cardinality (only 2 unique values)
# - Database still needs to read ~50% of rows
# - Index maintenance overhead not worth it
# Better: Don't index low-cardinality columns
Shell Scripts for Database Management
Why Shell Scripts?
Shell scripts automate repetitive database administration tasks. Instead of manually running backup commands daily, create a script that runs automatically.
Common Database Shell Script Uses:
- Automated backups
- Database restoration
- Data export/import
- Cleanup of old data
- Database health monitoring
- Log rotation
- Performance reporting
Advanced Backup Script with Logging
Create backup_advanced.sh:
#!/bin/bash
# Advanced database backup script with logging and error handling
# Configuration
DB_FILE="myapp.db"
BACKUP_DIR="backups"
LOG_DIR="logs"
MAX_BACKUPS=7
DATE=$(date +%Y%m%d_%H%M%S)
BACKUP_FILE="${BACKUP_DIR}/myapp_${DATE}.db"
LOG_FILE="${LOG_DIR}/backup_${DATE}.log"
# Colours for output
GREEN='\033[0;32m'
RED='\033[0;31m'
YELLOW='\033[1;33m'
NC='\033[0m' # No Colour
# Function to log messages
log_message() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a "$LOG_FILE"
}
# Function to log errors
log_error() {
echo -e "${RED}[$(date '+%Y-%m-%d %H:%M:%S')] ERROR: $1${NC}" | tee -a "$LOG_FILE"
}
# Function to log success
log_success() {
echo -e "${GREEN}[$(date '+%Y-%m-%d %H:%M:%S')] SUCCESS: $1${NC}" | tee -a "$LOG_FILE"
}
# Create directories if they don't exist
mkdir -p "$BACKUP_DIR" "$LOG_DIR"
log_message "Starting database backup..."
# Check if database file exists
if [ ! -f "$DB_FILE" ]; then
log_error "Database file $DB_FILE not found!"
exit 1
fi
# Check database integrity before backup
log_message "Checking database integrity..."
sqlite3 "$DB_FILE" "PRAGMA integrity_check;" > /dev/null 2>&1
if [ $? -eq 0 ]; then
log_success "Database integrity check passed"
else
log_error "Database integrity check failed!"
exit 1
fi
# Create backup
log_message "Creating backup: $BACKUP_FILE"
cp "$DB_FILE" "$BACKUP_FILE"
if [ $? -eq 0 ]; then
# Get file size
SIZE=$(du -h "$BACKUP_FILE" | cut -f1)
log_success "Backup created successfully (Size: $SIZE)"
# Compress backup
log_message "Compressing backup..."
gzip "$BACKUP_FILE"
if [ $? -eq 0 ]; then
COMPRESSED_SIZE=$(du -h "${BACKUP_FILE}.gz" | cut -f1)
log_success "Backup compressed (Size: $COMPRESSED_SIZE)"
else
log_error "Compression failed"
fi
# Clean up old backups
log_message "Cleaning up old backups (keeping ${MAX_BACKUPS} most recent)..."
DELETED=0
for OLD_BACKUP in $(ls -t ${BACKUP_DIR}/myapp_*.db.gz 2>/dev/null | tail -n +$((MAX_BACKUPS + 1))); do
rm -f "$OLD_BACKUP"
log_message "Deleted old backup: $(basename $OLD_BACKUP)"
DELETED=$((DELETED + 1))
done
if [ $DELETED -gt 0 ]; then
log_success "Deleted $DELETED old backup(s)"
else
log_message "No old backups to delete"
fi
# Summary
log_success "Backup completed successfully!"
echo -e "${GREEN}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
echo -e "${GREEN}Backup Summary:${NC}"
echo -e " Location: ${BACKUP_FILE}.gz"
echo -e " Size: $COMPRESSED_SIZE"
echo -e " Log: $LOG_FILE"
echo -e "${GREEN}━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━${NC}"
else
log_error "Backup failed!"
exit 1
fi
Make executable:
Database Restoration Script
Create restore.sh:
#!/bin/bash
# Database restoration script
BACKUP_DIR="backups"
DB_FILE="myapp.db"
echo "Available backups:"
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
ls -lht "${BACKUP_DIR}"/myapp_*.db.gz | nl
echo "━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━"
read -p "Enter backup number to restore (or 'q' to quit): " BACKUP_NUM
if [ "$BACKUP_NUM" = "q" ]; then
echo "Restoration cancelled"
exit 0
fi
# Get the selected backup file
BACKUP_FILE=$(ls -t "${BACKUP_DIR}"/myapp_*.db.gz | sed -n "${BACKUP_NUM}p")
if [ -z "$BACKUP_FILE" ]; then
echo "❌ Invalid backup number"
exit 1
fi
echo "Selected backup: $BACKUP_FILE"
read -p "⚠️ This will overwrite the current database. Continue? (yes/no): " CONFIRM
if [ "$CONFIRM" != "yes" ]; then
echo "Restoration cancelled"
exit 0
fi
# Create safety backup of current database
if [ -f "$DB_FILE" ]; then
SAFETY_BACKUP="${DB_FILE}.before_restore_$(date +%Y%m%d_%H%M%S)"
cp "$DB_FILE" "$SAFETY_BACKUP"
echo "✅ Current database backed up to: $SAFETY_BACKUP"
fi
# Decompress and restore
echo "Restoring database..."
gunzip -c "$BACKUP_FILE" > "$DB_FILE"
if [ $? -eq 0 ]; then
# Verify restored database
sqlite3 "$DB_FILE" "PRAGMA integrity_check;" > /dev/null 2>&1
if [ $? -eq 0 ]; then
echo "✅ Database restored successfully!"
echo "✅ Integrity check passed"
else
echo "❌ Restored database failed integrity check!"
echo "Restoring safety backup..."
cp "$SAFETY_BACKUP" "$DB_FILE"
exit 1
fi
else
echo "❌ Restoration failed!"
exit 1
fi
Make executable:
Database Health Check Script
Create health_check.py:
#!/usr/bin/env python3
"""
Database health check script
Monitors database size, performance, and integrity
"""
import sqlite3
import os
from datetime import datetime
DB_FILE = 'myapp.db'
def format_size(bytes):
"""Convert bytes to human-readable format"""
for unit in ['B', 'KB', 'MB', 'GB']:
if bytes < 1024.0:
return f"{bytes:.2f} {unit}"
bytes /= 1024.0
return f"{bytes:.2f} TB"
def check_database_health():
"""Perform comprehensive database health check"""
print("=" * 50)
print("DATABASE HEALTH CHECK")
print(f"Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("=" * 50)
# Check file exists
if not os.path.exists(DB_FILE):
print(f"❌ Database file '{DB_FILE}' not found!")
return False
# File size
file_size = os.path.getsize(DB_FILE)
print(f"\n📊 Database Size: {format_size(file_size)}")
try:
conn = sqlite3.connect(DB_FILE)
cursor = conn.cursor()
# Integrity check
print("\n🔍 Running integrity check...")
cursor.execute("PRAGMA integrity_check")
result = cursor.fetchone()[0]
if result == "ok":
print("✅ Integrity check: PASSED")
else:
print(f"❌ Integrity check: FAILED - {result}")
return False
# Table statistics
print("\n📈 Table Statistics:")
cursor.execute("""
SELECT name FROM sqlite_master
WHERE type='table' AND name NOT LIKE 'sqlite_%'
""")
tables = cursor.fetchall()
for (table_name,) in tables:
cursor.execute(f"SELECT COUNT(*) FROM {table_name}")
count = cursor.fetchone()[0]
print(f" • {table_name}: {count:,} records")
# Index usage
print("\n🔎 Indexes:")
cursor.execute("""
SELECT name, tbl_name FROM sqlite_master
WHERE type='index' AND name NOT LIKE 'sqlite_%'
""")
indexes = cursor.fetchall()
if indexes:
for index_name, table_name in indexes:
print(f" • {index_name} on {table_name}")
else:
print(" ⚠️ No custom indexes found")
# Recent activity
print("\n📅 Recent Activity:")
cursor.execute("""
SELECT COUNT(*) FROM users
WHERE created_at >= date('now', '-7 days')
""")
new_users = cursor.fetchone()[0]
print(f" • New users (last 7 days): {new_users}")
cursor.execute("""
SELECT COUNT(*) FROM posts
WHERE created_at >= date('now', '-7 days')
""")
new_posts = cursor.fetchone()[0]
print(f" • New posts (last 7 days): {new_posts}")
# Database settings
print("\n⚙️ Database Settings:")
cursor.execute("PRAGMA page_size")
page_size = cursor.fetchone()[0]
print(f" • Page size: {page_size} bytes")
cursor.execute("PRAGMA page_count")
page_count = cursor.fetchone()[0]
print(f" • Page count: {page_count:,}")
# Fragmentation
cursor.execute("PRAGMA freelist_count")
freelist = cursor.fetchone()[0]
fragmentation = (freelist / page_count * 100) if page_count > 0 else 0
print(f" • Fragmentation: {fragmentation:.2f}%")
if fragmentation > 20:
print(" ⚠️ Consider running VACUUM to reduce fragmentation")
conn.close()
print("\n" + "=" * 50)
print("✅ Health check completed successfully")
print("=" * 50)
return True
except Exception as e:
print(f"\n❌ Error during health check: {e}")
return False
if __name__ == '__main__':
check_database_health()
Make executable:
Run:
Practical Activity 6.1: Big Data Analysis (15 minutes)
Task: Implement data mining queries to analyse your blog data.
- Create
data_analysis.py:
# data_analysis.py
from db_config import query_db
from datetime import datetime
def analyse_user_engagement():
"""Analyse user engagement patterns"""
print("=" * 60)
print("USER ENGAGEMENT ANALYSIS")
print("=" * 60)
# Most active users
print("\n📊 Most Active Users:")
query = """
SELECT
users.username,
COUNT(DISTINCT posts.post_id) as posts,
COUNT(DISTINCT comments.comment_id) as comments,
(COUNT(DISTINCT posts.post_id) + COUNT(DISTINCT comments.comment_id)) as total_activity
FROM users
LEFT JOIN posts ON users.user_id = posts.user_id
LEFT JOIN comments ON users.user_id = comments.user_id
GROUP BY users.user_id
HAVING total_activity > 0
ORDER BY total_activity DESC
LIMIT 5
"""
results = query_db(query)
for user in results:
print(f" • {user['username']}: {user['posts']} posts, {user['comments']} comments")
# Activity by hour
print("\n⏰ Activity by Hour of Day:")
query = """
SELECT
strftime('%H', created_at) as hour,
COUNT(*) as activity_count
FROM (
SELECT created_at FROM posts
UNION ALL
SELECT created_at FROM comments
)
GROUP BY hour
ORDER BY hour
"""
results = query_db(query)
for row in results:
bar = '█' * (row['activity_count'] // 2)
print(f" {row['hour']}:00 {bar} {row['activity_count']}")
# Popular content
print("\n🔥 Most Commented Posts:")
query = """
SELECT
posts.title,
users.username as author,
COUNT(comments.comment_id) as comment_count
FROM posts
INNER JOIN users ON posts.user_id = users.user_id
LEFT JOIN comments ON posts.post_id = comments.post_id
GROUP BY posts.post_id
HAVING comment_count > 0
ORDER BY comment_count DESC
LIMIT 5
"""
results = query_db(query)
for post in results:
print(f" • '{post['title']}' by {post['author']}: {post['comment_count']} comments")
def analyse_metadata():
"""Analyse metadata patterns"""
print("\n" + "=" * 60)
print("METADATA ANALYSIS")
print("=" * 60)
# Account age distribution
print("\n📅 User Account Age Distribution:")
query = """
SELECT
CASE
WHEN julianday('now') - julianday(created_at) < 7 THEN 'Less than 1 week'
WHEN julianday('now') - julianday(created_at) < 30 THEN '1-4 weeks'
WHEN julianday('now') - julianday(created_at) < 90 THEN '1-3 months'
ELSE 'Over 3 months'
END as account_age,
COUNT(*) as user_count
FROM users
GROUP BY account_age
"""
results = query_db(query)
for row in results:
print(f" • {row['account_age']}: {row['user_count']} users")
# Content length analysis
print("\n📏 Post Length Analysis:")
query = """
SELECT
CASE
WHEN LENGTH(content) < 100 THEN 'Short (< 100 chars)'
WHEN LENGTH(content) < 500 THEN 'Medium (100-500 chars)'
ELSE 'Long (> 500 chars)'
END as post_length,
COUNT(*) as post_count,
AVG(LENGTH(content)) as avg_length
FROM posts
GROUP BY post_length
"""
results = query_db(query)
for row in results:
print(f" • {row['post_length']}: {row['post_count']} posts (avg: {row['avg_length']:.0f} chars)")
if __name__ == '__main__':
analyse_user_engagement()
analyse_metadata()
- Populate test data for analysis:
# populate_test_data.py
from db_config import execute_db
from werkzeug.security import generate_password_hash
import random
# Create test users
usernames = ['alice', 'bob', 'charlie', 'diana', 'eve']
for username in usernames:
try:
execute_db(
'INSERT INTO users (username, email, password_hash) VALUES (?, ?, ?)',
(username, f'{username}@example.com', generate_password_hash('password123'))
)
except:
pass # User may already exist
# Create test posts
post_titles = [
'Getting Started with Python',
'Web Development Best Practices',
'Database Design Patterns',
'Understanding SQL Injection',
'Building Secure Applications'
]
post_contents = [
'Short post.',
'This is a medium-length post with some interesting content about web development.' * 3,
'This is a longer post with extensive detail about database design and normalisation principles.' * 10
]
for i in range(20):
user_id = random.randint(1, 5)
title = random.choice(post_titles) + f" #{i}"
content = random.choice(post_contents)
execute_db(
'INSERT INTO posts (user_id, title, content) VALUES (?, ?, ?)',
(user_id, title, content)
)
# Create test comments
for i in range(50):
user_id = random.randint(1, 5)
post_id = random.randint(1, 20)
content = f'Great post! This is comment number {i}.'
try:
execute_db(
'INSERT INTO comments (user_id, post_id, content) VALUES (?, ?, ?)',
(user_id, post_id, content)
)
except:
pass # Post may not exist
print('✅ Test data populated successfully!')
-
Run analysis:
-
Git commit: "Add data mining analysis tools"
Checkpoint
You should see:
- User engagement statistics
- Activity patterns by hour
- Popular content identification
- Metadata insights about post lengths and account ages
Summary & Key Takeaways
What You've Learned
✅ Big Data Concepts: Understanding scalability challenges and solutions
✅ Data Mining: Extracting insights from user behaviour and content patterns
✅ Metadata Management: Using "data about data" to improve system functionality
✅ Advanced Indexing: How indexes work and when to use them
✅ Database Administration: Automated backups, restoration, and health monitoring
Practical Skills Acquired
- Writing data mining queries for business intelligence
- Implementing metadata tracking for files and content
- Creating comprehensive backup and restoration scripts
- Monitoring database health and performance
- Understanding trade-offs in database design for scale
Teacher Notes: Part 6
Formative Assessment:
- Can students explain the difference between data and metadata?
- Can they identify when a query would benefit from an index?
- Do they understand privacy implications of data mining?
- Can they interpret data mining results to make decisions?
Syllabus Alignment Summary
Programming for the Web - 100% Coverage
✅ Model elements that form a web development system (databases)
✅ Observe and describe back-end process (database operations)
✅ Shell scripts (backup automation)
✅ SQL queries (SELECT, INSERT, UPDATE, DELETE, GROUP BY, WHERE, JOINs)
✅ Big data concepts (indexing, metadata, data mining introduction)
Secure Software Architecture - 100% Coverage
✅ Security concepts applied to data (confidentiality, integrity, availability)
✅ Defensive input handling (preventing SQL injection)
✅ Security features for data protection (prepared statements, least privilege)
Student Self-Assessment Checklist
Use this checklist to verify you've mastered all content:
Database Fundamentals
- I can explain the difference between SQL and NoSQL databases
- I understand what CRUD operations are and can identify them
- I can explain what primary keys and foreign keys are
- I can describe database normalisation and why it matters
- I can securely configure database connections using environment variables
SQL Queries
- I can write INSERT statements to create new records
- I can write SELECT statements with WHERE, ORDER BY, and LIMIT
- I can use aggregate functions (COUNT, SUM, AVG, MIN, MAX)
- I can use GROUP BY to analyse data
- I can write UPDATE statements (with WHERE clause!)
- I can write DELETE statements (with WHERE clause!)
- I can perform INNER JOIN to combine tables
- I can perform LEFT JOIN to show all records from one table
Security
- I can explain what SQL injection is and why it's dangerous
- I can identify vulnerable code that uses string concatenation
- I can write queries using prepared statements (parameterised queries)
- I can test my code against SQL injection attacks
- I understand defence in depth (multiple security layers)
Performance & Operations
- I can explain what database indexes are and why they improve performance
- I can create indexes on appropriate columns
- I can use EXPLAIN to analyse query performance
- I can write shell scripts to backup databases
- I can restore databases from backups
- I understand the 3-2-1 backup rule
Integration
- I have integrated my Week 7 authentication with the database
- All my queries use prepared statements
- My data persists across server restarts
- I have tested my application's security
- My Git history shows clear, meaningful commits
Self-Assessment Score
- All checked: Excellent! You're ready for Week 9
- 80%+ checked: Good progress, review unchecked items
- 60-80% checked: Review content parts with gaps
- <60% checked: Revisit lesson, complete all activities
Teacher Notes
Differentiation Strategies
For Students Who Need Support:
- Simplified Schema: Start with just two tables (users and posts) instead of four
- Visual Tools: Use DB Browser for SQLite GUI instead of command-line only
- Pre-written Queries: Provide SQL query templates with blanks to fill in
- Paired Programming: Partner stronger students with those needing support for Activities 4 and 5
- Extended Time: Allow 150 minutes instead of 120 for completion
- Video Walkthroughs: Provide recorded demonstrations of complex topics (JOINs, prepared statements)
- Focus on basic data mining queries and simple backup scripts
For Advanced Students:
- Advanced Queries: Challenge with complex multi-table JOINs and subqueries
- ORM Preview: Introduce SQLAlchemy or similar ORM as extension
- Performance Optimisation: Profile queries, optimise slow queries, create covering indexes
- Database Security: Research and implement row-level security
- Migration System: Create database migration system from scratch
- Peer Teaching: Have them help struggling classmates during activities
- Real-World Challenge: Design schema for complex system (e.g., e-commerce with orders, products, inventory)
- Explore machine learning applications, implement predictive analytics, create distributed database simulations
Common Misconceptions
Misconception #1: SELECT * is Fine
Misconception: "It's easier to just SELECT * from tables"
Reality: This is inefficient and can expose sensitive data
Correction: Always specify exact columns needed. Show performance difference with EXPLAIN.
Misconception #2: Client-Side Validation is Enough
Misconception: "I already validate in JavaScript, so SQL injection isn't possible"
Reality: Attackers bypass client-side validation entirely (use Postman, cURL, or modify browser requests)
Correction: Demonstrate bypassing client validation with Postman. Show that server MUST validate.
Misconception #3: Escaping is Same as Prepared Statements
Misconception: "Can't I just remove dangerous characters instead of using prepared statements?"
Reality: Blacklist approaches always miss edge cases. Prepared statements are foolproof.
Correction: Show that even "sanitised" input can sometimes be exploited. Prepared statements are non-negotiable.
Misconception #4: Backups Happen Automatically
Misconception: "The database saves my data, so I don't need backups"
Reality: Deletion, corruption, and hardware failure can destroy data permanently
Correction: Simulate data loss scenario (delete table), show recovery from backup.
Misconception #5: Indexes Make Everything Faster
Misconception: "I should index every column for maximum speed"
Reality: Too many indexes slow down INSERT/UPDATE/DELETE operations and waste space
Correction: Show performance impact on writes. Teach strategic indexing.
Misconception #6: Foreign Keys are Optional
Misconception: "I can just store the user_id without making it a foreign key"
Reality: Foreign keys enforce referential integrity and prevent orphaned records
Correction: Demonstrate orphaned post (author deleted but post remains). Show ON DELETE CASCADE.
Misconception 7: Big Data
- "Big data always means millions of records" - It's relative to your infrastructure
- "More metadata is always better" - Balance between utility and storage overhead
- "Indexes solve all performance problems" - They have trade-offs
- "Data mining is just running queries" - It's about discovering actionable insights
Formative Assessment Strategies
During Activities (Informal Assessment):
- Code Review Checkpoints: Circulate during Activity 4.3, check for prepared statements
- Question Asking: Pose questions during demonstrations: "What would happen if...?"
- Exit Ticket Questions:
- "Explain SQL injection in your own words"
- "When would you use LEFT JOIN vs INNER JOIN?"
- "Why are prepared statements important?"
Hands-On Demonstrations:
- Live Debugging: Students share screen, debug SQL errors together
- Security Testing: Students attempt to SQL inject each other's code (controlled)
- Query Optimization Competition: Who can write fastest query? Use EXPLAIN to verify.
Written Checks:
-
Quick Quiz (10 minutes):
- Identify vulnerable code
- Write correct prepared statement
- Explain when to use indexes
-
Reflection Prompts:
- "What was most challenging about SQL injection prevention?"
- "How does this week connect to Week 7's authentication?"
Practical Checks:
- Commit History Review: Check Git logs for proper progression and meaningful messages
- Security Audit: Review student projects for SQL injection vulnerabilities
Diagnostic Questions
Check Understanding: CRUD
"Your friend says their blog needs to 'CRUD posts.' What do they mean? Give an example of each operation."
Check Understanding: SQL Injection
"A user enters admin'; DROP TABLE users;-- in the username field. Walk me through what happens with vulnerable code vs prepared statements."
Check Understanding: JOINs
"You want to show all blog categories, even ones with zero posts. Which JOIN type do you use and why?"
Check Understanding: Indexing
"Your posts table has 1 million rows. Users complain that searching by title is slow. What do you do?"
Extension Activities
For Students Who Finish Early:
-
Advanced Queries Challenge:
-
Transactions and Rollback:
- Research SQL transactions
- Implement BEGIN/COMMIT/ROLLBACK
- Handle errors gracefully
-
Database Migrations:
- Create numbered migration files
- Write UP and DOWN scripts
- Track migration history
-
Query Performance Lab:
- Profile slow queries
- Optimise with indexes and query rewrites
- Document 10x+ performance improvements
-
Full-Text Search:
- Implement search functionality
- Create search indexes
- Rank results by relevance
Resources and Tools
Essential Tools:
- SQLite (included with Python)
- DB Browser for SQLite (visual database tool)
- VSCode + SQLite or Postgres plugin
- Postman (API testing)
Documentation:
Practice Platforms:
- SQLBolt - Interactive SQL tutorial
- SQL Murder Mystery - Learn SQL through gamification
- HackerRank SQL - SQL challenges
Assessment Rubric
| Criteria | Excellent (4) | Proficient (3) | Developing (2) | Beginning (1) |
|---|---|---|---|---|
| Schema Design | Fully normalised, appropriate data types, all relationships defined | Minor normalisation issues, mostly correct | Poor normalisation, missing relationships | Unnormalised, incorrect structure |
| SQL Queries | All CRUD operations work, efficient queries, correct JOINs | Mostly correct, some inefficiencies | Some queries work, errors in complex operations | Major errors, cannot complete basic operations |
| Security | All queries use prepared statements, tested against attacks | Mostly secure, 1-2 vulnerabilities | Some prepared statements, significant vulnerabilities | String concatenation, highly vulnerable |
| Indexing | Strategic indexing, measured performance improvement | Indexes created but not optimised | Attempted indexing with errors | No indexes or incorrect implementation |
| Backups | Working backup and restore scripts, tested successfully | Scripts work but untested or incomplete | Scripts have errors or incomplete | No functional backup system |
Conclusion and Next Steps
Congratulations!
You've completed Week 8: Database Integration & SQL Security!
You now have:
- ✅ A persistent database replacing in-memory storage
- ✅ Secure queries using prepared statements
- ✅ Protection against SQL injection attacks
- ✅ Optimised performance with indexes
- ✅ Automated backup system
Connection to Week 9
Next week, you'll build on this foundation:
- Authentication → Authorization (who can do what?)
- User storage → Role-Based Access Control (admin, moderator, user)
- Basic security → CSRF Protection (prevent cross-site attacks)
- Simple sessions → Secure Session Management (proper cookies, expiration)
Final Checklist Before Moving On
- All database queries use prepared statements
- Schema is normalised with proper relationships
- Indexes created on frequently queried columns
- Backup and restore tested successfully
- Week 7 authentication integrated with database
- Data persists across server restarts
- SQL injection tests all fail (good!)
- Git history shows clear progression
-
.envfile in.gitignore(not committed!)
Keep Practising
- Review SQL queries daily
- Never write SQL without prepared statements
- Test your backups regularly
- Always use WHERE with UPDATE and DELETE!
You're ready for Complete Authentication, Authorization & CSRF Protection!