Overview: You are tasked with designing a small ETL pipeline and building some SQL queries that will help an organization like Samaritan improve its data operations. The data you're working with relates to a fictional system used to track Members, their goals, and the rewards they receive for achieving those goals.
—--------------------------------
You have three CSV files that represent different tables in the database:
A) members.csv: Information about the Members (e.g., people experiencing homelessness).
member_id, name, email, joined_at, city
1, John Doe, [email protected], 2023-01-15, Seattle
2, Jane Smith, [email protected], 2023-02-20, New York
3, Carlos Lee, [email protected], 2023-03-10, Los Angeles
B) goals.csv: Information about the goals that Members are working towards.
goal_id, member_id, goal_type, target_date, status
101, 1, housing, 2023-06-01, Completed
102, 2, employment, 2023-05-15, In Progress
103, 3, healthcare, 2023-04-20, Completed
C) rewards.csv: Information about rewards issued to Members for completing goals.
reward_id, goal_id, reward_type, amount, date_issued
201, 101, Voucher, 50.00, 2023-06-05
202, 103, Voucher, 30.00, 2023-04-25
- Schema Design: Based on the CSVs above, create a data model that will support storing and querying this information in a relational database (e.g., PostgreSQL).
- Include Tables: Members, Goals, Rewards
- Describe the relationships between the tables.
- Indicate any indexes or optimizations you would apply for performance (e.g., for frequent queries related to Member progress or rewards issuance).
- ETL Pipeline: Design an ETL pipeline to load these CSV files into your database.
- Assume that the data will be available as CSV files in an AWS S3 bucket (you can simulate this process by imagining the CSVs as input files).
- Write a Python script that would:
- Download these files from S3 (assume the CSVs are stored in the samaritan-data bucket).
- Process the CSVs and load the data into your PostgreSQL database.
- Apply basic data validation (e.g., no missing member IDs, valid email format, target date in the future).
- Bonus: If you have AWS experience, suggest how you might scale this pipeline for larger datasets, and consider error handling or retry logic.
Now that your data is in the database, write the following SQL queries:
- Query 1: Find the number of Members by city who have completed at least one goal.
- Output:
city, number_of_members
- Output:
- Query 2: For each Member, calculate the total reward amount they’ve received for completed goals.
- Output:
member_id, name, total_rewards
- Only include rewards for goals marked as Completed.
- Output:
- Query 3: Write a query that returns all Members who have not yet completed their Healthcare goal.
- Output:
member_id, name, goal_status
- Output:
- Query 4: Identify the top 3 goals (by
goal_type
) that have the highest total reward amount issued, and show the total reward for each goal.- Output:
goal_type, total_rewards
- Output:
- Bonus Query: Optimize the following query that frequently runs on your database, which has performance issues. Given that
rewards
has many records and the goal is to join withgoals
andmembers
, rewrite this query to improve its performance. Consider indexing, query optimization, or restructuring the query for faster execution.
SELECT m.name, g.goal_type, r.amount
FROM members m
JOIN goals g ON m.member_id = g.member_id
JOIN rewards r ON g.goal_id = r.goal_id
WHERE g.status = 'Completed' AND r.date_issued > '2023-01-01'
ORDER BY r.amount DESC;
Submission Guidelines: Please submit the following:
- A SQL file that contains the queries you’ve written.
- A Python script (or pseudocode if you prefer) that implements the ETL pipeline (including CSV parsing, database connection, and data insertion). If you use any external libraries (e.g., psycopg2 for PostgreSQL), please include the installation instructions.
- A brief data model diagram (e.g., ERD) showing the structure of the database and relationships between the tables.
- Any assumptions you made while designing the system or solving the queries.