dannysperry/0_intro.md

Samaritan Data Engineer Take-Home Assignment

Overview: You are tasked with designing a small ETL pipeline and building some SQL queries that will help an organization like Samaritan improve its data operations. The data you're working with relates to a fictional system used to track Members, their goals, and the rewards they receive for achieving those goals.

—--------------------------------

Dataset Overview:

You have three CSV files that represent different tables in the database:

A) members.csv: Information about the Members (e.g., people experiencing homelessness).

member_id, name, email, joined_at, city
1, John Doe, [email protected], 2023-01-15, Seattle
2, Jane Smith, [email protected], 2023-02-20, New York
3, Carlos Lee, [email protected], 2023-03-10, Los Angeles

B) goals.csv: Information about the goals that Members are working towards.

goal_id, member_id, goal_type, target_date, status
101, 1, housing, 2023-06-01, Completed
102, 2, employment, 2023-05-15, In Progress
103, 3, healthcare, 2023-04-20, Completed

C) rewards.csv: Information about rewards issued to Members for completing goals.

reward_id, goal_id, reward_type, amount, date_issued
201, 101, Voucher, 50.00, 2023-06-05
202, 103, Voucher, 30.00, 2023-04-25

Part 1: Data Modeling & ETL Pipeline

Schema Design: Based on the CSVs above, create a data model that will support storing and querying this information in a relational database (e.g., PostgreSQL).
- Include Tables: Members, Goals, Rewards
- Describe the relationships between the tables.
- Indicate any indexes or optimizations you would apply for performance (e.g., for frequent queries related to Member progress or rewards issuance).
ETL Pipeline: Design an ETL pipeline to load these CSV files into your database.
- Assume that the data will be available as CSV files in an AWS S3 bucket (you can simulate this process by imagining the CSVs as input files).
- Write a Python script that would:
  - Download these files from S3 (assume the CSVs are stored in the samaritan-data bucket).
  - Process the CSVs and load the data into your PostgreSQL database.
  - Apply basic data validation (e.g., no missing member IDs, valid email format, target date in the future).
- Bonus: If you have AWS experience, suggest how you might scale this pipeline for larger datasets, and consider error handling or retry logic.

Part 2: SQL Queries

Now that your data is in the database, write the following SQL queries:

Query 1: Find the number of Members by city who have completed at least one goal.
- Output: city, number_of_members
Query 2: For each Member, calculate the total reward amount they’ve received for completed goals.
- Output: member_id, name, total_rewards
- Only include rewards for goals marked as Completed.
Query 3: Write a query that returns all Members who have not yet completed their Healthcare goal.
- Output: member_id, name, goal_status
Query 4: Identify the top 3 goals (by goal_type) that have the highest total reward amount issued, and show the total reward for each goal.
- Output: goal_type, total_rewards
Bonus Query: Optimize the following query that frequently runs on your database, which has performance issues. Given that rewards has many records and the goal is to join with goals and members, rewrite this query to improve its performance. Consider indexing, query optimization, or restructuring the query for faster execution.

SELECT m.name, g.goal_type, r.amount
FROM members m
JOIN goals g ON m.member_id = g.member_id
JOIN rewards r ON g.goal_id = r.goal_id
WHERE g.status = 'Completed' AND r.date_issued > '2023-01-01'
ORDER BY r.amount DESC;

Submission Guidelines: Please submit the following:

A SQL file that contains the queries you’ve written.
A Python script (or pseudocode if you prefer) that implements the ETL pipeline (including CSV parsing, database connection, and data insertion). If you use any external libraries (e.g., psycopg2 for PostgreSQL), please include the installation instructions.
A brief data model diagram (e.g., ERD) showing the structure of the database and relationships between the tables.
Any assumptions you made while designing the system or solving the queries.

member_id	name	Email	joined_at	City
1	Jane Doe	[email protected]	6/29/2023	Seattle
2	John Doe	[email protected]	7/7/2023	Seattle
3	Mary Sue	[email protected]	7/15/2023	New York
4	Larry Sue	[email protected]	7/23/2023	New York
5	Gary Berry	[email protected]	7/31/2023	Los Angeles
6	John Berry	[email protected]	8/8/2023	Los Angeles
7	Don Atello	[email protected]	8/16/2023	Seattle
8	Mike Elangello	[email protected]	8/24/2023	N.Y.
9	Raphael Turtle	[email protected]	9/1/2023	L.A.
10	Cindy Lu	[email protected]	9/9/2023	Los Angeles

goal_id	member_id	goal_type	target_date	status
1	1	housing	7/15/2023	Completed
2	2	employment	7/22/2023	In Progress
3	3	healthcare	7/29/2023	Completed
4	4	housing	8/5/2023	Completed
5	1	employment	8/12/2023	In Progress
6	5	healthcare	8/19/2023	Completed
7	2	housing	8/26/2023	In Progress
8	6	employment	9/2/2023	Completed
9	7	healthcare	9/9/2023	Completed
10	8	housing	9/16/2023	Completed
11	3	employment	9/23/2023	Completed
12	1	healthcare	9/30/2023	In Progress
13	5	housing	10/7/2023	Completed
14	10	employment	10/14/2023	Cancelled
15	2	healthcare	10/21/2023	Completed
16	3	housing	10/28/2023	Completed
17	7	employment	11/4/2023	Completed
18	6	healthcare	11/11/2023	In Progress
19	10	housing	11/18/2023	Completed
20	4	employment	11/25/2023	Completed
21	9	healthcare	12/2/2023	In Progress

reward_id	goal_id	reward_type	amount	date_issued
1	1	Voucher	50	7/29/2023
2	2	Voucher	30	8/5/2023
3	2	Voucher	10	8/12/2023
4	3	Voucher	25	8/19/2023
5	4	Voucher	15	8/26/2023
6	5	Voucher	12	9/2/2023
7	5	Voucher	22	9/9/2023
8	6	Voucher	34	9/16/2023
9	7	Voucher	50	9/23/2023
10	8	Voucher	5	9/30/2023
11	9	Voucher	15	10/7/2023
12	10	Voucher	25	10/14/2023
13	11	Voucher	20	10/21/2023
14	12	Voucher	20	10/28/2023
15	13	Voucher	20	11/4/2023
16	15	Voucher	12	11/11/2023
17	16	Voucher	40	11/18/2023
18	17	Voucher	10	11/25/2023
19	18	Voucher	33	12/2/2023
20	19	Voucher	20	12/9/2023
21	20	Voucher	20	12/16/2023
22	21	Voucher	50	12/23/2023
23	20	Voucher	5	12/30/2023