Skip to content

Instantly share code, notes, and snippets.

@dannysperry
Last active February 6, 2025 19:20
Show Gist options
  • Save dannysperry/03296435b6109536b40c318a5cc33ecc to your computer and use it in GitHub Desktop.
Save dannysperry/03296435b6109536b40c318a5cc33ecc to your computer and use it in GitHub Desktop.
Samaritan Data Engineer

Samaritan Data Engineer Take-Home Assignment

Overview: You are tasked with designing a small ETL pipeline and building some SQL queries that will help an organization like Samaritan improve its data operations. The data you're working with relates to a fictional system used to track Members, their goals, and the rewards they receive for achieving those goals.

—--------------------------------

Dataset Overview:

You have three CSV files that represent different tables in the database:

A) members.csv: Information about the Members (e.g., people experiencing homelessness).

member_id, name, email, joined_at, city
1, John Doe, [email protected], 2023-01-15, Seattle
2, Jane Smith, [email protected], 2023-02-20, New York
3, Carlos Lee, [email protected], 2023-03-10, Los Angeles

B) goals.csv: Information about the goals that Members are working towards.

goal_id, member_id, goal_type, target_date, status
101, 1, housing, 2023-06-01, Completed
102, 2, employment, 2023-05-15, In Progress
103, 3, healthcare, 2023-04-20, Completed

C) rewards.csv: Information about rewards issued to Members for completing goals.

reward_id, goal_id, reward_type, amount, date_issued
201, 101, Voucher, 50.00, 2023-06-05
202, 103, Voucher, 30.00, 2023-04-25

Part 1: Data Modeling & ETL Pipeline

  • Schema Design: Based on the CSVs above, create a data model that will support storing and querying this information in a relational database (e.g., PostgreSQL).
    • Include Tables: Members, Goals, Rewards
    • Describe the relationships between the tables.
    • Indicate any indexes or optimizations you would apply for performance (e.g., for frequent queries related to Member progress or rewards issuance).
  • ETL Pipeline: Design an ETL pipeline to load these CSV files into your database.
    • Assume that the data will be available as CSV files in an AWS S3 bucket (you can simulate this process by imagining the CSVs as input files).
    • Write a Python script that would:
      • Download these files from S3 (assume the CSVs are stored in the samaritan-data bucket).
      • Process the CSVs and load the data into your PostgreSQL database.
      • Apply basic data validation (e.g., no missing member IDs, valid email format, target date in the future).
    • Bonus: If you have AWS experience, suggest how you might scale this pipeline for larger datasets, and consider error handling or retry logic.

Part 2: SQL Queries

Now that your data is in the database, write the following SQL queries:

  • Query 1: Find the number of Members by city who have completed at least one goal.
    • Output: city, number_of_members
  • Query 2: For each Member, calculate the total reward amount they’ve received for completed goals.
    • Output: member_id, name, total_rewards
    • Only include rewards for goals marked as Completed.
  • Query 3: Write a query that returns all Members who have not yet completed their Healthcare goal.
    • Output: member_id, name, goal_status
  • Query 4: Identify the top 3 goals (by goal_type) that have the highest total reward amount issued, and show the total reward for each goal.
    • Output: goal_type, total_rewards
  • Bonus Query: Optimize the following query that frequently runs on your database, which has performance issues. Given that rewards has many records and the goal is to join with goals and members, rewrite this query to improve its performance. Consider indexing, query optimization, or restructuring the query for faster execution.
SELECT m.name, g.goal_type, r.amount
FROM members m
JOIN goals g ON m.member_id = g.member_id
JOIN rewards r ON g.goal_id = r.goal_id
WHERE g.status = 'Completed' AND r.date_issued > '2023-01-01'
ORDER BY r.amount DESC;

Submission Guidelines: Please submit the following:

  • A SQL file that contains the queries you’ve written.
  • A Python script (or pseudocode if you prefer) that implements the ETL pipeline (including CSV parsing, database connection, and data insertion). If you use any external libraries (e.g., psycopg2 for PostgreSQL), please include the installation instructions.
  • A brief data model diagram (e.g., ERD) showing the structure of the database and relationships between the tables.
  • Any assumptions you made while designing the system or solving the queries.
member_id name Email joined_at City
1 Jane Doe [email protected] 6/29/2023 Seattle
2 John Doe [email protected] 7/7/2023 Seattle
3 Mary Sue [email protected] 7/15/2023 New York
4 Larry Sue [email protected] 7/23/2023 New York
5 Gary Berry [email protected] 7/31/2023 Los Angeles
6 John Berry [email protected] 8/8/2023 Los Angeles
7 Don Atello [email protected] 8/16/2023 Seattle
8 Mike Elangello [email protected] 8/24/2023 N.Y.
9 Raphael Turtle [email protected] 9/1/2023 L.A.
10 Cindy Lu [email protected] 9/9/2023 Los Angeles
goal_id member_id goal_type target_date status
1 1 housing 7/15/2023 Completed
2 2 employment 7/22/2023 In Progress
3 3 healthcare 7/29/2023 Completed
4 4 housing 8/5/2023 Completed
5 1 employment 8/12/2023 In Progress
6 5 healthcare 8/19/2023 Completed
7 2 housing 8/26/2023 In Progress
8 6 employment 9/2/2023 Completed
9 7 healthcare 9/9/2023 Completed
10 8 housing 9/16/2023 Completed
11 3 employment 9/23/2023 Completed
12 1 healthcare 9/30/2023 In Progress
13 5 housing 10/7/2023 Completed
14 10 employment 10/14/2023 Cancelled
15 2 healthcare 10/21/2023 Completed
16 3 housing 10/28/2023 Completed
17 7 employment 11/4/2023 Completed
18 6 healthcare 11/11/2023 In Progress
19 10 housing 11/18/2023 Completed
20 4 employment 11/25/2023 Completed
21 9 healthcare 12/2/2023 In Progress
reward_id goal_id reward_type amount date_issued
1 1 Voucher 50 7/29/2023
2 2 Voucher 30 8/5/2023
3 2 Voucher 10 8/12/2023
4 3 Voucher 25 8/19/2023
5 4 Voucher 15 8/26/2023
6 5 Voucher 12 9/2/2023
7 5 Voucher 22 9/9/2023
8 6 Voucher 34 9/16/2023
9 7 Voucher 50 9/23/2023
10 8 Voucher 5 9/30/2023
11 9 Voucher 15 10/7/2023
12 10 Voucher 25 10/14/2023
13 11 Voucher 20 10/21/2023
14 12 Voucher 20 10/28/2023
15 13 Voucher 20 11/4/2023
16 15 Voucher 12 11/11/2023
17 16 Voucher 40 11/18/2023
18 17 Voucher 10 11/25/2023
19 18 Voucher 33 12/2/2023
20 19 Voucher 20 12/9/2023
21 20 Voucher 20 12/16/2023
22 21 Voucher 50 12/23/2023
23 20 Voucher 5 12/30/2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment