Skip to content

Instantly share code, notes, and snippets.

@fxadecimal
Last active April 28, 2025 11:46
Show Gist options
  • Save fxadecimal/0b509026a3c2283871f2157fbf731390 to your computer and use it in GitHub Desktop.
Save fxadecimal/0b509026a3c2283871f2157fbf731390 to your computer and use it in GitHub Desktop.
Software Engineer to Data Engineer with MLops in 2025

Topics

  1. PostgreSQL
  2. DBT
  3. Prefect
  4. Snowflake
  5. Automation and Orchestration
  6. MLops
  7. Other topics

Overview

This curriculum assumes you're reasonably proficient in a popular language like: python, go, typescript. And, have at least a working knowledge of *nix, shell, git and docker..

1. PostgreSQL

1.1 Easy

  • Basic SQL Syntax: SELECT, FROM, WHERE
  • Data Types (Integer, Text, Boolean, Date/Time)
  • Creating Tables (CREATE TABLE)
  • Inserting Data (INSERT INTO)
  • Updating Data (UPDATE)
  • Deleting Data (DELETE FROM)
  • Simple Filtering (=, >, <, LIKE, IN)
  • Ordering Results (ORDER BY)
  • Limiting Results (LIMIT)
  • Basic Joins (INNER JOIN)

1.2 Medium

  • More Join Types (LEFT JOIN, RIGHT JOIN, FULL OUTER JOIN)
  • Aggregate Functions (COUNT, SUM, AVG, MIN, MAX)
  • Grouping Data (GROUP BY)
  • Filtering Groups (HAVING)
  • Subqueries (in SELECT, FROM, WHERE)
  • Basic Indexing (CREATE INDEX)
  • Understanding Primary Keys and Foreign Keys
  • Constraints (NOT NULL, UNIQUE, CHECK)
  • Views (CREATE VIEW)
  • Basic Data Import/Export (COPY)

1.3 Hard

  • Window Functions (ROW_NUMBER, RANK, LAG, LEAD, aggregate window functions)
  • Common Table Expressions (CTEs) (WITH)
  • Stored Procedures and Functions (CREATE FUNCTION, CREATE PROCEDURE, basic PL/pgSQL)
  • Triggers (CREATE TRIGGER)
  • Transactions (ACID properties, BEGIN, COMMIT, ROLLBACK)
  • Understanding EXPLAIN and basic query optimization
  • More complex data types (JSON/JSONB, Arrays)
  • User Defined Types (CREATE TYPE)

1.4 Advanced

  • Advanced Indexing (GIN, GiST, BRIN, Covering Indexes)
  • Table Partitioning (Declarative Partitioning)
  • Replication (Streaming Replication, Logical Replication)
  • Connection Pooling (e.g., PgBouncer)
  • Security Management (Roles, Permissions, Row-Level Security)
  • Backup and Recovery Strategies (pg_dump, Point-in-Time Recovery)
  • Advanced Performance Tuning and Monitoring
  • Concurrency Control (MVCC, Locking)
  • Foreign Data Wrappers (CREATE FOREIGN TABLE)
  • Advanced PL/pgSQL programming

2. DBT

2.1 Easy

  • What is DBT? (Overview and Use Cases)
  • Installing DBT and Setting Up a Project
  • DBT Project Structure and Files
  • Writing Basic Models (.sql files)
  • Running and Building Models (dbt run)
  • Using Seeds (dbt seed)
  • Simple Jinja Usage in SQL

2.2 Medium

  • Sources and Refactoring with ref() and source()
  • Using Variables and Macros
  • Testing Data with Built-in Tests (unique, not_null, etc.)
  • Documentation (dbt docs generate, dbt docs serve)
  • Using Snapshots
  • Incremental Models
  • Configuring Model Materializations

2.3 Hard

  • Writing Custom Tests and Macros
  • Advanced Jinja and Control Structures
  • Using Hooks and Operations
  • Advanced Model Configurations (tags, ephemeral, etc.)
  • Source Freshness and Auditing
  • Deployment Best Practices
  • Debugging and Logging

2.4 Advanced

  • DBT in Production (CI/CD, Scheduling)
  • DBT Cloud vs DBT Core
  • Managing Large Projects (Packages, Modularization)
  • Advanced Performance Optimization
  • Integrating DBT with Data Orchestration Tools (Airflow, Prefect)
  • Writing and Publishing DBT Packages
  • Security and Access Control in DBT

3. Prefect

3.1 Easy

  • What is Prefect? (Overview and Use Cases)
  • Installing Prefect and Basic Setup
  • Writing Your First Flow
  • Tasks and Flows Basics
  • Running Flows Locally
  • Using Prefect CLI

3.2 Medium

  • Parameters and Context in Flows
  • Scheduling Flows
  • Using State Handlers
  • Mapping and Dynamic Task Generation
  • Logging and Monitoring Basics
  • Using Blocks and Storage Options
  • Working with Prefect Cloud UI

3.3 Hard

  • Deployments and Infrastructure Blocks
  • Advanced Error Handling and Retries
  • Using Collections and Integrations (e.g., S3, GCS, Databases)
  • Orchestrating Flows with Subflows
  • Using Secrets and Environment Variables
  • Custom Task and Flow Classes

3.4 Advanced

  • Prefect Agents and Work Queues
  • Scaling and High Availability
  • Custom Infrastructure and Execution Environments
  • Advanced Monitoring and Alerting
  • CI/CD Integration for Prefect Deployments
  • Security Best Practices and Access Control
  • Extending Prefect with Plugins and Custom Collections

4. Snowflake

4.1 Easy

  • What is Snowflake? (Overview and Architecture)
  • Setting Up a Snowflake Account and UI Tour
  • Understanding Warehouses, Databases, and Schemas
  • Creating and Querying Tables
  • Basic SQL in Snowflake (SELECT, INSERT, UPDATE, DELETE)
  • Loading Data with Web UI and Worksheets

4.2 Medium

  • Using Snowflake Stages (Internal/External)
  • Bulk Loading Data (COPY INTO)
  • Working with File Formats
  • Time Travel and Data Retention
  • Cloning Databases, Schemas, and Tables
  • Working with Views and Secure Views
  • Using Snowflake Functions and Sequences
  • Query Performance Basics

4.3 Hard

  • Streams and Tasks (Change Data Capture, Automation)
  • Materialized Views
  • Semi-structured Data (VARIANT, JSON, XML, PARSE/FLATTEN)
  • User-defined Functions (UDFs) and Procedures
  • Data Sharing and Data Marketplace
  • Resource Monitors and Usage Tracking
  • Query Profiling and Optimization

4.4 Advanced

  • Snowflake Security (Roles, Policies, Masking, Row Access)
  • Data Governance and Compliance Features
  • Advanced Performance Tuning (Clustering, Result Caching)
  • Snowpipe (Continuous Data Ingestion)
  • External Tables and Data Lake Integration
  • Working with Snowpark (Python, Java, Scala)
  • Automation and Orchestration with Third-party Tools
  • Multi-cloud and Cross-region Features

5. Automation and Orchestration

5.1 Easy

  • What is Automation and Orchestration? (Overview and Use Cases)
  • Introduction to Scheduling (Cron, Task Schedulers)
  • Introduction to Docker (Containers vs VMs, Use Cases)
  • Installing Docker and Running Your First Container
  • Writing Simple Dockerfiles
  • Basic Docker Commands (build, run, ps, stop, rm)
  • Introduction to Prefect and Airflow (Concepts Only)
  • Simple Shell Scripting for Automation

5.2 Medium

  • Docker Compose for Multi-Container Applications
  • Building and Managing Custom Docker Images
  • Environment Variables and Volumes in Docker
  • Scheduling Workflows with Prefect or Airflow
  • Parameterizing and Triggering Workflows
  • Monitoring and Logging Automated Tasks
  • Using Makefiles for Automation
  • Automating Data Pipelines with Python Scripts

5.3 Hard

  • Advanced Docker Networking and Security
  • Orchestrating Containers with Kubernetes (Concepts and Basics)
  • Building Modular and Reusable Workflow DAGs (Airflow/Prefect)
  • Error Handling and Retry Strategies in Orchestration Tools
  • Integrating CI/CD Pipelines (GitHub Actions, GitLab CI)
  • Dynamic Workflow Generation
  • Managing Secrets and Credentials Securely
  • Automated Testing of Data Pipelines

5.4 Advanced

  • Scaling Workflows and Infrastructure (Kubernetes, Cloud Runners)
  • Custom Operators/Sensors in Airflow or Custom Blocks in Prefect
  • Distributed Task Execution and Parallelism
  • Monitoring, Alerting, and Observability for Automated Workflows
  • Advanced Docker Topics (Swarm, Multi-stage Builds, Image Optimization)
  • Infrastructure as Code (Terraform, CloudFormation) for Automation
  • End-to-End Data Pipeline Automation (from Ingestion to Reporting)
  • Security, Compliance, and Auditing in Automated Workflows

6. MLops

6.1 Easy

  • What is MLOps? (Overview and Use Cases)
  • Introduction to Machine Learning Lifecycle
  • Version Control for Code and Data (Git, DVC)
  • Basics of Model Training and Evaluation
  • Introduction to Model Serialization (Pickle, Joblib, ONNX)
  • Manual Model Deployment (Flask/FastAPI)
  • Tracking Experiments with Spreadsheets or Simple Tools

6.2 Medium

  • Automated Model Training Pipelines (with Prefect, Airflow, or similar)
  • Model Tracking with MLflow or Weights & Biases
  • Data Validation and Data Drift Detection
  • Model Registry Concepts
  • Containerizing ML Models with Docker
  • Batch and Real-time Inference Basics
  • Monitoring Model Performance (Basic Metrics)
  • Feature Store Concepts

6.3 Hard

  • CI/CD for ML (Automated Testing, Linting, and Deployment)
  • Advanced Model Monitoring (Drift, Outliers, Data Quality)
  • Automated Retraining and Model Versioning
  • Model Serving at Scale (Kubernetes, Seldon, KFServing)
  • Advanced Feature Store Usage
  • Secure Model Deployment (API Keys, Auth, RBAC)
  • A/B Testing and Canary Releases for Models

6.4 Advanced

  • End-to-End ML Pipeline Automation (from Data Ingestion to Monitoring)
  • Multi-cloud and Hybrid MLops Architectures
  • Infrastructure as Code for MLops (Terraform, CloudFormation)
  • Advanced Model Governance and Compliance
  • Custom ML Platform Development
  • Cost Optimization and Resource Management for ML Workloads
  • Integrating MLops with DataOps and DevOps
  • Advanced Security and Auditability in ML Systems

7. Other topics

7.1 Easy

  • Introduction to Data Warehousing Concepts
  • Basics of Data Modeling (Star, Snowflake Schemas)
  • Introduction to ETL/ELT Concepts
  • Data Quality Fundamentals
  • Introduction to Cloud Platforms (AWS, GCP, Azure) for Data
  • Basic Data Visualization (Tableau, Power BI, Looker)
  • Introduction to APIs and REST

7.2 Medium

  • Data Lake Concepts and Architecture
  • Data Catalogs and Metadata Management
  • Data Lineage and Provenance
  • Data Privacy Basics (GDPR, HIPAA Overview)
  • Working with NoSQL Databases (MongoDB, Cassandra)
  • Streaming Data Basics (Kafka, Kinesis)
  • Data Serialization Formats (Parquet, Avro, ORC)
  • Scheduling and Automation with Cloud Services (Cloud Composer, AWS Step Functions)

7.3 Hard

  • Data Governance Frameworks
  • Master Data Management (MDM)
  • Advanced Data Modeling (Slowly Changing Dimensions, Factless Fact Tables)
  • Real-time Data Processing Architectures
  • Data Mesh and Data Fabric Concepts
  • Advanced Data Privacy and Anonymization Techniques
  • Data API Design and Management (GraphQL, gRPC)
  • Data Migration Strategies (On-prem to Cloud, Cloud to Cloud)

7.4 Advanced

  • Data Architecture for Large-scale Systems
  • Multi-cloud and Hybrid Data Architectures
  • Data Monetization and Data-as-a-Service
  • Advanced Data Security (Encryption at Rest/In Transit, Key Management)
  • Data Ethics and Responsible AI
  • Building Custom Data Platforms
  • DataOps Best Practices and Tooling
  • Advanced Data Sharing and Collaboration (Data Clean Rooms, Secure Data Exchange)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment