https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-user-delegation-sas-create-cli
- designed for fault-tolerance, infinite-scalability, high-throughput ingestion of variable-sized data
- used for data exploration, analytics, ML
- could act as a data source for a data warehouse
- raw data ingested into data lake -> transform (with ELT pipeline - data is ingested and transformed in-place) into structured, queryable format
- source data that is already relational can go directly into the data warehouse, skipping the data lake
- often used in event streaming or iot, because they can persist large amounts of relational and non-relational data without transformations or schema definitions
- handle high volumes of small writes at low latency and are optimized for massive throughput
- Storing raw data is helpful when you don't know what insignts are abailable from the data
- More flexible than data warehouse, since data can be stored in heterogeneous formats
- Lack of schema or metadata - makes data hard to query
- Lack of semantic consistency across data - hard to perform analysis
- Hard to guarantee data quality of ingested data
- Might not be the best way to integrate data that is already relational
- By itself, a data lake does not provide integrated or holistic views across the organization
- Could become bloated with data never actually analyzed or mined for insight
- Be careful with data governance
ELT | ETL |
---|---|
transformation occurs in target data store | transformation takes place in a separate specialized engine |
target data store is capable of transforming the data (no additional transformation engine needed in the pipeline, since the target can do the transformation) | target is not capable of transforming the data - needs a separate transformation engine |
ELT only does well when target can transform the data efficiently | often ETL phases are run in parallel |
Data Lake | Data Warehouse |
---|---|
holds data in raw, untransformed format | transforms and and processes the data at ingestion |
optimized for scaling terabytes and petabytes of data | |
source data can be structured, semi, or unstructured | source data must be in a homogeneous format |
- docs
- built on top of blob storage (need to check the enable box when creating the resource)
- file system semantics, directory, and file level security and scale are combined with low-cost, tiered storage, high availability/disaster recovery capabilities
- cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale
- https://docs.microsoft.com/en-us/azure/data-factory/load-azure-sql-data-warehouse
- Remmina - Remote desktop
- Need to download the IR on the on-prem vm so that data factory can set up a connection between the vm's private network and Azure cloud
- Necessary for running copy activity between a cloud data stores and a data store in private network
- See this guide for more details on copying data from on-prem SQL to data factory.
- See this guide for more details on copying multiple tables over to data factory.
- make sure to uncheck the box that only shows one table result
- Create a self-hosted integration runtime within an Azure Data factory using powershell.
- Create a linked service for an on-prem data store by specifying the self-hosted IR instance
- The IR node encrypts the credentials and saves the credentials locally.
- The Data Factory service talks to the IR to schedule and manage jobs via a control channel that uses a shared Azure Service Bus. When a job needs to be run, Data Factory queues the request and credentials.
- The IR copies data from on-prem store to cloud storage, or vice versa.