data_notes.md

Data Lake

designed for fault-tolerance, infinite-scalability, high-throughput ingestion of variable-sized data
used for data exploration, analytics, ML
could act as a data source for a data warehouse
- raw data ingested into data lake -> transform (with ELT pipeline - data is ingested and transformed in-place) into structured, queryable format
- source data that is already relational can go directly into the data warehouse, skipping the data lake
often used in event streaming or iot, because they can persist large amounts of relational and non-relational data without transformations or schema definitions
handle high volumes of small writes at low latency and are optimized for massive throughput

Storing raw data is helpful when you don't know what insignts are abailable from the data
More flexible than data warehouse, since data can be stored in heterogeneous formats

Lack of schema or metadata - makes data hard to query
Lack of semantic consistency across data - hard to perform analysis
Hard to guarantee data quality of ingested data
Might not be the best way to integrate data that is already relational
By itself, a data lake does not provide integrated or holistic views across the organization
Could become bloated with data never actually analyzed or mined for insight
Be careful with data governance

ELT	ETL
transformation occurs in target data store	transformation takes place in a separate specialized engine
target data store is capable of transforming the data (no additional transformation engine needed in the pipeline, since the target can do the transformation)	target is not capable of transforming the data - needs a separate transformation engine
ELT only does well when target can transform the data efficiently	often ETL phases are run in parallel

Data Lake	Data Warehouse
holds data in raw, untransformed format	transforms and and processes the data at ingestion
optimized for scaling terabytes and petabytes of data
source data can be structured, semi, or unstructured	source data must be in a homogeneous format

docs
built on top of blob storage (need to check the enable box when creating the resource)
file system semantics, directory, and file level security and scale are combined with low-cost, tiered storage, high availability/disaster recovery capabilities

cloud-based ETL and data integration service that allows you to create data-driven workflows for orchestrating data movement and transforming data at scale
https://docs.microsoft.com/en-us/azure/data-factory/load-azure-sql-data-warehouse

Need to download the IR on the on-prem vm so that data factory can set up a connection between the vm's private network and Azure cloud
Necessary for running copy activity between a cloud data stores and a data store in private network

See this guide for more details on copying data from on-prem SQL to data factory.
See this guide for more details on copying multiple tables over to data factory.
- make sure to uncheck the box that only shows one table result

Create a self-hosted integration runtime within an Azure Data factory using powershell.
Create a linked service for an on-prem data store by specifying the self-hosted IR instance
The IR node encrypts the credentials and saves the credentials locally.
The Data Factory service talks to the IR to schedule and manage jobs via a control channel that uses a shared Azure Service Bus. When a job needs to be run, Data Factory queues the request and credentials.
The IR copies data from on-prem store to cloud storage, or vice versa.