Skip to content

Instantly share code, notes, and snippets.

@Smarker
Last active October 29, 2019 15:06
Show Gist options
  • Save Smarker/1ee5c2b5a77a09631ab0e27a1ffebf40 to your computer and use it in GitHub Desktop.
Save Smarker/1ee5c2b5a77a09631ab0e27a1ffebf40 to your computer and use it in GitHub Desktop.

https://docs.microsoft.com/en-us/azure/storage/blobs/storage-blob-user-delegation-sas-create-cli

Data Lake

  • designed for fault-tolerance, infinite-scalability, high-throughput ingestion of variable-sized data
  • used for data exploration, analytics, ML
  • could act as a data source for a data warehouse
    • raw data ingested into data lake -> transform (with ELT pipeline - data is ingested and transformed in-place) into structured, queryable format
    • source data that is already relational can go directly into the data warehouse, skipping the data lake
  • often used in event streaming or iot, because they can persist large amounts of relational and non-relational data without transformations or schema definitions
  • handle high volumes of small writes at low latency and are optimized for massive throughput

Advantages of Data Lake

  • Storing raw data is helpful when you don't know what insignts are abailable from the data
  • More flexible than data warehouse, since data can be stored in heterogeneous formats

Disadvantages of Data Lake

  • Lack of schema or metadata - makes data hard to query
  • Lack of semantic consistency across data - hard to perform analysis
  • Hard to guarantee data quality of ingested data
  • Might not be the best way to integrate data that is already relational
  • By itself, a data lake does not provide integrated or holistic views across the organization
  • Could become bloated with data never actually analyzed or mined for insight
  • Be careful with data governance

ELT vs ETL

ELT ETL
transformation occurs in target data store transformation takes place in a separate specialized engine
target data store is capable of transforming the data (no additional transformation engine needed in the pipeline, since the target can do the transformation) target is not capable of transforming the data - needs a separate transformation engine
ELT only does well when target can transform the data efficiently often ETL phases are run in parallel

Data Lake vs Data Warehouse

Data Lake Data Warehouse
holds data in raw, untransformed format transforms and and processes the data at ingestion
optimized for scaling terabytes and petabytes of data
source data can be structured, semi, or unstructured source data must be in a homogeneous format

Azure Data Lake Storage Gen2

  • docs
  • built on top of blob storage (need to check the enable box when creating the resource)
  • file system semantics, directory, and file level security and scale are combined with low-cost, tiered storage, high availability/disaster recovery capabilities

Data Factory

RDP Client for Linux

  • Remmina - Remote desktop

Integration Runtime

  • Need to download the IR on the on-prem vm so that data factory can set up a connection between the vm's private network and Azure cloud
  • Necessary for running copy activity between a cloud data stores and a data store in private network

Moving Data between On-Prem and the Cloud

image

  • See this guide for more details on copying data from on-prem SQL to data factory.
  • See this guide for more details on copying multiple tables over to data factory.
    • make sure to uncheck the box that only shows one table result
  1. Create a self-hosted integration runtime within an Azure Data factory using powershell.
  2. Create a linked service for an on-prem data store by specifying the self-hosted IR instance
  3. The IR node encrypts the credentials and saves the credentials locally.
  4. The Data Factory service talks to the IR to schedule and manage jobs via a control channel that uses a shared Azure Service Bus. When a job needs to be run, Data Factory queues the request and credentials.
  5. The IR copies data from on-prem store to cloud storage, or vice versa.

Transform Data with Data Bricks

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment