Skip to content

Instantly share code, notes, and snippets.

View creyesp's full-sized avatar

César Reyes creyesp

View GitHub Profile

Data Engineering Challenge - Integración de datos con la API de NYTimes

Objetivo

El objetivo de este desafío es construir un pequeño pipeline de datos automatizado que extraiga información desde una fuente externa (la API del New York Times), la almacene en una base de datos analítica (BigQuery) y permita consultarla de manera eficiente.

¿Qué vas a hacer?

Vas a desarrollar un script en Python que se conecte a la API de noticias del NYTimes y extraiga artículos recientes según ciertos parámetros. Esa información debe ser almacenada en una tabla en Google BigQuery para su posterior análisis.

def unify_victoria_secret(df):
    """
    We want that all brands that are related to Victoria's Secret
    have `victoria's secret` as their brand instead of what they
    currently have.
    """
    df = df.copy()
    new_string = "victoria's secret"
 df.loc[df["brand_name"].isin(["Victorias-Secret", "Victoria's Secret", "Victoria's Secret Pink"]), "brand_name"] = new_string
@creyesp
creyesp / pycharm_env_config.md
Created May 22, 2022 23:18
Pycharm env config
  1. install ProjectEnv plugin
  2. install direnv
  3. create a .env file and configure the env variables
  4. create a .envrc file ans write on it dotenv
  5. configure in pycharm > setting > build ... > ProjectEnv > add files (.env)

The below table shows some examples of heuristic benchmarks to compare the performance of a machine learning model when no previous solution exists. The original version of the table can be found in the Machine Learning Design Patterns Book (pattern 28)

| Scenario | Heuristic benchmark | Example task | Implementation for example task

from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "tr3cks/3LabelsSentimentAnalysisSpanish"
tokenizer_sent_esp = AutoTokenizer.from_pretrained(model_name)
model_sent_esp = AutoModelForSequenceClassification.from_pretrained(model_name)
# The output is ['ja', '##ja', '##ja', 'que', 'risa', 'me', 'da']
tokenizer_sent_esp.tokenize('jajaja que risa me da')