For this challenge, we will use the Brazilian E-commerce Public Dataset by Olist from Kaggle. The dataset contains multiple tables, each with specific attributes related to e-commerce activities.
-
Customers Table
customer_id
: Unique identifier for each customercustomer_unique_id
: Unique identifier for each unique customercustomer_zip_code_prefix
: Zip code prefix of the customercustomer_city
: City of the customercustomer_state
: State of the customer
-
Geolocation Table
geolocation_zip_code_prefix
: Zip code prefix of the locationgeolocation_lat
: Latitude coordinate of the locationgeolocation_lng
: Longitude coordinate of the locationgeolocation_city
: City of the locationgeolocation_state
: State of the location
-
Order Items Table
order_id
: Unique identifier for each orderorder_item_id
: Identifier for each item in the orderproduct_id
: Unique identifier for each productseller_id
: Unique identifier for each sellershipping_limit_date
: Last date for shipping the itemprice
: Price of the itemfreight_value
: Freight value of the item
-
Order Payments Table
order_id
: Unique identifier for each orderpayment_sequential
: Sequential number of the paymentpayment_type
: Type of paymentpayment_installments
: Number of installments for paymentpayment_value
: Value of the payment
-
Order Reviews Table
review_id
: Unique identifier for each revieworder_id
: Unique identifier for each orderreview_score
: Score given in the reviewreview_comment_title
: Title of the review commentreview_comment_message
: Message of the review commentreview_creation_date
: Creation date of the reviewreview_answer_timestamp
: Timestamp of the review answer
-
Orders Table
order_id
: Unique identifier for each ordercustomer_id
: Unique identifier for each customerorder_status
: Status of the orderorder_purchase_timestamp
: Timestamp of the order purchaseorder_approved_at
: Timestamp when the order was approvedorder_delivered_carrier_date
: Date when the order was delivered to the carrierorder_delivered_customer_date
: Date when the order was delivered to the customerorder_estimated_delivery_date
: Estimated delivery date of the order
-
Products Table
product_id
: Unique identifier for each productproduct_category_name
: Name of the product categoryproduct_name_lenght
: Length of the product nameproduct_description_lenght
: Length of the product descriptionproduct_photos_qty
: Number of photos of the productproduct_weight_g
: Weight of the product in gramsproduct_length_cm
: Length of the product in centimetersproduct_height_cm
: Height of the product in centimetersproduct_width_cm
: Width of the product in centimeters
-
Sellers Table
seller_id
: Unique identifier for each sellerseller_zip_code_prefix
: Zip code prefix of the sellerseller_city
: City of the sellerseller_state
: State of the seller
-
Product Category Translation Table
product_category_name
: Name of the product categoryproduct_category_name_english
: Name of the product category in English
-
Data Exploration
- Load the dataset and explore the structure and relationships between tables.
- Provide a summary of each table and its attributes.
-
Customer Analysis
- Analyze the distribution of customers by city and state.
- Identify the top 10 cities with the highest number of customers.
-
Sales Analysis
- Determine the total sales value for each product category.
- Identify the top 10 best-selling products based on the number of items sold.
-
Order Analysis
- Analyze the order status distribution.
- Calculate the average delivery time for orders.
-
Review Analysis
- Analyze the distribution of review scores.
- Identify common themes in review comments using text analysis.
-
Payment Analysis
- Analyze the distribution of payment types.
- Calculate the average payment value for different payment types.
-
Geolocation Analysis
- Visualize the geolocation data to identify regions with high order density.
- Analyze the correlation between geolocation and delivery times.