Created
March 5, 2025 14:57
-
-
Save korkridake/44769ba2b0b58c349132f487d01ad7d4 to your computer and use it in GitHub Desktop.
Get started with the DS Agent on Walmart Stocks Data 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# -*- coding: utf-8 -*- | |
"""DS Agent - Walmart Stocks Data 2025 (Unfiltered).ipynb | |
Automatically generated by Colab. | |
Original file is located at | |
https://colab.research.google.com/drive/1xQPflL75hlDCGRRntyEBdSt_DgegxlTi | |
# Task | |
Hey Data Science Agent, can you help with the data analysis on this dataset? It includes EDA, descriptive stats, and some data visualizations. If you can, could you also suggest some time-series modeling? | |
Here is all the data you need: | |
"walmart_stock_prices.csv" | |
## Data loading | |
### Subtask: | |
Load the Walmart stock price data from the provided CSV file. | |
**Reasoning**: | |
Load the Walmart stock price data from the provided CSV file into a pandas DataFrame and display the first few rows. | |
""" | |
import pandas as pd | |
try: | |
df = pd.read_csv('walmart_stock_prices.csv') | |
display(df.head()) | |
except FileNotFoundError: | |
print("Error: 'walmart_stock_prices.csv' not found.") | |
df = None # Set df to None to indicate failure | |
except pd.errors.ParserError: | |
print("Error: Could not parse the CSV file.") | |
df = None | |
except Exception as e: | |
print(f"An unexpected error occurred: {e}") | |
df = None | |
"""## Data exploration | |
### Subtask: | |
Explore the loaded Walmart stock price dataset. | |
**Reasoning**: | |
I need to explore the data by checking its shape, data types, missing values, first/last few rows, and descriptive statistics. | |
""" | |
# Check the shape of the DataFrame | |
print("Shape of the DataFrame:", df.shape) | |
# Examine the data types of each column | |
print("\nData types of each column:\n", df.dtypes) | |
# Identify the number of missing values in each column | |
print("\nNumber of missing values in each column:\n", df.isnull().sum()) | |
# Print the first 5 rows | |
print("\nFirst 5 rows:\n", df.head()) | |
# Print the last 5 rows | |
print("\nLast 5 rows:\n", df.tail()) | |
# Calculate basic descriptive statistics for numerical features | |
print("\nDescriptive statistics for numerical features:\n", df.describe()) | |
"""## Data visualization | |
### Subtask: | |
Visualize the distribution of key numerical features and the time series of stock prices. | |
**Reasoning**: | |
Convert the 'Date' column to datetime objects and create the visualizations as requested. | |
""" | |
import matplotlib.pyplot as plt | |
# Convert the 'Date' column to datetime objects | |
df['Date'] = pd.to_datetime(df['Date']) | |
# Create histograms for 'Open', 'High', 'Low', and 'Close' prices | |
plt.figure(figsize=(12, 8)) | |
plt.subplot(2, 2, 1) | |
plt.hist(df['Open'], bins=30, color='skyblue', edgecolor='black') | |
plt.xlabel('Open Price') | |
plt.ylabel('Frequency') | |
plt.title('Distribution of Open Prices') | |
plt.subplot(2, 2, 2) | |
plt.hist(df['High'], bins=30, color='lightcoral', edgecolor='black') | |
plt.xlabel('High Price') | |
plt.ylabel('Frequency') | |
plt.title('Distribution of High Prices') | |
plt.subplot(2, 2, 3) | |
plt.hist(df['Low'], bins=30, color='lightgreen', edgecolor='black') | |
plt.xlabel('Low Price') | |
plt.ylabel('Frequency') | |
plt.title('Distribution of Low Prices') | |
plt.subplot(2, 2, 4) | |
plt.hist(df['Close'], bins=30, color='gold', edgecolor='black') | |
plt.xlabel('Close Price') | |
plt.ylabel('Frequency') | |
plt.title('Distribution of Close Prices') | |
plt.tight_layout() | |
plt.show() | |
# Create box plots for 'Open', 'High', 'Low', and 'Close' prices | |
plt.figure(figsize=(12, 6)) | |
plt.boxplot([df['Open'], df['High'], df['Low'], df['Close']], labels=['Open', 'High', 'Low', 'Close'], patch_artist=True, | |
boxprops=dict(facecolor='lightblue'), medianprops=dict(color='black')) | |
plt.ylabel('Price') | |
plt.title('Box Plots of Stock Prices') | |
plt.show() | |
# Plot the time series of the 'Close' price | |
plt.figure(figsize=(12, 6)) | |
plt.plot(df['Date'], df['Close'], color='blue') | |
plt.xlabel('Date') | |
plt.ylabel('Close Price') | |
plt.title('Time Series of Walmart Stock Close Price') | |
plt.xticks(rotation=45) | |
plt.tight_layout() | |
plt.show() | |
"""**Reasoning**: | |
The previous code block execution generated warnings and the code needs to be fixed. The `FutureWarning` related to mixed timezones in the `to_datetime` function needs to be addressed by specifying `utc=True`. Also, the `MatplotlibDeprecationWarning` about the `labels` parameter in `boxplot` should be fixed by changing it to `tick_labels`. Finally, I will re-execute the code to ensure that the plots are generated correctly. | |
""" | |
import matplotlib.pyplot as plt | |
# Convert the 'Date' column to datetime objects, handling mixed time zones | |
df['Date'] = pd.to_datetime(df['Date'], utc=True) | |
# Create histograms for 'Open', 'High', 'Low', and 'Close' prices | |
plt.figure(figsize=(12, 8)) | |
plt.subplot(2, 2, 1) | |
plt.hist(df['Open'], bins=30, color='skyblue', edgecolor='black') | |
plt.xlabel('Open Price') | |
plt.ylabel('Frequency') | |
plt.title('Distribution of Open Prices') | |
plt.subplot(2, 2, 2) | |
plt.hist(df['High'], bins=30, color='lightcoral', edgecolor='black') | |
plt.xlabel('High Price') | |
plt.ylabel('Frequency') | |
plt.title('Distribution of High Prices') | |
plt.subplot(2, 2, 3) | |
plt.hist(df['Low'], bins=30, color='lightgreen', edgecolor='black') | |
plt.xlabel('Low Price') | |
plt.ylabel('Frequency') | |
plt.title('Distribution of Low Prices') | |
plt.subplot(2, 2, 4) | |
plt.hist(df['Close'], bins=30, color='gold', edgecolor='black') | |
plt.xlabel('Close Price') | |
plt.ylabel('Frequency') | |
plt.title('Distribution of Close Prices') | |
plt.tight_layout() | |
plt.show() | |
# Create box plots for 'Open', 'High', 'Low', and 'Close' prices | |
plt.figure(figsize=(12, 6)) | |
plt.boxplot([df['Open'], df['High'], df['Low'], df['Close']], | |
tick_labels=['Open', 'High', 'Low', 'Close'], patch_artist=True, | |
boxprops=dict(facecolor='lightblue'), medianprops=dict(color='black')) | |
plt.ylabel('Price') | |
plt.title('Box Plots of Stock Prices') | |
plt.show() | |
# Plot the time series of the 'Close' price | |
plt.figure(figsize=(12, 6)) | |
plt.plot(df['Date'], df['Close'], color='blue') | |
plt.xlabel('Date') | |
plt.ylabel('Close Price') | |
plt.title('Time Series of Walmart Stock Close Price') | |
plt.xticks(rotation=45) | |
plt.tight_layout() | |
plt.show() | |
"""## Data visualization | |
### Subtask: | |
Visualize the correlations between different numerical features in the dataset. | |
**Reasoning**: | |
Calculate the correlation matrix and visualize it as a heatmap. | |
""" | |
import matplotlib.pyplot as plt | |
import seaborn as sns | |
# Calculate the correlation matrix | |
correlation_matrix = df.corr() | |
# Create the heatmap | |
plt.figure(figsize=(10, 8)) | |
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f") | |
plt.title('Correlation Matrix of Numerical Features') | |
plt.show() | |
"""## Data preparation | |
### Subtask: | |
Prepare the data for time series modeling. | |
**Reasoning**: | |
Convert the 'Date' column to datetime objects, set it as the index, select the 'Close' price, difference the series, create lagged features, and split the data into training and testing sets. | |
""" | |
import pandas as pd | |
from sklearn.preprocessing import MinMaxScaler | |
# Convert 'Date' to datetime and set as index | |
df['Date'] = pd.to_datetime(df['Date']) | |
df = df.set_index('Date') | |
# Create a new DataFrame for the target variable | |
close_prices = pd.DataFrame(df['Close']) | |
# Differencing | |
close_prices['Close_diff'] = close_prices['Close'].diff() | |
# Lagged features | |
close_prices['Close_lag1'] = close_prices['Close'].shift(1) | |
# Scaling (optional, but demonstrating the process) | |
scaler = MinMaxScaler() | |
close_prices[['Close_scaled']] = scaler.fit_transform(close_prices[['Close']]) | |
# Splitting the data | |
train_size = int(len(close_prices) * 0.8) | |
train_data = close_prices[:train_size] | |
test_data = close_prices[train_size:] | |
display(train_data.head()) | |
display(test_data.head()) | |
"""## Model training | |
### Subtask: | |
Train an ARIMA model on the prepared training data. | |
**Reasoning**: | |
Train an ARIMA model on the training data and generate predictions for the test data. | |
""" | |
from statsmodels.tsa.arima.model import ARIMA | |
# Instantiate and fit the ARIMA model | |
try: | |
model = ARIMA(train_data['Close'], order=(5, 1, 0)) | |
model_fit = model.fit() | |
# Generate predictions for the test data | |
predictions = model_fit.predict(start=len(train_data), end=len(train_data) + len(test_data) - 1) | |
# Store predictions in the test data | |
test_data['ARIMA_Predictions'] = predictions | |
display(test_data.head()) | |
except Exception as e: | |
print(f"An error occurred during model training or prediction: {e}") | |
"""**Reasoning**: | |
The previous code block generated a `SettingWithCopyWarning`. I will modify the code to use `.loc` to avoid this warning and address the ValueWarnings related to the date index. I will also investigate the NaN values in the 'ARIMA_Predictions' column. | |
""" | |
from statsmodels.tsa.arima.model import ARIMA | |
# Instantiate and fit the ARIMA model | |
try: | |
model = ARIMA(train_data['Close'], order=(5, 1, 0)) | |
model_fit = model.fit() | |
# Generate predictions for the test data | |
predictions = model_fit.predict(start=len(train_data), end=len(train_data) + len(test_data) - 1) | |
# Store predictions in the test data using .loc to avoid SettingWithCopyWarning | |
test_data.loc[:, 'ARIMA_Predictions'] = predictions | |
display(test_data.head()) | |
except Exception as e: | |
print(f"An error occurred during model training or prediction: {e}") | |
"""**Reasoning**: | |
The ARIMA model predictions are all NaN. I suspect this is due to the date index still being present, despite the warnings. I will try removing the date index from both training and testing data before fitting and predicting, and then try again. | |
""" | |
from statsmodels.tsa.arima.model import ARIMA | |
# Remove the date index from train_data and test_data | |
train_data_no_index = train_data.reset_index(drop=True) | |
test_data_no_index = test_data.reset_index(drop=True) | |
# Instantiate and fit the ARIMA model | |
try: | |
model = ARIMA(train_data_no_index['Close'], order=(5, 1, 0)) | |
model_fit = model.fit() | |
# Generate predictions for the test data | |
predictions = model_fit.predict(start=len(train_data_no_index), end=len(train_data_no_index) + len(test_data_no_index) - 1) | |
# Store predictions in the test data | |
test_data_no_index['ARIMA_Predictions'] = predictions | |
display(test_data_no_index.head()) | |
except Exception as e: | |
print(f"An error occurred during model training or prediction: {e}") | |
"""**Reasoning**: | |
The ARIMA predictions are still NaN. I will try a different (p,d,q) order for the ARIMA model. I'll also double-check that the data is in the correct format and that there are no remaining issues with the date index. It might be useful to print the shape and info of the data to help diagnose the problem. | |
""" | |
from statsmodels.tsa.arima.model import ARIMA | |
# Remove the date index from train_data and test_data | |
train_data_no_index = train_data.reset_index(drop=True) | |
test_data_no_index = test_data.reset_index(drop=True) | |
print("Train data shape:", train_data_no_index.shape) | |
print("Train data info:\n", train_data_no_index.info()) | |
print("\nTest data shape:", test_data_no_index.shape) | |
print("Test data info:\n", test_data_no_index.info()) | |
# Try a different (p,d,q) order for ARIMA | |
try: | |
model = ARIMA(train_data_no_index['Close'], order=(1, 1, 1)) # Trying (1,1,1) | |
model_fit = model.fit() | |
# Generate predictions for the test data | |
predictions = model_fit.predict(start=len(train_data_no_index), end=len(train_data_no_index) + len(test_data_no_index) - 1) | |
test_data_no_index['ARIMA_Predictions'] = predictions | |
display(test_data_no_index.head()) | |
except Exception as e: | |
print(f"An error occurred: {e}") | |
"""## Model training | |
### Subtask: | |
Train an ARIMA model on the prepared training data. | |
**Reasoning**: | |
The previous attempts to train the ARIMA model resulted in NaN predictions. The issue might stem from the differencing and scaling applied to the data. Let's try training the model directly on the 'Close' column of the training data without any transformations, and try different (p, d, q) orders. The index of the train data is a DateTimeIndex. | |
""" | |
from statsmodels.tsa.arima.model import ARIMA | |
# Try different (p, d, q) orders for ARIMA | |
for order in [(1, 1, 1), (5, 1, 0), (0, 1, 1), (1, 0, 0)]: | |
try: | |
print(f"Trying order: {order}") | |
model = ARIMA(train_data['Close'], order=order) | |
model_fit = model.fit() | |
# Generate predictions for the test data | |
predictions = model_fit.predict(start=len(train_data), end=len(train_data) + len(test_data) - 1) | |
# Store predictions in the test data | |
test_data['ARIMA_Predictions'] = predictions | |
display(test_data.head()) | |
break # Exit the loop if successful | |
except Exception as e: | |
print(f"An error occurred with order {order}: {e}") | |
"""**Reasoning**: | |
The previous attempts still resulted in NaN predictions. Let's try setting the frequency of the DateTimeIndex and see if that helps. Then, try fitting the model again. | |
""" | |
from statsmodels.tsa.arima.model import ARIMA | |
# Set the frequency of the DateTimeIndex | |
train_data = train_data.asfreq('D') | |
test_data = test_data.asfreq('D') | |
# Try different (p, d, q) orders for ARIMA | |
for order in [(1, 1, 1), (5, 1, 0), (0, 1, 1), (1, 0, 0)]: | |
try: | |
print(f"Trying order: {order}") | |
model = ARIMA(train_data['Close'], order=order) | |
model_fit = model.fit() | |
# Generate predictions for the test data | |
predictions = model_fit.predict(start=len(train_data), end=len(train_data) + len(test_data) - 1) | |
# Store predictions in the test data | |
test_data['ARIMA_Predictions'] = predictions | |
display(test_data.head()) | |
break # Exit the loop if successful | |
except Exception as e: | |
print(f"An error occurred with order {order}: {e}") | |
"""**Reasoning**: | |
The ARIMA model is now producing predictions, but there are still NaN values in the 'Close' column of the test data. We need to handle these NaN values and evaluate the model. | |
""" | |
# Fill NaN values in 'Close' column of test_data using forward fill | |
test_data['Close'] = test_data['Close'].fillna(method='ffill') | |
# Evaluate the model (example using RMSE) | |
from sklearn.metrics import mean_squared_error | |
import numpy as np | |
rmse = np.sqrt(mean_squared_error(test_data['Close'], test_data['ARIMA_Predictions'])) | |
print(f"RMSE: {rmse}") | |
display(test_data.head()) | |
"""## Model evaluation | |
### Subtask: | |
Evaluate the performance of the ARIMA model using RMSE, MAE, and MAPE. | |
**Reasoning**: | |
Evaluate the performance of the ARIMA model using RMSE, MAE, and MAPE. Handle potential division-by-zero errors. Print the results. | |
""" | |
from sklearn.metrics import mean_squared_error, mean_absolute_error | |
import numpy as np | |
# Calculate RMSE | |
rmse = np.sqrt(mean_squared_error(test_data['Close'], test_data['ARIMA_Predictions'])) | |
# Calculate MAE | |
mae = mean_absolute_error(test_data['Close'], test_data['ARIMA_Predictions']) | |
# Calculate MAPE | |
# Avoid division by zero by adding a small constant to the denominator | |
mape = np.mean(np.abs((test_data['Close'] - test_data['ARIMA_Predictions']) / (test_data['Close'] + 1e-8))) * 100 | |
print(f"RMSE: {rmse}") | |
print(f"MAE: {mae}") | |
print(f"MAPE: {mape}") | |
"""## Summary: | |
### 1. Q&A | |
* **Can you suggest some time-series modeling?** Yes, the analysis explored using an ARIMA model. However, the initial attempts to fit the ARIMA model resulted in NaN predictions. Setting the frequency of the DateTimeIndex to daily ('D') improved the prediction generation but still required forward filling of NaN values in the 'Close' column. The final ARIMA model had an RMSE of approximately 23.51, MAE of 17.42, and MAPE of 37.57%, suggesting the need for further model tuning or alternative model exploration. | |
### 2. Data Analysis Key Findings | |
* **Data Characteristics:** The dataset contains 13,233 rows and 8 columns with no missing values. The 'Date' column, initially an object, was converted to datetime, and the 'Volume' column has a mean of ~22.4 million, with a maximum exceeding 395 million. 'Dividends' and 'Stock Splits' are mostly zero. | |
* **Price Distributions:** Histograms and box plots were generated to visualize the distributions of 'Open', 'High', 'Low', and 'Close' prices. Box plots helped to identify potential outliers in the stock prices. | |
* **Time Series Analysis:** The time series of the 'Close' price was plotted over time showing trends and seasonality. | |
* **Feature Correlation:** A heatmap of the correlation matrix revealed the correlation coefficients between numerical features. | |
* **ARIMA Model Performance:** The ARIMA model, with various (p,d,q) order parameters, was trained and evaluated using RMSE, MAE, and MAPE. The final model yielded an RMSE of approximately 23.51, MAE of 17.42, and MAPE of 37.57%, indicating a need for improvement. | |
* **Data Preparation for Time Series:** The data was preprocessed using differencing, lagged features, and scaling, preparing it for time series modeling. The data was split into 80% training and 20% testing sets. | |
### 3. Insights or Next Steps | |
* **Improve ARIMA Model:** The current ARIMA model's performance is not satisfactory. Explore different (p,d,q) combinations more systematically, using techniques like ACF and PACF plots to determine optimal values. Consider alternative models (e.g., Prophet, SARIMAX) that might better capture the underlying time series dynamics. | |
* **Feature Engineering:** Investigate potential for creating additional features (e.g., rolling averages, moving standard deviations, external economic indicators) that might enhance the model's predictive power. | |
""" |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment