0% found this document useful (0 votes)
24 views17 pages

The Error Code

The document outlines a data analysis process involving histograms, average price calculations, and regression modeling based on product pricing data. It includes visualizations for the distribution of prices ending in specific digits, compares average prices for products based on their last digits, and fits regression models to analyze the impact of these digits on log-transformed prices. Additionally, it details the handling of product data merging and frequency analysis to identify top products.

Uploaded by

Protik Deb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views17 pages

The Error Code

The document outlines a data analysis process involving histograms, average price calculations, and regression modeling based on product pricing data. It includes visualizations for the distribution of prices ending in specific digits, compares average prices for products based on their last digits, and fits regression models to analyze the impact of these digits on log-transformed prices. Additionally, it details the handling of product data merging and frequency analysis to identify top products.

Uploaded by

Protik Deb
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

,plt.

hist(merged_data['two_rightmost_digits']

)bins=100, edgecolor='black', alpha=0.7

plt.title("Histogram of Two Rightmost Digits of Prices")

plt.xlabel("Two Rightmost Digits")

plt.ylabel("Frequency")

Set x-axis ticks at intervals of 10 for readability #

plt.xticks(range(0, 100, 10))

plt.grid(axis='y', linestyle='--', alpha=0.7)

)(plt.tight_layout

)(plt.show

Create subplots for each store, showing the histogram of rightmost digits #

)(stores = merged_data['STORE'].unique

plt.figure(figsize=(15, 20)) # Adjust figure size for better visibility

:for idx, store in enumerate(stores, start=1)

plt.subplot(5, 4, idx) # Arrange in a grid of 5 rows and 4 columns

store_data = merged_data[merged_data['STORE'] == store]

,plt.hist(store_data['rightmost_digit']

)bins=10, edgecolor='black', alpha=0.7

plt.title(f"Store {store}")

plt.xlabel("Rightmost Digit")

plt.ylabel("Frequency")

plt.xticks(range(10)) # Ensure digits 0-9 are displayed

)(plt.tight_layout

)(plt.show

Calculate average prices for products ending in 9 and not ending in 9 #

(avg_price_ends_in_9 = merged_data[merged_data['ends_in_9'] == 1]['PRICE'].mean

)
(avg_price_not_ends_in_9 = merged_data[merged_data['ends_in_9'] == 0]['PRICE'].mean

Create a bar chart #

plt.figure(figsize=(8, 6))

,plt.bar(['Ends in 9', 'Does not end in 9'], [avg_price_ends_in_9

)avg_price_not_ends_in_9], color=['blue', 'orange']

plt.title("Average Price: Products Ending in 9 vs Not Ending in 9")

plt.ylabel("Average Price ($)")

plt.grid(axis='y', linestyle='--', alpha=0.7)

)(plt.tight_layout

)(plt.show

avg_price_ends_in_9, avg_price_not_ends_in_9

Calculate average prices for products ending in 99 and not ending in 99 #

(avg_price_ends_in_99 = merged_data[merged_data['ends_in_99'] == 1]['PRICE'].mean

(avg_price_not_ends_in_99 = merged_data[merged_data['ends_in_99'] == 0]['PRICE'].mean

Create a bar chart #

plt.figure(figsize=(8, 6))

,plt.bar(['Ends in 99', 'Does not end in 99'], [avg_price_ends_in_99

)avg_price_not_ends_in_99], color=['green', 'red']

plt.title("Average Price: Products Ending in 99 vs Not Ending in 99")

plt.ylabel("Average Price ($)")

plt.grid(axis='y', linestyle='--', alpha=0.7)

)(plt.tight_layout

)(plt.show
avg_price_ends_in_99, avg_price_not_ends_in_99

Calculate average prices for products ending in 9 vs not ending in 9 for each store #

][ = store_averages

:)(for store in merged_data['STORE'].unique

store_data = merged_data[merged_data['STORE'] == store]

)(avg_ends_in_9 = store_data[store_data['ends_in_9'] == 1]['PRICE'].mean

avg_not_ends_in_9 = store_data[store_data['ends_in_9']

)(mean.]'PRICE'[]0 ==

{(store_averages.append

,Store': store'

,Ends in 9': avg_ends_in_9'

Does not end in 9': avg_not_ends_in_9 '

)}

store_averages_df = pd.DataFrame(store_averages)

Plot bar charts for each store #

fig, axes = plt.subplots(nrows=5, ncols=3, figsize=(15, 20))

)(axes = axes.flatten

:)(for idx, row in store_averages_df.iterrows

ax = axes[idx]

,ax.bar(['Ends in 9', 'Does not end in 9'], [row['Ends in 9']

)row['Does not end in 9']], color=['blue', 'orange']

ax.set_title(f"Store {int(row['Store'])}")

ax.set_ylabel("Average Price ($)")

ax.grid(axis='y', linestyle='--', alpha=0.7)

Remove empty subplots #


:for ax in axes[len(store_averages_df):]

ax.axis('off')

)(plt.tight_layout

)(plt.show

Ensure the 'log_PRICE' variable is created (log of prices) #

:if 'log_PRICE' not in merged_data.columns

merged_data['log_PRICE'] = np.log(merged_data['PRICE'])

Define the independent variable (dummy for prices ending in 9) #

X = sm.add_constant(merged_data['ends_in_9']) # Add constant for intercept

y = merged_data['log_PRICE'] # Dependent variable (log of price)

Fit the regression model #

)(model = sm.OLS(y, X).fit

Display the regression summary #

print("Regression results for log(PRICE) ~ ends_in_9:")

print(model.summary())

Extract and interpret key results #

coefficient = model.params['ends_in_9']

std_error = model.bse['ends_in_9']

p_value = model.pvalues['ends_in_9']

Report key metrics #

print(f"\nCoefficient for 'ends_in_9': {coefficient}")

print(f"Standard Error: {std_error}")

print(f"p-value: {p_value}")

:if p_value < 0.05


print("The price difference between products ending in 9 and those not ending in 9 is
statistically significant.")

:else

print("The price difference between products ending in 9 and those not ending in 9 is
NOT statistically significant.")

Ensure the necessary variables exist in the dataset #

:if 'log_PRICE' not in merged_data.columns

merged_data['log_PRICE'] = np.log(merged_data['PRICE'])

Define the independent variable (dummy for prices ending in 99) #

X_99 = sm.add_constant(merged_data['ends_in_99']) # Add constant for intercept

y_log_price = merged_data['log_PRICE'] # Dependent variable (log of price)

Fit the regression model #

)(model_99 = sm.OLS(y_log_price, X_99).fit

Display the regression summary #

)(model_99_summary = model_99.summary

model_99_summary

Create an empty list to store the results #

][ = store_results_log_price

Loop through each store to calculate the regression for log(PRICE) ~ ends_in_9 #

:)(for store in merged_data['STORE'].unique

store_data = merged_data[merged_data['STORE'] == store]

X_store = sm.add_constant(store_data['ends_in_9'])

y_store = np.log(store_data['PRICE'])

)(model_store = sm.OLS(y_store, X_store).fit

Save the results #


{(store_results_log_price.append

,STORE": store"

,Coeff (ends_in_9)": model_store.params['ends_in_9']"

,Std. Error": model_store.bse['ends_in_9']"

p-value": model_store.pvalues['ends_in_9']"

)}

Convert results to a DataFrame #

store_results_df = pd.DataFrame(store_results_log_price)

Display the results table #

store_results_df

merged_data['log_price'] = np.log(merged_data['PRICE'])

merged_data['log_move'] = np.log(merged_data['MOVE'])

Create an empty list to store the results for question 22 regression #

][ = store_results_log_price_ends_in_99

Loop through each store to calculate the regression for log(PRICE) ~ ends_in_99 #

:)(for store in merged_data['STORE'].unique

store_data = merged_data[merged_data['STORE'] == store]

X_store = sm.add_constant(store_data['ends_in_99'])

y_store = np.log(store_data['PRICE'])

)(model_store = sm.OLS(y_store, X_store).fit

Save the results #

{(store_results_log_price_ends_in_99.append

,STORE": store"

,Coeff (ends_in_99)": model_store.params['ends_in_99']"

,Std. Error": model_store.bse['ends_in_99']"


p-value": model_store.pvalues['ends_in_99']"

)}

Convert results to a DataFrame #

store_results_ends_in_99_df = pd.DataFrame(store_results_log_price_ends_in_99)

Display the results table #

store_results_ends_in_99_df

Check data types and convert to numeric #

(merged_data['log_PRICE'] = pd.to_numeric

)'merged_data['log_PRICE'], errors='coerce

(merged_data['log_MOVE'] = pd.to_numeric

)'merged_data['log_MOVE'], errors='coerce

(merged_data['ends_in_9'] = pd.to_numeric

)'merged_data['ends_in_9'], errors='coerce

Create store dummies and ensure all columns are numeric #

(store_dummies = pd.get_dummies

)merged_data['STORE'], prefix='STORE', dtype=float

X = pd.concat([merged_data[['log_PRICE', 'ends_in_9']], store_dummies], axis=1)

y = merged_data['log_MOVE']

Drop rows with NaN values (resulting from non-numeric conversions) #

)(X = X.dropna

y = y.loc[X.index]

Fit the regression model #

X = sm.add_constant(X)

)(model = sm.OLS(y, X).fit


Extract coefficients and p-values #

log_price_coeff = model.params['log_PRICE']

log_price_pvalue = model.pvalues['log_PRICE']

ends_in_9_coeff = model.params['ends_in_9']

ends_in_9_pvalue = model.pvalues['ends_in_9']

Display regression summary #

print(model.summary())

Interpret results #

{ :print(f"Coefficient for log(PRICE)

)"log_price_coeff}, p-value: {log_price_pvalue}

{ :print(f"Coefficient for ends_in_9

)"ends_in_9_coeff}, p-value: {ends_in_9_pvalue}

Check data types and convert to numeric #

(merged_data['log_PRICE'] = pd.to_numeric

)'merged_data['log_PRICE'], errors='coerce

(merged_data['log_MOVE'] = pd.to_numeric

)'merged_data['log_MOVE'], errors='coerce

(merged_data['ends_in_99'] = pd.to_numeric

)'merged_data['ends_in_99'], errors='coerce

Create store dummies and ensure all columns are numeric #

(store_dummies = pd.get_dummies

)merged_data['STORE'], prefix='STORE', dtype=float

(X = pd.concat

)axis=1 ,]merged_data[['log_PRICE', 'ends_in_99']], store_dummies[

y = merged_data['log_MOVE']
Drop rows with NaN values (resulting from non-numeric conversions) #

)(X = X.dropna

y = y.loc[X.index]

Fit the regression model #

X = sm.add_constant(X) # Add constant for intercept

)(model = sm.OLS(y, X).fit

Extract coefficients and p-values #

log_price_coeff = model.params['log_PRICE']

log_price_pvalue = model.pvalues['log_PRICE']

ends_in_99_coeff = model.params['ends_in_99']

ends_in_99_pvalue = model.pvalues['ends_in_99']

Display regression summary #

print(model.summary())

Interpret results #

{ :print(f"\nCoefficient for log(PRICE)

)"log_price_coeff}, p-value: {log_price_pvalue}

{ :print(f"Coefficient for ends_in_99

)"ends_in_99_coeff}, p-value: {ends_in_99_pvalue}

Interpretation #

:if log_price_pvalue < 0.05

print("The log of the price is statistically significant.")

:else

print("The log of the price is NOT statistically significant.")

:if ends_in_99_pvalue < 0.05

print("The dummy variable for prices ending in 99 is statistically significant.")


:else

print("The dummy variable for prices ending in 99 is NOT statistically significant.")

Prepare a list to store results for each store #

][ = store_results

Loop through each store and run the regression separately #

:)(for store in merged_data['STORE'].unique

Subset data for the current store #

store_data = merged_data[merged_data['STORE'] == store]

Define independent and dependent variables #

X_store = store_data[['log_PRICE', 'ends_in_9']]

y_store = store_data['log_MOVE']

Add a constant to the independent variables #

X_store = sm.add_constant(X_store)

Fit the regression model #

)(model_store = sm.OLS(y_store, X_store).fit

Extract coefficients, standard errors, and p-values #

price_coeff = model_store.params['log_PRICE']

price_se = model_store.bse['log_PRICE']

price_pvalue = model_store.pvalues['log_PRICE']

ends_in_9_coeff = model_store.params['ends_in_9']

ends_in_9_se = model_store.bse['ends_in_9']

ends_in_9_pvalue = model_store.pvalues['ends_in_9']

Append results to the list #


{(store_results.append

,Store': store'

,Price Coeff': price_coeff'

,Price SE': price_se'

,Price p-value': price_pvalue '

,Ends_in_9 Coeff': ends_in_9_coeff'

,Ends_in_9 SE': ends_in_9_se'

Ends_in_9 p-value': ends_in_9_pvalue'

)}

Convert results into a DataFrame #

store_results_df = pd.DataFrame(store_results)

Display the results #

store_results_df

Prepare a list to store results for each store #

][ = store_results_ends_in_99

Loop through each store and run the regression separately #

:)(for store in merged_data['STORE'].unique

Subset data for the current store #

store_data = merged_data[merged_data['STORE'] == store]

Define independent and dependent variables #

X_store = store_data[['log_PRICE', 'ends_in_99']]

y_store = store_data['log_MOVE']

Add a constant to the independent variables #

X_store = sm.add_constant(X_store)
Fit the regression model #

)(model_store = sm.OLS(y_store, X_store).fit

Extract coefficients, standard errors, and p-values #

price_coeff = model_store.params['log_PRICE']

price_se = model_store.bse['log_PRICE']

price_pvalue = model_store.pvalues['log_PRICE']

ends_in_99_coeff = model_store.params['ends_in_99']

ends_in_99_se = model_store.bse['ends_in_99']

ends_in_99_pvalue = model_store.pvalues['ends_in_99']

Append results to the list #

{(store_results_ends_in_99.append

,Store': store'

,Price Coeff': price_coeff'

,Price SE': price_se'

,Price p-value': price_pvalue '

,Ends_in_99 Coeff': ends_in_99_coeff'

,Ends_in_99 SE': ends_in_99_se'

Ends_in_99 p-value': ends_in_99_pvalue '

)}

Convert results into a DataFrame #

store_results_ends_in_99_df = pd.DataFrame(store_results_ends_in_99_df)

Display the results #

store_results_ends_in_99_df

Load the product information database #

"product_data_path = r"C:\Users\Hila\Python\Final_task\upcsdr.xlsx
product_data = pd.read_excel(product_data_path)

Check the column names #

print("Product Data Columns:", product_data.columns)

print("Sales Data Columns:", sales_data.columns)

Merge datasets #

(merged_with_products = pd.merge

)"sales_data, product_data, on="UPC", how="inner

Check merged dataset columns #

print("Merged Dataset Columns:", merged_with_products.columns)

Verify if 'DESCRIP' exists #

:if 'DESCRIP' not in merged_with_products.columns

print("The 'DESCRIP' column is missing. Please verify column names in the product_data
file.")

:else

Calculate frequency of products #

)(product_counts = merged_with_products['DESCRIP'].value_counts

Display the top 3-4 most frequent products #

)(top_products = product_counts.head(4).reset_index

top_products.columns = ['DESCRIP', 'Frequency']

Display the results #

print("Top Products by Frequency:")

print(top_products)

Calculate the frequency of each product in the merged dataset #

(product_counts = merged_data.groupby
size().reset_index(name='Frequency').)]'DESCRIP', 'UPC', 'SIZE'[

Identify the top products by frequency #

Adjust the number as needed (e.g., top 4 products) #

top_frequent_products = product_counts.nlargest(4, 'Frequency')

Calculate the average price for each product #

[average_price = merged_data.groupby(['DESCRIP', 'UPC', 'SIZE'])

PRICE'].mean().reset_index(name='Average Price')'

Merge to include the average price #

[=top_frequent_products = pd.merge(top_frequent_products, average_price, on

)]'DESCRIP', 'UPC', 'SIZE'

Display the table with required details #

print(top_frequent_products)

...Loading datasets

Important

Figures are displayed in the Plots pane by default. To make them also appear inline in the
.console, you need to uncheck "Mute inline plotting" under the options menu of Plots

AppData\Local\Temp\ipykernel_24960\2575356717.py:45: \‫\הילה‬C:\Users
:SettingWithCopyWarning

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation:


https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-
versus-a-copy

summary_statistics.rename(columns={'index': 'Variable'}, inplace=True)

AppData\Local\Temp\ipykernel_24960\2575356717.py:67: \‫\הילה‬C:\Users
:SettingWithCopyWarning

A value is trying to be set on a copy of a slice from a DataFrame


See the caveats in the documentation:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-
versus-a-copy

cleaned_summary_statistics.rename(columns={'index': 'Variable'}, inplace=True)

AppData\Local\Temp\ipykernel_24960\2575356717.py:89: \‫\הילה‬C:\Users
:SettingWithCopyWarning

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation:


https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-
versus-a-copy

final_summary_statistics.rename(columns={'index': 'Variable'}, inplace=True)

:Regression results for log(PRICE) ~ ends_in_9

OLS Regression Results

=============================================
=================================

Dep. Variable: log_PRICE R-squared: 0.009

Model: OLS Adj. R-squared: 0.009

Method: Least Squares F-statistic: 91.44

Date: Wed, 12 Mar 2025 Prob (F-statistic): 1.42e-21

.Time: 23:10:32 Log-Likelihood: -11630

No. Observations: 10000 AIC: 2.326e+04

Df Residuals: 9998 BIC: 2.328e+04

Df Model: 1

Covariance Type: nonrobust

=============================================
=================================

coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------

const 0.5212 0.009 58.071 0.000 0.504 0.539

ends_in_9 0.1697 0.018 9.562 0.000 0.135 0.205

=============================================
=================================
Omnibus: 16.285 Durbin-Watson: 1.999

Prob(Omnibus): 0.000 Jarque-Bera (JB): 16.326

Skew: -0.095 Prob(JB): 0.000285

Kurtosis: 2.941 Cond. No. 2.47

=============================================
=================================

:Notes

.Standard Errors assume that the covariance matrix of the errors is correctly specified ]1[

Coefficient for 'ends_in_9': 0.16972185095165745

Standard Error: 0.017748725476650982

p-value: 1.4216469146422434e-21

The price difference between products ending in 9 and those not ending in 9 is statistically
.significant

:Traceback (most recent call last)

File C:\Apps\pyton\Lib\site-packages\pandas\core\indexes\base.py:3805 in get_loc

return self._engine.get_loc(casted_key)

File index.pyx:167 in pandas._libs.index.IndexEngine.get_loc

File index.pyx:196 in pandas._libs.index.IndexEngine.get_loc

File pandas\\_libs\\hashtable_class_helper.pxi:7081 in
pandas._libs.hashtable.PyObjectHashTable.get_item

File pandas\\_libs\\hashtable_class_helper.pxi:7089 in
pandas._libs.hashtable.PyObjectHashTable.get_item

'KeyError: 'log_MOVE
:The above exception was the direct cause of the following exception

:Traceback (most recent call last)

Cell In[1], line 510

)'merged_data['log_MOVE'], errors='coerce

__File C:\Apps\pyton\Lib\site-packages\pandas\core\frame.py:4102 in __getitem

indexer = self.columns.get_loc(key)

File C:\Apps\pyton\Lib\site-packages\pandas\core\indexes\base.py:3812 in get_loc

raise KeyError(key) from err

'KeyError: 'log_MOVE

You might also like