#_ Automation With Python & Excel [ Use Cases ]
1. Introduction
Excel is a widely-used software for data representation and analysis.
Sometimes, repetitive tasks in Excel can be time-consuming. That's
where Python comes into play, allowing for automation and saving a
great deal of time.
2. Background
When automating with Python, the main library used is openpyxl. This
library can handle reading and writing Excel files.
How does it work? At a high level, when you're working with Excel via
openpyxl, you're actually interacting with objects in memory. For
instance, a "Workbook" object represents an Excel file, while a
"Worksheet" object represents an individual sheet.
3. Setting Up
1. First, you need to install the necessary libraries. Use pip:
pip install openpyxl
4. Thinking About Automation
Identify repetitive tasks: Automation starts by identifying a
repetitive task. Example: You may have to format new data the same way
every week.
Break tasks into steps: Understand the step-by-step process you'd
normally do manually.
By: Waleed Mousa
Translate to code: Once you've identified the manual steps, you'll
convert these into Python code.
5. Real-World Example: Summarizing Monthly Sales
Scenario: You get a monthly Excel sheet with sales data. You want to
calculate the total sales and average sales for the month, then add
this info to the sheet.
Manual steps:
1. Open the file.
2. Identify the range of sales data.
3. Calculate the total and average.
4. Write the total and average at the end of the column.
Python Automation:
import openpyxl
# Step 1: Open the file
wb = openpyxl.load_workbook('monthly_sales.xlsx')
sheet = wb.active
# Step 2: Identify the range of sales data
last_row = sheet.max_row
sales_data = [sheet.cell(row=i, column=2).value for i in range(2, last_row +
1)]
# Step 3: Calculate the total and average
total_sales = sum(sales_data)
avg_sales = total_sales / len(sales_data)
# Step 4: Write the total and average at the end of the column
sheet.cell(row=last_row + 1, column=1, value="Total Sales:")
sheet.cell(row=last_row + 1, column=2, value=total_sales)
sheet.cell(row=last_row + 2, column=1, value="Average Sales:")
sheet.cell(row=last_row + 2, column=2, value=avg_sales)
# Save changes
wb.save('monthly_sales_summary.xlsx')
By: Waleed Mousa
Advanced Python Automation Using Excel
1. Creating Multiple Worksheets Based on Categories
Scenario: Imagine you have a main worksheet with a list of customers,
their purchases, and the category of items they bought. You want to
create separate worksheets for each category and list the respective
customers there.
import openpyxl
# Load Workbook and active sheet
wb = openpyxl.load_workbook('sales_data.xlsx')
sheet = wb.active
# Create a dictionary to hold data by category
category_data = {}
# Assuming column 1: Customers, column 2: Purchase Amount, column 3: Category
for row in range(2, sheet.max_row + 1):
category = sheet.cell(row=row, column=3).value
if category not in category_data:
category_data[category] = []
category_data[category].append((sheet.cell(row=row, column=1).value,
sheet.cell(row=row, column=2).value))
# Create separate worksheets for each category
for category, data in category_data.items():
new_sheet = wb.create_sheet(title=category)
for idx, (customer, purchase) in enumerate(data, 1):
new_sheet.cell(row=idx, column=1, value=customer)
new_sheet.cell(row=idx, column=2, value=purchase)
wb.save('sales_data_by_category.xlsx')
2. Conditional Formatting
Scenario: You want to highlight sales greater than a certain value,
e.g., $5000.
from openpyxl.styles import PatternFill
By: Waleed Mousa
# Load Workbook and sheet
wb = openpyxl.load_workbook('sales_data.xlsx')
sheet = wb.active
# Highlight sales greater than 5000
highlight_fill = PatternFill(start_color="FFFF00", end_color="FFFF00",
fill_type="solid")
for row in range(2, sheet.max_row + 1):
if sheet.cell(row=row, column=2).value > 5000:
sheet.cell(row=row, column=2).fill = highlight_fill
wb.save('highlighted_sales_data.xlsx')
3. Integrating Pandas for Data Analysis
Scenario: Compute and append month-over-month growth for a series of
monthly sales data.
import pandas as pd
# Read data into a DataFrame
df = pd.read_excel('monthly_sales.xlsx')
# Calculate month-over-month growth
df['MoM Growth'] = df['Sales'].pct_change()
# Save the DataFrame back to Excel
df.to_excel('sales_with_growth.xlsx', index=False)
4. Pivot Tables and Data Summarization
Scenario: You have data on products sold, their categories, and the
sales figures. You want to summarize sales by category.
import pandas as pd
# Read data into a DataFrame
df = pd.read_excel('product_sales.xlsx')
By: Waleed Mousa
# Create a pivot table
pivot = df.pivot_table(index='Category', values='Sales', aggfunc='sum')
# Save the pivot table to a new worksheet
with pd.ExcelWriter('product_sales_summary.xlsx') as writer:
pivot.to_excel(writer, sheet_name="Summary")
df.to_excel(writer, sheet_name="Detailed Data")
5. Merging Multiple Excel Files
Scenario: You have multiple monthly sales Excel files and you want to
merge them into a yearly file.
import pandas as pd
import glob
# Gather all Excel files in the directory
all_files = glob.glob('sales_*.xlsx')
# Read and concatenate all files into a single DataFrame
all_data = pd.concat([pd.read_excel(file) for file in all_files])
# Save the concatenated data to a new file
all_data.to_excel('yearly_sales_data.xlsx', index=False)
6. Automating Charts and Graphs
Scenario: You have monthly sales figures, and you want to generate a
line chart for visual representation.
import openpyxl
from openpyxl.chart import LineChart, Reference
wb = openpyxl.load_workbook('monthly_sales.xlsx')
sheet = wb.active
# Create a new line chart object
chart = LineChart()
chart.title = "Monthly Sales"
chart.style = 13 # Use a pre-defined style
chart.x_axis.title = 'Month'
By: Waleed Mousa
chart.y_axis.title = 'Sales ($)'
chart.y_axis.majorGridlines = None
# Set data and categories for the chart
data = Reference(sheet, min_col=2, min_row=1, max_col=2,
max_row=sheet.max_row)
categories = Reference(sheet, min_col=1, min_row=2, max_row=sheet.max_row)
chart.add_data(data, titles_from_data=True)
chart.set_categories(categories)
# Add the chart to the sheet and position it
sheet.add_chart(chart, "D5")
wb.save("sales_chart.xlsx")
7. Handling Excel Filters
Scenario: You want to automatically apply filters to a range of data for
easier manual review.
import openpyxl
wb = openpyxl.load_workbook('sales_data.xlsx')
sheet = wb.active
# Apply filter to entire data range
sheet.auto_filter.ref = sheet.dimensions
wb.save('filtered_sales_data.xlsx')
8. Data Validation
Scenario: You're preparing a template for sales input and you want to
ensure that only valid data is entered (e.g., sales figures between 1
and 10,000).
import openpyxl
from openpyxl.worksheet.datavalidation import DataValidation
wb = openpyxl.Workbook()
By: Waleed Mousa
sheet = wb.active
# Create a data validation rule
validation = DataValidation(type="whole", operator="between", formula1=1,
formula2=10000)
validation.errorTitle = "Invalid entry"
validation.error = "Sales figure should be between 1 and 10,000."
# Apply the validation to a range
validation.add('B2:B1000')
sheet.add_data_validation(validation)
wb.save('sales_template.xlsx')
9. Conditional Styling Based on Cell Values
Scenario: You want to change the background color of cells based on
their values (e.g., sales over 10,000 get a green background).
import openpyxl
from openpyxl.styles import PatternFill
wb = openpyxl.load_workbook('sales_data.xlsx')
sheet = wb.active
green_fill = PatternFill(start_color="00FF00", end_color="00FF00",
fill_type="solid")
for row in range(2, sheet.max_row + 1):
if sheet.cell(row=row, column=2).value > 10000:
sheet.cell(row=row, column=2).fill = green_fill
wb.save('color_coded_sales.xlsx')
10. Integrating External APIs
Scenario: You have a list of addresses, and you want to retrieve
latitude and longitude using a geocoding service and store the values
in the Excel file.
By: Waleed Mousa
import openpyxl
import requests
wb = openpyxl.load_workbook('addresses.xlsx')
sheet = wb.active
API_ENDPOINT = "https://geocode.search.hereapi.com/v1/geocode"
API_KEY = "YOUR_API_KEY" # Replace with your actual API key
for row in range(2, sheet.max_row + 1):
address = sheet.cell(row=row, column=1).value
response = requests.get(API_ENDPOINT, params={"q": address, "apiKey":
API_KEY}).json()
# Assuming the API response is valid and contains lat/lon information
lat = response['items'][0]['position']['lat']
lon = response['items'][0]['position']['lng']
sheet.cell(row=row, column=2, value=lat)
sheet.cell(row=row, column=3, value=lon)
wb.save('addresses_with_lat_lon.xlsx')
Note: Ensure you handle possible exceptions and rate-limiting when
dealing with external APIs.
11. Time Series Forecasting
Scenario: Predicting future sales based on past data.
You can utilize libraries like statsmodels to automate the creation of
time series forecasts, and then save the forecasted results in Excel.
import openpyxl
import pandas as pd
from statsmodels.tsa.holtwinters import ExponentialSmoothing
# Read sales data into a DataFrame
df = pd.read_excel('sales_data.xlsx', index_col='Date', parse_dates=True)
By: Waleed Mousa
# Train a time series model and forecast the next 12 months
model = ExponentialSmoothing(df['Sales'], trend='add', seasonal='add',
seasonal_periods=12)
fit = model.fit()
forecast = fit.forecast(12)
# Add forecast to Excel
wb = openpyxl.load_workbook('sales_data.xlsx')
sheet = wb.active
for month, value in enumerate(forecast, start=sheet.max_row + 1):
sheet.cell(row=month, column=1, value=value.index[month - sheet.max_row -
1])
sheet.cell(row=month, column=2, value=value)
wb.save('sales_forecast.xlsx')
12. Automating Descriptive Statistics
Scenario: For each column of data in an Excel file, compute and save
descriptive statistics (mean, median, standard deviation).
import openpyxl
import pandas as pd
df = pd.read_excel('data.xlsx')
desc_stats = df.describe()
# Save to Excel
with pd.ExcelWriter('data_summary.xlsx') as writer:
df.to_excel(writer, sheet_name='Original Data')
desc_stats.to_excel(writer, sheet_name='Descriptive Statistics')
13. Data Normalization and Standardization
Scenario: Normalize and standardize numerical columns for further
analysis.
import openpyxl
By: Waleed Mousa
import pandas as pd
df = pd.read_excel('data.xlsx')
# Normalize data (0-1 scaling)
df_normalized = (df - df.min()) / (df.max() - df.min())
# Standardize data (z-score scaling)
df_standardized = (df - df.mean()) / df.std()
# Save both to Excel
with pd.ExcelWriter('processed_data.xlsx') as writer:
df_normalized.to_excel(writer, sheet_name='Normalized Data')
df_standardized.to_excel(writer, sheet_name='Standardized Data')
14. Principal Component Analysis (PCA) for Dimension Reduction
Scenario: Reduce the dimensions of a dataset for visualization or
further analysis.
Using sklearn, you can automate PCA and save the reduced data to Excel.
import openpyxl
import pandas as pd
from sklearn.decomposition import PCA
df = pd.read_excel('high_dim_data.xlsx')
pca = PCA(n_components=2) # Reduce to 2 dimensions for simplicity
principal_components = pca.fit_transform(df)
df_pca = pd.DataFrame(data=principal_components, columns=['PC1', 'PC2'])
df_pca.to_excel('reduced_data.xlsx', index=False)
15. Clustering for Data Segmentation
Scenario: Group data points into clusters based on similarities.
By: Waleed Mousa
Use sklearn to automate K-means clustering and save cluster labels to
Excel.
import openpyxl
import pandas as pd
from sklearn.cluster import KMeans
df = pd.read_excel('data_for_clustering.xlsx')
kmeans = KMeans(n_clusters=3) # Assuming 3 clusters for this example
df['Cluster'] = kmeans.fit_predict(df)
df.to_excel('clustered_data.xlsx', index=False)
16. Automated Outlier Detection
Scenario: Detect outliers in a dataset based on the Z-score method.
import openpyxl
import pandas as pd
df = pd.read_excel('data.xlsx')
df['Z-Score'] = (df['Column_Name'] - df['Column_Name'].mean()) /
df['Column_Name'].std()
df['Is_Outlier'] = df['Z-Score'].abs() > 3 # Outliers are typically defined
as values more than 3 standard deviations from the mean
df.to_excel('data_with_outliers.xlsx', index=False)
17. Feature Engineering
Scenario: Generate polynomial features for regression analysis.
import openpyxl
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
df = pd.read_excel('data_for_regression.xlsx')
poly = PolynomialFeatures(degree=2)
By: Waleed Mousa
polynomial_features = poly.fit_transform(df)
feature_names = poly.get_feature_names(df.columns)
df_poly = pd.DataFrame(polynomial_features, columns=feature_names)
df_poly.to_excel('polynomial_features.xlsx', index=False)
18. Data Imputation
Scenario: Fill missing values in a dataset.
import openpyxl
import pandas as pd
from sklearn.impute import SimpleImputer
df = pd.read_excel('data_with_missing_values.xlsx')
# Use mean imputation for simplicity
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
df_imputed.to_excel('data_without_missing_values.xlsx', index=False)
19. Text Data Preprocessing
Scenario: Clean and preprocess a column containing text data.
import openpyxl
import pandas as pd
import re
df = pd.read_excel('text_data.xlsx')
# A simple preprocessing function to clean text
def clean_text(text):
text = text.lower() # Convert to lowercase
text = re.sub(r'\s+', ' ', text) # Replace multiple spaces with a single
space
text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove non-alphabetic
characters
By: Waleed Mousa
return text.strip()
df['Cleaned_Text'] = df['Text_Column'].apply(clean_text)
df.to_excel('cleaned_text_data.xlsx', index=False)
20. Encoding Categorical Variables
Scenario: Convert categorical variables into numerical format.
import openpyxl
import pandas as pd
df = pd.read_excel('data_with_categories.xlsx')
# Convert categorical column to numerical using one-hot encoding
df_encoded = pd.get_dummies(df, columns=['Category_Column'], drop_first=True)
df_encoded.to_excel('encoded_data.xlsx', index=False)
21. Automating Data Visualization
Scenario: Generate histograms for numerical columns.
import openpyxl
import pandas as pd
df = pd.read_excel('data.xlsx')
ax = df.hist(bins=50)
# Save the plots as images and then insert them into Excel
fig = ax[0][0].get_figure()
fig.savefig('histograms.png')
wb = openpyxl.load_workbook('data.xlsx')
sheet = wb.active
img = openpyxl.drawing.image.Image('histograms.png')
sheet.add_image(img, 'D5') # Place the image at cell D5
By: Waleed Mousa
wb.save('data_with_histograms.xlsx')
22. Correlation Analysis
Scenario: Calculate correlations between variables and save the matrix
to Excel.
import openpyxl
import pandas as pd
df = pd.read_excel('data.xlsx')
correlation_matrix = df.corr()
correlation_matrix.to_excel('correlation_matrix.xlsx', index=True)
23. Automating Data Splitting
Scenario: Split data into training and test sets for model validation.
import openpyxl
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_excel('data_for_modeling.xlsx')
train, test = train_test_split(df, test_size=0.2)
with pd.ExcelWriter('split_data.xlsx') as writer:
train.to_excel(writer, sheet_name='Training Data', index=False)
test.to_excel(writer, sheet_name='Test Data', index=False)
Using Python with Excel for data science tasks provides a bridge
between traditional spreadsheet-driven analysis and more advanced,
automated analysis. For analysts familiar with Excel but new to
programming, this combination can serve as an excellent transition to
the world of data science and machine learning.
By: Waleed Mousa