Python CSBS Bhavya Lab Manual
Python CSBS Bhavya Lab Manual
Program on data wrangling: Combining and merging data sets, reshaping and pivoting
1. Merging DataFrames
import pandas as pd
df1 = pd.DataFrame({
})
df2 = pd.DataFrame({
})
print(merged_df)
2. Joining DataFrames
df1.set_index('ID', inplace=True)
print(joined_df)
3. Concatenating DataFrames
df3 = pd.DataFrame({
})
print(concatenated_df)
4. Comparing DataFrames
df4 = pd.DataFrame({
})
# Comparing DataFrames
comparison = df1.equals(df4.set_index('ID'))
5. Reshaping Data
Reshaping data typically involves changing the layout of a DataFrame. This can be done using the melt()
function, which transforms a wide format DataFrame into a long format.
import pandas as pd
# Sample DataFrame
data = {
df = pd.DataFrame(data)
var_name='Subject', value_name='Score')
print(melted_df)
6. Pivoting Data
Using pivot()
# Sample DataFrame
data = {
df = pd.DataFrame(data)
print(pivoted_df)
Using pivot_table()
The pivot_table() function is more versatile than pivot(), as it allows for aggregation of data. This is
particularly useful when you have duplicate entries for the index/column pairs. You can specify an
aggregation function to summarize the data.
Example:
Let’s modify our previous example to include multiple sales entries for the same product on the same
date:
data = {
df = pd.DataFrame(data)
# Using pivot_table
print(pivot_table_df)
String Manipulation
Python provides a rich set of built-in methods for string manipulation. Here are some common
operations:
1. Changing Case: You can easily change the case of strings using methods like upper(), lower(), and
title().
2. Trimming Whitespace: The strip(), lstrip(), and rstrip() methods are useful for removing
unwanted whitespace.
text = " Hello World "
3. Replacing Substrings: The replace() method allows you to replace occurrences of a substring with
another substring.
4. Splitting and Joining Strings: You can split a string into a list of substrings using split(), and join a
list of strings into a single string using join().
text = "apple,banana,cherry"
fruits = text.split(",")
Regular Expressions
Regular expressions are a powerful tool for searching and manipulating strings based on patterns. The re
module in Python provides functions to work with regex.
1. Searching for Patterns: The search() function checks if a pattern exists in a string.
import re
if match:
2. Finding All Matches: The findall() function returns all occurrences of a pattern in a string.
3. Replacing Patterns: The sub() function allows you to replace occurrences of a pattern with a
specified string.
print(new_text)
4. Validating Input: Regular expressions can also be used to validate input formats, such as
checking if a string is a valid email address
def is_valid_email(email):
pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
3. Program on Time series: GroupBy Mechanics to display in data vector, multivariate time series
and forecasting formats
Time series analysis is a powerful technique used in various fields such as finance, economics, and
environmental science. In Python, the pandas library provides robust tools for handling time series data,
including the groupby functionality, which allows for efficient data aggregation and transformation.
The groupby method in pandas is essential for aggregating data based on specific time intervals. For
instance, you can group data by day, month, or year, which is particularly useful for summarizing trends
over time.
Here’s a simple example to illustrate how to use groupby with a time series dataset:
import pandas as pd
import numpy as np
print(df)
daily_mean = df.resample('D').mean()
print("\nDaily Mean:")
print(daily_mean)
In this example, we create a DataFrame with random data indexed by dates. We then use the resample
method to group the data by day and calculate the mean for each day.
When dealing with multivariate time series, you may have multiple variables that you want to analyze
simultaneously. The same groupby mechanics can be applied, but you can also visualize the relationships
between these variables.
data = {
print(df_multivariate)
daily_mean_multivariate = df_multivariate.resample('D').mean()
print(daily_mean_multivariate)
In this code snippet, we create a multivariate DataFrame with temperature and humidity data. We then
apply the same resample method to calculate daily means for both variables.
Forecasting Formats
Forecasting in time series can be approached using various models, such as ARIMA, Exponential
Smoothing, or machine learning techniques. The statsmodels library provides tools for implementing
these models.
model_fit = model.fit()
forecast = model_fit.forecast(steps=5)
plt.figure(figsize=(10, 5))
plt.xlabel('Date')
plt.ylabel('Values')
plt.legend()
plt.show()
In this example, we fit an ARIMA model to the time series data and forecast the next five days. The
results are then visualized using matplotlib.
4. Program to measure central tendency and measures of dispersion: Mean, median, mode,
standard deviation, variance, mean deviation and quartile deviation for a frequency
distribution/data.
In statistics, understanding the central tendency and measures of dispersion is crucial for analyzing data.
Central tendency provides a summary measure that represents the entire dataset, while measures of
dispersion indicate the spread or variability of the data. Below, we will explore how to compute these
statistics using Python.
Required Libraries
To perform these calculations, we will utilize the numpy and scipy libraries. If you haven't installed these
libraries yet, you can do so using pip:
Let's assume we have a frequency distribution represented as a list of tuples, where each tuple contains
a value and its corresponding frequency. For example:
data = [(1, 5), (2, 10), (3, 15), (4, 20), (5, 10)]
Mean: The mean is calculated as the sum of all values multiplied by their frequencies divided by the
total frequency.
Median: The median is the middle value when the data is sorted. If the number of observations is even,
it is the average of the two middle values.
Mode: The mode is the value that appears most frequently in the dataset.
Variance: Variance measures how far a set of numbers is spread out from their average value.
Standard Deviation: The standard deviation is the square root of the variance, providing a measure of
the average distance from the mean.
Mean Deviation: This is the average of the absolute deviations from the mean.
Quartile Deviation: This is half the difference between the first quartile (Q1) and the third quartile (Q3).
Implementation
import numpy as np
data = [(1, 5), (2, 10), (3, 15), (4, 20), (5, 10)]
expanded_data = []
expanded_data.extend([value] * frequency)
expanded_data = np.array(expanded_data)
# Central Tendency
mean = np.mean(expanded_data)
median = np.median(expanded_data)
mode = stats.mode(expanded_data)[0][0]
# Measures of Dispersion
variance = np.var(expanded_data)
std_deviation = np.std(expanded_data)
# Quartiles
Q1 = np.percentile(expanded_data, 25)
Q3 = np.percentile(expanded_data, 75)
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
Conclusion
This program effectively calculates the central tendency and measures of dispersion for a frequency
distribution. By utilizing Python's powerful libraries, we can easily perform statistical analysis, making it a
valuable tool for data scientists and analysts. Understanding these measures allows for better insights
into the data, guiding informed decision-making.
5. Program to perform cross validation for a given dataset to measure Root Mean Squared Error
(RMSE), Mean Absolute Error (MAE) and R2 Error using validation set, Leave one out cross-
validation(LOOCV) and k-fold cross-validation approaches.
import numpy as np
# Initialize model
model = LinearRegression()
# K-Fold Cross Validation
kf = KFold(n_splits=5)
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("K-Fold Metrics:")
loo = LeaveOneOut()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("LOOCV Metrics:")