1.
Program on data wrangling: Combining and merging data sets, reshaping and pivoting
pandas provides various methods for combining and
comparing Series or DataFrame
1. Merging DataFrames
import pandas as pd
# Creating two DataFrames
df1 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})
df2 = pd.DataFrame({
'ID': [1, 2, 4],
'Age': [25, 30, 22]
})
# Merging DataFrames on 'ID'
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)
2. Joining DataFrames
# Setting 'ID' as the index for df1
df1.set_index('ID', inplace=True)
# Joining df1 and df2
joined_df = df1.join(df2.set_index('ID'), how='inner')
print(joined_df)
3. Concatenating DataFrames
# Creating two DataFrames
df3 = pd.DataFrame({
'ID': [5, 6],
'Name': ['David', 'Eva']
})
# Concatenating DataFrames vertically
concatenated_df = pd.concat([df1.reset_index(), df3], ignore_index=True)
print(concatenated_df)
4. Comparing DataFrames
# Creating another DataFrame for comparison
df4 = pd.DataFrame({
'ID': [1, 2, 3],
'Name': ['Alice', 'Bob', 'Charlie']
})
# Comparing DataFrames
comparison = df1.equals(df4.set_index('ID'))
print(f"Are the DataFrames equal? {comparison}")
5. Reshaping Data
Reshaping data typically involves changing the layout of a DataFrame. This can be done using the melt()
function, which transforms a wide format DataFrame into a long format.
Example of Reshaping with melt()
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Math': [85, 90, 95],
'Science': [80, 85, 90]
df = pd.DataFrame(data)
# Reshaping the DataFrame
melted_df = pd.melt(df, id_vars=['Name'], value_vars=['Math', 'Science'],
var_name='Subject', value_name='Score')
print(melted_df)
6. Pivoting Data
Using pivot()
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Alice', 'Bob'],
'Subject': ['Math', 'Math', 'Science', 'Science'],
'Score': [85, 90, 80, 85]
df = pd.DataFrame(data)
# Pivoting the DataFrame
pivoted_df = df.pivot(index='Name', columns='Subject', values='Score')
print(pivoted_df)
Using pivot_table()
The pivot_table() function is more versatile than pivot(), as it allows for aggregation of data. This is
particularly useful when you have duplicate entries for the index/column pairs. You can specify an
aggregation function to summarize the data.
Example:
Let’s modify our previous example to include multiple sales entries for the same product on the same
date:
# Sample data with duplicates
data = {
'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-01'],
'Product': ['A', 'B', 'A', 'B', 'A'],
'Sales': [100, 150, 200, 250, 300]
df = pd.DataFrame(data)
# Using pivot_table
pivot_table_df = df.pivot_table(index='Date', columns='Product', values='Sales', aggfunc='sum')
print(pivot_table_df)
2. Program on Data transformations: String manipulation and Regular expression
String Manipulation
Python provides a rich set of built-in methods for string manipulation. Here are some common
operations:
1. Changing Case: You can easily change the case of strings using methods like upper(), lower(), and
title().
text = "hello world"
print(text.upper()) # Output: HELLO WORLD
print(text.title()) # Output: Hello World
2. Trimming Whitespace: The strip(), lstrip(), and rstrip() methods are useful for removing
unwanted whitespace.
text = " Hello World "
print(text.strip()) # Output: Hello World
3. Replacing Substrings: The replace() method allows you to replace occurrences of a substring with
another substring.
text = "I love Python"
new_text = text.replace("Python", "programming")
print(new_text) # Output: I love programming
4. Splitting and Joining Strings: You can split a string into a list of substrings using split(), and join a
list of strings into a single string using join().
text = "apple,banana,cherry"
fruits = text.split(",")
print(fruits) # Output: ['apple', 'banana', 'cherry']
new_text = " and ".join(fruits)
print(new_text) # Output: apple and banana and cherry
Regular Expressions
Regular expressions are a powerful tool for searching and manipulating strings based on patterns. The re
module in Python provides functions to work with regex.
1. Searching for Patterns: The search() function checks if a pattern exists in a string.
import re
match = re.search(r'\S+@\S+', text)
if match:
print("Found email:", match.group())
2. Finding All Matches: The findall() function returns all occurrences of a pattern in a string.
emails = re.findall(r'\S+@\S+', text)
print("Found emails:", emails)
3. Replacing Patterns: The sub() function allows you to replace occurrences of a pattern with a
specified string.
text = "My phone number is 123-456-7890"
new_text = re.sub(r'\d{3}-\d{3}-\d{4}', 'XXX-XXX-XXXX', text)
print(new_text)
4. Validating Input: Regular expressions can also be used to validate input formats, such as
checking if a string is a valid email address
def is_valid_email(email):
pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'
return re.match(pattern, email) is not None
print(is_valid_email("[email protected]")) # Output: True
print(is_valid_email("invalid-email")) # Output: False
3. Program on Time series: GroupBy Mechanics to display in data vector, multivariate time series
and forecasting formats
Time series analysis is a powerful technique used in various fields such as finance, economics, and
environmental science. In Python, the pandas library provides robust tools for handling time series data,
including the groupby functionality, which allows for efficient data aggregation and transformation.
GroupBy Mechanics in Time Series
The groupby method in pandas is essential for aggregating data based on specific time intervals. For
instance, you can group data by day, month, or year, which is particularly useful for summarizing trends
over time.
Here’s a simple example to illustrate how to use groupby with a time series dataset:
import pandas as pd
import numpy as np
# Create a sample time series DataFrame
date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
data = np.random.randint(0, 100, size=(len(date_rng), 2))
df = pd.DataFrame(data, columns=['A', 'B'], index=date_rng)
# Display the DataFrame
print("Original DataFrame:")
print(df)
# Group by day and calculate the mean
daily_mean = df.resample('D').mean()
print("\nDaily Mean:")
print(daily_mean)
In this example, we create a DataFrame with random data indexed by dates. We then use the resample
method to group the data by day and calculate the mean for each day.
Multivariate Time Series
When dealing with multivariate time series, you may have multiple variables that you want to analyze
simultaneously. The same groupby mechanics can be applied, but you can also visualize the relationships
between these variables.
Here’s how you can handle multivariate time series data:
# Create a multivariate time series DataFrame
date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
data = {
'Temperature': np.random.randint(20, 30, size=(len(date_rng))),
'Humidity': np.random.randint(30, 70, size=(len(date_rng)))
df_multivariate = pd.DataFrame(data, index=date_rng)
# Display the multivariate DataFrame
print("\nMultivariate DataFrame:")
print(df_multivariate)
# Group by day and calculate the mean for each variable
daily_mean_multivariate = df_multivariate.resample('D').mean()
print("\nDaily Mean for Multivariate DataFrame:")
print(daily_mean_multivariate)
In this code snippet, we create a multivariate DataFrame with temperature and humidity data. We then
apply the same resample method to calculate daily means for both variables.
Forecasting Formats
Forecasting in time series can be approached using various models, such as ARIMA, Exponential
Smoothing, or machine learning techniques. The statsmodels library provides tools for implementing
these models.
Here’s a basic example of how to implement an ARIMA model for forecasting:
from statsmodels.tsa.arima.model import ARIMA
import matplotlib.pyplot as plt
# Fit an ARIMA model
model = ARIMA(df['A'], order=(1, 1, 1))
model_fit = model.fit()
# Forecast the next 5 days
forecast = model_fit.forecast(steps=5)
print("\nForecast for the next 5 days:")
print(forecast)
# Plot the results
plt.figure(figsize=(10, 5))
plt.plot(df['A'], label='Historical Data')
plt.plot(pd.date_range(start=df.index[-1] + pd.Timedelta(days=1), periods=5), forecast, label='Forecast',
color='red')
plt.title('Time Series Forecasting')
plt.xlabel('Date')
plt.ylabel('Values')
plt.legend()
plt.show()
In this example, we fit an ARIMA model to the time series data and forecast the next five days. The
results are then visualized using matplotlib.
4. Program to measure central tendency and measures of dispersion: Mean, median, mode,
standard deviation, variance, mean deviation and quartile deviation for a frequency
distribution/data.
In statistics, understanding the central tendency and measures of dispersion is crucial for analyzing data.
Central tendency provides a summary measure that represents the entire dataset, while measures of
dispersion indicate the spread or variability of the data. Below, we will explore how to compute these
statistics using Python.
Required Libraries
To perform these calculations, we will utilize the numpy and scipy libraries. If you haven't installed these
libraries yet, you can do so using pip:
pip install numpy scipy
Sample Data
Let's assume we have a frequency distribution represented as a list of tuples, where each tuple contains
a value and its corresponding frequency. For example:
data = [(1, 5), (2, 10), (3, 15), (4, 20), (5, 10)]
Mean: The mean is calculated as the sum of all values multiplied by their frequencies divided by the
total frequency.
Median: The median is the middle value when the data is sorted. If the number of observations is even,
it is the average of the two middle values.
Mode: The mode is the value that appears most frequently in the dataset.
Calculating Measures of Dispersion
Variance: Variance measures how far a set of numbers is spread out from their average value.
Standard Deviation: The standard deviation is the square root of the variance, providing a measure of
the average distance from the mean.
Mean Deviation: This is the average of the absolute deviations from the mean.
Quartile Deviation: This is half the difference between the first quartile (Q1) and the third quartile (Q3).
Implementation
import numpy as np
from scipy import stats
# Sample frequency distribution
data = [(1, 5), (2, 10), (3, 15), (4, 20), (5, 10)]
# Expanding the data based on frequency
expanded_data = []
for value, frequency in data:
expanded_data.extend([value] * frequency)
# Convert to numpy array for calculations
expanded_data = np.array(expanded_data)
# Central Tendency
mean = np.mean(expanded_data)
median = np.median(expanded_data)
mode = stats.mode(expanded_data)[0][0]
# Measures of Dispersion
variance = np.var(expanded_data)
std_deviation = np.std(expanded_data)
mean_deviation = np.mean(np.abs(expanded_data - mean))
# Quartiles
Q1 = np.percentile(expanded_data, 25)
Q3 = np.percentile(expanded_data, 75)
quartile_deviation = (Q3 - Q1) / 2
# Displaying the results
print(f"Mean: {mean}")
print(f"Median: {median}")
print(f"Mode: {mode}")
print(f"Variance: {variance}")
print(f"Standard Deviation: {std_deviation}")
print(f"Mean Deviation: {mean_deviation}")
print(f"Quartile Deviation: {quartile_deviation}")
Conclusion
This program effectively calculates the central tendency and measures of dispersion for a frequency
distribution. By utilizing Python's powerful libraries, we can easily perform statistical analysis, making it a
valuable tool for data scientists and analysts. Understanding these measures allows for better insights
into the data, guiding informed decision-making.
5. Program to perform cross validation for a given dataset to measure Root Mean Squared Error
(RMSE), Mean Absolute Error (MAE) and R2 Error using validation set, Leave one out cross-
validation(LOOCV) and k-fold cross-validation approaches.
import numpy as np
from sklearn.model_selection import KFold, LeaveOneOut
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
# Generate synthetic data
X, y = make_regression(n_samples=100, n_features=1, noise=10)
# Initialize model
model = LinearRegression()
# K-Fold Cross Validation
kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("K-Fold Metrics:")
print("RMSE:", np.sqrt(mean_squared_error(y_test, predictions)))
print("MAE:", mean_absolute_error(y_test, predictions))
print("R-squared:", r2_score(y_test, predictions))
# Leave-One-Out Cross Validation
loo = LeaveOneOut()
for train_index, test_index in loo.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print("LOOCV Metrics:")
print("RMSE:", np.sqrt(mean_squared_error(y_test, predictions)))
print("MAE:", mean_absolute_error(y_test, predictions))
print("R-squared:", r2_score(y_test, predictions))