0% found this document useful (0 votes)
3 views

Python CSBS Bhavya Lab Manual

The document provides a comprehensive guide on data wrangling, transformations, time series analysis, central tendency, and cross-validation using Python's pandas and other libraries. It includes practical examples for merging, joining, concatenating DataFrames, reshaping data, string manipulation, and implementing statistical measures. Additionally, it covers time series forecasting with ARIMA and evaluates model performance through cross-validation techniques.

Uploaded by

Sai Kishan .s
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Python CSBS Bhavya Lab Manual

The document provides a comprehensive guide on data wrangling, transformations, time series analysis, central tendency, and cross-validation using Python's pandas and other libraries. It includes practical examples for merging, joining, concatenating DataFrames, reshaping data, string manipulation, and implementing statistical measures. Additionally, it covers time series forecasting with ARIMA and evaluates model performance through cross-validation techniques.

Uploaded by

Sai Kishan .s
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

1.

Program on data wrangling: Combining and merging data sets, reshaping and pivoting

pandas provides various methods for combining and


comparing Series or DataFrame

1. Merging DataFrames

import pandas as pd

# Creating two DataFrames

df1 = pd.DataFrame({

'ID': [1, 2, 3],

'Name': ['Alice', 'Bob', 'Charlie']

})

df2 = pd.DataFrame({

'ID': [1, 2, 4],

'Age': [25, 30, 22]

})

# Merging DataFrames on 'ID'

merged_df = pd.merge(df1, df2, on='ID', how='inner')

print(merged_df)

2. Joining DataFrames

# Setting 'ID' as the index for df1

df1.set_index('ID', inplace=True)

# Joining df1 and df2

joined_df = df1.join(df2.set_index('ID'), how='inner')

print(joined_df)
3. Concatenating DataFrames

# Creating two DataFrames

df3 = pd.DataFrame({

'ID': [5, 6],

'Name': ['David', 'Eva']

})

# Concatenating DataFrames vertically

concatenated_df = pd.concat([df1.reset_index(), df3], ignore_index=True)

print(concatenated_df)

4. Comparing DataFrames

# Creating another DataFrame for comparison

df4 = pd.DataFrame({

'ID': [1, 2, 3],

'Name': ['Alice', 'Bob', 'Charlie']

})

# Comparing DataFrames

comparison = df1.equals(df4.set_index('ID'))

print(f"Are the DataFrames equal? {comparison}")

5. Reshaping Data

Reshaping data typically involves changing the layout of a DataFrame. This can be done using the melt()
function, which transforms a wide format DataFrame into a long format.

Example of Reshaping with melt()

import pandas as pd

# Sample DataFrame
data = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Math': [85, 90, 95],

'Science': [80, 85, 90]

df = pd.DataFrame(data)

# Reshaping the DataFrame

melted_df = pd.melt(df, id_vars=['Name'], value_vars=['Math', 'Science'],

var_name='Subject', value_name='Score')

print(melted_df)

6. Pivoting Data

Using pivot()

# Sample DataFrame

data = {

'Name': ['Alice', 'Bob', 'Alice', 'Bob'],

'Subject': ['Math', 'Math', 'Science', 'Science'],

'Score': [85, 90, 80, 85]

df = pd.DataFrame(data)

# Pivoting the DataFrame

pivoted_df = df.pivot(index='Name', columns='Subject', values='Score')

print(pivoted_df)

Using pivot_table()

The pivot_table() function is more versatile than pivot(), as it allows for aggregation of data. This is
particularly useful when you have duplicate entries for the index/column pairs. You can specify an
aggregation function to summarize the data.
Example:

Let’s modify our previous example to include multiple sales entries for the same product on the same
date:

# Sample data with duplicates

data = {

'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-01'],

'Product': ['A', 'B', 'A', 'B', 'A'],

'Sales': [100, 150, 200, 250, 300]

df = pd.DataFrame(data)

# Using pivot_table

pivot_table_df = df.pivot_table(index='Date', columns='Product', values='Sales', aggfunc='sum')

print(pivot_table_df)

2. Program on Data transformations: String manipulation and Regular expression

String Manipulation

Python provides a rich set of built-in methods for string manipulation. Here are some common
operations:

1. Changing Case: You can easily change the case of strings using methods like upper(), lower(), and
title().

text = "hello world"

print(text.upper()) # Output: HELLO WORLD

print(text.title()) # Output: Hello World

2. Trimming Whitespace: The strip(), lstrip(), and rstrip() methods are useful for removing
unwanted whitespace.
text = " Hello World "

print(text.strip()) # Output: Hello World

3. Replacing Substrings: The replace() method allows you to replace occurrences of a substring with
another substring.

text = "I love Python"

new_text = text.replace("Python", "programming")

print(new_text) # Output: I love programming

4. Splitting and Joining Strings: You can split a string into a list of substrings using split(), and join a
list of strings into a single string using join().

text = "apple,banana,cherry"

fruits = text.split(",")

print(fruits) # Output: ['apple', 'banana', 'cherry']

new_text = " and ".join(fruits)

print(new_text) # Output: apple and banana and cherry

Regular Expressions

Regular expressions are a powerful tool for searching and manipulating strings based on patterns. The re
module in Python provides functions to work with regex.

1. Searching for Patterns: The search() function checks if a pattern exists in a string.
import re

text = "My email is [email protected]"

match = re.search(r'\S+@\S+', text)

if match:

print("Found email:", match.group())

2. Finding All Matches: The findall() function returns all occurrences of a pattern in a string.

text = "Contact us at [email protected] or [email protected]"

emails = re.findall(r'\S+@\S+', text)

print("Found emails:", emails)

3. Replacing Patterns: The sub() function allows you to replace occurrences of a pattern with a
specified string.

text = "My phone number is 123-456-7890"

new_text = re.sub(r'\d{3}-\d{3}-\d{4}', 'XXX-XXX-XXXX', text)

print(new_text)

4. Validating Input: Regular expressions can also be used to validate input formats, such as
checking if a string is a valid email address
def is_valid_email(email):

pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'

return re.match(pattern, email) is not None

print(is_valid_email("[email protected]")) # Output: True

print(is_valid_email("invalid-email")) # Output: False

3. Program on Time series: GroupBy Mechanics to display in data vector, multivariate time series
and forecasting formats

Time series analysis is a powerful technique used in various fields such as finance, economics, and
environmental science. In Python, the pandas library provides robust tools for handling time series data,
including the groupby functionality, which allows for efficient data aggregation and transformation.

GroupBy Mechanics in Time Series

The groupby method in pandas is essential for aggregating data based on specific time intervals. For
instance, you can group data by day, month, or year, which is particularly useful for summarizing trends
over time.

Here’s a simple example to illustrate how to use groupby with a time series dataset:

import pandas as pd

import numpy as np

# Create a sample time series DataFrame

date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')

data = np.random.randint(0, 100, size=(len(date_rng), 2))

df = pd.DataFrame(data, columns=['A', 'B'], index=date_rng)

# Display the DataFrame


print("Original DataFrame:")

print(df)

# Group by day and calculate the mean

daily_mean = df.resample('D').mean()

print("\nDaily Mean:")

print(daily_mean)

In this example, we create a DataFrame with random data indexed by dates. We then use the resample
method to group the data by day and calculate the mean for each day.

Multivariate Time Series

When dealing with multivariate time series, you may have multiple variables that you want to analyze
simultaneously. The same groupby mechanics can be applied, but you can also visualize the relationships
between these variables.

Here’s how you can handle multivariate time series data:

# Create a multivariate time series DataFrame

date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')

data = {

'Temperature': np.random.randint(20, 30, size=(len(date_rng))),

'Humidity': np.random.randint(30, 70, size=(len(date_rng)))

df_multivariate = pd.DataFrame(data, index=date_rng)

# Display the multivariate DataFrame


print("\nMultivariate DataFrame:")

print(df_multivariate)

# Group by day and calculate the mean for each variable

daily_mean_multivariate = df_multivariate.resample('D').mean()

print("\nDaily Mean for Multivariate DataFrame:")

print(daily_mean_multivariate)

In this code snippet, we create a multivariate DataFrame with temperature and humidity data. We then
apply the same resample method to calculate daily means for both variables.

Forecasting Formats

Forecasting in time series can be approached using various models, such as ARIMA, Exponential
Smoothing, or machine learning techniques. The statsmodels library provides tools for implementing
these models.

Here’s a basic example of how to implement an ARIMA model for forecasting:

from statsmodels.tsa.arima.model import ARIMA

import matplotlib.pyplot as plt

# Fit an ARIMA model

model = ARIMA(df['A'], order=(1, 1, 1))

model_fit = model.fit()

# Forecast the next 5 days

forecast = model_fit.forecast(steps=5)

print("\nForecast for the next 5 days:")


print(forecast)

# Plot the results

plt.figure(figsize=(10, 5))

plt.plot(df['A'], label='Historical Data')

plt.plot(pd.date_range(start=df.index[-1] + pd.Timedelta(days=1), periods=5), forecast, label='Forecast',


color='red')

plt.title('Time Series Forecasting')

plt.xlabel('Date')

plt.ylabel('Values')

plt.legend()

plt.show()

In this example, we fit an ARIMA model to the time series data and forecast the next five days. The
results are then visualized using matplotlib.

4. Program to measure central tendency and measures of dispersion: Mean, median, mode,
standard deviation, variance, mean deviation and quartile deviation for a frequency
distribution/data.

In statistics, understanding the central tendency and measures of dispersion is crucial for analyzing data.
Central tendency provides a summary measure that represents the entire dataset, while measures of
dispersion indicate the spread or variability of the data. Below, we will explore how to compute these
statistics using Python.

Required Libraries

To perform these calculations, we will utilize the numpy and scipy libraries. If you haven't installed these
libraries yet, you can do so using pip:

pip install numpy scipy


Sample Data

Let's assume we have a frequency distribution represented as a list of tuples, where each tuple contains
a value and its corresponding frequency. For example:

data = [(1, 5), (2, 10), (3, 15), (4, 20), (5, 10)]

Mean: The mean is calculated as the sum of all values multiplied by their frequencies divided by the
total frequency.

Median: The median is the middle value when the data is sorted. If the number of observations is even,
it is the average of the two middle values.

Mode: The mode is the value that appears most frequently in the dataset.

Calculating Measures of Dispersion

Variance: Variance measures how far a set of numbers is spread out from their average value.

Standard Deviation: The standard deviation is the square root of the variance, providing a measure of
the average distance from the mean.

Mean Deviation: This is the average of the absolute deviations from the mean.

Quartile Deviation: This is half the difference between the first quartile (Q1) and the third quartile (Q3).

Implementation

import numpy as np

from scipy import stats


# Sample frequency distribution

data = [(1, 5), (2, 10), (3, 15), (4, 20), (5, 10)]

# Expanding the data based on frequency

expanded_data = []

for value, frequency in data:

expanded_data.extend([value] * frequency)

# Convert to numpy array for calculations

expanded_data = np.array(expanded_data)

# Central Tendency

mean = np.mean(expanded_data)

median = np.median(expanded_data)

mode = stats.mode(expanded_data)[0][0]

# Measures of Dispersion

variance = np.var(expanded_data)

std_deviation = np.std(expanded_data)

mean_deviation = np.mean(np.abs(expanded_data - mean))

# Quartiles

Q1 = np.percentile(expanded_data, 25)

Q3 = np.percentile(expanded_data, 75)

quartile_deviation = (Q3 - Q1) / 2

# Displaying the results

print(f"Mean: {mean}")

print(f"Median: {median}")
print(f"Mode: {mode}")

print(f"Variance: {variance}")

print(f"Standard Deviation: {std_deviation}")

print(f"Mean Deviation: {mean_deviation}")

print(f"Quartile Deviation: {quartile_deviation}")

Conclusion

This program effectively calculates the central tendency and measures of dispersion for a frequency
distribution. By utilizing Python's powerful libraries, we can easily perform statistical analysis, making it a
valuable tool for data scientists and analysts. Understanding these measures allows for better insights
into the data, guiding informed decision-making.

5. Program to perform cross validation for a given dataset to measure Root Mean Squared Error
(RMSE), Mean Absolute Error (MAE) and R2 Error using validation set, Leave one out cross-
validation(LOOCV) and k-fold cross-validation approaches.

import numpy as np

from sklearn.model_selection import KFold, LeaveOneOut

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

from sklearn.linear_model import LinearRegression

from sklearn.datasets import make_regression

# Generate synthetic data

X, y = make_regression(n_samples=100, n_features=1, noise=10)

# Initialize model

model = LinearRegression()
# K-Fold Cross Validation

kf = KFold(n_splits=5)

for train_index, test_index in kf.split(X):

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

model.fit(X_train, y_train)

predictions = model.predict(X_test)

print("K-Fold Metrics:")

print("RMSE:", np.sqrt(mean_squared_error(y_test, predictions)))

print("MAE:", mean_absolute_error(y_test, predictions))

print("R-squared:", r2_score(y_test, predictions))

# Leave-One-Out Cross Validation

loo = LeaveOneOut()

for train_index, test_index in loo.split(X):

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

model.fit(X_train, y_train)

predictions = model.predict(X_test)

print("LOOCV Metrics:")

print("RMSE:", np.sqrt(mean_squared_error(y_test, predictions)))

print("MAE:", mean_absolute_error(y_test, predictions))

print("R-squared:", r2_score(y_test, predictions))

You might also like