0% found this document useful (0 votes)

21 views14 pages

Python CSBS Bhavya Lab Manual

The document provides a comprehensive guide on data wrangling, transformations, time series analysis, central tendency, and cross-validation using Python's pandas and other libraries. It includes practical examples for merging, joining, concatenating DataFrames, reshaping data, string manipulation, and implementing statistical measures. Additionally, it covers time series forecasting with ARIMA and evaluates model performance through cross-validation techniques.

Uploaded by

Sai Kishan .s

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views14 pages

Python CSBS Bhavya Lab Manual

Uploaded by

Sai Kishan .s

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

1.

Program on data wrangling: Combining and merging data sets, reshaping and pivoting

pandas provides various methods for combining and

comparing Series or DataFrame

1. Merging DataFrames

import pandas as pd

# Creating two DataFrames

df1 = [Link]({

'ID': [1, 2, 3],

'Name': ['Alice', 'Bob', 'Charlie']

})

df2 = [Link]({

'ID': [1, 2, 4],

'Age': [25, 30, 22]

})

# Merging DataFrames on 'ID'

merged_df = [Link](df1, df2, on='ID', how='inner')

print(merged_df)

2. Joining DataFrames

# Setting 'ID' as the index for df1

df1.set_index('ID', inplace=True)

# Joining df1 and df2

joined_df = [Link](df2.set_index('ID'), how='inner')

print(joined_df)
3. Concatenating DataFrames

# Creating two DataFrames

df3 = [Link]({

'ID': [5, 6],

'Name': ['David', 'Eva']

})

# Concatenating DataFrames vertically

concatenated_df = [Link]([df1.reset_index(), df3], ignore_index=True)

print(concatenated_df)

4. Comparing DataFrames

# Creating another DataFrame for comparison

df4 = [Link]({

'ID': [1, 2, 3],

'Name': ['Alice', 'Bob', 'Charlie']

})

# Comparing DataFrames

comparison = [Link](df4.set_index('ID'))

print(f"Are the DataFrames equal? {comparison}")

5. Reshaping Data

Reshaping data typically involves changing the layout of a DataFrame. This can be done using the melt()
function, which transforms a wide format DataFrame into a long format.

Example of Reshaping with melt()

import pandas as pd

# Sample DataFrame
data = {

'Name': ['Alice', 'Bob', 'Charlie'],

'Math': [85, 90, 95],

'Science': [80, 85, 90]

df = [Link](data)

# Reshaping the DataFrame

melted_df = [Link](df, id_vars=['Name'], value_vars=['Math', 'Science'],

var_name='Subject', value_name='Score')

print(melted_df)

6. Pivoting Data

Using pivot()

# Sample DataFrame

data = {

'Name': ['Alice', 'Bob', 'Alice', 'Bob'],

'Subject': ['Math', 'Math', 'Science', 'Science'],

'Score': [85, 90, 80, 85]

df = [Link](data)

# Pivoting the DataFrame

pivoted_df = [Link](index='Name', columns='Subject', values='Score')

print(pivoted_df)

Using pivot_table()

The pivot_table() function is more versatile than pivot(), as it allows for aggregation of data. This is
particularly useful when you have duplicate entries for the index/column pairs. You can specify an
aggregation function to summarize the data.
Example:

Let’s modify our previous example to include multiple sales entries for the same product on the same
date:

# Sample data with duplicates

data = {

'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02', '2023-01-01'],

'Product': ['A', 'B', 'A', 'B', 'A'],

'Sales': [100, 150, 200, 250, 300]

df = [Link](data)

# Using pivot_table

pivot_table_df = df.pivot_table(index='Date', columns='Product', values='Sales', aggfunc='sum')

print(pivot_table_df)

2. Program on Data transformations: String manipulation and Regular expression

String Manipulation

Python provides a rich set of built-in methods for string manipulation. Here are some common
operations:

1. Changing Case: You can easily change the case of strings using methods like upper(), lower(), and
title().

text = "hello world"

print([Link]()) # Output: HELLO WORLD

print([Link]()) # Output: Hello World

2. Trimming Whitespace: The strip(), lstrip(), and rstrip() methods are useful for removing
unwanted whitespace.
text = " Hello World "

print([Link]()) # Output: Hello World

3. Replacing Substrings: The replace() method allows you to replace occurrences of a substring with
another substring.

text = "I love Python"

new_text = [Link]("Python", "programming")

print(new_text) # Output: I love programming

4. Splitting and Joining Strings: You can split a string into a list of substrings using split(), and join a
list of strings into a single string using join().

text = "apple,banana,cherry"

fruits = [Link](",")

print(fruits) # Output: ['apple', 'banana', 'cherry']

new_text = " and ".join(fruits)

print(new_text) # Output: apple and banana and cherry

Regular Expressions

Regular expressions are a powerful tool for searching and manipulating strings based on patterns. The re
module in Python provides functions to work with regex.

1. Searching for Patterns: The search() function checks if a pattern exists in a string.
import re

text = "My email is example@[Link]"

match = [Link](r'\S+@\S+', text)

if match:

print("Found email:", [Link]())

2. Finding All Matches: The findall() function returns all occurrences of a pattern in a string.

text = "Contact us at support@[Link] or sales@[Link]"

emails = [Link](r'\S+@\S+', text)

print("Found emails:", emails)

3. Replacing Patterns: The sub() function allows you to replace occurrences of a pattern with a
specified string.

text = "My phone number is 123-456-7890"

new_text = [Link](r'\d{3}-\d{3}-\d{4}', 'XXX-XXX-XXXX', text)

print(new_text)

4. Validating Input: Regular expressions can also be used to validate input formats, such as
checking if a string is a valid email address
def is_valid_email(email):

pattern = r'^[\w\.-]+@[\w\.-]+\.\w+$'

return [Link](pattern, email) is not None

print(is_valid_email("test@[Link]")) # Output: True

print(is_valid_email("invalid-email")) # Output: False

3. Program on Time series: GroupBy Mechanics to display in data vector, multivariate time series
and forecasting formats

Time series analysis is a powerful technique used in various fields such as finance, economics, and
environmental science. In Python, the pandas library provides robust tools for handling time series data,
including the groupby functionality, which allows for efficient data aggregation and transformation.

GroupBy Mechanics in Time Series

The groupby method in pandas is essential for aggregating data based on specific time intervals. For
instance, you can group data by day, month, or year, which is particularly useful for summarizing trends
over time.

Here’s a simple example to illustrate how to use groupby with a time series dataset:

import pandas as pd

import numpy as np

# Create a sample time series DataFrame

date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')

data = [Link](0, 100, size=(len(date_rng), 2))

df = [Link](data, columns=['A', 'B'], index=date_rng)

# Display the DataFrame

print("Original DataFrame:")

print(df)

# Group by day and calculate the mean

daily_mean = [Link]('D').mean()

print("\nDaily Mean:")

print(daily_mean)

In this example, we create a DataFrame with random data indexed by dates. We then use the resample
method to group the data by day and calculate the mean for each day.

Multivariate Time Series

When dealing with multivariate time series, you may have multiple variables that you want to analyze
simultaneously. The same groupby mechanics can be applied, but you can also visualize the relationships
between these variables.

Here’s how you can handle multivariate time series data:

# Create a multivariate time series DataFrame

date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')

data = {

'Temperature': [Link](20, 30, size=(len(date_rng))),

'Humidity': [Link](30, 70, size=(len(date_rng)))

df_multivariate = [Link](data, index=date_rng)

# Display the multivariate DataFrame

print("\nMultivariate DataFrame:")

print(df_multivariate)

# Group by day and calculate the mean for each variable

daily_mean_multivariate = df_multivariate.resample('D').mean()

print("\nDaily Mean for Multivariate DataFrame:")

print(daily_mean_multivariate)

In this code snippet, we create a multivariate DataFrame with temperature and humidity data. We then
apply the same resample method to calculate daily means for both variables.

Forecasting Formats

Forecasting in time series can be approached using various models, such as ARIMA, Exponential
Smoothing, or machine learning techniques. The statsmodels library provides tools for implementing
these models.

Here’s a basic example of how to implement an ARIMA model for forecasting:

from [Link] import ARIMA

import [Link] as plt

# Fit an ARIMA model

model = ARIMA(df['A'], order=(1, 1, 1))

model_fit = [Link]()

# Forecast the next 5 days

forecast = model_fit.forecast(steps=5)

print("\nForecast for the next 5 days:")

print(forecast)

# Plot the results

[Link](figsize=(10, 5))

[Link](df['A'], label='Historical Data')

[Link](pd.date_range(start=[Link][-1] + [Link](days=1), periods=5), forecast, label='Forecast',

color='red')

[Link]('Time Series Forecasting')

[Link]('Date')

[Link]('Values')

[Link]()

In this example, we fit an ARIMA model to the time series data and forecast the next five days. The
results are then visualized using matplotlib.

4. Program to measure central tendency and measures of dispersion: Mean, median, mode,
standard deviation, variance, mean deviation and quartile deviation for a frequency
distribution/data.

In statistics, understanding the central tendency and measures of dispersion is crucial for analyzing data.
Central tendency provides a summary measure that represents the entire dataset, while measures of
dispersion indicate the spread or variability of the data. Below, we will explore how to compute these
statistics using Python.

Required Libraries

To perform these calculations, we will utilize the numpy and scipy libraries. If you haven't installed these
libraries yet, you can do so using pip:

pip install numpy scipy

Sample Data

Let's assume we have a frequency distribution represented as a list of tuples, where each tuple contains
a value and its corresponding frequency. For example:

data = [(1, 5), (2, 10), (3, 15), (4, 20), (5, 10)]

Mean: The mean is calculated as the sum of all values multiplied by their frequencies divided by the
total frequency.

Median: The median is the middle value when the data is sorted. If the number of observations is even,
it is the average of the two middle values.

Mode: The mode is the value that appears most frequently in the dataset.

Calculating Measures of Dispersion

Variance: Variance measures how far a set of numbers is spread out from their average value.

Standard Deviation: The standard deviation is the square root of the variance, providing a measure of
the average distance from the mean.

Mean Deviation: This is the average of the absolute deviations from the mean.

Quartile Deviation: This is half the difference between the first quartile (Q1) and the third quartile (Q3).

Implementation

import numpy as np

from scipy import stats

# Sample frequency distribution

data = [(1, 5), (2, 10), (3, 15), (4, 20), (5, 10)]

# Expanding the data based on frequency

expanded_data = []

for value, frequency in data:

expanded_data.extend([value] * frequency)

# Convert to numpy array for calculations

expanded_data = [Link](expanded_data)

# Central Tendency

mean = [Link](expanded_data)

median = [Link](expanded_data)

mode = [Link](expanded_data)[0][0]

# Measures of Dispersion

variance = [Link](expanded_data)

std_deviation = [Link](expanded_data)

mean_deviation = [Link]([Link](expanded_data - mean))

# Quartiles

Q1 = [Link](expanded_data, 25)

Q3 = [Link](expanded_data, 75)

quartile_deviation = (Q3 - Q1) / 2

# Displaying the results

print(f"Mean: {mean}")

print(f"Median: {median}")
print(f"Mode: {mode}")

print(f"Variance: {variance}")

print(f"Standard Deviation: {std_deviation}")

print(f"Mean Deviation: {mean_deviation}")

print(f"Quartile Deviation: {quartile_deviation}")

Conclusion

This program effectively calculates the central tendency and measures of dispersion for a frequency
distribution. By utilizing Python's powerful libraries, we can easily perform statistical analysis, making it a
valuable tool for data scientists and analysts. Understanding these measures allows for better insights
into the data, guiding informed decision-making.

5. Program to perform cross validation for a given dataset to measure Root Mean Squared Error
(RMSE), Mean Absolute Error (MAE) and R2 Error using validation set, Leave one out cross-
validation(LOOCV) and k-fold cross-validation approaches.

import numpy as np

from sklearn.model_selection import KFold, LeaveOneOut

from [Link] import mean_squared_error, mean_absolute_error, r2_score

from sklearn.linear_model import LinearRegression

from [Link] import make_regression

# Generate synthetic data

X, y = make_regression(n_samples=100, n_features=1, noise=10)

# Initialize model

model = LinearRegression()
# K-Fold Cross Validation

kf = KFold(n_splits=5)

for train_index, test_index in [Link](X):

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

[Link](X_train, y_train)

predictions = [Link](X_test)

print("K-Fold Metrics:")

print("RMSE:", [Link](mean_squared_error(y_test, predictions)))

print("MAE:", mean_absolute_error(y_test, predictions))

print("R-squared:", r2_score(y_test, predictions))

# Leave-One-Out Cross Validation

loo = LeaveOneOut()

for train_index, test_index in [Link](X):

X_train, X_test = X[train_index], X[test_index]

y_train, y_test = y[train_index], y[test_index]

[Link](X_train, y_train)

predictions = [Link](X_test)

print("LOOCV Metrics:")

print("RMSE:", [Link](mean_squared_error(y_test, predictions)))

print("MAE:", mean_absolute_error(y_test, predictions))

print("R-squared:", r2_score(y_test, predictions))

Pandas Research
No ratings yet
Pandas Research
14 pages
Python Data Structures and Libraries Guide
No ratings yet
Python Data Structures and Libraries Guide
7 pages
Data Handling Module
No ratings yet
Data Handling Module
10 pages
Data Wrangling With Python and Pandas
No ratings yet
Data Wrangling With Python and Pandas
7 pages
Pandas & PyNumS Essentials
No ratings yet
Pandas & PyNumS Essentials
10 pages
Python Interviews
No ratings yet
Python Interviews
154 pages
04-Data Manipulation With Pandas
No ratings yet
04-Data Manipulation With Pandas
28 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
Pandas Data Wrangling Cheat Sheet
100% (2)
Pandas Data Wrangling Cheat Sheet
6 pages
Introduction To Pandas in Data Analytics
No ratings yet
Introduction To Pandas in Data Analytics
12 pages
Python Data Analysis Cheat Sheet
100% (3)
Python Data Analysis Cheat Sheet
9 pages
Python Cheat Sheet 2.0
100% (2)
Python Cheat Sheet 2.0
10 pages
Unit 4 Fod
100% (1)
Unit 4 Fod
21 pages
Mastering Pandas: DataFrame Operations
100% (2)
Mastering Pandas: DataFrame Operations
33 pages
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
Eda Lab Assignment2
No ratings yet
Eda Lab Assignment2
10 pages
NumPy and Pandas
No ratings yet
NumPy and Pandas
12 pages
Lecture 14
No ratings yet
Lecture 14
33 pages
Python 2.1.3
No ratings yet
Python 2.1.3
6 pages
Data Aggregation and Group Operations
No ratings yet
Data Aggregation and Group Operations
34 pages
DAP 3 Module
No ratings yet
DAP 3 Module
62 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (4)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
11 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Python Unit IV
No ratings yet
Python Unit IV
12 pages
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Pandas
No ratings yet
Pandas
25 pages
Pandas Data Structures and Operations
No ratings yet
Pandas Data Structures and Operations
36 pages
Pandas
No ratings yet
Pandas
25 pages
Test 1 Datasheet
No ratings yet
Test 1 Datasheet
3 pages
Learn Pandas
No ratings yet
Learn Pandas
37 pages
FDS Module 2 Notes
No ratings yet
FDS Module 2 Notes
24 pages
Pandas
No ratings yet
Pandas
13 pages
Python Chrat Book Pandas
No ratings yet
Python Chrat Book Pandas
4 pages
Usage of NumPy For Numerical Data in Detail
No ratings yet
Usage of NumPy For Numerical Data in Detail
52 pages
Pandas Guide for Data Analysts
No ratings yet
Pandas Guide for Data Analysts
9 pages
Fundamental - Python
No ratings yet
Fundamental - Python
3 pages
Pandas DataFrame Cheat Sheet
No ratings yet
Pandas DataFrame Cheat Sheet
6 pages
Data Analysis
No ratings yet
Data Analysis
20 pages
Data Mining - Week - 4
No ratings yet
Data Mining - Week - 4
8 pages
Pandas DataFrame Cheat Sheet Guide
No ratings yet
Pandas DataFrame Cheat Sheet Guide
12 pages
Python For DS Unit4
No ratings yet
Python For DS Unit4
11 pages
Pandas Library: Data Manipulation & Analysis Guide
No ratings yet
Pandas Library: Data Manipulation & Analysis Guide
9 pages
Part A Assignment - No - 1
No ratings yet
Part A Assignment - No - 1
7 pages
Python Programming Pandas Across Examples
100% (1)
Python Programming Pandas Across Examples
350 pages
Pandas Tutorial
No ratings yet
Pandas Tutorial
9 pages
Pandas Cheat Sheet
100% (1)
Pandas Cheat Sheet
2 pages
Python Cheatsheet
No ratings yet
Python Cheatsheet
2 pages
Content Pandas Cheat Sheet
No ratings yet
Content Pandas Cheat Sheet
9 pages
Eda Code Snippets
No ratings yet
Eda Code Snippets
17 pages
Pandas
No ratings yet
Pandas
5 pages
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
100% (1)
Cheat Sheet: The Pandas Dataframe Object: Preliminaries Get Your Data Into A Dataframe
12 pages
Pandas Plots
No ratings yet
Pandas Plots
14 pages
Python For Data Analysis Jan 28
No ratings yet
Python For Data Analysis Jan 28
105 pages
Stats Lab (4-6)
No ratings yet
Stats Lab (4-6)
7 pages
Module 2
No ratings yet
Module 2
7 pages
Stats Lab (10-12)
No ratings yet
Stats Lab (10-12)
4 pages
Module 3
No ratings yet
Module 3
18 pages
4 MR&MM
No ratings yet
4 MR&MM
25 pages
Question Bank 1
No ratings yet
Question Bank 1
2 pages
Ddco Module 4 Notes
No ratings yet
Ddco Module 4 Notes
17 pages
CRC and Networking Algorithms in Java
No ratings yet
CRC and Networking Algorithms in Java
15 pages
DDCO Lab Manual for VTU BCS302
No ratings yet
DDCO Lab Manual for VTU BCS302
35 pages
DS Manual 22scheme BCSL305
No ratings yet
DS Manual 22scheme BCSL305
78 pages
Bda M3 B.com2
No ratings yet
Bda M3 B.com2
59 pages
Hasil Regresi Variabel AK
No ratings yet
Hasil Regresi Variabel AK
13 pages
Multicollinearity in Mixture Experiments
No ratings yet
Multicollinearity in Mixture Experiments
14 pages
MATH 291 Statistics Solutions 6137
No ratings yet
MATH 291 Statistics Solutions 6137
9 pages
Understanding Statistics and Measurement
No ratings yet
Understanding Statistics and Measurement
58 pages
Solution - Tut 5
No ratings yet
Solution - Tut 5
7 pages
Maths Problem Set and Solutions
No ratings yet
Maths Problem Set and Solutions
102 pages
Forecasting Problems Solved
No ratings yet
Forecasting Problems Solved
8 pages
Basic Business Statistics - (Part 1 Presenting and Describing Information)
No ratings yet
Basic Business Statistics - (Part 1 Presenting and Describing Information)
5 pages
$RKY7R9L
No ratings yet
$RKY7R9L
18 pages
General Linear Model Analysis
No ratings yet
General Linear Model Analysis
17 pages
How To Calculate Standard Deviation For Ungrouped Data
No ratings yet
How To Calculate Standard Deviation For Ungrouped Data
8 pages
Moments, Skewness, Kurtosis
No ratings yet
Moments, Skewness, Kurtosis
13 pages
Assignment 1: Radu Lupu
No ratings yet
Assignment 1: Radu Lupu
2 pages
Dat Science Unit 2
No ratings yet
Dat Science Unit 2
27 pages
Test Construction and Evaluation Guide
No ratings yet
Test Construction and Evaluation Guide
9 pages
Measures of Central Tendency
No ratings yet
Measures of Central Tendency
65 pages
Unit 22 Solution
No ratings yet
Unit 22 Solution
11 pages
Frequency Distribution and Graphing Techniques
No ratings yet
Frequency Distribution and Graphing Techniques
45 pages
Introduction of Statistics and Its Types - GeeksforGeeks
No ratings yet
Introduction of Statistics and Its Types - GeeksforGeeks
1 page
Measures of Dispersion The Range Standard Deviation and Variance 1
No ratings yet
Measures of Dispersion The Range Standard Deviation and Variance 1
23 pages
MODULE 1 - Measures of Position For Ungrouped Data
100% (2)
MODULE 1 - Measures of Position For Ungrouped Data
30 pages
Module 2 - Descriptive Analysis
No ratings yet
Module 2 - Descriptive Analysis
8 pages
Lecture 9
No ratings yet
Lecture 9
30 pages
Assignment 2 Predictive
No ratings yet
Assignment 2 Predictive
12 pages
ISDS 361A - Cheat Sheet Exam 1 PDF
No ratings yet
ISDS 361A - Cheat Sheet Exam 1 PDF
2 pages
GCSE Maths Higher Tier Exam
No ratings yet
GCSE Maths Higher Tier Exam
20 pages
Edexcel S3 October 2020 MS
No ratings yet
Edexcel S3 October 2020 MS
12 pages
Sampling Distribution Problem Set
No ratings yet
Sampling Distribution Problem Set
6 pages
Understanding Data
No ratings yet
Understanding Data
44 pages