0% found this document useful (0 votes)
4 views23 pages

Data Science Using Python Lab Week8

This document provides a comprehensive overview of data science techniques using Python, including polynomial fitting, eigenvalue and eigenvector computation, matrix decomposition via SVD, random number generation, and solving linear equations. It also covers functionalities of the pandas library such as creating DataFrames, grouping, joining, and creating pivot tables. The examples illustrate key concepts in data analysis and numerical methods, highlighting their applications in various fields.

Uploaded by

v.rithwikaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views23 pages

Data Science Using Python Lab Week8

This document provides a comprehensive overview of data science techniques using Python, including polynomial fitting, eigenvalue and eigenvector computation, matrix decomposition via SVD, random number generation, and solving linear equations. It also covers functionalities of the pandas library such as creating DataFrames, grouping, joining, and creating pivot tables. The examples illustrate key concepts in data analysis and numerical methods, highlighting their applications in various fields.

Uploaded by

v.rithwikaa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

DATA SCIENCE USING PYTHON LAB WEEK8

9. Program to use Scipy.linalg / Numpy.linalg package:


import numpy as np
import scipy.linalg as spla
import matplotlib.pyplot as plt

# 1. Fitting to polynomials
print("1. Fitting to polynomials:")
# Generate some noisy data
x = np.linspace(0, 10, 100)
y = 3 * x**2 - 5 * x + 2 + np.random.normal(0,
10, 100)

# Fit a second-degree polynomial


coeffs = np.polyfit(x, y, 2)
poly = np.poly1d(coeffs)

print("Polynomial coefficients:", coeffs)


print("Polynomial function:", poly)

# Plot the results


plt.figure(figsize=(10, 6))
plt.scatter(x, y, label='Data')
plt.plot(x, poly(x), 'r-', label='Fitted
polynomial')
plt.legend()
plt.title('Polynomial Fitting')
plt.xlabel('x')
plt.ylabel('y')
plt.show()
This script demonstrates how to fit a polynomial to a set of noisy data points using Python
libraries like NumPy and Matplotlib.

Explanation:

1. Import Libraries: Libraries like numpy are used for numerical operations,
scipy.linalg (though not utilized in this code), and matplotlib.pyplot for
visualizations.
2. Generate Noisy Data: A set of xxx-values is created, evenly spaced between 0 and 10.
Corresponding yyy-values are calculated using the quadratic equation 5x + 23x2−5x+2,
with random noise added to simulate real-world data variability. The noise is sampled
from a normal distribution with a mean of 0 and standard deviation of 10.
3. Polynomial Fitting: The script fits a quadratic polynomial to the noisy data. The fitting
process uses NumPy's polyfit function to determine the coefficients of the best-fit
polynomial. The coefficients are returned in descending powers of xxx, and a polynomial
object is created to evaluate the polynomial as a function of xxx.
4. Output Results: The coefficients of the fitted polynomial are displayed along with its
mathematical representation.
5. Visualization: A scatter plot is generated to show the noisy data points. The fitted
polynomial is displayed as a smooth red curve over the data points. The plot includes a
title, labeled axes, and a legend for clarity.

Key Points:

 The polynomial coefficients represent the best-fit quadratic equation for the data.
 Adding random noise illustrates how polynomial fitting works despite imperfect data.
 Visualization confirms that the polynomial approximates the noisy data effectively.

This example demonstrates the basics of regression analysis and curve fitting, widely used in
data science and scientific computing to model relationships within data.

# 2. Eigenvectors and Eigenvalues


print("\n2. Eigenvectors and Eigenvalues:")
A = np.array([[1, 2], [2, 3]])
eigenvalues, eigenvectors = np.linalg.eig(A)

print("Matrix A:")
print(A)
print("Eigenvalues:")
print(eigenvalues)
print("Eigenvectors:")
print(eigenvectors)

# Verify Av = λv
for i in range(len(eigenvalues)):
print(f"Verification for eigenpair {i +
1}:")
print(f"A * v = {np.dot(A, eigenvectors[:,
i])}")
print(f"λ * v = {eigenvalues[i] *
eigenvectors[:, i]}")
This script demonstrates how to compute the eigenvalues and eigenvectors of a matrix and verify
the eigenvalue-eigenvector relationship using Python's NumPy library.

Purpose:

This code serves to:


 Demonstrate the computation of eigenvalues and eigenvectors of a matrix.
 Verify the fundamental property of eigenvalues and eigenvectors (Av=λvAv = \lambda vAv=λv).
 Show how eigenvalues and eigenvectors are used to analyze linear transformations and their
effects on vectors.

This concept is a cornerstone in linear algebra and has applications in various fields, including
physics, engineering, and data science.

# 3. Decomposing a matrix using SVD


print("\n3. Decomposing a matrix using SVD:")
B = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
U, s, Vt = np.linalg.svd(B)

print("Matrix B:")
print(B)
print("U:")
print(U)
print("Singular values:")
print(s)
print("V transpose:")
print(Vt)

# Verify B = U * S * V^T
S = np.zeros_like(B)
S[:s.shape[0], :s.shape[0]] = np.diag(s)
B_reconstructed = np.dot(U, np.dot(S, Vt))
print("Reconstructed B:")
print(B_reconstructed)
print("Is reconstruction close to original?",
np.allclose(B, B_reconstructed))
This script demonstrates how to decompose a matrix using Singular Value Decomposition
(SVD) and verify the decomposition using Python's NumPy library.
Explanation:

purpose:

This code serves to:

 Illustrate the process of Singular Value Decomposition.


 Demonstrate how a matrix can be broken down into its fundamental components (left singular
vectors, singular values, and right singular vectors).
 Verify the decomposition by reconstructing the original matrix.

Applications of SVD:

 Data compression (e.g., reducing dimensionality in image processing).


 Principal Component Analysis (PCA) in data science.
 Solving linear systems and matrix approximations.

SVD is a fundamental tool in numerical linear algebra and has a wide range of practical uses
across various fields.

# 4. Generating random numbers


print("\n4. Generating random numbers:")
# Generate random numbers from a normal
distribution
normal_dist = np.random.normal(loc=0, scale=1,
size=1000)

# Generate random numbers from a uniform


distribution
uniform_dist = np.random.uniform(low=0, high=1,
size=1000)

# Plot histograms
plt.figure(figsize=(12, 5))
plt.subplot(121)
plt.hist(normal_dist, bins=30)
plt.title('Normal Distribution')
plt.subplot(122)
plt.hist(uniform_dist, bins=30)
plt.title('Uniform Distribution')
plt.tight_layout()
plt.show()
# Bonus: Solving a system of linear equations
print("\n5. Bonus: Solving a system of linear
equations:")
C = np.array([[3, 1], [1, 2]])
b = np.array([9, 8])

x = np.linalg.solve(C, b)
print("Matrix C:")
print(C)
print("Vector b:")
print(b)
print("Solution x:")
print(x)
print("Verification Cx = b:")
print(np.dot(C, x))

This script demonstrates two tasks: generating random numbers from different distributions and
solving a system of linear equations using Python.
Explanation:

4. Generating Random Numbers:

1. Random Number Generation:


o Normal Distribution:
 Random numbers are generated from a normal distribution with:
 Mean (loc\text{loc}loc) = 0.
 Standard deviation (scale\text{scale}scale) = 1.
 A total of 1000 samples.
o Uniform Distribution:
 Random numbers are uniformly distributed over the interval [0, 1], with 1000
samples.
2. Visualization:
o Two histograms are created to visualize the random numbers:
 Normal Distribution:
 Exhibits a bell-shaped curve, characteristic of a Gaussian distribution.
 Uniform Distribution:
 Shows a flat, even distribution across the range.
3. Purpose:
o Demonstrates how to generate and analyze random samples, which is critical in
probabilistic modeling, simulations, and data science.
Key Insights:

Random Number Generation:

 Demonstrates sampling from different distributions and understanding their characteristics


through visualization.
 Useful in various fields like statistics, machine learning, and stochastic modeling.

Solving Linear Equations:

 Efficiently solves linear systems, a fundamental operation in numerical linear algebra.


 Verifies correctness, ensuring the computed solution meets the original equation.

These concepts showcase foundational techniques in computational mathematics and data


analysis.
10. (a) Program to create pandas Data frames, grouping and joining
Data frames, Panda’s series, Creating Pivot tables(b) Using
vectorized string functions with Panda’s data frames
import pandas as pd
import numpy as np

# 10. (a) Program to create pandas Data frames,


grouping and joining Data frames, Pandas series,
Creating Pivot tables

print("1. Creating pandas DataFrames:")


# Create a DataFrame from a dictionary
df1 = pd.DataFrame({
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']
}, index=[0, 1, 2, 3])

print(df1)

# Create a DataFrame from a numpy array


dates = pd.date_range('20230101', periods=6)
df2 = pd.DataFrame(np.random.randn(6, 4),
index=dates, columns=list('ABCD'))

print("\nDataFrame with date index:")


print(df2)

print("\n2. Grouping DataFrames:")


df = pd.DataFrame({
'A': ['foo', 'bar', 'foo', 'bar', 'foo',
'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three', 'two',
'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)
})

print(df)
print("\nGrouped by 'A', sum of 'C':")
print(df.groupby('A')['C'].sum())

print("\nGrouped by 'A' and 'B', mean of 'D':")


print(df.groupby(['A', 'B'])['D'].mean())

print("\n3. Joining DataFrames:")


left = pd.DataFrame({'key': ['K0', 'K1', 'K2',
'K3'],
'A': ['A0', 'A1', 'A2',
'A3'],
'B': ['B0', 'B1', 'B2',
'B3']})

right = pd.DataFrame({'key': ['K0', 'K1', 'K2',


'K3'],
'C': ['C0', 'C1', 'C2',
'C3'],
'D': ['D0', 'D1', 'D2',
'D3']})

print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)

merged = pd.merge(left, right, on='key')


print("\nMerged DataFrame:")
print(merged)

print("\n4. Pandas Series:")


s = pd.Series([1, 3, 5, np.nan, 6, 8])
print("Pandas Series:")
print(s)

print("\nDataFrame column as Series:")


print(df1['A'])

print("\n5. Creating Pivot Tables:")


df = pd.DataFrame({"A": ["foo", "foo", "foo",
"foo", "foo",
"bar", "bar", "bar",
"bar"],
"B": ["one", "one", "one",
"two", "two",
"one", "one", "two",
"two"],
"C": ["small", "large",
"large", "small",
"small", "large",
"small", "small",
"large"],
"D": [1, 2, 2, 3, 3, 4, 5, 6,
7],
"E": [2, 4, 5, 5, 6, 6, 8, 9,
9]})

print("Original DataFrame:")
print(df)
pivot = pd.pivot_table(df, values='D',
index=['A', 'B'],
columns=['C'],
aggfunc=np.sum)
print("\nPivot Table:")
print(pivot)
1. Creating pandas DataFrames:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3

DataFrame with date index:


A B C D
2023-01-01 0.764616 -0.822003 0.610633 1.016225
2023-01-02 -1.037983 1.048314 -0.486589 -1.456634
2023-01-03 0.593625 0.330537 2.025437 -0.060851
2023-01-04 0.742442 0.410289 -0.486331 -0.519475
2023-01-05 -1.197032 -0.699823 -0.272674 0.699354
2023-01-06 -0.689811 -0.588540 -1.482459 -1.002152

2. Grouping DataFrames:
A B C D
0 foo one -0.140529 -2.309286
1 bar one -1.356555 -0.198256
2 foo two -0.374576 0.716094
3 bar three -0.956765 1.453538
4 foo two 0.042537 -1.113060
5 bar two -0.532235 0.167942
6 foo one -0.073184 0.329067
7 foo three 0.239322 -1.239340

Grouped by 'A', sum of 'C':


A
bar -2.845555
foo -0.306430
Name: C, dtype: float64

Grouped by 'A' and 'B', mean of 'D':


A B
bar one -0.198256
three 1.453538
two 0.167942
foo one -0.990110
three -1.239340
two -0.198483
Name: D, dtype: float64

3. Joining DataFrames:
Left DataFrame:
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3

Right DataFrame:
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 K3 C3 D3

Merged DataFrame:
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3

4. Pandas Series:
Pandas Series:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64

DataFrame column as Series:


0 A0
1 A1
2 A2
3 A3
Name: A, dtype: object

5. Creating Pivot Tables:


Original DataFrame:
A B C D E
0 foo one small 1 2
1 foo one large 2 4
2 foo one large 2 5
3 foo two small 3 5
4 foo two small 3 6
5 bar one large 4 6
6 bar one small 5 8
7 bar two small 6 9
8 bar two large 7 9

Pivot Table:
C large small
A B
bar one 4.0 5.0
two 7.0 6.0
foo one 4.0 1.0
two NaN 6.0
This script demonstrates various functionalities of the pandas library, such as creating
DataFrames, grouping and joining DataFrames, working with pandas Series, and creating pivot
tables.

Explanation:

1. Creating pandas DataFrames:

 From a Dictionary:
o A simple DataFrame is created using a dictionary where keys represent column names
and values are the data for those columns.
 From a NumPy Array:
o A DataFrame is created with random data generated using np.random.randn and
indexed by a range of dates created with pd.date_range.

2. Grouping DataFrames:

 Grouping and Aggregation:


o A DataFrame is grouped by a specific column (A) and then aggregated:
 The sum of column C for each group.
 The mean of column D for groups defined by A and B.

Grouping helps analyze subsets of data based on categorical columns.

3. Joining DataFrames:

 Merging:
o Two DataFrames (left and right) are merged on a common key column (key).
o The result is a combined DataFrame containing all matching rows based on the key.

This demonstrates relational operations similar to SQL joins, essential for integrating datasets.

4. Pandas Series:

 Creating a Series:
o A pandas Series is created, which is a one-dimensional array-like object capable of
holding any data type.
 Extracting a Column as Series:
o A specific column from a DataFrame is accessed as a Series.
5. Creating Pivot Tables:

 Pivot Table:
o A pivot table is created to summarize data. It groups data by the specified indices (A and
B) and columns (C) and computes the sum of values in column D.
o The pivot table provides a multi-dimensional representation of data.

Key Concepts:

1. DataFrame Creation:
o DataFrames are the core structure in pandas for handling tabular data.
o They can be created from dictionaries, NumPy arrays, or other data structures.
2. Grouping:
o Enables analysis of aggregated metrics for subsets of data.
3. Joining/Merging:
o Combines multiple datasets based on common keys, useful for integrating and
comparing data.
4. Series:
o Acts as a single column or 1D array, foundational for creating and manipulating
DataFrames.
5. Pivot Tables:
o A powerful way to summarize and organize complex datasets, commonly used in data
analysis.

This script highlights key pandas features that are critical for data preprocessing, analysis, and
summarization.
# 10. (b) Using vectorized string functions with
Pandas data frames

print("\n6. Using vectorized string functions:")


df = pd.DataFrame({'text': ['python', 'PANDAS',
'Data Science', 'ML & AI']})

print("Original DataFrame:")
print(df)

print("\nUppercase:")
print(df['text'].str.upper())

print("\nLowercase:")
print(df['text'].str.lower())

print("\nString length:")
print(df['text'].str.len())

print("\nSplit by space:")
print(df['text'].str.split())

print("\nReplace '&' with 'and':")


print(df['text'].str.replace('&', 'and'))

print("\nContains 'a' (case-insensitive):")


print(df['text'].str.contains('a', case=False))

print("\nExtract words starting with 'P':")


print(df['text'].str.extract('(P\w+)'))

# Bonus: Working with datetime in pandas


print("\n7. Bonus: Working with datetime in
pandas:")
df = pd.DataFrame({
'date': ['2023-01-01', '2023-01-15', '2023-
02-01', '2023-02-15', '2023-03-01'],
'value': [100, 200, 150, 250, 300]
})

df['date'] = pd.to_datetime(df['date'])
print(df)

print("\nExtract month:")
print(df['date'].dt.month)

print("\nExtract day of week:")


print(df['date'].dt.day_name())

print("\nAdd 7 days to each date:")


print(df['date'] + pd.Timedelta(days=7))

# Resample to monthly frequency


monthly =
df.set_index('date').resample('M')['value'].mean
()
print("\nMonthly average:")
print(monthly)

6. Using vectorized string functions:


Original DataFrame:
text
0 python
1 PANDAS
2 Data Science
3 ML & AI

Uppercase:
0 PYTHON
1 PANDAS
2 DATA SCIENCE
3 ML & AI
Name: text, dtype: object
Lowercase:
0 python
1 pandas
2 data science
3 ml & ai
Name: text, dtype: object

String length:
0 6
1 6
2 12
3 7
Name: text, dtype: int64

Split by space:
0 [python]
1 [PANDAS]
2 [Data, Science]
3 [ML, &, AI]
Name: text, dtype: object

Replace '&' with 'and':


0 python
1 PANDAS
2 Data Science
3 ML and AI
Name: text, dtype: object

Contains 'a' (case-insensitive):


0 False
1 True
2 True
3 True
Name: text, dtype: bool

Extract words starting with 'P':


0
0 NaN
1 PANDAS
2 NaN
3 NaN

7. Bonus: Working with datetime in pandas:


date value
0 2023-01-01 100
1 2023-01-15 200
2 2023-02-01 150
3 2023-02-15 250
4 2023-03-01 300

Extract month:
0 1
1 1
2 2
3 2
4 3
Name: date, dtype: int32

Extract day of week:


0 Sunday
1 Sunday
2 Wednesday
3 Wednesday
4 Wednesday
Name: date, dtype: object

Add 7 days to each date:


0 2023-01-08
1 2023-01-22
2 2023-02-08
3 2023-02-22
4 2023-03-08
Name: date, dtype: datetime64[ns]

Monthly average:
date
2023-01-31 150.0
2023-02-28 200.0
2023-03-31 300.0
Freq: ME, Name: value, dtype: float64

This script demonstrates how to use vectorized string functions and work with datetime data in
pandas. These operations are essential for text preprocessing and time-series analysis in data
science workflows.

Explanation:

6. Using Vectorized String Functions:

1. Original DataFrame:
o Contains a column text with sample strings such as "python", "PANDAS", etc.
2. String Operations:
o Uppercase and Lowercase:
 Convert text to all uppercase or lowercase using str.upper() and
str.lower() respectively.
o String Length:
 Compute the length of each string using str.len().
o Splitting Strings:
 Split each string into a list of words based on spaces using str.split().
o Replace Substrings:
 Replace & with and in each string using str.replace().
o Contains Substring:
 Check if each string contains the letter "a" (case-insensitive) using
str.contains().
o Extract Pattern:
 Extract words starting with "P" using a regular expression in str.extract().

These operations highlight pandas' ability to efficiently perform string manipulations directly on
columns without the need for loops.

7. Bonus: Working with Datetime in Pandas:

1. Datetime Conversion:
o The date column is converted to datetime format using pd.to_datetime(), allowing
for datetime-specific operations.
2. Extracting Components:
o Month:
 Extract the month from each date using dt.month.
o Day of Week:
 Extract the name of the day using dt.day_name().
3. Date Arithmetic:
o Add 7 days to each date using pd.Timedelta(days=7).
4. Resampling:
o The data is resampled to a monthly frequency using resample('M'), calculating the
average value for each month.

Applications:

1. Vectorized String Functions:


o Useful in cleaning, transforming, and analyzing text data.
o Often applied in natural language processing (NLP) tasks.
2. Datetime Operations:
o Essential for working with time-series data.
o Commonly used in financial analysis, forecasting, and trend analysis.

Key Insights:

 Efficiency: Vectorized operations are significantly faster and more concise than looping over
rows.
 Flexibility: Pandas provides rich functionality for handling both textual and temporal data.
 Real-world Relevance: These operations are critical in preprocessing steps for data analysis and
machine learning pipelines.

You might also like