Data Science Using Python Lab Week8
Data Science Using Python Lab Week8
# 1. Fitting to polynomials
print("1. Fitting to polynomials:")
# Generate some noisy data
x = np.linspace(0, 10, 100)
y = 3 * x**2 - 5 * x + 2 + np.random.normal(0,
10, 100)
Explanation:
1. Import Libraries: Libraries like numpy are used for numerical operations,
scipy.linalg (though not utilized in this code), and matplotlib.pyplot for
visualizations.
2. Generate Noisy Data: A set of xxx-values is created, evenly spaced between 0 and 10.
Corresponding yyy-values are calculated using the quadratic equation 5x + 23x2−5x+2,
with random noise added to simulate real-world data variability. The noise is sampled
from a normal distribution with a mean of 0 and standard deviation of 10.
3. Polynomial Fitting: The script fits a quadratic polynomial to the noisy data. The fitting
process uses NumPy's polyfit function to determine the coefficients of the best-fit
polynomial. The coefficients are returned in descending powers of xxx, and a polynomial
object is created to evaluate the polynomial as a function of xxx.
4. Output Results: The coefficients of the fitted polynomial are displayed along with its
mathematical representation.
5. Visualization: A scatter plot is generated to show the noisy data points. The fitted
polynomial is displayed as a smooth red curve over the data points. The plot includes a
title, labeled axes, and a legend for clarity.
Key Points:
The polynomial coefficients represent the best-fit quadratic equation for the data.
Adding random noise illustrates how polynomial fitting works despite imperfect data.
Visualization confirms that the polynomial approximates the noisy data effectively.
This example demonstrates the basics of regression analysis and curve fitting, widely used in
data science and scientific computing to model relationships within data.
print("Matrix A:")
print(A)
print("Eigenvalues:")
print(eigenvalues)
print("Eigenvectors:")
print(eigenvectors)
# Verify Av = λv
for i in range(len(eigenvalues)):
print(f"Verification for eigenpair {i +
1}:")
print(f"A * v = {np.dot(A, eigenvectors[:,
i])}")
print(f"λ * v = {eigenvalues[i] *
eigenvectors[:, i]}")
This script demonstrates how to compute the eigenvalues and eigenvectors of a matrix and verify
the eigenvalue-eigenvector relationship using Python's NumPy library.
Purpose:
This concept is a cornerstone in linear algebra and has applications in various fields, including
physics, engineering, and data science.
print("Matrix B:")
print(B)
print("U:")
print(U)
print("Singular values:")
print(s)
print("V transpose:")
print(Vt)
# Verify B = U * S * V^T
S = np.zeros_like(B)
S[:s.shape[0], :s.shape[0]] = np.diag(s)
B_reconstructed = np.dot(U, np.dot(S, Vt))
print("Reconstructed B:")
print(B_reconstructed)
print("Is reconstruction close to original?",
np.allclose(B, B_reconstructed))
This script demonstrates how to decompose a matrix using Singular Value Decomposition
(SVD) and verify the decomposition using Python's NumPy library.
Explanation:
purpose:
Applications of SVD:
SVD is a fundamental tool in numerical linear algebra and has a wide range of practical uses
across various fields.
# Plot histograms
plt.figure(figsize=(12, 5))
plt.subplot(121)
plt.hist(normal_dist, bins=30)
plt.title('Normal Distribution')
plt.subplot(122)
plt.hist(uniform_dist, bins=30)
plt.title('Uniform Distribution')
plt.tight_layout()
plt.show()
# Bonus: Solving a system of linear equations
print("\n5. Bonus: Solving a system of linear
equations:")
C = np.array([[3, 1], [1, 2]])
b = np.array([9, 8])
x = np.linalg.solve(C, b)
print("Matrix C:")
print(C)
print("Vector b:")
print(b)
print("Solution x:")
print(x)
print("Verification Cx = b:")
print(np.dot(C, x))
This script demonstrates two tasks: generating random numbers from different distributions and
solving a system of linear equations using Python.
Explanation:
print(df1)
print(df)
print("\nGrouped by 'A', sum of 'C':")
print(df.groupby('A')['C'].sum())
print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)
print("Original DataFrame:")
print(df)
pivot = pd.pivot_table(df, values='D',
index=['A', 'B'],
columns=['C'],
aggfunc=np.sum)
print("\nPivot Table:")
print(pivot)
1. Creating pandas DataFrames:
A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1
2 A2 B2 C2 D2
3 A3 B3 C3 D3
2. Grouping DataFrames:
A B C D
0 foo one -0.140529 -2.309286
1 bar one -1.356555 -0.198256
2 foo two -0.374576 0.716094
3 bar three -0.956765 1.453538
4 foo two 0.042537 -1.113060
5 bar two -0.532235 0.167942
6 foo one -0.073184 0.329067
7 foo three 0.239322 -1.239340
3. Joining DataFrames:
Left DataFrame:
key A B
0 K0 A0 B0
1 K1 A1 B1
2 K2 A2 B2
3 K3 A3 B3
Right DataFrame:
key C D
0 K0 C0 D0
1 K1 C1 D1
2 K2 C2 D2
3 K3 C3 D3
Merged DataFrame:
key A B C D
0 K0 A0 B0 C0 D0
1 K1 A1 B1 C1 D1
2 K2 A2 B2 C2 D2
3 K3 A3 B3 C3 D3
4. Pandas Series:
Pandas Series:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
Pivot Table:
C large small
A B
bar one 4.0 5.0
two 7.0 6.0
foo one 4.0 1.0
two NaN 6.0
This script demonstrates various functionalities of the pandas library, such as creating
DataFrames, grouping and joining DataFrames, working with pandas Series, and creating pivot
tables.
Explanation:
From a Dictionary:
o A simple DataFrame is created using a dictionary where keys represent column names
and values are the data for those columns.
From a NumPy Array:
o A DataFrame is created with random data generated using np.random.randn and
indexed by a range of dates created with pd.date_range.
2. Grouping DataFrames:
3. Joining DataFrames:
Merging:
o Two DataFrames (left and right) are merged on a common key column (key).
o The result is a combined DataFrame containing all matching rows based on the key.
This demonstrates relational operations similar to SQL joins, essential for integrating datasets.
4. Pandas Series:
Creating a Series:
o A pandas Series is created, which is a one-dimensional array-like object capable of
holding any data type.
Extracting a Column as Series:
o A specific column from a DataFrame is accessed as a Series.
5. Creating Pivot Tables:
Pivot Table:
o A pivot table is created to summarize data. It groups data by the specified indices (A and
B) and columns (C) and computes the sum of values in column D.
o The pivot table provides a multi-dimensional representation of data.
Key Concepts:
1. DataFrame Creation:
o DataFrames are the core structure in pandas for handling tabular data.
o They can be created from dictionaries, NumPy arrays, or other data structures.
2. Grouping:
o Enables analysis of aggregated metrics for subsets of data.
3. Joining/Merging:
o Combines multiple datasets based on common keys, useful for integrating and
comparing data.
4. Series:
o Acts as a single column or 1D array, foundational for creating and manipulating
DataFrames.
5. Pivot Tables:
o A powerful way to summarize and organize complex datasets, commonly used in data
analysis.
This script highlights key pandas features that are critical for data preprocessing, analysis, and
summarization.
# 10. (b) Using vectorized string functions with
Pandas data frames
print("Original DataFrame:")
print(df)
print("\nUppercase:")
print(df['text'].str.upper())
print("\nLowercase:")
print(df['text'].str.lower())
print("\nString length:")
print(df['text'].str.len())
print("\nSplit by space:")
print(df['text'].str.split())
df['date'] = pd.to_datetime(df['date'])
print(df)
print("\nExtract month:")
print(df['date'].dt.month)
Uppercase:
0 PYTHON
1 PANDAS
2 DATA SCIENCE
3 ML & AI
Name: text, dtype: object
Lowercase:
0 python
1 pandas
2 data science
3 ml & ai
Name: text, dtype: object
String length:
0 6
1 6
2 12
3 7
Name: text, dtype: int64
Split by space:
0 [python]
1 [PANDAS]
2 [Data, Science]
3 [ML, &, AI]
Name: text, dtype: object
Extract month:
0 1
1 1
2 2
3 2
4 3
Name: date, dtype: int32
Monthly average:
date
2023-01-31 150.0
2023-02-28 200.0
2023-03-31 300.0
Freq: ME, Name: value, dtype: float64
This script demonstrates how to use vectorized string functions and work with datetime data in
pandas. These operations are essential for text preprocessing and time-series analysis in data
science workflows.
Explanation:
1. Original DataFrame:
o Contains a column text with sample strings such as "python", "PANDAS", etc.
2. String Operations:
o Uppercase and Lowercase:
Convert text to all uppercase or lowercase using str.upper() and
str.lower() respectively.
o String Length:
Compute the length of each string using str.len().
o Splitting Strings:
Split each string into a list of words based on spaces using str.split().
o Replace Substrings:
Replace & with and in each string using str.replace().
o Contains Substring:
Check if each string contains the letter "a" (case-insensitive) using
str.contains().
o Extract Pattern:
Extract words starting with "P" using a regular expression in str.extract().
These operations highlight pandas' ability to efficiently perform string manipulations directly on
columns without the need for loops.
1. Datetime Conversion:
o The date column is converted to datetime format using pd.to_datetime(), allowing
for datetime-specific operations.
2. Extracting Components:
o Month:
Extract the month from each date using dt.month.
o Day of Week:
Extract the name of the day using dt.day_name().
3. Date Arithmetic:
o Add 7 days to each date using pd.Timedelta(days=7).
4. Resampling:
o The data is resampled to a monthly frequency using resample('M'), calculating the
average value for each month.
Applications:
Key Insights:
Efficiency: Vectorized operations are significantly faster and more concise than looping over
rows.
Flexibility: Pandas provides rich functionality for handling both textual and temporal data.
Real-world Relevance: These operations are critical in preprocessing steps for data analysis and
machine learning pipelines.