0% found this document useful (0 votes)
13 views10 pages

JJKJK

Uploaded by

Akshaya R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

JJKJK

Uploaded by

Akshaya R
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Subject Code : CS3352-Foundations of Data science (2021R)

Semester & Year : Odd Semester (2024-25)


Unit–IV Python Libraries for Data Wrangling
Part-A

S.n Questions
o
1 Under what circumstances, the pivot_table() in pandas is used?
The `pivot_table()` in pandas is used to summarize and aggregate data by transforming it into a new
table format, where rows and columns represent unique values from the original dataset. It is especially
useful for creating summaries, such as totals, averages, or counts, across different categories or
combinations of categories.
Using appropriate data visualization modules develop a python code snippet that generates a simple
2 State the advantages of using Nympy arrays State the advantages of using Nympy arrays sinusoidal
wave in an empty gridded axes?
import numpy as np
import matplotlib.pyplot as plt

# Generate x values (from 0 to 4π)


x = np.linspace(0, 4 * np.pi, 100)
# Generate y values (sinusoidal wave)
y = np.sin(x)

# Create a plot
fig, ax = plt.subplots()

# Plot the sinusoidal wave


ax.plot(x, y, label='sin(x)', color='blue')

# Add grid to the axes


ax.grid(True)

# Add labels and title


ax.set_xlabel('x')
ax.set_ylabel('sin(x)')
ax.set_title('Simple Sinusoidal Wave')

# Display the plot


plt.legend()
plt.show()
3 What are the key properties of Pearson Correlation Coefficient?
The key properties of the Pearson Correlation Coefficient are:
1. Range: The coefficient ranges between -1 and 1.
+1 indicates a perfect positive linear relationship.
-1 indicates a perfect negative linear relationship.
0 indicates no linear relationship.
2. Symmetry: Symmetry in correlation means that the relationship between two variables X and Y is
the same no matter the order; that is, corr(X, Y) is equal to corr(Y, X)

3.No Units: It is a dimensionless measure, meaning it does not depend on the units of the variables.
This makes it easier to compare across different datasets.

4. Linear Relationship: The coefficient only measures the strength and direction of a linear
relationship between two variables. Non-linear relationships may not be accurately represented.

5. Sensitivity to Outliers: Pearson correlation is sensitive to outliers, which can significantly affect the
value of the coefficient.

6.Assumes Continuous Variables: It assumes that both variables are continuous and normally
distributed. However, it can still be used for non-normally distributed data, but interpretation should be
done cautiously.
4 Summarize some built – in Pandas aggregations?
Pandas provides several built-in aggregation functions that are useful for summarizing data. Here are
some common ones:
1. sum() - Calculates the sum of values along the specified axis.
2. mean() - Computes the average of values.
3. median()- Finds the median (middle value) of the data.
4. min() / max()`** - Returns the minimum or maximum value.
5. count() - Counts the number of non-missing values.
6. std() - Calculates the standard deviation, showing how much the data deviates from the mean.
7. var() - Computes the variance, measuring the spread of data.
8. prod() - Returns the product of values.
9. mode() - Identifies the most common value(s) in the dataset.
10 describe() - Generates descriptive statistics, including count, mean, std, min, quartiles, and max.
11.agg()` - Allows applying multiple aggregations to a DataFrame or Series using different functions.

These functions can be used directly on DataFrames or Series and are useful for data analysis,
exploration, and summarization.
5 State the advantages of using Nympy arrays
NumPy arrays offer several advantages:

1. Speed: Faster than Python lists due to efficient memory usage and C-based implementation.
2. Vectorization: Supports element-wise operations without loops, enabling quick computations.
3. Mathematical Functions: Provides many built-in functions for easy numerical operations.
4. Multi-Dimensional Support: Easily handles arrays of multiple dimensions (e.g., matrices).
5. Integration: Compatible with libraries like Pandas, SciPy, and scikit-learn.
6. Boolean Indexing: Efficiently filter and manipulate data based on conditions.
7. Cross-Platform: Works seamlessly across different systems.

These benefits make NumPy ideal for data analysis, scientific computing, and machine learning.
6 Outline the two types of Nympy’s UFuncs
NumPy's Universal Functions (UFuncs) are of two main types:
1. Unary UFuncs
- Definition: These operate on a single input array element-wise.
Examples:
np.sqrt(): Computes the square root of each element.
np.exp(): Calculates the exponential (e^x) for each element.
np.sin(), np.cos(), etc.: Trigonometric functions applied to each element.
np.abs()`: Returns the absolute value of each element.

2. Binary UFuncs
Definition: These operate on two input arrays element-wise.
Examples
np.add(): Adds corresponding elements of two arrays.
np.subtract() Subtracts elements of one array from another.
np.multiply(): Multiplies elements of two arrays.
np.maximum(), np.minimum(): Finds the maximum or minimum between corresponding
elements.

These UFuncs enable fast, element-wise operations on arrays, making computations efficient and
straightforward.
7
List the attribute of a Nympy array. Give an example for it
ndarray.ndim - Returns the number of dimensions (axes) of the array.
Ex: import numpy as np
arr = np.array([[1, 2, 3], [4, 5, 6]])
print(arr.ndim) # Output: 2
ndarray.shape - Gives the shape of the array as a tuple, showing the size along each dimension.
Ex: print(arr.shape) # Output: (2, 3)
ndarray.size - Returns the total number of elements in the array.
Ex: print(arr.size) # Output: 6
ndarray.dtype - Displays the data type of the array elements.
Ex: print(arr.dtype) # Output: int64 (or similar, depending on the system)
8
Create a data frame with key and data pair as A-10, b-20,A-40, C-5, B-10, C-10. Find the sum of each
key and display the result as each key group
import pandas as pd
# Create the data
data = {'Key': ['A', 'B', 'A', 'C', 'B', 'C'],
'Value': [10, 20, 40, 5, 10, 10]}
# Create DataFrame
df = pd.DataFrame(data)
# Group by 'Key' and sum the values
grouped_sum = df.groupby('Key').sum()
# Display the result
print(grouped_sum)
9 Define Dictionary in python.
A dictionary in Python is a collection of key-value pairs, where each key is associated with a specific
value. It is defined using curly braces `{}`. Keys are unique and immutable, and values can be accessed,
modified, or removed using their corresponding keys.
Example:
student = {"name": "Alice", "age": 25}
print(student["name"])
# Output: Alice
10 What is Hierarchical data in a dataframe?
Hierarchical data in a DataFrame refers to data organized with multiple levels of indexing, known as
a MultiIndex. This allows for a nested structure where rows or columns are grouped by multiple
criteria (e.g., Year and Region). It is useful for representing and analyzing complex datasets.

Example:
import pandas as pd

data = {'Sales': [150, 200, 250, 300]}


index = pd.MultiIndex.from_tuples([('2022', 'North'), ('2022', 'South'), ('2023', 'North'), ('2023',
'South')])
df = pd.DataFrame(data, index=index)
print(df)

Output:
Sales
2022 North 150
South 200
2023 North 250
South 300

11 Explain about grouping in python


Grouping in Python, especially with pandas, involves combining data based on one or more keys to
perform aggregate operations. The `groupby()` function is used to split the data, apply functions like
`sum()`, `mean()`, or `count()`, and then combine the results. It helps in summarizing data efficiently.
Example:
import pandas as pd

data = {'Product': ['A', 'A', 'B', 'B'], 'Sales': [100, 150, 200, 250]}
df = pd.DataFrame(data)

grouped = df.groupby('Product').sum()
print(grouped)

Output:

Sales
Product
A 250
B 450
12
What is data indexing
Data indexing in Python, particularly with pandas, refers to the process of selecting, accessing, and
organizing data within a DataFrame or Series. An index acts as a label to identify rows or columns,
allowing you to retrieve data efficiently. Indexing helps in quick data lookup, slicing, and filtering.
Key Points:
- The row index is used to identify and access rows.
- The column index is used to identify and access columns.
- Indices can be single-level (simple) or multi-level (hierarchical).
Example:
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data, index=['ID1', 'ID2', 'ID3'])
# Accessing data using index
print(df.loc['ID2']) # Access row with index 'ID2'
Output:
Name Bob
Age 30
Name: ID2, dtype: object
13 What is Broadcasting? state the rules of broadcasting
Broadcasting in Python, especially with NumPy, refers to the ability of NumPy to perform element-
wise operations on arrays of different shapes by "stretching" the smaller array to match the shape of
the larger one without actually copying data. This feature makes it easy to perform arithmetic
operations on arrays of varying dimensions.

Rules of Broadcasting:
1 Same Shape: If the two arrays have the same shape, they are compatible for element-wise
operations.
2. One of the Dimensions is 1: If the shapes of the arrays are not the same, NumPy can still operate
on them if one of the dimensions is `1`. The array with a dimension of `1` is "stretched" or repeated
to match the size of the other array.
3. Arrays with Different Lengths: If the two arrays have different numbers of dimensions, the smaller
array is padded with `1s` on the left side until the shapes have the same length.

Example:
import numpy as np
# Array A: shape (3, 1)
A = np.array([[1], [2], [3]])
# Array B: shape (1, 3)
B = np.array([10, 20, 30])
# Broadcasting: resulting shape (3, 3)
result = A * B
print(result)
Output:
[[10 20 30]
[20 40 60]
[30 60 90]]

Here, `A` is stretched horizontally, and `B` is stretched vertically to perform element-wise
multiplication.
14
Enumerate the operation on missing data
Operations on missing data in Python, especially with pandas, include methods to detect, handle, and
fill missing values. Common operations include:
1. Detection of Missing Data:
- `isnull()`: Returns `True` for missing values and `False` for non-missing values.
- `notnull()`: Returns `True` for non-missing values and `False` for missing values.
import pandas as pd
data = {'Name': ['Alice', 'Bob', None], 'Age': [25, None, 30]}
df = pd.DataFrame(data)
print(df.isnull())
2. *Removing Missing Data:
- `dropna()`: Removes rows or columns with missing values.
- `dropna(axis=1)`: Removes columns with missing values.
df_cleaned = df.dropna() # Removes rows with any missing values
3. Filling Missing Data:
- `fillna(value)`: Replaces missing values with a specified value.
- `fillna(method='ffill')`: Forward fills missing values using the previous non-missing value.
- `fillna(method='bfill')`: Backward fills missing values using the next non-missing value.
df_filled = df.fillna(0) # Replaces missing values with 0
4.Replacing Missing Data with Statistical Values:
- `df.fillna(df.mean())`: Replaces missing values with the mean of the column.
- Other functions like `median()`, `mode()`, or custom logic can also be used.
df_filled = df.fillna(df['Age'].mean()) # Replace missing values in 'Age' with the column mean
```These operations allow effective handling of missing data to ensure accurate analysis and
computations.
15 What is NumPy in Python used for? Write a python program create an array?
NumPy is a Python library used for working with arrays.
Import numpy as
np
np.array([1,4,2,5,3]
)
OUTPUT: array([1,4,2,5,3])

16 Write the output of the following numpy code


I . np.array([3.14, 4, 2, 3])
ii. np.array([1, 2, 3, 4], dtype='float32')
iii. np.array([range(i, i + 3) for i in [2, 4, 6]])
iv. np.zeros(10, dtype=int)
v. np.ones((3, 5), dtype=float)vi.

vi. np.full((3, 5)3.14)


vii. np.arange(0, 20, 2)
viii. np.linspace(0, 1, 5)
ix. np.random.random((3, 3))
x. np.random.normal(0, 1, (3, 3))
SOLUTION:
i.array([3.14, 4. , 2. , 3. ])
ii. array([1., 2., 3., 4.], dtype=float32)
iii. array([[2, 3, 4],
[4, 5, 6],
[6, 7, 8]])
iv.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
v. array([[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]])

vi. array([[3.14, 3.14, 3.14, 3.14, 3.14],


[3.14, 3.14, 3.14, 3.14, 3.14],
[3.14, 3.14, 3.14, 3.14, 3.14]])

vii. array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])


viii. array([0. , 0.25, 0.5 , 0.75, 1. ])
ix. array([[0.15175932, 0.58606546, 0.68749167],
[0.75655189, 0.23198831, 0.94250833],
[0.95147981, 0.50405388, 0.22004745]])
x. array([[-0.98220551, -0.27991827, -1.58428463],
[-0.99791504, 0.10710667, -1.15115236],
[ 0.76783606, -0.83683471, -0.07508393]])
17 Define Series Object.
A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array.
Example:
data = pd.Series([0.25, 0.5, 0.75,
1.0])print(data )
The output will be displayed as

0 0.25
1 0.50
2 0.75
3 1.00
dtype: float64
18 What is Data frame?
A DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible
column names. DataFrame as a sequence of aligned Series objects.
Example:
states = pd.DataFrame({'population': population, 'area':
area})print(states)
Output:

area Population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193

19 What are indexers?


Pandas provides some special indexer attributes that explicitly expose indexing schemes. They are
loc, iloc, and ix.
loc attribute - allows indexing and slicing that always references the explicit index.
iloc attribute - allows indexing and slicing that always references the implicit Python-style index.
Ix - is a hybrid of the two, and for Series objects is equivalent to standard []-based indexing.

20 What is Data Wrangling?


Data Wrangling, also known as Data Munging, is the process of gathering, collecting, and
transforming raw data into a more usable format for better understanding, decision-making, and
analysis1. This process is crucial in data science projects as it helps in cleaning and organizing data,
making it ready for analysis and modelling
21 Explain the functionalities of Data wrangling in python?
1. Data exploration: In this process, the data is studied, analyzed, and understood by visualizing
representations of data.
2. Dealing with missing values: Most of the datasets having a vast amount of data contain missing
values of NaN, they are needed to be taken care of by replacing them with mean, mode, the most
frequent value of the column, or simply by dropping the row having a NaN value.
3. Reshaping data: In this process, data is manipulated according to the requirements, where new
data can be added or pre-existing data can be modified.
4. Filtering data: Some times datasets are comprised of unwanted rows or columns which are
required to be removed or filtered
5. Other: After dealing with the raw dataset with the above functionalities we get an efficient
dataset as per our requirements and then it can be used for a required purpose like data
analyzing, machine learning, data visualization, model training etc.

22
How a pandas dataframe can be constructed?
i) From a single Series object
ii) From a list of dicts
iii) From a dictionary of Series objects
iv) From a two-dimensional NumPy array

i) From a Single Series Object


A DataFrame can be created from a single `Series` by passing it to the `pd.DataFrame` constructor.
import pandas as pd

# Creating a Series
s = pd.Series([1, 2, 3, 4])

# Creating a DataFrame from the Series


df_from_series = pd.DataFrame(s, columns=['Values'])
print(df_from_series)

ii) From a List of Dicts


A DataFrame Can be created from a list of dictionaries. Each dictionary represents a row, and keys
represent column names.
data = [
{'Name': 'Alice', 'Age': 25},
{'Name': 'Bob', 'Age': 30},
{'Name': 'Charlie', 'Age': 35}
]

df_from_list_of_dicts = pd.DataFrame(data)
print(df_from_list_of_dicts)
iii) From a Dictionary of Series Objects
Each `Series` in the dictionary corresponds to a column, with keys serving as column names.
data = {
'Name': pd.Series(['Alice', 'Bob', 'Charlie']),
'Age': pd.Series([25, 30, 35])
}

df_from_dict_of_series = pd.DataFrame(data)
print(df_from_dict_of_series)
iv) From a Two-Dimensional NumPy Array
A two-dimensional NumPy array can be passed to create a DataFrame, along with optional row and
column labels.
import numpy as np

# Creating a NumPy array


array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Creating a DataFrame from the NumPy array


df_from_numpy_array = pd.DataFrame(array, columns=['A', 'B', 'C'])
print(df_from_numpy_array)

v) From a NumPy Structured Array


If the NumPy array is structured, it can also be used to create a DataFrame.
# Creating a structured array
structured_array = np.array(
[('Alice', 25), ('Bob', 30), ('Charlie', 35)],
dtype=[('Name', 'U10'), ('Age', 'i4')]
)

# Creating a DataFrame from the structured array


df_from_structured_array = pd.DataFrame(structured_array)
print(df_from_structured_array)

You might also like