Data Toolkit Assignment

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Data Toolkit Practice Questions

Questions
1.Demonstrate three different methods for creating identical 2D
arrays in NumPy. Provide the code for each method and the final
output after each method.

2. Using the Numpy function, generate an array of 100 evenly


spaced numbers between 1 and 10 and
Reshape that 1D array into a 2D array.

3. Explain the following terms.


● The difference in np.array, np.asarray and np.asanyarray.
● The difference between Deep copy and shallow copy.

4. Generate a 3x3 array with random floating-point numbers
between 5 and 20. Then, round each number in
the array to 2 decimal places.

5. Create a NumPy array with random integers between 1 and 10 of


shape (5, 6). After creating the array
perform the following operations:

a)Extract all even integers from array.

b)Extract all odd integers from array.

6. Create a 3D NumPy array of shape (3, 3, 3) containing random


integers between 1 and 10. Perform the
following operations:

a) Find the indices of the maximum values along each depth


level (third axis).

b) Perform element-wise multiplication of between both array.


7. Clean and transform the 'Phone' column in the sample dataset to
remove non-numeric characters and
convert it to a numeric data type. Also display the table attributes
and data types of each column.

8. Perform the following tasks using people dataset:

a) Read the 'data.csv' file using pandas, skipping the first 50


rows.

b) Only read the columns: 'Last Name', ‘Gender’,’Email’,‘Phone’


and ‘Salary’ from the file.

c) Display the first 10 rows of the filtered dataset.

d) Extract the ‘Salary’' column as a Series and display its last 5


values.

9. Filter and select rows from the People_Dataset, where the “Last
Name' column contains the name 'Duke',
'Gender' column contains the word Female and ‘Salary’ should be
less than 85000.

10. Create a 7*5 Dataframe in Pandas using a series generated from


35 random integers between 1 to 6?

11. Create two different Series, each of length 50, with the following
criteria:

a) The first Series should contain random numbers ranging from 10


to 50.

b) The second Series should contain random numbers ranging


from 100 to 1000.

c) Create a DataFrame by joining these Series by column, and,


change the names of the columns to 'col1', 'col2',
etc�
12. Perform the following operations using people data set:

a) Delete the 'Email', 'Phone', and 'Date of birth' columns from the
dataset.

b) Delete the rows containing any missing values.

d) Print the final output also.

13. Create two NumPy arrays, x and y, each containing 100 random
float values between 0 and 1. Perform the
following tasks using Matplotlib and NumPy:

a) Create a scatter plot using x and y, setting the color of the points
to red and the marker style to 'o'.

b) Add a horizontal line at y = 0.5 using a dashed line style and


label it as 'y = 0.5'.

c) Add a vertical line at x = 0.5 using a dotted line style and label it
as 'x = 0.5'.

d) Label the x-axis as 'X-axis' and the y-axis as 'Y-axis'.

e) Set the title of the plot as 'Advanced Scatter Plot of Random


Values'.

f) Display a legend for the scatter plot, the horizontal line, and the
vertical line.

14. Create a time-series dataset in a Pandas DataFrame with


columns: 'Date', 'Temperature', 'Humidity' and
Perform the following tasks using Matplotlib:
a) Plot the 'Temperature' and 'Humidity' on the same plot with
different y-axes (left y-axis for 'Temperature' and
right y-axis for 'Humidity').
b) Label the x-axis as 'Date'.

c) Set the title of the plot as 'Temperature and Humidity Over Time.

15. Create a NumPy array data containing 1000 samples from a


normal distribution. Perform the following
tasks using Matplotlib:
a) Plot a histogram of the data with 30 bins.

b) Overlay a line plot representing the normal distribution's


probability density function (PDF).

c) Label the x-axis as 'Value' and the y-axis as


'Frequency/Probability'.

d) Set the title of the plot as 'Histogram with PDF Overlay'.

16. Set the title of the plot as 'Histogram with PDF Overlay'.

17. Create a Seaborn scatter plot of two random arrays, color


points based on their position relative to the
origin (quadrants), add a legend, label the axes, and set the title as
'Quadrant-wise Scatter Plot.

18. With Bokeh, plot a line chart of a sine wave function, add grid
lines, label the axes, and set the title as 'Sine
Wave Function'.
19. Using Bokeh, generate a bar chart of randomly generated
categorical data, color bars based on their
values, add hover tooltips to display exact values, label the axes,
and set the title as 'Random Categorical
Bar Chart'.
20. Using Plotly, create a basic line plot of a randomly generated
dataset, label the axes, and set the title as
'Simple Line Plot'.
21. Using Plotly, create an interactive pie chart of randomly
generated data, add labels and percentages, set
the title as 'Interactive Pie Chart'.

ANSWERS
1. Here are three different methods for creating identical 2D
arrays in NumPy:
● Using np.array() to create the array manually.

import numpy as np

array_1 = np.array([[1, 2, 3], [4, 5, 6]])

print("Method 1 Output:")
print(array_1)
● Using np.ones() to create an array of ones and then multiply
or modify values.
Import numpy as np

# Method 2: Create an array of ones and modify it


array_2 = np.ones((2, 3)) # Creates a 2x3 array of ones
array_2[0] = [1, 2, 3]
array_2[1] = [4, 5, 6]

print("\nMethod 2 Output:")
print(array_2)
● Using np.zeros() and assigning values manually.
Import numpy as np

# Method 3: Create an array of zeros and assign values


array_3 = np.zeros((2, 3)) # Creates a 2x3 array of zeros
array_3[0] = [1, 2, 3]
array_3[1] = [4, 5, 6]

print("\nMethod 3 Output:")
print(array_3)
Output for all methods:
Method 1 Output:
[[1 2 3]
[4 5 6]]

Method 2 Output:
[[1. 2. 3.]
[4. 5. 6.]]

Method 3 Output:
[[1. 2. 3.]
[4. 5. 6.]]

2.
import numpy as np

# Generate 100 evenly spaced numbers between 1 and 10


array_1d = np.linspace(1, 10, 100)

# Reshape the 1D array into a 2D array (e.g., 10 rows, 10 columns)


array_2d = array_1d.reshape(10, 10)

print(array_2d)

3.
Difference between np.array, np.asarray, and np.asanyarray:

● np.array():
○ This function is used to create an array. It always creates
a new array, even if the input is already an array (unless
copy=False is specified).
○ It has additional options like specifying data type
(dtype) and forcing a copy or view.
○ np.asarray():
● Similar to np.array(), but if the input is already a NumPy array,
it will not make a copy; instead, it returns the original array.
● It's faster and more memory-efficient when you're converting data
that might already be an array.
np.asanyarray():

● Works similarly to np.asarray(), but it will not convert


subclasses of np.ndarray. This is useful when working with
special array types (e.g., masked arrays or matrices)

Difference between Deep copy and Shallow copy:

● Shallow copy:
○ A shallow copy of an object is a copy of the object, but it
does not recursively copy the objects contained within it (if
they are mutable).
○ In the case of a shallow copy, modifications to mutable
objects inside the copy will affect the original object as they
share the same reference.

Deep copy:

● A deep copy copies everything, including all objects found within


the original object. The new copy is completely independent of the
original object.
● Any changes to the copied object will not affect the original object

4.

import numpy as np

# Generate a 3x3 array with random floating-point numbers between 5


and 20

random_array = np.random.uniform(5, 20, (3, 3))

# Round each element in the array to 2 decimal places

rounded_array = np.round(random_array, 2)

print(rounded_array)

5.
import numpy as np

# Create a NumPy array with random integers between 1 and 10, of


shape (5, 6)

array = np.random.randint(1, 11, (5, 6))

# a) Extract all even integers from the array

even_integers = array[array % 2 == 0]

# b) Extract all odd integers from the array

odd_integers = array[array % 2 != 0]

# Display the array and results

print("Original Array:")

print(array)

print("\nEven Integers:")

print(even_integers)

print("\nOdd Integers:")

print(odd_integers)

6.

import numpy as np

# Create a 3D NumPy array of shape (3, 3, 3) with random integers


between 1 and 10

array_3d = np.random.randint(1, 11, (3, 3, 3))


# a) Find the indices of the maximum values along each depth level
(third axis)

max_indices = np.argmax(array_3d, axis=2)

# b) Perform element-wise multiplication between the array and itself

elementwise_multiplication = array_3d * array_3d

# Display results

print("Original 3D Array:")

print(array_3d)

print("\nIndices of Maximum Values along each depth level (third axis):")

print(max_indices)

print("\nElement-wise Multiplication:")

print(elementwise_multiplication)

Explanation:

np.random.randint(1, 11, (3, 3, 3)): Generates a 3D array


of shape (3, 3, 3) with random integers between 1 and 10.

np.argmax(array_3d, axis=2): Finds the indices of the maximum


values along the third axis (depth level).

array_3d * array_3d: Performs element-wise multiplication of the


array with itself.

7.

To clean and transform the 'Phone' column in a dataset to remove


non-numeric characters and convert it to a numeric data type, we can
use Pandas along with regular expressions. After cleaning, we can also
display the table's attributes and data types of each column .

import pandas as pd

# Sample dataset for demonstration

data = {

'Name': ['John Doe', 'Jane Smith', 'Emily Davis'],

'Phone': ['(123) 456-7890', '987-654-3210', '123.456.7890'],

'Email': ['[email protected]', '[email protected]',


'[email protected]']

# Create a DataFrame

df = pd.DataFrame(data)

# Clean the 'Phone' column to remove non-numeric characters and


convert to numeric

df['Phone'] = df['Phone'].str.replace(r'\D', '', regex=True).astype('int64')

# Display the cleaned DataFrame

print("Cleaned DataFrame:")

print(df)

# Display the attributes and data types of each column

print("\nTable Attributes and Data Types:")


print(df.info()).

Expected output:

Cleaned DataFrame:

Name Phone Email

0 John Doe 1234567890 [email protected]

1 Jane Smith 9876543210 [email protected]

2 Emily Davis 1234567890 [email protected]

Table Attributes and Data Types:

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 3 entries, 0 to 2

Data columns (total 3 columns):

# Column Non-Null Count Dtype

--- ------ -------------- -----

0 Name 3 non-null object

1 Phone 3 non-null int64

2 Email 3 non-null object

dtypes: int64(1), object(2)

memory usage: 200.0+ bytes

8.

import pandas as pd

# a) Read the 'data.csv' file using pandas, skipping the first 50 rows

df = pd.read_csv('data.csv', skiprows=50)
# b) Only read the columns: 'Last Name', 'Gender', 'Email', 'Phone', and
'Salary'

columns_to_read = ['Last Name', 'Gender', 'Email', 'Phone', 'Salary']

df_filtered = df[columns_to_read]

# c) Display the first 10 rows of the filtered dataset

print("First 10 rows of the filtered dataset:")

print(df_filtered.head(10))

# d) Extract the 'Salary' column as a Series and display its last 5 values

salary_series = df_filtered['Salary']

print("\nLast 5 values of the 'Salary' column:")

print(salary_series.tail(5))

Explanation:

● pd.read_csv('data.csv', skiprows=50): Reads the


CSV file and skips the first 50 rows.
● df[columns_to_read]: Filters the DataFrame to only include
the specified columns: 'Last Name', 'Gender', 'Email',
'Phone', and 'Salary'.
● df_filtered.head(10): Displays the first 10 rows of the
filtered DataFrame.
● df_filtered['Salary'].tail(5): Extracts the 'Salary'
column as a Series and shows the last 5 values.

9.

import pandas as pd

# Assuming the dataset is loaded into a DataFrame named 'df'


# Replace 'data.csv' with the actual path to your dataset

df = pd.read_csv('data.csv')

# Filter rows based on the given conditions

filtered_df = df[(df['Last Name'].str.contains('Duke', case=False)) &

(df['Gender'].str.contains('Female', case=False)) &

(df['Salary'] < 85000)]

# Display the filtered DataFrame

print("Filtered Rows:")

print(filtered_df)

10.

import pandas as pd

import numpy as np

# Generate 35 random integers between 1 and 6

random_integers = np.random.randint(1, 7, size=35)

# Create a DataFrame from the random integers, reshaping them into a


7x5 format

df = pd.DataFrame(random_integers.reshape(7, 5), columns=['Col1',


'Col2', 'Col3', 'Col4', 'Col5'])

# Display the DataFrame

print("7x5 DataFrame:")
print(df)

11.

import pandas as pd

import numpy as np

# a) Create the first Series with random numbers ranging from 10 to 50

series1 = np.random.randint(10, 51, size=50)

# b) Create the second Series with random numbers ranging from 100 to
1000

series2 = np.random.randint(100, 1001, size=50)

# c) Create a DataFrame by joining these Series by column

df = pd.DataFrame({'col1': series1, 'col2': series2})

# Display the DataFrame

print("Combined DataFrame:")

print(df)

Expected output:

Combined DataFrame:

col1 col2

0 32 384

1 14 534

2 23 251
3 41 728

4 36 910

5 15 436

6 47 677

7 35 123

8 21 580

9 50 761

...

12.

import pandas as pd

# Assuming the dataset is loaded into a DataFrame named 'df'

# Replace 'data.csv' with the actual path to your dataset

df = pd.read_csv('data.csv')

# a) Delete the 'Email', 'Phone', and 'Date of birth' columns from the
dataset

df.drop(columns=['Email', 'Phone', 'Date of birth'], inplace=True)

# b) Delete the rows containing any missing values

df.dropna(inplace=True)

# d) Print the final output

print("Final DataFrame after deletions:")


print(df)

Expected output:

Final DataFrame after deletions:

Last Name Gender Salary

0 Duke Female 60000

1 Smith Female 70000

2 Brown Male 80000

...

13.

import numpy as np

import matplotlib.pyplot as plt

# Create two NumPy arrays, x and y, each containing 100 random float
values between 0 and 1

x = np.random.rand(100)

y = np.random.rand(100)

# a) Create a scatter plot

plt.scatter(x, y, color='red', marker='o', label='Data Points')

# b) Add a horizontal line at y = 0.5

plt.axhline(y=0.5, color='blue', linestyle='--', label='y = 0.5')

# c) Add a vertical line at x = 0.5


plt.axvline(x=0.5, color='green', linestyle=':', label='x = 0.5')

# d) Label the x-axis and y-axis

plt.xlabel('X-axis')

plt.ylabel('Y-axis')

# e) Set the title of the plot

plt.title('Advanced Scatter Plot of Random Values')

# f) Display a legend

plt.legend()

# Show the plot

plt.show()

14.

To create a time-series dataset in a Pandas DataFrame with columns


for 'Date', 'Temperature', and 'Humidity', and then plot them using
Matplotlib, you can follow the steps below:

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

# Create a date range

date_range = pd.date_range(start='2023-01-01', periods=100, freq='D')


# Generate random temperature and humidity data

np.random.seed(0) # For reproducibility

temperature = np.random.uniform(low=15, high=35, size=100) #


Temperatures between 15 and 35 degrees Celsius

humidity = np.random.uniform(low=30, high=90, size=100) # Humidity


between 30% and 90%

# Create the DataFrame

data = {

'Date': date_range,

'Temperature': temperature,

'Humidity': humidity

df = pd.DataFrame(data)

# Plotting

fig, ax1 = plt.subplots()

# a) Plot the 'Temperature' and 'Humidity'

ax1.set_xlabel('Date')

ax1.set_ylabel('Temperature (°C)', color='tab:red')

ax1.plot(df['Date'], df['Temperature'], color='tab:red', label='Temperature')

ax1.tick_params(axis='y', labelcolor='tab:red')
# Create a second y-axis for Humidity

ax2 = ax1.twinx()

ax2.set_ylabel('Humidity (%)', color='tab:blue')

ax2.plot(df['Date'], df['Humidity'], color='tab:blue', label='Humidity')

ax2.tick_params(axis='y', labelcolor='tab:blue')

# b) Label the x-axis as 'Date'

plt.title('Temperature and Humidity Over Time')

# c) Set the title of the plot

fig.tight_layout() # To make sure the labels fit nicely

plt.show()

15.

To create a NumPy array containing samples from a normal


distribution and then plot a histogram with an overlay of the probability
density function (PDF) using Matplotlib, you can follow these steps:

import numpy as np

import matplotlib.pyplot as plt

from scipy.stats import norm

# Create a NumPy array data containing 1000 samples from a normal


distribution

mu, sigma = 0, 1 # mean and standard deviation

data = np.random.normal(mu, sigma, 1000)


# a) Plot a histogram of the data with 30 bins

plt.hist(data, bins=30, density=True, alpha=0.6, color='g',


label='Histogram')

# b) Overlay a line plot representing the normal distribution's probability


density function (PDF)

xmin, xmax = plt.xlim() # Get the current x limits

x = np.linspace(xmin, xmax, 100) # Generate values for x

p = norm.pdf(x, mu, sigma) # Calculate the PDF

plt.plot(x, p, 'k', linewidth=2, label='Normal PDF')

# c) Label the x-axis and the y-axis

plt.xlabel('Value')

plt.ylabel('Frequency/Probability')

# d) Set the title of the plot

plt.title('Histogram with PDF Overlay')

# Display legend

plt.legend()

# Show the plot

plt.show()

16.
import numpy as np

import matplotlib.pyplot as plt

from scipy.stats import norm

# Create a NumPy array data containing 1000 samples from a normal


distribution

mu, sigma = 0, 1 # mean and standard deviation

data = np.random.normal(mu, sigma, 1000)

# a) Plot a histogram of the data with 30 bins

plt.hist(data, bins=30, density=True, alpha=0.6, color='g',


label='Histogram')

# b) Overlay a line plot representing the normal distribution's probability


density function (PDF)

xmin, xmax = plt.xlim() # Get the current x limits

x = np.linspace(xmin, xmax, 100) # Generate values for x

p = norm.pdf(x, mu, sigma) # Calculate the PDF

plt.plot(x, p, 'k', linewidth=2, label='Normal PDF')

# c) Label the x-axis and the y-axis

plt.xlabel('Value')

plt.ylabel('Frequency/Probability')

# d) Set the title of the plot


plt.title('Histogram with PDF Overlay') # Set the title as requested

# Display legend

plt.legend()

# Show the plot

plt.show()

17.

import numpy as np

import pandas as pd

import seaborn as sns

import matplotlib.pyplot as plt

# Create two random arrays

np.random.seed(0) # For reproducibility

x = np.random.rand(100) * 20 - 10 # Random values between -10 and


10

y = np.random.rand(100) * 20 - 10 # Random values between -10 and


10

# Create a DataFrame to hold the data

df = pd.DataFrame({'X': x, 'Y': y})

# Define a function to determine the quadrant

def get_quadrant(row):
if row['X'] > 0 and row['Y'] > 0:

return 'Quadrant 1'

elif row['X'] < 0 and row['Y'] > 0:

return 'Quadrant 2'

elif row['X'] < 0 and row['Y'] < 0:

return 'Quadrant 3'

elif row['X'] > 0 and row['Y'] < 0:

return 'Quadrant 4'

else:

return 'Origin' # Handles points on the axes

# Apply the function to create a new column in the DataFrame

df['Quadrant'] = df.apply(get_quadrant, axis=1)

# Create a scatter plot

plt.figure(figsize=(10, 6))

sns.scatterplot(data=df, x='X', y='Y', hue='Quadrant', palette='Set1',


s=100)

# Label the axes

plt.xlabel('X-axis')

plt.ylabel('Y-axis')
# Set the title of the plot

plt.title('Quadrant-wise Scatter Plot')

# Display the legend

plt.legend(title='Quadrants')

# Show the plot

plt.grid()

plt.axhline(0, color='black',linewidth=0.5, ls='--') # Add x-axis line

plt.axvline(0, color='black',linewidth=0.5, ls='--') # Add y-axis line

plt.show()

18.

pip install bokeh

import numpy as np

from bokeh.plotting import figure, show

from bokeh.io import output_notebook

# Prepare the output

output_notebook()

# Generate sine wave data

x = np.linspace(0, 2 * np.pi, 100) # 100 points from 0 to 2π

y = np.sin(x)
# Create a new plot with title and axis labels

p = figure(title="Sine Wave Function", x_axis_label='x',


y_axis_label='sin(x)', width=800, height=400)

# Add grid lines

p.grid.grid_line_alpha = 0.3 # Set grid line transparency

# Add the sine wave line

p.line(x, y, line_width=2, color='blue', legend_label='sin(x)')

# Show the plot

show(p)

19.

import numpy as np

import pandas as pd

from bokeh.plotting import figure, show

from bokeh.io import output_notebook

from bokeh.models import ColumnDataSource, HoverTool

# Enable output in notebook

output_notebook()

# Generate random categorical data

categories = [f'Category {i}' for i in range(10)]


values = np.random.randint(1, 100, size=len(categories))

# Create a DataFrame

data = pd.DataFrame({'categories': categories, 'values': values})

# Define color based on values

colors = ['#%02x%02x%02x' % (int(value * 2.55), 0, 0) for value in


values]

# Create a ColumnDataSource

source = ColumnDataSource(data=data)

# Create the bar chart

p = figure(x_range=categories, height=400, title="Random Categorical


Bar Chart",

toolbar_location=None, tools="")

# Add bars with color mapping

p.vbar(x='categories', top='values', width=0.9, source=source,


color=colors)

# Add hover tooltips

hover = HoverTool()

hover.tooltips = [("Category", "@categories"), ("Value", "@values")]


p.add_tools(hover)

# Label the axes

p.xaxis.axis_label = "Categories"

p.yaxis.axis_label = "Values"

p.xaxis.major_label_orientation = "vertical"

# Show the plot

show(p)

20.

import numpy as np

import plotly.graph_objects as go

# Generate random data

np.random.seed(0) # For reproducibility

x = np.linspace(0, 10, 100) # 100 points from 0 to 10

y = np.random.rand(100) # Random values for y

# Create a line plot

fig = go.Figure()

# Add a trace for the line

fig.add_trace(go.Scatter(x=x, y=y, mode='lines', name='Random Data'))


# Update layout

fig.update_layout(

title='Simple Line Plot',

xaxis_title='X Axis',

yaxis_title='Y Axis'

# Show the plot

fig.show()

21.

import numpy as np

import plotly.graph_objects as go

# Generate random data

np.random.seed(0) # For reproducibility

labels = [f'Category {i}' for i in range(1, 6)] # Labels for the pie chart

values = np.random.randint(1, 100, size=len(labels)) # Random values

# Create a pie chart

fig = go.Figure(data=[go.Pie(labels=labels, values=values,

textinfo='label+percent', # Show label and


percentage

hole=0.3)]) # Add a hole for a donut chart


# Update layout

fig.update_layout(

title='Interactive Pie Chart'

# Show the plot

fig.show()

Explanation:

Import Libraries: The necessary libraries (numpy and plotly) are


imported.

1. Generate Data: Random values are generated for five categories.


Labels are created accordingly.
2. Create a Pie Chart: A Plotly figure is created, and a pie chart is
added using the Pie method.
○ The textinfo='label+percent' parameter shows both
the category labels and their corresponding percentages on
the chart.
○ The hole parameter is set to 0.3 to create a donut chart
style.
3. Update Layout: The title is set for the chart.
4. Show Plot: Finally, the pie chart is displayed.

You might also like