0% found this document useful (0 votes)
4 views2 pages

Practice Questions2

The document contains Python code that demonstrates data manipulation using pandas and numpy, including handling missing values, removing duplicates, and performing group operations. It also covers solving a system of equations, predicting house prices using linear regression, and filling missing values through various interpolation methods. Additionally, it includes creating a crosstab to count employees in different departments across work locations.

Uploaded by

Rishit Gandha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views2 pages

Practice Questions2

The document contains Python code that demonstrates data manipulation using pandas and numpy, including handling missing values, removing duplicates, and performing group operations. It also covers solving a system of equations, predicting house prices using linear regression, and filling missing values through various interpolation methods. Additionally, it includes creating a crosstab to count employees in different departments across work locations.

Uploaded by

Rishit Gandha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

In [32]: import pandas as pd

import numpy as np

# Creating a dataset with missing values, duplicates, and categorical data


data = {
'Customer_ID': [101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 101, 105],
'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Hannah', 'Ivy', 'Jack', 'Alice', 'Eva'],
'Age': [25, 34, 29, 40, 29, np.nan, 32, np.nan, 28, 45, 25, 29],
'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago', 'San Francisco',
'San Francisco', 'Los Angeles', 'New York', 'Chicago', 'New York', 'Chicago'],
'Salary': [50000, 70000, 80000, np.nan, 65000, 72000, 81000, 62000, 77000, 50000, 50000, 65000],
'Purchase_Amount': [200, 150, np.nan, 300, 250, 400, 350, np.nan, 500, 450, 200, 250]
}

# Create DataFrame
df = pd.DataFrame(data)

print(df)

# Remove duplicates
df = df.drop_duplicates()

Customer_ID Name Age City Salary Purchase_Amount


0 101 Alice 25.0 New York 50000.0 200.0
1 102 Bob 34.0 Los Angeles 70000.0 150.0
2 103 Charlie 29.0 Chicago 80000.0 NaN
3 104 David 40.0 New York NaN 300.0
4 105 Eva 29.0 Chicago 65000.0 250.0
5 106 Frank NaN San Francisco 72000.0 400.0
6 107 Grace 32.0 San Francisco 81000.0 350.0
7 108 Hannah NaN Los Angeles 62000.0 NaN
8 109 Ivy 28.0 New York 77000.0 500.0
9 110 Jack 45.0 Chicago 50000.0 450.0
10 101 Alice 25.0 New York 50000.0 200.0
11 105 Eva 29.0 Chicago 65000.0 250.0

Find the customer row who has made the highest total purchase amount (after removing duplicates)
In [ ]: # 1. Find the customer who has made the highest total purchase amount
highest_purchase_customer = df.groupby('Customer_ID')['Purchase_Amount'].sum().idxmax()
print("1. Customer with highest total purchase amount:", highest_purchase_customer)

Identify customers whose salary is within the top 10% of all salaries
In [ ]: # 2. Identify customers whose salary is within the top 10% of all salaries
salary_threshold = df['Salary'].quantile(0.9)
top_salary_customers = df[df['Salary'] >= salary_threshold]
print("2. Customers with top 10% salaries:\n", top_salary_customers)

Find the city with the most customers and calculate its average purchase amount
In [ ]: # 3. Find the city with the most customers and calculate its average purchase amount
most_common_city = df['City'].mode()[0]
avg_purchase_in_city = df[df['City'] == most_common_city]['Purchase_Amount'].mean()
print("3. City with most customers:", most_common_city, "Average Purchase Amount:", avg_purchase_in_city)

Create a new column indicating if the customer is a high spender (if purchase amount > median)
In [ ]: # 4. Create a new column indicating if the customer is a high spender (if purchase amount > median)
purchase_median = df['Purchase_Amount'].median()
df['High_Spender'] = df['Purchase_Amount'] > purchase_median
print("4. DataFrame with High Spender column:\n", df)

Find the age group (bins) with the highest average salary
In [ ]: # 5. Find the age group (bins) with the highest average salary
bins = [20, 30, 40, 50]
labels = ['20-30', '30-40', '40-50']
df['Age_Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)
highest_avg_salary_group = df.groupby('Age_Group')['Salary'].mean().idxmax()
print("5. Age group with highest average salary:", highest_avg_salary_group)

Replace missing salary values using the median salary for that customer’s city
In [ ]: # 6. Replace missing salary values using the median salary for that customer’s city
df['Salary'] = df.groupby('City')['Salary'].apply(lambda x: x.fillna(x.median()))
print("6. DataFrame after filling missing salaries:\n", df)

Solve a system of equations with multiple variables (3x3 system)

Problem:

Solve the following system using NumPy:

x+2y+3z=14

4x+5y+6z=32

7x+8y+10z=50
In [ ]: # 7. Solve a system of equations with multiple variables (3x3 system)
A = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 10]])
B = np.array([14, 32, 50])
solution = np.linalg.solve(A, B)
print("7. Solution to the system of equations:", solution)

In [ ]:

In [ ]:

In [ ]:

You are given a dataset containing historical house price data. The price of a house depends on square footage, number of
bedrooms,

and age of the house. Your task is to predict the house prices based on these features using linear algebra techniques

(np.linalg.solve and np.matmul).

Predict the price of a new house with:

2200 sq ft, 3 bedrooms, and 5 years old


In [27]: data = {
"Square_Feet": [1500, 1800, 2400, 3000],
"Bedrooms": [3, 4, 4, 5],
"Age_Years": [10, 5, 2, 1],
"Price": [300000, 400000, 500000, 600000]
}
df = pd.DataFrame(data)
# Display the DataFrame
print(df)

Square_Feet Bedrooms Age_Years Price


0 1500 3 10 300000
1 1800 4 5 400000
2 2400 4 2 500000
3 3000 5 1 600000

In [ ]: import numpy as np
import pandas as pd

# Given dataset
data = {
"Square_Feet": [1500, 1800, 2400, 3000],
"Bedrooms": [3, 4, 4, 5],
"Age_Years": [10, 5, 2, 1],
"Price": [300000, 400000, 500000, 600000]
}

df = pd.DataFrame(data)

# Extract features (X) and target variable (y)


X = df[["Square_Feet", "Bedrooms", "Age_Years"]].values
y = df["Price"].values

# Display dataset
print(df)

# Add a bias column (intercept term) to X


X_b = np.c_[np.ones((X.shape[0], 1)), X] # Adding a column of ones

# Compute the coefficients using the normal equation


theta = np.linalg.solve(X_b.T @ X_b, X_b.T @ y)

print("Coefficients (theta):", theta)

# Define new house data


new_house = np.array([1, 2200, 3, 5]) # Include bias term

# Predict price
predicted_price = np.matmul(new_house, theta)

print("Predicted price for the new house:", predicted_price)

In [ ]:

In [ ]:

Perform filling of nan values based on linear interpolation,time based and index based on ids
In [29]: import pandas as pd
import numpy as np

# Create dataset with missing values


data = {
"ID": [101, 102, 103, 104, 105, 106, 107, 108],
"Date": pd.to_datetime(["2024-01-01", "2024-01-02", "2024-01-04",
"2024-01-05", "2024-01-07", "2024-01-08",
"2024-01-10", "2024-01-11"]),
"Sales": [500, np.nan, 700, np.nan, 850, np.nan, 920, 980],
"Category": ["Electronics", "Furniture", "Electronics", "Clothing",
"Furniture", "Electronics", "Clothing", "Furniture"]
}

df = pd.DataFrame(data)

print("Original Dataset with Missing Values:")


df

Original Dataset with Missing Values:


Out[29]: ID Date Sales Category

0 101 2024-01-01 500.0 Electronics

1 102 2024-01-02 NaN Furniture

2 103 2024-01-04 700.0 Electronics

3 104 2024-01-05 NaN Clothing

4 105 2024-01-07 850.0 Furniture

5 106 2024-01-08 NaN Electronics

6 107 2024-01-10 920.0 Clothing

7 108 2024-01-11 980.0 Furniture

In [ ]: df_linear = df.copy()
df_linear["Sales"] = df_linear["Sales"].interpolate(method="linear")

print("\nDataset after Linear Interpolation:")


print(df_linear)

In [ ]: df_time = df.copy()
df_time.set_index("Date", inplace=True) # Set Date as index
df_time["Sales"] = df_time["Sales"].interpolate(method="time") # Time-based interpolation
df_time.reset_index(inplace=True)

print("\nDataset after Time-Based Interpolation:")


print(df_time)

In [ ]: df_index = df.copy()
df_index.set_index("ID", inplace=True) # Set ID as index
df_index["Sales"] = df_index["Sales"].interpolate(method="index") # Index-based interpolation
df_index.reset_index(inplace=True)

print("\nDataset after Index-Based Interpolation:")


print(df_index)

In [ ]:

In [ ]:

You have a dataset containing employee details with the following columns:
Employee_ID: A unique identifier for each employee.

Department: The department where the employee works (e.g., HR, IT, Sales).

Work_Location: The office location of the employee (e.g., New York, San Francisco, Chicago).

find the count of employees in each department across different work location using crosstab
In [31]: import pandas as pd

# Sample dataset
data = {
"Employee_ID": range(1, 11),
"Department": ["HR", "IT", "Sales", "IT", "HR", "Sales", "IT", "HR", "Sales", "IT"],
"Work_Location": ["New York", "San Francisco", "Chicago", "New York", "Chicago",
"San Francisco", "Chicago", "New York", "San Francisco", "Chicago"]
}

df = pd.DataFrame(data)

print("Employee Dataset:")
print(df)

Employee Dataset:
Employee_ID Department Work_Location
0 1 HR New York
1 2 IT San Francisco
2 3 Sales Chicago
3 4 IT New York
4 5 HR Chicago
5 6 Sales San Francisco
6 7 IT Chicago
7 8 HR New York
8 9 Sales San Francisco
9 10 IT Chicago

In [ ]: # Create a crosstab of Department vs Work_Location


crosstab_result = pd.crosstab(df["Department"], df["Work_Location"])

print("\nCount of Employees in Each Department Across Work Locations:")


print(crosstab_result)
In [ ]:

In [ ]:

You might also like