0% found this document useful (0 votes)
2 views10 pages

Assignment

The document outlines an assignment for a Data Science course requiring students to perform exploratory data analysis on a dataset named nyt1.csv. Students must categorize users into age groups, calculate click-through rates, and create visualizations based on user demographics. The assignment emphasizes the use of Python for data manipulation and analysis, with a submission deadline of May 20, 2023.

Uploaded by

Usama Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views10 pages

Assignment

The document outlines an assignment for a Data Science course requiring students to perform exploratory data analysis on a dataset named nyt1.csv. Students must categorize users into age groups, calculate click-through rates, and create visualizations based on user demographics. The assignment emphasizes the use of Python for data manipulation and analysis, with a submission deadline of May 20, 2023.

Uploaded by

Usama Ali
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Department of Computer Science

CSC-487: Introduction to Data Science

Semester 1 (SPRING 2023)


NAME :NAEEM UR REHMAN
REG NO:02-134202-053

Assignment N0. 01
Submission Due Date: 20 May 2023
Marks 05
Instructions (Any):

 You are supposed to create data frames in order to perform exploratory data analysis.
 This assignment requires to perform below mentioned tasks on one data set named as
nyt1.csv from the below mentioned link.
 Assignment submission should be on python.
 Submit soft copy on LMS

Question 1 CLO2-PLO3-BT level C3


There are 31 datasets named nyt1.csv, nyt2.csv,...,nyt31.csv, which you can find
here: https://github.com/oreillymedia/doing_data_science.
Each one represents one (simulated) days’ worth of ads shown and clicks recorded on the New York
Times home page in May 2012. Each row represents a single user. There are five columns: age, gender
(0=female, 1=male), number impressions, number clicks, and logged-in.

Once you have the data loaded, it’s time for some EDA:
1. Create a new variable, age_group, that categorizes users as "<18", "18-
24", "25-34", "35-44", "45-54", "55-64", and "65+".

For a single day:

Answer:

Code:

import pandas as Usama

# Define the age group function

def get_age_group(age):

if age < 18:

return '<18'
Department of Computer Science

CSC-487: Introduction to Data Science

Semester 1 (SPRING 2023)


NAME :NAEEM UR REHMAN
REG NO:02-134202-053

elif age < 25:

return '18-24'

elif age < 35:

return '25-34'

elif age < 45:

return '35-44'

elif age < 55:

return '45-54'

elif age < 65:

return '55-64'

else:

return '65+'

# Read the CSV file into a DataFrame

data_frame = Usama.read_csv('D:/Assignments/Alpha_Usama/assig1.csv')

# Add the age_group column to the DataFrame

data_frame['age_group'] = df['Age'].apply(get_age_group)#Apply() method. This function acts as a


map() function in Python.

#It takes a function as an input and applies this function to an entire DataFrame. If you are working with
tabular data,

#you must specify an axis you want your function to act on ( 0 for columns; and 1 for rows)

# Write the updated DataFrame back to the CSV file

# Write the updated DataFrame back to the CSV file

data_frame.to_csv('D:/Assignments/Alpha_Usama/assig1.csv', index=False)

Screenshot:
Department of Computer Science

CSC-487: Introduction to Data Science

Semester 1 (SPRING 2023)


NAME :NAEEM UR REHMAN
REG NO:02-134202-053

2. Plot the distributions of number impressions and click-through-rate (CTR=# clicks/# impressions)
for these six age categories. Define a new variable to segment or categorize users based on their
click behavior.
Answer:
Code:
import pandas as pd
import matplotlib.pyplot as plt
# Load data
df = pd.read_csv('C:/Users/Alpha_Usama/Desktop/nyt1.csv')
# Calculate CTR
df['CTR'] = df['Clicks'] / df['Impressions']
# Create click_segment variable
median_ctr = df['CTR'].median()
def click_segment(row):
if row['CTR'] >= median_ctr:
return 'High Clicker'
else:
return 'Low Clicker'

df['click_segment'] = df.apply(click_segment, axis=1)


Department of Computer Science

CSC-487: Introduction to Data Science

Semester 1 (SPRING 2023)


NAME :NAEEM UR REHMAN
REG NO:02-134202-053

# Plot histogram of Impressions for 18-24 age group


plt.hist(df[df['Age'] == '18-24']['Impressions'])
plt.xlabel('Number of Impressions')
plt.ylabel('Frequency')
plt.title('Distribution of Impressions for 18-24 Age Group')
plt.show()
# Plot CTR distribution for each age category
age_categories = df['Age'].unique()
for category in age_categories:
plt.hist(df[df['Age'] == category]['CTR'])
plt.xlabel('CTR')
plt.ylabel('Frequency')
plt.title(f'Distribution of CTR for {category}')
plt.show()
Screenshot:
Department of Computer Science

CSC-487: Introduction to Data Science

Semester 1 (SPRING 2023)


NAME :NAEEM UR REHMAN
REG NO:02-134202-053

Note As I was aking the continous data from csv file nyt1 so it give me a lot of grah according to
age but I have choose 3 gragh for result.
Department of Computer Science

CSC-487: Introduction to Data Science

Semester 1 (SPRING 2023)


NAME :NAEEM UR REHMAN
REG NO:02-134202-053

3. Explore the data and make visual and quantitative comparisons across user
segments/demographics (<18-year-old males versus < 18-year-old females or
logged-in versus not, for example).
Answer:
Code:
import pandas as pd
import matplotlib.pyplot as plt

# Read the csv file


data = pd.read_csv("C:/Users/NAEEM UR REHMAN/Desktop/nyt1.csv")

# Define the segments based on age and gender


data["segments"] = data.apply(lambda row: "<18-year-old males" if row["Age"] < 18 and
row["Gender"] == 1 else "<18-year-old females" if row["Age"] < 18 and row["Gender"] == 0 else
">=18-year-old males" if row["Age"] >= 18 and row["Gender"] == 1 else ">=18-year-old females",
axis=1)
Department of Computer Science

CSC-487: Introduction to Data Science

Semester 1 (SPRING 2023)


NAME :NAEEM UR REHMAN
REG NO:02-134202-053

# Define the segments based on sign-in status


data["signed_in_segments"] = data.apply(lambda row: "Signed-in users" if row["Signed_In"] == 1
else "Non-signed-in users", axis=1)

# Group the data by the segments


grouped_data = data.groupby(["segments", "signed_in_segments"]).sum()[["Impressions", "Clicks"]]

# Plot the data


grouped_data.plot(kind="bar")
plt.title("Impressions and clicks by user segments")
plt.xlabel("User segments")
plt.ylabel("Count")
plt.show()
Screenshot:
Department of Computer Science

CSC-487: Introduction to Data Science

Semester 1 (SPRING 2023)


NAME :NAEEM UR REHMAN
REG NO:02-134202-053

4. Create metrics/measurements/statistics that summarize the data.


Representing mean, median,
variance, and max, and these can be calculated across the various user
segments.

Answer:

Code:

import pandas as naeem

#this will load the data and read data by your file
Department of Computer Science

CSC-487: Introduction to Data Science

Semester 1 (SPRING 2023)


NAME :NAEEM UR REHMAN
REG NO:02-134202-053

data_frame = naeem.read_csv('C:/Users/NAEEM UR REHMAN/Desktop/nyt1.csv')

#we have to define the usr segment first

user_segments = ['Age', 'Gender', 'Signed_In']

# Calculate the metrics for each user segment

metrics = {}

for segment in user_segments:

# Group the data by the user segment

#To group large amounts of data and compute operations on these groups.

group = data_frame.groupby(segment)

# Calculate the metrics for Impressions and Clicks

metrics[segment] = {}

metrics[segment]['Impressions'] = group['Impressions'].agg(['mean', 'median', 'var',


'max'])

metrics[segment]['Clicks'] = group['Clicks'].agg(['mean', 'median', 'var', 'max'])

#Allows you to apply a function or a list of function names to

#be executed along one of the axis of the DataFrame,

#default 0, which is the index (row) axis. Note:

#the agg() method is an alias of the aggregate() method.

# Print the metrics

for segment in user_segments:

print(f"Metrics for {segment}:")

print(metrics[segment])

Screenshot:
Department of Computer Science

CSC-487: Introduction to Data Science

Semester 1 (SPRING 2023)


NAME :NAEEM UR REHMAN
REG NO:02-134202-053

You might also like