Data Science

Uploaded by

anushashegde.25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

33 views4 pages

Data Science

Uploaded by

anushashegde.25

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

1.

Explain with R Program Case study of RealDirect

RealDirect comprises licensed brokers in various established realtor associations, but
even so it has had its share of hate mail from realtors who dont appreciate its approach
to cutting commission costs. In this sense, RealDirect is breaking directly into a guild.
On the other hand, if a realtor refused to show houses because they are being sold on
RealDirect, the potential buyers would see those listings elsewhere and complain.
Case Study: How Does RealDirect Make Money?
R Program:
# Author: Benjamin Reddy
require(gdata)
bk <- read.xls("rollingsales_brooklyn.xls",pattern="BOROUGH")
head(bk)
summary(bk)
bk$SALE.PRICE.N <- as.numeric(gsub("[^[:digit:]]","",bk$SALE.PRICE))
count(is.na(bk$SALE.PRICE.N))
names(bk) <- tolower(names(bk))
bk$gross.sqft <- as.numeric(gsub("[^[:digit:]]","",
bk$gross.square.feet))
bk$land.sqft <- as.numeric(gsub("[^[:digit:]]","",
bk$land.square.feet))
bk$sale.date <- as.Date(bk$sale.date)
bk$year.built <- as.numeric(as.character(bk$year.built))
hist(sale.price.n)
hist(sale.price.n[sale.price.n>0])
hist(gross.sqft[sale.price.n==0])
detach(bk)
bk.sale <- bk[bk$sale.price.n!=0,]
plot(bk.sale$gross.sqft,bk.sale$sale.price.n)
plot(log(bk.sale$gross.sqft),log(bk.sale$sale.price.n))
## for now, let's look at 1-, 2-, and 3-family homes
bk.homes <- bk.sale[which(grepl("FAMILY",
bk.sale$building.class.category)),]
plot(log(bk.homes$gross.sqft),log(bk.homes$sale.price.n))
bk.homes[which(bk.homes$sale.price.n<100000),]
[order(bk.homes[which(bk.homes$sale.price.n<100000),]
$sale.price.n),]
bk.homes$outliers <- (log(bk.homes$sale.price.n) <=5) + 0
bk.homes <- bk.homes[which(bk.homes$outliers==0),]
plot(log(bk.homes$gross.sqft),log(bk.homes$sale.price.n))
2. Explain Feature Generation and Selection Criteria
A. Feature Generation: When developing features for a project like Chasing Dragons,
we go through a process called feature generation or feature extraction. This involves
brainstorming and identifying various features, combining both expert insights and
imagination. Modern technology allows us to log numerous features easily, unlike
surveys where respondents can only answer a limited number of questions.
Types of Information:
1. Relevant but Uncapturable:
• Information that’s important but can’t be recorded, like knowing how
much free time users have.
2. Relevant, Loggable, and Logged:
• Information that’s important and logged, such as user actions recorded
during brainstorming. The feature selection process will later determine its actual
relevance.
3. Relevant, Loggable, but Not Logged:
• Important information that could have been logged but wasn’t, like users
uploading their profile photos, which might be predictive of their future engagement.

3. Selection Criteria:
A. *R-squared:*
- *Definition:* R-squared measures how well the model explains the variance in the
data.
- *Interpretation:* It ranges from 0 to 1, where 1 means the model explains all the
variance.

*P-values:*
- *Definition:* P-values assess the significance of coefficients in regression.
- *Interpretation:* A low p-value suggests strong evidence that the coefficient is not
zero, implying it's likely important in predicting outcomes.

AIC (Akaike Information Criterion):

- *Definition:* AIC balances model complexity and goodness of fit.
- *Goal:* Minimize AIC to find the most effective model that explains the data well
without being overly complex.

BIC (Bayesian Information Criterion):

- *Definition:* BIC also balances model fit and complexity, with a penalty for more
parameters.
- *Goal:* Minimize BIC to select a simpler model that still fits the data adequately.
3. Explain Decision Tree Algorithm
A. You need an algorithm to decide which attribute to split on; e.g., which node should
be the next one to identify. You choose that attribute in order to maximize information
gain, because youre getting the most bang for your buck that way. You keep going until
all the points at the end are in the same class or you end up with no features left. In this
case, you take the majority vote. Often people “prune the tree” afterwards to avoid
overfitting. This just means cutting it off below a certain depth.

# Classification Tree with rpart

library(rpart)
# grow tree
model1 <- rpart(Return ~ profile + num_dragons +
num_friends_invited + gender + age +
num_days, method="class", data=chasingdragons)
printcp(model1) # display the results
plotcp(model1) # visualize cross-validation results
summary(model1) # detailed summary of thresholds picked to
transform to binary
# plot tree
plot(model1, uniform=TRUE,
main="Classification Tree for Chasing Dragons")
text(model1, use.n=TRUE, all=TRUE, cex=.8)

Handling Continuous Variables in Decision Trees:

Packages that already implement decision trees can handle continuous variables for you.
So you can provide continuous features, and it will determine an optimal threshold for
turning the continuous variable into a binary predictor.

4. Explain User Retention with example

A. User retention is the ability to keep users returning to a product or service over time
Example:
Imagine you have an app called Chasing Dragons. Users pay a monthly subscription fee
to use it. You notice that only 10% of new users continue using the app after the first
month. Steps to Analyze User Retention: 1.Track User Actions: Record every action
users take in the app, such as visits, points earned, and profile updates.
2. Create a Dataset: Organize these actions into a table where each row is a user and
each column is a different feature of their activity.
3. Identify Useful Features: Work with your team to brainstorm and list important
features that might affect user retention.
Examples of Features: Number of days the user visited in the first month. Time until
the user’s second visit. Points earned each day for the first 30 days.
6. Explain Phylosophy of EDA
A. 6. EDA is a process in data science focused on exploring data to understand its main
features before using any advanced analysis. The goal is to uncover patterns, detect
anomalies, and test assumptions.
Key Points of EDA: 1. Be Curious and Open-Minded: Approach the data without
preconceived notions, ready to discover new insights. 2. Clean and Prepare Data:
Ensure the data is accurate and complete. 3. Summarize Data: Use descriptive statistics
to get a sense of the data’s main characteristics. 4. Visualize Data: Create charts like
histograms, box plots, and scatter plots to see distributions and relationships between
variables.

5. Random Forests Algorithm

A. Random forests generalize decision trees with bagging, otherwise known as
bootstrap aggregating. We will explain bagging in more detail later, but the effect of
using it is to make your models more accurate and more robust, but at the cost of
interpretability—random forests are notoriously difficult to understand. Theyre
conversely easy to specify, with two hyper parameters: you just need to specify the
number of trees you want in your forest, say N, as well as the number of features to
randomly select for each tree, say F.
Now to the algorithm. To construct a random forest, you construct N decision trees as
follows:
1. For each tree, take a bootstrap sample of your data, and for each node you randomly
select F features, say 5 out of the 100 total features. 2. Then you use your entropy-
information-gain engine as described in the previous section to decide which among
those features you will split your tree on at each stage.
The code for this would look like:
require(ggplot2)
data(diamonds)
head(diamonds)
gplot(diamonds) + geom_histogram(aes(x=price)) +
geom_vline(xintercept=12000)
diamonds$Expensive <- ifelse(diamonds$price >= 12000, 1, 0)
head(diamonds)
diamonds$price <- NULL
require(glmnet)
x <- model.matrix(~., diamonds[, -ncol(diamonds)])
y <- as.matrix(diamonds$Expensive)
system.time(modGlmnet <- glmnet(x=x, y=y, family="binomial"))
plot(modGlmnet, label=TRUE)
set.seed(48872)
sample(1:10)
require(rpart)
modTree <- rpart(Expensive ~ ., data=diamonds) require(adabag)

Wa0001.
No ratings yet
Wa0001.
50 pages
Unit 2 Supervised Learning
No ratings yet
Unit 2 Supervised Learning
36 pages
ADS-ch3 2024-25
No ratings yet
ADS-ch3 2024-25
35 pages
DSV Ia2
No ratings yet
DSV Ia2
18 pages
Orange3 Data Mining Library Using Python
50% (2)
Orange3 Data Mining Library Using Python
102 pages
Module 2
No ratings yet
Module 2
38 pages
Module 5 Machine Learning
No ratings yet
Module 5 Machine Learning
36 pages
Prediction of Breast Cancer Using Machine Learning Algorithms - 2nd Review
No ratings yet
Prediction of Breast Cancer Using Machine Learning Algorithms - 2nd Review
21 pages
Data Science
No ratings yet
Data Science
13 pages
Da Mid 2
No ratings yet
Da Mid 2
12 pages
Mtech Final
No ratings yet
Mtech Final
16 pages
Introduction To Data Mining 1
No ratings yet
Introduction To Data Mining 1
23 pages
Unit 5 (DS)
No ratings yet
Unit 5 (DS)
15 pages
Predictive Analytics in Marketing
No ratings yet
Predictive Analytics in Marketing
90 pages
Module 3 DS
No ratings yet
Module 3 DS
44 pages
Classification and Clustering Algorithm Notes
No ratings yet
Classification and Clustering Algorithm Notes
19 pages
Numsense! Data Science For The Layman
100% (3)
Numsense! Data Science For The Layman
65 pages
DWDM Unit-3
No ratings yet
DWDM Unit-3
9 pages
Reference Papers
No ratings yet
Reference Papers
7 pages
Classification Notes
No ratings yet
Classification Notes
14 pages
Big Data Analytics Algorithm, Tools in Systematic Review
No ratings yet
Big Data Analytics Algorithm, Tools in Systematic Review
7 pages
DataMining Unit-3
No ratings yet
DataMining Unit-3
8 pages
Ritika Kapoor - DETD
No ratings yet
Ritika Kapoor - DETD
22 pages
Business Data Analytics Part 4
No ratings yet
Business Data Analytics Part 4
52 pages
Major Project
No ratings yet
Major Project
27 pages
Numsense
No ratings yet
Numsense
138 pages
Introduction To Data Science - Lin and Li
No ratings yet
Introduction To Data Science - Lin and Li
403 pages
ML Unit4
No ratings yet
ML Unit4
10 pages
Gurukul Ghatkesar Trust
No ratings yet
Gurukul Ghatkesar Trust
59 pages
21cs644 Module 3
100% (1)
21cs644 Module 3
95 pages
3 DM Classification
No ratings yet
3 DM Classification
62 pages
Unit-4 Data Mining
No ratings yet
Unit-4 Data Mining
19 pages
R Lect1 Introduction
No ratings yet
R Lect1 Introduction
16 pages
Decision Trees
67% (3)
Decision Trees
14 pages
DSV Module-3
No ratings yet
DSV Module-3
24 pages
DM - 06 Mar 2025
No ratings yet
DM - 06 Mar 2025
13 pages
Real Estate Start Up
No ratings yet
Real Estate Start Up
19 pages
Sat - 149.Pdf - Prediction of Bigmart Sales Using Machine Learning Algorihms
No ratings yet
Sat - 149.Pdf - Prediction of Bigmart Sales Using Machine Learning Algorihms
11 pages
Daa 01
No ratings yet
Daa 01
11 pages
Orange 3
100% (1)
Orange 3
46 pages
Data Analytics On Banking
No ratings yet
Data Analytics On Banking
3 pages
Spa To Manage Condominium Unit
No ratings yet
Spa To Manage Condominium Unit
3 pages
Unit 2
No ratings yet
Unit 2
48 pages
Classification
No ratings yet
Classification
50 pages
Module2 Ids 240201 162026
No ratings yet
Module2 Ids 240201 162026
11 pages
Mastering Predictive Analytics With R - Sample Chapter
No ratings yet
Mastering Predictive Analytics With R - Sample Chapter
57 pages
5 What Is Data-WPS Office
No ratings yet
5 What Is Data-WPS Office
19 pages
Module 3
No ratings yet
Module 3
29 pages
21CS644 Mod 3
No ratings yet
21CS644 Mod 3
29 pages
Mar 13 Lae 08
No ratings yet
Mar 13 Lae 08
656 pages
Learning Book 11 Feb
No ratings yet
Learning Book 11 Feb
322 pages
Business Data Mining
No ratings yet
Business Data Mining
9 pages
1.1 What Is Data Mining?
No ratings yet
1.1 What Is Data Mining?
6 pages
Case Study 219302405
No ratings yet
Case Study 219302405
14 pages
Oe Cae 3
No ratings yet
Oe Cae 3
7 pages
Government Gazette - 20th November PA
No ratings yet
Government Gazette - 20th November PA
48 pages
Practical Machine Learning Course Notes
No ratings yet
Practical Machine Learning Course Notes
76 pages
6 Applications of Predictive Analytics in Business Intelligence
No ratings yet
6 Applications of Predictive Analytics in Business Intelligence
6 pages
Lecture2 DataMiningFunctionalities
No ratings yet
Lecture2 DataMiningFunctionalities
18 pages
Bia Unit-3 Part-2
No ratings yet
Bia Unit-3 Part-2
43 pages
LeipzigerMietspiegel2022 en
No ratings yet
LeipzigerMietspiegel2022 en
37 pages
Nipa Hut
No ratings yet
Nipa Hut
6 pages
Ganesh Pandiyan - Report6
No ratings yet
Ganesh Pandiyan - Report6
6 pages
Intro To Data Science Summary
No ratings yet
Intro To Data Science Summary
17 pages
Swiss Companies
No ratings yet
Swiss Companies
45 pages
Hoesli Et Malle (2022)
No ratings yet
Hoesli Et Malle (2022)
12 pages
Session 2 RFB301
No ratings yet
Session 2 RFB301
13 pages
Mixed Used Development and Residential Hotel, Condominium
No ratings yet
Mixed Used Development and Residential Hotel, Condominium
57 pages
Deed of Assignment
No ratings yet
Deed of Assignment
5 pages
The Real Estate Agent Authority Act 2020
No ratings yet
The Real Estate Agent Authority Act 2020
36 pages
Luis Asensio Onal - Conditional Offer Letter 2024
No ratings yet
Luis Asensio Onal - Conditional Offer Letter 2024
9 pages
Exception To Holland V Hodgson
No ratings yet
Exception To Holland V Hodgson
14 pages
Standard Residential Lease Agreement
No ratings yet
Standard Residential Lease Agreement
5 pages
RPM Reviewer
No ratings yet
RPM Reviewer
10 pages
Glossary of Property Tax Terminology
No ratings yet
Glossary of Property Tax Terminology
11 pages
Kapil Dumarkha Kalan
No ratings yet
Kapil Dumarkha Kalan
7 pages
Executive Order 2020-134
No ratings yet
Executive Order 2020-134
4 pages
Coco
No ratings yet
Coco
10 pages
2018 Q4 Jakarta Office Market Report Colliers
No ratings yet
2018 Q4 Jakarta Office Market Report Colliers
11 pages
Lease Agreement Contract Form For............ Movileanu Ioana-Cătăilna
No ratings yet
Lease Agreement Contract Form For............ Movileanu Ioana-Cătăilna
6 pages
History of Licensing Regulations of Real Estate Consultants in The Philippines
No ratings yet
History of Licensing Regulations of Real Estate Consultants in The Philippines
1 page
CEQUEÑA Vs BOLANTE April 6, 2000 (DIGEST)
No ratings yet
CEQUEÑA Vs BOLANTE April 6, 2000 (DIGEST)
3 pages
Briefer On Land Titles and Deeds
No ratings yet
Briefer On Land Titles and Deeds
2 pages
Draft Lease Agreement
No ratings yet
Draft Lease Agreement
10 pages
Application Form For Housing Loan - PNB
No ratings yet
Application Form For Housing Loan - PNB
10 pages
Complete With DocuSign ST Louis Home Offer
No ratings yet
Complete With DocuSign ST Louis Home Offer
2 pages
Acap Vs Ca
No ratings yet
Acap Vs Ca
2 pages
Data Science with R: Beginner to Expert
From Everand
Data Science with R: Beginner to Expert
Narayana Nemani
No ratings yet
Artificial Intelligence 2024 Book 2 of 2: AI, #2
From Everand
Artificial Intelligence 2024 Book 2 of 2: AI, #2
Yang Yen Thaw
No ratings yet
The Secret Of Machine Learning
From Everand
The Secret Of Machine Learning
Mhd Arjunanta
No ratings yet

Data Science

Uploaded by

Data Science

Uploaded by

1.

Explain with R Program Case study of RealDirect

*AIC (Akaike Information Criterion):*

*BIC (Bayesian Information Criterion):*

# Classification Tree with rpart

Handling Continuous Variables in Decision Trees:

4. Explain User Retention with example

5. Random Forests Algorithm

You might also like

AIC (Akaike Information Criterion):

BIC (Bayesian Information Criterion):