0% found this document useful (0 votes)
33 views4 pages

Data Science

Uploaded by

anushashegde.25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views4 pages

Data Science

Uploaded by

anushashegde.25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

1.

Explain with R Program Case study of RealDirect


RealDirect comprises licensed brokers in various established realtor associations, but
even so it has had its share of hate mail from realtors who dont appreciate its approach
to cutting commission costs. In this sense, RealDirect is breaking directly into a guild.
On the other hand, if a realtor refused to show houses because they are being sold on
RealDirect, the potential buyers would see those listings elsewhere and complain.
Case Study: How Does RealDirect Make Money?
R Program:
# Author: Benjamin Reddy
require(gdata)
bk <- read.xls("rollingsales_brooklyn.xls",pattern="BOROUGH")
head(bk)
summary(bk)
bk$SALE.PRICE.N <- as.numeric(gsub("[^[:digit:]]","",bk$SALE.PRICE))
count(is.na(bk$SALE.PRICE.N))
names(bk) <- tolower(names(bk))
bk$gross.sqft <- as.numeric(gsub("[^[:digit:]]","",
bk$gross.square.feet))
bk$land.sqft <- as.numeric(gsub("[^[:digit:]]","",
bk$land.square.feet))
bk$sale.date <- as.Date(bk$sale.date)
bk$year.built <- as.numeric(as.character(bk$year.built))
hist(sale.price.n)
hist(sale.price.n[sale.price.n>0])
hist(gross.sqft[sale.price.n==0])
detach(bk)
bk.sale <- bk[bk$sale.price.n!=0,]
plot(bk.sale$gross.sqft,bk.sale$sale.price.n)
plot(log(bk.sale$gross.sqft),log(bk.sale$sale.price.n))
## for now, let's look at 1-, 2-, and 3-family homes
bk.homes <- bk.sale[which(grepl("FAMILY",
bk.sale$building.class.category)),]
plot(log(bk.homes$gross.sqft),log(bk.homes$sale.price.n))
bk.homes[which(bk.homes$sale.price.n<100000),]
[order(bk.homes[which(bk.homes$sale.price.n<100000),]
$sale.price.n),]
bk.homes$outliers <- (log(bk.homes$sale.price.n) <=5) + 0
bk.homes <- bk.homes[which(bk.homes$outliers==0),]
plot(log(bk.homes$gross.sqft),log(bk.homes$sale.price.n))
2. Explain Feature Generation and Selection Criteria
A. Feature Generation: When developing features for a project like Chasing Dragons,
we go through a process called feature generation or feature extraction. This involves
brainstorming and identifying various features, combining both expert insights and
imagination. Modern technology allows us to log numerous features easily, unlike
surveys where respondents can only answer a limited number of questions.
Types of Information:
1. Relevant but Uncapturable:
• Information that’s important but can’t be recorded, like knowing how
much free time users have.
2. Relevant, Loggable, and Logged:
• Information that’s important and logged, such as user actions recorded
during brainstorming. The feature selection process will later determine its actual
relevance.
3. Relevant, Loggable, but Not Logged:
• Important information that could have been logged but wasn’t, like users
uploading their profile photos, which might be predictive of their future engagement.

3. Selection Criteria:
A. *R-squared:*
- *Definition:* R-squared measures how well the model explains the variance in the
data.
- *Interpretation:* It ranges from 0 to 1, where 1 means the model explains all the
variance.

*P-values:*
- *Definition:* P-values assess the significance of coefficients in regression.
- *Interpretation:* A low p-value suggests strong evidence that the coefficient is not
zero, implying it's likely important in predicting outcomes.

*AIC (Akaike Information Criterion):*


- *Definition:* AIC balances model complexity and goodness of fit.
- *Goal:* Minimize AIC to find the most effective model that explains the data well
without being overly complex.

*BIC (Bayesian Information Criterion):*


- *Definition:* BIC also balances model fit and complexity, with a penalty for more
parameters.
- *Goal:* Minimize BIC to select a simpler model that still fits the data adequately.
3. Explain Decision Tree Algorithm
A. You need an algorithm to decide which attribute to split on; e.g., which node should
be the next one to identify. You choose that attribute in order to maximize information
gain, because youre getting the most bang for your buck that way. You keep going until
all the points at the end are in the same class or you end up with no features left. In this
case, you take the majority vote. Often people “prune the tree” afterwards to avoid
overfitting. This just means cutting it off below a certain depth.

# Classification Tree with rpart


library(rpart)
# grow tree
model1 <- rpart(Return ~ profile + num_dragons +
num_friends_invited + gender + age +
num_days, method="class", data=chasingdragons)
printcp(model1) # display the results
plotcp(model1) # visualize cross-validation results
summary(model1) # detailed summary of thresholds picked to
transform to binary
# plot tree
plot(model1, uniform=TRUE,
main="Classification Tree for Chasing Dragons")
text(model1, use.n=TRUE, all=TRUE, cex=.8)

Handling Continuous Variables in Decision Trees:


Packages that already implement decision trees can handle continuous variables for you.
So you can provide continuous features, and it will determine an optimal threshold for
turning the continuous variable into a binary predictor.

4. Explain User Retention with example


A. User retention is the ability to keep users returning to a product or service over time
Example:
Imagine you have an app called Chasing Dragons. Users pay a monthly subscription fee
to use it. You notice that only 10% of new users continue using the app after the first
month. Steps to Analyze User Retention: 1.Track User Actions: Record every action
users take in the app, such as visits, points earned, and profile updates.
2. Create a Dataset: Organize these actions into a table where each row is a user and
each column is a different feature of their activity.
3. Identify Useful Features: Work with your team to brainstorm and list important
features that might affect user retention.
Examples of Features: Number of days the user visited in the first month. Time until
the user’s second visit. Points earned each day for the first 30 days.
6. Explain Phylosophy of EDA
A. 6. EDA is a process in data science focused on exploring data to understand its main
features before using any advanced analysis. The goal is to uncover patterns, detect
anomalies, and test assumptions.
Key Points of EDA: 1. Be Curious and Open-Minded: Approach the data without
preconceived notions, ready to discover new insights. 2. Clean and Prepare Data:
Ensure the data is accurate and complete. 3. Summarize Data: Use descriptive statistics
to get a sense of the data’s main characteristics. 4. Visualize Data: Create charts like
histograms, box plots, and scatter plots to see distributions and relationships between
variables.

5. Random Forests Algorithm


A. Random forests generalize decision trees with bagging, otherwise known as
bootstrap aggregating. We will explain bagging in more detail later, but the effect of
using it is to make your models more accurate and more robust, but at the cost of
interpretability—random forests are notoriously difficult to understand. Theyre
conversely easy to specify, with two hyper parameters: you just need to specify the
number of trees you want in your forest, say N, as well as the number of features to
randomly select for each tree, say F.
Now to the algorithm. To construct a random forest, you construct N decision trees as
follows:
1. For each tree, take a bootstrap sample of your data, and for each node you randomly
select F features, say 5 out of the 100 total features. 2. Then you use your entropy-
information-gain engine as described in the previous section to decide which among
those features you will split your tree on at each stage.
The code for this would look like:
require(ggplot2)
data(diamonds)
head(diamonds)
gplot(diamonds) + geom_histogram(aes(x=price)) +
geom_vline(xintercept=12000)
diamonds$Expensive <- ifelse(diamonds$price >= 12000, 1, 0)
head(diamonds)
diamonds$price <- NULL
require(glmnet)
x <- model.matrix(~., diamonds[, -ncol(diamonds)])
y <- as.matrix(diamonds$Expensive)
system.time(modGlmnet <- glmnet(x=x, y=y, family="binomial"))
plot(modGlmnet, label=TRUE)
set.seed(48872)
sample(1:10)
require(rpart)
modTree <- rpart(Expensive ~ ., data=diamonds) require(adabag)

You might also like