0% found this document useful (0 votes)
6 views10 pages

Week 1

The document outlines the first week of a Marketing Analytics course taught by Dr. Swagato Chatterjee, focusing on using R for data analysis. It covers software installation, RStudio interface, essential functions, data manipulation techniques, and basic statistical concepts relevant to hotel review data analysis. Key topics include creating and manipulating matrices and data frames, summarizing data with dplyr, and introducing visualization and regression analysis methods.

Uploaded by

dushyant1209garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

Week 1

The document outlines the first week of a Marketing Analytics course taught by Dr. Swagato Chatterjee, focusing on using R for data analysis. It covers software installation, RStudio interface, essential functions, data manipulation techniques, and basic statistical concepts relevant to hotel review data analysis. Key topics include creating and manipulating matrices and data frames, summarizing data with dplyr, and introducing visualization and regression analysis methods.

Uploaded by

dushyant1209garg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

WEEK 1

🟦 Course & Software Overview


Course Title: Marketing Analytics
Instructor: Dr. Swagato Chatterjee, VGSoM, IIT Kharagpur
Software Used: Excel (small problems), R (large problems)

🟩 Why Use R in Marketing Analytics?


Excel is limited to ~1 million rows → not suitable for large datasets.
R is:
Open-source and free.
Has strong community support.
Good for research-oriented work.
Python:
More suitable for deployment/automation, less for research.
Not preferred in this course.

🟨 Installing R & RStudio


Install R from: https://cran.r-project.org
Current mentioned version: R 3.6.1 (may vary)
Install RStudio from: https://posit.co/download/rstudio-desktop
Current version mentioned: RStudio 1.2.5001
Choose RStudio Desktop (Free Version)
For 32-bit systems, use older compatible versions.

🟧 Why RStudio over Base R?


RStudio has a more user-friendly UI.
Easier for click-and-run, drag-and-drop operations.
Also open-source and free.

🟥 RStudio Interface (4 Quadrants)


1. Top-Left → Script Editor:
Write and save R code ( .R files)
2. Bottom-Left → Console:
Runs the code, shows output
3. Top-Right → Environment/History:
Stores variables, datasets
4. Bottom-Right → Files/Plots/Packages/Help:
File explorer, visualizations, install/manage packages

🔷 Getting Started in RStudio


Open RStudio → File → New File → R Script
Save the script using the floppy disk icon or Ctrl + S
File extension: .R
Run code from editor to console for output

🔶 Best Practices
Save script before running code
Type code manually → Helps you learn from mistakes
Avoid copy-paste; typing reinforces learning

✅ Important Functions in R
1. seq() Function
Used to generate sequences.
Syntax: seq(from, to, by)
Example: seq(1, 30, 2) gives 1, 3, 5, ..., 29
2. rep() Function
Used to repeat elements.
Syntax: rep(value, times)
Example: rep(2, 20) repeats 2 twenty times.
3. Help Options in R
help(function_name) or ?function_name shows syntax, arguments, and examples.

✅ Subsetting in R
4. Indexing
Access elements using square brackets [ ] .
Example: a[5] returns the 5th element of vector a .
5. Multiple Indexing
Use c() to pass multiple indices.
Example: a[c(5, 7, 9)] gives elements at positions 5, 7, and 9.
6. Conditional Subsetting
Use logical conditions inside [ ] .
Example: a[a > 7] returns elements greater than 7.

✅ Logical Operators in R
7. Logical Conditions
> : greater than
< : less than
>= : greater than or equal to
<= : less than or equal to
== : equal to
!= : not equal to
8. Combining Conditions
| : OR operator
& : AND operator
Example: a[a > 15 | a < 8] returns values satisfying either condition.

✅ Best Practices
9. Online Help
Search for help using keywords like "how to repeat a number in R".
Useful websites include: StackOverflow, RDocumentation, etc.
10. Visualization Analogy
Vector indexing is like finding people on specific floors of a building.
Index = floor number; Vector name = building name.

✅ Session 3 Overview: Working with Matrices and


Data Frames in R

🔧 Before Starting
Open the file W1S3.R in RStudio.
Clear the console using Ctrl + L
Clear the Global Environment by clicking the brush icon.
Close any other open files—only W1S3.R should be open.

🧮 Vectors Recap
A: Integer vector 1:10
B: Numeric sequence from 2 to 10 with 10 values using seq(2, 10, length.out = 10)
C: Character vector: first five values "Sachin" , next five "Saurav"

🧱 Matrix in R
A matrix is a tabular structure with homogeneous data types.
Matrix data must all be numeric or character—not mixed.
Matrix cells are accessed via [row, column] format (e.g., [2, 3] is row 2, column 3).

🛠️ Creating Matrices
1. Column Bind ( cbind )

matrix1 <- cbind(a, b, c)


Joins vectors side-by-side (columns).
Converts all data to character if any one is character (due to coercion).
2. Row Bind ( rbind )

matrix2 <- rbind(a, b, c)


Joins vectors one below another (rows).
Different shape compared to cbind .
3. Direct Matrix Creation

matrix3 <- matrix(1:9, nrow = 3, byrow = TRUE)


Creates a 3x3 matrix with values from 1 to 9.
byrow = TRUE fills row-wise, FALSE (default) fills column-wise.

🧪 Matrix Functions
Use View(matrix_name) to open spreadsheet-like view.
Use t(matrix) for transpose (swap rows/columns).

🗃️ Data Frame in R
A data frame allows different types of data in different columns (unlike matrices).
Created using: data1 <- data.frame(gh = a, ij = b, kl = c)
gh , ij , kl become column names.
a , b , c are the actual vectors with data.

🔍 Key Differences

Feature Matrix Data Frame


Data Types Homogeneous (all same) Heterogeneous (mixed allowed)
Use Case Numeric/character tables Tabular data (like a spreadsheet)
Access [row, column] $column or [row, column]

This session focuses on working with a basic dataset in R, covering how to:

1. Create and view a dataset using variables like company , fy , revenue , and margin .
2. Add a new variable profit calculated from revenue * margin / 100 .
3. Use the dplyr library to:
Group data using group_by()
Modify data using mutate() to add new columns (e.g., highest and lowest margin)
Use summarise() to condense grouped data into summary statistics

💡 Key R Concepts Covered

Concept Explanation
data.frame() Combines vectors into a tabular structure.
$ operator Used to access or create columns within a dataframe.
mutate() Adds or modifies columns without reducing the dataset size.
summarise() Condenses multiple rows into one per group.
group_by() Used with dplyr to perform group-wise operations.
install.packages("dplyr") Installs the dplyr package.
library(dplyr) Loads the dplyr package into memory for use.

You’ve shared a detailed walkthrough of data operations in R programming, covering topics


like:
1. Summarization using group_by and summarise()
2. Conditional logic using ifelse()
3. Looping with for loops
4. Subsetting data
5. Creating custom functions

Here’s a clean summary with relevant R code snippets that correspond to each major point
you made

🧮 1. Group-wise Minimum Cost


You used group_by() and summarise() to calculate the lowest cost for each year ( fy ).

library(dplyr) new_data <- data %>% group_by(as.factor(fy)) %>%


summarise(lowest_cost = min(cost))

Make sure:

cost is spelled with a lowercase c


fy is the correct year column
You grouped by as.factor(fy) to treat fy as a categorical variable

🔁 2. Conditional Columns Using ifelse()


You created a new column to label margins:

data$margin_high_low <- ifelse(data$margin > 10, "High", "Low")

You also mentioned extending it with nested conditions, which can be done with case_when()
(preferred for multiple conditions):

data$margin_level <- case_when( data$margin > 15 ~ "Very High", data$margin > 10 ~


"High", TRUE ~ "Low" )

🔄 3. Filtering Data Using Subset


You filtered only PNG company rows:

data_png <- data[data$company == "PNG", ]

Make sure to use == (not = ) for logical equality.

📉 4. Calculating Growth Using for Loop


You added a growth_rate column by calculating percentage change row-wise:

data_png$gr <- 0 # initialize the column for (i in 2:nrow(data_png)) {


data_png$gr[i] <- (data_png$revenue[i] - data_png$revenue[i - 1]) /
data_png$revenue[i - 1] }

🧰 5. Defining Your Own Function


You referred to writing custom functions like f(x) in mathematics. Here’s a simple example:

growth_calc <- function(current, previous) { return((current - previous) /


previous) } # Usage growth_calc(15698, 14567)

Week 1, Session 5: Handling Hotel Review Data in R

Key Concepts:

Data: sample hotel data.csv - contains hotel reviews (overall rating, date, reviewer type,
and 6 attribute ratings: value, location, sleep quality, rooms, cleanliness, service).
Objective: Basic data analytics for marketing insights (performance, areas for
improvement).
R Functions:
read.csv() : Reads the CSV data.
str() : Shows the structure of the data frame (rows, columns, data types).
names() : Gets column names.
head() : Shows the first few rows.
View() : Opens data in a spreadsheet view.
library(dplyr) : Loads the dplyr package for data manipulation.
group_by() : Groups data by a specific column (e.g., hotel_name_city ).
summarize() : Creates new columns by applying summary functions (e.g., mean() ).
na.rm = TRUE : Argument in mean() to handle missing values.
as.data.frame() : Converts to a data frame.
[rows, columns] : Used for subsetting data frames.

Core Steps in Analysis:

1. Read Data: Load the sample hotel data.csv into R.


2. Explore Data: Use str() , names() , head() , View() to understand the data.
3. Summarize by Hotel: Use dplyr 's group_by() and summarize() to calculate mean
overall rating and mean attribute ratings for each hotel.
4. Compare Performance: The summarized data allows for comparison of overall and
attribute ratings between hotels.

Potential MCQ Topics:

Purpose of different R functions ( read.csv , str , head , summarize , group_by ).


Understanding the structure of the sample hotel data.csv dataset (columns and their
meaning).
How to calculate basic summary statistics (like mean) in R, including handling missing data
( na.rm = TRUE ).
The role of dplyr in data manipulation.
How to group data and perform calculations within groups.
Basic steps in analyzing customer review data for marketing insights.
The meaning of overall rating and attribute ratings in the context of hotel reviews.
How to subset a data frame in R.

Not Covered in Detail (Less Likely for Basic MCQ):

Advanced R programming concepts beyond basic data frames and functions.


Specific details of regression analysis (mentioned as a future step).
In-depth text mining of review content.
Detailed strategies for resource allocation or service improvement.
The as.Date() function and date format conversions.

Focus on understanding the basic R commands used for data loading, exploration, and
summarization, and how these steps can provide initial marketing insights from
customer review data.

Here are the short and most important notes from Professor Chatterjee's lecture (Week 1,
Session 6):

Topic: Analyzing Hotel Review Data in R - Visualization and Regression

Key Objectives:

Visualization: Create a bar plot to compare overall and attribute ratings of two hotels.
Regression Introduction: Outline the steps involved in regression analysis to determine
the importance of different hotel aspects on overall rating.
Ordered Logistic Regression: Introduce an alternative method for analyzing ordered
categorical data (like the 1-5 star ratings).
Coding Familiarity: Get more comfortable with basic R coding for data analysis.

R Code and Concepts Covered:

1. Bar Plot Creation:


barplot() function.
Input needs to be a matrix.
names.arg : Specifies labels for the bars (using column names).
xlab , ylab : Sets axis labels.
beside = TRUE : Displays bars for different groups side-by-side.
col : Sets the colors of the bars.
legend() : Adds a legend to the plot, specifying location ( x , y ), labels
( summary_two[, 1] ), and colors.
2. Steps for Regression Analysis (to find importance of aspects):
Missing Value Imputation: Replace missing values (using median imputation as the
method applied).
Loop through relevant columns (aspect ratings).
Use ifelse() and is.na() to identify missing values.
Replace NA with the median() of that column.
Outlier Removal: Identify and remove extreme values (using the Z-score method with a
threshold of +/- 3).
Calculate Z-scores ( scale() ).
Keep data points where the absolute Z-score is less than 3.
Correlation Check: Examine the correlation matrix of the independent variables
(aspect ratings) to avoid multicollinearity (using cor() ).
3. Normality Check (and its caveat):
Visual inspection using hist() for each variable.
Shapiro-Wilk test ( shapiro.test() ) for formal normality testing.
Important Note: The lecture acknowledges that the rating data is likely not truly normal
(being categorical), but linear regression might still be used for a general idea in some
marketing research.
4. Linear Regression:
lm() function: fit <- lm(review_overall_rating ~ . , data = DATA) (where .
represents all other columns as predictors).
summary(fit) : Displays the results of the linear regression (F-statistic, R-squared,
coefficients, p-values).
Interpretation: Coefficients indicate the impact of each aspect on the overall rating.
Service had the highest positive coefficient in this example.
5. Ordered Logistic Regression (for ordered categorical Y variable):
Requires the MASS library ( library(MASS) ).
Convert both dependent ( review_overall_rating ) and independent (aspect) variables
to factors using as.factor() .
polr() function: fit1 <- polr(factor(review_overall_rating) ~
factor(rating_value) + ..., data = data1, method = "logistic") .
summary(fit1) : Displays the results of the ordered logistic regression.
Interpretation: Coefficients show the log-odds of moving to a higher rating category for
a one-unit increase in the predictor (keeping other variables constant). The example
showed the impact of increasing aspect ratings (e.g., value for money) on the likelihood
of higher overall ratings.

Key Takeaways for MCQs:

How to create and interpret a basic bar plot in R for comparing groups.
The fundamental steps involved in preparing data for regression analysis (missing value
handling, outlier detection, correlation check).
The purpose and interpretation of linear regression results (coefficients, significance).
The concept of ordered logistic regression and why it might be suitable for ordered
categorical dependent variables.
Basic R functions used for these analyses ( barplot , ifelse , is.na , median , scale ,
cor , hist , shapiro.test , lm , polr , as.factor ).
The overall goal of using these analytical techniques in a marketing context (understanding
drivers of customer satisfaction).

Important Note for Future Sessions: The professor emphasizes the need to revise basic
statistics, marketing management, and introductory business analytics (especially regression
and basic machine learning concepts) as these will be heavily used in future weeks.

You might also like