This document provides an introduction to R, a language and environment for statistical computing and graphics, emphasizing its strengths in data analysis and visualization. It covers the RStudio interface, basic calculations, data structures like vectors and data frames, and how to manage packages and import data from CSV and Excel files. The document also highlights the importance of getting help and practicing with R to enhance data manipulation skills.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
12 views
1. R Programming
This document provides an introduction to R, a language and environment for statistical computing and graphics, emphasizing its strengths in data analysis and visualization. It covers the RStudio interface, basic calculations, data structures like vectors and data frames, and how to manage packages and import data from CSV and Excel files. The document also highlights the importance of getting help and practicing with R to enhance data manipulation skills.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22
Class 1: Introduction to R
Introduction to R and Data Basics
Md. Iftekhar Ahmed Khan
Machine Learning Engineer Bondstein Technologies Limited Welcome & Today's Goals • Today you will learn to: • Understand what R is and why it's used. • Navigate the RStudio interface. • Perform basic calculations and use variables. • Understand and use fundamental data structures (Vectors, Data Frames). • Manage R packages (install/load). • Import data from common file types (CSV, Excel) and save results. • Know how to get help! What is R? • R is a language AND an environment for statistical computing and graphics. • Strengths: • Specifically designed for data analysis and visualization. • Open Source: Free to use, modify, and distribute. • Huge Community: Active development, extensive documentation, lots of help online. • Packages: Thousands of add-ons for specialized tasks (more later!). • Common Uses: Data cleaning, data exploration, statistical modeling, machine learning, report generation, creating plots and dashboards. Why R for Data Science? • Vast Package Ecosystem: CRAN (Comprehensive R Archive Network) hosts thousands of packages (like dplyr for manipulation, ggplot2 for plotting). • Powerful Visualization: Tools like ggplot2 allow for creating complex and publication-quality graphics. • Data Wrangling: Excellent tools (like the tidyverse) for cleaning, transforming, and preparing data. • Reproducibility: Scripts make analyses repeatable and shareable. • Interoperability: Connects well with databases, other languages (Python, SQL), and reporting tools (R Markdown). RStudio • RStudio: An Integrated Development Environment (IDE) for R. Think of it as a powerful dashboard for R. • Code editor with syntax highlighting • Console to run commands interactively
• Workspace browser to see your variables
• Plotting window • Help and file browsers • Package manager Your First Commands (Use Console) • R can be used as a powerful calculator. • Type directly into the Console pane (after the > prompt) and press Enter. • # Basic Arithmetic • 2+2 # [1] 4 <- This is the output R gives • 5 * 10 # [1] 50 • 10 / 3 # [1] 3.333333 Logical Operations & Comparisons • Used for asking TRUE/FALSE questions. Essential for filtering data later. • == means "is equal to?" (Note: double equals!) • != means "is not equal to?" • >, <, >=, <= (Greater than, Less than, etc.) # Comparisons 5>3 # [1] TRUE 10 == 10 # [1] TRUE 10 == 5 # [1] FALSE 5 != 6 # [1] TRUE Variables (Objects) in R • Store values or results using variables (R often calls them objects). • Use the assignment operator <- (less than sign, hyphen). Think of it as an arrow pointing from the value to the variable name. • Variable names: • Must start with a letter. • Can contain letters, numbers, _, and .. • Are case-sensitive (myVar is different from myvar). • Avoid using names of existing functions (like c, mean, data). Using Built-in Functions • Functions perform specific tasks. You provide arguments (inputs) inside parentheses (). • R has many built-in functions. some_numbers <- c(2, 8, 3, 7, 5) # Use functions on the data sum(some_numbers) # Calculates the sum # [1] 25 mean(some_numbers) # Calculates the average (mean) # [1] 5 Getting Help! • Essential skill! Don't try to memorize everything. • Use ? followed by the function name (no parentheses needed). • Use help("function_name"). • Use ?? to search documentation for keywords (use quotes). Packages: Extending R's Power • Packages are collections of functions, data, and documentation that add specific capabilities to R. • Thousands are available from CRAN (Comprehensive R Archive Network) and other places (like GitHub, Bioconductor). • Examples: dplyr for data manipulation, ggplot2 for plotting, readxl for reading Excel files. • Two Steps: • Install: Download the package to your computer (only need to do ONCE per R installation). Use install.packages("package_name"). • Load: Make the package's functions available in your current R session (need to do EVERY TIME you start a new R session and want to use it). Use library(package_name). Data Structures: Organizing Your Data • Variables store single values. Data structures store collections of values. • R has several fundamental data structures: • Vectors: Ordered sequence of elements of the same basic type. (MOST FUNDAMENTAL) • Data Frames: Rectangular table (like a spreadsheet), columns can be different types. (MOST IMPORTANT FOR TABULAR DATA) • Lists: Ordered, flexible collection, elements can be of different types/structures. • Matrices: 2-dimensional array, all elements must be the same type. • Factors: Special type of vector for representing categorical data (groups/levels). Data Structure 1: Vectors • The basic building block. Use the c() function (combine or concatenate). • All elements MUST be the same type (numeric, character, logical). If you mix, R will coerce them (often to character). # Numeric vector ages <- c(25, 30, 22, 45) ages # [1] 25 30 22 45 class(ages) # "numeric" Data Structure 2: Data Frames • The go-to structure for datasets (rows = observations, columns = variables). • Think spreadsheet: rectangular. • Columns are typically vectors. • Columns can be different data types (numeric, character, etc.). • All columns MUST have the same length (same number of rows). # Creating a data frame employee_data <- data.frame( EmployeeID = c(101, 102, 103, 104), Name = c("Alice", "Bob", "Charlie", "David"), Department = c("Sales", "IT", "Sales", "HR"), Salary = c(50000, 65000, 52000, 58000) ) # Print the data frame employee_data Accessing Data Frame Elements • Use $ to access columns by name (most common). • Use [[ ]] to access columns by name or index. • Use [row, column] indexing. # Access the 'Name' column employee_data$Name # [1] "Alice" "Bob" "Charlie" "David" # Access the 'Salary' column employee_data[["Salary"]] # [1] 50000 65000 52000 58000 # Access the 3rd column (Department) employee_data[[3]] # [1] "Sales" "IT" "Sales" "HR" Data Structure 3: Lists • Flexible containers. Can hold vectors, data frames, other lists, mixed types. my_list <- list(name = "Alice", age = 30, scores = c(85, 92, 78), employed = TRUE) my_list #Print the list my_list$scores # Access list elements by name using $ my_list[[3]] # Access list elements by index using [[ ]] Working Directory & RStudio Projects • When reading/writing files, R looks in the working directory. • getwd(): Get Working Directory (see where R is looking). • setwd("path/to/your/directory"): Set Working Directory (use / not \). Can be fragile! • BETTER WAY: RStudio Projects! • Go to File -> New Project... -> New Directory (or Existing Directory). • Create a folder for your course/project. • RStudio automatically sets the working directory to the project folder when you open the .Rproj file. • Keeps scripts, data, and output organized together! Highly recommended. Importing Data: CSV Files • CSV = Comma Separated Values. Very common plain text format. • Use read.csv() (base R) or read_csv() (from the readr package, part of tidyverse - often faster and smarter). • Make sure the CSV file is in your RStudio Project folder (or working directory). # Assume 'employee_data.csv' exists in your project directory # Using base R: my_data_csv <- read.csv("employee_data.csv") head(my_data_csv) str(my_data_csv) Importing Data: Excel Files • Requires the readxl package (install and load it first!). • read_excel() function is the main tool. • Can specify sheet name or number. # Make sure readxl is loaded: library(readxl) # Assume 'employee_data.xlsx' exists in your project directory # Read the first sheet by default my_data_excel <- read_excel("employee_data.xlsx") head(my_data_excel) str(my_data_excel) Class 1 Summary & Recap • R is a powerful language for data analysis. RStudio is the best way to use it. • You can do calculations, use variables (<-), and call functions (). • Key Data Structures: • Vectors: c(), same data type, access with []. • Data Frames: data.frame(), columns ($, [[]], [,]), rows ([,]). • Packages extend R: install.packages(), library(). • Use RStudio Projects for organization. • Import/Export: read.csv(), read_excel(), write.csv(), write_xlsx(). • Getting Help: ?, ??. Practice & Next Class • Practice: • Create different types of vectors. • Create a simple data frame. • Practice accessing elements/columns. • Try importing a sample CSV or Excel file (find one online or create one). • Next Class: • Data Manipulation! We'll learn how to filter, select, rearrange, and summarize data using the powerful dplyr package.