0% found this document useful (0 votes)
152 views

Unit1-Data Science

Here are the key points about column binding in R: - Cbind binds/combines two objects (vectors, matrices, data frames) by columns - The number of rows must be the same in the objects being combined - It appends the columns of the second object to the right of the first object - The column names do not need to be unique - It is used to add additional variables/columns to an existing data frame So in summary, cbind combines objects by columns, requiring matching row numbers, to expand the number of variables in a data frame or other object.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views

Unit1-Data Science

Here are the key points about column binding in R: - Cbind binds/combines two objects (vectors, matrices, data frames) by columns - The number of rows must be the same in the objects being combined - It appends the columns of the second object to the right of the first object - The column names do not need to be unique - It is used to add additional variables/columns to an existing data frame So in summary, cbind combines objects by columns, requiring matching row numbers, to expand the number of variables in a data frame or other object.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

18CSE396T-DATA SCIENCE

TOPICS COVERED

•What is Data science?


•Evolution of Data Science
•Key Roles of Data Science Project
•Data Science Applications
What is Data science?
Data science is the field of study that combines domain
expertise, programming skills, and knowledge of
mathematics and statistics to extract meaningful insights
from data.
Need for Data Science

•Better decision making


•Predictive analysis
•Pattern discovery
Eg: Self driving cars, Airlines, Logistic
companies
What is Data science
Recap: what is Data Science
Steps in Data science
Skill sets of Data Scientist (In general)

• Quantitative skill
• Technical aptitude
• Skeptical mind-set
and critical thinking
• Curious and creative
• Communicative and
collaborative
Skill sets of Data Scientist (Technical aspect)
• Statistics
• Programming Languages
• Data Extraction
&Processing
• Data Wrangling &
Exploration
• Machine learning & Deep
learning
• Big data processing
• Data Visualization
Key Roles for a Successful
Analytics Project

❖Business User
❖Project Sponsor
❖Project Manager
❖Business Intelligence Analyst
❖Database Administrator (DBA)
❖Data Engineer
❖Data Scientist
Business User

❖ Someone who understands the domain area and usually


benefits from the results. This person can consult and advise
the project team on the context of the project, the value of the
results, and how the outputs will be operationalized.
❖Usually a business analyst, line manager, or deep subject
matter expert in the project domain fulfils this role.
Project Sponsor
❖Responsible for the genesis of the project. Provides the
impetus and requirements for the project and defines
the core business problem.
❖ Generally provides the funding and gauges the degree
of value from the final outputs of the working team. This
person sets the priorities for the project and clarifies the
desired outputs.
Project Manager: Ensures that key milestones and objectives
are met on time and at the expected quality.
Business Analyst
❖Provides business domain expertise based on a deep
understanding of the data, key performance indicators
(KPIs), key metrics, and business intelligence from a
reporting perspective. Business Intelligence Analysts
generally create dashboards and reports and have
knowledge of the data feeds and sources.
Data Administrator

Role:
Data admin should ensure that the database is accessible
to all relevant users. He also makes sure that it is
performing correctly and is being kept safe from hacking.
Languages:
SQL, Java, C#, and Python
Data Engineer
Role:
The role of data engineer is of working with large
amounts of data. He develops, constructs, tests,
and maintains architectures like large scale
processing system and databases.

Languages:
SQL, Hive, R, SAS, Matlab, Python, Java, Ruby, C +
+, and Perl
Data Scientist
Role:
A Data Scientist is a professional who manages enormous
amounts of data to come up with compelling business
visions by using various tools, techniques, methodologies,
algorithms, etc.
Languages:
R, SAS, Python, SQL, Hive, Matlab, Pig, Spark
TOOLS FOR DATA SCIENCE
DATA SCIENCE PROCESS
Data Structures
Structured data

Structured data is data whose elements are addressable for


effective analysis. It has been organized into a formatted
repository that is typically a database. It concerns all data which
can be stored in database SQL in a table with rows and columns.
Semi-Structured data
Semi-structured data is information that does not reside in a
relational database but that have some organizational properties
that make it easier to analyze. With some process, you can store
them in the relation database (it could be very hard for some kind
of semi-structured data), but Semi-structured exist to ease
space. Example: XML data.
Quasi Structured data
This type of data consists of textual content with
erratic data formats, and its formatted with effort,
software system tools, and time. An example of
quasi-structured data is the data about webpages a
user visited and in what order.
Unstructured Data
Unstructured Data
Unstructured data is a data that is which is not organized in a
pre-defined manner or does not have a pre-defined data model.
Eg: Weblog, Multimedia content, email, text files.
State of the Practice in Analytics
BI VS DATA SCIENCE
Business Intelligence Business analytics

Business Intelligence is a process Business analytics – Deals


of collecting, integrating, analyzing with the why’s of what
and presenting the data. With happened in the past. It
Business Intelligence, executives breaks down contributing
and managers can have a better factors and causality. It also
understanding of decision-making. uses these why’s to make
This process is carried out through predictions of what will
software services and tools. happen in the future.
Conclusion

“BI is needed to run the business while Business Analytics are


needed to change the business.”
RECAP

❖DATA STRUCTURES
Structured, Semi structured, Quasi structured and unstructured
❖STATE OF PRACTICE IN ANALYTICS

❖BI VS DATA SCIENCE


RDBMS NoSQL
Data is stored in the form of NoSQL commonly referred to
rows and columns in RDBMS. as “Not Only SQL”. With
The relations among tables NoSQL, unstructured
are also stored in the form of ,schema less data can be
the table SQL (Structured stored in multiple collections
query Language) is a and nodes and it does not
programming language used require fixed table sachems,
to perform tasks such as it supports limited join
update data on a database, queries , and we scale it
or to retrieve data from a horizontally. Eg: MongoDB,
database. Eg:Oracle, Sybase, CouchDB, HBase, Cassandra
Microsoft SQL Server, Access DB
Staging and Curating the data
DATA STAGING
❖A data staging area (DSA) is a temporary storage area between the
data sources and a data warehouse.
❖The staging area is mainly used to quickly extract data from its data
sources, minimizing the impact of the sources.
DATA CURATION
Data curation is the management of data throughout its lifecycle, from
creation and initial storage to the time when it is archived for posterity or
becomes obsolete and is deleted.
DATA CURATION
DATA CURATION
DATA CURATION
R DATA STRUCTURES

❖Vectors
❖Lists
❖Matrices
❖Arrays
❖Factors
❖Data Frame
VECTOR
Vectors are an object which is used to store multiple information
or values of the same data type. A vector can not have a
combination of both integer and character.
marks <- c(88,65,90,40,65)
Factor
They are a data type that is used to refer to a qualitative
relationship like colors, good & bad, course or movie
ratings, etc. They are useful in statistical modeling.
fac <- factor(c("good", "bad"))
List is a data structure having components of mixed
data types.
n = c(2, 3, 5)
s = c("aa", "bb", "cc", "dd", "ee")
b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
x = list(n, s, b, 3) # x contains copies of n, s, b
MATRIX
A matrix is used to store information about the same data type. However, unlike
vectors, matrices are capable of holding two-dimensional information inside it.

SYNTAX: M<-matrix(vector, nrow=r, ncol=c, byrow=FALSE,


dimnames=list(char_vector_rownames,char_vector_colnames))
Arrays are one of the most useful data objects in R that allow the
user to store data in multiple dimensions. For example, if we
create an array with dimensions (4, 4, 2), it creates 2 rectangular
matrices with 4 rows and 4 columns each.
Array_NAME <- array(data, dim = (row_Size, column_Size,
matrices, dimnames)
row.names <- c("Row1", "Row2", "Row3")
column.names <-c("Col1", "Col2", "Col3", "Col4")
matrix.names <-c("Matrixl1", "Matrix2")
B <- array(1: 24, dim = c(3, 4, 2), dimnames = list(row.names,
column.names, matrix.names)) print(B)
A data frame in R programming is a 2-dimensional array-like
structure that also resembles a table, in which each column contains
values of one variable and each row contains one set of values from
each column.
A data frame has the following characteristics:
1.The column names of a data frame should not be empty.
2.Row names should be unique.
3.Data stored in a data frame can be numeric, factor or character type.
4.Each column should contain the same number of data items.
empid <- c(1:4)
empname <- c("Sam","Rob","Max","John")
empdept <- c("Sales","Marketing","HR","R & D")
emp.data <- data.frame(empid,empname,empdept)
print(emp.data)
Column Bind – Cbind in R appends or
combines vectorColumn Bind – Cbind in R appends or
combines vector, matrixColumn Bind – Cbind in R appends or
combines vector, matrix or data frame by columns.
EXAMPLE
df1 = data.frame(name = c("Rahul","joe","Adam","Brendon"),
married_year = c(2016,2015,2016,2008))
df2 = data.frame(Birth_place =
c("Delhi","Seattle","London","Moscow"), Birth_year =
c(1988,1990,1989,1984))

cbinded_df = cbind(df1,df2)

Rule:The number of rows in two data frames needs to be same


for both cbind() function
Rbind() function in R row binds the data frames which is a simple joining or
concatenation of two or more dataframes (tables) by row wise. In other words, Rbind
in R appends or combines vectorRbind() function in R row binds the data frames
which is a simple joining or concatenation of two or more dataframes (tables) by row
wise. In other words, Rbind in R appends or combines vector, matrixRbind()
function in R row binds the data frames which is a simple joining or concatenation of
two or more dataframes (tables) by row wise. In other words, Rbind in R appends or
combines vector, matrix or data frame by rows.

rbinded_df = rbind(df1,df2)
OPERATORS
ARITHMETIC OPERATORS

a <- 9.8
b <- 2

print ( a+b ) #addition


print ( a-b ) #subtraction
print ( a*b ) #multiplication
print ( a/b ) #Division
print ( a%%b ) #Reminder
print ( a%/%b ) #Quotient
print ( a^b ) #Power of
RELATIONAL OPERATOR
LOGICAL OPERATORS
Miscellaneous Operators

a = 23:31
print ( a )

a = c(25, 27, 76)


b = 27
print ( b %in% a )

M = matrix(c(1,2,3,4), 2, 2, TRUE)
print ( M %*% t(M) )
CONDITIONAL STATEMENT

x <- 5 x <- -5
if(x > 0) if(x > 0)
{ print("Positive
number") } { print("Non-negative
number")
x <- 0 if (x < 0) { } else
print("Negative number") } { print("Negative
else if (x > 0) { number") }
print("Positive number") }
else print("Zero")
readinteger <- function()
{ n <- readline(prompt="Enter an integer: ")
n<-as.integer(n)
if(n >= 0)
{
print("This is Non-negative number")
}
else
{
print("This is Negative number")
}
return(n)
}
print(readinteger())

readline() lets the user enter a one-line string at the terminal.


The prompt argument is printed in front of the user input. It usually ends on ": ".
The as.integer function :as.integer makes an integer out of the string.
APPLYING IF IN DATAFRAMES

df1 = data.frame(Name = c('George','Andrea',


'Micheal','Maggie','Ravi','Xien','Jalpa'),
Grade_score=c(4,6,2,9,5,7,8),
Mathematics_score=c(45,78,44,89,66,49,72),
Science_score=c(56,52,45,88,33,90,47))

mutate(df1, Result = ifelse(Mathematics_score


>= 50 & Science_score >= 50, "Pass", "Fail"))
For Loop
for(i in 1:n) NESTED FOR
LOOP
{ for(i in 1:5)
statement {
for(j in 1:2)
} {
print(i*j);
for(i in 1:5)
{
}
}
print (i^2)
}
mat <- matrix(data = seq(10, 20, by=1),
nrow = 6, ncol =2)
for (r in 1:nrow(mat))
for (c in 1:ncol(mat)) print(paste("Row", r,
"and column",c, "have values of", mat[r,c]))
WHILE LOOP
i <- 1
while (i <=6) {
print(i*i)
i = i+1
}
BREAK STATEMENT
A break statement is used inside a loop (for, while) to stop the iterations and
flow the control outside of the loop.
In a nested looping situation, where there is a loop inside another loop, this
statement exits from the innermost loop that is being evaluated.

x <- 1:5
for (val in x)
{
if (val == 3)
{
break
}
print(val)
}
next statement
A next statement is useful when we want to skip the current iteration of a
loop without terminating it. On encountering next, the R parser skips
further evaluation and starts next iteration of the loop.

x <- 1:5 for (val in x) { if (val == 3){ next } print(val) }


Apply functions are a family of functions in base R which allow you to repetitively
perform an action on multiple chunks of data. An apply function is essentially a
loop, but run faster than loops and often require less code.
The apply functions that this chapter will address are apply, lapply, sapply,
vapply, tapply, and mapply. There are so many different apply functions
because they are meant to operate on different types of data

Syntax:apply(X, MARGIN, FUN).


•X is an array or matrix (this is the data that you will be performing the function
on)
•Margin specifies whether you want to apply the function across rows (1) or
columns (2)
•FUN is the function you want to use

my.matrx <- matrix(c(1:10, 11:20, 21:30), nrow = 10, ncol = 3)


apply(my.matrx, 1, sum)
1-----for row
ratings <- c(4.2, 4.4, 3.4, 3.9, 5, 4.1, 3.2, 3.9, 4.6, 4.8,
5, 4, 4.5, 3.9, 4.7, 3.6)
employee.mat <-
matrix(ratings,byrow=TRUE,nrow=4,dimnames =
list(c("Quarter1","Quarter2","Quarter3","Quarter4"),
c("Hari","Shri","John","Albert")))#user defined
function
check<-function(x){
return(x[x>4.2])
}
result <- apply(employee,2,check)
lapply()
always returns a list, ‘l’ in lapply() refers to ‘list’. lapply() deals with list and
lapply()
data frames in the input. MARGIN argument is not required here, the specified
function is applicable only through columns.
tapply()
tapply() is helpful while dealing with categorical variables, it applies a function to
numeric data distributed across various categories. The simplest form of tapply() can
be understood as
tapply(column 1, column 2, FUN)
where column 1 is the numeric column on which function is applied, column 2 is a
factor object and FUN is for the function to be performed.
salary <- c(21000,29000,32000,34000,45000)
designation<-c("Programmer","Senior Programmer","Senior
Programmer","Senior Programmer","Manager")
gender <- c("M","F","F","M","M")
result <- tapply(salary,designation,mean)
Using summary statistics to spot problems

The summary command on a data frame reports a variety of summary statistics


on the numerical columns of the data frame, and count statistics on any
categorical columns .
You can also ask for summary statistics on specific numerical columns by using
the commands mean, variance, median, min, max, and quantile (which will return
the quartiles of the data by default).

Summary(dataset)

PROBLEM REVEALED BY DATA SUMMARIES


1.MISSING VALUES
2.INVALID VALUES AND OUTLIERS
3.DATA RANGE
Spotting problems using graphics and visualization

The use of graphics to examine data is called visualization


visualization. Details of specific plots aside, the key
points of Cleveland’s philosophy are these:
A graphic should display as much information as it can, with the lowest
possible cognitive strain to the viewer.
Strive for clarity. Make the data stand out. Specific tips for increasing clarity
include
– Avoid too many superimposed elements, such as too many curves in the
same graphing space.
– Find the right aspect ratio and scaling to properly bring out the details of the
data.
– Avoid having the data all skewed to one side or the other of your graph.
Visualization is an iterative process. Its purpose is to answer questions
about the data.
ABOUT GGPLOT2
The theme of this section is how to use visualization to explore your data, not how
to use ggplot2. We chose ggplot2 because it excels at combining multiple
graphical elements together, but its syntax can take some getting used to.
HISTOGRAMS
A basic histogram bins a variable into fixed-width buckets and returns the number
of data points that falls into each bucket.
For example, you could group your customers by age range, in intervals of five
years: 20–25, 25–30, 30–35, and so on.

ggplot(custdata) +
geom_histogram(aes(x=age),
binwidth=5, fill="gray")

The binwidth parameter tells the geom_histogram call how to make bins of
five-year intervals (default is datarange/30). The fill parameter specifies the
color of the histogram bars (default:black).
DENSITY PLOTS
A density plot is a “continuous histogram” of a variable, except the area
under the density plot is equal to 1. A point on a density plot
corresponds to the fraction of data (or the percentage of data, divided
by 100) that takes on a particular value. This fraction is usually very
small. When you look at a density plot, you’re more interested in the
overall shape of the curve than in the actual values on the y-axis.

library(scales)
ggplot(custdata) + gescale_x_continuous(labels=dollar)
om_density(aes(x=income)) +

ggplot(custdata) + geom_density(aes(x=income)) +
scale_x_log10(breaks=c(100,1000,10000,100000), labels=dollar) +
annotation_logticks(sides="bt")
BAR CHARTS

A bar chart is a histogram for discrete data: it records the frequency of every
value of a categorical variable.
Vertical bar chart

ggplot(custdata) + geom_bar(aes(x=marital.stat), fill="gray")

Horizontal Bar chart

ggplot(custdata) +
geom_bar(aes(x=state.of.res), fill="gray") +
coord_flip() +
theme(axis.text.y=element_text(size=rel(0.8)))
ggplot(custdata) + geom_bar(aes(x=marital.stat,
fill=health.ins))
ggplot(custdata) + geom_bar(aes(x=marital.stat,
fill=health.ins),
position="dodge")
ggplot(custdata) + geom_bar(aes(x=marital.stat,
fill=health.ins),
position="fill")
LINE PLOTS
Line plots work best when the relationship between two variables is relatively clean:
each x value has a unique (or nearly unique) y value

x <- runif(100)
y <- x^2 + 0.2*x
ggplot(data.frame(x=x,y=y), aes(x=x,y=y)) + geom_line()
SCATTER PLOTS

Scatterplots show many points plotted in the Cartesian plane. Each point represents
the values of two variables. One variable is chosen in the horizontal axis and another
in the vertical axis.

Corelation between Two variables


custdata2 <- subset(custdata,
(custdata$age > 0 & custdata$age < 100
& custdata$income > 0))
cor(custdata2$age, custdata2$income)
[1] -0.02240845

Scatterplot
ggplot(custdata2, aes(x=age, y=income)) +
A hexbin plot is like a two-dimensional histogram. The data is divided into
bins, and the number of data points in each bin is represented by color or
shading.

library(hexbin)
ggplot(custdata2, aes(x=age, y=income)) +
geom_hex(binwidth=c(5, 10000)) +
geom_smooth(color="white", se=F) +
ylim(0,200000)

You might also like