0% found this document useful (0 votes)

168 views77 pages

Unit1-Data Science

Here are the key points about column binding in R: - Cbind binds/combines two objects (vectors, matrices, data frames) by columns - The number of rows must be the same in the objects being combined - It appends the columns of the second object to the right of the first object - The column names do not need to be unique - It is used to add additional variables/columns to an existing data frame So in summary, cbind combines objects by columns, requiring matching row numbers, to expand the number of variables in a data frame or other object.

Uploaded by

DIVYANSH GAUR (RA2011027010090)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

168 views77 pages

Unit1-Data Science

Uploaded by

DIVYANSH GAUR (RA2011027010090)

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

18CSE396T-DATA SCIENCE

TOPICS COVERED

•What is Data science?

•Evolution of Data Science
•Key Roles of Data Science Project
•Data Science Applications
What is Data science?
Data science is the field of study that combines domain
expertise, programming skills, and knowledge of
mathematics and statistics to extract meaningful insights
from data.
Need for Data Science

•Better decision making

•Predictive analysis
•Pattern discovery
Eg: Self driving cars, Airlines, Logistic
companies
What is Data science
Recap: what is Data Science
Steps in Data science
Skill sets of Data Scientist (In general)

• Quantitative skill
• Technical aptitude
• Skeptical mind-set
and critical thinking
• Curious and creative
• Communicative and
collaborative
Skill sets of Data Scientist (Technical aspect)
• Statistics
• Programming Languages
• Data Extraction
&Processing
• Data Wrangling &
Exploration
• Machine learning & Deep
learning
• Big data processing
• Data Visualization
Key Roles for a Successful
Analytics Project

❖Business User
❖Project Sponsor
❖Project Manager
❖Business Intelligence Analyst
❖Database Administrator (DBA)
❖Data Engineer
❖Data Scientist
Business User

❖ Someone who understands the domain area and usually

benefits from the results. This person can consult and advise
the project team on the context of the project, the value of the
results, and how the outputs will be operationalized.
❖Usually a business analyst, line manager, or deep subject
matter expert in the project domain fulfils this role.
Project Sponsor
❖Responsible for the genesis of the project. Provides the
impetus and requirements for the project and defines
the core business problem.
❖ Generally provides the funding and gauges the degree
of value from the final outputs of the working team. This
person sets the priorities for the project and clarifies the
desired outputs.
Project Manager: Ensures that key milestones and objectives
are met on time and at the expected quality.
Business Analyst
❖Provides business domain expertise based on a deep
understanding of the data, key performance indicators
(KPIs), key metrics, and business intelligence from a
reporting perspective. Business Intelligence Analysts
generally create dashboards and reports and have
knowledge of the data feeds and sources.
Data Administrator

Role:
Data admin should ensure that the database is accessible
to all relevant users. He also makes sure that it is
performing correctly and is being kept safe from hacking.
Languages:
SQL, Java, C#, and Python
Data Engineer
Role:
The role of data engineer is of working with large
amounts of data. He develops, constructs, tests,
and maintains architectures like large scale
processing system and databases.

Languages:
SQL, Hive, R, SAS, Matlab, Python, Java, Ruby, C +
+, and Perl
Data Scientist
Role:
A Data Scientist is a professional who manages enormous
amounts of data to come up with compelling business
visions by using various tools, techniques, methodologies,
algorithms, etc.
Languages:
R, SAS, Python, SQL, Hive, Matlab, Pig, Spark
TOOLS FOR DATA SCIENCE
DATA SCIENCE PROCESS
Data Structures
Structured data

Structured data is data whose elements are addressable for

effective analysis. It has been organized into a formatted
repository that is typically a database. It concerns all data which
can be stored in database SQL in a table with rows and columns.
Semi-Structured data
Semi-structured data is information that does not reside in a
relational database but that have some organizational properties
that make it easier to analyze. With some process, you can store
them in the relation database (it could be very hard for some kind
of semi-structured data), but Semi-structured exist to ease
space. Example: XML data.
Quasi Structured data
This type of data consists of textual content with
erratic data formats, and its formatted with effort,
software system tools, and time. An example of
quasi-structured data is the data about webpages a
user visited and in what order.
Unstructured Data
Unstructured Data
Unstructured data is a data that is which is not organized in a
pre-defined manner or does not have a pre-defined data model.
Eg: Weblog, Multimedia content, email, text files.
State of the Practice in Analytics
BI VS DATA SCIENCE
Business Intelligence Business analytics

Business Intelligence is a process Business analytics – Deals

of collecting, integrating, analyzing with the why’s of what
and presenting the data. With happened in the past. It
Business Intelligence, executives breaks down contributing
and managers can have a better factors and causality. It also
understanding of decision-making. uses these why’s to make
This process is carried out through predictions of what will
software services and tools. happen in the future.
Conclusion

“BI is needed to run the business while Business Analytics are

needed to change the business.”
RECAP

❖DATA STRUCTURES
Structured, Semi structured, Quasi structured and unstructured
❖STATE OF PRACTICE IN ANALYTICS

❖BI VS DATA SCIENCE

RDBMS NoSQL
Data is stored in the form of NoSQL commonly referred to
rows and columns in RDBMS. as “Not Only SQL”. With
The relations among tables NoSQL, unstructured
are also stored in the form of ,schema less data can be
the table SQL (Structured stored in multiple collections
query Language) is a and nodes and it does not
programming language used require fixed table sachems,
to perform tasks such as it supports limited join
update data on a database, queries , and we scale it
or to retrieve data from a horizontally. Eg: MongoDB,
database. Eg:Oracle, Sybase, CouchDB, HBase, Cassandra
Microsoft SQL Server, Access DB
Staging and Curating the data
DATA STAGING
❖A data staging area (DSA) is a temporary storage area between the
data sources and a data warehouse.
❖The staging area is mainly used to quickly extract data from its data
sources, minimizing the impact of the sources.
DATA CURATION
Data curation is the management of data throughout its lifecycle, from
creation and initial storage to the time when it is archived for posterity or
becomes obsolete and is deleted.
DATA CURATION
DATA CURATION
DATA CURATION
R DATA STRUCTURES

❖Vectors
❖Lists
❖Matrices
❖Arrays
❖Factors
❖Data Frame
VECTOR
Vectors are an object which is used to store multiple information
or values of the same data type. A vector can not have a
combination of both integer and character.
marks <- c(88,65,90,40,65)
Factor
They are a data type that is used to refer to a qualitative
relationship like colors, good & bad, course or movie
ratings, etc. They are useful in statistical modeling.
fac <- factor(c("good", "bad"))
List is a data structure having components of mixed
data types.
n = c(2, 3, 5)
s = c("aa", "bb", "cc", "dd", "ee")
b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
x = list(n, s, b, 3) # x contains copies of n, s, b
MATRIX
A matrix is used to store information about the same data type. However, unlike
vectors, matrices are capable of holding two-dimensional information inside it.

SYNTAX: M<-matrix(vector, nrow=r, ncol=c, byrow=FALSE,

dimnames=list(char_vector_rownames,char_vector_colnames))
Arrays are one of the most useful data objects in R that allow the
user to store data in multiple dimensions. For example, if we
create an array with dimensions (4, 4, 2), it creates 2 rectangular
matrices with 4 rows and 4 columns each.
Array_NAME <- array(data, dim = (row_Size, column_Size,
matrices, dimnames)
row.names <- c("Row1", "Row2", "Row3")
column.names <-c("Col1", "Col2", "Col3", "Col4")
matrix.names <-c("Matrixl1", "Matrix2")
B <- array(1: 24, dim = c(3, 4, 2), dimnames = list(row.names,
column.names, matrix.names)) print(B)
A data frame in R programming is a 2-dimensional array-like
structure that also resembles a table, in which each column contains
values of one variable and each row contains one set of values from
each column.
A data frame has the following characteristics:
1.The column names of a data frame should not be empty.
2.Row names should be unique.
3.Data stored in a data frame can be numeric, factor or character type.
4.Each column should contain the same number of data items.
empid <- c(1:4)
empname <- c("Sam","Rob","Max","John")
empdept <- c("Sales","Marketing","HR","R & D")
emp.data <- data.frame(empid,empname,empdept)
print(emp.data)
Column Bind – Cbind in R appends or
combines vectorColumn Bind – Cbind in R appends or
combines vector, matrixColumn Bind – Cbind in R appends or
combines vector, matrix or data frame by columns.
EXAMPLE
df1 = data.frame(name = c("Rahul","joe","Adam","Brendon"),
married_year = c(2016,2015,2016,2008))
df2 = data.frame(Birth_place =
c("Delhi","Seattle","London","Moscow"), Birth_year =
c(1988,1990,1989,1984))

cbinded_df = cbind(df1,df2)

Rule:The number of rows in two data frames needs to be same

for both cbind() function
Rbind() function in R row binds the data frames which is a simple joining or
concatenation of two or more dataframes (tables) by row wise. In other words, Rbind
in R appends or combines vectorRbind() function in R row binds the data frames
which is a simple joining or concatenation of two or more dataframes (tables) by row
wise. In other words, Rbind in R appends or combines vector, matrixRbind()
function in R row binds the data frames which is a simple joining or concatenation of
two or more dataframes (tables) by row wise. In other words, Rbind in R appends or
combines vector, matrix or data frame by rows.

rbinded_df = rbind(df1,df2)
OPERATORS
ARITHMETIC OPERATORS

a <- 9.8
b <- 2

print ( a+b ) #addition

print ( a-b ) #subtraction
print ( a*b ) #multiplication
print ( a/b ) #Division
print ( a%%b ) #Reminder
print ( a%/%b ) #Quotient
print ( a^b ) #Power of
RELATIONAL OPERATOR
LOGICAL OPERATORS
Miscellaneous Operators

a = 23:31
print ( a )

a = c(25, 27, 76)

b = 27
print ( b %in% a )

M = matrix(c(1,2,3,4), 2, 2, TRUE)
print ( M %*% t(M) )
CONDITIONAL STATEMENT

x <- 5 x <- -5
if(x > 0) if(x > 0)
{ print("Positive
number") } { print("Non-negative
number")
x <- 0 if (x < 0) { } else
print("Negative number") } { print("Negative
else if (x > 0) { number") }
print("Positive number") }
else print("Zero")
readinteger <- function()
{ n <- readline(prompt="Enter an integer: ")
n<-as.integer(n)
if(n >= 0)
{
print("This is Non-negative number")
}
else
{
print("This is Negative number")
}
return(n)
}
print(readinteger())

readline() lets the user enter a one-line string at the terminal.

The prompt argument is printed in front of the user input. It usually ends on ": ".
The as.integer function :as.integer makes an integer out of the string.
APPLYING IF IN DATAFRAMES

df1 = data.frame(Name = c('George','Andrea',

'Micheal','Maggie','Ravi','Xien','Jalpa'),
Grade_score=c(4,6,2,9,5,7,8),
Mathematics_score=c(45,78,44,89,66,49,72),
Science_score=c(56,52,45,88,33,90,47))

mutate(df1, Result = ifelse(Mathematics_score

>= 50 & Science_score >= 50, "Pass", "Fail"))
For Loop
for(i in 1:n) NESTED FOR
LOOP
{ for(i in 1:5)
statement {
for(j in 1:2)
} {
print(i*j);
for(i in 1:5)
{
}
}
print (i^2)
}
mat <- matrix(data = seq(10, 20, by=1),
nrow = 6, ncol =2)
for (r in 1:nrow(mat))
for (c in 1:ncol(mat)) print(paste("Row", r,
"and column",c, "have values of", mat[r,c]))
WHILE LOOP
i <- 1
while (i <=6) {
print(i*i)
i = i+1
}
BREAK STATEMENT
A break statement is used inside a loop (for, while) to stop the iterations and
flow the control outside of the loop.
In a nested looping situation, where there is a loop inside another loop, this
statement exits from the innermost loop that is being evaluated.

x <- 1:5
for (val in x)
{
if (val == 3)
{
break
}
print(val)
}
next statement
A next statement is useful when we want to skip the current iteration of a
loop without terminating it. On encountering next, the R parser skips
further evaluation and starts next iteration of the loop.

x <- 1:5 for (val in x) { if (val == 3){ next } print(val) }

Apply functions are a family of functions in base R which allow you to repetitively
perform an action on multiple chunks of data. An apply function is essentially a
loop, but run faster than loops and often require less code.
The apply functions that this chapter will address are apply, lapply, sapply,
vapply, tapply, and mapply. There are so many different apply functions
because they are meant to operate on different types of data

Syntax:apply(X, MARGIN, FUN).

•X is an array or matrix (this is the data that you will be performing the function
on)
•Margin specifies whether you want to apply the function across rows (1) or
columns (2)
•FUN is the function you want to use

my.matrx <- matrix(c(1:10, 11:20, 21:30), nrow = 10, ncol = 3)

apply(my.matrx, 1, sum)
1-----for row
ratings <- c(4.2, 4.4, 3.4, 3.9, 5, 4.1, 3.2, 3.9, 4.6, 4.8,
5, 4, 4.5, 3.9, 4.7, 3.6)
employee.mat <-
matrix(ratings,byrow=TRUE,nrow=4,dimnames =
list(c("Quarter1","Quarter2","Quarter3","Quarter4"),
c("Hari","Shri","John","Albert")))#user defined
function
check<-function(x){
return(x[x>4.2])
}
result <- apply(employee,2,check)
lapply()
always returns a list, ‘l’ in lapply() refers to ‘list’. lapply() deals with list and
lapply()
data frames in the input. MARGIN argument is not required here, the specified
function is applicable only through columns.
tapply()
tapply() is helpful while dealing with categorical variables, it applies a function to
numeric data distributed across various categories. The simplest form of tapply() can
be understood as
tapply(column 1, column 2, FUN)
where column 1 is the numeric column on which function is applied, column 2 is a
factor object and FUN is for the function to be performed.
salary <- c(21000,29000,32000,34000,45000)
designation<-c("Programmer","Senior Programmer","Senior
Programmer","Senior Programmer","Manager")
gender <- c("M","F","F","M","M")
result <- tapply(salary,designation,mean)
Using summary statistics to spot problems

The summary command on a data frame reports a variety of summary statistics

on the numerical columns of the data frame, and count statistics on any
categorical columns .
You can also ask for summary statistics on specific numerical columns by using
the commands mean, variance, median, min, max, and quantile (which will return
the quartiles of the data by default).

Summary(dataset)

PROBLEM REVEALED BY DATA SUMMARIES

1.MISSING VALUES
2.INVALID VALUES AND OUTLIERS
3.DATA RANGE
Spotting problems using graphics and visualization

The use of graphics to examine data is called visualization

visualization. Details of specific plots aside, the key
points of Cleveland’s philosophy are these:
A graphic should display as much information as it can, with the lowest
possible cognitive strain to the viewer.
Strive for clarity. Make the data stand out. Specific tips for increasing clarity
include
– Avoid too many superimposed elements, such as too many curves in the
same graphing space.
– Find the right aspect ratio and scaling to properly bring out the details of the
data.
– Avoid having the data all skewed to one side or the other of your graph.
Visualization is an iterative process. Its purpose is to answer questions
about the data.
ABOUT GGPLOT2
The theme of this section is how to use visualization to explore your data, not how
to use ggplot2. We chose ggplot2 because it excels at combining multiple
graphical elements together, but its syntax can take some getting used to.
HISTOGRAMS
A basic histogram bins a variable into fixed-width buckets and returns the number
of data points that falls into each bucket.
For example, you could group your customers by age range, in intervals of five
years: 20–25, 25–30, 30–35, and so on.

ggplot(custdata) +
geom_histogram(aes(x=age),
binwidth=5, fill="gray")

The binwidth parameter tells the geom_histogram call how to make bins of
five-year intervals (default is datarange/30). The fill parameter specifies the
color of the histogram bars (default:black).
DENSITY PLOTS
A density plot is a “continuous histogram” of a variable, except the area
under the density plot is equal to 1. A point on a density plot
corresponds to the fraction of data (or the percentage of data, divided
by 100) that takes on a particular value. This fraction is usually very
small. When you look at a density plot, you’re more interested in the
overall shape of the curve than in the actual values on the y-axis.

library(scales)
ggplot(custdata) + gescale_x_continuous(labels=dollar)
om_density(aes(x=income)) +

ggplot(custdata) + geom_density(aes(x=income)) +
scale_x_log10(breaks=c(100,1000,10000,100000), labels=dollar) +
annotation_logticks(sides="bt")
BAR CHARTS

A bar chart is a histogram for discrete data: it records the frequency of every
value of a categorical variable.
Vertical bar chart

ggplot(custdata) + geom_bar(aes(x=marital.stat), fill="gray")

Horizontal Bar chart

ggplot(custdata) +
geom_bar(aes(x=state.of.res), fill="gray") +
coord_flip() +
theme(axis.text.y=element_text(size=rel(0.8)))
ggplot(custdata) + geom_bar(aes(x=marital.stat,
fill=health.ins))
ggplot(custdata) + geom_bar(aes(x=marital.stat,
fill=health.ins),
position="dodge")
ggplot(custdata) + geom_bar(aes(x=marital.stat,
fill=health.ins),
position="fill")
LINE PLOTS
Line plots work best when the relationship between two variables is relatively clean:
each x value has a unique (or nearly unique) y value

x <- runif(100)
y <- x^2 + 0.2*x
ggplot(data.frame(x=x,y=y), aes(x=x,y=y)) + geom_line()
SCATTER PLOTS

Scatterplots show many points plotted in the Cartesian plane. Each point represents
the values of two variables. One variable is chosen in the horizontal axis and another
in the vertical axis.

Corelation between Two variables

custdata2 <- subset(custdata,
(custdata$age > 0 & custdata$age < 100
& custdata$income > 0))
cor(custdata2$age, custdata2$income)
[1] -0.02240845

Scatterplot
ggplot(custdata2, aes(x=age, y=income)) +
A hexbin plot is like a two-dimensional histogram. The data is divided into
bins, and the number of data points in each bin is represented by color or
shading.

library(hexbin)
ggplot(custdata2, aes(x=age, y=income)) +
geom_hex(binwidth=c(5, 10000)) +
geom_smooth(color="white", se=F) +
ylim(0,200000)

Hanrahan v. Cambridge IGCSE and O Level Additional Mathematics 2023
100% (3)
Hanrahan v. Cambridge IGCSE and O Level Additional Mathematics 2023
396 pages
World Cup Analysis
No ratings yet
World Cup Analysis
15 pages
Welcome To The World of Fast Fashion
No ratings yet
Welcome To The World of Fast Fashion
2 pages
2600: The Hacker Quarterly (Volume 2, Number 3, March 1985)
100% (1)
2600: The Hacker Quarterly (Volume 2, Number 3, March 1985)
6 pages
Oracle Examlabs 1z0-900 v2020-05-12 by Luna 78q
No ratings yet
Oracle Examlabs 1z0-900 v2020-05-12 by Luna 78q
45 pages
Lecture2 DataMiningFunctionalities
No ratings yet
Lecture2 DataMiningFunctionalities
18 pages
AI Organizer Sem-5
No ratings yet
AI Organizer Sem-5
57 pages
P-N Junction Diode
No ratings yet
P-N Junction Diode
4 pages
Rolling Element Bearings PDF
No ratings yet
Rolling Element Bearings PDF
152 pages
B.tech Syllabus)
No ratings yet
B.tech Syllabus)
179 pages
C21 - Me - Iv Sem
No ratings yet
C21 - Me - Iv Sem
101 pages
Ann Unit 1
No ratings yet
Ann Unit 1
26 pages
Machine Learning - Unit - 1
100% (1)
Machine Learning - Unit - 1
58 pages
Chapter 4 SQQS1013
No ratings yet
Chapter 4 SQQS1013
20 pages
Apl Since 1978
100% (1)
Apl Since 1978
108 pages
Wa0002.
No ratings yet
Wa0002.
10 pages
R2015 Mech
No ratings yet
R2015 Mech
225 pages
Web Technologies Notes
No ratings yet
Web Technologies Notes
238 pages
ANKIT
No ratings yet
ANKIT
17 pages
Advanced Certification in Data Science and Artificial Intelligence
No ratings yet
Advanced Certification in Data Science and Artificial Intelligence
15 pages
Ec8381 FDS Lab Manual
33% (3)
Ec8381 FDS Lab Manual
45 pages
Web Question Bank
No ratings yet
Web Question Bank
6 pages
CD Unit 5 PDF
100% (1)
CD Unit 5 PDF
16 pages
JayDeep S CV PDF
No ratings yet
JayDeep S CV PDF
1 page
CD Computer Science 03
No ratings yet
CD Computer Science 03
114 pages
Part 1: Data Investigation and Cleaning: Classification For Data Errors
No ratings yet
Part 1: Data Investigation and Cleaning: Classification For Data Errors
12 pages
Borate Materials in Nonlinear Optics
No ratings yet
Borate Materials in Nonlinear Optics
14 pages
A Project Report On Smart Bell With Electronic Timetable Display
No ratings yet
A Project Report On Smart Bell With Electronic Timetable Display
83 pages
DWM Course
No ratings yet
DWM Course
67 pages
Physics Lab Manual 2021-22
No ratings yet
Physics Lab Manual 2021-22
103 pages
Brochure
100% (1)
Brochure
2 pages
Unit Ii
No ratings yet
Unit Ii
11 pages
CS8079 2marks
No ratings yet
CS8079 2marks
40 pages
User Defined Functions in Javascript
No ratings yet
User Defined Functions in Javascript
6 pages
Type Here
No ratings yet
Type Here
207 pages
Lec - 05 AAA - Brute Force and Exhaustive Search
No ratings yet
Lec - 05 AAA - Brute Force and Exhaustive Search
39 pages
Ai DS 2 Book-Chpt-6
No ratings yet
Ai DS 2 Book-Chpt-6
11 pages
Ppsuc Manual r20
No ratings yet
Ppsuc Manual r20
98 pages
GBG Idscan Ieos Web API v4
No ratings yet
GBG Idscan Ieos Web API v4
16 pages
Solid State Physics Optical Properties of Solids: M. S. Dresselhaus
No ratings yet
Solid State Physics Optical Properties of Solids: M. S. Dresselhaus
253 pages
Gaussian Tips
No ratings yet
Gaussian Tips
70 pages
A Study On Deep Learning For Fake News Detection
No ratings yet
A Study On Deep Learning For Fake News Detection
48 pages
Additional Relational Operations
No ratings yet
Additional Relational Operations
13 pages
Data Science Engineering Full Time Program Brochure
No ratings yet
Data Science Engineering Full Time Program Brochure
21 pages
UV Fire Detector C7050
No ratings yet
UV Fire Detector C7050
30 pages
L26. Generic
No ratings yet
L26. Generic
37 pages
Digital Communications: Chapter 2: Deterministic and Random Signal Analysis
No ratings yet
Digital Communications: Chapter 2: Deterministic and Random Signal Analysis
106 pages
Unit 1
No ratings yet
Unit 1
26 pages
Ayushman Bharat: Documentation of Process For Customization of Standard Treatment Guidelines
No ratings yet
Ayushman Bharat: Documentation of Process For Customization of Standard Treatment Guidelines
35 pages
Unit 1-5 CS8079 HCI QBank Panimalar College PDF
No ratings yet
Unit 1-5 CS8079 HCI QBank Panimalar College PDF
49 pages
JVVD Universities
No ratings yet
JVVD Universities
6 pages
Unit Ii - Applications of Operational Amplifiers
No ratings yet
Unit Ii - Applications of Operational Amplifiers
45 pages
CBSE Class 12 Question Paper Physics 2019 Set 4
No ratings yet
CBSE Class 12 Question Paper Physics 2019 Set 4
19 pages
NITK, Placement Cell 2011 2012
No ratings yet
NITK, Placement Cell 2011 2012
14 pages
Syllabus
No ratings yet
Syllabus
136 pages
FFT and Spectral Analysis Part2
No ratings yet
FFT and Spectral Analysis Part2
54 pages
Chapter 5 - Sequential Signal Assigements of VHDL
No ratings yet
Chapter 5 - Sequential Signal Assigements of VHDL
55 pages
Probability - Probabilition Distribution Notes by Kristin Kuter
No ratings yet
Probability - Probabilition Distribution Notes by Kristin Kuter
105 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
DATA ANALYSIS_Full_Note_Immersive 2
No ratings yet
DATA ANALYSIS_Full_Note_Immersive 2
13 pages
Overall Syllabus
No ratings yet
Overall Syllabus
525 pages
Syllabus BDA
No ratings yet
Syllabus BDA
4 pages
Byte Ordering - Unit 2
No ratings yet
Byte Ordering - Unit 2
77 pages
Unit3-Data Science
No ratings yet
Unit3-Data Science
37 pages
Unit2-Data Science
No ratings yet
Unit2-Data Science
20 pages
MST129 TMA Fall 2023-2024
No ratings yet
MST129 TMA Fall 2023-2024
7 pages
Let's Try This: Homeroom Guidance (HG) Las Quarter 2 - Week 1: High Five! Factors in Sound Decision-Making
No ratings yet
Let's Try This: Homeroom Guidance (HG) Las Quarter 2 - Week 1: High Five! Factors in Sound Decision-Making
3 pages
Digital Library Part1
No ratings yet
Digital Library Part1
1 page
Useofartificialintelligenceinthelibraryservicesprospectsand
No ratings yet
Useofartificialintelligenceinthelibraryservicesprospectsand
4 pages
Unit 15 Transport Network Design
100% (1)
Unit 15 Transport Network Design
6 pages
NUX VOMICA PERSONALITY
No ratings yet
NUX VOMICA PERSONALITY
6 pages
Magnitude_Modulation_in_Combined_Optical_Flow_for_Advanced_Micro_Expression_Recognition__Copy_ (3)
No ratings yet
Magnitude_Modulation_in_Combined_Optical_Flow_for_Advanced_Micro_Expression_Recognition__Copy_ (3)
16 pages
Letter of Request
No ratings yet
Letter of Request
3 pages
SEABC Law School Handbook 1 2021 22
No ratings yet
SEABC Law School Handbook 1 2021 22
18 pages
Muhammad Tayyab Ijaz: Drive Test
No ratings yet
Muhammad Tayyab Ijaz: Drive Test
3 pages
Intensive and Extensive Reading
No ratings yet
Intensive and Extensive Reading
4 pages
Prof-Ed Topnotcher Notes
100% (4)
Prof-Ed Topnotcher Notes
28 pages
Result-Samastha Kerala Islam Matha Vidyabhyasa Board 2
No ratings yet
Result-Samastha Kerala Islam Matha Vidyabhyasa Board 2
3 pages
"Ligdong Nga Sumusunod Ni Kristo": Campus Ministry Club Program
No ratings yet
"Ligdong Nga Sumusunod Ni Kristo": Campus Ministry Club Program
4 pages
Mathematics LRP
No ratings yet
Mathematics LRP
2 pages
Bid Management-1
No ratings yet
Bid Management-1
1 page
Flipped Classroom Lesson Plan - Hebert Algebra I
No ratings yet
Flipped Classroom Lesson Plan - Hebert Algebra I
3 pages
PR Revise Chapter
No ratings yet
PR Revise Chapter
21 pages
04492
No ratings yet
04492
1 page
Task 2 Ethics and Moral Judgement: Lekshmi T Research Scholar
No ratings yet
Task 2 Ethics and Moral Judgement: Lekshmi T Research Scholar
14 pages
Staffing of Sales Force
No ratings yet
Staffing of Sales Force
21 pages
DLP Housekeeping 2023
No ratings yet
DLP Housekeeping 2023
2 pages
X'S Plan Final (English Version)
No ratings yet
X'S Plan Final (English Version)
26 pages
WEBINAR Brochure AIIMS Rajkot
No ratings yet
WEBINAR Brochure AIIMS Rajkot
4 pages
Dilla University (Only For Presentation)
No ratings yet
Dilla University (Only For Presentation)
35 pages
CEAT Tyres - NATS JD
No ratings yet
CEAT Tyres - NATS JD
2 pages
The Lexicon: An Introduction 1st Edition Elisabetta Jezek all chapter instant download
100% (2)
The Lexicon: An Introduction 1st Edition Elisabetta Jezek all chapter instant download
52 pages
Apply Now: St. Lawrence University
No ratings yet
Apply Now: St. Lawrence University
2 pages
Recreational-Activities
No ratings yet
Recreational-Activities
2 pages

Unit1-Data Science

Uploaded by

Unit1-Data Science

Uploaded by

18CSE396T-DATA SCIENCE

•What is Data science?

•Better decision making

❖ Someone who understands the domain area and usually

Structured data is data whose elements are addressable for

Business Intelligence is a process Business analytics – Deals

“BI is needed to run the business while Business Analytics are

❖BI VS DATA SCIENCE

SYNTAX: M<-matrix(vector, nrow=r, ncol=c, byrow=FALSE,

Rule:The number of rows in two data frames needs to be same

print ( a+b ) #addition

a = c(25, 27, 76)

readline() lets the user enter a one-line string at the terminal.

df1 = data.frame(Name = c('George','Andrea',

mutate(df1, Result = ifelse(Mathematics_score

x <- 1:5 for (val in x) { if (val == 3){ next } print(val) }

Syntax:apply(X, MARGIN, FUN).

my.matrx <- matrix(c(1:10, 11:20, 21:30), nrow = 10, ncol = 3)

The summary command on a data frame reports a variety of summary statistics

PROBLEM REVEALED BY DATA SUMMARIES

The use of graphics to examine data is called visualization

ggplot(custdata) + geom_bar(aes(x=marital.stat), fill="gray")

Horizontal Bar chart

Corelation between Two variables

You might also like