Unit1-Data Science
Unit1-Data Science
TOPICS COVERED
• Quantitative skill
• Technical aptitude
• Skeptical mind-set
and critical thinking
• Curious and creative
• Communicative and
collaborative
Skill sets of Data Scientist (Technical aspect)
• Statistics
• Programming Languages
• Data Extraction
&Processing
• Data Wrangling &
Exploration
• Machine learning & Deep
learning
• Big data processing
• Data Visualization
Key Roles for a Successful
Analytics Project
❖Business User
❖Project Sponsor
❖Project Manager
❖Business Intelligence Analyst
❖Database Administrator (DBA)
❖Data Engineer
❖Data Scientist
Business User
Role:
Data admin should ensure that the database is accessible
to all relevant users. He also makes sure that it is
performing correctly and is being kept safe from hacking.
Languages:
SQL, Java, C#, and Python
Data Engineer
Role:
The role of data engineer is of working with large
amounts of data. He develops, constructs, tests,
and maintains architectures like large scale
processing system and databases.
Languages:
SQL, Hive, R, SAS, Matlab, Python, Java, Ruby, C +
+, and Perl
Data Scientist
Role:
A Data Scientist is a professional who manages enormous
amounts of data to come up with compelling business
visions by using various tools, techniques, methodologies,
algorithms, etc.
Languages:
R, SAS, Python, SQL, Hive, Matlab, Pig, Spark
TOOLS FOR DATA SCIENCE
DATA SCIENCE PROCESS
Data Structures
Structured data
❖DATA STRUCTURES
Structured, Semi structured, Quasi structured and unstructured
❖STATE OF PRACTICE IN ANALYTICS
❖Vectors
❖Lists
❖Matrices
❖Arrays
❖Factors
❖Data Frame
VECTOR
Vectors are an object which is used to store multiple information
or values of the same data type. A vector can not have a
combination of both integer and character.
marks <- c(88,65,90,40,65)
Factor
They are a data type that is used to refer to a qualitative
relationship like colors, good & bad, course or movie
ratings, etc. They are useful in statistical modeling.
fac <- factor(c("good", "bad"))
List is a data structure having components of mixed
data types.
n = c(2, 3, 5)
s = c("aa", "bb", "cc", "dd", "ee")
b = c(TRUE, FALSE, TRUE, FALSE, FALSE)
x = list(n, s, b, 3) # x contains copies of n, s, b
MATRIX
A matrix is used to store information about the same data type. However, unlike
vectors, matrices are capable of holding two-dimensional information inside it.
cbinded_df = cbind(df1,df2)
rbinded_df = rbind(df1,df2)
OPERATORS
ARITHMETIC OPERATORS
a <- 9.8
b <- 2
a = 23:31
print ( a )
M = matrix(c(1,2,3,4), 2, 2, TRUE)
print ( M %*% t(M) )
CONDITIONAL STATEMENT
x <- 5 x <- -5
if(x > 0) if(x > 0)
{ print("Positive
number") } { print("Non-negative
number")
x <- 0 if (x < 0) { } else
print("Negative number") } { print("Negative
else if (x > 0) { number") }
print("Positive number") }
else print("Zero")
readinteger <- function()
{ n <- readline(prompt="Enter an integer: ")
n<-as.integer(n)
if(n >= 0)
{
print("This is Non-negative number")
}
else
{
print("This is Negative number")
}
return(n)
}
print(readinteger())
x <- 1:5
for (val in x)
{
if (val == 3)
{
break
}
print(val)
}
next statement
A next statement is useful when we want to skip the current iteration of a
loop without terminating it. On encountering next, the R parser skips
further evaluation and starts next iteration of the loop.
Summary(dataset)
ggplot(custdata) +
geom_histogram(aes(x=age),
binwidth=5, fill="gray")
The binwidth parameter tells the geom_histogram call how to make bins of
five-year intervals (default is datarange/30). The fill parameter specifies the
color of the histogram bars (default:black).
DENSITY PLOTS
A density plot is a “continuous histogram” of a variable, except the area
under the density plot is equal to 1. A point on a density plot
corresponds to the fraction of data (or the percentage of data, divided
by 100) that takes on a particular value. This fraction is usually very
small. When you look at a density plot, you’re more interested in the
overall shape of the curve than in the actual values on the y-axis.
library(scales)
ggplot(custdata) + gescale_x_continuous(labels=dollar)
om_density(aes(x=income)) +
ggplot(custdata) + geom_density(aes(x=income)) +
scale_x_log10(breaks=c(100,1000,10000,100000), labels=dollar) +
annotation_logticks(sides="bt")
BAR CHARTS
A bar chart is a histogram for discrete data: it records the frequency of every
value of a categorical variable.
Vertical bar chart
ggplot(custdata) +
geom_bar(aes(x=state.of.res), fill="gray") +
coord_flip() +
theme(axis.text.y=element_text(size=rel(0.8)))
ggplot(custdata) + geom_bar(aes(x=marital.stat,
fill=health.ins))
ggplot(custdata) + geom_bar(aes(x=marital.stat,
fill=health.ins),
position="dodge")
ggplot(custdata) + geom_bar(aes(x=marital.stat,
fill=health.ins),
position="fill")
LINE PLOTS
Line plots work best when the relationship between two variables is relatively clean:
each x value has a unique (or nearly unique) y value
x <- runif(100)
y <- x^2 + 0.2*x
ggplot(data.frame(x=x,y=y), aes(x=x,y=y)) + geom_line()
SCATTER PLOTS
Scatterplots show many points plotted in the Cartesian plane. Each point represents
the values of two variables. One variable is chosen in the horizontal axis and another
in the vertical axis.
Scatterplot
ggplot(custdata2, aes(x=age, y=income)) +
A hexbin plot is like a two-dimensional histogram. The data is divided into
bins, and the number of data points in each bin is represented by color or
shading.
library(hexbin)
ggplot(custdata2, aes(x=age, y=income)) +
geom_hex(binwidth=c(5, 10000)) +
geom_smooth(color="white", se=F) +
ylim(0,200000)