0% found this document useful (0 votes)
49 views18 pages

Ds Unit 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views18 pages

Ds Unit 1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

UNIT I:

Defining Data Science and Big data, Benefits and Uses, facets of Data, Data Science Process.

History and Overview of R, Getting Started with R, R Nuts and Bolts

Data science in a big data world:

1. Big data is a blanket term for any collection of data sets so large or complex that it becomes
difficult to process them using traditional data management techniques such as, for example,
the RDBMS (relational database management systems).
2. The widely adopted RDBMS has long been regarded as a one-size-fits-all solution, but the
demands of handling big data have shown otherwise. Data science involves using methods to
analyze massive amounts of data and extract the knowledge it contains.
3. You can think of the relationship between big data and data science as being like the
relationship between crude oil and an oil refinery.
4. Data science and big data evolved from statistics and traditional data management but are now
considered to be distinct disciplines

The characteristics of big data are often referred to as the three Vs:

■ Volume—how much data is there?

■ Variety—How diverse are different types of data?

■ Velocity—at what speed is new data generated?

■ Veracity: How accurate is the data?

These four properties make big data different from the data found in traditional data management
tools. Consequently, the challenges they bring can be felt in almost every aspect: data capture, curation,
storage, search, sharing, transfer, and visualization. In addition, big data calls for specialized
technlinguistics

iques to extract the insights.

Data science is an evolutionary extension of statistics capable of dealing with the massive amounts of
data produced today.

Benefits and uses of data science and big data:

Data science and big data are used almost everywhere in both commercial and noncommercial settings.

The number of use cases is vast, and the examples we’ll provide throughout this book only scratch the
surface of the possibilities.

ADITYA DEGREE COLLEGE FOR WOMEN-RJY


FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 1
Commercial companies in almost every industry use data science and big data to gain insights into their
customers, processes, staff, completion, and products.

Many companies use data science to offer customers a better user experience, as well as to cross-sell,
up-sell, and personalize their offerings.

A good example of this is Google AdSense, which collects data from internet users so relevant
commercial messages can be matched to the person browsing the internet. MaxPoint
(http://maxpoint.com/us is another example of real-time personalized advertising.

Governmental organizations are also aware of data’s value. Many governmental organizations not only
rely on internal data scientists to discover valuable information, but also share their data with the
public.

A data scientist in a governmental organization gets to work on diverse projects such as detecting fraud
and other criminal activity or optimizing project funding. A well-known example was provided by
Edward Snowden, who leaked internal documents of the American National Security Agency and the
British Government Communications Headquarters that show clearly how they used data science and
big data to monitor millions of individuals.

Those organizations collected 5 billion data records from widespread applications such as Google Maps,
Angry Birds, email, and text messages, among many other data sources. Then they applied data science
techniques to distill information.

Nongovernmental organizations (NGOs) are also no strangers to using data. They use it to raise money
and defend their causes.

The World Wildlife Fund (WWF), for instance, employs data scientists to increase the effectiveness of
their fundraising efforts. Many data scientists devote part of their time to helping NGOs, because NGOs
often lack the resources to collect data and employ data scientists. Data Kind is one such data scientist
group that devotes its time to the benefit of mankind.

Universities use data science in their research but also to enhance the study experience of their
students. The rise of massive open online courses (MOOC) produces a lot of data, which allows
universities to study how these types of learning can complement traditional classes.

Facets of data:

In data science and big data you’ll come across many different types of data, and each of them tends to
require different tools and techniques. The main categories of data are these:

■ Structured

■ Unstructured

ADITYA DEGREE COLLEGE FOR WOMEN-RJY


FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 2
■ Natural language

■ Machine-generated

■ Graph-based

■ Audio, video, and images

Streaming Let’s explore all these interesting data types

Structured data:

1. Structured data is data that depends on a data model and resides in a fixed field within a record.

2. As such, it’s often easy to store structured data in tables within databases or Excel files (figure 1.1).
SQL, or Structured Query Language, is the preferred way to manage and query data that resides in
databases.

Unstructured data:

Unstructured data is data that isn’t easy to fit into a data model because the content is context-specific
or varying. One example of unstructured data is your regular email (figure 1.2). Although email contains
structured elements such as the sender, title, and body text, it’s a challenge to find the number of
people who have written an email complaint about a specific employee because so many ways exist to
refer to a person, for example. The thousands of different languages and dialects out there further
complicate this.

Natural language:

A human-written email, as shown in figure 1.2, is also a perfect example of natural language data.

Natural language is a special type of unstructured data; it’s challenging to process because it requires
knowledge of specific data science techniques and linguistics.

The natural language processing community has had success in entity recognition, topic recognition,
summarization, text completion, and sentiment analysis, but models trained in on

ADITYA DEGREE COLLEGE FOR WOMEN-RJY


FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 3
e domain don’t generalize well to other domains. Even state-of-the-art techniques aren’t able to
decipher the meaning of every piece of text.

Machine-generated data:

Machine-generated data is information that’s automatically created by a computer, process, application,


or other machine without human intervention. Machine-generated data is becoming a major data
resource and will continue to do so.

The analysis of machine data relies on highly scalable tools, due to its high volume and speed. Examples
of machine data are web server logs, call detail records, network event logs, and telemetry.

Graph-based or network data:

“Graph data” can be a confusing term because any data can be shown in a graph. “Graph” in this case
points to mathematical graph theory. In graph theory, a graph is a mathematical structure to model
pair-wise relationships between objects. Graph or network data is, in short, data that focuses on the
relationship or adjacency of objects. The graph structures use nodes, edges, and properties to represent
and store graphical data. Graph-based data is a natural way to represent social networks, and its
structure allows you to calculate specific metrics such as the influence of a person and the shortest path
between two people.

Examples of graph-based data can be found on many social media websites

Audio, image, and video:

Audio, image, and video are data types that pose specific challenges to a data scientist. Tasks that are
trivial for humans, such as recognizing objects in pictures, turn out to be challenging for computers.

Streaming data:

ADITYA DEGREE COLLEGE FOR WOMEN-RJY


FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 4
While streaming data can take almost any of the previous forms, it has an extra property. The data flows
into the system when an event happens instead of being loaded into a data store in a batch. Although
this isn’t really a different type of data, we treat it here as such because you need to adapt your process
to deal with this type of information. Examples are the “What’s trending” on Twitter, live sporting or
music events, and the stock market.

The data science process:

The data science process typically consists of six steps, as you can see in the mind map in figure 1.5. We
will introduce them briefly here and handle them in more detail in chapter 2.

1. Setting the research goal:

Data science is mostly applied in the context of an organization. When the business asks you to
perform a data science project, you’ll first prepare a project charter. This charter contains
information such as what you’re going to research, how the company benefits from that, what data
and resources you need, a timetable, and deliverables.

2. Retrieving data:

The second step is to collect data. You’ve stated in the project charter which data you need and
where you can find it. In this step you ensure that you can use the data in your program, which
means checking the existence of, quality, and access to the data. Data can also be delivered by third-
party companies and takes many forms ranging from Excel spreadsheets to different types of
databases

3. Data preparation:

ADITYA DEGREE COLLEGE FOR WOMEN-RJY


FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 5
Data collection is an error-prone process; in this phase you enhance the quality of the data and
prepare it for use in subsequent steps. This phase consists of three subphases: data cleansing
removes false values from a data source and inconsistencies across data sources, data integration
enriches data sources by combining information from multiple data sources, and data
transformation ensures that the data is in a suitable format for use in your models.

4. Data exploration:

Data exploration is concerned with building a deeper understanding of your data. You try to
understand how variables interact with each other, the distribution of the data, and whether there
are outliers. To achieve this you mainly use descriptive statistics, visual techniques, and simple
modeling. This step often goes by the abbreviation EDA, for Exploratory Data Analysis.

5. Data modeling or model building:

In this phase you use models, domain knowledge, and insights about the data you found in the
previous steps to answer the research question. You select a technique from the fields of statistics,
machine learning, operations research, and so on. Building a model is an iterative process that
involves selecting the variables for the model, executing the model, and model diagnostics.

6. Presentation and automation:

Finally, you present the results to your business. These results can take many forms, ranging from
presentations to research reports. Sometimes you’ll need to automate the execution of the process
because the business will want to use the insights you gained in another project or enable an
operational process to use the outcome from your model

History and overview of R:

This is an easy question to answer. R is a dialect of S.

S is a language that was developed by John Chambers and others at the old Bell Telephone Laboratories,
originally part of AT&T Corp. S was initiated in 1976⁷ as an internal statistical analysis environment—
originally implemented as Fortran libraries. Early versions of the language did not even contain functions
for statistical modeling.

The R language came to use quite a bit after S had been developed. One key limitation of the S language
was that it was only available in a commericial package, S-PLUS. In 1991, R was created by Ross Ihaka
and Robert Gentleman in the Department of Statistics at the University of Auckland. In 1993 the first
announcement of R was made to the public. Ross’s and Robert’s experience developing R is documented
in a 1996 paper in the Journal of Computational and Graphical Statistics

Ross Ihaka and Robert Gentleman. R: A language for data analysis and graphics. Journal of
Computational and Graphical Statistics, 5(3):299–314, 1996

ADITYA DEGREE COLLEGE FOR WOMEN-RJY


FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 6
In 1995, Martin Mächler made an important contribution by convincing Ross and Robert to use the GNU
General Public License⁹ to make R free software. This was critical because it allowed for the source code
for the entire R system to be accessible to anyone who wanted to tinker with it (more on free software
later).

Basic Features of R:

As stated earlier, R is a programming language and software environment for statistical analysis, graphics
representation and reporting. The following are the important features of R −

 R is a well-developed, simple and effective programming language which includes conditionals, loops, user
defined recursive functions and input and output facilities.

 R has an effective data handling and storage facility,

 R provides a suite of operators for calculations on arrays, lists, vectors and matrices.

 R provides a large, coherent and integrated collection of tools for data analysis.

 R provides graphical facilities for data analysis and display either directly at the computer or printing at the
papers.
As a conclusion, R is world’s most widely used statistics programming language. It's the # 1 choice of data scientists
and supported by a vibrant and talented community of contributors. R is taught in universities and deployed in
mission critical business applications.
s a convention, we will start learning R programming by writing a "Hello, World!" program. Depending on the needs,
you can program either at R command prompt or you can use an R script file to write your program. Let's check both
one by one.

Free Software:

A major advantage that R has over many other statistical packages and is that it’s free in the sense of
free software

According to the Free Software Foundation, with free software, you are granted the following four
freedoms¹²

• The freedom to run the program, for any purpose (freedom 0).

• The freedom to study how the program works, and adapt it to your needs (freedom 1). Access to the
source code is a precondition for this.

• The freedom to redistribute copies so you can help your neighbor (freedom 2).

• The freedom to improve the program, and release your improvements to the public, so that the whole
community benefits (freedom 3). Access to the source code is a precondition for this.

Design of the R System

The primary R system is available from the Comprehensive R Archive Network¹⁵, also known as CRAN.
CRAN also hosts many add-on packages that can be used to extend the functionality of R. The R system

ADITYA DEGREE COLLEGE FOR WOMEN-RJY


FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 7
is divided into 2 conceptual parts: 1. The “base” R system that you download from CRAN: Linux¹⁶
Windows¹⁷ Mac¹⁸ Source Code¹⁹ 2. Everything else

R functionality is divided into a number of packages. • The “base” R system contains, among other
things, the base package which is required to run R and contains the most fundamental functions. • The
other packages contained in the “base” system include utils, stats, datasets, graphics, grDevices, grid,
methods, tools, parallel, compiler, splines, tcltk, stats4.

• There are also “Recommended” packages: boot, class, cluster, codetools, foreign, KernSmooth, lattice,
mgcv, nlme, rpart, survival, MASS, spatial, nnet, Matrix. When you download a fresh installation of R
from CRAN, you get all of the above, which represents a substantial amount of functionality. However,
there are many other packages available: • There are over 4000 packages on CRAN that have been
developed by users and programmers around the world. • There are also many packages associated
with the Bioconductor project²⁰. • People often make packages available on their personal websites;
there is no reliable way to keep track of how many packages are available in this fashion. • There are a
number of packages being developed on repositories like GitHub and BitBucket but there is no reliable
listing of all these packages.

Limitations of R:

No programming language or statistical analysis system is perfect. R certainly has a number of


drawbacks. For starters, R is essentially based on almost 50 year old technology, going back to the
original S system developed at Bell Labs. There was originally little built in support for dynamic or 3-D
graphics (but things have improved greatly since the “old days”)

Another commonly cited limitation of R is that objects must generally be stored in physical memory. This
is in part due to the scoping rules of the language, but R generally is more of a memory hog than other
statistical packages.

At a higher level one “limitation” of R is that its functionality is based on consumer demand and
(voluntary) user contributions. If no one feels like implementing your favorite method, then it’s your job
to implement it (or you need to pay someone to do it). The capabilities of the R system generally reflect
the interests of the R user community.

Getting Started with R:

Installation:

The first thing you need to do to get started with R is to install it on your computer. R works on pretty
much every platform available, including the widely available Windows, Mac OS X, and Linux systems.

If you want to watch a step-by-step tutorial on how to install R for Mac or Windows, you can watch
these videos:

ADITYA DEGREE COLLEGE FOR WOMEN-RJY


FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 8
• Installing R on Windows²⁹

• Installing R on the Mac³⁰ There is also an integrated development environment available for R that is
built by RStudio. I really like this IDE—it has a nice editor with syntax highlighting, there is an R object
viewer, and there are a number of other nice features that are integrated. You can see how to install
RStudio here

• Installing RStudio³¹ The RStudio IDE is available from RStudio’s web site³²

Getting started with the R interface:

After you install R you will need to launch it and start writing R code. Before we get to exactly how to
write R code, it’s useful to get a sense of how the system is organized. In these two videos I talk about
where to write code and how set your working directory, which let’s R know where to find all of your
files

• Writing code and setting your working directory on the Mac³³

• Writing code and setting your working directory on Windows³⁴

R Nuts and Bolts:

Entering Input:

At the R prompt we type expressions. The <- symbol is the assignment operator.

> x <- 1

> print(x) [1] 1

> x [1] 1

> msg <- "hello"

x <- ## Incomplete expression

The # character indicates a comment. Anything to the right of the # (including the # itself) is ignored.
This is the only comment character in R. Unlike some other languages, R does not support multi-line
comments or comment blocks.

Evaluation:

When a complete expression is entered at the prompt, it is evaluated and the result of the evaluated
expression is returned. The result may be auto-printed

> x <- 5 ## nothing printed

ADITYA DEGREE COLLEGE FOR WOMEN-RJY


FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 9
> x ## auto-printing occurs [1] 5

> print(x) ## explicit printing [1] 5

> x <- 10:30

>x

[1] 10 11 12 13 14 15 16 17 18 19 20 21

[13] 22 23 24 25 26 27 28 29 30

R Objects:

R has five basic or “atomic” classes of objects:

• character

• numeric (real numbers)

• integer

• complex

• logical (True/False

The most basic type of R object is a vector. Empty vectors can be created with the vector() function.
There is really only one rule about vectors in R, which is that A vector can only contain objects of the
same class.

But of course, like any good rule, there is an exception, which is a list, which we will get to a bit later. A
list is represented as a vector but can contain objects of different classes. Indeed, that’s usually why we
use them.

There is also a class for “raw” objects, but they are not commonly used directly in data analysis and I
won’t cover them here.

Numbers:

Numbers in R are generally treated as numeric objects (i.e. double precision real numbers). This means
that even if you see a number like “1” or “2” in R, which you might think of as integers, they are likely
represented behind the scenes as numeric objects (so something like “1.00” or “2.00”).

Attributes:

ADITYA DEGREE COLLEGE FOR WOMEN-RJY


FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 10
R objects can have attributes, which are like metadata for the object. These metadata can be very useful
in that they help to describe the object. For example, column names on a data frame help to tell us what
data are contained in each of the columns. Some examples of R object attributes are

• names, dimnames • dimensions (e.g. matrices, arrays) • class (e.g. integer, numeric) • length • other
user-defined attributes/metadata

Attributes of an object (if any) can be accessed using the attributes() function. Not all R objects contain
attributes, in which case the attributes() function returns NULL.

Creating Vectors:

Note that in the above example, T and F are short-hand ways to specify TRUE and FALSE. However, in
general one should try to use the explicit TRUE and FALSE values when indicating logical values. The T
and F values are primarily there for when you’re feeling lazy. You can also use the vector() function to
initialize vectors.

> x <- vector("numeric", length = 10)

>x

[1] 0 0 0 0 0 0 0 0 0 0

Mixing Objects:

There are occasions when different classes of R objects get mixed together. Sometimes this happens by
accident but it can also happen on purpose. So what happens with the following code?

> y <- c(1.7, "a") ## character

> y <- c(TRUE, 2) ## numeric

> y <- c("a", TRUE) ## character

ADITYA DEGREE COLLEGE FOR WOMEN-RJY


FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 11
In each case above, we are mixing objects of two different classes in a vector. But remember that the
only rule about vectors says this is not allowed. When different objects are mixed in a vector, coercion
occurs so that every element in the vector is of the same class.

In the example above, we see the effect of implicit coercion. What R tries to do is find a way to
represent all of the objects in the vector in a reasonable fashion. Sometimes this does exactly what you
want and…sometimes not. For example, combining a numeric object with a character object will create
a character vector, because numbers can usually be easily represented as strings.

Explicit Coercion:

Objects can be explicitly coerced from one class to another using the as.* functions, if available.

Matrices:

Matrices are vectors with a dimension attribute. The dimension attribute is itself an integer vector of
length 2 (number of rows, number of columns)

ADITYA DEGREE COLLEGE FOR WOMEN-RJY


FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 12
ADITYA DEGREE COLLEGE FOR WOMEN-RJY
FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 13
Lists:

Lists are a special type of vector that can contain elements of different classes. Lists are a very important
data type in R and you should get to know them well. Lists, in combination with the various “apply”
functions discussed later, make for a powerful combination.

Lists can be explicitly created using the list() function, which takes an arbitrary number of arguments.

Factor:

ADITYA DEGREE COLLEGE FOR WOMEN-RJY


FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 14
Factors are used to represent categorical data and can be unordered or ordered. One can think of a
factor as an integer vector where each integer has a label. Factors are important in statistical modeling
and are treated specially by modelling functions like lm() and glm().

Using factors with labels is better than using integers because factors are self-describing. Having a
variable that has values “Male” and “Female” is better than a variable that has values 1 and 2.

Factor objects can be created with the factor() function

> x <- factor(c("yes", "yes", "no", "yes", "no")) > x [1] yes yes no yes no Levels: no yes > table(x) x no yes
2 3 > ## See the underlying representation of factor > unclass(x) [1] 2 2 1 2 1 attr(,"levels") [1] "no" "yes

Often factors will be automatically created for you when you read a dataset in using a function like
read.table(). Those functions often default to creating factors when they encounter data that look like
characters or strings.

The order of the levels of a factor can be set using the levels argument to factor(). This can be important
in linear modelling because the first level is used as the baseline level.

Missing Values:

Missing values are denoted by NA or NaN for q undefined mathematical operations.

is.na() is used to test objects if they are NA

• is.nan() is used to test for NaN

• NA values have a class also, so there are integer NA, character NA, etc.

• A NaN value is also NA but the converse is not true

ADITYA DEGREE COLLEGE FOR WOMEN-RJY


FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 15
Data Frames:

Data frames are used to store tabular data in R. They are an important type of object in R and are used
in a variety of statistical modeling applications. Hadley Wickham’s package dplyr³⁵ has an optimized set
of functions designed to work efficiently with data frames.

Data frames are represented as a special type of list where every element of the list has to have the
same length. Each element of the list can be thought of as a column and the length of each element of
the list is the number of rows.

Unlike matrices, data frames can store different classes of objects in each column. Matrices must have
every element be the same class (e.g. all integers or all numeric). In addition to column names,
indicating the names of the variables or predictors, data frames have a special attribute called
row.names which indicate information about each row of the data frame.

Data frames are usually created by reading in a dataset using the read.table() or read.csv(). However,
data frames can also be created explicitly with the data.frame() function or they can be coerced from
other types of objects like lists.

Data frames can be converted to a matrix by calling data.matrix(). While it might seem that the
as.matrix() function should be used to coerce a data frame to a matrix, almost always, what you want is
the result of data.matrix()

ADITYA DEGREE COLLEGE FOR WOMEN-RJY


FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 16
Names:

R objects can have names, which is very useful for writing readable code and self-describing objects.
Here is an example of assigning names to an integer vector.

ADITYA DEGREE COLLEGE FOR WOMEN-RJY


FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 17
ADITYA DEGREE COLLEGE FOR WOMEN-RJY
FACULTY-J NARESH KUMAR
LECTURER IN COMPUTER SCIENCE Page 18

You might also like