0% found this document useful (0 votes)

32 views9 pages

Unit 1

Uploaded by

tarun.bongu736

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views9 pages

Unit 1

Uploaded by

tarun.bongu736

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

3

1.2 Facets of data

In data science and big data we’ll come across many different types of data, and each of
them tends to require different tools and techniques. The main categories of data are these:
■
Structured
■
Unstructured
■
Natural language
■
Machine-generated
■
Graph-based
■
Audio, video, and images
■
Streaming
1.2.1 Structured data
Structured data is data that depends on a data model and resides in a fixed field within a
record. As such, it’s often easy to store structured data in tables within databases or Excel
files. SQL, or Structured Query Language, is the preferred way to manage and query data
that resides in databases.

Figure 1.1An Excel table is an example of structured data.

data and natural language data.

1.2.2 Unstructured data

Unstructured data is data content is context-specific or varying. One example of
unstructured data is your regular email . Although email contains structured elements such
as the sender, title, and body text, it’s a challenge to find the number of people who have
written an email complaint about a specific employee because so many ways exist to refer to
a person, for example..
4 CHAPTER 1 Data science in a big data world
1.2.3 Natural language
Natural language is a special type of unstructured data; it’s challenging to process because it
requires knowledge of specific data science techniques and linguistics.
The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis. Even state-of-the-art
techniques aren’t able to decipher the meaning of every piece of text.The concept of
meaning itself is questionable here. Have two people listen to the same
conversation. Will they get the same meaning? The meaning of the same words can vary
when coming from someone upset or joyous.

1.2.4 Machine-generated data

Machine-generated data is information that’s automatically created by a computer, process,
application, or other machine without human intervention. Machine-generated data is
becoming a major data resource and will continue to do so.
The analysis of machine data relies on highly scalable tools, due to its high volume and
speed. Examples of machine data are web server logs, call detail records, network event logs,
and telemetry.

Figure 1.3Example of machine-generated data

1.2.5Graph-based or network data

“Graph” in this case points to mathematical graph theory. In graph theory, a graph is a
mathematical structure to model pair-wise relationships between objects. Graph or network
data is, in short, data that focuses on the relationship or adjacency of objects. The graph
5
structures use nodes, edges, and properties to represent and store graphical data. Graph-
based data is a natural way to represent social networks, and its structure allows us to
calculate specific metrics such as the influence of a person and the shortest path between
two people.
Examples of graph-based data can be found on many social media websites (figure 1.4).
For instance, on LinkedIn you can see who you know at which company. Your follower list
on Twitter is another example of graph-based data.

Lucy Guy

Elizabeth Liam Kim Carlos

Jack Barack

Myriam Maria
Florin William

Er John

Figure 1.4Friends in a social network are an example of graph-based data.

Graph databases are used to store graph-based data and are queried with specialized query
languages such as SPARQL.

1.2.6Audio, image, and video

Audio, image, and video are data types that pose specific challenges to a data scientist. Tasks
that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging
for computers.
Recently a company called DeepMind succeeded at creating an algorithm that’s capable
of learning how to play video games. This algorithm takes the video screen as input and
learns to interpret everything via a complex process of deep learning. It’s a remarkable feat
that prompted Google to buy the company for their own Artificial Intelligence ( AI)
development plans

1.2.7 Streaming data

While streaming data can take almost any of the audio,video,image, it has an extra property.
The data flows into the system when an event happens instead of being loaded into a data
store in a batch.
6 CHAPTER 1 Data science in a big data world
Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock
market

1.3 The data science process

The data science process typically consists of six steps.

1.3.1 Setting the research goal

Data science is mostly applied in the context of an organization. we’ll first prepare a project
charter. This charter contains information such as what you’re going to research, how the
company benefits from that, what data and resources you need, a timetable, and deliverables.
Data science process

1: Setting the research goal+

2: Retrieving data +

3: Data preparation +

4: Data exploration +

5: Data modeling +

6: Presentation and automation+

Figure 1.5 The data science process

The data science process

1.3.2 Retrieving data

The second step is to collect data. In this step you ensure that you can use the data in your
program, which means checking the existence of, quality, and access to the data. Data can also
be delivered by third-party companies and takes many forms ranging from Excel spreadsheets
to different types of databases.

1.3.3 Data preparation

Data collection is an error-prone process; in this phase we enhance the quality of the data and
prepare it for use in subsequent steps. This phase consists of three subphases: data cleansing
removes false values from a data source and inconsistencies across data sources, data
integration enriches data sources by combining information from multiple data sources, and
data transformation ensures that the data is in a suitable format for use in your models.
7
1.3.4 Data exploration
Data exploration is concerned how variables interact with each other, the distribution of the
data, and whether there are outliers This step often goes by the abbreviation EDA, for
Exploratory Data Analysis.

1.3.5Data modeling or model building

In this phase we use models, domain knowledge, and insights about the. We select a technique
from the fields of statistics, machine learning, operations research, and so on. Building a model
is an iterative process that involves selecting the variables for the model, executing the model,
and model diagnostics.

1.3.6Presentation and automation

Finally, we present the results to your business. These results can take many forms, ranging
from presentations to research reports. The insights wee gained in another project or enable an
operational process to use the outcome from our model.
AN ITERATIVE PROCESS The previous description of the data science process gives
you the impression that you walk through this process in a linear way data science
process we gain incremental insights.

1.4The big data ecosystem and data science

Currently many big data tools and frameworks exist, and it’s easy to get lost because new
technologies appear rapidly. It’s much easier once you realize that the big data ecosystem can
be grouped into technologies that have similar goals and functionalities, which we’ll discuss in
this section. Data scientists use many different technologies, but not all of them; we’ll dedicate
a separate chapter to the most important data science technology classes. The mind map in
figure 1.6 shows the components of the big data ecosystem and where the different
technologies belong.
Let’s look at the different groups of tools in this diagram and see what each does. We’ll
start with distributed file systems.

1.4.1Distributed file systems

A distributed file system is similar to a normal file system, except that it runs on multiple
servers at oDistributed file systems have significant advantages:
■
They can store files larger than any one computer disk.
■
Files get automatically replicated across multiple servers for redundancy or parallel
operations while hiding the complexity of doing so from the user.
■
The system scales easily: you’re no longer bound by the memory or storage restrictions
of a single server.
Scale was increased by moving everything to a server with more memory, storage, and a better
CPU (vertical scaling). Nowadays you can add another small server (horizontal scaling). This
8 CHAPTER 1 Data science in a big data world
principle makes the scaling potential virtually limitless.
The best-known distributed file system at this moment is the Hadoop File System (HDFS).
The big data ecosystem and data science
9
Figure 1.6Big data technologies can be classified into a few main components.
1.4.2Distributed programming framework
One important aspect of working on a distributed hard disk is that we won’t move our data to
our program, but rather we’ll move our program to the data.We use general-purpose
programming language such as C, Python, or Java.The open source community has developed
many frameworks to handle restarting jobs,tracking result from different subprocess, and these
give you a much better experience working with distributed data and dealing with many of the
challenges it carries.

1.4.3Data integration framework

We need to move data from one source to another, and this is where the data integration
frameworks such as Apache Sqoop and Apache Flume excel. The process is similar to an
extract, transform, and load process in a traditional data warehouse.

1.4.4Machine learning frameworks

Before World War II everything needed to be calculated by hand, which severely limited the
possibilities of data analysis. After World War II computers and scientific computing were
developed. A single computer could do all the counting and calculations and a world of
opportunities opened.One of the biggest issues with the old algorithms is that they don’t scale
well. With the amount of data we need to analyze today, this becomes problematic, and
specialized frameworks and libraries are required to deal with this amount of data.
■
PyBrain for neural networks—Neural networks are learning algorithms that mimic the
human brain in learning mechanics and complexity. Neural networks are often regarded
as advanced and black box.
■
NLTK or Natural Language Toolkit—As the name suggests, its focus is working with
natural language. It’s an extensive library that comes bundled with a number of text
corpuses to help you model your own data.
■
Pylearn2—Another machine learning toolbox but a bit less mature than Scikit-learn.
■
TensorFlow—A Python library for deep learning provided by Google.
1.4.5 NoSQL databases
To store huge amounts of data, you require software that’s specialized in managing and
querying this data. Traditionally this has been the playing field of relational databases such as
Oracle SQL, MySQL, Sybase IQ, and others.New types of databases have emerged under the
grouping of NoSQL databases.
Many of the NoSQL databases have implemented a version of SQL. By solving several of the
problems of traditional databases, NoSQL databases allow for a virtually endless growth of
data.
■
Column databases—Data is stored in columns, which allows algorithms to perform much
faster queries. Newer technologies use cell-wise storage. Table-like structures are still
important.
10 CHAPTER 1 Data science in a big data world
■
Document stores—Document stores no longer use tables, but store every observation in a
document. This allows for a much more flexible data scheme.
■
Streaming data—Data is collected, transformed, and aggregated not in batches but in real
time. Although we’ve categorized it here as a database to help you in tool selection, it’s
more a particular type of problem that drove creation of technologies such as Storm.
■
Key-value stores—Data isn’t stored in a table; rather you assign a key for every value,
such as org.marketing.sales.2015: 20000. This scales well but places almost all the
implementation on the developer.
■
SQL on Hadoop—Batch queries on Hadoop are in a SQL-like language that uses the map-
reduce framework in the background.
■
New SQL—This class combines the scalability of No SQL databases with the advantages
of relational databases. They all have a SQL interface and a relational data model.
■
Graph databases—Not every problem is best stored in a table. Particular problems are
more naturally translated into graph theory and stored in graph databases. A classic
example of this is a social network.

1.4.6Scheduling tools
Scheduling tools help you automate repetitive tasks and trigger jobs based on events such as
adding a new file to a folder. These are similar to tools such as CRON on Linux.

1.4.7Benchmarking tools
This class of tools was developed to optimize your big data installation by providing
standardized profiling suites. A profiling suite is taken from a representative set of big data
jobs. Benchmarking and optimizing the big data infrastructure and configuration aren’t often
jobs for data scientists themselves but for a professional specialized in setting up IT
infrastructure.

1.4.8System deployment
Setting up a big data infrastructure isn’t an easy task and assisting engineers in deploying new
applications into the big data cluster is where system deployment tools shine.

1.4.9Service programming
We have no idea of the architecture or technology of everyone keen on using your predictions.
Service tools excel here by exposing big data applications to other applications as a service.
Data scientists sometimes need to expose their models through services. The best-known
example is the REST service; REST stands for representational state transfer.

1.4.10 Security
We probably need to have fine-grained control over the access to data but don’t want to
manage this on an application-by-application basis. Big data security tools allow you to have
central and fine-grained control over access to the data.
11

Foundation of Data Science
100% (2)
Foundation of Data Science
143 pages
Unit 1 PPT 1
No ratings yet
Unit 1 PPT 1
27 pages
Cs3352 Foundation of Data Science
No ratings yet
Cs3352 Foundation of Data Science
80 pages
Data Science SPPU
No ratings yet
Data Science SPPU
115 pages
CS3352 FDS Notes - 03 - by WWW - Notesfree.in
No ratings yet
CS3352 FDS Notes - 03 - by WWW - Notesfree.in
138 pages
Nemo Analyze 8.90 User Guide
No ratings yet
Nemo Analyze 8.90 User Guide
472 pages
Defining Data Science
100% (1)
Defining Data Science
167 pages
Facets of Data
0% (1)
Facets of Data
22 pages
Sound Reinforcement
No ratings yet
Sound Reinforcement
43 pages
Stucor Cs3352 Ad
No ratings yet
Stucor Cs3352 Ad
138 pages
Fmea Manual
No ratings yet
Fmea Manual
191 pages
FDS Notes PDF
No ratings yet
FDS Notes PDF
140 pages
Saudi Aramco: Logging Guidelines
No ratings yet
Saudi Aramco: Logging Guidelines
7 pages
Data Science Unit 1 Notes
No ratings yet
Data Science Unit 1 Notes
22 pages
Parallel Computer Architecture Classification
No ratings yet
Parallel Computer Architecture Classification
23 pages
12 13 PDF
No ratings yet
12 13 PDF
2 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Blue Coat Systems Reporter
100% (1)
Blue Coat Systems Reporter
251 pages
HUAWEI Ascend P2-6070 Maintenance Manual
No ratings yet
HUAWEI Ascend P2-6070 Maintenance Manual
101 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
36 pages
Ds Unit 1
No ratings yet
Ds Unit 1
18 pages
FDS - Aids Complete Notes
No ratings yet
FDS - Aids Complete Notes
138 pages
TM 1 1520 236 CL
No ratings yet
TM 1 1520 236 CL
39 pages
Erp For SMEs
100% (1)
Erp For SMEs
11 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Speedybee f405 Wing App Manual V1.1-En
No ratings yet
Speedybee f405 Wing App Manual V1.1-En
4 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
Unit 1
No ratings yet
Unit 1
137 pages
Data Science
No ratings yet
Data Science
244 pages
Unit I
No ratings yet
Unit I
262 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Unit 1 To 5
No ratings yet
Unit 1 To 5
202 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Fods Notes
No ratings yet
Fods Notes
139 pages
FODS Full Notes
No ratings yet
FODS Full Notes
217 pages
11.course Materials (Unit Wise
No ratings yet
11.course Materials (Unit Wise
138 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Cs3352foundation of Data Science - 1
No ratings yet
Cs3352foundation of Data Science - 1
141 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Deco L14
No ratings yet
Deco L14
72 pages
DSUP Chapter 1 PDF
No ratings yet
DSUP Chapter 1 PDF
31 pages
FDS 4 Unit
No ratings yet
FDS 4 Unit
156 pages
FDS Notes
No ratings yet
FDS Notes
137 pages
IDS - Lecture 1
No ratings yet
IDS - Lecture 1
52 pages
Ana Profile
No ratings yet
Ana Profile
11 pages
HRMD91
No ratings yet
HRMD91
3 pages
Mod 3
No ratings yet
Mod 3
96 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
26 pages
Data Science: October 2021
No ratings yet
Data Science: October 2021
51 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
ETCh 2
No ratings yet
ETCh 2
36 pages
Emerging Tech CH 2
No ratings yet
Emerging Tech CH 2
52 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
CS 3353 FDS Unit 1 Notes JPR
No ratings yet
CS 3353 FDS Unit 1 Notes JPR
39 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
CH-2 Data Science Emerging Technology
No ratings yet
CH-2 Data Science Emerging Technology
20 pages
Unit 1
No ratings yet
Unit 1
26 pages
Legrand Interruptor Horario
No ratings yet
Legrand Interruptor Horario
1 page
Presented by
No ratings yet
Presented by
19 pages
Data v2
No ratings yet
Data v2
25 pages
Fdsunit 1
No ratings yet
Fdsunit 1
27 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
Unit 1
No ratings yet
Unit 1
25 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
ET Ch-2 Data Science PPT
No ratings yet
ET Ch-2 Data Science PPT
28 pages
Unit 1
No ratings yet
Unit 1
21 pages
Unit 1
No ratings yet
Unit 1
19 pages
Automatic Hand Sanitizer Using IR
No ratings yet
Automatic Hand Sanitizer Using IR
6 pages
2 - Hardware
No ratings yet
2 - Hardware
29 pages
IDS - Sem Ans Unit 1
No ratings yet
IDS - Sem Ans Unit 1
10 pages
Chinese Influence Through Technical Standardization Power
No ratings yet
Chinese Influence Through Technical Standardization Power
20 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
15 pages
Lecture 1 and 2 Powerpoints
No ratings yet
Lecture 1 and 2 Powerpoints
32 pages
1SDC001057G0201 - WP Ekip UP For Utility - EN
No ratings yet
1SDC001057G0201 - WP Ekip UP For Utility - EN
12 pages
Instrumentation Installation Verification Procedure:: How To Use This Document
No ratings yet
Instrumentation Installation Verification Procedure:: How To Use This Document
3 pages
Lab 12
No ratings yet
Lab 12
8 pages
C Programming: MTRX 1702 - Software Component
No ratings yet
C Programming: MTRX 1702 - Software Component
18 pages
Dwg. No.10 - 42 20 DN002 - ER
No ratings yet
Dwg. No.10 - 42 20 DN002 - ER
4 pages
Fods Notes For Lecturing
No ratings yet
Fods Notes For Lecturing
5 pages
Cardstudio Datasheet en Us
No ratings yet
Cardstudio Datasheet en Us
2 pages
6ES73340CE010AA0 Datasheet en
No ratings yet
6ES73340CE010AA0 Datasheet en
4 pages
Digilent - Getting Started With Software Defined Radio
No ratings yet
Digilent - Getting Started With Software Defined Radio
4 pages
Semester Registration Notice (July-Dec 2023 Semester) - FoC
No ratings yet
Semester Registration Notice (July-Dec 2023 Semester) - FoC
2 pages
ROGER
No ratings yet
ROGER
1 page
Weeklynets Apr17
No ratings yet
Weeklynets Apr17
1 page
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet

Unit 1

Uploaded by

Unit 1

Uploaded by

3

1.2 Facets of data

Figure 1.1An Excel table is an example of structured data.

1.2.2 Unstructured data

1.2.4 Machine-generated data

Figure 1.3Example of machine-generated data

1.2.5Graph-based or network data

Elizabeth Liam Kim Carlos

Figure 1.4Friends in a social network are an example of graph-based data.

1.2.6Audio, image, and video

1.2.7 Streaming data

1.3 The data science process

1.3.1 Setting the research goal

1: Setting the research goal+

6: Presentation and automation+

Figure 1.5 The data science process

The data science process

1.3.2 Retrieving data

1.3.3 Data preparation

1.3.5Data modeling or model building

1.3.6Presentation and automation

1.4The big data ecosystem and data science

1.4.1Distributed file systems

1.4.3Data integration framework

1.4.4Machine learning frameworks

You might also like