0% found this document useful (0 votes)
32 views9 pages

Unit 1

Uploaded by

tarun.bongu736
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views9 pages

Unit 1

Uploaded by

tarun.bongu736
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

3

1.2 Facets of data


In data science and big data we’ll come across many different types of data, and each of
them tends to require different tools and techniques. The main categories of data are these:

Structured

Unstructured

Natural language

Machine-generated

Graph-based

Audio, video, and images

Streaming
1.2.1 Structured data
Structured data is data that depends on a data model and resides in a fixed field within a
record. As such, it’s often easy to store structured data in tables within databases or Excel
files. SQL, or Structured Query Language, is the preferred way to manage and query data
that resides in databases.

Figure 1.1An Excel table is an example of structured data.


data and natural language data.

1.2.2 Unstructured data


Unstructured data is data content is context-specific or varying. One example of
unstructured data is your regular email . Although email contains structured elements such
as the sender, title, and body text, it’s a challenge to find the number of people who have
written an email complaint about a specific employee because so many ways exist to refer to
a person, for example..
4 CHAPTER 1 Data science in a big data world
1.2.3 Natural language
Natural language is a special type of unstructured data; it’s challenging to process because it
requires knowledge of specific data science techniques and linguistics.
The natural language processing community has had success in entity recognition, topic
recognition, summarization, text completion, and sentiment analysis. Even state-of-the-art
techniques aren’t able to decipher the meaning of every piece of text.The concept of
meaning itself is questionable here. Have two people listen to the same
conversation. Will they get the same meaning? The meaning of the same words can vary
when coming from someone upset or joyous.

1.2.4 Machine-generated data


Machine-generated data is information that’s automatically created by a computer, process,
application, or other machine without human intervention. Machine-generated data is
becoming a major data resource and will continue to do so.
The analysis of machine data relies on highly scalable tools, due to its high volume and
speed. Examples of machine data are web server logs, call detail records, network event logs,
and telemetry.

Figure 1.3Example of machine-generated data

1.2.5Graph-based or network data


“Graph” in this case points to mathematical graph theory. In graph theory, a graph is a
mathematical structure to model pair-wise relationships between objects. Graph or network
data is, in short, data that focuses on the relationship or adjacency of objects. The graph
5
structures use nodes, edges, and properties to represent and store graphical data. Graph-
based data is a natural way to represent social networks, and its structure allows us to
calculate specific metrics such as the influence of a person and the shortest path between
two people.
Examples of graph-based data can be found on many social media websites (figure 1.4).
For instance, on LinkedIn you can see who you know at which company. Your follower list
on Twitter is another example of graph-based data.

Lucy Guy

Elizabeth Liam Kim Carlos

Jack Barack

Myriam Maria
Florin William

Er John

Figure 1.4Friends in a social network are an example of graph-based data.

Graph databases are used to store graph-based data and are queried with specialized query
languages such as SPARQL.

1.2.6Audio, image, and video


Audio, image, and video are data types that pose specific challenges to a data scientist. Tasks
that are trivial for humans, such as recognizing objects in pictures, turn out to be challenging
for computers.
Recently a company called DeepMind succeeded at creating an algorithm that’s capable
of learning how to play video games. This algorithm takes the video screen as input and
learns to interpret everything via a complex process of deep learning. It’s a remarkable feat
that prompted Google to buy the company for their own Artificial Intelligence ( AI)
development plans

1.2.7 Streaming data


While streaming data can take almost any of the audio,video,image, it has an extra property.
The data flows into the system when an event happens instead of being loaded into a data
store in a batch.
6 CHAPTER 1 Data science in a big data world
Examples are the “What’s trending” on Twitter, live sporting or music events, and the stock
market

1.3 The data science process


The data science process typically consists of six steps.

1.3.1 Setting the research goal


Data science is mostly applied in the context of an organization. we’ll first prepare a project
charter. This charter contains information such as what you’re going to research, how the
company benefits from that, what data and resources you need, a timetable, and deliverables.
Data science process

1: Setting the research goal+

2: Retrieving data +

3: Data preparation +

4: Data exploration +

5: Data modeling +

6: Presentation and automation+

Figure 1.5 The data science process

The data science process

1.3.2 Retrieving data


The second step is to collect data. In this step you ensure that you can use the data in your
program, which means checking the existence of, quality, and access to the data. Data can also
be delivered by third-party companies and takes many forms ranging from Excel spreadsheets
to different types of databases.

1.3.3 Data preparation


Data collection is an error-prone process; in this phase we enhance the quality of the data and
prepare it for use in subsequent steps. This phase consists of three subphases: data cleansing
removes false values from a data source and inconsistencies across data sources, data
integration enriches data sources by combining information from multiple data sources, and
data transformation ensures that the data is in a suitable format for use in your models.
7
1.3.4 Data exploration
Data exploration is concerned how variables interact with each other, the distribution of the
data, and whether there are outliers This step often goes by the abbreviation EDA, for
Exploratory Data Analysis.

1.3.5Data modeling or model building


In this phase we use models, domain knowledge, and insights about the. We select a technique
from the fields of statistics, machine learning, operations research, and so on. Building a model
is an iterative process that involves selecting the variables for the model, executing the model,
and model diagnostics.

1.3.6Presentation and automation


Finally, we present the results to your business. These results can take many forms, ranging
from presentations to research reports. The insights wee gained in another project or enable an
operational process to use the outcome from our model.
AN ITERATIVE PROCESS The previous description of the data science process gives
you the impression that you walk through this process in a linear way data science
process we gain incremental insights.

1.4The big data ecosystem and data science


Currently many big data tools and frameworks exist, and it’s easy to get lost because new
technologies appear rapidly. It’s much easier once you realize that the big data ecosystem can
be grouped into technologies that have similar goals and functionalities, which we’ll discuss in
this section. Data scientists use many different technologies, but not all of them; we’ll dedicate
a separate chapter to the most important data science technology classes. The mind map in
figure 1.6 shows the components of the big data ecosystem and where the different
technologies belong.
Let’s look at the different groups of tools in this diagram and see what each does. We’ll
start with distributed file systems.

1.4.1Distributed file systems


A distributed file system is similar to a normal file system, except that it runs on multiple
servers at oDistributed file systems have significant advantages:

They can store files larger than any one computer disk.

Files get automatically replicated across multiple servers for redundancy or parallel
operations while hiding the complexity of doing so from the user.

The system scales easily: you’re no longer bound by the memory or storage restrictions
of a single server.
Scale was increased by moving everything to a server with more memory, storage, and a better
CPU (vertical scaling). Nowadays you can add another small server (horizontal scaling). This
8 CHAPTER 1 Data science in a big data world
principle makes the scaling potential virtually limitless.
The best-known distributed file system at this moment is the Hadoop File System (HDFS).
The big data ecosystem and data science
9
Figure 1.6Big data technologies can be classified into a few main components.
1.4.2Distributed programming framework
One important aspect of working on a distributed hard disk is that we won’t move our data to
our program, but rather we’ll move our program to the data.We use general-purpose
programming language such as C, Python, or Java.The open source community has developed
many frameworks to handle restarting jobs,tracking result from different subprocess, and these
give you a much better experience working with distributed data and dealing with many of the
challenges it carries.

1.4.3Data integration framework


We need to move data from one source to another, and this is where the data integration
frameworks such as Apache Sqoop and Apache Flume excel. The process is similar to an
extract, transform, and load process in a traditional data warehouse.

1.4.4Machine learning frameworks


Before World War II everything needed to be calculated by hand, which severely limited the
possibilities of data analysis. After World War II computers and scientific computing were
developed. A single computer could do all the counting and calculations and a world of
opportunities opened.One of the biggest issues with the old algorithms is that they don’t scale
well. With the amount of data we need to analyze today, this becomes problematic, and
specialized frameworks and libraries are required to deal with this amount of data.

PyBrain for neural networks—Neural networks are learning algorithms that mimic the
human brain in learning mechanics and complexity. Neural networks are often regarded
as advanced and black box.

NLTK or Natural Language Toolkit—As the name suggests, its focus is working with
natural language. It’s an extensive library that comes bundled with a number of text
corpuses to help you model your own data.

Pylearn2—Another machine learning toolbox but a bit less mature than Scikit-learn.

TensorFlow—A Python library for deep learning provided by Google.
1.4.5 NoSQL databases
To store huge amounts of data, you require software that’s specialized in managing and
querying this data. Traditionally this has been the playing field of relational databases such as
Oracle SQL, MySQL, Sybase IQ, and others.New types of databases have emerged under the
grouping of NoSQL databases.
Many of the NoSQL databases have implemented a version of SQL. By solving several of the
problems of traditional databases, NoSQL databases allow for a virtually endless growth of
data.

Column databases—Data is stored in columns, which allows algorithms to perform much
faster queries. Newer technologies use cell-wise storage. Table-like structures are still
important.
10 CHAPTER 1 Data science in a big data world

Document stores—Document stores no longer use tables, but store every observation in a
document. This allows for a much more flexible data scheme.

Streaming data—Data is collected, transformed, and aggregated not in batches but in real
time. Although we’ve categorized it here as a database to help you in tool selection, it’s
more a particular type of problem that drove creation of technologies such as Storm.

Key-value stores—Data isn’t stored in a table; rather you assign a key for every value,
such as org.marketing.sales.2015: 20000. This scales well but places almost all the
implementation on the developer.

SQL on Hadoop—Batch queries on Hadoop are in a SQL-like language that uses the map-
reduce framework in the background.

New SQL—This class combines the scalability of No SQL databases with the advantages
of relational databases. They all have a SQL interface and a relational data model.

Graph databases—Not every problem is best stored in a table. Particular problems are
more naturally translated into graph theory and stored in graph databases. A classic
example of this is a social network.

1.4.6Scheduling tools
Scheduling tools help you automate repetitive tasks and trigger jobs based on events such as
adding a new file to a folder. These are similar to tools such as CRON on Linux.

1.4.7Benchmarking tools
This class of tools was developed to optimize your big data installation by providing
standardized profiling suites. A profiling suite is taken from a representative set of big data
jobs. Benchmarking and optimizing the big data infrastructure and configuration aren’t often
jobs for data scientists themselves but for a professional specialized in setting up IT
infrastructure.

1.4.8System deployment
Setting up a big data infrastructure isn’t an easy task and assisting engineers in deploying new
applications into the big data cluster is where system deployment tools shine.

1.4.9Service programming
We have no idea of the architecture or technology of everyone keen on using your predictions.
Service tools excel here by exposing big data applications to other applications as a service.
Data scientists sometimes need to expose their models through services. The best-known
example is the REST service; REST stands for representational state transfer.

1.4.10 Security
We probably need to have fine-grained control over the access to data but don’t want to
manage this on an application-by-application basis. Big data security tools allow you to have
central and fine-grained control over access to the data.
11

You might also like