100% found this document useful (1 vote)

50 views24 pages

Unit 3 Ids

Unit 3 of the document introduces the NoSQL movement and the use of Hadoop for handling big data, detailing its architecture, components, and the MapReduce programming model. It also discusses the limitations of traditional relational databases, the principles of ACID and BASE, and the CAP theorem, which outlines the trade-offs in distributed database systems. Additionally, the document includes case studies on risk assessment for loan sanctioning and disease diagnosis, emphasizing the practical applications of these concepts.

Uploaded by

Mamatha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

50 views24 pages

Unit 3 Ids

Uploaded by

Mamatha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Introduction to Data Science Unit-3

UNIT – III
TOPICS:
NoSQL movement for handling Bigdata: Distributing data storage and processing with Hadoop
framework, case study on risk assessment for loan sanctioning, ACID principle of relational
databases, CAP theorem, base principle of NoSQL databases, types of NoSQL databases, case
study on disease diagnosis and profiling

3.1 Distributing data storage and processing with frameworks

New big data technologies like Hadoop and Spark make it much easier to work with and control
a cluster of computers. Hadoop can scale up to thousands of computers, creating a cluster with
petabytes of storage. This enables businesses to grasp the value of the massive amount of data
available.
3.1.1 Hadoop: a framework for storing and processing large data sets Apache
• Hadoop is a framework that simplifies working with a cluster of computers. It aims to be all
of the following things and more:
➢ Reliable—By automatically creating multiple copies of the data and redeploying
processing logic in case of failure.
➢ Fault tolerant—It detects faults and applies automatic recovery.
➢ Scalable—Data and its processing are distributed over clusters of computers (horizontal
scaling).
➢ Portable—Installable on all kinds of hardware and operating systems.
• The core framework is composed of a distributed file system, a resource manager, and a
system to run distributed programs.
• In practice it allows you to work with the distributed file system almost as easily as with the
local file system of your home computer. But in the background, the data can be scattered
among thousands of servers
3.1.2 THE DIFFERENT COMPONENTS OF HADOOP
At the heart of Hadoop we have:
➢ A distributed file system (HDFS)
➢ A method to execute programs on a massive scale (MapReduce)
➢ A system to manage the cluster resources (YARN)

1
Introduction to Data Science Unit-3

• On top of that, an ecosystem of applications exist, such as the databases like Hive and
HBase and frameworks for machine learning such as Mahout.

Fig. 3.1A sample from the ecosystem of applications that arose around the Hadoop Core
Framework
3.1.3 How Hadoop achieves Parallelism
• Hadoop uses a programming method called MapReduce to achieve parallelism.
• A MapReduce algorithm splits up the data, processes it in parallel, and then sorts, combines,
and aggregates the results back together.
• But MapReduce algorithm is not suitable for interactive analysis or iterative programs
because it writes the data to a disk in between each computational step. This is expensive
when working with large data sets.
Example
• Consider a toy company in which every toy has two colors, and when a client orders a toy
from the web page, the web page puts an order file on Hadoop with the colors of the toy.
Our task is to find out how many color units we need to prepare. We will use a MapReduce-
style algorithm to count the colors.

2
Introduction to Data Science Unit-3

Fig. 3.2 A simplified example of a MapReduce flow for counting the colors in input texts
This process can be divided into two big phases:
Mapping phase: The documents are split up into key-value pairs. Until we reduce, we can have
many duplicates.
Reduce phase: The different unique occurrences are grouped together, and depending on the
reducing function, a different result can be created. Here we wanted a count per color, so that’s
what the reduce function returns.

Fig.3.3 An example of a MapReduce flow for counting the colors in input texts
The whole process is described in the following six steps and depicted in figure 3.3.
1. Reading the input files.
2. Passing each line to a mapper job.

3
Introduction to Data Science Unit-3

3. The mapper job parses the colors (keys) out of the file and outputs a file for each color with
the number of times it has been encountered (value). Or more technically said, it maps a key
(the color) to a value (the number of occurrences).
4. The keys get shuffled and sorted to facilitate the aggregation.
5. The reduce phase sums the number of occurrences per color and outputs one file per key
with the total number of occurrences for each color.
6. The keys are collected in an output file.
3.1.3 Spark: replacing MapReduce for better performance
Data scientists often do interactive analysis and rely on algorithms that are inherently iterative; it
can take a while until an algorithm converges to a solution. As this is a weak point of the
MapReduce framework can be replaced by the Spark Framework to overcome it. Spark
improves the performance on such tasks by an order of magnitude
What is Spark?
Spark is a cluster computing framework similar to MapReduce. Spark, however, doesn’t handle
the storage of files on the (distributed) file system itself, nor does it handle the resource
management. For this it relies on systems such as the Hadoop File System, YARN, or Apache
Mesos. Hadoop and Spark are thus complementary systems. For testing and development, we
can even run Spark on our local system.
How does Spark solve the problems of MapReduce?
While we oversimplify things a bit for the sake of clarity, Spark creates a kind of shared RAM
memory between the computers of your cluster. This allows the different workers to share
variables (and their state) and thus eliminates the need to write the intermediate results to disk.
More technically and more correctly if you’re into that: Spark uses Resilient Distributed Datasets
(RDD), which are a distributed memory abstraction that lets programmers perform in-memory
computations on large clusters in a fault tolerant way. Because it’s an in-memory system, it
avoids costly disk operations.
The different components of the Spark ecosystem
Spark core provides a NoSQL environment well suited for interactive, exploratory analysis.
Spark can be run in batch and interactive mode and supports Python.
Spark has four other large components, as listed below and depicted in figure 3.4.
1. Spark streaming is a tool for real-time analysis.

4
Introduction to Data Science Unit-3

2. Spark SQL provides a SQL interface to work with Spark.

3. MLLib is a tool for machine learning inside the Spark framework.
4. GraphX is a graph database for Spark.

Fig. 3.4 The Spark framework when used in combination with the Hadoop framework
3.2 Case study: Assessing risk when loaning money
We require the following
1. Horton Sandbox on a virtual machine. VirtualBox is a virtualization tool that allows us to
run another operating system inside our own operating system.
2. Python libraries: Pandas and pywebhdsf. They don’t need to be installed on our local
virtual environment this time around; we need them directly on the Horton Sandbox.
Therefore, we need to fire up the Horton Sandbox (on VirtualBox, for instance) and make
a few preparations. In the Sandbox command line, there are several things we still need to
do for this all to work, so connect to the command line. We can do this using a program
like PuTTY. PuTTY offers a command line interface to servers and can be downloaded
freely at http://www.chiark.greenend.org.uk/~sgtatham/putty/ download.html. The
PuTTY login configuration is shown in figure 3.5.

5
Introduction to Data Science Unit-3

Fig. 3.5 Connecting to Horton Sandbox using PuTTY

Once connected, issue the following commands:
➢ yum -y install python-pip—This installs pip, a Python package manager.
➢ pip install git+https://github.com/DavyCielen/pywebhdfs.git –upgrade— At the time of
writing there was a problem with the pywebhdfs library and we fixed that in this fork.
Hopefully we won’t require this anymore when we read this; the problem has been
signaled and should be resolved by the maintainers of this package.
➢ pip install pandas—To install Pandas. This usually takes a while because of the
dependencies. An .ipynb file is available for you to open in Jupyter or (the older) Ipython
and follow along with the code in. Setup instructions for Horton Sandbox are repeated
there; make sure to run the code directly on the Horton Sandbox. Now, with the
preparatory business out of the way, let’s look at what we’ll need to do. In this exercise,
we’ll go through several more of the data science process steps:
Step 1: The research goal. This consists of two parts:
• Providing our manager with a dashboard
• Preparing data for other people to create their own dashboards
Step 2: Data retrieval
• Downloading the data from the lending club website
• Putting the data on the Hadoop File System of the Horton Sandbox
Step 3: Data preparation
• Transforming this data with Spark

6
Introduction to Data Science Unit-3

• Storing the prepared data in Hive

Steps 4 & 6: Exploration and report creation
• Visualizing the data with Qlik Sens

7
Introduction to Data Science Unit-3

Part-2
ACID principle of relational databases, CAP theorem, base principle of NoSQL databases, types
of NoSQL databases, case study on disease diagnosis and profiling

3.3 Introduction
Traditional databases reside on a single computer or server. This used to be fine as a long as the
data didn’t outgrow the server, but it hasn’t been the case for many companies for a long time
now. With the growth of the internet, companies such as Google and Amazon felt they were held
back by these single-node databases and looked for alternatives. Numerous companies use
single-node NoSQL databases such as MongoDB because they want the flexible schema or the
ability to hierarchically aggregate data. Here are several early examples:
➢ Google’s first NoSQL solution was Google BigTable, which marked the start of the
columnar databases.
➢ Amazon came up with Dynamo, a key-value store.
➢ Two more database types emerged in the quest for partitioning: the document store and
the graph database.
Note that, although size was an important factor, these databases didn’t originate solely from
the need to handle larger volumes of data. Every V of big data has influence (volume, variety,
velocity, and sometimes veracity).
Graph databases, for instance, can handle network data. Consider an example of dinner
preparation for which ingredients and recipes can be a part of a network. But recipes and
ingredients could also be stored in the relational database or a document store. Herein lies the
strength of NoSQL which as the ability to look at a problem from a different angle, shaping the
data structure to the use case. As a data scientist, our job is to find the best answer to any
problem. Although sometimes this is still easier to attain using RDBMS, often a particular
NoSQL database offers a better approach.
Are relational databases doomed to disappear in companies with big data because of the need
for partitioning?
No, NewSQL platforms are the RDBMS answer to the need for cluster setup. NewSQL
databases follow the relational model but are capable of being divided into a distributed cluster
like NoSQL databases. It’s not the end of relational databases and certainly not the end of SQL,

8
Introduction to Data Science Unit-3

as platforms like Hive translate SQL into MapReduce jobs for Hadoop. Besides, not every
company needs big data; many do fine with small databases and the traditional relational
databases are perfect for that.

Fig. 3.6 NoSQL and NewSQL databases

3.4 Introduction to NoSQL
The goal of NoSQL databases isn’t only to offer a way to partition databases successfully
over multiple nodes, but also to present fundamentally different ways to model the data at hand
to fit its structure to its use case and not to how a relational database requires it to be modeled.
To understand NoSQL, first we should look at the core ACID principles of single-server
relational databases and see how NoSQL databases rewrite them into BASE principles so they’ll
work far better in a distributed fashion.
3.5 ACID: the core principle of relational databases
The main aspects of a traditional relational database can be summarized by the concept ACID.

9
Introduction to Data Science Unit-3

Atomicity—The “all or nothing” principle. If a record is put into a database, it’s put in
completely or not at all. If, for instance, a power failure occurs in the middle of a database write
action, you wouldn’t end up with half a record; it wouldn’t be there at all.
Consistency—This important principle maintains the integrity of the data. No entry that makes it
into the database will ever be in conflict with predefined rules, such as lacking a required field or
a field being numeric instead of text. The database should be consistent before and after the any
operation in the database. It refers to the correctness of a database.
Isolation—When something is changed in the database, nothing can happen on the exact copy of
same data at exactly the same moment. Instead, the actions happen in serial with other changes.
Isolation is a scale going from low isolation to high isolation. On this scale, traditional databases
are on the “high isolation” end. An example of low isolation would be Google Docs: Multiple
people can write to a document at the exact same time and see each other’s changes happening
instantly. A traditional Word document has high isolation; it’s locked for editing by the first user
to open it. The second person opening the document can view its last saved version but is unable
to see unsaved changes or edit the document without first saving it as a copy. So once someone
has it opened, the most up-to-date version is completely isolated from anyone but the editor who
locked the document.
Durability—If data has entered the database, it should survive permanently. Physical damage to
the hard discs will destroy records, but power outages and software crashes should not.

Fig. 3.7ACID Properties

10
Introduction to Data Science Unit-3

ACID applies to all relational databases and certain NoSQL databases, such as the graph
database Neo4j. For most other NoSQL databases another principle applies: BASE.
3.6 CAP Theorem: the problem with DBs on many nodes
• The CAP theorem describes the main problem with distributing databases across multiple
nodes and how ACID and BASE databases approach it
• Once a database gets spread out over different servers, it’s difficult to follow the ACID
principle because of the consistency ACID promises; the CAP Theorem points out why this
becomes problematic.
• The CAP Theorem states that a database can be any two of the following things but never all
three:
➢ Partition tolerant—The database can handle a network partition or network failure.
➢ Available—As long as the node you’re connecting to is up and running and you can
connect to it, the node will respond, even if the connection between the different database
nodes is lost.
➢ Consistent—No matter which node you connect to, you’ll always see the exact same data.
For a single-node database it’s easy to see how it’s always available and consistent:
➢ Available—As long as the node is up, it’s available. That’s all the CAP availability
promises.
➢ Consistent—There’s no second node, so nothing can be inconsistent.
Things get interesting once the database gets partitioned. Then we need to make a choice
between availability and consistency, as shown in figure 3.8.

Fig. 3.8 CAP Theorem: when partitioning the database, we need to choose between availability
and consistency.

11
Introduction to Data Science Unit-3

• Let’s take the example of an online shop with a server in Europe and a server in the United
States, with a single distribution center. A German named Fritz and an American named
Freddy are shopping at the same time on that same online shop. They see an item and only
one is still in stock: a bronze, octopus-shaped coffee table. Disaster strikes, and
communication between the two local servers is temporarily down. Now the owner of the
shop has two options:
➢ Availability—allow the servers to keep on serving customers, and sort out everything
afterward.
➢ Consistency—put all sales on hold until communication is reestablished.
• In the first case, Fritz and Freddy will both buy the octopus coffee table, because the last-
known stock number for both nodes is “one” and both nodes are allowed to sell it, as shown
in figure 3.9.

Fig. 3.9 CAP Theorem: if nodes get disconnected, you can choose to remain available, but
the data could become inconsistent.
• If the coffee table is hard to come by, you’ll have to inform either Fritz or Freddy that he
won’t receive his table on the promised delivery date or, even worse, he will never receive
it. As a good businessperson, you might compensate one of them with a discount coupon for
a later purchase, and everything might be okay after all.

12
Introduction to Data Science Unit-3

Fig. 3.10 CAP Theorem: if nodes get disconnected, you can choose to remain consistent by
stopping access to the databases until connections are restored
• The second option (figure 3.10) involves putting the incoming requests on hold temporarily.
This might be fair to both Fritz and Freddy if after five minutes the web shop is open for
business again, but then you might lose both sales and probably many more.
• Web shops tend to choose availability over consistency, but it’s not the optimal choice in all
cases.
• Consider a workshop in which the maximum allowed participants are 100. During online
registration process if a node communication failed, the server keeps on accepting the
registrations. You might have end with more registrations by the time communications are
reestablished. In such a case it might be wiser to go for consistency and turn off the nodes
temporarily.
3.7 The BASE principles of NoSQL databases
• RDBMS follows the ACID principles; NoSQL databases that don’t follow ACID, such as the
document stores and key-value stores, follow BASE.
• BASE is a set of much softer database promises:
➢ Basically available—Availability is guaranteed in the CAP sense. Taking the web shop
example, if a node is up and running, we can keep on shopping. Depending on how
things are set up, nodes can take over from other nodes. Elastic search, for example, is a

13
Introduction to Data Science Unit-3

NoSQL document–type search engine that divides and replicates its data in such a way
that node failure doesn’t necessarily mean service failure, via the process of sharding.
Each shard can be seen as an individual database server instance, but is also capable of
communicating with the other shards to divide the workload as efficiently as possible
(figure 3.11). Several shards can be present on a single node. If each shard has a replica
on another node, node failure is easily remedied by re-dividing the work among the
remaining nodes.

Fig. 3.11 Sharding: each shard can function as a self-sufficient database, but they also work
together as a whole. The example represents two nodes, each containing four shards: two main
shards and two replicas. Failure of one node is backed up by the other.
➢ Soft state—The state of a system might change over time. This corresponds to the
eventual consistency principle: the system might have to change to make the data
consistent again. In one node the data might say “A” and in the other it might say “B”
because it was modified. Later, at conflict resolution when the network is back online,
it’s possible the “A” in the first node is replaced by “B.” Even though no one did
anything to explicitly change “A” into “B,” it will take on this value as it becomes
consistent with the other node.
➢ Eventual consistency—The database will become consistent over time. In the web
shop example if the table is sold twice it results in data inconsistency. Once the
connection between the individual nodes is reestablished, they’ll communicate and

14
Introduction to Data Science Unit-3

decide how to resolve it. This conflict can be resolved on a first-come, first-served
basis or by preferring the customer who would incur the lowest transport cost.
Databases come with default behavior, but to make an actual business decision, this
behavior can be overwritten. Even if the connection is up and running, latencies might
cause nodes to become inconsistent. Often, products are kept in an online shopping
basket, but putting an item in a basket doesn’t lock it for other users. If Fritz first beats
the checkout button, there’ll be a problem once Freddy goes to check out. This can
easily be explained to the customer that he was too late.
ACID versus BASE
The BASE principles are somewhat contrived to fit acid and base from chemistry: an acid is a
fluid with a low pH value. A base is the opposite and has a high pH value. Figure 3.12 shows a
mnemonic to those familiar with the chemistry equivalents of acid and base.

Fig. 3.12 ACID versus BASE: traditional relational databases versus most NoSQL databases.
The names are derived from the chemistry concept of the pH scale. A pH value below 7 is acidic;
higher than 7 is a base. On this scale, your average surface water fluctuates between 6.5 and 8.5.
3.8 Types of NoSQL databases
There are four big NoSQL types:
• key-value store
• document store
• column-oriented database
• graph database.

15
Introduction to Data Science Unit-3

• Each type solves a problem that can’t be solved with relational databases. Actual
implementations are often combinations of these. OrientDB, for example, is a multi-model
database, combining NoSQL types. OrientDB is a graph database where each node is a
document.
• Relational databases generally strive toward normalization: making sure every piece of data
is stored only once. Normalization marks their structural setup.
• If, for instance, we want to store data about a person and their hobbies, we can do so with
two tables: one about the person and one about their hobbies. An additional table is
necessary to link hobbies to persons because of their many-to-many relationship: a person
can have multiple hobbies and a hobby can have many persons practicing it.
• A full-scale relational database can be made up of many entities and linking tables.

Fig. 3.13 Relational databases strive toward normalization (making sure every piece of data is
stored only once).
Each table has unique identifiers (primary keys) that are used to model the relationship between
the entities (tables), hence the term relational.
Column-Oriented Database
• Traditional relational databases are row-oriented, with each row having a row id and each
field within the row stored together in a table.

16
Introduction to Data Science Unit-3

• For example, no extra data about hobbies is stored and we have only a single table to
describe people, as shown in figure 3.14.
• In this scenario there is a slight denormalization because hobbies could be repeated.

Fig. 3.14 Row-oriented database layout. Every entity (person) is represented by a single row,
spread over multiple columns.
Let’s consider we only want a list of birthdays in September. The database will scan the table
from top to bottom and left to right, as shown in figure 3.15, and then returns the list of
birthdays.

Fig. 3.15 Row-oriented lookup: from top to bottom and for every entry, all columns are taken
into memory
• Indexing the data on certain columns can significantly improve lookup speed, but indexing
every column brings extra overhead and the database is still scanning all the columns.
• Column databases store each column separately, allowing for quicker scans when only a
small number of columns is involved.

17
Introduction to Data Science Unit-3

Fig. 3.16 Column-oriented databases store each column separately with the related row numbers.
Every entity (person) is divided over multiple tables.
• This layout looks very similar to a row-oriented database with an index on every column.
• A database index is a data structure that allows for quick lookups on data at the cost of
storage space and additional writes (index update).
• An index maps the row number to the data, whereas a column database maps the data to the
row numbers; in that way counting becomes quicker, so it’s easy to see how many people
like archery, for instance.
Row-oriented database vs Column oriented database
• In a column-oriented database it’s easy to add another column because none of the existing
columns are affected by it. But adding an entire record requires adapting all tables. This
makes the row-oriented database preferable over the column-oriented database for online
transaction processing (OLTP) where adding or changing records is done constantly.
• The column-oriented database is advantageous when performing analytics and reporting:
summing values and counting entries. A row-oriented database is often the operational
database of choice for actual transactions (such as sales).
• Overnight batch jobs bring the column-oriented database up to date, supporting lightning-
speed lookups and aggregations using MapReduce algorithms for reports. Examples of
column-family stores are Apache HBase, Facebook’s Cassandra, Hypertable, and Google
BigTable.
Key-Value Stores
• Key-value stores are the least complex of the NoSQL databases.
• They are collection of key-value pairs, as shown in figure 3.17.

18
Introduction to Data Science Unit-3

• This simplicity makes them the most scalable of the NoSQL database types, capable of
storing huge amounts of data.

Fig. 3.17 Key-value stores store everything as a key and a value.

• The value in a key-value store can be anything: a string, a number, but also an entire new set
of key-value pairs encapsulated in an object.
• Examples of key-value stores are Redis, Voldemort, Riak, and Amazon’s Dynamo.
• Figure 3.18 shows a slightly more complex key-value structure.

Figure 3.18 Key-value nested structure

DOCUMENT STORES
• A document store does assume a certain document structure that can be specified with a
schema.
• Document stores appear the most natural among the NoSQL database types because they’re
designed to store everyday documents.

19
Introduction to Data Science Unit-3

• They allow for complex querying and calculations on this often already aggregated form of
data.
• In relational database everything should be stored only once and connected via foreign keys
to have normalization point.
• Document stores care little about normalization as long as the data is in a structure that
makes sense
• A relational data model doesn’t always fit well with certain business cases. Newspapers or
magazines, for example, contain articles. To store these in a relational database, you need to
chop them up first: the article text goes in one table, the author and all the information about
the author in another, and comments on the article when published on a website go in yet
another.
• In document stores it can be stored as a single entity.
• Examples of document stores are MongoDB and CouchDB.

Figure 3.19 Relational Database Approach

20
Introduction to Data Science Unit-3

Figure 3. 20 Document Store Approach

GRAPH DATABASES
• This is the big and the most complex NoSQL database type.
• It is mainly used to store relations between entities in an efficient manner.
• When the data is highly interconnected, such as for social networks, scientific paper
citations, or capital asset clusters, graph databases are the used to store this information.

21
Introduction to Data Science Unit-3

• Graph or network data has two main components:

➢ Node—The entities themselves. In a social network this could be people.
➢ Edge—The relationship between two entities. This relationship is represented by a
line and has its own properties. An edge can have a direction, for example, if the
arrow indicates who is whose boss.
• Graphs can become incredibly complex given enough relation and entity types.
• Figure 3.21 shows that complexity with only a limited number of entities.
• Graph databases like Neo4j also claim to uphold ACID, whereas document stores and key-
value stores adhere to BASE.

Fig. 3.21 Graph data example with four entity types (person, hobby, company, and furniture) and
their relations without extra edge or node information

22
Introduction to Data Science Unit-3

Fig. 3.22 Top 15 databases ranked by popularity

3.9 Case study: What disease is that?
It has happened to many of us: you have sudden medical symptoms and the first thing
you do is Google what disease the symptoms might indicate; then you decide whether it’s worth
seeing a doctor. A web search engine is okay for this, but a more dedicated database would be
better. Databases like this exist and are fairly advanced. But they’re built upon well-protected
data and not all of it is accessible by the public. Also, although big pharmaceutical companies
and advanced hospitals have access to these virtual doctors, many general practitioners are still
stuck with only their books. This information and resource asymmetry is not only sad and
dangerous, it needn’t be there at all. If a simple, disease-specific search engine were used by all
general practitioners in the world, many medical mistakes could be avoided.
In this case study, we’ll learn how to build such a search engine here, although using only
a fraction of the medical data that is freely accessible. To tackle the problem, we’ll use a modern
NoSQL database called Elasticsearch to store the data, and the data science process to work with
the data and turn it into a resource that’s fast and easy to search. Here’s how we’ll apply the
process:
1. Setting the research goal.
➢ The primary goal is to set up a disease search engine that would help general practitioners
in diagnosing diseases.

23
Introduction to Data Science Unit-3

➢ The secondary goal is to profile a disease: What keywords distinguish it from other
diseases?
This secondary goal is useful for educational purposes or as input to more advanced uses
such as detecting spreading epidemics by tapping into social media.
2. Data collection—We’ll get the data from Wikipedia. There are more sources out there, but
for demonstration purposes a single one will do.
3. Data preparation—The Wikipedia data might not be perfect in its current format. We’ll
apply a few techniques to change this.
4. Data exploration—Our use case is special in that step 4 of the data science process is also
the desired end result: we want our data to become easy to explore.
5. Data modeling—No real data modeling is applied. Document term matrices that are used for
search are often the starting point for advanced topic modeling.
6. Presenting results —To make data searchable, we would need a user interface such as a
website where people can query and retrieve disease information. Our secondary goal:
profiling a disease category by its keywords; we’ll reach this stage of the data science
process because we’ll present it as a word cloud, such as the one in figure 3.11.

Fig. 3.11 A sample word cloud on non-weighted diabetes keywords

Din 1685 - 1
67% (3)
Din 1685 - 1
4 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Advantage Workstation 4.3 SM
100% (1)
Advantage Workstation 4.3 SM
346 pages
R23 IDS Unit 3 Lecture Notes
No ratings yet
R23 IDS Unit 3 Lecture Notes
57 pages
R23 IDS Unit3
No ratings yet
R23 IDS Unit3
36 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Lecture 3 - Introduction To Apache Spark - 1691899519972
No ratings yet
Lecture 3 - Introduction To Apache Spark - 1691899519972
67 pages
Module 5 Data Science
No ratings yet
Module 5 Data Science
8 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Introduction To Spark
No ratings yet
Introduction To Spark
54 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
Datascience Unit3
No ratings yet
Datascience Unit3
19 pages
In9040 PHD Presentation Selimozcan 2
No ratings yet
In9040 PHD Presentation Selimozcan 2
36 pages
Spark and Dask
No ratings yet
Spark and Dask
55 pages
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Module 3
No ratings yet
Module 3
51 pages
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
No ratings yet
Lab Manual Big Data Analytics Lab (LC-CSE-410G) : Department of Computer Science and Engineering
28 pages
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
HDP Developer Apache Pig and Hive
No ratings yet
HDP Developer Apache Pig and Hive
42 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
3 - Spark
No ratings yet
3 - Spark
51 pages
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
From Everand
Kafka Up and Running for Network DevOps: Set Your Network Data in Motion
Eric Chou
No ratings yet
SPARK
No ratings yet
SPARK
66 pages
Spark Overview
No ratings yet
Spark Overview
31 pages
Analyzing Big Data in Hadoop Spark
No ratings yet
Analyzing Big Data in Hadoop Spark
30 pages
IDS Unit3
No ratings yet
IDS Unit3
19 pages
Spark 101
No ratings yet
Spark 101
25 pages
Introduction To Spark
No ratings yet
Introduction To Spark
30 pages
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
No ratings yet
L1: Introduction, Mapreduce, Spark: Csl7710: Machine Learning With Big Data Dip Sankar Banerjee Cse, Iit Jodhpur
51 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Haddob Lab Report
No ratings yet
Haddob Lab Report
12 pages
Learning Hadoop 2
From Everand
Learning Hadoop 2
Garry Turkington
4/5 (1)
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Data Science With Python - Lesson 12 - Python Integration With Hadoop
No ratings yet
Data Science With Python - Lesson 12 - Python Integration With Hadoop
53 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
Cse3002 Big Data m3 Detailed
No ratings yet
Cse3002 Big Data m3 Detailed
39 pages
M5
No ratings yet
M5
18 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Lecture 10 - Spark
No ratings yet
Lecture 10 - Spark
87 pages
C5-SPARK Technology
No ratings yet
C5-SPARK Technology
39 pages
Class 06 IntroToSpark
No ratings yet
Class 06 IntroToSpark
51 pages
Chapter 2 - Intro. To Data Sciences
No ratings yet
Chapter 2 - Intro. To Data Sciences
27 pages
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
10 pages
Big Data Lab File
No ratings yet
Big Data Lab File
49 pages
ECS765P - W4 - Introduction To Spark
No ratings yet
ECS765P - W4 - Introduction To Spark
39 pages
Parallel Python with Dask
From Everand
Parallel Python with Dask
Tim Peters
No ratings yet
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
From Everand
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
Tim Peters
No ratings yet
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Integration of Python With Hadoop and Spark
No ratings yet
Integration of Python With Hadoop and Spark
13 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Unit-V Spark
No ratings yet
Unit-V Spark
69 pages
BDA Lec8
No ratings yet
BDA Lec8
39 pages
Analytics at Large Scale in Spark
No ratings yet
Analytics at Large Scale in Spark
13 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
26 pages
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Procedures and Displays
No ratings yet
Procedures and Displays
3 pages
Unit-5-Code Gen
No ratings yet
Unit-5-Code Gen
13 pages
r20 III I CN Lab Manual Final
No ratings yet
r20 III I CN Lab Manual Final
71 pages
Ds Lab Manual
No ratings yet
Ds Lab Manual
51 pages
UNIT I Material
No ratings yet
UNIT I Material
25 pages
Weekly Test Material-IDS
No ratings yet
Weekly Test Material-IDS
10 pages
Unit 5 Ids
No ratings yet
Unit 5 Ids
19 pages
IVECO Daily E6 Van Spec Sheet
No ratings yet
IVECO Daily E6 Van Spec Sheet
8 pages
Module Programming
No ratings yet
Module Programming
15 pages
Revised Notes Chapter 1
No ratings yet
Revised Notes Chapter 1
16 pages
Assignment 1
No ratings yet
Assignment 1
13 pages
Lampiran Diah Ayu BLM Fix
No ratings yet
Lampiran Diah Ayu BLM Fix
22 pages
Kobelco 6E - Hyd Motors PDF
100% (1)
Kobelco 6E - Hyd Motors PDF
26 pages
Gingival & Periodontal Indices - Dr. Priya
100% (2)
Gingival & Periodontal Indices - Dr. Priya
53 pages
Understanding Scuffing and Micropitting of Gears: R W Snidle, H P Evans, M P Alanou, M J A Holmes
No ratings yet
Understanding Scuffing and Micropitting of Gears: R W Snidle, H P Evans, M P Alanou, M J A Holmes
18 pages
Bulletin 193: Devicenet™ Configuration Terminal
No ratings yet
Bulletin 193: Devicenet™ Configuration Terminal
86 pages
Motor Current Calculator
No ratings yet
Motor Current Calculator
2 pages
كلية الهندسة
No ratings yet
كلية الهندسة
73 pages
Frontiers in Quantum Computing Luigi Maxmilian Caligiuri Editor Instant Download
No ratings yet
Frontiers in Quantum Computing Luigi Maxmilian Caligiuri Editor Instant Download
84 pages
MLE1101 - Tutorial 2 - Suggested Solutions
No ratings yet
MLE1101 - Tutorial 2 - Suggested Solutions
8 pages
Dictionary - Programs Questions and Answers - Class 11
No ratings yet
Dictionary - Programs Questions and Answers - Class 11
17 pages
EMD Module 1
No ratings yet
EMD Module 1
69 pages
MPU3343 - Glossary Chapter 4 Protein - Amino Acids
No ratings yet
MPU3343 - Glossary Chapter 4 Protein - Amino Acids
4 pages
Biology Revision KS3 Cells To Systems and Respiration
No ratings yet
Biology Revision KS3 Cells To Systems and Respiration
3 pages
Atlas Copco Pf4000 Manual
67% (6)
Atlas Copco Pf4000 Manual
476 pages
2013PDE5247 - Mohit Rajput
No ratings yet
2013PDE5247 - Mohit Rajput
87 pages
Cusps: Akshuz 09-Nov-1984 09:55:15 PM Ernakulam 76:17:0 E, 9:59:0 N Tzone: 5.5 KP (Original) Ayanamsha 23:33:6
No ratings yet
Cusps: Akshuz 09-Nov-1984 09:55:15 PM Ernakulam 76:17:0 E, 9:59:0 N Tzone: 5.5 KP (Original) Ayanamsha 23:33:6
1 page
FlashLoanExample Sol
No ratings yet
FlashLoanExample Sol
3 pages
M.S.Ramaiah Institute of Technology Department of Management Studies
No ratings yet
M.S.Ramaiah Institute of Technology Department of Management Studies
5 pages
Tables and Formulas Used in Dry-Run
No ratings yet
Tables and Formulas Used in Dry-Run
3 pages
Value of Expression 1 - 2 3 4 Sis: 2. 3, 3-Digit
No ratings yet
Value of Expression 1 - 2 3 4 Sis: 2. 3, 3-Digit
4 pages
CS211 Flow Control Structures
No ratings yet
CS211 Flow Control Structures
29 pages
PSU Manual
100% (1)
PSU Manual
23 pages
(Touzi) Deterministic and Stochastic Control, Application To Finance
No ratings yet
(Touzi) Deterministic and Stochastic Control, Application To Finance
117 pages
A REPORT ON MIMO IN WIRELESS APPLICATIONS - Final
No ratings yet
A REPORT ON MIMO IN WIRELESS APPLICATIONS - Final
11 pages

Unit 3 Ids

Uploaded by

Unit 3 Ids

Uploaded by

Introduction to Data Science Unit-3

3.1 Distributing data storage and processing with frameworks

2. Spark SQL provides a SQL interface to work with Spark.

Fig. 3.5 Connecting to Horton Sandbox using PuTTY

• Storing the prepared data in Hive

Fig. 3.6 NoSQL and NewSQL databases

Fig. 3.7ACID Properties

Fig. 3.17 Key-value stores store everything as a key and a value.

Figure 3.18 Key-value nested structure

Figure 3.19 Relational Database Approach

Figure 3. 20 Document Store Approach

• Graph or network data has two main components:

Fig. 3.22 Top 15 databases ranked by popularity

Fig. 3.11 A sample word cloud on non-weighted diabetes keywords

You might also like