100% found this document useful (1 vote)
50 views24 pages

Unit 3 Ids

Unit 3 of the document introduces the NoSQL movement and the use of Hadoop for handling big data, detailing its architecture, components, and the MapReduce programming model. It also discusses the limitations of traditional relational databases, the principles of ACID and BASE, and the CAP theorem, which outlines the trade-offs in distributed database systems. Additionally, the document includes case studies on risk assessment for loan sanctioning and disease diagnosis, emphasizing the practical applications of these concepts.

Uploaded by

Mamatha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
50 views24 pages

Unit 3 Ids

Unit 3 of the document introduces the NoSQL movement and the use of Hadoop for handling big data, detailing its architecture, components, and the MapReduce programming model. It also discusses the limitations of traditional relational databases, the principles of ACID and BASE, and the CAP theorem, which outlines the trade-offs in distributed database systems. Additionally, the document includes case studies on risk assessment for loan sanctioning and disease diagnosis, emphasizing the practical applications of these concepts.

Uploaded by

Mamatha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Introduction to Data Science Unit-3

UNIT – III
TOPICS:
NoSQL movement for handling Bigdata: Distributing data storage and processing with Hadoop
framework, case study on risk assessment for loan sanctioning, ACID principle of relational
databases, CAP theorem, base principle of NoSQL databases, types of NoSQL databases, case
study on disease diagnosis and profiling

3.1 Distributing data storage and processing with frameworks


New big data technologies like Hadoop and Spark make it much easier to work with and control
a cluster of computers. Hadoop can scale up to thousands of computers, creating a cluster with
petabytes of storage. This enables businesses to grasp the value of the massive amount of data
available.
3.1.1 Hadoop: a framework for storing and processing large data sets Apache
• Hadoop is a framework that simplifies working with a cluster of computers. It aims to be all
of the following things and more:
➢ Reliable—By automatically creating multiple copies of the data and redeploying
processing logic in case of failure.
➢ Fault tolerant—It detects faults and applies automatic recovery.
➢ Scalable—Data and its processing are distributed over clusters of computers (horizontal
scaling).
➢ Portable—Installable on all kinds of hardware and operating systems.
• The core framework is composed of a distributed file system, a resource manager, and a
system to run distributed programs.
• In practice it allows you to work with the distributed file system almost as easily as with the
local file system of your home computer. But in the background, the data can be scattered
among thousands of servers
3.1.2 THE DIFFERENT COMPONENTS OF HADOOP
At the heart of Hadoop we have:
➢ A distributed file system (HDFS)
➢ A method to execute programs on a massive scale (MapReduce)
➢ A system to manage the cluster resources (YARN)

1
Introduction to Data Science Unit-3

• On top of that, an ecosystem of applications exist, such as the databases like Hive and
HBase and frameworks for machine learning such as Mahout.

Fig. 3.1A sample from the ecosystem of applications that arose around the Hadoop Core
Framework
3.1.3 How Hadoop achieves Parallelism
• Hadoop uses a programming method called MapReduce to achieve parallelism.
• A MapReduce algorithm splits up the data, processes it in parallel, and then sorts, combines,
and aggregates the results back together.
• But MapReduce algorithm is not suitable for interactive analysis or iterative programs
because it writes the data to a disk in between each computational step. This is expensive
when working with large data sets.
Example
• Consider a toy company in which every toy has two colors, and when a client orders a toy
from the web page, the web page puts an order file on Hadoop with the colors of the toy.
Our task is to find out how many color units we need to prepare. We will use a MapReduce-
style algorithm to count the colors.

2
Introduction to Data Science Unit-3

Fig. 3.2 A simplified example of a MapReduce flow for counting the colors in input texts
This process can be divided into two big phases:
Mapping phase: The documents are split up into key-value pairs. Until we reduce, we can have
many duplicates.
Reduce phase: The different unique occurrences are grouped together, and depending on the
reducing function, a different result can be created. Here we wanted a count per color, so that’s
what the reduce function returns.

Fig.3.3 An example of a MapReduce flow for counting the colors in input texts
The whole process is described in the following six steps and depicted in figure 3.3.
1. Reading the input files.
2. Passing each line to a mapper job.

3
Introduction to Data Science Unit-3

3. The mapper job parses the colors (keys) out of the file and outputs a file for each color with
the number of times it has been encountered (value). Or more technically said, it maps a key
(the color) to a value (the number of occurrences).
4. The keys get shuffled and sorted to facilitate the aggregation.
5. The reduce phase sums the number of occurrences per color and outputs one file per key
with the total number of occurrences for each color.
6. The keys are collected in an output file.
3.1.3 Spark: replacing MapReduce for better performance
Data scientists often do interactive analysis and rely on algorithms that are inherently iterative; it
can take a while until an algorithm converges to a solution. As this is a weak point of the
MapReduce framework can be replaced by the Spark Framework to overcome it. Spark
improves the performance on such tasks by an order of magnitude
What is Spark?
Spark is a cluster computing framework similar to MapReduce. Spark, however, doesn’t handle
the storage of files on the (distributed) file system itself, nor does it handle the resource
management. For this it relies on systems such as the Hadoop File System, YARN, or Apache
Mesos. Hadoop and Spark are thus complementary systems. For testing and development, we
can even run Spark on our local system.
How does Spark solve the problems of MapReduce?
While we oversimplify things a bit for the sake of clarity, Spark creates a kind of shared RAM
memory between the computers of your cluster. This allows the different workers to share
variables (and their state) and thus eliminates the need to write the intermediate results to disk.
More technically and more correctly if you’re into that: Spark uses Resilient Distributed Datasets
(RDD), which are a distributed memory abstraction that lets programmers perform in-memory
computations on large clusters in a fault tolerant way. Because it’s an in-memory system, it
avoids costly disk operations.
The different components of the Spark ecosystem
Spark core provides a NoSQL environment well suited for interactive, exploratory analysis.
Spark can be run in batch and interactive mode and supports Python.
Spark has four other large components, as listed below and depicted in figure 3.4.
1. Spark streaming is a tool for real-time analysis.

4
Introduction to Data Science Unit-3

2. Spark SQL provides a SQL interface to work with Spark.


3. MLLib is a tool for machine learning inside the Spark framework.
4. GraphX is a graph database for Spark.

Fig. 3.4 The Spark framework when used in combination with the Hadoop framework
3.2 Case study: Assessing risk when loaning money
We require the following
1. Horton Sandbox on a virtual machine. VirtualBox is a virtualization tool that allows us to
run another operating system inside our own operating system.
2. Python libraries: Pandas and pywebhdsf. They don’t need to be installed on our local
virtual environment this time around; we need them directly on the Horton Sandbox.
Therefore, we need to fire up the Horton Sandbox (on VirtualBox, for instance) and make
a few preparations. In the Sandbox command line, there are several things we still need to
do for this all to work, so connect to the command line. We can do this using a program
like PuTTY. PuTTY offers a command line interface to servers and can be downloaded
freely at http://www.chiark.greenend.org.uk/~sgtatham/putty/ download.html. The
PuTTY login configuration is shown in figure 3.5.

5
Introduction to Data Science Unit-3

Fig. 3.5 Connecting to Horton Sandbox using PuTTY


Once connected, issue the following commands:
➢ yum -y install python-pip—This installs pip, a Python package manager.
➢ pip install git+https://github.com/DavyCielen/pywebhdfs.git –upgrade— At the time of
writing there was a problem with the pywebhdfs library and we fixed that in this fork.
Hopefully we won’t require this anymore when we read this; the problem has been
signaled and should be resolved by the maintainers of this package.
➢ pip install pandas—To install Pandas. This usually takes a while because of the
dependencies. An .ipynb file is available for you to open in Jupyter or (the older) Ipython
and follow along with the code in. Setup instructions for Horton Sandbox are repeated
there; make sure to run the code directly on the Horton Sandbox. Now, with the
preparatory business out of the way, let’s look at what we’ll need to do. In this exercise,
we’ll go through several more of the data science process steps:
Step 1: The research goal. This consists of two parts:
• Providing our manager with a dashboard
• Preparing data for other people to create their own dashboards
Step 2: Data retrieval
• Downloading the data from the lending club website
• Putting the data on the Hadoop File System of the Horton Sandbox
Step 3: Data preparation
• Transforming this data with Spark

6
Introduction to Data Science Unit-3

• Storing the prepared data in Hive


Steps 4 & 6: Exploration and report creation
• Visualizing the data with Qlik Sens

7
Introduction to Data Science Unit-3

Part-2
ACID principle of relational databases, CAP theorem, base principle of NoSQL databases, types
of NoSQL databases, case study on disease diagnosis and profiling

3.3 Introduction
Traditional databases reside on a single computer or server. This used to be fine as a long as the
data didn’t outgrow the server, but it hasn’t been the case for many companies for a long time
now. With the growth of the internet, companies such as Google and Amazon felt they were held
back by these single-node databases and looked for alternatives. Numerous companies use
single-node NoSQL databases such as MongoDB because they want the flexible schema or the
ability to hierarchically aggregate data. Here are several early examples:
➢ Google’s first NoSQL solution was Google BigTable, which marked the start of the
columnar databases.
➢ Amazon came up with Dynamo, a key-value store.
➢ Two more database types emerged in the quest for partitioning: the document store and
the graph database.
Note that, although size was an important factor, these databases didn’t originate solely from
the need to handle larger volumes of data. Every V of big data has influence (volume, variety,
velocity, and sometimes veracity).
Graph databases, for instance, can handle network data. Consider an example of dinner
preparation for which ingredients and recipes can be a part of a network. But recipes and
ingredients could also be stored in the relational database or a document store. Herein lies the
strength of NoSQL which as the ability to look at a problem from a different angle, shaping the
data structure to the use case. As a data scientist, our job is to find the best answer to any
problem. Although sometimes this is still easier to attain using RDBMS, often a particular
NoSQL database offers a better approach.
Are relational databases doomed to disappear in companies with big data because of the need
for partitioning?
No, NewSQL platforms are the RDBMS answer to the need for cluster setup. NewSQL
databases follow the relational model but are capable of being divided into a distributed cluster
like NoSQL databases. It’s not the end of relational databases and certainly not the end of SQL,

8
Introduction to Data Science Unit-3

as platforms like Hive translate SQL into MapReduce jobs for Hadoop. Besides, not every
company needs big data; many do fine with small databases and the traditional relational
databases are perfect for that.

Fig. 3.6 NoSQL and NewSQL databases


3.4 Introduction to NoSQL
The goal of NoSQL databases isn’t only to offer a way to partition databases successfully
over multiple nodes, but also to present fundamentally different ways to model the data at hand
to fit its structure to its use case and not to how a relational database requires it to be modeled.
To understand NoSQL, first we should look at the core ACID principles of single-server
relational databases and see how NoSQL databases rewrite them into BASE principles so they’ll
work far better in a distributed fashion.
3.5 ACID: the core principle of relational databases
The main aspects of a traditional relational database can be summarized by the concept ACID.

9
Introduction to Data Science Unit-3

Atomicity—The “all or nothing” principle. If a record is put into a database, it’s put in
completely or not at all. If, for instance, a power failure occurs in the middle of a database write
action, you wouldn’t end up with half a record; it wouldn’t be there at all.
Consistency—This important principle maintains the integrity of the data. No entry that makes it
into the database will ever be in conflict with predefined rules, such as lacking a required field or
a field being numeric instead of text. The database should be consistent before and after the any
operation in the database. It refers to the correctness of a database.
Isolation—When something is changed in the database, nothing can happen on the exact copy of
same data at exactly the same moment. Instead, the actions happen in serial with other changes.
Isolation is a scale going from low isolation to high isolation. On this scale, traditional databases
are on the “high isolation” end. An example of low isolation would be Google Docs: Multiple
people can write to a document at the exact same time and see each other’s changes happening
instantly. A traditional Word document has high isolation; it’s locked for editing by the first user
to open it. The second person opening the document can view its last saved version but is unable
to see unsaved changes or edit the document without first saving it as a copy. So once someone
has it opened, the most up-to-date version is completely isolated from anyone but the editor who
locked the document.
Durability—If data has entered the database, it should survive permanently. Physical damage to
the hard discs will destroy records, but power outages and software crashes should not.

Fig. 3.7ACID Properties

10
Introduction to Data Science Unit-3

ACID applies to all relational databases and certain NoSQL databases, such as the graph
database Neo4j. For most other NoSQL databases another principle applies: BASE.
3.6 CAP Theorem: the problem with DBs on many nodes
• The CAP theorem describes the main problem with distributing databases across multiple
nodes and how ACID and BASE databases approach it
• Once a database gets spread out over different servers, it’s difficult to follow the ACID
principle because of the consistency ACID promises; the CAP Theorem points out why this
becomes problematic.
• The CAP Theorem states that a database can be any two of the following things but never all
three:
➢ Partition tolerant—The database can handle a network partition or network failure.
➢ Available—As long as the node you’re connecting to is up and running and you can
connect to it, the node will respond, even if the connection between the different database
nodes is lost.
➢ Consistent—No matter which node you connect to, you’ll always see the exact same data.
For a single-node database it’s easy to see how it’s always available and consistent:
➢ Available—As long as the node is up, it’s available. That’s all the CAP availability
promises.
➢ Consistent—There’s no second node, so nothing can be inconsistent.
Things get interesting once the database gets partitioned. Then we need to make a choice
between availability and consistency, as shown in figure 3.8.

Fig. 3.8 CAP Theorem: when partitioning the database, we need to choose between availability
and consistency.

11
Introduction to Data Science Unit-3

• Let’s take the example of an online shop with a server in Europe and a server in the United
States, with a single distribution center. A German named Fritz and an American named
Freddy are shopping at the same time on that same online shop. They see an item and only
one is still in stock: a bronze, octopus-shaped coffee table. Disaster strikes, and
communication between the two local servers is temporarily down. Now the owner of the
shop has two options:
➢ Availability—allow the servers to keep on serving customers, and sort out everything
afterward.
➢ Consistency—put all sales on hold until communication is reestablished.
• In the first case, Fritz and Freddy will both buy the octopus coffee table, because the last-
known stock number for both nodes is “one” and both nodes are allowed to sell it, as shown
in figure 3.9.

Fig. 3.9 CAP Theorem: if nodes get disconnected, you can choose to remain available, but
the data could become inconsistent.
• If the coffee table is hard to come by, you’ll have to inform either Fritz or Freddy that he
won’t receive his table on the promised delivery date or, even worse, he will never receive
it. As a good businessperson, you might compensate one of them with a discount coupon for
a later purchase, and everything might be okay after all.

12
Introduction to Data Science Unit-3

Fig. 3.10 CAP Theorem: if nodes get disconnected, you can choose to remain consistent by
stopping access to the databases until connections are restored
• The second option (figure 3.10) involves putting the incoming requests on hold temporarily.
This might be fair to both Fritz and Freddy if after five minutes the web shop is open for
business again, but then you might lose both sales and probably many more.
• Web shops tend to choose availability over consistency, but it’s not the optimal choice in all
cases.
• Consider a workshop in which the maximum allowed participants are 100. During online
registration process if a node communication failed, the server keeps on accepting the
registrations. You might have end with more registrations by the time communications are
reestablished. In such a case it might be wiser to go for consistency and turn off the nodes
temporarily.
3.7 The BASE principles of NoSQL databases
• RDBMS follows the ACID principles; NoSQL databases that don’t follow ACID, such as the
document stores and key-value stores, follow BASE.
• BASE is a set of much softer database promises:
➢ Basically available—Availability is guaranteed in the CAP sense. Taking the web shop
example, if a node is up and running, we can keep on shopping. Depending on how
things are set up, nodes can take over from other nodes. Elastic search, for example, is a

13
Introduction to Data Science Unit-3

NoSQL document–type search engine that divides and replicates its data in such a way
that node failure doesn’t necessarily mean service failure, via the process of sharding.
Each shard can be seen as an individual database server instance, but is also capable of
communicating with the other shards to divide the workload as efficiently as possible
(figure 3.11). Several shards can be present on a single node. If each shard has a replica
on another node, node failure is easily remedied by re-dividing the work among the
remaining nodes.

Fig. 3.11 Sharding: each shard can function as a self-sufficient database, but they also work
together as a whole. The example represents two nodes, each containing four shards: two main
shards and two replicas. Failure of one node is backed up by the other.
➢ Soft state—The state of a system might change over time. This corresponds to the
eventual consistency principle: the system might have to change to make the data
consistent again. In one node the data might say “A” and in the other it might say “B”
because it was modified. Later, at conflict resolution when the network is back online,
it’s possible the “A” in the first node is replaced by “B.” Even though no one did
anything to explicitly change “A” into “B,” it will take on this value as it becomes
consistent with the other node.
➢ Eventual consistency—The database will become consistent over time. In the web
shop example if the table is sold twice it results in data inconsistency. Once the
connection between the individual nodes is reestablished, they’ll communicate and

14
Introduction to Data Science Unit-3

decide how to resolve it. This conflict can be resolved on a first-come, first-served
basis or by preferring the customer who would incur the lowest transport cost.
Databases come with default behavior, but to make an actual business decision, this
behavior can be overwritten. Even if the connection is up and running, latencies might
cause nodes to become inconsistent. Often, products are kept in an online shopping
basket, but putting an item in a basket doesn’t lock it for other users. If Fritz first beats
the checkout button, there’ll be a problem once Freddy goes to check out. This can
easily be explained to the customer that he was too late.
ACID versus BASE
The BASE principles are somewhat contrived to fit acid and base from chemistry: an acid is a
fluid with a low pH value. A base is the opposite and has a high pH value. Figure 3.12 shows a
mnemonic to those familiar with the chemistry equivalents of acid and base.

Fig. 3.12 ACID versus BASE: traditional relational databases versus most NoSQL databases.
The names are derived from the chemistry concept of the pH scale. A pH value below 7 is acidic;
higher than 7 is a base. On this scale, your average surface water fluctuates between 6.5 and 8.5.
3.8 Types of NoSQL databases
There are four big NoSQL types:
• key-value store
• document store
• column-oriented database
• graph database.

15
Introduction to Data Science Unit-3

• Each type solves a problem that can’t be solved with relational databases. Actual
implementations are often combinations of these. OrientDB, for example, is a multi-model
database, combining NoSQL types. OrientDB is a graph database where each node is a
document.
• Relational databases generally strive toward normalization: making sure every piece of data
is stored only once. Normalization marks their structural setup.
• If, for instance, we want to store data about a person and their hobbies, we can do so with
two tables: one about the person and one about their hobbies. An additional table is
necessary to link hobbies to persons because of their many-to-many relationship: a person
can have multiple hobbies and a hobby can have many persons practicing it.
• A full-scale relational database can be made up of many entities and linking tables.

Fig. 3.13 Relational databases strive toward normalization (making sure every piece of data is
stored only once).
Each table has unique identifiers (primary keys) that are used to model the relationship between
the entities (tables), hence the term relational.
Column-Oriented Database
• Traditional relational databases are row-oriented, with each row having a row id and each
field within the row stored together in a table.

16
Introduction to Data Science Unit-3

• For example, no extra data about hobbies is stored and we have only a single table to
describe people, as shown in figure 3.14.
• In this scenario there is a slight denormalization because hobbies could be repeated.

Fig. 3.14 Row-oriented database layout. Every entity (person) is represented by a single row,
spread over multiple columns.
Let’s consider we only want a list of birthdays in September. The database will scan the table
from top to bottom and left to right, as shown in figure 3.15, and then returns the list of
birthdays.

Fig. 3.15 Row-oriented lookup: from top to bottom and for every entry, all columns are taken
into memory
• Indexing the data on certain columns can significantly improve lookup speed, but indexing
every column brings extra overhead and the database is still scanning all the columns.
• Column databases store each column separately, allowing for quicker scans when only a
small number of columns is involved.

17
Introduction to Data Science Unit-3

Fig. 3.16 Column-oriented databases store each column separately with the related row numbers.
Every entity (person) is divided over multiple tables.
• This layout looks very similar to a row-oriented database with an index on every column.
• A database index is a data structure that allows for quick lookups on data at the cost of
storage space and additional writes (index update).
• An index maps the row number to the data, whereas a column database maps the data to the
row numbers; in that way counting becomes quicker, so it’s easy to see how many people
like archery, for instance.
Row-oriented database vs Column oriented database
• In a column-oriented database it’s easy to add another column because none of the existing
columns are affected by it. But adding an entire record requires adapting all tables. This
makes the row-oriented database preferable over the column-oriented database for online
transaction processing (OLTP) where adding or changing records is done constantly.
• The column-oriented database is advantageous when performing analytics and reporting:
summing values and counting entries. A row-oriented database is often the operational
database of choice for actual transactions (such as sales).
• Overnight batch jobs bring the column-oriented database up to date, supporting lightning-
speed lookups and aggregations using MapReduce algorithms for reports. Examples of
column-family stores are Apache HBase, Facebook’s Cassandra, Hypertable, and Google
BigTable.
Key-Value Stores
• Key-value stores are the least complex of the NoSQL databases.
• They are collection of key-value pairs, as shown in figure 3.17.

18
Introduction to Data Science Unit-3

• This simplicity makes them the most scalable of the NoSQL database types, capable of
storing huge amounts of data.

Fig. 3.17 Key-value stores store everything as a key and a value.


• The value in a key-value store can be anything: a string, a number, but also an entire new set
of key-value pairs encapsulated in an object.
• Examples of key-value stores are Redis, Voldemort, Riak, and Amazon’s Dynamo.
• Figure 3.18 shows a slightly more complex key-value structure.

Figure 3.18 Key-value nested structure


DOCUMENT STORES
• A document store does assume a certain document structure that can be specified with a
schema.
• Document stores appear the most natural among the NoSQL database types because they’re
designed to store everyday documents.

19
Introduction to Data Science Unit-3

• They allow for complex querying and calculations on this often already aggregated form of
data.
• In relational database everything should be stored only once and connected via foreign keys
to have normalization point.
• Document stores care little about normalization as long as the data is in a structure that
makes sense
• A relational data model doesn’t always fit well with certain business cases. Newspapers or
magazines, for example, contain articles. To store these in a relational database, you need to
chop them up first: the article text goes in one table, the author and all the information about
the author in another, and comments on the article when published on a website go in yet
another.
• In document stores it can be stored as a single entity.
• Examples of document stores are MongoDB and CouchDB.

Figure 3.19 Relational Database Approach

20
Introduction to Data Science Unit-3

Figure 3. 20 Document Store Approach


GRAPH DATABASES
• This is the big and the most complex NoSQL database type.
• It is mainly used to store relations between entities in an efficient manner.
• When the data is highly interconnected, such as for social networks, scientific paper
citations, or capital asset clusters, graph databases are the used to store this information.

21
Introduction to Data Science Unit-3

• Graph or network data has two main components:


➢ Node—The entities themselves. In a social network this could be people.
➢ Edge—The relationship between two entities. This relationship is represented by a
line and has its own properties. An edge can have a direction, for example, if the
arrow indicates who is whose boss.
• Graphs can become incredibly complex given enough relation and entity types.
• Figure 3.21 shows that complexity with only a limited number of entities.
• Graph databases like Neo4j also claim to uphold ACID, whereas document stores and key-
value stores adhere to BASE.

Fig. 3.21 Graph data example with four entity types (person, hobby, company, and furniture) and
their relations without extra edge or node information

22
Introduction to Data Science Unit-3

Fig. 3.22 Top 15 databases ranked by popularity


3.9 Case study: What disease is that?
It has happened to many of us: you have sudden medical symptoms and the first thing
you do is Google what disease the symptoms might indicate; then you decide whether it’s worth
seeing a doctor. A web search engine is okay for this, but a more dedicated database would be
better. Databases like this exist and are fairly advanced. But they’re built upon well-protected
data and not all of it is accessible by the public. Also, although big pharmaceutical companies
and advanced hospitals have access to these virtual doctors, many general practitioners are still
stuck with only their books. This information and resource asymmetry is not only sad and
dangerous, it needn’t be there at all. If a simple, disease-specific search engine were used by all
general practitioners in the world, many medical mistakes could be avoided.
In this case study, we’ll learn how to build such a search engine here, although using only
a fraction of the medical data that is freely accessible. To tackle the problem, we’ll use a modern
NoSQL database called Elasticsearch to store the data, and the data science process to work with
the data and turn it into a resource that’s fast and easy to search. Here’s how we’ll apply the
process:
1. Setting the research goal.
➢ The primary goal is to set up a disease search engine that would help general practitioners
in diagnosing diseases.

23
Introduction to Data Science Unit-3

➢ The secondary goal is to profile a disease: What keywords distinguish it from other
diseases?
This secondary goal is useful for educational purposes or as input to more advanced uses
such as detecting spreading epidemics by tapping into social media.
2. Data collection—We’ll get the data from Wikipedia. There are more sources out there, but
for demonstration purposes a single one will do.
3. Data preparation—The Wikipedia data might not be perfect in its current format. We’ll
apply a few techniques to change this.
4. Data exploration—Our use case is special in that step 4 of the data science process is also
the desired end result: we want our data to become easy to explore.
5. Data modeling—No real data modeling is applied. Document term matrices that are used for
search are often the starting point for advanced topic modeling.
6. Presenting results —To make data searchable, we would need a user interface such as a
website where people can query and retrieve disease information. Our secondary goal:
profiling a disease category by its keywords; we’ll reach this stage of the data science
process because we’ll present it as a word cloud, such as the one in figure 3.11.

Fig. 3.11 A sample word cloud on non-weighted diabetes keywords

24

You might also like