0% found this document useful (0 votes)
10 views

Datascience-unit3

The document discusses the challenges of handling large data sets, including memory overload and inefficient algorithms, and emphasizes the importance of selecting appropriate algorithms, data structures, and tools for effective data processing. It introduces big data technologies like Hadoop and Spark, highlighting their capabilities for distributed data storage and processing. Additionally, it covers NoSQL databases, their features, types, and advantages for managing massive volumes of data in a scalable manner.

Uploaded by

Kukra Kavya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Datascience-unit3

The document discusses the challenges of handling large data sets, including memory overload and inefficient algorithms, and emphasizes the importance of selecting appropriate algorithms, data structures, and tools for effective data processing. It introduces big data technologies like Hadoop and Spark, highlighting their capabilities for distributed data storage and processing. Additionally, it covers NoSQL databases, their features, types, and advantages for managing massive volumes of data in a scalable manner.

Uploaded by

Kukra Kavya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 19

Unit-3

The problems you face when handling large data


A large volume of data poses new challenges, such as overloaded memory and algorithms
that never stop running. It forces you to adapt and expand your repertoire of techniques. But
even when you can perform your analysis, you should take care of issues such as I/O
(input/output) and CPU starvation, because these can cause speed issues.
A computer only has a limited amount of RAM. When you try to squeeze more data into this
memory than actually fits, the OS will start swapping out memory blocks to disks, which is
far less efficient than having it all in memory. But only a few algorithms are designed to
handle large data sets; most of them load the whole data set into memory at once, which
causes the out-of-memory error. Other algorithms need to hold multiple copies of the data in
memory or store intermediate results. All of these aggravate the problem.
General techniques for handling large volumes of data
Even when you cure the memory issues, you may need to deal with another limited resource:
time. Although a computer may think you live for millions of years, in reality you won’t
(unless you go into cryostasis until your PC is done). Certain algorithms don’t take time into
account; they’ll keep running forever. Other algorithms can’t end in a reasonable amount of
time when they need to process only a few megabytes of data.
New big data technologies such as Hadoop and Spark make it much easier to work with and
control a cluster of computers. Hadoop can scale up to thousands of computers, creating a
cluster with petabytes of storage. This enables businesses to grasp the value of the massive
amount of data available.
1. Choosing the right algorithm
Choosing the right algorithm can solve more problems than adding more or better
hardware. An algorithm that’s well suited for handling large data doesn’t need to load
the entire data set into memory to make predictions. Ideally, the algorithm also
supports parallelized calculations. In this section we’ll dig into three types of
algorithms that can do that: online algorithms, block algorithms, and MapReduce
algorithms
2. Choosing the right data structure
Algorithms can make or break your program, but the way you store your data is of
equal importance. Data structures have different storage requirements, but also
influence the performance of CRUD (create, read, update, and delete) and other
operations on the data set.there are 3 types of data:
 sparse data,
 tree data, and
 hash data.
SPARSE DATA
A sparse data set contains relatively little information compared to its entries
(observations).Data like this might look ridiculous, but this is often what you get
when converting textual data to binary data.
TREE STRUCTURES
Trees are a class of data structure that allows you to retrieve information much faster
than scanning through a table. A tree always has a root value and subtrees of children,
each with its children, and so on.
HASH TABLES
Hash tables are data structures that calculate a key for every value in your data and
put the keys in a bucket. This way you can quickly retrieve the information by looking
in the right bucket when you encounter the data. Dictionaries in Python are a hash
table implementation, and they’re a close relative of key-value stores
3. Selecting the right tools
With the right class of algorithms and data structures in place, it’s time to choose the
right tool for the job. The right tool can be a Python library or at least a tool that’s
controlled from Python. The number of helpful tools available is enormous, so we’ll
look at only a handful of them.
PYTHON TOOLS
Python has a number of libraries that can help you deal with large data. They range
from smarter data structures over code optimizers to just-in-time compilers. The
following is a list of libraries we like to use when confronted with large data:
■ Cython—The closer you get to the actual hardware of a computer, the more vitalit is
for the computer to know what types of data it has to process. For a computer, adding
1 + 1 is different from adding 1.00 + 1.00. The first example consists of integers and
the second consists of floats, and these calculations are performed by different parts of
the CPU. In Python you don’t have to specify what data types you’re using, so the
Python compiler has to infer them. But inferring data types is a slow operation and is
partially why Python isn’t one of the fastest languages available. Cython, a superset of
Python, solves this problem by forcing the programmer to specify the data type while
developing the program. Once the compiler has this information, it runs programs
much faster.
■ Numexpr—Numexpr is at the core of many of the big data packages, as is NumPy
for in-memory packages. Numexpr is a numerical expression evaluator for NumPy
but can be many times faster than the original NumPy. To achieve this, it rewrites
your expression and uses an internal (just-in-time) compiler.
■ Numba —Numba helps you to achieve greater speed by compiling your code right
before you execute it, also known as just-in-time compiling. This gives you the
advantage of writing high-level code but achieving speeds similar to those of C code.
Using Numba is straightforward;
■ Bcolz—Bcolz helps you overcome the out-of-memory problem that can occur when
using NumPy. It can store and work with arrays in an optimal compressed form. It not
only slims down your data need but also uses Numexpr in the background to reduce
the calculations needed when performing calculations with bcolz arrays.
■ Blaze —Blaze is ideal if you want to use the power of a database backend but like
the “Pythonic way” of working with data. Blaze will translate your Python code into
SQL but can handle many more data stores than relational databases such as CSV,
Spark, and others. Blaze delivers a unified way of working with many databases and
data libraries. Blaze is still in development, though, so many features aren’t
implemented yet.
■ Theano—Theano enables you to work directly with the graphical processing unit
(GPU) and do symbolical simplifications whenever possible, and it comes with an
excellent just-in-time compiler. On top of that it’s a great library for dealing with an
advanced but useful mathematical concept: tensors.
■ Dask—Dask enables you to optimize your flow of calculations and execute them
efficiently. It also enables you to distribute calculations.
These libraries are mostly about using Python itself for data processing
General programming tips for dealing with large data sets
The tricks that work in a general programming context still apply for data science.
Several might be worded slightly differently, but the principles are essentially the same for all
programmers. This section recapitulates those tricks that are important in a data science
context.
You can divide the general tricks into three parts,
■ Don’t reinvent the wheel. Use tools and libraries developed by others.
■ Get the most out of your hardware. Your machine is never used to its full potential;
with simple adaptions you can make it work harder.
■ Reduce the computing need. Slim down your memory and processing needs as much as
possible.
Distributing data storage and processing with frameworks
Hadoop: a framework for storing and processing large data sets
Apache Hadoop is a framework that simplifies working with a cluster of computers. It aims
to be all of the following things and more:
■ Reliable—By automatically creating multiple copies of the data and redeploying
processing logic in case of failure.
■ Fault tolerant —It detects faults and applies automatic recovery.
■ Scalable—Data and its processing are distributed over clusters of computers
(horizontal scaling).
■ Portable—Installable on all kinds of hardware and operating systems.
The core framework is composed of a distributed file system, a resource manager, and a
system to run distributed programs. In practice it allows you to work with the distributed file
system almost as easily as with the local file system of your home computer. But in the
background, the data can be scattered among thousands of servers.
THE DIFFERENT COMPONENTS OF HADOOP
At the heart of Hadoop we find
■ A distributed file system (HDFS)
■ A method to execute programs on a massive scale (MapReduce)
■ A system to manage the cluster resources (YARN)
On top of that, an ecosystem of applications arose such as the databases
Hive and HBase and frameworks for machine learning such as Mahout.
Hive has a language based on the widely used SQL to interact with data
stored inside the database.

A sample from the ecosystem of applications that arose around the Hadoop Core Framework
First steps in big data
It’s possible to use the popular tool Impala to query Hive data up to 100 times faster.
MAPREDUCE: HOW HADOOP ACHIEVES PARALLELISM
Hadoop uses a programming method called MapReduce to achieve parallelism. A
MapReduce algorithm splits up the data, processes it in parallel, and then sorts, combines,
and aggregates the results back together.
However, the MapReduce algorithm isn’t well suited for interactive analysis or iterative
programs because it writes the data to a disk in between each computational step. This is
expensive when working with large data sets.
As the name suggests, the process roughly boils down to two big phases:
■ Mapping phase—The documents are split up into key-value pairs. Until we
reduce, we can have many duplicates.
■ Reduce phase—It’s not unlike a SQL “group by.” The different unique occurrences are
grouped together, and depending on the reducing function, a different result can be created.
In reality it’s a bit more complicated than this though.
NOTE While Hadoop makes working with big data easy, setting up a good working cluster
still isn’t trivial, but cluster managers such as Apache Mesos do ease the burden.
In reality, many (mid-sized) companies lack the competence to maintain a healthy Hadoop
installation.
This is why we’ll work with the Hortonworks Sandbox, a pre-installed and configured
Hadoop ecosystem.
5.1.2 Spark: replacing MapReduce for better performance
Data scientists often do interactive analysis and rely on algorithms that are inherently
iterative; it can take awhile until an algorithm converges to a solution. As this is a weak
point of the MapReduce framework, we’ll introduce the Spark Framework to overcome
it. Spark improves the performance on such tasks by an order of magnitude.
WHAT IS SPARK?
Spark is a cluster computing framework similar to MapReduce. Spark, however,
doesn’t handle the storage of files on the (distributed) file system itself, nor does it handle the
resource management. For this it relies on systems such as the Hadoop File System, YARN,
or Apache Mesos. Hadoop and Spark are thus complementary systems.
For testing and development, you can even run Spark on your local system.
HOW DOES SPARK SOLVE THE PROBLEMS OF MAPREDUCE?
 While we oversimplify things a bit for the sake of clarity, Spark creates a kind of
shared RAM memory between the computers of your cluster.
 This allows the different workers to share variables (and their state) and thus
eliminates the need to write the intermediate results to disk.
 More technically and more correctly if you’re into that: Spark uses Resilient
Distributed Datasets (RDD), which are a distributed memory abstraction that lets
programmers perform in-memory computations on large clusters in a faulttolerant
way.
 Because it’s an in-memory system, it avoids costly disk operations.
THE DIFFERENT COMPONENTS OF THE SPARK ECOSYSTEM
Spark core provides a NoSQL environment well suited for interactive, exploratory analysis.
Spark can be run in batch and interactive mode and supports Python.
Spark has four other large components, as listed below
1 Spark streaming is a tool for real-time analysis.
2 Spark SQL provides a SQL interface to work with Spark.
3 MLLib is a tool for machine learning inside the Spark framework.
4 GraphX is a graph database for Spark.
NOSQL DATA BASES
NoSQL Database is a non-relational Data Management System, that does not require a fixed
schema. It avoids joins, and is easy to scale. The major purpose of using a NoSQL database is
for distributed data stores with humongous data storage needs. NoSQL is used for Big data
and real-time web apps.
For example: companies like Twitter, Facebook and Google collect terabytes of user data
every single day.
NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better term would be
“NoREL”, NoSQL caught on. Carl Strozz introduced the NoSQL concept in 1998.
Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a
NoSQL database system encompasses a wide range of database technologies that can store
structured, semi-structured, unstructured and polymorphic data. Let’s understand about
NoSQL with a diagram in this NoSQL database tutorial:
The concept of NoSQL databases became popular with Internet giants like Google,
Facebook, Amazon, etc. who deal with huge volumes of data. The system response time
becomes slow when you use RDBMS for massive volumes of data.
To resolve this problem, we could “scale up” our systems by upgrading our existing
hardware. This process is expensive.
The alternative for this issue is to distribute database load on multiple hosts whenever the
load increases. This method is known as “scaling out.”

NoSQL database is non-relational, so it scales out better than relational databases as they are
designed with web applications in mind.
Features of NoSQL
1. Non-relational
 NoSQL databases never follow the relational model
 Never provide tables with flat fixed-column records
 Work with self-contained aggregates or BLOBs
 Doesn’t require object-relational mapping and data normalization
 No complex features like query languages, query planners,referential integrity joins,
ACID
2. Schema-free
 NoSQL databases are either schema-free or have relaxed schemas
 Do not require any sort of definition of the schema of the data
 Offers heterogeneous structures of data in the same domain

NoSQL is Schema-Free
3. Simple API
 Offers easy to use interfaces for storage and querying data provided
 APIs allow low-level data manipulation & selection methods
 Text-based protocols mostly used with HTTP REST with JSON
 Mostly used no standard based NoSQL query language
 Web-enabled databases running as internet-facing services
4. Distributed
 Multiple NoSQL databases can be executed in a distributed fashion
 Offers auto-scaling and fail-over capabilities
 Often ACID concept can be sacrificed for scalability and throughput
 Mostly no synchronous replication between distributed nodes Asynchronous Multi-
Master Replication, peer-to-peer, HDFS Replication
 Only providing eventual consistency
 Shared Nothing Architecture. This enables less coordination and higher distribution.

NoSQL is Shared Nothing.


Types of NoSQL Databases
NoSQL Databases are mainly categorized into four types: Key-value pair, Column-oriented,
Graph-based and Document-oriented. Every category has its unique attributes and limitations.
None of the above-specified database is better to solve all the problems. Users should select
the database based on their product needs.
Types of NoSQL Databases:
 Key-value Pair Based
 Column-oriented Graph
 Graphs based
 Document-oriented

Key Value Pair Based


Data is stored in key/value pairs. It is designed in such a way to handle lots of data and heavy
load.
Key-value pair storage databases store data as a hash table where each key is unique, and the
value can be a JSON, BLOB(Binary Large Objects), string, etc.
For example, a key-value pair may contain a key like “Website” associated with a value like
“Guru99”.

It is one of the most basic NoSQL database example. This kind of NoSQL database is used as
a collection, dictionaries, associative arrays, etc. Key value stores help the developer to store
schema-less data. They work best for shopping cart contents.
Redis, Dynamo, Riak are some NoSQL examples of key-value store DataBases. They are all
based on Amazon’s Dynamo paper.
Column-based
Column-oriented databases work on columns and are based on BigTable paper by Google.
Every column is treated separately. Values of single column databases are stored
contiguously.
Column based NoSQL database
They deliver high performance on aggregation queries like SUM, COUNT, AVG, MIN etc. as
the data is readily available in a column.
Column-based NoSQL databases are widely used to manage data warehouses, business
intelligence, CRM, Library card catalogs,
HBase, Cassandra, HBase, Hypertable are NoSQL query examples of column based database.
Document-Oriented
Document-Oriented NoSQL DB stores and retrieves data as a key value pair but the value
part is stored as a document. The document is stored in JSON or XML formats. The value is
understood by the DB and can be queried.

Relational Vs. Document


In this diagram on your left you can see we have rows and columns, and in the right, we have
a document database which has a similar structure to JSON. Now for the relational database,
you have to know what columns you have and so on. However, for a document database, you
have data store like JSON object. You do not require to define which make it flexible.
The document type is mostly used for CMS systems, blogging platforms, real-time analytics
& e-commerce applications. It should not use for complex transactions which require
multiple operations or queries against varying aggregate structures.
Amazon SimpleDB, CouchDB, MongoDB, Riak, Lotus Notes, MongoDB, are popular
Document originated DBMS systems.
Graph-Based
A graph type database stores entities as well the relations amongst those entities. The entity is
stored as a node with the relationship as edges. An edge gives a relationship between nodes.
Every node and edge has a unique identifier.

Compared to a relational database where tables are loosely connected, a Graph database is a
multi-relational in nature. Traversing relationship is fast as they are already captured into the
DB, and there is no need to calculate them.
Graph base database mostly used for social networks, logistics, spatial data.
Neo4J, Infinite Graph, OrientDB, FlockDB are some popular graph-based databases.
Query Mechanism tools for NoSQL
The most common data retrieval mechanism is the REST-based retrieval of a value based on
its key/ID with GET resource
Document store Database offers more difficult queries as they understand the value in a key-
value pair. For example, CouchDB allows defining views with MapReduce
CAP Theorem
CAP theorem is also called brewer’s theorem. It states that in networked shared-data systems
or distributed systems, we can only achieve at most two out of three guarantees for a
database:
1. Consistency,
2. Availability and
3. Partition Tolerance.
A distributed system is a network that stores data on more than one node (physical or virtual
machines) at the same time.
Consistency: means that all clients see the same data at the same time, no matter which node
they connect to in a distributed system.
To achieve consistency, whenever data is written to one node, it must be instantly forwarded
or replicated to all the other nodes in the system before the write is deemed successful.
Availability: means that every non-failing node returns a response for all read and write
requests in a reasonable amount of time, even if one or more nodes are down.
Another way to state this — all working nodes in the distributed system return a valid
response for any request, without failing or exception.
Partition Tolerance: means that the system continues to operate despite arbitrary message
loss or failure of part of the system. In other words, even if there is a network outage in the
data center and some of the computers are unreachable, still the system continues to perform.
Distributed systems guaranteeing partition tolerance can gracefully recover from partitions
once the partition heals.
The CAP theorem categorizes systems into three categories:
CP (Consistent and Partition Tolerant) database: A CP database delivers consistency and
partition tolerance at the expense of availability. When a partition occurs between any two
nodes, the system has to shut down the non-consistent node (i.e., make it unavailable) until
the partition is resolved.
Partition refers to a communication break between nodes within a distributed system.
Meaning, if a node cannot receive any messages from another node in the system, there is a
partition between the two nodes. Partition could have been because of network failure, server
crash, or any other reason.
AP (Available and Partition Tolerant) database: An AP database delivers availability and
partition tolerance at the expense of consistency. When a partition occurs, all nodes remain
available but those at the wrong end of a partition might return an older version of data than
others. When the partition is resolved, the AP databases typically resync the nodes to repair
all inconsistencies in the system.
CA (Consistent and Available) database: A CA delivers consistency and availability in the
absence of any network partition. Often a single node’s DB servers are categorized as CA
systems. Single node DB servers do not need to deal with partition tolerance and are thus
considered CA systems.
In any networked shared-data systems or distributed systems partition tolerance is a must.
Network partitions and dropped messages are a fact of life and must be handled
appropriately. Consequently, system designers must choose between consistency and
availability.
The following diagram shows the classification of different databases based on the CAP
theorem.
Eventual Consistency
The term “eventual consistency” means to have copies of data on multiple machines to get
high availability and scalability. Thus, changes made to any data item on one machine has to
be propagated to other replicas.
Data replication may not be instantaneous as some copies will be updated immediately while
others in due course of time. These copies may be mutually, but in due course of time, they
become consistent. Hence, the name eventual consistency.
BASE: Basically Available, Soft state, Eventual consistency
 Basically, available means DB is available all the time as per CAP theorem
 Soft state means even without an input; the system state may change
 Eventual consistency means that the system will become consistent over time
ACID: the core principle of relational databases
The main aspects of a traditional relational database can be summarized by the concept
ACID:
■ Atomicity—The “all or nothing” principle. If a record is put into a database, it’s
put in completely or not at all. If, for instance, a power failure occurs in the
middle of a database write action, you wouldn’t end up with half a record; it
wouldn’t be there at all.
■ Consistency—This important principle maintains the integrity of the data. No
entry that makes it into the database will ever be in conflict with predefined
rules, such as lacking a required field or a field being numeric instead of text.
■ Isolation—When something is changed in the database, nothing can happen on
this exact same data at exactly the same moment. Instead, the actions happen
in serial with other changes. Isolation is a scale going from low isolation to high
isolation. On this scale, traditional databases are on the “high isolation” end.
An example of low isolation would be Google Docs: Multiple people can write
to a document at the exact same time and see each other’s changes happening
instantly. A traditional Word document, on the other end of the spectrum, has
high isolation; it’s locked for editing by the first user to open it. The second person
opening the document can view its last saved version but is unable to see
unsaved changes or edit the document without first saving it as a copy. So once
someone has it opened, the most up-to-date version is completely isolated from
anyone but the editor who locked the document.
■ Durability—If data has entered the database, it should survive permanently.
Physical damage to the hard discs will destroy records, but power outages and
software crashes should not.
ACID applies to all relational databases and certain NoSQL databases, such as the
graph database Neo4j.
The BASE principles of NoSQL databases
RDBMS follows the ACID principles; NoSQL databases that don’t follow ACID, such as
the document stores and key-value stores, follow BASE. BASE is a set of much softer
database promises:
■ Basically available—Availability is guaranteed in the CAP sense. Taking the web
shop example, if a node is up and running, you can keep on shopping. Depending
on how things are set up, nodes can take over from other nodes. Elasticsearch,
for example, is a NoSQL document–type search engine that divides
and replicates its data in such a way that node failure doesn’t necessarily mean
service failure, via the process of sharding. Each shard can be seen as an individual
database server instance, but is also capable of communicating with the
other shards to divide the workload as efficiently as possible (figure 6.5). Several
shards can be present on a single node. If each shard has a replica on
another node, node failure is easily remedied by re-dividing the work among
the remaining nodes.
■ Soft state—The state of a system might change over time. This corresponds to the
eventual consistency principle: the system might have to change to make the data consistent
again.

In one node the data might say “A” and in the other it might
say “B” because it was adapted. Later, at conflict resolution when the network is
back online, it’s possible the “A” in the first node is replaced by “B.” Even
though no one did anything to explicitly change “A” into “B,” it will take on this
value as it becomes consistent with the other node.
■ Eventual consistency—The database will become consistent over time. In the web
shop example, the table is sold twice, which results in data inconsistency. Once
the connection between the individual nodes is reestablished, they’ll communicate
and decide how to resolve it. This conflict can be resolved, for example, on
a first-come, first-served basis or by preferring the customer who would incur
the lowest transport cost. Databases come with default behavior, but given that
there’s an actual business decision to make here, this behavior can be overwritten.
Even if the connection is up and running, latencies might cause nodes to
become inconsistent
Advantages of NoSQL
 Can be used as Primary or Analytic Data Source
 Big Data Capability
 No Single Point of Failure
 Easy Replication
 No Need for Separate Caching Layer
 It provides fast performance and horizontal scalability.
 Can handle structured, semi-structured, and unstructured data with equal effect
 Object-oriented programming which is easy to use and flexible
 NoSQL databases don’t need a dedicated high-performance server
 Support Key Developer Languages and Platforms
 Simple to implement than using RDBMS
 It can serve as the primary data source for online applications.
 Handles big data which manages data velocity, variety, volume, and complexity
 Excels at distributed database and multi-data center operations
 Eliminates the need for a specific caching layer to store data
 Offers a flexible schema design which can easily be altered without downtime or
service disruption
Disadvantages of NoSQL
 No standardization rules
 Limited query capabilities
 RDBMS databases and tools are comparatively mature
 It does not offer any traditional database capabilities, like consistency when multiple
transactions are performed simultaneously.
 When the volume of data increases it is difficult to maintain unique values as keys
become difficult
 Doesn’t work as well with relational data
 The learning curve is stiff for new developers
 Open source options so not so popular for enterprises

You might also like