0% found this document useful (0 votes)
16 views10 pages

Big Data Analytics Notes

The document provides comprehensive notes on Big Data Analytics, covering definitions, types of data (structured, unstructured, semi-structured), characteristics of Big Data, and various analytics types. It also discusses databases, including relational and non-relational databases, their advantages, and the differences between file systems and database management systems. Additionally, it introduces Hadoop, its ecosystem, and components such as HDFS and YARN, along with concepts related to data processing and machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views10 pages

Big Data Analytics Notes

The document provides comprehensive notes on Big Data Analytics, covering definitions, types of data (structured, unstructured, semi-structured), characteristics of Big Data, and various analytics types. It also discusses databases, including relational and non-relational databases, their advantages, and the differences between file systems and database management systems. Additionally, it introduces Hadoop, its ecosystem, and components such as HDFS and YARN, along with concepts related to data processing and machine learning.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Big Data Analytics Notes

Chapter 1
Revision Questions
What is Big Data?
Give an example for Structured data
Give an Example for Unstructured data
Give an Example for semi structured data
What is Big Data
5 Characteristics of Big Data
3 types of Analytics
What is Big Data Analytics
What is Big Data Streaming
Explain what is volume characteristics of Data

Types of data
Data
- Data is defined as information that’s stored in or used by a compute.
- Data is the information that has been translated into a form that is efficient for
movement, storing or processing by a computer
- Human are generating data when they write a document, play some video games, send
emails etc..
Big Data
Big data is collection of huge amounts of information which is hard to analyze or process
with traditional form of data management tools.

Structured Data
It is quantitative, highly organized, and easy to analyze using analytics software. It’s
formatted into systems that have a regular design, fitting into set rows, columns and tables.
Ex: Excels files document
Unstructured Data
Is information that has no set organization and doesn’t fit into a defined framework.
Examples of Unstructured data is video, audio, images, and all manner text: reports,
email, social media posts.
Semi-structured Data
It’s combination of structured and unstructured data and shares characteristics of both,
Example of semi-structured data includes JavaScript object Notation (Json) and XLM

Characteristics of Big Data


5vs
Volume – The size and amount of data that companies manage and analyze.
Variety – The diversity and range of different data types.
Veracity – The “truth” or accuracy of data and information assets
Value – The values of data, usually important from the perspective of the business benefit
Velocity – the speed at which companies receive, store and manage data

Data Analytics
Data Analytics is the scientific process of transforming data into insight for making better
decisions and offering new opportunities for a competitive advantage.

Types of Analytics
- Descriptive Analytics – Analyze historical data to learn about what is happening in a
business past and present. Ex: Just right movie recommendation in Netflix
- Predictive Analytics – Uses past and present data to forecast and create models,
allowing business to make predictions about the future. Ex:
- Prescriptive Analytics – Uses data modeling and forecasting to test the likely
outcome of different actions based on data. Ex: Change of airline prices every hours.

What and Why Big Data Analytics


- Big data analytics is the use of advanced analytic technique against very large, diverse
data sets that include structured, unstructured and semi-structured data types in
different size from petabytes onwards.
- Big data analytics help organizations harness their data and use it to identify new
opportunities. That in turn leads to smarter business moves, more efficient operations,
higher profit and happier customers.
Big data analytics tools:
- Hadoop
Helps in storing and analyzing data
- Cassandra
A distributed database used to handle a set of data.
- Spark
Used for real-time processing and analyzing large amount of data.

Big Data Streaming


Big data streaming is a process in which big data is quickly processed in order to extract real-
time insights from it. The processed data is called data in motion. Ex: location data, stock
prices, IT system monitoring

Chapter 2 & 14 Excluded


Chapter 3
Describe Definition of Database
What is a software that used to manage the data collected
What is usage of DBMS
State 2 types of Database available
What is Non-relational Database
Give 5 advantages of Relational Database
State the name for each character Acid
State 4 different types of NoSQL Database system
Give 5 advantages of Non-Relational Database
State the name for each character of Base
What is Document Database
What is difference between Key-Value Database and Relational Database

Database
It is an organized collection of data that can be stored and accessed from computer system. It
stores and access data electronically meaning that it stores as a file or a set of files on
magnetic disk or tape, optical disc or some other secondary storage device.
The purpose of storing the data is to have an easily access, modified, protected and
analysed. Example of using database: WeChat, Facebook, GoogleDrive.

The Software used to manage a database called Database Management System (DBMS)
DBMS
It enables user to enter commands in specific languages to process various data-processing
operations in database including storage of data, retrieval of data, modification of data and
deletion of data. Example of DBMS including MySQL, MangoDB, PostgreSQL, Cassandra,
etc

Types of Database:
Relational
It is collection of information that organizes data in predefined relationships where data is
stored in one or more tables of columns and rows. Structured data is stored in Relational
Database.
The advantage of Relational database is it’s easily categorizes and stores data.
Example of RDBMS:
Oracle, mySQL, SQLite, PostgreSQL
SQL which stands for Structured Query language is the query language used to communicate
with Relational Database.

5 Advantages of Relational Database


- Simple Model
- Data Accuracy
- Ease of Access to data
- Flexibility
- Speed
In Relational Database, transactions must be Atomic, Consistent, Isolated and Durable which
known as ACID.
Atomicity – The entire transaction take place at once or doesn’t happen at all
Consistency – The dataset must be consistent before and after transaction
Isolated – Multiple Transaction occur independently without interference
Durability – The changes to the database made by successful transaction will be saved, even
if the system failure occur.

Non-relational
It is different from traditional relational database in that they store their data in a non-tabular
form. (Doesn’t use tabular scheme of rows and columns).
All structure, unstructured and semi-structured types of data’s can be stored In non-relational
database. Non-Structured sometimes referred as NoSQL which stands for Not only SQL.

4 types(Structure) of NoSQL database


Document database
Graph database
Key-Value database
Wide Column Database

Advantages of Non-Relational Database


- High Scalability
- High Availability
- Big Data Capability
- Fast Performance
- Easy replication

In non-relational database to provide a flexible and fluid way to manipulate data


BASICALLY Available, Soft State and Eventually Consistent which is known as BASE
was applied.
- Basically Available- NoSQL databases will ensure availability of data by spreading
and replicating it across the nodes of the database cluster
- Soft State – It also deal with unstructured data, data value will change over the time,
so the state of the system will also change
- Eventually Consistent – Indicates that the system will become consistent over time
when input stop updating at some point.

Chapter 4
What is File System
Provide 1 Difference between File System and DBMS
What is distributed File System
State 4 features of DFS
What is the usage of Google File System
What are three main entities of GFS architecture
What is Hadoop
What is Hadoop Ecosystem
What is three main components of Hadoop
What is HDFS
What is YARN
What is MapReduce
What are the 2 types of components available in YARN
What is the type of Architecture that follow by HDFS and YARN

FILE SYSTEM
File system a software that manages and manage the files in a storage medium like a hard
disk, pen drive, DVD. It helps you to organize the data and allows easy retrieval of files when
they are required. It mostly consists of different types of files like mp3, mp4, txt that are
grouped into directories. It handles the way of reading and writing the data to the storage.

Difference between File System and DBMS


File system:
- File System doesn’t have a crash recovery mechanism
- Not provide support for complicated transactions
- File system is a software that controls how data stored and retrieved.
DBMS:
- DBMS is a software for accessing, creating and managing databases
- DBMS provides a crash recovery mechanism
- Easy to implement complicated transactions

Distributed File System:


It is a file system, that is distributed on multiple file servers or multiple locations, allowing
programmers to access files from any network or computer. Ex: Collection of workstation
and mainframes connected by a LAN is a configuration on DFS.

4 features of DFS
- Scalability – It can work across multiple servers and can scale out by adding more
machines
- Data integrity- As multiple users frequently share a files system, the integrity of data
saved in a shared file must be guaranteed by the file system
- Fault Tolerance – It enables a system to continue operating system, in the event of
the failure of some of its servers or disks.
- High reliability – A file system should create backup copies of key files that can be
used if the origins are lost.

Google File System


GFS is scalable distributed system created by Google. It is used to accommodate Google’s
expanding data processing requirements. It provides low cost, fault tolerance, reliability,
scalability, availability and performance to large networks and connected nodes.
It is made up from low-cost commodity hardware components.

How GFS works?


GFS provides the users to access the basic file commands. These includes commands like
open, create, read, write and close files along with special commands like append and
snapshots. Append allows clients to add information to an existing file without overwriting
previously written data. Snapshot is a command that creates quick copy of a computer;s
contents

Problem:
Accessing and manipulation file that would take up a lot of the network’s bandwidth.

Solution:
The GFS addresses this problem by breaking files up to chunks of 64 megabytes each.
In GFS architecture, there are 3 main entities
- Client – it can be computer application to make file request. Requests can range from
retrieving and manipulating existing files to creating new files on the system.
- Master Servers (only one) – It is coordinator for the cluster. It keeps track of
metadata, which is the information that describes chinks. The metadata tell the master
server to which files the chunks belong and where they fit within the overall file.
- Chunk Servers – They are the workhorses of the GFS, they store 64mb file chunks.
The chunk servers don’t send chunks to the master server. Instead, they send
requested GFS copies every chunk multiple times and stores them on different chunk
server multiple times and stores them on different chunk servers and the default is 3
copies. Each copy is called replica.
Hadoop
Apache Hadoop is an open-source framework that is used to efficiently store and process
large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large
computer to store and process the data, Hadoop allows clustering multiple computers to
analyze massive datasets in parallel more quickly

Hadoop ecosystem
Hadoop ecosystem neither programming language nor a service, it is a platform or
framework which solves big problems.

Hadoop Core Components


There are 2 Hadoop core components
- HDFS
- YARN

HDFS
HDFS is distributed files system which user to store different types of large data sets such as
structured, unstructured and semi-structured data types)
It is able to handle large data sets running on commodity hardware. It’s follows master/slave
architecture, where cluster comprises of a single NameNode (Master) and all other nodes are
DataNodes.(Slaves)
1) Name Node
- It is the Master node and it doesn’t store the actual data
- It contains metadata, just like a log file or you can say as a table content
- It requires less storage and high computational resources
2) Data Node
- It is a Slave Node and it store the actual data
- It can be think as commodity hardware ( like your laptops and desktops) in the distributed
environment.

YARN
Yarn can be considered as the brain of the Hadoop ecosystem. It performs all your processing
activities by allocating resources and scheduling tasks
It also allows the data stored in HDFS to be processed and run by various data processing
engines such as batch processing, stream processing, interactive processing, graph processing
and many more
It has two major components:
- Resource Manager
A master node of YARN
It is used for job scheduling
- Node Manager
It is a Slave node.
It is used to monitor the container’s resource usage, along with reporting it ti the
Resource Manager
It takes care of each node In the cluster while managing the workflow, along with
user jobs in a particular node.
It keeps the data in Resource manager.

Chapter 5
MapReduce (Uly chapter)

Chapter 6
Apache Spark

Chapter 7
Resilient Distributed Dataset (RDD) kicijik chapter

Chapter 8
Spark SQL and Data Frames
Kici chapter

Chapter 9
Data Cleaning and Data Transformation

Chapter 10
Basics of Machine Learning
Basics only
Chapter 11
ML (part 2)
Birje sahypa bar

Chapter 12
ML part 3
Not focused much

Chapter 13
Spark Graph frames
-What are the functions
Advantages

Chapter 14 (Hadoop Ecosystem much deeper)

Chapter 15 Case Studies (Uly)


Data Duplication and Big Data Analytics Case Studies

You might also like