Big Data Analytics Notes
Big Data Analytics Notes
Chapter 1
Revision Questions
What is Big Data?
Give an example for Structured data
Give an Example for Unstructured data
Give an Example for semi structured data
What is Big Data
5 Characteristics of Big Data
3 types of Analytics
What is Big Data Analytics
What is Big Data Streaming
Explain what is volume characteristics of Data
Types of data
Data
- Data is defined as information that’s stored in or used by a compute.
- Data is the information that has been translated into a form that is efficient for
movement, storing or processing by a computer
- Human are generating data when they write a document, play some video games, send
emails etc..
Big Data
Big data is collection of huge amounts of information which is hard to analyze or process
with traditional form of data management tools.
Structured Data
It is quantitative, highly organized, and easy to analyze using analytics software. It’s
formatted into systems that have a regular design, fitting into set rows, columns and tables.
Ex: Excels files document
Unstructured Data
Is information that has no set organization and doesn’t fit into a defined framework.
Examples of Unstructured data is video, audio, images, and all manner text: reports,
email, social media posts.
Semi-structured Data
It’s combination of structured and unstructured data and shares characteristics of both,
Example of semi-structured data includes JavaScript object Notation (Json) and XLM
Data Analytics
Data Analytics is the scientific process of transforming data into insight for making better
decisions and offering new opportunities for a competitive advantage.
Types of Analytics
- Descriptive Analytics – Analyze historical data to learn about what is happening in a
business past and present. Ex: Just right movie recommendation in Netflix
- Predictive Analytics – Uses past and present data to forecast and create models,
allowing business to make predictions about the future. Ex:
- Prescriptive Analytics – Uses data modeling and forecasting to test the likely
outcome of different actions based on data. Ex: Change of airline prices every hours.
Database
It is an organized collection of data that can be stored and accessed from computer system. It
stores and access data electronically meaning that it stores as a file or a set of files on
magnetic disk or tape, optical disc or some other secondary storage device.
The purpose of storing the data is to have an easily access, modified, protected and
analysed. Example of using database: WeChat, Facebook, GoogleDrive.
The Software used to manage a database called Database Management System (DBMS)
DBMS
It enables user to enter commands in specific languages to process various data-processing
operations in database including storage of data, retrieval of data, modification of data and
deletion of data. Example of DBMS including MySQL, MangoDB, PostgreSQL, Cassandra,
etc
Types of Database:
Relational
It is collection of information that organizes data in predefined relationships where data is
stored in one or more tables of columns and rows. Structured data is stored in Relational
Database.
The advantage of Relational database is it’s easily categorizes and stores data.
Example of RDBMS:
Oracle, mySQL, SQLite, PostgreSQL
SQL which stands for Structured Query language is the query language used to communicate
with Relational Database.
Non-relational
It is different from traditional relational database in that they store their data in a non-tabular
form. (Doesn’t use tabular scheme of rows and columns).
All structure, unstructured and semi-structured types of data’s can be stored In non-relational
database. Non-Structured sometimes referred as NoSQL which stands for Not only SQL.
Chapter 4
What is File System
Provide 1 Difference between File System and DBMS
What is distributed File System
State 4 features of DFS
What is the usage of Google File System
What are three main entities of GFS architecture
What is Hadoop
What is Hadoop Ecosystem
What is three main components of Hadoop
What is HDFS
What is YARN
What is MapReduce
What are the 2 types of components available in YARN
What is the type of Architecture that follow by HDFS and YARN
FILE SYSTEM
File system a software that manages and manage the files in a storage medium like a hard
disk, pen drive, DVD. It helps you to organize the data and allows easy retrieval of files when
they are required. It mostly consists of different types of files like mp3, mp4, txt that are
grouped into directories. It handles the way of reading and writing the data to the storage.
4 features of DFS
- Scalability – It can work across multiple servers and can scale out by adding more
machines
- Data integrity- As multiple users frequently share a files system, the integrity of data
saved in a shared file must be guaranteed by the file system
- Fault Tolerance – It enables a system to continue operating system, in the event of
the failure of some of its servers or disks.
- High reliability – A file system should create backup copies of key files that can be
used if the origins are lost.
Problem:
Accessing and manipulation file that would take up a lot of the network’s bandwidth.
Solution:
The GFS addresses this problem by breaking files up to chunks of 64 megabytes each.
In GFS architecture, there are 3 main entities
- Client – it can be computer application to make file request. Requests can range from
retrieving and manipulating existing files to creating new files on the system.
- Master Servers (only one) – It is coordinator for the cluster. It keeps track of
metadata, which is the information that describes chinks. The metadata tell the master
server to which files the chunks belong and where they fit within the overall file.
- Chunk Servers – They are the workhorses of the GFS, they store 64mb file chunks.
The chunk servers don’t send chunks to the master server. Instead, they send
requested GFS copies every chunk multiple times and stores them on different chunk
server multiple times and stores them on different chunk servers and the default is 3
copies. Each copy is called replica.
Hadoop
Apache Hadoop is an open-source framework that is used to efficiently store and process
large datasets ranging in size from gigabytes to petabytes of data. Instead of using one large
computer to store and process the data, Hadoop allows clustering multiple computers to
analyze massive datasets in parallel more quickly
Hadoop ecosystem
Hadoop ecosystem neither programming language nor a service, it is a platform or
framework which solves big problems.
HDFS
HDFS is distributed files system which user to store different types of large data sets such as
structured, unstructured and semi-structured data types)
It is able to handle large data sets running on commodity hardware. It’s follows master/slave
architecture, where cluster comprises of a single NameNode (Master) and all other nodes are
DataNodes.(Slaves)
1) Name Node
- It is the Master node and it doesn’t store the actual data
- It contains metadata, just like a log file or you can say as a table content
- It requires less storage and high computational resources
2) Data Node
- It is a Slave Node and it store the actual data
- It can be think as commodity hardware ( like your laptops and desktops) in the distributed
environment.
YARN
Yarn can be considered as the brain of the Hadoop ecosystem. It performs all your processing
activities by allocating resources and scheduling tasks
It also allows the data stored in HDFS to be processed and run by various data processing
engines such as batch processing, stream processing, interactive processing, graph processing
and many more
It has two major components:
- Resource Manager
A master node of YARN
It is used for job scheduling
- Node Manager
It is a Slave node.
It is used to monitor the container’s resource usage, along with reporting it ti the
Resource Manager
It takes care of each node In the cluster while managing the workflow, along with
user jobs in a particular node.
It keeps the data in Resource manager.
Chapter 5
MapReduce (Uly chapter)
Chapter 6
Apache Spark
Chapter 7
Resilient Distributed Dataset (RDD) kicijik chapter
Chapter 8
Spark SQL and Data Frames
Kici chapter
Chapter 9
Data Cleaning and Data Transformation
Chapter 10
Basics of Machine Learning
Basics only
Chapter 11
ML (part 2)
Birje sahypa bar
Chapter 12
ML part 3
Not focused much
Chapter 13
Spark Graph frames
-What are the functions
Advantages