Data Science
Data Science
Data Engineer:-
design
build, arrange data
programming
System
Data scientist :-
Test
Analyse, Translate data
analytics
both:-- Curious,judgemental,debate WALSN
statistics: whole inference from single sample
data today
structured :-enterprises office data number date string
unstructured videos audio emails very large storage high low data wil make data
collection
HADOOP AND OOP BASIC
ENORMOUS DATA SET (100110000101010)slices unique individual data set
given individual computer s
single program -mapper system-> primary results-> results are sort-> secondary process-
reduce process ( map-reduce )
SCRAPING
for data no
method
python with beautiful soap
Hadoop core
HDFS:
1.Storage unit ->HDFS
space storage
Block a. Block b block c block d block E
128 128 128 128 88
Total 600MB
Copy will store in multiple system like blocks
Data is not lost at any cost
Data replication method
One of the node willl be lost that can be recoved from the any block
Even if one data block crashes making HDFS fualt-tolerant
2.Map reduce
Traditional method to processing method
Advatges
-load balancing
-time reduce
3.yarn( resources manger)
meta data :- data about data
HDFS-fuetures
-destributed
-Scalabe
Scaling
horizontel scaling
Vertical scaling
-Cost effective
-fault-toleranet
-High through put
Latency:time to get first recored
Throughput:number of record s processed unit of time
replication factor
each block in deposit
Block(160 )
64
64
33
map reduces execution program
interview questions
What are the main features of HDFS?
Fault tolerance, high throughput, suitability for handling large data sets etc
What is daemon?
Daemon is the process that runs in background in the environment.
Components
-multi-tendency
-cluster utilisation
-compactbility
-scalblity
Master demon -resource manager - maximise the cluster utilisation
Slave demon - node manager
application MANGER :- give available containers to process ask RM
Shaduler :-hear scheduler
- capacity scheduler
knowledge
Is YARN a replacement of Hadoop MapReduce?
YARN is not a replacement of Hadoop but it is a more powerful and efficient technology
that supports MapReduce and other distributed computing frameworks and is also
referred to as MapReduce version 2
Why YARN?
With older versions of Hadoop, you were limited to executing MapReduce jobs only
What are the YARN responsibilities? create a container
Monitoring containers that are running
etc
What are the key components of YARN?
ResourceManager. NodeManager, ApplicationMaster, Container
What is ResourceManager in YARN?
Scheduler - The scheduler is responsible for allocating resources.
ApplicationManager - Accepting job-submissions, executing the application etc
What is ApplicationMasterin YARN?
Performany application-specificwork,
Map reduce
Map : you count shelf #1 I count up shelf #2 more peoplev too fast
Key value pire -intermediate map
key value
knowledge test
What is Hadoop Map Reduce?
For processing large data sets in parallel across a Hadoop cluster, Hadoop MapReduce
framework is used
captsl
zahid
MAPREDUCE
SQL (query)
Standerd language for communication with database
Create data base
Use
Data type
Numeric data type
INT
FLOAT
DECIMAL
NON-NUMERIC DATA TYPE
CHAR
VAARCHAR
ENUM
BOOLEAN
Eg
CREATE TABLE order(
Id AUTO_INCREMENT PREMIRY KEY,
Product_id INT,
Customer_id INT,
Ordertime DATETIME,
FOREIGN KEY product_id REFERENCEs product(id),
FOREIGN KEY customer_id REFERNCEs Customer(id));
QUERY
1. SELECT
2. SHOW
3. DROP TABLE DELETE
4. ALTER. – ADD,DROP,CHANGE,MODIFY
5. DESCRIBE
6. INSERT INTO table name (clom,) VALUES( “”)
7. WHERE
8. AND/OR
9. BETWEEN
10. IS NULL
Add foreign key
ALTER TABLE name
ADD CONSTRAINT FRKY_id
FOREIGN KEY FRKY_id REFERENCEs name(id),
SQL JOINS
Innerjoin
Leftjoin
Rightjoin
Fullouterjoin
Inner join
SELECT customers customer_id, orders order_id, orders. order_date
FROM customers
INNER JOIN orders
ON customers customer_id = orders customer_id
ORDER BY customers customer_id;]
APACHE HIVE