0% found this document useful (0 votes)
26 views

Data Science

The document discusses data science and AI topics like data engineering, data analysis, machine learning, Hadoop, HDFS, and SQL. It provides information on data types, MapReduce, YARN, and Hive and includes example queries and commands.

Uploaded by

ZAHID MOHD
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views

Data Science

The document discusses data science and AI topics like data engineering, data analysis, machine learning, Hadoop, HDFS, and SQL. It provides information on data types, MapReduce, YARN, and Hive and includes example queries and commands.

Uploaded by

ZAHID MOHD
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Data science & AI

Data Engineer:-
design
build, arrange data
programming
System
Data scientist :-
Test
Analyse, Translate data
analytics
both:-- Curious,judgemental,debate WALSN
statistics: whole inference from single sample
data today
structured :-enterprises office data number date string
unstructured videos audio emails very large storage high low data wil make data
collection
HADOOP AND OOP BASIC
ENORMOUS DATA SET (100110000101010)slices unique individual data set
given individual computer s
single program -mapper system-> primary results-> results are sort-> secondary process-
reduce process ( map-reduce )
SCRAPING
for data no
method
python with beautiful soap

Hadoop core
HDFS:
1.Storage unit ->HDFS
space storage
Block a. Block b block c block d block E
128 128 128 128 88
Total 600MB
Copy will store in multiple system like blocks
Data is not lost at any cost
Data replication method
One of the node willl be lost that can be recoved from the any block
Even if one data block crashes making HDFS fualt-tolerant
2.Map reduce
Traditional method to processing method
Advatges
-load balancing
-time reduce
3.yarn( resources manger)
meta data :- data about data
HDFS-fuetures
-destributed
-Scalabe
Scaling
horizontel scaling
Vertical scaling
-Cost effective
-fault-toleranet
-High through put
Latency:time to get first recored
Throughput:number of record s processed unit of time

HDFS MASTER/slave topology


master node :name node
|
|
V

slave node :data node


Data node send sgl to. Data node to ensure its alive
Name all report data on ,replication factor3x

Secondry Name node buffer hadoop1.x ,64mb


Stanby hadoop2.x 128, ,128MB
Edit log fsimage used to re construct data

replication factor
each block in deposit

Block(160 )
64
64
33
map reduces execution program

interview questions
What are the main features of HDFS?
Fault tolerance, high throughput, suitability for handling large data sets etc

What is meant by Data node?


Actual storage locations and serves read and writer requests for clients.

What is daemon?
Daemon is the process that runs in background in the environment.

What is meant by heartbeat in HDFS?


Data nodes send heartbeat signals to Name node respectively to inform that they are
alive.

What is meant by 'block' in HDFS?


Block in HDFS refers to minimum quantum of data for reading orwriting

Default block size?


Default block size is 64 MB

What type of data is processed by Hadoop? digital data

What is a rack in HDFS?


Rack is the storage location where all the data nodes are put together

Why is HDFS fault-tolerant?


HDFS is fault-tolerant because it replicates data on different DataNodes

Explain the architecture of HDFS?


Name node, data node, secondary name node

What is checkpointing in Hadoop?


Checkpointing is the process of combining the Edit Logs with the FsImage (File system
Image). It is performed by the Secondary NameNode.

What is a NameNode in Hadoop?


The NameNode is the master node that manages all the DataNodes (slave nodes). It
records the metadata information regarding all the files stored in the cluster (on the
DataNodes), e.g. The location of blocks stored, the size of the files, permissions,
hierarchy, etc.

Difference between traditional RDBMS and Hadoop?


Data Types
Processing
Schema on Read Vs. Write
Read/Write Speed
Cost
Best Fit Use Case
YARN

Components
-multi-tendency
-cluster utilisation
-compactbility
-scalblity
Master demon -resource manager - maximise the cluster utilisation
Slave demon - node manager
application MANGER :- give available containers to process ask RM
Shaduler :-hear scheduler
- capacity scheduler
knowledge
Is YARN a replacement of Hadoop MapReduce?
YARN is not a replacement of Hadoop but it is a more powerful and efficient technology
that supports MapReduce and other distributed computing frameworks and is also
referred to as MapReduce version 2
Why YARN?
With older versions of Hadoop, you were limited to executing MapReduce jobs only
What are the YARN responsibilities? create a container
Monitoring containers that are running
etc
What are the key components of YARN?
ResourceManager. NodeManager, ApplicationMaster, Container
What is ResourceManager in YARN?
Scheduler - The scheduler is responsible for allocating resources.
ApplicationManager - Accepting job-submissions, executing the application etc
What is ApplicationMasterin YARN?
Performany application-specificwork,

Map reduce
Map : you count shelf #1 I count up shelf #2 more peoplev too fast
Key value pire -intermediate map
key value

Reducer : we all get together and add up our individual count s


shuffle and sort
splitting :128 block splitting
mapping: just collecting
shuffling&key a :group by key
Mapperstage2reducersage -> key value pier

knowledge test
What is Hadoop Map Reduce?
For processing large data sets in parallel across a Hadoop cluster, Hadoop MapReduce
framework is used

How Hadoop MapReduce works?


Explain Map and Reduce Phases

Explain what is shuffling in MapReduce?


transfers the map outputs to the reduceras inputs is known as the shuffle

Distributed Cache in MapReduce Framework?


When you want to share some files across all nodes in Hadoop Cluster, Distributed Cache
is used. The files could be an executable jar files or simple properties file.
Explain what is Speculative Execution?
If a particular drive is taking a long time to complete a task, Hadoop will create a
duplicate task on anotherdisk.

Input Split and HDFS Block?


The logical division of data is known as Split while a physical division of data is known
as HDFS Block

What do you mean by data locality?


Moving computation unit to data rather data to the computation unit
hadoop start
format namenode hdfs namenode -format
start hadoop start-all.sh
stop hadoop stop-all.sh
open hadoop web ui localhost:9001 ,9870 ,8088 (cluster)
load sample data hdfs hadoop -put dir to dir( hdfs dfs -put dir to dir)
exit safe mode hadoop dfsadmin -safemode leave
delete data from hdfs hadoop dfs -rm -r file
Make dir Hadoop
Operation
hadoop dfs -put /Users/mohammedzahidk/desktop/test.csv /user/zahi
Hfs dfs -mkdir

captsl
zahid

MAPREDUCE

SQL (query)
Standerd language for communication with database
Create data base
Use

Data type
Numeric data type
 INT
 FLOAT
 DECIMAL
NON-NUMERIC DATA TYPE
 CHAR
 VAARCHAR
 ENUM
 BOOLEAN

DAATE AND TIME


 DATE(YYYY-MM-DD)
 DATETIME(YYYY-MM-DD HH-MM-SS)
 TIME(HH-MM-SS)
 YEAR(YYYY)
FOREIGN KEY
It used to link two table together
Primary key called reference or parent
Foreign key called child

Eg
CREATE TABLE order(
Id AUTO_INCREMENT PREMIRY KEY,
Product_id INT,
Customer_id INT,
Ordertime DATETIME,
FOREIGN KEY product_id REFERENCEs product(id),
FOREIGN KEY customer_id REFERNCEs Customer(id));

QUERY
1. SELECT
2. SHOW
3. DROP TABLE DELETE
4. ALTER. – ADD,DROP,CHANGE,MODIFY
5. DESCRIBE
6. INSERT INTO table name (clom,) VALUES( “”)
7. WHERE
8. AND/OR
9. BETWEEN
10. IS NULL
Add foreign key
ALTER TABLE name
ADD CONSTRAINT FRKY_id
FOREIGN KEY FRKY_id REFERENCEs name(id),

SQL JOINS
 Innerjoin
 Leftjoin
 Rightjoin
 Fullouterjoin
Inner join
SELECT customers customer_id, orders order_id, orders. order_date
FROM customers
INNER JOIN orders
ON customers customer_id = orders customer_id
ORDER BY customers customer_id;]

APACHE HIVE

You might also like