Bigdata PPT
Bigdata PPT
Course
• Beginners – Go through every video in the same order that is presented
• Intermediate – Choose the sessions you want to learn and start from
there
• Always plan and have a realistic goal
• Ask your questions in Q and A forum in udemy. Respond to other
student’s Q and A and share knowledge
• Practise, Practise, Practise
• Have good internet connectivity and good system configuration
• Provide feedback and Rating when requested
SPARK SPARK COMPLEX
SPARK
SPARK RDD SPARK SQL STRUCTURED DATA
DATAFRAME
STREAMING PROCESSING
What is Big Data
• Big data is collection of data that is huge in volume, is growing
exponentially with time.
• Example of Big Data : Data from social media, sales details from big
retailers like Walmart, Jet Engine data which can generate more than
10 terabytes in 30 mins of flight time etc
Types of Big Data
• Structured . Eg : Tables
• Semi Structured . Eg : XML
• Unstructured
Processing Big Data
There are many tools and programs which help in processing big data. Some of them
are :
• Hive
• Spark
• Kafka
• NoSQL Db
• Presto
• Flink
• Hudi
• Druid …..
Hadoop 1.0 Architecture
Internal process of 1.0 architecture
Drawbacks of 1.0
• Name node failure : If Name node fails, then the current fsimage is
lost. So recent transactions are lost
• Name node size increase: If the data in name node increases,
scalability becomes an issue
• Block size : Block size was just 64 MB in Hadoop 1.0
Hadoop 2.0
• Name node failure High Availability
• Name node filling up Federation
• Block size 64 mb increased to 128 mb
High Availabality
Hadoop 3.0 New Features
• You would need Java 8 to compile the Hadoop jars
• Default Port has been changed for multiple services
• Support for more than 2 Name Nodes One active and 2 Passive
Name nodes
• Several other optimizations
• Support for Erasure Coding
Erasure Code
Erasure code is an error correcting code that ensures
survival of data by breaking the data into many blocks and
then adding a parity block. Using the parity block, we can
reconstruct the original block if it is lost.Parity block
contains parity checks(parity bits). Parity bits are simple
form of error code.
Erasure Coding
FILE A1 + A2 + B1 + B2 + C1 + C2
A1 A1 A1 B1 A2
B1 A2 A2 B2 B2
C1 B1 B2 C1 C2
C1 C2 C2
ABCD FQM 534
Parity Block
Input: A = “1010”, B = “0101”
Output: 1111
Parity : 1 1 1 1
B :0101
--------------------------
XOR : 1 0 1 0 A
ID Name Country
1 John India
2 Kevin Australia
3 Michael America
4 Pooja India
…..
90000 Balor Ireland
1 John India|2 Kevin Australia|3 Michael America|4 Pooja India|.....90000 Balor Ireland
Block 1 Block 2 Block ?
…..
90000 Balor Ireland
Column Storage
India Australia
John Kevin Michael
1 2 3 4 …… 90000 America India…
Pooja… Balor
Ireland
Row store vs Column Store
Row Store Column Store
Keeps the data for the objects on the same block Keeps the entire column on the same block
Easy to read and manipulate one object at a time Easy to analyze entire columns quicklyE
Easy to insert new data Easy to compress the data
Slow to analyze large amount of data Slow to analyze new data or manipulate old data
Serialization
Serialization is the process of converting a data structure or an object
into a format that can be easily stored or transmitted over a network and
can be easily reconstructed later.
De-Serialization
Serialization
Data Streams of Bytes Data
Advantages of Serialization
• Serialized data are easy and fast to transmit over a network.
Read and write are fast
• Some serialized file formats offer good compression, they can be
encrypted
• Deserialization also does not take much time
Serialized File Formats in Big Data
• Sequence File Format
• RC File Format
• ORC File Format
• Avro File Format
• Parquet File Format
Avro Vs ORC Vs Parquet
• AVRO is a row Format while ORC and Parquet are columnar format. Because AVRO is row Format, it is write
heavy while ORC and Parquet are read heavy.
• In all these 3 file formats, along with data, the schema will also be there. This means you can take these files
from one machine and load it in another machine and it will know what the data is about and will be able to
process
• All these file formats can be split across multiple multiple disks. Therefore scalability and parallel processing is
not an issue.
• ORC provides maximum compression followed by Parquet followed by AVRO
• If the schema keeps changing then AVRO is preferred as AVRO supports superior schema evolution. ORC also
supports to some extent but AVRO is best
• AVRO is commonly used in streaming apps like Kafka, Parquet is commonly used in Spark and ORC in Hive.
Parquet is good in storing nested data. So usually in spark, parquet is used
• Because compression of ORCFile is more, it is used for space efficiency. Query time will be little more as it has
to decompress the data. In case of AVRO, the time efficiency would be good as it takes little less time to
decompress. So if you want to save space, use ORC . if you want to save time, use Parquet
AVRO File
Parquet File
ORC File
What is Sqoop
Sqoop is an Integration between RDBMS and Hadoop. Using sqoop you
can import the data from RDBMS into Hadoop and also export the data
from Hadoop to the RDBMS
Features of Sqoop
• Sqoop can handle huge amount of data
• Uses multi threading concept for parallelism
• Can import all of the data. It can also import a portion of data
• Can work with incremental data very well
• Can work with data that is changed (CDC)
• Can import into multiple file formats including serialized file formats
• Data is imported directly into the Hadoop system. It is not stored in
the edge node
Sqoop Command Template - Import
sqoop import --connect jdbc:mysql://<hostname>:<port>/<dbname>
--username <username> --password <password> --m 1
--table <tablename> --target-dir <target-dir>
Sqoop Command Template - Export
sqoop import
-Dhadoop.security.credential.provider.path=jceks://hdfs/tmp/mypassw
ord --connect jdbc:mysql://localhost:3306/retail_db --username root --
password-alias encrytpassword --table customers -m 1 --target-dir
/user/cloudera/data_import_encrypted
Sqoop import as ORC
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --
username root --password-file file:///home/cloudera/passfile --m 1 --
table order
--m 1
--hcatalog-database test
--hcatalog-table order
--create-hcatalog-table
--hcatalog-storage-stanza "stored as orcfile";
Sqoop import All tables
sqoop
import-all-tables
--connect jdbc:mysql://localhost:3306/retail_db
--username root
--password-file file:///home/cloudera/passfile
--m 1
--warehouse-dir /user/cloudera/tables;
Sqoop Export
sqoop export
--connect jdbc:mysql://localhost:3306/retail_db
--username root
--password-file file:///home/cloudera/passfile
--table test
--staging-table test_stg
--m 1
--export-dir /user/cloudera/test_null
Performance Tuning – Sqoop Imports
• Using mappers
• Using –direct option. if you give --direct instead of connectin to jdbc
driver, it will connect using native library
-Dsqoop.export.records.per.statement=100
Hive Features
• Hive is open source. We can use it for free
• HQL(High Query Language) is very similar to your SQL
• Hive is schema on read
• Hive can be used as an ETL Tool and can process huge amount of data
• Hive supports partitioning and Bucketing
• Hive is warehouse tool designed for analytical purpose. Not for
transactional purpose
• Can work with multiple file formats
• Can be plugged into BI tools for data visualization
Limitations of Hive
• Hive is not designed for OLTP operations. Used for OLAP
• It has limited subquery support
• Latency of Hive is little high
• Support for Updates and Deletion is very minimal
• Not used for real time queries as it takes a bit of time to give the
results
Hive Table Types
• Managed Table : When table dropped, backend directory associated
with the table is deleted as well. Use it for staging purpose
• External Table : When table dropped, backend directory associated
with the table would still exist. Use it for target system
Hive Partitions
Partition is a way of dividing the data in a table into related parts using
a partition column.
Types:
• Static Partition : Partitions are created whenever user specifies
• Dynamic Partition : Partitions are created dynamically
Hive Static Partition
• Static Load Partition : Here, we specify the partition name to which
the data needs to be loaded
• Static Insert Partition : Here, we first create a non partitioned table
and then insert the data from this non partitioned table into a
partitioned table
Hive Bucketing
Bucketing in hive is the concept of breaking data down
into ranges, which are known as buckets, to give extra
structure to the data so it may be used for more
efficient queries.
Hive Date Formats
• Default date Format : yyyy-MM-dd
• unix_timestamp : This will return the number of seconds from the unix time. Unix time is 1970-01-01
00:00:00 UTC. It uses the default time zone for conversion.
• from_unixtime : The result of unix_timestamp is fed into from_unixtime and it converts it back to the
desired date format
• TO_DATE : The TO_DATE function returns the date part of the timestamp in the format 'yyyy-MM-dd’
• YEAR( string date ), MONTH(String date), DAY(String date), HOUR(string date), MINUTE(String date),
SECOND(String date) Returns Year, Month, Day, Hour, Minute, Second part of the date
• DATEDIFF( string date1, string date2 ) : DATEDIFF function returns the number of days between the two
given dates
• DATE_ADD( string date, int days ) : DATE_ADD function adds the number of days to the specified date
• DATE_SUB( string date, int days ) : DATE_SUB function subtracts the number of days to the specified
date
Hive Joins
• Map Join
• Bucket Map Join
• Sort merge bucket join
• Skew join
Bucket Map Join
Properties to set:
set hive.optimize.bucketmapjoin = true
set hive.enforce.bucketing = true;
Query:
SELECT /*+ MAPJOIN(table2) */ table1.emp_id, table1.emp_name,
table2.job_title FROM table1 inner JOIN table2 ON table1.emp_id =
table2.emp_id;
Here, table2 is the smaller table
Bucket Map Join
Bucket Map Join
Sort merge bucket join
Below properties needs to be set:
• set hive.auto.convert.sortmerge.join=true;
• set hive.auto.convert.sortmerge.join.noconditionaltask=true;
What is data skew
Data skewness primarily refers to the non-uniform distribution of data.
Dno Count
10 3
20 7800
30 300
EMPLOYEE DEPT
dno count dno count
10 10,00,000 10 10
20 20 20 1
30 12 30 1
Skew join properties to be set
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
Hive Performance Tuning Techniques
• Partitions
• Bucketing
• Map Joins, Skew Joins
• Vectorization
set hive.vectorized.execution.enable=true
• Hive Parallel execution
set hive.exec.parallel=true
SQL vs Hive
• Hive is schema on read while sql is schema on write
• Hive is for analytics while SQL is for transcational
• Hive is a datawarehouse and SQL is a database
• SQL supports only structured data while hive supports structured and
semi structured data
Scala
• Scala is compatible with Java
• Semicolons are optional in scala
• Scala is statically typed
• Everything in Scala is an object
Scala – Conditional Statements
• If statement
• If-else statement
• Nested if-else statement
• If else if Ladder statement
If condition
if(condition)
{
// Statements to be executed
}
Scala If else
if(condition)
{
// If block statements to be executed
} else
{
// Else bock statements to be executed
}
If Else If Ladder Statement
if (condition1)
{
//Code to be executed if condition1 is true
}
else if (condition2)
{
//Code to be executed if condition2 is true
}
...
else
{
//Code to be executed if all the conditions are false
}
Nested If else
if (condition_1)
{
if (condition_2)
{
// Executes when condition_2 is true
}
else
{
// Executes when condition_2 is false
}
}
else
{
//can have else statement or nested if else here
}
While loop
while(condition)
{
// Statements to be executed
}
For Loop
for( i <- range)
{
// statements to be executed
}
The Break statement
Grand Father
Father
Child
Hierarchical Inheritance
Father
Child1 Child2
Multiple Inheritance
Class A Class B
Class C
Hybrid Inheritance
Class A
Trait Trait
Class B Class C
Class D
Case class
• Case class cannot inherit another case class
• It supports pattern matching
• No need to use the new Keyword to initiate the case class
• By default, case class creates a companion object
• All the arguments in case class will be val
• Case class are immutable. They are helpful for modelling immutable
data.
Abstraction and Final
• Abstraction is the process of hiding implementation. You can define
an abstract class, abstract method. Any class that extends the abstract
class must implement all its abstract methods else the class that
extends it will become abstract.
• final keyword is used to represent a constant value that cannot be
changed. If you specify final keyword before a variable, then that
variable value cannot be changed. If you specify final keyword in front
of a method, it means that the implementation of that method is final
and that method cannot be overidden. If you specify final in front of
abstract class that means that the abstraction is final and it cannot be
implemented
Higher Order Functions
Higher Order functions are those functions that contain another functions.
In higher order functions, we pass another function as an argument or
return a function as a result.
Eg : print(sf.addzero(sf.sub_reverse(“Robin")))
Lambda
Lambda function are also called Anonymous function. An anonymous function
is a function which has no name but works as function. Lambda expressions
are basically shorthand notation of functions.
Val square_1=(x:Int) => x*x
Option Type
Option Type is a container which specifies whether we have zero or
more elements.
Scala Pattern Matching
<expression> match {
case pattern=> logic
case pattern=>logic
……………..
case_ => default_logic
}
Scala Collections
Scala collections are containers that hold sequenced linear set of
elements .
The Scala collections can be mutable and immutable.
You can iterate through elements using for loop. It returns a List.
Queue Queue implements a data structure that allows inserting and retrieving
elements in a first-in-first-out (FIFO) manner.
In scala, Queue is implemented as a pair of lists. One is used to insert
the elements and second to contain deleted elements. Elements are
added to the first list and removed from the second list.
Using map.keys we will get key values. Using map.values we get the values
Tuple A tuple is a collection of elements in ordered form. If there is no element present, it is called
empty tuple. It can take any datatype
val t1=(1,2,"Robin",222.5)
Val fl=list.flatMap(x=>x.split(“~”))
Val fil=list.filter(x=>x%2==0)
count Count will have filter condition.It will give the count of elements that satisfies the filter
condition.
Val cc=num.count(x=>x%2)
val part_even=num.partition(x=>x%2==0)
It will create two lists. One with even elements and other with non-even elements
reduce Reduce methods are applied on collection. You can apply binary operations on the
reduceLeft collection. It takes 2 elements in the collection at a time and does the operation
reduceRight
Val num=List(1,2,3,4)
Eg : num.reduceLeft(_+_)
Eg : num.reduceRight(_+_)
Collection Methods
Method Description
foldLeft foldLeft and foldRight does what reduceLeft and reduceRight does. The only
foldRight difference is foldLeft and foldRight will have an initial value
val fold_left=name.foldLeft(“robin")(_+_)
scanLeft Same as fold. The basic difference is , in fold we have the final output while in scan
scanRight we get the intermediate results as well as output.
val scan_right=name.scanRight("ron")(_+_)
groupBy and grouped
val ages=List(1,2,7,30,32,35)
val gr=ages.groupBy(age=>if(age>30) "Senior" else "Junior")
Hadoop
Hadoop,1
User
Hadoop User Scala User,1 Hadoop,3
Scala
Scala Hadoop Spark Scala,1 User,3
Scala
User User Hadoop Scala,1 Scala,5
Hadoop
Scala Scala Scala Hadoop,1 Spark,1
Spark
Spark,1
User
User,1
User
User,1
Hadoop
Hadoop,1
Scala
Scala,1
Scala
Scala,1
Scala
Scala,1
Disk
Disk
Resilient Distributed Datasets (RDD)
RDD is an immutable collection of objects. It is read only , partition
collection of records. RDD basically represents the data across nodes in
the cluster.
Operations of RDD:
Transformations
Actions
Spark Vs MapReduce
Spark Map Reduce
Map Reduce follows Linear evaluation Spark follows lazy evaluation
Map Reduce follows top to bottom approach Spark follows bottom to top approach(action to rdd)
There are frequent hits to hard disk This is in-memory processing
Common Transformations
Transformation Description
map Takes a function and applies that function to all the elements in the collection
flatMap Does the same functionality as map except that flatMap will flatten the result
filter Filter will filter the records based on the condition
distinct Distinct is to remove the duplicates
Common transformations on 2 RDDs
Transformation Description
union Union combines two RDDs and returns all elements from these two RDDs
intersection Returns elements present in both the RDDs
subtract Returns an RDD with the contents of the other RDD removed
Common RDD Actions
Action Description
collect Returns all elements from RDD and stores in memory
count Returns the number of elements in the RDD
take(n) Returns ‘n’ number of elements from RDD
reduce The reduce() function takes the two elements as input from the RDD and then produces the
output of the same type as that of the input elements. The simple forms of such function are
an addition
foreach Apply function to each element in the RDD
Pair RDD Functions
Function Description
groupByKey groups all the values that belong to the same key as one
reduceByKey Returns a merged RDD by merging the values of each key.
mapValues mapValues takes a function and applies that function to only the values of the pairRDD
keys Returns the keys of a pair RDD
values Returns the values of a pair RDD
countByKey countByKey simply counts the number of elements per key in a pair RDD
Spark Dataframes
Data frames are distributed collection of data orgnanized into name
columns
For Writing:
df.write.format().option().save()
Spark XML Write Jars
Jar Jar location in MVN
commons-io-2.8.0.jar https://mvnrepository.com/artifact/commons-io/commons-io/2.8.0
txw2-2.3.3.jar https://mvnrepository.com/artifact/org.glassfish.jaxb/txw2/2.3.3
xmlschema-core-2.2.5.jar https://mvnrepository.com/artifact/org.apache.ws.xmlschema/xmlschema-core/2.2.5
Spark Write Modes
Write Mode Description
Error This is the default mode. If directory found, it will throw error
Append If directory found, append to that directory
Ignore If directory found, just ignore. Do not fail the job
Overwrite If directory found, overwrite it
Spark SQL – Working with Columns
Function Description
select Used to select the required columns
selectExpr Does what select does. In addition, it helps in applying sql transformation on the
columns.
withColumn Similar to selectExpr, it allows you to apply transformation on the selected column
while retaining all other columns in the dataframe
withColumnRenamed withColumnRenamed is used to rename a column
case when Acts like a case statement in sql , if then else in programming language
drop Drops the column from the dataframe
String Functions
Function Description
concat_ws(sep: String, exprs: Column*) Concatenates multiple input string columns together into a single
string column, using the given separator
instr(str: Column, substring: String) Locate the position of the first occurrence of substr column in the
given string. Returns 0 if no match found
length(e: Column) Computes the character length of the given string
lower(e: Column) Converts a string to lower case
upper(e: Column) Converts a string to upper case
lpad(str: Column, len: Int, pad: String) Left-pad the string column with pad to a length of len. If the string
column is longer than len, the return value is shortened to len
characters
rpad(str: Column, len: Int, pad: String) Right-pad the string column with pad to a length of len. If the string
column is longer than len, the return value is shortened to len
characters
String Functions
Function Description
repeat(str: Column, n: Int) Repeats a string column n times, and returns it as a new string
column
ltrim(e: Column) Trim the spaces from left end for the specified string value
rtrim(e: Column) Trim the spaces from right end for the specified string value.
split(str: Column, regex: String) Splits str around matches of the given regex
substring(str: Column, pos: Int, len: Int) Substring starts at `pos` and is of length `len`
regexp_replace(e: Column, pattern: String, Replace all substrings of the specified string value that match
replacement: String) regexp with rep
Working with Dates
Function Description
TOPICS
PRODUCER CONSUMER
Topic1,Topic2
Group-ID
Consumption Model
• Earliest
• Latest
Kafka Streaming
Spark Performance Tuning Tips
• Improve the performance at Code Level
• Use the Right File Format
• Have the optimized configurations
• Spark Optimizations