0% found this document useful (0 votes)
6 views140 pages

Bigdata PPT

The document outlines a comprehensive guide for becoming a proficient Big Data Developer, covering essential tools and concepts such as Hadoop, Spark, and Sqoop. It discusses the types of Big Data, processing techniques, and the differences between file formats like Avro, ORC, and Parquet. Additionally, it highlights Hive's features, limitations, and performance tuning strategies, alongside an introduction to Scala programming relevant to Big Data applications.

Uploaded by

Renuka Meduri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views140 pages

Bigdata PPT

The document outlines a comprehensive guide for becoming a proficient Big Data Developer, covering essential tools and concepts such as Hadoop, Spark, and Sqoop. It discusses the types of Big Data, processing techniques, and the differences between file formats like Avro, ORC, and Parquet. Additionally, it highlights Hive's features, limitations, and performance tuning strategies, alongside an introduction to Scala programming relevant to Big Data applications.

Uploaded by

Renuka Meduri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 140

Make the best of Big Data Developer

Course
• Beginners – Go through every video in the same order that is presented
• Intermediate – Choose the sessions you want to learn and start from
there
• Always plan and have a realistic goal
• Ask your questions in Q and A forum in udemy. Respond to other
student’s Q and A and share knowledge
• Practise, Practise, Practise
• Have good internet connectivity and good system configuration
• Provide feedback and Rating when requested
SPARK SPARK COMPLEX
SPARK
SPARK RDD SPARK SQL STRUCTURED DATA
DATAFRAME
STREAMING PROCESSING
What is Big Data
• Big data is collection of data that is huge in volume, is growing
exponentially with time.
• Example of Big Data : Data from social media, sales details from big
retailers like Walmart, Jet Engine data which can generate more than
10 terabytes in 30 mins of flight time etc
Types of Big Data
• Structured . Eg : Tables
• Semi Structured . Eg : XML
• Unstructured
Processing Big Data
There are many tools and programs which help in processing big data. Some of them
are :
• Hive
• Spark
• Kafka
• NoSQL Db
• Presto
• Flink
• Hudi
• Druid …..
Hadoop 1.0 Architecture
Internal process of 1.0 architecture
Drawbacks of 1.0
• Name node failure : If Name node fails, then the current fsimage is
lost. So recent transactions are lost
• Name node size increase: If the data in name node increases,
scalability becomes an issue
• Block size : Block size was just 64 MB in Hadoop 1.0
Hadoop 2.0
• Name node failure High Availability
• Name node filling up  Federation
• Block size 64 mb increased to 128 mb
High Availabality
Hadoop 3.0 New Features
• You would need Java 8 to compile the Hadoop jars
• Default Port has been changed for multiple services
• Support for more than 2 Name Nodes  One active and 2 Passive
Name nodes
• Several other optimizations
• Support for Erasure Coding
Erasure Code
Erasure code is an error correcting code that ensures
survival of data by breaking the data into many blocks and
then adding a parity block. Using the parity block, we can
reconstruct the original block if it is lost.Parity block
contains parity checks(parity bits). Parity bits are simple
form of error code.
Erasure Coding

FILE A1 + A2 + B1 + B2 + C1 + C2

A1 A1 A1 B1 A2

B1 A2 A2 B2 B2

C1 B1 B2 C1 C2

C1 C2 C2
ABCD FQM 534

Data Block 1 Data Block 2 Data Block 3

Parity Block
Input: A = “1010”, B = “0101”
Output: 1111

Parity : 1 1 1 1
B :0101
--------------------------
XOR : 1 0 1 0  A
ID Name Country
1 John India
2 Kevin Australia
3 Michael America
4 Pooja India

…..
90000 Balor Ireland

Row Based Storage

1 John India|2 Kevin Australia|3 Michael America|4 Pooja India|.....90000 Balor Ireland
Block 1 Block 2 Block ?

1 John India|2 Kevin 3 Michael America|


90000 Balor Ireland
Australia 4 Pooja India
……………………………………………….
Column Based Storage
Id Name Country
1 John India
2 Kevin Australia
3 Michael America
4 Pooja India

…..
90000 Balor Ireland

Column Storage

1 2 3 4 …… 90000|John Kevin Michael Pooja… Balor|India Australia America India… Ireland


Block 1 Block 2 Block ?

India Australia
John Kevin Michael
1 2 3 4 …… 90000 America India…
Pooja… Balor
Ireland
Row store vs Column Store
Row Store Column Store
Keeps the data for the objects on the same block Keeps the entire column on the same block
Easy to read and manipulate one object at a time Easy to analyze entire columns quicklyE
Easy to insert new data Easy to compress the data
Slow to analyze large amount of data Slow to analyze new data or manipulate old data
Serialization
Serialization is the process of converting a data structure or an object
into a format that can be easily stored or transmitted over a network and
can be easily reconstructed later.

De-Serialization

Serialization
Data Streams of Bytes Data
Advantages of Serialization
• Serialized data are easy and fast to transmit over a network.
Read and write are fast
• Some serialized file formats offer good compression, they can be
encrypted
• Deserialization also does not take much time
Serialized File Formats in Big Data
• Sequence File Format
• RC File Format
• ORC File Format
• Avro File Format
• Parquet File Format
Avro Vs ORC Vs Parquet
• AVRO is a row Format while ORC and Parquet are columnar format. Because AVRO is row Format, it is write
heavy while ORC and Parquet are read heavy.
• In all these 3 file formats, along with data, the schema will also be there. This means you can take these files
from one machine and load it in another machine and it will know what the data is about and will be able to
process
• All these file formats can be split across multiple multiple disks. Therefore scalability and parallel processing is
not an issue.
• ORC provides maximum compression followed by Parquet followed by AVRO
• If the schema keeps changing then AVRO is preferred as AVRO supports superior schema evolution. ORC also
supports to some extent but AVRO is best
• AVRO is commonly used in streaming apps like Kafka, Parquet is commonly used in Spark and ORC in Hive.
Parquet is good in storing nested data. So usually in spark, parquet is used
• Because compression of ORCFile is more, it is used for space efficiency. Query time will be little more as it has
to decompress the data. In case of AVRO, the time efficiency would be good as it takes little less time to
decompress. So if you want to save space, use ORC . if you want to save time, use Parquet
AVRO File
Parquet File
ORC File
What is Sqoop
Sqoop is an Integration between RDBMS and Hadoop. Using sqoop you
can import the data from RDBMS into Hadoop and also export the data
from Hadoop to the RDBMS
Features of Sqoop
• Sqoop can handle huge amount of data
• Uses multi threading concept for parallelism
• Can import all of the data. It can also import a portion of data
• Can work with incremental data very well
• Can work with data that is changed (CDC)
• Can import into multiple file formats including serialized file formats
• Data is imported directly into the Hadoop system. It is not stored in
the edge node
Sqoop Command Template - Import
sqoop import --connect jdbc:mysql://<hostname>:<port>/<dbname>
--username <username> --password <password> --m 1
--table <tablename> --target-dir <target-dir>
Sqoop Command Template - Export

sqoop export --connect jdbc:mysql://<hostname>:<port>/<dbname>


--username <username> --password <password> --m 1
--table <tablename> --export-dir <target-dir>
Import portion of the data
• --where  Using –where we specify the condition and data is filtered
based on that condition. Similar to where in sql
• --query  Here we specify the query to fetch the data from table
Sqoop incremental append
sqoop import
--connect jdbc:mysql://localhost:3306/retail_db
--username root --password cloudera
--m 2 --table customers --split-by customer_id
--target-dir /user/cloudera/customer1
--incremental append
--check-column customer_id
--last-value 0
Password encryption
hadoop credential create encrytpassword -provider
jceks://hdfs/tmp/mypassword

hdfs dfs -cat /tmp/mypassword

sqoop import
-Dhadoop.security.credential.provider.path=jceks://hdfs/tmp/mypassw
ord --connect jdbc:mysql://localhost:3306/retail_db --username root --
password-alias encrytpassword --table customers -m 1 --target-dir
/user/cloudera/data_import_encrypted
Sqoop import as ORC
sqoop import --connect jdbc:mysql://localhost:3306/retail_db --
username root --password-file file:///home/cloudera/passfile --m 1 --
table order
--m 1
--hcatalog-database test
--hcatalog-table order
--create-hcatalog-table
--hcatalog-storage-stanza "stored as orcfile";
Sqoop import All tables
sqoop
import-all-tables
--connect jdbc:mysql://localhost:3306/retail_db
--username root
--password-file file:///home/cloudera/passfile
--m 1
--warehouse-dir /user/cloudera/tables;
Sqoop Export
sqoop export
--connect jdbc:mysql://localhost:3306/retail_db
--username root
--password-file file:///home/cloudera/passfile
--table test
--staging-table test_stg
--m 1
--export-dir /user/cloudera/test_null
Performance Tuning – Sqoop Imports
• Using mappers
• Using –direct option. if you give --direct instead of connectin to jdbc
driver, it will connect using native library

sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username


root --password cloudera --m 1 --table customers --delete-target-dir --
target-dir /user/cloudera/data_import1 –direct
• --fetch-size : How many rows should each mapper fetch at a time. By
default it is 1000 rows
Performance Tuning – Sqoop Export
• In Sqoop export, 10k rows will be inserted per sqoop statement by
default. We can tune this using below 2 properties :

sqoop export -Dsqoop.export.statements.per.transcation=100

-Dsqoop.export.records.per.statement=100
Hive Features
• Hive is open source. We can use it for free
• HQL(High Query Language) is very similar to your SQL
• Hive is schema on read
• Hive can be used as an ETL Tool and can process huge amount of data
• Hive supports partitioning and Bucketing
• Hive is warehouse tool designed for analytical purpose. Not for
transactional purpose
• Can work with multiple file formats
• Can be plugged into BI tools for data visualization
Limitations of Hive
• Hive is not designed for OLTP operations. Used for OLAP
• It has limited subquery support
• Latency of Hive is little high
• Support for Updates and Deletion is very minimal
• Not used for real time queries as it takes a bit of time to give the
results
Hive Table Types
• Managed Table : When table dropped, backend directory associated
with the table is deleted as well. Use it for staging purpose
• External Table : When table dropped, backend directory associated
with the table would still exist. Use it for target system
Hive Partitions
Partition is a way of dividing the data in a table into related parts using
a partition column.

Types:
• Static Partition : Partitions are created whenever user specifies
• Dynamic Partition : Partitions are created dynamically
Hive Static Partition
• Static Load Partition : Here, we specify the partition name to which
the data needs to be loaded
• Static Insert Partition : Here, we first create a non partitioned table
and then insert the data from this non partitioned table into a
partitioned table
Hive Bucketing
Bucketing in hive is the concept of breaking data down
into ranges, which are known as buckets, to give extra
structure to the data so it may be used for more
efficient queries.
Hive Date Formats
• Default date Format : yyyy-MM-dd
• unix_timestamp : This will return the number of seconds from the unix time. Unix time is 1970-01-01
00:00:00 UTC. It uses the default time zone for conversion.
• from_unixtime : The result of unix_timestamp is fed into from_unixtime and it converts it back to the
desired date format
• TO_DATE : The TO_DATE function returns the date part of the timestamp in the format 'yyyy-MM-dd’
• YEAR( string date ), MONTH(String date), DAY(String date), HOUR(string date), MINUTE(String date),
SECOND(String date)  Returns Year, Month, Day, Hour, Minute, Second part of the date
• DATEDIFF( string date1, string date2 ) : DATEDIFF function returns the number of days between the two
given dates
• DATE_ADD( string date, int days ) : DATE_ADD function adds the number of days to the specified date
• DATE_SUB( string date, int days ) : DATE_SUB function subtracts the number of days to the specified
date
Hive Joins
• Map Join
• Bucket Map Join
• Sort merge bucket join
• Skew join
Bucket Map Join
Properties to set:
set hive.optimize.bucketmapjoin = true
set hive.enforce.bucketing = true;

Query:
SELECT /*+ MAPJOIN(table2) */ table1.emp_id, table1.emp_name,
table2.job_title FROM table1 inner JOIN table2 ON table1.emp_id =
table2.emp_id;
Here, table2 is the smaller table
Bucket Map Join
Bucket Map Join
Sort merge bucket join
Below properties needs to be set:

• set hive.auto.convert.sortmerge.join=true;

• set hive.optimize.bucketmapjoin = true;

• set hive.optimize.bucketmapjoin.sortedmerge = true;

• set hive.auto.convert.sortmerge.join.noconditionaltask=true;
What is data skew
Data skewness primarily refers to the non-uniform distribution of data.
Dno Count
10 3
20 7800
30 300
EMPLOYEE DEPT
dno count dno count
10 10,00,000 10 10
20 20 20 1
30 12 30 1
Skew join properties to be set
set hive.optimize.skewjoin=true;
set hive.skewjoin.key=500000;
set hive.skewjoin.mapjoin.map.tasks=10000;
set hive.skewjoin.mapjoin.min.split=33554432;
Hive Performance Tuning Techniques
• Partitions
• Bucketing
• Map Joins, Skew Joins
• Vectorization
set hive.vectorized.execution.enable=true
• Hive Parallel execution
set hive.exec.parallel=true
SQL vs Hive
• Hive is schema on read while sql is schema on write
• Hive is for analytics while SQL is for transcational
• Hive is a datawarehouse and SQL is a database
• SQL supports only structured data while hive supports structured and
semi structured data
Scala
• Scala is compatible with Java
• Semicolons are optional in scala
• Scala is statically typed
• Everything in Scala is an object
Scala – Conditional Statements
• If statement
• If-else statement
• Nested if-else statement
• If else if Ladder statement
If condition

if(condition)
{
// Statements to be executed
}
Scala If else
if(condition)
{
// If block statements to be executed
} else
{
// Else bock statements to be executed
}
If Else If Ladder Statement
if (condition1)
{
//Code to be executed if condition1 is true
}
else if (condition2)
{
//Code to be executed if condition2 is true
}
...
else
{
//Code to be executed if all the conditions are false
}
Nested If else
if (condition_1)
{
if (condition_2)
{
// Executes when condition_2 is true
}
else
{
// Executes when condition_2 is false
}
}
else
{
//can have else statement or nested if else here
}
While loop
while(condition)
{
// Statements to be executed
}
For Loop
for( i <- range)
{
// statements to be executed
}
The Break statement

Break is used to break a loop or program execution. It skips the current


execution. Inside inner loop it breaks the execution of inner loop.
Class and Object
A class is basically collection of similar objects.
An object is an instance of that class.
Eg: Vehicle is a class and Bus is an instance of that class.
When we define a class, we can then create new objects of that class
What is a constructor
A constructor is a special method of a class or structure in object-
oriented programming that initializes a newly created object of that
type. Whenever an object is created, the constructor is called
automatically.
Auxiliary Constructor
• The constructor name should be ‘this’ and signature of each
constructor must be different than other constructors in the class
• Each auxiliary constructor must call any of the previously defined
constructor of the class
Inheritance in Scala
• Single Inheritance
• Multilevel Inheritance
• Hierarchical Inheritance
• Multiple Inheritance
• Hybrid Inheritance
Single Inheritance

Base Class / Parent Class


Grand Father

Father Sub Class / Child Class


MultiLevel Inheritance

Grand Father

Father

Child
Hierarchical Inheritance

Father

Child1 Child2
Multiple Inheritance

Class A Class B

Class C
Hybrid Inheritance

Class A

Trait Trait

Class B Class C

Class D
Case class
• Case class cannot inherit another case class
• It supports pattern matching
• No need to use the new Keyword to initiate the case class
• By default, case class creates a companion object
• All the arguments in case class will be val
• Case class are immutable. They are helpful for modelling immutable
data.
Abstraction and Final
• Abstraction is the process of hiding implementation. You can define
an abstract class, abstract method. Any class that extends the abstract
class must implement all its abstract methods else the class that
extends it will become abstract.
• final keyword is used to represent a constant value that cannot be
changed. If you specify final keyword before a variable, then that
variable value cannot be changed. If you specify final keyword in front
of a method, it means that the implementation of that method is final
and that method cannot be overidden. If you specify final in front of
abstract class that means that the abstraction is final and it cannot be
implemented
Higher Order Functions
Higher Order functions are those functions that contain another functions.
In higher order functions, we pass another function as an argument or
return a function as a result.

Eg : print(sf.addzero(sf.sub_reverse(“Robin")))
Lambda
Lambda function are also called Anonymous function. An anonymous function
is a function which has no name but works as function. Lambda expressions
are basically shorthand notation of functions.
Val square_1=(x:Int) => x*x
Option Type
Option Type is a container which specifies whether we have zero or
more elements.
Scala Pattern Matching

<expression> match {
case pattern=> logic
case pattern=>logic
……………..
case_ => default_logic
}
Scala Collections
Scala collections are containers that hold sequenced linear set of
elements .
The Scala collections can be mutable and immutable.

Scala.collection.mutable contains all mutable collections. If you want to


use the mutable collections, you must import this package

Scala.collection.immutable contains all immutable collections. Scala by


default imports this package.
Collection Description
Set Set stores Unique elements. Does not maintain any order. Elements
can be of different datatypes
Eg: val games=Set(“cricket”,”Football”,”Hockey”)

Seq Represents indexed sequences that are guaranteed to be immutable.


Eg: var seq:Seq[Int] = Seq(52,85,1,8,3,2,7,”sa”)

You can iterate through elements using for loop. It returns a List.

List Store ordered elements.It can take combination of different types.


Eg: val games=List(“cricket”,”football”,”hockey”)
Vector Vector is a general-purpose, immutable data structure. It provides
random access of elements. It is good for large collection of elements
Eg : var vector2 = Vector(5,2,6,3)

Queue Queue implements a data structure that allows inserting and retrieving
elements in a first-in-first-out (FIFO) manner.
In scala, Queue is implemented as a pair of lists. One is used to insert
the elements and second to contain deleted elements. Elements are
added to the first list and removed from the second list.

Eg : var queue = Queue(1,5,6,2,3,9,5,2,5)


Collection Description
Map Map is used to store elements. It stores elements in pairs of key and values. In scala, you can
create map by using two ways either by using comma separated pairs or by using rocket
operator

var map = Map(("A","Apple"),("B","Ball"))


var map2 = Map("A"->"Aple","B"->"Ball")

Using map.keys we will get key values. Using map.values we get the values

Tuple A tuple is a collection of elements in ordered form. If there is no element present, it is called
empty tuple. It can take any datatype

val t1=(1,2,"Robin",222.5)

access elements using dot


eg: to get first element t1._1

To iterate through tuple we use : productIterator.


so it will be t1.productIterator.foreach(println)

Array Array is a collection of similar datatypes.Elements are accessed using index

var arr = Array(1,2,3,4,5)


Collection Methods
Method Description
Map Takes a function and applies that function to every element in the collection.

Eg: val mul=num.map(x=>x*2)

flatMap It does what map does.It also flattens the elements

Val fl=list.flatMap(x=>x.split(“~”))

filter Filters the elements.

Val fil=list.filter(x=>x%2==0)

count Count will have filter condition.It will give the count of elements that satisfies the filter
condition.

Val cc=num.count(x=>x%2)

exists Return true if a particular condition is met


val exist_even=num.exists(x=>x%2==0)
Collection Methods
Method Description
foreach Loop through collection
Eg :List.foreach(println)
partition Grouping the elements. You specify some condition. Elements satisfying that condition
will be grouped into one partition and the remaining elements into another group.

val part_even=num.partition(x=>x%2==0)
It will create two lists. One with even elements and other with non-even elements
reduce Reduce methods are applied on collection. You can apply binary operations on the
reduceLeft collection. It takes 2 elements in the collection at a time and does the operation
reduceRight
Val num=List(1,2,3,4)

Eg : num.reduceLeft(_+_)
Eg : num.reduceRight(_+_)
Collection Methods
Method Description
foldLeft foldLeft and foldRight does what reduceLeft and reduceRight does. The only
foldRight difference is foldLeft and foldRight will have an initial value

val fold_left=name.foldLeft(“robin")(_+_)
scanLeft Same as fold. The basic difference is , in fold we have the final output while in scan
scanRight we get the intermediate results as well as output.

val scan_right=name.scanRight("ron")(_+_)
groupBy and grouped
val ages=List(1,2,7,30,32,35)
val gr=ages.groupBy(age=>if(age>30) "Senior" else "Junior")

val grp=ages.grouped(2).foreach(x=>println(x+" "))


Spark
Apache Spark is an open source parallel processing framework for
running large-scale data analytics applications across clustered
computers. It can handle both batch and real-time analytics and data
processing workloads.
Spark Architecture
Programming
Scala Python Java R
Languages

Libraries Spark Mlib


Spark SQL Graph X
Streaming

Spark Core ( Processing Engine)


Resource
Standalone
Management
Word Count Steps

Hadoop
Hadoop,1
User
Hadoop User Scala User,1 Hadoop,3
Scala
Scala Hadoop Spark Scala,1 User,3
Scala
User User Hadoop Scala,1 Scala,5
Hadoop
Scala Scala Scala Hadoop,1 Spark,1
Spark
Spark,1
User
User,1
User
User,1
Hadoop
Hadoop,1
Scala
Scala,1
Scala
Scala,1
Scala
Scala,1

Disk
Disk
Resilient Distributed Datasets (RDD)
RDD is an immutable collection of objects. It is read only , partition
collection of records. RDD basically represents the data across nodes in
the cluster.

Operations of RDD:
Transformations
Actions
Spark Vs MapReduce
Spark Map Reduce
Map Reduce follows Linear evaluation Spark follows lazy evaluation
Map Reduce follows top to bottom approach Spark follows bottom to top approach(action to rdd)
There are frequent hits to hard disk This is in-memory processing
Common Transformations
Transformation Description
map Takes a function and applies that function to all the elements in the collection
flatMap Does the same functionality as map except that flatMap will flatten the result
filter Filter will filter the records based on the condition
distinct Distinct is to remove the duplicates
Common transformations on 2 RDDs
Transformation Description
union Union combines two RDDs and returns all elements from these two RDDs
intersection Returns elements present in both the RDDs
subtract Returns an RDD with the contents of the other RDD removed
Common RDD Actions
Action Description
collect Returns all elements from RDD and stores in memory
count Returns the number of elements in the RDD
take(n) Returns ‘n’ number of elements from RDD
reduce The reduce() function takes the two elements as input from the RDD and then produces the
output of the same type as that of the input elements. The simple forms of such function are
an addition
foreach Apply function to each element in the RDD
Pair RDD Functions
Function Description
groupByKey groups all the values that belong to the same key as one
reduceByKey Returns a merged RDD by merging the values of each key.
mapValues mapValues takes a function and applies that function to only the values of the pairRDD
keys Returns the keys of a pair RDD
values Returns the values of a pair RDD
countByKey countByKey simply counts the number of elements per key in a pair RDD
Spark Dataframes
Data frames are distributed collection of data orgnanized into name
columns

Process the data frame by :


1) Creating a temp view
2) Using the DSL
Spark DataFrames
Using schema RDD:
Eg: rdd1.toDF()

Using row RDD:


val df = spark.createDataFrame(rowrdd, structschema)
Spark DataFrame Seamless
For Reading:
spark.read.format().option().load()

For Writing:
df.write.format().option().save()
Spark XML Write Jars
Jar Jar location in MVN
commons-io-2.8.0.jar https://mvnrepository.com/artifact/commons-io/commons-io/2.8.0
txw2-2.3.3.jar https://mvnrepository.com/artifact/org.glassfish.jaxb/txw2/2.3.3
xmlschema-core-2.2.5.jar https://mvnrepository.com/artifact/org.apache.ws.xmlschema/xmlschema-core/2.2.5
Spark Write Modes
Write Mode Description
Error This is the default mode. If directory found, it will throw error
Append If directory found, append to that directory
Ignore If directory found, just ignore. Do not fail the job
Overwrite If directory found, overwrite it
Spark SQL – Working with Columns
Function Description
select Used to select the required columns
selectExpr Does what select does. In addition, it helps in applying sql transformation on the
columns.
withColumn Similar to selectExpr, it allows you to apply transformation on the selected column
while retaining all other columns in the dataframe
withColumnRenamed withColumnRenamed is used to rename a column
case when Acts like a case statement in sql , if then else in programming language
drop Drops the column from the dataframe
String Functions
Function Description
concat_ws(sep: String, exprs: Column*) Concatenates multiple input string columns together into a single
string column, using the given separator
instr(str: Column, substring: String) Locate the position of the first occurrence of substr column in the
given string. Returns 0 if no match found
length(e: Column) Computes the character length of the given string
lower(e: Column) Converts a string to lower case
upper(e: Column) Converts a string to upper case
lpad(str: Column, len: Int, pad: String) Left-pad the string column with pad to a length of len. If the string
column is longer than len, the return value is shortened to len
characters
rpad(str: Column, len: Int, pad: String) Right-pad the string column with pad to a length of len. If the string
column is longer than len, the return value is shortened to len
characters
String Functions
Function Description
repeat(str: Column, n: Int) Repeats a string column n times, and returns it as a new string
column
ltrim(e: Column) Trim the spaces from left end for the specified string value
rtrim(e: Column) Trim the spaces from right end for the specified string value.
split(str: Column, regex: String) Splits str around matches of the given regex
substring(str: Column, pos: Int, len: Int) Substring starts at `pos` and is of length `len`
regexp_replace(e: Column, pattern: String, Replace all substrings of the specified string value that match
replacement: String) regexp with rep
Working with Dates
Function Description

current_date () Returns the current date as a date column.


Converts a date/timestamp/string to a value of string in
date_format(dateExpr: Column, format: the format specified by the date format given by the
String): Column second argument.
Converts the column into `DateType` by casting rules to
to_date(e: Column) `DateType`.
Converts the column into a `DateType` with a specified
to_date(e: Column, fmt: String) format
add_months(startDate: Column,
numMonths: Int) Returns the date that is `numMonths` after `startDate`.
date_add(start: Column, days: Int) Returns the date that is `days` days after `start`
datediff(end: Column, start: Column) Returns the number of days from `start` to `end`.
Working with Dates
Function Description
months_between(end: Column, start:
Column) Returns number of months between dates `start` and `end`.
next_day(date: Column, dayOfWeek: Returns the first date which is later than the value of the `date` column
String) that is on the specified day of the week.
Returns date truncated to the unit specified by the format.
For example, `trunc("2018-11-19 12:01:19", "year")` returns
trunc(date: Column, format: String) 2018-01-01
Extracts the year as an integer from a given
year(e: Column) date/timestamp/string
Extracts the quarter as an integer from a given
quarter(e: Column) date/timestamp/string.
Extracts the month as an integer from a given
month(e: Column) date/timestamp/string
Extracts the day of the week as an integer from a given
date/timestamp/string. Ranges from 1 for a Sunday through to 7
dayofweek(e: Column) for a Saturday
Working with Dates
Function Description
Extracts the day of the month as an integer from a
dayofmonth(e: Column) given date/timestamp/string.
Extracts the day of the year as an integer from a
dayofyear(e: Column) given date/timestamp/string.
Extracts the week number as an integer from a
given date/timestamp/string. A week is considered
to start on a Monday and week 1 is the first week
weekofyear(e: Column) with more than 3 days, as defined by ISO 8601
Returns the last day of the month which the given
date belongs to. For example, input "2015-07-27"
returns "2015-07-31" since July 31 is the last day
last_day(e: Column): Column of the month in July 2015.
Join Strategy Types
• Broadcast hash join
val left_join=df1.join(df2, df1(“dno")===df2(“dno"),"left")

val left_join= df1.join( df2.hint(“broadcast”),


df1(“dno")===df2(“dno"),"left")

• Shuffle Hash Join

• Sort Merge Joins


Use case – bank Transactions
• In each location, identify the record with the highest
CustAccountBalance
• In which location we have most number of transactions
• Which location has the highest sum of total_transaction_amount
Spark-submit example
spark-submit --master local[*] --class sparkPack.SparkObj
/home/cloudera/jarpath/SparkDeployment-0.0.1-SNAPSHOT.jar
Spark-submit
Spark-submit --master yarn
--deploy-mode cluster
--executor-cores 5
--executor-memory 19g
--num-executors 15
--driver-memory 2g
--conf “spark.driver.extraClassPath==/user/cloudera/jarpath/*”
--class sparkPack.obj1 /home/cloudera/jarpath/SparkDeployment-0.0.1-SNAPSHOT.jar
What is NoSQL
NoSQL, also referred to as “not only SQL”, “non-SQL”, is an approach to
database design that enables the storage and querying of data outside
the traditional structures found in relational databases
Types of NoSQL Databases
• Key Value Store
• Document Oriented
• Graph Db
• Columnar Oriented
Features of NOSQL Database
• They have dynamic schema
• Auto sharding
• Replication
• No Joins
• Integrated caching
• Easily Scalable
• Highly distributable
Hbase Architecture
Hbase
• hadoop dfsadmin -safemode leave
• hadoop fs -rmr /hbase
• hadoop fs -mkdir -p /hbase/data
• sudo service hbase-master restart
• sudo service hbase-regionserver restart
Hbase Connectors
• hbase-client-1.1.2.2.6.2.0-205.jar
• hbase-common-1.1.2.2.6.2.0-205.jar
• shc-core-1.1.1-2.1-s_2.11.jar
• hbase-protocol-1.1.2.2.6.2.0-205.jar
• hbase-server-1.1.2.2.6.2.0-205.jar
Cassandra
K C
Tables
Cassandra Placement Strategy
• Simple Strategy
• Network Strategy
Simple Strategy
Network Topology
Cassandra Query Limitations
• CQL does not support aggregation queries like max, min, avg
• CQL does not support group by, having queries.
• CQL does not support joins.
• CQL does not support OR queries.
• CQL does not support wildcard queries.
• CQL does not support Union, Intersection queries.
• Table columns cannot be filtered without creating the index.
• Greater than (>) and less than (<) query is only supported on clustering column.
• Cassandra query language is not suitable for analytics purposes because it has
so many limitations.
Apache NIFI
Apache NIFI
• Processor : This is a component that does a specific task
• Processor Group : Group many processors
• Controlled Service
• Flow File : How the data propagates in the processor
What is Streaming
Streaming is experiencing the data in real time.

Eg : YouTube, Netflix, Amazon Prime, Whatsapp stream the data in real


time to the user
Kafka

TOPICS
PRODUCER CONSUMER
Topic1,Topic2
Group-ID
Consumption Model
• Earliest
• Latest

val kparams = Map[String, Object]("bootstrap.servers" -> "localhost:9092",

"key.deserializer" -> classOf[StringDeserializer],

"value.deserializer" -> classOf[StringDeserializer],

"group.id" -> “consumer_1_demo_topic",

"auto.offset.reset" -> "latest",

"enable.auto.commit" -> (false: java.lang.Boolean))


Kafka Delivery Guarantee
• Fire and Forget
• Async
• Sync
Spark Structured Streaming
• File Streaming : It will continuously read the files from a directory
• Socket Streaming : Keep listening to a port and read the data
• Kafka : Read and write into Kafka
• Kinesis : Integrates with AWS Kinesis
Structured Streaming with Kafka
Input Table

Key Value Topic Partition Offset timestamp

Kafka Streaming
Spark Performance Tuning Tips
• Improve the performance at Code Level
• Use the Right File Format
• Have the optimized configurations
• Spark Optimizations

You might also like