Pyspark
Pyspark
It offers PySpark Shell which connects the Python API to the spark core and in turn
initializes the Spark context.
More on PySpark
In
ed
SparkContext uses Py4J to launch a JVM and creates a JavaSparkContext.
nk
nk
By default, PySpark has SparkContext available as sc, so creating a new
SparkContext won't work.
Li
Li
|
|
Py4J
G
G
ya
ya
PySpark is built on top of Spark's Java API.
Data is processed in Python and cached / shuffled in the JVM.
da
da
Py4J enables Python programs running in a Python interpreter to
U
U
dynamically access Java objects in a Java Virtual Machine.
Here methods are called as if the Java objects resided in the Python
interpreter and Java collections. can be accessed through
standard Python collection methods.
More on Py4J
In
ed
ed
nk
nk
Li
Li
In the Python driver program, SparkContext uses Py4J to launch a JVM and
create a JavaSparkContext.
|
|
G
G
ya
ya
da
da
U
U
To establish local communication between the Python and Java
SparkContext objects Py4J is used on the driver.
ed
with PySpark including Py4J and they are automatically imported.
nk
nk
Getti ng Started
Li
Li
|
|
We can enter the Spark's python environment by running the given command in the
G
G
shell.
ya
ya
./bin/pyspark
da
da
This will start yourPySpark shell.`
U
ed
nk
nk
Li
Li
|
|
G
G
ya
ya
da
da
U
U
Resilient Distributed Datasets (RDDs)
Resilient distributed datasets (RDDs) are known as the main abstraction
in Spark.
It is a partitioned collection of objects spread across a cluster, and can be
persisted in memory or on disk.
Once RDDs are created they are immutable.
In
ed
1. Parallelizing a collection in driver program.
2. Referencing one dataset in an external storage system, like a
nk
nk
shared filesystem, HBase, HDFS, or any data source providing a Hadoop
Li
Li
InputFormat.
|
|
Features Of RDDs
G
G
ya
ya
Resilient, i.e. tolerant to faults using RDD lineage graph and therefore ready
da
da
to recompute damaged or missing partitions due to node failures.
Dataset - A set of partitioned data with primitive values or values of values,
U
For example, records or tuples.
U
Distributed with data remaining on multiple nodes in a cluster.
Creating RDDs
Parallelizing a collection in driver program.
E.g., here is how to create a parallelized collection holding the numbers 1 to 5:
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
Creating RDDs
In
ed
nk
Li
For the file (local path on the machine, hdfs://, s3n://, etc URI) the above method
|
takes a URI and then reads it as a collection containing lines to produce the RDD.
G
G
ya
ya
da
da
U
U
distFile = sc.textFile("data.txt")
RDD Operations
RDDs support two types of operations: transformations, which create a new
dataset from an existing one, and actions, which return a value to the driver program
after running a computation on the dataset.
For example, map is a transformation that passes each dataset element through a
function and returns a new RDD representing the results.
In
Similiarly, reduce is an action which aggregates all RDD elements by using some
ed
ed
functions and then returns the final result to driver program.
nk
nk
More On RDD Operations
Li
Li
As a recap to RDD basics, consider the simple program shown below:
|
|
G
G
lines = sc.textFile("data.txt")
ya
ya
lineLengths = lines.map(lambda s: len(s))
da
da
totalLength = lineLengths.reduce(lambda a, b: a + b)
U
Transformations
Transformations are functions that use an RDD as the input and return one or
more RDDs as the output.
randomSplit, cogroup, join, reduceByKey, filter, and map are examples of few
transformations.
Transformations do not change the input RDD, but always create one or more
new RDDs by utilizing the computations they represent.
By using transformations, you incrementally create an RDD lineage with all the
parent RDDs of the last RDD.
Transformations are lazy, i.e. are not run immediately. Transformations are
done on demand.
In
ed
Examples Of Transformations
nk
nk
Li
Li
filter(func): Returns a new dataset (RDD) that are created by choosing the
|
G
ya
ya
da
da
U
U
map(func): Passes each element of the RDD via the supplied function.
union(): New RDD contains elements from source argument and RDD.
intersection(): New RDD includes only common elements from source
argument and RDD.
cartesian(): New RDD cross product of all elements from source argument and
RDD.
Actions
Actions return concluding results of RDD computations.
In
Actions trigger execution utilising lineage graph to load the data into original
ed
ed
RDD, and then execute all intermediate transformations and write final results
out to file system or return it to Driver program.
nk
nk
Count, collect, reduce, take, and first are few actions in spark.
Li
Li
Example of Actions
|
|
G
G
count(): Get the number of data elements in the RDD.
ya
ya
collect(): Get all the data elements in an RDD as an array.
reduce(func): Aggregate the data elements in an RDD using this function
da
da
which takes two arguments and returns one.
U
U
take (n): Fetch first n data elements in an RDD computed by driver program.
foreach(func): Execute function for each data element in RDD. usually used to
update an accumulator or interacting with external systems.
first(): Retrieves the first data element in RDD. It is similar to take(1).
saveAsTextFile(path): Writes the content of RDD to a text file or a set of text
files to local file system/HDFS.
What is Dataframe ?
In general DataFrames can be defined as a data structure, which is tabular in nature.
It represents rows, each of them consists of a number of observations.
Rows can have a variety of data formats (heterogeneous), whereas a column can
have data of the same data type (homogeneous).
They mainly contain some metadata in addition to data like column and row names.
Why DataFrames ?
In
ed
ed
nk
semi-structured data
They are having the ability to handle petabytes of data
Li
Li
|
G
G
ya
ya
da
da
U
U
As a conclusion DataFrame is data organized into named columns
Features of DataFrame
Distributed
Lazy Evals
Immutable
Features Explained
In
DataFrames are Distributed in Nature, which makes it fault tolerant and highly
ed
ed
available data structure.
Lazy Evaluation is an evaluation strategy which will hold the evaluation of an
nk
nk
expression until its value is needed.
Li
Li
DataFrames are Immutable in nature which means that it is an object whose
state cannot be modified after it is created.
|
|
G
G
DataFrame Sources
ya
ya
da
da
For constructing a DataFrame a wide range of sources are available such as:
U
Structured data files
Tables in Hive U
External Databases
Existing RDDs
Spark SQL
Spark introduces a programming module for structured data processing called Spark
SQL.
The main capabilities of using structured and semi-structured data, by Spark SQL.
ed
ed
Such as:
nk
nk
Li
G
ya
ya
da
da
U
U
For more details about Spark SQL refer the fresco course Spark SQL
ed
pyspark.sql.GroupedData :Aggregation methods, returned by
nk
nk
DataFrame.groupBy().
Li
Li
More On Classes
|
|
G
G
pyspark.sql.DataFrameNaFunctions : Methods for handling missing data (null
ya
ya
values).
pyspark.sql.DataFrameStatFunctions : Methods for statistics functionality.
da
da
pyspark.sql.functions : List of built-in functions available for DataFrame.
pyspark.sql.types : List of data types available.
U
U
pyspark.sql.Window : For working with window functions.
More On Creation
ed
ed
nk
nk
Li
|
G
ya
ya
da
da
U
U
Student = Row("firstName", "lastName", "age", "telephone")
s1 = Student('David', 'Julian', 22, 100000)
s2 = Student('Mark', 'Webb', 23, 658545)
StudentData=[s1,s2]
df=spark.createDataFrame(StudentData)
df.show()
In
ed
spark = SparkSession \
.builder \
nk
nk
.config("spark.some.config.option", "some-value") \
Li
Li
.getOrCreate()
passenger = Row("Name", "age", "source", "destination")
|
|
s1 = passenger('David', 22, 'London', 'Paris')
G
G
s2 = passenger('Steve', 22, 'New York', 'Sydney')
x = [s1,s2]
ya
ya
df1=spark.createDataFrame(x)
da
da
df1.show()
U
U
Result of show()
Once show() is executed we can view the following result in the pyspark shell
+---------+--------+---+---------+
|firstName|lastName|age|telephone|
+---------+--------+---+---------+
| David| Julian| 22| 100000|
| Mark| Webb| 23| 658545|
Data Sources
In
ed
DataFrame interface.
A DataFrame can be operated on using relational transformations and can
nk
nk
Li
queries over its data.
|
|
G
G
ya
ya
da
da
U
U
This chapter describes the general methods for loading and saving
data using the Spark Data Sources.
df = spark.read.load("file path")
In
# Spark load the data source from the defined file path
ed
ed
df.select("column name", "column name").write.save("file name")
nk
nk
Li
Li
# The DataFrame is saved in the defined format
|
|
G
G
# By default it is saved in the Spark Warehouse
ya
ya
File path can be from local machine as well as from HDFS.
da
da
Manually Specifying Options
U
U
You can also manually specify the data source that will be used along with any extra
options that you would like to pass to the data source.
Data sources fully qualified name is used to specify them, but for built-in sources, you
can also use their short names (json, parquet, jdbc, orc, libsvm, csv, text) .
Apache Parquet
In
ed
ed
nk
Li
Spark SQL provides support for both reading and writing Parquet files.
Automatic conversion to nullable occurs when one tries to write Parquet
|
G
ya
ya
da
da
U
U
Reading A Parquet File
Here we are loading a json file into a dataframe.
ed
spark = SparkSession.builder.getOrCreate()
nk
nk
df = spark.read.json("emp.json")
df.show()
Li
Li
df.write.parquet("Employees")
df.createOrReplaceTempView("data")
|
|
res = spark.sql("select age,name,stream from data where stream='JAVA'")
G
G
res.show()
ya
ya
res.write.parquet("JavaEmployees")
da
da
U
ed
nk
Li
G
ya
ya
da
da
U
U
Parquet is the choice of Big data because it serves both
needs, efficient and performance in both storage and processing.
ed
What is a CSV file?
nk
nk
CSV is a file format which allows the user to store the data in tabular format.
Li
Li
CSV stands for comma-separated values.
It's data fields are most often separated, or delimited, by a comma.
|
|
G
G
CSV Loading
ya
ya
da
da
To load a csv data set user has to make use of spark.read.csv method to load it into
a DataFrame.
U
U
Here we are loading a football player dataset using the spark csvreader.
CSV Loading
inferSchema (default false): From the data, it infers the input schema automatically.
header (default false): Using this it inherits the first line as column names.
To verify we can run df.show(2).
The argument 2 will display the first two rows of the resulting DataFrame.
For every example from now onwards we will be using football player DataFrame
Schema of DataFrame
In
ed
ed
nk
Li
|
G
G
ya
ya
da
da
U
U
It results in different columns in our DataFrame, along with the datatype and
the nullable conditions.
df.printSchema()
root
ed
ed
|-- ID: integer (nullable = true)
nk
nk
|-- Name: string (nullable = true)
Li
Li
|-- Age: integer (nullable = true)
|-- Nationality: string (nullable = true
|
|
G
G
|-- Overall: integer (nullable = true)
|-- Potential: integer (nullable = true)
ya
ya
|-- Club: string (nullable = true)
da
da
|-- Value: string (nullable = true)
U
df.columns
['ID', 'Name', 'Age', 'Nationality', 'Overall', 'Potential', 'Club', 'Value',
'Wage', 'Special']
Row count
In
ed
ed
df.count()
nk
nk
17981
Li
Li
Column count
|
|
G
G
ya
ya
da
da
U
U
len(df.columns)
10
df.describe('Name').show()
ed
ed
The result we will be as shown below.
nk
nk
+-------+-------------+
Li
Li
|summary| Name|
|
|
+-------+-------------+
G
G
| count| 17981|
ya
ya
| mean| null|
da
da
| stddev| null|
U
|
|
min| A. Abbas|
max|Óscar Whalley|
U
+-------+-------------+
CSV is a file format which allows the user to store the data in tabular format.
In
ed
nk
nk
CSV Loading
Li
Li
To load a csv data set user has to make use of spark.read.csv method to load it into
|
a DataFrame.
G
G
ya
ya
da
da
U
U
Here we are loading a football player dataset using the spark csvreader.
CSV Loading
inferSchema (default false): From the data, it infers the input schema automatically.
header (default false): Using this it inherits the first line as column names.
In
ed
The argument 2 will display the first two rows of the resulting DataFrame.
nk
nk
For every example from now onwards we will be using football player DataFrame
Li
Li
Schema of DataFrame
|
|
G
G
What is meant by schema?
ya
ya
It’s just the structure of the DataFrame.
da
da
To check the schema one can make use of printSchema method.
It results in different columns in our DataFrame, along with the datatype and
U
df.printSchema()
root
|-- ID: integer (nullable = true)
|-- Name: string (nullable = true)
|-- Age: integer (nullable = true)
|-- Nationality: string (nullable = true
In
ed
nk
Li
|
G
G
ya
ya
da
da
U
U
|-- Special: integer (nullable = true)
df.columns
ed
ed
['ID', 'Name', 'Age', 'Nationality', 'Overall', 'Potential', 'Club', 'Value',
nk
nk
'Wage', 'Special']
Li
Li
Row count
|
|
df.count()
G
G
17981
ya
ya
da
da
Column count
U
len(df.columns)
10
U
Describing a Particular Column
To get the summary of any particular column make use of describe method.
This method gives us the statistical summary of the given column, if not specified, it
provides the statistical summary of the DataFrame.
df.describe('Name').show()
+-------+-------------+
|summary| Name|
In
+-------+-------------+
ed
ed
| count| 17981|
| mean| null|
nk
nk
| stddev| null|
Li
Li
| min| A. Abbas|
|
| max|Óscar Whalley|
G
G
ya
ya
da
da
U
U
+-------+-------------+
df.describe('Age').show()
In
ed
+-------+------------------+
nk
nk
|summary| Age|
+-------+------------------+
Li
Li
| count| 17981|
|
|
| mean|25.144541460430453|
G
G
| stddev| 4.614272345005111|
ya
ya
| min| 16|
da
da
| max| 47|
U
+-------+------------------+
U
Selecting Multiple Columns
For selecting particular columns from the DataFrame, one can use the select method.
Syntax for performing selection operation is:
**show() is optional **
One can load the result into another DataFrame by simply equating.
ie
Selection Operation
ed
ed
Selecting the column ID and Name and loading the result to a new DataFrame.
nk
nk
Li
Li
dfnew=df.select('ID','Name')
|
G
ya
ya
da
da
U
U
dfnew.show(5)
+------+-----------------+
| ID| Name|
+------+-----------------+
| 20801|Cristiano Ronaldo|
|158023| L. Messi|
|190871| Neymar|
|176580| L. Suárez|
In
|167495| M. Neuer|
ed
ed
+------+-----------------+
nk
nk
only showing top 5 rows
Li
Li
Filtering Data
|
|
G
G
For filtering the data filter command is used.
ya
ya
df.filter(df.Club=='FC Barcelona').show(3)
da
da
The result will be as follows:
U
U
+------+----------+---+-----------+-------+---------+------------+------+-----
+-------+
| ID| Name|Age|Nationality|Overall|Potential| Club| Value| Wage|
Special|
+------+----------+---+-----------+-------+---------+------------+------+-----
+-------+
|158023| L. Messi| 30| Argentina| 93| 93|FC Barcelona| €105M|€565K|
2154|
|176580| L. Suárez| 30| Uruguay| 92| 92|FC Barcelona| €97M|€510K|
2291|
|168651|I. Rakitić| 29| Croatia| 87| 87|FC Barcelona|€48.5M|€275K|
2129|
+------+----------+---+-----------+-------+---------+------------+------+-----
+-------+
In
ed
ed
only showing top 3 rows since we had given 3 in the show() as the argument
Verify the same by your own
nk
nk
Li
Li
|
|
G
G
ya
ya
da
da
U
U
To Filter our data based on multiple
conditions (AND or OR)
df.filter((df.Club=='FC Barcelona') & (df.Nationality=='Spain')).show(3)
+------+---------------+---+-----------+-------+---------+------------+------
+-----+-------+
| ID| Name|Age|Nationality|Overall|Potential| Club| Value|
Wage|Special|
In
+------+---------------+---+-----------+-------+---------+------------+------
+-----+-------+
ed
ed
|152729| Piqué| 30| Spain| 87| 87|FC Barcelona|€37.5M|
nk
nk
€240K| 1974|
| 41| Iniesta| 33| Spain| 87| 87|FC Barcelona|€29.5M|
Li
Li
€260K| 2073|
|
|
|189511|Sergio Busquets| 28| Spain| 86| 86|FC Barcelona| €36M|
G
G
€250K| 1998|
ya
ya
+------+---------------+---+-----------+-------+---------+------------+------
+-----+-------+
da
da
only showing top 3 rows
U
Sorting
ed
ed
The result of the first order by operation results in the following output.
nk
nk
Li
Li
+------+---------------+---+-----------+-------+---------+------------+------
+-----+-------+
|
|
G
G
ya
ya
da
da
U
U
| ID| Name|Age|Nationality|Overall|Potential| Club| Value|
Wage|Special|
+------+---------------+---+-----------+-------+---------+------------+------
+-----+-------+
| 41| Iniesta| 33| Spain| 87| 87|FC Barcelona|€29.5M|
€260K| 2073|
|152729| Piqué| 30| Spain| 87| 87|FC Barcelona|€37.5M|
€240K| 1974|
|189332| Jordi Alba| 28| Spain| 85| 85|FC Barcelona|€30.5M|
€215K| 2206|
In
ed
€250K| 1998|
|199564| Sergi Roberto| 25| Spain| 81| 86|FC Barcelona|€19.5M|
nk
nk
€140K| 2071|
Li
Li
+------+---------------+---+-----------+-------+---------+------------+------
+-----+-------+
|
|
G
G
only showing top 5 rows
ya
ya
Random Data Generation
da
da
Random Data generation is useful when we want to test algorithms and to implement
U
new ones. U
In Spark under sql.functions we have methods to generate random data. e.g.,
uniform (rand), and standard normal (randn).
+---+
| id|
+---+
| 0|
In
| 1|
| 2|
ed
ed
| 3|
nk
nk
+---+
Li
Li
|
G
G
ya
ya
da
da
U
U
By using uniform distribution and normal distribution generate two more columns.
df.select("id", rand(seed=10).alias("uniform"),
randn(seed=27).alias("normal")).show()
+---+-------------------+-------------------+
| id| uniform| normal|
+---+-------------------+-------------------+
| 0|0.41371264720975787| 0.5888539012978773|
| 1| 0.1982919638208397|0.06157382353970104|
In
| 2|0.12030715258495939| 1.0854146699817222|
ed
ed
| 3|0.44292918521277047|-0.4798519469521663|
+---+-------------------+-------------------+
nk
nk
Li
Li
Summary and Descriptive Statistics
|
|
G
G
The first operation to perform after importing data is to get some sense of what it
looks like.
ya
ya
The function describe returns a DataFrame containing information such as number
da
da
of non-null entries (count), mean, standard deviation, and minimum and maximum value
U
+-------+-------------------+--------------------+
|summary| uniform| normal|
+-------+-------------------+--------------------+
| count| 10| 10|
| mean| 0.3841685645682706|-0.15825812884638607|
| stddev|0.31309395532409323| 0.963345903544872|
| min|0.03650707717266999| -2.1591956435415334|
In
ed
+-------+-------------------+--------------------+
nk
nk
Descriptive Statistics
Li
Li
|
G
ya
ya
da
da
U
U
In the same way, we can also make use of some standard statistical
functions also.
+------------------+-------------------+------------------+
ed
ed
Sample Co-Variance and Correlation
nk
nk
Li
Li
In statistics Co-Variance comeans how one random variable changes with respect to
|
|
other.
G
G
Positive value indicates a trend in increase when the other increases.
ya
ya
Negative value indicates a trend in decrease when the other increases.
da
da
The sample co-variance of two columns of a DataFrame can be calculated as follows:
U
More On Co-Variance U
from pyspark.sql.functions import rand
df = sqlContext.range(0, 10).withColumn('rand1',
rand(seed=10)).withColumn('rand2', rand(seed=27))
df.stat.cov('rand1', 'rand2')
0.031109767020625314
From the above we can infer that co-variance of two random columns is near
to zero.
Correlation
Correlation provides the statistical dependence of two random variables.
In
df.stat.corr('rand1', 'rand2')
ed
ed
0.30842745432650953
nk
nk
Li
|
G
G
ya
ya
da
da
U
U
Cross Tabulation provides a frequency distribution table for a given set of
variables.
One of the powerful tool in statistics to observe the statistical independence of
variables.
Consider an example
ed
range(100)], ["name", "item"])
nk
nk
Contingency Table)
Li
Li
For applying the cross tabulation we can make use of the crosstab method.
|
|
G
G
df.stat.crosstab("name", "item").show()
ya
ya
+---------+------+-----+------+----+-------+
da
da
|name_item|apples|bread|butter|milk|oranges|
+---------+------+-----+------+----+-------+
U
| Bob| 6| 7| 7| 6| 7| U
| Mike| 7| 6| 7| 7| 6|
| Alice| 7| 7| 6| 7| 7|
+---------+------+-----+------+----+-------+
ed
nd2', rand(seed=27))
a = df.stat.cov('rand1', 'rand2')
nk
nk
b = df.stat.corr('rand1', 'rand2')
Li
Li
s1 = df1("Co-variance",a)
s2 = df1("Correlation",b)
|
x=[s1,s2]
G
G
ya
ya
da
da
U
U
df2 = spark.createDataFrame(x)
df2.show()
df2.write.parquet("Result")
Spark SQL blurs the lines between RDD's and relational tables.
ed
ed
By integrating these powerful features Spark makes it easy for developers to use SQL
commands for querying external data with complex analytics, all within in a single
nk
nk
application.
Li
Li
Performing SQL Queries
|
|
G
G
We can also pass SQL queries directly to any DataFrame.
ya
ya
For that, we need to create a table from the DataFrame using
da
da
the registerTempTable method.
U
U
After that use sqlContext.sql() to pass the SQL queries.
Apache Hive
In
ed
ed
The Apache Hive data warehouse software allows reading, writing, and
managing large datasets residing in distributed storage and queried using SQL
nk
nk
syntax.
Li
Li
|
G
ya
ya
da
da
U
U
Apache Hive is built on top of Apache Hadoop.
The below mentioned are the features of Apache Hive.
Apache Hive is having tools to allow easy and quick access to data using SQL,
thus enables data warehousing tasks such like extract/transform/load (ETL),
reporting, and data analysis.
Mechanisms for imposing structure on a variety of data formats.
Access to files stored either directly in Apache HDFS or in other data storage
systems such as Apache HBase.
In
ed
nk
nk
Query execution via Apache Tez,Apache Spark, or MapReduce.
A procedural language with HPL-SQL.
Li
Li
Sub-second query retrieval via Hive LLAP, Apache YARN and Apache Slider.
|
|
What Hive Provides ?
G
G
ya
ya
Apache Hive provides the standard SQL functionalities, which includes many of the
da
da
later SQL:2003 and SQL:2011 features for analytics.
U
U
We can extend Hive's SQL with the user code by using user-defined
functions (UDFs), user-defined aggregates (UDAFs), and user-defined table
functions (UDTFs).
Hive comes with built-in connectors for comma and tab-separated values
(CSV/TSV) text files, Apache Parquet,Apache ORC, and other formats.
Spark SQL supports reading and writing data stored in Hive. Connecting
Hive From Spark
When working with Hive, one must instantiate SparkSession with Hive support,
including connectivity to a persistent Hive metastore, support for Hive serdes,
and Hive user-defined functions.
ed
nk
nk
Li
|
|
G
G
ya
ya
da
da
U
U
#warehouse_location points to the default location for managed databases and
tables
warehouse_location = abspath('spark-warehouse')
spark = SparkSession \
.builder \
.appName("Python Spark SQL Hive integration example") \
In
.config("spark.sql.warehouse.dir", warehouse_location) \
ed
ed
.enableHiveSupport() \
.getOrCreate()
nk
nk
Li
Li
reating Hive Table From Spark
|
|
G
G
We can easily create a table in hive warehouse programmatically from Spark.
ya
ya
The syntax for creating a table is as follows:
da
da
spark.sql("CREATE TABLE IF NOT EXISTS table_name(column_name_1
DataType,column_name_2 DataType,......,column_name_n DataType) USING hive")
U
ed
nk
nk
Li
Li
|
|
G
G
ya
ya
da
da
U
U
In
ed
ed
nk
nk
Li
Li
|
|
G
G
ya
ya
da
da
U
U
External tables are used to store data outside the hive.
Data needs to remain in the underlying location even after the user drop the
table.
ed
nk
Li
|
G
G
ya
ya
da
da
U
U
Loading Data From Spark To The Hive Table
We can load data to hive table from the DataFrame.
For doing the same schema of both hive table and the DataFrame should be equal.
Let us take a sample CSV file.
We can read the csv the file by making use of spark csv reader.
The schema of the DataFrame will be same as the schema of the CSV file itself.
ed
ed
Data Loading To External Table
nk
nk
Li
Li
For loading the data we have to save the dataframe in external hive table location.
|
|
df.write.mode('overwrite').format("format").save("location")
G
G
ya
ya
Since our hive external table is in parquet format in place of format we have to
mention 'parquet'.
da
da
The location should be same as the hive external table location in hdfs directory.
U
U
If the schema is matching then data will load automatically to the hive table.
By querying the hive table we can verify it.
What is HBase ?
In
ed
ed
nk
nk
Li
|
|
G
G
ya
ya
da
da
U
U
HBase is an Apache open source project whose goal is to provide storage for the
Hadoop Distributed Computing.
More On HBase
HBase features compression, in-memory operation, and Bloom filters on a per-
column basis as outlined in the original Bigtable paper.
In
Tables in HBase can serve as the input and output for Map Reduce jobs run in
Hadoop, and may be accessed through the Java API but also
ed
ed
through REST, Avro or Thrift gateway APIs.
nk
nk
It is a column-oriented key-value data store and has been idolized widely
because of its lineage with Hadoop and HDFS.
Li
Li
HBase runs on top of HDFS and is well-suited for faster read and write
|
|
operations on large datasets with high throughput and low input/output
latency.
G
G
ya
ya
How To Connect Spark and HBase
da
da
U
U
To connect we require hdfs,Spark and HBase installed in the local machine.
Make sure that your versions are matching with each other.
Copy all the HBase jar files to the Spark lib folder.
Once done set the SPARK_CLASSPATH in spark-env.sh with lib.
ed
nk
nk
Li
Li
|
|
G
G
ya
ya
da
da
U
U
In
ed
ed
nk
nk
Li
Li
|
|
G
G
ya
ya
da
da
U
U
Real Time Pipeline using HDFS,Spark and HBase
Various Stages
It has 4 main stages which includes:
Transformation
Cleaning
Validation
Writing of the data received from the various sources
ed
Data Transformation
nk
nk
Li
G
ya
ya
da
da
U
U
Data Cleaning
ed
Writing
At last, data passed from the previous three stages is passed on to the writing
nk
nk
application which simply writes this final set of data to HBase for further data
Li
Li
analysis.
|
|
Spark In Real World
G
G
ya
ya
Uber – the online taxi company is an apt example for Spark. They are gathering
terabytes of event data from its various users.
da
da
U
U
Uses Kafka, Spark Streaming, and HDFS, to build a continuous ETL
pipeline.
Convert raw unstructured data into structured data as it is collected.
Uses it further complex analytics and optimization of operations.
Leverages Spark Streaming to gain immediate insight into how users all over
the world are engaging with Pins in real time.
Can make more relevant recommendations as people navigate the site.
Recommends related Pins.
Determine which products to buy, or destinations to visit.
ed
nk
nk
Conviva is using Spark for reducing the customer churn by managing live
video traffic and optimizing video streams.
Li
Li
|
G
G
ya
ya
da
da
U
U
Spark In Real World
Capital One – makes use of Spark and data science algorithms for a better
understanding of its customers.
ed
nk
nk
Li
Li
|
|
from pyspark.sql import *
G
G
spark = SparkSession.builder.getOrCreate()
ya
ya
df = Row("ID","Name","Age","AreaofInterest")
s1 = df("1","Jack",22,"Data Science")
da
da
s2 = df("2","Leo",21,"Data Analytics")
s3 = df("3","Luke",24,"Micro Services")
U
s4 = df("4","Mark",21,"Data Analytics")
x = [s1,s2,s3,s4]
U
df1 = spark.createDataFrame(x)
df3 = df1.describe("Age")
df3.show()
df3.write.parquet("Age")
df1.createOrReplaceTempView("data")
df4 = spark.sql("select ID,Name,Age from data order by ID desc")
df4.show()
df4.write.parquet("NameSorted")
In
ed
ed
nk
nk
Li
Li
|
|
G
G
ya
ya
da
da
U