0% found this document useful (0 votes)

36 views6 pages

D9 Lab+Day+9

1. The document describes how to load a JSON file containing store location data into a Spark DataFrame. 2. Key steps include copying the JSON file to HDFS, reading it into a DataFrame using Spark SQL, and exploring the schema and data in the DataFrame. 3. The DataFrame is then programmatically defined with a schema and the data is loaded and viewed.

Uploaded by

Milan Kumar Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as RTF, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views6 pages

D9 Lab+Day+9

Uploaded by

Milan Kumar Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as RTF, PDF, TXT or read online on Scribd

You are on page 1/ 6

Day 9: Data frames:

1. Download the store_locations.json file from google drive

2. create a directory in hadoop

hadoopuser@hadoopuser-VirtualBox:~$ hdfs dfs -mkdir /sparkLabData

hadoopuser@hadoopuser-VirtualBox:~$ hdfs dfs -ls /sparkLabData

3. Copy the downloaded store_locations.json from desktop(local system) into

hadoop

hadoopuser@hadoopuser-VirtualBox:~$ hdfs dfs -copyFromLocal

/home/hadoopuser/Desktop/store_locations.json /sparkLabData/

hadoopuser@hadoopuser-VirtualBox:~$ hdfs dfs -ls /sparkLabData

Found 1 items

-rw-r--r-- 1 hadoopuser supergroup 6053 2020-12-29 20:06

/sparkLabData/store_locations.json

hadoopuser@hadoopuser-VirtualBox:~$ hdfs dfs -cat

/sparkLabData/store_locations.json

{"city": "Antioch", "state": "CA", "zip_code": 945097911}

{"city": "Woodland", "state": "CA", "zip_code": 957765409}

{"city": "San Jose", "state": "CA", "zip_code": 951311866}

{"city": "Victorville", "state": "CA", "zip_code": 923954216}

{"city": "Chico", "state": "CA", "zip_code": 959284422}

1. Start the spark

hadoopuser@hadoopuser-VirtualBox:~$ spark-shell

scala> sc.setLogLevel("ERROR")

2. Load the data

scala> val storeDF =
spark.read.format("json").load("/sparkLabData/store_locations.json")

storeDF: org.apache.spark.sql.DataFrame = [city: string, state: string ... 1

more field]

3. see the data in dataframe

scala> storeDF.collect

res1: Array[org.apache.spark.sql.Row] = Array([Antioch,CA,945097911],

[Woodland,CA,957765409], [San Jose,CA,951311866], [Victorville,CA,923954216],
[Chico,CA,959284422], [San Dimas,CA,917731725], [Visalia,CA,932779527],
[Manteca,CA,953366745], [Redwood City,CA,940632854], [Lakewood,CA,907122409],
[Hayward,CA,945455008], [Pacoima,CA,913312352], [San Marcos,CA,92069],
[Lodi,CA,95240], [Huntington Beach,CA,92647], [Westlake Village,CA,913624063],
[San Leandro,CA,945771209], [Woodland Hills,CA,913672227], [El
Centro,CA,922431323], [Tustin,CA,927828918], [Vista,CA,920814546],
[Eureka,CA,955012121], [Garden Grove,CA,928431206], [Simi
Valley,CA,930656207], [Santa Clara,CA,950503100], [Los Angeles,CA,900391502],
[SandCity,CA,939553051], [Vallejo,CA,945913702], [Redding,CA,960034071],
[Clovis,CA...

scala> storeDF.show(5)

+-----------+-----+---------+

| city|state| zip_code|

+-----------+-----+---------+

| Antioch| CA|945097911|

| Woodland| CA|957765409|

| San Jose| CA|951311866|

|Victorville| CA|923954216|

| Chico| CA|959284422|

+-----------+-----+---------+
only showing top 5 rows

4. To see the schema

scala> storeDF.schema

res3: org.apache.spark.sql.types.StructType =
StructType(StructField(city,StringType,true),
StructField(state,StringType,true), StructField(zip_code,LongType,true))

5. Manually Define Schema

scala> import org.apache.spark.sql.types.Metadata

import org.apache.spark.sql.types.Metadata

scala> import org.apache.spark.sql.types.

{StructType,StructField,StringType,LongType}

import org.apache.spark.sql.types.{StructType, StructField, StringType,

LongType}

scala> val manualSchema =

StructType(Array(StructField("city",StringType,true),
StructField("state",StringType,true), StructField("zip_code",LongType,true)))

manualSchema: org.apache.spark.sql.types.StructType =
StructType(StructField(city,StringType,true),
StructField(state,StringType,true), StructField(zip_code,LongType,true))

scala> val storeDF =

spark.read.format("json").schema(manualSchema).load("/sparkLabData/
store_locations.json")

storeDF: org.apache.spark.sql.DataFrame = [city: string, state: string ... 1

more field]

scala> storeDF.show(5)
+-----------+-----+---------+

| city|state| zip_code|

+-----------+-----+---------+

| Antioch| CA|945097911|

| Woodland| CA|957765409|

| San Jose| CA|951311866|

|Victorville| CA|923954216|

| Chico| CA|959284422|

+-----------+-----+---------+

only showing top 5 rows

6. To see all the columns in dataframe

scala> storeDF.columns

res12: Array[String] = Array(city, state, zip_code)

scala> storeDF.col("city")

res16: org.apache.spark.sql.Column = city

7. To see Rows in Dataframe

scala> storeDF.take(5)

res21: Array[org.apache.spark.sql.Row] = Array([Antioch,CA,945097911],

[Woodland,CA,957765409], [San Jose,CA,951311866], [Victorville,CA,923954216],
[Chico,CA,959284422])

scala> storeDF.collect()

res17: Array[org.apache.spark.sql.Row] = Array([Antioch,CA,945097911],

scala> storeDF.show()

+----------------+-----+---------+

| city|state| zip_code|

+----------------+-----+---------+

| Antioch| CA|945097911|

| Woodland| CA|957765409|

| San Jose| CA|951311866|

| Victorville| CA|923954216|

| Chico| CA|959284422|

| San Dimas| CA|917731725|

| Visalia| CA|932779527|

| Manteca| CA|953366745|

| Redwood City| CA|940632854|

| Lakewood| CA|907122409|

| Hayward| CA|945455008|

| Pacoima| CA|913312352|

| San Marcos| CA| 92069|

| Lodi| CA| 95240|

|Huntington Beach| CA| 92647|

|Westlake Village| CA|913624063|

| San Leandro| CA|945771209|

| Woodland Hills| CA|913672227|

| El Centro| CA|922431323|
| Tustin| CA|927828918|

+----------------+-----+---------+

only showing top 20 rows

scala> storeDF.first

res19: org.apache.spark.sql.Row = [Antioch,CA,945097911]

# to see all

scala> storeDF.show(storeDF.count().toInt)

+-------------------+-----+---------+

| city|state| zip_code|

+-------------------+-----+---------+

| Antioch| CA|945097911|

| Woodland| CA|957765409|

| San Jose| CA|951311866|

| Victorville| CA|923954216|

Apache Spark With Scala - Cheatsheet
No ratings yet
Apache Spark With Scala - Cheatsheet
7 pages
Databricks - Cheatsheet
No ratings yet
Databricks - Cheatsheet
7 pages
KEYbtcDatabase 0 PDF
No ratings yet
KEYbtcDatabase 0 PDF
8,484 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
Assembler.V2.Alntext V2.00
No ratings yet
Assembler.V2.Alntext V2.00
1,346 pages
combinationofallGROUPBYEVEYTHINGwatermark Z7n95hehml
No ratings yet
combinationofallGROUPBYEVEYTHINGwatermark Z7n95hehml
22 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Journal
No ratings yet
Journal
47 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Spark SQL
No ratings yet
Spark SQL
41 pages
Pyspark File Commands and Theory
No ratings yet
Pyspark File Commands and Theory
29 pages
Bda Exp - 7
No ratings yet
Bda Exp - 7
8 pages
w12 - Runningnotes 201026 001818
No ratings yet
w12 - Runningnotes 201026 001818
25 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Spark SQLPDF 20 Jan
No ratings yet
Spark SQLPDF 20 Jan
4 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Linux Notes For Professionals
100% (1)
Linux Notes For Professionals
65 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Spotlight Functionality Details
No ratings yet
Spotlight Functionality Details
7 pages
Lab06 Spark Dataframes
No ratings yet
Lab06 Spark Dataframes
12 pages
BDA All 37 Answers Complete
No ratings yet
BDA All 37 Answers Complete
5 pages
Lab Spark
No ratings yet
Lab Spark
3 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
EDA Python For Data Analsis
No ratings yet
EDA Python For Data Analsis
10 pages
Dele
No ratings yet
Dele
4 pages
React Portfolio App Development: Increase your online presence and create your personal brand
From Everand
React Portfolio App Development: Increase your online presence and create your personal brand
Abdelfattah Ragab
No ratings yet
Spark
No ratings yet
Spark
12 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
Data and AI - Spark Python
No ratings yet
Data and AI - Spark Python
11 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
21 pages
PySpark Interview Cheatsheet 1741068112
No ratings yet
PySpark Interview Cheatsheet 1741068112
19 pages
Spark Revision
No ratings yet
Spark Revision
16 pages
Etl Commands For Pyspark
No ratings yet
Etl Commands For Pyspark
8 pages
Exercise: Visualize and Navigate A 3D Scene: Step 1: Download The Data
No ratings yet
Exercise: Visualize and Navigate A 3D Scene: Step 1: Download The Data
7 pages
KVM PDF
No ratings yet
KVM PDF
37 pages
Spark Project Phase1
No ratings yet
Spark Project Phase1
3 pages
Back To Basics:: Knowledge Bank
No ratings yet
Back To Basics:: Knowledge Bank
2 pages
02 Data - Engg - 23-24 Worksheet Practical#5b 1
No ratings yet
02 Data - Engg - 23-24 Worksheet Practical#5b 1
22 pages
Cara Hack FB
100% (2)
Cara Hack FB
2 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
BDA All 37 Practical Answers
No ratings yet
BDA All 37 Practical Answers
3 pages
EDA Zomato 1681401606
No ratings yet
EDA Zomato 1681401606
15 pages
Beginning C# and .NET
From Everand
Beginning C# and .NET
Benjamin Perkins
No ratings yet
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
Restaurant Meal Reservation System
100% (2)
Restaurant Meal Reservation System
26 pages
Pyspark Coding Interview Questions
No ratings yet
Pyspark Coding Interview Questions
19 pages
Fall209 Spark SQL MC
No ratings yet
Fall209 Spark SQL MC
96 pages
ShopData PDF
No ratings yet
ShopData PDF
162 pages
Creative Scala
No ratings yet
Creative Scala
228 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Cse413 201-15-3452 Lab-Report 02
No ratings yet
Cse413 201-15-3452 Lab-Report 02
6 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
D5 - Lab - Practicals - Day 5
100% (1)
D5 - Lab - Practicals - Day 5
37 pages
Quantam - Learning - Colaboratory
No ratings yet
Quantam - Learning - Colaboratory
13 pages
Spark Cheat Sheet 1717838924
No ratings yet
Spark Cheat Sheet 1717838924
10 pages
Ke Mend Agri
No ratings yet
Ke Mend Agri
15 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
Alerton Integration Engine Release: Frequently Asked Questions
No ratings yet
Alerton Integration Engine Release: Frequently Asked Questions
2 pages
ESight V300R001C10 Configuration File Management
No ratings yet
ESight V300R001C10 Configuration File Management
25 pages
An Introduction To Firewalls
No ratings yet
An Introduction To Firewalls
21 pages
Cryptbb Stuff
No ratings yet
Cryptbb Stuff
1 page
Deploying and Configuring SSIS Packages
No ratings yet
Deploying and Configuring SSIS Packages
25 pages
AZ-305 Exam - Free Actual Q - As, Page 1 - ExamTopics
No ratings yet
AZ-305 Exam - Free Actual Q - As, Page 1 - ExamTopics
53 pages
Spark RDD Commands - Spark Core
No ratings yet
Spark RDD Commands - Spark Core
7 pages
Pack 1
No ratings yet
Pack 1
8 pages
LT-6620 OpenGN To PRO-2000 Connection Instructions
No ratings yet
LT-6620 OpenGN To PRO-2000 Connection Instructions
17 pages
x96 Mini Remote - Google Search
No ratings yet
x96 Mini Remote - Google Search
1 page
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
Acs Ethics Case Studies v2.12
No ratings yet
Acs Ethics Case Studies v2.12
45 pages
Social Media Half Yearly Report 2023
No ratings yet
Social Media Half Yearly Report 2023
25 pages
Notes of Class 12 Chapter 3 (Python Rivision Tour List)
No ratings yet
Notes of Class 12 Chapter 3 (Python Rivision Tour List)
6 pages
Python and Mysql: Dfn40263 - Programming Essentials in Python
No ratings yet
Python and Mysql: Dfn40263 - Programming Essentials in Python
28 pages
"C" Programming Language
No ratings yet
"C" Programming Language
29 pages
T H e 911 Rescue CD User's Guide: System Maintenance and Recovery Software
No ratings yet
T H e 911 Rescue CD User's Guide: System Maintenance and Recovery Software
45 pages
(MX) Seamless Script Execution in Collecting Data From Peer Router Through SSH
No ratings yet
(MX) Seamless Script Execution in Collecting Data From Peer Router Through SSH
6 pages
User Manual Training Centre Verification by DSOs PDF
No ratings yet
User Manual Training Centre Verification by DSOs PDF
36 pages
C + + Project: Submitted By: Ravi Class: Xii A Roll No
No ratings yet
C + + Project: Submitted By: Ravi Class: Xii A Roll No
20 pages
Skrip Debian New
No ratings yet
Skrip Debian New
7 pages
A Strategy For Designing and Executing An Effective Regression Test Program
No ratings yet
A Strategy For Designing and Executing An Effective Regression Test Program
6 pages
POS Report Development Walkthrough
No ratings yet
POS Report Development Walkthrough
3 pages
Getting Started With Gulp Sample Chapter
No ratings yet
Getting Started With Gulp Sample Chapter
10 pages
Exam AZ-800: Administering Windows Server Hybrid Core Infrastructure Preparation
From Everand
Exam AZ-800: Administering Windows Server Hybrid Core Infrastructure Preparation
Georgio Daccache
No ratings yet

D9 Lab+Day+9

Uploaded by

D9 Lab+Day+9

Uploaded by

Day 9: Data frames:

1. Download the store_locations.json file from google drive

2. create a directory in hadoop

hadoopuser@hadoopuser-VirtualBox:~$ hdfs dfs -mkdir /sparkLabData

hadoopuser@hadoopuser-VirtualBox:~$ hdfs dfs -ls /sparkLabData

3. Copy the downloaded store_locations.json from desktop(local system) into

hadoopuser@hadoopuser-VirtualBox:~$ hdfs dfs -copyFromLocal

hadoopuser@hadoopuser-VirtualBox:~$ hdfs dfs -ls /sparkLabData

-rw-r--r-- 1 hadoopuser supergroup 6053 2020-12-29 20:06

hadoopuser@hadoopuser-VirtualBox:~$ hdfs dfs -cat

{"city": "Antioch", "state": "CA", "zip_code": 945097911}

{"city": "Woodland", "state": "CA", "zip_code": 957765409}

{"city": "San Jose", "state": "CA", "zip_code": 951311866}

{"city": "Victorville", "state": "CA", "zip_code": 923954216}

{"city": "Chico", "state": "CA", "zip_code": 959284422}

1. Start the spark

2. Load the data

storeDF: org.apache.spark.sql.DataFrame = [city: string, state: string ... 1

3. see the data in dataframe

res1: Array[org.apache.spark.sql.Row] = Array([Antioch,CA,945097911],

| San Jose| CA|951311866|

4. To see the schema

5. Manually Define Schema

scala> import org.apache.spark.sql.types.Metadata

scala> import org.apache.spark.sql.types.

import org.apache.spark.sql.types.{StructType, StructField, StringType,

scala> val manualSchema =

scala> val storeDF =

storeDF: org.apache.spark.sql.DataFrame = [city: string, state: string ... 1

| San Jose| CA|951311866|

only showing top 5 rows

6. To see all the columns in dataframe

res12: Array[String] = Array(city, state, zip_code)

res16: org.apache.spark.sql.Column = city

7. To see Rows in Dataframe

res21: Array[org.apache.spark.sql.Row] = Array([Antioch,CA,945097911],

res17: Array[org.apache.spark.sql.Row] = Array([Antioch,CA,945097911],

| San Jose| CA|951311866|

| San Dimas| CA|917731725|

| Redwood City| CA|940632854|

| San Marcos| CA| 92069|

| Lodi| CA| 95240|

|Huntington Beach| CA| 92647|

|Westlake Village| CA|913624063|

| San Leandro| CA|945771209|

| Woodland Hills| CA|913672227|

only showing top 20 rows

res19: org.apache.spark.sql.Row = [Antioch,CA,945097911]

| San Jose| CA|951311866|

You might also like