Data Frames

slides sobre dataframes

Uploaded by

ricardo.faceb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views12 pages

Data Frames

slides sobre dataframes

Uploaded by

ricardo.faceb

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

SPARK SQL

DataFrames and DataSets

Working with structured data

■ Extends RDD to a "DataFrame" object

■ DataFrames:
– Contain Row objects
– Can run SQL queries
– Has a schema (leading to more efficient storage)
– Read and write to JSON, Hive, parquet
– Communicates with JDBC/ODBC, Tableau
Using SparkSQL in Python

■ from pyspark.sql import SQLContext, Row

■ hiveContext = HiveContext(sc)
■ inputData = spark.read.json(dataFile)
■ inputData.createOrReplaceTempView("myStructuredStuff")
■ myResultDataFrame = hiveContext.sql("""SELECT foo FROM bar ORDER BY
foobar""")
Other stuff you can do with
dataframes
■ myResultDataFrame.show()
■ myResultDataFrame.select("someFieldName")
■ myResultDataFrame.filter(myResultDataFrame("someFieldName" > 200)
■ myResultDataFrame.groupBy(myResultDataFrame("someFieldName")).mean()
■ myResultDataFrame.rdd().map(mapperFunction)
Datasets

■ In Spark 2.0, a DataFrame is really a DataSet of Row objects

■ DataSets can wrap known, typed data too. But this is mostly transparent to
you in Python, since Python is dynamically typed.
■ So – don’t sweat this too much with Python. But the Spark 2.0 way is to use
DataSets instead of DataFrames when you can.
Shell access

■ Spark SQL exposes a JDBC/ODBC server (if you built Spark with Hive support)
■ Start it with sbin/start-thriftserver.sh
■ Listens on port 10000 by default
■ Connect using bin/beeline -u jdbc:hive2://localhost:10000
■ Viola, you have a SQL shell to Spark SQL
■ You can create new tables, or query existing ones that were cached using
hiveCtx.cacheTable("tableName")
User-defined functions (UDF's)

from pyspark.sql.types import IntegerType

hiveCtx.registerFunction("square", lambda x: x*x, IntegerType())
df = hiveCtx.sql("SELECT square('someNumericFiled') FROM tableName)
YOUR CHALLENGE
Filter out movies with very few ratings
The problem

■ Our examples of finding the lowest-rated movies were polluted with movies
only rated by one or two people.
■ Modify one or both of these scripts to only consider movies with at least ten
ratings.
Hints

■ RDD’s have a filter() function you can use

– It takes a function as a parameter, which accepts the entire key/value
pair
■ So if you’re calling filter() on an RDD that contains (movieID, (sumOfRatings,
totalRatings)) – a lambda function that takes in “x” would refer to
totalRatings as x[1][1]. x[1] gives us the “value” (sumOfRatings, totalRatings)
and x[1][1] pulls out totalRatings.
– This function should be an expression that returns True if the row should
be kept, or False if it should be discarded
■ DataFrames also have a filter() function
– It’s easier – you just pass in a string expression for what you want to
filter on.
– For example: df.filter(“count > 10”) would only pass through rows where
the “count” column is greater than 10.
GOOD LUCK

SED MAN Manual For GUI
100% (4)
SED MAN Manual For GUI
40 pages
Description of Hospitality by Ramesh Sir
No ratings yet
Description of Hospitality by Ramesh Sir
61 pages
All in One Kanji - RTK Order (New Edition)
0% (2)
All in One Kanji - RTK Order (New Edition)
508 pages
Reading Torch Test Comprehensive Lesson Plan 2
No ratings yet
Reading Torch Test Comprehensive Lesson Plan 2
5 pages
1 Describe The Distribution of Active Volcanoes, Earthquake Epicenters and Major Mountain Belts
100% (1)
1 Describe The Distribution of Active Volcanoes, Earthquake Epicenters and Major Mountain Belts
3 pages
Destination B1 Grammar Vocabulary
No ratings yet
Destination B1 Grammar Vocabulary
253 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Their Eyes Were Watching God
No ratings yet
Their Eyes Were Watching God
13 pages
Al-Husayn Ibn Hamdân Al-Khasîbî: A Historical Biography of The Founder of The Nusayrî - 'Alawite Sect"
100% (1)
Al-Husayn Ibn Hamdân Al-Khasîbî: A Historical Biography of The Founder of The Nusayrî - 'Alawite Sect"
23 pages
Learning Spark - Chapter 4
No ratings yet
Learning Spark - Chapter 4
30 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
English9 Pre Assessment Test
75% (4)
English9 Pre Assessment Test
3 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
Answer Sheet - English 10 Q3 - W1
No ratings yet
Answer Sheet - English 10 Q3 - W1
6 pages
Aditya Hridaya Stotra Meaning and Benefits PDF
0% (1)
Aditya Hridaya Stotra Meaning and Benefits PDF
16 pages
SQL - & - Pyspak
No ratings yet
SQL - & - Pyspak
6 pages
Subjective V.S Objective Tests
100% (1)
Subjective V.S Objective Tests
67 pages
PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Recursion Solutions
No ratings yet
Recursion Solutions
11 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Data Engineer
No ratings yet
Data Engineer
19 pages
SparkSql AND DF
No ratings yet
SparkSql AND DF
89 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
4.3. Spark SQL
No ratings yet
4.3. Spark SQL
25 pages
4 - Spark SQL
No ratings yet
4 - Spark SQL
58 pages
Bda U5
No ratings yet
Bda U5
42 pages
Pyspark
No ratings yet
Pyspark
44 pages
Lab 4 - Apache Spark SQL
No ratings yet
Lab 4 - Apache Spark SQL
46 pages
Spark SQL - Updated
No ratings yet
Spark SQL - Updated
19 pages
Session 3.2
No ratings yet
Session 3.2
27 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
Spark SQL
No ratings yet
Spark SQL
41 pages
Spark SQL
No ratings yet
Spark SQL
18 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Fall209 Spark SQL MC
No ratings yet
Fall209 Spark SQL MC
96 pages
Methods & Function in Databricks
No ratings yet
Methods & Function in Databricks
34 pages
Spark SQL
No ratings yet
Spark SQL
28 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
The Elements and Principles in Visual Arts
No ratings yet
The Elements and Principles in Visual Arts
5 pages
Learning Spark - Chapter 5
No ratings yet
Learning Spark - Chapter 5
44 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
T09 Sparksql
No ratings yet
T09 Sparksql
30 pages
Prog Numerically
No ratings yet
Prog Numerically
57 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
Unit 5 SQL 2024 25
No ratings yet
Unit 5 SQL 2024 25
19 pages
Chapter 3
No ratings yet
Chapter 3
33 pages
Unit-5 Spark SQL and Spark Streaming
No ratings yet
Unit-5 Spark SQL and Spark Streaming
24 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Context For Senior High School. Quezon City, QC: C&E Publishing, Inc
No ratings yet
Context For Senior High School. Quezon City, QC: C&E Publishing, Inc
3 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
SQL PySpark Cheat Sheet 1731729790
No ratings yet
SQL PySpark Cheat Sheet 1731729790
9 pages
SQL Vs Pyspark-1
No ratings yet
SQL Vs Pyspark-1
9 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
10 Lessons in Front-end
From Everand
10 Lessons in Front-end
Krasimir Tsonev
2/5 (1)
GET and POST Method in PHP
No ratings yet
GET and POST Method in PHP
3 pages
Spark SQL QA Summary
No ratings yet
Spark SQL QA Summary
3 pages
The Hero's Journey Analysis of The Movie Ever After
No ratings yet
The Hero's Journey Analysis of The Movie Ever After
1 page
SQL & pySPARK
No ratings yet
SQL & pySPARK
9 pages
Data Engineering Bootcamp
No ratings yet
Data Engineering Bootcamp
5 pages
Bigdata
No ratings yet
Bigdata
3 pages
Report SQL PDF
No ratings yet
Report SQL PDF
21 pages
HTML Code
No ratings yet
HTML Code
3 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Spark SQL Meetup - 4-8-2012
No ratings yet
Spark SQL Meetup - 4-8-2012
27 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Troubleshooting Guide Omnipcx Enterprise LX: NB of Pages: 43
No ratings yet
Troubleshooting Guide Omnipcx Enterprise LX: NB of Pages: 43
44 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Page 01
No ratings yet
Page 01
2 pages
Int 421
No ratings yet
Int 421
2 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Technical Skills: Github Repo Video Demo Deployed App
No ratings yet
Technical Skills: Github Repo Video Demo Deployed App
1 page
Set 6
No ratings yet
Set 6
8 pages
Dali Lorca Buñuel
No ratings yet
Dali Lorca Buñuel
20 pages
Disability in Fairy Tales 2
No ratings yet
Disability in Fairy Tales 2
19 pages
L0 U6 Answer
No ratings yet
L0 U6 Answer
5 pages
FP1 U0S Grammar Practice Plus
No ratings yet
FP1 U0S Grammar Practice Plus
1 page
Unit 4 HPC Part4
No ratings yet
Unit 4 HPC Part4
14 pages
11th Class Notes 2024 Poem A Man and Not of Deeds
No ratings yet
11th Class Notes 2024 Poem A Man and Not of Deeds
6 pages
Kinoko Komori My Hero Academia Wiki Fandom
No ratings yet
Kinoko Komori My Hero Academia Wiki Fandom
1 page
The Brain Audit by Sean D'Souza - 7-7
No ratings yet
The Brain Audit by Sean D'Souza - 7-7
1 page
Inspiring Powershell Articles
From Everand
Inspiring Powershell Articles
Murat Yildirimoglu
No ratings yet
Oral Proficiency Interview 2024 - Exam Format, Scoring Guide, and Test Modes Explained
No ratings yet
Oral Proficiency Interview 2024 - Exam Format, Scoring Guide, and Test Modes Explained
2 pages

Data Frames

Uploaded by

Data Frames

Uploaded by

SPARK SQL

DataFrames and DataSets

■ Extends RDD to a "DataFrame" object

■ from pyspark.sql import SQLContext, Row

■ In Spark 2.0, a DataFrame is really a DataSet of Row objects

from pyspark.sql.types import IntegerType

■ RDD’s have a filter() function you can use

You might also like