Data Frames
Data Frames
■ Spark SQL exposes a JDBC/ODBC server (if you built Spark with Hive support)
■ Start it with sbin/start-thriftserver.sh
■ Listens on port 10000 by default
■ Connect using bin/beeline -u jdbc:hive2://localhost:10000
■ Viola, you have a SQL shell to Spark SQL
■ You can create new tables, or query existing ones that were cached using
hiveCtx.cacheTable("tableName")
User-defined functions (UDF's)
■ Our examples of finding the lowest-rated movies were polluted with movies
only rated by one or two people.
■ Modify one or both of these scripts to only consider movies with at least ten
ratings.
Hints