0% found this document useful (1 vote)

223 views9 pages

Getting Started With Spark Redis PDF

This document outlines the steps to start using Apache Spark and Redis together. It demonstrates reading source code files using Spark, transforming the text into words, counting word frequencies, and writing the results to Redis data structures. Specifically, it shows writing the global word counts to a Redis sorted set and the per-file word counts to individual sorted sets, indexed by file name. It also shows how to subsequently read these Redis data structures back into Spark RDDs to facilitate further analysis.

Uploaded by

Adam Sánchez Ayte

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

223 views9 pages

Getting Started With Spark Redis PDF

Uploaded by

Adam Sánchez Ayte

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

WHITE PAPER

Getting Started with Spark and Redis

Itamar Haber

CONTENTS
Executive Summary 2
Introduction2
Setting up 2

Example Problem 3

Step 1: Reading the data 3

Step 2: Transforming file contents 4

Step 3: Writing RDDs to Redis 5

Step 4: Reading RDDs to Redis 6

Closing notes 8
Executive Summary
This document outlines the initial steps needed to start using Apache Spark and Redis. We outline the steps for a basic
Apache Spark install, followed by usage of the Spark-Redis package to expose Redis data structures to Spark. We use the
“word count” example to demonstrate how to use Spark, Redis and the spark-redis package together.

Introduction
Redis Labs1 recently published a spark-redis package for general public consumption. It is, as the name may suggest, a Redis
connector for Apache Spark that provides read and write access to all of Redis’ core data structures as RDDs (Resilient
Distributed Datasets, in Spark terminology).

Since Spark was introduced, it has caught developer attention as a fast and general engine for large-scale data processing,
easily surpassing alternate big data frameworks in the types of analytics that could be executed on a single platform.
Spark supports a cyclic data flow and in-memory computing, allowing programs to be run faster than Hadoop MapReduce.
With its ease of use and support for SQL, streaming and machine learning libraries, it has ignited early interest in a
wide developer community. Redis brings a shared in-memory infrastructure to Spark, allowing it process data orders
of magnitude faster. Redis data structures simplify data access and processing, reducing code complexity and saving
on application network and bandwidth usage. The combination of Spark and Redis fast tracks your analytics, allowing
unprecedented real-time processing of really large datasets.

So let’s explore how to get started with this powerful combination.

Setting up
There are a few prerequisites you need before you can actually use spark-redis, namely: Apache Spark, Scala, Jedis and
Redis. While the package specifically states version requirements for each piece, we actually used later versions with no
discernible ill effects (v1.5.2, v2.11.7, v2.8 and unstable respectively).

We start with setting up Spark on Ubuntu following this step by step guide, “Setting up a Standalone Apache Spark Cluster”
published by Tim Spann @PaaSDev at @DZone.

Once you’ve fulfilled all the requirements, you can just git clone https://github.com/RedisLabs/spark-redis and build it by
running sbt (install if you don’t already have it installed).

1
Redis Labs and the talented Sun He @sunheehnus of the Redis community

2
Example Problem
For the purposes of getting started, we will use the equivalent of the “Hello World” example in analytics land, the problem of
counting words. This simple problem will be used to illustrate how to use Spark and Redis together.

Step 1: Reading the data

For this example, we count the words in Redis’ source code files (this commit specifically), also hoping to reveal some interesting
facts in the process. With everything ready, start with spark-shell like the below:

itamar@ubuntu:~/src$ ./spark-1.5.2/bin/spark-shell --jars spark-redis/target/spark-re-

dis-0.5.1.jar,jedis/target/jedis-2.8.0.jar
Spark context available as sc.
SQL context available as sqlContext.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ ‘_/
/___/ .__/\_,_/_/ /_/\_\ version 1.5.2
/_/
/_/

Using Scala version 2.11.7 (OpenJDK 64-Bit Server VM, Java 1.7.0_91) Type in expressions
to have them evaluated.
Type :help for more information.
scala>

At the cursor, you can type in:

scala> val wtext = sc.wholeTextFiles(“redis/src/*.[ch]”)

wtext: org.apache.spark.rdd.RDD[(String, String)] = redis/src/*.[ch] WholeTextFileRDD[0]
at wholeTextFiles at <console>:24
scala> wtext.count
res0: Long = 100

The display shows there are exactly 100 Redis source files! Of course, doing ls -1 redis. src/*.[ch] | wc -l from the
shell prompt would have displayed the same thing, but this way we can actually see the stages of the job being done by the
standalone Spark cluster on the WholeTextFileRDD.

2
When scores are equal, items are sub-ordered by the lexicographic ordering of the members themselves.

3
While we generally use colons as name/namespace/data separators when operating with Redis data in this whitepaper, you can
feel free to use whatever character you like. Other users of Redis use a period “.”, semicolon “;”, and more. Picking some character
that doesn’t usually appear in your keys or data is a good idea.

3
Step 2: Transforming file contents
The next step is to get the contents of the files transformed to words (that can later be counted). Unlike the usual examples
that use the TextFileRDD, the WholeTextFilesRDD consists of file URLs and their contents, so we use the following snippet
to split and clean the data (the call to the cache() method is strictly optional, but in keeping with best practices).

val fwds = wtext.

flatMap{ case (filename, contents) =>
val fname = filename.substring(filename.lastIndexOf(“/”) + 1) contents.
split(“\\W+”).
filter(!_.isEmpty).
map( word => (fname, word))
}
fwds.cache()

The variable names chosen are meant to be meaningful and short e.g. wtext represents WholeTextFiles, fwds is FileWords
and so on.

Once the fwds RDD has clean filenames and all the words were neatly split, we are ready for some serious counting. First,
we recreate the ubiquitous word counting example:

val wcnts = fwds.

map{ case (fname, word) => (word, 1) }.
reduceByKey(_ + _).
map{ case (word, count) => (word, count.toString) }

Pasting the above into the spark-shell and following with take confirms success:

wcnts: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[5] at map at

<console>:31

scala> wcnts.take(10)
res1: Array[(String, String)] = Array((requirepass,15), (mixdigest,2),
(propagte,1), (used_cpu_sys,1), (rioFdsetRead,2), (0x3e13,1),
(preventing,1), (been,12), (modifies,1), (geoArrayCreate,3))

scala> wcnts.count()
res2: Long = 12657

A note about the results: take isn’t supposed to be deterministic, but given that “requirepass” keeps surfacing these days, it
may well be fatalistic. Also, 12657 must have some meaning but it is yet to be found.

4
Step 3: Writing RDDs to Redis
This is where we get started with powerful Redis. We use Redis to save the results so they can be used in later
computations. Redis’ Sorted Sets are a perfect match for the word-count pairs and also allow querying the data by score. It
takes only one line of Scala code to do that (actually three lines, but the first two don’t count):

import com.redislabs.provider.redis._
val redisDB = (“127.0.0.1”, 6379)
sc.toRedisZSET(wcnts, “all:words”, redisDB)

Once data is in a Redis sorted set we can use the cli to read it like so:

itamar@ubuntu:~/src$ ./redis/src/redis-cli 127.0.0.1:6379> DBSIZE

(integer) 1
127.0.0.1:6379> ZCARD all:words
(integer) 12657
127.0.0.1:6379> ZSCORE all:words requirepass
“15”
127.0.0.1:6379> ZREVRANGE all:words 0 4 WITHSCORES
1) “the”
2) “8164”
3) “if”
4) “6657”
5) “0”
6) “5396”
7) “c”
8) “4524”
9) “1”
10) “4293”
127.0.0.1:6379> ZRANGE all:words 6378 6379
1) “mbl”
2) “mblen”

What else can we keep in Redis? The filenames are also perfect candidates, so we make another RDD and stored it in a
regular Set:

val fnames = fwds.

map{ case (fname, word) => fname }.distinct()
sc.toRedisSET(fnames, “all:files”, redisDB)

5
Despite being very useful for science purposes, the content of the fnames Set is pretty mundane...so you can store the word count for each file in its
very own Sorted Set as a more interesting example. We can do that with a few transformations/actions/RDDs:

fwds
groupByKey.
collect.
foreach{ case(fname, contents) =>
val zsetcontents = contents.
groupBy( word => word ).
map{ case(word, list) => (word, list.size.toString) }.
toArray
sc.toRedisZSET(sc.parallelize(zsetcontents), “file:” + fname, redisDB)
}

Back to redis-cli:

127.0.0.1:6379> dbsize
(integer) 102
127.0.0.1:6379> ZREVRANGE file:scripting.c 0 4 WITHSCORES
1) “lua”
2) “366”
3) “the”
4) “341”
5) “if”
6) “227”
7) “1”
8) “217”
9) “0”
10) “197”

Step 4. Reading RDDs from Redis

Compared to storing word count data , a more practical use of the data in Redis is done with reads. Run the following code to take the per-file word
counts and reduce them to basically the same output of the classic WC challenge:

val rwcnts = sc.fromRedisKeyPattern(redisDB, “file:*”).

getZSet().
map{ case (member, count) => (member, count.toFloat.toInt) }.
reduceByKey(_ + _)

6
Then back to spark-shell to test this code and get a grand total of all words:

scala> rwcnts.count()
res8: Long = 12657
scala> val total = rwcnts.aggregate(0)(
| (acc, value) => acc + value._2,
| (acc1, acc2) => acc1 + acc2)
total: Int = 272655

Lets double check using a Lua script:

local tot1, tot2, cursor = 0, 0, 0

repeat
local rep1 = redis.call(‘SCAN’, cursor, ‘MATCH’, ‘file:*’)
cursor = tonumber(rep1[1])

for _, ssk in pairs(rep1[2]) do

local rep2 = redis.call(‘ZRANGE’, ssk, 0, -1, ‘WITHSCORES’)
for i = 2, #rep2, 2 do
tot1 = tot1 + tonumber(rep2[i])
end
end

until cursor == 0

local rep = redis.call(‘ZRANGE’, ‘all:words’, 0, -1, ‘WITHSCORES’)

for i = 2, #rep, 2 do
tot2 = tot2 + tonumber(rep[i])
end

return { tot1, tot2 }

itamar@ubuntu:~/src$ ./redis/src/redis-cli --eval /tmp/wordcount.lua

1) (integer) 272655
2) (integer) 272655

7
Closing notes
Back in the days when data was small, you could get away with counting words using a simple wc -w. As data grows, we
find new ways to abstract solutions and in return gain flexibility and scalability. Spark is an exciting tool to have and its core
is extremely useful. And that’s even without going into its integration with the Hadoop ecosystem and extensions for SQL,
streaming, graphs processing and machine learning.

Redis quenches Spark’s thirst for data. spark-redis lets you marry RDDs and Redis core data structures with just a line of
Scala code. The spark-redis package already provides straightforward RDD-parallelized read/write access to all core data
structures and a polite (i.e. SCAN-based) way to fetch key names. Furthermore, the connector carries considerable hidden
punch as it is actually (Redis) cluster-aware and maps RDD partitions to hash slots to reduce inter-engine shuffling. The
package is open source and has many more enhancements planned that should make Redis a default choice for use with
Spark.

8
700 E El Camino Real, Suite 250
Mountain View, CA 94040
(415) 930-9666
redislabs.com

Getting_Started_With_Spark_And_Redis

Pyspark
No ratings yet
Pyspark
31 pages
Spark
No ratings yet
Spark
160 pages
Lec 9
No ratings yet
Lec 9
38 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
Pro Swift - Break Out of Beginner's Swift With This Hands-On Guide - PDF Room
No ratings yet
Pro Swift - Break Out of Beginner's Swift With This Hands-On Guide - PDF Room
265 pages
Spark
No ratings yet
Spark
51 pages
What Is Spark?: Up To 100× Faster
No ratings yet
What Is Spark?: Up To 100× Faster
56 pages
Lec 9
No ratings yet
Lec 9
33 pages
Overview
No ratings yet
Overview
25 pages
Module 5 Data Science
No ratings yet
Module 5 Data Science
8 pages
BDA Unit III
No ratings yet
BDA Unit III
19 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
Big Data - Spark
No ratings yet
Big Data - Spark
42 pages
Scala 1
No ratings yet
Scala 1
8 pages
Lab 04 Spark APIs
No ratings yet
Lab 04 Spark APIs
20 pages
Module 3
No ratings yet
Module 3
51 pages
Hadoop Spark
No ratings yet
Hadoop Spark
34 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Big Data Computing Spark Basics and RDD: Ke Yi
No ratings yet
Big Data Computing Spark Basics and RDD: Ke Yi
43 pages
Chapter 7 Spark Computing Engine
No ratings yet
Chapter 7 Spark Computing Engine
42 pages
22 PDFsam Apache Spark Tutorial
No ratings yet
22 PDFsam Apache Spark Tutorial
7 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
Data Entry Operator Job Description
100% (1)
Data Entry Operator Job Description
2 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
KSS Catalog-E
No ratings yet
KSS Catalog-E
236 pages
TLE11 ICT Empowerment Tech Q1 W1
No ratings yet
TLE11 ICT Empowerment Tech Q1 W1
43 pages
Design Thinking Handbook
No ratings yet
Design Thinking Handbook
124 pages
Laptop Bill
No ratings yet
Laptop Bill
7 pages
Concept Map
No ratings yet
Concept Map
37 pages
CS 7641 CSE/ISYE 6740 Mid-Term Exam 2 (Fall 2016) Solutions: 1 Probability and Bayes' Rule (14 PTS)
No ratings yet
CS 7641 CSE/ISYE 6740 Mid-Term Exam 2 (Fall 2016) Solutions: 1 Probability and Bayes' Rule (14 PTS)
12 pages
Principles of Programming Languages: Eliezer A. Albacea
No ratings yet
Principles of Programming Languages: Eliezer A. Albacea
68 pages
Section 1 Technical Specifications - Electrical SMP - Rev.6
No ratings yet
Section 1 Technical Specifications - Electrical SMP - Rev.6
147 pages
Unit 3 Java Script: Web Technologies
No ratings yet
Unit 3 Java Script: Web Technologies
135 pages
MOD 4 Notes
No ratings yet
MOD 4 Notes
19 pages
Chapter 1 v8.2
No ratings yet
Chapter 1 v8.2
72 pages
DIRECTIONS: Work Out The Problems Below by Subtracting The Two Numbers. Make Sure You
100% (2)
DIRECTIONS: Work Out The Problems Below by Subtracting The Two Numbers. Make Sure You
3 pages
SPPL Ca 2022 V18a
No ratings yet
SPPL Ca 2022 V18a
284 pages
Nginx
No ratings yet
Nginx
41 pages
W2-EX RA0 6 Solutions
No ratings yet
W2-EX RA0 6 Solutions
24 pages
Database & Database Management Systems (Notes)
No ratings yet
Database & Database Management Systems (Notes)
22 pages
Programming ESP-12E - ESP-12F - NodeMCU With Arduino IDE - Circuit Journal
No ratings yet
Programming ESP-12E - ESP-12F - NodeMCU With Arduino IDE - Circuit Journal
18 pages
Adsa Mid-1 MCQ Unit-1
No ratings yet
Adsa Mid-1 MCQ Unit-1
5 pages
Project Report
No ratings yet
Project Report
30 pages
Facebook Business Model
No ratings yet
Facebook Business Model
4 pages
Carrental Project Proposal
No ratings yet
Carrental Project Proposal
3 pages
DB Cheat Sheet Till Mid
No ratings yet
DB Cheat Sheet Till Mid
2 pages
Mock Insem Q.P2023-24
No ratings yet
Mock Insem Q.P2023-24
1 page
Drupalcon Sydney - Show Me The Tests! Writing Automated Tests For Drupal
No ratings yet
Drupalcon Sydney - Show Me The Tests! Writing Automated Tests For Drupal
31 pages
LAB 1-WP Evaluation Sheet
No ratings yet
LAB 1-WP Evaluation Sheet
1 page
ISYE6501 HW1 Kevin
No ratings yet
ISYE6501 HW1 Kevin
7 pages
TB Barricade v3 - ESP
No ratings yet
TB Barricade v3 - ESP
6 pages
Iguana Iswc2017 PDF
No ratings yet
Iguana Iswc2017 PDF
17 pages
The New DBpedia Release Cycle - Increasing Agility and Efficiency in Knowledge Extraction Workflows
No ratings yet
The New DBpedia Release Cycle - Increasing Agility and Efficiency in Knowledge Extraction Workflows
15 pages
05 Laboratory Exercise 1 Dacuba
No ratings yet
05 Laboratory Exercise 1 Dacuba
2 pages
RDF Query Answering Using Apache Spark: Review and Assessment
No ratings yet
RDF Query Answering Using Apache Spark: Review and Assessment
6 pages
Telit Le920-Family Datasheet
No ratings yet
Telit Le920-Family Datasheet
2 pages
PostgreSQL 16 Cookbook, Second Edition: Solve challenges across scalability, performance optimization, essential commands, cloud provisioning, backup, and recovery
From Everand
PostgreSQL 16 Cookbook, Second Edition: Solve challenges across scalability, performance optimization, essential commands, cloud provisioning, backup, and recovery
Peter G
No ratings yet
Hadoop实际解决方案手册: Chinese Edition
From Everand
Hadoop实际解决方案手册: Chinese Edition
Posts & Telecom Press
No ratings yet
SQL Tutorial For Beginners
From Everand
SQL Tutorial For Beginners
HAU DANG
No ratings yet
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
From Everand
Mastering Node.js Web Development: Go on a comprehensive journey from the fundamentals to advanced web development with Node.js
Adam Freeman
No ratings yet
DP-420 Designing and Implementing Cloud-Native Applications Using Microsoft Azure Cosmos DB Certification Exam Guide
From Everand
DP-420 Designing and Implementing Cloud-Native Applications Using Microsoft Azure Cosmos DB Certification Exam Guide
Anand Vemula
No ratings yet
PostgreSQL 16 Cookbook, Second Edition
From Everand
PostgreSQL 16 Cookbook, Second Edition
Peter G
No ratings yet
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet
Node.js, Express.js, and More
From Everand
Node.js, Express.js, and More
Tom Henricksen
No ratings yet
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
JAVASCRIPT FRONT END PROGRAMMING: Crafting Dynamic and Interactive User Interfaces with JavaScript (2024 Guide for Beginners)
From Everand
JAVASCRIPT FRONT END PROGRAMMING: Crafting Dynamic and Interactive User Interfaces with JavaScript (2024 Guide for Beginners)
DAISY JOHNSTON
No ratings yet
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Q Tips: Fast, Scalable, and Maintainable Kdb+
From Everand
Q Tips: Fast, Scalable, and Maintainable Kdb+
Nick Psaris
No ratings yet
Jump Start MySQL: Master the Database That Powers the Web
From Everand
Jump Start MySQL: Master the Database That Powers the Web
Timothy Boronczyk
No ratings yet
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
Mastering Apache Cassandra - Second Edition
From Everand
Mastering Apache Cassandra - Second Edition
Nishant Neeraj
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Mastering the SAS DS2 Procedure: Advanced Data-Wrangling Techniques, Second Edition
From Everand
Mastering the SAS DS2 Procedure: Advanced Data-Wrangling Techniques, Second Edition
Mark Jordan
No ratings yet
Getting SASSY: A Practical Guide to SASS
From Everand
Getting SASSY: A Practical Guide to SASS
Tim Robards
No ratings yet
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
From Everand
DRBD-Cookbook: How to create your own cluster solution, without SAN or NAS!
Joerg Christian Seubert
No ratings yet
DBA's Guide to NoSQL
From Everand
DBA's Guide to NoSQL
The Enlightened DBA
5/5 (1)
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
From Everand
DATABASE From the conceptual model to the final application in Access, Visual Basic, Pascal, Html and Php: Inside, examples of applications created with Access, Visual Studio, Lazarus and Wamp
Olga Maria Stefania Cucaro
No ratings yet
R Programming - a Comprehensive Guide: Software
From Everand
R Programming - a Comprehensive Guide: Software
Editor IJSMI
No ratings yet
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
Redis Unlocked: Advanced Techniques and Strategies for Efficient Data Management
From Everand
Redis Unlocked: Advanced Techniques and Strategies for Efficient Data Management
Adam Jones
No ratings yet
Redis Mastery: Advanced Techniques for Scalable Data Architecture
From Everand
Redis Mastery: Advanced Techniques for Scalable Data Architecture
Adam Jones
No ratings yet
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Elements of Android Room
From Everand
Elements of Android Room
Mark Murphy
No ratings yet
Node.js: The Definitive Resource
From Everand
Node.js: The Definitive Resource
Tom Henricksen
No ratings yet
Mastering Racket Programming: From Basics to Expert Proficiency
From Everand
Mastering Racket Programming: From Basics to Expert Proficiency
William Smith
No ratings yet
Learn MongoDB in 24 Hours
From Everand
Learn MongoDB in 24 Hours
Alex Nordeen
5/5 (2)
Oracle Quick Guides: Part 1 - Oracle Basics: Database and Tools
From Everand
Oracle Quick Guides: Part 1 - Oracle Basics: Database and Tools
Malcolm Coxall
No ratings yet
Professional Microsoft SQL Server 2012 Integration Services
From Everand
Professional Microsoft SQL Server 2012 Integration Services
Brian Knight
No ratings yet
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
From Everand
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
Rob Botwright
No ratings yet
Learn Cassandra in 24 Hours
From Everand
Learn Cassandra in 24 Hours
Alex Nordeen
No ratings yet

Getting Started With Spark Redis PDF

Uploaded by

Getting Started With Spark Redis PDF

Uploaded by

WHITE PAPER

Getting Started with Spark and Redis

Step 1: Reading the data 3

Step 2: Transforming file contents 4

Step 3: Writing RDDs to Redis 5

Step 4: Reading RDDs to Redis 6

So let’s explore how to get started with this powerful combination.

Step 1: Reading the data

itamar@ubuntu:~/src$ ./spark-1.5.2/bin/spark-shell --jars spark-redis/target/spark-re-

At the cursor, you can type in:

scala> val wtext = sc.wholeTextFiles(“redis/src/*.[ch]”)

val fwds = wtext.

val wcnts = fwds.

wcnts: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[5] at map at

itamar@ubuntu:~/src$ ./redis/src/redis-cli 127.0.0.1:6379> DBSIZE

val fnames = fwds.

Step 4. Reading RDDs from Redis

val rwcnts = sc.fromRedisKeyPattern(redisDB, “file:*”).

Lets double check using a Lua script:

local tot1, tot2, cursor = 0, 0, 0

for _, ssk in pairs(rep1[2]) do

local rep = redis.call(‘ZRANGE’, ‘all:words’, 0, -1, ‘WITHSCORES’)

return { tot1, tot2 }

itamar@ubuntu:~/src$ ./redis/src/redis-cli --eval /tmp/wordcount.lua

You might also like

Step 1: Reading the data 3

Step 2: Transforming file contents 4

Step 3: Writing RDDs to Redis 5

Step 4: Reading RDDs to Redis 6