0% found this document useful (0 votes)
46 views3 pages

Atelier4-4 Spark

This document describes exploring a file containing New York City taxi trip data using Spark Streaming. It involves running a Python script to stream taxi trip data from a CSV file to a Spark Streaming application via a TCP socket. The Spark Streaming application receives the data, parses the rows, counts the number of passengers by vehicle make, and prints the results.

Uploaded by

Mazozi safae
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views3 pages

Atelier4-4 Spark

This document describes exploring a file containing New York City taxi trip data using Spark Streaming. It involves running a Python script to stream taxi trip data from a CSV file to a Spark Streaming application via a TCP socket. The Spark Streaming application receives the data, parses the rows, counts the number of passengers by vehicle make, and prints the results.

Uploaded by

Mazozi safae
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Atelier 4-4: SPARK STREAMING

Il s’agit d’explorer un fichier contenant des parcours de taxis, avec des


informations telles que le nombre de passagers et la marque de la voiture.

https://itabacademy.com/bigdata/hadoop/Spark/taxistreams.py
https://itabacademy.com/bigdata/hadoop/Spark/ss-test.scala
https://itabacademy.com/bigdata/hadoop/Spark/nyctaxi100.csv

Ecrire le script python ;


[cloudera@quickstart ~]$cat > taxistreams.py

# coding: utf-8

import socket
import time
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(("localhost", 7777))
s.listen(1)
print("Started...")
while(1):
c, address = s.accept()
for row in open("nyctaxi100.csv"):
print(row)
c.send(row.encode())
time.sleep(0.5)
c.close()

Ctr D … pour sauvegarder et sortir de fichhier

Lancer le script python ;


[cloudera@quickstart ~]$ python taxistreams.py
Started...

Ouvrir une autre console, Lancer spark-shell puis le traitement :

import org.apache.log4j.Logger
import org.apache.log4j.Level
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
val ssc = new StreamingContext(sc, Seconds(1))

val lines = ssc.socketTextStream("localhost", 7777)

val pass = lines.map(_.split(",")).


map(pass=>(pass(15), pass(7).toInt)).
reduceByKey(_+_)

pass.print()

ssc.start()
ssc.awaitTermination()

---------------------------------------------------------------------------
[cloudera@quickstart ~]$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/zookeeper/lib/slf4j-log4j12-
1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/flume-ng/lib/slf4j-log4j12-
1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/parquet/lib/slf4j-log4j12-
1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/avro/avro-tools-1.7.6-
cdh5.12.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.6.0
/_/

Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java
1.7.0_67)
Type in expressions to have them evaluated.
Type :help for more information.
22/11/07 10:22:22 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
Spark context available as sc (master = local[*], app id = local-
1667845345596).
22/11/07 10:22:29 WARN shortcircuit.DomainSocketFactory: The short-circuit
local reads feature cannot be used because libhadoop cannot be loaded.
SQL context available as sqlContext.

scala> import org.apache.log4j.Logger


import org.apache.log4j.Logger

scala> import org.apache.log4j.Level


import org.apache.log4j.Level

scala> Logger.getLogger("org").setLevel(Level.OFF)

scala> Logger.getLogger("akka").setLevel(Level.OFF)
scala> import org.apache.spark._
import org.apache.spark._

scala> import org.apache.spark.streaming._


import org.apache.spark.streaming._

scala> import org.apache.spark.streaming.StreamingContext._


import org.apache.spark.streaming.StreamingContext._

scala> val ssc = new StreamingContext(sc, Seconds(1))


ssc: org.apache.spark.streaming.StreamingContext =
org.apache.spark.streaming.StreamingContext@11467ced

scala>

scala> val lines = ssc.socketTextStream("localhost", 7777)


lines: org.apache.spark.streaming.dstream.ReceiverInputDStream[String] =
org.apache.spark.streaming.dstream.SocketInputDStream@75e011d9

scala>

scala> val pass = lines.map(_.split(",")).


| map(pass=>(pass(15), pass(7).toInt)).
| reduceByKey(_+_)
pass: org.apache.spark.streaming.dstream.DStream[(String, Int)] =
org.apache.spark.streaming.dstream.ShuffledDStream@48b75e7f

scala> pass.print()

scala>

scala> ssc.start()

scala> ssc.awaitTermination()

Résultat de Traitement :

You might also like