Exploratory Data Analysis - Big Data

Worked on Resilient Distributed Datasets and applying various transformations on the dataset.

Uploaded by

Pooja Pancholi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

71 views9 pages

Exploratory Data Analysis - Big Data

Worked on Resilient Distributed Datasets and applying various transformations on the dataset.

Uploaded by

Pooja Pancholi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 9

FIT5202 Data Processing for Big Data Assignment 1 - Part A Student Name : Pooja Vishal Pancholi Student ID : 29984939 Tutorial Day and Time : Thursday 6 to 8 PM Tutor Name: Huashun Li Step 01: Import pyspark and initialize SparkIn [2]: In [5]: # importing pyspark API Libraries from pyspark import SparkContext, SparkConf # Spark from pyspark.sql import SparkSession # Spark SOL context = SparkContext.getOrCreate() if (context is None): conf = SparkConf().setAppName(""AssignnentiA Application"). setMaster(" local [* context = SparkContext (conf=conf) # in order to check all configurations context. _conf.getall() # import re import re # for the stopwords import nltk from nltk.corpus import stopwords # uncomment and download it before running step 5 # nLtk. download('stopwords') # for plotting # Ipip install matplotlib # Ipip install numpy import matplotlib.pyplot as plt import numpy as np Ymatplotlib inline Xpylab inline Populating the interactive namespace from numpy and matplotlib Step 02: Create Resilient Distributed Datasets (RDDs) # RD for Scrum Handbook. txt file rddScrum = context.textFile('Scrum Handbook. txt") print("\n Number of total Lines in "Scrum Handbook.txt’ file is: ", rddScrum.coun # ROD for Agile Processes in Software Engineering and Extreme Progranming. txt fi rddAgile = context.textFile('Agile Processes in Software Engineering and Extrem, print("\n Number of total lines in ‘Agile Processes in Software Engineering and E rddagile.count()) Number of total lines in ‘Scrum Handbook.txt' file is: 4617 Number of total lines in ‘Agile Processes in Software Engineering and Extreme Programming.txt' file is: 21569In [6]: # Function to clean and format the RDD data def formatRdd(rdd): rddFile = rdd.map(lambda lines: re.sub(‘'[*a-zA-Z]+", -filter(lambda sublines: re.sub("\[2-9]', -map(lambda x: x.lower(), rddScrum) return rddFile # Formatting the ROD data rddScrum = formatRdd(rddScrum) rddAgile = formatRdd(rddagile) # Printing the top 5 results after formatting the ROD's index = 0 print("The first five tuples in Scrum Handbook are: \n") for val in rddScrum.take(5): index = index+1 print(index, ": ", val) index = 0 print("\nthe first five tuples in Agile Processes Book are: for val in rddAgile.take(5): index = index+1 print(index, “: ", val) The first five tuples in Scrum Handbook are: jeff sutherland s Scrum handbook everything you need ‘to know The first five tuples in Agile Processes Book aré Anbip i helen sharp tracy hall eds agile processes in software engineering wane Step 04: Transforming the Data/Counting the words *,Lines))\ ", sublines))\ \n")In [7]: # Function to transform ROD to (word, 1) form def filterkdd(rdd): rddFiltered = rdd.flatmap(Lanbda x: re.split("\s+', x)) -filter(Lanbda enpty: len(enpty) > 0)\ smap(lanbda finalwords: (FinalWords, 1)) return rddFiltered # Transforming ROD's to (word, 1) form rddScrumPair = filterRdd(rddScrum) rddagilepair = filterRdd(rddagile) # printing the top 5 results after transformations are done print("The transformed RDD for Scrum Handbook is:") for val in rddScrumPair.take(5): print(val) print("\nthe transformed ROD for Agile Processes Book is:") for val in rddagilePair. take(5): print (val) The transformed RDD for Scrum Handbook is: Cet", 1) (Sutherland, 1) Cs', 1) ('scrum’, 1) handbook’, 1) The transformed RDD for Agile Processes Book is: Cnbip’, 1) ci, D Chelen’, 1) (sharp", 1) tracy’, 1)In [8]: # function to reduce rdd according to word frequency def countFrequency(rdd): count = rdd.reduceBykey (Lambda val2, vali: val2 + val1)\ smap(lambda a: (a[1], a[0]))\ -sortBykey (ascending=False)\ smap(lambda a: (a[1], a[0])) return count # reducing RDD's accoridng to word frequency scrumCount = countFrequency(rddScrunPair) agilecount = countFrequency(rddAgilePain) # printing the top 20 words with most frequencies in both the books print("The top 20 words with most frequency in Scrum Handbook are: for val in scrumCount.take(2@): print(val) print("\nthe top 20 words with most frequency in Agile Processes Book are: for val in agilecount.take(2@): print(val) The top 2 words with most frequency in Scrum Handbook are: the’, 1238) of", 538) Cand", 534) (to", 478) (rat, 454) (C’serum’, 399) Cin", 363) (Cis', 348) (team’, 273) Ciproduct", 233) for", 195) that", 182) (at', 172) Con", 149) Csprint', 147) (Cthis', 142) Cwith’, 132) Cas", 124) Cat’, 119) Care’, 119) The top 2¢ words with most frequency in Agile Processes Book are: (the', 8161) Cand", 3975) oF", 3954) (to, 3751) (‘in', 3101) (at, 2755) (is', 1541) Cthat', 1356) (for', 1195) (‘on', 1027) (as, 1023) (we', 980) with’, 978) software’, 931) C this", 915) Care’, 785) Cagile’, 784) (Cit, 775)In [9]: development", 748) (Cwas', 711) g Step 05: Removing Stop Words stophordsList = set(stopwords.words( ‘english’ )) # getting the stopwords # Function to remove stop words from both RODS def renoveStopWords(rdd) : removeStop = rdd.filter(lambda x: x[@] not in stopWordsList) return renoveStop # Removing the stopwords from the ROD scrumCount = renoveStophiords(scrumCount) agileCount = renoveStophords(agilecount) # storing the count of words totalCountScrum = scrumCount.count() totalcountagile = agileCount.count() # printing the count of unique words after removal of stopwords print ("After the removal of stopwords: \n") print("There are \"",totalCountScrum,"\" unique words in Scrum Handbook. ") print("There are \"", totalCountagile,"\" unique words in Agile Processes Book After the renoval of stopwords: There are“ 2857 " unique words in Scrum Handbook. There are " 8962 " unique words in Agile Processes Book. Step 06: Find the average ocourrence of a wordIn [10]: # Function to find the average occurences for each book def averagedcc(rdd): average = rdd.map(lambda finalWords: (finalWords[@], finalWords(1] / totalcou return average # Getting the average occurence for each book scrumCountAverage = average0cc(scrumCount) agileCountAverage = averageOcc(agileCount) # Printing the average occurences of each book print("The average occurence of the top 5 words in Scrum Handbook are : for val in scrumCountAverage.take(5): print(val) print("\nThe average occurence of the top 5 words in Agile Processes Book are for val in agileCountaverage.take(5): print(val) The average occurence of the top 5 words in Scrum Handbook are : (‘scrum’, @.13965698284914246) (Cteam’, @.09555477773888695) (C product", @.08155407770388519) Csprint’, @.051452572628631434) (development, @.03430171508575429) The average occurence of the top 5 words in Agile Processes Book are : (C'software’, @.3258662933146657) (Cagile’, @.2744137206860343) ‘development, @.26181309065453273) (‘team', @.2072103605180259) (Cwork", @.16135806790339516) Step 7: Exploratory data analysisIn [13]: index = np.arange(15) bar_width = 0.25 # Fetching the first 15 book values scrunkey, scrunValue = zip(*scrumCount.take(15)) agilekey, agileValue = zip(*agileCount.take(15)) pylab.rcParams[“figure.figsize’] = (15, 9) # setting the figure size fig, ax = plt.subplots() # plotting the subplot # Setting the Label and Title formatting ax.set_xlabel(‘Words', size=14) ax.set_ylabel(*Counts', size=14) ax.set_title( Scrum Handbook vs Agile Processes Book’, size=18) # plotting the bars for each book scrunbata = ax.bar(index, scrunValue, bar_width, label="Scrum Handbook") agilebata = ax.bar(index + bar_width, agileValue, bar_width, label="Agile Process. # setting the tick values for X as Words ax.set_xticks(index, minor=False) ax.set_xticks(index + bar_width, minor=True) ax.set_xticklabels(scrumkey, rotation=92, minor=False, ha= ax.set_xticklabels(agilekey, rotation=98, minor=True, ha ax.tick_params(axis="both', which='major’, labelsize=13) ax.tick_params(axis="both’, which="minor", labelsize=13) enter") center") # Show the plot ax. legend() plt.show() ‘Scrum Handbook vs Agile Processes Book | NUN 55 HoH GH EE GE oH OH gb ub oH 4 PRR ENP ae Be : i 5 i a gEnd of Part A. I hope you like my work :)

BDA Lab Manual -BAD601-Final one - 7-11
No ratings yet
BDA Lab Manual -BAD601-Final one - 7-11
25 pages
Big Data - Spark
100% (1)
Big Data - Spark
72 pages
A204080739_28953_20_2025_unit 3 Introduction to RDD (1)
No ratings yet
A204080739_28953_20_2025_unit 3 Introduction to RDD (1)
51 pages
Ccna Devnet
No ratings yet
Ccna Devnet
492 pages
Pyspark and python preparation notes
No ratings yet
Pyspark and python preparation notes
2 pages
Java EE Application
No ratings yet
Java EE Application
11 pages
Calorie Tracker Application
No ratings yet
Calorie Tracker Application
35 pages
Calorie Tracker Android Application Report
No ratings yet
Calorie Tracker Android Application Report
26 pages
DataGrokr Technical Assignment - Data Engineering (1) (1)
No ratings yet
DataGrokr Technical Assignment - Data Engineering (1) (1)
4 pages
BigData-Assignment3-CSP 554
No ratings yet
BigData-Assignment3-CSP 554
5 pages
A Z Cheatsheet Python DA
No ratings yet
A Z Cheatsheet Python DA
7 pages
Action and Transformations (Wide and Narrow)
No ratings yet
Action and Transformations (Wide and Narrow)
7 pages
2335_m8_demo1_v1_0h2_cq188do
No ratings yet
2335_m8_demo1_v1_0h2_cq188do
9 pages
L7A_Spark RDD with Scala
No ratings yet
L7A_Spark RDD with Scala
21 pages
Python Test by Google Gemini
No ratings yet
Python Test by Google Gemini
6 pages
80838581
No ratings yet
80838581
9 pages
Python Activity
No ratings yet
Python Activity
16 pages
Immediate download Python Real-World Projects: Crafting your Python Portfolio with Deployable Applications Steven F. Lott ebooks 2024
No ratings yet
Immediate download Python Real-World Projects: Crafting your Python Portfolio with Deployable Applications Steven F. Lott ebooks 2024
51 pages
PySpark Cheat Sheet For RDD Operations
No ratings yet
PySpark Cheat Sheet For RDD Operations
1 page
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
PySpark Cheat Sheet Python
No ratings yet
PySpark Cheat Sheet Python
1 page
Class_06_IntroToSpark
No ratings yet
Class_06_IntroToSpark
51 pages
RDD
No ratings yet
RDD
4 pages
Python Scripts For Machine Learning
No ratings yet
Python Scripts For Machine Learning
13 pages
DataScience - ML DEEP LEARNING - LPEI - 120 Days
No ratings yet
DataScience - ML DEEP LEARNING - LPEI - 120 Days
8 pages
cs akshita inves final
No ratings yet
cs akshita inves final
24 pages
Data Science Papers
No ratings yet
Data Science Papers
109 pages
pyspark (1)
No ratings yet
pyspark (1)
44 pages
Pyspark File Commands and Theory
No ratings yet
Pyspark File Commands and Theory
29 pages
2.RDDs in Spark
No ratings yet
2.RDDs in Spark
38 pages
Spark
No ratings yet
Spark
51 pages
Slide 8 Spark Shell Tutorial
No ratings yet
Slide 8 Spark Shell Tutorial
61 pages
3- SPARK
No ratings yet
3- SPARK
51 pages
Class Xii (Informatics Practices) Half Yearly QP & Ms Ernakulam Region
No ratings yet
Class Xii (Informatics Practices) Half Yearly QP & Ms Ernakulam Region
5 pages
Time Task Analysis
No ratings yet
Time Task Analysis
21 pages
day6_dataanalyst
No ratings yet
day6_dataanalyst
9 pages
Spark Commands
No ratings yet
Spark Commands
3 pages
Hands - On Exercise: Using The Spark Shell..................................
100% (2)
Hands - On Exercise: Using The Spark Shell..................................
13 pages
Standard Structure of Exploratory Data Analysis
No ratings yet
Standard Structure of Exploratory Data Analysis
6 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Apache Spark
No ratings yet
Apache Spark
6 pages
4220 6 (DataFormat)
No ratings yet
4220 6 (DataFormat)
15 pages
PYTHONa 7
No ratings yet
PYTHONa 7
15 pages
Intro To Apache Spark
No ratings yet
Intro To Apache Spark
66 pages
RDD Actions
No ratings yet
RDD Actions
18 pages
SPARK
No ratings yet
SPARK
36 pages
SPARK
No ratings yet
SPARK
35 pages
Python Record Manual
No ratings yet
Python Record Manual
18 pages
2023713662-PythonSQLPyspark
No ratings yet
2023713662-PythonSQLPyspark
5 pages
Resilient Distributed Datasets
No ratings yet
Resilient Distributed Datasets
40 pages
Spark Using Python
No ratings yet
Spark Using Python
28 pages
Internship Report
No ratings yet
Internship Report
65 pages
APACHE SPARK and Scala
No ratings yet
APACHE SPARK and Scala
49 pages
Pyspark 30 Days
No ratings yet
Pyspark 30 Days
32 pages
Apache Spark: CS240A Winter 2016. T Yang
No ratings yet
Apache Spark: CS240A Winter 2016. T Yang
36 pages
5 - Programming With RDDs and Dataframes
No ratings yet
5 - Programming With RDDs and Dataframes
32 pages
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
No ratings yet
Suppose You Have A Large Dataset Stored in A Distributed File System Like HDFS
11 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Project On Netflix Data Analysis
100% (1)
Project On Netflix Data Analysis
22 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
Apache Spark Python Slides
No ratings yet
Apache Spark Python Slides
186 pages

Exploratory Data Analysis - Big Data

Uploaded by

Exploratory Data Analysis - Big Data

Uploaded by

You might also like