0% found this document useful (0 votes)

84 views51 pages

BIG DATA ANALYTICS Lab Manual

Wjis shensjw hw whw ww

Uploaded by

stortemfkopo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views51 pages

BIG DATA ANALYTICS Lab Manual

Wjis shensjw hw whw ww

Uploaded by

stortemfkopo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 51

EXPERIMENT:1

Install, configure and run python, numpy and pandas.

AIM: To Installing and Running Applications On python, numpy and pandas.

How to Install Anaconda on Windows?

Anaconda is an open-source software that contains Jupyter, spyder, etc that are used for large data
processing, data analytics, heavy scientific computing. Anaconda works for R and python programming
language. Spyder(sub-application of Anaconda) is used for python. Opencv for python will work in spyder.
Package versions are managed by the package management system calledconda.
To begin working with Anaconda, one must get it installed first. Follow the below instructions to Download
and install Anaconda on your system:
Download and install Anaconda:
Headovertoanaconda.comandinstallthelatestversionofAnaconda.Makesuretodownloadthe
―Python3.7Version‖fortheappropriatearchitecture.

Begin with the installation process:

 Getting Started

GettingthroughtheLicenseAgreement:

SKC Lakshmi Narain College Of Technology,Indore

SelectInstallationType:SelectJustMeifyouwantthesoftwaretobeusedbyasingleUser
ChooseInstallationLocation:

SKC Lakshmi Narain College Of Technology,Indore

Advanced Installation Option:

GettingthroughtheInstallationProcess:

SKC Lakshmi Narain College Of Technology,Indore

Recommendationto Install Pycharm:

SKC Lakshmi Narain College Of Technology,Indore

WorkingwithAnaconda:
Oncetheinstallationprocessisdone,Anacondacanbeusedtoperformmultipleoperations.To

begin using Anaconda search for AnacondaNavigatorfromtheStartMenuinWindows

SKC Lakshmi Narain College Of Technology,Indore

#importpandasinjupyternotebook import
pandas

#loadingthedatasetwhichisexcelfile
dataset=pandas.read_csv("crime.csv")

#displayingthedata dataset

importpandasaspd
dataset1=pd.read_csv("crime.csv")
dataset1

SKC Lakshmi Narain College Of Technology,Indore

dataset1.head(10)

SKC Lakshmi Narain College Of Technology,Indore

dataset1.tail(10)

type(dataset1)
pandas.core.frame.DataFrame

#tofindanynullvaluesinthe last 5rows dataset1.isnull().tail()

#tomakesurethatnonullvaluesexists
dataset1.notnull().tail()

SKC Lakshmi Narain College Of Technology,Indore

InNumpydimensionsarecalledasaxes.
Numpyis fast,convenientand occupies less memorywhencomparedtopythonlist.

importnumpy
arr=numpy.array([1,2,3,4,5]) print(arr)

NumPyisusually importedunderthenpalias.

importnumpyasnp

NowtheNumPypackagecan bereferredtoasnp insteadofnumpy.

importnumpyasnp
arr=np.array([1,2,3,4,5])
print(arr)

CheckingNumPyVersion
Theversionstringisstored underversion attribute.

importnumpyasnp print(np.
version )

CreateaNumPyndarrayObject
NumPyisusedto workwitharrays.Thearrayobject inNumPyiscalled ndarray. We can
create a NumPy ndarrayobject byusing the array() function.

importnumpyasnp
arr=np.array([1,2,3,4,5])
print(arr)
print(type(arr))

SKC Lakshmi Narain College Of Technology,Indore

EXPERIMENT:2

Install, Configure and Run Hadoop and HDFS

AIM:To Installing and Running Applications On Hadoop and HDFS.
HADOOPINSTALATIONINWINDOWS
1. Prerequisites
HardwareRequirement
* RAM—Min.8GB,ifyouhave SSDin yoursystemthen4GBRAMwouldalsowork.
* CPU—Min.Quadcore,withatleast1.80GHz
2. JRE1.8—OfflineinstallerforJRE
3. JavaDevelopmentKit —1.8
4. ASoftwareforUn-Zippinglike7Zipor WinRar
* Iwillbeusinga64-bit windowsfortheprocess, pleasecheckanddownloadtheversionsupported by your
system x86 or x64 for all the software.
5. DownloadHadoopzip
* IamusingHadoop-2.9.2,you canuseanyother STABLEversion forhadoop.

Oncewe haveDownloadedalltheabovesoftware,wecanproceedwithnext stepsininstallingthe Hadoop.

2.UnzipandInstallHadoop
AfterDownloadingtheHadoop,weneedtoUnzipthehadoop-2.9.2.tar.gz file.

SKC Lakshmi Narain College Of Technology,Indore

Onceextracted, wewouldgetanewfilehadoop-2.9.2.tar.

SKC Lakshmi Narain College Of Technology,Indore

Now,onceagainweneed toextractthistarfile.

NowwecanorganizeourHadoopinstallation, wecancreateafolderand movethe finalextracted file in it. For

Eg. :-

Pleasenotewhilecreating folders,DONOTADDSPACESINBETWEENTHEFOLDER NAME.(it can cause issues

later)
IhaveplacedmyHadoop inD:driveyoucanuseC:oranyother drivealso.
2. SettingUpEnvironmentVariables
Anotherimportantstepinsettingupaworkenvironment isto set yourSystemsenvironment variable.
Toeditenvironmentvariables,gotoControlPanel>System>clickonthe―Advancedsystem settings‖ link
Alternatively,WecanRightclick onThisPC iconandclickonPropertiesandclickon the
―Advancedsystemsettings‖link
Or,easiestwayistosearchforEnvironmentVariableinsearchbarandthereyouGO…

SKC Lakshmi Narain College Of Technology,Indore

SKC Lakshmi Narain College Of Technology,Indore
SettingJAVA_HOME
OpenenvironmentVariableandclickon―New‖in―UserVariable‖

SKC Lakshmi Narain College Of Technology,Indore

Onclicking―New‖,wegetbelowscreen.

Nowasshown,addJAVA_HOME invariablenameandpathofJava(jdk) in VariableValue. Click OK and we

are half done with setting JAVA_HOME.

SKC Lakshmi Narain College Of Technology,Indore

SettingHADOOP_HOME
OpenenvironmentVariableandclickon―New‖in―UserVariable‖
Onclicking―New‖,wegetbelowscreen.

Nowasshown,addHADOOP_HOME invariable nameandpathofHadoopfolder inVariable Value.

ClickOKand wearehalfdonewithsettingHADOOP_HOME.
Note:-Ifyouwantthepathtobesetforallusersyouneedtoselect―New‖fromSystemVariables.
SettingPath Variable
Last stepinsettingEnvironmentvariableissettingPathinSystemVariable.

SKC Lakshmi Narain College Of Technology,Indore

SelectPathvariableinthesystemvariablesandclickon―Edit‖.

Nowweneedto addthese pathstoPathVariableonebyone:-

* %JAVA_HOME%\bin
* %HADOOP_HOME%\bin
* %HADOOP_HOME%\sbin
ClickOKandOK.&wearedonewithSetting EnvironmentVariables.
VerifythePaths
Nowweneedto verifythat what wehavedoneiscorrectandreflecting. Open a NEW
Command Window
Run following commands
echo %JAVA_HOME% echo
%HADOOP_HOME%
echo %PATH%
3. Editing Hadoopfiles
OncewehaveconfiguredtheenvironmentvariablesnextstepistoconfigureHadoop.Ithas3parts:-
CreatingFolders
Weneed tocreateafolder datainthehadoopdirectory,and 2subfoldersnamenodeanddatanode

SKC Lakshmi Narain College Of Technology,Indore

CreateDATAfolderintheHadoopdirectory

OnceDATAfolder iscreated,weneedtocreate2new foldersnamely,namenodeanddatanode inside the data

folder
Thesefoldersare importantbecausefilesonHDFSresides insidethedatanode.
EditingConfigurationFiles
Nowweneedto editthefollowingconfig filesinhadoopforconfiguring it :- (We can find
these files in Hadoop -> etc -> hadoop)
* core-site.xml
* hdfs-site.xml
* mapred-site.xml
* yarn-site.xml
* hadoop-env.cmd
Editingcore-site.xml
Right clickonthefile, selecteditandpastethefollowingcontentwithin<configuration>
</configuration>tags.
Note:-Belowpartalreadyhastheconfigurationtag,weneedtocopyonlythepart insideit.
<configuration>
<property>
<name>fs.defaultFS</name>

SKC Lakshmi Narain College Of Technology,Indore

<value>hdfs://localhost:9000</value>
</property>
</configuration>
Editinghdfs-site.xml

SKC Lakshmi Narain College Of Technology,Indore

Rightclick onthefile, selecteditandpastethefollowingcontentwithin
<configuration></configuration>tags.
Note:-Belowpartalreadyhastheconfigurationtag,weneedtocopyonlythepart insideit.
Also replacePATH~1andPATH~2withthepathofnamenodeanddatanodefolderthat wecreated recently(step
4.1).
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>C:\hadoop\data\namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>C:\hadoop\data\datanode</value>
</property>
</configuration>
Editingmapred-site.xml
Right clickonthefile, selecteditandpastethefollowingcontentwithin<configuration>
</configuration>tags.
Note:-Below partalreadyhastheconfigurationtag,weneedtocopyonlythepart insideit.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Editingyarn-site.xml
Rightclickonthefile, selecteditandpastethefollowingcontentwithin<configuration>
</configuration>tags.
Note:-Belowpartalreadyhastheconfigurationtag,weneedtocopyonlythepart insideit.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Verifyinghadoop-env.cmd
Rightclickonthefile,selecteditandcheckiftheJAVA_HOMEissetcorrectlyor not.
WecanreplacetheJAVA_HOMEvariable inthe filewithyouractualJAVA_HOMEthat we configured in the
System Variable.
setJAVA_HOME=%JAVA_HOME% OR
setJAVA_HOME="C:\ProgramFiles\Java\jdk1.8.0_221"
Replacingbin

SKC Lakshmi Narain College Of Technology,Indore

Last stepinconfiguringthehadoopistodownloadandreplacethebinfolder.

SKC Lakshmi Narain College Of Technology,Indore

* GotothisGitHubRepoand downloadthebinfolderasazip.
* Extract the zip and copyall the files present under bin folder to %HADOOP_HOME%\bin Note:-
Ifyouareusingdifferent versionofHadoopthenpleasesearchforitsrespectivebin folder and download it.
4. TestingSetup
Congratulation..!!!!!
WearedonewiththesettinguptheHadoopinourSystem. Now we
need to check if everything works smoothly…

Note:-Wecanverifyifallthedaemonsare upandrunning usingjpscommandinnewcmdwindow.

5. RunningHadoop(VerifyingWebUIs)
Namenode
Openlocalhost:50070inabrowsertabto verifynamenode health.

Resourcemanger
Openlocalhost:8088 inabrowsertabtocheckresourcemanagerdetails.

SKC Lakshmi Narain College Of Technology,Indore

Datanode
Openlocalhost:50075inabrowsertabtocheckoutdatanode.

SKC Lakshmi Narain College Of Technology,Indore

EXPERIMENT:3
Aim: Visualize Data Using Basic Plotting Techniques In Python.
TocreateanapplicationthattakestheVisualizeDataUsingBasicPlottingTechniques. import pandas as pb
import matplotlib.pyplot as plt
import seaborn as sns
crime=pb.read_csv('crime.csv') crime

plt.plot(crime.Murder,crime.Assault);

SKC Lakshmi Narain College Of Technology,Indore

plt.bar(crime_bar.index,crime_bar.Robbery);

sns.barplot('Robbery','Year',data=crime);

SKC Lakshmi Narain College Of Technology,Indore

importmatplotlib.pyplotasplt import
pandas as pd
import numpy as np
data=pd.read_csv('crime.csv')
x=data.Population
y=data.CarTheft plt.scatter(x,y)
plt.xlabel('Population')
plt.ylabel('CarTheft')
plt.title('PopulationVsCarTheft')
plt.show();

SKC Lakshmi Narain College Of Technology,Indore

EXPERIMENT-4

Implement nosql Database Operations: Crud Operations, Arrays Using MONGO DB.
AIM:ToCreateaoperations forcrudandarrays withoutnosqldatasbase.
TITLE:BasicCRUD operationsinMongoDB.
CRUDoperationsrefertothebasicInsert,Read,UpdateandDeleteoperations. Inserting a
document into a collection (Create)
➢ The command db.collection.insert()will perform an insert operation into a collection of a
document.➢Letusinsert adocument toastudent collection.Youmust beconnectedto adatabase for doing
any insert. It is done as follows:
db.student.insert({ re
gNo: "3014",
name:"TestStudent",
course:{courseName:"MCA",duration:"3Years"}, address: {
city: "Bangalore",
state:"KA",
country:"India"}})
Anentryhas beenmadeintothecollectioncalledstudent.

Queryingadocumentfromacollection(Read)
Toretrieve(Select) theinserteddocument,runthebelowcommand. The find()commandwill retrieve all the
documents of the given collection.
db.collection_name.find()
➢ Ifarecordisto beretrieved basedonsomecriteria,thefind() methodshouldbecalledpassing parameters,
then the record will be retrieved based on the attributes specified.
db.collection_name.find({"fieldname":"value"})
➢ ForExample:Let usretrievetherecordfromthestudent collectionwheretheattributeregNo is 3014and the
query for the same is as shown below:
db.students.find({"regNo":"3014"})
Updatingadocument inacollection(Update)Inordertoupdatespecific fieldvaluesofacollection in MongoDB,
run the below query. db.collection_name.update()
update()methodspecifiedabovewilltakethe fieldnameandthenew valueasargumenttoupdate a document.
Let usupdatetheattributenameofthecollectionstudentforthedocument with regNo 3014.
db.student.update({
"regNo":"3014"
},
$set:
{
"name":"Viraj"
})
Removinganentryfromthecollection (Delete)
➢ Let us now look into the deleting an entry froma collection. In order to delete an entry froma
collection,runthecommandasshownbelow:db.collection_name.remove({"fieldname":"value"})
➢ ForExample: db.student.remove({"regNo":"3014"})

Notethatafterrunningtheremove()method,theentryhasbeendeleted fromthestudentcollection.

Working with Array sin Mongo DB

1. Introduction
In a MongoDB database, data is stored in collections and a collection has documents. A document has
fieldsand values, like ina JSON. The field types include scalar types(string, number, date,etc.)
andcompositetypes(arraysandobjects). Inthisarticlewewilllookat anexampleofusingthearray field type.
The example is an application where users create blog posts and write comments for the posts. The
relationship betweenthepostsandcommentsisOne-to-Many;i.e.,apost canhave manycomments. We
willconsider a collection ofblog posts with their comments. That is a post document willalso store the
related comments. In MongoDB's document model, a 1:N relationship data can be stored within a
collection; this is a de-normalized form of data. The related data is stored together and can be accessed
(and updated) together. The comments are stored as an array; an array of comment objects.
Asampledocument oftheblogpostswithcomments:
{
"_id":ObjectId("5ec55af811ac5e2e2aafb2b9"),
"name" : "Working with Arrays",
"user":"DatabaseRebel",
"desc":"Maintaininganarrayofobjects inadocument",
"content" : "some content ...",
"created" : ISODate("2020-05-20T16:28:55.468Z"),
"updated":ISODate("2020-05-20T16:28:55.468Z"),
"tags":["mongodb","arrays"],
"comments" : [
{
"user": "DB Learner",
"content":"Nicepost.",
"updated": ISODate("2020-05-20T16:35:57.461Z")
}
]
}
Inanapplication,ablogpost iscreated,commentsareadded,queried, modifiedordeletedbyusers. In the
example, we willwrite code to create a blog post document, and do some CRUD operations with comments
for the post.

2. CreateandQueryaDocument
Let'screateablog post document.Wewilluseadatabase called as blogsand acollectioncalled as
posts.Thecodeiswritteninmongoshell(aninteractiveJavaScript interfacetoMongoDB).Mongo shell is started
fromthe command line and is connected to the MongoDB server. Fromthe shell: use blogs
NEW_POST=
{
name:"WorkingwithArrays", user:
"Database Rebel",
desc:"Maintaininganarrayofobjectsinadocument", content:
"some content...",
created: ISODate(),
updated:ISODate(),
tags:["mongodb", "arrays"]
}
db.posts.insertOne(NEW_POST)
Returns a result { "acknowledged" : true, "insertedId" : ObjectId("5ec55af811ac5e2e2aafb2b9") }
indicatingthat anewdocument iscreated.Thisisacommonacknowledgement whenyouperforma write
operation. When a document is inserted into a collection for the first time, the collection gets created (if it
doesn't exist already). The insertOne method inserts a document into the collection.
Now,let'squerythecollection:
db.posts.findOne()
{
"_id":ObjectId("5ec55af811ac5e2e2aafb2b9"),
"name" : "Working with Arrays",
"user":"DatabaseRebel",
"desc":"Maintaininganarrayofobjects inadocument",
"content" : "some content...",
"created" : ISODate("2020-05-20T16:28:55.468Z"),
"updated":ISODate("2020-05-20T16:28:55.468Z"),
"tags":[
"mongodb",
"arrays"
]
}
The findOne method retrieves one matching document fromthe collection. Notethe scalar fields
name(stringtype)andcreated(datetype),andthearrayfieldtags.Inthenewlyinserteddocument there are no
comments, yet.
EXPERIMENT:5
Implement Functions: Count –Sort–Limit–Skip–Aggregate Using MONGODB.

AIM: To create function operations for sort, limit, skipand aggregate.

COUNT
Howdo yougetthenumberofDebitandCredittransactions?Onewayto do it is by using count()
function as below
> db.transactions.count({cr_dr:"D"});
or
>db.transactions.find({cr_dr:"D"}).length();
But what if you do not know the possible values of cr_dr upfront. HereAggregation framework
comes to play. See the below Aggregate query.
> db.transactions.aggregate( [
{
$group:{
_id:'$cr_dr', //groupbytypeoftransaction
//Add1foreachdocumenttothecountforthistypeof
transaction
count:{$sum:1}
}
}
]
);
Andtheresultis
{
"_id":"C",
"count" : 3
}
{
"_id":"D",
"count" : 5
}

2.SORT
Definition
$sort
Sortsallinputdocumentsandreturnsthemtothepipelineinsortedorder.

The
$sort

SKC Lakshmi Narain College Of Technology,Indore

stagehasthefollowingprototypeform:

{$sort:{<field1>:<sortorder>,<field2>:<sortorder>...}}

$sort
takes a document that specifies the field(s) to sort by and the respective sort order. <sort
order> can have one of the following values:

Value
Description 1
Sortascending.
-1
Sortdescending.
{$meta:"textScore"}
SortbythecomputedtextScoremetadataindescendingorder.See Text Score Metadata
Sort
foranexample.
Ifsorting on multiple fields, sort order is evaluated from left to right. For example, in the
form above, documents are first sorted by <field1>. Then documents with the same <field1>
values are further sorted by <field2>.

Behavior
Limits
Youcansortonamaximumof32keys.

SortConsistency
MongoDB does not store documents in a collection in a particular order. When sorting on a
field which contains duplicate values, documents containing those values may be returned in
any order.

If consistent sort order is desired, include at least one field in your sort that contains unique
values. The easiest way to guarantee this isto include the _id field in your sort query.

Considerthefollowingrestaurantcollection:

db.restaurants.insertMany( [
{"_id":1,"name":"CentralParkCafe","borough":"Manhattan"},
{ "_id":2,"name":"Rock AFeller BarandGrill", "borough": "Queens"},
{"_id":3,"name":"EmpireStatePub","borough":"Brooklyn"},
{"_id":4,"name":"Stan'sPizzaria","borough":"Manhattan"},
{"_id":5,"name":"Jane'sDeli","borough":"Brooklyn"},
])

SKC Lakshmi Narain College Of Technology,Indore

Thefollowingcommandusesthe
$sort
stagetosortontheborough field:

db.restaurants.aggregate( [
{$sort:{ borough:1}}
]
)

In this example, sort order may be inconsistent, since the borough field contains duplicate
values for both Manhattan and Brooklyn. Documents are returned in alphabetical order by
borough, but the order of those documents with duplicate values for borough might not the
be the same across multiple executions of the same sort. For example, here are the results
from two different executions of the above command:

{"_id":3,"name":"EmpireStatePub","borough":"Brooklyn"}
{"_id":5,"name":"Jane'sDeli","borough":"Brooklyn"}
{"_id":1,"name":"CentralParkCafe","borough":"Manhattan"}
{"_id":4,"name":"Stan'sPizzaria","borough":"Manhattan"}
{"_id":2,"name":"RockAFellerBarandGrill","borough":"Queens"
}
{"_id":5,"name":"Jane'sDeli","borough":"Brooklyn"}
{"_id":3,"name":"EmpireStatePub","borough":"Brooklyn"}
{"_id":4,"name":"Stan'sPizzaria","borough":"Manhattan"}
{"_id":1,"name":"CentralParkCafe","borough":"Manhattan"}
{"_id":2,"name":"RockAFellerBarandGrill","borough":"Queens"
}
While the values for borough are still sorted in alphabetical order, the order of the
documents containing duplicate values for borough (i.e. Manhattan and Brooklyn) is not the
same.

Toachieve a consistent sort, add a fieldwhichcontains exclusively unique values to the sort.
The following command uses the
$sort
stagetosortonboththeboroughfieldandthe_idfield:

db.restaurants.aggregate( [
{$sort:{borough:1, _id:1}}
]
)

SKC Lakshmi Narain College Of Technology,Indore

Since the _id field is always guaranteed to contain exclusively unique values, the returned
sort order will always be the same across multiple executions of the same sort.

Examples Ascending/DescendingSort
For the field or fields to sort by, set the sort order to 1 or -1 to specifyan ascending or descending
sort respectively, as in the following example:

db.users.aggregate( [
{$sort:{ age :-1,posts:1} }
]
)

This operation sorts the documents in the users collection, in descending order according by
the age field and then in ascending order according to the value in the posts field.
2. LIMIT

$sort
Sortsallinputdocumentsandreturnsthemtothepipelineinsortedorder. The $sort stage has the
following prototype form:
{$sort:{<field1>:<sortorder>,<field2>:<sortorder>...}}
$sort takes a document that specifies the field(s) to sort by and the
respectivesortorder.<sortorder>canhaveoneofthefollowingvalues:
Value Description
1 Sortascending.
-1 Sortdescending.
{ $meta: Sort by the computed textScore metadata in descending order. See
"textScore" } Text Score Metadata Sort for an example.
If sorting on multiple fields, sort order is evaluated from left to right. For example, in the
form above, documents are first sorted by <field1>. Then documents with the same <field1>
values are further sorted by <field2>.

Behavior

Limits
Youcansortonamaximumof32keys.

SKC Lakshmi Narain College Of Technology,Indore

that contains unique values. The easiest way to guarantee this isto include the _id field in
your sort query.
Considerthefollowingrestaurantcollection:
db.restaurants.insertMany([
{"_id":1,"name":"CentralParkCafe","borough":"Manhattan"},
{"_id":2,"name":"RockAFellerBarandGrill", "borough":"Queens"},
{"_id":3,"name":"EmpireStatePub","borough":"Brooklyn"},
{"_id":4,"name":"Stan'sPizzaria","borough":"Manhattan"},
{"_id":5,"name":"Jane'sDeli","borough":"Brooklyn"},
])
Thefollowingcommandusesthe$sortstagetosortontheboroughfield:
db.restaurants.aggregate(
[
{$sort:{borough:1} }
]
)
In this example, sort order may be inconsistent, since the borough field contains duplicate
values for both Manhattan and Brooklyn. Documents are returned in alphabetical order by
borough, but the order of those documents with duplicate values for borough might not the
be the same across multiple executions of the same sort. For example, here are the results
from two different executions of the above command:
{"_id":3,"name":"EmpireStatePub","borough":"Brooklyn"}
{"_id":5,"name":"Jane'sDeli","borough":"Brooklyn"}
{"_id":1,"name":"CentralParkCafe","borough":"Manhattan"}
{"_id":4,"name":"Stan'sPizzaria","borough":"Manhattan"}
{"_id":2,"name":"RockAFellerBarandGrill","borough":"Queens"}
{"_id":5,"name":"Jane'sDeli","borough":"Brooklyn"}
{"_id":3,"name":"EmpireStatePub","borough":"Brooklyn"}
{"_id":4,"name":"Stan'sPizzaria","borough":"Manhattan"}
{"_id":1,"name":"CentralParkCafe","borough":"Manhattan"}
{"_id":2,"name":"RockAFellerBarandGrill","borough":"Queens"}
While the values for borough are still sorted in alphabetical order, the
orderofthedocumentscontainingduplicatevalues for borough (i.e. Manhattan and Brooklyn)
is not the same.
To achieve a consistent sort, add a field which containsexclusively unique values to the sort.
The following command uses the $sort stage to sort on both the borough field and the _id
field:
db.restaurants.aggregate(
[
{$sort:{borough:1,_id:1}}
]
)
Since the _id field is always guaranteed to contain exclusively unique
values,thereturnedsortorderwillalwaysbethesameacrossmultiple

SKC Lakshmi Narain College Of Technology,Indore

executionsofthesamesort.

Examples Ascending/DescendingSort
For the field or fields to sort by, set the sort order to 1or-1to specify an ascending or descending
sort respectively, as in the following example:
db.users.aggregate(
[
{$sort:{age :-1,posts:1}}
]
)

SKC Lakshmi Narain College Of Technology,Indore

EXPERIMENT:6

Implement Word Count/Frequency Programs Using Map Reduce.

import sys
#inputcomesfromSTDIN(standardinput) for
line in sys.stdin:
line=line.strip()#removeleadingandtrailingwhitespace words
= line.split()# split the line into words
#increasecounters
for word in words:
#writetheresultstoSTDOUT(standardoutput); #
what we output here will be the input for the#
Reduce step, i.e. the input for reducer.py
#tab-delimited;thetrivialwordcountis1 print
'%s\t%s' % (word, 1)
Reducerprogram
"""reducer.py"""
fromoperatorimportitemgetter
import sys
current_word=None
current_count = 0
word = None

#inputcomesfromSTDIN for
line in sys.stdin:
line=line.strip()#removeleadingandtrailingwhitespace #
parse the input we got from mapper.py
word,count=line.split('\t',1)
#convertcount(currentlyastring)toint try:
count=int(count)
exceptValueError:
#countwasnotanumber,sosilently #
ignore/discard this line
continue

#thisIF-switchonlyworksbecauseHadoopsortsmapoutput # by
key (here: word) before it is passed to the reducer
if current_word == word:
current_count+=count
else:
ifcurrent_word:
#writeresulttoSTDOUT
print'%s\t%s'%(current_word,current_count)
current_count = count
current_word=word

#donotforgettooutputthelastwordifneeded! if
current_word == word:
print'%s\t%s'%(current_word,current_count)

Testthecode(catdata |map |sort|reduce)

hduser@ubuntu:~$echo"foofooquuxlabsfoobarquux"|/home/hduser/mapper.py
foo 1
foo 1
quux 1
labs 1
foo 1
bar 1
quux 1

hduser@ubuntu:~$echo"foofooquuxlabsfoobarquux"|/home/hduser/mapper.py|sort-k1,1|
/home/hduser/reducer.py
bar 1
foo 3
labs1
quux2

hduser@ubuntu:~$cat/tmp/gutenberg/20417-8.txt|/home/hduser/mapper.py
The 1
Project1
Gutenberg 1
EBook1
of 1
EXPERIMENT:7

Implementa Map Reduce Program that process dataset.

AIM: Tocreateprocessdatasetusingmapreducefunctions.

Thepythonprogramreadsthedatafromadataset(storedinthe filedata.csv- winequality). The data mapped

is stored in shuffled.pkl using mapper.py.
Thecontentsofshuffled.pklarereducedusingreducer.py
MapperProgram

Importpandasaspd
Importpickle

data=pd.read_csv('data.csv')

#SlicingData
slice1=data.iloc[0:399,:]
slice2=data.iloc[400:800,:]
slice3=data.iloc[801:1200,:]
slice4=data.iloc[1201:,:]

defmapper(data):
mapped=[]

forindex,rowindata.iterrows():
mapped.append((row['quality'],row['volatileacidity']))
Returnmapped

map1=mapper(slice1)
map2=mapper(slice2)
map3=mapper(slice3)
map4=mapper(slice4)

shuffled={
3.0:[],
4.0:[],
5.0:[],
6.0:[],
7.0:[],
8.0:[],
}
for iin[map1,map2,map3,map4]:
for jin i:
shuffled[j[0]].append(j[1])
file=open('shuffled.pkl','ab')
pickle.dump(shuffled,file)
file.close()

print("Datahasbeenmapped.Now,run reducer.pytoreducethecontentsin
shuffled.pkl file.")

ReducerProgram

Import
Pickle
file=open('shuffled.pkl','rb')
shuffled= pickle.load(file)
defreduce(shuffled_dict):
reduced={}

for iinshuffled_dict:

reduced[i]=sum(shuffled_dict[i])/len(shuffled_dict[i])

Returnreduced
final=reduce(shuffled)
print("Averagevolatileacidityindifferentclasses ofwine:")
foriinfinal:
print(i,':',final[i])
EXPERIMENT:8

ImplementClusteringTechniquesUsingSPARK.

AIM:Tocreatea clusteringusingSPARK.
#Loadsdata.
dataset=spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

#Trainsak-meansmodel.
kmeans=KMeans().setK(2).setSeed(1)
model = kmeans.fit(dataset)

#EvaluateclusteringbycomputingWithinSetSumof SquaredErrors.
wssse = model.computeCost(dataset)
print("WithinSetSumofSquaredErrors="+str(wssse))

#Showstheresult.
centers=model.clusterCenters()
print("Cluster Centers: ")
forcenterincenters:
print(center)
BIGDATAANALYTICSLAB 2023-2024
BIGDATAANALYTICSLAB 2023-2024
EXPERIMENT: 9

Implementan Application that Stores BigData in MONGO DB / PIG Using Hadoop/ R.

AIM: To design application to stores data in mong db using hadoop.

RShinyTutorial:HowtoMakeInteractiveWebApplicationsinR Introduction
In this modern technological era, various apps are available for all of us –from tracking our fitness level, sleep
to giving usthe latest informationaboutthe stockmarkets. Appslike Robinhood,Google Fit and Workit seem so
amazingly useful because they use real-time data and statistics. As R is a frontrunnerin thefield of statistical
computing and programming, developers need a system to useits power to build apps.
This is where R Shiny comes to save the day. In this, R Shiny tutorial, you will come to know the basics.
WhatisRShiny?
Shiny is an R package that was developed for building interactive web applications in R. Using this, you can
create web applications utilizing native HTML and CSS code along with R Shinycode. You can build
standalone web apps on a website that will make data visualization easy.These applications made through R
Shinycan seamlessly display R objects such as tables and plots.
Letuslookatsomeofthefeatures ofRShiny:
 Buildwebapplicationswithfewerlinesofcode,without JavaScript.
 Theseapplicationsareliveandareaccessibletouserslikespreadsheets.Theoutputsmay alter in real-time
if the users change the input.
 Developerswith littleknowledgeofwebtoolscanalsobuildappsusingR Shiny.
 Yougetin-builtwidgetstodisplaytables,outputsofRobjectsand plots.
 Youcanadd livevisualizations and reportstothewebapplicationusingthispackage.
 TheuserinterfacescanbecodedinRor canbeprepared usingHTML,CSSorJavaScript.
 Thedefaultuserinterface isbuiltusingBootstrap.
 ItcomeswithaWebSocketpackagethat enablesfast communicationbetweenthewebserver and R.

ComponentsofanRShiny app
A Shiny app has two primary components – a user interface object and a server function. These are
the arguments passed on to the shinyApp method. This method creates an application object using
the arguments.

LetusunderstandthebasicpartsofanRShinyappindetail:

Userinterfacefunction
This function defines the appearance of the web application. It makes the application interactive by
obtaining input from the user and displaying it on the screen. HTML and CSS tags can be used for
making the application look better. So, while building the ui.R file you create an HTML file with R
functions.

If you type fluidPage() in the R console, you will see that the method returns a tag <div
class=‖container-fluid‖></div>.

Thedifferentinputfunctionsare:
 selectInput() – This method is used for creating a dropdown HTML that has various choices
to select.
 numericInput()–Thismethodcreatesaninputareaforwritingtextornumbers.
 radioButtons()–Thisprovidesradiobuttonsfortheusertoselectan input.

Layoutmethods
ThevariouslayoutfeaturesavailableinBootstrapareimplemented byRShiny.Thecomponentsare:

Panels
Thesearemethodsthatgroupelementstogetherintoasinglepanel.Theseinclude:

 absolutePanel()
 inputPanel()
 conditionalPanel()
 headerPanel()
 fixedPanel()

Layoutfunctions
Theseorganizethepanels foraparticularlayout. These include:

 fluidRow()
 verticalLayout()
 flowLayout()
 splitLayout()
 sidebarLayout()

Outputmethods
ThesemethodsareusedfordisplayingRoutputcomponentsimages,tablesandplots.Theyare:

 tableOutput()–Thismethod isusedfordisplayinganRtable
 plotOutput()–This methodisusedfordisplayinganRplotobject

Serverfunction
After you have created the appearance of the application and the ways to take input values from the user, it
is time to set upthe server. The server functions help youto writethe server-side code forthe Shiny app. You
can create functions that map the user inputs to the corresponding outputs. This function is called bythe
web browser when the application is loaded.

It takes an input and output parameter, and return values are ignored. An optional session parameter is also
taken by this method.

RShinytutorial:HowtogetstartedwithRShiny?
Stepsto startworkingwiththeRShinypackage are asfollows:

 GototheRconsoleandtypeinthecommand–install.packages(―shiny‖)
 The package comes with 11 built-in application examples for you to understand how Shinyworks