0% found this document useful (0 votes)
399 views66 pages

SE 7204 BIG Data Analysis Unit I Final

This document provides an overview of a course on big data analytics. The course objectives are to explore fundamental concepts of big data analytics, analyze big data using intelligent techniques, understand various search and visualization techniques, and learn techniques for mining data streams. The outcomes are that students will be able to work with big data platforms and analyze big data for business applications, select visualization tools, implement search and visualization techniques, and design algorithms for mining large volumes of data.

Uploaded by

Dr.A.R.Kavitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
399 views66 pages

SE 7204 BIG Data Analysis Unit I Final

This document provides an overview of a course on big data analytics. The course objectives are to explore fundamental concepts of big data analytics, analyze big data using intelligent techniques, understand various search and visualization techniques, and learn techniques for mining data streams. The outcomes are that students will be able to work with big data platforms and analyze big data for business applications, select visualization tools, implement search and visualization techniques, and design algorithms for mining large volumes of data.

Uploaded by

Dr.A.R.Kavitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 66

SE7204BIGDATAANALYTICS

COURSE OBJECTIVES
Toexplorethefundamentalconceptsofbigdataanalytics
Tolearnandanalyzethebigdatausingintelligenttechniques.
Tounderstandthevarioussearchmethodsandvisualization
techniques.

Tolearnandusevarioustechniquesforminingdatastream.
TounderstandtheapplicationsusingMapReduceconcepts
OUTCOMES
Thestudentswillbeableto:
Workwithbigdataplatformanditsanalysistechniques.
Analyzethebigdataforusefulbusinessapplications.
Selectvisualizationtechniquesandtoolstoanalyzebigdata
Implementsearchmethodsandvisualizationtechniques.
Design efficient algorithms for mining the data from large
volumes.
Explorethetechnologiesassociatedwithbigdataanalyticssuch
asNoSQL,HadoopandMapReduce.
UNITIINTRODUCTIONTOBIGDATA
IntroductiontoBigDataPlatform
ChallengesofConventionalSystems
Intelligentdataanalysis(14)( David J. Hand Imperial College,
United Kingdom)
NatureofData(812)
AnalyticProcessesandTools
Analysisvs.Reporting
ModernDataAnalyticTools(1214)
StatisticalConcepts:SamplingDistributions(1217)
(Statisticalconceptsforintelligentdataanalytics,A.J.Feelders)
ReSampling(4151)
StatisticalInference(1729)
PredictionError(3041)
IntroductiontoBigData

Big data is highvolume, highvelocity and highvariety information assets


that demand costeffective, innovative forms of information processing for
enhancedinsightanddecisionmaking.Gartner
BIGDATAisrelentless.Itiscontinuouslygeneratedonamassivescale.Itis
generated by online interactions among people, by transactions between
peopleandsystemsandbysensorenabledinstrumentation.
DefinitionandCharacteristicsofBigData
Big data is highvolume, highvelocity and highvariety
informationassetsthatdemandcosteffective,innovativeforms
of information processing for enhanced insight and decision
making.Gartner
whichwasderivedfrom:
While enterprises struggle to consolidate systems and collapse
redundant databases to enable greater operational, analytical, and
collaborative consistencies, changing economic conditions have
madethisjobmoredifficult.
Ecommerce, in particular, has exploded data management
challengesalongthreedimensions:volumes,velocityandvariety.
muchcompileavarietyofapproachestohaveattheirdisposalfor
dealingeach.
WhatmadeBigDataneeded?
KeyComputingResourcesforBigData
Processing capability: CPU, processor, or node.
Memory
Storage
Network
Defining Big Data Via the
Three Vs
ChallengesofConventionalSystems
Intelligentdataanalysis
Dataanalysis is the most powerful tool to bring into your business. Employing the powers of
analysis can be comparable to finding gold in your reports, which allows your business to
increaseprofitsandfurtherdevelop.
NatureofData:
Categoriesof'BigData'

Bigdata'couldbefoundinthreeforms:
Structured
Unstructured
Semistructured
StructuredData
Anydatathatcanbestored,accessedandprocessedintheformoffixed
formatistermedasa'structured'data.
UnstructuredData
Anydatawithunknownformorthestructureisclassifiedas
unstructureddata
ExamplesOfUnstructuredData
SemistructuredData
Semistructureddatacancontainboththeformsofdata.
We can see semistructured data as a structured in form but it is
actuallynotdefined
ExampleofsemistructureddataisadatarepresentedinXMLfile.

NatureofData
AnalyticProcessesandTools

Train a classifier
Preprocessing raw data
Converting data into training data
for classifier
Converting classifiable data into
vectors
AnalyticProcessesandTools
HDFSHadoopdistributedfilesystems
toenablethestorageoflargefiles,anddoesthisbydistributingthedataamong
apoolofdatanodes.
ThecreationofafileinHDFSappearstobeasinglefile,eventhoughitblocks
chunksofthefileintopiecesthatarestoredonindividualdatanodes.
ZOOKEEPER Zookeeper is a centralized service for maintaining
configuration information, naming, providing distributed synchronization, and
providinggroupservices.
HBASE HBase is derived from Googles Bigtable and is a columnoriented
data layout that, when layered on top of Hadoop, provides a faulttolerant
methodforstoringandmanipulatinglarge.
HIVE Hive is layered on top of the file system and execution framework for
Hadoopandenablesapplications
PIG the Pig environment allows developers to create new user defined
functions
MAHOUTMahoutisaprojecttoprovidealibraryofscalableimplementations
ofmachinelearningalgorithmsontopofMapReduceandHadoop
AnalysisvsReporting
reporting translates data into information while analysis turns
informationintoinsights
reporting should enable users to ask What? questions about the
information,whereasanalysisshouldanswertoWhyandWhatcan
wedoaboutit?
5differencesbetweenreportingandanalysis:
1.Purpose
2.Tasks
3.Outputs
4.Delivery
5.Value
Purpose:Beforecoveringthedifferingrolesofreportingandanalysis,
lets start with some highlevel definitions of these two key areas of
analytics.
Reporting:Theprocessoforganizingdataintoinformationalsummaries
inordertomonitorhowdifferentareasofabusinessareperforming.
Analysis: The process of exploring data and reports in order to extract
meaningfulinsights,whichcanbeusedtobetterunderstandandimprove
businessperformance.
Outputs:threemaintypesofreporting:cannedreports,dashboards,and
alerts.
Delivery:reportingismoreofapushmodel,wherepeoplecanaccess
reports through an analytics tool, Excel spreadsheet, widget, or have
themscheduledfordeliveryintotheirmailbox,mobiledevice,FTPsite,
etc.
analysis is all about human beings using their superior reasoning and
analytical skills to extract key insights from the data and form
actionablerecommendationsfortheirorganizations
1.Purpose
Reportinghelpscompaniesmonitortheirdataevenbeforedigital
technologyboomed.
Variousorganizationshavebeendependentontheinformationit
brings to their business, as reporting extracts that and makes it
easiertounderstand.
Analysisinterpretsdataatadeeperlevel.Whilereportingcanlink
between crosschannels of data, provide comparison, and make
understand information easier (think of a dashboard, charts, and
graphs, which are reporting tools and not analysis reports),
analysisinterpretsthisinformationandprovidesrecommendations
onactions.
2. Tasks
As reporting and analysis have a very fine line dividing them,
sometimesitseasytoconfusetasksthathaveanalysislabeledon
topofthemwhenallitdoesisreporting.Hence,ensurethatyour
analyticsteamhasahealthybalancedoingboth.
Heresagreatdifferentiatortokeepinmindifwhatyouredoing
isreportingoranalysis:
Reporting includes building, configuring, consolidating,
organizing,formatting,andsummarizing.Itsverysimilartothe
abovementionedliketurningdataintocharts,graphs,andlinking
dataacrossmultiplechannels.
Analysis consists of questioning, examining, interpreting,
comparing, and confirming. With big data, predicting is possible
aswell.
3.Output
Reportingandanalysishavethepushandpulleffectfromitsusers
throughtheiroutputs.Reportinghasapushapproach,asitpushes
information to users and outputs come in the forms of canned
reports,dashboards,andalerts.
Analysis has a pull approach, where a data analyst draws
information to further probe and to answer business questions.
Outputs from such can be in the form of ad hoc responses and
analysis presentations. Analysis presentations are comprised of
insights,recommendedactions,andaforecastofitsimpactonthe
companyallinalanguagethatseasytounderstandatthelevelof
theuserwhollbereadinganddecidingonit.
Thisisimportantfororganizationstorealizetrulythevalueofdata,
suchthatastandardreportisnotsimilartoameaningfulanalytics.
4.Delivery
Considering that reporting involves repetitive tasksoften with
truckloads of data, automation has been a lifesaver, especially
now with big data. Its not surprising that the first thing
outsourced are dataentryservices since outsourcing companies
areperceivedasdatareportingexperts.
Analysis requires a more custom approach, with human minds
doing superior reasoning and analytical thinking to extract
insights, and technical skills to provide efficient steps towards
accomplishingaspecificgoal.
Thisiswhydataanalystsandscientistsaredemandedthesedays,
as organizations depend on them to come up with
recommendations for leaders or business executives make
decisionsabouttheirbusinesses.
5.VALUE
This isnt about identifying which one brings more value, rather
understandingthatbothareindispensablewhenlookingatthebig
picture. It should help businesses grow, expand, move forward,
andmakemoreprofitorincreasetheirvalue.
This Path to Value diagram illustrates how data converts into
value by reporting and analysis such that its not achievable
withouttheother.
DataReportingAnalysisDecisionmakingAction
VALUE
Data alone is useless, and action without data is baseless. Both
reportingandanalysisarevitaltobringingvaluetoyourdataand
operations.
Reference
https://blogs.adobe.com/digitalmarke
ting/analytics/reporting-vs-analysis-
whats-the-difference/
http://www.infinitdatum.com/blog/5-
differences-between-reporting-and-
analysis/
ModernDataAnalyticTools
http://www.slideshare.net/GWOcon/gr
eat-wideopentalk
StatisticalConcepts:SamplingDistributions
Thesamplingdistributionisadistributionofasamplestatistic.Whiletheconceptof
adistributionofasetofnumbersisintuitiveformoststudents.
The sampling distribution is a distribution of a sample statistic. It is a model of a
distributionofscores,likethepopulationdistribution,exceptthatthescoresarenot
rawscores,butstatistics.Itisathoughtexperiment;"whatwouldtheworldbelikeif
a person repeatedly took samples of size N from the population distribution and
computeda particular statistic eachtime?"Theresulting distributionofstatisticsis
calledthesamplingdistributionofthatstatistic.
For example, suppose that a sample of size sixteen (N=16) is taken from some
population. The mean of the sixteen numbers is computed. Next a new sample of
sixteen is taken, and the mean is again computed. If this process were repeated an
infinite number of times, the distribution of the now infinite number of sample
meanswouldbecalledthesamplingdistributionofthemean.
Everystatistichasasamplingdistribution.Forexample,supposethatinsteadofthe
mean, medians were computed for each sample. The infinite number of medians
wouldbecalledthesamplingdistributionofthemedian.
ReSampling
Instatistics,resamplingisanyofavarietyofmethodsfordoingoneof
thefollowing:
Estimating the precision of sample statistics (medians, variances,
percentiles)byusingsubsetsofavailabledata(jackknifing)ordrawing
randomlywithreplacementfromasetofdatapoints(bootstrapping)
Exchanging labels on data points when performing significance tests
(permutation tests, also called exact tests, randomization tests, or re
randomizationtests)
Validating models by using random subsets (bootstrapping, cross
validation)
Commonresamplingtechniquesincludebootstrapping,jackknifingand
permutationtests.
StatisticalInference
StatisticalInference,Model&Estimation
Recall, a statistical inference aims at learning characteristics of the
populationfromasample;thepopulationcharacteristicsare parameters
andsamplecharacteristicsarestatistics.
A statistical model is a representation of a complex phenomena that
generatedthedata.
It has mathematical formulations that describe relationships between
randomvariablesandparameters.
It makes assumptions about the random variables, and sometimes
parameters.
Ageneralform:data=model+residuals
Modelshouldexplainmostofthevariationinthedata
Residualsarearepresentationofalackoffit,thatisoftheportionofthe
dataunexplainedbythemodel.
Estimation represents ways or a process of learning and determining the
populationparameterbasedonthemodelfittedtothedata.
Pointestimationandintervalestimation,andhypothesistestingarethreemain
waysoflearningaboutthepopulationparameterfromthesamplestatistic.
Anestimatorisparticularexampleofastatistic,whichbecomesanestimate
whentheformulaisreplacedwithactualobservedsamplevalues.
Point estimation = a single value that estimates the parameter. Point
estimatesaresinglevaluescalculatedfromthesample
Confidence Intervals = gives a range of values for the parameter Interval
estimatesareintervalswithinwhichtheparameterisexpectedtofall,witha
certaindegreeofconfidence.
Hypothesistests=testsforaspecificvalue(s)oftheparameter.
In order to perform these inferential tasks, i.e., make inference about the
unknownpopulationparameterfromthesamplestatistic,weneedtoknowthe
likely values of the sample statistic. What would happen if we do sampling
manytimes?
We need the sampling distribution of the statistic It depends on the model
assumptionsaboutthepopulationdistribution,and/oronthesamplesize.
Standarderrorreferstothestandarddeviationofasamplingdistribution.
PredictionError
Prediction error is a discontinuity attribute that removes the
predictableimagecomponentsandrevealstheunpredictable.
To use prediction error as a discontinuity attribute the original
goal and starting point of my project one has to devise a
predictionerrorcomputationthatpredictsandremovestheplane
wave volumes of sedimentary layers but that is incapable of
predictingthediscontinuities.
Ref :
http://sep.stanford.edu/public/docs/sep99/cohy_Fig/paper_html/no
de38.html

EndofUnitI

You might also like