SE 7204 BIG Data Analysis Unit I Final

This document provides an overview of a course on big data analytics. The course objectives are to explore fundamental concepts of big data analytics, analyze big data using intelligent techniques, understand various search and visualization techniques, and learn techniques for mining data streams. The outcomes are that students will be able to work with big data platforms and analyze big data for business applications, select visualization tools, implement search and visualization techniques, and design algorithms for mining large volumes of data.

Uploaded by

Dr.A.R.Kavitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

399 views66 pages

SE 7204 BIG Data Analysis Unit I Final

Uploaded by

Dr.A.R.Kavitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 66

SE7204BIGDATAANALYTICS

COURSE OBJECTIVES
Toexplorethefundamentalconceptsofbigdataanalytics
Tolearnandanalyzethebigdatausingintelligenttechniques.
Tounderstandthevarioussearchmethodsandvisualization
techniques.

Tolearnandusevarioustechniquesforminingdatastream.
TounderstandtheapplicationsusingMapReduceconcepts
OUTCOMES
Thestudentswillbeableto:
Workwithbigdataplatformanditsanalysistechniques.
Analyzethebigdataforusefulbusinessapplications.
Selectvisualizationtechniquesandtoolstoanalyzebigdata
Implementsearchmethodsandvisualizationtechniques.
Design efficient algorithms for mining the data from large
volumes.
Explorethetechnologiesassociatedwithbigdataanalyticssuch
asNoSQL,HadoopandMapReduce.
UNITIINTRODUCTIONTOBIGDATA
IntroductiontoBigDataPlatform
ChallengesofConventionalSystems
Intelligentdataanalysis(14)( David J. Hand Imperial College,
United Kingdom)
NatureofData(812)
AnalyticProcessesandTools
Analysisvs.Reporting
ModernDataAnalyticTools(1214)
StatisticalConcepts:SamplingDistributions(1217)
(Statisticalconceptsforintelligentdataanalytics,A.J.Feelders)
ReSampling(4151)
StatisticalInference(1729)
PredictionError(3041)
IntroductiontoBigData

Big data is highvolume, highvelocity and highvariety information assets

that demand costeffective, innovative forms of information processing for
enhancedinsightanddecisionmaking.Gartner
BIGDATAisrelentless.Itiscontinuouslygeneratedonamassivescale.Itis
generated by online interactions among people, by transactions between
peopleandsystemsandbysensorenabledinstrumentation.
DefinitionandCharacteristicsofBigData
Big data is highvolume, highvelocity and highvariety
informationassetsthatdemandcosteffective,innovativeforms
of information processing for enhanced insight and decision
making.Gartner
whichwasderivedfrom:
While enterprises struggle to consolidate systems and collapse
redundant databases to enable greater operational, analytical, and
collaborative consistencies, changing economic conditions have
madethisjobmoredifficult.
Ecommerce, in particular, has exploded data management
challengesalongthreedimensions:volumes,velocityandvariety.
muchcompileavarietyofapproachestohaveattheirdisposalfor
dealingeach.
WhatmadeBigDataneeded?
KeyComputingResourcesforBigData
Processing capability: CPU, processor, or node.
Memory
Storage
Network
Defining Big Data Via the
Three Vs
ChallengesofConventionalSystems
Intelligentdataanalysis
Dataanalysis is the most powerful tool to bring into your business. Employing the powers of
analysis can be comparable to finding gold in your reports, which allows your business to
increaseprofitsandfurtherdevelop.
NatureofData:
Categoriesof'BigData'

Bigdata'couldbefoundinthreeforms:
Structured
Unstructured
Semistructured
StructuredData
Anydatathatcanbestored,accessedandprocessedintheformoffixed
formatistermedasa'structured'data.
UnstructuredData
Anydatawithunknownformorthestructureisclassifiedas
unstructureddata
ExamplesOfUnstructuredData
SemistructuredData
Semistructureddatacancontainboththeformsofdata.
We can see semistructured data as a structured in form but it is
actuallynotdefined
ExampleofsemistructureddataisadatarepresentedinXMLfile.

NatureofData
AnalyticProcessesandTools

Train a classifier
Preprocessing raw data
Converting data into training data
for classifier
Converting classifiable data into
vectors
AnalyticProcessesandTools
HDFSHadoopdistributedfilesystems
toenablethestorageoflargefiles,anddoesthisbydistributingthedataamong
apoolofdatanodes.
ThecreationofafileinHDFSappearstobeasinglefile,eventhoughitblocks
chunksofthefileintopiecesthatarestoredonindividualdatanodes.
ZOOKEEPER Zookeeper is a centralized service for maintaining
configuration information, naming, providing distributed synchronization, and
providinggroupservices.
HBASE HBase is derived from Googles Bigtable and is a columnoriented
data layout that, when layered on top of Hadoop, provides a faulttolerant
methodforstoringandmanipulatinglarge.
HIVE Hive is layered on top of the file system and execution framework for
Hadoopandenablesapplications
PIG the Pig environment allows developers to create new user defined
functions
MAHOUTMahoutisaprojecttoprovidealibraryofscalableimplementations
ofmachinelearningalgorithmsontopofMapReduceandHadoop
AnalysisvsReporting
reporting translates data into information while analysis turns
informationintoinsights
reporting should enable users to ask What? questions about the
information,whereasanalysisshouldanswertoWhyandWhatcan
wedoaboutit?
5differencesbetweenreportingandanalysis:
1.Purpose
2.Tasks
3.Outputs
4.Delivery
5.Value
Purpose:Beforecoveringthedifferingrolesofreportingandanalysis,
lets start with some highlevel definitions of these two key areas of
analytics.
Reporting:Theprocessoforganizingdataintoinformationalsummaries
inordertomonitorhowdifferentareasofabusinessareperforming.
Analysis: The process of exploring data and reports in order to extract
meaningfulinsights,whichcanbeusedtobetterunderstandandimprove
businessperformance.
Outputs:threemaintypesofreporting:cannedreports,dashboards,and
alerts.
Delivery:reportingismoreofapushmodel,wherepeoplecanaccess
reports through an analytics tool, Excel spreadsheet, widget, or have
themscheduledfordeliveryintotheirmailbox,mobiledevice,FTPsite,
etc.
analysis is all about human beings using their superior reasoning and
analytical skills to extract key insights from the data and form
actionablerecommendationsfortheirorganizations
1.Purpose
Reportinghelpscompaniesmonitortheirdataevenbeforedigital
technologyboomed.
Variousorganizationshavebeendependentontheinformationit
brings to their business, as reporting extracts that and makes it
easiertounderstand.
Analysisinterpretsdataatadeeperlevel.Whilereportingcanlink
between crosschannels of data, provide comparison, and make
understand information easier (think of a dashboard, charts, and
graphs, which are reporting tools and not analysis reports),
analysisinterpretsthisinformationandprovidesrecommendations
onactions.
2. Tasks
As reporting and analysis have a very fine line dividing them,
sometimesitseasytoconfusetasksthathaveanalysislabeledon
topofthemwhenallitdoesisreporting.Hence,ensurethatyour
analyticsteamhasahealthybalancedoingboth.
Heresagreatdifferentiatortokeepinmindifwhatyouredoing
isreportingoranalysis:
Reporting includes building, configuring, consolidating,
organizing,formatting,andsummarizing.Itsverysimilartothe
abovementionedliketurningdataintocharts,graphs,andlinking
dataacrossmultiplechannels.
Analysis consists of questioning, examining, interpreting,
comparing, and confirming. With big data, predicting is possible
aswell.
3.Output
Reportingandanalysishavethepushandpulleffectfromitsusers
throughtheiroutputs.Reportinghasapushapproach,asitpushes
information to users and outputs come in the forms of canned
reports,dashboards,andalerts.
Analysis has a pull approach, where a data analyst draws
information to further probe and to answer business questions.
Outputs from such can be in the form of ad hoc responses and
analysis presentations. Analysis presentations are comprised of
insights,recommendedactions,andaforecastofitsimpactonthe
companyallinalanguagethatseasytounderstandatthelevelof
theuserwhollbereadinganddecidingonit.
Thisisimportantfororganizationstorealizetrulythevalueofdata,
suchthatastandardreportisnotsimilartoameaningfulanalytics.
4.Delivery
Considering that reporting involves repetitive tasksoften with
truckloads of data, automation has been a lifesaver, especially
now with big data. Its not surprising that the first thing
outsourced are dataentryservices since outsourcing companies
areperceivedasdatareportingexperts.
Analysis requires a more custom approach, with human minds
doing superior reasoning and analytical thinking to extract
insights, and technical skills to provide efficient steps towards
accomplishingaspecificgoal.
Thisiswhydataanalystsandscientistsaredemandedthesedays,
as organizations depend on them to come up with
recommendations for leaders or business executives make
decisionsabouttheirbusinesses.
5.VALUE
This isnt about identifying which one brings more value, rather
understandingthatbothareindispensablewhenlookingatthebig
picture. It should help businesses grow, expand, move forward,
andmakemoreprofitorincreasetheirvalue.
This Path to Value diagram illustrates how data converts into
value by reporting and analysis such that its not achievable
withouttheother.
DataReportingAnalysisDecisionmakingAction
VALUE
Data alone is useless, and action without data is baseless. Both
reportingandanalysisarevitaltobringingvaluetoyourdataand
operations.
Reference
https://blogs.adobe.com/digitalmarke
ting/analytics/reporting-vs-analysis-
whats-the-difference/
http://www.infinitdatum.com/blog/5-
differences-between-reporting-and-
analysis/
ModernDataAnalyticTools
http://www.slideshare.net/GWOcon/gr
eat-wideopentalk
StatisticalConcepts:SamplingDistributions
Thesamplingdistributionisadistributionofasamplestatistic.Whiletheconceptof
adistributionofasetofnumbersisintuitiveformoststudents.
The sampling distribution is a distribution of a sample statistic. It is a model of a
distributionofscores,likethepopulationdistribution,exceptthatthescoresarenot
rawscores,butstatistics.Itisathoughtexperiment;"whatwouldtheworldbelikeif
a person repeatedly took samples of size N from the population distribution and
computeda particular statistic eachtime?"Theresulting distributionofstatisticsis
calledthesamplingdistributionofthatstatistic.
For example, suppose that a sample of size sixteen (N=16) is taken from some
population. The mean of the sixteen numbers is computed. Next a new sample of
sixteen is taken, and the mean is again computed. If this process were repeated an
infinite number of times, the distribution of the now infinite number of sample
meanswouldbecalledthesamplingdistributionofthemean.
Everystatistichasasamplingdistribution.Forexample,supposethatinsteadofthe
mean, medians were computed for each sample. The infinite number of medians
wouldbecalledthesamplingdistributionofthemedian.
ReSampling
Instatistics,resamplingisanyofavarietyofmethodsfordoingoneof
thefollowing:
Estimating the precision of sample statistics (medians, variances,
percentiles)byusingsubsetsofavailabledata(jackknifing)ordrawing
randomlywithreplacementfromasetofdatapoints(bootstrapping)
Exchanging labels on data points when performing significance tests
(permutation tests, also called exact tests, randomization tests, or re
randomizationtests)
Validating models by using random subsets (bootstrapping, cross
validation)
Commonresamplingtechniquesincludebootstrapping,jackknifingand
permutationtests.
StatisticalInference
StatisticalInference,Model&Estimation
Recall, a statistical inference aims at learning characteristics of the
populationfromasample;thepopulationcharacteristicsare parameters
andsamplecharacteristicsarestatistics.
A statistical model is a representation of a complex phenomena that
generatedthedata.
It has mathematical formulations that describe relationships between
randomvariablesandparameters.
It makes assumptions about the random variables, and sometimes
parameters.
Ageneralform:data=model+residuals
Modelshouldexplainmostofthevariationinthedata
Residualsarearepresentationofalackoffit,thatisoftheportionofthe
dataunexplainedbythemodel.
Estimation represents ways or a process of learning and determining the
populationparameterbasedonthemodelfittedtothedata.
Pointestimationandintervalestimation,andhypothesistestingarethreemain
waysoflearningaboutthepopulationparameterfromthesamplestatistic.
Anestimatorisparticularexampleofastatistic,whichbecomesanestimate
whentheformulaisreplacedwithactualobservedsamplevalues.
Point estimation = a single value that estimates the parameter. Point
estimatesaresinglevaluescalculatedfromthesample
Confidence Intervals = gives a range of values for the parameter Interval
estimatesareintervalswithinwhichtheparameterisexpectedtofall,witha
certaindegreeofconfidence.
Hypothesistests=testsforaspecificvalue(s)oftheparameter.
In order to perform these inferential tasks, i.e., make inference about the
unknownpopulationparameterfromthesamplestatistic,weneedtoknowthe
likely values of the sample statistic. What would happen if we do sampling
manytimes?
We need the sampling distribution of the statistic It depends on the model
assumptionsaboutthepopulationdistribution,and/oronthesamplesize.
Standarderrorreferstothestandarddeviationofasamplingdistribution.
PredictionError
Prediction error is a discontinuity attribute that removes the
predictableimagecomponentsandrevealstheunpredictable.
To use prediction error as a discontinuity attribute the original
goal and starting point of my project one has to devise a
predictionerrorcomputationthatpredictsandremovestheplane
wave volumes of sedimentary layers but that is incapable of
predictingthediscontinuities.
Ref :
http://sep.stanford.edu/public/docs/sep99/cohy_Fig/paper_html/no
de38.html

EndofUnitI

TOEFL Score Report 0528051
No ratings yet
TOEFL Score Report 0528051
2 pages
Nclex High Yield Quick Tips
No ratings yet
Nclex High Yield Quick Tips
72 pages
Data Analyst Udemy Report Writing PDF
No ratings yet
Data Analyst Udemy Report Writing PDF
15 pages
Written Questions
No ratings yet
Written Questions
33 pages
Organizational Readiness to E-Transformation
From Everand
Organizational Readiness to E-Transformation
Aqel M. Aqel
No ratings yet
Deped Order No. 42 S. 2016
100% (2)
Deped Order No. 42 S. 2016
10 pages
TR Food and Beverage Services NC III
100% (6)
TR Food and Beverage Services NC III
52 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Data Visualization: For Analytics and Business Intelligence
No ratings yet
Data Visualization: For Analytics and Business Intelligence
49 pages
2nd Unit - 2.2 - Data Analytics
No ratings yet
2nd Unit - 2.2 - Data Analytics
22 pages
Busa2001 2023 Sem2 Newcastle
No ratings yet
Busa2001 2023 Sem2 Newcastle
6 pages
Assignment 02
No ratings yet
Assignment 02
9 pages
Lecture 7 p1
No ratings yet
Lecture 7 p1
38 pages
Fundamentals of Business Analytics
No ratings yet
Fundamentals of Business Analytics
5 pages
Types of Analytics: What Is Descriptive Analytics?
No ratings yet
Types of Analytics: What Is Descriptive Analytics?
3 pages
Chapter 1 Data Analysis
No ratings yet
Chapter 1 Data Analysis
18 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
2.data Analysis With Python by Rituraj Dixit - Z-Library
No ratings yet
2.data Analysis With Python by Rituraj Dixit - Z-Library
4 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
Data Visualization R Programming Power Bi Lab Record
No ratings yet
Data Visualization R Programming Power Bi Lab Record
29 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
2 pages
Data Warehousing
No ratings yet
Data Warehousing
24 pages
BCS Level 4 Module in Dat+
No ratings yet
BCS Level 4 Module in Dat+
19 pages
Unit Iv
No ratings yet
Unit Iv
8 pages
1 The Role of Statistics and The Data Analysis Process
100% (1)
1 The Role of Statistics and The Data Analysis Process
30 pages
Writing Up A Factor Analysis
No ratings yet
Writing Up A Factor Analysis
7 pages
Chap 1 Data Preprocessing
No ratings yet
Chap 1 Data Preprocessing
17 pages
DSBDAL - Assignment No 9
No ratings yet
DSBDAL - Assignment No 9
12 pages
Assignment 2
No ratings yet
Assignment 2
43 pages
NOTES OF Python Ok
No ratings yet
NOTES OF Python Ok
73 pages
Unit 2 - Knowledge Delivery
No ratings yet
Unit 2 - Knowledge Delivery
31 pages
Emerging Technologies and Business Innovation-II PDF
No ratings yet
Emerging Technologies and Business Innovation-II PDF
4 pages
Capstone Presentation
No ratings yet
Capstone Presentation
58 pages
Applied Statistics: Assessment Tasks
No ratings yet
Applied Statistics: Assessment Tasks
4 pages
RMM Unit-I Introdution To Data Mining
No ratings yet
RMM Unit-I Introdution To Data Mining
129 pages
Business Analytics
No ratings yet
Business Analytics
9 pages
Project
No ratings yet
Project
14 pages
Pert 7 - Ethics and Privacy
No ratings yet
Pert 7 - Ethics and Privacy
18 pages
OOSE Unit 1 Notes
No ratings yet
OOSE Unit 1 Notes
21 pages
Question Bank of Data Visualization
No ratings yet
Question Bank of Data Visualization
108 pages
TABULEAU
No ratings yet
TABULEAU
20 pages
Business Analytics Module 8
100% (1)
Business Analytics Module 8
65 pages
Assignment-Based Subjective Questions/Answers
No ratings yet
Assignment-Based Subjective Questions/Answers
3 pages
Data Analytics-Lab Manual
No ratings yet
Data Analytics-Lab Manual
19 pages
Da Notes (Big Data) PDF
No ratings yet
Da Notes (Big Data) PDF
32 pages
Capstone Presentation
No ratings yet
Capstone Presentation
9 pages
Business Analytics and Decision Making
No ratings yet
Business Analytics and Decision Making
34 pages
Excel - Data - Analysis - 03 - Useful Books - TutorialsPoint
No ratings yet
Excel - Data - Analysis - 03 - Useful Books - TutorialsPoint
1 page
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Bachelor of Science in Accountancy: Program Curriculum Ay 2020 - 2021
No ratings yet
Bachelor of Science in Accountancy: Program Curriculum Ay 2020 - 2021
6 pages
Unit #5 - Data Warehouse and Data Mining
No ratings yet
Unit #5 - Data Warehouse and Data Mining
49 pages
1.1 Introduction To Data Analysis
No ratings yet
1.1 Introduction To Data Analysis
8 pages
Kapalan Gb513 Business Analytics Unit 1 Assignment
No ratings yet
Kapalan Gb513 Business Analytics Unit 1 Assignment
3 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
6 pages
Computer Vision Question Bank 24 25
No ratings yet
Computer Vision Question Bank 24 25
7 pages
Unit 3 - Data Visualization
No ratings yet
Unit 3 - Data Visualization
64 pages
Interview Preparations - NielsenIQ
No ratings yet
Interview Preparations - NielsenIQ
1 page
Week 8-Association Rules Part 1
No ratings yet
Week 8-Association Rules Part 1
31 pages
Advanced Analytics: The Next Wave of Business Intelligence
No ratings yet
Advanced Analytics: The Next Wave of Business Intelligence
17 pages
Study Material INTRODUCTION To Information Systems
No ratings yet
Study Material INTRODUCTION To Information Systems
5 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
4 pages
CCW331 BUSINESS ANALYTICS-notes
No ratings yet
CCW331 BUSINESS ANALYTICS-notes
35 pages
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
From Everand
Equity of Cybersecurity in the Education System: High Schools, Undergraduate, Graduate and Post-Graduate Studies.
Joseph O. Esin
No ratings yet
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Completed Unit II 17.7.17
No ratings yet
Completed Unit II 17.7.17
113 pages
CompletedUNIT 1 PPT 10.7.17
100% (6)
CompletedUNIT 1 PPT 10.7.17
87 pages
Completed UNIT-III 20.9.17
No ratings yet
Completed UNIT-III 20.9.17
61 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
SE 7204 BIG Data Analysis Unit I Final
No ratings yet
SE 7204 BIG Data Analysis Unit I Final
66 pages
Se 7204 Big Data Analysis Unit III Final 20.4.2017
No ratings yet
Se 7204 Big Data Analysis Unit III Final 20.4.2017
86 pages
English Language Curriculum-1
No ratings yet
English Language Curriculum-1
77 pages
Allama Iqbal Open University, Islamabad Warning: (Department of Economics)
No ratings yet
Allama Iqbal Open University, Islamabad Warning: (Department of Economics)
3 pages
Analysis of Grade 6 NAT Perf in Science in The Division of Cavite
No ratings yet
Analysis of Grade 6 NAT Perf in Science in The Division of Cavite
44 pages
Lemz Transfer Letter
No ratings yet
Lemz Transfer Letter
4 pages
Progressivism Essentialism
No ratings yet
Progressivism Essentialism
21 pages
Drivers Ed Homework
100% (1)
Drivers Ed Homework
7 pages
Intro to Research Methods 1
No ratings yet
Intro to Research Methods 1
24 pages
Letters For Special Situations Letters To Use in The Special Situations in Life
92% (13)
Letters For Special Situations Letters To Use in The Special Situations in Life
257 pages
BIZ204+Assessment+3 Recommendations+Report
No ratings yet
BIZ204+Assessment+3 Recommendations+Report
6 pages
12 Leng Test Pr17
No ratings yet
12 Leng Test Pr17
6 pages
Hookup Culture On Campus
No ratings yet
Hookup Culture On Campus
2 pages
Daily Lesson Log of M7Ge-Iif-1 (Week Six-Day 2)
100% (1)
Daily Lesson Log of M7Ge-Iif-1 (Week Six-Day 2)
3 pages
Lecture 2 Research Types & Designs
No ratings yet
Lecture 2 Research Types & Designs
33 pages
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
No ratings yet
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
20 pages
Social Systems & Organizational Culture
100% (1)
Social Systems & Organizational Culture
61 pages
English File: Progress Test Files 1-3 Answer Sheet
No ratings yet
English File: Progress Test Files 1-3 Answer Sheet
2 pages
What Is Digital Literacy?
No ratings yet
What Is Digital Literacy?
4 pages
EsP - 10 COMMANDMENTS OF HUMAN RELATIONS
100% (1)
EsP - 10 COMMANDMENTS OF HUMAN RELATIONS
20 pages
Btech Ce 3 Sem Surveying and Geomatics Kce302 2022
No ratings yet
Btech Ce 3 Sem Surveying and Geomatics Kce302 2022
2 pages
2019 - Perforation Friction Modeling in Limited Entry Fracturing Using Artificial Neural Network
No ratings yet
2019 - Perforation Friction Modeling in Limited Entry Fracturing Using Artificial Neural Network
9 pages
Oshsu Brochure
No ratings yet
Oshsu Brochure
4 pages
Testing Web Applications
No ratings yet
Testing Web Applications
18 pages
Mr. Chris
No ratings yet
Mr. Chris
1 page
English Lesson Plan Form 5
No ratings yet
English Lesson Plan Form 5
1 page
Key of Full NTS Test Paper-3
No ratings yet
Key of Full NTS Test Paper-3
4 pages
Notes Class 11
No ratings yet
Notes Class 11
1 page

SE 7204 BIG Data Analysis Unit I Final

Uploaded by

SE 7204 BIG Data Analysis Unit I Final

Uploaded by

SE7204BIGDATAANALYTICS

Big data is highvolume, highvelocity and highvariety information assets

You might also like