Open navigation menu
Close suggestions
Search
Search
en
Change Language
Upload
Sign in
Sign in
Download free for days
86%
(43)
86% found this document useful (43 votes)
52K views
370 pages
Big Data Analytics by Seema Acharya PDF
Uploaded by
Yash V
AI-enhanced title
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here
.
Available Formats
Download as PDF or read online on Scribd
Download
Save
Save Big Data Analytics by Seema Acharya.pdf For Later
Share
86%
86% found this document useful, undefined
14%
, undefined
Print
Embed
Report
86%
(43)
86% found this document useful (43 votes)
52K views
370 pages
Big Data Analytics by Seema Acharya PDF
Uploaded by
Yash V
AI-enhanced title
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here
.
Available Formats
Download as PDF or read online on Scribd
Carousel Previous
Carousel Next
Download
Save
Save Big Data Analytics by Seema Acharya.pdf For Later
Share
86%
86% found this document useful, undefined
14%
, undefined
Print
Embed
Report
Download
Save Big Data Analytics by Seema Acharya.pdf For Later
You are on page 1
/ 370
Search
Fullscreen
BIG Rin ie "ANALYTICS: oh: Acharya 2 WILEY. _ Subhashini Chellappan| 10134395 a nd EDITION BIG DATAe. Seema Acharya Infosys Limited Subhashini Chellappan WILEYBig Data and Analytics SECOND EDITION (Copsight ©2019 by Wie Ina Pet, La, 4456/7, Asati Road, Daryagan, New Ded 10002, Cove Tage: alksancarenese/Gey ages Allrghsrescrved Nopartof this book may be eprodoced stored ina rereleased in ay form or by ‘snymean lect, mechanical photocopying recording zoning without the ween penisson ofthe publshee Limits of Liability: Whe the publsherand de shoe have wd heeft ia ppg tis hook, Wie er the author make no epretentaton or waraaes With espe se aceomcyorcompetenas the conten of this book ‘nd specially chin any imped warranties of merchaneaaliy or fess for any pace purpore. There a 90 ‘wurateswhichestend bond the desertion contin ia thi pargraph Norway nay be crested or extended by sales representatives or writen sales mata, The acu and compctenes ofthe ormaon provided hee nd the opiions sited hee ae sot aranteed or warmed to pce ty pila akan theadveand ates songs here may noe stable Forever dial Neher Wie nda not the tral able oa Lor proforanyorhereommercal damages nlodng baraotlamitedo speci ida, comequenta aoe datagee Disclaimers ‘The conten ofthis bok have een checked for aeewacs Since deviations cannot piece, ‘Wie oritsauhor cannot goxtace fll agreement Aste book i inended fr educational purpose Wily orate shall no be response for any enor omssions o damage aking out of the eof the nfrtmaton conned nthe book. This pletion i designe o provi ccs sd nuthoctatie infomation with oud ote subject ‘covered [esto om the understate Pblaheris nor cgagein ees peoesionel services “Trademarks: Al band names ad product names ae in hs hook ae tademars, ester ademas, o ede sunnes of theitesecve olde ey inot asa any prodaetor seer meena inthis book Othe Wiley aria Ofc Jol Wey & Son. (11 Rie Sest Habe J O7050,USA ‘Wley-VCH Vera Ges, Papplae3,D-45469 Weinheim, Germay Jol Wie Sons Austral a, 42MeDougl Stet, ion, Queensland 464, Austra oho Wiley Sons (Ass) Pre, 1Fosionopos Wale #701 Soa, South Tomer, Singapore 138628, Joh Miley Suns Canada sd, 22 Wores Road Babies, Ona, Canada. MAW TL FiscEaion25 Second ion 2019 Repaie 2022 ISBNS975-1-265.7951-8 ISBN:978 1-265 8836-7 (4) swovswieyndincom Pineda PeasPreface ‘The last few years have been witness to a burgeoning growth of data, We have heard it being called Big Data! So what rally is this big data? Big data is an evolving term used to describe any voluminous amount of structured, semi-structured and unstructured data that has the potential to be mined for information. There is data everywhere from the sensors that gather weather information, to the likes, posts and comments on social media sites, to digital pictures, audios and videos that get circulated, to the conversations in a chat room, etc All this and more is big data. Need for this Book We felt the need to compose a book, egged on by the enthusiasm and inquisiiveness of the students and instructors fraternity alike. A book which can take the readers through an easy comprehension of the big data technology landscape. Ours is an attempt to cover a plethora of technologies from NoSQL databases such as MongoDB and Cassandra to components of the Hadoop Ecosystem such as MapReduce, Pig and Hive to delving into analytics with association rule mining on one hand and decision trees on the other hand. The Audience This book is forall interested in learning about Big Data, Hadoop and Analytics. The only criteria is the willingness o learn and the ability to stretch yourself in learning to limits that you have not done before. ‘The book is forall those who are new to big data irrespective ofthe fcld/background that you come from. ‘The book will be equally useful co an engincering graduate as it would be to a management graduate. ‘The book has been designed and crafted such that it caters to the knowledge requirements of an IT person as well asa business user with ease. Organization of the Book This book has a total of 14 chapters. Here is a sneak peel to the chapters of our book...ae Prefice Chapters 1-4 of the book provide a basic understanding of the types of digital data, the characteristics of big data, che challenges confronting the caterprises embracing big dats, the sudden hype around big daca analytics and the technologies that make up the big data landscape. Chapter 5 introduces the open source software framework called Hadoop. We have attempted to introduce ‘you to most of the major concepts and comiponents to empower you to hold your own in any meaningful conversation on big data and analytics. Chapters 6 and 7 introduce you.t0 the world of NoSQU databases. We have chosen MongoDB, the docu ‘mentroricnted database, and Cassandra, the wide cohimn store, to get you a feel of NoSQL databases. In explaining the NoSQL. databases, we have buile on the familiarity chat che readers will have with RDBMS (Relational Database Management Syste). Chapter 8 intzoduces you to the nitty-gritties of MapReduce Programming, The merits and challenges have been deale wich for a clearer appreciation. Chapters 9 and 10 cover two major components of the Hadoop Ecosystem, namely “Pig and “Hive”. Chapter 1 Tneroduces to you an open source tool to draw out reports by pulling data from NoSQL databases Chapter 12 is focused on introducing you to the world of machine learning and analysis wich algorithms under both supervised and unsupervised learning categories. (Chapter 13 is focused on bringing out the differences between various Hadoop ecosystem components for an easy lookup and eemembrance. Ie will be good to read this chapter sequentially for better absorption. Starting with data warehouse versus data lakes, HDFS versus the first non-batch component ~ base, it builds further to explain the differences between HDFS and RDBMS, then goes onta to highlight differ- ences of MapReduce with Pig, Spark and finally delves into the differences berween Pig and Hive Chapter 14 discusses the big data trends in 2019 and beyond. The years ahead will see an increase in the adoption of open-source technologies. Hadoop is and will remain fundamental, although there will be increased usage of the in-memory Spark. The years ahead will also awake to the container(ed) revolution. ‘The last half a decade has been a witness to the commoditization of visualization. The rising wave of ToT (aceenet of Things) will lead to processing being done on the edge of the network before moving ic to the cenweal data center in the cloud. The world will witness the power of empowered computing ~ edge and quantum, Ic is time to utilize and draw value/insight from the abundant dark data. Also bots will mature and ger smarter in the coming years. Glossary: A glossary of terms frequently used in the big data and analytics parlance is given at the end of the book. Alchough we strive to define rerms as we introduce them in this book, we think youl find the slossary a useful resource. To get most out of this Book ‘We have included sections such as “POINT ME”, “CONNECT M1 ther your learning and comprehension. The section “POINT ME” provides a list of books that you as a reader should check out to farther your learning. The section “CONNECT ME” provides a lst of reference links which will feed you with good content on topics covered in che chapter. ‘TEST ME" to enable you to fur-Preface ty The section “ZEST ME” has a gamut of self assessments such as “Crossword” puzzles, “Fill in the blanks’, ‘March the columns”, etc. We have provided solved and unsolved exercises to better your learning. There are HANDS-ON ASSIGNMENTS provided with MongoDB, Cassandra, MapReduce, Pig, Hive and JasperReports. We sincerely urge you to attempt these to gain good hands-on practice on these major technologies. Next Steps... We have endeavored to create an overview of big data and introduced you to all its significant components. We recommend you to read the book from cover to cover, but if you are not that kind of person, we have made an attempt to keep the chapters self-contained so chat you can go straight to the topics that interest you most. Whichever approach you may choose, we wish you well! Available with the Book (www.wileyin ‘We have put together an installation guide to help our learners with easy steps to install and configure a Hadoop cluster. The steps to setting up the components of the Hadoop ecosystem such as MapReduce, Pig and Hive have also been explained in easy, DIY (Do It Yourself) steps, ‘We have provided a Microsoft Access Database (acedb) and a text file on which we have based an assign- ment chac when attempted and solved will surely challenge and satiate you. com) ‘A Quick Word for the Instructors’ Fraternity Arcention has been paid in arriving'at the sequence of chapters and also to the flow of topics within each chapter. This is done particularly with an objective to assist our fellow instructors and academicians in carving out a syllabi from the Table of Contents (TOC) of the book. The complete TOC can qualify as the syllabi for a semester or ifthe college has an existing syllabus on big data and analytic, a few chapters can be added to the syllabi co make it more robust. We leave it to your discretion on how you wish to use the same for your students. ‘We have ensured that exch tool/component discussed in the book is with adequate hands-on content t0 enable you to teach better and provide ample hands-on practice to your students “The easy-to-follow installation guide provided on the website should help you setup the lab environment for practice. ‘We have also provided Instructor Resources (IR) that can be procured directly from our publishes, Wiley India by visiting their website or writing to
[email protected]
. These Instructor Resources are presenta sion decks (one for each chapter) which can be taken to the class directly or can be customized as per your requirements. Connect with Authors ‘To stay connected with the students and instructors fraternity, we runa group on Linkedin titled, “Exploring big data and analytics”. Join us to discuss, share and leat Happy Learninglt! Seema Acharya Subhashini ChellappanAcknowledgements ‘The making of the book was like a journey that we had undertaken for several months, We had our families, friends, colleagues, and well-wishers onboard this journey and we wish to express our heartfelt gratitude to sich one of them. Without their unflinching support and affection, we could not have pulled it off ‘We are grateful to the student and teacher community who kept us on our toes with their constant bom- bardment af queries which prompted us to learn more, simplify our learnings and findings, and place them neatly im the book. This boole is for chem, ‘We wish to thank our friends ~ the practitioners from the field for filling us in on the latest in the big data ficld and sharing with us valuable insights on the best practices and methodologies followed therein. A special thanks to RN Prasad for his encouragement and i ‘We have been fortunate ro have the support of our teams who sometimes knowingly and at other times ‘unknowingly contributed to the making of che book by lending us their unwavering support. We consider ourselves very fortunate for the editorial assistance provided by Wiley India. We wish to acknowledge and appreciate Meenakshi Sehrawat, Associate Publisher and her team of associates who deptly guided us through the entire process of preparation and publication. Appreciation is also due to Rakesh Poddar and his team for working with us through the entire production process. And fnally we can never suficient!ychank our families and friends who have been our pills of strength, ‘our stimulus, and our soundboard all through the process, andl endured paticntly our ctazy schedules as we -sembled the book.Author Profile Seema Acharya Seema Acharyaiisa Senior Lead Principal with the Education, Training and Assessmenc department of Infosys Limited. She is @ technology evangelist, a learning stategist, zed an author with over 15+ years of IT experience in learningfeducation services. Se has designed and delivered several large-scale competency development programs across the globe involving organizational competency need analysis, conceptualization, design, development and deployment of competency development programs. She is an educator by choice and vocation, and has rich experience in both academia and the software industry. She is also the author of the following books: 1, “Fundamentals of Business Analytics’, ISBN: 978-81-265-3203-2, publisher ~ Wiley India. 2. “Pro Tableau ~A Step by Step Guide", ISBN; 978-1484223512, publisher ~ Apress. 3. “Data Analytics using R", ISBN: 9789352605248, publisher ~ McGraw Hill Higher Education Sociery (2018). Sic has co-authored a paper on "Collaborative Engineering Competency Development” for ASEE (American Society for Engineering Education). She holds the patent on “Method and System for Automatically (Generating Questions for a Programming Language”. Her areas of interest and expertise are centered on Business Intelligence, Big Data and Analytics, ‘schnologies such as Data Warehousing, Daca Mining, Data Analytic, Text Mining and Data Visualization, ‘She is passionate about exploring new paradigms of learning and also dabbles into creating elearning content to facilitate learning anytime and anywhere.= ‘Author Profile Chellappan > CChellappan has rich experience in both academia and the software industry, She has published couple of papers in various Journals and Conferences Hier areas of interest and expertise are centered on Business Intelligence, Big Data and Analytics technologies such as Hadoop, NoSQL Databases, Spark and Machine Learning.Contents Preface Acknowledgements vii Author Profile ix Chapter 1_ Types of Digital Data 1 ‘Whar’ in Store? 1 1.1 Classification of Digital Data 2 LLL Seructred Data 2 LL2 Semi-Structured Data & 113 Unstructured Data 7 Remind Me 10 Point Me (Book) u Connect Me (Internet Resources) U ‘Test Me u Seenario-Based Question 15 Chapter 2_ Introduction to Big Data 17 What’ in Store? 7 2.1 Characteristics of Daca 18 2.2 Evolution of Big Data 19 23. Definition of Big Data 19 2.4 Challenges with Big Data 2 2.5 What is Big Data? 22 2.5.1 Volume 22xi’ Consens 25.2 Velocity 24 253° Varieey 25 2.6 Other Characteristics of Data Which are not Definitional Traits of Big Data 25 2.7 Why Big Data? 25 2.8 Are We Just an Information Consumer or Do We also Produce Information? 26 2.9 Traditional Business Intelligence (BI) versus Big Data 26 2.10 A Typical Data Warchouse Environment 7 2.11 A Typical Hadoop Environment 7 2.12 What is New Today? 28 212. Coexistence of Big Data and Data Warchowse 28 2.13 What is Changing in the Realms of Big Data? 29 Remind Me 30 Poinc Me (Book) 30 Connect Me (Internet Resources) 30 Test Me 31 Challenge Me 32 Chapter 3 Big Data Analytics 35 ‘What's in Store? 35 3.1 Where do we Begin? 36 3.2. What is Big Data Analytics? 37 3.3. What Big Daca Analytics Isnt? 37 3.4 Why this Sudden Hype Around Big Daca Analytics? 39 3.5. Clasification of Analytics 39 3.5.1 First School of Thought 40 3.5.2 Second School of Thought 40 3.6. Greatese Challenges that Prevent Businesses from Capitalizing on Big Data 4 3.7 Top Challenges Facing Big Data 4 3.8 Why is Big Data Analytics Important? 2 3.9 What Kind of Technologies are we Looking Toward to Help Meet the Challenges Posed by Big Data? 2 3:10 Data Science 43 3.10.1. Business Acumen Skills 4B 3.10.2. Technology Expertise 43 3.10.3. Mathematies Expertise 44 3.11 Data Scientst..Your New Best Frier 4 BALL Responsibilities of « Data Scientist 44 3.12. Terminologies Used in Big Data Environments 45 BIZ} InMemory Analytics 45 3.122 In-Darabase Procesing 45 3.123. Symmetric Mulriprocesor System (SMP) 46 3.124 Massively Parallel Processing 6 3.125. Difference Between Parallel and Distributed Systems 46 B26 Shared Nothing Architecture a3.12.7 CAP Theorem Explained 49 3.13 Basically Available Soft State Eventual Consistency (BASE) 52 3.14 Few Top Analytics Tools 52 3.14.1 Open Source Analytics Tools 3 Remind Me 33 Connect Me (Internet Resouces) 53 ‘Test Me 34 Chapter 4 The Big Data Technology Landscape 57 What’ in Store? 37 4.1 NoSQL (Nor Only SQL) 58 411 Where iit Used? 58 412 What isi? 58 413 Types of NOSQL Databases 59 414 Why NoSQU? 0 41S Advantages of NoSQL 60 4.1.6 What We Miss With NoSQU? 6 4.1.7 Wie of NoSQL in Industry a 41.8 NoSQL Vendors 6 41.9 SQL versus NoSQL 6 41.10 NewSQL “4 4.1.11 Comparison of SQL, NoSQL, and NewSQL 64 42 Hadoop 65 421 Features of Hadoop 6 422 Key Advantages of Hadoop 6 423. Versions of Hadoop 66 424 Overview of Hadoop Ecorstems 68 425. Hadoop Distributions 4 42.6 Hadoop versus SQL 74 4.2.7 Integrated Hadoop Systems Offered by Leading Market Vendors 6 428 Cloud Based Hadoop Solutions 3 Remind Me B Point Me (Books) % Connect Me (Internet Resources) 76 Test Me 76 Chapter 5 Introduction to Hadoop 79 ‘What's in Store? 80 5.1 Introducing Hadoop 80 5.11 Data: The Treasure Trove 80 5.2” Why Hadoop? 81 5.3. Why not RDBMS? 82 5.4 RDBMS versus Hadoop 83are Contents 5.5 Distributed Computing Challenges 3 SSL Hardware Failure 8 5.5.2 How t0 Process This Gigantic Store of Data? 84 5.6 History of Hadoop 84 5.6.1 The Nome "Hadoop” 84 5.7 Hadoop Overview 85 SIL Key Apecs of Hadoop 85 3.7.2 Hadoop Components 86 5.7.3 Hadoop Conceptual Layer 86 5.74 High-Level Architecture of Hadoop 86 5.8 Use Case of Hadoop 87 58.1 ClickSiream Data 87 5.9 Hadoop Distributors 88 5.10 HDFS (Hadoop Discibuced File System) 88 5.10.1 HDFS Daemons 89 5.10.2 Anatomy of File Read 7 5.10.3 Anatomy of File Write Oo) 5.104 Replica Placement Seategy 93 5.10.5. Working with HDFS Commands 93 5.10.6. Special Features of HDFS 95 5.11 Processing Data with Hadoop 95 S111 MapReduce Daemons 96 5.11.2 How Does MapReduce Work? 36 5.113 MapReduce Example 98 5.12 Managing Resources and Applications with Hadoop YARN (Yet Another Resource Negotiator) 101 S121 Limitations of Hadoop 1.0 Architecture 101 5.122 HDES Limitation 101 S123 Hadoop 2: HDPS 101 5.124 Fladop 2 VARN: Taking Hadoop beyond Batch 102 5.13 Interacting with Hadoop Ecosystem 104 SuB.L Pig 104 5.13.2 Hive 105 5.133 Sqoop 105 513A HBase 105 Remind Me 106 Pint Me (Books) 106 Connect Me (Internet Resources) 106 Test Me 107 Challenge Me 13 Chapter 6 Introduction to MongoDB 115 Whats in Store? ns 6.1 What is MongoDB? 116 62 Why MongoDB? 116Consents my 62.1 Using Java Script Object Notation (JSON) 116 62.2 Creating or Generating a Unique Key 18 623 Support for Dynamic Queries 18 6.24 Storing Binary Data 119 62.5 Replication 19 626 Sharding 120 62.7 Updating Information In-Place 120 6.3 Terms Used in RDBMS and MongoDB 121 63.1 Create Database 122 63.2. Drop Database 122 64 Data Types in MongoDB 122 6.5 MongoDB Query Language 126 65.1 Insert Method 127 65.2 Save() Method 131 65.3 Adding a New Field to an Existing Document — Update Method 132 G54 Removing an Existing Field from an Existing Document — Remove Method 133 65.5 Finding Documents based on Search Criteria ~ Find Method 133 65.6 Dealing with NULL Values 12 65.7 Count, Limit, ore, and Skip 144 658° Arrays 150 65.9 Aggregate Function 158 65.10 MapReduce Function 160 65.11 Java Scrips Programming 161 6.5.12 Cursors in MongoDB 162 65.13 Indexes 166 6.5.14 Mongolmport 168 65.15 MongoExport 169 6.5.16 Automatic Generation of Unique Numbers for the “id” Field 170 Remind Me m1 Poinc Me (Book) 71 Connect Me (Internet Resources) 71 Test Me 172 ‘Assignments for Hands-On Practice 175 Chapter 7\_ Introduction to Cassandra 177 Whar’ in Score? 178 7-l Apache Cassandra ~ An Introduction 178 7.2. Features of Cassandra 179 7.21 Peerto-Peer Network 179 7.2.2 Gossip and Failure Detection 180 7.2.3 Partitioner 180 7.24 Replication Factor 180 7.2.5. Anti-Entropy and Read Repair 180 7.2.6 Writes in Casandra 181nie Cones 72.7 Hinted Handoff rat 7.28 Tunable Consistency 182 7.3. CQL Data Types 183 74 CQUSH 184 741 Logging into cglh 184 7.5 Keyspaces 184 7.6 CRUD (Create, Read, Update, and Delete) Operations 188 77 Collections 195 771 Set Collection 195 7.7.2 Lise Collection 196 7.73 Map Collection 196 7.7.4 More Practice on Collections (SET and LIST) 198, 7.75 Using Map: Key, Value Pair 204 7.8 Usinga Councer 205 7.9 Time to Live (TTL) 206 7.10 Alter Commands 207 7.10.1. Alter Table to Change the Data Type of « Column 208 7.10.2 Aer Table to Delete a Column 208 7.103 Drop a Table 209 7.104 Drop a Database 209 7.11 Import and Export 209 JALIL Export to CSV 209 7.11.2 Import from CSV 210 7.11.3 Import from STDIN 211 7ALA Export 9 STDOUT 212 7.12 Querying System Tables 213 7.13 Practice Examples 216 Remind Me 218 Poine Me (Book) 219 Connect Me (Internet Resources) 219 Test Me 219 ‘Assignments for Hands-On Practice 219 Chapter 8 Introduction to MAPREDUCE Programming 221 ‘What's in Store? 221 8.1 Introduction 222 8.2 Mapper 222 83° Reducer 23 84 Combiner 224 8.5. Partitioner 225 86 Searching 228 87° Sorting 230 8.8 Compression 232Remind Me 232 Point Me (Book) 232 Connect Me (Internet Resources) 233 Test Me 233 ‘Assignment for Hands-On practice 233 Chapter 9 _ Introduction to Hive 235 235 is Hive? 236 9.1.1 History of Hive and Recent Releases of Hive 237 9.1.2 Hive Features 237 9.13. Hive Integration and Work Flow 238 9.14 Hive Data Units 238 9.2 Hive Architecture 239 9.3. Hive Data Types 241 93.1 Primitive Data Types 241 93.2 Collection Data Types 22 9.4 Hive File Format 242 94.1 Text File 242 9.4.2 Sequential File 242 9.4.3. RCFile (Record Columnar File) 242 9.5 Hive Query Language (HQL) 243 95.1 DDL (Data Definition Language) Statements 243 95.2 DML (Data Manipulation Language) Statements 243 9.5.3. Starting Hive Shell 244 954 Database 244 955. Tables 247 95.6 Partitions 252 95.7 Bucketing 255 95.8 Views 257 95.9 Sub-Query 258 95.10 Joins 259 95.11 Aggregation 260 95.12 Group By and Having 260 9.6 RCFile Implementation 260 97 SeDe 261 9.8 User-Defined Function (UDF) 262 Remind Me 263 Point Me (Books) 263 ‘Connect Me (Internet Resources) 264 Test Me 264 Assignments for Hands-On Practice 265Chapter 10 Introduction to Pig ‘What's in Store? 10.1 10.2 103 10.4 105 106 10.7 10.8 10.9 10.10 10.11 10.12 10.13 ‘What is Pig? LOL. Rey Features of Pig ‘The Anatomy of Pig Pig on Hadoop Pig Philosophy ‘Use Case for Pig: ETL Processing Pig Latin Overview 10.6.1 Pig Latin Statements 10.6.2 ig Latin: Keywords 10.63 Pig Latin: Identifiers 10.64 Pig Latin: Comments 106.5 Pig Latin: Case Sensitivity 10.6.6 Operators in Pig Latin Data Types in Pig 10.7.1 Simple Data Types 10.7.2 Complex Data Types Running Pig 10.8.1 Interactive Mode 10.8.2 Batch Mode Execution Modes of Pig 10.9.1 Local Mode 10.9.2 MapReduce Mode HDFS Commands Relational Operators 10.111 FILTER 10.112 FOREACH 10.113 GROUP 10.114 DISTINCT 10115 LIMIT 10.11.6 ORDER BY 10.117 JOIN 10.1.8 UNION 10.119 SPLIT 10.11.10 SAMPLE Bval Function 10.121 AVG 10.122 MAX 10.123 COUNT ‘Complex Data Types 10.13. TUPLE 10.132 MAPContems iti 10.14 Piggy Bank 284 10.15 User-Defined Functions (UDF) 285 10.16 Parameter Substitution 286 10.17 Diagnostic Operator 286 10.18 Word Count Example using Pig 287 10.19 When to use Pig? 288 10.20 When nor to use Pig? 288 10.21 Pigat Yahoo! 288 10.22 Pig versus Hive 288 Remind Me 289 Me (Book) 289 Connect Me (Internet Resources) 289 Test Me 289 ‘Assignments for Hands-On Practice 290 Chapter 11 JasperReport using Jaspersoft 293 What’ in Store? 293 11.1 Introduction to JasperReports 293 ILLL JasperReports 293 1112 Jaspersoft Studio 294 11.2 Connecting co MongoDB NoSQL Database 294 L121 Syntax of Few MongoDB Query Language 301 1122 Elements and Attributes 302 1123 Creating Variables 302 11.24 Creating Report Parameters 304 11.3 Connecting to Cassandra NoSQL Database 305 Remind Me 308 Point Me (Book) 309 Connect Me (Internet Resources) 309 Assignment for Hands-On Practice 309 Chapter 12 _ Introduction to Machine Learning 311 Whats in Store? ail 12.1 Introduction to Machine Learning 312 121.1 Machine Learaing Definition 312 12.2 Machine Learning Algorichms 312 122.1 Regresion Model ~ Linear Regresion 313 12.22 Clustering 315 1223 Callaboraive Filtering 317 12.24 Asociation Rule Mining 322 12.25 Decision Tree 325 Remind Me 329 Point Me (Book) 329Concenes ‘Connect Me (Internet Resources) 329 Tee Me 329 Assignment for Hands-On Practice 330 Chapter 13 _Few Interesting Differences 331 ‘Whats in Store? 331 13.1 Difference between Data Warehouse and Data Lake 331 13.2. Difference berween RDBMS and HDFS 333 13.3 Difference berween HDFS and HBase 334 134 Hadoop MapReduce versus Pig 335 13.5 Difference beoween Hadoop MapReduce and Spark 335 13.6 Difference between Pig and Hive 337 Chapter 14 Big Data Trends in 2019 and Beyond 339 “What's in Store? 339 14.1 Rise of the New Age “Data Curators” 340 142 CDOsare Stepping Up 340 143 Dark Data in the Cloud 341 144 Streaming the IoT for Machine Learning 342 14.5 Edge Computing 343 146 Open Source 344 14.7 Hadoop is Fundamental and will Remain Sol 344 148 Chathots will Get Smarter 344 149 Container(ed) Revolution 344 14,10 Commoditization of Visualization 345 Glossary 347 Index 359Types of Digital Data BRIEF CONTENTS “In God we trust all others must bring dasa.” = W. Bdwards Deming WHAT'S IN STORE? “essipective ofthe size ofthe enterprise (big or stnall), data continues to bea precious and irreplaceable asset. ‘Duca is present internal tothe enterprise and also exist outside the four walls and firewalls of che enterprise. ‘Daca is present in homogeneous sources as well as in heterogencous sources. The need of the hour is to ‘=ederstand, manage, process, and take the data for analysis to draw valuable insights. Das > Information Information —> Insights chapter isa “must read” fr fisttime learners interested in understanding the role of data in business igence and business analysis and businesses at large. This chapter will introduce you to the various for- sass of digital daa (sruccured, semi-structured, and unstructured data), the sources of each format of data, ‘Se isues with the terminology of unstructured data, eteae Big Data and Analytics We suggest you refer to che learning resources suggested at the end of this chapter and also attempt ll the ‘exercises to get a grip on this topic. We suggest you make your own notes/bookmarks while reading through the chapter. 1.1 CLASSIFICATION OF DIGITAL DATA “Asdepicted in Figure 1.1, digital daa can be broadly classified into structured, semi-structured, and unstruc- cured data, 1. Unstructured data: This is the data which does not conform to a data model ors notin a form which can be used easily by a computer program. About 80-90% data of an organization isin this format; for example, memos, chat rooms, PowerPoint presentations, images, videos, letters, researches, white papers, body of an email, etc. 2, Semi-structured data: This is the data which does not conform co a data model but has some struc- ture, However, itis not in a form which cam be used easly by a computer program; for example, emails, XML, markup languages like HTML, etc. Metadata for this data is available but is not sufficient. 13. Structured data: This is the data which is in an organized form (eg, in rows and columns) and can bbe exsily used by a computer program. Relationships exis beeween entities of data such as classes and their objects. Data stored in databases is an example of structured data. Ever since the 1980s most ofthe enterprise data has been scored in relational databases complece with rows! records/tuples, columns/atributes/ficlds, primary keys, foreign keys, etc. Over a period of time Relational Database Management System (RDBMS) marured and the RDBMS, as chey ate available today, have become more robust, cost-effective, and efficient. We have grown comfortable working with RDBMS — che storage, retrieval, and management of data has been immensely simplified. The data held in RDBMS is typically structured data. However, with the Internet connecting the worl, data that existed beyond one’s ‘enterprise started to become an integral part of daily transactions. This data grew by leaps and bounds so ‘much so that it became difficult forthe enterprises to ignore i. All of this dara was not structured. A lo of ic was unstructured. Infact, Gartner estimates that almost 80% of data generated in any enterprise today is unstructured date. Roughly around 10% of daca is in the structured and semi-structured category. Refet Figure 1.2 1.1.1. Structured Data Let us begin wich a very basic question ~ When do we say that the data is structured? The simple answer is ‘when data conforms to a pre-defined schemafstructure we say it is structured data. Ser Figure 1.1 Classification of digital data.‘Types of Digital Date 3 = Structures data = Semistructured data = Unstructured data Figure 1.2 Approximate percentage distribution of digital data. ‘Think structured data, and think data model ~a model of the types of business data that we intend to store, proces, and access. Let us discuss this in the context of an RDBMS. Most ofthe structured data is held in RDBMS. An RDBMS conforms to the relational data model wherein che data is stored in rows/columas Refer Table 1.1 ‘The number of rows/records/tuples in a relation is called the cardinality ofa relaron and the number of columns is referted to as the degree ofa relation. “The first step is the design of relation/table, the felds/columns to store the data, the type of data that will be scored [number (integer or real, alphabets, date, Boolean, etc}. Next we think of the constraints that we ‘would like our data to conform to (constraints such as UNIQUE values in the column, NOT NULL values s= she column, a business constraint such as the value held inthe column should not drop below 50, the set ‘ef permissible values in the column such asthe column should accept only “CS”, “IS”, "MS", etc. as input). “To explain further, let us design a able/relation structure o store the details ofthe employees of an enter- ese. Table 1.2 shows the struccure/schema of an “Employee” table in a RDBMS such as Oracle. “Table 1.2is an example ofa good structured table (complete with table name, meaningful column names wich data types, data length, and the relevane constraints) with absolute adherence to relational data model. Table 1.1 A relation/table with rows and columns Column 4 “Column 2 “column 3 —“totumn' 6 few Table 1.2 Schema of an “Employee” table in 2 RDBMS such as Oracle wc eee gee : 9 Varchar(10) PRIMARY KEY Exohame Varchar(50) Designation Varchar(25) NOT NULL BestNo Varchar(5) Contacto Varchar(10) NoT NULLae Big Data and Analyte Table 1.3 Sample records in the “Employee” table ContactNo EmpNo “‘Empwiame "Designation ‘éptNe E101 allen Software Engineer D1 0999999999 e102 simon Consultant overt 1k goes withour saying that each record in the table will have exactly the same structure. Let us take a look ata few records in Table 1.3. ‘The tables in an RDBMS can also be related, For example, the above “Employee” table is related to the “Department” rable on the basis of the common column, “DeptNo”. Ie is not mandatory forthe two tables that are related to have exactly the same name for the common column. On the contrary, the two tables are related on the basis of values held within the column, “DeptNo”. Given in Figure 1.3 is a depiction of ref. «erential integrity constraine (primary — foreign key) with the “Department” table being the referenced table and “Employee” table being the referencing table. 1.1.1.1 Sources of Structured Data If your data is highly structured, one can look at leveraging any of the available RDBMS [Oracle Corp. — Otace, IBM ~ DB2, Microsoft ~ Microsoft SQL Server, EMC ~ Greenphim, Teradata - Teradata, MySQL (open source), PosrgreSQL (advanced open source), etc] 10 house it. Refer Figure 1.4, These databases are ‘ypically used to hold transaction/operational data generated and collected by day-to-day business activities. In other words, the data of the On-Line Transaction Processing (OLTP) systems are generally quite structured. Department frente Depiname Employee Dopteeaton crpne Doptemestrength Emphiame EmpDesignation Depivo EmpContactNo Figure 1.3 Relationship between “Employee” and “Department” tables, Sac CLT systr Figure 1.4 Sources of structured data,‘pes of Digital Dats os 1.1.1.2 Ease of Working with Structured Data sructured dara provide theeas of working witht, Refer Figue 1.5. The easeis with respect to the following: 1. Insert/update/delete: The Data Manipulation Language (DML) operations provide the required ease with data inpus, storage, access, process, analysis, exc 2 Security: How does one ensure the security of information? There ae available staunch encryption and ‘okenizacon solutions to warrant the security of information throughout its lifecycle. Organizations are able to rerain control and maintain compliance adherence by ensuring that only authorized individuals are able to decrypt and view sensitive information. 3: Indexing: An index is a data structute that speeds up the data retrieval operations (primaily the SELECT DML statement) atthe cost of additional writes and storage space, but the benefits thar ensue in search operation are worth the addtional writes and storage space. 4+ Scalability: The storage and procesing expabilts ofthe traditional RDBMS can be exly scaled up by increasing the horsepower of the database sever (increasing the primary and secondary or petipherl storage capacisy, processing capacity of the processor, et). 5: Transaction procesing: RDBMS has suppor: for Acomiciy, Consistency, Isolation, and Durabiliy (ACID) properties of transaction. Given next i a quick explanation of the ACID properties: + Atomicity: A transaction is atomic, means that either it happens in its entitery or none of it ata * Consisteney: The database moves from one consistent state to another consistent state. Ia other words, ifthe same piece of information is stored at ewo or more places, they are in complete agreement * Kolation: The sssource allocation to the transaction happens such that the transaction gets the impression that ic isthe only transaction happening in isolation, * Durabiligy: All changes made tothe database during a transaction are permanent and that accounts for the durability ofthe transaction, 1.4.2 Semi-Structured Data Semistructured data is also referred to as seldescribing structure. Refer Figure 1.6. It has the follo 5 1. Ir does nor conform to the data models thar one typically associates with relational databases or any other form of data tables. 2. Icuses tags to segregate semantic elements Figure 1.5 Ease of working with structured data.Big Daca and Analytics errr Date obi ees Figure 1.6 Characteristics of semi-structured data. 3, Tags are also used to enforce hierarchies of records and filds within daca 4, There is no separation between the data and the schema. The amount of structure used is dictated by the purpose at hand. 5. In semi-sruccured data, entities belonging to the same class and also grouped together need not nec~ essaily have the same set of arributes. And if at all, chey have the same set of attributes, the order of actributes may not be similar and forall practical purposes it is not important as wel 1.1.2.1 Sources of Semi-Structured Data Amongst the sources for semi-structured data, the front runners are “XML” and "JSON" as depicted in Figure 1.7. 1, XML: eXtensible Markup Language (XML) is hugely popularized by web services developed uilizing the Simple Object Access Protocol (SOAP) principles. 2. JSON: Java Script Object Notation (JSON) is used to cransmit data between a server and a web appli- cation. JSON is popularized by web services developed utilizing the Representational State Transfer (REST) —an architecture style for creating scalable web services. MongoDB (open-soute, distributed, NoSQL, documented-oriented database) and Couchbase (originally known as Membase, open-source, distributed, NoSQL, document-oriented database) store data natively in JSON format. ‘An example of HTML is as follows:
Ser Figure 1.7. Sources of semi-structured data.of Digital Data 7
Link Namec/a>
this ia sub Headere/H2> Scnd me mail ar
“apport @yourcompany.comé/a>,
-
this is a new sentence without a paragraph break, in bold iales.
HTML» Semple JSON document aid, BookTicle: "Fundamentals of Business Analytic”, ‘AuchorName: “Seema Acharya’, ublisher: “Wiley India’, YeurofPublication: “2011” 7.1.3 Unstructured Data ructured data does not conform to any pre-defined data model. Infact, to explain things a liele more, Sees ake a closer look atthe various kinds of text available and the possible structure associated with it. As <= be scen from the examples quoted in Table 1.4, the structure is quite unpredictable, In Figure 1.8 we took at the other sources of unstructured data. 1.1.3.1 Issues with “Unstructured” Data though unstructured data is known NOT to conform to a pre-defined data model or be organized in a ec-defined manner, there are incidents wherein the structure of the data (placed in the unstructured cate- 7) can still be implied. As mentioned in Figure 1.9, there could be few other reasons behind placing daca = she unstructured category despite it having some structure ot being highly structured, are situations where people argue that a text file should be in the category of semi-structured data see not unseructuted data, Let us look at where they are coming from. Well, the text fle does have a name, Table 1.4 Few examples of disparate unstructured data Tetter message Feeling miffed. Victim of twishing, “= LOL. C ya. BEN 4327.0.0.1~ frank [10/Oct/2000:13:55:36 -0700] “GET /apache_pb.gif HTTP/1.0" 200 2326 “ttps//wnwexample.com/start.htmi” “Mozilla/4.08 fen} (Win98; T; Nav)” Enail Hey Joan, possible to send across the frst cut on the Hadoop chapter by Friday EOD or maybe we can meet up over a cup of coffee. est regards, Tomae Big Data and Analyeies eS SS Figure 1.8 Sources of unstructured data. Ss: i Figure 1.9 Issues with terminology of unstructured data. ‘one can easily look at the properties to get information such as che owner of the file, the date on which the file was ereated, the size ofthe file, etc. Okay, we do have litde mecadata, But when i¢ comes to analysis, are more concerned with the content ofthe text file rather than the name or any of the other properties. fact, the other properties may not in any way contribuce ro the processing/analysis task at hand. Therefo iis fac to place i in the unstructured data category. 1.1.3.2 How to Deal with Unstructured Data? “Today, unstructured data constcutes approximately 80% of che data that is being generated in any en prise. The balance is clearly shifting in favor of unstructured daca as shown in Figure 1.10. Ie is such a big percentage chat it cannot be ignored. Figure 1.11 scates a few ways of dealing with unstructured daaTes of Digital Dasa Unstructured data Figure 1.10 Unstructured data clearly constitutes a major percentage of enterprise data. Dia mining ch Se Figure 1.11 Dealing with unstructured data, The following techniques are used to find patterns in or interpret unstructured data: 1. Data mining: First, we deal with large data sets. Second, we use methods atthe intersection of art- ficial inelligence, machine learning, statistics, and dacabase systems to unearth consistent patterns large data sets and/or systematic relationships between variables, Iris the analysis step ofthe “knowl- sedge discovery in databases” proces. Few popular daca mining algorithms areas follows: * Association rude mininge Ics also called “market basket analysis” or “afinity analysis [eis used determine “What goes with what?” It is about when you buy a product, that you are likely to purchase with it. For example, likely to pick eggs or cheese ro go with it. value needs co be predicted is called the dependent variable and the variables which are used predict the value are referred to as the independent variables. [Prose Tas. You are interested in purchasing real estate, You have been looking at a few good sites. You have come to the conclusion that cost of the builder (joggers track, senior citizen zone, qym- nasium, swimming pools, et.), the built up area, ‘te. The cost of the real estate is the dependent ‘what is che other product if you pick up bread from the grocery, are you Regression anabysis It helps to predict the relationship becween so variables. The variable whose to real estate depends on the location (outskirts or prime locale), the amenities provided by the variable and the location, amenities, built-up area ate called the independent variables,10+ Big Data and Analytics Table 1.5 Sample records depicting learners’ preferences for modes of learning Learning using Audios Learning using Videos Textual Leamers User ves Yes Na User 2 Yes Yes Yes User 3 Yes Yes No User 4 Yes 2 2 *+ Collaborative filtering: Ivis about predicting a user's preference or preferences based on the prefer ences of group of users. For example, take a look at Table 1.5. ‘We ate looking at predicting whether User 4 will prefer eo learn using videos or is a textual learner depending on one or a couple of his or her known preferences. We analyze the preferences of similar ser profiles and on the basis oF it, predict chac User 4 will also like ta leatn using videos and is not a textual learner 2, Text analytics or text mining: Compared to the sericcared data stored in relational databases, text is largely unseructured, amorphous, and dificult to deal with algorithmically. ‘Texe mining isthe process of gleaning high quality and meaningful information (through devising of patcerns and trends by means of statistical pattern learning) from text. It inchudes tasks such as ext categorization, text clustering, sentiment analysis, concept/enity extraction, ec 3. Natural language processing (NLP): Ic is related co the area of human computer interaction, It is about enabling computers to understand human or natural language input. 4, Noisy text analytics: [tis the process of extracting structured or semi-steuctured information from noisy unstructured data such as chats, blogs, wikis, emails, message-boards, text messages, et. The noisy unstructured data usually comprises one ot more of the following: Spelling mistakes, abbreviations, acro- nnyms, non-standard words, missing punctuation, missing lever ease, filler words such as “ub? “unt, ee 5. Manual tagging with metadata: This is about tagging manually with adequate metadata to provide the requisite semantics to understand unstructured data 6, Partoof-speech tagging: Ic is also called POS or POST or grammatical tagging, Ic is the process of reading texc and tagging each word in the sentence as belonging to a particular part of speech such as ‘noun", “ver”, “adjective”, eve 7. Unstructured Information Management Architecture (UIMA): Ic is an open source platform from IBM. Ics used for real-time content analytics. It is about processing text and other unstructured data to find latent meaning and relevant relationship buried therein. Read up more on UIMA ar the links hexpid/wwncibmm.com/developerworks/data/downloads/uimal REMIND ME + Structured data: kt conforms to a data model, For example, RDBMS conforms to relational data ‘model, It has a pre-defined schema, + Semi-structured data: For this format of daca, little metadata is available, but is insufficient. Semi-structured data have a self-describing structuse. There is little ot no separation between data and schema,Types of Digital Dara eae + Unstructured data: This data is growing by the day and growing by leaps and bounds. It has innumerable sources such as human generated (social media data, emails, word documents, pre- sentations, audio and video files that we ceste and share everyday, etc.) and machine generated | data (sensors, web server logs, call data records, etc). POINT ME (BOOK) * Chapter 2: Types of Digital Data, “Fundamentals of Business Analytics", Wiley Indias Authors —RN. Prasad and Seema Acharya, 2011, 5 ‘CONNECT ME (INTERNET RESOURCES) + heep://data-magnum.com/the-big-deal-about-big-daca-whats-inside-structured-unstructured-and- semi-steuetured-daca/ ‘+ hetp://wirwcwebopedia com/TERM/S/structured_data.heml ‘+ hp: f/en, wikipedia ong/wiki/UIMA + Matching unstructured data and structured data by Bill Inmon: hp://www.tdan.comview- 515009 *+ Semi-structured data analytics: Relational 0 Hadoop platform? — IBM: hrep:/ivwnw bmbigdatahub. com/blog/semi-structured-dara-analytict-elational-or-hadoop-platform-part-1 TEST ME A. Place Me in the Basket Structured [unstrictured Following words are to be placed in the relevant basket: Email Relations/Tables MS Access Facebook Images Videos Database MS Excel ‘Chat conversations XMLRe Big Data and Analyice ‘Answer: Structured” ‘Unstructured NS Access tail XL Database Images Relations Tables Chat conversations NS Excel Facebook Videos B. Match the Following 1. Voiuma a Column & NUP Content analytics Text analytics Text messages uma Chats Noisy unstructured data Text mining Data mining Comprehend human or natural language input Noisy unstructured data Uses methods at the intersection of statistics, AL, machine leering & DBs 124 uma Answer: ati . ss NLP Comprehend human or natural language input Tet analytics Text mining IMA Content analytics Nosy unstructured data Text messages Data mining Uses methods at the intersection of statistics, AL Noisy unstructured data IBM. Column A ‘machine learning & DBs Chats IMA ‘ISON SOAP MongodB REST XML ‘ISON Flexible structure Couche xa SONTapes of Digital Data 1B answer: Column & 3S0N MongoDB ML File structure XML Js0N © Solve Me ‘You are a senior faculty at a premier engineering institute of the city. The Head of the Department has ssked you to take a look at the institure’s learning website and make a lst of the unstructured daca that == generated on the website that can then be stored and analyzed to improve the website to faciliate 2d enhance the student's learning. You log into the instivute’s learning website and observe the following Sanures on it + Presentation decks (pdf files) *+ Laboratory Manual (doe files) * Discussion forum + Seudent’ blog + Link to Wikipedia + Asurvey questionnaire for the students + Seudent’s performance sheet downloadable into an xls sheet + Scudene’s performance sheet downloadable into a xt file + Audio/Video learning files (wav files) + as sheet having a compiled lise of FAQs From this list, you select the following as sourees of unstructured data: RY Pe ‘We have just finished making your list when your colleague comes in looking for you. Both of you decide to & 2eay to the cafetcra in the vicinity ofthe insticute’s campus. You have forever liked this cafeteria. And you ‘See reasons for the same, There ate a couple of machines in the cafeteria’ reception area that the customers ‘== use to fed in their orders from a selection of menu items. Once the order is done, you are given a token member. Once your order is ready for serving, the display flashes your token number, Ie goes without sayingMe Big Data and Analytics thar the billing is also automated, You being in the IT department cannot reftain from thinking about the data thar gets collected by these automatic applications, Here’ yout lst ‘You ate thinking of the analysis thar you can perform on this data, Here's your lis: aye D. Solved Exercises 1, Why is an email placed in the “unstructured category”? Answer: Let us take a look at what we can place in the body of the email. We can have any or mote of the following + Hyperlink + PDFSDOCH/XLS/ete. attachments + Emoticons + Images + Audio/video attachments + Free flowing texe, etc. ‘The above are reasons behind placing the email in the “unstructured category” 2, What category will you place a CCTV footage into? Answer: Unstrucrared 3. You have just got a book issued from the library. What are the details aboue the book that can be placed in an RDBMS table? Answer: + Title of the book. + Author of the book + Publisher of the book + Year of Publication *+ No. of pages in the book: *+ Type of book such as whether hardbound or paperback * Price of the book + ISBN No. of the book. + Avtachments such as With CD or Without CD, etc.Tips of Digital Data it 4 Which category would you place the consumer complaints and feedback? “Answer: Unstructured data E. Unsolved Exercises 1. Which category (structured, semi-structured, of unstructured) will you place a web page in? 2 What according to you are the challenges with unstructured data? 3 Which category (structured, semi-structured, or unstructured) will you place a PowerPoint presentation ‘© Which category (structured, semi-structured, or unstructured) will you place a Word Document in? 5 Seate a few examples of human generated and machine-generated data. SCENARIO-BASED QUESTION ‘Woe are at che university library, You see a few students browsing through the library catalog on a kiosk. You sbecrve the librarians busy at work issuing and returning books. You sce a few students fill up the feedback. Seem on the services offered by the library. Quite a few students are earning using the e-learning content. Think for a while on the different types of data that are being gencrated in this scenario. Support your seem with logic.“Data isthe new sience. Big Data holds the antwer." — Pat Gelsinger, the Chief Executive Officer of VMware, Inc. and former Chie Operating Officer of EMC Corporation WHAT'S IN STORE? Ths chapter focuses on defining and explaining big data. The “Internet of Things” and its widely ssice-connected nature are leading to a burgeoning rise in big data. There is no dearth of data for today's enter- je On the contrary, they are mired in data and quite deep at that. That brings us to the following questions: 1. Why is i char we cannot forego big data? 2 How has it come to assume such magné 10us importance in running business?Is 3, How does ic compare with the traditional Business Intelligence (Bl) environment? 4, Is ichere to replace the traditional, relational database management system and data warehouse envi- ronment of is it likely to complement thei existence?” Data is widely available, What is scarce isthe ability to extract wisdom from it Sz You recently availed the opportunity to attend a Virtual classroom session from a leading training institute, You are reflecting back on the: experi- tence. Since the session was on big data, it gets you thinking on the types and volume of data that was created before, during, and after the session. Tt all ‘began with you registering online a week ago for the “Big Data” course, You remember having received ‘an acknowledgment confirming your registration. They had also stated that they will send across some. reading contents two days prior to te session. And true to thelr word, they did. When you logged into ‘the session, you saw that there were 493 other par ticipants. The presenter was introducing the process, fon smooth learning through the session. During the session, the participants could converse with the presenter as wel as with other participants using the chat facility. They had also activated a discussion forum for participants to share their learnings/views/ ‘opinions /experiences, etc. There were assignments, ‘hich would have to he attempted and submitted on 2.1 CHARACTERISTICS OF DATA Big Dawa and Azalytis Hal Varian, Google's Chief Feonomist, 2010 thei site, There was an assessment towards the end ofthe session that was graded. There was a feedback form that was made available at the end of the ses- sign to hear back from the participants. They also provided additional reading contents in the form of references to white papers/esearch papers. The lecture was recorded and made available for better learning and comprehension of the participants. TE was a good experience and you are already ‘thinking of being part of another such experience very soon. ‘There is no dearth of such virtual classroom ses- sions being conducted today. There i a huge learning community out there eager to lear, dust think on the volume of data that gets generated, and the variety (the list of attendees, ther scores and grades, their that conversations, their assignments, the poling questions put forth hy the instructor to gauge the level ‘of understanding and participation fromthe termes, etc. of data that we produce 2s welt consume 25 we become part of these virtual training session. Lec us start with the charactetistics of data, As depicted in Figure 2.1, data has three key characteristics: 1. Composition: The composition of data deals with the structure of data, that i, the sources of dara, the granulariny, the types, and the narure of data as to whether itis static or real-vime steaming 2. Condition: The condition of dara deals with the stae of data, that is, “Can one use this data as is for analysis?” ot “Does it require cleansing for further enhancement and éntichment?” 3. Context: The context of data deals wich “Where has this data been generated?” “Why was this data generated?” “How sensitive is this data” “What are the events associated with this data?” and so on ‘Small data (daca as it existed prior to the big data revolution) is about certainty. Iris abour fatty known data sources; iis about no major changes tothe composition or context of data “Most often we have answers to queries like why this daca was generated, where and when it was generated, exactly how we would like to use it, whar questions will chis data be able to answer, and so on. Big data iswoduction to Big Data +9 rd Figure 2.1 Characteristics of data. bout complexity... complexity in terms of multiple and unknown datasets, in terms of exploding volume, ® terms of the speed at which the data is being generated and the speed at which ie needs to be processed, 2d in cers of the variety of data (internal or external, behavioral or social) that is being generated 2.2 EVOLUTION OF BIG DATA 0s and before was the era of mainframes, The data was essentially primitive and structured, Relational tases evolved in 1980s and 1990s. The era was of data intensive applications. The World Wide Web SFWW) and the Internet of Things (loT) have led ro an onslaught of structured, unstructured, and mul- media data. Refer Table 2.1. Table 2.1 The evolution of big data Data Generation and Data Utilization Data Driven = Storage Complex and Structured data, Uestructured unstructured data, multimedia data Complex and Relational databases: Relational Data-intensve applications Primitive and Mainframes: Basic data Structured storage 1970s and before Relational 2000s and beyond (1980s and 1990s) 23 DEFINITION OF BIG DATA Bsc were co ask you the simple question: “Define Big Data’, what would your answer be? Well, we will give 2 few responses that we have heard overtime: 1. Anything beyond the human and technical infrastructure needed to support storage, processing, and analysis. 2 Today's BIG may be tomorrow's NORMAL.20+ Big Duta and Analytics 3. ‘Terabytes or petabyres or zertabytes of data. 4, [think icis abour 3 Vs. Refer Figuse 2.2. Well, all of these responses are correct. Buti is not just one ofthese; in fat, big data is all of the above and more. Big data is high-volume, high-velocity and bigh-oariety information assets that demand cos efitiv, innovative forms of information processing fo enhanced inighe and decision making. Source: Gartner IT Glossacy The 3Vs concept was proposed by the Gartner analyst Doug Laney in a 2001 MeraGroup research publication, titled, 3D Data Management: Controlling Dasa Volume, Variety and Velociy. Source: beep blogs garener.com/doug-laney/fles/2012/0 /ad949-3D-Data-Management-Conteolling- Data-Volume-Velocity-and-Variery pdf For the sake of easy comprehension, we wll ook a che definition in thre parts. Refer Figure 2.3 Po Ce ae Figure 2.2 Definition of big data,‘seoduction to Big Data 12 High-volume Hghvelocty High-varioty Cost-tiectve, Innovative forms ot information processing ‘Enhanced insight {decision making Figure 2.3 Definition of big data - Gartner. Par [ of the definition “big data is high-volume, high-velocity, and high-variety information assets” salis about voluminous data (humongous data) that may have great variety (a good mix of structured, sesi-sructured, and unstructured data) and will require a good speed/pace for storage, preparation, pro- sexing, and analysis Part Il of the definition “cost effective, innovative forms of information processing” talks about embrac- seg new techniques and technologies to capture (ingest), store, process, persis, integrate, and visualize the Sech-volume, high-velocity, and high-variety data. Parc II of the definition “enhanced insight and decision making” talks about deriving deeper, icher, and ‘ecsningfil insights and chen using these insights to make aster and better decisions to gain business value snd chus a competitive edge. Data — Information —+ Actionable intelligence —> Better decisions > Enhanced business value 24 CHALLENGES WITH BIG DATA ck: Figure 2.4 Following area few challenges with big data: 4. Data today is growing at an exponential rate. Most ofthe data that we have today has been generated in the last 2-3 years. This high tide of data will continue to rise incessantly. The key questions here are: “Will all chis data be useful for analysis, "Do we work with allthis data or a subset of ie”, “How will we separate the knowledge from the noise?” ete. 2 Cloud computing and virtualization are here to stay. Cloud computing is the answer to managing. infrastructure for big data as far as costefficiency, elasticity, and easy upgrading/downgrading is con- cetmed. This further complicares the decision to host big daa solutions outside the enterprise. 3. The other challenge isto decide on the period of retention of big dat. Just how long should one resin this data?A ricky question indeed as some data is useful for making long-term decisions, whereas in few cases, the data may quickly become irrelevant and obsolete just afew hours after having being generated.Big Data and Analytics ea Figure 2.4 Challenges with big data. 4, There is a dearth of skilled professionals who possess a high level of proficiency in data sciences that is vital in implementing big data solutions. 5. Then, of course, there are other challenges with respect ro capture, storage, preparation, search, anal- ysis, transfer, security, and visualization of big daca. Big data refers co datasets whose size is typically beyond the storage capacity of traditional database software tools. There is no explicit definition how big the dataset should be for ic to be considered “big data.” Here we are to deal with data that is just oo big, moves way to fast, and does not fit the structures of typical database systems, The dat ‘changes are highly dynamic and therefore there isa need co ingest this as quickly as possible 6. Data visualization is becoming popular asa separate discipline. We are short by quite a number, as as business visualization experts ae concerned. 2.5 WHAT IS BIG DATA? Big data is daca chat is ig in volume, velocity, and variety, Refer Figure 2.5. 2.5.1 Volume ‘We have seen it grow from bits to bytes to petabytes and exabytes. Refer Table 2.2 and Figure 2.6 Bits ++ Bytes —> Kilobytes > Megabytes —> Gigabytes —> Terabytes — Petabytes —> Exabytes > Zettabytes —> Yortabytes 2.5.1.1 Where Does This Data get Generated? ‘There area multicude of sources for big data. An XLS, a DOC, a PDE ete. is unstructured datas a video YouTube, a chat conversation on Internet Messenger, a customer feedback form on an online real wel‘wroduction to Big Data +23 Data varity Figure 2.5 Data: Big in volume, variety, and velocity. Table 2.2 Growth of data Bits ort Bytes a bits Kilobytes 1024 bytes Megabytes 1026? bytes Gigabytes 1024" bytes Terabytes 1024* bytes Petabytes 1026" bytes Exabytes 1024* bytes Zettabytes 1024’ bytes Yottabytes sctured data; a CCTY coverage, a weather forecast report is unstructured data too. Refer Figure 2.7 ste sources of big data. 2 ‘Typical internal data sources: Data present within an organization’ firewall. Ie is as follows: * Data storage: File systems, SQL. (RDBMSs ~ Oracle, MS SQL Server, DB2, MySQL, PostgreSQL, cre), NoSQL (MongoDB, Cassandra, etc), and s0 on. * Archives: Archives of scanned documents, paper archives, customer correspondence records, Patients’ health records, students’ admission records, students’ assessment records, and so on,Me Big Data and Analycies eee) Rey. eect i] vos Eee ee mcr, Lees oe Eos Figure 2.6 A mountain of data. Stee Figure 2.7 Sources of big data. 2. External data sources: Data residing outside an organization’ firewall, Ie is as follows: + Public Web: Wikipedia, weather, regulatory, compliance, census, et Both (internal + external data sources) «+ Sensor datas Car sensors, smart electic meccrs, office buildings, air conditioning units, relrigera tors, and s0 on. + Machine log data: Event logs, application logs, Business process logs, audit logs, clickstream dara, ete. + Social media: Twiter, blogs, Facebook, Linkedin, YouTube, Instagram, etc. + Business apps: ERP, CRM, HR, Google Docs, and 30 08. + Media: Audio, Video, Image, Podcast, etc. + Does: Comma separated value (CSV), Word Documents, PDE, XLS, PPT, and so on. 2.5.2. Velocity We have moved from the days of batch processing (remember our payroll applications) to real-time processing. Batch —> Periodic —> Near real time —> Real-time processingIntroduction to Big Daca 225 2.5.3 Variety Variety deals with a wide range of data types and sources of data. We will study this under three categories: Seructured data, semi-structured data and unstructured data. 1. Structured data: From traditional transaction processing systems and RDBMS, etc. 2. Semi-structured data: For example Hyper Text Markup Language (HTML), eXtensible Markup Language (XML). 3. Unstructured data: For cxample unstructured text documents, audios, videos, emails, photos, PDFs, social media, ete. 2.6 OTHER CHARACTERISTICS OF DATA WHICH ARE NOT DEFINITIONAL TRAITS OF BIG DATA here are yer other characteristics of data which are not necessarily the definitional eats of big data. Few of Soe are listed as follows: Veracity and validity: Veraciy refers to biases, noise, and abnormality in data. The key question here is: “Isl the data that is being stored, mined, and analyzed meaningful and pertinent to the problem under consideration?” Validity refers to the accuracy and correctness of the data, Any daca that is picked up for analysis needs to be accurate. eis not just true about big data alone. 2 Volatility: Volatility of data deals with, how long is the data valid? And how long should it be stored? ‘There is some data that i required for long-term decisions and remains valid for longer periods of time. However, there ae also pieces of data that quickly become obsolete minutes after their generation, 3. Variability: Data flows can be highly inconsistent with periodic peaks. ‘As online retailer announces the “bg sale day’ for ence a slump in his/her business immediately after | 2 particular week. The retaiterisikely to experience _ the festival season, This reemphasizes the point that | ‘a8 upsurge in customer traffic to the website during one might witness spikes in data at some point 2) ‘is week. In the same way, he/she might experi- time and at other times, the data flow can go fiat. 27 WHY BIG DATA? ‘more data we have for analysis, the greater will be the analytical accuracy and also the greater would be "Se confidence in our decisions based on these analytical findings. This will entail a greater positive impact in ses of enhancing operational efficiencies, reducing cost and time, and innovating on new products, new s=viccs and optimizing existing services. Refer Figure 2.8. More data + More accurate analysis -> Greater confidence in decision making —~ Greater operational efficiencies, cost reduction, time reduction, new product development, and optimized offerings, ete.26+ Big Data and Analytics Figure 2.8 Why big data? 2.8 ARE WE JUST AN INFORMATION CONSUMER OR DO WE ALSO. PRODUCE INFORMATION? You have been invited to your friend's promotion Archie's store to pick a good greeting card anda gift, party. You are happy and excited to join your friend. You get the items billed at the Point of Sale system at this important milestone in her career. You send and pay cash at the counter. While at the party, you ‘in your confirmation through a text.message. You click photographs and post it. on Facebook, Flickr, get ready and leave for your friend's residence. On and the Likes. Within minutes, you start to get likes the way, you stop at a gas station to refuel, You and comments on your posts. pay using your credit card. You stop at an upmarket Mention the places in this scenario where data was generated: 1, Text message to send in che confirmation to attend the promotion bash, 2. Use of creditcard ro pay for gas/fuel a the gas station. 3. Point of Sale system at Archie's where your transaction gets recorded. 4, Photographs and posts on social networking sites 5. Likes and comments to your post. Likewise, the ae several instances everyday where you generate data. Think about cases where you are 2.9 TRADITIONAL BUSINESS INTELLIGENCE (BI) VERSUS BIG DATA Let us take a sneak peek into some of the differences that one encounters dealing with tradisional BI an big dara. 1, In traditional BI environment, all the enterprise's data is housed in a central server whereas in a big data environment data resides ina distributed filesystem. The distributed filesystem scales by scali in or out horizontally as compared to typical database server that scales vertically. 2. In traditional BI, data is generally analyzed in an offine mode whereas in big data, ic is analyzed i both realtime as well asin offtine mode.‘ecroduction to Big Data “7 3. Traditional BI is about structured data and itis here that data is taken to processing functions (move dlata to code) whereas big data is about variety: Structured, semi-structured, and unstructured data and here the processing functions are taken to the data (move code to data). 2.10 ATYPICAL DATA WAREHOUSE ENVIRONMENT Let us look at a typical Data Warehouse (DW) environment. Operational or transactional or day-to-day csiness data is gathered from Enterprise Resource Planning (ERP) systems, Customer Relationship Man- sement (CRM), legacy systems, and several chird party applicavions. The data from these sources may fer in format [data could have been housed in any RDBMS such as Oracle, MS SQL Scrver, DB2, *SQL, and Teradata, and so on or in spreadsheet (xls, alos, etc.) or .csv or ext]. Data may come from sources located in the same geography or different geographies. This data is then integrated, léaned transformed, and standardized ehrough the process of Extraction, Transformation, and Loading (ETL). ansformed data is then loaded into the enterprise data warehouse (available atthe enterprise level) of smarts (available at the business unit! functional unic or business process level). A host of marker leading Sssiness intelligence and analytics tools are chen used to enable decision making from the use of ad-hoc e=cries, SQL, enterprise dashboards, data mining, etc. Refer Figure 2.9. 2.11_A TYPICAL HADOOP ENVIRONMENT ws now study the Hadoop environment. Isic very different from the data warehouse environment and. actly is this difference? s fairly obvious from Figure 2.10, the data sources are quite disparate from web logs to images, audios, se videos to social media data to the various docs, pdfs, etc. Here the data in focus is not just che data ‘he company’s firewall bur also data residing outside the company’ firewall. This data is placed in Fiadoop Distributed File System (HDES). If nced be, this can be repopulated back to operational systems to the enterprise data warehouse or data marts or Operational Data Store (ODS) to be picked for Scher processing and analysis. Reporting cee Rota Co) Figure 2.9 A typical data warehouse environment.Big Data and Analytics ete Figure 2.10 A typical Hadoop environment. 2.12 WHAT IS NEW TODAY? Accocaistence strategy that combines the best of legacy data warehouse and analytics environment with new power of big data solutions is the best of both the worlds. Refer Figure 2.11. 2.12.1 Coexistence of Big Data and Data Warehouse Icis NOT about rip and replace. Itwill not be possible to get rid of RDBMS or massively parallel process (MPP), but instead use the right rool for the righe job. ‘As we are aware that few companies are a wee bit comfortable working with incumbent data warcho for standard BI and analytics reporting, for example the quarterly sales report, customer dashboard, et ‘The data warehouse can continue with its standard workload drawing data from legacy operational syste storing the historical data to provision traditional BI reporting and analytics needs. However, one will n be able to ignore the power that Hadoop brings to the table with different types of analysis on diffe types of data. The same operational systems, which till now was engaged in powering the data warcho «an also populate the big data environment when they're needed for computation-rich processing or for data exploration. Ic will be a tight balancing act to steer the workload to the right platform based on that platform was designed to do. Here is a chought-provoking piece from Ralph Kimball aca cloudera webinar: “Heres question thas mace me laugh a tate bit, but its a serious question: ‘Well does this mean that relational databases are going to die? 1 think that there was a sense, three or four years ago, that ‘maybe this was all a giant zero sum game between Hadoop and relational database, and that haslowrodction to Big Data +29 Cote Figure 2.11 Big data and data warehouse coexistence. ‘imply gone any Everyone bas now realised that theres a huge legacy value in relational databases for the purposes they are used for. Not only transaction procesing, but for all the much focused, indes-oriented queries on that kind of data, and that will continue in a very robust way forever Haadoop, therefore, will present this alternative kind of environment for different types of analysis for diffrent kinds of data, and the wo of them will coexist. And they will call each other. There may be points at which the business user im’ actually quite sure which one of them they are touching at any point of time.” jist = one cannot ignore the powerful analytics capability of Hadoop, one will not be able to ignore the solutionary developments in RDBMS such as in-memory processing, etc. The need of the hour is to have Sec data warehouse and Hadoop co-exist in today’s environment. 2.13 WHAT IS CHANGING IN THE REALMS OF BIG DATA? ‘Gees are the days when IT and business could work in silos and stil see the business through. Today, ic is scx ofa tight handshake berween business, IT, and yet another class called Data Scienits (more on it == Caper 3 on “Big Data Analytis"), We are citing three very important reasons why companies should sepelsorily consider leveraging big data: © Competitive advantage: The most important resource with any organization today is chcir data, ‘What they do with ic will determine thei fate in the market. 2 Decision making: Decision making has shifted from the hands of the elite few to the empowered ‘any. Good decisions play asignificanc role in furthering customer engagement, reducing operating ‘=srzins in retail, cutting cost and other expendicures in the health sector.30+ Big Data and Analytics 3, Value of data: The value of data continues to see a steep rise. As the all-importane resource itis time to look at newer architecture, tools, and practices to leverage this. REMIND ME + The World Wide Web (WWW) and the Internet of Things (lo) have led co an onslaught of struc~ ‘iced, unstructured, and multimedia data + Big data is high-volume, high-velocity and high-varieey information assets that demand cost effetve, innovative forms of information processing for enhanced insight and decision making. Source: Gartner IT Glossary + More data -» More accurate analysis —> Greater confidence in decision making —> Greater opera- tional efficiencies, cost reduction, time reduction, new product development, and optimized offer- ings, etc. + Traditional BI is about structured data and itis here that data is taken co processing functions (move data to code). On the other hand, big data is about variety: Structured, semi-structured, and unstructured data and here che processing functions are taken to the data (move code to data). POINT ME (BOOK) + Big Data for Dummies — Judith Hurwitz, Alan Nugent, Fern Halpet, Marcia Kaufinan, Wiley India Pre. Led CONNECT ME (INTERNET RESOURCES) hup:/fen.wikipedia.orglwiki/Big_data hncp://wvow.sas.com/en_us/insights/big-data/whac-is-big-daca.heml bhaps:f/wrww.oracle.coim/bigdatal hiep:/Pbigdatauniversiy.com/ htcp:/fwworn.sap.com/solution/big-datalsofiwarefoverview.ml http:/fwwwrw.ibm,com/software/data/bigdacal huep:/fwww.ibm.comfbig-daca/usten! hxp:/fwww.sas.comJen_us/insights/big-data/what-is-big-dara.heml htcp://timoelliort.com/blog/2014/04/no-hadoop-isnt-going-to-replace-your-data-warchousc. helSecedoction 1 Big Da o3t TEST ME A Crossword Puzzle on Big Data i a Gi fr F Across Down 2 » a Garmer analyst coined 1. characteristic of data the term, ‘Big Data explains the spikes in data. * is the characteristic of data 4, Near realtime processing or real time process- Sealing with is retention. ing deals with characteristic x » isa large data repository that of data. scores data in its native format until iis needed Answer: Across Down 2 Doug Laney 1. Variability 3. Volatility 4, Velocity 5. Data Lakes & Filme 1 Bigdataishigh-volume, high-velocity and high-variety information assetsthacdemand forms of information processing for enhanced and. Answer: Cost-effective, Innovative, Insight, Decision makingRe Big Dara and Analytics . Match the Following Column A “Column B PostgreSal Machine generated unstructured data Scientific data Open source relational database Point-of sale Human-generated unstructured data Social Media data -Machine-generated structured data Gaming-elated data Human-generated unstructured data Mobite data Human-generated structured data Answer: “column A See Gina é PostgreSql ‘Open source relational database Scientific data achine generated unstructured data Point-oF-sale Machine-generated structured data Social Media data Human-generated unstructured data Gaming-elated data Human-generated structured data Mobile data Human-generated unstructured data D. Unsotved Exercises 1, Share your understanding of big data 2. How is traditional BI envionment different from the big data environment? 3. Big data (Hadoop) will replace the traditional RDBMS and data warehouse. Comment. 4, Share your experience as a customer on an e-commerce site. Comment on the big data that gets created ‘on a typical ecommerce site. 5. What is your understanding of "Big Data Analyt CHALLENGE ME 1. What is Interner of Things and why does it matter? ‘Answers See hitp://www.sas.com/en_us/insights/big-data/internet-of-things.heml 2. Can the same visualization tool that we run over conventional data warchouse, be used in big data environment? ‘Answer: Let us look at Figure 2.12 to understand the solution: ‘As pet Figure 2.12, structured data is stored in Relational Database Management System (RDBMS) whereas big data (largely unstructured data) is stored in NoSQI. databases. Structured data after cleansi ‘transforming, and converting to a uniform standard format are placed in the enterprise data warehocroduction to Big Data +33 ec Roce keree Figure 2.12 Visualization tools for traditional BI and big data. 2 the enterprise level) or the data marts (at the business unit or function level) or operational data stores almost the complete operational data of an enterprise is housed here) whereas the good variety of data seructured, semi-structured, and unstructured data) is placed in data lakes (a large data repository that szores raw data in its mative format until i is needed). Data can then be scooped from data lakes to data ‘archouses and traditional BI tools can then be run over them. A common set of data visualization tools
Sorting with datasets whose volume and variety exceed the current storage and processing capabilides 2nd infrastructure of your enterprise. S About moving code to dara. This makes perfect sense asthe program for distributed processing i Sy Gust afew KBs) compared tothe daa (Terabytes or Petabytes today and likely to be Exabytes or Zevtabytes in the neae facut). TT, business users, and data scientists, fen asked participants of our learning programsas what comes to mind when you hea the ten “Big Sa Ant we are nor suprised by the answer... tis “Volume.” Buc now that we havea clear understanding ses cara, we know i isnt only about volume but the vatiety and velocity too are very important factors38+ Big Data and Analytics Ricier, deeper eine pee Figure 3.3 What is big data analytics? Refer Figure 3.4, Big data isnt just about technology. It is aboue understanding what the data is say to us. Ie is about understanding telationships that we thought never existed berween datasets Its ab patterns and tends waiting to be unveiled. ‘And of course, big data analytics is nor here to replace our now very robust and powerful Relator Database Management System (RDBMS) or our traditional Data Warchouse. Ic is here to coexist with bo RDBMS and Data Warchonse, leveraging the power of each to yield business value. Big data analytics is “One-size fit all” uaditional RDBMS buile on shared disk and memory. ee eee Figure 3.4 What big data analytics isn't?Big Data Analytics +39 And before we think it is only used by huge online companies like a Google or Amazon, lec us clear the sxyth. Its for any business and any industry that needs actionable insights out of their data (both internal and external) 3.4 WHY THIS SUDDEN HYPE AROUND BIG DATA ANALYTICS? # we go by the industry buzr, every place there seems to be talk about big data and big data analytics. Why shis sudden hype? Refer Figure 3.5. Let us put it down to three foremost reasons: 1, Data is growing at a 40% compound annual rate, reaching nearly 45 ZB by 2020. In 2010, almost about 1.2 trillion Gigabyte of data was generated. This amount doubled to 2.4 trillion Gigabyte in 2012 and to about 5 trillion Gigabytes in the year 2014. The volume of business data worldwide is expected to double every 1.2 years. Wal-Mart, the world retailer, processes one million customer transactions per hour. 500 million “tweets” are posted by Twitter users every day. 27 billion “Likes” and comments are posted by Facebook users ina day. Every day 2.5 quintllion bytes of data is created, ‘with 90% of the world’s data created in the past 2 years alone. Source: (2) heep:/Awwintel.com/contenthwww/us/en/communications/internet-minute-infographic hem! (b) hup:/fwww-0L.dbm, com/sofware/data/bigdaca/what-is-big-dara ham 2 Cost per gigabyte of storage has hugely dropped. 3. There are an overwhelming numberof user-friendly analytics tools available inthe market today 3.5_CLASSIFICATION OF ANALYTICS ‘The are basically two schools of thought: 1. Those that classify analytics into basic, operationalized, advanced, and monetized. 2 Those that classify analytics into analysis 1.0, analytics 2.0, and analytes 3.0. sted More data sont rays tor More data preditons coved ~ More data a ‘srayeed Figure 3.5 What big data entails?soe Big Data and Analycice 3.5.1. First School of Thought 1, Basic analytics: This primarily is slicing and dicing of data to help with basic business insights. This is about reporting on historical data, basic visualization, etc 2. Operationalized analytics: It is operationalized analytics if ie gets woven into the enterprise's business processes 3. Advanced analytics: This largely is about forecasting forthe future by way of predictive and prescrip- tive modeling, 4. Monetized analytics: This is analytics in use to derive direct business revenue. 3.5.2. Second School of Thought 1.0, analytics 2.0, and analytics 3.0, Refer Table 3.1. Lecus take a closer look at anal Table 3.1 Analytics 1.0, 2.0, and 3.0 Z Analyties analytics 3. Era: mid 1950s to 2009 2005 to 2012 2012 to present Descriptive statistics Descriptive statistics + predictive statistics Descriptive + predictive + (eport on events, {use data from the past to make predictions prescriptive statistics ‘occurrences, etc. of the forthe future) {use data from the past to make past) prophecies for the future and at the same time make recommendations to leverage the situation to one's advantage) Key questions asked: Key questions asker Key questions asked: What happened? What will happen? What will happen? Why did it happen? Why will t happen? ‘When will it happen? ‘Why wil ft happen? ‘What should be the action taken to take advantage of what will happen?” Data from legacy Big data ‘A blend of big data and data from systems, ERP, CRM, and legacy systems, ERP, CRM, and 3rd party applications. 3* party applications. Small and structured Big data is being taken up seriously Data A blend of big data and traditional data sources. Data _—_is mainly unstructured, ariving at a much analytics to yield insights and stored in enterprise higher pace. This fast flow of data entailed offerings with speed and impact. data warehouses or data that the inftux of big volume data had to mart, be stored and processed rapidly, often on massive parallel servers running Hadoop. Data was internally Data was often externally sourced. Data is both being intemally and sourced. ‘externally sourced. Relational databases Database appliances, Hadop clusters, SOL In memory analytic, in database to Hadoop environments, etc. processing, agile analytical meth ‘machine learning techniques, etc.How ean we ‘make it happen? What wil Insight Hindsight Figure 3.6 Analytics 1.0, 2.0, and 3.0. 3.6 shows the subtle growth of analytics from Descriptive > Diagnostic > Predictive > Prescriptive salto. 3.6 GREATEST CHALLENGES THAT PREVENT BUSINESSES FROM. CAPITALIZING ON BIG DATA 1. Obtaining executive sponsorships for investments in big data and its related activities (such as train- ing, ete) Getting the business units to share information across organizational silos. . Finding the right skills (business analysts and data scientists) thac can managé large amounts of struc- tured, semi-structured, and unstructured data and create insights from it 4. Determining the approach to scale rapidly and elastically. In other words, the need to address the storage and processing of large volume, velocity, and variety of big daa, 5. Deciding whether to use structured or unstructured, internal or external data to make business & Choosing the optimal way to report findings and analysis of big data (visual presentation and analy- tics) forthe presentations to make the most sense. ~ Determining what to do with the insights created from big data. 3.7_TOP CHALLENGES FACING BIG DATA 1. Scale: Storage (RDBMS (Relational Database Management System) or NoSQL (Not only SQL) is fone major concern that needs to be addressed to handle the need for scaling rapidly and elastically. The need of che hour is a storage that can best withstand the onslaught of large volume, velocity, and variety of big data? Should you scale vertically or should you scale horizontally?ae Big Date and Analytics 2. Security: Most of the NoSQL big data platforms have poor security mechanisms (lack of proper authentication and authorization mechanisms) when ic comes to safeguarding big data. A spot that cannot be ignored given thac big daca carries eredic card information, personal information, and other sensitive data. 3. Schemat Rigid schemas have no place. We want the technology to be able to ft our big data and not the other way around. The need of the hours dynamic schema, Static (pre-defined schemas) are passé. 4. Continuous availability: The big question here is how to provide 24/7 support because almost all RDBMS and NoSQl_ big data platforms have a certain amount of downtime built in. 5. Consistency: Should one opt for consistency or eventual consistency? 6, Partition tolerant: How co build partion tolerant systems that cin take care of both hardware and software failures? 7. Data quality: How to maintain data quality —data accuracy, completeness, timeliness, etc? Do we have appropriate metadata in place? 3.8 WHY IS BIG DATA ANALYTICS IMPORTANT? [Lec us study the various approaches to analysis of data and what i leads to. 1, Reactive ~ Business Intelligence: Whac does Business Incelligence (BI) help us with? Te allows the’ businesses to make faster and better decisions by providing the right information to the right petson at the right time in the right format. Ie is about analysis of the pastor historical data and then displaying the findings of the analysis or reports in the form of enterprise dashboards, alerts, notifications, et. It has support for both pre-specified reports as well asad hoe querying, 2, Reactive ~ Big Data Analytics: Here the analysis is done on huge datasets but che approach is si reactive sit is sell based on static data 3. Proactive — Analytics: This is to support futuristic decision making by the use of data mining, p dictive modeling, text mining, and statistical analysis. This analysis is noc on big data as it still us the traditional database management practices on big data and therefore has severe limitations on storage capacity and the processing capability. 4, Proactive ~ Big Data Analytics: This is sieving through terabytes, petabyres,exabyres of informati to filter out the relevant data to analyze. This also includes high performance analytics to gain ra insights from big daca and the ability to solve complex problems using more data. 3.9 WHAT KIND OF TECHNOLOGIES ARE WE LOOKING TOWARD TO. HELP MEET THE CHALLENGES POSED BY BIG DATA? 1, The first requirement is of cheap and abundant storage. 2. We nced faster processors to help with quicker processing of big data 3. Affordable open-source, discributed big data plarforms, such as Hadoop. 4, Parallel processing, clustering, virtualization, large grid environments (to distribute processing, number of machines), high connectivity, and high throughputs rather than low latency. 5. Cloud computing and other flexible resource allocation arrangements.Analytics 13 }0_DATA SCIENCE science is the science of extracting knowledge from data. In other words it is science of drawing out hidden amongst data using taistial and mathematical techniques. It employs techniques and theories drawn ‘many fields from the broad arcas of mathematics, statistics, information technology including machine ing, data engineering, probability models, statistical learning, pateern recognition and learning, etc. day we have a plethora of use-cases for “Data Science” thae are already exploring massive datasets £0 Zeusa bytes of Information) for weather predictions, oil drllings, seismic activities, financial frauds, st network and activites, global economic impacts, sensor logs, social media analytics, and so many heyond standard retail, manufacturing use-cases such as customer churn, market basket analytics ative mining), collaborative filtering, regression analysis, etc. Data science is multi-disciplinary. Refer scientist should have the prowess to counter the pressures of business. A firm understanding of business ain further helps. The following isa list of traits that needs o be honed to play the role of data scientist. Technology Expertise ithout saying that technology expertise will come in handy if one is to play the role ofa data scien- bbclow are few skills required as far as technical expertise is concerned. Figure 3.7 Data scientist.we Big Daca and Anais 1, Good database knowledge such as RDBMS. 2. Good NoSQL. database knowledge such as MongoDB, Cassandra, HBase, etc. 3. Programming languages such as Java, Python, C+4, et. 4, Open-source tools such as Hadoop. 5. Data warehousing. 6. Data mining. 7. Visualization such as Tableau, Flare, Google visualization APIs, etc. 3.10.3 Mathematics Expertise Since the core job of the data scientist will require him o comprehend data, interpret it, make sense of i and analyze it, he/she will have to dabble in earning algorichms. The following are the key skills that ada scientise will have to have in his arsenal. 1. Mathematics. 2. Statistics, 3. Artificial Inelligence (Al). 4, Algorithms. 5. Machine learr 6. Pattern recognition. 7. Natural Language Processing. To sum it up, the data science process is 1, Collecting raw data from multiple disparate data sources. 2. Processing the data. 3, Integrating the daca and preparing clean datasers 4, Engaging in explorative data analysis using model and algorithms. 5. Preparing presentations using data visualizaions (commonly called Infographics, ot VizAnalytics, etc: 6, Communicating the findings to all stakeholders. 7. Making faster and better decisions. nalytis, 3.11_ DATA SCIENTIST...YOUR NEW BEST FRIEND!!! Tn today’s daca age, a data scientist isthe best fiend that you can gift yourself. Refer Figure 3.8 to learn ab the tasks thar the data scientist can help you with. 3.11.1. Responsibilities of a Data Scientist Refer Figure 3.8. 1, Data Management: A data scientist employs several approaches to develop the relevant datasets analysis. Raw data is just “RAW,” unsuitable for analysis, ‘The data scientist works on it to prepare to reflect the relationships and contexts. This data then becomes useful for processing and fu analysis,Big Data Analytics 45 Medel and analyses to Soe oc q 2s Businesd/Domain as one tindng Figure 3.8 Data scientist: your new best friend!!! 2. Analytical Techniques: Depending on the business questions which we are uying to find answers coand. the type of data available at hand, the data scientist employs a blend of analytical techniques to develop models and algorithms to understand the data, interpret relationships, spot tends, and unveil patterns 3. Business Analysts: A daca scientist isa business analyst who distinguishes cool facts from insights and. is able o apply his business acumen and domain knowledge to sce the results in the business context. He is a good presenter and communicator who is able to communicate the results of his findings in a Janguage that is understood by the different business stakeholders. 3.12 TERMINOLOGIES USED IN BIG DATA ENVIRONMENTS In order to geta good handle on the big data environment, let us get familiar with a few key terminologies ss thisarena, 3.12.1 In-Memory Analytics Data access from non-volatile storage such as hard disk isa slow process. The more the dara is required to be ‘cched from hard disk or secondary storage, the slower the process gets. One way to combat this challenge s to presprocess and store data (cubes, aggregate tables, query sets, et.) so that the CPU has to fetch a small subset of records. But this requires thinking in advance as to what data will be required for analysis. If there sa need for different or more data, it is back to the inital process of pre-computing and storing data or ching i from secondary storage. This problem has been addressed using in-memory analytics. Here all the relevant data is stored in Random Access Memory (RAM) or primary storage thus eliminating the need to access the data from hard disk. The advantage is faster access, rapid deployment, better insights, and minimal IT involvement. 3.12.2. In-Database Processing «database processing is also called as in-database analytics. It works by Fusing data warchouses with analyti- cal systems, Typically the data from various enterprise On Line Transaction Processing (OLTP) systems afterBig Data and Analytics cleaning up (de-duplication, scrubbing, et.) through the process of ETL is stored in the Enterprise Data ‘Warehouse (EDW) or data marts. The huge datasets are then exported to analytical programs for complex and extensive computations, With in-database processing, the database program itself can run the compu- tations eliminating the need for export and thereby saving on time. Leading database vendors are offering this feature co large businesses. 3.12.3 Symmetric Multiprocessor System (SMP) In SMP, there is 8 single common main memory that is shared by two or more identical processors. The processors have full access to all I/O devices and are controlled by a single operating system instance, SMP ate tightly coupled multiprocessor systems. Each processor has its own high-speed memory, called ‘ache memory and are connected using a system bus. Refer Figure 3.9. 3.12.4 Massively Parallel Processing ‘Massive Parallel Processing (MPP) refers to the coordinated processing of programs by a number of processors working parallel The processors, cach have their own operating systems and dedicated memory. They work on different parts of the same program, The MPP processors communicate using some sort of messaging interface. The MPP systems are more difficult to program as the application must be divided in such a way that all che executing segments can communicate with each other. MPP is different from Symmetrcaly “Multiprocessing (SMP) in that SMP works with the processors sharing the same operating system and same memory. SMP is aso referred to as tight-coupled mudiprocesing. 3.12.5 Difference Between Parallel and Distributed Systems The next two terms that we discuss are parallel and distributed systems, [As is evident from Figure 3.10, a parallel database system is a tightly coupled system, The processo co-operate for query processing. The user is unaware of the parallelism since he/she has no access ro a specif ecg Figure 3.9 Symmetric Multiprocessor System.Sig Daca Analytics “7 ‘Back end paral! system Figure 3.10 Parallel system. Figure 3.11 Parallel system. processor of the system. Either the processors have access to a common memory (Refer Fig 3.11) of make ‘se of message passing for communication, Distributed database systems are known co be loosely coupled and are composed by individual machines. Refer Figure 3,12. Each of the machines can run their individual application and serve their own respec- See user. The data is usually distributed across several machines, thereby necessitating quite a number of sachines to be accessed to answer a user query. Refer Figure 3.13. 3.12.6 Shared Nothing Architecture us look ar che three most common types of architecture for multiprocessor high transaction rate systems. 1. Shared Memory (SM). 2. Shared Disk (SD). 3. Shared Nothing (SN). 5 shared memory architecture, a common central memory is shared by multiple processors. In shared disk =hitecture, multiple processors share a common collection of disks while having their own private mem- ox In shared nothing architecture, neither memory nor disk is shared among multiple processors.4s Big Data and Analytics Figure 3.12 Distributed system. Figure 3.13 Distributed system, 3.12.6.1 Advantages of a “Shared Nothing Architecture” 1, Fault Isolation: A “Shared Nothing Architecture” provides the benefit of isolating fault A faule in single node is contained and confined to that node exclusively and exposed only chrough messages ( lack of i. 2, Scalability: Assume that the disk is 2 shared resource. Ic implies thatthe conoller and the disk width are also shared, Synchronization will have to be implemented to maintain a consistent state, This would mean that different nodes will have to take turns to access the critical data. TBig Daca Analytics 10 imposes a limit on how many nodes can be added to the distributed shared disk system, thus compro- rising on scalability 3.12.7 CAP Theorem Explained he CAP theorem is also called the Brewer’ Theorem. Ic states that in a distributed computing environment 2 collection of interconnected nodes that share data), ic is impossible to provide the following guarantees. Refer Figure 3.14, At best you can have two of the following three ~ one must be sactficed. 1. Consistency 2. Availability 3. Partition tolerance 3.12.7.1 CAP Theorem ‘Let us spend some time understanding the earlier mentioned terms. 1. Consistency implies that every read fetches the last writ. 2. Availability implies that reads and writes always succeed. In other words, each non-fuling node will return a response in a reasonable amount of time. 3. Partition tolerance implies that the system will continue to function when nerwork partition occurs. Lecus try to understand this using a real-life situation, ‘You work fora training instituce, “XYZ.” The insticute has 50 instructors including you, All of you report so a training coordinator, At the end of the month, all the instructors together with the training coordina- cor peruse through the training requests received from the various corporate houses and prepare a training schedule for each instructor. These training schedules (one for each instructor) are shared with “Amer.” the ce administrator. Each morning, you either call the office helpdesk (essentially Ameys desk) or check. person with Amey for your schedule for the day In case a training request has been cancelled or updated updates can be in che form of change in course, change in duration, change of the training timings, et.), Amey is informed of the updates and the schedules are subsequently updated by him. ‘Things were good until now. Few corporate houses were your clients and the schedules ofeach instructor could be smoothly managed withoue any major hiccups. Bue your raining institute has been implement- g promotion campaigns to expand the business. As a result of advertising in the media and word of outh publicity by your existing clients, you suddenly see an upsurge in training requests from existing and sew clients, In consequence of that, more instructors have been recruited, Few trainers/consultants have 20 been roped in from other training institutes o help tackle the load. Figure 3.14 Brewer's CAP.Big Data and Analyte Now when you go to Amey to check your schedule or callin at the helpdesk, you are prepared for a wait in the queue. Looking atthe current state of affairs che raining coordinator decides to recruit an additional office administrator “Joey.” The helpdesk number will remain the same and wil be shared by both che office administrators.” ‘This arrangement works well for a couple of days. Then one day You: Hey Amey! Amey, Hi How can T help? You: | think 1 am scheduled to anchor a training at 3:00 pm roday. Can I please have the details? Amey. Sure! Just a minute. ‘Amey browses through the ile where he maintains the schedules. Fe does not see a training scheduled against your name at 3:00 pm roday and responds back, “You do not have any training to conduc at 3:00 pm.” Yous How is that possible? The training coordinator called up yesterday evening to inform of the same and said he has updated the office administrators ofthe same. Amey, Ob! Did he say which office administrator? Ie could have been Joey. Pease check with Joey. Amey. Hey Joey! Please check the schedule for Paul here... Do you sce something scheduled at 3:00 pm today? Joey: Suse enough! He is anchoring the taining for client “Z” today ac 3:00 pm. A clear case of inconsistent system!!! The updates in the schedule were shared by the training coordi with Joey and you were checking for your schedule with Amey. You share this incident with the training coordinator and that gets him thinking, ‘The issue has to be addressed immediately otherwise it will be difficult to avoid a chaotic situation. He comes up with a plan and shares ic with both the office administrators che following day. Training Coordinator. Folks, each time that either an instructor or me calls any one of you to update a schedule, make sure thar both of you updace ic in your respective files. This way the inseructor will alway get the most recent and consistent information irrespective of whom amongst the two of you he/she speaks to. Joey: But that could mean a delay in answering either a phone call or sharing the schedule with the instructot waiting in queue, Tiaining Coordinator es, | understand. But there is no way that we can give incorrect information ‘Amey: There is this other problem as well. Suppose one of us is on leave on a particular day. That wot mean that we cannot take any update related calls as we will nr be able to simultaneously update both the files (my fle and Joey} Training Coordinator. Well, good point! That the availabilty problem! But I have thought about that well. Here is the plan: 1. Ifone of you receives the update call (any updates to any schedule), ensure that you inform the of person if he is available. 2. In case the other person is not available, ensure thac you inform him of all the updates to al schedi vin email [cis a must! 3. When the other person resumes duty, the firs thing he will dois update his file with all the updates all schedules that he has received via emailBig Daca Analyics ot Wow! That is sure a Consistent and Available system! Looks like everything isin control. Wait a minute! There is a tiff that has taken place between the office administrators. The two are pretty much available but are not talking to each other which, in other words, rans that che updates are not lowing from one to the other. We ave to be pariton tolerant! As a train- ing coordinator, you instruct them saying that none of you are taking any calls requesting for schedules or ‘updates to schedules tll you patch up. This implies thatthe system is partition tolerant but not available at shat ime. In summary, one can at most decide to go with two of the three. 1. Consistent: The instructors or the training coordinator, once they have updated information with you, will always get the most updated information when they call subsequent. 2. Availability: The instructors or the training coordinators will always get the schedule if any or both of the office administrators have reported to work. 3. Partition Tolerance: Work will go on as usual even if there is communication loss between the office administrators owing toa spat ora tiff ‘When to choose consisteney over availability and vice-versa... 1, Choose availability over consistency when your business requirements allow some flexibility around when the data in the system synchronizes. 2. Choose consistency over availability when your business requit Examples of databases that follow one of the possible three combi 1, Availabilty and Partition Tolerance (AP) 2. Consistency and Partition Tolerance (CP) 3. Consistency and Availability (CA) Refer Figure 3.15 to geta glimpse of databases that adhere to two of the thre eri {AP theorem. ig get glimps ic reads and ‘As avallable/accossitle/ ‘operational at all ies Traditional RDBMS OA, PostgreSQL. MySQL, ete. AP lak, Cassandra, CouchD8, Dynamo ike systems ¢ oF P ‘Commis ae atomic Hease ‘System responds incorrectly ‘across the entire MengoD8 only winen there is total ‘stibuted systoms Reds Petwork talure ‘MemcacheDB, IQTaDIe tke systems Figure 3.15 Databases and CAP.52+ Big Data and Analytics 3.13 BASICALLY AVAILABLE SOFT STATE EVENTUAL CONSISTENCY (BASE) A few basic questions to stare with: 1. Where is it used? In distributed computing. 2. Why is it used? “To achieve high avaiabilcy. 3. How is it achieved? ‘Assume a given data item. IF no new updates are made to this given data item for a stipulated period of time, eventually all accesses to his daca item will return the updated value. In other words, if no new updates are made to a given data item for a stipulated period of time, all updates that were made in the past and noc applied to this given data item and the several replicas of ic will percolate to this data item. 40 tha i stays as current/recent as is possible 4, What is replica convergence? “Asystem that has achieved eventual consistency is said co have converged or achieved replica convergence. 5. Conflict resolution: How is the conflict resolved? () Read repair: Ifthe read leads to discrepancy or inconsistency, a correction is initiated. Itslows down. the read operation, (b) Write repair: IF the write leads to discrepancy or inconsistency, a correction is initiated. This will ‘cause the write operation to slow down. (© Asynchronous repair: Here, the correction is not part ofa read or write operation. 3.14 FEW TOP ANALYTICS TOOLS ‘There is no dearth of analytical tools in the market. Please find below our list of few top analytics tools. We have also provided the links after each tool for you to explore more. 1. MS Excel hutps/ (support offce.microsoft.comlen-in/article/ Whats-new-in-Excel-2013-Icbe42cd-bfaf-4347- 9031-5688ef1 392fd?Correlationld=122171cc-191f-47de-8a55-08a52e9c7 398ui-en-USBcrs= IN&adeIN 2. SAS hecpst/www:sas.com/en_us/home.html 3, IBM SPSS Modeler heep://www-01.ibm.com/software/analytics/spss/products/modeler/ 4, Statistica berpelvwvestatsoft.com!Big Data Analytis +53 5. Salford systems (World Programming Systems) hhep:f/wwwsalford-systems.com/ 6. wes heep//worwzeamope co.uk produers!wps 3.14.1 Open Source Analytics Tools {ccs look at a couple of open source analytics tools. We have also provided the links after each tool for you © explore mote... 1. Ranalytics hutp:/fwwwsevolutionanalyties.com/ 2 Weka heap:/ few wes. waikaro.ac.na/mlfwekal REMIND ME + Quite a few data analytics and visualization tools are available in the market today from leading | vendors such as IBM, Tableau, SAS, R Analytics, Statistica, World Programming Systems (WPS), | tc to help process and analyze your big data. + Big data analytics is @ about a tight handshake beoween three communities: IT, business users, and | data scientists, + Data scienceis the science of exteacting knowledge from data. + The CAP theorem is also called the Brewer's Theorem. It states that in a distribused computing | environment (a collection of interconnected nodes that share data), itis impossible to provide the following guarantees. At best you can have two of the following three - one must be sacrificed. = Consistency * Availability * Partiion tolerance ‘CONNECT ME (INTERNET RESOURCES) + hup://en.wikipedia.orghwiki/Data_science } + hucp://simplystaistics.org/2013/12/12fthe-key-word-in-daca-science-is-notdata-it-is-ciencel | + herp://wwwcoralytics.com/2012/06/data-science-is-multidisciplinary.hrml | + hecp:/spothitetibco.com/blog/?p=4240 | hep: {/reports.informationweek.com/abstract/106/1255/Financial/tech-center-taking-advantage- of-in-memory-analytics,heml ‘+ hetp://www.informationweek.com/software/information-management/oracle-analytics-package- | cexpands-in-database-processing-options/d/d-id/11027122 |54s Big Data and Analytics TEST ME A. Fill Me 1. The technology helps query data that resides in a computer's random access memory (RAM) rather than data stored on physical disks. 2. Eventual consistency isa consistency mode! used in distributed computing to achieve high 3. A coordinated processing of a program by multiple processors, each working on different parts of the program and using ics own operating systom and memory is called _ 4. Acollection ofindependent computers thatappear toitsusersas.asingle coherencsystemnis Answers: 1, In-memory analytics 2. Availabilcy 3. Massively parallel processing 4, Distributed systems B. Answer Me 1. Whar are the various types of analyties? coe ica Sco eae een eed Cee Coed 2. What ate the key questions to be answered by al organizations stepping into analytics? “Answer: The key questions for any organization stepping into analytics are *+ Should you be storing all of your big data? IF“Yes", where are you going to store it? If*No", how ‘you know what to store and what to discard? *+ How will you sieve chrough your massive data to filter our the relevant from the inreleyant? + How long will you store this data? *+ How will you accommodate the peaks (variability in terms of data influx) in your data? How will you analyze? Will you analyze all che data that is stored or analyze a sample? ‘What will you do with the insighes generated from this analysis? 3 What can one expect from analytics 3.0? Answer: + In-memory analytes. + Incdatabase processing, + Leveraging analytics to improve operational, tactical, and strategic decision making,Big Dara Analytics 155 + Coupling the in-memory analytics and in-database processing with agile analytical methods and machine leaning techniques. *+ Appropriate tools to effectively support decision-making ae the front lines, such as mobile and self serve analytical applications. 4. Which industries will be affected most by analytics 3.07 Who will benefic the most? Answer: Almost all the firms in all the industries and not just online firms will be affected by ana- Iyties 3.0. A lor of analytics have already been done in the Transport, Retail, and Banking sector. Telecom, entertainment, and health sectors have a bit of catching up to do, 5. What is predictive and prescriptive analytics? Answer: Predictive analytics helps you answer the questions: "What will happen?” and “Why wil it happen"? Prescriptive analytics goes beyond “What will happen?” “Why happen?” and “When will it happen?” «co answer “What should be the action taken to take advantage of what will happen”? © Crossword 1, Puzzle on CAP Theorem i z Q f I Across Down 3. CAP theorem is also called as 1, Every read fetches the most recent waite. theorem. 4. System will continue ¢o function even when network partition occurs. Solution: Across 3. Brewer 4. Partition Tolerant 2.4 non-filing node will return a reasonable ‘response within a reasonable amount of time. Down 1, Consistency 2. Availabilicy56+ Big Data and Analyeis 2. Puzzle on Architecture Down 1. In this architecture, central memory is by mukiple processors. 2. In this architecture, multiple processors have their own private memory. Answer: Across Down 1. Scalability 1, Shared Memory 2. Shared Disk“The goal isto surn data into information, and information into insight.” ~ Carly Fiorina, former CEO, Hewlett-Packard Co WHAT'S IN STORE? ‘The focus of this chapter is on understanding “big data technology landscape”. This chapter is an overview © NoSQL and Hadoop. There are separate chapters on NoSQL (MongoDB and Cassandra) as well as =adoop in the book. ‘The big data technology landscape can be majotly studied under two important technologies: 1. NoSQL. 2. Hadoop.58+ Big Dacaand Analytics 4.1 NoSQL (NOT ONLY SQL) ‘The term NoSQL was fist coined by Carlo Strozzi in 1998 to name his lightweight, open-source, relational database that did not expose the standard SQL interface. Johan Oskzarsson, who was then a developer at last fim, in 2009 reineroduced the term NoSQL at an event called to discuss open-source distributed network. ‘The #NoSQL was coined by Eric Evans and few other database people at the event found it suitable to describe these non-rlational databases. Pew features of NoSQL databases ate 2 follows 1, They are open source. 2, They are non-relational 3. They are distributed. 4, They are schemas. 5. They are cluster friendly. 6 ‘They are born out of 21* century web applications. 4.1.1 Where is it Used? NoSQl. databases are widely used in big data and other real-time web applications. Refer Figure 4.1. NoSQI. databases is used to stock log daca which can chen be pulled for analysis. Likewise i is used to store social media data and all such data which cannot be stored and analyzed comfortably in RDBMS. 4.1.2, What is it? NoSQL stands for Not Only SQL. These are non-relational, open source, distributed databases. They are hugely popular today owing to their ability o scale out or scale horizontally and che adepmness at dealing vwich a rich variety of dat structured, semi-structured and unstructured data, Refer Figure 4.2 for additional features of NoSQL. NoSQL databases, 1, Are non-relational: They do not adhere to relational data model, In fact, they are either key-value pairs or document-oriented ot column-oriented or graph-based databases, eis Figure 4.2 What is NoSQL?The Big Data Technology Landscape +39 2. Are distributed: They are distributed meaning the data is distributed across several nodes in a cluster constituted of low-cost commodity hardware. 3. Offer no support for ACID properties (Atomicity, Consistency, Isolation, and Durability): They do not offer support for ACID properties of transactions. On the contrary, they have adherence to Brewers CAP (Consistency, Availabiliry, and Partition tolerance) theorem and are often seen compro- rising on consistency in favor of availability and partition tolerance. 4. Provide no fixed table schema: NoSQL. databases are becoming increasing popular owing to their support for flexibility to che schema, They do not mandate for the data to stricdly adhere to any schema structure atthe time of storage. 4.1.3. Types of NoSQL Databases We have alteady stated that NoSQL databases are non-relati lowing: 1, Key-value or the big hash table. 2. Schemaess. + Figure 4.3. Letus take a closer look at key-value and few other types of schema-less databases: 1. Key-values Ie maintains a big hash table of keys and values. For example, Dynamo, Redis, Riak, etc Sample Key-Value Pair in Key-Value Database Key Value First Name Simmonds last Name David 2. Document: It maintains data in collections constituted of documents. For example, MongoDB, Apache CouchDB, Couchbase, MarkLogic, ete. Sample Document in Document Database { “Book Name"; “Fundamentals of Business Analytics’, “Publisher’: “Wiley India’, “Year of Publication’: “2011” ) 3. Column: ach storage block has data from only one column. For example: Cassandra, HBase, etc, ce ‘Schema loss Figure 4.3 Types of NoSQL databases,60+ Big Data and Analyses 4. Graph They are also called network database. A graph stores data in nodes. For example, Neodi, HyperGraphDB, ete. Sample Graph in Graph Database Label knows since 2002 rome) ee ‘Labetis member since 2002 oir Label: is member since 2003 ca Refer Table 4.1 for popular schema-less databases 4.1.4 Why NoSQL? 1. Iehas scale out architecture instead of the monolithic architecture of relational databases 2. Ic can house large volumes of structured, semi-structured, and unstructured daca 3. Dynamic schema: NoSQU. database allows insertion of data without a pre-defined schema. In other words, ie ciliates application changes in real ime, which thus supports faster development, easy code integration, and requires less database administration. 4, Auto-sharding: It aucomatically spreads data across an arbitrary number of servers. The application in question is more ofien not even aware of the composition of the server pool It balances the load. of data and query on the available servers; and if and when a server goes down, ics quickly replaced, without any major activity disruptions. Replication: Ic offers good support for replication which in turn guarantees high availability, fault tolerance, and disaster recovery. s. 4.1.5 Advantages of NoSQL Lec us enumerate the advantages of NoSQL. Refer Figure 4.4 1. Can easily scale up and down: NoSQL database supports scaling rapidly and elastically and even allows o scale to the cloud. Table 4.1 Popular schema-less databases Key-Value Data Store Column-Oriented Data Store Document Data Store Graph Data Store = Risk © Cassandra * ‘MongoDB * InfiniteGraph + Reais + HBase + CouchoB, = Neo4} + Membase + Ravends + AllegroGraphThe Big Data Technology Landscape +61 cr See erent meer eyenos Figure 4.4 Advantages of NoSQL. (a), Cluster scale: I allows distribution of database actoss 100+ nodes often in multiple data centers. (b) Performance scale: It sustains over 100,000+ database reads and writes per second, (0 Data scale: c supports housing of 1 billions documents in the database. 2. Doesnit require a pre-defined schema: NoSQL does not require any adherence to pre-defined schema. It is pretry flexible. For example, if we look at MongoDB, the documents (equivalene of records in RDBMS) in a collection (equivalent of table in RDBMS) can have different sets of key-value pais. (id: 101,"BookName"; “Fundamentals of Business Analytics", “AuthorName": “Seema Acharya’, “Publishee”: “Wiley India {Lid:102, “BookName’:"Big Data and Analytics") 3. Cheap, easy to implement: Deploying NoSQlL properly allows forall ofthe benefits of scale, high availability, faul tolerance, etc. while also lowering operational costs. 4. Relaxes the data consistency requirement: NoSQL databases have adherence to CAP theorem (Consistency, Availability, and Partition tolerance). Mostof the NoSQL. databases compromise on con- sistency in favor of availability and partition tolerance. However, they do go for eventual consistency. 5. Data can be replicated to multiple nodes and can be partitioned: There are two terms that we will discuss here: (2) Sharding: Sharding is when different pieces of data are distributed across multiple servers NoSQL databases support auto-sharding: this means that they can natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware ofthe composition of the server pool. Servers can be addled or removed from the data layer ‘without application downtime. This would rccan that data and query load are automatically bal anced across servers, and when a server goes dawn, it can be quickly and transparently replaced with no application disruption. (b) Replication: Replication is when multiple copies of data are stored across the cluster and even, across data cencers. This promises high availability and fault colerance, 4.1.6 What We Miss With NoSQL? Wich NoSQL. around, we have been able to counter the problem of scale (NoSQL scales out). There is also exibility with respect ro schema design. However there ae few features of conventional RDBMS that ly missed. Refer Figure 45.a Big Data and Anal Figure 4.5 What we miss with NoSQL? NoSQL does not support joins. However, it compensates for it by allowing embedded documents as MongoDB. Ic docs not have provision for ACID properties of transactions. However, it obeys the Brewer's CAP theorem. NoSQL does not have a standard SQL interface but NoSQL. databases such ‘MongoDB and Cassandra have their own rich query language [MongoDB query language and Cassan query language (CQL)] to compensate for the lack of it. One thing which is dearly missed is the integration with other applications that support SQL. 4.1.7 Use of NoSQL in Industry NoSQL is being put to use in varied industries, They are used to support analysis for applications su web user dara analysis, log analysis, sensor feed analysis, making recommendations for upsell and cross etc. Refer Figure 4.6. Keoy-Value Pairs ‘Shopping cars| ‘web ues cata analysis (Amazon, LinkedIn) Figure 4.6 Use of NoSOL in industry.“The Big Data Technology Landscape 4.1.8 NoSQL Vendors Refer Table 4.2 for few popular NoSQL vendors. Table 4.2 Few popular NoSOL vendors “Company Product ‘amazon DynareDB Facebook cassandra Google Bigable 4.1.9 SQL versus NoSQL +6 Mast Widely Used by LinkedIn, Mozla Netflix Titer, eBay Adobe Photoshop Refer Table 4.3 for few salient differences between SQL and NoSQL. sal. Relational database Relational model Precdefined schema Table based databases Vertically scalable (by increasing system resources) Uses SOL ot preferred for large datasets Not 2 best fit for hierarchical data phasis on ACID properties Excellent support from vendors Seoports complex querying and data keeping needs Can be configured for strong consistency Samples: Oracle, 082, MySQL, MS SQL, Postgresal, Table 4.3 SOL versus NoSOL Nosat. Non-relational, distributed database Model-less approach Dynamic schema for unstructured data Document-based or graph-based or wide column store or key-value pairs databases Horizontally scalable (by creating a cluster of commodity machines) Uses Und. (Unstructured Query Language) Largely preferred for large datasets Best fit for hierarchical storage as it follows the key-value pair of storing data similar to ISON (Java Script Object Notation) Follows Brewer's CAP theorem Relies heavily on community support Does nat have good support for complex querying Few support strong consistency (e.g., MongoD8), some others can be configured for eventual consistency (e.g., Cassandra) Examples: MongoDB, HBase, Cassandra, Reds, Neod|, CouchDB, Couchbase, Riak, etc.
You might also like
Cloud Computing Black Book (Kailash Jayaswal, Jagannath Kallakurchi Etc.)
PDF
100% (1)
Cloud Computing Black Book (Kailash Jayaswal, Jagannath Kallakurchi Etc.)
487 pages
Ccs341 Data Warehousing Technical Publication
PDF
No ratings yet
Ccs341 Data Warehousing Technical Publication
103 pages
Big Data Analytics by Seema Acharya PDF - PDF
PDF
No ratings yet
Big Data Analytics by Seema Acharya PDF - PDF
372 pages
Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction To Pig
PDF
67% (3)
Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction To Pig
34 pages
Introduction To Computing and Problem Solving With Python.
PDF
50% (4)
Introduction To Computing and Problem Solving With Python.
11 pages
Big Data Black Book
PDF
16% (25)
Big Data Black Book
2 pages
Data Analytics Notes From Unit 1 To 5 by DR Kapil Chaturvedi
PDF
100% (9)
Data Analytics Notes From Unit 1 To 5 by DR Kapil Chaturvedi
94 pages
Unit3 BD
PDF
100% (1)
Unit3 BD
104 pages
Unit 3-BDA
PDF
50% (2)
Unit 3-BDA
26 pages
UNIT-3 Hadoop and MapReduce Programming
PDF
100% (1)
UNIT-3 Hadoop and MapReduce Programming
84 pages
Notes - KCS 061 Big Data Unit 1
PDF
No ratings yet
Notes - KCS 061 Big Data Unit 1
25 pages
Ccs334 - Big Data Analytics
PDF
75% (4)
Ccs334 - Big Data Analytics
2 pages
cp5293 Big Data Analytics Question Bank
PDF
0% (1)
cp5293 Big Data Analytics Question Bank
13 pages
Chapter+9+ HIVE
PDF
No ratings yet
Chapter+9+ HIVE
50 pages
Bda Viva Q&a
PDF
No ratings yet
Bda Viva Q&a
24 pages
Big Data Black Book PDF
PDF
15% (20)
Big Data Black Book PDF
2 pages
Big Data Analytics
PDF
100% (3)
Big Data Analytics
79 pages
MCQ - Bda
PDF
33% (3)
MCQ - Bda
3 pages
Question Bank-Big Data
PDF
25% (4)
Question Bank-Big Data
1 page
Unit5 BD
PDF
100% (2)
Unit5 BD
91 pages
Co-Po Big Data Analytics
PDF
100% (1)
Co-Po Big Data Analytics
41 pages
Cp5293 Big Data Analytics Question Bank
PDF
0% (1)
Cp5293 Big Data Analytics Question Bank
13 pages
Chapter 6
PDF
100% (1)
Chapter 6
51 pages
Updated Unit-2
PDF
0% (1)
Updated Unit-2
55 pages
Big Data Analytics (R18a0529)
PDF
No ratings yet
Big Data Analytics (R18a0529)
134 pages
Map Reduce Applications
PDF
No ratings yet
Map Reduce Applications
94 pages
4 UNIT-4 Introduction To Hadoop
PDF
No ratings yet
4 UNIT-4 Introduction To Hadoop
154 pages
Big Data Analytics by Seema Acharya PDF 9 PDF Free
PDF
No ratings yet
Big Data Analytics by Seema Acharya PDF 9 PDF Free
370 pages
Big Data Analytics: Seema Acharya Subhashini Chellappan
PDF
100% (1)
Big Data Analytics: Seema Acharya Subhashini Chellappan
47 pages
Chapter 10
PDF
No ratings yet
Chapter 10
50 pages
Chapter 7
PDF
No ratings yet
Chapter 7
48 pages
cp5293 Big Data Analytics Unit 5 PDF
PDF
No ratings yet
cp5293 Big Data Analytics Unit 5 PDF
28 pages
VTU Question Paper of 18CS72 Big Data Analytics Feb-2022
PDF
100% (1)
VTU Question Paper of 18CS72 Big Data Analytics Feb-2022
2 pages
Data Warehousing and Data Mining: Downloaded From
PDF
No ratings yet
Data Warehousing and Data Mining: Downloaded From
94 pages
Ai & ML Lab Manual (As Per 2018 Scheme)
PDF
No ratings yet
Ai & ML Lab Manual (As Per 2018 Scheme)
42 pages
Big Data Analytics Lab Manual
PDF
No ratings yet
Big Data Analytics Lab Manual
38 pages
Hive File Formats Presentation
PDF
No ratings yet
Hive File Formats Presentation
19 pages
Chapter 02 Understanding of Data
PDF
No ratings yet
Chapter 02 Understanding of Data
96 pages
BDA Lab Manual AI&DS
PDF
No ratings yet
BDA Lab Manual AI&DS
60 pages
Big Data Analytics Unit 2 MINING DATA STREAMS
PDF
100% (2)
Big Data Analytics Unit 2 MINING DATA STREAMS
22 pages
CP7019-Managing Big Data-Anna University - Question Paper
PDF
75% (4)
CP7019-Managing Big Data-Anna University - Question Paper
4 pages
M.Tech DSE Batch 8 - Dissertation Guidlines - 06.12.2023 - Guidelines, Key Dates
PDF
No ratings yet
M.Tech DSE Batch 8 - Dissertation Guidlines - 06.12.2023 - Guidelines, Key Dates
8 pages
CS8091 Important Questions BDA
PDF
No ratings yet
CS8091 Important Questions BDA
1 page
r18 - Big Data Analytics - Cse (DS)
PDF
0% (1)
r18 - Big Data Analytics - Cse (DS)
1 page
VTU Question Paper of 18CS72 Big Data Analytics June-2022
PDF
100% (1)
VTU Question Paper of 18CS72 Big Data Analytics June-2022
2 pages
Unit 5 Notes
PDF
100% (3)
Unit 5 Notes
66 pages
Big Data
PDF
No ratings yet
Big Data
9 pages
Machine Learning Question Paper Solved ML
PDF
No ratings yet
Machine Learning Question Paper Solved ML
55 pages
Chapter 5
PDF
No ratings yet
Chapter 5
45 pages
Data Science Lab Manual - CS3361-Ramprakash S
PDF
No ratings yet
Data Science Lab Manual - CS3361-Ramprakash S
47 pages
S MapReduce Types Formats
PDF
100% (2)
S MapReduce Types Formats
22 pages
Big Data Analytics Notes
PDF
67% (3)
Big Data Analytics Notes
16 pages
Unit 1 PPT
PDF
No ratings yet
Unit 1 PPT
72 pages
A Convergence of Key Trends: Kept Large Amounts of Information Information On Tape
PDF
No ratings yet
A Convergence of Key Trends: Kept Large Amounts of Information Information On Tape
14 pages
Question Bank - Big Data Analytics - Final1
PDF
100% (1)
Question Bank - Big Data Analytics - Final1
6 pages
Big Data Analysis Lab Manual
PDF
No ratings yet
Big Data Analysis Lab Manual
39 pages
Big Data Analytics
PDF
No ratings yet
Big Data Analytics
131 pages
Bda Notes Jntuk R20 Unit 4
PDF
No ratings yet
Bda Notes Jntuk R20 Unit 4
14 pages
@vtucode - in 21AI54 Question Bank 2021 Scheme
PDF
No ratings yet
@vtucode - in 21AI54 Question Bank 2021 Scheme
5 pages
BDA Textbook Main
PDF
No ratings yet
BDA Textbook Main
370 pages