0% found this document useful (0 votes)
504 views27 pages

Bda Chapter 1 Techneo

BDA CHAPTER1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
504 views27 pages

Bda Chapter 1 Techneo

BDA CHAPTER1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 27
Big Data Analytics (MU-Sem.8-IT) 1 Table of Contents Ve eee Table Of Contents Syllabus = Introduction to Big Data, Big Data characteristics, types of Big Data, Traditional vs. Big Data business approach, Big Data Challenges, Examples of Big Data in Real Life, Big Data Applications. 'Self-learning Topics : Identification of Big Data applications and its solutions. 1.1 Introduction to Big Data and Hadoop 1.1.1 Whatis Big Data ?.............. 1.1.2 Sotrces of Big Data... 1.2 Big Data Characteristics apa) 1.2.2 Varety.... 1.2.3 1.2.4 1.2.5 1.2.6 Visualization 1.2.7 Virality...... 1.3 Types of Big Data 1.3.1 Type #1 : Unstructured 1.3.1(A) Characteristics of Unstructured Data...... 1.3.1(B) Sources of Unstructured Data....... 1.3.1(C) Advantages and Disadvantages of Unstructured Data. 1.3.2 Type #2: Structured... 1.3.2(A) Characteristics of Structured Data... seesseecssesccssessees 4-11 1.3.2(B) Sources of Structured Data... we etd all (MU- 22-28) (MB-131) Tech-Neo Publications Big Data Analytics (MU-Sem.8-IT) 2 Table of Contents 1-3.2(C) Advantages of Structured Data.....cc:cvsusseesseesetitisssnn 1-12 1.3.3 TYPO HS =) Semi Structured ccc cscscessestictensssvecenseeseeeecsees 1-12 1.3.3(A) Characteristics of Semi-structured Data .iu.eesvessessseseeee 1-13 1.3.3(B) Sources of SEMI-StUCtUTEd Data........sccceceseeererererer scenes 1-14 1.4 7.5 Traditional vs. Big Data business approach 1.6 Examples of Big Data Applications... 1.7 Big Data Challenges 1.8 Examples of Big Data in Real Life. Introduction to Big Data University Prescribed Syllabus Introduction te Big Data, Big Data characteristics, types of Big Data, Traditional vs. Big Data business approach, Big Data Challenges, Examples of Big Data in Real Life, Big Data Applications. Self-learning Topics : Identification of Big Data applications and its solutions. (1) Now a day the amount of data created by various advanced technologies like Social networking sites, E-commerce etc. is very large. It is really difficult to store such huge data by using the traditional data storage facilities. (2) Until 2003, the size of data produced was 5 billion gigabytes. If this data is stored in the form of disks it may fill an entire football field. In 2011, the same amount of data was created in every two days and in 2013 it was created in every ten minutes. This is teally tremendous rate. (3) In this topic, we will discuss about big data oe fundamenta level and define common concepts related to big data, We will also see in deep about some of the processes and technologies currently being used in this field. 1.1.1 What is Big Data ? Big Data is a Massive collection of data that continues to grow dramatically over time. 3. Video sharing portals : Video sharing portals like pounDS, Vimeo etc. contains millions of videos each of which requires lots of memory to store. Sources of big data 3. Video sharing portals 4. Search Engine Data 5. Transport Data 6. Banking Data | Fig. 1.1.1 : Sources of big data | 4. Search Engine Data : The search engines like Google and | Yahoo holds lot much of metadata Tegarding various sites. 5. Transport Data : Transport data contains information about model, capacity, distance and availability of various vehicles, 6. Banking Data : The big giants in banking domain like SBI or ICICI hold large amount of data tegarding huge transactions of account holders. (MU: 22-28) (Mg-131) Tech-Neo Publications (1) Volume represents the volume i.e. amount of data that is 2TOWing at a high rate i.e. data volume in Petabytes. (2) Value refers to turning data into value. By turning accesseq big data into values, businesses may generate revenue. (3) Veracity refers to the uncertainty of available data. Veracity arises due to the high volume of data that brings incompletenes, and inconsistency. (4) Visualization is the Process of displaying data in charts, graphs, maps, and other visual forms, (5) Variety refers to the different data types i.e. various data formats like text, audios, videos, etc. (6) Velocity is the rate at which data grows. Social media contributes a major role in the velocity of growing data, () Virality describes how quickly information gets spread across people to people (P2P) networks, @ 1.2.1 Volume As it follows from the name, big data is used to refer to enormous amounts of information. We are talking about not gigabytes but terabytes and petabytes of data. The IoT (Internet of Things) is creating exponential growth in data. The volume of data is Projected to chan ge significantly in the coming years, e Hence, 'Volume’ is one characteristic which needs to be considered while dealing with Big Data, tS Volume [Data at Rest ] Terabytes, Petabytes Records/Arch Table/Files _ Distributed % 1.2.2 Variety e Variety refers to heterogeneous sources and the nature of data, both structured and unstructured. e Data comes in different formats — from structured, numeric data in traditional databases to unstructured text documents, emails, videos, audios, stock ticker data and financial transactions. e This variety of unstructured data poses certain issues for storage, mining and analysing data. e Organizing the data in a meaningful way is no simple task, especially when the data itself changes rapidly. e Another challenge of Big Data processing goes beyond the massive volumes and increasing velocities of data but also in manipulating the enormous variety of these data. tS Variety [ Data in many Forms ] i Structured Unstructured Text Multimedia % 1.2.3 Veracity e Veracity describes whether the data can be trusted. Veracity refers to the uncertainty of available data. ° Veracity arises due to the high volume of data that brings incompleteness and inconsistency. ° Hygiene of data in analytics is important because otherwise, you cannot guarantee the accuracy of your results. RA Introduction to Big Data)...Pg. no... Data Analytics (MU-Sem.8- « Because data comes from so many different sources, it’s difficuy, to link, match, cleanse and transform data across systems, ¢ However, it is useless if the data being analysed are inaccurate o, incomplete. e Veracity is all about making sure the data is accurate, which Tequires processes to keep the bad data from accumulating in your systems. 5S Veracity [Data in Doubt ] Trustworthiness Authenticity Accurate Availability @ 1.2.4 Velocity e Velocity is the speed in which data is grows, process and becomes accessible. ¢ A data flows in from sources like business Processes, application logs, networks, and social media sites, sensors, Mobile devices, etc. e The flow of data is massive and continuous. ¢ Most data are warehoused before analysis, there is an increasing need for real-time processing of these enormous volumes. e Real-time processing reduces Storage requirements while Providing more responsive, accurate and profitable responses. It should be processed fast by batch, in a stream-like manner because it just keeps growing every years. oS Velocity [Data in Motion ] Streaming Batch Real / Near Time Processes % 12.5 Value It refers to turning data into value. By turning accessed big data into values, businesses May generate revenue, Value is the end game. After addressing volume, velocity, variety, variability, veracity, and visualization — which takes a lot of time, effort and resources — you want to be sure your organization is getting value from the data. For example, data that can be used to analyze consumer behavior Is valuable for your company because you can use the research Tesults to make individualized offers, t= Value Statistical Events [Data into Money] Correlations @ 1.2.6 Visualization Big data visualization is the process of displaying data in charts, graphs, maps, and other visual forms. It is used to help people easily understand and interpret their data at a glance, and to clearly show trends and Patterns that arise from this data. Raw data comes in a different formats, so Creating data Visualizations is Process of gathering, Managing, and transforming data into a format that’s Most usable and Meaningful. Big Data Visualization makes your data as accessible as Possible to everyone within your Organization, whether they have technical data skills or not. (MU- 29-23) (49-494) Re & Visualization [Data Readable ] Readable Accessible Presentation Visual Forms | @ 1.2.7 Virality Virality describes how quickly information gets spread across people to people (P2P) networks. Tt is measures how quickly data is spread and shared to each unique node. ‘Time is a determinant factor along with rate of spread. tS Virality [Data Spread ] Sere e Shared e Rate of Spread There are three types of Big Data Analytics : 1. Unstructured 2. Structured 3. Semi-structured 1.3.1 Type #1: Unstructured Any data with unknown form or the structure is classified as unstructured data. In addition - to the size being huge, un Structured data poses Multiple challenges in terms of its Processing for deriving value out of it. a @) GB) (4) i) ©) a (1) (3) (6) (6) (7) Typical example of unstructured data is, a heterogeneous data source containing a combination of simple text files, images, videos like search in Google Engine. Now a day organizations have wealth of data available with them but unfortunately they don't know how to derive value out of it since this data is in its raw form or unstructured format. Human Generated Data Machine Generated Data. Unstructured — Example : The output returned by ‘Google Search’ 1.3.1(A) Characteristics of Unstructured Data Data neither conforms to a data model nor has any structure. Data can not be stored in the form of rows and columns as in Databases. Data does not follows any semantic or rules, Data lacks any particular format or sequence. Data has no easily identifiable structure. Due to lack of identifiable structure, it can not used by computer programs easily, 1.3.1(B) Sources of Unstructured Data Web pages (2) Images (JPEG, GIF, PNG, etc.) Videos (4) Memos Reports Word documents and PowerPoint Presentations Surveys (MU- 22-23) (MB-131) “Tact Nett ges and Disadvantages of %@ 1.3.1(C) Advanta Unstructured Data ti Advantages 1. Its supports the data which lacks a proper format or sequence. The data is not constrained by a fixed schema. Very Flexible due to absence of schema. Data is portable. Ttis very scalable. It can deal easily with the heterogeneity of sources. These type of data have a variety of business intelligence and analytics applications. Sol TON OR a a t= Disadvantages 1. It is difficult to store and manage unstructured data due to lack of schema and structure. 2. Indexing the data is difficult and error prone due to unclear structure and not having pre-defined attributes. Due to which search results are not very accurate. 3. Ensuring security to data is difficult task. % 1.3.2 Type #2 : Structured e Any data that can be stored, accessed and processed in the form of fixed format is termed as a "Structured" data. * Over the period of time, talent in computer science have achieved Gfeater success in developing techniques for working with such kind of data (where the format is well known in advance) and also determining value out of it, size Of such data grows to a huge extent, typical sizes ate me i the range of multiple zettabyte, Data stored in 2 Telational database mana; ome! s i Structured data gement system in one example of a) (WU22-29) (8-191) fe) i data model, has red data is the data which conforms to a gaan : der and can be a well define structure, follows a consistent ©: easily accessed and used by a person or a computer program. Structured data is usually stored in well-defined schemas such as Databases. It is generally tabular with column and rows that clearly define its attributes. » SQL (Structured Query language) is often used to manage structured data stored in databases. %, 1.3.2(A) Characteristics of Structured Data » Data conforms to a data model and has easily identifiable structure. e Data is stored in the form of rows and columns. Example : Database e Data is well organised so, Definition, Format and Meaning of data is explicitly known. e Data resides in fixed fields within a record or file. e Similar entities are grouped together to form relations or classes. e Entities in the same group have same attributes. « Easy to access and query, So data can be easily used by other programs. « Data elements are addressable, so efficient to analyse and process. @. 1,3.2(B) Sources of Structured Data (1) SQL Databases (2) Spreadsheets such as Excel (3) OLTP Systems (4) Online forms () Sensors such as GPS or RFID tags (6) Network and Web server logs (7) Medical devices (MU- 22-28) (Me-131) il Tech: Nama %& 1.3.2(C) Advantages of Structured Data iN ine Structured data have a well defined structure that hel IPS in easy Storage and access of data. Data can be indexed based on text string as well as attributes This makes search operation hassle-free. Data mining is easy i.e. knowledge can be easily extracted from data, Operations such as Updatin, Structured form of data, Business Intelligence oj easily undertaken, ig and deleting is easy due to well perations such as Data warehousing can be Easily scalable in Case there is an increment of data. Ensuring Security to data is easy. Structured - Example Employee Table 1 XYX MALE FINANCE 2 ABC MALE ADMIN 250000 3 PQR pear | SALES 350000 4 MNR _ [FEMALE FINANCE 600000 13.3 Type #3 : Semi Structured S the third type of big data Semi-structured data the forms of data “ Pettains to the data Containing both the above, that is, Sttuctured and unstructured To be precise, it refers to the data that although has not pee classified under a particular repository (database), yet conalnls vital information or tags that segregate individual elements within the data. Web application data, which is unstructured, consists of log files, transaction history files etc. Online transaction processing systems are built to work with structured data wherein data is stored in relations (tables). Semi-structured data is data that does not conform to a data model but has some structure. It lacks a fixed or rigid schema. It is the data that does not reside in a rational database but that have Some organizational properties that make it easier to analyze. With some Processes, we can store them in the relational database. 1.3.3(A) Characteristics of Semi-structured Data Data does not conform to a data model but has some structure. Data can not be stored in the form of rows and columns as in Databases Semi-structured data contains tags and elements (Metadata) which is used to group data and describe how the data is stored. Similar entities are grouped together and organized in a hierarchy, Entities in the same group may or may not have the same attributes or properties, Does not contain sufficient metadata which makes automation and management of data difficult. Size and type of the same attributes in a group may differ. Due to lack of a well-defined structu ire, it can not used by Computer programs easily. (MU- 22-03) (M8-131) El %®. 1.3.3(B) Sources of semi-structured Data (1) E-mails (2) XML and other markup languages (3) Binary executables (4) TCP/IP packets (S) Zipped files (6) Integration of data from digg sources (7) Web pages 1.3.3(¢) Advantages and Disadvantages of Semi-structured Data = Advantages 1. The data is not Constrained by a fixed schema. Flexible i.e. Schema can be easily changed. Data is portable. 2 3 4, It is possible to view structured data as semi-structured data. 5, Its supports users who can not express their need in SQL. 6 It can deal easily with the heterogeneity of sources. tS Disadvantages 1. Lack of fixed, rigid schema make it difficult in storage of th data. | 2, Interpreting the relationship between data is difficult as there i no separation of the schema and the data. 3. Queries are less efficient as compared to structured data. * Semi-structured - Example ; iti User can see semi-structured data as a structured in form but actually not defined with e.g. a table definition in relation DBMs, Technology | Itis based | It is based on Itis based on on XML/RDF(Resource | character and Relational | Description binary data database Framework). table Transaction | Matured Transaction is No Management | transaction | adapted from DBMS | transaction and various | not matured management concurrency and no techniques concurrency Version Versioning | Versioning over Versioned as management | over tuples, | tuples or graph is a whole Tow, tables | possible | Hexibitty Itis schema | It is more flexible It is more dependent | than structured data flexible and and less but less flexible than there is flexible unstructured data absence of schema aia he te Big Data Analytics (MU-Sem.8-IT) (Introduction to Big Scalability | Itis very It’s scaling is difficult to | simpler than scale DB structured data schema scalable. Robustness Very robust New technology, not very spread Query Structured | Queries over Only textual performance query allow anonymous nodes queries are complex are possible possible joining 1, Traditional Data © Traditional data is the Structured data maintained by all types of busines, Small to big organizations. Which is being Majorly Ses starting from very Managing and accessing the data Structured Query Language (SQL) is used, 2. Bigdata (MU- 22-23) (Me.131) (eer: Tt deals with large volume of both structured, semi structured and unstructured data. Volume, Velocity and Variety, Veracity and Value refer to the 5’V characteristics of big data. Big data not only refers to large amount of data it refers to extracting meaningful data by analyzing the huge amount of complex data sets. Traditional data is generated in enterprise level. Big data is generated in outside and enterprise level. Its volume ranges from] Its volume ranges from Gigabytes to Terabytes. Petabytes to Zettabytes or Exabytes. Traditional database system deals with structured data. Big data system deals with structured, semi structured and unstructured data. Traditional data is generated per hour or per day or more. But big data is generated more frequently mainly per seconds, Traditional data source is centralized and it is managed in centralized form. Big data source is distributed and it is managed in distributed form. Data integration is very easy. Data integration is difficult. very Normal system configuration is capable to process traditional data. High system configuration is required to process big data. (MU- 22-23) (M8-131) Tech-Neo Publications The size of the data is very Small. eran oe Traditional data base tools The size is more thay traditional data size. Special kind of data base tools | e Tequired to perform any | are required to perform any ta base operation, data base operation. | 10. | No i rmal functions can Special kind of functions can|| Manipulate data. manipulate data. 11. | Its ie : data model is Strict | Its data model is flat schema |__| Schema based and it is static. based and it is dynamic. 2, Traditional data is stable and Big data is not stable and inter relationship. unknown relationship, 13. | Traditional data’ is in Big data is in huge volume manageable volume. which becomes unmanageable. 14. | It is easy to manage and | It is difficult to manage and ipulate the data. manipulate the data. 15. | Its data sources includes ERP | Its data sources includes transaction data, CRM transaction data, financial data, organizational data, Web transaction data etc. social media, device data, sensor data, video, images, audio etc. > 1. ° y 2. Fraud detection Fraud detection is a Big Data application example for businesses which has operations like any type of claims or transaction processing. Number of times the detection of fraud is concluded long after the fact. At this point the damage has been already done all that's left is to decrease the harm and revise policies to prevent it in future. Big data applications 1. Fraud detection i 2. IT log analytics 3. Call center analytics | 4. Social media analysis | Fig. 1.6.1 : Big data applications The Big Data platforms can analyze claims and transactions of businesses. They identify large-scale patterns across many transactions or detect anomalous behaviour of a some user. This helps to avoid the fraud. IT log analytics An enormous quantity of logs and trace data is generated in _TT solutions and IT departments. Many times such data go unexamined: organizations simply don't have the manpower or resource to go through all such information. i identify large. Big data has the ability to quickly 1 he ue d ° es to help in diagnosing and prevent ee ems, : i artment. helps the organization with a large II dep: > 3. Call center analytics a Now we tum to the customer-facing Big Data mie examples, of which call center analytics are Pp cule ful, Without a Big Data solution, much of the insighy a be ignored or exposed later . that a call center can provide will “e By making sense of time/quality resolution metrics, the Big Data solutions are able to identify recurring problems oy customer and staff behaviour patterns. Big data can also capture and process call content itself. > 4. Social media analysis e With the help of Social media we can observe the real-time insights into how the market is responding to products and campaigns. ¢ With the help of these insights, it is possible for companies to adjust their pricing, promotion, and campaign placement to get optimal results. 1. Sharing and Accessing Data © Perhaps the most frequent challenge in big data efforts is the inaccessibility of data sets from external sources, » Sharing data can cause substantial challenges, It include the need for inter and intra- institutional legal documents, Accessing ‘ata from eads to public reposit dat Positories leads t multipl p IU- 22-23) (M8-131) Tec! Publication> Ml il ‘ech-Neo Publi , It is necessary for the data to be available in an accurate, complete and timely manner because if data in the companies information system is to be used to make accurate decisions in time then it becomes necessary for data to be available in this manner. 2. Privacy and Security It is another most important challenge with Big Data. This challenge includes sensitive, conceptual, technical as well as legal significance. Most of the organizations are unable to maintain regular checks due to large amounts of data generation. However, it should be necessary to perform security checks and observation in real time because it is most beneficial. Thete is some information of a person which when combined with external large data may lead to some facts of a person which may be secretive and he might not want the owner to know this information about that person. Some of the organization collects information of the people in order to add value to their business. This is done by making insights into their lives that they’re unaware of, 3. Analytical Challenges (MU- 22-23) (MB-131) There are some huge analytical challenges in big data which arise some main challenges questions like how to deal with a problem if data volume gets too large? Or how to find out the important data points? Or how to use data to the best advantage? These large amount of data on which these type of analysis is to be done can be structured (organized data), semi-structured (Semi-organized data) or unstructured (unorganized data) Tech-Neo Publications h which decision making ¢,, | 1 techniques throve There are tw ” be done: ive data volumes 11 the analysis, y | orate mass 1, Either incorp' ig, data js relevant. 9, Ordetermine upfront which Bi; 4. Technical challenges lity of data oF lection of a large amount of data and storage 0) business leaders 1. When there is ac He of this data, it comes at a cost. Big companies, and IT leaders always want large data storage. 2. For better results and conclusions, Big data rather than having irrelevant data, focuses on quality data storage. 3. This further arise a question that how it can be ensured that data is Teleyant, how much data would be enough for decision making and whether the stored data is accurate or not. S Fault tolerance 1. Fault tolerance is another technical challenge and fault tolerance computing is extremely hard, involving intricate algorithms. Nowadays some of the new technologies like cloud computing and big data always intended that whenever the failure occurs the damage done should be within the acceptable threshold that is the whole task should not begin from the Scratch, Scalability Big data Projects can 8TOw and evoly . of Big Data has lead towards cloy, * Wtleads to Vari ie, os Ous challenges like how, “tfectivety, Sal of ach Workloa, € rapidly. The scalability id computing, 3. It also requires dealing with the system failures in an Cee manner. This leads to a big question again that what kinds of Storage devices are to be used. (1) In the Education Industry The University of Alabama has more than 38,000 students and an ocean of data. In the past when there were no real solutions to analyze that much data, some of them seemed useless. Now, administrators can use analytics and data visualizations for this data to draw out patterns of students revolutionizing the university’s operations, Tecruitment, and retention efforis. (2) In the Healthcare Wearable devices and sensors have been introduced in the healthcare industry which can provide real-time feed to the electronic health record of a patient. One such technology is Apple. Apple has come up with Apple HealthKit, CareKit, and ResearchKit. The main goal is to empower iPhone users to Store and access their real-time health records on their phones. (3) In Government Sector Food and Drug Administration (FDA) which runs under the Jurisdiction of the Federal Government of the USA leverages the analysis of big data to discover patterns and asso, and examine the expected or unexp infections, ciations to identify ected occurrences of food-based (4) In Media and Entertainment Industry Spotify, on-demand music-providing platform, uses Analytics, collects data from all its users around the globe, uses the analyzed data to give informed music Tecommendat Suggestions to every individual user. Big Data and then tions and (MU- 22-28) (Me-131) Ree pail st Big Data Analytics MU-Sem.8-I Amazon Prime which offers, videos, music, ® one-stop shop is also big on using big data. (5) In Weather Patterns IBM Deep Thunder, which is ar Weather forecasting through high-pei IBM is also assisting Tokyo with natural disasters or Predicting the esearch project by IBM, Provig. formance computing of big dy improved weather forecasting fo probability of damaged power Tines. (6) In Transportation Industry customer for more than 25 years. (8) In Marketing New offers and advertisements, (9) In Business Sights en iS using Big Data to understand the user behavior, the 0 e ; ie ®Y like, popular movies on the Website, simila can . k : they invest in, mERESt to the user, and which series or movies should (MU- 22-23) (Mg. 31) 5 (10) In Space Sector NASA is collecting data from different satellites and rovers about the geography, atmospheric conditions, and other factors of mars for their upcoming mission. It uses big data to manage all that data and analyzes that to run simulations. Chapter Ends... 000

You might also like