0% found this document useful (0 votes)
48 views27 pages

Bda Unit 1

Big data analytics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
48 views27 pages

Bda Unit 1

Big data analytics
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 27
4 Define Big Data Analytics. go amounts of da Big data toalytion 38 Pe precess of collecting, extn wo scales Dae Wishes es discover matket trends; insights, and patterns that can help companic’ an be agile in crafting Sunt g Se ees b vthble quickly ens eicenaly atts: Hiya: Z oe maintain their competitive advantage. 5. Which conditions data is called by ‘A bnge amount of data which ean be stroctureds 1° aod the Yarn enish cannot be easily processed by the HAUT es 0 Breel ete can be tormed as Big Data. It will not De he would actually be in terms of Zetabytes OF Petabytes- “pig Data”? structured of non structured in nature atabase systems such es RDEMs f Megabytes or Gigabytes, 1, ing: 6, Give the difference between data analysis and dats Ce wering haciiess questio Data analysis - is the process of examining ata with the £081 Of ou are able to ee that supports decision-making. An analysis can reveal powerfil insig) T why somethi bout it. happening and what you can do a! ping data into charts and tables in order to track sare of what is happening with your data keeps you & , pang cay your reporting charts should alert you Reporting - Data reporting is the P' no of its goals, performance of your business. This raw | business, When your business is not reaching © of the issue, prompting you to respond. oa 7 3] 7, List the characteristies of Big Dats. (AU NOVREE Bali Here are five vs of Big Data that ex ne aa Bei a great variety of Se ly big. Big 5 at variety of source 1. Variety — Variety makes Big Data really 0 coni structured and unstructured data ae id ly is cout of three types: struct 2 vain pina vefors to the speed with which data is generated. High velocity data is 2% that it requires distinct (distributed) processing techniques, ith such a pace 3. ae ae of data refers to the size of the data sets that need to be analyzed and processed, which are now frequently larger than terabytes and petabytes. 4, Veracity — Vesacity refers to the quality of the data that is being analyzed. High veracity data thas many records that are valuable to ‘analyze and that contribute in ameaningful way to the overall results. Low veracity data, on the other hand, contains a high percentage of meaningless data. The non-valuable in these data sets is referred to as noise. 5. Value - Value is an essential characteristic of big data. It is not the data that we process or rc. store. It is valuable and reliable data that we store, process, and also analy 2-22] 8. Tabulate the difference between analysis and reporting. [AU NOV/DE! ‘Analysis Provides answers Provides what is needed 7a ‘Typically customized. Involves a person Extremely flexible je = are key marketing elements used to position a lifying and creating the optimal mix of these 5 Ps es: generate. While advances in retail have % sao15 R2021 BIG DATA ANALYTICS unirt transformed the producer-oriented business into a customer-centric model, algorithms are helping business combine business sense with data to redefine every “P”. 10. Define the terms Structured, Semi Structured and Unstructured Data. Structured data — Suuctured data is information that has been formatted and transformed into a well-defined data model. The raw data is mapped into predesigned fields that can then be extracted and read through SQL easily. SQL relational databases, consisting of tables with rows and columns, — are the perfect example of structured data, semi-structured data — Semi-structured data or partially structured data is another category between structured and unstructured data, Semi-structured data is a type of data that has some consistent and definite characteristics. Example XML. ‘Unstructured data ~ Unstructured data is defined as data present in absolute raw form. This data i on = 0 Additivity: TA 01 B= ; thon PGA UB) = PCA) + PCB) (a) =1 (e) € itional probability and independence ‘The probabitity that event A occurs may be influenced by infor event B, The probability of event A, given that B will ocetir oF bes 0 probability of A given B, denoted by PCA [B). it fallows fou the axioms oe DEAE) = PCAN BY PCB) (a) Random variables A random le X is a function from the outcome 3 IR Example 3, Consider the random experiment of tossing «coin Isrts, up. The outcome space 18 = (CHT), TD, GLED, TD) (© Probability diteibation. a sossible sag ofa deem A probability function p arsigns probability p(x), be. POX = x). From the axioms of probability it follows (f) Entropy The entropy of @ random variable ‘value. The information provided by (g) Expectation ‘i For a discrete random variable, the expected value or E(X) =Expta) and Example §. Consider once more the com tossing probability distribution. The expected value or mean of 2 i EQ) = 1/2(1) + 1/42) (h) Conditional probability distributions and exp For a discrete random variable X we define a ¢ bs4oas R2022 BIG DATA ANALYTICS Bayes’ Rule Bayes! rule shows how probabilities change in the light of evidence. It is a ve Bayesian statistical inference. Let B1,B2, trom the axioms of probability PUB PB, POA) = 5 Pala PUB) 6, Discuss Different Sourees of Data in Big Data, : ‘The quantities, characters, or symbols on which operations are performed by a cor Which may be stored and transmitted in the form of electrical signals and recorded on m optical, or és Big Data is also data but with a huge size. Big Data is a term used to describ data that is huge in size and yet growing exponentially with time, In short, such data ‘There are some of many sources of BigData: 1. Sensors/meters and uctivity records from electronic devices: The: produced on real-time, the number and periodicity of observations variable, sometimes it will depend of a lap of time, ou others of Ul example a car passing by the vision angle of a camera) and manipulation (from an strict point of viow it will be the same: Quulity of this kind of source depends mostly of the capacity measurements in the way it is expected. 2. Social interaetions: This data produced by human interact The most common is the data produced in social networks, quantitative aspects which are of some interest to be u measure tan qualitative aspects, first ones implies cou geographical or temporal characteristics, while the qu accuracy of the algorithms applied to extract the me as unstructured text written in natural language, sentiment analysis, trend topics analysis, ee,; recording their sales. But these ‘kind of pgs as produced in stored in relational databases, an electronic or Jess an structure hut if. we need to put the data that if contains: jn need to apply some process to distribute that data on different tables (in accordingly with the relational database theory), and maybe is not in sxcel recon, etc.), one problem that we could bave here is Previously said, data maybe is being produced teo fast, so we would need | to use the data, processing it as it is without putting it om a relational observations (hich criteria?), using parallel processing, tC. ‘Quality of tusiness transactions is tightly related to the capacity % get Tep process them. 4. Electronic Files: These refers to unstructured doctiments, ‘Which are stored or published as electronic files, like Internet They can have eontents of special interest but are difficult to e3 ‘used, like text mining, pattem recognition, and so on, Quality ‘on the capacity to extract and correctly interpret all the rep documents; §. Broadenstings: Mainly referred to video and audio pro from the contents of this kind of electroni¢ data b computational and communications power, once’ solved the contents to “digital-data™ contents we will have similar we enn find on social interactions, 6. What are the process involved in Intelligent) Intelligent Data Analysis (IDA) is one of the hot information, Intelligent data analysis reveals implicit The process af IDA gencrutly consists of the fo (1) data preparition 4 (2) nile finding or data mining 59015. R2023 BIG DATA ANALYTICS unr are entered and stored, Data cleaning is the process of preventing and these errors. Common tasks include record matching, identifying inaccuracy of data, overall ‘of existing. data, deduplication, and column segmentation : Result validation - requires examining these rules, and result explanation is giving intuitive, reasonable and understandable descriptions using logical reasoning. As the goal of intelligent data analysis is to extract useful knowledge, the process demands « combination of extraction, analysis, conversion, classification, organization, reasoning, and 50 on. It is challenging and fun working out how to choose appropriate methods to resolve the encountered in. the process, Intelligent data analysis methods and tools, as well as the authenticity of obtained results pase us continued challenges, In general data analysis can be classified as (i) Descriptive/Explanatory Analysis - basically deals with what's there's inside data, It deals with analyzing your datasets and driving insights. It docsnt deal with coming with the right recommendation/active to solve a particular problem, Normally it involves following steps: + Bi-variate/Multi-variate analysis Deals with simple statistical procedures such as mean, median, standard deviation, corzelation, range, variance etc, to understand data variables. ‘Some of the key things you can do from this analysis is - Outliers identification, relationships between variables, variables summary and insights. * Data visualization Data plots between different variables to drive insights (i) Inferemtial analysis - The inferential analysis uses statistical tests to see whether an observed pattern is a random suiple of data. Or perhaps it takes place because of some intervention effects from the side. cS Some tools for intelligent data analysis are : SceS - program for analyzing data and generating classifiers in the form of decision trees ar sets Cubist - analyzes data and generates rule-based piecewise linear models. with an associated linear expression for computing a target valuc.. ILLM - the too] constructs classification models in the form of rules w about relations hidden in data Intelligent Data Analysis which results Absolute & relative accuracy Sensitivity & specificity False positive & false negative Error rate Reliability of rules 8. Write down different analytic proce Data Analytics (DA) is the conclusions about that information. software is in its raw form, redundant or inconsistent, vet of IT, ia tha process of extmneti ho context of ITs ae Discovery — Data di of i, Me re eee from data, The extraction i generally Pe Rie i y =a i née systems. The dat presented is ty? ly rete poard, depending sce a dashboart der tH look Ii ear t know what you way no ‘Visual format ane Many of the time we ™ other? ion - A process for arriving ata decision oF 8 ‘The objective is to bring th ), The iterative proces! Iter: cycle of operation each repetition (iteration revocable (such as a martin ee or wat) OF iterative nature Of | big data analysis, be ve problems. ytic process designed to exy ‘elated - also known as "b y — Because of the Flexible capa s to sol time and utilize more resource: Data Mining is an anal business or market tic relationship: Mining and predicting amounts of data - typic consistent patterns. ami s by applying the detected. data mining is th ‘The process of n with validation/ verification model to new data in order to prediction - and predictive most direct business applications. I, the initial exploration 2. model building or pattern identiBicato 3. deployment (i.e., the application ofthe Decision Management — addresses Clearly identifies the potential business value of Big Data «Links improved decisions to key business objectives «Determines what kinds of analytics wil] improve decisions * Identifies automation opportunities Analysis and process tools broadly classified into following categori (i) Data discovery tools — nia discovery is 8 Process ‘that enables Viewing it or applying guided advanced analyties 10 assist business process and specialized software, ili and business intelligence software. The wor analysis functionality and not just traditional reporting: overy solution exists, software seekers can Use a analysis. (ii) Business Intelligence too! set of methodologies and technologies to prepare, process, data is turned into actionable busines: users to * ‘Fig 1.3 Business intelligence tools make more effoctive dataedriven decisions. The set of methodologies and. business intelligence is widely diverse depending on the purpose of the solution. (iii) In-Database Analytics tools - In-database analytics refers to ‘a model of processing is performed within a database to eliminate the overhead of analytic applications. In such a model, the analytic logic is buill into separate application. Advantages of iat-database analytics include parallel: analytic optimization and partitioning, (iv) Decision Management tools - Businesses increasingly are relying 0 technology to help them make business decisions. Ths sive EG Die n auaking i posible fo companies o trove and atone the proses us. However, deploying this technolo; Big Data tools are based on the Hadoop solution Map Reduce - The Map Reduce paradigm provides the | One portion of the input which cam be rn independently of the other input parts, In other tn he easily distributed over the ohuster, Ons, i Fs) - is 0 file system » oop Distributed File System CD a E istritmte data noross a chister to take advantage af the tie processing en Revi designed to rin on common low-cost Leer ome ee es ee : only on super comyrat ugh, it is fmple Yon super computers. Alihovel: tis dedicated 10 mannge namespace range of machines apart from a node, whic h . amming technique to conduct distributed operation eh evel programming technique 4 Fe By hg Pig, t by writing Map Reduce tasks. When a) formed on Map Rertuce operations > provides environment without having knowledge Je time which be could have spent Pig commands are automatically trans! the background, Hive - Abstracts from Mup Reduce operations by providing high-level pro gramming main difference between Pig and Hive is that che Hive language, HiveQL (Hive determined to query data (similar to SQL) rather than to be considered as » soripting ‘A hive table structure consists of rows and columns, ‘The rows are typi record, transaction or a particular entity. The corresponding columns represent the or characteristics fir ench row, Hive is typically used to apply some structure to ur HfBase - Unlike Pig and Hive, which are intended for batch applications, Apache: providing real-time read and write access to datasets with billions of rows and HBase is a technology which provides a distributed data store across a ehaster. Ast projects, FTRiase is built upon HDFS to distribute the workload over nodes im the table is composed of rows, cohimns and the third dimension intended to m tl of a row and column intersection aver time. z ‘Hhbase uses a keyfvalue structure to store the contents of an HBase data to be stored at the intersection of the row, column, and version. out - provides Java code that implements the algorithms : fa such as classification, clustering and filtering. ce R Programming - R is the leading analytics tool in the data modeling. It can easily manipulate your data and pre in many ways like capacity of data, perfor e al variety of platforms viz -UNIX, Windows and browse the packages by eategorics. R also pr user requirement, which can also be well assemt Spark - also includes a library ~ MEIb, 1 repetitive data seience techniques like Cl ete. Flume - is a distributed and re With a simple and easy to reliability mechanisms ish Dsa015. R202 BIG DATA ANALYTICS 9. Explain the different types of sampling distributions. [AU APR/MAY-23] Sampling is a process uscd in statistical analysis in which a predetermined number of observations are taken from a larger population. The methodology used to sample from a larger population depends on the type of analysis being performed but may include simple random sampling or systematic sampling. ‘A summpling distribution a a probability distfoutson of a stat obtained through a large number of samples drawn from a specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic ofa population. A Small population table (Fig 1-4) ata [ a a Pe Fig 1.4 example for population Drawing probability sample of size m from a population consisting of M units, may be a quite complex random experiment. The experiment is simplified considerably by subdividing it into “mm” experiments, consisting of drawing the *m’ consecutive units. In a simple random sample the “rm comecutive units are drawn with equal probabilities from the units concerned. In random sampling with replacement the sub experiments (drawing of one unit) are all identical and independent: ‘m! times a unit is randomly selected from the entire population. a : We will sce that this property simplifies the ensuing analysis considerably. A random sample of size m = 2 is drawn with replacement from this population. For each unit dravm we observe the Value of X, This yields two random variables X1 and X2, with identical probability distribution as displayed in table (Fig 1.5) 1 = 3 x es @= Pix [1 12. 6 Fig 1.5 Probability distribution their joint distribution equals the product of their individual distributions, P@1,22) = ie) = bP Probability distribution of sample of size m—2 by sampling with replacement from the follows (Fig 1.6) s 10. Disenss Re Sampling Distributions. [AU PR/MAY-23} > A resampling distribution refers to the distribution ofa other measure) obtained from repeated sampling. from the ob ‘assuming a theoretical distribution, Ne > In big data analytics, where traditional assumptions about forms i May not hold due to the : eer volume and complexity of data, resamp! estimate sampling variabi ity. ( Cross-Validation Cross-Validation is a resampling technique that is often u estimation of the prediction error of a classification or regression function, Squared error is a natural measure of prediction error for regression fimcti PE=E(y-f? 3 Estimating pena on the same data used for model ¢ downward biased estimates, because the parameter estimates are “fi the sample. Cross-validation is @ technique used in machine learning and st 8 model generalizes to new data, It helps in evaluating the p testing it on different subsets of the data. Example of Cross-Validation: Let's consider an example where we have a dataset of housing | P footage, number of bedrooms, and location. The task is to predict features using a machine learning model. 1. Dataset Splitting: © Typically, the dataset is split into two parts: 2 train set is used to train the model, and the test set is used 2. K-Fold Cross-Validation oes 2 K-fold cross-validation, the training set is further model is trained K times, each time using K-1 folds data. i 4. Oi For instance, if we choose K=5 (5-fold cr subsets. The model is trained on 4 subsets (4/5 of th subset (1/5 of the data). This process is repeated 5 ti subset. : Evaluation: + After performing K-fold cross-valid squared error, etc.) is computed fo folds is used as the overall p Arching Classifiers (AdaBoost) Explained: 1, Basic Ide: © AdaBoost (Adaptive Boosting) is a meta-algorithm i B s is c that combine classifiers to create a strong classifier. oe The key idea behind AdaBoost is to iteratively train weak leamers on diff distributions of the data, It assigns higher weights to instances that are misc] by previous classifiers, thereby focusing subsequent classifiers om the more d cases. ; 2. Procedure: © Step 1: Initialize Weights: Assign equal weights to all training instances, Step 2: Train Weak Learner: Train a weak learner (e.g.» decision tree s is a decision tree with only one split) on the training data using the current v Step 3: Evaluate Error: Calculate the crror rate (weighted sum of mi instance: Step 4: Update Weight that they have more influe classified instances. Step 5: Iterate: Repeat Steps 2-4 for a predetermined number of iterations Or performance of the ensemble stabilizes. Step 6: Aggregate Predictions: Combine the predictions from all weak lea final prediction is typically determined by a weighted majority vote (for clas tasks) or a weighted average (for regression tasks), 3. Advantages: Improved Accuracy: AdaBoost can significantly improve the accuracy classifiers by focusing on difficult instances. © Versatility: It can be used with a varicty of base learners (weak class: versatile across different types of data and tasks. © Robustness: AdaBoost is less prone to overfitting compared to. classifiers, especially when used with appropriate regularization, 4. Example: Let's illustrate AdaBoost with a classification task: Suppose we have a dataset with features and binary labels (0 ‘AdaBoost would start by assigning equal weights to all tr Tt would then train a weak learner (c.g., decision ‘After evaluating its performance, AdaBoost would instances to emphasize them more in the next i This process continues for multiple iterations, focusing more on the instances that previous Finally, AdaBoost combines the predictions | classification decision. : 5. Implementation: ‘9 AdaBoost can be implemented using like Python (e.g., scikit-learn). These ‘AdaBoost along with other ensemble fhat leads to consistently wrong prediction, Sources of Prediction Error ‘model that fails to capture the © Bins: Refers to the syst s can result from an over! patterns in the data. : e variability of model p' vari redictions for a given di Variance: Refers to th given data observation. High variance can lead to overfitting, where the model revo ne training data but ly on unseen data. Sp the data that cannot be explained by ‘s tematic error that} ly simplistic Noise: Refers to the random variability in the ¢ ro Noise contributes to prediction error but is inherent in the data and. reduced by model improvements. sore . Evatnation Techniques for Prediction Error: 1g set and evaluate o Cross-Validation: Divi combinations of these folds, provides a more robust estimate of prediction error, scenarios where datasets are large and diverse. Bootstrap Resampling: Generate multiple bootstrap train the model on each sample, and evaluate its performance prediction error and assessing model stability. Challenges in Big Data Analytics: ‘© Computational Complexity: Large datasets requi distributed computing frameworks to handle model trainir Dimensionality: High-dimensional data can lead 10 0 prediction error if not properly managed through $ reduction techniques. ss © Data Quality: Noisy or incomplete data can affecting the accuracy and reliability of results. ._ Improvement Strategies co Feature Engineering: Identify and create relev model performance. © Ensemble Methods: Combine predictions Forests, Gradient Boosting Machines) to re robustness. Regularization: Apply techniques such as LI DATA ANALYTICS, cave the tps involved In Analyte Process with ane 4d DE APR/MAY=23] with jc process involves several systematic steps to extract sw aut outline ofthese stop, illustrated with un example of e's the Object ; Example: A real estate company wants to predict hou features such as location, size, number of bedrooms, and pata Collection: 1, Define > Example: Gather data on houses sold in the past five yeurs, incl prices and features (ovation, size, number of bedrooms, age of pata Preparation aud Cleaning > Esample: Clean the collected data by handli inaccuracies, and ensuring consistency, For ‘an ‘f number of bedrooms, decide whether to impute these xxploratory Data Analysis (EDA): "5 Example: Analyze the data to understand the d ti pattems, and detect outliers. Create visualizations such i scatter plots of house size vs, price, and box plots Feature Selection and Engineering: Z * > Bxainple: Select relevant features novessury, such as the distance to neighborhood, ‘training 8, Model Evaluation: o Example: Evaluate

You might also like