0% found this document useful (0 votes)
18 views

Big Data Analytics Unit-1

Anna university

Uploaded by

prasathdhanam66
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
18 views

Big Data Analytics Unit-1

Anna university

Uploaded by

prasathdhanam66
Copyright
© © All Rights Reserved
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 39
SYLLABUS Big Data Analytics - [CCS334] UNDERSTANDING BIG DATA f key trends - unstructured data - industry examples of UNITI Introduction to big data - convergence o big data ~ web analytics - big data applications ~ big data technologies - introduction to Hadoop - open source technologies - cloud and big, data - ‘mobile business intelligence - Crowd sourcing analytics - inter and trans firewall analytics. (Chapter - 1) UNITIE. NOSQL DATA MANAGEMENT Introduction to NoSQL - aggregate data models - key-value and document data models - graph databases - schemaless databases - materialized views - distribution relationships - cation - consistency - Cassandra - Cassandra data model - models - master-slave repli Cassandra examples - Cassandra clients (Chapter - 2) UNIT I BASICS OF HADOOP Data format - analyzing data with Hadoop - scaling out - Hadoop streaming - Hadoop pipes - design of Hadoop distributed file’ system (HDFS) - HDFS concepts - Java interface - data flow - Hadoop V/O - data integrity - compression - serialization - Avro - file-based data structures - Cassandra - Hadoop integration. (Chapter - 3) UNITIV MAP REDUCE APPLICATIONS MapReduce workflows - unit tests with MRUnit - test data and local tests - anatomy of MapReduce job run - classic Map-reduce - YARN - failures in classic Map-reduce and YARN - job scheduling - shuffle and sort - task execution - MapReduce types - input formats - output formats. (Chapter - 4) : UNITV HADOOP RELATED TOOLS ‘Hbase - data model and implementations - Hbase clients - Hbase examples - praxis. Pig - Grunt - pig data model - Pig Latin - developing and testing Pig Latin scripts. Hive - data types and file formats - HiveQL data definition - HiveQL data manipulation - HiveQL queries: (Chapter - 5) (iv) TABLE OF CONTENTS eee eae Chapter- 1 Understanding Big Data (1-1) to (1 - 34) 1.1 _ Introduction to Big Data.. 1.1.1 Difference between Data Science and Big Data.. 1.1.2. Benefits of Big Data Processing... 1.1.3 Big Data Challenge: 1.2. Convergence of Key Trends... 1.2.1 V's of Big Data. 1.2.2 Compare Cloud Computing and Big Data. 1.3 Unstructured Data. 1.3.1. Difference between Structured and Unstructured Data... 1.4 Industry Examples of Big Data . 1.5 - Web Analytics 1.6 Big Data Applications. 1.7 Big Data Technologies ... 1.8 Introduction to Hadoop ..... 1.8.1 Hadoop Ecosystem... 1.8.2 Hadoop Advantages 1-17 1.9 Open Source Technologies 1-17 1-19 1.9.1. Difference between Open Source and Open Standards .. 1-19 1-20 1-20 1-20, 1.9.2. Advantages of Open Sources. 1.9.3 Disadvantages of Open Sources 1.9.4 Application of Open Source Software. 1.9.5 Comparison of Open Source with Close Source / Proprietary Software .... 1.10 Cloud and Big Data... 1.10.1 Difference between Cloud Computing and Big Data .... w) loud Computing and Internet... 1.10.2 Difference between C 1.11 Mobile Business Intelligence 1.11.1 Difference between Mobile Analytics and Web Analytics... 1.12 Crowd Sourcing Analytics ... 1.13 Inter and Trans Firewall Analytics ... 4.13.1 Firewall Rules. 1.13.2 Types of Firewall... 1.13.3 Comparison between Packet Filter and Application Level Gateways 1.14 Two Marks Questions with Answers Chapter - 2 NoSQL Data Management 2.1. Introduction to NoSQL. 2.1.1 The Definition of Four Types of NoSQL Databases. 2-3 2.1.2. Example and Advantages .. 2.1.3 CAP Theorem.. 2.1.4 Comparison of SQL and NoSQL Databases... 2.2 Aggregate Data Models.. 2.2.1. Key-value Store... 2.2.2 “Document-based... 2.2.3 Column-based 2.2.4 Graph-based 2.2.5 NoSQL Key/Value Database : MongoDB 2.3 Schemaless Databases... 2.4 Materialized Views ... 2.5 Distribution Models... 2.5.3 Master-slave Replication.. 2.5.4 Peer-to-Peer Replication .. wi) 2.5.5 Combining Sharding and Replication .. 2-19 2.5.6 Difference between Replication and Shardin 2. a Consistency, 2.6.1. Update Consistency 2.6.2. Read Consistency. 2.6.3 Quorums. 2.6.4 Relaxing Durabil 2.7 Cassandra .. 2.7.1 Cassandra Architecture .. 2.7.2, Cassandra Data Model ... 2.7.3 Cassandra Clients .. 2.8 Two Marks Questions with Answers Chapter - 3 Basics of Hadoop 3.1 Data Format. 3.1.1 Analyzing the Data with Hadoop.... 3.1.2. Scaling Out... 3.2. Hadoop Streaming. 3.3. Hadoop Pipes... 3.4 Design of Hadoop Distributed File System (HDFS)... 3.4.1 HDFS Architecture .. 3.4.2. HDFS Block... 3.4.3 Java Interface 3.4.4 Data Flow. 3.4.5. Heartbeat Mechanism in HDFS .. 3.4.6 Role of Sorter, Shuffler and Combiner in MapReduces Paradigm 3.5 Hadoop I/O 3.5.1 Data Integrity. 3.5.2. Hadoop Local File System... (wii) 3.5.3 compression ion.» jalizati 4 serializa 35. ie Interface -- ggg The Wetabl 3.5.6 Avro ed Data Structures jadoop Integration s with Answers ... 3.6. File-bas 3,7. Cassandra - Hi 3,8 Two Marks Question: 4.1. Introduction to MapReduce 4.1.1 MapReduce Workflows .. 4.1.2. Data Flow in the MapReduce Programming Model... 4.13 Functions of Job Tracker and Task Tracker .... 4.1.4 Limitation of MapReduce.. 4.2. Unit Tests with MRUnit 43° Anatomy of MapReduce Job Run... 44 YARN. 44.1 Merits and Demerits of YARN. 4.4. it 2 Difference between YARN and MapReduce. 45° Failures in Classic Map Redu a: ce and YARN... 4.6. 6.3 C2PaCty Scheduler 46.4 Differs ence i be Shuffle and sort 48 Task Execution 4.9.1. Input Formats - Output .. 4,10 Two Marks Questions with Answers . Chapter-5 | Hadoop Related Tools (5 - 1) to (5 - 28) 5.1 Hbase 5.1.1 Features and Application of Hbase 5.1.2 Difference betwéen HDFS and Hbase- “4 5.1.3 Difference between Hbase and Relational Database. -4 5.1.4 Limitations of HBas 5-5 5.2. Data Model and Implementations... 5.3 Hbase Clients ... 5.4 Praxis 55 Pig. 5.5.1 Pig Data Model 5.5.2 Pig Latin 5.5.3. Developing and Testing Pig Latin Script: 5.6 Hive. 5.6.1 Hive Architecture... 5.6.2 Data Types and File Formats... 5.7 HiveQL Data Definition 5.8 HiveQL Data Manipulation 5.9 HiveQL Queries... 5.10 Two Marks Questions with Answers ....... Solved Model Question Paper (M - 1) to (M- 2) (ix) Understanding Big Data Syllabus Introduction to big data - convergence of key trends - unstructured data - industry examples of big data - web analytics - big data applications- big data technologies - introduction to Hadoop - open source technologies - cloud and big data - mobile business intelligence - Crowd sourcing analytics ~ inter and trans firewall analytics. Contents 1.1 Introduction to Big Data 1.2. Convergence of Key Trends 1.3. Unstructured Data 1.4 Industry Examples of Big Data 1.5 Web Analytics 1.6 Big Data Applications 1.7 Big Data Technologies 1.8 Introduction to Hadoop 1.9 Open Source Technologies 1.10 Cloud and Big Data 1.11, Mobile Business Intelligence 1.12 Crowd Sourcing Analytics 1.13. Inter and Trans Firewall Analytics 1.14. Two Marks Questions with Answers a1) 2. Understanding Big Oy, Big Data Analytics EEW Introduction to Big Data data available at various sou, , large volumes of a ure © Big data can be ae i eat, ekeidied at different speed ie, Velocities a in varying degrees : ambiguity which cannot be processed ae taditng toler prong sds, grt of any commerce technologies, solutions, f jon of data that is huge in size gq ‘Bi a to describe a collection of : : Big dat’ isa Beal with time, In short, such data is $0 latge and compl Let lites redial dat management tools aze able to store itor proces at none efficiently. The processing of big data begins with the raw data that isn't aggregated o, . i oe and is most often impossible to store in the memo Ty of a single computer. Big data processing is a set of techniques or Programming models to access large-scale data to extract useful information for supporting and Providing decisions. Hadoop is the open-source implementation of MapReduce and is widely used for big data processing. Difference between Data Science and Big Data Ibis a field of scientific analysis of data in Big data is storing and Processing large grder to solve analytically complex volume of structured. akd unstructured Problems and the significant and data that can not be ‘possible with - Recess of cleansing, preparing — traditional applications, 7 ata, S ie Used in retail, education, healthcare and It is used in Biote insurance, ch, energy, gaming and © Data _ classification, anomaly Prediction, scoring and ranking. EEE] cenetits of Big Data Proce: Benefits of big data Processing ; 1. Improved customer service, 2. Business can utilize o 3. Reducing maintenan, detection, ssing utside intelligence while tal Ce costs, TECHNICAL PUBLICATION: AM un ae. king decisions, Big Data Analytics 1-3 Understanding Big Data 4, Re-develop your products : Big data can also help you understand how, others perceive your products so that you can adapt them or your marketing, if need be. 5, Early identification of risk to the product / services, if any. 6, Better operational efficiency. [BEI Big Data Challenges « Collecting, storing and processing big data comes with its own set of challenges : 1. Big data is growing exponentially and existing data management solutions have to be constantly updated to cope with the three Vs. 2. Organizations do not have enough skilled data professionals who can understand and work with big data and big data tools. EEA Convergence of Key Trends * The essence of computer applications is to store things in the real world into computer systems in the form of data, i.e., it is a process of producing data. Some data are the records related to culture and society and others are the descriptions of phenomena of the universe and life. The large scale of data is rapidly generated and stored in computer systems, which is called data explosion. + Data is generated automatically by mobile devices and computers, think facebook, search queries, directions and GPS locations and image capture. * Sensors also generate volumes of data, including medical data and commerce location-based sensors. Experts expect 55 billion IP - enabled sensors by 2021. Even storage of all this data is expensive. Analysis gets more important and more expensive every year. «Fig. 1.2.1 shows the big data explosion by the current data boom and how critical it is for us to be able to extract meaning from all of this data. @ iP Fig, 1.2.1 Data explosion The phenomena of exponential multiplication of data that gets stored is termed as "Data Explosion". Continuous inflow of real-time data from various processes, machinery and manual inputs keeps flooding the storage servers every second. Sending emails, making phone calls, collecting information for campaigns; each day we create a massive amount of data just by going about our normal business TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Understanding Big day, Big Data Analytics and this data explosion does not seem to be slowing down. fact, 90 % of d this dé ion does not seem tt iS In fe th inj rears. data that currently exists was created in just the last two yt jon i tion. it ta explosion is Innovat : | : + Reason for this eae : Innovation changed the way in = We do 1. Business vvovide eervices, The data world is governed y. = name . verr : = tae model transformation, globalization and personalization of trends are services. _ : iti ta as a legal or compliance izati traditionally treated dai 5 : ° ar a limited management reporting requirements, irement, quire Consequently organizations have treated data as a cost to be minimized, © The businesses are required to produce more data related to product ang provide services to cater each sector and channel of customer. 2. Globalization : Globalization is an emerging trend in business Where organizations start operating on an international scale. From manifachiring to customer service, globalization has changed the commerce of the world. Variety and different formats of data are generated due to globalization. 3. Personalization of services : To enhance customer service, the form of one-fo-one marketing in the form of personalization of service is opted by the customer. Customers expect communication through various channels increases the speed of data generation. 4. New sources of data : The shift to online advertising supported by the likes of Google, Yahoo and others is a key driver in the data boom. Social media, mobile devices, sensor networks and new media are on the fingertips of customers or users, The data generated through this is used by corporations for decision support systems like business intelligence and analytics. The growth of technology helped to emerge new business models over the last decade or more. Integration of all the data across the enterprise is used to create business decision support platform. V's of Big Data data characteristics from traditional d; cine, lata by one or more of the five V's : Volume, velocity, variety, veracity and value : TECHNICAL PUBL ICaTinun<® Big Data Analytics 1-5 Understanding Big Data Emails Fig. 1.2.2 Big data volume 2. Velocity : The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines real potential in the data. It is being created in or near real-time. 3. Variety : It refers to heterogeneous sources and the nature of data, both structured and unstructured. © Fig. 123 (a) and Fig. 123 (b)-shows big data velocity and data variety. (Web based companies) _ Fig. 1.2.3 (a) Data velocity \ (Refer Fig. 1.2.3 (b) on next page) 4, Value ; It represents the business value to be derived from big data. © The ultimate objective of any big data project should be to generate some ‘sort of value for the company doing all the analysis. Otherwise, you're just performing some technological task for technology's sake. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Understanding = ae Big Data Analytics Data Unstructured Q— [Senin] Fig. 1.2.3 (b) Data variety © For real-time’ spatial big data, decisions can be enhanced through visualization of dynamic change in such spatial, phenomena ag climate traffic, social-media-based attitudes and massive inventory locations, Exploration of data trends can include spatial proximities and relationships. Once spatial big data are structured, formal Spatial analytics can be applied, such as spatial autocorrelation, overlays, buffering, spatial cluster techniques and location quotients. z z zt ge & S B 3 & z iz z > 3 2 g : 8 z z ie 3 5 EEX] compare Cloud Computing and See a Bh _Cloud computing __ Tt provides resources on demand. Tt provides a wa te volumes of data ana gonad (2 handle huge erate insights. ‘ata, which can be structured. red or Unstructured. It is used 4 i eo! It refers to d, semi-stractu Big data is g highly scalable, robu* ecosystem and cost ot, ives t effect Big Data Analytics tee Understanding Big Data Be Vendors and solution providers of Vendors and solution providers of dig | cloud computing are Google, Amazon data are Cloudera, Hortonworks, A\ | web service, Dell, Microsoft, Apple and MapR. ipache:| IM. an ‘The main focus of cloud computing is Main focus of big data is about solving to provide computer resources and problems when a huge amount of data services with the help of network — generating and processing, EE Unstructured Data Pern Unstructured data is data that does not follow a specified format. Row and columns are not used for unstructured data. Therefore it is difficult to retrieve required information. Unstructured data has no identifiable structure. For example of unstructured data is e-mails, click streams, textual data, images, log data and videos. In the case of unstructured data, the size is not the only problem, deriving value or getting results out of unstructured data is much complex and challenging as compared of structured data. The unstructured data can be in the form of text : (Documents, email messages, customer feedbacks), audio, video, images. Email is an example of unstructured data. Even today in most of the organizations more than 80 % of the data are in unstructured form. This carries lots of information. But extracting information from these various sources is a very big challenge. Characteristics of unstructured data : . There is no a structural restriction or binding for the data. . Data can be of any type. . Unstructured data does not follow any structural rules. |. There are no predefined formats, restriction or sequence for unstructured data. . Since there is no structural binding for unstructured data it is unpredictable in nature. a Examples of machine generated unstructured data : 1. Satellite images : This includes weather data or the data that the government captures in its satellite surveillance imagery. 2. Scientific data : This includes atmospheric data and high energy physics. 3, Photographs and video : This include security, surveillance and traffic video. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Big Data Arey structured Data : ows and column format. It helps for applica is arranged in r Sa tie i " oe data easily. Database management system i used for - retrieve ant structured data. the form of a particular fixed is know, troed in Any data that can De Smt’ ita stored in the colums and rows of table ‘ed data. For example, e a database management systems 1s a form of structured data. rel etween Structured and Unstructured Data Structured data ai n. ie. stored Unstructured data is data that It is in discrete form. ie. [doesnot follow "at Spsciied fe Ae in row and column format. Database management system | Unmanaged file structur | SQL,ADOnet ODBC__Open XML, SMTO, SMS ETL Batch processing or manual data "entry. With a structured document, In unstructured document certain information always information can appear __ in appears in the same location on unexpected places onthe the page. - docume | Low volume operations _ High volume operations Industry Examples of Big Data * Big data plays an important role in digital marketing. Each day information shared digitally increases significantly. With the help of big data, marketers can analyze every action of the consumer. It provides better marketing insights and it helps marketers to make more accurate and advanced marketing strategies. * Reasons why big data is important for digital marketers : a) Real-time customer insights b) Personalized targeting ©) Increasing sales 4) Improves the efficiency’ of a marketing campaign e) Budget optimization £) Measuring campaign's results more accurately. TECHNICAL PUBLICATIONS® - an. up-thrust for knowledge - iytios : sig Date Analytic 1-9 Understanding Big Data Data constantly informs marketing teams of customer behaviors and industry trends and is used to optimize future efforts, create innovative campaigns and build lasting relationships with customers. Big data regarding customers provides marketers details about user demographics, locations and interests, which can be used to personalize the product experience and increase customer loyalty over time. Big data solutions can help organize data and pinpoint which marketing campaigns, strategies or social channels are getting the most traction. This lets marketers allocate marketing resources and reduce costs for projects that are not yielding as much revenue or meeting desired audience goals. Personalized targeting : Nowadays, personalization is the key strategy for every marketer. Engaging the customers at the right moment with the right message is the biggest issue for marketers. Big data helps marketers to create targeted and personalized campaigns. Personalized marketing is creating and delivering message the group of the audience through data analysis with the help of consumer's data such as geolocation, browsing history, clickstream behavior and_ purchasing history. It is also known as one - to - one marketing. In this day an age, marketing has become the ability of a the data and change its strategies accordingly. Big data ‘umer insights which is crucial to understanding the habits of your customers. By interacting with your consumers through social media you will know exactly what they want and expect from your product or service, which will be key to distinguishing your campaign from your competitors. ‘ta will help with demand predictions for a product or d on user behaviour will allow marketers to answer how often they conduct purchases what payment methods they prefer s to the individuals or Consumer insights : company to interpret allows for real-time cons' Help increase sales : Big da service. Information gathere what types of product their users are buying, or search for a product or service and lastly, using. : ‘Analyse campaign results : Big data allows marketers to measure their campaign performance. This is the most important part of digital marketing. Marketers will use reports to measure any negative changes to marketing KPIs. If they have not achieved the desired results it will be a signal that the strategy would need to be changed in order to maximize revenue and make your marketing efforts more scalable in future. cy matin ig Data Anayes : | nalysis of website data. The focy, | izational and user goals and using ure of those goals and to driv, EA web Analytics «Web analytics is the is on identifying meast the website data to de! e strategy and improve the user's &S ‘The WWW is an evolving system services across the Internet, The web is 4 on freely published communication standa Web analytics is important to help us to : | Refine your marketing campaigns . Understand your website visitors reporting and a) your organi or fail collection, ures based on, termine the sticcess perience. for publishing and accessing resources ang mn open system. Its operations are baseq tds and documents standards. 1 2. 3, Analyze website conversions 4, Improve the website user experience 5, Boost your search engine ranking 6. Understand and optimize referral sources 7. Boost online sales. Businesses use web analytics platforms to measure and benchmark site performance and to look at key performance indicators that drive their business, such as purchase conversion rate. ; : Website —_ provide insights and data that can be used to create a better user experience for website visitors. Understandin; i exper " : customer beh i to optimizing a website for key conversion metrics. aa For exampli ics wi a ee ear trni show us the most popular pages on your website, ee ae to besa With website analytics, we can also 5 ness of is i aes ae your online marketing campaigns to help Web analytics can hi = lp a digital pro ding : : gital marketer understand their customers better by + Insight into who ; the customers are : and their i D Conver challenges their interests ‘Big Date Analytics 1-11 1.6 | Big Data Applications + Big data applications can help companies to make better business decisions by analyzing large volumes of data and discovering hidden patterns. These data sets might be from social media, data captured by sensors, website logs, customer feedbacks, etc. Organizations are spending huge amounts on big data applications to discover hidden patterns, unknown associations, market style, consumer preferences and other valuable business information. Understanding Big Dete * Domains where big data can be applied to health care, media and entertainment, IoT, manufacturing and government. + Relation between oT and Big Data : Big data production in the industrial Internet of Things (IloT) is evident due to the massive deployment of sensors and Internet of Things (IoT) devices. However, big data processing is challenging due to limited computational, networking and storage resources at IoT device-end. Big Data Analytics (BDA) is expected to provide operational and customer-level intelligence in IoT systems. © The extensive installation of sensors on machines causes a massive increase in the volume of data collected within industrial processes. The data consist of operating data, error lists, history of maintenance activities and alike. * In combination with the related business data, the overall plethora of data provides the raw material for process optimizations and other applications. To set this potential for optimizations free, the raw data needs to be processed systematically, passing through various algorithms. «The results are prepared information with specific application objectives. Especially pattern detection is to mention in this context, since this method identifies and quantifies cause and effect correlations and allows predictions of state changes. The significance of the information given out by the analysis depends on the amount of data processed. 1, Healthcare : * Big data analytics for healthcare uses health-related information of an individual or community to understand a patient, organization or community. In the past, managing and analyzing healthcare data was tedious and expensive. More recently, technology has helped the healthcare sector make leaps and bounds to Keep up with the flow of big data in healthcare. * Diagnostic devices, medical machinery, instrumentation, online services sources such as these are transferring data throughout a healthcare network. This is done with the help of big data tools such as Hadoop and Spark. TECHNICAL PUBLICATIONS® - an up-thrust for knowtedge 1-12 Understanding Big p, | Big Data Analytics : a : : nt and relevant big data examples in heal ¥5 how , : i paced a ghba coronavirus crisis. Big ie salvia het, supported the rapid development of COVID-19 vaccines. aa 20 share data with each other to develop advanced Liguasrtelt a ly. I ig data is healtheare also predicted the spread of disease by allowing hea ma informati to be processed much more rapidly than in the past during other pandemics, Smoother hospital administration : Healthcare administration becomes smoother with the help of big data. It helps to reduce the cost of measurement, provide the best clinical support and manage the Population . at-risk patients. It also helps medical experts analyze data from diverse sources I helps healthcare providers conclude the deviations among patients and the effects treatments have on their health. Much, Cate Fraud prevention and detection : Big data helps to prevent a wide Tange of erro;; on the side of health administrators in the form of wrong dosage, medicines and other human errors. It will also be particularly useful to ins companies. They can prevent a wide range of fraudulent claims of insurance Wrong | surance * Challenges of big data in healthcare : As a relatively new field, big data in healthcare is still evolving to keep up with the fast pace and changing nature of technology. With such vast amounts of data available to work with, organizations and leaders can struggle with knowing where and how to start with data an in healthcare to find the information that is meaningful. Many healthcare organizations jack adequate systems and databases and the skilled professionals to handle them. As such, the demand for healthcare analysts with advanced education and training is very high in the World, ai Manufacturing 3 .* Improving efficiency across the busi costs, increase productivity, and b already standard practice for many improve line speed and quality. . Manufacturing big data also increase: example, by using sensor and RFID inventory in real time, Companies can also lytic iness helps a manufacturing company control Cost margins. Automated production lines a? | but manufacturing big data can exponentially transparency into the entire supply ny data to track the location of tools, parts 2 teducing interruptions and delays, bow! _ 1-13 2 Understanding Big Data : : ducts are faster i ns Se ao and easier t = 2 : fo produ veenpanies ynow where to focus their efforts, perhaps even concerti oa yen those products for maximum production. It helps for companies to i wh Se cient, with the added possibility of working on those arse tha reas that they are most ¢ we the most improvement. Al-driven analysis of manufacturing big data enables companies to aggregate and analyze both their own and competitor's pricing and cost data to produc: continually optimized price variants. For manufacturers that focus a puild-to-order products, ML can also ensure the accuracy of their customized configurations and streamline the Configure-Price-Quote (CPQ) workflow. 00 Oe EA Big Data Technologies « Big data technology is defined as the technology and a software utility that is designed for analysis, processing and extraction of the information from a large set of extremely complex structures and large data sets which is very difficult for traditional systems to deal with: Big data technology is used to handle both real-time and batch related data. ‘ i + Big data technology is defined as. software-utility. This technology is primarily designed to analyze, process and extract information from a large data set and a huge set of extremely complex structures. This is very difficult for traditional data processing software to deal with. * Big data technologies including Apache ‘Hadoop, ‘Apache Spark, MongoDB, Cassandra, Plotly, Pig, Tableau and. Apache Cassandra ete. + Cassandra 7 Cassandta ig one of the leading big data technologies sm°hS the list of top NoSQL databases. It is open-source, distributed and has exterisive column storage options. It is freely available and provides high availability without fail. * Apache Pig is a high - level scripting language used to execute queries for larger datasets that are used within Hadoop. © Apache Spark is a fast, in - Memory data processing engine suitable for use in @ wide range of circumstances. Spark can be deployed in several ways # features java, Python, Scala and R programming languages and supports SQL, streaming data, machine learning and graph processing, which can be used together in application. oe * MongoDB : MongoDB is another important component of big data teduicee ie terms of storage. No relational properties and RDBMS properties et eal MongoDb because it is a NoSQL database. This is not the same as RDBMS databases that use structured query languase® Instead, schema documents. TECHNICAL PUBLIGATIONS® - an upethrust for knaynedoe eB 7 1-14 Understanding Bx, Big Data Analytics EI Introduction to Hadoop that is used to efficiently store a. is en source framework : Apache eee nee in size from gigabytes to petabytes of data, Had, Process large to scale up from a single computer to thousands of clusters, « aes each machine offering local computation and storage. 01 7 While Hadoop is sometimes referred to as an acronym for High Availa bilit, Distributed Object Oriented Platform. The Hadoop framework consists of a storage layer known as the Hadoo, Distributed File System (HDFS) and a processing framework called the MapReduce programming model. Hadoop splits large amounts of data into chunks, distributes them within the network cluster and Processes them in its MapReduce Framework, * Hadoop can also be installed on cloud servers to better manage the compute and . Leading cloud vendors such as Amazon Azure offer solutions. Cloudera ‘supports Hadoop workloads both on-premises and in the cloud, including options for one or more:public cloud environments from multiple vendors. Hadoop provides a distributed file system and a framework for the analysis and transformation of very large data sets using the MapReduce paradigm. An important characteristic of Hadoop is the Partitioning of data and computation across many (thousands) of hosts and executing application computations in Parallel close to their data. A Hadoop cluster Scales : computation capacity, storage capacity and 1/0 bandwidth by simply adding commodity servers, * Key features of Hadoop ; : 1. (Cost Effective System 2. Large Cluster of Nodes 3. Parallel Processing | 4. Distributed Data 5. Automatic Failover Management 6. Data Locality Optimization 7, Heterogeneous Cluster 8. Scalability, * Hadoop allows for the distributi ity tic hardware. Pocening atin of ‘detest actos a cluster of commodity Software clien P i a i} ts input data into nee pele servers simultaneous! Ae ihe handles metadata and TECHNICAL PusLicaniong® ~ 8M Up-thrust for 5 By Date, Analytics 1-15 Understanding Big Data distributed file system. MapReduce then processes and converts the data. Finally, YARN divides the jobs across the computing cluster. «All Hadoop modules are designed with a fundamental assumption that hardware failures of individual machines or racks of machines are common and should be automatically handled in software by the framework. + Challenges of Hadoop : MapReduce complexity : As a fileintensive system, MapReduce can be a difficult tool to utilize for complex jobs, such as interactive analytical tasks. « There are four main libraries in Hadoop. 1. Hadoop Common : This provides utilities used by all other modules in Hadoop. 2, Hadoop MapReduce : This works as a parallel framework for scheduling and processing the data. 3, Hadoop YARN : This is an acronym for Yet Another Resource Navigator. It is an improved version of MapReduce and is used for processes running over Hadoop. 4. Hadoop Distributed File System - HDFS : This stores data and maintains records over various machines or clusters. It also allows the data to be stored in an accessible format. EERE Hadoop Ecosystem + Hadoop ecosystem is neither a programming language nor a service, it is a platform or framework which solves big data problems. * The Hadoop ecosystem refers to the various components of the Apache Hadoop software library, as well as to the accessories and tools provided by the Apache Software Foundation for these types of software projects and to the ways that they work ‘together. : * Hadoop is a Java - based framework that is extremely popular for handling and analysing large sets of data. The idea of a Hadoop ecosystem involves the use of different parts of the core Hadoop set such as MapReduce, a framework for handling vast amounts of data and.the Hadoop Distributed File System (HDFS), a sophisticated file - handling system. There is also YARN, a Hadoop resource manager. * In addition to these core elements of Hadoop, Apache has also delivered other Kinds of accessories or complementary tools for developers. Some of the most well - known tools of the Hadoop ecosystem include HDFS, Hive, Pig, YARN, MapReduce, Spark, HBase, Oozie, Sqoop, Zookeeper, etc. ; TECHNICAL PUBLICATIONS® - an up-thrust for knowledge = 16 Understancting « Wet Big Date Analytics feu mee * Fig. 1.8.1 shows Apache Hadoop ecosystem. Nosau (HBase) * Hadoop Distributed File System (HDFS), is one of the largest Apache projects an Primary storage system of Hadoop. It employs a NameNode and DataNev! architecture. It is a distributed file system able to store large files running over th cluster of commodity hardware, * YARN stands for Yet Another Resource Negotiator. It is one of the o components in open source Apache Hadoop suitable for resource management is responsible for managing workloads, monitoring and security contrl implementation. : Management and monitoring (Amberi) Seriptin Query *| E Distributed processing (MapReduce) Distributed storage (HDFS) Fig. 1.8.1 Apache Hadoop ecosystem Machine Leaming (Mahout) scheduling (Oozie) ¢ Hive is an ETL and Data warehousing tool used to query or analyze large datas! stored within the Hadox ae oP ecosystem. Hive has three main functions : De! Summarization, query and analysis of unstructured and semi - structured data Hadoop. * Map - Reduce : It is the core component of processing in a Hadoop Ecosystem if provides the logic of processing. In other words, MapReduce is a soft?" framework which helps in writing applications that processes large data sets us" distributed and parallel algorithms inside Hadoop environment. * Apache Pig is @ high - level scripting language used to execute queries for la"! datasets that are used within, Hadoop. Apache Spark is a fast, in - memory data Processing engine suitable for us¢ wide range of circumstances. Spark can be deployed in several ways, it fe2!" Java, Python, Scala and R programming languages and supports SQL, ste" TEGHNICAL PUBLICATIONS® . an uptirust or knowledge r fics ig Date Analyt 117 Understanding Big Data data, machine learning and graph processing, which can be used together in an application. + Apache HBase is a Hadoop ecosystem component which is a distributed database that was designed to store structured data in tables that could have billions of rows and millions of columns, HBase is scalable, distributed and NoSQL database that is built on top of HDFS. HBase provide real - time access to read or write data in HDFS. EE] Hadoop Advantages 1, Scalable : Hadoop cluster can be extended by just adding nodes in the cluster. 2. Cost effective : Hadoop is open source and uses commodity hardware to store data so it is really cost effective as compared to traditional relational database management systems. 3. Resilient to failure : HDFS has the property with which it can replicate data over the network. 4. Hadoop can handle unstructured as well as semi-structured data. 5, The unique storage method of Hadoop is based on a distiibuted file system that effectively maps data wherever the cluster is located. Ei Open Source Technologies * Open source software is like any other software (closed/proprietary software). This software is differentiated by its use and licenses. Open source software guarantees the right to access and modify the source code and to use, reuses and redistribute the software, all with no royalty or other costs. © Standard Software is sold and supported commercially. However, Open Source software can be sold and/or supported commercially, too, Open source is a disruptive technology. * Open source is an approach to the design, deyelopment and distribution of software, offering practical accessibility to software's source code. * Open source licenses must permit non-exclusive commercial exploitation of the licensed work, must make available the work's source code and must permit the creation of derivative works from the work itself, The Netscape Public License and subsequently under the Mozilla Public License. * Proprietary software is computer software which is the legal property of one party, The terms of use for other parties are defined by contracts or licensing agreements. These terms may include various privileges to share, alter, dissemble, and use the software and its code. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge 1-18 Understanding Bip Og, ita Analytics: Big Dat is a term for software whose license a not ene re eee Beer ccsatin of the ftware's source code. Generally, it means only the binaria, See repr re’ distributed and the license provides no access to th. Say cure ene. Ths pource codé of euch programs is usually regarded ag « ee cara ani Access to source code by third parties commonly rege the pry fo signa nondslosre agreement. esd et chen ee ises are ever increasing with osetis eens eieciiy mee meee es cclegy agg See eee is ingle solution provi i isfy their different needs. It is a fact that a sing Provider Saenre the needed solutions. Open source, freeware and free software are now available for anyoné and for any use. In the 1970s and early 1980s, the software organization started using technical measures .o prevént computer users from being able to study and modify software. The copyright law was extended to computer programs in 1980, The free software movement was conceived in 1983 by Richard Stallman to satisfy the need for and to give the benefit of "software freedom” to computer users, Richard Stallman declared the idea of the GNU operating system in September 1983. The GNU Manifesto was written by Richard Stallman and published in March 1985, The Free Software Foundation (FSF) is a Non-profit cor Stallman on 4 October 1985 to Support the free software movement, a copyleft based movement which aims to promote the universal freedom to distribute and modify computer software without restriction. In February 1986, the first formal definition of free software was published. The term "free software" j "poration started by Richard Permiting both open distribution ca open m oF aditional copyright licensing by & Hy 3 F z 3 e tise of th © concept. The term Open source gained Internet, which. Provided access to divers¢ aths and last but not least, interactiv Production models : , communicat communities, nication “ TECHNICAL Puan ® CATIONS® «an vy P-thrUSt for knowiean aaa a. ig Date Analytics 1-19 Understanding Big Data « Netscape licensed and released its code as open source under Definition of Open Source Software. ‘successes of open source * Successful open source projects make up many of today's most widely used technologies Operating systems : Linux, Symbian, GNU Project, NetBSD. Servers : Apache, Tomeat, MediaWiki, Drupal, WordPress, Eclipse, Moodle, Joomla Programming languages : Java, JavaScript, PHP, Python, Ruby. “Client software : Mozilla Firefox, Mozilla. Thunderbird, OpenOffice, Songbird, Audacity, 7-Zip. : Digital content : Wikipedia, Wiktionary, Project Gutenberg. Examples in open source and _propritary software : Soot eae Ade Pat aver Difference between Open Source and Open Standards * Open source software is a type of software where the user has access to the software's source code and can freely, use, modify and distribute the software. Thus open source concerns the code the software is made of. © Open standards denotes that the code responsible for communication with other systems is open and has technical specifications which are accessible free of charge. Thus open standards concern the communication between software. ia Advantages of Open Sources 1. The right to use the software in any way. 2. ‘There is usually no license cost and free of cost. 3. ‘The source code is open and can be modified freely. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge . Understanding Bi, Big Data Analytics 1-20 8, in another context or with another pj, 4, It is possible to reuse the software authority. 5, Open standards. 6. It provides higher flexibility. en Sources EEEI Disadvantages of Op 1, There is no guarantee that development will happen. to know that a project exists, and its current status, 2. It is sometimes difficult 3, No secured follow-up development strategy: EEE Application of Open Source Software « Following is the list of applications where open source software is used. 1. Social networking 2. Multimedia 3. Animation 4. Accounting 5, Instant messaging 6. ERP 7. Desktop publishing 8. Website development 9. Resource management 10. Video editing. EEE] comparison of Open Source with Close Source / Proprietary Software Sr. a Open source softwa vare " Close source / proprietary softwar | a. ” Source code freely available, Source code is kept secret. 2. Modificatic ia ee Modification are allowed. _ Modifications are not allowed, PERE Li i as Ee icenses may do their own development. All upgrades, support, maintenance and a ____ development are done by licensor. _ | _ Example : Microsoft windows = _ Sublicensing is not allowed. guarantee of further Tdevelopment. __ Guarantee of further d lopment. Fees if = = sm = for integration, Packing F Pport and consulting. ropes” Same. Betaaet Android 0 : ee Provided by GaegiPo™ Source Settee An iOS is i ide? Brae Proprietary software provide? TECHNICAL PuBLicaTion<® = . oo Cloud and Big Data The NST defines cloud computing as : "Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model is composed of five essential characteristics, three service models and four deployment models." Cloud provider is responsible for the physical infrastructure and the cloud consumer is responsible for application configuration, personalization and data. Broad network access refers to resources hosted in a cloud network that are available for access from a wide range of devices. Rapid elasticity is used to describe the capability to provide scalable cloud computing services. In measured services, NIST talks about measured service as a setup where cloud systems may control a user or tenant's use of resources by a metering capability somewhere in the system. On-demand self-service refers to the service provided by cloud computing vendors that enables the provision of cloud resources on demand whenever they are required, The Cloud Cube Model has four dimensions to differentiate cloud formations + a) External/Internal b) Proprietary/Open c) De-perimeterized / peremeterized d) Outsourced/Insourced. External / Internal : Physical location of data is defined by external/internal dimension. It defines the organization's boundary. Example: Information inside a datacenter using a private cloud deployment would be considered internal and data that resided on Amazon EC2 would be considered external. + ea Proprietary / Open : Ownership is proprietary or openy is a measurement for not only ownership of technology but also its interoperability, use of data and ease of data-transfer and degree of vendor's application's lock-in. ans that the organization providing the service is keeping the Clouds that are open are using that there are likely to be more Proprietary me means of provision under their ownership. technology that is not proprietary, meaning suppliers. De-perimeterized / pert de-parameterized; which the security boundary, firewall, etc. emeterized : Security Ranges : is parameterized or measures whether the operations are inside or outside TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Big Data Analytics 1-22 Understanding A y a, © Encryption and key management will be the technology means for Providing confidentiality and integrity in a de-perimeterized model. day, © Outsourced / Insourced : Out-sourcing/In-sourcing; which defines whe ther ‘ * i i t customer or the service provider provides the service. he * Outsourced means the service is provided by a third party. It refers to contractors or service providers handle all requests and most of cloud } models fall into this. let "mR sine, © Insourced is the services provided by your own staff under organization Contra, Insourced means in-house development of clouds. * Cloud computing is often described as a stack, as a response to the broad range y services built on top of one another under the "cloud. A cloud computing stacy» a cloud architecture built in layers of one or more cloud-managed services (gc Paas, IaaS, etc.). : * Cloud computing stacks are used for all sorts of applications and systems, The, are especially good in microservices and”scalable applications, as each tier ig dynamically scaling and replaceable, * The cloud computing pile ‘makes up a threefold system that comprises its lower-level elements. These components function as formalized cloud computing delivery models : a) Software as a Service (SaaS) b) Platform as a Service (PaaS) ©) Infrastructure as a Service (IaaS) * SaaS applications are designed for end-users and delivered over the web. * PaaS is the set of tools and services designed to make coding and deploying those applications quick and efficient. JaaS is the hardware and software that powers it all, including servers, storage, networks and operating systems. a Difference between Cloud Computing and Big Data It provides a way to handle huge volumes SOF data and generate insights, It refers to internet services from SaaS, It refers to data, which can be structured PaaS to IaaS, semi-structured or unstructured. TECHNICAL PUBLICATIONS® - an upthrust for knowledge Big Data Analytics Understanding Big Data oes as THe duel | is describe huge volume. and information, . Cloud reputlng is economical as it Big dala is “highly ai has low maintenance costs centralized ecosystem and cost elfective, : platform no upfront cost and disaster ; 5 safe implementation. Vendors and. solution providers of — Vendors and solution providers of big « cloud computing are Google, Amazon are Cloudera, ‘Hortonworks, Apache and ‘Web Service, Dell, Microsoft, Apple MapR. and IBM. act ‘The main focus aor loud computing is to provide computer resources ae services with the help. ‘connection. Cloud computing. allows individuals ee ae globally without the Internet. Cloud computing is oyned by) a pation, men company itution or government E ver oe Intemet, Cloud computing is an The Intemet provides. software /hardware application-based software infrastructure infrastructure to establish and maintain. that stores data on remote servers, connectivity of the computers. which can be accessed through the internet. ‘The Internet an enabling Cloud fone is tie rome ok the | utilization of infrast ERIE Mobile Business Intelligence * Mobile Business Intelligence (BI) or Mobile analytics is the rising software technology that allows users to access information and analytics on their phones TECHNICAL PUBLICATIONS® - an up-thrust for knowledge A Understanding Big Dat, Big Data Analytics ics it Ktop-based BI systems. Mobile analytics “involves and tablets instead of des! generated by mobile platforms and properties, such i lyzing data measuring and analyzing gene as mobile sites and mobile applications. : ing data of users in order to an ice of measuring and analyzing : ae Analytic 1 the Posing of user behavior as Well as website or applications tan it i Be orcnce, i “his practice is done on mobile apps and app users, it is called perfor - “mobile analytics". ae Mobile analytics is the practice of collecting user behavior data, determining intent from those metrics and taking action to drive retention, engagement and conversion. a Mobile analytics. is similar to web analytics where identification of the unique customer and recording their usages. With mobile analytics data, you can improve your cross-channel marketing initiatives, optimize the mobile experience for you customers and grow mobile user engagement and retention. ‘* Analytics usually comes in the form of a software that integrates into companie's existing websites and apps to capture, store and analyze the data. It is always very important for businesses to measure their critical KPIs (Key Performance Indicators), as the old rule is always valid : "If you can't measure it, you can't improve it". * To be more specific, if a business find out 75 % of their users exit in the shipment screen of their sales funnel, probably there is something wrong with that screen in terms of its design, user interface (UI) or user experience (UX) or there is a technical problem preventing users from completing the process. Working of Mobile Analytics : * Most of the analytics tools need a liby mobile app’s project code and at the users and screens, s f i ‘ ; ich as DKs diff f e SDK ited f h platform s1 ios, id, Wis PI 7 ees oo © On top of that, additional code is required for * With the help of this code, analy tap, event, app crash or device, Operating system, a library (an SDK) to be embedded into the munimum an initialization code in order to track ge . pe ae Any additional information that the user has, such version IP address (and probable location). TECH! NICAL PUBLICATIONS® . an Up-thrust for knows lowledge ig Data Analytics 1-25 Understanding Big Data « Unlike web analytics, mobile analyti i : lytics tools don't i eee unique users since mobile analytics Brie : ta depend on cookies to identify identifier for each device. generate a-persistent and unique + The tracking technology varies between websites, which use either JavaScript or cookies and apps, which use a Software Development Kit (SDF). Each time a website or app visitor takes an action, the application fires off data which is recorded in the mobile analytics platform. fea Difference between Mobile Analytics and Web Analytics arn rarer ees 0 es , double-clicking Crowd Sourcing Analytics * ‘Crowdsourcing is the process of exploring customer's ideas, opinions and thoughts available on the internet from large groups of people aimed at incorporating innovation, implementing new ideas and eliminating product issues. Crowdsourcing means the outsourcing of human-intelligence tasks to a large group of unspecified people via the Internet. Crowdsourcing is all about collecting data from users through some services, ideas, or content and then it needs to be stored in a server such that the necessary data can be or provided to users whenever necessary. Most users nowadays use Truecaller to find unknown numbers and Google Maps to find out places and the traffic’ in a region. All the services are based on crowdsourcing. TECHNICAL PUBLICATIONS® - an up-thnust for knowledge y 1-26 Understanding Big by, a fa Big Data Analytics dary data refers to data ¢, is a form of secondary data. Secon 7 Crowasoured ee other than the researcher. Secondary data provige. i oot context for any investigation into a policy intervention. researchers collect plentiful, valuable and disperseg itional data collection methods, important context for * When crowdsourcing data, c data at a cost typically lower than that of tradi Consider the trade-offs between sample size and sampling issues before deciding to crowdsource data. Ensuring data quality means making sure the platform o, which you are collecting crowdsourced data is well-tested. Crowdsourcing experiments are normally set up by asking a set of users 4, perform a task for a very small remuneration on each unit of the task. Amazon Mechanical Turk (AMT) is a popular platform that has a large set of -registereg remote workers who are hired to perform tasks such as data labeling. In data labeling tasks, the crowd workers are randomly assigned a single item in the dataset. A’data object may receive multiple labels from different workers ang these have to be aggregated to get the overall true label. Crowdsourcing allows for many contributors to be recruited in a short period of time, thereby eliminating traditional barriers to data collection. Furthermore, crowdsourcing platforms usually employ their own tools to optimize the annotation process, making it easier to conduct time-intensive labeling tasks, Crowdsourcing data is especially effective in generating complex and free-form labels such as in the case of audio transcription, sentiment analysis, image annotation or translation. + With crowdsourcing, companies can collect information from customers and use it to their advantage. Brands gather opinions, ask for help, receive feedback to improve their product or service, and drive sales. For instance, Lego conducted a campaign where customers had the chance to develop their designs of toys and submit them. : * To become the winner, the creator had to receive the biggest amount of people's votes. The best design was moved to the Production process. Moreover, the winner got a privilege that amounted to a 1% royalty on the net revenue. * Types of Crowdsourcing : There are four main types of crowdsourcing, - 1. Wisdom of the crowd : It is a collective opinion of different individuals gathered in a group. This type is used for decision-making since it allows on? to find the best solution for problems, 2. Crowd creation : This type involves a company asking its customers to hel? with new products. This way, compani penne , companies get b; i uughts that help a business stand out P get brand new ideas and thoug TECHNICAL PUBLICATIONS® - an upthrust for knowledge — ig Data Analytics 2 aig. 1-27 Understanding Big Data 3, Crowd voting : It is a type of crowdsourcing where customers are allowed to choose a winner. They can vote to decide which of the options is the best for them. This type can be applied to different situations. Consumers can choose one of the options provided by experts or products created by consumers. 4, Crowdfunding : It is when people collect money and ask for investments for charities, projects and startups without planning to return the money to the owners. People do it voluntarily, Often, companies gather money to help individuals and families suffering from natural disasters, poverty, social problems, etc. EGET Inter and Trans Firewall Analytics « A firewall is a device designed to control the flow of traffic into and out-of a network. In general, firewalls are installed to prevent attacks. Firewall can be 2 software program or a hardware device. « Fig. 1.13.1 shows firewall. ee tees Fig, 1.13.1 Firewall «Firewalls are software programs or hardware devices that filter the traffic that flows into a user PC ot user network through an internet connection. They sift through the data flow and block that which they deem harmful to the user network or computer system. ‘ « Firewalls filter based on IP, UDP and TCP information. Firewall is placed on the d Internet or between a user and router. For link between a network router an large organizations with many small networks, the firewall is placed on every connection attached to the Internet. use multiple levels of firewall of distributed firewalls, locating a firewall at a single access point to the network. © Firewalls test all traffic against consistent rules and pass traffic that meets those rules. Many routers support basic firewall functionality. Firewall can also’ be used to control data traffic. * Large organizations may TECHNICAL PUBLICATIONS® - an up-thrust for knowledge 2, Big Data Analytics 1-28 Understanding Big o,, t + Firewall based security depends on the firewall being the only connectivity to 4, size from outside; there should be no way to bypass the firewall via othe, gateways; wireless connections. Firewall filters out all incoming messages addressed to a particular IP address o- , particular TCP port number. It divides a network into a more trusted zon, internal to the firewall and a less trusted zone external to the firewall. Firewalls may also impose restrictions on outgoing traffic, to prevent Certain attacks and to limit losses if an attacker succeeds in getting access inside the firewall. Functions of firewall : . 1. Access control : Firewall filters incoming as well as outgoing packets. 2. Address/Port Translation : Using network address translation, internal machines, though not visible on the Internet, can establish a connection with external machines on the Internet. NATing is often done by firewall. 3. Logging : Security architecture ensures that each incoming or outgoing packet encounters at least one firewall. The firewall can log all anomalous packets. Firewalls can protect the computer and user personal information from : 1. Hackers who breaks your system security. 2. Firewall prevents malware and other Internet hacker attacks from reaching your computer in the first place. 3: Outgoing traffic from your computer created by a virus infection. * Firewalls cannot provide protection : 1. Against phishing scams and other fraudulent activity 2. Viruses spread through e-mail 3. From physical access of your computer or network 4. For an unprotected wireless network. Firewall Characteristics : 1. All traffic from inside to outside and vice versa, must pass through the firewall. 2. The firewall itself is resistant to penetration. 3. Only authorized traffic, as defined by the local security policy, will be allowed '° pass. Firewall Rules * The rules and regulations set by the organization. Policy determines the typ? internal and external information resources employees can access, the kinds Big vee rermyucs programs they ma Understending Big Data Y install on thei esexvi eir own reserving network resources, Own computers as well as their’ authority for + Policy is typically general and set at a hi 7 contain detai igh level with ca a . contain details generally become too svuch a ea iser can creati i . cument”, Pr Tp deere | disable firewall filter rules based on followi iti le addresses : System admin can block a certain lowing conditions : ‘ range of 2, Domain names : Admin y all ge of IP addresses. can only allow certaii sa : your systems or allow access certain specific domain names to access to only some specifi 5 domain name extension. a pecific types of domain names or » Protocol : A firewall can decide which of the systems can allow or have access to common protocols like IP, SMTP, FIP, UDP, ICMP, Telnet or SNMP. 4. Ports 3 Blocking or disabling ports of servers that are connected to the internet will help maintain the kind of data flow you want to see it used for and also close down possible entry points for hackers or malignant software. 5. Keywords : Firewalls also can sift through the data flow for a match of the keywords or phrases to block out offensive or unwanted data from flowing in. * When your computer makes a connection with another computer on the network, several things are exchanged including the source and destination ports. In a standard firewall configuration, most inbound ports are blocked. This would normally cause a problem with retum traffic since the source port is randomly assigned. A state is a dynamic rule created by the firewall containing the source-destination port combination, allowing the desired return traffic to pass the firewall. Types of Firewall 1, Packet filter 2. Application level firewall 3. Circuit level gateway. © Fig. 1.13.2 shows relation between OST layer and Firewall. © Packet filter firewall controls access to packets on the basis of packet source and i is the OSI data inatil specific transport protocol type. It is done at # a wen eat layers. Packet filtér firewall works on the network layer of the OSI model. © Packet filters do not see insid basis of the IP addresses an parsed to check whether they sl FTP packets have already been sere checked by the packet filtering router. up-thrust for ‘knowledge TECHNICAL PUBLICATION’ le a packet; they block or accept packets solely on the ports. All incoming SMTP and FIP packets are hould drop or forwarded. But outgoing SMTP and ened by the gateway and do not have to be Packet filter firewall only checks the header Big Data Analytics 1-30 Understanding Big Day Application layer Presentation layer Application level firewall ees Session layer » Transport layer === |Circuit evel gateway trewall ee Rework lever: | Emel pagal mera Data link layer Physical layer OSI layer Fig. 1.13.2 Relation between OSI layer and firewall * Application level gateway is also called application level. Multi gateway is a separate These firewalls, also known as application proxies, provide the most secure type of data connection because they can examine every layer of the communication, including the application data, @ bastion host. It operates at the iple application Bateways can run on the same host bi nat each, Server with its own processes, Circuit level Bateway : A circuit-level firewall is a second Seneration firewall that validates TCP and UDP sessions before opening a connection, The firewall does isallow packets but also determines whether the connection then opens a session and permits traffic only from the allowed Source and possibly only for a limited Period of time. TCP and the legitimacy of the session information used in establishing the connection. The decision to accept or reject a packet is based upon examining the packet's IP header and TCP header. Circuit level gateway cann lot examine the data content of the Packets it relays : between a trusted network and an untrusted network. EEE Comparison between Packet Filter and Application Level Gateways . Packet filter Application level Xt typically performs basic packet filter operations and then adds verification of ) Proper. handshaking of Even more complex, TECHNICAL PUBLICATIONS® * 8” Up-thrust for kmowtedige ig Date Analytics Understanding Big Data network performance, Network topology can not hide, Transparent to user, ‘ eh 4 Not trans; : sparent fo user. _ See only addresses and service ' ow Protocol Sees full data portion of packet, : [REM Two Marks Questions with Answers Qi What is data science ? Ans, : Data science is an interdisciplinary fiel insights from various forms of hee A ta tae du sae oe ae ut extract actionable knowledge from data that can be used to make in rust decisions and predictions. Data science uses advanced analytical theory and ia methods such as time series analysis for predicting the future. Q.2 What is data ? Ans. : Data set is a collection of related records or information. The information may be on some entity or some subject area. Q3 Define structured data. ‘ Ans. : Structured data is arranged in rows and column format. It helps for applications to retrieve and process data easily. Database management system is used for storing structured data. The term structured data refers to data that is identifiable because it is organized in a structure. Q4 What is unstructured data ? Ans. : Unstructured data is data th columns are not used for unstructure information. Unstructured data has no i at does not follow a specified format. Row and 4 data. Therefore it is difficult to retrieve required identifiable structure. Q5 What is machine-generated data ? ; Ans. ; Machine-generated data is an information that is cli Nitin Licon inte i lication activity. This interaction as a result of @ computer process OF app! anne data entered manually by an end-user is not recognized to be machine-g' Q6 Define streaming deta: continuously by thousands of data data is data the!

You might also like