The document provides an overview of Big Data, its characteristics defined by the 3Vs (Volume, Velocity, Variety), and the challenges and opportunities it presents. It introduces Hadoop as a key framework for managing and processing large datasets, detailing its components like HDFS, MapReduce, and YARN. Additionally, it covers various types of digital data, relationships in big data, and the history of Apache Hadoop's development.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
3 views74 pages
BigData Unit1
The document provides an overview of Big Data, its characteristics defined by the 3Vs (Volume, Velocity, Variety), and the challenges and opportunities it presents. It introduces Hadoop as a key framework for managing and processing large datasets, detailing its components like HDFS, MapReduce, and YARN. Additionally, it covers various types of digital data, relationships in big data, and the history of Apache Hadoop's development.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74
Big Data Analytics
Dr. U. Vinay Kumar
Associate Professor, FCE Poornima University Unit 1: Introduction to Big Data and Hadoop Introduction to Big Data Introduction to Big Data • Big Data refers to the immense volume of structured and unstructured data generated by various sources at an unprecedented speed. • This data comes from a wide range of channels, including social media, sensors, mobile devices, business transactions, and more. Introduction to Big Data • The concept of Big Data is characterized by three primary dimensions often referred to as the "3Vs": 1. Volume 2. Velocity 3. Variety Big Data Characteristics: Volume • Big Data involves large amounts of data. • Traditional data management tools may struggle to process and store such massive volumes. • The sheer size of the data sets is a key aspect of what makes it "big." Big Data Characteristics: Volume ➢Walmart handles 1 Million customer transactions per hour. ➢Instagram, Facebook inserts 10 PB of new Data every day. ➢A flight generates 1 PB data in a 2-4 hour flight. ➢More than 5 billion are calling, texting, tweeting and browsing on mobile phones worldwide. Big Data Characteristics: Volume Big Data Characteristics: Volume Big Data Characteristics: Velocity • Data is generated at an incredibly high speed in real- time or near-real-time. • Examples include social media posts, online transactions, and sensor data. • The ability to handle and process data at this pace is crucial for extracting meaningful insights. Big Data Characteristics: Velocity Big Data Characteristics: Variety • Big Data comes in various formats and types, including structured data (such as databases and tables), unstructured data (like text and images), and semi-structured data (such as XML or JSON files). • Managing and analyzing diverse data types is a significant challenge in the realm of Big Data. Big Data Characteristics • Additionally, two more Vs are sometimes added to the definition: ➢Variability: This refers to the inconsistency in the data flow. Data can be unpredictable and can vary at times. ➢Veracity: Veracity deals with the quality of the data. With the vast amount of data being generated, there is often uncertainty about its accuracy and reliability. Sources of Big Data An example of Big Data
Real time Traffic info
Challenges and Opportunities of Big Data • Challenges: • Opportunities: • Storage: Managing and storing • Innovation: Big Data analytics large volumes of data efficiently. can lead to innovative solutions, • Processing: Analyzing and products, and services. processing data quickly to derive • Efficiency: Improved decision- meaningful insights. making and operational • Analysis: Extracting relevant efficiency through data-driven information from diverse and insights. complex data sets. • Competitive Advantage: • Privacy and Security: Ensuring Organizations can gain a the confidentiality and protection competitive edge by harnessing of sensitive data. Big Data effectively. What's driving the Big Data What's driving the Big Data Tools and Technologies • Numerous tools and technologies have emerged to handle the challenges posed by Big Data. • These include distributed storage systems like Hadoop, in-memory processing frameworks like Apache Spark, and various data analytics and machine learning tools. Introduction to Hadoop • Hadoop is an open-source framework for distributed storage and processing of large sets of data using a cluster of commodity hardware. • It is designed to scale from single servers to thousands of machines, offering a cost-effective solution for managing and analyzing massive amounts of data. • The project is inspired by Google's MapReduce and Google File System (GFS) papers. Introduction to Hadoop • Key components of Hadoop include: 1. Hadoop Distributed File System (HDFS) 2. MapReduce 3. YARN (Yet Another Resource Negotiator) 4. Hadoop Ecosystem Introduction to Hadoop: HDFS • HDFS is a distributed file system that provides high-throughput access to data. • It breaks large files into smaller blocks (typically 128 MB or 256 MB) and distributes them across the nodes in a Hadoop cluster. • HDFS is fault-tolerant, with data replication across multiple nodes to ensure data durability. Introduction to Hadoop: MapReduce • MapReduce is a programming model and processing engine for parallel and distributed data processing. • It consists of two main steps: the Map phase, where data is divided into key-value pairs, and the Reduce phase, where the results from the Map phase are aggregated. • MapReduce allows for scalable and efficient processing of large datasets across a Hadoop cluster. Introduction to Hadoop: YARN • YARN is a resource management layer in Hadoop that manages and schedules resources across the cluster. • It allows multiple applications to share resources effectively, enabling more flexible and dynamic allocation of resources. Introduction to Hadoop: Ecosystem • Hadoop has a rich ecosystem of related projects and tools that extend its capabilities. • Some notable components include Apache Hive (data warehouse infrastructure), Apache Pig, Apache HBase (distributed, scalable, and big data store), Apache Spark and more. Introduction to Hadoop • Hadoop is widely used in various industries for processing and analyzing large datasets, including log files, social media data, sensor data, and more. • Its ability to handle massive amounts of data across a distributed cluster makes it a crucial tool for organizations seeking to gain insights and make informed decisions based on their data. Types of Digital Data Types of Digital Data • Digital data refers to information that is stored and transmitted in a form that is composed of discrete elements. • There are various types of digital data, and they can be broadly categorized based on their formats, structures, and characteristics. Types of Digital Data: Text Data • Plain Text: Unformatted text without any styling or formatting. • Rich Text Format (RTF): Text with formatting options such as bold, italic, and font changes. • HTML (Hypertext Markup Language): Used for creating and structuring web content. Types of Digital Data: Numeric Data • Integers: Whole numbers without decimal points. • Floating-Point Numbers: Numbers with decimal points or in scientific notation. • Complex Numbers: Numbers with both real and imaginary parts. Types of Digital Data: Audio Data • Digital Audio: Recorded or synthesized sound stored in digital format (e.g., MP3, WAV). • Speech Data: Transcribed or recorded human speech. Types of Digital Data: Image Data • Bitmap Images: Pixel- based images (e.g., JPEG, PNG, BMP). • Vector Images: Graphics represented by mathematical equations (e.g., SVG). Types of Digital Data: Image Data Types of Digital Data: Video Data • Digital Video: Sequences of images presented in a rapid succession (e.g., MP4, AVI). • Streaming Video: Video content transmitted in real- time over the internet. Types of Digital Data: Binary Data • Executable Files: Programs and applications in binary format (e.g., EXE, ELF). • Binary Code: Machine code instructions for computer processors. Types of Digital Data: Geospatial Data • Geographic Information System (GIS) Data: Information related to geographic locations. • Global Positioning System (GPS) Data: Location data obtained from GPS devices. Types of Digital Data: Meta Data • Descriptive Metadata: Information describing other data (e.g., file size, creation date). • Structural Metadata: Information about the structure and relationships within data. Types of Digital Data: Sensor Data • Environmental Sensor Data: Information collected from sensors measuring environmental parameters. • Biometric Data: Data related to physiological or behavioral characteristics (e.g., fingerprints, heart rate). Types of Digital Data: Social Media Data • Text Posts: Messages, tweets, or status updates. • Multimedia Posts: Images, videos, and audio shared on social media platforms. Types of Digital Data: Machine Learning Data • Training Data: Examples used to train machine learning models. • Testing Data: Examples used to evaluate the performance of machine learning models. Relationships and Representations Relationships in Big Data 1. Inter-Data Relationships 2. Temporal Relationships 3. Graph Relationships Inter-Data Relationships • Big data often involves diverse datasets that may have complex relationships. • Understanding how different datasets relate to each other is crucial for deriving meaningful insights. • For example, in a retail setting, you might explore the relationship between customer demographics and purchasing behavior to target specific market segments. Temporal Relationships • Many big data applications involve time-series data. Analyzing temporal relationships helps uncover patterns and trends over time. • This is valuable in various domains, such as finance (stock market trends), healthcare (patient monitoring), and manufacturing (predictive maintenance). Graph Relationships • Some big data scenarios involve data with intricate network or graph structures. • Social networks, for instance, can be represented as graphs where individuals are nodes and relationships between them are edges. • Understanding these relationships is vital for social network analysis, recommendation systems, and fraud detection. Representations in Big Data: Data Models • Choosing the right data model is crucial in big data systems. • Whether it's a relational database model, NoSQL models like document or graph databases, or specialized models for specific data types, the representation of data affects how it can be stored, queried, and processed. Data Formats • Big data is often stored in various formats, such as JSON, XML, Parquet, or Avro. • The choice of data format impacts data storage efficiency, query performance, and ease of integration with different tools and systems. Visualization Representations • Converting big data into visual representations, such as charts, graphs, and dashboards, is essential for making the data accessible and understandable. • Visualization aids in identifying patterns, trends, and outliers, facilitating better decision-making. Feature Representations • In machine learning and data analytics, representing data features effectively is crucial. • Feature engineering involves transforming raw data into a format that machine learning algorithms can understand. • This process influences the model's performance and accuracy. Graph Databases Graph Databases • Graph databases are a type of NoSQL database that is designed to store and manage data using graph structures. • In a graph database, data is represented as nodes, edges, and properties. • Nodes represent entities, edges represent relationships between entities, and properties provide additional information about nodes and edges. Graph Databases 1. Nodes: Nodes are entities in the graph, and each node can have properties that describe its attributes. • For example, in a social network graph, a node could represent a person, and properties could include the person's name, age, and location. 2. Edges: Edges are the relationships between nodes. They connect nodes and can also have properties to describe the nature of the relationship. • In a social network graph, an edge could represent a friendship between two people. Graph Databases 3. Properties: Nodes and edges can have associated properties, which are key-value pairs providing additional information about the entity or relationship. • For instance, a property on a person node might be "gender" with values like "male" or "female." 4. Graph Query Language: Graph databases often use a specialized query language to navigate and retrieve data from the graph. • Common graph query languages include Cypher (used in Neo4j) and Gremlin (used in Apache TinkerPop). 5. Schema-less: Unlike traditional relational databases, graph databases are typically schema-less, allowing for flexibility in adding new types of nodes and relationships without modifying a predefined schema. Graph Databases Use Cases: Examples of Graph Databases: • Social Networks: Modeling • Neo4j: A popular open-source relationships between users in a graph database. social network. • Amazon Neptune: A fully • Recommendation Engines: managed graph database service Analyzing user preferences and by Amazon Web Services. recommending items based on • ArangoDB: A multi-model connections. database that supports graph, • Fraud Detection: Detecting document, and key-value data patterns and connections in models. financial transactions. • Network Analysis: Analyzing and visualizing complex relationships in various domains. History of Apache Hadoop History of Apache Hadoop • Hadoop is an open-source framework designed for distributed storage and processing of large data sets using a cluster of commodity hardware. • The history of Hadoop dates back to the early 2000s, and it has since become a fundamental tool in the field of big data. Google's MapReduce Paper (2004): • The roots of Hadoop can be traced back to a paper titled "MapReduce: Simplified Data Processing on Large Clusters," published by Google researchers Jeffrey Dean and Sanjay Ghemawat in 2004. • The paper described a programming model for processing and generating large datasets that could be distributed across a cluster of computers. History of Apache Hadoop • Creation of Hadoop (2005): Doug Cutting, along with Mike Cafarella, created an open-source implementation of MapReduce in the programming language Java. • The project was named Hadoop after Doug's son's toy elephant. Hadoop aimed to provide an open-source, distributed computing framework that could process large datasets. • Nutch and Yahoo! (2006): Hadoop became an integral part of the Apache Nutch project, an open-source web search engine. • Yahoo! showed early interest in Hadoop and became a major contributor to its development. • Formation of the Apache Hadoop Project (2008): In January 2008, the Apache Software Foundation (ASF) established the Apache Hadoop project, and Hadoop became a top-level Apache project. This move facilitated collaboration and contributions from a broader community. History of Apache Hadoop • Hadoop Distributed File System (HDFS): HDFS, a distributed file system designed to store vast amounts of data across multiple machines, was developed as part of the Hadoop project. • HDFS follows the principles outlined in the Google File System (GFS) paper. • Expansion of Hadoop Ecosystem: Over time, the Hadoop ecosystem expanded with the introduction of various projects that complemented the core Hadoop framework. • Apache Hive (data warehousing), Apache HBase (NoSQL database), Apache Pig (data flow language), Apache Spark (cluster computing), and many others became integral components of the Hadoop ecosystem. • Hadoop 2.0 and YARN (2013): Hadoop 2.0, released in 2013, introduced the YARN (Yet Another Resource Negotiator) framework. • YARN separated resource management and job scheduling/monitoring functions, making Hadoop more versatile and capable of running a broader range of applications. Analysing Data with Hadoop • Analyzing data with Hadoop involves processing and deriving insights from large datasets using the Hadoop ecosystem, which is a set of open- source tools designed for distributed storage and processing of big data. 1.Setup Hadoop Cluster: Install and configure Hadoop on a cluster of machines. Hadoop follows a distributed computing model, and a cluster typically consists of multiple nodes. 2.Store Data in Hadoop Distributed File System (HDFS): HDFS is the primary storage system in Hadoop. Upload your datasets into HDFS to distribute and replicate the data across the cluster. Analysing Data with Hadoop • MapReduce Programming Model: • Write MapReduce programs to process and analyze data. MapReduce is a programming model that allows you to process large datasets in parallel across a distributed cluster. • Mapper: Processes input data and produces intermediate key-value pairs. • Reducer: Aggregates and processes the intermediate key- value pairs to produce the final result. Analysing Data with Hadoop • Hive: Hive provides a high-level SQL-like language called HiveQL, allowing you to query data stored in Hadoop. It translates queries into MapReduce jobs. • Pig: Pig is a scripting language designed for processing and analyzing large datasets. Pig scripts are translated into a series of MapReduce jobs. • Apache Spark: Apache Spark is a fast and general-purpose cluster computing system that can be integrated with Hadoop. It provides higher-level APIs in Java, Scala, Python, and R. • Spark enables in-memory data processing and supports interactive queries, iterative algorithms, and stream processing. Analysing Data with Hadoop Data Visualization: Use tools like Apache Zeppelin or Jupyter notebooks to create visualizations and reports based on the analyzed data. Hadoop Ecosystem Tools: Leverage other tools in the Hadoop ecosystem for specific tasks: 1.HBase: A NoSQL database for real-time read/write access to Hadoop data. 2.Sqoop: Transfers data between Hadoop and relational databases. 3.Flume: Collects, aggregates, and moves large amounts of log data to Hadoop. Analysing Data with Hadoop Optimization and Performance Tuning: Tune Hadoop configurations, optimize MapReduce jobs, and adjust cluster settings for better performance. Scaling: Hadoop is designed to scale horizontally. As data volumes grow, add more nodes to the cluster to handle the increased processing demands. IBM Big Data Strategy IBM Big Data Strategy • IBM has been a significant player in the big data and analytics space, offering a range of solutions and services to help organizations manage, analyze, and derive insights from large volumes of data. Infosphere BigInsights • IBM InfoSphere BigInsights is an analytics platform designed for processing and analyzing large volumes of structured and unstructured data. • It is built on open-source Apache Hadoop and includes additional tools and capabilities to simplify big data analytics. Infosphere BigInsights • Hadoop Ecosystem Integration: BigInsights leverages the Apache Hadoop ecosystem, providing distributed storage and processing capabilities for big data. • Analytical Tools: The platform includes various tools for data exploration, analysis, and visualization, allowing users to derive insights from diverse datasets. • Advanced Analytics: BigInsights supports advanced analytics, including machine learning and predictive analytics, to uncover patterns and trends in data. • Security and Governance: It includes features for securing data and ensuring compliance with regulatory requirements. This includes access controls, encryption, and auditing capabilities. • Integration with IBM and Open Source Tools: BigInsights integrates with other IBM products and open-source tools, providing flexibility in the choice of programming languages and frameworks. BigSheets • IBM BigSheets is a component of InfoSphere BigInsights designed to simplify the exploration and analysis of large datasets without requiring extensive programming skills. It provides a spreadsheet-like interface for users to interact with and analyze data. • Spreadsheet Interface: BigSheets offers a familiar spreadsheet-like interface that enables users to perform data exploration and analysis using point-and-click interactions. • Data Exploration: Users can explore and analyze large datasets by applying filters, aggregations, and transformations through a visual interface. BigSheets • Integration with BigInsights: BigSheets is tightly integrated with InfoSphere BigInsights, allowing users to leverage the underlying Hadoop-based analytics platform for processing and querying large volumes of data. • Visualization: Users can create visualizations of data directly within BigSheets to better understand patterns and trends. • Data Enrichment: BigSheets supports the enrichment of data through external sources, enhancing the analysis capabilities. • Collaboration: Users can share and collaborate on BigSheets workbooks, facilitating collaborative data analysis within a team. Hadoop Streaming Hadoop Streaming • Hadoop Streaming is a utility that comes with Apache Hadoop, a distributed storage and processing framework. • It allows users to create and run MapReduce jobs with any executable or script as the mapper and/or reducer. Hadoop Streaming • Mapper and Reducer Execution: In Hadoop Streaming, mappers and reducers can be implemented using any executable or script (e.g., Python, Perl, Ruby, etc.). This flexibility allows users to leverage their preferred programming languages for processing data. • Input and Output Formats: Hadoop Streaming uses standard input and output streams for communication between the Hadoop framework and the user's mapper and reducer scripts. Each line of input to the mapper is treated as a separate record, and the output of the mapper is likewise treated as input for the reducer. • Command-Line Interface: Users specify the mapper and reducer scripts along with input and output paths using the Hadoop Streaming command-line interface. Hadoop Streaming • Data Flow: • Input Data: Input data is typically stored in the Hadoop Distributed File System (HDFS) and is divided into fixed-size blocks. Each block is processed by a separate mapper. • Intermediate Data: The output of each mapper is partitioned and sorted based on keys. This sorted intermediate data is then passed to the reducers for further processing. • Output Data: The final output is stored in HDFS or another specified location. Each reducer produces a part of the final output, and these parts are combined to form the complete result. Hadoop Streaming • Use Cases: • Hadoop Streaming is particularly useful when existing code or scripts can be easily adapted to the MapReduce paradigm without the need for a full Java implementation. • It provides a bridge between traditional, non-Java applications and the Hadoop ecosystem, making it accessible to a wider range of users.