Unit 1 Handouts
Unit 1 Handouts
Big Data analysis uncovers social trends, sentiment analysis, and user behaviour insights.
UNSTRUCTURED DATA
4. E-commerce and Last-Mile Delivery Innovations: The growth of e-commerce has
Unstructured data refers to information that does not have a pre-defined data model or organized structure.
driven innovations in last-mile delivery, including drones, autonomous vehicles, and
Unlike structured data, which fits neatly into traditional databases and tables, unstructured data lacks a specific
smart logistics. These technologies are converging to create more efficient, cost-
format, making it more challenging to process and analyze using conventional methods. Unstructured data can
effective, and environmentally friendly delivery methods, transforming the retail and
come from a variety of sources and formats, including text, images, audio, video, social media posts, sensor
logistics industries. data, and more.
5. Block chain and Supply Chain Management: The convergence of block chain Some common examples of unstructured data:
technology with supply chain management is enhancing transparency, traceability, and 1. Text Data: This includes documents, emails, web pages, social media posts, and any other textual content.
security in global supply chains. By creating an immutable and decentralized ledger of Unstructured text data can be challenging to analyze due to variations in language, grammar, and context.
transactions, block chain ensures the authenticity and integrity of products as they 2. Images and Videos: Image files and video recordings contain visual content that cannot be directly stored
in tabular databases. Analyzing images and videos often involves techniques such as computer vision and
move through the supply chain, reducing fraud and enhancing trust.
pattern recognition.
6. 5G Connectivity and Augmented Reality (AR)/Virtual Reality (VR): The rollout of 3. Audio Recordings: Audio data, such as voice recordings, podcasts, and music tracks, fall into the category
5G networks is enabling high-speed, low-latency connectivity, which is crucial for of unstructured data. Speech recognition and audio analysis are used to extract insights from this type of
immersive technologies like AR and VR. This convergence is driving the development of data.
4. Sensor Data: Data collected from various sensors, such as those in IoT devices or scientific instruments,
new entertainment experiences, remote collaboration tools, and training simulations.
often lacks a predefined structure. This data can include temperature readings, GPS coordinates, and
7. Environmental Sustainability and Circular Economy: The convergence of more.
environmental sustainability efforts with the circular economy concept aims to minimize 5. Social Media Feeds: Posts, comments, likes, and shares on social media platforms generate vast
waste, promote recycling, and extend the lifespan of products. This approach is amounts of unstructured data.
reshaping industries by focusing on designing products for durability, repairability, and Analyzing sentiment, trends, and user behavior from social media requires specialized techniques.
recyclability. 6. Free-Form Surveys: Responses from open-ended survey questions provide valuable qualitative data but
are unstructured and need processing to derive meaningful insights.
6. Energy and Utilities: Big Data assists in optimizing energy consumption, monitoring power grids,
INDUSTRY EXAMPLES OF BIG DATA and managing renewable energy sources. Analyzing data from smart meters helps consumers and
Big Data has made a significant impact across various industries by providing insights, optimizing utilities track and manage energy usage more efficiently.
operations, and enabling data-driven decision-making.
7. Transportation and Logistics: Transportation companies use Big Data for route optimization, real-
1. Retail and E-commerce: Retailers use Big Data to analyze customer purchase patterns, preferences, and time tracking of vehicles and shipments, and demand forecasting. This improves delivery efficiency
behavior. This helps in personalizing marketing campaigns, optimizing inventory management, and and reduces operational costs.
improving supply chain efficiency. E-commerce platforms also utilize Big Data for product
8. Media and Entertainment: Big Data aids in content recommendation, audience analysis, and
recommendations and targeted advertising.
marketing campaign optimization. Streaming services use viewer data to suggest content, while social
2. Healthcare and Life Sciences: Big Data plays a crucial role in medical research, drug development, media platforms analyze user engagement patterns.
and patient care. It aids in genomics research, analyzing patient data for personalized treatments,
9. Agriculture: Agriculture benefits from Big Data through precision farming, where sensor data, satellite
predicting disease outbreaks, and managing health records efficiently.
imagery, and weather forecasts help optimize crop yield, resource allocation, and pest management.
3. Finance and Banking: Financial institutions use Big Data for fraud detection, risk assessment, algorithmic
10. Government and Public Services: Government agencies use Big Data for urban planning, crime
trading, and customer segmentation. Analyzing transaction data helps detect unusual patterns indicative
analysis, disaster response, and public health monitoring. Analyzing social media data can provide
of fraudulent activity, while customer data informs the development of personalized financial products and
insights into citizen sentiment during emergencies.
services.
4. Telecommunications: Telecommunication companies analyze call records, network data, and customer 11. Insurance: Insurance companies leverage Big Data for risk assessment, claims processing, and
interactions to optimize network performance, enhance customer experiences, and develop targeted customer segmentation. Data analytics help insurers set accurate premiums and improve customer
marketing strategies. satisfaction.
5. Manufacturing and Industry 4.0: In manufacturing, Big Data is utilized for predictive maintenance, quality 12. Hospitality and Tourism: In the hospitality industry, Big Data is used for demand forecasting, pricing
control, and supply chain optimization. Sensors and IoT devices collect data from machinery, which is then optimization, and guest personalization. Hotels and travel agencies tailor services based on customer
analyzed to prevent equipment failures and streamline production processes. preferences and behaviour.
WEB ANALYTICS
Web analytics is the process of collecting, analyzing, and interpreting data related to the performance of a 5. A/B Testing: Web analytics supports A/B testing (also known as split testing), which involves
website or online platform. It involves tracking various metrics and user interactions to gain insights into user comparing two versions of a webpage or element to determine which one performs better in
behaviour, website effectiveness, and overall digital marketing strategies. Web analytics provides valuable terms of user engagement or conversions.
information that can guide decision- making, optimize user experiences, and improve online business 6. User Flow Analysis: User flow analysis visually represents the path users take through a website,
outcomes. showing entry and exit points, navigation patterns, and the most common paths users follow.
Key Aspects of Web Analytics: 7. Heatmaps and Click Tracking: These tools provide visual representations of where users click or
1. Data Collection: Web analytics tools gather data about website visitors, their interactions, and their interact the most on a webpage. Heatmaps help identify user engagement patterns and areas of
journeys through the site. This data includes information about page views, clicks, conversions, session interest.
duration, referral sources, device types, geographic locations, and more. 8. Real-Time Monitoring: Web analytics tools often offer real-time monitoring of website traffic,
2. Metrics and KPIs: Web analytics provides a wide range of metrics and key performance indicators allowing you to see how visitors are interacting with your site at any given moment.
(KPIs) that help measure the success of online efforts. Some common metrics include bounce rate 9. Goal and Event Tracking: Beyond conversions, web analytics can track specific user interactions,
(percentage of visitors who leave after viewing only one page), conversion rate (percentage of visitors such as clicks on specific buttons, video plays, or downloads.
who take a desired action), average session duration, and exit pages
10. Content Analysis: Web analytics helps assess the performance of different types of content
3. User Segmentation: Web analytics allows segmentation of website visitors based on various attributes (articles, videos, images) by measuring engagement and interactions.
such as demographics, behavior, referral source, or device type. This segmentation helps in
understanding different user groups and tailoring strategies accordingly.
4. Conversion Tracking: Tracking conversions is a critical aspect of web analytics. Conversions can include
actions like purchases, sign-ups, downloads, or any other goals set by the website owner. Analyzing
conversion funnels helps identify points of friction and optimization opportunities.
Demand Forecasting: Utilizing historical sales data and external factors to predict demand,
optimize inventory, and reduce stockouts.
data across multiple machines. graph-based data structures, useful for analyzing complex relationships.
8. Data Visualization:
MapReduce: A programming model and processing framework for parallel computation of large Tableau, Power BI, D3.js, etc.: Tools for creating visual representations of data to aid in understanding
and a wide range of data analytics tasks. analytics and high- performance applications.
10.Data Integration and ETL:
2. NoSQL Databases: Apache NiFi, Talend, Apache Airflow, etc.: Tools for extracting, transforming, and loading data from
various sources into a target system or data warehouse.
MongoDB, Cassandra, Couchbase, etc.: Non-relational databases designed for high scalability,
11.Cloud Services:
flexibility, and performance when handling unstructured or semi-structured data.
Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP): Cloud computing
3. Data Warehousing: platforms offering various Big Data services, such as storage, processing, and analytics.
Amazon Redshift, Google BigQuery, Snowflake, etc.: Cloud- based data warehousing solutions that 12. Data Lakes:
allow efficient storage, processing, and querying of large datasets. Hadoop-based: Repositories that store vast amounts of raw and processed data, often using
Hadoop as a foundation.
4. Stream Processing: Cloud-based: Services like Amazon S3, Azure Data Lake Storage, and Google Cloud Storage for
building and managing data lakes in the cloud.
• Apache Kafka, Apache Flink, Apache Storm, etc.: Technologies for processing and analyzing real-time
streaming data from various sources.
INTRODUCTION TO HADOOP
Key Components of Hadoop:
Hadoop is an open-source framework designed for storing, processing, and analyzing large datasets across
distributed computing clusters. It was developed to address the challenges of working with massive volumes
of data, often referred to as Big Data. Hadoop's architecture and components enable organizations to
process data in parallel, making it a cornerstone technology for handling complex and large-scale data
processing tasks.
2. MapReduce: MapReduce is a programming model and processing framework for parallel computation.
It breaks down data processing tasks into two main steps: the "map" phase, where data is processed in
parallel across nodes, and the "reduce" phase, where results are aggregated.
3. YARN (Yet Another Resource Negotiator): YARN is a resource management platform that manages
computing resources in a Hadoop cluster. It allows various applications to share and allocate resources
dynamically.
4. Hadoop Common: Hadoop Common contains essential libraries and utilities needed by other Hadoop
components. It provides tools for managing and interacting with Hadoop clusters.
Hadoop: HDFS
Context(CLC) which includes everything an application needs to run. Once the application is started, it Apache Spark: A fast and flexible data processing framework that supports in-memory
sends the health report to the resource manager from time-to-time. processing and a wide range of data analytics tasks.
Apache Kafka: A distributed streaming platform for building real-time data pipelines and
streaming applications.
Apache Flink: A stream processing framework for high-throughput, low-latency data processing.
Recommendation systems
6. Real-Time Analytics: Cloud-based platforms can handle real-time data processing and analytics, enabling
organizations to make informed decisions in near real-time based on streaming data.
7. Global Accessibility: Cloud-based Big Data solutions enable teams to collaborate on data analysis projects regardless
of their geographical location. This is particularly useful for organizations with distributed teams or partners.
8. Managed Services: Cloud providers offer managed Big Data services that handle various aspects of data processing
and analysis, allowing organizations to focus on deriving insights rather than managing infrastructure.
7. Personalization: Users can customize their mobile BI experience by selecting the specific data, metrics,
and visualizations that are most relevant to their roles and responsibilities.
CROWD SOURCING ANALYTICS
Benefits of Mobile Business Intelligence:
1. Increased Accessibility: Decision-makers can access business data and insights from anywhere, enabling them Crowdsourcing analytics refers to the practice of harnessing the collective intelligence, skills, and input of a large
to make informed decisions on the go. group of people (the "crowd") to perform various data analysis tasks. It involves outsourcing data analysis tasks to a
2. Timely Decision-Making: Real-time access to data allows for faster decision-making, especially when time- diverse group of individuals, often through online platforms or communities, to collectively solve complex problems,
sensitive choices need to be made. generate insights, and produce meaningful results. Crowdsourcing analytics can offer unique perspectives, expertise, and
3. Enhanced Productivity: Mobile BI empowers users to stay productive by analyzing data and generating insights scalability that traditional data analysis methods may not achieve.
without being tied to a desk.
4. Improved Collaboration: Sharing and collaborating on data and reports becomes easier, fostering better Key Aspects of Crowdsourcing Analytics:
communication among team members.
1. Task Distribution: Organizations break down complex data analysis tasks into smaller, more manageable units that
5. Better User Adoption: The user-friendly and intuitive nature of mobile apps encourages broader user adoption
of BI tools across an organization. can be distributed to a large number of participants in the crowd.
6. Data-Driven Culture: Mobile BI contributes to a data-driven culture by providing easy access to data and 2. Diverse Expertise: Crowdsourcing can tap into a wide range of skills and expertise from individuals with diverse
encouraging data-driven decision-making at all levels.
backgrounds, enabling multidisciplinary insights and creative problem-solving.
Use Cases of Mobile Business Intelligence: 3. Scalability: Crowdsourcing provides the ability to scale up data analysis efforts rapidly by involving a large number
of contributors working concurrently.
1. Sales and Marketing: Sales teams can access real-time sales data, track performance metrics, and analyze
customer trends while in the field. 4. Rapid Turnaround: With many contributors working simultaneously, crowdsourcing can often achieve faster results
2. Executive Dashboards: Business executives can monitor key performance indicators (KPIs) and business metrics than traditional methods.
on their mobile devices.
5. Cost-Effectiveness: Crowdsourcing can be a cost-effective way to conduct data analysis, especially for tasks that
3. Field Service: Field service professionals can access job-related data, schedules, and customer information,
improving service efficiency. require a large amount of manual effort.
3. Supply Chain Management: Supply chain managers can track inventory levels, monitor shipments, and 6. Innovation: The diverse perspectives and ideas from the crowd can lead to innovative solutions and approaches to
analyze supply chain performance remotely.
data analysis challenges.
4. Retail Analytics: Retailers can track sales, inventory, and customer behavior to make informed merchandising
and pricing decisions. 7. Data Annotation and Labeling: Crowdsourcing is commonly used for tasks like annotating or labeling large datasets,
which are essential for training machine learning models.
8. Quality Control: Effective crowdsourcing platforms include mechanisms for quality control, such as validation,
consensus, and moderation, to ensure the accuracy of results.
Inter-Firewall Analytics:
- Kaggle is an Australian firm that provides innovative solutions for statistical analytics for outsourcing.
Inter-firewall analytics involve the examination and monitoring of network traffic that moves between different segments of a
Organizations that confront complex statistical challenges describe the problems to kaggle and provide
network, each protected by its own firewall or security perimeter. This analysis focuses on understanding the communication
data sets. Kaggle converts the problems and the data into contests that are posted on its website. The
patterns and potential threats that emerge when data crosses these security boundaries. It aims to detect anomalies,
contest features cash prizes ranging in values from $100 to $3 million. Kaggle’s clients range in size from unauthorized access, or malicious activities that might occur during data transfer between different zones.
tiny start-ups to Multinational Corporations such as Ford Motor Company and government agencies
such as NASA. The idea is that someone comes to Kaggle with a problem, they put it up on their Key aspects of inter-firewall analytics include:
website and then people from all over the world can compete to see who can produce the best 1. Traffic Monitoring: Monitoring and analyzing data flows between different security zones or segments of a network.
solution. In essence Kaggle has developed an effective global platform for crowdsourcing complex 3. Anomaly Detection: Detecting unusual or suspicious traffic patterns that might indicate unauthorized access or
analytic problems. malicious activity.
4. Access Control Verification: Ensuring that access controls and security policies are consistently enforced across
- 99designs.com/, does crowdsourcing of graphic design. different zones.
- Agentanything.com/, posts missions where agents are invited to do various jobs.
5. Intrusion Detection and Prevention: Identifying and mitigating potential intrusion attempts or security breaches
- 33needs.com/, allows people to contribute to charitable programs to make social impact.
that occur when data crosses firewall boundaries.
Trans-Firewall Analytics:
Trans-firewall analytics extend the analysis to include data that moves between different networks or
security domains, potentially involving external entities. This type of analysis focuses on understanding the
behavior and risks associated with data flows that traverse not only internal network boundaries but also
external connections.
1. External Threat Detection: Identifying and mitigating threats that might arise when data enters or
leaves the organization's network, interacting with external entities.
3. Third-Party Risk Management: Assessing the security of connections and interactions with
external partners, vendors, or service providers.
4. Malware and Threat Detection: Detecting potential malware, viruses, or other malicious content
that might be introduced from external sources.