0% found this document useful (0 votes)
16 views8 pages

Unit 1 Handouts

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views8 pages

Unit 1 Handouts

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

INTRODUCTION TO BIG DATA

CCS334 BIG DATA ANALYTICS


UNIT I Big Data refers to the massive volume of structured, semi-structured,
UNIT I UNDERSTANDING BIG DATA
and unstructured data that is generated at an unprecedented rate in
 Introduction to big data
our digital world.
 Convergence of key trends
 Unstructured data Data comes from various sources, including sensors, social media,
 Industry examples of big data mobile devices, websites, and more.
 Web analytics
The term "Big Data" not only refers to the volume of data but also
 Big data applications
encompasses the challenges and opportunities associated with
 Big data technologies
capturing, storing, managing, and analyzing such vast and complex
 Introduction to Hadoop
 Open source technologies
datasets.
 Cloud and big data
 Mobile business intelligence
Crowd sourcing analytics
Inter and trans firewall analytics

Key Characteristics of Big Data Challenges and Opportunities of Big Data:


1. Storage and Management: Storing and managing large volumes of data
1.Volume: requires scalable and cost-effective solutions, such as distributed databases,
Big Data involves enormous amounts of data that can range from terabytes to petabytes
and beyond. Traditional data management systems are inadequate for handling these data lakes, and cloud storage.
massive datasets.
2. Velocity: 2. Processing: Traditional data processing tools may struggle to handle the
Data is generated and collected at high speeds, often in real time or near real time. This speed and complexity of Big Data. Distributed computing frameworks like
rapid data flow requires efficient processing and analysis to derive timely insights.
3. Variety:
Hadoop and Spark have emerged to address these challenges.
Big Data encompasses diverse types of data, including structured data (e.g., databases), 3. Analysis and Interpretation: Extracting meaningful insights from Big Data
semi-structured data (e.g., XML, JSON), and unstructured data (e.g., text, images, videos).
Unstructured data refers to information that does not have a predefined data model or is not requires advanced analytics techniques, including machine learning, data
organized in a predefined manner. Managing this variety requires flexible data storage and mining, and natural language processing.
processing methods.
4. Value: 4. Privacy and Security: Managing and protecting sensitive data in
Extracting value from Big Data involves discovering insights, patterns, trends, and
compliance with privacy regulations is a critical concern when dealing with
correlations that can lead to decision- making and new business opportunities.
5. Veracity: Big Data.
Ensuring the accuracy, reliability, and quality of Big Data can be challenging due to data
inconsistencies, errors, and biases. Verifying and cleaning data is a crucial step in the 5. Resource Allocation: Optimizing resources such as computational power
analysis process. and storage capacity is essential to efficiently process and analyze Big
Data.

CONVERGENCE OF KEY TRENDS


Applications of Big Data:
1. Business and Marketing: The convergence of key trends refers to the intersection and blending of multiple significant developments
in various fields, industries, or technologies. This convergence often results in new opportunities,
Big Data is used for customer segmentation, predictive analytics, market trend
disruptions, and transformative changes that have a profound impact on how we live, work, and interact.
analysis, and personalized marketing campaigns.
Let's explore a few examples of the convergence of key trends:
2. Healthcare:
1. Internet of Things (IoT) and Artificial Intelligence (AI): The combination of IoT and AI is leading to
Big Data is leveraged for patient data analysis, drug discovery, genomics research, and
the creation of "smart" systems that can collect, analyze, and act upon vast amounts of data in real
disease outbreak prediction.
time. For instance, connected devices (IoT) can gather data from the environment, which is then
3. Finance: processed by AI algorithms to make informed decisions or trigger automated actions. This
Big Data is applied in fraud detection, risk assessment, algorithmic trading, and credit convergence is driving the development of smart cities, industrial automation, and personalized
scoring. healthcare.
4. Transportation:
2. HealthTech and Data Analytics: The integration of health technology (HealthTech) with advanced data
Big Data helps optimize routes, manage traffic congestion, and enhance public analytics is transforming healthcare. Wearable devices, electronic health records, and medical sensors
transportation systems. collect patient data, which is then analyzed using AI and machine learning to identify patterns, diagnose
5. Energy: diseases, and predict health outcomes. This convergence is leading to personalized medicine and more
Big Data is used for smart grid management, renewable energy optimization, and energy effective patient care.
consumption analysis. 3. Renewable Energy and Energy Storage: The convergence of advancements in renewable energy
6. Manufacturing: sources (such as solar and wind) with energy storage technologies (such as batteries) is revolutionizing
Big Data enables predictive maintenance, quality control, and supply chain optimization. the energy sector. Energy storage solutions help mitigate pollution. This convergence is accelerating the
7. Social Media: adoption of clean energy and reducing reliance on fossil fuels.

Big Data analysis uncovers social trends, sentiment analysis, and user behaviour insights.
UNSTRUCTURED DATA
4. E-commerce and Last-Mile Delivery Innovations: The growth of e-commerce has
Unstructured data refers to information that does not have a pre-defined data model or organized structure.
driven innovations in last-mile delivery, including drones, autonomous vehicles, and
Unlike structured data, which fits neatly into traditional databases and tables, unstructured data lacks a specific
smart logistics. These technologies are converging to create more efficient, cost-
format, making it more challenging to process and analyze using conventional methods. Unstructured data can
effective, and environmentally friendly delivery methods, transforming the retail and
come from a variety of sources and formats, including text, images, audio, video, social media posts, sensor
logistics industries. data, and more.
5. Block chain and Supply Chain Management: The convergence of block chain Some common examples of unstructured data:
technology with supply chain management is enhancing transparency, traceability, and 1. Text Data: This includes documents, emails, web pages, social media posts, and any other textual content.
security in global supply chains. By creating an immutable and decentralized ledger of Unstructured text data can be challenging to analyze due to variations in language, grammar, and context.
transactions, block chain ensures the authenticity and integrity of products as they 2. Images and Videos: Image files and video recordings contain visual content that cannot be directly stored
in tabular databases. Analyzing images and videos often involves techniques such as computer vision and
move through the supply chain, reducing fraud and enhancing trust.
pattern recognition.
6. 5G Connectivity and Augmented Reality (AR)/Virtual Reality (VR): The rollout of 3. Audio Recordings: Audio data, such as voice recordings, podcasts, and music tracks, fall into the category
5G networks is enabling high-speed, low-latency connectivity, which is crucial for of unstructured data. Speech recognition and audio analysis are used to extract insights from this type of
immersive technologies like AR and VR. This convergence is driving the development of data.
4. Sensor Data: Data collected from various sensors, such as those in IoT devices or scientific instruments,
new entertainment experiences, remote collaboration tools, and training simulations.
often lacks a predefined structure. This data can include temperature readings, GPS coordinates, and
7. Environmental Sustainability and Circular Economy: The convergence of more.
environmental sustainability efforts with the circular economy concept aims to minimize 5. Social Media Feeds: Posts, comments, likes, and shares on social media platforms generate vast
waste, promote recycling, and extend the lifespan of products. This approach is amounts of unstructured data.
reshaping industries by focusing on designing products for durability, repairability, and Analyzing sentiment, trends, and user behavior from social media requires specialized techniques.

recyclability. 6. Free-Form Surveys: Responses from open-ended survey questions provide valuable qualitative data but
are unstructured and need processing to derive meaningful insights.

Why Unstructured Data Matters: Challenges of Unstructured Data:


Despite its lack of structure, unstructured data holds immense value and insights. Many While unstructured data offers valuable opportunities, it presents challenges as well:
organizations recognize the importance of tapping into unstructured data to gain a more
1. Data Volume: Unstructured data can be vast, making storage, processing, and
comprehensive understanding of their operations, customers, and markets. Here's why
analysis resource-intensive.
unstructured data matters:
2. Data Quality: Ensuring the accuracy and relevance of unstructured data can be
1. Rich Insights: Unstructured data often contains valuable insights, patterns, and trends difficult, as it may contain noise, errors, or biases.
that might not be apparent in structured data alone.
3. Processing Complexity: Traditional data processing methods are often insufficient for
2. Holistic Understanding: Analyzing unstructured data along with structured data can handling unstructured data. Specialized tools and techniques are required.
provide a more complete view of a situation or phenomenon.
4. Contextual Understanding: Interpreting the context and meaning of unstructured text
3. Innovation: Extracting knowledge from unstructured data can lead to innovative or media data can be complex, requiring natural language processing and other
products, services, and solutions. For example, sentiment analysis of customer reviews advanced techniques.
can guide product improvements.

4. Competitive Advantage: Organizations that effectively harness unstructured data can


gain a competitive edge by making informed decisions and anticipating market trends.

6. Energy and Utilities: Big Data assists in optimizing energy consumption, monitoring power grids,
INDUSTRY EXAMPLES OF BIG DATA and managing renewable energy sources. Analyzing data from smart meters helps consumers and
Big Data has made a significant impact across various industries by providing insights, optimizing utilities track and manage energy usage more efficiently.
operations, and enabling data-driven decision-making.
7. Transportation and Logistics: Transportation companies use Big Data for route optimization, real-
1. Retail and E-commerce: Retailers use Big Data to analyze customer purchase patterns, preferences, and time tracking of vehicles and shipments, and demand forecasting. This improves delivery efficiency
behavior. This helps in personalizing marketing campaigns, optimizing inventory management, and and reduces operational costs.
improving supply chain efficiency. E-commerce platforms also utilize Big Data for product
8. Media and Entertainment: Big Data aids in content recommendation, audience analysis, and
recommendations and targeted advertising.
marketing campaign optimization. Streaming services use viewer data to suggest content, while social
2. Healthcare and Life Sciences: Big Data plays a crucial role in medical research, drug development, media platforms analyze user engagement patterns.
and patient care. It aids in genomics research, analyzing patient data for personalized treatments,
9. Agriculture: Agriculture benefits from Big Data through precision farming, where sensor data, satellite
predicting disease outbreaks, and managing health records efficiently.
imagery, and weather forecasts help optimize crop yield, resource allocation, and pest management.
3. Finance and Banking: Financial institutions use Big Data for fraud detection, risk assessment, algorithmic
10. Government and Public Services: Government agencies use Big Data for urban planning, crime
trading, and customer segmentation. Analyzing transaction data helps detect unusual patterns indicative
analysis, disaster response, and public health monitoring. Analyzing social media data can provide
of fraudulent activity, while customer data informs the development of personalized financial products and
insights into citizen sentiment during emergencies.
services.
4. Telecommunications: Telecommunication companies analyze call records, network data, and customer 11. Insurance: Insurance companies leverage Big Data for risk assessment, claims processing, and
interactions to optimize network performance, enhance customer experiences, and develop targeted customer segmentation. Data analytics help insurers set accurate premiums and improve customer
marketing strategies. satisfaction.

5. Manufacturing and Industry 4.0: In manufacturing, Big Data is utilized for predictive maintenance, quality 12. Hospitality and Tourism: In the hospitality industry, Big Data is used for demand forecasting, pricing
control, and supply chain optimization. Sensors and IoT devices collect data from machinery, which is then optimization, and guest personalization. Hotels and travel agencies tailor services based on customer
analyzed to prevent equipment failures and streamline production processes. preferences and behaviour.
WEB ANALYTICS
Web analytics is the process of collecting, analyzing, and interpreting data related to the performance of a 5. A/B Testing: Web analytics supports A/B testing (also known as split testing), which involves
website or online platform. It involves tracking various metrics and user interactions to gain insights into user comparing two versions of a webpage or element to determine which one performs better in
behaviour, website effectiveness, and overall digital marketing strategies. Web analytics provides valuable terms of user engagement or conversions.
information that can guide decision- making, optimize user experiences, and improve online business 6. User Flow Analysis: User flow analysis visually represents the path users take through a website,
outcomes. showing entry and exit points, navigation patterns, and the most common paths users follow.

Key Aspects of Web Analytics: 7. Heatmaps and Click Tracking: These tools provide visual representations of where users click or
1. Data Collection: Web analytics tools gather data about website visitors, their interactions, and their interact the most on a webpage. Heatmaps help identify user engagement patterns and areas of
journeys through the site. This data includes information about page views, clicks, conversions, session interest.
duration, referral sources, device types, geographic locations, and more. 8. Real-Time Monitoring: Web analytics tools often offer real-time monitoring of website traffic,
2. Metrics and KPIs: Web analytics provides a wide range of metrics and key performance indicators allowing you to see how visitors are interacting with your site at any given moment.
(KPIs) that help measure the success of online efforts. Some common metrics include bounce rate 9. Goal and Event Tracking: Beyond conversions, web analytics can track specific user interactions,
(percentage of visitors who leave after viewing only one page), conversion rate (percentage of visitors such as clicks on specific buttons, video plays, or downloads.
who take a desired action), average session duration, and exit pages
10. Content Analysis: Web analytics helps assess the performance of different types of content
3. User Segmentation: Web analytics allows segmentation of website visitors based on various attributes (articles, videos, images) by measuring engagement and interactions.
such as demographics, behavior, referral source, or device type. This segmentation helps in
understanding different user groups and tailoring strategies accordingly.

4. Conversion Tracking: Tracking conversions is a critical aspect of web analytics. Conversions can include
actions like purchases, sign-ups, downloads, or any other goals set by the website owner. Analyzing
conversion funnels helps identify points of friction and optimization opportunities.

Popular Web Analytics Tools: BIG DATA APPLICATIONS


Big Data applications span a wide range of industries and use cases, leveraging large and
1. Google Analytics: One of the most widely used web analytics platforms, offering a
complex datasets to extract valuable insights, drive innovation, and make informed decisions.
comprehensive set of features for tracking and analyzing website performance.
Here are some notable applications of Big Data:
2. Adobe Analytics: Provides in-depth data analysis and reporting, particularly suited
for larger enterprises. 1. Healthcare and Medical Research:
 Genomic Sequencing: Analyzing large genomic datasets to identify genetic variations
3. Matomo (formerly Piwik): An open-source alternative to Google Analytics, giving
linked to diseases and personalize treatments.
users full control over their data.
 Disease Prediction: Predicting disease outbreaks, monitoring public health trends, and
4. Hotjar: Offers heatmaps, session recordings, and user surveys to understand user
improving patient outcomes through data-driven insights.
behaviour and optimize website experiences.
 Drug Discovery: Using Big Data analytics to identify potential drug candidates, predict
5. Mixpanel: Focuses on event-based tracking and user segmentation for analyzing user
drug interactions, and accelerate drug development processes.
behaviour and engagement.
2. E-commerce and Retail:
 Customer Behaviour Analysis: Analyzing purchasing patterns, preferences, and behaviours

to personalize marketing strategies and enhance customer experiences.

 Demand Forecasting: Utilizing historical sales data and external factors to predict demand,
optimize inventory, and reduce stockouts.

6. Manufacturing and Industry 4.0:


3. Finance and Banking:  Predictive Maintenance: Analyzing sensor data from machinery to predict equipment failures and
optimize maintenance schedules.
 Fraud Detection: Detecting fraudulent activities by analyzing transaction patterns and
identifying anomalies in real time.  Quality Control: Using real-time data to identify defects and anomalies in production processes,
ensuring product quality.
 Risk Assessment: Evaluating credit risk, assessing loan eligibility, and making investment
7. Media and Entertainment:
decisions using predictive modeling.
 Content Personalization: Recommending content to users based on their preferences, viewing
 Algorithmic Trading: Analyzing market data and trends to develop algorithmic trading history, and behavior.
strategies that capitalize on market fluctuations.
 Audience Engagement: Analyzing social media data and user interactions to tailor marketing
4. Transportation and Logistics: campaigns and optimize content distribution.
 Route Optimization: Using real-time data to optimize delivery routes, reduce 8. Agriculture and Farming:
transportation costs, and improve overall supply chain efficiency.  Precision Agriculture: Using data from sensors, satellites, and drones to optimize crop planting, irrigation,
 Traffic Management: Analyzing traffic patterns and congestion data to enhance urban and fertilization for higher yields.
mobility and plan infrastructure improvements.  Livestock Management: Monitoring animal health and behavior using sensor data to improve animal
welfare and productivity.
5. Energy and Utilities:
9. Urban Planning and Smart Cities:
 Smart Grid Management: Analyzing data from smart meters and sensors to optimize
 City Management: Using data from IoT devices and sensors to enhance urban planning, optimize
energy distribution, minimize waste, and improve grid reliability.
resource allocation, and improve city services.
 Renewable Energy Integration: Balancing energy generation from renewable sources  Sustainability: Analyzing energy usage, waste management, and environmental data to develop
by predicting supply and demand patterns. sustainable city policies.
10. Social Sciences and Research:
 Sentiment Analysis: Analyzing social media and online content to understand public sentiment, opinions,
and trends.
 Societal Insights: Studying human behavior and interactions to gain insights into societal patterns and
dynamics.
BIG DATA TECHNOLOGIES
5. Machine Learning Frameworks:
Big Data technologies encompass a wide range of tools, frameworks, and platforms designed to handle and  TensorFlow, PyTorch, scikit-learn, etc.: Libraries and frameworks for building and training machine
analyze large volumes of data with varying levels of complexity. These technologies are essential for storing, learning models on large datasets.
6. Distributed Computing:
processing, and extracting insights from massive datasets. Here are some prominent Big Data technologies:
 Apache Mesos, Kubernetes: Platforms for managing and orchestrating the deployment of applications

1. Hadoop: and services in a distributed environment.


7. Graph Databases:
 Hadoop Distributed File System (HDFS): A distributed storage system that can store large volumes of  Neo4j, Amazon Neptune, JanusGraph, etc.: Databases optimized for storing and querying

data across multiple machines. graph-based data structures, useful for analyzing complex relationships.
8. Data Visualization:
 MapReduce: A programming model and processing framework for parallel computation of large  Tableau, Power BI, D3.js, etc.: Tools for creating visual representations of data to aid in understanding

datasets. and insights.


9. In-Memory Databases:
 Apache Spark: A fast and flexible data processing framework that supports in-memory processing  Redis, Apache Ignite: Databases that store data in-memory, providing fast access for real-time

and a wide range of data analytics tasks. analytics and high- performance applications.
10.Data Integration and ETL:
2. NoSQL Databases:  Apache NiFi, Talend, Apache Airflow, etc.: Tools for extracting, transforming, and loading data from
various sources into a target system or data warehouse.
 MongoDB, Cassandra, Couchbase, etc.: Non-relational databases designed for high scalability,
11.Cloud Services:
flexibility, and performance when handling unstructured or semi-structured data.
 Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP): Cloud computing

3. Data Warehousing: platforms offering various Big Data services, such as storage, processing, and analytics.

 Amazon Redshift, Google BigQuery, Snowflake, etc.: Cloud- based data warehousing solutions that 12. Data Lakes:
allow efficient storage, processing, and querying of large datasets.  Hadoop-based: Repositories that store vast amounts of raw and processed data, often using
Hadoop as a foundation.
4. Stream Processing:  Cloud-based: Services like Amazon S3, Azure Data Lake Storage, and Google Cloud Storage for
building and managing data lakes in the cloud.
• Apache Kafka, Apache Flink, Apache Storm, etc.: Technologies for processing and analyzing real-time
streaming data from various sources.

INTRODUCTION TO HADOOP
Key Components of Hadoop:
Hadoop is an open-source framework designed for storing, processing, and analyzing large datasets across
distributed computing clusters. It was developed to address the challenges of working with massive volumes
of data, often referred to as Big Data. Hadoop's architecture and components enable organizations to
process data in parallel, making it a cornerstone technology for handling complex and large-scale data
processing tasks.

Key Components of Hadoop:


1. Hadoop Distributed File System (HDFS): HDFS is a storage system that divides large files into smaller
blocks and distributes them across multiple machines (nodes) in a cluster. This approach provides fault
tolerance, high availability, and efficient data storage.

2. MapReduce: MapReduce is a programming model and processing framework for parallel computation.
It breaks down data processing tasks into two main steps: the "map" phase, where data is processed in
parallel across nodes, and the "reduce" phase, where results are aggregated.

3. YARN (Yet Another Resource Negotiator): YARN is a resource management platform that manages
computing resources in a Hadoop cluster. It allows various applications to share and allocate resources
dynamically.

4. Hadoop Common: Hadoop Common contains essential libraries and utilities needed by other Hadoop
components. It provides tools for managing and interacting with Hadoop clusters.

Hadoop: HDFS

HDFS has two core components,


i.e. NameNode and DataNode.

The NameNode is the main node and it


doesn’t store the actual data. It contains
metadata i.e. the data about the data (file
name, size, the information about the
location (Block number, Block ids),
transaction logs etc.). Therefore, it requires
less storage and high computational
resources.

Data blocks are stored on the DataNodes and


hence it requires more storage resources.
Hadoop: Map Reduce

Key Features of Hadoop:


Hadoop: YARN
 Scalability: Hadoop can scale horizontally by adding more nodes to a cluster, making it suitable
Resource Manager: master daemon of YARN for handling ever-growing data volumes.
and is responsible for resource assignment and  Fault Tolerance: Data stored in HDFS is replicated across nodes, ensuring data availability even
management among all the applications in the event of hardware failures.
 Parallel Processing: Hadoop's distributed nature allows it to process data in parallel, significantly
Node Manager: It takes care of individual node speeding up processing times for large datasets.
on Hadoop cluster and manages application  Cost-Effective: Hadoop can be run on commodity hardware, making it a cost-effective solution for
and workflow of that particular node. It managing and processing Big Data.
registers with the Resource Manager and sends  Flexibility: Hadoop is capable of handling various types of data, including structured, semi-
heartbeats with the health status of the node. structured, and unstructured data.
It monitors resource usage and performs log Hadoop Ecosystem:
management The Hadoop ecosystem consists of a collection of related projects and tools that extend Hadoop's
capabilities and make it more versatile for different use cases. Some notable components of the Hadoop
Application Master: An application is a single job submitted to a framework. The application master is ecosystem include:
responsible for negotiating resources with the resource manager, tracking the status and monitoring  Apache Hive: A data warehousing and SQL-like query language for Hadoop, making it easier to
progress of a single application. manage and query large datasets.
 Apache Pig: A platform for creating data flows and processing pipelines using a scripting
Container: It is a collection of physical resources such as RAM, CPU cores and disk on a single node. language called Pig Latin.
The application master requests the container from the node manager by sending a Container Launch  Apache HBase: A NoSQL database that provides real-time read and write access to large datasets.

Context(CLC) which includes everything an application needs to run. Once the application is started, it  Apache Spark: A fast and flexible data processing framework that supports in-memory
sends the health report to the resource manager from time-to-time. processing and a wide range of data analytics tasks.
 Apache Kafka: A distributed streaming platform for building real-time data pipelines and
streaming applications.
 Apache Flink: A stream processing framework for high-throughput, low-latency data processing.

Use Cases of Hadoop:

Hadoop is widely used across industries for various purposes:

 Data warehousing and business intelligence

 Log and event processing

 Machine learning and data analytics

 Genomics and bioinformatics

 Social media analysis

 Fraud detection and cybersecurity

 Recommendation systems

 IoT data processing


OPEN SOURCE TECHNOLOGIES
CLOUD COMPUTING AND BIG DATA:
- Open-source software is computer software that is available in source code form under an Cloud computing and Big Data are two complementary technologies that often go hand in
open-source license that permits users to study, change, improve and distribute the software.
(eg.) Hadoop is a open- source project. hand to address the challenges of managing and processing large volumes of data. Cloud
computing provides the infrastructure and resources needed to handle Big Data workloads
Advantages efficiently and cost-effectively. Let's explore how these two technologies intersect:
- Open-source software is not constrained by someone else’s predetermined ideas or vision
- Source code is open and can be modified freely
- Extensible
- Flexible Cloud Computing: Cloud computing involves the delivery of computing services—such as
- Free of cost/ low cost
computing power, storage, databases, networking, and software—over the internet. It
Disadvantages eliminates the need for organizations to own and maintain physical hardware and
- It has to coexist with the proprietary solution for a long time for many reasons. For example
getting data from Hadoop to a database required a Hadoop expert. If the data was not 100% a infrastructure, allowing them to scale resources up or down based on demand.
developer was needed to get it to a consistent, proper form. This process meant that business
analysts couldn’t directly access and analyze data in Hadoop clusters. SQL-H is software that
is developed to solve this problem.
Big Data: Big Data refers to the massive volumes of structured and unstructured data that
-There is no guarantee that development will occur.
cannot be effectively processed or analyzed using traditional methods. Big Data technologies
- No secured followed-up development strategy.
enable organizations to extract valuable insights from these large datasets, leading to better
decision- making and new opportunities.

Cloud and Big Data Integration Benefits:


1. Scalability and Flexibility: Cloud platforms offer on-demand scalability, making them well-suited for handling the
Examples of Cloud and Big Data Integration:
variable workloads associated with Big Data. Organizations can provision additional resources as needed to
process large datasets and run complex analytics tasks. 1. Amazon Web Services (AWS): Offers services like Amazon EMR (Elastic
2. Cost Efficiency: Cloud services operate on a pay-as-you-go model, allowing organizations to avoid upfront MapReduce) for processing large datasets with tools like Hadoop and Spark, and
infrastructure costs. This is particularly advantageous for Big Data projects, as processing massive datasets on- Amazon Redshift for data warehousing.
premises can be expensive and resource- intensive.
2. Google Cloud Platform (GCP): Provides BigQuery for analyzing large datasets using
3. Storage: Cloud providers offer scalable and cost-effective storage solutions, such as object storage and data lakes,
which are ideal for storing and managing Big Data. This eliminates the need to invest in and manage physical storage SQL queries and Dataproc for managing Hadoop and Spark clusters.
infrastructure.
3. Microsoft Azure: Offers Azure HDInsight for managing Hadoop, Spark, and other
4. Data Processing: Cloud platforms provide tools and services for Big Data processing, including managed Hadoop
Big Data clusters, and Azure Data Lake Storage for scalable data storage.
clusters, data warehouses, and serverless computing. Organizations can offload the processing of large datasets to
the cloud, leveraging its resources and expertise.
5. Data Analytics: Cloud services offer a variety of analytics tools, including machine learning, data visualization,
and business intelligence solutions. These tools can be used to analyze Big Data and derive valuable insights.

6. Real-Time Analytics: Cloud-based platforms can handle real-time data processing and analytics, enabling
organizations to make informed decisions in near real-time based on streaming data.

7. Global Accessibility: Cloud-based Big Data solutions enable teams to collaborate on data analysis projects regardless
of their geographical location. This is particularly useful for organizations with distributed teams or partners.

8. Managed Services: Cloud providers offer managed Big Data services that handle various aspects of data processing
and analysis, allowing organizations to focus on deriving insights rather than managing infrastructure.

MOBILE BUSINESS INTELLIGENCE


Key Aspects of Mobile Business Intelligence:
Mobile Business Intelligence (Mobile BI) refers to the practice of using mobile devices, such as smartphones and tablets, to
access, analyze, and present business data and insights. It enables decision-makers to access critical information anytime, 1. Data Visualization: Mobile BI tools provide interactive and visually appealing data visualizations, such as
anywhere, and make informed decisions on the go. Mobile BI leverages the principles of business intelligence (BI) but tailors charts, graphs, dashboards, and maps. These visual representations make it easier to understand complex
them to the mobile platform, providing a seamless and user-friendly experience for accessing and interacting with data. data and trends.
-Simplicity and ease of use had been the major barriers to BI adoption. But mobile devices have made complicated actions to be
performed very easily. For example, a young child can use an ipad or iphone easily but not a laptop. This ease of use will drive the 2. Real-Time Access: Mobile BI allows users to access real-time or near-real-time data directly from
wide adoption of mobile BI. various data sources, including databases, data warehouses, and cloud services. This enables timely
- Multi touch and software oriented devices have brought mobile analytics and intelligence to a much wider audience. decision-making based on the latest information.
-Ease of mobile application development and development have also contributed to the wide adoption of mobile BI.
3. Interactivity: Mobile BI applications support interactive features that enable users to drill down into data,
Three elements that have impacted the viability of mobile BI are apply filters, and perform ad-hoc analyses using touch gestures.
i) Location-GPS component enables finding location easy.
4. Collaboration: Mobile BI tools often include collaboration features, allowing users to share reports,
ii) Transaction can be done through smart phones.
dashboards, and insights with colleagues, partners, or clients. This fosters better communication and
iii) Multimedia functionality.
collaboration among teams.
Three challenges with mobile BI include
5. Offline Capabilities: Some mobile BI applications offer offline access, allowing users to download and
i) Managing standards for these devices.
view reports even when they are not connected to the internet. This ensures access to critical information
ii) Managing security (always a big challenge).
in remote or low-connectivity environments.
iii) Managing ―”bring your own device”, where devices both owned by the company and devices owned by the individual, both
contribute to productivity 6. Security: Mobile BI platforms implement security measures, such as data encryption, secure
authentication, and access controls, to ensure that sensitive business data remains protected.

7. Personalization: Users can customize their mobile BI experience by selecting the specific data, metrics,
and visualizations that are most relevant to their roles and responsibilities.
CROWD SOURCING ANALYTICS
Benefits of Mobile Business Intelligence:
1. Increased Accessibility: Decision-makers can access business data and insights from anywhere, enabling them Crowdsourcing analytics refers to the practice of harnessing the collective intelligence, skills, and input of a large
to make informed decisions on the go. group of people (the "crowd") to perform various data analysis tasks. It involves outsourcing data analysis tasks to a
2. Timely Decision-Making: Real-time access to data allows for faster decision-making, especially when time- diverse group of individuals, often through online platforms or communities, to collectively solve complex problems,
sensitive choices need to be made. generate insights, and produce meaningful results. Crowdsourcing analytics can offer unique perspectives, expertise, and
3. Enhanced Productivity: Mobile BI empowers users to stay productive by analyzing data and generating insights scalability that traditional data analysis methods may not achieve.
without being tied to a desk.
4. Improved Collaboration: Sharing and collaborating on data and reports becomes easier, fostering better Key Aspects of Crowdsourcing Analytics:
communication among team members.
1. Task Distribution: Organizations break down complex data analysis tasks into smaller, more manageable units that
5. Better User Adoption: The user-friendly and intuitive nature of mobile apps encourages broader user adoption
of BI tools across an organization. can be distributed to a large number of participants in the crowd.
6. Data-Driven Culture: Mobile BI contributes to a data-driven culture by providing easy access to data and 2. Diverse Expertise: Crowdsourcing can tap into a wide range of skills and expertise from individuals with diverse
encouraging data-driven decision-making at all levels.
backgrounds, enabling multidisciplinary insights and creative problem-solving.

Use Cases of Mobile Business Intelligence: 3. Scalability: Crowdsourcing provides the ability to scale up data analysis efforts rapidly by involving a large number
of contributors working concurrently.
1. Sales and Marketing: Sales teams can access real-time sales data, track performance metrics, and analyze
customer trends while in the field. 4. Rapid Turnaround: With many contributors working simultaneously, crowdsourcing can often achieve faster results
2. Executive Dashboards: Business executives can monitor key performance indicators (KPIs) and business metrics than traditional methods.
on their mobile devices.
5. Cost-Effectiveness: Crowdsourcing can be a cost-effective way to conduct data analysis, especially for tasks that
3. Field Service: Field service professionals can access job-related data, schedules, and customer information,
improving service efficiency. require a large amount of manual effort.
3. Supply Chain Management: Supply chain managers can track inventory levels, monitor shipments, and 6. Innovation: The diverse perspectives and ideas from the crowd can lead to innovative solutions and approaches to
analyze supply chain performance remotely.
data analysis challenges.
4. Retail Analytics: Retailers can track sales, inventory, and customer behavior to make informed merchandising
and pricing decisions. 7. Data Annotation and Labeling: Crowdsourcing is commonly used for tasks like annotating or labeling large datasets,
which are essential for training machine learning models.

8. Quality Control: Effective crowdsourcing platforms include mechanisms for quality control, such as validation,
consensus, and moderation, to ensure the accuracy of results.

Use Cases of Crowdsourcing Analytics:


Types of crowd sourcing
1. Image and Video Analysis: Crowdsourcing can be used to annotate and categorize images or videos for various Crowdsourcing involves outsourcing tasks or obtaining contributions from a large and often diverse group of people, typically
applications, including object recognition and sentiment analysis. through an online platform or community. There are several types of crowdsourcing, each serving different purposes and
utilizing the collective intelligence and skills of the crowd. Here are some common types of crowdsourcing:
2. Natural Language Processing: Crowdsourcing can help generate and validate training data for natural language 1. Ideation Crowdsourcing: Involves gathering ideas and suggestions from the crowd to solve a specific problem or
processing tasks like sentiment analysis, named entity recognition, and language translation. generate innovative solutions. It often takes the form of open-ended challenges, brainstorming sessions, or idea competitions.
2. Microtask Crowdsourcing: Breaks down complex tasks into small, discrete microtasks that can be completed quickly by
3. Market Research: Crowdsourcing can provide insights into consumer preferences, opinions, and trends by
individual contributors. Examples include image tagging, data annotation, and content moderation.
collecting and analyzing data from surveys, reviews, and social media.
3. Crowd Creativity: Focuses on leveraging the creative skills of the crowd to generate artistic, design, or multimedia
4. Healthcare: Crowdsourcing can assist in medical image analysis, such as identifying anomalies in medical scans, and in the content. This can include logo design contests, art competitions, and creative writing projects.
analysis of patient-reported data for research purposes. 4. Crowdfunding: Involves raising funds for a project, business, or initiative by collecting small contributions from a large
number of individuals. It is commonly used for startup funding, creative projects, and charitable causes.
5. Environmental Monitoring: Crowdsourcing can gather data related to environmental conditions, wildlife observations, and
5. Open Innovation: Refers to seeking external contributions and ideas from the crowd to drive innovation within an
weather patterns for scientific research and conservation efforts.
organization. This could involve collaborating with external experts, researchers, or enthusiasts to solve specific challenges.
6. Historical Research: Crowdsourcing historical documents or artifacts can contribute to historical research, data digitization, 6. Citizen Science: Enlists the general public to participate in scientific research projects by collecting data, conducting
and preservation. experiments, or contributing observations. This approach is often used in environmental and scientific research.
7. Crowd Wisdom (Prediction Markets): Utilizes the collective predictions or opinions of the crowd to forecast future
Challenges of Crowdsourcing Analytics: events or outcomes. Prediction markets are often used for financial predictions, election outcomes, and market trends.
1. Quality Assurance: Ensuring the accuracy and quality of crowdsourced data can be challenging. Implementing 8. Crowd Labor: Involves outsourcing tasks such as data entry, transcription, and content creation, to a distributed workforce.
validation mechanisms and training contributors is crucial. 9. Distributed Problem Solving: Taps into the crowd to solve complex technical or scientific problems that require
specialized knowledge.
2. Privacy and Data Security: Protecting sensitive data and ensuring compliance with privacy regulations is a concern when 10. Sourcing Expertise: Engages subject-matter experts from the crowd to provide insights, advice, or consulting services on
outsourcing data-related tasks. specific topics.
11. Localization and Translation: Involves crowd sourcing in the translation of content, software localization, and language-
3. Bias and Diversity: Ensuring a diverse and representative crowd is important to avoid potential biases in the collected
related tasks.
data or insights.
12. Human-Based Computing: Leverages human intelligence to perform tasks that are difficult for computers, such as image
4. Task Complexity: While crowdsourcing is effective for certain tasks, complex data analysis requiring deep domain recognition, natural language processing, and sentiment analysis.
expertise may still be best suited for traditional methods.

"INTER-FIREWALL" AND "TRANS-FIREWALL" ANALYTICS


Examples:
-In October 2006, Netflix an online DVD rental business announced a contest to create a new predictive "Inter-firewall" and "trans-firewall" analytics refer to the analysis of network traffic and data that traverse multiple firewalls or
model for recommending movies based on past user ratings. The grand prize was $1,000,000. Netflix network boundaries. These terms are often used in the context of cybersecurity and network monitoring to describe the
already had an algorithm to solve the problem but thought there was an opportunity to improve the analysis of data flows that move between different network segments, zones, or security domains, typically protected by

model which would turnout huge revenues. firewalls.

Inter-Firewall Analytics:
- Kaggle is an Australian firm that provides innovative solutions for statistical analytics for outsourcing.
Inter-firewall analytics involve the examination and monitoring of network traffic that moves between different segments of a
Organizations that confront complex statistical challenges describe the problems to kaggle and provide
network, each protected by its own firewall or security perimeter. This analysis focuses on understanding the communication
data sets. Kaggle converts the problems and the data into contests that are posted on its website. The
patterns and potential threats that emerge when data crosses these security boundaries. It aims to detect anomalies,
contest features cash prizes ranging in values from $100 to $3 million. Kaggle’s clients range in size from unauthorized access, or malicious activities that might occur during data transfer between different zones.
tiny start-ups to Multinational Corporations such as Ford Motor Company and government agencies
such as NASA. The idea is that someone comes to Kaggle with a problem, they put it up on their Key aspects of inter-firewall analytics include:
website and then people from all over the world can compete to see who can produce the best 1. Traffic Monitoring: Monitoring and analyzing data flows between different security zones or segments of a network.
solution. In essence Kaggle has developed an effective global platform for crowdsourcing complex 3. Anomaly Detection: Detecting unusual or suspicious traffic patterns that might indicate unauthorized access or
analytic problems. malicious activity.
4. Access Control Verification: Ensuring that access controls and security policies are consistently enforced across
- 99designs.com/, does crowdsourcing of graphic design. different zones.
- Agentanything.com/, posts missions where agents are invited to do various jobs.
5. Intrusion Detection and Prevention: Identifying and mitigating potential intrusion attempts or security breaches
- 33needs.com/, allows people to contribute to charitable programs to make social impact.
that occur when data crosses firewall boundaries.
Trans-Firewall Analytics:
Trans-firewall analytics extend the analysis to include data that moves between different networks or
security domains, potentially involving external entities. This type of analysis focuses on understanding the
behavior and risks associated with data flows that traverse not only internal network boundaries but also
external connections.

Key aspects of trans-firewall analytics include:

1. External Threat Detection: Identifying and mitigating threats that might arise when data enters or
leaves the organization's network, interacting with external entities.

2. Data Leakage Prevention: Ensuring sensitive or confidential information is not inadvertently


exposed when crossing network boundaries.

3. Third-Party Risk Management: Assessing the security of connections and interactions with
external partners, vendors, or service providers.

4. Malware and Threat Detection: Detecting potential malware, viruses, or other malicious content
that might be introduced from external sources.

You might also like