Technical Seminar Report
Technical Seminar Report
title
In
Submitted by
Rakshitha G
1CR23MC085
Internal Guide
Associate Professor
Department of MCA
2023 - 2024
1
Section Page No
Introduction 3
Problem statement 5
Survey of technologies 7
Methodology 11
Impact on environment, society and 19
domain
Conclusion 23
References 24
2
Introduction:
Big Data refers to the vast volume of data—both structured and unstructured—
that is generated at high speed and requires advanced methods to store, process,
and analyse . It is characterized by the "3 Vs": Volume, Velocity, and Variety.
Volume represents the massive amount of data generated by sources such as
social media, sensors, and transactions. Velocity refers to the speed at which
this data is produced and processed. Variety highlights the different types of
data, including text, images, videos, and more. Traditional data processing
methods are inadequate to handle such complexity, leading to the development
of specialized technologies like Hadoop and Spark. Big Data enables
organizations to gain valuable insights, optimize operations, and make data-
driven decisions. Its applications span across industries like healthcare, finance,
retail, and more, allowing businesses to understand trends, predict outcomes,
and enhance customer experiences. Another key feature is Snowflake’s
separation of compute and storage. This design allows for independent scaling:
organizations can adjust compute resources as needed for high-demand
workloads while keeping storage costs low. Snowflake also enables elastic
scaling, meaning it can automatically increase or decrease computational power
based on the workload, ensuring optimal performance without manual
intervention.
The key challenges of Big Data include not just managing the volume but also
handling the speed at which data arrives (velocity), ensuring accuracy and
consistency (veracity), and dealing with the different formats of data (variety).
Modern Big Data technologies, such as Hadoop (which enables distributed
storage and processing) and Apache Spark (which processes data at lightning
speed), allow businesses to manage these challenges effectively.
3
Another important aspect is the "value" of Big Data—how organizations can
turn these massive amounts of raw data into valuable information for decision-
making. Big Data analytics enables predictive analytics, real-time monitoring,
and machine learning, which can be used for predictive maintenance,
personalized marketing, fraud detection, and improving customer satisfaction.
Ultimately, Big Data has revolutionized industries by allowing more intelligent
and informed decision-making, leading to competitive advantages and
innovations across sectors.
Big Data presents significant challenges, with data privacy and security being
primary concerns. As vast amounts of personal and sensitive information are
processed, ensuring its protection is critical. Regulations like GDPR (General
Data Protection Regulation) and CCPA (California Consumer Privacy Act)
mandate stringent guidelines, compelling companies to safeguard user privacy
and handle data responsibly. Another challenge is maintaining data quality, as
inconsistencies, duplicates, and errors are common in massive datasets and can
lead to unreliable results. Additionally, storing and processing this data is
costly, requiring advanced infrastructure and expertise. Data integration is
complex as well, since data from various sources and formats need
harmonization to ensure accuracy. Lastly, finding skilled professionals to
manage and analyze Big Data is difficult, as specialized expertise in Big Data
technologies and analytics is essential to drive value from these complex
datasets.
4
Problem Statement
As organizations generate increasing amounts of data from multiple sources—
such as customer interactions, social media, sensors, and transactional records—
they face challenges in extracting valuable insights in real-time. The problem is
to design a Big Data solution that can efficiently store, process, and analyse
these large datasets to provide actionable insights while ensuring data privacy,
security, and quality.
1. Data Variety and Integration: The solution must handle multiple data
formats (structured, semi-structured, and unstructured) and integrate data
from diverse sources to produce a unified view of information.
5
4. Scalability and Cost Efficiency: The solution should be cost-effective
and scalable, capable of handling increasing volumes of data over time
without compromising performance.
Ultimately, the goal is to create a robust Big Data solution that allows
organizations to uncover actionable insights from complex datasets, enabling
more informed decision-making and facilitating innovation. This requires a
combination of advanced analytics, machine learning, scalable storage, and
secure data governance practices to effectively harness the value within Big
Data.
6
Survey of Technologies
Big Data, covering a range of areas from storage and processing to analytics and
data visualization:
Apache Spark: Known for its in-memory processing, Spark is faster than
Hadoop MapReduce and supports both batch and real-time data
processing. Spark includes libraries for SQL, machine learning, graph
processing, and streaming analytics.
7
suited for real-time applications like fraud detection and sensor data
analysis.
Apache Sqoop: Primarily used for transferring bulk data between Hadoop
and relational databases. Sqoop is often used in ETL workflows for data
import and export to and from Hadoop clusters.
8
5. Data Querying and SQL on Big Data
9
analytics. Storm is used in applications that require immediate response
times, such as monitoring and alerting systems.
10
10. Data Governance and Metadata Management
Google Cloud Data proc: Google’s managed service for Hadoop and
Spark, Data proc facilitates quick, cost-effective data processing and
integration with Google’s AI and analytics tools.
11
Apache Beam: A unified programming model for batch and streaming
data processing. Apache Beam allows users to create pipelines that run on
multiple Big Data platforms, including Apache Flink and Google Cloud
Dataflow.
AWS Glue: Amazon’s managed ETL service that helps prepare and
transform data for analytics, Glue automatically discovers and catalogs
data, allowing for efficient data transformations.
12
for both distributed and in-memory processing, enabling large-scale data
model training.
AWS IoT Analytics: A managed service that collects and analyzes IoT
data, AWS IoT Analytics enables real-time analytics and actionable
insights from edge data collected from devices and sensors.
13
ThoughtSpot: A search-driven analytics platform that provides an
intuitive interface for querying Big Data and finding insights through
natural language queries, making it accessible to non-technical users.
This extended survey of Big Data technologies highlights the diverse tools and
platforms available for each stage of the Big Data lifecycle, from ingestion and
processing to analytics, storage, security, and real-time processing. As data
volume, variety, and velocity continue to grow, these technologies are vital for
building scalable, resilient, and intelligent data ecosystems that drive informed
decision-making and innovation.
14
Methodology of Bigdata
Here’s a structured methodology for implementing a Big Data solution,
covering phases from requirement gathering to deployment and maintenance:
1. Requirement Analysis
Define Business Objectives: Identify the goals of the Big Data project.
These could include improving decision-making, customer insights, fraud
detection, predictive analytics, or optimizing operations.
Identify Data Sources: Catalog the sources from which data will be
collected, such as transactional databases, social media feeds, IoT
devices, logs, and third-party sources.
Select Data Ingestion Tools: Choose tools like Apache Kafka, Apache
NiFi, or Amazon Kinesis for real-time ingestion or Apache Sqoop for
batch data transfer.
15
Data Validation and Cleansing: Implement rules to handle missing
values, duplicates, and errors, ensuring data quality before further
processing.
Data Transformation: Cleanse, filter, and transform raw data into a usable
format. This may involve normalization, aggregation, or feature
engineering for machine learning applications.
16
analysis. Libraries like Spark MLlib, TensorFlow, and H2O.ai are
commonly used for scalable machine learning.
Model Training and Validation: Split data into training, validation, and
test sets. Train the model on historical data and validate it to ensure
accuracy, generalizability, and performance on unseen data.
Data Visualization Tools: Use tools like Tableau, Power BI, or Apache
Superset to create interactive dashboards and visualizations for
stakeholders.
7. Deployment
17
Performance Monitoring: Continuously monitor system performance and
model accuracy to ensure reliability and accuracy. Track metrics like data
latency, processing time, and model error rates.
Audit Trails: Maintain logs and audit trails for data access and
modifications, enabling tracking for compliance and identifying potential
security issues.
18
Feedback Loops: Gather feedback from stakeholders on the usability of
insights and dashboards. Use their input to improve data processes and
model relevance.
19
The Impact of Bigdata on Environment, Society, and
Domain
The impact of Big Data spans across various domains, influencing
environmental sustainability, societal structures, and specific industries. Below
is an exploration of how Big Data affects each of these areas:
2. Impact on Society
20
Enhanced Public Safety: Law enforcement agencies use Big Data to
analyze crime patterns, improve resource allocation, and enhance
community safety. Predictive policing uses historical data to identify
potential crime hotspots, allowing for proactive measures to prevent
crime.
Social Insights: Big Data provides insights into societal behaviors and
trends. Businesses can analyze consumer sentiment from social media
and other platforms to better understand public opinion, leading to more
informed decision-making in marketing and product development.
Finance and Banking: Financial institutions utilize Big Data for risk
assessment, fraud detection, and personalized services. Analyzing
transaction patterns and customer data helps banks identify suspicious
activities, reduce losses, and tailor financial products to individual clients.
21
Energy Sector: In energy production and consumption, Big Data
analytics helps optimize grid management and energy distribution.
Analyzing consumption data allows utility companies to predict demand,
reduce outages, and integrate renewable energy sources more effectively.
Data Bias and Fairness: Big Data algorithms can perpetuate existing
biases if the underlying data is not representative. Ensuring fairness in
algorithms is essential to prevent discrimination, particularly in areas
such as hiring, lending, and law enforcement. Organizations must
implement rigorous testing and auditing processes to identify and
mitigate biases in their data and algorithms.
22
Data Democratization: Organizations are increasingly focusing on
democratizing data access, allowing non-technical users to leverage data
analytics tools without requiring extensive training. This trend empowers
a wider range of stakeholders to make data-driven decisions, fostering a
culture of innovation and agility.
Conclusion
In summary, the impact of Big Data on the environment, society, and various
domains is transformative, presenting both significant opportunities and
challenges. By enabling organizations to harness vast amounts of data, Big Data
analytics drives innovation, enhances efficiency, and supports informed
decision-making across multiple sectors. From optimizing resource
management for environmental sustainability to improving public health and
personalizing customer experiences, the applications of Big Data are far-
reaching and impactful.
23
References
https://en.wikipedia.org/wiki/bigdata2.
https://www.investopedia.com/terms/b/bigdata.asp3.
https://www.pngegg.com/en/search?
q=bigdata+Technology4
https://www.simplilearn.com/tutorials/bigdata-tutorial/why-
is- bigdata-important5.
https://www.geeksforgeeks.org/advantages-and-
disadvantages-of- bigdata/6.
https://builtin.com/bigdata
24