BIG Data_Unit_1
BIG Data_Unit_1
Spark: Installing spark, spark applications, jobs, stages and tasks, Resilient Distributed
Databases, anatomy of a Spark job run, Spark on YARN
SCALA: Introduction, classes and objects, basic types and operators, built-in control
structures, functions and closures, inheritance.
Hadoop Eco System Frameworks: Applications on Big Data using Pig, Hive and
HBase
Pig : Introduction to PIG, Execution Modes of Pig, Comparison of Pig with Databases,
Grunt, Pig Latin, User Defined Functions, Data Processing operators, Hive - Apache
Hive architecture and installation, Hive shell, Hive services, Hive
V 08
metastore, comparison with traditional databases, HiveQL, tables, querying data and
user defined functions, sorting and aggregating, Map Reduce scripts, joins &
subqueries.
HBase – Hbase concepts, clients, example, Hbase vs RDBMS, advanced usage,
schema design, advance indexing, Zookeeper – how it helps in monitoring a cluster,
how to build applications with Zookeeper. IBM Big Data strategy, introduction to
Infosphere, BigInsights and Big Sheets, introduction to Big SQL.
INDEX
United Institute of Management ........................................................................................ 1
Types of digital data ........................................................................................................... 3
History of Big Data innovation .......................................................................................... 4
Big Data architecture and characteristics .......................................................................... 7
Big Data Tools and Techniques .......................................................................................... 8
Big Data features – security, compliance, auditing and protection.................................... 14
Big Data privacy and ethics ............................................................................................. 16
Big Data Analytics ........................................................................................................... 17
Key components of analytics include ................................................................................ 17
Big Data Analytics Process .............................................................................................. 18
Challenges of conventional systems ................................................................................. 18
ANALYSIS vs REPORTING ............................................................................................ 19
MODERN DATA ANALYTIC TOOLS ............................................................................. 20
Key Points: ...................................................................................................................... 23
Introduction to Big Data: Types of digital data, history of Big Data innovation, introduction
to Big Data platform, drivers for Big Data, Big Data architecture and characteristics, 5 Vs of
Big Data, Big Data technology components, Big Data importance and applications, Big Data
features – security, compliance, auditing and protection, Big Data privacy and ethics, Big
Data Analytics, Challenges of conventional systems, intelligent data analysis, nature of data,
analytic processes and tools, analysis vs reporting, modern data analytic tools.
Structured Data:
Semi-Structured Data:
Semi-structured data is less organized than structured data but still has some level of
structure.
It may not conform to a rigid schema but often includes tags, labels, or hierarchies that
provide some meaning.
Examples include:
JSON (JavaScript Object Notation) data.
XML (eXtensible Markup Language) data.
NoSQL databases like MongoDB, which store data in semi-structured formats.
Semi-structured data is commonly used in web applications and data interchange formats.
Unstructured Data:
Unstructured data lacks a predefined structure and does not fit neatly into tables or rows.
It includes a wide range of content types, making it the most challenging type of data to work
with.
Examples include:
Text documents: Email messages, social media posts, articles.
Multimedia content: Images, videos, audio recordings.
Sensor data: Data from IoT (Internet of Things) devices.
Weblogs and clickstream data.
These types of digital data play a crucial role in big data analytics and inform the choice of
tools, techniques, and technologies used for data storage, processing, and analysis in the
world of big data.
The concept of Big Data can be traced back to the 1960s and 1970s when the world was
generating data at an increasing rate, primarily in scientific research and government projects.
Relational Databases:
In the 1970s, Edgar F. Codd introduced the concept of relational databases, which allowed for
structured data storage and querying. While not initially designed for Big Data, relational
databases played a crucial role in data management.
Data Warehousing:
In the 1980s, data warehousing emerged as a method to consolidate and manage large
volumes of data for analytical purposes. This laid the foundation for the analysis of Big Data.
Internet Era:
The rise of the internet in the late 20th century led to an explosion of data generation. Search
engines like Google and e-commerce platforms like Amazon were early pioneers in dealing
with large datasets.
Hadoop and MapReduce:
In the mid-2000s, Hadoop and the MapReduce programming model were developed by
Yahoo and popularized by Apache. This open-source framework allowed distributed
processing of large datasets across clusters of computers.
Data Ingestion: This phase involves collecting data from various sources, including
databases, social media, IoT devices, and more. Technologies like Apache Kafka and Flume
are used for data ingestion.
Data Storage: Big data platforms use distributed storage systems like Hadoop Distributed
File System (HDFS) or cloud-based storage solutions such as Amazon S3 or Azure Blob
Storage.
Data Analysis: Machine learning and data analytics tools are integrated into big data
platforms. Libraries such as TensorFlow, scikit-learn, and Spark MLlib are used for analysis.
Data Visualization: Data visualization tools like Tableau, Power BI, and open-source options
like D3.js help present insights in a meaningful way.
Apache Spark: A fast, in-memory data processing engine for large-scale data processing.
NoSQL Databases: Such as MongoDB, Cassandra, and HBase, which are designed for
unstructured and semi-structured data.
Machine Learning Libraries: Including TensorFlow, scikit-learn, and PyTorch for building
predictive models.
Cloud-Based Solutions: Like AWS, Azure, and Google Cloud, which offer managed big data
services.
Data Explosion: The digital age has seen an exponential increase in data generation. The
proliferation of smartphones, , and online transactions has led to an enormous volume of data
that organizations can harness IoT devices, social media for insights.
Competitive Advantage: Companies recognize that by effectively utilizing big data, they can
gain a competitive edge. Analyzing data allows for better decision-making, improved
products and services, and enhanced customer experiences.
Cost Reduction: Big data technologies enable organizations to store and process data more
efficiently, often at a lower cost compared to traditional methods. This cost-effectiveness is a
strong incentive for adoption.
Operational Efficiency: Big data can optimize operational processes, supply chain
management, and resource allocation. It helps organizations make data-driven decisions to
enhance efficiency and reduce waste.
By: Shivam Bhardwaj UIM(011) Email:[email protected] 5
Risk Management: Big data analytics can identify and mitigate risks in real time. This is
crucial in industries such as finance and insurance to prevent fraud and minimize losses.
Innovation: Big data is a catalyst for innovation. By analyzing data, organizations can
discover new insights, create innovative products, and adapt to changing market conditions.
Customer Insights: Understanding customer behavior, sentiment, and feedback is critical for
businesses. Big data provides a means to gain deep insights into customer preferences and
needs.
Regulatory Compliance: Many industries face stringent data regulations. Big data platforms
help organizations manage and report data in compliance with these regulations, reducing
legal risks.
Healthcare Advancements: In the healthcare sector, big data is driving research and
enabling personalized medicine. It assists in diagnosing diseases, predicting outbreaks, and
improving patient care.
Real-Time Decision-Making: Big data technologies can process data in real time, allowing
organizations to make immediate decisions and respond swiftly to changing circumstances.
Supply Chain Optimization: Big data analytics can optimize the supply chain by predicting
demand, reducing lead times, and ensuring efficient inventory management.
Smart Cities: The concept of smart cities relies on big data to enhance urban planning, traffic
management, energy consumption, and public services.
Social Media and Sentiment Analysis: Organizations can gauge public sentiment and
opinions through social media data, which is valuable for marketing, reputation management,
and product development.
Predictive Maintenance: Big data analytics helps in predicting when machinery and
equipment will require maintenance, reducing downtime and operational disruptions.
In conclusion, the drivers for big data adoption are multifaceted, ranging from the need to
manage data growth to gaining a competitive advantage and improving operational
efficiency. As the volume and complexity of data continue to grow, big data technologies and
analytics will remain central to the success of organizations in various industries.
Big data architecture is a comprehensive solution to deal with an enormous amount of data. It
details the blueprint for providing solutions and infrastructure for dealing with big data based
on a company’s demands. It clearly defines the components, layers, and methods of
communication. The reference point is the ingestion, processing, storing, managing,
accessing, and analysing of the data.
• Data Sources: All of the sources that feed into the data extraction pipeline are subject
to this definition, so this is where the starting point for the big data pipeline is located.
Data sources, open and third-party, play a significant role in architecture. Relational
databases, data warehouses, cloud-based data warehouses, SaaS applications, real-
time data from company servers and sensors such as IoT devices, third-party data
providers, and also static files such as Windows logs, comprise several data sources.
Both batch processing and real-time processing are possible. The data managed can
be both batch processing and real-time processing.
• Data Storage: There is data stored in file stores that are distributed in nature and that
can hold a variety of format-based big files. It is also possible to store large numbers
of different format-based big files in the data lake. This consists of the data that is
managed for batch built operations and is saved in the file stores. We provide HDFS,
Microsoft Azure, AWS, and GCP storage, among other blob containers.
• Batch Processing: Each chunk of data is split into different categories using long-
running jobs, which filter and aggregate and also prepare data for analysis. These jobs
typically require sources, process them, and deliver the processed files to new files.
Multiple approaches to batch processing are employed, including Hive jobs, U-SQL
jobs, Sqoop or Pig and custom map reducer jobs written in any one of the Java or
Scala or other languages such as Python.
By: Shivam Bhardwaj UIM(011) Email:[email protected] 7
• Real Time-Based Message Ingestion: A real-time streaming system that caters to the
data being generated in a sequential and uniform fashion is a batch processing system.
When compared to batch processing, this includes all real-time streaming systems that
cater to the data being generated at the time it is received. This data mart or store,
which receives all incoming messages and discards them into a folder for data
processing, is usually the only one that needs to be contacted. Message-based
ingestion stores such as Apache Kafka, Apache Flume, Event hubs from Azure, and
others, on the other hand, must be used if message-based processing is required. The
delivery process, along with other message queuing semantics, is generally more
reliable.
• Stream Processing: Real-time message ingest and stream processing are different.
The latter uses the ingested data as a publish-subscribe tool, whereas the former takes
into account all of the ingested data in the first place and then utilises it as a publish-
subscribe tool. Stream processing, on the other hand, handles all of that streaming
data in the form of windows or streams and writes it to the sink. This includes Apache
Spark, Flink, Storm, etc.
• Reporting and Analysis: The generated insights, on the other hand, must be
processed and that is effectively accomplished by the reporting and analysis tools that
utilize embedded technology and a solution to produce useful graphs, analysis, and
insights that are beneficial to the businesses. For example, Cognos, Hyperion, and
others.
• Orchestration: Data-based solutions that utilise big data are data-related tasks that
are repetitive in nature, and which are also contained in workflow chains that can
transform the source data and also move data across sources as well as sinks and loads
in stores. Sqoop, oozie, data factory, and others are just a few examples.
A big data tool can be classified into the four buckets listed below based on its practicability.
Each processor works on separate tasks, has a different operating system, and does not share
memory. It is also possible for up to 200 or more processors to work on applications
connected to this high-speed network. In each case, the processor handles a different set of
instructions and has a different operating system, which is not shared. MPP may also send
messages between processes via a messaging system that allows it to send commands to the
processors.
MPP-based databases are IBM Netezza, Oracle Exadata, Teradata, SAP HANA, EMC
Greenplum.
No-SQL Databases
Structures are employed to help associate data with a particular domain. Data cannot be
stored in a structured database unless it is first converted to one. SQL (or NoSQL) is a non-
structured language used to encapsulate unstructured data and create structures for
heterogeneous data in the same domain. NoSQL databases offer a vast array of configuration
scalability, as well as versatility, and scalability in handling large quantities of data. There is
also distributed data storage, making data available locally or remotely.
Graph-based model: Graph databases store both entities and relationships between them,
and they are multi-relational. Nodes and links and entities are stored as elements on the
graph, and relationships between these elements are represented by edges (or nodes). Graph
databases are employed for mapping, transportation, social networks, and spatial data
applications. They may also be used to discover patterns in semi-structured and unstructured
data. The Neo4J, FlockDB, and OrientDB graph databases are available.
Column families are also known as Wide columns or Columnar columns or Column stores.
These are used for data structures, business intelligence, CRM, and catalogues of library
cards.
The columnar databases Cassandra, HBase, and Hypertable use NoSQL databases that use
columnar storage.
MongoDB, CouchDB, Amazon SimpleDB, Riak, Lotus Notes. NoSQL document databases
are MongoDB, CouchDB, Amazon SimpleDB, Riak, and Lotus Notes.
Volume:
Definition: Volume refers to the vast amount of data generated and collected. Big Data involves
extremely large datasets, often measured in terabytes, petabytes, or even exabytes.
Significance: Managing and processing such immense volumes of data requires specialized
infrastructure and technologies designed for storage, retrieval, and analysis at scale.
Velocity:
Definition: Velocity represents the speed at which data is generated and how quickly it must be
processed and analyzed. Data can be generated in real-time or near-real-time.
Significance: Real-time processing capabilities are essential for handling data streams that are
generated rapidly, such as social media updates, sensor data, financial transactions, and more.
Variety:
Definition: Variety encompasses the diverse types of data that make up Big Data, including
structured, semi-structured, and unstructured data. This data can include text, images, videos,
sensor readings, and more.
Significance: Big Data systems must be able to handle and analyze different data formats and
structures. This requires flexibility in storage and processing to accommodate a wide range of
data sources.
Veracity:
Definition: Veracity refers to the trustworthiness and quality of data. Big Data can be noisy,
with errors, duplications, inconsistencies, and inaccuracies.
Significance: To derive meaningful insights, it's essential to address data quality issues through
data cleansing, validation, and integration processes. Trustworthy data leads to more reliable
analysis.
By: Shivam Bhardwaj UIM(011) Email:[email protected] 12
Value:
Definition: Value signifies the goal of Big Data, which is to extract valuable insights, patterns,
and knowledge from the data.
Significance: The ultimate purpose of handling Big Data is to make data-driven decisions, gain
competitive advantages, improve processes, and create value for organizations. Extracting
actionable insights from large datasets is at the heart of Big Data analytics.
Understanding the 5 Vs of Big Data is essential for organizations to develop effective strategies
for data collection, storage, processing, and analysis. By addressing these characteristics,
businesses can harness the power of Big Data to make informed decisions, discover.
1. Ingestion :
The ingestion layer is the very first step of pulling in raw data.
It comes from internal sources, relational databases, non-relational databases, social media,
emails, phone calls etc.
There are two kinds of ingestions :
Batch, in which large groups of data are gathered and delivered together.
Streaming, which is a continuous flow of data. This is necessary for real-time data analytics.
2. Storage :
Definition: Security in Big Data involves safeguarding data against unauthorized access,
breaches, and cyber threats. It encompasses data encryption, authentication, authorization,
and network security.
Importance: Protecting sensitive information is paramount, as data breaches can have severe
consequences. Security measures, including encryption of data at rest and in transit, secure
access controls, and intrusion detection systems, are vital to ensure data confidentiality and
integrity.
Compliance:
Definition: Compliance refers to adhering to legal and regulatory requirements that pertain to
data handling, privacy, and security. Depending on the industry, this might involve
regulations like GDPR, HIPAA, or industry-specific standards.
Definition: Auditing involves tracking and monitoring all activities related to data, including
access, modification, and usage. Audit logs capture details such as who accessed data, what
actions were taken, and when.
Importance: Auditing provides transparency and accountability, aiding in compliance efforts
and identifying any suspicious or unauthorized activities. It is an essential part of ensuring
data security and governance in Big Data systems.
Data Protection:
Definition: Data protection focuses on safeguarding data from loss, corruption, and damage.
This includes backup and disaster recovery strategies.
Importance: Big Data systems often deal with irreplaceable data. Data protection measures
ensure data availability in the face of hardware failures, disasters, or other unforeseen events.
Key Measures for These Features in Big Data:
Access Control: Implement role-based access control (RBAC) to restrict data access to
authorized users. This ensures that only those with proper permissions can interact with
sensitive data.
Encryption: Use encryption techniques for data at rest and data in transit to protect data from
unauthorized access during storage and transmission.
Data Masking: For non-production environments, sensitive data can be replaced with
fictional data or obscured to prevent exposure of actual sensitive information.
Data Governance: Establish data governance policies and practices to ensure data quality,
compliance, and ethical data use.
Audit Trails: Implement detailed audit logs to track who accessed the data, what they did, and
when they did it. Regularly review audit logs for anomalies.
Data Classification: Categorize data according to its sensitivity and define appropriate
security and protection measures based on the classification.
Incident Response: Develop an incident response plan to address security breaches or data
loss effectively. This includes strategies for data recovery and system restoration.
Big Data technologies and practices have raised important concerns related to privacy and
ethics. The massive volume of data, advanced analytics, and data-driven decision-making
have the potential to impact individuals, society, and organizations in significant ways.
Therefore, addressing privacy and ethical considerations is crucial. Here are key aspects of
Big Data privacy and ethics:
1. Data Privacy:
Definition: Data privacy involves the protection of individuals' personal and sensitive
information, ensuring that it is collected, stored, processed, and shared in a way that respects
their rights and choices.
Importance: With Big Data, there's an increased risk of privacy violations as vast amounts of
personal data are collected from various sources. Ensuring data privacy is essential to
maintain trust with customers and users.
2. Informed Consent:
Definition: Informed consent requires individuals to be aware of what data is being collected,
how it will be used, and to have the choice to opt in or out of data collection.
Importance: Informed consent is crucial in data collection. It empowers individuals to make
choices about their data and promotes transparency.
3. Data Anonymization:
Definition: Fairness refers to ensuring that Big Data analysis and algorithms are free from
bias and do not discriminate against individuals or groups based on race, gender, age, or other
characteristics.
Importance: Unintentional biases can lead to unfair treatment. Ethical considerations include
identifying and mitigating biases in data and algorithms.
Definition: Data ownership concerns the question of who owns the data. Individuals should
have control over their own data, including the right to access, correct, or delete it.
Importance: Clarifying data ownership and control is essential to uphold individuals' rights
and privacy.
6. Transparency:
Definition: Transparency involves making data collection and processing practices clear to
individuals. Organizations should be open about their data practices.
By: Shivam Bhardwaj UIM(011) Email:[email protected] 16
Importance: Transparency builds trust, allows individuals to make informed decisions about
their data, and enhances accountability.
7. Data Security:
Definition: Data security measures protect data from unauthorized access, breaches, and
cyber threats.
Importance: Protecting data against security threats is fundamental to safeguarding privacy
and maintaining data integrity.
8. Ethical Use of Data:
Definition: Ethical use of data involves using data in ways that align with moral and social
values, respecting human rights and avoiding harm.
Importance: Adhering to ethical principles in data analysis and decision-making is essential to
avoid potential harm or misuse of data.
9. Social Implications:
Definition: Big Data can have profound societal impacts, such as influencing public opinion,
political decisions, and employment practices.
Importance: Understanding the social implications of Big Data is critical to address the
ethical and societal consequences.
10. Compliance with Regulations:
Definition: Compliance involves following legal and regulatory frameworks, such as GDPR
(General Data Protection Regulation) or HIPAA (Health Insurance Portability and
Accountability Act).
Importance: Organizations must comply with data protection regulations to avoid legal
consequences and protect individual privacy.
Data Processing: Once collected, data often requires preprocessing, which involves cleaning,
transforming, and organizing it to ensure accuracy and consistency. This step is crucial for
accurate analysis.
Analysis: The primary purpose of analysis is to examine data in-depth, identify patterns,
trends, anomalies, and relationships within the data. It aims to extract insights and answers to
specific questions, often leading to a deeper understanding of the underlying factors.
Nature:
Analysis: Analysis involves exploration, data mining, and hypothesis testing. It requires
critical thinking, data modeling, and the application of statistical or analytical techniques to
uncover meaningful insights.
Level of Detail:
Analysis: Analysis goes into granular details, often looking at individual data points or
specific subsets of data. It aims to answer specific questions and requires a more thorough
examination of the data.
Interactivity:
Analysis: Analysis is interactive and iterative. Analysts explore data, adjust models or
hypotheses, and perform multiple iterations to gain insights. It often involves ad-hoc queries
and exploratory data analysis.
Reporting: Reporting is typically static, offering a fixed view of data. While some reports and
dashboards may allow limited interaction (e.g., filtering or drilling down), it's not as flexible
as analysis.
Timing:
Analysis: Analysis can be an ongoing process and is often performed as needed or in
response to specific questions or problems. It doesn't have a fixed schedule.
Reporting: Reporting is usually periodic and scheduled. Reports are generated at regular
intervals (daily, weekly, monthly) and provide a snapshot of data at that moment.
Audience:
Analysis: The audience for analysis is often data analysts, scientists, or subject-matter experts
who need to deep dive into data to make informed decisions or recommendations.
Reporting: Reporting caters to a broader audience, including executives, managers, and other
stakeholders who require concise, high-level information for decision-making.
Apache Spark: Another open-source, distributed computing framework, Spark is known for
its speed and versatility. It supports real-time and batch data processing, machine learning,
and graph processing, making it a versatile choice for various analytics tasks.
Python and R: Both Python and R are popular programming languages for data analysis and
machine learning. They offer extensive libraries and frameworks, such as Pandas, NumPy,
SciPy, Scikit-Learn (Python), and Tidyverse (R), that provide powerful data manipulation
and statistical analysis capabilities.
Tableau: A data visualization tool that allows users to create interactive and shareable
dashboards. Tableau is known for its user-friendly interface and the ability to connect to a
wide range of data sources.
QlikView and Qlik Sense: Qlik's products offer associative data modeling and in-memory
analytics. They allow users to explore data, create dynamic visualizations, and share insights
with others.
Google Data Studio: A free data visualization and reporting tool from Google. It allows
users to create custom reports and dashboards, connect to various data sources, and share
them with others.
KNIME: An open-source data analytics, reporting, and integration platform. KNIME allows
users to build data pipelines and automate data processing and analysis workflows.
Databricks: A unified analytics platform that combines the power of Apache Spark with
collaborative data science tools. It's designed for big data analytics, machine learning, and
artificial intelligence.
Alteryx: A data analytics platform that offers data blending, data preparation, and advanced
analytics capabilities. It's known for its user-friendly workflow-based interface.
Sisense: A business intelligence platform that enables data integration, analysis, and
visualization. It's designed to handle complex data and provide insights to non-technical
users.
Looker: A data exploration and business intelligence platform that allows organizations to
create and share reports and dashboards.
Big Data features – security, compliance, auditing and protection, Big Data privacy and
ethics
Security:
Volume and Velocity: The massive volume and high velocity of Big Data make it challenging
to secure. Security measures must keep pace with the continuous influx of data.
Data Breach Risks: Big Data often contains sensitive and personally identifiable information.
The risk of data breaches and unauthorized access is a significant concern.
Encryption: Data encryption is essential for protecting data both in transit and at rest. It
ensures that even if data is intercepted, it remains unreadable without the appropriate
decryption keys.
Regulatory Compliance: Big Data often contains data subject to various regulations, such as
GDPR, HIPAA, and industry-specific compliance requirements. Organizations must ensure
compliance with these regulations.
Data Auditing: Auditing data access and usage is vital for compliance and security.
Organizations need to maintain comprehensive audit logs and monitor data access and
modifications.
Data Governance: Establishing robust data governance practices helps organizations manage
data quality, privacy, and compliance effectively. This includes data lineage, metadata
management, and policy enforcement.
Protection Against Insider Threats: Insider threats pose significant risks. Organizations must
implement strategies to detect and mitigate threats from within their own workforce.
Digital data encompasses a wide variety of types, including text, images, audio, video,
structured databases, sensor data, and more.
Each type of data requires different processing and storage methods.
History of Big Data Innovation:
Big Data innovation has its roots in the early days of computing but gained prominence in the
21st century with the explosion of digital data.
Technologies like Hadoop, NoSQL databases, and advanced analytics tools have driven this
innovation.
Introduction to Big Data Platform:
A Big Data platform is an integrated technology framework for collecting, storing, processing,
and analyzing large and complex datasets.
It enables organizations to derive insights and value from data.
Drivers for Big Data:
The key drivers for Big Data adoption include the need for real-time decision-making, data-
driven insights, and competitive advantages.
The exponential growth of data and the Internet of Things (IoT) are also significant drivers.
Big Data Architecture and Characteristics:
Big Data architecture typically includes data sources, data storage, data processing, and data
analytics layers.
Key characteristics are scalability, fault tolerance, and parallel processing.
5 Vs of Big Data:
The 5 Vs represent Volume (data size), Velocity (data speed), Variety (data types), Veracity
(data quality), and Value (data usefulness).
These dimensions define the challenges and opportunities in Big Data.
Big Data Technology Components:
Components may include storage (Hadoop HDFS, NoSQL databases), processing (Map-
Reduce, Spark), and analytics tools (Tableau, R, Python).
Data integration, ETL (Extract, Transform, Load), and visualization tools are also part of the
ecosystem.
Big Data Importance and Applications:
Big Data analytics involves processing and analyzing large and complex datasets to extract
valuable insights.
Techniques include descriptive, diagnostic, predictive, and prescriptive analytics.
Challenges of Conventional Systems:
Traditional systems struggle to handle Big Data's volume, velocity, and variety.
They may lack scalability, real-time processing, and advanced analytics capabilities.
Intelligent Data Analysis:
Intelligent data analysis uses AI and machine learning to derive deeper insights from data.
It includes techniques like clustering, regression, and neural networks.
Nature of Data:
Data can be structured (organized) or unstructured (not organized), and semi-structured (partly
organized).
Understanding data's nature is crucial for its effective processing.
Analytic Processes and Tools:
Analytic processes involve data collection, data preparation, data analysis, and interpretation.
Tools encompass software and algorithms used in data analytics.
Analysis vs. Reporting:
Modern tools include open-source platforms like Apache Spark, data warehouses like
Snowflake, and cloud-based analytics services.
They offer advanced analytics and visualization capabilities.