Finance - Unit 4
Finance - Unit 4
Variety: Big data comes in various formats and types, including structured, semi-structured, and
unstructured data. Structured data refers to well-organized data with a clear schema, such as
database records and spreadsheets. Semi-structured data, such as XML and JSON files, has some
organizational properties but lacks a rigid structure. Unstructured data, such as text documents,
emails, images, and videos, does not follow any predefined format. Analyzing diverse data types
requires flexible data processing techniques and tools.
Big data comes in various formats and types, including structured, semi-structured, and
unstructured data. Examples of diverse data types include:
Structured Data: Relational databases store structured data in tables with predefined
schemas, such as customer information, product inventory, and financial transactions.
Semi-Structured Data: XML and JSON files represent semi-structured data with
hierarchical structures and flexible schemas, commonly used in web services, APIs, and
data interchange formats.
Unstructured Data: Text documents, emails, images, videos, and social media posts are
examples of unstructured data that lack a predefined format and organization, requiring
advanced techniques for analysis and interpretation.
Veracity: Veracity refers to the reliability and accuracy of the data. Big data often includes
noise, inconsistencies, and errors, which can affect the quality of analysis and decision-making.
Ensuring data quality through validation, cleansing, and preprocessing is crucial for obtaining
reliable insights from big data.
Veracity refers to the reliability and accuracy of the data, which can be affected by noise,
inconsistencies, and errors. Examples of data quality challenges include:
Incomplete Data: Missing or incomplete data entries in databases or datasets can lead to
inaccuracies and biases in analysis and decision-making.
Inconsistencies: Data inconsistencies may arise from discrepancies in data formats, units
of measurement, or naming conventions across different sources or systems.
Data Errors: Data errors, such as typographical errors, duplicates, and outliers, can distort
analysis results and hinder data-driven decision-making processes.
By understanding these aspects of big data and leveraging advanced technologies and analytical
techniques, organizations can unlock valuable insights from large and complex datasets to drive
innovation, improve decision-making, and gain a competitive edge in today's data-driven world.
The emergence of big data technologies and platforms has enabled organizations to harness the
potential of large-scale data analytics. Some key technologies and concepts associated with big
data include:
1. Distributed Computing: Traditional data processing frameworks struggle to handle big data
due to limitations in processing power and storage capacity. Distributed computing frameworks
like Apache Hadoop and Apache Spark allow for the distributed storage and parallel processing
of large datasets across clusters of computers, enabling scalable and efficient data processing.
2. NoSQL Databases: NoSQL (Not Only SQL) databases are designed to handle large volumes
of unstructured and semi-structured data more efficiently than traditional relational databases.
Examples of NoSQL databases include MongoDB, Cassandra, and HBase. These databases are
optimized for scalability, flexibility, and high availability, making them suitable for big data
applications.
3. Data Lakes and Data Warehouses: Data lakes and data warehouses are storage repositories
that store and manage large volumes of structured and unstructured data for analysis. While data
warehouses are optimized for structured data and support structured query language (SQL)
queries, data lakes can store raw, unprocessed data in its native format, allowing for more
flexibility in data processing and analysis.
4. Machine Learning and AI: Big data analytics often involves the use of machine learning and
artificial intelligence techniques to derive insights, predict future trends, and automate decision-
making processes. Machine learning algorithms can analyze large datasets to identify patterns,
correlations, and anomalies that may not be apparent through traditional analysis methods.
Big data represents a paradigm shift in how organizations collect, store, process, and analyze
data. By leveraging advanced technologies and analytical techniques, organizations can extract
valuable insights from large and complex datasets to drive innovation, improve decision-making,
and gain a competitive edge in today's data-driven world.
Core technologies for distribution and scale
Distributing and scaling data processing is essential for efficiently handling big data. Several
core technologies and frameworks have emerged to address these challenges:
1. Apache Hadoop:
Hadoop is one of the most widely used frameworks for distributed storage and processing of
big data. It consists of two main components: Hadoop Distributed File System (HDFS) for
distributed storage and MapReduce for distributed processing. HDFS divides large files into
smaller blocks and distributes them across a cluster of commodity hardware. MapReduce is a
programming model for parallel processing of data across the cluster. Hadoop ecosystem also
includes various related projects such as YARN (Yet Another Resource Negotiator) for resource
management, Hive for data warehousing and querying, and Pig for data flow scripting.
Application: Hadoop is widely used in various industries for large-scale batch processing of data.
One notable application is in the field of cybersecurity. Security analysts use Hadoop clusters to
process and analyze massive volumes of log data from network devices, servers, and applications
to detect and mitigate cybersecurity threats. Hadoop's distributed processing capabilities enable
parallel analysis of log data to identify patterns, anomalies, and suspicious activities in real time.
Example: A cybersecurity company leverages Hadoop to build a threat detection platform. They
ingest logs from firewalls, intrusion detection systems, and other security appliances into HDFS.
Using MapReduce jobs, they analyze these logs to identify potential security breaches, such as
unauthorized access attempts, malware infections, or data exfiltration activities. By correlating
events across different data sources and applying machine learning algorithms, they can detect
and respond to security incidents more effectively.
2. Apache Spark:
Spark is a fast and general-purpose cluster computing system that provides in-memory data
processing capabilities. It offers higher-level APIs than MapReduce, making it easier to develop
complex data processing workflows. Spark supports various programming languages such as
Scala, Java, Python, and R. It includes libraries for SQL, machine learning (MLlib), graph
processing (GraphX), and streaming data processing (Spark Streaming). Spark can run on top of
Hadoop YARN, Mesos, or standalone mode.
Application: Spark is used for both batch and stream processing of big data, making it suitable
for a wide range of applications, including real-time analytics, machine learning, and interactive
querying. An example application is in e-commerce for real-time personalized recommendations.
E-commerce platforms use Spark Streaming to process user clickstream data in real time and
generate personalized product recommendations based on users' browsing and purchasing
history.
Example: An online retailer implements a recommendation engine using Spark Streaming. As
users browse the website, their clickstream data is ingested into a Kafka cluster. Spark Streaming
jobs analyze this data in near real-time to identify patterns and preferences. Based on users'
behavior and product affinities, personalized recommendations are generated and displayed on
the website in milliseconds. This enhances the user experience and increases the likelihood of
conversions.
3. Apache Flink:
Flink is a stream processing framework that provides low-latency, high-throughput processing
of continuous data streams. It supports event-driven and batch processing modes and offers fault-
tolerance and exactly-once processing semantics. Flink's APIs enable the development of
complex stream processing applications with features like event time processing, windowing,
and stateful computations. Flink integrates seamlessly with other big data technologies and
supports deployment on various cluster managers.
Application: Flink is primarily used for real-time stream processing and event-driven
applications. One common application is in the Internet of Things (IoT) for real-time monitoring
and analytics of sensor data. Companies in industries like manufacturing, energy, and
transportation use Flink to process streams of sensor data from IoT devices to detect anomalies,
predict equipment failures, and optimize operations in real time.
Example: A smart manufacturing company employs Flink for predictive maintenance of
industrial machinery. Sensors embedded in machines continuously monitor various parameters
such as temperature, vibration, and pressure. Flink jobs process this streaming data, applying
machine learning models to detect early signs of equipment degradation or failure. Predictive
alerts are generated to notify maintenance teams, enabling proactive maintenance and
minimizing downtime.
4. Apache Kafka:
Kafka is a distributed streaming platform that is often used for building real-time data pipelines
and event-driven architectures. It provides high-throughput, fault-tolerant messaging by
decoupling data producers from consumers through distributed commit logs. Kafka topics serve
as durable, append-only logs for publishing and subscribing to streams of records. Kafka
Connect enables integration with external data sources and sinks, while Kafka Streams allows for
stream processing within the Kafka ecosystem.
Application: Kafka is a distributed streaming platform commonly used for building real-time
data pipelines and event-driven architectures. An example application is in financial services for
real-time fraud detection. Banks and payment processors use Kafka to ingest transaction data
from multiple channels and systems in real time. Stream processing applications analyze this
data to detect fraudulent transactions, such as unauthorized access, identity theft, or suspicious
patterns.
Example: A financial institution deploys Kafka for real-time fraud detection. Transaction data
from ATM networks, online banking systems, and point-of-sale terminals is ingested into Kafka
topics. Kafka Connect is used to integrate with fraud detection systems and machine learning
models. Stream processing applications consume these data streams, applying complex
algorithms to identify potentially fraudulent activities. Suspicious transactions trigger immediate
alerts to fraud analysts for investigation and mitigation.
5. NoSQL Databases:
NoSQL databases are designed to handle large volumes of unstructured and semi-structured
data with horizontal scalability and high availability. Examples include Apache Cassandra,
MongoDB, Apache HBase, and Amazon DynamoDB. These databases use distributed
architectures and techniques like sharding, replication, and eventual consistency to ensure
scalability and fault tolerance. NoSQL databases are often used for real-time analytics, content
management, and user profile management in big data applications.
Application: NoSQL databases are widely used for storing and querying large volumes of
unstructured and semi-structured data. An example application is in digital advertising for real-
time ad targeting and personalization. Ad tech companies use NoSQL databases to store user
profiles, behavioral data, and ad inventory in scalable and high-performance databases. These
databases support real-time querying and analysis to deliver targeted ads to users based on their
interests, demographics, and browsing history.
Example: An advertising technology platform utilizes MongoDB for real-time ad targeting. User
profiles and behavioral data are stored in MongoDB collections, allowing for flexible schema
design and horizontal scalability. Ad targeting algorithms query MongoDB in real time to
retrieve user segments and preferences. Based on this information, relevant ads are selected and
delivered to users across websites and mobile apps, maximizing advertising effectiveness and
revenue.
6. Containerization and Orchestration:
Containerization technologies like Docker and container orchestration platforms like
Kubernetes have become essential for deploying and managing distributed big data applications
at scale. Containers encapsulate applications and their dependencies, enabling consistent
deployment across different environments. Kubernetes automates the deployment, scaling, and
management of containerized applications, providing features like service discovery, load
balancing, and rolling updates for distributed systems.
Application: Containerization and orchestration technologies like Docker and Kubernetes are
used for deploying and managing distributed big data applications at scale. An example
application is in cloud computing for scalable data analytics. Cloud service providers use
Kubernetes to orchestrate containerized big data workloads across multiple nodes in a cloud
infrastructure. Kubernetes automates deployment, scaling, and management of containerized
applications, ensuring high availability and resource optimization.
Example: A cloud-based data analytics platform utilizes Kubernetes for scalable data processing.
Spark and Flink jobs are containerized using Docker images and deployed as Kubernetes pods in
a cloud-native environment. Kubernetes automatically schedules and scales these pods based on
resource requirements and workload demands. Containerized data pipelines ingest, process, and
analyze large volumes of data across distributed clusters, providing elastic scalability and
efficient resource utilization for data-intensive workloads.
These core technologies provide the foundation for building distributed and scalable big data
systems that can handle the volume, velocity, variety, and veracity of modern data sources. By
leveraging these technologies, organizations can unlock the full potential of big data analytics to
drive innovation, improve decision-making, and gain a competitive edge in their respective
industries.
Data base management
Database management encompasses the processes, tools, and techniques used to efficiently store,
organize, retrieve, and manage data in a database system. It involves designing, implementing,
and maintaining databases to meet the information needs of an organization while ensuring data
integrity, security, and performance. Here's a detailed note covering all relevant aspects of
database management:
A database is a structured collection of data organized for efficient retrieval and manipulation.
Databases store data in tables, which consist of rows (records) and columns (fields).
Data is accessed and manipulated using queries, transactions, and other database operations.
Database Design:
Database design involves translating the data model into a physical database schema.
It includes decisions on data types, indexing, partitioning, and optimization strategies to ensure
efficient data storage and retrieval.
Database designers translate the logical data model into a physical database schema, considering
factors like data types, indexing strategies, partitioning schemes, and storage optimization
techniques.
Denormalization may be applied selectively to optimize query performance, trading off
redundancy for faster data retrieval.
Database Administration:
Database administrators (DBAs) are responsible for managing, monitoring, and maintaining
database systems.
Tasks include installation, configuration, performance tuning, capacity planning, and
troubleshooting.
DBAs also implement security policies, backup and recovery procedures, and database upgrades.
Database administrators (DBAs) oversee the day-to-day operations of database systems, ensuring
optimal performance, availability, and security.
DBAs perform tasks like database installation, configuration, monitoring, tuning, capacity
planning, and troubleshooting.
Database Design:
Conceptual Design: Identifying entities, attributes, and relationships to model the real-world
domain.
Involves understanding the requirements of the system and identifying the key entities, attributes,
and relationships. This stage focuses on creating an abstract representation of the data without
considering implementation details.
Logical Design: Translating conceptual design into a data model (e.g., ER model,
relational model). Translates the conceptual design into a data model, such as an Entity-
Relationship (ER) model or a relational model. It defines the structure of the database,
including tables, columns, keys, and relationships between entities.
Physical Design: Deciding storage structures, indexing strategies, and optimization
techniques for efficient data retrieval and processing.
Determines how the logical design will be implemented in the underlying database system. This
includes decisions on storage structures, indexing strategies, partitioning, and optimization
techniques to ensure efficient data storage and retrieval.
Data Modeling:
Entity-Relationship (ER) Modeling: Represents entities, attributes, and relationships using
diagrams.
Involves creating a visual representation of the data model using entities (objects or concepts),
attributes (properties of entities), and relationships (associations between entities).
Normalization: Process of organizing data to minimize redundancy and dependency, typically
through normal forms (1NF, 2NF, 3NF).
The process of organizing data to minimize redundancy and dependency, typically through
normalization forms like First Normal Form (1NF), Second Normal Form (2NF), and Third
Normal Form (3NF). Normalization reduces data duplication and improves data integrity.
Denormalization: Technique to optimize read performance by reintroducing redundancy into the
database.
In some cases, denormalization is performed to improve query performance by reintroducing
redundancy into the database. This technique can be useful for read-heavy workloads or when
optimizing for specific query patterns.
Database Administration:
Backup and Recovery: Regularly backing up database data and transaction logs to prevent data
loss and ensure recoverability in case of failures.
Performance Tuning: Optimizing database performance through indexing, query optimization,
caching, and hardware configuration.
Security Management: Implementing access controls, authentication, and encryption
mechanisms to protect data from unauthorized access and breaches.
Database Architecture:
Client-Server Architecture: Separates database management system (DBMS) into client
applications and a server that handles data storage and processing.
Distributed Databases: Spread data across multiple servers or nodes for scalability, fault
tolerance, and performance.
Cloud Databases: Database services provided by cloud providers (e.g., Amazon RDS, Google
Cloud SQL) offering scalability, managed services, and pay-as-you-go pricing.
Data Warehousing and Business Intelligence:
Data Warehousing: Process of collecting, storing, and managing data from various sources for
analysis and reporting.
Online Analytical Processing (OLAP): Analytical database approach for querying and analyzing
multidimensional data for business intelligence and decision support.
Data Mining: Extracting patterns and insights from large datasets using statistical algorithms,
machine learning, and AI techniques.
Emerging Trends:
Big Data: Handling large volumes of structured and unstructured data for analytics, predictive
modeling, and decision-making.
Blockchain Databases: Distributed and tamper-proof databases used for recording transactions
and maintaining an immutable ledger.
Graph Databases: Storing and querying data based on graph theory, suitable for applications
involving complex relationships and network analysis.
Database management plays a critical role in modern organizations by providing a reliable and
efficient means of storing, accessing, and managing data assets. Effective database management
practices are essential for leveraging data as a strategic asset, enabling informed decision-
making, and driving business growth and innovation.
Data stream management
Data Stream Management involves the processing and analysis of continuous, real-time data
streams. Here's an elaboration on this topic:
Data Stream Management refers to the handling of continuous, high-volume streams of data
generated from various sources such as sensors, social media, financial transactions, and IoT
devices.
Unlike traditional batch processing, where data is collected and processed in batches, data stream
management deals with data that is continuously produced and processed in near real-time.
Characteristics:
Continuous: Data streams flow continuously and indefinitely, without a defined beginning or
end.
High Volume and Velocity: Streams may produce vast amounts of data at high speeds, requiring
real-time or near real-time processing.
Dynamic: Data streams can be dynamic and unpredictable, with varying data rates, patterns, and
distributions.
Transient: Data in streams may have a transient nature, where older data becomes less relevant
over time.
Components:
Data Ingestion: Involves capturing and collecting data from various sources and ingesting it into
the stream processing system.
Stream Processing Engine: A software system designed to process, analyze, and transform data
streams in real-time. Examples include Apache Kafka Streams, Apache Flink, Apache Storm,
and Amazon Kinesis.
Analytics and Applications: Applications built on top of stream processing engines perform real-
time analytics, monitoring, alerting, and decision-making based on incoming data streams.
Output Destinations: Processed data may be sent to various destinations such as databases, data
warehouses, dashboards, or downstream systems for further analysis or action.
Stream Processing Paradigms:
Event-driven Processing: Processes events as they occur in the stream, reacting to changes or
triggers in real time.
Window-based Processing: Groups incoming events into fixed-size or time-based windows for
analysis, aggregation, or pattern recognition.
Stateful Processing: Maintains state across multiple events or windows to perform complex
analytics or computations.
Complex Event Processing (CEP): Identifies patterns or sequences of events in real time to
detect anomalies, correlations, or trends.
Use Cases:
IoT and Sensor Data Processing: Monitoring and analyzing sensor data from IoT devices for
predictive maintenance, anomaly detection, and resource optimization.
Financial Trading: Analyzing market data streams for real-time trading decisions, risk
management, and fraud detection in financial markets.
Social Media Analytics: Processing social media feeds for sentiment analysis, trending topics,
and targeted advertising.
Healthcare Monitoring: Monitoring patient data from medical devices for early disease detection,
patient care management, and remote patient monitoring.
Network Monitoring: Analyzing network traffic streams for intrusion detection, network
performance monitoring, and cybersecurity.
Challenges:
Scalability: Handling high-volume data streams and scaling the system to accommodate
increasing data rates and processing requirements.
Latency: Minimizing processing latency to ensure timely responses and actionable insights from
real-time data.
Fault Tolerance: Building fault-tolerant systems that can withstand failures, recover quickly, and
maintain data consistency in the event of node failures or network partitions.
Complexity: Dealing with the complexity of distributed systems, stream processing algorithms,
and state management in real-time environments.
Technologies:
Apache Kafka: A distributed streaming platform that provides scalable, fault-tolerant messaging
and event streaming capabilities.
Apache Flink: A stream processing framework that supports event-driven, window-based, and
stateful processing of data streams.
Apache Storm: A real-time stream processing system for processing large volumes of data
streams with low latency.
Amazon Kinesis: A managed service for real-time data streaming and analytics on the AWS
cloud platform.
Google Cloud Dataflow: A fully managed stream and batch processing service on the Google
Cloud Platform for building real-time data pipelines.
Data Stream Management is critical for organizations looking to extract insights, make informed
decisions, and take timely actions based on real-time data streams. By leveraging stream
processing technologies and methodologies, businesses can gain competitive advantages in
today's fast-paced and data-driven world.
Quick identification of trends: With real time data visualization, organizations can quickly
identify and respond to changes as needed, which can mean capitalizing on an increase in sales,
or responding to suspicious activities. Visualization can help present clear conclusions as soon as
the data is processed.
What is data visualization?
Data visualization is the practice of translating information into a visual context, such as a map
or graph, to make data easier for the human brain to understand and pull insights from. The main
goal of data visualization is to make it easier to identify patterns, trends and outliers in large data
sets. The term is often used interchangeably with others, including information graphics,
information visualization and statistical graphics.
Data visualization is one of the steps of the data science process, which states that after data has
been collected, processed and modeled, it must be visualized for conclusions to be made. Data
visualization is also an element of the broader data presentation architecture (DPA) discipline,
which aims to identify, locate, manipulate, format and deliver data in the most efficient way
possible.
Data visualization is important for almost every career. It can be used by teachers to display
student test results, by computer scientists exploring advancements in artificial intelligence (AI)
or by executives looking to share information with stakeholders. It also plays an important role
in big data projects. As businesses accumulated massive collections of data during the early
years of the big data trend, they needed a way to get an overview of their data quickly and easily.
Visualization tools were a natural fit.
Visualization is central to advanced analytics for similar reasons. When a data scientist is writing
advanced predictive analytics or machine learning (ML) algorithms, it becomes important to
visualize the outputs to monitor results and ensure that models are performing as intended. This
is because visualizations of complex algorithms are generally easier to interpret than numerical
outputs.
A timeline depicting the history of data visualization
Why is data visualization important?
Data visualization provides a quick and effective way to communicate information in a universal
manner using visual information. The practice can also help businesses identify which factors
affect customer behavior; pinpoint areas that need to be improved or need more attention; make
data more memorable for stakeholders; understand when and where to place specific products;
and predict sales volumes.
Other benefits of data visualization include the following:
The ability to absorb information quickly, improve insights and make faster decisions;
An increased understanding of the next steps that must be taken to improve the
organization;
An improved ability to maintain the audience's interest with information they can
understand;
An easy distribution of information that increases the opportunity to share insights with
everyone involved;
Eliminate the need for data scientists since data is more accessible and understandable;
and
An increased ability to act on findings quickly and, therefore, achieve success with
greater speed and less mistakes.
Data visualization and big data
The increased popularity of big data and data analysis projects have made visualization more
important than ever. Companies are increasingly using machine learning to gather massive
amounts of data that can be difficult and slow to sort through, comprehend and explain.
Visualization offers a means to speed this up and present information to business owners and
stakeholders in ways they can understand.
Big data visualization often goes beyond the typical techniques used in normal visualization,
such as pie charts, histograms and corporate graphs. It instead uses more complex
representations, such as heat maps and fever charts. Big data visualization requires powerful
computer systems to collect raw data, process it and turn it into graphical representations that
humans can use to quickly draw insights.
While big data visualization can be beneficial, it can pose several disadvantages to organizations.
They are as follows:
To get the most out of big data visualization tools, a visualization specialist must be hired. This
specialist must be able to identify the best data sets and visualization styles to guarantee
organizations are optimizing the use of their data.
Big data visualization projects often require involvement from IT, as well as management, since
the visualization of big data requires powerful computer hardware, efficient storage systems and
even a move to the cloud.
The insights provided by big data visualization will only be as accurate as the information being
visualized. Therefore, it is essential to have people and processes in place to govern and control
the quality of corporate data, metadata and data sources.
Examples of data visualization
In the early days of visualization, the most common visualization technique was using
a Microsoft Excel spreadsheet to transform the information into a table, bar graph or pie chart.
While these visualization methods are still commonly used, more intricate techniques are now
available, including the following:
Infographics
Bubble Clouds
Bullet Graphs
Heat Maps
Fever Charts
Time Series Charts
Some other popular techniques are as follows:
Line charts. This is one of the most basic and common techniques used. Line charts display how
variables can change over time.
Area charts. This visualization method is a variation of a line chart; it displays multiple values in
a time series -- or a sequence of data collected at consecutive, equally spaced points in time.
Scatter plots. This technique displays the relationship between two variables. A scatter plot takes
the form of an x- and y-axis with dots to represent data points.
Treemaps. This method shows hierarchical data in a nested format. The size of the rectangles
used for each category is proportional to its percentage of the whole. Treemaps are best used
when multiple categories are present, and the goal is to compare different parts of a whole.
Population pyramids. This technique uses a stacked bar graph to display the complex social
narrative of a population. It is best used when trying to display the distribution of a population.
Common data visualization use cases
Common use cases for data visualization include the following:
Sales and marketing. Research from market and consumer data provider Statista estimated
$566 billion was spent on digital advertising in 2022 and that number will cross the $700 billion
mark by 2025. Marketing teams must pay close attention to their sources of web traffic and how
their web properties generate revenue. Data visualization makes it easy to see how marketing
efforts effect traffic trends over time.
Politics. A common use of data visualization in politics is a geographic map that displays the
party each state or district voted for.
Healthcare. Healthcare professionals frequently use choropleth maps to visualize important
health data. A choropleth map displays divided geographical areas or regions that are assigned a
certain color in relation to a numeric variable. Choropleth maps allow professionals to see how a
variable, such as the mortality rate of heart disease, changes across specific territories.
Scientists. Scientific visualization, sometimes referred to in shorthand as SciVis, allows
scientists and researchers to gain greater insight from their experimental data than ever before.
Finance. Finance professionals must track the performance of their investment decisions when
choosing to buy or sell an asset. Candlestick charts are used as trading tools and help finance
professionals analyze price movements over time, displaying important information, such as
securities, derivatives, currencies, stocks, bonds and commodities. By analyzing how the price
has changed over time, data analysts and finance professionals can detect trends.
Logistics. Shipping companies can use visualization tools to determine the best global shipping
routes.
Data scientists and researchers. Visualizations built by data scientists are typically for the
scientist's own use, or for presenting the information to a select audience. The visual
representations are built using visualization libraries of the chosen programming languages and
tools. Data scientists and researchers frequently use Open Source programming languages --
such as Python -- or proprietary tools designed for complex data analysis. The data visualization
performed by these data scientists and researchers helps them understand data sets and identify
patterns and trends that would have otherwise gone unnoticed.
The science of data visualization
The science of data visualization comes from an understanding of how humans gather and
process information. Daniel Kahn and Amos Tversky collaborated on research that defined two
different methods for gathering and processing information.
System 1 focuses on thought processing that is fast, automatic and unconscious. This method is
frequently used in day-to-day life and helps accomplish:
Reading the text on a sign;
Solving simple math problems, like 1+1;
Identifying where a sound is coming from;
Riding a bike; and
Determining the difference between colors.
System 2 focuses on slow, logical, calculating and infrequent thought processing. This method is
used in one of the following situations:
Reciting a phone number;
Solving complex math problems, like 132 x 154;
Determining the difference in meaning between multiple signs standing side by side; and
Understanding complex social cues.
Data visualization tools and vendors
Data visualization tools can be used in a variety of ways. The most common use today is as a
business intelligence (BI) reporting tool. Users can set up visualization tools to generate
automatic dashboards that track company performance across key performance indicators (KPIs)
and visually interpret the results.
The generated images may also include interactive capabilities, enabling users to manipulate
them or look more closely into the data for questioning and analysis. Indicators designed to alert
users when data has been updated or when predefined conditions occur can also be integrated.
Many business departments implement data visualization software to track their own initiatives.
For example, a marketing team might implement the software to monitor the performance of an
email campaign, tracking metrics like open rate, click-through rate and conversion rate.
As data visualization vendors extend the functionality of these tools, they are increasingly being
used as front ends for more sophisticated big data environments. In this setting, data visualization
software helps data engineers and scientists keep track of data sources and do basic exploratory
analysis of data sets prior to or after more detailed advanced analyses.
The biggest names in the big data tools marketplace include Microsoft, IBM, SAP and SAS.
Some other vendors offer specialized big data visualization software; popular names in this
market include Tableau, Qlik and Tibco.
While Microsoft Excel continues to be a popular tool for data visualization, others have been
created that provide more sophisticated abilities:
IBM Cognos Analytics Tibco Spotfire
Qlik Sense and QlikView Zoho Analytics
Microsoft Power BI D3.js
Oracle Visual Analyzer Jupyter
SAP Lumira MicroStrategy
SAS Visual Analytics Google Charts
Visualization in the world of big data involves the creation of meaningful and insightful visual
representations of large and complex datasets. Here's an overview of visualization in the context
of big data:
Importance of Visualization in Big Data:
Complexity Management: Big data often contains vast amounts of information, making it
challenging to comprehend and analyze without visualization techniques.
Insight Discovery: Visualization helps uncover patterns, trends, and relationships within data that
may not be apparent from raw numbers or text.
Decision Support: Visual representations facilitate decision-making by providing stakeholders
with intuitive and actionable insights from complex datasets.
Communication: Visualizations enable effective communication of data-driven insights to a wide
range of stakeholders, including executives, analysts, and non-technical users.
Additional readings:
https://www.knowledgehut.com/blog/big-data/big-data-visualization