Chapter 2 - Introduction To Data Science
Chapter 2 - Introduction To Data Science
Data Science
Prepared by : Surafiel H.
Department of Computer Science,
Addis Ababa University
November, 2021
Overview of Data Science
• Data science is a multidisciplinary field that uses scientific methods, processes
and algorithm systems to extract knowledge and insights from structured, semi-
structured and unstructured data.
• Data science continues to evolve as one of the most promising and in-demand
career paths for skilled professionals.
2
What is Data?
• A representation of facts, concepts, or instructions in a formalized manner,
which should be suitable for communication, interpretation, or processing by
human or electronic machine.
3
• Data can also defined as groups of non-random symbols in the form of text, images,
and voice representing quantities, action and objects.
• Data is represented with the help of characters such as alphabets (A-Z, a-z) or special
characters (+, -,/ , *, <, >, = etc.).
4
What is Information?
• Organized or classified data, which has some meaningful values for the
receiver.
5
Summary: Data Vs. Information
Data Information
Described as unprocessed or raw facts and figures. Described as processed data.
Raw material that can be organized, structured, and Interpreted data; created from organized, structured, and
interpreted to create useful information systems. processed data in a particular context.
Groups of non-random symbols in the form of text, Processed data; in the form of text, images, and voice
images, and voice representing quantities, action and representing quantities, action and objects'.
objects.
6
Data VS. Information - Examples
• Data:
The number of cars sold by a dealership in the past month: 100
The number of customers who visited the dealership in the past month: 500
• Information:
The dealership's sales have increased by 10% in the past month.
The dealership's conversion rate is 20%.
• Data:
The temperature in Addis Ababa on October 21, 2021, at 6:00 PM was 23 degrees Celsius.
• Information:
The temperature in Addis Ababa on October 21, 2021, at 6:00 PM was above average for that time of
year.
7
Data Processing Cycle
• It’s a sequence of steps or operations for processing raw data to be the usable form.
• It is the re-structuring or re-ordering of data by people or machine to increase their
usefulness and add values for a particular purpose.
• Simply put, it’s the process of converting the raw data into information—the
transformation of raw data into meaningful information.
• It is a cyclical process; it starts and ends with data, and the output of one step is the input
for the next step.
• The value of data is often realized when it’s processed and turned into actionable
information.
• Data processing can be used for various purposes, such as business intelligence, research
or decision support.
8
• Data processing cycle typically consists of four main stages:
Input
Processing,
Output, and
Storage
9
Input:
The input data is prepared in some convenient form for
processing.
10
Processing:
In this step, the input data is changed to produce data in a
more useful form (The data is transformed into meaningful
information).
The raw data is processed by a suitable or selected processing
method.
For example, a summary of sales for a month can be calculated from the
sales orders data.
11
Output:
At this stage, the result of the processing step is collected.
Present processed data in a human readable format, including
reports, charts, graphs and dashboards.
The particular form of the output data depends on the use of
the data.
For example, output data can be total sale in a month.
12
Storage:
Refers to how and where the output of the data processing is
stored for future use.
The processed data can be stored in databases or file systems,
and it can be kept on various storage devices such as hard
drives, solid-state drives, and cloud storage.
13
Data Types and its Representation
• In Computer Science and/or Computer Programming, a data type is an
attribute of data which tells the compiler or interpreter how the programmer
intends to use the data.
14
• Common data types in Computer Programming includes:
Integers (int) are used to store whole numbers and all the negatives (or opposites) of
the natural numbers, mathematically known as integers.
Booleans (bool) are used to represent values restricted to one of two values: true (1)
or false (0).
16
Structured Data
• Data that adhere to a predefined data model and is therefore
straightforward to analyze.
17
• It depends on the creation of data model, defining what types of
data to include and how to store and process it.
Data model: a visual representation of a database structure.
18
Semi-structured Data
• A form of structured data that does not conform with the formal structure of
data models associated with relational databases or other forms of data tables.
• But contain tags or other markers to separate semantic elements and enforce
hierarchies of records and fields within the data.
19
• Examples of semi-structured data: JSON (JavaScript Object Notation) and XML (Extensible Markup
Language).
JSON
XML
20
Unstructured Data
• Data that either doesn’t have a predefined data model or isn’t organized in a
predefined manner.
• It is typically text-heavy, but may contain data such as dates, numbers, and
facts as well.
• Common examples:
• For example:
Metadata of a photo could describe when and where the photos were taken.
• The metadata then provides fields for dates and locations which, by
themselves, can be considered structured data.
22
What is Big Data?
• Generally speaking, Big Data is:
Large datasets,
The category of computing strategies and technologies that are used to handle large datasets.
• A data set is an ordered collection of data.
• Big data is a collection of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or traditional data processing
applications.
• The common scale of big data sets is constantly shifting and may vary significantly
from organization to organization.
• Big Data is mainly characterized by 4V’s.
23
24
• Volume: refers the amount of the data.
Large amounts of data - Zeta bytes/Massive datasets.
25
Other V's of Big Data
• Value: refers to the usefulness of gathered data for the business.
• Variability: refers the number of inconsistencies in the data and the inconsistent speed
at which big data is loaded into the database.
• Venue: distributed multiple data from heterogeneous data from multiple platforms.
26
• Vulnerability: big data brings new security concerns.
• Volatility: due to the velocity and volume of big data, however, its volatility needs to
be carefully considered.
27
28
Data Value Chain
• Describe the information flow within a big data system as a series of steps
needed to generate value and useful insights from data.
• The Big Data Value Chain identifies the following key high-level activities:
Data Acquisition,
Data Analysis,
Data Curation,
Data Storage, and
Data Usage
29
Data Acquisition
• It is the process of gathering, filtering, and cleaning data before it is put in a data
warehouse or any other storage solution on which data analysis can be carried out.
• Data acquisition is one of the major big data challenges in terms of infrastructure
requirements.
30
• The infrastructure required for data acquisition must:
Deliver low, predictable latency in both capturing data and in executing queries.
environment.
31
Data Analysis
• A process of cleaning, transforming, and modeling data to discover useful information
for business decision-making.
• Involves exploring, transforming, and modelling data with the goal of highlighting
relevant data, synthesising and extracting useful hidden information with high potential
from a business point of view.
• Related areas include data mining, business intelligence, and machine learning.
32
Data Curation
• Active management of data over its life cycle to ensure it meets the necessary data
quality requirements for its effective usage.
• Data curation processes can be categorized into different activities such as content :
Creation,
Selection,
Classification,
Transformation,
Validation, and
Preservation
33
• Data curation is performed by expert curators that are responsible for improving the
accessibility and quality of data.
• Data curators (also known as scientific curators, or data annotators) hold the
responsibility of ensuring that data are trustworthy, discoverable, accessible, reusable,
and fit their purpose.
• A key trend for the curation of big data utilizes community and crowd sourcing
approaches.
34
Data Storage
• It is the persistence and management of data in a scalable way that satisfies the needs of
• Relational Database Management Systems (RDBMS) have been the main, and almost
35
• Relational database that guarantee database transactions, lack flexibility with regard to
schema changes, performance and fault tolerance when data volumes and complexity
• NoSQL technologies have been designed with the scalability goal in mind and present a
36
Data Usage
• Covers the data-driven business activities that need access to data, its analysis, and the
tools needed to integrate the data analysis within the business activity.
increased added value, or any other parameter that can be measured against existing
performance criteria.
37
Cluster Computing
• Computing refers to the process of using computers to perform various tasks, including
• It involves the manipulation and transformation of data using software applications and
algorithms.
38
• Cluster computing is a specific type of computing that involves the use of a cluster.
• It allows for the distribution of computational load across multiple machines, enabling
faster processing and increased computational power.
39
• In a cluster computing setup, each computer in the cluster, also known as a node, works
in parallel with other nodes to handle different parts of a larger problem or workload.
• The nodes are connected through a high-speed network and communicate with each
other to coordinate their tasks.
• Many nodes (each node) connected with a single node which is called Head node.
• Accessing a cluster system typically means accessing a head node or gateway node.
• A head node or gateway node is set up to be the launching point for jobs running on the
cluster and the main point of access to the cluster.
40
• A classic cluster essentially allows the nodes to
share infrastructure, such as disk space, and
collaborate by sharing program data while those
programs are running.
• Cluster computing offers solutions to solve
complicated problems by:
Providing faster computational speed, and
Enhanced data integrity.
Data integrity refers to the overall
accuracy, completeness, and
consistency of data.
41
Big Data Cluster System
• In big data, individual computers are often inadequate for handling data at most stages.
• Therefore, the high storage and computational needs of big data are addressed through
computer clusters.
• A big data cluster system is a specialized type of cluster computing system designed to
manage and process large volumes of data.
• The primary goal of a big data cluster system is to enable scalable and distributed
processing of big data across multiple nodes within the cluster.
42
• Big Data clustering software that combines the resources of many smaller machines
• Some examples of big data clustering software/tools include Hadoop's YARN (Yet
43
• Using big data cluster provides solution for :
Managing cluster membership,
Coordinating resource sharing, and
Scheduling actual work on individual nodes.
• Cluster membership & resource allocation can be handled by software like Hadoop’s
YARN (Yet Another Resource Negotiator).
• The assembled computing cluster often acts as a foundation which other software
interfaces with to process data.
• Additionally, the machines in the computing cluster typically manage a distributed
storage system.
44
Benefits of Big Data Clustering Software
1. Resource Pooling:
Encompasses CPU and memory pooling, which are crucial for processing large datasets that
2. High Availability:
Clusters offer varying levels of fault tolerance and availability.
processing.
Clusters facilitate easy horizontal scaling by adding more machines to the group.
This allows the system to adapt to changes in resource demands without needing
to increase the physical resources of individual machines.
46
Hadoop
• It’s an open-source framework designed to simplify interaction with big data.
• Hadoop is an open-source framework designed for distributed storage and processing of large
datasets across clusters of computers.
• Reliable:
Hadoop is reliable due to its ability to store data copies on different machines, making it resistant to hardware
failures.
• Scalable:
Hadoop is easily scalable both horizontally and vertically, allowing the framework to expand with the addition of
extra nodes.
• Flexible:
Hadoop is flexible, enabling storage of both structured and unstructured data for future use as needed.
48
Hadoop Ecosystem
• Hadoop has an ecosystem that has evolved from its four core components:
1. Data management: Involves handling the storage, organization, and retrieval of data.
2. Data access: Enables users to interact with and retrieve data stored in Hadoop.
3. Data processing: Enables the execution of computations and analytics on large datasets.
4. Data storage: Focuses on efficient and scalable ways to store and manage data.
• The Hadoop ecosystem is continuously expanding to meet the needs of big data.
• The ecosystem offers a range of tools and technologies to address different aspects of the
data lifecycle and cater to various use cases in big data analytics.
49
Components of Hadoop Ecosystem
• Hadoop Distributed File System (HDFS): A distributed file system that provides reliable
and scalable storage for big data across multiple machines.
• Spark: A fast and general-purpose cluster computing system with in-memory processing
capabilities. It seamlessly integrates with Hadoop and offers higher performance for
certain workloads. 50
• Pig: A high-level scripting platform in Hadoop that simplifies data processing tasks using
a language called Pig Latin, abstracting away the complexities of MapReduce.
• HBase: A distributed and scalable NoSQL database that runs on top of Hadoop. It offers
random, real-time access to big data and is suitable for low-latency read/write operations.
51
• MLlib: A machine learning library in Spark that offers a rich set of algorithms and tools for scalable
machine learning tasks, including data preprocessing, feature extraction, model training, and
evaluation.
• Solr: An open-source search platform based on Apache Lucene, providing powerful search capabilities
such as full-text search, faceted search, and real-time indexing.
• Lucene: A Java library for full-text search, providing indexing and searching functionalities and
serving as the core technology behind search-related applications like Solr and Elasticsearch.
• ZooKeeper: A centralized coordination service for distributed systems, offering reliable infrastructure
for maintaining configuration information, synchronizing processes, and managing distributed locks.
• Oozie: A workflow scheduling system for Hadoop, enabling users to define and manage workflows that
coordinate the execution of multiple Hadoop jobs, automating complex data processing pipelines.
52
53
Big Data Life Cycle with Hadoop
1. Ingesting data into the system:
54
2. Processing the data in storage:
Data is stored in the distributed file system, HDFS, and in the NoSQL
distributed database, HBase.
55
3. Computing and analyzing data:
Pig employs map and reduce techniques for data conversion and
analysis, while Hive, based on map and reduce programming, is well-
suited for structured data.
56
4. Visualizing the results:
The final stage, access, involves tools such as Hue and Cloudera Search.
Here, users can access the analyzed data and visualize the results.
57
58