0% found this document useful (0 votes)
28 views

Chapter 2 - Introduction To Data Science

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views

Chapter 2 - Introduction To Data Science

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Chapter Two

Data Science
Prepared by : Surafiel H.
Department of Computer Science,
Addis Ababa University
November, 2021
Overview of Data Science
• Data science is a multidisciplinary field that uses scientific methods, processes
and algorithm systems to extract knowledge and insights from structured, semi-
structured and unstructured data.

• Data science is much more than simply analyzing data.

• It offers a range of roles and requires a range of skills.

• Data science continues to evolve as one of the most promising and in-demand
career paths for skilled professionals.
2
What is Data?
• A representation of facts, concepts, or instructions in a formalized manner,
which should be suitable for communication, interpretation, or processing by
human or electronic machine.

• Data can be described as unprocessed facts and figures.


Or
• Data are streams of raw facts representing events occurring in an
organization or in the physical environment before they have been organized
and arranged in to a form that people understand or use it.

3
• Data can also defined as groups of non-random symbols in the form of text, images,
and voice representing quantities, action and objects.

• Data is represented with the help of characters such as alphabets (A-Z, a-z) or special
characters (+, -,/ , *, <, >, = etc.).

4
What is Information?
• Organized or classified data, which has some meaningful values for the
receiver.

• Processed data on which decisions and actions are based.

• Plain collected data as raw facts cannot help much in decision-making.

• Interpreted data; created from organized, structured, and processed data in a


particular context.

5
Summary: Data Vs. Information

Data Information
Described as unprocessed or raw facts and figures. Described as processed data.

Can’t help in decision making. Can help in decision making

Raw material that can be organized, structured, and Interpreted data; created from organized, structured, and
interpreted to create useful information systems. processed data in a particular context.

Groups of non-random symbols in the form of text, Processed data; in the form of text, images, and voice
images, and voice representing quantities, action and representing quantities, action and objects'.
objects.

6
Data VS. Information - Examples
• Data:
 The number of cars sold by a dealership in the past month: 100
 The number of customers who visited the dealership in the past month: 500
• Information:
 The dealership's sales have increased by 10% in the past month.
 The dealership's conversion rate is 20%.
• Data:
 The temperature in Addis Ababa on October 21, 2021, at 6:00 PM was 23 degrees Celsius.
• Information:
 The temperature in Addis Ababa on October 21, 2021, at 6:00 PM was above average for that time of
year.

7
Data Processing Cycle
• It’s a sequence of steps or operations for processing raw data to be the usable form.
• It is the re-structuring or re-ordering of data by people or machine to increase their
usefulness and add values for a particular purpose.
• Simply put, it’s the process of converting the raw data into information—the
transformation of raw data into meaningful information.
• It is a cyclical process; it starts and ends with data, and the output of one step is the input
for the next step.
• The value of data is often realized when it’s processed and turned into actionable
information.
• Data processing can be used for various purposes, such as business intelligence, research
or decision support.
8
• Data processing cycle typically consists of four main stages:

 Input

 Processing,

 Output, and

 Storage

9
Input:
 The input data is prepared in some convenient form for
processing.

 The form will depend on the processing machine.


 For example, when electronic computers are used, the input data can be
recorded on any one of the several types of input medium, such as flash disks,
hard disk, and so on.

10
Processing:
 In this step, the input data is changed to produce data in a
more useful form (The data is transformed into meaningful
information).
 The raw data is processed by a suitable or selected processing
method.
 For example, a summary of sales for a month can be calculated from the
sales orders data.
11
Output:
 At this stage, the result of the processing step is collected.
 Present processed data in a human readable format, including
reports, charts, graphs and dashboards.
 The particular form of the output data depends on the use of
the data.
 For example, output data can be total sale in a month.

12
Storage:
 Refers to how and where the output of the data processing is
stored for future use.
 The processed data can be stored in databases or file systems,
and it can be kept on various storage devices such as hard
drives, solid-state drives, and cloud storage.

13
Data Types and its Representation
• In Computer Science and/or Computer Programming, a data type is an
attribute of data which tells the compiler or interpreter how the programmer
intends to use the data.

• The data types defines:-

 The operations that can be done on the data,

 The meaning of the data, and

 The way values of that type can be stored.

14
• Common data types in Computer Programming includes:

 Integers (int) are used to store whole numbers and all the negatives (or opposites) of
the natural numbers, mathematically known as integers.

 Booleans (bool) are used to represent values restricted to one of two values: true (1)
or false (0).

 Characters (char) are used to store a single character.

 Floating-point numbers (float) are used to store real numbers.

 Alphanumeric strings (strings) are used to store a combination of characters and


numbers.
15
Data Types from Data Analytics Perspective
• From a data analytics point of view, there are three common types of data types or
structures:
1. Structured,
2. Semi Structured, and
3. Unstructured data types.

16
Structured Data
• Data that adhere to a predefined data model and is therefore
straightforward to analyze.

• Data that resides in a fixed field within a file or record.

• Conforms to a tabular format with relationships between


different rows and columns.

17
• It depends on the creation of data model, defining what types of
data to include and how to store and process it.
 Data model: a visual representation of a database structure.

 Database: an organized collection of structured data typically stored in a computer


system.

• Common examples of structured data are Excel files or SQL databases.

18
Semi-structured Data
• A form of structured data that does not conform with the formal structure of

data models associated with relational databases or other forms of data tables.

• i.e. doesn’t conform to the formal structure of data model.

• But contain tags or other markers to separate semantic elements and enforce
hierarchies of records and fields within the data.

• Therefore, it is also known as self-describing structure.

19
• Examples of semi-structured data: JSON (JavaScript Object Notation) and XML (Extensible Markup
Language).

 JSON 

 XML 

20
Unstructured Data
• Data that either doesn’t have a predefined data model or isn’t organized in a
predefined manner.

• There is no data model; the data is stored in its native format.

• It is typically text-heavy, but may contain data such as dates, numbers, and
facts as well.

• Common examples:

 Audio, video files, NoSQL, pictures, pdfs ...


21
Metadata – Data about Data
• It provides additional information about a specific set of data.

• For example:

 Metadata of a photo could describe when and where the photos were taken.

• The metadata then provides fields for dates and locations which, by
themselves, can be considered structured data.

22
What is Big Data?
• Generally speaking, Big Data is:
 Large datasets,
 The category of computing strategies and technologies that are used to handle large datasets.
• A data set is an ordered collection of data.
• Big data is a collection of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or traditional data processing
applications.
• The common scale of big data sets is constantly shifting and may vary significantly
from organization to organization.
• Big Data is mainly characterized by 4V’s.
23
24
• Volume: refers the amount of the data.
 Large amounts of data - Zeta bytes/Massive datasets.

• Velocity: refers the speed of data processing


 Data is live streaming or in motion.

• Variety: refers to the number of types of data.


 Data comes in many different forms from diverse sources.

• Veracity: which in this context is equivalent to quality.


 Can we trust the data?

 Are the data “clean” and accurate?

 Do they really have something to offer?

25
Other V's of Big Data
• Value: refers to the usefulness of gathered data for the business.

• Variability: refers the number of inconsistencies in the data and the inconsistent speed
at which big data is loaded into the database.

• Validity: data quality, governance, master data management on massive.

• Venue: distributed multiple data from heterogeneous data from multiple platforms.

• Vocabulary: data models, semantics that describes data structure.

26
• Vulnerability: big data brings new security concerns.

 After all, a data breach with big data is a big breach.

• Volatility: due to the velocity and volume of big data, however, its volatility needs to
be carefully considered.

 How long does data need to be kept for?

• Visualization: different ways of representing data such as data clustering or using


tree maps, sunbursts, parallel coordinates, circular network diagrams, cone trees.

• Vagueness: confusion over meaning of big data and tools used.

27
28
Data Value Chain
• Describe the information flow within a big data system as a series of steps
needed to generate value and useful insights from data.

• The Big Data Value Chain identifies the following key high-level activities:

 Data Acquisition,
 Data Analysis,
 Data Curation,
 Data Storage, and
 Data Usage
29
Data Acquisition
• It is the process of gathering, filtering, and cleaning data before it is put in a data

warehouse or any other storage solution on which data analysis can be carried out.

 A data warehouse is a system that aggregates, stores, and processes information


from diverse data sources.

• Data acquisition is one of the major big data challenges in terms of infrastructure

requirements.

30
• The infrastructure required for data acquisition must:

 Deliver low, predictable latency in both capturing data and in executing queries.

 Be able to handle very high transaction volumes, often in a distributed

environment.

 support flexible and dynamic data structures.

31
Data Analysis
• A process of cleaning, transforming, and modeling data to discover useful information
for business decision-making.

• Involves exploring, transforming, and modelling data with the goal of highlighting
relevant data, synthesising and extracting useful hidden information with high potential
from a business point of view.

• Related areas include data mining, business intelligence, and machine learning.

32
Data Curation
• Active management of data over its life cycle to ensure it meets the necessary data
quality requirements for its effective usage.

• Data curation processes can be categorized into different activities such as content :
 Creation,
 Selection,
 Classification,
 Transformation,
 Validation, and
 Preservation
33
• Data curation is performed by expert curators that are responsible for improving the
accessibility and quality of data.
• Data curators (also known as scientific curators, or data annotators) hold the
responsibility of ensuring that data are trustworthy, discoverable, accessible, reusable,
and fit their purpose.

• A key trend for the curation of big data utilizes community and crowd sourcing
approaches.

34
Data Storage
• It is the persistence and management of data in a scalable way that satisfies the needs of

applications that require fast access to the data.

• Relational Database Management Systems (RDBMS) have been the main, and almost

unique, solution to the storage paradigm for nearly 40 years.

35
• Relational database that guarantee database transactions, lack flexibility with regard to

schema changes, performance and fault tolerance when data volumes and complexity

grow, making them unsuitable for big data scenarios.

• NoSQL technologies have been designed with the scalability goal in mind and present a

wide range of solutions based on alternative data models.

36
Data Usage
• Covers the data-driven business activities that need access to data, its analysis, and the

tools needed to integrate the data analysis within the business activity.

• In business decision-making, it can enhance competitiveness through reduction of costs,

increased added value, or any other parameter that can be measured against existing

performance criteria.

37
Cluster Computing
• Computing refers to the process of using computers to perform various tasks, including

calculations, data processing, and problem-solving.

• It involves the manipulation and transformation of data using software applications and

algorithms.

• Computing can be done on a single computer or distributed across multiple computers

connected through a network.

38
• Cluster computing is a specific type of computing that involves the use of a cluster.

• In general, cluster means “small group”.

• Cluster computing is a group of interconnected computers or servers working together to


perform a task or solve a problem.

• It refers to multiple computers connected to a network that function as a single entity

• It allows for the distribution of computational load across multiple machines, enabling
faster processing and increased computational power.
39
• In a cluster computing setup, each computer in the cluster, also known as a node, works
in parallel with other nodes to handle different parts of a larger problem or workload.

• The nodes are connected through a high-speed network and communicate with each
other to coordinate their tasks.

• Each node is performing a dedicated task.

• Many nodes (each node) connected with a single node which is called Head node.

• Accessing a cluster system typically means accessing a head node or gateway node.

• A head node or gateway node is set up to be the launching point for jobs running on the
cluster and the main point of access to the cluster.
40
• A classic cluster essentially allows the nodes to
share infrastructure, such as disk space, and
collaborate by sharing program data while those
programs are running.
• Cluster computing offers solutions to solve
complicated problems by:
 Providing faster computational speed, and
 Enhanced data integrity.
 Data integrity refers to the overall
accuracy, completeness, and
consistency of data.

41
Big Data Cluster System
• In big data, individual computers are often inadequate for handling data at most stages.

• Therefore, the high storage and computational needs of big data are addressed through
computer clusters.

• A big data cluster system is a specialized type of cluster computing system designed to
manage and process large volumes of data.

• The primary goal of a big data cluster system is to enable scalable and distributed
processing of big data across multiple nodes within the cluster.

42
• Big Data clustering software that combines the resources of many smaller machines

offers several benefits.

• Some examples of big data clustering software/tools include Hadoop's YARN (Yet

Another Resource Negotiator), Qubole, HPCC, Cassandra, MongoDB, Apache Storm,

CouchDB, and Statwing.

43
• Using big data cluster provides solution for :
 Managing cluster membership,
 Coordinating resource sharing, and
 Scheduling actual work on individual nodes.
• Cluster membership & resource allocation can be handled by software like Hadoop’s
YARN (Yet Another Resource Negotiator).
• The assembled computing cluster often acts as a foundation which other software
interfaces with to process data.
• Additionally, the machines in the computing cluster typically manage a distributed
storage system.

44
Benefits of Big Data Clustering Software
1. Resource Pooling:

 Involves combining available storage space to hold data.

 Encompasses CPU and memory pooling, which are crucial for processing large datasets that

require substantial amounts of these two resources.

2. High Availability:
 Clusters offer varying levels of fault tolerance and availability.

 Guarantee to prevent hardware/software failures from affecting access to data and

processing.

 This becomes increasingly important as real-time analytics continue to be emphasized.


45
3. Easy Scalability:

 Clusters facilitate easy horizontal scaling by adding more machines to the group.

 This allows the system to adapt to changes in resource demands without needing
to increase the physical resources of individual machines.

46
Hadoop
• It’s an open-source framework designed to simplify interaction with big data.

• Hadoop is an open-source framework designed for distributed storage and processing of large
datasets across clusters of computers.

• It is inspired by a technical document published by Google.


 Open-source software allows anyone to inspect, modify, and enhance its source code.
 This development and distribution model provide the public with access to the underlying
(source) code of a program or application.
 Source code refers to the part of software that typical computer users don’t see; it's the code
that programmers can modify to alter how a piece of software, such as a program or
application, functions.
 A software framework is an abstraction that provides generic functionality, allowing users to
47
extend it with additional code to create application-specific software.
Characteristics of Hadoop
• Economical:
 Hadoop systems are highly economical because they can utilize ordinary computers for data processing.

• Reliable:
 Hadoop is reliable due to its ability to store data copies on different machines, making it resistant to hardware
failures.

• Scalable:
 Hadoop is easily scalable both horizontally and vertically, allowing the framework to expand with the addition of
extra nodes.

• Flexible:
 Hadoop is flexible, enabling storage of both structured and unstructured data for future use as needed.
48
Hadoop Ecosystem
• Hadoop has an ecosystem that has evolved from its four core components:
1. Data management: Involves handling the storage, organization, and retrieval of data.

2. Data access: Enables users to interact with and retrieve data stored in Hadoop.

3. Data processing: Enables the execution of computations and analytics on large datasets.

4. Data storage: Focuses on efficient and scalable ways to store and manage data.

• The Hadoop ecosystem is continuously expanding to meet the needs of big data.

• These components work together to provide a comprehensive ecosystem for managing,


accessing, processing, and storing big data.

• The ecosystem offers a range of tools and technologies to address different aspects of the
data lifecycle and cater to various use cases in big data analytics.
49
Components of Hadoop Ecosystem
• Hadoop Distributed File System (HDFS): A distributed file system that provides reliable
and scalable storage for big data across multiple machines.

• Yet Another Resource Negotiator (YARN): The resource management framework in


Hadoop that manages resources and schedules tasks across the cluster, enabling multiple
processing engines to run on Hadoop.

• MapReduce: A programming model and processing engine in Hadoop for parallel


processing of large datasets by dividing tasks into map and reduce phases.

• Spark: A fast and general-purpose cluster computing system with in-memory processing
capabilities. It seamlessly integrates with Hadoop and offers higher performance for
certain workloads. 50
• Pig: A high-level scripting platform in Hadoop that simplifies data processing tasks using
a language called Pig Latin, abstracting away the complexities of MapReduce.

• Hive: A data warehouse infrastructure built on Hadoop, providing a high-level query


language (HiveQL) for querying and analyzing data stored in Hadoop.

• HBase: A distributed and scalable NoSQL database that runs on top of Hadoop. It offers
random, real-time access to big data and is suitable for low-latency read/write operations.

• Mahout: A library of machine learning algorithms for Hadoop, providing scalable


implementations for tasks like clustering, classification, and recommendation.

51
• MLlib: A machine learning library in Spark that offers a rich set of algorithms and tools for scalable
machine learning tasks, including data preprocessing, feature extraction, model training, and
evaluation.

• Solr: An open-source search platform based on Apache Lucene, providing powerful search capabilities
such as full-text search, faceted search, and real-time indexing.

• Lucene: A Java library for full-text search, providing indexing and searching functionalities and
serving as the core technology behind search-related applications like Solr and Elasticsearch.

• ZooKeeper: A centralized coordination service for distributed systems, offering reliable infrastructure
for maintaining configuration information, synchronizing processes, and managing distributed locks.

• Oozie: A workflow scheduling system for Hadoop, enabling users to define and manage workflows that
coordinate the execution of multiple Hadoop jobs, automating complex data processing pipelines.
52
53
Big Data Life Cycle with Hadoop
1. Ingesting data into the system:

 The first stage of big data processing with Hadoop is ingestion.

 Data is transferred to Hadoop from various sources such as relational


databases, systems, or local files.

 Sqoop facilitates data transfer from RDBMS to HDFS, while Flume


handles event data transfer.

54
2. Processing the data in storage:

 The second stage involves processing and storing.

 Data is stored in the distributed file system, HDFS, and in the NoSQL
distributed database, HBase.

 Data processing is carried out using Spark and MapReduce.

55
3. Computing and analyzing data:

 The third stage is analysis.

 Data is analyzed using processing frameworks like Pig, Hive, and


Impala.

 Pig employs map and reduce techniques for data conversion and
analysis, while Hive, based on map and reduce programming, is well-
suited for structured data.

56
4. Visualizing the results:

 The final stage, access, involves tools such as Hue and Cloudera Search.

 Here, users can access the analyzed data and visualize the results.

57
58

You might also like