0% found this document useful (0 votes)

29 views

Chapter 2 - Introduction To Data Science

Uploaded by

nehemiahelias1904

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views

Chapter 2 - Introduction To Data Science

Uploaded by

nehemiahelias1904

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

Chapter Two

Data Science
Prepared by : Surafiel H.
Department of Computer Science,
Addis Ababa University
November, 2021
Overview of Data Science
• Data science is a multidisciplinary field that uses scientific methods, processes
and algorithm systems to extract knowledge and insights from structured, semi-
structured and unstructured data.

• Data science is much more than simply analyzing data.

• It offers a range of roles and requires a range of skills.

• Data science continues to evolve as one of the most promising and in-demand
career paths for skilled professionals.
2
What is Data?
• A representation of facts, concepts, or instructions in a formalized manner,
which should be suitable for communication, interpretation, or processing by
human or electronic machine.

• Data can be described as unprocessed facts and figures.

Or
• Data are streams of raw facts representing events occurring in an
organization or in the physical environment before they have been organized
and arranged in to a form that people understand or use it.

3
• Data can also defined as groups of non-random symbols in the form of text, images,
and voice representing quantities, action and objects.

• Data is represented with the help of characters such as alphabets (A-Z, a-z) or special
characters (+, -,/ , *, <, >, = etc.).

4
What is Information?
• Organized or classified data, which has some meaningful values for the
receiver.

• Processed data on which decisions and actions are based.

• Plain collected data as raw facts cannot help much in decision-making.

• Interpreted data; created from organized, structured, and processed data in a

particular context.

5
Summary: Data Vs. Information

Data Information
Described as unprocessed or raw facts and figures. Described as processed data.

Can’t help in decision making. Can help in decision making

Raw material that can be organized, structured, and Interpreted data; created from organized, structured, and
interpreted to create useful information systems. processed data in a particular context.

Groups of non-random symbols in the form of text, Processed data; in the form of text, images, and voice
images, and voice representing quantities, action and representing quantities, action and objects'.
objects.

6
Data VS. Information - Examples
• Data:
 The number of cars sold by a dealership in the past month: 100
 The number of customers who visited the dealership in the past month: 500
• Information:
 The dealership's sales have increased by 10% in the past month.
 The dealership's conversion rate is 20%.
• Data:
 The temperature in Addis Ababa on October 21, 2021, at 6:00 PM was 23 degrees Celsius.
• Information:
 The temperature in Addis Ababa on October 21, 2021, at 6:00 PM was above average for that time of
year.

7
Data Processing Cycle
• It’s a sequence of steps or operations for processing raw data to be the usable form.
• It is the re-structuring or re-ordering of data by people or machine to increase their
usefulness and add values for a particular purpose.
• Simply put, it’s the process of converting the raw data into information—the
transformation of raw data into meaningful information.
• It is a cyclical process; it starts and ends with data, and the output of one step is the input
for the next step.
• The value of data is often realized when it’s processed and turned into actionable
information.
• Data processing can be used for various purposes, such as business intelligence, research
or decision support.
8
• Data processing cycle typically consists of four main stages:

 Input

 Processing,

 Output, and

 Storage

9
Input:
 The input data is prepared in some convenient form for
processing.

 The form will depend on the processing machine.

 For example, when electronic computers are used, the input data can be
recorded on any one of the several types of input medium, such as flash disks,
hard disk, and so on.

10
Processing:
 In this step, the input data is changed to produce data in a
more useful form (The data is transformed into meaningful
information).
 The raw data is processed by a suitable or selected processing
method.
 For example, a summary of sales for a month can be calculated from the
sales orders data.
11
Output:
 At this stage, the result of the processing step is collected.
 Present processed data in a human readable format, including
reports, charts, graphs and dashboards.
 The particular form of the output data depends on the use of
the data.
 For example, output data can be total sale in a month.

12
Storage:
 Refers to how and where the output of the data processing is
stored for future use.
 The processed data can be stored in databases or file systems,
and it can be kept on various storage devices such as hard
drives, solid-state drives, and cloud storage.

13
Data Types and its Representation
• In Computer Science and/or Computer Programming, a data type is an
attribute of data which tells the compiler or interpreter how the programmer
intends to use the data.

• The data types defines:-

 The operations that can be done on the data,

 The meaning of the data, and

 The way values of that type can be stored.

14
• Common data types in Computer Programming includes:

 Integers (int) are used to store whole numbers and all the negatives (or opposites) of
the natural numbers, mathematically known as integers.

 Booleans (bool) are used to represent values restricted to one of two values: true (1)
or false (0).

 Characters (char) are used to store a single character.

 Floating-point numbers (float) are used to store real numbers.

 Alphanumeric strings (strings) are used to store a combination of characters and

numbers.
15
Data Types from Data Analytics Perspective
• From a data analytics point of view, there are three common types of data types or
structures:
1. Structured,
2. Semi Structured, and
3. Unstructured data types.

16
Structured Data
• Data that adhere to a predefined data model and is therefore
straightforward to analyze.

• Data that resides in a fixed field within a file or record.

• Conforms to a tabular format with relationships between

different rows and columns.

17
• It depends on the creation of data model, defining what types of
data to include and how to store and process it.
 Data model: a visual representation of a database structure.

 Database: an organized collection of structured data typically stored in a computer

system.

• Common examples of structured data are Excel files or SQL databases.

18
Semi-structured Data
• A form of structured data that does not conform with the formal structure of

data models associated with relational databases or other forms of data tables.

• i.e. doesn’t conform to the formal structure of data model.

• But contain tags or other markers to separate semantic elements and enforce
hierarchies of records and fields within the data.

• Therefore, it is also known as self-describing structure.

19
• Examples of semi-structured data: JSON (JavaScript Object Notation) and XML (Extensible Markup
Language).

 JSON 

 XML 

20
Unstructured Data
• Data that either doesn’t have a predefined data model or isn’t organized in a
predefined manner.

• There is no data model; the data is stored in its native format.

• It is typically text-heavy, but may contain data such as dates, numbers, and
facts as well.

• Common examples:

 Audio, video files, NoSQL, pictures, pdfs ...

21
Metadata – Data about Data
• It provides additional information about a specific set of data.

• For example:

 Metadata of a photo could describe when and where the photos were taken.

• The metadata then provides fields for dates and locations which, by
themselves, can be considered structured data.

22
What is Big Data?
• Generally speaking, Big Data is:
 Large datasets,
 The category of computing strategies and technologies that are used to handle large datasets.
• A data set is an ordered collection of data.
• Big data is a collection of data sets so large and complex that it becomes difficult to
process using on-hand database management tools or traditional data processing
applications.
• The common scale of big data sets is constantly shifting and may vary significantly
from organization to organization.
• Big Data is mainly characterized by 4V’s.
23
24
• Volume: refers the amount of the data.
 Large amounts of data - Zeta bytes/Massive datasets.

• Velocity: refers the speed of data processing

 Data is live streaming or in motion.

• Variety: refers to the number of types of data.

 Data comes in many different forms from diverse sources.

• Veracity: which in this context is equivalent to quality.

 Can we trust the data?

 Are the data “clean” and accurate?

 Do they really have something to offer?

25
Other V's of Big Data
• Value: refers to the usefulness of gathered data for the business.

• Variability: refers the number of inconsistencies in the data and the inconsistent speed
at which big data is loaded into the database.

• Validity: data quality, governance, master data management on massive.

• Venue: distributed multiple data from heterogeneous data from multiple platforms.

• Vocabulary: data models, semantics that describes data structure.

26
• Vulnerability: big data brings new security concerns.

 After all, a data breach with big data is a big breach.

• Volatility: due to the velocity and volume of big data, however, its volatility needs to
be carefully considered.

 How long does data need to be kept for?

• Visualization: different ways of representing data such as data clustering or using

tree maps, sunbursts, parallel coordinates, circular network diagrams, cone trees.

• Vagueness: confusion over meaning of big data and tools used.

27
28
Data Value Chain
• Describe the information flow within a big data system as a series of steps
needed to generate value and useful insights from data.

• The Big Data Value Chain identifies the following key high-level activities:

 Data Acquisition,
 Data Analysis,
 Data Curation,
 Data Storage, and
 Data Usage
29
Data Acquisition
• It is the process of gathering, filtering, and cleaning data before it is put in a data

warehouse or any other storage solution on which data analysis can be carried out.

 A data warehouse is a system that aggregates, stores, and processes information

from diverse data sources.

• Data acquisition is one of the major big data challenges in terms of infrastructure

requirements.

30
• The infrastructure required for data acquisition must:

 Deliver low, predictable latency in both capturing data and in executing queries.

 Be able to handle very high transaction volumes, often in a distributed

environment.

 support flexible and dynamic data structures.

31
Data Analysis
• A process of cleaning, transforming, and modeling data to discover useful information
for business decision-making.

• Involves exploring, transforming, and modelling data with the goal of highlighting
relevant data, synthesising and extracting useful hidden information with high potential
from a business point of view.

• Related areas include data mining, business intelligence, and machine learning.

32
Data Curation
• Active management of data over its life cycle to ensure it meets the necessary data
quality requirements for its effective usage.

• Data curation processes can be categorized into different activities such as content :
 Creation,
 Selection,
 Classification,
 Transformation,
 Validation, and
 Preservation
33
• Data curation is performed by expert curators that are responsible for improving the
accessibility and quality of data.
• Data curators (also known as scientific curators, or data annotators) hold the
responsibility of ensuring that data are trustworthy, discoverable, accessible, reusable,
and fit their purpose.

• A key trend for the curation of big data utilizes community and crowd sourcing
approaches.

34
Data Storage
• It is the persistence and management of data in a scalable way that satisfies the needs of

applications that require fast access to the data.

• Relational Database Management Systems (RDBMS) have been the main, and almost

unique, solution to the storage paradigm for nearly 40 years.

35
• Relational database that guarantee database transactions, lack flexibility with regard to

schema changes, performance and fault tolerance when data volumes and complexity

grow, making them unsuitable for big data scenarios.

• NoSQL technologies have been designed with the scalability goal in mind and present a

wide range of solutions based on alternative data models.

36
Data Usage
• Covers the data-driven business activities that need access to data, its analysis, and the

tools needed to integrate the data analysis within the business activity.

• In business decision-making, it can enhance competitiveness through reduction of costs,

increased added value, or any other parameter that can be measured against existing

performance criteria.

37
Cluster Computing
• Computing refers to the process of using computers to perform various tasks, including

calculations, data processing, and problem-solving.

• It involves the manipulation and transformation of data using software applications and

algorithms.

• Computing can be done on a single computer or distributed across multiple computers

connected through a network.

38
• Cluster computing is a specific type of computing that involves the use of a cluster.

• In general, cluster means “small group”.

• Cluster computing is a group of interconnected computers or servers working together to

perform a task or solve a problem.

• It refers to multiple computers connected to a network that function as a single entity

• It allows for the distribution of computational load across multiple machines, enabling
faster processing and increased computational power.
39
• In a cluster computing setup, each computer in the cluster, also known as a node, works
in parallel with other nodes to handle different parts of a larger problem or workload.

• The nodes are connected through a high-speed network and communicate with each
other to coordinate their tasks.

• Each node is performing a dedicated task.

• Many nodes (each node) connected with a single node which is called Head node.

• Accessing a cluster system typically means accessing a head node or gateway node.

• A head node or gateway node is set up to be the launching point for jobs running on the
cluster and the main point of access to the cluster.
40
• A classic cluster essentially allows the nodes to
share infrastructure, such as disk space, and
collaborate by sharing program data while those
programs are running.
• Cluster computing offers solutions to solve
complicated problems by:
 Providing faster computational speed, and
 Enhanced data integrity.
 Data integrity refers to the overall
accuracy, completeness, and
consistency of data.

41
Big Data Cluster System
• In big data, individual computers are often inadequate for handling data at most stages.

• Therefore, the high storage and computational needs of big data are addressed through
computer clusters.

• A big data cluster system is a specialized type of cluster computing system designed to
manage and process large volumes of data.

• The primary goal of a big data cluster system is to enable scalable and distributed
processing of big data across multiple nodes within the cluster.

42
• Big Data clustering software that combines the resources of many smaller machines

offers several benefits.

• Some examples of big data clustering software/tools include Hadoop's YARN (Yet

Another Resource Negotiator), Qubole, HPCC, Cassandra, MongoDB, Apache Storm,

CouchDB, and Statwing.

43
• Using big data cluster provides solution for :
 Managing cluster membership,
 Coordinating resource sharing, and
 Scheduling actual work on individual nodes.
• Cluster membership & resource allocation can be handled by software like Hadoop’s
YARN (Yet Another Resource Negotiator).
• The assembled computing cluster often acts as a foundation which other software
interfaces with to process data.
• Additionally, the machines in the computing cluster typically manage a distributed
storage system.

44
Benefits of Big Data Clustering Software
1. Resource Pooling:

 Involves combining available storage space to hold data.

 Encompasses CPU and memory pooling, which are crucial for processing large datasets that

require substantial amounts of these two resources.

2. High Availability:
 Clusters offer varying levels of fault tolerance and availability.

 Guarantee to prevent hardware/software failures from affecting access to data and

processing.

 This becomes increasingly important as real-time analytics continue to be emphasized.

45
3. Easy Scalability:

 Clusters facilitate easy horizontal scaling by adding more machines to the group.

 This allows the system to adapt to changes in resource demands without needing
to increase the physical resources of individual machines.

46
Hadoop
• It’s an open-source framework designed to simplify interaction with big data.

• Hadoop is an open-source framework designed for distributed storage and processing of large
datasets across clusters of computers.

• It is inspired by a technical document published by Google.

 Open-source software allows anyone to inspect, modify, and enhance its source code.
 This development and distribution model provide the public with access to the underlying
(source) code of a program or application.
 Source code refers to the part of software that typical computer users don’t see; it's the code
that programmers can modify to alter how a piece of software, such as a program or
application, functions.
 A software framework is an abstraction that provides generic functionality, allowing users to
47
extend it with additional code to create application-specific software.
Characteristics of Hadoop
• Economical:
 Hadoop systems are highly economical because they can utilize ordinary computers for data processing.

• Reliable:
 Hadoop is reliable due to its ability to store data copies on different machines, making it resistant to hardware
failures.

• Scalable:
 Hadoop is easily scalable both horizontally and vertically, allowing the framework to expand with the addition of
extra nodes.

• Flexible:
 Hadoop is flexible, enabling storage of both structured and unstructured data for future use as needed.
48
Hadoop Ecosystem
• Hadoop has an ecosystem that has evolved from its four core components:
1. Data management: Involves handling the storage, organization, and retrieval of data.

2. Data access: Enables users to interact with and retrieve data stored in Hadoop.

3. Data processing: Enables the execution of computations and analytics on large datasets.

4. Data storage: Focuses on efficient and scalable ways to store and manage data.

• The Hadoop ecosystem is continuously expanding to meet the needs of big data.

• These components work together to provide a comprehensive ecosystem for managing,

accessing, processing, and storing big data.

• The ecosystem offers a range of tools and technologies to address different aspects of the
data lifecycle and cater to various use cases in big data analytics.
49
Components of Hadoop Ecosystem
• Hadoop Distributed File System (HDFS): A distributed file system that provides reliable
and scalable storage for big data across multiple machines.

• Yet Another Resource Negotiator (YARN): The resource management framework in

Hadoop that manages resources and schedules tasks across the cluster, enabling multiple
processing engines to run on Hadoop.

• MapReduce: A programming model and processing engine in Hadoop for parallel

processing of large datasets by dividing tasks into map and reduce phases.

• Spark: A fast and general-purpose cluster computing system with in-memory processing
capabilities. It seamlessly integrates with Hadoop and offers higher performance for
certain workloads. 50
• Pig: A high-level scripting platform in Hadoop that simplifies data processing tasks using
a language called Pig Latin, abstracting away the complexities of MapReduce.

• Hive: A data warehouse infrastructure built on Hadoop, providing a high-level query

language (HiveQL) for querying and analyzing data stored in Hadoop.

• HBase: A distributed and scalable NoSQL database that runs on top of Hadoop. It offers
random, real-time access to big data and is suitable for low-latency read/write operations.

• Mahout: A library of machine learning algorithms for Hadoop, providing scalable

implementations for tasks like clustering, classification, and recommendation.

51
• MLlib: A machine learning library in Spark that offers a rich set of algorithms and tools for scalable
machine learning tasks, including data preprocessing, feature extraction, model training, and
evaluation.

• Solr: An open-source search platform based on Apache Lucene, providing powerful search capabilities
such as full-text search, faceted search, and real-time indexing.

• Lucene: A Java library for full-text search, providing indexing and searching functionalities and
serving as the core technology behind search-related applications like Solr and Elasticsearch.

• ZooKeeper: A centralized coordination service for distributed systems, offering reliable infrastructure
for maintaining configuration information, synchronizing processes, and managing distributed locks.

• Oozie: A workflow scheduling system for Hadoop, enabling users to define and manage workflows that
coordinate the execution of multiple Hadoop jobs, automating complex data processing pipelines.
52
53
Big Data Life Cycle with Hadoop
1. Ingesting data into the system:

 The first stage of big data processing with Hadoop is ingestion.

 Data is transferred to Hadoop from various sources such as relational

databases, systems, or local files.

 Sqoop facilitates data transfer from RDBMS to HDFS, while Flume

handles event data transfer.

54
2. Processing the data in storage:

 The second stage involves processing and storing.

 Data is stored in the distributed file system, HDFS, and in the NoSQL
distributed database, HBase.

 Data processing is carried out using Spark and MapReduce.

55
3. Computing and analyzing data:

 The third stage is analysis.

 Data is analyzed using processing frameworks like Pig, Hive, and

Impala.

 Pig employs map and reduce techniques for data conversion and
analysis, while Hive, based on map and reduce programming, is well-
suited for structured data.

56
4. Visualizing the results:

 The final stage, access, involves tools such as Hue and Cloudera Search.

 Here, users can access the analyzed data and visualize the results.

57
58

Engineering Mechanics Statics 8th Edition Meriam Solutions Engineering Mechanics
No ratings yet
Engineering Mechanics Statics 8th Edition Meriam Solutions Engineering Mechanics
12 pages
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
From Everand
Data Structures & Algorithms Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
1/5 (1)
67 Golden Rules Ebook PDF
No ratings yet
67 Golden Rules Ebook PDF
31 pages
IT 106 - Intro To Data Sciences
No ratings yet
IT 106 - Intro To Data Sciences
32 pages
ict Ch. 2
No ratings yet
ict Ch. 2
38 pages
Chapter 2 Introduction to Data Science_for Extension
No ratings yet
Chapter 2 Introduction to Data Science_for Extension
51 pages
Chapter 2. Introduction to Data Science
No ratings yet
Chapter 2. Introduction to Data Science
41 pages
Data Lifecycle
No ratings yet
Data Lifecycle
55 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Emerging Tech Ch 2
No ratings yet
Emerging Tech Ch 2
52 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
57 pages
Chapter 2 - Intro To Data Sciences
No ratings yet
Chapter 2 - Intro To Data Sciences
41 pages
Chapter 2 - Intro to Data Sciences[2]
No ratings yet
Chapter 2 - Intro to Data Sciences[2]
41 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Chapter 2 EmTe
No ratings yet
Chapter 2 EmTe
37 pages
Chapter 2 Data Science1
No ratings yet
Chapter 2 Data Science1
41 pages
Chapter 2 Emerging
No ratings yet
Chapter 2 Emerging
31 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Chapter 2 - Introduction to Data Science
No ratings yet
Chapter 2 - Introduction to Data Science
37 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
ET_Ch-2_Data_Science_ppt (2)
No ratings yet
ET_Ch-2_Data_Science_ppt (2)
28 pages
Data Science
No ratings yet
Data Science
32 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
27 pages
Chapter 2 - Intro To Data Sciences (Updated)
No ratings yet
Chapter 2 - Intro To Data Sciences (Updated)
67 pages
Data Science
No ratings yet
Data Science
35 pages
Sample Security Plan
No ratings yet
Sample Security Plan
9 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Introduction To Data Science: Chapter Two
No ratings yet
Introduction To Data Science: Chapter Two
52 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
U - 02 ET
No ratings yet
U - 02 ET
24 pages
CH-2 Introduction To Data Science
No ratings yet
CH-2 Introduction To Data Science
26 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
28 pages
Data Science: Chapter Two
No ratings yet
Data Science: Chapter Two
8 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
56 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
ETCh2
No ratings yet
ETCh2
36 pages
Emerging_CH2
No ratings yet
Emerging_CH2
41 pages
Multidisciplinary Field That Uses A Variety
No ratings yet
Multidisciplinary Field That Uses A Variety
48 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
Chapter 2 [Data Science]
No ratings yet
Chapter 2 [Data Science]
35 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
55 pages
Chapter 2-2
No ratings yet
Chapter 2-2
34 pages
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Chaoter Data Science
No ratings yet
Chaoter Data Science
20 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
30 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
CHAPTER 2 Emerging
No ratings yet
CHAPTER 2 Emerging
8 pages
Chapter 2 - Introduction To Data Science
No ratings yet
Chapter 2 - Introduction To Data Science
36 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Chapter - 2 Data Sciences
No ratings yet
Chapter - 2 Data Sciences
25 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
Chapter 2. Introduction To Data Science
100% (2)
Chapter 2. Introduction To Data Science
45 pages
Chap 2-Data Analysis
No ratings yet
Chap 2-Data Analysis
27 pages
EmTech Chapter 2 - Data Science
No ratings yet
EmTech Chapter 2 - Data Science
22 pages
IET - Chapter 2
No ratings yet
IET - Chapter 2
32 pages
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
Current Log
No ratings yet
Current Log
25 pages
2024-10-22_Customer_and_Partner_Roundtable_Steampunk
No ratings yet
2024-10-22_Customer_and_Partner_Roundtable_Steampunk
31 pages
CS710S Handheld Sled RFID Reader Bluetooth and USB Byte Stream API Specifications
No ratings yet
CS710S Handheld Sled RFID Reader Bluetooth and USB Byte Stream API Specifications
112 pages
face detection system
No ratings yet
face detection system
11 pages
Access Control in BACnet PDF
No ratings yet
Access Control in BACnet PDF
6 pages
[Ebooks PDF] download (Ebook) Licensing Laws and Animal Welfare: The Legal Protection of Wild Animals by Elizabeth Tyson ISBN 9783030500412, 9783030500429, 3030500411, 303050042X full chapters
100% (8)
[Ebooks PDF] download (Ebook) Licensing Laws and Animal Welfare: The Legal Protection of Wild Animals by Elizabeth Tyson ISBN 9783030500412, 9783030500429, 3030500411, 303050042X full chapters
67 pages
Readout Abs
No ratings yet
Readout Abs
3 pages
TFF Basics
100% (2)
TFF Basics
7 pages
CV - Lam Thanh Nhat
No ratings yet
CV - Lam Thanh Nhat
3 pages
Precision Controller For Rotating Cutters and Printing Rolls
No ratings yet
Precision Controller For Rotating Cutters and Printing Rolls
36 pages
Research
No ratings yet
Research
7 pages
Walmart and Flipkart Deal, Impact On Indian Economy: Key Facts
No ratings yet
Walmart and Flipkart Deal, Impact On Indian Economy: Key Facts
2 pages
Tagum Doctors College
No ratings yet
Tagum Doctors College
9 pages
8 20 Up-Keeping The Area of Construction Site at Kapp-& Plant Site For The Year 8 20
No ratings yet
8 20 Up-Keeping The Area of Construction Site at Kapp-& Plant Site For The Year 8 20
68 pages
Bedsore
No ratings yet
Bedsore
28 pages
How TikTok Is Rewriting The World
No ratings yet
How TikTok Is Rewriting The World
5 pages
Meralco Industrial Engineering Services Corp.
No ratings yet
Meralco Industrial Engineering Services Corp.
13 pages
Tail Loss Probe
No ratings yet
Tail Loss Probe
19 pages
ADAM SMITH - An Inquiry Into The Nature and Causes of The Wealth of NationsPdf
No ratings yet
ADAM SMITH - An Inquiry Into The Nature and Causes of The Wealth of NationsPdf
1,282 pages
Bs 5268-7 Base of Span Table
No ratings yet
Bs 5268-7 Base of Span Table
20 pages
Impact of Subsidy Removal On Nigerian Educational System
No ratings yet
Impact of Subsidy Removal On Nigerian Educational System
12 pages
Compressor Solutions For LNG Terminal Applications: Proven Technologies For Reliable and Efficient Plant Operation
No ratings yet
Compressor Solutions For LNG Terminal Applications: Proven Technologies For Reliable and Efficient Plant Operation
20 pages
Reiner Fuellmich COVID-19 Grand Jury Video Fact Checked! Tech ARP
No ratings yet
Reiner Fuellmich COVID-19 Grand Jury Video Fact Checked! Tech ARP
1 page
Bala Saidu Minin CV
No ratings yet
Bala Saidu Minin CV
6 pages
Sieve: Actionable Insights From Monitored Metrics in Microservices
No ratings yet
Sieve: Actionable Insights From Monitored Metrics in Microservices
17 pages
Dynamometer
100% (4)
Dynamometer
28 pages
The Circular Economy A Wealth of Flows 2nd Edition Ken Webster download
100% (4)
The Circular Economy A Wealth of Flows 2nd Edition Ken Webster download
56 pages
TBA 09 Detailing of Clay Masonry
No ratings yet
TBA 09 Detailing of Clay Masonry
42 pages

Chapter 2 - Introduction To Data Science

Uploaded by

Chapter 2 - Introduction To Data Science

Uploaded by

Chapter Two

• Data science is much more than simply analyzing data.

• It offers a range of roles and requires a range of skills.

• Data can be described as unprocessed facts and figures.

• Processed data on which decisions and actions are based.

• Plain collected data as raw facts cannot help much in decision-making.

• Interpreted data; created from organized, structured, and processed data in a

Can’t help in decision making. Can help in decision making

 The form will depend on the processing machine.

• The data types defines:-

 The operations that can be done on the data,

 The meaning of the data, and

 The way values of that type can be stored.

 Characters (char) are used to store a single character.

 Floating-point numbers (float) are used to store real numbers.

 Alphanumeric strings (strings) are used to store a combination of characters and

• Data that resides in a fixed field within a file or record.

• Conforms to a tabular format with relationships between

 Database: an organized collection of structured data typically stored in a computer

• Common examples of structured data are Excel files or SQL databases.

• i.e. doesn’t conform to the formal structure of data model.

• Therefore, it is also known as self-describing structure.

• There is no data model; the data is stored in its native format.

 Audio, video files, NoSQL, pictures, pdfs ...

• Velocity: refers the speed of data processing

• Variety: refers to the number of types of data.

• Veracity: which in this context is equivalent to quality.

 Are the data “clean” and accurate?

 Do they really have something to offer?

• Validity: data quality, governance, master data management on massive.

• Vocabulary: data models, semantics that describes data structure.

 After all, a data breach with big data is a big breach.

 How long does data need to be kept for?

• Visualization: different ways of representing data such as data clustering or using

• Vagueness: confusion over meaning of big data and tools used.

 A data warehouse is a system that aggregates, stores, and processes information

 Be able to handle very high transaction volumes, often in a distributed

 support flexible and dynamic data structures.

applications that require fast access to the data.

unique, solution to the storage paradigm for nearly 40 years.

grow, making them unsuitable for big data scenarios.

wide range of solutions based on alternative data models.

• In business decision-making, it can enhance competitiveness through reduction of costs,

calculations, data processing, and problem-solving.

• Computing can be done on a single computer or distributed across multiple computers

connected through a network.

• In general, cluster means “small group”.

• Cluster computing is a group of interconnected computers or servers working together to

• It refers to multiple computers connected to a network that function as a single entity

• Each node is performing a dedicated task.

offers several benefits.

Another Resource Negotiator), Qubole, HPCC, Cassandra, MongoDB, Apache Storm,

CouchDB, and Statwing.

 Involves combining available storage space to hold data.

require substantial amounts of these two resources.

 Guarantee to prevent hardware/software failures from affecting access to data and

 This becomes increasingly important as real-time analytics continue to be emphasized.

• It is inspired by a technical document published by Google.

• These components work together to provide a comprehensive ecosystem for managing,

• Yet Another Resource Negotiator (YARN): The resource management framework in

• MapReduce: A programming model and processing engine in Hadoop for parallel

• Hive: A data warehouse infrastructure built on Hadoop, providing a high-level query

• Mahout: A library of machine learning algorithms for Hadoop, providing scalable

 The first stage of big data processing with Hadoop is ingestion.

 Data is transferred to Hadoop from various sources such as relational

 Sqoop facilitates data transfer from RDBMS to HDFS, while Flume

 The second stage involves processing and storing.

 Data processing is carried out using Spark and MapReduce.

 The third stage is analysis.

 Data is analyzed using processing frameworks like Pig, Hive, and

You might also like