0% found this document useful (0 votes)

39 views

Unit I Introduction To Data Science Syllabus

This document provides an introduction to data science and big data analytics. It discusses key concepts like the 3Vs of big data (volume, variety, velocity), different data structures (structured, semi-structured, quasi-structured, unstructured), and challenges with traditional analytical architectures. The role of data scientists is also outlined, including their involvement in data architecture, acquisition, analysis, and presentation.

Uploaded by

swetha karthigeyan

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views

Unit I Introduction To Data Science Syllabus

Uploaded by

swetha karthigeyan

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

IT1110 DATA SCIENCE AND BIG DATA ANALYTICS UNIT-I

UNIT I
INTRODUCTION TO DATA SCIENCE
Syllabus
Introduction of Data Science – Basic Data Analytics using R – R Graphical User Interfaces – Data
Import and Export – Attribute and Data Types – Descriptive Statistics – Exploratory Data
Analysis – Visualization Before Analysis – Dirty Data – Visualizing a Single Variable – Examining
Multiple Variables – Data Exploration Versus Presentation..

Introduction to Data Science

Big Data Overview
Data is created constantly, and at an ever-increasing rate. Mobile phones, social media, imaging
technologies to determine a medical diagnosis—all these and more create new data, and that
must be stored somewhere for some purpose. Devices and sensors automatically generate
diagnostic information that needs to be stored and processed in real time. Merely keeping up
with this huge influx of data is difficult, but substantially more challenging is analyzing vast
amounts of it, especially when it does not conform to traditional notions of data structure, to
identify meaningful patterns and extract useful information. These challenges of the data deluge
present the opportunity to transform business, government, science, and everyday life.

Several industries have led the way in developing their ability to gather and exploit data:
● Credit card companies monitor every purchase their customers make and can identify
fraudulent purchases with a high degree of accuracy using rules derived by processing billions of
transactions.
● Mobile phone companies analyze subscribers’ calling patterns to determine, for example,
whether a caller’s frequent contacts are on a rival network. If that rival network is offering an
attractive promotion that might cause the subscriber to defect, the mobile phone company can
proactively offer the subscriber an incentive to remain in her contract.
● For companies such as LinkedIn and Facebook, data itself is their primary product. The
valuations of these companies are heavily derived from the data they gather and host, which
contains more and more intrinsic value as the data grows.

Ms. Selva Mary. G Page 1

IT1110 DATA SCIENCE AND BIG DATA ANALYTICS UNIT-I

Three attributes stand out as defining Big Data characteristics:

● Huge volume of data (Volume): Rather than thousands\ or millions of rows, Big Data can be
billions of rows and millions of columns.
● Complexity of data types and structures (Variety): Big Data reflects the variety of new data
sources, formats, and structures, including digital traces being left on the web and other digital
repositories for subsequent analysis.
● Speed of new data creation and growth (Velocity): Big Data can describe high velocity data,
with rapid data ingestion and near real time analysis.
Although the volume of Big Data tends to attract the most attention, generally the variety and
velocity of the data provide a more apt definition of Big Data. (Big Data is sometimes described
as having 3 Vs: volume, variety, and velocity.) Due to its size or structure, Big Data cannot be
efficiently analyzed using only traditional databases or methods. Big Data problems require new
tools and technologies to store, manage, and realize the business benefit. These new tools and
technologies enable creation, manipulation, and management of large datasets and the storage
environments that house them.

Definition
Definition of Big Data comes from the McKinsey Global report from 2011:
Big Data is data whose scale, distribution, diversity, and/or timeliness require the use of new
technical architectures and analytics to enable insights that unlock new sources of business
value.
McKinsey’s definition of Big Data implies that organizations will need new data architectures and
analytic sandboxes, new tools, new analytical methods, and an integration of multiple skills into
the new role of the data scientist, Figure 1-1 highlights several sources of the Big Data deluge.

Ms. Selva Mary. G Page 2

IT1110 DATA SCIENCE AND BIG DATA ANALYTICS UNIT-I

Different Data Structures

 Structured data: Data containing a defined data type, format, and structure (that is,
transaction data, online analytical processing [OLAP] data cubes, traditional RDBMS, CSV
files, and even simple spreadsheets).

FIGURE Example of structured data

Ms. Selva Mary. G Page 3

IT1110 DATA SCIENCE AND BIG DATA ANALYTICS UNIT-I

 Semi-structured data: Textual data files with a discernible pattern that enables parsing
(such as Extensible Markup Language [XML] data files that are self-describing and defined
by an XML schema).

Figure Semi-structured data

Ms. Selva Mary. G Page 4

IT1110 DATA SCIENCE AND BIG DATA ANALYTICS UNIT-I

 Quasi-structured data: Textual data with erratic data formats that can be formatted with
effort, tools, and time (for instance, web clickstream data that may contain inconsistencies
in data values and formats).

Figure Quasi-structured data

 Unstructured data: Data that has no inherent structure, which may include text
documents, PDFs, images, and video.

Figure Unstructured data

Ms. Selva Mary. G Page 5

IT1110 DATA SCIENCE AND BIG DATA ANALYTICS UNIT-I

Current Analytical Architecture

Data Science projects need workspaces that are purpose-built for experimenting with data, with
flexible and agile data architectures. Most organizations still have data warehouses that provide
excellent support for traditional reporting and simple data analysis activities but unfortunately
have a more difficult time supporting more robust analyses. A typical analytical data architecture
that may exist within an organization is explained below

Figure shows typical data architecture and several of the challenges it presents to data scientists
and others trying to do advanced analytics. This section examines the data flow to the Data
Scientist and how this individual fits into the process of getting data to analyze on projects.
1. For data sources to be loaded into the data warehouse, data needs to be well understood,
structured, and normalized with the appropriate data type definitions. Although this kind
of centralization enables security, backup, and failover of highly critical data, it also
means that data typically must go through significant preprocessing and checkpoints
before it can enter this sort of controlled environment, which does not lend itself to data
exploration and iterative analytics.
2. As a result of this level of control on the EDW, additional local systems may emerge in the
form of departmental warehouses and local data marts that business users create to
accommodate their need for flexible analysis. These local data marts may not have the
same constraints for security and structure as the main EDW and allow users to do some
level of more in-depth analysis. However, these one-off systems reside in isolation, often
are not synchronized or integrated with other data stores, and may not be backed up.

Ms. Selva Mary. G Page 6

IT1110 DATA SCIENCE AND BIG DATA ANALYTICS UNIT-I

3. Once in the data warehouse, data is read by additional applications across the enterprise
for BI and reporting purposes. These are high-priority operational processes getting
critical data feeds from the data warehouses and repositories.
4. At the end of this workflow, analysts get data provisioned for their downstream analytics.
Because users generally are not allowed to run custom or intensive analytics on
production databases, analysts create data extracts from the EDW to analyze data offline
in R or other local analytical tools. Many times these tools are limited to in-memory
analytics on desktops analyzing samples of data, rather than the entire population of a
dataset. Because these analyses are based on data extracts, they reside in a separate
location, and the results of the analysis—and any insights on the quality of the data or
anomalies—rarely are fed back into the main data repository.

Role of Data Scientists

Data scientists play the most active roles in the four A’s of data:
Data architecture
A data scientist would help the system architect by providing input on how the data would need
to be routed and organized to support the analysis, visualization, and presentation of the data to
the appropriate people.

Data acquisition
Representing, transforming, grouping, and linking the data are all tasks that need to occur before
the data can be profitably analyzed, and these are all tasks in which the data scientist is actively
involved.

Data analysis
The analysis phase is where data scientists are most heavily involved. In this context we are
using analysis to include summarization of the data, using portions of data (samples) to make
inferences about the larger context, and visualization of the data by presenting it in tables,
graphs, and even animations.

Data archiving
Finally, the data scientist must become involved in the archiving of the data. Preservation of
collected data in a form that makes it highly reusable - what you might think of as "data

Ms. Selva Mary. G Page 7

IT1110 DATA SCIENCE AND BIG DATA ANALYTICS UNIT-I

duration"- is a difficult challenge because it is so hard to anticipate all of the future uses of the
data.

The data science process

The data science process typically consists of six steps, as you can see in the mind map in figure

1. Setting the research goal

Data science is mostly done in the context of an organization. When the business asks you to
perform a data science project, you’ll first prepare a project charter. This charter contains
information such as what you’re going to research, how the company benefits from that, what
data and resources you need, a timetable, and deliverables.
2. Retrieving data
The second step is to collect data. You’ve stated in the project charter which data you need and
where you can find it. In this step you ensure that you can use the data in your program, which

Ms. Selva Mary. G Page 8

IT1110 DATA SCIENCE AND BIG DATA ANALYTICS UNIT-I

means checking the existence, quality, and access to the data. Data can also be delivered by third-
party companies and take many forms ranging from Excel spreadsheets to different types of
databases.
3. Data cleansing
Data collection is an error-prone process; in this phase you enhance the quality of the data and
prepare it for use in subsequent steps. This phase consists of three subphases: data cleansing
removes false values from a data source and inconsistencies across data sources, data
integration enriches data sources by combining information from multiple data sources, and
data transformation ensures that the data is in a suitable format for use in your models.
4. Data exploration
Data exploration is concerned with building a deeper understanding of your data. You try to
understand how variables interact with each other, the distribution of the data, and whether
there are outliers. To achieve this you mainly use descriptive statistics, visual techniques, and
simple modeling. This step often goes under the abbreviation EDA for Exploratory Data Analysis.
5. Data modeling or model building
In this phase you use models, domain knowledge, and insights about the data you found in the
previous steps to answer the research question. You select a technique from the fields of
statistics, machine learning, operations research, and so on. Building a model is an iterative step
between selecting the variables for the model, executing the model, and model diagnostics.
6. Presentation and automation
Finally, you present the results to your business. These results can take many forms, ranging
from presentations to research reports. Sometimes you’ll need to automate the execution of the
process because the business will want to use the insights you gained in another project or
enable an operational process to use the outcome from your model.
AN ITERATIVE PROCESS
The previous description of the data science process gives you the impression that you walk
through this process in a linear way, but in reality you often have to step back and rework
certain findings. For instance, you might find outliers in the data exploration phase that point to
data import errors. As part of the data science process you gain incremental insights, which may
lead to new questions. To prevent rework, make sure that you scope the business question
clearly and thoroughly at the start.

Ms. Selva Mary. G Page 9

IT1110 DATA SCIENCE AND BIG DATA ANALYTICS UNIT-I

Challenges in Data Science

• Preparing Data (Noisy, Incomplete, Diverse, Streaming …)
• Analyze Data (Scalable, Accurate, Realtime, Advanced Methods, Probabilities & Uncertainties)
• Represent Analysis Results (i.e. data product) (Story-telling, Interactive, explainable…)

****************************

Ms. Selva Mary. G Page 10

MBenz C180.C200.C250 L4-1.8-M271 W204 Repair
86% (7)
MBenz C180.C200.C250 L4-1.8-M271 W204 Repair
4,352 pages
GlobalCONNECT Online Expense Report User Guide
No ratings yet
GlobalCONNECT Online Expense Report User Guide
16 pages
GE Frame 5 Gas Turbine Maintenance
67% (3)
GE Frame 5 Gas Turbine Maintenance
4 pages
Final Exam - Purposive Com
76% (21)
Final Exam - Purposive Com
3 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Data Analytics
100% (3)
Data Analytics
14 pages
Eds Unit 1
No ratings yet
Eds Unit 1
28 pages
Chapter-1-2, EMC DSA Notes
No ratings yet
Chapter-1-2, EMC DSA Notes
8 pages
BD1 1
No ratings yet
BD1 1
9 pages
Data Science Introduction
No ratings yet
Data Science Introduction
82 pages
Introduction
No ratings yet
Introduction
21 pages
Lesson 3 Big Data Overview
No ratings yet
Lesson 3 Big Data Overview
30 pages
Data Analytics Unit I 1
No ratings yet
Data Analytics Unit I 1
87 pages
Notes of Data Science Unit 3
No ratings yet
Notes of Data Science Unit 3
22 pages
Unit 1 Rept
No ratings yet
Unit 1 Rept
61 pages
Data Science - Fundamentals and Components
No ratings yet
Data Science - Fundamentals and Components
21 pages
(Subject Code: 410243) (Class: TE Computer Engineering) : Data Analytics
No ratings yet
(Subject Code: 410243) (Class: TE Computer Engineering) : Data Analytics
68 pages
Chapter 2 - Overview for Data Science
No ratings yet
Chapter 2 - Overview for Data Science
31 pages
M-1
No ratings yet
M-1
98 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet
Unit I: Chapter 1: Introduction To Big Data
No ratings yet
Unit I: Chapter 1: Introduction To Big Data
35 pages
FDSUNIT 1
No ratings yet
FDSUNIT 1
27 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
CHAPTER-1
No ratings yet
CHAPTER-1
149 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-01-29 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-01-29 Reference-Material-I
53 pages
Big Data and Data Science
No ratings yet
Big Data and Data Science
6 pages
Data Structures
No ratings yet
Data Structures
50 pages
22UCS303 DS-Unit I-N
No ratings yet
22UCS303 DS-Unit I-N
42 pages
Unit 1
No ratings yet
Unit 1
61 pages
1 U Data-Analytics-Unit-I-1
100% (1)
1 U Data-Analytics-Unit-I-1
81 pages
Data Science Notes
No ratings yet
Data Science Notes
3 pages
unit 1 big data
No ratings yet
unit 1 big data
34 pages
Bigdata Unit III
No ratings yet
Bigdata Unit III
22 pages
Big Data Analytics Unit 1
No ratings yet
Big Data Analytics Unit 1
26 pages
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
No ratings yet
(IJCST-V10I4P1) :swagata Sarkar, Dhivya Balaje, Vibha V, Harish Pichumani
4 pages
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
No ratings yet
20210913115458D3708 - Session 01 Introduction To Big Data Analytics
28 pages
BDA Unit 1
No ratings yet
BDA Unit 1
10 pages
R19 BDA UNIT-1
No ratings yet
R19 BDA UNIT-1
22 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
IDAV_Unit-1
No ratings yet
IDAV_Unit-1
20 pages
DTA First Lecture
No ratings yet
DTA First Lecture
36 pages
Chap1-Overview of Data Science
No ratings yet
Chap1-Overview of Data Science
50 pages
Unit - I DA.pptx
No ratings yet
Unit - I DA.pptx
107 pages
Bigdata Mod-1
No ratings yet
Bigdata Mod-1
33 pages
Unit I - Big Data Programming
No ratings yet
Unit I - Big Data Programming
19 pages
Da Unit-1
No ratings yet
Da Unit-1
23 pages
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
From Everand
Spreadsheets To Cubes (Advanced Data Analytics for Small Medium Business): Data Science
alasdair gilchrist
No ratings yet
Chapter 2 EMTE@Kibru 014914
No ratings yet
Chapter 2 EMTE@Kibru 014914
40 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
36 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Module 1 Introduction To DataScience and Analytics
No ratings yet
Module 1 Introduction To DataScience and Analytics
10 pages
DS Unit-1 PDF
No ratings yet
DS Unit-1 PDF
50 pages
CH-2 Data Science
No ratings yet
CH-2 Data Science
45 pages
Database Management System
From Everand
Database Management System
Manish Soni
No ratings yet
Chapter 1
No ratings yet
Chapter 1
27 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
Unit 4
No ratings yet
Unit 4
29 pages
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
From Everand
Data Science and Analytics: Transforming Raw Data into Actionable Insights: A Comprehensive Guide
Marlowe Reyes
No ratings yet
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
Sengamala Thayaar Educational Trust Women's College: Sundarakkottai, Mannargudi
No ratings yet
Sengamala Thayaar Educational Trust Women's College: Sundarakkottai, Mannargudi
14 pages
Enterprise Data Science: Smarter Decisions with Big Data
From Everand
Enterprise Data Science: Smarter Decisions with Big Data
Vidhur Gupta
No ratings yet
Unit-5 DS
No ratings yet
Unit-5 DS
20 pages
FDS - UNIT 1
No ratings yet
FDS - UNIT 1
233 pages
mod 3
No ratings yet
mod 3
96 pages
Stealth & Military Camouflage: Techniques & Materials
No ratings yet
Stealth & Military Camouflage: Techniques & Materials
36 pages
A Pi 653 Tank Inspection Form
No ratings yet
A Pi 653 Tank Inspection Form
13 pages
Design of Siw Fed Antipodal Linearly Tapered Slot Antennas With Curved and Hat Shaped Dielectric Loadings at 60 GHZ For Wireless Communications
No ratings yet
Design of Siw Fed Antipodal Linearly Tapered Slot Antennas With Curved and Hat Shaped Dielectric Loadings at 60 GHZ For Wireless Communications
2 pages
Symaro PDF
No ratings yet
Symaro PDF
14 pages
ODS-750
No ratings yet
ODS-750
9 pages
MPC5748G NXP
No ratings yet
MPC5748G NXP
76 pages
SIS 2.0 Payload
No ratings yet
SIS 2.0 Payload
5 pages
F1 Challenge 99 02 Manual
100% (9)
F1 Challenge 99 02 Manual
28 pages
Ford Motor Company SWOT Analysis
No ratings yet
Ford Motor Company SWOT Analysis
3 pages
Action Research Proposal I. Proposed Title
No ratings yet
Action Research Proposal I. Proposed Title
2 pages
C. QUIZ 1 MANAGING SOFTWARE CONSTRUCTION Attempt Review PDF
No ratings yet
C. QUIZ 1 MANAGING SOFTWARE CONSTRUCTION Attempt Review PDF
7 pages
Manual SA-AKX76LM-K - Audio
No ratings yet
Manual SA-AKX76LM-K - Audio
111 pages
Fusion 360 Fundamentals
No ratings yet
Fusion 360 Fundamentals
10 pages
Human Computer Interaction Example
No ratings yet
Human Computer Interaction Example
11 pages
Iron Clad
No ratings yet
Iron Clad
1 page
Temp Shelter
No ratings yet
Temp Shelter
29 pages
3PL Vs 4PL
No ratings yet
3PL Vs 4PL
3 pages
ASTM D3963 & D3963M-01 - Epoxy Coated Steel Reinforcing Bars
100% (1)
ASTM D3963 & D3963M-01 - Epoxy Coated Steel Reinforcing Bars
5 pages
Barco UserGuide TDE4742 01 Barco-RLM-W8
No ratings yet
Barco UserGuide TDE4742 01 Barco-RLM-W8
62 pages
User Manual Sailor System 5000 MFHF 150w 250w 500w
No ratings yet
User Manual Sailor System 5000 MFHF 150w 250w 500w
56 pages
Practice Sheet6 PDF
No ratings yet
Practice Sheet6 PDF
2 pages
X65SCH
No ratings yet
X65SCH
540 pages
B767 Anti Ice and Rain
No ratings yet
B767 Anti Ice and Rain
3 pages
BIBLE v3
100% (1)
BIBLE v3
163 pages
Water Hammer Analysis: Sample Problems
67% (3)
Water Hammer Analysis: Sample Problems
15 pages
The 6 Buyer Personas
No ratings yet
The 6 Buyer Personas
10 pages