0% found this document useful (0 votes)

49 views55 pages

UNIT - Introduction - DataScience - New

Uploaded by

Kranium A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views55 pages

UNIT - Introduction - DataScience - New

Uploaded by

Kranium A

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 55

UNIT - I

Introduction to Data Science

Data Science
T. Y. BTECH

SCHOOL OF COMPUTER ENGINEERING AND

TECHNOLOGY
Prepared By Shilpa
Sonawani
Data Science: Why all the Excitement?

Exciting new effective

applications of data analytics

e.g.,
Google Flu Trends:

Detecting outbreaks
two weeks ahead
of CDC data

New models are estimating

which cities are most at risk
for spread of the Ebola virus.

Prediction model is built on

Various data sources,
types and analysis.

2
“Big Data” Sources
It’s All Happening On-line User Generated (Web &
Mobile)
Every:
Click
Ad impression
Billing event
….
Fast Forward, pause,… .
Server request
Transaction
Network message
Fault
…

Internet of Things / M2M Health/Scientific Computing

Graph Data
Lots of interesting data
has a graph structure:
• Social networks
• Communication networks
• Computer Networks
• Road networks
• Citations
• Collaborations/Relationships
• …

Some of these graphs can get

quite large (e.g., Facebook*
user graph)

4
What can you do with the data?

Crowdsourcing + physical modeling + sensing + data assimilation

to produce:

From Alex Bayen, UCB

5
5 Vs of Big Data

• Raw Data: Volume

• Change over time: Velocity
• Data types: Variety
• Data Quality: Veracity
• Information for Decision Making: Value
DATA SCIENCE – WHAT IS IT?
Data Science – A Definition

Data Science is the science which uses computer

science, statistics and machine learning,
visualization and human-computer interactions
to collect, clean, integrate, analyze, visualize,
interact with data to create data products.

9
Ben Fry’s Model
Visualizing Data Process

1. Acquire
2. Parse
3. Filter
4. Mine
5. Represent
6. Refine
7. Interact

10
Jeff Hammerbacher’s Model
1. Identify problem

2. Instrument data sources

3. Collect data

4. Prepare data (integrate, transform, clean, ﬁlter, aggregate)

5. Build model

6. Evaluate model

7. Communicate results
11
Data Scientist’s Practice

Clean,
prep

Hypothesize Large Scale

Digging Around Model Exploitation
in Data

Evaluate
Interpret
The Big Picture

Extract
Transform
Load

13
Data Science: Getting Value out of Data
Data Science: Getting Value out of Data
Data Science: Getting Value out of Data
Data Science: Getting Value out of Data
Why the Increased Interest in Data Science?
Why Python for Data Science???

https://www.kdnuggets.com/2019/05/poll-top-data-science-machine-learning-platforms.html
Applications

• Climate change and weather

• Traffic control
• Agriculture
• Personalised healthcare
• Twitter data analysis
• Facebook information links
• Pollution and Weather
Contrast: Databases
Databases Data Science
Data Value “Precious” “Cheap”
Data Volume Modest Massive
Examples Bank records, Online clicks,
Personnel records, GPS logs,
Census, Tweets,
Medical records Building sensor readings
Priorities Consistency, Speed,
Error recovery, Availability,
Auditability Query richness
Structured Strongly (Schema) Weakly or none (Text)
Properties Transactions, ACID* CAP* theorem (2/3),
eventual consistency
Realizations SQL NoSQL:
MongoDB, CouchDB,
Hbase, Cassandra,…

ACID = Atomicity, Consistency, Isolation and Durability CAP = Consistency, Availability, Partition Tolerance
Contrast: BI
Contrast: AI
Modern Data Science Skills
The Structure Spectrum

Structured Semi- Unstructure

(schema- Structured d (schema-
first) (schema-later) never)

Relational Documents Plain Text

Database XML
Media
Formatted Tagged
Messages Text/Media
Key Concept: Structured Data
A data model is a collection of concepts
for describing data.

A schema is a description of a particular

collection of data, using a given data
model.
Data Cleaning-Dirty Data
• The Statistics View:
• There is a process that produces data
• We want to model ideal samples of that process, but
in practice we have non-ideal samples:
• Distortion – some samples are corrupted by a process
• Selection Bias - likelihood of a sample depends on its
value
• Left and right censorship - users come and go from our
scrutiny
• Dependence – samples are supposed to be independent,
but are not (e.g. social networks)
Dirty Data
• The Database View:
• Some of the values are missing, corrupted, wrong,
duplicated
• Results are absolute (relational model)
• You get a better answer by improving the quality
of the values in your dataset
Dirty Data
• The Domain Expert’s View:
• This Data Doesn’t look right
• This Answer Doesn’t look right
• What happened?

• Domain experts have an implicit model of the data

that they can test against…
Dirty Data

• The Data Scientist’s View:

• Some Combination of all of the above

Solution:
Data Quality Problems
• (Source) Data is dirty on its own.
• Transformations corrupt the data (complexity of software
pipelines).
• Data sets are clean but integration (i.e., combining them)
screws them up.
• “Rare” errors can become frequent after transformation or
integration.
• Data sets are clean but suffer “bit rot”
• Old data loses its value/accuracy over time

• Any combination of the above

Big Picture: Where can Dirty Data Arise?

Integrate
Clean

Extract
Transform
Load

34
Numeric Outliers

Adapted from Joe Hellerstein’s 2012 CS 194 Guest Lecture

Dirty Data Problems
1) Parsing text into fields (separator issues)
2) Naming conventions: ER: NYC vs New York
3) Missing required field (e.g. key field)
4) Different representations (2 vs Two)
5) Fields too long (get truncated)
6) Primary key violation (from un- to structured or during
integration
7) Redundant Records (exact match or other)
8) Formatting issues – especially dates
9) Licensing issues/Privacy/ keep you from using the data
as you would like?
Conventional Definition of Data Quality

• Accuracy
– The data was recorded correctly.
• Completeness
– All relevant data was recorded.
• Uniqueness
– Entities are recorded once.
• Timeliness
– The data is kept up to date.
• Special problems in federated data: time consistency.
• Consistency
– The data agrees with itself.
How we can deal with the noisy data

• Data Binning : In this approach sorting of data is performed

concerning the values of the neighborhood. This method is also
known as local smoothing.
• Preprocessing in Clustering : In the approach, the outliers may
be detected by grouping the similar data in the same group,
i.e., in the same cluster.
• Machine Learning : A Machine Learning algorithm can be
executed for smoothing of data. For example, Regression
Algorithm can be used for smoothing of data using a specified
linear function.
• Removing manually: The noisy data can be deleted manually
by the human being, but it is a time-consuming process, so
mostly this method is not given priority.
Missing, Noisy and inconsistent Data
What is Data Preprocessing?
Data Preprocessing

Data Preprocessing is a technique that is used to

convert the raw data into a clean data set.
Data is gathered from different sources it is collected
in raw format which is not feasible for the analysis.

The set of steps is known as Data Preprocessing. It

includes –
• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction
Why is Data Preparation Important?

• Data Preprocessing is necessary because of the presence of

unformatted real-world data. Mostly real-world data is
composed of –
• Inaccurate data (missing data) – There are many reasons for
missing data such as data is not continuously collected, a
mistake in data entry, technical problems with biometrics and
much more.
• The presence of noisy data (erroneous data and outliers)
– The reasons for the existence of noisy data could be a
technological problem of gadget that gathers data, a human
mistake during data entry and much more.
• Inconsistent data – The presence of inconsistencies are due
to the reasons such that existence of duplication within data,
human data entry, containing mistakes in codes or names,
i.e., violation of data constraints and much more.
How is Data Preprocessing performed?
How missing data can be handled. Three different steps can be
executed which are given below –
• Ignoring the missing record – It is the simplest and efficient
method for handling the missing data. But, this method should not
be performed at the time when the number of missing values are
immense or when the pattern of data is related to the
unrecognized primary root of the cause of statement problem.
• Filling the missing values manually – This is one of the best-chosen
methods. But there is one limitation that when there are large data
set, and missing values are significant then, this approach is not
efficient as it becomes a time-consuming task.
• Filling using computed values – The missing values can also be
occupied by computing mean, mode or median of the observed
given values. Another method could be the predictive values that
are computed by using any Machine Learning or Deep Learning
algorithm.
Tasks of Data Preparation

• Data Cleaning :This is the first step which is implemented in Data

Preprocessing. In this step, the primary focus is on handling missing data,
noisy data, detection, and removal of outliers, minimizing duplication and
computed biases within the data.
• Data Integration :This process is used when data is gathered from various
data sources and data are combined to form consistent data. This consistent
data after performing data cleaning is used for analysis.
• Data Transformation :This step is used to convert the raw data into a
specified format according to the need of the model. The options used for
transformation of data are given below –
• Normalization – In this method, numerical data is converted into the
specified range, i.e., between 0 and 1 so that scaling of data can be
performed.
• Aggregation – This method is used to combine the features into one. For
example, combining two categories can be used to form a new group.
• Generalization – In this case, lower level attributes are converted to a higher
standard.
• Data Reduction :After the transformation and scaling of data duplication, i.e.,
redundancy within the data is removed and efficiently organize the data.
What Is Data Wrangling?

• It is used to convert the raw data into the format that is convenient
for the consumption of data. Data Wrangling is a technique that is
executed at the time of making an interactive model.
• Steps:
– extracting the data from different data sources
– sorting of data using certain algorithm is performed
– decompose the data into a different structured format
– finally store the data into another database.

Data is converted to the proper feasible format before applying

any model to it.
By performing filtering, grouping and selecting appropriate data
accuracy and performance of the model could be increased.
Data Wrangling in Python

Da Data wrangling in python deals with the below functionalities:

1.Data exploration: In this process, the data is studied, analyzed and
understood by visualizing representations of data.
2.Dealing with missing values: Most of the datasets having a vast amount
of data contain missing values of NaN, they are needed to be taken care of
by replacing them with mean, mode, the most frequent value of the column or
simply by dropping the row having a NaN value.
3.Reshaping data: In this process, data is manipulated according to the
requirements, where new data can be added or pre-existing data can be
modified.
4.Filtering data: Some times datasets are comprised of unwanted rows or
columns which are required to be removed or filtered
5.Other: After dealing with the raw dataset with the above functionalities we
get an efficient dataset as per our requirements and then it can be used for a
required purpose like data analyzing, machine learning, data visualization,
model training etc.
in Python
Why is Data Wrangling Important?

• Data Wrangling is used to handle the issue

of Data Leakage while implementing Machine
Learning and Deep Learning.
• Data Leakage is responsible for the cause of
invalid Machine Learning/Deep Learning
model due to the over optimization of the
applied model.
Data Leakage
Data Leakage can be demonstrated in many ways that are
given below –

• The Leakage of data from test dataset to training data

set.
• Leakage of computed correct prediction to the training
dataset.
• Leakage of future data into the past data.
• Usage of data outside the scope of the applied algorithm
• In general, the leakage of data is observed from two
primary sources of Machine Learning/Deep Learning
algorithms such as feature attributes (variables) and
training data set.
• Checking the presence of Data Leakage within the
applied model
Minimizing Data Leakage
How is Data Wrangling performed?

• If one considers the complete data set for

normalization and standardization, then the cross-
validation is performed for the estimation of the
performance of the model leads to the beginning of
data leakage.
• The effect of Data Leakage could be minimized by
recalculating for the required Data Preparation during
the cross-validation process that includes feature
selection, outliers detection, and removal, projection
methods, scaling of selected features and much more.
• Another solution is that dividing the complete dataset
into training data set that is used to train the model
and validation dataset which is used to evaluate the
performance and accuracy of the applied model.
Tasks of Data Wrangling
• Discovering: Firstly, data should be understood thoroughly and examine
which approach will best suit. For example: if have a weather data when
we analyze the data it is observed that data is from one area and so
primary focus is on determining patterns.
• Structuring :As the data is gathered from different sources, the data will
be present in various shapes and sizes. Therefore, there is a need for
structuring the data in proper format.
• Cleaning :Cleaning or removing of data should be performed that can
degrade the performance of analysis.
• Enrichment :Extract new features or data from the given data set to
optimize the performance of the applied model.
• Validating: This approach is used for improving the quality of data and
consistency rules so that transformations that are applied to the data
could be verified.
• Publishing :After completing the steps of Data Wrangling, the steps can be
documented so that similar steps can be performed for the same kind of
data to save time.
NumPy/Python
• NumPy is a Python library used for working with arrays.
• It also has functions for working in domain of linear algebra, fourier
transform, and matrices.
• NumPy stands for Numerical Python.
• In Python we have lists that serve the purpose of arrays, but they are slow
to process.
• NumPy aims to provide an array object that is up to 50x faster than
traditional Python lists.
• The array object in NumPy is called ndarray, it provides a lot of supporting
functions that make working with ndarray very easy.
• Arrays are very frequently used in data science, where speed and resources
are very important
• NumPy arrays are stored at one continuous place in memory unlike lists, so
processes can access and manipulate them very efficiently
Pandas
• Pandas officially stands for ‘Python Data Analysis
Library’, THE most important Python tool used by Data
Scientists today.
• Pandas is an open source Python library that allows
users to explore, manipulate and visualise data in an
extremely efficient manner. It is literally Microsoft Excel in
Python.
• It is easy to read and learn
• It is extremely fast and powerful
• It integrates well with other visualisation libraries
• Pandas can take in a huge variety of data, the most
common ones are csv, excel, sql or even a webpage.
Pandas/Python
• Series: a named, ordered dictionary
– The keys of the dictionary are the indexes
– Built on NumPy’s ndarray
– Values can be any Numpy data type object

• DataFrame: a table with named columns

– Represented as a Dict (col_name -> series)
– Each Series object represents a column

54
Operations
• map() functions
• filter (apply predicate to rows)
• sort/group by
• aggregate: sum, count, average, max, min
• Pivot or reshape
• Relational:
– union, intersection, difference, cartesian product (CROSS
JOIN)
– select/filter, project
– join: natural join (INNER JOIN), theta join, semi-join, etc.
– rename
55
Applications
Amazing real-time Data Science Applications:
Recommendation- Most of the apps and websites like Amazon, YouTube, Flipkart, etc. give
recommendation over as per the viewer’s interest. Online music applications like Spotify give
recommendations as per your taste in music. So these are good examples of data science
recommendation applications.
Search Results- Machine Learning algorithms used to find the most relevant search for Google
search engines. Such an algorithm used for the most visited sites on google chrome.
Intelligent Assistant- Google assistant, Siri are examples of intelligent assistants. The advanced
machine learning algorithm converts voice input into text output. These smart assistants
recognize the voice and provide the required information in both voice and text outputs.
Autonomous driving vehicles- Automobile companies like Waymo and Tesla looking for the next
generation of autonomous vehicles. 3D images were taken by the cameras and the information
provided to the algorithms for further processing.
Piracy Detection- YouTube is an example of piracy detection using machine learning algorithms.
Due to the big database, copied contents cannot be detected manually. So it helps to detect and
remove the copied content to reduce human efforts.
Image Recognition- Facebook is the application that uses image recognition by data science and
machine learning for the friend suggestion. Even Google lens uses an image recognition algorithm
to provide the related information to you.

Unit-I Da
No ratings yet
Unit-I Da
42 pages
Unit 1 - Exploratory Data Analysis Fundamentals
No ratings yet
Unit 1 - Exploratory Data Analysis Fundamentals
47 pages
CS322 - Lec 3 - S25
No ratings yet
CS322 - Lec 3 - S25
42 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
Data Preprocessing
No ratings yet
Data Preprocessing
120 pages
Aiml Data Preprocessing
No ratings yet
Aiml Data Preprocessing
99 pages
16-Data Preprocessing
No ratings yet
16-Data Preprocessing
27 pages
21css303t Datascience Unit 1 Notes
No ratings yet
21css303t Datascience Unit 1 Notes
246 pages
Data Sciences Unit-I
No ratings yet
Data Sciences Unit-I
83 pages
DSV-S8 Data Cleaning
No ratings yet
DSV-S8 Data Cleaning
34 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
ML Lecture 5 Data Quality
No ratings yet
ML Lecture 5 Data Quality
19 pages
Datapreparation
No ratings yet
Datapreparation
59 pages
FDS - Unit 1
No ratings yet
FDS - Unit 1
233 pages
2 Data Preprocessing
No ratings yet
2 Data Preprocessing
57 pages
Data Mining and Data Warehousing - Data Preprocessing - L03
No ratings yet
Data Mining and Data Warehousing - Data Preprocessing - L03
10 pages
Data Preprocessing
No ratings yet
Data Preprocessing
13 pages
Fdsa PPT - Unit 1
No ratings yet
Fdsa PPT - Unit 1
19 pages
Introduction To Data Science 1-2-2025
No ratings yet
Introduction To Data Science 1-2-2025
14 pages
DWDM LS3 Fall 24 25
No ratings yet
DWDM LS3 Fall 24 25
50 pages
03preprocessing 1
No ratings yet
03preprocessing 1
39 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
139 pages
02 Data - Preprocessing - 4,5,6
No ratings yet
02 Data - Preprocessing - 4,5,6
54 pages
Pre Processing
No ratings yet
Pre Processing
68 pages
Data Science - Module 1.3
No ratings yet
Data Science - Module 1.3
34 pages
DM Preprocessing Lec4,5
No ratings yet
DM Preprocessing Lec4,5
36 pages
DM Chapter 3
No ratings yet
DM Chapter 3
60 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Data Mining
No ratings yet
Data Mining
22 pages
3 DSEngineering
No ratings yet
3 DSEngineering
64 pages
Data Preprocessing (Sagar)
No ratings yet
Data Preprocessing (Sagar)
31 pages
Chapter5-DATA AND KNOWLEDGE MANAGEMENT
No ratings yet
Chapter5-DATA AND KNOWLEDGE MANAGEMENT
39 pages
3BUR002120R3701 en AdvaBuild Engineering Methods User Guide
No ratings yet
3BUR002120R3701 en AdvaBuild Engineering Methods User Guide
152 pages
TTDS Lecture 2
No ratings yet
TTDS Lecture 2
40 pages
Data Analytics - Module-1.2
No ratings yet
Data Analytics - Module-1.2
55 pages
Unit 1
No ratings yet
Unit 1
11 pages
Unit 2 - Data Munging PDF
No ratings yet
Unit 2 - Data Munging PDF
54 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
Ad3491-FDA Unit 1 Question Bank
No ratings yet
Ad3491-FDA Unit 1 Question Bank
8 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
29 pages
Chapter - 2 - Cleaning and Transforming Data
No ratings yet
Chapter - 2 - Cleaning and Transforming Data
27 pages
Unit 3
No ratings yet
Unit 3
18 pages
2 Data Pre-Processing
No ratings yet
2 Data Pre-Processing
50 pages
Data Mining
No ratings yet
Data Mining
40 pages
COS10022 - Lecture 03 - Data Preparation PDF
No ratings yet
COS10022 - Lecture 03 - Data Preparation PDF
61 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
CSC 3301-Lecture06 Introduction To Machine Learning
No ratings yet
CSC 3301-Lecture06 Introduction To Machine Learning
56 pages
AutoLaser UM en
No ratings yet
AutoLaser UM en
151 pages
AS 4118.1.1-1996 Fire Sprinkler Systems Part 1.1 Components - Sprinklers & Sprayers Reconfirmed 2013
No ratings yet
AS 4118.1.1-1996 Fire Sprinkler Systems Part 1.1 Components - Sprinklers & Sprayers Reconfirmed 2013
56 pages
Screenshot 2025-04-23 at 8.26.12 AM
No ratings yet
Screenshot 2025-04-23 at 8.26.12 AM
14 pages
Lecture 6 Data Preprocessing
No ratings yet
Lecture 6 Data Preprocessing
59 pages
UNIT I - Introduction - DataScience - New
No ratings yet
UNIT I - Introduction - DataScience - New
34 pages
Newprot E500 Service Manual PDF
No ratings yet
Newprot E500 Service Manual PDF
179 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
8 pages
3 Data Preprocessing
No ratings yet
3 Data Preprocessing
33 pages
Data Preprocessing
No ratings yet
Data Preprocessing
77 pages
Correlation
No ratings yet
Correlation
14 pages
DWM Module 2
No ratings yet
DWM Module 2
9 pages
Data Preprocessing
No ratings yet
Data Preprocessing
11 pages
Unit 2 Theory of Metal Cutting
No ratings yet
Unit 2 Theory of Metal Cutting
51 pages
Unit 2 Data Gathering
No ratings yet
Unit 2 Data Gathering
14 pages
Easyctf Writeups
No ratings yet
Easyctf Writeups
64 pages
Creating A NAS With Ubuntu Server
No ratings yet
Creating A NAS With Ubuntu Server
10 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
33 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
Data Preprocessing Part 1
No ratings yet
Data Preprocessing Part 1
14 pages
Project Report Final Defence
No ratings yet
Project Report Final Defence
42 pages
Education Learning: Connections Ities e Business
No ratings yet
Education Learning: Connections Ities e Business
80 pages
Unit 1 Metal Forming
No ratings yet
Unit 1 Metal Forming
45 pages
CSCI 104 QT Intro: Mark Redekopp David Kempe
No ratings yet
CSCI 104 QT Intro: Mark Redekopp David Kempe
49 pages
Icpc Template
No ratings yet
Icpc Template
23 pages
Software Quality Concepts
No ratings yet
Software Quality Concepts
38 pages
Email Spam Perceptron Example
No ratings yet
Email Spam Perceptron Example
9 pages
Data Science S (2 Files Merged)
No ratings yet
Data Science S (2 Files Merged)
30 pages
Peace
No ratings yet
Peace
10 pages
Data Preprocessing
No ratings yet
Data Preprocessing
22 pages
Brian Charlot
No ratings yet
Brian Charlot
1 page
Shipment 18
No ratings yet
Shipment 18
16 pages
SC101Assignment3 Eng
No ratings yet
SC101Assignment3 Eng
10 pages
Aiml Lab 2
No ratings yet
Aiml Lab 2
3 pages
Samantha: Gustafson
No ratings yet
Samantha: Gustafson
1 page
CBE Online Form
No ratings yet
CBE Online Form
20 pages
Rabindranath Tagore Full Biography
No ratings yet
Rabindranath Tagore Full Biography
4 pages
EmpTech Reviewer
No ratings yet
EmpTech Reviewer
10 pages
Simulation and Analysis Environment: Author
No ratings yet
Simulation and Analysis Environment: Author
5 pages
MCQ Web Tech
No ratings yet
MCQ Web Tech
8 pages
Project Charter Template
No ratings yet
Project Charter Template
7 pages
Rac Cca2
No ratings yet
Rac Cca2
1 page
Ecler
No ratings yet
Ecler
4 pages
A Review of Genetic Algorithm Application For Image Segmentation
No ratings yet
A Review of Genetic Algorithm Application For Image Segmentation
4 pages
Introduction To Python Part 3
No ratings yet
Introduction To Python Part 3
2 pages
Resume PavanRaj
No ratings yet
Resume PavanRaj
2 pages
Preconditions: C++ & Fortran Development in Windows Using The Mingw-W64 GCC and Netbeans
No ratings yet
Preconditions: C++ & Fortran Development in Windows Using The Mingw-W64 GCC and Netbeans
1 page
GraphWorX64 Scripting - Local and Global Aliases
No ratings yet
GraphWorX64 Scripting - Local and Global Aliases
1 page
Seclore Data-Centric Security Platform
No ratings yet
Seclore Data-Centric Security Platform
2 pages
Steve Jobs Letter
No ratings yet
Steve Jobs Letter
4 pages
The Power of Big Data: Transforming Industries and Shaping the Future
From Everand
The Power of Big Data: Transforming Industries and Shaping the Future
Tom Henricksen
No ratings yet

UNIT - Introduction - DataScience - New

Uploaded by

UNIT - Introduction - DataScience - New

Uploaded by

UNIT - I

Introduction to Data Science

SCHOOL OF COMPUTER ENGINEERING AND

Exciting new effective

New models are estimating

Prediction model is built on

Internet of Things / M2M Health/Scientific Computing

Some of these graphs can get

Crowdsourcing + physical modeling + sensing + data assimilation

From Alex Bayen, UCB

• Raw Data: Volume

Data Science is the science which uses computer

2. Instrument data sources

4. Prepare data (integrate, transform, clean, ﬁlter, aggregate)

Hypothesize Large Scale

• Climate change and weather

Structured Semi- Unstructure

Relational Documents Plain Text

A schema is a description of a particular

• Domain experts have an implicit model of the data

• The Data Scientist’s View:

• Any combination of the above

Adapted from Joe Hellerstein’s 2012 CS 194 Guest Lecture

• Data Binning : In this approach sorting of data is performed

Data Preprocessing is a technique that is used to

The set of steps is known as Data Preprocessing. It

• Data Preprocessing is necessary because of the presence of

• Data Cleaning :This is the first step which is implemented in Data

Data is converted to the proper feasible format before applying

Da Data wrangling in python deals with the below functionalities:

• Data Wrangling is used to handle the issue

• The Leakage of data from test dataset to training data

• If one considers the complete data set for

• DataFrame: a table with named columns

You might also like