0% found this document useful (0 votes)

23 views18 pages

Dsbda Unit 1

The document provides an overview of Data Science and Big Data, including their definitions, importance, and applications across various industries. It discusses the Data Science Life Cycle, the 5 V's of Big Data, and compares Data Science with Information Science. Additionally, it highlights the significance of data wrangling, cleaning, and integration in preparing data for analysis.

Uploaded by

Anisha Dhuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views18 pages

Dsbda Unit 1

Uploaded by

Anisha Dhuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Evernote 09/03/25, 2:07 PM

DSBDA UNIT 1
Introduction to Data Science and Big Data

1. Basics and need of Data Science and Big Data

2. Applications of Data Science
3. Data explosion
4. 5 V's of Big Data
5. Relationship between Data Science and Information Science
6. Business intelligence versus Data Science
7. Data Science Life Cycle
8. Data : Data Types, Data Collection.
9. Need of Data wrangling
10. Methods :
a. Data Cleaning
b. Data Integration
c. Data Reduction
d. Data Transformation
e. Data Discretization.

Definitions

Data Science:- Data Science is an interdisciplinary field that aims to discover and extract
actionable knowledge from various forms of data to support business decisions and make
predictions.

Big Data:- Big data refers to extremely large and diverse collections of structured, unstructured,
and semi-structured data that continues to grow exponentially over time. These datasets are so
huge and complex in volume, velocity, and variety, that traditional data management systems cannot
store, process, and analyze them.

Q. Basics and Need of Data Science and Big Data

1. Data and Its Types

https://lite.evernote.com/ce Page 1 of 18
Evernote 09/03/25, 2:07 PM

• Data is a collection of facts and figures that relay specific information but are not
organized.
• It includes numbers, words, measurements, observations, or descriptions of things.
• Data acts as raw material in the production of information.
• Types of data:
◦ Record data
◦ Data matrix
◦ Document data
◦ Transaction data
◦ Graph data
◦ Ordered data

2. Data Science

Basics of Data Science

• Data Science is an interdisciplinary field that extracts insights from various forms of data.
• It aims to discover and extract actionable knowledge to support business decisions and
predictions.
• Uses advanced analytical methods such as time series analysis for forecasting future
trends.
• Instead of just analyzing past sales, Data Science helps in predicting future sales and
revenue.

Need for Data Science

1. Helps businesses process huge amounts of structured and unstructured data to detect
patterns.
2. Uses modern tools and techniques to uncover hidden insights and meaningful information.
3. Utilizes complex machine learning algorithms to build predictive models.
4. Enables organizations to make data-driven decisions for better efficiency and growth.

3. Big Data

Basics of Big Data

• Big Data refers to extremely large volumes of data that cannot be processed using
traditional tools.
• This data comes from multiple sources, has varying complexity, and is generated at high
speeds (velocity).

https://lite.evernote.com/ce Page 2 of 18
Evernote 09/03/25, 2:07 PM

• The term "Big Data" describes data that is:

◦ Huge in size
◦ Growing exponentially over time
◦ Difficult to store and process using conventional methods

Need for Big Data

• Traditional data management tools are inefficient in handling large-scale data.
• The processing of Big Data starts with raw, unstructured data that is difficult to
aggregate or organize.
• Big Data enables businesses to identify trends, detect patterns, and gain insights that
were previously impossible to obtain.

Conclusion
• Data Science and Big Data are essential for analyzing and utilizing vast amounts of data
effectively.
• They help organizations predict trends, improve decision-making, and enhance
efficiency.
• With the increasing growth of data, advanced analytics, AI, and machine learning are
necessary to extract meaningful insights.

Q. Applications of Data Science

1. Predictions and Surveys
a. Data science helps in making accurate predictions for surveys, elections, and flight
ticket confirmations.
b. Accurate voter targeting models and increasing voter participation.
2. Healthcare
a. Used by healthcare companies to develop sophisticated medical instruments for
detecting and curing diseases.
b. Personal health care management.
3. Gaming
i. Enhances video and computer games by utilizing data science techniques, improving
gaming experiences.
4. Image Recognition

https://lite.evernote.com/ce Page 3 of 18
Evernote 09/03/25, 2:07 PM

a. Helps in identifying patterns and detecting objects in images, widely used in facial
recognition and medical imaging.
5. Logistics
a. Optimizes delivery routes for faster transportation, ensuring efficient supply chain
management.
b. Real time traffic information
6. Predicting Future Market Trends
a. Analyzes large-scale data to identify emerging market trends.
b. Tracking purchase behavior, influencer impact, and search queries helps businesses
understand consumer interests.
7. Recommendation Systems
a. Platforms like Netflix and Amazon use data science to provide personalized movie and
product recommendations based on user behavior.
8. Streamlining Manufacturing
a. Identifies inefficiencies in manufacturing by analyzing high volumes of production data.
b. Algorithms help in cleaning, sorting, and interpreting data quickly and accurately,
improving productivity.

Conclusion
• Data Science is transforming industries by improving efficiency, decision-making, and
customer experiences.
• It plays a crucial role in healthcare, gaming, logistics, marketing, and manufacturing,
making processes more data-driven and automated.

Q. Explain the 5 V’s of Big Data

Big Data is characterized by five key attributes, commonly known as the 5 V’s: Volume,
Velocity, Variety, Value, and Veracity.

These differentiate Big Data from traditional data systems.

1. Volume
• Refers to the large scale of data, often in terabytes or petabytes, which exceeds the
capacity of conventional relational databases.
• Managing and processing such vast amounts of data requires specialized Big Data
technologies like Hadoop and Spark.

https://lite.evernote.com/ce Page 4 of 18
Evernote 09/03/25, 2:07 PM

2. Velocity
• Represents the speed at which data is generated and processed, often in real-time.
• Examples include social media updates, IoT sensor data, and financial transactions
that require immediate processing for timely insights.
3. Variety
• Describes the diverse types and sources of data, which can be structured (databases,
spreadsheets), semi-structured (XML, JSON), or unstructured (videos, images, social
media posts).
• Handling this variety requires flexible data storage and processing frameworks.
4. Value
• The business value derived from Big Data analysis is its ultimate goal.
• Organizations leverage Big Data for decision-making, trend analysis, and predictive
analytics, improving efficiency and profitability.
• In real-time spatial Big Data, visualization enhances decision-making in areas like
climate monitoring, traffic analysis, and inventory management.
5. Veracity
• Refers to the trustworthiness and accuracy of data, as inaccurate or misleading data
can affect insights and decisions.
• Since data comes from multiple sources, ensuring data integrity, quality, and
credibility is crucial for effective analytics.

These 5 V’s define the characteristics and challenges of Big Data, emphasizing the need for
advanced analytics and processing techniques to extract meaningful insights.

https://lite.evernote.com/ce Page 5 of 18
Evernote 09/03/25, 2:07 PM

https://lite.evernote.com/ce Page 6 of 18
Evernote 09/03/25, 2:07 PM

Q. Compare Business Intelligence and data science.

https://lite.evernote.com/ce Page 7 of 18
Evernote 09/03/25, 2:07 PM

Q. Difference Between Data Science and Information Science

https://lite.evernote.com/ce Page 8 of 18
Evernote 09/03/25, 2:07 PM

Feature Data Science Information Science

Definition Focuses on extracting insights Focuses on organizing, managing,

from structured and unstructured and information to improve
data using statistical and accessibility and usability.
computational techniques.

Primary Goal Discover patterns, make Improve information retrieval,

predictions, and drive data-driven storage, and dissemination for better
decision-making. knowledge management.

Core Techniques Machine Learning, AI, Data Information Retrieval, Knowledge

Mining, Big Data Analytics, Organization, Library Science, Digital
Statistical Modeling. Archiving.

Data Handling Works with raw, large-scale, and Deals with structured, processed,
complex datasets. and organized information.

Tools & Technologies Python, R, SQL, Hadoop, Database Management Systems

TensorFlow, Scikit-learn, Power (DBMS), Digital Libraries, Metadata
BI. Standards (Dublin Core, MARC),
Search Algorithms.

Fields of Application Business Analytics, Healthcare, Library Science, Knowledge

Finance, AI, Robotics, Scientific Management, Information Systems,
https://lite.evernote.com/ce Page 9 of 18
Evernote 09/03/25, 2:07 PM

Finance, AI, Robotics, Scientific Management, Information Systems,

Research. Digital Media.

Output Predictive models, AI systems, Efficient information retrieval

data-driven strategies. systems, well-organized knowledge
bases.

Key Takeaways:
• Data Science focuses on analyzing and extracting insights from data.
• Information Science deals with managing and organizing information for effective access
and usage.
• Data Science is more technical and algorithm-driven, while Information Science is more
structural and organizational.

Q. Explain different phases of data analytics life cycle with neat diagram.
The Data Analytics Life Cycle consists of six key phases, each playing a crucial role in
transforming raw data into actionable insights.

1. Discovery
In this initial phase, the team gathers information about the business domain, objectives, and
available resources. The focus is on understanding past experiences, potential challenges,
and formulating hypotheses.
Key Activities:

• Learning the Business Domain

• Developing Initial Hypotheses (IHs)
• Framing the Business Problem as an Analytics Challenge

• Identifying Key Stakeholders

• Interviewing the Analytics Sponsor

• Assessing Available Resources (People, Technology, Time, Data)

• Identifying Potential Data Sources

https://lite.evernote.com/ce Page 10 of 18
Evernote 09/03/25, 2:07 PM

2. Data Preparation
This phase involves setting up an analytic sandbox, where data can be extracted,
transformed, and loaded (ETL or ETLT). The team ensures the data is clean, structured, and
ready for analysis.

Key Activities:
• Preparing the Analytic Sandbox
• Performing ETLT (Extract, Transform, Load, and Transform)
• Understanding and Familiarizing with the Data
• Data Conditioning (Cleaning, Handling Missing Values, etc.)
• Surveying and Visualizing Data
• Using Common Data Preparation Tools

3. Model Planning
The team determines the analytical techniques, methods, and workflows to be used in the
next phase. This includes selecting key variables and exploring relationships between them.

Key Activities:
• Data Exploration and Variable Selection
• Choosing the Best Model for the Problem
• Selecting the Most Suitable Analytical Techniques
• Using Common Tools for Model Planning

4. Model Building
In this phase, the team develops and tests models using different datasets (training, testing,
and production). The execution environment is also evaluated for efficiency.

Key Activities:
• Developing Training and Testing Datasets
• Building and Running Models Based on Selected Techniques
• Evaluating Computational Requirements (e.g., Fast Hardware, Parallel Processing)
• Using Common Tools for Model Building

5. Communicate Results
The team collaborates with stakeholders to assess the success of the project. The results are
presented in a clear and structured manner.

https://lite.evernote.com/ce Page 11 of 18
Evernote 09/03/25, 2:07 PM

Key Activities:
• Identifying Key Findings
• Quantifying Business Value and Model Performance
• Developing a Narrative to Convey Insights
• Presenting Results to Stakeholders

6. Operationalize
In the final phase, the models and insights are deployed into a production environment. A
pilot project may be run to test the models before full implementation.

Key Activities:
• Delivering Final Reports, Briefings, Code, and Technical Documentation
• Deploying Models into Production
• Running a Pilot Project to Validate Performance

Conclusion
The Data Analytics Life Cycle ensures a structured approach to deriving insights from data.
Each phase plays a vital role in improving decision-making, optimizing operations, and driving
business value.

Q. What is Data Wrangling? Why Do You Need It?

Data Wrangling is the process of cleaning, organizing, and transforming raw data into a
structured format suitable for analysis. It ensures that data is accurate, consistent, and ready
for decision-making.

Why Do We Need Data Wrangling?

1. Key Tasks in Data Wrangling are

• Merging datasets to create a unified dataset for analysis.
• Handling missing values and filling data gaps.
• Identifying and removing outliers or anomalies.
• Standardizing data inputs for consistency.

2. Benefits of Data Wrangling:

https://lite.evernote.com/ce Page 12 of 18
Evernote 09/03/25, 2:07 PM

• Helps in quickly building data pipelines and workflows.

• Reduce time spent by analysts on data preparation.
• Ensures data consistency, completeness, usability, and security.
• Standardizes data formats for Big Data processing.
• Integrates multiple data sources for a holistic view.
• Enables efficient processing of large-scale data.
• Enhances data-driven decision-making by ensuring clean and structured data.

In summary, data wrangling is essential for businesses and analysts to derive meaningful
insights from raw data, ensuring efficiency and accuracy in decision-making.

Q. Data Cleaning as a Method of Data Wrangling

Data Cleaning is a crucial step in data wrangling, ensuring that raw data is accurate,
consistent, and usable for analysis. It helps in improving the data quality.

Key Tasks in Data Cleaning:

1. Data Acquisition and Metadata Management
• Understanding the structure, source, and meaning of the data.
2. Handling Missing Values
i. Ignoring the tuple: Done when the missing data is significant.
ii. Manual filling: Suitable for small datasets but impractical for large-scale data.
iii.Global constant replacement: Replacing missing values with a predefined constant.
iv.Mean/median substitution: Using the mean or median of an attribute to fill missing
values.
v. Class-wise mean replacement: Using the average value of a specific category.
vi.Most probable value filling: Predicting the missing value using statistical models.
3. Identifying and Handling Noisy Data (Errors/Variations in Data)
• Noise refers to random errors or inconsistencies in data, affecting data integrity.
• Methods to handle noisy data:
◦ Binning: Sorting values into bins and applying smoothing techniques:
Smoothing by bin means: Replacing bin values with their mean.
Smoothing by bin medians: Replacing bin values with their median.
Smoothing by bin boundaries: Replacing values with the closest boundary value.
◦ Regression Analysis: Using linear regression or multiple regression to smooth data.

https://lite.evernote.com/ce Page 13 of 18
Evernote 09/03/25, 2:07 PM

◦ Clustering Analysis: Detecting and handling anomalies using clustering or statistical

methods.
◦ Combined computer and human inspection
4. Ensuring Data Consistency and Standardization
• Unified Date Format: Ensuring all dates follow a single format.
• Converting Nominal to Numeric Data: Transforming categorical data into numerical
values.
• Correcting Inconsistent Data: Resolving discrepancies across datasets.

Why is Data Cleaning Important in Data Wrangling?

• Ensures reliable and accurate analysis.
• Reduces errors and improves data integrity.
• Enhances decision-making by eliminating inconsistencies.
• Facilitates seamless integration of structured and unstructured data.

Conclusion
Data Cleaning is a foundational step in Data Wrangling, helping to refine raw data into a
structured and meaningful format. By removing errors, inconsistencies, and noise, it
improves data quality, making it more suitable for analysis and insights.

Data Integration and Transformation are two essential steps in this process that ensure data
is unified, consistent, and ready for analytics.

Q. Data Integration as a Method of Data Wrangling

Definition: Data integration is the process of combining data from multiple sources into a
single, coherent data store. It involves managing metadata, resolving conflicts, and detecting
redundancies to provide a unified view of the data.

Importance of Data Integration:

• Ensures a unified view of scattered data.
• Maintains data consistency and accuracy.
• Helps in decision-making by eliminating data silos.

Challenges in Data Integration:

https://lite.evernote.com/ce Page 14 of 18
Evernote 09/03/25, 2:07 PM

1. Entity Identification Problem

• Data is collected from multiple heterogeneous sources, but different sources may use
different identifiers for the same real-world entity.
• Example: One dataset has customer_id, while another uses customer_number—
metadata is used to match these attributes correctly.
2. Redundancy
• Duplicate data or unnecessary attributes can lead to inefficiencies.
• Example: If one dataset contains customer age and another has date of birth, the age
is redundant since it can be derived from the date of birth.
• Correlation analysis and Covariance analysis is used to detect and remove redundant
attributes.
3. Detection and resolution of data value conflicts
• For the same real world entity, attribute values from different sources are different.
• Possible reasons: different representations, different scales.
• e.g. metric vs. British units

Q. Data Transformation as a Method of Data Wrangling

Definition: Data transformation is the process of modifying, converting, or aggregating data
into a suitable format for analysis.

Key Methods of Data Transformation:

1. Smoothing
• Removes noise from data using techniques like binning, regression, and clustering.
2. Aggregation
• Summarizes data into a higher-level format for easier interpretation.
3. Generalization
• Replaces raw (low-level) data with higher-level concepts using concept hierarchies.
• Example: Replacing specific dates of birth with broader age groups.
4. Normalization
• Scales numerical attributes to fit within a specific range.
• Techniques include:
◦ Min-Max Normalization: Rescales values between a minimum and maximum range.

https://lite.evernote.com/ce Page 15 of 18
Evernote 09/03/25, 2:07 PM

◦ Z-score Normalization: Adjusts values based on mean and standard deviation.

◦ Decimal Scaling: Moves the decimal point to normalize values.
5. Attribute Construction
• New attributes are created from existing ones to enhance analysis.
• Example: Creating an "Income Bracket" column from raw salary data.

Conclusion
Data Integration and Transformation play a crucial role in preparing data for analytics. They help in
merging, cleaning, and converting raw data into a structured format, making it usable, reliable, and
efficient for decision-making.

Q. Data Reduction as a Method of Data Wrangling

Definition:
Data reduction is the process of reducing the volume of data while preserving its integrity
and analytical value. It helps improve efficiency and storage without affecting the mining
results.

Key Data Reduction Strategies:

1. Data Cube Aggregation
• Aggregates data at different levels (e.g., daily sales → monthly/yearly totals).
2. Attribute Subset Selection
• Removes irrelevant or redundant attributes to reduce dataset size.
• Example: Removing student roll number when predicting CGPA.
3. Dimensionality Reduction
• Reduces the number of variables while retaining key information.
• Principal Component Analysis (PCA) and Wavelet Transform (DWT) are common
techniques.
4. Numerosity Reduction
• Replaces large data volumes with compact models or approximations.
• Parametric (statistical models) and Non-parametric (histograms, clustering)
approaches are used.
5. Discretization & Concept Hierarchy Generation
• Converts continuous data into categorical ranges for better representation.
• Example: Converting age into groups like young, middle-aged, senior.

https://lite.evernote.com/ce Page 16 of 18
Evernote 09/03/25, 2:07 PM

Why Use Data Reduction in Wrangling?

Improves storage efficiency and processing speed.
Maintains data quality while reducing size.
Ensures faster and more effective analysis.

Conclusion
Data Reduction streamlines large datasets, making them easier to store, process, and
analyze, ensuring efficient data wrangling without loss of key insights.

Q. Data Discretization as a Method of Data Wrangling

Definition: Data discretization is the process of dividing continuous attributes into intervals
and replacing actual values with interval labels. It helps in reducing complexity, enabling
classification algorithms that require categorical data, and improving knowledge
representation.

Types of Data Discretization:

1. Based on Class Information:
• Supervised Discretization: Uses class labels to guide interval formation.
• Unsupervised Discretization: Does not use class labels.
2. Based on Direction of Processing:
• Top-Down (Splitting): Starts with a broad range and recursively splits into smaller
intervals.
• Bottom-Up (Merging): Begins with individual values and merges them into larger
intervals.

Techniques for Data Discretization:

1. Binning – Divides data into bins and replaces values with bin labels.
2. Histogram Analysis – Groups data based on frequency distribution.
3. Clustering Analysis – Uses clustering algorithms to create meaningful groups.
4. Entropy-Based Discretization – Uses information gain to determine split points.
5. Segmentation by Natural Partitioning – Forms intervals based on natural divisions in data.

Why Use Data Discretization in Wrangling?

Reduces data complexity and improves interpretability.

Enables categorical data representation for certain models.

https://lite.evernote.com/ce Page 17 of 18
Evernote 09/03/25, 2:07 PM

Supports hierarchical and multiresolution analysis.

Conclusion
Data Discretization is a key method in data wrangling that simplifies continuous data, making
it easier to analyze, interpret, and use in classification and data mining tasks.

https://lite.evernote.com/ce Page 18 of 18

Datascience
75% (8)
Datascience
28 pages
Introduction To Data Science and Big Data
No ratings yet
Introduction To Data Science and Big Data
6 pages
ChatGPT - MyLearning On Big Data, Data Science and Machine Learning
No ratings yet
ChatGPT - MyLearning On Big Data, Data Science and Machine Learning
44 pages
What Is So Appealing About Being Spanked Flogged Dominated or Restrained Answers From Practitioners of Sexual Masochism Submission
100% (1)
What Is So Appealing About Being Spanked Flogged Dominated or Restrained Answers From Practitioners of Sexual Masochism Submission
16 pages
Data Science Unit I
No ratings yet
Data Science Unit I
13 pages
Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
Social Science 112 - Teaching Social Studies in Elementary Grades
100% (10)
Social Science 112 - Teaching Social Studies in Elementary Grades
47 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
Data Science Life Cycle
No ratings yet
Data Science Life Cycle
12 pages
Updated Cosmetics Europe PIF Guidelines - 2015 - Update
No ratings yet
Updated Cosmetics Europe PIF Guidelines - 2015 - Update
31 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
BCA Lecture I
No ratings yet
BCA Lecture I
20 pages
Report On Financial Management in Schools
No ratings yet
Report On Financial Management in Schools
60 pages
AD3491 UNIT 1 NOTES EduEngg
100% (1)
AD3491 UNIT 1 NOTES EduEngg
35 pages
D. Preliminary Activities
100% (1)
D. Preliminary Activities
16 pages
Learn About The Importance of Data Science
No ratings yet
Learn About The Importance of Data Science
10 pages
DS-BDS (Unit 1) Technical
No ratings yet
DS-BDS (Unit 1) Technical
22 pages
TOR - Manggar Waste Management PPP Project
No ratings yet
TOR - Manggar Waste Management PPP Project
24 pages
R Programming UNIT-1
No ratings yet
R Programming UNIT-1
48 pages
Untitled
No ratings yet
Untitled
186 pages
Big Data & Data Science - PIK - C5
No ratings yet
Big Data & Data Science - PIK - C5
10 pages
Lecture 1
No ratings yet
Lecture 1
12 pages
Lesson 1 Overview of Data Science
No ratings yet
Lesson 1 Overview of Data Science
12 pages
FDSA
No ratings yet
FDSA
29 pages
20IT501 BDA Unit1
No ratings yet
20IT501 BDA Unit1
18 pages
Basics of Big Data
No ratings yet
Basics of Big Data
14 pages
Lecture 2-Quick Overview of Data Science
No ratings yet
Lecture 2-Quick Overview of Data Science
18 pages
INTRODUCTION and M1-CH-1
No ratings yet
INTRODUCTION and M1-CH-1
63 pages
Unit I
No ratings yet
Unit I
61 pages
Data Science Unit-I
No ratings yet
Data Science Unit-I
13 pages
(DSBDA) Unit 1 Introduction To Data Science
No ratings yet
(DSBDA) Unit 1 Introduction To Data Science
14 pages
Ids Unit 1 Final
No ratings yet
Ids Unit 1 Final
30 pages
Data Science - FYBCA-Sem-II
No ratings yet
Data Science - FYBCA-Sem-II
13 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Orientation To Computing
No ratings yet
Orientation To Computing
67 pages
Dsbda U1 New
No ratings yet
Dsbda U1 New
6 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
70 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
53 pages
Unit1 R Full Material
No ratings yet
Unit1 R Full Material
11 pages
DS R Unit-1
No ratings yet
DS R Unit-1
41 pages
Lec 1 Data Science and Big Data
No ratings yet
Lec 1 Data Science and Big Data
3 pages
1.2 Introduction To Applied Data Science
No ratings yet
1.2 Introduction To Applied Data Science
30 pages
Ids - Unit-1
No ratings yet
Ids - Unit-1
14 pages
Ids Unit-I
No ratings yet
Ids Unit-I
34 pages
2 Data Science Process 06-01-2024
No ratings yet
2 Data Science Process 06-01-2024
32 pages
Unit 1
No ratings yet
Unit 1
8 pages
Chapter 1
No ratings yet
Chapter 1
85 pages
M 1 FDS Notes
No ratings yet
M 1 FDS Notes
19 pages
Unit 1
No ratings yet
Unit 1
76 pages
Unit 1
No ratings yet
Unit 1
60 pages
AIDS C04-Session-19
No ratings yet
AIDS C04-Session-19
29 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
43 pages
22UCS303 DS-Unit I-N
No ratings yet
22UCS303 DS-Unit I-N
42 pages
Unit 1 Final
No ratings yet
Unit 1 Final
75 pages
Mod 3
No ratings yet
Mod 3
96 pages
1 Unit 1 Introduction To Data Science
No ratings yet
1 Unit 1 Introduction To Data Science
48 pages
Unit I Introduction To Data Science and Big Data
No ratings yet
Unit I Introduction To Data Science and Big Data
121 pages
Introduction To Data Science and Big Data
No ratings yet
Introduction To Data Science and Big Data
124 pages
Data Sci
No ratings yet
Data Sci
67 pages
Hub and Spoke
No ratings yet
Hub and Spoke
26 pages
Data
No ratings yet
Data
43 pages
Ids (R22) U1 PPT 03092024
No ratings yet
Ids (R22) U1 PPT 03092024
87 pages
Business and Technical English: Handouts 201
No ratings yet
Business and Technical English: Handouts 201
255 pages
UNIT I Notes
No ratings yet
UNIT I Notes
37 pages
Soltuions
No ratings yet
Soltuions
19 pages
Unit II Natural Tolerance Limits, Specification Limits, Process Capability
100% (1)
Unit II Natural Tolerance Limits, Specification Limits, Process Capability
13 pages
Neda PDPMR Final PDF
No ratings yet
Neda PDPMR Final PDF
84 pages
Archeology-The-Science-Of-The-Human-Past - Compress 2
No ratings yet
Archeology-The-Science-Of-The-Human-Past - Compress 2
81 pages
Books Ias
No ratings yet
Books Ias
2 pages
Aeronautical and Astronautical Events of 1961
No ratings yet
Aeronautical and Astronautical Events of 1961
129 pages
2019 - Yeung Et Al - Accuracy and Precision of 3d-Printed Implant Surgical Guideswith Different Implant Systems An in Vitro Study
No ratings yet
2019 - Yeung Et Al - Accuracy and Precision of 3d-Printed Implant Surgical Guideswith Different Implant Systems An in Vitro Study
8 pages
CBA-SW Journal No.22 - The Shepton Mallet Amulet - Stephen Minnitt.
No ratings yet
CBA-SW Journal No.22 - The Shepton Mallet Amulet - Stephen Minnitt.
2 pages
Tourism Merchandise As A Means
No ratings yet
Tourism Merchandise As A Means
16 pages
Syllabus (Semester Pattern) Session 2021-22: Shaheed Mahendra Karma Vishwavidyalaya, Bastar Jagdalpur, Chhattisgarh
No ratings yet
Syllabus (Semester Pattern) Session 2021-22: Shaheed Mahendra Karma Vishwavidyalaya, Bastar Jagdalpur, Chhattisgarh
37 pages
3045 Dissertation 15k
No ratings yet
3045 Dissertation 15k
6 pages
1 Introduction To Applied Social Psychology: Lindastegandtalibrothengatter
No ratings yet
1 Introduction To Applied Social Psychology: Lindastegandtalibrothengatter
10 pages
A Knowledge Management Approach To Organizational Competitive Advantage Evidence From The Food Sector
No ratings yet
A Knowledge Management Approach To Organizational Competitive Advantage Evidence From The Food Sector
13 pages
3a. Factorial Experiment
No ratings yet
3a. Factorial Experiment
47 pages
The Sage Handbook of Applied Social Psychology Kieran C Odoherty Instant Download
No ratings yet
The Sage Handbook of Applied Social Psychology Kieran C Odoherty Instant Download
77 pages
Coursework Brief
No ratings yet
Coursework Brief
10 pages
A Systematic Approach To Searching
No ratings yet
A Systematic Approach To Searching
11 pages
KPMG Global Tech Report
No ratings yet
KPMG Global Tech Report
32 pages
Cleaning The NVD Comprehensive Quality Assessment Improvements and Analyses
No ratings yet
Cleaning The NVD Comprehensive Quality Assessment Improvements and Analyses
15 pages
Competency Mapping: A Tool For HR Excellence
No ratings yet
Competency Mapping: A Tool For HR Excellence
10 pages
CHN Film
No ratings yet
CHN Film
2 pages
Ni Hms 902678
No ratings yet
Ni Hms 902678
17 pages
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
From Everand
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
Steven Vollmer
No ratings yet
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
From Everand
Big Data: Statistics, Data Mining, Analytics, And Pattern Learning
Rob Botwright
No ratings yet

Dsbda Unit 1

Uploaded by

Dsbda Unit 1

Uploaded by

Evernote 09/03/25, 2:07 PM

1. Basics and need of Data Science and Big Data

Q. Basics and Need of Data Science and Big Data

Basics of Data Science

Need for Data Science

Basics of Big Data

• The term "Big Data" describes data that is:

Need for Big Data

Q. Applications of Data Science

Q. Explain the 5 V’s of Big Data

These differentiate Big Data from traditional data systems.

Q. Compare Business Intelligence and data science.

Q. Difference Between Data Science and Information Science

Feature Data Science Information Science

Definition Focuses on extracting insights Focuses on organizing, managing,

Primary Goal Discover patterns, make Improve information retrieval,

Core Techniques Machine Learning, AI, Data Information Retrieval, Knowledge

Tools & Technologies Python, R, SQL, Hadoop, Database Management Systems

Fields of Application Business Analytics, Healthcare, Library Science, Knowledge

Finance, AI, Robotics, Scientific Management, Information Systems,

Output Predictive models, AI systems, Efficient information retrieval

• Learning the Business Domain

• Identifying Key Stakeholders

• Assessing Available Resources (People, Technology, Time, Data)

Q. What is Data Wrangling? Why Do You Need It?

Why Do We Need Data Wrangling?

1. Key Tasks in Data Wrangling are

2. Benefits of Data Wrangling:

• Helps in quickly building data pipelines and workflows.

Q. Data Cleaning as a Method of Data Wrangling

Key Tasks in Data Cleaning:

◦ Clustering Analysis: Detecting and handling anomalies using clustering or statistical

Why is Data Cleaning Important in Data Wrangling?

Q. Data Integration as a Method of Data Wrangling

Importance of Data Integration:

Challenges in Data Integration:

1. Entity Identification Problem

Q. Data Transformation as a Method of Data Wrangling

Key Methods of Data Transformation:

◦ Z-score Normalization: Adjusts values based on mean and standard deviation.

Q. Data Reduction as a Method of Data Wrangling

Key Data Reduction Strategies:

Why Use Data Reduction in Wrangling?

Q. Data Discretization as a Method of Data Wrangling

Types of Data Discretization:

Techniques for Data Discretization:

Why Use Data Discretization in Wrangling?

Enables categorical data representation for certain models.

Supports hierarchical and multiresolution analysis.

You might also like