0% found this document useful (0 votes)
23 views18 pages

Dsbda Unit 1

The document provides an overview of Data Science and Big Data, including their definitions, importance, and applications across various industries. It discusses the Data Science Life Cycle, the 5 V's of Big Data, and compares Data Science with Information Science. Additionally, it highlights the significance of data wrangling, cleaning, and integration in preparing data for analysis.

Uploaded by

Anisha Dhuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views18 pages

Dsbda Unit 1

The document provides an overview of Data Science and Big Data, including their definitions, importance, and applications across various industries. It discusses the Data Science Life Cycle, the 5 V's of Big Data, and compares Data Science with Information Science. Additionally, it highlights the significance of data wrangling, cleaning, and integration in preparing data for analysis.

Uploaded by

Anisha Dhuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Evernote 09/03/25, 2:07 PM

DSBDA UNIT 1
Introduction to Data Science and Big Data

1. Basics and need of Data Science and Big Data


2. Applications of Data Science
3. Data explosion
4. 5 V's of Big Data
5. Relationship between Data Science and Information Science
6. Business intelligence versus Data Science
7. Data Science Life Cycle
8. Data : Data Types, Data Collection.
9. Need of Data wrangling
10. Methods :
a. Data Cleaning
b. Data Integration
c. Data Reduction
d. Data Transformation
e. Data Discretization.

Definitions

Data Science:- Data Science is an interdisciplinary field that aims to discover and extract
actionable knowledge from various forms of data to support business decisions and make
predictions.

Big Data:- Big data refers to extremely large and diverse collections of structured, unstructured,
and semi-structured data that continues to grow exponentially over time. These datasets are so
huge and complex in volume, velocity, and variety, that traditional data management systems cannot
store, process, and analyze them.

Q. Basics and Need of Data Science and Big Data


1. Data and Its Types

https://lite.evernote.com/ce Page 1 of 18
Evernote 09/03/25, 2:07 PM

• Data is a collection of facts and figures that relay specific information but are not
organized.
• It includes numbers, words, measurements, observations, or descriptions of things.
• Data acts as raw material in the production of information.
• Types of data:
◦ Record data
◦ Data matrix
◦ Document data
◦ Transaction data
◦ Graph data
◦ Ordered data

2. Data Science

Basics of Data Science


• Data Science is an interdisciplinary field that extracts insights from various forms of data.
• It aims to discover and extract actionable knowledge to support business decisions and
predictions.
• Uses advanced analytical methods such as time series analysis for forecasting future
trends.
• Instead of just analyzing past sales, Data Science helps in predicting future sales and
revenue.

Need for Data Science


1. Helps businesses process huge amounts of structured and unstructured data to detect
patterns.
2. Uses modern tools and techniques to uncover hidden insights and meaningful information.
3. Utilizes complex machine learning algorithms to build predictive models.
4. Enables organizations to make data-driven decisions for better efficiency and growth.

3. Big Data

Basics of Big Data


• Big Data refers to extremely large volumes of data that cannot be processed using
traditional tools.
• This data comes from multiple sources, has varying complexity, and is generated at high
speeds (velocity).

https://lite.evernote.com/ce Page 2 of 18
Evernote 09/03/25, 2:07 PM

• The term "Big Data" describes data that is:


◦ Huge in size
◦ Growing exponentially over time
◦ Difficult to store and process using conventional methods

Need for Big Data


• Traditional data management tools are inefficient in handling large-scale data.
• The processing of Big Data starts with raw, unstructured data that is difficult to
aggregate or organize.
• Big Data enables businesses to identify trends, detect patterns, and gain insights that
were previously impossible to obtain.

Conclusion
• Data Science and Big Data are essential for analyzing and utilizing vast amounts of data
effectively.
• They help organizations predict trends, improve decision-making, and enhance
efficiency.
• With the increasing growth of data, advanced analytics, AI, and machine learning are
necessary to extract meaningful insights.

Q. Applications of Data Science


1. Predictions and Surveys
a. Data science helps in making accurate predictions for surveys, elections, and flight
ticket confirmations.
b. Accurate voter targeting models and increasing voter participation.
2. Healthcare
a. Used by healthcare companies to develop sophisticated medical instruments for
detecting and curing diseases.
b. Personal health care management.
3. Gaming
i. Enhances video and computer games by utilizing data science techniques, improving
gaming experiences.
4. Image Recognition

https://lite.evernote.com/ce Page 3 of 18
Evernote 09/03/25, 2:07 PM

a. Helps in identifying patterns and detecting objects in images, widely used in facial
recognition and medical imaging.
5. Logistics
a. Optimizes delivery routes for faster transportation, ensuring efficient supply chain
management.
b. Real time traffic information
6. Predicting Future Market Trends
a. Analyzes large-scale data to identify emerging market trends.
b. Tracking purchase behavior, influencer impact, and search queries helps businesses
understand consumer interests.
7. Recommendation Systems
a. Platforms like Netflix and Amazon use data science to provide personalized movie and
product recommendations based on user behavior.
8. Streamlining Manufacturing
a. Identifies inefficiencies in manufacturing by analyzing high volumes of production data.
b. Algorithms help in cleaning, sorting, and interpreting data quickly and accurately,
improving productivity.

Conclusion
• Data Science is transforming industries by improving efficiency, decision-making, and
customer experiences.
• It plays a crucial role in healthcare, gaming, logistics, marketing, and manufacturing,
making processes more data-driven and automated.

Q. Explain the 5 V’s of Big Data


Big Data is characterized by five key attributes, commonly known as the 5 V’s: Volume,
Velocity, Variety, Value, and Veracity.

These differentiate Big Data from traditional data systems.

1. Volume
• Refers to the large scale of data, often in terabytes or petabytes, which exceeds the
capacity of conventional relational databases.
• Managing and processing such vast amounts of data requires specialized Big Data
technologies like Hadoop and Spark.

https://lite.evernote.com/ce Page 4 of 18
Evernote 09/03/25, 2:07 PM

2. Velocity
• Represents the speed at which data is generated and processed, often in real-time.
• Examples include social media updates, IoT sensor data, and financial transactions
that require immediate processing for timely insights.
3. Variety
• Describes the diverse types and sources of data, which can be structured (databases,
spreadsheets), semi-structured (XML, JSON), or unstructured (videos, images, social
media posts).
• Handling this variety requires flexible data storage and processing frameworks.
4. Value
• The business value derived from Big Data analysis is its ultimate goal.
• Organizations leverage Big Data for decision-making, trend analysis, and predictive
analytics, improving efficiency and profitability.
• In real-time spatial Big Data, visualization enhances decision-making in areas like
climate monitoring, traffic analysis, and inventory management.
5. Veracity
• Refers to the trustworthiness and accuracy of data, as inaccurate or misleading data
can affect insights and decisions.
• Since data comes from multiple sources, ensuring data integrity, quality, and
credibility is crucial for effective analytics.

These 5 V’s define the characteristics and challenges of Big Data, emphasizing the need for
advanced analytics and processing techniques to extract meaningful insights.

https://lite.evernote.com/ce Page 5 of 18
Evernote 09/03/25, 2:07 PM

https://lite.evernote.com/ce Page 6 of 18
Evernote 09/03/25, 2:07 PM

Q. Compare Business Intelligence and data science.

https://lite.evernote.com/ce Page 7 of 18
Evernote 09/03/25, 2:07 PM

Q. Difference Between Data Science and Information Science

https://lite.evernote.com/ce Page 8 of 18
Evernote 09/03/25, 2:07 PM

Feature Data Science Information Science

Definition Focuses on extracting insights Focuses on organizing, managing,


from structured and unstructured and information to improve
data using statistical and accessibility and usability.
computational techniques.

Primary Goal Discover patterns, make Improve information retrieval,


predictions, and drive data-driven storage, and dissemination for better
decision-making. knowledge management.

Core Techniques Machine Learning, AI, Data Information Retrieval, Knowledge


Mining, Big Data Analytics, Organization, Library Science, Digital
Statistical Modeling. Archiving.

Data Handling Works with raw, large-scale, and Deals with structured, processed,
complex datasets. and organized information.

Tools & Technologies Python, R, SQL, Hadoop, Database Management Systems


TensorFlow, Scikit-learn, Power (DBMS), Digital Libraries, Metadata
BI. Standards (Dublin Core, MARC),
Search Algorithms.

Fields of Application Business Analytics, Healthcare, Library Science, Knowledge


Finance, AI, Robotics, Scientific Management, Information Systems,
https://lite.evernote.com/ce Page 9 of 18
Evernote 09/03/25, 2:07 PM

Finance, AI, Robotics, Scientific Management, Information Systems,


Research. Digital Media.

Output Predictive models, AI systems, Efficient information retrieval


data-driven strategies. systems, well-organized knowledge
bases.

Key Takeaways:
• Data Science focuses on analyzing and extracting insights from data.
• Information Science deals with managing and organizing information for effective access
and usage.
• Data Science is more technical and algorithm-driven, while Information Science is more
structural and organizational.

Q. Explain different phases of data analytics life cycle with neat diagram.
The Data Analytics Life Cycle consists of six key phases, each playing a crucial role in
transforming raw data into actionable insights.

1. Discovery
In this initial phase, the team gathers information about the business domain, objectives, and
available resources. The focus is on understanding past experiences, potential challenges,
and formulating hypotheses.
Key Activities:

• Learning the Business Domain


• Developing Initial Hypotheses (IHs)
• Framing the Business Problem as an Analytics Challenge

• Identifying Key Stakeholders


• Interviewing the Analytics Sponsor

• Assessing Available Resources (People, Technology, Time, Data)


• Identifying Potential Data Sources

https://lite.evernote.com/ce Page 10 of 18
Evernote 09/03/25, 2:07 PM

2. Data Preparation
This phase involves setting up an analytic sandbox, where data can be extracted,
transformed, and loaded (ETL or ETLT). The team ensures the data is clean, structured, and
ready for analysis.

Key Activities:
• Preparing the Analytic Sandbox
• Performing ETLT (Extract, Transform, Load, and Transform)
• Understanding and Familiarizing with the Data
• Data Conditioning (Cleaning, Handling Missing Values, etc.)
• Surveying and Visualizing Data
• Using Common Data Preparation Tools

3. Model Planning
The team determines the analytical techniques, methods, and workflows to be used in the
next phase. This includes selecting key variables and exploring relationships between them.

Key Activities:
• Data Exploration and Variable Selection
• Choosing the Best Model for the Problem
• Selecting the Most Suitable Analytical Techniques
• Using Common Tools for Model Planning

4. Model Building
In this phase, the team develops and tests models using different datasets (training, testing,
and production). The execution environment is also evaluated for efficiency.

Key Activities:
• Developing Training and Testing Datasets
• Building and Running Models Based on Selected Techniques
• Evaluating Computational Requirements (e.g., Fast Hardware, Parallel Processing)
• Using Common Tools for Model Building

5. Communicate Results
The team collaborates with stakeholders to assess the success of the project. The results are
presented in a clear and structured manner.

https://lite.evernote.com/ce Page 11 of 18
Evernote 09/03/25, 2:07 PM

Key Activities:
• Identifying Key Findings
• Quantifying Business Value and Model Performance
• Developing a Narrative to Convey Insights
• Presenting Results to Stakeholders

6. Operationalize
In the final phase, the models and insights are deployed into a production environment. A
pilot project may be run to test the models before full implementation.

Key Activities:
• Delivering Final Reports, Briefings, Code, and Technical Documentation
• Deploying Models into Production
• Running a Pilot Project to Validate Performance

Conclusion
The Data Analytics Life Cycle ensures a structured approach to deriving insights from data.
Each phase plays a vital role in improving decision-making, optimizing operations, and driving
business value.

Q. What is Data Wrangling? Why Do You Need It?


Data Wrangling is the process of cleaning, organizing, and transforming raw data into a
structured format suitable for analysis. It ensures that data is accurate, consistent, and ready
for decision-making.

Why Do We Need Data Wrangling?

1. Key Tasks in Data Wrangling are


• Merging datasets to create a unified dataset for analysis.
• Handling missing values and filling data gaps.
• Identifying and removing outliers or anomalies.
• Standardizing data inputs for consistency.

2. Benefits of Data Wrangling:

https://lite.evernote.com/ce Page 12 of 18
Evernote 09/03/25, 2:07 PM

• Helps in quickly building data pipelines and workflows.


• Reduce time spent by analysts on data preparation.
• Ensures data consistency, completeness, usability, and security.
• Standardizes data formats for Big Data processing.
• Integrates multiple data sources for a holistic view.
• Enables efficient processing of large-scale data.
• Enhances data-driven decision-making by ensuring clean and structured data.

In summary, data wrangling is essential for businesses and analysts to derive meaningful
insights from raw data, ensuring efficiency and accuracy in decision-making.

Q. Data Cleaning as a Method of Data Wrangling


Data Cleaning is a crucial step in data wrangling, ensuring that raw data is accurate,
consistent, and usable for analysis. It helps in improving the data quality.

Key Tasks in Data Cleaning:


1. Data Acquisition and Metadata Management
• Understanding the structure, source, and meaning of the data.
2. Handling Missing Values
i. Ignoring the tuple: Done when the missing data is significant.
ii. Manual filling: Suitable for small datasets but impractical for large-scale data.
iii.Global constant replacement: Replacing missing values with a predefined constant.
iv.Mean/median substitution: Using the mean or median of an attribute to fill missing
values.
v. Class-wise mean replacement: Using the average value of a specific category.
vi.Most probable value filling: Predicting the missing value using statistical models.
3. Identifying and Handling Noisy Data (Errors/Variations in Data)
• Noise refers to random errors or inconsistencies in data, affecting data integrity.
• Methods to handle noisy data:
◦ Binning: Sorting values into bins and applying smoothing techniques:
Smoothing by bin means: Replacing bin values with their mean.
Smoothing by bin medians: Replacing bin values with their median.
Smoothing by bin boundaries: Replacing values with the closest boundary value.
◦ Regression Analysis: Using linear regression or multiple regression to smooth data.

https://lite.evernote.com/ce Page 13 of 18
Evernote 09/03/25, 2:07 PM

◦ Clustering Analysis: Detecting and handling anomalies using clustering or statistical


methods.
◦ Combined computer and human inspection
4. Ensuring Data Consistency and Standardization
• Unified Date Format: Ensuring all dates follow a single format.
• Converting Nominal to Numeric Data: Transforming categorical data into numerical
values.
• Correcting Inconsistent Data: Resolving discrepancies across datasets.

Why is Data Cleaning Important in Data Wrangling?


• Ensures reliable and accurate analysis.
• Reduces errors and improves data integrity.
• Enhances decision-making by eliminating inconsistencies.
• Facilitates seamless integration of structured and unstructured data.

Conclusion
Data Cleaning is a foundational step in Data Wrangling, helping to refine raw data into a
structured and meaningful format. By removing errors, inconsistencies, and noise, it
improves data quality, making it more suitable for analysis and insights.

Data Integration and Transformation are two essential steps in this process that ensure data
is unified, consistent, and ready for analytics.

Q. Data Integration as a Method of Data Wrangling


Definition: Data integration is the process of combining data from multiple sources into a
single, coherent data store. It involves managing metadata, resolving conflicts, and detecting
redundancies to provide a unified view of the data.

Importance of Data Integration:


• Ensures a unified view of scattered data.
• Maintains data consistency and accuracy.
• Helps in decision-making by eliminating data silos.

Challenges in Data Integration:

https://lite.evernote.com/ce Page 14 of 18
Evernote 09/03/25, 2:07 PM

1. Entity Identification Problem


• Data is collected from multiple heterogeneous sources, but different sources may use
different identifiers for the same real-world entity.
• Example: One dataset has customer_id, while another uses customer_number—
metadata is used to match these attributes correctly.
2. Redundancy
• Duplicate data or unnecessary attributes can lead to inefficiencies.
• Example: If one dataset contains customer age and another has date of birth, the age
is redundant since it can be derived from the date of birth.
• Correlation analysis and Covariance analysis is used to detect and remove redundant
attributes.
3. Detection and resolution of data value conflicts
• For the same real world entity, attribute values from different sources are different.
• Possible reasons: different representations, different scales.
• e.g. metric vs. British units

Q. Data Transformation as a Method of Data Wrangling


Definition: Data transformation is the process of modifying, converting, or aggregating data
into a suitable format for analysis.

Key Methods of Data Transformation:


1. Smoothing
• Removes noise from data using techniques like binning, regression, and clustering.
2. Aggregation
• Summarizes data into a higher-level format for easier interpretation.
3. Generalization
• Replaces raw (low-level) data with higher-level concepts using concept hierarchies.
• Example: Replacing specific dates of birth with broader age groups.
4. Normalization
• Scales numerical attributes to fit within a specific range.
• Techniques include:
◦ Min-Max Normalization: Rescales values between a minimum and maximum range.

https://lite.evernote.com/ce Page 15 of 18
Evernote 09/03/25, 2:07 PM

◦ Z-score Normalization: Adjusts values based on mean and standard deviation.


◦ Decimal Scaling: Moves the decimal point to normalize values.
5. Attribute Construction
• New attributes are created from existing ones to enhance analysis.
• Example: Creating an "Income Bracket" column from raw salary data.

Conclusion
Data Integration and Transformation play a crucial role in preparing data for analytics. They help in
merging, cleaning, and converting raw data into a structured format, making it usable, reliable, and
efficient for decision-making.

Q. Data Reduction as a Method of Data Wrangling


Definition:
Data reduction is the process of reducing the volume of data while preserving its integrity
and analytical value. It helps improve efficiency and storage without affecting the mining
results.

Key Data Reduction Strategies:


1. Data Cube Aggregation
• Aggregates data at different levels (e.g., daily sales → monthly/yearly totals).
2. Attribute Subset Selection
• Removes irrelevant or redundant attributes to reduce dataset size.
• Example: Removing student roll number when predicting CGPA.
3. Dimensionality Reduction
• Reduces the number of variables while retaining key information.
• Principal Component Analysis (PCA) and Wavelet Transform (DWT) are common
techniques.
4. Numerosity Reduction
• Replaces large data volumes with compact models or approximations.
• Parametric (statistical models) and Non-parametric (histograms, clustering)
approaches are used.
5. Discretization & Concept Hierarchy Generation
• Converts continuous data into categorical ranges for better representation.
• Example: Converting age into groups like young, middle-aged, senior.

https://lite.evernote.com/ce Page 16 of 18
Evernote 09/03/25, 2:07 PM

Why Use Data Reduction in Wrangling?


Improves storage efficiency and processing speed.
Maintains data quality while reducing size.
Ensures faster and more effective analysis.

Conclusion
Data Reduction streamlines large datasets, making them easier to store, process, and
analyze, ensuring efficient data wrangling without loss of key insights.

Q. Data Discretization as a Method of Data Wrangling


Definition: Data discretization is the process of dividing continuous attributes into intervals
and replacing actual values with interval labels. It helps in reducing complexity, enabling
classification algorithms that require categorical data, and improving knowledge
representation.

Types of Data Discretization:


1. Based on Class Information:
• Supervised Discretization: Uses class labels to guide interval formation.
• Unsupervised Discretization: Does not use class labels.
2. Based on Direction of Processing:
• Top-Down (Splitting): Starts with a broad range and recursively splits into smaller
intervals.
• Bottom-Up (Merging): Begins with individual values and merges them into larger
intervals.

Techniques for Data Discretization:


1. Binning – Divides data into bins and replaces values with bin labels.
2. Histogram Analysis – Groups data based on frequency distribution.
3. Clustering Analysis – Uses clustering algorithms to create meaningful groups.
4. Entropy-Based Discretization – Uses information gain to determine split points.
5. Segmentation by Natural Partitioning – Forms intervals based on natural divisions in data.

Why Use Data Discretization in Wrangling?


Reduces data complexity and improves interpretability.

Enables categorical data representation for certain models.

https://lite.evernote.com/ce Page 17 of 18
Evernote 09/03/25, 2:07 PM

Supports hierarchical and multiresolution analysis.

Conclusion
Data Discretization is a key method in data wrangling that simplifies continuous data, making
it easier to analyze, interpret, and use in classification and data mining tasks.

https://lite.evernote.com/ce Page 18 of 18

You might also like