Data Science and Big Data Analytics Unit 1 notes

Uploaded by

Sarvesh Kuvalekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Data Science and Big Data Analytics Unit 1 notes

Uploaded by

Sarvesh Kuvalekar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 13

1.

1 Basics of the Need for Data Science and Big Data

In today’s digital world, massive amounts of data are generated every second.
Extracting useful insights from this data is crucial for decision-making, automation,
and innovation. This is where Data Science and Big Data come into play.
Need for Data Science
Data Science is an interdisciplinary field that combines statistics, mathematics,
programming, and domain knowledge to extract meaningful insights from structured
and unstructured data.
Why is Data Science Needed?
 Data Explosion: With billions of internet users, IoT devices, and business
transactions, organizations need to process vast amounts of data.
 Better Decision-Making: Companies leverage data science for data-driven
decisions, improving efficiency and profitability.
 Automation & AI: Machine Learning and AI models, powered by data science,
help automate processes, from chatbots to self-driving cars.
 Personalization: Streaming platforms (like Netflix), e-commerce sites (like
Amazon), and social media (like Facebook) use data science to personalize
recommendations.
 Fraud Detection: Banks and financial institutions use data science to detect
fraudulent activities in real time.
 Healthcare Advancements: Predictive analytics help diagnose diseases,
improve treatment plans, and manage healthcare data efficiently.
Need for Big Data
Big Data refers to extremely large and complex datasets that traditional data
processing tools cannot handle efficiently.
Why is Big Data Needed?
 Volume of Data: Data is generated from social media, sensors, transactions,
etc., requiring scalable storage and processing solutions.
 Velocity of Data: Real-time processing is necessary for quick decision-making
in financial markets, healthcare, and cybersecurity.
 Variety of Data: Data comes in multiple formats (structured, semi-structured,
and unstructured) like text, images, audio, and video.
 Business Insights: Companies use Big Data analytics to understand customer
behaviour, market trends, and operational efficiency.
 Predictive Analytics: Businesses forecast future trends, optimize logistics, and
manage risks using big data insights.
Difference Between Data Science and Big Data

Applications of Data Science:

1.2 Data Explosion
Data Explosion refers to the rapid and exponential growth of digital data generated
across various sources, including social media, IoT devices, business transactions, and
more.
This massive increase in data volume, velocity, and variety poses challenges for
storage, processing, and analysis.
Causes of Data Explosion
1. Growth of the Internet & Social Media
o Billions of users generate text, images, videos, and interactions daily.
o Platforms like Facebook, Instagram, YouTube, and Twitter contribute to
high data generation.
2. IoT (Internet of Things) Devices
o Smart home devices, wearables (smartwatches), and industrial sensors
continuously produce data.
o Example: A smart car collects GPS, speed, and engine performance data
in real-time.
3. E-commerce & Online Transactions
o Online shopping platforms (Amazon, Flipkart) generate huge amounts
of customer data, including purchase history, preferences, and reviews.
4. Cloud Computing & Digital Transformation
o Businesses shift to cloud-based storage and applications, increasing
data traffic.
o Example: Google Drive, Dropbox, and AWS store vast amounts of files
and application logs.
5. Advancements in AI & Machine Learning
o AI models require large datasets for training and improvement,
increasing storage and processing demands.
o Example: Chatbots, voice assistants (Alexa, Siri) process vast speech and
text data.
6. Streaming Services & Multimedia Content
o Platforms like YouTube, Netflix, and Spotify generate petabytes of data
daily through video streaming, user preferences, and content
recommendations.
7. Scientific & Healthcare Data
o Genomic sequencing, medical imaging, and patient records contribute
to vast data volumes.
o Example: COVID-19 data tracking involved processing billions of test
results globally.
Challenges of Data Explosion
1. Storage Issues – Traditional databases struggle to handle massive data
volumes.
2. Processing Speed – Analyzing huge datasets in real time is challenging.
3. Security & Privacy – Increased data breaches and misuse risks.
4. Data Quality – Large-scale data often contains noise, redundancy, or
inconsistencies.
Solutions for Managing Data Explosion
Big Data Technologies – Hadoop, Apache Spark for large-scale data processing.
Cloud Computing – AWS, Google Cloud, Microsoft Azure for scalable storage.
Data Compression & Optimization – Reducing redundant data for efficient storage.
AI & Machine Learning – Automating data analysis for better decision-making.
1.3 5 V’s of Big Data:

Below fig shows big data volume

Relationship between Data Science and Information Science
1.4 Data Science Life Cycle

A data science life cycle is an iterative set of data science steps you take to deliver a
project or analysis.
1.5 Data
Data refers to raw facts, figures, and information that can be processed and analyzed
to extract meaningful insights. In Data Science, data is the foundation for analysis,
machine learning, and decision-making.
Data Types:
1. Structured Data
Definition: Data that is well-organized, follows a fixed format, and is stored in
databases (SQL, spreadsheets, etc.).
Example: Tables containing sales records, customer details, or employee data.
Sources: Relational databases (MySQL, PostgreSQL), Excel sheets, CRM systems.
Subtypes of Structured Data
✅ Numerical Data: Age, salary, temperature, stock prices.
✅ Categorical Data: Gender (Male/Female), Product Type (Electronics, Clothing).
2. Unstructured Data
Definition: Data that does not follow a predefined format and is difficult to store in
relational databases.
Example: Emails, social media posts, videos, images, audio files.
Sources: Twitter feeds, YouTube videos, WhatsApp messages, customer reviews.
Subtypes of Unstructured Data
✅ Text Data: Emails, blog posts, chat messages.
✅ Multimedia Data: Images, audio recordings, videos.
3. Semi-Structured Data
Definition: Data that is partially structured but does not fit into traditional databases.
It has some level of organization using tags, keys, or metadata.
Example: JSON, XML, log files, NoSQL databases.
Sources: Web pages, APIs, sensor logs, IoT device data.

1.6 Data Collection

Data Wrangling:
Data Wrangling, also known as Data Munging, is the process of transforming raw data into a
clean, structured, and usable format for analysis.
It involves various techniques to prepare data before applying statistical or machine learning
models.

1. Raw Data (Input Stage)

 The process starts with raw data collected from various sources such as databases,
APIs, CSV files, or unstructured formats.
 Raw data is often messy, containing missing values, inconsistencies, duplicates, and
noise.
2. Cleanse (Data Cleaning)
 This step involves cleaning the data to make it usable.
 Key tasks:
o Handling missing values (e.g., removing or filling them).
o Removing duplicate records.
o Fixing incorrect data formats (e.g., converting text to numerical data).
o Removing noise and outliers.
3. Evaluate Usability (Data Transformation)
 After cleaning, the data is evaluated for usability.
 Key tasks:
o Checking if the data is structured correctly for the next steps.
o Performing data normalization (scaling values to a standard range).
o Feature engineering (creating new useful attributes).
o Identifying important attributes for analysis.
4. Analyze (Data Processing & Insights)
 At this stage, usable data is ready for analysis.
 Key tasks:
o Applying statistical techniques to extract insights.
o Running machine learning models if needed.
o Finding patterns, trends, and relationships in the data.
5. Visualize (Results & Reporting)
 The final step is to visualize the findings.
 Key tasks:
o Creating charts, graphs, and dashboards.
o Using tools like Matplotlib, Seaborn, Power BI, or Tableau.
o Presenting data in an understandable format for decision-making.
Need of Data Wrangling:
1. Handling Raw & Messy Data
 Real-world data comes from various sources like databases, APIs, and web scraping,
often containing errors, inconsistencies, and missing values.
 Wrangling cleans and transforms this data into a structured format suitable for
analysis.
2. Improving Data Quality & Consistency
 Ensures accuracy, completeness, and consistency in data.
 Eliminates duplicates, outliers, and incorrect formats that can lead to misleading
analysis.
3. Enhancing Data Usability
 Converts raw data into a usable format for machine learning and analytics.
 Helps in normalizing, aggregating, and structuring data efficiently.
4. Boosting Analytical Efficiency
 Prepares clean and structured data, reducing errors and improving analysis speed.
 Avoids computational inefficiencies caused by missing or inconsistent values.
5. Enabling Better Decision-Making
 Accurate, clean data leads to reliable insights for businesses and researchers.
 Prevents incorrect conclusions that could arise from flawed or incomplete data.

✅ What is Dimensionality Reduction?

Dimensionality Reduction is a technique in data preprocessing where we reduce the
number of input variables or features in a dataset, while preserving as much important
information as possible.
It is used when we have datasets with a large number of features (high dimensionality),
which may cause issues like overfitting, slow processing, and difficulty in visualization.

✅ Why Dimensionality Reduction?

 To simplify datasets.
 To improve model performance.
 To reduce computational cost.
 To avoid overfitting caused by irrelevant or redundant features.
 To visualize high-dimensional data in 2D or 3D.

BUSINESS ANALYTICS NOTES
No ratings yet
BUSINESS ANALYTICS NOTES
31 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
232 pages
Data Science Unit-I
No ratings yet
Data Science Unit-I
13 pages
20IT501_BDA_Unit1
No ratings yet
20IT501_BDA_Unit1
18 pages
Module 1 - Data Science Introduction _Detailed
No ratings yet
Module 1 - Data Science Introduction _Detailed
131 pages
Chapter 2
No ratings yet
Chapter 2
22 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
Unit-1 Final sgs
No ratings yet
Unit-1 Final sgs
24 pages
Introduction to Big Data
No ratings yet
Introduction to Big Data
4 pages
CH 2 Data Science
No ratings yet
CH 2 Data Science
28 pages
Ch2 Emerging
No ratings yet
Ch2 Emerging
24 pages
Emerging Chapter 2
No ratings yet
Emerging Chapter 2
22 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
#2 Data Science
No ratings yet
#2 Data Science
32 pages
mod 3
No ratings yet
mod 3
96 pages
NJ CSE4261-1
No ratings yet
NJ CSE4261-1
26 pages
BIG DATA INTRODUCTION hadoop
No ratings yet
BIG DATA INTRODUCTION hadoop
24 pages
Bigdata Analytics
No ratings yet
Bigdata Analytics
48 pages
Unit 1
No ratings yet
Unit 1
137 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
dataanalyticsunit-1[1]
No ratings yet
dataanalyticsunit-1[1]
26 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
FDSUNIT 1
No ratings yet
FDSUNIT 1
27 pages
Data
No ratings yet
Data
43 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
Ids Unit I
No ratings yet
Ids Unit I
46 pages
Data Science
No ratings yet
Data Science
35 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Bda Unit 1
No ratings yet
Bda Unit 1
74 pages
Screenshot 2025-04-23 at 8.26.12 AM
No ratings yet
Screenshot 2025-04-23 at 8.26.12 AM
14 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
33 pages
Data Science Vs Big Data
No ratings yet
Data Science Vs Big Data
34 pages
Unit 1 - ETI (BDA)
No ratings yet
Unit 1 - ETI (BDA)
20 pages
Dsbda Unit -1 - Copy
No ratings yet
Dsbda Unit -1 - Copy
21 pages
Unit I
No ratings yet
Unit I
61 pages
Chapter 2 - Introduction to Data Science
No ratings yet
Chapter 2 - Introduction to Data Science
37 pages
Emerging Tech Ch 2
No ratings yet
Emerging Tech Ch 2
52 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
ict Ch. 2
No ratings yet
ict Ch. 2
38 pages
Fods MQP Solutions - 025136
No ratings yet
Fods MQP Solutions - 025136
76 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
Ccs 334
No ratings yet
Ccs 334
16 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
53 pages
Unit 1 Understanding Big Data
No ratings yet
Unit 1 Understanding Big Data
17 pages
Orientation To Computing
No ratings yet
Orientation To Computing
67 pages
Chapter 2. Introduction to Data Science
No ratings yet
Chapter 2. Introduction to Data Science
41 pages
L01-Fundamentals of Big Data and Data Analytics (1)
No ratings yet
L01-Fundamentals of Big Data and Data Analytics (1)
58 pages
Data Science
No ratings yet
Data Science
32 pages
IDS-UNIT-1-FINAL (1)
No ratings yet
IDS-UNIT-1-FINAL (1)
30 pages
BD 1
No ratings yet
BD 1
15 pages
Hamid Seminar Doc
No ratings yet
Hamid Seminar Doc
57 pages
Unit 1
No ratings yet
Unit 1
60 pages
"Big Data Science" Basic Concepts and Applications
From Everand
"Big Data Science" Basic Concepts and Applications
Sukanta Bhattacharya
No ratings yet
The Data Whisperer - Making Sense of Big Data
From Everand
The Data Whisperer - Making Sense of Big Data
Keaton Rivers
No ratings yet
Python Data Wrangling for Business Analytics: Python for Business Analytics Series
From Everand
Python Data Wrangling for Business Analytics: Python for Business Analytics Series
George Snypes
2/5 (1)
AI-enhanced Protein Design Makes Proteins That Have Never Existed 2023
No ratings yet
AI-enhanced Protein Design Makes Proteins That Have Never Existed 2023
3 pages
03 Software Application
No ratings yet
03 Software Application
29 pages
Mastering_ChatGPT_Compressed_Guide
No ratings yet
Mastering_ChatGPT_Compressed_Guide
4 pages
Arun Srinivas
No ratings yet
Arun Srinivas
2 pages
Rex Itlaw
No ratings yet
Rex Itlaw
19 pages
05 Linear Classifiers
No ratings yet
05 Linear Classifiers
59 pages
K Mean Clustering
No ratings yet
K Mean Clustering
27 pages
Generative AI
No ratings yet
Generative AI
31 pages
Mind Reading Ai Turns Thoughts Into Text British English Teacher B2 C1
No ratings yet
Mind Reading Ai Turns Thoughts Into Text British English Teacher B2 C1
13 pages
Andres Limon Alcocer English
No ratings yet
Andres Limon Alcocer English
1 page
Introduction To Artificial Intelligence1
No ratings yet
Introduction To Artificial Intelligence1
15 pages
Artificial Intelligence For Trading: Nanodegree Program Syllabus
No ratings yet
Artificial Intelligence For Trading: Nanodegree Program Syllabus
18 pages
Resume Manish
No ratings yet
Resume Manish
1 page
Lotte Fabien Et Al - Studying The Use of Fuzzy Inference Systems For Motor Imagery Classification
No ratings yet
Lotte Fabien Et Al - Studying The Use of Fuzzy Inference Systems For Motor Imagery Classification
3 pages
AI & Expert System ch12
No ratings yet
AI & Expert System ch12
13 pages
MoRSE: Bridging The Cybersecurity Gap With AI
No ratings yet
MoRSE: Bridging The Cybersecurity Gap With AI
7 pages
GEN AI
No ratings yet
GEN AI
17 pages
History Detailed
No ratings yet
History Detailed
10 pages
Classification of Diabetes Using Deep Learning
No ratings yet
Classification of Diabetes Using Deep Learning
6 pages
AI For Maternal Health
No ratings yet
AI For Maternal Health
11 pages
English Solved Past Papers Nursing
No ratings yet
English Solved Past Papers Nursing
20 pages
QB Soft
No ratings yet
QB Soft
10 pages
Ac 1
No ratings yet
Ac 1
2 pages
Chief Operations Officer: Programme
No ratings yet
Chief Operations Officer: Programme
20 pages
Amitha Akepati-2
No ratings yet
Amitha Akepati-2
1 page
Bag of Words Algorithm: Paragraph
No ratings yet
Bag of Words Algorithm: Paragraph
3 pages
Machine Learning Assignment
No ratings yet
Machine Learning Assignment
13 pages
027 Btech (Cse) May15
No ratings yet
027 Btech (Cse) May15
224 pages
Shubham Nov 2022
No ratings yet
Shubham Nov 2022
2 pages

Data Science and Big Data Analytics Unit 1 notes

Uploaded by

Data Science and Big Data Analytics Unit 1 notes

Uploaded by

1.

1 Basics of the Need for Data Science and Big Data

Applications of Data Science:

Below fig shows big data volume

1.6 Data Collection

1. Raw Data (Input Stage)

✅ What is Dimensionality Reduction?

✅ Why Dimensionality Reduction?

You might also like