0% found this document useful (0 votes)
6 views

Data Science and Big Data Analytics Unit 1 notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Data Science and Big Data Analytics Unit 1 notes

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 13

1.

1 Basics of the Need for Data Science and Big Data


In today’s digital world, massive amounts of data are generated every second.
Extracting useful insights from this data is crucial for decision-making, automation,
and innovation. This is where Data Science and Big Data come into play.
Need for Data Science
Data Science is an interdisciplinary field that combines statistics, mathematics,
programming, and domain knowledge to extract meaningful insights from structured
and unstructured data.
Why is Data Science Needed?
 Data Explosion: With billions of internet users, IoT devices, and business
transactions, organizations need to process vast amounts of data.
 Better Decision-Making: Companies leverage data science for data-driven
decisions, improving efficiency and profitability.
 Automation & AI: Machine Learning and AI models, powered by data science,
help automate processes, from chatbots to self-driving cars.
 Personalization: Streaming platforms (like Netflix), e-commerce sites (like
Amazon), and social media (like Facebook) use data science to personalize
recommendations.
 Fraud Detection: Banks and financial institutions use data science to detect
fraudulent activities in real time.
 Healthcare Advancements: Predictive analytics help diagnose diseases,
improve treatment plans, and manage healthcare data efficiently.
Need for Big Data
Big Data refers to extremely large and complex datasets that traditional data
processing tools cannot handle efficiently.
Why is Big Data Needed?
 Volume of Data: Data is generated from social media, sensors, transactions,
etc., requiring scalable storage and processing solutions.
 Velocity of Data: Real-time processing is necessary for quick decision-making
in financial markets, healthcare, and cybersecurity.
 Variety of Data: Data comes in multiple formats (structured, semi-structured,
and unstructured) like text, images, audio, and video.
 Business Insights: Companies use Big Data analytics to understand customer
behaviour, market trends, and operational efficiency.
 Predictive Analytics: Businesses forecast future trends, optimize logistics, and
manage risks using big data insights.
Difference Between Data Science and Big Data

Applications of Data Science:


1.2 Data Explosion
Data Explosion refers to the rapid and exponential growth of digital data generated
across various sources, including social media, IoT devices, business transactions, and
more.
This massive increase in data volume, velocity, and variety poses challenges for
storage, processing, and analysis.
Causes of Data Explosion
1. Growth of the Internet & Social Media
o Billions of users generate text, images, videos, and interactions daily.
o Platforms like Facebook, Instagram, YouTube, and Twitter contribute to
high data generation.
2. IoT (Internet of Things) Devices
o Smart home devices, wearables (smartwatches), and industrial sensors
continuously produce data.
o Example: A smart car collects GPS, speed, and engine performance data
in real-time.
3. E-commerce & Online Transactions
o Online shopping platforms (Amazon, Flipkart) generate huge amounts
of customer data, including purchase history, preferences, and reviews.
4. Cloud Computing & Digital Transformation
o Businesses shift to cloud-based storage and applications, increasing
data traffic.
o Example: Google Drive, Dropbox, and AWS store vast amounts of files
and application logs.
5. Advancements in AI & Machine Learning
o AI models require large datasets for training and improvement,
increasing storage and processing demands.
o Example: Chatbots, voice assistants (Alexa, Siri) process vast speech and
text data.
6. Streaming Services & Multimedia Content
o Platforms like YouTube, Netflix, and Spotify generate petabytes of data
daily through video streaming, user preferences, and content
recommendations.
7. Scientific & Healthcare Data
o Genomic sequencing, medical imaging, and patient records contribute
to vast data volumes.
o Example: COVID-19 data tracking involved processing billions of test
results globally.
Challenges of Data Explosion
1. Storage Issues – Traditional databases struggle to handle massive data
volumes.
2. Processing Speed – Analyzing huge datasets in real time is challenging.
3. Security & Privacy – Increased data breaches and misuse risks.
4. Data Quality – Large-scale data often contains noise, redundancy, or
inconsistencies.
Solutions for Managing Data Explosion
Big Data Technologies – Hadoop, Apache Spark for large-scale data processing.
Cloud Computing – AWS, Google Cloud, Microsoft Azure for scalable storage.
Data Compression & Optimization – Reducing redundant data for efficient storage.
AI & Machine Learning – Automating data analysis for better decision-making.
1.3 5 V’s of Big Data:

Below fig shows big data volume


Relationship between Data Science and Information Science
1.4 Data Science Life Cycle

A data science life cycle is an iterative set of data science steps you take to deliver a
project or analysis.
1.5 Data
Data refers to raw facts, figures, and information that can be processed and analyzed
to extract meaningful insights. In Data Science, data is the foundation for analysis,
machine learning, and decision-making.
Data Types:
1. Structured Data
Definition: Data that is well-organized, follows a fixed format, and is stored in
databases (SQL, spreadsheets, etc.).
Example: Tables containing sales records, customer details, or employee data.
Sources: Relational databases (MySQL, PostgreSQL), Excel sheets, CRM systems.
Subtypes of Structured Data
✅ Numerical Data: Age, salary, temperature, stock prices.
✅ Categorical Data: Gender (Male/Female), Product Type (Electronics, Clothing).
2. Unstructured Data
Definition: Data that does not follow a predefined format and is difficult to store in
relational databases.
Example: Emails, social media posts, videos, images, audio files.
Sources: Twitter feeds, YouTube videos, WhatsApp messages, customer reviews.
Subtypes of Unstructured Data
✅ Text Data: Emails, blog posts, chat messages.
✅ Multimedia Data: Images, audio recordings, videos.
3. Semi-Structured Data
Definition: Data that is partially structured but does not fit into traditional databases.
It has some level of organization using tags, keys, or metadata.
Example: JSON, XML, log files, NoSQL databases.
Sources: Web pages, APIs, sensor logs, IoT device data.

1.6 Data Collection


Data Wrangling:
Data Wrangling, also known as Data Munging, is the process of transforming raw data into a
clean, structured, and usable format for analysis.
It involves various techniques to prepare data before applying statistical or machine learning
models.

1. Raw Data (Input Stage)


 The process starts with raw data collected from various sources such as databases,
APIs, CSV files, or unstructured formats.
 Raw data is often messy, containing missing values, inconsistencies, duplicates, and
noise.
2. Cleanse (Data Cleaning)
 This step involves cleaning the data to make it usable.
 Key tasks:
o Handling missing values (e.g., removing or filling them).
o Removing duplicate records.
o Fixing incorrect data formats (e.g., converting text to numerical data).
o Removing noise and outliers.
3. Evaluate Usability (Data Transformation)
 After cleaning, the data is evaluated for usability.
 Key tasks:
o Checking if the data is structured correctly for the next steps.
o Performing data normalization (scaling values to a standard range).
o Feature engineering (creating new useful attributes).
o Identifying important attributes for analysis.
4. Analyze (Data Processing & Insights)
 At this stage, usable data is ready for analysis.
 Key tasks:
o Applying statistical techniques to extract insights.
o Running machine learning models if needed.
o Finding patterns, trends, and relationships in the data.
5. Visualize (Results & Reporting)
 The final step is to visualize the findings.
 Key tasks:
o Creating charts, graphs, and dashboards.
o Using tools like Matplotlib, Seaborn, Power BI, or Tableau.
o Presenting data in an understandable format for decision-making.
Need of Data Wrangling:
1. Handling Raw & Messy Data
 Real-world data comes from various sources like databases, APIs, and web scraping,
often containing errors, inconsistencies, and missing values.
 Wrangling cleans and transforms this data into a structured format suitable for
analysis.
2. Improving Data Quality & Consistency
 Ensures accuracy, completeness, and consistency in data.
 Eliminates duplicates, outliers, and incorrect formats that can lead to misleading
analysis.
3. Enhancing Data Usability
 Converts raw data into a usable format for machine learning and analytics.
 Helps in normalizing, aggregating, and structuring data efficiently.
4. Boosting Analytical Efficiency
 Prepares clean and structured data, reducing errors and improving analysis speed.
 Avoids computational inefficiencies caused by missing or inconsistent values.
5. Enabling Better Decision-Making
 Accurate, clean data leads to reliable insights for businesses and researchers.
 Prevents incorrect conclusions that could arise from flawed or incomplete data.

✅ What is Dimensionality Reduction?


Dimensionality Reduction is a technique in data preprocessing where we reduce the
number of input variables or features in a dataset, while preserving as much important
information as possible.
It is used when we have datasets with a large number of features (high dimensionality),
which may cause issues like overfitting, slow processing, and difficulty in visualization.

✅ Why Dimensionality Reduction?


 To simplify datasets.
 To improve model performance.
 To reduce computational cost.
 To avoid overfitting caused by irrelevant or redundant features.
 To visualize high-dimensional data in 2D or 3D.

You might also like