Dsbda Unit 1
Dsbda Unit 1
DSBDA UNIT 1
Introduction to Data Science and Big Data
Definitions
Data Science:- Data Science is an interdisciplinary field that aims to discover and extract
actionable knowledge from various forms of data to support business decisions and make
predictions.
Big Data:- Big data refers to extremely large and diverse collections of structured, unstructured,
and semi-structured data that continues to grow exponentially over time. These datasets are so
huge and complex in volume, velocity, and variety, that traditional data management systems cannot
store, process, and analyze them.
https://lite.evernote.com/ce Page 1 of 18
Evernote 09/03/25, 2:07 PM
• Data is a collection of facts and figures that relay specific information but are not
organized.
• It includes numbers, words, measurements, observations, or descriptions of things.
• Data acts as raw material in the production of information.
• Types of data:
◦ Record data
◦ Data matrix
◦ Document data
◦ Transaction data
◦ Graph data
◦ Ordered data
2. Data Science
3. Big Data
https://lite.evernote.com/ce Page 2 of 18
Evernote 09/03/25, 2:07 PM
Conclusion
• Data Science and Big Data are essential for analyzing and utilizing vast amounts of data
effectively.
• They help organizations predict trends, improve decision-making, and enhance
efficiency.
• With the increasing growth of data, advanced analytics, AI, and machine learning are
necessary to extract meaningful insights.
https://lite.evernote.com/ce Page 3 of 18
Evernote 09/03/25, 2:07 PM
a. Helps in identifying patterns and detecting objects in images, widely used in facial
recognition and medical imaging.
5. Logistics
a. Optimizes delivery routes for faster transportation, ensuring efficient supply chain
management.
b. Real time traffic information
6. Predicting Future Market Trends
a. Analyzes large-scale data to identify emerging market trends.
b. Tracking purchase behavior, influencer impact, and search queries helps businesses
understand consumer interests.
7. Recommendation Systems
a. Platforms like Netflix and Amazon use data science to provide personalized movie and
product recommendations based on user behavior.
8. Streamlining Manufacturing
a. Identifies inefficiencies in manufacturing by analyzing high volumes of production data.
b. Algorithms help in cleaning, sorting, and interpreting data quickly and accurately,
improving productivity.
Conclusion
• Data Science is transforming industries by improving efficiency, decision-making, and
customer experiences.
• It plays a crucial role in healthcare, gaming, logistics, marketing, and manufacturing,
making processes more data-driven and automated.
1. Volume
• Refers to the large scale of data, often in terabytes or petabytes, which exceeds the
capacity of conventional relational databases.
• Managing and processing such vast amounts of data requires specialized Big Data
technologies like Hadoop and Spark.
https://lite.evernote.com/ce Page 4 of 18
Evernote 09/03/25, 2:07 PM
2. Velocity
• Represents the speed at which data is generated and processed, often in real-time.
• Examples include social media updates, IoT sensor data, and financial transactions
that require immediate processing for timely insights.
3. Variety
• Describes the diverse types and sources of data, which can be structured (databases,
spreadsheets), semi-structured (XML, JSON), or unstructured (videos, images, social
media posts).
• Handling this variety requires flexible data storage and processing frameworks.
4. Value
• The business value derived from Big Data analysis is its ultimate goal.
• Organizations leverage Big Data for decision-making, trend analysis, and predictive
analytics, improving efficiency and profitability.
• In real-time spatial Big Data, visualization enhances decision-making in areas like
climate monitoring, traffic analysis, and inventory management.
5. Veracity
• Refers to the trustworthiness and accuracy of data, as inaccurate or misleading data
can affect insights and decisions.
• Since data comes from multiple sources, ensuring data integrity, quality, and
credibility is crucial for effective analytics.
These 5 V’s define the characteristics and challenges of Big Data, emphasizing the need for
advanced analytics and processing techniques to extract meaningful insights.
https://lite.evernote.com/ce Page 5 of 18
Evernote 09/03/25, 2:07 PM
https://lite.evernote.com/ce Page 6 of 18
Evernote 09/03/25, 2:07 PM
https://lite.evernote.com/ce Page 7 of 18
Evernote 09/03/25, 2:07 PM
https://lite.evernote.com/ce Page 8 of 18
Evernote 09/03/25, 2:07 PM
Data Handling Works with raw, large-scale, and Deals with structured, processed,
complex datasets. and organized information.
Key Takeaways:
• Data Science focuses on analyzing and extracting insights from data.
• Information Science deals with managing and organizing information for effective access
and usage.
• Data Science is more technical and algorithm-driven, while Information Science is more
structural and organizational.
Q. Explain different phases of data analytics life cycle with neat diagram.
The Data Analytics Life Cycle consists of six key phases, each playing a crucial role in
transforming raw data into actionable insights.
1. Discovery
In this initial phase, the team gathers information about the business domain, objectives, and
available resources. The focus is on understanding past experiences, potential challenges,
and formulating hypotheses.
Key Activities:
https://lite.evernote.com/ce Page 10 of 18
Evernote 09/03/25, 2:07 PM
2. Data Preparation
This phase involves setting up an analytic sandbox, where data can be extracted,
transformed, and loaded (ETL or ETLT). The team ensures the data is clean, structured, and
ready for analysis.
Key Activities:
• Preparing the Analytic Sandbox
• Performing ETLT (Extract, Transform, Load, and Transform)
• Understanding and Familiarizing with the Data
• Data Conditioning (Cleaning, Handling Missing Values, etc.)
• Surveying and Visualizing Data
• Using Common Data Preparation Tools
3. Model Planning
The team determines the analytical techniques, methods, and workflows to be used in the
next phase. This includes selecting key variables and exploring relationships between them.
Key Activities:
• Data Exploration and Variable Selection
• Choosing the Best Model for the Problem
• Selecting the Most Suitable Analytical Techniques
• Using Common Tools for Model Planning
4. Model Building
In this phase, the team develops and tests models using different datasets (training, testing,
and production). The execution environment is also evaluated for efficiency.
Key Activities:
• Developing Training and Testing Datasets
• Building and Running Models Based on Selected Techniques
• Evaluating Computational Requirements (e.g., Fast Hardware, Parallel Processing)
• Using Common Tools for Model Building
5. Communicate Results
The team collaborates with stakeholders to assess the success of the project. The results are
presented in a clear and structured manner.
https://lite.evernote.com/ce Page 11 of 18
Evernote 09/03/25, 2:07 PM
Key Activities:
• Identifying Key Findings
• Quantifying Business Value and Model Performance
• Developing a Narrative to Convey Insights
• Presenting Results to Stakeholders
6. Operationalize
In the final phase, the models and insights are deployed into a production environment. A
pilot project may be run to test the models before full implementation.
Key Activities:
• Delivering Final Reports, Briefings, Code, and Technical Documentation
• Deploying Models into Production
• Running a Pilot Project to Validate Performance
Conclusion
The Data Analytics Life Cycle ensures a structured approach to deriving insights from data.
Each phase plays a vital role in improving decision-making, optimizing operations, and driving
business value.
https://lite.evernote.com/ce Page 12 of 18
Evernote 09/03/25, 2:07 PM
In summary, data wrangling is essential for businesses and analysts to derive meaningful
insights from raw data, ensuring efficiency and accuracy in decision-making.
https://lite.evernote.com/ce Page 13 of 18
Evernote 09/03/25, 2:07 PM
Conclusion
Data Cleaning is a foundational step in Data Wrangling, helping to refine raw data into a
structured and meaningful format. By removing errors, inconsistencies, and noise, it
improves data quality, making it more suitable for analysis and insights.
Data Integration and Transformation are two essential steps in this process that ensure data
is unified, consistent, and ready for analytics.
https://lite.evernote.com/ce Page 14 of 18
Evernote 09/03/25, 2:07 PM
https://lite.evernote.com/ce Page 15 of 18
Evernote 09/03/25, 2:07 PM
Conclusion
Data Integration and Transformation play a crucial role in preparing data for analytics. They help in
merging, cleaning, and converting raw data into a structured format, making it usable, reliable, and
efficient for decision-making.
https://lite.evernote.com/ce Page 16 of 18
Evernote 09/03/25, 2:07 PM
Conclusion
Data Reduction streamlines large datasets, making them easier to store, process, and
analyze, ensuring efficient data wrangling without loss of key insights.
https://lite.evernote.com/ce Page 17 of 18
Evernote 09/03/25, 2:07 PM
Conclusion
Data Discretization is a key method in data wrangling that simplifies continuous data, making
it easier to analyze, interpret, and use in classification and data mining tasks.
https://lite.evernote.com/ce Page 18 of 18