Data Science and Big Data Analytics Unit 1 notes
Data Science and Big Data Analytics Unit 1 notes
A data science life cycle is an iterative set of data science steps you take to deliver a
project or analysis.
1.5 Data
Data refers to raw facts, figures, and information that can be processed and analyzed
to extract meaningful insights. In Data Science, data is the foundation for analysis,
machine learning, and decision-making.
Data Types:
1. Structured Data
Definition: Data that is well-organized, follows a fixed format, and is stored in
databases (SQL, spreadsheets, etc.).
Example: Tables containing sales records, customer details, or employee data.
Sources: Relational databases (MySQL, PostgreSQL), Excel sheets, CRM systems.
Subtypes of Structured Data
✅ Numerical Data: Age, salary, temperature, stock prices.
✅ Categorical Data: Gender (Male/Female), Product Type (Electronics, Clothing).
2. Unstructured Data
Definition: Data that does not follow a predefined format and is difficult to store in
relational databases.
Example: Emails, social media posts, videos, images, audio files.
Sources: Twitter feeds, YouTube videos, WhatsApp messages, customer reviews.
Subtypes of Unstructured Data
✅ Text Data: Emails, blog posts, chat messages.
✅ Multimedia Data: Images, audio recordings, videos.
3. Semi-Structured Data
Definition: Data that is partially structured but does not fit into traditional databases.
It has some level of organization using tags, keys, or metadata.
Example: JSON, XML, log files, NoSQL databases.
Sources: Web pages, APIs, sensor logs, IoT device data.