0% found this document useful (0 votes)
18 views2 pages

Ds Short

The document outlines the challenges in data processing, including volume, variety, velocity, veracity, and value. It defines data science as an interdisciplinary field and describes roles in a data science project, such as data engineer and data scientist. Additionally, it discusses data types, collection, management, regression analysis, common errors in data handling, and the importance of model maintenance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views2 pages

Ds Short

The document outlines the challenges in data processing, including volume, variety, velocity, veracity, and value. It defines data science as an interdisciplinary field and describes roles in a data science project, such as data engineer and data scientist. Additionally, it discusses data types, collection, management, regression analysis, common errors in data handling, and the importance of model maintenance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

1.

Different Facets of Data and Challenges in Processing

 Volume: Handling large datasets requires efficient storage and computing power.
 Variety: Data can be structured, unstructured, or semi-structured, requiring different
processing techniques.
 Velocity: Real-time data streams need fast processing.
 Veracity: Ensuring data accuracy and reliability is challenging.
 Value: Extracting meaningful insights from raw data is complex.

2. Statistical Description of Data

 Measures of Central Tendency: Mean, Median, Mode


 Measures of Dispersion: Variance, Standard Deviation, Range
 Shape of Data Distribution: Skewness, Kurtosis
 Correlation: Relationship between variables

3. Definition of Data Science

Data Science is an interdisciplinary field that combines statistics, programming, and domain
knowledge to extract insights and knowledge from structured and unstructured data using
analytical and machine learning techniques.

4. Use of Roles in a Data Science Project

 Data Engineer: Prepares and manages data pipelines.


 Data Scientist: Develops models and performs analysis.
 Machine Learning Engineer: Deploys and maintains models.
 Business Analyst: Interprets insights for business decisions.

5. Difference Between Structured and Unstructured Data

Feature Structured Data Unstructured Data


Format Well-organized (tables, rows, columns) Freeform (text, images, videos)
Storage Relational Databases (SQL) NoSQL, Data Lakes
Processing Easier to analyze Complex analysis required
Example Sales records, Customer details Social media posts, Emails

6. Short Notes on Data Collection and Management

 Data Collection: Gathering data from various sources (surveys, IoT, databases).
 Data Cleaning: Handling missing values, duplicates, and inconsistencies.
 Data Storage: Using databases, data warehouses, or cloud storage.
 Data Governance: Ensuring data security, privacy, and compliance.

7. Difference Between Simple Regression and Multiple Regression

Feature Simple Regression Multiple Regression


Number of One independent variable Two or more independent variables
Feature Simple Regression Multiple Regression
Predictors
Complexity Easier to interpret More complex
Predicting house price Predicting house price based on area, location,
Example
based on area and number of rooms

8. Common Errors in Data Retrieval and Cleansing Solutions

 Missing Data → Use imputation or remove incomplete records


 Duplicate Data → Identify and remove duplicates
 Inconsistent Data → Standardize formats and correct errors
 Outliers → Detect and handle using statistical methods
 Encoding Errors → Convert data to a consistent format

9. Model Maintenance and Relevance

 Difficulty: Models degrade over time due to changing data patterns.


 Lifespan: If left untouched, models may remain relevant for weeks to months,
depending on data drift.
 Solution: Regular retraining, monitoring for concept drift, and updating features.

10. Statistical Description of Data

o Central Tendency: Mean, Median, Mode

o Dispersion: Range, Variance, Standard Deviation

o Distribution: Histogram, Box Plot

o Correlation: Pearson, Spearman

You might also like