0% found this document useful (0 votes)
24 views24 pages

Data Science Dse

Data Science is an interdisciplinary field that utilizes scientific methods and algorithms to extract insights from data, encompassing the entire data lifecycle from collection to decision-making. It differs from traditional data analysis by incorporating advanced techniques like machine learning and statistical inference, which are essential for making predictions and understanding data patterns. Key components include data cleaning, exploratory data analysis, model building, and evaluation, with a focus on improving data quality and model performance.

Uploaded by

Debjit Karmakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
24 views24 pages

Data Science Dse

Data Science is an interdisciplinary field that utilizes scientific methods and algorithms to extract insights from data, encompassing the entire data lifecycle from collection to decision-making. It differs from traditional data analysis by incorporating advanced techniques like machine learning and statistical inference, which are essential for making predictions and understanding data patterns. Key components include data cleaning, exploratory data analysis, model building, and evaluation, with a focus on improving data quality and model performance.

Uploaded by

Debjit Karmakar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 24
What is Data Science? How is it different from traditional data analysis? Data Science is an interdisciplinary field that uses scientific methods, algorithms, processes, and systems to extract knowledge and insights from structured and unstructured data. It combines concepts from: Statistics Computer science Mathematics Domain knowledge Data Science invoives the entire data lifecycle: data collection, cleaning, exploration, modeling, visualization, and decision-making using techniques like machine learning, data mining, and predictive analytics. Key Components of Data Science: 1 Data Collection & Storage: Using databases, APIs, sensors, etc. Data Cleaning & Preprocessing: Removing noise and inconsistencies. Exploratory Data Analysis (EDA): Summarizing main characteristics using visualization and statistics Model Building: Applying machine learning algorithms to make predictions or classifications. Model Evaluation & Tuning: Measuring performance and optimizing. Deployment & Decision Making: Implementing the model in real-world systems. Define statistical inference. How is it used in Data Science? Statistical inference is the process of using data from a sample to draw conclusions or make estimates about a larger population. It involves applying probability theory to estimate population parameters, test hypotheses, and quantify uncertainty. Key Elements of Statistical Inference: 1 2. Population: The entire group you're interested in studying, Sample: A subset of the population used to make inferences. Parameter: A measurable characteristic of a population (e variance). mean, Statistic: A measurable characteristic of a sample, used to estimate a parameter. Inference: Drawing conclusions about the parameter based on the statistic. Common Techniques in Statistical Inference: Estimation (Point & Interval): Estimating population parameters. Hypothesis Testing: Assessing whether a claim about a population is likely true. Confidence Intervals: Giving a range of plausible values for a parameter. P-values: Measuring the strength of evidence against a null hypothesis. How is Statistical Inference Used in Data Science? Statistical inference is fundamental to Data Science, particularly in: Use Case AIB Testing Model Evaluation Sampling Feature Selection Uncertainty Quantification Bias Detection Role of Statistical Inference Determines if a new feature or product leads to significant improvement. Helps assess whether model performance is statistically significant. Allows working with large datasets by analyzing representative samples. Identifies which variables are significantly associated with the target. Provides confidence intervals for predictions. Tests whether models or data contain significant bias What is the difference between a population and a sample? Why do we sample? Aspect Population Sample Definition The complete set of individuals, A subset of the items, or data you're interested in _ population selected for studying. analysis. Size Usually large or infinite. Smaller, manageable portion. Purpose Provides true characteristics Used to make inferences (parameters) of the entire group. about the population. Data Contains all possible observations. _ Contains only part of the Type data Measure Results are called parameters (e.g., Results are called Type population mean, 11) statistics (e.g., sample mean, x). What is a probability distribution? Name and explain any two types. A probability distribution is a function or rule that assigns probabilities to the possible outcomes of a random variable. It describes how likely different values of the variable are and provides a mathematical framework to model uncertainty. Types of Probability Distributions Probability distributions are generally classified into: ¢ Discrete distributions — for variables that take countable values. * Continuous distributions — for variables that take infinite values within a range. 1. Binomial Distribution (Discrete) Used when: © There are a fixed number of trials (n). * Each trial has only two possible outcomes: success or failure. The probability of success (p) remains constant. Formula: P(X =k) = (7)p'a —py* Where; « X: Number of successes * (f): Number of ways to choose k successes in m trials Example: Flipping a coin 10 times and counting how many times it lands heads. 2. Normal Distribution (Continuous) Used when: © The data is symmetrically distributed around a mean Common in natural and social phenomena (e.g., height, |). Characteristics: © Bell-shaped curve @ Mean = Median = Mode © Described by two parameters: mean (y) and standard deviation (a) Formula (PDF): Example: Heights of adult males in a country often follow a normal distribution. Write basic R code to calculate mean, median, and standard deviation of a dataset. # Sample dataset (numeric vector) data <- c(12, 15, 22, 9, 18, 30, 24, 17, 21, 14) # Calculate mean mean_value <- mean(data) print(paste("Mean:", mean_value)) # Calculate median median_value <- median(data) print(paste("Median:”, median_value)) # Calculate standard deviation sd_value <- sd(data) print(paste("Standard Deviation:", sd_value)) Explanation: mean(data) — Returns the average median(data) — Returns the middle value. « sd(data) — Returns the standard deviation (measure of spread). Explain the process of fitting a model to data. Fitting a model to data means finding a mathematical relationship between input variables (features) and output variables (targets) so that the model can make predictions or understand patterns. Steps in Model Fitting: 1. Data Collection © Gather raw data from experiments, sensors, databases, or APIs. 2. Data Preprocessing © Cleaning: Handle missing values, outliers, and duplicates. Encoding: Convert categorical variables to numerical (e.g., one-hot encoding). Normalization/Scaling: Standardize numeric values for better model performance. © Splitting: Divide data into training and testing sets (commonly 70/30 or 80/20). 3. Choosing a Model © Select a suitable model based on the task: © Linear Regression for continuous outcomes. © Logistic Regression or Decision Trees for classification. © Clustering algorithms for grouping, etc: 4. Model Training (Fitting) « Use the training data to teach the model the relationship between inputs and outputs. The model "learns" by minimizing a loss or cost function (e.g., Mean Squared Error). 5. Model Evaluation © Test the model on the testing/validation data Use performance metrics: Regression: RMSE, R® © Classification: Accuracy, Precision, Recall, F1-score 6, Model Tuning (Hyperparameter Optimization) © Improve performance by adjusting hyperparameters (e.g.., learning rate, depth). e Use techniques like grid search or cross-validation. 7. Model Deployment « Integrate the model into real-world applications or systems for prediction 8. Monitoring and Maintenance ® Continuously monitor model performance. e Retrain if performance drops due to new data or concept drift. Example in R (Linear Model): # Load data data <- mtcars # Fit linear model: mpg (target) ~ wt (feature) model <- im(mpg ~ wt, data = data) # Summary of the model summary(model) What is Exploratory Data Analysis (EDA)? Why is it important? Exploratory Data Analysis (EDA) is the process of analyzing and summarizing datasets to uncover patterns, detect anomalies, test hypotheses, and check assumptions before applying modeling techniques. It involves both visual and statistical methods to better understand the structure of the data. Key Objectives of EDA: 1 Understand Data Distribution — See how values are spread across variables. Detect Outliers and Missing Values — Identify data quality issues. Identify Relationships — Find correlations or associations between variables. Summarize Data — Use descriptive statistics like mean, median, range. Guide Model Selection — Choose appropriate modeling techniques based on data behavior. Common EDA Techniques: Numerical Summaries: Mean, median, mode, variance, standard deviation Min, max, percentiles Visualizations: Histograms — For distribution of numerical data Boxplots — For spotting outliers Scatter plots — For relationships between two variables « Bar charts — For categorical variables © Correlation matrix/heatmaps Missing Value Analysis Check how much and where data is missing. Compare supervised and unsupervised learning with examples. Aspect Definition Goal Data Requirement Output Evaluation Metrics Examples Common Algorithms Supervised Learning Learns from labeled data (input-output pairs). Predict or classify the output based on input features. Requires a dataset with known outcomes (labels). Predictive model (e. class, value) Accuracy, precision, recall, RMSE, ete. Spam email classification, house price prediction. - Linear/Logistic Regression - Decision Trees - SVM - Neural Networks Unsupervised Learning Learns from unlabeled data (no target variable) Discover hidden patterns, structures, or groupings in data. Works with raw input data only. Groupings, associations, or dimensionality reduction Silhouette score, inertia, variance explained, etc Customer segmentation, topic modeling, anomaly detection. - K-Means - Hierarchical Clustering -PCA - DBSCAN Explain Linear Regression with an example. Linear regression is a supervised learning algorithm used to model the relationship between a dependent variable (target) and one or more independent variables (features) by fitting a straight line. Types of Linear Regression: 1, Simple Linear Regression: One independent variable y=Ootfirte 2. Multiple Linear Regression: More than one independent variable y — Bot Pray + Bowe +--+ + Baan +e Where © y: Dependent variable (target) * x: Independent variable(s) * Bp: Intercept © By:Slope (effect of x on y) ©: Etrorterm Objective: Find the line (model) that minimizes the difference (error) between the predicted values and the actual values — typically using least squares method. Example: Predicting House Price Size (sq ft) Price (in $1000s) 1000 200 1500 250 2000 300 2500 350 You want to build a model to predict house price based on size. In R: Simple Linear Regression # Data size <- c(1000, 1500, 2000, 2500) price <- c(200, 250, 300, 350) # Fit linear model model <- Im(price ~ size) # Model summary summary(model) # Predict price for 1800 sq ft predict(model, data.frame(size = 1800)) # Plot plot(size, price, main = "House Price vs Size", col = "blue", pch = 19) abline(model, col = “red") # regression line Interpretation: If the mode! outputs: Price = 100 + 0.1 x Size It means: © Intercept (100): Base price when size = 0 (theoretically), + Slope (0.1): Each additional square foot increases the price by $100. Describe how the k-Nearest Neighbors (k-NN) algorithm works. The k-Nearest Neighbors (k-NN) algorithm is a simple, non-parametric, supervised learning method used for both classification and regression. It predicts the outcome for a new data point based on how similar it is to nearby points in the training set. How k-NN Works (Step-by-Step): 1. Choose k Decide how many neighbors (k) to consider (commonly odd numbers like 3, 5, 7), 2. Calculate Distance Compute the distance between the new point and all points in the training data. © Common distance metrics: ° Euclidean distance: d= J — P+ (uw)! * Manhattan, Minkowski distances are also used. Find Nearest Neighbors Identify the k closest data points (neighbors). Make Prediction © Classification: Take a majority vote among the k neighbors' classes. * Regression: Take the average (mean) of the neighbors’ values. Return Result Assign the most common class (or average value) to the new data point. Example (Classification): Suppose you want to classify whether a fruit is an apple or orange based on weight and color. You setk = 3. * You calculate the distance between the new fruit and all labeled fruits. * You pick the 3 closest ones. © If 2 of them are apples and 1 is orange — classify as apple. Advantages of k-NN: Simple to understand and implement. No training phase — it's a lazy learner. Works well with small datasets. Disadvantages: © Slow with large datasets (computational cost). © Sensitive to irrelevant features and outliers. © Choosing the right k is critical. In R: Simple k-NN Example library(class) # Features train_X <- data.frame(height = c(6.5, 6.0, 5.0, 5.8), weight = c(150, 180, 120, 165)) # Labels train_Y < factor(c("Male", "Male", "Female", "Male")) # New observation test_X <- data.frame(height = 5.4, weight = 130) #k-NN Classification prediction <- knn(train = train_X, test = test_X, cl = train_Y, k = 3) print(prediction) What is Data Wrangling? Why is it essential? Data wrangling (also called data cleaning or data preprocessing) is the process of transforming and cleaning raw data into a usable format for analysis. It involves tasks such as handling missing values, correcting errors, standardizing data, and converting data into the appropriate structure and format for further processing. Key Steps in Data Wrangling: 1. Data Collection: Gathering data from various sources (databases, CSV files, APIs, etc.). 2. Data Cleaning: o Handling missing data (e.g., filling, dropping, or imputing missing values). o Removing duplicates to avoid redundancy. © Correcting inconsistencies (e.g., correcting spelling errors in categorical data), 3. Data Transformation: © Converting data types (e.g., changing a column from a string to numeric). © Normalizing or scaling data for consistency. © Aggregating or grouping data to summarize or derive new insights. 4. Data Integration: Merging data from different sources or formats into a single unified dataset. 5, Data Formatting: Structuring data in a consistent format (e.g., date formats, consistent units). Why is Data Wrangling Essential? 1. Improves Data Quality: o Ensures that the dataset is accurate, complete, and consistent, which is crucial for building reliable models. 2. Prepares Data for Analysis: © Raw dala is often unstructured and messy. Wrangling transforms it into a clean, structured format that can be easily analyzed. 3. Enables Better Insights: o Aclean dataset allows for clearer trends and patterns to emerge, leading to more reliable and actionable insights. 4, Reduces Errors in Analysis: © Cleaning the data minimizes the risk of errors, such as biases or inconsistencies that could lead to incorrect conclusions. 5. Optimizes Model Performance: © Models trained on well-preprocessed data are more likely to perform better, as the features are consistent and relevant. 6. Saves Time and Resources: © Without proper wrangling, a data analysis process may involve unnecessary troubleshooting or even lead to invalid results What is the difference between feature generation and feature selection? Feature Generation: Definition: Feature generation refers to the process of creating new features from the existing ones. This process aims to improve the model's performance by introducing more relevant or meaningful features. Goal: © Toenrich the dataset with additional features that may improve the predictive power of the model. To create new representations of the data that the model can learn from. How It Works: ¢ Mathematical transformations: Applying mathematical functions (e.g., logarithms, square roots) to existing features. © Interaction terms: Creating new features by combining two or more existing features (e.g., multiplying, adding, or creating ratios). © Domain-specific features: Creating new features based on domain knowledge (e.g., combining year and month to create a "seasonality" feature). « Time-based features: Extracting features like "day of the week," “hour of the day," or "month" from timestamps Example: Ifyou have height and weight, you could create a new feature called bai (body mass index) using the formula weight (ka) BMI t (m)? Feature Selection: Definition: Feature selection refers to the process of selecting a subset of the most relevant features from the existing set. This helps reduce the complexity of the model and improves model performance by removing irrelevant or redundant features. Goal: © To improve model efficiency by reducing dimensionality. ¢ To eliminate noise and irrelevant data, making the model simpler and faster. How It Works: e Filter Methods: Select features based on statistical measures (e.g., correlation, chi-square test, ANOVA) without using a model. * Wrapper Methods: Evaluate subsets of features based on model performance (e.g., forward selection, backward elimination, recursive feature elimination). « Embedded Methods: Perform feature selection during model training (e.g., LASSO, decision trees with feature importance). Example: You may have a dataset with 100 features, but after feature selection, you may find that only 10 of them are significantly contributing to the model's prediction. You then select only these 10 features to build your model. Key Differences: Aspect Feature Generation Feature Selection 6 Purpose Create new, potentially more useful Reduce the number ef features to eliminate features, rie. Impacton Dataset Increases the numberof features in the Reduces the number of feturesin the dataset, dataset Approach ‘Adding new variables or transforming Removing relevant or redundant testes exiting ones Focus Expanding the feature space Narrowing down tothe mest important femurs. Techniques Mathematical tranefermations, domain Statistical tests, model-based methods, et knowledge et. What are the components of a user-facing recommendation engine? A recommendation engine is a system that suggests products, services, content, or information to users based on their preferences, behavior, or characteristics. User-facing recommendation engines are typically part of platforms like e-commerce sites (e.g., Amazon), streaming services (e.g., Netflix), social media (e.g., Facebook), and other online platforms. The main components of a user-facing recommendation engine are: 1. User Interaction Data (Input Data) « Description: Data generated from the user’s interactions with the system (e.g., clicks, purchases, ratings, time spent on content). © Types of Data: ° Explicit Feedback: Ratings, likes, or direct user input (e.g., "I like this product’). © Implicit Feedback: Behavior-based data such as clicks, browsing history, or time spent on content. © Demographic Data: Information about users (e.g., age, gender, location) that can influence recommendations. Example: On a movie streaming platform, user actions such as watching a movie, rating it, or adding it to a watchlist can serve as interaction data. 2. Item Database (Product/Content Catalog) ¢ Description: The collection of all the items that the recommendation engine can suggest to the user. This includes products, movies, books, or other content. © Types of Data: © Item Features: Descriptive data about each item (e.g., genre, price, artist, director). © Metadata: Additional information like release date, description, or keywords that help categorize and filter items. Example: In an e-commerce platform, this would be the database of all products, including details like product descriptions, price, and category. 3. Recommendation Algorithm(s) Description: The core engine that analyzes user data and item information to generate recommendations. The algorithm is based on one or more methods, such as; 1. Collaborative Filtering: au User-based Collaborative Filtering: Recommends items that similar users have liked. = Item-based Collaborative Filtering: Recommends items similar to what the user has liked in the past. 2. Content-Based Filtering: Recommends items that are similar to ones the user has liked, based on item features (e.g., recommending action movies if the user has watched many action movies). 3. Hybrid Models: Combines collaborative filtering and content-based filtering to take advantage of both methods. 4, Matrix Factorization: Techniques like Singular Value Decomposition (SVD) decompose the user-item interaction matrix to find latent factors that explain user preferences. 5. Deep Learning: Uses neural networks to learn complex patterns from user-item interactions. 6. Popularity-based Recommendations: Suggests items based on their overall popularity (e.g., top-rated items) Example: Netflix uses collaborative filtering (e.g., "users who watched this movie also watched...") combined with content-based filtering (e.g., recommending movies from the same genre). 4, Ranking and Personalization Description: Once recommendations are generated, they need to be ranked and personalized to suit each user. o Ranking: Sorting the recommended items based on relevance, considering factors such as: = User preferences = Popularity = Recency m Expected value to the user © Personalization: Tailoring the recommendations to each individual user based on their specific interests, past behavior, and context. Example: On Spotify, not only are song recommendations personalized based on the user's listening history, but they're also ranked based on what is most relevant (e.g., the latest releases, user's most listened genres). 5. Filtering and Diversity Description: To avoid recommending the same items repeatedly and to provide variety, filters are applied. Diversity ensures the user sees a mix of content. © Diversity Filtering: Prevents the engine from recommending items too similar to each other or ones the user has already seen. © Contextual Filters: Filters based on context like location, time of day, or season. Example: In an e-commerce website, the engine might filter out products that the user has already purchased or viewed in the last 30 days. What are the principles of effective data visualization? Effective data visualization helps communicate information clearly and concisely by using visual elements like charts, graphs, and maps. To create impactful visualizations, it's essential to follow certain principles that ensure clarity, accuracy, and engagement 1. Know Your Audience © Principle: Understand the target audience's expertise, background, and goals. Tailor the complexity and style of the visualization to suit their needs. © Why It’s Important: Different audiences (e.g., executives vs. data scientists) will have varying levels of technical knowledge and specific interests. A simple, intuitive design might be needed for non-experts, while advanced users may require more detailed and interactive visuals. Example: A pie chart for an executive might work well to show market share percentages, while a data scientist might prefer a more complex bar chart or line graph to identify trends over time. 2. Clarity and Simplicity * Principle: Strive for simplicity and clarity in design. Avoid unnecessary elements (e.g., extraneous text, excessive colors) that might confuse the message. © Why It’s Important: A cluttered or overly complex visualization can overwhelm the viewer, leading to confusion and misunderstanding. Keep the design minimal and focused on the data story. Example: Instead of using 5 different colors in a bar chart, limit the chart to 2 or 3 colors that are easy to distinguish and convey the main points. 3, Use the Right Type of Chart Principle: Choose the most appropriate chart or graph to represent the data. Different types of charts work better for different data types. © Why It’s Important: The chart type should match the message you want to convey and the nature of the data (e.g., trends, comparisons, distributions). Example: © Line charts for trends over time ¢ Bar charts for comparing quantities across categories. © Pie charts for showing parts of a whole. © Scatter plots for relationships between two variables, 4. Highlight Key Insights © Principle: Emphasize the most important parts of the data so the audience can quickly identify key insights. * Why It’s Important: Viewers should be able to extract actionable insights at a glance without having to interpret the whole chart, Example: Use color to highlight important trends or outliers, or add annotations to key data points that need attention. 5. Maintain Proportions and Accuracy © Principle: Ensure that the visual representation is proportionate and accurately reflects the data. © Why It’s Important: Distorted or misleading visuals (e.g., using an inappropriate scale or changing axis ranges) can mislead the audience, causing wrong interpretations. Example: In bar charts, ensure the y-axis starts at zero, unless there's a compelling reason not to, to avoid exaggerating the differences between bars.

You might also like