0% found this document useful (0 votes)
558 views40 pages

Foundation of Data Science Previous Year Question Paper

2019 pattern tybcs

Uploaded by

onkarborhade25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
0% found this document useful (0 votes)
558 views40 pages

Foundation of Data Science Previous Year Question Paper

2019 pattern tybcs

Uploaded by

onkarborhade25
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 40
TY. B.Sc. COMPUTER SCIENCE CS - 354 : Foundations of Data Science (2019 Pattern) (CBCS) (Semester - V) Time : 2 Hours} [Max. Marks : 35 Instructions to the candidates: 1) All questions are compulsory. 2) Figures to the right indicate full marks. Q1) Attempt any Eight of the following : [8 x 1 = 8] a) List any two application of Data Science. => @ = Predictive Analytics: Data science is used to predict future outcomes based on historical data. For example, predicting customer behavior, stock prices, or sales forecasting. @ Recommendation Systems: Data science helps build recommendation engines for platforms like Netflix, Amazon, or YouTube, which suggest products, movies, or videos based on user preferences and past behavior. b) What is outlier? —>An outlier is a data point that significantly differs from the rest of the data. It can be much higher or lower than other values, and it may indicate variability in the data or an error in measurement. c) What is missing values? —>Missing values refer to the absence of data for certain variables in a dataset. This could occur due to errors in data collection or non-response in surveys, and it may require imputation or removal to handle during data analysis. d) Define variance. —>Variance is a statistical measure of the spread between numbers in a dataset. It represents how much the values in a data set differ from the mean (average) of the dataset. The formula is: _ ; Variance = ‘ Vai 1) where 2; are the data points, and 1 is the mean of the dataset. e) What is nominal attribute? —>A nominal attribute is a categorical variable that represents different categories without any inherent order or ranking. Examples include gender, color, and country of origin. Nominal variables are used to label or categorize data. f) What is data transformation? © seamed nth one semner —>Data transformation is the process of converting data from its original format into a format that is more suitable for analysis. This can include normalization, scaling, encoding, or aggregating data to ensure it fits the requirements of specific models or analysis techniques. g) What is one hot coding? —>One-hot encoding is a technique used to convert categorical variables into a binary format. Each category is represented as a binary vector where only one element is "1" (for the category) and the rest are "0". This ensures that machine learning models can process categorical data effectively. h) What is the use of Bubble plot? —>A bubble plot is used to visualize three-dimensional data in two dimensions. It is similar to a scatter plot but adds a third variable by varying the size of the bubbles. It helps in understanding the relationships between three continuous variables simultaneously. i) Define Data visualisation. —>Data visualization is the graphical representation of data. It involves creating charts, graphs, and plots to communicate information clearly and effectively, allowing patterns, trends, and outliers in the data to be easily understood. j) Define Standard deviation? —>Standard deviation is a measure of the amount of variation or dispersion in a dataset. It shows how much individual data points deviate from the mean of the dataset. A low standard deviation indicates that the data points are close to the mean, while a high standard deviation indicates that the data points are spread out. The formula for standard deviation is: Tn, y sation = aft _) Standard Deviation i ‘ ) (2;- where a; are the data points and jis the mean of the dataset, Let me know if you need further clarification! Q2) Attempt any four of the following : [4 « 2 = 8] a) Differentiate structured and Unstructured Data. => @ = Structured Data refers to data that is organized in a fixed format, typically in rows and columns (e.g., databases, spreadsheets). It is easy to search, analyze, and process using traditional data tools. @ Unstructured Data refers to data that doesn't have a predefined structure or format (e.g., text documents, images, videos). It is more complex to analyze and often requires advanced tools like machine learning or natural language processing for meaningful insights. b) What is inferential statistics? —>Inferential Statistics: Inferential statistics involves making predictions or inferences © seamed nth one semner about a population based on a sample of data. It uses probability theory to estimate population parameters, test hypotheses, and determine relationships between variables. c) What do you mean by data preprocessing? —>Data Preprocessing: Data preprocessing is the process of cleaning, transforming, and organizing raw data into a format suitable for analysis. This can involve handling missing values, removing outliers, normalizing, and encoding data d) Define data discretization. —>Data Discretization: Data discretization refers to the process of converting continuous data or numerical values into discrete categories or intervals. It is often used in data analysis or machine learning to simplify complex data and make it easier to analyze. ) What is visual encoding? —>Visual encoding is the process of converting visual information (such as images, shapes, and colors) into a mental representation that can be stored and later retrieved from memory. It is one of the ways the brain processes sensory input to form memories. When we look at something, the brain encodes the visual details into a format that allows us to recall them later, even if the actual image is no longer present. This process is crucial for visual memory and helps us recognize objects, faces, and scenes. In cognitive psychology, visual encoding is often contrasted with other types of encoding, like acoustic encoding (encoding of sound) or semantic encoding (encoding of meaning). Q3) Attempt any two of the following : [2 x 4 = 8] a) Explain outlier detection methods in brief. Outlier detection refers to the process of identifying data points that significantly deviate from the majority of the data. These outliers may indicate errors, anomalies, or interesting patterns. Some common outlier detection methods include: 1. Z-Seore: + The Z-score measures how many standard deviations a data point is away from the mean. If the Z-score exceeds a certain threshold (typically 3 or -3), the point is considered an outlier. = Formula: g-Xo# where X is the date point, 1 is the mean, and is the standard deviation. 2. 1OR (Interquartile Range): = The IQR method detects outliers by considering values that lie outside the range defined by the first (Q1) and third (Q3) quartiles. Outliers are typically considered as points outside the range: Lower bound = Q1— 1.5 x IQR and Upper bound = Q3 41.5 x IQR where IQR = Q3 — Q1. 3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise): + DBSCAN groups points based on their proximity. Points that are far away from dense clusters are labeled as outliers (also called noise). 4. Isolation Forest: + Isolation Forest is an algorithm that isolates outliers instead of profiling normal data points. It works by recursively partitioning the data using random features and checking how easily a data point is isolated. © seamed nth one semner b) Write different data visualization libraries in python. => 1. Matplotlib: + Awidely used library for creating static, interactive, and animated visualizations. It provides a variety of chart types like line plots, bar charts, histograms, scatter plots, etc. « Example: python copy cote inport ratplotlib.pyplot as plt plt.plot(x, y) plt.shon() 2. Seaborn: * Built on top of Matplotlib, Seaborn simplifies the creation of statistical plots. It is particularly useful for visualizing relationships between variables and has built-in ‘support for themes and color palettes. + Example: thon 8) Cony code import seaborn as sns sns.boxplot(x="category’, y='value", data=df) 3. Plotly: « Alibrary for creating interactive web-based visualizations. It supports a wide range of chart types and is particularly useful for creating dashboards and interactive plots. « Example: python 8) Cony code import plotly.express as px fig = px.scatter(df, x="x column", y="y_colunn”) fig.-shon() © seamed nth one semner 4, Bokeh: « Bokeh is another interactive visualization library that allows you to create web-ready plots. It’s ideal for creating complex visualizations like heatmaps, time-series plots, and more. « Example: python 0) Cony code from bokeh.plotting import figure, show p = figure(title="Simple line plot") p.line(x, y) show(p) 5. Altair: Altairis a declarative statistical visualization library based on Vega and VegarLite. It allows for easy generation of plots by combining data and transformations. « Example: python Copy code inport altair as alt chart = alt.chart(df).mark_point().encode(x="x column’, y="y_colunn') chart..show() 6. ggplot (Plotnine): * Plotnine is a Python implementation of the R package ggplot2.. It follows the “grammar of graphics" philosophy and allows for easy construction of plots by layering components. « Example: ton Copy code fron plotnine import ggplot, aes, geom_point ggplot(df, aes(x="x column’, y="y_colunn')) + geom_point() © seamed nth one semner c) What is data cleaning? Explain any two data cleaning methods. => Data cleaning is the process of identifying and rectifying errors, inconsistencies, and missing values in the dataset to ensure its accuracy and integrity. It is a crucial step in data preprocessing to ensure that the data is usable for analysis and modeling. Two Common Data Cleaning Methods: 1. Handling Missing Data: + Missing data is a common issue in datasets. There are several strategies to handle missing values: « Imputation: Replace missing values with statistical values such as the mean, median, or mode of the column. pathon @ Cony code dF['colunn'].fillna(df{' column’ }.mean(), inplace=True) * Deletion: Remove rows with missing values, but this may lead to loss of important data. Patton Copy cote df.dropna(subset=["colunn'}, inplace=True) 2. Removing Duplicates: * Duplicate entries can distort the analysis and lead to incorrect results. Duplicates can be identified and removed using the following method: python 6) Cony code df.drop_duplicates(inplace=True) By addressing these and other data quality issues, data cleaning ensures that the dataset is well-prepared for analysis, reducing the likelihood of errors or biased outcomes. Q4) Attempt any two of the following : [2 x 4 = 8] a) Explain 3V's of Data Science. —>The "3V's" of Data Science refer to three key dimensions that characterize big data and © seamed nth one semner the challenges faced when working with large datasets. These are: @ = Volume: This refers to the sheer amount of data generated every day. In data science, volume represents the scale of data that must be collected, processed, and analyzed. With advancements in technology, businesses and organizations now generate massive amounts of data from various sources like social media, loT devices, transactional systems, etc. A key challenge here is how to store, manage, and analyze such large datasets efficiently. @ Variety: Data comes in various forms, including structured (e.g,, tables in databases), semi- structured (e.g., XML, JSON), and unstructured (e.g,, text, images, videos). Variety refers to the different types and formats of data that need to be integrated and processed. Data scientists often have to deal with disparate sources of data and transform them into a format that is suitable for analysis. @ Velocity. Velocity refers to the speed at which data is generated and needs to be processed. In the age of real-time data, the rate at which data is produced, transmitted, and analyzed is crucial. For example, stock market data or social media feeds require real-time analysis to provide actionable insights. Efficient algorithms and technologies like stream processing are needed to handle high-velocity data effectively. These 3Vs — Volume, Variety, and Velocity — highlight the complexities and challenges in the field of data science, particularly in the era of big data. b) Explain data cube aggregation method in detail. —>A Data Cube is a multi-dimensional array of values used in data analysis, often in the context of Online Analytical Processing (OLAP). It allows users to view data from multiple perspectives (dimensions), enabling efficient querying and reporting. Aggregation in the context of a data cube refers to summarizing data based on certain attributes across different levels of granularity. Key Features of Data Cube Aggregation: @ Dimensions and Measures: @ Dimensions: These are the perspectives or categories (e.g., time, location, product) along which data is aggregated. @ Measures: These are the numerical values or facts (e.g., sales, profit) that are analyzed within the data cube. @ Aggregation Levels: Aggregation in data cubes can occur at different levels of granusarity. For example, data might be aggregated by year, quarter, month, or day for a time dimension. @ Roll-Up and Drill-Down: @ Roll-Up: This operation involves moving up in the hierarchy of a dimension. For instance, if we have data for each day, rolling up to a higher level might aggregate it by month or year. © __Drill-Down: This is the opposite of roll-up, where you break down data from a high- level summary to a more granular level, such as drilling down from a year to specific days or from a country to individual stores. @ = Slicing and Dicing: © seamed nth one semner @ = Slicing: This operation involves selecting a specific value from one of the dimensions to analyze the data at that level. For example, slicing the cube to show data for a specific year. @ _ Dicing: This operation involves selecting a subset of the cube, choosing specific values for multiple dimensions. For example, dicing to view sales data for a particular region in a specific quarter of a year. @ Example: Consider a data cube with the dimensions of "Time" (Year, Quarter, Month) and "Product" (Product Categories). A measure could be "Sales Amount”. A data cube allows us to aggregate sales data at different levels, such as total sales by year, by quarter, by month, or even by product category and month. In practice, data cubes are used in scenarios like financial reporting, sales analysis, and inventory management, where multidimensional data needs to be queried and aggregated efficiently. c) Explain any two data transformation technique in detail. > Data transform: ing data into a format si ble for analysis or use ina machine learning model. There are various techniques used to transform raw data, but two common ones are Normalization and Feature Engineering. 1. Normalization (or Scaling) Normalization is the process of adjusting values in numerical data to fall within a specific range, often between 0 and 1. This is especially important when data comes from different scales or units, such as age, salary, or weight, where each variable may have different ranges. Without normalization, variables with larger ranges can disproportionately influence the results of machine learning algorithms. Types of Normalization: + Min-Max Normalization: Rescales the data such that the minimum value becomes 0 and the maximum value becomes 1. The formula is: Xnorm where X is the original value, and Xinin and Xmaz are the minimum and maximum values of the feature. + Z-Score Normalization (Standardization): Standardizes the data so that it has a mean of 0 and a standard deviation of 1. This is useful when the data has a Gaussian distribution. The formula is: —n o X standardised = where jis the mean and o is the standard deviation of the feature. Normalization ensures that each feature contributes equally to the analysis, which is particularly important for algorithms that are sensitive to scale, like K-nearest neighbors (KNN) and Support Vector Machines (SVM). © seamed nth one semner 2. Feature Engineering: Feature engineering is the process of creating new features from existing data to improve the performance of a machine learning model. It involves domain knowledge and creative problem-solving to enhance predictive power and simplify the model's learning process. Types of Feature Engineering: @ Creation of Interaction Features: Sometimes, the interaction between two features can reveal patterns that are not obvious when looking at each feature individually. For example, combining "height" and "weight" could result in a new feature, "BMI" (Body Mass Index), which may have predictive value in health-related models. @ Aggregation Features: For time-series data, aggregation can create features that capture trends or summary statistics. For instance, calculating the moving average or rolling mean over a period can help capture the underlying trend and smooth out noise. @ Binning: Binning is the process of converting continuous values into categorical ones. For example, ages could be converted into age groups like "20-30", "30-40", etc. This can simplify the model and make it easier to work with certain types of algorithms, especially decision trees. @ = = One-Hot Encoding: This technique is used to convert categorical variables into binary vectors, each representing a single category. For example, the "Color" feature with values "Red’, "Green’, and "Blue" would be transformed into three binary features: "IsRed", "IsGreen’, and "IsBlue", each of which would take a value of 1 or 0. Feature engineering is an iterative process that requires experimentation and a deep understanding of the data and the problem you're trying to solve. It can significantly improve model performance, reduce overfitting, and allow the model to capture more meaningful patterns. Q5) Attempt any one of the following : [1 x 3 = 3] a) Write a short note on feature extraction. —>Feature extraction is a crucial step in the process of preparing data for machine learning. It involves transforming raw data into a set of meaningful attributes or features that can better represent the underlying patterns or relationships in the data. In many cases, raw data, such as images, text, or sensor readings, may contain redundant or irrelevant information. Feature extraction helps reduce the dimensionality of the data, making it more suitable for modeling. For example: @ — \nimage processing, feature extraction might involve detecting edges, textures, or shapes that are critical for recognizing objects. @ intext analysis, extracting key terms, word frequencies, or semantic features (like sentiment) is a common approach. The goal is to create a set of features that simplifies the modeling process and improves the performance of machine learning algorithms b) Explain Exploratory Data Analysis (EDA) in detail. —>Exploratory Data Analysis (EDA) is a critical step in the data analysis pipeline that involves using statistical and visualization techniques to understand the characteristics of a dataset before applying machine learning models. The primary goal of EDA is to summarize the main features of a dataset, identify patterns, spot anomalies, test assumptions, and © seamed nth one semner

You might also like