0% found this document useful (0 votes)
25 views16 pages

Question Bank Syllbuswise

this pdf is computer science in data science syllabuswise question bank

Uploaded by

gautamr.s199
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views16 pages

Question Bank Syllbuswise

this pdf is computer science in data science syllabuswise question bank

Uploaded by

gautamr.s199
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Unit – 1

Q.1 What Is Data Science? Scope of data science.

Ans - Data science is the domain of study that deals with vast volumes of data using modern tools and
techniques to find unseen patterns, derive meaningful information, and make business decisions.
Data science uses complex machine learning algorithms to build predictive models. The data used for
analysis can come from many different sources and presented in various formats.
Scope of data science:
 Data science is a rapidly growing field in India, with increasing demand for skilled professionals
analyzing and interpreting large datasets. The scope for data science in India is vast, with
opportunities in various industries, including finance, healthcare, e-commerce, and government
organizations.
 The scope for data science in India is not limited to these industries, however. Data scientists are
also in demand in sectors such as transportation, energy, and manufacturing. Additionally, the
field of data science is constantly evolving, with new applications and techniques being developed
all the time. This means that there will continue to be a wide range of opportunities for data
scientists in India in the future.
 Diagram :

Q.2 .Applications and Sources of data science.


Ans:
There are various applications of data science, including:

1. Healthcare
Healthcare companies are using data science to build sophisticated medical instruments to detect and
cure diseases.

2. Gaming
Video and computer games are now being created with the help of data science and that has taken
the gaming experience to the next level.

3. Image Recognition
Identifying patterns is one of the most commonly known applications of data science. in images and
detecting objects in an image is one of the most popular data science applications.

4. Recommendation Systems
Next up in the data science and its applications list comes Recommendation Systems. Netflix and
Amazon give movie and product recommendations based on what you like to watch, purchase, or
browse on their platforms.

5. Logistics
Data Science is used by logistics companies to optimize routes to ensure faster delivery of products
and increase operational efficiency.

6. Fraud Detection
Fraud detection comes the next in the list of applications of data science. Banking and financial
institutions use data science and related algorithms to detect fraudulent transactions.

7. Internet Search
Internet comes the next in the list of applications of data science. When we think of search, we
immediately think of Google. Right? However, there are other search engines, such as Yahoo,
Duckduckgo, Bing, AOL, Ask, and others, that employ data science algorithms to offer the best results
for our searched query in a matter of seconds. Given that Google handles more than 20 petabytes of
data per day. Google would not be the 'Google' we know today if data science did not exist.

8. Speech recognition
Speech recognition is one of the most commonly known applications of data science. It is a technology
that enables a computer to recognize and transcribe spoken language into text. It has a wide range of
applications, from virtual assistants and voice-controlled devices to automated customer service
systems and transcription services.

DATA SOURCES

DATABASES

Databases are structured collections of data organized in a tabular


form, making them a foundational source for data retrieval and
analysis in many applications. Data scientists access databases to
extract, query and analyze data for various applications. SQL
based and NOSQL databases are commonly used.
TYPES
RELATIONAL DATABASES: Data organized based on the relational
model(Eg.MySQL, PostgreSQL).

NoSQL Databases: Non-relational databases used for unstructured


or semi-structured data(Eg., MongoDB, Cassandra)
FILES
Data stored in files, which can be in various formats such as text
files, CSV(Comma-Seperated Values), Excel spreadsheets and
more. Data can be collected and stored in files, making them
accessible and shareable. Data scientists often work with files for
preprocessing and analysis.
TYPES
Text Files: Unstructured text data(Eg., TXT files)
CSV Files: Structured data separated by commas (Eg., data
exported from spreadsheets).

APIs(APPLICATION PROGRAMMING INTERFACE)


APIs provide a structured way to access specific functionalities or
data from applications, platforms or services. Data scientists use
APIs to fetch real-time data, such as financial market data,
weather information, social media, data, and more.
TYPES
Web APIs: Allow access to data over HTTP(Eg.,RESTful APIs)
and usually return data in JSON or XML format
Library APIs: APIs provided by programming libraries to access
specific functions and data.

SENSORS
Sensors collect data from the environment or devices, providing
valuable information for various applications and IoT(Internet of
Things) projects. Sensor Data is critical in various domains such as
IoT, environmental monitoring, healthcare, manufacturing and
more. Data Scientists analyze this data to derive insights aand
p0atterns for decision-making.
TYPES
Temperature Sensors: Collect temperature-related data.
Accelerometers: Measure acceleration or vibrations.

SOCIAL MEDIA
Social media platforms generate vast amounts of data daily,
including text, images, videos, and user engagement metrics.
Social media is a rich and valuable data source for various
applications, including sentiment analysis, trend identifications,
user behavior analysis, marketing strategies, brand monitoring,
and more.

Q.3 . comparison with other fields like Business intelligence ,Artificial Intelligence and machine learning
and data warehousing/data mining in Data science.
Ans :

Artificial Intelligence:
Artificial intelligence is the field of computer science associated with making machines that are
programmed to be capable of thinking and solving problems like the human brain. These machines can
perform human-like tasks and can also learn from past experiences like human beings. Artificial
intelligence involves advanced algorithms and theories of computer science. It is used in robotics and
gaming extensively.

Business Intelligence:
Business intelligence is a set of technologies, procedures, and applications that help us to convert the raw
data into meaningful information that can be used for decision making. It involves data analysis through
statistical methods. It combines data mining, data warehousing techniques, and various tools to extract
more data-driven information. It involves the processing of data and then using the data for decision-
making.

Data Mining:
It is the extraction of covered-up prescient data from expansive databases that could be
a powerful modern innovation with great potential to assist companies to focus on the
foremost vital data in their data warehouses. It contains a gigantic scope is little as well
as enormous organizations. Data mining is essentially is utilized in the inverse course
to that of Information warehousing. By analyzing client information of a company, data
mining apparatuses can construct a prescient demonstration that can tell you which
clients are at chance or misfortune.

Machine Learning:
Machine learning is a supplement utilizing which machines can be made cleverly, this implies they can
make choices on their claim, classify things, anticipate things, prescribe things based on your likes. Making
the machine (algorithm) shrewdly so they can take an astute choice. Below is a table of differences
between Business Intelligence and Machine Learning:
Advantages of Machine Learning:
ML can identify complex patterns and relationships in data that may be difficult for humans to detect.
ML can be used to make predictions about future events based on historical data.
ML can be used to automate tasks and processes, which can save time and reduce errors.
Disadvantages of Machine Learning:
ML requires significant computing resources and may be too expensive for smaller businesses.
ML algorithms can be difficult to interpret, which may make it challenging to explain the reasoning behind
the algorithm’s predictions.
ML requires extensive data preparation and cleaning before it can be used effectively.

Q.4. Different types of Data: Sturctured data , unstructured data and semi-sturctured data.
Ans :
1 .STRUCTURED DATA

Structured Data is highly organized and formatted, often in a tabular form with a
well-defined schema. It is easy to query, analyze, and store in databases. Each data
element is categorized and has a clear relationship with other elements.
Characteristics :
- ORGANIZED: Data is arranges in a structured format, often using rows and columns.
- CLEAR SCHEMA: The data structure is defined in advance, specifying, data types for
each field.
- EASY TO QUERY: Structured Data can be queried using standard SQL or other query
languages.
EXAMPLE: Relational databases, spreadsheets, SV files.

2. UNSTRUCTURED DATA

Unstructured Data lacks a predefined structure and organization. It is often in the form of
text, images, audio, video, or other formats without a clear schema. It is more challenging
to process and analyze.
CHARACTERISTICS:
 LACK OF STRUCTURE: Data elements are not organized in a predefined manner.
 Varied Formats: Unstructured data can include text, images, videos and audio.
 Difficult to Query: Traditional databases cannot easily handle unstructured data.
 EXAMPLES: Text documents, social media posts, images, audio recordings

3. SEMI STRUCTURED DATA

Semi Structured data falls between structured and unstructured data. It is partially
structured and follows a format, but it may not conform to a strict schema.
Semi-structured data often uses metadata or tags for organization.
CHARACTERISTICS:
 Self-describing: Semi-structured data often includes labels, tags, or markers to describe
its elements.
 Flexibility: It allows for variations in data structure and can accommodate changing data
formats.
 EXAMPLES: JSON, XML, YAML files

Q.5 What Is Data Wrangling?


Data wrangling is the process of transforming and structuring data from one raw form into a desired
format with the intent of improving data quality and making it more consumable and useful for analytics
or machine learning. It’s also sometimes called data munging.

The data wrangling process often includes transforming, cleansing, and enriching data from multiple
sources. As a result of data wrangling, the data being analyzed is more accurate and meaningful, leading
to better solutions, decisions, and outcomes.

Because of the increase in data collection and usage, especially diverse and unstructured data from
multiple data sources, organizations are now dealing with larger amounts of raw data and preparing it for
analysis can be time-consuming and costly.
Self-service approaches and analytics automation can speed up and increase the accuracy of data
wrangling processes by eliminating the errors that can be introduced by people when they transform data
using Excel or other manual processes.
Diagram :

Q.6. What is Feature Engineering?


Feature engineering, in data science, refers to manipulation — addition, deletion, combination, mutation
— of your data set to improve machine learning model training, leading to better performance and greater
accuracy. Effective feature engineering is based on sound knowledge of the business problem and the
available data sources.

Creating new features gives you a deeper understanding of your data and results in more valuable insights.
When done correctly, feature engineering is one of the most valuable techniques of data science, but it is
also one of the most challenging. A common example of feature engineering is when your doctor uses
your body mass index (BMI). BMI is calculated from both body weight and height, and serves as a
surrogate for a characteristic that is very hard to accurately measure: the proportion of lean body mass.

Common Feature Types:

Numerical: Values with numeric types (int, float, etc.). Examples: age, salary, height.
Categorical Features: Features that can take one of a limited number of values. Examples: gender (male,
female, non-binary), color (red, blue, green).

Ordinal Features: Categorical features that have a clear ordering. Examples: T-shirt size (S, M, L, XL).
Binary Features: A special case of categorical features with only two categories. Examples: is_smoker (yes,
no), has_subscription (true, false).

Text Features: Features that contain textual data. Textual data typically requires special preprocessing
steps (like tokenization) to transform it into a format suitable for machine learning models.
Q.7. Data Science Libraries
Ans.
Data Science Libraries
NumPy
NumPy is a fundamental library for scientific computing in Python. It provides support for large, multi-
dimensional arrays and matrices, along with a collection of mathematical functions to operate on these
arrays efficiently. NumPy forms the foundation for many other data science libraries in the Python
ecosystem.
Pandas
Pandas is a powerful data manipulation and analysis library in Python. It offers easy-to-use data structures,
such as DataFrames and Series, which enable data scientists to perform various operations like filtering,
grouping, and merging data. Pandas simplifies the handling of structured data and plays a crucial role in
exploratory data analysis.
Matplotlib
Matplotlib is a popular plotting library in Python. It provides a flexible and comprehensive set of functions
for creating various types of visualizations, including line plots, scatter plots, bar plots, and histograms.
Matplotlib allows data scientists to present their findings visually and communicate insights effectively.
Scikit-learn
Scikit-learn is a machine learning library for Python that offers a wide range of algorithms and tools for
tasks like classification, regression, clustering, and dimensionality reduction. It provides an easy-to-use
interface and supports various evaluation metrics, making it a valuable asset for both beginners and
experienced data scientists.
TensorFlow
TensorFlow is an open-source machine learning framework developed by Google. It specializes in building
and training deep learning models. TensorFlow provides a high-level API for constructing neural networks,
as well as lower-level capabilities for fine-tuning models. It has gained popularity due to its scalability and
extensive support for deployment across different platforms.

Q.8. EDA(Exploratory data analysis)


Ans :
Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and
summarize their main characteristics, often employing data visualization methods.

1.Data Cleaning: EDA involves examining the information for errors, lacking values, and inconsistencies. It
includes techniques including records imputation, managing missing statistics, and figuring out and
getting rid of outliers.

2. Descriptive Statistics: EDA utilizes precise records to recognize the important tendency, variability, and
distribution of variables. Measures like suggest, median, mode, preferred deviation, range, and
percentiles are usually used.

3.Data Visualization: EDA employs visual techniques to represent the statistics graphically. Visualizations
consisting of histograms, box plots, scatter plots, line plots, heatmaps, and bar charts assist in identifying
styles, trends, and relationships within the facts.
4. Feature Engineering: EDA allows for the exploration of various variables and their adjustments to create
new functions or derive meaningful insights. Feature engineering can contain scaling, normalization,
binning, encoding express variables, and creating interplay or derived variables.

5. Correlation and Relationships: EDA allows discover relationships and dependencies between variables.
Techniques such as correlation analysis, scatter plots, and pass-tabulations offer insights into the power
and direction of relationships between variables.

6. Data Segmentation: EDA can contain dividing the information into significant segments based totally
on sure standards or traits. This segmentation allows advantage insights into unique subgroups inside the
information and might cause extra focused analysis.

Q.9. Supervised and Unsupervised Learning


Ans :

1.Supervised Learning :
Supervised learning is the types of machine learning in which
machines are trained using well "labelled" training data, and
on basis of that data, machines predict the output. The
labelled data means some input data is already tagged with
the correct output.

In supervised learning, the training data provided to the


machines work as the supervisor that teaches the machines
to predict the output correctly. It applies the same concept
as a student learns in the supervision of the teacher.

Supervised learning is a process of providing input data as


well as correct output data to the machine learning model.
The aim of a supervised learning algorithm is to find a
mapping function to map the input variable(x) with the
output variable(y).

In the real-world, supervised learning can be used for Risk


Assessment, Image classification, Fraud Detection, spam
filtering, etc.

In supervised learning, models are trained using labelled


dataset, where the model learns about each type of data.
Once the training process is completed, the model is tested
on the basis of test data (a subset of the training set), and
then it predicts the output.

2.Unsupervised Learning :

Unsupervised learning is a machine learning technique in


which models are not supervised using training dataset.
Instead, models itself find the hidden patterns and insights
from the given data. It can be compared to learning which
takes place in the human brain while learning new things. It
can be defined as:
Unsupervised learning is a type of machine learning in which
models are trained using unlabeled dataset and are allowed
to act on that data without any supervision.

Unsupervised learning cannot be directly applied to a


regression or classification problem because unlike
supervised learning, we have the input data but no
corresponding output data. The goal of unsupervised
learning is to find the underlying structure of dataset, group
that data according to similarities, and represent that
dataset in a compressed format.
Q.10. Regression Analysis
Ans :

Regression analysis is a statistical method to model the relationship between a dependent


(target) and independent (predictor) variables with one or more independent variables. More
specifically, Regression analysis helps us to understand how the value of the dependent variable
is changing corresponding to an independent variable when other independent variables are
held fixed. It predicts continuous/real values such as temperature, age, salary, price, etc.

Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.

Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core, all the
regression methods analyze the effect of the independent variable on dependent
variables. Here we are discussing some important types of regression which are given
below:

o Linear Regression

The name says it all: linear regression can be used only when there is a linear relationship among the
variables. It is a statistical model used to understand the association between independent variables (X) and
dependent variables (Y).

The variables that are taken as input are called independent variables. In the example of the gym
supplement above, the prices and advertisement effect are the independent variables, whereas the
one that is being predicted is called the dependent variable (in this case, ‘sales’).

o Logistic Regression
o Logistic regression analysis is generally used to find the probability of an event. It is used
when the dependent variable is dichotomous or binary. For example, if the output is 0 or
1, True or False, Yes or No, Cat or Dog, etc., it is said to be a binary variable. Since it gives
us the probability, the output will be in the range of 0-1.
STEPWISE REGRESSION

Stepwise regression is a method of fitting a regression


model by iteratively adding or removing variables. It is used
to build a model that is accurate and parsimonious, meaning
that it has the smallest number of variables that can explain
the data.
There are two main types of stepwise regression:
– In forward selection, the
algorithm starts with an empty model and iteratively
adds variables to the model until no further
improvement is made.
– In backward elimination, the
algorithm starts with a model that includes all
variables and iteratively removes variables until no
further improvement is made.

The advantage of stepwise regression is that it can


automatically select the most important variables for
the model and build a parsimonious model. The
disadvantage is that it may not always select the best
model, and it can be sensitive to the order in which
the variables are added or removed.

Q.11. TECHNIQUES FOR EVALUATING MODEL PERFORMANCE


Ans :
Machine Learning Model does not require hard-coded algorithms. We feed a large amount of
data to the model and the model tries to figure out the features on its own to make future
predictions. So we must also use some techniques to determine the predictive power of the model.
Machine Learning Model Evaluation :
Model evaluation is the process that uses some metrics which help us to analyze the performance
of the model. Evaluating a model plays a vital role so that we can judge the performance of our
model. The evaluation also helps to analyze a model’s key weaknesses. There are many metrics
like Accuracy, Precision, Recall, F1 score, Area under Curve, Confusion Matrix, and Mean Square
Error. Cross Validation is one technique that is followed during the training phase and it is a model
evaluation technique as well.
ACCURACY, PRECISION, RECALL :
1. Accuracy:
Accuracy is the most straightforward metric and measures the overall correctness of the
model's predictions. It is calculated as the ratio of the correctly predicted instances
(TP + TN) to the total number of instances (TP + TN + FP + FN). However, accuracy alone
may not always be the best metric, especially when the dataset is imbalanced.

2. Precision:
Precision focuses on the proportion of correctly predicted positive instances (TP) out of all
predicted positive instances (TP + FP). It quantifies the model's ability to avoid false
positives. Precision is calculated as TP / (TP + FP). A high precision indicates a low number
of false positives.

3. Recall:
Recall (also known as sensitivity or true positive rate) measures the proportion of correctly
predicted positive instances (TP) out of all actual positive instances (TP + FN). It determines
the model's ability to identify all the positive instances in the dataset. Recall is
calculated as TP / (TP + FN). A high recall indicates a low number of false negatives.
4. F1 Score
The F1 score is the harmonic mean of precision and recall, which
makes it sensitive to small values. This means if either precision
or recall is significantly lower than the other, it will have a more
pronounced impact on the F1 score.
Q.12. Machine Learning Algorithms :

Random Forest , Support Vector Machine ,Neural Network And

Ans :
Random Forest :
The random forest algorithm is a powerful supervised machine learning technique used for both
classification and regression tasks. It is used to find patterns in data (classification) and predicting
outcomes (regression). During training, the algorithm constructs numerous decision trees, each
built on a unique subset of the training data. These individual trees then vote on the final
prediction, leading to a robust and accurate outcome.
In a random forest, many decision trees are made during training. Each tree is created separately
using a random part of the training data. When making predictions, each tree in the forest makes
its own prediction. Finally, the overall prediction is decided by combining these individual
predictions. Random Forest is recommended when dealing with diverse datasets, especially when
you prioritize a balance between model interpretability and performance. Its ability to avoid
overfitting and work well with high-dimensional data makes it a suitable choice in a wide range
of applications, including regression and classification tasks.

Support Vector Machine :


A Support Vector Machine (SVM) is a tool used in machine learning to sort data into different
groups. It’s good for both figuring out which group something belongs to (classification) and
predicting outcomes (regression). It works by finding the best line or plane that separates the data
points into different groups, making sure it’s as far away as possible from the points closest to it
(these are called support vectors).
In regression tasks, SVM works similarly to regression methods but with the objective of fitting a
hyperplane that captures the relationships between input features and target variables. SVM is
known for its ability to handle high-dimensional data, its effectiveness in dealing with small to
medium-sized datasets, and its robustness against overfitting. SVM is recommended when
dealing with datasets requiring clear margins between classes or when non-linear relationships
need to be captured. It’s a valuable choice for tasks involving small to medium-sized datasets, but
always considering of computational expenses and sensitivity to hyperparameter tuning.
Neural Network :
A neural network is like a computer brain made of lots of small units (neurons) that work
together. It’s based on how our brain works, with layers of these units. This model is used
in machine learning and Artificial Intelligence to help computers learn and make decisions.
Neural networks learn from data through a process called training. During training, the
network adjusts its parameters (weights and biases) based on the input data and expected
output. This is typically done using optimization algorithms such as gradient descent and
backpropagation, which minimize the difference between the predicted output and the
actual output. Often achieves cutting-edge results in image, text, and speech recognition
and automatically extracts valuable features from raw data.
Neural Networks are ideal for tasks demanding a high degree of flexibility and
performance, particularly in complex domains like image or speech recognition. While their
computational requirements can be substantial, their ability to automatically learn
hierarchical features from raw data makes them invaluable for cutting-edge applications
like image recognition, natural language processing, speech recognition and more.

Q. 13. 1.)Explain the matrix to evaluate the performance of classification. (confusion matrix ,
precision , F1 score, recall and accuracy).
Ans :

The confusion matrix serves as a foundational tool for evaluating the performance of a
classification model. It compares the model's predicted values against the actual values
and presents a detailed breakdown of true positives (TP), true negatives (TN), false
positives (FP), and false negatives (FN). This matrix offers a comprehensive view of how
well the model is performing across different classes.

Precision, also known as positive predictive value, assesses the accuracy of the model's
positive predictions. It calculates the proportion of correctly predicted positive instances
out of all instances predicted as positive (TP / (TP + FP)). A high precision score indicates
that the model is effectively identifying positive cases.

Recall, also referred to as sensitivity or true positive rate, measures the model's ability to
correctly identify positive instances from the actual positives. It is calculated as TP / (TP
+ FN), highlighting the model's capacity to capture all positive instances without missing
any. A high recall score suggests that the model is sensitive to positive cases.

Accuracy is a straightforward metric that calculates the proportion of correctly predicted


instances, irrespective of their class. It is determined by (TP + TN) / (TP + TN + FP + FN)
and provides an overall assessment of the model's performance across all classes.

The F1 score combines precision and recall into a single metric, offering a balanced
evaluation of the model's performance. It is calculated as 2 * ((precision * recall) /
(precision + recall)) and considers both false positives and false negatives. A higher F1
score indicates a better balance between precision and recall, showcasing the model's
effectiveness in handling both positive and negative cases.

Q. 14. Explain the types of visualization with examples. (bar chart, scatter plot , box plot, line chart, heat
map).
Ans :
Visualization is a powerful tool for presenting data and information in a visual format, making it easier to
understand and interpret. There are various types of visualizations, each with its own purpose and
characteristics. Let's explore some common types of visualizations along with examples:

1. Bar Chart:

A bar chart represents categorical data using rectangular bars of different lengths or heights. It is
commonly used to compare and display numerical values for different categories. For example, a bar chart
can be used to show the sales performance of different products over a specific time period.

2. Scatter Plot:
A scatter plot displays the relationship between two numerical variables. It consists of a series of points
on a graph, with each point representing the value of the variables on the x and y-axis. Scatter plots are
useful in identifying patterns, correlations, clusters, or outliers in data. For instance, a scatter plot can be
employed to visualize the correlation between the age and income of individuals in a population.

3. Box Plot:
A box plot, also known as box-and-whisker plot, summarizes the distribution and statistical properties of
numerical data. It uses a rectangular box to represent the interquartile range (IQR), with lines (whiskers)
extending from the box to indicate the range of the data. Box plots provide insights into the central
tendency, spread, and skewness of the data. For example, a box plot can be used to compare the salaries
across different job titles in a company.

4. Link Chart:
A link chart, or network diagram, visualizes connections or relationships between entities. It uses nodes
to represent individual entities and edges or links to indicate the connections between them. Link charts
are commonly used in social network analysis, graph theory, and organizational charts. For instance, a link
chart can be used to visualize the connections between users on a social media platform.
5. Heat Map:
A heat map displays data values using a color scale, where each cell or pixel represents a data point and
its color intensity represents the value. Heat maps are often used to visualize data on a grid or matrix,
making it easier to identify patterns, trends, or concentration of values. For example, a heat map can be
used to represent the population density of different regions in a country.

You might also like