Question Bank Syllbuswise
Question Bank Syllbuswise
Ans - Data science is the domain of study that deals with vast volumes of data using modern tools and
techniques to find unseen patterns, derive meaningful information, and make business decisions.
Data science uses complex machine learning algorithms to build predictive models. The data used for
analysis can come from many different sources and presented in various formats.
Scope of data science:
Data science is a rapidly growing field in India, with increasing demand for skilled professionals
analyzing and interpreting large datasets. The scope for data science in India is vast, with
opportunities in various industries, including finance, healthcare, e-commerce, and government
organizations.
The scope for data science in India is not limited to these industries, however. Data scientists are
also in demand in sectors such as transportation, energy, and manufacturing. Additionally, the
field of data science is constantly evolving, with new applications and techniques being developed
all the time. This means that there will continue to be a wide range of opportunities for data
scientists in India in the future.
Diagram :
1. Healthcare
Healthcare companies are using data science to build sophisticated medical instruments to detect and
cure diseases.
2. Gaming
Video and computer games are now being created with the help of data science and that has taken
the gaming experience to the next level.
3. Image Recognition
Identifying patterns is one of the most commonly known applications of data science. in images and
detecting objects in an image is one of the most popular data science applications.
4. Recommendation Systems
Next up in the data science and its applications list comes Recommendation Systems. Netflix and
Amazon give movie and product recommendations based on what you like to watch, purchase, or
browse on their platforms.
5. Logistics
Data Science is used by logistics companies to optimize routes to ensure faster delivery of products
and increase operational efficiency.
6. Fraud Detection
Fraud detection comes the next in the list of applications of data science. Banking and financial
institutions use data science and related algorithms to detect fraudulent transactions.
7. Internet Search
Internet comes the next in the list of applications of data science. When we think of search, we
immediately think of Google. Right? However, there are other search engines, such as Yahoo,
Duckduckgo, Bing, AOL, Ask, and others, that employ data science algorithms to offer the best results
for our searched query in a matter of seconds. Given that Google handles more than 20 petabytes of
data per day. Google would not be the 'Google' we know today if data science did not exist.
8. Speech recognition
Speech recognition is one of the most commonly known applications of data science. It is a technology
that enables a computer to recognize and transcribe spoken language into text. It has a wide range of
applications, from virtual assistants and voice-controlled devices to automated customer service
systems and transcription services.
DATA SOURCES
DATABASES
SENSORS
Sensors collect data from the environment or devices, providing
valuable information for various applications and IoT(Internet of
Things) projects. Sensor Data is critical in various domains such as
IoT, environmental monitoring, healthcare, manufacturing and
more. Data Scientists analyze this data to derive insights aand
p0atterns for decision-making.
TYPES
Temperature Sensors: Collect temperature-related data.
Accelerometers: Measure acceleration or vibrations.
SOCIAL MEDIA
Social media platforms generate vast amounts of data daily,
including text, images, videos, and user engagement metrics.
Social media is a rich and valuable data source for various
applications, including sentiment analysis, trend identifications,
user behavior analysis, marketing strategies, brand monitoring,
and more.
Q.3 . comparison with other fields like Business intelligence ,Artificial Intelligence and machine learning
and data warehousing/data mining in Data science.
Ans :
Artificial Intelligence:
Artificial intelligence is the field of computer science associated with making machines that are
programmed to be capable of thinking and solving problems like the human brain. These machines can
perform human-like tasks and can also learn from past experiences like human beings. Artificial
intelligence involves advanced algorithms and theories of computer science. It is used in robotics and
gaming extensively.
Business Intelligence:
Business intelligence is a set of technologies, procedures, and applications that help us to convert the raw
data into meaningful information that can be used for decision making. It involves data analysis through
statistical methods. It combines data mining, data warehousing techniques, and various tools to extract
more data-driven information. It involves the processing of data and then using the data for decision-
making.
Data Mining:
It is the extraction of covered-up prescient data from expansive databases that could be
a powerful modern innovation with great potential to assist companies to focus on the
foremost vital data in their data warehouses. It contains a gigantic scope is little as well
as enormous organizations. Data mining is essentially is utilized in the inverse course
to that of Information warehousing. By analyzing client information of a company, data
mining apparatuses can construct a prescient demonstration that can tell you which
clients are at chance or misfortune.
Machine Learning:
Machine learning is a supplement utilizing which machines can be made cleverly, this implies they can
make choices on their claim, classify things, anticipate things, prescribe things based on your likes. Making
the machine (algorithm) shrewdly so they can take an astute choice. Below is a table of differences
between Business Intelligence and Machine Learning:
Advantages of Machine Learning:
ML can identify complex patterns and relationships in data that may be difficult for humans to detect.
ML can be used to make predictions about future events based on historical data.
ML can be used to automate tasks and processes, which can save time and reduce errors.
Disadvantages of Machine Learning:
ML requires significant computing resources and may be too expensive for smaller businesses.
ML algorithms can be difficult to interpret, which may make it challenging to explain the reasoning behind
the algorithm’s predictions.
ML requires extensive data preparation and cleaning before it can be used effectively.
Q.4. Different types of Data: Sturctured data , unstructured data and semi-sturctured data.
Ans :
1 .STRUCTURED DATA
Structured Data is highly organized and formatted, often in a tabular form with a
well-defined schema. It is easy to query, analyze, and store in databases. Each data
element is categorized and has a clear relationship with other elements.
Characteristics :
- ORGANIZED: Data is arranges in a structured format, often using rows and columns.
- CLEAR SCHEMA: The data structure is defined in advance, specifying, data types for
each field.
- EASY TO QUERY: Structured Data can be queried using standard SQL or other query
languages.
EXAMPLE: Relational databases, spreadsheets, SV files.
2. UNSTRUCTURED DATA
Unstructured Data lacks a predefined structure and organization. It is often in the form of
text, images, audio, video, or other formats without a clear schema. It is more challenging
to process and analyze.
CHARACTERISTICS:
LACK OF STRUCTURE: Data elements are not organized in a predefined manner.
Varied Formats: Unstructured data can include text, images, videos and audio.
Difficult to Query: Traditional databases cannot easily handle unstructured data.
EXAMPLES: Text documents, social media posts, images, audio recordings
Semi Structured data falls between structured and unstructured data. It is partially
structured and follows a format, but it may not conform to a strict schema.
Semi-structured data often uses metadata or tags for organization.
CHARACTERISTICS:
Self-describing: Semi-structured data often includes labels, tags, or markers to describe
its elements.
Flexibility: It allows for variations in data structure and can accommodate changing data
formats.
EXAMPLES: JSON, XML, YAML files
The data wrangling process often includes transforming, cleansing, and enriching data from multiple
sources. As a result of data wrangling, the data being analyzed is more accurate and meaningful, leading
to better solutions, decisions, and outcomes.
Because of the increase in data collection and usage, especially diverse and unstructured data from
multiple data sources, organizations are now dealing with larger amounts of raw data and preparing it for
analysis can be time-consuming and costly.
Self-service approaches and analytics automation can speed up and increase the accuracy of data
wrangling processes by eliminating the errors that can be introduced by people when they transform data
using Excel or other manual processes.
Diagram :
Creating new features gives you a deeper understanding of your data and results in more valuable insights.
When done correctly, feature engineering is one of the most valuable techniques of data science, but it is
also one of the most challenging. A common example of feature engineering is when your doctor uses
your body mass index (BMI). BMI is calculated from both body weight and height, and serves as a
surrogate for a characteristic that is very hard to accurately measure: the proportion of lean body mass.
Numerical: Values with numeric types (int, float, etc.). Examples: age, salary, height.
Categorical Features: Features that can take one of a limited number of values. Examples: gender (male,
female, non-binary), color (red, blue, green).
Ordinal Features: Categorical features that have a clear ordering. Examples: T-shirt size (S, M, L, XL).
Binary Features: A special case of categorical features with only two categories. Examples: is_smoker (yes,
no), has_subscription (true, false).
Text Features: Features that contain textual data. Textual data typically requires special preprocessing
steps (like tokenization) to transform it into a format suitable for machine learning models.
Q.7. Data Science Libraries
Ans.
Data Science Libraries
NumPy
NumPy is a fundamental library for scientific computing in Python. It provides support for large, multi-
dimensional arrays and matrices, along with a collection of mathematical functions to operate on these
arrays efficiently. NumPy forms the foundation for many other data science libraries in the Python
ecosystem.
Pandas
Pandas is a powerful data manipulation and analysis library in Python. It offers easy-to-use data structures,
such as DataFrames and Series, which enable data scientists to perform various operations like filtering,
grouping, and merging data. Pandas simplifies the handling of structured data and plays a crucial role in
exploratory data analysis.
Matplotlib
Matplotlib is a popular plotting library in Python. It provides a flexible and comprehensive set of functions
for creating various types of visualizations, including line plots, scatter plots, bar plots, and histograms.
Matplotlib allows data scientists to present their findings visually and communicate insights effectively.
Scikit-learn
Scikit-learn is a machine learning library for Python that offers a wide range of algorithms and tools for
tasks like classification, regression, clustering, and dimensionality reduction. It provides an easy-to-use
interface and supports various evaluation metrics, making it a valuable asset for both beginners and
experienced data scientists.
TensorFlow
TensorFlow is an open-source machine learning framework developed by Google. It specializes in building
and training deep learning models. TensorFlow provides a high-level API for constructing neural networks,
as well as lower-level capabilities for fine-tuning models. It has gained popularity due to its scalability and
extensive support for deployment across different platforms.
1.Data Cleaning: EDA involves examining the information for errors, lacking values, and inconsistencies. It
includes techniques including records imputation, managing missing statistics, and figuring out and
getting rid of outliers.
2. Descriptive Statistics: EDA utilizes precise records to recognize the important tendency, variability, and
distribution of variables. Measures like suggest, median, mode, preferred deviation, range, and
percentiles are usually used.
3.Data Visualization: EDA employs visual techniques to represent the statistics graphically. Visualizations
consisting of histograms, box plots, scatter plots, line plots, heatmaps, and bar charts assist in identifying
styles, trends, and relationships within the facts.
4. Feature Engineering: EDA allows for the exploration of various variables and their adjustments to create
new functions or derive meaningful insights. Feature engineering can contain scaling, normalization,
binning, encoding express variables, and creating interplay or derived variables.
5. Correlation and Relationships: EDA allows discover relationships and dependencies between variables.
Techniques such as correlation analysis, scatter plots, and pass-tabulations offer insights into the power
and direction of relationships between variables.
6. Data Segmentation: EDA can contain dividing the information into significant segments based totally
on sure standards or traits. This segmentation allows advantage insights into unique subgroups inside the
information and might cause extra focused analysis.
1.Supervised Learning :
Supervised learning is the types of machine learning in which
machines are trained using well "labelled" training data, and
on basis of that data, machines predict the output. The
labelled data means some input data is already tagged with
the correct output.
2.Unsupervised Learning :
Regression is a supervised learning technique which helps in finding the correlation between
variables and enables us to predict the continuous output variable based on the one or more
predictor variables. It is mainly used for prediction, forecasting, time series modeling, and
determining the causal-effect relationship between variables.
Types of Regression
There are various types of regressions which are used in data science and machine
learning. Each type has its own importance on different scenarios, but at the core, all the
regression methods analyze the effect of the independent variable on dependent
variables. Here we are discussing some important types of regression which are given
below:
o Linear Regression
The name says it all: linear regression can be used only when there is a linear relationship among the
variables. It is a statistical model used to understand the association between independent variables (X) and
dependent variables (Y).
The variables that are taken as input are called independent variables. In the example of the gym
supplement above, the prices and advertisement effect are the independent variables, whereas the
one that is being predicted is called the dependent variable (in this case, ‘sales’).
o Logistic Regression
o Logistic regression analysis is generally used to find the probability of an event. It is used
when the dependent variable is dichotomous or binary. For example, if the output is 0 or
1, True or False, Yes or No, Cat or Dog, etc., it is said to be a binary variable. Since it gives
us the probability, the output will be in the range of 0-1.
STEPWISE REGRESSION
2. Precision:
Precision focuses on the proportion of correctly predicted positive instances (TP) out of all
predicted positive instances (TP + FP). It quantifies the model's ability to avoid false
positives. Precision is calculated as TP / (TP + FP). A high precision indicates a low number
of false positives.
3. Recall:
Recall (also known as sensitivity or true positive rate) measures the proportion of correctly
predicted positive instances (TP) out of all actual positive instances (TP + FN). It determines
the model's ability to identify all the positive instances in the dataset. Recall is
calculated as TP / (TP + FN). A high recall indicates a low number of false negatives.
4. F1 Score
The F1 score is the harmonic mean of precision and recall, which
makes it sensitive to small values. This means if either precision
or recall is significantly lower than the other, it will have a more
pronounced impact on the F1 score.
Q.12. Machine Learning Algorithms :
Ans :
Random Forest :
The random forest algorithm is a powerful supervised machine learning technique used for both
classification and regression tasks. It is used to find patterns in data (classification) and predicting
outcomes (regression). During training, the algorithm constructs numerous decision trees, each
built on a unique subset of the training data. These individual trees then vote on the final
prediction, leading to a robust and accurate outcome.
In a random forest, many decision trees are made during training. Each tree is created separately
using a random part of the training data. When making predictions, each tree in the forest makes
its own prediction. Finally, the overall prediction is decided by combining these individual
predictions. Random Forest is recommended when dealing with diverse datasets, especially when
you prioritize a balance between model interpretability and performance. Its ability to avoid
overfitting and work well with high-dimensional data makes it a suitable choice in a wide range
of applications, including regression and classification tasks.
Q. 13. 1.)Explain the matrix to evaluate the performance of classification. (confusion matrix ,
precision , F1 score, recall and accuracy).
Ans :
The confusion matrix serves as a foundational tool for evaluating the performance of a
classification model. It compares the model's predicted values against the actual values
and presents a detailed breakdown of true positives (TP), true negatives (TN), false
positives (FP), and false negatives (FN). This matrix offers a comprehensive view of how
well the model is performing across different classes.
Precision, also known as positive predictive value, assesses the accuracy of the model's
positive predictions. It calculates the proportion of correctly predicted positive instances
out of all instances predicted as positive (TP / (TP + FP)). A high precision score indicates
that the model is effectively identifying positive cases.
Recall, also referred to as sensitivity or true positive rate, measures the model's ability to
correctly identify positive instances from the actual positives. It is calculated as TP / (TP
+ FN), highlighting the model's capacity to capture all positive instances without missing
any. A high recall score suggests that the model is sensitive to positive cases.
The F1 score combines precision and recall into a single metric, offering a balanced
evaluation of the model's performance. It is calculated as 2 * ((precision * recall) /
(precision + recall)) and considers both false positives and false negatives. A higher F1
score indicates a better balance between precision and recall, showcasing the model's
effectiveness in handling both positive and negative cases.
Q. 14. Explain the types of visualization with examples. (bar chart, scatter plot , box plot, line chart, heat
map).
Ans :
Visualization is a powerful tool for presenting data and information in a visual format, making it easier to
understand and interpret. There are various types of visualizations, each with its own purpose and
characteristics. Let's explore some common types of visualizations along with examples:
1. Bar Chart:
A bar chart represents categorical data using rectangular bars of different lengths or heights. It is
commonly used to compare and display numerical values for different categories. For example, a bar chart
can be used to show the sales performance of different products over a specific time period.
2. Scatter Plot:
A scatter plot displays the relationship between two numerical variables. It consists of a series of points
on a graph, with each point representing the value of the variables on the x and y-axis. Scatter plots are
useful in identifying patterns, correlations, clusters, or outliers in data. For instance, a scatter plot can be
employed to visualize the correlation between the age and income of individuals in a population.
3. Box Plot:
A box plot, also known as box-and-whisker plot, summarizes the distribution and statistical properties of
numerical data. It uses a rectangular box to represent the interquartile range (IQR), with lines (whiskers)
extending from the box to indicate the range of the data. Box plots provide insights into the central
tendency, spread, and skewness of the data. For example, a box plot can be used to compare the salaries
across different job titles in a company.
4. Link Chart:
A link chart, or network diagram, visualizes connections or relationships between entities. It uses nodes
to represent individual entities and edges or links to indicate the connections between them. Link charts
are commonly used in social network analysis, graph theory, and organizational charts. For instance, a link
chart can be used to visualize the connections between users on a social media platform.
5. Heat Map:
A heat map displays data values using a color scale, where each cell or pixel represents a data point and
its color intensity represents the value. Heat maps are often used to visualize data on a grid or matrix,
making it easier to identify patterns, trends, or concentration of values. For example, a heat map can be
used to represent the population density of different regions in a country.