Data Science Notes
Data Science Notes
● Data science is the field of study that combines domain expertise, programming skills, and knowledge of
mathematics and statistics to extract meaningful insights from data.
● Data science practitioners apply machine learning algorithms to numbers, text, images, video, audio, and
more to produce artificial intelligence (AI) systems to perform tasks that ordinarily require human
intelligence. In turn, these systems generate insights which analysts and business users can translate
into tangible business value.
● A data scientist is someone who creates programming code and combines it with statistical knowledge
to create insights from data.
It is about the collection, processing, analyzing, and It is about extracting vital and valuable information
utilizing of data in various operations. It is more from a huge amount of data.
conceptual.
It is a field of study just like Computer Science, It is a technique for tracking and discovering trends in
Applied Statistics, or Applied Mathematics. complex data sets.
The goal is to build data-dominant products for a The goal is to make data more vital and usable i.e. by
venture. extracting only important information from the huge
data within existing traditional aspects.
Tools mainly used in Data Science include SAS, R, Tools mostly used in Big Data include Hadoop, Spark,
Python, etc Flink, etc.
It is a superset of Big Data as data science consists of It is a subset of Data Science as mining activities
Data scraping, cleaning, visualization, statistics, and which is in a pipeline of Data science.
many more techniques.
It is mainly used for scientific purposes. It is mainly used for business purposes and customer
satisfaction.
It broadly focuses on the science of data. It is more involved with the processes of handling
voluminous data.
3) How to clean up and organize big data sets towards Data Science?
observations most frequently arise during data collection and Irrelevant observations are those that
don’t actually fit the specific problem that you’re trying to solve.
1. Redundant observations alter the efficiency by a great extent as the data repeats and may
add towards the correct side or towards the incorrect side, thereby producing unfaithful
results.
2. Irrelevant observations are any type of data that is of no use to us and can be removed
directly.
2. Fixing Structural errors
The errors that arise during measurement, transfer of data, or other similar situations are called
structural errors. Structural errors include typos in the name of features, the same attribute with a
different name, mislabeled classes, i.e. separate classes that should really be the same, or
inconsistent capitalization.
1. For example, the model will treat America and America as different classes or values,
though they represent the same value or red, yellow, and red-yellow as different classes
or attributes, though one class can be included in the other two classes. So, these are
some structural errors that make our model inefficient and give poor quality results.
3. Managing Unwanted outliers
Outliers can cause problems with certain types of models. For example, linear regression models are
less robust to outliers than decision tree models. Generally, we should not remove outliers until we
have a legitimate reason to remove them. Sometimes, removing them improves performance,
sometimes not. So, one must have a good reason to remove the outlier, such as suspicious
measurements that are unlikely to be part of real data.
4. Handling missing data
Missing data is a deceptively tricky issue in data science. We cannot just ignore or remove the
missing observation. They must be handled carefully as they can be an indication of something
important. The two most common ways to deal with missing data are:
1. Dropping observations with missing values.
■ The fact that the value was missing may be informative in itself.
■ Plus, in the real world, you often need to make predictions on new data even if
some of the features are missing!
2. Imputing the missing values from past observations.
● Again, “missingness” is almost always informative in itself, and you should tell your
algorithm if a value was missing.
● Even if you build a model to impute your values, you’re not adding any real
information. You’re just reinforcing the patterns already provided by other
features.
● Datafication is the transformation of social action into online quantified data, thus allowing for real-time
tracking and predictive analysis.
● Simply said, it is about taking previously invisible processes/activity and turning it into data that can be
monitored, tracked, analyzed and optimized.
● The latest technologies we use have enabled lots of new ways of ‘datify’ our daily and basic activities.
● Summarizing, datafication is a technological trend turning many aspects of our lives into computerized
data using processes to transform organizations into data-driven enterprises by converting this
information into new forms of value.
● Datafication refers to the fact that daily interactions of living things can be rendered into a data format
and put to social use.
● Example - Let’s say social platforms, Facebook or Instagram, for example, collect and monitor data
information of our friendships to market products and services to us and surveillance services to
agencies which in turn changes our behavior; promotions that we daily see on the socials are also the
result of the monitored data. In this model, data is used to redefine how content is created by
datafication being used to inform content rather than recommendation systems.
● Other examples -
○ Insurance: Data used to update risk profile development and business models.
○ Banking: Data used to establish trustworthiness and likelihood of a person paying back a loan.
○ Human resources: Data used to identify e.g. employees risk-taking profiles.
○ Hiring and recruitment: Data used to replace personality tests.
○ Social science research: Datafication replaces sampling techniques and restructures the manner
in which social science research is performed.
5) What are the 8 Data Science skills that will get you hired?
○ No matter what type of company or role you’re interviewing for, you’re likely going to be
expected to know how to use the tools of the trade — and that includes several programming
languages.
○ You’ll be expected to know a statistical programming language, like R or Python, and a database
querying language like SQL.
● Statistics
○ Statistics is important at all company types, but especially data-driven companies where
stakeholders will depend on your help to make decisions and design / evaluate experiments.
● Machine Learning
○ If you’re at a large company with huge amounts of data or working at a company where the
product itself is especially data-driven (e.g. Netflix, Google Maps, Uber), it may be the case that
you’ll want to be familiar with machine learning methods.
○ This can mean things like k-nearest neighbors, random forests, ensemble methods, and more.
○ A lot of these techniques can be implemented using R or Python libraries so it’s not necessary to
become an expert on how the algorithms work.
○ Your goal is to understand the broad strokes and when it’s appropriate to use different
techniques.
○ Understanding these concepts is most important at companies where the product is defined by
the data, and small improvements in predictive performance or algorithm optimization can lead to
huge wins for the company.
○ In an interview for a data science role, you may be asked to derive some of the machine learning
or statistics results you employ elsewhere. Or, your interviewer may ask you some basic
multivariable calculus or linear algebra questions, since they form the basis of a lot of these
techniques.
○ You may wonder why a data scientist would need to understand this when there are so many
out-of-the-box implementations in Python or R. The answer is that at a certain point, it can
become worth it for a data science team to build out their own implementations in house.
● Data Wrangling
○ Often, the data you’re analyzing is going to be messy and difficult to work with. Because of this,
it’s really important to know how to deal with imperfections in data — aka data wrangling.
○ Some examples of data imperfections include missing values, inconsistent string formatting (e.g.,
‘New York’ versus ‘new york’ versus ‘ny’), and date formatting (‘2021-01-01’ vs. ‘01/01/2021’, unix
time vs. timestamps, etc.).
○ This will be most important at small companies where you’re an early data hire, or data-driven
companies where the product is not data-related (particularly because the latter has often grown
quickly with not much attention to data cleanliness), but this skill is important for everyone to
have.
○ Visualizing and communicating data is incredibly important, especially with young companies that
are making data-driven decisions for the first time, or companies where data scientists are
viewed as people who help others make data-driven decisions.
○ When it comes to communicating, this means describing your findings, or the way techniques
work to audiences, both technical and non-technical.
○ Visualization-wise, it can be immensely helpful to be familiar with data visualization tools like
matplotlib, ggplot, or d3.js. Tableau has become a popular data visualization and dashboarding
tool as well.
○ It is important to not just be familiar with the tools necessary to visualize data, but also the
principles behind visually encoding data and communicating information.
● Software Engineering
○ If you’re interviewing at a smaller company and are one of the first data science hires, it can be
important to have a strong software engineering background.
○ You’ll be responsible for handling a lot of data logging, and potentially the development of
data-driven products.
● Data Intuition
6) Five realtime challenges faced by the Data Science industry and how to combat it?
1) Problem-Identification
● One of the major concerns in analyzing a problem is to identify it accurately for designing a better
solution and defining each aspect of it.
● We have seen data scientists try mechanical approaches by beginning their work on data and tools
without getting a clear understanding of the business requirement from the client.
● There should be a well-defined workflow before starting off with the analysis of the data.
● Therefore, as a first step, you need to identify the problem very well to design a proper solution and build
a checklist to tick off as you analyze the results.
● It is vital to approach your hands on the right kind of data for the right analysis which can be a little time
consuming as you need to access the data in the most proper format.
● There might be issues ranging from hidden data and insufficient data volume to less data variety.
● It is also a kind of challenge to gain permission for accessing the data from various businesses.
● You need to know how dangerous fake chargers are and its consequences.
● Data scientists are expected to manage the data management system and other information integration
tools such as Stream analytics software which is used for data filtering and aggregation.
● The software allows to connect all the external data sources and sync them in the proper workflow.
● Big data is estimated to be a little expensive for generating more revenue because data cleansing is
making troubles to operating expenses.
● It can be a nightmare for every data scientist to work with databases which are full of inconsistencies and
anomalies as unwanted data leads to unwanted results.
● Here, they work with tons of data and spend a huge amount of time in sanitizing the data before
analyzing.
● Data scientists make use of data governance tools for improving their overall accuracy and data
formatting.
● In addition to this, maintaining data quality should be everyone’s goal and businesses need to function
across the enterprise to benefit from good quality data.
● Bad data can result in a big enterprise issue.
4) Lack of Professionals
● It is one of the biggest misconceptions to expect that the data scientists are good at high-end tools and
mechanisms.
● But they too need to have possessed a piece of sound knowledge and gain subject depth.
● Data scientists are considered as bridging the gap between the IT department and top management as
domain expertise is required for conveying the needs of the business to the IT department and vice
versa.
● To resolve this, data scientists need to get more useful insights from businesses in order to understand
the problem and work accordingly by modeling the solutions.
● They also need to focus on the requirements of the businesses by mastering statistical and technical
tools.
● In big corporations, a Data Scientist is regarded as a jack of all trades who is assigned with the task of
getting the data, building the model, and making right business decisions which is a big ask for any
individual.
● In a Data Science team, the role should be split among different individuals such as Data Engineering,
Data Visualizations. Predictive Analytics, model building, and so on.
1. Volume:
● To determine the value of data, size of data plays a very crucial role. If the volume of data is very
large then it is actually considered as a ‘Big Data’. This means whether a particular data can actually
be considered as a Big Data or not, is dependent upon the volume of data.
● Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’.
● Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes(6.2 billion GB) per
month. Also, by the year 2020 we will have almost 40000 ExaBytes of data.
2. Velocity:
3. Variety:
● It refers to the nature of data that is structured, semi-structured and unstructured data.
● It also refers to heterogeneous sources.
● Variety is basically the arrival of data from new sources that are both inside and outside of an
enterprise. It can be structured, semi-structured and unstructured.
○ Structured data: This data is basically an organized data. It generally refers to data that
has defined the length and format of data.
○ Semi- Structured data: This data is basically a semi-organised data. It is generally a form
of data that does not conform to the formal structure of data. Log files are examples of
this type of data.
○ Unstructured data: This data basically refers to unorganized data. It generally refers to
data that doesn’t fit neatly into the traditional row and column structure of the relational
database. Texts, pictures, videos etc. are examples of unstructured data which can’t be
stored in the form of rows and columns.
4. Veracity:
● It refers to inconsistencies and uncertainty in data, that is data which is available can sometimes get
messy and quality and accuracy are difficult to control.
● Big Data is also variable because of the multitude of data dimensions resulting from multiple
disparate data types and sources.
● Example: Data in bulk could create confusion whereas less amount of data could convey half or
Incomplete Information.
5. Value:
● After having the 4 V’s into account there comes one more V which stands for Value!. The bulk of Data
having no Value is of no good to the company, unless you turn it into something useful.
● Data in itself is of no use or importance but it needs to be converted into something valuable to
extract Information. Hence, you can state that Value.
8) Explain systematic approach for finding business needs to translate data available into business
value.
Refer Q9.
9) How can you convert a Business Problem into a Data Problem? Elaborate with a suitable example.
● Many times data scientists are presented with very vague problems such as how to reduce customer
churn, how to increase revenue, how to cut cost, how to improve sales, what do users want.
● These problems are very vague, however, it is the job of the data scientist to frame and define it in a way
that can be solved with data science. A data scientist is expected to probe and ask the stakeholders
questions.
● For example, the business wants to reduce churn and increase revenue, you want to ask the stakeholder
questions like - What strategies do you employ to retain customers? What are the initiatives the business
employs to increase revenue? What promotions are given to users? What are the major pain-points that
you experienced that led to a loss of revenue? Which product had the most decline in revenue?
● Try to get a balanced perspective from stakeholders, if some users are not happy with the products,
compare their view with those that are happy with it. It helps identify bias.
● In defining the problem, the problem posed by the stakeholder might not always be the pressing
problem. For example, the stakeholder might want to find out why the users come to the website but do
not purchase anything meanwhile the real problem is, can they improve the recommendations to users
that align with their interest and push them to place an order.
When defining the problem, it is important to think in terms of the decision that needs to be made to solve the
problem such as Which user would churn in the next 70 days? Which user must be given the discounts to stay
back on the app and when to trigger them? To a new user who has just landed on the app, what is the right ad to
show?
● Consider timing. The problem should be framed in a way that would enable the decision to be made with
respect to the time. For example, when should a particular ad be shown to a user for maximum
conversion
● Analyze all data science problem in a way that would lead to quantifiable impact for users such as an
increase in daily active users and quantifiable impact for stakeholders such as an increase in revenue
with lesser cost
● Now you have defined your problem, “which user should be given a discount to prevent them from
churning in the 70 days?”
For example, the goal of “which user should be given a discount to prevent them from churning in the 70 days”
is clear enough from a business perspective, but in terms of running an actual analysis, we need to further break
it down into smaller milestones.
● How do we identify customers that are going to churn in the next 70 days?
● What criteria should be used to determine who should be given a discount?
● What features can be used to differentiate churners from non-churners?
● What is the lifetime value for each customer?
● How do we determine when to trigger them with a discount, what data do we need?
These questions also guide you in thinking of important data points while solving your problem. Thinking in
terms of milestones helps to foresee dependencies.
● After defining your problem and setting your milestones, you want to start building the solution as a data
scientist, you want to build a minimum viable product that allows you to provide value to your
stakeholders in smaller increments.
● For example, if a client wants to build a mansion, inexperienced data scientists will then try to figure out
how to build the mansion they were asked for. Experienced data scientists will try to figure out how to
build a shed, then figure out how to turn the shed to a tent, then to a hut, a bungalow, a storey building
and finally a mansion.
● It is important to consider the following questions when building MVP
○ What is the smallest benefit stakeholders could get from the analysis and still consider it
valuable?
○ When do stakeholders need results? Do they need all the results at once, or do some results
have a more pressing deadline than others?
○ What is the simplest way to meet a benchmark, regardless of whether you consider it the “best”
way?
● The typical journey of a data science product is
○ Descriptive solution — tells you what happened in the past.
○ Diagnostic solution — helps you understand why something happened in the past.
○ Predictive solution — predicts what is most likely to happen in the future.
○ Prescriptive solution — not only identifies what’s likely to happen but also provides insights and
recommends actions you can take to affect those outcomes.
● A data scientist should plan in sprints, think modularly and get regular feedback from the stakeholders.
Having a target metric is important because it tells you and your stakeholders how successful your data science
solution is in solving the business problem.
● Think explicitly about trade-offs. Almost any metric will involve a trade-off. For example, in a classification
problem, “precision” focuses on minimizing false positives, while “recall” focuses on minimizing false
negatives. False positives might be more important to the business than false negatives, or the reverse
could be true.
● Which is more harmful — identifying a loyal customer as likely to churn, or identifying a likely to churn
customer as loyal. The stakeholders want to identify customers that are likely to churn, so identifying
likely to churn customers as loyal would not help the business. Hence we want to reduce false negative,
a high recall model would be more suited
● Find out the business’s “value” units: Find out what unit of value your stakeholders think in, and estimate
the value of your analysis using that unit. For example, stakeholders have said that they want to reduce
churn, but upon further investigation, you might find that what they really want is increased daily active
users which in turn impacts revenue.
● Subset all metrics. An analysis should almost never have only one set of metrics. All metrics used for the
analysis as a whole should be repeated for any relevant subsets: customer age bracket, customer spend,
site visit, etc. An analysis may perform very well on average but abjectly fail for certain subsets
Make it non-technical and explainable, as possible, stakeholders do need to be able to understand whatever
metrics you use.
● You would need to standardize the different columns and get the data to the format you need. There
could be a lot of inconsistencies in the data and cleaning and transforming this data becomes really
crucial. As you go deeper into data wrangling, analysis and aligning the data to the problem, more such
challenges would arise that need to be overcome.
● Data is the key to success or failure for any data science project
● Ensure you check for sufficiency of data to solve the problem. Sometimes you won’t even realize that a
crucial data point is missing until you are in the thick of your analysis.
● Identify all dataset needs ahead of time. Make sure you have all the pieces to the data puzzle available.
For example, you could say: “customer age bracket, site visit, location, customer spend to start with.”
● If data from different datasets don’t have a common key on which to join the information, or you can’t get
access to some datasets even though they exist, or some of the data have so many missing values that
they cannot support your use case, then your analysis will disappoint both you and your stakeholders.
● Focus on data refresh cycles: how old is the data? When does it get updated? How is it updated?
What/who decides when it is updated?
● Know when additional data collection is necessary. Sometimes the only way to complete an analysis is to
collect more data.
● It is always easier to plan for contingencies before you begin your analysis than it is to try to adapt in the
middle of your work as deadlines approach.
● Some of those problems manifest themselves only through careful Exploratory Data Analysis (EDA). It’s
easy to look at a column name and assume the dataset has what you need. Because of that, it’s very
common for data scientists to find out, at least halfway into their analysis, that the data they have isn’t
really the data they need. Hence a thorough EDA is essential before applying the methods. If you are
able to answer most of the questions in the EDA phase and identify the right insights to the stakeholders
that is itself a huge value add.
● Which methods/models are inappropriate for your analysis? Of those methods/models that are
appropriate, what are the costs and benefits of using each one? If you find a number of methods that are
appropriate and have roughly the same costs and benefits, how do you decide how to proceed?
● Keep constraints in mind. If your preferred method requires a GPU but you don’t have easy access to a
GPU, then it shouldn’t be your preferred method, even if you think it is analytically superior to its
alternatives. Similarly, some methods simply do not work well for large numbers of features, or only work
if you know beforehand how many clusters you want. Save time by thinking about the constraints each
method places on your work — because every method carries constraints of some kind.
● Even after you eliminated unsuitable methods and further narrowed down your list to accommodate your
project’s constraints, you will still likely have more than one method that could plausibly work for you.
There is no way to know beforehand which of these methods is better — you will have to try as many of
them as possible, and try each with as many initializing parameters as possible, to know what performs
best.
1. Search Engines
● As we know, when we want to search for something on the internet, we mostly use Search engines like
Google, Yahoo, Safari, Firefox, etc. So Data Science is used to get Searches faster.
● For Example, When we search something suppose “Data Structure and algorithm courses ” then at that
time on the Internet Explorer we get the first link of GeeksforGeeks Courses. This happens because the
GeeksforGeeks website is visited most in order to get information regarding Data Structure courses and
Computer related subjects. So this analysis is done using Data Science, and we get the Topmost visited
Web Links.
2. Transport
● Data Science also entered into the Transport field like Driverless Cars. With the help of Driverless Cars, it
is easy to reduce the number of Accidents.
● For Example, In Driverless Cars the training data is fed into the algorithm and with the help of Data
Science techniques, the Data is analyzed like what is the speed limit in Highway, Busy Streets, Narrow
Roads, etc. And how to handle different situations while driving etc.
3. Finance
● Data Science plays a key role in Financial Industries. Financial Industries always have an issue of fraud
and risk of losses.
● Thus, Financial Industries needs to automate risk of loss analysis in order to carry out strategic decisions
for the company.
● Also, Financial Industries uses Data Science Analytics tools in order to predict the future. It allows the
companies to predict customer lifetime value and their stock market moves.
● For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data Science is used to
examine past behavior with past data and their goal is to examine the future outcome. Data is analyzed
in such a way that it makes it possible to predict future stock prices over a set timetable.
4. E-Commerce
● E-Commerce Websites like Amazon, Flipkart, etc. use data Science to make a better user experience
with personalized recommendations.
● For Example, When we search for something on the E-commerce websites we get suggestions similar to
choices according to our past data and also we get recommendations according to most buy the
product, most rated, most searched, etc. This is all done with the help of Data Science.
5. Health Care
In the Healthcare Industry data science acts as a boon. Data Science is used for:
● Detecting Tumor.
● Drug discoveries.
● Medical Image Analysis.
● Virtual Medical Bots.
● Genetics and Genomics.
● Predictive Modeling for Diagnosis etc.
6. Image Recognition
7. Targeting Recommendation
● With the help of Data Science, the Airline Sector is also growing. With the help of it, it becomes easy to
predict flight delays.
● It also helps to decide whether to directly land into the destination or take a halt in between like a flight
can have a direct route from Delhi to the U.S.A or it can halt in between after that reach at the
destination.
● In most of the games where a user will play with an opponent i.e. a Computer Opponent, data science
concepts are used with machine learning where with the help of past data the Computer will improve its
performance.
● There are many games like Chess, EA Sports, etc. that use Data Science concepts.
● The process of creating medicine is very difficult and time-consuming and has to be done with full
discipline because it is a matter of someone's life.
● Without Data Science, it takes lots of time, resources, and finance or developing new medicine or drugs
but with the help of Data Science, it becomes easy because the prediction of success rate can be easily
determined based on biological data or factors.
● The algorithms based on data science will forecast how this will react to the human body without lab
experiments.
● Various Logistics companies like DHL, FedEx, etc. make use of Data Science.
● Data Science helps these companies to find the best route for the shipment of their products, the best
time suited for delivery, the best mode of transport to reach the destination, etc.
12. Autocomplete
● AutoComplete feature is an important part of Data Science where the user will get the facility to just type
a few letters or words, and he will get the feature of auto-completing the line.
● In Google Mail, when we are writing formal mail to someone, at that time the data science concept of
Autocomplete feature is used where he/she is an efficient choice to auto-complete the whole line.
● Also in Search Engines in social media, in various apps, AutoComplete feature is widely used.
1. Reduces Inefficiencies
● As much as companies value data that helps them understand their customers and internal processes,
they’re also eager to gain an edge over their competitors.
● Data scientists are responsible for understanding and gleaning insights from data about competitors.
● Effective competitor research helps businesses make competitive pricing decisions, reach new markets,
and stay up to date with changes in consumer behavior.
● By ensuring a ready stream of actionable insights about customer psychology, behavior, and satisfaction,
data science enables businesses to consistently reshape their products and services to fit with a shifting
marketplace.
● Data about customers is available from a variety of sources, and mining information from third-party
platforms, like social media, search engines, and purchased datasets, presents a unique challenge.
● One of the big problems faced by businesses when searching for new employees is the disconnect
between prospects that look good on paper and perform well in practice.
● Data science seeks to bridge this gap by using evidence to improve hiring practices.
● By combining and analyzing a variety of data-points about candidates, it’s possible to move towards an
ideal ‘company-employee fit’.
12) What is the need of estimation and validation for added value due to data science?
Cross Validation:
● Cross-Validation is an essential tool in the Data Scientist toolbox.
● It divides the dataset into two parts (train and test). On one part i.e on the train part, it will try to train the
model, and on the second part i.e on the test part, it will make the prediction which is unseen data for our
model.
● After that, we will check our model to see how well it works. If the model gives us good accuracy on test
data, it means that our model is good and we can trust it.
Types of Cross-Validation:
3. K-Fold Cross-Validation:
● We split our data into K parts, let’s use K=3 for a toy example.
● If we have 3000 instances in our dataset, we split it into three parts, part 1, part 2 and part 3.
● We then build three different models, each model is trained on two parts and tested on the third.
● Our first model is trained on part 1 and 2 and tested on part 3.
● Our second model is trained on part 1 and part 3 and tested on part 2 and so on.
4. Stratified Cross-Validation:
● When we split our data into folds, we want to make sure that each fold is a good representative
of the whole data.
● The most basic example is that we want the same proportion of different classes in each fold.
Most of the time it happens by just doing it randomly, but sometimes, in complex datasets, we
have to enforce a correct distribution for each fold.
● By using cross-validation, we can make predictions on our dataset in the same way as described
before and so our second model's input will be real predictions on data that our first model has
never seen before.
5. Parameters Fine-Tuning
● Most learning algorithms require some parameters tuning. We want to find the best parameters
for our problem.
● We do it by trying different values and choosing the best ones.
● There are many methods to do this. It could be a manual search, a grid search or optimization.
● However, in all those cases we can’t do it on our training test and not on our test set of course.
We have to use a third set, a validation set.
● By splitting our data into three sets instead of two, we’ll tackle all the same issues we talked
about before, especially if we don’t have a lot of data.
● By doing cross-validation, we’re able to do all those steps using a single set.
13) Who is a Data Scientist? What are his responsibilities and characteristics?
Data Scientist:
● A data scientist is an analytics professional who is responsible for collecting, analyzing and interpreting
data to help drive decision-making in an organization.
● The data scientist role combines elements of several traditional and technical jobs, including
mathematician, scientist, statistician and computer programmer.
● It involves the use of advanced analytics techniques, such as machine learning and predictive modeling,
along with the application of scientific principles.
● As part of data science initiatives, data scientists often must work with large amounts of data to develop
and test hypotheses, make inferences and analyze things such as customer and market trends, financial
risks, cybersecurity threats, stock trades, equipment maintenance needs and medical conditions.
● In businesses, data scientists typically mine data for information that can be used to predict customer
behavior, identify new revenue opportunities, detect fraudulent transactions and meet other business
needs.
● They also do valuable analytics work for healthcare providers, academic institutions, government
agencies, sports teams and other types of organizations.
In many organizations, data scientists are also responsible for helping to define and promote best practices for
data collection, preparation and analysis. In addition, some data scientists develop AI technologies for use
internally or by customers -- for example, conversational AI systems, AI-driven robots and other autonomous
machines, including key components in self-driving cars.
Statistic Parameter
Example:
● A researcher wants to know the average weight of females aged 22 years or older in India. The
researcher obtains the average weight of 54 kg, from a random sample of 40 females.
● In the given situation, the statistics are the average weight of 54 kg, calculated from a simple random
sample of 40 females, in India while the parameter is the mean weight of all females aged 22 years or
older.
● Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric
about the mean, showing that data near the mean are more frequent in occurrence than data far from
the mean.
● In graphical form, the normal distribution appears as a "bell curve".
● Its mean (average), median (midpoint), and mode (most frequent observation) are all equal to one
another. Moreover, these values all represent the peak, or highest point, of the distribution.
● The distribution then falls symmetrically around the mean, the width of which is defined by the standard
deviation.
where:
x = value of the variable or data being examined
f(x) = the probability function
μ = the mean
σ = the standard deviation
3) Explain Normal (Gaussian) distribution with an example. State and explain one application where this
distribution is suitable for model fitting.
Refer Q3.
4) Probability
5) Statistics
6) Distribution
1. Uniform Distribution
● Rolling a fair die has 6 discrete, equally probable outcomes
● You can roll a 1 or a 2, but not a 1.5
● The probabilities of each outcome are evenly distributed across the sample space.
2. Binomial Distribution
● “Binomial” means there are two discrete, mutually exclusive outcomes of a trial.
● heads or tails
● on or off
● sick or healthy
● success or failure
3. Poisson Distribution
● A binomial distribution considers the number of successes out of n trials
● A Poisson Distribution considers the number of successes per unit of time or any other
continuous unit, e.g. distance over the course of many units
1. Normal Distribution
● Many real life data points follow a normal distribution: People's Heights and Weights, Population
Blood Pressure, Test Scores, Measurement Errors.
● These data sources tend to be around a central value with no bias left or right, and it gets close
to a "Normal Distribution" like this:
● Unlike discrete distributions, where the sum of all the bars equals one, in a normal distribution the
area under the curve equals one.
2. Log-normal Distribution
● This distribution is used to plot the random variables whose logarithm values follow a normal
distribution.
● Consider the random variables X and Y. Y = ln(X) is the variable that is represented in this
distribution, where ln denotes the natural logarithm of values of X.
3. Student’s T Distribution
● The student’s t distribution is similar to the normal distribution.
● The difference is that the tails of the distribution are thicker.
● This is used when the sample size is small and the population variance is not known.
● This distribution is defined by the degrees of freedom(p) which is calculated as the sample size
minus 1(n – 1).
4. Chi-square Distribution
● This distribution is equal to the sum of squares of p normal random variables. p is the number of
degrees of freedom.
● Like the t-distribution, as the degrees of freedom increase, the distribution gradually approaches
the normal distribution.
● Below is a chi-square distribution with three degrees of freedom.
5. Exponential Distribution
● The exponential distribution can be seen as an inverse of the Poisson distribution.
● The events in consideration are independent of each other.
7) Data
8) Population vs Sample
Population Sample
Meaning Population refers to the collection of all Sample means a subgroup of the
elements possessing common members of population chosen for
characteristics that comprises the participation in the study.
universe.
Includes Each and every unit of the group. Only a handful of units of population.
9) Types of sampling
1. Random Sampling:
● As its name suggests, random sampling means every member of a population has an equal
chance of being selected.
● However, since samples are usually much smaller than populations, there’s a chance that entire
demographics might be missed.
2. Stratified Random Sampling:
● Stratified random sampling ensures that groups within a population are adequately represented.
● First, divide the population into segments based on some characteristics.
● Members cannot belong to two groups at once.
● Next, take random samples from each group
● The size of each sample is based on the size of the group relative to the population.
3. Clustering:
● A third – and often less precise – method of sampling is clustering
● The idea is to break the population down into groups and sample a random selection of groups,
or clusters.
● Usually this is done to reduce costs.
11) What is meant by the Row Reduced Echelon Form (RREF) of an augmented matrix.
1. Mean - mean(x)
2. Median - median(x)
3. Percentage quantile - quantile(x)
4. Variance - var(x)
5. Standard deviation - sd(x)
6. Minimum - min(x)
7. Maximum - max(x)
8. Correlation - cor(x,y)
R Programming language provides a wide range of functionalities for statistical procedures. Some of the major
functionalities of R for statistical analysis are:
Data Manipulation: R provides powerful data manipulation tools to clean and transform data before analysis. R
provides functions to handle missing values, outliers, and duplicate data.
Descriptive Statistics: R provides functions to compute basic summary statistics such as mean, median, standard
deviation, and quantiles. R also provides functions to compute more advanced descriptive statistics such as
frequency distributions, contingency tables, and correlation matrices.
Inferential Statistics: R provides a large number of functions for hypothesis testing and estimation of population
parameters. R provides functions for t-tests, ANOVA, chi-square tests, and non-parametric tests. R also provides
functions for regression analysis, including linear, logistic, and nonlinear regression.
Model Selection: R provides functions for model selection and validation. R provides functions to evaluate model
fit, determine the optimal number of variables, and perform cross-validation.
Data Visualization: R provides a wide range of graphical capabilities to visualize data and results. R provides
functions to create histograms, scatter plots, box plots, and more advanced visualizations such as scatter plot
matrices, heatmaps, and 3D plots.
Package System: R provides a large number of packages that extend its functionalities. The packages provide
additional functions and capabilities for statistical procedures, machine learning, and data visualization.
Report Generation: R provides functionalities for creating reports and presentations. R provides packages for
generating reports in HTML, PDF, and Word formats, as well as packages for creating interactive dashboards
and presentations.
In summary, R is a comprehensive and powerful language for statistical procedures, providing a wide range of
functionalities for data manipulation, descriptive statistics, inferential statistics, model selection, data
visualization, and report generation.
13) What are coefficient of determination that can be used for both linear and nonlinear fitting?
Coefficient of determination is a statistical measure used to evaluate the goodness of fit of a regression model. It
provides information on how well the model fits the data, and it can be used for both linear and nonlinear fitting.
The two commonly used coefficients of determination for both linear and nonlinear fitting are:
R-squared (R²): R-squared measures the proportion of variance in the dependent variable that is explained by
the independent variables in the model. For linear regression models, R-squared is a measure of how well the
linear regression line fits the data. For nonlinear regression models, R-squared is a measure of how well the
nonlinear regression curve fits the data.
Adjusted R-squared (Adjusted R²): Adjusted R-squared takes into account the number of independent variables
in the model and provides a more accurate estimate of the goodness of fit compared to R-squared. The adjusted
R-squared value increases as the number of independent variables in the model increases, even if the
contribution of these variables to the model is not significant.
Both R-squared and adjusted R-squared are expressed as values between 0 and 1, with 1 indicating a perfect fit
and values close to 0 indicating a poor fit. When selecting a model, it is recommended to choose the model with
the highest adjusted R-squared value, as it provides a better balance between model fit and model complexity.
14) What is the best approach for detection of outliers using R programming for real time data? Explain
with appropriate example.
● Outliers are data points that don’t fit the pattern of the rest of the data set.
● The best way to detect the outliers in the given data set is to plot the boxplot of the data set and the
point located outside the box in the boxplot are all the outliers in the data set.
● In this approach to remove the outliers from the given data set, the user needs to just plot the boxplot of
the given data set using the simple boxplot() function, and if found the presence of the outliers in the
given data the user needs to call the boxplot.stats() function which is a base function of the R language,
and pass the required parameters into this function, which will further lead to the removal of the outliers
present in the given data sets.
Example:
gfg<-rnorm(500)
gfg[1:10]<-c(-4,2,5,6,4,1,-5,8,9,-6)
boxplot(gfg)
Now let us again visualize the above plot but this time without outliers by applying the given approach.
15) Write a function in R language to replace the missing value in a vector with the mean of that vector.
This function takes as input a vector x, and calculates the mean of the non-missing values in the vector
using mean(x, na.rm = TRUE). The argument na.rm = TRUE specifies that the mean should be calculated
without considering the missing values.
Next, the function replaces the missing values in the vector with the mean value using the line x[is.na(x)]
<- mean_value. The function is.na(x) returns a logical vector indicating which elements of x are missing,
and the assignment x[is.na(x)] <- mean_value replaces the missing values with the mean value.
16) What are the different data objects in R? How do you split a continuous variable into different
groups/ranks in R?
1. Vectors
Atomic vectors are one of the basic types of objects in R programming. Atomic vectors can store
homogeneous data types such as character, doubles, integers, raw, logical, and complex. A single
element variable is also said to be a vector.
Example:
x <- c(1, 2, 3, 4)
y <- c("a", "b", "c", "d")
z <- 5
2. Lists
List is another type of object in R programming. List can contain heterogeneous data types such as
vectors or another lists.
Example:
ls <- list(c(1, 2, 3, 4), list("a", "b", "c"))
3. Matrices
To store values as 2-Dimensional array, matrices are used in R. Data, number of rows and columns are
defined in the matrix() function.
Example:
x <- c(1, 2, 3, 4, 5, 6)
mat <- matrix(x, nrow = 2)
4. Factors
Factor object encodes a vector of unique elements (levels) from the given data vector.
Example:
s <- c("spring", "autumn", "winter", "summer",
"spring", "autumn")
print(factor(s))
Output:
autumn spring summer winter
5. Arrays
array() function is used to create n-dimensional array. This function takes dim attribute as an argument
and creates required length of each dimension as specified in the attribute.
Example:
arr <- array(c(1, 2, 3), dim = c(3, 3, 3))
6. Data Frames
Data frames are 2-dimensional tabular data object in R programming. Data frames consists of multiple
columns and each column represents a vector. Columns in data frame can have different modes of data
unlike matrices.
Example:
x <- 1:5
y <- LETTERS[1:5]
z <- c("Albert", "Bob", "Charlie", "Denver", "Elie")
df <- data.frame(x, y, z)
set.seed(1)
ages <- floor(runif(20, min = 20, max = 50))
ages
# [1] 27 31 37 47 26 46 48 39 38 21 26 25 40 31 43 34 41 49 31 43
labels=c("low","middle","high"))
17) Explain the augmented matrix notation in linear system of equations with an example.
1) Explain how a large number of raw data sources and exploratory data analysis are required to
produce a single valuable application of the given data.
To produce a valuable application from a large number of raw data sources, a thorough exploratory data analysis
(EDA) is necessary. EDA is the process of examining, cleaning, transforming, and modeling data to gain insight
into its structure, patterns, and relationships. The following are the steps involved in this process:
1. Data collection: Collect the raw data from multiple sources and store it in a centralized repository.
2. Data cleaning: Clean the data by removing any missing or inconsistent data and transforming it into a
format that can be easily analyzed.
3. Data transformation: Transform the data into a format that is suitable for analysis, such as aggregating the
data or calculating derived variables.
4. Data exploration: Explore the data by generating descriptive statistics and visualizations to gain an
understanding of the patterns and relationships in the data.
5. Data modeling: Build models that can be used to make predictions or gain insights into the relationships
between the variables.
6. Validation: Validate the models by testing them on independent data sets to ensure that they generalize
well to new data.
By following these steps, a single valuable application can be produced from the raw data sources. The
application may be a predictive model, a dashboard that provides insights into the data, or a recommendation
system, for example. The goal of the EDA process is to create a high-quality, well-understood data set that can
be used to support decision-making and drive business value.
1. Data Collection
Nowadays, data is generated in huge volumes and various forms belonging to every sector of human life, like
healthcare, sports, manufacturing, tourism, and so on. Every business knows the importance of using data
beneficially by properly analyzing it. However, this depends on collecting the required data from various sources
through surveys, social media, and customer reviews, to name a few. Without collecting sufficient and relevant
data, further activities cannot begin.
Once the analysis is over, the findings are to be observed cautiously and carefully so that proper interpretation
can be made. The trends in the spread of data and correlation between variables give good insights for making
suitable changes in the data parameters. The data analyst should have the requisite capability to analyze and be
well-versed in all analysis techniques. The results obtained will be appropriate to data of that particular domain
and are suitable for use in retail, healthcare, and agriculture.
Collecting Data
The next step is to collect the right set of data. High-quality, targeted data—and the mechanisms to collect
them—are crucial to obtaining meaningful results. Since much of the roughly 2.5 quintillion bytes of data created
every day come in unstructured formats, you’ll likely need to extract the data and export it into a usable format,
such as a CSV or JSON file.
Cleaning Data
Most of the data you collect during the collection phase will be unstructured, irrelevant, and unfiltered. Bad data
produces bad results, so the accuracy and efficacy of your analysis will depend heavily on the quality of your
data.
Cleaning data eliminates duplicate and null values, corrupt data, inconsistent data types, invalid entries, missing
data, and improper formatting.
This step is the most time-intensive process, but finding and resolving flaws in your data is essential to building
effective models.
● Exploratory Data Analysis is a data analytics process to understand the data in depth and learn the
different data characteristics, often with visual means. This allows you to get a better feel of your data
and find useful patterns in it.
● Exploratory Data Analysis helps you gather insights and make better sense of the data, and removes
irregularities and unnecessary values from data.
○ Helps you prepare your dataset for analysis.
○ Allows a machine learning model to predict our dataset better.
○ Gives you more accurate results.
○ It also helps us to choose a better machine learning model.
1. Univariate
2. Bivariate
3. Multivariate
● In univariate analysis, the output is a single variable and all data collected is for it. There is no
cause-and-effect relationship at all. For example, data shows products produced each month for twelve
months.
● In bivariate analysis, the outcome is dependent on two variables, e.g., the age of an employee, while the
relation with it is compared with two variables, i.e., his salary earned and expenses per month.
● In multivariate analysis, the outcome is more than two, e.g., type of product and quantity sold against the
product price, advertising expenses, and discounts offered.
● The analysis of data is done on variables that can be numerical or categorical. The result of the analysis
can be represented in numerical values, visualization, or graphical form. Accordingly, they could be
further classified as non-graphical or graphical.
1. Univariate Non-graphical
2. Multivariate Non-graphical
3. Univariate graphical
4. Multivariate graphical
1. Univariate Non-graphical: This is the simplest form of data analysis as during this we use just one variable to
research the info. The standard goal of univariate non-graphical EDA is to know the underlying sample
distribution/ data and make observations about the population. Outlier detection is additionally part of the
analysis. The characteristics of population distribution include:
● Central tendency: The central tendency or location of distribution has got to do with typical or middle
values. The commonly useful measures of central tendency are statistics called mean, median, and
sometimes mode during which the foremost common is mean. For skewed distribution or when there’s
concern about outliers, the median may be preferred.
● Spread: Spread is an indicator of what proportion distant from the middle we are to seek out to find the
info values. The quality deviation and variance are two useful measures of spread. The variance is that
the mean of the square of the individual deviations and therefore the variance is the root of the variance
● Skewness and kurtosis: Two more useful univariate descriptors are the skewness and kurtosis of the
distribution. Skewness is that the measure of asymmetry and kurtosis may be a more subtle measure of
peakedness compared to a normal distribution
2. Multivariate Non-graphical: Multivariate non-graphical EDA technique is usually used to show the connection
between two or more variables within the sort of either cross-tabulation or statistics.
● For categorical data, an extension of tabulation called cross-tabulation is extremely useful. For 2
variables, cross-tabulation is preferred by making a two-way table with column headings that match the
amount of one-variable and row headings that match the amount of the opposite two variables, then
filling the counts with all subjects that share an equivalent pair of levels.
● For each categorical variable and one quantitative variable, we create statistics for quantitative variables
separately for every level of the specific variable then compare the statistics across the amount of
categorical variables.
● Comparing the means is an off-the-cuff version of ANOVA and comparing medians may be a robust
version of one-way ANOVA.
3. Univariate graphical: Non-graphical methods are quantitative and objective, they are not able to give the
complete picture of the data; therefore, graphical methods are used more as they involve a degree of subjective
analysis, also are required. Common sorts of univariate graphics are:
Histogram: The foremost basic graph is a histogram, which may be a barplot during which each bar represents
the frequency (count) or proportion (count/total count) of cases for a variety of values. Histograms are one of the
simplest ways to quickly learn a lot about your data, including central tendency, spread, modality, shape and
outliers.
Stem-and-leaf plots: This is a very simple but powerful EDA method used to display quantitative data but in a
shortened format. It displays the values in the data set, keeping each observation intact but separating them as
stem (the leading digits) and remaining or trailing digits as leaves.
Box Plots: These are used to display the distribution of quantitative value in the data. If the data set consists of
categorical variables, the plots can show the comparison between them. Further, if outliers are present in the
data, they can be easily identified. These graphs are very useful when comparisons are to be shown in
percentages, like values in the 25 %, 50 %, and 75% range (quartiles).
Quantile-normal plots: It’s used to see how well a specific sample follows a specific theoretical distribution. It
allows detection of non-normality and diagnosis of skewness and kurtosis.
4. Multivariate graphical: Multivariate graphical data uses graphics to display relationships between two or
more sets of knowledge. The sole one used commonly may be a grouped barplot with each group representing
one level of 1 of the variables and every bar within a gaggle representing the amount of the opposite variable.
Scatterplot: For 2 quantitative variables, the essential graphical EDA technique is that the scatter plot has one
variable on the x-axis and one on the y-axis and therefore the point for every case in your dataset.
Heat map: It’s a graphical representation of data where values are depicted by color.
Multivariate chart: It’s a graphical representation of the relationships between factors and response.
Bubble chart: It’s a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
Tools:
1. Python
Python is used for different tasks in EDA, such as finding missing values in data collection, data description,
handling outliers, obtaining insights through charts, etc. The syntax for EDA libraries like Matplotlib, Pandas,
Seaborn, NumPy, Altair, and more in Python is fairly simple and easy to use for beginners. You can find many
open-source packages in Python, such as D-Tale, AutoViz, PandasProfiling, etc., that can automate the entire
exploratory data analysis process and save time.
2. R
R programming language is a regularly used option to make statistical observations and analyze data, i.e.,
perform detailed EDA by data scientists and statisticians. Like Python, R is also an open-source programming
language suitable for statistical computing and graphics. Apart from the commonly used libraries like ggplot,
Leaflet, and Lattice, there are several powerful R libraries for automated EDA, such as Data Explorer, SmartEDA,
GGally, etc.
3. MATLAB
MATLAB is a well-known commercial tool among engineers since it has a very strong mathematical calculation
ability. Due to this, it is possible to use MATLAB for EDA but it requires some basic knowledge of the MATLAB
programming language.
1) Is it true that predictive modeling goes beyond insight (knowing why things happen) to foresight
(knowing what is likely to happen in future)? How do you explain predictive modeling?
Yes, predictive modeling does go beyond providing insight into the underlying relationships in data to providing
predictions about future outcomes. It uses statistical algorithms and machine learning techniques to analyze
existing data and make predictions about future events. Predictive modeling can help businesses make
informed decisions and allocate resources effectively.
Predictive modeling is a statistical process for analyzing data, learning from that data, and making a prediction
about future events. It uses algorithms and machine learning techniques to identify patterns in data, and make a
prediction about future outcomes based on that information. Predictive modeling is used in a variety of
applications such as marketing, financial forecasting, and risk management.
2) Your linear regression doesn’t run and communicates that there is an infinite number of best
estimates for the regression coefficients. What could be wrong? How do you know that linear
regression is suitable for any given data?
A linear regression model may produce an "infinite number of best estimates for the regression coefficients" if
the model is over-determined, meaning that there are more independent variables than observations, or if there
is multicollinearity, meaning that the independent variables are highly correlated with each other.
To determine if linear regression is suitable for a given data set, it's important to check for several assumptions:
Linearity: The relationship between the independent and dependent variables should be linear.
Homoscedasticity: The variance of the errors should be constant for all values of the independent variables.
No multicollinearity: The independent variables should not be highly correlated with each other.
If these assumptions are not met, alternative regression techniques or transformations of the data may need to
be applied.
3) Given a decision tree, you have the option (a) converting the decision tree to rules and then pruning
the resulting rules, or (b) pruning the decision tree and then converting the pruned tree to rules. What
advantages does (a) have over (b)?
Converting a decision tree to rules and then pruning the resulting rules (Option A) has the following advantages
over pruning the decision tree and then converting the pruned tree to rules (Option B):
Better interpretability: Rules are more easily interpreted and understood by humans compared to decision trees.
Pruning rules after they have been extracted from the tree provides more control over the interpretability of the
model.
Better accuracy: The pruning process can be more effective when performed on rules rather than on the tree
structure. Pruning rules can reduce the number of irrelevant or redundant rules, improving the accuracy of the
model.
Better performance: Pruning rules may result in a smaller number of rules compared to pruning a decision tree,
which can lead to faster prediction times. This is because rules can be processed in parallel and do not require a
traversal of the tree structure.
In conclusion, Option A provides better interpretability, accuracy, and performance compared to Option B.
However, the choice between these two options may depend on the specific requirements of the problem and
the goals of the modeling process.
4) What is data wrangling and why is it important? Explain steps in data wrangling.
Data wrangling is the transformation of raw data into a format that is easier to use. Data wrangling is a term often
used to describe the early stages of the data analytics process. It involves transforming and mapping data from
one format into another. The aim is to make data more accessible for things like business analytics or machine
learning. The data wrangling process can involve a variety of tasks. These include things like data collection,
exploratory analysis, data cleansing, creating data structures, and storage.
Data wrangling is time-consuming. In fact, it can take up to about 80% of a data analyst’s time. This is partly
because the process is fluid, i.e. there aren’t always clear steps to follow from start to finish. However, it’s also
because the process is iterative and the activities involved are labor-intensive.
Insights gained during the data wrangling process can be invaluable. They will likely affect the future course of a
project. Skipping or rushing this step will result in poor data models that impact an organization’s
decision-making and reputation. So, if you ever hear someone suggesting that data wrangling isn’t that
important, you have our express permission to tell them otherwise!
Unfortunately, because data wrangling is sometimes poorly understood, its significance can be overlooked.
High-level decision-makers who prefer quick results may be surprised by how long it takes to get data into a
usable format. Unlike the results of data analysis (which often provide flashy and exciting insights), there’s little to
show for your efforts during the data wrangling phase. And as businesses face budget and time pressures, this
makes a data wrangler’s job all the more difficult. The job involves careful management of expectations, as well
as technical know-how.
Validating your data means checking it for consistency, quality, and accuracy. We can do this using
pre-programmed scripts that check the data’s attributes against defined rules. This is also a good example of an
overlap between data wrangling and data cleaning—validation is key to both. Because you’ll likely find errors,
you may need to repeat this step several times.
5) Linear Regression
Linear regression is used for finding linear relationship between target and one or more predictors. There are
two types of linear regression- Simple and Multiple.
Example:
We have a dataset which contains information about the relationship between ‘number of hours studied’ and
‘marks obtained’. Many students have been observed and their hours of study and grade are recorded. This will
be our training data. The goal is to design a model that can predict marks if given the number of hours studied.
Using the training data, a regression line is obtained which will give minimum error. This linear equation is then
used for any new data. That is, if we give the number of hours studied by a student as an input, our model
should predict their mark with minimum error.
Multiple linear regression is used to estimate the relationship between two or more independent variables and
one dependent variable. You can use multiple linear regression when you want to know:
● How strong the relationship is between two or more independent variables and one dependent variable
(e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
● The value of the dependent variable at a certain value of the independent variables (e.g. the expected
yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).
K-nearest neighbors (KNN) is a type of supervised learning algorithm used for both regression and classification.
KNN tries to predict the correct class for the test data by calculating the distance between the test data and all
the training points. Then select the K number of points which is closet to the test data. The KNN algorithm
calculates the probability of the test data belonging to the classes of ‘K’ training data and class holds the highest
probability will be selected. In the case of regression, the value is the mean of the ‘K’ selected training points.
Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a cat
or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN
model will find the similar features of the new data set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category.
The K-NN working can be explained on the basis of the below algorithm:
Example
7) k-means
K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in
supervised learning. K-Means performs the division of objects into clusters that share similarities and are
dissimilar to the objects belonging to another cluster.
The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2
refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.
For a better understanding of k-means, let's take an example from cricket. Imagine you received data on a lot of
cricket players from all over the world, which gives information on the runs scored by the player and the wickets
taken by them in the last ten matches. Based on this information, we need to group the data into two clusters,
namely batsman and bowlers.
Example
8) Naive Bayes
A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of
the classifier is based on the Bayes theorem.
Bayes Theorem:
Using Bayes theorem, we can find the probability of A happening, given that B has occurred. Here, B is the
evidence and A is the hypothesis. The assumption made here is that the predictors/features are independent.
That is, the presence of one particular feature does not affect the other. Hence it is called naive.
9) Why Linear Regression and k-NN are poor choices for filtering spam
● Naive Bayes work on dependent events and the probability of an event occurring in the future that can
be detected from the previous occurrence of the same event . This technique can be used to classify
spam emails, word probabilities play the main rule here.
● If some words occur often in spam but not in ham, then this incoming e-mail is probably spam.
● Naive Bayes classifier technique has become a very popular method in mail filtering Email. Every word
has a certain probability of occurring in spam or ham email in its database. If the total of word
probabilities exceeds a certain limit, the filter will mark the email to either category.
● Here, only two categories are necessary: spam or ham.
Web scraping is simply automating the collection of structured data sets from the internet. Web scraping may
also be known as web data extraction or data extraction. Companies utilize web scraping techniques as a way to
keep an eye on the competition
Tools:
1. ParseHub is an incredibly powerful and elegant tool that allows you to build web scrapers without having
to write a single line of code. It is therefore as simple as simply selecting the data you need.
2. Scrapy is a Web Scraping library used by python developers to build scalable web crawlers.
3. OctoParse has a target audience similar to ParseHub, catering to people who want to scrape data
without having to write a single line of code, while still having control over the full process with their
highly intuitive user interface.
4. Scraper API is designed for designers building web scrapers. It handles browsers, proxies, and
CAPTCHAs which means that raw HTML from any website can be obtained through a simple API call.
5. Mozenda caters to enterprises looking for a cloud-based self serve Web Scraping platform.
6. Content Grabber is a cloud-based Web Scraping Tool that helps businesses of all sizes with data
extraction.
Data ingestion is the process of obtaining and importing data for immediate use or storage in a database. To
ingest something is to take something in or absorb something.
Batch processing. In batch processing, the ingestion layer collects data from sources incrementally and sends
batches to the application or system where the data is to be used or stored. Data can be grouped based on a
schedule or criteria, such as if certain conditions are triggered. This approach is good for applications that don't
require real-time data. It is typically less expensive.
Real-time processing. This type of data ingestion is also referred to as stream processing. Data is not grouped
in any way in real-time processing. Instead, each piece of data is loaded as soon as it is recognized by the
ingestion layer and is processed as an individual object. Applications that require real-time data should use this
approach.
Micro batching. This is a type of batch processing that streaming systems like Apache Spark Streaming use. It
divides data into groups, but ingests them in smaller increments that make it more suitable for applications that
require streaming data.
1) Write a note on any open-source data visualization tool such as Seaborn or PyTorch.
Seaborn:
Seaborn is an open-source Python library built on top of matplotlib. It is used for data visualization and
exploratory data analysis. Seaborn works easily with dataframes and the Pandas library. The graphs created can
also be customized easily. Below are a few benefits of Data Visualization.
● Graphs can help us find data trends that are useful in any machine learning or forecasting project.
● Graphs make it easier to explain your data to non-technical people.
● Visually attractive graphs can make presentations and reports much more appealing to the reader.
KDE Plot
A Kernel Density Estimate (KDE) Plot is used to plot the distribution of continuous data.
sns.kdeplot(x = 'age' , data = df , color = 'black')
Distribution plot
A Distribution plot is similar to a KDE plot. It is used to plot the distribution of continuous data.
sns.displot(x = 'age',kde=True,bins = 5 , data =df)
Scatter plot
sns.scatterplot(x='sepal_length', y ='petal_length' , data = df , hue = 'species')
Pair plots
Seaborn lets us plot multiple scatter plots. It’s a good option when you want to get a quick overview of your
data.
sns.pairplot(df)
Heatmaps
A heat map can be used to visualize confusion, matrices, and correlation.
corr = df.corr()
sns.heatmap(corr)
Pytorch:
PyTorch is an open source machine learning library used for developing and training neural network based deep
learning models. It is primarily developed by Facebook’s AI research group. PyTorch can be used with Python as
well as a C++. Naturally, the Python interface is more polished. Pytorch (backed by biggies like Facebook,
Microsoft, SalesForce, Uber) is immensely popular in research labs. Not yet on many production servers — that
are ruled by fromeworks like TensorFlow (Backed by Google) — Pytorch is picking up fast.
Unlike most other popular deep learning frameworks like TensorFlow, which use static computation graphs,
PyTorch uses dynamic computation, which allows greater flexibility in building complex architectures. Pytorch
uses core Python concepts like classes, structures and conditional loops — that are a lot familiar to our eyes,
hence a lot more intuitive to understand. This makes it a lot simpler than other frameworks like TensorFlow that
bring in their own programming style.
Implementation steps:
1. Install PyTorch
2. Import the Modules - The first step is of course to import the relevant libraries.
3. Gather the Data - Gather the data required to train the model.
4. Build the Network - Having done this, we start off with the real code. As mentioned before, PyTorch uses
the basic, familiar programming paradigms rather than inventing its own. A neural network in PyTorch is
an object. It is an instance of a class that defines this network — and inherits from the torch.nn.Module
5. Train the Network - Now that the model is ready, we have to work on training the model with the data
available to us. This is done by a method train(),
6. Test the Network - Similarly, we have a test method that verifies the performance of the network based
on the given test data set.
7. Put it Together - With the skeleton in place, we have to start with stitching these pieces into an
application that can build, train and validate the neural network model.
2) Explain in short, the role played by domain experts in information collection for a data science
application.
You may have studied data science and machine learning and used some machine learning algorithms like
regression, classification to predict on some test data. But the true power of an algorithm and data can be
harnessed only when we have some form of domain knowledge. Needless to say, the accuracy of the model
also increases with the use of such knowledge of data.
For example, the knowledge of the automobile industry when working with the relevant data can be used like —
Let’s say we have two features Horsepower and RPM from which we can create an additional feature like Torque
from the formula
TORQUE = HP x 5252 ÷ RPM
This could potentially influence the output when we train a machine learning model and result in higher
accuracy.
A domain expert has usually become an expert both by education and experience in that domain. Both imply a
significant amount of time spent in the domain. As most domains in the commercial world are not freely
accessible to the public, this usually entails a professional career in the domain. This is a person who could
define the framework for a data science project as they would know what the current challenges are and how
they must be answered to be practically useful given the state of the domain as it is today. The expert can judge
what data is available and how good it is. The expert can use and apply the deliverables of a data science
project in the real world. Most importantly, this person can communicate with the intended users of the project’s
outcome. This is crucial as many projects end up being shelved because the conclusions are either not
actionable or not acted upon.
If there are two individuals, they can get excellent results quickly by good communication. While the domain
expert (DE) defines the task, the data scientist (DS) chooses and configures the right toolset to solve it.
Adopting AI-based tools that help data scientists maintain their edge and increase their efficacy is the best
method to deal with this issue. Another flexible workplace AI technology that aids in data preparation and sheds
light on the topic at hand is augmented learning.
in unnecessary repeats or erroneous choices. The data may be most valuable when exploited effectively for
maximum usefulness in company artificial intelligence.
Companies now can build up sophisticated virtual data warehouses that are equipped with a centralized
platform to combine all of their data sources in a single location. It is possible to modify or manipulate the data
that is stored in the central repository to satisfy the needs of a company and increase its efficiency. This
easy-to-implement modification has the potential to significantly reduce the amount of time and labor required
by data scientists.
Before commencing analytical operations, data scientists may have a structured workflow in place. The process
must consider all company stakeholders and important parties. Using specialized dashboard software that
provides an assortment of visualization widgets, the enterprise's data may be rendered more understandable.
In order to provide an effective narrative for their analysis and visualizations of the notion, data scientists need to
incorporate concepts such as "data storytelling."
5. Data Security
Due to the need to scale quickly, businesses have turned to cloud management for the safekeeping of their
sensitive information. Cyberattacks and online spoofing have made sensitive data stored in the cloud exposed
to the outside world. Strict measures have been enacted to protect data in the central repository against
hackers. Data scientists now face additional challenges as they attempt to work around the new restrictions
brought forth by the new rules.
Organizations must use cutting-edge encryption methods and machine learning security solutions to counteract
the security threat. In order to maximize productivity, it is essential that the systems be compliant with all
applicable safety regulations and designed to deter lengthy audits.
6. Efficient Collaboration
It is common practice for data scientists and data engineers to collaborate on the same projects for a company.
Maintaining strong lines of communication is very necessary to avoid any potential conflicts. To guarantee that
the workflows of both teams are comparable, the institution hosting the event should make the necessary efforts
to establish clear communication channels. The organization may also choose to establish a Chief Officer
position to monitor whether or not both departments are functioning along the same lines.
It is vital for any company to have a certain set of metrics to measure the analyses that a data scientist presents.
In addition, they have the responsibility of analyzing the effects that these indicators have on the operation of
the company.
The many responsibilities and duties of a data scientist make for a demanding work environment. Nevertheless,
it is one of the occupations that are in most demand in the market today. The challenges that are experienced
by data scientists are simply solvable difficulties that may be used to increase the functionality and efficiency of
workplace AI in high-pressure work situations.
4) Is it necessary to use feature extraction for classification? Which one do you prefer between filter
approach and wrapper approach when doing feature selection? Justify your answer.
It is not necessary to use feature extraction for classification, but it can be useful in many cases. Feature
extraction can improve the performance of a classifier by reducing the dimensionality of the data, removing
noisy or irrelevant features, and enhancing the separability between the classes.
When doing feature selection, there are two main approaches: filter approach and wrapper approach.
Filter approach: In this approach, features are ranked based on their statistical properties or mutual information
with the target variable, and the top-k features are selected for classification. The filter approach is
computationally efficient, easy to implement, and can be used as a pre-processing step for any classifier.
Wrapper approach: In this approach, features are selected based on their performance in improving the
accuracy of a specific classifier. A search algorithm is used to explore the space of all possible feature subsets,
and the best subset of features is selected based on cross-validation performance. The wrapper approach is
more computationally expensive but provides a more accurate representation of the feature importance for a
specific classifier.
In conclusion, I prefer the wrapper approach for feature selection because it provides a more accurate
representation of the feature importance for a specific classifier, and it considers the interactions between
features and the classifier. However, the choice between these two approaches may depend on the specific
requirements of the problem, the size of the data, and the computational resources available.
5) What is the best way to visualize a time oriented multivariate data set? Describe this with respect to
information visualization.
The best way to visualize a time-oriented multivariate data set depends on the specific requirements of the
problem and the type of information that you want to communicate. However, some commonly used techniques
in information visualization for time-oriented multivariate data include:
Line chart: A line chart is a simple and effective way to visualize the trends and patterns in multiple time series
over time. Each series is represented by a separate line, and the lines can be color-coded to distinguish
between the different variables.
Stacked area chart: A stacked area chart is a variation of the line chart that displays the contributions of each
variable to the total over time. This can be useful for visualizing how the variables change relative to each other
over time.
Heatmap: A heatmap is a graphical representation of data where values are represented as colors. Heatmaps
can be used to visualize the relationships between multiple variables over time, with each cell representing a
combination of time and variable values.
Scatter plot matrix: A scatter plot matrix is a set of scatter plots showing the relationships between multiple
variables. Scatter plots can be time-oriented by using time as one of the variables, and this can be useful for
visualizing how the relationships between the variables change over time.
In conclusion, the choice of the best way to visualize a time-oriented multivariate data set depends on the
specific requirements of the problem and the type of information that you want to communicate. It is important to
use an appropriate visualization technique that clearly communicates the information and provides insights into
the trends and patterns in the data.
6) Why is the training data random in the definition of the random forest algorithm? How can random
forests be used for predicting sales prices?
The training data in random forest algorithms is considered random because a random subset of the features is
selected for each split in the decision trees. This randomness in the feature selection helps to reduce overfitting,
which can occur when using a single decision tree with all of the features.
Random forests can be used for predicting sales prices by using the historical sales data as the input features
and the target variable being the sales price. The algorithm builds a set of decision trees, each of which makes a
prediction based on a random subset of the features. The final prediction is made by averaging the predictions
of all the trees, and this can provide a more accurate and robust prediction compared to using a single decision
tree.
Random forests can handle both continuous and categorical variables, and they can also handle missing data
and noisy data effectively. Additionally, the feature importance values generated by random forests can be used
to identify the most important variables in predicting the sales price.
Filter Methods:
Filter methods are generally used as a preprocessing step. The selection of features is independent of any
machine learning algorithms. Instead, features are selected on the basis of their scores in various statistical tests
for their correlation with the outcome variable. The correlation is a subjective term here. For basic guidance, you
can refer to the following table for defining correlation co-efficients.
Pearson’s Correlation: It is used as a measure for quantifying linear dependence between two continuous
variables X and Y. Its value varies from -1 to +1. Pearson’s correlation is given as:
LDA: Linear discriminant analysis is used to find a linear combination of features that characterizes or separates
two or more classes (or levels) of a categorical variable.
ANOVA: ANOVA stands for Analysis of variance. It is similar to LDA except for the fact that it is operated using
one or more categorical independent features and one continuous dependent feature. It provides a statistical
test of whether the means of several groups are equal or not.
Chi-Square: It is a statistical test applied to the groups of categorical features to evaluate the likelihood of
correlation or association between them using their frequency distribution.
One thing that should be kept in mind is that filter methods do not remove multicollinearity. So, you must deal
with multicollinearity of features as well before training models for your data.
Wrapper Methods
In wrapper methods, we try to use a subset of features and train a model using them. Based on the inferences
that we draw from the previous model, we decide to add or remove features from your subset. The problem is
essentially reduced to a search problem. These methods are usually computationally very expensive.
Some common examples of wrapper methods are forward feature selection, backward feature elimination,
recursive feature elimination, etc.
Forward Selection: Forward selection is an iterative method in which we start with having no feature in the
model. In each iteration, we keep adding the feature which best improves our model till an addition of a new
variable does not improve the performance of the model.
Backward Elimination: In backward elimination, we start with all the features and remove the least significant
feature at each iteration which improves the performance of the model. We repeat this until no improvement is
observed on removal of features.
Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the best performing
feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each
iteration. It constructs the next model with the left features until all the features are exhausted. It then ranks the
features based on the order of their elimination.
Embedded Methods
Embedded methods perform feature selection during the model training, which is why we call them embedded
methods.
A learning algorithm takes advantage of its own variable selection process and performs feature selection and
classification/regression at the same time.
Embedded methods work as follows:
1. First, these methods train a machine learning model.
2. They then derive feature importance from this model, which is a measure of how much is feature
important when making a prediction.
3. Finally, they remove non-important features using the derived feature importance.
Embedded methods combine the qualities’ of filter and wrapper methods. It’s implemented by algorithms that
have their own built-in feature selection methods.
Some of the most popular examples of these methods are LASSO and RIDGE regression which have inbuilt
penalization functions to reduce overfitting.
● Lasso regression performs L1 regularization which adds a penalty equivalent to absolute value of the
magnitude of coefficients.
● Ridge regression performs L2 regularization which adds a penalty equivalent to the square of the
magnitude of coefficients.
● Elastic nets perform L1/L2 regularization which is a combination of the L1 and L2. It incorporates their
penalties, and therefore we can end up with features with zero as a coefficient—similar to L1.
8) Decision Tree
A decision tree can be used to visually and explicitly represent decisions and decision making. As the name
goes, it uses a tree-like model of decisions. Though a commonly used tool in data mining for deriving a strategy
to reach a particular goal, it's also widely used in machine learning.
A decision tree is drawn upside down with its root at the top. In the image on the left, the bold text in black
represents a condition/internal node, based on which the tree splits into branches/ edges. The end of the
branch that doesn’t split anymore is the decision/leaf, in this case, whether the passenger died or survived,
represented as red and green text respectively.
Although, a real dataset will have a lot more features and this will just be a branch in a much bigger tree, but you
can’t ignore the simplicity of this algorithm. The feature importance is clear and relations can be viewed easily.
This methodology is more commonly known as learning decision tree from data and above tree is called
Classification tree as the target is to classify passenger as survived or died. Regression trees are represented in
the same manner, just they predict continuous values like price of a house. In general, Decision Tree algorithms
are referred to as CART or Classification and Regression Trees.
So, what is actually going on in the background? Growing a tree involves deciding on which features to choose
and what conditions to use for splitting, along with knowing when to stop. As a tree generally grows arbitrarily,
you will need to trim it down for it to look beautiful.
9) Random Forest
Random forest, like its name implies, consists of a large number of individual decision trees that operate as an
ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes
becomes our model’s prediction (see figure below).
The fundamental concept behind random forest is a simple but powerful one — the wisdom of crowds. In data
science speak, the reason that the random forest model works so well is:
A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the
individual constituent models.
The low correlation between models is the key. Just like how investments with low correlations (like stocks and
bonds) come together to form a portfolio that is greater than the sum of its parts, uncorrelated models can
produce ensemble predictions that are more accurate than any of the individual predictions. The reason for this
wonderful effect is that the trees protect each other from their individual errors (as long as they don’t constantly
all err in the same direction). While some trees may be wrong, many other trees will be right, so as a group the
trees are able to move in the correct direction. So the prerequisites for random forest to perform well are:
1. There needs to be some actual signal in our features so that models built using those features do better
than random guessing.
2. The predictions (and therefore the errors) made by the individual trees need to have low correlations
with each other.
Informed Consent
In human subject research, there is a notion of informed consent. We understand what is being done, we
voluntarily consent to the experiment, and we have the right to withdraw consent at any time.
However, this is more vague in "ordinary conduct of business", such as A/B testing. For example, Facebook may
perform these tests all the time without explicit consent or even knowledge!
Privacy
Privacy is a basic human need. Loss of privacy occurs when there's a loss of control over personal data.
In some cases, even when identifiable information is removed from data – like name, phone number, address,
and so on – it may not be sufficient to protect individuals' identities.
Unfair discrimination
The incorrect and unchecked use of data science can lead to unfair discrimination against individuals based on
their gender, demographics and socio-economic conditions.
Algorithms are also influenced by analysts’ biases, as they may choose data and hypotheses that seem
important to them.
Lack of transparency
Data science algorithms can sometimes be a black box where the model predicts an outcome but does not
explain the rationale behind the result.
Numerous recent machine learning algorithms fall into this category. With black box solutions, it is not easy for a
business to understand and explain the reason for a business decision.