0% found this document useful (0 votes)
15 views61 pages

Data Science Notes

The document provides an introduction to Data Science, defining it as a field that combines domain expertise, programming skills, and statistical knowledge to extract insights from data. It distinguishes Data Science from Big Data, outlines essential skills for data scientists, and discusses challenges faced in the industry, such as problem identification and data cleansing. Additionally, it explains the concept of datafication and its implications across various sectors.

Uploaded by

armar abdul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views61 pages

Data Science Notes

The document provides an introduction to Data Science, defining it as a field that combines domain expertise, programming skills, and statistical knowledge to extract insights from data. It distinguishes Data Science from Big Data, outlines essential skills for data scientists, and discusses challenges faced in the industry, such as problem identification and data cleansing. Additionally, it explains the concept of datafication and its implications across various sectors.

Uploaded by

armar abdul
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

lOMoARcPSD|45529801

Module 1 : Introduction to Data Science

1) What is Data Science?

● Data science is the field of study that combines domain expertise, programming skills, and knowledge of
mathematics and statistics to extract meaningful insights from data.
● Data science practitioners apply machine learning algorithms to numbers, text, images, video, audio, and
more to produce artificial intelligence (AI) systems to perform tasks that ordinarily require human
intelligence. In turn, these systems generate insights which analysts and business users can translate
into tangible business value.
● A data scientist is someone who creates programming code and combines it with statistical knowledge
to create insights from data.

2) What is the difference between Data Science and Big Data?

Data Science Big Data

Data Science is an area. Big Data is a technique to collect, maintain and


process huge information.

It is about the collection, processing, analyzing, and It is about extracting vital and valuable information
utilizing of data in various operations. It is more from a huge amount of data.
conceptual.

It is a field of study just like Computer Science, It is a technique for tracking and discovering trends in
Applied Statistics, or Applied Mathematics. complex data sets.

The goal is to build data-dominant products for a The goal is to make data more vital and usable i.e. by
venture. extracting only important information from the huge
data within existing traditional aspects.

Tools mainly used in Data Science include SAS, R, Tools mostly used in Big Data include Hadoop, Spark,
Python, etc Flink, etc.

It is a superset of Big Data as data science consists of It is a subset of Data Science as mining activities
Data scraping, cleaning, visualization, statistics, and which is in a pipeline of Data science.
many more techniques.

It is mainly used for scientific purposes. It is mainly used for business purposes and customer
satisfaction.

It broadly focuses on the science of data. It is more involved with the processes of handling
voluminous data.

3) How to clean up and organize big data sets towards Data Science?

1. Removal of unwanted observations


This includes deleting duplicate/ redundant or irrelevant values from your dataset. Duplicate

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

observations most frequently arise during data collection and Irrelevant observations are those that
don’t actually fit the specific problem that you’re trying to solve.
1. Redundant observations alter the efficiency by a great extent as the data repeats and may
add towards the correct side or towards the incorrect side, thereby producing unfaithful
results.
2. Irrelevant observations are any type of data that is of no use to us and can be removed
directly.
2. Fixing Structural errors
The errors that arise during measurement, transfer of data, or other similar situations are called
structural errors. Structural errors include typos in the name of features, the same attribute with a
different name, mislabeled classes, i.e. separate classes that should really be the same, or
inconsistent capitalization.
1. For example, the model will treat America and America as different classes or values,
though they represent the same value or red, yellow, and red-yellow as different classes
or attributes, though one class can be included in the other two classes. So, these are
some structural errors that make our model inefficient and give poor quality results.
3. Managing Unwanted outliers
Outliers can cause problems with certain types of models. For example, linear regression models are
less robust to outliers than decision tree models. Generally, we should not remove outliers until we
have a legitimate reason to remove them. Sometimes, removing them improves performance,
sometimes not. So, one must have a good reason to remove the outlier, such as suspicious
measurements that are unlikely to be part of real data.
4. Handling missing data
Missing data is a deceptively tricky issue in data science. We cannot just ignore or remove the
missing observation. They must be handled carefully as they can be an indication of something
important. The two most common ways to deal with missing data are:
1. Dropping observations with missing values.
■ The fact that the value was missing may be informative in itself.
■ Plus, in the real world, you often need to make predictions on new data even if
some of the features are missing!
2. Imputing the missing values from past observations.
● Again, “missingness” is almost always informative in itself, and you should tell your
algorithm if a value was missing.
● Even if you build a model to impute your values, you’re not adding any real
information. You’re just reinforcing the patterns already provided by other
features.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

4) What is Datafication? What is its current landscape of perspectives?

● Datafication is the transformation of social action into online quantified data, thus allowing for real-time
tracking and predictive analysis.
● Simply said, it is about taking previously invisible processes/activity and turning it into data that can be
monitored, tracked, analyzed and optimized.
● The latest technologies we use have enabled lots of new ways of ‘datify’ our daily and basic activities.
● Summarizing, datafication is a technological trend turning many aspects of our lives into computerized
data using processes to transform organizations into data-driven enterprises by converting this
information into new forms of value.
● Datafication refers to the fact that daily interactions of living things can be rendered into a data format
and put to social use.
● Example - Let’s say social platforms, Facebook or Instagram, for example, collect and monitor data
information of our friendships to market products and services to us and surveillance services to
agencies which in turn changes our behavior; promotions that we daily see on the socials are also the
result of the monitored data. In this model, data is used to redefine how content is created by
datafication being used to inform content rather than recommendation systems.
● Other examples -
○ Insurance: Data used to update risk profile development and business models.
○ Banking: Data used to establish trustworthiness and likelihood of a person paying back a loan.
○ Human resources: Data used to identify e.g. employees risk-taking profiles.
○ Hiring and recruitment: Data used to replace personality tests.
○ Social science research: Datafication replaces sampling techniques and restructures the manner
in which social science research is performed.

5) What are the 8 Data Science skills that will get you hired?

● Programming and Database Skills

○ No matter what type of company or role you’re interviewing for, you’re likely going to be
expected to know how to use the tools of the trade — and that includes several programming
languages.
○ You’ll be expected to know a statistical programming language, like R or Python, and a database
querying language like SQL.

● Statistics

○ A good understanding of statistics is vital as a data scientist.


○ You should be familiar with statistical tests, distributions, maximum likelihood estimators, etc.
○ One of the more important aspects of your statistics knowledge will be understanding when
different techniques are (or aren’t) a valid approach.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

○ Statistics is important at all company types, but especially data-driven companies where
stakeholders will depend on your help to make decisions and design / evaluate experiments.

● Machine Learning

○ If you’re at a large company with huge amounts of data or working at a company where the
product itself is especially data-driven (e.g. Netflix, Google Maps, Uber), it may be the case that
you’ll want to be familiar with machine learning methods.
○ This can mean things like k-nearest neighbors, random forests, ensemble methods, and more.
○ A lot of these techniques can be implemented using R or Python libraries so it’s not necessary to
become an expert on how the algorithms work.
○ Your goal is to understand the broad strokes and when it’s appropriate to use different
techniques.

● Multivariable Calculus & Linear Algebra

○ Understanding these concepts is most important at companies where the product is defined by
the data, and small improvements in predictive performance or algorithm optimization can lead to
huge wins for the company.
○ In an interview for a data science role, you may be asked to derive some of the machine learning
or statistics results you employ elsewhere. Or, your interviewer may ask you some basic
multivariable calculus or linear algebra questions, since they form the basis of a lot of these
techniques.
○ You may wonder why a data scientist would need to understand this when there are so many
out-of-the-box implementations in Python or R. The answer is that at a certain point, it can
become worth it for a data science team to build out their own implementations in house.

● Data Wrangling

○ Often, the data you’re analyzing is going to be messy and difficult to work with. Because of this,
it’s really important to know how to deal with imperfections in data — aka data wrangling.
○ Some examples of data imperfections include missing values, inconsistent string formatting (e.g.,
‘New York’ versus ‘new york’ versus ‘ny’), and date formatting (‘2021-01-01’ vs. ‘01/01/2021’, unix
time vs. timestamps, etc.).
○ This will be most important at small companies where you’re an early data hire, or data-driven
companies where the product is not data-related (particularly because the latter has often grown
quickly with not much attention to data cleanliness), but this skill is important for everyone to
have.

● Data Visualization & Communication

○ Visualizing and communicating data is incredibly important, especially with young companies that
are making data-driven decisions for the first time, or companies where data scientists are
viewed as people who help others make data-driven decisions.
○ When it comes to communicating, this means describing your findings, or the way techniques
work to audiences, both technical and non-technical.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

○ Visualization-wise, it can be immensely helpful to be familiar with data visualization tools like
matplotlib, ggplot, or d3.js. Tableau has become a popular data visualization and dashboarding
tool as well.
○ It is important to not just be familiar with the tools necessary to visualize data, but also the
principles behind visually encoding data and communicating information.

● Software Engineering

○ If you’re interviewing at a smaller company and are one of the first data science hires, it can be
important to have a strong software engineering background.
○ You’ll be responsible for handling a lot of data logging, and potentially the development of
data-driven products.

● Data Intuition

○ Companies want to see that you’re a data-driven problem-solver.


○ At some point during the interview process, you’ll probably be asked about some high-level
problem, possibly about a test the company may want to run, or a data-driven product it may
want to develop.
○ It’s important to think about what things are key to the process and what things aren’t. How
should you, as the data scientist, interact with the engineers and product managers? What
methods should you use? When do approximations make sense?

6) Five realtime challenges faced by the Data Science industry and how to combat it?

1) Problem-Identification

● One of the major concerns in analyzing a problem is to identify it accurately for designing a better
solution and defining each aspect of it.
● We have seen data scientists try mechanical approaches by beginning their work on data and tools
without getting a clear understanding of the business requirement from the client.

How to Resolve it?

● There should be a well-defined workflow before starting off with the analysis of the data.
● Therefore, as a first step, you need to identify the problem very well to design a proper solution and build
a checklist to tick off as you analyze the results.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

2) Accessing the Right Data

● It is vital to approach your hands on the right kind of data for the right analysis which can be a little time
consuming as you need to access the data in the most proper format.
● There might be issues ranging from hidden data and insufficient data volume to less data variety.
● It is also a kind of challenge to gain permission for accessing the data from various businesses.
● You need to know how dangerous fake chargers are and its consequences.

How to Resolve it?

● Data scientists are expected to manage the data management system and other information integration
tools such as Stream analytics software which is used for data filtering and aggregation.
● The software allows to connect all the external data sources and sync them in the proper workflow.

3) Cleansing of the Data

● Big data is estimated to be a little expensive for generating more revenue because data cleansing is
making troubles to operating expenses.
● It can be a nightmare for every data scientist to work with databases which are full of inconsistencies and
anomalies as unwanted data leads to unwanted results.
● Here, they work with tons of data and spend a huge amount of time in sanitizing the data before
analyzing.

How to Resolve it?

● Data scientists make use of data governance tools for improving their overall accuracy and data
formatting.
● In addition to this, maintaining data quality should be everyone’s goal and businesses need to function
across the enterprise to benefit from good quality data.
● Bad data can result in a big enterprise issue.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

4) Lack of Professionals

● It is one of the biggest misconceptions to expect that the data scientists are good at high-end tools and
mechanisms.
● But they too need to have possessed a piece of sound knowledge and gain subject depth.
● Data scientists are considered as bridging the gap between the IT department and top management as
domain expertise is required for conveying the needs of the business to the IT department and vice
versa.

How to Resolve it?

● To resolve this, data scientists need to get more useful insights from businesses in order to understand
the problem and work accordingly by modeling the solutions.
● They also need to focus on the requirements of the businesses by mastering statistical and technical
tools.

5) Misconception About the Role

● In big corporations, a Data Scientist is regarded as a jack of all trades who is assigned with the task of
getting the data, building the model, and making right business decisions which is a big ask for any
individual.
● In a Data Science team, the role should be split among different individuals such as Data Engineering,
Data Visualizations. Predictive Analytics, model building, and so on.

How to Resolve it?


● The organization should be clear about their requirement and specialize the task the Data Scientist
needs to perform without putting unrealistic expectations on the individual.
● Though a Data Scientist possesses the majority of the necessary skills, distributing the task would ensure
flawless operation of the business.
● Thus a clear description and communication about the role are necessary before anyone starts working
as a Data Scientist in the company.

7) Explain in detail 5 V’s and their roles in Data Science applications.

1. Volume:

● The name ‘Big Data’ itself is related to a size which is enormous.


● Volume is a huge amount of data.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

● To determine the value of data, size of data plays a very crucial role. If the volume of data is very
large then it is actually considered as a ‘Big Data’. This means whether a particular data can actually
be considered as a Big Data or not, is dependent upon the volume of data.
● Hence while dealing with Big Data it is necessary to consider a characteristic ‘Volume’.
● Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes(6.2 billion GB) per
month. Also, by the year 2020 we will have almost 40000 ExaBytes of data.

2. Velocity:

● Velocity refers to the high speed of accumulation of data.


● In Big Data velocity data flows in from sources like machines, networks, social media, mobile phones
etc.
● There is a massive and continuous flow of data. This determines the potential of data that how fast
the data is generated and processed to meet the demands.
● Sampling data can help in dealing with the issue like ‘velocity’.
● Example: There are more than 3.5 billion searches per day on Google. Also, FaceBook users are
increasing by 22%(Approx.) year by year.

3. Variety:

● It refers to the nature of data that is structured, semi-structured and unstructured data.
● It also refers to heterogeneous sources.
● Variety is basically the arrival of data from new sources that are both inside and outside of an
enterprise. It can be structured, semi-structured and unstructured.
○ Structured data: This data is basically an organized data. It generally refers to data that
has defined the length and format of data.
○ Semi- Structured data: This data is basically a semi-organised data. It is generally a form
of data that does not conform to the formal structure of data. Log files are examples of
this type of data.
○ Unstructured data: This data basically refers to unorganized data. It generally refers to
data that doesn’t fit neatly into the traditional row and column structure of the relational
database. Texts, pictures, videos etc. are examples of unstructured data which can’t be
stored in the form of rows and columns.

4. Veracity:

● It refers to inconsistencies and uncertainty in data, that is data which is available can sometimes get
messy and quality and accuracy are difficult to control.
● Big Data is also variable because of the multitude of data dimensions resulting from multiple
disparate data types and sources.
● Example: Data in bulk could create confusion whereas less amount of data could convey half or
Incomplete Information.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

5. Value:

● After having the 4 V’s into account there comes one more V which stands for Value!. The bulk of Data
having no Value is of no good to the company, unless you turn it into something useful.
● Data in itself is of no use or importance but it needs to be converted into something valuable to
extract Information. Hence, you can state that Value.

8) Explain systematic approach for finding business needs to translate data available into business
value.

Refer Q9.

9) How can you convert a Business Problem into a Data Problem? Elaborate with a suitable example.

There are 3 steps to translating business problems to data science problems

A. Understand & Define the problem

B. Set analytic goals and scope your solution

C. Plan the analysis

A. Understand & Define the problem

Frame the business problem

● Many times data scientists are presented with very vague problems such as how to reduce customer
churn, how to increase revenue, how to cut cost, how to improve sales, what do users want.
● These problems are very vague, however, it is the job of the data scientist to frame and define it in a way
that can be solved with data science. A data scientist is expected to probe and ask the stakeholders
questions.
● For example, the business wants to reduce churn and increase revenue, you want to ask the stakeholder
questions like - What strategies do you employ to retain customers? What are the initiatives the business
employs to increase revenue? What promotions are given to users? What are the major pain-points that
you experienced that led to a loss of revenue? Which product had the most decline in revenue?

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

● Try to get a balanced perspective from stakeholders, if some users are not happy with the products,
compare their view with those that are happy with it. It helps identify bias.
● In defining the problem, the problem posed by the stakeholder might not always be the pressing
problem. For example, the stakeholder might want to find out why the users come to the website but do
not purchase anything meanwhile the real problem is, can they improve the recommendations to users
that align with their interest and push them to place an order.

Prepare for a decision

When defining the problem, it is important to think in terms of the decision that needs to be made to solve the
problem such as Which user would churn in the next 70 days? Which user must be given the discounts to stay
back on the app and when to trigger them? To a new user who has just landed on the app, what is the right ad to
show?

Here are some guidelines for mapping out relevant decisions.

● Consider timing. The problem should be framed in a way that would enable the decision to be made with
respect to the time. For example, when should a particular ad be shown to a user for maximum
conversion
● Analyze all data science problem in a way that would lead to quantifiable impact for users such as an
increase in daily active users and quantifiable impact for stakeholders such as an increase in revenue
with lesser cost
● Now you have defined your problem, “which user should be given a discount to prevent them from
churning in the 70 days?”

B. Set analytic goals and scope your solution

Set objectives and define milestones

● Translate the defined problem into analytical needs.


● What analytics goal do you need to accomplish in order to claim you have found a solution?
● What are the options for reaching those goals?
● Which options are cost-effective?
● How will you measure the extent to which your proposed solution addresses the business problem?

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

For example, the goal of “which user should be given a discount to prevent them from churning in the 70 days”
is clear enough from a business perspective, but in terms of running an actual analysis, we need to further break
it down into smaller milestones.

● How do we identify customers that are going to churn in the next 70 days?
● What criteria should be used to determine who should be given a discount?
● What features can be used to differentiate churners from non-churners?
● What is the lifetime value for each customer?
● How do we determine when to trigger them with a discount, what data do we need?

These questions also guide you in thinking of important data points while solving your problem. Thinking in
terms of milestones helps to foresee dependencies.

Design Minimum Viable Product

● After defining your problem and setting your milestones, you want to start building the solution as a data
scientist, you want to build a minimum viable product that allows you to provide value to your
stakeholders in smaller increments.
● For example, if a client wants to build a mansion, inexperienced data scientists will then try to figure out
how to build the mansion they were asked for. Experienced data scientists will try to figure out how to
build a shed, then figure out how to turn the shed to a tent, then to a hut, a bungalow, a storey building
and finally a mansion.
● It is important to consider the following questions when building MVP
○ What is the smallest benefit stakeholders could get from the analysis and still consider it
valuable?
○ When do stakeholders need results? Do they need all the results at once, or do some results
have a more pressing deadline than others?
○ What is the simplest way to meet a benchmark, regardless of whether you consider it the “best”
way?
● The typical journey of a data science product is
○ Descriptive solution — tells you what happened in the past.
○ Diagnostic solution — helps you understand why something happened in the past.
○ Predictive solution — predicts what is most likely to happen in the future.
○ Prescriptive solution — not only identifies what’s likely to happen but also provides insights and
recommends actions you can take to affect those outcomes.
● A data scientist should plan in sprints, think modularly and get regular feedback from the stakeholders.

Identify target metrics

Having a target metric is important because it tells you and your stakeholders how successful your data science
solution is in solving the business problem.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

Here are some guidelines for selecting good metrics.

● Think explicitly about trade-offs. Almost any metric will involve a trade-off. For example, in a classification
problem, “precision” focuses on minimizing false positives, while “recall” focuses on minimizing false
negatives. False positives might be more important to the business than false negatives, or the reverse
could be true.
● Which is more harmful — identifying a loyal customer as likely to churn, or identifying a likely to churn
customer as loyal. The stakeholders want to identify customers that are likely to churn, so identifying
likely to churn customers as loyal would not help the business. Hence we want to reduce false negative,
a high recall model would be more suited
● Find out the business’s “value” units: Find out what unit of value your stakeholders think in, and estimate
the value of your analysis using that unit. For example, stakeholders have said that they want to reduce
churn, but upon further investigation, you might find that what they really want is increased daily active
users which in turn impacts revenue.
● Subset all metrics. An analysis should almost never have only one set of metrics. All metrics used for the
analysis as a whole should be repeated for any relevant subsets: customer age bracket, customer spend,
site visit, etc. An analysis may perform very well on average but abjectly fail for certain subsets

Make it non-technical and explainable, as possible, stakeholders do need to be able to understand whatever
metrics you use.

C. Plan the analysis

Plan your dataset

● You would need to standardize the different columns and get the data to the format you need. There
could be a lot of inconsistencies in the data and cleaning and transforming this data becomes really
crucial. As you go deeper into data wrangling, analysis and aligning the data to the problem, more such
challenges would arise that need to be overcome.
● Data is the key to success or failure for any data science project
● Ensure you check for sufficiency of data to solve the problem. Sometimes you won’t even realize that a
crucial data point is missing until you are in the thick of your analysis.
● Identify all dataset needs ahead of time. Make sure you have all the pieces to the data puzzle available.
For example, you could say: “customer age bracket, site visit, location, customer spend to start with.”
● If data from different datasets don’t have a common key on which to join the information, or you can’t get
access to some datasets even though they exist, or some of the data have so many missing values that
they cannot support your use case, then your analysis will disappoint both you and your stakeholders.
● Focus on data refresh cycles: how old is the data? When does it get updated? How is it updated?
What/who decides when it is updated?
● Know when additional data collection is necessary. Sometimes the only way to complete an analysis is to
collect more data.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

● It is always easier to plan for contingencies before you begin your analysis than it is to try to adapt in the
middle of your work as deadlines approach.
● Some of those problems manifest themselves only through careful Exploratory Data Analysis (EDA). It’s
easy to look at a column name and assume the dataset has what you need. Because of that, it’s very
common for data scientists to find out, at least halfway into their analysis, that the data they have isn’t
really the data they need. Hence a thorough EDA is essential before applying the methods. If you are
able to answer most of the questions in the EDA phase and identify the right insights to the stakeholders
that is itself a huge value add.

Plan your methods/models

● Which methods/models are inappropriate for your analysis? Of those methods/models that are
appropriate, what are the costs and benefits of using each one? If you find a number of methods that are
appropriate and have roughly the same costs and benefits, how do you decide how to proceed?
● Keep constraints in mind. If your preferred method requires a GPU but you don’t have easy access to a
GPU, then it shouldn’t be your preferred method, even if you think it is analytically superior to its
alternatives. Similarly, some methods simply do not work well for large numbers of features, or only work
if you know beforehand how many clusters you want. Save time by thinking about the constraints each
method places on your work — because every method carries constraints of some kind.
● Even after you eliminated unsuitable methods and further narrowed down your list to accommodate your
project’s constraints, you will still likely have more than one method that could plausibly work for you.
There is no way to know beforehand which of these methods is better — you will have to try as many of
them as possible, and try each with as many initializing parameters as possible, to know what performs
best.

10) Applications of Data Science.

1. Search Engines

● As we know, when we want to search for something on the internet, we mostly use Search engines like
Google, Yahoo, Safari, Firefox, etc. So Data Science is used to get Searches faster.
● For Example, When we search something suppose “Data Structure and algorithm courses ” then at that
time on the Internet Explorer we get the first link of GeeksforGeeks Courses. This happens because the
GeeksforGeeks website is visited most in order to get information regarding Data Structure courses and
Computer related subjects. So this analysis is done using Data Science, and we get the Topmost visited
Web Links.

2. Transport

● Data Science also entered into the Transport field like Driverless Cars. With the help of Driverless Cars, it
is easy to reduce the number of Accidents.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

● For Example, In Driverless Cars the training data is fed into the algorithm and with the help of Data
Science techniques, the Data is analyzed like what is the speed limit in Highway, Busy Streets, Narrow
Roads, etc. And how to handle different situations while driving etc.

3. Finance

● Data Science plays a key role in Financial Industries. Financial Industries always have an issue of fraud
and risk of losses.
● Thus, Financial Industries needs to automate risk of loss analysis in order to carry out strategic decisions
for the company.
● Also, Financial Industries uses Data Science Analytics tools in order to predict the future. It allows the
companies to predict customer lifetime value and their stock market moves.
● For Example, In Stock Market, Data Science is the main part. In the Stock Market, Data Science is used to
examine past behavior with past data and their goal is to examine the future outcome. Data is analyzed
in such a way that it makes it possible to predict future stock prices over a set timetable.

4. E-Commerce

● E-Commerce Websites like Amazon, Flipkart, etc. use data Science to make a better user experience
with personalized recommendations.
● For Example, When we search for something on the E-commerce websites we get suggestions similar to
choices according to our past data and also we get recommendations according to most buy the
product, most rated, most searched, etc. This is all done with the help of Data Science.

5. Health Care

In the Healthcare Industry data science acts as a boon. Data Science is used for:

● Detecting Tumor.
● Drug discoveries.
● Medical Image Analysis.
● Virtual Medical Bots.
● Genetics and Genomics.
● Predictive Modeling for Diagnosis etc.

6. Image Recognition

● Currently, Data Science is also used in Image Recognition.


● For Example, When we upload our image with our friend on Facebook, Facebook gives suggestions
Tagging who is in the picture. This is done with the help of machine learning and Data Science. When an
Image is Recognized, the data analysis is done on one’s Facebook friends and after analysis, if the faces
which are present in the picture matched with someone else profile then Facebook suggests us
auto-tagging.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

7. Targeting Recommendation

● Targeting Recommendation is the most important application of Data Science.


● Whatever the user searches on the Internet, he/she will see numerous posts everywhere.
● For example: Suppose I want a mobile phone, so I just Google searched it and after that, I changed my
mind to buy it offline. Data Science helps those companies who are paying for Advertisements for their
mobile. So everywhere on the internet, in social media, in websites, in apps everywhere I will see the
recommendation of that mobile phone which I searched for. So this will force me to buy online.

8. Airline Routing Planning

● With the help of Data Science, the Airline Sector is also growing. With the help of it, it becomes easy to
predict flight delays.
● It also helps to decide whether to directly land into the destination or take a halt in between like a flight
can have a direct route from Delhi to the U.S.A or it can halt in between after that reach at the
destination.

9. Data Science in Gaming

● In most of the games where a user will play with an opponent i.e. a Computer Opponent, data science
concepts are used with machine learning where with the help of past data the Computer will improve its
performance.
● There are many games like Chess, EA Sports, etc. that use Data Science concepts.

10. Medicine and Drug Development

● The process of creating medicine is very difficult and time-consuming and has to be done with full
discipline because it is a matter of someone's life.
● Without Data Science, it takes lots of time, resources, and finance or developing new medicine or drugs
but with the help of Data Science, it becomes easy because the prediction of success rate can be easily
determined based on biological data or factors.
● The algorithms based on data science will forecast how this will react to the human body without lab
experiments.

11. In Delivery Logistics

● Various Logistics companies like DHL, FedEx, etc. make use of Data Science.
● Data Science helps these companies to find the best route for the shipment of their products, the best
time suited for delivery, the best mode of transport to reach the destination, etc.

12. Autocomplete

● AutoComplete feature is an important part of Data Science where the user will get the facility to just type
a few letters or words, and he will get the feature of auto-completing the line.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

● In Google Mail, when we are writing formal mail to someone, at that time the data science concept of
Autocomplete feature is used where he/she is an efficient choice to auto-complete the whole line.
● Also in Search Engines in social media, in various apps, AutoComplete feature is widely used.

11) What is the impact of applying Data Science in business scenarios?

1. Reduces Inefficiencies

● Inefficiencies often cost businesses up to 30% of their revenue.


● Data scientists track a range of company-wide metrics – factory production times, delivery expenditure,
employee productivity, and more – and pinpoint areas for improvement.
● By limiting wasted resources, it’s possible to lower overall costs and boost return-on-investment. It’s
expected, for example, that big data will reduce healthcare costs in the US by 20%.

2. Predicts Trends and Customer Behavior

● Predictive models are essential business tools.


● Data scientists organize huge swathes of historical data and utilize it to inform planning processes, thus
helping businesses make informed decisions about the future.
● On a practical level, data-based predictions have an array of applications. It’s possible, for example, to
determine peak customer shopping times and adjust staff levels accordingly, or to identify early buyer
trends and implement appropriate promotional campaigns.

3. Enables Competitor Research

● As much as companies value data that helps them understand their customers and internal processes,
they’re also eager to gain an edge over their competitors.
● Data scientists are responsible for understanding and gleaning insights from data about competitors.
● Effective competitor research helps businesses make competitive pricing decisions, reach new markets,
and stay up to date with changes in consumer behavior.

4. Allows Testing of Business Initiatives

● Consistent, long-term testing enables companies to drive incremental revenue gains.


● Data scientists are responsible for conducting extensive tests to guarantee successful marketing
campaigns, product launches, employee satisfaction, website optimization, and more.
● Testing is one of the most exciting areas of data science.
● New, innovative alternatives are posed against existing features, often with unexpected results.
● Businesses like Amazon adopt an indefinite approach to testing, trialing new changes and implementing
them as part of a long-term strategy, rather than ‘one-off’ optimization campaigns.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

5. Develops Market Understanding

● By ensuring a ready stream of actionable insights about customer psychology, behavior, and satisfaction,
data science enables businesses to consistently reshape their products and services to fit with a shifting
marketplace.
● Data about customers is available from a variety of sources, and mining information from third-party
platforms, like social media, search engines, and purchased datasets, presents a unique challenge.

6. Informs Hiring Decisions

● One of the big problems faced by businesses when searching for new employees is the disconnect
between prospects that look good on paper and perform well in practice.
● Data science seeks to bridge this gap by using evidence to improve hiring practices.
● By combining and analyzing a variety of data-points about candidates, it’s possible to move towards an
ideal ‘company-employee fit’.

12) What is the need of estimation and validation for added value due to data science?

Cross Validation:
● Cross-Validation is an essential tool in the Data Scientist toolbox.
● It divides the dataset into two parts (train and test). On one part i.e on the train part, it will try to train the
model, and on the second part i.e on the test part, it will make the prediction which is unseen data for our
model.
● After that, we will check our model to see how well it works. If the model gives us good accuracy on test
data, it means that our model is good and we can trust it.

Types of Cross-Validation:

1. Hold Out Method:


● It simply divides the dataset into training and testing sets.
● The training dataset is used to train the model and then the testing dataset is fitted in the trained
model to make predictions.
● This method is used as it is computationally less costly.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

2. Leave One Out Cross-Validation (LOOCV):


● This is the most extreme way to do cross-validation.
● For each instance in our dataset, we build a model using all other instances and then test it on
the selected instance.

3. K-Fold Cross-Validation:
● We split our data into K parts, let’s use K=3 for a toy example.
● If we have 3000 instances in our dataset, we split it into three parts, part 1, part 2 and part 3.
● We then build three different models, each model is trained on two parts and tested on the third.
● Our first model is trained on part 1 and 2 and tested on part 3.
● Our second model is trained on part 1 and part 3 and tested on part 2 and so on.

Diagram of k-fold cross-validation with k=4

4. Stratified Cross-Validation:
● When we split our data into folds, we want to make sure that each fold is a good representative
of the whole data.
● The most basic example is that we want the same proportion of different classes in each fold.
Most of the time it happens by just doing it randomly, but sometimes, in complex datasets, we
have to enforce a correct distribution for each fold.

5. Time Series Cross-Validation:


● In time series cross-validation, we cannot split our dataset into training and testing datasets.
● Time series cross-validation starts with a small subset of data for training and makes a prediction
for the future data points and then checking the accuracy for the predicted data points.
● After that, the same predicted data points are then included as part of the next training dataset
and future data points are predicted.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

● Similarly, the process continues in time series cross-validation.

Need for cross validation:


1. Use All Your Data
● When we have very little data, splitting it into training and test sets might leave us with a very
small test set.
● Say we have only 100 examples, if we do a simple 80–20 split, we’ll get 20 examples in our test
set which is not enough. The problem is even worse when we have a multi-class problem. If we
have 10 classes and only 20 examples, it leaves us with only 2 examples for each class on
average. Testing anything on only 2 examples can’t lead to any real conclusion.
● If we use cross-validation in this case, we build K different models, so we are able to make
predictions on all of our data. For each instance, we make a prediction by a model that didn’t see
this example, and so we are getting 100 examples in our test set. For the multi-class problem, we
get 10 examples for each class on average, and it’s much better than just 2.

2. Get More Metrics


● When we do a single evaluation on our test set, we get only one result.
● Say we trained five models and we use accuracy as our measurement.
● The best scenario is that our accuracy is similar in all our folds, say 92.0, 91.5, 92.0, 92.5 and 91.8.
This means that our algorithm (and our data) is consistent and we can be confident.
● However, we could end up in a slightly different scenario, say 92.0, 44.0, 91.5, 92.5 and 91.8. It
looks like one of our folds is from a different distribution, we have to go back and make sure that
our data is what we think it is.
● The worst scenario we can end up in is when we have considerable variation in our results, say
80, 44, 99, 60 and 87. Here it looks like that our algorithm or our data (or both) is not consistent, it
could be that our algorithm is unable to learn, or our data is very complicated.
● By using Cross-Validation, we are able to get more metrics and draw important conclusions both
about our algorithm and our data.

3. Use Models Stacking


● When we have limited data, we can’t train both our models on the same dataset because then,
our second model learns on predictions that our first model has already seen.
● These will probably be over-fitted or at least have better results than on a different set.
● This means that our second algorithm is trained not on what it will be tested on.
● This may lead to different effects in our final evaluations that will be hard to understand.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

● By using cross-validation, we can make predictions on our dataset in the same way as described
before and so our second model's input will be real predictions on data that our first model has
never seen before.

4. Work with Dependent/Grouped Data


● Let’s look at spoken digits recognition. In this dataset, for example, there are 3 speakers and
1500 recordings (500 for each speaker).
● If we do a random split, our training and test set will share the same speaker saying the same
words. This will boost our algorithm performance but once tested on a new speaker, our results
will be much worse.
● The proper way to do it is to split the speakers, i.e., use 2 speakers for training and use the third
for testing. However, then we’ll test our algorithm only on one speaker which is not enough. We
need to know how our algorithm performs on different speakers.
● We can use cross-validation on the speakers level. We will train 3 models, each time using one
speaker for testing and two others for training. This way we’ll be able to better evaluate our
algorithm and finally build our model on all speakers.

5. Parameters Fine-Tuning
● Most learning algorithms require some parameters tuning. We want to find the best parameters
for our problem.
● We do it by trying different values and choosing the best ones.
● There are many methods to do this. It could be a manual search, a grid search or optimization.
● However, in all those cases we can’t do it on our training test and not on our test set of course.
We have to use a third set, a validation set.
● By splitting our data into three sets instead of two, we’ll tackle all the same issues we talked
about before, especially if we don’t have a lot of data.
● By doing cross-validation, we’re able to do all those steps using a single set.

13) Who is a Data Scientist? What are his responsibilities and characteristics?

Data Scientist:

● A data scientist is an analytics professional who is responsible for collecting, analyzing and interpreting
data to help drive decision-making in an organization.
● The data scientist role combines elements of several traditional and technical jobs, including
mathematician, scientist, statistician and computer programmer.
● It involves the use of advanced analytics techniques, such as machine learning and predictive modeling,
along with the application of scientific principles.
● As part of data science initiatives, data scientists often must work with large amounts of data to develop
and test hypotheses, make inferences and analyze things such as customer and market trends, financial
risks, cybersecurity threats, stock trades, equipment maintenance needs and medical conditions.
● In businesses, data scientists typically mine data for information that can be used to predict customer
behavior, identify new revenue opportunities, detect fraudulent transactions and meet other business
needs.
● They also do valuable analytics work for healthcare providers, academic institutions, government
agencies, sports teams and other types of organizations.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

The basic responsibilities of a data scientist include the following activities:

● Gather and prepare relevant data to use in analytics applications


● Create forecasting algorithms and data models
● Use various types of analytics tools to detect patterns, trends and relationships in data sets
● Develop statistical and predictive models to run against the data sets
● Create data visualizations, dashboards and reports to communicate their findings
● Improve the quality of data or product offerings by utilizing machine learning techniques
● Distribute suggestions to other teams and top management
● In data analysis, use data tools such as R, SAS, Python, or SQL
● Top the field of data science innovations

In many organizations, data scientists are also responsible for helping to define and promote best practices for
data collection, preparation and analysis. In addition, some data scientists develop AI technologies for use
internally or by customers -- for example, conversational AI systems, AI-driven robots and other autonomous
machines, including key components in self-driving cars.

Characteristics of an effective data scientist


● The personal characteristics and soft skills required by data scientists include intellectual curiosity, critical
thinking, a healthy skepticism, good intuition, problem-solving abilities and creativity.
● The ability to collaborate with other people is critical, too.
● Data scientists typically work on a data science team that also includes data engineers, lower-level data
analysts and others, and the role often involves working with various business teams on a regular basis.
● Many employers expect their data scientists to be strong communicators who can use data storytelling
capabilities to present and explain data insights to business executives, managers and workers.
● They also need leadership capabilities and business savvy to help steer data-driven decision-making
processes in an organization.

Module 2 : Introduction to Mathematical Foundation

1) What are the differences between statistics and parameters of distribution?

Statistic Parameter

Meaning Statistic is a measure which describes Parameter refers to a measure which


a fraction of the population. describes population.

Numerical value Variable and Known Fixed and Unknown

Statistical Notation x̄ = Sample Mean μ = Population Mean


s = Sample Standard Deviation σ = Population Standard Deviation
p̂ = Sample Proportion P = Population Proportion
x = Data Elements X = Data Elements
n = Size of sample N = Size of Population

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

r = Correlation coefficient ρ = Correlation coefficient

Example:
● A researcher wants to know the average weight of females aged 22 years or older in India. The
researcher obtains the average weight of 54 kg, from a random sample of 40 females.
● In the given situation, the statistics are the average weight of 54 kg, calculated from a simple random
sample of 40 females, in India while the parameter is the mean weight of all females aged 22 years or
older.

2) What is a normal distribution? Elaborate one real life application.

● Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric
about the mean, showing that data near the mean are more frequent in occurrence than data far from
the mean.
● In graphical form, the normal distribution appears as a "bell curve".
● Its mean (average), median (midpoint), and mode (most frequent observation) are all equal to one
another. Moreover, these values all represent the peak, or highest point, of the distribution.
● The distribution then falls symmetrically around the mean, the width of which is defined by the standard
deviation.

The Empirical Rule

Formula for the Normal Distribution:

where:
x = value of the variable or data being examined
f(x) = the probability function
μ = the mean
σ = the standard deviation

Example of a Normal Distribution

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

● The distribution of the heights of human beings.


● The average height is found to be roughly 175 cm (5' 9"), counting both males and females.
● As the chart below shows, most people conform to that average.
● Meanwhile, taller and shorter people exist, but with decreasing frequency in the population.
● According to the empirical rule, 99.7% of all people will fall with +/- three standard deviations of the
mean, or between 154 cm (5' 0") and 196 cm (6' 5").
● Those taller and shorter than this would be quite rare ( just 0.15% of the population each).

3) Explain Normal (Gaussian) distribution with an example. State and explain one application where this
distribution is suitable for model fitting.

Refer Q3.

4) Probability

● Probability is the chance of an outcome in an experiment (also called event).


Event: Tossing a fair coin
Outcome: Head, Tail
● Probability is the science of uncertainty.
● Whenever there is a doubt of an event occurring, probability concepts are used to estimate the
likelihood of the event.
● Probability is a value between 0 and 1 that a certain event will occur
● For example, the probability that a fair coin will come up heads is 0.5
Mathematically we write: P (Eheads) = 0.5.

5) Statistics

● Statistics is the application of what we know to what we want to know.


● Is the S&P 500 a good model of the entire U.S. economy?
● Does the population of Texas reflect the entire U.S. population?

6) Distribution

● A distribution describes all of the probable outcomes of a variable.


● In a discrete distribution, the sum of all the individual probabilities must equal 1
● In a continuous distribution, the area under the probability curve equals 1.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

Discrete Distribution : 3 types

1. Uniform Distribution
● Rolling a fair die has 6 discrete, equally probable outcomes
● You can roll a 1 or a 2, but not a 1.5
● The probabilities of each outcome are evenly distributed across the sample space.

Heights are all the same, add up to 1.

2. Binomial Distribution
● “Binomial” means there are two discrete, mutually exclusive outcomes of a trial.
● heads or tails
● on or off
● sick or healthy
● success or failure

3. Poisson Distribution
● A binomial distribution considers the number of successes out of n trials
● A Poisson Distribution considers the number of successes per unit of time or any other
continuous unit, e.g. distance over the course of many units

Continuous Distribution : 3 types

1. Normal Distribution
● Many real life data points follow a normal distribution: People's Heights and Weights, Population
Blood Pressure, Test Scores, Measurement Errors.
● These data sources tend to be around a central value with no bias left or right, and it gets close
to a "Normal Distribution" like this:

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

● Unlike discrete distributions, where the sum of all the bars equals one, in a normal distribution the
area under the curve equals one.

2. Log-normal Distribution
● This distribution is used to plot the random variables whose logarithm values follow a normal
distribution.
● Consider the random variables X and Y. Y = ln(X) is the variable that is represented in this
distribution, where ln denotes the natural logarithm of values of X.

3. Student’s T Distribution
● The student’s t distribution is similar to the normal distribution.
● The difference is that the tails of the distribution are thicker.
● This is used when the sample size is small and the population variance is not known.
● This distribution is defined by the degrees of freedom(p) which is calculated as the sample size
minus 1(n – 1).

4. Chi-square Distribution
● This distribution is equal to the sum of squares of p normal random variables. p is the number of
degrees of freedom.
● Like the t-distribution, as the degrees of freedom increase, the distribution gradually approaches
the normal distribution.
● Below is a chi-square distribution with three degrees of freedom.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

5. Exponential Distribution
● The exponential distribution can be seen as an inverse of the Poisson distribution.
● The events in consideration are independent of each other.

7) Data

Data is the collected observations we have about something.

8) Population vs Sample

Population Sample

Meaning Population refers to the collection of all Sample means a subgroup of the
elements possessing common members of population chosen for
characteristics that comprises the participation in the study.
universe.

Includes Each and every unit of the group. Only a handful of units of population.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

Characteristic Parameter Statistic

Data collection Complete enumeration or census Sample survey or sampling

Focus on Identifying the characteristics. Making inferences about population.

9) Types of sampling

1. Random Sampling:
● As its name suggests, random sampling means every member of a population has an equal
chance of being selected.
● However, since samples are usually much smaller than populations, there’s a chance that entire
demographics might be missed.
2. Stratified Random Sampling:
● Stratified random sampling ensures that groups within a population are adequately represented.
● First, divide the population into segments based on some characteristics.
● Members cannot belong to two groups at once.
● Next, take random samples from each group
● The size of each sample is based on the size of the group relative to the population.
3. Clustering:
● A third – and often less precise – method of sampling is clustering
● The idea is to break the population down into groups and sample a random selection of groups,
or clusters.
● Usually this is done to reduce costs.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

10) State steps that are followed when testing hypotheses.

1. Set up the hypotheses and check conditions:


● Each hypothesis test includes two hypotheses about the population. One is the null hypothesis,
notated as H0 and another in alternate hypothesis notated as Ha.
● The null hypothesis (H0) is a statement of no effect, relationship, or difference between two or
more groups or factors. In research studies, a researcher is usually interested in disproving the
null hypothesis.
● The alternative hypothesis (H1) is the statement that there is an effect or difference. This is
usually the hypothesis the researcher is interested in proving.
2. Decide on the significance level:
● The significance level (denoted by the Greek letter alpha— a) is generally set at 0.05. This means
that there is a 5% chance that you will accept your alternative hypothesis when your null
hypothesis is actually true.
● The most common value is 0.05 or 5%. Other popular choices are 0.01 (1%) and 0.1 (10%).
3. Calculate the test statistic:
● Gather sample data and calculate a test statistic where the sample statistic is compared to the
parameter value.
● The test statistic is calculated under the assumption the null hypothesis is true and incorporates a
measure of standard error and assumptions (conditions) related to the sampling distribution.
4. Calculate probability value (p-value), or find the rejection region:
● A p-value is found by using the test statistic to calculate the probability of the sample data
producing such a test statistic or one more extreme.
● The rejection region is found by using alpha to find a critical value; the rejection region is the area
that is more extreme than the critical value.
5. Make a decision about the null hypothesis:
In this step, we decide to either reject the null hypothesis or decide to fail to reject the null hypothesis.
6. State an overall conclusion:
Once we have found the p-value or rejection region, and made a statistical decision about the null
hypothesis (i.e. we will reject the null or fail to reject the null), we then want to summarize our results into
an overall conclusion for our test.

11) What is meant by the Row Reduced Echelon Form (RREF) of an augmented matrix.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

12) State R Programming language functionalities for statistical procedures.

1. Mean - mean(x)
2. Median - median(x)
3. Percentage quantile - quantile(x)
4. Variance - var(x)
5. Standard deviation - sd(x)
6. Minimum - min(x)
7. Maximum - max(x)
8. Correlation - cor(x,y)
R Programming language provides a wide range of functionalities for statistical procedures. Some of the major
functionalities of R for statistical analysis are:

Data Manipulation: R provides powerful data manipulation tools to clean and transform data before analysis. R
provides functions to handle missing values, outliers, and duplicate data.

Descriptive Statistics: R provides functions to compute basic summary statistics such as mean, median, standard
deviation, and quantiles. R also provides functions to compute more advanced descriptive statistics such as
frequency distributions, contingency tables, and correlation matrices.

Inferential Statistics: R provides a large number of functions for hypothesis testing and estimation of population
parameters. R provides functions for t-tests, ANOVA, chi-square tests, and non-parametric tests. R also provides
functions for regression analysis, including linear, logistic, and nonlinear regression.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

Model Selection: R provides functions for model selection and validation. R provides functions to evaluate model
fit, determine the optimal number of variables, and perform cross-validation.

Data Visualization: R provides a wide range of graphical capabilities to visualize data and results. R provides
functions to create histograms, scatter plots, box plots, and more advanced visualizations such as scatter plot
matrices, heatmaps, and 3D plots.

Package System: R provides a large number of packages that extend its functionalities. The packages provide
additional functions and capabilities for statistical procedures, machine learning, and data visualization.

Report Generation: R provides functionalities for creating reports and presentations. R provides packages for
generating reports in HTML, PDF, and Word formats, as well as packages for creating interactive dashboards
and presentations.

In summary, R is a comprehensive and powerful language for statistical procedures, providing a wide range of
functionalities for data manipulation, descriptive statistics, inferential statistics, model selection, data
visualization, and report generation.

13) What are coefficient of determination that can be used for both linear and nonlinear fitting?

Coefficient of determination is a statistical measure used to evaluate the goodness of fit of a regression model. It
provides information on how well the model fits the data, and it can be used for both linear and nonlinear fitting.
The two commonly used coefficients of determination for both linear and nonlinear fitting are:

R-squared (R²): R-squared measures the proportion of variance in the dependent variable that is explained by
the independent variables in the model. For linear regression models, R-squared is a measure of how well the
linear regression line fits the data. For nonlinear regression models, R-squared is a measure of how well the
nonlinear regression curve fits the data.

Adjusted R-squared (Adjusted R²): Adjusted R-squared takes into account the number of independent variables
in the model and provides a more accurate estimate of the goodness of fit compared to R-squared. The adjusted
R-squared value increases as the number of independent variables in the model increases, even if the
contribution of these variables to the model is not significant.

Both R-squared and adjusted R-squared are expressed as values between 0 and 1, with 1 indicating a perfect fit
and values close to 0 indicating a poor fit. When selecting a model, it is recommended to choose the model with
the highest adjusted R-squared value, as it provides a better balance between model fit and model complexity.

14) What is the best approach for detection of outliers using R programming for real time data? Explain
with appropriate example.

● Outliers are data points that don’t fit the pattern of the rest of the data set.
● The best way to detect the outliers in the given data set is to plot the boxplot of the data set and the
point located outside the box in the boxplot are all the outliers in the data set.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

● In this approach to remove the outliers from the given data set, the user needs to just plot the boxplot of
the given data set using the simple boxplot() function, and if found the presence of the outliers in the
given data the user needs to call the boxplot.stats() function which is a base function of the R language,
and pass the required parameters into this function, which will further lead to the removal of the outliers
present in the given data sets.

Example:
gfg<-rnorm(500)
gfg[1:10]<-c(-4,2,5,6,4,1,-5,8,9,-6)
boxplot(gfg)

Now let us again visualize the above plot but this time without outliers by applying the given approach.

Removing Outliers Using boxplot.stats() Function:


gfg<-rnorm(500)
gfg[1:10]<-c(-4,2,5,6,4,1,-5,8,9,-6)
gfg <- gfg[!gfg %in% boxplot.stats(gfg)$out]
boxplot(gfg)

15) Write a function in R language to replace the missing value in a vector with the mean of that vector.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

This function takes as input a vector x, and calculates the mean of the non-missing values in the vector
using mean(x, na.rm = TRUE). The argument na.rm = TRUE specifies that the mean should be calculated
without considering the missing values.

Next, the function replaces the missing values in the vector with the mean value using the line x[is.na(x)]
<- mean_value. The function is.na(x) returns a logical vector indicating which elements of x are missing,
and the assignment x[is.na(x)] <- mean_value replaces the missing values with the mean value.

Finally, the function returns the modified vector x.

16) What are the different data objects in R? How do you split a continuous variable into different
groups/ranks in R?

There are 5 basic types of objects in the R language:

1. Vectors
Atomic vectors are one of the basic types of objects in R programming. Atomic vectors can store
homogeneous data types such as character, doubles, integers, raw, logical, and complex. A single
element variable is also said to be a vector.
Example:
x <- c(1, 2, 3, 4)
y <- c("a", "b", "c", "d")
z <- 5

2. Lists
List is another type of object in R programming. List can contain heterogeneous data types such as
vectors or another lists.
Example:
ls <- list(c(1, 2, 3, 4), list("a", "b", "c"))

3. Matrices
To store values as 2-Dimensional array, matrices are used in R. Data, number of rows and columns are
defined in the matrix() function.
Example:
x <- c(1, 2, 3, 4, 5, 6)
mat <- matrix(x, nrow = 2)

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

4. Factors
Factor object encodes a vector of unique elements (levels) from the given data vector.
Example:
s <- c("spring", "autumn", "winter", "summer",
"spring", "autumn")
print(factor(s))

Output:
autumn spring summer winter

5. Arrays
array() function is used to create n-dimensional array. This function takes dim attribute as an argument
and creates required length of each dimension as specified in the attribute.
Example:
arr <- array(c(1, 2, 3), dim = c(3, 3, 3))

6. Data Frames
Data frames are 2-dimensional tabular data object in R programming. Data frames consists of multiple
columns and each column represents a vector. Columns in data frame can have different modes of data
unlike matrices.
Example:
x <- 1:5
y <- LETTERS[1:5]
z <- c("Albert", "Bob", "Charlie", "Denver", "Elie")
df <- data.frame(x, y, z)

Split a continuous variable into different groups/ranks in R:

First, make up some sample data:

set.seed(1)
ages <- floor(runif(20, min = 20, max = 50))
ages
# [1] 27 31 37 47 26 46 48 39 38 21 26 25 40 31 43 34 41 49 31 43

1. Use findInterval() to categorize the "ages" vector.


findInterval(ages, c(20, 30, 40))
# [1] 1 2 2 3 1 3 3 2 2 1 1 1 3 2 3 2 3 3 2 3
2. Use cut()
# Example 1
cut(ages, breaks=c(20, 30, 40, 50), right = FALSE)
cut(ages, breaks=c(20, 30, 40, 50), right = FALSE, labels = FALSE)
# Example 2
df$category <- cut(df$a,
breaks=c(-Inf, 0.5, 0.6, Inf),

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

labels=c("low","middle","high"))

17) Explain the augmented matrix notation in linear system of equations with an example.

Module 3 : Exploratory Data Analysis

1) Explain how a large number of raw data sources and exploratory data analysis are required to
produce a single valuable application of the given data.

To produce a valuable application from a large number of raw data sources, a thorough exploratory data analysis
(EDA) is necessary. EDA is the process of examining, cleaning, transforming, and modeling data to gain insight
into its structure, patterns, and relationships. The following are the steps involved in this process:

1. Data collection: Collect the raw data from multiple sources and store it in a centralized repository.

2. Data cleaning: Clean the data by removing any missing or inconsistent data and transforming it into a
format that can be easily analyzed.

3. Data transformation: Transform the data into a format that is suitable for analysis, such as aggregating the
data or calculating derived variables.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

4. Data exploration: Explore the data by generating descriptive statistics and visualizations to gain an
understanding of the patterns and relationships in the data.

5. Data modeling: Build models that can be used to make predictions or gain insights into the relationships
between the variables.

6. Validation: Validate the models by testing them on independent data sets to ensure that they generalize
well to new data.

By following these steps, a single valuable application can be produced from the raw data sources. The
application may be a predictive model, a dashboard that provides insights into the data, or a recommendation
system, for example. The goal of the EDA process is to create a high-quality, well-understood data set that can
be used to support decision-making and drive business value.

2) State fundamental steps of Exploratory Data Analysis.

1. Data Collection
Nowadays, data is generated in huge volumes and various forms belonging to every sector of human life, like
healthcare, sports, manufacturing, tourism, and so on. Every business knows the importance of using data
beneficially by properly analyzing it. However, this depends on collecting the required data from various sources
through surveys, social media, and customer reviews, to name a few. Without collecting sufficient and relevant
data, further activities cannot begin.

2. Finding all Variables and Understanding Them


When the analysis process starts, the first focus is on the available data that gives a lot of information. This
information contains changing values about various features or characteristics, which helps to understand and
get valuable insights from them. It requires first identifying the important variables which affect the outcome and
their possible impact. This step is crucial for the final result expected from any analysis.

3. Cleaning the Dataset


The next step is to clean the data set, which may contain null values and irrelevant information. These are to be
removed so that data contains only those values that are relevant and important from the target point of view.
This will not only reduce time but also reduces the computational power from an estimation point of view.
Preprocessing takes care of all issues, such as identifying null values, outliers, anomaly detection, etc.

4. Identify Correlated Variables


Finding a correlation between variables helps to know how a particular variable is related to another. The
correlation matrix method gives a clear picture of how different variables correlate, which further helps in
understanding vital relationships among them.

5. Choosing the Right Statistical Methods


Depending on the data, categorical or numerical, the size, type of variables, and the purpose of analysis,
different statistical tools are employed. Statistical formulae applied for numerical outputs give fair information,
but graphical visuals are more appealing and easier to interpret.

6. Visualizing and Analyzing Results

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

Once the analysis is over, the findings are to be observed cautiously and carefully so that proper interpretation
can be made. The trends in the spread of data and correlation between variables give good insights for making
suitable changes in the data parameters. The data analyst should have the requisite capability to analyze and be
well-versed in all analysis techniques. The results obtained will be appropriate to data of that particular domain
and are suitable for use in retail, healthcare, and agriculture.

3) Data Science Process

Framing the Problem


Understanding and framing the problem is the first step of the data science life cycle. This framing will help you
build an effective model that will have a positive impact on your organization.

Collecting Data
The next step is to collect the right set of data. High-quality, targeted data—and the mechanisms to collect
them—are crucial to obtaining meaningful results. Since much of the roughly 2.5 quintillion bytes of data created
every day come in unstructured formats, you’ll likely need to extract the data and export it into a usable format,
such as a CSV or JSON file.

Cleaning Data
Most of the data you collect during the collection phase will be unstructured, irrelevant, and unfiltered. Bad data
produces bad results, so the accuracy and efficacy of your analysis will depend heavily on the quality of your
data.
Cleaning data eliminates duplicate and null values, corrupt data, inconsistent data types, invalid entries, missing
data, and improper formatting.
This step is the most time-intensive process, but finding and resolving flaws in your data is essential to building
effective models.

Exploratory Data Analysis (EDA)


Now that you have a large amount of organized, high-quality data, you can begin conducting an exploratory data
analysis (EDA). Effective EDA lets you uncover valuable insights that will be useful in the next phase of the data
science lifecycle.

Model Building and Deployment


Next, you’ll do the actual data modeling. This is where you’ll use machine learning, statistical models, and
algorithms to extract high-value insights and predictions.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

Communicating Your Results


Lastly, you’ll communicate your findings to stakeholders. Every data scientist needs to build their repertoire of
visualization skills to do this.
Your stakeholders are mainly interested in what your results mean for their organization, and often won’t care
about the complex back-end work that was used to build your model. Communicate your findings in a clear,
engaging way that highlights their value in strategic business planning and operation.

4) Exploratory Data Analysis and its types.

● Exploratory Data Analysis is a data analytics process to understand the data in depth and learn the
different data characteristics, often with visual means. This allows you to get a better feel of your data
and find useful patterns in it.
● Exploratory Data Analysis helps you gather insights and make better sense of the data, and removes
irregularities and unnecessary values from data.
○ Helps you prepare your dataset for analysis.
○ Allows a machine learning model to predict our dataset better.
○ Gives you more accurate results.
○ It also helps us to choose a better machine learning model.

There are three main types of EDA:

1. Univariate
2. Bivariate
3. Multivariate

● In univariate analysis, the output is a single variable and all data collected is for it. There is no
cause-and-effect relationship at all. For example, data shows products produced each month for twelve
months.
● In bivariate analysis, the outcome is dependent on two variables, e.g., the age of an employee, while the
relation with it is compared with two variables, i.e., his salary earned and expenses per month.
● In multivariate analysis, the outcome is more than two, e.g., type of product and quantity sold against the
product price, advertising expenses, and discounts offered.
● The analysis of data is done on variables that can be numerical or categorical. The result of the analysis
can be represented in numerical values, visualization, or graphical form. Accordingly, they could be
further classified as non-graphical or graphical.

TYPES OF EXPLORATORY DATA ANALYSIS:

1. Univariate Non-graphical
2. Multivariate Non-graphical
3. Univariate graphical
4. Multivariate graphical

1. Univariate Non-graphical: This is the simplest form of data analysis as during this we use just one variable to
research the info. The standard goal of univariate non-graphical EDA is to know the underlying sample

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

distribution/ data and make observations about the population. Outlier detection is additionally part of the
analysis. The characteristics of population distribution include:

● Central tendency: The central tendency or location of distribution has got to do with typical or middle
values. The commonly useful measures of central tendency are statistics called mean, median, and
sometimes mode during which the foremost common is mean. For skewed distribution or when there’s
concern about outliers, the median may be preferred.
● Spread: Spread is an indicator of what proportion distant from the middle we are to seek out to find the
info values. The quality deviation and variance are two useful measures of spread. The variance is that
the mean of the square of the individual deviations and therefore the variance is the root of the variance
● Skewness and kurtosis: Two more useful univariate descriptors are the skewness and kurtosis of the
distribution. Skewness is that the measure of asymmetry and kurtosis may be a more subtle measure of
peakedness compared to a normal distribution

2. Multivariate Non-graphical: Multivariate non-graphical EDA technique is usually used to show the connection
between two or more variables within the sort of either cross-tabulation or statistics.

● For categorical data, an extension of tabulation called cross-tabulation is extremely useful. For 2
variables, cross-tabulation is preferred by making a two-way table with column headings that match the
amount of one-variable and row headings that match the amount of the opposite two variables, then
filling the counts with all subjects that share an equivalent pair of levels.
● For each categorical variable and one quantitative variable, we create statistics for quantitative variables
separately for every level of the specific variable then compare the statistics across the amount of
categorical variables.
● Comparing the means is an off-the-cuff version of ANOVA and comparing medians may be a robust
version of one-way ANOVA.

3. Univariate graphical: Non-graphical methods are quantitative and objective, they are not able to give the
complete picture of the data; therefore, graphical methods are used more as they involve a degree of subjective
analysis, also are required. Common sorts of univariate graphics are:

Histogram: The foremost basic graph is a histogram, which may be a barplot during which each bar represents
the frequency (count) or proportion (count/total count) of cases for a variety of values. Histograms are one of the
simplest ways to quickly learn a lot about your data, including central tendency, spread, modality, shape and
outliers.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

Stem-and-leaf plots: This is a very simple but powerful EDA method used to display quantitative data but in a
shortened format. It displays the values in the data set, keeping each observation intact but separating them as
stem (the leading digits) and remaining or trailing digits as leaves.

Box Plots: These are used to display the distribution of quantitative value in the data. If the data set consists of
categorical variables, the plots can show the comparison between them. Further, if outliers are present in the
data, they can be easily identified. These graphs are very useful when comparisons are to be shown in
percentages, like values in the 25 %, 50 %, and 75% range (quartiles).

Quantile-normal plots: It’s used to see how well a specific sample follows a specific theoretical distribution. It
allows detection of non-normality and diagnosis of skewness and kurtosis.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

4. Multivariate graphical: Multivariate graphical data uses graphics to display relationships between two or
more sets of knowledge. The sole one used commonly may be a grouped barplot with each group representing
one level of 1 of the variables and every bar within a gaggle representing the amount of the opposite variable.

Other common sorts of multivariate graphics are:

Scatterplot: For 2 quantitative variables, the essential graphical EDA technique is that the scatter plot has one
variable on the x-axis and one on the y-axis and therefore the point for every case in your dataset.

Run chart: It’s a line graph of data plotted over time.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

Heat map: It’s a graphical representation of data where values are depicted by color.

Multivariate chart: It’s a graphical representation of the relationships between factors and response.
Bubble chart: It’s a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.

5) Basic tools (plots, graphs and summary statistics) of EDA

For plots, refer Q4.

Tools:

1. Python
Python is used for different tasks in EDA, such as finding missing values in data collection, data description,
handling outliers, obtaining insights through charts, etc. The syntax for EDA libraries like Matplotlib, Pandas,
Seaborn, NumPy, Altair, and more in Python is fairly simple and easy to use for beginners. You can find many
open-source packages in Python, such as D-Tale, AutoViz, PandasProfiling, etc., that can automate the entire
exploratory data analysis process and save time.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

2. R
R programming language is a regularly used option to make statistical observations and analyze data, i.e.,
perform detailed EDA by data scientists and statisticians. Like Python, R is also an open-source programming
language suitable for statistical computing and graphics. Apart from the commonly used libraries like ggplot,
Leaflet, and Lattice, there are several powerful R libraries for automated EDA, such as Data Explorer, SmartEDA,
GGally, etc.

3. MATLAB
MATLAB is a well-known commercial tool among engineers since it has a very strong mathematical calculation
ability. Due to this, it is possible to use MATLAB for EDA but it requires some basic knowledge of the MATLAB
programming language.

Module 4 : Introduction to Basic Machine Learning Algorithms

1) Is it true that predictive modeling goes beyond insight (knowing why things happen) to foresight
(knowing what is likely to happen in future)? How do you explain predictive modeling?

Yes, predictive modeling does go beyond providing insight into the underlying relationships in data to providing
predictions about future outcomes. It uses statistical algorithms and machine learning techniques to analyze
existing data and make predictions about future events. Predictive modeling can help businesses make
informed decisions and allocate resources effectively.

Predictive modeling is a statistical process for analyzing data, learning from that data, and making a prediction
about future events. It uses algorithms and machine learning techniques to identify patterns in data, and make a
prediction about future outcomes based on that information. Predictive modeling is used in a variety of
applications such as marketing, financial forecasting, and risk management.

2) Your linear regression doesn’t run and communicates that there is an infinite number of best
estimates for the regression coefficients. What could be wrong? How do you know that linear
regression is suitable for any given data?

A linear regression model may produce an "infinite number of best estimates for the regression coefficients" if
the model is over-determined, meaning that there are more independent variables than observations, or if there
is multicollinearity, meaning that the independent variables are highly correlated with each other.

To determine if linear regression is suitable for a given data set, it's important to check for several assumptions:

Linearity: The relationship between the independent and dependent variables should be linear.

Independence: The observations should be independent of each other.

Homoscedasticity: The variance of the errors should be constant for all values of the independent variables.

Normality: The errors should be normally distributed.

No multicollinearity: The independent variables should not be highly correlated with each other.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

If these assumptions are not met, alternative regression techniques or transformations of the data may need to
be applied.

3) Given a decision tree, you have the option (a) converting the decision tree to rules and then pruning
the resulting rules, or (b) pruning the decision tree and then converting the pruned tree to rules. What
advantages does (a) have over (b)?

Converting a decision tree to rules and then pruning the resulting rules (Option A) has the following advantages
over pruning the decision tree and then converting the pruned tree to rules (Option B):

Better interpretability: Rules are more easily interpreted and understood by humans compared to decision trees.
Pruning rules after they have been extracted from the tree provides more control over the interpretability of the
model.

Better accuracy: The pruning process can be more effective when performed on rules rather than on the tree
structure. Pruning rules can reduce the number of irrelevant or redundant rules, improving the accuracy of the
model.

Better performance: Pruning rules may result in a smaller number of rules compared to pruning a decision tree,
which can lead to faster prediction times. This is because rules can be processed in parallel and do not require a
traversal of the tree structure.

In conclusion, Option A provides better interpretability, accuracy, and performance compared to Option B.
However, the choice between these two options may depend on the specific requirements of the problem and
the goals of the modeling process.

4) What is data wrangling and why is it important? Explain steps in data wrangling.

Data wrangling is the transformation of raw data into a format that is easier to use. Data wrangling is a term often
used to describe the early stages of the data analytics process. It involves transforming and mapping data from
one format into another. The aim is to make data more accessible for things like business analytics or machine
learning. The data wrangling process can involve a variety of tasks. These include things like data collection,
exploratory analysis, data cleansing, creating data structures, and storage.

Data wrangling is time-consuming. In fact, it can take up to about 80% of a data analyst’s time. This is partly
because the process is fluid, i.e. there aren’t always clear steps to follow from start to finish. However, it’s also
because the process is iterative and the activities involved are labor-intensive.

Why is data wrangling important?

Insights gained during the data wrangling process can be invaluable. They will likely affect the future course of a
project. Skipping or rushing this step will result in poor data models that impact an organization’s

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

decision-making and reputation. So, if you ever hear someone suggesting that data wrangling isn’t that
important, you have our express permission to tell them otherwise!

Unfortunately, because data wrangling is sometimes poorly understood, its significance can be overlooked.
High-level decision-makers who prefer quick results may be surprised by how long it takes to get data into a
usable format. Unlike the results of data analysis (which often provide flashy and exciting insights), there’s little to
show for your efforts during the data wrangling phase. And as businesses face budget and time pressures, this
makes a data wrangler’s job all the more difficult. The job involves careful management of expectations, as well
as technical know-how.

Data wrangling process:

Extracting the data


Not everybody considers data extraction part of the data wrangling process. But in our opinion, it’s a vital aspect
of it. You can’t transform data without first collecting it. This stage requires planning. You’ll need to decide which
data you need and where to collect them from. You’ll then pull the data in a raw format from its source. This
could be a website, a third-party repository, or some other location.

Carrying out exploratory data analysis (EDA)


EDA involves determining a dataset’s structure and summarizing its main features. Whether you do this
immediately, or wait until later in the process, depends on the state of the dataset and how much work it
requires. Ultimately, EDA means familiarizing yourself with the data so you know how to proceed.

Structuring the data


Freshly collected data are usually in an unstructured format. This means they lack an existing model and are
completely disorganized. Unstructured data are often text-heavy but may contain things like ID codes, dates,
numbers, and so on. To structure your dataset, you’ll usually need to parse it. In this context, parsing means
extracting relevant information. For instance, you might parse HTML code scraped from a website, pulling out
what you need and discarding the rest. The result might be a more user-friendly spreadsheet containing the
useful data with columns, headings, classes, and so on.

Cleaning the data


Once your dataset has some structure, you can start applying algorithms to tidy it up. You can automate a range
of algorithmic tasks using tools like Python and R. They can be used to identify outliers, delete duplicate values,
standardize systems of measurement, and so on. You can learn about the data cleaning process in detail in this
post.

Enriching the data


Once your dataset is in good shape, you’ll need to check if it’s ready to meet your requirements. At this stage,
you may want to enrich it. Data enrichment involves combining your dataset with data from other sources. This
might include internal systems or third-party providers. Your goal could be to accumulate a greater number of
data points (to improve the accuracy of an analysis). Or it could simply be to fill in gaps…Say, by combining two
databases of customer info where one contains telephone numbers, and the other doesn’t.

Validating the data

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

Validating your data means checking it for consistency, quality, and accuracy. We can do this using
pre-programmed scripts that check the data’s attributes against defined rules. This is also a good example of an
overlap between data wrangling and data cleaning—validation is key to both. Because you’ll likely find errors,
you may need to repeat this step several times.

Publishing the data


Last but not least, it’s time to publish your data. This means making the data accessible by depositing them into
a new database or architecture. End-users might include data analysts, engineers, or data scientists. They may
use the data to create business reports and other insights. Or they might further process it to build more
complex data structures, e.g. data warehouses.

5) Linear Regression

Linear regression is used for finding linear relationship between target and one or more predictors. There are
two types of linear regression- Simple and Multiple.

Simple Linear Regression


Simple linear regression is useful for finding relationships between two continuous variables. One is a predictor
or independent variable and other is response or dependent variable. It looks for statistical relationships but not
deterministic relationships. Relationship between two variables is said to be deterministic if one variable can be
accurately expressed by the other.
The core idea is to obtain a line that best fits the data. The best fit line is the one for which total prediction errors
(all data points) are as small as possible. Error is the distance between the point to the regression line.

Example:
We have a dataset which contains information about the relationship between ‘number of hours studied’ and
‘marks obtained’. Many students have been observed and their hours of study and grade are recorded. This will
be our training data. The goal is to design a model that can predict marks if given the number of hours studied.
Using the training data, a regression line is obtained which will give minimum error. This linear equation is then
used for any new data. That is, if we give the number of hours studied by a student as an input, our model
should predict their mark with minimum error.

Multiple linear regression is used to estimate the relationship between two or more independent variables and
one dependent variable. You can use multiple linear regression when you want to know:

● How strong the relationship is between two or more independent variables and one dependent variable
(e.g. how rainfall, temperature, and amount of fertilizer added affect crop growth).
● The value of the dependent variable at a certain value of the independent variables (e.g. the expected
yield of a crop at certain levels of rainfall, temperature, and fertilizer addition).

6) k-Nearest Neighbors (kNN)

K-nearest neighbors (KNN) is a type of supervised learning algorithm used for both regression and classification.
KNN tries to predict the correct class for the test data by calculating the distance between the test data and all
the training points. Then select the K number of points which is closet to the test data. The KNN algorithm

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

calculates the probability of the test data belonging to the classes of ‘K’ training data and class holds the highest
probability will be selected. In the case of regression, the value is the mean of the ‘K’ selected training points.

Suppose, we have an image of a creature that looks similar to cat and dog, but we want to know either it is a cat
or dog. So for this identification, we can use the KNN algorithm, as it works on a similarity measure. Our KNN
model will find the similar features of the new data set to the cats and dogs images and based on the most
similar features it will put it in either cat or dog category.

How does K-NN work?

The K-NN working can be explained on the basis of the below algorithm:

Step-1: Select the number K of the neighbors


Step-2: Calculate the Euclidean distance of K number of neighbors
Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these k neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
Step-6: Our model is ready.

Example

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

7) k-means

K-Means clustering is an unsupervised learning algorithm. There is no labeled data for this clustering, unlike in
supervised learning. K-Means performs the division of objects into clusters that share similarities and are
dissimilar to the objects belonging to another cluster.

The term ‘K’ is a number. You need to tell the system how many clusters you need to create. For example, K = 2
refers to two clusters. There is a way of finding out what is the best or optimum value of K for a given data.

For a better understanding of k-means, let's take an example from cricket. Imagine you received data on a lot of
cricket players from all over the world, which gives information on the runs scored by the player and the wickets
taken by them in the last ten matches. Based on this information, we need to group the data into two clusters,
namely batsman and bowlers.

Example

8) Naive Bayes

A Naive Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of
the classifier is based on the Bayes theorem.

Bayes Theorem:

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

Using Bayes theorem, we can find the probability of A happening, given that B has occurred. Here, B is the
evidence and A is the hypothesis. The assumption made here is that the predictors/features are independent.
That is, the presence of one particular feature does not affect the other. Hence it is called naive.

Types of Naive Bayes Classifier:

Multinomial Naive Bayes:


This is mostly used for document classification problem, i.e whether a document belongs to the category of
sports, politics, technology etc. The features/predictors used by the classifier are the frequency of the words
present in the document.

Bernoulli Naive Bayes:


This is similar to the multinomial naive bayes but the predictors are boolean variables. The parameters that we
use to predict the class variable take up only values yes or no, for example if a word occurs in the text or not.

Gaussian Naive Bayes:


When the predictors take up a continuous value and are not discrete, we assume that these values are sampled
from a gaussian distribution.

9) Why Linear Regression and k-NN are poor choices for filtering spam

Why Won’t Linear Regression Work for FilteringSpam?


● we need to have a training set of emails where the messages have already been labeled with some
outcome variable. In this case, the outcomes are either spam or not.
● Once you build a model, email messages would come in without a label, and you’d use your model to
predict the labels.
● The first thing to consider is that your target is binary (0 if not spam, 1 if spam)—you wouldn’t get a 0 or a
1 using linear regression; you’d get a number. Strictly speaking, this option really isn’t ideal; linear
regression is aimed at modeling a continuous output and this is binary.

Why Won’t KNN Work for Filtering Spam?


● We would still need to choose features, probably corresponding to words, and we’d likely define the
value of those features to be 0 or 1, depending on whether the word is present or not. Then, we’d need
to define when two emails are “near” each other based on which words they both contain.
● Again, with 10,000 emails and 100,000 words, we’ll encounter a problem of “too many dimensions”. Yes,
computing distances in a 100,000-dimensional space requires lots of computational work. But that’s not
the real problem.
● The real problem is even more basic: even our nearest neighbors are really far away. This is called “the
curse of dimensionality,” and it makes k-NN a poor algorithm in this case.

10) Naive Bayes and its use for Filtering Spam

● Naive Bayes work on dependent events and the probability of an event occurring in the future that can
be detected from the previous occurrence of the same event . This technique can be used to classify
spam emails, word probabilities play the main rule here.
● If some words occur often in spam but not in ham, then this incoming e-mail is probably spam.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

● Naive Bayes classifier technique has become a very popular method in mail filtering Email. Every word
has a certain probability of occurring in spam or ham email in its database. If the total of word
probabilities exceeds a certain limit, the filter will mark the email to either category.
● Here, only two categories are necessary: spam or ham.

11) APIs and other tools for scraping the Web

Web scraping is simply automating the collection of structured data sets from the internet. Web scraping may
also be known as web data extraction or data extraction. Companies utilize web scraping techniques as a way to
keep an eye on the competition

Tools:
1. ParseHub is an incredibly powerful and elegant tool that allows you to build web scrapers without having
to write a single line of code. It is therefore as simple as simply selecting the data you need.
2. Scrapy is a Web Scraping library used by python developers to build scalable web crawlers.
3. OctoParse has a target audience similar to ParseHub, catering to people who want to scrape data
without having to write a single line of code, while still having control over the full process with their
highly intuitive user interface.
4. Scraper API is designed for designers building web scrapers. It handles browsers, proxies, and
CAPTCHAs which means that raw HTML from any website can be obtained through a simple API call.
5. Mozenda caters to enterprises looking for a cloud-based self serve Web Scraping platform.
6. Content Grabber is a cloud-based Web Scraping Tool that helps businesses of all sizes with data
extraction.

12) What is meant by data ingesting? Explain in short.

Data ingestion is the process of obtaining and importing data for immediate use or storage in a database. To
ingest something is to take something in or absorb something.

Types of data ingestion

Batch processing. In batch processing, the ingestion layer collects data from sources incrementally and sends
batches to the application or system where the data is to be used or stored. Data can be grouped based on a
schedule or criteria, such as if certain conditions are triggered. This approach is good for applications that don't
require real-time data. It is typically less expensive.
Real-time processing. This type of data ingestion is also referred to as stream processing. Data is not grouped
in any way in real-time processing. Instead, each piece of data is loaded as soon as it is recognized by the
ingestion layer and is processed as an individual object. Applications that require real-time data should use this
approach.
Micro batching. This is a type of batch processing that streaming systems like Apache Spark Streaming use. It
divides data into groups, but ingests them in smaller increments that make it more suitable for applications that
require streaming data.

Module 5 : Feature Generation and Data Visualization

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

1) Write a note on any open-source data visualization tool such as Seaborn or PyTorch.

Seaborn:
Seaborn is an open-source Python library built on top of matplotlib. It is used for data visualization and
exploratory data analysis. Seaborn works easily with dataframes and the Pandas library. The graphs created can
also be customized easily. Below are a few benefits of Data Visualization.
● Graphs can help us find data trends that are useful in any machine learning or forecasting project.
● Graphs make it easier to explain your data to non-technical people.
● Visually attractive graphs can make presentations and reports much more appealing to the reader.

To initialize the Seaborn library, the command used is:


import seaborn as sns

Different types of graphs


Count plot
A count plot is helpful when dealing with categorical values. It is used to plot the frequency of the different
categories. The column sex contains categorical data in the titanic data, i.e., male and female.
sns.countplot(x='sex',data=df)

KDE Plot
A Kernel Density Estimate (KDE) Plot is used to plot the distribution of continuous data.
sns.kdeplot(x = 'age' , data = df , color = 'black')

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

Distribution plot
A Distribution plot is similar to a KDE plot. It is used to plot the distribution of continuous data.
sns.displot(x = 'age',kde=True,bins = 5 , data =df)

Scatter plot
sns.scatterplot(x='sepal_length', y ='petal_length' , data = df , hue = 'species')

Pair plots
Seaborn lets us plot multiple scatter plots. It’s a good option when you want to get a quick overview of your
data.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

sns.pairplot(df)

Heatmaps
A heat map can be used to visualize confusion, matrices, and correlation.
corr = df.corr()
sns.heatmap(corr)

Pytorch:
PyTorch is an open source machine learning library used for developing and training neural network based deep
learning models. It is primarily developed by Facebook’s AI research group. PyTorch can be used with Python as
well as a C++. Naturally, the Python interface is more polished. Pytorch (backed by biggies like Facebook,
Microsoft, SalesForce, Uber) is immensely popular in research labs. Not yet on many production servers — that
are ruled by fromeworks like TensorFlow (Backed by Google) — Pytorch is picking up fast.

Unlike most other popular deep learning frameworks like TensorFlow, which use static computation graphs,
PyTorch uses dynamic computation, which allows greater flexibility in building complex architectures. Pytorch
uses core Python concepts like classes, structures and conditional loops — that are a lot familiar to our eyes,
hence a lot more intuitive to understand. This makes it a lot simpler than other frameworks like TensorFlow that
bring in their own programming style.

Implementation steps:
1. Install PyTorch
2. Import the Modules - The first step is of course to import the relevant libraries.
3. Gather the Data - Gather the data required to train the model.
4. Build the Network - Having done this, we start off with the real code. As mentioned before, PyTorch uses
the basic, familiar programming paradigms rather than inventing its own. A neural network in PyTorch is
an object. It is an instance of a class that defines this network — and inherits from the torch.nn.Module
5. Train the Network - Now that the model is ready, we have to work on training the model with the data
available to us. This is done by a method train(),
6. Test the Network - Similarly, we have a test method that verifies the performance of the network based
on the given test data set.
7. Put it Together - With the skeleton in place, we have to start with stitching these pieces into an
application that can build, train and validate the neural network model.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

2) Explain in short, the role played by domain experts in information collection for a data science
application.

You may have studied data science and machine learning and used some machine learning algorithms like
regression, classification to predict on some test data. But the true power of an algorithm and data can be
harnessed only when we have some form of domain knowledge. Needless to say, the accuracy of the model
also increases with the use of such knowledge of data.

For example, the knowledge of the automobile industry when working with the relevant data can be used like —
Let’s say we have two features Horsepower and RPM from which we can create an additional feature like Torque
from the formula
TORQUE = HP x 5252 ÷ RPM
This could potentially influence the output when we train a machine learning model and result in higher
accuracy.

A domain expert has usually become an expert both by education and experience in that domain. Both imply a
significant amount of time spent in the domain. As most domains in the commercial world are not freely
accessible to the public, this usually entails a professional career in the domain. This is a person who could
define the framework for a data science project as they would know what the current challenges are and how
they must be answered to be practically useful given the state of the domain as it is today. The expert can judge
what data is available and how good it is. The expert can use and apply the deliverables of a data science
project in the real world. Most importantly, this person can communicate with the intended users of the project’s
outcome. This is crucial as many projects end up being shelved because the conclusions are either not
actionable or not acted upon.
If there are two individuals, they can get excellent results quickly by good communication. While the domain
expert (DE) defines the task, the data scientist (DS) chooses and configures the right toolset to solve it.

3) State any two challenges faced in developing data science applications.

1. Preparation of Data for Smart Enterprise AI


Finding and cleaning up the proper data is a data scientist's priority. Nearly 80% of a data scientist's day is spent
on cleaning, organizing, mining, and gathering data, according to a CrowdFlower poll. In this stage, the data is
double-checked before undergoing additional analysis and processing. Most data scientists (76%) agree that this
is one of the most tedious elements of their work. As part of the data wrangling process, data scientists must
efficiently sort through terabytes of data stored in a wide variety of formats and codes on a wide variety of
platforms, all while keeping track of changes to such data to avoid data duplication.

Adopting AI-based tools that help data scientists maintain their edge and increase their efficacy is the best
method to deal with this issue. Another flexible workplace AI technology that aids in data preparation and sheds
light on the topic at hand is augmented learning.

2. Generation of Data from Multiple Sources


Data is obtained by organizations in a broad variety of forms from the many programs, software, and tools that
they use. Managing voluminous amounts of data is a significant obstacle for data scientists. This method calls for
the manual entering of data and compilation, both of which are time-consuming and have the potential to result

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

in unnecessary repeats or erroneous choices. The data may be most valuable when exploited effectively for
maximum usefulness in company artificial intelligence.

Companies now can build up sophisticated virtual data warehouses that are equipped with a centralized
platform to combine all of their data sources in a single location. It is possible to modify or manipulate the data
that is stored in the central repository to satisfy the needs of a company and increase its efficiency. This
easy-to-implement modification has the potential to significantly reduce the amount of time and labor required
by data scientists.

3. Identification of Business Issues


Identifying issues is a crucial component of conducting a solid organization. Before constructing data sets and
analyzing data, data scientists should concentrate on identifying enterprise-critical challenges. Before
establishing the data collection, it is crucial to determine the source of the problem rather than immediately
resorting to a mechanical solution.

Before commencing analytical operations, data scientists may have a structured workflow in place. The process
must consider all company stakeholders and important parties. Using specialized dashboard software that
provides an assortment of visualization widgets, the enterprise's data may be rendered more understandable.

4. Communication of Results to Non-Technical Stakeholders


The primary objective of a data scientist is to enhance the organization's capacity for decision-making, which is
aligned with the business plan that its function supports. The most difficult obstacle for data scientists to
overcome is effectively communicating their findings and interpretations to business leaders and managers.
Because the majority of managers or stakeholders are unfamiliar with the tools and technologies used by data
scientists, it is vital to provide them with the proper foundation concept to apply the model using business AI.

In order to provide an effective narrative for their analysis and visualizations of the notion, data scientists need to
incorporate concepts such as "data storytelling."

5. Data Security
Due to the need to scale quickly, businesses have turned to cloud management for the safekeeping of their
sensitive information. Cyberattacks and online spoofing have made sensitive data stored in the cloud exposed
to the outside world. Strict measures have been enacted to protect data in the central repository against
hackers. Data scientists now face additional challenges as they attempt to work around the new restrictions
brought forth by the new rules.

Organizations must use cutting-edge encryption methods and machine learning security solutions to counteract
the security threat. In order to maximize productivity, it is essential that the systems be compliant with all
applicable safety regulations and designed to deter lengthy audits.

6. Efficient Collaboration
It is common practice for data scientists and data engineers to collaborate on the same projects for a company.
Maintaining strong lines of communication is very necessary to avoid any potential conflicts. To guarantee that
the workflows of both teams are comparable, the institution hosting the event should make the necessary efforts
to establish clear communication channels. The organization may also choose to establish a Chief Officer
position to monitor whether or not both departments are functioning along the same lines.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

7. Selection of Non-Specific KPI Metrics


It is a common misunderstanding that data scientists can handle the majority of the job on their own and come
prepared with answers to all of the challenges that are encountered by the company. Data scientists are put
under a great deal of strain as a result of this, which results in decreased productivity.

It is vital for any company to have a certain set of metrics to measure the analyses that a data scientist presents.
In addition, they have the responsibility of analyzing the effects that these indicators have on the operation of
the company.

The many responsibilities and duties of a data scientist make for a demanding work environment. Nevertheless,
it is one of the occupations that are in most demand in the market today. The challenges that are experienced
by data scientists are simply solvable difficulties that may be used to increase the functionality and efficiency of
workplace AI in high-pressure work situations.

4) Is it necessary to use feature extraction for classification? Which one do you prefer between filter
approach and wrapper approach when doing feature selection? Justify your answer.

It is not necessary to use feature extraction for classification, but it can be useful in many cases. Feature
extraction can improve the performance of a classifier by reducing the dimensionality of the data, removing
noisy or irrelevant features, and enhancing the separability between the classes.

When doing feature selection, there are two main approaches: filter approach and wrapper approach.

Filter approach: In this approach, features are ranked based on their statistical properties or mutual information
with the target variable, and the top-k features are selected for classification. The filter approach is
computationally efficient, easy to implement, and can be used as a pre-processing step for any classifier.

Wrapper approach: In this approach, features are selected based on their performance in improving the
accuracy of a specific classifier. A search algorithm is used to explore the space of all possible feature subsets,
and the best subset of features is selected based on cross-validation performance. The wrapper approach is
more computationally expensive but provides a more accurate representation of the feature importance for a
specific classifier.

In conclusion, I prefer the wrapper approach for feature selection because it provides a more accurate
representation of the feature importance for a specific classifier, and it considers the interactions between
features and the classifier. However, the choice between these two approaches may depend on the specific
requirements of the problem, the size of the data, and the computational resources available.

5) What is the best way to visualize a time oriented multivariate data set? Describe this with respect to
information visualization.

The best way to visualize a time-oriented multivariate data set depends on the specific requirements of the
problem and the type of information that you want to communicate. However, some commonly used techniques
in information visualization for time-oriented multivariate data include:

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

Line chart: A line chart is a simple and effective way to visualize the trends and patterns in multiple time series
over time. Each series is represented by a separate line, and the lines can be color-coded to distinguish
between the different variables.

Stacked area chart: A stacked area chart is a variation of the line chart that displays the contributions of each
variable to the total over time. This can be useful for visualizing how the variables change relative to each other
over time.

Heatmap: A heatmap is a graphical representation of data where values are represented as colors. Heatmaps
can be used to visualize the relationships between multiple variables over time, with each cell representing a
combination of time and variable values.

Scatter plot matrix: A scatter plot matrix is a set of scatter plots showing the relationships between multiple
variables. Scatter plots can be time-oriented by using time as one of the variables, and this can be useful for
visualizing how the relationships between the variables change over time.

In conclusion, the choice of the best way to visualize a time-oriented multivariate data set depends on the
specific requirements of the problem and the type of information that you want to communicate. It is important to
use an appropriate visualization technique that clearly communicates the information and provides insights into
the trends and patterns in the data.

6) Why is the training data random in the definition of the random forest algorithm? How can random
forests be used for predicting sales prices?

The training data in random forest algorithms is considered random because a random subset of the features is
selected for each split in the decision trees. This randomness in the feature selection helps to reduce overfitting,
which can occur when using a single decision tree with all of the features.

Random forests can be used for predicting sales prices by using the historical sales data as the input features
and the target variable being the sales price. The algorithm builds a set of decision trees, each of which makes a
prediction based on a random subset of the features. The final prediction is made by averaging the predictions
of all the trees, and this can provide a more accurate and robust prediction compared to using a single decision
tree.

Random forests can handle both continuous and categorical variables, and they can also handle missing data
and noisy data effectively. Additionally, the feature importance values generated by random forests can be used
to identify the most important variables in predicting the sales price.

7) Feature Selection algorithms – Filters, Wrappers

Filter Methods:

Filter methods are generally used as a preprocessing step. The selection of features is independent of any
machine learning algorithms. Instead, features are selected on the basis of their scores in various statistical tests

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

for their correlation with the outcome variable. The correlation is a subjective term here. For basic guidance, you
can refer to the following table for defining correlation co-efficients.

Pearson’s Correlation: It is used as a measure for quantifying linear dependence between two continuous
variables X and Y. Its value varies from -1 to +1. Pearson’s correlation is given as:

LDA: Linear discriminant analysis is used to find a linear combination of features that characterizes or separates
two or more classes (or levels) of a categorical variable.
ANOVA: ANOVA stands for Analysis of variance. It is similar to LDA except for the fact that it is operated using
one or more categorical independent features and one continuous dependent feature. It provides a statistical
test of whether the means of several groups are equal or not.
Chi-Square: It is a statistical test applied to the groups of categorical features to evaluate the likelihood of
correlation or association between them using their frequency distribution.

One thing that should be kept in mind is that filter methods do not remove multicollinearity. So, you must deal
with multicollinearity of features as well before training models for your data.

Wrapper Methods

In wrapper methods, we try to use a subset of features and train a model using them. Based on the inferences
that we draw from the previous model, we decide to add or remove features from your subset. The problem is
essentially reduced to a search problem. These methods are usually computationally very expensive.

Some common examples of wrapper methods are forward feature selection, backward feature elimination,
recursive feature elimination, etc.

Forward Selection: Forward selection is an iterative method in which we start with having no feature in the
model. In each iteration, we keep adding the feature which best improves our model till an addition of a new
variable does not improve the performance of the model.
Backward Elimination: In backward elimination, we start with all the features and remove the least significant
feature at each iteration which improves the performance of the model. We repeat this until no improvement is
observed on removal of features.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

Recursive Feature elimination: It is a greedy optimization algorithm which aims to find the best performing
feature subset. It repeatedly creates models and keeps aside the best or the worst performing feature at each
iteration. It constructs the next model with the left features until all the features are exhausted. It then ranks the
features based on the order of their elimination.

Embedded Methods

Embedded methods perform feature selection during the model training, which is why we call them embedded
methods.
A learning algorithm takes advantage of its own variable selection process and performs feature selection and
classification/regression at the same time.
Embedded methods work as follows:
1. First, these methods train a machine learning model.
2. They then derive feature importance from this model, which is a measure of how much is feature
important when making a prediction.
3. Finally, they remove non-important features using the derived feature importance.

Embedded methods combine the qualities’ of filter and wrapper methods. It’s implemented by algorithms that
have their own built-in feature selection methods.

Some of the most popular examples of these methods are LASSO and RIDGE regression which have inbuilt
penalization functions to reduce overfitting.

● Lasso regression performs L1 regularization which adds a penalty equivalent to absolute value of the
magnitude of coefficients.
● Ridge regression performs L2 regularization which adds a penalty equivalent to the square of the
magnitude of coefficients.
● Elastic nets perform L1/L2 regularization which is a combination of the L1 and L2. It incorporates their
penalties, and therefore we can end up with features with zero as a coefficient—similar to L1.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

8) Decision Tree

A decision tree can be used to visually and explicitly represent decisions and decision making. As the name
goes, it uses a tree-like model of decisions. Though a commonly used tool in data mining for deriving a strategy
to reach a particular goal, it's also widely used in machine learning.

How can an algorithm be represented as a tree?


For this let’s consider a very basic example that uses a titanic data set for predicting whether a passenger will
survive or not. The below model uses 3 features/attributes/columns from the data set, namely sex, age and sibsp
(number of spouses or children along).

A decision tree is drawn upside down with its root at the top. In the image on the left, the bold text in black
represents a condition/internal node, based on which the tree splits into branches/ edges. The end of the
branch that doesn’t split anymore is the decision/leaf, in this case, whether the passenger died or survived,
represented as red and green text respectively.

Although, a real dataset will have a lot more features and this will just be a branch in a much bigger tree, but you
can’t ignore the simplicity of this algorithm. The feature importance is clear and relations can be viewed easily.
This methodology is more commonly known as learning decision tree from data and above tree is called
Classification tree as the target is to classify passenger as survived or died. Regression trees are represented in

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

the same manner, just they predict continuous values like price of a house. In general, Decision Tree algorithms
are referred to as CART or Classification and Regression Trees.

So, what is actually going on in the background? Growing a tree involves deciding on which features to choose
and what conditions to use for splitting, along with knowing when to stop. As a tree generally grows arbitrarily,
you will need to trim it down for it to look beautiful.

9) Random Forest

Random forest, like its name implies, consists of a large number of individual decision trees that operate as an
ensemble. Each individual tree in the random forest spits out a class prediction and the class with the most votes
becomes our model’s prediction (see figure below).

The fundamental concept behind random forest is a simple but powerful one — the wisdom of crowds. In data
science speak, the reason that the random forest model works so well is:

A large number of relatively uncorrelated models (trees) operating as a committee will outperform any of the
individual constituent models.

The low correlation between models is the key. Just like how investments with low correlations (like stocks and
bonds) come together to form a portfolio that is greater than the sum of its parts, uncorrelated models can
produce ensemble predictions that are more accurate than any of the individual predictions. The reason for this
wonderful effect is that the trees protect each other from their individual errors (as long as they don’t constantly
all err in the same direction). While some trees may be wrong, many other trees will be right, so as a group the
trees are able to move in the correct direction. So the prerequisites for random forest to perform well are:
1. There needs to be some actual signal in our features so that models built using those features do better
than random guessing.
2. The predictions (and therefore the errors) made by the individual trees need to have low correlations
with each other.

Downloaded by armar abdul ([email protected])


lOMoARcPSD|45529801

Bagging and Bootstrapping etc.

10) Ethical Issues in Data Science.

Informed Consent
In human subject research, there is a notion of informed consent. We understand what is being done, we
voluntarily consent to the experiment, and we have the right to withdraw consent at any time.

However, this is more vague in "ordinary conduct of business", such as A/B testing. For example, Facebook may
perform these tests all the time without explicit consent or even knowledge!

Privacy
Privacy is a basic human need. Loss of privacy occurs when there's a loss of control over personal data.
In some cases, even when identifiable information is removed from data – like name, phone number, address,
and so on – it may not be sufficient to protect individuals' identities.

Unfair discrimination
The incorrect and unchecked use of data science can lead to unfair discrimination against individuals based on
their gender, demographics and socio-economic conditions.

Reinforcing human biases


Data science algorithms use past data to predict future outcomes. Data are generated based on human
decisions made in the past. Training the algorithm purely based on past data could lead to some of these biases
being included in the algorithms.

Algorithms are also influenced by analysts’ biases, as they may choose data and hypotheses that seem
important to them.

Lack of transparency
Data science algorithms can sometimes be a black box where the model predicts an outcome but does not
explain the rationale behind the result.

Numerous recent machine learning algorithms fall into this category. With black box solutions, it is not easy for a
business to understand and explain the reason for a business decision.

Downloaded by armar abdul ([email protected])

You might also like