DASC5133 FA23 Assignment

Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

DASC 5133 Introduction to Data Science Individual Assignment

Assigned Date: Sep 13th, 2023 Due Date: Oct 4th, 2023

Instruction:
1. Your completed work will be submitted on Blackboard before 11:59pm on the due date.
2. Please name your work DASC5133_FA23_A2_[YourLastName] for submission.
3. Insert your answers into THIS document for submitting. Do NOT start a new document.
4. Make sure your name is available on every page of your submitted document, either in header or
footnote.
5. All individual assignments do NOT allow collaboration. In addition, your work will be checked against
others’ using software. If any academic dishonesty violation detected, you will get a ZERO.
6. Make sure you cite properly if you refer to other people’s work.

Grade: 100 points

Problem 1. Data Science Essential Steps (40 points)


In class, we discussed that data science is mostly:
 Turning business problems into data problems
 Collecting data
 Understanding data
 Cleaning data
 Formatting data
 Finding patterns—probably with machine learning

Please read this article (How to Structure Business Problems for Data Science Solutions) and answer the
following questions:
Q1. According to this article, what are the four common data science problems? Please provide a summary
for each.

ANS :

According to the article, there are four common data science problems, and here is a summary of each:
1. Customer-centric Data Science Problems:
 These problems are prevalent in commercial data science projects, especially in retail,
marketing, and advertising.
 Objectives include increasing revenue through improved product recommendations,
upselling, cross-selling, reducing churn, personalizing user experiences, improving
targeted marketing, performing sentiment analysis, and optimizing product or service
pricing.
 Success in these goals relies on a deep understanding of customer needs, motivations,
preferences, and behaviors using available data.
2. Optimization Problems:
 Optimization problems involve maximizing or minimizing factors like costs, revenues, risks,
time, or pollution within specific quantitative constraints.
 They are often solved by modeling them as graphs or networks and using specialized
algorithms.
 Examples include supply chain optimization, logistics, financial portfolio risk minimization,
and scheduling optimization for staffing or airline routes.
 These problems are complex because solutions are dependent on the current state,
making them path-dependent.
3. Demand Prediction:
 Traditional demand forecasting is typically a top-down process, estimating demand based
on historical aggregate data and external variables like weather.
 Data science can invert this process and estimate demand from the bottom up using
various data sources, including consumer data, macroeconomic data, and open data.
 This approach allows for more granular demand estimation, such as per store, per hour,
or per customer, which can be crucial in cases with logistical constraints.
4. Counter Fraud Analytics:
 Counter-fraud analytics deals with detecting and preventing fraudulent activities, which
can be highly challenging for several reasons.
 Fraudsters adapt to avoid detection, making their behavior unpredictable.
 Limited data points for fraudulent activities make statistical modeling difficult.
 Fraud detection is a "needle in a haystack" problem because most transactions are
legitimate, making it critical to build models that can identify rare fraudulent cases.
 This problem involves continually evolving challenges due to changes in fraud tactics and
legal/regulatory frameworks.
These four categories encompass a wide range of data science challenges commonly encountered in
business and can serve as a starting point for structuring data science projects effectively.

Q2. Conduct some research (make sure you cite your source) and identify two real examples (of different
type stated in Q1) of data science projects for businesses (that are not already discussed in this article).
And fill out the following table:
Data Science Project 1:
Source: The New York Times - "How Uber Uses Data to Perfect Its Customer Experience"
What is/are the business problem(s) Uber aimed to enhance its customer experience by reducing
this project is trying to solve? pickup times and enhancing ride reliability.
Additionally, they sought to improve the overall efficiency of
their operations and optimize pricing strategies.
What is/are the data problem(s) this The project needed to leverage massive volumes of data to
project is trying to solve? achieve these goals, including information on historical ride
requests, traffic patterns, and GPS data.
Real-time data streams were also crucial to make on-the-fly
adjustments and provide accurate ETAs.
What data needs to be collected Uber collected data from various sources, including user app
from where and any issues interactions, driver locations, traffic conditions, and weather
associated with the collection. forecasts.
One of the challenges was collecting and processing this real-
time data rapidly to make accurate predictions.
Challenges in understanding the Understanding and predicting traffic patterns and user
data behaviors accurately was challenging due to the dynamic nature
of city environments.
Challenges in data cleansing and Data cleansing involved dealing with noisy GPS data, outliers,
what techniques are used. and incomplete records.
Techniques like outlier removal and imputation were used to
address these issues.
What techniques are used for Data was transformed into structured formats suitable for
formatting/transforming/structuring machine learning, including time-series data for traffic patterns
the data for data science project and user behavior.
What patterns will be Valuable patterns included peak demand times, traffic
interesting/valuable? congestion hotspots, and user behavior during surge pricing.
What machine learning technique(s) Machine learning techniques such as time-series forecasting,
are used? clustering, and reinforcement learning were used.
What patterns/models are Models were generated to predict demand patterns, optimize
generated? driver dispatches, and set pricing strategies dynamically based
on real-time supply and demand.

Data Science Project 2:


Source: Harvard Business Review - "How Starbucks Uses AI to Counter the Effects of a Pandemic"
What is/are the business problem(s)  Starbucks aimed to adapt to the challenges posed by
this project is trying to solve? the COVID-19 pandemic by optimizing store operations,
inventory management, and staffing levels.
 They also needed to personalize customer experiences
to drive loyalty and revenue.

What is/are the data problem(s) this The project required the analysis of a diverse range of data,
project is trying to solve? including store sales data, inventory levels, customer
preferences, and social media sentiment.
What data needs to be collected  Data was collected from various sources, including
from where and any issues point-of-sale systems, mobile app interactions, and
associated with the collection. social media channels.
 Challenges included ensuring data privacy and security
while collecting customer-related data.

Challenges in understanding the Understanding and predicting customer behavior during a


data pandemic was challenging due to the unprecedented nature of
the situation.
Challenges in data cleansing and  Data cleansing involved dealing with incomplete and
what techniques are used. noisy data, especially in the case of social media
sentiment analysis.
 Techniques like text preprocessing and sentiment
analysis were used.

What techniques are used for Data was formatted and structured to create customer profiles,
formatting/transforming/structuring sales forecasts, and inventory optimization models.
the data for data science project
What patterns will be Valuable patterns included identifying customer preferences
interesting/valuable? for specific products, predicting demand fluctuations during
lockdowns, and optimizing staffing levels.
What machine learning technique(s) Machine learning techniques such as recommendation systems,
are used? time-series forecasting, and natural language processing were
used.
What patterns/models are Starbucks generated models for personalized product
generated? recommendations, inventory forecasting, and staffing
optimization to navigate the challenges of the pandemic
effectively.

Problem 2. Predictive Modeling for Life Insurance (40 points)

Read the article “Predictive Modeling for Life Insurance” provided in the reading pack and answer the
following questions:

Q1. What does “underwrite” mean in the insurance industry? Why it is costly in insurance?

ANS: In the insurance industry, "underwrite" refers to the process of evaluating and assessing the risk
associated with insuring a particular individual, entity, or event. The primary goal of underwriting is to
determine the terms and conditions of an insurance policy, including the premium amount that the
insured party should pay to cover the potential risks adequately. Underwriting involves a thorough
analysis of various factors to estimate the likelihood of a claim occurring and to calculate an appropriate
premium that covers the potential losses.
Key aspects of underwriting in the insurance industry include:
1. Risk Assessment: Underwriters assess the level of risk associated with an insurance application.
They analyze factors such as the applicant's age, health condition, occupation, lifestyle, past
insurance claims, and the type of insurance coverage being sought.
2. Pricing: Underwriters use their risk assessment to determine the premium that the insured party
should pay. The premium should cover the expected losses and expenses associated with the
policy, while also allowing the insurance company to make a profit.
3. Policy Terms: Underwriters establish the terms and conditions of the insurance policy, including
coverage limits, deductibles, and any specific policy exclusions or endorsements.
4. Acceptance or Rejection: Based on their risk assessment, underwriters decide whether to accept
or reject the insurance application. If they accept it, they determine the premium amount and
policy terms. If the risk is deemed too high, they may reject the application.
5. Ongoing Monitoring: In some cases, underwriters continue to monitor the insured party's risk
profile throughout the policy term and may adjust premiums or coverage as needed.
Now, regarding why underwriting can be costly in insurance:
1. Data Collection: Underwriting requires the collection and analysis of a significant amount of data
about the applicant. This may involve medical examinations, background checks, financial records,
and other relevant information. Gathering and verifying this data can be time-consuming and
expensive.
2. Expertise: Skilled underwriters with expertise in risk assessment, actuarial science, and insurance
regulations are needed to make informed decisions. Employing and retaining these experts adds
to the overall cost.
3. Technology: Insurance companies often invest in advanced technology and software to
streamline the underwriting process and analyze large datasets effectively. Developing and
maintaining these systems can be costly.
4. Regulatory Compliance: Insurance is a highly regulated industry, and underwriting practices must
comply with various laws and regulations. Ensuring compliance involves additional expenses, such
as legal and regulatory compliance teams.
5. Risk Assessment: Accurate risk assessment is crucial to an insurance company's profitability. If
underwriting is not thorough, it can lead to adverse selection, where the company insures a
disproportionately high number of high-risk individuals, resulting in potential losses.
6. Customer Service: The underwriting process involves interactions with applicants, including
communication, document collection, and answering inquiries. Providing excellent customer
service during this process also adds to the costs.
7. Fraud Prevention: Detecting and preventing insurance fraud is an ongoing challenge for insurance
companies. Implementing fraud detection measures and investigating suspicious claims require
resources.
In summary, underwriting in the insurance industry involves evaluating and pricing risks, and it can be
costly due to the need for extensive data collection, specialized expertise, technology investments,
regulatory compliance, and the importance of accurate risk assessment for the company's profitability.

Q2. Traditionally, what are the possible sources of external information to be considered in determining
insurance premium?

ANS: Traditionally, insurance companies consider various sources of external information when
determining insurance premiums. These external sources provide valuable data that helps underwriters
assess risk and set appropriate premium rates. Some of the common traditional sources of external
information in insurance underwriting include:
1. Mortality Tables: Mortality tables provide historical data on death rates within specific
demographic groups. These tables are crucial for life insurance underwriting, helping insurers
estimate the likelihood of policyholders' deaths based on factors like age, gender, and other
demographics.
2. Credit Reports: Credit reports and credit scores are often used in underwriting for various types
of insurance, such as auto and home insurance. Insurers may use credit information to assess an
applicant's financial responsibility and predict the likelihood of filing a claim.
3. Medical Records: For health and medical insurance, insurers rely on medical records to evaluate
an applicant's health status and pre-existing conditions. This information helps determine the
premium and coverage options.
4. Driving History: In auto insurance underwriting, an applicant's driving history, including traffic
violations and accidents, is a key external source of information. It helps insurers gauge the risk
of insuring a particular driver.
5. Claims History: Insurance companies maintain databases of past claims, which can be used to
assess an applicant's claims history. A history of frequent claims may result in higher premiums.
6. Location Data: Geographic information, such as the insured person's address and location of
insured property, is used in various insurance types. For example, the risk of natural disasters or
theft can vary by location, affecting premium rates.
7. Occupation and Industry: Information about an applicant's occupation and industry can be
relevant in certain insurance types. Some jobs may involve higher risks, leading to adjusted
premiums.
8. Vehicle Information: In auto insurance, details about the insured vehicle, including make, model,
year, and safety features, can impact premiums. Safer vehicles may qualify for lower rates.
9. Home Characteristics: For homeowners and property insurance, the characteristics of the insured
property, such as its size, construction materials, and security features, can influence premium
rates.
10. Environmental Factors: Environmental data, such as floodplain maps, can be used to assess the
risk of natural disasters and determine insurance rates for properties in high-risk areas.
11. Criminal Records: In some cases, insurers may check an applicant's criminal record, especially for
policies that involve liability coverage.
12. Social and Demographic Data: Social and demographic information, such as marital status, family
size, and lifestyle factors, can be used to assess risk in some insurance contexts.
It's important to note that the sources of external information considered may vary depending on the
type of insurance and the specific underwriting guidelines of the insurance company. Additionally,
advancements in data analytics and technology have expanded the range of external data sources that
insurers can use to refine their underwriting processes and more accurately price policies.

Q3. It is stated in the article that “Insurers have begun to turn to predictive models for scientific
guidance of expert decisions in areas such as claims management, fraud detection, premium audit,
target marketing, cross-selling, and agency recruiting and placement” (Page 5), do some research and
briefly what each application is about.

ANS : Below is the Explanation about each Application

1. Claims Management: Claims management involves the process of handling insurance claims filed
by policyholders. Predictive models in this context are used to assess the validity and severity of
claims. By analyzing historical claims data and various factors, insurers can predict the likelihood
of a claim being legitimate or fraudulent. This helps them allocate resources effectively and make
quicker and more accurate claims decisions.
2. Fraud Detection: Predictive models are employed to identify fraudulent insurance claims. These
models analyze data patterns and anomalies in claims submissions to flag potentially fraudulent
activities. By detecting fraud early, insurers can reduce financial losses and maintain the integrity
of their insurance policies.
3. Premium Audit: Premium audits are conducted to ensure that policyholders are paying the
correct insurance premiums based on their actual risk exposure. Predictive models can assist in
automating this process and identifying any discrepancies in premium calculations, which can
result in adjustments to the policy premium.
4. Target Marketing: Predictive models are used to analyze customer data and identify individuals
or businesses most likely to purchase specific insurance products. This allows insurers to tailor
their marketing efforts more effectively, improving the return on investment for marketing
campaigns.
5. Cross-Selling: Cross-selling involves offering additional insurance products to existing
policyholders. Predictive models can help insurers identify which policyholders are most likely to
be interested in complementary insurance products, thereby increasing revenue for the insurer.
6. Agency Recruiting and Placement: Insurers use predictive models to assess the potential
performance of insurance agents and brokers. These models consider various factors and
attributes to identify individuals who are well-suited to the insurance sales role. This approach
helps insurers make better recruiting and placement decisions.
In all these applications, predictive modeling leverages data analysis and statistical techniques to make
more informed and data-driven decisions. It enhances the efficiency and effectiveness of various
insurance processes, ultimately leading to better business outcomes for insurers.

Q4. Briefly explain how predicative modeling help reduce underwriting cost? You can resort to the
example in Table 1.

ANS : Predictive modeling helps reduce underwriting costs by streamlining the underwriting process and
making it more efficient. Here's how it works, with reference to the example in Table 1:
1. Identifying Low-Risk Applicants: Predictive models assess applicant profiles using a combination
of underwriting requirements and third-party data. These models predict the likelihood of an
applicant being low-risk based on historical data and patterns.
2. Reducing Invasive Tests: Instead of requiring all applicants to undergo a full battery of medical
tests and evaluations, the predictive model recommends which applicants are likely to be low-
risk. In the example in Table 1, about 30% to 50% of applicants are identified as low-risk by the
model.
3. Cost Savings: By relying on the model's recommendations, insurers can eliminate the need for
certain underwriting requirements, such as paramedical exams, blood and urine analysis, EKGs,
and stress tests. These requirements can be costly and time-consuming.
4. Faster Decision-Making: Predictive modeling allows insurers to issue policies more quickly to low-
risk applicants, often within just a few days. This contrasts with the traditional underwriting
process, which may take longer due to the extensive requirements.
5. Efficiency and Productivity: Streamlining the underwriting process with predictive models not
only saves costs but also increases efficiency. Underwriters can focus on more complex cases,
while routine work is handled by the model. This can lead to increased productivity for the
underwriting staff.
6. Competitive Advantage: Insurers can attract more applicants by offering a faster, less invasive
underwriting process. This competitive advantage can lead to more policies being sold and greater
market share.
In the example provided in Table 1, the potential annual savings for a representative life insurer are
significant, ranging from $2 to $3 million. These savings are achieved by reducing the number of expensive
requirements for a large portion of applicants while maintaining underwriting accuracy and efficiency.
Q5. In the modeling Process section (Page 19), how the predictive model is built is explained. Please create
a flowchart to demonstrate the process (use diagram tool of your choice) and insert the diagram below.

Problem 3. Supply Chain Analytics (20 points)


Read the article here and answer the following questions:
Q1. What is supply chain? Why it is important to manage supply chain to use data-driven approaches
described in the article?

Ans : A supply chain is an entire system of producing and delivering a product or


service, from the very beginning stage of sourcing the raw materials to the final
delivery of the product or service to end-users.

The supply chain lays out all aspects of the production process, including the activities
involved at each stage, information that is being communicated, natural resources
that are transformed into useful materials, human resources, and other components
that go into the finished product or service.

A Supply Chain can be defined as several parties exchanging flows of material,


information or money resources with the ultimate goal of fulfilling a customer
request.
A 'data-driven supply chain' is supply chain quality management based on the
collection and analysis of product information from every important point of
production.

Data such as quality control (QC) inspection results, manufacturing speed, and even
delivery efficiency can all be gathered at their corresponding production
points. Then, with the help of machine-based analytics, these large chunks of data
can provide companies with a clear picture of their supply chain's current
performance.

Here is where companies can pinpoint specific areas along their supply chain that are
in need of quality, compliance, and efficiency improvement.

Additionally, predictions about future growth potential and trends in product


demand can be determined, for instance, through subtle changes in inventory
movement over time.

Finally, a data-driven supply chain management system can also help companies
determine the 'cost of quality,' meaning the increased cost of reworking or
rebuilding products based on quality and compliance failures discovered, either
internally by QC inspectors, or worse, externally by customers or regulators.

When successfully implemented, a data-driven supply chain management system


allows managers and operators at every level of production to work together,
proactively and with confidence, to address all manner of sourcing, manufacturing,
and delivery problems.

A confident and proactive supply chain, backed by powerful data-driven analysis, is


your company's best bet for increasing sales, boosting profit margins, and garnering
customer retention and brand loyalty.

How has technology responded to the demand for versatile forms of collection to
feed data-driven supply chains?
Q2. What are the four types of supply chain analytics, please provide example (one for each)
to illustrate.

Ans : Different types of Supply Chain Analytics


Supply Chain Analytics can be represented as a set of tools that will use the
flow of information to answer questions and support the decision-making
process.

Four Types of Supply Chain Analytics — (Image by Author)

For each type, you’ll need specific methodology, mathematical concepts and
analytics tools to answer the question.

Descriptive Analytics

A set of tools to provide visibility and a single source of truth across the
supply chain to track your shipments, detect incidents and measure the
performance of your operations.
Examples — (Image by Author)

The final deliverable is usually a set of dashboards that can be put on the
cloud using PowerBI/Tableau, such as a

 Warehouse Workload Report reporting the key indicators to measure a


warehouse activity (orders prepared, productivity, logistic ratios)

 Supply Chain Control Tower to track your shipments along your


distribution networks

 Transportation Route Analysis to visualize the routing of your pasts


deliveries

Diagnostic Analytics
This can be summarized as incident root cause analysis. Let us take the
example of the supply chain control tower.

Time Stamps — (Image by Author)

If a shipment is delivered late, the root cause analysis consists in checking


each time stamp to see where your shipment missed a cut-off time.

The process of analysis is designed by the operational teams and


implemented by data engineers for complete automation.

Predictive Analytics

Support the operations to understand the most likely outcome or future


scenario and its business implications.
Example of predictive Analysis — (Image by Author)

For example, by using predictive analytics, you can estimate the impact of
future promotions on the sales volumes in stores to support inventory
management.
👨💼 POSITIONS INVOLVED
Supply Chain Engineers, Data Scientists, Business Experts � TOOLS
Cloud computing, Python processing libraries (Pandas, Spark), BI
Machine Learning, Statistics

In the example above, data scientists will work with business experts to
understand which features can help to improve the accuracy of sales
forecasts.

Prescriptive Analytics

Assist the operations to solve problems and optimize the resources to reach
the best efficiency.
Examples of Prescriptive Analytics — (Image by Author)

Most of the time, prescriptive analytics are linked to optimization problems


where you need to maximize (or minimize) objective functions considering
several constraints.

You might also like