Data Management Assg - 1
Data Management Assg - 1
Insurance Company
1. Database Documentation
Belfast Insurance Company currently stores its data separately in several files. The data is
segregated according to the various insurance portfolios the company offers. This makes it
difficult to see a complete picture or to get a bird's-eye view of the company as a whole. BIC
currently deals with health insurance, motor insurance, and travel insurance; hence, they have
three separate segment-specific datasheets and a separate database for customer information.
The important customer metrics, such as name, card type, occupation, gender, age, and
location, are captured and stored in the customer database. While segment-specific datasheets
have niche attributes like the start and end dates of a policy, claims, types, etc. The first step
was to make an analytics base table (ABT). This is done by consolidating all the databases into
a single base table, which is then used to carry out the analysis.
Step 1 - Download and save all the datasheets on your local machine, preferably in a single
folder.
Step 3 - Select the appropriate attributes within the sheets using SELECT query.
● Data 1 - Customer
Primary Key
Customer ID
Foreign Key
Motor ID
Health ID
Travel ID
Primary Key
Motor ID
Primary Key
Health ID
Primary Key
Travel ID
Step 5 - Using the identified PK and FK, join the databases using JOIN query. The type of join
which we used is LEFT JOIN.
Step 1 - Open Query Design under Create tab and load all the datasheets in the form of a table
by clicking on their names.
Step 2 - Select all the attributes and PKs from each entity while making sure that FKs are not
selected. Drag and drop all the selected attributes in the table console below.
Step 3 - Match the PKs to FKs of each database. This will create an INNER JOIN. MS Access
by default joins using INNER JOIN if not specified.
Step 4 - Change the type of JOIN using Join Properties. Select LEFT JOIN to make a
consolidated ABT.
Fig 1 - Database Structure
The linking of databases is done using Primary Key (PK) and Foreign Key (FK). A primary key is
a value that is unique to a particular dataset. It also ensures that each record on the table is
identified uniquely. While a Foreign Key is a key or combination of columns in a database whose
values match the Primary Key of another table, FK is essentially used to form the link between
the two databases. In our case, the FKs of Data 1—Customer were used to link the PKs of other
datasheets.
Having said that, there are a few limitations to using MS Access for data analysis. The biggest
limitation is that the data cannot be visualised in MS Access and needs support from an external
visualisation tool. Hence, the operator must be proficient in both software to generate some
quality reports. Features such as Power Pivots are not available in MS Access, although MS
Excel has them. There are limited summary functions that are available to use. In-depth
analysis can, sometimes, prove cumbersome. However, these limitations can be tackled by
using MS Access in conjunction with MS Excel or any other data visualisation tool. We can also
completely skip the preliminary analysis in MS Access and completely transition over to R or
Python.
This report aims to demonstrate and address the data quality issues with BIC’s database. It will
help us analyse the state of the data by utilising characteristics such as accuracy, consistency,
integrity, and usability. If the data quality score of a database is low, the insights drawn from that
particular data will be skewed and might not give us the correct picture. Hence, it is important to
address, document, and rectify any possible data flaws before running the analysis. While
running the data quality check or preliminary analysis, we found a few data quality issues. The
"Gender" column had a few "m"s instead of "male" and "f"s instead of "female." The implication
of this error caused four different rows while we calculated the average age based on gender.
Since there were four different parameters—"male," "m," "female," and "f," the figures were split
between all four, giving us a false figure. This was fixed using the UPDATE query in MS Access.
The next issue that we identified was the outliers in the age column. The implication of having
an outlier is that the average age gets skewed toward the outlier. This was initially identified by
noticing a significant difference between the median and the average. So, to confirm this, we
plotted a histogram that confirmed the outlier since the X-axis was significantly long, indicating
the outlier. This was fixed by using the "mutate" and "replace" functions in R. The outliers were
replaced by the mean value of the column. A similar outlier was also identified and mutated in
the "dependent kids" column. In addition to that, a few incorrect values were also identified in
the "Comchannel" column. Instead of using the entire word, customers used initials to input the
channel of communication. This created a new redundant parameter. Just like the "Gender"
column, this issue was also fixed by using an UPDATE query in MS Access.
From the preliminary analysis, it was clear that the majority of data quality issues were due to
typing errors. This can be easily fixed by locking the input with the data validation function. Data
validation is a function that allows the user to input data only according to the parameters set by
the administrator. It will reject all the data that does not satisfy the set rules. Alternatively, users
can also be provided with a drop-down menu. This will not only ensure that we only get a
specific attribute in that entity, but it will also completely eliminate randomness.
2. Insights Report
This report proposes to discuss the main insights that were drawn from the analysis. It also
describes the types of analysis that were performed and explains their results. It also considers
the implications and possible solutions to those implications. Furthermore, it proposes a few
recommendations that we feel will help BIC expand and grow its business. These
recommendations are entirely based on the quantitative and qualitative analysis that was
carried out.
After a careful analysis of motor insurance policies, it was found that the occurrence of claims in
rural areas is 27.20 percent higher than in rural areas. Therefore, it is pretty evident that the
company is spending more in the form of claims, which reduces its net gains. On the other
hand, the average value of urban vehicles is much higher than that of rural vehicles, and hence
the premium that the company gets is significantly higher. So it won’t be possible for BIC to just
focus on rural vehicles since the profit margin is higher. Hence, we would recommend applying
a loading premium to urban vehicles. This will ensure that the net profit is maintained and the
company doesn't suffer any losses.
BIC sells five types of travel insurance. After a detailed analysis, it was found that the average
age of the backpacker travel policyholder was 25.23 years, while the business travel
policyholder was 39.40 years. The average age of Premium Travel policyholders was 31.82
years, and the average age of Senior Travel policyholders was 65.56 years. The average age of
Standard Travel policyholders is 43.56 years old. This insight gives us a clear picture of the age
bracket for each type of travel policy. We recommend that BIC offer the Backpacker policy to
customers under the age of 28. While Premium should be pitched to people aged 29 to 35, the
business policy would be more popular among people aged 36 to 41, so it should be targeted at
this age group. The Standard policy is more likely to be bought by customers aged between 42
and 60 years, while the Senior policy should be pitched to customers who are older than 60
years. This will make certain that the marketing team is targeting the right set of customers while
ensuring the optimum level of work.
When we analysed "ComChannel" against "Age", we found that certain age groups preferred a
certain mode of communication. The customers who were less than 30 years old preferred
email or SMS as a mode of communication. while customers in the age bracket of 31 to 60
preferred email and phone. The customers who were over 60 years of age preferred only the
phone as a medium of communication. This makes it important to note that if the wrong age
group is targeted via the wrong mode of communication, it is most likely to get wasted. BIC can
use this strategy to market their products as well. We recommend using the above statistics for
better reach. This would increase the reachability and visibility of BIC significantly. We can send
a monthly or weekly newsletter or a brochure to a specific age group by using these preferred
modes of communication. Since they have chosen this mode voluntarily, this would be the
platform on which they are most active.
BIC has a total of 3 portfolios, namely: motor insurance, health insurance, and travel insurance.
After analysing the data, it was found that 3,357 people bought motor insurance. While 2,538
customers purchased health insurance, 2,105 purchased travel insurance.This tells us that
around 32.25% more customers bought motor insurance than health insurance, while 20.53%
more customers bought health insurance than travel insurance. There were 501 people who
had motor insurance but did not have health or travel insurance. Likewise, there were 1149
people who had both motor and health insurance but not travel insurance. This gives us an
opportunity to tap into those customers for prospective business. The advantage of doing this is
that the customer would already be aware of the BIC and its pros and cons. The customer has
liked this company some time in the past, and hence he has bought a particular policy from BIC.
So when a customer gets a new product from a familiar brand or company, the chances of
conversion are high. The customer is most likely to purchase it again from BIC unless there was
some unpleasant experience. So instead of tapping unknown customers or cold calling, BIC
could use the hot calling technique and reap the fruits.
All of the above recommendations to BIC are based solely on quantitative and qualitative
analysis.We used descriptive and diagnostic analysis techniques to identify and address the
data quality issues. Since it does not employ any algorithm, the accuracy and quality of the
recommendations totally depend on the experience and knowledge of the analysts. We
recommend using more sophisticated machine learning algorithms in future scopes.We could
deploy unsupervised machine learning algorithms. These algorithms recognise patterns in
datasets and self-learn.It is also capable of making future predictions based on past trends. This
would mitigate the risks of human error and also help BIC move onto the next phase of
analytics, which is predictive analytics.
2. UPDATE Final_DB1
SET Gender = 'Male'
Where Gender = 'm';
UPDATE Final_DB1
SET Gender = 'Female'
Where Gender = 'f';
3. UPDATE Final_DB1
SET ComChannel = 'Email'
Where ComChannel = 'E';
UPDATE Final_DB1
SET ComChannel = 'Phone'
Where ComChannel = 'P';
UPDATE Final_DB1
SET ComChannel = 'SMS'
Where ComChannel = 'S';
4. SELECT t1.[Gender],
AVG(t1.[Age]) as Average_Age,
MIN(t1.[Age]) as Minimum_Age,
MAX(t1.[Age]) as Maximum_Age,
stdev(t1.[Age]) as Deviation
FROM Final_DB1 as t1
where t1.[Gender] <>''
group by t1.[Gender];
Appendix 2: R Code
1. Building ABT
getwd()
setwd("/Users/abhishek/Documents/R_Projects")
library(readxl)
install.packages("dplyr")
library(dplyr)
summary(Final_DB1)
Final_DB1 %>%
mutate(Gender = replace(Gender,Gender=="m", "male"))
Final_DB1 %>%
mutate(Gender = replace(Gender,Gender=="f", "female"))
table(Final_DB1$ComChannel,Final_DB1$Age)
Final_DB1 %>%
mutate(ComChannel = replace(ComChannel,ComChannel=="E", "Email")) -> Final_DB1
Final_DB1 %>%
mutate(ComChannel = replace(ComChannel,ComChannel=="P", "Phone"))-> Final_DB1
Final_DB1 %>%
mutate(ComChannel = replace(ComChannel,ComChannel=="S", "SMS"))-> Final_DB1
6. Summarising by groups