Nasscom - Report (Finally)
Nasscom - Report (Finally)
BACHELOR OF TECHNOLOGY In
COMPUTER SCIENCE AND ENGINEERING
By
CERTIFICATE
This is to certify that the “Internship Report”
submitted by <NAME> (<Registration Number>) is
work done by him and submitted during 2021 – 2022
academic year, in partial fulfillment of the requirements for
the award of the degree of BACHELOR OF
TECHNOLOGY in COMPUTER SCIENCE AND
ENGINEERING, at VIT Bhopal University, Kothri-Kalan.
CERTIFICATION
In the Gold Tier with 95 percentiles.
ACKNOWLEDGEMENT
I would like to thank my Program Chair Dr. Sandip Mal for his
constructive criticism throughout my internship.
<NAME>
(<Registration Number>)
ABSTRACT
The process of analyzing, cleaning, manipulating, and modelling data with
the objective of identifying usable information, informing conclusions, and
assisting decision-making is known as data analysis. Data analysis has
several dimensions and methodologies, including a wide range of techniques
under many titles and being applied in a number of business, scientific, and
social science sectors. Data analysis is important in today's business
environment since it helps firms make more scientific choices and run more
efficiently.
Data mining is a type of data analysis that focuses on statistical modelling
and knowledge discovery for predictive rather than just descriptive purposes,
whereas business intelligence is a type of data analysis that heavily depends
on aggregation and is primarily concerned with business data. Data analysis
may be separated into descriptive statistics, exploratory data analysis (EDA),
and confirmatory data analysis in statistical applications (CDA). CDA
focuses on verifying or falsifying existing assumptions, whereas EDA
focuses on identifying new characteristics in the data. Text analytics uses
statistical, linguistic, and structural techniques to extract and classify
information from textual sources, a type of unstructured data. Predictive
analytics focuses on the application of statistical models for predictive
forecasting or classification, whereas predictive analytics focuses on the
application of statistical models for predictive forecasting or classification.
All of the aforementioned are examples of data analysis.
As a result, implementing a measurement and data analysis strategy is a
recognized best practice within the software business for assisting
stakeholders in making choices. However, putting measuring tools and
analytics into practice in industry is difficult. The real-world issues that
emerge during the execution of a software measurement and analytics project
are discussed in this chapter. We also share what we've learned about
overcoming these obstacles and best practices for conducting practical,
successful data analysis in the workplace. The lessons learnt may be used by
researchers who want to work on data analytics with industry partners, as
well as industry practitioners who want to set up and reap the advantages of a
successful measurement program.
Methodology:
The knowledge session for this certification was divided over the course of two
semesters (5th and Interim), the course curriculum was designed by NASSCOM with
the help of its industry partners with the thought of making the student future ready
and more skilled. The NAS1001-Associative Data Analytics and the NAS 2001-
NASSCOM Advance Data Analytics were introduced in our college curriculum and
my batch was the first to make use of this wonderful opportunity.
The Institute combines pioneering research with top class education. An innovative
curriculum allows the student flexibility in selecting courses and projects. Students,
even at the undergraduate level, get to participate in ongoing research and technology
development - an opportunity unprecedented in India. As a result, a vibrant
undergraduate programme co-exists with a strong postgraduate program.
Organization Information
Sector skills council nasscom is the national standard-setting body for IT Skills set up
under the aegis of National Skill Development Corporation and Ministry of Skills
Development & Entrepreneurship. SSC NASSCOM acts as a conduit to support the
voices and build synergies across the different groups of stakeholders it works with.
Benefits to the company / institution through your report:
The Institute combines pioneering research with top class education. An innovative
curriculum allows the student flexibility in selecting courses and projects. Students,
even at the undergraduate level, get to participate in ongoing research and technology
development - an opportunity unprecedented in India
Table of Contents
BACHELOR OF TECHNOLOGY In..........................................................................................2
ACKNOWLEDGEMENT....................................................................................................................5
<NAME>......................................................................................................................................5
(<Registration Number>)..........................................................................................................5
Methodology:........................................................................................................................................6
Programs and opportunities:..................................................................................................................7
Organization Information......................................................................................................................7
...............................................................................................................................................................8
Benefits to the company / institution through your report:....................................................................8
National Occupational Standards done in the Internship.......................................................................3
for Week by Week.................................................................................................................................3
2.History................................................................................................................................................9
DEFINITION......................................................................................................................................10
5.STEPS IN DATA ANALYSIS.........................................................................................................12
Hardware Requirements:.....................................................................................................................24
BIBLIOGRAPHY...............................................................................................................................25
National Occupational standards are a set of certain conducts that an individual should
perform while carrying out a function in a workplace with proper understanding and
knowledge they need to meet the standard consistently. Each and every NOS define a key
function for the job role and every employee should practice it at their workplace.
We have studied the following NOS in this data analytics course:
a) Overview:
Here, the students will learn the various documentation techniques used in
corporate world. These includes various types of documents like case studies,
best practices, project artifacts, reports, minutes, policies, procedures, work
instructions etc. Although, here the technical documents the documents
associated with any application or products are not included here.
b) Goal:
The main goal for this session is that the students should have a hands on
practice of MS word and MS Visio, and be able to draft reports and documents
following the techniques that is used by the corporate world.
c) Objective:
● To organize the document’s purpose, scope, format and target audience
with the right group of people.
● Discuss and work with the organization people to collect and verify the
information required for the documents.
● To access existing documents, language standards, templates and
documentation tools from the organization.
● Make an appropriate group of people and finalize the content and
structure of the document.
● Set a goal of creating documents that meet the standard template and
agreed language standards.
● Discuss the documents with the group and make changes if relevant
inputs are given.
● Submit the documents for approval.
● After approval of the documents, submit them with the decided
standards.
● Update the oragnization’s knowledge base with the documents.
● Meet the organization’s policy, procedure and guidelines when
creating the documents.
a) Overview:
In this session, students will learn how to use the R tool for Business
Analytics. Further, the students will fet an idea about Applied Statistical
concepts like Descriptive Statistics and find their usage along with R and an
overview of Big Data with the basic functionality. We will also get an
overview about Machine Learning and their use in Data Mining and Predictive
Analytics. Data visualization and the graphical representation of data will also
be covered in this part.
b) Goal:
Here, we will get an idea about how to use the R Tool for Big Data and Big
Data analytics.The concepts of basic applied statistical concepts will also be
covered in this chapter.
c) Objective:
● To get an clear idea on the objectives and scope of analysis.
● Discuss with the appropriate people to identify suitable data
sources and to finalize the methodological approaches.
● To structure the data using standard tools and templates.
● To validate data accurately and identify anomalies.
● Learn form the appropriate group of people on how to handle
anomalies in the data.
● To carry out rule-based analysis of the data in line with the
analysis plan.
● After this, validate the result of the analysis according to the
statistical guidelines.
● Check these results with the appropriate group of people.
● Based on the inputs received from other people, change the
data accordingly for better results.
● Then draw justifiable inferences from your analysis.
● Present the results and inferences from your analysis using
standard templates and tools.
● Finally, comply with your organization’s policies, procedures
and guidelines when carrying out rule-based quantitative
analysis
a) Overview:
The main objective of this session is to understand how time management is
important in corporate environment and how to manage the work accordingly
to meet the required deadlines. Everyone should follow a set of principles to
manage their time and work in the business world.
b) Goal:
The requirements of the work unit are classified as follows: activities,
deliverable, quantity, standards and timelines and the goal of this sessions to
manage the time to deliver the goals in the specified deadline. Also, one of the
main motive of this session is to plan the work in advance so that students can
effectively deal with failure points and minimize the impact.
c) Objective:
● The first and foremost objective of this session is to establish and sgree
the work requirements with your colleagues.
● Learn how to keep the work environment clean and tidy for better
results.
● We should utilize the time effectively and meet the stated deadlines.
● Candidates should use the resources correctly and effectively.
● The confidential information should be kept safe and it is the
responsibility of the candidate to treat the information correctly.
● Candidates should work according to the stated policies and
procedures of the organization.
● Employees should work according to their job roles.
● Candidates should seek guidance wherever required.
● Ensure your work meets the agreed requirements.
a) Overview:
This session covers the basic overview of how to manage the work
relationship with your colleagues and how teamwork in every important in a
work environment. This session also covers the topics on how to respect your
colleagues and points on personal grooming for a workplace.
b) Goal:
The main cola of this module is to understand the importance of professional
relationships at workplace and how to achieve the inter-relationship of
professionalism and team-work.
c) Objective:
● We should interact with our colleagues accurately, concisely and in a
clear manner.
● We should work with colleagues to manage our work with their work
and achieve better results.
● Discuss all the required information for the desired project to avoid
confusion.
● Work with your colleagues respectfully.
● Stick to all the commitments and promises made to the colleagues
when working in a team project.
● Inform the colleagues politely if there is any delay to complete the
committed work by you and give genuine reasons to seek out their
help.
● If you are facing any problems with your colleagues then discuss with
them and try to solve them.
● You should follow all the rules and regulations stated by the
organization about working with colleagues.
5. SSC/ N 9003 - Maintain a Healthy, Safe and Secure working Environment.
a) Overview:
This session gives an idea about the safety rules and regulations to be followed
by every individual at their workplace. The should follow the guidelines
provided by the organization to prevent and handle any accidents or
emergencies taking place at the organization.
b) Goal:
The main goal of this session is that the candidates should be aware about the
various hazards that they may come across at workplace and what are the
defined health, safety and security measures that should be followed at the
time of occurrence of such unpredictable events. It also covers the practical
application of health and safety procedures to deal with any kind of
circumstances.
c) Objective:
● Employees should follow all the health policies and procedures.
● If the eomplyee finds any breach in health and safety measures then
he/she should inform the responsible person .
● Identify and correct any hazards that you can deal with safely,
competently and within the limits of your authority.
● Employees should report any hazards that may be effective to other
people to and should inform them to the concerened authorities.
● One should follow the organization’s procedures calmly and
effectively.
● If you have any suggestions than can improve the safety procedures of
the organization then suggest them to the respected authorities.
● Complete any health and safety records legibly and accurately.
a) Overview:
In this module, candidates will learn the standard operating procedures to
report data in a logical sequence and arriving at conclusive decisions models
after analysis of data. It will cover that how an individual should handle and
report the data in standard formats. Candidates will also come to know about
how the data should be shared within and outside a particular group without
disclosing too much confidential information.
b) Goal:
The main goal of this module is to analyze the data and publish the report in
the standardized format given by the organization. They should also learn how
to make the report with the specified objective.
c) Objective:
● Discuss with your team about what information is to be provided by
you, in what way it should be done and when should you submit the
data.
● Collect all the data from a reliable and trusted source.
● You should check that if the data provided is complete and up to date.
● Keep an open discussion with your team and identify the problems in
your data.
● You should carry out a rule-based analysis of your data according to
the requirement.
● You should enter the data in the provided and accepted format.
● You should keep a check about your work.
● Report any unresolved anomalies in the data/information to appropriate
people
● You should provide complete, accurate and up-to-date
data/information to the appropriate people in the required formats on
time
a) Overview:
This module will cover the steps on how to develop skills for the professional
environment and how the right skills will help the candidate to excel. It
emphasizes on how enhance skills and knowledge in a diversified professional
environment.
b) Goal:
This session will cover how skill enhancement will help the candidate to grow
in their professional and personal life. It gives knowledge on organizational
context, technical knowledge, core skills/generic skills, professional skills and
technical skills. We will learn how skill enhancement and growth are the two
main factors for improvement at the workplace.
c) Objective:
● We should learn from an appropriate group of people to develop our
knowledge, skill and competence.
● Keep track of the skills required for the job role.
● Candidates should identify accurately the current level of knowledge,
skills and competence and any learning and development needs.
● Schedule a plan of learning and development activities with an
appropriate group of people.
● Candidates should undertake learning and development activities in
line with their plan.
● We should then apply these knowledge and skills at workplace under
the guidance of the expert.
● Ask for feedback from the appropriate group of people and work on
them accordingly to get improvised results.
● One should check their knowledge, skills and competence regularly
and keep improving wherever possible.
Most companies are collecting loads of data all the time—but, in its raw form, this data
doesn’t really mean anything. This is where data analytics comes in. Data analytics is the
process of analyzing raw data in order to draw out meaningful, actionable insights.
These insights are then used to inform and drive smart business decisions. So, a data analyst
will extract raw data, organize it, and then analyze it, transforming it from incomprehensible
numbers into coherent, intelligible information. Having interpreted the data, the data analyst
will then pass on their findings in the form of suggestions or recommendations about what
the company’s next steps should be.
You can think of data analytics as a form of business intelligence, used to solve specific
problems and challenges within an organization. It’s all about finding patterns in a dataset
which can tell you something useful and relevant about a particular area of the business—
how certain customer groups behave, for example, or how employees engage with a
particular tool. Data analytics helps you to make sense of the past and to predict future trends
and behaviors; rather than basing your decisions and strategies on guesswork, you’re making
informed choices based on what the data is telling you. Armed with the insights drawn from
the data, businesses and organizations are able to develop a much deeper understanding of
their audience, their industry, and their company as a whole—and, as a result, are much better
equipped to make decisions and plan ahead.
2.History
Data analytics is based on statistics. It has been surmised statistics were used as far back as
Ancient Egypt for building pyramids. Governments worldwide have used statistics based on
censuses, for a variety of planning activities, including taxation. After the data has been
collected, the goal of discovering useful information and insights begins. For example, an
analysis of population growth by county and city could determine the location of a new
hospital.
The development of computers and the evolution of computing technology has dramatically
enhanced the process of data analytics. In 1880, prior to computers, it took over seven years
for the U.S. Census Bureau to process the collected information and complete a final report.
In response, inventor Herman Hollerith produced the “tabulating machine,” which was used
in the 1890 census. The tabulating machine could systematically process data recorded on
punch cards. With this device, the 1890 census was finished in 18 months.
In the late 1980s, the amount of data being collected continued to grow significantly, in part
due to the lower costs of hard disk drives. During this time, the architecture of data
warehouses was developed to help in transforming data coming from operational systems into
decision-making support systems.
The term business intelligence (BI) was first used in 1865, and was later adapted by Howard
Dresner at Gartner in 1989, to describe making better business decisions through searching,
gathering, and analyzing the accumulated data saved by an organization. Using the term
“business intelligence” as a description of decision-making based on data technologies was
both novel and far-sighted. Large companies first embraced BI in the form of analyzing
customer data systematically, as a necessary step in making business decisions.
Data mining began in the 1990s and is the process of discovering patterns within large data
sets. Analyzing data in non-traditional ways provided results that were both surprising and
beneficial. The use of data mining came about directly from the evolution of database and
data warehouse technologies
In 2005, big data was given that name by Roger Magoulas. He was describing a large amount
of data, which seemed almost impossible to cope with using the Business Intelligence tools
available at the time. In the same year, Hadoop, which could process big data, was developed.
Hadoop’s foundation was based on Nutch, which was then merged with Google’s
MapReduce.
DEFINITION
Data analytics (DA) is the process of examining data sets in order to find trends and draw
conclusions about the information they contain. Increasingly, data analytics is done with the
aid of specialized systems and software. Data analytics technologies and techniques are
widely used in commercial industries to enable organizations to make more-informed
business decisions. Scientists and researchers also use analytics tools to verify or disprove
scientific models, theories and hypotheses.
Data analytics initiatives can help businesses increase revenue, improve operational
efficiency, optimize marketing campaigns and bolster customer service efforts. Analytics also
enable organizations to respond quickly to emerging market trends and gain a competitive
edge over business rivals. The ultimate goal of data analytics, however, is boosting business
performance. Depending on the particular application, the data that's analyzed can consist of
either historical records or new information that has been processed for real-time analytics. In
addition, it can come from a mix of internal systems and external data sources.
Data analytics is a broad field. There are four primary types of data analytics: descriptive,
diagnostic, predictive and prescriptive analytics. Each type has a different goal and a different
place in the data analysis process. These are also the primary data analytics applications in
business.
1. Descriptive analytics helps answer questions about what happened. These techniques
summarize large datasets to describe outcomes to stakeholders. By developing key
performance indicators (KPIs,) these strategies can help track successes or failures.
Metrics such as return on investment (ROI) are used in many industries. Specialized
metrics are developed to track performance in specific industries. This process
requires the collection of relevant data, processing of the data, data analysis and data
visualization. This process provides essential insight into past performance.
2. Diagnostic analytics helps answer questions about why things happened. These
techniques supplement more basic descriptive analytics. They take the findings from
descriptive analytics and dig deeper to find the cause. The performance indicators are
further investigated to discover why they got better or worse. This generally occurs in
three steps:
a. Identify anomalies in the data. These may be unexpected changes in a metric
or a particular market.
b. Data that is related to these anomalies is collected.
c. Statistical techniques are used to find relationships and trends that explain
these anomalies.
3. Predictive analytics helps answer questions about what will happen in the future.
These techniques use historical data to identify trends and determine if they are likely
to recur. Predictive analytical tools provide valuable insight into what may happen in
the future and its techniques include a variety of statistical and machine learning
techniques, such as: neural networks, decision trees, and regression.
4. Prescriptive analytics helps answer questions about what should be done. By using
insights from predictive analytics, data-driven decisions can be made. This allows
businesses to make informed decisions in the face of uncertainty. Prescriptive
analytics techniques rely on machine learning strategies that can find patterns in large
datasets. By analyzing past decisions and events, the likelihood of different outcomes
can be estimated.
These types of data analytics provide the insight that businesses need to make effective and
efficient decisions. Used in combination they provide a well-rounded understanding of a
company’s needs and opportunities.
Before getting into the nitty-gritty of data analysis, a business must first define why it
requires a well-founded process in the first place. The first step in a data analysis
process is determining why you need data analysis. This need typically stems from a
business problem or question, such as:
How can we reduce production costs without sacrificing quality?
What are some ways to increase sales opportunities with our current resources?
Do customers see our brand positively?
In addition to finding a purpose, consider which metrics to track along the way. Also,
be sure to identify sources of data when it’s time to collect.
This process can be long and arduous, so building a roadmap will greatly prepare your
data team for all the following steps.
Data analysts can use many data analysis techniques to extract meaningful
information from aw data for real-world applications and computational purposes.
Some of the notable data analysis techniques that aid a data analysis process are:
Exploratory data analysis
Exploratory data analysis is used to understand the messages within a dataset. This
technique involves many iterative processes to ensure that the cleaned data is further
sorted to better understand the useful meaning. Data visualization techniques such as
analyzing data in an Excel sheet or other graphical format and descriptive analysis
techniques such as calculating the mean or median are examples of exploratory data
analysis.
Algorithms have become an integral part of today's data environment and include
mathematical calculations for data analysis. Mathematical formulas or models such as
correlation or causation help identify the relationships between data variables.
Modeling techniques such as regression analysis analyze data by modeling the change
in one variable caused by another. For example, determining whether a change in
marketing (independent variable) explains a change in engagement (dependent
variable). Such techniques are part of inferential statistics, the process of analyzing
statistical data to draw conclusions about the relationship between different sets of
data.
The final step is interpreting the results from the data analysis. This part is essential
because it’s how a business will gain actual value from the previous four steps.
Interpreting data analysis results should validate why you conducted it, even if it’s not
100 percent conclusive. For example, “options A and B can be explored and tested to
reduce production costs without sacrificing quality.”
Analysts and business users should look to collaborate during this process. Also,
when interpreting results, consider any challenges or limitations that may not have
been present in the data. This will only bolster your confidence in the next steps.
Not just one or two, the use of data analytics is in every field you can see around. Be it from
Online shopping, or Hitech industries, or the government, everyone uses data analytics to
help them in decision making, budgeting, planning, etc. The data analytics are employed in
various places like:
1. Transportation
There are different logistic companies like DHL, FedEx, etc that uses data analytics
to manage their overall operations. Using the applications of data analytics, they can
figure out the best shipping routes, approximate delivery times, and also can track the
real-time status of goods that are dispatched using GPS trackers. Data Analytics has
made online shopping easier and more demandable.
The web search engines like Yahoo, Bing, Duckduckgo, Google uses a set of data to
give you when you search a data. Whenever you hit on the search button, the search
engines use algorithms of data analytics to deliver the best-searched results within a
limited time frame. The set of data that appears whenever we search for any
information is obtained through data analytics.
The searched data is considered as a keyword and all the related pieces of information
are presented in a sorted manner that one can easily understand. For example, when
you search for a product on amazon it keeps showing on your social media profiles or
to provide you with the details of the product to convince you by that product.
4. Manufacturing
Data analytics helps the manufacturing industries maintain their overall working
through certain tools like prediction analysis, regression analysis, budgeting, etc. The
unit can figure out the number of products needed to be manufactured according to
the data collected and analyzed from the demand samples and likewise in many other
operations increasing the operating capacity as well as the profitability.
5. Security
Data analyst provides utmost security to the organization, Security Analytics is a way
to deal with online protection zeroed in on the examination of information to deliver
proactive safety efforts. No business can foresee the future, particularly where
security dangers are concerned, yet by sending security investigation apparatuses that
can dissect security occasions it is conceivable to identify danger before it gets an
opportunity to affect your framework and main concern.
6. Education
Data analytics applications in education are the most needed data analyst in the
current scenario. It is mostly used in adaptive learning, new innovations, adaptive
content, etc. Is the estimation, assortment, investigation, and detailing of information
about students and their specific circumstance, for reasons for comprehension and
streamlining learning and conditions in which it happens.
7. Healthcare
Applications of data analytics in healthcare can be utilized to channel enormous
measures of information in seconds to discover treatment choices or answers for
various illnesses. This won’t just give precise arrangements dependent on recorded
data yet may likewise give accurate answers for exceptional worries for specific
patients.
8. Military
Military applications of data analytics bring together an assortment of specialized and
application-situated use cases. It empowers chiefs and technologists to make
associations between information investigation and such fields as augmented reality
and psychological science that are driving military associations around the globe
forward.
9. Insurance
There is a lot of data analysis taking place during the insurance process. Several data,
such as actuarial data and claims data, help insurance companies realize the risk
involved in insuring the person. Analytical software can be used to identify risky
claims and bring them before the authorities for further investigation.
10. Digital Advertisement
Digital advertising has also been transformed as a result of the application of data
science. Data analytics and data algorithms are used in a wide range of advertising
mediums, including digital billboards in cities and banners on websites.
11. Fraud and Risk Detection
Detecting fraud may have been the first application of data analytics. They applied
data analytics because they already had a large amount of customer data at their
disposal. Data analysis was used to examine recent spending patterns and customer
profiles to determine the likelihood of default. It eventually resulted in a reduction in
fraud and risk.
12. Travel
Data analysis applications can be used to improve the traveler’s purchasing
experience by analyzing social media and mobile/weblog data. Companies can use
data on recent browse-to-buy conversion rates to create customized offers and
packages that take into account the preferences and desires of their customers.
13. Communication, Media, and Entertainment
When it comes to creating content for different target audiences, recommending
content, and measuring content performance, organizations in this industry analyze
customer data and behavioral data simultaneously. Data analytics is applied to collect
and utilize customer insights and understand their pattern of social-media usage.
14. Energy and Utility
Many firms involved in energy management use data analysis applications in areas
such as smart-grid management, energy distribution, energy optimization, and
automation building for other utility-based firms.
► Following next will involve tidying up the data by enumerating the outcome variable
and renaming badly encoded variables. Packages such as tidy verse, dplyr would be
used in this step.
► Once our data frame is made explicable, we build a deeper understanding of our data
through visualization by plotting density curves and one-on-one scatter plots and box
plots for all the different attributes using the ggplot2 library.
► Next, we check for any missing values or null values and if present we apply
appropriate imputations to our data based on the pattern of missing entries and also
get hints on the mechanism. The VIM package of R would come handy for this
purpose.
► Since the ratio of attributes to instances is quite high in our case, we next aim at
reducing a few of the attributes to avoid overfitting of the data. We plot the
correlation plot and deploy the caret package to remove highly correlated variables,
providing redundant data, based on the cutoff value of 0.9.
► To further enhance visualization and preprocess our data we apply PCA as part of
EDA. This converts our original variables into a smaller number of “Principal
Components”. This is done by finding the straight line that best spreads the data out
when it is projected along it i.e. transforming a set of x correlated variables over y
samples to a set of p uncorrelated principal components over the same samples.
► For dimensionality reduction, we also apply LDA. The LDA algorithm tries to find
linear combinations of the predictor variables that can maximize the separation among
the outcome classes which would then be used for predicting the class of each and
every instance.
► Now that our preprocessing is done, we would partition our final data frame into
training and testing sets. We use 80% of the data for training while remaining 20% for
testing. We also apply cross-validation technique to resample the data at least 15
times.
► We will be Applying different machine learning models and determining all the
performance measures, Confusion Matrix and Statistics comprising of Accuracy,
sensitivity, specificity etc.
► All the models use ROC as a metric. The ROC metric measures the auc of the roc
curve of each model. This metric is independent of any threshold
► Our first model will be doing logistic regression on the data frame where we took
away the highly correlated variables which is the training dataset.
► Our second model uses random forest and Induction. Similarly, we are using the data
frame, the one where we took away the highly correlated variables and also, we will
be making some diagnostic plots here.
► Our third model uses KNN (k-nearest neighbors’ algorithm) on the training dataset.
► Our fourth model will be using the SVM (Support Vector Machines) on a non-PCA
training dataset. For better results with SVM when doing it on the PCA data set.
► Our last and best model Neural networks with LDA, to use the LDA pre-processing
step, we will also create the same training and testing set.
► After Training all the models we will do model evaluation this is done distributions
are summarized in terms of the percentiles. The distributions are summarized as box
plots and finally the distributions are summarized as dot plots.
► The model which will have the best results for sensitivity (detection of breast cases)
will be used in the application.
Result
Breast cancer is one of the most severe cancers. It has taken hundreds of thousands of
lives every year. Early prediction of breast cancer plays an important role in successful
treatment and saving the lives of thousands of patients every year. However, the
conventional approaches are limited in providing such capability. The recent breakthrough
of data analytics and data mining techniques have opened a new door for healthcare
diagnostic and prediction. Machine learning methods for diagnosis can significantly
increase processing speed and on a big scale can make the diagnosis significantly
cheaper.
This research was carried out to predict the accuracy of determining cancer at early
state, after comparing five different models. The best results for sensitivity (detection of
breast cases) is LDA_NNET which also has a great F1 score.
Project implementation: -
Dataset comprises of 8 variables which are InvoiceNo ,StockCode ,Description,
Quantity, InvoiceDate, UnitPrice, CustomerID, Country.
The RFM model is fundamentally built using principles of data-driven marketing.
Data-driven marketing has fundamentally transformed how marketing works ever
since its inception, as it allows the analysis of large sets of customer data like never
before.
There are three digital numbers for each RFM score, in general we rate the customers
using points from 1 to 8 in each dimension. Higher score means better customer
value so 8 points is the best and 1 is the worst score
We used K- means clustering algorithm for clustering the data into various segments.
The following steps were involved in the successful implementation of the project:
Step 1: Read the data into a data frame
The data for this analysis has been taken from Kaggle. The data is of a retail store,
describing the past transactions and purchase history of the customers.
Step 2: Data cleaning and preprocessing
Looking at the summary statistics of the data frame, we can see 2 problems in the data
— 1) Presence of null values 2) Invalid data — negative values for quantities. We
solve these problems by omitting the rows with null values and negative quantity
values.
Step 3: Calculate Recency, Frequency and Monetary values for every customer
We now calculate the following values:
1. Recency : difference between the analysis date and the most recent date, that
the customer has shopped in the store. The analysis date here has been taken as the
maximum date available for the variable InvoiceDate.
2. Frequency : Number of transactions performed by every customer.
3. Monetary: Total money spent by every customer in the store.
Step 4: Calculate the RFM score
Recency, frequency and monetary has different ranges. We first convert these quantities to
scores based on their quartiles. For this, we start with looking at the summary of these values.
Score is the total score of a customer’s engagement or loyalty which can be used to
categorize customers into 7–8 -> ‘Diamond’, 5–6-> ‘Gold’, 3–4 -> ‘Silver’ and 1–2->
‘Bronze’.
● Today, businesses can go beyond the above questions with the help of the RFM model
and get answers to highly specific questions such as:Who are my best customers?Which
customer has the potential to buy more?Which customer has been churned out/has
lapsed?
● Low churn rates are the easiest way to maintain and grow business, as it enables a
reliance on customer satisfaction, and also the creation of positive word of mouth by
customers. The RFM model helps businesses create unique customer journeys for
different customer segments, creating value for customers and establishing loyalty and
trust.
Results :
73% of the annual sales are produced by the top 27% of customers
Out of the four segments present a large chunk of (2405) customers are in the
silver category.
Most of the customers have a Recency of < 50 days, a Frequency of less than 5
times, and a Monetary of less than $50,000. That’s the reason why the customers
are not distributed evenly in the RFM cells.
Customer
Distribution per
recency
These data points are all represented with the triangles shape in the plot and they’re in
the top 80/20 category. At the other end of the continuum, we have the no-value
customers in the bottom, left-hand corner.
While RFM segmentation is powerful, it does have limits. When performed manually,
it’s prone to human error. RFM analysis is also based on just a few behavioral traits,
lacking the power of the advanced predictive analytics now available.
Using a limited number of selection variables is another issue, which means that some
other variables are possibly able to influence and determine the value score of
customers.
Lacks consistency: K-means clustering can give varying results on different runs of
an algorithm. A random choice of cluster patterns yields different clustering results
resulting in inconsistency.
Sensitivity to scale: Changing or rescaling the dataset either through normalization or
standardization will completely change the final results.
k-means can only separate clusters that are more or less linearly separable. If your
clusters are based on distance to the origin, k-means won’t be able to identify them.
REQUIREMENTS
Software Requirements:
1. An OS capable of running the R programming language
2. R from Cran Projects
3. IDE – R Studio
4. Packages such as ggplot2
Hardware Requirements:
5. A Laptop or Desktop with Internet Connectivity and at least 4 GB of RAM
CONCLUSION
Project-1
If properly executed the model should deliver more satisfied customers, few
confrontations with competitors, and better-designed marketing programmers.
Each of these RFM metrics has been shown to be effective in predicting future
customer behavior and increasing revenue. Customers who have made a purchase in
the recent past are more likely to do so in the near future. Those who interact with
your brand more frequently are more likely to do so again soon. And those who have
spent the most are more likely to be big spenders going forward.
It’s evident from the results obtained that 80 percent of the business comes from 20
percent of your consumers.
Project -2
Data visualization and machine learning techniques can provide significant benefits
and impact cancer detection in the decision-making process.Using comparative
analysis of various algorithms we can find a model with high accuracy which means it
can predict a greater number of correct values than negatives. This research has a
translational potential for women, who have abnormal mammogram findings or who
have been diagnosed with breast cancer.
Finding new ways to determine the stage of metastatic breast cancer would have a
major clinical impact. Heat Maps ,scatter plot and box plot visualization helped to
understand the correlation between each feature and brought out unnecessary features
that were not essential to use while making predictions.
BIBLIOGRAPHY
• Nikita Rane, Jean Sunny, Rucha Kanade, Prof. Sulochana Devi, Breast Cancer
Classification and Prediction using Machine Learning, International Journal of
Engineering Research & Technology (IJERT), Published in http://www.ijert.org/,
Vol. 9 Issue 02, February-2020.
• https://cran.r-project.org/web/packages/caret/index.html
• https://machinelearningmastery.com/compare-models-and-select-the-best-usingthe-
caret-r-package/
• https://rpubs.com/Aakansha_garg/aakansha_cancer
• https://canceratlas.cancer.org/the-burden/
• https://www.kaggle.com/lbronchal/breast-cancer-dataset-analysis
• https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)