0% found this document useful (0 votes)
244 views

Nasscom - Report (Finally)

The document provides instructions for modifying an internship report template. It instructs the user to change personal information like their name and registration number. It also instructs changing the name of the supervisor, certificate details, projects discussed, table of contents, bibliography, and removing the instructions page.

Uploaded by

MSV Forever
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
244 views

Nasscom - Report (Finally)

The document provides instructions for modifying an internship report template. It instructs the user to change personal information like their name and registration number. It also instructs changing the name of the supervisor, certificate details, projects discussed, table of contents, bibliography, and removing the instructions page.

Uploaded by

MSV Forever
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Instructions for making your report

from this report

 Change Name and Registration Number


(Obvious)
 Change the name of you supervisor
 Change Certificate and percentiles
 Change the projects and conclusions
 Change Table of Contents accordingly
 Change the Bibliography and references
 Remove this Page
INTERNSHIP REPORT
A report submitted in partial fulfillment of the
requirements for the Award of Degree of

BACHELOR OF TECHNOLOGY In
COMPUTER SCIENCE AND ENGINEERING
By

<NAME> (<Registration Number>)


Under Supervision of
Dr. Nilamadhab Mishra
(Duration: August 2020 to January 2021)

DEPARTMENT OF COMPUTER SCIENCE AND


ENGINEERING
VIT BHOPAL UNIVERSITY
2018– 2022

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING


VIT BHOPAL UNIVERSITY

CERTIFICATE
This is to certify that the “Internship Report”
submitted by <NAME> (<Registration Number>) is
work done by him and submitted during 2021 – 2022
academic year, in partial fulfillment of the requirements for
the award of the degree of BACHELOR OF
TECHNOLOGY in COMPUTER SCIENCE AND
ENGINEERING, at VIT Bhopal University, Kothri-Kalan.

College Internship Guide Program Chair Dean


Dr. Nilammadhab Mishra Dr.Sandip Mal Dr.S.Poonkuntran
Senior Assistant Professor Senior Assistant Professor Dean/SCSE,
Professor

CERTIFICATION
In the Gold Tier with 95 percentiles.
ACKNOWLEDGEMENT

First, I would like to thank Dr.Shriram R for giving me the


opportunity to do an internship with NASSCOM through VIT Bhopal
University.
I am really grateful to Dr.Nilamadhab Mishra for his specific guidance
and mentoring. Sir made online classes interactive and fun that helped me to
learn a lot about the fascinating field of data analysis.

It is indeed with a great sense of pleasure and immense sense of


gratitude that I acknowledge the help of these individuals.

I am highly indebted to Dean Dr.S.Poonkuntran, for the facilities provided


to accomplish this internship.

I would like to thank my Program Chair Dr. Sandip Mal for his
constructive criticism throughout my internship.

I am extremely great full to my department staff members and friends


who helped me in successful completion of this internship.

<NAME>
(<Registration Number>)

ABSTRACT
The process of analyzing, cleaning, manipulating, and modelling data with
the objective of identifying usable information, informing conclusions, and
assisting decision-making is known as data analysis. Data analysis has
several dimensions and methodologies, including a wide range of techniques
under many titles and being applied in a number of business, scientific, and
social science sectors. Data analysis is important in today's business
environment since it helps firms make more scientific choices and run more
efficiently.
Data mining is a type of data analysis that focuses on statistical modelling
and knowledge discovery for predictive rather than just descriptive purposes,
whereas business intelligence is a type of data analysis that heavily depends
on aggregation and is primarily concerned with business data. Data analysis
may be separated into descriptive statistics, exploratory data analysis (EDA),
and confirmatory data analysis in statistical applications (CDA). CDA
focuses on verifying or falsifying existing assumptions, whereas EDA
focuses on identifying new characteristics in the data. Text analytics uses
statistical, linguistic, and structural techniques to extract and classify
information from textual sources, a type of unstructured data. Predictive
analytics focuses on the application of statistical models for predictive
forecasting or classification, whereas predictive analytics focuses on the
application of statistical models for predictive forecasting or classification.
All of the aforementioned are examples of data analysis.
As a result, implementing a measurement and data analysis strategy is a
recognized best practice within the software business for assisting
stakeholders in making choices. However, putting measuring tools and
analytics into practice in industry is difficult. The real-world issues that
emerge during the execution of a software measurement and analytics project
are discussed in this chapter. We also share what we've learned about
overcoming these obstacles and best practices for conducting practical,
successful data analysis in the workplace. The lessons learnt may be used by
researchers who want to work on data analytics with industry partners, as
well as industry practitioners who want to set up and reap the advantages of a
successful measurement program.

Methodology:

The knowledge session for this certification was divided over the course of two
semesters (5th and Interim), the course curriculum was designed by NASSCOM with
the help of its industry partners with the thought of making the student future ready
and more skilled. The NAS1001-Associative Data Analytics and the NAS 2001-
NASSCOM Advance Data Analytics were introduced in our college curriculum and
my batch was the first to make use of this wonderful opportunity.

Programs and opportunities:

The Institute combines pioneering research with top class education. An innovative
curriculum allows the student flexibility in selecting courses and projects. Students,
even at the undergraduate level, get to participate in ongoing research and technology
development - an opportunity unprecedented in India. As a result, a vibrant
undergraduate programme co-exists with a strong postgraduate program.

Organization Information

Sector skills council nasscom is the national standard-setting body for IT Skills set up
under the aegis of National Skill Development Corporation and Ministry of Skills
Development & Entrepreneurship. SSC NASSCOM acts as a conduit to support the
voices and build synergies across the different groups of stakeholders it works with.
Benefits to the company / institution through your report:
The Institute combines pioneering research with top class education. An innovative
curriculum allows the student flexibility in selecting courses and projects. Students,
even at the undergraduate level, get to participate in ongoing research and technology
development - an opportunity unprecedented in India

Learning Objectives/Internship Objectives

 To establish clearly the objectives and scope of the predictive analysis


 Use R programming language to identify suitable data sources to agree the
methodological approach
 Validate and review data accurately and identify anomalies
 To appreciate the current trends in data analysis procedure
 Carry out rule-based analysis of the data in line with the analysis plan
 Apply statistical models to perform Regression Analysis, Clustering and
Classification
 Present the results and inferences from your analysis using R tool
 To improve document management and team work An objective for this
position should emphasize the skills you already possess in the area and
your interest in learning more.

Table of Contents
BACHELOR OF TECHNOLOGY In..........................................................................................2
ACKNOWLEDGEMENT....................................................................................................................5
<NAME>......................................................................................................................................5
(<Registration Number>)..........................................................................................................5
Methodology:........................................................................................................................................6
Programs and opportunities:..................................................................................................................7
Organization Information......................................................................................................................7
...............................................................................................................................................................8
Benefits to the company / institution through your report:....................................................................8
National Occupational Standards done in the Internship.......................................................................3
for Week by Week.................................................................................................................................3
2.History................................................................................................................................................9
DEFINITION......................................................................................................................................10
5.STEPS IN DATA ANALYSIS.........................................................................................................12
Hardware Requirements:.....................................................................................................................24
BIBLIOGRAPHY...............................................................................................................................25

National Occupational Standards done in the Internship


for Week by Week

National Occupational standards are a set of certain conducts that an individual should
perform while carrying out a function in a workplace with proper understanding and
knowledge they need to meet the standard consistently. Each and every NOS define a key
function for the job role and every employee should practice it at their workplace.
We have studied the following NOS in this data analytics course:

1. SSC/ N 0703 - Create documents for knowledge sharing

a) Overview:
Here, the students will learn the various documentation techniques used in
corporate world. These includes various types of documents like case studies,
best practices, project artifacts, reports, minutes, policies, procedures, work
instructions etc. Although, here the technical documents the documents
associated with any application or products are not included here.

b) Goal:
The main goal for this session is that the students should have a hands on
practice of MS word and MS Visio, and be able to draft reports and documents
following the techniques that is used by the corporate world.

c) Objective:
● To organize the document’s purpose, scope, format and target audience
with the right group of people.
● Discuss and work with the organization people to collect and verify the
information required for the documents.
● To access existing documents, language standards, templates and
documentation tools from the organization.
● Make an appropriate group of people and finalize the content and
structure of the document.
● Set a goal of creating documents that meet the standard template and
agreed language standards.
● Discuss the documents with the group and make changes if relevant
inputs are given.
● Submit the documents for approval.
● After approval of the documents, submit them with the decided
standards.
● Update the oragnization’s knowledge base with the documents.
● Meet the organization’s policy, procedure and guidelines when
creating the documents.

2. SSC/ N 2101 - Carry out rule-based statistical analysis

a) Overview:
In this session, students will learn how to use the R tool for Business
Analytics. Further, the students will fet an idea about Applied Statistical
concepts like Descriptive Statistics and find their usage along with R and an
overview of Big Data with the basic functionality. We will also get an
overview about Machine Learning and their use in Data Mining and Predictive
Analytics. Data visualization and the graphical representation of data will also
be covered in this part.

b) Goal:
Here, we will get an idea about how to use the R Tool for Big Data and Big
Data analytics.The concepts of basic applied statistical concepts will also be
covered in this chapter.

c) Objective:
● To get an clear idea on the objectives and scope of analysis.
● Discuss with the appropriate people to identify suitable data
sources and to finalize the methodological approaches.
● To structure the data using standard tools and templates.
● To validate data accurately and identify anomalies.
● Learn form the appropriate group of people on how to handle
anomalies in the data.
● To carry out rule-based analysis of the data in line with the
analysis plan.
● After this, validate the result of the analysis according to the
statistical guidelines.
● Check these results with the appropriate group of people.
● Based on the inputs received from other people, change the
data accordingly for better results.
● Then draw justifiable inferences from your analysis.
● Present the results and inferences from your analysis using
standard templates and tools.
● Finally, comply with your organization’s policies, procedures
and guidelines when carrying out rule-based quantitative
analysis

3. SSC/ N 9001 - Manage your work to meet requirements

a) Overview:
The main objective of this session is to understand how time management is
important in corporate environment and how to manage the work accordingly
to meet the required deadlines. Everyone should follow a set of principles to
manage their time and work in the business world.

b) Goal:
The requirements of the work unit are classified as follows: activities,
deliverable, quantity, standards and timelines and the goal of this sessions to
manage the time to deliver the goals in the specified deadline. Also, one of the
main motive of this session is to plan the work in advance so that students can
effectively deal with failure points and minimize the impact.

c) Objective:
● The first and foremost objective of this session is to establish and sgree
the work requirements with your colleagues.
● Learn how to keep the work environment clean and tidy for better
results.
● We should utilize the time effectively and meet the stated deadlines.
● Candidates should use the resources correctly and effectively.
● The confidential information should be kept safe and it is the
responsibility of the candidate to treat the information correctly.
● Candidates should work according to the stated policies and
procedures of the organization.
● Employees should work according to their job roles.
● Candidates should seek guidance wherever required.
● Ensure your work meets the agreed requirements.

4. SSC/ N 9002 - Work Effectively with Colleagues.

a) Overview:
This session covers the basic overview of how to manage the work
relationship with your colleagues and how teamwork in every important in a
work environment. This session also covers the topics on how to respect your
colleagues and points on personal grooming for a workplace.

b) Goal:
The main cola of this module is to understand the importance of professional
relationships at workplace and how to achieve the inter-relationship of
professionalism and team-work.

c) Objective:
● We should interact with our colleagues accurately, concisely and in a
clear manner.
● We should work with colleagues to manage our work with their work
and achieve better results.
● Discuss all the required information for the desired project to avoid
confusion.
● Work with your colleagues respectfully.
● Stick to all the commitments and promises made to the colleagues
when working in a team project.
● Inform the colleagues politely if there is any delay to complete the
committed work by you and give genuine reasons to seek out their
help.
● If you are facing any problems with your colleagues then discuss with
them and try to solve them.
● You should follow all the rules and regulations stated by the
organization about working with colleagues.
5. SSC/ N 9003 - Maintain a Healthy, Safe and Secure working Environment.

a) Overview:
This session gives an idea about the safety rules and regulations to be followed
by every individual at their workplace. The should follow the guidelines
provided by the organization to prevent and handle any accidents or
emergencies taking place at the organization.

b) Goal:
The main goal of this session is that the candidates should be aware about the
various hazards that they may come across at workplace and what are the
defined health, safety and security measures that should be followed at the
time of occurrence of such unpredictable events. It also covers the practical
application of health and safety procedures to deal with any kind of
circumstances.

c) Objective:
● Employees should follow all the health policies and procedures.
● If the eomplyee finds any breach in health and safety measures then
he/she should inform the responsible person .
● Identify and correct any hazards that you can deal with safely,
competently and within the limits of your authority.
● Employees should report any hazards that may be effective to other
people to and should inform them to the concerened authorities.
● One should follow the organization’s procedures calmly and
effectively.
● If you have any suggestions than can improve the safety procedures of
the organization then suggest them to the respected authorities.
● Complete any health and safety records legibly and accurately.

6. SSC/ N 9004 - Provide data/information in standard formats.

a) Overview:
In this module, candidates will learn the standard operating procedures to
report data in a logical sequence and arriving at conclusive decisions models
after analysis of data. It will cover that how an individual should handle and
report the data in standard formats. Candidates will also come to know about
how the data should be shared within and outside a particular group without
disclosing too much confidential information.

b) Goal:
The main goal of this module is to analyze the data and publish the report in
the standardized format given by the organization. They should also learn how
to make the report with the specified objective.

c) Objective:
● Discuss with your team about what information is to be provided by
you, in what way it should be done and when should you submit the
data.
● Collect all the data from a reliable and trusted source.
● You should check that if the data provided is complete and up to date.
● Keep an open discussion with your team and identify the problems in
your data.
● You should carry out a rule-based analysis of your data according to
the requirement.
● You should enter the data in the provided and accepted format.
● You should keep a check about your work.
● Report any unresolved anomalies in the data/information to appropriate
people
● You should provide complete, accurate and up-to-date
data/information to the appropriate people in the required formats on
time

7. SSC/ N 9005 - Develop your knowledge, skills and competence.

a) Overview:
This module will cover the steps on how to develop skills for the professional
environment and how the right skills will help the candidate to excel. It
emphasizes on how enhance skills and knowledge in a diversified professional
environment.

b) Goal:
This session will cover how skill enhancement will help the candidate to grow
in their professional and personal life. It gives knowledge on organizational
context, technical knowledge, core skills/generic skills, professional skills and
technical skills. We will learn how skill enhancement and growth are the two
main factors for improvement at the workplace.

c) Objective:
● We should learn from an appropriate group of people to develop our
knowledge, skill and competence.
● Keep track of the skills required for the job role.
● Candidates should identify accurately the current level of knowledge,
skills and competence and any learning and development needs.
● Schedule a plan of learning and development activities with an
appropriate group of people.
● Candidates should undertake learning and development activities in
line with their plan.
● We should then apply these knowledge and skills at workplace under
the guidance of the expert.
● Ask for feedback from the appropriate group of people and work on
them accordingly to get improvised results.
● One should check their knowledge, skills and competence regularly
and keep improving wherever possible.

1.Introduction To Data Analytics

Most companies are collecting loads of data all the time—but, in its raw form, this data
doesn’t really mean anything. This is where data analytics comes in. Data analytics is the
process of analyzing raw data in order to draw out meaningful, actionable insights.
These insights are then used to inform and drive smart business decisions. So, a data analyst
will extract raw data, organize it, and then analyze it, transforming it from incomprehensible
numbers into coherent, intelligible information. Having interpreted the data, the data analyst
will then pass on their findings in the form of suggestions or recommendations about what
the company’s next steps should be.

You can think of data analytics as a form of business intelligence, used to solve specific
problems and challenges within an organization. It’s all about finding patterns in a dataset
which can tell you something useful and relevant about a particular area of the business—
how certain customer groups behave, for example, or how employees engage with a
particular tool. Data analytics helps you to make sense of the past and to predict future trends
and behaviors; rather than basing your decisions and strategies on guesswork, you’re making
informed choices based on what the data is telling you. Armed with the insights drawn from
the data, businesses and organizations are able to develop a much deeper understanding of
their audience, their industry, and their company as a whole—and, as a result, are much better
equipped to make decisions and plan ahead.

2.History

Data analytics is based on statistics. It has been surmised statistics were used as far back as
Ancient Egypt for building pyramids. Governments worldwide have used statistics based on
censuses, for a variety of planning activities, including taxation. After the data has been
collected, the goal of discovering useful information and insights begins. For example, an
analysis of population growth by county and city could determine the location of a new
hospital.

The development of computers and the evolution of computing technology has dramatically
enhanced the process of data analytics. In 1880, prior to computers, it took over seven years
for the U.S. Census Bureau to process the collected information and complete a final report.
In response, inventor Herman Hollerith produced the “tabulating machine,” which was used
in the 1890 census. The tabulating machine could systematically process data recorded on
punch cards. With this device, the 1890 census was finished in 18 months.

In the late 1980s, the amount of data being collected continued to grow significantly, in part
due to the lower costs of hard disk drives. During this time, the architecture of data
warehouses was developed to help in transforming data coming from operational systems into
decision-making support systems.

The term business intelligence (BI) was first used in 1865, and was later adapted by Howard
Dresner at Gartner in 1989, to describe making better business decisions through searching,
gathering, and analyzing the accumulated data saved by an organization. Using the term
“business intelligence” as a description of decision-making based on data technologies was
both novel and far-sighted. Large companies first embraced BI in the form of analyzing
customer data systematically, as a necessary step in making business decisions.

Data mining began in the 1990s and is the process of discovering patterns within large data
sets. Analyzing data in non-traditional ways provided results that were both surprising and
beneficial. The use of data mining came about directly from the evolution of database and
data warehouse technologies

In 2005, big data was given that name by Roger Magoulas. He was describing a large amount
of data, which seemed almost impossible to cope with using the Business Intelligence tools
available at the time. In the same year, Hadoop, which could process big data, was developed.
Hadoop’s foundation was based on Nutch, which was then merged with Google’s
MapReduce.
DEFINITION

Data analytics (DA) is the process of examining data sets in order to find trends and draw
conclusions about the information they contain. Increasingly, data analytics is done with the
aid of specialized systems and software. Data analytics technologies and techniques are
widely used in commercial industries to enable organizations to make more-informed
business decisions. Scientists and researchers also use analytics tools to verify or disprove
scientific models, theories and hypotheses.

As a term, data analytics predominantly refers to an assortment of applications, from basic


business intelligence (BI), reporting and online analytical processing (OLAP) to various
forms of advanced analytics. In that sense, it's similar in nature to business analytics, another
umbrella term for approaches to analyzing data. The difference is that the latter is oriented to
business uses, while data analytics has a broader focus. The expansive view of the term isn't
universal, though: In some cases, people use data analytics specifically to mean advanced
analytics, treating BI as a separate category.

Data analytics initiatives can help businesses increase revenue, improve operational
efficiency, optimize marketing campaigns and bolster customer service efforts. Analytics also
enable organizations to respond quickly to emerging market trends and gain a competitive
edge over business rivals. The ultimate goal of data analytics, however, is boosting business
performance. Depending on the particular application, the data that's analyzed can consist of
either historical records or new information that has been processed for real-time analytics. In
addition, it can come from a mix of internal systems and external data sources.

TYPES OF DATA ANLYSIS

Data analytics is a broad field. There are four primary types of data analytics: descriptive,
diagnostic, predictive and prescriptive analytics. Each type has a different goal and a different
place in the data analysis process. These are also the primary data analytics applications in
business.

1. Descriptive analytics helps answer questions about what happened. These techniques
summarize large datasets to describe outcomes to stakeholders. By developing key
performance indicators (KPIs,) these strategies can help track successes or failures.
Metrics such as return on investment (ROI) are used in many industries. Specialized
metrics are developed to track performance in specific industries. This process
requires the collection of relevant data, processing of the data, data analysis and data
visualization. This process provides essential insight into past performance.
2. Diagnostic analytics helps answer questions about why things happened. These
techniques supplement more basic descriptive analytics. They take the findings from
descriptive analytics and dig deeper to find the cause. The performance indicators are
further investigated to discover why they got better or worse. This generally occurs in
three steps:
a. Identify anomalies in the data. These may be unexpected changes in a metric
or a particular market.
b. Data that is related to these anomalies is collected.
c. Statistical techniques are used to find relationships and trends that explain
these anomalies.
3. Predictive analytics helps answer questions about what will happen in the future.
These techniques use historical data to identify trends and determine if they are likely
to recur. Predictive analytical tools provide valuable insight into what may happen in
the future and its techniques include a variety of statistical and machine learning
techniques, such as: neural networks, decision trees, and regression.
4. Prescriptive analytics helps answer questions about what should be done. By using
insights from predictive analytics, data-driven decisions can be made. This allows
businesses to make informed decisions in the face of uncertainty. Prescriptive
analytics techniques rely on machine learning strategies that can find patterns in large
datasets. By analyzing past decisions and events, the likelihood of different outcomes
can be estimated.

These types of data analytics provide the insight that businesses need to make effective and
efficient decisions. Used in combination they provide a well-rounded understanding of a
company’s needs and opportunities.

5.STEPS IN DATA ANALYSIS

Step 1: Define why you need data analysis

Before getting into the nitty-gritty of data analysis, a business must first define why it
requires a well-founded process in the first place. The first step in a data analysis
process is determining why you need data analysis. This need typically stems from a
business problem or question, such as:
How can we reduce production costs without sacrificing quality?
What are some ways to increase sales opportunities with our current resources?
Do customers see our brand positively?
In addition to finding a purpose, consider which metrics to track along the way. Also,
be sure to identify sources of data when it’s time to collect.
This process can be long and arduous, so building a roadmap will greatly prepare your
data team for all the following steps.

Step 2: Collect data


After a purpose has been defined, it’s time to begin collecting the data needed for
analysis. This step is important because the nature of the collected data sources
determines how in-depth the analysis is.
Data collection starts with primary sources, also known as internal sources. This is
typically structured data gathered from CRM software, ERP systems, marketing
automation tools, and others. These sources contain information about customers,
finances, gaps in sales, and more.
Then comes secondary sources, also known as external sources. This is both
structured and unstructured data that can be gathered from many places.
For example, if you’re looking to perform a sentiment analysis toward your brand,
you could gather data from review sites or social media APIs.
While it’s not required to gather data from secondary sources, it could add another
element to your data analysis. This is becoming more common in the age of big data.

Step 3: Clean unnecessary data


Once data is collected from all the necessary sources, your data team will be tasked
with cleaning and sorting through it. Data cleaning is extremely important during the
data analysis process, simply because not all data is good data.
Data scientists must identify and purge duplicate data, anomalous data, and other
inconsistencies that could skew the analysis to generate accurate results.
With advances in data science and machine learning platforms, more intelligent
automation can save a data analyst’s valuable time while cleaning data.

Step 4: Perform data analysis


One of the last steps in the data analysis process is analyzing and manipulating the
data. This can be done in a variety of ways.
One way is through data mining, which is defined as “knowledge discovery within
databases”. Data mining techniques like clustering analysis, anomaly detection,
association rule mining, and others could unveil hidden patterns in data that weren’t
previously visible.
There’s also business intelligence and data visualization software, both of which are
optimized for decision-makers and business users. These options generate easy-to-
understand reports, dashboards, scorecards, and charts.
Data scientists may also apply predictive analytics, which makes up one of the four
data analytics used today (descriptive, diagnostic, predictive, prescriptive). Predictive
analysis looks ahead to the future, attempting to forecast what will likely happen next
with a business problem or question.

What are the types of data analysis methods?


► Data analysis methods can be broadly classified into the following categories:
► Quantitative data analysis
► Qualitative data analysis
► Statistical analysis
► Textual analysis
► Descriptive analysis
► Predictive analysis
► Prescriptive analysis
► Diagnostic analysis

Examples of data analysis techniques

Data analysts can use many data analysis techniques to extract meaningful
information from aw data for real-world applications and computational purposes.
Some of the notable data analysis techniques that aid a data analysis process are:
Exploratory data analysis
Exploratory data analysis is used to understand the messages within a dataset. This
technique involves many iterative processes to ensure that the cleaned data is further
sorted to better understand the useful meaning. Data visualization techniques such as
analyzing data in an Excel sheet or other graphical format and descriptive analysis
techniques such as calculating the mean or median are examples of exploratory data
analysis.

Using algorithms and models

Algorithms have become an integral part of today's data environment and include
mathematical calculations for data analysis. Mathematical formulas or models such as
correlation or causation help identify the relationships between data variables.
Modeling techniques such as regression analysis analyze data by modeling the change
in one variable caused by another. For example, determining whether a change in
marketing (independent variable) explains a change in engagement (dependent
variable). Such techniques are part of inferential statistics, the process of analyzing
statistical data to draw conclusions about the relationship between different sets of
data.

Step 5: Interpret the results

The final step is interpreting the results from the data analysis. This part is essential
because it’s how a business will gain actual value from the previous four steps.
Interpreting data analysis results should validate why you conducted it, even if it’s not
100 percent conclusive. For example, “options A and B can be explored and tested to
reduce production costs without sacrificing quality.”
Analysts and business users should look to collaborate during this process. Also,
when interpreting results, consider any challenges or limitations that may not have
been present in the data. This will only bolster your confidence in the next steps.

APPLICATIONS OF DATA ANALYTICS

Not just one or two, the use of data analytics is in every field you can see around. Be it from
Online shopping, or Hitech industries, or the government, everyone uses data analytics to
help them in decision making, budgeting, planning, etc. The data analytics are employed in
various places like:

1. Transportation

Data analytics can be applied to help in improving Transportation Systems and


intelligence around them. The predictive method of the analysis helps find transport
problems like Traffic or network congestions. It helps synchronize the vast amount of
data and uses them to build and design plans and strategies to plan alternative routes,
reduce congestions and traffics, which in turn reduces the number of accidents and
mis happenings. Data Analytics can also help to optimize the buyer’s experience in
the travels through recording the information from social media. It also helps the
travel companies fixing their packages and boost the personalized travel experience as
per the data collected.
For Example During the Wedding season or the Holiday season, the transport
facilities are prepared to accommodate the heavy number of passengers traveling from
one place to another using prediction tools and techniques.

2. Logistics and Delivery

There are different logistic companies like DHL, FedEx, etc that uses data analytics
to manage their overall operations. Using the applications of data analytics, they can
figure out the best shipping routes, approximate delivery times, and also can track the
real-time status of goods that are dispatched using GPS trackers. Data Analytics has
made online shopping easier and more demandable.

Example of Use of data analytics in Logistics and Delivery:


When a shipment is dispatched from its origin, till it reaches its buyers, every position
is tracked which leads to the minimizing of the loss of the goods.

3. Web Search or Internet Web Results

The web search engines like Yahoo, Bing, Duckduckgo, Google uses a set of data to
give you when you search a data. Whenever you hit on the search button, the search
engines use algorithms of data analytics to deliver the best-searched results within a
limited time frame. The set of data that appears whenever we search for any
information is obtained through data analytics.
The searched data is considered as a keyword and all the related pieces of information
are presented in a sorted manner that one can easily understand. For example, when
you search for a product on amazon it keeps showing on your social media profiles or
to provide you with the details of the product to convince you by that product.

4. Manufacturing

Data analytics helps the manufacturing industries maintain their overall working
through certain tools like prediction analysis, regression analysis, budgeting, etc. The
unit can figure out the number of products needed to be manufactured according to
the data collected and analyzed from the demand samples and likewise in many other
operations increasing the operating capacity as well as the profitability.
5. Security
Data analyst provides utmost security to the organization, Security Analytics is a way
to deal with online protection zeroed in on the examination of information to deliver
proactive safety efforts. No business can foresee the future, particularly where
security dangers are concerned, yet by sending security investigation apparatuses that
can dissect security occasions it is conceivable to identify danger before it gets an
opportunity to affect your framework and main concern.

6. Education
Data analytics applications in education are the most needed data analyst in the
current scenario. It is mostly used in adaptive learning, new innovations, adaptive
content, etc. Is the estimation, assortment, investigation, and detailing of information
about students and their specific circumstance, for reasons for comprehension and
streamlining learning and conditions in which it happens.

7. Healthcare
Applications of data analytics in healthcare can be utilized to channel enormous
measures of information in seconds to discover treatment choices or answers for
various illnesses. This won’t just give precise arrangements dependent on recorded
data yet may likewise give accurate answers for exceptional worries for specific
patients.

8. Military
Military applications of data analytics bring together an assortment of specialized and
application-situated use cases. It empowers chiefs and technologists to make
associations between information investigation and such fields as augmented reality
and psychological science that are driving military associations around the globe
forward.
9. Insurance
There is a lot of data analysis taking place during the insurance process. Several data,
such as actuarial data and claims data, help insurance companies realize the risk
involved in insuring the person. Analytical software can be used to identify risky
claims and bring them before the authorities for further investigation.
10. Digital Advertisement
Digital advertising has also been transformed as a result of the application of data
science. Data analytics and data algorithms are used in a wide range of advertising
mediums, including digital billboards in cities and banners on websites.
11. Fraud and Risk Detection
Detecting fraud may have been the first application of data analytics. They applied
data analytics because they already had a large amount of customer data at their
disposal. Data analysis was used to examine recent spending patterns and customer
profiles to determine the likelihood of default. It eventually resulted in a reduction in
fraud and risk.
12. Travel
Data analysis applications can be used to improve the traveler’s purchasing
experience by analyzing social media and mobile/weblog data. Companies can use
data on recent browse-to-buy conversion rates to create customized offers and
packages that take into account the preferences and desires of their customers.
13. Communication, Media, and Entertainment
When it comes to creating content for different target audiences, recommending
content, and measuring content performance, organizations in this industry analyze
customer data and behavioral data simultaneously. Data analytics is applied to collect
and utilize customer insights and understand their pattern of social-media usage.
14. Energy and Utility
Many firms involved in energy management use data analysis applications in areas
such as smart-grid management, energy distribution, energy optimization, and
automation building for other utility-based firms.

Applications Of Data Analytics In The Business World


The use of data analytics in business is not confined internally, Business Analysts
direct market examinations, dissecting both product offerings and the general
productivity of the business. Furthermore, they create and screen information quality
measurements and guarantee business information and detailing needs are met.
Business Analysts direct market examinations, dissecting both product offerings and
the general productivity of the business. Furthermore, they create and screen
information quality measurements and guarantee business information and detailing
needs are met.

Internship Project – NAS2001


Breast Cancer Prediction and Analysis

The following steps were involved in the successful implementation of the


project:

► As the first step, we aim to collect the data from (URL:


https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnosti c
%29) consisting of 569 instances and 32 attributes, which will be read through a csv
file and converted to a data frame for further use.

► Following next will involve tidying up the data by enumerating the outcome variable
and renaming badly encoded variables. Packages such as tidy verse, dplyr would be
used in this step.

► Once our data frame is made explicable, we build a deeper understanding of our data
through visualization by plotting density curves and one-on-one scatter plots and box
plots for all the different attributes using the ggplot2 library.

► Next, we check for any missing values or null values and if present we apply
appropriate imputations to our data based on the pattern of missing entries and also
get hints on the mechanism. The VIM package of R would come handy for this
purpose.

► Since the ratio of attributes to instances is quite high in our case, we next aim at
reducing a few of the attributes to avoid overfitting of the data. We plot the
correlation plot and deploy the caret package to remove highly correlated variables,
providing redundant data, based on the cutoff value of 0.9.

► To further enhance visualization and preprocess our data we apply PCA as part of
EDA. This converts our original variables into a smaller number of “Principal
Components”. This is done by finding the straight line that best spreads the data out
when it is projected along it i.e. transforming a set of x correlated variables over y
samples to a set of p uncorrelated principal components over the same samples.

► For dimensionality reduction, we also apply LDA. The LDA algorithm tries to find
linear combinations of the predictor variables that can maximize the separation among
the outcome classes which would then be used for predicting the class of each and
every instance.

► Now that our preprocessing is done, we would partition our final data frame into
training and testing sets. We use 80% of the data for training while remaining 20% for
testing. We also apply cross-validation technique to resample the data at least 15
times.

► We will be Applying different machine learning models and determining all the
performance measures, Confusion Matrix and Statistics comprising of Accuracy,
sensitivity, specificity etc.

► All the models use ROC as a metric. The ROC metric measures the auc of the roc
curve of each model. This metric is independent of any threshold

► Our first model will be doing logistic regression on the data frame where we took
away the highly correlated variables which is the training dataset.

► Our second model uses random forest and Induction. Similarly, we are using the data
frame, the one where we took away the highly correlated variables and also, we will
be making some diagnostic plots here.

► Our third model uses KNN (k-nearest neighbors’ algorithm) on the training dataset.

► Our fourth model will be using the SVM (Support Vector Machines) on a non-PCA
training dataset. For better results with SVM when doing it on the PCA data set.

► Our last and best model Neural networks with LDA, to use the LDA pre-processing
step, we will also create the same training and testing set.

► After Training all the models we will do model evaluation this is done distributions
are summarized in terms of the percentiles. The distributions are summarized as box
plots and finally the distributions are summarized as dot plots.

► The model which will have the best results for sensitivity (detection of breast cases)
will be used in the application.

Skills learned during project implementation: -


1. In-depth knowledge of the dataset selection process, and how to consider different
attributes and variables while selecting a dataset.
2. Working knowledge of data pre-processing technique and how it is useful to
transform raw dataset into a more useful and understandable format.
3. Learning different classifier techniques such as Support Vector Machine (SVM),
Naive Bayes, Logistic Regression, Linear Regression, Random Forest etc.
4. Implementation of Principle Component Analysis (PCA) algorithm on a given dataset
which is a linear transformation that fits the dataset to a new coordinate system in
such a way that the most significant variance is found on the first coordinate.
5. To execute Linear Discriminant Analysis (LDA) algorithm which is a machine
learning technique and classification method for predicting categories.

Significance of the project


Data visualization and machine learning techniques can provide significant benefits
and impact cancer detection in the decision-making process. Using comparative
analysis of various algorithms we can find a model with high accuracy which means it
can predict a greater number of correct values than negatives. This research has a
translational potential for women, who have abnormal mammogram findings or who
have been diagnosed with breast cancer.
Finding new ways to determine the stage of metastatic breast cancer would have
major clinical impact. Heat Maps ,scatter plot and box plot visualization helped to
understand the correlation between each feature and brought out unnecessary features
that were not essential to use while making predictions

Result
Breast cancer is one of the most severe cancers. It has taken hundreds of thousands of
lives every year. Early prediction of breast cancer plays an important role in successful
treatment and saving the lives of thousands of patients every year. However, the
conventional approaches are limited in providing such capability. The recent breakthrough
of data analytics and data mining techniques have opened a new door for healthcare
diagnostic and prediction. Machine learning methods for diagnosis can significantly
increase processing speed and on a big scale can make the diagnosis significantly
cheaper.

This research was carried out to predict the accuracy of determining cancer at early
state, after comparing five different models. The best results for sensitivity (detection of
breast cases) is LDA_NNET which also has a great F1 score.

Challenges during the project


1. Availability of a dataset that has all the necessary variables and data to carry out
the research and prediction.
2. Splitting the dataset in such a way that it overcomes the problems of underfitting
and overfitting. In both cases, the prediction percentage could be way less than
expected.
3. Implementing LDA and PCA algorithms in dataset as it required standardization of
data before implementing and it would limit the relationship with variables.
4. Comparing different algorithms and their performance measures such as sensitivity,
accuracy, specificity etc to determine which algorithm gives best prediction results.

Internship Project – NAS1001


Customer Segmentation using RFM analysis in R

Project implementation: -
 Dataset comprises of 8 variables which are InvoiceNo ,StockCode ,Description,
Quantity, InvoiceDate, UnitPrice, CustomerID, Country.
 The RFM model is fundamentally built using principles of data-driven marketing.
Data-driven marketing has fundamentally transformed how marketing works ever
since its inception, as it allows the analysis of large sets of customer data like never
before.
 There are three digital numbers for each RFM score, in general we rate the customers
using points from 1 to 8 in each dimension. Higher score means better customer
value so 8 points is the best and 1 is the worst score
 We used K- means clustering algorithm for clustering the data into various segments.

The following steps were involved in the successful implementation of the project:
Step 1: Read the data into a data frame
The data for this analysis has been taken from Kaggle. The data is of a retail store,
describing the past transactions and purchase history of the customers.
Step 2: Data cleaning and preprocessing
Looking at the summary statistics of the data frame, we can see 2 problems in the data
— 1) Presence of null values 2) Invalid data — negative values for quantities. We
solve these problems by omitting the rows with null values and negative quantity
values.
Step 3: Calculate Recency, Frequency and Monetary values for every customer
We now calculate the following values:
1. Recency : difference between the analysis date and the most recent date, that
the customer has shopped in the store. The analysis date here has been taken as the
maximum date available for the variable InvoiceDate.
2. Frequency : Number of transactions performed by every customer.
3. Monetary: Total money spent by every customer in the store.
Step 4: Calculate the RFM score
Recency, frequency and monetary has different ranges. We first convert these quantities to
scores based on their quartiles. For this, we start with looking at the summary of these values.
Score is the total score of a customer’s engagement or loyalty which can be used to
categorize customers into 7–8 -> ‘Diamond’, 5–6-> ‘Gold’, 3–4 -> ‘Silver’ and 1–2->
‘Bronze’.

Skills learned during project implementation: -


● Learned to apply an unsupervised machine learning algorithm called K means in order to
cluster the data into various segments.
● Visualize Recency, Frequency & Monetary Value of Customer Segments.
● Analyzing cluster output through Silhouette coefficient(measure of how similar an object
is to its own cluster compared to other clusters ), Hubert index (graphical method of
determining the number of clusters) and D index.
● Elbow method to analyze the percentage of variance and finding the optimal number of
clusters.
● Understanding the concept of Pareto’s principle which is the base of RFM analysis
through graph plots.

Significance of the project


● Principles of segmentation, targeting, and positioning have been used for ages in the field
of marketing. However, with the advent of data analytics, and the creation of number-
driven models like RFM, the scope of these principles has widened tremendously.

● The objective of RFM Analysis is to segment customers according to their purchase


history, and turn them into loyal customers by recommending products of their choice.

● Today, businesses can go beyond the above questions with the help of the RFM model
and get answers to highly specific questions such as:Who are my best customers?Which
customer has the potential to buy more?Which customer has been churned out/has
lapsed?

● Low churn rates are the easiest way to maintain and grow business, as it enables a
reliance on customer satisfaction, and also the creation of positive word of mouth by
customers. The RFM model helps businesses create unique customer journeys for
different customer segments, creating value for customers and establishing loyalty and
trust.

Results :

 73% of the annual sales are produced by the top 27% of customers
 Out of the four segments present a large chunk of (2405) customers are in the
silver category.
 Most of the customers have a Recency of < 50 days, a Frequency of less than 5
times, and a Monetary of less than $50,000. That’s the reason why the customers
are not distributed evenly in the RFM cells.

Customer
Distribution per
recency

These data points are all represented with the triangles shape in the plot and they’re in
the top 80/20 category. At the other end of the continuum, we have the no-value
customers in the bottom, left-hand corner.

Challenges during the project

 While RFM segmentation is powerful, it does have limits. When performed manually,
it’s prone to human error. RFM analysis is also based on just a few behavioral traits,
lacking the power of the advanced predictive analytics now available.
 Using a limited number of selection variables is another issue, which means that some
other variables are possibly able to influence and determine the value score of
customers.
 Lacks consistency: K-means clustering can give varying results on different runs of
an algorithm. A random choice of cluster patterns yields different clustering results
resulting in inconsistency.
 Sensitivity to scale: Changing or rescaling the dataset either through normalization or
standardization will completely change the final results.
 k-means can only separate clusters that are more or less linearly separable. If your
clusters are based on distance to the origin, k-means won’t be able to identify them.

REQUIREMENTS

Software Requirements:
1. An OS capable of running the R programming language
2. R from Cran Projects
3. IDE – R Studio
4. Packages such as ggplot2

Hardware Requirements:
5. A Laptop or Desktop with Internet Connectivity and at least 4 GB of RAM

CONCLUSION
Project-1

 The underlying principle of the RFM(Recency, Frequency, Monetary ) Technique is


that the product and services needs of individual customers differ. Segmentation
involves the grouping of customers together with the aim of better satisfying their
needs whilst maintaining economies of scale.

 If properly executed the model should deliver more satisfied customers, few
confrontations with competitors, and better-designed marketing programmers.

 Each of these RFM metrics has been shown to be effective in predicting future
customer behavior and increasing revenue. Customers who have made a purchase in
the recent past are more likely to do so in the near future. Those who interact with
your brand more frequently are more likely to do so again soon. And those who have
spent the most are more likely to be big spenders going forward.

 It’s evident from the results obtained that 80 percent of the business comes from 20
percent of your consumers.

Project -2
 Data visualization and machine learning techniques can provide significant benefits
and impact cancer detection in the decision-making process.Using comparative
analysis of various algorithms we can find a model with high accuracy which means it
can predict a greater number of correct values than negatives. This research has a
translational potential for women, who have abnormal mammogram findings or who
have been diagnosed with breast cancer.

 Finding new ways to determine the stage of metastatic breast cancer would have a
major clinical impact. Heat Maps ,scatter plot and box plot visualization helped to
understand the correlation between each feature and brought out unnecessary features
that were not essential to use while making predictions.

 By application of several data mining and machine learning techniques, classification


will be done whether the tumor mass is benign or malignant in women. This will help
in understanding the important underlying importance of attributes thereby helping in
predicting the stage of breast cancer depending on the values of these attributes.
Performance measures of all the algorithms will be taken into account in order to do a
comparative analysis.

BIBLIOGRAPHY

• Madhu Kumaria , Vijendra Singh, Breast Cancer Prediction system, International


Conference on Computational Intelligence and Data Science (ICCIDS 2018),
Published in Procedia Computer Science, Volume 132, 2018.

• Vivek Kumar, Brojo Kishore Mishra , Manuel Mazzara, Dang N. H. Thanh,


Abhishek Verma, Prediction of Malignant & Benign Breast Cancer: A Data Mining
Approach in Healthcare Applications.

• Nikita Rane, Jean Sunny, Rucha Kanade, Prof. Sulochana Devi, Breast Cancer
Classification and Prediction using Machine Learning, International Journal of
Engineering Research & Technology (IJERT), Published in http://www.ijert.org/,
Vol. 9 Issue 02, February-2020.

• https://cran.r-project.org/web/packages/caret/index.html
• https://machinelearningmastery.com/compare-models-and-select-the-best-usingthe-
caret-r-package/
• https://rpubs.com/Aakansha_garg/aakansha_cancer
• https://canceratlas.cancer.org/the-burden/
• https://www.kaggle.com/lbronchal/breast-cancer-dataset-analysis
• https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)

You might also like