Major Project Report
Major Project Report
A PROJECT REPORT
Submitted in partial fulfillment of the
requirement for the award of the degree
of
Bachelor of Technology (B.Tech.)
in
Information Technology
by
Varun Goel
159102156
IT Department
MANIPAL UNIVERSITY JAIPUR
JAIPUR – 303007
RAJASTHAN, INDIA
May – 2019
DEPARTMENT OF INFORMATION TECHNOLOGY, MANIPAL
UNIVERSITY JAIPUR, JAIPUR – 303007(RAJASTHAN), INDIA
CERTIFICATE
This is to certify that the project titled Implementation of Latest Technologies like RPA,
ML, IOT on Real Time Data is a record of the bonafide work done by Varun Goel
(159102156) submitted in partial fulfilment of the requirements for the award of the Degree of
Bachelor of Technology (B.Tech) in Information Technology of Manipal University
Jaipur, during the academic year 2019-2020.
I express my sincere gratitude to the Director (SCIT) Dr. Shekhawat, HOD of IT Dr. Pankaj Vyas for his
administration towards
our academic growth.
I express sincere gratitude to my Project Lead Mr. Nirmal Aagash Chandramohan, Mr. Sudhakar Gopalrao, Mr.
Bhanu Pratap Sikarwar and Deepika Av, Engineer, Mindtree Ltd. with his novel association of ideas,
encouragement, appreciation and intellectual zeal which motivated us to venture this Internship successfully.
I record it as my privilege to deeply thank the Mindtree Kalinga Manager for C1 Engineers, Mr. Amarendra
Gopalakrishna for his constant motivation to make our ideas into reality.
I express my sincere thanks to my Department Guide, Mr. Rahul Saxena for his constant monitoring provided in
successful completion of our academic semester.
Finally, it is pleased to acknowledge the indebtedness to all those who devoted themselves directly or indirectly to
make this project report success.
i
ABSTRACT
As a part of Enterprise Re-Imagination Business (ERB) team in Mindtree, we were trained and worked on Machine
Learning and Internet of Things on some real-time data provided by the customers of the company. We also
worked on RPA and Chat-Bot Development.
ERB team focuses on IOT, Machine Learning, Artificial Intelligence, Chat- Bot Development and most of the
work is based on python and advanced python including some libraries like SKLearn, Scipy, Pandas, Matplotlib
and Numpy.
As the work was on these advanced technologies, there was no proper use-case designed to implement a project
during the three-month training rather a proper course structure was designed and we were trained on them.
In RPA, the softwares we worked on were Automation Anywhere and UIPath. In IOT, we were provided with
some real-time small grade sensors like IR sensor, temperature sensors, pressure sensors, proximity sensors etc.
and we transmitted data from these sensors to the cloud, for which we used Microsoft Azure IOT Hub, using a
gateway. For Machine learning we used Jupyter Notebook.
ii
Mindtree Limited is an Indian multinational information technology and outsourcing company
headquartered in Bengaluru, India and New Jersey, USA.
Founded in 1999, the company employs approximately 20,204 employees with annual revenue of $1
Billion. The company deals in e-commerce, mobile applications, cloud computing, digital transformation,
data analytics, enterprise application integration and enterprise resource planning, with more than 339
active clients and 43 offices in over 17 countries, as of 31 July 2018.
Its largest operations are in India and major markets are United States and Europe working with thousands
of clients serving on multiple technologies like DotNet, Salesforce, Machine Learning, Internet of things
etc.
The company works in Application Development and Maintenance, Data Analytics, Digital Services,
Enterprise Application Integration and Business Process Management, Engineering R&D, Enterprise
Application Services, Testing, and Infrastructure Management Services.
Mindtree provides various research and development services including Bluetooth Solutions, Digital
Video Surveillance, an integrated test methodology called MindTest, an IT infrastructure management
and service platform called MWatch, the application management service, Atlas, SAP Insurance and
OmniChannel.
Mindtree’s business is structured around clients in verticals such as Banking, Capital Markets, Consumer
Devices & Electronics, Consumer Packed Goods, Independent Software Vendors, Manufacturing,
Insurance, Media & Entertainment, Retail, Semiconductors and the Travel and Hospitality industry.
LIST OF TABLES
iii
LIST OF FIGURES
iv
Contents
Page No.
Acknowledgement i
Abstract ii
List of Tables iii
List of Figures iv
Introduction 1
Chapter 1 ROBOTIC PROCESS AUTOMATION
1.1 Introduction 2
1.2 Automation Anywhere 4
1.3 UIPath 9
1.4 Automation Anywhere vs UIPath 14
Chapter 2 INTERNET OF THINGS
2.1 Introduction 15
2.2 Architecture 16
2.3 Use Case: U-Sit 17
Chapter 3 CHATBOT
3.1 Introduction 21
3.2 Dialogflow/Api.ai 21
3.3 Use Case: Trippy 22
Chapter 4 MACHINE LEARNING
4.1 Introduction 28
4.2 Use Case 1: Wisconsin Breast Cancer Dataset 29
4.3 Use Case 2: Pima Diabetes Dataset 32
Chapter 5 ADVANCE PYTHON
5.1 Introduction 34
REFERENCES 37
INTRODUCTION
Mindtree works for thousands of clients based on different parts of world serving them with
various projects and offering solutions on technologies like Java, Dotnet, Salesforce, AppDev etc.
In Basecamp phase, we were trained on basic language specific programing skills where the team leads
assessed us through some competitive coding challenges and some logic building programming questions
on Hacker Earth. Each capability will be in ‘Learn’ phase in the beginning and is further moved to
‘Manage’ or ‘Thrive’ phases based on the performance on the given problems during the discussions with
the team leader.
Later, after the successful completion of 3-week Basecamp phase, tracks were allocated to students based
on their working mindsets. The following are the tracks that everyone is split into:
Full Stack: Dotnet, Java, Software testing, Android and IOS application development, Webtech (React
and Angular), BigData, SAP.
Post the Engineering camp phase, The FSE tracks are allocated with projects they enter the Project camp
phase and the Non FSE tracks will continue their Engineering camp till the end.
1
ENTEPRISE RE-IMAGINATION BUSSINESS
1. ROBOTIC PROCESS AUTOMATION
1.1 Introduction
RPA tools have strong technical similarities to graphical user interface testing tools. These tools also
automate interactions with the GUI, and often do so by repeating a set of demonstration actions performed
by a user. RPA tools differ from such systems including features that allow data to be handled in and
between multiple applications, for instance, receiving email containing an invoice, extracting the data, and
then typing that into a bookkeeping system.
The hosting of RPA services also aligns with the metaphor of a software robot, with each robotic instance
having its own virtual workstation, much like a human worker. The robot uses keyboard and mouse
controls to take actions and execute automations. Normally all of these actions take place in a virtual
environment and not on screen; the robot does not need a physical screen to operate, rather it interprets
the screen display electronically. The scalability of modern solutions based on architectures such as these
owes much to the advent of virtualization technology, without which the scalability of large deployments
would be limited by available capacity to manage physical hardware and by the associated costs. The
implementation of RPA in business enterprises has shown dramatic cost savings when compared to
traditional non-RPA solutions.
There are however several risks with RPA. Criticism include risks of stifling innovation and creating a
more complex maintenance environment of existing software that now needs to consider the use of
graphical user interfaces in a way they weren't intended to be used.
According to Harvard Business Review, most operations groups adopting RPA have promised their
employees that automation would not result in layoffs. Instead, workers have been redeployed to do more
interesting work. One academic study highlighted that knowledge workers did not feel threatened by
automation: they embraced it and viewed the robots as team-mates. The same study highlighted that,
rather than resulting in a lower "headcount", the technology was deployed in such a way as to achieve
more work and greater productivity with the same number of people.
2
Conversely however, some analysts proffer that RPA represents a threat to the business process
outsourcing (BPO) industry. The thesis behind this notion is that RPA will enable enterprises to
"repatriate" processes from offshore locations into local data centers, with the benefit of this new
technology. The effect, if true, will be to create high value jobs for skilled process designers in onshore
locations (and within the associated supply chain of IT hardware, data center management, etc.) but to
decrease the available opportunity to low skilled workers offshore. On the other hand, this discussion
appears to be healthy ground for debate as another academic study was at pains to counter the so-called
"myth" that RPA will bring back many jobs from offshore.
RPA is automation of software related repeated tasks using various third party softwares like UIPath,
Automation Anywhere (AA), BluePrism etc. and Artificial Intelligence (AI). It is an emerging form of
business process automation technology. I have been trained on UIPath and Automation Anywhere.
UIPath and AA are both equally good and capable automation softwares but the basic difference between
them is that Automation Anywhere can deploy a bot on a cloud-based server, known as Control Room
and there is Bot Creator and Bot Runner for creating and running individual bots. We can deploy a single
bot on multiple machines. On the other hand, UIPath bots are deployed on the same machine where the
software is installed.
Some advanced modules like Optical Character Recognition (OCR) and queue management doesn’t work
well on AA but works perfectly fine on UIPath.
3
1.2 Automation Anywhere
1.2.1 Introduction
It is one of the popular RPA vendors offering powerful & user-friendly RPA capabilities to automate any
complex tasks. It is one of the "Revolutionary Technology" that changes the way the enterprise operates.
This tool combines conventional RPA with intellectual elements like natural language understanding and
reading any unstructured data.
Automation Anywhere allows organizations to automate the processes which are performed by the
humans. It is a Web-Based Management System which uses a Control Room to run the Automated Tasks.
Automation Anywhere tool can automate ends to end business operations for companies.
1.2.2 Architecture
4
Control Room is a web-based platform that controls the Automation Anywhere. In other words, it's the
Server that controls Automation Anywhere bots.
Apart from that the control room deal with
User management
Source control: code for the bots is managed by the control room. So, it becomes easy to share the
code across different systems.
Dashboard- It gives complete analytics/results of Automation Anywhere bots. You can see how
many bots are runs and how bot failed/passed etc. is controlled.
License Management: The purchased licenses for Automation Anywhere are configured in the
Control Room.
Bot Creator is a Desktop based application which is used by Developers use to create bots. Their dev
licenses are checked with that configured in the control room. On authentication, the code of the bots they
create is stored in the control room. Different developers may create individual tasks/bots. These bots
could be merged and executed at once.
Bot Runner is the machine where you run the bot. You could have multiple bots running in parallel. You
only need the Run License to run the bots. The bots report back the execution logs/pass/fail status back to
the control room.
Bot Insights is the tool which shows statistic and display graphs to analyze the performance of every bot
in the system. Here, you can also calculate the time you have saved because of the automation process.
Bot Farm is integrated with Automation Anywhere Enterprise. It allows you to create multiple bots.
Moreover, you can also give these boats on the rental basis.
Bot Store is a first digital workforce marketplace. Here, you will get lots of pre-built bots for every type
of business automation.
5
1.2.3 Advantages
No programming knowledge is required. You can record your actions, or point and click the action
wizards.
Eliminates the element of the human error
Increases transaction speed and allows to save time and costs
Quick Time to Value, Non-intrusive
Helps you to automate data transfers and import or export data between files or applications.
Scale from Desktop to Data Center
Email Automation – Open Gmail account by passing the credentials and then composing a mail by
entering recipient’s Email Id, subject and body and then send the email. After this logout of the
account. This was done using Object cloning.
Data Extraction from Website – In this task there is a csv file with city names in it. Now, open a
website and then extract temperatures of different cities mentioned in the csv file using OCR and then
save them into a separate csv.
Data Extraction from PDF – A PDF of a hospital bill was given. We had to extract the Name,
Address, Bill Number, Admitted on, Discharged on, Dr. Name and Total Bill Amount fields from the
PDF using PDF Integration tool of Automation Anywhere and store it in a csv file.
6
Fig-1.3 PDF Integration
Autofill a Registration form – A dummy registration link and a csv file containing 10 records with
details of people was given. We wrote a script to autofill the registration form by following this
methodology:
1. First open the browser and then open the link.
2. Then read values from csv file one by one and input them in the registration form.
3. Submit the registration form for all input records.
4. Now, update the input csv file and mark all the completed rows with status as registered.
5. Also, create an error log file if any exception is raised.
6. After completing of all the registrations, close the browser.
7
Fig-1.4b Registration Form POC
8
1.3 UIPath
1.3.1 Introduction
More and more companies are adopting Digital day by day. With digitization, the biggest advantage is the
speed of execution. But the challenge with digitization is that it requires diverse tools and hence
manpower with diverse skillset is required to handle those tools. But workforce having varied skill set is
scarce. To resolve this issue the entire IT industry has been looking for reliable, fast, intelligent, robust
solution. This demand is fulfilled by UIPath.
UIPath Studio software solution allows automating repetitive office tasks. It is founded by Romanian
entrepreneur Daniel Dines in the year 2005. It converts boring tasks into automation process can work
with multiple tools.
1.3.2 Architecture
UIPath Studio allows us to plan any automation processes visually with the help of different diagrams.
Each diagram represents a certain type of work to perform. Scripts in UIPath Studio are in the form of a
flowchart or a sequence.
UIPath Robot is used execute the processes after you are done with the designing of processes in the
Studio. Robots will pick those steps and run without human direction in any environment. It can also
work when human triggers the process.
9
UIPath Orchestrator is a web-based application. It helps you to deploy, schedule, monitor, manage
robots & processes. It is a centralized platform for all the robots to manage.
1.3.3 Advantages
Application Compatibility: Offers a high range of applications to work with, which includes web
and desktop applications.
Centralized repository: This feature helps for handling all the robots simultaneously by users.
Advanced screen scraping solution: Scraping solution that works with any application like DotNet,
Java, Flash, PDF, Legacy, SAP, with absolute accuracy.
Reliable tool for modeling business processes: The UIPath studio offers automation excellence with
the help of model business processes.
Outlook Automation – The task was to send a mail to any of your college with subject line UIPath
Training and attach any file while sending the mail. This task was performed using
‘OutlookAutomation’ tool.
10
Fig-1.6 Outlook Automation POC
Flipkart Data – Task was to open flipkart and search for Samsung mobiles. Then extract Mobile
Name, Price, Rating and Vendor details for multiple pages. The adopted methodology was as
following:
1. First, open the browser and open Flipkart website. This can be done using OpenBrowser tool
in UIPath.
2. Then create an input modifier using ‘Input’ tool and use ‘Type Into’ tool to search for
Samsung mobiles. Add a delay using ‘Wait’ tool so that enough time s given for the page to
load.
3. Using ‘DataTable’ tool, extract required files and store the values in separate variables.
4. Finally using ‘WriteCSV’ tool add the entire DataTable to the csv file.
11
12
Fig-1.7 Flipkart Data POC
13
1.4 Automation Anywhere vs UIPath
Table-1.1 Automation vs UIPath
Robots Front Office and Back Office Front Office and Back Office
14
2. INTERNET OF THINGS
2.1 Introduction
IOT is the network of physical objects such as devices, vehicles, buildings and other items – embedded
with electronics, software, sensors, and network connectivity that enables these objects to collect and
exchange data. The definition of the Internet of things has evolved due to convergence of multiple
technologies, real-time analytics, machine learning, commodity sensors, and embedded systems.
Traditional fields of embedded systems, wireless sensor networks, control systems, automation (including
home and building automation), and others all contribute to enabling the Internet of things.
It consists of devices (including sensors) which are connected to gateway using a protocol. All the
telemetry and other data is sent to the gateway in json/xml format and stored on the NoSQL/SQL
databases using Spark and HDFS.
Then this data can be used for:
1. Web/Mobile app integrations
2. Real Time/Batch data analysis
3. Storing
All these functionalities are performed on any cloud-based service such as Azure IOT, Intel IOT, AWS,
Gladeus etc.
In this phase we were trained on various hardware components and specifically on Rasperry-Pi and Nano-
Pi boards with some sensors and gateway programming. Later, filtering of data at the gateway and this
filtered sensor data was sent to the Azure cloud through the gateway and representing the clustered data as
tables using PowerBI on Azure.
15
2.2 IOT Architecture
16
2.3 Use Case: U-SIT
U-Sit is a perfect solution for all the above-mentioned scenarios. U-Sit detects when persons sitting down
or getting up from a seat updating the real time status.
2.3.2 Motivation
17
2.3.3 Architecture Diagram
18
2.3.4 Technologies Used
A load cell is a transducer that is used to create an electrical signal whose magnitude is directly proportional
to the force being measured. The various load cell types include hydraulic, pneumatic, and strain gauge.
Strain Gauge Load Cells are the most common in industry. These load cells are particularly stiff, have very
good resonance values, and tend to have long life cycles in application. Strain gauge load cells work on the
principle that the strain gauge (a planar resistor) deforms when the material of the load cells deforms
appropriately. Deformation of the strain gauge changes its electrical resistance, by an amount that is
proportional to the strain. The change in resistance of the strain gauge provides an electrical value change that
is calibrated to the load placed on the load cell.
We chose MQTT because it is highly compatible with Azure IOT Hub (Cloud Platform) and XMC4800
(Gateway).
XMC4800 is an industrial grade micro-controller which can connect at most 196 sensors. It offers EtherCAT
node on an ARM Cortex –M controller with on-chip flash and analog/mixed signal capabilities. That’s why
we are using an Ethernet to Wi-Fi module which will enable us to send data from sensors to IOT Hub. This
module can connect up to 5 XMC4800 micro-controllers. So, this setup can be used for large scale
application.
Azure IoT Hub allows you to get on with developing cool IoT stuff, and not worry about how it all gets
connected up and managed.
Internet of Things (IoT) offers businesses immediate and real-world opportunities to reduce costs, to increase
revenue, as well as transforming their businesses. Azure IoT hub is a managed IoT service which is hosted in
the cloud. It allows bi-directional communication between IoT applications and the devices it manages. This
cloud-to-device connectivity means that you can receive data from your devices, but you can also send
commands and policies back to the devices. How Azure IoT hub differs from the existing solutions is that it
also provides the infrastructure to authenticate, connect and manage the devices connected to it.
Azure IoT Hub allows full-featured and scalable IoT solutions. Virtually, any device can be connected to
Azure IoT Hub and it can scale up to millions of devices. Events can be tracked and monitored, such as the
creation, failure, and connection of devices.
Device libraries for the most commonly used platforms and languages for easy device connectivity.
19
Secure communications with multiple options for device-to-cloud and cloud-to-device hyper-scale
communication.
After sending the data to Azure IOT Hub, we can use it for various purposes:
1. Analytics can be done on the received data from sensors to find out which areas are more occupied
most of the time so that more logistics can be provided there.
2. Analytics on this data will help users to find more popular restaurants and preferred conference rooms.
3. All the analytics dashboards can be displayed on a Mobile or Web Applications on user’s devices.
4. Last minute cancellations on train or movie tickets won’t result in loss for Railway Services and
Theatres because someone might even book the seat later.
20
3. ChatBot
3.1 Introduction
Such programs are often designed to convincingly simulate how a human would behave as a
conversational partner, thereby passing the Turing test. Chatbots are typically used in dialog systems for
various practical purposes including customer service or information acquisition. Some chatbots use
sophisticated natural language processing systems, but many simpler ones scan for keywords within the
input, then pull a reply with the most matching keywords, or the most similar wording pattern, from
a database.
The term "ChatterBot" was originally coined by Michael Mauldin (creator of the first Verbot, Julia) in
1994 to describe these conversational programs. Today, most chatbots are accessed via virtual
assistants such as Google Assistant and Amazon Alexa, via messaging apps such as Facebook
Messenger or WeChat, or via individual organizations' apps and websites. Chatbots can be classified into
usage categories such as conversational commerce (e-commerce via chat), analytics, communication,
customer support, design, developer tools, education, entertainment, finance, food, games, health, HR,
marketing, news, personal, productivity, shopping, social, sports, travel and utilities.
Beyond chatbots, Conversational AI refers to the use of messaging apps, speech-based assistants and
chatbots to automate communication and create personalized customer experiences at scale.
Dialogflow (Previously known as API.AI) is where the magic happens. It works on natural language
processing and backed by Machine Learning. At Dialogflow the whole ‘conversation’ take place.
Dialogflow is backed by Google and runs on Google infrastructure, which means you can scale to millions
of users.
Give users new ways to interact with your product by building engaging voice and text-based
conversational interfaces powered by AI. Connect with users on the Google Assistant, Amazon Alexa,
Facebook Messenger, and other popular platforms and devices.
The process a Dialogflow agent follows from invocation to fulfillment is similar to someone answering a
question, with some liberties taken of course.
On any platform, Dialogflow support more than 20+ platforms from Google home to Twitter. Across
devices Dialogflow supports all the devices from wearables, to phones to devices. Around the world
Dialogflow supports more than 14+ languages worldwide & more support is coming.
21
3.2.1 Components of Dialogflow
Agents help convert user request into actionable data. Example: TestAgent in the image above
Intents are configured by developers which indicate what the objective of the user might be when
he/she/she makes a specific request. Example: Book a flight / Collect Feedback etc.
Entities help extract information from user speech with the help of prompts. Example: “Book a flight”
intent might need such as the: to and from cities, date, class etc. as entities that the agent tries to extract
from the user via conversations. The information received here are sent on for fulfilment.
Fulfilment is the code that fulfils the intent of the user’s request.
Integrations are various third-party applications such as Twitter, Slack, Googlehome etc. where you
can integrate your chatbot.
Prebuilt Agents: There are various pre-built agents already present in Dialogflow which help you to
make your chatbot easily. You can use them in your own way.
Smalltalk helps make bots friendly and chatty with no coding from our end.
I created a chatbot named Trippy which can be used to book Flights, Book Rooms and Rent Vehicles at
the destination.
First it asks the user for what service he\she wants. Then depending upon the service, it asks the user
some details such as Departure location, Destination, type of flight, Date of journey, type of vehicle and
rooms and then gives a confirmation that the booking has been completed. Finally, the user can also give
feedback and rating for his/her experience.
Ever increasing use of smartphones and internet has led to digitization. Many organizations, businesses,
services are moving to be online rather than only having a physical presence. Earlier flight bookings used
to happen only through online booking where sometimes people don’t find it much easier to book flights
due to messy UIs of websites and they weren’t able to compare prices and select the best one. Now, this
can happen with just telling your needs and requirements to a chatbot.
22
3.3.2 Flow of Conversation in Dialogflow
Dialogflow or Api.ai, used for this Use Case, is an open source cloud-based platform for creating text or
voice based chatbots. It is free to use, which you can use with just creating a free account.
The methodology for the development of the application are as follows:
i. Understanding the problem statement of the chatbot and gathering the requirements and
specifications.
ii. Conversation flow for the chatbot was designed. These are the integral part of planning since it
can affect the user experience.
iii. Entities are created based on the requirements of the chatbot such as car type, flight type, rating,
room type etc.
iv. Small talk, a pre-built intent of Dialogflow, used so that your agent can learn how to support small
talk without any extra development. By default, it will respond with predefined phrases but it can
be customized as per developer requirements.
v. Intents were created as per functionality of the chatbot. For example, BookCars for booking cars,
BookFlight for booking flights etc.
vi. Continuous testing of each intent is done to make sure the conversation flow is maintained.
vii. Finally, the chatbot was integrated on Telegram and proper testing of integration was done.
viii. Final test was performed to check the working of the Chabot on Telegram.
24
3.3.4 Diagrams
Once the user says hi, the bot greets and introduces itself and tells its functionality. The bot also asks
which service does the user want.
While booking flights and hotel rooms for same date and location, the user won’t have to give the details
again.
User can opt for directly giving feedback and rating to his previous experience.
Some phrases not recognized by the chatbot makes it to go to default fallback intent so that user can again
mention his requirements.
27
4. MACHINE LEARNING
4.1 Introduction
Machine learning is an application of artificial intelligence (AI) that provides systems the ability to
automatically learn and improve from experience without being explicitly programmed. Machine
learning focuses on the development of computer programs that can access data and use it learn for
themselves.
The process of learning begins with observations or data, such as examples, direct experience, or
instruction, in order to look for patterns in data and make better decisions in the future based on the
examples that we provide. The primary aim is to allow the computers learn automatically without
human intervention or assistance and adjust actions accordingly.
Supervised machine learning algorithms can apply what has been learned in the past to new
data using labeled examples to predict future events. Starting from the analysis of a known
training dataset, the learning algorithm produces an inferred function to make predictions about
the output values. The system is able to provide targets for any new input after sufficient training.
The learning algorithm can also compare its output with the correct, intended output and find
errors in order to modify the model accordingly.
In contrast, unsupervised machine learning algorithms are used when the information used to
train is neither classified nor labeled. Unsupervised learning studies how systems can infer a
function to describe a hidden structure from unlabeled data. The system doesn’t figure out the
right output, but it explores the data and can draw inferences from datasets to describe hidden
structures from unlabeled data.
Semi-supervised machine learning algorithms fall somewhere in between supervised and
unsupervised learning, since they use both labeled and unlabeled data for training – typically a
small amount of labeled data and a large amount of unlabeled data. The systems that use this
method are able to considerably improve learning accuracy. Usually, semi-supervised learning is
chosen when the acquired labeled data requires skilled and relevant resources in order to train it /
learn from it. Otherwise, acquiring unlabeled data generally doesn’t require additional resources.
Reinforcement machine learning algorithms is a learning method that interacts with its
environment by producing actions and discovers errors or rewards. Trial and error search and
delayed reward are the most relevant characteristics of reinforcement learning. This method allows
machines and software agents to automatically determine the ideal behavior within a specific
context in order to maximize its performance. Simple reward feedback is required for the agent to
learn which action is best; this is known as the reinforcement signal.
Machine learning enables analysis of massive quantities of data. While it generally delivers faster, more
accurate results in order to identify profitable opportunities or dangerous risks, it may also require
additional time and resources to train it properly. Combining machine learning with AI and cognitive
technologies can make it even more effective in processing large volumes of information.
28
Breast cancer is the most common malignancy among women, accounting for nearly 1 in 3 cancers
diagnosed among women in the United States, and it is the second leading cause of cancer death among
women. Breast Cancer occurs as a result of abnormal growth of cells in the breast tissue, commonly
referred to as a Tumor. A tumor does not mean cancer - tumors can be benign (not cancerous), pre-
malignant (pre-cancerous), or malignant (cancerous). Tests such as MRI, mammogram, ultrasound and
biopsy are commonly used to diagnose breast cancer performed.
Given breast cancer results from breast fine needle aspiration (FNA) test (is a quick and simple procedure
to perform, which removes some fluid or cells from a breast lesion or cyst (a lump, sore or swelling) with
a fine needle similar to a blood sample needle). Since this build a model that can classify a breast cancer
tumor using two training classification:
1. 1= Malignant (Cancerous) – Present
2. 0= Benign (Not Cancerous) -Absent
4.2.2 Objective
Since the labels in the data are discrete, the predication falls into two categories, (i.e. Malignant or
benign). In machine learning this is a classification problem.
Thus, the goal is to classify whether the breast cancer is benign or malignant and predict the recurrence
and non-recurrence of malignant cases after a certain period. To achieve this, we have used machine
learning classification methods to fit a function that can predict the discrete class of new input.
29
Important libraries were loaded and the dataset was loaded using pandas.
i. Converted text categorical data to numerical data using one hot encoding.
ii. Removed Null values: Using dataframe.describe() we can get to know the number of empty
values in each attribute. Then these null values can be filled using mean for continuous data or
mode for categorical data.
iii. Outliers were removed from each attribute using Interquartile Ranges or IQR.
i. Plotted a correlation matrix between all the attributes to find out the relation between them.
ii. Dropped the attributes with very high correlation as they only increase the complexity of the
model rather than increasing the accuracy.
Classification Model:
i. Divided the dataset in Train and Test data with test size as one-third of the whole dataset.
ii. Logistic Regression Model, Support Vector Machine and Random Forest Model were trained on
training dataset and accuracy metrics was used to test the prediction.
iii. Classification Report for each model was generated to know the precision and recall for each of
them.
Logistic Regression achieved the highest accuracy with 95.74% while Random Forest achieved 93.03%
accuracy and Support Vector Machine achieved 91.5% accuracy.
Logistic Regression works best for Binary classification Data.
30
Table-4.1 Classification Report of Logistic Regression
Pima Indian Diabetes dataset has 9 attributes in total. All the person in records are females and the
number of pregnancies they have had has been recorded as the first attribute of the dataset. Second is the
value of Plasma glucose concentration a 2 hours in an oral glucose tolerance test and then is the Diastolic
blood pressure (mm Hg), fourth in line is the Triceps skin fold thickness (mm), then is the 2-Hour serum
insulin (mu U/ml), sixth is Body mass index (weight in kg/ (height in m) ^2) and then seventh is the
Diabetes pedigree function and the second last value is the that of the Age (years). The ninth column is
that of the Class variable (0 or 1), 0 for no diabetes and 1 for the presence.
To start with we first take a description of the dataset. We infer not much from this except the facts like
we have a dataset of 768 lines and the maximum values of the Age and Pregnancies. Nothing more is of
much use for the prediction. We also calculated the number of datasets that were positive to the test of
diabetes and those who were negative and the value came out to be 268 and 500 respectively. We decided
to take the mean value of BMI and found that the average value of a person suffering from the disease has
mean BMI value as 35.14 which means that they are not healthy and obese.
It is also interesting to note that the mean BMI value for the people who are not suffering from the disease
is 30 which is the threshold value of people becoming obese. The mean value of the second parameter
Glucose (Plasma glucose concentration) was done we found that those who suffered from the disease had
mean value as 141.25 which indicates pre-diabetic state of hyperglycemia that is associated with insulin
resistance and increased risk of cardiovascular pathology.
4.3.2 Objective
The diabetes dataset is a binary classification problem where it needs to be analyzed whether a patient is
suffering from the disease or not on the basis of many available features in the dataset. Different methods
and procedures of cleaning the data, feature extraction, feature engineering and algorithms to predict the
onset of diabetes are used based for diagnostic measure on Pima Indians Diabetes Dataset.
32
A. A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their
possible consequences, including chance event outcomes, resource costs, and utility. It is one way to
display an algorithm that only contains conditional control statements. Decision trees are commonly used
in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a
goal, but are also a popular tool in machine learning. Decision tree algorithm follows: • The
attribute/feature best for set is taken as root • Distribute the set into different sets having same attribute
values for particular value. • Repeat the above steps till we get to the leaf nodes of the tree where no
further division can take place.
B. In statistics, linear regression is a linear approach for modelling the relationship between a scalar
dependent variable y and one or more explanatory variables (or independent variables) denoted X. The
case of one explanatory variable is called simple linear regression. For more than one explanatory
variable, the process is called multiple linear regression. (This term is distinct from multivariate linear
regression, where multiple correlated dependent variables are predicted, rather than a single scalar
variable.)
C. A multilayer perceptron (MLP) is a class of feedforward artificial neural network. An MLP consists
of at least three layers of nodes. Except for the input nodes, each node is a neuron that uses a nonlinear
activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its
multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data
that is not linearly separable.
D. In machine learning, support vector machines (SVMs, also support vector networks) are supervised
learning models with associated learning algorithms that analyse data used for classification and
regression analysis. Given a set of training examples, each marked as belonging to one or the other of two
categories, an SVM training algorithm builds a model that assigns new examples to one category or the
other, making it a non-probabilistic binary linear classifier (although methods such as Platt scaling exist to
use SVM in a probabilistic classification setting). An SVM model is a representation of the examples as
points in space, mapped so that the examples of the separate categories are divided by a clear gap that is
as wide as possible. New examples are then mapped into that same space and predicted to belong to a
category based on which side of the gap they fall.
Applied many algorithms and did a lot of feature manipulation and extraction. We got the best accuracy
of 80.5% using SVM. A lot of information about the dataset was also extracted without using complex
algorithms. We were also able to perform a lot of exploratory data analysis and came to many
conclusions. Random Forest and Ensemble Learning can probably find a better result. Our result was also
very close to the best result found and this shows that at the right parameters SVM can be a good and
practical choice to classify a medical data.
33
Table-4.4 Classification Report of Logistic Regression
5.1 Introduction
Since, being a part of the ERB Team, a lot of focus was given to train us in Python and its advance
libraries as they are needed in all the advance technologies taught to us. To master Machine Learning,
IOT and Chatbot an in-depth knowledge of Python is necessary. These capabilities were performed and
cleared through various tests and coding challenges.