DS&ML 1
DS&ML 1
Data Science can be defined as the study of data, where it comes from, what it
represents, and the ways by which it can be transformed into valuable inputs and
resources to create business and IT strategies.
Data science is a deep study of the massive amount of data, which involves extracting
meaningful insights from raw, structured, and unstructured data that is processed
using the scientific method, different technologies, and algorithms.
It is a multidisciplinary field that uses tools and techniques to manipulate the data so
that you can find something new and meaningful.
Data science uses the most powerful hardware, programming systems, and most
efficient algorithms to solve the data related problems. It is the future of artificial
intelligence.
Data Science is about finding patterns in data, through analysis, and make future
predictions.
Data Science is used in many industries in the world today, e.g. banking, consultancy,
healthcare, and manufacturing.
• Machine Learning
• Statistics
• Programming (Python or R)
• Mathematics
• Databases
Most data can be categorized into 4 basic types from a Machine Learning perspective:
numerical data
categorical data
time-series data
text data
1.Numerical Data
Numerical data is any data where data points are exact numbers. Statisticians also
might call numerical data, quantitative data. This data has meaning as
a measurement such as house prices or as a count, such as a number of residential
properties in Los Angeles or how many houses sold in the past year.
2.Categorical Data
In the context of super classification, categorical data would be the class label. This
would also be something like if a person is a man or woman, or property is
residential or commercial.
Time series data is a sequence of numbers collected at regular intervals over some
period of time. It is very important, especially in particular fields like finance. Time
series data has a temporal value attached to it, so this would be something like a date
or a timestamp that you can look for trends in time.
4.Text
Text data is basically just words. A lot of the time the first thing that you do with text
is you turn it into numbers using some interesting functions like the bag of words
formulation.
3. Differentiate between supervised and unsupervised learning
algorithms.
1.unsupervised learning
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
Below are some main reasons which describe the importance of Unsupervised
Learning:
• Unsupervised learning is helpful for finding useful insights from the data.
• In real-world, we do not always have input data with the corresponding output so
to solve such cases, we need unsupervised learning.
Types of Unsupervised Learning Algorithm:
• Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities
with the objects of another group. Cluster analysis finds the commonalities
between the data objects and categorizes them as per the presence and absence
of those commonalities.
• K-means clustering
• Hierarchal clustering
• Anomaly detection
• Neural Networks
• Apriori algorithm
• The result of the unsupervised learning algorithm might be less accurate as input
data is not labeled, and algorithms do not know the exact output in advance.
Supervised learning is the types of machine learning in which machines are trained
using well "labelled" training data, and on basis of that data, machines predict the
output. The labelled data means some input data is already tagged with the correct
output.
In supervised learning, the training data provided to the machines work as the
supervisor that teaches the machines to predict the output correctly. It applies the
same concept as a student learns in the supervision of the teacher.
Supervised learning is a process of providing input data as well as correct output data
to the machine learning model. The aim of a supervised learning algorithm is to find
a mapping function to map the input variable(x) with the output variable(y).
In the real-world, supervised learning can be used for Risk Assessment, Image
classification, Fraud Detection, spam filtering, etc.
In supervised learning, models are trained using labelled dataset, where the model
learns about each type of data. Once the training process is completed, the model is
tested on the basis of test data (a subset of the training set), and then it predicts the
output.
• Split the training dataset into training dataset, test dataset, and validation
dataset.
• Determine the input features of the training dataset, which should have enough
knowledge so that the model can accurately predict the output.
• Determine the suitable algorithm for the model, such as support vector machine,
decision tree, etc.
• Execute the algorithm on the training dataset. Sometimes we need validation sets
as the control parameters, which are the subset of training datasets.
• Evaluate the accuracy of the model by providing the test set. If the model predicts
the correct output, which means our model is accurate.
1. Regression
Regression algorithms are used if there is a relationship between the input variable
and the output variable. It is used for the prediction of continuous variables, such as
Weather forecasting, Market Trends, etc. Below are some popular Regression
algorithms which come under supervised learning:
• Linear Regression
• Regression Trees
• Non-Linear Regression
• Polynomial Regression
2. Classification
Classification algorithms are used when the output variable is categorical, which
means there are two classes such as Yes-No, Male-Female, True-false, etc.
Spam Filtering,
• Random Forest
• Decision Trees
• Logistic Regression
• With the help of supervised learning, the model can predict the output on the
basis of prior experiences.
• In supervised learning, we can have an exact idea about the classes of objects.
• Supervised learning models are not suitable for handling the complex tasks.
• Supervised learning cannot predict the correct output if the test data is different
from the training dataset.
1. Decision Trees
3.KNN