0% found this document useful (0 votes)
27 views3 pages

Codes

Principal component analysis (PCA) is a technique for analyzing datasets with many variables that increases interpretability while preserving information. It enables visualization of multidimensional data by reducing dimensions. Data can be nominal (qualitative labels without values), ordinal (ordered categories), discrete (finite values), or continuous (infinite values within a range). Blockchains are distributed ledgers shared across networks that securely store transaction records in linked blocks. Sigmoid functions convert real numbers to probabilities and are important for logistic regression models in machine learning classification problems.

Uploaded by

RITESH TRIPATHI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views3 pages

Codes

Principal component analysis (PCA) is a technique for analyzing datasets with many variables that increases interpretability while preserving information. It enables visualization of multidimensional data by reducing dimensions. Data can be nominal (qualitative labels without values), ordinal (ordered categories), discrete (finite values), or continuous (infinite values within a range). Blockchains are distributed ledgers shared across networks that securely store transaction records in linked blocks. Sigmoid functions convert real numbers to probabilities and are important for logistic regression models in machine learning classification problems.

Uploaded by

RITESH TRIPATHI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

PCA

Principal component analysis is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation,
increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data.

Data
here are different types of data in Statistics, that are collected, analyzed, interpreted and presented. The data are the individual pieces of factual
information recorded, and it is used for the purpose of the analysis process.data is classified into majorly four categories:

 Nominal data
 Ordinal data
 Discrete data
 Continuous data

Categorical Data
describes the data that fits into the categories. Qualitative data are not numerical.

Nominal Data
Nominal data is one of the types of qualitative information which helps to label the variables without providing the numerical value. Nominal data is also
called the nominal scale. These data are visually represented using the pie charts.

Ordinal Data
Ordinal data/variable is a type of data that follows a natural order. The significant feature of the nominal data is that the difference between the data
values is not determined. found in surveys, finance, economics, questionnaires, and so on.

Numerical Data
data which represents the numerical value (i.e., how much, how often, how many). Numerical data gives information about the quantities of a specific
thing. Some examples of numerical data are height, length, size, weight, and so on.

Discrete Data
Discrete data can take only discrete values. Discrete information contains only a finite number of possible values. Example: Number of students in the
class

Continuous Data
Continuous data is data that can be calculated. It has an infinite number of probable values that can be selected within a given specific range. Example:
Temperature range

NLP
branch of artificial intelligence or AI—concerned with giving computers the ability to understand text and spoken words in much the same way human
beings can.

NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models.

Differentiate between causation and correlation


Correlation gives the relationship between two variables, whereas causation means one event is cause due to another.

Correlation is when two things happen together, while Causation is when one thing causes another thing to happen. Correlation is the degree of
association between two random variables.

 It describes the size and direction of the relationship between two variables.
 Correlation does not imply that change in one variable is the cause to change in another variable.
 The value of Correlation Coefficient varies from -1 to 1.

Causation means that changes in one variable brings about changes in the other; there is a cause-and-effect relationship between variables. The two
variables are correlated with each other, and there's also a causal link between them.

Blockchain
A blockchain is a distributed database or ledger shared among a computer network's nodes. They are best known for their crucial role in cryptocurrency
systems for maintaining a secure and decentralized record of transactions.
 Blockchain is a type of shared database that differs from a typical database in the way it stores information; blockchains store data in blocks linked
together via cryptography.
 Different types of information can be stored on a blockchain, but the most common use for transactions has been as a ledger. 
 In Bitcoin’s case, blockchain is decentralized so that no single person or group has control—instead, all users collectively retain control.

Importance of Sigmoid function


A Sigmoid function is a mathematical function which has a characteristic S-shaped curve.
mode
mode in statistics refers to a number in a set of numbers that appears the most often. For example, if a set of numbers contained the following digits, 1,
1, 3, 5, 6, 6, 7, 7, 7, 8, the mode would be 7,

Mean 
Mean is the average of the given numbers and is calculated by dividing the sum of given numbers by the total number of numbers. 
What is the mean of 2, 4, 6, 8 and 10?
2 + 4 + 6 + 8 + 10 = 30
Mean = 30/5 = 6

Median
Median is the middle value of the given list of data when arranged in an order.
inconsistency
inconsistent data are values or observations that are distant from the other observations conducted on the same phenomenon, which means that it
contrasts sharply with the values that are normally measured.

Duplicate
Duplicate occur when two or more rows have the same values or nearly the same values.

null hypothesis
The null hypothesis in statistics states that there is no difference between groups or no relationship between variables.

outlier
outlier is an extremely high or extremely low data point relative to the nearest data point and the rest of the neighboring co-existing values in a data
graph or dataset

Sigmoid functions
Sigmoid functions are also useful for many machine learning applications where a real number needs to be converted to a probability. A sigmoid function
placed as the last layer of a machine learning model can serve to convert the model's output into a probability score, which can be easier to work with
and interpret.

Sigmoid functions are an important part of a logistic regression model. Logistic regression is a modification of linear regression for two-class classification,
and converts one or more real-valued inputs into a probability, such as the probability that a customer will purchase a product. The final stage of a logistic
regression model is often set to the logistic function, which allows the model to output a probability.

Linear Regression vs Logistic Regression


Linear Regression and Logistic Regression are the two famous Machine Learning Algorithms which come under supervised learning technique. Since both
the algorithms are of supervised in nature hence these algorithms use labeled dataset to make the predictions. But the main difference between them is
how they are being used. The Linear Regression is used for solving Regression problems whereas Logistic Regression is used for solving the Classification
problems.

Linear Regression:
Linear Regression is one of the simplest Machine learning algorithm that comes under Supervised Learning technique and used for solving regression
problems.

It is used for predicting the continuous dependent variable with the help of independent variables.
The goal of the Linear regression is to find the best fit line that can accurately predict the output for the continuous dependent variable.

Logistic Regression:
Logistic regression is one of the most popular Machine learning algorithm that comes under Supervised Learning techniques.

It can be used for Classification as well as for Regression problems, but mainly used for Classification problems.

Logistic regression is used to predict the categorical dependent variable with the help of independent variables.
The output of Logistic Regression problem can be only between the 0 and 1.

Binomial distribution
Binomial distribution is a probability distribution in statistics that summarizes the likelihood that a value will take one of two independent values under a
given set of parameters or assumptions.

The underlying assumptions of binomial distribution are that 

there is only one outcome for each trial,


that each trial has the same probability of success, and
that each trial is mutually exclusive or independent of one another.

Binomial distribution is a common discrete distribution used in statistics, as opposed to a continuous distribution, such as normal distribution.

Bayes Theorem
Bayes theorem is a theorem in probability and statistics, named after the Reverend Thomas Bayes, that helps in determining the probability of an event
that is based on some event that has already occurred. Bayes rule has many applications such as Bayesian interference, in the healthcare sector - to
determine the chances of developing health problems with an increase in age and many others.

Prior probability and Posterior probability


A posterior probability is the probability of assigning observations to groups given the data. A prior probability is the probability that an observation will
fall into a group before you collect the data.

Central Tendency
A number that represents the centre of a data distribution

Clustering
Clustering is a machine learning technique for analyzing data and dividing in to groups of similar data. These groups or sets of similar data are known as
clusters. Cluster analysis looks at clustering algorithms that can identify clusters automatically. Hierarchical and Partitional are two such classes of
clustering algorithms. Hierarchical clustering algorithms break up the data in to a hierarchy of clusters. Paritional algorithms divide the data set into
mutually disjoint partitions.

Hierarchical Clustering
Hierarchical clustering algorithms repeat the cycle of either merging smaller clusters in to larger ones or dividing larger clusters to smaller ones.

Partitional Clustering
Partitional clustering algorithms generate various partitions and then evaluate them by some criterion. They are also referred to as nonhierarchical as
each instance is placed in exactly one of k mutually exclusive clusters.

Machine Learning
Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that
humans learn, gradually improving its accuracy.

supervised and unsupervised learning are examples of two different types of machine learning model approach. They differ in the way the models are
trained and the condition of the training data that’s required. Each approach has different strengths, so the task or problem faced by a supervised vs
unsupervised learning model will usually be different.  

Supervised learning
Supervised machine learning requires labelled input and output data during the training phase of the machine learning model lifecycle. This training data
is often labelled by a data scientist in the preparation phase, before being used to train and test the model. Once the model has learned the relationship
between the input and output data, it can be used to classify new and unseen datasets and predict outcomes.  

unsupervised learning
Unsupervised machine learning is the training of models on raw and unlabelled training data. It is often used to identify patterns and trends in raw
datasets, or to cluster similar data into a specific number of groups. It’s also often an approach used in the early exploratory phase to better understand
the datasets.  

big data ethics


big data ethics itself is defined as outlining, defending and recommending concepts of right and wrong practice when it comes to the use of data, with
particular emphasis on personal data.

Time-Series Data
Time-series data is a sequence of data points collected over time intervals, allowing us to track changes over time. Time-series data can track changes
over milliseconds, days, or even years. Businesses, governments, schools, and communities, large and small, are finding invaluable ways to mine value
from analyzing time-series data.

What sets time series data apart from other data is that the analysis can show how variables change over time. In other words, time is a crucial variable
because it shows how the data adjusts over the course of the data points as well as the final results.

Time series analysis helps organizations understand the underlying causes of trends or systemic patterns over time. Using data visualizations, business
users can see seasonal trends and dig deeper into why these trends occur.

When organizations analyze data over consistent intervals, they can also use time series forecasting to predict the likelihood of future events.

computer vision?
Computer vision is a field of artificial intelligence (AI) that enables computers and systems to derive meaningful information from digital images,
videos and other visual inputs — and take actions or make recommendations based on that information.

You might also like