0% found this document useful (0 votes)
14 views6 pages

Week7 DataProcessionMethod

Uploaded by

18782101508
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views6 pages

Week7 DataProcessionMethod

Uploaded by

18782101508
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

Ⅱ.

Questions

降维技术可以分为线性和非线性两大类:
★线性降维技术。侧重让不相似的点在低维表示中分开。
①PCA(Principle Components Analysis,主成分分析)
②MDS(Multiple Dimensional Scaling,多维缩放)等
★非线性降维技术(广义上“非线性降维技术”≈“流形学习”,狭义上后者
是前者子集)。这类技术假设高维数据实际上处于一个比所处空间维度低的非
线性流形上,因此侧重让相似的近邻点在低维表示中靠近。
①Sammon mapping
②SNE (Stochastic Neighbor Embedding,随机近邻嵌入),t-SNE 是基于 SNE
的。
③Isomap(Isometric Mapping,等度量映射)
④MVU(Maximum Variance Unfolding)
⑤LLE(Locally Linear Embedding,局部线性嵌入)等

Data dimension reduction

Q1. Principal Component Analysis (PCA) 主成分分析


主成分为影响因素的线性组合
A1. Principal component analysis is a dimensionality reduction algorithm,
which can convert multiple indicators into a few principal components. These
principal components are linear combinations of original variables and are not
correlated with each other, which can reflect most of the information of the original
data. Generally speaking, when the research problem involves multiple variables and
there is a strong correlation between the variables, we can consider using the principal
component analysis method to simplify the data.
I. Introduction to Principal component analysis
Principal component analysis can replace more new variables with fewer new
variables, and it is these fewer new variables as much as possible to retain the original
reflected information.
Principal component analysis is a kind of data dimensionality reduction
algorithm. Dimensionality reduction is to retain some of the most important features
of high-dimensional data (too many indicators) and remove noise and unimportant
features, so as to achieve the purpose of improving the data processing speed.
Ⅱ. Methodology
1) Standardization;
2) Calculate the covariance matrix of standardized sample search;
3) Calculate the eigenvalues and eigenvalue vectors of the covariance matrix;
4) Calculate principal component sharing rate and cumulative contribution rate;
5) Generally, the first, second and... values corresponding to the eigenvalues
whose cumulative contribution rate exceeds 80% are taken into the principal
component.
清风数学建模学习笔记——主成分分析 (PCA)原理详解及案例分析 _Xiu Yan
的博客-CSDN 博客

Q2. Multidimensional Scaling (MDS) 多维缩放


多维缩放(multidimensional scaling ,MDS),是另外一种线性降维方式,与
主成分分析法和线性降维分析法都不同的是,多维缩放的目标不是保留数据的
最大可分性,而是更加关注与高维数据内部的特征。多维缩放算法集中于保留
高维空间中的“相似度”信息,而在一般的问题解决的过程中,这个“相似
度”通常用欧式距离来定义。
A2. The goal of the analysis of Multidimensional Scaling is to display the
differences (or similarities) between the objects by graph. The distance between two
points in the graph can be used to represent the differences of the objects. The longer
the distance is, the greater the difference between the two objects; the closer the
distance can be.
Scaling: Projects,
Critical Points: Turn the multidimensional data into two dimensions.
sklearn 与机器学习系列专题之降维(三)一文弄懂 MDS 特征筛选 &降维_
南上加南的博客-CSDN 博客

Q3. T-distributed Stochastic Neighbor Embedding (t-SNE)


高维数据-降维至 1~2 维:t 分布随机邻域嵌入
A3. t-Distributed Neighbor Embedding (t-SNE) is a dimensionality reduction
technique used to represent high-dimensional data sets in a two-dimensional or three-
dimensional low-dimensional space for visualization. In contrast to other
dimensionality reduction algorithms such as PCA, t-SNE creates a reduced feature
space where similar samples are modeled by nearby points and dissimilar samples are
modeled by distant points with high probability.
At high levels, t-SNE constructs a probability distribution for high-dimensional
samples, with a high probability that similar samples will be selected, and a low
probability that different points will be selected. t-SNE then defines a similar
distribution for the points in the low-dimensional embedding. Finally, t-SNE
minimizes the Kullback-Leibler (KL) divergence between the two distributions with
respect to the location of the embedding point.
t-SNE(t-distributed neighbor embedding) is a machine learning algorithm used
for dimension reduction, which is proposed by Laurens van der Maaten et al in 2008.
In addition, t-SNE is a nonlinear dimension reduction algorithm, which is very
suitable for high-dimensional data to be reduced to 2 or 3 dimensions for
visualization. The algorithm can reduce the distance of t distribution in low-
dimensional space for points with greater similarity. For points with low similarity,
the distance of t distribution in low dimensional space needs to be further.

Q4. Shepard Diagrams


用于非度量多维标度分析法(Non-metric multidimensional scaling,NMDS)
A4. Shepard diagram - Oxford Reference
A plot of two measurements of the distances between objects. One measurement
is the true distance, and the other measurement is the apparent distance in some
representation of the objects. For example, the apparent distance between objects in a
photograph (two dimensions) and the real three-dimensional distance. The diagram is
used in multidimensional scaling to assess the extent of any distortion. Zero distortion
would correspond to a set of collinear points.
Shepard plot is a plot comparing actual or transformed proximity to predicted
proximity. The figure reflects the extent to which the multidimensional scaling graph
reflects actual proximity. A Shepard diagram is similar to a "predicted value versus
actual value" diagram. Ideally, these points fall on the Y = X line.
NMDS weakens the dependence on the actual distance value and puts more
emphasis on the ranking (rank) among the values. For example, the pair similarity
distance of the three samples, (1,2,3) and (10,20,30) are in the same order in NMDS
analysis, showing the same effect.
The NMDS analysis runs as follows:
1. Set the analysis dimension (usually 2-dimensional plane);
2. Build the initial structure and place the distance value (input data);
3. Judge the suitability of the model according to the comparison between the set
distance data and the original data (Stress judgment)

Shepard diagram。样本/观测点原始距离与 NMDS 排序结果比较。胁迫系数


表示原始差异的经 NMDS 排序后产生的变化,变化越小,表示 NMDS 排序效果
越 好 , 越 能 准 确 反 应 样 本 的 原 始 空 间 位 置 或 梯 度 。 Non-metric fit R^2=1-
stress^2。拟合优度图(右侧)中气泡越大,观测在排序空间的位置与原始位置差
距越大。

pca(Principal Component Analysis,主成分分析)


思路:寻找一个从高维 N 到低维 d 的映射矩阵 W。使得投影后的方差最大
化(保留尽可能多的信息)
优点:
线性变换,新的维度是原始维度的线性组合(可解释性强 比如 x' = 0.6*性
别 + 0.3 年龄)
保留全局结构
不涉及超参,结果稳定
计算速度较快
缺点:
区分度、表现力不突出
结果容易受离群点影响(保留全局结构)

mds(Metric Multidimensional Scaling, 多维尺度变换)


距离度量:欧式距离
优化目标:任意两个实例在 Z 维空间中的距离与原始空间的距离相同

tSNE(t-Dist Stochastic Neighborhood Embedding)


核心特色:用概率分布来表示数据点之间的距离
距离度量:
高维空间:高斯分布(方差 sigma 对于每个 xi 都是不同的,有 N 个分布)
低维空间:自由度为 1 的 t 分布
t 分布在将高维空间中较远的点,在低维空间中推的更远,能够减轻低维空
间的拥挤的问题
优化目标:KL 散度(刻画两个概率分布的差异程度)
优点:
降维区分度效果凸出
能处理离群点
缺点:
性能差
结果受超参影响明显,优化难度大
形状与密度无关(t 分布中去掉了密度的参数)
结果难以解读解释

Shepard Diagram:可视化方案
目标:针对降维的数据,提供可视化主体内容及辅助信息,从而帮助受众
更好地认知数据应该从哪些角度去分析、获取信息
全局视角:
1.形状探索(数据区分度)
通过标签打标看看不同类型人群的形状及分布(这里打标的方式可以是点
击 、 收 藏 加 购 之 类 的 动 作 , 也 可 以 是 不 同 定 向 类 型 的 人 群 ) 。 通 过 Grid
Search,尝试多种参数结果,将其聚类找到主要的几种空间布局结构。并通过
定量的方式反应降维的效果
2.密度区分:tSNE,UMAP 等概率密度公式中均在低维抛弃了密度参数,
可以通过密度图辅助还原密度信息
局部视角:
1. 子群探索(显著特征分析):探索某一特定的子群,相较其他人群显著
的特征是什么:
2. 特征空间分布分析
3. 边界构成及相关因素分析

You might also like