Manifold Learning for data visualizing

tangwenju · May 22, 2015 · c040753 · c040753
1 parent 73f3c9b
commit c040753
Show file tree

Hide file tree

Showing 3 changed files with 288 additions and 0 deletions.
diff --git a/ManifoldLearning/DimensionalityReduction_DataVisualizing/README.md b/ManifoldLearning/DimensionalityReduction_DataVisualizing/README.md
@@ -0,0 +1,100 @@
+##1.流形学习的概念
+流形学习方法(Manifold Learning)，简称流形学习，自2000年在著名的科学杂志《Science》被首次提出以来，已成为信息科学领域的研究热点。在理论和应用上，流形学习方法都具有重要的研究意义。
+
+假设数据是均匀采样于一个高维欧氏空间中的低维流形，流形学习就是从高维采样数据中恢复低维流形结构，即找到高维空间中的低维流形，并求出相应的嵌入映射，以实现维数约简或者数据可视化。它是从观测到的现象中去寻找事物的本质，找到产生数据的内在规律。
+
+>以上选自[百度百科](http://baike.baidu.com/link?url=vQmr30kzWc3gXfZM-6ANTtPdWJ1JyUsJR0pzoOWfjG79QK4zVZ_PvFN8BRfgHeGkqFPR-HZGsguaYuZrSTEcwK)
+
+简单地理解，流形学习方法可以用来对高维数据降维，如果将维度降到2维或3维，我们就能将原始数据可视化，从而对数据的分布有直观的了解，发现一些可能存在的规律。
+
+##2.流形学习的分类
+可以将流形学习方法分为线性的和非线性的两种，线性的流形学习方法如我们熟知的主成份分析（PCA），非线性的流形学习方法如等距映射（Isomap）、拉普拉斯特征映射（Laplacian eigenmaps，LE）、局部线性嵌入(Locally-linear embedding，LLE)。
+
+当然，流形学习方法不止这些，因学识尚浅，在此我就不展开了，对于它们的原理，也不是一篇文章就能说明白的。对各种流形学习方法的介绍，网上有一篇不错的读物（原作已找不到）： [流形学习 (Manifold Learning)](http://blog.csdn.net/zhulingchen/article/details/2123129)
+
+##3.高维数据降维与可视化
+对于数据降维，有一张图片总结得很好（同样，我不知道原始出处）：
+
+![这里写图片描述](http://img.blog.csdn.net/20150522194801297)
+
+
+图中基本上包括了大多数流形学习方法，不过这里面没有t-SNE,相比于其他算法，t-SNE算是比较新的一种方法，也是效果比较好的一种方法。t-SNE是深度学习大牛Hinton和lvdmaaten（他的弟子？）在2008年提出的，lvdmaaten对t-SNE有个主页介绍：[tsne](http://lvdmaaten.github.io/tsne/),包括论文以及各种编程语言的实现。
+
+接下来是一个小实验，对MNIST数据集降维和可视化，采用了十多种算法，算法在sklearn里都已集成，画图工具采用matplotlib。大部分实验内容都是参考sklearn这里的[example](http://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html)，稍微做了些修改。
+
+Matlab用户可以使用lvdmaaten提供的工具箱: [drtoolbox](http://lvdmaaten.github.io/drtoolbox/)
+
+###**- 加载数据**
+
+
+MNIST数据从sklearn集成的datasets模块获取，代码如下，为了后面观察起来更明显，我这里只选取`n_class=5`，也就是0～4这5种digits。每张图片的大小是8*8，展开后就是64维。
+
+
+	digits = datasets.load_digits(n_class=5)
+	X = digits.data
+	y = digits.target
+	print X.shape
+	n_img_per_row = 20
+	img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
+	for i in range(n_img_per_row):
+	    ix = 10 * i + 1
+	    for j in range(n_img_per_row):
+	        iy = 10 * j + 1
+	        img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))
+	plt.imshow(img, cmap=plt.cm.binary)
+	plt.title('A selection from the 64-dimensional digits dataset')
+
+
+运行代码，获得X的大小是(901,64)，也就是901个样本。下图显示了部分样本：
+
+![这里写图片描述](http://img.blog.csdn.net/20150522195128952)
+
+
+
+
+###**- 降维**
+以t-SNE为例子，代码如下，n_components设置为3，也就是将64维降到3维，init设置embedding的初始化方式，可选random或者pca，这里用pca，比起random init会更stable一些。
+
+
+	print("Computing t-SNE embedding")
+	tsne = manifold.TSNE(n_components=3, init='pca', random_state=0)
+	t0 = time()
+	X_tsne = tsne.fit_transform(X)
+	plot_embedding_2d(X_tsne[:,0:2],"t-SNE 2D")
+	plot_embedding_3d(X_tsne,"t-SNE 3D (time %.2fs)" %(time() - t0))
+
+
+降维后得到X_ tsne，大小是(901,3)，plot_ embedding_ 2d()将前2维数据可视化，plot_ embedding_ 3d()将3维数据可视化。
+
+
+函数plot_ embedding_ 3d定义如下：
+
+
+	def plot_embedding_3d(X, title=None):
+	    #坐标缩放到[0,1]区间
+	    x_min, x_max = np.min(X,axis=0), np.max(X,axis=0)
+	    X = (X - x_min) / (x_max - x_min)
+	    #降维后的坐标为（X[i, 0], X[i, 1],X[i,2]），在该位置画出对应的digits
+	    fig = plt.figure()
+	    ax = fig.add_subplot(1, 1, 1, projection='3d')
+	    for i in range(X.shape[0]):
+	        ax.text(X[i, 0], X[i, 1], X[i,2],str(digits.target[i]),
+	                 color=plt.cm.Set1(y[i] / 10.),
+	                 fontdict={'weight': 'bold', 'size': 9})
+	    if title is not None:
+	        plt.title(title)
+
+
+###**- 看看效果**
+
+十多种算法，结果各有好坏，总体上t-SNE表现最优，但它的计算复杂度也是最高的。下面给出PCA、LDA、t-SNE的结果:
+![这里写图片描述](http://img.blog.csdn.net/20150522195334439)
+![这里写图片描述](http://img.blog.csdn.net/20150522195314420)
+![这里写图片描述](http://img.blog.csdn.net/20150522195347336)
+![这里写图片描述](http://img.blog.csdn.net/20150522195443173)
+![这里写图片描述](http://img.blog.csdn.net/20150522195502751)
+![这里写图片描述](http://img.blog.csdn.net/20150522195440501)
+
+
+
+
diff --git a/ManifoldLearning/DimensionalityReduction_DataVisualizing/data_visualizing.py b/ManifoldLearning/DimensionalityReduction_DataVisualizing/data_visualizing.py
@@ -0,0 +1,184 @@
+#coding:utf-8
+"""
+CreatedCreated on Fri May 22 2015
+@author: wepon
+@blog；
+
+"""
+from time import time
+import numpy as np
+import matplotlib.pyplot as plt
+from mpl_toolkits.mplot3d.axes3d import Axes3D
+from sklearn import (manifold, datasets, decomposition, ensemble, lda,random_projection)
+
+#%%
+#加载数据，显示数据
+digits = datasets.load_digits(n_class=5)
+X = digits.data
+y = digits.target
+print X.shape
+n_img_per_row = 20
+img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
+for i in range(n_img_per_row):
+    ix = 10 * i + 1
+    for j in range(n_img_per_row):
+        iy = 10 * j + 1
+        img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))
+plt.imshow(img, cmap=plt.cm.binary)
+plt.title('A selection from the 64-dimensional digits dataset')
+
+#LLE,Isomap,LTSA需要设置n_neighbors这个参数
+n_neighbors = 30
+
+
+#%%
+# 将降维后的数据可视化,2维
+def plot_embedding_2d(X, title=None):
+    #坐标缩放到[0,1]区间
+    x_min, x_max = np.min(X,axis=0), np.max(X,axis=0)
+    X = (X - x_min) / (x_max - x_min)
+
+    #降维后的坐标为（X[i, 0], X[i, 1]），在该位置画出对应的digits
+    fig = plt.figure()
+    ax = fig.add_subplot(1, 1, 1)
+    for i in range(X.shape[0]):
+        ax.text(X[i, 0], X[i, 1],str(digits.target[i]),
+                 color=plt.cm.Set1(y[i] / 10.),
+                 fontdict={'weight': 'bold', 'size': 9})
+
+    if title is not None:
+        plt.title(title)
+
+#%%
+#将降维后的数据可视化,3维
+def plot_embedding_3d(X, title=None):
+    #坐标缩放到[0,1]区间
+    x_min, x_max = np.min(X,axis=0), np.max(X,axis=0)
+    X = (X - x_min) / (x_max - x_min)
+
+    #降维后的坐标为（X[i, 0], X[i, 1],X[i,2]），在该位置画出对应的digits
+    fig = plt.figure()
+    ax = fig.add_subplot(1, 1, 1, projection='3d')
+    for i in range(X.shape[0]):
+        ax.text(X[i, 0], X[i, 1], X[i,2],str(digits.target[i]),
+                 color=plt.cm.Set1(y[i] / 10.),
+                 fontdict={'weight': 'bold', 'size': 9})
+
+    if title is not None:
+        plt.title(title)
+
+
+#%%
+#随机映射
+print("Computing random projection")
+rp = random_projection.SparseRandomProjection(n_components=2, random_state=42)
+X_projected = rp.fit_transform(X)
+plot_embedding_2d(X_projected, "Random Projection")
+
+#%%
+#PCA
+print("Computing PCA projection")
+t0 = time()
+X_pca = decomposition.TruncatedSVD(n_components=3).fit_transform(X)
+plot_embedding_2d(X_pca[:,0:2],"PCA 2D")
+plot_embedding_3d(X_pca,"PCA 3D (time %.2fs)" %(time() - t0))
+
+#%%
+#LDA
+print("Computing LDA projection")
+X2 = X.copy()
+X2.flat[::X.shape[1] + 1] += 0.01  # Make X invertible
+t0 = time()
+X_lda = lda.LDA(n_components=3).fit_transform(X2, y)
+plot_embedding_2d(X_lda[:,0:2],"LDA 2D" )
+plot_embedding_3d(X_lda,"LDA 3D (time %.2fs)" %(time() - t0))
+
+
+
+#%%
+#Isomap
+print("Computing Isomap embedding")
+t0 = time()
+X_iso = manifold.Isomap(n_neighbors, n_components=2).fit_transform(X)
+print("Done.")
+plot_embedding_2d(X_iso,"Isomap (time %.2fs)" %(time() - t0))
+
+
+#%%
+#standard LLE
+print("Computing LLE embedding")
+clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,method='standard')
+t0 = time()
+X_lle = clf.fit_transform(X)
+print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
+plot_embedding_2d(X_lle,"Locally Linear Embedding (time %.2fs)" %(time() - t0))
+
+
+#%%
+#modified LLE
+print("Computing modified LLE embedding")
+clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,method='modified')
+t0 = time()
+X_mlle = clf.fit_transform(X)
+print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
+plot_embedding_2d(X_mlle,"Modified Locally Linear Embedding (time %.2fs)" %(time() - t0))
+
+
+#%%
+# HLLE
+print("Computing Hessian LLE embedding")
+clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,method='hessian')
+t0 = time()
+X_hlle = clf.fit_transform(X)
+print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
+plot_embedding_2d(X_hlle,"Hessian Locally Linear Embedding (time %.2fs)" %(time() - t0))
+
+
+#%%
+# LTSA
+print("Computing LTSA embedding")
+clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,method='ltsa')
+t0 = time()
+X_ltsa = clf.fit_transform(X)
+print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
+plot_embedding_2d(X_ltsa,"Local Tangent Space Alignment (time %.2fs)" %(time() - t0))
+
+#%%
+# MDS
+print("Computing MDS embedding")
+clf = manifold.MDS(n_components=2, n_init=1, max_iter=100)
+t0 = time()
+X_mds = clf.fit_transform(X)
+print("Done. Stress: %f" % clf.stress_)
+plot_embedding_2d(X_mds,"MDS (time %.2fs)" %(time() - t0))
+
+#%%
+# Random Trees
+print("Computing Totally Random Trees embedding")
+hasher = ensemble.RandomTreesEmbedding(n_estimators=200, random_state=0,max_depth=5)
+t0 = time()
+X_transformed = hasher.fit_transform(X)
+pca = decomposition.TruncatedSVD(n_components=2)
+X_reduced = pca.fit_transform(X_transformed)
+
+plot_embedding_2d(X_reduced,"Random Trees (time %.2fs)" %(time() - t0))
+
+#%%
+# Spectral
+print("Computing Spectral embedding")
+embedder = manifold.SpectralEmbedding(n_components=2, random_state=0,eigen_solver="arpack")
+t0 = time()
+X_se = embedder.fit_transform(X)
+plot_embedding_2d(X_se,"Spectral (time %.2fs)" %(time() - t0))
+
+#%%
+# t-SNE
+print("Computing t-SNE embedding")
+tsne = manifold.TSNE(n_components=3, init='pca', random_state=0)
+t0 = time()
+X_tsne = tsne.fit_transform(X)
+print X_tsne.shape
+plot_embedding_2d(X_tsne[:,0:2],"t-SNE 2D")
+plot_embedding_3d(X_tsne,"t-SNE 3D (time %.2fs)" %(time() - t0))
+
+plt.show()
diff --git a/README.md b/README.md
@@ -43,6 +43,10 @@ CSDN：[wepon的专栏](http://blog.csdn.net/u012162613)
 - **logistic regression**
 
       基于python+numpy实现了logistic回归（二类别），详细的介绍：[文章链接](http://blog.csdn.net/u012162613/article/details/41844495)
+
+- **ManifoldLearning**
+
+	[DimensionalityReduction_DataVisualizing]() 运用多种流形学习方法将高维数据降维，并用matplotlib将数据可视化(2维和3维)
      
 - **SVM**