Skip to content
0

文章发布较早,内容可能过时,阅读注意甄别。

机器学习笔记

参考链接

1、必要的库

python
pip install scikit-learn
pip install scipy
pip install numpy
pip install matplotlib

2、函数的介绍

2.1、PCA方法(主成分分析)

2.1.1、函数说明

主成分分析(Principal Components Analysis),简称PCA,是一种数据降维技术,用于数据预处理

PCA的一般步骤是:先对原始数据零均值化,然后求协方差矩阵,接着对协方差矩阵求特征向量和特征值,这些特征向量组成了新的特征空间。

python
sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False)

参数: n_components:

意义:PCA算法中所要保留的主成分个数n,也即保留下来的特征个数n

类型:int 或者 string,缺省时默认为None,所有成分被保留。赋值为int,比如n_components=1,将把原始数据降到一个维度。赋值为string,比如n_components='mle',将自动选取特征个数n,使得满足所要求的方差百分比。

可以是int,float,或者“mle”。如果是int,那么是选择对应的主成分数,如果是float,那么会自动合适的组成份数,使得方差大于float;如果是“mle”,会自动地选择合适的主成分。

copy:

类型:bool,True或者False,缺省时默认为True。

意义:表示是否在运行算法时,将原始训练数据复制一份。若为True,则运行PCA算法后,原始训练数据的值不会有任何改变,因为是在原始数据的副本上进行运算;若为False,则运行PCA算法后,原始训练数据的值会改,因为是在原始数据上进行降维计算。

whiten:

类型:bool,缺省时默认为False

意义:白化,使得每个特征具有相同的方差。

1、PCA属性:

  • components_ :返回具有最大方差的成分。
  • explainedvariance_ratio:返回 所保留的n个成分各自的方差百分比。
  • n_components:返回所保留的成分个数n。
  • mean:
  • noisevariance

2、PCA方法:

(1)fit(X,y=None)

fit(X),表示用数据X来训练PCA模型。

函数返回值:调用fit方法的对象本身。比如pca.fit(X),表示用X对pca这个对象进行训练

拓展:fit()可以说是scikit-learn中通用的方法,每个需要训练的算法都会有fit()方法,它其实就是算法中的“训练”这一步骤。因为PCA是无监督学习算法,此处y自然等于None。

(2)transform(X)

将数据X转换成降维后的数据。当模型训练好后,对于新输入的数据,都可以用transform方法来降维。

(3)fit_transform(X)

用X来训练PCA模型,同时返回降维后的数据。相当于先用fit(X),再用transform(X)。

newX=pca.fit_transform(X)newX就是降维后的数据。

(4)inverse_transform()

将降维后的数据转换成原始数据,X=pca.inverse_transform(newX)

2.1.2、用法

python
from sklearn.decomposition import PCA
# 输入待降维数据 (5 * 6) 矩阵,6个维度,5个样本值
A = np.array([[84,65,61,72,79,81],[64,77,77,76,55,70],[65,67,63,49,57,67],[74,80,69,75,63,74],[84,74,70,80,74,82]])
print(A)
'''
 [[84 65 61 72 79 81]
  [64 77 77 76 55 70]
  [65 67 63 49 57 67]
  [74 80 69 75 63 74]
  [84 74 70 80 74 82]]
'''

# 直接使用PCA进行降维
pca = PCA(n_components=3) #降到 3 维
pca.fit(A)

a_pca = pca.transform(A) # 降维后的结果
print('a_pca', a_pca.shape, '\n',a_pca)
'''
a_pca (5, 3)
 [[-16.14860528 -12.48396235  -2.05120596]
 [ 10.61676743  15.67317428  -3.55918823]
 [ 23.40212697 -13.607117     0.54500782]
 [ -0.43966353   7.77054621   4.72024299]
 [-17.43062559   2.64735885   0.34514339]]
'''

a_reverse = pca.inverse_transform(a_new)
print('a_reverse', a_reverse.shape,'\n', a_reverse)
'''
a_reverse (5, 6)
 [[84.6267619  65.03472928 62.12667603 71.28115667 78.79268699 81.55012354]
 [64.14111429 77.00781923 77.25366904 75.83815375 54.95332386 70.1238593 ]
 [64.68210057 66.98238499 62.42853949 49.36460399 57.10515108 66.72097225]
 [74.48374539 80.02680464 69.86958753 74.44518462 62.83999233 74.42459461]
 [83.06627786 73.94826186 68.32152791 81.07090098 74.30884575 81.18045029]]
'''

2.1.3、pytorch实现

2.1.3.1、方法

python
# pytorch实现
def PCA_svd(X, k, center=True):
  n = X.size()[0]
  ones = torch.ones(n).view([n,1])
  h = ((1/n) * torch.mm(ones, ones.t())) if center  else torch.zeros(n*n).view([n,n])
  H = torch.eye(n) - h
  H = H
  X_center =  torch.mm(H.double(), X.double())
  u, s, v = torch.svd(X_center)
  components  = v[:k].t()
  #explained_variance = torch.mul(s[:k], s[:k])/(n-1)
  return components
python
#验证一下
feature_dim = 3
feature = torch.arange(1, 51).reshape(5, 10)
print('1', feature)
feature = PCA_svd(feature, feature_dim)
feature = feature.float()
print('2', feature)
'''
1 tensor([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10],
        [11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
        [21, 22, 23, 24, 25, 26, 27, 28, 29, 30],
        [31, 32, 33, 34, 35, 36, 37, 38, 39, 40],
        [41, 42, 43, 44, 45, 46, 47, 48, 49, 50]])
2 tensor([[-3.1623e-01, -3.1623e-01, -3.1623e-01],
        [-9.4868e-01,  1.0541e-01,  1.0541e-01],
        [-7.1898e-17,  9.4281e-01, -1.1785e-01],
        [-3.3463e-17,  1.9550e-16,  9.3541e-01],
        [ 4.1693e-18, -3.6006e-17,  1.5841e-16]])
'''

2.1.3.2、类实现

python
import torch
from torch.linalg import eig


class PCA():
    def __init__(self, n_component: int = 16, device=torch.device('cpu')) -> None:
        """主成分分析

        Args:
            n_component (int): 保留的主成分数
        """
        super().__init__()
        self.n_component = n_component
        self.device = device

    def CHECK_SHAPE(self, shape: torch.Size) -> None:
        assert len(shape) >= 2, 'Shape of input is expected bigger than 2!!!'
        limit = 1
        for i in range(1, len(shape)):
            limit *= shape[i]
        assert limit >= self.n_component, f'n_component = {self.n_component}, expected <= {limit}'

    @torch.no_grad()
    def fit(self, X: torch.Tensor) -> None:
        """提取主成分

        Args:
            X (torch.Tensor): 待进行主成分分析的输入张量,形状应当为 (batch_size, ...)
            X的shape长度应该大于2,可以是(batch_size, channel, width, length)或者其他
        """
        self.CHECK_SHAPE(X.shape)
        Y = X.reshape(X.shape[0], -1).to(self.device)
        self.mean = Y.mean(0)
        Z = Y - self.mean

        covariance = Z.T @ Z
        _, eig_vec = eig(covariance)

        self.components = eig_vec[:, :self.n_component]

    @torch.no_grad()
    def transform(self, X: torch.Tensor) -> torch.Tensor:
        """数据降维

        Args:
            X (torch.Tensor): 待降维数据,形状应当为 (batch_size, ...)

        Returns:
            torch.Tensor: 降维后数据
        """
        self.CHECK_SHAPE(X.shape)
        Z = X.reshape(X.shape[0], -1).to(self.device)

        return (Z - self.mean) @ self.components.real

    @torch.no_grad()
    def reconstruct(self, X: torch.Tensor) -> torch.Tensor:
        """高维数据重建

        Args:
            X (torch.Tensor): 待重建数据,形状应当为 (batch_size, ...)

        Returns:
            torch.Tensor: 重建后数据
        """
        assert len(X.shape) == 2, 'Shape of input is expected to equal to 2!!!'

        return (X @ self.components.real.T) + self.mean

验证一下:

python
transforms = transforms.Compose([transforms.ToTensor()])
dataset = datasets.MNIST(root=path, train=False, download=False,transform=transforms)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
pca = PCA(n_component=8, device=device)
for i, (images, labels) in enumerate(dataloader):
    images = images.to(device)
    print('images', images.shape)
    pca.fit(images)
    pca_image = pca.transform(images)
    print('pca_image', pca_image.shape)
    restruct_image = pca.reconstruct(pca_image)
    print('restruct_image', restruct_image.shape)
    print('===============================')
'''
images torch.Size([64, 1, 28, 28])
pca_image torch.Size([64, 8])
restruct_image torch.Size([64, 784])
'''

2.2、IPCA方法

2.2.1、函数说明

PCA虽然很有用,但是需要将数据全部都存入内存,因此当当要分解的数据集太大,会导致内存很大。这时候,增量主成分分析(IPCA)通常用作主成分分析(PCA)的替代,可以通过部分计算的方式,获得跟PCA一样的结果。

  • 使用partial_fit方法,可以分块的读取数据。
  • 如果是稀疏矩阵或者是内存文件,使用的是numpy.memmap。

IPCA使用与输入数据样本数无关的内存量为输入数据建立低秩近似。它仍然依赖于输入数据功能,但更改批量大小可以控制内存使用量。

该函数,增加了一个batch_size的参数,用来控制批次,其余都一样,至此不再赘述。

2.2.1、用法

1、partial_fit(X, y=None, check_input=True)

Incremental fit with X. All of X is processed as a single batch.

  • Parameters:

    X:

    array-like of shape (n_samples, n_features)Training data, where n_samples is the number of samples and n_features is the number of features.

    y:

    IgnoredNot used, present for API consistency by convention.check_inputbool, default=TrueRun check_array on X.

2.2.3、举例说明

python
from sklearn.datasets import load_digits
from sklearn.decomposition import IncrementalPCA
from scipy import sparse
X, _ = load_digits(return_X_y=True)
# print('X', X.shape)  # X (1797, 64)
transformer = IncrementalPCA(n_components=7, batch_size=200)
x_ipca = transformer.fit_transform(X)
# print('x_ipca', x_ipca.shape)  # x_ipca (1797, 7)
# print('X[:100, :]', X[:100, :].shape) # X[:100, :] (100, 64)
# # either partially fit on smaller batches of data
transformer.partial_fit(X[:100, :])
# # or let the fit function itself divide the data into batches
X_sparse = sparse.csr_matrix(X)
X_transformed = transformer.fit_transform(X_sparse)
print(X_transformed.shape)  # (1797, 7)

sparse.csr_matrix的用法

拓展:用sklearn进行降维的七种方法

2.3、SVM方法

支持向量机(Support Vector Machines,SVM)是一个强大的机器学习模型,可用于解决分类、回归或异常检测等问题。它通过在高维空间中找到超平面来最大化类别之间的边界。Python的scikit-learn库提供了一个名为sklearn.svm.SVC()的函数,使得可以方便地应用和实现==SVM进行分类==。

2.2.1、函数说明

sklearn.svm.SVC()函数有许多可配置的参数,这些参数可以帮助优化模型的性能。下面是一些最重要的参数:

  • C:误差项的惩罚参数。C越大,对错误分类的惩罚越大,边界就会更窄。
  • kernel:指定SVM内部使用的核函数类型。选项包括 linear(线性)poly(多项式)rbf(径向基)sigmoidprecomputed 或一个可调用对象。
  • degree:如果选择了poly核,这将决定多项式的程度。
  • gamma:rbfpolysigmoid的核系数。如果γ设为auto,那么将使用1/n_features

2.2.2、用法

1、创建SVC实例:

python
from sklearn import svm
clf = svm.SVC()

2、接着,使用fit()方法来训练模型:

python
# 其中X_train是特征数据,y_train是标签数据。
clf.fit(X_train, y_train)

3、最后,可以使用predict()方法进行预测:

python
# 其中X_test是测试集的特征数据。
predictions = clf.predict(X_test)

# 或者使用score方法计算准确率
# 其中x_test是特征数据,y_test是标签数据。相当于用`predict`方法先计算出prediction,再与y_test做准确率计算
score = clf.score(x_test,y_test)

2.2.3、举例说明

2.2.3.1、鸢尾花分类

在此部分,我们将展示如何使用svm.SVC()对鸢尾花数据集进行分类。首先,导入所需的库和数据集:

python
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import accuracy_score

#加载鸢尾花数据集
iris = datasets.load_iris()
X = iris.data
y = iris.target

然后,将数据划分为训练集和测试集:

python
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

创建SVM分类器实例,并用训练集数据进行训练:

python
clf = svm.SVC(kernel='linear', C=1.0)
clf.fit(X_train, y_train)

最后,用测试集进行预测,并计算准确度:

python
y_pred = clf.predict(X_test)
print('Accuracy: ', accuracy_score(y_test, y_pred))
'''
Accuracy:  1.0
'''

2.2.3.2、自定义的一种写法

python
import sklearn
from sklearn import datasets
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

class SVM(object):
    def __init__(self,C=1.0):
        self._support_vectors=None
        self.C=C
        self.W=None #shape=(d,)
        self.b=None
        self.x=None #shape=(n,d)
        self.y=None #shape=(n,)
        self.n=0 #Number of samples
        self.d=0 #Feature dimension
    def __desicion_function(self,X):
        return X.dot(self.W)+self.b
    def __margin(self,X,y):
        return y*self.__desicion_function(X)
    def __cost(self,margin):
        return (1/2)*self.W.dot(self.W)+self.C*np.sum(np.maximum(0,1-margin))
    def fit(self,X,y,lr=1e-3,epochs=500):
        self.n,self.d=X.shape[0],X.shape[1]
        self.W=np.random.rand(self.d)
        self.b=np.random.rand()

        self.x=X
        self.y=y
        losses=[]
        for i in range(epochs):
            margin=self.__margin(X,y)
            loss=self.__cost(margin)
            losses.append(loss)

            missclassified_pts_idx=np.where(margin<1)[0]
            d_W=self.W-self.C*y[missclassified_pts_idx].dot(X[missclassified_pts_idx])
            self.W=self.W-lr*d_W

            d_gama=-self.C*np.sum(y[missclassified_pts_idx])
            self.b = self.b - lr * d_gama

            self._support_vectors=np.where(self.__margin(X,y)<1)[0]
    def predict(self,X):
        return np.sign(self.__desicion_function(X))

    def score(self,X,y):
        P=self.predict(X)
        return np.mean(P==y)

    def plotresult(self):
        plt.figure()
        plt.scatter(self.x[:,0],self.x[:,1],c=self.y,s=50,cmap=plt.cm.Paired,alpha=0.7)
        ax=plt.gca()
        xlim=ax.get_xlim()
        ylim=ax.get_ylim()

        xx=np.linspace(xlim[0],xlim[1],30)
        yy=np.linspace(ylim[0],ylim[1],30)
        XX,YY=np.meshgrid(xx,yy)
        xy=np.stack([XX.ravel(),YY.ravel()],axis=1)
        z=self.__desicion_function(xy).reshape(XX.shape)
        ax.contour(XX,YY,z,colors=['r','b','r'],levels=[-1,0,1],alpha=0.5,linestyles=['--','-','--'],linewidths=[2.0,2.0,2.0])
        ax.scatter(self.x[:,0][self._support_vectors],self.x[:,1][self._support_vectors],s=100,linewidth=1,facecolors='r',edgecolors='r')

        plt.show()

if __name__ == '__main__':
    iris = datasets.load_iris()
    X_train, y_target = iris.data[:100, [2, 3]], iris.target[:100]
    y_target[y_target == 0] = -1

    model=SVM(C=0.10)
    model.fit(X_train,y_target)
    print(f'TrainScore: {model.score(X_train,y_target)}')
    model.plotresult()

2.4 决策树方法

2.4.1 普通决策树

2.4.2 模糊决策树

(1)介绍

fuzzytree是模糊决策树方法,是一种进阶的决策树方法,github链接:

GitHub - balins/fuzzytree: A Fuzzy Decision Tree implementation for Python.

操作手册:Welcome to fuzzytree’s documentation! — fuzzytree 0.1.4 documentation

需要安装包:

python
pip install fuzzytree

(2)代码实现

1、导入包和数据集(使用make_moons数据集)

python
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
X, y = make_moons(n_samples=300, noise=0.5, random_state=42)
print("X", X.shape)
print("y", y.shape)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print("X_train", X_train.shape)
print("y_train", y_train.shape)
print("X_test", X_test.shape)
print("y_test", y_test.shape)

输出:

X (300, 2)
y (300,)
X_train (240, 2)
y_train (240,)
X_test (60, 2)
y_test (60,)

2、定义模型,进行训练

python
from sklearn.tree import DecisionTreeClassifier
from fuzzytree import FuzzyDecisionTreeClassifier

clf_sk = DecisionTreeClassifier().fit(X_train, y_train)
clf_fuzz = FuzzyDecisionTreeClassifier().fit(X_train, y_train)

3、进行测试

python
print(f"fuzzytree: {clf_fuzz.score(X_test, y_test)}")
print(f"  sklearn: {clf_sk.score(X_test, y_test)}")

输出:

fuzzytree: 0.8333333333333334
  sklearn: 0.75

4、可视化展示

python
from mlxtend.plotting import plot_decision_regions
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
gs = gridspec.GridSpec(2, 2)
fig = plt.figure(figsize=(10,8))
labels = ['Fuzzy Decision Tree', 'sklearn Decision Tree']
for clf, lab, grd in zip([clf_fuzz, clf_sk], labels, [[0, 0], [0, 1]]):
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X_train, y=y_train, clf=clf, legend=2)
    plt.title("%s (train)" % lab)
for clf, lab, grd in zip([clf_fuzz, clf_sk], labels, [[1, 0], [1, 1]]):
    ax = plt.subplot(gs[grd[0], grd[1]])
    fig = plot_decision_regions(X=X_test, y=y_test, clf=clf, legend=2)
    plt.title("%s (test)" % lab)
plt.show()

(3)总的代码

使用make_blobs数据集

python
import matplotlib.pyplot as plt
from matplotlib import gridspec
from sklearn.svm import SVC
from mlxtend.plotting import plot_decision_regions
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from fuzzytree import FuzzyDecisionTreeClassifier

X, y = make_blobs(n_samples=150, n_features=2,
                  centers=[[0, 5], [10, 20], [20, 5]],
                  cluster_std=[10, 5, 10], random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf_fuzz = FuzzyDecisionTreeClassifier().fit(X_train, y_train)
clf_sk = DecisionTreeClassifier().fit(X_train, y_train)
clf_svc = SVC().fit(X_train, y_train)
clf_rf = RandomForestClassifier().fit(X_train, y_train)


print(f"   fuzzytree: {clf_fuzz.score(X_test, y_test)}")
print(f"decisiontree: {clf_sk.score(X_test, y_test)}")
print(f"         svc: {clf_svc.score(X_test, y_test)}")
print(f"randomForest: {clf_rf.score(X_test, y_test)}")

# 设置网格布局和图形大小
gs = gridspec.GridSpec(2, 3)  # 2行3列
fig = plt.figure(figsize=(15, 10))  # 调整图形大小以适应更多子图

# 分类器列表和对应的标签(如果FuzzyDecisionTreeClassifier不可用,则不包含在内)
classifiers = [clf_fuzz, clf_sk, clf_svc]
labels = ["Fuzzy Decision Tree", "sklearn Decision Tree", "SVC"]

# 遍历分类器并绘制决策区域图
for i, (clf, label) in enumerate(zip(classifiers, labels)):
    if clf is not None:  # 确保分类器是可用的
        # 计算子图的位置
        row = i // 3  # 由于只有2行,所以使用整除来确定行索引(0或1)
        col = i % 3  # 列索引(0, 1, 或 2)

        # 绘制训练集决策区域图
        plt.subplot(gs[row, col])
        plot_decision_regions(X=X_train, y=y_train, clf=clf, legend=2 if i == len(classifiers) - 1 else 0)
        plt.title(f"{label} (train)")

        # 由于我们只有2行,但每个分类器需要2个子图(一个用于训练,一个用于测试),
        # 我们需要为测试集使用下一行的相同列索引(通过row + 1,但注意要模2以避免索引超出范围)
        test_row = (row + 1) % 2  # 计算测试集图的行索引(在2行布局中循环)

        # 绘制测试集决策区域图
        plt.subplot(gs[test_row, col])
        plot_decision_regions(X=X_test, y=y_test, clf=clf, legend=2 if i == len(classifiers) - 1 else 0)
        plt.title(f"{label} (test)")

# 显示图形
plt.tight_layout()  # 调整子图之间的间距以避免重叠
plt.show()

输出:

   fuzzytree: 0.86
decisiontree: 0.8
         svc: 0.87
randomForest: 0.85

3、代码实战

3.1、PCA和SVM实现Mnist分类

1、首先,导入必要的包

python
import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np
from sklearn.svm import SVC

2、加载数据集

python
mnist_transform = transforms.Compose([transforms.ToTensor()])
mnist_train_dataset = datasets.MNIST(root='./mnist/', train=True, download=True, transform=mnist_transform)
mnist_test_dataset = datasets.MNIST(root='./mnist/', train=False, download=True, transform=mnist_transform)
# print(len(mnist_train_dataset))  # 60000 10000
# batch_size = 1000
mnist_train_dataloader = DataLoader(mnist_train_dataset, batch_size=60000, shuffle=False)
mnist_test_dataloader = DataLoader(mnist_test_dataset, batch_size=1000, shuffle=False)
# print(len(mnist_train_dataloader))  # 1
# device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
x, y = next(iter(mnist_train_dataloader))
print(x.shape, y.shape)
'''
torch.Size([60000, 1, 28, 28]) torch.Size([60000])
'''

3、进行训练,比较降到不同的维度的准确率

python
feature_ratio = np.linspace(0.5, 0.99, 20)
x_shape = []  # 按照保存特征比例进行 PCA 降维之后,数据的维度保存在这个列表中
scores = []  # 每次降维后的数据的评分保存在这里面
train_images, train_labels = next(iter(mnist_train_dataloader))
test_images, test_labels = next(iter(mnist_test_dataloader))
train_images = train_images.reshape(-1, 28 * 28).squeeze(1)
test_images = test_images.reshape(-1, 28 * 28).squeeze(1)
for i in feature_ratio:
    pca = PCA(i)  # 降维
    pca_train_image = pca.fit_transform(train_images)
    pca_test_image = pca.transform(test_images)
    classifier = SVC(kernel='rbf')
    history = classifier.fit(pca_train_image, train_labels)
    score = classifier.score(pca_test_image, test_labels)
    print(i, score)
    x_shape.append(pca_test_image.shape[1])
    scores.append(score)
print('x_shape', x_shape)
print('scores', scores)
'''
x_shape [11, 12, 14, 15, 17, 19, 22, 24, 27, 31, 35, 40, 46, 54, 64, 78, 99, 132, 193, 331]
scores [0.935, 0.944, 0.954, 0.961, 0.967, 0.966, 0.973, 0.971, 0.971, 0.976, 0.974, 0.974, 0.975, 0.977, 0.979, 0.979, 0.979, 0.977, 0.979, 0.978]
'''

4、画图展示

python
plt.plot(x_shape,scores)
plt.xlabel('number of features')
plt.ylabel('accuracy')
plt.show()

3.2、PCA和SVM实现CIFAR10分类(谨慎选择)

python
import numpy as np
from sklearn.decomposition import PCA
import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from sklearn.svm import SVC
import matplotlib.pyplot as plt

mnist_transform = transforms.Compose([transforms.ToTensor()])
# mnist_train_dataset = datasets.MNIST(root='./mnist/', train=True, download=True, transform=mnist_transform)
# mnist_test_dataset = datasets.MNIST(root='./mnist/', train=False, download=True, transform=mnist_transform)
cifar10_train_dataset = datasets.CIFAR10(root='../Train_cifar10/dataset', train=True, download=True, transform=mnist_transform)
cifar10_test_dataset = datasets.CIFAR10(root='../Train_cifar10/dataset', train=False, download=True, transform=mnist_transform)
# print(len(mnist_train_dataset), len(mnist_test_dataset))  # 60000 10000
# print(len(cifar10_train_dataset), len(cifar10_test_dataset))  # 50000 10000
# batch_size = 1000
# mnist_train_dataloader = DataLoader(cifar10_train_dataset, batch_size=60000, shuffle=False)
# mnist_test_dataloader = DataLoader(cifar10_test_dataset, batch_size=1000, shuffle=False)
cifar10_train_dataloader = DataLoader(cifar10_train_dataset, batch_size=50000, shuffle=False)
cifar10_test_dataloader = DataLoader(cifar10_test_dataset, batch_size=10000, shuffle=False)
# print(len(mnist_dataloader))  # 200
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
x, y = next(iter(cifar10_train_dataloader))
# print(x.shape, y)  # torch.Size([50000, 3, 32, 32]) tensor([6, 9, 9,  ..., 9, 1, 1])
# print(device)

feature_ratio = np.linspace(0.5, 0.99, 20)
x_shape = []  # 按照保存特征比例进行 PCA 降维之后,数据的维度保存在这个列表中
scores = []  # 每次降维后的数据的评分保存在这里面
train_images, train_labels = next(iter(cifar10_train_dataloader))
test_images, test_labels = next(iter(cifar10_test_dataloader))
train_images = train_images.reshape(50000, -1).squeeze(1)
test_images = test_images.reshape(10000, -1).squeeze(1)
# print(feature_ratio)
for i in range(2000, 2500, 50):
    pca = PCA(n_components=i)  # 降维
    pca_train_image = pca.fit_transform(train_images)
    pca_test_image = pca.transform(test_images)
    classifier = SVC(kernel='rbf')
    history = classifier.fit(pca_train_image, train_labels)
    score = classifier.score(pca_test_image, test_labels)
    print(i, score)
    x_shape.append(pca_test_image.shape[1])
    scores.append(score)
print('x_shape', x_shape)
print('scores', scores)

plt.plot(x_shape, scores)
plt.xlabel('number of features')
plt.ylabel('accuracy')
plt.show()
plt.savefig('compare_dimension.png')

3.3 PCA、SVM、fuzzytree等结合对比

python
import torch
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from mlxtend.plotting import plot_decision_regions
from matplotlib import gridspec
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from fuzzytree import FuzzyDecisionTreeClassifier
import numpy as np
from sklearn.svm import SVC

if __name__ == '__main__':
    mnist_transform = transforms.Compose([transforms.ToTensor()])
    mnist_train_dataset = datasets.MNIST(root='/home/wangchangmiao/sxy/TotalDataset/mnist', train=True, download=True, transform=mnist_transform)
    mnist_test_dataset = datasets.MNIST(root='/home/wangchangmiao/sxy/TotalDataset/mnist', train=False, download=True, transform=mnist_transform)
    # print(len(mnist_train_dataset))  # 60000 10000
    # batch_size = 1000
    mnist_train_dataloader = DataLoader(mnist_train_dataset, batch_size=500, shuffle=False)
    mnist_test_dataloader = DataLoader(mnist_test_dataset, batch_size=400, shuffle=False)
    # print(len(mnist_train_dataloader))  # 1
    # device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    x, y = next(iter(mnist_train_dataloader))
    print(x.shape, y.shape)

    x_shape = []  # 按照保存特征比例进行 PCA 降维之后,数据的维度保存在这个列表中
    scores = []  # 每次降维后的数据的评分保存在这里面
    train_images, train_labels = next(iter(mnist_train_dataloader))
    test_images, test_labels = next(iter(mnist_test_dataloader))

    train_images = train_images.reshape(-1, 28 * 28)
    test_images = test_images.reshape(-1, 28 * 28)
    print("train_images", train_images.shape)
    print("test_images", test_images.shape)
    pca = PCA(20)  # 降维
    pca_train_image = pca.fit_transform(train_images)
    pca_test_image = pca.transform(test_images)
    print("pca_train_image", pca_train_image.shape)
    print("pca_test_image", pca_test_image.shape)


    clf_fuzz = FuzzyDecisionTreeClassifier().fit(pca_train_image, train_labels)
    clf_sk = DecisionTreeClassifier().fit(pca_train_image, train_labels)
    clf_svc = SVC().fit(pca_train_image, train_labels)

    gs = gridspec.GridSpec(2, 2)
    fig = plt.figure(figsize=(10, 8))
    labels = ["Fuzzy Decision Tree", "sklearn Decision Tree"]
    print(f"   fuzzyTree: {clf_fuzz.score(pca_test_image, test_labels)}")
    print(f"decisionTree: {clf_sk.score(pca_test_image, test_labels)}")
    print(f"         SVC: {clf_svc.score(pca_test_image, test_labels)}")
    print("pca_train_image", type(pca_train_image))
    print("train_labels", type(train_labels))
    print("pca_test_image", type(pca_test_image))
    print("test_labels", type(test_labels))

    # for clf, lab, grd in zip([clf_fuzz, clf_sk], labels, [[0, 0], [0, 1]]):
    #     plt.subplot(gs[grd[0], grd[1]])
    #     plot_decision_regions(X=pca_train_image, y=train_labels.numpy(), clf=clf, legend=2)
    #     plt.title("%s (train)" % lab)
    #
    #     plt.subplot(gs[grd[0] + 1, grd[1]])
    #     plot_decision_regions(X=pca_test_image, y=test_labels.numpy(), clf=clf, legend=2)
    #     plt.title("%s (test)" % lab)
    #
    # plt.show()

输出:

torch.Size([500, 1, 28, 28]) torch.Size([500])
train_images torch.Size([500, 784])
test_images torch.Size([400, 784])
pca_train_image (500, 20)
pca_test_image (400, 20)
   fuzzyTree: 0.75
decisionTree: 0.6
         SVC: 0.8725
pca_train_image <class 'numpy.ndarray'>
train_labels <class 'torch.Tensor'>
pca_test_image <class 'numpy.ndarray'>
test_labels <class 'torch.Tensor'>
<Figure size 1000x800 with 0 Axes>

当PCA降维为2维时,可以进行画图:

python
for clf, lab, grd in zip([clf_fuzz, clf_sk], labels, [[0, 0], [0, 1]]):
    plt.subplot(gs[grd[0], grd[1]])
    plot_decision_regions(X=pca_train_image, y=train_labels.numpy(), clf=clf, legend=2)
    plt.title("%s (train)" % lab)

    plt.subplot(gs[grd[0] + 1, grd[1]])
    plot_decision_regions(X=pca_test_image, y=test_labels.numpy(), clf=clf, legend=2)
    plt.title("%s (test)" % lab)

plt.show()
最近更新