当前位置 博文首页 > DL_fan的博客:机器学习的几种方法(knn,逻辑回归,SVM,决策树

    DL_fan的博客:机器学习的几种方法(knn,逻辑回归,SVM,决策树

    作者:[db:作者] 时间:2021-07-11 09:58

    ?一.判别模式与生成模型基础知识

    举例:要确定一个瓜是好瓜还是坏瓜,用判别模型的方法是从历史数据中学习到模型,然后通过提取这个瓜的特征来预测出这只瓜是好瓜的概率,是坏瓜的概率。

    举例:利用生成模型是根据好瓜的特征首先学习出一个好瓜的模型,然后根据坏瓜的特征学习得到一个坏瓜的模型,然后从需要预测的瓜中提取特征,放到生成好的好瓜的模型中看概率是多少,在放到生产的坏瓜模型中看概率是多少,哪个概率大就预测其为哪个。

    举例:

    假如你的任务是识别一个语音属于哪种语言。例如对面一个人走过来,和你说了一句话,你需要识别出她说的到底是汉语、英语还是法语等。那么你可以有两种方法达到这个目的:

    1.学习每一种语言,你花了大量精力把汉语、英语和法语等都学会了,我指的学会是你知道什么样的语音对应什么样的语言。然后再有人过来对你说,你就可以知道他说的是什么语言.

    2.不去学习每一种语言,你只学习这些语言之间的差别,然后再判断(分类)。意思是指我学会了汉语和英语等语言的发音是有差别的,我学会这种差别就好了。
    那么第一种方法就是生成方法,第二种方法是判别方法。

    生成模型是所有变量的全概率模型,而判别模型是在给定观测变量值前提下目标变量条件概率模型。因此生成模型能够用于模拟(即生成)模型中任意变量的分布情况,而判别模型只能根据观测变量得到目标变量的采样。判别模型不对观测变量的分布建模,因此它不能够表达观测变量与目标变量之间更复杂的关系。因此,生成模型更适用于无监督的任务,如分类和聚类。

    条件概率: 就是事件A在事件B发生的条件下发生的概率。条件概率表示为P(A|B),读作“A在B发生的条件下发生的概率”。

    贝叶斯公式:

    P(X)?代表 X 事件发生的概率,也称为先验概率;

    P(Y|X)?代表在 X 事件发生的前提下,Y 事件发生的概率,也称为似然率;

    P(X|Y)?代表事件 Y 发生后,X 事件发生的概率,也称为后验概率;

    最大似然估计(英语:maximum likelihood estimation,缩写为MLE),是用来估计一个概率模型的参数的一种方法。

    ?

    条件概率,就是在条件为瓜的颜色是青绿的情况下,瓜是好瓜的概率

    先验概率,就是常识、经验、统计学所透露出的“因”的概率,即瓜的颜色是青绿的概率。

    后验概率,就是在知道“果”之后,去推测“因”的概率,也就是说,如果已经知道瓜是好瓜,那么瓜的颜色是青绿的概率是多少。后验和先验的关系就需要运用贝叶斯决策理论来求解。

    基于条件独立性假设,对于多个属性的后验概率可以写成:

    d为属性数目,xi是x在第i个属性上取值。
    对于所有的类别来说P(x)相同,基于极大似然的贝叶斯判定准则有朴素贝叶斯的表达式:

    朴素贝叶斯算法实现:?

    #coding:utf-8
    
    #P(y|x) = [P(x|y)*P(y)]/P(x)
    
    import numpy as  np
    import pandas as pd
    
    class Naive_Bayes:
        def __init__(self):
            pass
    
        # 朴素贝叶斯训练过程
        def nb_fit(self, X, y):
            # print('===y.columns[0]:', y.columns[0])
            classes = y[y.columns[0]].unique()
            # print('==classes:', classes)
            # print('==y[y.columns[0]]:', y[y.columns[0]])
            class_count = y[y.columns[0]].value_counts()
            # print('=class_count:', class_count)
            # 计算类先验概率
            class_prior = class_count / len(y)
            print('==class_prior:', class_prior)
            # 计算类条件概率
            prior = dict()
            #也就是求P(x1=?|y=?)
            for col in X.columns:
                for j in classes:
                    # print('y:', y)
                    # print('j:', j)
                    # print('===X[(y == j).values]:', X[(y == j).values])
                    # print('==X[(y == j).values][col]:', X[(y == j).values][col])
                    p_x_y = X[(y == j).values][col].value_counts()
                    # print('==p_x_y:', p_x_y)
    
                    for i in p_x_y.index:
                        # print('=i:', i)
                        # print('==p_x_y[i]:', p_x_y[i])
                        prior[(col, i, j)] = p_x_y[i] / class_count[j]
                        # print(prior)
                    # assert 1 == 0
            print('==prior:', prior)
            return classes, class_prior, prior
    
        # 预测新的实例
        def predict(self, X_test):
            #argmax(P(x1=?|y=?)*P(y=?))
            res = []
            for c in classes:
                p_y = class_prior[c]
                p_x_y = 1
                for i in X_test.items():
                    # print('i:', i)
                    # print(tuple(list(i) + [c]))
                    p_x_y *= prior[tuple(list(i) + [c])]
                res.append(p_y * p_x_y)
            # print('===res:', res)
            return classes[np.argmax(res)]
    
    
    if __name__ == "__main__":
        x1 = [1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3]
        x2 = ['S', 'M', 'M', 'S', 'S', 'S', 'M', 'M', 'L', 'L', 'L', 'M', 'M', 'L', 'L']
        y = [-1, -1, 1, 1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, -1]
        df = pd.DataFrame({'x1': x1, 'x2': x2, 'y': y})
        print('==df:\n', df)
        X = df[['x1', 'x2']]
        # print('==X:', X)
        y = df[['y']]
        # print('==y:', y)
        X_test = {'x1': 2, 'x2': 'S'}
    
        nb = Naive_Bayes()
        classes, class_prior, prior = nb.nb_fit(X, y)
        print('测试数据预测类别为:', nb.predict(X_test))

    ???

    朴素贝叶斯分类器代码:

    朴素贝叶斯分类器采用了“属性条件独立性假设”,对已知类别,假设所有属性相互独立。换言之,假设每个属性独立的对分类结果发生影响相互独立。

    采用GaussianNB 高斯朴素贝叶斯,概率密度函数为

    import math
    
    
    class NaiveBayes:
        def __init__(self):
            self.model = None
    
        # 数学期望
        @staticmethod
        def mean(X):
            """计算均值
            Param: X : list or np.ndarray
    
            Return:
                avg : float
    
            """
            avg = 0.0
            # ========= show me your code ==================
            avg = sum(X) / float(len(X))
            # ========= show me your code ==================
            return avg
    
        # 标准差(方差)
        def stdev(self, X):
            """计算标准差
            Param: X : list or np.ndarray
    
            Return:
                res : float
    
            """
            res = 0.0
           
            avg = self.mean(X)
            res = math.sqrt(sum([pow(x - avg, 2) for x in X]) / float(len(X)))
            
            return res
    
        # 概率密度函数
        def gaussian_probability(self, x, mean, stdev):
            """根据均值和标注差计算x符号该高斯分布的概率
            Parameters:
            ----------
            x : 输入
            mean : 均值
            stdev : 标准差
    
            Return:
    
            res : float, x符合的概率值
    
            """
            res = 0.0
            # ========= show me your code ==================
            exponent = math.exp(-(math.pow(x - mean, 2) /
                                  (2 * math.pow(stdev, 2))))
            res = (1 / (math.sqrt(2 * math.pi) * stdev)) * exponent
            # ========= show me your code ==================
    
            return res
    
        # 处理X_train
        def summarize(self, train_data):
            """计算每个类目下对应数据的均值和标准差
            Param: train_data : list
    
            Return : [mean, stdev]
            """
            summaries = [0.0, 0.0]
            # ========= show me your code ==================
            # for i in zip(*train_data):
                # print(i)
            summaries = [(self.mean(i), self.stdev(i)) for i in zip(*train_data)]
    
            # ========= show me your code ==================
            return summaries
    
        # 分类别求出数学期望和标准差
        def fit(self, X, y):
            labels = list(set(y))
            data = {label: [] for label in labels}
            for f, label in zip(X, y):
                data[label].append(f)
            print('===data:', data)
            self.model = {
                label: self.summarize(value) for label, value in data.items()
            }
            print(self.model)#得到每一类的每个特征的均值和方差
            return 'gaussianNB train done!'
    
        # 计算概率
        def calculate_probabilities(self, input_data):
            """计算数据在各个高斯分布下的概率
            Paramter:
            input_data : 输入数据
    
            Return:
            probabilities : {label : p}
            """
            # summaries:{0.0: [(5.0, 0.37),(3.42, 0.40)], 1.0: [(5.8, 0.449),(2.7, 0.27)]}
            # input_data:[1.1, 2.2]
            probabilities = {}
            # ========= show me your code ==================
            for label, value in self.model.items():
                print('====label, value', label, value)
                print('==len(value)', len(value))
                probabilities[label] = 1
                for i in range(len(value)):
                    mean, stdev = value[i]
                    probabilities[label] *= self.gaussian_probability(
                        input_data[i], mean, stdev)
                print('===probabilities:', probabilities)
            # ========= show me your code ==================
            return probabilities
    
        # 类别
        def predict(self, X_test):
            # {0.0: 2.9680340789325763e-27, 1.0: 3.5749783019849535e-26}
            label = sorted(self.calculate_probabilities(X_test).items(), key=lambda x: x[-1])[-1][0]
            return label
    
        # 计算得分
        def score(self, X_test, y_test):
            right = 0
            for X, y in zip(X_test, y_test):
                label = self.predict(X)
                if label == y:
                    right += 1
    
            return right / float(len(X_test))
    
    def test_bayes_model():
        from sklearn.datasets import load_iris
        import pandas as pd
        from sklearn.model_selection import train_test_split
        iris = load_iris()
        X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)
        print(len(X_train))
        print(len(y_train))
        model = NaiveBayes()
        model.fit(X_train, y_train)
    
        print(model.predict([4.4, 3.2, 1.3, 0.2]))
    if __name__ == '__main__':
        test_bayes_model()

    基于pgmpy的贝叶斯网络例子:

    pgmpy是一款基于Python的概率图模型包,主要包括贝叶斯网络和马尔可夫蒙特卡洛等常见概率图模型的实现以及推断方法.

    下图是学生获得推荐信质量的例子。具体有向图和概率表如下图所示:

    代码:

    #coding:utf-8
    #git clone https://github.com/pgmpy/pgmpy
    #cd pgmpy
    #python setup.py install
    
    
    from pgmpy.factors.discrete import TabularCPD
    from pgmpy.models import BayesianModel
    
    student_model = BayesianModel([('D', 'G'),
                                   ('I', 'G'),
                                   ('G', 'L'),
                                   ('I', 'S')])
    #分数节点
    grade_cpd = TabularCPD(
        variable='G',# 节点名称
        variable_card=3,# 节点取值个数
        values=[[0.3, 0.05, 0.9, 0.5],# 该节点的概率表
                [0.4, 0.25, 0.08, 0.3],
                [0.3, 0.7, 0.02, 0.2]],
        evidence=['I', 'D'], # 该节点的依赖节点
        evidence_card=[2, 2] # 依赖节点的取值个数
    )
    #考试难度节点
    difficulty_cpd = TabularCPD(
                variable='D',
                variable_card=2,
                values=[[0.6, 0.4]]
    )
    ##智商节点
    intel_cpd = TabularCPD(
                variable='I',
                variable_card=2,
                values=[[0.7, 0.3]]
    )
    #收到推荐信节点
    letter_cpd = TabularCPD(
                variable='L',
                variable_card=2,
                values=[[0.1, 0.4, 0.99],
                        [0.9, 0.6, 0.01]],
                evidence=['G'],
                evidence_card=[3]
    )
    #sat分数节点
    sat_cpd = TabularCPD(
                variable='S',
                variable_card=2,
                values=[[0.95, 0.2],
                        [0.05, 0.8]],
                evidence=['I'],
                evidence_card=[2]
    )
    
    student_model.add_cpds(
        grade_cpd,
        difficulty_cpd,
        intel_cpd,
        letter_cpd,
        sat_cpd
    )
    print(student_model.get_cpds())
    
    
    print('D节点路径:', student_model.active_trail_nodes('D'))
    print('I节点路径:', student_model.active_trail_nodes('I'))
    
    print(student_model.local_independencies('G'))
    
    # print(student_model.get_independencies())
    
    # print(student_model.to_markov_model())
    
    # 进行贝叶斯推断
    from pgmpy.inference import VariableElimination
    student_infer = VariableElimination(student_model)
    prob_G = student_infer.query(variables=['G'])
    
    print('所有可能性的分数概率prob_G:', prob_G)
    
    prob_G = student_infer.query(
                variables=['G'],
                evidence={'I': 1, 'D': 0})
    print('聪明学生的分数概率prob_G', prob_G)
    
    # prob_G = student_infer.query(
    #             variables=['G'],
    #             evidence={'I': 0, 'D': 1})
    # print(prob_G)
    
    
    # # 生成数据
    # import numpy as np
    # import pandas as pd
    #
    # raw_data = np.random.randint(low=0, high=2, size=(1000, 5))
    # data = pd.DataFrame(raw_data, columns=['D', 'I', 'G', 'L', 'S'])
    # data.head()
    #
    #
    # # 定义模型
    # from pgmpy.models import BayesianModel
    # from pgmpy.estimators import MaximumLikelihoodEstimator, BayesianEstimator
    #
    # model = BayesianModel([('D', 'G'), ('I', 'G'), ('I', 'S'), ('G', 'L')])
    #
    # # 基于极大似然估计进行模型训练
    # model.fit(data, estimator=MaximumLikelihoodEstimator)
    # for cpd in model.get_cpds():
    #     # 打印条件概率分布
    #     print("CPD of {variable}:".format(variable=cpd.variable))
    #     print(cpd)
    
    
    

    二.机器学习

    knn的详细链接:https://blog.csdn.net/fanzonghao/article/details/86411102

    决策树的详细链接:https://blog.csdn.net/fanzonghao/article/details/85246720

    1.SVM:寻找最优的间隔

    等式约束的最优解

    不等式约束的最优解:利用kkT条件

    最终得到分类器:

    也就是C(松弛变量)越大:得到高方差,低偏差的模型;更倾向于过拟合;

    C越小:得到低方差,高偏差的模型;更倾向于欠拟合。

    推导:

    SVM案例,应用SMO算法:

    import numpy as np
    import pandas as pd
    from sklearn.datasets import load_iris
    from sklearn.model_selection import  train_test_split
    import matplotlib.pyplot as plt
    
    
    def create_data():
        iris = load_iris()
        df = pd.DataFrame(iris.data, columns=iris.feature_names)
        df['label'] = iris.target
        df.columns = [
            'sepal length', 'sepal width', 'petal length', 'petal width', 'label'
        ]
        data = np.array(df.iloc[:100, [0, 1, -1]])
        for i in range(len(data)):
            if data[i, -1] == 0:
                data[i, -1] = -1
        # print(data)
        return data[:, :2], data[:, -1]
    
    X, y = create_data()
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
    print('==X_train.shape:', X_train.shape)
    print('==y_train.shape:', y_train.shape)
    plt.scatter(X[:50, 0], X[:50, 1], label='0', color='R')
    plt.scatter(X[50:, 0], X[50:, 1], label='1', color='G')
    plt.legend()
    # plt.show()
    
    #w = alpha*y*x
    class SVM:
        def __init__(self, max_iter=100, kernel='linear'):
            self.max_iter = max_iter
            self._kernel = kernel
        def init_args(self, features, labels):
            self.m, self.n = features.shape#m数据量  n特征维度
            self.X = features
            self.Y = labels
            self.b = 0.0
            # 将Ei保存在一个列表里
            self.alpha = np.ones(self.m)
            self.E = [self._E(i) for i in range(self.m)]
            # 松弛变量
            self.C = 1.0
        def _KKT(self, i):
            y_g = self._g(i) * self.Y[i]
            if self.alpha[i] == 0:
                return y_g >= 1
            elif 0 < self.alpha[i] < self.C:
                return y_g == 1
            else:
                return y_g <= 1
        # g(x)预测值,输入xi(X[i])
        def _g(self, i):
            r = self.b
            for j in range(self.m):
                r += self.alpha[j] * self.Y[j] * self.kernel(self.X[i], self.X[j])
            return r
        # E(x)为g(x)对输入x的预测值和y的差
        def _E(self, i):
            return self._g(i) - self.Y[i]
        # 核函数
        def kernel(self, x1, x2):
            if self._kernel == 'linear':
                return sum([x1[k] * x2[k] for k in range(self.n)])
            elif self._kernel == 'poly':
                return (sum([x1[k] * x2[k] for k in range(self.n)]) + 1)**2
            return 0
        def _init_alpha(self):
            # 外层循环首先遍历所有满足0<a<C的样本点,检验是否满足KKT
            index_list = [i for i in range(self.m) if 0 < self.alpha[i] < self.C]
            # 否则遍历整个训练集
            non_satisfy_list = [i for i in range(self.m) if i not in index_list]
            index_list.extend(non_satisfy_list)
            for i in index_list:
                if self._KKT(i):
                    continue
                E1 = self.E[i]
                # 如果E2是+,选择最小的;如果E2是负的,选择最大的
                if E1 >= 0:
                    j = min(range(self.m), key=lambda x: self.E[x])
                else:
                    j = max(range(self.m), key=lambda x: self.E[x])
                return i, j
        def _compare(self, _alpha, L, H):
            if _alpha > H:
                return H
            elif _alpha < L:
                return L
            else:
                return _alpha
        def fit(self, features, labels):
            self.init_args(features, labels)
            for t in range(self.max_iter):
                # train
                i1, i2 = self._init_alpha()
                # 边界
                if self.Y[i1] == self.Y[i2]:
                    L = max(0, self.alpha[i1] + self.alpha[i2] - self.C)
                    H = min(self.C, self.alpha[i1] + self.alpha[i2])
                else:
                    L = max(0, self.alpha[i2] - self.alpha[i1])
                    H = min(self.C, self.C + self.alpha[i2] - self.alpha[i1])
                E1 = self.E[i1]
                E2 = self.E[i2]
                # eta=K11+K22-2K12
                eta = self.kernel(self.X[i1], self.X[i1]) + self.kernel(
                    self.X[i2],
                    self.X[i2]) - 2 * self.kernel(self.X[i1], self.X[i2])
                if eta <= 0:
                    # print('eta <= 0')
                    continue
                alpha2_new_unc = self.alpha[i2] + self.Y[i2] * (
                    E1 - E2) / eta  #此处有修改,根据书上应该是E1 - E2,书上130-131页
                alpha2_new = self._compare(alpha2_new_unc, L, H)
    
                alpha1_new = self.alpha[i1] + self.Y[i1] * self.Y[i2] * (
                    self.alpha[i2] - alpha2_new)
    
                b1_new = -E1 - self.Y[i1] * self.kernel(self.X[i1], self.X[i1]) * (
                    alpha1_new - self.alpha[i1]) - self.Y[i2] * self.kernel(
                        self.X[i2],
                        self.X[i1]) * (alpha2_new - self.alpha[i2]) + self.b
                b2_new = -E2 - self.Y[i1] * self.kernel(self.X[i1], self.X[i2]) * (
                    alpha1_new - self.alpha[i1]) - self.Y[i2] * self.kernel(
                        self.X[i2],
                        self.X[i2]) * (alpha2_new - self.alpha[i2]) + self.b
    
                if 0 < alpha1_new < self.C:
                    b_new = b1_new
                elif 0 < alpha2_new < self.C:
                    b_new = b2_new
                else:
                    # 选择中点
                    b_new = (b1_new + b2_new) / 2
                # 更新参数
                self.alpha[i1] = alpha1_new
                self.alpha[i2] = alpha2_new
                self.b = b_new
                self.E[i1] = self._E(i1)
                self.E[i2] = self._E(i2)
            return 'train done!'
    
        def predict(self, data):
            r = self.b
            for i in range(self.m):
                r += self.alpha[i] * self.Y[i] * self.kernel(data, self.X[i])
    
            return 1 if r > 0 else -1
    
        def score(self, X_test, y_test):
            right_count = 0
            for i in range(len(X_test)):
                result = self.predict(X_test[i])
                if result == y_test[i]:
                    right_count += 1
            return right_count / len(X_test)
    
        # def _weight(self):
        #     # linear model
        #     yx = self.Y.reshape(-1, 1) * self.X
        #     self.w = np.dot(yx.T, self.alpha)
        #     return self.w
    
    
    svm = SVM(max_iter=200)
    svm.fit(X_train, y_train)
    score = svm.score(X_test, y_test)
    print('===score:', score)

    SVM案例,用于水果数据集分类,调用scikit-learn:

    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
    import seaborn as sns
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    from sklearn.svm import SVC
    import matplotlib.patches as mpatches
    from matplotlib.colors import ListedColormap
    
    def plot_class_regions_for_classifier(clf, X, y, X_test=None, y_test=None, title=None,
                                          target_names=None, plot_decision_regions=True):
        """
            根据分类器可视化数据分类的结果
            只能用于二维特征的数据
        """
    
        num_classes = np.amax(y) + 1
        color_list_light = ['#FFFFAA', '#EFEFEF', '#AAFFAA', '#AAAAFF']
        color_list_bold = ['#EEEE00', '#000000', '#00CC00', '#0000CC']
        cmap_light = ListedColormap(color_list_light[0:num_classes])
        cmap_bold = ListedColormap(color_list_bold[0:num_classes])
    
        h = 0.03
        k = 0.5
        x_plot_adjust = 0.1
        y_plot_adjust = 0.1
        plot_symbol_size = 50
    
        x_min = X[:, 0].min()
        x_max = X[:, 0].max()
        y_min = X[:, 1].min()
        y_max = X[:, 1].max()
        x2, y2 = np.meshgrid(np.arange(x_min-k, x_max+k, h), np.arange(y_min-k, y_max+k, h))
    
        P = clf.predict(np.c_[x2.ravel(), y2.ravel()])
        P = P.reshape(x2.shape)
        plt.figure()
        if plot_decision_regions:
            plt.contourf(x2, y2, P, cmap=cmap_light, alpha=0.8)
    
        plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, s=plot_symbol_size, edgecolor='black')
        plt.xlim(x_min - x_plot_adjust, x_max + x_plot_adjust)
        plt.ylim(y_min - y_plot_adjust, y_max + y_plot_adjust)
    
        if X_test is not None:
            plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cmap_bold, s=plot_symbol_size,
                        marker='^', edgecolor='black')
            train_score = clf.score(X, y)
            test_score = clf.score(X_test, y_test)
            title = title + "\nTrain score = {:.2f}, Test score = {:.2f}".format(train_score, test_score)
    
        if target_names is not None:
            legend_handles = []
            for i in range(0, len(target_names)):
                patch = mpatches.Patch(color=color_list_bold[i], label=target_names[i])
                legend_handles.append(patch)
            plt.legend(loc=0, handles=legend_handles)
    
        if title is not None:
            plt.title(title)
        plt.show()
    
    # 加载数据集
    fruits_df = pd.read_table('fruit_data_with_colors.txt')
    
    X = fruits_df[['width', 'height']]
    y = fruits_df['fruit_label'].copy()
    
    # 将不是apple的标签设为0
    y[y != 1] = 0
    # 分割数据集
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/4, random_state=0)
    print(y_test.shape)
    # 不同的C值
    c_values = [0.0001, 1, 100]
    
    for c_value in c_values:
        # 建立模型
        svm_model = SVC(C=c_value, kernel='rbf')
    
        # 训练模型
        svm_model.fit(X_train, y_train)
    
        # 验证模型
        y_pred = svm_model.predict(X_test)
    
        acc = accuracy_score(y_test, y_pred)
        print('C={},准确率:{:.3f}'.format(c_value, acc))
    
        # 可视化
        plot_class_regions_for_classifier(svm_model, X_test.values, y_test.values, title='C={}'.format(c_value))

    二维高斯分布?

    将kernel替换成‘linear’

    2.集成学习

    def load_data():
        # 加载数据集
        fruits_df = pd.read_table('fruit_data_with_colors.txt')
        # print(fruits_df)
        print('样本个数:', len(fruits_df))
        # 创建目标标签和名称的字典
        fruit_name_dict = dict(zip(fruits_df['fruit_label'], fruits_df['fruit_name']))
    
        # 划分数据集
        X = fruits_df[['mass', 'width', 'height', 'color_score']]
        y = fruits_df['fruit_label']
    
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=1/4, random_state=0)
        print('数据集样本数:{},训练集样本数:{},测试集样本数:{}'.format(len(X), len(X_train), len(X_test)))
        # print(X_train)
        return  X_train, X_test, y_train, y_test
    #特征归一化
    def minmax_scaler(X_train,X_test):
        scaler = MinMaxScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        # print(X_train_scaled)
        #此时scaled得到一个最小最大值,对于test直接transform就行
        X_test_scaled = scaler.transform(X_test)
    
        for i in range(4):
            print('归一化前,训练数据第{}维特征最大值:{:.3f},最小值:{:.3f}'.format(i + 1,
                                                   X_train.iloc[:, i].max(),
                                                   X_train.iloc[:, i].min()))
            print('归一化后,训练数据第{}维特征最大值:{:.3f},最小值:{:.3f}'.format(i + 1,
                                                   X_train_scaled[:, i].max(),
                                                   X_train_scaled[:, i].min()))
        return  X_train_scaled,X_test_scaled