人工智能3贝叶斯统计机器学习2课件.pptx

资源描述

《人工智能3贝叶斯统计机器学习2课件.pptx》由会员分享，可在线阅读，更多相关《人工智能3贝叶斯统计机器学习2课件.pptx（78页珍藏版）》请在三一办公上搜索。

1、贝叶斯统计机器学习（2）,主要内容,贝叶斯决策理论机器学习的几种方法机器学习问题实例机器学习的主要模型线性回归模型逻辑回归模型神经网络模型支持向量机模型,参考讲义,模式识别与机器学习第1章, 1.2 概率论 1.5 决策论第3章 3.1第4章 4.3（4.3.1,4.3.2）第5章, 5.1，5.2，5.3,相关的基本概念,训练集合x1,，xN目标向量 t映射函数: y(x)推广性(举一反三)模型评估与模型选择正则化与交叉验证分类回归(regression)reinforcement learning,模式识别与机器学习的基本问题,监督学习: 分类，回归(regression)输入变量: x,

2、目标变量: t给定训练样本: x,t目标：（学习出x 和t的函数关系）给定x 预测t,模式识别与机器学习的基本问题,学习数据,算法：解释数据,结果：预测数据,衡量算法：推广能力,理论原则：拟合训练数据+最简模型,用函数或其它模型表示数据,Polynomial Curve Fitting,多项式曲线拟合-问题描述输入变量: x目标变量: t生成过程:给定训练样本: x,t,实际问题中是未知的,Polynomial Curve Fitting,Polynomial Curve Fitting,目标:给定新的，预测的值线性模型: 利用训练样本，估计模型的参数方法:误差平方和最小:,Sum-of-

3、Squares Error Function,0th Order Polynomial,1st Order Polynomial,3rd Order Polynomial,9th Order Polynomial,模型评估与模型选择,Polynomial Curve Fitting哪一个最好？训练误差测试误差过训练均方误差(root-mean-square),Over-fitting,Root-Mean-Square (RMS) Error:,Polynomial Curve Fitting,过训练的相关因素模型复杂度,Polynomial Coefficients,Polynomial Cu

4、rve Fitting,过训练的相关因素模型复杂度训练样本数,Data Set Size:,9th Order Polynomial,Data Set Size:,9th Order Polynomial,模型评估与模型选择,过训练的相关因素模型复杂度训练样本数学习方法最大似然贝叶斯方法,正则化与交叉验证,Regularization：Penalize large coefficient values,Regularization:,Regularization:,Regularization: vs.,Polynomial Coefficients,正则化与交叉验证,交叉验证：训练集 tr

5、aining set：用于训练模型验证集 validation set：用于模型选择测试集 test set：用于最终对学习方法的评估简单交叉验证 S折交叉验证留一交叉验证,分类问题,分类问题,二分类评价指标 TP true positive FN false negative FP false positive TN true negative 精确率召回率 F1值,回归问题,回归模型是表示从输入变量到输出变量之间映射的函数. 回归问题的学习等价于函数拟合。学习和预测两个阶段训练集：,回归问题,例子：标记表示名词短语的“开始”、“结束”或“其他” （分别以B, E, O表示)

6、输入：At Microsoft Research, we have an insatiable curiosity and the desire to create new technology that will help define the computing experience.输出：At/O Microsoft/B Research/E, we/O have/O an/O insatiable/6 curiosity/E and/O the/O desire/BE to/O create/O new/B technology/E that/O will/O help/O defi

7、ne/O the/O computing/B experience/E.,主要内容,贝叶斯决策理论机器学习的几种方法机器学习问题实例机器学习的主要模型线性回归模型逻辑回归模型神经网络模型支持向量机模型,Linear Basis Function Models (1),Example: Polynomial Curve Fitting,Linear Basis Function Models (2),Generallywhere j(x) are known as basis functions.Typically, 0(x) = 1, so that w0 acts as a bias.In

8、the simplest case, we use linear basis functions : d(x) = xd.,Linear Basis Function Models (3),Polynomial basis functions:These are global; a small change in x affect all basis functions.,Linear Basis Function Models (4),Gaussian basis functions:These are local; a small change in x only affect nearb

9、y basis functions. j and s control location and scale (width).,Linear Basis Function Models (5),Sigmoidal basis functions:whereAlso these are local; a small change in x only affect nearby basis functions. j and s control location and scale (slope).,主要内容,贝叶斯决策理论机器学习的几种方法机器学习问题实例机器学习的主要模型线性回归模型逻辑回归模型神

10、经网络模型支持向量机模型,固定基函数 1,Two Gaussian basis functions 1(x) and 2(x),固定基函数2,Two Gaussian basis functions 1(x) and 2(x),逻辑斯谛回归,logistic regressionadjustable parametersGaussian: M(M+5)/2+1logistic regression: M,logistic sigmoid,logistic sigmoid,normalized exponential (softmax function),补充：交叉熵损失函数,相对平方损失过于严

11、格，可使用更适合衡量两个概率分布差异的测量函数。其中，交叉熵（ cross- entropy）是个常用的衡量方法：由于向量中只有第个元素为 1，其余全为 0，于是假设训练数据集的样本数为 n，交叉熵损失函数定义为其中代表模型参数。,同样地，如果每个样本只有个标签，那么交叉熵损失可以简写。从另个角度来看，我们知道最小化等价于最大化即最小化交叉熵损失函数等价于最化训练数据集所有标签类别的联合预测概率,KL散度（Kullback-Leibler (KL) divergence）如果我们对于同一个随机变量 x 有两个单独的概率分布 P(x) 和 Q(x)，可以使用 KL 散度来衡量这两

12、个分布的差异：和 KL 散度密切联系的量是交叉熵它和 KL 散度很像但是缺少左边一项：,补充：决策树中的信息增益,例，,信息增益比,主要内容,贝叶斯决策理论机器学习的几种方法机器学习问题实例机器学习的主要模型线性回归模型逻辑回归模型神经网络模型支持向量机模型,神经元,人工神经元,结点,这种模型所实现的功能正是前面提到的线性分类器。,非线性的映射单元,Feed-forward Network Functions 1,training the basis functions,人工神经元网络工作原理,复杂一些的判别函数将特征空间划分成两个区域,两条射线组成的折线来划分,在折线的一边为y=1，在

13、折线的另一边y=0,显然用一个神经元是不行,人工神经元网络工作原理,复杂一些的判别函数,整个空间将因这两个函数值的极性不同分成四个区域,y=0这个区域所具有的特点是与都小于零,需要增加一个逻辑运算才能解决问题,三个运算可以通过三个神经元结点,人工神经元网络工作原理,复杂一些的判别函数,Whereas a two-layer network classifier can only implement a linear decisionboundary, given an adequate number of hidden units, three-, four- and higher-lay

14、ernetworks can implement arbitrary decision boundaries. The decision regions need notbe convex or simply connected.,From: Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. Copyright c 2001 by John Wiley & Sons, Inc.,Parameter optimization参数最优化,Local quadratic approximation局部

15、二次近似,Use of gradient information使用梯度信息,所有独立元素：W(W + 3)/2 O(W2)非利用梯度信息：O(W2) function O(W) stepsO(W3)利用梯度信息：O(W) gradient evaluations O(W) stepsO(W2),Gradient descent optimization梯度下降最优化,batch methods批量优化 gradient descent , steepest descentconjugate gradients quasi-Newton methods在线优化sequential gradie

16、nt descent or stochastic gradient descent,Error Backpropagation误差反向传播,Error BackpropagationIn the first stage计算权值导数使用于其他网络In the second stage计算权值调整量,Evaluation of error-function derivatives 1,error functionforward propagation,each unit computes a weighted sum of its inputs,nonlinear activation funct

17、ion,Evaluation of error-function derivatives 2,the derivative of Enwith respect to a weight wjifor the output units,Evaluation of error-function derivatives 3,for hidden units,Evaluation of error-function derivatives 4,Error Backpropagation,主要内容,贝叶斯决策理论机器学习的几种方法机器学习问题实例机器学习的主要模型线性回归模型逻辑回归模型神经网络模型支持向

18、量机模型,Maximum Margin Classifiers 1,the two-class classification problemy(x) = wT(x) + btraining data setN input vectors x1, . . . , xNtarget values t1, . . . , tN tn 1, 1new data pointsx are classified according to the sign of y(x)assume linearly separabletn y(xn) 0,Maximum Margin Classifiers 2,many

19、such solutions existthe perceptron algorithmfind a solution in a finite number of stepsdependent on the (arbitrary) initial values chosen for w and bthe order which the data points are presentedwe should try to find the smallest generalization error oneThe support vector machinethe concept of the ma

20、rgin:defined to be the smallest distance between the decision boundary and any of the samples,Maximum Margin Classifiers 3,Maximum Margin,Maximum Margin Classifiers 3,Maximum Margin,y=0,y=1,y=-1,y=0,y=1,y=-1,Maximum Margin Classifiers 4,the perpendicular distance of a point x from a hyperplane (y(x) = 0):|y(x)|/|w|the distance of a point xn to the decision surface,思考与讨论,

展开阅读全文