国外讲pca的一个简单案例.ppt

上传人:小飞机 文档编号:6381235 上传时间:2023-10-22 格式:PPT 页数:45 大小:480.50KB
返回 下载 相关 举报
国外讲pca的一个简单案例.ppt_第1页
第1页 / 共45页
国外讲pca的一个简单案例.ppt_第2页
第2页 / 共45页
国外讲pca的一个简单案例.ppt_第3页
第3页 / 共45页
国外讲pca的一个简单案例.ppt_第4页
第4页 / 共45页
国外讲pca的一个简单案例.ppt_第5页
第5页 / 共45页
点击查看更多>>
资源描述

《国外讲pca的一个简单案例.ppt》由会员分享,可在线阅读,更多相关《国外讲pca的一个简单案例.ppt(45页珍藏版)》请在三一办公上搜索。

1、Principal Component Analysis(PCA),Data Reduction,summarization of data with many(p)variables by a smaller set of(k)derived(synthetic,composite)variables.,Data Reduction,“Residual”variation is information in A that is not retained in Xbalancing act betweenclarity of representation,ease of understandi

2、ngoversimplification:loss of important or relevant information.,Principal Component Analysis(PCA),probably the most widely-used and well-known of the“standard”multivariate methodsinvented by Pearson(1901)and Hotelling(1933)first applied in ecology by Goodall(1954)under the name“factor analysis”(“pri

3、ncipal factor analysis”is a synonym of PCA).,Principal Component Analysis(PCA),takes a data matrix of n objects by p variables,which may be correlated,and summarizes it by uncorrelated axes(principal components or principal axes)that are linear combinations of the original p variablesthe first k com

4、ponents display as much as possible of the variation among objects.,Geometric Rationale of PCA,objects are represented as a cloud of n points in a multidimensional space with an axis for each of the p variablesthe centroid of the points is defined by the mean of each variablethe variance of each var

5、iable is the average squared deviation of its n values around the mean of that variable.,Geometric Rationale of PCA,degree to which the variables are linearly correlated is represented by their covariances.,Geometric Rationale of PCA,objective of PCA is to rigidly rotate the axes of this p-dimension

6、al space to new positions(principal axes)that have the following properties:ordered such that principal axis 1 has the highest variance,axis 2 has the next highest variance,.,and axis p has the lowest variancecovariance among each pair of the principal axes is zero(the principal axes are uncorrelate

7、d).,2D Example of PCA,variables X1 and X2 have positive covariance&each has a similar variance.,Configuration is Centered,each variable is adjusted to a mean of zero(by subtracting the mean from each value).,Principal Components are Computed,PC 1 has the highest possible variance(9.88)PC 2 has a var

8、iance of 3.03PC 1 and PC 2 have zero covariance.,The Dissimilarity Measure Used in PCA is Euclidean Distance,PCA uses Euclidean Distance calculated from the p variables as the measure of dissimilarity among the n objectsPCA derives the best possible k dimensional(k p)representation of the Euclidean

9、distances among objects.,Generalization to p-dimensions,In practice nobody uses PCA with only 2 variablesThe algebra for finding principal axes readily generalizes to p variablesPC 1 is the direction of maximum variance in the p-dimensional cloud of pointsPC 2 is in the direction of the next highest

10、 variance,subject to the constraint that it has zero covariance with PC 1.,Generalization to p-dimensions,PC 3 is in the direction of the next highest variance,subject to the constraint that it has zero covariance with both PC 1 and PC 2and so on.up to PC p,each principal axis is a linear combinatio

11、n of the original two variablesPCj=ai1Y1+ai2Y2+ainYnaijs are the coefficients for factor i,multiplied by the measured value for variable j,PC axes are a rigid rotation of the original variablesPC 1 is simultaneously the direction of maximum variance and a least-squares“line of best fit”(squared dist

12、ances of points away from PC 1 are minimized).,Generalization to p-dimensions,if we take the first k principal components,they define the k-dimensional“hyperplane of best fit”to the point cloudof the total variance of all p variables:PCs 1 to k represent the maximum possible proportion of that varia

13、nce that can be displayed in k dimensionsi.e.the squared Euclidean distances among points calculated from their coordinates on PCs 1 to k are the best possible representation of their squared Euclidean distances in the full p dimensions.,Covariance vs Correlation,using covariances among variables on

14、ly makes sense if they are measured in the same unitseven then,variables with high variances will dominate the principal componentsthese problems are generally avoided by standardizing each variable to unit variance and zero mean.,Covariance vs Correlation,covariances between the standardized variab

15、les are correlationsafter standardization,each variable has a variance of 1.000correlations can be also calculated from the variances and covariances:,The Algebra of PCA,first step is to calculate the cross-products matrix of variances and covariances(or correlations)among every pair of the p variab

16、lessquare,symmetric matrixdiagonals are the variances,off-diagonals are the covariances.,Variance-covariance Matrix,Correlation Matrix,The Algebra of PCA,in matrix notation,this is computed aswhere X is the n x p data matrix,with each variable centered(also standardized by SD if using correlations).

17、,Variance-covariance Matrix,Correlation Matrix,Manipulating Matrices,transposing:could change the columns to rows or the rows to columnsmultiplying matricesmust have the same number of columns in the premultiplicand matrix as the number of rows in the postmultiplicand matrix,X=10 0 4 7 1 2,X=10 7 0

18、1 4 2,The Algebra of PCA,sum of the diagonals of the variance-covariance matrix is called the traceit represents the total variance in the datait is the mean squared Euclidean distance between each object and the centroid in p-dimensional space.,Trace=12.9091,Trace=2.0000,The Algebra of PCA,finding

19、the principal axes involves eigenanalysis of the cross-products matrix(S)the eigenvalues(latent roots)of S are solutions()to the characteristic equation,The Algebra of PCA,the eigenvalues,1,2,.p are the variances of the coordinates on each principal component axisthe sum of all p eigenvalues equals

20、the trace of S(the sum of the variances of the original variables).,1=9.87832=3.0308Note:1+2=12.9091,Trace=12.9091,The Algebra of PCA,each eigenvector consists of p values which represent the“contribution”of each variable to the principal component axis eigenvectors are uncorrelated(orthogonal)their

21、 cross-products are zero.,Eigenvectors,0.7291*(-0.6844)+0.6844*0.7291=0,The Algebra of PCA,coordinates of each object i on the kth principal axis,known as the scores on PC k,are computed aswhere Z is the n x k matrix of PC scores,X is the n x p centered data matrix and U is the p x k matrix of eigen

22、vectors.,The Algebra of PCA,variance of the scores on each PC axis is equal to the corresponding eigenvalue for that axisthe eigenvalue represents the variance displayed(“explained”or“extracted”)by the kth axisthe sum of the first k eigenvalues is the variance explained by the k-dimensional ordinati

23、on.,1=9.8783 2=3.0308 Trace=12.9091PC 1 displays(“explains”)9.8783/12.9091=76.5%of the total variance,The Algebra of PCA,The cross-products matrix computed among the p principal axes has a simple form:all off-diagonal values are zero(the principal axes are uncorrelated)the diagonal values are the ei

24、genvalues.,Variance-covariance Matrixof the PC axes,A more challenging example,data from research on habitat definition in the endangered Baw Baw frog16 environmental and structural variables measured at each of 124 sitescorrelation matrix used because variables have different units,Philoria frosti,

25、Eigenvalues,Interpreting Eigenvectors,correlations between variables and the principal axes are known as loadingseach element of the eigenvectors represents the contribution of a given variable to a component,How many axes are needed?,does the(k+1)th principal axis represent more variance than would

26、 be expected by chance?several tests and rules have been proposeda common“rule of thumb”when PCA is based on correlations is that axes with eigenvalues 1 are worth interpreting,What are the assumptions of PCA?,assumes relationships among variables are LINEARcloud of points in p-dimensional space has

27、 linear dimensions that can be effectively summarized by the principal axesif the structure in the data is NONLINEAR(the cloud of points twists and curves its way through p-dimensional space),the principal axes will not be an efficient and informative summary of the data.,When should PCA be used?,In

28、 community ecology,PCA is useful for summarizing variables whose relationships are approximately linear or at least monotonice.g.A PCA of many soil properties might be used to extract a few components that summarize main dimensions of soil variationPCA is generally NOT useful for ordinating communit

29、y dataWhy?Because relationships among species are highly nonlinear.,The“Horseshoe”or Arch Effect,community trends along environmenal gradients appear as“horseshoes”in PCA ordinationsnone of the PC axes effectively summarizes the trend in species composition along the gradientSUs at opposite extremes

30、 of the gradient appear relatively close together.,Ambiguity of Absence,0,The“Horseshoe”Effect,curvature of the gradient and the degree of infolding of the extremes increase with beta diversityPCA ordinations are not useful summaries of community data except when beta diversity is very lowusing corr

31、elation generally does better than covariancethis is because standardization by species improves the correlation between Euclidean distance and environmental distance.,What if theres more than one underlying ecological gradient?,The“Horseshoe”Effect,when two or more underlying gradients with high be

32、ta diversity a“horseshoe”is usually not detectablethe SUs fall on a curved hypersurface that twists and turns through the p-dimensional species spaceinterpretation problems are more severePCA should NOT be used with community data(except maybe when beta diversity is very low).,Impact on Ordination H

33、istory,by 1970 PCA was the ordination method of choice for community datasimulation studies by Swan(1970)&Austin&Noy-Meir(1971)demonstrated the horseshoe effect and showed that the linear assumption of PCA was not compatible with the nonlinear structure of community datastimulated the quest for more appropriate ordination methods.,

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 生活休闲 > 在线阅读


备案号:宁ICP备20000045号-2

经营许可证:宁B2-20210002

宁公网安备 64010402000987号