《《R软件做PCA》PPT课件.ppt》由会员分享,可在线阅读,更多相关《《R软件做PCA》PPT课件.ppt(29页珍藏版)》请在三一办公上搜索。
1、Jianwei Gou,Slide 1,Principal Components Analysis,Objectives:Understand the principles of principal components analysis(PCA)Recognize conditions under which PCA may be usefulUse R procedure PRINCOMP toperform a principal components analysisinterpret PRINCOMP output.,Jianwei Gou,Slide 2,Typical Form
2、of Data,A data set in a 8x3 matrix.The rows could be species and columns sampling sites.,10097999690908075607585956240287780789291807585100,X=,A matrix is often referred to as a nxp matrix(n for number of rows and p for number of columns).Our matrix has 8 rows and 3 columns,and is an 8x3 matrix.,Jia
3、nwei Gou,Slide 3,What are Principal Components?,Principal components are linear combinations of the observed variables.The coefficients of these principal components are chosen to meet three criteriaWhat are the three criteria?,Y=b1X1+b2 X2+bn Xn,Jianwei Gou,Slide 4,What are Principal Components?,Th
4、e three criteria:There are exactly p principal components(PCs),each being a linear combination of the observed variables;The PCs are mutually orthogonal(i.e.,perpendicular and uncorrelated);The components are extracted in order of decreasing variance.,Jianwei Gou,Slide 5,A Simple Data Set,XYX11Y11,X
5、YX11.414Y1.4142,Correlation matrixCovariance matrix,Jianwei Gou,Slide 6,General Patterns,The total variance is 3(=1+2)The two variables,X and Y,are perfectly correlated,with all points fall on the regression line.The spatial relationship among the 5 points can therefore be represented by a single di
6、mension.PCA is a dimension-reduction technique.What would happen if we apply PCA to the data?,Jianwei Gou,Slide 7,Graphic PCA,Jianwei Gou,Slide 8,R Program,#Pricipal Components Analysis#entering raw data and extracting PCs#from the correlation matrixx=c(-1.264911064,-0.632455532,0,0.632455532,1.2649
7、11064)y=c(-1.788854382,-0.894427191,0,0.894427191,1.788854382)mydata=cbind(x,y)fit-princomp(mydata,cor=TRUE)summary(fit)#print variance accounted forloadings(fit)#pc loadingsplot(fit,type=lines)#scree plotfit$scores#the principal componentsbiplot(fit),Jianwei Gou,Slide 9,Steps in a PCA,Have at least
8、 two variablesGenerate a correlation or variance-covariance matrix Obtain eigenvalues and eigenvectors(This is called an eigenvalue problem,and will be illustrated with a simple numerical example)Generate principal component(PC)scoresPlot the PC scores in the space with reduced dimensionsAll these c
9、an be automated by using R.,Jianwei Gou,Slide 10,Covariance or Correlation Matrix?,0,10,20,30,40,Abundance,Sp1,Sp2,Xuhua Xia,Slide 11,Covariance or Correlation Matrix?,Xuhua Xia,Slide 12,Covariance or Correlation Matrix?,Jianwei Gou,Slide 13,The Eigenvalue Problem,The covariance matrix.The Eigenvalu
10、e is the set of values that satisfy this condition.The resulting eigenvalues(There are n eigenvalues for n variables).The sum of eigenvalues is equal to the sum of variances in the covariance matrix.,Finding the eigenvalues and eigenvectors is called an eigenvalue problem(or a characteristic value p
11、roblem).,Jianwei Gou,Slide 14,Get the Eigenvectors,An eigenvector is a vector(x)that satisfies the following condition:A x=xIn our case A is a variance-covariance matrix of the order of 2,and a vector x is a vector specified by x1 and x2.,Jianwei Gou,Slide 15,Get the Eigenvectors,We want to find an
12、eigenvector of unit length,i.e.,x12+x22=1We therefore have,From Previous Slide,The first eigenvector is one associated with the largest eigenvalue.,Solve x1,Jianwei Gou,Slide 16,Get the PC Scores,First PC score,Second PC score,Original data(x and y),Eigenvectors,The original data in a two dimensiona
13、l space is reduced to one dimension.,Jianwei Gou,Slide 17,What Are Principal Components?,Principal components are a new set of variables,which are linear combinations of the observed ones,with these properties:Because of the decreasing variance property,much of the variance(information in the origin
14、al set of p variables)tends to be concentrated in the first few PCs.This implies that we can drop the last few PCs without losing much information.PCA is therefore considered as a dimension-reduction technique.Because PCs are orthogonal,they can be used instead of the original variables in situation
15、s where having orthogonal variables is desirable(e.g.,regression).,Jianwei Gou,Slide 18,Index of hidden variables,The ranking of Asian universities by the Asian WeekHKU is ranked second in financial resources,but seventh in academic researchHow did HKU get ranked third?Is there a more objective way
16、of ranking?An illustrative example:,Jianwei Gou,Slide 19,A Simple Data Set,School 5 is clearly the best schoolSchool 1 is clearly the worst school,Jianwei Gou,Slide 20,Graphic PCA,-1.7889-0.8944 0 0.8944 1.7889,Jianwei Gou,Slide 21,Crime Data in 50 States,STATE MURDER RAPE ROBBE ASSAU BURGLA LARCEN
17、AUTOALABAMA 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7ALASKA 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3ARIZONA 9.5 34.2 138.2 312.3 2346.1 4467.4 439.5ARKANSAS 8.8 27.6 83.2 203.4 972.6 1862.1 183.4CALIFORNIA 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5COLORADO 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1CONNE
18、CTICUT 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2DELAWARE 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0FLORIDA 10.2 39.6 187.9 449.1 1859.9 3840.5 351.4GEORGIA 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9HAWAII 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4IDAHO 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6ILLINOIS 9.9 21
19、.8 211.3 209.0 1085.0 2828.5 528.6.PROC PRINCOMP OUT=CRIMCOMP;,DATA CRIME;TITLE CRIME RATES PER 100,000 POP BY STATE;INPUT STATENAME$1-15 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;CARDS;Alabama 14.2 25.2 96.8 278.3 1135.5 1881.9 280.7Alaska 10.8 51.6 96.8 284.0 1331.7 3369.8 753.3Arizona 9.5
20、 34.2 138.2 312.3 2346.1 4467.4 439.5Arkansas 8.8 27.6 83.2 203.4 972.6 1862.1 183.4California 11.5 49.4 287.0 358.0 2139.4 3499.8 663.5Colorado 6.3 42.0 170.7 292.9 1935.2 3903.2 477.1Connecticut 4.2 16.8 129.5 131.8 1346.0 2620.7 593.2Delaware 6.0 24.9 157.0 194.2 1682.6 3678.4 467.0Florida 10.2 3
21、9.6 187.9 449.1 1859.9 3840.5 351.4Georgia 11.7 31.1 140.5 256.5 1351.1 2170.2 297.9Hawaii 7.2 25.5 128.0 64.1 1911.5 3920.4 489.4Idaho 5.5 19.4 39.6 172.5 1050.8 2599.6 237.6Illinois 9.9 21.8 211.3 209.0 1085.0 2828.5 528.6Indiana 7.4 26.5 123.2 153.5 1086.2 2498.7 377.4Iowa 2.3 10.6 41.2 89.8 812.
22、5 2685.1 219.9Kansas 6.6 22.0 100.7 180.5 1270.4 2739.3 244.3Kentucky 10.1 19.1 81.1 123.3 872.2 1662.1 245.4Louisiana 15.5 30.9 142.9 335.5 1165.5 2469.9 337.7Maine 2.4 13.5 38.7 170.0 1253.1 2350.7 246.9Maryland 8.0 34.8 292.1 358.9 1400.0 3177.7 428.5Massachusetts 3.1 20.8 169.1 231.6 1532.2 2311
23、.3 1140.1Michigan 9.3 38.9 261.9 274.6 1522.7 3159.0 545.5Minnesota 2.7 19.5 85.9 85.8 1134.7 2559.3 343.1Mississippi 14.3 19.6 65.7 189.1 915.6 1239.9 144.4Missouri 9.6 28.3 189.0 233.5 1318.3 2424.2 378.4Montana 5.4 16.7 39.2 156.8 804.9 2773.2 309.2Nebraska 3.9 18.1 64.7 112.7 760.0 2316.1 249.1N
24、evada 15.8 49.1 323.1 355.0 2453.1 4212.6 559.2New Hampshire 3.2 10.7 23.2 76.0 1041.7 2343.9 293.4New Jersey 5.6 21.0 180.4 185.1 1435.8 2774.5 511.5New Mexico 8.8 39.1 109.6 343.4 1418.7 3008.6 259.5New York 10.7 29.4 472.6 319.1 1728.0 2782.0 745.8,North Carolina 10.6 17.0 61.3 318.3 1154.1 2037.
25、8 192.1North Dakota 0.9 9.0 13.3 43.8 446.1 1843.0 144.7Ohio 7.8 27.3 190.5 181.1 1216.0 2696.8 400.4Oklahoma 8.6 29.2 73.8 205.0 1288.2 2228.1 326.8Oregon 4.9 39.9 124.1 286.9 1636.4 3506.1 388.9Pennsylvania 5.6 19.0 130.3 128.0 877.5 1624.1 333.2Rhode Island 3.6 10.5 86.5 201.0 1489.5 2844.1 791.4
26、South Carolina 11.9 33.0 105.9 485.3 1613.6 2342.4 245.1South Dakota 2.0 13.5 17.9 155.7 570.5 1704.4 147.5Tennessee 10.1 29.7 145.8 203.9 1259.7 1776.5 314.0Texas 13.3 33.8 152.4 208.2 1603.1 2988.7 397.6Utah 3.5 20.3 68.8 147.3 1171.6 3004.6 334.5Vermont 1.4 15.9 30.8 101.2 1348.2 2201.0 265.2Virg
27、inia 9.0 23.3 92.1 165.7 986.2 2521.2 226.7Washington 4.3 39.6 106.2 224.8 1605.6 3386.9 360.3West Virginia 6.0 13.2 42.2 90.9 597.4 1341.7 163.3Wisconsin 2.8 12.9 52.2 63.7 846.9 2614.2 220.7Wyoming 5.4 21.9 39.7 173.9 811.6 2772.2 282.0;PROC PRINCOMP out=crimcomp;run;PROC PRINT;ID STATENAME;VAR PR
28、IN1 PRIN2 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;run;PROC GPLOT;PLOT PRIN2*PRIN1=STATENAME;TITLE2 PLOT OF THE FIRST TWO PRINCIPAL COMPONENTS;run;PROC PRINCOMP data=CRIME COV OUT=crimcomp;run;PROC PRINT;ID STATENAME;VAR PRIN1 PRIN2 MURDER RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTO;run;,/*A
29、dd to have a map view*/proc sort data=crimcomp out=crimcomp;by STATENAME;run;proc sort data=maps.us2 out=mymap;by STATENAME;run;data both;merge mymap crimcomp;by STATENAME;run;proc gmap data=both;id _map_geometry_;choro PRIN1 PRIN2/levels=15;/*choro PRIN1/discrete;*/run;,Jianwei Gou,Slide 24,MURDER
30、RAPE ROBBERY ASSAULT BURGLARY LARCENY AUTOMURDER 1.0000 0.6012 0.4837 0.6486 0.3858 0.1019 0.0688RAPE 0.6012 1.0000 0.5919 0.7403 0.7121 0.6140 0.3489ROBBERY 0.4837 0.5919 1.0000 0.5571 0.6372 0.4467 0.5907ASSAULT 0.6486 0.7403 0.5571 1.0000 0.6229 0.4044 0.2758BURGLARY 0.3858 0.7121 0.6372 0.6229 1
31、.0000 0.7921 0.5580LARCENY 0.1019 0.6140 0.4467 0.4044 0.7921 1.0000 0.4442AUTO 0.0688 0.3489 0.5907 0.2758 0.5580 0.4442 1.0000,Correlation Matrix,If variables are not correlated,there would be no point in doing PCA.The correlation matrix is symmetric,so we only need to inspect either the upper or
32、lower triangular matrix.,Jianwei Gou,Slide 25,Eigenvalue Difference Proportion CumulativePRIN1 4.11496 2.87624 0.587851 0.58785PRIN2 1.23872 0.51291 0.176960 0.76481PRIN3 0.72582 0.40938 0.103688 0.86850PRIN4 0.31643 0.05846 0.045205 0.91370PRIN5 0.25797 0.03593 0.036853 0.95056PRIN6 0.22204 0.09798
33、 0.031720 0.98228PRIN7 0.12406.0.017722 1.00000,Eigenvalues,Jianwei Gou,Slide 26,Eigenvectors,PRIN1 PRIN2 PRIN3 PRIN4 PRIN5 PRIN6 PRIN7MURDER 0.3002-.6291 0.1782-.2321 0.5381 0.2591 0.2675RAPE 0.4317-.1694-.2441 0.0622 0.1884-.7732-.2964ROBBERY 0.3968 0.0422 0.4958-.5579-.5199-.1143-.0039ASSAULT 0.3
34、966-.3435-.0695 0.6298-.5066 0.1723 0.1917BURGLARY 0.4401 0.2033-.2098-.0575 0.1010 0.5359-.6481LARCENY 0.3573 0.4023-.5392-.2348 0.0300 0.0394 0.6016AUTO 0.2951 0.5024 0.5683 0.4192 0.3697-.0572 0.1470,Do these eigenvectors mean anything?All crimes are positively correlated with the first eigenvect
35、or,which is therefore interpreted as a measure of overall crime rate.The 2nd eigenvector has positive loadings on AUTO,LARCENY and ROBBERY and negative loadings on MURDER,ASSAULT and RAPE.It is interpreted to measure the preponderance of property crime over violent crime.,Jianwei Gou,Slide 27,PC Plot:Crime Data,North and South Dakota,Nevada,New York,California,Mississippi,Alabama,Louisiana,South Carolina,Maryland,Plot of PC1,Plot of PC2,