应用多元统计分析.ppt_三一办公31ppt.com

资源描述

《应用多元统计分析.ppt》由会员分享，可在线阅读，更多相关《应用多元统计分析.ppt（38页珍藏版）》请在三一办公上搜索。

1、Preface to the 1st Edition,Most of the observable phenomenafinmin in the empirical(empirikl经验)sciences are of a multivariate nature.In financial studies,assets in stock markets are observed simultaneously and their joint development is analyzed to better understand general tendencies（趋势）and to track

2、 indices（路灯）.The underlying theoretical structure of these and many other quantitative studies of applied sciences is multivariate.This book on Applied Multivariate Statistical Analysis presents the tools and concepts of multivariate data analysis with a strong focus on applications.The aim of the b

3、ook is to present multivariate data analysis in a way that is understandable for non-mathematicians and practitioners who are（面对）by statistical data analysis.This is achieved by focusing on the practical relevance and through the e-book character of this text.All practical examples may be recalculat

4、ed and modified by the reader using a standard web browser and without reference or application of any specific software.,Most of the observable phenomenafinmin in the empirical(empirikl经验)sciences are of a multivariate nature.The underlying theoretical structure of these and many other quantitative

5、 studies of applied sciences is multivariate.This book on Applied Multivariate Statistical Analysis presents the tools and concepts of multivariate,mlti vereit data analysis with a strong focus on applications.,The book is divided into three main parts.The first part is devoted to graphical techniqu

6、es describing the distributions of the variables involved.The second part deals with multivariate random variables and presents from a theoretical point of view distributions,estimators and tests for various practical situations.The last part is on multivariate techniques and introduces the reader t

7、o the wide selection of tools available for multivariate data analysis.All data sets are given in the appendix and are downloadable from md-stat.The text contains a wide variety of exercises the solutions of which are given in a separate textbook.In addition a full set of transparencies on md-stat i

8、s provided making iteasier for an instructor to present the materials in this book.All transparencies contain hyper links to the statistical web service so that students and instructors alike may repute all examples via a standard web browser.,1-2 week,UNIT-I Descriptive Techniques(描述技术)1 Comparison

9、（对照）of Batches1.1 Boxplots 4 1.2 Histograms 101.3 Scatterplots 171.4 Data Set-Boston Housing 35,1 Comparison of Batches,Multivariate statistical analysis is concerned with analyzing and understanding data in high dimensions.We suppose that we are given a set xini=1 of n observations of a variable ve

10、ctor X in Rp.That is,we suppose that each observation xi has p dimensions:xi=(xi1,xi2,.,xip),and that it is an observed value of a variable vector X Rp.Therefore,X is posed of p random variables:X=(X1,X2,.,Xp)where Xj,for j=1,.,p,is a one-dimensional random variable.,1 Comparison of Batches,Multivar

11、iate statistical analysis is concerned with analyzing and understanding data in high dimensions.How do we begin to analyze this kind of data?Before we investigate questions on what inferences we can reach from the data,we should think about how to look at the data.This involves descriptive technique

12、s.Questions that we could answer by descriptive techniques are:Are there ponents of X that are more spread out than others?Are there some elements of X that indicate subgroups of the data?Are there outliers in the ponents of X?How“normal”is the distribution of the data?,1.1 Boxplots,1 Comparison of

13、Batches,Genuinedenjuin真正的,X6,X1,The median and mean bars are measures of locations.The relative location of the median(and the mean)in the box is a measure of skewness.The length of the box and whiskers are a measure of spread.The length of the whiskers indicate the tail length of the distribution.T

14、he outlying points are indicated with a“”or“”depending on if they are outside of FUL 1.5dF or FUL 3dF respectively.The boxplots do not indicate multi modality or clusters.If we pare the relative size and location of the boxes,we are paring distributions.,Summary,Reading material,1.2 Histograms,h=0.4

15、,Diagonal,Histograms are density(denst)(密度)estimates(estimeits概算).A density estimate gives a good impression of the distribution of the data.In contrast to boxplots,density estimates show possible multimodality(多模式；综合,mltimdliti)of the data.The idea is to locally represent the data density by counti

16、ng the number of observations in a sequence of consecutive（连续的）intervals(bins)（箱）with origin（rn起源、原点）x0.Let Bj(x0,h)denote(dinut,指示,表示)the bin of length h which is the element of a bin grid starting at x0:Bj(x0,h)=x0+(j 1)h,x0+jh),j Z,where.,.)(square brackets)denotes a left closed and right open in

17、terval(ntrvl 间隔,右开区间).,If xin i=1 is an i.i.d.sample with density f,the histogram is defined as follows:In sum(1.7)the first indicator function I xi Bj(x0,h)counts the number of observations falling into bin Bj(x0,h).The second indicator function I is responsible for“localizing”（luklizi局限）the counts

18、 around x.The parameter h is a smoothing or localizing parameter and controls the width(wid)of the histogram bins.An h that is too large leads to very big blocks and thus to a very unstructured histogram.On the other hand,an h that is too small gives a very variable estimate with many unimportant pe

19、aks.,H=0.1,H=0.2,H=0.3,Diagonaldaignladj.对角线的,斜的 n.对角线,斜线,H=0.4,The effect of h is given in detail in Figure 1.6.It contains the histogram(upper left)for the diagonal of the counterfeit bank notes for x0=137.8(the minimum of these observations)and h=0.1.Increasing h to h=0.2 and using the same origi

20、n,x0=137.8,results in the histogram shown in the lower left of the figure.This density histogram is somewhat smoother due to the larger h.The binwidth is next set to h=0.3(upper right).From this histogram,one has the impression that the distribution of the diagonal is bimodal with peaks at about 138

21、.5 and 139.9.The detection of modes requires a fine tuning of the binwidth.Using methods from smoothing methodology(medldi，n.方法学)one can find an“optimal”binwidth h for n observations:,counterfeitkauntfitadj.假冒的,假装的,In Figure 1.7,we show histograms with x0=137.65(upper left),x0=137.75(lower left),wit

22、h x0=137.85(upper right),and x0=137.95(lower right).All the graphs have been scaled equally on the y-axis to allow parison.One sees thatdespite the fixed binwidth hthe interpretation is not facilitated(fsiliteitid vt.使容易).The shift of the origin x0(to 4 different locations)created 4 different histog

23、rams.This property of histograms strongly contradicts the goal of presenting data features.,Modes of the density are detected with a histogram.Modes correspond to strong peaks in the histogram.Histograms with the same h need not be identical.They also depend on the origin x0 of the grid.The influenc

24、e of the origin x0 is drastic.Changing x0 creates different looking histograms.The consequence of an h that is too large is an unstructured histogram that is too flat.A bin width h that is too small results in an unstable histogram.There is an“optimal”h=(24/n)1/3.It is remended to use averaged histo

25、grams.They are kernel densities.,Summary,1.4 Scatterplots,Scatterplots are bivariate or trivariate plots of variables(vribl)against each other.They help us understand relationships among the variables of a data set.A downward-sloping(slupi)scatter indicates that as we increase the variable on the ho

26、rizontal axis,the variable on the vertical axis decreases(di:kri:s vt.减少).An analogous(nlgs adj.类似的)statement can be made for upward-sloping scatters.,Figure 1.12 plots the 5th column(upper inner frame)of the bank data against the 6th column(diagonal).The scatter is downward-sloping.As we already kn

27、ow from the previous section on marginal parison a good separation between genuine and counterfeit bank notes is visible for the diagonal variable.The sub-cloud in the upper half(circles)of Figure 1.12 corresponds to the true bank notes.As noted before,this separation is not distinct(adj.清楚的、明显),sin

28、ce the two groups overlap(,uvlp vt.重叠)somewhat.,Draftman绘图员,Scatterplots in two and three dimensions helps in identifying separated points,outliers or sub-clusters.Scatterplots help us in judging positive or negative dependencies.Draftman scatterplot matrices help detect structures conditioned on values of other variables.As the brush of a scatterplot matrix moves through a point cloud,we can study conditional dependence.,Summary,1.8 Data Set,Boston Housing Data Set,Variablevribladj.可变的,易变的,不定的n.变量,可变物,First Step：New Words第一类高频词 160个,

展开阅读全文