《华语料库session(外语学习)课件.ppt》由会员分享,可在线阅读,更多相关《华语料库session(外语学习)课件.ppt(38页珍藏版)》请在三一办公上搜索。
1、Making statistic claims,Corpus LinguisticsRichard X,Making statistic claimsCorpus,Update on assignments,Deadline for submission (email submission)TBAThe Harvard referencing styleAssignment ACorpus study: introduction; synopsis / overview, critical review of data, method of analysis, conclusion etc;
2、conclusions, bibliographyCL2005: http:/www.corpus.bham.ac.uk/pclc/index.shtmlCL2007: http:/www.corpus.bham.ac.uk/conference/proceedings.shtmlUCCTS2008: http:/www.lancs.ac.uk/fass/projects/corpus/UCCTS2008Proceedings/UCCTS2010: http:/www.lancs.ac.uk/fass/projects/corpus/UCCTS2010Proceedings/ Corpus t
3、ool: Introduction; description of the tool, its main features and functions; your critical evaluation of the tool: how well it does the jobs it is supposed to do; user interface, powerfulness, etc; conclusions; bibliographyAssignment BIntroduction; literature review; methodology; results and discuss
4、ions; conclusions; bibliographyOption B: A 3,500-word essay, similar to Assignment B,Update on assignmentsDeadline,Outline of the session,LectureRaw and normalised frequencyDescriptive statistics (mean, mode, media, measure of dispersion) Inferential statistics (chi squared, LL, Fishers Exact tests)
5、Collocation statisticsLabUCREL online LL calculatorXus LL calculatorSPSS,Outline of the sessionLecture,Quantitative analysis,Corpus analysis is both qualitative and quantitativeOne of the advantages of corpora is that they can readily provide quantitative data which intuitions cannot provide reliabl
6、y“The use of quantification in corpus linguistics typically goes well beyond simple counting” (McEnery and Wilson 2001: 81)What can we do with those numbers and counts?,Quantitative analysisCorpus an,Raw frequency,The arithmetic count of the number of linguistic feature (a word, a structure etc)The
7、most direct quantitative data provided by a corpusFrequency itself does NOT tell you much in terms of the validity of a hypothesisThere are 250 instances of the f*k swearword in the spoken BNC, so what?Does this mean that people swear frequently or infrequently when they speak?,Raw frequencyThe arit
8、hmetic co,Normalized frequency,in relation to what?Corpus analysis is inherently comparativeThere are 250 instances of the swearword in the spoken BNC and 500 instances in the written BNCDo people swear twice as often in writing as in speech?Remember the written BNC is 9 times as large as the spoken
9、 BNCWhen comparing corpora of different sizes, we need to normalize the frequencies to a common base (e.g. per million tokens)Normalised freq = raw freq / token number * common baseThe swearword is 4 times as frequent in speech as in writingSwearword in spoken BNC = 250 / 10 * 1 = 25 per million tok
10、ensSwearword in written BNC = 500 / 90 * 1 = 6 per million tokensbut is this difference statistically significant?,Normalized frequencyin relati,Normalized frequency,The size of a sample may affect the level of statistical significanceTips for normalizing frequency dataThe common base for normalizat
11、ion must be comparable to the sizes of the corporaNormalizing the spoken vs. written BNC to a common base of 1000 tokens?WarningResults obtained on an irrationally enlarged or reduced common base are distorted,Normalized frequencyThe size o,Descriptive statistics,Frequencies are a type of descriptiv
12、e statisticsDescriptive statistics are used to describe a datasetA group of ten students took a test and their scores are as follows4, 5, 6, 6, 7, 7, 7, 9, 9, 10How will you report the measure of central tendency of this group of test results using a single score?,Descriptive statisticsFrequenc,The
13、mean,The mean is the arithmetic averageThe most common measure of central tendencyCan be calculated by adding all of the scores together and then dividing the sum by the number of scores (i.e. 7)4+5+6+6+7+7+7+9+9+10=70/10=7While the mean is a useful measure, unless we also knows how dispersed (i.e.
14、spread out) the scores in a dataset are, the mean can be an uncertain guide,The meanThe mean is the arithm,The mode and the median,The mode is the most common score in a set of scoresThe mode in our testing example is 7, because this score occurs more frequently than any other score4, 5, 6, 6, 7, 7,
15、 7, 9, 9, 10The median is the middle score of a set of scores ordered from the lowest to the highestFor an odd number of scores, the median is the central score in an ordered listFor an even number of scores, the median is the average of the two central scoresIn the above example the median is 7 (i.
16、e. (7+7)/2),The mode and the medianThe mod,Measure of dispersion: range,The range is a simple way to measure the dispersion of a set of dataThe difference between the highest and lowest frequencies / scoresIn our testing example the range is 6 (i.e. highest 10 lowest 4)Only a poor measure of dispers
17、ionAn unusually high or low score in a dataset may make the range unreasonably large, thus giving a distorted picture of the dataset,Measure of dispersion: rangeTh,Measure of dispersion: variance,The variance measures the distance of each score in the dataset from the meanIn our test results, the va
18、riance of the score 4 is 3 (i.e. 74); and the variance of the score 9 is 2 (97)For the whole dataset, the sum of these differences is always zeroSome scores will be above the mean while some will be below the meanMeaningless to use variance to measure the dispersion of a whole dataset,Measure of dis
19、persion: varianc,Measure of dispersion: std dev,Standard deviation is equal to the square root of the quantity of the sum of the deviation scores squared divided by the number of scores in a datasetF is a score in a dataset (i.e. any of the ten scores) is the mean score (i.e. 7) N is the number of s
20、cores under consideration (i.e. 10) Std dev in our example of test results is 1.687,Measure of dispersion: std dev,Measure of dispersion: std dev,For a normally distributed dataset (i.e. where most of the items are clustered towards the centre rather than the lower or higher end of the scale)68% of
21、the scores lie within one standard deviation of the mean95% lie within two standard deviations of the mean99.7% lie within three standard deviations of the meanThe standard deviation is the most reasonable measure of the dispersion of a dataset,Normal distribution(bell-shaped curve),Measure of dispe
22、rsion: std dev,Computing std dev with SPSS,SPSS Menu - Analyze Descriptive statistics - Descriptives,Computing std dev with SPSSSPS,Inferential statistics,Descriptive statistics are useful in summarizing a dataset Inferential statistics are typically used to formulate or test a hypothesisUsing stati
23、stical measures to test whether or not any differences observed are statistically significantTests of statistical significancechi-square testlog-likelihood (LL) testFishers Exact testCollocation statisticsMutual information (MI)z score,Inferential statisticsDescript,Statistical significance,In testi
24、ng a linguistic hypothesis, it would be nice to be 100% sure that the hypothesis can be acceptedHowever, one can never be 100% sure in real life casesThere is always the possibility that the differences observed between two corpora have been due to chanceIn our swearword example, it is 4 times as fr
25、equent in speech as in writingWe need to use a statistical test to help us to decide whether this difference is statistically significantThe level of statistical significance = the level of our confidence in accepting a given hypothesisThe closer the likelihood is to 100%, the more confident we can
26、beOne must be more than 95% confident that the observed differences have not arisen by chance,Statistical significanceIn tes,Commonly used statistical tests,Chi square testcompares the difference between the observed values (e.g. the actual frequencies extracted from corpora) and the expected values
27、 (e.g. the frequencies that one would expect if no factor other than chance was affecting the frequencies)Log likelihood test (LL)Similar, but more reliable as LL does not assume that data is normally distributedThe preferred test for statistic significance,Commonly used statistical test,Commonly us
28、ed statistical tests,Interpreting resultsThe greater the difference (absolute value) between the observed values and the expected values, the less likely it is that the difference is due to chance; conversely, the closer the observed values are to the expected values, the more likely it is that the
29、difference has arisen by chanceA probability value p close to 0 indicates that a difference is highly significant statistically; a value close to 1 indicates that a difference is almost certainly due to chanceBy convention, the general practice is that a hypothesis can be accepted only when the leve
30、l of significance is less than 0.05 (i.e. p0.05, or more than 95% confident),Commonly used statistical test,Online LL calculator,http:/ucrel.lancs.ac.uk/llwizard.html,How to find the probability value p for an LL score of 301.88?,Online LL calculatorhttp:/ucr,Contingency table,degree of freedom (d.f
31、.) = (No. of row -1) * (No. of column - 1)= (2 - 1) * (2 1) =1 * 1 = 1,Contingency tabledegree of fre,Critical values,The chi square test or LL test score must be greater than 3.84 (1 d.f.) for a difference to be statistically significant.Oakes, M (1998) Statistics for Corpus Linguistics, EUP, p. 26
32、6In the example of swearword in spoken/written BNC, LL 301.88 for 1 d.f. More than 99.99% confident that the difference is statistically significant,Critical valuesThe chi square,Excel LL calculator by Xu,www.corpus4u.org/attachment.php?attachmentid=560&d=1240826440,Excel LL calculator by Xuwww.c,SP
33、SS: Left- vs. right-handed,Define variables,Data view,weight case,(Data Weight cases),SPSS: Left- vs. right-handedDe,SPSS: Left- vs. right-handed,Cross-tab,Select variables,SPSS: Left- vs. right-handedCr,SPSS: Left- vs. right-handed,Critical value (X2 / LL) for 1 d.f. at p0.05 (95%): 3.84Is there a
34、relationship between gender and left- or right-handedness?,Any cells with an expected value less than 5?,SPSS: Left- vs. right-handedCr,Fishers Exact test,The chi-square or log-likelihood test may not be reliable with very low frequenciesWhen a cell in a contingency table has an expected value less
35、than 5, Fishers Exact test is more reliableIn this case, SPSS computes Fishers exact significance level automatically when the chi-square test is selectedSPSS Releases 15 and 16 have removed the Fishers Exact test module, which can be purchased separately,Fishers Exact testThe chi-squ,Fishers Exact
36、test,Dont forget to weight cases!,Fishers Exact testDont forge,Fishers Exact test,Fishers Exact test,Fishers Exact test,Fishers Exact test,Force an FE test,Force an FE test,Practice,Use both the UCREL/Xus LL calculator / SPSS to determine if the difference in the frequencies of passives in the CLEC
37、and LOCNESS corpora is statistically significantCLEC: 7,911 instances in 1,070,602 wordsLOCNESS: 5,465 instances in 324,304 words,PracticeUse both the UCREL/Xu,华语料库session(外语学习)课件,Collocation statistics,Collocation: the habitual or characteristic co-occurrence patterns of wordsCan be identified usin
38、g a statistical approach in CL, e.g.Mutual Information (MI), t test, z scoreCan be computed using tools like SPSS, Wordsmith, AntConc, XairaOnly a brief introduction hereMore discussions of collocation statistics to be followed,Collocation statisticsCollocat,Mutual information,Computed by dividing t
39、he observed frequency of the co-occurring word in the defined span for the search string (so-called node word), e.g. a 4:4 window, by the expected frequency of the co-occurring word in that span and then taking the logarithm to the base 2 of the result,Mutual informationComputed by,Mutual informatio
40、n,A measure of collocational strengthThe higher the MI score, the stronger the link between two itemsMI score of 3.0 or higher to be taken as evidence that two items are collocatesThe closer to 0 the MI score gets, the more likely it is that the two items co-occur by chanceA negative MI score indica
41、tes that the two items tend to shun each other,Mutual informationA measure of,The t test,Computed by subtracting the expected frequency from the observed frequency and then dividing the result by the standard deviationA t score of 2 or higher is normally considered to be statistically significantThe
42、 specific probability level can be looked up in a table of t distribution,The t testComputed by subtract,The z score,The z score is the number of standard deviations from the mean frequencyThe z test compares the observed frequency with the frequency expected if only chance is affecting the distribution A higher z score indicates a greater degree of collocability of an item with the node word,The z scoreThe z score is the,