1、统计学简史与数据科学,袁卫 2016.12.10 中南财经政法大学,英国培根: 读史可以明智(Histories make men wise)德国斯勒兹: 统计是静态的历史, 历史是动态的统计. (Statistics is the state history while history is the dynamic statistics).,一、早期源头(Early Beginnings)二、数学基础 (Mathematical Foundations)三、现代发展 (Modern Era),一、早期源头(公元前450年至15世纪),均值的使用,450 bc Hippias of Elis

2、uses the average value of the length of a kings reign (the mean) to work out the date of the first Olympic Games, some 300 years before his time.希皮亚斯(Hippias), 出生于希腊伯罗奔尼撒(Peloponnesus)西北部的埃利斯(Elis), 与柏拉图(Plato)是同时代的人,历史上第一位数学史家。他在公元前450年用以前每个国王执政时间长短的均值推算出首届奥运会是距当时300多年前的公元前776年举办的。,431 bc Attackers

3、 besieging Plataea in the Peloponnesian war calculate the height of the wall by counting the number of bricks. The count was repeated several times by different soldiers. The most frequent value (the mode) was taken to be the most likely. Multiplying it by the height of one brick allowed them to cal

4、culate the length of the ladders needed to scale the walls.公元前431年希腊伯罗奔尼撒战争中雅典人让士兵数城墙砖的层数,取士兵数据的众数乘以每块砖的厚度推算城墙的高度,用以计算云梯所需长度。,众数的使用,400 bc In the Indian epic the Mahabharata, King Rtuparna estimates the number of fruit and leaves (2095 fruit and 50 000 000 leaves) on two great branches of a vibhitak

5、a tree by counting the number on a single twig, then multiplying by the number of twigs. The estimate is found to be very close to the actual number. This is the first recorded example of sampling “but this knowledge is kept secret”, says the account.公元前400年,印度史诗摩诃婆罗多(Mahabharata)中国王利用只计算两个大树枝上的果实和叶

6、子数量乘上树枝的数量估算整棵树果实和叶子的数量,这是已知最早的抽样推断。,抽样推断,AD 2 Chinese census under the Han dynasty finds 57.67 million people in 12.36 million households the first census from which data survives, andstill considered by scholars to have been accurate公元2年, 中国汉代进行了人口普查,结果是1236万家庭,5767万人口。记载的数据被认为是相当准确的。,普查,AD 7 Cens

7、us by Quirinus, governor of the Roman province of Judea, is mentioned in Lukes Gospel as causing Joseph and Mary to travel to Bethlehem to be taxed.路加福音记载,公元7年,意大利罗马省省长奎里努斯实施了普查,导致约瑟夫和玛丽前往约瑟夫祖籍大卫家族所在的伯利恒申报户籍.,普查,840 Islamic mathematician Al-Kindi uses frequency analysis the most common symbols in a

8、coded message will stand for the most common letters to break secret codes. Al-Kindi also introduces Arabic numerals to Europe.公元840年,伊斯兰数学家金迪利用最常用符号和最常用字符破解伊斯兰密码,他还将阿拉伯数字介绍到欧洲。,频数分析,10th century The earliest known graph, in a commentary on a book by Cicero, shows the movements of the planets throug

9、h the zodiac. It is apparently intended for use in monastery schools.公元10世纪,意大利西塞罗书中最早使用了曲线,描述黄道带中行星运动的轨迹,也是修道院最早使用的图表曲线。,曲线,1069 Domesday Book: survey for William the Conqueror of farms, villages and livestock in his new kingdom the start of official statistics in England.1069年最终税册:英王征服者威廉一世做的调查,对新

10、王国村庄和牲畜进行调查,这是英国官方统计最早的记录(英格兰约150万人,90%是农民)。,官方统计,1150 Trial of the Pyx, an annual test of the purityof coins from the Royal Mint, begins. Coins aredrawn at random, in fixed proportions to the number minted. It continues to this day.公元1150年,英国皇家制币厂开始硬币纯度和质量的年度检验。通过随机样本进行等比例抽样检验,延续至今。,随机抽样,1188 Geral

11、d of Wales completed the first population census of Wales.公元1188年,英国威尔士的杰拉尔德完成了威尔士第一次人口普查。,人口普查,1303 A Chinese diagram entitled “The Old Method Chart of the Seven Multiplying Squares” shows the binomial coefficients up to the eighth power the numbers that are fundamental to the mathematics of probab

12、ility, and that appeared five hundred years later in the west as Pascals triangle.公元1303年中国“杨辉(1261)三角形”(贾宪更早)给出二项分布系数8次幂,奠定概率论的数学基础,而帕斯卡(1662)三角形是500年之后才出现。,二项式系数,1346 Giovanni Villanis Nuova Cronica gives statistical information on the population and trade of Florence.公元1346年,意大利佛罗伦斯当时的历史学家佐凡尼微拉尼(

13、GiovanniVillani)在著作NuovaCronica中纪录了人口和贸易的统计信息。,人口与贸易统计,二、数学基础(16世纪至19世纪末),1560 Gerolamo Cardano calculates probabilitiesof different dice throws for gamblers.公元1560年,意大利文艺复兴科学家吉罗拉莫卡尔达诺计算出掷骰子的各种概率。,概率初步,1570 Astronomer Tycho Brahe uses the arithmeticmean to reduce errors in his estimates of the locat

14、ions of stars and planets.公元1570年,丹麦天文学家第谷布拉赫在估计星球的位置和运行时使用算术平均数减少误差。,均值与误差,1644 Michael van Langren draws the first known graph of statistical data that shows the size of possible errors. It is of different estimates of the distance between Toledo and Rome.公元1644年,荷兰天文学家Michael van Langren 用统计数据画出第

15、一张误差图,用不同方法估计从西班牙托莱多到意大利罗马的距离。,误差图,1654 Pascal and Fermat correspond about dividing stakes in gambling games and together create the mathematical theory of probability.公元1654年法国帕斯卡和费马通过对赌博中如何下注等问题通信的研究共同创立了概率的数学理论。,概率数学基础,1657 Huygenss On Reasoning in Games of Chance is the first book on probability

16、 theory.He also invented the pendulum clock.公元1657年,荷兰科学家惠更斯完成“机会游戏的推理”一书,这是第一本概率理论的书,他还是摆钟的发明者。,首本概率著作,1663 John Graunt uses parish records to estimate the population of London.公元1663年,英国约翰 格朗特利用伦敦教区的洗礼、弥撒等数据分析并估计伦敦的人口, 并首次给出新生婴儿性别比52:48。,人口统计,1693 Edmund Halley prepares the first mortality tables

17、statistically relating death rates to age the foundation of life insurance. He also drew a stylised map of the path of a solar eclipse over England one of the first data visualisation maps.1693年,英国哈雷制作了第一张分年龄的死亡率表,为人寿保险奠定了基础。他还画出日食经过英国的路线图,这也是数据的第一张可视化地图。,首张死亡率表,1713 Jacob Bernoullis Ars conjectandi

18、 derives the law of large numbers the more often you repeat an experiment, the more accurately you can predict the result.1713年,瑞士科学家伯努利在猜测术一书中提出大数定律,即实验次数越多,预测结果就越准确。,大数定律,1728 Voltaire and his mathematician friend de la Condamine spot that a Paris bond lottery is offering more in prize money than

19、the total cost of the tickets; they corner the market and win themselves a fortune.公元1728年法国伏尔泰和他的数学家朋友拉.孔达明计算出巴黎债券彩票的奖金总额高于购买的成本,于是他们垄断了彩票市场, 并获得收益.,博彩统计,1749 Gottfried Achenwall coins the word “statistics” (in German, Statistik); he means the information you need to run a nation state.公元1749年德国阿亨瓦

20、尔创造了德文词汇 “Statistik”, 即 “statistics”。他定义“统计”为治理国家所需要的信息。,德文“统计”词汇的出现,1757 Casanova becomes a trustee of, and may have had a hand in devising, the French national lottery.1757年法国卡萨诺瓦成为法国国家彩票的受托人,发明了彩票。,国家彩票业出现,1761 The Rev. Thomas Bayes proves Bayestheorem the cornerstone of conditional probability a

21、nd the testing of beliefs and hypotheses.1761年英国贝叶斯证明了贝叶斯定理,奠定了条件概率的基础,检验信念和假设。,贝叶斯定理,1786 William Playfair introduces graphs and barcharts to show economic data.1786年英国爱丁堡William Playfair 首次用图表反映经济数据变化。,经济数据图表,1789 Gilbert White and other clergymen-naturalists keep records of temperatures, dates of

22、 first snowdrops and cuckoos, etc; the data is later useful for study of climate change.1789年英国吉尔伯特.怀特和其他牧师博物学家记录温度变化、首次降雪时间以及变化情况等。数据被用来研究气候变化。,气候统计,1790 First US census, taken by men on horseback directed by Thomas Jefferson, counts 3.9 million Americans.1790年美国在第三任总统托马斯.杰斐逊总统指导下进行了首次人口普查,结果为390万人

23、口。,美国首次人口普查,1791 First use of the word “statistics” in English, by Sir John Sinclair in his Statistical Account of Scotland.1791年英国约翰.辛克莱在他“苏格兰统计账户”中首次使用英文词“统计”Statistics.,英文“统计”词汇的出现,1805 Adrien-Marie Legendre introduces the method of least squares for fitting a curve to a given set of observations

24、.1805年法国数学家勒让德首次使用最小二乘法利用数据去拟合曲线。,最小二乘法,1808 Gauss, with contributions from Laplace,derives the normal distribution the bell-shaped curve fundamental to the study of variation and error.1808年德国高斯和拉普拉斯一起得到正态分布,即钟形曲线,奠定了误差研究的基础。,正态分布,1833 The British Association for the Advancement of Science sets up

25、a statistics section. Thomas Malthus, who analysed population growth, andCharles Babbage are members. It later becomes the Royal Statistical Society.1833年,英国高等科学协会建立了统计分会,分析人口增长的托马斯.马尔萨斯和查尔斯巴贝奇都是会员,这个分会后来成为英国皇家统计学会。,英国统计学会,1835 Belgian Adolphe Quetelets Treatise on Man introduces social science stat

26、istics and the concept of the “average man” his height, bodymass index, and earnings.1835年比利时阿道夫凯特勒在论人及其才能的发展中将统计方法用于社会科学,并提出“平均人”的概念,讨论人的身高、体重和收入等。,应用于社会科学,1839: The American Statistical Association is formed. Alexander Graham Bell, Andrew Carnegie and President Martin Van Buren will become members

27、.1839年美国统计学会成立。亚历山大格雷厄姆贝尔、安德鲁卡内基和美国总统马丁范布伦都是会员。,美国统计学会成立,1840 William Farr sets up the official system for recording causes of death in England and Wales. This allows epidemics to be tracked and diseases compared the start of medical statistics.1840年,英国威廉.法尔建立了英格兰和威尔士死亡原因的官方数据系统,可以追踪研究流行病,并对疾病进行比较研究

28、,开创了医疗卫生统计。,医疗卫生统计,1849 Charles Babbage designs his “differenceengine”, embodying the ideas of data handling and the modern computer. Ada Lovelace, Lord Byrons niece, writes the worlds first computer program for it.1849年英国查尔斯巴贝奇设计了他的“差分机” ,体现了掌握数据和现代计算机的基本思想。爱达勒芙蕾丝, 拜伦勋爵的侄女,为它写了世界上最早的计算机程序。,计算机程序雏形,

29、1854 John Snows “cholera map” pins down the source of an outbreak as a water pump in Broad Street, London, beginning the modern study of epidemics.1854年英国约翰.斯诺利用“霍乱地图”确认伦敦百老汇大街的供水系统是疾病爆发的源头,也是现代流行病学研究的源头。,开创流行病学研究,1859 Florence Nightingale uses statistics ofCrimean War casualties to influence public

30、 opinion and the War Office. She shows casualties month by month on a circular chart she devises, the “Nightingale rose”, forerunner of the pie chart. She is the first woman member of the Royal Statistical Society and the first overseasmember of the American Statistical Association.1859年,南丁格尔使用克里米亚战

31、争伤亡统计数据影响公众意见和英国战争决策机构。她将战争期间逐月伤亡数据用她设计的圆形图表示出来,即“南丁格尔玫瑰”,是最早的饼图。她是英国皇家统计学会第一位女会员,也是美国统计学会第一位海外会员。,饼图的使用,1868 Minards graphic diagram of Napoleons Marchon Moscow shows on one diagram the distance covered, the number of men still alive at each kilometre of the march, and the temperatures they encount

32、ered on the way.1868年英国米纳尔绘制了拿破仑东征莫斯科战争图。图中绘出征程中重大战役以及军队减员数据,从莫斯科撤退过程气温的变化,简洁地描述了一场战争。,统计图表的妙用,1877 Francis Galton, Darwins cousin, describes regression to the mean. In 1888 he introduces the concept of correlation. At a “Guess the weight ofan Ox” contest in Devon he describes the “Wisdom of Crowds”

33、 that the average of many uninformed guesses1877年英国弗朗西斯高尔顿,达尔文的表弟,首次描述了向平均数的回归。1888年他首次使用了相关的概念。在德文郡一次“猜猜公牛的体重”的竞赛中,他描述许多无知猜测的平均数体现了“群众的智慧”。,回归与相关,1886 Philanthropist Charles Booth begins his survey of the London poor, to produce his “poverty map of London”. Areas were coloured black, for the poores

34、t, through to yellow for the upper-middle class and wealthy.1886年英国慈善家查尔斯布斯开始在伦敦进行贫困调查,并绘制了“贫困地图”。地图上用黑色表示最贫穷地区,逐渐过渡到黄色代表中上层和富裕地区。,贫困地图,1894 Karl Pearson introduces the term “standard deviation”. If errors are normally distributed, 68% of samples will lie within one standard deviation of the mean. L

35、ater he develops chi-squared tests for whether two variables are independent of each other.1894年英国卡尔.皮尔逊首次使用了“标准差”的术语。如果误差是正态分布的,68%的样本会落在均值附近正负一个标准差之内。不久,他又提出卡方检验,用来检验两个统计量是否独立。,标准差,1898 Von Bortkiewiczs data on deaths of soldiers in the Prussian army from horse kicks shows that apparently rare eve

36、nts follow a predictable pattern,the Poisson distribution1898年,德国范.鲍特凯维兹发现普鲁士士兵被战马踢死的数据是明显的稀有事件,可以用泊松分布进行预测。,泊松分布,三、现代发展(20世纪初至今),1900 Louis Bachelier shows that fluctuations instock market prices behave in the same way as therandom Brownian motion of molecules the start of financial mathematics.190

37、0年,法国数学家巴施里叶(Louis Bachelier)首先发现股票市场价格的波动与分子随机布朗运动一致,开创了金融数学。,金融数学,1908 William Sealy Gossett, chief brewer for Guinness in Dublin, describes the t-test. It uses a small number of samples to ensure that every brew tastes equally good.1908年英国都柏吉尼斯啤酒厂的首席酿酒师威廉.戈塞特提出了t检验。 使用小样本确保每一桶啤酒都是一样的高质量。,小样本t统计量,

38、1911 Herman Hollerith, inventor of punch-carddevices used to analyse data in US censuses, merges his company to form what will becomeIBM, pioneers of machines to handle business dataand of early computers.1911年在美国人口普查中使用打孔机分析数据的赫尔曼.霍尔瑞斯将收购的公司与自己公司合并形成IBM,是使用机器处理数据和早期计算机的先驱。,计算机处理数据先驱,1916 During the

39、 First World War car designer Frederick Lanchester develops statistical laws to predict the outcomes of aerial battles: if you double their size land armies are only twice as strong, but air forces are four times as powerful.1916年第一次世界大战期间,英国汽车的设计者兰彻斯特用统计法则预测空战结果。如果陆军军队数量是对手两倍的话,其空军战斗力将是对手的四倍。,第一次世界

40、大战中统计的应用,1924 Walter Shewhart invents the control chart to aid industrial production and management1924年美国贝尔实验室的沃尔特.休哈特发明了控制图,极大地提高了工业生产和管理水平。,质量控制图,1935 George Zipf finds that many phenomena river lengths, city populations obey a power law so that the largest is twice the size of the second largest

41、, three times the size of the third, and so on.1935年美国语言学家约翰.齐普夫发现许多现象,如河流长度、城市人口数、英文词汇使用频率等都遵从一条定律,即出现最多的是出现第二多的两倍, 是出现第三多的三倍,等等,被称为齐普夫定律,也就是我们常说的“二八原则”。,齐普夫定律:“二八”原则,1935 R. A. Fisher revolutionises modern statistics. His Design of Experiments gives ways of deciding which results of scientific exp

42、eriments are significant and which are not.1935年英国费雪对现代统计学作出了历史性的贡献。他的试验设计方法能够确定哪些科学试验结果是显著的,哪些不是。,现代试验设计,1937 Jerzy Neyman introduces confidence intervals in statistical testing. His work leads to modern scientific sampling.1937年内曼在统计检验中给出了置信区间,他的成果开创了现代科学抽样理论。,置信区间,1940-45 Alan Turing at Bletchley

43、 Park cracks the German wartime Enigma code, using advanced Bayesian statistics and Colossus, the first programmable electronic computer.1940-45年英国数学家阿兰.图灵在布莱切利园破解德军战争中的密码,他使用高等贝叶斯统计,并研制了巨人计算机,第一台编程的电子计算机。,首台编程计算机,1944 The German tank problem: the Allies desperately need to know how many Panther tan

44、ks they will face in France on D-Day. Statistical analysis of the serial numbers on gearboxes from captured tanks indicates how many of each are being produced. Statisticians predict 270 a month; reports from intelligence sources predict many fewer. The total turned out to be 276. Statistics had out

45、performed spies.1944年德军坦克问题:盟军急切地想知道在法国将面对多少豹式坦克。破解了被缴获坦克变速箱上的生产序列号,利用统计方法预测德军每月增加270辆,谍报人员预测比270少得多,最后实际数字是276辆,统计预测胜过谍报工作。,第二次世界大战中统计的应用,1946 Coxs theorem derives the axioms of probability from simple logical assumptions.1946年美国考克斯利用简单逻辑假设推出了概率论公理。,概率论公理,1948 Claude Shannon introduces information

46、theoryand the “bit” fundamental to the digital age.1948年美国科学家克劳德.香农提出了信息论和“比特”的概念,开创了数字时代。,信息论与比特,1948-53 The Kinsey Report gathers objective data on human sexual behaviour. A large-scale survey of 5000 men and, later, 5000 women, it causes outrage.1948-53年美国金赛报告收集了人类性行为的客观数据,先是对5000男性进行了调查,接着又对5000

47、女性进行调查,报告公开后引起社会巨大愤怒和反响。,性学研究,1950 Richard Doll and Bradford Hill establish the linkbetween cigarette smoking and lung cancer. Despitefierce opposition the result is conclusively proved, to huge public health benefit.1950年英国理查德.多尔和布莱德福希尔研究了抽烟和肺癌的关系。他们顶住反对意见的压力,最终给出了研究的结果,对公众健康有益。,抽烟与肺癌研究,1950s Genic

48、hi Taguchis statistical methods to improve the quality of automobile and electronics components revolutionise Japanese industry, whichfar overtakes western European rivals.20世纪50年代日本田口玄一利用统计方法改善汽车和电子产品给日本工业界带来革命,使得日本远远超过欧美竞争者的质量。,田口的试验设计,1958 The KaplanMeier estimator gives doctors a simple statisti

49、cal way of judging which treatments work best. It has saved millions of lives.1958年美国KaplanMeier生存分析估计方法使得医生可以用简单的统计方法判断治疗方案的有效性,解救了数百万生命。,生存分析,1972 David Coxs proportional hazard model and the concept of partial likelihood.1972年英国大卫.考克斯使用比例风险模型和偏似然函数概念。,比例风险模型与偏似然函数,1977 John Tukey introduces the b

50、ox-plot orbox-and-whisker diagram, which shows the quartiles,medians and spread of data in a single image.1977年美国约翰.图基介绍了箱线图和茎叶图,利用数据的四分位数、中位数和散布等简单直观表示数据特征。,探索性数据分析,1979 Bradley Efron introduces bootstrapping, a simple way to estimate the distribution of almost any sample of data.1979年美国斯坦福大学布拉德利.艾


