《外文翻译什么是数据挖掘.doc》由会员分享,可在线阅读,更多相关《外文翻译什么是数据挖掘.doc(6页珍藏版)》请在三一办公上搜索。
1、什么是数据挖掘?简单地说,数据挖掘是从大量的数据中提取或“挖掘”知识。该术语实际上有点儿用词不当。注意,从矿石或砂子中挖掘黄金叫做黄金挖掘,而不是叫做矿石挖掘。这样,数据挖掘应当更准确地命名为“从数据中挖掘知识”,不幸的是这个有点儿长。“知识挖掘”是一个短术语,可能它不能反映出从大量数据中挖掘的意思。毕竟,挖掘是一个很生动的术语,它抓住了从大量的、未加工的材料中发现少量金块这一过程的特点。这样,这种用词不当携带了“数据”和“挖掘”,就成了流行的选择。还有一些术语,具有和数据挖掘类似但稍有不同的含义,如数据库中的知识挖掘、知识提取、数据/模式分析、数据考古和数据捕捞。许多人把数据挖掘视为另一个常
2、用的术语数据库中的知识发现或KDD的同义词。而另一些人只是把数据挖掘视为数据库中知识发现过程的一个基本步骤。知识发现的过程由以下步骤组成:1)数据清理:消除噪声或不一致数据,2)数据集成:多种数据可以组合在一起,3)数据选择:从数据库中检索与分析任务相关的数据,4)数据变换:数据变换或统一成适合挖掘的形式,如通过汇总或聚集操作,5)数据挖掘:基本步骤,使用智能方法提取数据模式,6)模式评估:根据某种兴趣度度量,识别表示知识的真正有趣的模式,7)知识表示:使用可视化和知识表示技术,向用户提供挖掘的知识。数据挖掘的步骤可以与用户或知识库进行交互。把有趣的模式提供给用户,或作为新的知识存放在知识库中
3、。注意,根据这种观点,数据挖掘只是整个过程中的一个步骤,尽管是最重要的一步,因为它发现隐藏的模式。我们同意数据挖掘是知识发现过程中的一个步骤。然而,在产业界、媒体和数据库研究界,“数据挖掘”比那个较长的术语“数据库中知识发现”更为流行。因此,在本书中,选用的术语是数据挖掘。我们采用数据挖掘的广义观点:数据挖掘是从存放在数据库中或其他信息库中的大量数据中挖掘出有趣知识的过程。基于这种观点,典型的数据挖掘系统具有以下主要成分:数据库、数据仓库或其他信息库:这是一个或一组数据库、数据仓库、电子表格或其他类型的信息库。可以在数据上进行数据清理和集成。数据库、数据仓库服务器:根据用户的数据挖掘请求,数据
4、库、数据仓库服务器负责提取相关数据。知识库:这是领域知识,用于指导搜索,或评估结果模式的兴趣度。这种知识可能包括概念分层,用于将属性或属性值组织成不同的抽象层。用户确信方面的知识也可以包含在内。可以使用这种知识,根据非期望性评估模式的兴趣度。领域知识的其他例子有兴趣度限制或阈值和元数据(例如,描述来自多个异种数据源的数据)。数据挖掘引擎:这是数据挖掘系统基本的部分,由一组功能模块组成,用于特征化、关联、分类、聚类分析以及演变和偏差分析。模式评估模块:通常,此成分使用兴趣度度量,并与数据挖掘模块交互,以便将搜索聚集在有趣的模式上。它可能使用兴趣度阈值过滤发现的模式。模式评估模块也可以与挖掘模块集
5、成在一起,这依赖于所用的数据挖掘方法的实现。对于有效的数据挖掘,建议尽可能深地将模式评估推进到挖掘过程之中,以便将搜索限制在有兴趣的模式上。图形用户界面:本模块在用户和数据挖掘系统之间进行通信,允许用户与系统进行交互,指定数据挖掘查询或任务,提供信息、帮助搜索聚焦,根据数据挖掘的中间结果进行探索式数据挖掘。此外,此成分还允许用户浏览数据库和数据仓库模式或数据结构,评估挖掘的模式,以不同的形式对模式进行可视化。从数据仓库观点,数据挖掘可以看作联机分析处理(OLAP)的高级阶段。然而,通过结合更高级的数据理解技术,数据挖掘比数据仓库的汇总型分析处理走得更远。尽管市场上已有许多“数据挖掘系统”,但是
6、并非所有系统的都能进行真正的数据挖掘。不能处理大量数据的数据分析系统,最多是被称作机器学习系统、统计数据分析工具或实验系统原型。一个系统只能够进行数据或信息检索,包括在大型数据库中找出聚集的值或回答演绎查询,应当归类为数据库系统,或信息检索系统,或演绎数据库系统。数据挖掘涉及多学科技术的集成,包括数据库技术、统计学、机器学习、高性能计算、模式识别、神经网络、数据可视化、信息检索、图像与信号处理和空间数据分析。在本书讨论数据挖掘的时候,我们采用数据库的观点。即,着重强调在大型数据库中有效的和可伸缩的数据挖掘技术。一个算法是可伸缩的,如果给定内存和磁盘空间等可利用的系统资源,其运行时间应当随数据库
7、大小线性增加。通过数据挖掘,可以从数据库提取有趣的知识、规律或者高层信息,并可以从不同的角度来观察或浏览。发现的知识可以用于决策、过程控制、信息管理、查询处理,等等。因此,数据挖掘被信息产业界认为是数据库系统最重要的前沿之一,是信息产业中最有前途的交叉学科。数据挖掘是一个交叉学科的领域,受到多个学科的影响,包括数据库系统、统计学、机器学习、可视化和信息科学。此外,依赖于所用的数据挖掘方法,以及可以使用的其他学科的技术,如神经网络、模糊和/或粗糙集理论、知识表示、归纳逻辑程序设计或高性能计算。依赖于所挖掘的数据类型或给定的数据挖掘应用,数据挖掘系统也可以集成空间数据分析、信息检索、模式识别、图形
8、分析、信号处理、计算机图形学、Web技术、经济、商业、生物信息学或心理学领域的技术。由于数据挖掘源于多个学科,因此在数据挖掘研究中就产生了大量的、各种不同类型的数据挖掘系统。这样,就需要对数据挖掘系统给出一个清楚的分类。这种分类可以帮助用户区分数据挖掘系统,确定出最适合其需要的数据挖掘系统。根据不同的标准,数据挖掘系统可以有如下分类:1)根据挖掘的数据库类型进行分类。数据挖掘系统可以根据挖掘的数据库类型进行分类。数据库系统本身可以根据不同的标准(如数据模型,或数据或所涉及的应用类型)来分类,每一类都可能需要自己的数据挖掘技术。这样,数据挖掘系统就可以据此进行相应的分类。例如,如果是根据数据模型
9、来分类,我们可以有关系的、事务的、面向对象的、对象-关系的或数据仓库的数据挖掘系统。如果是根据所处理的数据的特定类型分类,我们可以有空间的、时间序列的、文本的或多媒体的数据挖掘系统,或是WWW的数据挖掘系统。2)根据挖掘的知识类型进行分类。数据挖掘系统可以根据所挖掘的知识类型进行分类。即根据数据挖掘的功能,如特征化、区分、关联、分类聚类、孤立点分析和演变分析、偏差分析、类似性分析等进行分类。一个全面的数据挖掘系统应当提供多种和/或集成的数据挖掘功能。此外,数据挖掘系统也可以根据所挖掘的知识的粒度或抽象层进行区分,包括概化知识(在高抽象层),原始层知识(在原始数据层),或多层知识(考虑若干抽象层
10、)。一个高级的数据挖掘系统应当支持多抽象层的知识发现。数据挖掘系统还可以分类为挖掘数据规则性(通常出现的模式)和数据不规则性(如异常或孤立点)这几种。一般地,概念描述、关联分析、分类、预测和聚类挖掘数据规律,将孤立点作为噪声排除。这些方法也能帮助检测孤立点。3)根据所用的技术进行分类。数据挖掘系统也可以根据所用的数据挖掘技术进行分类。这些技术可以根据用户交互程度(例如自动系统、交互探查系统、查询驱动系统),或利用的数据分析方法(例如面向数据库或数据仓库的技术、机器学习、统计学、可视化、模式识别、神经网络等)来描述。复杂的数据挖掘系统通常采用多种数据挖掘技术,或是采用有效的、集成的技术,结合一些
11、方法的优点。What is Data Mining?Simply stated, data mining refers to extracting or “mining” knowledge from large amounts of data. The term is actually a misnomer. Remember that the mining of gold from rocks or sand is referred to as gold mining rather than rock or sand mining. Thus, “data mining” should h
12、ave been more appropriately named “knowledge mining from data”, which is unfortunately somewhat long. “Knowledge mining”, a shorter term, may not reflect the emphasis on mining from large amounts of data. Nevertheless, mining is a vivid term characterizing the process that finds a small set of preci
13、ous nuggets from a great deal of raw material. Thus, such a misnomer which carries both “data” and “mining” became a popular choice. There are many other terms carrying a similar or slightly different meaning to data mining, such as knowledge mining from databases, knowledge extraction, data / patte
14、rn analysis, data archaeology, and data dredging.Many people treat data mining as a synonym for another popularly used term, “Knowledge Discovery in Databases”, or KDD. Alternatively, others view data mining as simply an essential step in the process of knowledge discovery in databases. Knowledge di
15、scovery consists of an iterative sequence of the following steps: data cleaning: to remove noise or irrelevant data, data integration: where multiple data sources may be combined, data selection : where data relevant to the analysis task are retrieved from the database, data transformation : where d
16、ata are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance, data mining: an essential process where intelligent methods are applied in order to extract data patterns, pattern evaluation: to identify the truly interesting patter
17、ns representing knowledge based on some interestingness measures, and knowledge presentation: where visualization and knowledge representation techniques are used to present the mined knowledge to the user . The data mining step may interact with the user or a knowledge base. The interesting pattern
18、s are presented to the user, and may be stored as new knowledge in the knowledge base. Note that according to this view, data mining is only one step in the entire process, albeit an essential one since it uncovers hidden patterns for evaluation. We agree that data mining is a knowledge discovery pr
19、ocess. However, in industry, in media, and in the database research milieu, the term “data mining” is becoming more popular than the longer term of “knowledge discovery in databases”. Therefore, in this book, we choose to use the term “data mining”. We adopt a broad view of data mining functionality
20、: data mining is the process of discovering interesting knowledge from large amounts of data stored either in databases, data warehouses, or other information repositories. Based on this view, the architecture of a typical data mining system may have the following major components: 1. Database, data
21、 warehouse, or other information repository. This is one or a set of databases, data warehouses, spread sheets, or other kinds of information repositories. Data cleaning and data integration techniques may be performed on the data. 2. Database or data warehouse server. The database or data warehouse
22、 server is responsible for fetching the relevant data, based on the users data mining request. 3. Knowledge base. This is the domain knowledge that is used to guide the search, or evaluate the interestingness of resulting patterns. Such knowledge can include concept hierarchies, used to organize att
23、ributes or attribute values into different levels of abstraction. Knowledge such as user beliefs, which can be used to assess a patterns interestingness based on its unexpectedness, may also be included. Other examples of domain knowledge are additional interestingness constraints or thresholds, and
24、 metadata (e.g., describing data from multiple heterogeneous sources).4. Data mining engine. This is essential to the data mining system and ideally consists of a set of functional modules for tasks such as characterization, association analysis, classification, evolution and deviation analysis.5. P
25、attern evaluation module. This component typically employs interestingness measures and interacts with the data mining modules so as to focus the search towards interesting patterns. It may access interestingness thresholds stored in the knowledge base. Alternatively, the pattern evaluation module m
26、ay be integrated with the mining module, depending on the implementation of the data mining method used. For efficient data mining, it is highly recommended to push the evaluation of pattern interestingness as deep as possible into the mining process so as to confine the search to only the interesti
27、ng patterns. 6. Graphical user interface. This module communicates between users and the data mining system, allowing the user to interact with the system by specifying a data mining query or task, providing information to help focus the search, and performing exploratory data mining based on the in
28、termediate data mining results. In addition, this component allows the user to browse database and data warehouse schemas or data structures, evaluate mined patterns, and visualize the patterns in different forms. From a data warehouse perspective, data mining can be viewed as an advanced stage of o
29、n-1ine analytical processing (OLAP). However, data mining goes far beyond the narrow scope of summarization-style analytical processing of data warehouse systems by incorporating more advanced techniques for data understanding. While there may be many “data mining systems” on the market, not all of
30、them can perform true data mining. A data analysis system that does not handle large amounts of data can at most be categorized as a machine learning system, a statistical data analysis tool, or an experimental system prototype. A system that can only perform data or information retrieval, including
31、 finding aggregate values, or that performs deductive query answering in large databases should be more appropriately categorized as either a database system, an information retrieval system, or a deductive database system. Data mining involves an integration of techniques from mult1ple disciplines
32、such as database technology, statistics, machine learning, high performance computing, pattern recognition, neural networks, data visualization, information retrieval, image and signal processing, and spatial data analysis. We adopt a database perspective in our presentation of data mining in this b
33、ook. That is, emphasis is placed on efficient and scalable data mining techniques for large databases. By performing data mining, interesting knowledge, regularities, or high-level information can be extracted from databases and viewed or browsed from different angles. The discovered knowledge can b
34、e applied to decision making, process control, information management, query processing, and so on. Therefore, data mining is considered as one of the most important frontiers in database systems and one of the most promising, new database applications in the information industry. A classification o
35、f data mining systems Data mining is an interdisciplinary field, the confluence of a set of disciplines, including database systems, statistics, machine learning, visualization, and information science. Moreover, depending on the data mining approach used, techniques from other disciplines may be ap
36、plied, such as neural networks, fuzzy and or rough set theory, knowledge representation, inductive logic programming, or high performance computing. Depending on the kinds of data to be mined or on the given data mining application, the data mining system may also integrate techniques from spatial d
37、ata analysis, Information retrieval, pattern recognition, image analysis, signal processing, computer graphics, Web technology, economics, or psychology.Because of the diversity of disciplines contributing to data mining, data mining research is expected to generate a large variety of data mining sy
38、stems. Therefore, it is necessary to provide a clear classification of data mining systems. Such a classification may help potential users distinguish data mining systems and identify those that best match their needs. Data mining systems can be categorized according to various criteria, as follows.
39、 1) Classification according to the kinds of databases mined. A data mining system can be classified according to the kinds of databases mined. Database systems themselves can be classified according to different criteria (such as data models, or the types of data or applications involved), each of
40、which may require its own data mining technique. Data mining systems can therefore be classified accordingly. For instance, if classifying according to data models, we may have a relational, transactional, object-oriented, object-relational, or data warehouse mining system. If classifying according
41、to the special types of data handled, we may have a spatial, time -series, text, or multimedia data mining system , or a World-Wide Web mining system . Other system types include heterogeneous data mining systems, and legacy data mining systems.2) Classification according to the kinds of knowledge m
42、ined.Data mining systems can be categorized according to the kinds of knowledge they mine, i.e., based on data mining functionalities, such as characterization, discrimination, association, classification, clustering, trend and evolution analysis, deviation analysis , similarity analysis, etc. A com
43、prehensive data mining system usually provides multiple and/or integrated data mining functionalities. Moreover, data mining systems can also be distinguished based on the granularity or levels of abstraction of the knowledge mined, including generalized knowledge(at a high level of abstraction), pr
44、imitive-level knowledge(at a raw data level), or knowledge at multiple levels (considering several levels of abstraction). An advanced data mining system should facilitate the discovery of knowledge at multiple levels of abstraction.3) Classification according to the kinds of techniques utilized. Da
45、ta mining systems can also be categorized according to the underlying data mining techniques employed. These techniques can be described according to the degree of user interaction involved (e.g., autonomous systems, interactive exploratory systems, query-driven systems), or the methods of data anal
46、ysis employed(e.g., database-oriented or data warehouse-oriented techniques, machine learning, statistics, visualization, pattern recognition, neural networks, and so on ) .A sophisticated data mining system will often adopt multiple data mining techniques or work out an effective, integrated technique which combines the merits of a few individual approaches.