《数据挖掘作业.doc》由会员分享,可在线阅读,更多相关《数据挖掘作业.doc(6页珍藏版)》请在三一办公上搜索。
1、数据挖掘作业(毛卓,M080501447)1. Based on your observation, describe another possible kind of knowledge that needs to be discovered by data mining methods but has not been listed in this chapter. Does it require a mining methodology that is quite different from those outlined in this chapter?答:SLIQ algorithm
2、 is a kind of knowledge that needs to be discovered by data mining methods. It builds an index for each attribute and only class list and the current attribute list reside in memory. It requires Decision Tree Induction Methods in Data Mining.2. Suppose that the data for analysis include the attribut
3、e age. The age values for the data tuples are (in increasing order):13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35,35,35,36,40,45,46,52,70.(a) Using smoothing by bin means to smooth the above data, using a bin depth of 3. Illustrate your steps. Comment on the effect of this technique for t
4、he given data.答:Bin1:14.6, 14.6, 14.6; Bin2: 18.3, 18.3, 18.3; Bin3: 21, 21, 21; Bin4:24, 24, 24; Bin5: 26.6, 26.6, 26.6; Bin6: 33.6, 33.6, 33.6; Bin7:35, 35, 35; Bin8: 40.3, 40.3, 40.3; Bin9: 56, 56, 56.第一步:将数据按宽度为3的箱子分成9箱;第二步:对每个箱中数据求平均值,用得出的平均值来代替箱中的数据。 (b) How might you determine outliers in the
5、 data?答: 1.5*IRQ IRQ=Q3-Q1(c) What other methods are there for data smoothing?答:Binning Regression Clustering3. Using the data for age given in Exercise 2, answer the following:(a) Use min-max normalization to transform the value 35 for age onto the range 0.0,1.0 答: (b) Use z-score normalization to
6、transform the value 35 for age, where the standard deviation of age is 12.94 years. 答: (c) Use normalization by decimal scaling to transform the value 35 for age. 答: (d) Comment on which method you would prefer to use for the given data, giving reasons as to why.4. Suppose that a data warehouse for
7、Big University consists of the following four dimensions:student, course, semester, and instructor, and two measures count and avg _grade. When at the lowest conceptual level (e.g., for a given student, course, semester, and instructor combination), the avg_ grade measure stores the actual course gr
8、ade of the student. At higher conceptual levels, avg _grade stores the average grade for the given combination.(a) Draw a snowflake schema diagram for the data warehouse. P116 答: Big university are considered along four dimensions, namely, semester, student ,course and instructor. The schema contain
9、s a central fact table for Big University that contains keys to each of the four dimensions, along with two measures: count and avg_grade .Figure3.4 Snowflake schema of a data warehouse for Big _university(b) Starting with the base cuboid student, course, semester, instructor, what specific OLAP ope
10、rations (e.g., roll-up from semester to year) should one perform in order to list the average grade of CS courses for each Big University student. 答: Starting with the base cuboid student, course, semester, instructor,we use the following specific OLAP operations in order to list the average grade o
11、f CS courses for each Big University student.Roll-up: The roll-up operation performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction, This hierarchy was defined as the total order “quarteryear.”Slice and dice: The slice operation perfor
12、ms a selection on one dimension of the given cube, resulting in a subcube. (c) If each dimension has five levels (including all), such as “student major status university 50%, the rule buys(x, hotdogs) =buys(x, hamburgers) is a strong association rule.(b) Based on the given data, is the purchase of
13、hot dogs independent of the purchase of hamburgers? If not, what kind of correlation relationship exists between the two? 答:lift (hotdogs, hamburgers) =P(hotdogshamburgers)/(P(hotdogs)*P(hamburgers)=(2000/5000)/(3000/5000)*(2500/5000)=1.331. Based on the given data, the purchase of hotdogs and the p
14、urchase of hamburgers are correlated, lift value 1, so they are positively correlated.7. Write an algorithm for k-nearest-neighbor classification given k and n, the number of attributes describing each tuple. 答:We will denote an arbitrary instance x as a1(x) . an(x) where ar(x) denotes the value of
15、the r-th attribute of instance x. The distance between two instances xi and xj is defined to be d(xi, xj) where The algorithm follows.k-Nearest neighbor algorithm:TrainingBuild the set of training examples D.ClassificationGiven a query instance xq to be classified,Let x1. xk denote the k instances f
16、rom D that are nearest to xqReturn where -(a, b)=1, if a = b, and -(a, b)=0 otherwise8. Show that accuracy is a function of sensitivity and specificity, that is, prove Equation(6.58). 答:Suppose that you have trained a classifier to classify medical data tuples as either “cancer” or “not_cancer” An a
17、ccuracy rate of ,say,90% may not be acceptable-the classifier could be correctly labeling only the “not_cancer” tuples, for instance. Instead, we would like to be able to access how well the classifier can recognize “cancer” tuples (the positive tuples) and how well it can recognize “not_cancer” (th
18、e negative tuples).The sensitivity and specificity measures can be used, respectively, for this purpose. Sensitivity is also referred to as the true positive (recognition) rate (thatis, the proportion of positive tuples that are correctly identified), while specificity is the true negative rate (tha
19、t is, the proportion of negative tuples that are correctly identified).In addition, we may use precision to access the percentage of tuples labeled as “cancer” that actually are “cancer” tuples.These measures are defined aswhere t pos is the number of true positives (“cancer” tuples that were correc
20、tly classified as such), pos is the number of positive (“cancer”) tuples, t_neg is the number of true negatives (“not cancer” tuples that were correctly classified as such), neg is the number of negative (“not cancer”) tuples, and f pos is the number of false positives (“not cancer” tuples that were
21、 incorrectly labeled as “cancer”). It can be shown that accuracy is a function of sensitivity and specificity:9Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):(a) Compute the Euclidean distance between the two objects. 答:(b) Compute the Manhattan distance between the t
22、wo objects. 答:(c) Compute the Minkowski distance between the two objects, using q = 3. 答:10Suppose that the data mining task is to cluster the following eight points (with (x, y) representing location) into three clusters:A1(2, 10), A2(2, 5), A3(8, 4), B1(5, 8), B2(7, 5), B3(6, 4), C1(1, 2), C2(4, 9
23、):The distance function is Euclidean distance. Suppose initially we assign A1, B1, and C1 as the center of each cluster, respectively. Use the k-means algorithm to show only(a) The three cluster centers after the first round execution 答:The first computation for each node:For A2: d(A1,A2)=5;d(B1,A2)
24、=4.24; d(C1,A2)=3.16 So A2 belongs to center C1;For A3: d(A1,A3)=8.48; d(B1,A3)=5; d(C1,A3)=7.28, so A3 belongs to center B1;For B2: d(A1,B2)=7.07; d(B1,B2)=3.60; d(C1,B2)=6.71, so B2 belongs to center B1;For B3: d(A1,B3)=7.21; d(B1,B3)=4.12; d(C1,B3)=5.39; so B3 belongs to center B1;For C2:d(A1,C2)
25、=2.24; d(B1,C2)=1.41; d(C1,C2)=7.62 ; so C2 belongs to center B1.So after the first computation, three clusters are as follows:Cluste1:A1; Cluster2:A2,C1; Cluster3:A3,B1,B2,B3,C3.So new three center are A1(2,10), O1(1.5,3.5), O2(6,6).(b) The final three clusters 答:The second computation for each nod
26、e:A2 belongs to center O1; A3 belongs to center O2; B1, B2, B3 belongs to center O2; C1 belongs to center O1; C2 belongs to center A1;So after the second computation, three clusters are as follows:Cluster1: A2, C1; Cluster2: A1,C2; Cluster3: A3,B1,B2,B3.New three center are O1, O3(3,9.5),O4(6.5,5.25
27、).The third computation for each node:A1 belongs to center O3; A2 belongs to center O1; A3 belongs to center O4;B1 belongs to center O3; B2 belongs to center O4; B3 belongs to center O4;C1 belongs to center O1; C2 belongs to center O3.So after the third computation, new three center are O1, O5(3.67,9),O6(7,4.33).Similarly, after the fourth computation, no redistribution of the point in any cluster occurs and so the process terminates. So the final three cluster are:Cluster1: A2, C1; Cluster2: A1,B1,C2; Cluster3: A3,B2,B3.