外文翻译利用手腕麦克风和三轴加速计来进行手势定位.doc

资源描述

《外文翻译利用手腕麦克风和三轴加速计来进行手势定位.doc》由会员分享，可在线阅读，更多相关《外文翻译利用手腕麦克风和三轴加速计来进行手势定位.doc（16页珍藏版）》请在三一办公上搜索。

1、Gesture Spotting Using Wrist Worn Microphoneand 3-Axis Accelerometer（原文2）Abstract. We perform continuous activity recognition using only two wrist-worn sensors - a 3-axis accelerometer and a microphone. We build on the intuitive notion that two very dierent sensors are unlikely to agree in classicat

2、ion of a false activity. By comparing imperfect, sliding window classications from each of these sensors, we are able discern activities of interest from null or uninteresting activities. Where one sensor alone is unable to perform such partitioning, using comparison we are able to report good overa

3、ll system performance of up to 70% accuracy. In presenting these results, we attempt to give a more-in depth visualization of the errors than can be gathered from confusion matrices alone.1 IntroductionHand actions play a crucial role in most human activities.As a consequences detecting and recognis

4、ing such activities is one of the most important aspects of context recognition. At the same it is one of the most dicult. This is particularly true for continuous recognition where a set of relevant hand motions (gestures) need to be spotted in a data stream. The diculties of such recognition stem

5、from two things.First, dueto a large number of degrees of freedom, hand motions tend to be very diverse. The same activity might be performed in many dierent ways even by a single person. Second, in terms of motion, hands are the most active body parts. We move our hands continuously, mostly in an u

6、nstructured way, even when not doing anything particular with them. In fact in most situations such unstructured motions by far outnumber gestures that are relevant for context recognition. This means that a continuous gesture spotting applications has to deal with an zero class that is dicult to mo

7、del while taking up most of the signal.1.1 Paper ContributionsOur group has invested a considerable amount of work into hand gesture spotting. To date this work has focused on using several sensors distributed over the userfi body to maximise system performance. This included motion sensors (3 axis

8、accelerometer, 3 axis gyroscopes and 3 axis magnetic sensors) on the upper and lower arm 3, microphone/accelerometer combination on the upper and lower arm 5 as well as, more recently, a combination of several motion sensors and ultrasonic location devices.This paper investigates the performance of

9、a gesture spotting system based on a single, wrist mounted device. The idea behind the work is that wrist mounted accessories are broadly accepted and worn by most people on daily basis. In contrast,systems that require the user to put on several sensors at locations such as the upper arm would have

10、 much more problems with user acceptance.The downside of this approach is the reduced amount of information available for the recognition. This for example means that the method of analysing sound intensity dierences between microphones on dierent parts of the bodythat was the corner stone of our pr

11、evious signal partitioning work is not feasible. This problem is compounded by the fact that for the approach to make sense that wrist mounted device can neither contain too manysensors nor can it require computing and/or communication power that would imply large, bulky batteries.The main contribut

12、ion of the paper is to show that, for a certain subset of activities, reasonable gesture spotting results can be achieved with a combination of a microphone and 3 axis accelerometer mounted on the wrist. Our method relies on simple jumping window sound processing algorithms that we have shown 10 to

13、require only minimal computationaland communication performance. For the acceleration we use inference on Hidden Markov Models (HMM), again on jumping windows across the data.To our knowledge this is the rst time that such a simple system and a straight forward jumping window method has been success

14、fully used for hand gesture spotting in continuous data stream with a dominant, unstructured zero class.Previously such setups and algorithms have only been shown to be successfull either for segmented recognition or for scenarios where the zero class was either easy to model or not relevant (e.g. r

15、ecognition of standing, sitting, walking, running 6, 9,12). Where these approaches use acceleration sensors, in the work of ?, ? sound was exploited for performing situation analysis in the wearable computing domain. Also ? used sound information to improve the performance of hearing aids. Complimen

16、tary information from sound and acceleration has been used before to detect defects in material surfaces, e.g. in 13, but no work that the authors are aware uses these for recognition of complex activities.In the paper we summarise the sound and acceleration algorithms and then focus on the performa

17、nce of dierent fusion methods. It is shown that appropriate fusion is the key to achieving good performance despite simple sensors and algorithms. We verify our approach on data from a wood workshop assembly experiment that have we have introduced and used in previous work 5. We present the results

18、using both traditional confusion matrices, plus a novel visualisation method that provides a more in-depth understanding of the error types.2 Recognition MethodWe apply sliding windows of lenght w seconds across all the data in increments of w . At each step we apply an wjmp LDA based classication o

19、n the sound data, and an HMM classication on the sound. The fisoft results of each classication - LDA distances for sound and HMM class likelihoods for acceleration - are converted into class rankings, and these are fused together using one of two methods: comparison of top rank (COMP), and a method

20、 using Logistic Regression(LR).2.1 Frame by Frame Sound ClassicationUsing LDA Frame-by-frame sound classication was carried out using pattern matching of features extracted in the frequency do-main. Each frame represents a window on 100ms of raw audio data. These windows are then jumped over the ent

21、ire dataset in 25ms increments, producing a 40Hz output.The audio stream was taken at a sample rate of 2kHz from the wrist worn microphone. From this a Fast Fourier Trans-form (FFT) was carried out on each 100ms window, generating a 100 bin output vector (12fsfftwnd = 122100 =100bins).Making use of

22、the fact that our recognition problem requires a small nite number of classes, we applied Linear Discriminant Analysis (LDA)1 to reduce the dimensionality ofthese FFT vectors from 100 to #Classes1.Classication of each frame can then be carried out using a simple Euclidean minimum distance calculatio

23、n. Whenever we wish to make a decision, we simply calculate the incoming point in LDA space and nd its nearest class mean value from the training dataset. This saving in computation complexity by dimensionality reduction comes at the comparatively minor cost of requiring us to compute and store a se

24、t of LDA class mean values from which the LDA distances might be obtained.Equally, a nearest neighbour approach might be used. For the experiment described here however, Euclidean distance was found to be sucient.A larger window, wlen , was moved over the data in w jmp second increments. This relati

25、vely large window was chosen to reect the fact that all of the activities we are interested in occurat thetimescale of at least severalseconds. On each window we compute a sum of the constituent LDA distances for each class. From these total distances, we then rank each class according to minimum di

26、stance. Classication of the window is then simply a matter of choosing the top ranking class.2.2 HMM Acceleration ClassicationIn contrast to the approach used for sound recognition, we employed model based classication, specically the Hidden Markov Model (HMM), for classifying accelerometer data8,11

27、. (The implementation of the HMM learning and inference routines for this experiment was provided courtesy of KevinP. Murphyfi HMM Toolbox for matlab 7.)The features used to feed the HMM models were calculated fromsliding100ms windows on the x,y,andzaxisofthe100Hz sampled acceleration data. These wi

28、ndows were moved over the data in 25ms increments,producingthefollowing features,output at 40Hz: Mean of x-axisVariance of x-axis A count of the number of peaks (for x,y,z) Mean amplitude of the peaks (for x,y,z)Finally we globally standardised the features so as to avoid numerical complications wit

29、h the model learning algorithms in matlab.In previous work we employed single Gaussian observation models, but this was found to be inadequate for some classes unless a large number of states were used. Intuitively, the descriptive power of a mixture of Gaussian is much closer to firealiy than only

30、one, and so for these classes a mixture model was used. The specic number of mixtures and the number of hidden states used were individually tailored by hand for each class. The parameters themselves were trainedfrom the data.A window of wlen , in wjmp increments, was run over the acceleration featu

31、res, and the corresponding log likelihood for each HMM class model calculated.Classication is carried out for each window by choosing the class which produces the largest log likelihood given the stream of feature data from the test set.2.3 Fusion of classiersComparison of top choices (COMP) The top

32、 rankings from each of the sound and acceleration classiers for a given jumping window segment are taken, compared, and returned as valid if they agree. Those where both classiers disagree are thrown out - classied as null.Logistic regression (LR) The main problem with a direct comparison of top cla

33、ssier rankings is that it fails to take into account cases where one classier might be more reliable than another at recognising particular classes. If one classier reliably detects a class, but the other classier failsto, perhaps relegating the class to second or third rank, then a basic comparison

34、 would just assign null. For such cases, then a fisofter method of classier fusion is needed - one that takes into account the dierent rankings of each classier.In the work of Ho et. al. 2, three methods for classier fusion based on class rankings are presented and evaluated:Highest Rank,wherebyeach

35、 class is assigned a rank according to the highest rank assigned to it by any of the classiers;Borda count, whereby each class is ranked according to the total number of classes ranking below it by each classier;and Logistic Regression (LR), a method based on the Borda count, but which estimates wei

36、ghts for each class combinationusing regression.Of the methods presented, only one of them, the Logistic Regression (LR) makes sense to apply here, as it is the only one which provides the scope to deal with assigning results to null.The basic motivation behind LR is to assign a score foreach class

37、and every combination of classier rankings. However, such a scoring would soon become computationally prohibitive, even for a moderate number of classes and classiers.Instead, LR makes use of a linear function to estimate the likelihood of whether a class is correct or not for a given set of ranking

38、s. Such a regression function, estimating a binary outcome with P(truejX; class) or P(falsejX; class), wouldbe far simpler to compute. So for each class a function can be computed: L(X) = a+Pmi=1 ixi where X = x1; x2; :xm are the rankings of the class for each of the m classiers, and, the logistic r

39、egression coecients. These coecients can be computed by applying a suitable regression fit using the correctly classied ranking combinations from, for example,training data.So that unlikely combinations are assigned to null, we introduce an empirically obtained threshold on L(x) for each class.Of th

40、e classes which fall below this threshold, the most likelyL(x) value is taken and re-assigned to the null class. This means that if all classes fall below their threshold for a given ranking combination, then the null will take top ranking.Classication can then be carried out by estimating L(X) for

41、each class on the input rankings, comparing with the null threshold, and then ranking the values obtained. The final classication result can then be taken from the highest rank.3 Experimental setupData was collected using a sony microphone and a 3-axis accelerometer (from the ETH PadNET sensor netwo

42、rk 4) strapped to the wrist. Each subject was asked to follow a predefined sequence of activities using tools in the wood workshop of our lab. The 9 activities which we set out to spotwere: hammering (h), sawing (s), ling (f), using a machine drill (r), sanding (a), using a machine grinder (g), scre

43、wdriving (w), opening and closing a vise (v), opening and closing a drawer (d). All other activities and movements were labeled as null ().Each subject performed the entire sequence in about 5 minutes. In all, twenty such sets of data were collected from five different subjects.4 ResultsThe system w

44、as initially evaluated across sweeps of the two main parameters, window length wlen and window jump lenght wjmp. From these sweeps, setting both wlen and wjmp to 2 seconds was found to produce favourable results. All further analysis was carried out with these parameters set.Both the LDA and HMM met

45、hods require training of parameter using data. This was carried out in a user-dependent leave-one-out fashion. That is for each set under test, the training data was taken from the sets of the same user but not including the set under test.We applied HMM classifcation to the accelerometer data,and L

46、DA minimum distance to the audio. This was applied to all 20 sets of data. Typical results from one of these sets is plotted in Figures 1, with class predictions compared alongside the hand-labelled ground truth.With each of the 2 second segments, we then carried outfirstly the classification compar

47、ison fusion, and then the logistic regression using the rankings obtained from the HMM likelihood and LDA distance information.On first run, the LR method continued to produce a large number of insertions - primarily from the class screwdriving.This was due to the fact that this is comparatively sil

48、ent class,and as the training data consisted mostly of noisy, positive class examples (at no stage do we use null labelled data for training), it winds up being a catch all class for non-activities which should have been assigned null. Reducing the weights of the ranking combinations for this class

49、during training helps to alleviate this problem.The final predictions from each of these, compared along-side the ground truth, are shown in 2.Lacking any ability to distinguish valid activities from null,the constituent classifiers, as expected, produce much noise.With LDA tending to misclassify null as a quiet class, such as screwdriving; and HMM generally giving random misclassifications. Both perform relatively well when set

展开阅读全文