FRAUD DETECTION USING DATA MINING TECHNIQUES APPLICATIONS IN THE MOTOR INSURANCE INDUSTRY.doc

资源描述

《FRAUD DETECTION USING DATA MINING TECHNIQUES APPLICATIONS IN THE MOTOR INSURANCE INDUSTRY.doc》由会员分享，可在线阅读，更多相关《FRAUD DETECTION USING DATA MINING TECHNIQUES APPLICATIONS IN THE MOTOR INSURANCE INDUSTRY.doc（15页珍藏版）》请在三一办公上搜索。

1、FRAUD DETECTION USING DATA MINING TECHNIQUES:APPLICATIONS IN THE MOTOR INSURANCE INDUSTRYHIAN CHYE KOH AND GABRIEL GERVAISSIM University (Singapore)School of Business535A Clementi RoadSingapore, 599490Tel: (65) 6248-9644 Fax: (65) 6462-4377email: hckohunisim.edu.sgAbstractFraud costs stakeholders (e

2、.g., victims, merchants and insurance companies) billions of dollars worldwide and to prevent it, effective fraud detection is the key. This paper examines fraud detection within a data mining framework by first discussing the general approaches to fraud detection and then focusing on particular dat

3、a mining techniques that can be applied to improve it. Finally, this paper illustrates the special case of fraud detection in the motor insurance industry where the identification of illicit activities can be particularly challenging because of the nature of motor insurance fraud.Keywords: Data mini

4、ng; fraud detection, motor insurance.INTRODUCTIONFraud has serious implications in business. Take, for example, the case of credit card fraud which includes stolen cards, counterfeit cards and compromised accounts (e.g., application fraud and skimming). As reported in Wikipedia (“Credit card fraud,”

5、 2009), the cost of credit card fraud in 2006 was 7 cents per 100 dollars of transactions. Given the huge volume of annual credit card transactions, this translated into fraudulent activities amounting to billions of dollars worldwide. Accordingly, fraud detection has important applications. In part

6、icular, effective fraud detection can contribute to fraud prevention.This paper has two objectives. The first objective is to examine the use of data mining techniques in fraud detection. Within the data mining framework, fraud detection can be done using the clustering approach, expectations approa

7、ch or predictive modelling approach. The second objective is to focus on motor insurance fraud and illustrate fraud detection in this area.While credit card fraud is often self-reported (e.g., credit cardholders will quickly find out and report fraudulent transactions made with their credit cards),

8、motor insurance fraud is a lot more difficult to get a handle on (e.g., deliberate “accidents”, inflated claims such as personal injury and unnecessary or excessive repairs). This difficulty is often compounded by possible collusions among different parties (e.g., insurance policyholders and car wor

9、kshops). Hence, the issues facing fraud detection in motor insurance can be very challenging.The paper is organised into the following sections. The next (second) section reviews the literature in data mining, fraud detection and motor insurance fraud. The third section discusses the research method

10、ology, including fraud detection approaches and sample data. The fourth section presents the findings and implications. The illustrations focus on two datasets relating to repairs and claims, respectively. The clustering and expectations approaches are applied. Finally, the concluding section summar

11、ises the study and highlights the limitations and future directions.It is hoped that this exploratory paper can make a contribution to the fraud detection and data mining literature.LITERATURE REVIEWThe term “data mining” is not new in that it has been used for a long time to denote the idea of unsc

12、ientific “fishing” or “dredging” of data in data analysis. That is, if an analyst is searching for a particular conclusion, then there is a good chance that this conclusion can be “found” by repeatedly analysing the data in various ways, including inappropriate ways. For a long time, the term “data

13、mining” has had a negative connotation.“Data mining” as used today, however, refers to an entirely different concept from that of unscientific data fishing or dredging. The new concept of data mining can be considered a recently developed methodology and technology, coming into prominence only in 19

14、94 (Trybula, 1997).Judging from the number of its definitions in the literature (for example, Hormozi Giles, 2004) found six different ones), data mining appears to be a discipline whose domain is still evolving. Yet, upon a closer inspection of the literature published to date, data mining research

15、ers, practitioners and users do concur on its key aims and characteristics. As a data analysis tool, data mining aims to uncover previously unknown trends and patterns or establish relationships in large datasets so as to help decision makers make better decisions. It fulfils this purpose by employi

16、ng statistical methods such as cluster and logistic analyses but by also using data analysis methods borrowed from other disciplines (e.g. neural networks in artificial intelligence and decision trees in machine learning).Data mining has been used by both the public and private sectors, and increasi

17、ngly so. For instance, governments find it useful in ensuring corporate governance compliance (Songini, 2004), fighting money-laundering activities (Zengan & Mao, 2007) and supporting counter-terrorism activities (Baesens, Mues, Martens, & Vanthienen, 2009). In the private sector, companies use data

18、 mining in business forecasting, marketing (e.g. market segmentation, advertising campaign optimisation and customer churn reduction see Agosta, 2004), customer relationship management (e.g. customer acquisition and retention see Hormozi & Giles, 2004) as well as in corporate governance (Ata & Seyre

19、ck, 2009; Volonino, Gessner, & Kermis, 2004; Johnson, 2004).However, because data mining excels at anomaly detection in large datasets (a particularly useful feature to unearth outliers), financial institutions are increasingly relying on it for risk management (e.g. credit scoring and bankruptcy pr

20、ediction) (Sinha & Zhao, 2008) and fraud detection and prevention (OFlaherty, 2005). This is done either internally against computer misuse (Heatley & Otto, 1998) and accounting fraud (Jans, Lybaert, Vanhoof, 2010; Dubinsky & Warner, 2008) or externally to combat credit/debit card fraud (Hand, Whitr

21、ow, Adams, Juszczak, & Weston, 2008; Huber, 2004) or insurance claim fraud (Rejesus, 2004).Besides health and medical insurance frauds, motor insurance fraud represents a significant and very costly problem for many stakeholders besides the insurance companies. According to the Coalition Against Ins

22、urance Fraud, a US-based not-for-profit organisation representing the interests of a number of American insurance companies, federal and state government authorities as well as some consumer associations, motor insurance fraud falls into three categories: (1) underwriting fraud where dishonest drive

23、rs try to lower motor insurance premiums by lying on their insurance applications or renewals; (2) staged car accidents; and (3) fraudulent and abusive car accident injury claims “which added $4.8 billion to $6.8 billion in excess payments to auto injury claims in 2007” (Go figure: fraud data - Auto

24、 insurance, n.d.). Recent studies in motor insurance fraud include Viaene, Derrig, & Dedene (2005), Agyemang et al, (2006), Takeuchi & Yamanishi (2006), Eberle & Holder (2007), Deshmeh & Rahmati (2008) and Lian, Lida, Ying, Lee (2009). They involve the use of Bayesian learning neural networks, numer

25、ic and symbolic outlier mining techniques, time-series analyses, graphs, association analyses and cluster-based outlier detection, respectively.For example, Eberle & Holder (2007) presented graph-based approaches to uncovering anomalies and developed three algorithms to discover particular anomalous

26、 types. They validated all three approaches using synthetic data and found that the algorithms were able to detect the anomalies with very high detection rates and minimal false positives. They also validated the algorithms using real-world cargo data and actual fraud scenarios injected into the dat

27、aset with very good results.Deshmeh & Rahmati (2008) addressed the problem of detecting anomalies in horizontally distributed data. They trained local predictors and extracted association rules using the difference between predicted and actual values on a context dataset. These association rules are

28、 used to represent normal and anomalous behaviours, while a final set of learners use these representations to detect anomalies.Further, in Lian et al. (2009), outlier detection was applied to detect observations that were grossly different from or inconsistent with the remaining observations in the

29、 dataset. Traditionally, outliers are considered as single points. However, many abnormal events have both temporal and spatial locality and might form small clusters that also need to be deemed as outliers. In this context, Lian et al. (2009) presented a new definition and detection algorithm for o

30、utliers: cluster-based outliers, which is meaningful and provides importance to the local data behaviour.RESEARCH METHODOLOGYThis section discusses fraud detection approaches in general and the analyses performed in the study in particular. It also discusses the data used in the illustrations. For c

31、onfidentiality reasons, no real cases are incorporated in the datasets. Instead, the datasets are simulated based on patterns found in motor insurance data.Fraud Detection ApproachesGenerally, fraud detection can be done using the: (1) clustering approach, (2) expectations approach, and (3) predicti

32、ve modelling approach. While the first two approaches highlight suspicious cases for further fraud investigation, the last approach directly predicts the probability of fraud.The clustering approach focuses on “normal” patterns/clusters and searches for deviations from the “norm”. These deviations f

33、lag suspicious cases that may be further investigated for fraud. They indicate outliers only and not necessarily fraud cases. On the other hand, the expectations approach focuses on what should be the (expected) value and compares it with what is the (actual) value. Large deviations are suspicious.

34、This approach requires a predictive model that generates the expectations.Finally, the predictive modelling approach constructs a predictive model that predicts the probability of fraud. Such a model attempts to differentiate fraud from non-fraud cases and hence requires data from both categories. T

35、his data requirement may be difficult to satisfy in some types of fraud (e.g., motor insurance fraud). In particular, the fraud data may not be sufficient because there may not be many cases of confirmed fraud, relative to non-fraud cases.This research employs only the clustering and expectations ap

36、proaches. Further, data mining techniques such as outliers clustering (similar to IBM-SPSS proprietary TwoStep clustering) and decision trees are used to generate the clustering results and expected/predicted values. The data mining software IBM-SPSS Modeller (previously called SPSS Clementine) is u

37、sed in this study.Sample DataTwo motor insurance datasets are used in this study. The first dataset comprises repairs data with the following inputs: (1) age of driver, (2) gender of driver, (3) claim type, (4) number of injuries, (5) excess amount category, (6) repair workshop, (7) odometer reading

38、, (8) brand of car, (9) year of manufacture, (10) initial estimate of repair costs, and (11) final amount of repair cost. There are a total of 15,000 observations in the repairs dataset.The second dataset comprises claims data and has 50,000 observations. It captures the following inputs: (1) age of

39、 insured, (2) gender of insured, (3) marital status of insured, (4) occupation of insured, (5) nationality of insured, (6) policy type, (7) policyholder type, (8) number of policy renewals, (9) type of accident, (10) type of damage or injury, (11) property or body damage, (12) insurance coverage und

40、er claim, (13) whether own damage or third party claim, (14) whether claimant is policyholder, (15) claim amount, and (16) paid out amount.Although many inputs are captured in both the datasets, not all the inputs are used in the fraud detection analyses. In particular, inputs that contain substanti

41、al missing values are not included.FINDINGS AND IMPLICATIONSThe analyses performed can be grouped accordingly to the datasets on which they are performed, namely the repairs dataset and the claims dataset.Repairs DatasetTwo models were constructed on the repairs dataset. The first model used the exp

42、ectations approach to estimate/compute the expected repair cost while the second model looked at the difference between the final amount and initial estimate of the repair cost (i.e., diff = final amount initial estimate).Repairs Model 1In this model, the following inputs were included in the analys

43、is: (1) age of driver, (2) gender of driver, (3) claim type, (4) number of injuries, (5) excess amount category, (6) repair workshop, (7) odometer reading, (8) brand of car, and (9) year of manufacture. The output was the final amount of repair cost. Regression analysis, neural network, CHAID and CA

44、RT (the last two being decision trees) were employed to construct the prediction model.Figure 1 shows the accuracy and hit rates of the models. Based on these results, the neural network performed the best, followed closely by the CHAID and CART models. The regression model did not perform well in t

45、erms of predicting the final amount of repair cost. Given the advantages of decision trees (e.g., ease of interpretation and deployment), the CHAID decision tree was selected for fraud detection.-Insert Figure 1 about here-The CHAID decision tree indicates that the following inputs are significantly

46、 associated with the repair cost: (1) brand of car, (2) year of manufacture, and (c) repair workshop. In particular, luxurious cars and newer cars show a higher level of repair costs. Certain repair workshops tend to charge more too. Figure 2 presents the CHAID decision tree.-Insert Figure 2 about h

47、ere-To flag suspicious cases, the difference between the final amount of repair cost and the expected/predicted repair cost (as generated by the CHAID decision tree) was computed. Figure 3 shows the observations with high differences. In particular, 46 observations have differences greater than S$30

48、,000. These are the flagged suspicious fraud cases.-Insert Figure 3 about here-Repairs Model 2The second model looks at the difference between the final amount and initial estimate of the repair cost (i.e., diff = final amount initial estimate). The difference is dichomotised such that “2” represents a difference of more than S$1,000. A larger amount shows a greater difference between the expected repair cost (b

展开阅读全文