Detecting Coalition Attacks in Online Advertising: A hybrid data mining approach

Qinglei Zhang; Wenying Feng; Qinglei Zhang; Wenying Feng

doi:10.3934/bdia.2016006

Big Data and Information Analytics

2016, Volume 1, Issue 2: 227-245. doi: 10.3934/bdia.2016006

Previous Article Next Article

Detecting Coalition Attacks in Online Advertising: A hybrid data mining approach

Qinglei Zhang ,
Wenying Feng ^,

Department of Computing & Information Systems Department of Mathematics, Trent University Peterborough, Ontario K9J 0G2, Canada

Received: 01 February 2016 Revised: 01 September 2016 Published: 01 July 2016

Coalition attack is nowadays one of the most common type of attacks in the industry of online advertising. In this paper, we attempt to mitigate the problem of frauds by proposing a hybrid framework that detects the coalition attacks based on multiple metrics. We also articulate the theoretical basis for these metrics to be integrated into the hybrid framework. Furthermore, we instance the framework with two metrics and develop a detection system that identifies the coalition attacks from two distinguish perspectives.

Keywords:

Citation: Qinglei Zhang, Wenying Feng. Detecting Coalition Attacks in Online Advertising: A hybrid data mining approach[J]. Big Data and Information Analytics, 2016, 1(2): 227-245. doi: 10.3934/bdia.2016006

Related Papers:

[1]	Ana Jofre, Lan-Xi Dong, Ha Phuong Vu, Steve Szigeti, Sara Diamond . Rendering website traffic data into interactive taste graph visualizations. Big Data and Information Analytics, 2017, 2(2): 107-118. doi: 10.3934/bdia.2017003
[2]	Zhen Mei . Manifold Data Mining Helps Businesses Grow More Effectively. Big Data and Information Analytics, 2016, 1(2): 275-276. doi: 10.3934/bdia.2016009
[3]	Anupama N, Sudarson Jena . A novel approach using incremental under sampling for data stream mining. Big Data and Information Analytics, 2018, 3(1): 1-13. doi: 10.3934/bdia.2017017
[4]	Weidong Bao, Wenhua Xiao, Haoran Ji, Chao Chen, Xiaomin Zhu, Jianhong Wu . Towards big data processing in clouds: An online cost-minimization approach. Big Data and Information Analytics, 2016, 1(1): 15-29. doi: 10.3934/bdia.2016.1.15
[5]	Xiaoying Chen, Chong Zhang, Zonglin Shi, Weidong Xiao . Spatio-temporal Keywords Queries in HBase. Big Data and Information Analytics, 2016, 1(1): 81-91. doi: 10.3934/bdia.2016.1.81
[6]	Sunmoo Yoon, Maria Patrao, Debbie Schauer, Jose Gutierrez . Prediction Models for Burden of Caregivers Applying Data Mining Techniques. Big Data and Information Analytics, 2017, 2(3): 209-217. doi: 10.3934/bdia.2017014
[7]	Pankaj Sharma, David Baglee, Jaime Campos, Erkki Jantunen . Big data collection and analysis for manufacturing organisations. Big Data and Information Analytics, 2017, 2(2): 127-139. doi: 10.3934/bdia.2017002
[8]	Ricky Fok, Agnieszka Lasek, Jiye Li, Aijun An . Modeling daily guest count prediction. Big Data and Information Analytics, 2016, 1(4): 299-308. doi: 10.3934/bdia.2016012
[9]	Wenxue Huang, Yuanyi Pan . On Balancing between Optimal and Proportional categorical predictions. Big Data and Information Analytics, 2016, 1(1): 129-137. doi: 10.3934/bdia.2016.1.129
[10]	Maria Gabriella Mosquera, Vlado Keselj . Identifying electronic gaming machine gambling personae through unsupervised session classification. Big Data and Information Analytics, 2017, 2(2): 141-175. doi: 10.3934/bdia.2017015

Abstract

1. Introduction

As an opportunity for organizations and companies to reach out millions of potential customers at a low cost, online advertising costs for nearly one-quarter of the global ad spent [1]. Figure 1 illustrates the process of online advertising. The advertiser manages the advertising campaigns for various brands, while internet publisher provides the inventory for hosting the ads. Moreover, advertisers and publishers can communicate directly or indirectly. Exchangers such as Google's DoubleClick [2] and Yahoo!'s RightMedia [3] are involved as intermediaries who pair the publishers' ad request with the most profitable advertiser bid. During the life-cycle of an advertisement, an impression (view) means the ad is displayed on the publisher's webpage when a visitor loading the page. Consequently, when the visitor clicks on the ad a click event is fired and the visitor is taken to the brand's landing page. Finally, a conversion is recorded if the visitor performs certain actions on the brand's websites.

Figure 1. The process of online advertising.

DownLoad: Full-Size Img PowerPoint

In the context of online advertising, a buyer/seller relationship is established between the advertiser and publisher [26]. With accordance to the three basic events during the advertising life-cycle, the three price models in the industry are PPM (Pay-Per-Mille¹), PPC (Pay-Per-Click), and PPA (Pay-Per-Action) [7,22,27], respectively. Since publishers earn revenue based on the number of views, clicks, or actions that are associated with the ads, publishers have strong incentives to maximize those numbers [15,16]. While publishers can apply legitimate methods to attract more traffic to their websites, malicious publishers attempt to make more money by generating invalid (fraudulent) traffic. Such publisher frauds can be further categorized as impression frauds, click frauds, and conversion (action) frauds [7,9,22]. Various types of fraud attacks from both industry and academia have been identified in the literature (e.g. [20,21,24]). To generate traffic to publishers' websites, attacks can be launched either on publishers' websites or on machines that are controlled by malicious publishers, and attacks can be performed either by real human or bots. More details of some existing online advertising frauds are discussed in [26].

¹Mille means thousand impressions in the context.

In online advertising industry, the number of views, clicks, and actions of ads is critical to the revenue of advertisers. Advertisers use several metrics such as CPM (Cost Perm Mille), CTR (Click Through Rate), and CR (Conversion Rate) as the KPIs (Key Performance Indicators) to evaluate the performance of their market campaigns. However, the inflated numbers of views, clicks, and actions that are generated by fraudulent publishers hinder the reliability and effectiveness of those performance metrics. Publisher fraud is a severe problem for advertisers and worth the endeavour to detect and prevent.

Publisher frauds are mainly launched with two key entities: fraudulent sites and machines. With respect to correlations between machines and sites, publisher frauds can be classified as coalition attacks and non-coalition attacks. As illustrated in Figure 2, the non-coalition attack corresponds to a one-to-one relationship, and the coalition attack represents a many-to-many relationship. An elementary attack can be easily detected and blocked by identifying repeated views, clicks, or actions from one machine on one site. To avoid being detected, fraudulent publishers attempt to blur the strong correlation between the fraudulent sites and machines. A single fraudulent publisher needs to simulate fake visitors, or frequently change the identification of users to blur the relationship, which significantly increases the difficulty and cost for launching the attacks. On the other hand, several fraudulent publishers can make a collusion and share their resources, which also blur the correlation between a pair of fraudulent sites and machines. Since coalition attacks can be launched without developing sophisticated techniques, they have become more and more popular recently. While coalition attacks can be deployed easily and effectively, the detection of such attacks is more challengeable as an open issue.

Figure 2. Different types of publisher frauds.

DownLoad: Full-Size Img PowerPoint

Both active and passive approaches can be used to detect the fraudulent problems in general. To verify the validity of the traffic, active approaches usually require the interaction with the web pages and/or visitors (e.g. [20]). Passive detection approaches, on the other hand, apply data analysis/mining techniques to recognize the behaviour pattern of traffics (e.g. [11,19,21,23,25,27]). While active detection techniques typically adopt the signature based detection mechanism to mark the fraudulent traffic, passive detection techniques can use both signature based (e.g. [11]) and anomaly based (e.g. [23]) approaches to identify frauds. Moreover, the active detection may change the model of advertising network and compromise the privacy of visitors, which is sometimes impractical. On the other hand, the premise for data mining approach to detect frauds is to identify correlations between publishers and visitors. Due to the flexibility and feasibility of the passive detection measurement, we adopt the data mining approach in this paper. In particular, we propose an algorithm to detect correlations of publishers and visitors that identify coalition attacks.

The paper is structured as follows. Section 2 introduces background and related notations of coalition attacks and parallel computing. Section 3 presents the proposed detection system for coalition attacks. Next, in Section 4 we demonstrate the evaluation results of our new techniques. Finally, some conclusions and future work are discussed in Section 5.

2. Background

2.1. Identification for coalition attacks

A classical way to detect publisher frauds is to block suspicious domains when their impressions and CTR exceed a hard (static) threshold. However, fraudsters can easily comprise the classical detection tools by mimicking legitimate criteria. To identify malicious intents of fraudsters, the key is to identify the association between fraudulent websites and machines. In the literature, two types of metrics have been explored. One set of metrics is derived from the graphic topology (e.g. sets of neighbours and paths between nodes). Those metrics are more generic and can be applied to any domain. In our context, real network data indicates that two legitimate sites should have different sets of visitors [17,18]. Therefore, the degree of overlap among visitors of two sites can be used to measure the collusion of two sites. Different similarity metrics having been proposed for generic link prediction problems, such as Jaccard coefficient [8] and Adamic [6] can be used to estimate the similarity of visitors for two publishers. For example, Jaccard coefficient is employed for the coalition attack problem in [17]. On the other hand, another set of metrics is associated with attributes of vertices and links, which requires the domain knowledge. In the context of e-commerce, as ROI (Return of Investor) is one of the key concerns, coalition groups incentively have high ROI than normal publishers. Therefore, different variants of ROI can be derived as the metrics for identifying frauds. For example, in [12], a metric called GPR (Gain-Per-Resource) is proposed to detect the coalition attacks.

In this paper, we propose a hybrid approach that combines the metrics from different perspectives to detect the coalition attacks. In particular, the similarity and GPR are selected as two representative metrics in our technique. To systematically and automatically detect coalition attacks, we give formal definitions for identifying coalition attacks based on similarity and GPR in the rest of this section.

The ratio of gain and cost is an estimator of ROI (Return of Investment), which can be formally defined as:

$GPR(g) = {W(g) \over R(g)},$

(1)

where $W(g)$ denotes the gain of group $g$ and $R(g)$ denotes the amount of resources for the group. Given the definition of the GPR metric, the GPR value of coalition groups will exceed a threshold, denoted as $\theta_{gpr}$ . However, a group that consists of coalitions and normal publishers may still have a high GPR. In order to exclude such cases, a definition of GPR-GROUP is further introduced.

Definition 2.1. A group is called a GPR-GROUP iff

$\forall \left( g\ \ |\ \ {g}'\subset g\ \ \text{:}\ \ GPR({g}')<GPR(g) \right),$

where $g'$ denotes any subgroup of $g$ .

A coalition group can be defined in terms of GPR.

Definition 2.2 (gpr-based identification). (1) A group of publishers and visitors, denoted as $g$ , is called a coalition iff

$GPR(g) \geq \theta_{gpr} \; \mbox{ and }\; g \; \mbox{is GPR-GROUP}.$

(2) A coalition $g$ is a maximal coalition iff

$\neg \exists \left( {g}''\ \ |\ \ g\subset {g}''\ \ :\ \ {g}''\text{ is a coalition} \right).$

Let $S_{p_1}$ and $S_{p_2}$ denote two sets of visitors to publishers $p_1$ and $p_2$ , respectively. The Jaccard coefficient, ${|S_{p_1} \cap S_{p_2}| \over |S_{p_1} \cup S_{p_2}|}$ , captures the traffic similarity of $p_1$ and $p_2$ . However, since the number of repeated visitors is also a signal of frauds in online advertising, the set model for publishers' visitors will lose important information for detecting frauds. Therefore, we use a variant of Jaccard coefficient w.r.t a bag model² to represent publishers' visitors. Formally, let $B_{p_1}$ and $B_{p_2}$ denote two bags of visitors to publishers $p_1$ and $p_2$ , respectively. The similarity of a pair of sites $p_1$ and $p_2$ is modelled by the formula:

²The elements in bag models can be duplicated.

$\mbox{Similarity}(p_1, p_1) = {|B_{p_1} \sqcap B_{p_2}| \over |B_{p_1} \sqcup B_{p_2}|},$

(2)

where $\sqcap$ and $\sqcup$ denote the intersection and union of bags. We use a pair $(n, e)$ to represent an element in a bag, where $n$ indicates how many times $e$ repeats in the bag. Then $\sqcap$ and $\sqcup$ can be defined as the following:

$\forall \left( (n, e)\ |\ (n, e)\in A\sqcap B:\ \ \ \exists \left( {{m}_{1}}, {{m}_{2}}\ |\ ({{m}_{1}}, e)\in A\wedge ({{m}_{2}}, e)\in B\ :\ n = min({{m}_{1}}, {{m}_{2}}) \right) \right)$

$\begin{align} &\forall \left( (n, e)\ \ |\ \ (n, e)\in A\sqcup B:\ \ \ ((n, e)\in A\wedge \neg \exists \left( k\ \ |\ \ k\in N\ :\ \ (k, e)\in B \right) \right) \\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \vee ((n, e)\in B\wedge \neg \exists \left( k\ \ |\ \ k\in N\ \ :\ \ (k, e)\in A) \right) \\ &\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \vee (\exists \left( {{m}_{1}}, {{m}_{2}}\ |\ ({{m}_{1}}, e)\in A\wedge ({{m}_{2}}, e)\in B\ :\ n = max({{m}_{1}}, {{m}_{2}}) \right)). \\ \end{align}$

With the definition of similarity metric, a pair-wise similar group means that the similarity of each pair in the group is greater than a threshold, denoted by $\theta_{sim}$ . Since two legitimate sites only have negligible similarity, it is unlikely that a random group is pair-wise similar. On the other hand, coalition groups share the resources, which would generate similar traffics inherently. Therefore, coalition groups can be identified with regard to the traffic similarity.

Definition 2.3 (similarity-based identification).A group of publishers $g$ is a coalition group iff

${\forall} {p_1, p_2}: \; {(p_1\in g) \wedge (p_2 \in g) \wedge (p_1 \neq p_2)} \Longrightarrow {\mbox{ similarity} \; (p_1, p_2) \geq \theta_{sim}}.$

2.2. Examples

To illustrate the effectiveness of the hybrid approach, we use several trivial examples to show that the two metrics, similarity and GPR, are insufficient and not consistent. Figure 3 illustrates click graphs of different traffic patterns. Publishers and visitors are denoted as $p_{\{1, 2, 3\}}$ , and $v_{\{1, 2, 3, 4\}}$ , respectively. Only the case (a) is an example of coalition attacks that the three publishers share the resources. However, the similarity of the group in case (b) is greater than or equal to the value of case (a), and the GPR of the group in case (c) is same as the value of case (a). In other terms, case (b) and case (c) can be respectively mis-identified as coalitions according to similarity-based and gpr-based identifications for coalition attacks given in Definitions 2.3 and 2.2.

Figure 3. Different Traffic Patterns with/without Coalition Attacks.

DownLoad: Full-Size Img PowerPoint

2.3. MapReduce computing paradigm

A key challenge of detecting coalition attacks is the complexity for calculating the metrics for all possible groups among the whole set of publishers. The problem of detecting coalitions is NP-hard in general, and usually involves with processing large set of data. Therefore, in order to apply the proposed metric in practical, we adopt a parallel computing technique called MapReduce.

MapReduce is a prominent parallel, distributed data processing paradigm that grows rapidly and has gained significant momentum from both industry and academia. Figure 4 illustrates an overview of the Hadoop architecture, which is an open-source Java implementation of MapReduce [5]. In general, each processing job is broken down to as many tasks as input data blocks on multiple clusters. With the MapReduce model, details of parallel execution are hidden and handled by the internal implementation. Users focus only on the task of data processing. A single MapReduce (MR) job simply consists of two primitive stages: Map and Reduce. The Map stage is to map the input data to a list of pairs (key, value), and the Reduce stage calculates results by aggregating the value of keys.

Figure 4. The Architecture of Hadoop [14].

DownLoad: Full-Size Img PowerPoint

3. Detection of coalition attacks

3.1. The theoretical basis for the detection technique

In Definitions 2.2 and 2.3, we have formally defined metrics that can be used to identify the coalition attacks. However, the detection technique to find all of them is not straightforward. For a universal group with $N$ publishers in total, all its possible sub-groups would be $2^N$ . It is impractical to calculate the proposed metric for all possible sub groups and identify those that satisfy the given definitions. Such brute-force searching scheme is especially impossible for a hybrid approach as the computational complexity would increase exponentially when more metrics are integrated. On the other hand, a priority-style detection scheme would search the space inductively and prune impossible candidate as early as possible, which as a result may reduce the complexity dramatically. Furthermore, the combination of several metrics helps to further prune a significant number of candidates on the fly, and further reduce the computation time.

A well-known property derived from association rule mining is the following definition of anti-monotone.

Definition 3.1 (anti-monotone property[13]). Let $P: {\mathbb X} \rightarrow$ {true, false} be a predicate over the domain ${\mathbb X}$ , $P$ is said to be anti-monotone iff:

$\forall \left( g, {g}'\in \mathbb{X}\ \ |\ \ {g}'\subseteq g\ \ :\ \ P(g)\ \ \Rightarrow \ \ P({g}') \right).$

We can consider the identification of coalition attacks as a predicate $P$ . Let $D$ denote the set of all publishers and visitors. The domain ${\mathbb X}$ corresponds to the powerset of $D$ in our contex. However, not every metric that is abstracted from the domain knowledge has the anti-monotone property. In other words, to fit the metrics into the inductive detection style, we should derive anti-monotonic predicates based on the original metrics.

To obtain the anti-monotone predicate w.r.t. the similarity-based identification (see Definition 2.3), we use notations of similar graph, and clique. Let $G_{sim}$ denote a similarity graph associated with the underlying traffic. Each vertex in $G_{sim}$ corresponds to a publisher, and edges represent the similarity of visitors between two publishers is greater than or equal to the threshold $\theta_{sim}$ . Moreover, a clique is a subset of vertices of an undirected graph $G$ such that its induced subgraph is complete. With accordance to the similarity-based identification, the detection of coalition attacks is abstracted as finding cliques in the associated similarity graph. Formally, we can prove the following lemma according to Definition 2.3.

Lemma 3.2. Given the similarity graph $G_{sim}$ , a group g is a coalition group iff g is a clique associated with $G_{sim}$ .

Consequently, we can derive a predicate using the notation of clique. Proposition 1 indicates that the predicate is anti-monotone.

Proposition 1 (similar clique property). $\forall ({g}'\ \ |\ \ {g}'\subseteq g\ \ :\ \ g\ is\ a\ clique\ associated$ $with\ {{G}_{sim}}\ \ \Rightarrow \ \ {g}'\ is\ a\ clique\ associated\ with\ {{G}_{sim}})$

With respect to the gpr-based identification given in Definition 2.2, the GPR property cannot be transferred to an anti-monotone predicate directly. We instead use the anti-monotone predicate according to an alternative concept called GPR-CORE [12]. Let $G_{click}$ denote a click graph induced by the traffic data. The graph in Figure 3 is an example of such click graphs. Each vertex in $G_{click}$ corresponds to either a publisher or a visitor. An edge $(p, v)$ in $G_{click}$ means that the visitor $v$ visits the publisher $p$ . Given the definition of GPR-CORE below, we can also have the anti-monotone property of GPR-CORE in Proposition 2.

Definition 3.3. A click graph $g$ is called a GPR-CORE if $g$ is a GPR-GROUP and every connected subgraph of $g$ is also a GPR-GROUP.

Proposition 2 (gpr core property). $\forall ({g}'\ \ |\ \ {g}'\subseteq g\ \ :\ \ g\ \ is\ a\ gpr-core\ group\ \Rightarrow$ ${g}'\ is\ a\ gpr-core\ group)$

Consequently, the following lemma articulates the gpr-based identification w.r.t. the anti-monotone property GPR-CORE.

Lemma 3.4. a group $g$ is a coalition group if $g$ satisfies the following conditions:

● $\exists \left( {g}', {g}''\ \ |\ \ {g}'\cup {g}'' = g\ \ :\ \ {g}'\text{ }is\ GPR-CORE\ and\ {g}''\ is\ GPR-CORE \right)$ ,

● the GPR value of $g$ is greater than or equal to $\theta_{gpr}$ ,

● the graph $G_{click}$ induced by $g$ is connected.

3.2. The framework of the hybrid detection system

Figure 5 illustrates the framework of the hybrid detection system. There are three major stages of the system: initialization phase, inductive phase, and finalization phase. With such a framework, we adopt an aprior-style to identify the coalition attacks. Moreover, we define generic modules that can be instanced with different metrics. Such a framework enables the efficiency and extensibility of the hybrid approach. Following Figure 5 and the discussions in Section 3.1, implementation of the system is straightforward.

Figure 5. Framework of the Hybrid Detection System.

DownLoad: Full-Size Img PowerPoint

For the similarity-based identification, the initializing module responses to generate the similarity graph $G_{sim}$ . In the property checking module, we check the predicate $P(g) \equiv \langle g \mbox{ is a clique associated with } G_{sim}\rangle$ . As discussed previously, the anti-monotone property (see Proposition 1) of the predicate ensures the soundness of the inductive detection mechanism. Algorithms 1 and 2 articulate the pseudo code for the similarity-based initializing and property checking modules. In Line $2$ of Algorithm 1, we call a function ${\tt PAIR\_INTERSECT\_UNION}$ to calculate the bag union and intersection of visitors for a pair of publishers according to the definitions. In Line $2$ of Algorithm 2, a function ${\tt COUNT\_GROUP\_EDGES}$ is used to count the number of edges in $G_{sim}$ for potential candidate subgroups. To improve the efficiency of the algorithm, these two functions are implemented with the MapReduce paradigm, which will be discussed in the next section. Moreover, the final identification module for the similarity based metric is trivial. As given in Definition 3.2, identified similar cliques are identical to coalition groups.

Table . .

Algorithm 1 initializing: similarity-based
1:procedure PAIR_SIMILARITY(Data)
2:pairs $\gets$ the list of all pairs of Publishers
3:pairs_value $\gets$ PAIR_INTERSECT_UNION(pairs, Data)
4: $G\_{sim} \gets$ initialize the storage of the similar graph
5:for each in pairs do
6:(intersection, union) $\gets$ pairs_value[each]
7:similarity $\gets$ intersection / union
8:if similarity $\geq \theta_{sim}$ then
9:Adding each as an edge in $G\_{sim}$
10:end if
11:end for
12:return $G\_{sim}$
13:end procedure

| Show Table

DownLoad: CSV

Table . .

Algorithm 2 checking: similarity-based
1:procedure SIMILAR GROUP (candidate, G_sim)
2:group_edges $\gets$ COUNT_GROUP_EDGES(candidate, $G_{sim}$ )
3:clique_groups $\gets$ initialize the storage of the clique groups
4:for each in candidate do
5:complete_edge $\gets$ len(each) * (len(each)-1)
6:if complete_edge $\gets$ group_edges [each] then
7:Adding each to clique_groups
8:end if
9:end for
10:end procedure

| Show Table

DownLoad: CSV

For the gpr-based identification, we calculate the gpr value for single publisher and visitor in the initializing module, and check the gpr-core property in the checking module. Moreover, Lemma 3.4 is used to identify tasks of the identification module in the gpr-based case. The corresponding algorithm is given in Algorithms 3, 4, and 5, respectively. The function ${\tt CALCULATE\_RESOURCE}$ that is called in Lines $5$ and $14$ of Algorithm 3, Line $6$ of Algorithm 4, and Line $7$ of Algorithm 5 calculates the resource cost of the corresponding group (i.e., $R(g)$ in Definition 1). In Algorithm 3, we initially obtain the boundary value for individual (i.e., either a publish or a visitor) gain by calling the function ${\tt INDIVIDUAL\_GAIN\_BOUNDARY}$ . The individual gbr value is always $0$ , and the induced click graph can be considered as connected. Furthermore, the function ${\tt GROUP\_GBC\_VALUES}$ is called in Line $2$ of Algorithm 4, and Line $3$ of Algorithm 5 to calculate the gain, gain boundary, and connectivity for all candidate groups. Both functions ${\tt INDIVIDUAL\_GAIN\_BOUNDARY}$ and ${\tt GROUP\_GBC\_VALUES}$ are implemented using the map-reduce paradigm. Moreover, we implement another map reduce function ${\tt MERGE\_GPR\_CORE}$ to merge the GPR-CORE groups to potential GPR groups according to the first item in Lemma 3.4.

Table . .

Algorithm 3 initializing: gpr-based
1:procedure INDIVIDUAL_GPR(Data)
2:(P_gainb, V_gainb) $\gets$ INDIVIDUAL_GAIN_BOUNDARY(Data)
3:GPR $_1$ $\gets$ initialize the storage for individual gpr values
4:for p in P_gainb do
5:resource $\gets$ CALCULATE_RESOURCE([p], [])
6:gprb $\gets$ P_gainb[p] / resource
7:gpr $\gets$ 0
8:connectivity $\gets$ true
9:if gprb $\ge \theta_{gpr}$ then
10:Adding (p, gpr, gprb, connectivity) to the GPR $_1$
11:end if
12:end for
13:for v in V_gainb do
14:resource $\gets$ CALCULATE_RESOURCE([], [v])
15:gprb $\gets$ V_gainb[v] / resource
16:gpr $\gets$ 0
17:connectivity $\gets$ true
18:if gprb $\ge \theta_{gpr}$ then
19:Adding (v, gpr, gprb, connectivity) to the GPR $_1$
20:end if
21:end for
22:return GPR $_1$
23:end procedure

| Show Table

DownLoad: CSV

Table . .

Algorithm 4 checking: gpr-based
1: procedure GPR-CORE (Data, candidate, $\mbox{GPR}_{(n-1)}$ )
2:gbc_values $\gets$ GROUP_GBC_VALUES(Data, candidate)
3: $\mbox{GPR}_n \gets$ initial the storage for the group gain values of size $n$
4:for g in candidate do
5:(publishers, visitors) $\gets$ g
6:resource $\gets$ CALCULATE_RESOURCE (publishers, visitors)
7:(gpr, gprb) $\gets$ gbc_values[g] / resource
8Adding $(g, gpr, gprb)$ to the $\mbox{GPR}_n$
9:end for
10:subset $\gets$ COUNT_VALID_SUBSET(candidate, $\mbox{GPR}_{(n-1)}$ , $\mbox{GPR}_n$ )
11:gprcore $\gets$ initialize the storage of the GPR-CORE groups
12:for g in candidate do
13:if subset[g] == n $\And$ $\mbox{GPR}_{n} \ge \theta_{gpr}$ then
14:Adding g to the gprcore
15:end if
16:end for
17:end procedure

| Show Table

DownLoad: CSV

Table . .

Algorithm 5 identification: gpr-based
1:procedure FIND_GPR_GROUPS (gprcore, $GPR_{1\dots n}$ )
2:merged_group $\gets$ MERGE_GPR_CORE(gprcore)
3:gbc_values $\gets$ GROUP_GBC_VALUES(merged_group)
4:gpr_groups $\gets$ initialize the storage for the gpr groups
5:for g in merged_group do
6:(publishers, visitors) $\gets$ g
7:resource $\gets$ CALCULATE_RESOURCE(publishers, visitors)
8:(gain, gain_b, connectivity) $\gets$ gb_values[g]
9:gbr $\gets$ gain / resource
10:if $gbr \ge \theta_{gpr}$ $\And$ connectivity is true then
11:Adding g to gpr_group
12:end if
13: end for
14:end procedure

| Show Table

DownLoad: CSV

3.3. The map and reduce functions for the detection techniques

The map reduce paradigm is discussed in Section 2. The input data is mapped into a collection of pairs $($ key, values $)$ , and then the result is calculated by aggregating the values w.r.t. each key. In order to allow the parallel computation on different clusters, the reduce operation on the mapped dataset should be commutative and associative. In our context, different metrics for a large collection of groups have to be calculated. All groups form a lattice data structure, which allows the inductive search strategy in the algorithm. With the map reduce paradigm, a set of groups is mapped to a set of partial ordered keys, while values are emitted with respect to different metrics. All functions for key encoder, decoder and comparison are given in Appendix A. With the inductive style, we generate new sets of keys in each iteration based on the potential candidates from the previous iteration. To reduce the computation time, we also implement the inductive candidate generator with the map-reduce paradigm (see Appendix A for the map and reduce functions). Furthermore, in Algorithm 6, we articulate the pseudo code for function ${\tt pair\_intersection\_union}$ and ${\tt count\_group\_edges}$ . functions ${\tt individual\_gain\_boundary}$ , ${\tt group\_gbc\_values}$ , ${\tt count\_valid\_subset}$ , and ${\tt merge\_gpr\_core}$ are given in Algorithms 7 and 8. As mentioned in the last section, map-reduce computation is adopted in Lines $3, 21$ of Algorithm 6, Lines $3, 23$ of Algorithm 7, and Lines $10, 24$ of Algorithm 8. In particular, we use Disco [4] to conduct the computation and the implementation of those mapper functions are also given in the Appendix.

Table . .

Algorithm 6 functions associated with the similarity metric
1:function PAIR_INTERSECTION_UNION (Data, pair)
2:pairs_value $\gets$ {}
3:for (key, values) $\in$ ${\tt reduce(mapper\_similarity\_pairs}$ (pairs, Data)) do
4:\\ "key" is the key for each pairs
5:\\ "values" are counts of clicks that are associated the pair, respectively
6:pairs $\gets$ KEY_DECODER (key)[0]
7:if values[0] $>$ 0 $\And$ values[1] $>$ 0 then
8:intersection $\gets$ min(values[0], values[1])
9:union $\gets$ max(values[0], values[1])
10:else if values[0] $>$ 0 $\And$ values[1] $==$ 0 then
11:intersection $\gets$ 0
12:union $\gets$ values[0]
13:else values[0] $==$ 0 $\And$ values[1] $>$ 0
14:intersection $\gets$ 0
15:union $\gets$ values[1]
16:end if
17:pairs_value[key] $\gets$ [intersection, union]
18:end for
19:return pairs_value
20:end function
21:function COUNT_GROUP_EDGES ( $G\_{sim}$ , candidate)
22:group_edges $\gets$ {}
23:for (key, value) $\in$ ${\tt reducer(map\_similarity\_groups}$ (candidate, $G\_{sim}$ ))do
24:\\ "key" is the encoded key of each group in candidate
25:\\ "value" is the count of edges in $G\_{sim}$ that are associated with all elements in the group
26:group=key.decoder(key)
27:group_edges[group] $\gets$ value
28:end for
29:return group_edges
30:end function

| Show Table

DownLoad: CSV

Table . .

Algorithm 7 functions associated with the gpr metric
1:function INDIVIDUAL_GAIN_BOUNDARY(Data)
2:P_gainb $\gets$ {}
3:V_gainb $\gets$ {}
4:for key, value) $\in$ ${\tt reduce(map\_vertices}$ (Data) do
5:\\ "key" is the encoded key of each publisher or visitor
6:\\ "value" is the sum cost that is associated with the publisher or visitor
7:(p, v)=key.decoder(key)
8:if v=="" then
9:P_gain[p] $\gets$ value[1]
10:elsep==""
11:V_gain[p] $\gets$ value[1]
12:end if
13:end for
14:return (P_gainb, V_gainb)
15:end function
16:function GROUP_GBC_VALUES (Data, candidate, size)
17:gbc_values $\gets$ {}
18:for (key, values) $\in$ ${\tt reduce(map\_gbc\_values}$ (Data, candidate)) do
19:\\ "key" is the encoded key for each group in candidate
20:\\ "values" correspond to the gain, gain boundary, and the number of connected edges that are associated with the group
21:gain $\gets$ values[0]
22:gain_boundary $\gets$ values[1]
23:if values[2] $>$ 0 then
24:connectivity = true
25:else
26:connectivity = false
27:end if
28:gbc_values[key] $\gets$ (gain, gain_boundary, connectivity)
29:end for
30:return gbc_values
31:end function
32:function COUNT_VALID_SUBSET(candidate)
33:subset $\gets$ {}
34:for (key, value) $\in$ ${\tt reduce(map}$ (Data, candidate)) do
35:\\ "key" is the encoded key for each group in candidate
36:\\ "value" count the number of valid subgroup of the group (i.e. satisfies the properties of GRP-Core)
37:subset[key] $\gets$ value
38:end for
39:return subset
40:end function

| Show Table

DownLoad: CSV

Table . .

Algorithm 8 functions associated with the gpr metric (cont.)
1:function MERGE_GPR_CORE(Data, gprcore)
2: $i \gets 0$
3: $\mbox{merged}_i \gets$ gprcore
4:prev $\gets \mbox{merged}_i$
5:new $\gets$ []
6:while prev!=new do
7:pairs $\gets$ combination(new, 2)
8:i $\gets$ i+1
9:separated $\gets$ {}
10:for (key, value) $\in$ REDUCE(MAP_MERGE_GPRCORE(pairs))do
11:\\ "key" is the encoded key for the merged group that associated with each pair.
12:\\ "value" is the number of overlapped vertices for the pair
13:if value $>$ 0 then
14:Adding key to $\mbox{merged}_i$
15:else
16:(merged_key, component_key) $\gets$ key.split(":")
17:separated[merged_key] $\gets$ component_key
18:end if
19:end for
20:unknown_keys $\gets$ separated.keys() - $\mbox{merged}_i$
21:for key in unknown_keys do
22:candidate[key] $\gets$ separated[key]
23:end for
24:for(key, values) $\in$ REDUCE(MAP_GBC_VALUES(Data, Candidate)) do
25:if values[2] $>$ 0 then
26:Adding key to $\mbox{merged}_i$
27:end if
28:end for
29:prev $\gets \mbox{merged}_{i-1}$
30:new $\gets \mbox{merged}_i$
31:end while
32:
33:end function

| Show Table

DownLoad: CSV

4. Evaluation of the detection system

To evaluate the proposed detection algorithm, we compare the performance of our algorithm with those that only use single metric. The proposed hybrid framework provides a convenient way to extend the system with different metrics. In other term, instead of re-implementing those algorithms separately, we can simply modify the list of metrics in Figure 5 to obtain and compare different algorithms.

We retrieve the motivative examples introduced in Section 2. Both values of $\theta_{sim}$ and $\theta_{gpr}$ have critical effects on the detection rate and error rate of the algorithm. The detection rate becomes low if the threshold is too high. On the other hand, the precision rate can also decrease when the threshold is too low. In Figure 6, we illustrate effects of $\theta_{sim}$ and $\theta_{gpr}$ on the traffic that corresponds to the examples given in Figure 3. The detection rate is $1$ when the $\theta_{sim}$ less than $0.33$ , while the detection rate becomes $0$ when the value greater than $0.33$ . The detection rate is $0$ with $\theta_{gpr}$ greater than $1$ . With a relative low threshold $\theta_{gpr}$ , we may also increase the opportunity of positive false alarms. For example, the precision rate could be as low as $0.3$ when $\theta_{gpr}$ is less than $0.84$ . On the other hand, with the hybrid approach, we can obtain a high detection rate while still keep a reasonable precision rate. Figure 7 shows the effect of the two parameters $\theta_{sim}$ and $\theta_{gpr}$ on the detection rate and the precision of the hybrid approach. In particular, we adjust the value of one parameter with a fix value for the other. With accordance to the dropping point of detection rate in Figure 6, we set $\theta_{sim}$ to be $0.32$ and $\theta_{sim}$ to be $1$ , respectively. We can see that the precision in Figure 7(b) is always much better than the precision in Figure 6(b), and the average precision in Figure 7(a) is also better than the case of 6(a).

Figure 6. Independent Parameter Adjustment.

DownLoad: Full-Size Img PowerPoint

Figure 7. Dependent Parameter Adjustment.

DownLoad: Full-Size Img PowerPoint

We also applied the proposed technique to practical data. The raw data are daily log including impressions, clicks, and conversions from real traffic. Each record in the log is related to more than $50$ features. To prepare the dataset of our algorithm, we index the publisher and visitor, associate the impression and clicks, and then extract the cost for each click. To further evaluate the effectiveness, we use third party service to label the click either as 'normal' or 'coalition attack'. Table 1 shows results for three different datasets.

Table 1. Results obtained from practical data.

#publishers	detection rate	precision

340	96.12%	95.03%
414	97.12%	99.35%
407	99.18%	94.94%

| Show Table

DownLoad: CSV

We can see that the algorithm has an average detection rate of $97.67\%$ and precision $96.44\%$ . In experiments, each dataset includes the traffic of $7$ days with $300\sim500$ publishers. Since we adopt the map-reduce paradigm, our detection technique allows parallel computing. Although the complexity of the algorithm is determined by the number of publishers in the dataset, our detection system inherently has no restriction on the size of the problem. Therefore, with increasing number of clusters, we can further reduce the processing time and handle even larger size of data.

5. Conclusion and future work

In this paper, we proposed a detection technique for coalition attacks in online advertising. A hybrid framework that can detect coalition attacks based on multiple metrics is developed. We articulate the theoretical properties that each metric should satisfy in the framework. In particular, we adopt similarity and gpr as two metrics in the system. The performance of the technique with different datasets is evaluated.

As future work, more datasets will be tested. Since there is no benchmark data sets for coalition frauds in online advertising, obtaining the data for evaluation is usually difficult and expensive. Besides real-world data sets and the third party service, simulation technique can be used to generate synthetic date for both normal and fraudulent traffic. Extensive empirical study on different traffic patterns can be studied. On the other side, to improve the performance of the detection technique, more metrics can be explored to identify coalition attacks [6,10]. With the hybrid framework, additional metrics can be integrated in the system efficiently. The ultimate goal of the work is to provide an reliable, extensible, and scalable on-line detection system for the dramatically developing online advertising industry.

Acknowledgments

The project was supported by the Mathematics of Information Technology and Complex Systems of Canada (MITACS) and EQ Works (http://www.eqworks.com/).

Appendix A. Implementation

The following code are written with PYTHON.

A.1. Key handler

In this subsection, we list functions for encoding, decoding, and comparing the keys.

def key_decoder(publishers, visitors):

sorted_p = sorted(publishers)

sorted_v = sorted(visitors)

key_p="_".join(each for each in sorted_p)

key_v="_".join(each for each in sorted_v)

key = key_p+"!!"+key_v

return key

def key_decoder(key):

(key_p, key_v) = key.split("!!")

publishers = key_p.split("_") if not key_p = ="" else []

visitors = key_v.split("_") if not key_v = ="" else []

return (publishers, visitors)

def key_cmp(key1, key2):

(p1, v1) = key_decoder(key1)

(p2, v2) = key_decoder(key2)

p_in = set(p1).issubset(set(p2))

v_in = set(v1).issubset(set(v2))

return p_in and v_in

A.2. General mapper reduce functions

We generalize two mapper and reducer functions that can be reused by other functions.

def inductive_generator_mapper(line, params):

candidate = line.split(", ")[0]

P = params[0]

V = params[1]

(publishers, visitors) = candidate.split("!!")

for p in P:

if not p in publishers:

key = key_encoder(publishers+[p], visitors)

yield (key+":"+each, [1])

for v in V:

if not v in visitors:

key = key_encoder(publishers, visitors+[V])

yield (key+":"+each, [1])

def reducer(iter, params):

data = {}

first = True

prev = None

for key, value in sorted(iter):

if key not in data:

if not first:

yield prev, data[prev]

data = {}

data[key] = value

else:

for k, v in enumerate(value):

if type(v) = =set:

data[key][k] = data[key][k].union(v)

else:

data[key][k]+ = v

first = False

prev = key

for key, value in data.iteritems():

yield key, value

A.3. Other mapper reduce functions

All other mapper reduce functions are given in the rest of this section.

def mapper_similarity_pairs(line, params):

groups = params

(head, tail, weight, label) = line.strip().split(", ")

for each in groups:

(p1, p2) = each.split("_")

key = each+"!!"+tail

if head = =p1:

yield (key, [1, 0])

elif head = =p2:

yield (key, [0, 1])

else:

pass

def mapper_similarity_groups(line, params):

#tag2group = params[0]

group_keys = params[0]

V_key = params[1]

threshold = params[2]

(p1, p2, similarity) = line.split(", ")

if float(similarity) < threshold:

pass

else:

for key in group_keys:

(p_key, v_key) = key.split("!!")

new_key = p_key+"!!"+V_key

pubs = p_key.split("_")

if p1 in pubs and p2 in pubs:

yield (new_key, [1])

def mapper_vertices(line, param):

(head, tail, weight, label) = line.split(", ")

p_key = head+"!!"

v_key="!!"+tail

yield (p_key, [1, float(weight)])

yield (v_key, [1, float(weight)])

def mapper_gbc_values(line, params):

groups = params

(head, tail, weight, label) = line.split(", ")

#print groups

for key in groups:

and_edge = 0

or_edge = 0

(key_p, key_v) = key.split("!!")

publishers = key_p.split("_") if not key_p = ="" else []

visitors = key_v.split("_") if not key_v = ="" else []

if head in publishers and tail in visitors:

and_edge = float(weight)

if head in publishers or tail in visitors:

or_edge = float(weight)

#yield (key, [and_edge, or_edge])

components = groups[key]

connectivity = 0

for each in components:

(p_key, v_key) = each.split("!!")

heads = p_key.split("_")

tails = v_key.split("_")

if head in heads and tail in tails:

connectivity = 1

break

yield (key, [and_edge, or_edge, connectivity])

References

[1]	[ Zenithoptimedia, Available from http://techcrunch.com/2014/04/07/.
[2]	[ Doubleclick by Google, Available from:http://www.google.ca/doubleclick/.
[3]	[ Rightmedia from Yahoo!, Available from:https://www.rightmedia.com/.
[4]	[ Disco MapReduce, Available from http://discoproject.org/.
[5]	[ Hadoop MapReduce, Available from:https://hadoop.apache.org.
[6]	[ L. Adamic and E. Adar, Friends and neighbors on the web, Social Networks, 25(2003), 211-230.
[7]	[ V. Anupam, A. Mayer, K. Nissim and B. Pinkas, On the security of pay-per-click and other web advertising schemes, in The 8th International Conference on World Wide Web, 31(1999), 1091-1100.
[8]	[ M. S. Charikar, Similarity estimation techniques from rounding algorithms, in Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, (2002), 380-388.
[9]	[ N. Daswani, C. Mysen, V. Rao, S. Weis, K. Gharachorloo and S. Ghosemajumde, Online advertising fraud, in Crimeware:Understanding New Attacks and Defenses, Addison-Wesley Professional, Reading, 2008.
[10]	[ M. A. Hasan, A survey of link prediction in social networks, Social Network Data Analytics, (2011), 243-275.
[11]	[ B. Kitts, J. Y. Zhang, A. Roux and R. Mills, Click fraud detection with bot signatures, in Proceedings of ISI 2013, Seattle, Washington, USA, (2013), 146-150.
[12]	[ C. Kim, H. Miao and K. Shim, CATCH:A detecting algorithm for coalition attacks of hit inflation in internet advertising, Information Systems, 36(2011), 1105-1123.
[13]	[ C. K. S. Leung, Anti-monotone constraints, in Encyclopedia of Database Systems (eds. Ling Liu and M. Tamer Özsu), Springer, (2009), 98-98.
[14]	[ K. Lee, H. Choi and B. Moon, Parallel data processing with MapReduce:a survey, in The ACM Special Interest Group on Management of Data Record, 40(2011), 11-20.
[15]	[ A. Metwally, D. Agrawal and A. EI Abbadi, Duplicate detection in click streams, in International World Wide Web Conference, (2005), 12-21.
[16]	[ A. Metwally, D. Agrawal and A. EI Abbadi, Using association rules for fraud detection in web advertising networks, in Proceedings of the 31st international conference on very large data bases, (2005), 169-180.
[17]	[ A. Metwally, D. Agrawal and A. EI Abbadi, Detectives:detecting coalition hit inflation attacks in advertising networks streams, in Proceedings of the 16th International Conference on World Wide Web, (2007), 241-250.
[18]	[ A. Metwally, D. Agrawal, A. EI Abbadi and Q. Zheng, On hit inflation techniques and detection in streams of web advertising networks, in Proceedings of the 27th International Conference on Distributed Computing Systems, (2007), 52-52.
[19]	[ C. Phua, E. Y. Cheu, G. E. Yap, K. Sim and M. N. Nguyen, Feature engineering for click fraud detection, in ACML Workshop on Fraud Detection in Mobile Advertising, 2012.
[20]	[ Y. Peng, L. Zhang, J. M. Chang and Y. Guan, An effective method for combating malicious scripts clickbots, in Computer Security, the Series Lecture Notes in Computer Science, 5789(2009), 523-538.
[21]	[ B. K. Perera, A Class Imbalance Learning Approach to Fraud Detection in Online Advertising, M.Sc. thesis, Masdar Institute of Science and Technology, 2013.
[22]	[ K. Springborn and P. Barford, Impression fraud in online advertising via pay-per-view networks, in Proceedings of the 22nd USENIX Security Symposium, (2013), 211-226.
[23]	[ F. Soldo and A. Metwally, Traffic anomaly detection based on the IP size distribution, in Proceedings of the INFOCOM International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, (2012), 2005-2013.
[24]	[ C. Walgampaya and M. Kantardzic, Cracking the smart clickbot, in Proceedings of the 13th IEEE International Symposium on Web Systems Evolution, (2011), 125-134.
[25]	[ C. Walgampaya and M. Kantardzic and R. Yampolskiy, Evidence fusion for real time click fraud detection and prevention, in Intelligent Automation and Systems Engineering, Lecture Notes in Electrical Engineering, Springer Science+Business Media, 103(2011), 1-14.
[26]	[ Q. Zhang and W. Feng, Detecting coalition frauds in online-advertising, in Mathematical and Computational Approaches in Advancing Modern Science and Engineering, (eds. Jacques Bélair, Ian A. Frigaard, Herb Kunze, Roman Makarov, Roderick Melnik and Raymond J. Spiteri) Springer, (2016), 595-605.
[27]	[ L. Zhang and Y. Guan, Detecting click fraud in pay-per-click streams of online advertising networks, in The 28th International Conference on Distributed Computing Systems, (2008), 77-84.

This article has been cited by:

1. Amine Kechid, Habiba Drias, 2019, Chapter 41, 978-3-030-22998-6, 470, 10.1007/978-3-030-22999-3_41
Reader Comments

Your name:*

Email:*
© 2016 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Big Data and Information Analytics

Metrics

Article views(3147) PDF downloads(533) Cited by(1)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(7) / Tables(1)

Big Data and Information Analytics

Detecting Coalition Attacks in Online Advertising: A hybrid data mining approach

Related Papers:

Abstract

1. Introduction

2. Background

2.1. Identification for coalition attacks

2.2. Examples

2.3. MapReduce computing paradigm

3. Detection of coalition attacks

3.1. The theoretical basis for the detection technique

3.2. The framework of the hybrid detection system

3.3. The map and reduce functions for the detection techniques

4. Evaluation of the detection system

5. Conclusion and future work

Acknowledgments

Appendix A. Implementation

A.1. Key handler

A.2. General mapper reduce functions

A.3. Other mapper reduce functions

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Catalog

Big Data and Information Analytics

Detecting Coalition Attacks in Online Advertising: A hybrid data mining approach

Related Papers:

Abstract

1. Introduction

2. Background

2.1. Identification for coalition attacks

2.2. Examples

2.3. MapReduce computing paradigm

3. Detection of coalition attacks

3.1. The theoretical basis for the detection technique

3.2. The framework of the hybrid detection system

3.3. The map and reduce functions for the detection techniques

4. Evaluation of the detection system

5. Conclusion and future work

Acknowledgments

Appendix A. Implementation

A.1. Key handler

A.2. General mapper reduce functions

A.3. Other mapper reduce functions

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog