Citation: Qinglei Zhang, Wenying Feng. Detecting Coalition Attacks in Online Advertising: A hybrid data mining approach[J]. Big Data and Information Analytics, 2016, 1(2): 227-245. doi: 10.3934/bdia.2016006
[1] | Ana Jofre, Lan-Xi Dong, Ha Phuong Vu, Steve Szigeti, Sara Diamond . Rendering website traffic data into interactive taste graph visualizations. Big Data and Information Analytics, 2017, 2(2): 107-118. doi: 10.3934/bdia.2017003 |
[2] | Zhen Mei . Manifold Data Mining Helps Businesses Grow More Effectively. Big Data and Information Analytics, 2016, 1(2): 275-276. doi: 10.3934/bdia.2016009 |
[3] | Anupama N, Sudarson Jena . A novel approach using incremental under sampling for data stream mining. Big Data and Information Analytics, 2018, 3(1): 1-13. doi: 10.3934/bdia.2017017 |
[4] | Weidong Bao, Wenhua Xiao, Haoran Ji, Chao Chen, Xiaomin Zhu, Jianhong Wu . Towards big data processing in clouds: An online cost-minimization approach. Big Data and Information Analytics, 2016, 1(1): 15-29. doi: 10.3934/bdia.2016.1.15 |
[5] | Xiaoying Chen, Chong Zhang, Zonglin Shi, Weidong Xiao . Spatio-temporal Keywords Queries in HBase. Big Data and Information Analytics, 2016, 1(1): 81-91. doi: 10.3934/bdia.2016.1.81 |
[6] | Sunmoo Yoon, Maria Patrao, Debbie Schauer, Jose Gutierrez . Prediction Models for Burden of Caregivers Applying Data Mining Techniques. Big Data and Information Analytics, 2017, 2(3): 209-217. doi: 10.3934/bdia.2017014 |
[7] | Pankaj Sharma, David Baglee, Jaime Campos, Erkki Jantunen . Big data collection and analysis for manufacturing organisations. Big Data and Information Analytics, 2017, 2(2): 127-139. doi: 10.3934/bdia.2017002 |
[8] | Ricky Fok, Agnieszka Lasek, Jiye Li, Aijun An . Modeling daily guest count prediction. Big Data and Information Analytics, 2016, 1(4): 299-308. doi: 10.3934/bdia.2016012 |
[9] | Wenxue Huang, Yuanyi Pan . On Balancing between Optimal and Proportional categorical predictions. Big Data and Information Analytics, 2016, 1(1): 129-137. doi: 10.3934/bdia.2016.1.129 |
[10] | Maria Gabriella Mosquera, Vlado Keselj . Identifying electronic gaming machine gambling personae through unsupervised session classification. Big Data and Information Analytics, 2017, 2(2): 141-175. doi: 10.3934/bdia.2017015 |
As an opportunity for organizations and companies to reach out millions of potential customers at a low cost, online advertising costs for nearly one-quarter of the global ad spent [1]. Figure 1 illustrates the process of online advertising. The advertiser manages the advertising campaigns for various brands, while internet publisher provides the inventory for hosting the ads. Moreover, advertisers and publishers can communicate directly or indirectly. Exchangers such as Google's DoubleClick [2] and Yahoo!'s RightMedia [3] are involved as intermediaries who pair the publishers' ad request with the most profitable advertiser bid. During the life-cycle of an advertisement, an impression (view) means the ad is displayed on the publisher's webpage when a visitor loading the page. Consequently, when the visitor clicks on the ad a click event is fired and the visitor is taken to the brand's landing page. Finally, a conversion is recorded if the visitor performs certain actions on the brand's websites.
In the context of online advertising, a buyer/seller relationship is established between the advertiser and publisher [26]. With accordance to the three basic events during the advertising life-cycle, the three price models in the industry are PPM (Pay-Per-Mille1), PPC (Pay-Per-Click), and PPA (Pay-Per-Action) [7,22,27], respectively. Since publishers earn revenue based on the number of views, clicks, or actions that are associated with the ads, publishers have strong incentives to maximize those numbers [15,16]. While publishers can apply legitimate methods to attract more traffic to their websites, malicious publishers attempt to make more money by generating invalid (fraudulent) traffic. Such publisher frauds can be further categorized as impression frauds, click frauds, and conversion (action) frauds [7,9,22]. Various types of fraud attacks from both industry and academia have been identified in the literature (e.g. [20,21,24]). To generate traffic to publishers' websites, attacks can be launched either on publishers' websites or on machines that are controlled by malicious publishers, and attacks can be performed either by real human or bots. More details of some existing online advertising frauds are discussed in [26].
1Mille means thousand impressions in the context.
In online advertising industry, the number of views, clicks, and actions of ads is critical to the revenue of advertisers. Advertisers use several metrics such as CPM (Cost Perm Mille), CTR (Click Through Rate), and CR (Conversion Rate) as the KPIs (Key Performance Indicators) to evaluate the performance of their market campaigns. However, the inflated numbers of views, clicks, and actions that are generated by fraudulent publishers hinder the reliability and effectiveness of those performance metrics. Publisher fraud is a severe problem for advertisers and worth the endeavour to detect and prevent.
Publisher frauds are mainly launched with two key entities: fraudulent sites and machines. With respect to correlations between machines and sites, publisher frauds can be classified as coalition attacks and non-coalition attacks. As illustrated in Figure 2, the non-coalition attack corresponds to a one-to-one relationship, and the coalition attack represents a many-to-many relationship. An elementary attack can be easily detected and blocked by identifying repeated views, clicks, or actions from one machine on one site. To avoid being detected, fraudulent publishers attempt to blur the strong correlation between the fraudulent sites and machines. A single fraudulent publisher needs to simulate fake visitors, or frequently change the identification of users to blur the relationship, which significantly increases the difficulty and cost for launching the attacks. On the other hand, several fraudulent publishers can make a collusion and share their resources, which also blur the correlation between a pair of fraudulent sites and machines. Since coalition attacks can be launched without developing sophisticated techniques, they have become more and more popular recently. While coalition attacks can be deployed easily and effectively, the detection of such attacks is more challengeable as an open issue.
Both active and passive approaches can be used to detect the fraudulent problems in general. To verify the validity of the traffic, active approaches usually require the interaction with the web pages and/or visitors (e.g. [20]). Passive detection approaches, on the other hand, apply data analysis/mining techniques to recognize the behaviour pattern of traffics (e.g. [11,19,21,23,25,27]). While active detection techniques typically adopt the signature based detection mechanism to mark the fraudulent traffic, passive detection techniques can use both signature based (e.g. [11]) and anomaly based (e.g. [23]) approaches to identify frauds. Moreover, the active detection may change the model of advertising network and compromise the privacy of visitors, which is sometimes impractical. On the other hand, the premise for data mining approach to detect frauds is to identify correlations between publishers and visitors. Due to the flexibility and feasibility of the passive detection measurement, we adopt the data mining approach in this paper. In particular, we propose an algorithm to detect correlations of publishers and visitors that identify coalition attacks.
The paper is structured as follows. Section 2 introduces background and related notations of coalition attacks and parallel computing. Section 3 presents the proposed detection system for coalition attacks. Next, in Section 4 we demonstrate the evaluation results of our new techniques. Finally, some conclusions and future work are discussed in Section 5.
A classical way to detect publisher frauds is to block suspicious domains when their impressions and CTR exceed a hard (static) threshold. However, fraudsters can easily comprise the classical detection tools by mimicking legitimate criteria. To identify malicious intents of fraudsters, the key is to identify the association between fraudulent websites and machines. In the literature, two types of metrics have been explored. One set of metrics is derived from the graphic topology (e.g. sets of neighbours and paths between nodes). Those metrics are more generic and can be applied to any domain. In our context, real network data indicates that two legitimate sites should have different sets of visitors [17,18]. Therefore, the degree of overlap among visitors of two sites can be used to measure the collusion of two sites. Different similarity metrics having been proposed for generic link prediction problems, such as Jaccard coefficient [8] and Adamic [6] can be used to estimate the similarity of visitors for two publishers. For example, Jaccard coefficient is employed for the coalition attack problem in [17]. On the other hand, another set of metrics is associated with attributes of vertices and links, which requires the domain knowledge. In the context of e-commerce, as ROI (Return of Investor) is one of the key concerns, coalition groups incentively have high ROI than normal publishers. Therefore, different variants of ROI can be derived as the metrics for identifying frauds. For example, in [12], a metric called GPR (Gain-Per-Resource) is proposed to detect the coalition attacks.
In this paper, we propose a hybrid approach that combines the metrics from different perspectives to detect the coalition attacks. In particular, the similarity and GPR are selected as two representative metrics in our technique. To systematically and automatically detect coalition attacks, we give formal definitions for identifying coalition attacks based on similarity and GPR in the rest of this section.
The ratio of gain and cost is an estimator of ROI (Return of Investment), which can be formally defined as:
GPR(g)=W(g)R(g), | (1) |
where
Definition 2.1. A group is called a GPR-GROUP iff
∀(g | g′⊂g : GPR(g′)<GPR(g)), |
where
A coalition group can be defined in terms of GPR.
Definition 2.2 (gpr-based identification). (1) A group of publishers and visitors, denoted as
GPR(g)≥θgpr and gis GPR-GROUP. |
(2) A coalition
¬∃(g″ | g⊂g″ : g″ is a coalition). |
Let
2The elements in bag models can be duplicated.
Similarity(p1,p1)=|Bp1⊓Bp2||Bp1⊔Bp2|, | (2) |
where
∀((n,e) | (n,e)∈A⊓B: ∃(m1,m2 | (m1,e)∈A∧(m2,e)∈B : n=min(m1,m2))) |
∀((n,e) | (n,e)∈A⊔B: ((n,e)∈A∧¬∃(k | k∈N : (k,e)∈B)) ∨((n,e)∈B∧¬∃(k | k∈N : (k,e)∈A)) ∨(∃(m1,m2 | (m1,e)∈A∧(m2,e)∈B : n=max(m1,m2))). |
With the definition of similarity metric, a pair-wise similar group means that the similarity of each pair in the group is greater than a threshold, denoted by
Definition 2.3 (similarity-based identification).A group of publishers
∀p1,p2:(p1∈g)∧(p2∈g)∧(p1≠p2)⟹ similarity(p1,p2)≥θsim. |
To illustrate the effectiveness of the hybrid approach, we use several trivial examples to show that the two metrics, similarity and GPR, are insufficient and not consistent. Figure 3 illustrates click graphs of different traffic patterns. Publishers and visitors are denoted as
A key challenge of detecting coalition attacks is the complexity for calculating the metrics for all possible groups among the whole set of publishers. The problem of detecting coalitions is NP-hard in general, and usually involves with processing large set of data. Therefore, in order to apply the proposed metric in practical, we adopt a parallel computing technique called MapReduce.
MapReduce is a prominent parallel, distributed data processing paradigm that grows rapidly and has gained significant momentum from both industry and academia. Figure 4 illustrates an overview of the Hadoop architecture, which is an open-source Java implementation of MapReduce [5]. In general, each processing job is broken down to as many tasks as input data blocks on multiple clusters. With the MapReduce model, details of parallel execution are hidden and handled by the internal implementation. Users focus only on the task of data processing. A single MapReduce (MR) job simply consists of two primitive stages: Map and Reduce. The Map stage is to map the input data to a list of pairs (key, value), and the Reduce stage calculates results by aggregating the value of keys.
In Definitions 2.2 and 2.3, we have formally defined metrics that can be used to identify the coalition attacks. However, the detection technique to find all of them is not straightforward. For a universal group with
A well-known property derived from association rule mining is the following definition of anti-monotone.
Definition 3.1 (anti-monotone property[13]). Let
∀(g,g′∈X | g′⊆g : P(g) ⇒ P(g′)). |
We can consider the identification of coalition attacks as a predicate
To obtain the anti-monotone predicate w.r.t. the similarity-based identification (see Definition 2.3), we use notations of similar graph, and clique. Let
Lemma 3.2. Given the similarity graph
Consequently, we can derive a predicate using the notation of clique. Proposition 1 indicates that the predicate is anti-monotone.
Proposition 1 (similar clique property).
With respect to the gpr-based identification given in Definition 2.2, the GPR property cannot be transferred to an anti-monotone predicate directly. We instead use the anti-monotone predicate according to an alternative concept called GPR-CORE [12]. Let
Definition 3.3. A click graph
Proposition 2 (gpr core property).
Consequently, the following lemma articulates the gpr-based identification w.r.t. the anti-monotone property GPR-CORE.
Lemma 3.4. a group
●
● the GPR value of
● the graph
Figure 5 illustrates the framework of the hybrid detection system. There are three major stages of the system: initialization phase, inductive phase, and finalization phase. With such a framework, we adopt an aprior-style to identify the coalition attacks. Moreover, we define generic modules that can be instanced with different metrics. Such a framework enables the efficiency and extensibility of the hybrid approach. Following Figure 5 and the discussions in Section 3.1, implementation of the system is straightforward.
For the similarity-based identification, the initializing module responses to generate the similarity graph
Algorithm 1 initializing: similarity-based |
1:procedure PAIR_SIMILARITY(Data) |
2:pairs |
3:pairs_value |
4: |
5:for each in pairs do |
6:(intersection, union) |
7:similarity |
8:if similarity |
9:Adding each as an edge in |
10:end if |
11:end for |
12:return |
13:end procedure |
Algorithm 2 checking: similarity-based |
1:procedure SIMILAR GROUP (candidate, G_sim) |
2:group_edges |
3:clique_groups |
4:for each in candidate do |
5:complete_edge |
6:if complete_edge |
7:Adding each to clique_groups |
8:end if |
9:end for |
10:end procedure |
For the gpr-based identification, we calculate the gpr value for single publisher and visitor in the initializing module, and check the gpr-core property in the checking module. Moreover, Lemma 3.4 is used to identify tasks of the identification module in the gpr-based case. The corresponding algorithm is given in Algorithms 3, 4, and 5, respectively. The function
Algorithm 3 initializing: gpr-based |
1:procedure INDIVIDUAL_GPR(Data) |
2:(P_gainb, V_gainb) |
3:GPR |
4:for p in P_gainb do |
5:resource |
6:gprb |
7:gpr |
8:connectivity |
9:if gprb |
10:Adding (p, gpr, gprb, connectivity) to the GPR |
11:end if |
12:end for |
13:for v in V_gainb do |
14:resource |
15:gprb |
16:gpr |
17:connectivity |
18:if gprb |
19:Adding (v, gpr, gprb, connectivity) to the GPR |
20:end if |
21:end for |
22:return GPR |
23:end procedure |
Algorithm 4 checking: gpr-based |
1: procedure GPR-CORE (Data, candidate, |
2:gbc_values |
3: |
4:for g in candidate do |
5:(publishers, visitors) |
6:resource |
7:(gpr, gprb) |
8Adding |
9:end for |
10:subset |
11:gprcore |
12:for g in candidate do |
13:if subset[g] == n |
14:Adding g to the gprcore |
15:end if |
16:end for |
17:end procedure |
Algorithm 5 identification: gpr-based |
1:procedure FIND_GPR_GROUPS (gprcore, |
2:merged_group |
3:gbc_values |
4:gpr_groups |
5:for g in merged_group do |
6:(publishers, visitors) |
7:resource |
8:(gain, gain_b, connectivity) |
9:gbr |
10:if |
11:Adding g to gpr_group |
12:end if |
13: end for |
14:end procedure |
The map reduce paradigm is discussed in Section 2. The input data is mapped into a collection of pairs
Algorithm 6 functions associated with the similarity metric |
1:function PAIR_INTERSECTION_UNION (Data, pair) |
2:pairs_value |
3:for (key, values) |
4:\\ "key" is the key for each pairs |
5:\\ "values" are counts of clicks that are associated the pair, respectively |
6:pairs |
7:if values[0] |
8:intersection |
9:union |
10:else if values[0] |
11:intersection |
12:union |
13:else values[0] |
14:intersection |
15:union |
16:end if |
17:pairs_value[key] |
18:end for |
19:return pairs_value |
20:end function |
21:function COUNT_GROUP_EDGES ( |
22:group_edges |
23:for (key, value) |
24:\\ "key" is the encoded key of each group in candidate |
25:\\ "value" is the count of edges in |
26:group=key.decoder(key) |
27:group_edges[group] |
28:end for |
29:return group_edges |
30:end function |
Algorithm 7 functions associated with the gpr metric |
1:function INDIVIDUAL_GAIN_BOUNDARY(Data) |
2:P_gainb |
3:V_gainb |
4:for key, value) |
5:\\ "key" is the encoded key of each publisher or visitor |
6:\\ "value" is the sum cost that is associated with the publisher or visitor |
7:(p, v)=key.decoder(key) |
8:if v=="" then |
9:P_gain[p] |
10:elsep=="" |
11:V_gain[p] |
12:end if |
13:end for |
14:return (P_gainb, V_gainb) |
15:end function |
16:function GROUP_GBC_VALUES (Data, candidate, size) |
17:gbc_values |
18:for (key, values) |
19:\\ "key" is the encoded key for each group in candidate |
20:\\ "values" correspond to the gain, gain boundary, and the number of connected edges that are associated with the group |
21:gain |
22:gain_boundary |
23:if values[2] |
24:connectivity = true |
25:else |
26:connectivity = false |
27:end if |
28:gbc_values[key] |
29:end for |
30:return gbc_values |
31:end function |
32:function COUNT_VALID_SUBSET(candidate) |
33:subset |
34:for (key, value) |
35:\\ "key" is the encoded key for each group in candidate |
36:\\ "value" count the number of valid subgroup of the group (i.e. satisfies the properties of GRP-Core) |
37:subset[key] |
38:end for |
39:return subset |
40:end function |
Algorithm 8 functions associated with the gpr metric (cont.) |
1:function MERGE_GPR_CORE(Data, gprcore) |
2: |
3: |
4:prev |
5:new |
6:while prev!=new do |
7:pairs |
8:i |
9:separated |
10:for (key, value) |
11:\\ "key" is the encoded key for the merged group that associated with each pair. |
12:\\ "value" is the number of overlapped vertices for the pair |
13:if value |
14:Adding key to |
15:else |
16:(merged_key, component_key) |
17:separated[merged_key] |
18:end if |
19:end for |
20:unknown_keys |
21:for key in unknown_keys do |
22:candidate[key] |
23:end for |
24:for(key, values) |
25:if values[2] |
26:Adding key to |
27:end if |
28:end for |
29:prev |
30:new |
31:end while |
32: |
33:end function |
To evaluate the proposed detection algorithm, we compare the performance of our algorithm with those that only use single metric. The proposed hybrid framework provides a convenient way to extend the system with different metrics. In other term, instead of re-implementing those algorithms separately, we can simply modify the list of metrics in Figure 5 to obtain and compare different algorithms.
We retrieve the motivative examples introduced in Section 2. Both values of
We also applied the proposed technique to practical data. The raw data are daily log including impressions, clicks, and conversions from real traffic. Each record in the log is related to more than
#publishers | detection rate | precision |
340 | 96.12% | 95.03% |
414 | 97.12% | 99.35% |
407 | 99.18% | 94.94% |
We can see that the algorithm has an average detection rate of
In this paper, we proposed a detection technique for coalition attacks in online advertising. A hybrid framework that can detect coalition attacks based on multiple metrics is developed. We articulate the theoretical properties that each metric should satisfy in the framework. In particular, we adopt similarity and gpr as two metrics in the system. The performance of the technique with different datasets is evaluated.
As future work, more datasets will be tested. Since there is no benchmark data sets for coalition frauds in online advertising, obtaining the data for evaluation is usually difficult and expensive. Besides real-world data sets and the third party service, simulation technique can be used to generate synthetic date for both normal and fraudulent traffic. Extensive empirical study on different traffic patterns can be studied. On the other side, to improve the performance of the detection technique, more metrics can be explored to identify coalition attacks [6,10]. With the hybrid framework, additional metrics can be integrated in the system efficiently. The ultimate goal of the work is to provide an reliable, extensible, and scalable on-line detection system for the dramatically developing online advertising industry.
The project was supported by the Mathematics of Information Technology and Complex Systems of Canada (MITACS) and EQ Works (http://www.eqworks.com/).
The following code are written with PYTHON.
In this subsection, we list functions for encoding, decoding, and comparing the keys.
def key_decoder(publishers, visitors):
sorted_p = sorted(publishers)
sorted_v = sorted(visitors)
key_p="_".join(each for each in sorted_p)
key_v="_".join(each for each in sorted_v)
key = key_p+"!!"+key_v
return key
def key_decoder(key):
(key_p, key_v) = key.split("!!")
publishers = key_p.split("_") if not key_p = ="" else []
visitors = key_v.split("_") if not key_v = ="" else []
return (publishers, visitors)
def key_cmp(key1, key2):
(p1, v1) = key_decoder(key1)
(p2, v2) = key_decoder(key2)
p_in = set(p1).issubset(set(p2))
v_in = set(v1).issubset(set(v2))
return p_in and v_in
We generalize two mapper and reducer functions that can be reused by other functions.
def inductive_generator_mapper(line, params):
candidate = line.split(", ")[0]
P = params[0]
V = params[1]
(publishers, visitors) = candidate.split("!!")
for p in P:
if not p in publishers:
key = key_encoder(publishers+[p], visitors)
yield (key+":"+each, [1])
for v in V:
if not v in visitors:
key = key_encoder(publishers, visitors+[V])
yield (key+":"+each, [1])
def reducer(iter, params):
data = {}
first = True
prev = None
for key, value in sorted(iter):
if key not in data:
if not first:
yield prev, data[prev]
data = {}
data[key] = value
else:
for k, v in enumerate(value):
if type(v) = =set:
data[key][k] = data[key][k].union(v)
else:
data[key][k]+ = v
first = False
prev = key
for key, value in data.iteritems():
yield key, value
All other mapper reduce functions are given in the rest of this section.
def mapper_similarity_pairs(line, params):
groups = params
(head, tail, weight, label) = line.strip().split(", ")
for each in groups:
(p1, p2) = each.split("_")
key = each+"!!"+tail
if head = =p1:
yield (key, [1, 0])
elif head = =p2:
yield (key, [0, 1])
else:
pass
def mapper_similarity_groups(line, params):
#tag2group = params[0]
group_keys = params[0]
V_key = params[1]
threshold = params[2]
(p1, p2, similarity) = line.split(", ")
if float(similarity) < threshold:
pass
else:
for key in group_keys:
(p_key, v_key) = key.split("!!")
new_key = p_key+"!!"+V_key
pubs = p_key.split("_")
if p1 in pubs and p2 in pubs:
yield (new_key, [1])
def mapper_vertices(line, param):
(head, tail, weight, label) = line.split(", ")
p_key = head+"!!"
v_key="!!"+tail
yield (p_key, [1, float(weight)])
yield (v_key, [1, float(weight)])
def mapper_gbc_values(line, params):
groups = params
(head, tail, weight, label) = line.split(", ")
#print groups
for key in groups:
and_edge = 0
or_edge = 0
(key_p, key_v) = key.split("!!")
publishers = key_p.split("_") if not key_p = ="" else []
visitors = key_v.split("_") if not key_v = ="" else []
if head in publishers and tail in visitors:
and_edge = float(weight)
if head in publishers or tail in visitors:
or_edge = float(weight)
#yield (key, [and_edge, or_edge])
components = groups[key]
connectivity = 0
for each in components:
(p_key, v_key) = each.split("!!")
heads = p_key.split("_")
tails = v_key.split("_")
if head in heads and tail in tails:
connectivity = 1
break
yield (key, [and_edge, or_edge, connectivity])
[1] | [ Zenithoptimedia, Available from http://techcrunch.com/2014/04/07/. |
[2] | [ Doubleclick by Google, Available from:http://www.google.ca/doubleclick/. |
[3] | [ Rightmedia from Yahoo!, Available from:https://www.rightmedia.com/. |
[4] | [ Disco MapReduce, Available from http://discoproject.org/. |
[5] | [ Hadoop MapReduce, Available from:https://hadoop.apache.org. |
[6] | [ L. Adamic and E. Adar, Friends and neighbors on the web, Social Networks, 25(2003), 211-230. |
[7] | [ V. Anupam, A. Mayer, K. Nissim and B. Pinkas, On the security of pay-per-click and other web advertising schemes, in The 8th International Conference on World Wide Web, 31(1999), 1091-1100. |
[8] | [ M. S. Charikar, Similarity estimation techniques from rounding algorithms, in Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, (2002), 380-388. |
[9] | [ N. Daswani, C. Mysen, V. Rao, S. Weis, K. Gharachorloo and S. Ghosemajumde, Online advertising fraud, in Crimeware:Understanding New Attacks and Defenses, Addison-Wesley Professional, Reading, 2008. |
[10] | [ M. A. Hasan, A survey of link prediction in social networks, Social Network Data Analytics, (2011), 243-275. |
[11] | [ B. Kitts, J. Y. Zhang, A. Roux and R. Mills, Click fraud detection with bot signatures, in Proceedings of ISI 2013, Seattle, Washington, USA, (2013), 146-150. |
[12] | [ C. Kim, H. Miao and K. Shim, CATCH:A detecting algorithm for coalition attacks of hit inflation in internet advertising, Information Systems, 36(2011), 1105-1123. |
[13] | [ C. K. S. Leung, Anti-monotone constraints, in Encyclopedia of Database Systems (eds. Ling Liu and M. Tamer Özsu), Springer, (2009), 98-98. |
[14] | [ K. Lee, H. Choi and B. Moon, Parallel data processing with MapReduce:a survey, in The ACM Special Interest Group on Management of Data Record, 40(2011), 11-20. |
[15] | [ A. Metwally, D. Agrawal and A. EI Abbadi, Duplicate detection in click streams, in International World Wide Web Conference, (2005), 12-21. |
[16] | [ A. Metwally, D. Agrawal and A. EI Abbadi, Using association rules for fraud detection in web advertising networks, in Proceedings of the 31st international conference on very large data bases, (2005), 169-180. |
[17] | [ A. Metwally, D. Agrawal and A. EI Abbadi, Detectives:detecting coalition hit inflation attacks in advertising networks streams, in Proceedings of the 16th International Conference on World Wide Web, (2007), 241-250. |
[18] | [ A. Metwally, D. Agrawal, A. EI Abbadi and Q. Zheng, On hit inflation techniques and detection in streams of web advertising networks, in Proceedings of the 27th International Conference on Distributed Computing Systems, (2007), 52-52. |
[19] | [ C. Phua, E. Y. Cheu, G. E. Yap, K. Sim and M. N. Nguyen, Feature engineering for click fraud detection, in ACML Workshop on Fraud Detection in Mobile Advertising, 2012. |
[20] | [ Y. Peng, L. Zhang, J. M. Chang and Y. Guan, An effective method for combating malicious scripts clickbots, in Computer Security, the Series Lecture Notes in Computer Science, 5789(2009), 523-538. |
[21] | [ B. K. Perera, A Class Imbalance Learning Approach to Fraud Detection in Online Advertising, M.Sc. thesis, Masdar Institute of Science and Technology, 2013. |
[22] | [ K. Springborn and P. Barford, Impression fraud in online advertising via pay-per-view networks, in Proceedings of the 22nd USENIX Security Symposium, (2013), 211-226. |
[23] | [ F. Soldo and A. Metwally, Traffic anomaly detection based on the IP size distribution, in Proceedings of the INFOCOM International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, (2012), 2005-2013. |
[24] | [ C. Walgampaya and M. Kantardzic, Cracking the smart clickbot, in Proceedings of the 13th IEEE International Symposium on Web Systems Evolution, (2011), 125-134. |
[25] | [ C. Walgampaya and M. Kantardzic and R. Yampolskiy, Evidence fusion for real time click fraud detection and prevention, in Intelligent Automation and Systems Engineering, Lecture Notes in Electrical Engineering, Springer Science+Business Media, 103(2011), 1-14. |
[26] | [ Q. Zhang and W. Feng, Detecting coalition frauds in online-advertising, in Mathematical and Computational Approaches in Advancing Modern Science and Engineering, (eds. Jacques Bélair, Ian A. Frigaard, Herb Kunze, Roman Makarov, Roderick Melnik and Raymond J. Spiteri) Springer, (2016), 595-605. |
[27] | [ L. Zhang and Y. Guan, Detecting click fraud in pay-per-click streams of online advertising networks, in The 28th International Conference on Distributed Computing Systems, (2008), 77-84. |
1. | Amine Kechid, Habiba Drias, 2019, Chapter 41, 978-3-030-22998-6, 470, 10.1007/978-3-030-22999-3_41 |
Algorithm 1 initializing: similarity-based |
1:procedure PAIR_SIMILARITY(Data) |
2:pairs |
3:pairs_value |
4: |
5:for each in pairs do |
6:(intersection, union) |
7:similarity |
8:if similarity |
9:Adding each as an edge in |
10:end if |
11:end for |
12:return |
13:end procedure |
Algorithm 2 checking: similarity-based |
1:procedure SIMILAR GROUP (candidate, G_sim) |
2:group_edges |
3:clique_groups |
4:for each in candidate do |
5:complete_edge |
6:if complete_edge |
7:Adding each to clique_groups |
8:end if |
9:end for |
10:end procedure |
Algorithm 3 initializing: gpr-based |
1:procedure INDIVIDUAL_GPR(Data) |
2:(P_gainb, V_gainb) |
3:GPR |
4:for p in P_gainb do |
5:resource |
6:gprb |
7:gpr |
8:connectivity |
9:if gprb |
10:Adding (p, gpr, gprb, connectivity) to the GPR |
11:end if |
12:end for |
13:for v in V_gainb do |
14:resource |
15:gprb |
16:gpr |
17:connectivity |
18:if gprb |
19:Adding (v, gpr, gprb, connectivity) to the GPR |
20:end if |
21:end for |
22:return GPR |
23:end procedure |
Algorithm 4 checking: gpr-based |
1: procedure GPR-CORE (Data, candidate, |
2:gbc_values |
3: |
4:for g in candidate do |
5:(publishers, visitors) |
6:resource |
7:(gpr, gprb) |
8Adding |
9:end for |
10:subset |
11:gprcore |
12:for g in candidate do |
13:if subset[g] == n |
14:Adding g to the gprcore |
15:end if |
16:end for |
17:end procedure |
Algorithm 5 identification: gpr-based |
1:procedure FIND_GPR_GROUPS (gprcore, |
2:merged_group |
3:gbc_values |
4:gpr_groups |
5:for g in merged_group do |
6:(publishers, visitors) |
7:resource |
8:(gain, gain_b, connectivity) |
9:gbr |
10:if |
11:Adding g to gpr_group |
12:end if |
13: end for |
14:end procedure |
Algorithm 6 functions associated with the similarity metric |
1:function PAIR_INTERSECTION_UNION (Data, pair) |
2:pairs_value |
3:for (key, values) |
4:\\ "key" is the key for each pairs |
5:\\ "values" are counts of clicks that are associated the pair, respectively |
6:pairs |
7:if values[0] |
8:intersection |
9:union |
10:else if values[0] |
11:intersection |
12:union |
13:else values[0] |
14:intersection |
15:union |
16:end if |
17:pairs_value[key] |
18:end for |
19:return pairs_value |
20:end function |
21:function COUNT_GROUP_EDGES ( |
22:group_edges |
23:for (key, value) |
24:\\ "key" is the encoded key of each group in candidate |
25:\\ "value" is the count of edges in |
26:group=key.decoder(key) |
27:group_edges[group] |
28:end for |
29:return group_edges |
30:end function |
Algorithm 7 functions associated with the gpr metric |
1:function INDIVIDUAL_GAIN_BOUNDARY(Data) |
2:P_gainb |
3:V_gainb |
4:for key, value) |
5:\\ "key" is the encoded key of each publisher or visitor |
6:\\ "value" is the sum cost that is associated with the publisher or visitor |
7:(p, v)=key.decoder(key) |
8:if v=="" then |
9:P_gain[p] |
10:elsep=="" |
11:V_gain[p] |
12:end if |
13:end for |
14:return (P_gainb, V_gainb) |
15:end function |
16:function GROUP_GBC_VALUES (Data, candidate, size) |
17:gbc_values |
18:for (key, values) |
19:\\ "key" is the encoded key for each group in candidate |
20:\\ "values" correspond to the gain, gain boundary, and the number of connected edges that are associated with the group |
21:gain |
22:gain_boundary |
23:if values[2] |
24:connectivity = true |
25:else |
26:connectivity = false |
27:end if |
28:gbc_values[key] |
29:end for |
30:return gbc_values |
31:end function |
32:function COUNT_VALID_SUBSET(candidate) |
33:subset |
34:for (key, value) |
35:\\ "key" is the encoded key for each group in candidate |
36:\\ "value" count the number of valid subgroup of the group (i.e. satisfies the properties of GRP-Core) |
37:subset[key] |
38:end for |
39:return subset |
40:end function |
Algorithm 8 functions associated with the gpr metric (cont.) |
1:function MERGE_GPR_CORE(Data, gprcore) |
2: |
3: |
4:prev |
5:new |
6:while prev!=new do |
7:pairs |
8:i |
9:separated |
10:for (key, value) |
11:\\ "key" is the encoded key for the merged group that associated with each pair. |
12:\\ "value" is the number of overlapped vertices for the pair |
13:if value |
14:Adding key to |
15:else |
16:(merged_key, component_key) |
17:separated[merged_key] |
18:end if |
19:end for |
20:unknown_keys |
21:for key in unknown_keys do |
22:candidate[key] |
23:end for |
24:for(key, values) |
25:if values[2] |
26:Adding key to |
27:end if |
28:end for |
29:prev |
30:new |
31:end while |
32: |
33:end function |
#publishers | detection rate | precision |
340 | 96.12% | 95.03% |
414 | 97.12% | 99.35% |
407 | 99.18% | 94.94% |
Algorithm 1 initializing: similarity-based |
1:procedure PAIR_SIMILARITY(Data) |
2:pairs |
3:pairs_value |
4: |
5:for each in pairs do |
6:(intersection, union) |
7:similarity |
8:if similarity |
9:Adding each as an edge in |
10:end if |
11:end for |
12:return |
13:end procedure |
Algorithm 2 checking: similarity-based |
1:procedure SIMILAR GROUP (candidate, G_sim) |
2:group_edges |
3:clique_groups |
4:for each in candidate do |
5:complete_edge |
6:if complete_edge |
7:Adding each to clique_groups |
8:end if |
9:end for |
10:end procedure |
Algorithm 3 initializing: gpr-based |
1:procedure INDIVIDUAL_GPR(Data) |
2:(P_gainb, V_gainb) |
3:GPR |
4:for p in P_gainb do |
5:resource |
6:gprb |
7:gpr |
8:connectivity |
9:if gprb |
10:Adding (p, gpr, gprb, connectivity) to the GPR |
11:end if |
12:end for |
13:for v in V_gainb do |
14:resource |
15:gprb |
16:gpr |
17:connectivity |
18:if gprb |
19:Adding (v, gpr, gprb, connectivity) to the GPR |
20:end if |
21:end for |
22:return GPR |
23:end procedure |
Algorithm 4 checking: gpr-based |
1: procedure GPR-CORE (Data, candidate, |
2:gbc_values |
3: |
4:for g in candidate do |
5:(publishers, visitors) |
6:resource |
7:(gpr, gprb) |
8Adding |
9:end for |
10:subset |
11:gprcore |
12:for g in candidate do |
13:if subset[g] == n |
14:Adding g to the gprcore |
15:end if |
16:end for |
17:end procedure |
Algorithm 5 identification: gpr-based |
1:procedure FIND_GPR_GROUPS (gprcore, |
2:merged_group |
3:gbc_values |
4:gpr_groups |
5:for g in merged_group do |
6:(publishers, visitors) |
7:resource |
8:(gain, gain_b, connectivity) |
9:gbr |
10:if |
11:Adding g to gpr_group |
12:end if |
13: end for |
14:end procedure |
Algorithm 6 functions associated with the similarity metric |
1:function PAIR_INTERSECTION_UNION (Data, pair) |
2:pairs_value |
3:for (key, values) |
4:\\ "key" is the key for each pairs |
5:\\ "values" are counts of clicks that are associated the pair, respectively |
6:pairs |
7:if values[0] |
8:intersection |
9:union |
10:else if values[0] |
11:intersection |
12:union |
13:else values[0] |
14:intersection |
15:union |
16:end if |
17:pairs_value[key] |
18:end for |
19:return pairs_value |
20:end function |
21:function COUNT_GROUP_EDGES ( |
22:group_edges |
23:for (key, value) |
24:\\ "key" is the encoded key of each group in candidate |
25:\\ "value" is the count of edges in |
26:group=key.decoder(key) |
27:group_edges[group] |
28:end for |
29:return group_edges |
30:end function |
Algorithm 7 functions associated with the gpr metric |
1:function INDIVIDUAL_GAIN_BOUNDARY(Data) |
2:P_gainb |
3:V_gainb |
4:for key, value) |
5:\\ "key" is the encoded key of each publisher or visitor |
6:\\ "value" is the sum cost that is associated with the publisher or visitor |
7:(p, v)=key.decoder(key) |
8:if v=="" then |
9:P_gain[p] |
10:elsep=="" |
11:V_gain[p] |
12:end if |
13:end for |
14:return (P_gainb, V_gainb) |
15:end function |
16:function GROUP_GBC_VALUES (Data, candidate, size) |
17:gbc_values |
18:for (key, values) |
19:\\ "key" is the encoded key for each group in candidate |
20:\\ "values" correspond to the gain, gain boundary, and the number of connected edges that are associated with the group |
21:gain |
22:gain_boundary |
23:if values[2] |
24:connectivity = true |
25:else |
26:connectivity = false |
27:end if |
28:gbc_values[key] |
29:end for |
30:return gbc_values |
31:end function |
32:function COUNT_VALID_SUBSET(candidate) |
33:subset |
34:for (key, value) |
35:\\ "key" is the encoded key for each group in candidate |
36:\\ "value" count the number of valid subgroup of the group (i.e. satisfies the properties of GRP-Core) |
37:subset[key] |
38:end for |
39:return subset |
40:end function |
Algorithm 8 functions associated with the gpr metric (cont.) |
1:function MERGE_GPR_CORE(Data, gprcore) |
2: |
3: |
4:prev |
5:new |
6:while prev!=new do |
7:pairs |
8:i |
9:separated |
10:for (key, value) |
11:\\ "key" is the encoded key for the merged group that associated with each pair. |
12:\\ "value" is the number of overlapped vertices for the pair |
13:if value |
14:Adding key to |
15:else |
16:(merged_key, component_key) |
17:separated[merged_key] |
18:end if |
19:end for |
20:unknown_keys |
21:for key in unknown_keys do |
22:candidate[key] |
23:end for |
24:for(key, values) |
25:if values[2] |
26:Adding key to |
27:end if |
28:end for |
29:prev |
30:new |
31:end while |
32: |
33:end function |
#publishers | detection rate | precision |
340 | 96.12% | 95.03% |
414 | 97.12% | 99.35% |
407 | 99.18% | 94.94% |