1.
Introduction
Private set intersection (PSI) is a cryptographic protocol that allows two parties to jointly calculate set intersection over their datasets without leaking any additional information. Formally, a PSI protocol is described as follows: Two clients, Alice and Bob, have different (or same) size of the datasets, where Alice has a set X={x1,x2,⋯,xm} and Bob has a set Y={y1,y2,⋯,yn}. At the conclusion of the protocol, if the protocol is symmetric, it outputs (X∩Y,X∩Y)←PSI(X,Y) to Alice and Bob simultaneously; If the protocol is asymmetric, it outputs (X∩Y,Λ)←PSI(X,Y) to one client, where Λ denotes the empty string.
PSI has been applied in many scenarios. For instance, the department of homeland security (DHS), who maintains a terror watch list (or 'do-not-fly' list), wants to check whether its terrorist suspects are in the list of passengers from a flight. Obviously, DHS is not allowed to obtain the private information of the innocent passengers in the list. As another classic example, two real estate companies would like to identity homeowners who have signed double contracts with both companies for profit purpose, and both two are not willing to reveal any extra information of other homeowners. Furthermore, PSI also has been utilized extensively in a wide range of emerging cloud computing paradigm setting such as internet-based Personal Healthy Records (PHRs) systems [1], mobile social networks [2], human genomic research [3], and online advertising [4].
With the rapid development of big data, traditional PSI protocols (e.g., [5,6,7]) are never well applicable for large-scale dataset. How to design an efficient and scalable privacy-preserving PSI solution is an open problem. Dong et al. [6] combined Garbled Bloom filter to propose a two-party PSI protocol for million-element datasets, and hereafter several works focused on enhancing the efficiency of PSI. Especially, Pinkas et al. [8] improved the efficiency with Oblivious Transfer extension technique, and then proposed another new circuit-based protocols for computing variants of the intersection with an almost linear number of comparisons based on new variants of Cuckoo hashing in two dimensions. And recently, Chase et al. [9] presented an efficient PSI protocol with lightweight Oblivious Pseudo-Random Functions in a two-party setting model. However, such two-party model introduces multiple interactions and cannot be scalable to billion-scale datasets.
Kamara et al. [10] introduced an efficient PSI protocol for billion-element datasets using Pseudorandom Permutation under a server-aided setting by parallel mode and hardware acceleration. Later, Zhang et al. [11] proposed a two-server-aided PSI protocol with multiple keys with scarifying a little efficiency compared with [10]. There are also amount of works focusing on server-aided privacy preserving outsourced computations (e.g. [12,13,14,15]), and especially [16,17,18,19] realize outsourced private set intersection with combing functional encryption algorithm. Regrettably, these solutions require either multiple interactions or very complex operations over the outsourced ciphertexts. Moreover, in the hidden size model, the clients have to undertake most of set intersection calculation. All of these obstacles make them unscalable to a very large datasets (such as billion set size).
Our Contributions. Motivated by previous PSI works on massive datasets, we exploit the Bloom Filter technique and lightweight deterministic encryption algorithm to design an efficient fog/cloud server-aided PSI protocol in this paper. Our Protocol can output an approximate intersection due to the false positive in Bloom Filter, while this approximate result is guaranteed to be almost the same with the actual intersection through setting the suitable parameter of Bloom Filter. The main contributions can be summarized as below:
● We present the first basic protocol with extremely high efficiency and it is secure against a semi-honest server. Our first protocol is also regarded as a basic building block for the second protocol.
● For more security, we propose the second protocol with secure against a lazy or cheating malicious server. Specifically, by adding dummy sets, our second protocol not only can preserve the set/intersection size from being leaked to the server, and also can check whether the server honestly returns the intersection result with a quite high probability.
● Experimental evaluation shows that our two basic protocols need only around 15 and 24 seconds respectively over one million-element datasets. Furthermore, we propose a novel multi-round mechanism to improve the efficiency and accuracy. Through the implementation, a two-round mechanism can enhance efficiency by almost twice than two basic protocols in average with maintaining certain accuracy.
2.
Related work
Private Set Intersection (PSI) proposed by Freedman et al. [20] is a representative cryptographic computation protocol between two parties. Subsequent works focus on the study of PSI protocols with more efficiency and security. We summarize them with two modes as follows: two-party PSI mode and server-aided PSI mode.
Two-party PSI Mode. Freedman et al. [20] proposed two secure PSI protocols with Oblivious Polynomial Evaluation, and lots of subsequent works focused on improving the computation efficiency or obtaining more security (e.g., [21,22]). There exist several solutions using Pseudo-Random Functions (PRF) (e.g., [2,9,23]), which are instantiated effectively since with symmetric-key technique. What's more, several other works combining different techniques focused on realizing high efficiency and security over a large-scale dataset (e.g., Bloom Filter [6], cut-and-choose [24], Circuits-based techniques [25,26], and Oblivious Transfer (OT) [8,27,28,29]). Unfortunately, these two-party setting solutions are seemingly infeasible with low capability clients since multiple interactions would incur an extreme computation cost. So a large body of works under server-aided setting for PSI were introduced (e.g., [7,10,11,16,17,18,19,30,31]).
Server-aided PSI Mode. In the Server-aided mode, two parties with weak computation capability in the protocols can outsource their storage and intersection computation to an aided server, and the server returns the intersection results without learning any additional private information of each parties. A branch of works (e.g., [7,31,32]) used homomorphic encryption to outsource the encrypted datasets and delegate the intersection operations to the server. Woefully, the lower efficiency of homomorphic encryption prevents these schemes from performing on a large scale dataset (e.g., billion set size). In order to get a higher efficiency, [10,16,18] leveraged Pseudo-Random Functions with no public-key operations. Moreover, an effective data structure, Bloom Filter, is also utilized to do set operations in variety of work [16,19,33,34]. Unfortunately, as illustrated in Section 1, multiple interactions with complex operations make these solutions infeasible for a very large-scale set. [30] proposed a fine-grained access control private set intersection (PSI) scheme with non-interaction, while due to the inefficiency of the attribute-based encryption, this scheme is unable to scale to the PSI computation over billion size sets. In terms of efficiency, [10] is a quite feasible solution, but it cannot minimize privacy disclosure because the set/intersection size will be leaked to the server. In this paper, we focus on designing an efficient and secure protocol via leveraging Bloom Filter and lightweight symmetric deterministic encryption, which can also preserve the set/intersection size from leaking to the aided server.
3.
Definitions
3.1. System definition
Three entities are consisted in our server-aided PSI system model: Alice, Bob, and an aided server (the fog/cloud server), where two clients Alice and Bob can engage in a computation of set intersection by the assistance of an un-fully trusted server without leaking any extra private information. Specifically, each client, Alice and Bob, holds a secret set respectively, denoted as SA and SB. Before conducting the set intersection, two clients encrypt their original sets respectively and outsource them to the aided server. Then, the server is delegated to compute the intersection over two encrypted sets of SA and SB without gaining any more private information, and finally outputs the private set intersection of SA∩SB to Alice and Bob. For that, we aim to build a more efficient solution with leaking minimal privacy to the server and saving more computation cost of two clients, and we formalize our security model as follows.
3.2. Security definition
Similar to the related works (e.g., [6,10]), we also consider two different models of adversaries, including the semi-honest and malicious adversaries, and only one of the parties can be corrupted with an adversary at the same time. Specifically, a semi-honest adversary can execute the protocol with the specified steps honestly, but it curiously records all intermediate computations to derive other parties' private data; a malicious client would provide a false input to infer extra privacy of the other client, and a malicious server would arbitrarily deviate from the specified executions in the protocol or temper with the results of computation, such as deleting or modifying elements of the computed intersection or even randomly outputting a false result without any calculations.
In addition, we suppose that no collusion is occurred in any two parties in our protocols under the above two types of adversary model and this non-collusion setting is referred in most previous protocols (e.g., [10,35]).
Ideal vs. Real. We introduce the general ideal vs. real model like [36] to illustrate the privacy of our protocols in this paper, which is a common security model of secure multiple computation.
Generally speaking, we assume that there is a fully Trusted Third Party (TTP) and a probabilistic polynomial-time independent simulator S in the ideal world, they evaluate two parties' set intersection through simulating π. S outputs IdealπS as its view. While in a real world, the valid participants and A execute π without TTP. In the end of the execution, A outputs RealπA as its view. The protocol π is secure only if the views for any adversary in the real and ideal world are computationally indistinguishable.
Assuming there is no collusion occurred in our executions, for the security parameter λ, all feasible inputs ˉx, randomness ˉz, and each i∈{1,2,3}
where ″c≈" indicates computational indistinguishability[36].
4.
Preliminaries
To achieve a better efficiency, we adopt an effective data structure in this paper, namely Bloom Filter. In addition, we use a lightweight symmetric deterministic encryption to encrypt the original datasets, and to preserve the position privacy, a pseudo-random permutation is also introduced in our protocols. Moreover, for better readability, we first describes some main mathematical notations referred to in this paper in Table 1.
4.1. Bloom Filter
Bloom Filter, a simple space-efficient randomized data structure, supports fast membership testing in a set. A Bloom Filter of set S is an array with certain length m, denoted as BFS, which is generated as in Figure 1.
Note that, it sometimes occurs false positive in Bloom Filter, meaning that it is possible that s is not in S, while all BFS[hi(s)]=1 (0≤i<k). The probability of false positive p can be computed as:
According to [6], given a particular false positive probability p, Bloom Filter length is at least m≥−nlog2e⋅log2p. It is obviously that the probability of false positive and Bloom Filter length are negative correlation. Therefore, to maintain a better accuracy and efficiency, we set m=−nlog2e⋅log2p in this paper.
4.2. Deterministic encryption
In a symmetric deterministic encryption (DE) cryptosystem [37], a same original plaintext would be encrypted to a same ciphertext [37], and thus it can be used to do equality checking or membership testing over encrypted data [10]. We generally state DE={KeyGen,Enc,Dec} as follows,
● sk←KeyGen(1λ): for security parameter λ, return secret key sk∈K.
● c←Enc(sk,d): for secret key sk and plaintext d∈M, return ciphertext c.
● d←Dec(sk,c): for secret key sk and ciphertext c∈C, return plaintext d.
Here, KeyGen is a probabilistic algorithm while Enc and Dec are deterministic for equality checking in our protocols.
4.3. Pseudo-random permutation
Suppose that F is an efficient pseudo-random permutation, then for all probably distinguishers D and a random β-bit permutation functions fβ, there exists a negligible ε(⋅) such that:
where αR←{0,1}β.
Especially, to preserve the element's positions in the Bloom Filter, we introduce π:[|S|]→[|S|] as a pseudo-random permutation in this paper. That is to say, π(S) permutes the real positions in Bloom Filter, which is denoted as
5.
Our protocols
Here, two protocols are proposed to compute the private set intersection with the assistance of an aided server. More specifically, each client firstly generates a Bloom Filter BFj (j∈{A,B}) of the corresponding Sj, and then encrypts BFj with DE and sends it to the server. Finally, the aided server performs equality test on the encrypted Bloom Filters. Since there exist only two different values (i.e, 0 or 1) in the Bloom Filters, to keep the computation correct, we encrypt BF[i] with different cases,
where ri (i∈{0,1,⋯,m−1}) is randomly chosen from message domain M, and ″||" denotes concatenation. Then we can obtain the ciphertext of Bloom Filter as C={C[1],C[2],⋯,C[m−1]} based on Eq (5.1).
5.1. Protocol I under semi-honest model
In Protocol I, the server is honest-but-curious about inferring the private information of other parties with a record of the intermediate computations. More details are described in Figure 2.
Then we analyze the correctness and security of our Protocol I. The correctness is obvious, for ∀i, i∈{0,1,⋯,k−1}, CA[hi(a)]=CB[hi(b)], a∈SA, b∈SB, since the encryption algorithm DE is deterministic, we have BFA[hi(a)]=BFB[hi(b)], and then hi(a)=hi(b), that is a=b.
In case of security, our Protocol I is secure under a semi-honest server or a malicious client without distinguishing the views of ideal and real world. Then we formalize it as follows:
Theorem 1. Since the pseudo-randomness for permutation function π and DE, our Protocol I is secure if: 1) with a semi-honest server and two honest clients, no extra information is revealed to the server; 2) with a honest server and one of malicious clients, no extra private information is leaked to the other client.
Proof. Recall that, in order to prove security against an adversary A corrupting the semi-honest server and attempting to get more private information of the input sets, we need to construct a simulator S in the ideal-world to simulate the server's behavior in the real-world executions with the honest Alice and Bob. Our Protocol I is secure if the joint distribution of the view generated by A in the real-world is distinguishable from the view of S in the ideal-world executions.
It is clear to see that the only information that the simulator S obtains in the ideal-world executions is the encrypted Bloom filters CA=DE.Enc(sk,BFA) and CB=DE.Enc(sk,BFB). Then, S computes CA∩CB and finally returns the computed result to the adversary A. With the pseudo-randomness of DE encryption, A cannot discriminate these ciphertexts from the real and ideal world. The only difference between the view of S and the adversary A is the length of the returned intersection result. However, this length is randomly distributed in [0,m] (m is bloom filter length.), which guarantees that A cannot distinguish it from the real world and the ideal world. Additionally, due to the private permutation function π, A is unable to guess the real bit positions of the intersection in the Bloom filter. Therefore, there is no way to obtain any elements in the intersection via brute-force attack based on the intersection Bloom filter CA∩CB.
Note that, from the aspect of the clients, the only information obtained by both two is the last intersection which is exactly what they should know. Therefore, each client obtains no extra private information of the other one in the above protocol.
5.2. Protocol II under malicious model
Our Protocol I shown in above is proven to be secure under a semi-honest model, where the server always runs the protocol and outputs the result honestly. Unfortunately, such assumption can not satisfy the actual requirements since the server sometimes could perform cheating in practice to make more profits or save computational expands with economically-driven nature, such as returning a random results with no computation or an incorrect intersection after computation to both clients without being detected.
Moreover, the server somewhat may conjecture the intersection size via the intersection Bloom Filter in our Protocol I, which is private information in some situations. To solve this problem, we present our second protocol with adding a checking mechanism over our Protocol I. Concretely, given the original sets SA and SB, Alice negotiates S0 to S2 with Bob privately, and they get DA=SA∪S0∪S1 and DB=SB∪S0∪S2 respectively. Then, Alice, Bob and the server compute the dummied intersection of DA and DB by running Protocol I, and finally Alice and Bob gain the original SA∩SB via deducting S0 from dummied intersection returned by the server. Here, with a short shared random seed, we can leverage a pseudorandom generator to effectively generate three disjoint dummy sets mentioned above, which can not only reduce time overhead compared to directly agreeing on these dummy sets and also maintain the randomness of them. Our Protocol II is illustrated as shown in Figure 3.
The correctness of Protocol II can be easily verified as the Protocol I. Next, we define its security as follows:
Theorem 2. Our Protocol II is secure under the following cases: 1) with a malicious server deceiving the honest users during the protocol, which can be detected with a very high probability; 2) with a malicious client, no extra private information can be inferred to this malicious client.
Proof. Firstly, similar to the proof of Theorem 1, suppose that there is a malicious adversary A corrupting the server to execute the protocol and try to obtain as much as private information of both two clients' inputs. The view of A in the execution includes the ciphertexts of the bloom filters for two dummied sets DA and DB, and the approximate size of DA∩DB. As the pseudo-randomness of DE, A cannot get any extra privacy belonging to the input sets. With the computed intersection result of two dummied encrypted Bloom filters, the adversary would not infer any private information of the real intersection SA∩SB. Specifically, A can approximate the size of DA∩DB from its view in the executions. However, the server is unable to obtain the size of SA∩SB unless it can correctly guess the size of dummy set S0. Unfortunately, the size of S0 is randomly chosen from a large domain, and thus the adversary has only a negligible probability to correctly guess it in polynomial time. Therefore, our Protocol II can successfully preserve the intersection size from the malicious server.
Another advantage of Protocol II is that it can prevent a cheating server from removing or adding element into the computed intersection or a lazy server from returning a random result without computation. Specifically, by adding the dummy sets, the intersection result DI=DA∩DB is impossible to be empty due to the existence of S0 and also impossible to be an entire set of one of the sets DA or DB since neither S1 nor S2 should be contained in the intersection, so the server is not able to arbitrarily return an empty intersection or a whole set from DA and DB without detection. Furthermore, it also can guarantee that the server cannot remove or add some elements into the intersection, because it needs to make sure that the intersection set includes all the elements in S0 while all the elements in S1 or S2 must be not in the intersection. Therefore, we can easily know that the probability of a lazy server randomly returning a result without detection is (1n+2t)t⋅(1−1n+2t)2t(n is the size of original sets and t is size of dummy sets), which would almost trend to zero with a large n and a suitable t. What's more, for a cheating server, according to [10], the security can be significantly increased against a malicious server by choosing suitable parameters.
For the case of one client being corrupted with a malicious adversary A (Without loss of generality, suppose Alice is corrupted with A, which is able to provide arbitrary inputs.), A is unable to learn any extra private information of the other client Bob. Specifically, the most powerful malicious behavior of A with arbitrary input is that it sets all bits of the Bloom filter to be 1, thus the result returned from the server would contain all the elements provided by Bob, including the dummy set S2. However, even if Alice recovers all the elements based on the intersection Bloom filter, he/she is still unable to decide which elements should be contained in Bob's private set since S2 is also included in the recovery set, unless via two times executions of the protocol with the same S2. While Bob can detect such malicious behavior with checking whether S2 is contained in the intersection once receiving the intersection Bloom filter, and stops carrying out the private set intersection with Alice if so. Therefore, A cannot gain extra private information of Bob with the above malicious case.
Note that, the intersection size is not leaked to the server since the size of dummy set S0 is secret only known to both two clients. That is to say, our Protocol II achieves the intersection size-hiding from the server.
5.3. Enhanced Efficiency with Multi-Round Executions
Since there exits a false positive probability in the Bloom Filter used in our protocols, two clients may recover a different intersection by running Query(BFSA∩SB) at the end of two protocols. Based on Eq (4.1), we observe that the accuracy of the intersection is higher with the increasing of Bloom Filter length m, while it would lead to lower efficiency with more computation and storage overhead. To leverage a better efficiency and accuracy, we present a multi-round mechanism to achieve a higher efficiency with a desired accuracy, and our two basic protocols in this paper can be regarded as one-round protocols. Intuitively, we assume that pi is the false positive probability and mi is the Bloom Filter length of i-th round, then
where τ denotes total rounds in the multi-round algorithm, and mi=−nlog2e⋅log2pi (i∈{1,⋯,τ}). m is the total length of all the Bloom Filters used in the multi-round algorithm and p is the final achieved false positive probability. As shown in Algorithm 1, we introduce that Optimal(pi) is the most optimal false positive of i-th round, thoroughly guaranteeing a shortest Bloom Filter length m as illustrated in Eq (5.2).
The security of Algorithm 1 follows the security of two basic protocols in this paper since the main step of Algorithm 1 is step 5. With the following experimental implementations, the results show that the multi-round executions significantly enhance the performance of our two basic protocols.
6.
Performance evaluation
6.1. Efficiency of Protocol I and II
To evaluate the performance of our protocols, we set the experimental environments as: two clients run on Linux operating system, 12GB RAM and 8 vCPUs; and the server runs on Linux operating system, 8 vCPUs and 14GB RAM. We use Java language to implement our protocols, and leverage Crypto++ library and AES-ECB mode with 128-bit security to realize our cryptographic operations with Deterministic Encryption. Furthermore, to get more efficiency, we adopt MD5* to construct Bloom Filters.
* MD5 guarantees a sound accuracy in most situations, and it is enough in our works and can be replaced to SHA-1 if needed.
In the following part, we evaluate the performance of our protocols in the pipeline mode and parallel mode as described in [6]. Specifically, all computation on each side is executed in a single thread under the pipeline mode, while in the parallel mode, multiple threads are used to perform the protocols simultaneously on multiple CPUs. The experimental results given below show that the performance can be critically enhanced under the parallel mode with multi-core machines since all dominating computations in both two protocols, including constructing and encrypting the Bloom Filters, and performing equality checking over the encrypted Bloom filter, can be executed simultaneously in parallel.
Table 2 presents the average execution time of our two basic protocols for the set size varied from 10 thousand to 100 million. To guarantee the correctness and efficiency of our Protocol II, we set the size of the dummy sets to be n/2 and a feasible false positive probability to be 1/n, where n is the size of original set. Experimental results show that our setting of false positive probability 1/n may ensure a good accuracy, and the efficiency of both two protocols can be boosted via performing on the parallel mode. For instance, when considering one million set size, our two protocols cost 79 seconds and 123 seconds in the pipeline mode, while in parallel mode, both two can be respectively reduced to 15 seconds and 24.3 seconds in average.
To verify the relation of false positive probability p and performance for our protocols, we set dataset size n=1.0×107 and p varying from 10−7 to 10−1. As shown in Figure 4, it is clear to see that the execution time linearly decreases with the increase of the false positive probability which indicates a lower accuracy. This is consistent with our theoretical analysis in Subsection 4.1.
6.2. Enhanced efficiency of multi-round algorithm
For simplify, we just verify a two-round mechanism over our two basic protocols since all multi-round (more than two rounds) protocols can be easily realized with a two-round block. Suppose that m1 and m2 are the Bloom filters length in this two-round algorithm respectively, and p1 and p2 are the corresponding false positive probability. Based on Eq (5.2), we can get
where mi=−nlog2e⋅log2pi,(i∈{1,2}). Then, with a desirable p in Eq (6.1), the optimal p1 can be easily computed and can also get the minimum m:
where β=|SA∩SB|n, and p<p1<1.
Table 3 lists the optimal false positive probability p1 in the first round computed from Eq (6.2) with β ranging from 0.1 to 0.9 with step 0.2, where n=1.0×107 and p=1.0×10−7. Through our implementation, the experimental results are almost consistent with the theoretical results in Table 3. Next, we verify the efficiency of our two-round algorithm with these optimal parameters under the pipeline mode.
We can observe that the two-round protocols execute almost twice more efficient than one-round setting in average in Figure 5. Specifically, given a 10 million set size, our Protocol I under one-round protocol requires at least 13 minutes, while it only needs less than 7 minutes in the two-round setting. Note that, it is clear to see that the execution time in our two-round protocol linearly increases with β, which is the ratio of intersection size to the original set size; the output intersection of the first round in our two-round protocol is the input of the second round execution, and thus the efficiency of the second round is linear to β. However, two-round setting still runs more efficient than one-round setting even though in the worst case β=0.9 (Actually, β would not be very high in most applications.), as shown in Figure 5. Furthermore, we can also deeply optimize the performance for two-round setting with a parallel mode, and make our protocols more scalable to large datasets.
Additionally, we compare the efficiency of our enhanced protocols with the server-aided PSI protocols in [10]. According to [10], we can observe that it costs around 7s for the basic protocol SHPSI and 82 s for the enhanced protocol SizePSI over 10 million set size under a parallel mode with 100-threads. For the efficient of our protocols, as stated above in Figure 5, our Protocol I and Protocol II run around 400 and 600 s in average respectively under pipeline mode with one thread. It is obviously that our Protocol II has significant efficiency improvements over SizePSI in [10] after the same acceleration of parallel mode with 100-threads.
7.
Conclusions
In this paper, we present two enhanced secure and efficient PSI protocols with assistance of an aided server (a fog/cloud server) by leveraging Bloom Filter technique and lightweight symmetric deterministic encryption. Especially, our Protocol II is able to hide the size of the set/intersection from leaking to the server via adding dummy sets, and also can improve security against a malicious server with modifying the computed intersection or randomly returning a false result. Our novel multi-round mechanism can significantly enhance efficiency of the basic protocols with maintaining a desirable accuracy, and it is quite feasible to the practical application.
Acknowledgments
This work was supported by Program for Scientific Research Foundation for Talented Scholars of Jinling Institute of Technology (JIT-B-201639, JIT-B-201726), Natural Science Research Projects of Universities (19KJB520033), Philosophy and Social Science Foundation of the Jiangsu Higher Education Institutions of China (2021SJA0448), Natural Science Foundation of Jiangsu Province (BK20210928), Higher Education Research Project of Nanjing Institute of Technology (2021ZC13), and National Natural Science Foundation of China (61902163).
Conflict of interest
The authors declare there is no conflict of interest.