משתמש:יובל מדר/ארגז חול/מבחנה 3

פרטיות דיפרנציאלית היא דרישה ממערכת נתונים סטטיסטיים לאפשר שליפת נתונים אמינה, בלי להסתכן בחשיפת רשומות ספציפיות במאגר הנתונים.

תרחיש[עריכת קוד מקור | עריכה]

גוף אמין מחזיק במאגר נתונים המכיל נתונים רגישים (למשל, נתונים רפואיים, הצבעות, כספיים וכו'), במטרה להציג נתונים סטטיסטיים על כלל רשומות המאגר (למשל, אחוז חולי הסרטן במדינה), אך בלא לחשוף נתונים על אנשים ספציפיים מהמאגר. (למשל, חשיפת מחלתו של אחד מחברי מאגר הנתונים)

פרטיות ε-דיפרנציאלית[עריכת קוד מקור | עריכה]

פעולות השרת האמין ממודלות כאלגוריתם (אקראי) $\ {\mathcal {A}}$ .

נאמר על האלגוריתם שהוא מקיים פרטיות $\ \epsilon$ -דיפרנציאלית אם לכל שני מסדי נתונים $D_{1},D_{2}$ הנבדלים באלמנט יחיד, ולכל קבוצה $\ S$ של פלטים של האלגוריתם:

\ \Pr[{\mathcal {A}}(D_{1})\in S]\leq \exp(\epsilon )\times {\textbf {Pr}}[{\mathcal {A}}(D_{2})\in S]

כאשר ההסתברות מחושבת על פני כל תרחישי הריצה של האלגוריתם. ("הטלות המטבע" שלו)

משמעות הדבר היא שלכל שני מסדי נתונים הנבדלים זה מזה באיבר יחיד, אלגוריתם $\ \epsilon$ -פרטי דיפרנציאלית ייתן תוצאות דומות עבור שניהם. (כלומר, לא ניתן יהיה להשתמש באלגוריתם לקביעת ערכו של איבר בודד במאגר הנתונים.)

דוגמא[עריכת קוד מקור | עריכה]

יהי $\ D_{1}$ מסד נתונים של נתונים רפואיות, אשר כל רשומה שלו מכילה שני איברים $\ (Name,Issick)$ , כאשר הראשון מציין את שם האדם, והשני מציין האם הוא חולה במחלה מסוימת.

לדוגמא:

שם	חולה?
רוס	כן
מוניקה	כן
ג'ואי	לא
פיבי	לא
צ'נדלר	כן

הניחו שיריב זדוני מנסה להבין האם צ'נדלר חולה או לא. נניח שגישת היריב למאגר הנתונים מוגבלת לחישוב מספר החולים בין $\ k$ החולים הראשונים במאגר. אם נניח שהיריב מגלה, באופן כזה או אחר, את מספרו הסידורי של צ'נדלר במאגר, (חמישי) הוא יכול לברר האם צ'נדלר חולה או לא, על ידי חישוב ההפרש $\ Q(5)-Q(4)$ .

לכן, סכמת הגישה שהוצעה למאגר הנתונים אינה משמרת פרטיות, על אף שלכשעצמה אינה מאפשרת בירור פרטים עבור רשומה בודדת במאגר.

Let us take this example a little further. Now we construct $D_{2}\,\!$ by replacing (Chandler,1) with (Chandler,0). Let us call the release mechanism (which releases the output of $Q(i)\,\!$ ) as ${\mathcal {A}}\,\!$ . We say ${\mathcal {A}}\,\!$ is $\epsilon \,\!$ -differentially private if it satisfies the definition, where $S\,\!$ can be thought of as a singleton set (something like $\{3.5\},\{4\}\,\!$ etc.) if the output function of ${\mathcal {A}}\,\!$ is a Discrete Random Variable (i.e. has a probability mass function(pmf)); else if it is a Continuous Random Variable (i.e. has a probability density function(pdf)), then $S\,\!$ can be thought to be a small range of reals (something like $3.5\leq {\mathcal {A}}(D_{1})\leq 3.7\,\!$ ).

In essence if such an ${\mathcal {A}}\,\!$ exist then a particular individual's presence or absence in the database will not alter the distribution of the output of the query by a significant amount and thus assures privacy of individual information in an information theoretic sense.

מוטיבציה[עריכת קוד מקור | עריכה]

בעבר, מספר גישות הוצעו לסוגיית שמירה על אנונימיות של רשומות פומביות, ומרביתן נכשלו במקרים בהם התוקפים השיגו גישה למסדי נתונים נוספים, ובאמצעות ההצלבה בין השניים, פגעו באנונימיות המאגר.

שני מקרים מפורסמים של התקפות המקשרות מספרי מסדי נתונים הם תחרות Netflix Prize, ותקרית מאגר הנתונים הרפואיים של ועדת הביטוח הקבוצתי של מסצ'וסטס. (ה-GIC)

Netflix Prize[עריכת קוד מקור | עריכה]

חברת Netflix הציעה פרס בן מיליון דולר למי שיוכל להציע שיפור בן 10% למערכת המלצת התכנים שלה. לשם כך, שיחררה החברה מסד נתונים חלקי לבדיקות הפיתוח של המתחרים. לשחרור מסד הנתונים, צורפה ההכרזה "על מנת להגן על פרטיות לקוחותינו, כל המידע האישי המזהה את הלקוחות הוסר, ומזהי הלקוחות הוחלפו במזהים שהוגרלו באקראי".

חברת Netflix אינה אתר דירוג הסרטים היחיד ברשת, קיימים אתרים דומים רבים, וביניהם האתר IMDb. ב-IMDb, המשתמשים אינם בהכרח מדרגים את הסרטים באופן אנונימי. שני החוקרים ארוינד נרינן וויטלי שמטיקוב, מאוניברסיטת טקסס, קישרו בין מסדי הנתונים של שני האתרים (בהתבסס על זמני הדירוג), והצליחו לחשוף את זהות חלק מהמשתמשים במאגר שהפיצה Netflix.^[1] ובכך הוכיחו שמאגר הנתונים הפומבי אינו אנונימי כפי שהאמינה החברה כאשר הפיצה אותו.

מאגר הנתונים הרפואיים של ה-GIC[עריכת קוד מקור | עריכה]

הוועדה האחראית על רכישת ביטוחים קבוצתיים לעובדי מדינה במדינת מסצ'וסטס בארצות הברית, ה-GIC, הפיצה את התיקים הרפואיים של עובדי המדינה, לאחר סינון שמותיהם, באינטרנט, אך המאגר הכיל את תאריכי הלידה שלהם, מינם, והמיקוד שלהם. חוקרת בשם לטניה סוויני, מאוניברסיטת קרנגי מלון קישרה את הנתונים במאגר זה למרשם המצביעים, על מנת לחלץ את תיקו הרפואי של מושל מסצ'וסטס^[2].

רגישות[עריכת קוד מקור | עריכה]

Getting back on the main stream discussion on Differential Privacy, the sensitivity ^[3] ( $\Delta f\,\!$ ) of a function $f:{\mathcal {D}}\rightarrow \mathbb {R} ^{d}\,\!$ is

\Delta f=\max _{D_{1},D_{2}}\lVert f(D_{1})-f(D_{2})\rVert _{1}\,\!

for all $D_{1}\,\!$ , $D_{2}\,\!$ differing in at most one element, and $D_{1},D_{2}\in {\mathcal {D}}\,\!$ .

To get more intuition into this let us return to the example of the medical database and a query $Q(i)\,\!$ (which can also be seen as the function $f\,\!$ ) to find how many people in the first $i\,\!$ rows of the database have diabetes. Clearly, if we change one of the entries in the database then the output of the query $Q(i)\,\!$ will change by at most one. So, the sensitivity of this query is one. It so happens that there are techniques(which we will describe below) using which we can create a differentially private algorithm for functions with low sensitivity.

Trade-off between utility and privacy[עריכת קוד מקור | עריכה]

A trade-off between the accuracy of the statistics estimated in a privacy-preserving manner, and the privacy paramater ε. This trade-off is studied in ^[4] and ^[5].

Laplace noise[עריכת קוד מקור | עריכה]

Many differentially private algorithms rely on adding controlled noise^[3] to functions with low sensitivity. We will elaborate this point by taking a special kind of noise (whose kernel is a Laplace distribution i.e. the probability density function ${\text{noise}}(y)\propto \exp(-|y|/\lambda )\,\!$ , mean zero and standard deviation $\lambda \,\!$ ). Now in our case we define the output function of ${\mathcal {A}}\,\!$ as a real valued function (called as the transcript output by ${\mathcal {A}}\,\!$ ) ${\mathcal {T}}_{\mathcal {A}}(x)=f(x)+Y\,\!$ , where $Y\sim {\text{Lap}}(\lambda )\,\!\,\!$ and $f\,\!$ is the original real valued query/function we plan to execute on the database. Now clearly ${\mathcal {T}}_{\mathcal {A}}(x)\,\!$ can be considered to be a continuous random variable, where

{\frac {\mathrm {pdf} ({\mathcal {T}}_{{\mathcal {A}},D_{1}}(x)=t)}{\mathrm {pdf} ({\mathcal {T}}_{{\mathcal {A}},D_{2}}(x)=t)}}={\frac {{\text{noise}}(t-f(D_{1}))}{{\text{noise}}(t-f(D_{2}))}}\,\!

which is atmost $e^{\frac {|f(D_{1})-f(D_{2})|}{\lambda }}\leq e^{\frac {\Delta (f)}{\lambda }}\,\!$ . We can consider ${\frac {\Delta (f)}{\lambda }}\,\!$ to be the privacy factor $\epsilon \,\!$ . Thus ${\mathcal {T}}\,\!$ follows a differentially private mechanism (as can be seen from the definition). If we try to use this concept in our diabetes example then it follows from the above derived fact that in order to have ${\mathcal {A}}\,\!$ as the $\epsilon \,\!$ -differential private algorithm we need to have $\lambda =1/\epsilon \,\!$ . Though we have used Laplacian noise here but we can use other forms of noises which also allows to create a differentially private mechanism, such as the Gaussian Noise (where of course a slight relaxation of the definition of differential privacy ^[2] is needed).

Composability[עריכת קוד מקור | עריכה]

Sequential composition ^[6][עריכת קוד מקור | עריכה]

If we query an ε-differential privacy mechanism $t$ times, the result would be $\epsilon t$ -differentially private. In the more general case, if there are $n$ mechanisms: ${\mathcal {M}}_{1},\dots ,{\mathcal {M}}_{n}$ , whose privacy guarantees are $\epsilon _{1},\dots ,\epsilon _{n}$ differential privacy, respectively, then any function $g$ of them: $g({\mathcal {M}}_{1},\dots ,{\mathcal {M}}_{n})$ is $(\sum \limits _{i=1}^{n}\epsilon _{i})$ -differentially private.

Parallel composition ^[6][עריכת קוד מקור | עריכה]

However, if the previous mechanisms are computed on disjoint subsets of the private database then the function $g$ would be $(\max _{i}\epsilon _{i})$ -differentially private instead.

Group privacy[עריכת קוד מקור | עריכה]

In general, ε-differential privacy is designed to protect the privacy between neighboring databases which differ only in one row. This means that no adversary with arbitrary auxiliary information can know if one particular participant submitted his information. However this is also extendable if we want to protect databases differing in $c$ rows, which amounts to adversary with arbitrary auxiliary information can know if $c$ particular participants submitted their information. This can be achieved because if $c$ items change, the probability dilation is bounded by $\exp(\epsilon c)$ instead of $\exp(\epsilon )$ ,^[2] i.e. for D₁ and D₂ differing on $c$ items:

\Pr[{\mathcal {A}}(D_{1})\in S]\leq \exp(\epsilon c)\times {\textbf {Pr}}[{\mathcal {A}}(D_{2})\in S]\,\!

Thus setting ε instead to $\epsilon /c$ achieves the desired result (protection of $c$ items). In other words, instead of having each item ε-differentially private protected, now every group of $c$ items is ε-differentially private protected (and each item is $(\epsilon /c)$ -differentially private protected).

Proof idea[עריכת קוד מקור | עריכה]

For three datasets D1, D2, and D3, such that D1 and D2 differ on one item, and D2 and D3 differ on one item (implicitly D1 and D3 differ on at most 2 items), the following holds for an ε-differentially private mechanism ${\mathcal {A}}$ :

$\Pr[{\mathcal {A}}(D_{1})\in S]\leq \exp(\epsilon )\times {\textbf {Pr}}[{\mathcal {A}}(D_{2})\in S]\,\!$ , and $\Pr[{\mathcal {A}}(D_{2})\in S]\leq \exp(\epsilon )\times {\textbf {Pr}}[{\mathcal {A}}(D_{3})\in S]\,\!$

hence:

$\Pr[{\mathcal {A}}(D_{1})\in S]\leq \exp(\epsilon )\times (\exp(\epsilon )\times {\textbf {Pr}}[{\mathcal {A}}(D_{3})\in S])=\exp(2\epsilon )\times {\textbf {Pr}}[{\mathcal {A}}(D_{3})\in S]\,\!$

The proof can be extended to $c$ instead of 2.

Stable transformations[עריכת קוד מקור | עריכה]

A transformation $T$ is $c$ -stable if the hamming distance between $T(A)$ and $T(B)$ is at most $c$ -times the hamming distance between $A$ and $B$ for any two databases $A,B$ . Theorem 2 in ^[6] asserts that if there is a mechanism $M$ that is $\epsilon$ -differentially private, then the composite mechanism $M\circ T$ is $(\epsilon \times c)$ -differentially private.

This could be generalized to group privacy, as the group size could be thought of as the hamming distance $h$ between $A$ and $B$ (where $A$ contains the group and $B$ doesn't). In this case $M\circ T$ is $(\epsilon \times c\times h)$ -differentially private.

Notes[עריכת קוד מקור | עריכה]

^ Arvind Narayanan, Vitaly Shmatikov. Robust De-anonymization of Large Sparse Datasets. In IEEE Symposium on Security and Privacy 2008, p. 111–125.
^ ¹ ² ³ טקסט ההערה
^ ¹ ² Dwork, McSherry, Nissim and Smith, 2006.
^ A. Ghosh, T. Roughgarden, and M. Sundararajan. Universally utility-maximizing privacy mechanisms. In Proceedings of the 41st annual ACM Symposium on Theory of Computing, pages 351–360. ACM New York, NY, USA, 2009.
^ H. Brenner and K. Nissim. Impossibility of Differentially Private Universally Optimal Mechanisms. In Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2010.
^ ¹ ² ³ McSherry, SIGMOD 2009 (Theorem 3 and 4).

References[עריכת קוד מקור | עריכה]

Calibrating Noise to Sensitivity in Private Data Analysis by Cynthia Dwork, Frank McSherry, Kobbi Nissim, Adam Smith In Theory of Cryptography Conference (TCC), Springer, 2006.
Differential Privacy by Cynthia Dwork, International Colloquium on Automata, Languages and Programming (ICALP) 2006, p. 1–12.
Frank D. McSherry. 2009. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 35th SIGMOD international conference on Management of data (SIGMOD '09), Carsten Binnig and Benoit Dageville (Eds.). ACM, New York, NY, USA, 19-30. DOI= 10.1145/1559845.1559850

הערות שוליים[עריכת קוד מקור | עריכה]

קישורים חיצוניים[עריכת קוד מקור | עריכה]

* Differential Privacy: A Survey of Results by Cynthia Dwork, Microsoft Research April 2008

Privacy of Dynamic Data: Continual Observation and Pan Privacy by Moni Naor, Institute for Advanced Study November 2009
A Practical Beginner's Guide To Differential Privacy by Christine Task, Purdue University April 2012

קטגוריה:קריפטוגרפיה קטגוריה:פרטיות

[1] Arvind Narayanan, Vitaly Shmatikov. Robust De-anonymization of Large Sparse Datasets. In IEEE Symposium on Security and Privacy 2008, p. 111–125.

[Dwork,_ICALP_2006-2] ¹ ² ³ טקסט ההערה

[Dwork,_McSherry_2006-3] ¹ ² Dwork, McSherry, Nissim and Smith, 2006.

[Ghosh,_Roughgarden,_Sundararajan_2009-4] A. Ghosh, T. Roughgarden, and M. Sundararajan. Universally utility-maximizing privacy mechanisms. In Proceedings of the 41st annual ACM Symposium on Theory of Computing, pages 351–360. ACM New York, NY, USA, 2009.

[Brenner,_Nissim_2010-5] H. Brenner and K. Nissim. Impossibility of Differentially Private Universally Optimal Mechanisms. In Proceedings of the 51st Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2010.

[McSherry,_SIGMOD_'09-6] ¹ ² ³ McSherry, SIGMOD 2009 (Theorem 3 and 4).

[1]

[2]

[3]

[4]

[5]

[6]