Third Person Data

by Dr. Michael Sonntag

What is “Third Person Data”?

Data about a person may exist at various locations: if we think about our own personal data, e.g. our preferences in communication (who we send E-mails to or receive from), interests (what we “like” on Facebook) or habits (which websites we visit regularly), then many persons know about these things. Obviously we know about ourselves quite a lot (but take care: you might be able to name your favorite E-Mail contacts, but could you correctly identify all topical areas you “liked” in the past?). Comparatively easy to identify are persons or corporations we gave the data to directly: our E-Mail provider, the social networking platform, and the websites we visit regularly. We might not always like that they know, but this is difficult to avoid and data protection laws provide not only theoretical but at least partly also effective limits and remedies against data misuse.

Much more difficult to “take care” of are third persons processing our data: not only would you have to know about their existence and what they actually collect and store, but they can also be located anywhere and you do not have any contract with them – or they would not be third persons. Examples of such third persons and third person data are:

Third person data can therefore be defined as data about a person which is stored, collected, or used by someone the person does not know is doing this (a third person), and where therefore no direct control or verification/supervision is possible.

Implications for Computer Forensics

For computer forensics, which can be roughly defined as the investigation of digital data in the context of legal proceedings, third person data can be invaluable, but also very problematic. Invaluable because this is data collected by someone who is not involved and therefore trustworthy. The suspect might have deleted his browser history, but the advertisement network still knows where he has been when (at least partially). But it can be problematic as well, because by default (i.e. in most cases) the investigator does not know who might have such data in her possession, and if they do, how to obtain access to it. Additionally there is no guarantee that data exists, that it is complete, of good quality, reliable etc.

How to know data exists at all – and where

The first and most simple option is just to know who collects which data. While this is a good approach for experts and in narrow areas, this obviously cannot be a general solution. Still it should not be omitted as computer forensics is something only experts should perform and these might then know potential owners of additional data. Such knowledge can be obtained or expanded through investigations. If e-mails are of interest, for example, then the layperson will see the sender (“Sent” mailbox) and the recipient (“Inbox”) as those possessing data about the time of sending/receiving the mail. But experts know that additional header lines (normally not shown!) exist, creating a trace of servers the mail traversed on its way from source to destination. And every server appearing in there might (or should, unless it was not saved or already deleted) have some third person data referencing this mail and can therefore confirm or refute certain aspects about it. This means that investigations may uncover potential holders of additional data usable as evidence.

Another option to discover third person data is to perform the same activities as the person suspecting the existence of such data, but simultaneously and explicitly looking for any signs of surveillance or even employing tools actively scanning for them. This is suitable for open video monitoring, for example: normally we don’t notice any cameras, but when we explicitly look for them, they are easy to see. Hidden (or temporary: see tow truck example above) cameras are more difficult to catch, but with enough experience and diligence these might be discovered at least sometimes. Also wireless connections can be detected easily (but not necessarily their content), leading to processing devices which might collect some data (or not). On the Internet this is easier, as it is trivial to monitor all network traffic of your computer in detail. When visiting a webpage it is then possible to identify where the computer connects to, what cookies it sends out and receives, etc. But passing data on by the server or changes between the original incident and the investigation (new advertisement partner) pose significant problems.

If such third parties collecting data have been hacked, their data might have been published on the Internet. Based on this information, it can be assumed what these (or similar) parties are observing. So if you are not part of the data disclosed, some conclusions can still be drawn. The Snowden disclosure can serve as an example here. The capabilities of one specific secret service have been published, but it must be assumed that similar services in other countries are mostly capable of the same actions. Additionally, it is possible to identify what someone else with comparable access might be able to do – and therefore probably is doing.

This leads to the next, and rather pessimistic, category: when someone possesses the technical capabilities to monitor and collect data, he will. This is not necessarily true, but at least in countries with weak privacy laws this must be assumed. There data will be collected just in case and quickly be sold to others, if they show interest and are willing to pay. So this approach is better suited to identifying what kind of data might be third party data than who the third party is.

Definite third party holders of data are all kinds of “upstream providers” of services. AirBnB does not own any servers, for example, instead they use Amazon web services. So Amazon obviously does have physical access to all of their data. They might not be allowed to look at it (contract), but they can access it, e.g. in case of emergencies or on request of third parties. While direct data access seems unlikely, using the data for calculating statistics is quite probable. Physical access is especially interesting if the company gets into financial difficulties, as Amazon might use their data as security, preventing any access by them or you, the actual owner, or as compensation for unpaid invoices (e.g. through selling to someone else; similar to utilizing domain names in bankruptcy).

The last useful option is inserting incorrect data and waiting for it to come up somewhere again. For instance an arbitrary e-mail address might be created and disclosed to a single provider (creating a new one for each target is not difficult). Whenever someone contacts you on this address, you know one person to whom your data (or at least parts of it) was disclosed too – and from which source. This is obviously time-consuming and works only if data usage is observable (e.g. difficult with video surveillance). Also, “storing for future use/reference” cannot be detected in this way.

Finally you could perform illegal actions where the only evidence is the potentially monitored behavior and wait to be arrested (which resembles the previous approach). While this method is very reliable, as the police/prosecutor will have to disclose how they found you and present the evidence in court at the latest (hence the behavior must be the only evidence existing at all), this cannot be recommended. Still it is useful regarding other persons (e.g. criminals performing illegal activities for other reasons), as verifying what has been used as evidence in the past can be assumed as a lower limit of what is possible today.

As an overview, the methods described above are presented here briefly in a table with some properties: time required to obtain information, reliability (wrongly assumed to possess data or incorrectly seen as having no data), completeness (will we find all such third parties) and associated costs (not necessarily monetary, but also effort required or “drawbacks” experienced).

Source of knowledge Time required Reliability Completeness Costs
Just know None, but long preparation Medium-Very good; depends on sources Medium-Very good; depends on sources Low; most sources are free
Do again and observe Low/as long as the original Good; wrong identification is unlikely Medium; depends on observer Low
Third party sources None, but long preparation Medium; disclosed data is correct, mere reports not necessarily Low; only what actually occurred and was published Low-Medium; many sources are free
Technical capabilities Medium; investigation who+capabilities required Low; not everyone who can, actually does Good, but not every party can be identified Low
Upstream providers Low Good; services can be bought (=tested) Low; often not disclosed Low-Medium; depends on data provided/testing
Publish traps Medium-Long Very good; actual use is observed Medium; depends on time and observability Low-Medium; providing data is free, but active tests might cost
Illegal activity + wait Medium; investigation will take some time Very good Very good; limited to certain groups (enforcement) Very high

Obtaining access to third-person data

If somebody wants to know what a third person knows about them, several options exist. However, it must be considered that this party might possess the data illegally (or exceeding legal permissions) or are simply not interested in disclosing this fact (only bad press, but no additional revenue). Therefore replies may be slow or non-existing. From the computer forensic view, at least in “official” cases, e.g. court proceedings, several additional options do exist. Moreover, cooperation of the data holder might then be enforceable (at least within a country).

First, the person can request access to her/his own data. This only works for personal data according to privacy laws, explicitly granting this right. Outside the EU, someone possessing data because of a contract is not necessarily required to provide it. This situation is very problematic with third parties, as they are usually unwilling to disclose it voluntarily. Also, while the person might have a contract with company A, and this company a contract with company B, this does not automatically mean that data at B must be disclosed to the person. Any court case is between the person and A, for example, so B is an “innocent bystander” and unaffected by these proceedings. Only A might be ordered to rely on some contract provisions it has with B to first obtain data and secondly disclose it. This requires the person to at least conclusively demonstrate that such data probably does exist and would help the case. Even then, especially in civil proceedings, access might be difficult, as B could argue that this would adversely impact trade secrets. Only in case of criminal proceedings is such transitive disclosure easier, because the police can also search/impound data located at third parties (after obtaining appropriate permissions, typically from a judge).

Indirectly this third person data might be obtained through information from the second party: what is stored there could have been passed on to third parties, and, if logs are available, the actual transfers might be reconstructed as well. While this seems to establish an “upper limit” (at most these items could have been transferred), that is not the case. The third party may have obtained separate additional data from other sources, combined it with such other information, or enriched it with previously anonymous data. So in reality, more or more detailed information may exist with the third party. Still this approach serves as a first approximation.

An illegal method to obtain access is hacking the data custodian. This could be the actual owner or someone else, e.g. a cloud provider, with physical access. While this is obviously illegal, in case of sufficient knowledge/resources, it is a quite promising method. Advantageous is that no owner consent is required and that internationality is not a problem but rather a boon. However, hacking is typically not that easy and there is no guarantee of success. Often only a webserver can be compromised and other servers, where third-party data might be expected, are more difficult to reach.

As third-party data is only rarely collected for the purpose of merely owning it, but rather for deriving monetary benefits, offering to buy it is another chance for retrieval. It might be necessary to pose as someone else (typically a company intending to use the data), as well as to obtain a larger part of the dataset (e.g. all Austrian users). This may obviously be costly and/or illegal, especially if data of other persons must be acquired too or false statements (“I am a company”) are involved.

Problems of third-person data

While the person the data is about typically desires access to it and simultaneously wants to keep it secret (i.e. the owner of the data should not be allowed to use it or pass it on further), this is not necessarily the case. Sometimes the owner would be interested in publishing the data, e.g. to be able to provide an alibi. This may contradict interests of the third party: data is only valuable if it is not generally available, and more so when its legality is questionable. But even if the person obtains the data, owners might retain some rights to it, especially if the original data (collected or received) has been enhanced or combined with data collected by them. This is comparable to the problem of credit-worthiness checks: while data access is granted (and the person could then publish it), the algorithm for calculating the score remains secret and need not be disclosed. Additional persons might be involved too, such as telephone call records, which can create further difficulties: any party might obtain access, but publication must consider rights of other communication participants too.

While third person data can be difficult to access legally for the persons affected, this is not equally true for data owners – collecting or buying it is legal in many jurisdictions. Even then – and more so when ownership is not perfectly legal – such data is typically kept “secret”. So knowing about it becomes difficult, reducing the acceptance of such data by the persons affected. However, this effect should not be overestimated. Considering the existing public registers of applications using/storing personal data (mandatory within the EU), little effect on the general population is observable, which rarely even knows of their existence. From this it can be concluded that public registers or general availability of data categories stored by someone are unlikely to significantly improve the situation. And individual rights to retrieve such information would be enough for e.g. investigative journalists.

Legally, third person data is difficult to regulate: by definition there exists no direct contact or contract between data subject and data owner. Therefore all rights of both parties depend either on the law or a chain of contracts, which might be enforceable by third parties – or not (legally possible, but restricted in scope and difficult in practice). Combined with the typical internationality of electronic data this further complicates matters, as normal contracts are much easier to enforce across borders than such contract chains. Also, national laws obviously differ and then the only hope are the EU or international treaties: harmonized rules applying to many countries. The problem of international relations in personal data was recently tackled by the ECJ, who ruled that “Safe Harbor” provisions allowing the export of personal data to the USA are invalid. Another example is that the collection and export of personal data might be illegal in the “source” country, but gathering and importing it can be perfectly legal in the “destination” country. While in “real” life such trans-border situations are hardly applicable (using a telescope to watch persons across borders), this is the typical situation on the Internet.

Another issue of third person data is correctness: how does someone (i.e. the person it is attributed to, but similarly the third party itself) know, whether data is correct or not? It could lack important details, contain old values now invalid, or include calculated data which was correct enough for the original purpose but is not for the new one. Also, third person data might just be invented. An example for the latter are fake profiles identified in the Ashley-Madison website hack. While it is unlikely that names/e-mail addresses of real persons have been used, e.g. for pictures or other data, often actual profiles are harvested from other dating websites or scraped from social media platforms. Re-identification could therefore lead to real persons, for whom it can be difficult to explain that it was not them using a fake name and an anonymous e-mail account. Verification of third person data is complicated by the fact that it was not obtained from the persons directly, so modifications or additions might have been introduced at any intermediary point the data passed through – typically without information where exactly. Another source for incorrectness or inconsistencies is that such data is often collected solely indirectly (i.e. not through asking the person but observing and drawing conclusions). For instance devices might be shared (especially common with PCs and tablets, which the whole family might use; less so mobile phones), but any data collected through it is attributed to the “one and only” owner. For instance, when a father allows his children to use his tablet they might contact their friends, e.g. through chats, visiting social media profiles, posting messages and so on. Therefore obviously this adult male is strangely interested in small children, contacts them, and must be a pedophile in the eyes of someone observing data only indirectly, e.g. through trackers in advertisements. Such danger is much higher for third-parties, as they typically do not interact directly with the person they are collecting data about and therefore have few chances for noticing a different user, for instance, as in the example.

When considering the difficulties of deleting e.g. revenge porn or any other data from the Internet, it becomes clear, that the existence of third person data is problematic. This is exemplified by the possibility of “removing” data from Google search results. The data itself remains on the Internet, is still indexed, will continue to show up in search results etc – only searches for “name” or “name + topic” will not contain this specific link (searches for “topic” will!). In relation to Google this is again third person data, and while rendering it a bit more difficult to find is commendable, this cannot be considered a real solution. Either the data needs to (or may) remain publicly accessible, or it should be deleted. Otherwise we create classes of people: those who possess the tools or the knowledge to find things, and the “dumb masses” who do not. The latter will then have no control over their own data and not be able to find it, while the “privileged” can access all data (their own and others), therefore creating an artificial distinction and partial immunity, as they can hide their misdeeds, while others cannot.

Outlook

Third person data will increase in the future, as much more data is being collected and will be retained. And what is stored will be used and transferred on to maximize profit. Especially problematic in this context is the “Internet of Things”, where many small devices are equipped with computing power and communication possibilities. Here easily even the vendor could become a third party – no permanent contract is really needed for a coffee machine, but “outsourcing the evaluation of the data to the cloud for better brewing of coffee” is going to be a reality: see for instance Nest thermostats, which send a lot of information to the cloud in the hope of slightly improving comfort or reducing heating costs (where energy savings might be offset by the additional energy required for communication and cloud servers!). Who exactly receives this data and what is or will be done with it later remains unclear. Regarding future developments, similar considerations apply to cars (mandatory eCall: an automatic telephone call is placed to an emergency number in case of a crash; a continuous mobile phone connection is optional for this application, but added-value services are envisaged – then a third party, the mobile phone operator, will be able to continuously locate any car, at least if in use), or fitness trackers (e.g. sending data to health insurance companies for lower payments).

What options exist to improve the situation or reduce problems? Some approaches could be:

Mag. Dipl.-Ing. Dr. Michael Sonntag (AT) is associate professor at the Johannes Kepler University in Linz at the Institute for Networks and Security. He studied both computer science and law and is researching and teaching in the areas of smart home and web security, computer forensics, and IT law. In addition to the Universities of Linz and Graz, he also regularly teaches at the ELTE in Budapest and the University of Economics in Prague.

KairUs: Artistic strategies for dealing with resurfacing data