Machine Learning Data: Do You Really Have Rights to Use It?

Organizations using machine learning systems require data to train their systems. But where does that data come from? And can they get into trouble if they don’t have the rights to use that data? The short answer is yes; they can get into trouble if they aren’t careful.

A few recent cases show the risks associated with companies using personal information for training AI systems allegedly without authorization. First, Burke v. Clearview AI, Inc., a class action filed in federal district court in San Diego at the end of February 2020, involves a company, Clearview, accused of “scraping” thousands of sites to obtain three billion images of faces of individuals used for training AI algorithms for facial recognition and identification purposes. “Scraping” refers to the process of automated processes scanning the content of websites, collecting certain content from them, storing that content, and using it later for the collecting company’s own purposes. The basis for the complaint is that Clearview AI failed to obtain consent to use the scraped images. Moreover, given the vast scale of the scraping – obtaining three billion images – the risk to privacy is tremendous.

In Stein v. Clarifai, Inc., filed earlier in February, the plaintiffs’ class action complaint filed in Illinois state court claims that investors in Clarifai, founders in the dating site OKCupid, used their access to OKCupid’s database of profile photographs to transfer the database to Clarifai. Clarifai then supposedly used the photos to train its algorithms used for analyzing images and videos, including for purposes of facial recognition. Clarifai is the defendant in this case and will have to fight claims that it wasn’t entitled to take the OKCupid photos without notifying the dating site’s users and obtaining consent. OKCupid is potentially a target too. It wasn’t clear if plaintiffs are saying that OKCupid’s management approved the access to its database, but if it did, the plaintiffs may have claims against OKCupid as well.

Dinerstein v. Google, LLC, is a case involving questions of the right to use data for AI training purposes. This case involves Google’s use of supposedly de-identified electronic health records (EHR) from the University of Chicago and the University of California San Francisco medical centers to train Google’s medical AI systems to assist with the development of a variety of AI services, including assistance with medical diagnoses. In a class action complaint filed in late June 2019, a patient at the University of Chicago’s medical center, on behalf of a putative class, alleged injury from the sharing of medical records by the University. According to the plaintiffs, while the EHR data was supposedly de-identified, Google collects huge amounts of data to profile people, including geolocation data from Android phones, and Google can, therefore, reidentify individual patients from the de-identified data.

I am skeptical that Google intended to combine the EHR data with other data, and it isn’t clear what the plaintiffs think Google was going to do with the reidentified data. For instance, there was no allegation about pushing medical condition-specific ads to the Android users. Moreover, just because an Android phone was in the E.R. when an E.R. medical record was created doesn’t mean that the two are linked. For instance, the patient may be a child, and the Android user was the child’s parent. Moreover, it isn’t even clear that the level of geolocation precision can link an Android user to the department/room where the record was created.

Regardless, organizations should consider sources of AI training data in their risk management plans. They should obtain any needed consents from data subjects. Certain business models, like scraping the public web for photos, is especially problematic. Under the European Union’s General Data Protection Regulation, companies that scrape public Internet sites for personal information at least have to inform individual data subjects that they have collected their personal data and provide a mechanism for opting out. While GDPR does not generally apply in the United States, companies should consider that kind of mechanism to avoid liability in the United States.

Moreover, if original consents occurred for one use, they should analyze whether it is necessary to reconsent the data subject for AI training purposes. Deidentification of personal data will help, although the personal data source may want to be made whole in case there are lawsuits stemming from the data using the organization’s use of that data. Also, organizations should be careful of bridging contexts – using data from one source and combining it with data from another source, thereby potentially reidentifying data subjects and violating their privacy. These measures will reduce liability risks associated with personal data sharing.

Originally posted at airoboticslaw.com.