The implications of data mining got mildly creepy and a bit too personal the summer I began grad school in data science at Indiana University. I had just bought a pint of ice cream and bottle of red wine at my neighborhood grocer. As I was smiling about the nice evening meal these two items would pair well with, I noticed I had received a baby formula coupon. Since I was just getting my start in data analytics, I realized the grocer had linked data from my loyalty card about baby items I bought in the past to a recommender system to suggest I buy formula. After all from the grocer’s algorithm and big data, they surmised new parents not only need ice cream and red wine but also formula. (This is not an unreasonable set of items in my opinion for some new parents but I digress.)
Last week I began a discussion on why data ethics are more important than ever. I want to discuss Matthew J. Salganik’s four principles from the book Bit by Bit: Social Research in the Digital Age: Respect for Persons, Beneficence, Justice and Respect for Law and Public Interest. My personal example with baby formula did me no harm, but demonstrates a gap in the ethics of my grocer’s big data collection and use policies. Salganik’s first principle is Respect for Persons, or ‘treating people as autonomous and honoring their wishes.’ The Respect for Persons principle means to let people control their own lives and have informed consent about how their data is used. As far as I could tell, I could not find any information from my grocer that allowed me to opt out of recommender systems or be in control of my shopping data.
The second principle of Beneficence is based on the 2010 Belmont Report and says researchers need to understand and improve the benefits and risk of the research and then decide if the right balance between benefits and risks happen. In my formula recommender system, the humans designing the grocer’s machine learning algorithms should have considered what if any harm the customer would experience by getting an unwanted (and possibly offensive to some) formula coupon. What if I was a customer with a child that was sent to the emergency room after eating that brand of formula? Or what if I was a customer that experienced a recent miscarriage and the coupon brought up very negative emotions and memories? By weighing the benefits of the low reward $1 coupon versus the potentially much higher risks, perhaps the grocer would have decided not to give me the coupon based on the beneficence concept.
The third principle Salganik discusses is Justice or ‘ensuring that the risks and benefits of research are distributed fairly.’ This principle means one group should not bear the costs of the research while another group unfairly gets its benefits. Another way to look at this component to an ethical data science framework is to intentionally think about ways data mining could exclude certain groups of customers. In the formula coupon example, this principle is probably met. I was most likely not the only parent that got the coupon recommended to me after buying the “trigger products” of ice cream and red wine.
The last principle is Respect for Law and Public Interest comes from the 1979 Belmont Report that provides guidelines to protect people during biomedical and behavioral research. The principle extends the principle of Beneficence to all research participants. Two components of the principle are compliance and accountability through transparency. Salganik argues that although there may be times researchers break compliance laws when getting data, they should be transparent that they are doing that. I would go a bit farther and suggest that all published research using data should require authors to include a section explaining how data ethics was a part of the research design and implementation process.
I would love hearing from you about your take on these four principles or real use case scenarios of data ethics considerations.