Dataset information disclosure for AI startups

  • Alessandro Perilli profile
    Alessandro Perilli
    10 April 2019 - updated 2 years ago
    Total votes: 3

Good morning and thanks for the opportunity to participate in such a critical conversation. One of the many aspects that I'd like to discuss with the group and have the HLEG weighting in is transparency in datasets characteristics and acquisition by AI startups. 

As this forum is fully aware, a small and/or improperly gathered dataset can generate biased and/or inaccurate results. Both outcomes are not necessarily obvious until investigation but can have a very direct impact on consumers. 

As part of my job, I interact with hundreds of technology startups per year, many of which, these days, claim to use modern artificial intelligence techniques to deliver what they promise. As part of my evaluation of these companies, I always ask how they acquired their dataset, what's its size, and what sort of analysis they performed to identify and mitigate a potential bias developed during the data gathering phase. I rarely receive clear, detailed, transparent answers, suggesting that this is a problem startups don't consider or care about, or that the dataset is not as data-rich as it should be.

Moreover, my interactions with data scientists on the subject suggest that very few are actually concerned about the risks of bias in datasets acquired directly or from 3rd parties or developed in a synthetic way. 

My recommendation to end-user organizations interested in these AI startups always is to investigate how the companies acquire/build their datasets and what is their effort to evaluate both accuracy and bias. However, it would be in the interest of customers to have strict rules that force companies to disclose specific information about their datasets. 

 

I'd be interested in hearing your opinion on the matter.

Thanks

 

Alessandro Perilli

GM, Management Strategy

Red Hat Inc.

@giano

linkedin.com/in/alessandroperilli