May 15, 2017
Wei Hu, Amrapali Zaveri, Honglei Qiu, Michel Dumontier. Cleaning by Clustering: Methodology for addressing data quality issues in biomedical metadata. Under Review

Amrapali Zaveri, Wei Hu, Honglei Qiu, Michel Dumontier. MetaCrowd: Crowdsourcing biomedical metadata quality assessment. Under Review
Clustering Algorithm

Our agglomerative cluster algorithm algorithm identified keys that were similar based on (i) name, (ii) core concept and (iii) value similarities, to each other and grouped them together. We evaluated our method using a manually created gold standard in which 359 keys were grouped into 27 clusters based on six types of characteristics: (i) age, (ii) cell line, (iii) disease, (iv) strain, (v) tissue and (vi) treatment. As a result, the algorithm generated 18 clusters containing 355 keys (four clusters with only one key were excluded). In the 18 clusters, there were keys that were identified correctly to be related to that cluster, but there were 13 keys which were not related to that cluster. We compared our approach with four other published methods. Our approach significantly outperformed them in most characteristic keys (except “cell line”) and achieved the best average F-Score (0.63).