One of the challenges to accurately classifying privacy policies is the shortage of training data. In the past, other studies have depended heavily on trained annotators, some even with legal training and background, which limits the size of labeled datasets. The conventional wisdom is that untrained annotators will not have the background (or perhaps the patience) to read through and understand privacy policies. Calpric shows this is not true — and that with Calpric’s text selection and segmentation, crowdsourced annotators can be competitive with trained annotators. This allows Calpric to be trained on a dataset several times larger than previous datasets, allowing us to capture and study privacy policy properties in minority data categories and data actions. For example, Calpric can accurately identify explicit denials (i.e. we will not collect…) versus a lack of a statement about collection. We find that there is an inverse correlation between explicit denials and app popularity — likely because popular apps end up collecting more personal data, and thus cannot flatly deny collection. You can read more about Wendy’s Calpric in the paper here. It will appear in August at Usenix Security 2023.