[Jenkins-infra] Revisiting access control for anonymized "census data"

Mon Jun 13 18:44:26 UTC 2016

(replies inline)

On Wed, 08 Jun 2016, Baptiste Mathus wrote:

> +1. I've played with it some months ago and agree there's no privacy issue
> IMO with the data already previously available.

I brought this up in the governance meeting[1] and decided not to push it
towards an action after Daniel Beck expressed concerns abotu some changes
Baptiste has been pushing in a few pull requests.

Consider these JIRAs:

 * Extension Point for contributing usage statistics - <https://issues.jenkins-ci.org/browse/JENKINS-32485>
 * Split node monitors from core - <https://issues.jenkins-ci.org/browse/JENKINS-26466>

There are two concerns which I would like to raise for you, Baptiste:

1; Potential infrastructure impact feasibility of addition "usage stats"

The way that usage statistics are reported is through an encrypted HTTP GET
request to usage.jenkins-ci.org. This limits the amount of data that can be
transmitted, but generally allows us more guaranteed delivery since a Jenkins
user's browser is what is used for reporting this information

That encrypted payload is then decrypted, anonymized and then re-uploaded by an
out-of-band process, after which the `infra-statistics` process takes over.

A few questions:

 - How much additional data is going to be crammed into that HTTP GET payload?
   What happens when there's too much?

 - Is ther an estimate of what kind of additional transit and disk storage
   requirements this is going to impose on project infrastructure?

 - How are plugin developers expected to "see" their custom usage stats at the
   end of the processing cycle?

2: Potential privacy impact of additional usage stats

The anonymization code for existing usage stats is successful at anonymizing
because we know where "custom" fields or sensitive information might come from
ahead of time. For example, custom plugins, hostnames, and things like that.
Without knowledge beforehand of what metrics are going to be reported, I don't
understand how we can ensure user privacy.

 - How would we keep the anonymization code updated with the new metrics
   potentially reported by plugins?

 - How are we going to ensure that plugin authors don't purposefully or
   accidentally publish personally-identifiable information?

I'm in agreement with Daniel that, right now, we can either open up census
information, *or* we can merge the changes proposed by Baptiste. But in their
current implementations, we cannot have both at the same time.

[1] <http://meetings.jenkins-ci.org/jenkins-meeting/2016/jenkins-meeting.2016-06-08-18.00.html>

- R. Tyler Croy

------------------------------------------------------
     Code: <https://github.com/rtyler>
  Chatter: <https://twitter.com/agentdero>

  % gpg --keyserver keys.gnupg.net --recv-key 3F51E16F
------------------------------------------------------
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 181 bytes
Desc: Digital signature
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20160613/c281e1fc/attachment.asc>