[Jenkins-infra] Revisiting access control for anonymized "census data"

Baptiste Mathus bmathus at gmail.com
Thu Jun 16 12:03:11 UTC 2016

See inline.

2016-06-13 18:44 GMT+00:00 R. Tyler Croy <tyler at monkeypox.org>:

> (replies inline)
> On Wed, 08 Jun 2016, Baptiste Mathus wrote:
> > +1. I've played with it some months ago and agree there's no privacy
> issue
> > IMO with the data already previously available.
> I brought this up in the governance meeting[1] and decided not to push it
> towards an action after Daniel Beck expressed concerns abotu some changes
> Baptiste has been pushing in a few pull requests.
> Consider these JIRAs:
>  * Extension Point for contributing usage statistics - <
> https://issues.jenkins-ci.org/browse/JENKINS-32485>
>  * Split node monitors from core - <
> https://issues.jenkins-ci.org/browse/JENKINS-26466>

First, as a reminder, and to get the context, please note that
JENKINS-32485 was actually created, because breaking cycles was actually
required to be able to make JENKINS-26466 forward.
In the past, I had had a question about how plugin developers could have a
way to push custom stats
so this was somehow by chance a match.

Hence the resulting PR: https://github.com/jenkinsci/jenkins/pull/1985

> There are two concerns which I would like to raise for you, Baptiste:
> 1; Potential infrastructure impact feasibility of addition "usage stats"
> The way that usage statistics are reported is through an encrypted HTTP GET
> request to usage.jenkins-ci.org. This limits the amount of data that can
> be
> transmitted, but generally allows us more guaranteed delivery since a
> Jenkins
> user's browser is what is used for reporting this information
> That encrypted payload is then decrypted, anonymized and then re-uploaded
> by an
> out-of-band process, after which the `infra-statistics` process takes over.
> A few questions:
>  - How much additional data is going to be crammed into that HTTP GET
> payload?
>    What happens when there's too much?

Potentially, it could be a lot if someone screws, for sure. That's why I
put a lot of Javadoc in the PR to make clear this was to be used with care,
and strongly encourage developers to ask for reviews when using that
feature. And possibly ping specifically the Security Officer (which Daniel
seems to actually disagree with) for review.

But basically, this is already the case (anyone could quite easily use the
current pipe, though there's currently no way to finally get it w/o
discussing it upstream), though I can only agree this would be made easier
to screw up.

>  - Is ther an estimate of what kind of additional transit and disk storage
>    requirements this is going to impose on project infrastructure?

No. It's not possible. It's totally dependent on what "stats" each
implementation would add. (Though again, I could stress a lot more in the
javadoc that the data is required to stay small in any situation, even
apart from the privacy aspects).

>  - How are plugin developers expected to "see" their custom usage stats at
> the
>    end of the processing cycle?

At first, as explained, the only first goal was to actually send the exact
same data as currently, but decoupled to make JENKINS-26466 feasible. So,
not a whole lot of thought had been given to that (but many discussions
happened here and there, more specifically in the PR
<https://github.com/jenkinsci/jenkins/pull/1985> or on IRC mostly with

> 2: Potential privacy impact of additional usage stats
> The anonymization code for existing usage stats is successful at
> anonymizing
> because we know where "custom" fields or sensitive information might come
> from
> ahead of time. For example, custom plugins, hostnames, and things like
> that.
> Without knowledge beforehand of what metrics are going to be reported, I
> don't
> understand how we can ensure user privacy.
>  - How would we keep the anonymization code updated with the new metrics
>    potentially reported by plugins?
>  - How are we going to ensure that plugin authors don't purposefully or
>    accidentally publish personally-identifiable information?

Code review. Very. Very. Strongly. Recommended.
Again, there's no magical solution.

Any plugin is able (spoiler: and some *already* do) to push stats data
anywhere they want while Jenkins is running.

> I'm in agreement with Daniel that, right now, we can either open up census
> information, *or* we can merge the changes proposed by Baptiste. But in
> their
> current implementations, we cannot have both at the same time.

Though I would like to be able to keep things as open as we can, I agree.
IMO we would have to keep census closed with some checking to know new data
has been added.
Again, about new data, we would prolly want to be /aggressive/ about new
stats data pushed that would be discovered just then, i.e. w/o the
associated code having been reviewed by the dev community.

I understand this is a tough subject.
I would understand if we decide to say PR 1985 must be trashed, and an
alternative decoupling approach be looked for (though an extension point is
indeed the usual approach to inverse control, and this approach was indeed
advised by Jesse originally when I started working on it).

Hope this clarifies.

-- Baptiste
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.jenkins-ci.org/pipermail/jenkins-infra/attachments/20160616/1b551f30/attachment.html>

More information about the Jenkins-infra mailing list