data.debian.org

Original Project Description and Rationale

Subject: Large data packages in the archive

Hi,

one important question lately has been "What should we do with large packages containing data", like game data, huge icon/wallpaper sets, some science data sets, etc. Naturally, this is a decision ftpmaster has to take, so here are our thoughts on it. So here are a few thoughts to facilitate discussion and see if we missed important points but we keep the right to have the last word here. :)

Basic Problem: "What to do with large data packages?"

That already has a problem: How to define "large"? One way, which we chose for now, is simply "everything > 50MB".

While the archive software is written in Python, this problem sounds like a Perl one as "There is more than one way to do (solve) it":

a.) We can simply say that we don't want this in Debian and people should use external hosting for such packages. After all they are for a very small minority usually.

b.) We can just add another component "data" besides main/contrib/non-free.

c.) We can host an own archive for it under control of ftpmaster.

The first two seem to have grave problems:

a.) Is basically no (good) option. It is our job to maintain the archive, and if there is enough demand we should make it possible to also host things like these data packages. Additionally it has the problem that it would require a move of everything that needs those data packages into contrib, as there wouldn't be a good base for a Policy exception.

b.) While that would be the most simple solution it has other problems, large enough that we decided against it. The biggest one being that of the principle of least surprise for our mirrors. We are talkin about this to not bloat the main archive too much. If we just add another component stuff will end up mirrored a lot. Even if we send an announcement weeks before. Requiring every mirror admin to take a decision if they want to mirror or exclude it, then adjust their scripts, is a simple no-go for us.

So the way to go for us seems to be c.), hosting the archive somewhere below data.debian.org probably.

For all the rest of the mail I talk about solution c., unless otherwise stated.

So assume we go for solution c. (which is what happens unless someone has a very strong reason not to, which I currently can't imagine) we will setup a seperate archive for this. This will work the same way as our main archive does, with a few notable points:

It will be solely arch:all, not splitted per architecture. Or, if someone presents good reasons why a data archive needs to be architecture-aware, we will also offer this, but NO autobuilder support will be provided. This is meant as a place for large datasets, and those should be arch independent.

It is an own archive, so it needs full source uploads to work, every data package you create will be a full source package and you have to split the source between this archive and the rest that goes into the normal Debian one.

We need to change policy. It currently forbids packages in main to Depend/Recommend something outside of it (which is good). As that would make the data archive less useful, I propose to change this to something including the meaning of "Packages in main are allowed to recommend packages in the data archive". Dependencies should not be allowed, but read the next point.

Packages in main need to be installable and not cause their (indirect) reverse build-depends to FTBFS in the absence of data.debian.org. If the data is necessary for the package to work and there is a small dataset (like 5 to 10 MB) that can be reasonably substituted for the complete data package, the smaller dataset should be included in main and the package may then depend on "foo-data | foo-data-small".

Any comments?

Timeframe for this? I expect it to be ready within 2 weeks.

-- bye Joerg