Issue40

Title Hash cache needs to be more flexible
Priority feature Status resolved
Superseder Nosy List ant, poeml
Assigned To poeml Keywords

Created on 2010-03-08.20:44:37 by poeml, last changed by poeml.

Messages
msg204 (view) Author: poeml Date: 2010-09-01.16:13:33
Generation of hashes for zsync and torrents can now be (separately) switched off 
in /etc/mirrorbrain.conf. 

For the zsync hashes, the default is "off", because Apache currently allocates 
large amounts of memory for these large data.

On another note, empty files seem to be handled as they should.

Hence, I regard this bug resolved.
msg182 (view) Author: poeml Date: 2010-04-23.03:03:42
What's also missing is a way to switch off (or on) (per /etc/mirrorbrain.conf) 
generation of the "expensive" hashes, like torrents and zsync. Maybe with a file 
mask or list of directories.
msg160 (view) Author: poeml Date: 2010-03-12.02:57:44
Note to self: need to check whether empty files (0 byte size) are still
handled correctly, or if they need a special case.
msg150 (view) Author: poeml Date: 2010-03-11.23:53:05
This is largely done.

Code in metalink-hasher seems to work well, and creates hashes in the
database in addition to the on-disk storage which we keep available for
transition. 

The new hashes in the database are not cleaned up yet, if they become
obsolete. Maybe "mb db vacuum" should become involved in the cleanup,
but it would need to look into the file tree for that. It's probably
needed to let mb makehashes clean up per directory. Otherwise files
could very quickly accumulate.

mod_mirrorbrain uses the new hashes from the database and falls back to
on-disk hashes for transition. The new hashes are already used in old
Metalinks, new Metalinks, and also in the mirror lists!
msg148 (view) Author: poeml Date: 2010-03-10.00:17:41
In svn trunk, there is now working code that saves the hashes also to the 
database. Seems like a good step forward. The code needs more testing to become 
robust enough to be used by mod_mirrorbrain.
msg135 (view) Author: poeml Date: 2010-03-08.20:44:36
The hash cache is too inflexible, in its current on-disk format. It was fine in the past, 
where Apache included the ocntents into v3 Metalinks. The snippets on disk were prepared 
just for that. However, it's difficult to add further features like
 - hashes in HTTP headers
 - inclusion of hashes into RFC Metalinks (different format)
 - inclusion of hashes into the mirror lists
 - building a "hash server" (append .md5 to any URL and get the md5 sum)

So this is blocking several good things that could be done. 

Issue 15 contains some ramblings about this, but let's track this change here.

I currently think that moving the hash into the database might be best. It would definitely 
a flexible option without the need to invent an on-disk format and write parsers for it. 
Also, it would make the data available to a web frontend.

Before the on-disk format is dropped, we can try how well it works with the database.

As a first step, I have now transferred all functionality from the external metalink-hasher 
script into the "mb" tool. Thus, now the database functionality is available for no cost.
History
Date User Action Args
2010-09-01 16:13:33poemlsetstatus: testing -> resolved
messages: + msg204
2010-04-23 03:05:37poemlsetstatus: in-progress -> testing
2010-04-23 03:03:42poemlsetmessages: + msg182
2010-03-29 06:44:31antsetnosy: + ant
2010-03-12 02:57:45poemlsetmessages: + msg160
2010-03-11 23:53:06poemlsetmessages: + msg150
2010-03-10 00:17:41poemlsetmessages: + msg148
2010-03-08 20:44:37poemlcreate