On 6/19/2011 1:34 PM, Don Y wrote:
> Hi David,
>
> On 6/19/2011 12:07 PM, David Brown wrote:
>> On 19/06/11 20:41, Don Y wrote:
>
>>> I'm looking for a tool that will let me pass "portions
>>> of filesystems" to it (i.e., lists of file hierarchies)
>>> along with a "volume size" and have *it* come up with
>>> *an* "optimal" packing arrangement to put those
>>> files onto the minimum number of volume-sized media.
>>
>> /Optimal/ cannot be done (at least, not in non-polynomial time, as far
>> as is currently known).
>
> Correct. I was trying to acknowledge this by emphasizing
> "an" (optimal).
>
>> You are looking for something roughly equivalent
>> to the "knapsack problem", which is a well-known NP-complete problem.
>> The best you could hope for is a reasonable solution.
>
> Exactly. There are multiple (unknown) ways of packing a set
> of objects onto a medium. In terms of "quality of solution",
> they might be equivalent (i.e., two media -- A & B -- have
> free spaces a1 and b1 in a given packing while an alternate
> packing scheme can result in identical free spaces -- but different
> *contents* on their respective media). Or, they can be "less ideal"
> (i.e., total number of media remains the same -- and, thus, the
> total amount of free space -- but the contents are spread out over
> more volumes).
>
> *An* ideal packing would use exactly total_size/volume_size volumes
> with free_space on only the last volume.
>
> *An* "optimal" packing would use the minimum number of volumes
> necessary to contain all of the objects (ideal is better than
> optimal).
>
> [Of course, all of the above subject to the above constraints.]
>
>>> [this is similar to packing software for distribution]
>>>
>>> MS had a similar tool in one of their ancient kits
>>> that was effective for packing *few* files onto floppies
>>> (yes, *that* ancient!).
>>>
>>> The tool should be able to take into account block
>>> sizes and overheads of the target media (i.e., a 1 byte
>>> file takes up a lot more than 1 byte!).
>>>
>>> And, the tool should try to preserve related portions
>>> of the hierarchy on the same medium, where possible.
>>> (i.e., /foo/bar and /foo/baz should cling together
>>> more tightly than either would with /bizzle/bag)
>>>
>>> My current need is for a windows platform but I can
>>> easily port something from "elsewhere" if need be.
>>
>> I'd be surprised if you find anything useful. Perhaps if you try to
>> describe what you are trying to do, and why it can't be handled by a
>> simpler solution, it may be possible to help. The only vaguely related
>> problem I could imagine is deciding how to split the packages of a large
>> Linux distro between different CD images (or different floppy images for
>> old distros, or different DVD images for the newest and biggest distros).
>
> I generate large data sets (almost) daily. Up to this
> point, I have just been generating (gzip'ed) tarballs
> and split-ing them into "volume sized" pieces.
>
> This works fine for getting them onto the media with
> the least amount of effort/intervention. I.e., this is
> an "ideal" packing (using the above terminology).
>
> But, going back and trying to retrieve/inspect a portion of a
> data set is a real PITA. It's easy to find a particular
> *day*. But, with that set of media, the only reliable way
> of finding what I'm looking for is to mount all of the media,
> cat them together, pipe that through gunzip (or let tar do
> it for me), then de-tar, locate the file of interest and
> *extract* it.
>
> It would be much nicer to just mount *a* medium, browse to
> that part of the filesystem and see if the file sought is
> present. If not, umount and move on to the next in the series.
>
> (I think this is also more tolerant of media errors -- i.e.,
> a bad read on volume N could make it impossible (?) to retrieve
> the contents of volumes > N, depending on the nature of the read
> error.)
Don --
Do you really need to put this on some sort of "small" (no more than say
4.7GB) media? Our level 1 backup system here is a 1.5TB USB hard drive,
on which we keep, simply as directories, full copies of the entire
server daily for the last two weeks and biweekly for the last two
months. I think it cost $100, thirty lines of bash script, and two cron
jobs.
Newegg's got on their front page today a 2TB internal drive for $60.
Four of those would give you a 6TB RAID5 array; are you really
generating so much data so fast that you would have to worry about
running that out? Or is this an archival issue?
--
Rob Gaddi, Highland Technology
Email address is currently out of order