Motherboard Forums


Reply
Thread Tools Display Modes

"volume packing"

 
 
Don Y
Guest
Posts: n/a
 
      06-19-2011, 06:41 PM
Hi,

I'm looking for a tool that will let me pass "portions
of filesystems" to it (i.e., lists of file hierarchies)
along with a "volume size" and have *it* come up with
*an* "optimal" packing arrangement to put those
files onto the minimum number of volume-sized media.

[this is similar to packing software for distribution]

MS had a similar tool in one of their ancient kits
that was effective for packing *few* files onto floppies
(yes, *that* ancient!).

The tool should be able to take into account block
sizes and overheads of the target media (i.e., a 1 byte
file takes up a lot more than 1 byte!).

And, the tool should try to preserve related portions
of the hierarchy on the same medium, where possible.
(i.e., /foo/bar and /foo/baz should cling together
more tightly than either would with /bizzle/bag)

My current need is for a windows platform but I can
easily port something from "elsewhere" if need be.

Thanks!
--don
 
Reply With Quote
 
 
 
 
David Brown
Guest
Posts: n/a
 
      06-19-2011, 07:07 PM
On 19/06/11 20:41, Don Y wrote:
> Hi,
>
> I'm looking for a tool that will let me pass "portions
> of filesystems" to it (i.e., lists of file hierarchies)
> along with a "volume size" and have *it* come up with
> *an* "optimal" packing arrangement to put those
> files onto the minimum number of volume-sized media.
>


/Optimal/ cannot be done (at least, not in non-polynomial time, as far
as is currently known). You are looking for something roughly
equivalent to the "knapsack problem", which is a well-known NP-complete
problem. The best you could hope for is a reasonable solution.

> [this is similar to packing software for distribution]
>
> MS had a similar tool in one of their ancient kits
> that was effective for packing *few* files onto floppies
> (yes, *that* ancient!).
>
> The tool should be able to take into account block
> sizes and overheads of the target media (i.e., a 1 byte
> file takes up a lot more than 1 byte!).
>
> And, the tool should try to preserve related portions
> of the hierarchy on the same medium, where possible.
> (i.e., /foo/bar and /foo/baz should cling together
> more tightly than either would with /bizzle/bag)
>
> My current need is for a windows platform but I can
> easily port something from "elsewhere" if need be.
>
> Thanks!
> --don


I'd be surprised if you find anything useful. Perhaps if you try to
describe what you are trying to do, and why it can't be handled by a
simpler solution, it may be possible to help. The only vaguely related
problem I could imagine is deciding how to split the packages of a large
Linux distro between different CD images (or different floppy images for
old distros, or different DVD images for the newest and biggest distros).
 
Reply With Quote
 
 
 
 
Don Y
Guest
Posts: n/a
 
      06-19-2011, 08:34 PM
Hi David,

On 6/19/2011 12:07 PM, David Brown wrote:
> On 19/06/11 20:41, Don Y wrote:


>> I'm looking for a tool that will let me pass "portions
>> of filesystems" to it (i.e., lists of file hierarchies)
>> along with a "volume size" and have *it* come up with
>> *an* "optimal" packing arrangement to put those
>> files onto the minimum number of volume-sized media.

>
> /Optimal/ cannot be done (at least, not in non-polynomial time, as far
> as is currently known).


Correct. I was trying to acknowledge this by emphasizing
"an" (optimal).

> You are looking for something roughly equivalent
> to the "knapsack problem", which is a well-known NP-complete problem.
> The best you could hope for is a reasonable solution.


Exactly. There are multiple (unknown) ways of packing a set
of objects onto a medium. In terms of "quality of solution",
they might be equivalent (i.e., two media -- A & B -- have
free spaces a1 and b1 in a given packing while an alternate
packing scheme can result in identical free spaces -- but different
*contents* on their respective media). Or, they can be "less ideal"
(i.e., total number of media remains the same -- and, thus, the
total amount of free space -- but the contents are spread out over
more volumes).

*An* ideal packing would use exactly total_size/volume_size volumes
with free_space on only the last volume.

*An* "optimal" packing would use the minimum number of volumes
necessary to contain all of the objects (ideal is better than
optimal).

[Of course, all of the above subject to the above constraints.]

>> [this is similar to packing software for distribution]
>>
>> MS had a similar tool in one of their ancient kits
>> that was effective for packing *few* files onto floppies
>> (yes, *that* ancient!).
>>
>> The tool should be able to take into account block
>> sizes and overheads of the target media (i.e., a 1 byte
>> file takes up a lot more than 1 byte!).
>>
>> And, the tool should try to preserve related portions
>> of the hierarchy on the same medium, where possible.
>> (i.e., /foo/bar and /foo/baz should cling together
>> more tightly than either would with /bizzle/bag)
>>
>> My current need is for a windows platform but I can
>> easily port something from "elsewhere" if need be.

>
> I'd be surprised if you find anything useful. Perhaps if you try to
> describe what you are trying to do, and why it can't be handled by a
> simpler solution, it may be possible to help. The only vaguely related
> problem I could imagine is deciding how to split the packages of a large
> Linux distro between different CD images (or different floppy images for
> old distros, or different DVD images for the newest and biggest distros).


I generate large data sets (almost) daily. Up to this
point, I have just been generating (gzip'ed) tarballs
and split-ing them into "volume sized" pieces.

This works fine for getting them onto the media with
the least amount of effort/intervention. I.e., this is
an "ideal" packing (using the above terminology).

But, going back and trying to retrieve/inspect a portion of a
data set is a real PITA. It's easy to find a particular
*day*. But, with that set of media, the only reliable way
of finding what I'm looking for is to mount all of the media,
cat them together, pipe that through gunzip (or let tar do
it for me), then de-tar, locate the file of interest and
*extract* it.

It would be much nicer to just mount *a* medium, browse to
that part of the filesystem and see if the file sought is
present. If not, umount and move on to the next in the series.

(I think this is also more tolerant of media errors -- i.e.,
a bad read on volume N could make it impossible (?) to retrieve
the contents of volumes > N, depending on the nature of the read
error.)
 
Reply With Quote
 
Jon Kirwan
Guest
Posts: n/a
 
      06-19-2011, 10:28 PM
On Sun, 19 Jun 2011 13:34:56 -0700, Don Y <(E-Mail Removed)>
wrote:

><snip>
>*An* ideal packing would use exactly total_size/volume_size volumes
>with free_space on only the last volume.
>
>*An* "optimal" packing would use the minimum number of volumes
>necessary to contain all of the objects (ideal is better than
>optimal).
>
>[Of course, all of the above subject to the above constraints.]
><snip>


I'd probably want to _add_ that if one of the media is
destroyed or otherwise non-functional either partially or
completely, that the rest can still be recovered and used.

This is reminding me of two areas -- hibernation and sleep
tapes (we'd lose part of the tape in getting it tangled up,
for example, and could splice it back into functionality but
would lose a file) and also breaking up transmissions over
the internet into smaller sections for broadcast, where
compression was desperately wanted (we were using 2400 bps
modems) but so was the handling of files within the set
should it come to pass that one of the sections was lost, the
rest could still be recovered (it often wasn't possible to
ask for a repeated broadcast of it.)

I don't have an answer off the top of my head. But one place
I'd look is towards Dr. Knuth. He might be receptive to an
email with this kind of question and have some thoughts about
where to look or what to think more about. I'd consider
writing him were I you. Just on a lark.

Jon
 
Reply With Quote
 
Don Y
Guest
Posts: n/a
 
      06-20-2011, 12:52 AM
Hi Jon,

On 6/19/2011 3:28 PM, Jon Kirwan wrote:
> On Sun, 19 Jun 2011 13:34:56 -0700, Don Y<(E-Mail Removed)>
> wrote:
>
>> *An* ideal packing would use exactly total_size/volume_size volumes
>> with free_space on only the last volume.
>>
>> *An* "optimal" packing would use the minimum number of volumes
>> necessary to contain all of the objects (ideal is better than
>> optimal).

>
> I'd probably want to _add_ that if one of the media is
> destroyed or otherwise non-functional either partially or
> completely, that the rest can still be recovered and used.


You could do that by treating the volumes as a RAID array.
But, that increases the amount of data you preserve.

These aren't long term backups. Rather, I might want to look
at something that I did a few days ago to compare against
a new dataset.

> This is reminding me of two areas -- hibernation and sleep
> tapes (we'd lose part of the tape in getting it tangled up,
> for example, and could splice it back into functionality but
> would lose a file) and also breaking up transmissions over
> the internet into smaller sections for broadcast, where
> compression was desperately wanted (we were using 2400 bps
> modems) but so was the handling of files within the set
> should it come to pass that one of the sections was lost, the
> rest could still be recovered (it often wasn't possible to
> ask for a repeated broadcast of it.)
>
> I don't have an answer off the top of my head. But one place
> I'd look is towards Dr. Knuth. He might be receptive to an
> email with this kind of question and have some thoughts about
> where to look or what to think more about. I'd consider
> writing him were I you. Just on a lark.


If I can't find something ready made, I'll just brute force
something. The advantage, there, is that I know how *my*
datasets are organized so I can take advantage of that in
the algorithms. E.g., if "1, 2, 3, 4, 5, 28" form a "best fit"
for a volume, it would probably be "friendlier" (to me) to
use "1, 2, 3, 4, 5, 6" and have a little extra waste.
 
Reply With Quote
 
Nobody
Guest
Posts: n/a
 
      06-20-2011, 09:35 AM
On Sun, 19 Jun 2011 21:07:31 +0200, David Brown wrote:

> You are looking for something roughly equivalent to the "knapsack
> problem", which is a well-known NP-complete problem.


This specific case is the "bin packing problem", which is NP-hard.
According to wikipedia, approximate algorithms exist which asymptotically
use only ~20% more bins than an optimal solution.

Depending upon the typical file sizes, a brute-force (i.e. optimal)
solution might not be out of the question. If most of the files are only a
few blocks, that would drastically reduce the number of distinct
partitions.

 
Reply With Quote
 
nospam
Guest
Posts: n/a
 
      06-20-2011, 12:55 PM
Don Y <(E-Mail Removed)> wrote:

http://bttb.sourceforge.net/
http://lars.werner.no/?page_id=2
 
Reply With Quote
 
Vladimir Vassilevsky
Guest
Posts: n/a
 
      06-20-2011, 01:25 PM


Nobody wrote:

> On Sun, 19 Jun 2011 21:07:31 +0200, David Brown wrote:
>
>
>>You are looking for something roughly equivalent to the "knapsack
>>problem", which is a well-known NP-complete problem.

>
>
> This specific case is the "bin packing problem", which is NP-hard.
> According to wikipedia, approximate algorithms exist which asymptotically
> use only ~20% more bins than an optimal solution.


This is not what they meant.
For a *large* number of *random* items, the *average* result achieved by
greedy algorithms would be somewhat 20% worse then optimal.
Iterative algorithms could be close to optimum depending on how much of
the processing is available.


 
Reply With Quote
 
Frnak McKenney
Guest
Posts: n/a
 
      06-20-2011, 01:32 PM
On Sun, 19 Jun 2011 13:34:56 -0700, Don Y <(E-Mail Removed)> wrote:
> Hi David,
>
> On 6/19/2011 12:07 PM, David Brown wrote:
>> On 19/06/11 20:41, Don Y wrote:

>
>>> I'm looking for a tool that will let me pass "portions
>>> of filesystems" to it (i.e., lists of file hierarchies)
>>> along with a "volume size" and have *it* come up with
>>> *an* "optimal" packing arrangement to put those
>>> files onto the minimum number of volume-sized media.

--snip--

> I generate large data sets (almost) daily. Up to this
> point, I have just been generating (gzip'ed) tarballs
> and split-ing them into "volume sized" pieces.
>
> This works fine for getting them onto the media with
> the least amount of effort/intervention. I.e., this is
> an "ideal" packing (using the above terminology).
>
> But, going back and trying to retrieve/inspect a portion of a
> data set is a real PITA. It's easy to find a particular
> *day*. But, with that set of media, the only reliable way
> of finding what I'm looking for is to mount all of the media,
> cat them together, pipe that through gunzip (or let tar do
> it for me), then de-tar, locate the file of interest and
> *extract* it.
>
> It would be much nicer to just mount *a* medium, browse to
> that part of the filesystem and see if the file sought is
> present. If not, umount and move on to the next in the series.


Hi, Don.

Splitting up your problem and only attacking the portion I can claim
any expertise on... why not just check the 1403 printout to see
which of the 2400' tape reels contains your file? <grin!>

More seriously, many of the desktop-PC backup programs I've used
keep an index to help you figure out which diskette/tape cassette
contains a particular file. Or, often more useful, remind you that
you had deleted the file prior to that partcular backup, and that
you needed to be checking an earlier backup set.

If you don't mind using 'grep', this could be as simple as a text
file with the backup date, file names, media volume IDs, and
timestamps:

grep critical_file Backup-Index-bagheera-2011-06-01.log

yields:

/etc/sysadmin/critical_file.cfg 2011-05-15 02:41:12 Diskette 12

The actiual _construction_ of this log file depends on your specific
situation, and is therefore left as an exercise for the student
(accompanied by the usual hand-waving and superior look <grin!>).

Does this help?


Frank McKenney
--
Until the milennium comes and countries cease trying to enslave
others, it will be necessary to accept one's responsibilities
and to be willing to make sacrifices for one's country -- as my
comrades did. As the troops used to say, "If the country is good
enough to live in, it's good enough to fight for." With privilege
goes responsibility.
-- E.B. Sledge / With the Old Breed at Peleliu and Okinawa
--
Frank McKenney, McKenney Associates
Richmond, Virginia / (804) 320-4887
Munged E-mail: frank uscore mckenney ayut mined spring dawt cahm (y'all)

 
Reply With Quote
 
Rob Gaddi
Guest
Posts: n/a
 
      06-20-2011, 04:43 PM
On 6/19/2011 1:34 PM, Don Y wrote:
> Hi David,
>
> On 6/19/2011 12:07 PM, David Brown wrote:
>> On 19/06/11 20:41, Don Y wrote:

>
>>> I'm looking for a tool that will let me pass "portions
>>> of filesystems" to it (i.e., lists of file hierarchies)
>>> along with a "volume size" and have *it* come up with
>>> *an* "optimal" packing arrangement to put those
>>> files onto the minimum number of volume-sized media.

>>
>> /Optimal/ cannot be done (at least, not in non-polynomial time, as far
>> as is currently known).

>
> Correct. I was trying to acknowledge this by emphasizing
> "an" (optimal).
>
>> You are looking for something roughly equivalent
>> to the "knapsack problem", which is a well-known NP-complete problem.
>> The best you could hope for is a reasonable solution.

>
> Exactly. There are multiple (unknown) ways of packing a set
> of objects onto a medium. In terms of "quality of solution",
> they might be equivalent (i.e., two media -- A & B -- have
> free spaces a1 and b1 in a given packing while an alternate
> packing scheme can result in identical free spaces -- but different
> *contents* on their respective media). Or, they can be "less ideal"
> (i.e., total number of media remains the same -- and, thus, the
> total amount of free space -- but the contents are spread out over
> more volumes).
>
> *An* ideal packing would use exactly total_size/volume_size volumes
> with free_space on only the last volume.
>
> *An* "optimal" packing would use the minimum number of volumes
> necessary to contain all of the objects (ideal is better than
> optimal).
>
> [Of course, all of the above subject to the above constraints.]
>
>>> [this is similar to packing software for distribution]
>>>
>>> MS had a similar tool in one of their ancient kits
>>> that was effective for packing *few* files onto floppies
>>> (yes, *that* ancient!).
>>>
>>> The tool should be able to take into account block
>>> sizes and overheads of the target media (i.e., a 1 byte
>>> file takes up a lot more than 1 byte!).
>>>
>>> And, the tool should try to preserve related portions
>>> of the hierarchy on the same medium, where possible.
>>> (i.e., /foo/bar and /foo/baz should cling together
>>> more tightly than either would with /bizzle/bag)
>>>
>>> My current need is for a windows platform but I can
>>> easily port something from "elsewhere" if need be.

>>
>> I'd be surprised if you find anything useful. Perhaps if you try to
>> describe what you are trying to do, and why it can't be handled by a
>> simpler solution, it may be possible to help. The only vaguely related
>> problem I could imagine is deciding how to split the packages of a large
>> Linux distro between different CD images (or different floppy images for
>> old distros, or different DVD images for the newest and biggest distros).

>
> I generate large data sets (almost) daily. Up to this
> point, I have just been generating (gzip'ed) tarballs
> and split-ing them into "volume sized" pieces.
>
> This works fine for getting them onto the media with
> the least amount of effort/intervention. I.e., this is
> an "ideal" packing (using the above terminology).
>
> But, going back and trying to retrieve/inspect a portion of a
> data set is a real PITA. It's easy to find a particular
> *day*. But, with that set of media, the only reliable way
> of finding what I'm looking for is to mount all of the media,
> cat them together, pipe that through gunzip (or let tar do
> it for me), then de-tar, locate the file of interest and
> *extract* it.
>
> It would be much nicer to just mount *a* medium, browse to
> that part of the filesystem and see if the file sought is
> present. If not, umount and move on to the next in the series.
>
> (I think this is also more tolerant of media errors -- i.e.,
> a bad read on volume N could make it impossible (?) to retrieve
> the contents of volumes > N, depending on the nature of the read
> error.)


Don --

Do you really need to put this on some sort of "small" (no more than say
4.7GB) media? Our level 1 backup system here is a 1.5TB USB hard drive,
on which we keep, simply as directories, full copies of the entire
server daily for the last two weeks and biweekly for the last two
months. I think it cost $100, thirty lines of bash script, and two cron
jobs.

Newegg's got on their front page today a 2TB internal drive for $60.
Four of those would give you a 6TB RAID5 array; are you really
generating so much data so fast that you would have to worry about
running that out? Or is this an archival issue?

--
Rob Gaddi, Highland Technology
Email address is currently out of order
 
Reply With Quote
 
 
 
Reply

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are Off



All times are GMT. The time now is 12:15 AM.


Welcome!
Welcome to Motherboard Point
 

Advertisment