1. This forum section is a read-only archive which contains old newsgroup posts. If you wish to post a query, please do so in one of our main forum sections (here). This way you will get a faster, better response from the members on Motherboard Point.

Sun T1000/T2000 + ZFS + SAS performance

Discussion in 'Sun Hardware' started by jwa, Aug 4, 2006.

  1. jwa

    jwa Guest

    Hello -- we have ~100TB spread over about 45 PCs with
    SATA disks + FreeBSD. We're looking at reducing the
    number of machines we have to maintain, and are
    considering using a much smaller number of T1000 or
    T2000 boxes + SAS + ZFS/RAIDZ + as many 16-bay
    JBOD SAS arrays as we can.

    Availability isn't critical (ie, if a T1000 crashes for a few
    days, it's OK), but we don't want to lose data or
    have a bad drive corrupt files, so we would like to
    use ZFS w/ RAIDZ as the underlying filesystem. We also
    want to minimize power consumption (thus the Tx000
    boxes)

    Questions:

    * Are there any PCI-Express SAS cards that work
    w/ Solaris 10/sparc? Ideally, we could throw a
    PCI-Express SAS card in a T1000 and attach N
    terabytes of storage to it. (where N >= 30,
    hopefully :)

    * Any general recommendations as to how much
    storage to attach to a single T1000 or T2000?

    * What sort of ZFS gotchas would we encounter on
    volumes this large?

    Thanks!
    James
     
    jwa, Aug 4, 2006
    #1
    1. Advertisements

  2. jwa

    Frank Cusack Guest

    Plus zfs has snapshots and has great management features
    yadda yadda.
    I've been battling this one for awhile. Good luck. The LSI Logic
    3442E-R (they don't make it in non -R [RAID]) seems to work, but I can
    only see the drive enclosure, not the drives. See my post "I can see
    the scsi enclosure but no disks". A 3442X (same controller, different
    host interface) works just fine in an x4100, so the difference on the
    t1000 seems like it must be [lack of] FCode. But if that's the case,
    why does the card work at all?

    There's a firmware patch for the onboard LSI SAS controller, I was
    going to try to apply it to the 3442E and see what happens. The card
    isn't useful to me if it doesn't work, so if I kill it, then
    "whatever".

    If that doesn't work I was going to get an iscsi array.

    What array/JBOD are you looking at? I have a promise J300s. Let me
    know if you want to buy it. ;-)
    Well isn't that based on your availability and performance requirements?
    I just learned (on zfs-discuss) that you should pay attention to the
    man page when it says the maximum number of disks in a raidz should be
    9. So for 16 drive arrays probably you might want 3 5-drive raidz's,
    and save a drive for a hot spare (support coming in U3). Note that
    zfs can combine all these raidz's into a single pool, so you don't
    have to worry about free space management.

    -frank
     
    Frank Cusack, Aug 4, 2006
    #2
    1. Advertisements

  3. jwa

    Dan Foster Guest

    Here's a thought:

    Get drive arrays that has SAS or SATA disks internally, and connects
    from the array to host via FC.

    For instance, the AC&NC Jetstor 416FC4 array supports 16 disks (SCSI,
    SATA, unfortunately no SAS) and supports dual 4 Gb/sec FC paths.

    It supports 750 GB Seagate SATA disks, too. So one array would be 12 TB
    raw. Not sure cost... probably around USD $22,000??

    SATA disks in these Jetstors do not live as long as their SCSI
    counterparts, however. Our SATA disks seems to last about three years,
    whereas our SCSI disks lasts five to seven years. That's part of the
    price tradeoff. You get what you pay for. :)

    SCSI and SAS would be nice, but they're not as high capacity as SATA,
    which means you need far more disks (and arrays)...

    Solaris FC drivers are stable and very well supported, and FC cards for
    SPARC are plentiful.

    Sun even sells StorageTek (bought out by Sun a while ago) dual port 4
    Gb/sec FC PCI-e HBAs for SPARC.

    FC HBA options possible:

    1 Gb/sec, 2 Gb/sec, 4 Gb/sec.

    Single port, dual port.

    PCI-X, PCI-e.

    Most importantly, drivers are solid and there are at least two major
    card manufacturers with good Solaris support: Emulex and Qlogic. (JNI
    was bought out by AMCC then AMCC dropped FC HBA support. Too bad.)

    The T2000 has 3 PCI-e slots and 2 PCI-X slots, I believe.

    If you use FC to connect to your storage... you could throw in a SAN
    switch. Fewer cables to host and therefore, fewer cards. Also zoned
    setups makes it possible for other hosts to see the disks.

    If you did that... then you could easily run Sun Cluster 3.1 (free). The
    T2000 has enough network ports to do public interfaces, private
    interfaces, and cluster interconnects.

    Between two or more cluster nodes, availability is pretty good. You're
    probably going to have to get at least two servers, anyhow... might as
    well as not make the disk-to-host attachment a single point of failure.

    But if you're thinking of 100 TB of disk, that's hundreds to thousands
    of disks... you probably really want to consider an enterprise disk
    storage system (e.g. IBM DS6000, DS8000, EMC Clariion, Sun StorageTek,
    Hitachi HDS, etc) for the management and support angle.

    Think about it this way: if you went with 146 GB SAS disks, you'd need
    about 685 of them to reach 100 TB. With 750 GB SATA disks, only need
    about 133.

    Our averaged disk failure rate per year (starting around year 4 or year
    5 of life) is about 1.5% per year, meaning with 986 disks, you'd want at
    least 10 spare drives on hand for swapouts as disks fails. With 133
    disks, at least 2 spares.

    You had better have really good monitoring systems to keep an eye on
    degrading and failed disks, especially if it's 133, 986, or even more. :)

    I would still recommend FC interconnects, no matter what the disk array
    is -- whether high or low end.

    -Dan

    P.S. I don't work for Sun or any system or storage vendors. I also hate
    sales people. ;) I'm just a customer with a pile of Suns and disks.
     
    Dan Foster, Aug 4, 2006
    #3
  4. Hi James,
    This doesn't answer your questions, but have you considered using the
    X4500 (aka Thumper)?

    http://www.sun.com/servers/x64/x4500/

    With 24TB per box, 5 boxes could give you the ~100TB you wanted or were
    you hoping to reuse the drives from your existing machines?

    Kind Regards,

    Nathan Dietsch
     
    Nathan Dietsch, Aug 4, 2006
    #4
  5. jwa

    Frank Cusack Guest

    Yeah, but compare that to a promise J300s (only 12 drives), about
    $7200 for 9TB. Plus only about $300 for the controller (and that's
    only because you are forced to pay for the raid) vs what $1000 or even
    $2000?

    If iscsi is acceptable, the promise M500i is ~$5400 so then you get
    11.2TB for $11500, still half the price of FC and no HBA to buy. I
    guess the major disadadvantage here is that it's slow compared to
    either SAS or FC. Maybe we'll see 10Gb iscsi soon.

    hmm I was going to say that given the use of zfs, maybe the Apple RAID
    is acceptable. (You wouldn't want to use the Aple RAID without zfs.)
    The problem with that guy is that it's Ultra-ATA not serial ATA, and
    it's not as dense as other solutions. And you might end up having to
    deal with Apple support. But doing the numbers it seems it's expensive;
    $13k for 7TB (and eats 3U of space for half the data).

    hmmm I had just assumed the FC->SATA arrays were much more expensive
    than SAS (given numbers like $22k) but the promise M500f is just $5k.
    So 11.2TB for $11k seems a good deal. On the downside it only
    supports a single controller (dual ports) so there is less
    availability. (But this is the same for the Jetstor.)

    Getting back to Apple, don't be confused about the dual controllers;
    the Apple RAID is two distinct arrays in one box.

    -frank
     
    Frank Cusack, Aug 4, 2006
    #5
  6. jwa

    jwa Guest

    Unfortunately, that won't fit in a T1000 (although since it doesn't
    seem to work, it may be moot)
    Possibly something like the Adaptec SANblock S50:

    http://www.adaptec.com/en-US/products/nas/expansion_arrays/SANbloc_S50_JBOD/

    SAS port on the back, takes SAS or SATA disks.
    Put another way: assuming we're spreading out I/O across nearly
    all drives, a what point (physical drive count) does performance
    peak? At what rate does performance degrade?
    Good point.

    wrt hot spare replacement, it should be possible to script something
    to do the replacement; either tail the syslog or run zpool status
    periodically to look for failures, and then enable spares as
    appropriate. I've done this manually in only a few steps, so
    it can't be that hard to script..

    James
     
    jwa, Aug 4, 2006
    #6
  7. jwa

    jwa Guest

    Actually, I've thought about using SCSI -> SATA enclosures, since
    it's easier to find compatible SCSI cards for Solaris than SAS.
    We'd use the 750GB SATA Seagates.
    As is SCSI .. but like SAS, I don't know where the performance peaks &
    begins
    to degrade.
    Why FC vs. SCSI? Is it just about better performance, or thinner
    cables? :)

    James
     
    jwa, Aug 4, 2006
    #7
  8. jwa

    jwa Guest

    Yup, we have .. but it seems to be more cost-effective to use
    a T1000 and a bunch of SAS->SCSI shelves. How well this
    combination actually works is to be determined :)

    James
     
    jwa, Aug 4, 2006
    #8
  9. jwa

    Dan Foster Guest

    Well, it's a lot easier on the host wiring for a clustered setup if you
    only have to run a couple FC cables from a switch to the host... and
    this also makes it easier to have a cluster of more than two hosts if
    ever desired.

    This also relieves slot pressure on the host systems. If you only need
    to wire up, let's say, 4 FC cables to access all 12-16 arrays, then...
    you'd only need two dual-port FC HBAs in each host, leaving a few slots
    free for future expansion in a T2000. Or maybe you could just get by
    with a T1000 and a single dual port FC HBA.

    With SCSI, you'd be limited to only two hosts in a clustered setup (for
    the typical SCSI-attached disk array with dual ports) and you'd have to
    wire up all the arrays to both hosts. It works, but is relatively ugly,
    inelegant, etc.

    FC cables also seems to be more resistant to damage than SCSI, in my
    experience. Both can be damaged, but SCSI cables I've had are more
    sensitive to reflection due to kinks or even bends.

    Distances are also *MUCH* greater with FC than with SCSI. 1.5-6m for SE
    SCSI, 12m for LVD SCSI, 25m for HVD SCSI, 300-500m for FC (100-300km
    with FC channel extenders).

    Distance is good for two things:

    1. Less pressure on selecting nearby locations if you have a data center
    with little space and perhaps they are a little far apart, or if some
    equipment is managed by another department in another section of the
    computer room.

    2. Less likely to run into attentuation issues with FC than for SCSI for
    computer room distances.

    Bandwidth: 4 Gb/sec FC would be 500 MB/sec; SCSI peaks out at 320 MB/sec.
    10 Gb/sec FC (1250 MB/sec) is already starting to appear.

    With larger SCSI setups, need a small farm of goats for sacrifice to
    appease Murphy. ;) All kidding aside, we make use of both SCSI and FC,
    but for a large scale setup, it is generally a lot easier to scale
    further with FC than with SCSI. When you're looking at a 100+ TB setup,
    you definitely want something that will scale.

    12 arrays to 2 clustered hosts via SAN switch would require at a minimum
    of perhaps... (12 * 1) + (3 * 2) = 18 FC cables. Thereafter, you can add
    it to more hosts just by adding (let's say) 3 FC cables to each host and
    adjusting zoning on the SAN switch.

    12 arrays to 2 hosts via SCSI would require a minimum of (12 * 2) = 24
    bigger cables that would have to be recabled if you ever wanted to move
    them to other hosts in the future.

    BTW, FC-connected storage uses SCSI commands to manage things... so...
    think of FC as being SCSI without SCSI cabling limitations.

    -Dan
     
    Dan Foster, Aug 4, 2006
    #9
  10. jwa

    Rich Teer Guest

    Forgive the FC-newbie question, but I'm not following your math. 12 * 1 I
    get (1 FC cable from each of the 12 arrays to the one SAN switch), but don't
    understand why the two hosts would need 3 cables each (unless they're being
    trunked for more bandwidth)?

    --
    Rich Teer, SCNA, SCSA, OpenSolaris CAB member

    President,
    Rite Online Inc.

    Voice: +1 (250) 979-1638
    URL: http://www.rite-group.com/rich
     
    Rich Teer, Aug 4, 2006
    #10
  11. jwa

    Frank Cusack Guest

    ?

    It's in my T1000 right now.

    -frank
     
    Frank Cusack, Aug 4, 2006
    #11
  12. jwa

    Frank Cusack Guest

    I'd have done that also, but SCSI = problems.

    -frank
     
    Frank Cusack, Aug 4, 2006
    #12
  13. jwa

    Frank Cusack Guest

    I tried to buy that before I got the promise. At the time (2 months
    ago), the product was vaporware, although they were taking orders
    which just sat forever in some backorder queue.

    Even today, you can't download documentation and you can't buy it from
    the Adaptec online store. I doubt you'd get this product anytime soon.

    -frank
     
    Frank Cusack, Aug 4, 2006
    #13
  14. jwa

    Dan Foster Guest

    3 was an arbitrary number. Needs at least 2 if you want to do load
    balancing and multipathing (for redundancy). More than 2 if you need
    greater aggregated bandwidth.

    -Dan
     
    Dan Foster, Aug 4, 2006
    #14
  15. jwa

    Frank Cusack Guest

    Well, still this depends on your application. random writes? sequential
    reads? etc.

    But Tom's Hardware says the seagate 750 does 63.5 MB/s read (and write),
    so with SAS 4x interface at 1200MB/s, 18 drives will saturate the SAS bus.

    You're likely to reach the peak at fewer than 18 drives, thanks to
    cache effects. And I guess the 1200MB/s number doesn't include
    protocol overhead. So 12-16 drives per interface might be a starting
    ballpark. But you'd really want to put it up against your
    application. Maybe you'd peak at only 6 drives, then it'd be smarter
    to get lower capacity arrays (if performance as opposed to space or
    power are your primary concerns). Maybe you'd peak at 32 drives, then
    you'd want to chain arrays together and buy fewer CPUs/HBAs.

    -frank
     
    Frank Cusack, Aug 4, 2006
    #15
  16. jwa

    Frank Cusack Guest

    yup, pretty easy, but U3 will make it automatic.

    -frank
     
    Frank Cusack, Aug 4, 2006
    #16
  17. jwa

    Frank Cusack Guest

    Hey, wow! I just got this working. Sorry this is so long, but I'm
    pretty excited ... it's fun to tell the story.

    In my testing, I was actually using the enclosure "live" on the x4100,
    which has 2 disks mounted (and there were only 2 disks in the enclosure).

    I then attached the T1000 to the enclosure (port 2 on the single controller),
    and it wouldn't see the disks. I thought maybe just maybe it is something
    to do with the x4100 "owning" those disks, somehow. So I disconnected and
    shut off the x4100, and attached only the t1000. Still no love ... which
    is what I expected.

    Since you got me going again on this, I remember having this kind of
    problem with a 3511 JBOD when attaching it to a SAN. (3511 JBOD is
    unsupported as a direct attach array, it's only supported as an
    expansion unit for a 3510 or 3511 RAID.) When attaching the 3511
    directly, I could use it fine, but with a switch in between, I could
    only see the enclosure. So based on the SAN hint I found luxadm and
    thought I'd play with that.

    I hooked the t1000 back up to the jbod, but this time I added a new
    drive -- a seagate 750gb. Then, without doing anything else, lo and
    behold, cfgadm -al saw a disk:

    # cfgadm -al
    Ap_Id Type Receptacle Occupant Condition
    c0 scsi-bus connected configured unknown
    c0::dsk/c0t0d0 disk connected configured unknown
    c1 scsi-bus connected configured unknown
    c1::es/ses0 ESI connected configured unknown
    c1::sd28 disk connected configured unknown
    #

    However, format didn't see it. After 'devfsadm -c disk', format saw it
    as c1t13, and cfgadm output changed as well. After doing the same on
    the x4100, it doesn't see the new disk. So there must be some ownership
    of drives that happens. Why the t1000 couldn't see the original 2 drives
    when I turned off the x4100 I still don't get.

    Yay! This is a pretty cheap solution. Certainly performance won't be
    as good as FC or SCSI disks (for one thing because the seagates are
    only 7200 RPM cf 10k or 15k) but for me, zfs (or even svm; slow raid5
    write is fine for my app) will take care of that. My application is
    large sequential reads of static data so I should be able to take
    enough advantage of striping that the inherent per-disk performance
    difference won't matter. The drives themselves are probably not
    nearly as reliable either. I'm ok with that.

    So we have

    promise j300s $2200
    lsi 3442e-r $340
    seagate 750gb $5400 (13x415 -- always buy a spare!)
    -----
    $7940

    = $0.88 / GB, excluding the computer. $1.43/GB if you add a $5k
    T1000. I think that's going to be hard to beat. (unless you use
    a $1000 x2100 == $0.99/GB, but there are other reasons to use T1000)

    The controller on the j300s has 2 ports (plus an expansion port), so
    you can dual attach to protect against host failure, basically this
    comes free if you already are going to buy multiple hosts. (Assuming
    there's a way to get 2 hosts to see the same drives.) An additional
    controller for the JBOD is available for only $838 if you want.

    There are SAS switches (ala SAN switches) that should be available in
    the next 1 year timeframe for a real storage network solution.

    Oh, last note, the rackmount rails for the promise are horrendous.
    Hopefully you can get some kind of mounting rails 3rd party -- I was
    able to use APC rails.

    -frank
     
    Frank Cusack, Aug 5, 2006
    #17
  18. jwa

    Frank Cusack Guest

    Looks like the problem was that I am using different drivers on the
    x4100 vs the t1000. On the x4100 it's S10U1 with LSI itmpt driver and
    on the t1000 it's S10U2 with the Sun mpt driver. The 2 drivers are
    setting the drive id's differently. This has to cause some conflicts.

    If I power off either system, the other sees all the drives.

    -frank
     
    Frank Cusack, Aug 5, 2006
    #18
  19. I've got a raidz pool of six 146Gb spindles. The CPU overhead
    is pretty massive and the performance bottleneck lands on the fact
    that an E250 with 2x400Mhz CPUs can't deliver enough CPU time to run
    the spindles at anything approaching full speed. I tried a 280R with
    12 x 9Gb FCAL spindles and the performance was similarly sluggish.
     
    Andre van Eyssen, Aug 5, 2006
    #19
  20. Maybe Sun's x4500 (aka Thumper) would be better?
     
    Robert Milkowski, Aug 5, 2006
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.