How to copy the top level of a website to get rid of Javascript etc?

Discussion in 'Apple' started by Paul Sture, Feb 19, 2006.

  1. Paul Sture

    Paul Sture Guest

    The problem:

    I frequently visit a couple of websites (not my own) I wish to make more
    accessible to disabled and older folks (and for that matter those who
    only have a trackpad on a laptop).

    The problem is that the main pages are full of Javascript and extremely
    tricky to navigate with a mouse, even for someone with good motor
    coordination and eyesight.

    What I would like to do is grab the main page(s) to rip out the
    mouseover highly graphical whizz bangs there and put a plain HTML
    compliant menu in there.

    Fr my intitial attempt I want to host the simplified index pages on my
    own systems, but do not wish to use my own bandwidth for more than that,
    and of course I want to avoid copyright infringement issues by copying
    the information beneath.

    I have Apache on both Mac OS X and VMS (hence my posting to both
    comp.os.vms and comp.sys.mac.system) and would like to know how best to
    do this.

    wget and htdig come to mind, but I am open to all suggestions.
     
    Paul Sture, Feb 19, 2006
    #1
    1. Advertisements

  2. Okay, so you're set to host whatever you come up with.
    Well, htdig isn't going to do much for you. (It's an indexer program that
    crawls web pages; you'd have to get into the crawler part and make it _not_
    follow links and also save intermediate results; this doesn't buy you anything
    over using a tool that does this.)

    Wget can suck down the page; so can cURL. You can also use Perl with libwww,
    and I'd expect there's a Python equivalent. But if this is a one-time thing,
    you can also just use a browser to save the page.

    But if you're concerned with ongoing issues of accessibility to the websites,
    you probably want to automate fetching _and editing_ the index page as much
    as possible. I would tend to do this on VMS because of the robust batch
    system, but a cron job on OS X is likely to be good enough as well.

    I'd use Perl, running in batch.

    In addition to having the tools to do the fetch in libwww, there
    are HTML::parse capabilities, and regular expression support, so you have at
    least a fighting chance of decoding the javascript and extracting the URLs, or,
    failing that, at least finding all the URLs hidden in the code, and you can
    test those URLs and capture the statuses, all within Perl.

    At the very least, you can easily write a Perl script that notifies you when
    the originating site has changed in ways you care about, and alerts you to do
    a manual edit.

    -- Alan
     
    Alan Winston - SSRL Central Computing, Feb 20, 2006
    #2
    1. Advertisements

  3. Paul Sture

    JF Mezei Guest


    Not enough. A lot of sites have javascript code that build unnecessarily
    complex URLs. So you need to actually execute the damned javascript to
    get to a URL it wants you to go to.
     
    JF Mezei, Feb 20, 2006
    #3
  4. Paul Sture

    Paul Sture Guest

    I'll email privately if that's OK with you.
     
    Paul Sture, Feb 22, 2006
    #4
  5. Paul Sture

    JF Mezei Guest

    An additional comment/warning:

    I just spent hours trying to debug and de-javascript a bank's "comments"
    page so I could send a comment about how I am unable to send comments to
    them (as well as the original comments about bad UI design in their new ATMs).


    That "send us comments" page has 120 html errors, and god only knows how
    many javascript ones.

    The comment forms is fairly simple withh a number of hidden fields.

    So while I am submitting them the same fields as their fancy web page,
    they keep refusing it. My guess is that they may be checking the
    refering page and if it isn't from their domain, they may automatically
    refuse it.

    So it isn't a given that making copies of javasccipt and html error
    laden pages and turning them into simple html compliant ones will work
    because the servers at the other end may block it.


    BTW, i signed up with paypal last week to make a paymenty for something
    on EBAY. I had to call paypall twice. Seems they have bugs in their
    system and accessing the payments stuff via the obvious menus doesn't
    work you need to access it from a different page. (pressing PAY on the
    obvious page would simply bring that page up over and over again without
    any error message).

    We're talking about financial institutions here. If their HTML code is
    that buggy, isn't that an indication that their quality assurance is way
    down and similar terrible quality would also be tolerated in their core
    banking systems ?
     
    JF Mezei, Mar 1, 2006
    #5
  6. I think you should drop your computer into a block of cement. You'll be
    a lot happier and less bitchy. And you won't be posting such pointless
    whiney flames. We'll all win then.
     
    Michael Vilain, Mar 1, 2006
    #6
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.