Wednesday, June 8, 2016

How to parse HTML in shell

In shell, you can parse HTML using:
  • sed though:
    1. Turing.sed
    2. Write HTML parser (homework)
    3. ???
    4. Profit!
  • hxselect from html-xml-utils package
  • vim/ex (which can easily jump between html tags), for example:
    • removing style tag with inner code:
      $ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin
  • grep, for example:
    • extracting outer html of H1:
      $ curl -s http://example.com/ | grep -o '<h1>.*</h1>'
      <h1>Example Domain</h1>
    • extracting the body:
      $ curl -s http://example.com/ | tr '\n' ' ' | grep -o '<body>.*</body>'
      <body> <div> <h1>Example Domain</h1> ...
  • html2text to plain text parsing:
  • using xpath (XML::XPath perl module), see example here
  • perl or Python (see @Gilles example)
  • for parsing multiple files at once, see: How to parse hundred html source code files in shell?

No comments:

Post a Comment