In shell, you can parse HTML using:
- sed though:
- Turing.sed
- Write HTML parser (homework)
- ???
- Profit!
hxselectfromhtml-xml-utilspackage-
- removing style tag with inner code:
$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin
grep, for example:- extracting outer html of H1:
$ curl -s http://example.com/ | grep -o '<h1>.*</h1>' <h1>Example Domain</h1> - extracting the body:
$ curl -s http://example.com/ | tr '\n' ' ' | grep -o '<body>.*</body>' <body> <div> <h1>Example Domain</h1> ...
html2textto plain text parsing:- like parsing tables:
$ html2text foo.txt | column -ts'|'
- perl or Python (see @Gilles example)
- for parsing multiple files at once, see: How to parse hundred html source code files in shell?
No comments:
Post a Comment