Solutions to all issues, bugs, fixes, errors and how-to: How to parse HTML in shell

Wednesday, June 8, 2016

In shell, you can parse HTML using:

sed though:
1. Turing.sed
2. Write HTML parser (homework)
3. ???
4. Profit!
hxselect from html-xml-utils package
vim/ex (which can easily jump between html tags), for example:
- removing style tag with inner code:
```
$ curl -s http://example.com/ | ex -s +'/<style.*/norm nvatd' +%p -cq! /dev/stdin
```

grep, for example:

extracting outer html of H1:

$ curl -s http://example.com/ | grep -o '<h1>.*</h1>'
<h1>Example Domain</h1>

extracting the body:

$ curl -s http://example.com/ | tr '\n' ' ' | grep -o '<body>.*</body>'
<body> <div> <h1>Example Domain</h1> ...

html2text to plain text parsing:
- like parsing tables:
```
$ html2text foo.txt | column -ts'|'
```
using xpath (XML::XPath perl module), see example here
perl or Python (see @Gilles example)
for parsing multiple files at once, see: How to parse hundred html source code files in shell?

Solutions to all issues, bugs, fixes, errors and how-to