vsoup
A fast, JSoup-inspired HTML5 parser and DOM manipulation library for
V
Features
-
HTML5 parsing
— full spec-compliant parsing via Lexbor -
CSS selectors
— select(), select_first()with cached compiled selectors -
DOM traversal
— children(), parent(), next_sibling(), etc. -
DOM manipulation
— set_attr(), add_class(), append(), remove(), etc. -
Serialization
— html(), outer_html(), pretty_html() -
HTTP client
— JSoup-style connect(url).get()builder
Quick Start
import vsoup
doc := vsoup.parse('<div class="main"><p>Hello</p><a href="/link">World</a></div>')!
defer { doc.free() }
// CSS selectors
links := doc.@select('a[href]')
println(links.first()!.attr('href')) // "/link"
println(links.first()!.text()) // "World"
// DOM traversal
body := doc.body()!
for child in body.children() {
println(child.tag_name())
}
// DOM manipulation
mut div := doc.select_first('.main')!
div.set_attr('data-processed', 'true')
div.add_class('active')
div.append('<span>New content</span>')
println(doc.html())
Installation
v install marcalc.vsoup
Prerequisites
- V compiler
- C compiler (cc/gcc/clang)
- CMake
Building from source
git clone --recurse-submodules https://github.com/marcalc/vsoup.git
cd vsoup
make setup # builds lexbor static library
API Reference
Parsing
doc := vsoup.parse(html_string)! // parse HTML string
doc := vsoup.parse_file('page.html')! // parse from file
doc := vsoup.connect('https://example.com').get()! // fetch & parse
defer { doc.free() }
Document
| Method | Returns | Description |
|---|---|---|
doc.body() |
?Element |
The
<body>
|
doc.head() |
?Element |
The
<head>
|
doc.title() |
string |
Document title text |
doc.@select(css) |
Elements |
All matching elements (
@
|
doc.select_first(css) |
?Element |
First matching element |
doc.html() |
string |
Serialized HTML |
doc.pretty_html() |
string |
Pretty-printed HTML |
doc.free() |
Free all resources |
Element
| Method | Returns | Description |
|---|---|---|
e.tag_name() |
string |
Uppercase tag name (e.g.
"DIV"
|
e.local_name() |
string |
Lowercase tag name (e.g.
"div"
|
e.id() |
string |
The
id
|
e.class_name() |
string |
The
class
|
e.class_names() |
[]string |
Individual class names |
e.has_class(name) |
bool |
Check for a class |
e.attr(key) |
string |
Attribute value |
e.has_attr(key) |
bool |
Check attribute existence |
e.attributes() |
map[string]string |
All attributes |
e.text() |
string |
Text content (recursive) |
e.html() |
string |
Inner HTML |
e.outer_html() |
string |
Outer HTML |
e.@select(css) |
Elements |
CSS select descendants (
@
|
e.select_first(css) |
?Element |
First matching descendant |
e.parent() |
?Element |
Parent element |
e.children() |
[]Element |
Child elements |
e.first_child() |
?Element |
First child element |
e.next_sibling() |
?Element |
Next sibling element |
e.prev_sibling() |
?Element |
Previous sibling element |
e.set_attr(k, v) |
Set attribute | |
e.remove_attr(k) |
Remove attribute | |
e.add_class(name) |
Add a class | |
e.remove_class(name) |
Remove a class | |
e.append(html) |
Append child HTML | |
e.prepend(html) |
Prepend child HTML | |
e.remove() |
Remove from DOM | |
e.empty() |
Remove all children | |
e.set_text(text) |
Set text content |
Elements
| Method | Returns | Description |
|---|---|---|
es.len() |
int |
Number of elements |
es.first() |
?Element |
First element |
es.last() |
?Element |
Last element |
es.at(i) |
?Element |
Element at index |
es.text() |
string |
Combined text of all |
es.attr(key) |
string |
First matching attr |
es.each_attr(key) |
[]string |
Attr from each element |
es.@select(css) |
Elements |
Sub-select across all |
es.iter() |
[]Element |
For use in
for
|
HTTP Client
doc := vsoup.connect('https://example.com')
.user_agent('vsoup/0.1')
.header('Accept', 'text/html')
.cookie('session', 'abc123')
.get()!
defer { doc.free() }
Benchmarks
Selector benchmarks against native Lexbor C and jsoup (Java), using the same HTML fixture and methodology: 5 iterations x 10,000 repetitions, mean time in seconds.
Lexbor v2.6.0 | jsoup 1.22.2 | macOS ARM64
| Selector | Lexbor C | vsoup (V) | jsoup (Java) |
|---|---|---|---|
div |
0.00418 | 0.00622 (1.5x) | 0.01596 (3.8x) |
div span |
0.00554 | 0.00715 (1.3x) | 0.02966 (5.4x) |
p ~ p |
0.00503 | 0.00652 (1.3x) | 0.02262 (4.5x) |
p + p |
0.00496 | 0.00660 (1.3x) | 0.01900 (3.8x) |
div > p |
0.00507 | 0.00692 (1.4x) | 0.01434 (2.8x) |
div > div |
0.00512 | 0.00731 (1.4x) | 0.01414 (2.8x) |
div p:not(#p-5) a |
0.00785 | 0.00953 (1.2x) | 0.03763 (4.8x) |
div:has(a) a |
0.00726 | 0.00905 (1.2x) | 0.02558 (3.5x) |
div p:nth-child(n+2) |
0.00643 | 0.00799 (1.2x) | 0.02950 (4.6x) |
div p:nth-child(n+2 of div > p) |
0.01364 | 0.01685 (1.2x) | n/a |
vsoup is 1.2-1.5x native Lexbor C
The remaining overhead vs raw C is from the V function call layer and result collection into V arrays. The actual
lxb_selectors_find
Running benchmarks
make bench-selectors # vsoup vs lexbor (raw C bindings + public API)
make bench-parse # vsoup microbenchmarks (parse, traverse, select, serialize, manipulate)
make bench-jsoup # jsoup comparison (downloads jar automatically)
Thread Safety
vsoup is
not thread-safe
Document
Element
Document
Memory Management
Document
free()
doc := vsoup.parse(html)!
defer { doc.free() } // always pair with defer
Element
Error Handling
Parsing and HTTP operations return V
Result
!
or {}
// Parsing errors
doc := vsoup.parse(html) or {
eprintln('Parse failed: ${err}')
return
}
defer { doc.free() }
// Selector queries return Option types
elem := doc.select_first('.missing') or {
println('Element not found')
return
}
// HTTP errors
doc2 := vsoup.connect('https://example.com').get() or {
eprintln('Fetch failed: ${err}')
return
}
defer { doc2.free() }
Architecture
vsoup
├── bindings.v # C FFI declarations (lexbor)
├── helpers.v # C↔V conversion, serialization, selector cache
├── vsoup.v # parse(), parse_file(), connect()
├── document.v # Document struct
├── element.v # Element struct (non-owning DOM node view)
├── elements.v # Elements collection
├── node_type.v # NodeType enum
├── connection.v # HTTP client
├── c_shims.c/h # Compatibility shims for lexbor v2.6.0
└── lexbor/ # Vendored lexbor v2.6.0 (git submodule)
Key design decisions:
-
Elementis a lightweight, non-owning pointer wrapper (24 bytes) — freely copyable -
Documentowns the C memory and must be freed with free() - CSS selectors are compiled once and cached per-document for reuse
- All V strings are copies from C memory (no dangling pointers)
Acknowledgements
-
Lexbor
— the fast, spec-compliant HTML5 engine that powers vsoup's parsing and selector machinery. Created by Alexander Borisov. -
jsoup
— the excellent Java HTML parser whose clean API design inspired vsoup's interface. Created by Jonathan Hedley.
License
MIT