PCRE - Perl Compatible Regular Expressions - for V
The
PCRE
This package uses the older, but still widely deployed PCRE library, originally released in 1997, at version 8.45. If you are interested in the current version, PCRE2, released in 2015 and now at version 10.42, see
prantlf.pcre2
Synopsis
import prantlf.pcre { pcre_compile }
pattern := r'answer (?<answer>\d+)'
text := 'Is the answer 42?'
re := pcre_compile(pattern, 0)!
defer { re.free() }
assert re.contains(text, 0)!
idx := re.index_of(text, 0)!
assert idx == 14
start, end := re.index_range(text, 0)!
assert start == 14
assert end == 16
m := re.exec(text, 0)!
assert re.captures == 1
assert re.names == 1
start, end := m.group_bounds(1)?
assert start == 14
assert end == 16
assert m.group_text(text, 1)? == '42'
assert re.group_index_by_name('answer') == 1
text2 := re.replace(text, 'question known', 0)!
assert text2 == 'Is the question known?'
assert !re.contains(text2, 0)
Installation
You can install this package either from
VPM
v install prantlf.pcre
v install --git https://github.com/prantlf/v-pcre
Usage
For the syntax of the regular expression patterns, see the
quick reference
Compile
A regular expression pattern has to be compiled at first. Both synonymous methods share the same functionality:
import prantlf.pcre { pcre_compile }
pcre_compile(source string, options u32) !&RegEx
import prantlf.pcre
pcre.compile(source string, options u32) !&RegEx
The following options can be applied. Combine multiple options together with the
|
opt_anchored Force pattern anchoring
opt_auto_callout Compile automatic callouts
opt_bsr_anycrlf \R matches only CR, LF, or CRLF
opt_bsr_unicode \R matches all Unicode line endings
opt_caseless Do caseless matching
opt_dollar_endonly $ not to match newline at end
opt_dotall . matches anything including NL
opt_dupnames Allow duplicate names for subpatterns
opt_extended Ignore white space and # comments
opt_extra PCRE extra features
(not much use currently)
opt_firstline Force matching to be before newline
opt_javascript_compat JavaScript compatibility
opt_multiline ^ and $ match newlines within data
opt_never_utf Lock out UTF, e.g. via (*UTF)
opt_newline_any Recognize any Unicode newline sequence
opt_newline_anycrlf Recognize CR, LF, and CRLF as newline
sequences
opt_newline_cr Set CR as the newline sequence
opt_newline_crlf Set CRLF as the newline sequence
opt_newline_lf Set LF as the newline sequence
opt_no_auto_capture Disable numbered capturing paren-
theses (named ones available)
opt_no_auto_possess Disable auto-possessification
opt_no_start_optimize Disable match-time start optimizations
opt_no_utf8_check Do not check the pattern for UTF-8
validity (only relevant if opt_utf8 is set)
opt_ucp Use Unicode properties for \d, \w, etc.
opt_ungreedy Invert greediness of quantifiers
opt_utf8 Run pcre_compile() in UTF-8 mode
If the compilation fails, an error will be returned:
struct CompileError {
msg string // the error message
code int // the error code
offset int // if >= 0, points to the pattern where the compilation failed
}
Don't forget to free the regular expression object when you do not need it any more:
(r &RegEx) free()
defer { re.free() }
Some characteristics of the regular expression, which are usually needed when executing it later, can be enquired right after compiling it:
struct RegEx {
captures int // total count of the capturing groups
names int // total count of the named capturing groups
}
(r &RegEx) group_index_by_name(name string) int
(r &RegEx) group_name_by_index(idx int) string
See also the
original documentation for pcre_compile2
Execute
After compiling, the regular expression can be executed with various subjects:
(r &RegEx) exec(subject string, options int) !Match
(r &RegEx) exec_within(subject string, start int, end int, options int) !Match
(r &RegEx) exec_within_nochk(subject string, start int, end int, options int) !Match
The following options can be applied. Combine multiple options together with the
|
opt_anchored Match only at the first position
opt_bsr_anycrlf \R matches only CR, LF, or CRLF
opt_bsr_unicode \R matches all Unicode line endings
opt_newline_any Recognize any Unicode newline sequence
opt_newline_anycrlf Recognize CR, LF, & CRLF as newline sequences
opt_newline_cr Recognize CR as the only newline sequence
opt_newline_crlf Recognize CRLF as the only newline sequence
opt_newline_lf Recognize LF as the only newline sequence
opt_notbol Subject string is not the beginning of a line
opt_noteol Subject string is not the end of a line
opt_notempty An empty string is not a valid match
opt_notempty_atstart An empty string at the start of the subject
is not a valid match
opt_no_start_optimize Do not do "start-match" optimizations
opt_no_utf8_check Do not check the subject for UTF-8 validity
(only relevant if opt_utf8 was set at compile time)
opt_partial ) Return error_partial for a partial
opt_partial_soft ) match if no full matches are found
opt_partial_hard Return error_partial for a partial match
if that is found before a full match
If the execution succeeds, an object with information about the match will be returned:
struct Match {}
Capturing groups can be obtained by the following methods, which return
none
0
(m &Match) group_bounds(idx int) ?(int, int)
(m &Match) group_text(subject string, idx int) ?string
If the execution cannot match the pattern, a special error will be returned:
struct NoMatch {}
If the execution matches the pattern only partially - see options
opt_partial_hard
opt_partial_soft
struct Partial {}
If the execution fails from other reasons, a general error will be returned:
struct ExecuteError {
msg string
code int
}
The following error codes may encounter and are exported as public constants:
error_null = -2
error_badoption = -3
error_badmagic = -4
error_unknown_opcode = -5
error_nomemory = -6
error_nosubstring = -7
error_matchlimit = -8
error_badutf8 = -10
error_badutf8_offset = -11
error_badpartial = -13
error_internal = -14
error_badcount = -15
error_recursionlimit = -21
error_badnewline = -23
error_badoffset = -24
error_shortutf8 = -25
error_recurseloop = -26
error_badmode = -28
error_badendianness = -29
error_badlength = -32
See also the
original documentation for pcre_exec
Others
The API consists of two parts - basic compilation and execution of a regular expression, corresponding with the
PCRE
Search
(r &RegEx) matches(s string, opt int) !bool
(r &RegEx) matches_within(s string, at int, end int, opt int) !bool
(r &RegEx) matches_within_nochk(s string, at int, stop int, opt int) !bool
(r &RegEx) contains(s string, opt int) !bool
(r &RegEx) contains_within(s string, at int, end int, opt int) !bool
(r &RegEx) contains_within_nochk(s string, at int, stop int, opt int) !bool
(r &RegEx) starts_with(s string, opt int) !bool
(r &RegEx) starts_with_within(s string, at int, end int, opt int) !bool
(r &RegEx) starts_with_within_nochk(s string, at int, stop int, opt int) !bool
(r &RegEx) index_of(s string, option int) !int
(r &RegEx) index_of_within(s string, start int, end int, opt int) !int
(r &RegEx) index_of_within_nochk(s string, start int, stop int, opt int) !int
(r &RegEx) index_range(s string, opt int) !(int, int)
(r &RegEx) index_range_within(s string, start int, end int, opt int) !(int, int)
(r &RegEx) index_range_within_nochk(s string, start int, stop int, opt int) !(int, int)
(r &RegEx) ends_with(s string, opt int) !bool
(r &RegEx) ends_with_within(s string, from int, to int, opt int) !bool
(r &RegEx) ends_with_within_nochk(s string, from int, to int, opt int) !bool
(r &RegEx) count_of(s string, opt int) !int
(r &RegEx) count_of_within(s string, start int, end int, opt int) !int
(r &RegEx) count_of_within_nochk(s string, start int, stop int, opt int) !int
Replace
Replace either all occurrences or only the first one matching the pattern of the regular expression:
(r &RegEx) replace(s string, with string, opt int) !string
(r &RegEx) replace_first(s string, with string, opt int) !string
If the regular expression doesn't match the pattern, a special error will be returned:
struct NoMatch {}
If the regular expression matches, but the replacement string is the same as the found string, so the replacing wouldn't change anything, a special error will be returned:
struct NoReplace {}
Split
Split the input string by the regular expression and return the remaining parts in a string array:
(r &RegEx) split(s string, opt int) ![]string
(r &RegEx) split_first(s string, opt int) ![]string
Split the input string by the regular expression and return all parts, both remaining and splitting, in a string array:
(r &RegEx) chop(s string, opt int) ![]string
(r &RegEx) chop_first(s string, opt int) ![]string
Classify
Classify ASCII characters:
char_space = 0x01
char_letter = 0x02
char_digit = 0x04
char_xdigit = 0x08
char_word = 0x10
char_meta = 0x80
pcre_chartype(ch u8) int
Additional character-classifying functions:
pcre_isalnum(ch u8) bool
pcre_isalpha(ch u8) bool
pcre_isdigit(ch u8) bool
pcre_isxdigit(ch u8) bool
pcre_isword(ch u8) bool
pcre_isspace(ch u8) bool
Classify Unicode characters in the general way:
enum UnicodeGeneral {
other
letter
mark
number
punctuation
symbol
separator
}
pcre_unicode_gentype(r rune) UnicodeGeneral
Classify Unicode characters in a particular way:
enum UnicodeParticular {
control
format
unassigned
private_use
surrogate
lowercase_letter
modifier_letter
other_letter
titlecase_letter
uppercase_letter
spacing_mark
enclosing_mark
nonspacing_mark
decimal_number
letter_number
other_number
connector_punctuation
dash_punctuation
close_punctuation
final_punctuation
initial_punctuation
other_punctuation
open_punctuation
currency_symbol
modifier_symbol
mathematical_symbol
other_symbol
line_separator
paragraph_separator
space_separator
}
pcre_unicode_partype(r rune) UnicodeParticular
Additional Unicode-capable character-classifying functions:
pcre_unicode_isalnum(r rune) bool
pcre_unicode_isalpha(r rune) bool
pcre_unicode_islower(r rune) bool
pcre_unicode_isupper(r rune) bool
pcre_unicode_isdigit(r rune) bool
pcre_unicode_iscntrl(r rune) bool
pcre_unicode_isword(r rune) bool
pcre_unicode_isspace(r rune) bool
pcre_unicode_isblank(r rune) bool
pcre_unicode_ispunct(r rune) bool
Contributing
In lieu of a formal styleguide, take care to maintain the existing coding style. Lint and test your code.
License
Copyright (c) 2023-2024 Ferdinand Prantl
Licensed under the MIT license.