Utility class for working with UTF-8 encoded strings. More...

#include "donner/base/Utf8.h"

Static Public Member Functions
static bool	IsSurrogateCodepoint (char32_t ch)
	Returns true if the codepoint is a surrogate, per https://infra.spec.whatwg.org/#surrogate.

static bool	IsValidCodepoint (char32_t ch)
	Returns true if the codepoint is a valid UTF-8 codepoint.

static int	SequenceLength (char leadingCh)
	Determines the length in bytes of a UTF-8 encoded character based on its leading byte.

static std::tuple< char32_t, int >	NextCodepointLenient (std::string_view str)
	Decodes the next UTF-8 codepoint from the input string, without validating if it is valid.

static std::tuple< char32_t, int >	NextCodepoint (std::string_view str)
	Decodes the next UTF-8 codepoint from the input string, while strictly validating continuation bytes and sequence lengths.

template<std::output_iterator< char > OutputIterator>
static OutputIterator	Append (char32_t ch, OutputIterator it)
	Appends the UTF-8 encoding of the given Unicode codepoint to the output iterator.

Static Public Attributes
static constexpr char32_t	kUnicodeReplacementCharacter = 0xFFFD
	U+FFFD REPLACEMENT CHARACTER (�)

static constexpr char32_t	kUnicodeMaximumAllowedCodepoint = 0x10FFFF
	The greatest codepoint defined by Unicode, per https://www.w3.org/TR/css-syntax-3/#maximum-allowed-code-point.

Detailed Description

Utility class for working with UTF-8 encoded strings.

Member Function Documentation

◆ Append()

template<std::output_iterator< char > OutputIterator>

static OutputIterator donner::Utf8::Append	(	char32_t	ch,
		OutputIterator	it )

inlinestatic

Appends the UTF-8 encoding of the given Unicode codepoint to the output iterator.

Template Parameters

OutputIterator An output iterator that accepts char elements.

Parameters

ch	The Unicode codepoint to encode and append.
it	The output iterator to which the encoded bytes are appended.

Returns: An iterator pointing to the element past the last inserted element.

◆ NextCodepoint()

static std::tuple< char32_t, int > donner::Utf8::NextCodepoint ( std::string_view str )

inlinestatic

Decodes the next UTF-8 codepoint from the input string, while strictly validating continuation bytes and sequence lengths.

If an invalid codepoint is encountered, the function returns the Unicode replacement character (\xFFFD) and consumes the invalid codepoint.

Parameters

str	The input string_view from which to read the codepoint.

Returns: A tuple containing the decoded Unicode codepoint and the number of bytes consumed.

◆ NextCodepointLenient()

static std::tuple< char32_t, int > donner::Utf8::NextCodepointLenient ( std::string_view str )

inlinestatic

Decodes the next UTF-8 codepoint from the input string, without validating if it is valid.

If the string is empty or contains insufficient bytes, returns a replacement codepoint.

Parameters

str	The input string_view from which to read the codepoint.

Returns: A tuple containing the decoded Unicode codepoint and the number of bytes consumed.

◆ SequenceLength()

static int donner::Utf8::SequenceLength ( char leadingCh )

inlinestatic

Determines the length in bytes of a UTF-8 encoded character based on its leading byte.

Parameters

leadingCh The leading byte of the UTF-8 character.

Returns: The number of bytes in the UTF-8 character, or 0 if invalid.

The documentation for this class was generated from the following file:

donner/base/Utf8.h

Static Public Member Functions

Static Public Attributes

Detailed Description

Member Function Documentation

◆ Append()

◆ NextCodepoint()

◆ NextCodepointLenient()

◆ SequenceLength()