contrib: Import utf8proc 1.1.5

URL
https://juliastrings.github.io/utf8proc/releases/

Hash
There is no repo until 1.1.6 the file archive used was utf8proc-v1.1.5.tar.gz
released on 2009-10-16 the sha512 is:

cd75a0aabdf7b331ce6cc210fe343e15804e5a097607e387ec0ab9c994ceecc\
f80aecbe25b06efb756d2989fd623b7a4d6de2c4d3416add20ac8692cf37912c6
This commit is contained in:
Amar Takhar
2025-07-25 10:43:11 -04:00
committed by Kinsey Moore
parent 4374d0ef8b
commit d37b554bc5
17 changed files with 16872 additions and 0 deletions

View File

@@ -0,0 +1,128 @@
Changelog
2006-06-02:
- initial release of version 0.1
2006-06-05:
- changed behaviour of PostgreSQL function to return NULL in case of
invalid input, rather than raising an exceptional condition
- improved efficiency of PostgreSQL function (no transformation to C string
is done)
2006-06-20:
- added -fpic compiler flag in Makefile
- fixed bug in the C code for the ruby library (usage of non-existent
function)
Release of version 0.2
2006-07-18:
- changed normalization from NFC to NFKC for postgresql unifold function
2006-08-04:
- added support to mark the beginning of a grapheme cluster with 0xFF
(option: CHARBOUND)
- added the ruby method String#chars, which is returning an array of UTF-8
encoded grapheme clusters
- added NLF2LF transformation in postgresql unifold function
- added the DECOMPOSE option, if you neither use COMPOSE or DECOMPOSE, no
normalization will be performed (different from previous versions)
- using integer constants rather than C-strings for character properties
- fixed (hopefully) a problem with the ruby library on Mac OS X, which
occured when compiler optimization was switched on
Release of version 0.3
2006-09-17:
- added the LUMP option, which lumps certain characters together
(see lump.txt) (also used for the PostgreSQL "unifold" function)
- added the STRIPMARK option, which strips marking characters
(or marks of composed characters)
- deprecated ruby method String#char_ary in favour of String#utf8chars
Release of version 1.0
2006-09-20:
- included a gem file for the ruby version of the library
Release of version 1.0.1
2006-09-21:
- included a check in Integer#utf8, which raises an exception, if the given
code-point is invalid because of being too high (this was missing yet)
2006-12-26:
- added support for PostgreSQL version 8.2
Release of version 1.0.2
2007-03-16:
- Fixed a bug in the ruby library, which caused an error, when splitting an
empty string at grapheme cluster boundaries (method String#utf8chars).
Release of version 1.0.3
2007-06-25:
- Added a new PostgreSQL function 'unistrip', which behaves like 'unifold',
but also removes all character marks (e.g. accents).
2007-07-22:
- Changed license from BSD to MIT style.
- Added a new function 'utf8proc_codepoint_valid' to the C library.
- Changed compiler flags in Makefile from -g -O0 to -O2
- The ruby script, which was used to build the utf8proc_data.c file, is now
included in the distribution.
Release of version 1.1.1
2007-07-25:
- Fixed a serious bug in the data file generator, which caused characters
being treated incorrectly, when stripping default ignorable characters or
calculating grapheme cluster boundaries.
Release of version 1.1.2
2008-10-04:
- Added a function utf8proc_version returning a string containing the version
number of the library.
- Included a target libutf8proc.dylib for MacOSX.
2009-05-01:
- PostgreSQL 8.3 compatibility (use of SET_VARSIZE macro)
Release of version 1.1.3
2009-06-14:
- replaced C++ style comments for compatibility reasons
- added typecasts to suppress compiler warnings
- removed redundant source files for ruby-gemfile generation
2009-08-19:
- Changed copyright notice for Public Software Group e. V.
- Minor changes in the README file
- Release of version 1.1.4
2009-08-20:
- Use RSTRING_PTR() and RSTRING_LEN() instead of RSTRING()->ptr and
RSTRING()->len for ruby1.9 compatibility (and #define them, if not
existent)
2009-10-02:
- Patches for compatibility with Microsoft Visual Studio
2009-10-08:
- Fixes to make utf8proc usable in C++ programs
2009-10-16:
- Release of version 1.1.5
2009-10-08:

View File

@@ -0,0 +1,64 @@
Copyright (c) 2009 Public Software Group e. V., Berlin, Germany
Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.
This software distribution contains derived data from a modified version of
the Unicode data files. The following license applies to that data:
COPYRIGHT AND PERMISSION NOTICE
Copyright (c) 1991-2007 Unicode, Inc. All rights reserved. Distributed
under the Terms of Use in http://www.unicode.org/copyright.html.
Permission is hereby granted, free of charge, to any person obtaining a
copy of the Unicode data files and any associated documentation (the "Data
Files") or Unicode software and any associated documentation (the
"Software") to deal in the Data Files or Software without restriction,
including without limitation the rights to use, copy, modify, merge,
publish, distribute, and/or sell copies of the Data Files or Software, and
to permit persons to whom the Data Files or Software are furnished to do
so, provided that (a) the above copyright notice(s) and this permission
notice appear with all copies of the Data Files or Software, (b) both the
above copyright notice(s) and this permission notice appear in associated
documentation, and (c) there is clear notice in each modified Data File or
in the Software as well as in the documentation associated with the Data
File(s) or Software that the data or software has been modified.
THE DATA FILES AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF
THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS
INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR
CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF
USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER
TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF THE DATA FILES OR SOFTWARE.
Except as contained in this notice, the name of a copyright holder shall
not be used in advertising or otherwise to promote the sale, use or other
dealings in these Data Files or Software without prior written
authorization of the copyright holder.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and may be
registered in some jurisdictions. All other trademarks and registered
trademarks mentioned herein are the property of their respective owners.

View File

@@ -0,0 +1,68 @@
# libutf8proc Makefile
# settings
cflags = -O2 -std=c99 -pedantic -Wall -fpic $(CFLAGS)
cc = $(CC) $(cflags)
# meta targets
c-library: libutf8proc.a libutf8proc.so
ruby-library: ruby/utf8proc_native.so
pgsql-library: pgsql/utf8proc_pgsql.so
all: c-library ruby-library ruby-gem pgsql-library
clean::
rm -f utf8proc.o libutf8proc.a libutf8proc.so
cd ruby/ && test -e Makefile && (make clean && rm -f Makefile) || true
rm -Rf ruby/gem/lib ruby/gem/ext
rm -f ruby/gem/utf8proc-*.gem
cd pgsql/ && make clean
# real targets
utf8proc.o: utf8proc.h utf8proc.c utf8proc_data.c
$(cc) -c -o utf8proc.o utf8proc.c
libutf8proc.a: utf8proc.o
rm -f libutf8proc.a
ar rs libutf8proc.a utf8proc.o
libutf8proc.so: utf8proc.o
$(cc) -shared -o libutf8proc.so utf8proc.o
chmod a-x libutf8proc.so
libutf8proc.dylib: utf8proc.o
$(cc) -dynamiclib -o $@ $^ -install_name $(libdir)/$@
ruby/Makefile: ruby/extconf.rb
cd ruby && ruby extconf.rb
ruby/utf8proc_native.so: utf8proc.h utf8proc.c utf8proc_data.c \
ruby/utf8proc_native.c ruby/Makefile
cd ruby && make
ruby/gem/lib/utf8proc.rb: ruby/utf8proc.rb
test -e ruby/gem/lib || mkdir ruby/gem/lib
cp ruby/utf8proc.rb ruby/gem/lib/
ruby/gem/ext/extconf.rb: ruby/extconf.rb
test -e ruby/gem/ext || mkdir ruby/gem/ext
cp ruby/extconf.rb ruby/gem/ext/
ruby/gem/ext/utf8proc_native.c: utf8proc.h utf8proc_data.c utf8proc.c ruby/utf8proc_native.c
test -e ruby/gem/ext || mkdir ruby/gem/ext
cat utf8proc.h utf8proc_data.c utf8proc.c ruby/utf8proc_native.c | grep -v '#include "utf8proc.h"' | grep -v '#include "utf8proc_data.c"' | grep -v '#include "../utf8proc.c"' > ruby/gem/ext/utf8proc_native.c
ruby-gem:: ruby/gem/lib/utf8proc.rb ruby/gem/ext/extconf.rb ruby/gem/ext/utf8proc_native.c
cd ruby/gem && gem build utf8proc.gemspec
pgsql/utf8proc_pgsql.so: utf8proc.h utf8proc.c utf8proc_data.c \
pgsql/utf8proc_pgsql.c
cd pgsql && make

View File

@@ -0,0 +1,116 @@
Please read the LICENSE file, which is shipping with this software.
*** QUICK START ***
For compilation of the C library call "make c-library", for compilation of
the ruby library call "make ruby-library" and for compilation of the
PostgreSQL extension call "make pgsql-library".
For ruby you can also create a gem-file by calling "make ruby-gem".
"make all" can be used to build everything, but both ruby and PostgreSQL
installations are required in this case.
*** GENERAL INFORMATION ***
The C library is found in this directory after successful compilation and
is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of
the files "utf8proc.rb" and "utf8proc_native.so", which are found in the
subdirectory "ruby/". If you chose to create a gem-file it is placed in the
"ruby/gem" directory. The PostgreSQL extension is named "utf8proc_pgsql.so"
and resides in the "pgsql/" directory.
Both the ruby library and the PostgreSQL extension are built as stand-alone
libraries and are therefore not dependent the dynamic version of the
C library files, but this behaviour might change in future releases.
The Unicode version being supported is 5.0.0.
Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as
version 5.0.0 had not been available at the time of implementation.
For Unicode normalizations, the following options have to be used:
Normalization Form C: STABLE, COMPOSE
Normalization Form D: STABLE, DECOMPOSE
Normalization Form KC: STABLE, COMPOSE, COMPAT
Normalization Form KD: STABLE, DECOMPOSE, COMPAT
*** C LIBRARY ***
The documentation for the C library is found in the utf8proc.h header file.
"utf8proc_map" is most likely function you will be using for mapping UTF-8
strings, unless you want to allocate memory yourself.
*** RUBY API ***
The ruby library adds the methods "utf8map" and "utf8map!" to the String
class, and the method "utf8" to the Integer class.
The String#utf8map method does the same as the "utf8proc_map" C function.
Options for the mapping procedure are passed as symbols, i.e:
"Hello".utf8map(:casefold) => "hello"
The descriptions of all options are found in the C header file
"utf8proc.h". Please notice that the according symbols in ruby are all
lowercase.
String#utf8map! is the destructive function in the meaning that the string
is replaced by the result.
There are shortcuts for the 4 normalization forms specified by Unicode:
String#utf8nfd, String#utf8nfd!,
String#utf8nfc, String#utf8nfc!,
String#utf8nfkd, String#utf8nfkd!,
String#utf8nfkc, String#utf8nfkc!
The method Integer#utf8 returns a UTF-8 string, which is containing the
unicode char given by the code point.
0x000A.utf8 => "\n"
0x2028.utf8 => "\342\200\250"
*** POSTGRESQL API ***
For PostgreSQL there are two SQL functions supplied named "unifold" and
"unistrip". These functions function can be used to prepare index fields in
order to be folded in a way where string-comparisons make more sense, e.g.
where "bathtub" == "bath<soft hyphen>tub"
or "Hello World" == "hello world".
CREATE TABLE people (
id serial8 primary key,
name text,
CHECK (unifold(name) NOTNULL)
);
CREATE INDEX name_idx ON people (unifold(name));
SELECT * FROM people WHERE unifold(name) = unifold('John Doe');
The function "unistrip" removes character marks like accents or diaeresis,
while "unifold" keeps then.
NOTICE: The outputs of the function can change between releases, as
utf8proc does not follow a versioning stability policy. You have to
rebuild your database indicies, if you upgrade to a newer version
of utf8proc.
*** TODO ***
- detect stable code points and process segments independently in order to
save memory
- do a quick check before normalizing strings to optimize speed
- support stream processing
*** CONTACT ***
If you find any bugs or experience difficulties in compiling this software,
please contact us:
Project page: http://www.public-software-group.org/utf8proc

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,26 @@
U+0020 <-- all space characters (general category Zs)
U+0027 ' <-- left/right single quotation mark U+2018..2019,
modifier letter apostrophe U+02BC,
modifier letter vertical line U+02C8
U+002D - <-- all dash characters (general category Pd),
minus U+2212
U+002F / <-- fraction slash U+2044,
division slash U+2215
U+003A : <-- ratio U+2236
U+003C < <-- single left-pointing angle quotation mark U+2039,
left-pointing angle bracket U+2329,
left angle bracket U+3008
U+003E > <-- single right-pointing angle quotation mark U+203A,
right-pointing angle bracket U+232A,
right angle bracket U+3009
U+005C \ <-- set minus U+2216
U+005E ^ <-- modifier letter up arrowhead U+02C4,
modifier letter circumflex accent U+02C6,
caret U+2038,
up arrowhead U+2303
U+005F _ <-- all connector characters (general category Pc),
modifier letter low macron U+02CD
U+0060 ` <-- modifier letter grave accent U+02CB
U+007C | <-- divides U+2223
U+007E ~ <-- tilde operator U+223C

View File

@@ -0,0 +1,10 @@
utf8proc_pgsql.so: utf8proc_pgsql.o
ld -shared -o utf8proc_pgsql.so utf8proc_pgsql.o
utf8proc_pgsql.o: utf8proc_pgsql.c
gcc -Wall -fpic -c -I`pg_config --includedir-server` \
-o utf8proc_pgsql.o utf8proc_pgsql.c
clean:
rm -f *.o *.so

View File

@@ -0,0 +1,6 @@
CREATE OR REPLACE FUNCTION unifold (text) RETURNS text
LANGUAGE 'C' IMMUTABLE STRICT AS '$libdir/utf8proc_pgsql.so',
'utf8proc_pgsql_unifold';
CREATE OR REPLACE FUNCTION unistrip (text) RETURNS text
LANGUAGE 'C' IMMUTABLE STRICT AS '$libdir/utf8proc_pgsql.so',
'utf8proc_pgsql_unistrip';

View File

@@ -0,0 +1,139 @@
/*
* Copyright (c) Public Software Group e. V., Berlin, Germany
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
* DEALINGS IN THE SOFTWARE.
*/
/*
* File name: pgsql/utf8proc_pgsql.c
*
* Description:
* PostgreSQL extension to provide two functions 'unifold' and 'unistrip',
* which can be used to case-fold and normalize index fields and
* optionally strip marks (e.g. accents) from strings.
*/
#include "../utf8proc.c"
#include <postgres.h>
#include <utils/elog.h>
#include <fmgr.h>
#include <string.h>
#include <unistd.h>
#include <utils/builtins.h>
#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif
#define UTF8PROC_PGSQL_FOLD_OPTS ( UTF8PROC_REJECTNA | UTF8PROC_COMPAT | \
UTF8PROC_COMPOSE | UTF8PROC_STABLE | UTF8PROC_IGNORE | UTF8PROC_STRIPCC | \
UTF8PROC_NLF2LF | UTF8PROC_CASEFOLD | UTF8PROC_LUMP )
#define UTF8PROC_PGSQL_STRIP_OPTS ( UTF8PROC_REJECTNA | UTF8PROC_COMPAT | \
UTF8PROC_COMPOSE | UTF8PROC_STABLE | UTF8PROC_IGNORE | UTF8PROC_STRIPCC | \
UTF8PROC_NLF2LF | UTF8PROC_CASEFOLD | UTF8PROC_LUMP | UTF8PROC_STRIPMARK )
ssize_t utf8proc_pgsql_utf8map(
text *input_string, text **output_string_ptr, int options
) {
ssize_t result;
text *output_string;
result = utf8proc_decompose(
VARDATA(input_string), VARSIZE(input_string) - VARHDRSZ,
NULL, 0, options
);
if (result < 0) return result;
if (result > (SIZE_MAX-1-VARHDRSZ)/sizeof(int32_t))
return UTF8PROC_ERROR_OVERFLOW;
/* reserve one extra byte for termination */
*output_string_ptr = palloc(result * sizeof(int32_t) + 1 + VARHDRSZ);
output_string = *output_string_ptr;
if (!output_string) return UTF8PROC_ERROR_NOMEM;
result = utf8proc_decompose(
VARDATA(input_string), VARSIZE(input_string) - VARHDRSZ,
(int32_t *)VARDATA(output_string), result, options
);
if (result < 0) return result;
result = utf8proc_reencode(
(int32_t *)VARDATA(output_string), result, options
);
if (result >= 0) SET_VARSIZE(output_string, result + VARHDRSZ);
return result;
}
void utf8proc_pgsql_utf8map_errchk(ssize_t result, text *output_string) {
if (result < 0) {
int sqlerrcode;
if (output_string) pfree(output_string);
switch(result) {
case UTF8PROC_ERROR_NOMEM:
sqlerrcode = ERRCODE_OUT_OF_MEMORY; break;
case UTF8PROC_ERROR_OVERFLOW:
sqlerrcode = ERRCODE_PROGRAM_LIMIT_EXCEEDED; break;
case UTF8PROC_ERROR_INVALIDUTF8:
case UTF8PROC_ERROR_NOTASSIGNED:
return;
default:
sqlerrcode = ERRCODE_INTERNAL_ERROR;
}
ereport(ERROR, (
errcode(sqlerrcode),
errmsg("%s", utf8proc_errmsg(result))
));
}
}
PG_FUNCTION_INFO_V1(utf8proc_pgsql_unifold);
Datum utf8proc_pgsql_unifold(PG_FUNCTION_ARGS) {
text *input_string;
text *output_string = NULL;
ssize_t result;
input_string = PG_GETARG_TEXT_P(0);
result = utf8proc_pgsql_utf8map(
input_string, &output_string, UTF8PROC_PGSQL_FOLD_OPTS
);
PG_FREE_IF_COPY(input_string, 0);
utf8proc_pgsql_utf8map_errchk(result, output_string);
if (result >= 0) {
PG_RETURN_TEXT_P(output_string);
} else {
PG_RETURN_NULL();
}
}
PG_FUNCTION_INFO_V1(utf8proc_pgsql_unistrip);
Datum utf8proc_pgsql_unistrip(PG_FUNCTION_ARGS) {
text *input_string;
text *output_string = NULL;
ssize_t result;
input_string = PG_GETARG_TEXT_P(0);
result = utf8proc_pgsql_utf8map(
input_string, &output_string, UTF8PROC_PGSQL_STRIP_OPTS
);
PG_FREE_IF_COPY(input_string, 0);
utf8proc_pgsql_utf8map_errchk(result, output_string);
if (result >= 0) {
PG_RETURN_TEXT_P(output_string);
} else {
PG_RETURN_NULL();
}
}

View File

@@ -0,0 +1,2 @@
require 'mkmf'
create_makefile("utf8proc_native")

View File

@@ -0,0 +1,64 @@
Copyright (c) 2009 Public Software Group e. V., Berlin, Germany
Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.
This software distribution contains derived data from a modified version of
the Unicode data files. The following license applies to that data:
COPYRIGHT AND PERMISSION NOTICE
Copyright (c) 1991-2007 Unicode, Inc. All rights reserved. Distributed
under the Terms of Use in http://www.unicode.org/copyright.html.
Permission is hereby granted, free of charge, to any person obtaining a
copy of the Unicode data files and any associated documentation (the "Data
Files") or Unicode software and any associated documentation (the
"Software") to deal in the Data Files or Software without restriction,
including without limitation the rights to use, copy, modify, merge,
publish, distribute, and/or sell copies of the Data Files or Software, and
to permit persons to whom the Data Files or Software are furnished to do
so, provided that (a) the above copyright notice(s) and this permission
notice appear with all copies of the Data Files or Software, (b) both the
above copyright notice(s) and this permission notice appear in associated
documentation, and (c) there is clear notice in each modified Data File or
in the Software as well as in the documentation associated with the Data
File(s) or Software that the data or software has been modified.
THE DATA FILES AND SOFTWARE ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY
KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT OF
THIRD PARTY RIGHTS. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR HOLDERS
INCLUDED IN THIS NOTICE BE LIABLE FOR ANY CLAIM, OR ANY SPECIAL INDIRECT OR
CONSEQUENTIAL DAMAGES, OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF
USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER
TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR
PERFORMANCE OF THE DATA FILES OR SOFTWARE.
Except as contained in this notice, the name of a copyright holder shall
not be used in advertising or otherwise to promote the sale, use or other
dealings in these Data Files or Software without prior written
authorization of the copyright holder.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and may be
registered in some jurisdictions. All other trademarks and registered
trademarks mentioned herein are the property of their respective owners.

View File

@@ -0,0 +1,12 @@
require 'rubygems'
SPEC = Gem::Specification.new do |s|
s.name = 'utf8proc'
s.version = '1.1.5'
s.author = 'Public Software Group e. V., Berlin, Germany'
s.homepage = 'http://www.public-software-group.org/utf8proc'
s.summary = 'UTF-8 Unicode string processing'
s.files = ['LICENSE', 'lib/utf8proc.rb', 'ext/utf8proc_native.c']
s.require_path = 'lib/'
s.extensions = ['ext/extconf.rb']
s.has_rdoc = false
end

View File

@@ -0,0 +1,98 @@
# Copyright (c) 2009 Public Software Group e. V., Berlin, Germany
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.
#
# File name: ruby/utf8proc.rb
#
# Description:
# Part of the ruby wrapper for libutf8proc, which is written in ruby.
#
require 'utf8proc_native'
module Utf8Proc
SpecialChars = {
:HT => "\x09",
:LF => "\x0A",
:VT => "\x0B",
:FF => "\x0C",
:CR => "\x0D",
:FS => "\x1C",
:GS => "\x1D",
:RS => "\x1E",
:US => "\x1F",
:LS => "\xE2\x80\xA8",
:PS => "\xE2\x80\xA9",
}
module StringExtensions
def utf8map(*option_array)
options = 0
option_array.each do |option|
flag = Utf8Proc::Options[option]
raise ArgumentError, "Unknown argument given to String#utf8map." unless
flag
options |= flag
end
return Utf8Proc::utf8map(self, options)
end
def utf8map!(*option_array)
self.replace(self.utf8map(*option_array))
end
def utf8nfd; utf8map( :stable, :decompose); end
def utf8nfd!; utf8map!(:stable, :decompose); end
def utf8nfc; utf8map( :stable, :compose); end
def utf8nfc!; utf8map!(:stable, :compose); end
def utf8nfkd; utf8map( :stable, :decompose, :compat); end
def utf8nfkd!; utf8map!(:stable, :decompose, :compat); end
def utf8nfkc; utf8map( :stable, :compose, :compat); end
def utf8nfkc!; utf8map!(:stable, :compose, :compat); end
def utf8chars
result = self.utf8map(:charbound).split("\377")
result.shift if result.first == ""
result
end
def char_ary
# depecated, use String#utf8chars instead
utf8chars
end
end
module IntegerExtensions
def utf8
return Utf8Proc::utf8char(self)
end
end
end
class String
include(Utf8Proc::StringExtensions)
end
class Integer
include(Utf8Proc::IntegerExtensions)
end

View File

@@ -0,0 +1,160 @@
/*
* Copyright (c) 2009 Public Software Group e. V., Berlin, Germany
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
* DEALINGS IN THE SOFTWARE.
*/
/*
* File name: ruby/utf8proc_native.c
*
* Description:
* Native part of the ruby wrapper for libutf8proc.
*/
#include "../utf8proc.c"
#include "ruby.h"
#ifndef RSTRING_PTR
#define RSTRING_PTR(s) (RSTRING(s)->ptr)
#endif
#ifndef RSTRING_LEN
#define RSTRING_LEN(s) (RSTRING(s)->len)
#endif
typedef struct utf8proc_ruby_mapenv_struct {
int32_t *buffer;
} utf8proc_ruby_mapenv_t;
void utf8proc_ruby_mapenv_free(utf8proc_ruby_mapenv_t *env) {
free(env->buffer);
free(env);
}
VALUE utf8proc_ruby_module;
VALUE utf8proc_ruby_options;
VALUE utf8proc_ruby_eUnicodeError;
VALUE utf8proc_ruby_eInvalidUtf8Error;
VALUE utf8proc_ruby_eCodeNotAssignedError;
VALUE utf8proc_ruby_map_error(ssize_t result) {
VALUE excpt_class;
switch (result) {
case UTF8PROC_ERROR_NOMEM:
excpt_class = rb_eNoMemError; break;
case UTF8PROC_ERROR_OVERFLOW:
case UTF8PROC_ERROR_INVALIDOPTS:
excpt_class = rb_eArgError; break;
case UTF8PROC_ERROR_INVALIDUTF8:
excpt_class = utf8proc_ruby_eInvalidUtf8Error; break;
case UTF8PROC_ERROR_NOTASSIGNED:
excpt_class = utf8proc_ruby_eCodeNotAssignedError; break;
default:
excpt_class = rb_eRuntimeError;
}
rb_raise(excpt_class, "%s", utf8proc_errmsg(result));
return Qnil;
}
VALUE utf8proc_ruby_map(VALUE self, VALUE str_param, VALUE options_param) {
VALUE str;
int options;
VALUE env_obj;
utf8proc_ruby_mapenv_t *env;
ssize_t result;
VALUE retval;
str = StringValue(str_param);
options = NUM2INT(options_param) & ~UTF8PROC_NULLTERM;
env_obj = Data_Make_Struct(rb_cObject, utf8proc_ruby_mapenv_t, NULL,
utf8proc_ruby_mapenv_free, env);
result = utf8proc_decompose(RSTRING_PTR(str), RSTRING_LEN(str),
NULL, 0, options);
if (result < 0) {
utf8proc_ruby_map_error(result);
return Qnil; /* needed to prevent problems with optimization */
}
env->buffer = ALLOC_N(int32_t, result+1);
result = utf8proc_decompose(RSTRING_PTR(str), RSTRING_LEN(str),
env->buffer, result, options);
if (result < 0) {
free(env->buffer);
env->buffer = 0;
utf8proc_ruby_map_error(result);
return Qnil; /* needed to prevent problems with optimization */
}
result = utf8proc_reencode(env->buffer, result, options);
if (result < 0) {
free(env->buffer);
env->buffer = 0;
utf8proc_ruby_map_error(result);
return Qnil; /* needed to prevent problems with optimization */
}
retval = rb_str_new((char *)env->buffer, result);
free(env->buffer);
env->buffer = 0;
return retval;
}
static VALUE utf8proc_ruby_char(VALUE self, VALUE code_param) {
char buffer[4];
ssize_t result;
int uc;
uc = NUM2INT(code_param);
if (!utf8proc_codepoint_valid(uc))
rb_raise(rb_eArgError, "Invalid Unicode code point");
result = utf8proc_encode_char(uc, buffer);
return rb_str_new(buffer, result);
}
#define register_utf8proc_option(sym, field) \
rb_hash_aset(utf8proc_ruby_options, ID2SYM(rb_intern(sym)), INT2FIX(field))
void Init_utf8proc_native() {
utf8proc_ruby_module = rb_define_module("Utf8Proc");
rb_define_module_function(utf8proc_ruby_module, "utf8map",
utf8proc_ruby_map, 2);
rb_define_module_function(utf8proc_ruby_module, "utf8char",
utf8proc_ruby_char, 1);
utf8proc_ruby_eUnicodeError = rb_define_class_under(utf8proc_ruby_module,
"UnicodeError", rb_eStandardError);
utf8proc_ruby_eInvalidUtf8Error = rb_define_class_under(
utf8proc_ruby_module, "InvalidUtf8Error", utf8proc_ruby_eUnicodeError);
utf8proc_ruby_eCodeNotAssignedError = rb_define_class_under(
utf8proc_ruby_module, "CodeNotAssignedError",
utf8proc_ruby_eUnicodeError);
utf8proc_ruby_options = rb_hash_new();
register_utf8proc_option("stable", UTF8PROC_STABLE);
register_utf8proc_option("compat", UTF8PROC_COMPAT);
register_utf8proc_option("compose", UTF8PROC_COMPOSE);
register_utf8proc_option("decompose", UTF8PROC_DECOMPOSE);
register_utf8proc_option("ignore", UTF8PROC_IGNORE);
register_utf8proc_option("rejectna", UTF8PROC_REJECTNA);
register_utf8proc_option("nlf2ls", UTF8PROC_NLF2LS);
register_utf8proc_option("nlf2ps", UTF8PROC_NLF2PS);
register_utf8proc_option("nlf2lf", UTF8PROC_NLF2LF);
register_utf8proc_option("stripcc", UTF8PROC_STRIPCC);
register_utf8proc_option("casefold", UTF8PROC_CASEFOLD);
register_utf8proc_option("charbound", UTF8PROC_CHARBOUND);
register_utf8proc_option("lump", UTF8PROC_LUMP);
register_utf8proc_option("stripmark", UTF8PROC_STRIPMARK);
OBJ_FREEZE(utf8proc_ruby_options);
rb_define_const(utf8proc_ruby_module, "Options", utf8proc_ruby_options);
}

View File

@@ -0,0 +1,587 @@
/*
* Copyright (c) 2009 Public Software Group e. V., Berlin, Germany
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
* DEALINGS IN THE SOFTWARE.
*/
/*
* This library contains derived data from a modified version of the
* Unicode data files.
*
* The original data files are available at
* http://www.unicode.org/Public/UNIDATA/
*
* Please notice the copyright statement in the file "utf8proc_data.c".
*/
/*
* File name: utf8proc.c
*
* Description:
* Implementation of libutf8proc.
*/
#include "utf8proc.h"
#include "utf8proc_data.c"
const int8_t utf8proc_utf8class[256] = {
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
4, 4, 4, 4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0 };
#define UTF8PROC_HANGUL_SBASE 0xAC00
#define UTF8PROC_HANGUL_LBASE 0x1100
#define UTF8PROC_HANGUL_VBASE 0x1161
#define UTF8PROC_HANGUL_TBASE 0x11A7
#define UTF8PROC_HANGUL_LCOUNT 19
#define UTF8PROC_HANGUL_VCOUNT 21
#define UTF8PROC_HANGUL_TCOUNT 28
#define UTF8PROC_HANGUL_NCOUNT 588
#define UTF8PROC_HANGUL_SCOUNT 11172
/* END is exclusive */
#define UTF8PROC_HANGUL_L_START 0x1100
#define UTF8PROC_HANGUL_L_END 0x115A
#define UTF8PROC_HANGUL_L_FILLER 0x115F
#define UTF8PROC_HANGUL_V_START 0x1160
#define UTF8PROC_HANGUL_V_END 0x11A3
#define UTF8PROC_HANGUL_T_START 0x11A8
#define UTF8PROC_HANGUL_T_END 0x11FA
#define UTF8PROC_HANGUL_S_START 0xAC00
#define UTF8PROC_HANGUL_S_END 0xD7A4
#define UTF8PROC_BOUNDCLASS_START 0
#define UTF8PROC_BOUNDCLASS_OTHER 1
#define UTF8PROC_BOUNDCLASS_CR 2
#define UTF8PROC_BOUNDCLASS_LF 3
#define UTF8PROC_BOUNDCLASS_CONTROL 4
#define UTF8PROC_BOUNDCLASS_EXTEND 5
#define UTF8PROC_BOUNDCLASS_L 6
#define UTF8PROC_BOUNDCLASS_V 7
#define UTF8PROC_BOUNDCLASS_T 8
#define UTF8PROC_BOUNDCLASS_LV 9
#define UTF8PROC_BOUNDCLASS_LVT 10
const char *utf8proc_version(void) {
return "1.1.5";
}
const char *utf8proc_errmsg(ssize_t errcode) {
switch (errcode) {
case UTF8PROC_ERROR_NOMEM:
return "Memory for processing UTF-8 data could not be allocated.";
case UTF8PROC_ERROR_OVERFLOW:
return "UTF-8 string is too long to be processed.";
case UTF8PROC_ERROR_INVALIDUTF8:
return "Invalid UTF-8 string";
case UTF8PROC_ERROR_NOTASSIGNED:
return "Unassigned Unicode code point found in UTF-8 string.";
case UTF8PROC_ERROR_INVALIDOPTS:
return "Invalid options for UTF-8 processing chosen.";
default:
return "An unknown error occured while processing UTF-8 data.";
}
}
ssize_t utf8proc_iterate(
const uint8_t *str, ssize_t strlen, int32_t *dst
) {
int length;
int i;
int32_t uc = -1;
*dst = -1;
if (!strlen) return 0;
length = utf8proc_utf8class[str[0]];
if (!length) return UTF8PROC_ERROR_INVALIDUTF8;
if (strlen >= 0 && length > strlen) return UTF8PROC_ERROR_INVALIDUTF8;
for (i=1; i<length; i++) {
if ((str[i] & 0xC0) != 0x80) return UTF8PROC_ERROR_INVALIDUTF8;
}
switch (length) {
case 1:
uc = str[0];
break;
case 2:
uc = ((str[0] & 0x1F) << 6) + (str[1] & 0x3F);
if (uc < 0x80) uc = -1;
break;
case 3:
uc = ((str[0] & 0x0F) << 12) + ((str[1] & 0x3F) << 6)
+ (str[2] & 0x3F);
if (uc < 0x800 || (uc >= 0xD800 && uc < 0xE000) ||
(uc >= 0xFDD0 && uc < 0xFDF0)) uc = -1;
break;
case 4:
uc = ((str[0] & 0x07) << 18) + ((str[1] & 0x3F) << 12)
+ ((str[2] & 0x3F) << 6) + (str[3] & 0x3F);
if (uc < 0x10000 || uc >= 0x110000) uc = -1;
break;
}
if (uc < 0 || ((uc & 0xFFFF) >= 0xFFFE))
return UTF8PROC_ERROR_INVALIDUTF8;
*dst = uc;
return length;
}
bool utf8proc_codepoint_valid(int32_t uc) {
if (uc < 0 || uc >= 0x110000 ||
((uc & 0xFFFF) >= 0xFFFE) || (uc >= 0xD800 && uc < 0xE000) ||
(uc >= 0xFDD0 && uc < 0xFDF0)) return false;
else return true;
}
ssize_t utf8proc_encode_char(int32_t uc, uint8_t *dst) {
if (uc < 0x00) {
return 0;
} else if (uc < 0x80) {
dst[0] = uc;
return 1;
} else if (uc < 0x800) {
dst[0] = 0xC0 + (uc >> 6);
dst[1] = 0x80 + (uc & 0x3F);
return 2;
} else if (uc == 0xFFFF) {
dst[0] = 0xFF;
return 1;
} else if (uc == 0xFFFE) {
dst[0] = 0xFE;
return 1;
} else if (uc < 0x10000) {
dst[0] = 0xE0 + (uc >> 12);
dst[1] = 0x80 + ((uc >> 6) & 0x3F);
dst[2] = 0x80 + (uc & 0x3F);
return 3;
} else if (uc < 0x110000) {
dst[0] = 0xF0 + (uc >> 18);
dst[1] = 0x80 + ((uc >> 12) & 0x3F);
dst[2] = 0x80 + ((uc >> 6) & 0x3F);
dst[3] = 0x80 + (uc & 0x3F);
return 4;
} else return 0;
}
const utf8proc_property_t *utf8proc_get_property(int32_t uc) {
/* ASSERT: uc >= 0 && uc < 0x110000 */
return utf8proc_properties + (
utf8proc_stage2table[
utf8proc_stage1table[uc >> 8] + (uc & 0xFF)
]
);
}
#define utf8proc_decompose_lump(replacement_uc) \
return utf8proc_decompose_char((replacement_uc), dst, bufsize, \
options & ~UTF8PROC_LUMP, last_boundclass)
ssize_t utf8proc_decompose_char(int32_t uc, int32_t *dst, ssize_t bufsize,
int options, int *last_boundclass) {
/* ASSERT: uc >= 0 && uc < 0x110000 */
const utf8proc_property_t *property;
utf8proc_propval_t category;
int32_t hangul_sindex;
property = utf8proc_get_property(uc);
category = property->category;
hangul_sindex = uc - UTF8PROC_HANGUL_SBASE;
if (options & (UTF8PROC_COMPOSE|UTF8PROC_DECOMPOSE)) {
if (hangul_sindex >= 0 && hangul_sindex < UTF8PROC_HANGUL_SCOUNT) {
int32_t hangul_tindex;
if (bufsize >= 1) {
dst[0] = UTF8PROC_HANGUL_LBASE +
hangul_sindex / UTF8PROC_HANGUL_NCOUNT;
if (bufsize >= 2) dst[1] = UTF8PROC_HANGUL_VBASE +
(hangul_sindex % UTF8PROC_HANGUL_NCOUNT) / UTF8PROC_HANGUL_TCOUNT;
}
hangul_tindex = hangul_sindex % UTF8PROC_HANGUL_TCOUNT;
if (!hangul_tindex) return 2;
if (bufsize >= 3) dst[2] = UTF8PROC_HANGUL_TBASE + hangul_tindex;
return 3;
}
}
if (options & UTF8PROC_REJECTNA) {
if (!category) return UTF8PROC_ERROR_NOTASSIGNED;
}
if (options & UTF8PROC_IGNORE) {
if (property->ignorable) return 0;
}
if (options & UTF8PROC_LUMP) {
if (category == UTF8PROC_CATEGORY_ZS) utf8proc_decompose_lump(0x0020);
if (uc == 0x2018 || uc == 0x2019 || uc == 0x02BC || uc == 0x02C8)
utf8proc_decompose_lump(0x0027);
if (category == UTF8PROC_CATEGORY_PD || uc == 0x2212)
utf8proc_decompose_lump(0x002D);
if (uc == 0x2044 || uc == 0x2215) utf8proc_decompose_lump(0x002F);
if (uc == 0x2236) utf8proc_decompose_lump(0x003A);
if (uc == 0x2039 || uc == 0x2329 || uc == 0x3008)
utf8proc_decompose_lump(0x003C);
if (uc == 0x203A || uc == 0x232A || uc == 0x3009)
utf8proc_decompose_lump(0x003E);
if (uc == 0x2216) utf8proc_decompose_lump(0x005C);
if (uc == 0x02C4 || uc == 0x02C6 || uc == 0x2038 || uc == 0x2303)
utf8proc_decompose_lump(0x005E);
if (category == UTF8PROC_CATEGORY_PC || uc == 0x02CD)
utf8proc_decompose_lump(0x005F);
if (uc == 0x02CB) utf8proc_decompose_lump(0x0060);
if (uc == 0x2223) utf8proc_decompose_lump(0x007C);
if (uc == 0x223C) utf8proc_decompose_lump(0x007E);
if ((options & UTF8PROC_NLF2LS) && (options & UTF8PROC_NLF2PS)) {
if (category == UTF8PROC_CATEGORY_ZL ||
category == UTF8PROC_CATEGORY_ZP)
utf8proc_decompose_lump(0x000A);
}
}
if (options & UTF8PROC_STRIPMARK) {
if (category == UTF8PROC_CATEGORY_MN ||
category == UTF8PROC_CATEGORY_MC ||
category == UTF8PROC_CATEGORY_ME) return 0;
}
if (options & UTF8PROC_CASEFOLD) {
if (property->casefold_mapping) {
const int32_t *casefold_entry;
ssize_t written = 0;
for (casefold_entry = property->casefold_mapping;
*casefold_entry >= 0; casefold_entry++) {
written += utf8proc_decompose_char(*casefold_entry, dst+written,
(bufsize > written) ? (bufsize - written) : 0, options,
last_boundclass);
if (written < 0) return UTF8PROC_ERROR_OVERFLOW;
}
return written;
}
}
if (options & (UTF8PROC_COMPOSE|UTF8PROC_DECOMPOSE)) {
if (property->decomp_mapping &&
(!property->decomp_type || (options & UTF8PROC_COMPAT))) {
const int32_t *decomp_entry;
ssize_t written = 0;
for (decomp_entry = property->decomp_mapping;
*decomp_entry >= 0; decomp_entry++) {
written += utf8proc_decompose_char(*decomp_entry, dst+written,
(bufsize > written) ? (bufsize - written) : 0, options,
last_boundclass);
if (written < 0) return UTF8PROC_ERROR_OVERFLOW;
}
return written;
}
}
if (options & UTF8PROC_CHARBOUND) {
bool boundary;
int tbc, lbc;
tbc =
(uc == 0x000D) ? UTF8PROC_BOUNDCLASS_CR :
(uc == 0x000A) ? UTF8PROC_BOUNDCLASS_LF :
((category == UTF8PROC_CATEGORY_ZL ||
category == UTF8PROC_CATEGORY_ZP ||
category == UTF8PROC_CATEGORY_CC ||
category == UTF8PROC_CATEGORY_CF) &&
!(uc == 0x200C || uc == 0x200D)) ? UTF8PROC_BOUNDCLASS_CONTROL :
property->extend ? UTF8PROC_BOUNDCLASS_EXTEND :
((uc >= UTF8PROC_HANGUL_L_START && uc < UTF8PROC_HANGUL_L_END) ||
uc == UTF8PROC_HANGUL_L_FILLER) ? UTF8PROC_BOUNDCLASS_L :
(uc >= UTF8PROC_HANGUL_V_START && uc < UTF8PROC_HANGUL_V_END) ?
UTF8PROC_BOUNDCLASS_V :
(uc >= UTF8PROC_HANGUL_T_START && uc < UTF8PROC_HANGUL_T_END) ?
UTF8PROC_BOUNDCLASS_T :
(uc >= UTF8PROC_HANGUL_S_START && uc < UTF8PROC_HANGUL_S_END) ? (
((uc-UTF8PROC_HANGUL_SBASE) % UTF8PROC_HANGUL_TCOUNT == 0) ?
UTF8PROC_BOUNDCLASS_LV : UTF8PROC_BOUNDCLASS_LVT
) :
UTF8PROC_BOUNDCLASS_OTHER;
lbc = *last_boundclass;
boundary =
(tbc == UTF8PROC_BOUNDCLASS_EXTEND) ? false :
(lbc == UTF8PROC_BOUNDCLASS_START) ? true :
(lbc == UTF8PROC_BOUNDCLASS_CR &&
tbc == UTF8PROC_BOUNDCLASS_LF) ? false :
(lbc == UTF8PROC_BOUNDCLASS_CONTROL) ? true :
(tbc == UTF8PROC_BOUNDCLASS_CONTROL) ? true :
(lbc == UTF8PROC_BOUNDCLASS_L &&
(tbc == UTF8PROC_BOUNDCLASS_L ||
tbc == UTF8PROC_BOUNDCLASS_V ||
tbc == UTF8PROC_BOUNDCLASS_LV ||
tbc == UTF8PROC_BOUNDCLASS_LVT)) ? false :
((lbc == UTF8PROC_BOUNDCLASS_LV ||
lbc == UTF8PROC_BOUNDCLASS_V) &&
(tbc == UTF8PROC_BOUNDCLASS_V ||
tbc == UTF8PROC_BOUNDCLASS_T)) ? false :
((lbc == UTF8PROC_BOUNDCLASS_LVT ||
lbc == UTF8PROC_BOUNDCLASS_T) &&
tbc == UTF8PROC_BOUNDCLASS_T) ? false :
true;
*last_boundclass = tbc;
if (boundary) {
if (bufsize >= 1) dst[0] = 0xFFFF;
if (bufsize >= 2) dst[1] = uc;
return 2;
}
}
if (bufsize >= 1) *dst = uc;
return 1;
}
ssize_t utf8proc_decompose(
const uint8_t *str, ssize_t strlen,
int32_t *buffer, ssize_t bufsize, int options
) {
/* strlen will be ignored, if UTF8PROC_NULLTERM is set in options */
ssize_t wpos = 0;
if ((options & UTF8PROC_COMPOSE) && (options & UTF8PROC_DECOMPOSE))
return UTF8PROC_ERROR_INVALIDOPTS;
if ((options & UTF8PROC_STRIPMARK) &&
!(options & UTF8PROC_COMPOSE) && !(options & UTF8PROC_DECOMPOSE))
return UTF8PROC_ERROR_INVALIDOPTS;
{
int32_t uc;
ssize_t rpos = 0;
ssize_t decomp_result;
int boundclass = UTF8PROC_BOUNDCLASS_START;
while (1) {
if (options & UTF8PROC_NULLTERM) {
rpos += utf8proc_iterate(str + rpos, -1, &uc);
/* checking of return value is not neccessary,
as 'uc' is < 0 in case of error */
if (uc < 0) return UTF8PROC_ERROR_INVALIDUTF8;
if (rpos < 0) return UTF8PROC_ERROR_OVERFLOW;
if (uc == 0) break;
} else {
if (rpos >= strlen) break;
rpos += utf8proc_iterate(str + rpos, strlen - rpos, &uc);
if (uc < 0) return UTF8PROC_ERROR_INVALIDUTF8;
}
decomp_result = utf8proc_decompose_char(
uc, buffer + wpos, (bufsize > wpos) ? (bufsize - wpos) : 0, options,
&boundclass
);
if (decomp_result < 0) return decomp_result;
wpos += decomp_result;
/* prohibiting integer overflows due to too long strings: */
if (wpos < 0 || wpos > SSIZE_MAX/sizeof(int32_t)/2)
return UTF8PROC_ERROR_OVERFLOW;
}
}
if ((options & (UTF8PROC_COMPOSE|UTF8PROC_DECOMPOSE)) && bufsize >= wpos) {
ssize_t pos = 0;
while (pos < wpos-1) {
int32_t uc1, uc2;
const utf8proc_property_t *property1, *property2;
uc1 = buffer[pos];
uc2 = buffer[pos+1];
property1 = utf8proc_get_property(uc1);
property2 = utf8proc_get_property(uc2);
if (property1->combining_class > property2->combining_class &&
property2->combining_class > 0) {
buffer[pos] = uc2;
buffer[pos+1] = uc1;
if (pos > 0) pos--; else pos++;
} else {
pos++;
}
}
}
return wpos;
}
ssize_t utf8proc_reencode(int32_t *buffer, ssize_t length, int options) {
/* UTF8PROC_NULLTERM option will be ignored, 'length' is never ignored
ASSERT: 'buffer' has one spare byte of free space at the end! */
if (options & (UTF8PROC_NLF2LS | UTF8PROC_NLF2PS | UTF8PROC_STRIPCC)) {
ssize_t rpos;
ssize_t wpos = 0;
int32_t uc;
for (rpos = 0; rpos < length; rpos++) {
uc = buffer[rpos];
if (uc == 0x000D && rpos < length-1 && buffer[rpos+1] == 0x000A) rpos++;
if (uc == 0x000A || uc == 0x000D || uc == 0x0085 ||
((options & UTF8PROC_STRIPCC) && (uc == 0x000B || uc == 0x000C))) {
if (options & UTF8PROC_NLF2LS) {
if (options & UTF8PROC_NLF2PS) {
buffer[wpos++] = 0x000A;
} else {
buffer[wpos++] = 0x2028;
}
} else {
if (options & UTF8PROC_NLF2PS) {
buffer[wpos++] = 0x2029;
} else {
buffer[wpos++] = 0x0020;
}
}
} else if ((options & UTF8PROC_STRIPCC) &&
(uc < 0x0020 || (uc >= 0x007F && uc < 0x00A0))) {
if (uc == 0x0009) buffer[wpos++] = 0x0020;
} else {
buffer[wpos++] = uc;
}
}
length = wpos;
}
if (options & UTF8PROC_COMPOSE) {
int32_t *starter = NULL;
int32_t current_char;
const utf8proc_property_t *starter_property = NULL, *current_property;
utf8proc_propval_t max_combining_class = -1;
ssize_t rpos;
ssize_t wpos = 0;
int32_t composition;
for (rpos = 0; rpos < length; rpos++) {
current_char = buffer[rpos];
current_property = utf8proc_get_property(current_char);
if (starter && current_property->combining_class > max_combining_class) {
/* combination perhaps possible */
int32_t hangul_lindex;
int32_t hangul_sindex;
hangul_lindex = *starter - UTF8PROC_HANGUL_LBASE;
if (hangul_lindex >= 0 && hangul_lindex < UTF8PROC_HANGUL_LCOUNT) {
int32_t hangul_vindex;
hangul_vindex = current_char - UTF8PROC_HANGUL_VBASE;
if (hangul_vindex >= 0 && hangul_vindex < UTF8PROC_HANGUL_VCOUNT) {
*starter = UTF8PROC_HANGUL_SBASE +
(hangul_lindex * UTF8PROC_HANGUL_VCOUNT + hangul_vindex) *
UTF8PROC_HANGUL_TCOUNT;
starter_property = NULL;
continue;
}
}
hangul_sindex = *starter - UTF8PROC_HANGUL_SBASE;
if (hangul_sindex >= 0 && hangul_sindex < UTF8PROC_HANGUL_SCOUNT &&
(hangul_sindex % UTF8PROC_HANGUL_TCOUNT) == 0) {
int32_t hangul_tindex;
hangul_tindex = current_char - UTF8PROC_HANGUL_TBASE;
if (hangul_tindex >= 0 && hangul_tindex < UTF8PROC_HANGUL_TCOUNT) {
*starter += hangul_tindex;
starter_property = NULL;
continue;
}
}
if (!starter_property) {
starter_property = utf8proc_get_property(*starter);
}
if (starter_property->comb1st_index >= 0 &&
current_property->comb2nd_index >= 0) {
composition = utf8proc_combinations[
starter_property->comb1st_index +
current_property->comb2nd_index
];
if (composition >= 0 && (!(options & UTF8PROC_STABLE) ||
!(utf8proc_get_property(composition)->comp_exclusion))) {
*starter = composition;
starter_property = NULL;
continue;
}
}
}
buffer[wpos] = current_char;
if (current_property->combining_class) {
if (current_property->combining_class > max_combining_class) {
max_combining_class = current_property->combining_class;
}
} else {
starter = buffer + wpos;
starter_property = NULL;
max_combining_class = -1;
}
wpos++;
}
length = wpos;
}
{
ssize_t rpos, wpos = 0;
int32_t uc;
for (rpos = 0; rpos < length; rpos++) {
uc = buffer[rpos];
wpos += utf8proc_encode_char(uc, ((uint8_t *)buffer) + wpos);
}
((uint8_t *)buffer)[wpos] = 0;
return wpos;
}
}
ssize_t utf8proc_map(
const uint8_t *str, ssize_t strlen, uint8_t **dstptr, int options
) {
int32_t *buffer;
ssize_t result;
*dstptr = NULL;
result = utf8proc_decompose(str, strlen, NULL, 0, options);
if (result < 0) return result;
buffer = malloc(result * sizeof(int32_t) + 1);
if (!buffer) return UTF8PROC_ERROR_NOMEM;
result = utf8proc_decompose(str, strlen, buffer, result, options);
if (result < 0) {
free(buffer);
return result;
}
result = utf8proc_reencode(buffer, result, options);
if (result < 0) {
free(buffer);
return result;
}
{
int32_t *newptr;
newptr = realloc(buffer, (size_t)result+1);
if (newptr) buffer = newptr;
}
*dstptr = (uint8_t *)buffer;
return result;
}
uint8_t *utf8proc_NFD(const uint8_t *str) {
uint8_t *retval;
utf8proc_map(str, 0, &retval, UTF8PROC_NULLTERM | UTF8PROC_STABLE |
UTF8PROC_DECOMPOSE);
return retval;
}
uint8_t *utf8proc_NFC(const uint8_t *str) {
uint8_t *retval;
utf8proc_map(str, 0, &retval, UTF8PROC_NULLTERM | UTF8PROC_STABLE |
UTF8PROC_COMPOSE);
return retval;
}
uint8_t *utf8proc_NFKD(const uint8_t *str) {
uint8_t *retval;
utf8proc_map(str, 0, &retval, UTF8PROC_NULLTERM | UTF8PROC_STABLE |
UTF8PROC_DECOMPOSE | UTF8PROC_COMPAT);
return retval;
}
uint8_t *utf8proc_NFKC(const uint8_t *str) {
uint8_t *retval;
utf8proc_map(str, 0, &retval, UTF8PROC_NULLTERM | UTF8PROC_STABLE |
UTF8PROC_COMPOSE | UTF8PROC_COMPAT);
return retval;
}

View File

@@ -0,0 +1,385 @@
/*
* Copyright (c) 2009 Public Software Group e. V., Berlin, Germany
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
* FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
* DEALINGS IN THE SOFTWARE.
*/
/*
* File name: utf8proc.h
*
* Description:
* Header files for libutf8proc, which is a mapping tool for UTF-8 strings
* with following features:
* - decomposing and composing of strings
* - replacing compatibility characters with their equivalents
* - stripping of "default ignorable characters"
* like SOFT-HYPHEN or ZERO-WIDTH-SPACE
* - folding of certain characters for string comparison
* (e.g. HYPHEN U+2010 and MINUS U+2212 to ASCII "-")
* (see "LUMP" option)
* - optional rejection of strings containing non-assigned code points
* - stripping of control characters
* - stripping of character marks (accents, etc.)
* - transformation of LF, CRLF, CR and NEL to line-feed (LF)
* or to the unicode chararacters for paragraph separation (PS)
* or line separation (LS).
* - unicode case folding (for case insensitive string comparisons)
* - rejection of illegal UTF-8 data
* (i.e. UTF-8 encoded UTF-16 surrogates)
* - support for korean hangul characters
* Unicode Version 5.0.0 is supported.
*/
#ifndef UTF8PROC_H
#define UTF8PROC_H
#include <stdlib.h>
#include <sys/types.h>
#ifdef _MSC_VER
typedef signed char int8_t;
typedef unsigned char uint8_t;
typedef short int16_t;
typedef unsigned short uint16_t;
typedef int int32_t;
#ifdef _WIN64
#define ssize_t __int64
#else
#define ssize_t int
#endif
typedef unsigned char bool;
enum {false, true};
#else
#include <stdbool.h>
#include <inttypes.h>
#endif
#include <limits.h>
#ifdef __cplusplus
extern "C" {
#endif
#ifndef SSIZE_MAX
#define SSIZE_MAX ((size_t)SIZE_MAX/2)
#endif
#define UTF8PROC_NULLTERM (1<<0)
#define UTF8PROC_STABLE (1<<1)
#define UTF8PROC_COMPAT (1<<2)
#define UTF8PROC_COMPOSE (1<<3)
#define UTF8PROC_DECOMPOSE (1<<4)
#define UTF8PROC_IGNORE (1<<5)
#define UTF8PROC_REJECTNA (1<<6)
#define UTF8PROC_NLF2LS (1<<7)
#define UTF8PROC_NLF2PS (1<<8)
#define UTF8PROC_NLF2LF (UTF8PROC_NLF2LS | UTF8PROC_NLF2PS)
#define UTF8PROC_STRIPCC (1<<9)
#define UTF8PROC_CASEFOLD (1<<10)
#define UTF8PROC_CHARBOUND (1<<11)
#define UTF8PROC_LUMP (1<<12)
#define UTF8PROC_STRIPMARK (1<<13)
/*
* Flags being regarded by several functions in the library:
* NULLTERM: The given UTF-8 input is NULL terminated.
* STABLE: Unicode Versioning Stability has to be respected.
* COMPAT: Compatiblity decomposition
* (i.e. formatting information is lost)
* COMPOSE: Return a result with composed characters.
* DECOMPOSE: Return a result with decomposed characters.
* IGNORE: Strip "default ignorable characters"
* REJECTNA: Return an error, if the input contains unassigned
* code points.
* NLF2LS: Indicating that NLF-sequences (LF, CRLF, CR, NEL) are
* representing a line break, and should be converted to the
* unicode character for line separation (LS).
* NLF2PS: Indicating that NLF-sequences are representing a paragraph
* break, and should be converted to the unicode character for
* paragraph separation (PS).
* NLF2LF: Indicating that the meaning of NLF-sequences is unknown.
* STRIPCC: Strips and/or convers control characters.
* NLF-sequences are transformed into space, except if one of
* the NLF2LS/PS/LF options is given.
* HorizontalTab (HT) and FormFeed (FF) are treated as a
* NLF-sequence in this case.
* All other control characters are simply removed.
* CASEFOLD: Performs unicode case folding, to be able to do a
* case-insensitive string comparison.
* CHARBOUND: Inserts 0xFF bytes at the beginning of each sequence which
* is representing a single grapheme cluster (see UAX#29).
* LUMP: Lumps certain characters together
* (e.g. HYPHEN U+2010 and MINUS U+2212 to ASCII "-").
* (See lump.txt for details.)
* If NLF2LF is set, this includes a transformation of
* paragraph and line separators to ASCII line-feed (LF).
* STRIPMARK: Strips all character markings
* (non-spacing, spacing and enclosing) (i.e. accents)
* NOTE: this option works only with COMPOSE or DECOMPOSE
*/
#define UTF8PROC_ERROR_NOMEM -1
#define UTF8PROC_ERROR_OVERFLOW -2
#define UTF8PROC_ERROR_INVALIDUTF8 -3
#define UTF8PROC_ERROR_NOTASSIGNED -4
#define UTF8PROC_ERROR_INVALIDOPTS -5
/*
* Error codes being returned by almost all functions:
* ERROR_NOMEM: Memory could not be allocated.
* ERROR_OVERFLOW: The given string is too long to be processed.
* ERROR_INVALIDUTF8: The given string is not a legal UTF-8 string.
* ERROR_NOTASSIGNED: The REJECTNA flag was set,
* and an unassigned code point was found.
* ERROR_INVALIDOPTS: Invalid options have been used.
*/
typedef int16_t utf8proc_propval_t;
typedef struct utf8proc_property_struct {
utf8proc_propval_t category;
utf8proc_propval_t combining_class;
utf8proc_propval_t bidi_class;
utf8proc_propval_t decomp_type;
const int32_t *decomp_mapping;
unsigned bidi_mirrored:1;
int32_t uppercase_mapping;
int32_t lowercase_mapping;
int32_t titlecase_mapping;
int32_t comb1st_index;
int32_t comb2nd_index;
unsigned comp_exclusion:1;
unsigned ignorable:1;
unsigned control_boundary:1;
unsigned extend:1;
const int32_t *casefold_mapping;
} utf8proc_property_t;
#define UTF8PROC_CATEGORY_LU 1
#define UTF8PROC_CATEGORY_LL 2
#define UTF8PROC_CATEGORY_LT 3
#define UTF8PROC_CATEGORY_LM 4
#define UTF8PROC_CATEGORY_LO 5
#define UTF8PROC_CATEGORY_MN 6
#define UTF8PROC_CATEGORY_MC 7
#define UTF8PROC_CATEGORY_ME 8
#define UTF8PROC_CATEGORY_ND 9
#define UTF8PROC_CATEGORY_NL 10
#define UTF8PROC_CATEGORY_NO 11
#define UTF8PROC_CATEGORY_PC 12
#define UTF8PROC_CATEGORY_PD 13
#define UTF8PROC_CATEGORY_PS 14
#define UTF8PROC_CATEGORY_PE 15
#define UTF8PROC_CATEGORY_PI 16
#define UTF8PROC_CATEGORY_PF 17
#define UTF8PROC_CATEGORY_PO 18
#define UTF8PROC_CATEGORY_SM 19
#define UTF8PROC_CATEGORY_SC 20
#define UTF8PROC_CATEGORY_SK 21
#define UTF8PROC_CATEGORY_SO 22
#define UTF8PROC_CATEGORY_ZS 23
#define UTF8PROC_CATEGORY_ZL 24
#define UTF8PROC_CATEGORY_ZP 25
#define UTF8PROC_CATEGORY_CC 26
#define UTF8PROC_CATEGORY_CF 27
#define UTF8PROC_CATEGORY_CS 28
#define UTF8PROC_CATEGORY_CO 29
#define UTF8PROC_CATEGORY_CN 30
#define UTF8PROC_BIDI_CLASS_L 1
#define UTF8PROC_BIDI_CLASS_LRE 2
#define UTF8PROC_BIDI_CLASS_LRO 3
#define UTF8PROC_BIDI_CLASS_R 4
#define UTF8PROC_BIDI_CLASS_AL 5
#define UTF8PROC_BIDI_CLASS_RLE 6
#define UTF8PROC_BIDI_CLASS_RLO 7
#define UTF8PROC_BIDI_CLASS_PDF 8
#define UTF8PROC_BIDI_CLASS_EN 9
#define UTF8PROC_BIDI_CLASS_ES 10
#define UTF8PROC_BIDI_CLASS_ET 11
#define UTF8PROC_BIDI_CLASS_AN 12
#define UTF8PROC_BIDI_CLASS_CS 13
#define UTF8PROC_BIDI_CLASS_NSM 14
#define UTF8PROC_BIDI_CLASS_BN 15
#define UTF8PROC_BIDI_CLASS_B 16
#define UTF8PROC_BIDI_CLASS_S 17
#define UTF8PROC_BIDI_CLASS_WS 18
#define UTF8PROC_BIDI_CLASS_ON 19
#define UTF8PROC_DECOMP_TYPE_FONT 1
#define UTF8PROC_DECOMP_TYPE_NOBREAK 2
#define UTF8PROC_DECOMP_TYPE_INITIAL 3
#define UTF8PROC_DECOMP_TYPE_MEDIAL 4
#define UTF8PROC_DECOMP_TYPE_FINAL 5
#define UTF8PROC_DECOMP_TYPE_ISOLATED 6
#define UTF8PROC_DECOMP_TYPE_CIRCLE 7
#define UTF8PROC_DECOMP_TYPE_SUPER 8
#define UTF8PROC_DECOMP_TYPE_SUB 9
#define UTF8PROC_DECOMP_TYPE_VERTICAL 10
#define UTF8PROC_DECOMP_TYPE_WIDE 11
#define UTF8PROC_DECOMP_TYPE_NARROW 12
#define UTF8PROC_DECOMP_TYPE_SMALL 13
#define UTF8PROC_DECOMP_TYPE_SQUARE 14
#define UTF8PROC_DECOMP_TYPE_FRACTION 15
#define UTF8PROC_DECOMP_TYPE_COMPAT 16
extern const int8_t utf8proc_utf8class[256];
const char *utf8proc_version(void);
const char *utf8proc_errmsg(ssize_t errcode);
/*
* Returns a static error string for the given error code.
*/
ssize_t utf8proc_iterate(const uint8_t *str, ssize_t strlen, int32_t *dst);
/*
* Reads a single char from the UTF-8 sequence being pointed to by 'str'.
* The maximum number of bytes read is 'strlen', unless 'strlen' is
* negative.
* If a valid unicode char could be read, it is stored in the variable
* being pointed to by 'dst', otherwise that variable will be set to -1.
* In case of success the number of bytes read is returned, otherwise a
* negative error code is returned.
*/
bool utf8proc_codepoint_valid(int32_t uc);
/*
* Returns 1, if the given unicode code-point is valid, otherwise 0.
*/
ssize_t utf8proc_encode_char(int32_t uc, uint8_t *dst);
/*
* Encodes the unicode char with the code point 'uc' as an UTF-8 string in
* the byte array being pointed to by 'dst'. This array has to be at least
* 4 bytes long.
* In case of success the number of bytes written is returned,
* otherwise 0.
* This function does not check if 'uc' is a valid unicode code point.
*/
const utf8proc_property_t *utf8proc_get_property(int32_t uc);
/*
* Returns a pointer to a (constant) struct containing information about
* the unicode char with the given code point 'uc'.
* If the character is not existent a pointer to a special struct is
* returned, where 'category' is a NULL pointer.
* WARNING: The parameter 'uc' has to be in the range of 0x0000 to
* 0x10FFFF, otherwise the program might crash!
*/
ssize_t utf8proc_decompose_char(
int32_t uc, int32_t *dst, ssize_t bufsize,
int options, int *last_boundclass
);
/*
* Writes a decomposition of the unicode char 'uc' into the array being
* pointed to by 'dst'.
* Following flags in the 'options' field are regarded:
* REJECTNA: an unassigned unicode code point leads to an error
* IGNORE: "default ignorable" chars are stripped
* CASEFOLD: unicode casefolding is applied
* COMPAT: replace certain characters with their
* compatibility decomposition
* CHARBOUND: Inserts 0xFF bytes before each grapheme cluster
* LUMP: lumps certain different characters together
* STRIPMARK: removes all character marks
* The pointer 'last_boundclass' has to point to an integer variable which
* is storing the last character boundary class, if the CHARBOUND option
* is used.
* In case of success the number of chars written is returned,
* in case of an error, a negative error code is returned.
* If the number of written chars would be bigger than 'bufsize',
* the buffer (up to 'bufsize') has inpredictable data, and the needed
* buffer size is returned.
* WARNING: The parameter 'uc' has to be in the range of 0x0000 to
* 0x10FFFF, otherwise the program might crash!
*/
ssize_t utf8proc_decompose(
const uint8_t *str, ssize_t strlen,
int32_t *buffer, ssize_t bufsize, int options
);
/*
* Does the same as 'utf8proc_decompose_char', but acts on a whole UTF-8
* string, and orders the decomposed sequences correctly.
* If the NULLTERM flag in 'options' is set, processing will be stopped,
* when a NULL byte is encounted, otherwise 'strlen' bytes are processed.
* The result in form of unicode code points is written into the buffer
* being pointed to by 'buffer', having the length of 'bufsize' entries.
* In case of success the number of chars written is returned,
* in case of an error, a negative error code is returned.
* If the number of written chars would be bigger than 'bufsize',
* the buffer (up to 'bufsize') has inpredictable data, and the needed
* buffer size is returned.
*/
ssize_t utf8proc_reencode(int32_t *buffer, ssize_t length, int options);
/*
* Reencodes the sequence of unicode characters given by the pointer
* 'buffer' and 'length' as UTF-8.
* The result is stored in the same memory area where the data is read.
* Following flags in the 'options' field are regarded:
* NLF2LS: converts LF, CRLF, CR and NEL into LS
* NLF2PS: converts LF, CRLF, CR and NEL into PS
* NLF2LF: converts LF, CRLF, CR and NEL into LF
* STRIPCC: strips or converts all non-affected control characters
* COMPOSE: tries to combine decomposed characters into composite
* characters
* STABLE: prohibits combining characters which would violate
* the unicode versioning stability
* In case of success the length of the resulting UTF-8 string is
* returned, otherwise a negative error code is returned.
* WARNING: The amount of free space being pointed to by 'buffer', has to
* exceed the amount of the input data by one byte, and the
* entries of the array pointed to by 'str' have to be in the
* range of 0x0000 to 0x10FFFF, otherwise the program might
* crash!
*/
ssize_t utf8proc_map(
const uint8_t *str, ssize_t strlen, uint8_t **dstptr, int options
);
/*
* Maps the given UTF-8 string being pointed to by 'str' to a new UTF-8
* string, which is allocated dynamically, and afterwards pointed to by
* the pointer being pointed to by 'dstptr'.
* If the NULLTERM flag in the 'options' field is set, the length is
* determined by a NULL terminator, otherwise the parameter 'strlen' is
* evaluated to determine the string length, but in any case the result
* will be NULL terminated (though it might contain NULL characters
* before). Other flags in the 'options' field are passed to the functions
* defined above, and regarded as described.
* In case of success the length of the new string is returned,
* otherwise a negative error code is returned.
* NOTICE: The memory of the new UTF-8 string will have been allocated with
* 'malloc', and has theirfore to be freed with 'free'.
*/
uint8_t *utf8proc_NFD(const uint8_t *str);
uint8_t *utf8proc_NFC(const uint8_t *str);
uint8_t *utf8proc_NFKD(const uint8_t *str);
uint8_t *utf8proc_NFKC(const uint8_t *str);
/*
* Returns a pointer to newly allocated memory of a NFD, NFC, NFKD or NFKC
* normalized version of the null-terminated string 'str'.
*/
#ifdef __cplusplus
}
#endif
#endif

File diff suppressed because it is too large Load Diff