rtems

RTOS/rtems

Files

Ralf Kirchner 46b7f92152 libmisc: Add utf8proc-v1.1.5

utf8proc is a small library for processing UTF-8 encoded Unicode strings.
Some features are Unicode normalization, stripping of default ignorable characters, case folding and detection of grapheme cluster boundaries.
For the time beeing utf8proc is intended to be used for normalizing and folding UTF-8 strings
for comparison purposes when adding UTF-8 support to the FAT file system.

2013-06-03 17:28:40 +02:00

LICENSE

libmisc: Add utf8proc-v1.1.5

2013-06-03 17:28:40 +02:00

README

libmisc: Add utf8proc-v1.1.5

2013-06-03 17:28:40 +02:00

utf8proc_data.c

libmisc: Add utf8proc-v1.1.5

2013-06-03 17:28:40 +02:00

utf8proc.c

libmisc: Add utf8proc-v1.1.5

2013-06-03 17:28:40 +02:00

utf8proc.h

libmisc: Add utf8proc-v1.1.5

2013-06-03 17:28:40 +02:00

README

Please read the LICENSE file, which is shipping with this software.


*** QUICK START ***

For compilation of the C library call "make c-library", for compilation of
the ruby library call "make ruby-library" and for compilation of the
PostgreSQL extension call "make pgsql-library".

For ruby you can also create a gem-file by calling "make ruby-gem".

"make all" can be used to build everything, but both ruby and PostgreSQL
installations are required in this case.


*** GENERAL INFORMATION ***

The C library is found in this directory after successful compilation and
is named "libutf8proc.a" and "libutf8proc.so". The ruby library consists of
the files "utf8proc.rb" and "utf8proc_native.so", which are found in the
subdirectory "ruby/". If you chose to create a gem-file it is placed in the
"ruby/gem" directory. The PostgreSQL extension is named "utf8proc_pgsql.so"
and resides in the "pgsql/" directory.

Both the ruby library and the PostgreSQL extension are built as stand-alone
libraries and are therefore not dependent the dynamic version of the
C library files, but this behaviour might change in future releases.

The Unicode version being supported is 5.0.0.
Note: Version 4.1.0 of Unicode Standard Annex #29 was used, as
      version 5.0.0 had not been available at the time of implementation.

For Unicode normalizations, the following options have to be used:
Normalization Form C:  STABLE, COMPOSE
Normalization Form D:  STABLE, DECOMPOSE
Normalization Form KC: STABLE, COMPOSE, COMPAT
Normalization Form KD: STABLE, DECOMPOSE, COMPAT


*** C LIBRARY ***

The documentation for the C library is found in the utf8proc.h header file.
"utf8proc_map" is most likely function you will be using for mapping UTF-8
strings, unless you want to allocate memory yourself.


*** RUBY API ***

The ruby library adds the methods "utf8map" and "utf8map!" to the String
class, and the method "utf8" to the Integer class.

The String#utf8map method does the same as the "utf8proc_map" C function.
Options for the mapping procedure are passed as symbols, i.e:
"Hello".utf8map(:casefold) => "hello"

The descriptions of all options are found in the C header file
"utf8proc.h". Please notice that the according symbols in ruby are all
lowercase.

String#utf8map! is the destructive function in the meaning that the string
is replaced by the result.

There are shortcuts for the 4 normalization forms specified by Unicode:
String#utf8nfd,  String#utf8nfd!,
String#utf8nfc,  String#utf8nfc!,
String#utf8nfkd, String#utf8nfkd!,
String#utf8nfkc, String#utf8nfkc!

The method Integer#utf8 returns a UTF-8 string, which is containing the
unicode char given by the code point.
0x000A.utf8 => "\n"
0x2028.utf8 => "\342\200\250"


*** POSTGRESQL API ***

For PostgreSQL there are two SQL functions supplied named "unifold" and
"unistrip". These functions function can be used to prepare index fields in
order to be folded in a way where string-comparisons make more sense, e.g.
where "bathtub" == "bath<soft hyphen>tub"
or "Hello World" == "hello world".

CREATE TABLE people (
  id    serial8 primary key,
  name  text,
  CHECK (unifold(name) NOTNULL)
);
CREATE INDEX name_idx ON people (unifold(name));
SELECT * FROM people WHERE unifold(name) = unifold('John Doe');

The function "unistrip" removes character marks like accents or diaeresis,
while "unifold" keeps then.

NOTICE: The outputs of the function can change between releases, as
        utf8proc does not follow a versioning stability policy. You have to
        rebuild your database indicies, if you upgrade to a newer version
        of utf8proc.


*** TODO ***

- detect stable code points and process segments independently in order to
  save memory
- do a quick check before normalizing strings to optimize speed
- support stream processing


*** CONTACT ***

If you find any bugs or experience difficulties in compiling this software,
please contact us:

Project page: http://www.public-software-group.org/utf8proc