23 Commits

Author SHA1 Message Date
Nick Alcock
7de5fa111a UPSTREAM: unicode: remove MODULE_LICENSE in non-modules
Since commit 8b41fc4454e ("kbuild: create modules.builtin without
Makefile.modbuiltin or tristate.conf"), MODULE_LICENSE declarations
are used to identify modules. As a consequence, uses of the macro
in non-modules will cause modprobe to misidentify their containing
object file as a module when it is not (false positives), and modprobe
might succeed rather than failing with a suitable error message.

So remove it in the files in this commit, none of which can be built as
modules.

Signed-off-by: Nick Alcock <nick.alcock@oracle.com>
Suggested-by: Luis Chamberlain <mcgrof@kernel.org>
Acked-by: Gabriel Krisman Bertazi <krisman@suse.de>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: linux-modules@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: Hitomi Hasegawa <hasegawa-hitomi@fujitsu.com>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Signed-off-by: mrsrimar22 <mar.pashter1922@gmail.com>
Signed-off-by: chrisl7 <wandersonrodriguesf1@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-07-11 19:27:21 -03:00
Christoph Hellwig
3764e06c6d UPSTREAM: unicode: clean up the Kconfig symbol confusion
Turn the CONFIG_UNICODE symbol into a tristate that generates some always
built in code and remove the confusing CONFIG_UNICODE_UTF8_DATA symbol.

Note that a lot of the IS_ENABLED() checks could be turned from cpp
statements into normal ifs, but this change is intended to be fairly
mechanic, so that should be cleaned up later.

Fixes: 2b3d04787012 ("unicode: Add utf8-data module")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Change-Id: I91c9031a7320e996b1cd931a18d79bfab05ee959
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: mrsrimar22 <mar.pashter1922@gmail.com>
Signed-off-by: chrisl7 <wandersonrodriguesf1@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-07-11 19:25:23 -03:00
Krzysztof Wilczynski
832e98b3c6 UPSTREAM: unicode: Move static keyword to the front of declarations
Move the static keyword to the front of declarations of nfdi_test_data
and nfdicf_test_data, and resolve the following compiler warnings that
can be seen when building with warnings enabled (W=1):

fs/unicode/utf8-selftest.c:38:1: warning:
  ‘static’ is not at beginning of declaration [-Wold-style-declaration]

fs/unicode/utf8-selftest.c:92:1: warning:
  ‘static’ is not at beginning of declaration [-Wold-style-declaration]

Signed-off-by: Krzysztof Wilczynski <kw@linux.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: mrsrimar22 <mar.pashter1922@gmail.com>
Signed-off-by: chrisl7 <wandersonrodriguesf1@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-07-11 18:38:22 -03:00
Linus Torvalds
c4db30fcd3 UPSTREAM: unicode: fix .gitignore for generated utfdata file
Commit 2b3d04787012 ("unicode: Add utf8-data module") changed the
generated utf8data file from 'utf8data.h' to 'utf8data.c', but didn't
change the comments or the .gitignore to match.

The comments should be updated too, but at least they don't cause any
visible breakage.  But the gitignore file needs changing to avoid git
complaining about untracked files.

Fixes: 2b3d04787012 ("unicode: Add utf8-data module")
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: mrsrimar22 <mar.pashter1922@gmail.com>
Signed-off-by: chrisl7 <wandersonrodriguesf1@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-07-11 18:38:21 -03:00
Christoph Hellwig
adbec9dbf3 UPSTREAM: unicode: only export internal symbols for the selftests
The exported symbols in utf8-norm.c are not needed for normal
file system consumers, so move them to conditional _GPL exports
just for the selftest.

Change-Id: Ie9170398b992e5b4ece631ba302e3e8feb8b5ef7
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: chrisl7 <wandersonrodriguesf1@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-06-15 15:31:08 -03:00
Christoph Hellwig
cbc05b0350 UPSTREAM: unicode: Add utf8-data module
utf8data.h contains a large database table which is an auto-generated
decodification trie for the unicode normalization functions.

Allow building it into a separate module.

Based on a patch from Shreeya Patel <shreeya.patel@collabora.com>.

Change-Id: I6ce97a0249b1fedbd27b3f05d73459f5a02d3e75
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: chrisl7 <wandersonrodriguesf1@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-06-15 15:31:08 -03:00
Christoph Hellwig
89b5457241 UPSTREAM: unicode: cache the normalization tables in struct unicode_map
Instead of repeatedly looking up the version add pointers to the
NFD and NFD+CF tables to struct unicode_map, and pass a
unicode_map plus index to the functions using the normalization
tables.

Change-Id: Iced7cfaa61057defd15581258a8d1b093f041759
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: chrisl7 <wandersonrodriguesf1@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-06-15 15:31:08 -03:00
Christoph Hellwig
1da79aa576 UPSTREAM: unicode: move utf8cursor to utf8-selftest.c
Only used by the tests, so no need to keep it in the core.

Change-Id: Id1f83cc3575f2683e7020268e34336b68eb55318
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: chrisl7 <wandersonrodriguesf1@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-06-15 15:31:08 -03:00
Christoph Hellwig
02f80a8bf8 UPSTREAM: unicode: simplify utf8len
Just use the utf8nlen implementation with a (size_t)-1 len argument,
similar to utf8_lookup.  Also move the function to utf8-selftest.c, as
it isn't used anywhere else.

Change-Id: I6a5b7f73de9493ed76a2aa884db1cc6ee99655d3
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: chrisl7 <wandersonrodriguesf1@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-06-15 15:31:08 -03:00
Christoph Hellwig
12489ae4c6 UPSTREAM: unicode: remove the unused utf8{,n}age{min,max} functions
No actually used anywhere.

Change-Id: Ia319ea78385ec20c96036e83aac378638429e6cd
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: chrisl7 <wandersonrodriguesf1@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-06-15 15:31:08 -03:00
Christoph Hellwig
343d1e862f UPSTREAM: unicode: pass a UNICODE_AGE() tripple to utf8_load
Don't bother with pointless string parsing when the caller can just pass
the version in the format that the core expects.  Also remove the
fallback to the latest version that none of the callers actually uses.

Change-Id: I4c0056bcdce44ea655d5da0f69f92f7a1a590445
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: chrisl7 <wandersonrodriguesf1@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-06-15 15:31:08 -03:00
Christoph Hellwig
f0006a3c97 UPSTREAM: unicode: remove the charset field from struct unicode_map
It is hardcoded and only used for a f2fs sysfs file where it can be
hardcoded just as easily.

Change-Id: I4a0e2e143a62c411dbcff89a2398ed72e8125500
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: chrisl7 <wandersonrodriguesf1@gmail.com>
Signed-off-by: Richard Raya <rdxzv.dev@gmail.com>
2024-06-15 15:31:08 -03:00
Daniel Rosenberg
afff68f6cf unicode: Add utf8_casefold_hash
This adds a case insensitive hash function to allow taking the hash
without needing to allocate a casefolded copy of the string.

The existing d_hash implementations for casefolding allocate memory
within rcu-walk, by avoiding it we can be more efficient and avoid
worrying about a failed allocation.

Signed-off-by: Daniel Rosenberg <drosen@google.com>
Reviewed-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Reviewed-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jaegeuk Kim <jaegeuk@kernel.org>
2020-09-11 11:23:08 -07:00
Gabriel Krisman Bertazi
0abff11d2e ext4: optimize case-insensitive lookups
Temporarily cache a casefolded version of the file name under lookup in
ext4_filename, to avoid repeatedly casefolding it.  I got up to 30%
speedup on lookups of large directories (>100k entries), depending on
the length of the string under lookup.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-09-23 19:08:46 -07:00
Theodore Ts'o
7db4acd88a unicode: update to Unicode 12.1.0 final
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
2019-09-23 19:08:44 -07:00
Theodore Ts'o
1fc69d99b2 unicode: add missing check for an error return from utf8lookup()
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
2019-09-23 19:08:43 -07:00
Masahiro Yamada
a01d5c6fe9 unicode: refactor the rule for regenerating utf8data.h
scripts/mkutf8data is used only when regenerating utf8data.h,
which never happens in the normal kernel build. However, it is
irrespectively built if CONFIG_UNICODE is enabled.

Moreover, there is no good reason for it to reside in the scripts/
directory since it is only used in fs/unicode/.

Hence, move it from scripts/ to fs/unicode/.

In some cases, we bypass build artifacts in the normal build. The
conventional way to do so is to surround the code with ifdef REGENERATE_*.

For example,

 - 7373f4f83c71 ("kbuild: add implicit rules for parser generation")
 - 6aaf49b495b4 ("crypto: arm,arm64 - Fix random regeneration of S_shipped")

I rewrote the rule in a more kbuild'ish style.

In the normal build, utf8data.h is just shipped from the check-in file.

$ make
  [ snip ]
  SHIPPED fs/unicode/utf8data.h
  CC      fs/unicode/utf8-norm.o
  CC      fs/unicode/utf8-core.o
  CC      fs/unicode/utf8-selftest.o
  AR      fs/unicode/built-in.a

If you want to generate utf8data.h based on UCD, put *.txt files into
fs/unicode/, then pass REGENERATE_UTF8DATA=1 from the command line.
The mkutf8data tool will be automatically compiled to generate the
utf8data.h from the *.txt files.

$ make REGENERATE_UTF8DATA=1
  [ snip ]
  HOSTCC  fs/unicode/mkutf8data
  GEN     fs/unicode/utf8data.h
  CC      fs/unicode/utf8-norm.o
  CC      fs/unicode/utf8-core.o
  CC      fs/unicode/utf8-selftest.o
  AR      fs/unicode/built-in.a

I renamed the check-in utf8data.h to utf8data.h_shipped so that this
will work for the out-of-tree build.

You can update it based on the latest UCD like this:

$ make REGENERATE_UTF8DATA=1 fs/unicode/
$ cp fs/unicode/utf8data.h fs/unicode/utf8data.h_shipped

Also, I added entries to .gitignore and dontdiff.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-09-23 19:08:41 -07:00
Gabriel Krisman Bertazi
b1bda14bfb unicode: update unicode database unicode version 12.1.0
Regenerate utf8data.h based on the latest UCD files and run tests
against the latest version.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-09-23 19:07:21 -07:00
Gabriel Krisman Bertazi
2c2236abc9 unicode: introduce test module for normalized utf8 implementation
This implements a in-kernel sanity test module for the utf8
normalization core.  At probe time, it will run basic sequences through
the utf8n core, to identify problems will equivalent sequences and
normalization/casefold code.  This is supposed to be useful for
regression testing when adding support for a new version of utf8 to
linux.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-09-23 19:07:20 -07:00
Gabriel Krisman Bertazi
c71834b718 unicode: implement higher level API for string handling
This patch integrates the utf8n patches with some higher level API to
perform UTF-8 string comparison, normalization and casefolding
operations.  Implemented is a variation of NFD, and casefold is
performed by doing full casefold on top of NFD.  These algorithms are
based on the core implemented by Olaf Weber from SGI.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-09-23 19:07:18 -07:00
Olaf Weber
74fc5fb3da unicode: reduce the size of utf8data[]
Remove the Hangul decompositions from the utf8data trie, and do
algorithmic decomposition to calculate them on the fly. To store the
decomposition the caller of utf8lookup()/utf8nlookup() must provide a
12-byte buffer, which is used to synthesize a leaf with the
decomposition. This significantly reduces the size of the utf8data[]
array.

Changes made by Gabriel:
  Rebase to mainline
  Fix checkpatch errors
  Extract robustness fixes and merge back to original mkutf8data.c patch
  Regenerate utf8data.h

Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-09-23 19:07:17 -07:00
Olaf Weber
e58b26a26f unicode: introduce code for UTF-8 normalization
Supporting functions for UTF-8 normalization are in utf8norm.c with the
header utf8norm.h. Two normalization forms are supported: nfdi and
nfdicf.

  nfdi:
   - Apply unicode normalization form NFD.
   - Remove any Default_Ignorable_Code_Point.

  nfdicf:
   - Apply unicode normalization form NFD.
   - Remove any Default_Ignorable_Code_Point.
   - Apply a full casefold (C + F).

For the purposes of the code, a string is valid UTF-8 if:

 - The values encoded are 0x1..0x10FFFF.
 - The surrogate codepoints 0xD800..0xDFFFF are not encoded.
 - The shortest possible encoding is used for all values.

The supporting functions work on null-terminated strings (utf8 prefix)
and on length-limited strings (utf8n prefix).

From the original SGI patch and for conformity with coding standards,
the utf8data_t typedef was dropped, since it was just masking the struct
keyword.  On other occasions, namely utf8leaf_t and utf8trie_t, I
decided to keep it, since they are simple pointers to memory buffers,
and using uchars here wouldn't provide any more meaningful information.

From the original submission, we also converted from the compatibility
form to canonical.

Changes made by Gabriel:
  Rebase to Mainline
  Fix up checkpatch.pl warnings
  Drop typedefs
  move out of libxfs
  Convert from NFKD to NFD

Signed-off-by: Olaf Weber <olaf@sgi.com>
Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-09-23 19:07:16 -07:00
Gabriel Krisman Bertazi
5c0d8663cf unicode: introduce UTF-8 character database
The decomposition and casefolding of UTF-8 characters are described in a
prefix tree in utf8data.h, which is a generate from the Unicode
Character Database (UCD), published by the Unicode Consortium, and
should not be edited by hand.  The structures in utf8data.h are meant to
be used for lookup operations by the unicode subsystem, when decoding a
utf-8 string.

mkutf8data.c is the source for a program that generates utf8data.h. It
was written by Olaf Weber from SGI and originally proposed to be merged
into Linux in 2014.  The original proposal performed the compatibility
decomposition, NFKD, but the current version was modified by me to do
canonical decomposition, NFD, as suggested by the community.  The
changes from the original submission are:

  * Rebase to mainline.
  * Fix out-of-tree-build.
  * Update makefile to build 11.0.0 ucd files.
  * drop references to xfs.
  * Convert NFKD to NFD.
  * Merge back robustness fixes from original patch. Requested by
    Dave Chinner.

The original submission is archived at:

<https://linux-xfs.oss.sgi.narkive.com/Xx10wjVY/rfc-unicode-utf-8-support-for-xfs>

The utf8data.h file can be regenerated using the instructions in
fs/unicode/README.utf8data.

- Notes on the update from 8.0.0 to 11.0:

The structure of the ucd files and special cases have not experienced
any changes between versions 8.0.0 and 11.0.0.  8.0.0 saw the addition
of Cherokee LC characters, which is an interesting case for
case-folding.  The update is accompanied by new tests on the test_ucd
module to catch specific cases.  No changes to mkutf8data script were
required for the updates.

Signed-off-by: Gabriel Krisman Bertazi <krisman@collabora.co.uk>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
2019-09-23 19:07:15 -07:00