異体字は、文字の代替グリフであり、異体シーケンスのメカニズムを通じてUnicodeでエンコードされます。異体シーケンスは、基本文字とそれに続く異体セレクター文字で構成される Unicode のシーケンスです。
異体字は通常、基本文字と外観と意味が非常に似ています。このメカニズムは、異体字が利用できない場合に基本文字を表示してもテキストの意味が変わらず、多くの読者には気づかれないような異体字を対象としています。
Unicode では、次の 2 種類のバリエーション シーケンスが定義されています。
- StandardizedVariants.txtで定義された標準化された変異配列[1]
- 表意文字異形データベース(IVD)[2] [3]で定義された表意文字異形シーケンス
バリエーションセレクター文字は、いくつかの Unicode ブロックに存在します。
- バリエーションセレクター(16文字、VS1~VS16の略)
- バリエーションセレクタサプリメント(240文字、略称VS17~VS256)
- モンゴル語(4文字、略称FVS1~FVS4)
Variation selectors are not required for Arabic and Latin cursive characters, where substitution of glyphs can occur based on context: glyphs may be connected together depending on whether the character is the initial character in a word, the final character, a medial character or an isolated character. These types of glyph substitution are easily handled by the context of the character with no other authoring input involved. Authors may also use special-purpose characters such as joiners and non-joiners to force an alternate form of glyph where it would not otherwise appear. Ligatures are similar instances where glyphs may be substituted simply by turning ligatures on or off as a rich text attribute.
For other glyph substitution, the author's intent may need to be encoded with the text and cannot be determined contextually. This is the case with character/glyphs referred to as gaiji, where different glyphs are used for the same character either historically or for ideographs for family names. This is one of the gray areas in distinguishing between a glyph and a character: If a family name differs slightly from the ideograph character it derives from, then is that a simple glyph variant or a character variant?
Character substitutions may also occur outside of Unicode, for example with OpenType Layout tags.[4]
Blocks with standardized variation sequences
As of Unicode version 17.0, standardized variation sequences specifically for emoji/text presentation are defined for base characters in 20 blocks:[1]
- Arrows
- Basic Latin
- CJK Symbols and Punctuation
- Dingbats
- Emoticons
- Enclosed Alphanumeric Supplement
- Enclosed Alphanumerics
- Enclosed CJK Letters and Months
- Enclosed Ideographic Supplement
- General Punctuation
- Geometric Shapes
- Latin-1 Supplement
- Letterlike Symbols
- Mahjong Tiles
- Miscellaneous Symbols
- Miscellaneous Symbols and Arrows
- Miscellaneous Symbols and Pictographs
- Miscellaneous Technical
- Supplemental Arrows-B
- Transport and Map Symbols
Other standardized variation sequences are formed with base characters in the following sixteen blocks:[1]
- CJK Unified Ideographs
- CJK Unified Ideographs Extension A
- CJK Unified Ideographs Extension B
- Egyptian Hieroglyph Format Controls
- Egyptian Hieroglyphs
- Egyptian Hieroglyphs Extended-A
- Halfwidth and Fullwidth Forms
- Manichaean
- Mathematical Alphanumeric Symbols
- Mathematical Operators
- Miscellaneous Mathematical Symbols-B
- Mongolian
- Myanmar
- Myanmar Extended-A
- Phags-pa
- Supplemental Mathematical Operators
Blocks with ideographic variation sequences
As of 14 July 2025[アップデート], ideographic variation sequences are defined for base characters in eleven blocks:[2][3]
- CJK Compatibility Ideographs
- CJK Unified Ideographs
- CJK Unified Ideographs Extension A
- CJK Unified Ideographs Extension B
- CJK Unified Ideographs Extension C
- CJK Unified Ideographs Extension D
- CJK Unified Ideographs Extension E
- CJK Unified Ideographs Extension F
- CJK Unified Ideographs Extension G
- CJK統合漢字拡張H
- CJK統合漢字拡張I
参照
参考文献
- ^ abc 「UCD: 標準化された変異シーケンス」。Unicodeコンソーシアム。
- ^ ab 「Ideographic Variation Database」. Unicodeコンソーシアム.
- ^ ab 「UTS #37、Unicode表意文字異体データベース」。Unicodeコンソーシアム。
- ^ 「言語システムタグ」。Microsoft。2022年9月30日。