Post on 10-Dec-2015
What’s New in Globalization
Mark Davis
Unicode Character Database:UCD 5.0
ScheduleCurrently in β2Due June, 2006
Major part of the Unicode Standard 5.0Frozen and published to give implementers a head-startNew Character Repertoire: +1,369
Total Graphic + Control: 99,089Total PU/NC/SG: 139,582
U5.0 character propertiesNew charactersCorrections
Unicode Standard 5.0
Due 2006Q4: obsoletes previous versionsYears of implementation experience
Encoding model; casing; writing systems; security; classification of code points; Unicode strings; variation selectors; new properties; linebreak; bidi; segmentation; … Increased interoperability for BIDI, Indic,…
Required basis for: regex, collation, segmentation, identifiers, security,…Planned for major software releases:
Windows Vista, Solaris, Java, GNOME, …
Unicode Guide
Authoritative but lightweightIntroduction, overview, and quick referenceMain principles of the Unicode StandardBest practices in Software GlobalizationSee Globalization Gotchas at this conference
Language Tags
RFC3066 replacement approved: 2005-11-15Not yet published, but registry now operating
http://inter-locale.com/ID/draft-ietf-ltru-registry-14.html
Addresses problems in RFC3066Stability / accessibility / ambiguity of the underlying ISO standards Parseability, Extensibility; Registration speedIdentification of script (where necessary): Traditional Chinese (zh-Hant), Serbian in Latin (sr-Latn), Azerbaijani (Cyrillic) az-Cyrl, etc.
Common Locale Data Repository: CLDR
Common, necessary software locale data for world languagesXML format for effective interchange
Δευτέρα, 05 Σεπτεμβρίου 2005
Montag, 5. September 2005
¥ 1,234.57 1 234,57руб.
Arabic – arabskiBulgarian – bułgarskiCzech – czeski…
Africa – 非洲Central America – 中美洲Eastern Africa – 东非Northern Africa – 北非…
AED – . إ. دBHD – .ب .دDZD – . ج. دEGP – . م. جEUR – €…
Z < Å
CLDR 1.4 Features
Repository separated into language vs locale dataLanguage-specific segmentation (word/line breaks…)Transliterations (eg Ελληνικά ↔ Ellēniká)Data for lenient date/time formatting and parsing
Programmer asks for “numeric day” + “abbreviated month”Best format pattern returned, eg “dd.MMM”Algorithm and locale data for choosing, adjusting
Calendar usage dataQuarters in dates (eg 2006Q1)
CLDR 1.4 Schedule
Gathering data phase: currentlyVetting phase start: March 15Release: May 15Aside from features:
New data, correctionsMetadata for parsing & validationNew tool for gathering/vetting data
CLDR Survey Tool
New web tool for data submissionUnicode members and othersAutomatically incorporated into XMLProcess for resolving differences, approval by committee
CLDR Vetting Process
Vetters confirm or approve new translations, correctionsErrors and alerts for areas of concernData accepted when approved by multiple organizations (plus exception process).
Unicode Security Issues
Examples: Visual Confusables: “paypal.com” with Cyrillic ‘a’…Non visual problems: buffer overflows, non-shortest form,…
UAX #36: Unicode Security ConsiderationsProcess recommendationsBest practices
UTS #39: Unicode Security MechanismsLimitations on RepertoireTesting for Confusables
See Unicode Security at this conference
Internationalized Domain Names
Unicode RecommendationsNarrow the repertoire: exclude symbols, punctuationExpand the coverage: currently only Unicode 3.2.
Broader problem; many RFCs use Nameprep, but that is limited to Unicode 3.2
New ICANN Guidelines (2.0)Improved, but needs more work.
IETF idn-nextstepsPositive developments, but misreads Unicode, needs more work
URL -> IRI
International Resource Identifier (IRI) http://w3.org/International/articles/idn-and-iri/JP納豆 /引き割り納豆 .html
= http://w3.org/International/articles/idn-and-iri/JP%E7%B4%8D...%E8%B1%86.html
UTF-8, %-escapedSee http://ietf.org/rfc/rfc3987.txt
World Wide Web Consortium
Work AreasWeb Services InternationalizationLanguage Tags and Locale IdentifiersInternationalization Tag SetCSS WG on vertical text, etc.
Many W3C specs being upgraded to include IRIsGrowing number of articles, tutorials and tests availableFind out more at
http://w3.org/International/
Ideographic Variation Database
U+82A6 ashi: multiple formsThe first occurrence – any glyphSecond occurrence is in the name of the town Ashiya – customarily displayed with form #4Registration for variants
Unicode Members:Full
Unicode Members:Institutional & SupportingNew membership levels, between Full and Associate
Unicode Members: Associate
Why Join?
Support the technology…That enables your success in international, technical, and emerging markets.
Protect your investment:The stability you needThe extensions you requireThe development you call for: security, …
Demonstrate your leadership…In furthering the goal that all the world's languages can be used on computers everywhere, from mobile phones to mainframes.