Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2...

Post on 10-Dec-2015

224 views 4 download

Transcript of Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2...

What’s New in Globalization

Mark Davis

Unicode Character Database:UCD 5.0

ScheduleCurrently in β2Due June, 2006

Major part of the Unicode Standard 5.0Frozen and published to give implementers a head-startNew Character Repertoire: +1,369

Total Graphic + Control: 99,089Total PU/NC/SG: 139,582

U5.0 character propertiesNew charactersCorrections

Unicode Standard 5.0

Due 2006Q4: obsoletes previous versionsYears of implementation experience

Encoding model; casing; writing systems; security; classification of code points; Unicode strings; variation selectors; new properties; linebreak; bidi; segmentation; … Increased interoperability for BIDI, Indic,…

Required basis for: regex, collation, segmentation, identifiers, security,…Planned for major software releases:

Windows Vista, Solaris, Java, GNOME, …

Unicode Guide

Authoritative but lightweightIntroduction, overview, and quick referenceMain principles of the Unicode StandardBest practices in Software GlobalizationSee Globalization Gotchas at this conference

Language Tags

RFC3066 replacement approved: 2005-11-15Not yet published, but registry now operating

http://inter-locale.com/ID/draft-ietf-ltru-registry-14.html

Addresses problems in RFC3066Stability / accessibility / ambiguity of the underlying ISO standards Parseability, Extensibility; Registration speedIdentification of script (where necessary): Traditional Chinese (zh-Hant), Serbian in Latin (sr-Latn), Azerbaijani (Cyrillic) az-Cyrl, etc.

Common Locale Data Repository: CLDR

Common, necessary software locale data for world languagesXML format for effective interchange

Δευτέρα, 05 Σεπτεμβρίου 2005

Montag, 5. September 2005

¥ 1,234.57 1 234,57руб.

Arabic – arabskiBulgarian – bułgarskiCzech – czeski…

Africa – 非洲Central America – 中美洲Eastern Africa – 东非Northern Africa – 北非…

AED – . إ. دBHD – .ب .دDZD – . ج. دEGP – . م. جEUR – €…

Z < Å

CLDR 1.4 Features

Repository separated into language vs locale dataLanguage-specific segmentation (word/line breaks…)Transliterations (eg Ελληνικά ↔ Ellēniká)Data for lenient date/time formatting and parsing

Programmer asks for “numeric day” + “abbreviated month”Best format pattern returned, eg “dd.MMM”Algorithm and locale data for choosing, adjusting

Calendar usage dataQuarters in dates (eg 2006Q1)

CLDR 1.4 Schedule

Gathering data phase: currentlyVetting phase start: March 15Release: May 15Aside from features:

New data, correctionsMetadata for parsing & validationNew tool for gathering/vetting data

CLDR Survey Tool

New web tool for data submissionUnicode members and othersAutomatically incorporated into XMLProcess for resolving differences, approval by committee

CLDR Vetting Process

Vetters confirm or approve new translations, correctionsErrors and alerts for areas of concernData accepted when approved by multiple organizations (plus exception process).

Unicode Security Issues

Examples: Visual Confusables: “paypal.com” with Cyrillic ‘a’…Non visual problems: buffer overflows, non-shortest form,…

UAX #36: Unicode Security ConsiderationsProcess recommendationsBest practices

UTS #39: Unicode Security MechanismsLimitations on RepertoireTesting for Confusables

See Unicode Security at this conference

Internationalized Domain Names

Unicode RecommendationsNarrow the repertoire: exclude symbols, punctuationExpand the coverage: currently only Unicode 3.2.

Broader problem; many RFCs use Nameprep, but that is limited to Unicode 3.2

New ICANN Guidelines (2.0)Improved, but needs more work.

IETF idn-nextstepsPositive developments, but misreads Unicode, needs more work

URL -> IRI

International Resource Identifier (IRI) http://w3.org/International/articles/idn-and-iri/JP納豆 /引き割り納豆 .html

= http://w3.org/International/articles/idn-and-iri/JP%E7%B4%8D...%E8%B1%86.html

UTF-8, %-escapedSee http://ietf.org/rfc/rfc3987.txt

World Wide Web Consortium

Work AreasWeb Services InternationalizationLanguage Tags and Locale IdentifiersInternationalization Tag SetCSS WG on vertical text, etc.

Many W3C specs being upgraded to include IRIsGrowing number of articles, tutorials and tests availableFind out more at

http://w3.org/International/

Ideographic Variation Database

U+82A6 ashi: multiple formsThe first occurrence – any glyphSecond occurrence is in the name of the town Ashiya – customarily displayed with form #4Registration for variants

Unicode Members:Full

Unicode Members:Institutional & SupportingNew membership levels, between Full and Associate

Unicode Members: Associate

Why Join?

Support the technology…That enables your success in international, technical, and emerging markets.

Protect your investment:The stability you needThe extensions you requireThe development you call for: security, …

Demonstrate your leadership…In furthering the goal that all the world's languages can be used on computers everywhere, from mobile phones to mainframes.