Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2...

19
What’s New in Globalization Mark Davis

Transcript of Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2...

Page 1: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

What’s New in Globalization

Mark Davis

Page 2: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

Unicode Character Database:UCD 5.0

ScheduleCurrently in β2Due June, 2006

Major part of the Unicode Standard 5.0Frozen and published to give implementers a head-startNew Character Repertoire: +1,369

Total Graphic + Control: 99,089Total PU/NC/SG: 139,582

U5.0 character propertiesNew charactersCorrections

Page 3: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

Unicode Standard 5.0

Due 2006Q4: obsoletes previous versionsYears of implementation experience

Encoding model; casing; writing systems; security; classification of code points; Unicode strings; variation selectors; new properties; linebreak; bidi; segmentation; … Increased interoperability for BIDI, Indic,…

Required basis for: regex, collation, segmentation, identifiers, security,…Planned for major software releases:

Windows Vista, Solaris, Java, GNOME, …

Page 4: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

Unicode Guide

Authoritative but lightweightIntroduction, overview, and quick referenceMain principles of the Unicode StandardBest practices in Software GlobalizationSee Globalization Gotchas at this conference

Page 5: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

Language Tags

RFC3066 replacement approved: 2005-11-15Not yet published, but registry now operating

http://inter-locale.com/ID/draft-ietf-ltru-registry-14.html

Addresses problems in RFC3066Stability / accessibility / ambiguity of the underlying ISO standards Parseability, Extensibility; Registration speedIdentification of script (where necessary): Traditional Chinese (zh-Hant), Serbian in Latin (sr-Latn), Azerbaijani (Cyrillic) az-Cyrl, etc.

Page 6: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

Common Locale Data Repository: CLDR

Common, necessary software locale data for world languagesXML format for effective interchange

Δευτέρα, 05 Σεπτεμβρίου 2005

Montag, 5. September 2005

¥ 1,234.57 1 234,57руб.

Arabic – arabskiBulgarian – bułgarskiCzech – czeski…

Africa – 非洲Central America – 中美洲Eastern Africa – 东非Northern Africa – 北非…

AED – . إ. دBHD – .ب .دDZD – . ج. دEGP – . م. جEUR – €…

Z < Å

Page 7: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

CLDR 1.4 Features

Repository separated into language vs locale dataLanguage-specific segmentation (word/line breaks…)Transliterations (eg Ελληνικά ↔ Ellēniká)Data for lenient date/time formatting and parsing

Programmer asks for “numeric day” + “abbreviated month”Best format pattern returned, eg “dd.MMM”Algorithm and locale data for choosing, adjusting

Calendar usage dataQuarters in dates (eg 2006Q1)

Page 8: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

CLDR 1.4 Schedule

Gathering data phase: currentlyVetting phase start: March 15Release: May 15Aside from features:

New data, correctionsMetadata for parsing & validationNew tool for gathering/vetting data

Page 9: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

CLDR Survey Tool

New web tool for data submissionUnicode members and othersAutomatically incorporated into XMLProcess for resolving differences, approval by committee

Page 10: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

CLDR Vetting Process

Vetters confirm or approve new translations, correctionsErrors and alerts for areas of concernData accepted when approved by multiple organizations (plus exception process).

Page 11: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

Unicode Security Issues

Examples: Visual Confusables: “paypal.com” with Cyrillic ‘a’…Non visual problems: buffer overflows, non-shortest form,…

UAX #36: Unicode Security ConsiderationsProcess recommendationsBest practices

UTS #39: Unicode Security MechanismsLimitations on RepertoireTesting for Confusables

See Unicode Security at this conference

Page 12: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

Internationalized Domain Names

Unicode RecommendationsNarrow the repertoire: exclude symbols, punctuationExpand the coverage: currently only Unicode 3.2.

Broader problem; many RFCs use Nameprep, but that is limited to Unicode 3.2

New ICANN Guidelines (2.0)Improved, but needs more work.

IETF idn-nextstepsPositive developments, but misreads Unicode, needs more work

Page 13: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

URL -> IRI

International Resource Identifier (IRI) http://w3.org/International/articles/idn-and-iri/JP納豆 /引き割り納豆 .html

= http://w3.org/International/articles/idn-and-iri/JP%E7%B4%8D...%E8%B1%86.html

UTF-8, %-escapedSee http://ietf.org/rfc/rfc3987.txt

Page 14: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

World Wide Web Consortium

Work AreasWeb Services InternationalizationLanguage Tags and Locale IdentifiersInternationalization Tag SetCSS WG on vertical text, etc.

Many W3C specs being upgraded to include IRIsGrowing number of articles, tutorials and tests availableFind out more at

http://w3.org/International/

Page 15: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

Ideographic Variation Database

U+82A6 ashi: multiple formsThe first occurrence – any glyphSecond occurrence is in the name of the town Ashiya – customarily displayed with form #4Registration for variants

Page 16: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

Unicode Members:Full

Page 17: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

Unicode Members:Institutional & SupportingNew membership levels, between Full and Associate

Page 18: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

Unicode Members: Associate

Page 19: Whats New in Globalization Mark Davis. Unicode Character Database: UCD 5.0 Schedule Currently in β2 Due June, 2006 Major part of the Unicode Standard.

Why Join?

Support the technology…That enables your success in international, technical, and emerging markets.

Protect your investment:The stability you needThe extensions you requireThe development you call for: security, …

Demonstrate your leadership…In furthering the goal that all the world's languages can be used on computers everywhere, from mobile phones to mainframes.