ISO Language Codes: How Different Languages are Identified
By: Shahzad Bashir
The world of today is an ever-expanding field of businesses and companies seeking to make a name for themselves amongst the competition. To stay ahead of the race to win more customers, however, a company needs to be able to get out of its comfort zone and explore new areas and markets. One thing that assists a business in achieving that global success is translation and localization. And like every other thing of importance, language codes are available to make that translation and localization a reality.
ISO language codes are essentially the standard for language codification globally, which allows international companies to successfully recognize and understand various linguistics, scripts, and dialects. It is quite interesting to know here that language codes currently play a significant role in the field of machine learning. As there are many living languages in the world, it was necessary to label and categorize them, and this work was carried out by Ethnologue, a yearly publication that consists of a lot of data regarding living languages of the globe. In 1984, Ethnologue launched a SIL system with three-letter coding with the purpose to identify labeled linguistics. The SIL system significantly influenced the ISO 639 language code standards. As a matter of fact, since ISO standards’ 15th publication it has been constantly used by Ethnologue. ISO 639 and the subsets ISO 639-1, ISO 639-2, ISO 639-3 are the most well-known standard sets for linguistics codes, and ISO 15924 is considered the perfect standard for language scripts.
This article describes some of the ISO language codes that are followed by companies worldwide.
1. IETF language tags
The codes having the ability to recognize human linguistics are known as IETF language tags. IETF language tags can easily rely on ISO standard sets. They join subtags of ISO 639 for the linguistic code, use ISO 15924 for the script, and take the assistance of ISO 3166-1 and UN M.49 for country codes. Although IETF tags might seem a bit of a challenge as its structure has been organized by the Internet Engineering Task Force, their wide usage by HTTP, HTML, and XML formats allow for its popularity in the global market.
2. ISO 639 standards
Undeniably one of the most commonly referenced sets of language codes in translation and localization, ISO 639 lists have been developing in stages for 30 years. The very first ISO 639-1 (then known as ISO 639) appeared in 1988, introduced by the International Organization for Standardization. The international NGO is tasked with the purpose of developing universal requirements, guidelines, and specifications, with an aim to add easier processing and sharing of information, data, services, and products generally. At the start of the introduction of these codes, the idea was to aid in a process that could allow language professionals to tag and identify content with accuracy. With time, though, natural language processing and machine learning both began to be governed by proper language codification.
3. ISO 639-1
The very first ISO 639 set comprises identifiers for major language professionals globally. Subsequently, that involves languages that are most commonly used in world literature and extremely developed languages, consisting of extraordinary terminology and vocabulary. ISO 639-1 is known as a two-character code and is set by the International Information Center for Terminology.
4. ISO 639-2
The ISO 639-2, a three-character code, primarily contains identifiers for languages from the ISO 639-1 as well as supplementary languages that include a significant amount of literature. As this list arranges identifiers for language families, it consists of nearly all languages of the globe. The Library of Congress is responsible for preserving the ISO 639-2.
5. ISO 639-3
The computer systems of today need support for managing a large number of various languages. The ISO 639-3 list includes different languages from ISO 639-2 and also involves ancient, extinct, historic, and constructed linguistics. This language code set identifies the low-level languages that are rarely exercised. The alpha-3 codes are quite identical to ISO 639-2 codes and are managed by SIL International. In accordance with the SIL system, language identifiers in ISO 639-3 were made to be used in the computer systems.
6. ISO 639-4
The ISO 639-part 4 codes which represent the names of linguistics, basically deals with the enforcement of rules, regulations, and overall doctrines related to language codes and provide common guidelines for practicing ISO 639.
7. ISO 639-5
The ISO 639-part 5 codes represent the names of languages, deals with alpha-3 code for linguistic groups and families, and offer a 3 letter code for language groups and families (both extinct and living).
8. ISO 639-6
The ISO 639 part 6 represents the names of languages and deals with alpha-4 code for the complete analysis of linguistic variants. Even though the ISO language code sets consist of just about all current languages, not all linguistics appear in every part of ISO 639.
A look at the Macro Languages
The Macro languages are essentially a mechanism that assists navigation of linguistic codes amongst ISO 639-2 and ISO 639-3 follow completely dissimilar criteria for distributing languages. ISO 639-2 uses a shared writing literature system whereas ISO 639-3 emphasizes a shared lexicon and ease in mutual understanding. For example, the Chinese language is termed as a macro language that involves several languages but it is not easy to understand them mutually. On the other hand, Hindi and Urdu don’t fall under the category of macro languages. In fact, even dialects of Hindi are categorized as distinct languages.
Collective languages are very different from Macro Languages. Collective languages are language groups that do not fulfill the ISO standards regarding separate language codes. As a result, these collections of linguistics are omitted from ISO 639-3 because they do not give reference to individual languages as well as the collective languages that are included in ISO 639-5.
ISO 639-2 and ISO 639-3 consist of four extraordinary codes that are applicable in cases where featured linguistics codes cannot be implemented. The first one is an un-coded language or miscellaneous code, which is the flawless tag for languages not presently accessible in the ISO standards. Secondly, multiple language code is used in situations where the information is available in a number of languages. Thirdly, an undetermined code is best suited for circumstances where the language is unidentified. Lastly, the not applicable/no linguistic code (zxx) is best for the information which is not in a real language.
Choosing the right language code
The choice of the right language code may be a tough one, as selecting macro language features in ISO 639-1 requires some serious focus from the users. There are certain elements to be focused on when picking the right linguistic code for adaptation. The first concern is the ability to reach the desired and aimed audience with a macro language. The other concerns are whether the language of the targeted audience is specific or regional, whether there are language differences used in a number of different areas, and whether the concerned localization project demands various scripts of the language or not.
The Last Word
Translation and localization of a firm can be quite a tough job but with the right tools at a company’s disposal, it can be made easy and quick. This is the reason ISO language codes are used to assist in the language translation processes. In order to make the job easier for the companies, translation and localization agencies need to hire expert and professional linguists who are adept at handling vast amounts of content and are able to give the assigned projects in due time.