Best Practices

From organic_lingua
Jump to: navigation, search

Contents

Introduction 

This document presents the Best Practices and the open sources tools developed in the context of Organic.Lingua project that can be used to expand the multilinguality of an online service. The main goal of this document is to help portal owners to transform an online service to a truly multilingual service.

Rationale

There is an important need for online services that provide access to people from diverse geographical sites to utilize language technologies in order to become truly multilingual. To that end one of the main outcomes of the Organic.Lingua project is the Best Practices of how something like this can be realized. The Organic.Lingua project aims at helping various stakeholders in the challenging and time consuming process of providing truly multilingual services to the users.

Target audiences

The target audiences of the Best Practices are presented in three categories, namely Online Services, European Projects and other Stakeholders from the public and private sector. Each of the target audience is analyzed in the following sections.   

Online Services

Online services that provide content discovery services for users with significant geographical distribution such as


The above online services have similar architecture and operational model to that of the Organic.Edunet online service. Some of them such as AGRIS and Europeana have already integrated Automatic Translation services. However, the mere integration of commercial or free MT service is not sufficient to provide a truly multilingual service. The quality of the translation can be poor especially for the languages with low or no support in terms of Machine Translation. In addition to this, the domain specific terminology is not translated correctly and this is a significant limitation of the generic MT services as the terminology is very critical to understand the content. Furthermore, in content discovery services, it is also important to allow searching using languages other than the one supported by manually produced descriptions. This calls for multilingual search mechanisms that can be enhanced by using domain specific multilingual Knowledge Organization Systems.

Projects

European initiative that

  • are developing federated discovery services for different type of content. Such projects are OpenAIRE, EUDAT, Open Discovery Space and agINFRA
  • are developing language technologies and frameworks e.g. META-NET
  • are developing infrastructure for sharing language resources and hosting language technologies such as META-SHARE
  • are defining standards for language technologies e.g. MUMIA

These EU initiatives can take advantage of the experience, guidelines, resources, tools and web services developed in the context of the Organic.Lingua project.


Stakeholders

The Best Practices defined in the context of the Organic.Lingua project may be of interest to the following stakeholders

  • Online services/portal owners
  • SMEs that are developing data products and services using language technologies (Technology providers)
  • Software Engineers and Architects that seek for a solution in order to transform an online service to a truly multilingual service
  • Digital collection managers of Libraries and Organizations that are focused in knowledge sharing.
  • Language technology experts and researchers
  • SMEs that are developing language technologies


Lessons Learned

During the Organic.Lingua journey there were some important lessons learned regarding the application of language technologies in a pan-European online service such as Organic.Edunet portal that should be shared. These are:

  • Evaluate the performance of linguistic services with users and not only in vitro
  • Provide generic components that can wrap different linguistic services
  • Effort is needed to find, collect, organize and prepare domain specific language resources
  • Generate and publish language resources that can be used to improve linguistic services
  • Enable users feedback by allowing them to post edit the MT results
  • Always inform the users that the generated text is produced by MT
  • Do not apply directly and on the fly MT for the graphical user interface of the online service
  • Do not keep several instances of the Knowledge Organization Systems but manage them centrally and publish them as linked data
  • Having a metadata aggregation service and indexing as a part of the portal introduces important limitations in terms of content update and performance

Common Multilingual Processing Framework

One of the main final outcomes of the Organic.Lingua project is the definition of a common multilingual processing framework that can be applied at any online service that would like to become truly multilingual. This framework includes a) a number of principles on which the framework was based on, b)  a generic architecture for the integration of the language technologies in an online service and c) processes for supporting and managing multilinguality in an online service.

Principles

The main principles of the proposed framework are:

  • Open and modular architecture based on REST APIs (SOA approach) that can easily support new content sources, new language services and new KOSs. 
  • Wrapping of language services
  • Decouple front end services of the portal from the metadata aggregation services.
  • KOSs are managed externally and exposed as linked data so they can be consumed by various systems
  • The architecture is domain agnostic and can be re-used at any other domain
  • The architecture is scalable
  • The architecture is technology independent i.e. the portal can be built using either a standard CMS or an MVC framework, the language services can be built using various technologies. 

All these principles were the basis for the definition of the generic architecture that is presented in the following section.

Architecture

The architecture of the Common Multilingual Processing Framework is presented in the following diagram. The main goal was to define a generic architecture that could be applied in the case of any online service that is providing federated content discovery services. The architecture includes the following main components

  • the online service (portal) that allows content discovery
  • language technologies component that includes all the language services such MT, cross language information retrieval, domain terminology verification service, automatic ontology alignment that can be used by the portal or any other component of the architecture
  • analytics and language wrapper service which is used as the main gateway for all the language services and keeps analytics on the use of these services. In addition, it enables users feedback for the outcome of MT through a rating web service that is provided.
  • a metadata aggregator that is harvesting metadata for the content from different sources. The aggregator includes operations such as transformation, filtering and enrichment of metadata records. It is based on big data technologies to ensure the scalability.
  • a KOS management and publishing system. Such system allows the multilingual evolution of the KOSs by groups of domain experts and the publishing of the KOSs through REST API so they can be used by other systems e.g. portal, semantic annotation tools, language tools.
  • the content sharing systems that provide functionalities for semantic and multilingual annotation. Such tools can be a standard digital repository software such as DSpace or a Learning Management System such as Moodle.
  • the social services component that can be used by the portal and any other app e.g. mobile apps to store and retrieve tags, ratings and comments for the content.


In the architecture a feedback loop from users regarding the content and the automatic translations of the description is depicted using dashed line. The feedback of the users can be implemented either as a portal module or as a widget that consumes an external service. The metadata of the content suggested and corrected by users are aggregated as any other content source at the metadata aggregator and after validation it is published on the portal. Multilingual indexing of the matadata records is done externally and is not part of the portal infrastructure. The indexes are generated in the language technologies component and are retrieved using a REST based search API. Alternatively multilingual indexing can be part of the metadata aggregation infrastructure.

Architecture of the Common Multilingual Processing Framework


The proposed architecture enables:

  • the content scalability since any new repository can be connected to the metadata aggregator and be available with small effort through portal
  • the update and change of the language services without the need to change your code on the portal side. Suppose that one wish to change the MS Translator Bing MT to Google MT. In that case the MT client for the service should be updated at the Analytics and Language Wrapper component and no significant changes are required at the portal or any other tool that is using the MT services
  • the transaction of the portal and each tool with the language services is stored in the analytics component and can be used at any point to extract useful information such as rating of the MT, which terms have been used in the multilingual search, which MT service was used etc.
  • any update and multilingual extension of the KOSs are available through the KOS management and publishing system and no major updates and changes are needed at the portal or any other tool that is using the KOSs such as the annotation tools
  • development of new web and mobile apps such that will use the same content and multilingual services and will target new audiences.


Process View

The framework implies a specific process for the re-engineering of an existing online service to a truly multilingual one. This process includes the following steps:

  1. identify the limitations of existing multilingual approach e.g. where the automatic translation services should be applied, which parts are not multilingual etc.
  2. set up a powerful metadata aggregation infrastructure in case of federated discovery services. 
  3. use an open source component like the Analytics and Language Wrapper Service and access through it all the language services. This service can be integrated also in tools other than portal such as repository tools and KOS management and publishing tool.
  4. use a KOS management and publishing service to store the multilingual vocabularies and ontologies.
  5. enable the user feedback by implementing a simple widget that can be used by registered users to improve translations


After applying the common multilingual framework, the process of supporting a new language in the online service is

  1. Check if the new language is supported by the MT services wrapped by the Analytics Service. Services such as MS Bing and Google support almost all languages
  2. Translate the user interface of the portal and any other tool in the new language by using the automatic translation facilitates. If possible check the translation of all the user interface before enabling the new language at the portal
  3. Enable the new language at the portal i.e. add the language switcher value, add the language in the MT drop down list box etc.
  4. Translate the KOSs in the new language by using the automatic translation services and ask a domain experts to correct/improve the translations
  5. In case a multilingual search functionality is used in the portal, the language resources for the new language should be added to support language guessing, morphological analysis and dictionary translation. 


Technical View

Protocols and Standards

A number of protocols and standards should be supported in order to ensure a) the interoperability of the proposed framework with existing technical colutions and b) the sharing of content. The following protocols and standards are have been used in the implementation of the framework

  • OAI-PMH for harvesting of the content descriptions (metadata)
  • REST for the integration of different component
  • SKOS for publishing the KOSs
  • IEEE LOM as a standard for describing educational content and any other metadata standard like Dublin Core.


Open Source Tools

A number of open source tools that have been developed or evolved in the context of Organic.Lingua, can be re-used by other online services. These are: 

  • AgLR can be used to share digital resources and to provide semantic and multilingual annotation for them.
  • Language Analytics can be used a) to wrap several language services such as MS Bing, XEROX MT, MosesCore MT services b) to evaluate the quality of the translation by providing a simple rating and c) to keep analytics about all the translations that have been done.
  • MoKi can be used to collaboratively develop an ontology and to expose the ontology in linked data format.
  • agriMoodle can be used to share learning resources and courses and to provide semantic and multilingual annotation for them
  • Organic.Edunet portal infrastructure that can be re-used to develop a multilingual online document discovery service
  • Proxy service for Moses MT that can be used as a free alternative for MT services for specific language pairs
  • Machine Translation Caching service
  • Tools for automatically matching concepts between multilingual Thesauri (RDF SKOS format)


Best Practice Statements

In this section we present the Best Practice Statements as defined by the Organic.Lingua consortium. This set of Best Practices could be a starting point for the formulation of a working group that will work on the Multilinguality of Online Services.

Use a Service oriented approach

To allow the easy integration of language technologies, the portal should be transformed to an infrastructure that will be based on REST APIs. This will also allow the development of multiple multilingual apps both web and mobile.  

Select the best MT service for your case

Nowadays, there are many options for MT service. Before selecting one it is recommended to assess the performance for your resources. In the process of selection and operation a tool such as the automatic MT selector could be helpful to use the optimum MT service.

Support the graphical user interface translation with MT

Using MT for the translation of the graphical user interface of your online service is beneficial because it speeds up the process but applying it on the fly is not recommended as the results may reduce significantly the generic content of the online service. It is recommended to support the process of translating all the labels and generic texts of the portal with MT but enable the ability to edit/correct the translations in an admin area. This will ensure the high quality of the generic content at the service. Always follow the concepts of the International web design (i18n).

Avoid incorporating text in the graphics

All the textual information that you are using in your online service should be in text form and not incorporated in graphics. You should decouple the textual information from the graphics and ensure the translatability of the content. 

Wrap language services

Wrapping language services is very important if you want to ensure the sustainability and easy maintenance of the multilingual online service. More specifically, this will enable

  • integration of new MT services without making changes in the applications (web and mobile apps, annotation tools, KOS management tools etc)
  • integration more language services and tools such as caching, domain adaptation rather than mere integration of MT services

Keep analytics

You want to know how the visitors are using the language technologies. It is recommended to keep information such as the search terms used, which translations were requested, from which MT services and for which language pairs. Such information can be very useful to extract insights for the use and performance of language technologies.

Be domain specific

Domain specific translations are very important especially if your online service targets a specific domain such as agriculture, education, biodiversity, culture etc. It is important to train the MT models with domain specific resources and use domain specific language resources for language services like Cross Lingual Information Retrieval and automatic annotation. 

Enable users feedback for MT

It is important to allow your users to provide feedback about the MT results in your service. This can be done by a) introducing a simple evaluation with ratings for the quality of the MT and b) allowing users to improve the translation.

Publish multilingual KOSs in linked data

Publish multilingual KOSs (ranging simple vocabularies to ontologies) on the web using linked data format. You can use an open source tool for that. This will allow consistency among the systems in annotation and a central point of reference for all the systems that want to use KOSs. All the updates of the KOS will be made available in real time and mappings to other KOSs will be done centrally.


Case Studies

The main goal of this section is to present cases from different sectors for which the best practices and the common multilingual processing framework could be applied to extend the multilinguality of an online service.

Online pan-European service for cultural content

Background and multilinguality requirements

In the cultural domain, Europeana is currently the pan-Erupean service that can be used to discover cultural treasures of Europe. The support of all European languages in such online service is very important. Currently Europeana portal supports 30 languages at the user interface of the portal. However the experience is not truly multilingual as

  • some parts of the portal still remains untranslated such as some vocabularies of the metadata shema used at the portal (e.g. providing country)
  • user cannot provide a feedback for the automatic transaltion e.g. through a simple rating or suggesting improved translation

Such requirements could be fulfilled by applying partially the common multilingual processing framework (processes and open source tools) and the best practices defined in the context of the Organic.Lingua project. The infrastructure of Europeana online service is based on a metadata infrastructure that is similar to the one used in the Organic.Edunet network. The portal infratructure is decoupled from the metadata repository and this allows the application of some of the Organic.Lingua solutions as described in the following section.

How the common multilingual framework could be applied

The Europeana online service could take advantage of the Best practices and Open Source tools that were developed in the context of the Organic.Lingua project by

  • using a tool like MoKi to manage and publish all the multilingual vocabularies for the EDM and ESE metadata schema
  • using the Analytics Service Component to wrap the external MT services and to enable user feedback about the MT results
  • using the MosesCore wrapper to get free MT services for some language pairs
  • using an approach for the automatic MT service selection 
  • using services for multilingual search such as the Cross Language Information Retrieval API


Online service for scientific content in the Agricultural domain

Background and multilingual requirements

AGRIS is an online service that contains more than 7 million bibliographic references on agricultural research and technology and links to related data resources on the Web, like DBPedia, World Bank, Nature, FAO Fisheries and FAO Country profiles. It is used as a main tool for the discovery of research achievements in the agricultural domain and it receives daily approximately 10.000 visits. Visitors come from the 235 top-level domains and speak more than 200 languages. An important percentage of the visitors (80%) is not using English as working language and this strongly indicates that AGRIS service needs to take advantage of the language technologies in order to solve one of the main problems that researchers are facing while searching publications on the web: to discover research outcomes using their language.

Recently a commercial MT service has been integrated into the AGRIS service to allow the automatic translation of publications' description (metadata). However, However the experience is not truly multilingual as

  • some parts of the portal still remains untranslated such as some vocabularies of the metadata schema used at the portal (e.g. providing country)
  • user cannot provide a feedback for the automatic transaltion e.g. through a simple rating or suggesting improved translation
  • user cannot search in his native language as multilingual indexes of the documents are not available.

The AGRIS online service is based on a metadata infrastructure that harvests metadata from diverse sources that is similar to the one used in Organic.Edunet network. This allows the re-use of the multilingual tools and processes developed in the context of the Organic.Lingua project.

How the common multilingual framework could be applied

The AGRIS online service could take advantage of the Best practices and Open Source tools that were developed in the context of the Organic.Lingua project by

  • adopting the multilingual framework as an architecture and processes e.g. approach for translating the user interface and sync with metadata translation, user feedback, translate each snippet in the results set
  • using the Analytics and Language Services wrapper to a) automatically translate the descriptions of publications and b) to facilitate the translation of the portal's user interface
  • using the MT chaching service to speed up the translations of the descriptions
  • using the MosesCore wrapper for free MT services
  • using the Analytics and Language Services wrapping component to facilitate the translation of AGROVOC
  • using services for multilingual search such as the Cross Language Information Retrieval API
  • Using the Domain terminology checking service that is based on AGROVOC for improving the translation of domain specific terms. 
  • using the Ontology Alignment component to facilitate the mapping of domain specific KOSs such as Organic Agriculture and Agro-ecology ontology to AGROVOC
  • adopting the approach for the user feedback about the results of the automatic translation and enabling the suggestion of the correct translations


Online services for researchers and libraries

Background and multilinguality requirements

European Library provide access to more than 200 million scientific resources. Users can search over 24 million pages of full-text content, 18 million digital objects and 119 million bibliographic records within Europe. The support of all European languages in this online service is very important. Currently European Library portal supports more than 30 languages at the user interface of the portal. However the experience is not trully multilingual as

  • the generic content of the portal is not translated e.g. the About page
  • the metadata elements of the records are not translated in the selected language
  • vocabularies such as the language of the resource and the resource type are not translated in the selected language

The European Library online service is based on a metadata infrastructure that harvests metadata from diverse sources that is similar to the one used in Organic.Edunet network. This allows the re-use of the multilingual tools and processes developed in the context of the Organic.Lingua project.

How the common multilingual framework could be applied

The European Library online service could take advantage of the Best practices and Open Source tools that were developed in the context of the Organic.Lingua project by

  • using a tool like MoKi to manage and publish all the multilingual vocabularies for the metadata schema used by the metadata aggregator
  • using the Analytics Service Component to enable automatic translation of the descriptions by wrapping the external MT services
  • using the Analytics and language services wrapping component to enable user feedback i.e. allow users to add rating
  • using the MosesCore wrapper to get free MT services for some language pairs
  • using an approach for the automatic MT service selection


Knowledge sharing services for a research community working on Organic Agriculture

Background and multilinguality requirements

This use case focuses on how a knowledge sharing portal for a specific research community namely Organic eprints can be set up easily using the Organic.Lingua Best practices and open source tools. One of the main need of the Organic eprints community which includes researchers from several countries is to share the research achievements related to Organic Agriculture. The language is the main barrier as researchers from Spain would like to grasp content created by Danish researchers.

How the common multilingual framework could be applied

Assuming that the collections are available at the Organic.Edunet metadata repository or are published through a protocol such as OAI-PMH, to set up a portal for the Organic eprints community the following steps should be followed:

  • Download and install the web app for the Organic.Edunet portal
  • Configure in the portal the languages that should be supported both at the user interface level and at the translation of the descriptions
  • Customize the css and html to the needs of the community
  • Download and install the Analytics Service and configure the languages to be supported
  • Select the collections that should be included in the service
  • Test the service

In order to expand the collections of the new portal one should expose the metadata of the original repository using the OAI-PMH so they can be harvested by the Organic.Edunet metadata repository and made available at the portal. 


Multilingual pan-European online service for OER sharing

Background and multilinguality requirements

In the context of the Open Discovery Space project, a pan-European service is being developed to enable the sharing of Open Educational Resources and the creation of teachers' communities. The online service needs to be trully multilingual to serve pupils, parents, teachers and policy makers from several European countries. The online service is based on

  • a metadata infrastructure that harvests OER metadata from diverse sources that is similar to the one used in Organic.Edunet network
  • a Drupal based portal that allow OER discovery


How the common multilingual framework could be applied

The Open Discovery Space online service could take advantage of the Best practices and open source tools that were developed in the context of the Organic.Lingua project by

  • adopting the multilingual framework as an architecture and processes e.g. approach for translating the user interface and sync with metadata translation, user feedback, translate each snippet in the results set 
  • using the Analytics and Language Services module to a) automatically translate the descriptions of OERs and b) to facilitate the translation of the portal's user interface
  • using the MT chaching service to speed up the translations of the descriptions
  • using the MosesCore wrapper for free MT services
  • adopting a tool such as MoKi to manage and publish the multilingual vocabularies as well as to facilitate the translation of the KOSs in several languages
  • using services for multilingual search such as the Cross Language Information Retrieval API
  • Using the Domain terminology checking service for improving the translation of domain specific terms. This could be achieved by using educational KOSs.
  • using the Ontology Alignment component to facilitate the mapping among different KOSs in the educational domain
  • adopting the approach for the user feedback about the results of the automatic translation and enabling the suggestion of the correct translations

The Open Discovery Space online service is based on a metadata infrastructure that harvests metadata from diverse sources that is similar to the one used in Organic.Edunet network. This allows the re-use of the multilingual tools and processes developed in the context of the Organic.Lingua project.

References


Personal tools
Menu