[GALA Valencia 2024] Can we obtain enough quality data for AI training and personalization?

23 Apr 2024

This event has expired, video available

To view the recording, you must be logged in with a GALA Member account or have purchased the webinar.

AI's efficacy relies heavily on the quality and quantity of training data. In recent years, we witnessed a profound shift in the field, driven by the advent of Neural Machine Translation (NMT). Today, we stand on the cusp of another transformative era with the rise of GenAI. Language Service Providers (LSPs) often grapple with a dual challenge: acquiring the technical expertise to create, integrate, and harness these cutting-edge technologies, and ensuring access to sufficient high-quality data to effectively train and personalize deep learning-based systems. The overarching question in the forums is clear – while technical know-how is crucial, the backbone of AI's prowess lies in the availability of robust data. Obtaining such data can be an expensive and labor-intensive endeavor.

To thrive in this rapidly evolving landscape, we need a technology capable of surmounting this obstacle whether we are building models from scratch or customizing Large Language Models (LLMs) for translation and other applications. Are you in need of a bilingual TMX corpus with a specific language pairing? Do you require this corpus to cover particular topics or within certain dimensions? Or, perhaps, are you looking for monolingual corpora in various languages?

Enter SmartBiC (Bilingual Copora), a project funded by the European Union under the NextGenerationEU initiative's 2021 call for Research and Development projects in Artificial Intelligence and digital technologies, driven by the Spanish Public Business Entity RED.ES. SmartBic is built upon the success of the Broader/Continued Web-Scale Provision of Parallel Corpora for European Languages project (paracrawl.eu) and is scheduled to conclude in June 2024. SmartBiC is designed to equip us with the technology needed to efficiently identify, collect, align, tag, and filter bilingual data from the Internet. This data will serve as the bedrock for training selective neural engines and Deep Learning models, empowering a range of applications and sectors. These include targeted search for bilingual text based on specific criteria or domain focus; training and customization of neural machine translation systems and LLMs; training other neuro-linguistic systems; terminology extraction, and text preprocessing, cleaning, filtering, and annotation, among other.

Host organization: Globalization and Localization Association

Event Speakers

Pedro L. Diez-Orzas
Linguaserve Internacionalización de Servicios S.A.

Pedro is the founder and CEO at Linguaserve I.S. S.A., a company providing cutting-edge multilingual solutions. He has a PhD in Computational Linguistics, and is also Professor at the university in Madrid. He is an expert in Language Engineering & Translation, with 25 years of professional experience in R&D in language industries. Pedro developed his career at WordPerfect as Head of Spanish Lexicography; at Novell as head of multilingual semantics; and as European Project Coordinator (INTERLEX), Technical Manager (EuroWordNet), a member of EU Project MLW-LT and ITS 2.0 IG, as well as in several universities.

Almudena Ballester Carrillo
Linguaserve Internacionalización de Servicios S.A.

Language Resources and Technologies Coordinator for Linguaserve.;