SPECULA
SPECULA

SPECULA
SPECULA
 : 
Spectral Induction and Transduction of Natural Language Models
Spectral Induction and Transduction of Natural Language Models

A Project coordinated by IIIA.

Web page:

Principal investigator: 

Collaborating organisations:

  • Universitat Politècnica de Catalunya (UPC), CS Department.
  • Universidad Nacional de Córdoba y CONICET (Argentina).

 

  • Universitat Politècnica de Catalunya (UPC), CS Department.
  • Universidad Nacional de Córdoba y CONICET (Argentina).

 

Funding entity:

Ministerio de Ciencia e Innovación
Ministerio de Ciencia e Innovación

Funding call:

Proyectos I+D+i 2020
Proyectos I+D+i 2020

Funding call URL:

Project #:

PID2020-119395GB-I00
PID2020-119395GB-I00

Total funding amount:

71.148,00€
71.148,00€

IIIA funding amount:

71.148,00€
71.148,00€

Duration:

01/Sep/2021
01/Sep/2021
31/Aug/2024
31/Aug/2024

Extension date:

The SPECULA project develops theoretically-grounded machine learning methods for natural language based on spectral learning. Current machine learning methods for Natural Language Processing are strongly based on supervised approaches, and this limits the ability to apply natural language models to new textual domains, or to unique information needs of an application. The fundamental limitation is that there does not exist a universal representation of natural language that enables effective generalization. While deep learning methods have achieved significant progress in this area, there are still fundamental open questions to resolve: what class of models should we use to capture the structure and meaning of natural language? and how should we use such representations for specific natural language tasks?
 

We will study these questions theoretically and empirically using the paradigm of grammar induction and transduction, which encompasses two processes. By grammar induction we refer to unsupervised machine learning methods that learn the structure of natural language from a large and representative collection of sentences. By transduction we refer to the process by which we can transform the unsupervised structures we learned into human-designed linguistic representations of languages, such as syntactic-semantic trees. By composing the two approaches, we obtain models that predict the linguistic structure of sentences using an intermediate grammatical representation, which is learned from textual data. Under this induction and transduction paradigm, our research will focus on unsupervised learning of grammars for natural language, and we will be testing if the grammars we learn are sufficiently rich to solve NLP tasks via transduction processes.

 

The technical objectives of the project are to develop formulations of grammar induction and transduction that are efficient and that scale to large collections of natural language. The main technical workhorse is spectral learning, which offers formal tools to reduce the problems of grammar induction and transduction into low-rank matrix learning problems. In doing so, we expect to improve our understanding of grammatical induction for natural language, and to establish connections between deep and spectral approaches to grammatical induction.
 

The SPECULA project develops theoretically-grounded machine learning methods for natural language based on spectral learning. Current machine learning methods for Natural Language Processing are strongly based on supervised approaches, and this limits the ability to apply natural language models to new textual domains, or to unique information needs of an application. The fundamental limitation is that there does not exist a universal representation of natural language that enables effective generalization. While deep learning methods have achieved significant progress in this area, there are still fundamental open questions to resolve: what class of models should we use to capture the structure and meaning of natural language? and how should we use such representations for specific natural language tasks?
 

We will study these questions theoretically and empirically using the paradigm of grammar induction and transduction, which encompasses two processes. By grammar induction we refer to unsupervised machine learning methods that learn the structure of natural language from a large and representative collection of sentences. By transduction we refer to the process by which we can transform the unsupervised structures we learned into human-designed linguistic representations of languages, such as syntactic-semantic trees. By composing the two approaches, we obtain models that predict the linguistic structure of sentences using an intermediate grammatical representation, which is learned from textual data. Under this induction and transduction paradigm, our research will focus on unsupervised learning of grammars for natural language, and we will be testing if the grammars we learn are sufficiently rich to solve NLP tasks via transduction processes.

 

The technical objectives of the project are to develop formulations of grammar induction and transduction that are efficient and that scale to large collections of natural language. The main technical workhorse is spectral learning, which offers formal tools to reduce the problems of grammar induction and transduction into low-rank matrix learning problems. In doing so, we expect to improve our understanding of grammatical induction for natural language, and to establish connections between deep and spectral approaches to grammatical induction.
 

No publications uploaded yet
Xavier Carreras
Tenured Scientist
Phone Ext. 238