Interactive Grounded Language Understanding is an ability that develops in young children through joint interaction with their caretakers and their physical environment. At this level, human language understanding could be referred as interpreting and expressing semantic concepts (e.g. objects, actions and relations) through what can be perceived (or inferred) from current context in the environment. Previous work in the field of artificial intelligence has failed to address the acquisition of such perceptually-grounded knowledge in virtual agents (avatars), mainly because of the lack of physical embodiment (ability to interact physically) and dialogue, communication skills (ability to interact verbally). We believe that robotic agents are more appropriate for this task, and that interaction is a so important aspect of human language learning and understanding that pragmatic knowledge (identifying or conveying intention) must be present to complement semantic knowledge. Through a developmental approach where knowledge grows in complexity while driven by multimodal experience and language interaction with a human, we propose an agent that will incorporate models of dialogues, human emotions and intentions as part of its decision-making process. This will lead anticipation and reaction not only based on its internal state (own goal and intention, perception of the environment), but also on the perceived state and intention of the human interactant. This will be possible through the development of advanced machine learning methods (combining developmental, deep and reinforcement learning) to handle large-scale multimodal inputs, besides leveraging state-of-the-art technological components involved in a language-based dialog system available within the consortium. Evaluations of learned skills and knowledge will be performed using an integrated architecture in a culinary use-case, and novel databases enabling research in grounded human language understanding will be released. IGLU will gather an interdisciplinary consortium composed of committed and experienced researchers in machine learning, neurosciences and cognitive sciences, developmental robotics, speech and language technologies, and multimodal/multimedia signal processing. We expect to have key impacts in the development of more interactive and adaptable systems sharing our environment in everyday life.

IGLU is part of the European CHIST-ERA network ( .
The coordinator is prof. J. Rouat, NECOTIS, Université de Sherbrooke, Québec, Canada,

Future events under the sponsorship of IGLU

First Visually-Grounded Interaction and Language (ViGIL), NIPS 2017 Workshop , Long Beach, California, USA, Friday, December 8th
Shared sponsorship with facebook research and google DeepMind.

Past events under the sponsorship of IGLU

First International Workshop on Grounding Language Understanding, Satellite of Interspeech 2017 , KTH Royal Institute of Technology, Stockholm, Sweden, Friday, August 25th
Shared sponsorship with ISCA.

Partners inside IGLU and list of institutional partners

List of institutional partners


The IGLU consortium is composed of 8 research teams, across 6 different countries. The project is a total effort of 325 person-months (PM).

Public Open Access Datasets and Softwares


We recorded 3 databases that cover 3 levels of knowledge types and representations giving a gradation in semantic representation and levels of interactions and grounding:
A first one for environment representation and learning for a mobile platform ( ROS Create database, Documentation on arXiv );
A second one for object learning and representation on a Baxter platform (Multimodal Human-Robot Interaction (MHRI) database);
A third one for dialogue and a richer semantic with the new GuessWhat game (The GuessWhat?! database).


HoME, a Household Multimodal Environment , has been designed to enable artificial agents to learn as humans do, in an interactive, multimodal, and richly contextualized setting. It provides support for vision, audio, semantics, physics and interaction with objects and other agents inside a 3D environment. It can integrate the 45,000 diverse 3D house layouts based on the large-scale SUNCG dataset .

Real-time GCC-NMF Blind Speech Separation and Enhancement is a software that eases the audio interaction through the separation and enhancement of sound sources.


  • Deep learning & machine learning - A. Courville (MILA, UdeM),
  • Reinforcement learning - O. Pietquin, B. Piot (CRIStAL, Lille1 & DeepMind),
  • Neurosciences and cognitive sciences - J. Rouat (NECOTIS, UdeS & UdeM),, R.K. Moore (U. Sheffield)
  • Robotics - P.Y. Oudeyer (INRIA), A.C. Murillo (UNIZAR), J. Civera (UNIZAR)
  • Signal Processing (audition, vision) and machine learning - J. Rouat (UdeS & UdeM), S. Dupont (U. Mons), G. Salvi (KTH)
  • Human-Machine interaction - S. Dupont (U. Mons)


11 PhD & 3 Msc.A



  • Brodeur, S.; Carrier, S.; Rouat, J., CREATE: Multimodal Dataset for Unsupervised Learning and Generative Modeling of Sensory Data from a Mobile Robot, IEEE Dataport, 10.21227/H2M94J, 2018
  • Brodeur, S.; Carrier, S.; Rouat, J., CREATE: Multimodal Dataset for Unsupervised Learning, Generative Modeling and Prediction of Sensory Data from a Mobile Robot in Indoor Environments, CoRR, arXiv:1801.10214v1 [cs.RO], 2018
  • K. Stefanov, J. Beskow, G. Salvi, Self-Supervised Vision-Based Detection of the Active Speaker as a Prerequisite for Socially-Aware Language Acquisition, submitted to IEEE Transactions on Cognitive and Developmental Systems.
  • G. Saponaro, L. Jamone, A. Bernardino, G. Salvi, Beyond the Self: Using Grounded Affordances to Interpret and Describe Others’ Actions, submitted to IEEE Transactions on Cognitive and Developmental Systems.
  • E. Perez, F. Strub, H. Vries, V. Dumoulin, A. Courville FiLM: Visual Reasoning with a General Conditioning Layer. arXiv preprint arXiv:1709.07871. Under submission at Association for the Advancement of Artificial Intelligence 2018 (AAAI 2018)


  • Pablo Azagra, Florian Golemo, Yoan Mollard, Ana Cristina Murillo and Javier Civera, A Multimodal Dataset for Object Model Learning from Natural Human-Robot Interaction, 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vancouver, BC, 2017, pp. 6134-6141.

  • Pablo Azagra, Javier Civera, and Ana C. Murillo. Finding Regions of Interest from Multimodal Human-Robot Interactions, Proc. GLU 2017 International Workshop on Grounding Language Understanding. 2017.

  • Julien Pérolat, Florian Strub, Bilal Piot, Olivier Pietquin, Learning Nash Equilibrium for General-Sum Markov Games from Batch Data. arXiv preprint arXiv:1606.08718, Accepted at the International Conference on Artificial Intelligence and Statistics 2017 (AIStat 2017)

  • Brodeur, S. & Rouat, J., Optimality of Inference in Hierarchical Coding for Distributed Object-Based Representations, 15th. IEEE Canadian Workshop on Information Theory (CIWT), DOI:10.1109/CWIT.2017.7994828, [pdf]

  • Harm de Vries, Florian Strub, Sarath Chandar, Olivier Pietquin, Hugo Larochelle, Aaron Courville, GuessWhat?! Visual object discovery through multi-modal dialogue. arXiv preprint arXiv:1611.08481, Accepted at the Conference on Computer Vision and Pattern recognition 2017 (CVPR 2017) - Spotlight

  • Wood, S.U.; Rouat, J.; Dupont, S. & Pironkov, Blind Speech Separation and Enhancement with GCC-NMF. IEEE Transactions on Audio, Speech and Language Processing, pp. 3329-3341, 2017, DOI:10.1109/TASLP.2017.2656805, [pdf]

  • S. U. N. Wood and J. Rouat, Real-time speech enhancement with GCC- NMF. Proc. Interspeech 2017, 2665-2669, DOI: 10.21437/Interspeech.2017-1458.

  • S. U. N. Wood and J. Rouat, Real-time speech enhancement with GCC-NMF: Demonstration on the Raspberry Pi and NVIDIA Jetson. Proc. Interspeech 2017, 2048-2049, [pdf]

  • S. U. N. Wood and J. Rouat, Towards GCC-NMF speech enhancement for hearing assistive devices: Reducing latency with asymmetric windows 1st Int. Conference on Challenges in Hearing Assistive Technology (CHAT-17), Stockholm, Sweden, August 19, 2017, [pdf]

  • F. Strub, H. de Vries, J. Mary, B. Piot, A. Courville, O. Pietquin, End-to-end optimization of goal-driven and visually grounded dialogue systems.. arXiv preprint arXiv:1703.05423, Accepted at the International Joint Conference in Artificial Intelligence 2017 (IJCAI 2017) - Oral presentation.

  • Kumar Dhaka, A., Salvi, G., Sparse Autoencoder Based Semi-Supervised Learning for Phone Classification with Limited Annotations, Proc. GLU 2017 International Workshop on Grounding Language Understanding, Stockholm, Sweden, 22-26, DOI: 10.21437/GLU.2017-5.

  • Fahlström Myrman, A., Salvi. G., Partitioning of Posteriorgrams Using Siamese Models for Unsupervised Acoustic Modelling, Proc. GLU 2017 International Workshop on Grounding Language Understanding, Stockholm, Sweden, 27-31, DOI: 10.21437/GLU.2017-6.

  • Delbrouck, J., Dupont, S., Seddati, O., Visually Grounded Word Embeddings and Richer Visual Features for Improving Multimodal Neural Machine Translation. Proc. GLU 2017 International Workshop on Grounding Language Understanding, 62-67, DOI: 10.21437/GLU.2017-13

  • Stefanov, K., Beskow, J., Salvi, G., Vision-based Active Speaker Detection in Multiparty Interaction, Proc. GLU 2017 International Workshop on Grounding Language Understanding, Stockholm, Sweden, 47-51, DOI: 10.21437/GLU.2017-10.

  • Saponaro, G., Jamone, L., Bernardino, A., Salvi, G., Interactive Robot Learning of Gestures, Language and Affordances, Proc. GLU 2017 International Workshop on Grounding Language Understanding, Stockholm, Sweden, 83-87, DOI: 10.21437/GLU.2017-17.

  • Brodeur, S., Celotti, L., Rouat, J., Proposal of a Generative Model of Event-based Representations for Grounded Language Understanding, Proc. GLU 2017 International Workshop on Grounding Language Understanding, Stockholm, Sweden, 68-72, DOI: 10.21437/GLU.2017-14.

  • Delbrouck, J., Dupont, S., An empirical study on the effectiveness of images in Multimodal Neural Machine Translation, Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

  • H. de Vries , F. Strub , J. Mary, H. Larochelle, O. Pietquin, A. Courville, Modulating Early visual Processing by language. arXiv preprint arXiv:1707.00683, Accepted at the Conference on Neural Information Processing System 2017 (NIPS 2017) - Spotlight

  • Delbrouck, J., Dupont, S., Multimodal Compact Bilinear Pooling for Multimodal Neural Machine Translation, arXiv preprint arXiv:1707.03017

  • E. Perez, H. Vries, F. Strub, V. Dumoulin, A. Courville, Learning Visual Reasoning Without Strong Priors. arXiv preprint arXiv:1707.03017, Accepted at ICML Speech and Language Processing Workshop 2017.


  • Pablo, A., Mollard, Y., Golemo, F., Murillo, A. C., Lopes, M., & Civera, J. (2016, December). A Multimodal Human-Robot Interaction Dataset. In Future of Interactive Learning Machines Workshop, NIPS 2016. Barcelona, Spain. December 2016,, [pdf]

  • Cambra, A. B., Muñoz, A., Guerrero, J. J., & Murillo, A. C. Dense Labeling with User Interaction: An Example for Depth-Of-Field Simulation. British Machine Vision Conference (BMVC), 2016.


subscribe via RSS