Study-Unit Description

Study-Unit Description


CODE CSA5011

 
TITLE Corpora and Statistical Methods

 
UM LEVEL 05 - Postgraduate Modular Diploma or Degree Course

 
MQF LEVEL Not Applicable

 
ECTS CREDITS 6

 
DEPARTMENT Intelligent Computer Systems

 
DESCRIPTION This unit focuses on techniques for designing Natural Language Processing applications whose core is a statistical model of language derived from large linguistic corpora. The study-unit is divided into three main parts, as follows.

1. Part I deals with introductory material and some of the mathematical and linguistic background. In this part, participants will also be introduced to existing corpora, as well as annotation methods.

2. Part II focuses in detail on particular areas of corpus-based research in NLP, and the methods used, with particular emphasis on:
* Research on words, word distributions, word frequencies and collocations
* Semantic similarity, vector-space models and corpus-derived thesauri
* Supervised and unsupervised Word Sense Disambiguation methods
* N-gram language models
* Hidden Markov Models

3. Part III aims to provide a more comprehensive picture of state-of-the art NLP research, with emphasis on the following areas:
* Part of speech tagging
* Statistical Parsing
* Statistical techniques in Natural Language Generation
* Supervised and unsupervised techniques for Automatic Summarisation
* Text clustering and categorisation

These topics will be tackled at both a theoretical and practical level, with emphasis on the mathematical models underlying the various methods, as well as the way they are evaluated.

Throughout the unit, students will be encouraged to apply the skills learned through hands-on tasks, which will be discussed during weekly tutorials. These will serve to introduce students to existing tools, as well as give them the opportunity to implement programmes to carry out particular tasks.

Study-unit Aims

- To provide students with a solid background in statistical techniques applied to large data repositories (especially text and speech corpora)
- To familiarise students with the main subtasks of NLP in which such techniques are applied, and the principal paradigms in use today
- To allow students to use existing tools and implement their own solutions to NLP problems
- To familiarise students with evaluation techniques

Learning Outcomes

1. Knowledge & Understanding: By the end of the study-unit the student will be able to:

- be able to formulate hypotheses in explicit, probabilistic terms, and test them
- be able to identify the correct method(s) for solving a particular task in language understanding or generation

2. Skills: By the end of the study-unit the student will be able to:

- implement programs to solve particular problems
- evaluate the outcome of NLP systems against corpora or human users
- use tools and corpora for Natural Language analysis

Main Text/s and any supplementary readings

- C. Manning and H. Schuetze (1999). Foundations of Statistical Natural Language Processing. Cambridge, Ma: MIT Press
- D. Jurafsky and H. Martin (2009). Speech and Language Processing (2nd Ed). New York: Prentice Hall

In addition, several articles will be made available to students via the study unit website
(http://staff.um.edu.mt/albert.gatt/statNLP.html)

 
STUDY-UNIT TYPE Lecture, Independent Study & Tutorial

 
METHOD OF ASSESSMENT
Assessment Component/s Sept. Asst Session Weighting
Assignment No 15%
Examination (3 Hours) Yes 85%

 
LECTURER/S Albert Gatt

 

 
The University makes every effort to ensure that the published Courses Plans, Programmes of Study and Study-Unit information are complete and up-to-date at the time of publication. The University reserves the right to make changes in case errors are detected after publication.
The availability of optional units may be subject to timetabling constraints.
Units not attracting a sufficient number of registrations may be withdrawn without notice.
It should be noted that all the information in the description above applies to study-units available during the academic year 2023/4. It may be subject to change in subsequent years.

https://www.um.edu.mt/course/studyunit