Study-Unit Description

CODE

CIS5113

TITLE

Large Scale Databases

UM LEVEL

05 - Postgraduate Modular Diploma or Degree Course

MQF LEVEL

ECTS CREDITS

DEPARTMENT

Computer Information Systems

DESCRIPTION

This study-unit focuses on current research topics in databases, data modelling for consolidation and presentation of an orginisation’s data infrastructure. Consequent to this part of the content is the scaling of data processing operations under varying data consistency requirements and conversely starting with operational targets and indicating an acceptable (or possible) level of operational support.

Given that a number of databases and datasets are accessible to an organisation then it is in a position to consolidate sources together so as to provide "a subject oriented, nonvolatile, integrated, time variant collection of data in support of management's decisions" (B. Inmon).

Also combining a company’s data with data streaming from various sources and structures creates the possibility for it to investigate and follow-up opportunities that arise from day to day.

This unit presents knowledge and know how on building repositories onto which data warehousing and data mining exercises are executable. Handling of large data sets - origin of which can be transactional systems or pattern extraction programs (e.g. data extraction from large repositories). To accommodate better the velocity aspect of Big Data an overview of data stream processing systems and respective scenarios are presented.

Design and implementation techniques in SQL and procedural extensions to SQL are presented. With regards to velocity, practical use of Strom and Spark is given.

Furthermore specialise tools and techniques are studied to consolidate and validate the quality of data. It is now accepted that a portion of this processing is farmed to less general purpose DBMSs in a direct effort to reach performance targets.

A substantial part is devoted to query design and optimisation for these massive data repositories.

Study-unit Aims:

The aims of this study-unit are to:
- instil techniques of how to identify, understand the underlying databases (and the processes executed over them);
- introduce methodologies on how to move data from a source to a destination, and then integrate it into a centralised repository. This centralised database needs to adhere to its own set of integrity constraints and gives the capability of tracing back data to its source;
- pursue further knowledge to the physical design of a DB by including hardware and design techniques (e.g. what, how, when, where to index) that are very different from on-line systems;
- apply data warehousing and data mining techniques that require extensive computational load if executed over massive datasets. Such queries/algorithms require careful study to design and optimise for execution. It has become customary that a number of specific techniques are applied to known problems;
- allow students to consider data intensive distributed computing for both monolithic applications and object oriented applications. In case of object oriented applications interoperability of objects running on different computer platforms and developed by different languages are the main issues. Another crop of tools to be introduced are the current NoSQL DBMS that offer added performance if set-up is acceptable (e.g. does not affect business process);
- understand what the Velocity aspect of Big Data is;
- appreciate the challenges posed by Velocity;
- apply a data processing framework such as Spark or Storm to process data in this scenario.

Learning Outcomes:

1. Knowledge & Understanding:
By the end of the study-unit the student will be able to:

- recognise the need of and know how to build a cross organisation data infrastructure for a warehouse and data mining exercise;
- evaluate data sources and how to extract and move data into a staging area;
- build an organisation wide data repository for data warehousing and data mining (at logical and physical level);
- write complex queries in SQL and SQL procedural extensions;
- write complex queries in NoSQL and NoSQL procedural extensions;
- understand the concept of distributed data and functions across networks of computers;
- build a framework that enables distributed computation across databases and massive datasets;
- explain the difference between building the infrastructure and querying it in terms of computational load;
- explain query processing and optimisation in massive datasets.
- gain awareness of Velocity by analysing typical scenarios;
- appreciate the different architectures used to deal with the Velocity aspect of Big Data.

2. Skills:
By the end of the study-unit the student will be able to:

- create large scale distributed and interoperable systems;
- write and implement complex database design for an enterprise infrastructure with a database high level language;
- write and implement problematic extract, load and transform methods to consolidate the source databases into the infrastructure;
- write and implement extract, load and transform methods to read the output of pattern extraction programs;
- write SQL commands for roll-up (and cube), top-n, group by, partitions and CTE;
- write procedures with embedded queries for basic algorithms that extract patterns;
- write code for specific data intensive problems: e.g. association rules, rules, clustering;
- write code for dimension reduction for data intensive problems and datasets;
- write code to implement data mining in time series datasets;
- select, use, and deploy specialised tools for data warehousing and data mining;
- identify cases of Velocity data;
- apply the right architecture to Velocity data depending on the scenario.

Main Text/s and any supplementary readings:

• Fundamentals of Database Systems, Ramez Elmasri, Shamkant B. Navathe, 6th Edition, 2010, Addison Wesley, ISBN-13: 978-0136086208.
• Data Mining: Concepts and Techniques, Jiawei Han, Micheline Kamber, Jian Pei, 3rd Edition, 2011,The Morgan Kaufmann Series in Data Management Systems), ISBN-13: 978-0123814791.
• Data Warehouse Design: Modern Principles and Methodologies, Matteo Golfarelli, Stefano Rizzi, 2011, McGraw-Hill Osborne, ISBN-13: 978-0071610391.
• M.T. Ozsu, P.V. Valduriez., Principles of Distributed Databases, 2011, Springer - PH Publishers, ISBN: 978-1-4419-8833-1.
• A number of research papers are made available.
• System Manuals as per need (and available in department's labs)

Note: Inmon and Kimball books are still a good read for data warehousing.

RULES/CONDITIONS

Before TAKING THIS UNIT YOU ARE ADVISED TO TAKE CIS2090 OR TAKE CIS3107

STUDY-UNIT TYPE

Lecture, Independent Study and Practical

METHOD OF ASSESSMENT

Assessment Component/s	Assessment Due	Sept. Asst Session	Weighting
Practical	SEM1	Yes	30%
Examination (2 Hours)	SEM1	Yes	70%

LECTURER/S

Joseph Vella

The University makes every effort to ensure that the published Courses Plans, Programmes of Study and Study-Unit information are complete and up-to-date at the time of publication. The University reserves the right to make changes in case errors are detected after publication.
The availability of optional units may be subject to timetabling constraints.
Units not attracting a sufficient number of registrations may be withdrawn without notice.
It should be noted that all the information in the description above applies to study-units available during the academic year 2023/4. It may be subject to change in subsequent years.

Study-Unit Description

Study-Unit Description

Study at UM