|TITLE||Data X - An Introduction to Data Science: Storage, Visualisation and Analysis|
|LEVEL||H - Higher Level|
|DEPARTMENT||Centre for the Liberal Arts and Sciences|
|DESCRIPTION||What is a fair price for a three bedroom (with ensuite) apartment in Attard and what is the variability on the price? Which way will a referendum go, given a small voting sample and how confident are we in the prediction? Who are my facebook friends who like me most? Am I more popular than other people? Data science allows us to answer these questions (and more) in a rigorous manner, and gives us confidence in the answers we provide.
The exponential growth of data in the past decade has placed an emphasis on the formal study of data and given rise to the Data Scientist (or Analyst) role in industry. In many areas such as biology, physics, finance, and social networks the generation of data has outpaced its analysis. The data available may either be used to formulate a novel hypothesis or to confirm a working one. Data science allows us to build useful models to predict behaviour and to improve business decisions.
During this Unit we will gain an introductory multi-disciplinary overview of Data Science using computational, statistical and cognitive science tools. We will go through the main steps of a data science project: processing (including cleaning noisy data), storage, visualization and modelling of data. The statistical aspect will be carried out in R, a statistical programming language. Main themes will include:
- Cleaning noisy and incomplete data originating from different sources;
- Storing Data (Relational Databases, SQL, text files);
- Statistical analysis using R (correlation vs causation, linear modelling, hypothesis testing etc.);
- Data Visualization using R (different types of graphs, how to present data, graphing pitfalls, possibility of exploring D3 for web-based visualizations etc.);
- Advanced Topics – (Machine Learning, tools for big data such as Cloud Computing and Hadoop etc.)
1. Knowledge & Understanding:
By the end of the Unit the student will be able to:
- Execute a data analysis project; from start to finish. This includes completing practical worksheets in data processing, storage, visualization and modelling on real-world and artificial datasets;
- Use descriptive statistics (mean, mode, standard deviation, distributions etc.) to summarize a given dataset and formulate a hypothesis;
- Criticize/Justify a decision based on data at hand;
- Build mathematical models (e.g. linear regression) describing datasets;
- Use these models to predict real-world phenomena.
By the end of the Unit the student will be able to:
- Program (and analyse datasets) in R;
- Use plotting libraries, such as ggplot2, to programmatically create graphs describing the underlying datasets;
- Use hypothesis testing to find if differences in datasets are statistically significant.
Main Text/s and any supplementary readings:
Practical Data Science with R (2014)
Nina Zumel, John Mount
An introductory textbook which takes you from start to finish.
R for Everyone: Advanced Analytics and Graphics (2014)
Jared P. Lander
Good introductory book for R programming.
Naked Statistics: Stripping the Dread from the Data (2014)
Gives you a solid grasp of Statistics.
Data Scientists at Work (2014)
Contains a set of interviews with luminaries in the data science field. Useful to learn which technologies are used in industry.
Doing Data Science: Straight Talk from the Frontline (2013)
Cathy O'Neil, Rachel Schutt
O'Reilly's take on data science, based on a set of lectures.
The Visual Display of Quantitative Information (2001)
Edward R. Tufte
Tufte is probably the leading expert in data visualization.
Possibly the most seminal text in the field.
Information is Beautiful (2012)
Contains many infographics and visualization examples. A follow up book by the same author exists which is called Knowledge is Beautiful (2014).
Statistics in Plain English (2010)
Timothy C. Urdan
Excellent first textbook for people who want to gain a working knowledge in statistics.
The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2011)
Trevor Hastie, Robert Tibshirani, Jerome Friedman
Very popular advanced text. Together with Tom Mitchell's "Machine Learning" considered as the bible of the field.
|ADDITIONAL NOTES||Pre-requisite knowledge, skills and competences:
- A strong interest in data, an inquisitive nature and a Sherlock-like aptitude for investigation.
- The ability to program in an imperative language (e.g. Java, C, Python, Perl, Go) is beneficial.
- Familiarity with databases is a plus.
- Basic numeracy skills are required.
|STUDY-UNIT TYPE||Lecture and Practical|
|METHOD OF ASSESSMENT||
|LECTURER/S||Jean Paul Ebejer
The University makes every effort to ensure that the published Courses Plans, Programmes of Study and Study-Unit information are complete and up-to-date at the time of publication. The University reserves the right to make changes in case errors are detected after publication.
The availability of optional units may be subject to timetabling constraints.
Units not attracting a sufficient number of registrations may be withdrawn without notice.
It should be noted that all the information in the study-unit description above applies to the academic year 2019/0, if study-unit is available during this academic year, and may be subject to change in subsequent years.