Big Data Management Systems
Course Description
The emergence of sophisticated web applications, the proliferation of social networks, the massive deployment of sensor networks and other data-producing applications have led to an exponential growth of data volumes, unforeseen just few years ago. At the same time, the incorporation of a variety of data formats (structured, semi- and un-structured) into mainstream data analysis, along with the velocity aspect of modern applications, revise the fundamental aspects of data management. The era of big data mandated a new generation of data management systems: not-necessarily relational, focusing on fault-tolerance and availability, involving cloud-based computations, distributed in nature and exploiting large main-memories. The goal of this course is to delineate the challenges in managing big data, present the various systems that have emerged for this purpose and provide representative implementations and applications. The following systems will be presented: MapReduce/Hadoop, Redis, MongoDB, Neo4j, Azure Stream Analytics.
Learning Outcomes
The use of data in making correct, valid and timely decisions has become a “must-have” success factor for most modern businesses and organizations. At the same time, in recent years, with the development of new technologies and applications – such as the spread of social networks, the extensive use of smart phones, the installation of sensors, etc. – the volume and form of data has changed dramatically: we now have data volumes of petabytes and exabytes and in text, audio, video, images formats. The need to manage and exploit this data has led to the development of a new generation of systems, models and programming tools – which are still in their infancy – such as: Map Reduce, Hadoop and its ecosystem, NoSQL, etc., technologies that allow for parallel processing of data on a large scale and in a fault-tolerant manner. The purpose of this course is to present the basic principles of these systems and how they operate.
Upon completion of the course, students should be able to:
- understand the concept of "data analysis pipeline", the various phases of this pipeline and the implementation requirements of each phase,
- use HDFS to store/retrieve data and develop MapReduce jobs to answer specific queries; have a first encounter with the Hadoop ecosystem
- use a key-value system like Redis through a programming language like python or java to support applications requiring such a system,
- use a document store such as MongoDB and write queries involving JSON documents,
- use a graph database like Neo4j and write queries in a query language over graph databases like Cypher,
- define simple continuous queries over data streams, using an extended SQL query language and a stream engine, such as Stream Analytics of Azure.