BDARG Research Meeting

The research meeting of the Big Data and Analytics Research Group (BDARG) at the Department of Computer Science and Information Systems of UL takes place since January 2013. The topics presented at the meeting include data mining, text mining, pattern recognition, machine learning, information visualisation, network visualisation, big data management, etc.

Organising committee: Dr. Patrick Healy, Dr. Nikola S. Nikolov, Kevin O'Brien

24 January 2019

15:00 - 16:00, CSG-027

The Global Mangrove Watch: monitoring of mangrove forests globally

Dr. Pete Bunting, DGES

Abstract: Globally mangrove forests have witnessed a considerable amount of change over the past 30 years, with clearing (e.g., for aquaculture), erosion, sedimentation events and dieback making this environment one of the most dynamic of any ecosystem. Mangroves are also at the forefront of climate change with changing sea levels and temperatures providing new opportunities for mangroves but also making some areas uninhabitable. The time lag between these events and the mangrove response is also short, it can be only a few months. Little information is known on the overall extent of mangroves and how this extent and condition has been changing at a global scale. This study has used very large datasets to create the first global monitoring system for mangrove forests using a machine learning approach. The system came out of the JAXA Kyoto & Carbon (K&C) initiative and has therefore focused on the application of JAXA ALOS PALSAR, ALOS-2 PALSAR-2 and JERS-1 data with the augmentation of Landsat to aid the definition of a new 2010 mangrove global mangrove baseline. Change products (from the 2010 baseline) were produced for 1996 using JERS-1, 2007, 2008 and 2009 using ALOS PALSAR, and 2015 and 2016 using ALOS-2 PALSAR-2. The baseline classification was created using a scalable automated machine learning approach based on random forests algorithm while the change products were produced using a new innovative map-to-image change detection approach. The baseline was assessed to have an overall accuracy of 95.25% with 2.5% omission and 6% commission based on 53,800 visually accessed random points from 20 regions. Globally, mangrove change has been estimated at a rate of -0.29% per year, i.e., a loss of 5.8% in mangrove extent from 1996 to 2016. This is not equally distributed with Indonesia, having the largest area of mangroves within a single country, having lost almost 20% of mangrove extent from 1996 to 2016. The talk will also include an overview of some of the other projects within the Earth Observation and Ecosystem Dynamics (EOED) lab at Aberystwyth University based on the analysis of very large spatial datasets with an overview of the (open source) software and tools used to create the data processing system to undertaken these data analysis.

About the presenter: Pete Bunting is a Reader in Remote Sensing within the Department of Geography and Earth Sciences (DGES). His research focuses on understanding changes in ecosystems (e.g., rainforests, savannas, mangroves) through integration of ground, airborne and space borne remote sensing data. He has developed a number of innovative techniques for the analysis of these data that have been implemented into a number of active open source software projects, including the remote sensing and GIS software library (RSGISLib; which provides a wide range of tools and algorithms for large scale processing of remotely sensed imagery. Dr. Bunting has also produced the Atmospheric and Radiometric Correction of Satellite Imagery (ARCSI) software for the rapid and automated correction of optical imagery to surface reflectance. These tools and innovations, including the KEA image file format, have enabled large area remote sensing based mapping including the Global Mangrove Watch (GMW) which has mapping changes in global mangrove extent from 1996 to 2016. Dr. Bunting is also an expert in the application of high performance computing (HPC) for the analysis of large remotely sensed datasets.

19 November 2018

15:00 - 16:00, LG011

Empirical Modelling of Emissions from Electricity Grids

Dr. Joe Wheatley, Biospherica Risk

Abstract: De-carbonisation policy is effecting radical change to electric power grids. These systems are an example of complex networks. In this talk the problem of quantifying the impact of renewable generation on emissions in mixed renewable-thermal (fossil fuel) grids is described. A successful approach combines rich datasets available from grid operators with a stochastic matrix description of thermal generators.

About the presenter: Joe Wheatley is a TCD physics graduate. He received a PhD from Princeton University in 1988. After a career in condensed matter physics research, he left academia to work in finance as an interest rate derivatives dealer. Joe currently works on data-science projects for private clients as well as on other research projects in which he takes a personal interest. He is based in Dublin and Clare.

12 November 2018

16:00 - 17:00, ERB001

Reinforcement Learning 101

Dr. Asanka Wasala, Senior Software Engineer at Jaguar Land Rover, Shannon

Abstract: This talk is intended for those who would like an introduction to the field Reinforcement Learning (RL). The aim is to intuitively present the basics of RL.
Reinforcement learning (RL) is a sub-field of machine learning that goes back as far back as the 1980s at least. In RL, an AI Agent (e.g., a self-driving car) is learning what to do under certain situation (or according to some observed environmental conditions), by taking actions as to maximise a numerical reward signal. According to Sutton & Barto "The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them. In the most interesting and challenging cases, actions may affect not only the immediate reward but also the next situation and, through that, all subsequent rewards. These two characteristics—trial-and-error search and delayed reward—are the two most important distinguishing features of reinforcement learning". Therefore RL addresses the question of how an agent that senses and acts in a dynamic environment can be trained to choose optimal actions (a.k.a. policy) so as to maximise a scalar reward or a reinforcement signal.
In this talk, we will first provide an overview of terminology in Reinforcement Learning and cover the basic theory related to Q-learning and Deep-Q Learning. We will then present some examples of recent advances in the field.

About the presenter: Asanka Wasala is a Senior Software Engineer in the AI team at Jaguar Land Rover, Shannon, Ireland. Previously, he worked as a Postdoctoral Researcher in Software Engineering at the Lero, the Irish Software Research Centre ( During his PhD studies, he also served as a voting member of the OASIS XML Localisation Interchange File Format (XLIFF) Standard Technical Committee. His main areas of expertise are Natural Language Processing, Speech Processing, Machine Learning and Software Localization. In 2004, Asanka graduated from the prestigious University of Colombo, Sri Lanka, receiving the best student award (Gold Medal) and being the only person to obtain a First Class qualification that year. After completing his BSc in Physical Science, he worked in the PAN Localization project as a Senior Research Assistant at the University of Colombo School of Computing, Sri Lanka where he developed the first Sinhala Text-to-Speech Engine. He received a full scholarship from Microsoft Ireland to pursue his Master's degree in University of Limerick, before transferring to the Ph.D. program in 2010. He completed his PhD thesis on the identification of limitations of localization data exchange standards using corpus-based approaches and Data Mining techniques, in 2013. His thesis won the LRC Best Thesis Award (2013) and a prize from the Microsoft Ireland.

14 June 2018

15:00 - 16:00, CSG-27

Open Source Processing of Remote Sensing Images with the ORFEO ToolBox: (very) Big Data Science At Scale

Manuel Grizonnet, Centre National d’Études Spatiales (French Space Agency)

Abstract: From weather forecasting to military intelligence, satellite images help solve some of our most challenging problems on Earth. Since 2006, the French Space Agency have been actively developing an open source remote sensing image processing toolbox called the Orfeo ToolBox (OTB), which provides a large set of ready-to-use tools and a high performance satellite image viewer. It offers a wide range of processing algorithms, which permit the creation of high level processing chains that will run on a desktop computer or clusters alike. The processing capabilities cover pre-processing for several sensors, feature extraction, image segmentation, classification, as well as algorithms for hyperspectral and radar data. OTB is now used to design processing chains that can efficiently process thousands of remote sensing images (Sentinels, SPOT, Pléiades) to derive country-scale value added products.

The presentation will present the challenge of extracting valuable information from Earth Observation data and how open source software like OTB can help in this context.

About the presenter: Manuel Grizonnet received the Mathematical modelling, Vision, Graphics and Simulation Engineer degree from the École Nationale Supérieure d'Informatique et de Mathématiques Appliquées de Grenoble, France, in 2007. From 2007 to 2009 he worked for BRGM (French geological survey) in Niamey, Niger as a systems and GIS engineer in the frame of the SYSMIN project which aim to give the Niger ways to promote its mining potential by establishing a Geological Information System (GIS). He is currently with the Centre National d’Études Spatiales (French Space Agency), Toulouse, France, where he is developing image processing algorithms and software for the exploitation of Earth observation images.

22 May 2018

14:00 - 15:00, CSG-27

Applying AI Solutions in Large Non-Tech Companies

David Curran, OpenJaw Technologies

Abstract: A short talk on how to harness and coordinate the abilities of large companies to deploy AI solutions in industry with examples focused on chatbots and natural language processing.

About the presenter: David Curran is a machine learning engineer at OpenJaw Technologies where he is involved in the development of an airline chatbot platform. Previously, he worked as a technical lead for IBM Watson where he created chatbots for Mercedes, Credit Mutuel, Orange Bank, La Caixa, RBS, RBC, Arthritis UK, the Dubai Government and others. He has also worked as the CTO of the legal search company Courtsdesk and the optimization company Fair and Square.

15 May 2018

15:00 - 16:00, CSG-27

Drawing Big Networks: Shape, Proxies, and Sampling

Peter Eades, Emeritus Professor, University of Sydney

Abstract: Shape faithfulness: It is easy to see that a function that takes a very large data set and draws it on a computer screen cannot be invertible, because there are not enough pixels on the screen. As such, visualization functions may be a little "unfaithful" to the data. As big data gets bigger, visualization unfaithfulness becomes a big issue. We describe a model for network visualization faithfulness, and suggest some metrics based on the idea of the "shape" of the picture.
Proxies and sampling: The "proxy approach" to very large network visualization has two steps: (1) Compute a proxy network G' from the input network G. The network G' should be much smaller than G, but G' should represent G well (in some sense). (2) We draw G', and treat it as a drawing of G. We describe experiments with using a sample of a large network as a proxy. In particular, we investigate spectral sampling.

About the presenter: Peter Eades is an Emeritus Professor in the School of Information Technologies at the University of Sydney, known for his research in Graph Drawing. He gained his PhD from the Australian National University in 1978, and has worked in Universities and Research Institutes across the world.

2 May 2018

15:00 - 17:00, CSG-27

Bandwidth Aware Geo-Distributed Data Analytics

Mohammed Bergui, PhD student at Laboratory of Intelligent Systems and Applications (LSIA), USMBA

Abstract: In the era of global-scale services, organisations produce huge volumes of data, often distributed across multiple data centres, separated by vast geographical distances. The necessity to utilise such infrastructure introduces new challenges in the data analytics process due to bandwidth limitations of the inter-data-centre communication. In this talk we summarise these challenges and discuss possible solutions.

Lung Nodule Detection with Convolutional Neural Networks and Restricted Boltzmann Machines

Brahim Ait Skourt, PhD student at Laboratory of Intelligent Systems and Applications (LSIA), USMBA

Abstract: The utilization of deep neural networks in the medical field has recently been the subject of great interest due to the ability of deep nets to be effectively trained with large amounts of data for accurate pattern recognition in medical images. In this talk we discuss the use of two types of deep nets, Convolutional Neural Networks and Restricted Boltzmann Machines, for detection and classification of lung nodules in MRI images as well as the application of Restricted Boltzmann Machines for extracting features from MRI images.

The Explosion of Genomic Data and the Need for Feature Selection

Khawla Tadist, PhD student at Laboratory of Intelligent Systems and Applications (LSIA), USMBA

Abstract: In this talk we discuss the need of feature selection techniques specifically designed for genomic data. It is expected that by year 2025, genomics will become the main generator of big data as researchers aim to sequence the genomes of all living creatures. In particular, privacy concerns make it hard to utilise such data in the medical field without appropriate pre-processing, which in turn leads to more complexity and veracity issues. Feature selection techniques are expected to become a game changer that can help substantially reduce the complexity of genomic data, thus making it easier to analyse it.

About the presenters: Khawla Tadist, Brahim Ait Skourt and Mohammed Bergui are PhD students at Laboratory of Intelligent Systems and Applications (LSIA), USMBA, Morocco. They have spent four months at CSIS as visiting PhD student within an Erasmus+ mobility programme.

12 April 2018

15:00 - 16:00, CSG-27

Applied Predictive Modelling in the Energy Sector

Sean Walsh, Bord Gais Energy, Scully Analytics

Abstract: The primary business problem for companies that follow a subscription business model is customer churn. It is usually far more expensive to acquire a new customer than it is to retain a current one. As such, many companies wish to maximise retention while also creating positive sentiment among their customer base. Churn in the energy sector is difficult to predict but there are many algorithms and strategies, from both the regression and classification domains, which a data scientist can utilise to address the problem. This talk will discuss the work-flow associated with the development of a typical churn modelling system and some powerful R packages that can help along the way.

About the presenter: Sean Walsh is a data scientist at Bord Gais Energy with a background in environmental science and statistics. He has worked as a data scientist for almost three years specialising in statistical modelling and data visualisation using the R programming language. He is a regular contributor to the popular R-bloggers website and has consulted on data science projects across a range of professions including sports science, physiotherapy, agricultural science and psychology.

29 March 2018

15:00 - 16:00, CSG-27

Zellige (Arab-Spanish Ceramic) Decor Images Retrieval System Based on Shape and Spatial Relationships Indexing

Prof. Arsalane Zarghili, Head of Laboratory of Intelligent Systems and Applications (LSIA) at Sidi Mohamed Ben Abdellah University, Morocco

Abstract: Arab-Spanish decorative arts, which tend towards abstraction based on rigorous geometrical laws, reached the summit of perfection in the Arab-Spanish civilization. Today, this decorative art renaissance in Morocco is stored in the form of images taken by amateurs, professionals and specialists in Arab-Andalusian arts. These images, which constitute a documentary background of inestimable value, are scattered, making them easily accessible. The advent of the Information and Communication Technologies (ICT) will, in time, establish databases where all these documents once scanned, will be available to historians, archaeologists and all amateurs and professionals. The visual information retrieval in this type of image databases has been recently addressed by a new approach based directly on the information content (shape, colour, texture, etc.) of the query images instead of using keywords. In this perspective, we had the idea to develop an Arab-Andalusian decor images retrieval system, especially for the ceramic decors called Zellige.

Towards a Multi-font printed and Handwritten Arabic OCR

Prof. Arsalane Zarghili, Head of Laboratory of Intelligent Systems and Applications (LSIA) at Sidi Mohamed Ben Abdellah University, Morocco

Abstract: Optical character recognition, usually abbreviated as OCR, is a field of research in artificial intelligence, pattern recognition and computer vision. OCR involves computer systems designed to translate text images into machine-readable and editable text. Such a system can be used in many applications such as document processing, bank check processing, automatic mail sorting and routing, verification of signatures, etc. Although, many researchers have addressed Arabic handwritten text recognition, it still faces great challenges especially handwritten text recognition. Our aim is to construct an offline OCR system, able to recognize Arabic multi-font printed and handwritten text. This system is a part of a bigger project, which will permit digitalization the oldest library in the world: AL Quaraouiyine University in Fez, founded in 859 and which contains about 24000 printed books dating back more than 100 years ago with only unique copies in the world and 4000 handwritten books dating back more than 1200 years ago.

About the presenter: Arsalane Zarghili is a full professor at Sidi Mohamed Ben Abdellah University (USMBA) in Fez, Morocco. He received his Ph.D. in 2001 and joined the Faculty of Science and Technology (FST-Fez) of USMBA in 2002. From 2007 to 2010 he was the head of the Department of Computer Science and chair of the Master in Software Quality programme at FST-Fez. In 2011, he co-founded the Laboratory of Intelligent Systems and Applications (LSIA) at FST-Fez as well as the research group “Artificial Vision and Embedded Systems”. From 2011 to 2014, he was the chair of Master in Intelligent Systems and Networks. He is a member and a treasurer of the Moroccan Association of Innovation and Engineering in Healthcare (AMIIS). Prof. Zarghili’s main research contributions are in pattern recognition, image indexing and retrieval systems in cultural heritage, biometrics, and healthcare. He also works on Arabic natural language processing and especially in Arabic OCR and digitalization of Arabic manuscripts. He is a co-editor of the 2014 special issue of the IADIS International Journal in Computer Science and Information Systems (IJCSIS) and a co-editor of the 2017 special issue of Elsevier’s JKSU.

15 March 2018

15:00 - 16:00, CSG-27

Data Warehousing in the age of big data. The end of an era?

Uli Bethke, CEO of Sonra and VP of DAMA Ireland

Abstract: The hype around big data technologies has reached fever pitch levels. Marketing departments go into overdrive and come up with cool new buzzwords every day. If we believe the vendors, big data tools can cure world hunger! But do schemas on data lakes, query offloads, and the Hadoops of this world really deliver? Can they tame the growing data volumes and prevent the data warehouse from bursting at the seams? What about growing license costs? And are there techniques to shorten the data warehouse lifecycle? We all know it takes forever to get a subject area into an enterprise data warehouse. In this talk I will share my experience in big data warehousing. What works and what does not. Where are the opportunities and limitations of big data tools and concepts and where are we better off sticking with traditional technologies. The presentation is a condensed version of my popular training course Big Data for Data Warehouse Professionals.

About the presenter: Uli Bethke has 18 years hands on experience as a consultant, architect, and manager in the data industry. He is a traveller between the worlds of traditional data warehousing and big data technologies and has delivered data warehouse solutions in Europe, North America, and South East Asia. Uli is a regular contributor to blogs and books, holds an Oracle ACE award, and chairs the Hadoop User Group Ireland. He is also a co-founder and VP of the Irish chapter of DAMA, a nonprofit global data management organization, as well as a co-founder of the Irish Oracle Big Data User Group. Uli holds degrees in political science from Freie Universität Berlin, Albrecht Ludwigs Universität Freiburg, and the University of Ulster, Coleraine. Last but not least, he is the CEO of Sonra, the data liberation company, which develops Flexter, a tool to automate the conversion of complex XML to formats suitable to a variety of database management systems and Spark/Hadoop.

15 February 2018

15:00 - 16:00, CSG-27

Data Science @ Zalando

Paul O'Grady, Zalando Fashion Insights Centre

Abstract: In this Industry focused talk we will present the Zalando approach to Data Science, the talk is split into two sections. In the first we will discuss a number of topics including the technologies we employ in our solutions, tools we use in our day-to-day and the skill sets required to do the job. In the second section we will present a solution to a real-world problem we encountered in the design of one of our systems. The problem is related to the Record Linkage Problem, which is an O(n^2) problem, and we discuss our approach to making our system run at scale given the computation complexities of the problem.

About the presenter: Dr. Paul O’Grady is as a Senior Data Scientist at Zalando's Fashion Insights Centre in Dublin, where he works on a Data Platform team. Paul's day-to-day involves looking at computationally intensive problems and making them work at scale. He has a Ph.D. in Machine Learning from the Hamilton Institute at NUI Maynooth, and has also worked as a post-doctoral researcher in the School of Engineering at University College Dublin. Since leaving academia Paul has worked in a number of different industries in the areas of Machine Learning & Software Engineering. He is a committed Python Ireland Committee member, and has been involved in the running of the PyCon Ireland conference over the last couple of years.

1st February 2018

15:00 - 16:00, CSG-27

Feature Location in Source Code

Frashad Ghassemi Toosi, Lero

Abstract: Feature location is the task of locating the implementation of a feature in the source code of a software system. A feature, in terms of a software system, is a functional requirement of a system or a user-observable behaviour of a system. For example, the 'mkdir ' feature in Linux, offers the user the functionality of making a new directory; and the identification of the relevant implementation of this feature (mkdir) in the main source code is the objective of feature location. The importance of the feature location task is highlighted in several different fields such as software maintenance, software evolution, software debugging or even source code migration. There are many different feature location techniques that are used to perform the task of feature location. Based on different criteria (input data, developer’s requirements etc.) one, or a combination of them can be suggested. There are already some tools that offer the implementations of these techniques that can be directly used by developers to inspect their source code and locate features.

About the presenter: Dr. Farshad Ghassemi Toosi is a postdoctoral research assistant in Lero. His work involves feature location and source code analysis. He completed his PhD at the Department of Computer Science and Information Systems in UL with a thesis on network visualization.

18th January 2018

15:00 - 16:00, CSG-27

Star Wars: A Social Network Analysis

Siobhán Grayson, Insight Centre for Data Analytics, UCD

Abstract: An introductory talk for anyone interested in learning about social network analysis that focuses on how to extract, process, and construct networks from text using Python and the scripts of Star Wars. It will then be demonstrated how to calculate basic networks metrics and describe the relevance of each of these metrics in the context of our Star Wars dataset.

About the presenter: Siobhán Grayson is a third year PhD Student in Computer Science whose primary research involves identifying structure in multi-relational social networks. Currently, she is focused on modelling discussion forums by combining text and network analysis using data collected from Reddit. Her research interests include machine learning, social network analysis, link prediction, digital humanities, natural language processing, time series analysis, and anomaly detection.

24th November 2017

15:00 - 16:00, KB119

A Novel Text Mining Approach Using a Bipartite Graph Projected onto Two Dimensions

Stephen Redmond, Accenture Ireland

Abstract: The collection of text data is exploding. From blogs to news reports, to helpdesk tickets, there seems to be a never-ending supply of writings. The owners of these data see methods to group texts and look for clusters of topics. Because of the size of the data, solutions that scale on clustered computer solutions are ideal. The traditional term vector approach can lead to the curse-of-dimensionality. Simple solutions are better than complex because it is often necessary to explain the model to either business users or even regulators. This talk presents a method of keyword mining using the graph-of-words technique and classification by projecting the bipartite graph of terms and documents onto two dimensions. This method can be scaled using a cluster computing technology such as Apache Spark, and the results are easily surfaced to users.

About the presenter: Stephen Redmond is a Senior Manager at Accenture in the Analytics Delivery team. He holds a Master’s degree in Data Analytics from the National College of Ireland, where he also holds a part-time Associate Faculty role, lecturing to port-graduate students in the area of big data analytics. After over 20 years working in data, he had covered most of the bases, from back-end databases and warehouses, to front end data visualisation. He has a focus on using information and technology to solve business problems. Stephen is the author of Mastering QlikView, QlikView Server and Publisher, the QlikView for Developers Cookbook. He has been recognised as a Qlik Luminary in 2014, 2015, 2016 & 2017.

26th October 2017

10:00 - 11:00, CSG-027

The data.table R package - High Performance Data Processing for R

Kevin O’Brien, Dept. of Mathematics and Statistics, UL

Abstract: An unheralded, but critical important, component of data science is the management of data. The data.table R package provides a high-performance version of base R's data.frame with syntax and feature enhancements for ease of use, convenience and programming speed. Data.table performs extremely well for large datasets, and also offers powerful indexing, transformation/grouping, and merging/joining. The features of data.table include fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns and a fast file reader (fread). It offers a natural and flexible syntax for faster development.

About the presenter: Kevin O’Brien is a Limerick based data-scientist that specializes in agriculture and forestry, working with R and Python. He is also a teaching fellow in UL.

6th April 2017

15:00 - 16:00, CSG-027

Machine Learning, Visualization, Distributed and High-Performance Computing in Julia

Dr. Paulito P. Palmes, IBM Research

Abstract: In the long history of programming languages, only a handful of them had been developed and designed from the onset for scientific and high-performance computing. Most noteworthy of them is the development of FORTRAN In 1954 by John Backus of IBM. FORTRAN is still a favourite language to run numerical calculations in the world’s fastest supercomputers because of its highly optimized compiled code. With the increasing improvements in computer technology, scripting languages such as R (1993) and Python (1991) became more popular in statistical and scientific computing for their user-friendliness syntax and interactivity at the expense of performance. Typically, R and Python interpreters are used to call functions from libraries compiled in C/C++ or Fortran. Critical tasks are still written in a low-level language for performance which makes the entire development workflow slow because of having to deal with the two-language problem.
Recently, a new language called Julia (2012) was developed in MIT to take advantage of the recent developments in compiler technology and computing. Julia is a high-level just-in-time compiler designed for high-performance computing. It runs as fast as statically-compiled languages such as C/Fortran but uses a very high-level dynamic language. It supports modern language features such as: map-reduce style parallelism and distributed computation, coroutines, automatic code generation, metaprogramming, dynamic type system, and multiple dispatch. It aims to solve the two-language problem by having a user-friendly, interactive, and dynamic language but statically compiling the code on the fly for optimal performance. In Julia, most of the math/stat/scientific/machine-learning libraries and packages are written in Julia itself which makes the entire workflow easy to explore, modify, and maintain. In this talk, the presenter will discuss their recent explorations of using Julia to implement machine learning algorithms exploiting various Julia features including most importantly parallelism, distributed computations, and meta-programming. The talk covers the following areas: parallel weight update across layers by Direct Feedback Alignment (DFA) for deep learning architectures, population ensembling and evolution of machine learning models, as well as performance tracking and visualization.

About the presenter: Paulito P. Palmes is currently a Research Scientist in IBM Ireland’s Dublin Research Lab (DRL) with research interests in the areas of data mining, deep learning, cognitive computing, and biomedical engineering.

9th March 2017

11:00 - 12:00, CSG-027

The Tidyverse Toolkit : Managing your Data with R

Presenter: Kevin O’Brien, Dept. of Mathematics and Statistics, UL

Abstract: An unheralded, but critical important, component of Data Sciences is the management of data. In Hadley Wickham's Tidy Tools Manifesto, he proposes four basic principles for any computer interface for handling data: Reuse existing data structures, Compose simple functions with the pipe, Embrace functional programming, Design for humans. The tidyverse is a collection of R packages that share common philosophies and are designed to work together. Tidyverse puts a complete suite of modern data-handling tools into your R session, and provides an essential toolbox for any data scientist using R. This talk will look at several key components of the tidyverse, including dplyr, tidyr and stringr.

About the presenter: Kevin O’Brien is currently completing a PhD in Statistics in UL. He is also a teaching fellow for undergraduate statistics modules in UL.

3rd November 2016

14:00 - 15:00, CSG-027

Utilising Heterogeneous Information Networks for Hybrid Recommendation in a Sparse Matrix

Presenter: Haiyang Zhang, Telecommunications Research Centre, UL

Abstract: Item-based collaborative filtering (CF) is one of the most widely used technique for building recommendation systems. However, it often suffers from data sparsity and a cold start problem. Additional item metadata, which naturally constitutes a heterogeneous information network (HIN), may be used to solve these problems. In this work, we incorporate item-based CF with different types of item relationships in HIN to exploit sparse implicit feedback data. We introduce meta path-based features which present item relations and define a integrated item-based CF model. Bayesian personalised ranking optimization technique is utilised to estimate the proposed model. Performance evaluation of the proposed method is done by comparing the results with the traditional item-based CF. The experimental results demonstrate that the proposed approach achieves better accuracy.

About the presenter: Haiyang Zhang is a PhD student at the Telecommunications Research Centrei in UL.

20th October 2016

14:00 - 15:00, CSG-027

Vertex-Neighboring Multilevel Force-Directed Graph Drawing

Presenter: Farshad Ghassemi Toosi, Department of CSIS, UL

Abstract: We introduce a new force-directed graph drawing algorithm for large undirected graphs with at least a few hundreds of vertices. Our algorithm falls into the class of multilevel force-directed graph drawing algorithms. Unlike other multilevel algorithms it has no pre-processing step and it also ignores repulsion forces between pairs of non-adjacent vertices. As a result, our algorithm demonstrably outperforms known multilevel algorithms in terms of running time while keeping the quality of the layout sufficiently good.
This work has been recently presented at the 2016 IEEE International Conference on Systems, Man, and Cybernetics (SMC 2016).

About the presenter: Farshad Ghassemi Toosi is a PhD student at the Department of Computer Science and Information Systems in UL.

19th February 2016

12:00 - 13:00, CSG-027

Simple Ways to Solve Hard Problems

Presenter: Prof. Barry O'Sullivan, Director of the Insight Centre for Data Analytics at UCC

Abstract: There are many combinatorial problems in artificial intelligence, computer hardware design, production scheduling, timetabling, product configuration, planning, diagnosis, etc. that can be formulated and solved using boolean satisfiability (SAT) or constraint programming (CP). Over the past 10-15 years significant advances have been made in terms of the scalability of these problem solving approaches to the point where we can now solve instances with millions of variables. In this talk I will give some insights into why such scalability can be achieved. The 'no free lunch' theorem tells us that there is no one best method for solving combinatorial problems. This opens the door for the application of machine learning techniques to improve the use of SAT and CP methods. I will highlight a number of ways in which machine learning has been successfully used in tandem with SAT and CP to give superior performance. While the talk focuses on solving combinatorial problems, the methods I will present can be readily transferred to other complex domains, and it will be interesting to discuss the opportunities as part of this talk.

About the presenter: Professor Barry O'Sullivan is the Director of the Insight Centre for Data Analytics at University College Cork. He is a Full Professor (Chair of Constraint Programming) in the Department of Computer Science at University College Cork. Professor O'Sullivan is a Fellow and board member of the European Coordinating Committee for Artificial Intelligence, the world's largest artificial intelligence society, and a Senior Member of AAAI, the Association for the Advancement of Artificial Intelligence. He was the longest serving President of the international Association for Constraint Programming until 2012; he received the association's Distinguished Service Award for that leadership. He also won UCC's Leadership Award in 2013. Professor O’Sullivan has been involved in winning more than €140 million in research funding, of which approximately €25 million has directly funded his research at UCC. Professor O'Sullivan is an editor of Springer's book series on "Artificial Intelligence: Foundations, Theory, and Algorithms", and serves on the editorial board of a number of international artificial intelligence journals.

3rd November 2015

4 - 5 pm, CSG-027

Introduction to Data Science with Python and Sci-Kit Learn

Presenter: Kevin O’Brien, UL Department of Mathematics and Statistics

Abstract: This talk will look at how the infrastructure for data science and machine learning is developing in the Python programming environment. We will look at the key libraries that Python has to offer, in particular the sci-kit learn package, which was designed for machine learning. A quick discussion on data visualization libraries will also be included.

About the presenter: Kevin O’Brien is currently completing a PhD in Statistics in UL. He is also a teaching fellow for undergraduate statistics modules in UL.

20th October 2015

4 - 5 pm, CSG-027

A Survey on Cyberbullying Detection in Online Communication

Presenter: Azalden Alakrot, Department of CSIS, UL

Abstract: Cyberbullying, as defined by Patchin and Hinduja (2006), is wilful and repeated harm inflicted through the medium of electronic text. Most works on detecting misbehaviour and cybercrime (incl. cyberbullying) in online communication involve text mining techniques. This talk discusses the major steps in cyberbullying detection as a text-mining task. We present a comparison between various recent studies within the scope and give an illustration with the Linguistic Inquiry and Word Count (LIWC) program.

6th October 2015

4 - 5 pm, CSG-027

Visual Analytics Methods and Resources

Presenters: Nikola S. Nikolov and Farshad Ghassemi Toosi, Department of CSIS, UL

Abstract: This is an informal talk to discuss the current state-of-the art in the area of visual analytics. We present the general methodology for visual analysis of large datasets and we present some of our current research results in network visualisation.

3rd June 2015

10 - 11 am, SG-15 (Schumann Building)

The Influence of Soil Carbon and Afforestation on Climate Change Mitigation and the Potential Role of Big Data Analytics

Presenter: Michael A. Clancy, Department of Life Sciences, UL

Abstract*: Soils are our largest terrestrial carbon (C) pool, estimated to hold between 1500– 2400 Pg C to a soil depth of 1 m, which is approximately twice the size of the atmospheric C pool and three times the biotic C pool. Hence soils play a very significant role in the global C cycle due to their potential to accumulate C through both soil organic matter and mineral inputs. Afforestation is promoted as a means of sequestering atmospheric carbon dioxide (CO2). Globally forests contain approximately 45% of terrestrial carbon (C) stocks and therefore are inextricably tied to the C fluxes driving global climate change. Even if afforestation has only a minimal effect on soil C stocks at the regional or country level, its effect on the global C pool could be significant if large scale conversion of agricultural land to plantation continues. The latest revision of the Irish Soil Information System (SIS) published in 2014 has gathered together new and existing information and data on Irelands soils and produced a 1:250,000 scale map with links to their associated properties. This database provides a foundation on which future Irish soil research can be built. Some of the proposed next steps for Irish soil data, specifically soil C stocks, require the consolidation of all other existing soil datasets and the addition of a perpetual soil monitoring network. With these in place the need for Big Data Analytics solutions will arise to maximise the potential research outputs and improve national and international policy level decision making.

* This is a collaborative work of M. A. Clancy, R. Creamer, J. Jovani, T. Cummins and K. A. Byrne.

About the presenter: Michael A. Clancy is a PhD. student at UL. His research interests include the impact of afforestation on soil carbon stocks as well as investigating the greenhouse gas balance of Short Rotation Forestry via Life Cycle Assessment. He has a BSc. in Environmental Science (UL, 2012), plus over twenty years in the IT industry with Wang Labs b.v. and Dell Inc. with experience in Enterprise Architecture, Project Management, Software Analysis, Design, & Development roles.

20 May 2015

Source Code Analysis for Feature Location

Presenter: Jacek Rosik, Lero

The FLINTS (Feature Location towards INTerfaceS) project tries to leverage Feature Location Techniques to facilitate porting of large legacy heterogeneous systems to more modern architectures and/or languages. Here, features are considered as observable user functionalities in the system and, by locating these features in code, we get an approximation of the code that needs to be extracted to form services. We aim to achieve this feature location by extracting and analysing information from the source code in the form of structural call/data dependencies, to locate code responsible for the implementation of certain features. This is incorporated into a semi-automatic, iterative process where an experienced developer seeds the process with some code associated with the feature, the analysis comes back with additional suggestions based on that seed and the analysis algorithms, and the user then selects the appropriate code to add to the seed. Then this process begins again. This work is carried out with our industrial partner on a legacy system implemented using a combination of proprietary DSLs.
As part of this talk we would like to discuss the envisaged algorithms and challenges relating to analysis of the data extracted from the implementation, in order to identify code relating to features. We also would like to show a demo of our PoC tool that has been implemented in Java and is currently being updated to facilitate the proprietary DSLs used by our commercial partner.

About the presenter: Jacek Rosik received his M.Sc. Eng. in computer science from the Technical University of Lodz, Poland in 2004. In 2014 he completed his PhD with Dr. Jim Buckley at the University of Limerick in the area of Software Architecture focusing on consistency between the design and implementation. His research interests include source code analysis, program comprehension, human computer interaction and empirical evaluation. Recently he joined the FLINTS team at Lero, University of Limerick where he is working on applying feature location techniques to legacy software with the commercial international partner company.

6 May 2015

The ICHEC National Service and New Horizons in Technical Computing

Presenter: Simon Wong, ICHEC

The Irish Centre for High-End Computing (ICHEC) is a national technical computing centre that provides supercomputing resources, support, training and related services. The first part of this talk will focus on ICHEC's National Service that provides access to supercomputing facilities and support for academic researchers in Ireland, including an overview of the available hardware and software infrastructure, means of access, application procedure and training. The second part of this talk will highlight some of ICHEC's activities in the broader field of technical computing. In particular, some of the major challenges facing numerical algorithms used by scientific and engineering applications will be discussed as we progress towards the many-core and exascale era, and how these represent novel opportunities for the Irish research community.

21 April 2015

Big Data Management with SAP Hana

Presenters: Austin Devine and Declan Kearney, SAP Dublin

We will present SAP's in-memory database management system Hana and discuss its advantages for big data management. The topics considered include data compression and database partitioning with Hana as well as an introduction to application development for Hana.

This week BDARG is held jointly with a special lecture delivered to the Database Systems class at CSIS. Please note the change of time and venue for this week only.

Time: 2:00 pm
Venue: CSG-0012:00

8 April 2015

Semantic-Based UCWW Service Recommendation

Presenter: Haiyang Zhang, Telecommunications Research Centre, UL

We present a roadmap to the design of a context-aware service recommendation system utilising semantic knowledge in the Ubiquitous Consumer Wireless World (UCWW). The main objective of the system is to provide users with the 'best' service instances that match their dynamic, contextualised and personalised requirements and expectations, thereby aligning to the always best connected and best served (ABC&S) paradigm. Conventional recommendation systems usually recommend instances of the same type. However, target services to recommend in UCWW typically vary in types. We propose a semantic-based recommendation framework in which services and their related attributes are modelled dynamically as a heterogeneous network, named a UCWW heterogeneous service network (HSN), by collecting and extracting service information. We model user profiles by finding profile kernels which are the minimal set of features describing user preference. Subsequently, a recommendation engine considering both user profiles and current context (user- and network context) will be applied to recommend best service instances to users.

25 March 2015

Basketball Simulation and Management

Presenter: Liam Walsh, Department of CSIS, UL

This project is a Basketball Simulation and Management game, that aims to simulate basketball matches to a high degree of accuracy. The simulations makes use of the AI technique of influence maps to define player behaviour. The user can make tactical decisions for their team which will influence the players, the matches and their outcome. This project is based on the statistics and probabilities of events occurring in the sport of basketball and modelling them within the program. Modern sports analytic techniques are used for design and evaluation of the match system. It is written in C++ and uses the Qt framework for GUI and graphical development

11 March 2015

Visualization of Text Mining

Presenter: Bartosz Kaminiecki, Department of CSIS, UL

This project is a standalone application in the area of text mining. It generates a database from a text file; the data is processed and mined. The end product is a visualisation in the form of various word/tag clouds. The current version is designed to read Amazon datasets sourced from the Stanford University for the purpose of this project. The architectural design; however, allows expanding the application in the future and enabling analysis of any dataset.

Sentiment Analysis of Twitter Data

Presenter: Brian Greene, Department of CSIS, UL

The aim of this project is to research and develop an application to analyse the sentiment of Twitter data. Given a topic, matching tweets are classified as either positive or negative with regard to its overall sentiment. A sample of tweets will be displayed on the UI, highlighting the sentiment of each tweet. The application encompasses other functionality such as, the ability to improve the classification model by evaluating classified tweets and updating the classification model, allows users to compare results to a 3rd party sentiment analysis model and displays various visualisations of the results achieved by the application.

25 February 2015

Synchronisation-Driven Graph Drawing: An Update

Presenter: Farshad Ghassemi Toosi, Department of CSIS, UL

We introduce an algorithm for force-directed graph drawing in which forces of attraction and repulsion between vertices depend on synchronisation dynamics simulated on the graph. This algorithm has two phases; at the first phase synchronisation is employed to bring highly interconnected vertices closer to each other, and then at the second phase equal-size forces are applied to all nodes for spreading them more evenly throughout the drawing area. The results are aesthetically pleasing drawings with a circular-shape; we compare them to drawings of the same graphs produced by the Fruchterman-Reingold force-directed graph drawing algorithm.