Open Universiteit

Please use this identifier to cite or link to this item:
Title: Comparison of Data Preprocessing Techniques on Software Sources for Topic Modeling
Authors: Willems, John
Keywords: Latent Dirichlet Allocation
Topic Modeling
source code
Issue Date: 23-Apr-2014
Publisher: Open Universiteit Nederland
Abstract: Studies have shown that topic modeling with Latent Dirichlet Allocation (LDA) is a useful (semi-)unsupervised technique to reveal information about a software system that was not known before. As topic modeling uses unstructured data we found no consensus in literature how to conduct data preprocessing on software source code to extract unstructured data. In this thesis we want to find the data preprocessing technique that leads to the most optimal topic distribution for a given software system, therefore we create an experiment in which we compare four data preprocessing techniques. We select two techniques from literature, we define one by ourselves and we try one technique in which we take the software source code as-is. To measure the differences between the four techniques we use structural coupling metrics. We develop software that is dedicated to our experiment in the domain-specific language Rascal and in Java. Results suggest there is minor difference between the four techniques when we perform the experiment for two software systems. This implies we can use the software source code as-is for topic modeling. If future work confirms this preliminary result it means a significant reduction of effort using topic modeling for software systems.
Appears in Collections:MSc Software Engineering

Files in This Item:
File Description SizeFormat 
INF_20140422_Willems.pdf2.75 MBAdobe PDFView/Open

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.