Please use this identifier to cite or link to this item:
|Title:||Comparison of Data Preprocessing Techniques on Software Sources for Topic Modeling|
|Keywords:||Latent Dirichlet Allocation|
|Publisher:||Open Universiteit Nederland|
|Abstract:||Studies have shown that topic modeling with Latent Dirichlet Allocation (LDA) is a useful (semi-)unsupervised technique to reveal information about a software system that was not known before. As topic modeling uses unstructured data we found no consensus in literature how to conduct data preprocessing on software source code to extract unstructured data. In this thesis we want to find the data preprocessing technique that leads to the most optimal topic distribution for a given software system, therefore we create an experiment in which we compare four data preprocessing techniques. We select two techniques from literature, we define one by ourselves and we try one technique in which we take the software source code as-is. To measure the differences between the four techniques we use structural coupling metrics. We develop software that is dedicated to our experiment in the domain-specific language Rascal and in Java. Results suggest there is minor difference between the four techniques when we perform the experiment for two software systems. This implies we can use the software source code as-is for topic modeling. If future work confirms this preliminary result it means a significant reduction of effort using topic modeling for software systems.|
|Appears in Collections:||MSc Software Engineering|
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.