FABIO PALOMBA | Associate Professor

[J98] JSS 2026

Understanding Machine Learning Testing in Practice.*

Elsevier's Journal of Systems and Software

Machine Learning is increasingly embedded in critical software systems, making their quality assurance a matter of growing concern. While the research community has proposed several techniques for testing ML-enabled systems, there is limited empirical evidence on whether these techniques are adopted in practice or align with developers' testing workflows. This paper presents a two-step empirical investigation aimed at characterizing the current landscape of ML testing in real-world development. Our goal is to understand how developers approach testing, whether proposed techniques are adopted, and what barriers hinder their implementation. We designed a mixed-method study that triangulates insights from two complementary sources: (1) a mining study of 398 open-source repositories to analyze implemented testing strategies and tool usage; and (2) a survey of 100 practitioners to capture perceptions, motivations, and practical challenges. Download PDF

Journal Software Quality Empirical Software Engineering A. Cannavale, V. Pontillo, A. De Lucia, F. Palomba.

Understanding Machine Learning Testing in Practice.*

A. Cannavale, V. Pontillo, A. De Lucia, F. Palomba. Journal Software Quality Empirical Software Engineering

Abstract. Machine Learning is increasingly embedded in critical software systems, making their quality assurance a matter of growing concern. While the research community has proposed several techniques for testing ML-enabled systems, there is limited empirical evidence on whether these techniques are adopted in practice or align with developers' testing workflows. This paper presents a two-step empirical investigation aimed at characterizing the current landscape of ML testing in real-world development. Our goal is to understand how developers approach testing, whether proposed techniques are adopted, and what barriers hinder their implementation. We designed a mixed-method study that triangulates insights from two complementary sources: (1) a mining study of 398 open-source repositories to analyze implemented testing strategies and tool usage; and (2) a survey of 100 practitioners to capture perceptions, motivations, and practical challenges. Our findings reveal that developers rely heavily on foundational strategies like Smoke Testing and Rule-Based Checking, implemented through custom testing logic built on general-purpose libraries (e.g., PyTest, NumPy). Conversely, we identified a critical adoption gap in specialized tools and advanced techniques such as Metamorphic Testing, which are rarely implemented despite their academic prominence. Our survey indicates that this gap is driven by practical barriers, including high integration costs and a poor fit with existing developer workflows. These findings suggest that future research and tooling must prioritize usability, integration, and a clearer alignment with the pragmatic needs of developers.

Download PDF

[J97] ESWA 2026

RobustDRNet: A Clinically-Aligned Hybrid Ensemble Model with Multi-Method Explainability for Lesion-Aware Diabetic Retinopathy Grading.*

Elsevier's Expert Systems with Applications (ESWA)

Diabetic retinopathy (DR) screening requires artificial intelligence (AI) models that are not only highly accurate in grading five clinical stages but are also capable of generating reliable explanations at the level of the lesion to earn the trust of clinicians. We propose RobustDRNet, a hybrid ensemble model that combines local convolutional features from Residual Network34 (ResNet-34) and Convolutional Neural Network Next-Tiny (ConvNeXt-Tiny) with global transformer embeddings from the Vision Transformer Base/16 (ViT-B16) via two-stage feature fusion, a disentangled multilayer perceptron (MLP), followed by a stacking logistic regression meta-learner to predict aggregation. To address this severe class imbalance, our training pipeline employs stratified sampling, contrast-limited adaptive histogram equalization (CLAHE) for contrast enhancement, hard data augmentation, and class-weighted focal loss. Download PDF

Journal Empirical Software Engineering P. Khokhar, V. Pentangelo, C. Gravino, F. Palomba.

RobustDRNet: A Clinically-Aligned Hybrid Ensemble Model with Multi-Method Explainability for Lesion-Aware Diabetic Retinopathy Grading.*

P. Khokhar, V. Pentangelo, C. Gravino, F. Palomba. Journal Empirical Software Engineering

Abstract. Diabetic retinopathy (DR) screening requires artificial intelligence (AI) models that are not only highly accurate in grading five clinical stages but are also capable of generating reliable explanations at the level of the lesion to earn the trust of clinicians. We propose RobustDRNet, a hybrid ensemble model that combines local convolutional features from Residual Network34 (ResNet-34) and Convolutional Neural Network Next-Tiny (ConvNeXt-Tiny) with global transformer embeddings from the Vision Transformer Base/16 (ViT-B16) via two-stage feature fusion, a disentangled multilayer perceptron (MLP), followed by a stacking logistic regression meta-learner to predict aggregation. To address this severe class imbalance, our training pipeline employs stratified sampling, contrast-limited adaptive histogram equalization (CLAHE) for contrast enhancement, hard data augmentation, and class-weighted focal loss. Evaluated on the Asia Pacific Tele-Ophthalmology Society (APTOS) 2019 dataset, RobustDRNet achieved 88.4% validation accuracy, 0.967 macro-averaged area under the receiver operating characteristic curve (macro-AUC), and a Cohen’s Kappa of 0.823, outperforming individual backbones and simple voting ensembles. In addition to classification performance, we integrated six complementary explainable AI (XAI) techniques: Gradient-weighted Class Activation Mapping++ (Grad-CAM++), Integrated Gradients, attention rollout, SHapley Additive explanations (SHAP), Local Interpretable Model-Agnostic Explanations (LIME), and Testing with Concept Activation Vectors (TCAV). Each technique was quantitatively benchmarked against expert-annotated lesion maps from the Indian Diabetic Retinopathy Image Dataset (IDRiD). Saliency maps achieve mean Intersection over Union (IoU) scores of 0.06 for Grad-CAM++ and 0.10 for Integrated Gradients; SHapley Additive exPlanations (SHAP) perturbations show a deletion drop of 0.25 and insertion gain of 0.22; and TCAV achieves perfect concept alignment (score=1.0) with clinically coherent, grade-wise importance trajectories. By combining cutting-edge grading with multi-perspective and clinically validated interpretability, RobustDRNet delivers a deployable DR screening solution, whose decisions are both highly accurate and transparently grounded in lesion-level pathology.

Download PDF

[J96] TOSEM 2026

Advancing LLM-Based Issue Report Classification with Explained Few-Shot Learning, Intent Extraction, Ensemble, and Summarization.*

ACM Transactions on Software and Methodology (TOSEM)

Effective software maintenance requires automated issue report classification due to increasing report volume and complexity. Although fine- tuned BERT models and Large Language Models (LLMs) have exhibited potential in this field, they face critical limitations in handling lengthy reports and ensuring classification consistency. This paper presents an LLM-based method for processing long reports and explores two classification perspectives, namely user intent understanding and example-based decision-making. On this basis, we propose three LLM-based methods: (1) Intent Extraction and Classification, which identifies and classifies user intent from issue reports; (2) Ensemble Classification, which enhances the intent-based method through majority voting; and (3) Explained Few-Shot Learning, which implements the example-based strategy with transparent rationales. Download PDF

Journal Software Quality Empirical Software Engineering G. De Vito, L. Starace, F. Palomba, S. Di Martino, F. Ferrucci.

Advancing LLM-Based Issue Report Classification with Explained Few-Shot Learning, Intent Extraction, Ensemble, and Summarization.*

G. De Vito, L. Starace, F. Palomba, S. Di Martino, F. Ferrucci. Journal Software Quality Empirical Software Engineering

Abstract. Effective software maintenance requires automated issue report classification due to increasing report volume and complexity. Although fine- tuned BERT models and Large Language Models (LLMs) have exhibited potential in this field, they face critical limitations in handling lengthy reports and ensuring classification consistency. This paper presents an LLM-based method for processing long reports and explores two classification perspectives, namely user intent understanding and example-based decision-making. On this basis, we propose three LLM-based methods: (1) Intent Extraction and Classification, which identifies and classifies user intent from issue reports; (2) Ensemble Classification, which enhances the intent-based method through majority voting; and (3) Explained Few-Shot Learning, which implements the example-based strategy with transparent rationales. We compare these methods against 3 baselines: a RoBERTa-based model, SETFIT, and a previous LLM-based method, using GPT-4o, GPT-3.5-turbo, and Qwen 2.5-32B, through an extensive evaluation that comprises consistency analysis, ablation studies, and an analysis of misclassification patterns. The results show that GPT-4 outperforms the state-of-the-art by 5–8% and performs well across all methods. Furthermore, the results show that Qwen-2.5 performs better than the larger GPT-3.5-turbo, suggesting that multiple factors beyond parameter count - including architectural design, training data composition, and optimization strategies - influence classification performance. The analysis creates a taxonomy of classification challenges and reveals important findings about the pros and cons of each approach. We also introduced innovative ensemble techniques based on LLM perplexity and adaptive strategies capable of selecting the most effective LLM and proposed classification method under specific privacy constraints.

Download PDF

[J95] TOSEM 2026

What Were You Thinking? An LLM-Driven Large-Scale Study of Refactoring Motivations in Open-Source Projects.*

ACM Transactions on Software and Methodology (TOSEM)

Code refactoring improves software quality without changing external behavior. Despite its advantages, its benefits are hindered by the considerable cost of time, resources, and continuous effort it demands. Understanding why developers refactor, and which metrics capture these motivations, may support wider and more effective use of refactoring in practice. We performed a large-scale empirical study to analyze developers' refactoring activity, leveraging Large Language Models (LLMs) to identify underlying motivations from version control data, comparing our findings with previous motivations reported in the literature. Download PDF

Journal Software Quality Empirical Software Engineering M. Robredo, M. Esposito, F. Palomba, R. Penaloza, V. Lenarduzzi.

What Were You Thinking? An LLM-Driven Large-Scale Study of Refactoring Motivations in Open-Source Projects.*

M. Robredo, M. Esposito, F. Palomba, R. Penaloza, V. Lenarduzzi. Journal Software Quality Empirical Software Engineering

Abstract. Code refactoring improves software quality without changing external behavior. Despite its advantages, its benefits are hindered by the considerable cost of time, resources, and continuous effort it demands. Understanding why developers refactor, and which metrics capture these motivations, may support wider and more effective use of refactoring in practice. We performed a large-scale empirical study to analyze developers' refactoring activity, leveraging Large Language Models (LLMs) to identify underlying motivations from version control data, comparing our findings with previous motivations reported in the literature. LLMs matched human judgment in 80% of cases, but aligned with literature-based motivations in only 47%. They enriched 22% of motivations with more detailed rationale, often highlighting readability, clarity, and structural improvements. Most motivations were pragmatic, focused on simplification and maintainability. While metrics related to developer experience and code readability ranked highest, their correlation with motivation categories was weak. We conclude that LLMs effectively capture surface-level motivations but struggle with architectural reasoning. Their value lies in providing localized explanations, which, when combined with software metrics, can form hybrid approaches. Such integration offers a promising path toward prioritizing refactoring more systematically and balancing short-term improvements with long-term architectural goals.

Download PDF

[J94] VR 2026

From Zero to Hero: A Scoping Review of the Emergence of the Metaverse in the Virtual Environments History.*

Springer's Virtual Reality (VR)

The metaverse has transitioned from a science fiction term to a rapidly growing area of research and application, with potential uses in education, professional training, social events, and the virtual economy. However, despite this progress, a fully realized and functional metaverse is not yet available, and its development still requires a clear understanding and definition of the research directions to follow. Nonetheless, the metaverse topic does not start from scratch; it shares its foundations with Virtual Environments (VEs), which represent the Virtual Reality applications' core. In this paper, we built on top of the knowledge given by the history of the two topics to present a scoping review of the historical development of research in VEs and the metaverse from the 1990s to early 2024, analyzing 352 papers from the Scopus database. Download PDF

Journal Systematic Literature Review V. Pentangelo, C. Gravino, F. Palomba.

From Zero to Hero: A Scoping Review of the Emergence of the Metaverse in the Virtual Environments History.*

V. Pentangelo, C. Gravino, F. Palomba. Journal Systematic Literature Review

Abstract. The metaverse has transitioned from a science fiction term to a rapidly growing area of research and application, with potential uses in education, professional training, social events, and the virtual economy. However, despite this progress, a fully realized and functional metaverse is not yet available, and its development still requires a clear understanding and definition of the research directions to follow. Nonetheless, the metaverse topic does not start from scratch; it shares its foundations with Virtual Environments (VEs), which represent the Virtual Reality applications' core. In this paper, we built on top of the knowledge given by the history of the two topics to present a scoping review of the historical development of research in VEs and the metaverse from the 1990s to early 2024, analyzing 352 papers from the Scopus database. We aimed to offer a comprehensive understanding of how past research informs the present and future directions of the metaverse. Our findings revealed that the metaverse, while emerging as a distinct research area in recent years, is deeply rooted in the history of VEs, with many of its concepts and technologies deriving from earlier work in the field. We also identified new, underexplored trends within the metaverse research, proposing a future research agenda informed by the shared history of the two topics.

Download PDF

[J93] IST 2026

Pythonic vs Refactorable Pythonic: On the Relationship between Pythonic Idioms and Code Quality in Machine Learning Projects.*

Elsevier's Information and Software Technology (IST)

Python is increasingly becoming the lingua franca for developing Machine Learning (ML) systems, thanks to a rich ecosystem of libraries and an emphasis on readability. In this context, Pythonic idioms are seen as stylistic conventions that support maintainable and efficient code. Conversely, Refactorable-Pythonic idioms refer to patterns that can be refactored into more idiomatic Python, improving code quality in terms of maintainability, performance, and clarity. While the assumptions about idiomaticity are widely accepted in practice, the extent to which Pythonic or Refactorable-Pythonic idioms relate to software quality in ML projects has not been systematically validated. To address this lack of empirical evidence, this paper conducts a large-scale study to assess how Pythonic and Refactorable-Pythonic idioms are related to code quality in ML systems. Download PDF

Journal Software Quality Empirical Software Engineering G. Festa, G. Giordano, V. Pontillo, M. Di Penta, D. Tamburri, F. Palomba.

Pythonic vs Refactorable Pythonic: On the Relationship between Pythonic Idioms and Code Quality in Machine Learning Projects.*

G. Festa, G. Giordano, V. Pontillo, M. Di Penta, D. Tamburri, F. Palomba. Journal Software Quality Empirical Software Engineering

Abstract. Python is increasingly becoming the lingua franca for developing Machine Learning (ML) systems, thanks to a rich ecosystem of libraries and an emphasis on readability. In this context, Pythonic idioms are seen as stylistic conventions that support maintainable and efficient code. Conversely, Refactorable-Pythonic idioms refer to patterns that can be refactored into more idiomatic Python, improving code quality in terms of maintainability, performance, and clarity. While the assumptions about idiomaticity are widely accepted in practice, the extent to which Pythonic or Refactorable-Pythonic idioms relate to software quality in ML projects has not been systematically validated. To address this lack of empirical evidence, this paper conducts a large-scale study to assess how Pythonic and Refactorable-Pythonic idioms are related to code quality in ML systems. We analyze 303 open-source Python projects from the NICHE dataset, distinguishing between "well-engineered" (i.e., projects that adopt structured development practices such as testing, CI, documentation, and packaging) and "non-engineered" (i.e., projects that lack such characteristics). Our analysis proceeds in two main phases: (i) idiom detection, where we extract Pythonic and Refactorable-Pythonic code patterns using a combination of existing and custom detectors; and (ii) quality assessment, where we detect Python-specific smells and relate them to code metrics and other quality indicators. Truth Value Test and Assign Multiple Targets are the most common Pythonic and Refactorable-Pythonic idioms, respectively. In "well-engineered" projects, both idiom types positively correlate with Python-specific code smells, suggesting that idiomatic usage does not always align with higher code quality. In contrast, in "non-engineered" projects, the presence of smells is more strongly influenced by structural factors such as the number of lines of code, complexity, and commit activity. We conclude by distilling lessons learned, implications, and future research directions.

Download PDF

[J92] EMSE 2025

Analyzing the Ripple Effects of Refactoring.*

Springer's Journal of Empirical Software Engineering (EMSE)

Short development cycles and continuous delivery pressure often push developers toward expedients that lead to poor design and hard-to-maintain systems. A common remedy is code refactoring, which reduces complexity and improves maintainability, though often seen as costly and risky. We investigate the long-term effects of refactoring to provide recommendations that support strategic development decisions. Method. We assess refactoring impact through change- and defect-proneness analysis, as well as benefit/effort evaluation. Most refactorings have short-lived effects, persisting for fewer than 10 changes. Structural refactorings may last over 190 changes, with significant differences across families. Medium-lived refactorings (9–19 changes) prove the most stable and efficient, while longer-lasting ones become increasingly defect-prone and costly. Refactorings differ in sustainability. Medium-duration refactorings strike the best balance between stability and maintenance cost, while structural ones, though impactful, pose higher long-term risks. These insights guide prioritization of refactoring types to maximize benefit and minimize technical debt. Download PDF

Journal Software Quality Empirical Software Engineering M. Robredo Manero, M. Esposito, F. Palomba, R. Peñaloza, V. Lenarduzzi.

Analyzing the Ripple Effects of Refactoring.*

M. Robredo Manero, M. Esposito, F. Palomba, R. Peñaloza, V. Lenarduzzi. Journal Software Quality Empirical Software Engineering

Abstract. Short development cycles and continuous delivery pressure often push developers toward expedients that lead to poor design and hard-to-maintain systems. A common remedy is code refactoring, which reduces complexity and improves maintainability, though often seen as costly and risky. We investigate the long-term effects of refactoring to provide recommendations that support strategic development decisions. Method. We assess refactoring impact through change- and defect-proneness analysis, as well as benefit/effort evaluation. Most refactorings have short-lived effects, persisting for fewer than 10 changes. Structural refactorings may last over 190 changes, with significant differences across families. Medium-lived refactorings (9–19 changes) prove the most stable and efficient, while longer-lasting ones become increasingly defect-prone and costly. Refactorings differ in sustainability. Medium-duration refactorings strike the best balance between stability and maintenance cost, while structural ones, though impactful, pose higher long-term risks. These insights guide prioritization of refactoring types to maximize benefit and minimize technical debt.

Download PDF

[J91] IST 2025

Fair and Square? Evaluating Fairness of LLM-Generated Synthetic Datasets.*

Elsevier's Information and Software Technology (IST)

Machine Learning (ML) is driving advancements across various industries, including healthcare, finance, and entertainment, but it also raises significant ethical concerns, particularly regarding fairness. Biases in training data can lead to unfair outcomes, perpetuating or even amplifying existing disparities. Prior research in the Software Engineering (SE) and ML communities has developed numerous bias mitigation techniques, yet two key limitations persist: (1) most approaches intervene at later stages of development, such as after data collection or model training, rather than addressing fairness from the outset; and (2) these methods often mitigate bias without fully eliminating it, since the root issue frequently lies in the data itself. In this paper, we explore an alternative approach to mitigate unfairness: synthetic data generation, which involves creating artificial datasets that mimic the statistical properties of real-world data. We aim to assess how this approach can contribute to generating data that positively impacts the trade-off between performance and fairness by creating datasets that reduce the influence of real-world biases through synthetic feature generation. Download PDF

Journal Empirical Software Engineering Socio-Technical Analytics G. Voria, B. Scala, L. Todisco, C. Venditto, G. Giordano, G. Catolino, F. Palomba.

Fair and Square? Evaluating Fairness of LLM-Generated Synthetic Datasets.*

G. Voria, B. Scala, L. Todisco, C. Venditto, G. Giordano, G. Catolino, F. Palomba. Journal Empirical Software Engineering Socio-Technical Analytics

Abstract. Machine Learning (ML) is driving advancements across various industries, including healthcare, finance, and entertainment, but it also raises significant ethical concerns, particularly regarding fairness. Biases in training data can lead to unfair outcomes, perpetuating or even amplifying existing disparities. Prior research in the Software Engineering (SE) and ML communities has developed numerous bias mitigation techniques, yet two key limitations persist: (1) most approaches intervene at later stages of development, such as after data collection or model training, rather than addressing fairness from the outset; and (2) these methods often mitigate bias without fully eliminating it, since the root issue frequently lies in the data itself. In this paper, we explore an alternative approach to mitigate unfairness: synthetic data generation, which involves creating artificial datasets that mimic the statistical properties of real-world data. We aim to assess how this approach can contribute to generating data that positively impacts the trade-off between performance and fairness by creating datasets that reduce the influence of real-world biases through synthetic feature generation. To this end, we conducted an empirical study comparing ML models trained on synthetic datasets generated by large language models to ML models trained on real-world data, evaluating performance and fairness indicators. Our results demonstrate that models trained with synthetic data, particularly those generated using simpler prompts, can achieve competitive performance while enhancing fairness. Conclusion. Our work suggests that synthetic data generation may be a viable approach to addressing fairness requirements in ML systems.

Download PDF

[J90] TOSEM 2025

Sustainability of Machine Learning-Enabled Systems: The Machine Learning Practitioner's Perspective.*

ACM Transactions on Software Engineering and Methodology (TOSEM)

Software sustainability is a key multifaceted non-functional requirement that encompasses environmental, social, and economic concerns, yet its integration into the development of Machine Learning (ML)-enabled systems remains an open challenge. While previous research has explored high-level sustainability principles and policy recommendations, limited empirical evidence exists on how sustainability is practically managed in ML workflows. Existing studies predominantly focus on environmental sustainability, e.g., carbon footprint reduction, while missing the broader spectrum of sustainability dimensions and the challenges practitioners face in real-world settings. To address this gap, we conduct an empirical study to characterize sustainability in ML-enabled systems from a practitioner’s perspective. Download PDF

Journal Empirical Software Engineering Socio-Technical Analytics V. De Martino, S. Lambiase, F. Pecorelli, W.J. van den Heuvel, F. Ferrucci, F. Palomba.

Sustainability of Machine Learning-Enabled Systems: The Machine Learning Practitioner's Perspective.*

V. De Martino, S. Lambiase, F. Pecorelli, W.J. van den Heuvel, F. Ferrucci, F. Palomba. Journal Empirical Software Engineering Socio-Technical Analytics

Abstract. Software sustainability is a key multifaceted non-functional requirement that encompasses environmental, social, and economic concerns, yet its integration into the development of Machine Learning (ML)-enabled systems remains an open challenge. While previous research has explored high-level sustainability principles and policy recommendations, limited empirical evidence exists on how sustainability is practically managed in ML workflows. Existing studies predominantly focus on environmental sustainability, e.g., carbon footprint reduction, while missing the broader spectrum of sustainability dimensions and the challenges practitioners face in real-world settings. To address this gap, we conduct an empirical study to characterize sustainability in ML-enabled systems from a practitioner’s perspective. We investigate (1) how ML engineers perceive and describe sustainability, (2) the software engineering practices they adopt to support it, and (3) the key challenges hindering its adoption. We first perform a qualitative analysis based on interviews with eight experienced ML engineers, followed by a large-scale quantitative survey with 203 ML practitioners. Our key findings reveal a significant disconnection between sustainability awareness and its systematic implementation, highlighting the need for more structured guidelines, measurement frameworks, and regulatory support.

Download PDF

[J89] JSEP 2025

A Novel, Tool-Supported Catalog of Community Smell Symptoms.*

Wiley's Journal of Software: Evolution and Process (JSEP)

Software development is a multifaceted endeavor, requiring a profound grasp of both social dynamics and technical intricacies. Poor collaboration often leads to the accumulation of social debt, manifesting as unforeseen project costs due to sub-optimal team interactions. Community smells have emerged as indicators of these socio-technical inefficiencies and potential social debt. While previous research has focused on automated detection of community smells through analyzing developer communication patterns, our study offers a complementary approach. We emphasize the critical role of project managers in assessing socio-technical dynamics and propose a novel, tool-supported catalog of symptoms. This catalog can be used for manual inspections to identify early signs of community smells at the individual level, allowing managers to address issues before they escalate. Using a mixed-method design that leveraged an existing literature review and a user survey, we cataloged symptoms related to four community smell types. Additionally, we developed TOAST, a tool that operationalizes this catalog, and assessed its usability and practical usefulness through an experiment involving project managers. Download PDF

Journal Empirical Software Engineering A. Della Porta, S. Lambiase, G. Catolino, F. Ferrucci, F. Palomba.

A Novel, Tool-Supported Catalog of Community Smell Symptoms.*

A. Della Porta, S. Lambiase, G. Catolino, F. Ferrucci, F. Palomba. Journal Empirical Software Engineering

Abstract. Software development is a multifaceted endeavor, requiring a profound grasp of both social dynamics and technical intricacies. Poor collaboration often leads to the accumulation of social debt, manifesting as unforeseen project costs due to sub-optimal team interactions. Community smells have emerged as indicators of these socio-technical inefficiencies and potential social debt. While previous research has focused on automated detection of community smells through analyzing developer communication patterns, our study offers a complementary approach. We emphasize the critical role of project managers in assessing socio-technical dynamics and propose a novel, tool-supported catalog of symptoms. This catalog can be used for manual inspections to identify early signs of community smells at the individual level, allowing managers to address issues before they escalate. Using a mixed-method design that leveraged an existing literature review and a user survey, we cataloged symptoms related to four community smell types. Additionally, we developed TOAST, a tool that operationalizes this catalog, and assessed its usability and practical usefulness through an experiment involving project managers. The study showed that even participants unfamiliar with the term "community smells" were able to interpret the tool’s output, reflect on team dynamics, and recognize problematic behavioral patterns when supported by structured symptom-based information. The paper concludes by shedding light on the potential impact of our work and its contribution to advancing the detection and analysis of community smells.

Download PDF

[J88] IST 2025

Fairness Set and Forgotten: Mining Fairness Toolkit Usage in Open-Source Machine Learning Projects.*

Elsevier's Information and Software Technology (IST)

The development of machine learning (ML) systems in high-stakes domains has amplified concerns about fairness, prompting the creation of fairness toolkits offering metrics and mitigation techniques. Open-source software (OSS) ecosystems, a critical driver of AI innovation, present a unique opportunity to study the practical adoption of these toolkits. This paper aims to empirically characterize the adoption of fairness toolkits in OSS ML projects by investigating for what purposes they are used and how their usage evolves over time. We conducted a mining study on GitHub repositories related to real-world ML projects that integrate fairness toolkits such as AIF360 and Fairlearn. Starting from 1,096 candidate repositories, we applied systematic filtering to identify a final dataset of 20 relevant ML projects (comprising 5,777 total commits). We analyzed toolkit usage by examining invoked APIs and commit histories to uncover patterns of adoption and evolution. Our findings reveal that fairness toolkits are predominantly used for diagnostic purposes, with analytic components integrated early in the project lifecycle and rarely modified thereafter. In contrast, mitigation techniques are infrequently adopted, tend to appear later, and exhibit short, unstable lifespans. Download PDF

Journal Empirical Software Engineering A. Cannavale, G. Voria, A. Scognamiglio, G. Giordano, G. Catolino, F. Palomba.

Fairness Set and Forgotten: Mining Fairness Toolkit Usage in Open-Source Machine Learning Projects. *

A. Cannavale, G. Voria, A. Scognamiglio, G. Giordano, G. Catolino, F. Palomba. Journal Empirical Software Engineering

Abstract. The development of machine learning (ML) systems in high-stakes domains has amplified concerns about fairness, prompting the creation of fairness toolkits offering metrics and mitigation techniques. Open-source software (OSS) ecosystems, a critical driver of AI innovation, present a unique opportunity to study the practical adoption of these toolkits. This paper aims to empirically characterize the adoption of fairness toolkits in OSS ML projects by investigating for what purposes they are used and how their usage evolves over time. We conducted a mining study on GitHub repositories related to real-world ML projects that integrate fairness toolkits such as AIF360 and Fairlearn. Starting from 1,096 candidate repositories, we applied systematic filtering to identify a final dataset of 20 relevant ML projects (comprising 5,777 total commits). We analyzed toolkit usage by examining invoked APIs and commit histories to uncover patterns of adoption and evolution. Our findings reveal that fairness toolkits are predominantly used for diagnostic purposes, with analytic components integrated early in the project lifecycle and rarely modified thereafter. In contrast, mitigation techniques are infrequently adopted, tend to appear later, and exhibit short, unstable lifespans. Our results show that the adoption of fairness toolkits in OSS ML projects is limited and often restricted to initial diagnostic phases, with active mitigation practices remaining rare. These findings highlight the need for improved support to foster more sustained and effective integration of fairness practices within open-source development.

Download PDF

[J87] TOSEM 2025

Back to the Roots: Assessing Mining Techniques for Java Vulnerability-Contributing Commits.*

ACM Transactions on Software Engineering and Methodology (TOSEM)

Vulnerability-contributing commits (VCCs) are code changes that introduce vulnerabilities. Mining historical VCCs relies on SZZ-based algorithms that trace from known vulnerability-fixing commits. Although these techniques have been used, e.g., to train just-in-time vulnerability predictors, they lack systematic benchmarking to evaluate their precision, recall, and error sources. We empirically assessed 12 VCC mining techniques in Java repositories using two benchmark datasets (one from the literature and one newly curated). We also explored combinations of techniques, through intersections, voting schemes, and machine learning, to improve performance. Individual techniques achieved at most 0.60 precision but up to 0.89 recall. The precision rose to 0.75 when the outputs were combined with the logical AND, at the expense of recall. Machine learning ensembles reached 0.80 precision with a better precision–recall balance. Performance varied significantly by dataset. Analyzing "fixing commits" showed that certain fix types (e.g., filtering or sanitization) affect retrieval accuracy, and failure patterns highlighted weaknesses when fixes involve external data handling. Such results help software security researchers select the most suitable mining technique for their studies and understand new ways to design more accurate solutions. Download PDF

Journal Empirical Software Engineering T. Hinrichs, E. Iannone, T. Aladics, P. Hegedus, A. De Lucia, F. Palomba, R. Scandariato.

Back to the Roots: Assessing Mining Techniques for Java Vulnerability-Contributing Commits. *

T. Hinrichs, E. Iannone, T. Aladics, P. Hegedus, A. De Lucia, F. Palomba, R. Scandariato. Journal Empirical Software Engineering

Abstract. Vulnerability-contributing commits (VCCs) are code changes that introduce vulnerabilities. Mining historical VCCs relies on SZZ-based algorithms that trace from known vulnerability-fixing commits. Although these techniques have been used, e.g., to train just-in-time vulnerability predictors, they lack systematic benchmarking to evaluate their precision, recall, and error sources. We empirically assessed 12 VCC mining techniques in Java repositories using two benchmark datasets (one from the literature and one newly curated). We also explored combinations of techniques, through intersections, voting schemes, and machine learning, to improve performance. Individual techniques achieved at most 0.60 precision but up to 0.89 recall. The precision rose to 0.75 when the outputs were combined with the logical AND, at the expense of recall. Machine learning ensembles reached 0.80 precision with a better precision–recall balance. Performance varied significantly by dataset. Analyzing "fixing commits" showed that certain fix types (e.g., filtering or sanitization) affect retrieval accuracy, and failure patterns highlighted weaknesses when fixes involve external data handling. Such results help software security researchers select the most suitable mining technique for their studies and understand new ways to design more accurate solutions.

Download PDF

[J86] CSUR 2025

Another Brick in the Wall: A Systematic Mapping Study Toward Defining Metaverse Engineering Through Socio-Technical Issues.*

ACM Computing Surveys (CSUR)

Nowadays, virtual worlds are evolving into immersive metaverses, i.e., collective, shared virtual spaces that arise from the convergence of virtual reality (VR), augmented reality (AR), the internet, and additional digital technologies. These environments allow users to interact through avatars. While research has explored the technologies and social dynamics of the metaverse, two key limitations remain. First, there is no systematic overview of the expertise required to build metaverses and the key publication venues. Second, the socio-technical issues have not been fully synthesized or made instrumental for developing holistic design approaches that address both social and technical constraints. A deeper investigation is needed to guide future research, highlight challenges, and foster collaboration across disciplines. In this paper, we propose a systematic mapping study that addresses the limitations identified. From an initial pool of 2,323 sources, we identify 63 primary resources to (1) characterize the research community around metaverse and (2) elicit, synthesize, and categorize 19 social and 18 technical issues affecting the development of an effective metaverse. Download PDF

Journal Systematic Literature Review D. Di Dario, F. Palomba, C. Gravino.

Another Brick in the Wall: A Systematic Mapping Study Toward Defining Metaverse Engineering Through Socio-Technical Issues.*

D. Di Dario, F. Palomba, C. Gravino. Journal Systematic Literature Review

Abstract. Nowadays, virtual worlds are evolving into immersive metaverses, i.e., collective, shared virtual spaces that arise from the convergence of virtual reality (VR), augmented reality (AR), the internet, and additional digital technologies. These environments allow users to interact through avatars. While research has explored the technologies and social dynamics of the metaverse, two key limitations remain. First, there is no systematic overview of the expertise required to build metaverses and the key publication venues. Second, the socio-technical issues have not been fully synthesized or made instrumental for developing holistic design approaches that address both social and technical constraints. A deeper investigation is needed to guide future research, highlight challenges, and foster collaboration across disciplines. In this paper, we propose a systematic mapping study that addresses the limitations identified. From an initial pool of 2,323 sources, we identify 63 primary resources to (1) characterize the research community around metaverse and (2) elicit, synthesize, and categorize 19 social and 18 technical issues affecting the development of an effective metaverse. Based on our results, we contextualize the catalog of issues with respect to the current body of knowledge, providing insights and a research roadmap that transforms issues into actionable challenges in the scope of a novel, unified research asset coined "metaverse engineering", i.e., the multidisciplinary discipline to identify processes and instruments to design socio-technical metaverses.

Download PDF

[J85] IST 2025

Fairness on a Budget, Across the Board: A Cost-Effective Evaluation of Fairness-Aware Practices Across Contexts, Tasks, and Sensitive Attributes.*

Elsevier's Information and Software Technology (IST)

Machine Learning (ML) is widely used in critical domains like finance, healthcare, and criminal justice, where unfair predictions can lead to harmful outcomes. Although bias mitigation techniques have been developed by the Software Engineering (SE) community, their practical adoption is limited due to complexity and integration issues. As a simpler alternative, fairness-aware practices, namely conventional ML engineering techniques adapted to promote fairness, e.g., MinMax Scaling, which normalizes feature values to prevent attributes linked to sensitive groups from disproportionately influencing predictions, have recently been proposed, yet their actual impact is still unexplored. Building on our prior work that explored fairness-aware practices in different contexts, this paper extends the investigation through a large-scale empirical study assessing their effectiveness across diverse ML tasks, sensitive attributes, and datasets belonging to specific application domains. Download PDF

Journal Empirical Software Engineering A. Parziale, G. Voria, G. Giordano, G. Catolino, G. Robles, F. Palomba.

Fairness on a Budget, Across the Board: A Cost-Effective Evaluation of Fairness-Aware Practices Across Contexts, Tasks, and Sensitive Attributes.*

A. Parziale, G. Voria, G. Giordano, G. Catolino, G. Robles, F. Palomba. Journal Empirical Software Engineering

Abstract. Machine Learning (ML) is widely used in critical domains like finance, healthcare, and criminal justice, where unfair predictions can lead to harmful outcomes. Although bias mitigation techniques have been developed by the Software Engineering (SE) community, their practical adoption is limited due to complexity and integration issues. As a simpler alternative, fairness-aware practices, namely conventional ML engineering techniques adapted to promote fairness, e.g., MinMax Scaling, which normalizes feature values to prevent attributes linked to sensitive groups from disproportionately influencing predictions, have recently been proposed, yet their actual impact is still unexplored. Building on our prior work that explored fairness-aware practices in different contexts, this paper extends the investigation through a large-scale empirical study assessing their effectiveness across diverse ML tasks, sensitive attributes, and datasets belonging to specific application domains. We conduct 5,940 experiments, evaluating fairness-aware practices from two perspectives: contextual bias mitigation and cost-effectiveness. Contextual evaluation examines fairness improvements across different ML models, sensitive attributes, and datasets. Cost-effectiveness analysis considers the trade-off between fairness gains and performance costs. Findings reveal that the effectiveness of fairness-aware practices depends on specific contexts’ datasets and configurations, while cost-effectiveness analysis highlights those that best balance ethical gains and efficiency. These insights guide practitioners in choosing fairness-enhancing practices with minimal performance impact, supporting ethical ML development.

Download PDF

[J84] SoftwareX 2025

SENEM-AI: Leveraging LLMs for Student Behavior Simulation in Virtual Learning Environments.*

Elsevier's SoftwareX

SENEM-AI is a 3D virtual environment-based tool designed to enhance teaching and presentation skills by leveraging immersive simulations. Built upon the SENEM platform, it integrates virtual students powered by LLaMA with distinct personalities that simulate realistic classroom interactions. Educators can refine their communication strategies by reacting to dynamically generated questions. Preliminary evaluations highlighted its usability and potential impact, with participants valuing the immersive experience and engagement. SENEM-AI represents a novel approach to supporting educators through accessible technology, paving the way for further research into AI-driven teaching aids and training environments in virtual settings. Download PDF

Journal Empirical Software Engineering V. Pentangelo, L. Turco, S. Lambiase, C. Gravino, F. Palomba.

SENEM-AI: Leveraging LLMs for Student Behavior Simulation in Virtual Learning Environments.*

V. Pentangelo, L. Turco, S. Lambiase, C. Gravino, F. Palomba. Journal Empirical Software Engineering

Abstract. SENEM-AI is a 3D virtual environment-based tool designed to enhance teaching and presentation skills by leveraging immersive simulations. Built upon the SENEM platform, it integrates virtual students powered by LLaMA with distinct personalities that simulate realistic classroom interactions. Educators can refine their communication strategies by reacting to dynamically generated questions. Preliminary evaluations highlighted its usability and potential impact, with participants valuing the immersive experience and engagement. SENEM-AI represents a novel approach to supporting educators through accessible technology, paving the way for further research into AI-driven teaching aids and training environments in virtual settings.

Download PDF

[J83] TSE 2025

RECOVER: Toward Requirements Generation from Stakeholders’ Conversations.*

IEEE Transactions on Software Engineering (TSE)

Stakeholders' conversations in requirements elicitation meetings hold valuable insights into system and client needs. However, manually extracting requirements is time-consuming, labor-intensive, and prone to errors and biases. While current state-of-the-art methods assist in summarizing stakeholder conversations and classifying requirements based on their nature, there is a noticeable lack of approaches capable of both identifying requirements within these conversations and generating corresponding system requirements. These approaches would assist requirement identification, reducing engineers' workload, time, and effort. They would also enhance accuracy and consistency in documentation, providing a reliable foundation for further analysis. To address this gap, this paper introduces RECOVER (Requirements EliCitation frOm conVERsations), a novel conversational requirements engineering approach that leverages natural language processing and large language models (LLMs) to support practitioners in automatically extracting system requirements from stakeholder interactions by analyzing individual conversation turns. Download PDF

Journal Empirical Software Engineering G. Voria, F. Casillo, C. Gravino, G. Catolino, F. Palomba.

RECOVER: Toward Requirements Generation from Stakeholders’ Conversations.*

G. Voria, F. Casillo, C. Gravino, G. Catolino, F. Palomba. Journal Empirical Software Engineering

Abstract. Stakeholders' conversations in requirements elicitation meetings hold valuable insights into system and client needs. However, manually extracting requirements is time-consuming, labor-intensive, and prone to errors and biases. While current state-of-the-art methods assist in summarizing stakeholder conversations and classifying requirements based on their nature, there is a noticeable lack of approaches capable of both identifying requirements within these conversations and generating corresponding system requirements. These approaches would assist requirement identification, reducing engineers' workload, time, and effort. They would also enhance accuracy and consistency in documentation, providing a reliable foundation for further analysis. To address this gap, this paper introduces RECOVER (Requirements EliCitation frOm conVERsations), a novel conversational requirements engineering approach that leverages natural language processing and large language models (LLMs) to support practitioners in automatically extracting system requirements from stakeholder interactions by analyzing individual conversation turns. The approach is evaluated using a mixed-method research design that combines statistical performance analysis with a user study involving requirements engineers, targeting two levels of granularity. First, at the conversation turn level, the evaluation measures RECOVER's accuracy in identifying requirements-relevant dialogue and the quality of generated requirements in terms of correctness, completeness, and actionability. Second, at the entire conversation level, the evaluation assesses the overall usefulness and effectiveness of RECOVER in synthesizing comprehensive system requirements from full stakeholder discussions. Empirical evaluation of RECOVER shows promising performance, with generated requirements demonstrating satisfactory correctness, completeness, and actionability. The results also highlight the potential of automating requirements elicitation from conversations as an aid that enhances efficiency while maintaining human oversight.

Download PDF

[J82] EMSE 2025

When Code Smells Meet ML: On the Lifecycle of ML-specific Code Smells in ML-enabled Systems.*

Springer's Journal of Empirical Software Engineering (EMSE)

The adoption of Machine Learning (ML)-enabled systems is growing rapidly, introducing novel challenges in maintaining quality and managing technical debt in these complex systems. Among the key quality threats are ML-specific code smells (ML-CSs), suboptimal implementation practices in ML pipelines that can compromise system performance, reliability, and maintainability. Although these smells have been defined in the literature, detailed insights into their characteristics, evolution, and mitigation strategies are still needed to help developers address these quality issues effectively. In this paper, we investigate the emergence and evolution of ML-CSs through a large-scale empirical study focusing on (i) their prevalence in real ML-enabled systems, (ii) how they are introduced and removed, and (iii) their survivability. Download PDF

Journal Software Quality Empirical Software Engineering G. Recupito, G. Giordano, F. Ferrucci, D. Di Nucci, F. Palomba.

When Code Smells Meet ML: On the Lifecycle of ML-specific Code Smells in ML-enabled Systems.*

G. Recupito, G. Giordano, F. Ferrucci, D. Di Nucci, F. Palomba. Journal Software Quality Empirical Software Engineering

Abstract. The adoption of Machine Learning (ML)-enabled systems is growing rapidly, introducing novel challenges in maintaining quality and managing technical debt in these complex systems. Among the key quality threats are ML-specific code smells (ML-CSs), suboptimal implementation practices in ML pipelines that can compromise system performance, reliability, and maintainability. Although these smells have been defined in the literature, detailed insights into their characteristics, evolution, and mitigation strategies are still needed to help developers address these quality issues effectively. In this paper, we investigate the emergence and evolution of ML-CSs through a large-scale empirical study focusing on (i) their prevalence in real ML-enabled systems, (ii) how they are introduced and removed, and (iii) their survivability. We analyze over 400,000 commits from 337 ML-enabled projects, leveraging CodeSmile, a novel ML smell detector that we developed to enable our investigation and identify ML-specific code smells. Our results reveal that: (1) CodeSmile can detect ML-CSs with precision and recall rates of 87.4% and 78.6%, respectively; (2) ML-CSs are frequently introduced during file modifications in new feature tasks; (3) smells are typically removed during tasks related to new features, enhancements, or refactoring; and (4) the majority of ML-CSs are resolved within the first 10% of commits. Based on these findings, we provide actionable conclusions and insights to guide future research and quality assurance practices for ML-enabled systems.

Download PDF

[J81] JSS 2025

Into the ML-Universe: An Improved Classification and Characterization of Machine-Learning Projects.*

Elsevier's Journal of Systems and Software (JSS)

The prominence of Machine Learning (ML) systems led to the rise of Software Engineering for Artificial Intelligence (SE4AI), which addresses the unique engineering challenges of these systems. Researchers in SE4AI engage with three primary types of ML projects: those that apply ML techniques, those that develop new ML methodologies, and those that provide support tools and libraries. Current classification schemas distinguish ML projects based on their purpose and engineering quality, yet they miss a fine-grained classification of their nature and purpose. In this paper, we propose a novel, tool-supported automated classification schema for ML projects, coined Machine learning Automated Rule-based Classification Kit (MARK), that builds on top of the work by Gonzalez et al. to refine the classification of applied ML projects into 'ML-Model Consumers', 'ML-Model Producers', and 'ML-Model Producers & Consumers'. Download PDF

Journal Empirical Software Engineering V. De Martino, G. Recupito, G. Giordano, F. Ferrucci, D. Di Nucci, F. Palomba.

Into the ML-Universe: An Improved Classification and Characterization of Machine-Learning Projects.*

V. De Martino, G. Recupito, G. Giordano, F. Ferrucci, D. Di Nucci, F. Palomba. Journal Empirical Software Engineering

Abstract. The prominence of Machine Learning (ML) systems led to the rise of Software Engineering for Artificial Intelligence (SE4AI), which addresses the unique engineering challenges of these systems. Researchers in SE4AI engage with three primary types of ML projects: those that apply ML techniques, those that develop new ML methodologies, and those that provide support tools and libraries. Current classification schemas distinguish ML projects based on their purpose and engineering quality, yet they miss a fine-grained classification of their nature and purpose. In this paper, we propose a novel, tool-supported automated classification schema for ML projects, coined Machine learning Automated Rule-based Classification Kit (MARK), that builds on top of the work by Gonzalez et al. to refine the classification of applied ML projects into 'ML-Model Consumers', 'ML-Model Producers', and 'ML-Model Producers & Consumers'. We evaluated MARK through two empirical studies. The first assessed its classification accuracy across 4,603 ML projects from two datasets. The second analyzed repository metrics, such as community engagement, activity, and structure, to demonstrate MARK’s potential in identifying trends and characteristics unique to each project type. Our findings indicate high F1-scores for our classifier, particularly for 'ML-Model Producer' projects, though challenges remain for 'ML-Model Consumer' classification. Significant differences in repository metrics among the classified projects highlight the usefulness of MARK, offering insights for researchers studying the socio-technical dynamics of ML projects.

Download PDF

[J80] AIIM 2025

Advances in Artificial Intelligence for Diabetes Prediction: Insights from a Systematic Literature Review.*

Elsevier's Artificial Intelligence in Medicine (AIIM)

Diabetes mellitus (DM), a prevalent metabolic disorder, has significant global health implications. The advent of machine learning (ML) has revolutionized the ability to predict and manage diabetes early, oﬀering new avenues to mitigate its impact. This systematic review examined 53 articles on ML applications for diabetes prediction, focusing on datasets, algorithms, training methods, and evaluation metrics. Various datasets, such as the Singapore National Diabetic Retinopathy Screening Program, REPLACE-BG, National Health and Nutrition Examination Survey (NHANES), and Pima Indians Diabetes Database (PIDD), have been explored, highlighting their unique features and challenges, such as class imbalance. This review assesses the performance of various ML algorithms, such as Convolutional Neural Networks (CNN), Support Vector Machines (SVM), Logistic Regression, and XGBoost, for the prediction of diabetes outcomes from multiple datasets. In addition, it explores explainable AI (XAI) methods such as Grad-CAM, SHAP, and LIME, which improve the transparency and clinical interpretability of AI models in assessing diabetes risk and detecting diabetic retinopathy. Download PDF

Journal Systematic Literature Review P. Khokhar, C. Gravino, F. Palomba.

The Role of Large Language Models in Addressing IoT Challenges: A Systematic Literature Review.*

P. Khokhar, C. Gravino, F. Palomba. Journal Systematic Literature Review

Abstract. Diabetes mellitus (DM), a prevalent metabolic disorder, has significant global health implications. The advent of machine learning (ML) has revolutionized the ability to predict and manage diabetes early, oﬀering new avenues to mitigate its impact. This systematic review examined 53 articles on ML applications for diabetes prediction, focusing on datasets, algorithms, training methods, and evaluation metrics. Various datasets, such as the Singapore National Diabetic Retinopathy Screening Program, REPLACE-BG, National Health and Nutrition Examination Survey (NHANES), and Pima Indians Diabetes Database (PIDD), have been explored, highlighting their unique features and challenges, such as class imbalance. This review assesses the performance of various ML algorithms, such as Convolutional Neural Networks (CNN), Support Vector Machines (SVM), Logistic Regression, and XGBoost, for the prediction of diabetes outcomes from multiple datasets. In addition, it explores explainable AI (XAI) methods such as Grad-CAM, SHAP, and LIME, which improve the transparency and clinical interpretability of AI models in assessing diabetes risk and detecting diabetic retinopathy. Techniques such as cross- validation, data augmentation, and feature selection are discussed in terms of their influence on the versatility and robustness of the model. Some evaluation techniques involving k-fold cross-validation, external validation, and performance indicators such as accuracy, Area Under Curve, sensitivity, and specificity are presented. The findings highlight the usefulness of ML in addressing the challenges of diabetes prediction, the value of sourcing diﬀerent data types, the need to make models explainable, and the need to keep models clinically relevant. This study highlights significant implications for healthcare professionals, policymakers, technology developers, patients, and researchers, advocating interdisciplinary collaboration and ethical considerations when implementing ML-based diabetes prediction models. By consolidating existing knowledge, this SLR outlines future research directions aimed at improving diagnostic accuracy, patient care, and healthcare eﬃciency through advanced ML applications. This comprehensive review contributes to the ongoing eﬀorts to utilize artificial intelligence technology for a better prediction of diabetes, ultimately aiming to reduce the global burden of this widespread disease.

Download PDF

[J79] JSS 2025

Examining the Impact of Bias Mitigation Algorithms on the Sustainability of ML-enabled Systems: A Benchmark Study.*

Elsevier's Journal of Systems and Software (JSS)

As machine learning (ML) systems become increasingly prevalent across various industries, concerns regarding fairness have intensified. Bias mitigation algorithms—that aim to reduce bias in ML models—serve as solu tions to mitigate this issue. However, these techniques can affect more than just social sustainability. They may alter the computational overhead and energy usage of ML systems, affecting their environmental sustainability. Similarly, they can influence businesses' economic sustainability by shaping resource allocation and consumer trust. This work aims to provide a benchmark study of the implications of applying bias mitigation algorithms on the sustainability of ML solutions. We first corroborate previous findings by examining their effect on social sustainability metrics. Additionally, we complement existing studies by offering a comprehensive analysis of how bias mitigation affects environmental and economic sustainability, aiming to highlight trade-offs for practitioners designing ML solutions. We evaluate six bias mitigation algorithms by conducting 3,360 experiments across multiple configurations of four ML algorithms and datasets. From these experiments, we compute metrics for social, environmental, and economic sustainability, evaluating them using statistical analysis. Download PDF

Journal Empirical Software Engineering V. De Martino, G. Voria, C. Troiano, G. Catolino, F. Palomba.

Examining the Impact of Bias Mitigation Algorithms on the Sustainability of ML-enabled Systems: A Benchmark Study.*

V. De Martino, G. Voria, C. Troiano, G. Catolino, F. Palomba. Journal Empirical Software Engineering

Abstract. As machine learning (ML) systems become increasingly prevalent across various industries, concerns regarding fairness have intensified. Bias mitigation algorithms—that aim to reduce bias in ML models—serve as solu tions to mitigate this issue. However, these techniques can affect more than just social sustainability. They may alter the computational overhead and energy usage of ML systems, affecting their environmental sustainability. Similarly, they can influence businesses' economic sustainability by shaping resource allocation and consumer trust. This work aims to provide a benchmark study of the implications of applying bias mitigation algorithms on the sustainability of ML solutions. We first corroborate previous findings by examining their effect on social sustainability metrics. Additionally, we complement existing studies by offering a comprehensive analysis of how bias mitigation affects environmental and economic sustainability, aiming to highlight trade-offs for practitioners designing ML solutions. We evaluate six bias mitigation algorithms by conducting 3,360 experiments across multiple configurations of four ML algorithms and datasets. From these experiments, we compute metrics for social, environmental, and economic sustainability, evaluating them using statistical analysis. Our quantitative findings show that all bias mitigation algorithms affect the three sustainability dimensions differently, indicating that applying these algo rithms involves complex trade-offs. Furthermore, we expand our discussion with qualitative insights that arise from our results, also providing implications for both research and practice. Our study emphasizes the need for a deeper investigation into the trade-offs bias mitigation algorithms introduce and how they impact various non-functional requirements of ML systems.

Download PDF

[J78] FGCS 2025

The Role of Large Language Models in Addressing IoT Challenges: A Systematic Literature Review.*

Future Generation Computer Systems (FGCS)

The Internet of Things (IoT) has revolutionized various sectors by enabling devices to communicate and interact seamlessly. However, developing IoT applications has data management, security, and interoperability challenges. Large Language Models (LLMs) have shown promise in addressing these challenges due to their advanced language processing capabilities. This Systematic Literature Review assesses the role of LLMs in addressing IoT challenges, exploring the strategies, hardware, and software configurations used, and identifying directions for future research. We extensively searched databases like Scopus, IEEE Xplore, and ACM Digital Library, initially screening 1,419 studies and identifying an additional 1,167 through snowballing, ultimately focusing on 55 relevant papers. Download PDF

Journal Systematic Literature Review G. De Vito, F. Palomba, F. Ferrucci.

The Role of Large Language Models in Addressing IoT Challenges: A Systematic Literature Review.*

G. De Vito, F. Palomba, F. Ferrucci. Journal Systematic Literature Review

Abstract. The Internet of Things (IoT) has revolutionized various sectors by enabling devices to communicate and interact seamlessly. However, developing IoT applications has data management, security, and interoperability challenges. Large Language Models (LLMs) have shown promise in addressing these challenges due to their advanced language processing capabilities. This Systematic Literature Review assesses the role of LLMs in addressing IoT challenges, exploring the strategies, hardware, and software configurations used, and identifying directions for future research. We extensively searched databases like Scopus, IEEE Xplore, and ACM Digital Library, initially screening 1,419 studies and identifying an additional 1,167 through snowballing, ultimately focusing on 55 relevant papers. The findings reveal LLMs' potential to address key IoT challenges such as security and scalability. However, they also highlight significant obstacles, including high computational demands and the complexities of training and tuning these models. Future research should aim to develop methods to reduce the computational requirements of LLMs, improve training datasets, simplify implementation processes, and explore the ethical and privacy implications of using LLMs in IoT applications.

Download PDF

[J77] TSE 2025

LLM-Based Automation of COSMIC Functional Size Measurement from Use Cases.*

IEEE Transactions on Software Engineering (TSE)

COmmon Software Measurement International Consortium (COSMIC) Functional Size Measurement is a method widely used in the software industry to quantify user functionality and measure software size, which is crucial for estimating development effort, cost, and resource allocation. COSMIC measurement is a manual task that requires qualified professionals and effort. To support professionals in COSMIC measurement, we propose an automatic approach, CosMet, that leverages Large Language Models to measure software size starting from use cases specified in natural language. To evaluate the proposed approach, we developed a web tool that implements CosMet using GPT-4 and conducted two studies to assess the approach quantitatively and qualitatively. Initially, we experimented with CosMet on seven software systems, encompassing 123 use cases, and compared the generated results with the ground truth created by two certified professionals. Download PDF

Journal Empirical Software Engineering G. De Vito, S. Di Martino, F. Ferrucci, C. Gravino, F. Palomba.

LLM-Based Automation of COSMIC Functional Size Measurement from Use Cases.*

G. De Vito, S. Di Martino, F. Ferrucci, C. Gravino, F. Palomba. Journal Empirical Software Engineering

Abstract. COmmon Software Measurement International Consortium (COSMIC) Functional Size Measurement is a method widely used in the software industry to quantify user functionality and measure software size, which is crucial for estimating development effort, cost, and resource allocation. COSMIC measurement is a manual task that requires qualified professionals and effort. To support professionals in COSMIC measurement, we propose an automatic approach, CosMet, that leverages Large Language Models to measure software size starting from use cases specified in natural language. To evaluate the proposed approach, we developed a web tool that implements CosMet using GPT-4 and conducted two studies to assess the approach quantitatively and qualitatively. Initially, we experimented with CosMet on seven software systems, encompassing 123 use cases, and compared the generated results with the ground truth created by two certified professionals. Then, seven professional measurers evaluated the analysis achieved by CosMet and the extent to which the approach reduces the measurement time. The first study's results revealed that CosMet is highly effective in analyzing and measuring use cases. The second study highlighted that CosMet offers a transparent and interpretable analysis, allowing practitioners to understand how the measurement is derived and make necessary adjustments. Additionally, it reduces the manual measurement time by 60-80%.

Download PDF

[J76] TOSEM 2025

Investigating the Role of Cultural Values in Adopting Large Language Models for Software Engineering.*

ACM Transactions on Software Engineering and Methodology (TOSEM)

As a socio-technical activity, software development involves the close interconnection of people and technology. The integration of Large Language Models (LLMs) into this process exemplifies the socio-technical nature of software development. Although LLMs influence the development process, software development remains fundamentally human-centric, necessitating an investigation of the human factors in this adoption. Thus, with this study we explore the factors influencing the adoption of LLMs in software development, focusing on the role of professionals' cultural values. Guided by the Unified Theory of Acceptance and Use of Technology (UTAUT2) and Hofstede’s cultural dimensions, we hypothesized that cultural values moderate the relationships within the UTAUT2 framework. Download PDF

Journal Empirical Software Engineering S. Lambiase, G. Catolino, F. Palomba, F. Ferrucci, D. Russo.

Investigating the Role of Cultural Values in Adopting Large Language Models for Software Engineering.*

S. Lambiase, G. Catolino, F. Palomba, F. Ferrucci, D. Russo. Journal Empirical Software Engineering

Abstract. As a socio-technical activity, software development involves the close interconnection of people and technology. The integration of Large Language Models (LLMs) into this process exemplifies the socio-technical nature of software development. Although LLMs influence the development process, software development remains fundamentally human-centric, necessitating an investigation of the human factors in this adoption. Thus, with this study we explore the factors influencing the adoption of LLMs in software development, focusing on the role of professionals' cultural values. Guided by the Unified Theory of Acceptance and Use of Technology (UTAUT2) and Hofstede’s cultural dimensions, we hypothesized that cultural values moderate the relationships within the UTAUT2 framework. Using Partial Least Squares-Structural Equation Modelling and data from 188 software engineers, we found that habit and performance expectancy are the primary drivers of LLM adoption, while cultural values do not significantly moderate this process. These findings suggest that, by highlighting how LLMs can boost performance and efficiency, organizations can encourage their use, no matter the cultural differences. Practical steps include offering training programs to demonstrate LLM benefits, creating a supportive environment for regular use, and continuously tracking and sharing performance improvements from using LLMs.

Download PDF

[J75] IST 2025

Fairness-Aware Practices from Developers' Perspective: A Survey.*

Elsevier's Information and Software Technology (IST)

Machine Learning (ML) technologies have shown great promise in many areas, but when used without proper oversight, they can produce biased results that discriminate against historically underrepresented groups. In recent years, the software engineering research community has contributed to addressing the need for ethical machine learning by proposing a number of fairness-aware practices, e.g., fair data balancing or testing approaches, that may support the management of fairness requirements throughout the software lifecycle. Nonetheless, the actual validity of these practices, in terms of practical application, impact, and effort, from the developers' perspective has not been investigated yet. This paper addresses this limitation, assessing the developers’ perspective of a set of 28 fairness practices collected from the literature. Download PDF

Journal Empirical Software Engineering G. Voria, G. Sellitto, C. Ferrara, F. Abate, A. De Lucia, F. Ferrucci, G. Catolino, F. Palomba.

Fairness-Aware Practices from Developers' Perspective: A Survey.*

G. Voria, G. Sellitto, C. Ferrara, F. Abate, A. De Lucia, F. Ferrucci, G. Catolino, F. Palomba. Journal Empirical Software Engineering

Abstract. Machine Learning (ML) technologies have shown great promise in many areas, but when used without proper oversight, they can produce biased results that discriminate against historically underrepresented groups. In recent years, the software engineering research community has contributed to addressing the need for ethical machine learning by proposing a number of fairness-aware practices, e.g., fair data balancing or testing approaches, that may support the management of fairness requirements throughout the software lifecycle. Nonetheless, the actual validity of these practices, in terms of practical application, impact, and effort, from the developers' perspective has not been investigated yet. This paper addresses this limitation, assessing the developers’ perspective of a set of 28 fairness practices collected from the literature. We perform a survey study involving 155 practitioners who have been working on the development and maintenance of ML-enabled systems, analyzing the answers via statistical and clustering analysis to group fairness-aware practices based on their application frequency, impact on bias mitigation, and effort required for their application. While all the practices are deemed relevant by developers, those applied at the early stages of development appear to be the most impactful. More importantly, the effort required to implement the practices is average and sometimes high, with a subsequent average application. The findings highlight the need for effort-aware automated approaches that ease the application of the available practices, as well as recommendation systems that may suggest when and how to apply fairness-aware practices throughout the software lifecycle.

Download PDF

[J74] JSS 2025

An Empirical Investigation into the Capabilities of Anomaly Detection Approaches for Test Smell Detection.*

Elsevier's Journal of Systems and Software (JSS)

Test smells are symptoms of sub-optimal design choices adopted when developing test cases. Previous research has demonstrated their harmfulness for test code maintainability and effectiveness, showing their impact on test code quality. As such, the quality of test cases affected by test smells is likely to deviate significantly from the quality of test cases not affected by any smell and might be classified as anomalies. In this paper, we challenge this observation by experimenting with three anomaly detection approaches based on machine learning, cluster analysis, and statistics to understand their effectiveness for the detection of four test smells, i.e., Eager Test, Mystery Guest, Resource Optimism, and Test Redundancy on 66 open-source Java projects. Download PDF

Journal Software Testing Empirical Software Engineering V. Pontillo, L. Martins, I. Machado, F. Palomba, F. Ferrucci.

An Empirical Investigation into the Capabilities of Anomaly Detection Approaches for Test Smell Detection.*

V. Pontillo, L. Martins, I. Machado, F. Palomba, F. Ferrucci. Journal Software Testing Empirical Software Engineering

Abstract. Test smells are symptoms of sub-optimal design choices adopted when developing test cases. Previous research has demonstrated their harmfulness for test code maintainability and effectiveness, showing their impact on test code quality. As such, the quality of test cases affected by test smells is likely to deviate significantly from the quality of test cases not affected by any smell and might be classified as anomalies. In this paper, we challenge this observation by experimenting with three anomaly detection approaches based on machine learning, cluster analysis, and statistics to understand their effectiveness for the detection of four test smells, i.e., Eager Test, Mystery Guest, Resource Optimism, and Test Redundancy on 66 open-source Java projects. In addition, we compare our results with state-of-the-art heuristic-based and machine learning-based baselines. Our ultimate goal is not to prove that anomaly detection methods are better than existing approaches, but to objectively assess their effectiveness in this domain. The key findings of the study show that the F-Measure of anomaly detectors never exceeds 47%, obtained in the Eager Test detection using the statistical approach, while the Recall is generally higher for the statistical and clustering approaches. Nevertheless, the anomaly detection approaches have a higher Recall than the heuristic and machine learning-based techniques for all test smells. The low F-Measure values we observed for anomaly detectors provide valuable insights into the current limitations of anomaly detection in this context. We conclude our study by elaborating on and discussing the reasons behind these negative results through qualitative investigations. Our analysis shows that the detection of test smells could depend on the approach exploited, suggesting the feasibility of developing a meta-approach.

Download PDF

[J73] IST 2025

Classification and Challenges of Non-Functional Requirements in ML-Enabled Systems: A Systematic Literature Review.*

Elsevier's Information and Software Technology (IST)

Machine learning (ML) is nowadays so pervasive and diffused that virtually no application can avoid its use. Nonetheless, its enormous potential is often tempered by the need to manage non-functional requirements (NFRs) and navigate pressing, contrasting trade-offs. In this respect, we notice a lack of systematic synthesis of challenges explicitly tied to achieving and managing NFRs in ML-enabled systems. Such a synthesis may not only provide a comprehensive summary of the state of the art but also drive further research on the analysis, management, and optimization of NFRs of ML-enabled systems. In this paper, we propose a systematic literature review targeting two key aspects such as (1) the classification of the NFRs investigated so far, and (2) the challenges associated with achieving and managing NFRs in ML-enabled systems during model development. Download PDF

Journal Empirical Software Engineering Systematic Literature Review V. De Martino, F. Palomba.

Classification and Challenges of Non-Functional Requirements in ML-Enabled Systems: A Systematic Literature Review.*

V. De Martino, F. Palomba. Journal Empirical Software Engineering Systematic Literature Review

Abstract. Machine learning (ML) is nowadays so pervasive and diffused that virtually no application can avoid its use. Nonetheless, its enormous potential is often tempered by the need to manage non-functional requirements (NFRs) and navigate pressing, contrasting trade-offs. In this respect, we notice a lack of systematic synthesis of challenges explicitly tied to achieving and managing NFRs in ML-enabled systems. Such a synthesis may not only provide a comprehensive summary of the state of the art but also drive further research on the analysis, management, and optimization of NFRs of ML-enabled systems. In this paper, we propose a systematic literature review targeting two key aspects such as (1) the classification of the NFRs investigated so far, and (2) the challenges associated with achieving and managing NFRs in ML-enabled systems during model development. Through the combination of well-established guidelines for conducting systematic literature reviews and additional search criteria, we survey a total amount of 130 research articles. Our findings report that current research identified 31 different NFRs, which can be grouped into six main classes. We also compiled a catalog of 26 software engineering challenges, emphasizing the need for further research to systematically address, prioritize, and balance NFRs in ML-enabled systems. We conclude our work by distilling implications and a future outlook on the topic.

Download PDF

[J72] TOSEM 2025

Uncovering Community Smells in Machine Learning-Enabled Systems: Causes, Effects, and Mitigation Strategies.*

ACM Transactions on Software Engineering and Methodology (TOSEM)

Successful software development hinges on effective communication and collaboration, which are significantly influenced by human and social dynamics. Poor management of these elements can lead to the emergence of `community smells', i.e., negative patterns in socio-technical interactions that gradually accumulate as `social debt'. This issue is particularly pertinent in machine learning-enabled systems, where diverse actors such as data engineers and software engineers interact at various levels. The unique collaboration context of these systems presents an ideal setting to investigate community smells and their impact on development communities. This paper addresses a gap in the literature by identifying the types, causes, effects, and potential mitigation strategies of community smells in machine learning-enabled systems. Download PDF

Journal Socio-Technical Analytics Empirical Software Engineering G. Annunziata, S. Lambiase, D. Tamburri, W.J. Van den Heuvel, F. Palomba, G. Catolino, F. Ferrucci, A. De Lucia.

Uncovering Community Smells in Machine Learning-Enabled Systems: Causes, Effects, and Mitigation Strategies.*

G. Annunziata, S. Lambiase, D. Tamburri, W.J. Van den Heuvel, F. Palomba, G. Catolino, F. Ferrucci, A. De Lucia. Journal Socio-Technical Analytics Empirical Software Engineering

Abstract. Successful software development hinges on effective communication and collaboration, which are significantly influenced by human and social dynamics. Poor management of these elements can lead to the emergence of `community smells', i.e., negative patterns in socio-technical interactions that gradually accumulate as `social debt'. This issue is particularly pertinent in machine learning-enabled systems, where diverse actors such as data engineers and software engineers interact at various levels. The unique collaboration context of these systems presents an ideal setting to investigate community smells and their impact on development communities. This paper addresses a gap in the literature by identifying the types, causes, effects, and potential mitigation strategies of community smells in machine learning-enabled systems. Using Partial Least Squares Structural Equation Modeling (PLS-SEM), we developed hypotheses based on existing literature and interviews, and conducted a questionnaire-based study to collect data. Our analysis resulted in the construction and validation of five models that represent the causes, effects, and strategies for five specific community smells. These models can help practitioners identify and address community smells within their organizations, while also providing valuable insights for future research on the socio-technical aspects of machine learning-enabled system communities.

Download PDF

[J71] EMSE 2024

Test Code Refactoring Unveiled: Where and How Does It Affect Test Code Quality and Effectiveness?*

Springer's Journal of Empirical Software Engineering (EMSE)

Refactoring has been widely investigated in the past in relation to production code quality, yet little is known about how developers apply refactoring to test code. Specifically, there is still a lack of investigation into how developers typically refactor test code and its effects on test code quality and effectiveness. This paper presents an exploratory empirical study aimed to bridge this gap of knowledge by investigating (1) whether test refactor- ing actually targets test classes affected by quality and effectiveness concerns and (2) the extent to which refactoring contributes to the improvement of test code quality and effectiveness. Download PDF

Journal Software Testing Empirical Software Engineering L. Martins, V. Pontillo, H. Costa, F. Ferrucci, F. Palomba, I. Machado.

Test Code Refactoring Unveiled: Where and How Does It Affect Test Code Quality and Effectiveness?*

L. Martins, V. Pontillo, H. Costa, F. Ferrucci, F. Palomba, I. Machado. Journal Software Testing Empirical Software Engineering

Abstract. Refactoring has been widely investigated in the past in relation to production code quality, yet little is known about how developers apply refactoring to test code. Specifically, there is still a lack of investigation into how developers typically refactor test code and its effects on test code quality and effectiveness. This paper presents an exploratory empirical study aimed to bridge this gap of knowledge by investigating (1) whether test refactor- ing actually targets test classes affected by quality and effectiveness concerns and (2) the extent to which refactoring contributes to the improvement of test code quality and effectiveness. First, we performed an exploratory mining software repository to collect test refactoring data of open-source Java projects from GitHub. Then, we statistically analyzed them in combination with quality metrics, test smells, and code/mutation coverage indicators. Furthermore, we measured how refactoring operations impact the quality and effectiveness of test code. Our findings indicate that test refactorings primarily address low-quality test code, as evidenced by test smells and quality metrics. At the same time, we did not find a statistically significant relationship between test refactorings and code/mutation coverage metrics. In addition, test refactorings enhance the coupling, cohesion, and size of the test code, albeit sometimes leading to an increase in certain test smells. We conclude our study by emphasizing the significance of incorporating both quality metrics and test smells into refactoring decisions to enhance the overall quality of test code.

Download PDF

[J70] CSUR 2024

Motivations, Challenges, Best Practices, and Benefits for Bots and Conversational Agents in Software Engineering: A Multivocal Literature Review.*

ACM Computing Surveys (CSUR)

Bots are software systems designed to support users by automating specific processes, tasks, or activities. When these systems implement a conversational component to interact with users, they are also known as conversational agents or chatbots. Bots---particularly in their conversation-oriented version and AI-powered---have seen increased adoption over time for software development and engineering purposes. Despite their exciting potential, which has been further enhanced by the advent of Generative AI and Large Language Models, bots still face challenges in terms of development and integration into the development cycle, as practitioners report that bots can add difficulties rather than provide improvements. In this work, we aim to provide a taxonomy for characterizing bots, as well as a series of challenges for their adoption in software engineering, accompanied by potential mitigation strategies. Download PDF

Journal Empirical Software Engineering Systematic Literature Review S. Lambiase, G. Catolino, F. Palomba, F. Ferrucci.

Motivations, Challenges, Best Practices, and Benefits for Bots and Conversational Agents in Software Engineering: A Multivocal Literature Review.*

S. Lambiase, G. Catolino, F. Palomba, F. Ferrucci. Journal Empirical Software Engineering Systematic Literature Review

Abstract. Bots are software systems designed to support users by automating specific processes, tasks, or activities. When these systems implement a conversational component to interact with users, they are also known as conversational agents or chatbots. Bots---particularly in their conversation-oriented version and AI-powered---have seen increased adoption over time for software development and engineering purposes. Despite their exciting potential, which has been further enhanced by the advent of Generative AI and Large Language Models, bots still face challenges in terms of development and integration into the development cycle, as practitioners report that bots can add difficulties rather than provide improvements. In this work, we aim to provide a taxonomy for characterizing bots, as well as a series of challenges for their adoption in software engineering, accompanied by potential mitigation strategies. To achieve our objectives, we conducted a multivocal literature review, examining both research and practitioner literature. Through such an approach, we hope to contribute to both researchers and practitioners by providing (i) a series of future research directions to pursue, (ii) a list of strategies to adopt for improving the use of bots for software engineering purposes, and (iii) fostering technology and knowledge transfer from the research field to practice—one of the primary goals of multivocal literature reviews.

Download PDF

[J69] IEEE Access 2024

A Large-Scale Empirical Investigation into Cross-Project Flaky Test Prediction.*

IEEE Access

Test flakiness arises when a test case exhibits inconsistent behavior by alternating between passing and failing states when executed against the same code. Previous research showed the significance of the problem in practice, proposing empirical studies into the nature of flakiness and automated techniques for its detection. Machine learning models emerged as a promising approach for flaky test prediction. However, existing research has predominantly focused on within-project scenarios, where models are trained and tested using data from a single project. On the contrary, little is known about how flaky test prediction models may be adapted to software projects lacking sufficient historical data for effective prediction. Download PDF

Journal Software Testing Empirical Software Engineering A. Afeltra, A. Cannavale, F. Pecorelli, V. Pontillo, F. Palomba.

A Large-Scale Empirical Investigation into Cross-Project Flaky Test Prediction.*

A. Afeltra, A. Cannavale, F. Pecorelli, V. Pontillo, F. Palomba. Journal Software Testing Empirical Software Engineering

Abstract. Test flakiness arises when a test case exhibits inconsistent behavior by alternating between passing and failing states when executed against the same code. Previous research showed the significance of the problem in practice, proposing empirical studies into the nature of flakiness and automated techniques for its detection. Machine learning models emerged as a promising approach for flaky test prediction. However, existing research has predominantly focused on within-project scenarios, where models are trained and tested using data from a single project. On the contrary, little is known about how flaky test prediction models may be adapted to software projects lacking sufficient historical data for effective prediction. In this paper, we address this gap by proposing a large-scale assessment of flaky test prediction in cross-project scenarios, i.e., in situations where predictive models are trained using data coming from external projects. Leveraging a dataset of 1,385 flaky tests from 29 open-source projects, we examine static test flakiness prediction models and evaluate feature- and instance-based filtering methods for cross-project predictions. Our study underscores the difficulties in utilizing cross-project flaky test data and underscores the significance of filtering methods in enhancing prediction accuracy. Notably, we find that the TrAdaBoost filtering method significantly reduces data heterogeneity, leading to an F-Measure of 70%.

Download PDF

[J68] IST 2024

The Quantum Frontier of Software Engineering: A Systematic Mapping Study.*

Elsevier's Information and Software Technology (IST)

Quantum computing is becoming a reality, and quantum software engineering (QSE) is emerging as a new discipline to enable developers to design and develop quantum programs. This paper presents a systematic mapping study of the current state of QSE research, aiming to identify the most investigated topics, the types and number of studies, the main reported results, and the most studied quantum computing tools/frameworks. Additionally, the study aims to explore the research community's interest in QSE, how it has evolved, and any prior contributions to the discipline before its formal introduction through the Talavera Manifesto. We searched for relevant articles in several databases and applied inclusion and exclusion criteria to select the most relevant studies. After evaluating the quality of the selected resources, we extracted relevant data from the primary studies and analyzed them. Download PDF

Journal Software Quality Empirical Software Engineering M. De Stefano, F. Pecorelli, D. Di Nucci, F. Palomba, A. De Lucia.

The Quantum Frontier of Software Engineering: A Systematic Mapping Study.*

M. De Stefano, F. Pecorelli, D. Di Nucci, F. Palomba, A. De Lucia. Journal Software Quality Empirical Software Engineering

Abstract. Quantum computing is becoming a reality, and quantum software engineering (QSE) is emerging as a new discipline to enable developers to design and develop quantum programs. This paper presents a systematic mapping study of the current state of QSE research, aiming to identify the most investigated topics, the types and number of studies, the main reported results, and the most studied quantum computing tools/frameworks. Additionally, the study aims to explore the research community's interest in QSE, how it has evolved, and any prior contributions to the discipline before its formal introduction through the Talavera Manifesto. We searched for relevant articles in several databases and applied inclusion and exclusion criteria to select the most relevant studies. After evaluating the quality of the selected resources, we extracted relevant data from the primary studies and analyzed them. We found that QSE research has primarily focused on software testing, with little attention given to other topics, such as software engineering management. The most commonly studied technology for techniques and tools is Qiskit, although, in most studies, either multiple or none specific technologies were employed. The researchers most interested in QSE are interconnected through direct collaborations, and several strong collaboration clusters have been identified. Most articles in QSE have been published in non-thematic venues, with a preference for conferences. Conclusions. The study's implications are providing a centralized source of information for researchers and practitioners in the field, facilitating knowledge transfer, and contributing to the advancement and growth of QSE.

Download PDF

[J67] JSS 2024

Technical Debt in AI-Enabled Systems: On the Prevalence, Severity, Impact, and Management Strategies for Code and Architecture.*

Elsevier's Journal of Systems and Software (JSS)

Artificial Intelligence (AI) is pervasive in several application domains and promises to be even more diffused in the next decades. Developing high-quality AI-enabled systems — software systems embedding one or multiple AI components, algorithms, and models — could introduce critical challenges for mitigating specific risks related to the systems' quality. Such development alone is insufficient to fully address socio-technical consequences and the need for rapid adaptation to evolutionary changes. Recent work proposed the concept of AI technical debt, a potential liability concerned with developing AI-enabled systems whose impact can affect the overall systems’ quality. While the problem of AI technical debt is rapidly gaining the attention of the software engineering research community, scientific knowledge that contributes to understanding and managing the matter is still limited. In this paper, we leverage the expertise of practitioners to offer useful insights to the research community, aiming to enhance researchers' awareness about the detection and mitigation of AI technical debt. Download PDF

Journal Software Quality Empirical Software Engineering G. Recupito, F. Pecorelli, G. Catolino, V. Lenarduzzi, D. Taibi, D. Di Nucci, F. Palomba.

Technical Debt in AI-Enabled Systems: On the Prevalence, Severity, Impact, and Management Strategies for Code and Architecture.*

G. Recupito, F. Pecorelli, G. Catolino, V. Lenarduzzi, D. Taibi, D. Di Nucci, F. Palomba. Journal Software Quality Empirical Software Engineering

Abstract. Artificial Intelligence (AI) is pervasive in several application domains and promises to be even more diffused in the next decades. Developing high-quality AI-enabled systems — software systems embedding one or multiple AI components, algorithms, and models — could introduce critical challenges for mitigating specific risks related to the systems' quality. Such development alone is insufficient to fully address socio-technical consequences and the need for rapid adaptation to evolutionary changes. Recent work proposed the concept of AI technical debt, a potential liability concerned with developing AI-enabled systems whose impact can affect the overall systems’ quality. While the problem of AI technical debt is rapidly gaining the attention of the software engineering research community, scientific knowledge that contributes to understanding and managing the matter is still limited. In this paper, we leverage the expertise of practitioners to offer useful insights to the research community, aiming to enhance researchers' awareness about the detection and mitigation of AI technical debt. Our ultimate goal is to empower practitioners by providing them with tools and methods. Additionally, our study sheds light on novel aspects that practitioners might not be fully acquainted with, contributing to a deeper understanding of the subject. Method: We develop a survey study featuring 53 AI practitioners, in which we collect information on the practical prevalence, severity, and impact of AI technical debt issues affecting the code and the architecture other than the strategies applied by practitioners to identify and mitigate them. The key findings of the study reveal the multiple impacts that AI technical debt issues may have on the quality of AI-enabled systems (e.g., the high negative impact that Undeclared consumers has on security, whereas Jumbled Model Architecture can induce the code to be hard to maintain) and the little support practitioners have to deal with them, limited to apply manual effort for identification and refactoring. We conclude the article by distilling lessons learned and actionable insights for researchers.

Download PDF

[J66] IST 2024

SENEM: A Software Engineering-Enabled Educational Metaverse.*

Elsevier's Information and Software Technology (IST)

The term metaverse refers to a persistent, virtual, three-dimensional environment where individuals may communicate, engage, and collaborate. One of the most multifaceted and challenging use cases of the metaverse is education, where educators and learners may require multiple technical, social, psychological, and interaction instruments to accomplish their learning objectives. While the characteristics of the metaverse might nicely fit the problem's needs, our research points out a noticeable lack of knowledge into (1) the specific requirements that an educational metaverse should actually fulfill to let educators and learners successfully interact toward their objectives and (2) how to design an appropriate educational metaverse for both educators and learners. In this paper, we aim to bridge this knowledge gap by proposing SENEM, a novel software engineering-enabled educational metaverse. We first elicit a set of functional requirements that an educational metaverse should fulfill. Download PDF

Journal Software Quality Empirical Software Engineering V. Pentangelo, D. Di Dario, S. Lambiase, F. Ferrucci, C. Gravino, F. Palomba.

SENEM: A Software Engineering-Enabled Educational Metaverse.*

V. Pentangelo, D. Di Dario, S. Lambiase, F. Ferrucci, C. Gravino, F. Palomba. Journal Software Quality Empirical Software Engineering

Abstract. The term metaverse refers to a persistent, virtual, three-dimensional environment where individuals may communicate, engage, and collaborate. One of the most multifaceted and challenging use cases of the metaverse is education, where educators and learners may require multiple technical, social, psychological, and interaction instruments to accomplish their learning objectives. While the characteristics of the metaverse might nicely fit the problem's needs, our research points out a noticeable lack of knowledge into (1) the specific requirements that an educational metaverse should actually fulfill to let educators and learners successfully interact toward their objectives and (2) how to design an appropriate educational metaverse for both educators and learners. In this paper, we aim to bridge this knowledge gap by proposing SENEM, a novel software engineering-enabled educational metaverse. We first elicit a set of functional requirements that an educational metaverse should fulfill. in this respect, we conduct a literature survey to extract the currently available knowledge on the matter discussed by the research community, and afterward, we assess and complement such knowledge through semi-structured interviews with educators and learners. Upon completing the requirements elicitation stage, we then build our prototype implementation of SENEM, a metaverse that makes available to educators and learners the features identified in the previous stage. Finally, we evaluate the tool in terms of learnability, efficiency, and satisfaction through a Rapid Iterative Testing and Evaluation research approach, leading us to the iterative refinement of our prototype. Through our survey strategy, we extracted nine requirements that guided the tool development that the study participants positively evaluated. Our study reveals that the target audience appreciates the elicited design strategy. Our work has the potential to form a solid contribution that other researchers can use as a basis for further improvements.

Download PDF

[J65] IEEE Access 2024

FedCSD: A Federated Learning Based Approach for Code-Smell Detection.*

IEEE Access

Software quality is critical, as low quality or code smells increases technical debt and maintenance costs. There is a timely need for a collaborative model that detects and manages code smells by learning from diverse and distributed data sources while respecting privacy and providing a scalable solution for continuously integrating new patterns and practices in code quality management. However, the current literature is still missing such capabilities. This paper addresses the previous challenges by proposing a Federated Learning Code Smell Detection (FedCSD) approach, specifically targeting "God Class", to enable organizations to train distributed ML models while safeguarding data privacy collaboratively. Download PDF

Journal Software Quality Empirical Software Engineering S. Alawadi, K. Alkharabsheh, F. Alkhabbas, V. Kebande, F. Awaysheh, F. Palomba, M. Awad.

FedCSD: A Federated Learning Based Approach for Code-Smell Detection.*

S. Alawadi, K. Alkharabsheh, F. Alkhabbas, V. Kebande, F. Awaysheh, F. Palomba, M. Awad. Journal Software Quality Empirical Software Engineering

Abstract. Software quality is critical, as low quality or code smells increases technical debt and maintenance costs. There is a timely need for a collaborative model that detects and manages code smells by learning from diverse and distributed data sources while respecting privacy and providing a scalable solution for continuously integrating new patterns and practices in code quality management. However, the current literature is still missing such capabilities. This paper addresses the previous challenges by proposing a Federated Learning Code Smell Detection (FedCSD) approach, specifically targeting "God Class", to enable organizations to train distributed ML models while safeguarding data privacy collaboratively. We conduct experiments using manually validated datasets to detect and analyze code smell scenarios to validate our approach. Experiment 1, a centralized training experiment, revealed varying accuracies across datasets, with dataset two achieving the lowest accuracy (92.30%) and datasets one and three achieving the highest (98.90% and 99.5%, respectively). Experiment 2, focusing on cross-evaluation, showed a significant drop in accuracy (lowest: 63.80%) when fewer smells were present in the training dataset, reflecting technical debt. Experiment 3 involved splitting the dataset across 10 companies, resulting in a global model accuracy of 98.34%, comparable to the centralized model's highest accuracy. The application of federated ML techniques demonstrates promising performance improvements in code-smell detection, benefiting both software developers and researchers.

Download PDF

[J64] TOSEM 2024

Early and Realistic Exploitability Prediction of Just-Disclosed Software Vulnerabilities: How Reliable Can It Be?*

ACM Transactions on Software Engineering and Methodology (TOSEM)

With the rate of discovered and disclosed vulnerabilities escalating, researchers have been experimenting with machine learning to predict whether a vulnerability will be exploited. Existing solutions leverage information unavailable when a CVE is created, making them unsuitable just after the disclosure. This paper experiments with early exploitability prediction models driven exclusively by the initial CVE record, i.e., the original description and the linked online discussions. Leveraging NVD and Exploit Database, we evaluate 72 prediction models trained using six traditional machine learning classifiers, four feature representation schemas, and three data balancing algorithms. We also experiment with five pre-trained large language models (LLMs). Download PDF

Journal Empirical Software Engineering E. Iannone, G. Sellitto, E. Iaccarino, F. Ferrucci, A. De Lucia, F. Palomba.

Early and Realistic Exploitability Prediction of Just-Disclosed Software Vulnerabilities: How Reliable Can It Be?*

E. Iannone, G. Sellitto, E. Iaccarino, F. Ferrucci, A. De Lucia, F. Palomba. Journal Empirical Software Engineering

Abstract. With the rate of discovered and disclosed vulnerabilities escalating, researchers have been experimenting with machine learning to predict whether a vulnerability will be exploited. Existing solutions leverage information unavailable when a CVE is created, making them unsuitable just after the disclosure. This paper experiments with early exploitability prediction models driven exclusively by the initial CVE record, i.e., the original description and the linked online discussions. Leveraging NVD and Exploit Database, we evaluate 72 prediction models trained using six traditional machine learning classifiers, four feature representation schemas, and three data balancing algorithms. We also experiment with five pre-trained large language models (LLMs). The models leverage seven different corpora made by combining three data sources, i.e., CVE description, Security Focus, and BugTraq. The models are evaluated in a realistic, time-aware fashion by removing the training and test instances that cannot be labeled "neutral" with sufficient confidence. The validation reveals that CVE descriptions and Security Focus discussions are the best data to train on. Pre-trained LLMs do not show the expected performance, requiring further pre-training in the security domain. We distill new research directions, identify possible room for improvement, and envision automated systems assisting security experts in assessing the exploitability.

Download PDF

[J63] EMSE 2024

An Empirical Study Into the Effects of Transpilation on Quantum Circuit Smells.*

Springer's Journal of Empirical Software Engineering (EMSE)

Quantum computing is a promising field that can solve complex problems beyond traditional computers' capabilities. Developing high-quality quantum software applications, called quantum software engineering, has recently gained attention. However, quantum software development faces challenges related to code quality. A recent study found that many open-source quantum programs are affected by quantum-specific code smells, with long circuit being the most common. While the study provided relevant insights into the prevalence of code smells in quantum circuits, it did not explore the potential effect of transpilation, a necessary step for executing quantum computer programs, on the emergence of code smells. Download PDF

Journal Software Quality Empirical Software Engineering M. De Stefano, D. Di Nucci, F. Palomba, A. De Lucia.

An Empirical Study Into the Effects of Transpilation on Quantum Circuit Smells.*

M. De Stefano, D. Di Nucci, F. Palomba, A. De Lucia. Journal Software Quality Empirical Software Engineering

Abstract. Quantum computing is a promising field that can solve complex problems beyond traditional computers' capabilities. Developing high-quality quantum software applications, called quantum software engineering, has recently gained attention. However, quantum software development faces challenges related to code quality. A recent study found that many open-source quantum programs are affected by quantum-specific code smells, with long circuit being the most common. While the study provided relevant insights into the prevalence of code smells in quantum circuits, it did not explore the potential effect of transpilation, a necessary step for executing quantum computer programs, on the emergence of code smells. Indeed, transpilation might alter those characteristics employed to detect the presence of a smell on a circuit. To address this limitation, we present a new study investigating the impact of transpilation on quantum-specific code smells and how different target gate sets affect the results. We conducted experiments on 17 open-source quantum programs alongside a set of 100 synthetic circuits. We found that transpilation can significantly alter the metrics that are used to detect code smells, even into previously smell-free circuits, with the long circuit smell being the most susceptible to transpilation. Furthermore, the choice of the gate set significantly influences the presence and severity of code smells in transpiled circuits, highlighting the need for careful gate set selection to mitigate their impact. These findings have implications for circuit optimization and high-quality quantum software development. Further research is needed to understand the consequences of code smells and their potential impact on quantum computations, considering the characteristics and constraints of different gate sets and hardware platforms.

Download PDF

[J62] EMSE 2024

Toward Granular Search-Based Automatic Unit Test Case Generation.*

Springer's Journal of Empirical Software Engineering (EMSE)

Unit testing verifies the presence of faults in individual software components. Previous research has been targeting the automatic generation of unit tests through the adoption of random or search-based algorithms. Despite their effectiveness, these approaches aim at creating tests by solely optimiz- ing metrics like code coverage, without ensuring that the resulting tests have granularities that would allow them to verify both the behavior of individual production methods and the interaction between methods of the class under test. To address this limitation, we propose a two-step systematic approach to the generation of unit tests. Download PDF

Journal Software Testing Empirical Software Engineering F. Pecorelli, G. Grano, F. Palomba, H. Gall, A. De Lucia.

Toward Granular Search-Based Automatic Unit Test Case Generation.*

F. Pecorelli, G. Grano, F. Palomba, H. Gall, A. De Lucia. Journal Software Testing Empirical Software Engineering

Abstract. Unit testing verifies the presence of faults in individual software components. Previous research has been targeting the automatic generation of unit tests through the adoption of random or search-based algorithms. Despite their effectiveness, these approaches aim at creating tests by solely optimiz- ing metrics like code coverage, without ensuring that the resulting tests have granularities that would allow them to verify both the behavior of individual production methods and the interaction between methods of the class under test. To address this limitation, we propose a two-step systematic approach to the generation of unit tests: we first force search-based algorithms to create tests that cover individual methods of the production code, hence implementing the so-called intra-method tests; then, we relax the constraints to enable the creation of intra-class tests that target the interactions among production code methods. The assessment of our approach is conducted through a mixed- method research design that combines statistical analyses with a user study. The key results report that our approach is able to keep the same level of code and mutation coverage while providing test suites that are more structured, more understandable and aligned to the design principles of unit testing.

Download PDF

[J61] IST 2023

Test Code Flakiness in Mobile Apps: The Developer's Perspective.*

Elsevier's Information and Software Technology (IST)

Test flakiness arises when test cases have a non-deterministic, intermittent behavior that leads them to either pass or fail when run against the same code. While researchers have been contributing to the detection, classification, and removal of flaky tests with several empirical studies and automated techniques, little is known about how the problem of test flakiness arises in mobile applications. We point out a lack of knowledge on: (1) The prominence and harmfulness of the problem; (2) The most frequent root causes inducing flakiness; and (3) The strategies applied by practitioners to deal with it in practice. An improved understanding of these matters may lead the software engineering research community to assess the need for tailoring existing instruments to the mobile context or for brand-new approaches that focus on the peculiarities identified. Download PDF

Journal Software Testing Empirical Software Engineering V. Pontillo, F. Palomba, F. Ferrucci.

Test Code Flakiness in Mobile Apps: The Developer's Perspective.*

V. Pontillo, F. Palomba, F. Ferrucci. Journal Software Testing Empirical Software Engineering

Abstract. Test flakiness arises when test cases have a non-deterministic, intermittent behavior that leads them to either pass or fail when run against the same code. While researchers have been contributing to the detection, classification, and removal of flaky tests with several empirical studies and automated techniques, little is known about how the problem of test flakiness arises in mobile applications. We point out a lack of knowledge on: (1) The prominence and harmfulness of the problem; (2) The most frequent root causes inducing flakiness; and (3) The strategies applied by practitioners to deal with it in practice. An improved understanding of these matters may lead the software engineering research community to assess the need for tailoring existing instruments to the mobile context or for brand-new approaches that focus on the peculiarities identified. We address this gap of knowledge by means of an empirical study into the mobile developer's perception of test flakiness. We first perform a systematic grey literature review to elicit how developers discuss and deal with the problem of test flakiness in the wild. Then, we complement the systematic review through a survey study that involves 130 mobile developers and that aims at analyzing their experience on the matter. The results of the grey literature review indicate that developers are often concerned with flakiness connected to user interface elements. In addition, our survey study reveals that flaky tests are perceived as critical by mobile developers, who pointed out major production code- and source code design-related root causes of flakiness, other than the long-term effects of recurrent flaky tests. Furthermore, our study lets the diagnosing and fixing processes currently adopted by developers and their limitations emerge. We conclude by distilling lessons learned, implications, and future research directions.

Download PDF

[J60] EMSE 2023

Machine Learning-Based Test Smell Detection.*

Springer's Journal of Empirical Software Engineering (EMSE)

Test smells are symptoms of sub-optimal design choices adopted when developing test cases. Previous studies have proved their harmfulness for test code maintainability and effectiveness. Therefore, researchers have been proposing automated, heuristic-based techniques to detect them. However, the performance of these detectors is still limited and dependent on tunable thresholds. We design and experiment with a novel test smell detection approach based on machine learning to detect four test smells. First, we develop the largest dataset of manually-validated test smells to enable experimentation. Afterward, we train six machine learners and assess their capabilities in within- and cross-project scenarios. Finally, we compare the ML-based approach with state-of-the-art heuristic-based techniques. Download PDF

Journal Software Testing Empirical Software Engineering V. Pontillo, D. Amoroso D'Aragona, F. Pecorelli, D. Di Nucci, F. Ferrucci, F. Palomba.

Machine Learning-Based Test Smell Detection.*

V. Pontillo, D. Amoroso D'Aragona, F. Pecorelli, D. Di Nucci, F. Ferrucci, F. Palomba. Journal Empirical Software Engineering

Abstract. Test smells are symptoms of sub-optimal design choices adopted when developing test cases. Previous studies have proved their harmfulness for test code maintainability and effectiveness. Therefore, researchers have been proposing automated, heuristic-based techniques to detect them. However, the performance of these detectors is still limited and dependent on tunable thresholds. We design and experiment with a novel test smell detection approach based on machine learning to detect four test smells. First, we develop the largest dataset of manually-validated test smells to enable experimentation. Afterward, we train six machine learners and assess their capabilities in within- and cross-project scenarios. Finally, we compare the ML-based approach with state-of-the-art heuristic-based techniques. The key findings of the study report a negative result. The performance of the machine learning-based detector is significantly better than heuristic-based techniques, but none of the learners able to overcome an average F-Measure of 51%. We further elaborate and discuss the reasons behind this negative result through a qualitative investigation into the current issues and challenges that prevent the appropriate detection of test smells, which allowed us to catalog the next steps that the research community may pursue to improve test smell detection techniques.

Download PDF

[J59] EMSE 2023

On the Adoption and Effects of Source Code Reuse on Defect Proneness and Maintenance Effort.*

Springer's Journal of Empirical Software Engineering (EMSE)

Software reusability mechanisms, like inheritance and delegation in Object-Oriented programming, are widely recognized as key instruments of software design that reduce the risks of source code being affected by defects, other than to reduce the effort required to maintain and evolve source code. Previous work has traditionally employed source code reuse metrics for prediction purposes, e.g., in the context of defect prediction. However, our research identifies two noticeable limitations of the current literature. First, still little is known about the extent to which developers actually employ code reuse mechanisms over time. Second, it is still unclear how these mechanisms may contribute to explaining defect-proneness and maintenance effort during software evolution. Download PDF

Journal Empirical Software Engineering G. Giordano, G. Festa, G. Catolino, F. Palomba, F. Ferrucci, C. Gravino.

On the Adoption and Effects of Source Code Reuse on Defect Proneness and Maintenance Effort.*

G. Giordano, G. Festa, G. Catolino, F. Palomba, F. Ferrucci, C. Gravino. Journal Empirical Software Engineering

Abstract. Software reusability mechanisms, like inheritance and delegation in Object-Oriented programming, are widely recognized as key instruments of software design that reduce the risks of source code being affected by defects, other than to reduce the effort required to maintain and evolve source code. Previous work has traditionally employed source code reuse metrics for prediction purposes, e.g., in the context of defect prediction. However, our research identifies two noticeable limitations of the current literature. First, still little is known about the extent to which developers actually employ code reuse mechanisms over time. Second, it is still unclear how these mechanisms may contribute to explaining defect-proneness and maintenance effort during software evolution. We aim at bridging this gap of knowledge, as an improved understanding of these aspects might provide insights into the actual support provided by these mechanisms, e.g., by suggesting whether and how to use them for prediction purposes. We propose an exploratory study, conducted on 12 Java projects---over 44,900 commits---of the Defects4J dataset, aiming at (1) assessing how developers use inheritance and delegation during software evolution; and (2) statistically analyzing the impact of inheritance and delegation on fault proneness and maintenance effort. Our results let emerge various usage patterns that describe the way inheritance and delegation vary over time. In addition, we find out that inheritance and delegation are statistically significant factors that influence both source code defect-proneness and maintenance effort.

Download PDF

[J58] JSS 2023

An Empirical Investigation into the Influence of Software Communities' Cultural and Geographical Dispersion on Productivity.*

Elsevier's Journal of Systems and Software (JSS)

Estimating and understanding software development productivity represent crucial tasks for researchers and practitioners. Although different works focused on evaluating the impact of human factors on productivity, a few explored the influence of cultural/geographical diversity in software development communities. More particularly, all previous treatise addresses cultural aspects as abstract concepts without providing a quantitative representation. Improved knowledge of these matters might help project managers to assemble more productive teams and tool vendors to design software analytics toolkits that may better estimate productivity. This paper has the goal of enlarging the existing body of knowledge on the factors affecting productivity by focusing on cultural and geographical dispersion of a development community---namely, how diverse a community is in terms of cultural attitudes and geographical collocation of the members who belong to it. Download PDF

Journal Empirical Software Engineering S. Lambiase, G. Catolino, F. Pecorelli, D. Tamburri, F. Palomba, W.J. van den Heuvel, F. Ferrucci.

An Empirical Investigation into the Influence of Software Communities' Cultural and Geographical Dispersion on Productivity.*

S. Lambiase, G. Catolino, F. Pecorelli, D. Tamburri, F. Palomba, W.J. van den Heuvel, F. Ferrucci. Journal Empirical Software Engineering

Abstract. Estimating and understanding software development productivity represent crucial tasks for researchers and practitioners. Although different works focused on evaluating the impact of human factors on productivity, a few explored the influence of cultural/geographical diversity in software development communities. More particularly, all previous treatise addresses cultural aspects as abstract concepts without providing a quantitative representation. Improved knowledge of these matters might help project managers to assemble more productive teams and tool vendors to design software analytics toolkits that may better estimate pro- ductivity. This paper has the goal of enlarging the existing body of knowledge on the factors affecting productivity by focusing on cultural and geographical dispersion of a development community---namely, how diverse a community is in terms of cultural attitudes and geographical collocation of the members who belong to it. To reach this goal, we performed a mixed-method empirical study. First, we built a statistical model relating dispersion metrics with the productivity of 25 open-source communities on Github. Then, we performed a confirmatory survey with 140 practitioners. The key results of our study indicate that cultural and geographical dispersion considerably impact productivity, thus encouraging managers and practitioners to consider such aspects during all the phases of the software development lifecycle. We conclude our paper by elaborating on the main insights from our analyses and instilling implications that may drive further research.

Download PDF

[J58] JSS 2023

An Empirical Investigation into the Influence of Software Communities' Cultural and Geographical Dispersion on Productivity.*

Elsevier's Journal of Systems and Software (JSS)

Estimating and understanding software development productivity represent crucial tasks for researchers and practitioners. Although different works focused on evaluating the impact of human factors on productivity, a few explored the influence of cultural/geographical diversity in software development communities. More particularly, all previous treatise addresses cultural aspects as abstract concepts without providing a quantitative representation. Improved knowledge of these matters might help project managers to assemble more productive teams and tool vendors to design software analytics toolkits that may better estimate productivity. This paper has the goal of enlarging the existing body of knowledge on the factors affecting productivity by focusing on cultural and geographical dispersion of a development community---namely, how diverse a community is in terms of cultural attitudes and geographical collocation of the members who belong to it. Download PDF

Journal Empirical Software Engineering S. Lambiase, G. Catolino, F. Pecorelli, D. Tamburri, F. Palomba, W.J. van den Heuvel, F. Ferrucci.

An Empirical Investigation into the Influence of Software Communities' Cultural and Geographical Dispersion on Productivity.*

S. Lambiase, G. Catolino, F. Pecorelli, D. Tamburri, F. Palomba, W.J. van den Heuvel, F. Ferrucci. Journal Empirical Software Engineering

Abstract. Estimating and understanding software development productivity represent crucial tasks for researchers and practitioners. Although different works focused on evaluating the impact of human factors on productivity, a few explored the influence of cultural/geographical diversity in software development communities. More particularly, all previous treatise addresses cultural aspects as abstract concepts without providing a quantitative representation. Improved knowledge of these matters might help project managers to assemble more productive teams and tool vendors to design software analytics toolkits that may better estimate pro- ductivity. This paper has the goal of enlarging the existing body of knowledge on the factors affecting productivity by focusing on cultural and geographical dispersion of a development community---namely, how diverse a community is in terms of cultural attitudes and geographical collocation of the members who belong to it. To reach this goal, we performed a mixed-method empirical study. First, we built a statistical model relating dispersion metrics with the productivity of 25 open-source communities on Github. Then, we performed a confirmatory survey with 140 practitioners. The key results of our study indicate that cultural and geographical dispersion considerably impact productivity, thus encouraging managers and practitioners to consider such aspects during all the phases of the software development lifecycle. We conclude our paper by elaborating on the main insights from our analyses and instilling implications that may drive further research.

Download PDF

[J57] EMSE 2023

Fairness-Aware Machine Learning Engineering: How Far Are We?*

Springer's Journal of Empirical Software Engineering (EMSE)

Machine learning is part of the daily life of people and companies worldwide. Unfortunately, bias in machine learning algorithms risks unfairly influencing the decision-making process and reiterating possible discrimination. While the interest of the software engineering community in software fairness is rapidly increasing, there is still a lack of understanding of various aspects connected to fair machine learning engineering, i.e., the software engineering process involved in developing fairness-critical machine learning systems. Questions connected to the practitioners’ awareness and maturity about fairness, the skills required to deal with the matter, and the best development phase(s) where fairness should be faced more are just some examples of the knowledge gaps currently open. Download PDF

Journal Empirical Software Engineering C. Ferrara, G. Sellitto, F. Ferrucci, F. Palomba, A. De Lucia.

Fairness-Aware Machine Learning Engineering: How Far Are We?*

C. Ferrara, G. Sellitto, F. Ferrucci, F. Palomba, A. De Lucia. Journal Empirical Software Engineering

Abstract. Machine learning is part of the daily life of people and companies worldwide. Unfortunately, bias in machine learning algorithms risks unfairly influencing the decision-making process and reiterating possible discrimination. While the interest of the software engineering community in software fairness is rapidly increasing, there is still a lack of understanding of various aspects connected to fair machine learning engineering, i.e., the software engineering process involved in developing fairness-critical machine learning systems. Questions connected to the practitioners’ awareness and maturity about fairness, the skills required to deal with the matter, and the best development phase(s) where fairness should be faced more are just some examples of the knowledge gaps currently open. In this paper, we provide insights into how fairness is perceived and managed in practice, to shed light on the instruments and approaches that practitioners might employ to properly handle fairness. We conducted a survey with 117 professionals who shared their knowledge and experience highlighting the relevance of fairness in practice, and the skills and tools required to handle it. The key results of our study show that fairness is still considered a second-class quality aspect in the development of artificial intelligence systems. The building of specific methods and development environments, other than automated validation tools, might help developers to treat fairness throughout the software lifecycle and revert this trend.

Download PDF

[J56] CSUR 2023

A Systematic Literature Review on Code Smells Datasets and Validation Mechanisms.*

ACM Computing Surveys (CSUR)

The accuracy reported for code smell detection tools varies depending on the dataset used to evaluate the tools. Our survey of 45 existing datasets reveals that the adequacy of a dataset for detecting smells highly depends on relevant properties such as the size, severity level, project types, number of each type of smell, number of smells, and the ratio of smelly to non-smelly samples in the dataset. Most existing datasets support God Class, Long Method, and Feature Envy while six smells in Fowler and Beck's catalog are not supported by any datasets. We conclude that existing datasets suffer from imbalanced samples, lack of supporting severity level, and restriction to Java language. Download PDF

Journal Empirical Software Engineering Systematic Literature Review M. Zakeri-Nasrabadi, S. Parsa, E. Esmaili, F. Palomba.

A Systematic Literature Review on Code Smells Datasets and Validation Mechanisms.*

M. Zakeri-Nasrabadi, S. Parsa, E. Esmaili, F. Palomba. Journal Empirical Software Engineering Systematic Literature Review

Abstract. The accuracy reported for code smell detection tools varies depending on the dataset used to evaluate the tools. Our survey of 45 existing datasets reveals that the adequacy of a dataset for detecting smells highly depends on relevant properties such as the size, severity level, project types, number of each type of smell, number of smells, and the ratio of smelly to non-smelly samples in the dataset. Most existing datasets support God Class, Long Method, and Feature Envy while six smells in Fowler and Beck's catalog are not supported by any datasets. We conclude that existing datasets suffer from imbalanced samples, lack of supporting severity level, and restriction to Java language.

Download PDF

[J55] SoftwareX 2023

QuantuMoonLight: A Low-Code Platform to Experiment with Quantum Machine Learning.*

Elsevier's SoftwareX

Nowadays, machine learning is being used to address multiple problems in various research fields, with software engineering researchers being among the most active users of machine learning mechanisms. Recent advances revolve around the use of quantum machine learning, which promises to revolutionize program computation and boost software systems' problem-solving capabilities. However, using quantum computing technologies is not trivial and requires interdisciplinary skills and expertise. Download PDF

Journal Empirical Software Engineering F. Amato, [other authors] , F. Palomba.

QuantuMoonLight: A Low-Code Platform to Experiment with Quantum Machine Learning.*

F. Amato, M. Cicalese, L. Contrasto, G. Cubicciotti, G. D'Ambola, A. La Marca, G. Pagano, F. Tomeo, G. Robertazzi, G. Vassallo, G. Acampora, A. Vitiello, G. Catolino, G. Giordano, S. Lambiase, V. Pontillo, G. Sellitto, F. Ferrucci, F. Palomba. Journal Empirical Software Engineering

Abstract. Nowadays, machine learning is being used to address multiple problems in various research fields, with software engineering researchers being among the most active users of machine learning mechanisms. Recent advances revolve around the use of quantum machine learning, which promises to revolutionize program computation and boost software systems' problem-solving capabilities. However, using quantum computing technologies is not trivial and requires interdisciplinary skills and expertise. For such a reason, we propose QuantuMoonLight, a community-based low-code platform that allows researchers and practitioners to configure and experiment with quantum machine learning pipelines, compare them with classic machine learning algorithms, and share lessons learned and experience reports. We showcase the architecture and main features of QuantuMoonLight, other than discussing its envisioned impact on research and practice.

Download PDF

[J54] JSS 2023

The Anatomy of a Vulnerability Database: A Systematic Mapping Study.*

Elsevier's Journal of Systems and Software (JSS)

Software vulnerabilities play a major role, as there are multiple risks associated, including loss and manipulation of private data. The software engineering research community has been contributing to the body of knowledge by proposing several empirical studies on vulnerabilities and automated techniques to detect and remove them from source code. The reliability and generalizability of the findings heavily depend on the quality of the information mineable from publicly available datasets of vulnerabilities as well as on the availability and suitability of those databases. Download PDF

Journal Empirical Software Engineering Systematic Literature Review X. Li, S. Moreschini, Z. Zhang F. Palomba, D. Taibi.

The Anatomy of a Vulnerability Database: A Systematic Mapping Study.*

X. Li, S. Moreschini, Z. Zhang F. Palomba, D. Taibi. Journal Empirical Software Engineering Systematic Literature Review

Abstract. Software vulnerabilities play a major role, as there are multiple risks associated, including loss and manipulation of private data. The software engineering research community has been contributing to the body of knowledge by proposing several empirical studies on vulnerabilities and automated techniques to detect and remove them from source code. The reliability and generalizability of the findings heavily depend on the quality of the information mineable from publicly available datasets of vulnerabilities as well as on the availability and suitability of those databases. In this paper, we seek to understand the anatomy of the currently available vulnerability databases through a systematic mapping study where we analyze (1) what are the popular vulnerability databases adopted; (2) what are the goals for adoption; (3) what are the other sources of information adopted; (4) what are the methods and techniques; (5) which tools are proposed. An improved understanding of these aspects might not only allow researchers to take informed decisions on the databases to consider when doing research but also practitioners to establish reliable sources of information to inform their security policies and standards.

Download PDF

[J53] EMSE 2023

Rubbing Salt in The Wound? A Large-Scale Investigation into The Effects of Refactoring on Security.*

Springer's Journal of Empirical Software Engineering (EMSE)

Software refactoring is a behavior-preserving activity to improve the source code quality without changing its external behavior. Unfortunately, it is often a manual and error-prone task that may induce regressions in the source code. Researchers have provided initial compelling evidence of the relation between refactoring and defects, yet little is known about how much it may impact software security. This paper bridges this knowledge gap by presenting a large-scale empirical investigation into the effects of refactoring on the security profile of applications. Download PDF

Journal Software Quality Empirical Software Engineering E. Iannone, Z. Codabux, V. Lenarduzzi, A. De Lucia, F. Palomba.

Rubbing Salt in The Wound? A Large-Scale Investigation into The Effects of Refactoring on Security.*

E. Iannone, Z. Codabux, V. Lenarduzzi, A. De Lucia, F. Palomba. Journal Software Quality Empirical Software Engineering

Abstract. Software refactoring is a behavior-preserving activity to improve the source code quality without changing its external behavior. Unfortunately, it is often a manual and error-prone task that may induce regressions in the source code. Researchers have provided initial compelling evidence of the relation between refactoring and defects, yet little is known about how much it may impact software security. This paper bridges this knowledge gap by presenting a large-scale empirical investigation into the effects of refactoring on the security profile of applications. We conduct a three-level mining software repository study to establish the impact of 14 refactoring types on (i) security-related metrics, (ii) security technical debt, and (iii) the introduction of known vulnerabilities. The study covers 39 projects and a total amount of 7,708 refactoring commits. The key results show that refactoring has a limited connection to security. However, Inline Method and Extract Interface statistically contribute to improving some security aspects connected to encapsulating security-critical code components. Extract Superclass and Pull Up Attribute refactoring are commonly found in commits violating specific security best practices for writing secure code. Finally, Extract Superclass and Extract and Move Method refactoring tend to occur more often in commits contributing to the introduction of vulnerabilities. We conclude by distilling lessons learned and recommendations for researchers and practitioners.

Download PDF

[J52] JSEP 2022

"Through the looking-glass..." An Empirical Study on Blob Infrastructure Blueprints in TOSCA.*

Wiley's Journal of Software: Evolution and Process (JSEP)

Infrastructure-as-Code (IaC) helps keep up with the demand for fast, reliable, high-quality services by provisioning and managing infrastructures through configuration files. Those files ensure efficient and repeatable routines for system provisioning, but they might be affected by code smells that negatively affect quality and code maintenance. Research has broadly studied code smells for traditional source code development; however, none explored them in the "Topology and Orchestration Specification for Cloud Applications" (TOSCA), the technology-agnostic OASIS standard for IaC. In this paper, we investigate a prominent tradi- tional implementation code smell potentially applicable to TOSCA: Large Class, or "Blob Blueprint" in IaC terms. Download PDF

Journal Software Quality Empirical Software Engineering S. Dalla Palma, C. van Asseldonk, G. Catolino, D. Di Nucci, F. Palomba, D. Tamburri.

"Through the looking-glass..." An Empirical Study on Blob Infrastructure Blueprints in TOSCA.*

S. Dalla Palma, C. van Asseldonk, G. Catolino, D. Di Nucci, F. Palomba, D. Tamburri. Journal Software Quality Empirical Software Engineering

Abstract. Infrastructure-as-Code (IaC) helps keep up with the demand for fast, reliable, high-quality services by provisioning and managing infrastructures through configuration files. Those files ensure efficient and repeatable routines for system provisioning, but they might be affected by code smells that negatively affect quality and code maintenance. Research has broadly studied code smells for traditional source code development; however, none explored them in the "Topology and Orchestration Specification for Cloud Applications" (TOSCA), the technology-agnostic OASIS standard for IaC. In this paper, we investigate a prominent tradi- tional implementation code smell potentially applicable to TOSCA: Large Class, or "Blob Blueprint" in IaC terms. We compare metrics-based and unsupervised learning-based detectors on a large dataset of manually validated observations related to Blob Blueprints. We provide insights on code metrics that corroborate previous findings and em- pirically show that metrics-based detectors perform highly in detecting Blob Blueprints. We deem our results put forward a new research path toward dealing with this problem, e.g., in the scope of fully automated service pipelines.

Download PDF

[J51] JSS 2022

A Critical Comparison on Six Static Analysis Tools: Detection, Agreement, and Precision.*

Elsevier's Journal of Systems and Software (JSS)

Developers use Static Analysis Tools (SATs) to control for potential quality issues in source code, including defects and technical debt. Tool vendors have devised quite a number of tools, which makes it harder for practitioners to select the most suitable one for their needs. To better support developers, researchers have been conducting several studies on SATs to favor the understanding of their actual capabilities. Despite the work done so far, there is still a lack of knowledge regarding (1) what is their agreement, and (2) what is the precision of their recommendations. We aim at bridging this gap by proposing a large-scale comparison of six popular SATs for Java projects: Better Code Hub, CheckStyle, Coverity Scan, FindBugs, PMD, and SonarQube. Download PDF

Journal Software Quality Empirical Software Engineering V. Lenarduzzi, F. Pecorelli, N. Saarimaki, S. Lujan, F. Palomba.

A Critical Comparison on Six Static Analysis Tools: Detection, Agreement, and Precision.*

V. Lenarduzzi, F. Pecorelli, N. Saarimaki, S. Lujan, F. Palomba. Journal Software Quality Empirical Software Engineering

Abstract. Developers use Static Analysis Tools (SATs) to control for potential quality issues in source code, including defects and technical debt. Tool vendors have devised quite a number of tools, which makes it harder for practitioners to select the most suitable one for their needs. To better support developers, researchers have been conducting several studies on SATs to favor the understanding of their actual capabilities. Despite the work done so far, there is still a lack of knowledge regarding (1) what is their agreement, and (2) what is the precision of their recommendations. We aim at bridging this gap by proposing a large-scale comparison of six popular SATs for Java projects: Better Code Hub, CheckStyle, Coverity Scan, FindBugs, PMD, and SonarQube. We analyze 47 Java projects applying 6 SATs. To assess their agreement, we compared them by manually analyzing---at line- and class-level---whether they identify the same issues. Finally, we evaluate the precision of the tools against a manually-defined ground truth. The key results show little to no agreement among the tools and a low degree of precision. Our study provides the first overview on the agreement among different tools as well as an extensive analysis of their precision that can be used by researchers, practitioners, and tool vendors to map the current capabilities of the tools and envision possible improvements.

Download PDF

[J50] EMSE 2022

Static Test Flakiness Prediction: How Far Can We Go?*

Springer's Journal of Empirical Software Engineering (EMSE)

Test flakiness is a phenomenon occurring when a test case is non-deterministic and exhibits both a passing and failing behavior when run against the same code. The problem has been closely investigated by researchers and practitioners, who all have shown its relevance in practice. The software engineering research community has been working toward defining approaches for detecting and addressing test flakiness. Despite being quite accurate, most of these approaches rely on expensive dynamic steps, e.g., the computation of code coverage information. Consequently, they might suffer from scalability issues that possibly preclude their practical use. This limitation has been recently targeted through machine learning solutions that could predict the flakiness of tests using various features, like source code vocabulary or a mixture of static and dynamic metrics computed on individual snapshots of the system. Download PDF

Journal Software Testing Empirical Software Engineering V. Pontillo, F. Palomba, F. Ferrucci.

Static Test Flakiness Prediction: How Far Can We Go?*

V. Pontillo, F. Palomba, F. Ferrucci. Journal Software Testing Empirical Software Engineering

Abstract. Test flakiness is a phenomenon occurring when a test case is non-deterministic and exhibits both a passing and failing behavior when run against the same code. The problem has been closely investigated by researchers and practitioners, who all have shown its relevance in practice. The software engineering research community has been working toward defining approaches for detecting and addressing test flakiness. Despite being quite accurate, most of these approaches rely on expensive dynamic steps, e.g., the computation of code coverage information. Consequently, they might suffer from scalability issues that possibly preclude their practical use. This limitation has been recently targeted through machine learning solutions that could predict the flakiness of tests using various features, like source code vocabulary or a mixture of static and dynamic metrics computed on individual snapshots of the system. In this paper, we aim to perform a step forward and predict test flakiness only using static metrics. We propose a large-scale experiment on 70 Java projects coming from the iDFlakies and FlakeFlagger datasets. First, we statistically assess the differences between flaky and non-flaky tests in terms of 25 test and production code metrics and smells, analyzing both their individual and combined effects. Based on the results achieved, we experiment with a machine learning approach that predicts test flakiness solely based on static features, comparing it with two state-of-the-art approaches. The key results of the study show that the static approach has performance comparable to those of the baselines. In addition, we found that the characteristics of the production code might impact the performance of the flaky test prediction models.

Download PDF

[J49] JSS 2022

On the Use of Artificial Intelligence to Deal with Privacy in IoT Systems: A Systematic Literature Review.*

Elsevier's Journal of Systems and Software (JSS)

The Internet of Things (IoT) refers to a network of Internet-enabled devices that can make different operations, like sensing, communicating, and reacting to changes arising in the surrounding environment. Nowadays, the number of IoT devices is already higher than the world population. These devices operate by exchanging data between them, sometimes through an intermediate cloud infrastructure, and may be used to enable a wide variety of novel services that can potentially improve the quality of life of billions of people. Nonetheless, all that glitters is not gold: the increasing adoption of IoT comes with several privacy concerns due to the lack or loss of control over the sensitive data exchanged by these devices. Download PDF

Journal Empirical Software Engineering Systematic Literature Review G. Giordano, F. Palomba, F. Ferrucci.

On the Use of Artificial Intelligence to Deal with Privacy in IoT Systems: A Systematic Literature Review.*

G. Giordano, F. Palomba, F. Ferrucci. Journal Empirical Software Engineering Systematic Literature Review

Abstract. The Internet of Things (IoT) refers to a network of Internet-enabled devices that can make different operations, like sensing, communicating, and reacting to changes arising in the surrounding environment. Nowadays, the number of IoT devices is already higher than the world population. These devices operate by exchanging data between them, sometimes through an intermediate cloud infrastructure, and may be used to enable a wide variety of novel services that can potentially improve the quality of life of billions of people. Nonetheless, all that glitters is not gold: the increasing adoption of IoT comes with several privacy concerns due to the lack or loss of control over the sensitive data exchanged by these devices. This represents a key challenge for software engineering researchers attempting to address those privacy concerns by proposing (semi-)automated solutions to identify sources of privacy leaks. In this respect, a notable trend is represented by the adoption of smart solutions, that is, the definition of techniques based on artificial intelligence (AI) algorithms. This paper proposes a systematic literature review of the research in smart detection of privacy concerns in IoT devices. Following well-established guidelines, we identify 152 primary studies that we analyze under three main perspectives: (1) What are the privacy concerns addressed with AI-enabled techniques; (2) What are the algorithms employed and how they have been configured/validated; and (3) Which are the domains targeted by these techniques. The key results of the study identified six main tasks targeted through the use of artificial intelligence, like Malware Detection or Network Analysis. Support Vector Machine is the technique most frequently used in literature, however in many cases researchers do not explicitly indicate the domain where to use artificial intelligence algorithms. We conclude the paper by distilling several lessons learned and implications for software engineering researchers.

Download PDF

[J48] EMSE 2022

FindICI: Using Machine-Learning to Detect Linguistic Inconsistencies between Code and Natural Language Descriptions in Infrastructure-as-Code.*

Springer's Journal of Empirical Software Engineering (EMSE)

Linguistic anti-patterns are recurring poor practices concerning inconsistencies in the naming, documentation, and implementation of an entity. They impede the readability, understandability, and maintainability of source code. This paper attempts to detect linguistic anti-patterns in Infrastructure-as-Code (IaC) scripts used to provision and manage computing environments. In particular, we consider inconsistencies between the logic/body of IaC code units and their short text names. To this end, we propose FindICI a novel automated approach that employs word embedding and classification algorithms. Download PDF

Journal Empirical Software Engineering N. Borovits, I. Kumara, D. Di Nucci, P. Krishnan, S. Dalla Palma, F. Palomba, D. Tamburri, W.J. van den Heuvel.

FindICI: Using Machine-Learning to Detect Linguistic Inconsistencies between Code and Natural Language Descriptions in Infrastructure-as-Code.*

N. Borovits, I. Kumara, D. Di Nucci, P. Krishnan, S. Dalla Palma, F. Palomba, D. Tamburri, W.J. van den Heuvel. Journal Empirical Software Engineering

Abstract. Linguistic anti-patterns are recurring poor practices concerning inconsistencies in the naming, documentation, and implementation of an entity. They impede the readability, understandability, and maintainability of source code. This paper attempts to detect linguistic anti-patterns in Infrastructure-as-Code (IaC) scripts used to provision and manage computing environments. In particular, we consider inconsistencies between the logic/body of IaC code units and their short text names. To this end, we propose FindICI a novel automated approach that employs word embedding and classification algorithms. We build and use the abstract syntax tree of IaC code units to create code embeddings used by machine learning techniques to detect inconsistent IaC code units. We evaluated our approach with two experiments on Ansible tasks systematically extracted from open source repositories for various word embedding models and classification algorithms. Classical machine learning models and novel deep learning models with different word embedding methods showed comparable and satisfactory results in detecting inconsistent Ansible tasks related to the top-10 used Ansible modules.

Download PDF

[J47] EMSE 2022

The Making of Accessible Android Applications: An Empirical Study on the State of the Practice.*

Springer's Journal of Empirical Software Engineering (EMSE)

Nowadays, mobile applications represent the principal means to en- able human interaction. Being so pervasive, these applications should be made usable for all users: accessibility collects the guidelines that developers should follow to include features allowing users with disabilities (e.g., visual impairments) to better interact with an application. While research in this field is gaining interest, there is still a notable lack of knowledge on how developers practically deal with the problem: (i) whether they are aware and take accessibility guidelines into account when developing apps, (ii) which guidelines are harder for them to implement, and (iii) which tools they use to be supported in this task. Download PDF

Journal Empirical Software Engineering Computer-Human Interaction M. Di Gregorio, D. Di Nucci, F. Palomba, G. Vitiello.

The Making of Accessible Android Applications: An Empirical Study on the State of the Practice.*

M. Di Gregorio, D. Di Nucci, F. Palomba, G. Vitiello. Journal Empirical Software Engineering Computer-Human Interaction

Abstract. Nowadays, mobile applications represent the principal means to en- able human interaction. Being so pervasive, these applications should be made usable for all users: accessibility collects the guidelines that developers should follow to include features allowing users with disabilities (e.g., visual impairments) to better interact with an application. While research in this field is gaining interest, there is still a notable lack of knowledge on how developers practically deal with the problem: (i) whether they are aware and take accessibility guidelines into account when developing apps, (ii) which guidelines are harder for them to implement, and (iii) which tools they use to be supported in this task. To bridge the gap of knowledge on the state of the practice concerning the accessibility of mobile applications, we adopt a mixed-method research approach with a twofold goal. We aim to (i) verify how accessibility guidelines are implemented in mobile applications through a coding strategy and (ii) survey mobile developers on the issues and challenges of dealing with accessibility in practice. The key results of the study show that most accessibility guidelines are ignored when developing mobile apps. This behavior is mainly due to the lack of developers’ awareness of accessibility concerns and the lack of tools to support them during the development.

Download PDF

[J46] CCIS 2022

Unsupervised Labor Intelligence Systems: A Detection Approach and Its Evaluation.*

Springer's Communications in Computer and Information Science (CCIS)

In recent years, job advertisements through the web or social media represent an easy way to spread this information. However, social media are often a dangerous showcase of possibly labor exploitation advertisements. This paper aims to determine the potential indicators of labor exploitation for unskilled jobs offered in the Netherlands. Download PDF

Journal Computer-Human Interaction A. Andreou, G. Cascavilla, G. Catolino, F. Palomba, D. Tamburri, W.J. Van Den Heuvel.

Unsupervised Labor Intelligence Systems: A Detection Approach and Its Evaluation.*

A. Andreou, G. Cascavilla, G. Catolino, F. Palomba, D. Tamburri, W.J. Van Den Heuvel. Journal Computer-Human Interaction

Abstract. In recent years, job advertisements through the web or social media represent an easy way to spread this information. However, social media are often a dangerous showcase of possibly labor exploitation advertisements. This paper aims to determine the potential indicators of labor exploitation for unskilled jobs offered in the Netherlands. Specifically, we exploited topic modeling to extract and handle information from textual data about job advertisements for analyzing deceptive and characterizing features. Finally, we use these features to investigate whether automated machine learning methods can predict the risk of labor ex- ploitation by looking at salary discrepancies. The results suggest that features need to be carefully monitored, e.g., hours, link. Finally, our results showed encouraging results, i.e., F1-Score 61%, thus meaning that Data Science methods and AI approaches can be used to detect labor exploitation—starting from job advertisements-based on the discrepancy of delta salary, possibly representing a revolutionary step.

Download PDF

[J45] JSS 2022

Software Engineering for Quantum Programming: How Far Are We?*

Elsevier's Journal of Systems and Software (JSS)

Quantum computing is no longer only a scientific interest but is rapidly becoming an industrially available technology that can potentially overcome the limits of classical computation. Over the last years, all major companies have provided frameworks and programming languages that allow developers to create their quantum applications. This shift has led to the definition of a new discipline called quantum software engineering, which is demanded to define novel methods for engineering large-scale quantum applications. While the research community is successfully embracing this call, we notice a lack of systematic investigations into the state of the practice of quantum programming. Understanding the challenges that quantum developers face is vital to precisely define the aims of quantum software engineering. Download PDF

Journal Software Quality Empirical Software Engineering M. De Stefano, F. Pecorelli, D. Di Nucci, F. Palomba, A. De Lucia.

Software Engineering for Quantum Programming: How Far Are We?*

M. De Stefano, F. Pecorelli, D. Di Nucci, F. Palomba, A. De Lucia. Journal Software Quality Empirical Software Engineering

Abstract. Quantum computing is no longer only a scientific interest but is rapidly becoming an industrially available technology that can potentially overcome the limits of classical computation. Over the last years, all major companies have provided frameworks and programming languages that allow developers to create their quantum applications. This shift has led to the definition of a new discipline called quantum software engineering, which is demanded to define novel methods for engineering large-scale quantum applications. While the research community is successfully embracing this call, we notice a lack of systematic investigations into the state of the practice of quantum programming. Understanding the challenges that quantum developers face is vital to precisely define the aims of quantum software engineering. Hence, in this paper, we first mine all the GitHub repositories that make use of the most used quantum programming frameworks currently on the market and then conduct coding analysis sessions to produce a taxonomy of the purposes which quantum technologies are used for. In the second place, we conduct a survey study that involves the contributors of the considered repositories and that aim at eliciting the developers’ opinions on the current adoption and challenges of quantum programming. On the one hand, the results achieved highlight that the current adoption of quantum programming is still limited. On the other hand, there are many challenges that the software engineering community should carefully consider: these do not strictly pertain to technical concerns but also socio-technical matters.

Download PDF

[J44] EMSE 2022

Handling Uncertainty in SBSE: A Possibilistic Evolutionary Approach for Code Smells Detection.*

Springer's Journal of Empirical Software Engineering (EMSE)

Code smells,also known as anti-patterns, are poor design or implementation choices that hinder program comprehensibility and maintainability. While several code smell detection methods have been proposed, Mantyla et al. identified the uncertainty issue as one of the major individual human factors that may affect developer's decisions about the smelliness of software classes: they may indeed have different opinions mainly due to their different knowledge and expertise. Unfortunately, almost all the existing approaches assume data perfection and neglect the uncertainty when identifying the labels of the software classes. Download PDF

Journal Software Quality Empirical Software Engineering S. Boutaib, M. Elarbi, S. Bechikh, F. Palomba, L. Ben Said.

Handling Uncertainty in SBSE: A Possibilistic Evolutionary Approach for Code Smells Detection.*

S. Boutaib, M. Elarbi, S. Bechikh, F. Palomba, L. Ben Said. Journal Software Quality Empirical Software Engineering

Abstract. Code smells,also known as anti-patterns, are poor design or implementation choices that hinder program comprehensibility and maintainability. While several code smell detection methods have been proposed, Mantyla et al. identified the uncertainty issue as one of the major individual human factors that may affect developer's decisions about the smelliness of software classes: they may indeed have different opinions mainly due to their different knowledge and expertise. Unfortunately, almost all the existing approaches assume data perfection and neglect the uncertainty when identifying the labels of the software classes. Ignoring or rejecting any uncertainty form could lead to a considerable loss of information, which could significantly deteriorate the effectiveness of the detection and identification processes. Inspired by our previous works and motivated by the interesting performance of the PDT (Possibilistic Decision Tree) in classifying uncertain data, we propose ADIPE (Anti-pattern Detection and Identification using Possibilistic decision tree Evolution), as a new tool that evolves and optimizes a set of detectors (PDTs) that could effectively deal with software class labels uncertainty using some concepts from the Possibility theory. ADIPE uses a PBE (Possibilistic Base of Examples: a dataset with possibilistic labels) that it is built using a set of opinion-based classifiers (i.e., a set of probabilistic classifiers) with the aim to simulate human developers’ uncertainty. A set of advisors and probabilistic classifiers are employed in order to mimic the subjectivity and the doubtfulness of software engineers. A detailed experimental study is conducted to show the merits and outperformance of ADIPE in dealing with uncertainty in code smells detection and identification with respect to four relevant state-of-the-art methods, including the baseline PDT. The experimental study was performed in uncertain and certain environments based on two suitable metrics: PF-measure_dist (Possibilistic F-measure_Distance) and IAC (Information Affinity Criterion); which corresponds to the F-measure and Accuracy (PCC) for the certain case. The obtained results for the uncertain environment reveal that for the detection process, the PF-measure_dist of ADIPE ranges within [0.9047 and 0.9285], and its IAC lies within [0.9288 and 0.9557]; while for the identification process, the PF-measure_dist of ADIPE is in [0.8545, 0.9228], and its IAC lies within [0.8751, 0.933]. ADIPE is able to find 35% more code smells with uncertain data than the second best algorithm (i.e., BLOP). In addition, ADIPE succeeds to decrease the number of false alarms (i.e., misclassified smelly instances) with a rate equals to 12%. Our proposed approach is also able to identify 43% more smell types than BLOP and decreases the number of false alarms with a rate equals to 32%. Similar results were obtained for the certain environment, which demonstrate the ability of ADIPE to also deal with the certain environment.

Download PDF

[J43] JSS 2022

Just-in-Time Software Vulnerability Detection: Are We There Yet?*

Elsevier's Journal of Systems and Software (JSS)

Background. Software vulnerabilities are weaknesses in source code that might be exploited to cause harm or loss. Previous work has proposed a number of automated machine learning approaches to detect them. Most of these techniques work at release-level, meaning that they aim at predicting the files that will potentially be vulnerable in a future release. Yet, researchers have shown that a commit-level identification of source code issues might better fit the developer’s needs, speeding up their resolution. Objective. To investigate how currently available machine learning-based vulnerability detection mechanisms can support developers in the detection of vulnerabilities at commit-level. Download PDF

Journal Software Quality Empirical Software Engineering F. Lomio, E. Iannone, A. De Lucia, F. Palomba, V. Lenarduzzi.

Just-in-Time Software Vulnerability Detection: Are We There Yet?*

F. Lomio, E. Iannone, A. De Lucia, F. Palomba, V. Lenarduzzi. Journal Software Quality Empirical Software Engineering

Abstract. Background. Software vulnerabilities are weaknesses in source code that might be exploited to cause harm or loss. Previous work has proposed a number of automated machine learning approaches to detect them. Most of these techniques work at release-level, meaning that they aim at predicting the files that will potentially be vulnerable in a future release. Yet, researchers have shown that a commit-level identification of source code issues might better fit the developer’s needs, speeding up their resolution. Objective. To investigate how currently available machine learning-based vulnerability detection mechanisms can support developers in the detection of vulnerabilities at commit-level. Method. We perform an empirical study where we consider nine projects accounting for 8,991 commits and experiment with eight machine learners built using process, product, and textual metrics. Results. We point out three main findings: (1) basic machine learners rarely perform well; (2) the use of ensemble machine learning algorithms based on boosting can substantially improve the performance; and (3) the combination of more metrics does not necessarily improve the classification capabilities. Conclusion. Further research should focus on just-in-time vulnerability detection, especially with respect to the introduction of smart approaches for feature selection and training strategies.

Download PDF

[J42] EMSE 2022

On the Adequacy of Static Analysis Warnings with Respect to Code Smell Prediction.*

Springer's Journal of Empirical Software Engineering (EMSE)

Code smells are poor implementation choices that developers apply while evolving source code and that affect program maintainability. Multiple automated code smell detectors have been proposed: while most of them relied on heuristics applied over software metrics, a recent trend concerns the definition of machine learning techniques. However, machine learning-based code smell detectors still suffer from low accuracy: one of the causes is the lack of adequate features to feed machine learners. Download PDF

Journal Software Quality Empirical Software Engineering F. Pecorelli, S. Lujan, V. Lenarduzzi, F. Palomba, A. De Lucia.

On the Adequacy of Static Analysis Warnings with Respect to Code Smell Prediction.*

F. Pecorelli, S. Lujan, V. Lenarduzzi, F. Palomba, A. De Lucia. Journal Software Quality Empirical Software Engineering

Abstract. Code smells are poor implementation choices that developers apply while evolving source code and that affect program maintainability. Multiple automated code smell detectors have been proposed: while most of them relied on heuristics applied over software metrics, a recent trend concerns the definition of machine learning techniques. However, machine learning-based code smell detectors still suffer from low accuracy: one of the causes is the lack of adequate features to feed machine learners. In this paper, we face this issue by investigating the role of static analysis warnings generated by three state-of-the-art tools to be used as features of machine learning models for the detection of seven code smell types. We conduct a three-step study in which we (1) verify the relation between static analysis warnings and code smells and the potential predictive power of these warnings; (2) build code smell prediction models exploiting and combining the most relevant features coming from the first analysis; (3) compare and combine the performance of the best code smell prediction model with the one achieved by a state of the art approach. The results reveal the low performance of the models exploit- ing static analysis warnings alone, while we observe significant improvements when combining the warnings with additional code metrics. Nonetheless, we still find that the best model does not perform better than a random model, hence leaving open the challenges related to the definition of ad-hoc features for code smell prediction.

Download PDF

[J41] TSE 2022

The Secret Life of Software Vulnerabilities: A Large-Scale Empirical Study.*

IEEE Transactions on Software Engineering (TSE)

Software vulnerabilities are weaknesses in source code that can be exploited to cause loss or harm. While researchers have been devising a number of methods to deal with vulnerabilities, there is still a noticeable lack of knowledge on their software engineering life cycle, for example how vulnerabilities are introduced and removed by developers. This information can be exploited to design more effective methods for vulnerability prevention and detection, as well as to understand the granularity at which these methods should aim. To investigate the life cycle of known software vulnerabilities, we focus on how, when, and under which circumstances the contributions to the introduction of vulnerabilities in software projects are made, as well as how long, and how they are removed. Download PDF

Journal Software Quality Empirical Software Engineering E. Iannone, R. Guadagni, F. Ferrucci, A. De Lucia, F. Palomba.

The Secret Life of Software Vulnerabilities: A Large-Scale Empirical Study.*

E. Iannone, R. Guadagni, F. Ferrucci, A. De Lucia, F. Palomba. Journal Software Quality Empirical Software Engineering

Abstract. Software vulnerabilities are weaknesses in source code that can be potentially exploited to cause loss or harm. While researchers have been devising a number of methods to deal with vulnerabilities, there is still a noticeable lack of knowledge on their software engineering life cycle, for example how vulnerabilities are introduced and removed by developers. This information can be exploited to design more effective methods for vulnerability prevention and detection, as well as to understand the granularity at which these methods should aim. To investigate the life cycle of known software vulnerabilities, we focus on how, when, and under which circumstances the contributions to the introduction of vulnerabilities in software projects are made, as well as how long, and how they are removed. We consider 3,663 vulnerabilities with public patches from the National Vulnerability Database—pertaining to 1,096 open-source software projects on Github—and define an eight-step process involving both automated parts (e.g., using a procedure based on the SZZ algorithm to find the vulnerability-contributing commits) and manual analyses (e.g., how vulnerabilities were fixed). The investigated vulnerabilities can be classified in 144 categories, take on average at least 4 contributing commits before being introduced, and half of them remain unfixed for at least more than one year. Most of the contributions are done by developers with high workload, often when doing maintenance activities, and removed mostly with the addition of new source code aiming at implementing further checks on inputs. We conclude by distilling practical implications on how vulnerability detectors should work to assist developers in timely identifying these issues.

Download PDF

[J40] JSEP 2021

Evolving Software Forges: an Experience Report from Apache Allura.*

Wiley's Journal of Software: Evolution and Process (JSEP)

The open-source phenomenon has reached unimaginable proportions to a point in which it is virtually impossible to find large applications that do not rely on open-source as well. However, such proportions may turn into a risk if the organisational and socio-technical aspects (e.g., the contribution and release schemes) behind open-source communities are not explicitly supported by open-source forges by-design. Download PDF

Journal Socio-Technical Analytics Empirical Software Engineering D. Tamburri, F. Palomba.

Evolving Software Forges: an Experience Report from Apache Allura.*

D. Tamburri, F. Palomba. Journal Socio-Technical Analytics Empirical Software Engineering

Abstract. The open-source phenomenon has reached unimaginable proportions to a point in which it is virtually impossible to find large applications that do not rely on open-source as well. However, such proportions may turn into a risk if the organisational and socio-technical aspects (e.g., the contribution and release schemes) behind open-source communities are not explicitly supported by open-source forges by-design. In an effort to make such aspects explicit and supported by-design in open-source forges, we conducted empirical software engineering as follows: (a) through online industrial surveying, we elicited organisational and social aspects relevant in open-source com- munities; (b) through action research, we extended a widely known open-source support system and top-level Apache project Allura; (c) through ethnography, we studied the Allura community and, learning from its social and organisational structure, (d) we elicited a metrics framework that support more explicit organisational and socio-technical design principles around open-source communities. This article is an experience report on these results and the lessons we learned in obtaining them. We found that the extensions provided to Apache Allura formed the basis for community awareness by design, providing valuable and usable community characteristics. Ultimately, however, the extensions we provided to Apache Allura were de-activated by its core developers because of performance overheads. Our results and lessons learned allow us to provide recommendations for designing forges, like Github. Architecting a forge is a participatory process that requires active engagement, hence remarking the need for mechanisms enabling it. At the same time, we conclude that a more active support for the governance is required to avoid the failure of the forge.

Download PDF

[J39] EMSE 2021

Software Testing and Android Applications: A Large-Scale Empirical Study.*

Springer's Journal of Empirical Software Engineering (EMSE)

These days, over three billion users rely on mobile applications (a.k.a. apps) on a daily basis to access high-speed connectivity and all kinds of services it enables, from social to emergency needs. Having high-quality apps is therefore a vital requirement for developers to keep staying on the market and acquire new users. For this reason, the research community has been devising automated strategies to better test these applications. Despite the effort spent so far, most developers write their test cases manually without the adoption of any tool. Download PDF

Journal Software Quality Empirical Software Engineering F. Pecorelli, G. Catolino, F. Ferrucci, A. De Lucia, F. Palomba.

Software Testing and Android Applications: A Large-Scale Empirical Study.*

F. Pecorelli, G. Catolino, F. Ferrucci, A. De Lucia, F. Palomba. Journal Software Quality Empirical Software Engineering

Abstract. These days, over three billion users rely on mobile applications (a.k.a. apps) on a daily basis to access high-speed connectivity and all kinds of services it enables, from social to emergency needs. Having high-quality apps is therefore a vital requirement for developers to keep staying on the market and acquire new users. For this reason, the research community has been devising automated strategies to better test these applications. Despite the effort spent so far, most developers write their test cases manually without the adoption of any tool. Nevertheless, we still observe a lack of knowledge on the quality of these manually written tests: an enhanced understanding of this aspect may provide evidence-based findings on the current status of testing in the wild and point out future research directions to better support the daily activities of mobile developers. We perform a large-scale empirical study targeting 1,693 open-source Android apps and aiming at assessing (1) the extent to which these apps are actually tested, (2) how well-designed are the available tests, (3) what is their effectiveness, and (4) how well manual tests can reduce the risk of having defects in production code. In addition, we conduct a focus group with 5 Android testing experts to discuss the findings achieved and gather insights into the next research avenues to undertake. The key results of our study show Android apps are poorly tested and the available tests have low (i) design quality, (ii) effectiveness, and (iii) ability to find defects in production code. Among the various suggestions, testing experts report the need for improved mechanisms to locate potential defects and deal with the complexity of creating tests that effectively exercise the production code.

Download PDF

[J38] IST 2021

On the Impact of Continuous Integration on Refactoring Practice: An Exploratory Study on TravisTorrent.*

Elsevier's Information and Software Technology (IST)

The ultimate goal of Continuous Integration (CI) is to support developers in integrating changes into production constantly and quickly through automated build process. While CI provides developers with prompt feedback on several quality dimensions after each change, such frequent and quick changes may in turn compromise software quality without refactoring. Indeed, recent work emphasized the potential of CI in changing the way developers perceive and apply refactoring. However, we still lack empirical evidence to confirm or refute this assumption. Download PDF

Journal Software Quality Empirical Software Engineering I. Saidani, A. Ouni, M. Mkaouer, F. Palomba.

On the Impact of Continuous Integration on Refactoring Practice: An Exploratory Study on TravisTorrent.*

I. Saidani, A. Ouni, M. Mkaouer, F. Palomba. Journal Software Quality Empirical Software Engineering

Abstract. The ultimate goal of Continuous Integration (CI) is to support developers in integrating changes into production constantly and quickly through automated build process. While CI provides developers with prompt feedback on several quality dimensions after each change, such frequent and quick changes may in turn compromise software quality without refactoring. Indeed, recent work emphasized the potential of CI in changing the way developers perceive and apply refactoring. However, we still lack empirical evidence to confirm or refute this assumption. We aim to explore and understand the evolution of refactoring practices, in terms of frequency, size and involved developers, after the switch to CI in order to emphasize the role of this process in changing the way refactoring is applied. We collect a corpus of 99,545 commits and 89,926 refactoring operations extracted from 39 open-source GitHub projects that adopt Travis CI and analyze the changes using Multiple Regression Analysis (MRA). Our study delivers several important findings. We found that the adoption of CI is associated with a drop in the refactoring size as recommended, while refactoring frequency as well as the number (and its related rate) of developers that perform refactoring are estimated to decrease after the shift to CI, indicating that refactoring is less likely to be applied in CI context. Our study uncovers insights about CI theory and practice and adds evidence to existing knowledge about CI practices related especially to quality assurance. Software developers need more customized refactoring tool support in the context of CI to better maintain and evolve their software systems.

Download PDF

[J37] IST 2021

The Do's and Don'ts of Infrastructure Code: a Systematic Grey Literature Review.*

Elsevier's Information and Software Technology (IST)

Infrastructure-as-code (IaC) is the DevOps tactic of managing and provisioning software infrastructures through machine-readable definition files, rather than manual hardware configuration or interactive configuration tools. From a maintenance and evolution perspective, the topic has picked the interest of practitioners and academics alike, given the relative scarcity of supporting patterns and practices in the academic literature. Download PDF

Journal Empirical Software Engineering I. Kumara, M. Garriga, A. Romeu, D. Di Nucci, F. Palomba, D. Tamburri, W. J. van den Heuvel.

The Do's and Don'ts of Infrastructure Code: a Systematic Grey Literature Review.*

I. Kumara, M. Garriga, A. Romeu, D. Di Nucci, F. Palomba, D. Tamburri, W. J. van den Heuvel. Journal Empirical Software Engineering

Abstract. Infrastructure-as-code (IaC) is the DevOps tactic of managing and provisioning software infrastructures through machine-readable definition files, rather than manual hardware configuration or interactive configuration tools. From a maintenance and evolution perspective, the topic has picked the interest of practitioners and academics alike, given the relative scarcity of supporting patterns and practices in the academic literature. At the same time, a considerable amount of grey literature exists on IaC. Thus we aim to characterize IaC and compile a catalog of best and bad practices for widely used IaC languages, all using grey literature materials. In this paper, we systematically analyze the industrial grey literature on IaC, such as blog posts, tutorials, white papers using qualitative analysis techniques. We proposed a definition for IaC and distilled a broad catalog summa- rized in a taxonomy consisting of 10 and 4 primary categories for best practices and bad practices, respectively, both language-agnostic and language-specific ones, for three IaC languages, namely Ansible, Puppet, and Chef. The practices reflect implementation issues, design issues, and the violation of/adherence to the essential principles of IaC. Our findings reveal critical insights concerning the top languages as well as the best practices adopted by practitioners to address (some of) those challenges. We evidence that the field of development and maintenance IaC is in its infancy and deserves further attention.

Download PDF

[J36] TSE 2021

Within-project Defect Prediction of Infrastructure-as-Code Using Product and Process Metrics.*

IEEE Transactions on Software Engineering (TSE)

Infrastructure-as-code (IaC) is the DevOps practice enabling management and provisioning of infrastructure through the definition of machine-readable files, hereinafter referred to as IaC scripts. Similarly to other source code artefacts, these files may contain defects that can preclude their correct functioning. In this paper, we aim at assessing the role of product and process metrics when predicting defective IaC scripts. Download PDF

Journal Empirical Software Engineering S. Dalla Palma, D. Di Nucci F. Palomba, D. Tamburri

Within-project Defect Prediction of Infrastructure-as-Code Using Product and Process Metrics.*

S. Dalla Palma, D. Di Nucci F. Palomba, D. Tamburri Journal Empirical Software Engineering

Abstract. Infrastructure-as-code (IaC) is the DevOps practice enabling management and provisioning of infrastructure through the definition of machine-readable files, hereinafter referred to as IaC scripts. Similarly to other source code artefacts, these files may contain defects that can preclude their correct functioning. In this paper, we aim at assessing the role of product and process metrics when predicting defective IaC scripts. We propose a fully integrated machine-learning framework for IaC Defect Prediction, that allows for repository crawling, metrics collection, model building, and evaluation. To evaluate it, we analyzed 104 projects and employed five machine-learning classifiers to compare their performance in flagging suspicious defective IaC scripts. The key results of the study report Random Forest as the best-performing model, with a median AUC-PR of 0.93 and MCC of 0.80. Furthermore, at least for the collected projects, product metrics identify defective IaC scripts more accurately than process metrics. Our findings put a baseline for investigating IaC Defect Prediction and the relationship between the product and process metrics, and IaC scripts' quality.

Download PDF

[J35] EMSE 2020

The Relation of Test-Related Factors to Software Quality: A Case Study on Apache Systems.*

Springer's Journal of Empirical Software Engineering (EMSE)

Testing represents a crucial activity to ensure software quality. Recent studies have shown that test-related factors (e.g., code coverage) can be reliable predictors of software code quality, as measured by post-release defects. While these studies provided initial compelling evidence on the relation between tests and post-release defects, they considered different test-related factors separately: as a consequence, there is still a lack of knowledge of whether these factors are still good predictors when considering all together. Download PDF

Journal Testing Empirical Software Engineering F. Pecorelli, F. Palomba, A. De Lucia

The Relation of Test-Related Factors to Software Quality: A Case Study on Apache Systems.*

F. Pecorelli, F. Palomba, A. De Lucia Journal Testing Empirical Software Engineering

Abstract. Testing represents a crucial activity to ensure software quality. Recent studies have shown that test-related factors (e.g., code coverage) can be reliable predictors of software code quality, as measured by post-release defects. While these studies provided initial compelling evidence on the relation between tests and post-release defects, they considered different test-related factors separately: as a consequence, there is still a lack of knowledge of whether these factors are still good predictors when considering all together. In this paper, we propose a comprehensive case study on how test-related factors relate to production code quality in Apache systems. We first investigated how the presence of tests relates to post-release defects; then, we analyzed the role played by the test-related factors previously shown as significantly related to post-release defects. The key findings of the study show that, when controlling for other metrics (e.g., size of the production class), test-related factors have a limited connection to post-release defects.

Download PDF

[J34] JSS 2020

Predicting the Emergence of Community Smells using Socio-Technical Metrics: A Machine-Learning Approach.*

Elsevier's Journal of Systems and Software (JSS)

Community smells represent sub-optimal conditions appearing within software development communities (e.g., non-communicating sub-teams, deviant contributors, etc.) that may lead to the emergence of social debt and increase the overall project's cost. Previous work has studied these smells under different perspectives, investigating their nature, diffuseness, and impact on technical aspects of source code. Download PDF

Journal Socio-Technical Analytics Empirical Software Engineering F. Palomba, D. Tamburri

Predicting the Emergence of Community Smells using Socio-Technical Metrics: A Machine-Learning Approach.*

F. Palomba, D. Tamburri Journal Socio-Technical Analytics Empirical Software Engineering

Abstract. Community smells represent sub-optimal conditions appearing within software development communities (e.g., non-communicating sub-teams, deviant contributors, etc.) that may lead to the emergence of social debt and increase the overall project's cost. Previous work has studied these smells under different perspectives, investigating their nature, diffuseness, and impact on technical aspects of source code. Furthermore, it has been shown that some socio-technical metrics like, for instance, the well-known socio-technical congruence, can potentially be employed to foresee their appearance. Yet, there is still a lack of knowledge of the actual predictive power of such socio-technical metrics. In this paper, we aim at tackling this problem by empirically investigating (i) the potential value of socio-technical metrics as predictors of community smells and (ii) what is the performance of within- and cross-project community smell prediction models based on socio-technical metrics. To this aim, we exploit a dataset composed of 60 open-source projects and consider four community smells such as Organizational Silo, Black Cloud, Lone Wolf, and Bottleneck. The key results of our work report that a within-project solution can reach F-Measure and AUC-ROC of 77% and 78%, respectively, while cross-project models still require improvements, being however able to reach an F-Measure of 62% and overcome a random baseline. Among the metrics investigated, socio-technical congruence, communicability, and turnover-related metrics are the most powerful predictors of the emergence of community smells.

Download PDF

[J33] ESWA 2020

Code Smell Detection and Identification in Imbalanced Environments.*

Elsevier's Expert Systems with Applications (ESWA)

Code smells are sub-optimal design choices that could lower software maintainability. Previous literature did not consider an important characteristic of the smell detection problem, namely data imbalance. When considering a high number of code smell types, the number of smelly classes is likely to largely exceed the number of non-smelly ones, and vice versa. Download PDF

Journal Software Quality Empirical Software Engineering Sofien Boutaiba, Slim Bechikha, F. Palomba, Maha Elarbia, Lamjed Ben Saida

Code Smell Detection and Identification in Imbalanced Environments.*

Sofien Boutaiba, Slim Bechikha, F. Palomba, Maha Elarbia, Lamjed Ben Saida Journal Software Quality Empirical Software Engineering

Abstract. Code smells are sub-optimal design choices that could lower software maintainability. Previous literature did not consider an important characteristic of the smell detection problem, namely data imbalance. When considering a high number of code smell types, the number of smelly classes is likely to largely exceed the number of non-smelly ones, and vice versa. Moreover, most studies did address the smell identification problem, which is more likely to present a higher imbalance as the number of smelly classes is relatively much less than the number of non-smelly ones. Furthermore, an additional research gap in the literature consists in the fact that the number of smell type identification methods is very small compared to the detection ones. The main challenges in smell detection and identification in an imbalanced environment are: (1) the structuring of the smell detector that should be able to deal with complex splitting boundaries and small disjuncts, (2) the design of the detector quality evaluation function that should take into account data imbalance, and (3) the efficient search for effective software metrics' thresholds that should well characterize the different smells. Furthermore, the number of smell type identification methods is very small compared to the detection ones. We propose ADIODE, an effective search-based engine that is able to deal with all the above-described challenges not only for the smell detection case but also for the identification one. Indeed, ADIODE is an EA (Evolutionary Algorithm) that evolves a population of detectors encoded as ODTs (Oblique Decision Trees) using the F-measure as a fitness function. This allows ADIODE to efficiently approximate globally-optimal detectors with effective oblique splitting hyper-planes and metrics’ thresholds. Results. A comparative experimental study on six open-source software systems demonstrates the merits and the outperformance of our approach com- pared to four of the most representative and prominent baseline techniques available in literature. The detection results show that the F-measure of ADIODE ranges between 91.23 % and 95.24 %, and its AUC lies between 0.9273 and 0.9573. Similarly, the identification results indicate that the F-measure of ADIODE varies between 86.26 % and 94.5 %, and its AUC is between 0.8653 and 0.9531.

Download PDF

[J32] JSS 2020

Towards a Catalogue of Software Quality Metrics for Infrastructure Code.*

Elsevier's Journal of Systems and Software (JSS)

Infrastructure-as-code (IaC) is a practice to implement continuous deployment by allowing management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools. Download PDF

Journal Empirical Software Engineering S. Dalla Palma, D. Di Nucci, F. Palomba, D. Tamburri.

Towards a Catalogue of Software Quality Metrics for Infrastructure Code.*

S. Dalla Palma, D. Di Nucci, F. Palomba, D. Tamburri. Journal Empirical Software Engineering

Abstract. Infrastructure-as-code (IaC) is a practice to implement continuous deployment by allowing management and provisioning of infrastructure through the definition of machine-readable files and automation around them, rather than physical hardware configuration or interactive configuration tools. On the one hand, although IaC represents an ever-increasing widely adopted practice nowadays, still little is known concerning how to best maintain, speedily evolve, and continuously improve the code behind the IaC practice in a measurable fashion. On the other hand, source code measurements are often computed and analyzed to evaluate the different quality aspects of the software developed. However, unlike general-purpose programming languages (GPLs), IaC scripts use domain-specific languages, and metrics used for GPLs may not be applicable for IaC scripts. This article proposes a catalogue consisting of 46 metrics to identify IaC properties focusing on Ansible, one of the most popular IaC language to date, and shows how they can be used to analyze IaC scripts.

Download PDF

[J31] SPE 2020

"The Canary in the Coal Mine..." A Cautionary Tale from the Decline of SourceForge.*

Wiley's Software: Practice and Experience (SPE)

Forges are online collaborative platforms to support the development of distributed open-source software. While once mighty keepers of open-source vitality, software forges are rapidly becoming less and less relevant. For example, of the top 10 forges in 2011, only one survives today — SourceForge — the biggest of them all, but its numbers are dropping and its community is tenuous at best. Download PDF

Journal Socio-Technical Analytics Empirical Software Engineering D. Tamburri, K. Blincoe, F. Palomba, R. Kazman.

"The Canary in the Coal Mine..." A Cautionary Tale from the Decline of SourceForge.*

D. Tamburri, K. Blincoe, F. Palomba, R. Kazman. Journal Socio-Technical Analytics Empirical Software Engineering

Abstract. Forges are online collaborative platforms to support the development of distributed open-source software. While once mighty keepers of open-source vitality, software forges are rapidly becoming less and less relevant. For example, of the top 10 forges in 2011, only one survives today — SourceForge — the biggest of them all, but its numbers are dropping and its community is tenuous at best. Through mixed-methods research, this manuscript chronicles and analyze the software practice and experiences of the project's history — in particular its architectural and community/organizational decisions. We discovered a number of sub-optimal social and architectural decisions and circumstances that, may have led to SourceForge's demise. In addition, we found evidence suggesting that the impact of such decisions could have been monitored, reduced, and possibly avoided altogether. The use of socio-technical insights needs to become a basic set of design and software/organization monitoring principles that tell a cautionary tale on what to measure and what not to do in the context of large-scale software forge and community design and management.

Download PDF

[J30] TEM 2020

Success and Failure in Software Engineering: A Followup Systematic Literature Review.*

IEEE Transactions on Engineering Management (TEM)

Success and failure in software engineering are still among the least understood phenomena in the discipline. In a recent special journal issue on the topic, Mantyla et al. started discussing these topics from different angles; the authors focused their contributions on offering a general overview of both topics without deeper detail. Recognising the importance and impact of the topic, we have executed a followup, more in-depth systematic literature review with additional analyses beyond what was previously provided. Download PDF

Journal Socio-Technical Analytics Systematic Literature Review D. Tamburri, F. Palomba, R. Kazman.

Success and Failure in Software Engineering: A Followup Systematic Literature Review.*

D. Tamburri, F. Palomba, R. Kazman. Journal Socio-Technical Analytics Systematic Literature Review

Abstract. Success and failure in software engineering are still among the least understood phenomena in the discipline. In a recent special journal issue on the topic, Mantyla et al. started discussing these topics from different angles; the authors focused their contributions on offering a general overview of both topics without deeper detail. Recognising the importance and impact of the topic, we have executed a followup, more in-depth systematic literature review with additional analyses beyond what was previously provided. These new analyses offer: (a) a grounded-theory of success and failure factors, harvesting over 500+ factors from the literature; (b) 14 manually-validated clusters of factors that provide relevant areas for success- and failure-specific measurement and risk-analysis; (c) a quality model composed of previously unmeasured organizational structure quantities which are germane to software product, process, and community quality. We show that the topics of success and failure deserve further study as well as further automated tool support, e.g., monitoring tools and metrics able to track the factors and patterns emerging from our study. This paper provides managers with risks as well as a more fine-grained analysis of the parameters that can be appraised to anticipate the risks.

Download PDF

[J29] JSS 2019

On the Performance of Method-Level Defect Prediction: A Negative Result.*

Elsevier's Journal of Systems and Software (JSS)

Bug prediction is aimed at identifying software artifacts that are more likely to be defective in the future. Most approaches defined so far target the prediction of bugs at class/file level. Nevertheless, past research has provided evidence that this granularity is too coarse-grained for its use in practice. Download PDF

Journal Software Quality Empirical Software Engineering L. Pascarella, F. Palomba, A. Bacchelli.

On the Performance of Method-Level Defect Prediction: A Negative Result.*

L. Pascarella, F. Palomba, A. Bacchelli. Journal Software Quality Empirical Software Engineering

Abstract. Bug prediction is aimed at identifying software artifacts that are more likely to be defective in the future. Most approaches defined so far target the prediction of bugs at class/file level. Nevertheless, past research has provided evidence that this granularity is too coarse-grained for its use in practice. As a consequence, researchers have started proposing defect prediction models targeting a finer granularity (particularly method-level granularity), providing promising evidence that it is possible to operate at this level. Particularly, models mixing product and process metrics provided the best results. We present a study in which we first replicate previous research on method-level bug-prediction, by using different systems and timespans. Afterwards, based on the limitations of existing research, we (1) re-evaluate method-level bug prediction models more realistically and (2) analyze whether alternative features based on textual aspects, code smells, and developer-related factors can be exploited to improve method-level bug prediction abilities. Key results of our study include that (1) the performance of the previously proposed models, tested using the same strategy but on different systems/timespans, is confirmed; but, (2) when evaluated with a more practical strategy, all the models show a dramatic drop in performance, with results close to that of a random classifier. Finally, we find that (3) the contribution of alternative features within such models is limited and unable to improve the prediction capabilities significantly. As a consequence, our replication and negative results indicate that method-level bug prediction is still an open challenge.

Download PDF

[J28] IEEE SW 2019

Gender Diversity and Community Smells: Insights from the Trenches.*

IEEE Software

Effective communication and organization within a software development team might influence the quality of both the software development process and the software created. Download PDF

Journal Socio-Technical Analytics Empirical Software Engineering G. Catolino, F. Palomba, D. A. Tamburri, A. Serebrenik, F. Ferrucci.

Gender Diversity and Community Smells: Insights from the Trenches.*

G. Catolino, F. Palomba, D. A. Tamburri, A. Serebrenik, F. Ferrucci. Journal Socio-Tehnical Analytics Empirical Software Engineering

Abstract. Effective communication and organization within a software development team might influence the quality of both the software development process and the software created. It is estimated that the consequences of poor communication in terms of cost reached $37 billion for companies. This motivated the research to understanding so-called "social debt", meant as the presence of non-cohesive development communities whose members have communication or coordination issues, and to identify community smells, namely socio-technical characteristics and patterns, which may lead to the emergence of social and technical debt. In this study, we triangulate the results previously obtained surveying 60 software practitioners to understand dimensions and presumed importance of gender diversity, but also whether there are additional factors to consider to reduce community smells. As a result, we found that practitioners seem not to perceive the phenomenon of gender diversity as an important factor to mitigate the presence of community smells. Nevertheless, practitioners who consider this as an important factor tried to strongly motivate their considerations. Finally, as main takeaway message from the survey, we found that most of the participants suggest taking into account communication skills when hiring and managing teams.

Download PDF

[J27] EMSE 2019

How Developers Engage with Static Analysis Tools in Different Contexts.*

Springer's Journal of Empirical Software Engineering (EMSE)

Automatic static analysis tools (ASATs) are instruments that support code quality assessment by automatically detecting defects and design issues. Despite their popularity, they are characterized by (i) a high false positive rate and (ii) the low comprehensibility of the generated warnings.

Most Influential 5-Years Journal First Paper on Software Testing

Download PDF

Journal Software Quality Empirical Software Engineering C. Vassallo, S. Panichella, F. Palomba, S. Proksch, H. Gall, A. Zaidman.

How Developers Engage with Static Analysis Tools in Different Contexts.*

C. Vassallo, S. Panichella, F. Palomba, S. Proksch, H. Gall, A. Zaidman. Journal Software Quality Empirical Software Engineering

Abstract. Automatic static analysis tools (ASATs) are instruments that support code quality assessment by automatically detecting defects and design issues. Despite their popularity, they are characterized by (i) a high false positive rate and (ii) the low comprehensibility of the generated warnings. However, no prior studies have investigated the usage of ASATs in different development contexts (e.g., code reviews, regular development), nor how open source projects integrate ASATs into their workflows. These perspectives are paramount to improve the prioritization of the identified warnings. To shed light on the actual ASATs usage practices, in this paper we first survey 56 developers (66% from industry and 34% from open source projects) and interview 11 industrial experts leveraging ASATs in their workflow with the aim of understanding how they use ASATs in different contexts. Furthermore, to investigate how ASATs are being used in the workflows of open source projects, we manually inspect the contribution guidelines of 176 open-source systems and extract the ASATs’ configuration and build files from their corresponding GitHub repositories. Our study highlights that (i) 71% of developers do pay attention to different warning categories depending on the development context; (ii) 63% of our respondents rely on specific factors (e.g., team policies and composition) when prioritizing warnings to fix during their programming; and (iii) 66% of the projects define how to use specific ASATs, but only 37% enforce their usage for new contributions. The perceived relevance of ASATs varies between different projects and domains, which is a sign that ASATs use is still not a common practice. In conclusion, this study confirms previous findings on the unwillingness of developers to configure ASATs and it emphasizes the necessity to improve existing strategies for the selection and prioritization of ASATs warnings that are shown to developers.

Download PDF

[J26] JSS 2019

Scented Since the Beginning: On the Diffuseness of Test Smells in Automatically Generated Test Code.*

Elsevier's Journal of Systems and Software (JSS)

Software testing represents a key software engineering practice to ensure source code quality and reliability. To support developers in this activity and reduce testing effort, several automated unit test generation tools have been proposed. Most of these approaches have the main goal of covering as more branches as possible. Download PDF

Journal Software Testing Empirical Software Engineering G. Grano, F. Palomba, D. Di Nucci, A. De Lucia, H. Gall.

Scented Since the Beginning: On the Diffuseness of Test Smells in Automatically Generated Test Code.*

G. Grano, F. Palomba, D. Di Nucci, A. De Lucia, H. Gall. Journal Software Testing Empirical Software Engineering

Abstract. Software testing represents a key software engineering practice to ensure source code quality and reliability. To support developers in this activity and reduce testing effort, several automated unit test generation tools have been proposed. Most of these approaches have the main goal of covering as more branches as possible. While these approaches have good performance, little is still known on the maintainability of the test code they produce, i.e., whether the generated tests have a good code quality and if they do not possibly introduce issues threatening their effectiveness. To bridge this gap, in this paper we study to what extent existing automated test case generation tools produce potentially problematic test code. We consider seven test smells, i.e., suboptimal design choices applied by programmers during the development of test cases, as measure of code quality of the generated tests, and evaluate their diffuseness in the unit test classes automatically generated by three state-of-the-art tools such as Randoop, JTExpert, and Evosuite. Moreover, we investigate whether there are characteristics of test and production code influencing the generation of smelly tests. Our study shows that all the considered tools tend to generate a high quantity of two specific test smell types, i.e., Assertion Roulette and Eager Test, which are those that previous studies showed to negatively impact the reliability of production code. We also discover that test size is correlated with the generation of smelly tests. Based on our findings, we argue that more effective automated generation algorithms that explicitly take into account test code quality should be further investigated and devised.

Download PDF

[J25] EMSE 2019

Third-Party Libraries in Mobile Apps: When, How, and Why Developers Update Them.*

Springer's Journal of Empirical Software Engineering (EMSE)

When developing new software, third-party libraries are commonly used to reduce implementation efforts. However, even these libraries undergo evolution activities to offer new functionalities and fix bugs or security issues. Download PDF

Journal Mobile Apps Evolution Empirical Software Engineering P. Salza, F. Palomba, D. Di Nucci, A. De Lucia, F. Ferrucci.

Third-Party Libraries in Mobile Apps: When, How, and Why Developers Update Them.*

P. Salza, F. Palomba, D. Di Nucci, A. De Lucia, F. Ferrucci. Journal Mobile Apps Evolution Empirical Software Engineering

Abstract. When developing new software, third-party libraries are commonly used to reduce implementation efforts. However, even these libraries undergo evolution activities to offer new functionalities and fix bugs or security issues. The research community has mainly investigated third-party libraries in the context of desktop applications, while only little is known regarding the mobile context. In this paper, we bridge this gap by investigating when, how, and why mobile developers update third-party libraries. By mining 2752 mobile apps, we study (i) whether mobile developers update third-party libraries, (ii) how much such apps lag behind the latest version of their dependencies,(iii) which are the categories of libraries that are more prone to be updated, and (iv) what are the common patterns followed by developers when updating a library. Then, we perform a survey with 73 mobile developers that aims at shedding lights on the reasons why they update (or not) third-party libraries. We find that mobile developers rarely update libraries, and when they do, they mainly tend to update libraries related to the Graphical User Interface.Avoiding bug propagation and making the app compatible with new Android releases are the top reasons why developers update their libraries.

Download PDF

[J24] EMSE 2019

Improving Change Prediction Models with Code Smells-Related Information.*

Springer's Journal of Empirical Software Engineering (EMSE)

Code smells are sub-optimal implementation choices applied by developers that have the effect of negatively impacting, among others, the change-proneness of the affected classes. Download PDF

Journal Software Quality Empirical Software Engineering G. Catolino, F. Palomba, F. Arcelli Fontana, A. De Lucia, A. Zaidman, F. Ferrucci.

Improving Change Prediction Models with Code Smells-Related Information.*

G. Catolino, F. Palomba, F. Arcelli Fontana, A. De Lucia, A. Zaidman, F. Ferrucci. Journal Software Quality Empirical Software Engineering

Abstract. Code smells are sub-optimal implementation choices applied by developers that have the effect of negatively impacting, among others, the change-proneness of the affected classes. Based on this consideration, in this paper we conjecture that code smell-related information can be effectively exploited to improve the performance of change prediction models, i.e., models having the goal of indicating which classes are more likely to change in the future. We exploit the so-called intensity index—a previously defined metric that captures the severity of a code smell—and evaluate its contribution when added as additional feature in the context of three state of the art change prediction models based on product, process, and developer-based features. We also compare the performance achieved by the proposed model with a model based on previously defined antipattern metrics, a set of indicators computed considering the history of code smells in files. Our results report that (i) the prediction performance of the intensity-including models is statistically better than the baselines and, (ii) the intensity is a better predictor than antipattern metrics. We observed some orthogonality between the set of change-prone and non-change-prone classes correctly classified by the models relying on intensity and antipattern metrics: for this reason, we also devise and evaluate a smell-aware combined change prediction model including product, process, developer-based, and smell-related features. We show that the F-Measure of this model is notably higher than other models.

Download PDF

[J23] SCP 2019

A Large-Scale Empirical Exploration on Refactoring Activities in Open Source Software Projects.*

Elsevier's Science of Computer Programming (SCP)

Refactoring is a well-established practice that aims at improving the internal structure of a software system without changing its external behavior. Existing literature provides evidence of how and why developers perform refactoring in practice. Download PDF

Journal Software Quality Empirical Software Engineering C. Vassallo, G. Grano, F. Palomba, H. Gall, A. Bacchelli.

A Large-Scale Empirical Exploration on Refactoring Activities in Open Source Software Projects.*

C. Vassallo, G. Grano, F. Palomba, H. Gall, A. Bacchelli. Journal Software Quality Empirical Software Engineering

Abstract. Refactoring is a well-established practice that aims at improving the internal structure of a software system without changing its external behavior. Existing literature provides evidence of how and why developers perform refactoring in practice. In this paper, we continue on this line of research by performing a large-scale empirical analysis of refactoring practices in 200 open source systems. Specifically, we analyze the change history of these systems at commit level to investigate: (i) whether developers perform refactoring operations and, if so, which are more diffused and (ii) when refactoring operations are applied, and (iii) which are the main developer-oriented factors leading to refactoring. Based on our results, future research can focus on enabling automatic support for less frequent refactorings and on recommending refactorings based on the developer’s workload, project’s maturity and developer’s commitment to the project.

Download PDF BibTeX

@article{vassallo2019large,
  title={A large-scale empirical exploration on refactoring activities in open source software projects},
  author={Vassallo, Carmine and Grano, Giovanni and Palomba, Fabio and Gall, Harald C and Bacchelli, Alberto},
  journal={Science of Computer Programming},
  volume={180},
  pages={1--15},
  year={2019},
  publisher={Elsevier}
}

[J22] JSS 2019 Recommended

Not All Bugs Are the Same: Understanding, Characterizing, and Classifying Bug Types.*

Elsevier's Journal of Systems and Software (JSS)

Modern version control systems, e.g., GitHub, include bug tracking mechanisms that developers can use to highlight the presence of bugs. This is done by means of bug reports, i.e., textual descriptions reporting the problem and the steps that led to a failure. Download PDF

Journal Software Quality Empirical Software Engineering G. Catolino, F. Palomba, A. Zaidman, F. Ferrucci.

Not All Bugs Are the Same: Understanding, Characterizing, and Classifying Bug Types.*

G. Catolino, F. Palomba, A. Zaidman, F. Ferrucci. Journal Recommended Software Quality Empirical Software Engineering

Abstract. Modern version control systems, e.g., GitHub, include bug tracking mechanisms that developers can use to highlight the presence of bugs. This is done by means of bug reports, i.e., textual descriptions reporting the problem and the steps that led to a failure. In past and recent years, the research community deeply investigated methods for easing bug triage, that is, the process of assigning the fixing of a reported bug to the most qualified developer. Nevertheless, only a few studies have reported on how to support developers in the process of understanding the type of a reported bug, which is the first and most time-consuming step to perform before assigning a bug-fix operation. In this paper, we target this problem in two ways: first, we analyze 1,280 bug reports of 119 popular projects belonging to three ecosystems such as Mozilla, Apache, and Eclipse, with the aim of building a taxonomy of the types of reported bugs; then, we devise and evaluate an automated classification model able to classify reported bugs according to the defined taxonomy. As a result, we found nine main common bug types over the considered systems. Moreover, our model achieves high F-Measure and AUC-ROC (64% and 74% on overall, respectively).

Download PDF BibTeX

@article{catolino2019not,
  title={Not all bugs are the same: Understanding, characterizing, and classifying bug types},
  author={Catolino, Gemma and Palomba, Fabio and Zaidman, Andy and Ferrucci, Filomena},
  journal={Journal of Systems and Software},
  volume={152},
  pages={165--181},
  year={2019},
  publisher={Elsevier}
}

[J21] TSE 2019 Recommended

Lightweight Assessment of Test-Case Effectiveness using Source-Code-Quality Indicators.*

IEEE Transactions on Software Engineering (TSE)

Test cases are crucial to help developers preventing the introduction of software faults. Unfortunately, not all the tests are properly designed or can effectively capture faults in production code. Download PDF

Journal Software Testing Empirical Software Engineering G. Grano, F. Palomba, H. Gall.

Lightweight Assessment of Test-Case Effectiveness using Source-Code-Quality Indicators.*

G. Grano, F. Palomba, H. Gall. Journal Recommended Software Testing Empirical Software Engineering

Abstract. Test cases are crucial to help developers preventing the introduction of software faults. Unfortunately, not all the tests are properly designed or can effectively capture faults in production code. Some measures have been defined to assess test-case effectiveness: the most relevant one is the mutation score, which highlights the quality of a test by generating the so-called mutants, i.e., variations of the production code that make it faulty and that the test is supposed to identify. However, previous studies revealed that mutation analysis is extremely costly and hard to use in practice. The approaches proposed by researchers so far have not been able to provide practical gains in terms of mutation testing efficiency. This leaves the problem of efficiently assessing test-case effectiveness as still open. In this paper, we investigate a novel, orthogonal, and lightweight methodology to assess test-case effectiveness: in particular, we study the feasibility to exploit production and test-code-quality indicators to estimate the mutation score of a test case. We firstly select a set of 67 factors and study their relation with test-case effectiveness. Then, we devise a mutation score prediction model exploiting such factors and investigate its performance as well as its most relevant features. The key results of the study reveal that our prediction model only based on static features has 86% of both F-Measure and AUC-ROC. This means that we can estimate the test-case effectiveness, using source-code-quality indicators, with high accuracy and without executing the tests. As a consequence, we can provide a practical approach that is beyond the typical limitations of current mutation testing techniques.

Download PDF BibTeX

@article{grano2019lightweight,
  title={Lightweight Assessment of Test-Case Effectiveness using Source-Code-Quality Indicators},
  author={Grano, Giovanni and Palomba, Fabio and Gall, Harald C},
  journal={IEEE Transactions on Software Engineering},
  year={2019},
  publisher={IEEE}
}

[J20] TSE 2019

Exploring Community Smells in Open-Source: An Automated Approach.*

IEEE Transactions on Software Engineering

Software engineering is now more than ever a community effort. Its success often weighs on balancing distance, culture, global engineering practices and more. Download PDF

Journal Socio-Technical Analytics Empirical Software Engineering D. A. Tamburri, F. Palomba, R. Kazman.

Exploring Community Smells in Open-Source: An Automated Approach.*

D. A. Tamburri, F. Palomba, R. Kazman. Journal Socio-Technical Analytics Empirical Software Engineering

Abstract. Software engineering is now more than ever a community effort. Its success often weighs on balancing distance, culture, global engineering practices and more. In this scenario many unforeseen socio-technical events may result into additional project cost or “social" debt, e.g., sudden, collective employee turnover. With industrial research we discovered community smells, that is, sub-optimal patterns across the organisational and social structure in a software development community that are precursors of such nasty socio-technical events. To understand the impact of community smells at large, in this paper we first introduce CODEFACE4SMELLS, an automated approach able to identify four community smell types that reflect socio-technical issues that have been shown to be detrimental both the software engineering and organisational research fields. Then, we perform a large-scale empirical study involving over 100 years worth of releases and communication structures data of 60 open-source communities: we evaluate (i) their diffuseness, i.e., how much are they distributed in open-source, (ii) how developers perceive them, to understand whether practitioners recognize their presence and their negative effects in practice, and (iii) how community smells relate to existing socio-technical factors, with the aim of assessing the inter-relations between them. The key findings of our study highlight that community smells are highly diffused in open-source and are perceived by developers as relevant problems for the evolution of software communities. Moreover, a number of state-of-the-art socio-technical indicators (e.g., socio-technical congruence) can be used to monitor how healthy a community is and possibly avoid the emergence of social debt.

Download PDF BibTeX

@article{tamburri2019exploring,
  title={Exploring Community Smells in Open-Source: An Automated Approach},
  author={Tamburri, Damian Andrew Andrew and Palomba, Fabio and Kazman, Rick},
  journal={IEEE Transactions on Software Engineering},
  year={2019},
  publisher={IEEE}
}

[J19] IST 2019

Machine Learning Techniques for Code Smell Detection: A Systematic Literature Review and Meta-Analysis.*

Elsevier's Information and Software Technology

Code smells indicate suboptimal design or implementation choices in the source code that often lead it to be more change- and faultprone. Download PDF

Journal Software Quality Systematic Literature Review M. I. Azeem, F. Palomba, L. Shi, Q. Wang.

Machine Learning Techniques for Code Smell Detection: A Systematic Literature Review and Meta-Analysis.*

M. I. Azeem, F. Palomba, L. Shi, Q. Wang. Journal Software Quality Systematic Literature Review

Abstract.
Background: Code smells indicate suboptimal design or implementation choices in the source code that often lead it to be more change- and faultprone. Researchers defined dozens of code smell detectors, which exploit different sources of information to support developers when diagnosing design flaws. Despite their good accuracy, previous work pointed out three important limitations that might preclude the use of code smell detectors in practice: (i) subjectiveness of developers with respect to code smells detected by such tools, (ii) scarce agreement between different detectors, and (iii) difficulties in finding good thresholds to be used for detection. To overcome these limitations, the use of machine learning techniques represents an ever increasing research area.
Objective: While the research community carefully studied the methodologies applied by researchers when defining heuristic-based code smell detectors, there is still a noticeable lack of knowledge on how machine learning approaches have been adopted for code smell detection and whether there are points of improvement to allow a better detection of code smells. Our goal is to provide an overview and discuss the usage of machine learning approaches in the field of code smells.
Method: This paper presents a Systematic Literature Review (SLR) on Machine Learning Techniques for Code Smell Detection. Our work considers papers published between 2000 and 2017. Starting from an initial set of 2,456 papers, we found that 15 of them actually adopted machine learning approaches. We studied them under four different perspectives: (i) code smells considered, (ii) setup of machine learning approaches, (iii) design of the evaluation strategies, and (iv) a meta-analysis on the performance achieved by the models proposed so far.
Results: The analyses performed show that God Class, Long Method, Functional Decomposition, and Spaghetti Code have been heavily considered in the literature. Decision Trees and Support Vector Machines are the most commonly used machine learning algorithms for code smell detection. Models based on a large set of independent variables have performed well. JRip and Random Forest are the most effective classifiers in terms of performance. The analyses also reveal the existence of several open issues and challenges that the research community should focus on in the future.
Conclusion: Based on our findings, we argue that there is still room for the improvement of machine learning techniques in the context of code smell detection. The open issues emerged in this study can represent the input for researchers interested in developing more powerful techniques.

Download PDF BibTeX

@article{azeem2019machine,
  title={Machine learning techniques for code smell detection: A systematic literature review and meta-analysis},
  author={Azeem, Muhammad Ilyas and Palomba, Fabio and Shi, Lin and Wang, Qing},
  journal={Information and Software Technology},
  year={2019},
  publisher={Elsevier}
}

[J17] JSS 2019

Fine-Grained Just-In-Time Defect Prediction.*

Elsevier's Journal of Systems and Software (JSS)

Defect prediction models focus on identifying defect-prone code elements, for example to allow practitioners to allocate testing resources on specific subsystems and to provide assistance during code reviews. Download PDF

Journal Software Quality Empirical Software Engineering L. Pascarella, F. Palomba, A. Bacchelli.

Fine-Grained Just-In-Time Defect Prediction.*

L. Pascarella, F. Palomba, A. Bacchelli. Journal Software Quality Empirical Software Engineering

Abstract. Defect prediction models focus on identifying defect-prone code elements, for example to allow practitioners to allocate testing resources on specific subsystems and to provide assistance during code reviews. While the research community has been highly active in proposing metrics and methods to predict defects on long-term periods (i.e., at release time), a recent trend is represented by the so-called short-term defect prediction (i.e., at commit-level). Indeed, this strategy represents an effective alternative in terms of effort required to inspect files likely affected by defects. Nevertheless, the granularity considered by such models might be still too coarse. Indeed, existing commit-level models highlight an entire commit as defective even in cases where only specific files actually contain defects. In this paper, we first investigate to what extent commits are partially defective; then, we propose a novel fine-grained just-in-time defect prediction model to predict the specific files, contained in a commit, that are defective. Finally, we evaluate our model in terms of (i) performance and (ii) the extent to which it decreases the effort required to diagnose a defect. Our study highlights that: (1) defective commits are frequently composed of a mixture of defective and nondefective files, (2) our fine-grained model can accurately predict defective files with an AUC-ROC up to 82% and (3) our model would allow practitioners to save inspection efforts with respect to standard just-in-time techniques.

Download PDF BibTeX

@article{pascarella2019fine,
  title={Fine-grained just-in-time defect prediction},
  author={Pascarella, Luca and Palomba, Fabio and Bacchelli, Alberto},
  journal={Journal of Systems and Software},
  volume={150},
  pages={22--36},
  year={2019},
  publisher={Elsevier}
}

[J16] IST 2019

A Survey on Software Coupling Relations and Tools*

Elsevier's Information and Software Technology (IST)

Coupling relations reflect the dependencies between software entities and can be used to assess the quality of a program. For this reason, a vast amount of them has been developed, together with tools to compute their related metrics. Download PDF

Journal Software Quality Systematic Literature Review E. Fregnan, T. Baum, F. Palomba, A. Bacchelli.

A Survey on Software Coupling Relations and Tools*

E. Fregnan, T. Baum, F. Palomba, A. Bacchelli. Journal Software Quality Systematic Literature Review

Abstract.
Context: Coupling relations reflect the dependencies between software entities and can be used to assess the quality of a program. For this reason, a vast amount of them has been developed, together with tools to compute their related metrics. However, this makes the coupling measures suitable for a given application challenging to find.
Goals: The first objective of this work is to provide a classification of the different kinds of coupling relations, together with the metrics to measure them. The second consists in presenting an overview of the tools proposed until now by the software engineering academic community to extract these metrics.
Method: This work constitutes a systematic literature review in software engineering. To retrieve the referenced publications, publicly available scientific research databases were used. These sources were queried using keywords inherent to software coupling. We included publications from the period 2002 to 2017 and highly cited earlier publications. A snowballing technique was used to retrieve further related material.
Results: Four groups of coupling relations were found: structural, dynamic, semantic and logical. A fifth set of coupling relations includes approaches too recent to be considered an independent group and measures developed for specific environments. The investigation also retrieved tools that extract the metrics belonging to each coupling group.
Conclusion: This study shows the directions followed by the research on software coupling: e.g., developing metrics for specific environments. Concerning the metric tools, three trends have emerged in recent years: use of visualization techniques, extensibility and scalability. Finally, some coupling metrics applications were presented (e.g., code smell detection), indicating possible future research directions.

Download PDF BibTeX

@article{fregnan2018survey,
  title={A survey on software coupling relations and tools},
  author={Fregnan, Enrico and Baum, Tobias and Palomba, Fabio and Bacchelli, Alberto},
  journal={Information and Software Technology},
  year={2018},
  publisher={Elsevier}
}

[J15] TSE 2019

Beyond Technical Aspects: How Do Community Smells Influence the Intensity of Code Smells?*

IEEE Transactions on Software Engineering (TSE)

Code smells are poor implementation choices applied by developers during software evolution that often lead to critical flaws or failure. Much in the same way, community smells reflect the presence of organizational and socio-technical issues within a software community that may lead to additional project costs. Download PDF

Journal Socio-Technical Analytics Empirical Software Engineering F. Palomba, D. A. Tamburri, F. Arcelli Fontana, R. Oliveto, A. Zaidman, A. Serebrenik.

Beyond Technical Aspects: How Do Community Smells Influence the Intensity of Code Smells?*

F. Palomba, D. A. Tamburri, F. Arcelli Fontana, R. Oliveto, A. Zaidman, A. Serebrenik. Journal Socio-Technical Analytics Empirical Software Engineering

Abstract. Code smells are poor implementation choices applied by developers during software evolution that often lead to critical flaws or failure. Much in the same way, community smells reflect the presence of organizational and socio-technical issues within a software community that may lead to additional project costs. Recent empirical studies provide evidence that community smells are often—if not always—connected to circumstances such as code smells. In this paper we look deeper into this connection by conducting a mixed-methods empirical study of 117 releases from 9 open-source systems. The qualitative and quantitative sides of our mixed-methods study were run in parallel and assume a mutually-confirmative connotation. On the one hand, we survey 162 developers of the 9 considered systems to investigate whether developers perceive relationship between community smells and the code smells found in those projects. On the other hand, we perform a fine-grained analysis into the 117 releases of our dataset to measure the extent to which community smells impact code smell intensity (i.e., criticality). We then propose a code smell intensity prediction model that relies on both technical and community-related aspects. The results of both sides of our mixed-methods study lead to one conclusion: community-related factors contribute to the intensity of code smells. This conclusion supports the joint use of community and code smells detection as a mechanism for the joint management of technical and social problems around software development communities.

Download PDF BibTeX

@article{palomba2018beyond,
  title={Beyond technical aspects: How do community smells influence the intensity of code smells?},
  author={Palomba, Fabio and Tamburri, Damian Andrew Andrew and Fontana, Francesca Arcelli and Oliveto, Rocco and Zaidman, Andy and Serebrenik, Alexander},
  journal={IEEE transactions on software engineering},
  year={2018},
  publisher={IEEE}
}

[J14] EMSE 2019

Discovering Community Patterns in Open-Source: A Systematic Approach and Its Evaluation.*

Springer's Journal of Empirical Software Engineering (EMSE)

The open-source phenomenon has reached the point in which it is virtually impossible to find large applications that do not rely on it. Such grand adoption may turn into a risk if the community regulatory aspects behind open-source work (e.g., contribution guidelines or release schemas) are left implicit and their effect untracked. Download PDF

Journal Socio-Technical Analytics Empirical Software Engineering D. A. Tamburri, F. Palomba, A. Serebrenik, A. Zaidman.

Discovering Community Patterns in Open-Source: A Systematic Approach and Its Evaluation.*

D. A. Tamburri, F. Palomba, A. Serebrenik, A. Zaidman. Journal Socio-Technical Analytics Empirical Software Engineering

Abstract. “There can be no vulnerability without risk; there can be no community without vulnerability; there can be no peace, and ultimately no life, without community.” - [M. Scott Peck]
The open-source phenomenon has reached the point in which it is virtually impossible to find large applications that do not rely on it. Such grand adoption may turn into a risk if the community regulatory aspects behind open-source work (e.g., contribution guidelines or release schemas) are left implicit and their effect untracked. We advocate the explicit study and automated support of such aspects and propose Yoshi (Yielding Open-Source Health Information), a tool able to map open-source communities onto community patterns, sets of known organisational and social structure types and characteristics with measurable core attributes. This mapping is beneficial since it allows, for example, (a) further investigation of community health measuring established characteristics from organisations research, (b) reuse of pattern-specific best-practices from the same literature, and (c) diagnosis of organisational anti-patterns specific to open-source, if any. We evaluate the tool in a quantitative empirical study involving 25 open-source communities from GitHub, finding that the tool offers a valuable basis to monitor key community traits behind open-source development and may form an effective combination with web-portals such as OpenHub or Bitergia. We made the proposed tool open source and publicly available.

Download PDF BibTeX

@article{tamburri2019discovering,
  title={Discovering community patterns in open-source: A systematic approach and its evaluation},
  author={Tamburri, Damian A and Palomba, Fabio and Serebrenik, Alexander and Zaidman, Andy},
  journal={Empirical Software Engineering},
  volume={24},
  number={3},
  pages={1369--1417},
  year={2019},
  publisher={Springer}
}

[J13] IST 2019

On the Impact of Code Smells on the Energy Consumption of Mobile Applications.*

Elsevier's Information and Software Technology (IST)

The demand for green software design is steadily growing higher especially in the context of mobile devices, where the computation is often limited by battery life. Previous studies found how wrong programming solutions have a strong impact on the energy consumption. Download PDF

Journal Software Quality Empirical Software Engineering F. Palomba, D. Di Nucci, A. Panichella, A. Zaidman, A. De Lucia.

On the Impact of Code Smells on the Energy Consumption of Mobile Applications.*

F. Palomba, D. Di Nucci, A. Panichella, A. Zaidman, A. De Lucia. Journal Software Quality Empirical Software Engineering

Abstract.
Context. The demand for green software design is steadily growing higher especially in the context of mobile devices, where the computation is often limited by battery life. Previous studies found how wrong programming solutions have a strong impact on the energy consumption.
Objective. Despite the efforts spent so far, only a little knowledge on the influence of code smells, i.e., symptoms of poor design or implementation choices, on the energy consumption of mobile applications is available.
Method. To provide a wider overview on the relationship between smells and energy efficiency, in this paper we conducted a large-scale empirical study on the influence of 9 Android-specific code smells on the energy consumption of 60 Android apps. In particular, we focus our attention on the design flaws that are theoretically supposed to be related to non-functional attributes of source code, such as performance and energy consumption.
Results. The results of the study highlight that methods affected by four code smell types, i.e., Internal Setter, Leaking Thread, Member Ignoring Method, and Slow Loop, consume up to 87 times more than methods affected by other code smells. Moreover, we found that refactoring these code smells reduces energy consumption in all of the situations.
Conclusions. Based on our findings, we argue that more research aimed at designing automatic refactoring approaches and tools for mobile apps is needed.

Download PDF BibTeX

@article{palomba2019impact,
  title={On the impact of code smells on the energy consumption of mobile applications},
  author={Palomba, Fabio and Di Nucci, Dario and Panichella, Annibale and Zaidman, Andy and De Lucia, Andrea},
  journal={Information and Software Technology},
  volume={105},
  pages={43--55},
  year={2019},
  publisher={Elsevier}
}

[J12] JSS 2018

Enhancing Change Prediction Models using Developer-Related Factors.*

Elsevier's Journal of Systems and Software (JSS)

Continuous changes applied during software maintenance risk to deteriorate the structure of a system and threat its maintainability. In this context, predicting the portions of source code where specific maintenance operations should be focused on may be crucial for developers to prevent maintainability issues. Download PDF

Journal Software Quality Empirical Software Engineering G. Catolino, F. Palomba, A. De Lucia, F. Ferrucci, A. Zaidman.

Enhancing Change Prediction Models using Developer-Related Factors.*

G. Catolino, F. Palomba, A. De Lucia, F. Ferrucci, A. Zaidman. Journal Software Quality Empirical Software Engineering

Abstract. Continuous changes applied during software maintenance risk to deteriorate the structure of a system and threat its maintainability. In this context, predicting the portions of source code where specific maintenance operations should be focused on may be crucial for developers to prevent maintainability issues. Researchers proposed change prediction models based on product metrics, while recent papers have shown the adaptability of process metrics to the same context. However, we believe that existing approaches still miss an important information, i.e., developer-related factors that are able to capture how complex is the development process under different perspectives. In this paper, we firstly investigate three change prediction models that exploit developer-related factors (e.g., number of developers working on a class) as predictors of change-proneness of classes and then we compare them with existing models. Our findings reveal that these factors might improve in some cases the capabilities of change prediction models. Moreover, we observed interesting complementarities among the prediction models. For this reason, we devised a novel change prediction model exploiting the combination of developer-related factors and product and evolution metrics. The results show that such model is up to 20% more effective than the single models in the identification of change-prone classes.

Download PDF BibTeX

@article{catolino2018enhancing,
  title={Enhancing change prediction models using developer-related factors},
  author={Catolino, Gemma and Palomba, Fabio and De Lucia, Andrea and Ferrucci, Filomena and Zaidman, Andy},
  journal={Journal of Systems and Software},
  volume={143},
  pages={14--28},
  year={2018},
  publisher={Elsevier}
}

[J11] IST 2018

A Large-Scale Empirical Study on the Lifecycle of Code Smell Co-occurrences.*

Elsevier's Information and Software Technology (IST)

Code smells are suboptimal design or implementation choices made by programmers during the development of a software system that possibly lead to low code maintainability and higher maintenance costs. Download PDF

Journal Software Quality Empirical Software Engineering F. Palomba, G. Bavota, M. Di Penta, F. Fasano, R.Oliveto, A. De Lucia.

A Large-Scale Empirical Study on the Lifecycle of Code Smell Co-occurrences.*

F. Palomba, G. Bavota, M. Di Penta, F. Fasano, R.Oliveto, A. De Lucia. Journal Software Quality Empirical Software Engineering

Abstract.
Context. Code smells are suboptimal design or implementation choices made by programmers during the development of a software system that possibly lead to low code maintainability and higher maintenance costs.
Objective. Previous research mainly studied the characteristics of code smell instances affecting a source code file, while only few studies analyzed the magnitude and effects of smell co-occurrence, i.e., the co-occurrence of different types of smells on the same code component. This paper aims at studying in details this phenomenon.
Method. We analyzed 13 code smell types detected in 395 releases of 30 software systems to firstly assess the extent to which code smells co-occur, and then we analyze (i) which code smells co-occur together, and (ii) how and why they are introduced and removed by developers.
Results. 59% of smelly classes are affected by more than one smell, and in particular there are six pairs of smell types (e.g., Message Chains and Spaghetti Code) that frequently co-occur. Furthermore, we observed that method-level code smells may be the root cause for the introduction of class-level smells. Finally, code smell co-occurrences are generally removed together as a consequence of other maintenance activities causing the deletion of the affected code components (with a consequent removal of the code smell instances) as well as the result of a major restructuring or scheduled refactoring actions.
Conclusions. Based on our findings, we argue that more research aimed at designing co-occurrence-aware code smell detectors and refactoring approaches is needed.

Download PDF BibTeX

@article{palomba2018large,
  title={A large-scale empirical study on the lifecycle of code smell co-occurrences},
  author={Palomba, Fabio and Bavota, Gabriele and Di Penta, Massimiliano and Fasano, Fausto and Oliveto, Rocco and De Lucia, Andrea},
  journal={Information and Software Technology},
  volume={99},
  pages={1--10},
  year={2018},
  publisher={Elsevier}
}

[J10] JSS 2018

Crowdsourcing User Reviews to Support the Evolution of Mobile Apps.*

Elsevier's Journal of Systems and Software (JSS)

In recent software development and distribution scenarios, app stores are playing a major role, especially for mobile apps. On one hand, app stores allow continuous releases of app updates. On the other hand, they have become the premier point of interaction between app providers and users. Download PDF

Journal Mobile Apps Evolution Empirical Software Engineering F. Palomba, M. Linares Vasquez, G. Bavota, R. Oliveto, M. Di Penta, D. Poshyvanyk, A. De Lucia.

Crowdsourcing User Reviews to Support the Evolution of Mobile Apps.*

F. Palomba, M. Linares Vasquez, G. Bavota, R. Oliveto, M. Di Penta, D. Poshyvanyk, A. De Lucia. Journal Mobile Apps Evolution Empirical Software Engineering

Abstract. In recent software development and distribution scenarios, app stores are playing a major role, especially for mobile apps. On one hand, app stores allow continuous releases of app updates. On the other hand, they have become the premier point of interaction between app providers and users. After installing/updating apps, users can post reviews and provide ratings, expressing their level of satisfaction with apps, and possibly pointing out bugs or desired features. In this paper we empirically investigate—by performing a study on the evolution of 100 open source Android apps and by surveying 73 developers—to what extent app developers take user reviews into account, and whether addressing them contributes to apps’ success in terms of ratings. In order to perform the study, as well as to provide a monitoring mechanism for developers and project managers, we devised an approach, named CRISTAL, for tracing informative crowd reviews onto source code changes, and for monitoring the extent to which developers accommodate crowd requests and follow-up user reactions as reflected in their ratings. The results of our study indicate that (i) on average, half of the informative reviews are addressed, and over 75% of the interviewed developers claimed to take them into account often or very often, and that (ii) developers implementing user reviews are rewarded in terms of significantly increased user ratings.

Download PDF BibTeX

@article{palomba2018crowdsourcing,
  title={Crowdsourcing user reviews to support the evolution of mobile apps},
  author={Palomba, Fabio and Linares-V{\'a}squez, Mario and Bavota, Gabriele and Oliveto, Rocco and Di Penta, Massimiliano and Poshyvanyk, Denys and De Lucia, Andrea},
  journal={Journal of Systems and Software},
  volume={137},
  pages={143--162},
  year={2018},
  publisher={Elsevier}
}

[J9] EMSE 2018 Recommended

On the Diffuseness and the Impact on Maintainability of Code Smells: A Large Scale Empirical Study.*

Springer's Journal of Empirical Software Engineering (EMSE)

Code smells are symptoms of poor design and implementation choices that may hinder code comprehensibility and maintainability. Despite the effort devoted by the research community in studying code smells, the extent to which code smells in software systems affect software maintainability remains still unclear. Download PDF

Journal Software Quality Empirical Software Engineering F. Palomba, G. Bavota, M. Di Penta, F. Fasano, R.Oliveto, A. De Lucia.

On the Diffuseness and the Impact on Maintainability of Code Smells: A Large Scale Empirical Study.*

F. Palomba, G. Bavota, M. Di Penta, F. Fasano, R.Oliveto, A. De Lucia. Journal Recommended Software Quality Empirical Software Engineering

Abstract. Code smells are symptoms of poor design and implementation choices that may hinder code comprehensibility and maintainability. Despite the effort devoted by the research community in studying code smells, the extent to which code smells in software systems affect software maintainability remains still unclear. In this paper we present a large scale empirical investigation on the diffuseness of code smells and their impact on code changeand fault-proneness. The study was conducted across a total of 395 releases of 30 open source projects and considering 17,350 manually validated instances of 13 different code smell kinds. The results show that smells characterized by long and/or complex code (e.g., Complex Class) are highly diffused, and that smelly classes have a higher change- and fault-proneness than smell-free classes.

Download PDF BibTeX

@article{palomba2018diffuseness,
  title={On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation},
  author={Palomba, Fabio and Bavota, Gabriele and Di Penta, Massimiliano and Fasano, Fausto and Oliveto, Rocco and De Lucia, Andrea},
  journal={Empirical Software Engineering},
  volume={23},
  number={3},
  pages={1188--1221},
  year={2018},
  publisher={Springer}
}

[J8] TSE 2017

Toward a Smell-aware Bug Prediction Model.*

IEEE Transactions on Software Engineering (TSE)

Code smells are symptoms of poor design and implementation choices. Previous studies empirically assessed the impact of smells on code quality and clearly indicate their negative impact on maintainability, including a higher bug-proneness of components affected by code smells. Download PDF

Journal Software Quality Empirical Software Engineering F. Palomba, M. Zanoni, F. Arcelli Fontana, A. De Lucia, R. Oliveto.

Toward a Smell-aware Bug Prediction Model.*

F. Palomba, M. Zanoni, F. Arcelli Fontana, A. De Lucia, R. Oliveto. Journal Software Quality Empirical Software Engineering

Abstract. Code smells are symptoms of poor design and implementation choices. Previous studies empirically assessed the impact of smells on code quality and clearly indicate their negative impact on maintainability, including a higher bug-proneness of components affected by code smells. In this paper, we capture previous findings on bug-proneness to build a specialized bug prediction model for smelly classes. Specifically, we evaluate the contribution of a measure of the severity of code smells (i.e., code smell intensity) by adding it to existing bug prediction models based on both product and process metrics, and comparing the results of the new model against the baseline models. Results indicate that the accuracy of a bug prediction model increases by adding the code smell intensity as predictor. We also compare the results achieved by the proposed model with the ones of an alternative technique which considers metrics about the history of code smells in files, finding that our model works generally better. However, we observed interesting complementarities between the set of buggy and smelly classes correctly classified by the two models. By evaluating the actual information gain provided by the intensity index with respect to the other metrics in the model, we found that the intensity index is a relevant feature for both product and process metrics-based models. At the same time, the metric counting the average number of code smells in previous versions of a class considered by the alternative model is also able to reduce the entropy of the model. On the basis of this result, we devise and evaluate a smell-aware combined bug prediction model that included product, process, and smell-related features. We demonstrate how such model classifies bug-prone code components with an F-Measure at least 13% higher than the existing state-of-the-art models.

Download PDF BibTeX

@article{palomba2017toward,
  title={Toward a smell-aware bug prediction model},
  author={Palomba, Fabio and Zanoni, Marco and Fontana, Francesca Arcelli and De Lucia, Andrea and Oliveto, Rocco},
  journal={IEEE Transactions on Software Engineering},
  volume={45},
  number={2},
  pages={194--218},
  year={2017},
  publisher={IEEE}
}

[J7] TSE 2017

The Scent of a Smell: An Extensive Comparison between Textual and Structural Smells.*

IEEE Transactions on Software Engineering (TSE)

Code smells are symptoms of poor design or implementation choices that have a negative effect on several aspects of software maintenance and evolution, such as program comprehension or change- and fault-proneness. This is why researchers have spent a lot of effort on devising methods that help developers to automatically detect them in source code. Download PDF

Journal Software Quality Empirical Software Engineering F. Palomba, A. Panichella, A. Zaidman, R. Oliveto, A. De Lucia.

The Scent of a Smell: An Extensive Comparison between Textual and Structural Smells.*

F. Palomba, A. Panichella, A. Zaidman, R. Oliveto, A. De Lucia. Journal Software Quality Empirical Software Engineering

Abstract. Code smells are symptoms of poor design or implementation choices that have a negative effect on several aspects of software maintenance and evolution, such as program comprehension or change- and fault-proneness. This is why researchers have spent a lot of effort on devising methods that help developers to automatically detect them in source code. Almost all the techniques presented in literature are based on the analysis of structural properties extracted from source code, although alternative sources of information (e.g., textual analysis) for code smell detection have also been recently investigated. Nevertheless, some studies have indicated that code smells detected by existing tools based on the analysis of structural properties are generally ignored (and thus not refactored) by the developers. In this paper, we aim at understanding whether code smells detected using textual analysis are perceived and refactored by developers in the same or different way than code smells detected through structural analysis. To this aim, we set up two different experiments. We have first carried out a software repository mining study to analyze how developers act on textually or structurally detected code smells. Subsequently, we have conducted a user study with industrial developers and quality experts in order to qualitatively analyze how they perceive code smells identified using the two different sources of information. Results indicate that textually detected code smells are easier to identify and for this reason they are considered easier to refactor with respect to code smells detected using structural properties. On the other hand, the latter are often perceived as more severe, but more difficult to exactly identify and remove.

Download PDF BibTeX

@article{palomba2017scent,
  title={The scent of a smell: An extensive comparison between textual and structural smells},
  author={Palomba, Fabio and Panichella, Annibale and Zaidman, Andy and Oliveto, Rocco and De Lucia, Andrea},
  journal={IEEE Transactions on Software Engineering},
  volume={44},
  number={10},
  pages={977--1000},
  year={2017},
  publisher={IEEE}
}

[J6] TETCI 2017

Dynamic Selection of Classifiers in Bug Prediction: An Adaptive Method.*

IEEE Transactions on Emerging Topics in Computational Intelligence (TETCI)

In the last decades the research community has devoted a lot of effort in the definition of approaches able to predict the defect proneness of source code files. Such approaches exploit several predictors (e.g., product or process metrics) and use machine learning classifiers to predict classes into buggy or not buggy, or provide the likelihood that a class will exhibit a fault in the near future. Download PDF

Journal Software Quality Empirical Software Engineering D. Di Nucci, F. Palomba, R.Oliveto, A. De Lucia.

Dynamic Selection of Classifiers in Bug Prediction: An Adaptive Method.*

D. Di Nucci, F. Palomba, R.Oliveto, A. De Lucia. Journal Software Quality Empirical Software Engineering

Abstract. In the last decades the research community has devoted a lot of effort in the definition of approaches able to predict the defect proneness of source code files. Such approaches exploit several predictors (e.g., product or process metrics) and use machine learning classifiers to predict classes into buggy or not buggy, or provide the likelihood that a class will exhibit a fault in the near future. The empirical evaluation of all these approaches indicated that there is no machine learning classifier providing the best accuracy in any context, highlighting interesting complementarity among them. For these reasons ensemble methods have been proposed to estimate the bug-proneness of a class by combining the predictions of different classifiers. Following this line of research, in this paper we propose an adaptive method, named ASCI (Adaptive Selection of Classifiers in bug predIction), able to dynamically select among a set of machine learning classifiers the one which better predicts the bug proneness of a class based on its characteristics. An empirical study conducted on 30 software systems indicates that ASCI exhibits higher performances than 5 different classifiers used independently and combined with the majority voting ensemble method.

Download PDF BibTeX

@article{di2017dynamic,
  title={Dynamic selection of classifiers in bug prediction: An adaptive method},
  author={Di Nucci, Dario and Palomba, Fabio and Oliveto, Rocco and De Lucia, Andrea},
  journal={IEEE Transactions on Emerging Topics in Computational Intelligence},
  volume={1},
  number={3},
  pages={202--212},
  year={2017},
  publisher={IEEE}
}

[J5] TSE 2017

A Developer Centered Bug Prediction Model.*

IEEE Transactions on Software Engineering (TSE)

Several techniques have been proposed to accurately predict software defects. These techniques generally exploit characteristics of the code artefacts (e.g., size, complexity, etc.) and/or of the process adopted during their development and maintenance (e.g., the number of developers working on a component) to spot out components likely containing bugs. Download PDF

Journal Software Quality Empirical Software Engineering D. Di Nucci, F. Palomba, G. De Rosa, G. Bavota, R.Oliveto, A. De Lucia.

A Developer Centered Bug Prediction Model.*

D. Di Nucci, F. Palomba, G. De Rosa, G. Bavota, R.Oliveto, A. De Lucia. Journal Software Quality Empirical Software Engineering

Abstract. Several techniques have been proposed to accurately predict software defects. These techniques generally exploit characteristics of the code artefacts (e.g., size, complexity, etc.) and/or of the process adopted during their development and maintenance (e.g., the number of developers working on a component) to spot out components likely containing bugs. While these bug prediction models achieve good levels of accuracy, they mostly ignore the major role played by human-related factors in the introduction of bugs. Previous studies have demonstrated that focused developers are less prone to introduce defects than non-focused developers. According to this observation, software components changed by focused developers should also be less error prone than components changed by less focused developers. We capture this observation by measuring the scattering of changes performed by developers working on a component and use this information to build a bug prediction model. Such a model has been evaluated on 26 systems and compared with four competitive techniques. The achieved results show the superiority of our model, and its high complementarity with respect to predictors commonly used in the literature. Based on this result, we also show the results of a “hybrid” prediction model combining our predictors with the existing ones.

Download PDF BibTeX

@article{di2017developer,
  title={A developer centered bug prediction model},
  author={Di Nucci, Dario and Palomba, Fabio and De Rosa, Giuseppe and Bavota, Gabriele and Oliveto, Rocco and De Lucia, Andrea},
  journal={IEEE Transactions on Software Engineering},
  volume={44},
  number={1},
  pages={5--24},
  year={2017},
  publisher={IEEE}
}

[J4] TSE 2017 Recommended

When and Why Your Code Starts to Smell Bad (and Whether the Smells Go Away).*

IEEE Transactions on Software Engineering (TSE)

Technical debt is a metaphor introduced by Cunningham to indicate “not quite right code which we postpone making it right”. One noticeable symptom of technical debt is represented by code smells, defined as symptoms of poor design and implementation choices. Previous studies showed the negative impact of code smells on the comprehensibility and maintainability of code. Download PDF

Journal Software Quality Empirical Software Engineering M. Tufano, F. Palomba, G. Bavota, R.Oliveto, M. Di Penta, A. De Lucia, D. Poshyvanyk.

When and Why Your Code Starts to Smell Bad (and Whether the Smells Go Away).*

M. Tufano, F. Palomba, G. Bavota, R.Oliveto, M. Di Penta, A. De Lucia, D. Poshyvanyk. Journal Recommended Software Quality Empirical Software Engineering

Abstract. Technical debt is a metaphor introduced by Cunningham to indicate “not quite right code which we postpone making it right”. One noticeable symptom of technical debt is represented by code smells, defined as symptoms of poor design and implementation choices. Previous studies showed the negative impact of code smells on the comprehensibility and maintainability of code. While the repercussions of smells on code quality have been empirically assessed, there is still only anecdotal evidence on when and why bad smells are introduced, what is their survivability, and how they are removed by developers. To empirically corroborate such anecdotal evidence, we conducted a large empirical study over the change history of 200 open source projects. This study required the development of a strategy to identify smell-introducing commits, the mining of over half a million of commits, and the manual analysis and classification of over 10K of them. Our findings mostly contradict common wisdom, showing that most of the smell instances are introduced when an artifact is created and not as a result of its evolution. At the same time, 80% of smells survive in the system. Also, among the 20% of removed instances, only 9% is removed as a direct consequence of refactoring operations.

Download PDF BibTeX

@article{tufano2017and,
  title={When and why your code starts to smell bad (and whether the smells go away)},
  author={Tufano, Michele and Palomba, Fabio and Bavota, Gabriele and Oliveto, Rocco and Di Penta, Massimiliano and De Lucia, Andrea and Poshyvanyk, Denys},
  journal={IEEE Transactions on Software Engineering},
  volume={43},
  number={11},
  pages={1063--1088},
  year={2017},
  publisher={IEEE}
}

[J3] JSEP 2017

There and Back Again: Can you Compile that Snapshot?.*

Wiley's Journal of Software: Evolution and Process (JSEP)

A broken snapshot represents a snapshot from a project’s change history that cannot be compiled. Broken snapshots can have significant implications for researchers, as they could hinder any analysis of the past project history that requires code to be compiled. Download PDF

Journal Empirical Software Engineering M. Tufano, F. Palomba, G. Bavota, M. Di Penta, R.Oliveto, A. De Lucia, D. Poshyvanyk.

There and Back Again: Can you Compile that Snapshot?.*

M. Tufano, F. Palomba, G. Bavota, M. Di Penta, R.Oliveto, A. De Lucia, D. Poshyvanyk. Journal Empirical Software Engineering

Abstract. A broken snapshot represents a snapshot from a project’s change history that cannot be compiled. Broken snapshots can have significant implications for researchers, as they could hinder any analysis of the past project history that requires code to be compiled. Noticeably, while some broken snapshots may be observable in change history repositories (e.g., no longer available dependencies), some of them may not necessarily happen during the actual development. In this paper we systematically study the compilability of broken snapshots in 219,395 snapshots belonging to 100 Java projects from the Apache Software Foundation, all relying on Maven as an automated build tool. We investigated broken snapshots from two different perspectives: (i) how frequently they happen and (ii) likely causes behind them. The empirical results indicate that broken snapshots occur in most (96%) of the projects we studied and that they are mainly due to problems related to the resolution of dependencies. On average, only 38% of the change history of project systems is currently successfully compilable.

Download PDF BibTeX

@article{tufano2017there,
  title={There and back again: Can you compile that snapshot?},
  author={Tufano, Michele and Palomba, Fabio and Bavota, Gabriele and Di Penta, Massimiliano and Oliveto, Rocco and De Lucia, Andrea and Poshyvanyk, Denys},
  journal={Journal of Software: Evolution and Process},
  volume={29},
  number={4},
  pages={e1838},
  year={2017},
  publisher={Wiley Online Library}
}

[J2] JSS 2015

An Experimental Investigation on the Innate Relationship between Quality and Refactoring.*

Elsevier's Journal of Systems and Software (JSS)

Previous studies have investigated the reasons behind refactoring operations performed by developers, and proposed methods and tools to recommend refactorings based on quality metric profiles, or on the presence of poor design and implementation choices, i.e., code smells. Download PDF

Journal Software Quality Empirical Software Engineering G. Bavota, A. De Lucia, M. Di Penta, R.Oliveto, F. Palomba.

An Experimental Investigation on the Innate Relationship between Quality and Refactoring.*

G. Bavota, A. De Lucia, M. Di Penta, R.Oliveto, F. Palomba. Journal Software Quality Empirical Software Engineering

Abstract. Previous studies have investigated the reasons behind refactoring operations performed by developers, and proposed methods and tools to recommend refactorings based on quality metric profiles, or on the presence of poor design and implementation choices, i.e., code smells. Nevertheless, the existing literature lacks observations about the relations between metrics/code smells and refactoring activities performed by developers. In other words, the characteristics of code components increasing/decreasing their chances of being object of refactoring operations are still unknown. This paper aims at bridging this gap. Specifically, we mined the evolution history of three Java open source projects to investigate whether refactoring activities occur on code components for which certain indicators—such as quality metrics or the presence of smells as detected by tools—suggest there might be need for refactoring operations. Results indicate that, more often than not, quality metrics do not show a clear relationship with refactoring. In other words, refactoring operations are generally focused on code components for which quality metrics do not suggest there might be need for refactoring operations. Finally, 42% of refactoring operations are performed on code entities affected by code smells. However, only 7% of the performed operations actually remove the code smells from the affected class.

Download PDF BibTeX

@article{bavota2015experimental,
  title={An experimental investigation on the innate relationship between quality and refactoring},
  author={Bavota, Gabriele and De Lucia, Andrea and Di Penta, Massimiliano and Oliveto, Rocco and Palomba, Fabio},
  journal={Journal of Systems and Software},
  volume={107},
  pages={1--14},
  year={2015},
  publisher={Elsevier}
}

[J1] TSE 2015 Recommended

Mining Version Histories for Detecting Code Smells.*

IEEE Transactions on Software Engineering (TSE)

Code smells are symptoms of poor design and implementation choices that may hinder code comprehension, and possibly increase change- and fault-proneness. While most of the detection techniques just rely on structural information, many code smells are intrinsically characterized by how code elements change over time. Download PDF

Journal Software Quality Empirical Software Engineering F. Palomba, G. Bavota, M. Di Penta, R.Oliveto, D. Poshyvanyk, A. De Lucia.

Mining Version Histories for Detecting Code Smells.*

F. Palomba, G. Bavota, M. Di Penta, R.Oliveto, D. Poshyvanyk, A. De Lucia. Journal Recommended Software Quality Empirical Software Engineering

Abstract. Code smells are symptoms of poor design and implementation choices that may hinder code comprehension, and possibly increase change- and fault-proneness. While most of the detection techniques just rely on structural information, many code smells are intrinsically characterized by how code elements change over time. In this paper, we propose HIST (Historical Information for Smell deTection), an approach exploiting change history information to detect instances of five different code smells, namely Divergent Change, Shotgun Surgery, Parallel Inheritance, Blob, and Feature Envy. We evaluate HIST in two empirical studies. The first, conducted on twenty open source projects, aimed at assessing the accuracy of HIST in detecting instances of the code smells mentioned above. The results indicate that the precision of HIST ranges between 72% and 86%, and its recall ranges between 58% and 100%. Also, results of the first study indicate that HIST is able to identify code smells that cannot be identified by competitive approaches solely based on code analysis of a single system’s snapshot. Then, we conducted a second study aimed at investigating to what extent the code smells detected by HIST (and by competitive code analysis techniques) reflect developers’ perception of poor design and implementation choices. We involved twelve developers of four open source projects that recognized more than 75% of the code smell instances identified by HIST as actual design/implementation problems.

Download PDF BibTeX

@article{palomba2014mining,
  title={Mining version histories for detecting code smells},
  author={Palomba, Fabio and Bavota, Gabriele and Di Penta, Massimiliano and Oliveto, Rocco and Poshyvanyk, Denys and De Lucia, Andrea},
  journal={IEEE Transactions on Software Engineering},
  volume={41},
  number={5},
  pages={462--489},
  year={2014},
  publisher={IEEE}
}

[C107] Fairness 2026

Bias Ahead: Sensitive Prompts as Early Warnings for Fairness in Large Language Models.*

International Workshop on Fairness in Software Systems (Fairness 2026), Limassol, Cyprus.

Large Language Models (LLMs) are being increasingly integrated into software systems, offering powerful capabilities but also raising concerns about fairness. Existing fairness benchmarks, however, focus on stereotype-specific associations, which limit their ability to anticipate risks in diverse, real-world contexts. In this paper, we propose sensitive prompts as a new abstraction for fairness evaluation: inputs that are not inherently biased but are more likely to elicit biased or inadequate responses due to the sensitivity of their content. We construct and release SENSY, a dataset of 12,801 prompts, categorized as sensitive and non-sensitive, spanning seven thematic domains, combining synthetic generation and real user inputs. Using this dataset, we query three open-source LLMs and manually analyze 4,500 responses to evaluate their adequacy in answering sensitive prompts.

Best Paper Award

Download PDF

Conference Empirical Software Engineering G. Voria, M. De Lucia, A. Raia, A. De Lucia, G. Catolino, F. Palomba.

Bias Ahead: Sensitive Prompts as Early Warnings for Fairness in Large Language Models.*

G. Voria, M. De Lucia, A. Raia, A. De Lucia, G. Catolino, F. Palomba. Conference Empirical Software Engineering

Abstract. Large Language Models (LLMs) are being increasingly integrated into software systems, offering powerful capabilities but also raising concerns about fairness. Existing fairness benchmarks, however, focus on stereotype-specific associations, which limit their ability to anticipate risks in diverse, real-world contexts. In this paper, we propose sensitive prompts as a new abstraction for fairness evaluation: inputs that are not inherently biased but are more likely to elicit biased or inadequate responses due to the sensitivity of their content. We construct and release SENSY, a dataset of 12,801 prompts, categorized as sensitive and non-sensitive, spanning seven thematic domains, combining synthetic generation and real user inputs. Using this dataset, we query three open-source LLMs and manually analyze 4,500 responses to evaluate their adequacy in answering sensitive prompts. Results show that while models often provide factually correct answers, they frequently fail to acknowledge the ethical, relational, or contextual implications of sensitive inputs. In addition, we develop an automated classifier for predicting prompt sensitivity, achieving robust performance on sensitive prompts. Our findings demonstrate that prompt sensitivity can serve as an effective early-warning mechanism for fairness risks in LLMs. This perspective shifts fairness assessment from reactive mitigation toward preventive design, enabling developers to anticipate and manage bias before deployment.

Download PDF

[C106] ICPC 2026

Déjà Vu: A Replication Study on Code Smells and Faults in JavaScript Projects.*

34th IEEE/ACM International Conference on Program Comprehension (ICPC 2026), Rio De Janeiro, Brazil, 2026.

Code smells are symptoms of poor design choices that can harm software quality. Their relation to fault-proneness has been studied in statically typed languages, such as Java, but less so in dynamic languages like JavaScript, which are becoming increasingly central given their primary role in rendering AI models accessible to their intended audience. Previous work on JavaScript was limited in scope, which affected the generalizability of its findings. This paper replicates and extends Johannes et al.'s study "A Large-scale Empirical Study of Code Smells in JavaScript Projects" to examine how code smells impact the fault-proneness of JavaScript applications. We analyze a large sample of 50 projects and nearly 100k commits across multiple domains, applying survival analysis with Cox models and robustness checks. Download PDF

Conference Empirical Software Engineering Software Quality K. Pacifico, G. Giordano, V. Pontillo, M. Di Penta, D. Tamburri, F. Palomba.

Déjà Vu: A Replication Study on Code Smells and Faults in JavaScript Projects.

K. Pacifico, G. Giordano, V. Pontillo, M. Di Penta, D. Tamburri, F. Palomba. Conference Empirical Software Engineering Software Quality

Abstract. Code smells are symptoms of poor design choices that can harm software quality. Their relation to fault-proneness has been studied in statically typed languages, such as Java, but less so in dynamic languages like JavaScript, which are becoming increasingly central given their primary role in rendering AI models accessible to their intended audience. Previous work on JavaScript was limited in scope, which affected the generalizability of its findings. This paper replicates and extends Johannes et al.'s study "A Large-scale Empirical Study of Code Smells in JavaScript Projects" to examine how code smells impact the fault-proneness of JavaScript applications. We analyze a large sample of 50 projects and nearly 100k commits across multiple domains, applying survival analysis with Cox models and robustness checks. We confirm that files with smells, such as Variable Re-Assignment, Complex Code, and Conditional Assignment, are more prone to faults. We also find that smell survivability varies across projects and that smells introduced at file creation often persist. These results offer a more ecologically valid and replicable perspective on the impact of code smells on JavaScript systems.

Download PDF

[C105] MSR 2026

Tracing Stereotypes in Pre-trained Transformers: From Biased Neurons to Fairer Models.*

23rd IEEE/ACM International Conference on Mining Software Repositories (MSR 2026), Rio De Janeiro, Brazil, 2026.

The advent of transformer-based language models has reshaped how AI systems process and generate text. In software engineering (SE), these models now support diverse activities, accelerating automation and decision-making. Yet, evidence shows that these models can reproduce or amplify social biases, raising fairness concerns. Recent work on neuron editing has shown that internal activations in pre-trained transformers can be traced and modified to alter model behavior. Building on the concept of knowledge neurons---neurons that encode factual information---we hypothesize the existence of biased neurons that capture stereotypical associations within pre-trained transformers. Download PDF

Conference Empirical Software Engineering Software Quality G. Voria, M. Openja, F. Khomh, G. Catolino, F. Palomba.

Tracing Stereotypes in Pre-trained Transformers: From Biased Neurons to Fairer Models.

G. Voria, M. Openja, F. Khomh, G. Catolino, F. Palomba. Conference Empirical Software Engineering Software Quality

Abstract. The advent of transformer-based language models has reshaped how AI systems process and generate text. In software engineering (SE), these models now support diverse activities, accelerating automation and decision-making. Yet, evidence shows that these models can reproduce or amplify social biases, raising fairness concerns. Recent work on neuron editing has shown that internal activations in pre-trained transformers can be traced and modified to alter model behavior. Building on the concept of knowledge neurons---neurons that encode factual information---we hypothesize the existence of biased neurons that capture stereotypical associations within pre-trained transformers. To test this hypothesis, we build a dataset of biased relations, i.e., triplets encoding stereotypes across nine bias types, and adapt neuron attribution strategies to trace and suppress biased neurons in BERT models. We then assess the impact of suppression on SE tasks. Our findings show that biased knowledge is localized within small neuron subsets, and suppressing them substantially reduces bias with minimal performance loss. This demonstrates that bias in transformers can be traced and mitigated at the neuron level, offering an interpretable approach to fairness in SE.

Download PDF

[C104] TechDebt 2026

Investigating Technical Debt Types, Issues, and Solutions in Serverless Computing.*

9th IEEE/ACM International Conference on Technical Debt (TechDebt 2026), Rio De Janeiro, Brazil, 2026.

Serverless computing is a cloud execution model where developers run code and the server management is handled by the cloud provider. Serverless computing is increasingly gaining popularity as more systems adopt it to enhance scalability and reduce operational costs. While it has numerous benefits, it also embodies unique challenges inherent to serverless computing. One such challenge is Technical Debt (TD), which is exacerbated by the complexities of the serverless paradigm. While prior work has investigated the activities and bad practices that lead to TD in serverless computing, there remains a gap in understanding how TD manifests, the challenges it poses, and the solutions proposed to address TD issues in serverless systems. This study aims to investigate TD in the serverless context using Stack Overflow (SO) as a knowledge base. Download PDF

Conference Empirical Software Engineering Software Quality H. Perera, Z. Codabux, F. Palomba.

Investigating Technical Debt Types, Issues, and Solutions in Serverless Computing.

H. Perera, Z. Codabux, F. Palomba. Conference Empirical Software Engineering Software Quality

Abstract. Serverless computing is a cloud execution model where developers run code and the server management is handled by the cloud provider. Serverless computing is increasingly gaining popularity as more systems adopt it to enhance scalability and reduce operational costs. While it has numerous benefits, it also embodies unique challenges inherent to serverless computing. One such challenge is Technical Debt (TD), which is exacerbated by the complexities of the serverless paradigm. While prior work has investigated the activities and bad practices that lead to TD in serverless computing, there remains a gap in understanding how TD manifests, the challenges it poses, and the solutions proposed to address TD issues in serverless systems. This study aims to investigate TD in the serverless context using Stack Overflow (SO) as a knowledge base. We collected 78,867 serverless questions on SO and labeled them as TD or non-TD using deep learning. We further conducted a deeper exploration to identify types of TD in serverless settings, issues and proposed solutions, and also explored TD in the code snippets. We found that 37% of serverless questions on SO are TD-related, and that the majority of code snippets contained code smells and security vulnerabilities. We also identified six serverless-specific issues. Our research highlights the need for tools that can effectively detect TD in serverless applications.

Download PDF

[C103] REFSQ 2026

Fairness as a First-Class Requirement: A Fairness Hazard Analysis Approach to Socio-Technical Processes.*

32nd International Working Conference on Requirements Engineering Foundation for Software Quality (REFSQ 2026), Poznan, Poland.

Fairness in socio-technical systems is increasingly recognised as a critical requirement, especially in processes involving human-AI interaction. Fairness hazards are situations or factors that threaten the fair treatment of individuals or groups. If left unaddressed, they can accumulate into systemic bias. Therefore, ensuring fairness must be treated as a first-class requirement during system design, rather than a post-hoc fix. Systematic methods for identifying fairness hazards in socio-technical workflows and translating them into requirements-level mitigations are still missing. We propose Fairness Hazard Analysis (FHA), an adaptation of hazard analysis methods from the safety-critical domain to analyse fairness in socio-technical processes. Download PDF

Conference Empirical Software Engineering Socio-Technical Analytics G. Broccia, L. Lelii, R. Cirillo, D. Di Nucci, S. Flicker, F. Palomba, G. Spagnuolo, A. Ferrari.

Fairness as a First-Class Requirement: A Fairness Hazard Analysis Approach to Socio-Technical Processes.

G. Broccia, L. Lelii, R. Cirillo, D. Di Nucci, S. Flicker, F. Palomba, G. Spagnuolo, A. Ferrari. Conference Empirical Software Engineering Socio-Technical Analytics

Abstract. Fairness in socio-technical systems is increasingly recognised as a critical requirement, especially in processes involving human-AI interaction. Fairness hazards are situations or factors that threaten the fair treatment of individuals or groups. If left unaddressed, they can accumulate into systemic bias. Therefore, ensuring fairness must be treated as a first-class requirement during system design, rather than a post-hoc fix. Systematic methods for identifying fairness hazards in socio-technical workflows and translating them into requirements-level mitigations are still missing. We propose Fairness Hazard Analysis (FHA), an adaptation of hazard analysis methods from the safety-critical domain to analyse fairness in socio-technical processes. FHA is demonstrated through an AI-assisted hiring case and supported by HumAInFlow, a modelling and simulation platform. The approach is preliminarily evaluated through two focus groups. The feedback from participants highlights FHA's usefulness for structured fairness analysis, the importance of diverse expertise, and the potential for deeper integration within HumAInFlow. This work offers a novel method for integrating fairness into requirements analysis of socio-technical workflows, and provides an LLM-based tool to automate the analysis, marking a shift from bias detection to bias prevention with fairness-by-design.

Download PDF

[C102] ICSE 2026

Toward Systematic Counterfactual Fairness Evaluation of Large Language Models: The CAFFE Framework.*

48th IEEE/ACM International Conference on Software Engineering (ICSE 2026), Software Engineering in Society, Rio De Janeiro, Brazil.

Nowadays, Large Language Models (LLMs) are foundational components of modern software systems. As their influence grows, concerns about fairness have become increasingly pressing. Prior work has proposed metamorphic testing to detect fairness issues, applying input transformations to uncover inconsistencies in model behavior. This paper introduces an alternative perspective for testing counterfactual fairness in LLMs, proposing a structured and intent-aware framework coined CAFFE (Counterfactual Assessment Framework for Fairness Evaluation). Inspired by traditional non-functional testing, CAFFE (1) formalizes LLM-Fairness test cases through explicitly defined components, including prompt intent, conversational context, input variants, expected fairness thresholds, and test environment configuration, (2) assists testers by automatically generating targeted test data, and (3) evaluates model responses using semantic similarity metrics. Download PDF

Conference Empirical Software Engineering Socio-Technical Analytics A. Parziale, G. Voria, V. Pontillo, G. Catolino, A. De Lucia, F. Palomba.

Toward Systematic Counterfactual Fairness Evaluation of Large Language Models: The CAFFE Framework.

A. Parziale, G. Voria, V. Pontillo, G. Catolino, A. De Lucia, F. Palomba. Conference Empirical Software Engineering Socio-Technical Analytics

Abstract. Nowadays, Large Language Models (LLMs) are foundational components of modern software systems. As their influence grows, concerns about fairness have become increasingly pressing. Prior work has proposed metamorphic testing to detect fairness issues, applying input transformations to uncover inconsistencies in model behavior. This paper introduces an alternative perspective for testing counterfactual fairness in LLMs, proposing a structured and intent-aware framework coined CAFFE (Counterfactual Assessment Framework for Fairness Evaluation). Inspired by traditional non-functional testing, CAFFE (1) formalizes LLM-Fairness test cases through explicitly defined components, including prompt intent, conversational context, input variants, expected fairness thresholds, and test environment configuration, (2) assists testers by automatically generating targeted test data, and (3) evaluates model responses using semantic similarity metrics. Our experiments, conducted on three different architectural families of LLM, demonstrate that CAFFE achieves broader bias coverage and more reliable detection of unfair behavior than existing metamorphic approaches.

Download PDF

[C101] ICSE 2026

Once Upon a Team: Investigating Bias in LLM-Driven Software Team Composition and Task Allocation.*

48th IEEE/ACM International Conference on Software Engineering (ICSE 2026), Software Engineering in Society, Rio De Janeiro, Brazil.

LLMs are increasingly used to boost productivity and support software engineering tasks. However, when applied to socially sensitive decisions such as team composition and task allocation, they raise concerns of fairness. Prior studies have revealed that LLMs may reproduce stereotypes; however, these analyses remain exploratory and examine sensitive attributes in isolation. This study investigates whether LLMs exhibit bias in team composition and task assignment by analyzing the combined effects of candidates' country and pronouns. Using three LLMs and 3,000 simulated decisions, we find systematic disparities: demographic attributes significantly shaped both selection likelihood and task allocation, even when accounting for expertise-related factors. Task distributions further reflected stereotypes, with technical and leadership roles unevenly assigned across groups. Our findings indicate that LLMs exacerbate demographic inequities in software engineering contexts, underscoring the need for fairness-aware assessment. Download PDF

Conference Empirical Software Engineering Socio-Technical Analytics A. Parziale, G. Voria, V. Pontillo, A. Di Salle, P. Pelliccione, G. Catolino, F. Palomba.

Once Upon a Team: Investigating Bias in LLM-Driven Software Team Composition and Task Allocation.

A. Parziale, G. Voria, V. Pontillo, A. Di Salle, P. Pelliccione, G. Catolino, F. Palomba. Conference Empirical Software Engineering Socio-Technical Analytics

Abstract. LLMs are increasingly used to boost productivity and support software engineering tasks. However, when applied to socially sensitive decisions such as team composition and task allocation, they raise concerns of fairness. Prior studies have revealed that LLMs may reproduce stereotypes; however, these analyses remain exploratory and examine sensitive attributes in isolation. This study investigates whether LLMs exhibit bias in team composition and task assignment by analyzing the combined effects of candidates' country and pronouns. Using three LLMs and 3,000 simulated decisions, we find systematic disparities: demographic attributes significantly shaped both selection likelihood and task allocation, even when accounting for expertise-related factors. Task distributions further reflected stereotypes, with technical and leadership roles unevenly assigned across groups. Our findings indicate that LLMs exacerbate demographic inequities in software engineering contexts, underscoring the need for fairness-aware assessment.

Download PDF

[C100] iMeta 2025

Privacy-Aware 3D Reconstruction and Obfuscation in the Metaverse.*

IEEE 3rd International Conference on Intelligent Metaverse Technologies and Applications (iMeta), Dubrovnik, Croatia.

The automated reconstruction of 3D environments is crucial for immersive metaverse experiences, as it enables realistic and dynamic virtual spaces that enhance user interaction and presence; nevertheless, it could raise significant privacy concerns. Image-to-3D techniques, such as photogrammetry and Neural Radiance Fields (NeRF), can inadvertently capture and render sensitive information, posing ethical and legal risks. To address this issue, this work evaluates the combination of obfuscation techniques with 3D reconstruction methods to mitigate privacy threats while maintaining visual quality. Various obfuscation approaches are applied to input images before reconstruction, and their impact on model fidelity and privacy preservation is assessed through quantitative metrics and qualitative discussion. Download PDF

Conference Empirical Software Engineering V. Pentangelo, S. Tabet, S. Lambiase, A. Kayssi, I. H. Elhajj, F. Palomba.

Privacy-Aware 3D Reconstruction and Obfuscation in the Metaverse.

V. Pentangelo, S. Tabet, S. Lambiase, A. Kayssi, I. H. Elhajj, F. Palomba. Conference Empirical Software Engineering

Abstract. The automated reconstruction of 3D environments is crucial for immersive metaverse experiences, as it enables realistic and dynamic virtual spaces that enhance user interaction and presence; nevertheless, it could raise significant privacy concerns. Image-to-3D techniques, such as photogrammetry and Neural Radiance Fields (NeRF), can inadvertently capture and render sensitive information, posing ethical and legal risks. To address this issue, this work evaluates the combination of obfuscation techniques with 3D reconstruction methods to mitigate privacy threats while maintaining visual quality. Various obfuscation approaches are applied to input images before reconstruction, and their impact on model fidelity and privacy preservation is assessed through quantitative metrics and qualitative discussion. The findings highlight trade-offs between privacy protection and 3D reconstruction quality, where diffusion-based inpainting combined with weaker reconstruction methods can achieve strong privacy preservation for volumetric objects (up to 177% improvement) with lightweight meshes (up to 6.5× less) but slightly lower visual fidelity (less than 5% difference), whereas machine learning-based reconstruction methods can reconstruct challenging surfaces with high realism (up to 80% realism) at the cost of larger meshes (10k–56k polygons) and reduced privacy (56% less on average). This study informs researchers and practitioners on balancing realism and privacy, contributing to the development of privacy-aware metaverse environments.

Download PDF

[C99] ESEM 2025

On the Harmfulness of Test Smells in Manual System Testing: A Controlled Experiment.*

ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), Honolulu, Hawai, USA.

Test smells can pose difficulties during testing activities, such as poor maintainability, non-deterministic behavior, and incomplete verification. Existing research has extensively addressed test smells in automated software tests, but little attention has been paid to smells in natural language tests. While some research has attempted to catalog such test smells, there is a lack of investigation into their impact on the effectiveness of test cases. In this paper, we conduct a controlled experiment with 30 participants from academia and industry to examine the impact of test smells in manual test descriptions. Download PDF

Conference Software Quality Empirical Software Engineering G. Soares, V. Santos, M. Ribeiro, L. Martins, V. Pontillo, M. Aranda, R. Gheyi, I. Machado, F. Palomba.

On the Harmfulness of Test Smells in Manual System Testing: A Controlled Experiment.

G. Soares, V. Santos, M. Ribeiro, L. Martins, V. Pontillo, M. Aranda, R. Gheyi, I. Machado, F. Palomba. Conference Software Quality Empirical Software Engineering

Abstract. Test smells can pose difficulties during testing activities, such as poor maintainability, non-deterministic behavior, and incomplete verification. Existing research has extensively addressed test smells in automated software tests, but little attention has been paid to smells in natural language tests. While some research has attempted to catalog such test smells, there is a lack of investigation into their impact on the effectiveness of test cases. In this paper, we conduct a controlled experiment with 30 participants from academia and industry to examine the impact of test smells in manual test descriptions. Specifically, we analyze whether the presence of two test smells, Ambiguous Test and Eager Action, result in (1) increased test execution time, (2) a higher number of steps needed to complete the tests, and (3) high divergency on the perceived success of the tests outcomes. Our findings reveal that an Ambiguous Test can increase execution time by up to five times and screen flow by up to seven times. In addition, if the Eager Actions are dependent on one another, there is no increase in execution time and screen flow.

Download PDF

[C98] SEAA 2025

Socio-Technical Well-Being of Quantum Software Communities: An Overview on Community Smells.*

51st Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA), Salerno, Italy.

Quantum computing has gained significant attention due to its potential to solve computational problems beyond the capabilities of classical computers. With major corporations and academic institutions investing in quantum hardware and software, there has been a rise in the development of quantum-enabled systems, particularly within open-source communities. However, despite the promising nature of quantum technologies, these communities face critical socio-technical challenges, including the emergence of socio-technical anti-patterns known as community smells. These anti-patterns, prevalent in open-source environments, have the potential to negatively impact both product quality and community health by introducing technical debt and amplifying architectural and code smells. Download PDF

Conference Socio-Technical Analytics Empirical Software Engineering S. Lambiase, M. De Stefano, F. Palomba, F. Ferrucci, A. De Lucia.

Socio-Technical Well-Being of Quantum Software Communities: An Overview on Community Smells.

S. Lambiase, M. De Stefano, F. Palomba, F. Ferrucci, A. De Lucia. Conference Socio-Technical Analytics Empirical Software Engineering

Abstract. Quantum computing has gained significant attention due to its potential to solve computational problems beyond the capabilities of classical computers. With major corporations and academic institutions investing in quantum hardware and software, there has been a rise in the development of quantum-enabled systems, particularly within open-source communities. However, despite the promising nature of quantum technologies, these communities face critical socio-technical challenges, including the emergence of socio-technical anti-patterns known as community smells. These anti-patterns, prevalent in open-source environments, have the potential to negatively impact both product quality and community health by introducing technical debt and amplifying architectural and code smells. Despite the importance of these socio-technical factors, there remains a scarcity of research investigating their influence within quantum open-source communities. This work aims to address this gap by providing a first step in analyzing the socio-technical well-being of quantum communities through a cross-sectional study. By understanding the socio-technical dynamics at play, it is expected that foundational knowledge can be established to mitigate the risks associated with community smells and ensure the long-term sustainability of open-source quantum initiatives.

Download PDF

[C97] SEAA 2025

"The candle is burning out on its own...": Modeling Fatigue and Empathy Among Chinese Developers.*

51st Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA), Salerno, Italy.

Developer turnover and layoffs are at a historical peak, contributing to increased stress, fatigue, and declining morale among software developers. To investigate this issue, in this paper, we surveyed 178 developers in China and found that over half reported experiencing psychological distress, which is significantly higher than the national average. Using factor analysis and regression modeling, we identified key psychological dimensions of fatigue and empathy and examined their relationship to workplace conditions. We complemented the survey with 17 behavioral metrics from Azure DevOps and Microsoft Viva Insight, enabling a data-driven assessment of developers' work context. Finally, we developed the Empathy Catalogues Analysis Model, a statistical model linking work context metrics to empathy scores, revealing a significant negative correlation between workload burden and perceived empathy. Our findings provide a foundation for scalable, automated monitoring of psychological well-being in teams. Download PDF

Conference Socio-Technical Analytics Empirical Software Engineering D. Tamburri, H. Zhang, K. Blincoe, R. Kazman, G. Giordano, V. Pontillo, F. Palomba.

"The candle is burning out on its own...": Modeling Fatigue and Empathy Among Chinese Developers.

D. Tamburri, H. Zhang, K. Blincoe, R. Kazman, G. Giordano, V. Pontillo, F. Palomba. Conference Socio-Technical Analytics Empirical Software Engineering

Abstract. Developer turnover and layoffs are at a historical peak, contributing to increased stress, fatigue, and declining morale among software developers. To investigate this issue, in this paper, we surveyed 178 developers in China and found that over half reported experiencing psychological distress, which is significantly higher than the national average. Using factor analysis and regression modeling, we identified key psychological dimensions of fatigue and empathy and examined their relationship to workplace conditions. We complemented the survey with 17 behavioral metrics from Azure DevOps and Microsoft Viva Insight, enabling a data-driven assessment of developers' work context. Finally, we developed the Empathy Catalogues Analysis Model, a statistical model linking work context metrics to empathy scores, revealing a significant negative correlation between workload burden and perceived empathy. Our findings provide a foundation for scalable, automated monitoring of psychological well-being in teams.

Download PDF

[C96] SEAA 2025

An Evidence-Based Study on the Relationship of Software Engineering Practices on Code Smells in Python ML Projects.*

51st Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA), Salerno, Italy.

The rapid adoption of Machine Learning (ML) technologies has introduced new challenges for code quality. Code smells, i.e., suboptimal design and implementation choices applied when developing source code, represent a particularly prevalent problem. While software engineering (SE) practices are often recommended to improve maintainability, their actual impact on code smells in ML projects remains unclear. In this paper, we present an evidence-based empirical study of 566 real-world Python ML projects from the NICHE dataset, labeled according to adherence to eight established SE practices. Using static analysis and statistical testing, we assess the relationship between these practices and the presence of ten Python-specific code smells. Our results show that projects adopting SE practices exhibit significantly fewer code smells. In particular, Continuous Integration is negatively correlated with the \textit{Complex Container Comprehension} smell. These findings highlight the importance of engineering discipline in managing code quality in ML development. Download PDF

Conference Empirical Software Engineering G. Giordano, A. Della Porta, F. Ferrucci, F. Palomba.

An Evidence-Based Study on the Relationship of Software Engineering Practices on Code Smells in Python ML Projects.*

G. Giordano, A. Della Porta, F. Ferrucci, F. Palomba. Conference Empirical Software Engineering

Abstract. The rapid adoption of Machine Learning (ML) technologies has introduced new challenges for code quality. Code smells, i.e., suboptimal design and implementation choices applied when developing source code, represent a particularly prevalent problem. While software engineering (SE) practices are often recommended to improve maintainability, their actual impact on code smells in ML projects remains unclear. In this paper, we present an evidence-based empirical study of 566 real-world Python ML projects from the NICHE dataset, labeled according to adherence to eight established SE practices. Using static analysis and statistical testing, we assess the relationship between these practices and the presence of ten Python-specific code smells. Our results show that projects adopting SE practices exhibit significantly fewer code smells. In particular, Continuous Integration is negatively correlated with the \textit{Complex Container Comprehension} smell. These findings highlight the importance of engineering discipline in managing code quality in ML development.

Download PDF

[C95] SEAA 2025

Teaching Software Engineering for Artificial Intelligence: An Experience Report.*

51st Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA), Salerno, Italy.

As Artificial Intelligence (AI) becomes integral to modern software systems, the software engineering (SE) research community has been actively developing methods, tools, and frameworks to address software quality assurance of AI-enabled systems across critical dimensions such as robustness, ethics, security, and sustainability. These contributions are designed to tackle the complexity of AI systems, such as their probabilistic nature, data dependencies, and societal impact, ensuring they meet the standards of modern software engineering. These advances have, in turn, inspired educators to introduce Software Engineering for Artificial Intelligence (SE4AI) courses aimed at preparing the next generation of software engineers, with notable success examples already reported in the literature. In this experience report, we contribute to the field of SE4AI education by sharing lessons learned in designing and teaching a course that addresses the unique characteristics of AI-enabled systems. Download PDF

Conference Empirical Software Engineering F. Palomba, G. Voria, A. Parziale, V. Pentangelo, A. Della Porta, V. De Martino, G. Recupito, G. Giordano.

Teaching Software Engineering for Artificial Intelligence: An Experience Report.*

F. Palomba, G. Voria, A. Parziale, V. Pentangelo, A. Della Porta, V. De Martino, G. Recupito, G. Giordano. Conference Empirical Software Engineering

Abstract. As Artificial Intelligence (AI) becomes integral to modern software systems, the software engineering (SE) research community has been actively developing methods, tools, and frameworks to address software quality assurance of AI-enabled systems across critical dimensions such as robustness, ethics, security, and sustainability. These contributions are designed to tackle the complexity of AI systems, such as their probabilistic nature, data dependencies, and societal impact, ensuring they meet the standards of modern software engineering. These advances have, in turn, inspired educators to introduce Software Engineering for Artificial Intelligence (SE4AI) courses aimed at preparing the next generation of software engineers, with notable success examples already reported in the literature. In this experience report, we contribute to the field of SE4AI education by sharing lessons learned in designing and teaching a course that addresses the unique characteristics of AI-enabled systems. Drawing on insights gathered over four iterations of the course, we discuss how students perceive and apply key software engineering concepts, the challenges they encounter with tools and techniques, and how project-based learning bridges the gap between theoretical knowledge and real-world application. Furthermore, we address the broader educational challenges, such as interdisciplinary barriers and the integration of rapidly evolving AI technologies, and provide recommendations to enhance SE4AI education. By reflecting on these experiences, we aim to offer insights and strategies for improving the teaching of SE4AI topics.

Download PDF

[C94] ECTEL 2025

Toward Realistic AI-Generated Student Questions to Support Instructor Training.*

European Conference on Technology Enhanced Learning (ECTEL 2025), Newcastle, UK.

Instructor effectiveness is fundamental to student learning, with the ability to manage student inquiries serving as a critical compo-nent of effective teaching. Student questions represent a valuable training resource for instructors to strengthen their teaching strategies, yet interactions with students are often constrained by several factors. In this paper, we investigate how instructors perceive machine- and student-generated questions, considering the potential for the former to complement the latter in a cost-effective manner. Our study involved 121 undergraduate students and an equivalent number of simulated students modeled using a state-of-the-art large language model, generating over 360 questions in total based on video lectures given by seven university instructors. Download PDF

Conference Empirical Software Engineering F. Cardia, V. Pentangelo, S. Lambiase, C. Gravino, F. Palomba, M. Marras.

Toward Realistic AI-Generated Student Questions to Support Instructor Training.*

F. Cardia, V. Pentangelo, S. Lambiase, C. Gravino, F. Palomba, M. Marras. Conference Empirical Software Engineering

Abstract. Instructor effectiveness is fundamental to student learning, with the ability to manage student inquiries serving as a critical compo-nent of effective teaching. Student questions represent a valuable training resource for instructors to strengthen their teaching strategies, yet interactions with students are often constrained by several factors. In this paper, we investigate how instructors perceive machine- and student-generated questions, considering the potential for the former to complement the latter in a cost-effective manner. Our study involved 121 undergraduate students and an equivalent number of simulated students modeled using a state-of-the-art large language model, generating over 360 questions in total based on video lectures given by seven university instructors. We assessed whether instructors could distinguish between human- and machine-generated questions and how they evaluated their relevance, clarity, answerability, challenge level, and cognitive depth. Results show that instructors struggle to differentiate between the two sets of questions, with accuracy close to random chance. Instructors tended to (i) rate machine-generated questions slightly higher in relevance, clarity, answerability, and challenge—though only relevance and answerability showed significant differences—and (ii) associate them marginally more often with higher-order cognitive skills. This confirms the potential of machine-generated questions as tools for instructor training.

Download PDF

[C93] CSEET 2025

Students' Perception of ChatGPT in Software Engineering: Lessons Learned from Five Courses.*

International Conference on Software Engineering Education and Training (CSEET 2025), Ottawa, Canada.

A few years after their release, Large Language Models (LLMs)-based tools are becoming an essential component of software education, as calculators are used in math courses. When learning software engineering (SE), the challenge is the extent to which LLMs are suitable and easy to use for different software development tasks. In this paper, we report the findings and lessons learned from using LLM-based tools-ChatGPT in particular-in five SE courses from four universities. Download PDF

Conference Empirical Software Engineering L. Baresi, A. De Lucia, A. Di Marco, M. Di Penta, D. Di Ruscio, L. Mariani, D. Micucci, F. Palomba, M. T. Rossi, F. Zampetti.

Students' Perception of ChatGPT in Software Engineering: Lessons Learned from Five Courses.*

L. Baresi, A. De Lucia, A. Di Marco, M. Di Penta, D. Di Ruscio, L. Mariani, D. Micucci, F. Palomba, M. T. Rossi, F. Zampetti. Conference Empirical Software Engineering

Abstract. A few years after their release, Large Language Models (LLMs)-based tools are becoming an essential component of software education, as calculators are used in math courses. When learning software engineering (SE), the challenge is the extent to which LLMs are suitable and easy to use for different software development tasks. In this paper, we report the findings and lessons learned from using LLM-based tools-ChatGPT in particular-in five SE courses from four universities. After instructing students on the LLM potentials in SE and about prompting strategies, we ask participants to complete a survey and be involved in semi-structured interviews. The collected results report (i) indications about the usefulness of the LLM for different tasks, (ii) challenges to prompt the LLM, i.e., interact with it, (iii) challenges to adapt the generated artifacts to their own needs, and (iv) wishes about some valuable features students would like to see in LLM-based tools. Although results vary among different courses, also because of students' seniority and course goals, the perceived usefulness is greater for lowlevel phases (e.g., coding or debugging/fault localization) than for analysis and design phases. Interaction and code adaptation challenges vary among tasks and are mostly related to the need for task-specific prompts, as well as better specification of the development context.

Download PDF

[C92] EASE 2025

Do Prompt Patterns Affect Code Quality? A First Empirical Assessment of ChatGPT-Generated Code.*

International Conference on Evaluation and Assessment in Software Engineering (EASE 2025), Instanbul, Turkey.

Large Language Models (LLMs) have rapidly transformed software development, especially in code generation. However, their inconsistent performance, prone to hallucinations and quality issues, complicates program comprehension and hinders maintainability. Research indicates that prompt engineering---the practice of designing inputs to direct LLMs toward generating relevant outputs---may help address these challenges. In this regard, researchers have introduced prompt patterns, structured templates intended to guide users in formulating their requests. However, the influence of prompt patterns on code quality has yet to be thoroughly investigated. An improved understanding of this relationship would be essential to advancing our collective knowledge on how to effectively use LLMs for code generation, thereby enhancing their understandability in contemporary software development. Download PDF

Conference Empirical Software Engineering A. Della Porta, S. Lambiase, F. Palomba.

Do Prompt Patterns Affect Code Quality? A First Empirical Assessment of ChatGPT-Generated Code.*

A. Della Porta, S. Lambiase, F. Palomba. Conference Empirical Software Engineering

Abstract. Large Language Models (LLMs) have rapidly transformed software development, especially in code generation. However, their inconsistent performance, prone to hallucinations and quality issues, complicates program comprehension and hinders maintainability. Research indicates that prompt engineering---the practice of designing inputs to direct LLMs toward generating relevant outputs---may help address these challenges. In this regard, researchers have introduced prompt patterns, structured templates intended to guide users in formulating their requests. However, the influence of prompt patterns on code quality has yet to be thoroughly investigated. An improved understanding of this relationship would be essential to advancing our collective knowledge on how to effectively use LLMs for code generation, thereby enhancing their understandability in contemporary software development. This paper empirically investigates the impact of prompt patterns on code quality, specifically maintainability, security, and reliability, using the Dev-GPT dataset. Results show that Zero-Shot prompting is most common, followed by Zero-Shot with Chain-of-Thought and Few-Shot. Analysis of 7,583 code files across quality metrics revealed minimal issues, with Kruskal-Wallis tests indicating no significant differences among patterns, suggesting that prompt structure may not substantially impact these quality metrics in ChatGPT-assisted code generation.

Download PDF

[C91] EASE 2025

How Do Communities of ML-Enabled Systems Smell? A Cross-Sectional Study on the Prevalence of Community Smells.*

International Conference on Evaluation and Assessment in Software Engineering (EASE 2025), Instanbul, Turkey.

Successful software development depends on effectively managing both collaboration and technology. However, socio-technical chal- lenges can disrupt team dynamics and lead to technical debt. Despite the interdisciplinary nature of teams working on ML-enabled systems, research on their socio-technical dynamics remains limited compared to the emphasis on technical aspects. This study aims to address this gap by examining the prevalence, evolution, and inter-relations of "community smells", i.e., social anti-patterns that indicate dysfunctional collaboration, in open-source ML projects. We conducted an empirical study on 188 repositories from the NICHE dataset. Leveraging the CADOCS tool, we identified and analyzed community smells within these repositories. Download PDF

Conference Socio-Technical Analytics Empirical Software Engineering G. Annunziata, S. Lambiase, F. Palomba, G. Catolino, F. Ferrucci.

How Do Communities of ML-Enabled Systems Smell? A Cross-Sectional Study on the Prevalence of Community Smells.*

G. Annunziata, S. Lambiase, F. Palomba, G. Catolino, F. Ferrucci. Conference Socio-Technical Analytics Empirical Software Engineering

Abstract. Successful software development depends on effectively managing both collaboration and technology. However, socio-technical chal- lenges can disrupt team dynamics and lead to technical debt. Despite the interdisciplinary nature of teams working on ML-enabled systems, research on their socio-technical dynamics remains limited compared to the emphasis on technical aspects. This study aims to address this gap by examining the prevalence, evolution, and inter-relations of "community smells", i.e., social anti-patterns that indicate dysfunctional collaboration, in open-source ML projects. We conducted an empirical study on 188 repositories from the NICHE dataset. Leveraging the CADOCS tool, we identified and analyzed community smells within these repositories. Our analysis focused on three key aspects: (1) the prevalence of community smells, (2) their correlations, and (3) their variations over time. Our findings indicate that while some community smells are more prevalent than others, their overall occurrence remains relatively stable over time; Prima Donna Effects, Sharing Villainy, and Solution Defiance are particularly prominent in ML-enabled projects compared to Radio Silence or Organizational Skirmish. These insights provide valuable guidance for project managers and team leads in ML-focused communities, helping them mitigate social challenges more effectively, allocate resources more effectively, and improve collaboration and team dynamics.

Download PDF

[C90] Fairness 2025

Contextual Fairness-Aware Practices in ML: A Cost-Effective Empirical Evaluation.*

International Workshop on Fairness in Software Systems (Fairness 2025), Montreal, Canada.

As machine learning (ML) systems become central to critical decision-making, concerns over fairness and potential biases have increased. To address this, the software engineering (SE) field has introduced bias mitigation techniques aimed at enhancing fairness in ML models at various stages. Additionally, recent research suggests that standard ML engineering practices can also improve fairness; these practices, known as fairness-aware practices, have been cataloged across each stage of the ML development life cycle. However, fairness remains context-dependent, with different domains requiring customized solutions. Furthermore, existing specific bias mitigation methods may sometimes degrade model performance, raising ongoing discussions about the trade-offs involved.

Best Paper Award

Download PDF

Conference Empirical Software Engineering A. Parziale, G. Voria, G. Giordano, G. Catolino, G. Robles, F. Palomba.

Contextual Fairness-Aware Practices in ML: A Cost-Effective Empirical Evaluation.*

A. Parziale, G. Voria, G. Giordano, G. Catolino, G. Robles, F. Palomba. Conference Empirical Software Engineering

Abstract. As machine learning (ML) systems become central to critical decision-making, concerns over fairness and potential biases have increased. To address this, the software engineering (SE) field has introduced bias mitigation techniques aimed at enhancing fairness in ML models at various stages. Additionally, recent research suggests that standard ML engineering practices can also improve fairness; these practices, known as fairness-aware practices, have been cataloged across each stage of the ML development life cycle. However, fairness remains context-dependent, with different domains requiring customized solutions. Furthermore, existing specific bias mitigation methods may sometimes degrade model performance, raising ongoing discussions about the trade-offs involved. In this paper, we empirically investigate fairness-aware practices from two perspectives: contextual and cost-effectiveness. The contextual evaluation explores how these practices perform in various application domains, identifying areas where specific fair- ness adjustments are particularly effective. The cost-effectiveness evaluation considers the trade-off between fairness improvements and potential performance costs. Our findings provide insights into how context influences the effectiveness of fairness-aware practices. This research aims to guide SE practitioners in selecting practices that achieve fairness with minimal performance costs, supporting the development of ethical ML systems.

Download PDF

[C89] WSESE 2025

A Framework for Using LLMs for Repository Mining Studies in Empirical Software Engineering.*

International Workshop on Methodological Issues with Empirical Studies in Software Engineering (co-located with ICSE 2025), Ottawa, Canada, 2025.

The emergence of Large Language Models (LLMs) has significantly transformed Software Engineering (SE) by providing innovative methods for analyzing software repositories. Our objective is to establish a practical framework for future SE researchers needing to enhance the data collection and dataset while conducting software repository mining studies using LLMs. This experience report shares insights from two previous repository mining studies, focusing on the methodologies used for creating, refining, and validating prompts that enhance the output of LLMs, particularly in the context of data collection in empirical studies. Download PDF

Conference Empirical Software Engineering V. De Martino, J. Castano, F. Palomba, X. Franch, S. Martinez-Fernandez.

A Framework for Using LLMs for Repository Mining Studies in Empirical Software Engineering.*

V. De Martino, J. Castano, F. Palomba, X. Franch, S. Martinez-Fernandez. Conference Empirical Software Engineering

Abstract. The emergence of Large Language Models (LLMs) has significantly transformed Software Engineering (SE) by providing innovative methods for analyzing software repositories. Our objective is to establish a practical framework for future SE researchers needing to enhance the data collection and dataset while conducting software repository mining studies using LLMs. This experience report shares insights from two previous repository mining studies, focusing on the methodologies used for creating, refining, and validating prompts that enhance the output of LLMs, particularly in the context of data collection in empirical studies. Our research packages a framework, coined Prompt Refinement and Insights for Mining Empirical Software repositories (PRIMES), consisting of a checklist that can improve LLM usage performance, enhance output quality, and minimize errors through iterative processes and comparisons among different LLMs. We also emphasize the significance of reproducibility by implementing mechanisms for tracking model results. Our findings indicate that standardizing prompt engineering and using PRIMES can enhance the reliability and reproducibility of studies utilizing LLMs. Ultimately, this work calls for further research to address challenges like hallucinations, model biases, and cost-effectiveness in integrating LLMs into workflows.

Download PDF

[C88] ICSE 2025

Do Developers Adopt Green Architectural Tactics for ML-Enabled Systems? A Mining Software Repository Study.*

IEEE/ACM International Conference on Software Engineering (ICSE 2025), Ottawa, Canada, 2025.

As machine learning (ML) and artificial intelligence (AI) technologies become more widespread, concerns about their environmental impact are increasing due to the resource-intensive nature of training and inference processes. Green AI advocates for reducing computational demands while still maintaining accuracy. Although various strategies for creating sustainable ML systems have been identified, their real-world implementation is still underexplored. This paper addresses this gap by studying 168 open-source ML projects on GitHub. It employs a novel large language model (LLM)-based mining mechanism to identify and analyze green strategies. The findings reveal the adoption of established tactics that offer significant environmental benefits. This provides practical insights for developers and paves the way for future automation of sustainable practices in ML systems. Download PDF

Conference Empirical Software Engineering V. De Martino, S. Martinez-Fernandez, F. Palomba.

Do Developers Adopt Green Architectural Tactics for ML-Enabled Systems? A Mining Software Repository Study.*

V. De Martino, S. Martinez-Fernandez, F. Palomba. Conference Empirical Software Engineering

Abstract. As machine learning (ML) and artificial intelligence (AI) technologies become more widespread, concerns about their environmental impact are increasing due to the resource-intensive nature of training and inference processes. Green AI advocates for reducing computational demands while still maintaining accuracy. Although various strategies for creating sustainable ML systems have been identified, their real-world implementation is still underexplored. This paper addresses this gap by studying 168 open-source ML projects on GitHub. It employs a novel large language model (LLM)-based mining mechanism to identify and analyze green strategies. The findings reveal the adoption of established tactics that offer significant environmental benefits. This provides practical insights for developers and paves the way for future automation of sustainable practices in ML systems.

Download PDF

[C87] ICSE 2025

From Expectation to Habit: Why Do Software Practitioners Adopt Fairness Toolkits?*

IEEE/ACM International Conference on Software Engineering (ICSE 2025), Ottawa, Canada, 2025.

As the adoption of machine learning (ML) systems continues to grow across industries, concerns about fairness and bias in these systems have taken center stage. Fairness toolkits—designed to mitigate bias in ML models—serve as critical tools for addressing these ethical concerns. However, their adoption in the context of software development remains underexplored, especially regarding the cognitive and behavioral factors driving their usage. As a deeper understanding of these factors could be pivotal in refining tool designs and promoting broader adoption, this study investigates the factors influencing the adoption of fairness toolkits from an individual perspective. Guided by the Unified Theory of Acceptance and Use of Technology (UTAUT2), we examined the factors shaping the intention to adopt and actual use of fairness toolkits. Download PDF

Conference Socio-Technical Analytics Empirical Software Engineering G. Voria, S. Lambiase, M.C. Schiavone, G. Catolino, F. Palomba.

From Expectation to Habit: Why Do Software Practitioners Adopt Fairness Toolkits?*

G. Voria, S. Lambiase, M.C. Schiavone, G. Catolino, F. Palomba. Conference Socio-Technical Analytics Empirical Software Engineering

Abstract. As the adoption of machine learning (ML) systems continues to grow across industries, concerns about fairness and bias in these systems have taken center stage. Fairness toolkits—designed to mitigate bias in ML models—serve as critical tools for addressing these ethical concerns. However, their adoption in the context of software development remains underexplored, especially regarding the cognitive and behavioral factors driving their usage. As a deeper understanding of these factors could be pivotal in refining tool designs and promoting broader adoption, this study investigates the factors influencing the adoption of fairness toolkits from an individual perspective. Guided by the Unified Theory of Acceptance and Use of Technology (UTAUT2), we examined the factors shaping the intention to adopt and actual use of fairness toolkits. Specifically, we employed Partial Least Squares Structural Equation Modeling (PLS-SEM) to analyze data from a survey study involving practitioners in the software industry. Our findings reveal that performance expectancy and habit are the primary drivers of fairness toolkit adoption. These insights suggest that by emphasizing the effectiveness of these tools in mitigating bias and fostering habitual use, organizations can encourage wider adoption. Practical recommendations include improving toolkit usability, integrating bias mitigation processes into routine development workflows, and providing ongoing support to ensure professionals see clear benefits from regular use.

Download PDF

[C86] SEAA 2024

An Empirical Study on the Relation between Programming Languages and the Emergence of Community Smells.*

50th Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA), Paris, France.

To provide a measurable representation of social issues in software teams, the research community defined a set of anti-patterns that may lead to the emergence of both social and technical debt, i.e., "community smells". Researchers have investigated community smells from different perspectives; in particular, they have analyzed how product-related aspects of software development, such as architecture and introducing a new language, could influence community smells. However, how technical project characteristics may be in relation to the emergence of community smells is still unknown. Download PDF

Conference Socio-Technical Analysis G. Annunziata, C. Ferrara, S. Lambiase, F. Palomba, G. Catolino, F. Ferrucci, A. De Lucia.

An Empirical Study on the Relation between Programming Languages and the Emergence of Community Smells.*

G. Annunziata, C. Ferrara, S. Lambiase, F. Palomba, G. Catolino, F. Ferrucci, A. De Lucia. Conference Socio-Technical Analysis

Abstract. To provide a measurable representation of social issues in software teams, the research community defined a set of anti-patterns that may lead to the emergence of both social and technical debt, i.e., "community smells". Researchers have investigated community smells from different perspectives; in particular, they have analyzed how product-related aspects of software development, such as architecture and introducing a new language, could influence community smells. However, how technical project characteristics may be in relation to the emergence of community smells is still unknown. Different from those works, we aim to investigate how adopting specific programming languages might influence the socio-technical alignment and congruence of the development community, possibly inducing their overall ability to communicate and collaborate, leading to the emergence of social anti-patterns, i.e., community smells. We studied the relationship between the most used programming languages and the community smells in 100 open-source projects on GitHub. Key results of the study show a low statistical correlation for specific community smells like Prima Donna Effects, Solution Defiance, and Organizational Skirmish, highlighting the fact that for some programming languages, its adoption could not be an indicator of the presence or absence of community smells.

Download PDF

[C85] SEAA 2024

AGORA: An Approach for Generating Acceptance Test Cases from Use Cases.*

50th Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA), Paris, France.

This paper introduces AGORA, an innovative approach that leverages Large Language Models to automate the definition of acceptance test cases from use cases. AGORA consists of two phases that exploit prompt engineering to 1) identify test cases for specific use cases and 2) generate detailed acceptance tests cases. AGORA was evaluated through a con- trolled experiment involving industry professionals, comparing the effectiveness and efficiency of the proposed approach with the manual method. Download PDF

Conference Software Testing Empirical Software Engineering G. De Vito, G. Vassallo, F. Palomba, F. Ferrucci.

AGORA: An Approach for Generating Acceptance Test Cases from Use Cases.*

G. De Vito, G. Vassallo, F. Palomba, F. Ferrucci. Conference Socio-Technical Analysis

Abstract. This paper introduces AGORA, an innovative approach that leverages Large Language Models to automate the definition of acceptance test cases from use cases. AGORA consists of two phases that exploit prompt engineering to 1) identify test cases for specific use cases and 2) generate detailed acceptance tests cases. AGORA was evaluated through a controlled experiment involving industry professionals, comparing the effectiveness and efficiency of the proposed approach with the manual method. The results showed that AGORA can generate acceptance test cases with a quality comparable to that obtained manually but improving the process efficiency by over 90% in a fraction of the time. Furthermore, user feedback indicated high satisfaction with using the proposed approach. These findings underscore the potential of AGORA as a tool to enhance the efficiency and quality of the software testing process.

Download PDF

[C84] UKDE 2024

Collecting and Implementing Ethical Guidelines for Emotion Recognition in an Educational Metaverse.*

International Conference on User-Centered Practices of Knowledge Discovery in Educational Data (UKDE 2024), Cagliari, Italy, 2024.

The metaverse represents a persistent, online 3D universe where people can interact, socialize, and work toward common goals. Education represents a key application domain, as it has the potential to enhance experiential learning and collaboration between learners and between learners and educators. However, challenges to the widespread adoption of educational metaverses persist. This paper focuses on emotional isolation, i.e., the feeling of emotional disconnection or loneliness, which can hinder learners' motivation and participation. Machine learning-enabled emotional recognition systems have the potential to address this challenge, offering educators with feedback on the emotional states of learners within the metaverse. Yet, the integration of emotion recognition systems raises ethical concerns regarding consent, privacy, and algorithmic bias. Download PDF

Conference Socio-Technical Analysis D. Di Dario, V. Pentangelo, M. Colella, F. Palomba, C. Gravino.

Collecting and Implementing Ethical Guidelines for Emotion Recognition in an Educational Metaverse.*

D. Di Dario, V. Pentangelo, M. Colella, F. Palomba, C. Gravino. Conference Socio-Technical Analysis

Abstract. The metaverse represents a persistent, online 3D universe where people can interact, socialize, and work toward common goals. Education represents a key application domain, as it has the potential to enhance experiential learning and collaboration between learners and between learners and educators. However, challenges to the widespread adoption of educational metaverses persist. This paper focuses on emotional isolation, i.e., the feeling of emotional disconnection or loneliness, which can hinder learners' motivation and participation. Machine learning-enabled emotional recognition systems have the potential to address this challenge, offering educators with feedback on the emotional states of learners within the metaverse. Yet, the integration of emotion recognition systems raises ethical concerns regarding consent, privacy, and algorithmic bias. In this paper, we first conduct a literature review on the ethical considerations surrounding the deployment of emotion recognition technology within educational metaverses. Then, we report on the implementation of these guidelines within SENEM, an educational metaverse platform available in the literature. Through this research, we aim to contribute to the responsible deployment of emotion recognition technology in educational settings, ultimately fostering a supportive and inclusive learning environment for all learners.

Download PDF

[C83] FORGE 2024

Is Attention All You Need? Toward a Conceptual Model for Social Awareness in Large Language Models.*

International Conference on AI Foundation Models and Software Engineering (FORGE 2024), Lisbon, Portugal, 2024.

Large Language Models (LLMs) are revolutionizing the landscape of Artificial Intelligence (AI) due to recent technological breakthroughs. Their remarkable success in aiding various Software Engineering (SE) tasks through AI-powered tools and assistants has led to the integration of LLMs as active contributors within development teams, ushering in novel modes of communication and collaboration. However, great power comes with great responsibility: ensuring that these models meet fundamental ethical principles such as fairness is still an open challenge. Download PDF

Conference Socio-Technical Analysis G. Voria, G. Catolino, F. Palomba.

Is Attention All You Need? Toward a Conceptual Model for Social Awareness in Large Language Models.*

G. Voria, G. Catolino, F. Palomba. Conference Socio-Technical Analysis

Abstract. Large Language Models (LLMs) are revolutionizing the landscape of Artificial Intelligence (AI) due to recent technological breakthroughs. Their remarkable success in aiding various Software Engineering (SE) tasks through AI-powered tools and assistants has led to the integration of LLMs as active contributors within development teams, ushering in novel modes of communication and collaboration. However, great power comes with great responsibility: ensuring that these models meet fundamental ethical principles such as fairness is still an open challenge. In this light, our vision paper analyzes the existing body of knowledge to propose a conceptual model designed to frame ethical, social, and cultural considerations that researchers and practitioners should consider when defining, employing, and validating LLM-based approaches for software engineering tasks.

Download PDF

[C82] CAIN 2024

Unmasking Data Secrets: An Empirical Investigation into Data Smells and Their Impact on Data Quality.*

International Conference on AI Engineering – Software Engineering for AI (CAIN 2024), Lisbon, Portugal, 2024.

Artificial Intelligence (AI) is rapidly advancing with a data-centered approach suitable for various domains. Nevertheless, AI faces significant challenges, particularly in data quality. Data collection from diverse sources can introduce quality issues that may threaten the development of AI-enabled systems. A growing concern in this context is the emergence of data smells – issues specific to the data used in building AI models, which can have long-term consequences. In this paper, we aim at enlarging the current body of knowledge on data smells, by proposing a two-step investigation into the matter. Download PDF

Conference Software Quality Empirical Software Engineering G. Recupito, R. Rapacciuolo, D. Di Nucci, F. Palomba.

Unmasking Data Secrets: An Empirical Investigation into Data Smells and Their Impact on Data Quality.*

G. Recupito, R. Rapacciuolo, D. Di Nucci, F. Palomba. Conference Software Quality Empirical Software Engineering

Abstract. Artificial Intelligence (AI) is rapidly advancing with a data-centered approach suitable for various domains. Nevertheless, AI faces significant challenges, particularly in data quality. Data collection from diverse sources can introduce quality issues that may threaten the development of AI-enabled systems. A growing concern in this context is the emergence of data smells – issues specific to the data used in building AI models, which can have long-term consequences. In this paper, we aim at enlarging the current body of knowledge on data smells, by proposing a two-step investigation into the matter. First, we updated an existing literature review in an effort of cataloguing the currently existing data smells and the tools to detect them. Afterward, we assess the prevalence of data smells and their correlation with data quality metrics. We identify a novel set composed of 12 data smells distributed across three additional categories. Secondly, we observe that the correlation between data smells and data quality is notably impactful, exhibiting a pronounced and substantial effect, especially in highly diffused data smell instances. This research sheds light on the complex relationship between data smells and data quality, providing valuable insights into the challenges of maintaining AI-enabled systems.

Download PDF

[C81] ICSE 2024

ReFAIR: Toward a Context-Aware Recommender for Fairness Requirements Engineering.*

IEEE/ACM International Conference on Software Engineering (ICSE 2024), Lisbon, Portugal, 2024.

Machine learning (ML) is increasingly being used as a key component of most software systems, yet serious concerns have been raised about the fairness of ML predictions. Researchers have been proposing novel methods to support the development of fair machine learning solutions. Nonetheless, most of them can only be used in late development stages, e.g., during model training, while there is a lack of methods that may provide practitioners with early fairness analytics enabling the treatment of fairness throughout the development lifecycle. This paper proposes ReFair, a novel context-aware requirements engineering framework that allows to classify sensitive features from User Stories. Download PDF

Conference Socio-Technical Analytics Empirical Software Engineering C. Ferrara, F. Casillo, C. Gravino, A. De Lucia, F. Palomba.

ReFAIR: Toward a Context-Aware Recommender for Fairness Requirements Engineering.*

C. Ferrara, F. Casillo, C. Gravino, A. De Lucia, F. Palomba. Conference Socio-Technical Analytics Empirical Software Engineering

Abstract. Machine learning (ML) is increasingly being used as a key component of most software systems, yet serious concerns have been raised about the fairness of ML predictions. Researchers have been proposing novel methods to support the development of fair machine learning solutions. Nonetheless, most of them can only be used in late development stages, e.g., during model training, while there is a lack of methods that may provide practitioners with early fairness analytics enabling the treatment of fairness throughout the development lifecycle. This paper proposes ReFair, a novel context-aware requirements engineering framework that allows to classify sensitive features from User Stories. By exploiting natural language processing and word embedding techniques, our framework first identifies both the use case domain and the machine learning task to be performed in the system being developed; afterward, it recommends which are the context-specific sensitive features to be considered during the implementation. We assess the capabilities of ReFair by experimenting it against a synthetic dataset---which we built as part of our research---composed of 12,401 User Stories related to 34 application domains. Our findings showcase the high accuracy of ReFair, other than highlighting its current limitations.

Download PDF

[C80] ICSE 2024

SERGE – Serious Game for the Education of Risk Management in Software Project Management.*

IEEE/ACM International Conference on Software Engineering (ICSE 2024) - Software Engineering Education and Training Track, Lisbon, Portugal, 2024.

Software Project Management is the systematic and disciplined approach for planning, executing, monitoring, controlling, and closing software development projects. It plays a critical role in the success of software projects and encompasses several processes for ensuring the successful completion of a software project. Among them, risk management emerges as a critical pivot to be able to react to the unpredictable events that often affect software projects. Teaching risk management is vital to equip individuals and organi- zations with the adapted skills to prevent and monitor challenges and potential issues. In this paper, we propose a serious game named Serge, conceived to involve students in learning risk management and improve their skills through gamification and simulation of a real-world application context. Download PDF

Conference Empirical Software Engineering G. Annunziata, S. Lambiase, F. Palomba, F. Ferrucci.

SERGE – Serious Game for the Education of Risk Management in Software Project Management.*

G. Annunziata, S. Lambiase, F. Palomba, F. Ferrucci. Conference Empirical Software Engineering

Abstract. Software Project Management is the systematic and disciplined approach for planning, executing, monitoring, controlling, and closing software development projects. It plays a critical role in the success of software projects and encompasses several processes for ensuring the successful completion of a software project. Among them, risk management emerges as a critical pivot to be able to react to the unpredictable events that often affect software projects. Teaching risk management is vital to equip individuals and organi- zations with the adapted skills to prevent and monitor challenges and potential issues. In this paper, we propose a serious game named Serge, conceived to involve students in learning risk management and improve their skills through gamification and simulation of a real-world application context. The features for the design of Serge were identified through a literature review. An iterative Game Design Phase was employed to build, test, and refine the design of Serge. Finally, the proposed approach was assessed by conducting a controlled experiment to compare risk management skills acquired through a traditional lecture and using Serge. The results show that adopting a serious game as Serge, able to involve the students actively, can improve the acquisition of risk management skills.

Download PDF

[C79] ICSE 2024

Dealing With Cultural Dispersion: a Novel Theoretical Framework for Software Engineering Research and Practice.*

IEEE/ACM International Conference on Software Engineering (ICSE 2024) - Software Engineering in Society Track, Lisbon, Portugal, 2024.

Software development is fundamentally a team-driven process; researchers in software engineering have identified various human and social factors that can significantly impact it. Culture emerged as a critical element, and the diversity deriving from cultural differences can be highly impactful both positively and negatively. Despite existing knowledge about how culture influences software development, limitations persist. Most importantly, a unified and comprehensive (grounded) theory of how cultural differences influence and are managed in software development has yet to exist. Download PDF

Conference Empirical Software Engineering S. Lambiase, G. Catolino, B. Della Piana, F. Ferrucci, F. Palomba.

Dealing With Cultural Dispersion: a Novel Theoretical Framework for Software Engineering Research and Practice.*

S. Lambiase, G. Catolino, B. Della Piana, F. Ferrucci, F. Palomba. Conference Socio-Technical Analytics Empirical Software Engineering

Abstract. Software development is fundamentally a team-driven process; researchers in software engineering have identified various human and social factors that can significantly impact it. Culture emerged as a critical element, and the diversity deriving from cultural differences can be highly impactful both positively and negatively. Despite existing knowledge about how culture influences software development, limitations persist. Most importantly, a unified and comprehensive (grounded) theory of how cultural differences influence and are managed in software development has yet to exist. This lack has two significant consequences: (1) it makes research on culture fragmented, leading to the continual definition of new concepts that do not allow state of the art to advance significantly, and (2) it reduces the ability of the research to be transferred to practitioners since there is no framework designed to be understood and used by them. To address the above-mentioned limitation, this work proposed a theoretical framework of "Dealing With Cultural Dispersion", which focuses on challenges and benefits originating from cultural differences and strategies for dealing with them. Such a framework was developed through a qualitative study using an iterative research approach, including interviews and socio-technical grounded theory for data analysis. The proposed framework was designed to reveal the tangible effects of practitioners' culture in software development, allowing software teams to (1) clearly understand the problem and (2) implement the correct strategy for addressing it. Additionally, researchers can use this framework as a foundation to (deductively) develop a more robust and comprehensive theory in this field.

Download PDF

[C78] SCAM 2023

Automating Test-Specific Refactoring Mining: A Mixed-Method Investigation.*

23rd IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM), Bogotà, Colombia.

Refactoring is a practice commonly used by developers to restructure the source code without changing its external behavior. Over the last decades, the software engineering research community has been making use of mining software repository techniques to investigate refactoring under multiple perspectives, identifying properties and impact of this practice on source code quality, other than using refactoring data coming from software repositories to build automated recommendation systems. While the current state of the art proposes various automated tools to mine refactoring data, there is still a lack of instruments that may help researchers when mining test-specific refactoring data. The availability of those instruments may enable additional, specialized techniques to support developers while refactoring test code. In this paper, we introduce an approach that extends RefactoringMiner - a well-established refactoring mining tool having high precision and recall scores — and is able to detect seven test-specific refactoring operations. Download PDF

Conference Software Testing Empirical Software Engineering L. Martins, H. Costa, M. Ribeiro, F. Palomba, I. Machado.

Automating Test-Specific Refactoring Mining: A Mixed-Method Investigation.*

L. Martins, H. Costa, M. Ribeiro, F. Palomba, I. Machado. Conference Software Testing Empirical Software Engineering

Abstract. Refactoring is a practice commonly used by developers to restructure the source code without changing its external behavior. Over the last decades, the software engineering research community has been making use of mining software repository techniques to investigate refactoring under multiple perspectives, identifying properties and impact of this practice on source code quality, other than using refactoring data coming from software repositories to build automated recommendation systems. While the current state of the art proposes various automated tools to mine refactoring data, there is still a lack of instruments that may help researchers when mining test-specific refactoring data. The availability of those instruments may enable additional, specialized techniques to support developers while refactoring test code. In this paper, we introduce an approach that extends RefactoringMiner - a well-established refactoring mining tool having high precision and recall scores — and is able to detect seven test-specific refactoring operations. We perform mixed-method research to assess capabilities and usefulness of the approach. First, we compare the test-specific refactoring data extracted by the approach against an oracle of 375 test-specific refactorings. Second, we engage with 15 software engineering researchers and apply a technology acceptance model to investigate how they would benefit from our approach. The key results of the study show that our approach reaches 100% and 92.5% of precision and recall scores, respectively. In addition, the approach is considered useful and suitable for various research tasks, including the definition of novel learning models able to recommend test-specific refactoring actions.

Download PDF

[C77] MENSURA 2023

Please, Be Realistic! An Empirical Study on the Performance of Vulnerability Prediction Models.*

17th International Conference on Software Process and Product Measurement (MENSURA), Rome, Italy.

Software vulnerabilities are infamous threats to the security of computing systems, and it is vital to detect and correct them before releasing any piece of software to the public. Many approaches for the detection of vulnerabilities have been proposed in the literature; in particular, those leveraging machine learning techniques, i.e., vulnerability prediction models, seem quite promising. However, recent work has warned that most models have only been evaluated in in-vitro settings, under certain assumptions that do not resemble the real scenarios in which such approaches are supposed to be employed. This observation ignites the risk that the encouraging results obtained in previous literature may be not as well convenient in practice. Download PDF

Conference Software Quality Empirical Software Engineering G. Sellitto, A. Sheykina, F. Palomba, A. De Lucia.

Please, Be Realistic! An Empirical Study on the Performance of Vulnerability Prediction Models.*

G. Sellitto, A. Sheykina, F. Palomba, A. De Lucia. Conference Software Quality Empirical Software Engineering

Abstract. Software vulnerabilities are infamous threats to the security of computing systems, and it is vital to detect and correct them before releasing any piece of software to the public. Many approaches for the detection of vulnerabilities have been proposed in the literature; in particular, those leveraging machine learning techniques, i.e., vulnerability prediction models, seem quite promising. However, recent work has warned that most models have only been evaluated in in-vitro settings, under certain assumptions that do not resemble the real scenarios in which such approaches are supposed to be employed. This observation ignites the risk that the encouraging results obtained in previous literature may be not as well convenient in practice. Recognizing the dangerousness of biased and unrealistic evaluations, we aim to dive deep into the problem, by investigating whether and to what extent vulnerability prediction models' performance changes when measured in realistic settings. To do this, we perform an empirical study evaluating the performance of a vulnerability prediction model, configured with three data balancing techniques, executed at three different degrees of realism, leveraging two datasets. Our findings highlight that the outcome of any measurement strictly depends on the experiment setting, calling researchers to take into account the actuality and applicability in practice of the approaches they propose and evaluate.

Download PDF

[C76] MENSURA 2023

Understanding Developer Practices and Code Smells Diffusion in AI-Enabled Software: A Preliminary Study.*

17th International Conference on Software Process and Product Measurement (MENSURA), Rome, Italy.

To deal with continuous change requests and the strict time-to-market, practitioners and big companies constantly update their software systems to meet users' requirements. This practice force developers to release immature products, neglecting best practices to reduce delivery times. As a possible result, technical debt can arise, i.e., potential design issues that can negatively impact software maintenance and evolution and, in turn, increase both the time-to-market and costs. Code smells, i.e., sub-optimal design decisions identifiable by computing software metrics and providing a general overview of code quality, are common symptoms of technical debt. While previous research focused on code smells primarily considering them in the context of Java, the growing popularity of Python, particularly for developing artificial intelligence (AI)-Enabled systems, calls for additional investigations. Download PDF

Conference Software Quality Empirical Software Engineering G. Giordano, G. Annunziata, A. De Lucia, F. Palomba.

Understanding Developer Practices and Code Smells Diffusion in AI-Enabled Software: A Preliminary Study.*

G. Giordano, G. Annunziata, A. De Lucia, F. Palomba. Conference Software Quality Empirical Software Engineering

Abstract. To deal with continuous change requests and the strict time-to-market, practitioners and big companies constantly update their software systems to meet users' requirements. This practice force developers to release immature products, neglecting best practices to reduce delivery times. As a possible result, technical debt can arise, i.e., potential design issues that can negatively impact software maintenance and evolution and, in turn, increase both the time-to-market and costs. Code smells, i.e., sub-optimal design decisions identifiable by computing software metrics and providing a general overview of code quality, are common symptoms of technical debt. While previous research focused on code smells primarily considering them in the context of Java, the growing popularity of Python, particularly for developing artificial intelligence (AI)-Enabled systems, calls for additional investigations. This preliminary analysis addresses this gap by exploring the diffusion of Python-specific code smells, and the activities performed by developers that induce the introduction of code smells in their systems. To perform our preliminary investigation, we selected 200 AI-Enabled systems available in the Niche dataset; We extracted 10,611 information on the releases using PyDriller, and PySmell to extract information about code smells. The results reveal several insights: 1) Code smells related to object-oriented principles are rarely detected in Python; 2) Complex List Comprehension is the most prevalent and the most long-alive smell; 3) The main activities that can induce code smells are evolutionary. This study fills a critical gap in the literature by providing empirical evidence on the evolution of code smells in Python-based AI-enabled systems.

Download PDF

[C75] SEAA 2023

The Yin and Yang of Software Quality: On the Relationship between Design Patterns and Code Smells.*

49th Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA), Durres, Albania.

Software reuse is considered the silver bullet of software engineering. It has been largely demonstrated that the proper implementation of design and reuse principles can substantially reduce the effort, time, and costs required to develop software systems. Design patterns are one of the most affirmed techniques for source code reuse. While previous work pointed out their benefits in terms of maintainability and understandability, some seem to raise the opposite concern, suggesting that they can negatively impact code quality from the developers' perspectives. We recognize such discrepancy in the literature, and we aim to fill this gap by investigating whether and how design patterns are related to the emergence of issues compromising code understandability, namely the Complex Class, God Class, and Spaghetti Code smells, which have been also shown to increase the change- and fault-proneness of code.

Best Paper Award

Download PDF

Conference Software Quality Empirical Software Engineering G. Giordano, G. Sellitto, A. Sepe, F. Palomba, F. Ferrucci.

The Yin and Yang of Software Quality: On the Relationship between Design Patterns and Code Smells.*

G. Giordano, G. Sellitto, A. Sepe, F. Palomba, F. Ferrucci. Conference Software Quality Empirical Software Engineering

Abstract. Software reuse is considered the silver bullet of software engineering. It has been largely demonstrated that the proper implementation of design and reuse principles can substantially reduce the effort, time, and costs required to develop software systems. Design patterns are one of the most affirmed techniques for source code reuse. While previous work pointed out their benefits in terms of maintainability and understandability, some seem to raise the opposite concern, suggesting that they can negatively impact code quality from the developers' perspectives. We recognize such discrepancy in the literature, and we aim to fill this gap by investigating whether and how design patterns are related to the emergence of issues compromising code understandability, namely the Complex Class, God Class, and Spaghetti Code smells, which have been also shown to increase the change- and fault-proneness of code. We perform an empirical evaluation on 15 Java projects evolving over 542 releases, and we find that, although design patterns are supposed to improve code quality without prejudice, they can be related to dangerous issues, as we observe the emergence of code smells in the classes participating in their implementation. From our findings, we distill a number of implications for developers and project managers to support them in dealing with design patterns.

Download PDF

[C74] SEAA 2023

Toward a Secure Educational Metaverse: A Tail of Blockchain Design for Educational Environments.*

49th Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA), Durres, Albania.

In the era of social distancing, distance learning represents a crucial educational challenge. Several 2D information technologies have been provided, yet these share multiple limitations and have negative social, educational, and psychological implications for learners. Metaverse promises to revolutionize education as we know it: this is a persistent, virtual, three-dimensional environment that is supposed to address most of the limitations of 2D information technologies. Nonetheless, there are still software engineering challenges to face to enable such a metaverse, especially when turning to software security and privacy. In this paper, we aim at performing the first steps toward an improved understanding of the security perspective of educational metaverse, by analyzing how blockchain can be employed within educational environments and how applications may be designed. Download PDF

Conference Software Quality Empirical Software Engineering D. Di Dario, U. Bilotti, M. Sibilio, C. Gravino, F. Palomba.

Toward a Secure Educational Metaverse: A Tail of Blockchain Design for Educational Environments.*

D. Di Dario, U. Bilotti, M. Sibilio, C. Gravino, F. Palomba. Conference Software Quality Empirical Software Engineering

Abstract. In the era of social distancing, distance learning represents a crucial educational challenge. Several 2D information technologies have been provided, yet these share multiple limitations and have negative social, educational, and psychological implications for learners. Metaverse promises to revolutionize education as we know it: this is a persistent, virtual, three-dimensional environment that is supposed to address most of the limitations of 2D information technologies. Nonetheless, there are still software engineering challenges to face to enable such a metaverse, especially when turning to software security and privacy. In this paper, we aim at performing the first steps toward an improved understanding of the security perspective of educational metaverse, by analyzing how blockchain can be employed within educational environments and how applications may be designed. Our ultimate goal is to provide insights into how blockchain can be further tailored in the context of educational metaverse. We conduct a systematic literature review, which targets 20 primary studies. The key findings of the study showcase the use of blockchain in 3 educational tasks, other than describing the blockchain design approaches, which protocol they commonly use and the associated limitations. We conclude by developing a conceptualization of a blockchain-based educational metaverse.

Download PDF

[C73] SEAA 2023

Meet C4SE: Your New Collaborator for Software Engineering Tasks.*

49th Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA), Durres, Albania.

The software industry has rapidly increased in complexity and scale, leading to challenges in managing in- formation and tasks among developer teams, often resulting in inefficiencies, misunderstandings, and delays. Moreover, the increasing search for automated tasks led to the extensive adoption of chatbots—a.k.a. conversational agents—for software development purposes. However, despite their undoubted positive contributions, practitioners started to identify numerous issues deriving from their adoption, both technical and social, first of which, the uselessness of the provided support due to the bot’s lack of full working context. To address such a limitation, we propose C4SE, a chatbot designed to assist software engineers and managers in performing several tasks. Download PDF

Conference Software Quality Empirical Software Engineering G. De Vito, S. Lambiase, F. Palomba, F. Ferrucci.

Meet C4SE: Your New Collaborator for Software Engineering Tasks.*

G. De Vito, S. Lambiase, F. Palomba, F. Ferrucci. Conference Software Quality Empirical Software Engineering

Abstract. The software industry has rapidly increased in complexity and scale, leading to challenges in managing in- formation and tasks among developer teams, often resulting in inefficiencies, misunderstandings, and delays. Moreover, the increasing search for automated tasks led to the extensive adoption of chatbots—a.k.a. conversational agents—for software development purposes. However, despite their undoubted positive contributions, practitioners started to identify numerous issues deriving from their adoption, both technical and social, first of which, the uselessness of the provided support due to the bot’s lack of full working context. To address such a limitation, we propose C4SE, a chatbot designed to assist software engineers and managers in performing several tasks. The idea behind the bot is to collect information from the different tasks that could be useful for others to provide better support and tailor the bot to the specific operational context-i.e., the development team using it. To enable such task heterogeneity and contextual persistence, we operationalize the GPT 3.5 model for understanding the user’s intent and a specialized data store based on a vector database for long-term memory for maintaining contextual information. With these characteristics, C4SE can provide benefits to the entire software development lifecycle increasing practitioners' productivity. We presented a prototype of the tool able to perform code suggestions, code reviews, GitHub API operationalization, and unit and acceptance test case generation. A preliminary evaluation was carried out reporting encouraging results.

Download PDF

[C72] SEAA 2023

ECHO: An Approach to Enhance Use Case Quality Exploiting Large Language Models.*

49th Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA), Durres, Albania.

UML use cases are commonly used in software engineering to specify the functional requirements of a system since they are an effective tool for interacting with stakeholders thanks to the use of natural languages. However, producing high- quality use cases can be challenging due to the lack of precise guidelines and suitable tools. This can lead to problems, e.g. inaccuracy and incompleteness, in the derived software artifacts and the final product. Recent advancements in Natural Language Processing and Large Language Models (LLMs) can provide the premises for developing tools supporting activities based on natural languages. In this paper, we propose ECHO, a novel approach for supporting software engineers in enhancing the quality of UML use cases using LLMs. Download PDF

Conference Empirical Software Engineering G. De Vito, F. Palomba, C. Gravino, S. Di Martino, F. Ferrucci.

ECHO: An Approach to Enhance Use Case Quality Exploiting Large Language Models.*

G. De Vito, F. Palomba, C. Gravino, S. Di Martino, F. Ferrucci. Conference Empirical Software Engineering

Abstract. UML use cases are commonly used in software engineering to specify the functional requirements of a system since they are an effective tool for interacting with stakeholders thanks to the use of natural languages. However, producing high- quality use cases can be challenging due to the lack of precise guidelines and suitable tools. This can lead to problems, e.g. inaccuracy and incompleteness, in the derived software artifacts and the final product. Recent advancements in Natural Language Processing and Large Language Models (LLMs) can provide the premises for developing tools supporting activities based on natural languages. In this paper, we propose ECHO, a novel approach for supporting software engineers in enhancing the quality of UML use cases using LLMs. Our approach consists of a co-prompt engineering approach and an iterative and interactive process with the LLM to improve the quality of use cases, based on practitioners’ feedback. To prove the feasibility of the proposal, we instantiated the approach using ChatGPT and performed a controlled experiment to assess its effectiveness by involving seven software engineering professionals. Three were part of the experimental group and used ECHO to improve the quality of the use cases. Three others were the control group and enhanced the quality of use cases manually. Finally, the last participant acted as an oracle, blind w.r.t. the groups, and evaluated the quality of the enhanced use cases, both qualitatively by means of a questionnaire, and quantitatively, by means of the Use Case Points metric. Results show that ECHO can effectively support software engineers to improve use cases’ quality thanks to the prompts suitably designed to interact with ChaGPT.

Download PDF

[C71] SEAA 2023

Security Testing in the Wild: An Interview Study.*

49th Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA), Durres, Albania.

Modern software systems are increasingly complex and the risk of falling into security concerns is high if these systems are not developed with a proper security mindset. Despite the empirical studies and security-oriented approaches proposed by researchers and tool vendors, we still point out a lack of knowledge on the security testing processes applied by companies to reduce risks connected to software security. In this paper, we aim to bridge this gap of knowledge by performing an interview-based study with 19 security experts to understand how companies arrange security testing and how the process of security testing is actually performed in practice. Download PDF

Conference Software Testing Empirical Software Engineering D. Di Dario, V. Pontillo, S. Lambiase, F. Ferrucci, F. Palomba.

Security Testing in the Wild: An Interview Study.*

D. Di Dario, V. Pontillo, S. Lambiase, F. Ferrucci, F. Palomba. Conference Software Testing Empirical Software Engineering

Abstract. Modern software systems are increasingly complex and the risk of falling into security concerns is high if these systems are not developed with a proper security mindset. Despite the empirical studies and security-oriented approaches proposed by researchers and tool vendors, we still point out a lack of knowledge on the security testing processes applied by companies to reduce risks connected to software security. In this paper, we aim to bridge this gap of knowledge by performing an interview-based study with 19 security experts to understand how companies arrange security testing and how the process of security testing is actually performed in practice. Our results highlight that some companies incorporated the figure of the security tester in the software life cycle, yet practitioners reported a lack of standardized guidelines for security testing. From a management perspective, our results suggest that the introduction of formal communication between development and security testing teams may lead to better performance.

Download PDF

[C70] SEAA 2022

"There and Back Again?" On the Influence of Software Community Dispersion Over Productivity.*

48th Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA), Gran Canaria, Spain.

Estimating and understanding productivity still represents a crucial task for researchers and practitioners. Researchers spent significant effort identifying the factors that influence software developers' productivity, providing several approaches for analyzing and predicting such a metric. Although different works focused on evaluating the impact of human factors on productivity, little is known about the influence of cultural/geographical diversity in software development communities.

Best Paper Award

Download PDF

Conference Socio-Technical Analytics Empirical Software Engineering S. Lambiase, G. Catolino, F. Pecorelli, D. Tamburri, F. Palomba, W.J. van den Heuvel, F. Ferrucci.

"There and Back Again?" On the Influence of Software Community Dispersion Over Productivity.*

S. Lambiase, G. Catolino, F. Pecorelli, D. Tamburri, F. Palomba, W.J. van den Heuvel, F. Ferrucci. Conference Socio-Technical Analytics Empirical Software Engineering

Abstract. Estimating and understanding productivity still represents a crucial task for researchers and practitioners. Researchers spent significant effort identifying the factors that influence software developers' productivity, providing several approaches for analyzing and predicting such a metric. Although different works focused on evaluating the impact of human factors on productivity, little is known about the influence of cultural/geographical diversity in software development communities. Indeed, in previous studies, researchers treated cultural aspects like an abstract concept without providing a quantitative representation. This work provides an empirical assessment of the relationship between cultural and geographical dispersion of a development community---namely, how diverse a community is in terms of cultural attitudes and geographical collocation of the members who belong to it---and its productivity. To reach our aim, we built a statistical model that contained product and socio-technical factors as independent variables to assess the correlation with productivity, i.e., the number of commits performed in a given time. Then, we ran our model considering data of 25 open-source communities on GitHub. Results of our study indicate that cultural and geographical dispersion impact productivity, thus encouraging managers and practitioners to consider such aspects during all the phases of the software development lifecycle.

Download PDF

[C69] SEAA 2022

A Multivocal Literature Review of MLOps Tools and Features.*

48th Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA), Gran Canaria, Spain.

DevOps has become increasingly widespread, with companies employing its methods in different fields. In this context, MLOps automates Machine Learning pipelines by applying DevOps practices. Considering the high number of tools available and the high interest of the practitioners to be supported by tools to automate the steps of Machine Learning pipelines, little is known concerning MLOps tools and their functionalities. Download PDF

Conference Software Quality Empirical Software Engineering G. Recupito, F. Pecorelli, G. Catolino, S. Moreschini, D. Di Nucci, F. Palomba, D. Tamburri.

A Multivocal Literature Review of MLOps Tools and Features.*

G. Recupito, F. Pecorelli, G. Catolino, S. Moreschini, D. Di Nucci, F. Palomba, D. Tamburri. Conference Software Quality Empirical Software Engineering

Abstract. DevOps has become increasingly widespread, with companies employing its methods in different fields. In this context, MLOps automates Machine Learning pipelines by applying DevOps practices. Considering the high number of tools available and the high interest of the practitioners to be supported by tools to automate the steps of Machine Learning pipelines, little is known concerning MLOps tools and their functionalities. To this aim, we conducted a Multivocal Literature Review (MLR) to (i) extract tools that allow for and support the creation of MLOps pipelines and (ii) analyze their main characteristics and features to provide a comprehensive overview of their value. Overall, we investigate the functionalities of 13 MLOps Tools. Our results show that most MLOps tools support the same features but apply different approaches that can bring different advantages, depending on user requirements.

Download PDF

[C68] SEAA 2022

A Preliminary Conceptualization and Analysis on Automated Static Analysis Tools for Vulnerability Detection in Android Apps.*

48th Euromicro Conference Series on Software Engineering and Advanced Applications (SEAA), Gran Canaria, Spain.

The availability of dependable mobile apps is a crucial need for over three billion people who use apps daily for any social and emergency connectivity. A key challenge for mobile developers concerns the detection of security-related issues. While a number of tools have been proposed over the years—especially for the Android operating system — we point out a lack of empirical investigations on the actual support provided by these tools; these might guide developers in selecting the most appropriate instruments to improve their apps. Download PDF

Conference Software Quality Empirical Software Engineering G. Giordano, F. Palomba, F. Ferrucci.

A Preliminary Conceptualization and Analysis on Automated Static Analysis Tools for Vulnerability Detection in Android Apps.*

G. Giordano, F. Palomba, F. Ferrucci. Conference Software Quality Empirical Software Engineering

Abstract. The availability of dependable mobile apps is a crucial need for over three billion people who use apps daily for any social and emergency connectivity. A key challenge for mobile developers concerns the detection of security-related issues. While a number of tools have been proposed over the years—especially for the Android operating system — we point out a lack of empirical investigations on the actual support provided by these tools; these might guide developers in selecting the most appropriate instruments to improve their apps. In this paper, we propose a preliminary conceptualization of the vulnerabilities detected by three automated static analysis tools such as AndroBugs2, Trueseeing, and Insider. We first derive a taxonomy of the issues detectable by the tools. Then, we run the tools against a dataset composed of 6,500 Android apps to investigate their detection capabilities in terms of frequency of detection of vulnerabilities and complementarity among tools. Key findings of the study show that current tools identify similar concerns, but they use different naming conventions. Perhaps more importantly, the tools only partially cover the most common vulnerabilities classified by the Open Web Application Security Project (OWASP) Foundation.

Download PDF

[C67] HCII 2022

AI-based Emotion Recognition to Study Users' Perception of Dark Patterns.*

24th International Conference on Human-Computer Interaction (HCII 2022), Virtual, 2022.

Dark Patterns are design patterns used to trick users into acting against their real interest. The web provides an infinite number of services accessible to anyone, which do not always promote a good user experience and are often structured with the aim of leading the user to perform unwanted actions or discourage him from making decisions that could damage the company. This is a very common practice, especially in neuromarketing. Human behavioral and perceptual patterns are cleverly exploited to achieve a specific goal. For this reason, dark pattern developers try to create an environment that invites as much purchase as possible by stimulating the customer's unconscious. Download PDF

Conference Computer-Human Interaction S. Avolicino, M. Di Gregorio, M. Romano, G. Vitiello, F. Palomba, M. Sebillo.

AI-based Emotion Recognition to Study Users' Perception of Dark Patterns.*

S. Avolicino, M. Di Gregorio, M. Romano, G. Vitiello, F. Palomba, M. Sebillo. Conference Computer-Human Interaction

Abstract. Dark Patterns are design patterns used to trick users into acting against their real interest. The web provides an infinite number of services accessible to anyone, which do not always promote a good user experience and are often structured with the aim of leading the user to perform unwanted actions or discourage him from making decisions that could damage the company. This is a very common practice, especially in neuromarketing. Human behavioral and perceptual patterns are cleverly exploited to achieve a specific goal. For this reason, dark pattern developers try to create an environment that invites as much purchase as possible by stimulating the customer's unconscious. Among the areas in which these strategies are adopted is tourism: online travel agency websites promote "fake discounts" for the products/services they are selling, display inaccurate pricing information leading to incorrect pricing assumptions, thus misleading consumers. One of the goals of this work is to identify which dark patterns are most used in online travel agencies and once identified, they will be used to run scenarios that will simulate booking a vacation online. During the execution of the tests, users will be filmed via webcam track- ing their expressions and emotions through AI-based facial recognition. Finally, the data obtained from the tests will be analyzed to study the emotions and feel- ings that a user feels when he/she is confronted with dark patterns, to under- stand which users are more at risk and which are the types of dark patterns to which they are more vulnerable.

Download PDF

[C66] GECCO 2022

A Bi-level Evolutionary Approach for the Multi-label Detection of Smelly Classes.*

The Genetic and Evolutionary Computation Conference (GECCO 2022), Boston, USA, 2022.

This paper presents a new evolutionary method and tool called BMLDS (Bi-level Multi-Label Detection of Smells) that optimizes a population of classifier chains for the multi-label detection of smells. As the chain is sensitive to the labels' (i.e., smell types) order, the chains induction task is framed as a bi-level optimization problem, where the upper-level role is to search for the optimal order of each considered chain while the lower-level one is to generate the chains. This allows taking into consideration the interactions between smells in the multi-label detection process. The statistical analysis of the experimental results reveals the merits of our proposal with respect to several existing works. Download PDF

Conference Software Quality Empirical Software Engineering S. Boutaib, M. Elarbi, S. Bechikh, F. Palomba, L. Ben Said.

A Bi-level Evolutionary Approach for the Multi-label Detection of Smelly Classes.*

S. Boutaib, M. Elarbi, S. Bechikh, F. Palomba, L. Ben Said. Conference Software Quality Empirical Software Engineering

Abstract. This paper presents a new evolutionary method and tool called BMLDS (Bi-level Multi-Label Detection of Smells) that optimizes a population of classifier chains for the multi-label detection of smells. As the chain is sensitive to the labels' (i.e., smell types) order, the chains induction task is framed as a bi-level optimization problem, where the upper-level role is to search for the optimal order of each considered chain while the lower-level one is to generate the chains. This allows taking into consideration the interactions between smells in the multi-label detection process. The statistical analysis of the experimental results reveals the merits of our proposal with respect to several existing works.

Download PDF

[C65] CHASE 2022

A Preliminary Study on the Assignment of GitHub Issues to Issue Commenters and the Relationship with Social Smells.*

International Conference on Cooperative and Human Aspects of Software Engineering (CHASE 2022), Pittsburgh, USA, 2022.

GitHub is the world's largest software hosting plat- form. Its features affect millions of developers. Investigating the impact of GitHub features on software teams is essential to gain insights into features' usefulness. As a preliminary step in this direction, this paper explores the relationship between the use of one GitHub feature and the social structure of the projects that adopt the feature. We explore whether the feature is used and whether the feature is associated with positive or negative changes in the team’s social structure. Download PDF

Conference Socio-Technical Analytics Empirical Software Engineering H. Mumtaz, C. Paradis, F. Palomba, D. Tamburri, R. Kazman, K. Blincoe.

A Preliminary Study on the Assignment of GitHub Issues to Issue Commenters and the Relationship with Social Smells.*

H. Mumtaz, C. Paradis, F. Palomba, D. Tamburri, R. Kazman, K. Blincoe. Conference Socio-Technical Analytics Empirical Software Engineering

Abstract. GitHub is the world's largest software hosting platform. Its features affect millions of developers. Investigating the impact of GitHub features on software teams is essential to gain insights into features' usefulness. As a preliminary step in this direction, this paper explores the relationship between the use of one GitHub feature and the social structure of the projects that adopt the feature. We explore whether the feature is used and whether the feature is associated with positive or negative changes in the team’s social structure. In this paper, we report on a preliminary study of 13 projects that used the GitHub "assign issues to issue commenters" feature. We examine the social smells in the software teams before and after the introduction of this new feature using statistical and temporal analysis. Our results indicate that the usage of this feature varied across the analyzed projects. We also find that social smells that reflect low or missing communications (Organizational Silo and Missing Links) decrease in most of the projects that used the feature consistently. The results suggest that the social structure of the teams has a positive relationship with the feature adoption. Still, future research should study the feature’s impact (and its use cases) on other aspects and over longer time periods to learn its diverse and long-term benefits on the social structure of software projects.

Download PDF

[C64] ICPC 2022

Regularity or Anomaly? On The Use of Anomaly Detection for Fine-Grained Just-in-Time Defect Prediction.*

IEEE/ACM International Conference on Program Comprehension (ICPC 2022), Pittsburgh, USA, 2022.

Fine-grained just-in-time defect prediction aims at identifying likely defective files within new commits pushed by developers onto a shared repository. Most of the techniques proposed in literature are based on supervised learning, where machine learning algorithms are fed with historical data. One of the limitations of these techniques is concerned with the use of imbalanced data that only contain a few defective samples to enable a proper learning phase. To overcome this problem, recent work has shown that anomaly detection methods can be used as an alternative to supervised learning, given that these do not necessarily need labelled samples. Download PDF

Conference Software Quality Empirical Software Engineering F. Lomio, L. Pascarella, F. Palomba, V. Lenarduzzi.

Regularity or Anomaly? On The Use of Anomaly Detection for Fine-Grained Just-in-Time Defect Prediction.*

F. Lomio, L. Pascarella, F. Palomba, V. Lenarduzzi. Conference Software Quality Empirical Software Engineering

Abstract. Fine-grained just-in-time defect prediction aims at identifying likely defective files within new commits pushed by developers onto a shared repository. Most of the techniques proposed in literature are based on supervised learning, where machine learning algorithms are fed with historical data. One of the limitations of these techniques is concerned with the use of imbalanced data that only contain a few defective samples to enable a proper learning phase. To overcome this problem, recent work has shown that anomaly detection methods can be used as an alternative to supervised learning, given that these do not necessarily need labelled samples. We aim at assessing how anomaly detection methods can be employed for the problem of fine-grained just-in-time defect prediction. We conduct an empirical investigation on 32 open-source projects, designing and evaluating three anomaly detection methods for fine-grained just-in-time defect prediction. However, our results are negative because anomaly detection methods, taken alone, do not overcome the prediction performance of existing machine learning solutions.

Download PDF

[C63] ICSE 2022

Good Fences Make Good Neighbours? On the Impact of Cultural and Geographical Dispersion on Community Smells.*

IEEE/ACM International Conference on Software Engineering (ICSE 2022) - Software Engineering in Society Track, Pittsburgh, USA, 2022.

Software development is de facto a social activity that often involves people from all places to join forces globally. In such common instances, project managers must face social challenges, e.g., personality conflicts and language barriers, which often amount literally to "culture shock". In this paper, we seek to analyze and illustrate how cultural and geographical dispersion—that is, how much a community is diverse in terms of its members' cultural attitudes and geographical collocation—influence the emergence of collaboration and communication problems in open-source communities, a.k.a. community smells, the socio-technical precursors of unforeseen, often nasty organizational conditions amounting collectively to the phenomenon called social debt. Download PDF

Conference Socio-Technical Analytics Empirical Software Engineering S. Lambiase, G. Catolino, D. Tamburri, A. Serebrenik, F. Palomba, F. Ferrucci.

Good Fences Make Good Neighbours? On the Impact of Cultural and Geographical Dispersion on Community Smells.*

S. Lambiase, G. Catolino, D. Tamburri, A. Serebrenik, F. Palomba, F. Ferrucci. Conference Socio-Technical Analytics Empirical Software Engineering

Abstract. Software development is de facto a social activity that often involves people from all places to join forces globally. In such common instances, project managers must face social challenges, e.g., personality conflicts and language barriers, which often amount literally to "culture shock". In this paper, we seek to analyze and illustrate how cultural and geographical dispersion—that is, how much a community is diverse in terms of its members' cultural attitudes and geographical collocation—influence the emergence of collaboration and communication problems in open-source communities, a.k.a. community smells, the socio-technical precursors of unforeseen, often nasty organizational conditions amounting collectively to the phenomenon called social debt. We perform an extensive empirical study on cultural characteristics of GitHub developers, and build a regression model relating the two types of dispersion—cultural and geographical—with the emergence of four types of commu- nity smells, i.e., Organizational Silo, Lone Wolf, Radio Silence, and Black Cloud. Results indicate that cultural and geographical factors influence collaboration and communication within open-source communities, to an extent which incites—or even more interest- ingly mitigates, in some cases—community smells, e.g., Lone Wolf, in development teams. Managers can use these findings to address their own organizational structure and tentatively diagnose any nasty phenomena related to the conditions under study.

Download PDF

[C62] SANER 2022

Toward Understanding the Impact of Refactoring on Program Comprehension.*

IEEE International Conference on Software Analysis, Engineering, and Reengineering, Honolulu, Hawaii, USA, 2022.

Software refactoring is the activity associated with developers changing the internal structure of source code without modifying its external behavior. The literature argues that refactoring might have beneficial and harmful implications for software maintainability, primarily when performed without the support of automated tools. This paper continues the narrative on the effects of refactoring by exploring the dimension of program comprehension.

IEEE/TCSE Distinguished Paper Award

Download PDF

Conference Software Quality Empirical Software Engineering G. Sellitto, E. Iannone, Z. Codabux, V. Lenarduzzi, A. De Lucia, F. Palomba, F. Ferrucci

Toward Understanding the Impact of Refactoring on Program Comprehension.*

G. Sellitto, E. Iannone, Z. Codabux, V. Lenarduzzi, A. De Lucia, F. Palomba, F. Ferrucci. Conference Software Quality Empirical Software Engineering

Abstract. Software refactoring is the activity associated with developers changing the internal structure of source code without modifying its external behavior. The literature argues that refactoring might have beneficial and harmful implications for software maintainability, primarily when performed without the support of automated tools. This paper continues the narrative on the effects of refactoring by exploring the dimension of program comprehension, namely the property that describes how easy it is for developers to understand source code. We start our investigation by assessing the basic unit of program comprehension, namely program readability. Next, we set up a large-scale empirical investigation – conducted on 156 open-source projects – to quantify the impact of refactoring on program readability. First, we mine refactoring data and, for each commit involving a refactoring, we compute (i) the amount and type(s) of refactoring actions performed and (ii) eight state-of-the-art program comprehension metrics. Afterwards, we build statistical models relating the various refactoring operations to each of the readability metrics considered to quantify the extent to which each refactoring impacts the metrics in either a positive or negative manner. The key results are that refactoring has a notable impact on most of the readability metrics considered.

Download PDF

[C61] SANER 2022

On the Evolution of Inheritance and Delegation Mechanisms and Their Impact on Code Quality.*

IEEE International Conference on Software Analysis, Engineering, and Reengineering, Honolulu, Hawaii, USA, 2022.

Source code reuse is considered one of the holy grails of modern software development. Indeed, it has been widely demonstrated that this activity decreases software development and maintenance costs while increasing its overall trustwor- thiness. The Object-Oriented (OO) paradigm provides differ- ent internal mechanisms to favor code reuse, i.e., specification inheritance, implementation inheritance, and delegation. Download PDF

Conference Software Quality Empirical Software Engineering G. Giordano, A. Fasulo, G. Catolino, F. Palomba, F. Ferrucci, C. Gravino

On the Evolution of Inheritance and Delegation Mechanisms and Their Impact on Code Quality.*

G. Giordano, A. Fasulo, G. Catolino, F. Palomba, F. Ferrucci, C. Gravino. Conference Software Quality Empirical Software Engineering

Abstract. Source code reuse is considered one of the holy grails of modern software development. Indeed, it has been widely demonstrated that this activity decreases software development and maintenance costs while increasing its overall trustworthiness. The Object-Oriented (OO) paradigm provides different internal mechanisms to favor code reuse, i.e., specification inheritance, implementation inheritance, and delegation. While previous studies investigated how inheritance relations impact source code quality, there is still a lack of understanding of their evolutionary aspects and, more particular, of how these mechanisms may impact source code quality over time. To bridge this gap of knowledge, this paper proposes an empirical investigation into the evolution of specification inheritance, implementation inheritance, and delegation and their impact on the variability of source code quality attributes. First, we assess how the implementation of those mechanisms varies over 15 releases of three software systems. Second, we devise a statistical approach with the aim of understanding how inheritance and delegation let source code quality—as indicated by the severity of code smells—vary in either positive or negative manner. The key results of the study indicate that inheritance and delegation evolve over time, but not in a statistically significant manner. At the same time, their evolution often leads code smell severity to be reduced, hence possibly contributing to improve code maintainability.

Download PDF

[C60] SANER 2022

Gender Diversity and Community Smells: A Double-Replication Study on Brazilian Software Teams.*

IEEE International Conference on Software Analysis, Engineering, and Reengineering, Honolulu, Hawaii, USA, 2022.

Social debts in software teams are gaining increasing attention from the research community due to their potential adverse effects on software quality. For instance, community smells are indicators of sub-optimal organizational structures and may well lead to the emergence of social debt. Previous studies analyzed which factors influence the emergence/mitigation of such smells. In particular, studies by Catolino et al. showed how factors related to team composition, particularly gender diversity, correlated to the mitigation of community smells. Download PDF

Conference Socio-Technical Analytics Empirical Software Engineering C. Sarmento, T. Massoni, A. Serebrenik, G. Catolino, D. Tamburri, F. Palomba.

Gender Diversity and Community Smells: A Double-Replication Study on Brazilian Software Teams.*

C. Sarmento, T. Massoni, A. Serebrenik, G. Catolino, D. Tamburri, F. Palomba. Conference Socio-Technical Analytics Empirical Software Engineering

Abstract. Social debts in software teams are gaining increasing attention from the research community due to their potential adverse effects on software quality. For instance, community smells are indicators of sub-optimal organizational structures and may well lead to the emergence of social debt. Previous studies analyzed which factors influence the emergence/mitigation of such smells. In particular, studies by Catolino et al. showed how factors related to team composition, particularly gender diversity, correlated to the mitigation of community smells. However, a confirmation survey on 60 practitioners suggested that these results were not aligned with the experts’ perceptions. In addition, in a separate survey, Catolino et al. collected the most common team refactoring strategies for those community smells. In this work we replicate two studies by those authors, focusing on the Brazilian software teams; culture-specific expectations on the behavior of people of different genders might have affected the perception of the importance of gender diversity and refactoring strategies when mitigating community smells. We translated the survey instrument used by Catolino et al. to Brazilian Portuguese and recruited 184 Brazilian developers. Results did not show significant differences from the original study; indeed, participants perceived gender diversity as less valuable to mitigate community smells than such factors like experience or team size. Additionally, we performed a qualitative analysis of an open question within the questionnaire for the refactoring strategies. Brazilian developers agree with the original studies for most smells, mainly promoting restructuring communities, creating a communication plan and mentoring. We believe these results provide further evidence on the problem and its implications when managing software teams, avoiding technical debt and maintenance issues due to team communication and coordination problems.

Download PDF

[C59] QRS 2021

A Possibilistic Evolutionary Approach to Handle the Uncertainty of Software Metrics Thresholds in Code Smells Detection.*

IEEE International Conference on Software Quality, Reliability, and Security, Hainan Island, China, 2021.

A code smells detection rule is a combination of metrics with their corresponding crisp thresholds and labels. The goal of this paper is to deal with metrics' thresholds uncertainty; as usual, such thresholds could not be exactly determined to judge the smelliness of a particular software class. To deal with this issue, we first propose to encode each metric value into a binary possibility distribution with respect to a threshold computed from a discretization technique; using the Possibilistic C-means classifier. Download PDF

Conference Software Quality Empirical Software Engineering S. Boutaib, M. Elarbi, S. Bechikh, F. Palomba, L. Ben Said.

A Possibilistic Evolutionary Approach to Handle the Uncertainty of Software Metrics Thresholds in Code Smells Detection.*

S. Boutaib, M. Elarbi, S. Bechikh, F. Palomba, L. Ben Said. Conference Software Quality Empirical Software Engineering

Abstract. A code smells detection rule is a combination of metrics with their corresponding crisp thresholds and labels. The goal of this paper is to deal with metrics' thresholds uncertainty; as usual, such thresholds could not be exactly determined to judge the smelliness of a particular software class. To deal with this issue, we first propose to encode each metric value into a binary possibility distribution with respect to a threshold computed from a discretization technique; using the Possibilistic C-means classifier. Then, we propose ADIPOK-UMT as an evolutionary algorithm that evolves a population of PK-NN classifiers for the detection of smells under thresholds' uncertainty. The experimental results reveal that the possibility distribution-based encoding allows the implicit weighting of software metrics (features) with respect to their computed discretization thresholds. Moreover, ADIPOK-UMT is shown to outperform four relevant state-of-art approaches on a set of commonly adopted benchmark software systems.

Download PDF

[C58] ICSE 2021

Understanding Community Smells Variability: A Statistical Approach.*

IEEE/ACM International Conference on Software Engineering (ICSE 2021) - Software Engineering in Society Track, Madrid, Spain, 2021.

Social debt has been defined as the presence in a project of costly sub-optimal organizational conditions, e.g., non-cohesive development communities whose members have communication or coordination issues. Community smells are indicators of such sub-optimal organizational structures and may well lead to social debt. Download PDF

Conference Socio-Technical Analytics Empirical Software Engineering G. Catolino, F. Palomba, D. Tamburri, A. Serebrenik.

Understanding Community Smells Variability: A Statistical Approach.*

G. Catolino, F. Palomba, D. Tamburri, A. Serebrenik. Conference Socio-Technical Analytics Empirical Software Engineering

Abstract. Social debt has been defined as the presence in a project of costly sub-optimal organizational conditions, e.g., non-cohesive development communities whose members have communication or coordination issues. Community smells are indicators of such sub-optimal organizational structures and may well lead to social debt. Recently, several studies analyzed actors affecting presence of community smells and their harmfulness, or proposed refactoring strategies to mitigate them. However, to the best of our knowledge, there is still a limited understanding of the factors influencing the variability of community smells, namely how they increase/decrease in magnitude over time. In this paper, we aim at conducting the first statistical experimentation on the matter, by analyzing how a set of 40 socio-technical factors, e.g., turnover or communicability, impact the variability of four community smells on a dataset composed of 60 open-source communities. The results of the study reveal that communicability is, in most cases, important to reduce the risk of an increase of community smell instances, while broadening the collaboration network does not always have a positive effect.

Download PDF

[C57] ESEC/FSE 2020

tsDetect: An Open Source Test Smells Detection Tool.*

ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Sacramento (California), USA, 2020.

The test code, just like production source code, is subject to bad design and programming practices, also known as smells. The presence of test smells in a software project may affect the quality, maintainability, and extendability of test suites making them less effective in finding potential faults and quality issues in the project's production code. Download PDF

Conference Software Testing Software Quality A. Peruma, K. Almalki, C. Newman, M. Mkaouer, A. Ouni, F. Palomba.

A. Peruma, K. Almalki, C. Newman, M. Mkaouer, A. Ouni, F. Palomba. Conference Software Testing Empirical Software Engineering

Abstract. The test code, just like production source code, is subject to bad design and programming practices, also known as smells. The presence of test smells in a software project may affect the quality, maintainability, and extendability of test suites making them less effective in finding potential faults and quality issues in the project's production code. In this paper, we introduce tsDetect, an automated test smell detection tool for Java software systems that uses a set of detection rules to locate existing test smells in test code. We evaluate the effectiveness of tsDetect on a benchmark of 65 unit test files containing instances of 19 test smell types. Results show that tsDetect achieves a high detection accuracy with an average precision score of 96% and an average recall score of 97%. tsDetect is publicly available, with a demo video, at: https://testsmells.github.io/

Download PDF

[C56] ICSME 2020

Pizza versus Pinsa: On the Perception and Measurability of Unit Test Code Quality.*

IEEE International Conference on Software Maintenance and Evolution, Adelaide, Australia, 2020.

Test cases are an essential asset to evaluate software quality. The research community has provided various alternatives to help developers assessing the quality of tests, like code or mutation coverage. Despite the effort spent so far, however, little is known on how practitioners perceive unit test code quality and whether the existing metrics reflect their perception. Download PDF

Conference Software Testing Empirical Software Engineering G. Grano, C. De Iaco, F. Palomba, H. Gall.

G. Grano, C. De Iaco, F. Palomba, H. Gall. Conference Software Testing Empirical Software Engineering

Abstract. Test cases are an essential asset to evaluate software quality. The research community has provided various alternatives to help developers assessing the quality of tests, like code or mutation coverage. Despite the effort spent so far, however, little is known on how practitioners perceive unit test code quality and whether the existing metrics reflect their perception. This paper aims at addressing this gap of knowledge. We first conduct semi-structured interviews and surveys with practitioners to establish a taxonomy of relevant factors for unit test quality and collect a dataset of tests rated by developers based on their perceived quality. Then, we devise a statistical model to measure how the metrics available in literature reflect the perceived quality of test cases. The findings of our study show that readability and maintainability are the key aspects for developers to diagnose the outcome of test cases and drive debugging activities. On the contrary, code coverage metrics are necessary but not sufficient to evaluate the capability of tests. Finally, we discover that available metrics are effective in characterizing poor-quality tests, while limited when distinguishing high-quality ones.

Download PDF

[C55] ICSME 2020

The Making of Accessible Android Applications: An Empirical Study on the State of the Practice.*

IEEE International Conference on Software Maintenance and Evolution - Registered Report, Adelaide, Australia, 2020.

Nowadays, mobile applications represent the principal means to enable human interaction. Being so pervasive, these applications should be made usable for all users: accessibility collects the guidelines that developers should follow to include features allowing users with disabilities (e.g., visual impairments) to better interact with an application. Download PDF

Conference Software Quality Computer-Human Interaction M. Di Gregorio, D. Di Nucci, F. Palomba, G. Vitiello.

M. Di Gregorio, D. Di Nucci, F. Palomba, G. Vitiello. Conference Software Quality Computer-Human Interaction

Abstract. Context. Nowadays, mobile applications represent the principal means to enable human interaction. Being so pervasive, these applications should be made usable for all users: accessibility collects the guidelines that developers should follow to include features allowing users with disabilities (e.g., visual impairments) to better interact with an application. Problem. While research in this field is gaining interest, there is still a notable lack of knowledge on how developers practically deal with the problem: (i) whether they are aware and take accessibility guidelines into account when developing apps, (ii) which guidelines are harder for them to implement, and (iii) which tools they use to be supported in this task. Objective. To bridge the gap of knowledge on the state of the practice concerning the accessibility of mobile applications. Method. Adopting a mixed-method research approach, we aim to (i) verify how accessibility guidelines are implemented in mobile applications through a coding strategy and (ii) survey mobile developers on the issues and challenges of dealing with accessibility in practice. Limitations. Threats are represented by the size of the app sample and the number of answers to our survey study.

Download PDF

[C54] AVI 2020

VITRuM - A Plug-In for the Visualization of Test-Related Metrics.*

ACM International Conference on Advanced Visual Interfaces, Ischia, Italy, 2020.

Software testing is the first weapon against software faults, used by developers to preventively locate implementation errors in the exercised production code that may cause critical failures to the inner-working of software systems. According to recent findings, the effectiveness of testing might be not only due to its ability to cover the production code but also to some other properties, like code quality. Download PDF

Conference Software Testing Computer-Human Interaction F. Pecorelli, G. Di Lillo, F. Palomba, A. De Lucia.

F. Pecorelli, G. Di Lillo, F. Palomba, A. De Lucia. Conference Software Testing Computer-Human Interaction

Abstract. Software testing is the first weapon against software faults, used by developers to preventively locate implementation errors in the exercised production code that may cause critical failures to the inner-working of software systems. According to recent findings, the effectiveness of testing might be not only due to its ability to cover the production code but also to some other properties, like code quality. Among other aspects, the literature reported that an advanced visualization of test-related metrics, e.g., test code coverage on production code, result to be a key strength for developers when dealing with software faults. In this paper, we propose VITRuM (VIsualization of Test-Related Metrics), an IntelliJ plug-in able to provide developers with an advanced visual interface of both static and dynamic test-related metrics that has the potential of making them more able to diagnose production code faults. The plug-in is available in the official JetBrains Plugins Repository. A video showing the tool in action is available at https://youtu.be/kFE81eYPgUg.

Download PDF

[C53] AVI 2020

cASpER: A Plug-in for Automated Code Smell Detection and Refactoring.*

ACM International Conference on Advanced Visual Interfaces, Ischia, Italy, 2020.

During software evolution, code is inevitably subject to continuous changes that are often performed by developers within short and strict deadlines. As a consequence, good design practices are often sacrificed, possibly leading to the introduction of sub-optimal de- sign or implementation solutions, the so-called code smells. Download PDF

Conference Software Quality Computer-Human Interaction M. De Stefano, M. Gambardella, F. Pecorelli, F. Palomba, A. De Lucia.

M. De Stefano, M. Gambardella, F. Pecorelli, F. Palomba, A. De Lucia. Conference Software Quality Computer-Human Interaction

Abstract. During software evolution, code is inevitably subject to continuous changes that are often performed by developers within short and strict deadlines. As a consequence, good design practices are often sacrificed, possibly leading to the introduction of sub-optimal design or implementation solutions, the so-called code smells. Several studies have shown that the presence of code smells makes the source code more change- and fault-prone, reduces productivity, and causes greater rework and more significant design efforts for developers. Refactoring is the practice that developers may use to remove code smells without changing the external behavior of the source code. However, it requires much time and effort and is poorly automated, often leading developers to prefer keeping low- quality code instead of spending time in designing and performing refactoring operations. To mitigate this problem and support developers throughout the process of code smell identification and refactoring, in this paper we present cASpER, a IntelliJ IDEA plugin that provides visual and semi-automatic support for detection and refactoring four different types of code smells.

Download PDF

[C52] AVI 2020

Counterterrorism for Cyber-Physical Spaces: A Computer Vision Approach.*

ACM International Conference on Advanced Visual Interfaces, Ischia, Italy, 2020.

Simulating terrorist scenarios in cyber-physical spaces — that is, urban open or (semi-) closed spaces combined with a cyber-physical systems counterparts — is challenging given the context and variables therein. This paper addresses the aforementioned issue with Alter, a framework featuring computer vision and Generative Adversarial Neural Networks (GANs) over terrorist scenarios. Download PDF

Conference Computer-Human Interaction G. Cascavilla, J. Slabber, F. Palomba, D. Di Nucci, D. Tamburri, W.J. van den Heuvel.

G. Cascavilla, J. Slabber, F. Palomba, D. Di Nucci, D. Tamburri, W.J. van den Heuvel. Conference Computer-Human Interaction

Abstract. Simulating terrorist scenarios in cyber-physical spaces — that is, urban open or (semi-) closed spaces combined with a cyber-physical systems counterparts — is challenging given the context and variables therein. This paper addresses the aforementioned issue with Alter, a framework featuring computer vision and Generative Adversarial Neural Networks (GANs) over terrorist scenarios. We obtained the data for the terrorist scenarios by creating a synthetic dataset, exploiting the Grand Theft Auto V (GTAV) videogame, and the Unreal Game Engine behind it, in combination with Open-StreetMap data. The results of the proposed approach show its feasibility to predict criminal activities in cyber-physical spaces. Moreover, the usage of our synthetic scenarios elicited from GTAV is promising in building datasets for cybersecurity and Cyber-Threat Intelligence (CTI) featuring simulated videogaming platforms. We learned that local authorities can simulate terrorist scenarios for their own cities based on previous or related reference and this helps them in three ways: (1) better determine the necessary security measures; (2) better use the expertise of the authorities; (3) refine preparedness scenarios and drills for sensitive areas.

Download PDF

[C51] ICPC 2020

Just-In-Time Test Smell Detection and Refactoring: The DARTS Project.*

IEEE/ACM International Conference on Program Comprehension (ICPC 2020) - Tool Demo Track, Seoul, South Korea, 2020.

Test smells represent sub-optimal design or implementation solutions applied when developing test cases. Previous research has shown that these smells may decrease both maintainability and effectiveness of tests and, as such, researchers have been devising methods to automatically detect them. Download PDF

Conference Software Quality S. Lambiase, A. Cupito, F. Pecorelli, A. De Lucia, F. Palomba.

Just-In-Time Test Smell Detection and Refactoring: The DARTS Project.*

S. Lambiase, A. Cupito, F. Pecorelli, A. De Lucia, F. Palomba. Conference Software Quality

Abstract. Test smells represent sub-optimal design or implementation solutions applied when developing test cases. Previous research has shown that these smells may decrease both maintainability and effectiveness of tests and, as such, researchers have been devising methods to automatically detect them. Nevertheless, there is still a lack of tools that developers can use within their integrated devel- opment environment to identify test smells and refactor them. In this paper, we present DARTS (Detection And Refactoring of Test Smells), an Intellij plug-in which (1) implements a state-of-the-art detection mechanism to detect instances of three test smell types, i.e., General Fixture, Eager Test, and Lack of Cohesion of Test Meth- ods, at commit-level and (2) enables their automated refactoring through the integrated APIs provided by Intellij.

Download PDF

[C50] ICPC 2020

Refactoring Android-specific Energy Smells: A Plugin for Android Studio.*

IEEE/ACM International Conference on Program Comprehension (ICPC 2020) - Tool Demo Track, Seoul, South Korea, 2020.

Mobile applications are major means to perform daily actions, including social and emergency connectivity. However, their usability is threatened by energy consumption that may be impacted by code smells, i.e., symptoms of bad implementation and design practices. In particular, researchers derived a set of mobile-specific code smells resulting in increased energy consumption of mobile apps and removing such smells through refactoring can mitigate the problem. Download PDF

Conference Mobile Apps Evolution Software Quality E. Iannone, F. Pecorelli, D. Di Nucci, F. Palomba, A. De Lucia.

Refactoring Android-specific Energy Smells: A Plugin for Android Studio.*

E. Iannone, F. Pecorelli, D. Di Nucci, F. Palomba, A. De Lucia. Conference Mobile Apps Evolution Software Quality

Abstract. Mobile applications are major means to perform daily actions, including social and emergency connectivity. However, their usability is threatened by energy consumption that may be impacted by code smells, i.e., symptoms of bad implementation and design practices. In particular, researchers derived a set of mobile-specific code smells resulting in increased energy consumption of mobile apps and removing such smells through refactoring can mitigate the problem. In this paper, we extend and revise aDoctor, a tool that we previously implemented to identify energy-related smells. On the one hand, we present and implement automated refactoring solutions to those smells. On the other hand, we make the tool completely open-source and available in Android Studio as a plugin pub- lished in the official store. The video showing the tool in action is available at: https://www.youtube.com/watch?v=1c2EhVXiKis

Download PDF

[C49] ICPC 2020

OpenSZZ: A Free, Open-Source, Web-Accessible Implementation of the SZZ Algorithm.*

IEEE/ACM International Conference on Program Comprehension (ICPC 2020) - Tool Demo Track, Seoul, South Korea, 2020.

The accurate identification of defect-inducing commits represents a key problem for researchers interested in studying the naturalness of defects and defining defect prediction models. To tackle this problem, software engineering researchers have relied on and proposed several implementations of the well-known Sliwerski-Zimmermann-Zeller (SZZ) algorithm. Download PDF

Conference Software Quality V. Lenarduzzi, F. Palomba, D. Taibi, D. Tamburri.

OpenSZZ: A Free, Open-Source, Web-Accessible Implementation of the SZZ Algorithm.*

V. Lenarduzzi, F. Palomba, D. Taibi, D. Tamburri. Conference Software Quality

Abstract. The accurate identification of defect-inducing commits represents a key problem for researchers interested in studying the naturalness of defects and defining defect prediction models. To tackle this problem, software engineering researchers have relied on and proposed several implementations of the well-known Sliwerski-Zimmermann-Zeller (SZZ) algorithm. Despite its popularity and wide usage, no open-source, publicly available, and web-accessible implementation of the algorithm has been proposed so far. In this paper, we prototype and make available one such implementation for further use by practitioners and researchers alike. The evaluation of the proposed prototype showed competitive results and lays the foundation for future work. This paper outlines our prototype, illustrating its usage and reporting on its evaluation in action.

Download PDF

[C48] MSR 2020

Developer-Driven Code Smell Prioritization.*

IEEE/ACM International Conference on Mining Software Repositories (MSR 2020), Seoul, South Korea, 2020.

Code smells are symptoms of poor implementation choices applied during software evolution. While previous research has devoted effort in the definition of automated solutions to detect them, still little is known on how to support developers when prioritizing them. Download PDF

Conference Software Quality Empirical Software Engineering F. Pecorelli, F. Palomba, F. Khomh, A. De Lucia.

Developer-Driven Code Smell Prioritization.*

F. Pecorelli, F. Palomba, F. Khomh, A. De Lucia. Conference Software Quality Empirical Software Engineering

Abstract. Code smells are symptoms of poor implementation choices applied during software evolution. While previous research has devoted effort in the definition of automated solutions to detect them, still little is known on how to support developers when prioritizing them. Some works attempted to deliver solutions that can rank smell instances based on their severity, computed on the basis of software metrics. However, this may not be enough since it has been shown that the recommendations provided by current approaches do not take the developer's perception of design issues into account. In this paper, we perform a first step toward the concept of developer-driven code smell prioritization and propose an approach based on machine learning able to rank code smells according to the perceived criticality that developers assign to them. We evaluate our technique in an empirical study to investigate its accuracy and the features that are more relevant for classifying the developer's perception. Finally, we compare our approach with a state-of-the-art technique. Key findings show that the our solution has an F-Measure up to 85% and outperforms the baseline approach.

Download PDF

[C47] ICPC 2020

Testing of Mobile Applications in the Wild: A Large-Scale Empirical Study on Android Apps.*

IEEE/ACM International Conference on Program Comprehension (ICPC 2020), Seoul, South Korea, 2020.

Nowadays, mobile applications (a.k.a., apps) are used by over two billion users for every type of need, including social and emergency connectivity. Their pervasiveness in today's world has inspired the software testing research community in devising approaches to allow developers to better test their apps and improve the quality of the tests being developed. Download PDF

Conference Mobile Apps Evolution Empirical Software Engineering F. Pecorelli, G. Catolino, F. Ferrucci, A. De Lucia, F. Palomba.

Testing of Mobile Applications in the Wild: A Large-Scale Empirical Study on Android Apps.*

F. Pecorelli, G. Catolino, F. Ferrucci, A. De Lucia, F. Palomba. Conference Mobile Apps Evolution Empirical Software Engineering

Abstract. Nowadays, mobile applications (a.k.a., apps) are used by over two billion users for every type of need, including social and emergency connectivity. Their pervasiveness in today's world has inspired the software testing research community in devising approaches to allow developers to better test their apps and improve the quality of the tests being developed. In spite of this research effort, we still notice a lack of empirical studies aiming at assessing the actual quality of test cases developed by mobile developers: this perspective could provide evidence-based findings on the current status of testing in the wild as well as on the future research directions in the field. As such, we performed a large-scale empirical study targeting 1,780 open-source Android apps and aiming at assessing (1) the extent to which these apps are actually tested, (2) how well-designed are the available tests, and (3) what is their effectiveness. The key results of our study show that mobile developers still tend not to properly test their apps. Furthermore, we discovered that the test cases of the considered apps have a low (i) design quality, both in terms of test code metrics and test smells, and (ii) effectiveness when considering code coverage as well as assertion density.

Download PDF

[C46] ICSE 2020

Refactoring Community Smells in the Wild: The Practitioner's Field Manual.*

IEEE/ACM International Conference on Software Engineering (ICSE 2020) - Software Engineering in Society Track, Seoul, South Korea, 2020.

Community smells have been defined as sub-optimal organizational structures that may lead to social debt. Previous studies have shown that they are highly diffused in both open- and closed-source projects, are perceived as harmful by practitioners, and can even lead to the introduction of technical debt in source code. Download PDF

Conference Socio-Technical Analytics Empirical Software Engineering G. Catolino, F. Palomba, D. Tamburri, A. Serebrenik, F. Ferrucci.

Refactoring Community Smells in the Wild: The Practitioner's Field Manual.*

G. Catolino, F. Palomba, D. Tamburri, A. Serebrenik, F. Ferrucci Conference Socio-Technical Analytics Empirical Software Engineering

Abstract. Community smells have been defined as sub-optimal organizational structures that may lead to social debt. Previous studies have shown that they are highly diffused in both open- and closed-source projects, are perceived as harmful by practitioners, and can even lead to the introduction of technical debt in source code. Despite the presence of this body of research, little is known on the practitioners’ perceived prominence of community smells in practice as well as on the strategies adopted to deal with them. This paper aims at bridging this gap by proposing an empirical study in which 76 software practitioners are inquired on (i) the prominence of four well-known community smells, i.e., Organizational Silo, Black Cloud, Lone Wolf, and Radio Silence, in their contexts and (ii) the methods they adopted to "refactor" them. Our results first reveal that community smells frequently manifest themselves in software projects and, more importantly, there exist specific refactoring practices to deal with each of the considered community smells.

Download PDF

[C45] CHI 2020

UI Dark Patterns and Where to Find Them: A Study on Mobile Applications and User Perception.*

38th ACM CHI Conference on Human Factors in Computing Systems, Honolulu (Hawai), USA, 2020.

A Dark Pattern (DP) is an interface maliciously crafted to deceive users into performing actions they did not mean to do. Although design experts have reported on DPs extensively, little effort has been made to study how pervasive they are, especially in mobile applications. Download PDF

Conference Mobile Apps Evolution Computer-Human Interaction L. Di Geronimo, L. Braz, E. Fregnan, F. Palomba, A. Bacchelli.

UI Dark Patterns and Where to Find Them: A Study on Mobile Applications and User Perception.*

L. Di Geronimo, L. Braz, E. Fregnan, F. Palomba, A. Bacchelli. Conference Mobile Apps Evolution Computer-Human Interaction

Abstract. A Dark Pattern (DP) is an interface maliciously crafted to deceive users into performing actions they did not mean to do. Although design experts have reported on DPs extensively, little effort has been made to study how pervasive they are, especially in mobile applications. In this work, we analyze DPs in 240 popular apps and conduct an online study with 589 users on how they perceive DPs. The results of the analysis showed that 95% of apps contain one or more forms of DPs and, on average, popular applications include at least seven different types of deceiving UIs. The online study shows that most users do not recognize DPs, and they would change their behavior on app usage once informed about them. We discuss the impact of our work and what measures could be applied to alleviate malicious design issues.

Download PDF

[C44] CASCON 2019

On the Distribution of Test Smells in Open Source Android Applications: An Exploratory Study.*

29th International Conference on Computer Science and Software Engineering, Ontario, Canada, 2019.

The impact of bad programming practices, such as code smells, in production code has been the focus of numerous studies in soft- ware engineering. Like production code, unit tests are also affected by bad programming practices which can have a negative impact on the quality and maintenance of a software system. Download PDF

Conference Software Testing Empirical Software Engineering A. Peruma, K. Almalki, C. Newman, M. Mkaouer, A. Ouni, F. Palomba.

On the Distribution of Test Smells in Open Source Android Applications: An Exploratory Study.*

A. Peruma, K. Almalki, C. Newman, M. Mkaouer, A. Ouni, F. Palomba. Conference Software Testing Empirical Software Engineering

Abstract. The impact of bad programming practices, such as code smells, in production code has been the focus of numerous studies in software engineering. Like production code, unit tests are also affected by bad programming practices which can have a negative impact on the quality and maintenance of a software system. While several studies addressed code and test smells in desktop applications, there is little knowledge of test smells in the context of mobile applications. In this study, we extend the existing catalog of test smells by identifying and defining new smells and survey over 40 developers who confirm that our proposed smells are bad programming practices in test suites. Additionally, we perform an empirical study on the occurrences and distribution of the proposed smells on 656 open-source Android apps. Our findings show a widespread occurrence of test smells in apps. We also show that apps tend to exhibit test smells early in their lifetime with different degrees of co-occurrences on different smell types. This empirical study demonstrates that test smells can be used as an indicator for necessary preventive software maintenance for test suites.

Download PDF

[C43] ICSME 2019

How the Experience of Development Teams Relates to Assertion Density of Test Classes.*

35th IEEE Internation Conference on Software Maintenance and Evolution (ICSME), Cleveland, USA, 2019.

The impact of developers’ experience on several development practices has been widely investigated in the past. One of the most promising research fields is software testing, as many researchers found significant correlations between developers’ experience and testing effectiveness. Download PDF

Conference Software Testing Empirical Software Engineering G. Catolino, F. Palomba, A. Zaidman, F. Ferrucci.

How the Experience of Development Teams Relates to Assertion Density of Test Classes.*

G. Catolino, F. Palomba, A. Zaidman, F. Ferrucci. Conference Software Testing Empirical Software Engineering

Abstract. The impact of developers’ experience on several development practices has been widely investigated in the past. One of the most promising research fields is software testing, as many researchers found significant correlations between developers’ experience and testing effectiveness. In this paper, we aim at further studying this relation, by focusing on how development teams’ experience is associated with the assertion density, i.e., the number of assertions per test class KLOC, that has previously been shown as an effective way to decrease fault density. We perform a mixed-methods empirical study. First, we devise a statistical model relating development teams’ experience and other control factors to the assertion density of test classes belonging to 12 software projects. This model enables us to investigate whether experience comes out as a statistically significant factor to explain assertion density. Second, we contrast the statistical findings with a survey study conducted with 57 developers, who were asked their opinions on how developer’s experience is related to the way they add assertions in test code. Our findings suggest the existence of a relationship: on the one hand, the development team’s experience is a statistically significant factor in most of the systems that we have investigated; on the other hand, developers confirm the importance of experience and team composition for the effective testing of production code.

Download PDF

[C42] ICSME 2019

Adoption, Support, and Challenges of Infrastructure-as-Code: Insights from Industry.*

35th IEEE Internation Conference on Software Maintenance and Evolution (ICSME), Industrial Track, Cleveland, USA, 2019.

Infrastructure-as-code (IaC) is the DevOps tactic of managing and provisioning infrastructure through machinereadable definition files, rather than physical hardware configuration or interactive configuration tools. Download PDF

Conference Empirical Software Engineering M. Guerriero, M. Garriga, D. A. Tamburri, F. Palomba.

Adoption, Support, and Challenges of Infrastructure-as-Code: Insights from Industry.*

M. Guerriero, M. Garriga, D. A. Tamburri, F. Palomba. Conference Empirical Software Engineering

Abstract. Infrastructure-as-code (IaC) is the DevOps tactic of managing and provisioning infrastructure through machinereadable definition files, rather than physical hardware configuration or interactive configuration tools. From a maintenance and evolution perspective, the topic has piqued the interest of practitioners and academics alike, given the relative scarcity of supporting patterns, best practices, tools, and software engineering techniques. Using the data coming from 44 semi-structured interviews in as many companies, in this paper we shed light on the state of the practice in the adoption of IaC and the key software engineering challenges in the field. Particularly, we investigate (i) how practitioners adopt and develop IaC, (ii) which support is currently available, i.e., the typically used tools and their advantages/disadvantages, and (iii) what are the practitioner’s needs when dealing with IaC development, maintenance, and evolution. Our findings clearly highlight the need for more research in the field: the support provided by currently available tools is still limited, and developers feel the need of novel techniques for testing and maintaining IaC code.

Download PDF

[C41] ESEC/ FSE 2019 Recommended

Understanding Flaky Tests: The Developer's Perspective.*

27th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), Tallinn, Estonia, 2019.

Flaky tests are software tests that exhibit a seemingly random outcome (pass or fail) when run against the same, identical code. Previous work has examined fixes to flaky tests and has proposed automated solutions to locate as well as fix flaky tests—we complement it by examining the perceptions of software developers about the nature, relevance, and challenges of this phenomenon. Download PDF

Conference Software Testing Empirical Software Engineering M. Eck, F. Palomba, M. Castelluccio, A. Bacchelli.

Understanding Flaky Tests: The Developer's Perspective.*

M. Eck, F. Palomba, M. Castelluccio, A. Bacchelli. Conference Recommended Software Testing Empirical Software Engineering

Abstract. Flaky tests are software tests that exhibit a seemingly random outcome (pass or fail) when run against the same, identical code. Previous work has examined fixes to flaky tests and has proposed automated solutions to locate as well as fix flaky tests—we complement it by examining the perceptions of software developers about the nature, relevance, and challenges of this phenomenon. We asked 21 professional developers to classify 200 flaky tests they previously fixed, in terms of the nature of the flakiness, the origin of the flakiness, and the fixing effort. We complement this analysis with information about the fixing strategy. Subsequently, we conducted an online survey with 121 developers with a median industrial programming experience of five years. Our research shows that: The flakiness is due to several different causes, four of which have never been reported before, despite being the most costly to fix; flakiness is perceived as significant by the vast majority of developers, regardless of their team’s size and project’s domain, and it can have effects on resource allocation, scheduling, and the perceived reliability of the test suite; and the challenges developers report to face regard mostly the reproduction of the flaky behavior and the identification of the cause for the flakiness.

Download PDF BibTeX

@article{eck2019understanding,
  title={Understanding Flaky Tests: The Developer’s Perspective},
  author={Eck, Moritz and Palomba, Fabio and Castelluccio, Marco and Bacchelli, Alberto},
  year={2019}
}

[C40] MSR 2019

On the Effectiveness of Manual and Automatic Unit Test Generation: Ten Years Later.*

IEEE/ACM Working Conference on Mining Software Repositories (MSR 2019), Montreal, Canada, 2019.

Good unit tests play a paramount role when it comes to foster and evaluate software quality. However, writing effective tests is an extremely costly and time consuming practice. Download PDF

Conference Software Testing Empirical Software Engineering D. Serra, G. Grano, F. Palomba, F. Ferrucci, H. Gall, A. Bacchelli.

On the Effectiveness of Manual and Automatic Unit Test Generation: Ten Years Later.*

D. Serra, G. Grano, F. Palomba, F. Ferrucci, H. Gall, A. Bacchelli. Conference Software Testing Empirical Software Engineering

Abstract. Good unit tests play a paramount role when it comes to foster and evaluate software quality. However, writing effective tests is an extremely costly and time consuming practice. To reduce such a burden for developers, researchers devised ingenious techniques to automatically generate test suite for existing code bases. Nevertheless, how automatically generated test cases fare against manually written ones is an open research question. In 2008, Bacchelli et al. conducted an initial case study comparing automatic and manually generated test suites. Since in the last ten years we have witnessed a huge amount of work on novel approaches and tools for automatic test generation, in this paper we revise their study using current tools as well as complementing their research method by evaluating these tools’ ability in finding regressions.

Download PDF BibTeX

@inproceedings{serra2019effectiveness,
  title={On the effectiveness of manual and automatic unit test generation: ten years later},
  author={Serra, Domenico and Grano, Giovanni and Palomba, Fabio and Ferrucci, Filomena and Gall, Harald C and Bacchelli, Alberto},
  booktitle={Proceedings of the 16th International Conference on Mining Software Repositories},
  pages={121--125},
  year={2019},
  organization={IEEE Press}
}

[C39] ICPC 2019

Comparing Machine Learning and Heuristic Approaches for Metric-Based Code Smell Detection.*

IEEE/ACM International Conference on Program Comprehension (ICPC 2019), Montreal, Canada, 2019.

Code smells represent poor implementation choices performed by developers when enhancing source code. Their negative impact on source code maintainability and comprehensibility has been widely shown in the past and several techniques to automatically detect them have been devised. Download PDF

Conference Software Quality Empirical Software Engineering F. Pecorelli, F. Palomba, D. Di Nucci, A. De Lucia.

Comparing Machine Learning and Heuristic Approaches for Metric-Based Code Smell Detection.*

F. Pecorelli, F. Palomba, D. Di Nucci, A. De Lucia. Conference Software Quality Empirical Software Engineering

Abstract. Code smells represent poor implementation choices performed by developers when enhancing source code. Their negative impact on source code maintainability and comprehensibility has been widely shown in the past and several techniques to automatically detect them have been devised. Most of these techniques are based on heuristics, namely they compute a set of code metrics and combine them by creating detection rules; while they have a reasonable accuracy, a recent trend is represented by the use of machine learning where code metrics are used as predictors of the smelliness of code artefacts. Despite the recent advances in the field, there is still a noticeable lack of knowledge of whether machine learning can actually be more accurate than traditional heuristic-based approaches. To fill this gap, in this paper we propose a large-scale study to empirically compare the performance of heuristic-based and machine-learning-based techniques for metric-based code smell detection. We consider five code smell types and compare machine learning models with DECOR, a state-of-the-art heuristic-based approach. Key findings emphasize the need of further research aimed at improving the effectiveness of both machine learning and heuristic approaches for code smell detection: while DECOR generally achieves better performance than a machine learning baseline, its precision is still too low to make it usable in practice.

Download PDF

[C38] ICSE 2019

Gender Diversity and Women in Software Teams: How Do They Affect Community Smells?*

IEEE/ACM International Conference on Software Engineering (ICSE 2019) - Software Engineering in Society Track, Montreal, Canada, 2019.

As social as software engineers are, there is a known and established gender imbalance in our community structures, regardless of their open- or closed-source nature.

Invited for the Special Issue

Download PDF

Conference Socio-Technical Analytics Empirical Software Engineering G. Catolino, F. Palomba, D. A. Tamburri, A. Serebrenik, F. Ferrucci.

Gender Diversity and Women in Software Teams: How Do They Affect Community Smells?*

G. Catolino, F. Palomba, D. A. Tamburri, A. Serebrenik, F. Ferrucci. Conference Socio-Technical Analytics Empirical Software Engineering

Abstract. As social as software engineers are, there is a known and established gender imbalance in our community structures, regardless of their open- or closed-source nature. To shed light on the actual benefits of achieving such balance, this empirical study looks into the relations between such balance and the occurrence of community smells, that is, sub-optimal circumstances and patterns across the software organizational structure. Example of community smells are Organizational Silo effects (overly disconnected sub-groups) or Lone Wolves (defiant community members). Results indicate that the presence of women generally reduces the amount of community smells. We conclude that women are instrumental to reducing community smells in software development teams.

Download PDF BibTeX

@inproceedings{catolino2019gender,
  title={Gender diversity and women in software teams: How do they affect community smells?},
  author={Catolino, Gemma and Palomba, Fabio and Tamburri, Damian A and Serebrenik, Alexander and Ferrucci, Filomena},
  booktitle={Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Society},
  pages={11--20},
  year={2019},
  organization={IEEE Press}
}

[C37] ICSE 2019

Test-Driven Code Review: An Empirical Study.*

IEEE/ACM International Conference on Software Engineering (ICSE 2019), Montreal, Canada, 2019.

Test-Driven Code Review (TDR) is a code review practice in which a reviewer inspects a patch by examining the changed test code before the changed production code. Although this practice has been mentioned positively by practitioners in informal literature and interviews, there is no systematic knowledge on its effects, prevalence, problems, and advantages. Download PDF

Conference Software Testing Empirical Software Engineering D. Spadini, F. Palomba, T. Baum, S. Hanenberg, M. Bruntink, A. Bacchelli.

Test-Driven Code Review: An Empirical Study.*

D. Spadini, F. Palomba, T. Baum, S. Hanenberg, M. Bruntink, A. Bacchelli. Conference Software Testing Empirical Software Engineering

Abstract. Test-Driven Code Review (TDR) is a code review practice in which a reviewer inspects a patch by examining the changed test code before the changed production code. Although this practice has been mentioned positively by practitioners in informal literature and interviews, there is no systematic knowledge on its effects, prevalence, problems, and advantages. In this paper, we aim at empirically understanding whether this practice has an effect on code review effectiveness and how developers’ perceive TDR. We conduct (i) a controlled experiment with 93 developers that perform more than 150 reviews, and (ii) 9 semi-structured interviews and a survey with 103 respondents to gather information on how TDR is perceived. Key results from the experiment show that developers adopting TDR find the same proportion of defects in production code, but more in test code, at the expenses of less maintainability issues in production code. Furthermore, we found that most developers prefer to review production code as they deem it more important and tests should follow from it. Moreover, widespread poor test code quality and no tool support hinder the adoption of TDR.

Download PDF

[C36] CSCW 2018

Information Needs in Contemporary Code Review.*

ACM Conference on Computer Supported Cooperative Work (CSCW 2018), New York, USA, 2018.

Contemporary code review is a widespread practice used by software engineers to maintain high software quality and share project knowledge. However, conducting proper code review takes time and developers often have limited time for review.

CSCW 2018 Best Paper Honorable Mention

Download PDF

Conference Software Quality L. Pascarella, D. Spadini, F. Palomba, M. Bruntik, A. Bacchelli.

Information Needs in Contemporary Code Review.*

L. Pascarella, D. Spadini, F. Palomba, M. Bruntik, A. Bacchelli. Conference Software Quality

Abstract. Contemporary code review is a widespread practice used by software engineers to maintain high software quality and share project knowledge. However, conducting proper code review takes time and developers often have limited time for review. In this paper, we aim at investigating the information that reviewers need to conduct a proper code review, to better understand this process and how research and tool support can make developers become more effective and efficient reviewers. Previous work has provided evidence that a successful code review process is one in which reviewers and authors actively participate and collaborate. In these cases, the threads of discussions that are saved by code review tools are a precious source of information that can be later exploited for research and practice. In this paper, we focus on this source of information as a way to gather reliable data on the aforementioned reviewers’ needs. We manually analyze 900 code review comments from three large open-source projects and organize them in categories by means of a card sort. Our results highlight the presence of seven high-level information needs, such as knowing the uses of methods and variables declared/modified in the code under review. Based on these results we suggest ways in which future code review tools can better support collaboration and the reviewing task.

Download PDF BibTeX

@article{pascarella2018information,
  title={Information needs in contemporary code review},
  author={Pascarella, Luca and Spadini, Davide and Palomba, Fabio and Bruntink, Magiel and Bacchelli, Alberto},
  journal={Proceedings of the ACM on Human-Computer Interaction},
  volume={2},
  number={CSCW},
  pages={135},
  year={2018},
  publisher={ACM}
}

[C35] ASE 2018

Mining File Histories: Should We Consider Branches?*

International Conference of Automated Software Engineering (ASE 2018), Montpellier, France, 2018.

Modern distributed version control systems, such as Git, offer support for branching — the possibility to develop parts of software outside the master trunk. Consideration of the repository structure in Mining Software Repository (MSR) studies requires a thorough approach to mining, but there is no well-documented, widespread methodology regarding the handling of merge commits and branches. Download PDF

Conference Empirical Software Engineering V. Kovalenko, F. Palomba, A. Bacchelli.

Mining File Histories: Should We Consider Branches?*

V. Kovalenko, F. Palomba, A. Bacchelli. Conference Empirical Software Engineering

Abstract. Modern distributed version control systems, such as Git, offer support for branching — the possibility to develop parts of software outside the master trunk. Consideration of the repository structure in Mining Software Repository (MSR) studies requires a thorough approach to mining, but there is no well-documented, widespread methodology regarding the handling of merge commits and branches. Moreover, there is still a lack of knowledge of the extent to which considering branches during MSR studies impacts the results of the studies. In this study, we set out to evaluate the importance of proper handling of branches when calculating file modification histories. We analyze over 1,400 Git repositories of four open source ecosystems and compute modification histories for over two million files, using two different algorithms. One algorithm only follows the first parent of each commit when traversing the repository, the other returns the full modification history of a file across all branches. We show that the two algorithms consistently deliver different results, but the scale of the difference varies across projects and ecosystems. Further, we evaluate the importance of accurate mining of file histories by comparing the performance of common techniques that rely on file modification history — reviewer recommendation, change recommendation, and defect prediction — for two algorithms of file history retrieval. We find that considering full file histories leads to an increase in the techniques’ performance that is rather modest.

Download PDF BibTeX

@inproceedings{kovalenko2018mining,
  title={Mining file histories: should we consider branches?},
  author={Kovalenko, Vladimir and Palomba, Fabio and Bacchelli, Alberto},
  booktitle={Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering},
  pages={202--213},
  year={2018},
  organization={ACM}
}

[C34] ASE 2018

Continuous Code Quality: Are We (Really) Doing That?*

International Conference of Automated Software Engineering (ASE 2018), Montpellier, France, 2018.

Continuous Integration (CI) is a software engineering practice where developers constantly integrate their changes to a project through an automated build process. The goal of CI is to provide developers with prompt feedback on several quality dimensions after each change. Download PDF

Conference Software Quality Empirical Software Engineering C. Vassallo, F. Palomba, A. Bacchelli, H. Gall.

Continuous Code Quality: Are We (Really) Doing That?*

C. Vassallo, F. Palomba, A. Bacchelli, H. Gall. Conference Software Quality Empirical Software Engineering

Abstract. Continuous Integration (CI) is a software engineering practice where developers constantly integrate their changes to a project through an automated build process. The goal of CI is to provide developers with prompt feedback on several quality dimensions after each change. Indeed, previous studies provided empirical evidence on a positive association between properly following CI principles and source code quality. A core principle behind CI is Continuous Code Quality (also known as CCQ, which includes automated testing and automated code inspection) may appear simple and effective, yet we know little about its practical adoption. In this paper, we propose a preliminary empirical investigation aimed at understanding how rigorously practitioners follow CCQ. Our study reveals a strong dichotomy between theory and practice: developers do not perform continuous inspection but rather control for quality only at the end of a sprint and most of the times only on the release branch.

Download PDF BibTeX

@inproceedings{vassallo2018continuous,
  title={Continuous code quality: are we (really) doing that?},
  author={Vassallo, Carmine and Palomba, Fabio and Bacchelli, Alberto and Gall, Harald C},
  booktitle={Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering},
  pages={790--795},
  year={2018},
  organization={ACM}
}

[C33] ICSME 2018

Continuous Refactoring in CI: A Preliminary Study On the Perceived Advantages and Barriers*

International Conference of Software Maintenance and Evolution (ICSME 2018), Madrid, Spain, 2018.

By definition, the practice of Continuous Integration (CI) promotes continuous software quality improvement. In systems adopting such a practice, quality assurance is usually performed by using static and dynamic analysis tools (e.g., SonarQube) that compute overall metrics such as maintainability or reliability measures. Download PDF

Conference Software Quality Empirical Software Engineering C. Vassallo, F. Palomba, H. Gall.

Continuous Refactoring in CI: A Preliminary Study On the Perceived Advantages and Barriers*

C. Vassallo, F. Palomba, H. Gall. Conference Software Quality Empirical Software Engineering

Abstract. By definition, the practice of Continuous Integration (CI) promotes continuous software quality improvement. In systems adopting such a practice, quality assurance is usually performed by using static and dynamic analysis tools (e.g., SonarQube) that compute overall metrics such as maintainability or reliability measures. Furthermore, developers usually define quality gates, i.e., source code quality thresholds that must be reached by the software product after every newly committed change. If a quality gate fails (e.g., a maintainability metric is below a certain threshold), developers should refactor the code possibly addressing some of the proposed warnings. While previous research findings showed that refactoring is often not done in practice, it is still unclear whether and how the adoption of a CI philosophy has changed the way developers perceive and adopt refactoring. In this paper, we preliminarily study—running a survey study that involves 31 developers—how developers perform refactoring in CI, which needs they have and the barriers they face while continuously refactor source code.

Download PDF BibTeX

@inproceedings{vassallo2018continuous,
  title={Continuous refactoring in ci: A preliminary study on the perceived advantages and barriers},
  author={Vassallo, Carmine and Palomba, Fabio and Gall, Harald C},
  booktitle={2018 IEEE International Conference on Software Maintenance and Evolution (ICSME)},
  pages={564--568},
  year={2018},
  organization={IEEE}
}

[C32] ICSME 2018

On The Relation of Test Smells to Software Code Quality*

International Conference of Software Maintenance and Evolution (ICSME 2018), Madrid, Spain, 2018.

Test smells are sub-optimal design choices in the implementation of test code. As reported by recent studies, their presence might not only negatively affect the comprehension of test suites, but can also lead to test cases being less effective in finding bugs in production code. Download PDF

Conference Software Testing Empirical Software Engineering D. Spadini, F. Palomba, A. Zaidman, M. Bruntink, A. Bacchelli.

On The Relation of Test Smells to Software Code Quality*

D. Spadini, F. Palomba, A. Zaidman, M. Bruntink, A. Bacchelli. Conference Software Testing Empirical Software Engineering

Abstract. Test smells are sub-optimal design choices in the implementation of test code. As reported by recent studies, their presence might not only negatively affect the comprehension of test suites, but can also lead to test cases being less effective in finding bugs in production code. Although important steps toward understanding test smells, there is still a notable absence of studies assessing their association with software quality. In this paper, we investigate the relationship between the presence of test smells and the change- and defect-proneness of test code, as well as the defect-proneness of the production code being tested. To this aim, we collect data pertaining to 221 releases of ten software systems and we analyze more than a million test cases to investigate the association of six test smells and their co-occurrence with software quality. Key results of our study include: (i) tests with smells are more change- and defect-prone, (ii) ‘Indirect Testing’, ‘Eager Test’, and ‘Assertion Roulette’ are the most significant smells for change-proneness and, (iii) production code is more defect-prone when tested by smelly tests.

Download PDF BibTeX

@inproceedings{spadini2018relation,
  title={On the relation of test smells to software code quality},
  author={Spadini, Davide and Palomba, Fabio and Zaidman, Andy and Bruntink, Magiel and Bacchelli, Alberto},
  booktitle={2018 IEEE International Conference on Software Maintenance and Evolution (ICSME)},
  pages={1--12},
  year={2018},
  organization={IEEE}
}

[C31] ICSME 2018

Automatic Test Smell Detection Using Information Retrieval Techniques*

International Conference of Software Maintenance and Evolution (ICSME 2018), Madrid, Spain, 2018.

Software testing is a key activity to control the reliability of production code. Unfortunately, the effectiveness of test cases can be threatened by the presence of faults. Recent work showed that static indicators can be exploited to identify testrelated issues. Download PDF

Conference Software Testing Empirical Software Engineering F. Palomba, A. Zaidman, A. De Lucia.

Automatic Test Smell Detection Using Information Retrieval Techniques*

F. Palomba, A. Zaidman, A. De Lucia. Conference Software Testing Empirical Software Engineering

Abstract. Software testing is a key activity to control the reliability of production code. Unfortunately, the effectiveness of test cases can be threatened by the presence of faults. Recent work showed that static indicators can be exploited to identify testrelated issues. In particular test smells, i.e., sub-optimal design choices applied by developers when implementing test cases, have been shown to be related to test case effectiveness. While some approaches for the automatic detection of test smells have been proposed so far, they generally suffer of poor performance: as a consequence, current detectors cannot properly provide support to developers when diagnosing the quality of test cases. In this paper, we aim at making a step ahead toward the automated detection of test smells by devising a novel textual-based detector, coined TASTE (Textual AnalySis for Test smEll detection), with the aim of evaluating the usefulness of textual analysis for detecting three test smell types, General Fixture, Eager Test, and Lack of Cohesion of Methods. We evaluate TASTE in an empirical study that involves a manually-built dataset composed of 494 test smell instances belonging to 12 software projects, comparing the capabilities of our detector with those of two code metrics-based techniques proposed by Van Rompaey et al. and Greiler et al. Our results show that the structural-based detection applied by existing approaches cannot identify most of the test smells in our dataset, while TASTE is up to 44% more effective. Finally, we find that textual and structural approaches can identify different sets of test smells, thereby indicating complementarity.

Download PDF BibTeX

@inproceedings{palomba2018automatic,
  title={Automatic test smell detection using information retrieval techniques},
  author={Palomba, Fabio and Zaidman, Andy and De Lucia, Andrea},
  booktitle={2018 IEEE International Conference on Software Maintenance and Evolution (ICSME)},
  pages={311--322},
  year={2018},
  organization={IEEE}
}

[C30] SANER 2018

BECLoMA: Augmenting Stack Traces with User Review Information.*

International Conference on Software Analysis, Evolution, and Reengineering (SANER 2018) - Formal Tool Demo, Campobasso, Italy.

Mobile devices such as smartphones, tablets and wearables are changing the way we do things, radically modifying our approach to technology. To sustain the high competition characterizing the mobile market, developers need to deliver high quality applications in a short release cycle.

SANER 2018 Best Tool Demo Paper Award

Download PDF

Conference Mobile Apps Evolution Tool Demo L. Pelloni, G. Grano, A. Ciurumelea, S. Panichella, F. Palomba, H. Gall.

BECLoMA: Augmenting Stack Traces with User Review Information.*

L. Pelloni, G. Grano, A. Ciurumelea, S. Panichella, F. Palomba, H. Gall. Conference Mobile Apps Evolution Tool Demo

Abstract. Mobile devices such as smartphones, tablets and wearables are changing the way we do things, radically modifying our approach to technology. To sustain the high competition characterizing the mobile market, developers need to deliver high quality applications in a short release cycle. To reveal and fix bugs as soon as possible, researchers and practitioners proposed tools to automate the testing process. However, such tools generate a high number of redundant inputs, lacking of contextual information and generating reports difficult to analyze. In this context, the content of user reviews represents an unmatched source for developers seeking for defects in their applications. However, no prior work explored the adoption of information available in user reviews for testing purposes. In this demo we present BECLOMA, a tool to enable the integration of user feedback in the testing process of mobile apps. BECLOMA links information from testing tools and user reviews, presenting to developers an augmented testing report combining stack traces with user reviews information referring to the same crash. We show that BECLOMA facilitates not only the diagnosis and fix of app bugs, but also presents additional benefits: it eases the usage of testing tools and automates the analysis of user reviews from the Google Play Store.

Download PDF BibTeX

@inproceedings{pelloni2018becloma,
  title={Becloma: Augmenting stack traces with user review information},
  author={Pelloni, Lucas and Grano, Giovanni and Ciurumelea, Adelina and Panichella, Sebastiano and Palomba, Fabio and Gall, Harald C},
  booktitle={2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
  pages={522--526},
  year={2018},
  organization={IEEE}
}

[C29] ICPC 2018

Do Developers Update Third-Party Libraries in Mobile Apps?*

International Conference on Program Comprehension (ICPC 2018), Gothenburg, Sweden, 2018.

One of the most common strategies to develop new software is to take advantage of existing source code, which is available in comprehensive packages called third-party libraries. As for all software systems, even these libraries change to offer new functionalities and fix bugs or security issues.

Invited for the Special Issue

Download PDF

Conference Mobile Apps Evolution Empirical Software Engineering P. Salza, F. Palomba, D. Di Nucci, C. D'Uva, F. Ferrucci, A. De Lucia.

Do Developers Update Third-Party Libraries in Mobile Apps?*

P. Salza, F. Palomba, D. Di Nucci, C. D'Uva, F. Ferrucci, A. De Lucia. Conference Mobile Apps Evolution Empirical Software Engineering

Abstract. One of the most common strategies to develop new software is to take advantage of existing source code, which is available in comprehensive packages called third-party libraries. As for all software systems, even these libraries change to offer new functionalities and fix bugs or security issues. The way the changes are propagated has been studied by researchers, interested in understanding their impact on the non-functional attributes of the systems source code. While the research community mainly focused on the change propagation phenomenon in the context of traditional applications, only little is known regarding the mobile context. In this paper, we aim at bridging this gap by conducting an empirical study on the evolution history of 291 mobile apps, by investigating (i) whether mobile developers actually update third-party libraries, (ii) which are the categories of libraries with respect to the developers’ proneness to update their apps, (iii) what are the common patterns followed by developers when updating a software library, and (iv) whether high- and low-rated apps present peculiar update patterns. The results of the study showed that mobile developers rarely update their apps with respect to the used libraries, and when they do, they mainly tend to update the libraries related to the Graphical User Interface, with the aim of keeping the mobile apps updated with the latest design tendencies. In some cases developers ignore updates because of a poor awareness of the benefits, or a too high cost/benefit ratio. Finally, high- and low-rated apps present strong differences.

Download PDF BibTeX

@inproceedings{salza2018developers,
  title={Do developers update third-party libraries in mobile apps?},
  author={Salza, Pasquale and Palomba, Fabio and Di Nucci, Dario and D'Uva, Cosmo and De Lucia, Andrea and Ferrucci, Filomena},
  booktitle={Proceedings of the 26th Conference on Program Comprehension},
  pages={255--265},
  year={2018},
  organization={ACM}
}

[C28] MSR 2018

How Is Video Game Development Different from Software Development in Open Source?*

IEEE/ACM Working Conference on Mining Software Repositories (MSR 2018), Gothenburg, Sweden, 2018.

Recent research has provided evidence that, in the industrial context, developing video games diverges from developing software systems in other domains, such as office suites and system utilities. Download PDF

Conference Software Quality Empirical Software Engineering L. Pascarella, F. Palomba, M. Di Penta, A. Bacchelli.

How Is Video Game Development Different from Software Development in Open Source?*

L. Pascarella, F. Palomba, M. Di Penta, A. Bacchelli. Conference Software Quality Empirical Software Engineering

Abstract. Recent research has provided evidence that, in the industrial context, developing video games diverges from developing software systems in other domains, such as office suites and system utilities. In this paper, we consider video game development in the open source system (OSS) context. Specifically, we investigate how developers contribute to video games vs. non-games by working on different kinds of artifacts, how they handle malfunctions, and how they perceive the development process of their projects. To this purpose, we conducted a mixed, qualitative and quantitative study on a broad suite of 60 OSS projects. Our results confirm the existence of significant differences between game and non-game development, in terms of how project resources are organized and in the diversity of developers’ specializations. Moreover, game developers responding to our survey perceive more difficulties than other developers when reusing code as well as performing automated testing, and they lack a clear overview of their system’s requirements.

Download PDF BibTeX

@inproceedings{pascarella2018video,
  title={How is video game development different from software development in open source?},
  author={Pascarella, Luca and Palomba, Fabio and Di Penta, Massimiliano and Bacchelli, Alberto},
  booktitle={2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)},
  pages={392--402},
  year={2018},
  organization={IEEE}
}

[C27] MSR 2018

A Graph-based Dataset of Commit History of Real-World Android apps.*

IEEE/ACM Working Conference on Mining Software Repositories (MSR 2018), Gothenburg, Sweden, 2018.

Empirical studies on the engineering of Android apps need to be based on open datasets and tools to allow comparisons, improve generalizability, and enable replicability. However, obtaining a good dataset is problematic and this state of things slows down empirical research on this topic. Download PDF

Conference Software Quality Dataset F. Geiger, I. Malavolta, L. Pascarella, F. Palomba, D. Di Nucci, A. Bacchelli.

A Graph-based Dataset of Commit History of Real-World Android apps.*

F. Geiger, I. Malavolta, L. Pascarella, F. Palomba, D. Di Nucci, A. Bacchelli. Conference Software Quality Dataset

Abstract. Empirical studies on the engineering of Android apps need to be based on open datasets and tools to allow comparisons, improve generalizability, and enable replicability. However, obtaining a good dataset is problematic and this state of things slows down empirical research on this topic. In this paper, we contribute to overcome this challenge by presenting the first, self-contained, publicly available dataset weaving spread-out data sources about real-world, open-source Android apps. Our dataset is encoded as a graph-based database and contains the following information about 8,431 real open-source Android apps: (i) metadata about their GitHub projects, (ii) Git repositories with full commit history and (iii) metadata extracted from the Google Play store, such as app ratings and permissions. The dataset is available in Docker images to ease adoption.

Download PDF BibTeX

@inproceedings{geiger2018graph,
  title={A graph-based dataset of commit history of real-world android apps},
  author={Geiger, Franz-Xaver and Malavolta, Ivano and Pascarella, Luca and Palomba, Fabio and Di Nucci, Dario and Bacchelli, Alberto},
  booktitle={Proceedings of the 15th International Conference on Mining Software Repositories},
  pages={30--33},
  year={2018},
  organization={ACM}
}

[C26] MOBILE SOFT 2018

Self-Reported Activities of Android Developers.*

International Conference on Mobile Software Engineering and Systems (MobileSoft 2018), Gothenburg, Sweden, 2018.

To gain a deeper empirical understanding of how developers work on Android apps, we investigate self-reported activities of Android developers and to what extent these activities can be classified with machine learning techniques. Download PDF

Conference Mobile Apps Evolution Empirical Software Engineering L. Pascarella, F. Geiger, F. Palomba, D. Di Nucci, I. Malavolta, A. Bacchelli.

Self-Reported Activities of Android Developers.*

L. Pascarella, F. Geiger, F. Palomba, D. Di Nucci, I. Malavolta, A. Bacchelli. Conference Mobile Apps Evolution Empirical Software Engineering

Abstract. To gain a deeper empirical understanding of how developers work on Android apps, we investigate self-reported activities of Android developers and to what extent these activities can be classified with machine learning techniques. To this aim, we firstly create a taxonomy of self-reported activities coming from the manual analysis of 5,000 commit messages from 8,280 Android apps. Then, we study the frequency of each category of self-reported activities identified in the taxonomy, and investigate the feasibility of an automated classification approach. Our findings can inform be used by both practitioners and researchers to take informed decisions or support other software engineering activities.

Download PDF BibTeX

@inproceedings{pascarella2018self,
  title={Self-reported activities of android developers},
  author={Pascarella, Luca and Geiger, Franz-Xaver and Palomba, Fabio and Di Nucci, Dario and Malavolta, Ivano and Bacchelli, Alberto},
  booktitle={2018 IEEE/ACM 5th International Conference on Mobile Software Engineering and Systems (MOBILESoft)},
  pages={144--155},
  year={2018},
  organization={IEEE}
}

[C25] SANER 2018

Re-evaluating Method-Level Bug Prediction.*

International Conference on Software Analysis, Evolution, and Reengineering (SANER 2018 - RENE Track), Campobasso, Italy, 2018.

Bug prediction is aimed at supporting developers in the identification of code artifacts more likely to be defective. Most approaches defined so far target the prediction of bugs at class-level, thus pinpointing the presence of a bug in an entire source file. Download PDF

Conference Software Quality Empirical Software Engineering L. Pascarella, F. Palomba, A. Bacchelli.

Re-evaluating Method-Level Bug Prediction.*

L. Pascarella, F. Palomba, A. Bacchelli. Conference Software Quality Empirical Software Engineering

Abstract. Bug prediction is aimed at supporting developers in the identification of code artifacts more likely to be defective. Most approaches defined so far target the prediction of bugs at class-level, thus pinpointing the presence of a bug in an entire source file. Nevertheless, past research has provided evidence that this granularity might be too coarse-grained, thus reducing the usability of bug prediction in practice. As a consequence, researchers have started proposing method-level bug prediction models, showing promising evidence that it is possible to operate at this level of granularity. In this study, we first replicate previous research on methodlevel bug prediction on different systems/timespans. Afterwards, we reflect on the evaluation strategy and propose a more realistic one. Key results of our study show that the performance of the method-level bug prediction model is similar to what previously reported also for different systems/timespans, when evaluated with the same strategy. However—when evaluated with a more realistic strategy—all the models show a dramatic drop in performance showing results close to that of a random classifiers. Our replication and negative results indicate that method-level bug prediction is still an open challenge.

Download PDF BibTeX

@inproceedings{pascarella2018re,
  title={Re-evaluating method-level bug prediction},
  author={Pascarella, Luca and Palomba, Fabio and Bacchelli, Alberto},
  booktitle={2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
  pages={592--601},
  year={2018},
  organization={IEEE}
}

[C24] SANER 2018

Detecting Code Smells using Machine Learning Techniques: Are We There Yet?*

International Conference on Software Analysis, Evolution, and Reengineering (SANER 2018 - RENE Track), Campobasso, Italy, 2018.

Code smells are symptoms of poor design and implementation choices weighing heavily on the quality of produced source code. During the last decades several code smell detection tools have been proposed. Download PDF

Conference Software Quality Empirical Software Engineering D. Di Nucci, F. Palomba, D. A. Tamburri, A. Serebrenik, A. De Lucia.

Detecting Code Smells using Machine Learning Techniques: Are We There Yet?*

D. Di Nucci, F. Palomba, D. A. Tamburri, A. Serebrenik, A. De Lucia. Conference Software Quality Empirical Software Engineering

Abstract. Code smells are symptoms of poor design and implementation choices weighing heavily on the quality of produced source code. During the last decades several code smell detection tools have been proposed. However, the literature shows that the results of these tools can be subjective and are intrinsically tied to the nature and approach of the detection. In a recent work Arcelli Fontana et al. [1] proposed the use of Machine-Learning (ML) techniques for code smell detection, possibly solving the issue of tool subjectivity giving to a learner the ability to discern between smelly and non-smelly source code elements. While this work opened a new perspective for code smell detection, in the context of our research we found a number of possible limitations that might threaten the results of this study. The most important issue is related to the metric distribution of smelly instances in the used dataset, which is strongly different than the one of nonsmelly instances. In this work, we investigate this issue and our findings show that the high performance achieved in the study by Arcelli Fontana et al. was in fact due to the specific dataset employed rather than the actual capabilities of machine-learning techniques for code smell detection.

Download PDF BibTeX

@inproceedings{di2018detecting,
  title={Detecting code smells using machine learning techniques: are we there yet?},
  author={Di Nucci, Dario and Palomba, Fabio and Tamburri, Damian A and Serebrenik, Alexander and De Lucia, Andrea},
  booktitle={2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
  pages={612--621},
  year={2018},
  organization={IEEE}
}

[C23] SANER 2018

Context Is King: The Developer Perspective on the Usage of Static Analysis Tools.*

International Conference on Software Analysis, Evolution, and Reengineering (SANER 2018), Campobasso, Italy, 2018.

Automatic static analysis tools (ASATs) are tools that support automatic code quality evaluation of software systems with the aim of (i) avoiding and/or removing bugs and (ii) spotting design issues. Hindering their wide-spread acceptance are their (i) high false positive rates and (ii) low comprehensibility of the generated warnings.

Invited for the Special Issue

Download PDF

Conference Software Quality Empirical Software Engineering C. Vassallo, S. Panichella, F. Palomba, S. Proksch, A. Zaidman, H. Gall.

Context Is King: The Developer Perspective on the Usage of Static Analysis Tools.*

C. Vassallo, S. Panichella, F. Palomba, S. Proksch, A. Zaidman, H. Gall. Conference Software Quality Empirical Software Engineering

Abstract. Automatic static analysis tools (ASATs) are tools that support automatic code quality evaluation of software systems with the aim of (i) avoiding and/or removing bugs and (ii) spotting design issues. Hindering their wide-spread acceptance are their (i) high false positive rates and (ii) low comprehensibility of the generated warnings. Researchers and ASATs vendors have proposed solutions to prioritize such warnings with the aim of guiding developers toward the most severe ones. However, none of the proposed solutions considers the development context in which an ASAT is being used to further improve the selection of relevant warnings. To shed light on the impact of such contexts on the warnings configuration, usage and adopted prioritization strategies, we surveyed 42 developers (69% in industry and 31% in open source projects) and interviewed 11 industrial experts that integrate ASATs in their workflow. While we can confirm previous findings on the reluctance of developers to configure ASATs, our study highlights that (i) 71% of developers do pay attention to different warning categories depending on the development context, and (ii) 63% of our respondents rely on specific factors (e.g., team policies and composition) when prioritizing warnings to fix during their programming. Our results clearly indicate ways to better assist developers by improving existing warning selection and prioritization strategies.

Download PDF BibTeX

@inproceedings{vassallo2018context,
  title={Context is king: The developer perspective on the usage of static analysis tools},
  author={Vassallo, Carmine and Panichella, Sebastiano and Palomba, Fabio and Proksch, Sebastian and Zaidman, Andy and Gall, Harald C},
  booktitle={2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
  pages={38--49},
  year={2018},
  organization={IEEE}
}

[C22] SANER 2018

Exploring the Integration of User Feedback in Automated Testing of Android Applications.*

International Conference on Software Analysis, Evolution, and Reengineering (SANER 2018), Campobasso, Italy, 2018.

The intense competition characterizing mobile application’s marketplaces forces developers to create and maintain high-quality mobile apps in order to ensure their commercial success and acquire new users. This motivated the research community to propose solutions that automate the testing process of mobile apps.

Invited for the Special Issue

Download PDF

Conference Software Testing Empirical Software Engineering G. Grano, A. Ciurumelea, S. Panichella, F. Palomba, H. Gall.

Exploring the Integration of User Feedback in Automated Testing of Android Applications.*

G. Grano, A. Ciurumelea, S. Panichella, F. Palomba, H. Gall. Conference Software Testing Empirical Software Engineering

Abstract. The intense competition characterizing mobile application’s marketplaces forces developers to create and maintain high-quality mobile apps in order to ensure their commercial success and acquire new users. This motivated the research community to propose solutions that automate the testing process of mobile apps. However, the main problem of current testing tools is that they generate redundant and random inputs that are insufficient to properly simulate the human behavior, thus leaving feature and crash bugs undetected until they are encountered by users. To cope with this problem, we conjecture that information available in user reviews—that previous work showed as effective for maintenance and evolution problems—can be successfully exploited to identify the main issues users experience while using mobile applications, e.g., GUI problems and crashes. In this paper we provide initial insights into this direction, investigating (i) what type of user feedback can be actually exploited for testing purposes, (ii) how complementary user feedback and automated testing tools are, when detecting crash bugs or errors and (iii) whether an automated system able to monitor crashrelated information reported in user feedback is sufficiently accurate. Results of our study, involving 11,296 reviews of 8 mobile applications, show that user feedback can be exploited to provide contextual details about errors or exceptions detected by automated testing tools. Moreover, they also help detecting bugs that would remain uncovered when rely on testing tools only. Finally, the accuracy of the proposed automated monitoring system demonstrates the feasibility of our vision, i.e., integrate user feedback into testing process.

Download PDF BibTeX

@inproceedings{grano2018exploring,
  title={Exploring the integration of user feedback in automated testing of android applications},
  author={Grano, Giovanni and Ciurumelea, Adelina and Panichella, Sebastiano and Palomba, Fabio and Gall, Harald C},
  booktitle={2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
  pages={72--83},
  year={2018},
  organization={IEEE}
}

[C20] ICPC 2017

Developer-Related Factors in Change Prediction: An Empirical Assessment.*

25th International Conference on Program Comprehension (ICPC 2017), Buenos Aires, Argentina, 2017.

Predicting the areas of the source code having a higher likelihood to change in the future is a crucial activity to allow developers to plan preventive maintenance operations such as refactoring or peer-code reviews. In the past the research community was active in devising change prediction models based on structural metrics extracted from the source code. Download PDF

Conference Software Quality Empirical Software Engineering G. Catolino, F. Palomba, A. De Lucia, F. Ferrucci, A. Zaidman.

Developer-Related Factors in Change Prediction: An Empirical Assessment.*

G. Catolino, F. Palomba, A. De Lucia, F. Ferrucci, A. Zaidman. Conference Software Quality Empirical Software Engineering

Abstract. Predicting the areas of the source code having a higher likelihood to change in the future is a crucial activity to allow developers to plan preventive maintenance operations such as refactoring or peer-code reviews. In the past the research community was active in devising change prediction models based on structural metrics extracted from the source code. More recently, Elish et al. showed how evolution metrics can be more efficient for predicting change-prone classes. In this paper, we aim at making a further step ahead by investigating the role of different developer-related factors, which are able to capture the complexity of the development process under different perspectives, in the context of change prediction. We also compared such models with existing change-prediction models based on evolution and code metrics. Our findings reveal the capabilities of developer-based metrics in identifying classes of a software system more likely to be changed in the future. Moreover, we observed interesting complementarities among the experimented prediction models, that may possibly lead to the definition of new combined models exploiting developer-related factors as well as product and evolution metrics.

Download PDF BibTeX

@inproceedings{catolino2017developer,
  title={Developer-related factors in change prediction: an empirical assessment},
  author={Catolino, Gemma and Palomba, Fabio and De Lucia, Andrea and Ferrucci, Filomena and Zaidman, Andy},
  booktitle={Proceedings of the 25th International Conference on Program Comprehension},
  pages={186--195},
  year={2017},
  organization={IEEE Press}
}

[C19] ICPC 2017

An Exploratory Study on the Relationship between Changes and Refactoring.*

25th International Conference on Program Comprehension (ICPC 2017), Buenos Aires, Argentina, 2017.

Refactoring aims at improving the internal structure of a software system without changing its external behavior. Previous studies empirically assessed, on the one hand, the benefits of refactoring in terms of code quality and developers’ productivity, and on the other hand, the underlying reasons that push programmers to apply refactoring. Download PDF

Conference Software Quality Empirical Software Engineering F. Palomba, A. Zaidman, R. Oliveto, A. De Lucia.

An Exploratory Study on the Relationship between Changes and Refactoring.*

F. Palomba, A. Zaidman, R. Oliveto, A. De Lucia. Conference Software Quality Empirical Software Engineering

Abstract. Refactoring aims at improving the internal structure of a software system without changing its external behavior. Previous studies empirically assessed, on the one hand, the benefits of refactoring in terms of code quality and developers’ productivity, and on the other hand, the underlying reasons that push programmers to apply refactoring. Results achieved in the latter investigations indicate that besides personal motivation such as the responsibility concerned with code authorship, refactoring is mainly performed as a consequence of changes in the requirements rather than driven by software quality. However, these findings have been derived by surveying developers, and therefore no studies performed on the actual modifications made on software repositories have been carried out to corroborate the achieved findings. To bridge this gap, we provide a quantitative investigation on the relationship between different types of code changes (i.e., Fault Repairing Modification, Feature Introduction Modification, and General Maintenance Modification) and 28 different refactoring types coming from 3 open source projects. Results showed that developers tend to apply a higher number of refactoring operations aimed at improving maintainability and comprehensibility of the source code when fixing bugs. Instead, when new features are implemented, more complex refactoring operations are performed to improve code cohesion. Most of the times, the underlying reasons behind the application of such refactoring operations are represented by the presence of duplicate code or previously introduced self-admitted technical debts.

Download PDF BibTeX

@inproceedings{palomba2017exploratory,
  title={An exploratory study on the relationship between changes and refactoring},
  author={Palomba, Fabio and Zaidman, Andy and Oliveto, Rocco and De Lucia, Andrea},
  booktitle={2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC)},
  pages={176--185},
  year={2017},
  organization={IEEE}
}

[C18] ICSE 2017

PETrA: a Software-Based Tool for Estimating the Energy Profile of Android Applications.*

39th International Conference on Software Engineering (ICSE 2017) - Formal Tool Demo, Buenos Aires, Argentina, 2017.

Energy efficiency is a vital characteristic of any mobile application, and indeed is becoming an important factor for user satisfaction. For this reason, in recent years several approaches and tools for measuring the energy consumption of mobile devices have been proposed. Download PDF

Conference Mobile Apps Evolution Tool Demo D. Di Nucci, F. Palomba, A. Prota, A. Panichella, A. Zaidman, A. De Lucia.

PETrA: a Software-Based Tool for Estimating the Energy Profile of Android Applications.*

D. Di Nucci, F. Palomba, A. Prota, A. Panichella, A. Zaidman, A. De Lucia. Conference Mobile Apps Evolution Tool Demo

Abstract. Energy efficiency is a vital characteristic of any mobile application, and indeed is becoming an important factor for user satisfaction. For this reason, in recent years several approaches and tools for measuring the energy consumption of mobile devices have been proposed. Hardware-based solutions are highly precise, but at the same time they require costly hardware toolkits. Model-based techniques require a possibly difficult calibration of the parameters needed to correctly create a model on a specific hardware device. Finally, software-based solutions are easier to use, but they are possibly less precise than hardware-based solution. In this demo, we present PETRA, a novel software-based tool for measuring the energy consumption of Android apps. With respect to other tools, PETRA is compatible with all the smartphones with Android 5.0 or higher. We also provide evidence that our tool is able to perform similarly to hardware-based solutions.

Download PDF BibTeX

@inproceedings{di2017petra,
  title={Petra: a software-based tool for estimating the energy profile of android applications},
  author={Di Nucci, Dario and Palomba, Fabio and Prota, Antonio and Panichella, Annibale and Zaidman, Andy and De Lucia, Andrea},
  booktitle={2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C)},
  pages={3--6},
  year={2017},
  organization={IEEE}
}

[C17] ICSE 2017

Recommending and Localizing Code Changes for Mobile Apps based on User Reviews.*

39th International Conference on Software Engineering (ICSE 2017), Buenos Aires, Argentina, 2017.

Researchers have proposed several approaches to extract information from user reviews useful for maintaining and evolving mobile apps. However, most of them just perform automatic classification of user reviews according to specific keywords (e.g., bugs, features). Download PDF

Conference Mobile Apps Evolution Empirical Software Engineering F. Palomba, P. Salza, A. Ciurumelea, S. Panichella, H. Gall, F. Ferrucci, A. De Lucia.

Recommending and Localizing Code Changes for Mobile Apps based on User Reviews.*

F. Palomba, P. Salza, A. Ciurumelea, S. Panichella, H. Gall, F. Ferrucci, A. De Lucia. Conference Mobile Apps Evolution Empirical Software Engineering

Abstract. Researchers have proposed several approaches to extract information from user reviews useful for maintaining and evolving mobile apps. However, most of them just perform automatic classification of user reviews according to specific keywords (e.g., bugs, features). Moreover, they do not provide any support for linking user feedback to the source code components to be changed, thus requiring a manual, time-consuming, and error-prone task. In this paper, we introduce CHANGEADVISOR, a novel approach that analyzes the structure, semantics, and sentiments of sentences contained in user reviews to extract useful (user) feedback from maintenance perspectives and recommend to developers changes to software artifacts. It relies on natural language processing and clustering algorithms to group user reviews around similar user needs and suggestions for change. Then, it involves textual based heuristics to determine the code artifacts that need to be maintained according to the recommended software changes. The quantitative and qualitative studies carried out on 44 683 user reviews of 10 open source mobile apps and their original developers showed a high accuracy of CHANGEADVISOR in (i) clustering similar user change requests and (ii) identifying the code components impacted by the suggested changes. Moreover, the obtained results show that CHANGEADVISOR is more accurate than a baseline approach for linking user feedback clusters to the source code in terms of both precision (+47%) and recall (+38%).

Download PDF BibTeX

@inproceedings{palomba2017recommending,
  title={Recommending and localizing change requests for mobile apps based on user reviews},
  author={Palomba, Fabio and Salza, Pasquale and Ciurumelea, Adelina and Panichella, Sebastiano and Gall, Harald and Ferrucci, Filomena and De Lucia, Andrea},
  booktitle={Proceedings of the 39th international conference on software engineering},
  pages={106--117},
  year={2017},
  organization={IEEE Press}
}

[C16] SANER 2017

Lightweight Detection of Android-specific Code Smells: the aDoctor Project.*

International Conference on Software Analysis, Evolution, and Reengineering (SANER 2017) - Tool Track, Klagenfurt, Austria, 2017.

Code smells are symptoms of poor design solutions applied by programmers during the development of software systems. While the research community devoted a lot of effort to studying and devising approaches for detecting the traditional code smells defined by Fowler, little knowledge and support is available for an emerging category of Mobile app code smells. Download PDF

Conference Mobile Apps Evolution Tool Demo F. Palomba, D. Di Nucci, A. Panichella, A. Zaidman, A. De Lucia.

Lightweight Detection of Android-specific Code Smells: the aDoctor Project.*

F. Palomba, D. Di Nucci, A. Panichella, A. Zaidman, A. De Lucia. Conference Mobile Apps Evolution Tool Demo

Abstract. Code smells are symptoms of poor design solutions applied by programmers during the development of software systems. While the research community devoted a lot of effort to studying and devising approaches for detecting the traditional code smells defined by Fowler, little knowledge and support is available for an emerging category of Mobile app code smells. Recently, Reimann et al. proposed a new catalogue of Androidspecific code smells that may be a threat for the maintainability and the efficiency of Android applications. However, current tools working in the context of Mobile apps provide limited support and, more importantly, are not available for developers interested in monitoring the quality of their apps. To overcome these limitations, we propose a fully automated tool, coined ADOCTOR, able to identify 15 Android-specific code smells from the catalogue by Reimann et al. An empirical study conducted on the source code of 18 Android applications reveals that the proposed tool reaches, on average, 98% of precision and 98% of recall. We made ADOCTOR publicly available.

Download PDF BibTeX

@inproceedings{palomba2017lightweight,
  title={Lightweight detection of Android-specific code smells: The aDoctor project},
  author={Palomba, Fabio and Di Nucci, Dario and Panichella, Annibale and Zaidman, Andy and De Lucia, Andrea},
  booktitle={2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER)},
  pages={487--491},
  year={2017},
  organization={IEEE}
}

[C15] SANER 2017

Software-based Energy Profiling of Android Apps: Simple, Efficient and Reliable?*

International Conference on Software Analysis, Evolution, and Reengineering (SANER 2017), Klagenfurt, Austria, 2017.

Modeling the power profile of mobile applications is a crucial activity to identify the causes behind energy leaks. To this aim, researchers have proposed hardware-based tools as well as model-based and software-based techniques to approximate the actual energy profile. Download PDF

Conference Mobile Apps Evolution Empirical Software Engineering D. Di Nucci, F. Palomba, A. Prota, A. Panichella, A. Zaidman, A. De Lucia.

Software-based Energy Profiling of Android Apps: Simple, Efficient and Reliable?*

D. Di Nucci, F. Palomba, A. Prota, A. Panichella, A. Zaidman, A. De Lucia. Conference Mobile Apps Evolution Empirical Software Engineering

Abstract. Modeling the power profile of mobile applications is a crucial activity to identify the causes behind energy leaks. To this aim, researchers have proposed hardware-based tools as well as model-based and software-based techniques to approximate the actual energy profile. However, all these solutions present their own advantages and disadvantages. Hardware-based tools are highly precise, but at the same time their use is bound to the acquisition of costly hardware components. Model-based tools require the calibration of parameters needed to correctly create a model on a specific hardware device. Software-based approaches are cheaper and easier to use than hardware-based tools, but they are believed to be less precise. In this paper, we take a deeper look at the pros and cons of softwarebased solutions investigating to what extent their measurements depart from hardware-based solutions. To this aim, we propose a software-based tool named PETRA that we compare with the hardware-based MONSOON toolkit on 54 Android apps. The results show that PETRA performs similarly to MONSOON despite not using any sophisticated hardware components. In fact, the mean relative error with respect to MONSOON is always lower than 0.05. Moreover, 95% of the estimation errors are within 5% of the actual values measured using the hardware-based toolkit.

Download PDF BibTeX

@inproceedings{di2017software,
  title={Software-based energy profiling of android apps: Simple, efficient and reliable?},
  author={Di Nucci, Dario and Palomba, Fabio and Prota, Antonio and Panichella, Annibale and Zaidman, Andy and De Lucia, Andrea},
  booktitle={2017 IEEE 24th international conference on software analysis, evolution and reengineering (SANER)},
  pages={103--114},
  year={2017},
  organization={IEEE}
}

[C14] ASE 2016

An Empirical Investigation into the Nature of Test Smells.*

International Conference on Automated Software Engineering (ASE 2016), Singapore, Singapore, 2016.

Test smells have been defined as poorly designed tests and, as reported by recent empirical studies, their presence may negatively affect comprehension and consequently maintenance of test suites. Despite this, there are no available automated tools to support identification and removal of test smells. Download PDF

Conference Software Testing Empirical Software Engineering M. Tufano, F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, A. De Lucia, D. Poshyvanyk.

An Empirical Investigation into the Nature of Test Smells.*

M. Tufano, F. Palomba, G. Bavota, M. Di Penta, R. Oliveto, A. De Lucia, D. Poshyvanyk. Conference Software Testing Empirical Software Engineering

Abstract. Test smells have been defined as poorly designed tests and, as reported by recent empirical studies, their presence may negatively affect comprehension and consequently maintenance of test suites. Despite this, there are no available automated tools to support identification and removal of test smells. In this paper, we firstly investigate developers’ perception of test smells in a study with 19 developers. The results show that developers generally do not recognize (potentially harmful) test smells, highlighting that automated tools for identifying such smells are much needed. However, to build effective tools, deeper insights into the test smells phenomenon are required. To this aim, we conducted a large-scale empirical investigation aimed at analyzing (i) when test smells occur in source code, (ii) what their survivability is, and (iii) whether their presence is associated with the presence of design problems in production code (code smells). The results indicate that test smells are usually introduced when the corresponding test code is committed in the repository for the first time, and they tend to remain in a system for a long time. Moreover, we found various unexpected relationships between test and code smells. Finally, we show how the results of this study can be used to build effective automated tools for test smell detection and refactoring.

Download PDF BibTeX

@inproceedings{tufano2016empirical,
  title={An empirical investigation into the nature of test smells},
  author={Tufano, Michele and Palomba, Fabio and Bavota, Gabriele and Di Penta, Massimiliano and Oliveto, Rocco and De Lucia, Andrea and Poshyvanyk, Denys},
  booktitle={2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE)},
  pages={4--15},
  year={2016},
  organization={IEEE}
}

[C13] ICSME 2016

Alternative Sources of Information for Code Smell Detection: Postcards from Far Away.*

International Conference on Software Maintenance and Evolution (ICSME 2016) - Doctoral Symposium, Raleight, USA, 2016.

Code smells have been defined as symptoms of poor design and implementation choices. Previous studies showed the negative impact of code smells on the comprehensibility and maintainability of code. Download PDF

Conference Software Quality F. Palomba.

Alternative Sources of Information for Code Smell Detection: Postcards from Far Away.*

F. Palomba. Conference Software Quality

Abstract. Code smells have been defined as symptoms of poor design and implementation choices. Previous studies showed the negative impact of code smells on the comprehensibility and maintainability of code. For this reasons, several detection techniques have been proposed. Most of them rely on the analysis of the properties extractable from the source code. In the context of this work, we highlight several aspects that can possibly contribute to the improvement of the current state of the art and propose our solutions, based on the analysis on how code smells are actually introduced as well as the usefulness of historical and textual information to realize more reliable code smell detectors. Finally, we present an overview of the open issues and challenges related to code smell detection and management that the research community should focus on in the next future.

Download PDF BibTeX

@inproceedings{palomba2016alternative,
  title={Alternative sources of information for code smell detection: Postcards from far away},
  author={Palomba, Fabio},
  booktitle={2016 IEEE International Conference on Software Maintenance and Evolution (ICSME)},
  pages={636--640},
  year={2016},
  organization={IEEE}
}

[C12] ICSME 2016

Smells like Teen Spirit: Improving Bug Prediction Performance Using the Intensity of Code Smells.*

International Conference on Software Maintenance and Evolution (ICSME 2016), Raleight, USA, 2016.

Code smells are symptoms of poor design and implementation choices. Previous studies empirically assessed the impact of smells on code quality and clearly indicate their negative impact on maintainability, including a higher bugproneness of components affected by code smells. Download PDF

Conference Software Quality Empirical Software Engineering F. Palomba, M. Zanoni, F. Arcelli Fontana, A. De Lucia, R. Oliveto.

Smells like Teen Spirit: Improving Bug Prediction Performance Using the Intensity of Code Smells.*

F. Palomba, M. Zanoni, F. Arcelli Fontana, A. De Lucia, R. Oliveto. Conference Software Quality Empirical Software Engineering

Abstract. Code smells are symptoms of poor design and implementation choices. Previous studies empirically assessed the impact of smells on code quality and clearly indicate their negative impact on maintainability, including a higher bugproneness of components affected by code smells. In this paper we capture previous findings on bug-proneness to build a specialized bug prediction model for smelly classes. Specifically, we evaluate the contribution of a measure of the severity of code smells (i.e., code smell intensity) by adding it to existing bug prediction models and comparing the results of the new model against the baseline model. Results indicate that the accuracy of a bug prediction model increases by adding the code smell intensity as predictor. We also evaluate the actual gain provided by the intensity index with respect to the other metrics in the model, including the ones used to compute the code smell intensity. We observe that the intensity index is much more important as compared to other metrics used for predicting the buggyness of smelly classes.

Download PDF BibTeX

@inproceedings{palomba2016smells,
  title={Smells like teen spirit: Improving bug prediction performance using the intensity of code smells},
  author={Palomba, Fabio and Zanoni, Marco and Fontana, Francesca Arcelli and De Lucia, Andrea and Oliveto, Rocco},
  booktitle={2016 IEEE International Conference on Software Maintenance and Evolution (ICSME)},
  pages={244--255},
  year={2016},
  organization={IEEE}
}

[C11] ISSTA 2016

Automatic Test Case Generation: What if Test Code Quality Matters?*

International Symposium on Software Testing and Analysis (ISSTA 2016), Saarbrucken, Germany, 2016.

Test case generation tools that optimize code coverage have been extensively investigated. Recently, researchers have suggested to add other non-coverage criteria, such as memory consumption or readability, to increase the practical usefulness of generated tests. Download PDF

Conference Software Testing Empirical Software Engineering F. Palomba, A. Panichella, A. Zaidman, R. Oliveto, A. De Lucia.

Automatic Test Case Generation: What if Test Code Quality Matters?*

F. Palomba, A. Panichella, A. Zaidman, R. Oliveto, A. De Lucia. Software Testing Empirical Software Engineering

Abstract. Test case generation tools that optimize code coverage have been extensively investigated. Recently, researchers have suggested to add other non-coverage criteria, such as memory consumption or readability, to increase the practical usefulness of generated tests. In this paper, we observe that test code quality metrics, and test cohesion and coupling in particular, are valuable candidates as additional criteria. Indeed, tests with low cohesion and/or high coupling have been shown to have a negative impact on future maintenance activities. In an exploratory investigation we show that most generated tests are indeed affected by poor test code quality. For this reason, we incorporate cohesion and coupling metrics into the main loop of search-based algorithm for test case generation. Through an empirical study we show that our approach is not only able to generate tests that are more cohesive and less coupled, but can (i) increase branch coverage up to 10% when enough time is given to the search and (ii) result in statistically shorter tests.

Download PDF BibTeX

@inproceedings{palomba2016automatic,
  title={Automatic test case generation: What if test code quality matters?},
  author={Palomba, Fabio and Panichella, Annibale and Zaidman, Andy and Oliveto, Rocco and De Lucia, Andrea},
  booktitle={Proceedings of the 25th International Symposium on Software Testing and Analysis},
  pages={130--141},
  year={2016},
  organization={ACM}
}

[C10] ICPC 2016

A Textual-based Technique for Smell Detection.*

24th International Conference on Program Comprehension (ICPC 2016), Austin, USA, 2016.

In this paper, we present TACO (Textual Analysis for Code Smell Detection), a technique that exploits textual analysis to detect a family of smells of different nature and different levels of granularity.

Invited for the Special Issue

Download PDF

Conference Software Quality Empirical Software Engineering F. Palomba, A. Panichella, A. De Lucia, R. Oliveto, A. Zaidman.

A Textual-based Technique for Smell Detection.*

F. Palomba, A. Panichella, A. De Lucia, R. Oliveto, A. Zaidman. Conference Software Quality Empirical Software Engineering

Abstract. In this paper, we present TACO (Textual Analysis for Code Smell Detection), a technique that exploits textual analysis to detect a family of smells of different nature and different levels of granularity. We run TACO on 10 open source projects, comparing its performance with existing smell detectors purely based on structural information extracted from code components. The analysis of the results indicates that TACO’s precision ranges between 67% and 77%, while its recall ranges between 72% and 84%. Also, TACO often outperforms alternative structural approaches confirming, once again, the usefulness of information that can be derived from the textual part of code components.

Download PDF BibTeX

@inproceedings{palomba2016textual,
  title={A textual-based technique for smell detection},
  author={Palomba, Fabio and Panichella, Annibale and De Lucia, Andrea and Oliveto, Rocco and Zaidman, Andy},
  booktitle={2016 IEEE 24th international conference on program comprehension (ICPC)},
  pages={1--10},
  year={2016},
  organization={IEEE}
}

[C9] ICSME 2015

User Reviews Matter! Tracking Crowdsourced Reviews to Support Evolution of Successful Apps.*

31st IEEE International Conference on Software Maintenance and Evolution (ICSME 2015), Bremen, Germany, 2015.

Nowadays software applications, and especially mobile apps, undergo frequent release updates through app stores. After installing/updating apps, users can post reviews and provide ratings, expressing their level of satisfaction with apps, and possibly pointing out bugs or desired features. Download PDF

Conference Mobile Apps Evolution Empirical Software Engineering F. Palomba, M. Linares Vasquez, G. Bavota, R. Oliveto, M. Di Penta, D. Poshyvanyk, A. De Lucia.

User Reviews Matter! Tracking Crowdsourced Reviews to Support Evolution of Successful Apps.*

F. Palomba, M. Linares Vasquez, G. Bavota, R. Oliveto, M. Di Penta, D. Poshyvanyk, A. De Lucia. Conference Mobile Apps Evolution Empirical Software Engineering

Abstract. Nowadays software applications, and especially mobile apps, undergo frequent release updates through app stores. After installing/updating apps, users can post reviews and provide ratings, expressing their level of satisfaction with apps, and possibly pointing out bugs or desired features. In this paper we show—by performing a study on 100 Android apps—how applications addressing user reviews increase their success in terms of rating. Specifically, we devise an approach, named CRISTAL, for tracing informative crowd reviews onto source code changes, and for monitoring the extent to which developers accommodate crowd requests and follow-up user reactions as reflected in their ratings. The results indicate that developers implementing user reviews are rewarded in terms of ratings. This poses the need for specialized recommendation systems aimed at analyzing informative crowd reviews and prioritizing feedback to be satisfied in order to increase the apps success.

Download PDF BibTeX

@inproceedings{palomba2015user,
  title={User reviews matter! tracking crowdsourced reviews to support evolution of successful apps},
  author={Palomba, Fabio and Linares-Vasquez, Mario and Bavota, Gabriele and Oliveto, Rocco and Di Penta, Massimiliano and Poshyvanyk, Denys and De Lucia, Andrea},
  booktitle={2015 IEEE international conference on software maintenance and evolution (ICSME)},
  pages={291--300},
  year={2015},
  organization={IEEE}
}

[C8] ICSME 2015

On the Role of Developer’s Scattered Changes in Bug Prediction.*

31st IEEE International Conference on Software Maintenance and Evolution (ICSME 2015), Bremen, Germany, 2015.

The importance of human-related factors in the introduction of bugs has recently been the subject of a number of empirical studies. However, these observations have not been captured yet in bug prediction models which simply exploit product metrics or process metrics based on the number and type of changes or on the number of developers working on a software component. Download PDF

Conference Software Quality Empirical Software Engineering D. Di Nucci, F. Palomba, S. Siravo, G. Bavota, R. Oliveto, A. De Lucia.

On the Role of Developer’s Scattered Changes in Bug Prediction.*

D. Di Nucci, F. Palomba, S. Siravo, G. Bavota, R. Oliveto, A. De Lucia. Conference Software Quality Empirical Software Engineering

Abstract. The importance of human-related factors in the introduction of bugs has recently been the subject of a number of empirical studies. However, these observations have not been captured yet in bug prediction models which simply exploit product metrics or process metrics based on the number and type of changes or on the number of developers working on a software component. Some previous studies have demonstrated that focused developers are less prone to introduce defects than non focused developers. According to this observation, software components changed by focused developers should also be less error prone than software components changed by less focused developers. In this paper we capture this observation by measuring the structural and semantic scattering of changes performed by the developers working on a software component and use these two measures to build a bug prediction model. Such a model has been evaluated on five open source systems and compared with two competitive prediction models: the first exploits the number of developers working on a code component in a given time period as predictor, while the second is based on the concept of code change entropy. The achieved results show the superiority of our model with respect to the two competitive approaches, and the complementarity of the defined scattering measures with respect to standard predictors commonly used in the literature.

Download PDF BibTeX

@inproceedings{di2015role,
  title={On the role of developer's scattered changes in bug prediction},
  author={Di Nucci, Dario and Palomba, Fabio and Siravo, Sandro and Bavota, Gabriele and Oliveto, Rocco and De Lucia, Andrea},
  booktitle={2015 IEEE International Conference on Software Maintenance and Evolution (ICSME)},
  pages={241--250},
  year={2015},
  organization={IEEE}
}

[C7] MSR 2015

Landfill: an Open Dataset of Code Smells with Public Evaluation.*

12th IEEE/ACM Working Conference on Mining Software Repositories (MSR 2015) - Florence, Italy, 2015.

Code smells are symptoms of poor design and implementation choices that may hinder code comprehension and possibly increase change- and fault-proneness of source code. Several techniques have been proposed in the literature for detecting code smells. Download PDF

Conference Software Quality Dataset F. Palomba, D. Di Nucci, M. Tufano, G. Bavota, R. Oliveto, D. Poshyvanyk, A. De Lucia.

Landfill: an Open Dataset of Code Smells with Public Evaluation.*

F. Palomba, D. Di Nucci, M. Tufano, G. Bavota, R. Oliveto, D. Poshyvanyk, A. De Lucia. Conference Software Quality Dataset

Abstract. Code smells are symptoms of poor design and implementation choices that may hinder code comprehension and possibly increase change- and fault-proneness of source code. Several techniques have been proposed in the literature for detecting code smells. These techniques are generally evaluated by comparing their accuracy on a set of detected candidate code smells against a manually-produced oracle. Unfortunately, such comprehensive sets of annotated code smells are not available in the literature with only few exceptions. In this paper we contribute (i) a dataset of 243 instances of five types of code smells identified from 20 open source software projects, (ii) a systematic procedure for validating code smell datasets, (iii) LANDFILL, a Web-based platform for sharing code smell datasets, and (iv) a set of APIs for programmatically accessing LANDFILL’s contents. Anyone can contribute to Landfill by (i) improving existing datasets (e.g., adding missing instances of code smells, flagging possibly incorrectly classified instances), and (ii) sharing and posting new datasets. Landfill is available at www.sesa.unisa.it/landfill/, while the video demonstrating its features in action is available at http://www.sesa.unisa.it/tools/landfill.jsp.

Download PDF BibTeX

@inproceedings{palomba2015landfill,
  title={Landfill: An open dataset of code smells with public evaluation},
  author={Palomba, Fabio and Di Nucci, Dario and Tufano, Michele and Bavota, Gabriele and Oliveto, Rocco and Poshyvanyk, Denys and De Lucia, Andrea},
  booktitle={2015 IEEE/ACM 12th Working Conference on Mining Software Repositories},
  pages={482--485},
  year={2015},
  organization={IEEE}
}

[C6] ICSE 2015

Extract Package Refactoring in ARIES.*

37th IEEE/ACM International Conference on Software Engineering (ICSE 2015) - Formal Tool Demo, IEEE Press, Florence, Italy, 2015.

Software evolution often leads to the degradation of software design quality. In Object-Oriented (OO) systems, this often results in packages that are hard to understand and maintain, as they group together heterogeneous classes with unrelated responsibilities. Download PDF

Conference Software Quality Tool Demo F. Palomba, M. Tufano, G. Bavota, R. Oliveto, A. Marcus, D. Poshyvanyk, A. De Lucia.

Extract Package Refactoring in ARIES.*

F. Palomba, M. Tufano, G. Bavota, R. Oliveto, A. Marcus, D. Poshyvanyk, A. De Lucia. Conference Software Quality Tool Demo

Abstract. Software evolution often leads to the degradation of software design quality. In Object-Oriented (OO) systems, this often results in packages that are hard to understand and maintain, as they group together heterogeneous classes with unrelated responsibilities. In such cases, state-of-the-art re-modularization tools solve the problem by proposing a new organization of the existing classes into packages. However, as indicated by recent empirical studies, such approaches require changing thousands of lines of code to implement the new recommended modularization. In this demo, we present the implementation of an Extract Package refactoring approach in ARIES (Automated Refactoring In EclipSe), a tool supporting refactoring operations in Eclipse. Unlike state-of-the-art approaches, ARIES automatically identifies and removes single low-cohesive packages from software systems, representing very localized design flaws in the package organization aiming at incrementally improve the overall quality of the software modularisation.

Download PDF BibTeX

@inproceedings{palomba2015extract,
  title={Extract package refactoring in ARIES},
  author={Palomba, Fabio and Tufano, Michele and Bavota, Gabriele and Oliveto, Rocco and Marcus, Andrian and Poshyvanyk, Denys and De Lucia, Andrea},
  booktitle={Proceedings of the 37th International Conference on Software Engineering-Volume 2},
  pages={669--672},
  year={2015},
  organization={IEEE Press}
}

[C5] ICSE 2015

Textual Analysis for Code Smell Detection.*

37th IEEE/ACM International Conference on Software Engineering (ICSE 2015) - Student Research Competition (SRC) Track, Florence, Italy, 2015.

The negative impact of smells on the quality of a software systems has been empirical investigated in several studies. This has recalled the need to have approaches for the identification and the removal of smells.

ACM Student Research Competition Bronze Medal

Download PDF

Conference Software Quality F. Palomba.

Textual Analysis for Code Smell Detection.*

F. Palomba. Conference Software Quality

Abstract. The negative impact of smells on the quality of a software systems has been empirical investigated in several studies. This has recalled the need to have approaches for the identification and the removal of smells. While approaches to remove smells have investigated the use of both structural and conceptual information extracted from source code, approaches to identify smells are based on structural information only. In this paper, we bridge the gap analyzing to what extent conceptual information, extracted using textual analysis techniques, can be used to identify smells in source code. The proposed textual-based approach for detecting smells in source code, coined as TACO (Textual Analysis for Code smell detectiOn), has been instantiated for detecting the Long Method smell and has been evaluated on three Java open source projects. The results indicate that TACO is able to detect between 50% and 77% of the smell instances with a precision ranging between 63% and 67%. In addition, the results show that TACO identifies smells that are not identified by approaches based on solely structural information.

Download PDF BibTeX

@inproceedings{palomba2015textual,
  title={Textual analysis for code smell detection},
  author={Palomba, Fabio},
  booktitle={Proceedings of the 37th International Conference on Software Engineering-Volume 2},
  pages={769--771},
  year={2015},
  organization={IEEE Press}
}

[C4] ICSE 2015

When and Why Your Code Starts to Smell Bad.*

37th IEEE/ACM International Conference on Software Engineering (ICSE 2015), Florence, Italy, 2015.

In past and recent years, the issues related to managing technical debt received significant attention by researchers from both industry and academia. There are several factors that contribute to technical debt.

ACM/SIGSOFT Distinguished Paper Award

Download PDF

Conference Software Quality Empirical Software Engineering M. Tufano, F. Palomba, G. Bavota, R. Oliveto, M. Di Penta, A. De Lucia, D. Poshyvanyk.

When and Why Your Code Starts to Smell Bad.*

M. Tufano, F. Palomba, G. Bavota, R. Oliveto, M. Di Penta, A. De Lucia, D. Poshyvanyk. Conference Software Quality Empirical Software Engineering

Abstract. In past and recent years, the issues related to managing technical debt received significant attention by researchers from both industry and academia. There are several factors that contribute to technical debt. One of these is represented by code bad smells, i.e., symptoms of poor design and implementation choices. While the repercussions of smells on code quality have been empirically assessed, there is still only anecdotal evidence on when and why bad smells are introduced. To fill this gap, we conducted a large empirical study over the change history of 200 open source projects from different software ecosystems and investigated when bad smells are introduced by developers, and the circumstances and reasons behind their introduction. Our study required the development of a strategy to identify smellintroducing commits, the mining of over 0.5M commits, and the manual analysis of 9,164 of them (i.e., those identified as smellintroducing). Our findings mostly contradict common wisdom stating that smells are being introduced during evolutionary tasks. In the light of our results, we also call for the need to develop a new generation of recommendation systems aimed at properly planning smell refactoring activities.

Download PDF BibTeX

@inproceedings{tufano2015and,
  title={When and why your code starts to smell bad},
  author={Tufano, Michele and Palomba, Fabio and Bavota, Gabriele and Oliveto, Rocco and Di Penta, Massimiliano and De Lucia, Andrea and Poshyvanyk, Denys},
  booktitle={Proceedings of the 37th International Conference on Software Engineering-Volume 1},
  pages={403--414},
  year={2015},
  organization={IEEE Press}
}

[C3] ICSME 2014

Do They Really Smell Bad? A Study on Developers’ Perception of Bad Code Smells.*

30th IEEE International Conference on Software Maintenance and Evolution (ICSME 2014), Victoria, Canada, 2014.

In the last decade several catalogues have been defined to characterize bad code smells, i.e., symptoms of poor design and implementation choices. On top of such catalogues, researchers have defined methods and tools to automatically detect and/or remove bad smells. Download PDF

Conference Software Quality Empirical Software Engineering F. Palomba, G. Bavota, M. Di Penta, R.Oliveto, A. De Lucia.

Do They Really Smell Bad? A Study on Developers’ Perception of Bad Code Smells.*

F. Palomba, G. Bavota, M. Di Penta, R.Oliveto, A. De Lucia. Conference Software Quality Empirical Software Engineering

Abstract. In the last decade several catalogues have been defined to characterize bad code smells, i.e., symptoms of poor design and implementation choices. On top of such catalogues, researchers have defined methods and tools to automatically detect and/or remove bad smells. Nevertheless, there is an ongoing debate regarding the extent to which developers perceive bad smells as serious design problems. Indeed, there seems to be a gap between theory and practice, i.e., what is believed to be a problem (theory) and what is actually a problem (practice). This paper presents a study aimed at providing empirical evidence on how developers perceive bad smells. In this study, we showed to developers code entities—belonging to three systems— affected and not by bad smells, and we asked them to indicate whether the code contains a potential design problem, and if any, the nature and severity of the problem. The study involved both original developers from the three projects and outsiders, namely industrial developers and Master students. The results provide insights on characteristics of bad smells not yet explored sufficiently. Also, our findings could guide future research on approaches for the detection and removal of bad smells.

Download PDF BibTeX

@inproceedings{palomba2014they,
  title={Do they really smell bad? a study on developers' perception of bad code smells},
  author={Palomba, Fabio and Bavota, Gabriele and Di Penta, Massimiliano and Oliveto, Rocco and De Lucia, Andrea},
  booktitle={2014 IEEE International Conference on Software Maintenance and Evolution},
  pages={101--110},
  year={2014},
  organization={IEEE}
}

[C2] ASE 2013

Detecting Bad Smells in Source Code Using Change History Information.*

28th IEEE/ACM International Conference on Automated Software Engineering (ASE 2013), Palo Alto, California, 2013.

Code smells represent symptoms of poor implementation choices. Previous studies found that these smells make source code more difficult to maintain, possibly also increasing its fault-proneness.

ACM/SIGSOFT Distinguished Paper Award

Download PDF

Conference Software Quality Empirical Software Engineering F. Palomba, G. Bavota, M. Di Penta, R.Oliveto, A. De Lucia, D. Poshyvanyk.

Detecting Bad Smells in Source Code Using Change History Information.*

F. Palomba, G. Bavota, M. Di Penta, R.Oliveto, A. De Lucia, D. Poshyvanyk. Conference Software Quality Empirical Software Engineering

Abstract. Code smells represent symptoms of poor implementation choices. Previous studies found that these smells make source code more difficult to maintain, possibly also increasing its fault-proneness. There are several approaches that identify smells based on code analysis techniques. However, we observe that many code smells are intrinsically characterized by how code elements change over time. Thus, relying solely on structural information may not be sufficient to detect all the smells accurately. We propose an approach to detect five different code smells, namely Divergent Change, Shotgun Surgery, Parallel Inheritance, Blob, and Feature Envy, by exploiting change history information mined from versioning systems. We applied approach, coined as HIST (Historical Information for Smell deTection), to eight software projects written in Java, and wherever possible compared with existing state-of-the-art smell detectors based on source code analysis. The results indicate that HIST’s precision ranges between 61% and 80%, and its recall ranges between 61% and 100%. More importantly, the results confirm that HIST is able to identify code smells that cannot be identified through approaches solely based on code analysis.

Download PDF BibTeX

@inproceedings{palomba2013detecting,
  title={Detecting bad smells in source code using change history information},
  author={Palomba, Fabio and Bavota, Gabriele and Di Penta, Massimiliano and Oliveto, Rocco and De Lucia, Andrea and Poshyvanyk, Denys},
  booktitle={Proceedings of the 28th IEEE/ACM International Conference on Automated Software Engineering},
  pages={268--278},
  year={2013},
  organization={IEEE Press}
}

[C1] ICSE 2012

Supporting Extract Class Refactoring in Eclipse: The ARIES Project.*

34th International Conference on Software Engineering (ICSE 2012) - Formal Tool Demo, Zurich, Switzerland, 2012.

During software evolution changes are inevitable. These changes may lead to design erosion and the introduction of inadequate design solutions, such as design antipatterns. Download PDF

Conference Software Quality Tool Demo G. Bavota, A. De Lucia, A. Marcus, R. Oliveto, F. Palomba.

Supporting Extract Class Refactoring in Eclipse: The ARIES Project.*

G. Bavota, A. De Lucia, A. Marcus, R. Oliveto, F. Palomba. Conference Software Quality Tool Demo

Abstract. During software evolution changes are inevitable. These changes may lead to design erosion and the introduction of inadequate design solutions, such as design antipatterns. Several empirical studies provide evidence that the presence of antipatterns is generally associated with lower productivity, greater rework, and more significant design efforts for developers. In order to improve the quality and remove antipatterns, refactoring operations are needed. In this demo, we present the Extract class features of ARIES (Automated Refactoring In EclipSe), an Eclipse plug-in that supports the software engineer in removing the “Blob” antipattern.

Download PDF BibTeX

@inproceedings{bavota2012supporting,
  title={Supporting extract class refactoring in eclipse: The aries project},
  author={Bavota, Gabriele and De Lucia, Andrea and Marcus, Andrian and Oliveto, Rocco and Palomba, Fabio},
  booktitle={Proceedings of the 34th International Conference on Software Engineering},
  pages={1419--1422},
  year={2012},
  organization={IEEE Press}
}

[B3] 2024

Teaching Mining Software Repositories.*

Teaching Empirical Research Methods in Software Engineering.

Mining Software Repositories (MSR) has become a popular research area recently. MSR analyzes different sources of data, such as version control systems, code repositories, defect tracking systems, archived communication, deployment logs, and so on, to uncover interesting and actionable insights from the data for improved software development, maintenance, and evolution. This chapter provides an overview of MSR and how to conduct an MSR study, including setting up a study, formulating research goals and questions, identifying repositories, extracting and cleaning the data, performing data analysis and synthesis, and discussing MSR study limitations. Download PDF

Book Chapter Software Quality Z. Codabux, F. Fard, R. Verdecchia, F. Palomba, D. Di Nucci, G. Recupito.

Teaching Mining Software Repositories.*

Z. Codabux, F. Fard, R. Verdecchia, F. Palomba, D. Di Nucci, G. Recupito. Book Chapter Software Quality

Abstract. Mining Software Repositories (MSR) has become a popular research area recently. MSR analyzes different sources of data, such as version control systems, code repositories, defect tracking systems, archived communication, deployment logs, and so on, to uncover interesting and actionable insights from the data for improved software development, maintenance, and evolution. This chapter provides an overview of MSR and how to conduct an MSR study, including setting up a study, formulating research goals and questions, identifying repositories, extracting and cleaning the data, performing data analysis and synthesis, and discussing MSR study limitations. Furthermore, the chapter discusses MSR as part of a mixed method study, how to mine data ethically, and gives an overview of recent trends in MSR as well as reflects on the future. As a teaching aid, the chapter provides tips for educators, exercises for students at all levels, and a list of repositories that can be used as a starting point for an MSR study.

Download PDF

[B2] 2024

Quantum Software Engineering Issues and Challenges: Insights from Practitioners.*

Quantum Software Book: The Practitioners Voice.

Quantum computing is an emerging field in which theoretical principles are being transformed into practical applications, largely due to the efforts of the developer community. In order to ensure that quantum software engineering continues to advance, it is vital to understand the experiences, challenges, and aspirations of developers. This chapter is a continuation of our previous work, which provided a comprehensive survey exploring the adoption patterns and common challenges in quantum software engineering. In addition to the survey, we conducted in-depth, semi-structured interviews with practitioners in the field to gain a deeper and more detailed understanding of their perspectives. Download PDF

Book Chapter Software Quality M. De Stefano, F. Pecorelli F. Palomba, D. Taibi, D. Di Nucci, A. De Lucia.

Quantum Software Engineering Issues and Challenges: Insights from Practitioners.*

M. De Stefano, F. Pecorelli F. Palomba, D. Taibi, D. Di Nucci, A. De Lucia. Book Chapter Software Quality

Abstract. Quantum computing is an emerging field in which theoretical principles are being transformed into practical applications, largely due to the efforts of the developer community. In order to ensure that quantum software engineering continues to advance, it is vital to understand the experiences, challenges, and aspirations of developers. This chapter is a continuation of our previous work, which provided a comprehensive survey exploring the adoption patterns and common challenges in quantum software engineering. In addition to the survey, we conducted in-depth, semi-structured interviews with practitioners in the field to gain a deeper and more detailed understanding of their perspectives. Through the interviews and survey findings, we have gained nuanced insights into the motivations, hurdles, and outlook of developers toward the rapidly evolving quantum computing landscape. We describe the research methodology in detail, including the tools and techniques used, in order to provide a comprehensive understanding of the research process. Furthermore, we present critical insights from both the survey and interviews, enriching the narrative with fresh perspectives obtained from the post-publication interviews. This chapter is a blend of academic investigation and real-world practitioner insights, aiming to provide a comprehensive understanding of the current state of quantum software engineering. By illuminating the path for future research and development in this dynamic field, we hope to guide the way toward continued progress and innovation.

Download PDF

[B1] 2014

Anti-Pattern Detection: Methods, Challenges, and Open Issues.*

Advances in Computers.

Anti-patterns are poor solutions to recurring design problems. They occur in object-oriented systems when developers unwillingly introduce them while designing and implementing the classes of their systems. Download PDF

Book Chapter Software Quality F. Palomba, G. Bavota, R. Oliveto, A. De Lucia.

Anti-Pattern Detection: Methods, Challanges, and Open Issues.*

F. Palomba, G. Bavota, R. Oliveto, A. De Lucia. Book Chapter Software Quality

Abstract. Anti-patterns are poor solutions to recurring design problems. They occur in object-oriented systems when developers unwillingly introduce them while designing and implementing the classes of their systems. Several empirical studies have highlighted that anti-patterns have a negative impact on the comprehension and maintainability of a software systems. Consequently, their identification has received recently more attention from both researchers and practitioners who have proposed various approaches to detect them. This chapter discusses on the approaches proposed in the literature. In addition, from the analysis of the state of the art, we will (i) derive a set of guidelines for building and evaluating recommendation systems supporting the detection of antipatterns; and (ii) discuss some problems that are still open, to trace future research directions in the field. For this reason, the chapter provides a support to both researchers, who are interested in comprehending the results achieved so far in the identification of anti-patterns, and practitioner, who are interested in adopting a tool to identify anti-patterns in their software systems.

Download PDF BibTeX

@incollection{palomba2014anti,
  title={Anti-pattern detection: Methods, challenges, and open issues},
  author={Palomba, Fabio and De Lucia, Andrea and Bavota, Gabriele and Oliveto, Rocco},
  booktitle={Advances in Computers},
  volume={95},
  pages={201--238},
  year={2014},
  publisher={Elsevier}
}

[W15] MALTESQUE 2020

A Preliminary Study on the Adequacy of Static Analysis Warnings with Respect to Code Smell Prediction.*

4th International Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE 2020), Sacramento, USA, 2020.

Code smells are poor implementation choices applied during soft- ware evolution that can affect source code maintainability. While several heuristic-based approaches have been proposed in the past, machine learning solutions have recently gained attention since they may potentially address some limitations of state-of-the-art approaches. Download PDF

Workshop Software Quality Empirical Software Engineering S. Lujan, F. Pecorelli, F. Palomba, A. De Lucia, V. Lenarduzzi.

A Preliminary Study on the Adequacy of Static Analysis Warnings with Respect to Code Smell Prediction.*

S. Lujan, F. Pecorelli, F. Palomba, A. De Lucia, V. Lenarduzzi. Workshop Software Quality Empirical Software Engineering

Abstract. Code smells are poor implementation choices applied during software evolution that can affect source code maintainability. While several heuristic-based approaches have been proposed in the past, machine learning solutions have recently gained attention since they may potentially address some limitations of state-of-the-art approaches. Unfortunately, however, machine learning-based code smell detectors still suffer from low accuracy. In this paper, we aim at advancing the knowledge in the field by investigating the role of static analysis warnings as features of machine learning models for the detection of three code smell types. We first verify the potential contribution given by these features. Then, we build code smell prediction models exploiting the most relevant features coming from the first analysis. The main finding of the study reports that the warnings given by the considered tools lead the performance of code smell prediction models to drastically increase with respect to what reported by previous research in the field.

Download PDF

[W14] MALTESQUE 2020

Speeding Up the Data Extraction of Machine Learning Approaches: A Distributed Framework.*

4th International Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE 2020), Sacramento, USA, 2020.

In the last decade, mining software repositories (MSR) has become one of the most important sources to feed machine learning models. Especially open-source projects on platforms like GitHub are providing a tremendous amount of data and make them easily accessible. Nevertheless, there is still a lack of standardized pipelines to extract data in an automated and fast way. Download PDF

Workshop Software Quality Empirical Software Engineering M. Steinhauer, F. Palomba.

Speeding Up the Data Extraction of Machine Learning Approaches: A Distributed Framework.*

M. Steinhauer, F. Palomba. Workshop Software Quality Empirical Software Engineering

Abstract. In the last decade, mining software repositories (MSR) has become one of the most important sources to feed machine learning models. Especially open-source projects on platforms like GitHub are providing a tremendous amount of data and make them easily accessible. Nevertheless, there is still a lack of standardized pipelines to extract data in an automated and fast way. Even though several frameworks and tools exist which can fulfill specific tasks or parts of the data extraction process, none of them allow neither building an automated mining pipeline nor the possibility for full parallelization. As a consequence, researchers interested in using mining software repositories to feed machine learning models are often forced to re-implement commonly used tasks leading to additional development time and libraries may not be integrated optimally. This preliminary study aims to demonstrate current limitations of existing tools and Git itself which are threatening the prospects of standardization and parallelization. We also introduce the multidimensionality aspects of a Git repository and how they affect the computation time. Finally, as a proof of concept, we define an exemplary pipeline for predicting refactoring operations, assessing its performance. Finally, we discuss the limitations of the pipeline and further optimizations to be done.

Download PDF

[W13] MALTESQUE 2020

DeepIaC: Deep Learning-based Linguistic Anti-Pattern Detection for Infrastructure-as-Code.*

4th International Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE 2020), Sacramento, USA, 2020.

Linguistic anti-patterns are recurring poor practices concerning inconsistencies among the naming, documentation, and implementation of an entity. They impede readability, understandability, and maintainability of source code. In this paper, we attempt to detect linguistic anti-patterns in infrastructure as code (IaC) scripts used to provision and manage computing environments. Download PDF

Workshop Software Quality Empirical Software Engineering N. Borovits, I. Kumara, P. Krishnan, S. Dalla Palma, D. Di Nucci, F. Palomba, D. Tamburri, J. A. van den Heuvel.

DeepIaC: Deep Learning-based Linguistic Anti-Pattern Detection for Infrastructure-as-Code.*

N. Borovits, I. Kumara, P. Krishnan, S. Dalla Palma, D. Di Nucci, F. Palomba, D. Tamburri, J. A. van den Heuvel. Workshop Software Quality Empirical Software Engineering

Abstract. Linguistic anti-patterns are recurring poor practices concerning inconsistencies among the naming, documentation, and implementation of an entity. They impede readability, understandability, and maintainability of source code. In this paper, we attempt to detect linguistic anti-patterns in infrastructure as code (IaC) scripts used to provision and manage computing environments. In particular, we consider inconsistencies between the logic/body of IaC code units and their names. To this end, we propose a novel automated approach that employs word embeddings and deep learning techniques. We build and use the abstract syntax tree of IaC code units to create their code embedments. Our experiments with a dataset systematically extracted from open source repositories show that our approach yields an accuracy between 0.785 and 0.915 in detecting inconsistencies.

Download PDF

[W12] IWoR 2020

An Exploratory Study on the Refactoring of Unit Test Files in Android Applications.*

4th International Workshop on Refactoring (IWoR 2020), Seoul, South Korea, 2020.

An essential activity of software maintenance is the refactoring of source code. Refactoring operations enable developers to take necessary actions to correct bad programming practices (i.e., smells) in the source code of both production and test files. Download PDF

Workshop Software Testing Empirical Software Engineering A. Peruma, C. Newman, M. Mkaouer, A. Ouni, F. Palomba.

An Exploratory Study on the Refactoring of Unit Test Files in Android Applications.*

A. Peruma, C. Newman, M. Mkaouer, A. Ouni, F. Palomba. Workshop Software Testing Empirical Software Engineering

Abstract. An essential activity of software maintenance is the refactoring of source code. Refactoring operations enable developers to take necessary actions to correct bad programming practices (i.e., smells) in the source code of both production and test files. With unit testing being a vital and fundamental part of ensuring the quality of a system, developers must address smelly test code. In this paper, we empirically explore the impact and relationship between refactoring operations and test smells in 250 open-source Android applications (apps). Our experiments showed that the type of refac- toring operations performed by developers on test files differ from those performed on non-test files. Further, results around test smells show a co-occurrence between certain smell types and refactorings, and how refactorings are utilized to eliminate smells. Findings from this study will not only further our knowledge of refactoring operations on test files, but will also help developers in understanding the possible ways on how to maintain their apps.

Download PDF

[W11] SoHeal 2020

Splicing Community Patterns and Smells: A Preliminary Study.*

3rd International Workshop on Software Health (SoHeal 2020), Seoul, South Korea, 2020.

Software engineering projects are now more than ever a community effort. In the recent past, researchers have shown that their success may not only depend on source code quality, but also on other aspects like the balance of distance, culture, global engineering practices, and more. Download PDF

Workshop Socio-Technical Analytics M. De Stefano, F. Pecorelli, D. Tamburri, F. Palomba, A. De Lucia.

Splicing Community Patterns and Smells: A Preliminary Study.*

M. De Stefano, F. Pecorelli, D. Tamburri, F. Palomba, A. De Lucia. Workshop Socio-Technical Analytics

Abstract. Software engineering projects are now more than ever a community effort. In the recent past, researchers have shown that their success may not only depend on source code quality, but also on other aspects like the balance of distance, culture, global engineering practices, and more. In such a scenario, understanding the characteristics of the community around a project and foresee possible problems may be the key to develop successful systems. In this paper, we focus on this research problem and propose an exploratory study on the relation between community patterns, i.e., recurrent mixes of organizational or social structure types, and smells, i.e., sub-optimal patterns across the organizational structure of a software development community that may be precursors of some sort of social debt. We exploit association rule mining to discover fre- quent relations between them. Our findings show that different organizational patterns are connected to different forms of socio- technical problems, possibly suggesting that practitioners should put in place specific preventive actions aimed at avoiding the emergence of community smells depending on the organization of the project.

Download PDF

[W10] BENEVOL 2019

The Secret Life of Software Communities: What we Know and What we Don’t Know.*

18th BElgian-NEtherlands software eVOLution symposium (BENEVOL 2019), Brussels, Belgium, 2019.

Communities of software practice are increasingly playing a central role in the development, operation, maintenance, and evolution of good-quality software, as well as DevOps pipelines, lean Organizations, and Global Software Development.

BENEVOL 2019 Best Paper Award

Download PDF

Workshop Socio-Technical Analytics G. Catolino, F. Palomba, D. Tamburri.

The Secret Life of Software Communities: What we Know and What we Don’t Know.*

G. Catolino, F. Palomba, D. Tamburri. Workshop Socio-Technical Analytics

Abstract. Communities of software practice are increasingly playing a central role in the development, operation, maintenance, and evolution of good-quality software, as well as DevOps pipelines, lean Organizations, and Global Software Development. However, it is still unknown the structures and characteristics behind such communities. For this reason, in this paper, we tried to explore the organizational secret of communities, trying to offer a few practical extracts of (1) what we know and is known, (2) what we know to be unknown, and (3) what we know to be tentatively discoverable in the near future from an empirical research point of view. Moreover, the paper provides a number of recommendations for practitioners to help and be helped in their community endeavors.

Download PDF

[W9] WAMA 2019

Healthcare Android Apps: A Tale of the Customers’ Perspective.*

International Workshop on App Market Analytics (WAMA), Tallinn, Estonia, 2019.

Healthcare mobile apps are becoming a reality for users interested in keeping their daily activities under control. In the last years, several researchers have investigated the effect of healthcare mo- bile apps on the life of their users as well as the positive/negative impact they have on the quality of life. Download PDF

Workshop Mobile Apps Evolution Empirical Software Engineering M. Nicolai, L. Pascarella, F. Palomba, A. Bacchelli.

Healthcare Android Apps: A Tale of the Customers’ Perspective.*

M. Nicolai, L. Pascarella, F. Palomba, A. Bacchelli. Workshop Mobile Apps Evolution Empirical Software Engineering

Abstract. Healthcare mobile apps are becoming a reality for users interested in keeping their daily activities under control. In the last years, several researchers have investigated the effect of healthcare mo- bile apps on the life of their users as well as the positive/negative impact they have on the quality of life. Nonetheless, it remains still unclear how users approach and interact with the develop- ers of those apps. Understanding whether healthcare mobile app users request different features with respect to other applications is important to estimate the alignment between the development process of healthcare apps and the requests of their users. In this study, we perform an empirical analysis aimed at (i) classifying the user reviews of healthcare open-source apps and (ii) analyzing the sentiment with which users write down user reviews of those apps. In doing so, we define a manual process that enables the creation of an extended taxonomy of healthcare users’ requests. The results of our study show that users of healthcare apps are more likely to request new features and support for other hardware than users of different types of apps. Moreover, they tend to be less critical of the defects of the application and better support developers when debugging.

Download PDF

[W8] GE 2019

Characterizing Women (Not) Contributing To Open-Source.*

International Workshop on Gender Equality (GE), Montreal, Canada, 2019.

Women are under-represented not only in software development, but also in the Open-Source Software (OSS) community. Based on previous research, there are observed differences between developers who contribute to OSS and those who do not. Download PDF

Workshop Socio-Technical Analytics Empirical Software Engineering P. Wurzelova, F. Palomba, A. Bacchelli.

Characterizing Women (Not) Contributing To Open-Source.*

P. Wurzelova, F. Palomba, A. Bacchelli. Workshop Socio-Technical Analytics Empirical Software Engineering

Abstract. Women are under-represented not only in software development, but also in the Open-Source Software (OSS) community. Based on previous research, there are observed differences between developers who contribute to OSS and those who do not. In this study we examine the existence of the same differences as present in a sample of women. Characterizing women who participate in OSS may help to attract other women to contribute to OSS. Furthermore, it might uncover potential biases in data about female developers that are gathered through the mining of software repositories. Using the data from the Stack Overflow Developer Survey 2018, counting 100,000+ respondents (6.9% female), we compare the characteristics of women who report to contribute to OSS and those who report to not contribute. Surprisingly, we did not found the differences that we expected based on previous literature, thus suggesting that open-source software data seem to represent well the closed-source population, in the context of female developers. However, the correlates of female under-representation in OSS remain unexplained.

Download PDF

[W7] RAISE 2018

Evaluating the Adaptive Selection of Classifiers for Cross-Project Bug Prediction*

International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE 2018), Gothenburg, Sweden.

Bug prediction models are used to locate source code elements more likely to be defective. One of the key factors influencing their performances is related to the selection of a machine learning method (a.k.a., classifier) to use when discriminating buggy and non-buggy classes. Download PDF

Workshop Software Quality Empirical Software Engineering D. Di Nucci, F. Palomba, A. De Lucia.

Evaluating the Adaptive Selection of Classifiers for Cross-Project Bug Prediction*

D. Di Nucci, F. Palomba, A. De Lucia. Workshop Software Quality Empirical Software Engineering

Abstract. Bug prediction models are used to locate source code elements more likely to be defective. One of the key factors influencing their performances is related to the selection of a machine learning method (a.k.a., classifier) to use when discriminating buggy and non-buggy classes. Given the high complementarity of stand-alone classifiers, a recent trend is the definition of ensemble techniques, which try to effectively combine the predictions of different standalone machine learners. In a recent work we proposed ASCI, a technique that dynamically select the right classifier to use based on the characteristics of the class on which the prediction have to be done. We tested it in a within-project scenario, showing its higher accuracy with respect to the Validation and Voting strategy. In this paper, we continue on the line of research, by (i) evaluating ASCI in a global and local cross-project setting and (ii) comparing its performances with those achieved by a stand-alone and an ensemble baselines, namely Naive Bayes and Validation and Voting, respectively. A key finding of our study shows that ASCI is able to perform better than the other techniques in the context of cross-project bug prediction. Moreover, despite local learning is not able to improve the performances of the corresponding models in most cases, it is able to improve the robustness of the models relying on ASCI.

Download PDF BibTeX

@inproceedings{di2018evaluating,
  title={Evaluating the adaptive selection of classifiers for cross-project bug prediction},
  author={Di Nucci, Dario and Palomba, Fabio and De Lucia, Andrea},
  booktitle={2018 IEEE/ACM 6th International Workshop on Realizing Artificial Intelligence Synergies in Software Engineering (RAISE)},
  pages={48--54},
  year={2018},
  organization={IEEE}
}

[W6] DEVOPS 2018

Omniscient DevOps Analytics.*

First international workshop on software engineering aspects of continuous development and new paradigms of software production and deployment (DEVOPS 2018), Chateau de Villebrumier, France, 2018.

DevOps predicates the continuity between Development and Operations teams at an unprecedented scale. Also, the continuity does not stop at tools, or processes but goes beyond into organisational practices, collaboration, co-located and coordinated effort. Download PDF

Workshop Socio-Technical Analytics D. A. Tamburri, D. Di Nucci, L. Di Giacomo, F. Palomba.

Omniscient DevOps Analytics.*

D. A. Tamburri, D. Di Nucci, L. Di Giacomo, F. Palomba. Workshop Socio-Technical Analytics

Abstract. DevOps predicates the continuity between Development and Operations teams at an unprecedented scale. Also, the continuity does not stop at tools, or processes but goes beyond into organisational practices, collaboration, co-located and coordinated effort. We conjecture that this unprecedented scale of continuity requires predictive analytics which are *omniscient*, that is, (a) transversal to the technical, organisational, and social stratification in software processes, and (b) correlate all strata to provide a live and holistic snapshot of software development, its operations, and organisation. We elaborate on this conjecture and illustrate it with an example scenario.

Download PDF BibTeX

@inproceedings{tamburri2018omniscient,
  title={Omniscient devops analytics},
  author={Tamburri, Damian Andrew and Di Nucci, Dario and Di Giacomo, Lucio and Palomba, Fabio},
  booktitle={International Workshop on Software Engineering Aspects of Continuous Development and New Paradigms of Software Production and Deployment},
  pages={48--59},
  year={2018},
  organization={Springer}
}

[W5] BENEVOL 2017

Social Debt Analytics for Improving the Management of Software Evolution Tasks.*

16th BElgian-NEtherlands software eVOLution symposium (BENEVOL 2017), Antwerp, Belgium, 2017.

The success of software engineering projects is in a large part dependent on social and organization aspects of the development community. Indeed, it not only depends on the complexity of the product or the number of requirements to be implemented, but also on people, processes, and how they impact the technical side of software development. Download PDF

Workshop Socio-Technical Analytics F. Palomba, A. Serebrenik, A. Zaidman.

Social Debt Analytics for Improving the Management of Software Evolution Tasks.*

F. Palomba, A. Serebrenik, A. Zaidman. Workshop Socio-Technical Analytics

Abstract. The success of software engineering projects is in a large part dependent on social and organization aspects of the development community. Indeed, it not only depends on the complexity of the product or the number of requirements to be implemented, but also on people, processes, and how they impact the technical side of software development. Social debt represents patterns across the organizational structure around a software system that may lead to additional unforeseen project costs. Condescending behavior, disgruntlement or rage quitting are just some examples of social issues that may occur among the developers of a software project. While the research community has recently investigated the underlying dynamics leading to the introduction of social debt (e.g., the so-called “community smells” which represent symptoms of the presence of social problems in a community), as well as how such debt can be payed off, there is still a noticeable lack of empirical evidence on how social debt impacts software maintenance and evolution. In this paper, we present our position on how social debt can impacts technical aspects of source code by presenting a road map toward a deeper understanding of such relationship.

Download PDF BibTeX

@inproceedings{palomba2017social,
  title={Social Debt Analytics for Improving the Management of Software Evolution Tasks.},
  author={Palomba, Fabio and Serebrenik, Alexander and Zaidman, Andy},
  booktitle={BENEVOL},
  pages={18--21},
  year={2017}
}

[W4] MALTESQUE 2017

Investigating Code Smell Co-Occurrences using Association Rule Learning: A Replicated Study.*

International Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE 2017), Klagenfurt, Austria.

Previous research demonstrated how code smells (i.e., symptoms of the presence of poor design or implementation choices) threat software maintainability. Moreover, some studies showed that their interaction has a stronger negative impact on the ability of developers to comprehend and enhance the source code when compared to cases when a single code smell instance affects a code element (i.e., a class or a method). Download PDF

Workshop Software Quality Empirical Software Engineering F. Palomba, R. Oliveto, A. De Lucia.

Investigating Code Smell Co-Occurrences using Association Rule Learning: A Replicated Study.*

F. Palomba, R. Oliveto, A. De Lucia. Workshop Software Quality Empirical Software Engineering

Abstract. Previous research demonstrated how code smells (i.e., symptoms of the presence of poor design or implementation choices) threat software maintainability. Moreover, some studies showed that their interaction has a stronger negative impact on the ability of developers to comprehend and enhance the source code when compared to cases when a single code smell instance affects a code element (i.e., a class or a method). While such studies analyzed the effect of the co-presence of more smells from the developers’ perspective, a little knowledge regarding which code smell types tend to co-occur in the source code is currently available. Indeed, previous papers on smell co-occurrence have been conducted on a small number of code smell types or on small datasets, thus possibly missing important relationships. To corroborate and possibly enlarge the knowledge on the phenomenon, in this paper we provide a large-scale replication of previous studies, taking into account 13 code smell types on a dataset composed of 395 releases of 30 software systems. Code smell co-occurrences have been captured by using association rule mining, an unsupervised learning technique able to discover frequent relationships in a dataset. The results highlighted some expected relationships, but also shed light on co-occurrences missed by previous research in the field.

Download PDF BibTeX

@inproceedings{palomba2017investigating,
  title={Investigating code smell co-occurrences using association rule learning: A replicated study},
  author={Palomba, Fabio and Oliveto, Rocco and De Lucia, Andrea},
  booktitle={2017 IEEE Workshop on Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE)},
  pages={8--13},
  year={2017},
  organization={IEEE}
}

[W3] SBST 2016

On the Diffusion of Test Smells in Automatically Generated Test Code: An Empirical Study.*

9th International Workshop on Search-based Software Testing (SBST 2016), Austin, USA, 2016.

The role of software testing in the software development process is widely recognized as a key activity for successful projects. This is the reason why in the last decade several automatic unit test generation tools have been proposed, focusing particularly on high code coverage. Download PDF

Workshop Software Testing Empirical Software Engineering F. Palomba, D. Di Nucci, A. Panichella, R. Oliveto, A. De Lucia.

On the Diffusion of Test Smells in Automatically Generated Test Code: An Empirical Study.*

F. Palomba, D. Di Nucci, A. Panichella, R. Oliveto, A. De Lucia. Workshop Software Testing Empirical Software Engineering

Abstract. The role of software testing in the software development process is widely recognized as a key activity for successful projects. This is the reason why in the last decade several automatic unit test generation tools have been proposed, focusing particularly on high code coverage. Despite the effort spent by the research community, there is still a lack of empirical investigation aimed at analyzing the characteristics of the produced test code. Indeed, while some studies inspected the effectiveness and the usability of these tools in practice, it is still unknown whether test code is maintainable. In this paper, we conducted a large scale empirical study in order to analyze the diffusion of bad design solutions, namely test smells, in automatically generated unit test classes. Results of the study show the high diffusion of test smells as well as the frequent co-occurrence of different types of design problems. Finally we found that all test smells have strong positive correlation with structural characteristics of the systems such as size or number of classes.

Download PDF BibTeX

@inproceedings{palomba2016diffusion,
  title={On the diffusion of test smells in automatically generated test code: An empirical study},
  author={Palomba, Fabio and Di Nucci, Dario and Panichella, Annibale and Oliveto, Rocco and De Lucia, Andrea},
  booktitle={Proceedings of the 9th international workshop on search-based software testing},
  pages={5--14},
  year={2016},
  organization={ACM}
}

[W2] 2013

ARIES: An Eclipse plug-in to Support Extract Class Refactoring.*

8th Italian Workshop on Eclipse Technologies, Crema, Italy, 2013. LCNS Press.

During Object-Oriented development, developers try to define classes having (i) strongly related responsibilities, i.e., high cohesion, and (ii) limited number of dependencies with other classes, i.e., low coupling [1]. Download PDF

Workshop Software Quality Tool Demo G. Bavota, A. De Lucia, A. Marcus, R. Oliveto, F. Palomba, M. Tufano

ARIES: An Eclipse plug-in to Support Extract Class Refactoring.*

G. Bavota, A. De Lucia, A. Marcus, R. Oliveto, F. Palomba, M. Tufano Workshop Software Quality Tool Demo

Abstract. During Object-Oriented development, developers try to define classes having (i) strongly related responsibilities, i.e., high cohesion, and (ii) limited number of dependencies with other classes, i.e., low coupling [1]. Unfortunately, due to strict deadlines, programmers do not always have sufficient time to make sure that the resulting source code conforms to such a development laws [8]. In particular, during software evolution the internal structure of the system undergoes continuous modifications that makes the source code more complex and drifts away from its original design. Classes grow rapidly because programmers often add a responsibility to a class thinking that it is not required to include it in a separate class. However, when the added responsibility grows and breeds, the class becomes too complex and its quality deteriorates [8]. A class having more than one responsibility has generally low cohesion and high coupling. Several empirical studies provided evidence that high levels of coupling and lack of cohesion are generally associated with lower productivity, greater rework, and more significant design efforts for developers [6], [10], [11], [12], [13]. In addition, classes with lower cohesion and/or higher coupling have been shown to correlate with higher defect rates [9], [14], [15]. Classes with unrelated methods often need to be restructured by distributing some of their responsibilities to new classes, thus reducing their complexity and improving their cohesion. The research domain that addresses this problem is referred to as refactoring [8]. In particular, Extract Class Refactoring allows to split classes with many responsibilities into different classes. Moreover, it is a widely used technique to address the Blob antipattern [8], namely a large and complex class, with generally low cohesion, that centralize the behavior of a portion of a system and only use other classes as data holders. It is worth noting that performing Extract Class Refactoring operations manually might be very difficult, due to the high complexity of some Blobs. For this reason, several approaches and tools have been proposed to support this kind of refactoring. Bavota et. al [2] proposed an approach based on graph theory that is able to split a class with low cohesion into two classes having a higher cohesion, using a MaxFlow-MinCut algorithm. An important limitation of this approach is that often classes need to be split in more than two classes. Such a problem can be mitigated using partitioning or hierarchical clustering algorithms. However, such algorithms suffer of important limitations as well. The former requires as input the number of clusters, i.e., the number of classes to be extracted, while the latter requires the definition of a threshold to cut the dendogram. Unfortunately, no heuristics have been derived to suggest good default values for all these parameters. Indeed, in the tool JDeodorant [7], which uses a hierarchical clustering algorithm to support Extract Class Refactoring, the authors tried to mitigate such an issue by proposing different refactoring opportunities that can be obtained using various thresholds to cut the dendogram. However, such an approach requires an additional effort by the software engineer who has to analyze different solutions in order to identify the one that provides the most adequate division of responsibilities. We tried to mitigated such deficiencies by defining an approach able to suggest a suitable decomposition of the original class by also identifying the appropriate number of classes to extract [3, 4]. Given a class to be refactored, the approach calculates a measure of cohesion between all the possible pairs of methods in the class. Such a measure captures relationships between methods that impact class cohesion (e.g., attribute references, method calls, and semantic content). Then, a weighted graph is built where each node represents a method and the weight of an edge that connects two nodes is given by the cohesion of the two methods. The higher the cohesion between two methods the higher the likelihood that the methods should be in the same class. Thus, a cohesion threshold is applied to cut all the edges having cohesion lower than the threshold in order to reduce spurious relationships between methods. The approach defines chains of strongly related methods exploiting the transitive closure of the filtered graph. The extracted chains are then refined by merging trivial chains (i.e., chains with few methods) with non trivial chains. Exploiting the extracted chains of methods it is possible to create new classes - one for each chain - having higher cohesion than the original class. In this paper, we present the implementation of the proposed Extract Class Refactoring method in ARIES (Automated Refactoring In EclipSe) [5], a plug-in to support refactoring operations in Eclipse. ARIES provides support for Extract Class Refactoring through a three steps wizard. In the first step, shown in figure 1, the tool supports the software engineer in the identification of candidate Blobs through the computing of three quality metrics, namely LCOM5 [6], C3 [9] and MPC [16]. Thus, ARIES does not compute an overall quality of the classes, but it considers only cohesion and coupling as the main indicators of class quality in this context. Hence, Blobs are usually outliers or classes having a quality much lower than the average quality of the system under analysis [9]. The identification of Blobs in ARIES is based on such a conjecture. In the second step of the wizard, the software engineer has the possibility to further analyze a candidate Blob and get insights on the different responsibilities implemented by analyzing its topic map, represented as the five most frequent terms in a class (the terms present in the highest number of methods). For this reason, the topic map is represented by a pentagon where each vertex represents one of the main topics. Once a class that needs to be refactored is identified, the software engineer activates the last step of the wizard (shown in figure 2) to obtain a possible restructuring of the class under analysis. ARIES reports for each class that should be extracted from the Blob the following information: (i) its topic map; (ii) the set of methods composing it; and (ii) a text field where the developer can assign a name to the class. The tool also allows the developer to customize the proposed refactoring moving the methods between the extracted classes. In addition, ARIES offers the software engineer on-demand analysis of the quality improvement obtained by refactoring the Blob, by comparing various measures of the new classes with the measures of the Blob. When the developer ends the analysis, the extraction process begins. ARIES will generate the new classes making sure that the changes made by the refactoring do not introduce any syntactic error. A video of the tool is available on Youtube.

Download PDF BibTeX

@article{bavotaaries,
  title={ARIES: An Eclipse plug-in to Support Extract Class Refactoring},
  author={Bavota, Gabriele and De Lucia, Andrea and Marcus, Andrian and Oliveto, Rocco and Palomba, Fabio and Tufano, Michele}
}

[W1] 2013

Textual Analysis and Software Quality: Challenges and Opportunities.*

50th Italian Workshop on Computing and Distributed Computing, Salerno, Italy, 2013.

Source code lexicon (identifier names and comments) has been used – as an alternative or as a complement to source code structure – to perform various kinds of analyses (e.g., traceability recovery). Download PDF

Workshop Software Quality G. Bavota, A. De Lucia, R. Oliveto, F. Palomba, A. Panichella.

Textual Analysis and Software Quality: Challenges and Opportunities.*

G. Bavota, A. De Lucia, R. Oliveto, F. Palomba, A. Panichella. Workshop Software Quality

Abstract. Source code lexicon (identifier names and comments) has been used – as an alternative or as a complement to source code structure – to perform various kinds of analyses (e.g., traceability recovery). All these successful applications increased in the recent years the interest in using textual analysis for improving and assessing the quality of a software system. In particular, textual analysis could be used to identify refactoring opportunities or ambiguous identifiers that may increase the program comprehension burden by creating a mismatch between the developers' cognitive model and the intended meaning of the term, thus ultimately increasing the risk of fault proneness. In addition, when used “on-line” during software development, textual analysis could guide the programmers to select better identifiers aiming at improving the quality of the source code lexicon. In this paper, we overview research in text analysis for the assessment and the improvement of software quality and discuss our achievements to date, the challenges, and the opportunities for the future.

Download PDF

[T1] 2017

Code Smells: Relevance of the Problem and Novel Detection Techniques.*

A thesis submitted for the degree of Doctor of Philosophy.

Software systems are becoming the core of the business of several industrial companies and, for this reason, they are getting bigger and more complex. Furthermore, they are subject of frantic modifications every day with regard to the implementation of new features or for bug fixing activities. Download PDF

PhD Thesis Software Quality Software Testing Empirical Software Engineering F. Palomba.

Code Smells: Relevance of the Problem and Novel Detection Techniques.*

F. Palomba. PhD Thesis Software Quality Software Testing Empirical Software Engineering

Abstract. Software systems are becoming the core of the business of several industrial companies and, for this reason, they are getting bigger and more complex. Furthermore, they are subject of frantic modifications every day with regard to the implementation of new features or for bug fixing activities. In this context, often developers have not the possibility to design and implement ideal solutions, leading to the introduction of technical debt, i.e., “not quite right code which we postpone making it right”. One noticeable symptom of technical debt is represented by the bad code smells, which were defined by Fowler to indicate sub-optimal design choices applied in the source code by developers. In the recent past, several studies have demonstrated the negative impact of code smells on the maintainability of the source code, as well as on the ability of developers to comprehend a software system. This is the reason why several automatic techniques and tools aimed at discovering portions of code affected by design flaws have been devised. Most of them rely on the analysis of the structural properties (e.g., method calls) mined from the source code. Despite the effort spent by the research community in recent years, there are still limitations that threat the industrial applicability of tools for detecting code smells. Specifically, there is a lack of evicence regarding (i) the circustamces leading to code smell introduction, (ii) the real impact of code smells on maintainability, since previous studies focused the attention on a limited number of software projects. Moreover, existing code smell detectors might be inadeguate for the detection of many code smells defined in literature. For instance, a number of code smells are intrinsically characterized by how code elements change over time, rather than by structural properties extractable from the source code. In the context of this thesis we face these specific challenges, by proposing a number of large-scale empirical investigations aimed at understanding (i) when and why smells are actually introduced, (ii) what is their longevity and the way developers remove them in practice, (iii) what is the impact of code smells on change- and fault-proneness, and (iv) how developers perceive code smells. At the same time, we devise two novel approaches for code smell detection that rely on alternative sources of information, i.e., historical and textual, and we evaluate and compare their ability in detecting code smells with respect to other existing baseline approaches solely relying structural analysis. The findings reported in this thesis somehow contradicts common expectations. In the first place, we demonstrate that code smells are usually introduced during the first commit on the repository involving a source file, and therefore they are not the result of frequent modifications during the history of source code. More importantly, almost 80% of the smells survive during the evolution, and the number of refactoring operations performed on them is dramatically low. Of these, only a small percentage actually removed a code smell. At the same time, we also found that code smells have a negative impact on maintainability, and in particular on both change- and fault-proneness of classes. In the second place, we demonstrate that developers can correctly perceive only a subset of code smells characterized by long or complex code, while the perception of other smells depend on the intensity with which they manifest themselves. Furthermore, we also demonstrate the usefulness of historical and textual analysis as a way to improve existing detectors using orthogonal informations. The usage of these alternative sources of information help developers in correctly diagnose design problems and, therefore, they should be actively exploited in future research in the field. Finally, we provide a set of open issues that need to be addressed by the research community in the future, as well as an overview of further future applications of code smells in other software engineering field.

Download PDF BibTeX

@article{palomba2017code,
  title={Code smells: relevance of the problem and novel detection techniques},
  author={Palomba, Fabio},
  year={2017},
  publisher={Universita degli studi di Salerno}
}

ABOUT

BIO

ABOUT ME

FACTS

SOME NUMBERS ABOUT ME

HOBBIES

... AND OTHER FACTS ABOUT ME

CAREER

ASSOCIATE PROFESSOR

UNIVERSITY OF SALERNO

ASSISTANT PROFESSOR

UNIVERSITY OF SALERNO

SENIOR RESEARCH ASSOCIATE

UNIVERSITY OF ZURICH - Switzerland

POST-DOC RESEARCHER

DELFT UNIVERSITY OF TECHNOLOGY - Netherlands

EINDHOVEN UNIVERSITY OF TECHNOLOGY - Netherlands

DEGREE OF EUROPEAN DOCTOR OF PHILOSOPHY (PH.D.) IN MANAGEMENT AND INFORMATION TECHNOLOGY

UNIVERSITY OF SALERNO

MASTER’S DEGREE (M.SC.) IN COMPUTER SCIENCE

UNIVERSITY OF SALERNO

BACHELOR’S DEGREE (B.SC.) IN COMPUTER SCIENCE

UNIVERSITY OF MOLISE

ITALIAN SCIENTIFIC QUALIFICATION AS FULL PROFESSOR

SECTOR 01/B1 – INFORMATICA

ITALIAN SCIENTIFIC QUALIFICATION AS FULL PROFESSOR

SECTOR 09/H1 – SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI

ITALIAN SCIENTIFIC QUALIFICATION AS ASSOCIATE PROFESSOR

SECTOR 01/B1 – INFORMATICA

ITALIAN SCIENTIFIC QUALIFICATION AS ASSOCIATE PROFESSOR

SECTOR 09/H1 – SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI

LICENCE OF COMPUTER ENGINEER

UNIVERSITY OF MOLISE

PUBLICATIONS

Understanding Machine Learning Testing in Practice.*

Elsevier's Journal of Systems and Software

Understanding Machine Learning Testing in Practice.*

RobustDRNet: A Clinically-Aligned Hybrid Ensemble Model with Multi-Method Explainability for Lesion-Aware Diabetic Retinopathy Grading.*

Elsevier's Expert Systems with Applications (ESWA)

RobustDRNet: A Clinically-Aligned Hybrid Ensemble Model with Multi-Method Explainability for Lesion-Aware Diabetic Retinopathy Grading.*

Advancing LLM-Based Issue Report Classification with Explained Few-Shot Learning, Intent Extraction, Ensemble, and Summarization.*

ACM Transactions on Software and Methodology (TOSEM)

Advancing LLM-Based Issue Report Classification with Explained Few-Shot Learning, Intent Extraction, Ensemble, and Summarization.*

What Were You Thinking? An LLM-Driven Large-Scale Study of Refactoring Motivations in Open-Source Projects.*

ACM Transactions on Software and Methodology (TOSEM)

What Were You Thinking? An LLM-Driven Large-Scale Study of Refactoring Motivations in Open-Source Projects.*

From Zero to Hero: A Scoping Review of the Emergence of the Metaverse in the Virtual Environments History.*

Springer's Virtual Reality (VR)

From Zero to Hero: A Scoping Review of the Emergence of the Metaverse in the Virtual Environments History.*

Pythonic vs Refactorable Pythonic: On the Relationship between Pythonic Idioms and Code Quality in Machine Learning Projects.*

Elsevier's Information and Software Technology (IST)

Pythonic vs Refactorable Pythonic: On the Relationship between Pythonic Idioms and Code Quality in Machine Learning Projects.*

Analyzing the Ripple Effects of Refactoring.*

Springer's Journal of Empirical Software Engineering (EMSE)

Analyzing the Ripple Effects of Refactoring.*

Fair and Square? Evaluating Fairness of LLM-Generated Synthetic Datasets.*

Elsevier's Information and Software Technology (IST)

Fair and Square? Evaluating Fairness of LLM-Generated Synthetic Datasets.*

Sustainability of Machine Learning-Enabled Systems: The Machine Learning Practitioner's Perspective.*

ACM Transactions on Software Engineering and Methodology (TOSEM)

Sustainability of Machine Learning-Enabled Systems: The Machine Learning Practitioner's Perspective.*

A Novel, Tool-Supported Catalog of Community Smell Symptoms.*

Wiley's Journal of Software: Evolution and Process (JSEP)

A Novel, Tool-Supported Catalog of Community Smell Symptoms.*

Fairness Set and Forgotten: Mining Fairness Toolkit Usage in Open-Source Machine Learning Projects.*

Elsevier's Information and Software Technology (IST)

Fairness Set and Forgotten: Mining Fairness Toolkit Usage in Open-Source Machine Learning Projects. *

Back to the Roots: Assessing Mining Techniques for Java Vulnerability-Contributing Commits.*

ACM Transactions on Software Engineering and Methodology (TOSEM)

Back to the Roots: Assessing Mining Techniques for Java Vulnerability-Contributing Commits. *

Another Brick in the Wall: A Systematic Mapping Study Toward Defining Metaverse Engineering Through Socio-Technical Issues.*

ACM Computing Surveys (CSUR)

Another Brick in the Wall: A Systematic Mapping Study Toward Defining Metaverse Engineering Through Socio-Technical Issues.*

Fairness on a Budget, Across the Board: A Cost-Effective Evaluation of Fairness-Aware Practices Across Contexts, Tasks, and Sensitive Attributes.*

Elsevier's Information and Software Technology (IST)

Fairness on a Budget, Across the Board: A Cost-Effective Evaluation of Fairness-Aware Practices Across Contexts, Tasks, and Sensitive Attributes.*

SENEM-AI: Leveraging LLMs for Student Behavior Simulation in Virtual Learning Environments.*

Elsevier's SoftwareX

SENEM-AI: Leveraging LLMs for Student Behavior Simulation in Virtual Learning Environments.*

RECOVER: Toward Requirements Generation from Stakeholders’ Conversations.*