Comparison of Non-Homogeneous Poisson Process Software Reliability Models in Web Applications

Software reliability is an important quality factor that effects project success. By modelling software reliability, it can be estimated when and with how much effort a project can be deployed. Consequently, this can contribute to the resource and schedule planning of a project. Therefore, software reliability models (SRM) are frequently used for measuring the maturity of a software. A number of studies exist in the literature that compare SRMs in terms of their modelling performance. However, there is a need of evaluating these SRMs by taking into account the software project domain. This study aims to compare the performance of SRMs in the context of Web applications. In accordance to this purpose, six different software reliability models, namely Goel-Okumoto, Musa Exponential, Inflected S-shaped, Delayed S-shaped, Yamada and Pham Nordmann Zhang Imperfect Fault Detection (PNZ), are evaluated by using the defect records of four Web application projects developed by a Turkish software organization. 100%, 70% and 50% of the recorded data is used as input to the maximum likelihood parameter estimation (MLPE) method and the results of these three cases are investigated and commented separately in the research. The goodness of fit and the predictive validity of the models to the project data are tested by calculating Mean Square Error (MSE), Mean Magnitude Relative Error (MMRE), Percentage Relative Error Deviation (PRED) and Average Balanced Predicted Relative Error (A.BPRE) measures. For each NHPP model 48 separate cases which are combinations of the three defect inflow data cases (100%, 70% and 50%), four projects and four measures, are investigated and ranked. It is shown that the NHPP models can be applied to Web applications and Delayed S-shaped model displays the best results among the alternatives. However, it is understood that the Goel-Okumoto and Yamada models give identical results and that these two models converge to each other with respect to the project defect data that has been used. Combined, these two models obtain the highest ranking scores and it is concluded that these two models perform better than the other models with respect to Web based software.


Introduction
Web applications, with the proliferation of the Internet, have become part of everyone's daily life, having an important impact on businesses, entertainment and education.Li, Das and Dowe (2014) define Web applications as systems that are typically composed of a back-end (a database) and front-ends (Web-pages) that the users interact with over a network by using a browser.Web applications, depending on whether they change based on the user inputs, interactions, sequences of interactions, etc. can be static or dynamic (Li, Das, & Dowe, 2014).Öztürk, Çavuşoğlu and Zengin (2015) differentiate Web applications from other software by regarding them as distributed systems that are multi-user heterogeneous systems, where different software work in the same architectural environments and embody software that respond according to input and server values.Moreover, Web applications have very high quality requirements and are considered to be highly interactive applications that employ several new languages, technologies and programming models (Qian & Miao, 2011).The aforementioned features of Web applications and their pervasiveness necessitates them to be of high quality with low defectiveness.As any software, Web applications are also tested.Di Lucca and Fasolino (2006) state that the testing of Web applications is conducted in order to address major software engineering issues such as maintainability, testability, security, performance, correctness and reliability; whereas Qian and Miao (2011) point out that for ensuring the requested software quality of Web applications, testing will have a gradually increasing importance and relevance.However, several drawbacks exist regarding the testing of Web applications.First of all, the testing of Web applications is frequently neglected by developers.The main reasons for that are the market pressure, the very short time-to-market periods and the fact that the testing efforts, especially for Web applications, are considered to be too time-consuming and with insignificant returns (Hieatt & Mee, 2002;Di Lucca & Fasolino, 2006).Moreover, as stated by Torchiano, Ricca and Marchetto (2011), the number of studies to understand the specific nature of Web application defects is inadequate and Web application testing techniques and tools are more immature compared to the techniques and tools used for testing desktop or embedded applications.Similarly, Ferrara, De Meo, Fiumara and Baumgartner (2014) argue that the increase in the complexity of Web applications results to test approaches becoming more and more insufficient.
The concept of software testing is closely related to software reliability and one major approach of assessing the reliability of a Web application is using software reliability models (SRMs).The IEEE 1633: Recommended Practice of Software Reliability (IEEE Reliability Society, 2008) not only specifies the recommended procedures for software reliability assessment and prediction but also defines two major concepts that are used throughout this paper: Software Reliability: is (a) the probability that software will not cause the failure of a system for a specified time under specified conditions, or (b) the ability of a program to perform a required function under stated conditions for a stated period of time.
SRM is a mathematical expression that specifies the general form of the software failure process as a function of factors such as fault introduction, fault removal, and the operational environment.
SRMs are used to estimate the total number of defects and the rate of software deterioration within a particular time period by using numeric data and are the result of applying reliability engineering theory to the software development domain (Rana, et al., 2014).The deterioration behavior of the software is predicted by using the known or assumed characteristics of the software in question (Lai & Garg, 2012).During the initial testing processes, some of the software defects are identified and resolved; as the tests progress over time the number of unidentified defects continuously decrease, in other words as the probability of finding a defect decreases as the reliability of the software increases (Yamada, 2014).As stated by Rana et al. (2014), the SRMs mainly try to answer two very practical question: "when the attained software quality is enough so that the testing can stop?" and "considering the remaining defects, is the software ready to be released?".Therefore, we do believe that SRMs can provide a solution to the aforementioned testing problems that are experienced in Web applications and in this research we focus on a specific category of SRMs, namely Non-Homogeneous Poisson Process (NHPP) models.
This study tries to answer the following three research questions (RQ): -RQ1: Are the NHPP models applicable to Web applications?-RQ2: How each one of the assessed NHPP models perform with respect to Web applications?-RQ3: How good predictors of the future testing stages in a Web application development are the NHPP model parameters that are estimated by using the 70% and 50% of the defect data?
The rest of this paper is structured as follows: First we review the basic literature regarding SRMs and NHPP models, and provide the required background information.Section 3 presents the methodology used in our research.The results and the analysis of these results are given in Section 4. The last section concludes the paper, addresses the validity threats and gives the planned future work.

Related Work and Background
Reliability is an important factor with respect to the success of Web applications.As stated by Offutt (2002), the majority of Web applications perform within high competition environments, where users select the software to use based on the level this software meets their requirements and can easily switch to some other application if their requirements are not met.When assessing its reliability, the different characteristics of a software should be taken into consideration, one of them being the type of the software being assessed.The differences of Web applications with respect to other software are given in detail in several studies (Offutt, 2002;Mendes, 2014;Murugesan, Deshpande, Hansen, & Ginige, 2001).A list of testing techniques and approaches currently utilized for Web based applications are given in detail in (Li, Das, & Dowe, 2014), (Di Lucca & Fasolino, 2006) and (Fasolino, Amalfitano, & Tramontana, 2013).The work of Garousi, Mesbah, Betin-Can and Mirshokraie (2013) is a detailed systematic mapping study of Web application testing and based on that mapping, a systematic literature review of Web application testing is given by Doğan, Serdar and Garousi (2014).In these two systematic literature reviews the authors discuss the emerging trends in Web application testing concluding that Web testing is an active area of research with an increasing number of publications.Similarly, Qian and Miao (2011) state that with respect to Web based software testing many issues have not been sufficiently investigated yet, and many open questions, one of them being the reliability of Web applications, still need to be addressed.Moreover, Hieatt and Mee (2002) argue that adequate, efficient and cost effective testing approaches for testing Web applications are needed, increasing the significance of the research on Web software reliability.
As software reliability is considered to be a must-be quality characteristic, many SRMs have been proposed and applied to practical use (Yamada, 2014).Pham (2006) states that SRMs can be grouped to two: the deterministic models that are used to study the number of distinct operators and operands, the number of errors and the number of machine instructions in the program, and the probabilistic models that represent the failure occurrences and the fault removals as probabilistic events.The probabilistic SRMs can be classified into different groups, namely error seeding, failure rate, curve fitting, reliability growth, Markov structure, timeseries and NHPP.This study focuses only on the NHPP models.
NHPP models are based on estimating the mean value function of the cumulative number of defects that are observed up to a certain point in time and thus provide an analytical framework for describing the defect identification during testing (Yamada, 2014).The mean value function of NHPP models' cumulative defects identified up to point t is generally in the form of where λ(t) is the defect density function.Similarly, the reliability is defined with the following function: The most common NHPP models that are used in this study are given in Table 1 where a(t) are the total defects at time t, identified and not, whereas b(t) stands for the defect identification rate at time t (Pham, 2007).The NHPP models make these specific assumptions: the defects in a software occur randomly and are independent of each other, the cumulative sum of the defects up to time t follow the NHPP, at any point in the process the defect density rate in the software is relative to the yet unidentified defects in the software and when a defect is identified efforts are undertaken to fix it and these efforts are irrelevant of the defect location (Pham, 2006;Xie, Hong, & Wohlin, 2003).

Goel
AJIT NHPP models are approaches that their popularity are continuously increasing within the SRM domain (Lai & Garg, 2012;Pham, 2006).Pham (2003), while comparing the performances of NHPP models investigates also the effect of environmental factors and used budget model to the software reliability.Rana at al. ( 2014) evaluate eight SRMs (namely Musa-Okumoto, Goel-Okumoto, Inflection S-shaped, Delayed S-shaped, Rayleigh, Logistic, Gompertz and Linear) on 11 large software projects within the embedded software domain from three different companies.The authors mainly try to answer which SRMs are the best to assist decisions for optimal allocation of testing resources and which SRMs assess best the release readiness of a software system.They observe that Gompertz model is the best for software development processes that are either V-model based or follow lean and agile software development processes, and the Logistic model for waterfall development process.Whereas for assessing the release readiness of a software Logistic and Gompertz perform best from the perspective of asymptote precision and also when the shape of defect inflow is predicted.In a similar study by Ullah, Morisio and Vetro (2012) eight SRMs (namely Musa-Okumoto, Inflection S-shaped, Goel-Okumoto, Delayed S-shaped, Logistic, Gompertz, Yamada Exponential and Generalized Goel Model) are evaluated for fifty different defect record data sets coming from system test phases, field defects and Open Source Software projects.All models are tested for their performances and investigated for each situation.The authors argue that when all data sets are considered Musa-Okumoto fits best, but for all Open Source Software data sets all examined models do fit.When the prediction capability is considered for inductrial data sets Musa-Okumoto, Inflection S-shaped and Goel-Okumoto models are the best predictors, whereas for Open Source Software Gompertz and Yamada are the best predictors.When fitting and predicition capability are combined in the industrial datasets Musa-Okumoto and Inflection S-shaped models are the best performers, and for the Open Source Software data sets the best performers are Gompertz and Inflection S-shaped models.SRMs can be fitted to the observed defect data using statistical techniques such as maximum likelihood parameter estimation (MLPE) or curve fitting techniques like non-linear least square.As Zhao and Xie (1996) state MLPE is the most common approach used while estimating the parameters of NHPP models, therefore in this research the MLPE is used and the function sets were solved with the use of MATLAB and MATLAB Optimization Toolbox.To assess which NHPP SRMs are performing best and when it is more appropriate for them to be applied during a project timeline, the approach proposed by Rana et al. (2014) was employed in this study.The NHPP SRMs in question were evaluated by using full and partial data sets.For each of the four projects the defect data was divided into four sets containing all, 70% and 50% defect data points respectively (observed region), corresponding to the same level of project completion with respect to project timeline.Then the data points in the observed region for each set were used for model fitting, while the defect data points outside of this region (predicted region, 0%, 30% and 50% respectively) were used to assess the predictive power of the SRMs.

Figure 2 Normalized Cumulative Defect Inflow Profiles for the Analyzed Four Projects
Following the identification of the data sets, the SRMs were evaluated with respect to goodness-of-fit and predictive validity criteria (Ahmad, Khan, & Rafi, 2011).The methods of Mean Square Error (MSE), Mean Magnitude Relative Error (MMRE), Percentage Relative Error Deviation (PRED) and Average Balanced Predicted Relative Error (A.BPRE) were used.In MSE, the error is the difference between actual and predicted defect values and the lower the MSE, the better is the model's goodness-of-fit.Magnitude Relative Error (MRE) computes the absolute percentage of error between actual defects (ea) and estimated defects (ee), for each project: MMRE calculates the average of MRE over all investigated items (n): PRED(q) is used to count the percentage of defect estimates that fall within less than or equal to q of the actual values.
where λ is the number of projects where MREi  q and N is the number of all estimates.As stated by Chouseinoglou and Aydın (2013), an estimation model with lower MMRE and higher PRED(q) can be interpreted is more accurate than other models and a model whose MMRE  0.25 and PRED(0.25)≥ 0.75 is considered to be a good one.Finally, the A.BPRE, introduced by Rana et al. (2014), is based on Predicted Relative Error (PRE): To obtain more balanced results with respect to positive or negative deviations, the Balanced Predicted Relative Error (BPRE) is used: , 0


The BPRE obtains values in the range of -1 to 1.A value closer to 0 denotes that the model is producing better estimations.The function of the A.BPRE is as follows:

Results and Analysis
Since each examined project has different defect inflow characteristics, each project was analyzed separately.Initially the three datasets were created with respect to the amount of defect data to be used (100%, 70%, 50%) and then the parameters for each SRM NHPP model and for each dataset were calculated with the use of MLPE.The calculated NHPP model parameters are given in Table 3.When the parameters are analyzed it is observed that the models estimate that the actual total defects for Project A are much more than what it is observed and that only 80%-90% of the total defects for Project B have been already identified.This is in accordance with tha fact that these projects (A and B) are not completed yet and are still ongoing projects.On the contrary, all investigated models denote that almost all defects have been identified for Project C and that it is ready to be released to customer.Project D is a completed project, but the recorded defect data shows that the project is still producing errors and the model parameters can be interpreted as that Project D has reached a specific maturity but still a significant number of defects exist in the software.Based on the parameters it is estimated that something as 10%-15% of all defects are yet to be discovered in the software.It can be argued that if Project D is released to the customer in it is current state, the probability that it will crash is higher than Project C. The graphs of each investigated NHPP model drawn based on the parameters calculated for 100%, 70% and 50% of data are given in Figure 3, Figure 4 and Figure 5.Each of the NHPP models are evaluated with respect to the four performance methods, namely MSE, MMRE, PRED and A.BPRE.The values of the four performance methods for each project (A, B, C and D) and each input data (%100, %70 and %50) are given in Table 4.For MSE, MMRE and A.BPRE the closer to 0 values denote a better result, whereas for PRED the values closer to 1 are interpreted as a better result.In Table 4, each superscript value next to the calculated value shows the ranking of that performance measure within the column.For example, the MSE value for Project A / 100% is given as 314.2668 2 , here 314.2668 is the actual calculated value whereas the superscript (2) denotes that this MSE value is the second best (ranks second) among all MSE values for Project A / 100%.Several models have generated similar PRED values, in this case the PRED values were ranked based on their respective MMRE values, as PRED is calculated by using MMRE.The calculated MSE values for 70%-30% (observed and predicted) and 50%-50% (observed and predicted) of each project are given in Figure 6.The models are sorted for each project from the highest total MSE to the lowest one.The absolute A.BPRE values of each NHPP Model for all four projects and three data sets (100%, 70% and 50%) are given in Figure 7.The NHPP models are sorted in increasing order of total A.BPRE.It is observed in the graph that Yamada is the model that displays the best results with respect to the A.BPRE values.
In RQ1 we tried to examine whether the NHPP models are applicable to Web applications.When the defect inflow data of the Web applications that were used in this study is examined it is observed that defect creation is a random process and that they do not follow a specific period.With respect to these characteristics it can be argued that the defect creation of the examined Web applications satisfy the assumptions of the NHPP models.The results of the examined NHPP models were communicated with project team members and they have ratified the results.Based on these findings, we conclude that SRM NHPP models are applicable to Web applications.
In order to answer RQ2, that is how each one of the assessed NHPP models perform with respect to Web applications, the performance measurements obtained from each one of the assessed NHPP models are examined.Our initial observation is that a single model does not outperform all others significantly.Thus, the results are summarized and the best and second best performing models for each case (100%, 70% and 50% in all four projects) are given inTable 5.The values in Table 5 are read as: A50% standing for the measurement for Project A conducted with the 50% of the given defect data.Therefore each of the NHPP models are evaluated based on data collected from four projects (A, B, C and D), with three datasets (%100, %70 and %50) and by using four different measurement approaches (MSE, MMRE, PRED and A.BPRE), thus resulting to a total of 48 measurements.As it can be seen the Delayed S-shaped model performs best in 13 cases, the Yamada model in 12 and the Goel-Okumoto model in 11 cases.Moreover, it is observed that Yamada and Goel-Okumoto models display very close results with respect to the four measurement types.These two models differentiate from each other with respect to their total defect ( ) (t a ) functions: in the Goel-Okumoto model the total defect function is whereas for the Yamada model this is given as . The parameters that were computed in this research and are given in Table 3 are very small (near-zero) values, thus converging the two total defect functions.The small  values are interpreted as the introduction of very few new defects over time to the developed software.Because of these  values the two models converge to each other.Based on these findings we conclude that Yamada and Goel-Okumoto models give the best results in Web projects.On the other the worst performing model almost in all cases is the Pham Nordmann Zhang.
When the performance measurements for the models that were constructed by using the 70% and 50% of defect datasets are examined it is seen that the models perform good for projects C and D (the completed projects), but that the performance significantly drops for projects A and B (the ongoing projects).The outputs were discussed with project team members in order to understand RQ3, that is how good predictors of the future testing stages in a Web application development are the NHPP model parameters that are estimated by using the 70% and 50% of the defect data.As stated by Rana et al. (2014), BPRE and A.BPRE can be used to assess the release readiness of a software.When Figure 7 and Table 5 are examined it is seen that Yamada and Delayed S-Shaped models perform best with respect to their A.BPRE values and therefore we can argue that Yamada and Delayed S-Shaped models are the best models for the assessment of the release readiness of a Web application. AJIT

Conclusion
In this paper we have evaluated six known NHPP SRMs on four software projects within the Web software context, developed in a Turkish software company.By using this unique setting and focusing on this scope three main research questions were set: In RQ1 we tried to examine whether the NHPP SRMs are applicable to Web applications and we discovered that NHPP models can be used with Web software defect data as the project and defect inflow characteristics are in accordance with the NHPP model characteristics.The results of our analysis and the feedback collected by project members ratify our assumption.In RQ2, the performance of each one of the assessed NHPP models with respect to Web applications was analyzed.A total of 48 measurements were compared and ranked and it is observed that Yamada and Goel-Okumoto models give the best results in the examined Web projects.
Finally, for RQ3 we tried to assess the release readiness of a Web application by trying to understand how good predictors of the future testing stages are the NHPP model parameters that are estimated by using the 70% and 50% of the defect data.Overall, Yamada and Delayed S-Shaped models performed better with respect to their A.BPRE values and therefore we argue that these models are better for the assessment of the release readiness of a Web application.
It is important to address the validity threats of this study and we do that by using the approach presented by Runeson and Höst (2009).The reliability of the study is concerned with to what extent the data and the analysis are dependent on the specific researchers.The defect data used was collected directly from the defect databases of the same organization.With the use of semi-structured interviews, it was assessed that all members of the software teams have the same understanding regarding the term "defect", which was done in order to address the issues of construct validity.All formulas, computations and intermediate steps are given clearly, thus if the study is repeated the same results can be obtained.On the other hand, internal validity is related to causal relations.This study tries to assess how NHPP models perform with respect to Web applications, however factors like development lifecycle, development environment, used programming language and deployment platform may affect the results.In order to minimize these effects, we tried to select projects with different characteristics.Another threat to internal validity arises from using MLPE in order to estimate the SRM parameters.However, as stated by Zhao and Xie (1996) MLPE is the most common approach used while estimating the parameters of NHPP models.In order to minimize the internal validity threat regarding the performance assessment, four different performance criteria, namely MSE, MMRE, PRED and A.BPRE were used and the results from all four of them were used in the evaluation process.As the data used in this study comes from projects that belong to a single software development company, this is also a threat to the validity of the research.Therefore, in order to assess the generalizability of the obtained results with respect to external validity, it is important to conduct further similar case studies in other software organizations that develop Web software, which is also within the scope of planned future studies.
In this study the applicability of the NHPP SRMs in Web projects was analyzed with a practical focus.Further studies in similar direction, that may address different aspects of these models in the specific domain of Web applications, would allow further and wider adoptation of reliability approaches, as there is a significant need for more efficient reliability solutions in Web software projects.Combibing NHPP SRMs with time-series analysis and modelling seperate life-cycle stages of a project with different SRMs are two of the initially planned future studies.We do believe that this research constitutes a necessary and important initial step towards the wider and more detailed use of SRMs, and specifically NHPP models, in the Web application development processes.

Figure 3 Figure 4 Figure 5
Figure 3 The graphs of NHPP models drawn based on the parameters calculated for 100% of defect data

Figure
Figure 6 MSE Values for Studied NHPP Models over Full and Partial Data

-e: Online Academic Journal of Information Technology 2016 Summer/Yaz -Cilt/Vol: 7 -Sayı/Num: 24 DOI: 10.5824/1309-1581.2016.3.001.x
Runeson and Höst (2009)is research is a case study that aims to evaluate the applicability and performance results of a specific group of SRMs, namely NHPP models, in the context of Web application development.Based on the guidelines and taxonomy ofRuneson and Höst (2009), this study is an interpretive case study that uses a fixed design principle, organized as an embedded case study where each of the four projects analyzed constitute the units of analysis.The overview of the case study design is given in Figure1.The case study was conducted at ALTAIR Software and Defence Technologies Inc., a software development organization with a ISO 9001:2008 Quality Certificate, employing approximately 30 employees that work in different project groups with varying responsibilities as developers, testers, project managers and quality assurance specialists.The details of the four Web application projects used in the case study are given in Table2.The defect data were collected from the JIRA Issue & Project Tracking Software used by the Organization and semi-structured interviews were conducted with the responsible project team members to complement the defect data.The cumulative normalized defect inflow profiles for the four projects analyzed are presented in Figure2.
Aydın and Tarhan (2014)investigate if defect density estimations and project life cycle are related, and for a project with an iterative software development life cycle the performance of Rayleigh and Linear Regression models are assessed.Comparison results of actual and predicted defect density values show that at module level the Rayleigh Model and at project level the Linear Regression Model produce more reliable results.MethodologyFollowingRobson's