A multi-institutional evaluation of Intelligent Tutoring Tools in Numeric Disciplines
A number of researchers have considered the benefits and limitations of Computer Aided Learning (CAL) and its effects on the educational community. CAL has been compared with the traditional human teacher based methods to investigate its effectiveness and has been observed to perform better in many applications (Kaplan & Rock, 1995). In some of the other cases, the low performance of CAL can be attributed to poor interface design (Hazari & Reaves, 1994; Wong, 1994), less flexibility than human teachers or poor and inappropriate evaluation of CAL packages (Murray, 1993; Shute & Regian, 1993; Duncan, 1993; Alexander & Hedberg, 1994). It is therefore important to develop benchmarks for assessing the suitability of CAL packages in the actual learning environment.
Heller (1991) noted that instructional software, like all other educational material, should be evaluated before it is used in the classroom or research laboratory. The challenge is to decide what to evaluate, who should carry out the process and how it should be carried out. The literature suggests that the evaluation of a tutoring system needs to be carried out in two stages (Wyatt & Spiegelhalter, 1990; Murray, 1993; Legree et. al., 1995). Initially, the system should be evaluated for its overall effectiveness and usability. Such evaluations play an important role of informing the subsequent modifications of procedures and interface design. When a system meets the objectives of the initial evaluation stage, the efficacy of its components should be determined in the real environment. This paper presents some of the analysis and findings of a multi-institutional evaluation study to investigate the efficacy of intelligent tutoring systems designed for numeric disciplines. The study was conducted as a part of formative evaluations carried out towards the end of software development cycle.
The evaluation of Intelligent Tutoring Systems in numeric disciplines has not received much attention in the literature. Although there are some instances of small-scale evaluations that have been completed within a single institution, little work has been reported on large-scale evaluations conducted across several institutions. This paper is concerned with the findings of research involving multi-institutional evaluation of the effectiveness of tutoring packages as an alternative to the human-led tutorials. It employs a quantitative approach, in the main, as favoured by various researchers (for example, Legree et. al., 1993; Murray, 1993; Mark & Greer, 1993) for initial investigations, although the subjective views of the students towards the functionality and effectiveness of these packages have also been recorded. The evaluation is based on three packages used for teaching different techniques in management accounting. Although the evaluation studies were conducted under the laboratory-based control-testing conditions and may not provide a fully accurate picture of how the students would behave in real teaching environment, the multi-institutional nature of this study brings it near to a field trial carried out with the sample size exceeding that required for a power level of 0.95 (Altman, 1991). This is sufficient to enable drawing of firm conclusions about the efficacy of the tutoring system, at least within the scope of the testing conditions. In addition, an independent study by Stoner & Harvey (1999), described later in this paper, validates the effectiveness of tutoring packages in real environment.
Byzantium model of CILE and Intelligent Tutoring Tools
Although computers are being used at all levels of the curriculum, introductory topics are becoming more popular for the use of CAL. An explanation of this fact may be the simple and relatively discrete nature of the concepts acquired at the introductory level that are inter-linked at later stages of studies to solve more complex problems. To recognise that students construct knowledge of different degrees of complexity at different stages of their learning, a model of Computer Integrated Learning Environments (CILE) was formulated by a consortium of six universities under the Byzantium project, funded through the Teaching and Learning Technology Programme (TLTP) of the Higher Education Funding Councils of United Kingdom. This model, which proposes that the level at which a discipline is taught and learnt provides a vital context for tutoring software design, divides the learning of the subject discipline into three distinct knowledge levels:
The current research output is focused on the development of the first level packages and their evaluation. It is recognised, however, that the on-going developments in the fields of the Internet, fuzzy logic and natural language processing may greatly assist developments at subsequent development stages by respectively providing: (i) an infra-structure for distributing development efforts but also for linking the outputs of such distributed efforts (Patel & Kinshuk, 1997); (ii) the processing of imprecise and possibly qualitative data; and (iii) a more natural student-computer interaction interface that removes much of the effort in encoding data to suit computer processing and thus lifts current limitations on the range of activities that can be performed on a computer with ease.
The Intelligent Tutoring Tools (ITTs) are aimed at extending a lecturer's scope by horizontally partitioning some of the teaching activities, e.g. supervising the development of operational skills, and assigning them to a tutoring package. Although the accounting domain has been used to develop these ITTs, the structure of an ITT is considerably domain independentand the same structure can be used for any numeric discipline. The structure and use of the ITTs have been discussed in Patel & Kinshuk (1996a and 1996b) and Patel, Kinshuk & Russell (2000) respectively.
Evaluation of ITTs
The evaluation stage of the ITT design commenced in May 1995, when the students at one university in United Kingdom studied Capital Investment Appraisal in a two-groups parallel trial. The Control group had classroom-based tutorials led by an experienced teacher, whereas the CAL group was exposed to the CAL package in computer laboratory based tutorials. Group comparison with the help of pre and post tests provided the initial validation of the effectiveness of the ITT, whereas the observations and subjective questionnaire feedback from CAL group validated the interface design adopted in the ITT. The study also provided a validation of the measurement techniques and questionnaire design adopted. A Phase II study was carried out subsequent to incorporating some design changes as a result of the Phase I study. It was conducted at six UK institutions and utilised three CAL packages. Capital Investment Appraisal, Absorption Costing and Marginal Costing packages were used by different groups of students for this purpose. A two group study was organised at two universities on the Capital Investment Appraisal package. At other institutions, the testing of all three ITTs took place on a random sample of about 40 students, as it was not feasible to test all students at other institutions.
Since the aim of the evaluation in this study was to examine the overall effectiveness of the tutoring packages mainly quantitative methods are used, as suggested by various researchers (Legree et. al., 1993; Murray, 1993; Mark & Greer, 1993), although qualitative views were also obtained from comments recorded during student observation and through subjective questionnaire. Subject-based evaluation methods were used in the study, as they are widely employed and favoured for the evaluation of CAL packages (Daroca, 1986; Simpson, 1986; Gallagher & Letza, 1991; Tonge et. al., 1994; Iqbal et. al, 1999). These are based directly on the user's judgement and the process of data collection is facilitated under laboratory conditions with less chances of bias. Two-group trial studies, the most common technique for evaluation of CAL packages (Webb et. al., 1991; Simons & De Jong, 1992; Wang & Sleeman, 1993; Ruf et. al., 1994; Forrester, 1995; Magnuson-Martinson, 1995), had been adopted for assessing the effectiveness of the packages.
Two types of subject-based evaluation techniques were used: questionnaires and observations. Since questionnaires contained both structured and open-ended questions, it was quite easy to elicit large amount of specific information quickly and easily. Also, users were free to provide detailed opinions about the packages in the open-ended questions. Students were also observed by one of the authors and a staff member at the various institutions. The information collected through both these techniques provided valuable understanding about the students' feelings towards the navigational procedures, screen layouts and other human-computer interaction related matters. Student observation was employed as a supplementary technique to augment the information obtained through the questionnaire. It was also used to capture the initial reactions of students that may not be conveyed in a questionnaire completed at the end of a session, when the initial problems may have been forgotten due to increased confidence in operating the software.
The main objective of the research was to determine if the Byzantium project ITTs are an effective alternative to the resource-intensive human-tutor-led tutorials for introductory numeric disciplines.
The research questions addressed for statistical analysis in the study were as follows:
The questionnaires employed in the study consisted of: (i) Pre and Post Test Questionnaire for all three packages; (ii) a Learning Style Questionnaire and (iii) the Subjective Questionnaire. Since the students had a mixed background of subjects studied at secondary school level, the Pre and Post Test Questionnaires were essential for eliminating any bias of previous exposure to the subject matter and were designed to assess the improvements in student knowledge following the use of each package. The subjective questionnaire was divided into three parts. The first part contained information regarding biographical data of users and the second part was related to their experience with general computing. The information obtained in these two parts provides the basis for the division of students into various subgroups according to their background for the purpose of analysis. The third part of the questionnaire was related to the subjective assessment of the tutoring system. It contained 113 closed-ended statements and one three-part open-ended question. The closed-ended statements in the questionnaire were 44% in favour and 56% not in favour of the packages so that the questionnaire was unbiased and balanced. All statements had a five point agree-disagree Likert Scale to facilitate easy and reliable analysis.
Sample size determination
The adequacy of the sample size is based on standardised difference, which is the ratio of the difference of interest to the standard deviation of the observations. In the comparative study of the two teaching methods for the introductory subjects, the difference in the means of gains obtained by students was used as a basis for comparison. A real difference of 10% between the means of gains was taken as representing an important difference between the performance of two teaching methods. The standard deviation for phase II study varies between 7.4 and 14.7. Therefore, taking the maximum value of standard deviation at 14.7, the value of standardised difference comes to 0.68. According to Altman (1991), the power level of 0.95 is achieved with a total sample size of 110 for a significance level of 0.05. This value of power is large enough to draw firm conclusions. The total sample sizes for all packages under the study were well above 110 students.
Initially, the gains were obtained for different institutions, where the two-group trial studies were carried out. Two-way ANOVA analysis was applied to the data to investigate whether the gain in student knowledge was consistent for both the teaching methods. The interaction between modes of instruction and centres was also analysed and the Least Significant Difference method was employed to investigate which centres had significantly different results (see Altman, 1991). The consistency among the gains obtained by the students was also investigated by two-way ANOVA analysis for different packages at various centres. To ascertain the students’ views about the packages, the subjective questionnaire data was analysed. The questionnaires were grouped according to centres and packages, and since the data was categorical, Mantel-Haenszel chi-square test was used for the analysis.
Analysis of the evaluation data
The evaluation took place at six universities in United Kingdom. Four universities out of six (universities A, B, D and E in table 1) were new universities (formerly polytechnics). The other two (universities C and F) were traditional universities. One new university (university E) used the packages in their open learning programs, whereas, at the other universities, the packages were used in general tutorial settings.
Table 1 lists the number of students who participated in the evaluation at various universities.
Table 1. Summary of sample sizes at various universities
Analysis of the gains obtained from pre and post test results
Two-way ANOVA analysis was applied to the gains of pre and post test results of the two group parallel trial study on Capital Investment Appraisal package at two universities. At one university, the study was carried out in two phases. The phase I study was a pilot study.
Table 2 shows the means and standard deviations obtained at the different universities.
Table 2. Means and standard deviations for various two group parallel trial studies
The results of the ANOVA analysis are as follows:
The above analysis can be summarised as follows:
Since the two group trial study was completed at two universities and one university conducted the study twice, the Least Significant Difference test was used to investigate whether the difference is between the universities or between phases I and II, yielding (Comparisons significant at the 0.05 level are indicated by '***'):
The results showed that the differences are between the gains of phase I and phase II evaluation studies. These differences can be attributed to the relatively minor but critical modifications in the design following the phase I study. Once it was established that there is no significant difference between the gains at different universities, the analysis was extended to all the universities in phase II where the CAL study took place. Table 3 shows the means and standard deviations obtained at various universities.
Table 3. Phase II studies of CAL
The results of two-way ANOVA analysis are as follows:
The analysis showed that there were significant differences in the gains between different packages and between different centres, and there was significant interaction between packages and centres. To identify significant differences in the gains, the Least Significant Difference test was applied on packages. The analysis showed that the Marginal Costing package results are significantly different from other packages. There was a difference of about 10% in the gains between Marginal Costing and other packages. The application of the Least Significant Difference test on universities revealed that the universities B and D are significantly different from universities A, C, E and F, whereas university A is significantly different from university E. The differences at universities B and D can be attributed to the higher gains obtained by the students at these universities, and the differences between universities A and E can be attributes to the lower percentage gains obtained by the students of university E (a new university where the software was used in open learning programs).
The analysis showed that the overall feelings of the students about the system were quite positive, and most of the students agreed that the packages do not require any prior knowledge of accounting and computing. Mantel-Haenszel chi-square tests were applied on subjective questionnaire for various parameters such as gender and students’ attitude towards computers. Though the details of this analysis are beyond the scope of this paper, it should be noted that the performance of students with or without any previous computer training, confidence in operating computers and enjoyment in computers was not significantly different (i.e. less than 0.05). In response to the open-ended questions, a large number of students provided positive feedback for the user interface of the packages and appreciated the error messages and the ease of navigation. They found the packages easy to understand and use. Many students commented that the layout of the screens made the programs easy to follow. Some students wanted greater transparency in the saving routines of the packages and the ability to view some details of the examples saved for computer marking. A software utility program has subsequently been developed to fulfil this requirement.
The key issue in this study was to determine whether the Byzantium approach to CAL in the numeric disciplines, based on the cognitive apprenticeship model (Collins, Brown & Newman, 1989), provides an adequate means of tutoring in the procedural skills that can be employed as an adjunct to the traditional lectures and replace some of the human-led tutorials. The study revealed that the means of gains obtained by the students in the traditional teaching group (mean = 60.8) and CAL teaching group (mean = 62.7) are almost equal and statistically; the difference in the performance is not significant. Since the sample size of this study is high enough to draw firm conclusions, it can be concluded that this CAL tutoring approach a suitable alternative to human-led tutorials. As there was no significant difference in the performance gain between the students with and without the attributes of previous computer training, confidence in operating computers and enjoyment in using computers, it can be concluded that the Byzantium packages are suitable for all the students learning numeric disciplines.
Feedback from independent evaluation in real environment
In an independent evaluation exercise carried out at the University of Glasgow, however, Stoner & Harvey (1999) found the results indicating that students’ performance had improved statistically significantly over the period since Learning Technology materials were introduced, and that this improvement appeared to be mainly reflected in the students’ ability to complete numeric questions, the area addressed by the ITTs. Interestingly, their evaluation involved the Byzantium ITTs, another widely used traditional Computer Based Learning package and human teachers. It was based on comparison of the examination performances over a period of three years. Considering the problems in maintaining control group conditions over an extended period of time, the Stoner & Harvey approach is, perhaps, better able to capture the improved long-term retention enabled through the cognitive apprenticeship model of tutoring system design, and may therefore provide a better measure of summative evaluation.
Reflections and Recommendations
The control-group based, mainly formative evaluations described here, as with many such evaluations reporting ‘no significant gain’, may have suffered from the short-term ‘freshness’ of what is learnt and may have failed to evaluate improved retention and recall in the longer term.
Our formative evaluation activities, however, yielded some other interesting findings. The work also focused on the differences in the knowledge gained through the different packages. The means of the gains were found statistically significantly different for different packages (Capital Investment Appraisal = 67.1, Absorption Costing = 66.4 and Marginal Costing = 78.6). The difference may be attributed to the ‘problem span’ and ‘problem size’, as opposed to the ‘problem complexity’, covered in these packages. In Marginal Costing, only 14 variables are involved and these are closely related to each other, although the relationships are more complex. The fewer variables allow the whole problem to be displayed on one screen and the user visually maintains a full view of the problem.
Unlike Marginal Costing, Capital Investment Appraisal (48 variables under 4 different techniques) and Absorption Costing (114 variables) spreads out the processing of the given problem into a number of screens to prevent cluttered layouts and for ease of learning. However, this appears to increase the cognitive load on students, as they have to conceptually relate the variables on one screen with those on another. Though critical variables are reproduced on the current screen, a novice user still has to retain a mental map of the variables in the previous screen to maintain the semantic link and may have to move between the screens to refresh this link until the concepts and their relationships are fully grasped and internalised.
Another possible difference favouring Marginal Costing is the smaller amount of overall information students have to process in solving a given problem, as compared to the other packages. This appears to have a psychological effect on student performance, as they perceive the relative size of the problem to be considerably larger in the other packages when compared to Marginal Costing. These reasons may explain better performance of students in Marginal Costing package compared to other packages. Interestingly, these observations, when presented at conferences, invoked interest among multi-media researchers who have suspected similar increased cognitive load and drop in performance due to screen changes.
Since the comparative evaluations involving the laboratory based CAL groups and classroom based Control groups employed Capital Investment Appraisal and found similar knowledge gains with no significant difference between the two groups, the factors described above appear to affect the traditional teaching and learning methods too. Some topics by their nature may offer greater ease of learning, while others may be found more difficult to learn by most of the students. The study suggests that two of the contributing factors may be (i) the problem size in terms of the amount of data to be processed and (ii) problem span and the cognitive overhead involved in maintaining mental maps of the different parts of the problem at an initial stage of learning. It appears that while the partitioning of a larger size problem provides ease of learning within the scope of an individual component, it reduces overall visibility, whether we use a computer screen or a paper-based interface.
There is a need for more research on these issues. However, it does seem that students need more time and practice for multi-part problems involving larger amounts of data and that, in the absence of adequate consideration of these factors, the traditional methods of teaching may not be allocating adequate time for learning a particular topic, disadvantaging the weaker students.
In conclusion, it would be useful to reiterate that evaluation has two dimensions. The sumamtive evaluation in a real environment involving a longer time frame is perhaps a better way to evaluate any teaching and learning system. The shorter time frame formative evaluations may fail to fully capture all the dimensions of learning and retention (perhaps blurring the difference between concept acquisition and cognitive skill acquisition), returning verdicts of ‘no significant difference’. The formative evaluations, however, are vital, not only for identifying key design issues but also for improving our understanding of the pedagogical issues.