Introduction

As a vital twenty-first century skill, development of critical thinking (CT) is one of the major goals of higher education (Association of American Colleges and Universities 2005; Halpern 2014; van Gelder 2005). CT involves the ability to clearly and precisely raise vital questions, gather relevant information and reach well-reasoned conclusions, make accurate decisions, assess the credibility of sources, identify cause–effect relationships, and effectively communicate with others in figuring out solutions (Ennis 1989; Halpern 1998). Proficiency in CT is linked with success in college (Williams et al. 2004; Zohar and Nemet 2002), improved decision-making with regard to complex, real-life problems (Dwyer et al. 2012; Facione 1990a), and more generally with a tendency to become a more active and informed citizen (Butler et al. 2012; Halpern 2014).

Although widely recognized as desirable, CT competence is often found to be low among students in higher education (e.g., Arum and Roksa 2011; Billing 2007; Pascarella and Terenzini 2005; van Gelder 2005). Identifying effective ways to develop students’ CT has therefore been the focus of a large body of intervention-based research. Most previous efforts to address the challenge of CT development took place in a context in which general CT skills were taught separately from regular subject matter domains (for reviews, see McMillan 1987; Pascarella and Terenzini 2005). In this approach, the ability to think critically is seen as independent of the acquisition of knowledge and skills of a particular subject matter domain. However, this point of view has become less dominant in recent years. Empirical attempts to develop students’ CT have shifted mainly towards embedding CT skills within subject matter domains (for reviews, see Abrami et al. 2008, 2015; Niu et al. 2013; Tiruneh et al. 2014).

The accompanying expectation has been that embedding CT skills within a subject matter domain will enable the acquisition of CT skills that are applicable to a wide variety of thinking tasks within the domain in question and that it will facilitate transfer to other problems in everyday life (Perkins and Salomon 1989; Resnick et al. 2010; van Merrienboer 1997). Successful teaching of CT within a subject matter domain is in other words expected to result in the development of both domain-specific and domain-general CT skills. Domain-specific CT refers to the ability to think critically in a domain that requires specific subject matter expertise (e.g., McPeck 1990), while domain-general CT refers to the ability to think critically in a domain that requires knowledge of everyday life (e.g., Ennis 1989).

Although extensive empirical studies have focused on developing domain-general CT skills (see Abrami et al. 2008), research into the acquisition of domain-specific CT skills has largely been lacking (Fischer et al. 2014; Pascarella and Terenzini 2005; Tiruneh et al. 2014). Aside from a few successful attempts in the domain of psychology (e.g., Penningroth et al. 2007; Williams et al. 2004), there is a dearth of empirical evidence with respect to the question of whether embedding CT skills within a subject matter domain promotes the development of CT in specific domains of science and arts. In addition, evidence on the effectiveness of embedding CT skills in developing domain-general CT has been inconsistent. Some studies found that explicit CT instruction within subject matter domains is an effective way of developing domain-general CT skills (e.g., Bensley and Spero 2014; Dwyer et al. 2012; Solon 2007), whereas several others reported an insignificant effect (e.g., Anderson et al. 2001; McLean and Miller 2010; Toy and Ok 2012). Furthermore, it is unclear whether instructional intervention that aims to promote domain-general CT skills also improves students’ ability to solve domain-specific CT tasks, and vice versa (Fischer et al. 2014; Siegel 1988). In view of the dearth and inconsistency of the existing empirical evidence, determining the features of instructional interventions that contribute to developing domain-specific and domain-general CT remains an important challenge in CT research.

Recent developments in cognitive psychology have influenced instructional design in various ways (Elen 1995; Jonassen 1991; Merrill 2002; van Merrienboer 1997). One of the influences has been on the conception of learning and instruction. Cognitive psychologists view learning as an active, cumulative, constructive, goal-oriented, self-regulated, and situated process of knowledge and meaning building (e.g., Elen 1995; Shuell 1986; van Merrienboer 1997). Instruction is viewed as a set of activities that aim to support and enable learning, and that means helping and guiding students to actively process information, monitoring their performance, and providing feedback with respect to the appropriateness of students’ learning activities (Elen 1995; Merrill 2013). These moderate constructivist views on learning and instruction (Elen 1995) emphasize that learning and understanding go hand in hand (e.g., Shuell 1986). Echoing this view, Perkins and Unger (1999) argued that understanding a subject matter domain is a matter of being able to think critically and act competently with one’s knowledge of the subject matter. This implies that meaningful subject matter learning in any domain inherently involves the development of relevant CT skills. From this follows the development of CT is essentially an implicit goal in all subject matter learning.

Despite the theoretical claim that subject matter instruction in any domain can stimulate the development of CT (Perkins and Salomon 1989; Resnick et al. 2010; Resnick 1987; Smith 2002; van Merrienboer 1997), the potential impact of the design of subject matter instruction has been overlooked in existing CT research. The development of CT is largely explored through loosely defined instructional interventions that consist of teaching general CT skills within less optimally designed subject matter instruction (Tiruneh et al. 2014). Research attempts to embed CT skills within subject matter instruction have not systematically built on instructional design research, and the link between the acquisition of domain-specific and domain-general CT skills appears to be vague. In sum, although it is unclear to what extent systematically designed subject matter instruction in itself promotes the development of domain-specific and domain-general CT skills, strong impact on the development of domain-specific CT skills is to be expected since they are an integral part of the domain-specific expertise that instruction aspires toward.

Drawing on past research on cognitive development (Glaser 1984; Perkins and Salomon 1989), we explored the question of whether systematically designed subject matter instruction may facilitate the acquisition of domain-specific and, to some extent, domain-general CT skills. The aim of this paper is therefore to examine the effectiveness of systematically designed subject matter instruction in promoting the development of domain-specific and domain-general CT skills, and to investigate the relationship between the two.

Teaching CT in higher education: state-of-the-art

What is CT?

Existing literature suggests widespread disagreement among educators and researchers with regard to the definition of CT and what is to be accomplished in teaching it. Ennis (1993) defines CT as logical and reflective thought that focuses on a decision on what to believe or do. Halpern (1998, 2014) defines CT as the use of thinking strategies that increase the probability of a desirable outcome. Together with her definition, Halpern identified five major categories of CT skills: verbal reasoning, argument analysis, hypothesis testing, likelihood and uncertainty analysis, and decision-making and problem-solving. Halpern argues that the use of CT skills in solving various cognitive tasks can increase the probability of ‘a desirable outcome’ (Halpern 1998, p. 450). McPeck (1990) defines CT as the appropriate use of reflective skepticism within the problem area under consideration, and he closely relates the problem areas to particular subject matter domains.

Some researchers (Facione 1990a; Halpern 1998; Norris 1989; Perkins et al. 1993) have moreover argued that in addition to mastery of a set of cognitive skills, a more meaningful and comprehensive understanding of CT must include CT dispositions. The latter refers to a person’s inclination to use CT skills appropriately without prompting, and with conscious intent in a variety of settings, for instance, when faced with problems to solve, ideas to evaluate, or decisions to make (Ennis 1993; Halpern 1998). Researchers have arrived at a list of CT dispositions that in the main includes open-mindedness, inquisitiveness, systematicity, analyticity, truth-seeking, self-confidence, and maturity (Facione 1990a). Halpern (1998) also notes that a critical thinker demonstrates the following dispositions:

(a) willingness to engage and persist in a complex task, (b) habitual use of plans and the suppression of impulsive activity, (c) flexibility or open-mindedness, (d) willingness to abandon non-productive strategies in an attempt to self-correct, and (e) awareness of the social realities that need to be overcome (such as the need to seek consensus or compromise) so that thoughts can become actions. (p. 452).

We used Halpern’s (2014) classification of CT skills for the purposes of this study. After synthesizing the various conceptions of CT (e.g., Bailin et al. 1999; Ennis 1989; Halpern 2014; McPeck 1990; Resnick et al. 2010; Smith 2002), we defined CT as the proficiency a person demonstrates in using thinking strategies to accomplish a task in a reasonable manner. The thinking task in question may require specific subject matter expertise for it to be reasonably performed, and we call such proficiency domain-specific CT. On the other hand, the thinking task in question may not require specific subject matter expertise, but rather knowledge of everyday life. We refer to such proficiency as domain-general CT.

Specificity and generality of CT and its implications for instruction

The question of whether CT is a set of general skills that can be applied across domains or whether it is by and large specific to a particular domain has been the subject of heated debate (e.g., Bailin et al. 1999; Davies 2013; Ennis 1989; McPeck 1990; Moore 2011; Norris 1989; Paul 1985). This disagreement has had major implications for approaches to integrate CT in higher education curricula. Generalists (Davies 2013; De Bono 1991; Ennis 1989; Halpern 1998; Kuhn 1999) claim a set of CT skills exists that are general and applicable across a wide variety of domains. They contend that this set of general CT skills can be taught either as a specific curriculum subject (i.e., a stand-alone course), or be integrated explicitly into regular courses. On the other hand, specifists (McPeck 1990; Moore 2004, 2011) argue that thinking is highly dependent on specific domain knowledge and that CT teaching should therefore always be pursued within the context of a specific domain. McPeck (1990) has strongly argued against the notion of general CT skills on the basis that the thinking skills required in one domain are different from those required in another. This specifist position implies that each domain will need to identify its own distinctive thinking skills, and students will learn those domain-specific CT skills while building up knowledge of that particular domain.

However, it seems that the generality versus specificity debate has recently shifted towards a synthesis of the two views (Davies 2013; Robinson 2011; Smith 2002). First, although the related content and issues differ from one domain to the next, a set of CT skills that are applicable across a wide variety of domains exists. Second, the ability to think critically on a particular task is understood to be highly dependent on knowledge of the task at hand as well as knowledge of relevant CT skills. This implies that effective CT instructional approaches need to target students’ in-depth understanding of a domain and that of the relevant CT skills.

CT assessment

In tandem with the absence of a consistent CT definition, one of the main challenges in CT research has been the lack of uniform CT tests. Researchers have employed various kinds of CT tests that use a broad range of formats, scope, and psychometric characteristics to measure CT outcomes (for reviews, see Ennis 1993; McMillan 1987; Tiruneh et al. 2014). Some of the available standardized domain-general CT tests include the Cornell Critical Thinking Test (CCTT: Ennis et al. 1985), the California Critical Thinking Skills Test (CCTST: Facione 1990b), the Watson–Glaser Critical Thinking Appraisal (WGCTA: Watson and Glaser 2002), the Ennis–Weir CT Essay Test (Ennis and Wier 1985) and the Halpern Critical Thinking Assessment (HCTA: Halpern 2010). These domain-general CT tests use content from a variety of real-life situations with which test takers are assumed to already be familiar.

Except for the Ennis–Weir CT Essay test and HCTA, all the above-mentioned tests use forced-choice format items, which have been criticized for not efficiently measuring significant CT features such as drawing warranted conclusions, analyzing arguments, making decisions and systematically solving problems (Norris 1989). The HCTA is the only standardized measure of domain-general CT proficiency that uses two different types of item formats: forced-choice and constructed-response formats. Halpern claims that the constructed-response format of the HCTA measures CT dispositions (Halpern 2013).

A couple of domain-specific CT tests also exist in the science domain. The Lawson’s Classroom Test of Scientific Reasoning (CTSR) is the most commonly administered test in the domain of science focused on measuring general scientific reasoning skills (Lawson 1978, 2004). It is a multiple-choice test that measures scientific reasoning skills that include probabilistic reasoning, combinatorial reasoning, proportional reasoning and controlling of variables in the context of scientific domains (Lawson 1978). Respondents do not necessarily need to have expertise in a specific science domain, rather the test focuses on general science-related issues that students can reasonably be presumed to have acquired in specific science subjects. The test mainly targets junior and senior high school students, but it is also used to assess scientific reasoning skills among college science freshmen (Lawson 1978, 2004). The other domain-specific CT test is the biology critical thinking exam (McMurray 1991). It is a multiple-choice test with 52 questions that aims to measure university students’ CT skills in biology. The Critical Thinking in Electricity and Magnetism test (CTEM) is a domain-specific CT test that was recently developed and that aims to measure students’ ability to draw valid inferences, analyze arguments, solve problems, make predictions, and analyze probabilities and assumptions with respect to thinking tasks that are specific to a freshman physics course (De Cock et al. 2015). The CTEM test consists of 20 items, two of which are forced-choice; the remaining are constructed-response format items. The items were designed to mirror the five CT structural components identified in the HCTA (Halpern 2010), and target the content of an introductory electricity and magnetism course. The CTEM test was validated to prompt students’ ability to demonstrate the aforementioned domain-specific CT skills.

Despite the existence of a few domain-specific CT tests, the assessment of CT has thus far mainly focused on domain-general CT skills. CT has mainly been linked with everyday problem solving, and there is a general lack of experience among researchers and educators when it comes to testing for domain-specific CT skills. As discussed in the previous section, the embedded approach aims to teach desired CT skills as part of subject matter instruction. This approach is expected to result in the acquisition of both domain-specific and domain-general CT skills. Standardized tests that measure students’ ability to think critically on issues and problems that are specific to a subject matter domain, however, were hardly ever administered in the various studies that adopted an embedded approach (for review, see Tiruneh et al. 2014).

Embedding CT within regular courses: instructional approaches

Ennis (1989) divided the various approaches to embedding CT within subject matter domains into two types: Infusion and Immersion. In the Infusion approach, students are explicitly trained on how to apply CT skills as part of a specific subject matter domain instruction. Students are explicitly introduced to the desired CT skills and extensively engaged in domain-specific classroom activities that call for the application of the desired CT skills. The Immersion approach, however, aims to help students acquire the desired CT skills as they construct knowledge and skills of a subject matter domain, without explicit instruction about desired CT skills. The main assumption behind this approach is that proficiency in CT is by definition targeted in meaningful subject matter learning; it follows that students can learn relevant and transferrable CT skills when immersed in well-designed subject matter instruction (e.g., McPeck 1990). Given the limited empirical evidence on the effectiveness of well-designed subject matter instruction on the development of domain-specific and domain-general CT skills, the effect of an Immersion-based instructional intervention is the focus of the present study.

The present study

The central question in CT instruction appears to be identifying theoretically sound and empirically valid instructional design principles that foster the development of the desired CT skills (Perkins and Salomon 1989; van Merriënboer 2013). There are a few instructional design models that offer specific guidelines to develop learning environments that enable students to acquire complex cognitive skills. The First Principles of Instruction model is one of the instructional design models that offer explicit guidelines to designing learning environments that can promote the active and constructive acquisition of higher-order learning outcomes (Merrill 2002, 2013). The model is a synthesis of the various instructional design models that emerged from research on the acquisition of subject matter knowledge and skills. Merrill systematically reviewed the different instructional design principles that claim to be empirically valid and abstracted five interrelated prescriptive instructional design principles: activation, demonstration, application, integration and problem-centeredness. This model emphasizes that subject matter instruction designed on the basis of those principles can result in effective, efficient and engaging learning that leads to students’ acquisition of knowledge and skills that are necessary to complete complex real-world tasks (Merrill 2013).

Because of its comprehensiveness and strong theoretical foundation, the First Principles of Instruction model was chosen to guide the design of the learning environment for this study. No previous study, to our knowledge, has tested the efficacy of this model in designing instructional interventions that target the development of CT skills. A brief explanation of the First Principles of Instruction model and its implications for designing subject matter instruction is offered in the next section. A learning environment in the context of a freshman physics course was designed based on the model. The following research questions are addressed: (a) What is the effect of systematically designed subject matter instruction on the development of domain-specific CT skills? (b) What is the effect of systematically designed subject matter instruction on the development of domain-general CT skills? and (C) What is the relationship between performance on domain-specific and domain-general CT tests? In line with existing theoretical literature (e.g., Perkins and Unger 1999; Resnick et al. 2010), we hypothesized that subject matter instruction systematically designed according to the First Principles of Instruction model would produce a significantly higher acquisition of domain-specific and domain-general CT skills than regular subject matter instruction.

Method

Participants

The study participants were first-year students with physics majors at two universities in northwest Ethiopia. Students at one of the universities formed the experimental group (n = 45), while those at the other university constituted the control group (n = 44). The experimental group was comprised of 24 women and 21 men between the ages of 19 and 23 years (M = 20.09, SD = .93), while the control group consisted of 23 women and 21 men between the ages of 19 and 24 years (M = 20.32, SD = .98).

Design and development of the Immersion-based instructional intervention

The intervention focused on a freshman introductory physics course, namely introductory electricity and magnetism (E&M). At both universities, this course was taught based on a harmonized national curriculum, with the same content and credit hours. The targeted course was taught during the second semester of the 2013/2014 academic year. The intervention focused only on the first five chapters of the course: electric field, electric flux, electric potential energy, capacitor and capacitance, and direct current circuits (as specified in the course textbooks of the two universities).

In recognition of the complex and multidimensional nature of CT, an effort was first made to acquire clearer understanding of the desired CT outcomes learners ought to demonstrate after the intervention. The CT skills that were the focus of our intervention were reasoning, argument analysis, hypothesis testing, likelihood and uncertainty analysis, and problem solving and decision-making. The targeted CT skills were split into sub-skills before the instructional intervention was designed. A more precise description of each of the domain-specific and domain-general CT outcomes was subsequently developed with respect to the post-intervention performance (see Table 1). Such an in-depth analysis of the CT outcomes that we wished our students to demonstrate helped us decide on the specific and relevant instructional strategies that should be targeted while the learning environment is designed and implemented.

Table 1 Description of desired domain-specific and domain-general CT outcomes

After the desired CT outcomes were identified, the next important phase was designing a learning environment based on the First Principles of Instruction model. Table 2 offers a brief description of the principles, the implications for instructional design, and brief examples of what happened in the actual design and implementation phase of the learning environments. Two regular course instructors from the experimental university, two physics professors, one instructional psychology professor and one doctoral candidate collaborated in designing the experimental learning environment. Efforts were made to embrace the desired CT skills as part of the regular domain-specific classroom activities during this design process.

Table 2 Comparison between the experimental and control learning environments in relation to the First Principles of Instruction model

Implementation of the experimental and control interventions

Students in both the experimental and control conditions learned the same five chapters. The lessons were taught by regular instructors at the two universities. Two instructors (one as a main instructor and the other as an assistant for the tutorial sessions only) participated in the study at each university. In order to control for the teacher effect, we involved instructors from the two universities who had the same education level (all MSc in Physics) and similar years of teaching experience.

Training the experimental instructors

The two regular instructors received adequate training to be able to teach the experimental class. Their collaboration began during the design phase of the intervention, and they were both fully informed on the purpose of the intervention and what was required of them in implementing the designed lessons. For example, we initially asked them to comment on a draft version of the lessons designed for chapter one and both instructors provided useful feedback. Their involvement and feedback continued throughout the design process of the five chapters. On a number of occasions, they reported that some of the activities and questions in the draft versions were unclear or less relevant for the targeted students. A number of modifications were accordingly made.

Moreover, to facilitate implementation of the lessons as designed and provide the necessary theoretical knowledge base, the first author and the two experimental instructors participated in 5 h of face-to-face discussions over a period of 3 days. The instructors were briefed on the overall goal of the instructional intervention as well as the specific designed lesson activities of the full five chapters.

Experimental condition

The developed lessons were taught during the regular lecture hours. Students were divided into 10 groups of 4 or 5 students. Efforts were made to have groups that were evenly spread in terms of gender and academic performance (with the latter based on students’ GPA in the first semester). Students received guidance in performing both the individual and group activities that had been designed. At the beginning of each chapter, students were assigned contextually relevant E&M problems that required them to collaborate to find solutions. Throughout the intervention, students were made to observe well-scripted instructor demonstrations that modeled the important procedures and reasoning involved in solving various E&M problems. The demonstrations were followed by extensive opportunities for the students to practice solving E&M problems both individually and in small groups for a substantial amount of time. A number of activities that encouraged students to activate prior knowledge and communicate their ideas to both their group and the entire class were carefully designed and implemented. Both peer and instructor feedback was provided as needed. Overall, students were carefully assisted in developing an in-depth understanding of the subject matter domain, and they were coached and supported in the acquisition of the CT outcomes through the various domain-specific instructional activities. The first author monitored overall implementation of the intervention, which lasted 8 weeks. Three lessons of 2 h each were taught every week. See Table 2 for a brief overview of the activities designed and implemented in the experimental class.

Control condition

Students in the control condition followed the regular subject matter instruction. Two instructors (one main instructor and one assistant for only the tutorial sessions) from the control university were responsible for designing and implementing the lessons. The lesson durations for this group were the same that for the experimental group: a total of 8 weeks with 3 lessons of 2 h each per week. This group was similar to the experimental group in terms of previous course and parallel courses enrollment during the intervention. However, the E&M lessons for this group were not designed according to the First Principles of Instruction model, and we will refer to the instructional method in the control class as “regular” E&M instruction. See Table 2 for a detailed comparison of the control and experimental learning environments. To obtain an overview of the instructional processes, the first author observed one of the control group’s lessons. In addition, interviews were conducted with the E&M instructor on three separate occasions (at the beginning of the semester, a month after the semester, and at the posttest) to acquire additional information on the various classroom activities. A brief description of the instructional activities that took place in the control group is offered below.

At the beginning of each chapter, the main instructor gave a brief overview of the general learning outcomes. He immediately proceeded by discussing the first subtopic of a chapter and asked oral questions between presentations that encouraged students to engage in discussions. However, students were not pushed to give more detailed explanations of their responses. In most cases, the instructor himself offered the explanations. He usually showed the solutions to one or two problems after a brief discussion of a particular topic. In most cases, students took notes and wrote down the solutions. Towards the end of the lesson, students were usually handed homework that was to be solved by the next lesson. The students, however, did not receive comprehensive and contextually relevant E&M tasks at the beginning of each chapter. The E&M problems solved by the teacher during class and those given as homework assignments were traditional end-of-chapter problems that focused on computation and gave students limited opportunities to engage in thoughtful discussions (see Fig. 1 for a comparison of E&M problems for the control and experimental conditions).

Fig. 1
figure 1

Sample E&M problems for the control and experimental condition

Fig. 2
figure 2

Sample whole-task for chapter three

Instruments

The effects of an instructional intervention on the development of CT skills should be measured by using valid and reliable CT measures that are sensitive enough to capture the changes of targeted CT outcomes (Ennis 1993; Halpern 1993; McMillan 1987). The CTEM test was administered in order to measure students’ acquisition of the desired domain-specific CT outcomes. The HCTA (Halpern 2010) was administered to measure the acquisition of domain-general CT outcomes. A pilot study was conducted to examine the applicability of the HCTA for use to the present participants. The test consists of 25 scenarios (5 scenarios for each domain-general CT skills targeted in the study), with variety of everyday health, education, politics and social policy issues. Each scenario is followed by questions that require respondents to provide a constructed response and to subsequently select the best option from a short list of alternatives (forced-choice items). Based on the findings of the pilot study, 5 scenarios (1 from each CT category) that were somewhat confusing and reduced the test’s overall internal consistency in this particular context were omitted. As a result, 20 constructed-response and 20 forced-choice items were ultimately administered.

Both the CTEM and HCTA focus on similar CT components, with the exception that the CTEM items focus on E&M tasks, while the HCTA items focus on thinking tasks drawn from everyday life that do not require specific subject matter expertise (see Fig. 3 for sample CTEM and HCTA items). We computed the internal consistencies (Cronbach’s alpha) of the administered tests in the present study: .74 for the CTEM, .76 for the HCTA constructed-response, .73 for the HCTA forced-choice and .77 for the HCTA overall test. Although a desirable value for internal consistency may vary as a function of the nature of the construct being measured, Cronbach’s alpha values between .70 and .80 are considered acceptable (Cohen et al. 2007). Prior physics knowledge of the participants (physics scores from the Ethiopian Higher Education Entrance Examination) was collected from the student records offices of the two universities.

Fig. 3
figure 3

Sample CTEM and HCTA items

Procedure

The CTEM was administered as a posttest-only test a week after the end of the intervention. Because the CTEM items require prior knowledge of E&M, we felt it was reasonable to administer the test only at the end of the intervention. The HCTA test, on the other hand, was administered both to the experimental and control groups as a pretest, immediately before the beginning of the intervention and as a posttest a week after the end of the intervention. Due to practical reasons, the paper version of the HCTA test was administered since computer-based administration of the HCTA was not possible. Participants were required to first answer all the constructed-response format items and then the forced-choice format items. Administration of the CTEM test lasted between 60 and 75 min, and the HCTA (both formats) between 70 and 90 min.

Approximately 90 % of the experimental lessons were observed, and the experimental instructor was consulted after each lesson to reflect on challenges that surfaced as well as any other aspects that might improve implementation of the lessons as designed. Post-lesson discussions focused on such issues as usage of instructional time, giving of support and feedback to groups within the allocated instructional time, oral questions used to prompt students to further elaborate on their answers, and overall evaluation of the implementation of the lesson in relation to the design. Instructors registered class attendance for each session both in the experimental and control conditions. Eighty-five percent of the experimental group students and approximately 80 % of the control group students attended more than 90 % of the sessions. There were two dropouts in the experimental group and one dropout in the control group. The pretest data of those three students were omitted from the results. This means that our analysis of the data from the two groups is based on 45 students for the experimental group and 44 students for the control group.

Results

Screening of the data

The CTEM and HCTA scores were screened for accuracy of data entry, missing values and the assumptions for normality and homogeneity of variances. A separate overview of the experimental and control students’ scores for each CTEM and HCTA items showed random missing data for a few items. However, the proportion of missing values per item was very limited (<5 %) and scattered over each of the 20 CTEM and HCTA items. Mean substitution was therefore used to estimate the missing data. The mean scores for each separate item for the experimental and control groups were calculated and the handful missing values were substituted with the respective group mean scores. Outliers were also separately sought in the experimental and control groups. Visual inspection of boxplots and inspection of the z scores for each of the CTEM and HCTA variables showed that there were no potential outliers.

Moreover, tests of assumptions for normality and homogeneity of variances were conducted through examination of the standardized residuals for the CTEM and HCTA scores. For the CTEM, a Shapiro–Wilk’s test (p > .05) and a visual inspection of the histograms, the Q–Q plot and boxplot suggested that the scores from the two groups were approximately normally distributed. Using the standardized residuals, the assumption of homogeneity of variances was tested and satisfied based on Levene’s F test, F(1, 87) = 1.57, p = .11. For the HCTA scores, a Shapiro–Wilk’s test (p > .05) and a visual inspection of the histograms and boxplot showed that the HCTA pretest and posttest scores were also approximately normally distributed for both the experimental and control groups. Furthermore, the assumptions of homogeneity of variances were tested and satisfied based on Levene’s F test for the pretest (F(1, 87) = .16, p = .69) and posttest scores (F(1, 87) = 1.36, p = .25).

Domain-specific CT performance: CTEM

Initial comparison of prior physics knowledge revealed no significant differences between the experimental and control group, t(87) = .15, p = .88. An independent sample t test was therefore conducted to compare the performance of the two groups on the domain-specific CT test. The results indicated that the CTEM mean score for the experimental group was significantly higher than that of the control group, t(87) = 7.15, p < .001, d = 1.55. The effect size for this analysis was found to exceed Cohen’s (1988) convention for a large effect (d = .80).

An analysis of covariance (ANCOVA) was conducted to examine whether the statistically significant mean score differences could be maintained after controlling for physics prior knowledge. The ANCOVA results showed that the CTEM mean score of the experimental group was significantly higher than that of the control group, F(1, 86) = 52.56, p < .001, η 2 = .379. The results indicated that the intervention accounted for 37.9 % of the variance in the acquisition of domain-specific CT skills. Post-hoc power analysis by using G*Power (Faul et al. 2007) indicated that the power to detect the effect size observed in the present study (d = 1.55, p < .001) was >.99. The a priori power analysis indicated that a total sample size of 84 would be sufficient to detect a large effect (d = .8; Cohen 1988) with a power of .95 (p = .05), and a total sample size of 210 would be sufficient to detect a medium effect (d = .5; Cohen 1988) with a power of .95 (p = .05). See Table 3 for descriptive statistics of the CTEM test.

Table 3 Descriptive statistics for experimental and control groups: prior knowledge, CTEM and HCTA scores

Domain-general CT performance: HCTA

In order to examine the effect of the instructional intervention on students’ domain-general CT performance, a 2 (groups: experimental and control) × 2 (testing time: pretest and posttest) mixed design ANOVA was conducted. The results of the mixed design ANOVA revealed that the two groups together demonstrated a statistically significant improvement on the HCTA mean scores across the two time points, F(1, 87) = 4.61, p = .035, η 2 = .05. The effect size value (η 2 = .05) suggested a small practical significance. However, there was no significant interaction between the intervention type (experimental-control) and the testing time (pretest–posttest), F(1, 87) = .14, p = .71. In other words, the HCTA mean score for the experimental group did not show a significant pretest–posttest improvement compared to the control group. This indicates that the experimental learning environment did not result in a significantly greater pretest–posttest improvement in the acquisition of domain-general CT skills compared to the control learning environment. The descriptive statistics of the HCTA scores are shown in Table 3.

Relationship between domain-specific and domain-general CT performances

Calculation of the Pearson’s correlation coefficient showed a significant positive relationship between pretest HCTA and posttest HCTA scores (r = .29, p = .006). Moreover, the CTEM scores significantly correlated with the posttest HCTA scores (r = .38, p = .01). These findings show that when both groups are taken together, those students who scored higher on the pretest HCTA also tended to score higher on the posttest HCTA. Post-intervention comparison similarly indicated that those who scored higher on the CTEM test also tended to score higher on the posttest HCTA. A linear regression analysis also revealed that the CTEM test explained a significant proportion of the variance on posttest HCTA performance, F(1, 87) = 14.7, p = .05, R 2 = .145. The result shows that CTEM performance was a significant predictor, accounting for 14.5 % of the variance in posttest HCTA scores. Post-hoc power analysis using G*Power (Faul et al. 2007) indicated that the power to detect the observed effect at the .05 level was .94 for the regression in prediction of the posttest HCTA performance.

Discussion

In this study, we argued that the design of CT instructional interventions should be supported by the principles of instructional design research. To that end, we tested an alternative method to address the challenge of CT development through the systematic design of subject matter instruction rather than explicit instruction on general CT skills. A regular physics course was systematically designed in accordance with the First Principles of Instruction model. We hypothesized that E&M instruction systematically designed in line with the First Principles of Instruction model would produce higher acquisition of domain-specific and domain-general CT skills than regular E&M instruction.

Implementation of the lessons for the experimental condition was carefully monitored, and sufficient information was gathered with respect to the implementation of the lessons in the control condition. With regard to the first research question, we found that a systematically designed E&M instruction that implicitly targeted CT skills in various domain-specific classroom activities resulted in higher acquisition of domain-specific CT skills compared to regular E&M instruction. We focused on the systematic design of subject matter instruction (supported by valid principles of instructional design research) as previous CT intervention studies did not systematically explore how subject matter instruction in itself may stimulate learning of domain-specific CT skills. The instructional interventions designed and implemented as part of a couple of previous Immersion-oriented CT empirical studies (e.g., Barnett and Francis 2012; Garside 1996; Renaud and Murray 2008; Stark 2012; Wheeler and Collins 2003) appear to show significant limitations. The interventions focused mainly on a specific component of the learning environment (e.g., small group discussion only), and only minimally emphasized other important learning environment components such as the types of learning tasks/problems designed for discussion (e.g., are the learning tasks challenging enough to provoke discussion among students? Are the tasks authentic/contextually relevant?). They also paid scant attention to the adequacy of support, feedback and coaching offered during full-class and small group discussions. In most previous CT studies, the desired CT outcomes learners were expected to demonstrate after instruction were moreover barely described or articulated during the design phase. It is next to impossible to evaluate the extent to which the various designed tasks and instructional activities were relevant in stimulating the acquisition of the desired CT outcomes.

For the present study, efforts were made to design a learning environment that addressed the limitations of previous studies. First, the desired domain-specific and domain-general CT outcomes were operationalized and described. A learning environment that could stimulate the acquisition of the desired CT outcomes was subsequently systematically designed. In accordance with the theoretical claim that meaningful subject matter learning inherently involves development of relevant CT skills (e.g., Glaser 1984; Resnick 1987), the E&M instruction was systematically designed in such a way that it provided students with the opportunity to engage in a number of domain-specific classroom activities. It is important to point out that previous studies already implemented one or two of the instructional strategies implemented in the present study. For example, the discussion method of teaching (e.g., Wheeler and Collins 2003), and teacher modeling (e.g., Anderson et al. 2001) are among the most commonly employed instructional strategies in previous Immersion-oriented CT studies. However, for this study, we designed a comprehensive intervention that integrates most of the empirically validated instructional design principles. The findings with regard to domain-specific CT skills suggest that systematic design of subject matter instruction based on a combination of empirically valid instructional principles promotes the acquisition of domain-specific CT skills. CT development, this study argues, involves both domain-specific and domain-general dimensions. It demonstrates that acquisition of domain-specific CT skills can be improved through systematic design of subject matter instruction without explicit teaching of general CT skills. This finding is consistent with the result of a recent meta-analysis of strategies for teaching CT (Abrami et al. 2015) as well as previous theoretical claims (e.g., Glaser 1984; McPeck 1990; Resnick et al. 2010; Resnick 1987) that underlined the importance of learning environments systematically designed in accordance with relevant instructional principles.

For the second research question, however, the findings showed that the experimental learning environment did not result in a statistically significant improvement for domain-general CT skills compared to the control learning environment. Gains in domain-specific CT proficiency found in the experimental condition were not accompanied by gains in domain-general CT proficiency. The two groups together demonstrated improvement in the acquisition of domain-general CT skills, between the pretest and posttest scores. The same test was administered both prior to and after the intervention, and the observed pretest–posttest improvement might simply be a test–retest effect.

On the other hand, we found that domain-specific CT proficiency significantly predicted posttest domain-general CT proficiency. This suggests that when a domain-general CT test that presumably required similar thinking skills was administered to the participants, performance on a domain-specific CT test was a significant predictor of performance on a domain-general CT test. To a degree, this reveals a tendency to transfer the acquired domain-specific CT skills in solving domain-general CT tasks. This finding is consistent with previous psychology studies in which higher performance on a psychological CT test also predicted higher performance on a domain-general CT test (e.g., Williams et al. 2004).

A number of reasons may explain why the designed learning environment did not have a significant effect on the acquisition of domain-general CT skills. The absence of an explicit focus on the desired CT skills during the E&M instruction may have kept students from abstracting the domain-specific CT skills and applying them in solving domain-general tasks. This suggests that a great emphasis on systematic development of domain-specific knowledge alone may not be sufficient to facilitate transfer of domain-specific CT skills to everyday problems. Perhaps a worthwhile approach to CT instruction may be to explicitly emphasize desired CT skills within specific subject matter instruction. Proponents of the embedded approach often claim that explicitly teaching CT skills within subject matter instruction is the best way to stimulate development of transferrable CT skills (Davies 2013; Halpern and Hakel 2002; Halpern 1998). For example, some generalists have argued that students must be aware that they are being taught CT skills during specific subject matter instruction and they will be expected to use those skills to solve everyday problems or issues they will come across. However, the main criticism that has been directed at generalists is that they largely see CT as everyday problem-solving that is detached from domain-specific CT proficiency (see Bailin et al. 1999; Resnick et al. 2010; Smith 2002). To date, there is no agreement on how specific subject matter instruction can be optimally designed to develop both domain-specific and domain-general CT skills. An important area for future studies would therefore be to evaluate the effectiveness of explicit teaching of CT skills within well-designed subject matter instruction to develop both domain-specific and domain-general CT skills. It could prove interesting to compare an Immersion-based learning environment with an Infusion-based learning environment in which CT skills are explicitly trained within systematically designed subject matter instruction.

Another possible explanation for the insignificant effect on the acquisition of domain-general CT skills may relate to the longstanding debate around the specificity and generality of CT skills. As noted in our above analysis of existing CT literature, generalists (e.g., De Bono 1991; Ennis 1989; Siegel 1988) view CT skills as applicable across domains, whereas specifists (e.g., McPeck 1990) argue against the existence of general CT skills on the grounds that thinking always amounts to thinking about something and that specific knowledge of a subject matter is necessary for CT. In this study, students in the experimental condition were intensively engaged in acquiring deeper understanding of E&M through an implicit emphasis on the desired CT outcomes. These students performed significantly better than the control group students on domain-specific CT tasks. However, the acquired domain-specific CT proficiency did not transfer when the same students were confronted with domain-general CT tasks (viz., the HCTA). Following the specifist view, it could be argued that the study participants perhaps lacked adequate knowledge of the content used in preparing the HCTA test. This reinforces the notion that the ability to think critically is mainly content dependent (e.g., Bailin et al. 1999; McPeck 1990; Smith 2002). The findings revealed that, compared to the control group, the experimental group students were able to demonstrate proficiency in using CT skills for E&M-specific thinking tasks. However, those CT skills were not applicable when they were presented with domain-general CT tasks. Students’ failure to transfer the acquired domain-specific CT skills may therefore spring from the HCTA itself. An important area for future study would therefore be to evaluate the effectiveness of CT-embedded instructional approaches through administration of at least two domain-general CT tests that were designed based on different everyday content yet focused on similar CT skills.

A third possible explanation for the unimproved domain-general CT skills may relate to the brief duration of the intervention: 8 weeks and with a focus on just 50 % of the E&M course content. Perhaps the intervention was too short to produce a substantial change in participants’ modes of thought, which made it impossible for them to transfer the acquired domain-specific CT skills to other domains than the E&M problems. Moreover, the experimental group students were also simultaneously following other courses in which subject matter instruction appeared to be less systematically designed. This may have resulted in limited opportunities for students to extensively practice the desired CT skills in other subject matter domains, and hence hindered their transfer. An important implication of this finding is that transfer of domain-specific CT skills to everyday problems may not automatically occur during a brief instructional intervention, but may instead require a conscious and systematic design of all subject matter instruction toward CT.

Study limitations

The findings of this study are based on a comparison of two intact classrooms at different universities taught by different instructors. Although the initial plan was to use two intact groups at the same university, the number of first-year students with major physics at the targeted university was very limited with just one intact group. To minimize the effects of having two different instructors and institutions, efforts were made to recruit instructors from the two universities with similar education levels and equivalent years of teaching experience. Efforts were also made to closely monitor the implementation of the lessons at both the experimental and control universities. However, it is important to interpret the findings from the present study by taking into consideration the limitations that sprang from having different institutions and instructors. Moreover, random assignment of the two intact groups into an experimental and control condition was not feasible. The first author is affiliated with one of the two universities. Since we expected to intensively collaborate with the regular instructors and to make the close follow-up more convenient, the group at the affiliated university was purposely assigned to the experimental condition.

Conclusion

This study explored the effectiveness of systematically designed subject matter instruction on the development of domain-specific and domain-general CT skills. It demonstrated that a typical freshman course systematically designed based on the First Principles of Instruction model—with an implicit focus on the desired CT outcomes as an integral part of the domain-specific classroom activities—can stimulate the development of domain-specific CT skills. This finding suggests that systematic design of subject matter instruction needs to be made an important component of teaching and learning in undergraduate education if students are to demonstrate domain-specific CT proficiency. Although this study’s instructional intervention failed to provide evidence of the transfer of the acquired domain-specific CT skills to everyday problems, this does not mean that domain-general CT skills cannot be systematically taught. Our hope is that the present study will encourage researchers and instructional designers to pay attention to systematic design of subject matter instruction as a valuable approach to addressing the challenges of CT development. The following observations with regard to CT research in undergraduate education were particularly important. First, we showed that both the domain-specific and domain-general CT outcomes that we wish students to demonstrate need to be identified and precisely articulated before any attempts at teaching CT. Second, through a systematic design of regular subject matter instruction, useful empirical evidence was presented that supports the longstanding theoretical claim that meaningful subject matter learning in a domain can result in the development of domain-specific CT skills. Third, following the argument that embedding CT within subject matter domains should result in the acquisition of both domain-specific and domain-general CT skills, CTEM and HCTA tests were administered respectively to evaluate the effectiveness of the designed instructional intervention. Accordingly, empirical evidence that establishes the relationship between acquisition of domain-specific and domain-general CT skills, a barely examined research question, was validated. Our starting point was that instructional interventions for CT are not sufficiently supported by the principles of instructional design research. Through this study, we hope to have demonstrated how the two largely detached fields of CT and instructional design research can systematically be integrated. We moreover argued that the instructional principles behind various instructional design models are not sufficiently attuned to specific instructional settings. In this study, we hope to have shown how those empirically valid instructional design principles can be translated into usable instructional design prescriptions that are also relevant to CT.