Assessment and Evaluation of the Learning by Design&trade™; Physical Science Unit, 1999-2000: A Document in Progress

May, 2001

Jennifer K. Holbrook, Jackie Gray, Barbara Fasse, Paul Camp, and Janet Kolodner.

™College of Computing

Georgia Institute of Technology

Atlanta, GA 30332-0280

{holbrook, grayj, bfasse, pjcamp, jlk}@cc.gatech.edu

 

Abstract: This paper describes project evaluation methods and results of the project evaluation, the focus of which has been the middle-school physical science unit, Vehicles in Motion. Vehicles in Motion is a project-based unit developed to help middle school students understand Newton’s Laws and related concepts of force, mass, and motion. The approach seeks to teach science content through design practice, but also to help teachers develop a classroom in which all members of the class are engaged in scientific practices and science culture. Herein, we report results, which currently span implementations in school years 1998-1999 and 1999-2000. As results from the final grant year assessment effort are completed, they will be integrated into this document.

 

1. Context and Purpose of the LBD™; Assessment and Evaluation Effort

Our project, Learning by Design™ (LBD™;) attempts, like other cognitive-science-derived approaches to science education, to provide a learning environment that will foster the development of science knowledge and "habits of mind" for becoming self-regulated life-long learners. In our work on developing LBD™;, we have focused on four major areas.

What distinguishes assessment instruments from scaffolding instruments? Ideally, all assessment instruments serve to provide feedback for teachers and students so that appropriate scaffolding is provided to help a student or class bridge from level to level of concept understanding, skill acquisition, and metacognitive awareness. But to arrive at the point at which assessment and scaffolding are interwoven requires that each aspect of assessment be developed to the point that we understand what it can and can’t tell us. Then the instrument must be integrated into the curriculum such that the assessment is easy to undertake, the results are easy to evaluate and distribute, and the outcome leads directly to the appropriate level of scaffolding. Meanwhile, data from assessment instruments serve program evaluation needs as well. In designing instruments that serve for formative and program evaluation for the LBD™; project, we generally start by developing instruments and methodology to be used external to the curriculum for program evaluation purposes. But each was developed with the purpose of eventual integration in the curriculum as tools for use by teachers and students, rather than by external assessors. In this paper, we explain how each instrument and related rubrics have been developed, how it has been used in program evaluation, what we have been finding through these instruments, and how we envision it being integrated into the curriculum.

Making assessment information available to students can encourage self-directed learning in students, as it provides them with feedback on their progress. Assessment is also essential to providing instructional intervention or support for the learner (scaffolding). Supporting and extending the developing understandings that students exhibit is central to our approach. Our assessment tools reveal this developing understanding which, in turn, is used to determine the application of the most relevant scaffolding. Students vary in their understanding and developmental progressions towards knowledge and skill acquisition. The assessment and scaffolding tools we have developed provide an important practical translation of what we know from how to assess and scaffold complex cognitive activity.

Our approach is distinguished by (i) our focus on assessing and scaffolding science and collaborative practices (where others focus on assessing knowledge), (ii) our development of resources that scaffold students’ performance and learning in a project-based classroom and at the same time scaffold teachers as they develop facilitation skills, and (iii) our design of assessment and scaffolding tools in collaboration with teachers and students so that we know we are addressing real issues in actual classrooms. Our aim in addressing these goals is for students to be able to learn content in such a way that they can apply it in a variety of situations (promote transfer) and to become skilled science learners, becoming competent at participating in the practices of science and science learning.

Our assessment goals have been to provide alternative, more dynamic ways to assess student learning and skill mastery beyond static measures like standardized achievement tests; involve our teachers as collaborators in each step of this developmental process; and validate those processes and tools that would have value to other teachers of middle school students. A lesson we continue to acknowledge as we work with our teachers is that we all want students who have high achievement tests scores, but we also want students who are learning the deep principles and processes of science that will go on to become the innovators, designers and problem solvers of tomorrow. We want to extend what counts as learning in science to include the process skills that will equip our students to learn life long.

Indeed, the science education community and the recently-published American standards about science literacy want students to gain both competencies just listed (American Association for the Advancement of Science (AAAS), 1993) – to learn science concepts in a way that allows them to apply those concepts to new situations as they arise and to become enculturated into the practices of scientists. This includes being able to enact science process skills such as inquiry, observation, measurement, experiment design, modeling, informed decision making, and communication of results, as well as social practices such as learning from each other. The aim is for students to learn in ways that will allow them to take part in those practices in skilled ways both inside and outside the classroom Transfer, in the context of science learning, means gaining expertise in the practices of the scientific community (e.g., designing experiments, managing variables, justifying with evidence, analyzing results, planning investigations, communicating ideas, communicating results, incrementally building up one’s understanding of a new concept) and learning scientific concepts and their conditions of applicability in order to engage in scientific reasoning.

From several iterations on the development of our assessment and scaffolding tools, and from the data we've collected to show student learning with these tools, we have had an impact on addressing the both process-oriented and content-oriented national standards for middle school students in science:.

The assessment and scaffolding tools we’ve designed address these standards and more. Our results so far show that the scaffolding tools and assessment instruments we have designed have potential for impacting the development of these very important skills, routinely practiced in our LBD™; classrooms. We also note from the standards (AAAS, 1993) several practices that students tend to have difficulty with, and LBD™ provides extensive practice opportunities and scaffolding for those:

As well, LBD™ units cover more content and process skill development than these standards designate. For example, our students develop great skill in collaboration.

2. Methods and Instruments

2.1. Ethnography

Evaluation of the earliest LBD™; curriculum implementations focused on providing formative feedback for the curriculum development effort. For the most part, this called for ethnographic methodology-case studies of a small number of students and the environment in which they were operating (the unit, the students’ expectations of the classroom culture, the teacher’s approach, the social environment of the class and school, the school’s values and expectations, etc.) The case studies were generated by regular passive participant observation, formal and informal interviews, and teacher/student self-report. The data set includes field notes, videotapes and audiotapes, and analysis of artifacts. Such feedback helped identify the differences between what the unit developers assumed about the classroom culture in which the units would be instantiated and the actual environment in which the units were implemented.

After several major revisions, the developers were fairly satisfied with the unit format and activities, and with the type of training offered to teachers. As this happened, the purpose of classroom observation evolved: now we were interested in studying how the variations in implementation affected the success of the curriculum. "Success" in this case is broadly defined as how well the students learn target content, science and design practices, collaborative skills, and metacognitive strategies; also, "success" includes the students’ level of engagement with and enthusiasm for the curriculum, and their teacher’s satisfaction with teaching the curriculum. Operationalizing and measuring such definitions of success led to the use of very different methods of assessment, detailed below. However, it was still through ethnographic methods that local variation continues to be analyzed. Now, the ethnographic efforts focus on understanding how different teachers with different teaching styles, knowledge of the target science content, and knowledge of science and design practice, make the affordances of LBD™ available to students. Also, the ethnography gives us information about how students are responding, their levels of engagement, what’s difficult for students, how students use one another as resources, and so on. From this, we are learning about the affordances our materials and ancillary training provide for the students and the teachers, and what is still needed.

In the first years of LBD™, when only 1-4 teachers were implementing our units at a time, we were able to spend enough time with each teacher to do in-depth case studies of all. In the last few years, however, the number of implementing teachers per year (and the geographic area over which they are spread) has grown as our staffing has stayed constant. It was therefore necessary to develop methods of assessment that could be reliably used by amateurs, which would yield valid data about the differences in:

Thus, we create our ethnography from data gathered through four strategies (Fasse & Kolodner, 2000). First, we’ve developed two observation instruments to help observers focus their observations in all of the classrooms (can be found at http://www.cc.gatech.edu/projects/lbd/obs_tools/tools.html). While this flies in the face of qualitative methodology, we do have a practical need to make sure that our untrained observers include the taken-for-granted world in their notes. We target a subset of teachers to visit at least once a week. This subset, is chosen based on specific questions we need to answer. For example, a teacher was included in the set because of both strong science background and specific training in teaching inquiry-based classes, while another was included because she was teaching science for the first time ever. This allows us to understand what the teacher needs to know to make the implementation successful; in turn, this information is used in developing teacher support and in helping teachers decide if the curriculum is appropriate for them to use. Second, we interleave thick description (Geertz, 1983) from our observations with description derived from video documentary. We videotape all of the LBD™; and the comparison classrooms at least twice during the year; some are selected for much more frequent videotaping. Third, we meet with our teachers in focus groups every six weeks to learn what works and doesn’t work in their classrooms and to allow them to share their experiences with each other. Fourth, we have team members use prompt sheets and checklists to create a description of classes each time they have reason to visit.

The data’s audit trail includes field notes from observations and interviews, written evaluations of class visits based on the checklist and prompt instruments, transcriptions of audiotapes, and written summaries and coding of videotapes (Lincoln & Guba, 1985; Spradley, 1980). Our staff ethnographer organizes and maintains these data, and then members of the team make use of whichever resource is most appropriate to inform specific research questions as needed.

2.2. Measuring LBD™; Fidelity of Implementation and Inquiry-Friendly Practices in LBD™; and Comparison Classrooms

The teachers who implement LBD™ represent a wide variety of teaching styles, knowledge of target science, experience in science methodology, and experience with teaching and exposure to inquiry-based teaching methods. Then too, the classes and schools in which they teach vary a great deal-there are economic disparities between schools and districts and disparities of administration style; there are class cohorts that have a wide variety of student abilities vs. class cohorts that have been restricted, sometimes to honors students, perhaps to at-risk students. Each of these features will affect how the teacher implements the unit, from the amount of time taken on a given construction activity to the types of questions the teacher frames for class discussion, from decisions made about grading collaborative efforts to the emphasis placed on identifying methodological variation in experiments.

Learning by Design units provide teachers and students with specific sequences of activities, group and individual assignments, and carefully articulated expectations. The ways in which teachers depart from the unit and the reasons for these departures must be documented so that, when faced with different outcomes among different teachers’ classes, we are able to identify covariates of the unit.

As discussed above, the thick description, interviews, and videotapes that we gather from some classes make it easy to gauge the fidelity of unit implementation. However, in classes that are not targeted for intense ethnography, it is still important to gather an accurate description of the fidelity of implementation.

Staff visits to classrooms afford the opportunity to get a "snapshot" of implementation fidelity. However, unless the staff member who is visiting is well versed in qualitative methodology, it is difficult to know what counts as evidence in forming opinions and conclusions of implementation fidelity. It is also important that any such summaries of implementation fidelity meet a standard of objectivity, or at least inter-observer reliability.

The Observation Prompt Tool (Fasse, Gray, & Holbrook1999) and LBD™; Fidelity Report Card (Fasse, Holbrook, & Gray, 2000) (http://www.cc.gatech.edu/projects/lbd/obs_tools/tools.html) are both designed specifically for use with our evaluation effort. The Observation Prompt Tool prompts for information about which types of LBD™; activities take place during the visit, and for specific aspects of how such activities are carried out, as well as a description of the environment in which the visit is carried out. It also prompts for descriptions of teacher and student roles during the various activities, and for descriptions of how well these roles are carried out. For example, in a discussion, the teacher may adopt the role of lecturer or inquisitor, or may moderate a discussion among the students. The students may actively seek to question one another, or they may expect the teacher to provide the discussion framework. The questions themselves may be more- or less-well-suited to promoting connections between class actions and science concepts Each of these areas has a set of set of several statements and questions to cue the observer to describe specific elements of an activity, and to speak to specific issues about the classroom.

The form of data gathered from such visits includes a written summary of the visit and a completed copy of the OPT form; it often also includes a videotape of the class. The written summary is indexed to the appropriate sections of the OPT, so that the evidence and context for a specific descriptive choice is preserved.

The Fidelity Report Card is intended to be used to sum up a set of such observations. Evaluative statements for both students and teachers are included. The items are scored on a 5 point scale (Unsatisfactory - Needs much improvement - Meets fair expectations - Good - Ideal). Many of the items could be used to evaluate any science classroom (e.g., "Students use science vocabulary and/ measurement techniques w/accuracy"; "Students listen/discuss/consider ideas/suggestions of others w/in group"; "Teacher knowledge of the specific science area"; "Teacher shows flexibility for changing plans when indicated by student needs". Other questions are specific to LBD™;, such as: ‘Students independently employ LBD™ rituals to inform decisions"; Teacher uses the [LBD™] rituals as scaffolding tools to promote or model the process of reflection". When data across time are available, the Fidelity Report Card is applied separately to early and subsequent time frames to see how the class culture evolves throughout the year.

What results is a Fidelity of LBD™; Implementation score (or set of scores, when employed across time) for each teacher in the program. These scores are ranked ordinally, and the scale is used as a covariate in comparing results on written tests, performance assessments, and structured interviews. Three staff members each score the teachers for Fidelity of Implementation, based on the OPT and written reports.

We obtain high reliability (r > .9) on all items for Fidelity of Implementation scores based on reports from at least halfway through the content in the school year. The reliability for Fidelity of Implementation scores based on reports from earlier in the school year are less reliable the earlier the time frame. We are currently investigating whether this is attributable to some scorers setting developmentally-relative standards and other scorers, absolute standards.

LBD™ specifically seeks to promote an inquiry-friendly culture through the activities, the written materials, the assignments, and the in-service teacher training and support. We believe that such a culture is key to promoting deep learning. Thus, it is important to document not only the differences in the written materials and activities of LBD™; and comparison classes, but the differences in classroom culture, as part of the context in which outcomes for LBD™ and comparison classes are discussed.

We approach documenting inquiry-friendly culture in the same general way that we approach documenting fidelity of implementation, i.e., employing those sections of the OPT which are not curriculum-specific, and a subscale of the Fidelity Report Card that focuses on inquiry skills only. The scores on this subscale for both LBD™; and comparison teachers are then ranked ordinally and used as a covariate in comparing results on written tests, performance assessments, and structured interviews.

The value of ranking teachers’ classes for "inquiry friendliness" and "fidelity of implementation" becomes clear as we begin to look at measures of learning outcomes. For example, we find that the teacher who "used science terminology well" (Fidelity Report Card item 33) had students who did the same (FRC item 1), and those same students have significantly higher scores in performance assessments and structured interviews. The teacher who used design terminology and orchestrated LBD™; rituals well, but did not have great science knowledge, (FRC items 32a, 26) had students who spontaneously used the rituals to solve their own problems (FRC item 15) and transferred the language and methodology both in performance assessments and to science fair projects (Fidelity Report Card items 2,3,4, 16), but had only mildly significant pre- to post-test gains on the target science content. The comparison teacher who used science terminology very well (FRC item 33) but did not identify science concepts as they emerged in student discussion, questions, and demonstrations (FRC item 41), and did not identify science in everyday events or cases (FRC item 42) had students who did not use science terminology well (FRC item 1) and who did not show a significant change from pre-to post-test on content knowledge.

And so on. In other words, we are beginning to be able to tie specific teacher and student habits in the classroom to outcomes on the measures of target science content learning and science practice, in both LBD™ and non-LBD™ classrooms. These predictors in turn help us to focus our ethnography more tightly, and helps us to tailor teacher training and curricular materials.

2.3. Targeting Specific Teacher Practices

As the curriculum units changed less from implementation to implementation, we were able to make preliminary guesses, based on ethnography and outcome data, about how particular practices were most closely related to various aspects of student success. By narrowing our scope to studying two practices in depth per implementation year, we added a new dimension to the ethnographic data being gathered. This year, for example, we have looked at how teachers make use of the practice of articulating Design Rules of Thumb (Crismond, Camp, & Ryan, 2001). The idea behind these rules is that, as class groups design an artifact and conduct tests upon it, their data will help them form conclusions about the intrinsic constraints of the task. Articulating these constraints can lead from the specific artifact’s behavior in a specific set of circumstances to the recognition of an underlying aspect of science. For example, student groups seek to develop the best balloon-propelled car by testing a number of variables, including the aperture of the straw through which air escapes from the balloon. The "best" width is the width that allows the most air to be released - up to a point. The student group may begin by articulating the design rule as a piece of advice about straw width. Sometimes, the advice is too artifact-specific (e.g., "use straws of 7/8’ diameter). Teachers who push for reasons to back up the advice are requiring the students to back up their conclusions with evidence. Teachers who understand the science will orchestrate the discussion moving, first toward generalizing the rule, (e.g., "use wide straws instead of narrow ones") then push toward generalizing of why the wider straw is better, and on to consider about whether there’s a ratio of balloon size to aperture. They will also help the students remember that air is a gas, and that it’s being released through a rigid tube; they may have students consider the behavior of the balloon and the behavior of a rocket. Thus, when the rule of thumb is articulated, the science underlying the concept is much better understood, and the opportunity for far transfer in applying the rule of thumb is enhanced.

2.4. Measuring student outcomes

Among the questions we seek to answer through our program evaluation are:

  1. Do students in LBD™ classes learn more of the target content than students in comparable classes that use more traditional curricula, teaching methods, and culture?
  2. How great is the increase in expertise in the target content areas by LBD™ students and their comparison counterparts?
  3. Do LBD™ students gain greater ancillary skills that suggest they have "learned how to learn" better than their comparison counterparts? In particular:
      • do LBD™ students give more detailed explanations of their answers (in prompted situations or spontaneously)?
      • do LBD™ students show greater ability to transfer knowledge?
      • do LBD™ students reference cases more often?
      • do LBD™ students understand scientific methodology issues better?

Quantitative assessment techniques must be used in ascertaining these outcomes, for credibility in the larger community comes from being able to compare outcomes on agreed-upon measures. However, much of what passes for knowledge assessment is justifiably criticized as favoring rote memorization without understanding (e.g., Wiggins, 1993). The difficulty for those who design assessment instruments is to find a way that measures knowledge objectively, reliably, and validly without measuring knowledge at too shallow a level. Fortunately, alternative assessment methods are becoming more widely available and accepted, and we have been able to adapt numerous available instruments and techniques to meet our own program evaluation needs.

2.4.1 Recognition-Based Content Tests

In developing the current versions of content tests for earth science and physical science, we sought to keep the tests simple to administer, but at the same time, to include questions that allowed a richer understanding of the depth of understanding of target science content. The content tests are in the familiar multiple-choice question format, but the question scenarios and the choices themselves are carefully crafted to allow analysis of the stage of understanding that the student is at before instruction, and what qualitative differences in understanding occur as a result of instruction.

2.4.1.1 Question Origins

Two types of questions were specifically included to help answer the questions about student outcomes posed above.

One type, which we call "Explanatory Answer," links a multiple-choice response with a required explanation of the response choice. The score on the question is based on giving the correct response among the choices and explaining why this is the correct response.

Another type of question is multiple-choice in nature, but the choices are carefully crafted to reveal the depth of understanding of the concept at issue, a methodology Thornton (1997) refers to as "conceptual dynamics.". These questions were published items in the Tools for Scientific Thinking Force and Motion Conceptual Evaluation, or FMCE,(Sokoloff & Thornton, 1989) and the Force Concept Inventory, or FCI (Hestenes, Wells, & Swackhamer 1992) which were developed to track the development of depth of understanding of a number of physical science concepts. These tests were designed to be administered to secondary and college students. The questions that we selected were adapted with simpler wording and choice selections for the younger students who are being assessed in the LBD™ project.

Thirteen of the questions on the exam were adapted from for study of the development of depth of understanding of concepts of force and motion. Many of these questions were clustered together to probe understanding of a specific concept carefully. Questions 14-16 look at the effect of force on a given velocity; questions 17-18 probe the understanding of acceleration as a change in direction; questions 20-24 are about force, mass, acceleration, and "equal & opposite reactions", and questions 25-27 are specifically about gravity as a force in acceleration on a hill.

2.4.1.2 Coding "Depth Of Understanding" Questions

The content test is administered both pre- and post-curriculum. One way that we compare LBD™; and comparison student learning of target science is to compare LBD™ and comparison student’s overall change of scores on the test items focussing on target content. For such an analysis, it is easiest to interpret findings if questions are coded as being "correct" or "incorrect" as on traditional tests, rather than trying to encode different levels of understanding as interval values. Differing stages of understanding could not appropriately be interpreted on an interval scale, as is assumed by the types of repeated-measures analyses we intend to employ. However, providing nominal and ordinal codes, then using descriptive and non-parametric analyses on these items specifically, will allow us to study a number of changes.

The coding scheme we devised, then, serves a dual purpose: (1) to provide simple "correct/incorrect" labels of most items on the test, (2) to provide more detailed labels that could serve to compare how pre-curriculum and post-curriculum misconceptions and missing knowledge differ from one another. Codes were assigned as follows: On multiple choice questions, the letter of the multiple choice answer was preserved in one stage of coding. In another stage, the letters were matched against an answer key and converted into "correct" and "incorrect" answers: 1 was given for a correct answer; 0 for an incorrect answer.

On explanatory answers questions, incorrect answers were given additional coding labels of a single letter hat described the rationale of the answer. For example, "m" on question 27 might stand for "mass". Coding labels for each explanatory answer were unique to that question (thus, "m" might stand for "measure" in question 7). As these labels do not affect the designation of correct v. incorrect in the repeated measures analysis, and each question’s answer pattern must be studied separately, there is no need for these codes to be systematic from question to question, and might in fact make it necessary to provide longer codes or use numeric codes which are less mnemonic for coder and analyst alike.

2.4.1.3 Analyzing for Nature of Concept Change

For the questions from the FCI and FCME we’ve looked at the differences in answer patterns from pre- to post tests. The concept inventory questions were originally designed to tease apart different levels of concept development. Thus, even answers that are designated "wrong" can show movement from one level of understanding (e.g.., guessing by using contextual clues, Aristotelean-level concepts) to another (partial understanding, with remaining misconceptions). Should the patterns of responses differ between LBD™; and comparison students, we will be able to compare their relative depth of understanding of the concepts.

The first step was to model question-answer patterns that would be indicative of each level of understanding. Some of this modeling work was actually done in the devising of the questions. However, guessing patterns were not modeled by the test authors. Random guessing would of course be indicated by relatively even distribution of answers. We also model a more sophisticated guessing strategy that will take each question’s contextual statements as the basis for the answer, rather than the physical law which is to be inferred by the question explication.

Each model predicted a specific pattern of response preferences for each question. We ranked these models from least-to-greatest concept understanding, with more specific guessing strategies ranked higher than less sophisticated guessing strategies.

Similarly, on the explanatory answer questions, we are looking at the differences in both the quantity of correct choice/explanation pairs from pre-test to post-test, and the nature of the explanations, especially incorrect explanations. We expect to find (1) on the pre-test, many of the explanations will indicate guessing (answers such as "I just picked one" are coded as guesses), and that there will be a large number of students who do not attempt any explanation at all. (2) Incorrect choices and explanations will indicate a later developmental stage of the concept on the post-test than the pre-test.

Additionally, with explanations, we can look at the nature of explanation. We can see whether the answer cites aspects of the question, physical science/earth science principles, or cases from personal/class experience. Differences between LBD™ and comparison students on the nature of explanation from pre-test to post-test will allow us insight into most of the questions cited under 3 at the beginning of this document. (The exceptions are noted in the footnote.)

Finally, we will be able to see whether LBD™ students are more likely than comparison students simply to articulate explanations on their questions from pre-test to post-test, for among the encoded responses is "no answer." In a very perfunctory scan of the complete answer worksheet we see a large difference between the number of pre-test and post-test "no answer" codes. To show that LBD™ students have developed a habit of explaining their answers, even when those answers are tenuously grasped, would support our claim that LBD™ classrooms provide and environment in which both "thinking through" and explanation are valued by students. The first step in resolving misconceptions is being able to identify them; students’ ability to articulate a conception helps both them and the teacher in such identification.

2.4.2 Performance Assessments and rubrics

Goals:

(1) To develop an instrument/methodology that measures both content knowledge and science process skills, such that the dynamic aspects of the thinking processes necessary to transfer the knowledge and apply the skills is measurable.

(2) To develop an instrument/methodology that allows assessment of students’ ability to use evidence to answer questions and in support of assertions or recommendations.

(3) To integrate the instrument/methodology into the curriculum such that (a) assessment is easy to accomplish as a natural part of the curriculum, and (b) scoring rubrics are learnable, reliable, easy to use, and easy to explain to students.

It is difficult to actually measure complex cognitive activity and content tests, since traditional measures do not generally offer a window into how complex skills actually play out or are developed. Zimmerman (2000) reviews the literature on scientific reasoning and makes the point that to assess scientific reasoning, activities must be examined during which such reasoning is needed. Performance assessment tasks are just that. They allow reasoning skills to be examined in knowledge-rich contexts.

We have gained expertise in the past two years in designing and adapting performance tasks that can be used to assess student learning of skills and practices and in creating rubrics that can be used to analyze the extent to which students are participating, and we’ve developed several such tasks and the materials needed to use them reliably for assessment. Preliminary evidence based on these tasks shows that LBD™’s practices promote transfer in the subset of the students we have evaluated, showing us that such tasks can be used for assessment of skills learning and can be coded reliably.

We have adapted performance assessment tasks to follow a format that allows us to better assess the collaboration and science process skills that we seek to promote in the LBD™ curricula. The task is designed in three parts: (i) students design an experiment to gather evidence to address an issue in the context of a real-world problem; (ii) students work in groups to run a specified experiment with materials we have provided, and gather data from this experiment; (iii) students answer questions that require them to utilize the data they gathered, and to apply their knowledge of science to interpret the data. The quasi-experimental design has different classes assigned to different participation conditions: Some classes have the students do all three parts of the task as a group, writing a single group answer; some classes have the students run the experiment as a group, but to work as individuals on parts 1 (designing/writing an experiment) and 3 (interpreting data, answering questions); and some classes have the students work together on all three parts to develop answers, but each student writes these answers in his/her own words.

We videotape the two conditions in which groups of students work together throughout the task. The design-an-experiment part of the task allows us opportunity to judge group ability to design an investigation, their understanding of what a variable is, and their ability to control variables, among other things. The middle part helps us determine their ability to carry out a procedure carefully and correctly: to measure, observe, and record. The third part allows us to determine if they know how to use evidence to justify and how well they can explain. All three parts provide evidence about their collaboration and communication capabilities and their facility at remembering and applying important classroom lessons.

An example task may help bring this to life. In "Where the Rubber Meets the Road," adapted from a task developed by the Kentucky Department of Education and now available through the PALS Performance Assessment Links in Science website (PALS 1999). Part I has students design an experiment that compares the efficacy of two tire types that differ in the hardness of the rubber used when tested in different road conditions. The science concept being tested is understanding of the force needed to overcome sliding friction.

Coding categories include negotiations during collaboration; distribution of the task; use of prior knowledge; adequacy of prior knowledge mentioned; science talk; science practice; and self checks during the design of the experiment, and each group is scored on a Likert scale of 1 - 5, with 5 being the highest score. (See Appendix 1 for examples from the coding scheme developed to assess collaboration and science practice skills during these tasks.).

2.5. Structured Interviews

While performance assessments allow for assessment of dynamic processes, they have some disadvantages for use in the curriculum:

(i) they take a large chunk of time to implement, evaluate, and provide feedback

(ii) the written-answer format does not allow for on-line probing of understanding or skills

(iii) they take the form of a class activity that is separate from the class project, so they are most naturally treated as tests

(iv) they are intended to be implemented for the whole class

In response to these concerns, we have been developing an assessment method to meet the same goals, but that allows for more flexibility in implementation. Structured interviews can be used for individual or small group assessment by a teacher (or assessment team member). They can be administered in multiple ways: by having a student read a scenario or run a simulation or devise a solution to an orally-presented problem. They can be administered one-on-one and casually, pulling a student who is at leisure aside for a few minutes, or they can be assigned to a group or a class as a quiz. The salient feature is that the student is to describe the way s/he will go about solving a problem. The student(s) then carries out their proposed methodology, and reports on results. The salient feature of the structured interview is the interactive nature; the assessor is not simply recording the student’s responses, but using probing questions, observing the procedure, asking questions about choices and results, and noting what the student spontaneously offers and what the student is able to do with various levels of scaffolding.

The 2000-2001 academic year marks the pilot effort of structured interviews. We have used several different tasks with subsets of students, and we have begun refining the interview framework for each. We intend these to eventually be integrated into the curriculum, having teachers use them for either informal or formal assessment.

2.6. Student self assessment survey

We have collected an exhaustive list of skills for teachers and students to use to guide student progress throughout our LBD™ units, along with questions that students can ask themselves about each in order to assess their capabilities and developmental progress. We then targeted two specific aspects of learning for self assessment, collaboration skills and active research skills. Twenty survey items were selected and adapted to cover these two topics. The Twenty Questions survey is intended to be used after each major activity. It is administered after each of the launcher unit activities, then every few weeks as specific sub-challenges to the overall problem are completed. The students keep the survey with their notebook; they have access to their own survey answers each time they look at the survey, and the instructions encourage them to notice differences from administration to administration.

 

3. Results

Our design of LBD™ predicts three important aspects of learning that stand to gain from the approach - (a) domain knowledge, (b) specific science process skills such as those involved in designing experiments, and (c) more general learning practices, such as collaborative skills. We have compared knowledge and capabilities of students participating in LBD™ environments, which take a project-based approach that uses iteration as a central practice and that focuses on target concepts that are closely tied to one another, to students in matched comparison classes (with matched teachers) where an inquiry approach with multiple hands-on demonstrations and activities are used, but in less orchestrated sequence than in LBD™ and without the context of a project goal. Our results, based on analysis of content pre and post-tests and performance tasks, indicate that learning science content well is tied to the domain expertise of teachers coupled with their ability to facilitate an inquiry-based approach. Thus, both LBD™ students and matched comparison students who had teachers with high domain knowledge and good inquiry facilitation skills showed high levels of content understanding when we look at right and wrong answers on the content test. When we compare student capabilities on performance assessments with respect to science process skills and general learning practices, and when we compare degree to which student understanding of content improves, however, LBD™ seems to make a big difference. Looking at correct and incorrect answers in the content tests, we see larger increases in science content scores in LBD™ classes than in matched comparison classes. As well, our results show that ability to use science practices and processes well and ability to engage in more general learning practices (e.g., productive collaboration) is greater among LBD™ students than in comparison classes.

3.1. Learning of content

(1) Do students in LBD™ classes learn more of the target content than students in comparable classes that use more traditional curricula, teaching methods, and culture?

(2) How great is the increase in expertise in the target content areas by LBD™ students and their comparison counterparts?

Recall that we assess content knowledge using a 36-item paper-and-pencil content test; 19 of these items assessed knowledge of the specific content areas of Vehicles in Motion, 4 addressed areas of physical science typically covered in the middle-school curriculum but not covered in the Vehicles in Motion LBD™ unit (e.g., simple machines, energy), 7 assessed knowledge of general science practices such as drawing and interpreting graphs and understanding scientific procedure, 2 assessed knowledge of life science, 3 of earth science, and 1 of chemistry. The content test was administered within the first few weeks of school, before any target content had been introduced. Typically, the topics covered before the pre-test were more general aspects of science: for LBD™ students, the launcher unit material introduced science and design, and in the comparison classes, the introductory textbook chapter included science methodology, safety, and measurement. The content post-test was administered after the LBD™ unit was completed (for LBD™ classes) and after the target content was (for the comparison classes). Post-test administration typically occurred in March; all classes completed the post-test within three weeks of one another. All but one teacher administered each test on one day to all of his/her classes. The exception, a comparison teacher, administered the pre-tests in mid-December to some classes, and in early January to the remaining class. This teacher administered post-tests within one week of the LBD™ teachers. This teacher administered post-tests within one week of the LBD™ teachers with whom the comparison was paired.

We have divided classes into three cohorts for assessment purposes. those who were in both middle-income communities and mixed-achievement classes, those in both an affluent community and in Honors Science, and those in a lower-income community with a mixed-grade (grades 6-8) Honors Science class. The LBD™ and comparison students in the affluent community/Honors Science cohort were taught by teachers who had high domain expertise in physical science and extensive experience using innovative teaching methods and enriched teaching environments. Most of the LBD™ and comparison students in the mixed-achievement classes had teachers whose domain expertise was in Life Science (one has an Earth Science background, one a Chemistry background) and who had some experience with enriched teaching environments, but less experience with inquiry-driven teaching methods than the Honors cohort teachers. Similarly, the teacher of the mixed-grade Honors Science class had expertise in Life Science; LBD™ was this teacher’s first inquiry-driven, enriched environment teaching experience. The mixed-grade Honor’s students were compared to the comparison classes of the mixed-achievement class, because of the teacher background similarity, because these two cohorts also had fewer students in advanced math than the affluent/Honors science cohort, and because there was greater socio-economic similarity between these two cohorts than between the two Honors science cohorts.

Results

Mixed-Achievement and Mixed-Grade Honors Cohorts

In comparing the first cohort of two LBD™ teachers’ (average achievers) classes to the comparison teacher’s (average achiever’s) classes, we find a modest (p < .05) but significant Curriculum x Test Administration Time interaction for learning gains on target science content questions. (a mean gain of about 1/2 point out of a possible total of 7 points for the target science items), but they are significantly greater than matched comparison students with a matched comparison teacher (no mean gain on the target science items). No other category of questions showed significant Curriculum x Test Administration Time interaction within this comparison cohort. In comparing the multiple-grade Honor’s classes to the comparison teacher from this cohort, there was a significant (p < .001) Curriculum X Test Administration Time interaction for general physical science, but not for target physical science or for any other category.

We were unfortunate in that the mixed-level achievement comparison teacher administered the pre-test incorrectly to her classes, presenting them with copies of the test that only had the odd-numbered pages of a two-sided test. This resulted in a large number of items being unavailable for pre-to-post test analysis. In particular, instead of 16 multiple-choice target content items, only 7 target content items were administered to the comparison teacher’s classes on the pre-tests. It should be noted that science practices.

8th Grade Honors Cohort

Of particular interest is the Honors Classes comparison. The LBD™ teacher’s mean pre-test scores on target science items (possible total of 16 points-see footnote 3) were similar to those of the matched comparison teacher, but whereas the scores for the LBD™ teacher’s classes more than doubled from pre-test (M = 3.87, SD = 1.63) to post-test (M = 8.48, SD = 2.67), the comparison teacher’s class only had a mean gain of about 1 point from pre-test (M = 3.17, SD = 1.79) to post-test (M = 4.44, SD = 1.85). The Curriculum x Test Administration Time interaction for learning gains for this LBD™; set of classes was statistically significant (p <.001). All but one of the remaining question categories showed no significant Curriculum X Test Administration Time changes. Graph 1 shows these results.

 

FCME and FCI Question Cluster Results

Three force and motion topics are covered by clustered questions which were adapted from the FMCE (Sokoloff & Thornton, 1989) and the FCI (Hestenes, Wells, & Swackhamer, 1992): "for every action there is an equal and opposite reaction" (5 questions in cluster), "acceleration as change of direction" (2 questions in cluster), and "net forces-gravity as primary force" (3 questions in cluster). (A fourth cluster of questions involved "acceleration vs. velocity (3 questions), but a printing error rendered two answer selections difficult to interpret, so the cluster was not analyzed.)

The development of understanding of each of these concepts has been the subject of study of numerous researchers (e.g., Chi, Feltovich, & Glaser, 1981; Thornton & Sokoloff, 1990). It is generally accepted that learners will go through a series of predictable stages of understanding of each of these concepts, and that memorizing the formulae that capture the physical laws does not hurry, and may even impede, true understanding of the concept (e.g., Thornton & Sokoloff, 1990). In other words, when interpreting the physical behavior of objects in motion, the application of each "law" is built around the definition of each aspect of the law. Students frequently have misapprehensions about fundamental concepts within these laws, and so they readily misinterpret them. As understanding of fundamental concepts grows, the ability to apply the laws appropriately also grows.

An excellent example of such a misapprehension is the definition of Force. The developmental course of the understanding of Force includes thinking of purposive application of force (such as pushing or pulling) as different from those forces involved in all situations in which objects move in this world-forces such as gravity and friction. Indeed, each aspect of Force is typically considered a separate, "little-f" force. Thus, when applying a formulaic law to a specific circumstance of an object in motion, a student whose understanding is at this stage will tend to think of "pushing" as "force", but not friction or gravity. Thus, the student reasons that a car that is rolled down a ramp is "not subject to the same force" as a car that is pushed by hand, and both are different from a car that is powered by electricity. When a student is able to recognize that Force refers to all of its manifestations simultaneously, their ability to apprehend how a law such as "F=ma" applies to an object in motion undergoes a qualitative shift.

The FCI and the FMCE were developed as clusters of multiple-choice questions. Each cluster of questions provides a general scenario, such as a sled being pushed on ice. The question cluster has a single set of choices for the whole cluster. Each question within the cluster asks about a slightly different version of the general scenario. For example, one question about the sled pushed on the ice might ask, "which force would keep the sled moving toward the right and speeding up at a steady rate?" Another might ask, "Which force would keep the sled moving toward the right at a steady velocity?" Among the choices would be "The force is toward the right and is increasing in strength" "the force is toward the right and is of constant strength", "no applied force is needed." Often, the clusters were designed so that all questions within the cluster are correctly answered by the same choice, reflecting the underlying law at work among surface differences. But students at different stages of understanding will select different answers. They may show that they have not grasped the concept of "force", or that they understand "force" fairly well but confuse "acceleration" and "velocity", or that they do not realize that "acceleration" includes slowing or reversing or other changes of direction.

Thus, what we are looking for within these clusters is evidence of a shift in understanding. Pre-to-post-test answer changes may reflect an intermediate stage of concept acquisition, rather than complete comprehension. This would mean a change not only in the number of correct answers , but in the pattern of incorrect answers. For example, pre-tests may show that while some percentage of the class selects answers for a question randomly, one answer choice has a significant number of student selections. If conceptual change occurs during the course, post-tests might reflect (a) less distribution across all answer choices (b) a shift toward the "correct" answer by a large number of students; and (c) a shift toward an answer that reflects a less-complete understanding of the concept by another cohort of students.

We compared the answer patterns to two of the clusters among all of our LBD™ and comparison teachers. The first cluster dealt with the concept of acceleration as a change in direction. The scenario had a diagram of a hockey puck sliding on a frictionless horizontal surface from Point A to Point B; on reaching Point B, the puck receives an instantaneous horizontal kick at a 90° angle. The first question required the student to choose among 5 diagrams, as shown in Diagram 1 depicting possible paths that the puck would take after the kick (the correct choice is B).

Diagram 1: choices of path taken by an object following a kick perpendicular to original path.

The second question asked, "along the frictionless path you have chosen, how does the speed of the puck vary after receiving thekick?" Answer choices were as follows :

a. No change (correct answer)

b. Continuously increasing

c. Continuously decreasing

d. Increasing for awhile, then decreasing thereafter

e. Constant for awhile, then decreasing thereafter

 

 

For the first question, a common early stage of understanding has students reason that the "sudden, instantaneous" nature of the force applied toward a different direction will be the only aspect of force that still applies to the object; that "the force" that first sent the puck in one direction expires as "a new force" is applied. A more sophisticated understanding recognizes that the two directional forces will essentially be averaged. (See Thornton & Sokoloff, 1989, for a more specific description of what the selection of each choice reveals.) What we see in our data is depicted in Graphs 2 and 3 (insert figures about here).

Pre-test answers indicate that, while the majority of students’ answers are distributed evenly across all choices, choice A, which reflects partial understanding of the scenario, is the most common selection; choice B, which reflects more complete understanding, is the second-most-common choice. This is true among all classes of all teachers. However, in the post-test, we see something quite interesting. For five of the six classes, there is no major change in choice pattern-both choice A and choice B increase a small percentage, to the detriment of other choices. However, for teacher 5, the Honor’s LBD™ teacher, the choice pattern clearly shows a large change in overall class understanding of the concept; the correct choice has increased by about 30%, while the partial-understanding choice has dropped by only about 10%. Although we have not yet done multiple-regression studies to show us individual students’ choice migration, it is clear that, as a class, large numbers of students gained sophisticated understanding, and a large minority of the remainder moved from low understanding to partial understanding. Teacher 6, the Honor’s comparison teacher, shows a less dramatic conceptual change pattern for the class; there is comparatively more movement toward partial understanding and less towards full understanding. Meanwhile, in the average achievement class, two LBD™ teacher’s classes (102, 510) seem to show little concept change; the other LBD™ teacher (101) and the comparison teacher (103) show some migration towards sophisticated understanding, but without more in-depth analysis, it is more difficult to gauge the extent of the change from one concept level to the next.

One thing we know about the two Honor’s teachers is that both tended to depict concepts graphically and that both provided assignments that gave students opportunities to do so. (In particular, both of the Honor’s teachers introduced force vectors in the class.) The other four teachers did not tend to include graphic depictions of concepts, although both LBD™ curriculum materials and those used by the comparison class included such depictions in both student and teacher materials.

Experience with such graphic depictions would not have been as directly useful for the second question of the cluster as for the first. Again, for the pre-test we see a strong tendency toward a naïve understanding of the concept of acceleration across all six teachers’ classes, reflected by the strong preference for choice D, with remaining answers distributed fairly evenly across all other choices. Again, we see that the students of Teacher 5, the LBD™ teacher, showed the largest answer migration toward the correct answer (a) from both the naïve-understanding choice and from the other choices. The Honor’s comparison teacher had a less dramatic increase toward the correct answer, and a very large migration toward the naïve understanding choice. When we see such a large cluster towards a single answer, we assume that the question tends to reflect the class experience fairly directly-that the class activities and discussions reinforce a given conceptual interpretation.

Another cluster of questions that we have been analyzing focuses on understanding the relationships among the pieces of "F=ma" by realizing the application of "equal and opposite reactions". The scenario depicted for five questions show two vehicles moving toward one another on the same path. Three questions depict one vehicle as much heavier than the other, and two questions depict the vehicles as identical in weight. The choices determine vehicle exerts which relative amount of force upon the other. Each of the questions varies an aspect of speed ("both move the same speed when they collide," "one is moving much faster", "the heavier vehicle is standing still", "one of the equal-weight vehicles is standing still.") For each question, the correct choice is E: the vehicles exert the same amount of force on each other. However, the various stages of understanding of these concepts is predicted to provide an intricate pattern of responses. An incomplete understanding of these concepts leads students to calculate that, with vehicles of significantly unequal weights, if their speed is the same, the larger vehicle exerts a greater force; if the speed of the lighter vehicle is much greater, then the smaller exerts a greater force; that there is a point at which the weight differential and the higher speed of the lighter car will provide a 1:1 exchange of forces.

At this point, we have not analyzed our data in great enough depth to apply such predictions directly to this population. In general, though, we have found similar results to those described for the cluster above-that the students in the Honor’s LBD™ classes showed high migration toward the correct concept, unlike any other teacher’s classes (See Table 1.) The change was typically at the expense of a single response choice, suggesting that there was not simply an accrual of knowledge from no understanding (as would be seen with equal distribution of answer choices), but a shift in qualitative understanding (See Table 2).

Table 1

"Equal and Opposite Reaction" Cluster, Change in Percent Correct Answer by Teacher

TEACHER:

Q1: Disparate weights, same speed at collision

Q2: Disparate weights, lighter is faster at collision

Q3: Disparate weights, heavier not moving at collision

Q4: Same weight: same speed at collision

Q5: Same weight: one "truck" not moving at collision

Mixed

Achievement
(LBD™ 1)

Pre

.077

Pre

.115

Pre

.09

Pre

.487

Pre

.113

Post

.231

Post

.192

Post

.064

Post

.577

Post

.115

Diff:

+.104

Diff:

+.077

Diff:

-.03

Diff:

.09

Diff:

.013

Mixed
Achievement
LBD™ (2)

Pre

.1

Pre

.067

Pre

.05

Pre

.317

Pre

.083

Post

.15

Post

.25

Post

.167

Post

.483

Post

.05

Diff:

+.05

Diff:

.183

Diff:

.017

Diff:

.167

Diff:

-.03

Mixed
Achievement
Comparison

Pre

.282

Pre

.141

Pre

.024

Pre

.635

Pre

.059

Post

.176

Post

.129

Post

.047

Post

.741

Post

.035

Diff:

-.11

Diff:

-.01

Diff:

.023

Diff:

.106

Diff:

-.02

Mixed
Grade (6-8)
Honors LBD™

Pre

.27

Pre

.122

Pre

.068

Pre

.716

Pre

.081

Post

.311

Post

.122

Post

.149

Post

.878

Post

.054

Diff:

.041

Diff:

0

Diff:

.081

Diff:

.162

Diff:

-.03

Honors LBD™

Pre

.25

Pre

.219

Pre

.031

Pre

.844

Pre

.031

Post

.802

Post

.604

Post

.667

Post

.979

Post

.604

Diff:

.552

Diff:

.385

Diff:

.635

Diff:

.135

Diff:

.573

Honors Comparison

Pre

.167

Pre

.389

Pre

.056

Pre

.833

Pre

.056

Post

.056

Post

.389

Post

.056

Post

1

Post

.056

Diff:

-.11

Diff:

0

Diff:

0

Diff:

.167

Diff:

0

The "difference" cells show the percentage of correct choices gained or lost for each of the questions, by teacher. Note that only the Honors LBD™ classes show major gains for the whole cluster (question 4 shows a ceiling effect for both Honors LBD™ and Honors comparison).

 

Table 2

Honor’s LBD™ Teacher"Equal and Opposite Reaction" Cluster, Answer Migration Patterns

Response Choice

Q1 Pre-Post Test Difference

Q2 Pre-Post Test Difference

Q3 Pre-Post Test Difference

Q4 Pre-Post Test Difference

Q5 Pre-Post Test Difference

a

-0.45

0.00

-0.04

0.01

-0.02

b

0.00

-0.24

-0.19

0.00

-0.28

c

0.01

-0.03

-0.04

0.00

-0.05

d

-0.05

-0.03

-0.06

-0.02

-0.03

e

0.55

0.39

0.64

0.14

0.57

f

-0.01

-0.03

0.01

-0.09

0.00

g

0.00

0.00

0.00

0.00

0.00

j

-0.01

0.01

-0.28

-0.01

-0.15

The highlighted cells indicate a large migration away from the preferred pre-test choice. Note that there is no migration for q4, as the majority of students in the Honors LBD™ class selected the correct answer to this question on the pre-test and the post-test.

For the third question cluster, which focused on the concept of net force, only minimal gains occurred toward sophisticated conceptual understanding; the Honor’s LBD™ teacher had a 10% migration toward sophisticated understanding of one of the three questions; the Honor’s comparison teacher had a similar migration toward sophisticated understanding of another of the questions, and no major change was found for the other teachers. As these questions are best considered as a cluster, minor changes on one question apiece do not in and of themselves constitute a strong enough pattern for interpretation. It should be noted that this concept is not emphasized in either curriculum and that no teacher focussed much class time or assigned specific work on this topic.

In considering our current knowledge following these preliminary analyses, we are greatly encouraged. We know that both Honor’s teachers were highly knowledgeable of the physical science, and that the Mixed-Achievement teachers had little expertise in physics. We know that the Honor’s teachers were careful to present concepts for mastery, and that they tested for mastery along the way. We dare, therefore, to interpret the cluster-question differences between the Honor’s LBD™ and Honor’s comparison teacher as imparted by a difference in curriculum: teachers with expertise in their discipline and strong teaching skills can help their students make major conceptual shifts, and LBD™ provides a learning environment in which such gains are optimized.

3.2. Preliminary results from performance assessments

(3) Do LBD™ students gain greater ancillary skills that suggest they have "learned how to learn" better than their comparison counterparts? In particular:

Our coding scheme for performance assessments, recall, codes on seven dimensions: negotiations during collaboration; distribution of the task; use of prior knowledge; adequacy of prior knowledge mentioned; science talk (good use of science vocabulary and explanation); science practice; and self checks. Science practices while designing an experiment include controlling variables and planning for adequate number of trials. For each dimension, each group is scored on a Likert scale of 1 - 5, with 5 being the highest score.

When we used our coding scheme to analyze student performance, we found that for an LBD™ typical achievement levels classroom vs. a similar comparison classroom, there were statistically significant differences in mean scores for the distributed and self checks measures, and a nonsignificant trend for prior knowledge adequacy. In each case, the LBD™ means were higher than the comparsion class. For the LBD™ advanced-achievement (honors) classroom vs. a similar comparison classroom, there were significant differences for the negotiation, science practice, and self-check measures, with higher LBD™ means (See Table 3). LBD™ students were better than comparison students at collaboration, meta-cognitive awareness of their practices, and ability to remember and use what they had learned previously. Students in LBD™ classrooms participate in collaboration that can be characterized by negotiation and the distribution of the work. Students in comparison classrooms work in groups without taking advantage of the unique possibilities when work is distributed or solutions negotiated.

Table 3

Means and standard deviations for categories from performance assessment coding for LBD™ students (typical and Honors) and Comparison students (typical and honors)

Coding category

Means (s.d.)

LBD™ Typical

Means (s.d.)

Typical
Comparison

Means (s.d.)
LBD™ Honors

Means (s.d.)
Honors
Comparison

Negotiations

2.50 (1.00)

1.50 (.58)

4.50 (.58) ***

2.67 (.58)

Distributed

Effort/Tasks

3.25 (.50) *

2.25 (.50)

4.00 (1.15)

3.00 (1.00)

Prior Knowledge

2.25 (.50)

1.75 (.50)

3.75 (1.50)

3.0 (.00)

Prior Knowledge

Adequacy

2.75 (.96)

1.50 (.58)

3.50 (1.00)

2.67 (1.15)

Science Terms Used

2.50 (1.29)

1.75 (.50)

3.50 (1.00)

2.67 (1.15)

Science Practice Skills

2.75 (.96)

2.25 (.50)

4.75 (.50) ***

2.67 (.71)

Self-checks

3.00 (.82) **

1.50 (.58)

4.25 (.50) ***

2.33 (.58)

Significance levels: * = p < .03; ** = p < .02; *** = p < .01
The means are based on the Likert scale: 1 - 5

This assessment is important for several reasons. First, it tells us that the combination of scaffolding and orchestration that we have developed for LBD™ is successful in achieving learning of process skills and practices. Second, it tells us that indeed we are on the right track in designing the performance tasks and their coding metrics. As these become more concise, we will make them available to teachers and students as scaffolding, showing them the kinds of articulations and practices we expect them to be able to achieve. Third, it provides evidence that these "habits of mind" are being learned and transferred (Kolodner, Gray, & Fasse, submitted).

These scores are based on analysis of student ability to design an experiment when engaging in the Rubber Meets the Road performance described earlier. We have not yet fully coded their scientific and collaborative capabilities when carrying out an experiment or analyzing data. The same coding rubric is being used, but science process skills particular to these two tasks are coded for under the "science skills" category.

3.3. Preliminary Results on student self-assessment of skills

The Twenty Questions survey was administered multiple times in a subset of our classrooms during 1999-2000. We were able to analyze the data from four of our teachers' classes. We had different predictions for the two aspects that were being self-assessed. For the collaborative skills, we predicted that (1) students would rate themselves highly in the beginning of the year, (2) would show lower ratings of themselves around midpoint since they would have had actual experience working as teams, collaborating, doing research and engaging in challenging design problems, and (3) would rate themselves more highly again at the end of the units, reflecting a more accurate picture of their skill development. This U-shaped developmental progression was found to be significant on several items from the survey. Students in one teachers’ four classes (N= 120 students) showed significant change in their ability to think about the function of the objects/artifacts they design; identify and examine alternative solutions to problems; and, make decisions based on evidence and not just guesses or opinions.

For the active research skills self-assessment segment, we predicted that students would tend to score lower at the beginning of the year and that the scores would get higher over time. Our reasoning is that the students would tend to judge their ability to know where to turn for information and how and when to design experiments would be low; as they practices these skills, their confidence should grow. We are currently analyzing the results.

4. Discussion

While we clearly have significantly more work to do in analyzing our 1999-2000 data and our more extensive 2000-2001 data, our results so far show significant progress in both our ability to measure the acquisition of skills and practices and in LBD™’s ability to help shape students’ learning of science and collaboration skills. We posed a set of questions above about the learning of skills and practices:

(3) Do LBD™ students gain greater ancillary skills that suggest they have "learned how to learn" better than their comparison counterparts? In particular:

  • do LBD™ students give more detailed explanations of their answers (in prompted situations, or spontaneously)?
  • do LBD™ students show greater ability to transfer knowledge?
  • do LBD™ students reference classroom experiences more often?
  • do LBD™ students understand scientific methodology issues better?

We have shown little yet in the way of an answer to the first of the sub-questions, but our evidence suggests an answer of "yes" in response to the other questions. We are, of course, mindful that we should use caution in interpreting our data too broadly, but framing our results in the context of what is known about measuring and development of scientific reasoning provides us with justification in interpreting our results in a strong and powerful light.

We define scientific reasoning broadly as the ability to investigate (e.g., design an experiment and collect data) and to make inferences from the resulting evidence (Kuhn & Pearsall, 2000). The scientific reasoning literature points out that skill capabilities are quite difficult to measure (Zimmerman, 2000). The rubrics we have developed, however, seem to differentiate between student groups that can practice scientific reasoning well and those who are less capable. Even with our small sample size, multiple coders were able to use our rubric such that standard deviations were quite small in differentiating student capabilities along many of our dimensions.

According to our analysis, LBD™ students use scientific reasoning and collaboration skills more successfully than comparison students. We suspect this is because LBD™ foregrounds the learning of those skills, providing repeated practice with them and providing, as well, opportunities for reflecting on those practices and discussing how to carry them out successfully.

On the other hand, our performance assessments show that LBD™ students don’t carry out practices as well as we had hoped. While they outscore comparison students, only the honors students score very high and then only on a subset of the dimensions, and all groups perform at a lower level than what we observe in the classroom. The scientific reasoning literature suggests reasons why. Empirical evidence for the development of scientific reasoning has shown that "it may not develop fully even by adulthood" (p. 115, Kuhn, et al., 1995). As Siegler (1996) has noted, developmental change often proceeds more like waves than stairstep models as implied by many stage theories. Changes in children’s thinking emerge in one context but not another, for example, using an addition strategy spontaneously in one setting but failing to use the strategy in a different setting when it would be appropriate to do so. This early fragile competence followed by a regression to previous strategies or earlier understandings has been noted in a variety of cognitive tasks (e.g., problem solving, scientific reasoning, in, for example Kuhn, & Phelps, 1982; Schauble , 1990). Noted as well is that when learners are reminded of a strategy they might use, they can apply it (Schauble , 1990).

Looking at Table 1, we can see that our honors students, who scored more highly than others on all dimensions, showed a trend toward being better able to remember relevant prior knowledge. This may be a significant reason why they were better able to carry out the science and collaboration practices we measured. Preliminary analysis of our structured interview data seems to be pointing in this direction, as well - that students would perform far better on the performance tasks if a facilitator joined with them as they carried out the tasks and helped them remember what skills and knowledge they might apply. In our structured interviews in 2000-2001, we asked students to manipulate a simulation to answer several questions relevant to designing a balloon car. Some students ran tests in the simulation where they changed several variables at one time, showing, on the surface, little understanding of the skill of designing an experiment. When questioned to explain the result they got from a trial run, however, they recognized the need run their simulation trials again, this time testing only one variable at a time. The science practice of controlling for interacting variables is emerging and still fragile and is not spontaneously used when it is appropriate, but with prompting, students are able to correct themselves and remember the practices they had learned. Other researchers have observed similar capabilities (Kuhn and Phelps, 1982 and Schauble, 1990). We are currently analyzing the full set of structured interviews from this year's implementations. We will add this to our interpretation of our other quantitative findings and predict we will find significant differences between students in our LBD™ classrooms and their comparisons.

What about learning of content? Our analyses so far are not conclusive about the role LBD™ plays in learning of content knowledge, though LBD™ students increased at least as much as comparison students in their learning and sometimes significantly more. We need to follow up this analysis with analysis of the deep learning questions - those that get at explanatory answers and those that look at the clustering of developing understanding.

Overall, a more comprehensive analysis of a greater volume of data is needed.

5. Future Work: Embedding Assessment for Program Evaluation and Class Scaffolding

In the introduction, we said that one of the goals of the LBD™; project is to develop science units based on what we know about how students will learn and retain that learning. One of the most important aspects of helping students learn is to provide appropriate scaffolding so that students are able to form bridges from less-sophisticated to more sophisticated understanding. But to form such bridges, we must have ways of ascertaining where the student is beginning, as well as knowing where we wish the student to end up. And traditionally, classroom assessment is a capstone experience-ascertaining where the student ended up.

Among the criteria we have used in designing our program evaluation and research tools, we have always included the need to find ways to adapt assessment techniques to be used by teachers, administrators and students themselves as part of the classroom experience. By finding ways to embed assessment in the curriculum itself, we seek to provide teachers with an understanding of what students still struggle with, and also what students may be able to bring to bear from related experiences.

The first step, of course, has been to develop and test instruments and rubrics, and this dovetailed with the requirements of the program evaluation effort. Then, we consider what we intend for the curriculum to scaffold, and what instrument might be useful to the class to assess the specific needs. Adaptation and adoption of the various instruments described herein is at a very early stage. We have included the Twenty Questions Survey as part of the teacher materials, and we intend to embed the survey into student materials in a future unit revision. Currently, the teachers administer the survey only for program evaluation purposes, but we have begun developing ways for teachers to employ the surveys directly in their assessment of class needs.

We have also been working on methods to integrate performance assessment tasks into the class units. Project-based approaches are natural settings for assessing through doing. Indeed, the investigation cycle that is used throughout our units is easily adapted to performance assessments. In the investigation cycle, students are given some problem for which they are to test a series of solutions and recommend one solution. Performance assessments, too, are based on a problem and an activity session during which data is collected that will address the problem. In the investigation cycle, the data gathering and interpretation are followed by critical examination of proposed solutions; performance assessment scoring rubrics can be used to provide feedback directly to students on their own proposed solutions and explanations. The rubric can be used to orchestrate a discussion of the target concepts and skills, so that students are more aware not only of what the better solutions are, but also the importance and method of using data to explain why a given solution is better. Giving students access to videotaped performance assessment sessions and evaluation rubrics allows students to self-assess their own science practices and collaborative skills; the rubric articulates expectations and keys these expectations to observable actions, so that students have specific actions that they can aim to try or to change.

As we develop structured interviews, we are selecting tasks that have qualities that make them easy for teachers to administer. Namely, these tasks should be easy to perform one-on-one, use little class time, be discreet to avoid distracting the class, have clear guidelines on the types of questions to ask and the extent of follow-up, and explain how to assess responses. We are beginning to consider innovative applications of computer modeling environments, as well as teacher-run interviews.

If we are to replace capstone-type evaluations with embedded assessments, we need to include formats that are familiar to students, teachers, administrators, and other stakeholders. However, the material contained within the familiar format may be more suited to assessment for scaffolding purposes. We are beginning to work with teachers to help them learn to identify aspects of learning that they want to assess for scaffolding purposes, and then to use traditional formats such as written assignments and quizzes as feedback for scaffolding efforts. We are providing teachers with information about resources such as the FMCE (Thornton & Sokoloff, 1998) and the FCI (Hestenes, Wells, & Swackhamer, 1992) so that they can learn to adapt or develop test items that help pinpoint the developmental level of a target concept.

Finally, we seek to find ways to help teachers assess the classroom culture, and especially their own actions and habits within the classroom. We are working to adapt the OPT and the Fidelity of Implementation Report Card for the teachers’ use as a self-assessment survey. Helping teachers self-assess is especially important if LBD™; is disseminated more widely, when teacher support and feedback will come from peers and administration instead of directly from the LBD™; development team.

Our goal in assessing and scaffolding the learning in a large scale curriculum implementation has been to address the variability we find across classrooms attempting the same curriculum units. Our ultimate evaluation results will provide a mosaic of multiple data sources that can help us document and account for that variability. We believe that multiple measures of both the classroom environment and the learning outcomes serve to dovetail these measures. Thus, this approach affords capturing the actual complexity and variability of classroom environments, even those sharing the same teaching and learning goals and the same curriculum materials.

Taken together, we have created a way to capture the complexity in real classrooms attempting serious reform. Our acknowledgement of the variation and unique aspects of each classroom has challenged us to develop the approaches we have reported here. It is our next step to begin the translation of our research efforts into a format that will allow teachers and students to develop ever more autonomy in doing embedded assessments throughout their efforts at project based approaches. We are hopeful that we have established a framework for doing so from what we consider sound theoretical and practical consideration. It is only by being in the classroom and collaborating with our teachers that we feel positioned to embark confidently on this next step.

 

Footnotes

1Acknowledgement and thanks to Dr. Steve Cole for statistical consulting and to Dr. Daniel Hickey for advice on designing, administering, and coding performance assessments.

2Additional examples will be published in Holbrook, Fasse, and Camp (in preparation).

3A few items are scored as full, partial, or no-credit items.

4 The small number of target science items is the result of an administration error by the comparison teacher, who administered only the odd-numbered pages of a two-sided test to her classes.

5 Three target science content items were "explanatory answer" questions rather than multiple choice; one science practices item was "explanatory answer".

6 The LBD™ teacher taught five class periods of Honors Physical Science, but the Comparison teacher only taught two, and only one of these classes participated in assessment. To provide roughly equal ns for analysis, each LBD™ class was separately compared to the comparison class. The LBD™ class with the median gain is presented here; the range of pre-to-post changes was 4.1-6.09 pts. gained.

7 The exception was Life Sciences, an artifact of the result that the comparison teacher's class score was significantly lower from pre-test to post-test on these questions. Given the small number of items and the fact that Life Sciences is not part of the 8th grade curriculum but is part of the 7th grade curriculum, it seems likely that this was an unimportant aberration.

8 We will be writing up our results in this area as Gray, Holbrook & Ryan (in preparation).

References

Chi, M.T.H., Feltoviche, P.J., and Glaser, R. (1981). Categorization and representation of physics problems by experts and novices. Cognitive Science, 5 (2), 121-152.

Crismond, D., Camp, P.J., & Ryan, M. Design Rules of Thumb-connecting science and design. Symposium talk presented at the 2001 Annual Meeting of the American Education Research Association, Seattle WA, April 2001 www.cc.gatech.edu/projects/lbd/Conference_Papers/2001_conference_index.html

Fasse, B. & Kolodner, J.L. Evaluating classroom methods using qualitative research methods: Defining and refining the process. In Proceedings of the Annual Meeting of the International Conference of Learning Sciences, Ann Arbor MI, June 2000. http://www.umich.edu/~icls/

Fasse, B. Gray, J; & Holbrook, J.K. (1999). Observation prompt tool (OPT). Learning By Design™ project document. Georgia Institute of Technology, Atlanta, GA.

Fasse, B., Holbrook, J.K. & Gray, J. (1999). Fidelity Report Card (FRC). Learning By Design Project document. Georgia Institute of Technology, Atlanta, GA.

Geertz, C. (1983). Thick description: Toward an interpretive theory of culture. In R.M. Emerson (Ed.), Contemporary Field Research (pp. 37-59). USA: Waveland Press.

Goetz, J.P. & LeCompte, M.D. (1984). Ethnography and qualitative design in educational research. Orlando, FL: Academic Press.

Gray, J., Camp, P.J., Holbrook, J.K., & Fasse, B.B. 2001). Science talk as a way to assess student transfer and learning: Implications for formative assessment. Symposium talk presented at the 2001 Annual Meeting of the American Education Research Association, Seattle WA, April 2001 www.cc.gatech.edu/projects/lbd/Conference_Papers/2001_conference_index.html

Gray , J. & Holbrook, J. (in preparation). Student self-assessment in a project based classroom. Georgia Institute of Technology.

Gray, J., Holbrook, J. & Ryan, M. (in preparation). Assessing student knowledge of science through simulation tasks. Georgia Institute of Technology.

Hestenes, D. Wells, M. and Swackhamer, G. (1992). Force Concept Inventory. Physics Teacher, 30, 159-165.

Holbrook, J.K., Fasse, B., Gray, J., & Camp, P. (in preparation). How the quality of science culture in science class predicts student learning outcomes. Georgia Institute of Technology, Atlanta, GA

Kolodner, J.L. (1993). Case-Based Reasoning. San Mateo, CA: Morgan Kaufmann.

Kuhn, D. and Pearsall, S. (2000). Developmental origins of scientific thinking. Journal of Cognition and Development, 1, 113- 127.

Kuhn, D. & Phelps, E. (1982). The development of problem-solving strategies. In H. Reese (Ed.), Advances in child development and behavior, Vol. 17. New York: Academic Press.

Lincoln, Y.S. & Guba, E.G. (1985). Naturalistic inquiry. CA: Sage.

PALS website, (1999) Where the Rubber Meets the Road Performance Assessment Task, developed by the Kentucky Department of Education, available at the website. SRI International, Center for Technology in Learning, http://www.ctl.sri.com/pals/index.html

Schauble, L. (1990). Belief revision in children: The role of prior knowledge and strategies for generating evidence. Journal of Experimental Child Psychology, 49, 31-57.

Siegler, R. S. (1996). Emerging minds: The process of change in children's thinking. New York: Oxford University Press.

Spradley, J.P. (1980). Participant observation. NY: Holt, Rinehart & Winston.

Thornton, R.K. & Sokolofff, , D.R. (1998) Assessing student learning of Newton’s laws, The Force and Motion Conceptual Evaluation and the evaluation of active learning laboratory and lecture curricula. American Journal of Physics, 66, 338-352.

Wiggins, G.P. (1993). Assessing student performance: Exploring the purpose and limits of testing. San Francisco: Jossey-Bass.

Zimmerman, C. (2000). The development of scientific reasoning skills. Developmental Review, 20. 99-149.

 

Acknowledgments

This research has been supported by the National Science Foundation (ESI-9553583), the McDonnell Foundation, and the BellSouth Foundation. The views expressed are those of the authors.

 

 

Appendix 1: Performance Assessment tasks: Coding for science practice
DRAFT, do not use without permission

Additional notes are fine and can be recorded on the coding sheet.

Please note which event segment is being coded for each episode:

planning an experiment; problem set up; experimental manipulation; response to written questions.

In general, the 5 -point Likert scale reflects the following quantitative continuum. Details for each item are also included below.

1 = Not at all: no evidence of the quality to be rated

2 = Some evidence that at least one episode or one student exhibits the quality rated

3 = The quality is exhibited half the time

4 = The quality is exhibited for more than half the episodes

5 = The quality completely captures the nature of the episodes

Design an experiment segment:

Within an episode, the context of the group is characterized by:

Negotiations

         

Not at all

at least one of the members of the group suggests a compromise about some aspect of the procedure

at least one of the members of the group suggests that compromise or debate is needed for at least half the issues that require it

at least two of the members of the group questions several aspect of the procedure and the group makes the needed change

Most decisions are made about procedure by the entire team contributing and decision making is consensual

1

2

3

4

5

Distributed efforts and tasks

         

Not at all

at least one of the members of the group suggests that others help do the task

at least two of the members of the group suggest that all do some part of the task

at least one of the members of the group suggests and leads the group in dividing and doing the task

More than one of the members of the group enlists the participation of all the team in doing the task

1

2

3

4

5

 

 

Level of Understanding of the problem

         

Not at all

The group thinks the task is to write something down disregarding the "design" aspect

at least two of the members of the group try to work out a method and "run an experiment" with the material available

at least one of the members of the group recognizes that an experiment is to be designed and shares with the other members

More than one of the members of the group enlists the participation of all the team in designing an experiment and that it calls for additional materials

1

2

3

4

5

Use of materials to get to a method

         

Not at all

At least one member of the group manipulates the material (s) while trying to develop a solution

at least two of the members of the group examine and use the material in a way that might suggest an effort to discover a method

at least two of the team members manipulates the material to explicitly suggest a method

The team explores the material as if messing about to understand what to include in their design/method

1

2

3

4

5

 

 

Prior knowledge is defined as students referring to some aspect of the curriculum unit that relates to the current problem; referring to some aspect of a personal experience that seems to relate to the current problem; referring to some aspect of the science concept or method at issue that appears to come from previous exposure to the concept or skill.

Students show evidence of using prior knowledge to solve the problem

         

Not at all

at least one of the members of the group mentions a prior event or concept that relates to the problem

at least half of the team mentions a prior event or concept that relates to the problem

Several events and concepts are mentioned and applied to the problem

The group routinely recalls events or concepts that assist in their collaborative problem solving

1

2

3

4

5

 

Prior knowledge seems adequate

         

Not at all

at least one of the mentions of prior knowledge is followed up on and is useful

At least half the mentions of prior knowledge are appropriate to the problem

More than one member of the group mentions or follows up on events or concepts that are useful

Every mention of prior knowledge is directly applicable to the problem

1

2

3

4

5

 

 

 

 

Science terms are used in a way that indicates some degree of understanding and can be argued that they are not attributed to the science terms included in the problem description.

Students use science terms to discuss problem solution

         

Not at all

at least one of the members of the relates the discussion to some science concept

at least half the team relates the discussion to some science concept

Most of the team members use science concepts or terms in such a way that accurate understanding and application are noted

All members of the the team members use science concepts or terms in such a way that accurate understanding and application are noted

1

2

3

4

5

Students use science practice to decide on method/procedures

         

Not at all

at least one of the members of the group suggest a method to test at least one variable

at least one of the members suggest a method and indicates an understanding of fair testing

at least one of the members suggest a method and indicates an understanding of fair testing and controlling for variables

Most of the team agrees that the method used will fairly test the important variables and their decisions would actually be a reasonable experiment

1

2

3

4

5

The episodes are characterized by group self-checks on procedures

         

Not at all

at least one of the members of the group questions some aspect of the procedure

at least one of the members of the group questions some aspect of the procedure and the makes the needed change

at least one of the members of the group questions several aspect of the procedure and the group makes the needed change

More than one of the members of the group questions several aspect of the procedure and the group makes the needed change

1

2

3

4

5