Development and Implementation of a Four-Tier Close-Ended Test to Analyze Students' Misconceptions of Optical Instruments

: The research aims to develop a four-tier test for optical instrument materials. The method used in this study is a 4D design that includes defining, designing, developing, and disseminating. The instrument used consisted of fifteen items in the form of a four-tier closed-ended test. The research participants were 60 female and 15 male students from West Java in grade 11 high school who were randomly selected. The analysis is divided into four parts. The first analysis is a CVR and multi-rater Rasch measurement of the original validation results. The second analysis involves calculating the percentage of students' scores based on their conception scores. The third is a Rasch Model analysis of the instrument's validity and reliability. The Rasch Model is used in the fourth analysis to examine conceptions and misconceptions. Following the analysis, all items met the CVR value criteria. I2, I7, I9, I10, I12, I15

Misconceptions can be influenced by factors such as students' pre-existing knowledge, teachers' prior understanding, textbooks, the learning environment, and inaccuracies in transcribing terminology (Jauhariyah et al., 2018;Kocakulah & Kural, 2010).According to Oberoi (2017), Misconceptions can arise due to insufficient knowledge about concepts, textbook confusion, linguistic ambiguities, or overgeneralization.Misconceptions may also stem from learning strategies, students' initial information, the challenge of connecting one concept to another, the content presented in textbooks, and the influence of language and media.According to Oberoi (2017) and Kaltakçi & Didiç (2007), misconceptions among students can be caused by a variety of factors, including both the students themselves and their educational environment.
Misconceptions can be identified through the use of diagnostic tests, which are assessment tools designed to pinpoint challenges or unresolved issues that learners may encounter in the learning process (Fariyani et al., 2017;Gurel et al., 2015;Pertiwi & Setyarsih, 2015;Rosita et al., 2020).Diagnostic tests are also characterized as assessments designed to identify students' weaknesses, facilitating the implementation of appropriate measures to address those areas of concern (Ismail et al., 2015;Rosita et al., 2020).Interviews, concept maps, open tests, multiple-choice tests, two-tier multiplechoice tests, three-tier tests, and four-tier tests can all uncover students' misconceptions.Interviews, open-ended tests, and multiple-choice tests are commonly employed in physics education research.Nevertheless, each diagnostic test instrument comes with its own set of advantages and disadvantages when compared to others (Gurel et al., 2015;Kaltakci-Gurel et al., 2017).
Open-ended tests or descriptive assessments can prompt students to contemplate a concept for an extended period, articulate their thoughts in writing, uncover misconceptions in problemsolving, and assist students in overcoming learning difficulties.The open-ended test format enables respondents to express their answers in their own words and allows the administration of a broader sample than interview tests (Gurel et al., 2015;Kaltakci-Gurel et al., 2017;Zhou et al., 2016).Nevertheless, tests in the form of descriptive responses come with practical limitations related to language use issues.Students often exhibit a lack of enthusiasm in providing answers as complete sentences, necessitating more time for result analysis and assessment (Bautista & Boone, 2015;Kaltakci-Gurel et al., 2017;Kaltakçi & Didiç, 2007).
The four-tier test is structured with four levels.The first level involves answer choices (multiple-choice), the second level involves indicating the level of confidence for the answers selected in the first level, the third level entails choosing reasons (multiple-choice) for the answers selected in the first level, and the fourth level involves indicating the confidence level for the reasons selected in the third level (Kaltakci-Gurel et al., 2017).In the second and fourth levels, the expressions of confidence typically involve categorizations such as "sure" and "not sure."The four-tier test represents an advancement over a comparable diagnostic test with a three-tier format comprising only three components.The three-level diagnostic test, in turn, enhances a two-level diagnostic test.Incorporating reasons for selecting answers is a notable improvement (Anggrayni & Ermawati, 2019;Hermita et al., 2017).Therefore, the four-level multiple-choice diagnostic test is the most accurate for detecting misconceptions (Afif et al., 2017;Anggrayni & Ermawati, 2019;Hermita et al., 2017).The comprehension levels of the four-tier test reveal that students' conceptions can be categorized into six distinct conceptual levels according to the level of understanding conveyed by Coştu (2008), the assessment by Kaltakci-Gurel et al. (2017), and the concept category of (Amalia et al., 2019) Based on literature studies, students still have misconceptions about some physics concepts (Coetzee & Imenda, 2012;Kocakulah & Kural, 2010;Kucukozer, 2010;Rohmanasari & Ermawati, 2020;Salamah et al., 2017;Salmadhia et al., 2021;Umar et al., 2021

METHOD Research Design
The research method followed the 4D model, encompassing the stages of defining, designing, developing, and disseminating (Thiagarajan et al., 1974).The defined stage of literature studies on optical instrument misconceptions has been concluded.The subsequent design stage involved establishing a construction distribution for each item, designing content for each item, and implementing a four-tier test.The first, second, and fourth tiers utilize closed-ended questions in this test, while the third tier adopts an openended format.

Participants
The participants involved in this study were 75 high school students in grade 11 in West Java.The students consisted of 60 female students and 15 male students.Random sampling is used to select students.

Instruments
The research employed a four-tier closed-ended instrument comprising fifteen items related to optical instruments, including topics such as cameras, eyes and eyeglasses, magnifying glasses, microscopes, and binoculars.The first tier presents a concept-based question, while the second tier gauges students' confidence in responding to the initial question.The third tier requires students to provide reasons or explanations for their answers to the first tier.Lastly, the fourth tier assesses the confidence level associated with the explanations provided in the third tier.All questions in the test are closedended and take the form of multiple-choice queries.

Data Analysis
The analysis conducted in this study comprises four stages.The expert validation results were initially scrutinized, employing CVR and multirater Rasch measurement for analysis.The experts evaluated the validity of the subject under consideration, determining whether it is valid without revision, with revision, or invalid.Equation 1 was applied to calculate CVR.
Description:   = The number of validators who provide valid assessments  = The total number of validators The instrument is deemed valid if the calculated CVR result surpasses the minimum CVR value, as determined by the Schipper Table (Wilson et al., 2012).Table 1 shows the minimum CVR values for the various validator counts.(Amalia et al., 2019;Aminudin et al., 2019;Coştu, 2008;Kaltakci-Gurel et al., 2017).The third analysis stage evaluates the four-tier optical instrument questions developed using the Rasch Model: (  = 1|  ,   ) is the probability of respondent n in i to produce the correct answer (  = 1) with the respondent's ability (  ) and item difficulty level (  ) (Sumintono & Widhiarsho, 2015).
The instruments were subjected to data analysis, focusing on students' conception scores for each item.Instrument analysis was employed to assess the items' validity, reliability, and difficulty level.Rasch analysis served as the methodology for instrument analysis.The instrument's validity is gauged by evaluating the appropriateness of each item.Item validity is determined through the output of the Item Fit Order, considering outfit mean square (MNSQ), outfit Z-Standard (ZSTD), and point measure correlation (PT MEASURE CORR).
Additionally, the unidimensionality output, indicating the raw variance explained by measures, is utilized to ascertain instrument validity.
Rasch analysis was also employed to assess the reliability of the instruments, yielding results such as person reliability, item reliability, and Cronbach's Alpha.Person reliability gauges the consistency of students' responses, while item reliability reflects the quality of the instrument's items.Cronbach's Alpha provides an overview of the overall interaction between individuals and items.
In the fourth analysis, students' misconception scores on each item were scrutinized using Rasch analysis.Output tables, such as output variable maps (Wright maps), were utilized in this analysis to interpret the findings.

RESULT AND DISCUSSION
The focus is exploring alternative conceptions derived from students' responses in transitioning from the fourtier open-ended to the four-tier closedended tests on optical instruments.The subsequent sections will provide a detailed discussion of the stages of development (define, design, develop, and disseminate) and the associated analysis within the framework of the 4D model.

Define
The define stage is a review of the literature on optical instrument misconceptions.This stage is utilized to locate research sources.A literature review of misconceptions about optical instruments is performed on each submaterial.Eyes, cameras, eyeglasses, magnifying glasses, microscopes, and binoculars are the optical instrument materials investigated.Based on the literature review and the prediction of optical instrument misconceptions that students may encounter a four-tier test can be developed.The results of literature studies, the misconceptions that occur among students regarding optical instruments are detailed as follows (Munawaroh et al., 2016;Kaniawati et al., 2020;Salmadhia, Rusnayati, & Liliawati, 2021).
Table 3. Students Misconceptions about Each Sub-material.

Sub Material Students Misconceptions
Eyes, eyes glasses, and camera The near point (PP) of hyperopia is farther than the normal eye, so objects must be placed closer than 25 cm.The larger the field diameter of a lens, the more light comes in, so the image gets bigger.
The older you get, the better your eyes' accommodation power becomes.
The pupil in the human eye has the same function as the diaphragm, regulating the intensity of incoming light.
The near point of the eye of a myopia sufferer is closer than the near point of a normal eye (PP < 25cm).The camera's distance to the object is closer when the camera is in a landscape position than in a portrait position.

Magnifying glass
If the loupe lens is partially closed, the image formed is half of the object.
A convex lens is used as a loupe because it spreads light so that the image of an object is enlarged from its original size.
The strength of the loop is not affected by the medium in which the loop is used, meaning that the strength of the loop in air and water is the same.

Microscope
The function of a microscope is to see small objects so that they appear large and clear.Observations using a microscope with maximum accommodation occur when the image formed by the objective lens is exactly in the focus of the eyepiece lens.
Concave mirrors and convex lenses have the property of scattering light.

Binoculars
All lenses on stage binoculars are convex lenses that can collect light.
In unaccommodated observations, the magnification of the image from a star telescope is influenced by the length of the telescope tube.
The more lenses a binocular has the greater the angular magnification.

Design
The instrument design stage requires distributing the construction of questions for each sub-material of optical instruments, designing content for each item, and designing questions in the form of a four-tier open-ended test.The first tier is a regular multiple choice, while the second tier is about confidence, with two options: "sure" and "unsure."However, students have no choice but to fill in at the third tier of the four-tier open-ended as in Figure 2a.At the same time, the fourth tier of confidence is comparable to the second tier.After obtaining alternative concepts from students' answers at the third tier, design a four-tier close-ended test as in Figure 2b.The distribution of the question construction contains the form of questions and choices in the first tier.Each set of questions and choices in the first tier can be in statements, pictures, or tables, such as the item construction distribution in Table 3.In the construction distribution of eye, camera, and glasses sub-materials, they are included in one sub-material because they work similarly.This sub-material also has the most questions compared to the sub-materials of magnifying glasses, microscopes, and binoculars.This aligns with (Kaniawati et al., 2020;Munawaroh et al., 2016) that the most common misconceptions are in the eye, camera, and glasses sub-materials.open answers.Furthermore, the alternative concept obtained will be an option on the third tier of the four-tier closed-ended instrument.Figure 3 shows an example of a third-tier change from open to closed.In the four-tier closed-ended test's Question 6.3, the answer choices represent reasons given by students.Each reason is categorized, modified, and used as a response option.This approach is highly beneficial for probing student responses.According to (Caleon & Subramaniam, 2010;Gurel et al., 2015;Kaltakci-Gurel et al., 2017), reasoning can differentiate between a correct answer due to the right rationale (scientific concept) and a correct answer based on flawed reasoning (false positive).After converting all items to a fourtier closed-ended format, expert validation is conducted with input from five validators.The validators assessing the four-tier closed-ended test include three physics education lecturers, a teacher, and a researcher in the same field.The evaluation conducted by the validators encompasses nine aspects: 1) Items are crafted based on misconceptions; 2) Consistency of the concepts in the questions with those advanced by the experts; 3) Items are designed to assess the understanding of students' concepts; 4) Utilization of language that adheres to the rules of the Indonesian language; 5) Language used is accessible and understandable for students; 6) Answer choices and reasons exhibit homogeneity and logical alignment with the material; 7) There is only one correct answer key; 8) Questions do not provide hints or clues to the correct answer; 9) Answer choices do not include statements like "all answers are correct" or "answers are wrong."The results of the validation analysis utilizing CVR are presented in Table 5.
As per Table 5, all items exhibit an average CVR value greater than or equal to 0.736.Given that the smallest CVR value with five validators is 0.736, it can be concluded that the questions are considered valid and can be utilized (Wilson et al., 2012).The average CVR calculation results for each item can be interpreted to mean that all items have valid expert validation results.However, when each item on each aspect is reviewed, the assessment reveals that the CVR values on items 3, 10, 11, and 15 need to be improved.It needs to be improved, according to assessment aspects 2, 3, and 8 in item 3.Meanwhile, items 10 and 11 only require refinement regarding judging questions that do not provide hints of the correct answer.Item 15 was revised in response to the comments on assessment aspect number 3.
The multi-rater Rasch measurement was utilized to analyze the results of expert validation.Figure 1 illustrates the outcomes of the multi-rater analysis, featuring five columns.The first column, known as the size column (logit transformation), displays measurement results with values ranging from +2 (top) to -5 (bottom), representing logit values.The second column delineates the distribution of logit values, spanning from less than logit -1 (I11) to greater than logit +2 (I1).The logit value of 0 serves as the minimum criterion for item quality, with experts considering values above this threshold as indicative of good-quality items and values below as representing items of lesser quality.
Figure 4 shows eight items considered unfavorable by experts: item numbers I2, I7, I9, I10, I12, I15, I11, and I3.Meanwhile, the expert deemed seven items qualified, including I14, I13, I8, I4, I5, I6, and I1.The third column in Figure 4 illustrates information about the difficulty level of the assessment aspects.This column displays the distribution of the assessment aspects.According to the experts, a lower logit value for an assessment aspect indicates that fulfilling an item in that aspect is easier.Conversely, a higher logit value suggests greater difficulty for the assessment aspect to be fulfilled in an item, as evaluated by the validator.Assessment aspects with similar logit values share the same level of difficulty.
According to Figure 4, assessment aspects 4 (using language that follows the rules of the Indonesian language) and 9 (answer choices do not use statements; all answers are correct or answers are wrong) are the easiest aspects of judging because all items satisfy this assessment aspect according to the validators.While assessment aspect 6 (the answer choices and reasons are homogeneous and logical in terms of material) is the most difficult aspect of the assessment, most question items do not meet it in the expert's opinion.According to all expert opinions, I1 is the only item that satisfies all aspects of the assessment.
Figure 5 depicts the quality of an expert panel's assessment, sorted by item severity.Expert D is the most consistent when considering statistical fit criteria (Boone et al., 2014).Expert D has Outfit MNSQ and Outfit ZSTD values in the statistical suitability ranges of 0.5-1.5 (Outfit MNSQ) and -2 to +2 (Outfit ZSTD), respectively (Outfit ZSTD).Experts A and B are the worst because they have the lowest infit value.The reliability between raters is sufficient (0.67), indicating that the experts give quite different scores, but some are the same (Koçak, 2020).The rater's tendency influences the reliability value (Bond & Fox, 2013).The obtained data aligns with the measurement model, and this alignment is corroborated by the Chisquare test value (p < 0.01).The agreement in assessment by the five experts (inter-rater agreement) stands at 86.3%, signifying minimal divergence in evaluating all items among the five experts.Existing studies consistently indicate variations in raters' judgment tendencies, with rater behaviors such as leniency and severity influencing rater reliability (Brookhart et al., 2006;Darmana et al., 2021;Güler, 2014).

Disseminate
The disseminated stage is a concrete stage for putting the instrument for utilization.The results are then examined in three stages of analysis.The first analysis determines the percentage of each conception category derived from student scores.Thus, students' conceptions can be distributed based on the categories created.The second analysis analyzes the closed-ended fourtier instrument based on score conceptions.The third analysis includes a detailed description of conception and misconception and a comparison using Rasch analysis.
Based on the results, the conception score of all students on each item can be determined.The conception score per maximum conception value shows the percentage value of each conception.Figure 6 depicts the proportion of conception categories for each item.The results of the conception category on each item indicate that the highest sound understanding category is item number 3 (60.00%),and the lowest is item numbers 2, 5, 10, and 15 (0.00%).The highest partial positive category is item 9 (9.33%), and the lowest is items 1, 4, 5, and 8 (0.00%).The highest partial negative category is number 2 (65.33%), and the lowest is item 3 (34.67%).The highest misconception category is item number 15 (54.67%), and the lowest is number 3 (4.00%).The category with the highest incidence of "no understanding" is item number 6, accounting for 9.33%, while the lowest is item number 3, registering at 0.00%.Notably, all items exhibit a "no coding" category of 0.00%, indicating that all students responded to all tiers for each item.
The analysis of the four-tier closedended format involved the application of the Rasch Model.This analysis aimed to ascertain the validity, reliability, and difficulty level of four-tier closed-ended optical instruments.The outcomes related to instrument validity are presented in Table 6.The results of the four-tier closedended test analysis presented in Table 6 reveal that items I1 and I5 do not meet the criteria for PT MEASURE CORR.However, they are retained because they satisfy the criteria for OUTFIT MNSQ and OUTFIT ZSTD values.On the other hand, the remaining four-tier test items meet all the criteria for item suitability.The examination of the instrument's unidimensionality evaluates the validity of the Rasch model on each item individually and as a whole.Unidimensionality is a criterion to determine if the developed instrument can effectively measure its intended content.Table 7 illustrates the impact of unidimensionality, showing that the raw variance explained by measures is 43.0%, surpassing the 40% threshold.This result indicates that the overall validity of the four-tier closed-ended test falls into the "good" category (Sumintono & Widhiarsho, 2015).This good category shows that the four-tier closed-ended test has good validity in measuring students' misconceptions.Furthermore, the value of each unexplained variance is less than 15%.As a result, all four-tier closedended test items are valid and can be used in total without revision.(Sumintono & Widhiarsho, 2015).The conclusion from the analysis of the four-tier test shows that the items have very good quality, but the consistency of the answers given by students is still weak.In addition, the interaction between students and the questions is good.
Students with conception scores of 35F and 39F have the highest ability, while students with conception scores of 64F have the lowest ability.Although the 35F and 39F students appear to have the best abilities, they still fall short of items I5 and I15.The 35F and 39F can only answer items I6 and I10 and below.In contrast, 64F students' abilities fall short of all question items.I3 is the item with the lowest level of difficulty.Students 04F, 20F, 63M, 67F, 72F, 14F, 03F, 61M, and 64F have abilities that fall below item I3.As a result, the students struggle to answer item I3 questions correctly.In contrast, I5 and I15 have the highest difficulty levels.There are no students who are capable of answering questions I5 and I15.The analysis revealed that item I3 had the lowest difficulty in conception, and none of the participants responded with misconceptions.While item I5 presents the most difficult and most frequently answered questions about conception.Students in 64F had the lowest conceptions and the highest misconception scores.Meanwhile, the highest conception value for students of 35F and 39F was not the student with the lowest misconception.Participants whose lowest misconception scores were 45F, below 35F, and 39F on conception scores.The disparity among students is due to their self-confidence level.This shows that student selfconfidence influences student misconceptions.The more students who believe in mistaken concepts, the more students will experience misconceptions.

CONCLUSION
There are four conclusions based on the data analysis and discussion results.First, all items met the CVR scoring criteria, and items I2, I7, I9, I10, I12, I15, and I3 were corrected based on expert advice.
Second, students have misunderstandings about each item.Item I5 (49.33 percent) and I15 (49.33 percent) have the most misconceptions (54.67 percent).Third, all items are valid and reliable, with a Cronbach Alpha value of 0.78 in the good category.The fourth conclusion is that conception and misconception are inversely related.The fewer misconceptions, the better the student's understanding, and vice versa.However, misconceptions can also occur due to each student's confidence level.
Students with misconceptions about optical instrument materials should be given appropriate treatment, such as appropriate learning.The developed fourlevel closed test is expected to be used and improved into a better five-level test to investigate the causes of each student's misconceptions.

Figure 1 .
Figure 1.Research Chart of the Four-Tier Closed-Ended Questions on Optical Instruments.
−   −   −   Description:  = Probability of examinee n receiving a rating of k on criterion i from rater j  −1 = Probability of examinee n receiving a rating of k-1 on criterion i from rater j second step is categorizing students in each concept understanding category based on their responses.The results of the classification of concept categories for students are expressed in percent form.The rubric of the conception category, conception score, and misconception score are shown in Table2

Figure 2 .
Figure 2. The Design of Four-tier: (a) Open-ended Test, and (b) Close-ended Test

Figure 3 .
Figure 3.The Example of the Four-tier Close-ended Test on Optical Instruments.

Figure 6 .
Figure 6.The Conception Categories for Each Item.

Figure 7 .
Figure 7. Wright Map Conceptions Score and Wright Map Misconceptions Score.
).One of the physics materials that still has misconceptions is optical instrument material(Kaniawati et al., 2020;

Table 1 .
The Minimum CVR Values for the Various Validator Numbers.

Table 2 .
The Score of Conceptions and Misconception.

Table 4 .
Distribution of Item Construction.
*AAN: Assessment Aspect Number; N: the total number of validators; Ne: the number of validators who provide valid; CVRi: CVR index; CVRa: CVR average.

Table 6 .
OUTFIT MNSQ, OUTFIT ZSTD, and PT MEASURE CORR for Each Item of the Four-tier Closeended Test.

Table 7 .
The Unidimensionality of Four-tier Closed-ended Test.

Table 8 .
The Value of Item Reliability, Person Reliability, and Cronbach Alpha.