SCB-Dataset: A Dataset for Detecting Student and Teacher Classroom Behavior
TL;DR Summary
The paper presents SCB-Dataset, the first large-scale dataset covering 19 classroom behavior classes for students and teachers, addressing data scarcity in education. It includes 13,330 images and 122,977 labels, designed for object detection and image classification, establishin
Abstract
Using deep learning methods to detect the classroom behaviors of both students and teachers is an effective way to automatically analyze classroom performance and enhance teaching effectiveness. Then, there is still a scarcity of publicly available high-quality datasets on student-teacher behaviors. We constructed SCB-Dataset a comprehensive dataset of student and teacher classroom behaviors covering 19 classes. SCB-Dataset is divided into two types: Object Detection and Image Classification. The Object Detection part includes 13,330 images and 122,977 labels, and the Image Classification part includes 21,019 images. We conducted benchmark tests on SCB-Dataset using YOLO series algorithms and Large vision-language model. We believe that SCB-Dataset can provide a solid foundation for future applications of artificial intelligence in education. Code:https://github.com/Whiffe/SCB-dataset
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
SCB-Dataset: A Dataset for Detecting Student and Teacher Classroom Behavior
1.2. Authors
Fan Yang
1.3. Journal/Conference
The paper is published on arXiv, a preprint server, indicating it has not yet undergone formal peer review or been accepted by a specific journal or conference. However, arXiv is widely used in the academic community for disseminating research quickly.
1.4. Publication Year
2023 (Published at UTC: 2023-04-05T15:02:30.000Z)
1.5. Abstract
This paper introduces SCB-Dataset, a new, high-quality, publicly available dataset designed for detecting the classroom behaviors of both students and teachers using deep learning methods. It addresses the scarcity of such datasets in the educational field. SCB-Dataset covers 19 distinct behavior classes and is organized into two main types: an Object Detection part, comprising 13,330 images and 122,977 labels, and an Image Classification part, containing 21,019 images. The authors conducted benchmark tests on this dataset using YOLO series algorithms for object detection and Large Vision-Language Models (LVLMs) for image classification. The results demonstrate the dataset's utility and provide baseline performance figures, aiming to establish a foundation for future artificial intelligence applications in education.
1.6. Original Source Link
Original Source Link: https://arxiv.org/abs/2304.02488 PDF Link: https://arxiv.org/pdf/2304.02488v7.pdf Publication Status: This is a preprint available on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem this paper aims to solve is the scarcity of publicly available, high-quality datasets for detecting student and teacher classroom behaviors. While deep learning methods, particularly computer vision (CV) techniques, have shown great promise in automatically analyzing classroom performance and enhancing teaching effectiveness, their application is severely limited by the lack of suitable training data.
This problem is important because classroom behavior analysis can provide valuable insights into students' learning, personality, and psychological traits, offer feedback on teachers' teaching processes, and contribute to overall education quality evaluations. Existing computer vision datasets (e.g., COCO, Crowded Human, AVA) are either general-purpose or focus on broader human actions, often lacking the specific context, diversity, and annotation quality required for nuanced educational scenarios. Some existing student behavior datasets are either closed-source, have limited samples, or suffer from quality issues like non-standard bounding boxes and labeling errors.
The paper's entry point is to construct a comprehensive, diverse, and high-quality dataset, SCB-Dataset, specifically tailored for the education domain. Its innovative idea is to provide data for two distinct but complementary computer vision tasks: object detection (for locating and classifying individual behaviors) and image classification (for categorizing overall classroom group behaviors or complex actions that depend on global image context).
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
First Public Comprehensive Dataset:
SCB-Datasetis presented as the first public dataset for student and teacher classroom behaviors in real educational environments, boasting the most classes (19 in total) and a large number of images. It addresses a significant gap in research on student classroom behavior within teaching scenarios. -
Dual Dataset Types: The dataset is divided into two distinct types:
- Object Detection Dataset: This part allows for the precise localization of individual students and teachers and the classification of their behaviors, making it suitable for traditional
deep convolutional neural network (CNN)models likeYOLO. It includes 13,330 images and 122,977 labels. - Image Classification Dataset: This part focuses on classifying the overall behavior of students and teachers in an image, suitable for
Large Vision-Language Models (LVLMs). It includes 21,019 images and is designed for behaviors that require global image context (e.g., a teacher patrolling).
- Object Detection Dataset: This part allows for the precise localization of individual students and teachers and the classification of their behaviors, making it suitable for traditional
-
Benchmark Testing and Baselines: The paper conducts benchmark tests on
SCB-DatasetusingYOLOseries algorithms forobject detectionandQwen2.5-VL-7B-instruct(LVLM) forimage classification. These results provide valuable baseline data references for future research and development in this field.The key findings from the benchmark tests indicate that:
YOLOv7performs strongly on theobject detectiontask, withmAP@0.5scores above 70% for most behaviors, and even above 90% for half of them, suggesting practical applicability for many classes. However, some behaviors like "bow the head" and "turn the head" remain challenging.Qwen2.5-VL-7B-instructdemonstrates significant training effectiveness forimage classification, withF1 scoresmostly above 80% for various student and teacher behaviors. This highlights the potential ofLVLMsfor capturing complex, context-dependent classroom actions.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with fundamental deep learning and computer vision concepts:
-
Deep Learning (DL): A subfield of
machine learningthat usesartificial neural networkswith multiple layers (hence "deep") to learn representations of data with multiple levels of abstraction. It has revolutionized tasks likeimage recognition,natural language processing, andspeech recognition. -
Computer Vision (CV): A field of
artificial intelligence (AI)that enables computers to "see" and interpret visual data from the world (e.g., images, videos). Common tasks includeobject recognition,object detection,object tracking,instance segmentation, andaction recognition. -
Object Detection: A
computer visiontask that involves identifying the presence of objects in an image or video and drawing abounding box (bbox)around each object, while also classifying what type of object it is. For example, in a classroom image,object detectioncould identify each student, draw a box around them, and label them as "student." -
Image Classification: A
computer visiontask that assigns a single category label to an entire image. For example, animage classificationmodel might determine if an image shows "a classroom with students discussing" or "a teacher lecturing." -
Action Recognition: The task of identifying human actions or activities from videos or images. This can range from simple actions (e.g., "walking") to complex ones (e.g., "playing basketball").
-
Pose Estimation: A
computer visiontechnique that detects and localizes key points (joints) of a person's body (e.g., shoulders, elbows, knees) in an image or video. This information can then be used to understand body posture and movement. -
Large Vision-Language Models (LVLMs): A new class of
AImodels that combine capabilities from bothcomputer visionandnatural language processing (NLP). They can process and understand both images and text simultaneously, enabling tasks likevisual question answering (VQA)(answering questions about an image) orimage captioning(generating descriptive text for an image). The "large" refers to their massive scale, often having billions of parameters, similar toLarge Language Models (LLMs). -
YOLO (You Only Look Once): A popular family of
object detectionalgorithms known for its speed and accuracy. Unlike traditionalobject detectionsystems that separate region proposal and classification,YOLOpredictsbounding boxesand class probabilities in a single pass of aneural network, making it very efficient for real-time applications. The paper mentions various versions likeYOLOv7,YOLOv5,YOLOv8, etc., which are incremental improvements over the originalYOLOarchitecture. -
Evaluation Metrics for Object Detection:
- Precision (P): Measures the accuracy of positive predictions. It is the ratio of correctly predicted positive observations to the total predicted positive observations.
$
\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
$
True Positives (TP): Correctly detected objects.False Positives (FP): Incorrectly detected objects (e.g., detecting something that isn't there, or detecting the wrong class).
- Recall (R): Measures the ability of the model to find all relevant cases (all true positives). It is the ratio of correctly predicted positive observations to all observations in the actual class.
$
\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
$
False Negatives (FN): Objects that were present but not detected by the model.
- mAP (mean Average Precision): A common metric for
object detectionthat averages theAverage Precision (AP)over all object classes.APis the area under thePrecision-Recall curve.mAP@0.5: CalculatesmAPwith anIntersection over Union (IoU)threshold of 0.5.IoUmeasures the overlap between the predictedbounding boxand the ground truthbounding box; a threshold of 0.5 means that a detection is consideredTPif itsIoUwith a ground truth box is greater than 0.5.mAP@0.95: CalculatesmAPwith anIoUthreshold of 0.95, meaning detections must be very precise to be counted asTP. This is a stricter measure of localization accuracy.
- Precision (P): Measures the accuracy of positive predictions. It is the ratio of correctly predicted positive observations to the total predicted positive observations.
$
\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
$
-
Evaluation Metrics for Image Classification:
- F1 Score: The harmonic mean of
PrecisionandRecall. It balances both metrics, being particularly useful when there is an uneven class distribution. $ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $
- F1 Score: The harmonic mean of
-
LoRA (Low-Rank Adaptation): A technique used to fine-tune
large language modelsandlarge vision-language modelsmore efficiently. Instead of fine-tuning all the parameters of a large pre-trained model,LoRAinjects small, trainable low-rank matrices into thetransformerarchitecture. This significantly reduces the number of trainable parameters and computational cost, making fine-tuning more accessible.
3.2. Previous Works
The paper reviews existing approaches and datasets for student classroom behavior detection, categorizing them mainly into:
-
Video Action Recognition-based: These methods recognize continuous behaviors from video streams. The paper notes that they often require a very large number of annotations, citing
AVA dataset(1.58M annotations) forSlowFastdetection. Examples of general video datasets includeUCF101andKinetics400. -
Pose Estimation-based: These algorithms characterize human behavior by obtaining joint position and motion information. However, the paper points out their limitation in
overcrowded classroomscenarios where dense student populations make accurate pose estimation difficult. -
Object Detection-based: These methods identify and localize individual behaviors within still frames. The paper highlights recent advancements in this area, such as
YOLOv7, making it a promising solution.Regarding datasets, the paper specifically mentions:
-
Publicly Available Datasets:
STBD-08(Zhao and Zhu 2023): Claimed 4432 images and 151574 annotations, but the paper found actual quantities to be higher (8884 images, 267888 annotations) and noted significant data quality issues (non-standardbbox, class labeling errors). It also suggests the public version is derived from a paid online dataset.ClaBehavior(Wang et al. 2023b): Paper mentions 1342 images and 9911 annotations, but only 400 publicly available images were found, which is far fewer than stated.SCBehavior(Wang et al. 2024): Similar toClaBehavior, the paper mentions 1346 images, but only 400 damaged images were found publicly.UK Datasets(Feng et al. 2025): Derived from elementary school classroom videos, it extracted 8754 images for 8 student behaviors. The paper asserts this dataset is also not original and appears to be based on publicly sold data.universe roboflow website(reddy projects 2023): A general resource for datasets.a student classroom behavior dataset from a paid website(Unknown 2023): Noted as a source for some public datasets.
-
Closed-source Datasets: The paper lists numerous closed-source datasets for student behavior, indicating active research but limited public access. These vary in size and behavior classes, often focusing on student-specific actions like "focus," "distract," "listening," "writing," "playing," "raising hands," etc. Some also include teacher behaviors.
3.3. Technological Evolution
The field of computer vision has seen rapid development, particularly with the advent of deep learning. Early computer vision methods relied on hand-crafted features, which were often brittle and less robust to variations in real-world data. Deep learning, especially Convolutional Neural Networks (CNNs), revolutionized object recognition and detection by automatically learning hierarchical features from data. Early significant datasets like COCO propelled research in object detection. Subsequent advancements focused on real-time performance (YOLO, SSD) and more complex tasks like instance segmentation (Mask R-CNN) and video retrieval. More recently, the emergence of Large Vision-Language Models (LVLMs) like Qwen2.5-VL marks a new frontier, allowing models to process and understand multimodal inputs (vision and language) and perform more nuanced tasks, which is particularly relevant for context-rich domains like classroom behavior analysis.
The paper fits into this timeline by leveraging advanced object detection techniques (YOLO series) and state-of-the-art LVLMs to tackle a specific, challenging domain (classroom behavior). It acknowledges the limitations of simpler datasets and the need for complex and realistic datasets to bridge the gap between model performance on benchmarks and real-world applicability.
3.4. Differentiation Analysis
Compared to the main methods and datasets in related work, SCB-Dataset presents several core differences and innovations:
-
Quality and Scope: Unlike some existing public datasets (
STBD-08,ClaBehavior,SCBehavior,UK Datasets) which are criticized for limited quantity, problematic annotations, or being non-original,SCB-Datasetaims to be a high-quality, comprehensive, and original public resource. It explicitly addresses issues likenon-standard bounding boxesandclass labeling errorsfound in other datasets through its rigorous annotation process. -
Comprehensive Behavior Classes: With 19 distinct behaviors,
SCB-Datasetoffers a wider range ofstudent and teacher behaviorsthan many existing datasets, which often focus solely on students or a limited set of actions. This comprehensive nature allows for more holistic classroom analysis. -
Dual-Task Design: The dataset's division into
Object DetectionandImage Classificationparts is a key innovation.- Object Detection: Caters to fine-grained, individual behavior analysis, providing
bounding boxcoordinates and classification for each person. This is crucial forfine-grained researchand tracking individual student/teacher engagement. - Image Classification: Addresses the need for
group behavioranalysis and behaviors that are difficult to define or recognize via local object features (e.g., "patrol," "listen," "on-stage presentation"), which depend on the overall image context. This leverages the strengths ofLVLMs. This dual approach provides flexibility and caters to different analytical needs, covering both individual and collective classroom dynamics.
- Object Detection: Caters to fine-grained, individual behavior analysis, providing
-
Real-world Diversity: The dataset incorporates diverse classroom scenarios, including
varying shooting angles,different classroom environments,multiple learning stages(kindergarten to university), andvarious ethnic groups(Han and Tibetan), ensuring robustness and generalizability, which is often lacking in simpler datasets. -
Teacher Behaviors: A significant differentiator is the inclusion of
teacher behaviors. Many existing datasets focus exclusively on students.SCB-Datasetexplicitly includes teacher actions, recognizing their crucial role in classroom dynamics. -
Transparent Annotation Process: The paper details its extensive
annotation workand iterative improvements to theVIAtool, emphasizing the effort put into data quality and extendibility, which is often overlooked in dataset publications.In essence,
SCB-Datasetoffers a more robust, diverse, and well-structured foundation forAI in educationby providing a high-quality, dual-purpose dataset that addresses current limitations in scale, quality, and scope of existing resources.
4. Methodology
4.1. Principles
The core idea behind SCB-Dataset is to bridge the gap in classroom behavior analysis by providing a robust and comprehensive dataset for deep learning models. The theoretical basis is that classroom performance and teaching effectiveness can be objectively analyzed through the automatic detection of student and teacher behaviors. The intuition is that by collecting a large, diverse set of real-world classroom videos, extracting frames, and meticulously annotating individual and group behaviors, a dataset can be created that enables AI models to learn these complex patterns. The dataset is designed to support two primary computer vision tasks: object detection for fine-grained individual actions and image classification for holistic, context-dependent classroom states, recognizing that different behaviors require different analytical approaches. A key principle is to ensure data quality and diversity to make the models generalizable to various real-world classroom settings.
4.2. Core Methodology In-depth (Layer by Layer)
The SCB-Dataset construction process involves several meticulous steps, from data collection to annotation, and finally, statistical analysis to ensure quality and relevance.
4.2.1. Data Collection
The process begins with acquiring raw video data from various online open-class platforms.
- Video Source: Videos are directly collected from public websites such as
bilibili,TikTok,1s1k, andbjyhjy. This ensures a diverse range of real classroom scenarios. - Diversity Inclusion: To enhance the dataset's representativeness, classroom data from
China's ethnic minorities, specificallyTibetan nationality, were also incorporated in addition toHan nationality. This broadens the cultural and demographic scope of the dataset. - Frame Extraction: Once videos are collected, individual frames are extracted from them.
- Differential Frame Selection Strategy: To combat
class imbalance, a common problem in real-world datasets where some behaviors are much more frequent than others, adifferential frame selection strategyis employed. This means:- Reducing the sampling volume for
common classes(e.g., "read" and "write"). - Increasing the sampling volume for
rare classes(e.g., "discuss" and "board writing"). This strategy aims to ensure that the dataset has a sufficient number of examples for less frequent but important behaviors.
- Reducing the sampling volume for
The overall production process is visualized in Figure 1.
该图像是示意图,展示了SCB-Dataset的生产流程,包括从开放课堂网站获取视频,进行帧提取,接着对图片进行数据注释,最后通过对象检测和图像分类模型的微调来不断迭代更新模型。
Figure 1: The production process of SCB-Dataset.
4.2.2. Behavior Classes
SCB-Dataset is designed to cover a comprehensive set of student and teacher behaviors.
-
Total Classes: By removing overlapping behaviors between the two dataset types,
SCB-Datasetfeatures a total of 19 distinct types of behaviors. -
Object Detection Dataset: Contains 12 types of behaviors. This part focuses on individual-level actions and objects.
-
Image Classification Dataset: Contains 14 types of behaviors. This part focuses on group-level or context-dependent behaviors that are better described for the entire image.
The behavior classification is illustrated in Figure 4. Behaviors with a yellow background in this figure are considered
group behaviors.
该图像是行为分类示意图,展示了SCB-Dataset中学生与教师的课堂行为分类。行为分为物体检测和图像分类两部分,列出了各类行为的标签,如举手、阅读以及教师的指导等。
Figure 4: The behavior classification of SCB-Dataset
Rationale for Group vs. Individual Behaviors:
-
Hand-raising: Considered anindividual behavioras it reflectsteacher-student interactionandclassroom activitythrough individual student participation. -
ReadingandWriting: Classified asgroup behaviors. The paper argues that individual instances of these behaviors might not be representative. Instead, focusing on theoverall classroom engagement(i.e., whether the majority of students are reading or writing) better reflectscurrent teaching dynamics.Specific Class Considerations for Object Detection:
-
Blackboard-writing: Divided into "student blackboard-writing" and "teacher blackboard-writing" due to their completely different pedagogical meanings (student demonstration vs. teacher instruction). -
Teacher: Anidentity recognitionclass, used to locate the teacher. -
Teacher behaviors like "teach" and "patrol": Not included in theObject Detectiondataset becauseYOLOseries networks struggle to fit their characteristics, as these behaviors often requireenvironmental or timing characteristicsthat are better captured byimage classification. -
Stand: Classified underOtheras it can be a behavior of both students and teachers. -
ScreenandBlackboard: Included as objects, not behaviors, to facilitate tasks like extracting content from screens or blackboards.Specific Class Considerations for Image Classification:
-
LVLM Prompts: Two sets of prompts are used for students and teachers, each containing "other" to cover uncategorized actions.
-
Hand-raising: Only images containing
3 or more hand-raising behaviorsare retained to ensure stronger behavioral signals forLVLMtraining. -
Merged Classes: "Read" and "write" behaviors are merged into "read and write" because
image classificationoutputs only one class per image, and these behaviors often co-occur or are hard to distinguish globally. -
Advantages:
Image Classificationcan capture behaviors difficult forobject detectiondue to their reliance onentire image information, such as "on-stage presentation," "reading aloud," "listen," "teach," and "patrol."Examples of behavior classes are provided in Figure 2.
该图像是SCB-Dataset中的行为类别示例,展示了学生和老师在课堂上的多种行为。图中包括阅读、书写、听讲、举手回答问题等不同场景,体现了学生与教师的互动模式,为课堂行为分析提供了视觉依据。
Figure 2: Examples of behavior classes in SCB-Dataset
4.2.3. Dataset Challenges
The real-world classroom environment presents several challenges that SCB-Dataset aims to capture and address. These challenges increase the complexity for computer vision models:
-
Pixel Differences: Significant variations in resolution and detail between objects close to the camera (e.g., front-row students) and those far away (e.g., back-row students). (Illustrated in Fig. 5 I)
-
Dense Environments: Classrooms are often crowded, leading to severe
occlusion(objects blocking each other) situations, especially among densely packed students. (Illustrated in Fig. 5 II) -
Coexistence of Multiple Classes: Individuals (students/teachers) may exhibit multiple behaviors simultaneously (e.g., a teacher standing and teaching, a student hand-raising and reading). (Illustrated in Fig. 5 III)
-
Similar Behaviors: High similarity between distinct behaviors, making differentiation difficult (e.g., placing a hand on the forehead vs. raising a hand, writing vs. reading). (Illustrated in Fig. 5 IV)
该图像是展示 SCB-Dataset 挑战的示意图。包含了像素差异、密集环境、多类共存和相似行为等问题的图示,有助于理解教师和学生在课堂上的行为特征。
Figure 5: Challenges in the SCB-Dataset include pixel differences, dense environments, the coexistence of multiple classes , and similar behaviors.
Furthermore, the dataset incorporates diversity along several axes to ensure robustness:
-
Varying Shooting Angles: Frontal, lateral, and back views of the classroom, complicating behavior detection as the same action can appear very different from various perspectives. (Illustrated in Fig. 6 I)
-
Class Differences & Environments: Includes various classroom environments and course types (e.g., computer labs, standard classrooms, art studios), each with unique visual characteristics. (Illustrated in Fig. 6 II)
-
Different Learning Stages: Covers students from kindergarten through university, reflecting developmental differences in appearance and behavior. (Illustrated in Fig. 6 III)
-
Different Ethnic Groups: Includes
HanandTibetannationalities, adding cultural diversity. (Illustrated in Fig. 6 IV)
该图像是一个展示SCB-Dataset多样性的图表,包括不同的拍摄角度、课堂类型和学习阶段。图中展示了计算机课、幼儿园、英语课、艺术课和中学等多种课堂场景,体现了教育环境的多样性。
Figure 6: The diversity of the SCB-Dataset includes varying shooting angles, class differences, different learning stages, and different ethnic groups.
4.2.4. Dataset Statistics
The dataset's composition is detailed, including counts for each class and the overall structure.
-
Class Count Statistics: Figure 3 presents bar charts showing the number of annotations for each class in both
Object Detection(left) andImage Classification(right) datasets. This highlights theclass imbalancedespite efforts to mitigate it during data collection. -
Annotation Volume: The
Object Detectiondataset has a much higher number of annotations per image because it labels multiple objects and behaviors within a single image, whereas theImage Classificationdataset assigns only one overall class per image.
该图像是一个关于SCB-Dataset的统计图,展示了对象检测和图像分类的数据集数量。左侧为对象检测部分,包括不同课堂行为的训练和验证数据集数量,右侧为图像分类部分,显示了各类行为的统计数据。
Figure 3: SCB-Dataset Class Count Statistics
-
Overall Statistics: Table 1 provides the total number of images and annotations for both dataset types.
The following are the results from Table 1 of the original paper:
Dataset Object Detection Image Classification Images Annotation Images Annotation Train 89,713 19,152 19,152 Val 33,264 1,867 1,867 Total 13,330 122,977 21,019 21,019
Note on Object Detection Split: The paper explicitly states that the Object Detection dataset does not provide a standard train/validation split count due to the severe class imbalance. To handle this, SCB-Dataset is split into multiple sub-parts. For each sub-part, training and validation sets are independently and randomly divided in a 4:1 ratio. This approach leads to overlaps between the training and validation sets across different sub-parts, rendering an overall count of training/validation images for object detection impractical for general reference. The detailed statistics for each sub-part are provided in Appendix A.3.
4.2.5. Annotation Work (Appendix A.1)
The annotation process was the most time-consuming part, accounting for nearly 90% of the total workload.
-
Evolution of Annotation: The process involved defining
input-output expectationsandbehavior classificationsstarting in 2021. Initial definitions proved "unreasonable and immature," leading to a significant time investment (1.5 years) before a redesign in early 2023. -
Extensible Method: The redesigned process introduced an
extensible behavior annotation method, allowing flexible expansion with future behavior classes. -
Annotation Tool Improvements (VIA): The annotation tool
VIAwas iteratively optimized to meet practical needs:-
Original Version: Basic functionality (Fig. 11).
该图像是一个课堂行为标注界面的示例,展示了学生在课堂上的行为。多个学生的行为被框选并标注,包括讨论、注意力、使用工具等情况。图像中还包含用于分类的标签界面,方便对学生行为进行分析。Figure 11: Original Annotation Interface of VIA
-
Second Version: Optimized
label display position(insidebounding boxesfor better visibility in dense classroom scenes). Addedkeyboard shortcuts("z" and "x") for switching label display, in addition to the mouse scroll wheel. (Fig. 12).
该图像是一个教室场景,展示了多名学生在课堂上学习。图中标注了不同的课堂行为,如讨论和阅读,旨在用于行为检测的数据集。Figure 12: VIA Second Version
-
Third Version: Optimized
selection of annotation boxesby addingmouse scroll switching. This helps select very small or mislabeledbounding boxesand highlights the selected box. (Fig. 13).
该图像是一个关于课堂行为检测的示意图,展示了学生在教室内的多种行为类型。图中包含标注框和信息表格,记录了不同学生的行为,如阅读、讨论等,旨在支持SCB-Dataset的研究和分析工作。Figure 13: VIA Third Version
-
Fourth Version: Optimized
display of annotation contentinhalf-page formatfor efficiency. IntroducedFull Image Mode(all annotations) andSingle Target Mode(individual annotated targets) for detailed inspection in dense scenarios. (Fig. 14).
该图像是一个包含课堂场景的示意图,显示了多个学生的学习行为。图中以黄色框标注了一系列的行为信息,包括学生的姿态、所用工具和教学活动等,旨在展示SCB-Dataset数据集中的标注情况。Figure 14: VIA Fourth Version
-
Fifth Version: Introduced
copy previous frame annotationsfeature, significantly reducing repetitive work for consecutive frames with high similarity. (Fig. 15).
该图像是一个示意图,展示了课堂上学生的行为识别标注。图中包含多个标注框,分别标记不同学生的行为和姿态,显示出他们在课堂学习时的状态和互动。每个学生的行为被详细分类,有助于深入分析教学效果。Figure 15: VIA Fifth Version
-
-
Final Checks (viaJson): After annotation,
viaJson counting website(Fig. 16) is used to verify results, identifyingunclassified annotation boxesand providing statistics on detection boxes and annotated targets to ensure completeness and accuracy.
该图像是一个统计结果页面,展示了通过 JSON 文件分析的检测框和标签数量。检测框数量为461,标签数量为539,并列出了部分没有标注的图片文件。此图可以帮助用户了解标注数据的完整性和有效性。Figure 16: Annotation Review and Counting Website
4.2.6. Prompt (Appendix A.4)
For the Image Classification task using LVLMs, specific prompts are defined for student and teacher behaviors.
Student Behavior Categories:
read and write, on-stage presentation, student blackboard-writing, answering questions, reading aloud, discuss, hand-raising, listen, others.
Teacher Behavior Categories:
teach, guide, answer, on-stage interaction, blackboard-writing, patrol, others.
Definition and Recognition Rules:
-
Priority of single behavior: Only one dominant behavior is identified per picture. Compound actions are classified based on the dominant behavior.
-
Unique output: Only one behavior category is output for each recognition.
The paper provides detailed definitions for each category, clarifying nuances such as:
Read and write: Students are reading or writing.On-stage presentation: Students presenting on stage without teachers (distinction from teacheron-stage interaction).Student blackboard-writing: Students writing on the blackboard (distinct from teacherblackboard-writing).Answering questions: Students standing to answer questions without teachers present in the image (distinct from teacheranswer).Reading aloud: Students reading aloud in unison, often with open mouths, distinct from silentread and write.Discuss: Students discussing (deskmates or between rows).Listen: Students looking up to listen to the teacher.Hand-raising: Generally counted only if3 or more studentsare raising hands (forimage classification).Others: Catch-all for behaviors not in the defined categories.Teach: Teachers on the podium explaining knowledge.Guide: Teachers off the podium providing individual guidance, often involving bending over.Answer: Students answering teacher's questions, with both teacher and student standing.On-stage interaction: Teachers inviting students to stage for activities, including both teachers and students.Blackboard-writing: Teacher's writing on the blackboard.Patrol: Teachers walking around the classroom, observing.Others: Catch-all for teacher behaviors not defined.
4.2.7. Behavior Description (Appendix A.5)
A comprehensive list of all 19 behavior classes with detailed descriptions is provided. This ensures clarity and consistency in understanding the dataset's labels.
hand-raising: Students raise hands to speak or ask.read: Students read books, textbooks, or notes.write: Students take notes or complete assignments.discuss: Students discuss content with each other.bow the head: Students lower heads, possibly distracted or focused on personal items.turn the head: Students turn heads to look at classmates/teachers or events.blackboard-writing: Teachers/students write or draw on the blackboard.on-stage presentation: Students stand and answer questions. (Clarification: The definition here in A.5 seems to combine aspects of "answering questions" from A.4. It states "Students stand up and answer questions. The difference betweenanswering questionsandansweris thatanswering questionsmeans there are only students in the image, no teachers, whileanswermeans there are both students and teachers in the image.")reading aloud: Students read content aloud from books.listen: Students look up to listen to the teacher.guide: Teachers provide guidance or explanations to students.answer: Students respond to the teacher's questions.on-stage interaction: Students interact with teachers or other students on stage.teacher: Identity recognition for the teacher.teach: Teachers standing on the podium, explaining knowledge.patrol: Teachers walk around the classroom.stand: Students or teachers stand.screen: The screen displaying PPTs.blackboard: The blackboard for writing.
5. Experimental Setup
5.1. Datasets
The dataset used for the experiments is the SCB-Dataset itself, divided into two main parts:
-
Object Detection Dataset: This part is used for training and evaluating models that detect and localize individual student and teacher behaviors within images. It comprises 13,330 images and 122,977 labels.
-
Image Classification Dataset: This part is used for training and evaluating
Large Vision-Language Modelson classifying the overall behavior or group behavior within an image. It contains 21,019 images.The paper does not provide concrete examples of raw data samples (e.g., a specific image with its full annotation file). However, various figures throughout the paper provide visual intuition about the data's form:
-
Figure 2: Examples of behavior classes, showing diverse scenes with students and teachers.
-
Figure 5: Illustrates dataset challenges like pixel differences, dense environments, and similar behaviors, offering a glimpse into the visual complexity.
-
Figure 6: Demonstrates the diversity of the dataset across angles, classroom types, learning stages, and ethnic groups.
-
Figure 7: Shows a
YOLOv7detection result, where bounding boxes are drawn around a teacher, blackboard, and screen, with corresponding labels and confidence scores. -
Figure 10: Presents an example of
LVLMoutput for image classification, showing a classroom scene with students discussing and the model outputting "discuss." -
Figure 17: Provides a collage of example images for various classes, offering an overview of the visual content for each behavior.
该图像是来自 SCB 数据集的示例图,展示了学生和教师在课堂中的多种行为,涵盖了 19 个类别。每个类别的行为通过多个图像展示,包括举手、回答、讨论等场景。
Figure 17: Examples of images from various classes in the SCB dataset.
These datasets were chosen because they are specifically designed to address the research gap in classroom behavior analysis. They offer a comprehensive and diverse collection of real-world classroom scenarios, making them effective for validating the performance of computer vision methods in this domain.
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, here is a detailed explanation:
For Object Detection:
The metrics used are Precision (P), Recall (R), mAP@0.5, and mAP@0.95.
-
Precision (P)
- Conceptual Definition: Precision measures the accuracy of the model's positive predictions. In
object detection, it quantifies how many of the detected objects are actually correct. A high precision indicates a low number offalse positives(incorrect detections). - Mathematical Formula: $ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $
- Symbol Explanation:
TP(True Positives): The number of objects correctly detected by the model that actually exist in the image.FP(False Positives): The number of objects detected by the model that do not actually exist in the image, or were detected with the wrong class or anIoUbelow the threshold.
- Conceptual Definition: Precision measures the accuracy of the model's positive predictions. In
-
Recall (R)
- Conceptual Definition: Recall measures the model's ability to find all the relevant objects. In
object detection, it quantifies how many of the actual objects present in the image were successfully detected by the model. A high recall indicates a low number offalse negatives(missed objects). - Mathematical Formula: $ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $
- Symbol Explanation:
TP(True Positives): The number of objects correctly detected by the model that actually exist in the image.FN(False Negatives): The number of objects that exist in the image but were not detected by the model.
- Conceptual Definition: Recall measures the model's ability to find all the relevant objects. In
-
mAP@0.5 (mean Average Precision at IoU threshold 0.5)
- Conceptual Definition:
mAPis a standard metric forobject detectionthat assesses the overall accuracy of detections across all classes. It is calculated by averaging theAverage Precision (AP)for each class.APitself is the area under thePrecision-Recall curve. The@0.5signifies that a detected object is considered aTrue Positiveonly if itsIntersection over Union (IoU)with a ground truth bounding box is greater than or equal to 0.5.IoUis a measure of the overlap between the predicted and ground truth bounding boxes. - Mathematical Formula: $ \text{IoU}(B_p, B_{gt}) = \frac{\text{Area}(B_p \cap B_{gt})}{\text{Area}(B_p \cup B_{gt})} $ $ \text{AP}c = \int_0^1 P(R) dR $ $ \text{mAP} = \frac{1}{N} \sum{c=1}^{N} \text{AP}_c $
- Symbol Explanation:
- : Predicted
bounding box. - : Ground truth
bounding box. - : Area of intersection between the predicted and ground truth boxes.
- : Area of union between the predicted and ground truth boxes.
- : Average Precision for class .
P(R): Precision as a function of Recall.- : Number of object classes.
- : Predicted
- Conceptual Definition:
-
mAP@0.95 (mean Average Precision at IoU threshold 0.95)
- Conceptual Definition: Similar to
mAP@0.5, but with a much stricterIoUthreshold of 0.95. This metric heavily penalizes inaccuratebounding boxlocalization. A highmAP@0.95indicates that the model not only correctly identifies objects but also localizes them almost perfectly. - Mathematical Formula: Same as
mAP, but theTPcondition forAPcalculation requires . - Symbol Explanation: Same as
mAP@0.5.
- Conceptual Definition: Similar to
For Image Classification:
The metrics used are Precision (P), Recall (R), and F1 Score.
-
Precision (P)
- Conceptual Definition: For
image classification, precision for a given class measures the proportion of images classified as that class that were actually correct. - Mathematical Formula: (Same as for
object detection, applied per class) $ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $ - Symbol Explanation:
TP(True Positives): The number of images correctly classified as the target class.FP(False Positives): The number of images incorrectly classified as the target class.
- Conceptual Definition: For
-
Recall (R)
- Conceptual Definition: For
image classification, recall for a given class measures the proportion of actual images belonging to that class that were correctly identified. - Mathematical Formula: (Same as for
object detection, applied per class) $ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $ - Symbol Explanation:
TP(True Positives): The number of images correctly classified as the target class.FN(False Negatives): The number of images that truly belong to the target class but were not classified as such.
- Conceptual Definition: For
-
F1 Score
- Conceptual Definition: The
F1 Scoreis the harmonic mean ofPrecisionandRecall. It provides a single score that balances both metrics, being particularly useful when there is an uneven class distribution or when bothfalse positivesandfalse negativesare costly. - Mathematical Formula: $ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $
- Symbol Explanation:
Precision: The precision value for the class.Recall: The recall value for the class.
- Conceptual Definition: The
5.3. Baselines
For Object Detection:
The paper's method is compared against various versions of the YOLO (You Only Look Once) series models, specifically YOLOv7 for the main benchmark and YOLOv5, YOLOv8, YOLOv9, YOLOv10, YOLOv11, YOLOv12, and YOLOv13 for a comparative study on a subset of the dataset (teacher behavior part).
- Why these baselines are representative: The
YOLOseries are widely recognized as state-of-the-art and highly efficientobject detectionmodels, particularly suitable for real-time applications. Comparing against a range ofYOLOversions (from v5 to v13) provides a strong benchmark against current best practices inobject detection, demonstrating the dataset's challenge and the relative performance of different model generations.
For Image Classification:
The paper uses Qwen2.5-VL-7B-instruct, a Large Vision-Language Model (LVLM), as the baseline.
- Why this baseline is representative:
LVLMsrepresent the cutting edge in multimodal understanding. UsingQwen2.5-VL-7B-instruct, a powerful and recentLVLM, demonstrates the dataset's applicability to advancedAI modelsand establishes a strong baseline forimage classificationtasks, especially for behaviors requiring contextual understanding thatLVLMsexcel at. TheLoRA(Low-Rank Adaptation) method is employed for efficient fine-tuning of thisLVLM.
5.4. Experimental Environment
The experiments were conducted using the following hardware and software specifications:
- GPU: NVIDIA vGPU-32GB GPU with 32GB of video memory.
- CPU: 12 vCPU Intel(R) Xeon(R) Platinum 8352V CPU.
- Operating System: Ubuntu 22.04.
- Software Versions:
PyTorch: 2.3.0Python: 3.12CUDA: 12.1
5.5. Model Training
For Object Detection Dataset:
- Model:
YOLOseries algorithms. - Epochs: Ranged from 30 to 120.
- Batch Size: Set to 4.
- Image Size: Resized to pixels.
For Image Classification Dataset:
- Framework:
LLaMA Factory. - LVLM:
Qwen2.5-VL-7B-instruct. - Fine-tuning Method:
LoRA(Low-Rank Adaptation). - Learning Rate: Set to .
- Number of Training Epochs: Set to 2.
- Batch Size: Set to 2.
- LoRA Rank: Set to 8.
- Scaling Factor: Set to 16.
- Dropout Rate: Set to 0.1.
- LoRA+ Learning Rate Ratio: Set to 16.
6. Results & Analysis
6.1. Core Results Analysis
Object Detection Results (YOLOv7)
The paper presents the benchmark results of YOLOv7 on the Object Detection dataset of SCB-Dataset in Table 2. The performance is evaluated using Precision (P), Recall (R), mAP@0.5, and mAP@0.95.
The following are the results from Table 2 of the original paper:
| class | P | R | mAP@0.5 | mAP@.95 |
| hand-raising | 79.4 | 76.9 | 79.2 | 59.4 |
| read | 65.5 | 68.2 | 70.5 | 52.9 |
| write | 68.4 | 67.8 | 72.2 | 58.1 |
| discuss | 67.5 | 72.5 | 74.7 | 39.3 |
| bow the head | 26.0 | 34.0 | 21.9 | 7.9 |
| turn the head | 23.8 | 44.0 | 23.5 | 9.5 |
| guide | 88.5 | 78.3 | 83.6 | 48.9 |
| answer | 86.2 | 86.6 | 91.5 | 80.8 |
| on-stage interaction | 82.3 | 84.5 | 90.1 | 81.5 |
| blackboard-writing | 91.0 | 93.5 | 96.4 | 86.6 |
| teacher | 95.5 | 95.2 | 97.7 | 83.0 |
| stand | 93.1 | 94.7 | 96.6 | 79.8 |
| screen | 96.1 | 97.1 | 97.9 | 92.5 |
| blackboard | 96.2 | 97.1 | 98.1 | 93.3 |
Analysis:
-
Strong Performance: Many classes achieve high
mAP@0.5scores, with "blackboard-writing," "teacher," "stand," "screen," and "blackboard" all exceeding 96%. "Answer" and "on-stage interaction" also perform exceptionally well withmAP@0.5over 90%. This indicates thatYOLOv7is highly effective at detecting and localizing these distinct objects and behaviors. The paper suggests thatmAP@0.5above 70% makes a behavior practically applicable, and the majority of classes meet this criterion. -
Moderate Performance: "Hand-raising," "read," "write," "discuss," and "guide" show good performance with
mAP@0.5ranging from 70.5% to 83.6%. These behaviors are more complex than static objects but still detectable with reasonable accuracy. -
Challenging Classes: "Bow the head" and "turn the head" exhibit significantly lower
mAP@0.5scores (21.9% and 23.5%, respectively) and very lowmAP@0.95scores (7.9% and 9.5%). This highlights the difficulty in accurately detecting and localizing these subtle head movements, likely due to small object size, occlusion, and similarity to other non-behavioral head positions. These remain areas for future improvement.An example of
YOLOv7detection results is shown in Figure 7. It visually demonstrates the model's ability to identify and bound objects like the teacher, blackboard, and screen, along with their confidence scores.
该图像是YOLOv7检测结果示例,展示了一名教师在黑板前授课的场景。可见界面中标注了教师的动作和环境元素,包括黑板和屏幕的识别置信度。此示例有助于分析课堂行为。
Figure 7: Example of YOLOv7 detection results
Image Classification Results (Qwen2.5-VL-7B-instruct)
Table 3 presents the training results of the Qwen2.5-VL-7B-instruct model on the Image Classification dataset, evaluated using Precision (P), Recall (R), and F1 Score.
The following are the results from Table 3 of the original paper:
| class | P | R | f1 |
| hand-raising | 87.0 | 85.6 | 86.3 |
| read and write | 83.6 | 93.9 | 88.5 |
| discuss | 93.9 | 90.2 | 92.0 |
| student blackboard-writing | 83.3 | 88.2 | 85.7 |
| on-stage presentation | 100 | 70.0 | 82.4 |
| answering questions | 75.0 | 69.2 | 72.0 |
| reading aloud | 100 | 69.2 | 81.8 |
| listen | 88.1 | 89.2 | 88.6 |
| guide | 87.0 | 50.8 | 64.2 |
| answer | 87.6 | 83.4 | 85.4 |
| on-stage interaction | 89.2 | 73.5 | 80.6 |
| blackboard-writing | 99.0 | 98.5 | 98.8 |
| teach | 87.4 | 92.1 | 89.7 |
| patrol | 42.3 | 87.1 | 57.0 |
| all | 86.1 | 83.4 | 83.8 |
Analysis:
-
Overall Strong Performance: The
F1 scoresfor most behaviors are well above 80%, with an overallF1 scoreof 83.8%. This demonstrates thatLVLMscan effectively classify complex classroom behaviors based on the entire image context. -
Excellent Performance: "Blackboard-writing" (98.8% F1) and "discuss" (92.0% F1) show particularly high
F1 scores. "Read and write," "listen," and "teach" also perform very well (88.5%, 88.6%, 89.7% F1 respectively). "On-stage presentation" and "reading aloud" achieved perfect precision (100%), though their recall was lower, indicating some instances were missed. -
Areas for Improvement: "Patrol" (57.0% F1) and "guide" (64.2% F1) show lower performance. "Patrol" has a very low precision (42.3%) despite high recall, suggesting many
false positives(incorrectly classifying an image as "patrol"). "Guide" has low recall (50.8%), indicating it often misses actual "guide" behaviors. These behaviors likely require more nuanced contextual cues or temporal information that a single image classification might struggle with. "Answering questions" also has a relatively modest F1 score of 72.0%.The training process for the
LVLMshows a rapid decrease in loss during the initial iterations, stabilizing towards the end, as depicted in Figure 9. The final loss value reached 0.0471.
该图像是一个损失函数的变化曲线图,显示了模型训练过程中各步的损失值。图中包含两个曲线:原始损失(original)和平滑损失(smoothed)。随着训练步数的增加,损失值逐渐降低,表明模型性能的提升。
Figure 9: Loss of Qwen2.5-VL-7B-instruct during the training iteration process
Figure 10 provides a visual example of the Qwen2.5-VL-7B-instruct model classifying a classroom image as "discuss" given an image and a prompt.
该图像是学生在课堂上讨论的场景,展示学生们在积极参与和互动。根据示例输出格式,识别出的行为类别为讨论。
Figure 10: Using the LLaMA Factory framework to test the Qwen2.5-VL-7B-instruct example
6.2. Data Presentation (Tables)
SCB-Dataset Training and Validation Data Statistics (Appendix A.3)
The following are the results from Table 7 of the original paper:
| Class: Object Detection | Train | Val |
| hand-raising | 10538 | 2915 |
| read | 17539 | 6539 |
| write | 6447 | 3394 |
| discuss | 3607 | 1785 |
| bow the head | 4422 | 540 |
| turn the head | 7943 | 3213 |
| guide | 1155 | 449 |
| answer | 2574 | 853 |
| on-stage interaction | 528 | 149 |
| blackboard-writing | 821 | 277 |
| teacher | 8490 | 3228 |
| stand | 13932 | 4967 |
| screen | 5025 | 1959 |
| blackboard | 7847 | 3445 |
| Class: Image Classification | Train | Val |
| hand-raising | 1472 | 187 |
| read and write | 814 | 98 |
| discuss | 703 | 51 |
| student blackboard-writing | 211 | 17 |
| on-stage presentation | 34 | 10 |
| answering questions | 639 | 52 |
| reading aloud | 134 | 13 |
| listen | 2294 | 157 |
| guide | 1584 | 185 |
| answer | 3938 | 439 |
| on-stage interaction | 816 | 113 |
| blackboard-writing | 1703 | 204 |
| teach | 3088 | 240 |
| patrol | 1722 | 101 |
6.3. Ablation Studies / Parameter Analysis
The paper includes a comparative analysis of different YOLO series models on the teacher behavior part of the Object Detection dataset. This serves as an assessment of model choice and performance evolution across YOLO generations.
Figure 8 visually summarizes the mAP@0.5 scores for YOLOv5 to YOLOv13 on the teacher behavior subset.
该图像是一个条形图,展示了不同YOLO模型在SCB-Dataset(教师行为部分)上的mAP50成绩。YOLOv7表现最佳,达到94.0,其他模型的成绩如YOLOv8和YOLOv6分别为93.6和88.1。
Figure 8: Training and testing results of the SCB-Dataset dataset (teacher behavior part) on YOLO series models
Analysis of Figure 8:
-
YOLOv7achieves the highestmAP@0.5at 94.0%. -
YOLOv8is a close second at 93.6%. -
YOLOv5shows the lowestmAP@0.5at 88.1%. -
Interestingly,
YOLOv9toYOLOv13show generally average performance compared toYOLOv7andYOLOv8, with scores around 91-92%. This suggests that for this specific task and dataset subset, newerYOLOversions do not necessarily provide a significant improvement overYOLOv7orYOLOv8, or their improvements might be more focused on other aspects like speed or specific architectures not fully leveraged here.Detailed results for each
YOLOversion on the teacher behavior part are provided in tables in Appendix A.3:
Training results of Object Detection Dataset in SCB-Dataset (teacher behavior part) on YOLOv5 (Appendix A.3)
The following are the results from Table 8 of the original paper:
| class | P | R | mAP@0.5mAP@.95 | |
| all | 81.8 84.1 | 88.1 | 67.3 | |
| guide | 68.2 | 59.1 | 66.0 | 31.5 |
| answer | 67.4 | 81.7 | 80.8 | 57.1 |
| On-stage interaction | 52.0 | 82.8 | 80.0 | 59.2 |
| blackboard-writing | 97.7 | 70.4 | 96.3 | 73.9 |
| teacher | 95.1 | 91.9 | 95.7 | 71.9 |
| stand | 88.9 | 92.3 | 90.4 | 63.9 |
| screen | 91.6 | 97.5 | 98.0 | 90.0 |
| blackBoard | 93.4 97.1 | 98.0 | 90.7 |
Training results of SCB-Dataset on YOLOv7 (Appendix A.3)
The following are the results from Table 9 of the original paper:
| Dataset | class | | P | | R | | mAP@0.5 | | mAP@.95 |
| SCB5-A | all | | 71.1 | | 70.9| | | 74.0 | 56.8 |
| hand-raising | 79.4 | 76.9 | 79.2 | 59.4 | |
| read | 65.5 | 68.2 | 70.5 | 52.9 | |
| write | 68.4 | | 67.8 | | 72.2 | |58.1 | |
| SCB5-B | all | | 94.5 | | 97.3 | | | 98.7 | | 91.9 |
| screen | 94.8 | 95.7 | 98.2 | 95.1 | |
| backboard | 94.2 | 98.9 | | 99.2 | 88.8 | |
| SCB5-C | | all/discuss | | 67.5 | | | 72.5 | | | 74.7 | | 39.3 |
| SCB5-D | all | | 85.5 | | | 82.6 | | | 86.4 | 67.2 |
| guide | 88.0 | 81.7 | 87.0 | 49.5 | |
| answer | 89.3 | 88.0 | 92.3 | 76.7 | |
| stage interaction | 69.9 | 65.2 | 68.5 | 54.7 | |
| board writing | | 94.5 | | 95.6 | | 97.7 | |87.9 | |
| SCB5-E | | all/stand | | 95.8 | | | 91.7 | | | 96.6 | | 80.5 |
| SCB5-F | all/teacher | | 96.2 | | 94.4 | | | 97.7 | | 82.7 |
| SCB5-G | all bow-head | ||||
| turn-head | - | - | - | ||
| SCB5-H | all/talk | |87.8 | | | 62.6 77.2 | | 61.3 |
Training results of Object Detection Dataset in SCB-Dataset (teacher behavior part) on YOLOv7 (Appendix A.3)
The following are the results from Table 10 of the original paper:
| class | P | R | mAP@0.5 mAP@.95 |
| all | 91.1 | 90.9 94.0 | 80.8 |
| guide | 88.5 78.3 | 83.6 | 48.9 |
| answer | 86.2 | 86.6 91.5 | 80.8 |
| On-stage interaction | 82.3 84.5 | 90.1 | 81.5 |
| blackboard-writing | 91.0 93.5 | 96.4 | 86.6 |
| teacher | 95.5 95.2 | 97.7 | 83.0 |
| stand | 93.1 94.7 | 96.6 | 79.8 |
| screen | 96.1 97.1 | 97.9 | 92.5 |
| blackBoard | 96.2 97.1 | 98.1 | 93.3 |
Training results of Object Detection Dataset in SCB-Dataset (teacher behavior part) on YOLOv8 (Appendix A.3)
The following are the results from Table 11 of the original paper:
| class | P | R | mAP@0.5 mAP@.95 |
| all | 90.6 89.2 | 93.6 | 83.1 |
| guide | 81.5 67.5 | 79.4 | 53.5 |
| answer | 87.2 87.5 | 92.9 | 86.4 |
| On-stage interaction | 82.5 83.2 | 88.9 | 80.1 |
| blackboard-writing | 90.8 93.9 | 97.6 | 87.3 |
| teacher | 96.2 93.8 | 97.3 | 86.0 |
| stand | 94.0 93.7 | 96.5 | 83.0 |
| screen | 95.8 96.7 | 97.9 | 93.4 |
| blackBoard | 96.5 97.2 | 98.3 | 94.7 |
Training results of Object Detection Dataset in SCB-Dataset (teacher behavior part) on YOLOv9 (Appendix A.3)
The following are the results from Table 12 of the original paper:
| class | P | R | mAP@0.5mAP@.95 | |
| all | 87.8 87.7 | 91.8 | 78.2 | |
| guide | 79.3 | 63.7 | 74.3 | 45.0 |
| answer | 79.7 | 84.5 | 89.5 | 78.7 |
| On-stage interaction | 71.7 | 79.9 | 84.7 | 72.6 |
| blackboard-writing | 95.0 | 95.1 | 97.9 | 84.5 |
| teacher | 94.2 | 93.1 | 96.9 | 81.8 |
| stand | 92.2 | 92.5 | 95.4 | 77.1 |
| screen | 94.6 | 96.3 | 97.6 | 90.9 |
| blackBoard | 95.5 96.6 | 98.2 | 94.4 |
Training results of Object Detection Dataset in SCB-Dataset (teacher behavior part) on YOLOv10 (Appendix A.3)
The following are the results from Table 13 of the original paper:
| class | P | R | mAP@0.5mAP@.95 | |
| all | 86.9 84.1 | 90.1 | 76.2 | |
| guide | 79.7 | 58.3 | 71.1 | 41.8 |
| answer | 80.9 | 78.4 | 86.1 | 77.0 |
| On-stage interaction | 80.1 | 78.6 | 86.0 | 77.5 |
| blackboard-writing | 86.3 | 88.6 | 94.7 | 79.2 |
| teacher | 90.1 | 88.7 | 95.1 | 77.1 |
| stand | 88.1 | 88.5 | 93.1 | 72.6 |
| screen | 94.9 | 96.0 | 97.4 | 90.9 |
| blackBoard | 95.2 95.6 | 97.6 | 93.5 |
Training results of Object Detection Dataset in SCB-Dataset (teacher behavior part) on YOLOv11 (Appendix A.3)
The following are the results from Table 14 of the original paper:
| class | P | R mAP@0.5 | mAP@.95 |
| all | 87.6 90.1 | 92.9 | 81.8 |
| guide | 72.2 66.7 | 72.1 | 45.9 |
| answer | 87.1 88.8 | 94.1 | 86.6 |
| On-stage interaction | 75.1 89.9 | 91.2 | 82.4 |
| blackboard-writing | 92 93.4 | 97.3 | 86.5 |
| teacher | 94.8 94.1 | 97.0 | 85.0 |
| stand | 91.9 93.5 | 95.8 | 81.5 |
| screen | 92.7 96.9 | 97.4 | 92.1 |
| blackBoard | 94.9 97.3 | 98.3 | 93.9 |
Training results of Object Detection Dataset in SCB-Dataset (teacher behavior part) on YOLOv12 (Appendix A.3)
The following are the results from Table 15 of the original paper:
| class | P | R | mAP@0.5 mAP@.95 |
| all | 86.2 | 86.6 90.6 | 77.1 |
| guide | 78.6 58.9 | 71.8 | 43.7 |
| answer | 77.8 | 83.1 88.0 | 78.9 |
| On-stage interaction | 66.5 81.2 | 82.6 | 69.4 |
| blackboard-writing | 91.4 93.5 | 96.4 | 84.2 |
| teacher | 92.6 91.9 | 95.9 | 80.2 |
| stand | 91.4 90.8 | 94.5 | 75.6 |
| screen | 95.6 96.4 | 97.7 | 92.1 |
| blackBoard | 95.7 96.7 | 98.1 | 92.9 |
Training results of Object Detection Dataset in SCB-Dataset (teacher behavior part) on YOLOv13 (Appendix A.3)
The following are the results from Table 16 of the original paper:
| class | P R | mAP@0.5 | mAP@0.75 | mAP@.95 |
| all | 89.6 85.1 | 91.4 | 82.6 | 77.2 |
| guide | 84.6 54.6 | 74.8 | 45.1 | 43.4 |
| answer | 82.9 80.6 | 87.6 | 80.4 | 76.0 |
| On-stage interaction | 79.0 78.2 | 87.1 | 83.5 | 79.8 |
| blackboard-writing | 93.3 92.7 | 96.1 | 90.2 | 81.7 |
| teacher | 93.5 91.7 | 95.8 | 86.5 | 78.0 |
| stand | 92.2 89.0 | 94.0 | 82.3 | 73.7 |
| screen | 95.2 97.0 | 98.0 | 96.7 | 92.6 |
| blackBoard | 95.9 96.7 | 98.1 | 96.1 | 92.9 |
Further Analysis of YOLO Versions on Teacher Behavior:
The detailed tables confirm the trend shown in Figure 8. While all YOLO versions show generally strong performance on most teacher-related classes (e.g., "teacher," "blackboard-writing," "screen," "blackboard," "stand"), there are notable differences:
-
YOLOv7andYOLOv8consistently achieve highermAP@0.5andmAP@0.95for most behavioral classes compared toYOLOv5and the later versions ( to ). -
Classes like "guide" and "on-stage interaction" often have lower
mAPscores across allYOLOversions, indicating they are more challenging to detect accurately, possibly due to their contextual nature or variability. -
The
mAP@0.95scores are significantly lower thanmAP@0.5for many behavioral classes, highlighting that precise localization ofbounding boxesfor behaviors is harder than simply identifying their presence. Static objects like "screen" and "blackboard" generally maintain highmAP@0.95because their boundaries are clearer.This comparative analysis demonstrates the utility of
SCB-Datasetfor benchmarking and selecting appropriateobject detectionmodels for specific classroom behavior tasks. The findings suggest thatYOLOv7provides a good balance of accuracy and efficiency for this domain.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully introduces SCB-Dataset, a novel and comprehensive dataset specifically designed for detecting student and teacher classroom behaviors using deep learning methods. It addresses a critical gap in the availability of high-quality, publicly accessible datasets tailored for educational environments. The dataset is notable for its coverage of 19 distinct behavior classes and its innovative dual structure, offering an Object Detection part (13,330 images, 122,977 labels) for fine-grained individual actions and an Image Classification part (21,019 images) for holistic, context-dependent behaviors. Through rigorous data collection and an iteratively improved annotation process, SCB-Dataset captures significant real-world diversity and challenges. Benchmark tests using YOLO series algorithms for object detection and a Large Vision-Language Model (Qwen2.5-VL-7B-instruct) for image classification establish baseline performances. The results demonstrate the practical applicability of YOLOv7 for many object detection tasks and the strong potential of LVLMs for image classification of complex classroom behaviors, thereby promoting the application of artificial intelligence in education.
7.2. Limitations & Future Work
The authors implicitly acknowledge several limitations through their discussion of dataset challenges and the performance of specific models/classes:
-
Challenging Behaviors for Object Detection: Classes like "bow the head" and "turn the head" still yield low
mAPscores, indicating the inherent difficulty in precisely detecting and localizing subtle human movements due to factors likepixel differences,occlusion, andsimilarity to other behaviors. -
Contextual Behaviors for Image Classification: Behaviors like "patrol" and "guide" show relatively lower
F1 scoresinimage classification, suggesting that these actions might require more than single-frame visual information and could benefit from temporal context (i.e., video-based analysis). -
Class Imbalance: Despite efforts to mitigate it,
class imbalanceremains a challenge within the dataset, which can impact model training and performance for rare classes. -
Static Nature: While
image classificationcaptures some context, bothobject detectionandimage classificationon static frames inherently lacktemporal information, which is crucial for fully understanding dynamic behaviors and their progression over time.As for future work, the paper states: "We are also continuously expanding the scale of the dataset to adapt to various challenges in real environments." This implies:
-
Dataset Expansion: Continuously adding more images and annotations to increase the dataset's size and diversity.
-
Addressing Challenges: Further improving the dataset to handle the identified challenges such as
occlusion,pixel differences, andsubtle behaviors. -
Incorporating Temporal Data: While not explicitly stated as future work for the dataset itself, the discussion around the limitations of current approaches for
teachandpatrolbehaviors suggests that future iterations might involvevideo annotationsorspatiotemporal modeling.
7.3. Personal Insights & Critique
This paper makes a highly valuable contribution to the AI in Education domain. The meticulous effort put into constructing SCB-Dataset, particularly the iterative improvements to the VIA annotation tool and the careful consideration of real-world classroom challenges, is commendable. The dual approach of providing both object detection and image classification data is a thoughtful design choice, recognizing the varied needs of classroom behavior analysis.
Inspirations & Applications:
-
Personalized Learning: More accurate behavior detection can lead to real-time feedback systems for teachers, helping them identify disengaged students or those struggling, thus enabling more personalized interventions.
-
Teacher Training: Automated analysis of teaching styles and student responses could provide data-driven insights for teacher professional development.
-
Classroom Management: Identifying disruptive or off-task behaviors could assist in maintaining a conducive learning environment.
-
Broader Behavior Recognition: The methodology for dataset construction, especially the iterative
VIAtool improvements and thedifferential frame selection strategy, could be adopted for creating high-quality datasets in other complex behavior recognition domains (e.g., healthcare for patient monitoring, industrial settings for worker safety).Potential Issues & Areas for Improvement:
-
Ethical Considerations: The paper does not explicitly discuss the ethical implications of
student and teacher behavior monitoring. Such systems raise concerns aboutprivacy,surveillance,data security, and potential biases inAIinterpretations of behavior. A discussion on these aspects, including consent, anonymization, and responsible deployment, would significantly strengthen the paper's holistic impact. -
Bias in Data Collection: While the paper mentions including data from
ethnic minorities, a deeper analysis of potential biases (e.g., gender, socioeconomic background, classroom setup, geographical region) and how they are mitigated (or acknowledged) would be beneficial. -
Temporal Context: The current dataset focuses on static images. While
LVLMscan infer some context, many behaviors (e.g., "patrol," "teach," "guide," "discuss") are inherentlyspatiotemporal. Future extensions tovideo-based annotations(e.g.,AVAstyleatomic actions) orevent-based sequenceswould be a natural and crucial progression for capturing the full richness of classroom dynamics. The paper implicitly acknowledges this by not including "teach" and "patrol" inobject detectiondue to their need fortiming characteristics. -
Long-tail Behaviors: Despite
differential frame selection, theclass imbalancefor some behaviors remains. Techniques likefew-shot learningordata augmentationspecifically tailored for theselong-tail classescould be explored in conjunction with this dataset. -
Granularity of "Others": While "others" is a necessary category, a qualitative analysis of what falls into "others" might reveal new, important behaviors that could be explicitly added to the classification scheme in future iterations.
Overall,
SCB-Datasetis a foundational work that will undoubtedly accelerate research inAI for education. Addressing the limitations, particularly the ethical aspects and extending tospatiotemporal data, would solidify its position as a gold standard in the field.
Similar papers
Recommended via semantic vector search.