Paper status: completed

SCB-Dataset: A Dataset for Detecting Student and Teacher Classroom Behavior

Published:04/05/2023
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
7 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper presents SCB-Dataset, the first large-scale dataset covering 19 classroom behavior classes for students and teachers, addressing data scarcity in education. It includes 13,330 images and 122,977 labels, designed for object detection and image classification, establishin

Abstract

Using deep learning methods to detect the classroom behaviors of both students and teachers is an effective way to automatically analyze classroom performance and enhance teaching effectiveness. Then, there is still a scarcity of publicly available high-quality datasets on student-teacher behaviors. We constructed SCB-Dataset a comprehensive dataset of student and teacher classroom behaviors covering 19 classes. SCB-Dataset is divided into two types: Object Detection and Image Classification. The Object Detection part includes 13,330 images and 122,977 labels, and the Image Classification part includes 21,019 images. We conducted benchmark tests on SCB-Dataset using YOLO series algorithms and Large vision-language model. We believe that SCB-Dataset can provide a solid foundation for future applications of artificial intelligence in education. Code:https://github.com/Whiffe/SCB-dataset

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

SCB-Dataset: A Dataset for Detecting Student and Teacher Classroom Behavior

1.2. Authors

Fan Yang

1.3. Journal/Conference

The paper is published on arXiv, a preprint server, indicating it has not yet undergone formal peer review or been accepted by a specific journal or conference. However, arXiv is widely used in the academic community for disseminating research quickly.

1.4. Publication Year

2023 (Published at UTC: 2023-04-05T15:02:30.000Z)

1.5. Abstract

This paper introduces SCB-Dataset, a new, high-quality, publicly available dataset designed for detecting the classroom behaviors of both students and teachers using deep learning methods. It addresses the scarcity of such datasets in the educational field. SCB-Dataset covers 19 distinct behavior classes and is organized into two main types: an Object Detection part, comprising 13,330 images and 122,977 labels, and an Image Classification part, containing 21,019 images. The authors conducted benchmark tests on this dataset using YOLO series algorithms for object detection and Large Vision-Language Models (LVLMs) for image classification. The results demonstrate the dataset's utility and provide baseline performance figures, aiming to establish a foundation for future artificial intelligence applications in education.

Original Source Link: https://arxiv.org/abs/2304.02488 PDF Link: https://arxiv.org/pdf/2304.02488v7.pdf Publication Status: This is a preprint available on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve is the scarcity of publicly available, high-quality datasets for detecting student and teacher classroom behaviors. While deep learning methods, particularly computer vision (CV) techniques, have shown great promise in automatically analyzing classroom performance and enhancing teaching effectiveness, their application is severely limited by the lack of suitable training data.

This problem is important because classroom behavior analysis can provide valuable insights into students' learning, personality, and psychological traits, offer feedback on teachers' teaching processes, and contribute to overall education quality evaluations. Existing computer vision datasets (e.g., COCO, Crowded Human, AVA) are either general-purpose or focus on broader human actions, often lacking the specific context, diversity, and annotation quality required for nuanced educational scenarios. Some existing student behavior datasets are either closed-source, have limited samples, or suffer from quality issues like non-standard bounding boxes and labeling errors.

The paper's entry point is to construct a comprehensive, diverse, and high-quality dataset, SCB-Dataset, specifically tailored for the education domain. Its innovative idea is to provide data for two distinct but complementary computer vision tasks: object detection (for locating and classifying individual behaviors) and image classification (for categorizing overall classroom group behaviors or complex actions that depend on global image context).

2.2. Main Contributions / Findings

The paper's primary contributions are:

  1. First Public Comprehensive Dataset: SCB-Dataset is presented as the first public dataset for student and teacher classroom behaviors in real educational environments, boasting the most classes (19 in total) and a large number of images. It addresses a significant gap in research on student classroom behavior within teaching scenarios.

  2. Dual Dataset Types: The dataset is divided into two distinct types:

    • Object Detection Dataset: This part allows for the precise localization of individual students and teachers and the classification of their behaviors, making it suitable for traditional deep convolutional neural network (CNN) models like YOLO. It includes 13,330 images and 122,977 labels.
    • Image Classification Dataset: This part focuses on classifying the overall behavior of students and teachers in an image, suitable for Large Vision-Language Models (LVLMs). It includes 21,019 images and is designed for behaviors that require global image context (e.g., a teacher patrolling).
  3. Benchmark Testing and Baselines: The paper conducts benchmark tests on SCB-Dataset using YOLO series algorithms for object detection and Qwen2.5-VL-7B-instruct (LVLM) for image classification. These results provide valuable baseline data references for future research and development in this field.

    The key findings from the benchmark tests indicate that:

  • YOLOv7 performs strongly on the object detection task, with mAP@0.5 scores above 70% for most behaviors, and even above 90% for half of them, suggesting practical applicability for many classes. However, some behaviors like "bow the head" and "turn the head" remain challenging.
  • Qwen2.5-VL-7B-instruct demonstrates significant training effectiveness for image classification, with F1 scores mostly above 80% for various student and teacher behaviors. This highlights the potential of LVLMs for capturing complex, context-dependent classroom actions.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with fundamental deep learning and computer vision concepts:

  • Deep Learning (DL): A subfield of machine learning that uses artificial neural networks with multiple layers (hence "deep") to learn representations of data with multiple levels of abstraction. It has revolutionized tasks like image recognition, natural language processing, and speech recognition.

  • Computer Vision (CV): A field of artificial intelligence (AI) that enables computers to "see" and interpret visual data from the world (e.g., images, videos). Common tasks include object recognition, object detection, object tracking, instance segmentation, and action recognition.

  • Object Detection: A computer vision task that involves identifying the presence of objects in an image or video and drawing a bounding box (bbox) around each object, while also classifying what type of object it is. For example, in a classroom image, object detection could identify each student, draw a box around them, and label them as "student."

  • Image Classification: A computer vision task that assigns a single category label to an entire image. For example, an image classification model might determine if an image shows "a classroom with students discussing" or "a teacher lecturing."

  • Action Recognition: The task of identifying human actions or activities from videos or images. This can range from simple actions (e.g., "walking") to complex ones (e.g., "playing basketball").

  • Pose Estimation: A computer vision technique that detects and localizes key points (joints) of a person's body (e.g., shoulders, elbows, knees) in an image or video. This information can then be used to understand body posture and movement.

  • Large Vision-Language Models (LVLMs): A new class of AI models that combine capabilities from both computer vision and natural language processing (NLP). They can process and understand both images and text simultaneously, enabling tasks like visual question answering (VQA) (answering questions about an image) or image captioning (generating descriptive text for an image). The "large" refers to their massive scale, often having billions of parameters, similar to Large Language Models (LLMs).

  • YOLO (You Only Look Once): A popular family of object detection algorithms known for its speed and accuracy. Unlike traditional object detection systems that separate region proposal and classification, YOLO predicts bounding boxes and class probabilities in a single pass of a neural network, making it very efficient for real-time applications. The paper mentions various versions like YOLOv7, YOLOv5, YOLOv8, etc., which are incremental improvements over the original YOLO architecture.

  • Evaluation Metrics for Object Detection:

    • Precision (P): Measures the accuracy of positive predictions. It is the ratio of correctly predicted positive observations to the total predicted positive observations. $ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $
      • True Positives (TP): Correctly detected objects.
      • False Positives (FP): Incorrectly detected objects (e.g., detecting something that isn't there, or detecting the wrong class).
    • Recall (R): Measures the ability of the model to find all relevant cases (all true positives). It is the ratio of correctly predicted positive observations to all observations in the actual class. $ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $
      • False Negatives (FN): Objects that were present but not detected by the model.
    • mAP (mean Average Precision): A common metric for object detection that averages the Average Precision (AP) over all object classes. AP is the area under the Precision-Recall curve.
      • mAP@0.5: Calculates mAP with an Intersection over Union (IoU) threshold of 0.5. IoU measures the overlap between the predicted bounding box and the ground truth bounding box; a threshold of 0.5 means that a detection is considered TP if its IoU with a ground truth box is greater than 0.5.
      • mAP@0.95: Calculates mAP with an IoU threshold of 0.95, meaning detections must be very precise to be counted as TP. This is a stricter measure of localization accuracy.
  • Evaluation Metrics for Image Classification:

    • F1 Score: The harmonic mean of Precision and Recall. It balances both metrics, being particularly useful when there is an uneven class distribution. $ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $
  • LoRA (Low-Rank Adaptation): A technique used to fine-tune large language models and large vision-language models more efficiently. Instead of fine-tuning all the parameters of a large pre-trained model, LoRA injects small, trainable low-rank matrices into the transformer architecture. This significantly reduces the number of trainable parameters and computational cost, making fine-tuning more accessible.

3.2. Previous Works

The paper reviews existing approaches and datasets for student classroom behavior detection, categorizing them mainly into:

  • Video Action Recognition-based: These methods recognize continuous behaviors from video streams. The paper notes that they often require a very large number of annotations, citing AVA dataset (1.58M annotations) for SlowFast detection. Examples of general video datasets include UCF101 and Kinetics400.

  • Pose Estimation-based: These algorithms characterize human behavior by obtaining joint position and motion information. However, the paper points out their limitation in overcrowded classroom scenarios where dense student populations make accurate pose estimation difficult.

  • Object Detection-based: These methods identify and localize individual behaviors within still frames. The paper highlights recent advancements in this area, such as YOLOv7, making it a promising solution.

    Regarding datasets, the paper specifically mentions:

  • Publicly Available Datasets:

    • STBD-08 (Zhao and Zhu 2023): Claimed 4432 images and 151574 annotations, but the paper found actual quantities to be higher (8884 images, 267888 annotations) and noted significant data quality issues (non-standard bbox, class labeling errors). It also suggests the public version is derived from a paid online dataset.
    • ClaBehavior (Wang et al. 2023b): Paper mentions 1342 images and 9911 annotations, but only 400 publicly available images were found, which is far fewer than stated.
    • SCBehavior (Wang et al. 2024): Similar to ClaBehavior, the paper mentions 1346 images, but only 400 damaged images were found publicly.
    • UK Datasets (Feng et al. 2025): Derived from elementary school classroom videos, it extracted 8754 images for 8 student behaviors. The paper asserts this dataset is also not original and appears to be based on publicly sold data.
    • universe roboflow website (reddy projects 2023): A general resource for datasets.
    • a student classroom behavior dataset from a paid website (Unknown 2023): Noted as a source for some public datasets.
  • Closed-source Datasets: The paper lists numerous closed-source datasets for student behavior, indicating active research but limited public access. These vary in size and behavior classes, often focusing on student-specific actions like "focus," "distract," "listening," "writing," "playing," "raising hands," etc. Some also include teacher behaviors.

3.3. Technological Evolution

The field of computer vision has seen rapid development, particularly with the advent of deep learning. Early computer vision methods relied on hand-crafted features, which were often brittle and less robust to variations in real-world data. Deep learning, especially Convolutional Neural Networks (CNNs), revolutionized object recognition and detection by automatically learning hierarchical features from data. Early significant datasets like COCO propelled research in object detection. Subsequent advancements focused on real-time performance (YOLO, SSD) and more complex tasks like instance segmentation (Mask R-CNN) and video retrieval. More recently, the emergence of Large Vision-Language Models (LVLMs) like Qwen2.5-VL marks a new frontier, allowing models to process and understand multimodal inputs (vision and language) and perform more nuanced tasks, which is particularly relevant for context-rich domains like classroom behavior analysis.

The paper fits into this timeline by leveraging advanced object detection techniques (YOLO series) and state-of-the-art LVLMs to tackle a specific, challenging domain (classroom behavior). It acknowledges the limitations of simpler datasets and the need for complex and realistic datasets to bridge the gap between model performance on benchmarks and real-world applicability.

3.4. Differentiation Analysis

Compared to the main methods and datasets in related work, SCB-Dataset presents several core differences and innovations:

  • Quality and Scope: Unlike some existing public datasets (STBD-08, ClaBehavior, SCBehavior, UK Datasets) which are criticized for limited quantity, problematic annotations, or being non-original, SCB-Dataset aims to be a high-quality, comprehensive, and original public resource. It explicitly addresses issues like non-standard bounding boxes and class labeling errors found in other datasets through its rigorous annotation process.

  • Comprehensive Behavior Classes: With 19 distinct behaviors, SCB-Dataset offers a wider range of student and teacher behaviors than many existing datasets, which often focus solely on students or a limited set of actions. This comprehensive nature allows for more holistic classroom analysis.

  • Dual-Task Design: The dataset's division into Object Detection and Image Classification parts is a key innovation.

    • Object Detection: Caters to fine-grained, individual behavior analysis, providing bounding box coordinates and classification for each person. This is crucial for fine-grained research and tracking individual student/teacher engagement.
    • Image Classification: Addresses the need for group behavior analysis and behaviors that are difficult to define or recognize via local object features (e.g., "patrol," "listen," "on-stage presentation"), which depend on the overall image context. This leverages the strengths of LVLMs. This dual approach provides flexibility and caters to different analytical needs, covering both individual and collective classroom dynamics.
  • Real-world Diversity: The dataset incorporates diverse classroom scenarios, including varying shooting angles, different classroom environments, multiple learning stages (kindergarten to university), and various ethnic groups (Han and Tibetan), ensuring robustness and generalizability, which is often lacking in simpler datasets.

  • Teacher Behaviors: A significant differentiator is the inclusion of teacher behaviors. Many existing datasets focus exclusively on students. SCB-Dataset explicitly includes teacher actions, recognizing their crucial role in classroom dynamics.

  • Transparent Annotation Process: The paper details its extensive annotation work and iterative improvements to the VIA tool, emphasizing the effort put into data quality and extendibility, which is often overlooked in dataset publications.

    In essence, SCB-Dataset offers a more robust, diverse, and well-structured foundation for AI in education by providing a high-quality, dual-purpose dataset that addresses current limitations in scale, quality, and scope of existing resources.

4. Methodology

4.1. Principles

The core idea behind SCB-Dataset is to bridge the gap in classroom behavior analysis by providing a robust and comprehensive dataset for deep learning models. The theoretical basis is that classroom performance and teaching effectiveness can be objectively analyzed through the automatic detection of student and teacher behaviors. The intuition is that by collecting a large, diverse set of real-world classroom videos, extracting frames, and meticulously annotating individual and group behaviors, a dataset can be created that enables AI models to learn these complex patterns. The dataset is designed to support two primary computer vision tasks: object detection for fine-grained individual actions and image classification for holistic, context-dependent classroom states, recognizing that different behaviors require different analytical approaches. A key principle is to ensure data quality and diversity to make the models generalizable to various real-world classroom settings.

4.2. Core Methodology In-depth (Layer by Layer)

The SCB-Dataset construction process involves several meticulous steps, from data collection to annotation, and finally, statistical analysis to ensure quality and relevance.

4.2.1. Data Collection

The process begins with acquiring raw video data from various online open-class platforms.

  1. Video Source: Videos are directly collected from public websites such as bilibili, TikTok, 1s1k, and bjyhjy. This ensures a diverse range of real classroom scenarios.
  2. Diversity Inclusion: To enhance the dataset's representativeness, classroom data from China's ethnic minorities, specifically Tibetan nationality, were also incorporated in addition to Han nationality. This broadens the cultural and demographic scope of the dataset.
  3. Frame Extraction: Once videos are collected, individual frames are extracted from them.
  4. Differential Frame Selection Strategy: To combat class imbalance, a common problem in real-world datasets where some behaviors are much more frequent than others, a differential frame selection strategy is employed. This means:
    • Reducing the sampling volume for common classes (e.g., "read" and "write").
    • Increasing the sampling volume for rare classes (e.g., "discuss" and "board writing"). This strategy aims to ensure that the dataset has a sufficient number of examples for less frequent but important behaviors.

The overall production process is visualized in Figure 1.

Figure 1: The production process of SCB-Dataset. 该图像是示意图,展示了SCB-Dataset的生产流程,包括从开放课堂网站获取视频,进行帧提取,接着对图片进行数据注释,最后通过对象检测和图像分类模型的微调来不断迭代更新模型。

Figure 1: The production process of SCB-Dataset.

4.2.2. Behavior Classes

SCB-Dataset is designed to cover a comprehensive set of student and teacher behaviors.

  • Total Classes: By removing overlapping behaviors between the two dataset types, SCB-Dataset features a total of 19 distinct types of behaviors.

  • Object Detection Dataset: Contains 12 types of behaviors. This part focuses on individual-level actions and objects.

  • Image Classification Dataset: Contains 14 types of behaviors. This part focuses on group-level or context-dependent behaviors that are better described for the entire image.

    The behavior classification is illustrated in Figure 4. Behaviors with a yellow background in this figure are considered group behaviors.

    Figure 4: The behavior classification of SCB-Dataset 该图像是行为分类示意图,展示了SCB-Dataset中学生与教师的课堂行为分类。行为分为物体检测和图像分类两部分,列出了各类行为的标签,如举手、阅读以及教师的指导等。

Figure 4: The behavior classification of SCB-Dataset

Rationale for Group vs. Individual Behaviors:

  • Hand-raising: Considered an individual behavior as it reflects teacher-student interaction and classroom activity through individual student participation.

  • Reading and Writing: Classified as group behaviors. The paper argues that individual instances of these behaviors might not be representative. Instead, focusing on the overall classroom engagement (i.e., whether the majority of students are reading or writing) better reflects current teaching dynamics.

    Specific Class Considerations for Object Detection:

  • Blackboard-writing: Divided into "student blackboard-writing" and "teacher blackboard-writing" due to their completely different pedagogical meanings (student demonstration vs. teacher instruction).

  • Teacher: An identity recognition class, used to locate the teacher.

  • Teacher behaviors like "teach" and "patrol": Not included in the Object Detection dataset because YOLO series networks struggle to fit their characteristics, as these behaviors often require environmental or timing characteristics that are better captured by image classification.

  • Stand: Classified under Other as it can be a behavior of both students and teachers.

  • Screen and Blackboard: Included as objects, not behaviors, to facilitate tasks like extracting content from screens or blackboards.

    Specific Class Considerations for Image Classification:

  • LVLM Prompts: Two sets of prompts are used for students and teachers, each containing "other" to cover uncategorized actions.

  • Hand-raising: Only images containing 3 or more hand-raising behaviors are retained to ensure stronger behavioral signals for LVLM training.

  • Merged Classes: "Read" and "write" behaviors are merged into "read and write" because image classification outputs only one class per image, and these behaviors often co-occur or are hard to distinguish globally.

  • Advantages: Image Classification can capture behaviors difficult for object detection due to their reliance on entire image information, such as "on-stage presentation," "reading aloud," "listen," "teach," and "patrol."

    Examples of behavior classes are provided in Figure 2.

    Figure 2: Examples of behavior classes in SCB-Dataset 该图像是SCB-Dataset中的行为类别示例,展示了学生和老师在课堂上的多种行为。图中包括阅读、书写、听讲、举手回答问题等不同场景,体现了学生与教师的互动模式,为课堂行为分析提供了视觉依据。

Figure 2: Examples of behavior classes in SCB-Dataset

4.2.3. Dataset Challenges

The real-world classroom environment presents several challenges that SCB-Dataset aims to capture and address. These challenges increase the complexity for computer vision models:

  • Pixel Differences: Significant variations in resolution and detail between objects close to the camera (e.g., front-row students) and those far away (e.g., back-row students). (Illustrated in Fig. 5 I)

  • Dense Environments: Classrooms are often crowded, leading to severe occlusion (objects blocking each other) situations, especially among densely packed students. (Illustrated in Fig. 5 II)

  • Coexistence of Multiple Classes: Individuals (students/teachers) may exhibit multiple behaviors simultaneously (e.g., a teacher standing and teaching, a student hand-raising and reading). (Illustrated in Fig. 5 III)

  • Similar Behaviors: High similarity between distinct behaviors, making differentiation difficult (e.g., placing a hand on the forehead vs. raising a hand, writing vs. reading). (Illustrated in Fig. 5 IV)

    Figure 5: Challenges in the SCB-Dataset include pixel differences, dense environments, the coexistence of multiple classes , and similar behaviors. 该图像是展示 SCB-Dataset 挑战的示意图。包含了像素差异、密集环境、多类共存和相似行为等问题的图示,有助于理解教师和学生在课堂上的行为特征。

Figure 5: Challenges in the SCB-Dataset include pixel differences, dense environments, the coexistence of multiple classes , and similar behaviors.

Furthermore, the dataset incorporates diversity along several axes to ensure robustness:

  • Varying Shooting Angles: Frontal, lateral, and back views of the classroom, complicating behavior detection as the same action can appear very different from various perspectives. (Illustrated in Fig. 6 I)

  • Class Differences & Environments: Includes various classroom environments and course types (e.g., computer labs, standard classrooms, art studios), each with unique visual characteristics. (Illustrated in Fig. 6 II)

  • Different Learning Stages: Covers students from kindergarten through university, reflecting developmental differences in appearance and behavior. (Illustrated in Fig. 6 III)

  • Different Ethnic Groups: Includes Han and Tibetan nationalities, adding cultural diversity. (Illustrated in Fig. 6 IV)

    Figure 6: The diversity of the SCB-Dataset includes varying shooting angles, class differences, different learning stages, and different ethnic groups. 该图像是一个展示SCB-Dataset多样性的图表,包括不同的拍摄角度、课堂类型和学习阶段。图中展示了计算机课、幼儿园、英语课、艺术课和中学等多种课堂场景,体现了教育环境的多样性。

Figure 6: The diversity of the SCB-Dataset includes varying shooting angles, class differences, different learning stages, and different ethnic groups.

4.2.4. Dataset Statistics

The dataset's composition is detailed, including counts for each class and the overall structure.

  • Class Count Statistics: Figure 3 presents bar charts showing the number of annotations for each class in both Object Detection (left) and Image Classification (right) datasets. This highlights the class imbalance despite efforts to mitigate it during data collection.

  • Annotation Volume: The Object Detection dataset has a much higher number of annotations per image because it labels multiple objects and behaviors within a single image, whereas the Image Classification dataset assigns only one overall class per image.

    Figure 3: SCB-Dataset Class Count Statistics 该图像是一个关于SCB-Dataset的统计图,展示了对象检测和图像分类的数据集数量。左侧为对象检测部分,包括不同课堂行为的训练和验证数据集数量,右侧为图像分类部分,显示了各类行为的统计数据。

Figure 3: SCB-Dataset Class Count Statistics

  • Overall Statistics: Table 1 provides the total number of images and annotations for both dataset types.

    The following are the results from Table 1 of the original paper:

    DatasetObject DetectionImage Classification
    ImagesAnnotationImagesAnnotation
    Train89,71319,15219,152
    Val33,2641,8671,867
    Total13,330122,97721,01921,019

Note on Object Detection Split: The paper explicitly states that the Object Detection dataset does not provide a standard train/validation split count due to the severe class imbalance. To handle this, SCB-Dataset is split into multiple sub-parts. For each sub-part, training and validation sets are independently and randomly divided in a 4:1 ratio. This approach leads to overlaps between the training and validation sets across different sub-parts, rendering an overall count of training/validation images for object detection impractical for general reference. The detailed statistics for each sub-part are provided in Appendix A.3.

4.2.5. Annotation Work (Appendix A.1)

The annotation process was the most time-consuming part, accounting for nearly 90% of the total workload.

  1. Evolution of Annotation: The process involved defining input-output expectations and behavior classifications starting in 2021. Initial definitions proved "unreasonable and immature," leading to a significant time investment (1.5 years) before a redesign in early 2023.

  2. Extensible Method: The redesigned process introduced an extensible behavior annotation method, allowing flexible expansion with future behavior classes.

  3. Annotation Tool Improvements (VIA): The annotation tool VIA was iteratively optimized to meet practical needs:

    • Original Version: Basic functionality (Fig. 11).

      Figure 11: Original Annotation Interface of VIA 该图像是一个课堂行为标注界面的示例,展示了学生在课堂上的行为。多个学生的行为被框选并标注,包括讨论、注意力、使用工具等情况。图像中还包含用于分类的标签界面,方便对学生行为进行分析。

      Figure 11: Original Annotation Interface of VIA

    • Second Version: Optimized label display position (inside bounding boxes for better visibility in dense classroom scenes). Added keyboard shortcuts ("z" and "x") for switching label display, in addition to the mouse scroll wheel. (Fig. 12).

      Figure 12: VIA Second Version 该图像是一个教室场景,展示了多名学生在课堂上学习。图中标注了不同的课堂行为,如讨论和阅读,旨在用于行为检测的数据集。

      Figure 12: VIA Second Version

    • Third Version: Optimized selection of annotation boxes by adding mouse scroll switching. This helps select very small or mislabeled bounding boxes and highlights the selected box. (Fig. 13).

      Figure 13: VIA Third Version 该图像是一个关于课堂行为检测的示意图,展示了学生在教室内的多种行为类型。图中包含标注框和信息表格,记录了不同学生的行为,如阅读、讨论等,旨在支持SCB-Dataset的研究和分析工作。

      Figure 13: VIA Third Version

    • Fourth Version: Optimized display of annotation content in half-page format for efficiency. Introduced Full Image Mode (all annotations) and Single Target Mode (individual annotated targets) for detailed inspection in dense scenarios. (Fig. 14).

      Figure 14: VIA Fourth Version 该图像是一个包含课堂场景的示意图,显示了多个学生的学习行为。图中以黄色框标注了一系列的行为信息,包括学生的姿态、所用工具和教学活动等,旨在展示SCB-Dataset数据集中的标注情况。

      Figure 14: VIA Fourth Version

    • Fifth Version: Introduced copy previous frame annotations feature, significantly reducing repetitive work for consecutive frames with high similarity. (Fig. 15).

      Figure 15: VIA Fifth Version 该图像是一个示意图,展示了课堂上学生的行为识别标注。图中包含多个标注框,分别标记不同学生的行为和姿态,显示出他们在课堂学习时的状态和互动。每个学生的行为被详细分类,有助于深入分析教学效果。

      Figure 15: VIA Fifth Version

  4. Final Checks (viaJson): After annotation, viaJson counting website (Fig. 16) is used to verify results, identifying unclassified annotation boxes and providing statistics on detection boxes and annotated targets to ensure completeness and accuracy.

    Figure 16: Annotation Review and Counting Website 该图像是一个统计结果页面,展示了通过 JSON 文件分析的检测框和标签数量。检测框数量为461,标签数量为539,并列出了部分没有标注的图片文件。此图可以帮助用户了解标注数据的完整性和有效性。

    Figure 16: Annotation Review and Counting Website

4.2.6. Prompt (Appendix A.4)

For the Image Classification task using LVLMs, specific prompts are defined for student and teacher behaviors.

Student Behavior Categories: read and write, on-stage presentation, student blackboard-writing, answering questions, reading aloud, discuss, hand-raising, listen, others.

Teacher Behavior Categories: teach, guide, answer, on-stage interaction, blackboard-writing, patrol, others.

Definition and Recognition Rules:

  1. Priority of single behavior: Only one dominant behavior is identified per picture. Compound actions are classified based on the dominant behavior.

  2. Unique output: Only one behavior category is output for each recognition.

    The paper provides detailed definitions for each category, clarifying nuances such as:

  • Read and write: Students are reading or writing.
  • On-stage presentation: Students presenting on stage without teachers (distinction from teacher on-stage interaction).
  • Student blackboard-writing: Students writing on the blackboard (distinct from teacher blackboard-writing).
  • Answering questions: Students standing to answer questions without teachers present in the image (distinct from teacher answer).
  • Reading aloud: Students reading aloud in unison, often with open mouths, distinct from silent read and write.
  • Discuss: Students discussing (deskmates or between rows).
  • Listen: Students looking up to listen to the teacher.
  • Hand-raising: Generally counted only if 3 or more students are raising hands (for image classification).
  • Others: Catch-all for behaviors not in the defined categories.
  • Teach: Teachers on the podium explaining knowledge.
  • Guide: Teachers off the podium providing individual guidance, often involving bending over.
  • Answer: Students answering teacher's questions, with both teacher and student standing.
  • On-stage interaction: Teachers inviting students to stage for activities, including both teachers and students.
  • Blackboard-writing: Teacher's writing on the blackboard.
  • Patrol: Teachers walking around the classroom, observing.
  • Others: Catch-all for teacher behaviors not defined.

4.2.7. Behavior Description (Appendix A.5)

A comprehensive list of all 19 behavior classes with detailed descriptions is provided. This ensures clarity and consistency in understanding the dataset's labels.

  1. hand-raising: Students raise hands to speak or ask.
  2. read: Students read books, textbooks, or notes.
  3. write: Students take notes or complete assignments.
  4. discuss: Students discuss content with each other.
  5. bow the head: Students lower heads, possibly distracted or focused on personal items.
  6. turn the head: Students turn heads to look at classmates/teachers or events.
  7. blackboard-writing: Teachers/students write or draw on the blackboard.
  8. on-stage presentation: Students stand and answer questions. (Clarification: The definition here in A.5 seems to combine aspects of "answering questions" from A.4. It states "Students stand up and answer questions. The difference between answering questions and answer is that answering questions means there are only students in the image, no teachers, while answer means there are both students and teachers in the image.")
  9. reading aloud: Students read content aloud from books.
  10. listen: Students look up to listen to the teacher.
  11. guide: Teachers provide guidance or explanations to students.
  12. answer: Students respond to the teacher's questions.
  13. on-stage interaction: Students interact with teachers or other students on stage.
  14. teacher: Identity recognition for the teacher.
  15. teach: Teachers standing on the podium, explaining knowledge.
  16. patrol: Teachers walk around the classroom.
  17. stand: Students or teachers stand.
  18. screen: The screen displaying PPTs.
  19. blackboard: The blackboard for writing.

5. Experimental Setup

5.1. Datasets

The dataset used for the experiments is the SCB-Dataset itself, divided into two main parts:

  • Object Detection Dataset: This part is used for training and evaluating models that detect and localize individual student and teacher behaviors within images. It comprises 13,330 images and 122,977 labels.

  • Image Classification Dataset: This part is used for training and evaluating Large Vision-Language Models on classifying the overall behavior or group behavior within an image. It contains 21,019 images.

    The paper does not provide concrete examples of raw data samples (e.g., a specific image with its full annotation file). However, various figures throughout the paper provide visual intuition about the data's form:

  • Figure 2: Examples of behavior classes, showing diverse scenes with students and teachers.

  • Figure 5: Illustrates dataset challenges like pixel differences, dense environments, and similar behaviors, offering a glimpse into the visual complexity.

  • Figure 6: Demonstrates the diversity of the dataset across angles, classroom types, learning stages, and ethnic groups.

  • Figure 7: Shows a YOLOv7 detection result, where bounding boxes are drawn around a teacher, blackboard, and screen, with corresponding labels and confidence scores.

  • Figure 10: Presents an example of LVLM output for image classification, showing a classroom scene with students discussing and the model outputting "discuss."

  • Figure 17: Provides a collage of example images for various classes, offering an overview of the visual content for each behavior.

    Figure 17: Examples of images from various classes in the SCB dataset. 该图像是来自 SCB 数据集的示例图,展示了学生和教师在课堂中的多种行为,涵盖了 19 个类别。每个类别的行为通过多个图像展示,包括举手、回答、讨论等场景。

Figure 17: Examples of images from various classes in the SCB dataset.

These datasets were chosen because they are specifically designed to address the research gap in classroom behavior analysis. They offer a comprehensive and diverse collection of real-world classroom scenarios, making them effective for validating the performance of computer vision methods in this domain.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, here is a detailed explanation:

For Object Detection:

The metrics used are Precision (P), Recall (R), mAP@0.5, and mAP@0.95.

  1. Precision (P)

    • Conceptual Definition: Precision measures the accuracy of the model's positive predictions. In object detection, it quantifies how many of the detected objects are actually correct. A high precision indicates a low number of false positives (incorrect detections).
    • Mathematical Formula: $ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $
    • Symbol Explanation:
      • TP (True Positives): The number of objects correctly detected by the model that actually exist in the image.
      • FP (False Positives): The number of objects detected by the model that do not actually exist in the image, or were detected with the wrong class or an IoU below the threshold.
  2. Recall (R)

    • Conceptual Definition: Recall measures the model's ability to find all the relevant objects. In object detection, it quantifies how many of the actual objects present in the image were successfully detected by the model. A high recall indicates a low number of false negatives (missed objects).
    • Mathematical Formula: $ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $
    • Symbol Explanation:
      • TP (True Positives): The number of objects correctly detected by the model that actually exist in the image.
      • FN (False Negatives): The number of objects that exist in the image but were not detected by the model.
  3. mAP@0.5 (mean Average Precision at IoU threshold 0.5)

    • Conceptual Definition: mAP is a standard metric for object detection that assesses the overall accuracy of detections across all classes. It is calculated by averaging the Average Precision (AP) for each class. AP itself is the area under the Precision-Recall curve. The @0.5 signifies that a detected object is considered a True Positive only if its Intersection over Union (IoU) with a ground truth bounding box is greater than or equal to 0.5. IoU is a measure of the overlap between the predicted and ground truth bounding boxes.
    • Mathematical Formula: $ \text{IoU}(B_p, B_{gt}) = \frac{\text{Area}(B_p \cap B_{gt})}{\text{Area}(B_p \cup B_{gt})} $ $ \text{AP}c = \int_0^1 P(R) dR $ $ \text{mAP} = \frac{1}{N} \sum{c=1}^{N} \text{AP}_c $
    • Symbol Explanation:
      • BpB_p: Predicted bounding box.
      • BgtB_{gt}: Ground truth bounding box.
      • Area(BpBgt)\text{Area}(B_p \cap B_{gt}): Area of intersection between the predicted and ground truth boxes.
      • Area(BpBgt)\text{Area}(B_p \cup B_{gt}): Area of union between the predicted and ground truth boxes.
      • APc\text{AP}_c: Average Precision for class cc.
      • P(R): Precision as a function of Recall.
      • NN: Number of object classes.
  4. mAP@0.95 (mean Average Precision at IoU threshold 0.95)

    • Conceptual Definition: Similar to mAP@0.5, but with a much stricter IoU threshold of 0.95. This metric heavily penalizes inaccurate bounding box localization. A high mAP@0.95 indicates that the model not only correctly identifies objects but also localizes them almost perfectly.
    • Mathematical Formula: Same as mAP, but the TP condition for AP calculation requires IoU0.95\text{IoU} \ge 0.95.
    • Symbol Explanation: Same as mAP@0.5.

For Image Classification:

The metrics used are Precision (P), Recall (R), and F1 Score.

  1. Precision (P)

    • Conceptual Definition: For image classification, precision for a given class measures the proportion of images classified as that class that were actually correct.
    • Mathematical Formula: (Same as for object detection, applied per class) $ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $
    • Symbol Explanation:
      • TP (True Positives): The number of images correctly classified as the target class.
      • FP (False Positives): The number of images incorrectly classified as the target class.
  2. Recall (R)

    • Conceptual Definition: For image classification, recall for a given class measures the proportion of actual images belonging to that class that were correctly identified.
    • Mathematical Formula: (Same as for object detection, applied per class) $ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $
    • Symbol Explanation:
      • TP (True Positives): The number of images correctly classified as the target class.
      • FN (False Negatives): The number of images that truly belong to the target class but were not classified as such.
  3. F1 Score

    • Conceptual Definition: The F1 Score is the harmonic mean of Precision and Recall. It provides a single score that balances both metrics, being particularly useful when there is an uneven class distribution or when both false positives and false negatives are costly.
    • Mathematical Formula: $ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $
    • Symbol Explanation:
      • Precision: The precision value for the class.
      • Recall: The recall value for the class.

5.3. Baselines

For Object Detection:

The paper's method is compared against various versions of the YOLO (You Only Look Once) series models, specifically YOLOv7 for the main benchmark and YOLOv5, YOLOv8, YOLOv9, YOLOv10, YOLOv11, YOLOv12, and YOLOv13 for a comparative study on a subset of the dataset (teacher behavior part).

  • Why these baselines are representative: The YOLO series are widely recognized as state-of-the-art and highly efficient object detection models, particularly suitable for real-time applications. Comparing against a range of YOLO versions (from v5 to v13) provides a strong benchmark against current best practices in object detection, demonstrating the dataset's challenge and the relative performance of different model generations.

For Image Classification:

The paper uses Qwen2.5-VL-7B-instruct, a Large Vision-Language Model (LVLM), as the baseline.

  • Why this baseline is representative: LVLMs represent the cutting edge in multimodal understanding. Using Qwen2.5-VL-7B-instruct, a powerful and recent LVLM, demonstrates the dataset's applicability to advanced AI models and establishes a strong baseline for image classification tasks, especially for behaviors requiring contextual understanding that LVLMs excel at. The LoRA (Low-Rank Adaptation) method is employed for efficient fine-tuning of this LVLM.

5.4. Experimental Environment

The experiments were conducted using the following hardware and software specifications:

  • GPU: NVIDIA vGPU-32GB GPU with 32GB of video memory.
  • CPU: 12 vCPU Intel(R) Xeon(R) Platinum 8352V CPU.
  • Operating System: Ubuntu 22.04.
  • Software Versions:
    • PyTorch: 2.3.0
    • Python: 3.12
    • CUDA: 12.1

5.5. Model Training

For Object Detection Dataset:

  • Model: YOLO series algorithms.
  • Epochs: Ranged from 30 to 120.
  • Batch Size: Set to 4.
  • Image Size: Resized to 640×640640 \times 640 pixels.

For Image Classification Dataset:

  • Framework: LLaMA Factory.
  • LVLM: Qwen2.5-VL-7B-instruct.
  • Fine-tuning Method: LoRA (Low-Rank Adaptation).
  • Learning Rate: Set to 5×1055 \times 10^{-5}.
  • Number of Training Epochs: Set to 2.
  • Batch Size: Set to 2.
  • LoRA Rank: Set to 8.
  • Scaling Factor: Set to 16.
  • Dropout Rate: Set to 0.1.
  • LoRA+ Learning Rate Ratio: Set to 16.

6. Results & Analysis

6.1. Core Results Analysis

Object Detection Results (YOLOv7)

The paper presents the benchmark results of YOLOv7 on the Object Detection dataset of SCB-Dataset in Table 2. The performance is evaluated using Precision (P), Recall (R), mAP@0.5, and mAP@0.95.

The following are the results from Table 2 of the original paper:

classPRmAP@0.5mAP@.95
hand-raising79.476.979.259.4
read65.568.270.552.9
write68.467.872.258.1
discuss67.572.574.739.3
bow the head26.034.021.97.9
turn the head23.844.023.59.5
guide88.578.383.648.9
answer86.286.691.580.8
on-stage interaction82.384.590.181.5
blackboard-writing91.093.596.486.6
teacher95.595.297.783.0
stand93.194.796.679.8
screen96.197.197.992.5
blackboard96.297.198.193.3

Analysis:

  • Strong Performance: Many classes achieve high mAP@0.5 scores, with "blackboard-writing," "teacher," "stand," "screen," and "blackboard" all exceeding 96%. "Answer" and "on-stage interaction" also perform exceptionally well with mAP@0.5 over 90%. This indicates that YOLOv7 is highly effective at detecting and localizing these distinct objects and behaviors. The paper suggests that mAP@0.5 above 70% makes a behavior practically applicable, and the majority of classes meet this criterion.

  • Moderate Performance: "Hand-raising," "read," "write," "discuss," and "guide" show good performance with mAP@0.5 ranging from 70.5% to 83.6%. These behaviors are more complex than static objects but still detectable with reasonable accuracy.

  • Challenging Classes: "Bow the head" and "turn the head" exhibit significantly lower mAP@0.5 scores (21.9% and 23.5%, respectively) and very low mAP@0.95 scores (7.9% and 9.5%). This highlights the difficulty in accurately detecting and localizing these subtle head movements, likely due to small object size, occlusion, and similarity to other non-behavioral head positions. These remain areas for future improvement.

    An example of YOLOv7 detection results is shown in Figure 7. It visually demonstrates the model's ability to identify and bound objects like the teacher, blackboard, and screen, along with their confidence scores.

    Figure 7: Example of YOLOv7 detection results 该图像是YOLOv7检测结果示例,展示了一名教师在黑板前授课的场景。可见界面中标注了教师的动作和环境元素,包括黑板和屏幕的识别置信度。此示例有助于分析课堂行为。

Figure 7: Example of YOLOv7 detection results

Image Classification Results (Qwen2.5-VL-7B-instruct)

Table 3 presents the training results of the Qwen2.5-VL-7B-instruct model on the Image Classification dataset, evaluated using Precision (P), Recall (R), and F1 Score.

The following are the results from Table 3 of the original paper:

classPRf1
hand-raising87.085.686.3
read and write83.693.988.5
discuss93.990.292.0
student blackboard-writing83.388.285.7
on-stage presentation10070.082.4
answering questions75.069.272.0
reading aloud10069.281.8
listen88.189.288.6
guide87.050.864.2
answer87.683.485.4
on-stage interaction89.273.580.6
blackboard-writing99.098.598.8
teach87.492.189.7
patrol42.387.157.0
all86.183.483.8

Analysis:

  • Overall Strong Performance: The F1 scores for most behaviors are well above 80%, with an overall F1 score of 83.8%. This demonstrates that LVLMs can effectively classify complex classroom behaviors based on the entire image context.

  • Excellent Performance: "Blackboard-writing" (98.8% F1) and "discuss" (92.0% F1) show particularly high F1 scores. "Read and write," "listen," and "teach" also perform very well (88.5%, 88.6%, 89.7% F1 respectively). "On-stage presentation" and "reading aloud" achieved perfect precision (100%), though their recall was lower, indicating some instances were missed.

  • Areas for Improvement: "Patrol" (57.0% F1) and "guide" (64.2% F1) show lower performance. "Patrol" has a very low precision (42.3%) despite high recall, suggesting many false positives (incorrectly classifying an image as "patrol"). "Guide" has low recall (50.8%), indicating it often misses actual "guide" behaviors. These behaviors likely require more nuanced contextual cues or temporal information that a single image classification might struggle with. "Answering questions" also has a relatively modest F1 score of 72.0%.

    The training process for the LVLM shows a rapid decrease in loss during the initial iterations, stabilizing towards the end, as depicted in Figure 9. The final loss value reached 0.0471.

    Figure 9: Loss of Qwen2.5-VL-7B-instruct during the training iteration process 该图像是一个损失函数的变化曲线图,显示了模型训练过程中各步的损失值。图中包含两个曲线:原始损失(original)和平滑损失(smoothed)。随着训练步数的增加,损失值逐渐降低,表明模型性能的提升。

Figure 9: Loss of Qwen2.5-VL-7B-instruct during the training iteration process

Figure 10 provides a visual example of the Qwen2.5-VL-7B-instruct model classifying a classroom image as "discuss" given an image and a prompt.

Figure 10: Using the LLaMA Factory framework to test the Qwen2.5-VL-7B-instruct example 该图像是学生在课堂上讨论的场景,展示学生们在积极参与和互动。根据示例输出格式,识别出的行为类别为讨论。

Figure 10: Using the LLaMA Factory framework to test the Qwen2.5-VL-7B-instruct example

6.2. Data Presentation (Tables)

SCB-Dataset Training and Validation Data Statistics (Appendix A.3)

The following are the results from Table 7 of the original paper:

Class: Object DetectionTrainVal
hand-raising105382915
read175396539
write64473394
discuss36071785
bow the head4422540
turn the head79433213
guide1155449
answer2574853
on-stage interaction528149
blackboard-writing821277
teacher84903228
stand139324967
screen50251959
blackboard78473445
Class: Image ClassificationTrainVal
hand-raising1472187
read and write81498
discuss70351
student blackboard-writing21117
on-stage presentation3410
answering questions63952
reading aloud13413
listen2294157
guide1584185
answer3938439
on-stage interaction816113
blackboard-writing1703204
teach3088240
patrol1722101

6.3. Ablation Studies / Parameter Analysis

The paper includes a comparative analysis of different YOLO series models on the teacher behavior part of the Object Detection dataset. This serves as an assessment of model choice and performance evolution across YOLO generations.

Figure 8 visually summarizes the mAP@0.5 scores for YOLOv5 to YOLOv13 on the teacher behavior subset.

Figure 8: Training and testing results of the SCB-Dataset dataset (teacher behavior part) on YOLO series models 该图像是一个条形图,展示了不同YOLO模型在SCB-Dataset(教师行为部分)上的mAP50成绩。YOLOv7表现最佳,达到94.0,其他模型的成绩如YOLOv8和YOLOv6分别为93.6和88.1。

Figure 8: Training and testing results of the SCB-Dataset dataset (teacher behavior part) on YOLO series models

Analysis of Figure 8:

  • YOLOv7 achieves the highest mAP@0.5 at 94.0%.

  • YOLOv8 is a close second at 93.6%.

  • YOLOv5 shows the lowest mAP@0.5 at 88.1%.

  • Interestingly, YOLOv9 to YOLOv13 show generally average performance compared to YOLOv7 and YOLOv8, with scores around 91-92%. This suggests that for this specific task and dataset subset, newer YOLO versions do not necessarily provide a significant improvement over YOLOv7 or YOLOv8, or their improvements might be more focused on other aspects like speed or specific architectures not fully leveraged here.

    Detailed results for each YOLO version on the teacher behavior part are provided in tables in Appendix A.3:

Training results of Object Detection Dataset in SCB-Dataset (teacher behavior part) on YOLOv5 (Appendix A.3)

The following are the results from Table 8 of the original paper:

classPRmAP@0.5mAP@.95
all81.8 84.188.167.3
guide68.259.166.031.5
answer67.481.780.857.1
On-stage interaction52.082.880.059.2
blackboard-writing97.770.496.373.9
teacher95.191.995.771.9
stand88.992.390.463.9
screen91.697.598.090.0
blackBoard93.4 97.198.090.7

Training results of SCB-Dataset on YOLOv7 (Appendix A.3)

The following are the results from Table 9 of the original paper:

Dataset class| P| R| mAP@0.5| mAP@.95
SCB5-A all| 71.1| 70.9|| 74.056.8
hand-raising79.476.979.259.4
read65.568.270.552.9
write68.4| 67.8| 72.2|58.1
SCB5-B all| 94.5| 97.3 || 98.7| 91.9
screen94.895.7 98.295.1
backboard94.2 98.9| 99.288.8
SCB5-C| all/discuss| 67.5 || 72.5 || 74.7| 39.3
SCB5-D all| 85.5 || 82.6 || 86.467.2
guide88.081.787.049.5
answer89.388.0 92.376.7
stage interaction69.965.268.554.7
board writing| 94.5| 95.6| 97.7|87.9
SCB5-E| all/stand| 95.8 || 91.7 || 96.6| 80.5
SCB5-Fall/teacher| 96.2| 94.4 || 97.7| 82.7
SCB5-Gall bow-head
turn-head---
SCB5-Hall/talk|87.8 || 62.6 77.2| 61.3

Training results of Object Detection Dataset in SCB-Dataset (teacher behavior part) on YOLOv7 (Appendix A.3)

The following are the results from Table 10 of the original paper:

classPRmAP@0.5 mAP@.95
all91.190.9 94.080.8
guide88.5 78.383.648.9
answer86.286.6 91.580.8
On-stage interaction82.3 84.590.181.5
blackboard-writing91.0 93.596.486.6
teacher95.5 95.297.783.0
stand93.1 94.796.679.8
screen96.1 97.197.992.5
blackBoard96.2 97.198.193.3

Training results of Object Detection Dataset in SCB-Dataset (teacher behavior part) on YOLOv8 (Appendix A.3)

The following are the results from Table 11 of the original paper:

classPRmAP@0.5 mAP@.95
all90.6 89.293.683.1
guide81.5 67.579.453.5
answer87.2 87.592.986.4
On-stage interaction82.5 83.288.980.1
blackboard-writing90.8 93.997.687.3
teacher96.2 93.897.386.0
stand94.0 93.796.583.0
screen95.8 96.797.993.4
blackBoard96.5 97.298.394.7

Training results of Object Detection Dataset in SCB-Dataset (teacher behavior part) on YOLOv9 (Appendix A.3)

The following are the results from Table 12 of the original paper:

classPRmAP@0.5mAP@.95
all87.8 87.791.878.2
guide79.363.774.345.0
answer79.784.589.578.7
On-stage interaction71.779.984.772.6
blackboard-writing95.095.197.984.5
teacher94.293.196.981.8
stand92.292.595.477.1
screen94.696.397.690.9
blackBoard95.5 96.698.294.4

Training results of Object Detection Dataset in SCB-Dataset (teacher behavior part) on YOLOv10 (Appendix A.3)

The following are the results from Table 13 of the original paper:

classPRmAP@0.5mAP@.95
all86.9 84.190.176.2
guide79.758.371.141.8
answer80.978.486.177.0
On-stage interaction80.178.686.077.5
blackboard-writing86.388.694.779.2
teacher90.188.795.177.1
stand88.188.593.172.6
screen94.996.097.490.9
blackBoard95.2 95.697.693.5

Training results of Object Detection Dataset in SCB-Dataset (teacher behavior part) on YOLOv11 (Appendix A.3)

The following are the results from Table 14 of the original paper:

classPR mAP@0.5mAP@.95
all87.6 90.192.981.8
guide72.2 66.772.145.9
answer87.1 88.894.186.6
On-stage interaction75.1 89.991.282.4
blackboard-writing92 93.497.386.5
teacher94.8 94.197.085.0
stand91.9 93.595.881.5
screen92.7 96.997.492.1
blackBoard94.9 97.398.393.9

Training results of Object Detection Dataset in SCB-Dataset (teacher behavior part) on YOLOv12 (Appendix A.3)

The following are the results from Table 15 of the original paper:

classPRmAP@0.5 mAP@.95
all86.286.6 90.677.1
guide78.6 58.971.843.7
answer77.883.1 88.078.9
On-stage interaction66.5 81.282.669.4
blackboard-writing91.4 93.596.484.2
teacher92.6 91.995.980.2
stand91.4 90.894.575.6
screen95.6 96.497.792.1
blackBoard95.7 96.798.192.9

Training results of Object Detection Dataset in SCB-Dataset (teacher behavior part) on YOLOv13 (Appendix A.3)

The following are the results from Table 16 of the original paper:

classP RmAP@0.5mAP@0.75mAP@.95
all89.6 85.191.482.677.2
guide84.6 54.674.845.143.4
answer82.9 80.687.680.476.0
On-stage interaction79.0 78.287.183.579.8
blackboard-writing93.3 92.796.190.281.7
teacher93.5 91.795.886.578.0
stand92.2 89.094.082.373.7
screen95.2 97.098.096.792.6
blackBoard95.9 96.798.196.192.9

Further Analysis of YOLO Versions on Teacher Behavior: The detailed tables confirm the trend shown in Figure 8. While all YOLO versions show generally strong performance on most teacher-related classes (e.g., "teacher," "blackboard-writing," "screen," "blackboard," "stand"), there are notable differences:

  • YOLOv7 and YOLOv8 consistently achieve higher mAP@0.5 and mAP@0.95 for most behavioral classes compared to YOLOv5 and the later versions (v9v9 to v13v13).

  • Classes like "guide" and "on-stage interaction" often have lower mAP scores across all YOLO versions, indicating they are more challenging to detect accurately, possibly due to their contextual nature or variability.

  • The mAP@0.95 scores are significantly lower than mAP@0.5 for many behavioral classes, highlighting that precise localization of bounding boxes for behaviors is harder than simply identifying their presence. Static objects like "screen" and "blackboard" generally maintain high mAP@0.95 because their boundaries are clearer.

    This comparative analysis demonstrates the utility of SCB-Dataset for benchmarking and selecting appropriate object detection models for specific classroom behavior tasks. The findings suggest that YOLOv7 provides a good balance of accuracy and efficiency for this domain.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces SCB-Dataset, a novel and comprehensive dataset specifically designed for detecting student and teacher classroom behaviors using deep learning methods. It addresses a critical gap in the availability of high-quality, publicly accessible datasets tailored for educational environments. The dataset is notable for its coverage of 19 distinct behavior classes and its innovative dual structure, offering an Object Detection part (13,330 images, 122,977 labels) for fine-grained individual actions and an Image Classification part (21,019 images) for holistic, context-dependent behaviors. Through rigorous data collection and an iteratively improved annotation process, SCB-Dataset captures significant real-world diversity and challenges. Benchmark tests using YOLO series algorithms for object detection and a Large Vision-Language Model (Qwen2.5-VL-7B-instruct) for image classification establish baseline performances. The results demonstrate the practical applicability of YOLOv7 for many object detection tasks and the strong potential of LVLMs for image classification of complex classroom behaviors, thereby promoting the application of artificial intelligence in education.

7.2. Limitations & Future Work

The authors implicitly acknowledge several limitations through their discussion of dataset challenges and the performance of specific models/classes:

  • Challenging Behaviors for Object Detection: Classes like "bow the head" and "turn the head" still yield low mAP scores, indicating the inherent difficulty in precisely detecting and localizing subtle human movements due to factors like pixel differences, occlusion, and similarity to other behaviors.

  • Contextual Behaviors for Image Classification: Behaviors like "patrol" and "guide" show relatively lower F1 scores in image classification, suggesting that these actions might require more than single-frame visual information and could benefit from temporal context (i.e., video-based analysis).

  • Class Imbalance: Despite efforts to mitigate it, class imbalance remains a challenge within the dataset, which can impact model training and performance for rare classes.

  • Static Nature: While image classification captures some context, both object detection and image classification on static frames inherently lack temporal information, which is crucial for fully understanding dynamic behaviors and their progression over time.

    As for future work, the paper states: "We are also continuously expanding the scale of the dataset to adapt to various challenges in real environments." This implies:

  • Dataset Expansion: Continuously adding more images and annotations to increase the dataset's size and diversity.

  • Addressing Challenges: Further improving the dataset to handle the identified challenges such as occlusion, pixel differences, and subtle behaviors.

  • Incorporating Temporal Data: While not explicitly stated as future work for the dataset itself, the discussion around the limitations of current approaches for teach and patrol behaviors suggests that future iterations might involve video annotations or spatiotemporal modeling.

7.3. Personal Insights & Critique

This paper makes a highly valuable contribution to the AI in Education domain. The meticulous effort put into constructing SCB-Dataset, particularly the iterative improvements to the VIA annotation tool and the careful consideration of real-world classroom challenges, is commendable. The dual approach of providing both object detection and image classification data is a thoughtful design choice, recognizing the varied needs of classroom behavior analysis.

Inspirations & Applications:

  • Personalized Learning: More accurate behavior detection can lead to real-time feedback systems for teachers, helping them identify disengaged students or those struggling, thus enabling more personalized interventions.

  • Teacher Training: Automated analysis of teaching styles and student responses could provide data-driven insights for teacher professional development.

  • Classroom Management: Identifying disruptive or off-task behaviors could assist in maintaining a conducive learning environment.

  • Broader Behavior Recognition: The methodology for dataset construction, especially the iterative VIA tool improvements and the differential frame selection strategy, could be adopted for creating high-quality datasets in other complex behavior recognition domains (e.g., healthcare for patient monitoring, industrial settings for worker safety).

    Potential Issues & Areas for Improvement:

  • Ethical Considerations: The paper does not explicitly discuss the ethical implications of student and teacher behavior monitoring. Such systems raise concerns about privacy, surveillance, data security, and potential biases in AI interpretations of behavior. A discussion on these aspects, including consent, anonymization, and responsible deployment, would significantly strengthen the paper's holistic impact.

  • Bias in Data Collection: While the paper mentions including data from ethnic minorities, a deeper analysis of potential biases (e.g., gender, socioeconomic background, classroom setup, geographical region) and how they are mitigated (or acknowledged) would be beneficial.

  • Temporal Context: The current dataset focuses on static images. While LVLMs can infer some context, many behaviors (e.g., "patrol," "teach," "guide," "discuss") are inherently spatiotemporal. Future extensions to video-based annotations (e.g., AVA style atomic actions) or event-based sequences would be a natural and crucial progression for capturing the full richness of classroom dynamics. The paper implicitly acknowledges this by not including "teach" and "patrol" in object detection due to their need for timing characteristics.

  • Long-tail Behaviors: Despite differential frame selection, the class imbalance for some behaviors remains. Techniques like few-shot learning or data augmentation specifically tailored for these long-tail classes could be explored in conjunction with this dataset.

  • Granularity of "Others": While "others" is a necessary category, a qualitative analysis of what falls into "others" might reveal new, important behaviors that could be explicitly added to the classification scheme in future iterations.

    Overall, SCB-Dataset is a foundational work that will undoubtedly accelerate research in AI for education. Addressing the limitations, particularly the ethical aspects and extending to spatiotemporal data, would solidify its position as a gold standard in the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.