Data Labeling for Machine Learning Engineers: Project-Based Curriculum and Data-Centric Competitions
The process of training and evaluating machine learning (ML) models relies on high-quality and timely annotated datasets. While a significant portion of academic and indus- trial research is focused on creating new ML methods, these communities rely on open datasets and benchmarks.
Implementation
Source publication / research team or educational organization described in paper
Learning context
Higher education
AI role
Learning object / concept model
Outcome signal
AI literacy
Registry Facets
- Higher education
- Adult / workforce
- Higher education
- ML engineering
- data-centric AI
- ML concepts / supervised learning
- Curriculum / course design
- Students
- Adult learners / professionals
- ML concepts / supervised learning
- Higher education
- Professional / adult learning
- Activity documentation
- AI literacy
- Conceptual understanding
Implementing Organization
Source publication / research team or educational organization described in paper
Not specified in extracted text
Researchers, educators, instructors, or facilitators as described in the source publication
Learning Context
- Higher education
- Professional / adult learning
Course implementation or course design
Not specified in extracted text
Not specified in extracted text
ML concepts / supervised learning
- The paper provides limited implementation detail in the extracted abstract; additional manual review may be needed for local replication.
Learner Profile
Higher education, Adult / workforce
Mixed or not explicitly specified; infer from target learner group and intervention design.
Varies by intervention; not specified unless the paper explicitly describes prerequisites.
Educational Intent
- Document the AI education intervention, course, tool, or resource described in the source publication.
- Extract the learner context, AI role, pedagogy, outcomes, and constraints for AAB registry comparison.
- The process of training and evaluating machine learning (ML) models relies on high-quality and timely annotated datasets.
- Support AAB comparison across AI literacy, AI education, teacher training, higher education, and workforce contexts.
- Capture evidence maturity, transferability, and limitations rather than treating the publication as product endorsement.
- Not an AAB endorsement of the tool, curriculum, provider, or result.
- Not a direct replication record unless the source paper reports implementation details sufficient for replication.
AI Tool Description
ML concepts / supervised learning
Not specified in extracted text
- Learning object / concept model
- Primary interaction pattern inferred from publication: Curriculum / course design.
- AI capability focus: ML concepts / supervised learning.
- Apply standard AAB safeguards: privacy, transparency, human oversight, and documentation of limitations.
Activity Design
- Review the publication’s reported context, learner group, AI tool or curriculum, implementation process, and outcome evidence.
- Map the case to AAB registry fields for comparison across educational levels and AI capability types.
- Use the source publication and PDF for any manual verification before public registry release.
- Human educators/researchers remain responsible for instructional design, supervision, interpretation, and ethical safeguards.
- AI systems or AI concepts provide the learning object, support tool, evaluator, simulator, or automation context depending on the paper.
- Project-based learning
- Registry extraction emphasizes explicit learning goals, observed outcomes, constraints, and safety limitations.
Observed Challenges
- The paper provides limited implementation detail in the extracted abstract; additional manual review may be needed for local replication.
Design Adaptations
- Case classified under: Published curriculum / implementation paper.
- Pedagogical pattern: Project-based learning.
- Any additional adaptations should be verified against the full paper before public-facing publication.
Reported Outcomes
- Engagement evidence should be interpreted according to the source paper’s reported method and sample.
- To fill the need for this competency, we created a semester course on Data Collection and Labeling for Ma- chine Learning, integrated into a bachelor program that trains data analysts and ML engineers.
- To fill the need for this competency, we created a semester course on Data Collection and Labeling for Ma- chine Learning, integrated into a bachelor program that trains data analysts and ML engineers.
The process of training and evaluating machine learning (ML) models relies on high-quality and timely annotated datasets. While a significant portion of academic and indus- trial research is focused on creating new ML methods, these communities rely on open datasets and benchmarks.
Ethical & Privacy Considerations
- Apply standard AAB safeguards: privacy, transparency, human oversight, and documentation of limitations.
Evidence Type
- Activity documentation
Relevance to Research
- Can be used as an AAB evidence record for cross-case comparison, standards drafting, and evidence-maturity mapping.
- Supports identification of recurring patterns in AI literacy, AI education implementation, teacher preparation, assessment, and responsible AI learning.
- AI literacy
- Conceptual understanding
- Curriculum / course design
- ML concepts / supervised learning
Case Status
- Completed
AAB Classification Tags
Higher education, Adult / workforce
Higher education, Professional / adult learning
ML concepts / supervised learning
Project-based learning
Low to Medium
Medium
Source Publication
Data Labeling for Machine Learning Engineers: Project-Based Curriculum and Data-Centric Competitions
- Anastasia Zhdanovskaya
- Daria Baidakova
- Dmitry Ustalov
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37 No. 13, EAAI-23
2023
10.1609/aaai.v37i13.26886
https://ojs.aaai.org/index.php/AAAI/article/view/26886
https://ojs.aaai.org/index.php/AAAI/article/view/26886/26658
077_Data Labeling for Machine Learning Engineers_ Project-Based Curriculum and Data-Centric Competitions.pdf
8
The process of training and evaluating machine learning (ML) models relies on high-quality and timely annotated datasets. While a significant portion of academic and indus- trial research is focused on creating new ML methods, these communities rely on open datasets and benchmarks. How- ever, practitioners often face issues with unlabeled and un- available data specific to their domain. We believe that build- ing scalable and sustainable processes for collecting data of high quality for ML is a complex skill that needs focused de- velopment. To fill the need for this competency, we created a semester course on Data Collection and Labeling for Ma- chine Learning, integrated into a bachelor program that trains data analysts and ML engineers. The course design and deliv- ery illustrate how to overcome the challenge of putting uni- versity students with a theoretical background in mathemat- ics, computer science, and physics through a program that is substantially different from their educational habits. Our goal was to motivate students to focus on practicing and master- ing a skill that was considered unnecessary to their work. We created a system of inverse ML competitions that showed the students how high-quality and relevant data affect their work with ML models, and their mindset changed completely in the end. Project-based learning with increasing complexity of conditions at each stage helped to raise the satisfaction in- dex of students accustomed to difficult challenges. During the course, our invited industry practitioners drew on their first- hand experience with data, which helped us avoid overtheo- rizing and made the course highly applicable to the students’ future career paths.
Transferability
- Higher education
- Professional / adult learning
- The paper provides limited implementation detail in the extracted abstract; additional manual review may be needed for local replication.
Cost And Operations
Not specified in extracted text unless noted in duration field.
Requires educators/researchers/facilitators with sufficient AI literacy and pedagogy knowledge for the target learners.
Infrastructure depends on AI tool type, learner devices, data access, and institutional policy context.
Extraction Notes
High
- group_size
- duration
This entry was automatically extracted from the PDF text and manifest metadata. Fields should be manually verified before public registry publication, especially group size, location, duration, and outcome claims.
Artificial Intelligence (AI) in early childhood education: Curriculum design and future directions
0.424
false
