Beating Neural Scaling Raw 뭐시기 논문

5.1 Pre-processing

5.1.1 파싱

5.1.2 cleaning

5.1.3 dimension reduction

5.2 데이터 증강

5.2.1 EDA

5.2.2 LLM 기반의 데이터 생성

5.3 데이터 정제

CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data (2019.11)

msft - Phi3

5.3.1 Deduplication

5.3.1.1 LSH-MinHash

5.3.1.2 SemDedup

SemDeDup: Data-efficient learning at web-scale through semantic deduplication (2023.04)

5.3.2 SSL Prototype

5.3.3 important data finding

Deep Learning on a Data Diet: Finding Important Examples Early in Training (2021.07)

5.3.4 D4