高效的网络爬取框架Crawl4LLM

项目简介

高效的网络爬取框架，旨在解决当前 LLM 预训练数据爬取效率低下的问题。CRAW4LLM 通过优先爬取对 LLM 预训练更有影响力的网页，显著提升了数据质量和爬取效率，并减少了不必要的网络资源消耗。

先决条件

请求clueweb22数据集。

使用Python％3E = 3.10创建虚拟环境，并安装以下要求：

numpytqdmfasttextpyyamlwandb

to fasttext_scorers/.
将DCLM FastText分类器下载到fasttext_scorers/ 。

运行Crawler爬行
要运行（模拟的）爬网，请首先在configs/下创建YAML配置文件，然后运行以下命令：

python crawl.py crawl --config <path_to_your_config_file>

Crawl4LLM

在`configs/`带有以下内容：

cw22_root_path: <path_to_clueweb22_a>seed_docs_file: seed.txtoutput_dir: crawl_results/seed_10k_crawl_20m_dclm_fasttextnum_selected_docs_per_iter: 10000num_workers: 16  # set to a number that fits your machinesave_state_every: -1  # set to a positive number to save the state (queue & visited set) of the crawler every certain stepsmax_num_docs: 20000000selection_method: dclm_fasttext_scoreorder: desc  # desc for descending, asc for ascendingwandb: true  # set to false to disable wandb loggingwandb_project: crawlerwandb_run_name: seed_10k_crawl_20m_dclm_fasttextrating_methods:    -         type: length    -         type: fasttext_score        rater_name: dclm_fasttext_score        model_path: fasttext_scorers/openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train.bin

所有得分子在rating_methods中都对文档进行评分。在上面的配置文件中，我们设置了一个length得分手，该长度得分器按其长度分数和一个使用DCLM FastText模型来评分文档的fasttext_score得分手。最终排名由selection_method确定，该selection_method设置为dclm_fasttext_score ， fasttext_score得分手的名称。

基线爬行者

随机爬行者

cw22_root_path: <path_to_clueweb22_a>seed_docs_file: seed.txtoutput_dir: crawl_results/seed_10k_crawl_20m_randomnum_selected_docs_per_iter: 10000num_workers: 16save_state_every: -1max_num_docs: 20000000selection_method: random_scoreorder: descwandb: truewandb_project: crawlerwandb_run_name: seed_10k_crawl_20m_randomrating_methods:    -         type: random_score

基于indegree的爬行者

cw22_root_path: <path_to_clueweb22_a>seed_docs_file: seed.txtoutput_dir: crawl_results/seed_10k_crawl_20m_indegreenum_selected_docs_per_iter: 10000num_workers: 16save_state_every: -1max_num_docs: 20000000selection_method: inlink_countorder: descwandb: truewandb_project: crawlerwandb_run_name: seed_10k_crawl_20m_indegreerating_methods:    -         type: inlink_count

预处理和评估

运行爬网后，将crawled文档ID放置在配置文件中的`output_dir`中。运行以下命令获取文档文本：

python fetch_docs.py  --input_dir <document_ids_dir>  --output_dir <document_texts_dir>  --num_workers <num_workers>

然后，您可以使用DCLM框架运行LLM预处理和评估。

各种各样的

浏览数据

运行以下命令以通过其ID打印文档及其链接：

python access_data.py <path_to_clueweb22> <document_id>

项目链接

https://github.com/cxcscmu/Crawl4LLM

扫码加入技术交流群，备注「开发语言-城市-昵称」

（文：GitHubStore）

一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31