项目简介
先决条件
请求clueweb22数据集。
numpytqdmfasttextpyyamlwandb
fasttext_scorers/.将DCLM FastText分类器下载到
fasttext_scorers/ 。要运行(模拟的)爬网,请首先在
configs/下创建YAML配置文件,然后运行以下命令:python crawl.py crawl --config <path_to_your_config_file>
Crawl4LLM
在configs/带有以下内容:
cw22_root_path: <path_to_clueweb22_a>seed_docs_file: seed.txtoutput_dir: crawl_results/seed_10k_crawl_20m_dclm_fasttextnum_selected_docs_per_iter: 10000num_workers: 16 # set to a number that fits your machinesave_state_every: -1 # set to a positive number to save the state (queue & visited set) of the crawler every certain stepsmax_num_docs: 20000000selection_method: dclm_fasttext_scoreorder: desc # desc for descending, asc for ascendingwandb: true # set to false to disable wandb loggingwandb_project: crawlerwandb_run_name: seed_10k_crawl_20m_dclm_fasttextrating_methods:-type: length-type: fasttext_scorerater_name: dclm_fasttext_scoremodel_path: fasttext_scorers/openhermes_reddit_eli5_vs_rw_v2_bigram_200k_train.bin
所有得分子在rating_methods中都对文档进行评分。在上面的配置文件中,我们设置了一个length得分手,该长度得分器按其长度分数和一个使用DCLM FastText模型来评分文档的fasttext_score得分手。最终排名由selection_method确定,该selection_method设置为dclm_fasttext_score , fasttext_score得分手的名称。
基线爬行者
随机爬行者
cw22_root_path: <path_to_clueweb22_a>seed_docs_file: seed.txtoutput_dir: crawl_results/seed_10k_crawl_20m_randomnum_selected_docs_per_iter: 10000num_workers: 16save_state_every: -1max_num_docs: 20000000selection_method: random_scoreorder: descwandb: truewandb_project: crawlerwandb_run_name: seed_10k_crawl_20m_randomrating_methods:-type: random_score
基于indegree的爬行者
cw22_root_path: <path_to_clueweb22_a>seed_docs_file: seed.txtoutput_dir: crawl_results/seed_10k_crawl_20m_indegreenum_selected_docs_per_iter: 10000num_workers: 16save_state_every: -1max_num_docs: 20000000selection_method: inlink_countorder: descwandb: truewandb_project: crawlerwandb_run_name: seed_10k_crawl_20m_indegreerating_methods:-type: inlink_count
预处理和评估
运行爬网后,将crawled文档ID放置在配置文件中的output_dir中。运行以下命令获取文档文本:
python fetch_docs.py --input_dir <document_ids_dir> --output_dir <document_texts_dir> --num_workers <num_workers>
然后,您可以使用DCLM框架运行LLM预处理和评估。
各种各样的
浏览数据
运行以下命令以通过其ID打印文档及其链接:
python access_data.py <path_to_clueweb22> <document_id>
项目链接
https://github.com/cxcscmu/Crawl4LLM
扫码加入技术交流群,备注「开发语言-城市-昵称」
(文:GitHubStore)