前置
需安装conda与wsl
1.conda使用一条命令- wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh ; bash ~/miniconda.sh -b -p $HOME/miniconda ; eval "$($HOME/miniconda/bin/conda shell.bash hook)" ; echo 'export PATH="$HOME/miniconda/bin:$PATH"' >> ~/.bashrc ; source ~/.bashrc
复制代码 2.wsl- 打开功能添加子系统 hyperv平台 虚拟机平台
- mircosoft下载ubuntu
复制代码 1.conda可以将每个python项目隔离开来,conda环境隔离的是整个Python解释器和所有包,不只是pip安装的包,可以选择对应的python版本
2.windows10里最好用的linux方式就是wsl,在ai这个方向wsl可以轻松调用nvidia的驱动,WSL2实际上是基于轻量级虚拟化技术的,但开销确实很小,接近原生性能
选择在wsl-Linux上进行本地部署微调环境
下载模型
- (base)root@Riemann-Plan:/mnt/c/Users/25671#
- (base)root@Riemann-Plan:/mnt/c/Users/25671# conda activate huggingface
- (huggingface)root@Riemann-Plan:~# export HF_ENDPOINT=https://hf-mirror.com
- (huggingface)root@Riemann-Plan:~#
- (huggingface)root@Riemann-Plan:~# huggingface-cli download Qwen/Qwen3-8B
复制代码 使用vllm运行模型并本地调用
vLLM:专为生产环境设计的高性能大模型推理引擎,通过PagedAttention技术实现极致的GPU内存利用率和并发处理能力。
vllm会吃满整个显存,他的计算方式为 kvcache 最大上下文 以及模型本身
Ollama:让本地运行大模型变得像使用Docker一样简单,一键下载、运行各种量化模型的开箱即用工具。- (huggingface)root@Riemann-Plan:~# conda env list
- # conda environments:
- #
- base /root/miniconda
- LF /root/miniconda/envs/LF
- django /root/miniconda/envs/django
- docling /root/miniconda/envs/docling
- dots_ocr /root/miniconda/envs/dots_ocr
- huggingface * /root/miniconda/envs/huggingface
- ocrflux /root/miniconda/envs/ocrflux
- unsloth-blackwell /root/miniconda/envs/unsloth-blackwell
- vllm_tx /root/miniconda/envs/vllm_tx
- (huggingface)root@Riemann-Plan:~# conda activate unsloth-blackwell
- (unsloth-blackwell)root@Riemann-Plan:~#
- (unsloth-blackwell)root@Riemann-Plan:~# cd /root/.cache/huggingface/hub/models--Qwen--Qwen3-8B/
- (unsloth-blackwell)root@Riemann-Plan:~/.cache/huggingface/hub/models--Qwen--Qwen3-8B# ls
- blobs refs snapshots
- (unsloth-blackwell)root@Riemann-Plan:~/.cache/huggingface/hub/models--Qwen--Qwen3-8B# cd snapshots/
- (unsloth-blackwell)root@Riemann-Plan:~/.cache/huggingface/hub/models--Qwen--Qwen3-8B/snapshots# ls
- b968826d9c46dd6066d109eabc6255188de91218
- (unsloth-blackwell)root@Riemann-Plan:~/.cache/huggingface/hub/models--Qwen--Qwen3-8B/snapshots#
- python -m vllm.entrypoints.openai.api_server \
- --model /root/.cache/huggingface/hub/models--Qwen--Qwen3-8B/snapshots/b968826d9c46dd6066d109eabc6255188de91218/ \
- --served-model-name Qwen3-8B \
- --host 0.0.0.0 \
- --port 8000 \
- --gpu-memory-utilization 0.93 \
- --max-model-len 4096
复制代码 调用模型
curl http://localhost:8000/v1/chat/completions
-H "Content-Type: application/json"
-d '{
"model": "Qwen3-8B",
"messages": [
{"role": "user", "content": "Solve (x + 2)^2 = 0."}
],
"temperature": 0.7,
"max_tokens": 2048
}'
安装unsloth与jupyter lab
- pip install -U vllm --extra-index-url https://download.pytorch.org/whl/cu128
- pip install unsloth unsloth_zoo bitsandbytes
- pip jupyterlab
- jupyter lab --allow-root
复制代码 1.部署了模型推理框架
2.部署unsloth微调框架
3.安装了jupyter
传统-xformers-flash attention
介绍一下
方法比喻特点传统把所有资料都摊在桌上准确但占地方,速度慢xformers只看重点资料速度快但可能遗漏信息Flash Att 2一本一本看,用脑子记住又快又准确,还不占地方batch-size
方案A(2×4):每次吃2口,吃4次,总共8口才消化
问题:第一口就噎死了
方案B(1×8):每次吃1口,吃8次,总共8口才消化
结果:能正常吃完,虽然吃的次数多了点
安装llama.cpp
- sudo apt install libcurl4-openssl-dev
- git clone https://github.com/ggml-org/llama.cpp
- cd llama.cpp
- cmake -B build
- cmake --build build --config Release -j 8
复制代码 启动vllm模型并dify连接
- python -m vllm.entrypoints.openai.api_server --model /root/qwen3math/ --served-model-name qwen3math --host 0.0.0.0 --port 8000 --gpu-memory-utilization 0.93 --max-model-len 4096
复制代码 阿瓦达阿瓦达a```powershell
from unsloth import FastLanguageModel
import torch
这些参数的含义如下:
local_model_path = "/root/.cache/huggingface/hub/models--Qwen--Qwen3-8B/snapshots/b968826d9c46dd6066d109eabc6255188de91218"
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = local_model_path, # 指定要加载的模型名称或路径
max_seq_length = 2048, # 最大上下文长度(输入序列长度),越大越占用显存
load_in_4bit = True, # 是否以4bit量化加载模型,节省显存
load_in_8bit = False, # 是否以8bit量化加载模型,精度更高但显存占用更多
full_finetuning = False, # 是否进行全参数微调,通常LoRA微调时为False
# token = "hf_...", # 如果加载受限模型,需要填写HuggingFace的访问令牌
)- ```powershell
- model = FastLanguageModel.get_peft_model(
- model,
- r = 32, # LoRA秩(rank),控制可训练参数量。32较大,提升表达能力,适合大模型和复杂任务
- target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
- "gate_proj", "up_proj", "down_proj",],
- # 需要注入LoRA的模块,覆盖Transformer的主要线性层,提升微调效果
- lora_alpha = 32, # LoRA缩放因子,通常设为rank或2*rank,这里与r一致,保证训练稳定
- lora_dropout = 0, # LoRA dropout,防止过拟合。设为0可获得最佳性能(Unsloth已优化)
- bias = "none", # 是否训练bias参数。设为"none"节省显存且已优化
- use_gradient_checkpointing = "unsloth", # 启用Unsloth优化的梯度检查点,节省显存,支持更长上下文
- random_state = 3407, # 随机种子,保证实验可复现
- use_rslora = False, # 是否使用rank stabilized LoRA,关闭以减少复杂性
- loftq_config = None, # 是否使用LoftQ量化,None表示不启用
- )
- # 这些设置兼顾了训练效率、显存占用和模型能力,适合大多数LoRA微调场景
复制代码- from datasets import load_dataset
- reasoning_dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot")
- non_reasoning_dataset = load_dataset("mlabonne/FineTome-100k", split = "train")
复制代码- def generate_conversation(examples):
- # 获取每个样本的问题和生成的解答
- problems = examples["problem"]
- solutions = examples["generated_solution"]
- conversations = []
- # 遍历每个问题和对应的解答,构造成对话格式
- for problem, solution in zip(problems, solutions):
- conversations.append([
- {"role" : "user", "content" : problem}, # 用户提问
- {"role" : "assistant", "content" : solution}, # 助手回答
- ])
- # 返回包含对话的字典,键为"conversations"
- return { "conversations": conversations, }
复制代码- # 将reasoning_dataset中的每个样本转换为对话格式(user/assistant),并应用聊天模板,得到最终的对话文本列表
- reasoning_conversations = tokenizer.apply_chat_template(
- reasoning_dataset.map(generate_conversation, batched = True)["conversations"],
- tokenize = False,
- )
复制代码- reasoning_conversations[0]
复制代码- # 导入 Unsloth 的 standardize_sharegpt 函数,将非推理数据集(ShareGPT风格)标准化为统一的多轮对话格式
- from unsloth.chat_templates import standardize_sharegpt
- dataset = standardize_sharegpt(non_reasoning_dataset)
- # 直接提取标准化后的"conversations"字段,并用tokenizer的apply_chat_template方法转换为最终对话文本
- # 由于已经是标准多轮对话格式,无需像奥数竞赛数据集那样手动拼接 user/assistant 结构
- non_reasoning_conversations = tokenizer.apply_chat_template(
- dataset["conversations"],
- tokenize = False,
- )
复制代码- non_reasoning_conversations[0]
复制代码- print(len(reasoning_conversations))
- print(len(non_reasoning_conversations))
复制代码- import pandas as pd
- non_reasoning_subset = pd.Series(non_reasoning_conversations)
- non_reasoning_subset = non_reasoning_subset.sample(
- int(len(reasoning_conversations)*(chat_percentage/(1 - chat_percentage))),
- random_state = 2407,
- )
- print(len(reasoning_conversations))
- print(len(non_reasoning_subset))
- print(len(non_reasoning_subset) / (len(non_reasoning_subset) + len(reasoning_conversations)))
复制代码- data = pd.concat([
- pd.Series(reasoning_conversations),
- pd.Series(non_reasoning_subset)
- ])
- data.name = "text"
- from datasets import Dataset
- combined_dataset = Dataset.from_pandas(pd.DataFrame(data))
- combined_dataset = combined_dataset.shuffle(seed = 3407)
复制代码- from trl import SFTTrainer, SFTConfig
- # 创建 SFTTrainer 实例,用于对大语言模型进行指令微调(Supervised Fine-Tuning)
- trainer = SFTTrainer(
- model = model, # 传入已加载并加装LoRA的模型
- tokenizer = tokenizer, # 传入对应的分词器
- train_dataset = combined_dataset, # 训练数据集,已混合推理和聊天数据
- eval_dataset = None, # 不设置验证集(可选,可后续添加用于评估)
- args = SFTConfig(
- dataset_text_field = "text", # 指定数据集中用于训练的文本字段名
- per_device_train_batch_size = 1, # 每张GPU上的batch size,2较小,适合大模型和显存有限场景
- gradient_accumulation_steps = 8, # 梯度累积步数,等效于总batch size=2*4=8,提升训练稳定性
- warmup_steps = 5, # 学习率预热步数,防止初始训练不稳定
- # num_train_epochs = 1, # 可选:完整训练1轮(此处注释掉,仅演示max_steps用法)
- max_steps = 30, # 最多训练30步,加快演示速度(实际训练可增大)
- learning_rate = 2e-4, # 初始学习率,短期演示用较大值,长期训练建议降为2e-5
- logging_steps = 1, # 每1步记录一次日志,便于观察训练过程
- optim = "adamw_8bit", # 使用8bit AdamW优化器,节省显存,适合大模型
- weight_decay = 0.01, # 权重衰减,防止过拟合
- lr_scheduler_type = "linear", # 线性学习率调度,常用且简单
- seed = 3407, # 随机种子,保证实验可复现
- report_to = "none", # 不上报到WandB等平台,纯本地训练
- ),
- )
- # 总结:本配置兼顾显存、训练速度和实验复现性,适合Colab等资源有限环境下的快速微调演示
复制代码- trainer_stats = trainer.train()
复制代码- messages = [
- {"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
- ]
- text = tokenizer.apply_chat_template(
- messages,
- tokenize = False,
- add_generation_prompt = True, # Must add for generation
- enable_thinking = False, # Disable thinking
- )
- from transformers import TextStreamer
- _ = model.generate(
- **tokenizer(text, return_tensors = "pt").to("cuda"),
- max_new_tokens = 256, # Increase for longer outputs!
- temperature = 0.7, top_p = 0.8, top_k = 20, # For non thinking
- streamer = TextStreamer(tokenizer, skip_prompt = True),
- )
复制代码- messages = [
- {"role" : "user", "content" : "Solve (x + 2)^2 = 0."}
- ]
- text = tokenizer.apply_chat_template(
- messages,
- tokenize = False,
- add_generation_prompt = True, # Must add for generation
- enable_thinking = True, # Disable thinking
- )
- from transformers import TextStreamer
- _ = model.generate(
- **tokenizer(text, return_tensors = "pt").to("cuda"),
- max_new_tokens = 1024, # Increase for longer outputs!
- temperature = 0.6, top_p = 0.95, top_k = 20, # For thinking
- streamer = TextStreamer(tokenizer, skip_prompt = True),
- )
复制代码- # Merge to 16bit
- if True:
- model.save_pretrained_merged("qwen3mathmodel", tokenizer, save_method = "merged_16bit",)
- if False: # Pushing to HF Hub
- model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")
- # Merge to 4bit
- if False:
- model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
- if False: # Pushing to HF Hub
- model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")
- # Just LoRA adapters
- if False:
- model.save_pretrained("model")
- tokenizer.save_pretrained("model")
- if False: # Pushing to HF Hub
- model.push_to_hub("hf/model", token = "")
- tokenizer.push_to_hub("hf/model", token = "")
复制代码 来源:程序园用户自行投稿发布,如果侵权,请联系站长删除
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作! |