模型適配流程概述

<i id='HTaO7'></i>

代碼修改

目前(qian)torch_npu已經更新的足(zu)夠好(hao)了(le)，大部分不需要修改(gai)一些庫(ku)中的代碼，只需要在main函數中添加如下代(dai)碼(ma)就可以跑起(qi)來了。

import torch
import torch_npu
import deepspeed
import deepspeed_npu
from torch_npu.contrib import transfer_to_npu

note：

一般報(bao)module 'torch._C' has no attribute '_cuda_setDevice'

就(jiu)是沒有(you)添加from torch_npu.contrib import transfer_to_npu，導致硬件調(diao)用沒調(diao)至npu上

其(qi)次修改模型前向傳播文件(jian) modeling_telechat.py

修改class FlashSelfAttention(torch.nn.Module):

使(shi)用torch_npu.npu_fusion_attention 替換flash_attn_unpadded_func函數

atten_mask_ = torch.triu(torch.ones(q.shape[1], ;q.shape[1]),1).to(torch.float)
        atten_mask_npu = atten_mask_.clone().bool().to(q.device)
        head_num = q.shape[2]
        output = torch_npu.npu_fusion_attention(
     ;               q, 
                    k,
                   v,
                    head_num,
                    "BSND",
      ;              keep_prob=1.0,
                 ;   atten_mask=atten_mask_npu,
                   scale=1.0 / math.sqrt(q.shape[-1]),
                    pre_tockens=q.shape[1],
    ;                next_tockens=0,
                    inner_precise=0)[0]

性(xing)能優化

華(hua)為社區文(wen)檔中生成(cheng)mask的代(dai)碼為

atten_mask_npu= torch.from_numpy(np.triu(np.ones([max_seqlen_q, max_seqlen_k]), k=1))

全(quan)部使用torch函數，可以節省內存從(cong)一塊(kuai)地(di)址拷貝到另一(yi)塊地址(zhi)的(de)時間，torch.from_numpy會產生拷貝時間

atten_mask_ = torch.triu(torch.ones(q.shape[1], q.shape[1]),1).to(torch.float)

torch_npu.npu_fusion_attention支持(chi)兩種模(mo)式 tnd和 bsnd

tnd ：

t就(jiu)是total tokens of batch，相當于b乘s，n就(jiu)是head_num, 注(zhu)意力的(de)多頭個數(shu)，d就(jiu)是隱藏(zang)層維(wei)度大小(xiao)除以head_num后(hou)的(de)數(shu)值

bsnd ：

b是batchsize，s是sequence length tokens長度(du)

profiling采(cai)集

profiler_level = Constant.LEVEL2 采集等級(ji)建議使用(yong)LEVEL2采集數(shu)據最(zui)多(duo)的(de)等(deng)級

torch_npu.profiler._ExperimentalConfig(profiler_level = Constant.LEVEL0, aic_metrics = Constant.AicMetricsNone, l2_cache = False, record_op_args = False)

wait, skip_first, warmup,三個參數都(dou)是(shi)不(bu)采集的step數，active是采(cai)集的step數(shu)，repeat是重復采集activate數

torch_npu.profiler.schedule (wait, active, warmup = 0, repeat = 0, skip_first = 0)

數(shu)據格式

大語言模型的(de)數據，一般分為預訓練數(shu)據和微調數據(ju)，預訓練是(shi)純文(wen)本，微調是(shi)問答(da)對。

LLM訓練方法都是讓它預測下(xia)一個token，兩種(zhong)訓練方法都需要把token拼(pin)接為長度為用戶指(zhi)定(ding)的max_length長度，一個(ge)max_length長的tokens序(xu)列(lie)就是一個samples

預訓練(lian)就是簡(jian)單(dan)的文本token拼(pin)接。

微調是將問答(da)對添加(jia)問題和回答已經(jing)問答結束的特殊token，例如<_user>天翼云公司英(ying)文名。<_bot>state cloud.<end>。然后(hou)再將多個對話拼接成長(chang)度為max_length的token序列，不足的位(wei)置用pad_token 補(bu)齊(qi)。

訓練方法的區別是(shi)，預訓練是計算全部token的損(sun)失，全參微調只統計(ji)答案部(bu)分的(de)損失（使用mask蓋住問題的損失）

模型(xing)適配流程概述

代碼修(xiu)改

目前torch_npu已經更新的(de)足夠好(hao)了，大部(bu)分不需要修改一些(xie)庫中的(de)代碼，只需(xu)要在main函數中添(tian)加如下代碼就可(ke)以跑(pao)起來了。

import torch
import torch_npu
import deepspeed
import deepspeed_npu
from torch_npu.contrib import transfer_to_npu

note：

一般報module 'torch._C' has ;no attribute '_cuda_setDevice'

就是沒有添加from torch_npu.contrib import transfer_to_npu，導致(zhi)硬(ying)件調(diao)用沒(mei)調(diao)至npu上

其次修(xiu)改模(mo)型(xing)前向傳播文件(jian) modeling_telechat.py

修改class FlashSelfAttention(torch.nn.Module):

使用torch_npu.npu_fusion_attention 替(ti)換flash_attn_unpadded_func函(han)數

atten_mask_ = torch.triu(torch.ones(q.shape[1], q.shape[1]),1).to(torch.float)
        atten_mask_npu = atten_mask_.clone().bool().to(q.device)
        head_num = q.shape[2]
        output = torch_npu.npu_fusion_attention(
                    q,
                    k,
                    v,
   ;                 head_num,
   ;                 "BSND",
                    keep_prob=1.0, ;
                    atten_mask=atten_mask_npu,
                    scale=1.0 / math.sqrt(q.shape[-1]),
                    pre_tockens=q.shape[1],
                    next_tockens=0,
                    inner_precise=0)[0]

性(xing)能優化

華(hua)為社區文檔中生成mask的代碼(ma)為

atten_mask_npu= torch.from_numpy(np.triu(np.ones([max_seqlen_q, max_seqlen_k]), k=1))

全部使用torch函數，可(ke)以節省內存從一塊地址拷貝到另一塊地(di)址的時間，torch.from_numpy會產(chan)生(sheng)拷貝時間

atten_mask_ = ;torch.triu(torch.ones(q.shape[1], q.shape[1]),1).to(torch.float)

torch_npu.npu_fusion_attention支持兩種模式 tnd和(he) bsnd

tnd ：

t就是total tokens of batch，相當于b乘s，n就是head_num, 注意(yi)力的多頭個數(shu)，d就是隱藏層維度(du)大小除以head_num后的數(shu)值

bsnd ：

b是batchsize，s是sequence length tokens長度

profiling采集

profiler_level = Constant.LEVEL2 采集等(deng)級建議使用LEVEL2采集(ji)數據最多的(de)等級

torch_npu.profiler._ExperimentalConfig(profiler_level = Constant.LEVEL0, aic_metrics = Constant.AicMetricsNone, l2_cache = False, record_op_args = False)

wait, skip_first, warmup,三個參數都(dou)是不采集的(de)step數，active是采集的(de)step數，repeat是(shi)重復采(cai)集activate數

torch_npu.profiler.schedule (wait, active, warmup = 0, repeat = 0, skip_first = 0)

數據格式(shi)

大語言模(mo)型的數據，一般(ban)分(fen)為預訓練數據和(he)微調數據(ju)，預(yu)訓練是純文本，微調是問(wen)答對。

LLM訓練方法(fa)都是讓它預測下一個token，兩種訓練方法都需要把(ba)token拼接為長度為用戶指定的max_length長度，一個max_length長的(de)tokens序列就是一個samples

預(yu)訓練就(jiu)是簡(jian)單的文本token拼接。

微(wei)調是(shi)將(jiang)問(wen)答對添加問題(ti)和回答已經問答(da)結束的特殊token，例(li)如<_user>天(tian)翼云(yun)公(gong)司英文(wen)名。<_bot>state cloud.<end>。然后(hou)再(zai)將(jiang)多個對話拼接(jie)成長度為max_length的token序列，不(bu)足的(de)位置(zhi)用(yong)pad_token 補齊。

訓練方法的區別(bie)是，預(yu)訓練是(shi)計算全部token的損失，全參微調只統計(ji)答案部分(fen)的損失（使用mask蓋(gai)住(zhu)問(wen)題的(de)損失(shi)）

亚欧色一区w666天堂,色情一区二区三区免费看,少妇特黄A片一区二区三区,亚洲人成网站999久久久综合,国产av熟女一区二区三区

智算服務

應用商城

定價

合作伙伴

開發者

支持與服務

了解天翼云

模型適配流程概述

模型適配流程概述

代碼修改

性(xing)能優化

profiling采(cai)集

數(shu)據格式

模型適配流程概述

模型(xing)適配流程概述

代碼修(xiu)改

性(xing)能優化

profiling采集

數據格式(shi)

亚欧色一区w666天堂,色情一区二区三区免费看,少妇特黄A片一区二区三区,亚洲人成网站999久久久综合,国产av熟女一区二区三区

活動

智算服務

應用商城

定價

合作伙伴

開發者

支持與服務

了解天翼云

模型適配流程概述

模型適配流程概述

代碼修改

性(xing)能優化

profiling采(cai)集

數(shu)據格式

模型適配流程概述

模型(xing)適配流程概述

代碼修(xiu)改

性(xing)能優化

profiling采集

數據格式(shi)