I could imagine his giving a friend a little pinch of the latest vegetable alkaloid,not out of malevolence,you understand,but simply out of a spirit of inquiry in order to have an accurate idea of the effects.
1 2 3 4 5
content = "I could imagine his giving a friend a little pinch of the latest vegetable alkaloid, not out of malevolence, you understand, but simply out of a spirit of inquiry in order to have an accurate idea of the effects." # 将内容写入txt文件 withopen('output.txt', 'w') as file: file.write(content)
2、对内容进行分词,统计文本中的所有词语,进行独热编码,得到每个词的one-hot向量表示。
1 2 3 4 5 6
text = 'I could imagine his giving a friend a little pinch of the latest vegetable alkaloid,not out of malevolence,you understand,but simply out of a spirit of inquiry in order to have an accurate idea of the effects.' lis = jieba.lcut(text) label_encoder = LabelEncoder() label_encoded = label_encoder.fit_transform(lis) one_hot_encoder = OneHotEncoder() one_hot_encoded = one_hot_encoder.fit_transform(label_encoded.reshape(-1, 1)).toarray()
import torch from transformers import BertTokenizer, BertModel model_name = "bert-base-cased"# 请替换为您的模型名称 tokenizer = BertTokenizer.from_pretrained(model_name) text = "I could imagine his giving a friend a little pinch of the latest vegetable alkaloid,not out of malevolence,you understand,but simply out of a spirit of inquiry in order to have an accurate idea of the effects." # 对文本进行分词并添加特殊标记(例如[CLS]和[SEP]) input_ids = tokenizer.encode(text, add_special_tokens=True) # 将输入ID序列转换为BERT模型的输入 input_tensor = torch.tensor([input_ids]) # 加载预训练的BERT模型 model = BertModel.from_pretrained(model_name) # 将输入序列传递给BERT模型并获取输出(1个句子的向量表示) outputs = model(input_tensor) # 获取第一个输出(即整个句子的向量表示) last_hidden_state = outputs[0] # 将输出张量展平以获取单个句子向量 sentence_vector = last_hidden_state.squeeze(0) print("Sentence vector:", sentence_vector)