如何使用 Python 中的功能和库创建 n-gram

更新时间：2023年09月29日 09:35:25 作者：迹忆客

在计算语言学中,n-gram 对于语言处理、上下文和语义分析非常重要,它们是从令牌字符串中相邻的连续单词序列,本文将讨论如何使用 Python 中的功能和库创建 n-gram,感兴趣的朋友一起看看吧

使用 for 循环在 Python 中从文本创建 n-gram

我们可以有效地创建一个 ngrams 函数，该函数接受文本和 n 值，并返回一个包含 n-gram 的列表。

为了创建这个函数，我们可以分割文本并创建一个空列表（output）来存储 n-gram。我们使用 for 循环遍历 splitInput 列表以遍历所有元素。

然后将单词（令牌）添加到 output 列表中。

def ngrams(input, num):
    splitInput = input.split(' ')
    output = []
    for i in range(len(splitInput) - num + 1):
        output.append(splitInput[i:i + num])
    return output
text = "Welcome to the abode, and more importantly, our in-house exceptional cooking service which is close to the Burj Khalifa"
print(ngrams(text, 3))

代码输出：

[['Welcome', 'to', 'the'], ['to', 'the', 'abode,'], ['the', 'abode,', 'and'], ['abode,', 'and', 'more'], ['and', 'more', 'importantly,'], ['more', 'importantly,', 'our'], ['importantly,', 'our', 'in-house'], ['our', 'in-house', 'exceptional'], ['in-house', 'exceptional', 'cooking'], ['exceptional', 'cooking', 'service'], ['cooking', 'service', 'which'], ['service', 'which', 'is'], ['which', 'is', 'close'], ['is', 'close', 'to'], ['close', 'to', 'the'], ['to', 'the', 'Burj'], ['the', 'Burj', 'Khalifa']]

使用 NLTK 在 Python 中创建 n-gram

NLTK 是一个自然语言工具包，提供了一个易于使用的接口，用于文本处理和分词等重要资源。要安装 nltk，我们可以使用以下 pip 命令。

pip install nltk

为了展示潜在问题，让我们使用 word_tokenize() 方法。它可以帮助我们使用 NLTK 推荐的单词分词器创建一个令牌化的文本副本，然后再编写更详细的代码。

import nltk
text = "well the money has finally come"
tokens = nltk.word_tokenize(text)

代码输出：

Traceback (most recent call last):
File "c:\Users\akinl\Documents\Python\SFTP\n-gram-two.py", line 4, in <module>
tokens = nltk.word_tokenize(text)
File "C:\Python310\lib\site-packages\nltk\tokenize\__init__.py", line 129, in word_tokenize
sentences = [text] if preserve_line else sent_tokenize(text, language)
File "C:\Python310\lib\site-packages\nltk\tokenize\__init__.py", line 106, in sent_tokenize
tokenizer = load(f"tokenizers/punkt/{language}.pickle")
File "C:\Python310\lib\site-packages\nltk\data.py", line 750, in load
opened_resource = _open(resource_url)
File "C:\Python310\lib\site-packages\nltk\data.py", line 876, in _open
return find(path_, path + [""]).open()
File "C:\Python310\lib\site-packages\nltk\data.py", line 583, in find
raise LookupError(resource_not_found)
LookupError:
**********************************************************************
Resource [93mpunkt[0m not found.
Please use the NLTK Downloader to obtain the resource:

[31m>>> import nltk
>>> nltk.download('punkt')
[0m
For more information see: https://www.nltk.org/data.html

Attempted to load [93mtokenizers/punkt/english.pickle[0m

Searched in:
- 'C:\\Users\\akinl/nltk_data'
- 'C:\\Python310\\nltk_data'
- 'C:\\Python310\\share\\nltk_data'
- 'C:\\Python310\\lib\\nltk_data'
- 'C:\\Users\\akinl\\AppData\\Roaming\\nltk_data'
- 'C:\\nltk_data'
- 'D:\\nltk_data'
- 'E:\\nltk_data'
- ''
**********************************************************************

上述错误消息和问题的原因是 NLTK 库对于某些方法需要某些数据，而我们尚未下载这些数据，特别是如果这是您首次使用的话。因此，我们需要使用 NLTK 下载器来下载两个数据模块，punkt 和 averaged_perceptron_tagger。

当我们使用 words() 等方法时，可以使用这些数据，例如创建一个 Python 文件并运行以下代码以解决该问题。

import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

或者通过命令行界面运行以下命令：

python -m nltk.downloader punkt
python -m nltk.downloader averaged_perceptron_tagger

示例代码：

import nltk
text = "well the money has finally come"
tokens = nltk.word_tokenize(text)
textBigGrams = nltk.bigrams(tokens)
textTriGrams = nltk.trigrams(tokens)
print(list(textBigGrams), list(textTriGrams))

代码输出：

[('well', 'the'), ('the', 'money'), ('money', 'has'), ('has', 'finally'), ('finally', 'come')] [('well', 'the', 'money'), ('the', 'money', 'has'), ('money', 'has', 'finally'), ('has', 'finally', 'come')]

示例代码：

import nltk
text = "well the money has finally come"
tokens = nltk.word_tokenize(text)
textBigGrams = nltk.bigrams(tokens)
textTriGrams = nltk.trigrams(tokens)
print("The Bigrams of the Text are")
print(*map(' '.join, textBigGrams), sep=', ')
print("The Trigrams of the Text are")
print(*map(' '.join, textTriGrams), sep=', ')

代码输出：

The Bigrams of the Text are
well the, the money, money has, has finally, finally come

The Trigrams of the Text are
well the money, the money has, money has finally, has finally come

到此这篇关于在 Python 中从文本创建 N-Grams的文章就介绍到这了,更多相关Python文本创建 N-Grams内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家！

您可能感兴趣的文章:

python工具dtreeviz决策树可视化和模型可解释性
这篇文章主要介绍了python工具dtreeviz决策树可视化和模型可解释性，决策树是梯度提升机和随机森林的基本构建块，在学习这些模型的工作原理和模型可解释性时，可视化决策树是一个非常有帮助，下文相关资料，需要的小伙伴可任意参考一下
2022-03-03
Python编程基础之类和对象
这篇文章主要为大家详细介绍了Python的类和对象，文中示例代码介绍的非常详细，具有一定的参考价值，感兴趣的小伙伴们可以参考一下，希望能够给你带来帮助
2022-01-01
Python中的hypot()方法使用简介
这篇文章主要介绍了Python中的hypot()方法使用简介,是Python入门所需掌握的基础知识,需要的朋友可以参考下
2015-05-05
Django REST framework 限流功能的使用
DRF常用功能的案例基本用法都有讲解，关于限流（Throttling）这个功能其实在真实的业务场景中能真正用到的其实不算多。今天说这个话题其实一方面是讨论功能，另一方面也是希望换个角度去审视我们的开发过程，希望大家可以在使用DRF功能的同时，也了解一下功能背后的实现
2021-06-06
Python 实现Numpy中找出array中最大值所对应的行和列
今天小编就为大家分享一篇Python 实现Numpy中找出array中最大值所对应的行和列，具有很好的参考价值，希望对大家有所帮助。一起跟随小编过来看看吧
2019-11-11
Python异常处理操作实例详解
这篇文章主要介绍了Python异常处理操作,结合实例形式分析了Python常见的异常处理类型、相关操作技巧与注意事项,需要的朋友可以参考下
2018-08-08
Python生成可执行文件之PyInstaller库的使用方式
PyInstaller是一个十分有用的第三方库,通过对源文件打包,Python程序可以在没有安装Python的环境中运行,也可以作为一个独立文件方便传递和管理,下面这篇文章主要给大家介绍了关于Python生成可执行文件之PyInstaller库的使用方式,需要的朋友可以参考下
2022-04-04
浅析Python基础-流程控制
Python编程语言的作用非常强大，而且其应用方便的特点也对开发人员起到了非常大的作用。在这里我们就可以先从Python流程控制关键字的相关概念开始了解，从而初步掌握这一语言的特点
2016-03-03
python中把元组转换为namedtuple方法
在本篇文章里小编给大家整理的是一篇关于python中把元组转换为namedtuple方法，有兴趣的朋友们可以参考下。
2020-12-12
python中使用pymssql库操作MSSQL数据库
这篇文章主要给大家介绍了关于python中使用pymssql库操作MSSQL数据库的相关资料,最近在学习python,发现好像没有对pymssql的详细说明,于是乎把官方文档学习一遍,重要部分做个归档,方便以后查阅,需要的朋友可以参考下
2023-08-08