# extract_words

**Repository Path**: zongkai28/extract_words

## Basic Information

- **Project Name**: extract_words
- **Description**: use python to extract words from ISO26262-20118
- **Primary Language**: Python
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2023-01-27
- **Last Updated**: 2023-01-27

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# extract_words

## 介绍
使用python从ISO26262-2018英文标准中提取单词，解决


## 代码

### 将PDF内容转换为TXT形式

首先，需要将PDF中的内容转换为TXT文本形式，便于进行字词分析和统计。

下面文档提到的python包pdfminer3k，是python3版本的主线支持版本，可以方便的实现pdf文件的解析读取：

[(17条消息) 使用Python对PDF文件进行词频统计分析并保存到CSV文件中_cugzyc的博客-CSDN博客_pdf词频统计](https://blog.csdn.net/qq_41333844/article/details/101372969)


下面文档提到其他工具，多数已经不再python3版本支持了：

[(17条消息) 使用python解析pdf文件_CV小蜗牛的博客-CSDN博客_python解析pdf](https://blog.csdn.net/u011331397/article/details/121706490)


使用Adobe Acrobat DC也可以直接将标准转换为TXT文本。

比较发现，转换都不是很完美，尤其是表格部分的转换，会出现很多连词、断词的情况。

简单起见，直接用Adobe Acrobat DC吧。


### 提取TXT中的单词


下面文章给出了提取TXT中单词的代码，直接参考了：

[(17条消息) Python字典简单实现词频统计_Pandas_007的博客-CSDN博客_python 字典统计](https://blog.csdn.net/qq_57329395/article/details/127607411)

[(17条消息) 利用python实现词频统计_python词频统计代码_普通网友的博客-CSDN博客](https://blog.csdn.net/m0_67401153/article/details/125389042)


### 使用bing搜索后，制作为anki单词本

参考了如下文档：

https://zhuanlan.zhihu.com/p/27163677
https://www.52pojie.cn/thread-1081832-1-1.html
https://github.com/tongfeima/Make_Anki_Package


### 词频统计

原计划做个词云或者柱状条形式的词频统计的，后来感觉意义有限，暂时搁置吧。


## 使用说明

1.  xxxx
2.  xxxx
3.  xxxx