# NLP-Projects

**Repository Path**: create_future/NLP-Projects

## Basic Information

- **Project Name**: NLP-Projects
- **Description**: NLP研发
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-09-15
- **Last Updated**: 2021-09-15

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# NLP-Projects
Natural Language Processing projects, which includes concepts and scripts about:
- [0_Word2vec](https://github.com/gaoisbest/NLP-Projects/blob/master/0_Word2vec/README.md)
    - `gensim`, `fastText` and `tensorflow` implementations. See [Chinese notes](http://url.cn/5PKmy7W), [中文解读](http://url.cn/5PKmy7W)

- [1_Sentence2vec](https://github.com/gaoisbest/NLP-Projects/blob/master/1_Sentence2vec/README.md)
    - `doc2vec`, `word2vec averaging` and `Smooth Inverse Frequency` implementations

- [2_Machine_reading_comprehension](https://github.com/gaoisbest/NLP-Projects/blob/master/2_Machine_reading_comprehension/README.md)

- [3_Dialog_system](https://github.com/gaoisbest/NLP-Projects/blob/master/3_Dialog_system/README.md)
    - Categories and components of dialog system

- [4_Text_classification](https://github.com/gaoisbest/NLP-Projects/blob/master/4_Text_classification/README.md)
    - `tensorflow LSTM` (See [Chinese notes 1](http://url.cn/5cLDOQI), [中文解读 1](http://url.cn/5cLDOQI) and [Chinese notes 2](http://url.cn/5w5VbaI), [中文解读 2](http://url.cn/5w5VbaI))
    - `fastText` implementation

- [5_Pretraining_LM](https://github.com/gaoisbest/NLP-Projects/blob/master/5_Pretraining_LM/README.md)
    - Principle of ELMo, ULMFit, GPT, BERT, XLNet

- [6_Sequence_labeling](https://github.com/gaoisbest/NLP-Projects/blob/master/6_Sequence_labeling/README.md)
    - [Chinese_word_segmentation](https://github.com/gaoisbest/NLP-Projects/blob/master/6_Sequence_labeling/Chinese_word_segmentation/README.md)
        - `HMM Viterbi` implementations. See [Chinese notes](http://url.cn/5x4KR8u), [中文解读](http://url.cn/5x4KR8u)
    - [Named_Entity_Recognition](https://github.com/gaoisbest/NLP-Projects/tree/master/6_Sequence_labeling/Named_Entity_Recognition)
        - Brands NER via bi-directional LSTM + CRF, `tensorflow` implementation. See [Chinese notes](http://url.cn/5fcC754), [中文解读](http://url.cn/5fcC754)

- [7_Information_retrieval](https://github.com/gaoisbest/NLP-Projects/blob/master/7_Information_retrieval/README.md)

- [8_Information_extraction](https://github.com/gaoisbest/NLP-Projects/blob/master/8_Information_extraction/README.md)

- [9_Knowledge_graph](https://github.com/gaoisbest/NLP-Projects/blob/master/9_Knowledge_graph/README.md)

- [10_Text_generation](https://github.com/gaoisbest/NLP-Projects/blob/master/10_Text_generation/README.md)

- [11_Network_embedding](https://github.com/gaoisbest/NLP-Projects/blob/master/11_Network_embedding/README.md)


# Concepts
### 1. Attention
- Attention == **weighted averages**
- The attention [review 1](https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html) and [review 2](https://zhuanlan.zhihu.com/p/31547842) summarize attention mechanism into several types:
    - Additive vs Multiplicative attention
    - Self attention
    - Soft vs Hard attention
    - Global vs Local attention

### 2. CNNs, RNNs and Transformer
- **Parallelization** [1]
    - RNNs
        - Why not good ? 
	    - **Last step's output is input of current step**
	- Solutions
	    - **Simple Recurrent Units (SRU)**
	        - Perform parallelization on each hidden state neuron independently
	    - **Sliced RNNs**
	        - Separate sequences into windows, use RNNs in each window, use another RNNs above windows
	        - Same as CNNs
    - CNNs
        - Why good ?
	    - For different windows in one filter
	    - For different filters
	
- **Long-range dependency** [1]
    - CNNs
        - Why not good ?
	    - Single convolution can only caputure window-range dependency
	- [Solutions](http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture11-convnets.pdf)
	    - Dilated CNNs
	    - Deep CNNs
	        - `N * [Convolution + skip-connection]`
	        - For example, window size=3 and sliding step=1, second convolution can cover 5 words (i.e., 1-2-3, 2-3-4, 3-4-5)
    - Transformer > RNNs > CNNs
- **Position** [1]
    - CNNs
        - Why not good ?
	    - Convolution preserves **relative-order** information, but **max-pooling discards them**
	- Solutions
	    - Discard max-pooling, use deep CNNs with skip-connections instead
	    - Add position embedding, just like in ConvS2S

    - [Transformer](https://github.com/gaoisbest/NLP-Projects/blob/master/Pretraining_LM/README.md#transformer)
        - Why not good ?
	    - In self-attention, one word attends to other words and generate the summarization vector without relative position information
    

- **Semantic features extraction** [2]
    - Transformer > CNNs == RNNs

### 3. Pattern of DL in NLP models [3]
- **Data**
    - Preprocess
        - [Sub-word segmentation](https://medium.com/@makcedward/how-subword-helps-on-your-nlp-model-83dd1b836f46) to avoid OOV and reduce vocabulary size
            - [sentencepiece](https://github.com/google/sentencepiece)
    - Pre-training (e.g., ELMO, BERT)
    - [Multi-task learning](https://mp.weixin.qq.com/s/ulZBmyt_L-RgGEGhNxrHeQ)
    - Transfer learning, [ref_1](https://mp.weixin.qq.com/s/UJlmjFHWhnlXXJoRv4zkEQ), [ref_2](http://ruder.io/transfer-learning/)
        - Use source task/domain `S` to increase target task/domain `T`
	- If `S` has a zero/one/few instances, we call it zero-shot, one-shot, few-shot learning, respectively
    
- **Model**
    - Encoder
        - CNNs, RNNs, Transformer
    - Structure
        - Sequential, Tree, Graph
- **Learning** (change loss definition)
    - Adversarial learning
    - Reinforcement learning

#### References
- [1] [Review](https://zhuanlan.zhihu.com/p/54743941)
- [2] [Why self-attention? A targeted evaluation of neural machine translation architectures](http://aclweb.org/anthology/D18-1458)
- [3] [ACL 2019 oral](https://zhuanlan.zhihu.com/p/72725518?utm_source=wechat_timeline&utm_medium=social&utm_oi=35938507948032&wechatShare=1&s_r=0&from=timeline&isappinstalled=0)

# Awesome public apis
- [Baidu AI Open Platform](https://ai.baidu.com/)
- [Tencent AI Open Platform](https://ai.qq.com/)
- [Tencent NLP](http://nlp.qq.com/)

# Awesome packages
### Chinese
- [pyltp](http://pyltp.readthedocs.io/zh_CN/develop/api.html)
- [HanLP](http://hanlp.linrunsoft.com/index.html)

### English
- [Spacy](https://spacy.io)
- [gensim](https://radimrehurek.com/gensim/)
- [Install tensorflow with one line](https://towardsdatascience.com/tensorflow-gpu-installation-made-easy-use-conda-instead-of-pip-52e5249374bc): `conda install tensorflow-gpu`

# Future directions
- [Multi-task learning](http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture17-multitask.pdf)
- [Self-training](http://web.stanford.edu/class/cs224n/slides/cs224n-2019-lecture20-future.pdf)