FastText
What is FastText?
FastText is a word embedding technique developed by Facebook Research in 2016 that extends Word2Vec by incorporating subword information. It represents each word as a bag of character n-grams, allowing it to capture morphological patterns and handle rare words more effectively than traditional word embedding methods.
Key Characteristics
- Subword Information: Uses character n-grams (typically 3-6 characters)
- Morphological Awareness: Captures word structure and morphology
- Rare Word Handling: Better representation of infrequent words
- Out-of-Vocabulary: Can generate vectors for unseen words
- Efficient Training: Optimized implementation for fast training
- Multilingual Support: Works well with morphologically rich languages
- Scalable: Handles large vocabularies effectively
- Transferable: Pre-trained embeddings usable across tasks
Core Concepts
Character N-grams
FastText represents each word as the sum of its character n-grams. For example, the word "where" with n=3 would be represented as:
<wh, whe, her, ere, re> + <where>
Subword Embeddings
Each character n-gram has its own vector representation, and the final word vector is the sum of these subword vectors.
Architecture
graph TD
A[Word] --> B[Character N-grams]
B --> C[Subword Vectors]
C --> D[Word Vector]
D --> E[Applications]
style A fill:#f9f,stroke:#333
style E fill:#f9f,stroke:#333
FastText vs Other Embedding Methods
| Feature | FastText | Word2Vec | GloVe | BERT/Transformers |
|---|---|---|---|---|
| Subword Info | Yes (character n-grams) | No | No | Yes (WordPiece/BytePair) |
| Rare Words | Excellent handling | Poor handling | Poor handling | Excellent handling |
| OOV Words | Can generate vectors | Cannot handle | Cannot handle | Can handle |
| Training Speed | Fast | Fast | Very fast | Slow |
| Memory Usage | Moderate | Low | Low | High |
| Morphology | Excellent | Limited | Limited | Good |
| Contextual | No | No | No | Yes |
| Vector Arithmetic | Good | Excellent | Excellent | Limited |
| Pre-trained Models | Available | Available | Available | Widely available |
Training Process
- Extract character n-grams for each word
- Initialize subword vectors randomly
- Train using Skip-gram or CBOW architecture
- Combine subword vectors to form final word representation
Mathematical Formulation
FastText extends the Skip-gram objective function:
$$ J = \frac{1}{T} \sum_^{T} \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j} | w_t) $$
Where the word representation includes both the word vector and its subword vectors:
$$ v_w = \sum_{g \in \mathcal{G}_w} z_g $$
Where:
- $ \mathcal{G}_w $ is the set of n-grams for word $ w $
- $ z_g $ is the vector for n-gram $ g $
Applications
Morphologically Rich Languages
FastText excels with languages that have:
- Complex word formation (German, Finnish, Turkish)
- Rich inflectional morphology (Arabic, Hebrew)
- Agglutinative languages (Hungarian, Basque)
Text Classification
FastText is particularly effective for:
- Document classification
- Sentiment analysis
- Topic modeling
- Spam detection
Information Retrieval
- Semantic search
- Query expansion
- Document similarity
- Content recommendation
Language Identification
FastText's subword approach makes it ideal for:
- Language detection
- Dialect identification
- Code-switching detection
Training Best Practices
Hyperparameter Tuning
| Parameter | Typical Range | Recommendation |
|---|---|---|
| Vector Dimension | 100-300 | 300 for most applications |
| N-gram Size | 3-6 | 3-5 for most languages |
| Context Window | 5-10 | 5 for classification, 10 for embeddings |
| Minimum Count | 1-10 | 1-5 for rare word handling |
| Learning Rate | 0.01-0.1 | Start with 0.05, decay over time |
| Epochs | 5-50 | 10-25 for convergence |
Evaluation
- Intrinsic evaluation: Word similarity tasks
- Extrinsic evaluation: Downstream task performance
- OOV evaluation: Performance on out-of-vocabulary words
- Morphological tasks: Lemmatization, stemming
Implementation Tools
Popular Libraries
- FastText (Facebook): Official implementation
- Gensim: Python library with FastText support
- spaCy: Includes FastText integration
- TensorFlow/PyTorch: Custom implementations
Pre-trained Models
- Facebook FastText: Pre-trained on Common Crawl, Wikipedia (157 languages)
- Gensim Data: Pre-trained word vectors
- Domain-Specific: Specialized embeddings for various domains
Research and Advancements
Key Papers
- "Enriching Word Vectors with Subword Information" (Bojanowski et al., 2017)
- Introduced FastText algorithm
- Demonstrated superior performance on rare words
- Foundation for subword embeddings
- "Bag of Tricks for Efficient Text Classification" (Joulin et al., 2017)
- Introduced FastText for text classification
- Demonstrated state-of-the-art performance with simple architecture
Emerging Research Directions
- Contextual FastText: Incorporating sentence context
- Multimodal FastText: Combining text with other modalities
- Dynamic FastText: Time-evolving word representations
- Efficient Training: Faster algorithms for large-scale data
- Multilingual FastText: Better cross-lingual representations
- Green FastText: Energy-efficient training methods
Best Practices
Implementation Guidelines
- Use pre-trained embeddings for general applications
- Fine-tune on domain-specific data when needed
- Adjust n-gram size based on language morphology
- Combine with character-level models for best results
- Evaluate on both intrinsic and extrinsic tasks
Common Pitfalls and Solutions
| Pitfall | Solution |
|---|---|
| Small Corpus | Use pre-trained embeddings |
| N-gram Size | Experiment with 3-6 character n-grams |
| Memory Usage | Use optimized implementations |
| Training Time | Use GPU acceleration |
| Domain Mismatch | Fine-tune on domain-specific data |
| Evaluation Bias | Use multiple evaluation metrics |