FastText

Word embedding technique by Facebook that incorporates subword information for better handling of rare and morphologically rich words.

What is FastText?

FastText is a word embedding technique developed by Facebook Research in 2016 that extends Word2Vec by incorporating subword information. It represents each word as a bag of character n-grams, allowing it to capture morphological patterns and handle rare words more effectively than traditional word embedding methods.

Key Characteristics

  • Subword Information: Uses character n-grams (typically 3-6 characters)
  • Morphological Awareness: Captures word structure and morphology
  • Rare Word Handling: Better representation of infrequent words
  • Out-of-Vocabulary: Can generate vectors for unseen words
  • Efficient Training: Optimized implementation for fast training
  • Multilingual Support: Works well with morphologically rich languages
  • Scalable: Handles large vocabularies effectively
  • Transferable: Pre-trained embeddings usable across tasks

Core Concepts

Character N-grams

FastText represents each word as the sum of its character n-grams. For example, the word "where" with n=3 would be represented as: <wh, whe, her, ere, re> + <where>

Subword Embeddings

Each character n-gram has its own vector representation, and the final word vector is the sum of these subword vectors.

Architecture

graph TD
    A[Word] --> B[Character N-grams]
    B --> C[Subword Vectors]
    C --> D[Word Vector]
    D --> E[Applications]

    style A fill:#f9f,stroke:#333
    style E fill:#f9f,stroke:#333

FastText vs Other Embedding Methods

FeatureFastTextWord2VecGloVeBERT/Transformers
Subword InfoYes (character n-grams)NoNoYes (WordPiece/BytePair)
Rare WordsExcellent handlingPoor handlingPoor handlingExcellent handling
OOV WordsCan generate vectorsCannot handleCannot handleCan handle
Training SpeedFastFastVery fastSlow
Memory UsageModerateLowLowHigh
MorphologyExcellentLimitedLimitedGood
ContextualNoNoNoYes
Vector ArithmeticGoodExcellentExcellentLimited
Pre-trained ModelsAvailableAvailableAvailableWidely available

Training Process

  1. Extract character n-grams for each word
  2. Initialize subword vectors randomly
  3. Train using Skip-gram or CBOW architecture
  4. Combine subword vectors to form final word representation

Mathematical Formulation

FastText extends the Skip-gram objective function:

$$ J = \frac{1}{T} \sum_^{T} \sum_{-c \leq j \leq c, j \neq 0} \log p(w_{t+j} | w_t) $$

Where the word representation includes both the word vector and its subword vectors:

$$ v_w = \sum_{g \in \mathcal{G}_w} z_g $$

Where:

  • $ \mathcal{G}_w $ is the set of n-grams for word $ w $
  • $ z_g $ is the vector for n-gram $ g $

Applications

Morphologically Rich Languages

FastText excels with languages that have:

  • Complex word formation (German, Finnish, Turkish)
  • Rich inflectional morphology (Arabic, Hebrew)
  • Agglutinative languages (Hungarian, Basque)

Text Classification

FastText is particularly effective for:

  • Document classification
  • Sentiment analysis
  • Topic modeling
  • Spam detection

Information Retrieval

  • Semantic search
  • Query expansion
  • Document similarity
  • Content recommendation

Language Identification

FastText's subword approach makes it ideal for:

  • Language detection
  • Dialect identification
  • Code-switching detection

Training Best Practices

Hyperparameter Tuning

ParameterTypical RangeRecommendation
Vector Dimension100-300300 for most applications
N-gram Size3-63-5 for most languages
Context Window5-105 for classification, 10 for embeddings
Minimum Count1-101-5 for rare word handling
Learning Rate0.01-0.1Start with 0.05, decay over time
Epochs5-5010-25 for convergence

Evaluation

  • Intrinsic evaluation: Word similarity tasks
  • Extrinsic evaluation: Downstream task performance
  • OOV evaluation: Performance on out-of-vocabulary words
  • Morphological tasks: Lemmatization, stemming

Implementation Tools

  • FastText (Facebook): Official implementation
  • Gensim: Python library with FastText support
  • spaCy: Includes FastText integration
  • TensorFlow/PyTorch: Custom implementations

Pre-trained Models

  • Facebook FastText: Pre-trained on Common Crawl, Wikipedia (157 languages)
  • Gensim Data: Pre-trained word vectors
  • Domain-Specific: Specialized embeddings for various domains

Research and Advancements

Key Papers

  1. "Enriching Word Vectors with Subword Information" (Bojanowski et al., 2017)
    • Introduced FastText algorithm
    • Demonstrated superior performance on rare words
    • Foundation for subword embeddings
  2. "Bag of Tricks for Efficient Text Classification" (Joulin et al., 2017)
    • Introduced FastText for text classification
    • Demonstrated state-of-the-art performance with simple architecture

Emerging Research Directions

  • Contextual FastText: Incorporating sentence context
  • Multimodal FastText: Combining text with other modalities
  • Dynamic FastText: Time-evolving word representations
  • Efficient Training: Faster algorithms for large-scale data
  • Multilingual FastText: Better cross-lingual representations
  • Green FastText: Energy-efficient training methods

Best Practices

Implementation Guidelines

  • Use pre-trained embeddings for general applications
  • Fine-tune on domain-specific data when needed
  • Adjust n-gram size based on language morphology
  • Combine with character-level models for best results
  • Evaluate on both intrinsic and extrinsic tasks

Common Pitfalls and Solutions

PitfallSolution
Small CorpusUse pre-trained embeddings
N-gram SizeExperiment with 3-6 character n-grams
Memory UsageUse optimized implementations
Training TimeUse GPU acceleration
Domain MismatchFine-tune on domain-specific data
Evaluation BiasUse multiple evaluation metrics

External Resources