Latent Dirichlet Allocation (LDA) for Content-Based Recommendation

elakshi · October 28, 2025, 7:41am

Latent Dirichlet Allocation (LDA) for Content-Based Recommendation

Introduction

In the field of natural language processing (NLP), content-based recommendation systems have been widely adopted to provide personalized suggestions to users. However, these systems often struggle with issues such as keyword synonyms and polysemy, which can lead to inaccurate recommendations. To address these challenges, we propose the use of Latent Dirichlet Allocation (LDA), a popular unsupervised learning technique that can effectively capture the underlying semantics of text data.

Magic LDA

LDA is a probabilistic topic modeling technique that represents documents as a mixture of topics, where each topic is a distribution over words. The core idea behind LDA is to generate a document by:

Selecting a certain probability theme
Selecting a word with a certain probability
Repeating the process until all the selected words have been chosen

The document-theme and topic-word distributions are modeled using a polynomial distribution, where each word is associated with a specific topic and a specific theme.

Algorithm

The LDA algorithm is relatively complex and involves several steps:

Data preprocessing: The input text data is preprocessed by removing stop words, stemming, and tokenizing.
Vocabulary extraction: The vocabulary is extracted from the preprocessed data, and a word-id mapping table is created.
Document-word frequency matrix creation: The preprocessed data is converted into a document-word frequency matrix, where each row represents a document and each column represents a word.
LDA model training: The LDA model is trained on the document-word frequency matrix using the LDA class in Spark.
Model evaluation: The trained LDA model is evaluated using metrics such as log likelihood and topic coherence.

Code

The code for training the LDA model using Spark is shown below:

// Configure HDFS
val hadoopConf = new Configuration()
hadoopConf.set("fs.defaultFS", "hdfs://ip:port")

// Training set path
val logFile: String = args(0)

// Output model path
val modelPath: String = args(1)

// Topics
val numTopics: Int = args(2).toInt

// Maximum number of iterations
val numMaxIterations: Int = args(3).toInt

// Disable the number of words
val numStopwords: Int = if (args.length == 5) args(4).toInt else 0

// Super parameter α
val docConcentration: Double = if (args.length == 6) args(5).toDouble else 1.1

// Super parameter β
val topicConcentration: Double = if (args.length == 7) args(6).toDouble else 1.1

// Extract vocabulary
val termCounts: Array[(String, Long)] = words.flatMap(_ map (_ -> 1L))... reduceByKey(_ + _) collect() sortBy (-_._2)
val vocabArray: Array[String] = termCounts.takeRight(termCounts.size - numStopwords).map(_._1)
val vocab: Map[String, Int] = vocabArray.zipWithIndex.toMap

// Create document-word frequency matrix
val corpus = vectorize(words, vocab, ids).cache()
def vectorize(words: RDD[Seq[String]], vocab: Map[String, Int], ids: RDD[String]): RDD[(Long, Vector)] = {
  val corpus: RDD[(Long, Vector)] = words.zip(ids).map { case (tokens, id) =>
    val counts = new mutable.HashMap[Int, Double]()
    tokens.foreach { term =>
      if (vocab.contains(term)) {
        val idx = vocab(term)
        counts(idx) = counts.getOrElse(idx, 0.0) + 1.0
      }
    }
    (id.toLong, Vectors.sparse(vocab.size, counts.toSeq))
  }
  corpus
}

// Train LDA model
val ldaModel = new LDA().setK(numTopics).setDocConcentration(docConcentration).setTopicConcentration(topicConcentration).setMaxIterations(numMaxIterations).setSeed(0L).setCheckpointInterval(10).setOptimizer("em").run(corpus)

// Evaluate model
val distLdaModel = ldaModel.asInstanceOf[DistributedLDAModel]
val ll = distLdaModel.logLikelihood
println("Likelihood: " + ll)

// Get topic keywords and their weights
val topics = ldaModel.describeTopics(maxTermsPerTopic = 10)
topics.foreach { case (terms, termWeights) =>
  println(s"TOPIC: $i")
  terms.zip(termWeights).foreach { case (term, weight) =>
    println(s"$vocabArray(term.toInt) \t $weight")
  }
  println("DOCS:")
  distLdaModel.topDocumentsPerTopic(20)(i)._1.zip(distLdaModel.topDocumentsPerTopic(20)(i)._2).foreach { case (term, weight) =>
    println(s"$term \t $weight")
  }
  i += 1
  println()
}

// Save model
ldaModel.save(sc, modelPath)
sc.stop()

Training Results

The trained LDA model was evaluated on a dataset of 50,000 news articles, and the results showed that the model was able to capture the underlying semantics of the text data effectively. The topic keywords and their weights were used to generate personalized recommendations for users.

Example Output

The output of the trained LDA model showed that the model was able to capture the underlying semantics of the text data effectively. For example, the topic keywords and their weights for Topic 3 were:

TOPIC 3: Human 0.040956 AI 0.0384 Robot 0.036 Go 0.016823 AI 0.01638 Ke Jie 0.015567 AlphaGo 0.0 Players 0.011772 Learn 0.011311

The top 20 documents for Topic 3 were:

Coulee: AI changed the rules of Go Ke Jie won a .93749 probability of less than 10% Media reports “kind of stick Law” Ke Jie: A trick against the law Lovell dog 0.91730 Nie Weiping: AlphaGo a lesson to the world who is saying under the category 0.9 God story! AlphaGo start to walk three - Three Ke Jie and learn from each other?.9046 The second game man-machine war Preview: Ke Jie or camel 0.9045219669 with secret weapon Finally, human-computer Go Wars: Ke Jie was fierce AIpahGo limit 0.8882 Nie Weiping man-machine war Quotations: A French dog was pulling its 20 stages to win power 0.885853 The second game man-machine war Preview: Ke Jie or with a secret weapon competitiveness 0.8811950345

The results showed that the model was able to capture the underlying semantics of the text data effectively and generate personalized recommendations for users.