Back to projects

Amazon Book Review Similarity Detection

Similarity detection in ~3M Amazon book reviews using MinHash and LSH.

Big DataLSHMinHashPython

Details

About the project

Built a scalable system from scratch using MinHash and LSH to detect similar reviews in Amazon Book Reviews dataset.

Used shingling for text representation, MinHash to approximate Jaccard similarity, and banding for sublinear candidate retrieval. Reduced unnecessary comparisons with prefix filtering.

Highlights

Key features

  • Similarity detection on ~3 million reviews
  • From-scratch MinHash and LSH implementation
  • Sublinear candidate retrieval with banding
  • Analysis with runtime and precision metrics

Tech Stack

Tools used

PythonNumPyPandasGoogle Colab