Back to projects
Amazon Book Review Similarity Detection
Similarity detection in ~3M Amazon book reviews using MinHash and LSH.
Big DataLSHMinHashPython
Details
About the project
Built a scalable system from scratch using MinHash and LSH to detect similar reviews in Amazon Book Reviews dataset.
Used shingling for text representation, MinHash to approximate Jaccard similarity, and banding for sublinear candidate retrieval. Reduced unnecessary comparisons with prefix filtering.
Highlights
Key features
- Similarity detection on ~3 million reviews
- From-scratch MinHash and LSH implementation
- Sublinear candidate retrieval with banding
- Analysis with runtime and precision metrics
Tech Stack
Tools used
PythonNumPyPandasGoogle Colab