Yelp Dataset Analysis: Key Findings and Insights

This report summarizes the major findings from the analysis of the Yelp Academic Dataset. The goal was to explore the characteristics of businesses, understand user behavior, and identify key trends that could inform business strategy. The analysis was conducted using SQL queries on a PostgreSQL database and visualized using Matplotlib.


1. Hypotheses

Before conducting the analysis, several hypotheses were formulated to guide the investigation:


2. Key Findings and Visualizations

This section details the results of each of the nine distinct analyses performed.

Analysis 1: High-Level Business Statistics

Objective: To get a baseline understanding of the dataset's scale.

Finding: The analysis shows that the average star rating across all businesses is 3.60. The review counts vary dramatically, with some businesses having as few as 3 reviews and the most popular having over 7,500, indicating a wide distribution of business popularity.

Analysis 2: Top 10 Most Reviewed Restaurants

Objective: To identify the most popular, high-traffic businesses in the dataset.

Finding: A small number of American restaurants, primarily located in major cities, dominate the review landscape. This concentration highlights the strong brand recognition and high customer volume of market leaders.

Top 10 Most Reviewed Restaurants

Top 10 Most Reviewed Restaurants

Analysis 3: Correlation of Features for Highly-Rated Restaurants

Objective: To determine which business characteristics are most associated with a high rating (>4.0 stars).

Finding: A correlation matrix confirms Hypothesis 1. The `review_count` has a very weak correlation with a business being highly-rated. The strongest predictors are, logically, the `overall_star_rating` and the `avg_review_stars` from individual reviews.

Correlation Matrix for Restaurant Features

Correlation Matrix for Restaurant Features

Analysis 4: Rating Behavior of Different User Segments

Objective: To see if a user's activity level impacts their rating behavior.

Finding: The data supports Hypothesis 2. "Power Users" (defined by high Stars and fan counts) give higher average ratings than "Engaged Users".

Average Stars Given by User Segment

Average Stars Given by User Segment

Analysis 5: Distribution of User Segments

Objective: To understand the composition of the Yelp user base.

Finding: The user base follows a power-law distribution. The vast majority of users are "Casual," with "Engaged" and "Power Users" making up a much smaller fraction. This provides context to the previous analysis, showing that while power users are more critical, they are also rare.

Distribution of User Segments

Distribution of User Segments

Analysis 6: City Market Opportunity Analysis

Objective: To identify the best cities to open a new restaurant based on market quality and engagement.

Finding: This analysis supports Hypothesis 4. The best markets are not necessarily the biggest. Cities like Philadelphia, which fall in the "sweet spot" of high average ratings and high average review counts, represent prime opportunities.

City Market Opportunity Plot

City Market Opportunity Plot

Analysis 7: Top Cities for Nightlife

Objective: To find the most prominent cities for the "Nightlife" category.

Finding: Philadelphia is the undisputed leader in the quantity of nightlife venues and customer engagement. However, other cities like Tampa and Indianapolis have higher average quality ratings.

Top 10 Cities for Nightlife Venues

Top 10 Cities for Nightlife Venues

Analysis 8: Weekly Customer Traffic by Check-ins

Objective: To determine the busiest days of the week for restaurants.

Finding: The data overwhelmingly supports Hypothesis 3. Customer check-ins peak dramatically on Saturday and Sunday, confirming the critical importance of weekend business for restaurants.

Total Restaurant Check-ins by Day of the Week

Total Restaurant Check-ins by Day of the Week

Analysis 9: Elite vs. Non-Elite User Rating Patterns

Objective: To conduct a deeper dive into the rating behavior of Yelp's "Elite Squad."

Finding: This analysis provides further support for Hypothesis 2.

  1. Rating Distribution: Elite users are less likely to give extreme 1-star or 5-star reviews, preferring more nuanced 2, 3, and 4-star ratings.
  2. Influence Over Time: Contrary to what might be expected, the proportion of reviews from Elite users is not growing. It peaked around 2007 and has been on a gradual but steady decline ever since.
Rating Distribution: Elite vs. Non-Elite

Rating Distribution: Elite vs. Non-Elite

Elite User Contribution Over Time

Elite User Contribution Over Time


3. Limitations of the Analysis


4. Conclusion and Future Work

This project successfully validated our initial hypotheses and yielded key insights into the Yelp ecosystem. By systematically analyzing business attributes, user segments, market dynamics, and operational rhythms, we have painted a data-driven picture of the restaurant industry.

Future Work could include:

Explore the Code

For a detailed look at the Python scripts, ETL process, and SQL queries used in this analysis, please visit the project repository on GitHub.

View on GitHub
Back to All Projects