Yelp Dataset Analysis: Key Findings

An exploration of business characteristics, user behavior, and market trends from the Yelp Academic Dataset using SQL and Matplotlib.

1. Hypotheses

Before conducting the analysis, several hypotheses were formulated to guide the investigation:

  1. Popularity vs. Quality: A higher number of reviews will not be a strong predictor of a higher average star rating.
  2. Experienced User Criticism: More active and "Elite" users will be more critical and exhibit different rating patterns than casual users.
  3. Weekend Peak: Customer traffic, as measured by check-ins, will be significantly higher on weekends.
  4. Market Saturation: Cities with the most restaurants will not necessarily be the best markets for a new business.

2. Key Findings and Visualizations

Analysis 1: High-Level Business Statistics

Finding: The analysis shows that the average star rating across all businesses is 3.60. The review counts vary dramatically, with some businesses having as few as 3 reviews and the most popular having over 7,500, indicating a wide distribution of business popularity.

Analysis 2: Top 10 Most Reviewed Restaurants

Finding: A small number of American restaurants, primarily located in major cities, dominate the review landscape. This concentration highlights the strong brand recognition and high customer volume of market leaders.

Chart of Top 10 Most Reviewed Restaurants

Placeholder for the Top 10 Most Reviewed Restaurants chart.

Analysis 3: Correlation for Highly-Rated Restaurants

Finding: Confirming Hypothesis 1, the `review_count` has a very weak correlation with a business being highly-rated (>4.0 stars). The strongest predictors are, logically, the overall star rating and the average stars from individual reviews.

Correlation Matrix for Restaurant Features

Placeholder for the Correlation Matrix.

Analysis 4 & 5: User Segments and Rating Behavior

Finding: Supporting Hypothesis 2, "Power Users" (those with high star/fan counts) give higher average ratings than "Engaged Users". However, the user base is dominated by "Casual" users, meaning that while Power Users are more generous with ratings, their overall impact is limited by their small numbers.

Chart of Average Stars by User Segment

Placeholder for the Average Stars by User Segment chart.

Chart of User Segment Distribution

Placeholder for the Distribution of User Segments chart.

Analysis 6 & 7: City Market & Nightlife Opportunity

Finding: Supporting Hypothesis 4, the best markets are not necessarily the biggest. Cities like Philadelphia fall in the "sweet spot" of high average ratings and high review counts, representing prime opportunities. Philadelphia is also the clear leader in the quantity of "Nightlife" venues, though other cities like Tampa and Indianapolis boast higher average quality in this category.

Scatter Plot of City Market Opportunity

Placeholder for the City Market Opportunity plot.

Analysis 8: Weekly Customer Traffic

Finding: The data overwhelmingly supports Hypothesis 3. Customer check-ins peak dramatically on Saturday and Sunday, confirming the critical importance of weekend business for restaurants.

Bar Chart of Check-ins by Day

Placeholder for the Check-ins by Day of the Week chart.

Analysis 9: Elite vs. Non-Elite User Rating Patterns

Finding: Further supporting Hypothesis 2, Elite users are less likely to give extreme 1-star or 5-star reviews, preferring more nuanced ratings. Interestingly, the proportion of reviews from Elite users peaked around 2007 and has been on a gradual decline, suggesting a shift in the platform's user dynamics.

Chart of Elite vs Non-Elite Rating Distribution

Placeholder for the Elite vs. Non-Elite Rating Distribution chart.

3. Limitations of the Analysis

4. Conclusion and Future Work

This project successfully validated our initial hypotheses and yielded key insights into the Yelp ecosystem. By systematically analyzing business attributes, user segments, market dynamics, and operational rhythms, we have painted a data-driven picture of the restaurant industry that can directly inform business strategy.

Future work could include:

Explore the Code

For a detailed look at the Python scripts, ETL process, and SQL queries used in this analysis, please visit the project repository on GitHub.

View on GitHub
Back to All Projects