Yelp Dataset Analysis: Key Findings
An exploration of business characteristics, user behavior, and market trends from the Yelp Academic Dataset using SQL and Matplotlib.
Table of Contents
1. Hypotheses
Before conducting the analysis, several hypotheses were formulated to guide the investigation:
- Popularity vs. Quality: A higher number of reviews will not be a strong predictor of a higher average star rating.
- Experienced User Criticism: More active and "Elite" users will be more critical and exhibit different rating patterns than casual users.
- Weekend Peak: Customer traffic, as measured by check-ins, will be significantly higher on weekends.
- Market Saturation: Cities with the most restaurants will not necessarily be the best markets for a new business.
2. Key Findings and Visualizations
Analysis 1: High-Level Business Statistics
Finding: The analysis shows that the average star rating across all businesses is 3.60. The review counts vary dramatically, with some businesses having as few as 3 reviews and the most popular having over 7,500, indicating a wide distribution of business popularity.
Analysis 2: Top 10 Most Reviewed Restaurants
Finding: A small number of American restaurants, primarily located in major cities, dominate the review landscape. This concentration highlights the strong brand recognition and high customer volume of market leaders.
Placeholder for the Top 10 Most Reviewed Restaurants chart.
Analysis 3: Correlation for Highly-Rated Restaurants
Finding: Confirming Hypothesis 1, the `review_count` has a very weak correlation with a business being highly-rated (>4.0 stars). The strongest predictors are, logically, the overall star rating and the average stars from individual reviews.
Placeholder for the Correlation Matrix.
Analysis 4 & 5: User Segments and Rating Behavior
Finding: Supporting Hypothesis 2, "Power Users" (those with high star/fan counts) give higher average ratings than "Engaged Users". However, the user base is dominated by "Casual" users, meaning that while Power Users are more generous with ratings, their overall impact is limited by their small numbers.
Placeholder for the Average Stars by User Segment chart.
Placeholder for the Distribution of User Segments chart.
Analysis 6 & 7: City Market & Nightlife Opportunity
Finding: Supporting Hypothesis 4, the best markets are not necessarily the biggest. Cities like Philadelphia fall in the "sweet spot" of high average ratings and high review counts, representing prime opportunities. Philadelphia is also the clear leader in the quantity of "Nightlife" venues, though other cities like Tampa and Indianapolis boast higher average quality in this category.
Placeholder for the City Market Opportunity plot.
Analysis 8: Weekly Customer Traffic
Finding: The data overwhelmingly supports Hypothesis 3. Customer check-ins peak dramatically on Saturday and Sunday, confirming the critical importance of weekend business for restaurants.
Placeholder for the Check-ins by Day of the Week chart.
Analysis 9: Elite vs. Non-Elite User Rating Patterns
Finding: Further supporting Hypothesis 2, Elite users are less likely to give extreme 1-star or 5-star reviews, preferring more nuanced ratings. Interestingly, the proportion of reviews from Elite users peaked around 2007 and has been on a gradual decline, suggesting a shift in the platform's user dynamics.
Placeholder for the Elite vs. Non-Elite Rating Distribution chart.
3. Limitations of the Analysis
- Data Timeliness: The dataset is a periodic snapshot and does not represent real-time data.
- Geographic Bias: The data is heavily focused on specific metropolitan areas in North America and Europe.
- Incomplete Data: Some fields were not consistently populated, requiring data cleaning and imputation strategies.
- Definition of "Success": The analysis uses proxies like star ratings and review counts. True business success also involves financial metrics not present in the dataset.
4. Conclusion and Future Work
This project successfully validated our initial hypotheses and yielded key insights into the Yelp ecosystem. By systematically analyzing business attributes, user segments, market dynamics, and operational rhythms, we have painted a data-driven picture of the restaurant industry that can directly inform business strategy.
Future work could include:
- Natural Language Processing (NLP): Performing topic modeling on review text to identify specific drivers of positive or negative sentiment (e.g., "service," "ambiance," "price").
- Predictive Modeling: Building a machine learning model to predict a new restaurant's potential star rating based on its attributes.
- Geospatial Analysis: Using geospatial libraries to map restaurant "hotspots" and "deserts" within a city to identify optimal locations.
Explore the Code
For a detailed look at the Python scripts, ETL process, and SQL queries used in this analysis, please visit the project repository on GitHub.
View on GitHub