How AI helps Oil and Gas Industry

Artificial Intelligence is the hulk of all technologies used across various industries; however,its role is slightly uncertain in the industries like Oil and Gas. There are facts that support digital…


独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Is That New Lunch Spot Overpriced? A Study in Linear Regression.

How many times have you gone to a new restaurant where every dish seems overpriced? Consider when your regular lunch spot is offering a new special, is it even worth trying? Are restaurants trying to up-charge you on things like decor or reputation? What if you had a model that could you help predict the likelihood of that?

I decided to design this model for my 2nd Metis project, where I would utilize linear regression to predict the price of a lunch dish based on the information that one could gather from the restaurant menu.

To train this model, there were three types of data that were obtained:

To put constraints on types of restaurants for the dataset, only restaurants in San Francisco, CA were picked. Also these restaurants would focus on 6 types of cuisines that were a diverse representative of lunch choices in SF:

1. American
2. Mexican
3. Mediterranean
4. African
5. Pakistani
6. Creole & Cajun

Ingredient Data

In order establish meaning from the restaurant menu text, my model would need a reference list of important words. Thus using BS4, I obtained a base ingredient list from some recipe related websites that contained 600+ unique food-related nouns.

Demographic Data

With the data in hand there were three aspects that I focused on during the EDA process:

Data Cleaning

In examining the raw data there was some necessary cleaning:

1. Missing Info: For certain entries, the dish price was not listed. This was generally the case when dishes were all grouped and set at the same price (i.e. dim sum). For these dishes I ended up dropping them from the dataset.

2. Outliers: Some dishes were much more expensive than expected based off their ingredients and stuck out as outliers. Digging in deeper, I discovered that many of these dishes were actually labeled as sharing platters or group meals. Because I was focusing on a meal for one, I ended up dropping these entries from the dataset.

Data Cut

As I wanted to restrict my model to predict the prices of dishes at lunch, I ended up removing any dishes that cost more than $20 or less than $7. However, even with this subset there were still 10,000+ data points.

Feature Generation

With the restaurant text gathered, there was flexibility in feature generation using NLP (Natural Language Processing) such as:

2. The frequency that each ingredient appeared in high-cost and low-cost dishes from the training set was used to generate a list of high-cost and low-cost ingredients. The frequency that these price-categorized ingredients appeared in each dish were also used as features in the model.

In trying to improve my RMSE there were a couple of modifications that I did to my base linear regression model:

Data Transformations

Distribution of Box-Cox transformed dish prices


With about 600 starting features in my model, regularization was sorely needed in order to help reduce the number of features. By using Lasso regularization, I was able to subtract 130 features from my model. Most likely, if I was more aggressive in reducing the number of features during the cross-validation process even more features would have been dropped.

Important Features

With regards to the most predictive features, there a mixture of expected and unexpected results. The most predictive features were:

While restaurant types (Mexican and Jerk) and specific ingredients were expectedly predictive, a real surprise was dish text length. Behind expensive ingredients (like crab, lobster, and duck) dish text length was actually the 4th strongest positive predictor and it demonstrated a log-squared relationship. That actually made sense because more expensive restaurants would generally be more verbose and embellish dish descriptions.

One more note was that demographic features did not end up being particularly predictive, so any information regarding restaurant location ended up being dropped from the model.

Model Performance

My final model ended up having a train RMSE of $2.44 and a test RMSE of $2.63.

As both values are similar and the residual plots are similar in shape, I believe my model was generalizing well and not overfitting.

So bringing this back to the original question, if you went to the restaurant Best Mexican and ordered a burrito that contained:

My model would predict a price of $7.80. If Best Mexican was charging anything more than $10.43 then they would be ripping you off.

If I had more time and could redo this model, a couple of improvements I think would further minimize the RMSE would be to:

Add a comment

Related posts:

Using Artificial Intelligence and Bots to stop Fake News

The creation of artificial intelligence has a huge impact on the future of media. As technology advances throughout society, the way we receive our news have drastically changed. Before the advent of…

5 Ways in Which Big Data Will Transform Business

In contrast to a couple of years ago, when we started our family business, people are fully aware of the concept of big data now. Nobody can say that it is not relevant to their business anymore…

Building the Cognitive Budget for Your Most Effective Mind

Building a cognitive budget involves managing and allocating your mental resources effectively to optimize your thinking, decision-making, and overall mental well-being. Here are some key steps to…