NYC Yellow Taxi Data

This post describes how to use NYC Yellow Taxi Data to prediction Total taxi fare through AI learning.
Table of Contents
- Data Introduction
- Data Visualization with Machbase Neo
- Table Creation and Data Upload in Machbase Neo
- Experimental Methodology
- Experiment Code
- Experimental Results
1. Data Introduction
- DataHub Serial Number: 2025-1.
- Data Name: NYC Yellow Taxi Data.
- Data Collection Methods: Collected through the meter of each taxi.
- Data Source: Link
- Raw data size and format: 6.88GB, CSV.
- Number of tags: 17.
Tag | Description |
---|---|
VendorID | A code indicating the TPEP provider that provided the record. |
- Creative Mobile Technologies | |
- VeriFone Inc. | |
Passenger_count | The number of passengers in the vehicle. This is a driver-entered value. |
Trip_distance | The elapsed trip distance in miles reported by the taximeter. |
Pickup_longitude | Longitude where the meter was engaged. |
Pickup_latitude | Latitude where the meter was engaged. |
RateCodeID | The final rate code in effect at the end of the trip. |
- Standard rate | |
- JFK | |
- Newark | |
- Nassau or Westchester | |
- Negotiated fare | |
- Group ride | |
Dropoff_longitude | Longitude where the meter was disengaged. |
Dropoff_latitude | Latitude where the meter was disengaged. |
Payment_type | A numeric code signifying how the passenger paid for the trip. |
- Credit card | |
- Cash | |
- No charge | |
- Dispute | |
- Unknown | |
- Voided trip | |
Fare_amount | The time-and-distance fare calculated by the meter. |
Extra | Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges. |
MTA_tax | 0.50 MTA tax that is automatically triggered based on the metered rate in use. |
Improvement_surcharge | 0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015. |
Tip_amount | Tip amount – This field is automatically populated for credit card tips. Cash tips are not included. |
Tolls_amount | Total amount of all tolls paid in trip. |
Total_amount | The total amount charged to passengers. Does not include cash tips. |
Trip_duration_minutes | The taxi usage time is calculated by subtracting the pickup time from the drop-off time. |
- Data Time Range: 2015-01-01 00:00:00 to 2016-03-31 23:59:59.
- Number of data records collected: 793,805,956.
- CSV data URL: https://data.yotahub.com/2025-1/datahub-2025-1-taxi.csv.gz
- Data Migration: NYC Yellow Taxi Data Migration
2. Data Visualization with Machbase Neo
- Data visualization is possible through the Tag Analyzer in Machbase Neo.
- Select desired tag names and visualize them in various types of graphs.

- Below, access the 2025-1 DataHub in real-time, select the desired tag names from the data of 17 tags, visualize them, and preview the data patterns.
DataHub Viewer
3. Table Creation and Data Upload in Machbase Neo
- In the DataHub directory, use setup.wrk located in the NYC Yellow Taxi Dataset folder to create tables and load data, as illustrated in the image below.

1) Table Creation
- The table is created immediately upon pressing the "Run" button in the menu.
- If the Taxi table exists, execute the first line and then the second. If it does not exist, start from the second line.

2) Data Upload
- Loading tables in two different ways.
Method 1) Table loading method using TQL in Machbase Neo (since machbase-neo v8.0.29-rc1
-
Pros
- Markbase Neo loads as soon as you hit the launch button.
-
Cons
- Slower table loading speed compared to other method.

Method 2) Loading tables using commands
-
Pros
- Fast table loading speed.
-
Cons
- The table loading process is cumbersome.
- Run cmd window - Change machbase-neo path - Enter command in cmd window.
- If run the below script from the command shell, the data will be entered at high speed into the Taxi table.
curl http://data.yotahub.com/2025-1/datahub-2025-1-taxi.csv.gz | machbase-neo shell import --input - --compress gzip --header --method append --timeformat ns taxi
- If specify a separate username and password, use the --user and --password options (if not sys/manager) and add the options as shown below.
curl http://data.yotahub.com/2025-1/datahub-2025-1-taxi.csv.gz | machbase-neo shell import --input - --compress gzip --header --method append --timeformat ns taxi --user USERNAME --password PASSWORD
4. Experimental Methodology
- Model Objective: NYC Yellow Taxi Fare Prediction.
- Tags Used: extra, fare_amount, improvement_surcharge, mta_tax, passenger_count, payment_type, tip_amount, tolls_amount, trip_distance, trip_duration_minutes, total_amount.
- Model Configuration: ResNet1d.
- Learning Method: supervised Learning.
- Train: Model Training.
- Validation: Model Valdation.
- Test: Model Performance Evaluation Based on NYC Yellow Taxi Fare Prediction.
- Model Optimizer: Adam.
- Model Loss Function: Mean Squared Error.
- Model Performance Metric: Mean Squared Error & R2 Score.
- Data Loading Method
- Loading the Entire Dataset.
- Loading the Fetch Dataset.
- Data Preprocessing
- MinMax Scaling.
5. Experiment Code
NYC Yellow Taxi Data Exploratory Data Analysis
- Composed of three methods.
- Data Information: Outputs general information about the data.
- Visual Information: correlation heatmap, plot, Decomposition about the data.
- Statistical Test: ADF Test, KPSS Test, PP Test, ljung box Test, Arch Test, VIF Test about the data.

- The entire code can be run through 18.NYC_Yellow_Taxi_EDA.
NYC Yellow Taxi Fare Prediction
- Below is the code for each of the two ways to get data from the database.
- If all the data can be loaded and trained at once without causing memory errors, then method 1 is the fastest and simplest.
- If the data is too large, causing memory errors, then the batch loading method proposed in method 2 is the most efficient.
Method 1) Loading the Entire Dataset
- The code below is implemented in a way that loads all the data needed for training from the database all at once.
- It is exactly the same as loading all CSV files (The only difference is that the data is loaded from Machbase Neo).
- Pros
- Can use the same code that was previously utilizing CSVs (Only the loading process is different).
- Cons
- Unable to train if trainable data size exceeds memory size.
- The entire code can be run through 18.NYC_Yellow_Taxi_Full.
Method 2) Loading the Fetch Dataset
- Method for loading data from the Machbase Neo for a buffer size.
- Pros
- It is possible to train the model regardless of the data size, no matter how large it is.
- Cons
- It takes longer to train compared to method 1.
- The entire code can be run through 18.NYC_Yellow_Taxi_Buffered_Fetch.
6. Experimental Results
Method 1) Loading the Entire Dataset Result


Method 2) Loading the Fetch Dataset Result


- The R2 score shows high performance above 0.9 in both methods.
- It is expected that more precise data cleaning would improve performance.
- If data is collected directly using Machbase Neo, there would be no need to put as much effort into data cleaning as with the current data.
※ Various datasets and tutorial codes can be found in the GitHub repository below.
datahub/dataset at main · machbase/datahub
All Industrial IoT DataHub with data visualization and AI source - machbase/datahub