NYC Yellow Taxi Data

Last updated on Mar 25, 2025

This post describes how to use NYC Yellow Taxi Data to prediction Total taxi fare through AI learning.

Data Introduction
Data Visualization with Machbase Neo
Table Creation and Data Upload in Machbase Neo
Experimental Methodology
Experiment Code
Experimental Results

1. Data Introduction

DataHub Serial Number: 2025-1.
Data Name: NYC Yellow Taxi Data.
Data Collection Methods: Collected through the meter of each taxi.
Data Source: Link
Raw data size and format: 6.88GB, CSV.
Number of tags: 17.

Tag	Description
VendorID	A code indicating the TPEP provider that provided the record.
	- Creative Mobile Technologies
	- VeriFone Inc.
Passenger_count	The number of passengers in the vehicle. This is a driver-entered value.
Trip_distance	The elapsed trip distance in miles reported by the taximeter.
Pickup_longitude	Longitude where the meter was engaged.
Pickup_latitude	Latitude where the meter was engaged.
RateCodeID	The final rate code in effect at the end of the trip.
	- Standard rate
	- JFK
	- Newark
	- Nassau or Westchester
	- Negotiated fare
	- Group ride
Dropoff_longitude	Longitude where the meter was disengaged.
Dropoff_latitude	Latitude where the meter was disengaged.
Payment_type	A numeric code signifying how the passenger paid for the trip.
	- Credit card
	- Cash
	- No charge
	- Dispute
	- Unknown
	- Voided trip
Fare_amount	The time-and-distance fare calculated by the meter.
Extra	Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
MTA_tax	0.50 MTA tax that is automatically triggered based on the metered rate in use.
Improvement_surcharge	0.30 improvement surcharge assessed trips at the flag drop. The improvement surcharge began being levied in 2015.
Tip_amount	Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.
Tolls_amount	Total amount of all tolls paid in trip.
Total_amount	The total amount charged to passengers. Does not include cash tips.
Trip_duration_minutes	The taxi usage time is calculated by subtracting the pickup time from the drop-off time.

Data Time Range: 2015-01-01 00:00:00 to 2016-03-31 23:59:59.
Number of data records collected: 793,805,956.
CSV data URL: https://data.yotahub.com/2025-1/datahub-2025-1-taxi.csv.gz
Data Migration: NYC Yellow Taxi Data Migration

2. Data Visualization with Machbase Neo

Data visualization is possible through the Tag Analyzer in Machbase Neo.
Select desired tag names and visualize them in various types of graphs.

Below, access the 2025-1 DataHub in real-time, select the desired tag names from the data of 17 tags, visualize them, and preview the data patterns.

DataHub Viewer

3. Table Creation and Data Upload in Machbase Neo

In the DataHub directory, use setup.wrk located in the NYC Yellow Taxi Dataset folder to create tables and load data, as illustrated in the image below.

1) Table Creation

The table is created immediately upon pressing the "Run" button in the menu.
If the Taxi table exists, execute the first line and then the second. If it does not exist, start from the second line.

2) Data Upload

Loading tables in two different ways.

Method 1) Table loading method using TQL in Machbase Neo (since machbase-neo v8.0.29-rc1

Pros
- Machbase Neo loads as soon as you hit the launch button.
Cons
- Slower table loading speed compared to other method.

Method 2) Loading tables using commands

Pros
- Fast table loading speed.
Cons
- The table loading process is cumbersome.
- Run cmd window - Change machbase-neo path - Enter command in cmd window.

If run the below script from the command shell, the data will be entered at high speed into the Taxi table.

curl http://data.yotahub.com/2025-1/datahub-2025-1-taxi.csv.gz | machbase-neo shell import --input - --compress gzip --header --method append --timeformat ns taxi

If specify a separate username and password, use the --user and --password options (if not sys/manager) and add the options as shown below.

curl http://data.yotahub.com/2025-1/datahub-2025-1-taxi.csv.gz | machbase-neo shell import --input - --compress gzip --header --method append --timeformat ns taxi --user USERNAME --password PASSWORD

4. Experimental Methodology

Model Objective: NYC Yellow Taxi Fare Prediction.
Tags Used: extra, fare_amount, improvement_surcharge, mta_tax, passenger_count, payment_type, tip_amount, tolls_amount, trip_distance, trip_duration_minutes, total_amount.
Model Configuration: ResNet1d.

Learning Method: supervised Learning.
- Train: Model Training.
- Validation: Model Valdation.
- Test: Model Performance Evaluation Based on NYC Yellow Taxi Fare Prediction.
Model Optimizer: Adam.
Model Loss Function: Mean Squared Error.
Model Performance Metric: Mean Squared Error & R2 Score.
Data Loading Method
- Loading the Entire Dataset.
- Loading the Fetch Dataset.
Data Preprocessing
- MinMax Scaling.

5. Experiment Code

NYC Yellow Taxi Data Exploratory Data Analysis

Composed of three methods.
Data Information: Outputs general information about the data.
Visual Information: correlation heatmap, plot, Decomposition about the data.
Statistical Test: ADF Test, KPSS Test, PP Test, ljung box Test, Arch Test, VIF Test about the data.

The entire code can be run through 18.NYC_Yellow_Taxi_EDA.

NYC Yellow Taxi Fare Prediction

Below is the code for each of the two ways to get data from the database.
If all the data can be loaded and trained at once without causing memory errors, then method 1 is the fastest and simplest.
If the data is too large, causing memory errors, then the batch loading method proposed in method 2 is the most efficient.

Method 1) Loading the Entire Dataset

The code below is implemented in a way that loads all the data needed for training from the database all at once.
It is exactly the same as loading all CSV files (The only difference is that the data is loaded from Machbase Neo).
Pros
- Can use the same code that was previously utilizing CSVs (Only the loading process is different).
Cons
- Unable to train if trainable data size exceeds memory size.

The entire code can be run through 18.NYC_Yellow_Taxi_Full.

Method 2) Loading the Fetch Dataset

Method for loading data from the Machbase Neo for a buffer size.
Pros
- It is possible to train the model regardless of the data size, no matter how large it is.
Cons
- It takes longer to train compared to method 1.

The entire code can be run through 18.NYC_Yellow_Taxi_Buffered_Fetch.

6. Experimental Results

Method 1) Loading the Entire Dataset Result

Method 2) Loading the Fetch Dataset Result

The R2 score shows high performance above 0.9 in both methods.
It is expected that more precise data cleaning would improve performance.
If data is collected directly using Machbase Neo, there would be no need to put as much effort into data cleaning as with the current data.

NYC Yellow Taxi Data

Table of Contents

1. Data Introduction

2. Data Visualization with Machbase Neo

3. Table Creation and Data Upload in Machbase Neo

1) Table Creation

2) Data Upload

Method 1) Table loading method using TQL in Machbase Neo (since machbase-neo v8.0.29-rc1

Method 2) Loading tables using commands

4. Experimental Methodology

5. Experiment Code

NYC Yellow Taxi Data Exploratory Data Analysis

NYC Yellow Taxi Fare Prediction

Method 1) Loading the Entire Dataset

Method 2) Loading the Fetch Dataset

6. Experimental Results

Method 1) Loading the Entire Dataset Result

Method 2) Loading the Fetch Dataset Result

※ Various datasets and tutorial codes can be found in the GitHub repository below.