NYC Yellow Taxi Data Migration



Introducing the NYC Yellow Taxi Data Migration

1. Check raw data

The handling of basic data cleaning seems necessary.

2. Set the Preprocessing Plan

  • Data Cleaning.
    • Drop Unnecessary Columns.
    • Calculate the time difference and convert to minutes.
    • Remove cases where the trip duration is more than 1 hour or the distance is 0 ~ 60.
    • Remove the dropoff time.
  • Convert to the format: Name, Time, Value & Convert to UTC time.

Data Cleaning

  • Drop Unnecessary Columns.
  • Calculate the time difference and convert to minutes.
  • Remove cases where the trip duration is more than 1 hour or the distance is 0 ~ 60.
  • Remove the dropoff time.

Convert to the format: Name, Time, Value & Convert to UTC time

Once the data frame is restructured as shown in the image below, it will be ready for upload to Machbase Neo.

3. Data Upload

Finally, the data can be uploaded to Machbase Neo using the command below.

machbase-neo shell import --input ./datahub-2025-1-taxi.csv.gz --compress gzip --header --method append --timeformat ns taxi

Check the entire code.

datahub/dataset/2025/01.NYC Yellow Taxi/conv/convert.py at main · machbase/datahub
All Industrial IoT DataHub with data visualization and AI source - machbase/datahub

4. Check the results after uploading

Output when executing the following code in the Machbase Neo internal shell.

select * from v$taxi_stat;

※ Subsequent link to the AI training process: NYC Taxi Data

Back to Top