Providence's JUMP Bike program provides location data of every bike currently available through the gbfs data feed. Using Python and AWS you can sorta create trip data of the bikes. While the gbfs data is directly from the company, it's not ment to be used for trip data. However currently it is the only data provided for Providence and there is no official trip information publicly available.

The gbfs data is updated every 60s showing you the location of every bike assigned to the Providence region, it contains bike_id, lat, lon and battery_level. The way we create trips is a bike will be at one location from 8am-9am. Once rented the bike will disappear from the data, 20 mins later the bike will re appear somewhere else. This would be considered a "trip".

A few issues with this method is it would:

  • include bikes gathered to be charged and returned somewhere else

    • these were excluded by removing any records when a charge increased by 10% (14% of records)
  • include bikes taken out for repairs and returned several days later

    • these were excluded by removing any trip over 2 hours (9% of records)
  • includes bikes moved to other cities but still assigned to the Providence region

    • these were excluded using a geographic shape of RI and only taking points inside it

Scheduled AWS Lambda into S3 Bucket

I had some AWS credit from awhile back that i've been searching for a project to use it twords and figured this would be a good oportunity. I chose to use a lambda script to feed the json data from the gbfs feed directly into an S3 bucket to store it for future use. Heres a flowchart of what this process looks like.


The code would be triggered every 5 minutes from a CloudWatch Cron trigger running 24/7. The code only required Python standard library so theres no packaging involved, this was the code used, it will require you setup IAM permissions to allow the lambda function to put objects into your bucket.

import json
import boto3
import botocore
import urllib.request

def collect_vehicles(vehicles_url, vehicle_type):

    r = urllib.request.urlopen(vehicles_url)
    vehicles_data = json.loads('utf-8'))

    timestamp = vehicles_data['last_updated']

    vehicles_data = json.dumps(vehicles_data, ensure_ascii=False)  
    vehicles_file = f'{vehicle_type}/{vehicle_type}_' + str(timestamp) + '.json'

    return vehicles_data, vehicles_file

def handler(event, context):

    bucket_name = '{YOUR BUCKET NAME}'
    bike_status = ''

    bikes_data, bikes_file = collect_vehicles(bike_status, 'bikes')

    # S3 Connect
    s3 = boto3.resource(

    # Uploaded File
    s3.Bucket(bucket_name).put_object(Key=bikes_file, Body=bikes_data)

Lastly in the AWS process, you can get retrieve all the files from your bucket easily using the AWS CLI command.

aws s3 sync s3://{YOUR BUCKET NAME} {YOUR FILE PATH}

Making Trips from Data Collected

The first step in making trip data is to combine all your downloaded files into a pandas dataframe.

frames = []
files = glob.glob('{YOUR FILE PATH}/*.json')
for file in files:

    # load files & get timestamp
    df = pd.read_json(file, orient='records')
    timestamp = df['last_updated'][0]

    # flatten json data into table & add timestamp
    df =[0])
    df['timestamp'] = timestamp

    # add file to frames 

df = pd.concat(frames)

Next we will need to order the data by bike_id so we can compare the rows to see changes
aswell as convert the timestamp to a understandable format.

df = df.sort_values(by=['bike_id', 'timestamp'], ascending=[False, True]).reset_index(drop=True)
df['timestamp'] = df['timestamp'].apply(lambda x: pd.Timestamp(x, unit='s', tz='US/Eastern'))

Now, we can use pandas .shift() to compare each row to the one below it. Most importantly we first make sure the bike_id is the same, then we use .ne() not equal to check if lat OR lng has changed. This will create a TRUE value for every time the value has changed, which would be the "end" of a trip, then we can simply look for that true value and create the "start".

df['end'] = (df['bike_id'] == df['bike_id'].shift())  \
          & (df['lat'].ne(df['lat'].shift())          \
          | (df['lon'].ne(df['lon'].shift())))        \
          & (df['timestamp'].ne(df['timestamp'].shift()))

df['start'] = (df['bike_id'] == df['bike_id'].shift(-1)) \
            & (df['end'].shift(-1) == True)

df.loc[df['start'] == True, 'trip'] = 'start'
df.loc[df['end'] == True, 'trip'] = 'end'

However this is an issue when a bike would be re-rented within the 5 min causing a single record to be both a start, and a stop location. To solve this we create a copy of every row that needs to be duplicated and append it back into the dataframe.

df['dupe'] = (
         (df['bike_id'] == df['bike_id'].shift())
       & (df['bike_id'] == df['bike_id'].shift(-1))
       & (df['trip'] == 'end')
       & (df['trip'].shift(-1) == 'end')

dupe = df[df['dupe'] == True].copy()
dupe['trip'] = 'start'

# merge rows that needed to be duplicated
df = df.append(dupe, sort=False)

To create a unique ID number for each trip we can use a groupby with .cumcount()

# adding a trip_id to join start & stop data
df['trip_id'] = df.groupby(['trip']).cumcount() + 1

Finally to create the actual trips, we remove all the extra fat (which is 90% of our rows) and pivot the dataframe on the trip, and trip_id level.

trips = df[df['trip'].notnull()]
trips = trips.pivot(columns='trip', index='trip_id')

col_one = trips.columns.get_level_values(0).astype(str)
col_two = trips.columns.get_level_values(1).astype(str)
trips.columns = col_one + "_" + col_two

trips = trips.reset_index()

Now you end up all the information for every time a bike has moved the basic trip data. Link provided below for the full code which includes more like filtering out the errors specified at the top of this post.

Useful Links

Full Code - My Github

Graphhopper Routing Engine

Bird gbfs data

Lime gbfs data