Introduction

Introduction

fingym makes no assumptions about the structure of your agent, and is compatible with any numerical computation library, such as Tensorflow or Theano.

The agent sends actions to the environment, and the environment replies with observations and rewards (that is, a score).

Here we defined a couple of agents to understand this concept and get you started.

Buy and hold agent

Our favorite holding period is forever - Warren Buffett

This code creates a SPY-Daily-v0 environment an agent that buys as many shares as it is allowed from the initial investment and holds it through the whole episode.

Run examples/agents/buy_and_hold_agent.py

The Agent is very simple.

class BuyAndHoldAgent(object):
    def __init__(self, action_space):
        self.bought_yet = False
    
    def act(self, observation, reward, done):
        if not self.bought_yet:
            
            cash_in_hand = observation[1]
            close_price = observation[6]
            num_shares_to_buy = cash_in_hand / close_price
            print('will buy {} shares'.format(num_shares_to_buy))
            self.bought_yet = True
            
            return [1,num_shares_to_buy]
        else:
            return [0,0]

First time this agent acts, it buys as many shares as it can based on the amount of cash in hand, then just holds it through the rest of the episode.

The cash in hand and close prices come from the observation observation object. You can read more about observation (also refered to as the state) here.

Run the episode:

env = fingym.make(args.env_id)
agent = BuyAndHoldAgent(env.action_space)

reward = 0
done = False

cur_val = 0

ob = env.reset()
initial_value = ob[1]

while True:
    action = agent.act(ob, reward, done)
    ob, reward, done, info = env.step(action)
    if done:
        cur_val = info['cur_val']
        break

print('initial value: {}'.format(initial_value))
print('final value: {}'.format(cur_val))

We run through the episode until done is true. Finally we print the intial and final value. The results:

will buy 290.15784586815226 shares
initial value: 25000.0
final value: 92848.52999999987

This agent has an extremely successful investment strategy over a long period of time. Can we do better?

Random agent

Give a monkey enough darts and they'll beat the market. - Research Affiliates

This code creates a SPY-Daily-v0 environment and agent that makes random actions throughout the episode.

Run examples/agents/random_agent.py

This agent is even simpler than the buy and hold agent:

class RandomAgent(object):
    def __init__(self, action_space):
        self.action_space = action_space
    
    def act(self, observation, reward, done):
        return self.action_space.sample()

For this agent, we will run 100 episodes and get a min, average and max values to create a distribution. This is because one data point is not enough to evaluate this strategy. The more episodes we run, the better distribution we create.

env = fingym.make(args.env_id)
agent = RandomAgent(env.action_space)

episode_count = 100
reward = 0
done = False

final_vals = []

initial_value = 0

for i in range(episode_count):
    ob = env.reset()
    initial_value = ob[1]
    while True:
        action = agent.act(ob, reward, done)
        ob, reward, done, info = env.step(action)
        if done:
            final_vals.append(info['cur_val'])
            break

max_value = max(final_vals)
min_value = min(final_vals)
avg_value = sum(final_vals)/len(final_vals)
print('initial value: {}'.format(initial_value))
print('min_value: {}, avg_value: {}, max_value: {}'.format(min_value,avg_value,max_value))

Results

initial value: 25000.0
min_value: 34758.73830000002, avg_value: 49393.79445500023, max_value: 75881.42879999822

Even random actions generate good returns in the long run, however, not as good as a buy and hold strategy.

Deep Q learning

This code creates a SPY-Daily-v0 environment and a deep reinforcement learning agent based on q learning and the Bellman equation. This section is not an introductory course to q learning. Instead it focuses on implementation of a deep q learning agent for the fingym environment.

This agent requires tensorflow, as of Feb. 12 2020, we recommend the nightly build due to a memory leak issue.

Run examples/agents/dqn_agent.py

The agent builds two identical networks, the q and target networks. Additionally we set a epsilon decay to reduce exploration as training cycles increase. Finally a replay buffer is used for training.

We are using fully connected neural networks, and our model takes time frame sequence data as inputs. For example, if we are using the last 5 days of data then the input size is state_size*time_frame.

class DQNAgent():

    def __init__(self, env_state_dim, time_frame, epsilon = 1, learning_rate=0.01, train = False):
        dirname = os.path.dirname(__file__)
        self.model_filepath = os.path.join(dirname,'weights.h5')
        self._state_size = env_state_dim
        self.trainMode = train

        # 0 - do nothing
        # 1 - buy w/ multiplier .33
        # 2 - buy w/ multiplier .5
        # 3 - buy w/ multiplier .66
        # 4 - sell w/ multiplier .33
        # 5 - sell w/ multiplier .5
        # 6 - sell w/ multiplier .66
        self._action_size = 7

        self.experience_replay = deque(maxlen=2000)

        self.gamma = 0.98
        if not self.trainMode:
            self.epsilon = 0
        else:
            self.epsilon = epsilon
        self.eps_decay = 0.995
        self.eps_min = 0.01

        self.max_shares_to_trade_at_once = 100

        # holds our last time_frame sequential state frames for prediction.
        self._time_frame = time_frame
        self.state_fifo = deque(maxlen=self._time_frame)

        # networks
        self.q_network = self._build_compile_model(learning_rate)
        self.target_network = self._build_compile_model(learning_rate)
        self._load_model_weights(self.q_network, self.model_filepath)
        self.align_target_model()

    def align_target_model(self):
        # reduce exploration rate as more
        # training happens
        if self.epsilon > self.eps_min:
            self.epsilon *= self.eps_decay
    
        print('epsilon: ', self.epsilon)

        self.target_network.set_weights(self.q_network.get_weights())
        self._save_model_weights(self.q_network, self.model_filepath)

    def _build_compile_model(self, learning_rate):
        '''
        Model taken from https://arxiv.org/pdf/1802.09477.pdf
        '''
        model = Sequential()

        # use a dense nn with inputs
        input_size = self._state_size * self._time_frame
        model.add(Dense(400, input_shape=(input_size,), activation='relu'))
        model.add(Dense(300, activation='relu'))
        #model.add(Dense(self._action_size, activation='tanh'))
        model.add(Dense(self._action_size, activation='linear'))
        model.compile(loss='mse', optimizer=Adam(learning_rate=learning_rate))
        print(model.summary())
        return model
    
    def store(self, state, action, reward, next_state, terminated):
        state = np.reshape(state, (self._state_size,self._time_frame))
        next_state = np.reshape(next_state, (self._state_size, self._time_frame))
        self.experience_replay.append((state, action, reward, next_state, terminated))

The output action size is 7. This model buys and sells shares in batches of max_shares_to_trade_at_once*multiplier. As follows:

Action space output:

0 - do nothing
1 - buy w/ multiplier .33
2 - buy w/ multiplier .5
3 - buy w/ multiplier .66
4 - sell w/ multiplier .33
5 - sell w/ multiplier .5
6 - sell w/ multiplier .66

The act function predicts when our state_fifo is full (state_size*time_frame) and based on our epsilon value:

    def act(self, state):
        self.state_fifo.append(state)

        # do nothing for the first time frames until we can start the prediction
        if len(self.state_fifo) < self._time_frame:
            # Our environment takes a tuple for action https://entrpn.github.io/fingym/#spaces
            return np.zeros(2)
        
        # epsilon decays over time
        if np.random.rand() <= self.epsilon:
            return self._random_action()

        state = np.array(list(self.state_fifo))
        state = np.reshape(state,(self._state_size*self._time_frame,1))

        q_values = self.q_network.predict_on_batch(state.T)
        env_action = self._nn_action_to_env_action(np.argmax(q_values[0]))
        return env_action

Finally, the network is retrained with stochastic gradient descent:

    def retrain(self, batch_size):
        if not self.trainMode:
            return
        
        minibatch = random.sample(self.experience_replay, batch_size)

        for state, action, reward, next_state, terminated in minibatch:
            state = np.reshape(state,(self._state_size * self._time_frame,1))
            next_state = np.reshape(next_state, (self._state_size * self._time_frame, 1))
            target = np.array(self.q_network.predict_on_batch(state.T))

            if terminated[-1]:
                target[0][np.argmax(self._nn_action_to_env_action(action))] = reward[-1]
            else:
                t = np.array(self.target_network.predict_on_batch(next_state.T))
                target[0][np.argmax(self._env_action_to_nn_action(action))] = reward[-1] + self.gamma * np.amax(t)
            
            self.q_network.fit(state.T, target, epochs=1, verbose = 0)

The full training loop takes care of filling up the replay buffer with the time frame data.

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=None)
    parser.add_argument('env_id', nargs='?', default='SPY-Daily-v0', help='Select the environment to run')
    args = parser.parse_args()

    train = True
    if train:
        rang = 100
    else:
        rang = 1

    # collect the last 10 time frames (10 days for daily env) and use that to make a prediction for current action
    time_frame = 10
    time_frame_counter = 0

    # train on this batch size
    batch_size = 32

    env = fingym.make(args.env_id)
    # removing time element from state_dim since I'm creating a sequence via time_frame
    state_size = env.state_dim - 1
    print('state_size: ', state_size)
    agent = DQNAgent(state_size, time_frame, train = train)

    for i in range(rang):

        # init our env
        state = env.reset()
        # remove time element
        state = np.delete(state, 2)
        done = False

        # init our timeframe
        s_timeframe, r_timeframe, ns_timeframe, d_timeframe = reset_timeframe(time_frame, state_size)

        # alighn every training iterations
        align_every_itt = 15
        align_counter = 0

        while not done:
            action = agent.act(state)
            next_state, reward, done, info = env.step(action)
            print('action: ', action)
            print('reward: ', reward)

            # remove time element
            if len(state) > state_size:
                state = np.delete(state, 2)
            next_state = np.delete(next_state, 2)

            if time_frame_counter >= time_frame:
                agent.store(s_timeframe, action, r_timeframe, ns_timeframe, d_timeframe)
                s_timeframe[:-1] = s_timeframe[1:]
                r_timeframe[:-1] = r_timeframe[1:]
                ns_timeframe[:-1] = ns_timeframe[1:]
                d_timeframe[:-1] = d_timeframe[1:]
                time_frame_counter-=1
            
            s_timeframe[time_frame_counter] = state
            r_timeframe[time_frame_counter] = reward
            ns_timeframe[time_frame_counter] = next_state
            d_timeframe[time_frame_counter] = done
            time_frame_counter+=1

            if len(agent.experience_replay) > batch_size:
                print('retrain')
                agent.retrain(batch_size)

            if align_counter >= align_every_itt:
                print('align target model')
                agent.align_target_model()
                align_counter = 0
                print(info)
            
            state = next_state
            align_counter+=1

Results

Unfortunately this agent has difficulty finding a good policy and ultimately gets stuck taking a single action every time.

Some thoughts on how to improve this agent is to give it more context via feature extraction. For example, giving it more input data such as current events sentiment, highest and lowest values in some time period, etc. But feature extraction takes effort, so we move away from this agent to try other strategies.

Deep neuroevolution

This code creates a SPY-Daily-v0 environment and hundreds of deep neural network agents with weights randomly generated via uniform distributions. Agents are compared to one another to find the fittest (obtained highest rewards). All but the top N agents are discarted. The top N agents are used to repopulate our population much like asexual reproduction. Children contain mutations so that they aren’t identical to their parents. The process is repeated across many generations until the fittest agents are obtained. Based on the following paper.

This agent requires tensorflow, as of Feb. 12 2020, we recommend the nightly build due to a memory leak issue.

There are two techniques created: asexual and sexual reproduction. The only difference being that sexual reproduction uses crossover (genes from both parents)

Run examples/agents/evolutionary_agent.py

Run examples/agents/evolutionary_agent_w_crossover.py

The training loop creates 400 agents and runs them through the environment multiple times and takes the average score. The top 20 agents are selected to repopulate the environment. Some mutation and crossover is applied for diversity. This is done through normal distribution random generated values. As a note, the top 20 agents are kept in the new generation to ensure that mutation doesn’t make worse agents.

The main loop looks like this:

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=None)
    parser.add_argument('env_id', nargs='?', default='SPY-Daily-v0', help='Select the environment to run')
    args = parser.parse_args()

    env = fingym.make(args.env_id)

    # removing time element from state_dim
    state_size = env.state_dim - 1
    print('state_size: ', state_size)

    time_frame = 30
    num_agents = 400

    agents = create_random_agents(num_agents, state_size, time_frame)

    # first agent gets saved weights
    dirname = os.path.dirname(__file__)
    os.path.join(dirname,'evo_weights.h5')
    weights_file=os.path.join(dirname,'evo_weights.h5')
    if os.path.exists(weights_file):
        print('loading existing weights')
        agents[0].model.load_weights(weights_file)

    # how many top agents to consider as parents
    top_limit = 20

    # run evolution until x generations
    generations = 1000

    elite_index = None

    for generation in range(generations):
        rewards = run_agents_n_times(env,agents,3) # average of x times

        # sort by rewards
        sorted_parent_indexes = np.argsort(rewards)[::-1][:top_limit]

        top_rewards = []
        for best_parent in sorted_parent_indexes:
            top_rewards.append(rewards[best_parent])
        
        print("Generation ", generation, " | Mean rewards: ", np.mean(rewards), " | Mean of top 5: ",np.mean(top_rewards[:5]))
        print("Top ",top_limit," scores", sorted_parent_indexes)
        print("Rewards for top: ",top_rewards)

        children_agents, elite_index = return_children(env, agents, sorted_parent_indexes, elite_index)

        agents = children_agents

Results

Turns out this agent beats the buy and hold strategy. Remember buy and hold gave us a reward of 92848 after 10 years.

Evolutionary agents beats it right from the start and improves it over generations:

Generation  0  | Mean rewards:  57359.43331030005  | Mean of top 5:  99890.59519999995
Generation  1  | Mean rewards:  82531.03777689998  | Mean of top 5:  105624.93686000006
Generation  2  | Mean rewards:  86718.46527845002  | Mean of top 5:  107301.62140000006
Generation  3  | Mean rewards:  90612.0291022  | Mean of top 5:  108621.7784000001
Generation  4  | Mean rewards:  90634.85884410002  | Mean of top 5:  109371.26040000007

The best agents:

Elite selected with index  276  and score 105707.33
Elite selected with index  330  and score 106712.48000000003
Elite selected with index  180  and score 107936.29000000007
Elite selected with index  27  and score 110383.51000000029
Elite selected with index  399  and score 110383.51000000029

Upon closer inspection, this agent learns to buy early on and hold, so is not much different from the buy and hold strategy.

Results

It seems it overfits the data and chooses the best point to buy and hold.

Deep evolution

https://openai.com/blog/evolution-strategies/

This code creates a SPY-Daily-v0 environment and one agent with randomly initialized weights. The optimization of the agent is a guess and check process. The agent weights are updated via finite differencing and no back propagation is needed. This approach is easy to parallelize and has less computational needs.

Run examples/agents/deep_evolution.py

This agent requires ray to run multiple agents in parallel and train faster.

The model is just an array of randomly initialized parameters and a function for forward propagation:

class Model:
    def __init__(self, input_size, layer_size, output_size):
        self.weights = [
            np.random.randn(input_size, layer_size),
            np.random.randn(layer_size, output_size),
            np.random.randn(layer_size, 1),
            np.random.randn(1, layer_size)
        ]
    
    def predict(self, inputs):
        feed = np.dot(inputs, self.weights[0]) + self.weights[-1]
        decision = np.dot(feed, self.weights[1])
        buy = np.dot(feed, self.weights[2])
        return decision, buy
    
    def get_weights(self):
        return self.weights
    
    def set_weights(self, weights):
        self.weights = weights

The agent contains a model and the strategy:

class Agent:
    def __init__(self, model, state_size, time_frame):
        self.model = model
        self.time_frame = time_frame
        self.state_size = state_size
        self.state_fifo = deque(maxlen=self.time_frame)
        self.max_shares_to_trade_at_once = CONFIG['max_shares_to_trade_at_once']
        self.des = Deep_Evolution_Strategy(self.model.get_weights())
    
    def act(self,state):
        self.state_fifo.append(state)
        # do nothing for the first time frames until we can start the prediction
        if len(self.state_fifo) < self.time_frame:
            return np.zeros(2)
        
        state = np.array(list(self.state_fifo))
        state = np.reshape(state,(self.state_size*self.time_frame,1))
        #print(state)
        decision, buy = self.model.predict(state.T)
        # print('decision: ', decision)
        # print('buy: ', buy)

        return [np.argmax(decision[0]), min(self.max_shares_to_trade_at_once,max(int(buy[0]),0))]
    
    def fit(self, iterations, checkpoint):
        self.des.train(iterations, print_every = checkpoint)

The strategy’s job is to find the best parameters based on all of the population’s weights.

class Deep_Evolution_Strategy:
    def __init__(self, weights):
        self.weights = weights
        self.population_size = CONFIG['population_size']
        self.sigma = CONFIG['sigma']
        self.learning_rate = CONFIG['learning_rate']
    
    def _get_weight_from_population(self,weights, population):
        weights_population = []
        for index, i in enumerate(population):
            jittered = self.sigma * i
            weights_population.append(weights[index] + jittered)
        return weights_population
    
    def get_weights(self):
        return self.weights

    def train(self,epoch = 500, print_every=1):
        for i in range(epoch):
            population = []
            rewards = np.zeros(self.population_size)
            for k in range(self.population_size):
                x = []
                for w in self.weights:
                    x.append(np.random.randn(*w.shape))
                population.append(x)
            
            futures = [reward_function.remote(self._get_weight_from_population(self.weights,population[k])) for k in range(self.population_size)]

            rewards = ray.get(futures)
            
            rewards = (rewards - np.mean(rewards)) / np.std(rewards)
            for index, w in enumerate(self.weights):
                A = np.array([p[index] for p in population])
                self.weights[index] = (
                    w + self.learning_rate / (self.population_size * self.sigma) * np.dot(A.T, rewards).T
                )
            
            if (i + 1) % print_every == 0:
                print('iter: {}. standard reward: {}'.format(i+1,ray.get(reward_function.remote((self.weights)))))

This strategy returns exeptional results:

iter: 950. standard reward: 258715.4850000006
iter: 960. standard reward: 259706.73000000074
iter: 970. standard reward: 258416.72000000093
iter: 980. standard reward: 260159.72000000047
iter: 990. standard reward: 258908.7600000004
iter: 1000. standard reward: 259261.71000000075

The agent also learns to buy and sell and doesn’t get stuck doign a single action.

Results