Reinforcement Learning

link tham khảo:

file setup:

link youtube:

Install Sublime Text

  1. Quick launch 1 code folder

Because I use Mac and Linux, I often type commands on Terminal, and Win friends, people google add:

subl [folder_path]: Open folder with sublime text

subl [file_path]: Open file with sublime text

  1. Must Have Plugins

2.1 installed so that you can search and install packages for Sublime Text directly. To open the command line screen, use the key combination Ctrl + Shirt + P.

2.2Emmet: supports super-fast HTML editing.

2.3Sidebar Enhancements

2.4Git Gutter: This package helps to notify your version changes to the Git server

2.5DocBlockr: Automatically create standard comments

CodeIntel: Easily find out where the functions, classes, .. in use are written from

2.6Bracket HighLighter: This package makes it easy to see where the opening/closing part of the card is located.

2.7AutoFileName: This package will display all the files in the folder so you can embed files more simply.

2.8ColorHighlighter: Display color in css code

  1. Shortcuts

3.1 Frequently used keyboard shortcuts

Shift + Alt + (1/2/3/4/5/8/9): Split into multiple screens


Shift + F11: Full screen


Ctrl + P: Quickly open a file


Ctrl + Shirt + T: Open the closed file.


Ctrl + Tab: Go to the most recently opened tab.


Alt + number: Go to tab by numbered order


Ctrl + PgUp/PgDown: Switch tabs in a circle


Ctrl + W: Close current tab / Exit Sublime Text


3.2 Shortcuts in 1 tab

Ctrl + F: Search

Ctrl + H: Search and Replace

Ctrl + Shift + K: Delete current line

Ctrl + Shift + D: duplicate current line

Ctrl + Shift + ↑ (↓): Move lines/clusters, automatically put in brackets (jaw opening and closing marks)

Ctrl + /: comment

Ctrl + Shift + /: block comment

Ctrl + R: List of functions.

Ctrl + KU: Convert to uppercase

Ctrl + KL: Convert to lowercase

Ctrl + X: Delete the current line but also cache it.


3.3 Navigation shortcuts

Ctrl + G <line number> : Move to line

Ctrl + P :<line number> : Move to line

Ctrl + D: Highlight current word

Ctrl + M: Move to the nearest closing bracket

Ctrl + Shirt + M: Highlight all content in brackets.

Ctrl + Shirt + Left Arrow: Highlight the top of the word to the left.

Ctrl + Shirt + Right Arrow: Highlight to the beginning of the word towards the right.

Ctrl + L: Highlight the current line and move the cursor to the next line.


4. Configuration

To configure options such as font size, length blah blah… you go to Preferences -> Setting and edit the file Preferences.sublime-settings – User

Reinforcement Learning w/ Python Tutorial

Welcome to a reinforcement learning tutorial. In this part, we’re going to focus on Q-Learning.

Q-Learning is a model-free form of machine learning, in the sense that the AI “agent” does not need to know or have a model of the environment that it will be in. The same algorithm can be used across a variety of environments.

For a given environment, everything is broken down into “states” and “actions.” The states are observations and samplings that we pull from the environment, and the actions are the choices the agent has made based on the observation. For the purposes of the rest of this tutorial, we’ll use the context of our environment to exemplify how this works.

While our agent doesn’t actually need to know anything about our environment, it would be somewhat useful for you to understand how it works in the context of learning how Q-learning works!

We’re going to be working with OpenAI’s gym, specifically with the “MountainCar-v0” environment. To get to the gym, just do a pip install gym.

Okay, now let’s check out this environment. Most of these basic gym environments are very much the same in the way they work. To initialize the environment, you do a gym.make(NAME), then you env.reset the environment, then you enter into a loop where you do an env.step(ACTION) every iteration. Let’s poke around this environment:

For the various environments, we can query them for how many actions/moves are possible. In this case, there are “3” actions we can pass. This means, when we step the environment, we can pass a 0, 1, or 2 as our “action” for each step. Each time we do this, the environment will return to us the new state, a reward, whether or not the environment is done/complete, and then any extra info that some envs might have.

It doesn’t matter to our model, but, for your understanding, a 0 means push left, 1 is stay still, and 2 means push right. We won’t tell our model any of this, and that’s the power of Q learning. This information is basically irrelevant to it. All the model needs to know is what the options for actions are, and what the reward of performing a chain of those actions would be given a state. Continuing along:

How will Q-learning do that? So we know we can take 3 actions at any given time. That’s our “action space.” Now, we need our “observation space.” In the case of this gym environment, the observations are returned from resets and steps. For example:

Will give you something like [-0.4826636 0. ], which is the starting observation state. While the environment runs, we can also get this information:

At each step, we get the new state, the reward, whether or not the environment is done (either we beat it or exhausted our limit of 200 steps), and then a final “extra info” is returned, but, in this environment, this final return item is not used. Gym throws it in there so we can use the same reinforcement learning programs across a variety of environments without the need to actually change any of the code.

Output from the above:

In our case, we can query the environment to find out the possible ranges for each of these state values:

We’ll use 20 groups/buckets for each range. This is a variable you might decide to tweak later.

So this tells us how large each bucket is, basically how much to increment the range by for each bucket. We can build our q_table now with:

Which is what we’ll be talking about in the next tutorial!

User manual

Step 1: Enter the required libraries

Step 2: Identify and visualize the chart

Note: The above chart may not be the same as when regenerating the code because the NetworkX library in Python creates a random chart from the given edges.

Step 3: Determine the reward that the system is for BOT

Step 4: Identify some utility functions that will be used in the training course

Step 5: Training and evaluate Bot by Q-Matrix

Most efficient path: [0,1,3,9,10]

Now, take this bot to a more realistic environment. Let’s imagine that Bot is a detective and is trying to find the position of a large drug racket. He of course concludes that drug sellers will not sell their products in a place where the police know are frequent and medication sites located near the drug seller’s position. In addition, the seller leaves a trace of their products where they sell and this can help detectives find the necessary position. We want to train their bot to find the location using these environment clues.


Step 6: Identify and visualize the new chart with environmental clues

Note: The above chart may look a bit different from the previous chart but in fact, they are the same charts. This is due to the random position of the nodes by the NetworkX library.

Step 7: Identify some utility functions for the training process

Step 8: Visualize the environmental matrix

Step 9: Training and evaluation of the model