(AI)mbot: Real-Time Object Detection for CS:GO

Ari Takvorian
12 min readMay 3, 2021

By: Santosh Balachandra, Julian Fritz, Chris Karouta, Victor Lai, Pierce Nguyen, and Ari Takvorian

Background

Today, image-recognition accuracy and speed are better than ever. One open source project at the forefront of this topic is YOLOv5. Shown below is a comparison of YOLOv5’s small, medium, large, and extra large pretrained models with EfficientDet, another object recognition model from 2020.

Source: YOLOv5 Github

Even with just one year of difference, it can be seen that YOLOv5’s speed is substantially greater for all models, while the accuracy is greater for its larger models. Below is a table that goes through the differences between models in more detail, with model size, speed, and accuracy being compared from top to bottom. YOLOv5 is explained in depth further below.

Introduction

With the advent of competitive video games and esports, many players have tried to gain an edge on their competition by using ‘aimbots’. An ‘aimbot’ is software designed to run alongside first person shooter (FPS) games and make gameplay easier for players. Although the way different aimbots interface with games can vary, the general idea behind aimbots is automated aiming. When moving around a virtual game-world, aimbots can identify the positions of players in the vicinity and position the cursor automatically on top of the other player(s). Some aimbots even have functionality that will automatically shoot so that there is even less effort required of the user.

The very first aimbots used color recognition. Players were required to input a specific RGB color value (usually the color of enemy players). These were not always accurate, and players would find themselves aiming at the ground or their own teammates. Some aimbots also offer content hacks which allow players to modify game aspects like wall transparency or recoil. The most effective aimbots today are known as graphics driver aimbots. These aimbots hijack the information used by the rendering software to locate every player’s location in 3D space.

In almost every popular multiplayer FPS, the advantages aimbots offer are considered unfair. Players are quick to report suspected cheaters, and bigger games have software to detect aimbot usage. Although some aimbots have flaws built in to prevent detection (such as intentional missed shots), many of the aimbots mentioned above modify the game’s code directly, which is easily detectable. Although we do not condone cheating in video games, we have attempted to create an aimbot (that can go undetected) using machine learning.

Motivation

Because many popular aimbots work by modifying a game’s code, we wanted to make an aimbot that aims exactly like a human does: looking at the screen. Ideally, when running an object detection model on screenshots in real-time, the model would locate enemies and never touch anything that a human player wouldn’t touch. This makes it very difficult for the game to detect.

The figure above shows graphically why our system is nearly undetectable, but another benefit of ours is that it can be applied to almost any FPS game. Since the only input from the game is screen captures, our aimbot can theoretically be applied to any game out there as long as your enemies look like humans. Traditional aimbots have to be made specifically for each game.

But why do we want an undetectable wide-use aimbot anyway? Turns out that human reaction speed is pretty slow, much slower than machines. The simple reaction time (SRT) of a human is between 200 and 250 milliseconds on average, compared to the less than 25 milliseconds it takes for YOLOv5’s largest model to run on one image. This means that our aimbot could give a player up to ten times the reaction speed of everyone else, assuming all else ideal. That is an unbeatable advantage.

Goals

Our goal is to create an aimbot that uses YOLOv5 object detection to detect enemy players in real-time and aim at them. For this object detection aimbot to work, a few things need to happen. First, the object detection model needs to be able to identify fake humans from video games with decent accuracy. Second, it needs to be able to do it fast enough. An accurate model running at a high frame-rate would create an unstoppable aimbot.

For the first objective, our goal was for the object detection model to be able to detect a CS:GO player standing close by and unobstructed with at least 90% confidence. A confidence that high would indicate that CS:GO characters are human-like enough to where the model can detect them just as well as a human detection model can detect real humans in photos. The confidence level for partially obstructed or distant characters may be significantly lower, but with 90% base confidence that should be enough to detect characters in most positions with decent confidence. Our goal is to use a YOLOv5 pre-trained model to achieve this level of confidence, but if the human detection is not able to detect CS:GO characters that well, we would resort to training the model on random CS:GO screen captures.

The second objective is even more important than the first, as an accurate model is useless to a real-time aimbot if it cannot perform in real time. Our goal is for the model to run at least 30 times per second so that we can match an acceptable video game frame-rate of 30 FPS. We mentioned earlier that an average human’s simple reaction time is 200–250 milliseconds, whereas YOLOv5 can run in under 25 milliseconds. However, our aimbot needs to do more than just run the model, it needs to capture the screen, move data, and apply the model all within hardware limitations — including the fact that other GPU-intensive applications such as CS:GO would be running at the same time. We hope to complete all of that in just 33 milliseconds to achieve around 30 FPS. Although 33 milliseconds is much quicker than human reaction time, we do not want our aimbot to be slower than this because of other disadvantages the aimbot might have over humans, such as the model failing to detect an enemy right away.

YOLOv5 — What It Is

Most modern object detection algorithms use Convolutional Neural Networks (CNNs), as they’re the most optimal machine-learning model for image data. The two most common model frameworks are PyTorch, a Torch framework developed at Facebook, and the more popular Tensorflow, a creation of Google. Each framework works in similar ways, providing functions that create, fit, and test their neural networks. However, since building and training a neural network completely from scratch can take hours, even on the fastest of GPUs, there are a variety of pretrained, publicly-available object detection neural networks. These networks have been trained on millions of images, and are able to identify dozens of different classes. For the most part, when providing new image data, networks make two passes over the image in search of objects. Single-shot detectors, like YOLO (You Only Look Once), only look at each image one time, allowing for faster prediction speeds (albeit slightly less accurate predictions).

YOLOv5 was released in 2020, and advertises itself as the premier object detection network. The base model is trained on the COCO (Common Objects in COntext), which consists of over 330,000 images and 1,800,000 object instances. The ‘small’ version of YOLOv5 trained on a subset of the COCO dataset, and can detect the following classes:

Source: YOLOv5 Github

The beauty of pretrained models lies in its ability to identify objects present in any image. At their core, images are just variations of pixel data that we’ve assigned classifications too; the model has figured out a way to do just that, and with extreme accuracy.

Source: Google Images

For the above picture of a person, the YOLOv5 model can predict the coordinates and class of the object in the image with the following simple lines of code:

Making prediction
Output Tensor and Class Prediction

As seen above, the YOLOv5 network classified the person as ‘person’ with 91% confidence. The returned ‘tensor’ contains the coordinates of the ‘bounding box’ that is drawn around the detected person.

YOLOv5 (and other object detection methods) can be trained to recognize custom objects (think facial recognition) by modifying the last layer of the neural network. However, for most use-cases (like ours), the COCO dataset classes are satisfactory.

YOLOv5 — How We Used It

The opposition players in CS:GO are computer-generated humans (as seen below), meaning, for any given frame of gameplay, the YOLOv5 network can locate “people,” even though they aren’t actually real.

Gameplay footage

For the above sample image, the YOLOv5 network detected 12 objects:

  • 7 ‘persons’
  • 2 ‘traffic lights’
  • 2 ‘cars’
  • 1 ‘clock’

While most of the ‘people’ were classified with relatively low confidence (less than 50%), the two people in the foreground of the image were detected with accuracy upwards of 90%. Since the pretrained network was able to classify CS:GO characters with relative accuracy, we decided to use it without training on random gameplay screenshots. This made the process easier, as using additional training data would have required us to draw and label bounding boxes on training images, while also allowing the model to modify the last layer of the network, likely taking multiple hours.

After loading the model and configuring the program so that it could grab screen captures at 120fps, each still-frame is fed into the model. The model was set to only look for ‘persons’ (class 0), and only those with a confidence level of above 60%, ensuring our aimbot only aimed at actual characters. By using the Windows API, we were able to auto-move the mouse to the top of the bounding box, and, after some constant tuning, we were able to lower the mouse enough to aim (approximately) for the head of the opponent. After moving the mouse, the program waits 0.05 seconds, clicks the mouse (shoots), and then releases.

Since the aiming and shooting is being done by the program, a user would simply have to move their character around the map and allow the aimbot to rack up the kills.

YOLOv5 — How We Implemented It

Although the YOLOv5 model is pretrained, integrating it into our use-case is highly nontrivial due to extremely low latency constraints. As shown before, in a vacuum, YOLOv5 can produce predictions at 25ms. However, in practice, we should also consider data movement and hardware limitations.

Pipelining screen data into Python is particularly difficult because most software packages are not designed for this task in real-time. As a benchmark, PIL ImageGrab results in 1.01 seconds per image, PyScreeze (a dedicated screenshot library) results in 1 second per image, and Python gtk takes 2.34 seconds per image. Even the native win32gui module takes 1.01 seconds per image. This performance is not nearly enough for real-time inference and prediction.

To work around this, we leverage the Windows Desktop Duplication API, which provides access to frame-by-frame updates of a desktop window. This API utilizes Direct3D and DXGI system-level libraries to provide lightning fast real-time capture functionality to our program. For ease of implementation, we use a Python wrapper for the Windows Desktop Duplication API called d3dshot. With this pipeline, we’re able to achieve 20–40 FPS on data movement, which is acceptable.

Hardware limitations are also challenging to deal with. In particular, the main development/benchmarking environment is a 2018 laptop with a GTX 1060 GPU connected to a 2K monitor. Although this is a moderately powerful laptop, consider the application: during a demo, we have to run (i) The video game itself, (ii) The image pipeline described above, (iii) The YOLOv5 real-time inference model via CUDA, and (iv) a second screen-capture process to record video for the demo. All four processes are highly GPU intensive, and running them in parallel significantly increases latency. This reveals one interesting property of our system: performance of the aimbot is highly dependent on the system’s actual hardware. Because we achieved viable performance on this hardware-limited setup, we hypothesize that better hardware (such as an RTX 3070) could potentially lead to competitive-level performance.

Once a prediction is made, we use win32api to send simple macro scripts (re-position mouse, press LMB, release LMB) to a virtual keyboard, replicating the behaviors of a real human. To further reduce latency, the entire application is multithreaded: inference, display, and image pipelining all occur on separate threads.

How The Model Performed

Gameplay with Aimbot enabled. Top-left shows the frame-by-frame bounding box drawings by the model

As shown, the aimbot consistently manages to identify other players accurately. The player manages a high win rate with the aimbot’s assistance, automatically shifting up a bit to deal “headshots” which are harder to aim but more effective. The drawn rectangles (running on a separate thread) are only to help visualize what the bot is doing, but it has a significant delay compared to the aimbot’s own speed, so it looks like the rectangles are slightly off.

Observe for example the scenes below, where the rectangle seems to be placed in the future on an image taken from the past.

Screenshot of gameplay with Aimbot enabled
Screenshot of gameplay with Aimbot enabled

Limitations and Future Improvements

There are many different ways we could’ve approached this problem to improve the model on CS:GO performance. To improve performance, as mentioned previously, we could have trained it on images extracted from CS:GO gameplay. This way, instead of basing its predictions on real humans that look slightly different from CS:GO characters, it would have a better understanding CS:GO character pixel data. This requires us to create a large dataset of images from the game and draw and label bounding boxes on where characters are, so it’s definitely an intensive but helpful approach for future improvements of the model.

Additionally, another improvement could be identifying heads instead of bodies. Currently, the model identifies bodies and shifts the mouse up in the bounding box to aim at the opponent’s head. This leaves room for error as the opponent’s head is just some of the space in the upper box, and there would be some missed shots. In most FPS games, including CS:GO, headshots cause a lot more damage than body shots, so by aiming for just the head and causing more damage to other players, the model would improve user performance.

Because the YOLOv5 model is a pre-trained model, we don’t have much flexibility in the model’s predictions. Namely, you may notice that sometimes the bot jitters to aim at allies (and even dead bodies!) during the gameplay demo. This detracts from the overall performance of the aimbot, particularly when aiming at large, clustered groups of opponents. We anticipate two solutions to this problem: either add an aimbot key that only activates when pressed (many traditional aimbots use this), or fine-tune the last deep layer of the model to not include false positives (allies, dead bodies, etc.).

Finally, we are currently using the smallest COCO dataset available. However, using a larger dataset that’s been trained on more images would improve our model performance to better recognize humans in varying environments. This method would take more time, so if we had more computing power available, we would take this approach. All of these improvements could be applied to other FPS games as well, improving performance of those users.

Other Applications

Since the model predicts where a person is on the screen given a FPS environment and alters the bounding box depending on that location, it could be expanded to other FPS games as well, like Valorant, Call of Duty, and Battlefield. As mentioned previously, it would improve predictions if the model were trained on a set of images from the game it will be used on, especially since character builds, costumes, and general game specifications change depending on the game. Since YOLOv5 can be trained to recognize custom objects, then potentially, the model could be applied to games, like Overwatch and Halo, where the characters look less like humans by modifying the last layer of the neural network and training on images from those games.

--

--