Track-before-detect (TBD) based on the particle filter (PF) algorithm is known for its outstanding performance in detecting and tracking of weak targets. However, large amount of calculation leads to difficulty in real-time applications. To solve this problem, effective implementation of the PF-based TBD on the graphics processing units (GPU) is proposed in this article. By recasting the particles propagation process and weights calculating process on the parallel structure of GPU, the running time of this algorithm can greatly be reduced. Simulation results in the infrared scenario and the radar scenario are demonstrated to compare the implementation on two types of the GPU card with the CPU-only implementation.