Abstract
This article introduces a novel Deep Reinforcement Learning (DRL)based approach for mapless navigation in Industrial Autonomous Mobile Robots, emphasizing advancements in generalization through Potential Risk State Augmentation (PRSA) and an adaptive safety optimization reward function. Traditional LiDARbased state representations often fail to capture environmental intricacies, leading to suboptimal performance. PRSA addresses this by improving the representation of highdimensional LiDAR data, focusing on essential riskrelated information to reduce redundancy and enhance the DRL agent’s generalization across various industrial settings. The adaptive reward function integrated with intrinsic reward mitigates the issue of sparse rewards in complex tasks, promoting faster learning and optimal policy convergence. Extensive experiments demonstrate that our method maintains a high success rate (over 90%) and low collision risk in narrow and dynamic environments compared to existing DRLbased methods. Meanwhile, compared with the classic navigation baseline, the proposed method improves the success rate by about 33% and reduces the mean navigation time by about 48% in realworld navigation tasks. The direct transfer of policies trained in simulations to realworld environments has demonstrated significant potential for enhancing both the efficacy and reliability of autonomous navigation.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Autonomous Mobile Robots (AMRs) have become pivotal in various industrial sectors, undertaking tasks ranging from material transport in manufacturing facilities to monitoring and inspection in hazardous environments. A critical capability of industrial AMRs is autonomous navigation, which demands high levels of adaptability and precision, especially in complex and dynamic industrial settings. Traditional navigation methods have predominantly utilized mapbased navigation techniques, leveraging Simultaneous Localization and Mapping (SLAM) [1] to create detailed environmental maps and path planning algorithms such as A* and Dynamic Window Approach (DWA) [2, 3] for trajectory optimization. Despite their widespread adoption, these approaches often struggle to cope with the intricacies and unpredictabilities inherent to these environments [4], where the high computational overhead for map updates and the need for realtime adaptability pose considerable challenges.
Recent advancements in Deep Reinforcement Learning (DRL) have opened new avenues for enhancing the autonomy of AMRs [5,6,7]. DRL’s capability to learn from environmental interactions promises significant improvements over traditional rulebased and pathplanning algorithms, particularly in unstructured or dynamic settings. However, deploying DRL in realworld navigation tasks introduces unique challenges, such as processing highdimensional sensory inputs like LiDAR data efficiently and devising reward functions that capture complex navigation objectives accurately. Traditional approaches to DRL state representation often do not fully leverage this data, resulting in navigation policies that may not optimally balance risk and efficiency [8, 9]. Moreover, designing reward functions in DRL is critical yet challenging due to the inherent sparsity of rewards in complex navigation tasks, often leading to prolonged learning periods or convergence to suboptimal policies [10, 11].
Industrial environments amplify these challenges due to their inherent complexity and the higher stakes involved in navigation tasks. For instance, the presence of moving machinery, temporary obstructions, and ongoing human activity requires AMRs to exhibit a high degree of adaptability and generalization capability. Moreover, stringent safety requirements demand navigation strategies that minimize collision risks and ensure the uninterrupted flow of operations. Additionally, the challenge of sparse reward functions becomes apparent in industrial settings where AMRs often receive feedback only upon the successful completion of a task. This sparsity can delay the development of efficient and reliable navigation solutions by hindering the robot’s ability to learn effective navigation strategies. Furthermore, the discrepancy between simulated training environments and realworld operational settings introduces additional challenges in simtoreal transfer. These domain differences can significantly impede the direct application of learned policies in real scenarios, necessitating robust and reliable navigation strategies.
In response to these challenges, we propose the Potential Risk State Augmentation (PRSA) method to enhance the AMR’s perception of environmental risks and improve decisionmaking under uncertainty. Traditional LiDARbased state representations, which often contain excessive redundant information due to highdimensional LiDAR observations, make it difficult to generalize strategies across varied environments. Our PRSA approach addresses this by efficiently encoding LiDAR data into a compact and informative state representation, focusing on crucial riskrelated information. Additionally, we introduce an adaptive safety optimization reward function, complemented by intrinsic rewards, to mitigate the issue of sparse rewards and encourage exploration of novel states. This innovative reward structure is designed to tackle the inherent challenges in training DRL agents for complex navigation tasks, where traditional reward signals may not provide sufficient guidance for effective learning. The key contributions of this paper are as follows:

1)
Introducing PRSA to efficiently process and represent highdimensional LiDAR data, thereby reducing state space complexity and enhancing the DRL agent’s generalization capabilities across diverse industrial scenarios.

2)
Developing an adaptive weight reward function shaped by collision risk and augmented with intrinsic rewards for state novelty, thereby enhancing model training efficiency and adaptability in environments characterized by sparse rewards.

3)
Conducting extensive experiments to validate the superior navigation performance and generalization capabilities of our proposed DRLbased mapless navigation approach, including successful policy transfers to realworld scenarios.
The remainder of this paper is organized as follows: Section 2 provides an overview of related works, emphasizing the evolution of DRLbased navigation systems and highlighting the limitations of existing methodologies. Section 3 presents an overview of the preliminary work. Section 4 details the proposed navigation scheme, including PRSA and the adaptive reward function. Section 5 presents our experimental setup, results, and analysis, showcasing the effectiveness of our approach. Finally, Section 6 concludes the paper with a summary of findings and directions for future research.
2 Related work
In this section, we outline the current advancements made in navigating complex tasks and explore related works focusing on learning and exploration in the domain of robot navigation.
2.1 Robot navigation in complex environments
Autonomous Mobile Robots (AMRs) in industrial settings have traditionally relied on mapbased navigation techniques to perform a broad array of operational tasks [12]. Foundational to these techniques are path planning [13], trajectory tracking [14], and obstacle avoidance algorithms [15, 16], which have established the groundwork for precise movement and trajectory optimization within structured environments. Nevertheless, their reliance on preexisting maps and static obstacle locations limits their applicability in dynamic and unpredictable industrial settings, necessitating more adaptive and responsive solutions [4].
To overcome the limitations of traditional methods, the application of DRL enables AMRs to dynamically learn and adjust navigation policies from environmental interactions [17]. Early DRL efforts explored visual navigation using RGB images [18], depthestimated images [19], and depth camera inputs [7]. Despite demonstrating the potential of DRL, these visualbased methods faced challenges, including high computational requirements and sensitivity to variable conditions like lighting or occlusions. In addition, in order to navigate through crowded environments safely, many studies incorporate pedestrian motion information as observations for DRL agents. Zhou et al. [20] introduced a social graphbased double dueling deep Qnetwork (SGD3QN) that incorporates a social attention mechanism to extract crowd features as the agent state, significantly enhancing the safety of the navigation process. Chen et al. [21] proposed leveraging graph convolutional network (GCN) to extract crucial pedestrian information for navigation, employing an attention mechanism for prediction. Similarly, Sun et al. [22] proposed using a risk function to evaluate the collision probability of pedestrians, and prioritize pedestrians with higher collision risks for obstacle avoidance. These agentlevel methods have one primary limitation: they heavily rely on pedestrian motion information and primarily address crowd navigation in open environments, overlooking the existence of static obstacle structures within the environment. Liu et al. [23] utilized static angle map to represent pedestrians and static obstacles, tackling this issue. However, this method relies on a pretrained feature extraction network to manage highdimensional input.
To address the need for improved generalization of navigation policies, attention has turned to effective state representation techniques, focusing on LiDAR data for its direct relevance and robustness to environmental changes. Pfeiffer et al. [24] employed convolutional neural networks (CNNs) to extract features from 2D laser scans, mapping these features to steering commands. Francis et al. [25] integrated probabilistic roadmaps (PRMs) with RL for navigating complex environments, utilizing direct LiDAR observations to enhance shortrange obstacle avoidance and longrange navigation with the help of additional global path planning. In order to mitigate the degradation of generalization performance caused by highdimensional LiDAR data, several studies have focused on extracting relevant features from laser scans to facilitate navigation across diverse environments. Pfeiffer et al. [26] introduced a method to compress LiDAR data through minimum pooling, simplifying DRL model state inputs while prioritizing crucial distance information from the nearest obstacles. Zhang et al. [9] proposed IPAPRec, a novel input preprocessing method that uses adaptively parametric reciprocal functions to process LiDAR readings, accentuating shortdistance information to improve the generalization performance of DRL agents and accelerate learning in unfamiliar scenarios. However, while sparse laser scanning can enhance generalization, it overlooks crucial information essential for effective obstacle avoidance in complex environments, thereby constraining the agent’s ability to navigate obstacles.
Our research introduces the Potential Risk State Augmentation method, optimizing highdimensional LiDAR data representation by enriching it with riskrelated information. This approach assesses threat levels based on the proximity and orientation of obstacles, providing a more nuanced understanding of the environment compared to existing methods.
2.2 Exploration and learning in robot navigation
Since DRL relies on trial and error for learning, some research has focused on improving training efficiency and policy effectiveness through techniques such as reward shaping [27, 28] and innovative data utilization [29, 30]. Jang et al. [31] addressed these challenges by introducing the Hindsight Intermediate Targets (HIT), which circumvents common navigation issues like dead ends and local minima by creating virtual intermediate targets based on recent local observations, showcasing an innovative use of data utilization techniques to guide navigation. Zhu et al. [32] proposed a hierarchical deep reinforcement learning (HDRL) framework that enhances sampling efficiency and simtoreal transfer capabilities. Miranda et al. [33] introduced a dense reward function that uses map information to encourage exploration and help avoid traps in chaotic environments. Shi et al. [6] utilized curiosity to enhance the agent’s exploration efficiency and employed it as an auxiliary reward to guide the agent in learning navigation tasks within discrete action spaces. Although the above methods can overcome the local optimal problems of exploration and navigation, they still have certain limitations when facing dynamic environments. To tackle this issue, Xie et al. [11] integrated velocity obstacles into the reward function to actively prevent collisions, thereby enhancing navigation safety in crowded pedestrian environments. However, this approach necessitates additional pedestrian location information as observations.
Our work introduces the Adaptive Safety Optimization Reward Shaping method, which stands distinct in its simplicity and effectiveness without complex models or extensive data techniques. It mitigates reward sparsity by shaping rewards and harmonizing goal achievement with obstacle avoidance through adaptive weighting based on collision risk. Additionally, we incorporate an intrinsic curiosity model to provide intrinsic rewards, enhancing the agent’s exploration ability in highdimensional state space. This method fosters strategy generalization across diverse environments and enhances the training efficiency of the model.
3 Preliminaries
3.1 Problem formulation
Consider a differentialdrive mobile robot navigating within a twodimensional environment, represented by its position \(\textbf{p}_{t}=[{p}_{t}^{x}, {p}_{t}^{y}]^{T} \in \mathbb {R}^{2}\) and orientation \(\vartheta _{t} \in \mathbb {R}\). This work aims to develop a navigation control policy \(\pi \), which effectively controls the robot’s linear velocity \(v_{t}\) and angular velocity \(w_{t}\), enabling it to reach a desired position \(\textbf{p}_{g}=[{p}_{g}^{x}, {p}_{g}^{y}]^{T} \in \mathbb {R}^{2}\) efficiently while avoiding collisions. The mapless navigation problem is modeled as a sequential decisionmaking process within the framework of reinforcement learning. The agent controlling the robot’s motion operates within a Markov Decision Process (MDP), represented by a tuple \((\mathcal {S}, \mathcal {A}, \mathcal {P}(s's,a), r(s, a), \gamma )\) where \(\mathcal {S}\) and \(\mathcal {A}\) represent the state and action spaces respectively. \(\mathcal {P}(s' s,a)\) describes the probability of transitioning from the current state \(s \in \mathcal {S}\) to the next state \(s' \in \mathcal {S}\) when action \(a \in \mathcal {A}\) is taken. The reward function r(s, a) and discount factor \(\gamma \in [0,1]\) define the immediate reward and the importance of future rewards respectively. The objective of reinforcement learning is to find an optimal policy \(\pi ^{*}\) that maximizes the expected return:
3.2 Traditional LiDAR state representation
LiDAR observations are crucial for environmental perception in robotic systems. These observations are typically preprocessed to extract navigationrelevant features, which serve as representational states for decisionmaking. Traditional LiDAR data representations involve vectors derived from either raw or preprocessed laser scans. The raw LiDAR measurements are denoted as \(\textbf{l}=[l_{1},l_{2},...,l_{n}]\in \mathbb {R}^{n}\), with each \(l_{i} \in [L_{min}, L_{max}]\) representing the ith measured distance, where \(L_{min}\) and \(L_{max}\) are the minimum and maximum measurable distances respectively.
To address the high dimensionality of LiDAR data, a common approach involves reducing these measurements to m values using minimum pooling [11, 34]:
where k is the sampling window size. Subsequent dimensionality reduction of laser scans often employs linear or nonlinear normalization methods for preprocessing. Linear normalization is typically performed by dividing each measured distance by \(L_{max}\), denoted as \(LN(y_{i}) = y_{i} / L_{max}\) [35, 36]. Nonlinear normalization transforms the compressed laser scans using the reciprocal of the distance [9, 37]:
where \(j_{i} \in (\infty , L_{min})\) is a parameter of the input preprocessing function. In this article, the linear normalization is denoted as \(LN(\varvec{y}) = [LN(y_{1}), LN(y_{2}),..., LN(y_{m})] \in \mathbb {R}^{m}\) and the reciprocal version as \(Rec(\varvec{y}) = [Rec(y_{1}), Rec(y_{2}),..., Rec(y_{m})] \in \mathbb {R}^{m}\).
3.3 Soft actor critic
The Soft Actor Critic (SAC) algorithm [38] is a deep reinforcement learning method based on the ActorCritic architecture, known for its effectiveness in handling continuous action spaces and robust exploration capabilities. SAC has demonstrated superior performance in mapless navigation tasks [39, 40] due to its incorporation of entropy as a regularization term in the objective function, which balances the expected cumulative reward with an entropy component to encourage policy exploration:
where \(\rho _{\pi }\) is the distribution of stateaction pairs under policy \(\pi \), and \(\alpha \) is the temperature coefficient that adjusts the weight of the entropy term \(\mathcal {H}(\pi (\cdot  s_{t}))\). This entropy term helps prevent premature convergence to suboptimal policies by promoting exploration of diverse actions. SAC employs dual critic networks (parameterized by \(\xi _{1}, \xi _{2}\)) to mitigate overestimation bias. The loss function for the Q network is defined as:
where \(\mathcal {T}=(s,a,r,s',d)\) is state transition sampled from minibatch \(\mathcal {B}\) and \(\overline{Q}(s, a)\) is the target Q value calculated as:
with \(\tilde{a}'\) sampled from the policy network \(\pi _{\zeta }(\cdot  s')\) (parameterized by \(\zeta \)). The loss function for the policy network is:
4 Method
The objective of mapless navigation is to navigate to a destination safely and efficiently based on observed sensor information, without relying on a global map. In our approach, a neural network guides the robot through unstructured environments by controlling its movements based on processed observational states. The framework of our deep reinforcement learningbased navigation scheme is depicted in Fig. 1. The observation values from laser scans undergo preprocessing through Potential Risk State Augmentation and dimensionality reduction before being used as inputs to the policy network for decisionmaking.
4.1 Potential risk state augmentation
We utilize LiDAR technology to perceive surrounding obstacles, a crucial component of autonomous navigation. To address the limitations of traditional LiDAR state representations, our method enhances the representation by weighting the potential risk of obstacles based on their direction. This is achieved by combining raw laser scans \(\varvec{l} = [l_{1}, l_{2},..., l_{n}] \in \mathbb {R}^{n}\) with corresponding angle information at time t, resulting in an augmented state representation. The exponential function is applied to process distance information, expanding the distribution range of shortdistance values, as demonstrated in [9]. Hence, the augmented LiDAR state representation is defined as:
where \(\alpha (\theta _{i})\) is a weight factor that evaluates the threat level of obstacles in various directions relative to the robot’s navigation path. \(\theta _{i}\) represents the direction of the \(i^{th}\) obstacle relative to the robot, and \(\eta \in (0,1)\) is the parameter of the distance processing function.
The weighting factor \(\alpha (\theta ^{i}_{t})\) minimizes redundant decisionmaking information by emphasizing obstacles that pose a direct threat to the navigation path. It is defined as a function of the obstacle’s angular position relative to the robot:
where \(\beta \) is a scaling parameter that adjusts the obstacle’s angular position relative to the robot into an appropriate range for the sine function. Specifically, \(\beta \) is inversely proportional to the field of view \(\varphi _{f}\) of the LiDAR, calculated as \(\beta =180/\varphi _{f}\). This scaling ensures that the entire field of view from 0 to \(\varphi _{f}\) degrees maps to the range of 0 to 180 degrees, aligning with the valid range of the sine function. This alignment causes the threat level to peak directly in front of the robot and symmetrically decrease on both sides. It ensures that the weight assigned to the obstacle directly in front of the differential drive robot should be prioritized, while secondary obstacle information from the sides can be appropriately disregarded. By adjusting \(\beta \), the robot’s perception system is finely tuned to enhance sensitivity to direct threats in front while filtering out less relevant obstacles on the periphery, thereby optimizing navigation efficiency and safety.
To further optimize processing and reduce input dimensionality, the augmented state is compressed into m values, representing the threat level posed by obstacles within these m regions. A higher state value indicates a greater level of threat from obstacles to the robot. The dimensionality reduction function is defined as follows:
where k is the sampling window size. The state of the LiDAR information at time t is then expressed as \(\varvec{z}=\left\{ z_{1}, z_{2},..., z_{m}\right\} \).
Our comparative analysis, illustrated in Fig. 2(d), demonstrates the effectiveness of our method compared to traditional preprocessing techniques, using eight evenly spaced laser scan values as an example. While linear normalization and reciprocal mapping highlight highthreat obstacles (represented as dark color areas) at close range, they fail to effectively differentiate varying threat levels across different directions (see Fig. 2(b) and (c)). In contrast, the Potential Risk State Augmentation method integrates both angle and distance information into the state, enhancing the agent’s environmental understanding. When an obstacle is both close and directly ahead, it is considered more threatening, prompting a heightened response from the agent (see Fig. 2(a)), while less threatening obstacles exert a lesser influence on decisionmaking.
4.2 Mapless navigation with augmented LiDAR state
4.2.1 State and action representation
In deep reinforcement learning, the importance of state representation is evident, as it translates environmental perception information into a format that is interpretable by the agent, directly impacting its decisionmaking ability and the generalizability of policies. Figure 1 illustrates the state \(s_{t} \in \mathcal {S}\) of the DRL agent at time t, which comprises the LiDAR state, goal state, obstacle state, and current velocity. The specific formulations are described below.

LiDAR State \(\varvec{z}\): PRSA is applied to preprocess the feature vector derived from the raw laser scans \(\varvec{l} = [l_{1}, l_{2},..., l_{n}] \in \mathbb {R}^{n}\), which describes the obstacle information perceived by the robot. In this article, the LiDAR sensor in the simulation has a field of view (FOV) of \(190^{\circ }\), an angular resolution of \(1^{\circ }\) (equivalent to 190 laser beams), and a scanning range of 8m. Given the physical specifications of the sensor on the robot, we set the distance processing parameter \(\eta = e^{1}\) and the sampling window size \(k = 5\) for PRSA. The resulting LiDAR state is \(\varvec{z} = [z_{1}, z_{2},..., z_{38}] \in \mathbb {R}^{38}\).

Goal State \(\varvec{x}_{g}\): This state denotes the relative distance \(d_{g}\) and angle \(\varphi _{g}\) between the robot and the target in the world frame, typically obtained through localization techniques such as simultaneous localization and mapping (SLAM) or global positioning system (GPS). The relative distance is defined as:
$$\begin{aligned} d_{g} = \Vert \varvec{{p}}_{t}  \varvec{{p}}_{g}\Vert , \end{aligned}$$(11)where \(\varvec{p}_{t} = [{p}_{t}^{x}, {p}_{t}^{y}] \in \mathbb {R}^{2}\) denotes the robot’s position at time t, and \(\varvec{{p}}_{g}=[{p}_{g}^{x}, {p}_{g}^{y}] \in \mathbb {R}^{2}\) denotes the target’s position. Correspondingly, the relative angle is calculated using:
$$\begin{aligned} \varphi _{g} = {\textbf {atan2}}(p_{g}^{y}p_{t}^{y}, p_{g}^{x}p_{t}^{x})  p_{t}^{w} \end{aligned}$$(12)where \(p_{t}^{w}\) refers to the heading angle of the robot at time \(t\). After normalization, the goal state is \(\varvec{x}_{g} = [d_{g}, \varphi _{g}] \in \mathbb {R}^{2}\).

Obstacle State \(\varvec{x}_{o}\): This state denotes the distance \(d_{o}\) and angle \(\varphi _{o}\) between the robot and the nearest obstacle, compensating for the absence of distance information caused by PRSA. The Obstacle State is normalized and expressed as \(\varvec{x}_{o} = [d_{o}, \varphi _{o}] \in \mathbb {R}^{2}\).

Current Velocity \(\varvec{u}_{t}\): This state represents the linear velocity \(v_{t}\) and angular velocity \(w_{t}\) of the robot, and is given as \(\varvec{u}_{t} = [v_{t}, w_{t}] \in \mathbb {R}^{2}\).
The comprehensive state representation for the mapless navigation task is thus \(s_{t} = [\varvec{z}, \varvec{x}_{g}, \varvec{x}_{o}, \varvec{u}_{t}] \in \mathbb {R}^{44}\). This configuration accounts for potential risks posed by obstacles during navigation while also including information about the nearest obstacle, the target, and the robot’s speed, ensuring the agent has access to all necessary data for efficient navigation.
We implement continuous control in our DRL model to allow for nuanced and precise navigation, which affords smoother and more accurate movements compared to discrete control schemes. The action \(\varvec{a} = [a_{v}, a_{w}] \in \mathbb {R}^{2}\) is mapped to the robot’s linear velocity \(v_{t} \in [v_{min}, v_{max}]\) and angular velocity \(w_{t} \in [w_{min}, w_{max}]\) using the following transformation:
According to the kinematic characteristics of the industrial autonomous mobile robot, the specific ranges for linear and angular velocities in this article are set to \(v \in [0, 0.5]\) and \(w \in [0.8, 0.8]\), respectively.
4.2.2 Adaptive safety optimization reward shaping
In DRL, addressing the challenge of sparse rewards often involves decomposing complex tasks into multiple subtasks and introducing intermediate rewards for each. During navigation, it is crucial for the robot to concurrently monitor the location of the target and the relative orientation of the nearest obstacles [41]. To effectively balance reaching the goal with avoiding obstacles, we introduce an intermediate reward mechanism with adaptive weighting, contingent on the collision risk of the current state. This method dynamically prioritizes goal attainment and obstacle avoidance based on the current risk level.
The reward function is formulated as follows:
where \(r_{success}\) denotes the positive reward when the agent successfully achieves its goal, and \(r_{collision}\) represents the penalty incurred upon collision. The dense intermediate reward \(r_{dense}\) comprises three components:
Distance reward \(r_{close}\): This incentivizes the agent to minimize the distance to the target, defined as:
where \(d_{t} = \Vert \varvec{{p}}_{t}  \varvec{{p}}_{g}\Vert \) denotes the distance between the robot and the target at time t and \(\kappa _{1}\) is a weight parameter.
Obstacle reward \(r_{penalty}\): This discourages proximity to obstacles, defined as:
where \(\kappa _{2}\) and \(\kappa _{3}\) are constants adjusting the reward magnitude, \(d_{min}\) is the minimum distance for laser scan, and \(\kappa _{o} \in (0,\kappa _{3}]\) represents the threshold distance at which behavior in close proximity to an obstacle starts to incur penalties.
Goaloriented adaptive reward \(r_{adaptive}\): This encourages the robot to adapt its behavioral strategies in response to varying risk levels, calculated as:
where \(h_{risk}\) denotes the risk posed by the nearest obstacle (computed from (8)), and \(h_{orient}\) represents the constraint reward associated with the agent’s forward direction, calculated as follows:
with \(\kappa _{4}\) and \(\kappa _{5}\) finetuning the reward scale, and \(\varphi _{g}\) as the robot’s heading angle from (12).
The collision risk factor \(\rho (s)\) is a crucial determinant in the adaptive reward, calculated based on the distance to the nearest obstacle \(d_{o}(s)\). It is calculated as:
where \(D_{safe}\) and \(D_{risk}\) denote the safe and risk distance thresholds respectively. In instances where the value of \(\rho (s)\) is small, signifying that the agent is within a safe range, the angle constraint promotes movement in the desired angular direction. Conversely, when \(\rho (s)\) is large, indicating the agent is in a hazardous zone, the reward encourages the agent to prioritize obstacle avoidance, thereby mitigating the danger level associated with the nearest obstacle.
Combining these rewards, the final dense intermediate reward is defined as \(r_{dense} = r_{close} + r_{penalty} + r_{adaptive}\). The constant coefficients are related to the robot’s size, motion model and risk area settings. They are used to constrain rewards to an appropriate range based on the robot’s physical properties, as shown in Table 1.
4.3 Enhance exploration with intrinsic reward
In mapless navigation tasks, agents often confront highdimensional and complex state spaces, necessitating robust exploration capabilities. Enhanced exploration enables agents to develop a comprehensive understanding of state relationships, which is crucial for effective navigation. To this end, we utilize state novelty estimates as intrinsic rewards to incentivize exploration, thereby acquiring diverse and rich training data to enhance learning efficiency and generalization.
The Intrinsic Curiosity Module (ICM) [42] is adopted for state novelty estimation, as depicted in Fig. 3. The ICM consists of two main components: the forward prediction model and the inverse model, which collaboratively generate explorationdriven learning signals by encouraging the discovery of novel states and behaviors.
The inverse model within the ICM aims to estimate the agent’s actions from consecutive state features, enhancing the agent’s understanding of causeandeffect relationships within the environment. The loss function \(\mathcal {L}_{IM}\) measures the disparity between estimated and actual actions, formulated as:
Conversely, the forward prediction model focuses on predicting the next state’s features \(\phi _{t+1}\) based on the current state and action. This model generates intrinsic rewards by assessing the novelty of state transitions, propelling the agent to explore unfamiliar states. The prediction error from the forward model, which forms the basis of the intrinsic reward, is computed using \(\mathcal {L}_{FM}\):
The intrinsic reward \(r^{in}_{t}\), derived from the prediction error of the forward model, encourages exploration of novel states:
where \(\lambda \) is a scaling factor used to balance intrinsic and extrinsic rewards. The novelty of the current state is evaluated through the prediction error of the forward model. A smaller prediction error implies that the agent is familiar with the current environmental state, indicating lower state novelty. Conversely, a higher prediction error suggests a greater level of state novelty, thereby increasing the intrinsic reward and motivating more proactive exploration.
By introducing state novelty estimation as an auxiliary task, the agent is trained to maximize the reward \(r_{t} = r_{t}^{in} + r_{t}^{ex}\) of intrinsic reward \(r_{t}^{in}\) and extrinsic reward \(r_{t}^{ex}\) defined in (14). The overarching optimization goal integrates maximizing the policy objective (4) with minimizing the prediction model’s loss:
where \(0 \le \sigma \le 1\) adjusts the weight between \(\mathcal {L}_{FM}\) and \(\mathcal {L}_{IM}\).
4.4 Neural network architecture
The actorcritic network architecture is designed to balance computational efficiency with the ability to manage the complexities of mapless navigation, as illustrated in Fig. 4. This network employs fully connected layers to effectively process highdimensional state inputs, thus facilitating rapid and precise decisionmaking for realtime navigation.
Notably, the policy network outputs the mean and variance of the linear and angular velocities, modeled as a Gaussian distribution. During the training phase, actions are sampled from this distribution to ensure a diverse range of exploratory maneuvers. Conversely, during testing and practical applications, the mean of the distribution is used to determine the robot’s actions, emphasizing stability and consistency.
The state novelty estimation model within our network consists of two integral components: the inverse model and the forward prediction model. The state input \(s_t\) initially passes through two fully connected layers, with 512 and 256 units respectively, to derive a 64dimensional feature representation \(\phi _t\). This representation \(\phi _t\) is then utilized in both the inverse and forward models:

In the inverse model, \(\phi _t\) and \(\phi _{t+1}\) are concatenated and processed through a 256unit fully connected layer, which outputs the predicted action \(\hat{a}_t\) via a tanh activation function.

The forward model concatenates the feature \(\phi _t\) and the action \(a_t\), and inputs them into a 128unit fully connected layer. This layer predicts the next state’s feature \(\hat{\phi }_{t+1}\), crucial for calculating the intrinsic reward.
This architecture supported by intrinsic rewarddriven exploration, enables the agent to effectively learn navigation strategies within a simulated environment. The training process is systematically outlined in Algorithm 1, detailing the initialization of network parameters and the iterative learning steps undertaken by the agent.
5 Experiment
In this section, we demonstrate the effectiveness of our proposed method through a series of simulations and realworld experiments. The experimental results highlight significant improvements in navigation performance and generalization capabilities enabled by our approach.
5.1 Training setup
For the training and testing of our models, we employ the Gazebo simulator integrated with the Robot Operating System (ROS) platform. This setup facilitates a seamless simtoreal transition, essential for validating the practical applicability of our methods. To demonstrate the enhanced generalization capability of our approach, we only trained the robot(shown in Fig. 5(a)) in a \(10 \times 10\textrm{m}^{2}\) static environment depicted in Fig. 5(b). This controlled environment ensures that the improvements can be directly attributed to the efficacy of the proposed method. All models in the experiment share the same network architecture, and the hyperparameters specific to the proposed method are detailed in Table 2. These hyperparameters were chosen based on a combination of prior literature insights [6, 38] and finetuning specific to our model, ensuring optimal performance. Additionally, coefficients for state representation and reward functions, as well as network structure parameters, are explained and provided in the previous Section 4.
5.2 Comparative experiment on reward effectiveness
To examine the impact of intrinsic rewards on generalization performance and the role of adaptive weight reward functions, we utilized Intrinsic Reward SAC (IRSAC) and standard SAC to train under three distinct intermediate rewards: (1) R1: Adaptive safety optimization reward, (2) R2: Safety optimization reward under fixed weight, (3) R3: Common intermediate reward including only distance reward [9, 31].
The key indicators for assessing the efficacy of various algorithms and reward structures were the average extrinsic reward (\(R_{Ae}\)) and the total navigation risk:
where \(r_{j}^{ex}\) denotes the extrinsic reward received at each step, and \(N_{s}\) is the total number of steps in the episode. Additionally, the total navigation risk is the sum of the risk index (\(R_{I}\)) for each episode, defined as:
Here, \(d_{o}(s)\) is the distance to the nearest obstacle, and \(D_{coll}\) is the default collision distance. In the simulation we set \(D_{coll}=0.35\).
To ensure robust evaluation, we computed the average reward and risk index values every 12 episodes. Posttraining, the models’ generalization and navigation behaviors were assessed across four different static environments (Fig. 6(a) to (d)). Each model completed the same tasks in each map, with evaluations based on average performance across the maps. Metrics included success rate (SR), collision rate (CR), loss rate (LR), and mean navigation risk (MNR), providing a comprehensive assessment of each model’s navigation capabilities in diverse environments.
Training results illustrated in Fig. 7 show that regardless of the specific reward function, incorporating intrinsic rewards during training enhances the average reward level and accelerates the learning process. The intrinsic rewards facilitate a wider range of exploratory behaviors, which is critical in environments where external rewards are sparse. This expanded exploration allows the agent to encounter and learn from a broader spectrum of states, significantly leveraging the PRSA’s capability to finetune responses based on the perceived risks from nearby obstacles. Test results in Table 3 indicate that models incorporating intrinsic rewards demonstrate improved adaptability to novel environments and tasks, evident in higher SRs and lower LRs compared to models trained without intrinsic rewards. The adaptability improvements are attributable to the richer exploratory experiences during training, allowing the agent to develop more versatile strategies, thus enhancing performance across diverse environments and tasks.
Furthermore, the results affirm that the proposed adaptive safety optimization reward significantly enhances decisionmaking safety and efficiency.This reward strategy, compared to other reward functions, shows reduced navigation risk (see Fig. 7) during training and smaller MNR during testing (see Table 3).Unlike traditional rewards that primarily focus on avoiding obstacles through penalties for collisions and close encounters, the adaptive safety optimization reward offers a detailed feedback that tailores navigation and obstacle avoidance strategies to different environmental areas. When the agent operates within safe areas, rewards incentivize achieving goals quickly. Conversely, the reward system shifts to prioritize risk reduction in hazardous areas. This adaptive reward structure encourages the agent to adopt more effective strategies based on the current obstacle distribution, thereby reducing the possibility of collisions.
5.3 Comparative experiment on generalization of state representation
This subsection investigates the influence of different state representations on the generalization performance in mapless navigation. We compare three standard state representation methods against our proposed Potential Risk State Augmentation (PRSA): (1) Raw dense laser scans, incorporating goal state and velocity (Raw); (2) Compressed linear normalized input, incorporating goal state and velocity (LN); (3) Compressed reciprocal function input, incorporating goal state and velocity (Rec) (Fig. 8).
All models were trained using Algorithm 1 with the same reward function (14), and evaluated across four different static environments as shown in Fig. 6(ad). The test results depicted in Fig. 9 highlight substantial differences in performance among various LiDAR state representations between training and testing environments. Traditional representations (LN and Raw) exhibited the weakest performance, indicating poor generalization. The application of the reciprocal function (Rec) showed significant improvement in average performance, though it exhibited some variance across scenarios. Conversely, the PRSA method consistently achieved high success rates in unfamiliar environments, with minimal variation in performance across different settings. This suggests a superior generalization capability brought about by the detailed state information provided by PRSA.
Furthermore, we assessed the models’ obstacle avoidance capabilities in environments SEnv1, SEnv2, and SEnv3, conducting the same navigation tasks three times for consistency. The performance was quantified using four metrics: success rate (SR), collision rate (CR), lost rate (LR), and mean navigation score (MNS), calculated as follows:
where, \(P_{i}\) represents the distance to navigate, \(T_{max}\) signifies the maximum time steps allowed for each navigation, and \(v_{max}\) is the maximum linear speed of navigation.
The summarized results in Table 4 reveal that agents trained with PRSA not only achieve consistently higher SR and MNS but also exhibit significantly better obstacle avoidance, as evidenced in varied test environments. In addition, it is clear that traditional state representations (Raw and LN) often lead to suboptimal path planning, especially in open or complex environments(see Fig. 8). The primary reason behind this phenomenon is the agent’s inability to comprehend certain environmental features, resulting in the failure of the learned policy in the training environment to generalize effectively to novel environments. In contrast, Rec and PRSA demonstrate better generalization with trajectories closer to optimal paths and higher mean navigation scores. The strong generalization capabilities of Rec and PRSA stem from their efficient preprocessing of highdimensional LiDAR data, utilizing key information for decisionmaking and ignoring information that has little impact on navigation and obstacle avoidance. Notably, Rec sometimes results in collisions in narrower environments, whereas PRSA enhances the agent’s hazard perception capability, ensuring robust performance across all tested scenarios. This underlines the effectiveness of PRSA in providing comprehensive state information, which significantly improves navigation and generalization across diverse environments.
5.4 Comprehensive comparison on navigation performance
The previous experiments comparatively analyzed the influence of reward function and state representation. In this subsection, we evaluate the proposed DRLbased realtime navigation method within a complex environment featuring narrow passages and dynamic obstacles, as depicted in Fig. 6(e). These dynamic obstacles are programmed to move in predetermined directions at specified speeds v.
We benchmarked the proposed method against several representative navigation techniques: (1) APF (Artificial Potential Field), a conventional method often used in mapless navigation, (2) MoveBase, a widely utilized classic navigation baseline in industrial AMRs autonomous navigation, (3) IPAPRECSAC [9], which integrates adaptive parameter reciprocal input preprocessing with the SAC algorithm for enhanced generalization, (4) SACSP [28], an optimized realtime navigation method using an enhanced SAC algorithm with a compound auxiliary reward, modified to meet navigation requirements based on the differential robot kinematics model and observations used in this study, (5) ITD3CLN [30], which utilizes an improved TD3 algorithm and a channelbased convolutional laser network architecture to improve obstacle avoidance and generalization.
Three scenarios were set up to comprehensively assess the navigation performance of each method: (1) S: Static environment where obstacles remain stationary (\(v=0 m/s\)), (2) D1: Dynamic environment where obstacles move at a slow speed (\(v=0.2 m/s\)), (2) D2: Dynamic environment where obstacles move at a faster speed (\(v=0.4 m/s\)). Each scenario performed the same navigation task and performance was assessed using success rate (SR), collision rate (CR), mean navigation risk (MNR) and mean navigation score (MNS).
The results summarized in Figs. 10 and 11 reveal several key insights. Traditional planning methods like APF and MoveBase are often difficult to navigate in narrow environments, exhibiting low SR and high MNR. This challenge arises from their dependency on predetermined trajectories or potential fields, which lack adaptability and often lead to local optima. Consequently, these methods struggle to adapt to narrow spaces and dynamic environments effectively. In contrast, DRLbased methods consistently outperform traditional planners by mitigating local minima issues and demonstrating superior generalization across dynamic conditions.
Moreover, our method shows the highest SR and MNS across all scenarios, indicating robust generalization and efficient navigation even under dynamic obstacle conditions. This performance highlights the effectiveness of integrating PRSA with adaptive safety optimization rewards, balancing target achievement with obstacle avoidance. Furthermore, our method achieves the lowest MNR, attributed to the designed reward function that incentivizes the agent to adopt safer navigation behaviors. It’s noteworthy that with the increase in obstacle speed, the collision rate rises across all methods, underscoring the heightened challenges in perception and decisionmaking. Compared to other navigation methods, our method maintains the lowest CR, showcasing its robustness to adaptively manage risks and preserve navigation integrity under varied environmental dynamics.
5.5 Realworld validation of navigation performance
In this subsection, we verified the performance of the proposed DRLbased realtime navigation model in the real world. Similar to simulation testing, we compared the navigation performance of different state representations using the traditional MoveBase planner as a benchmark.
5.5.1 Experimental setup
The realworld experiments utilize an industrial composite robot equipped with SLAMTEC S1 ToFbased LiDAR, featuring a 190\(^{\circ }\) field of view and an angular resolution of 0.26\(^{\circ }\). Policies trained in simulated environments are directly deployed on this robot to evaluate the transferability of our model. The robot’s maximum linear speed is capped at 0.25 m/s for safety reasons. The GMapping SLAM algorithm is used to create maps of the testing environment, aiding in robot localization via the AMCL algorithm. It is important to note that these maps are solely for localization and not for trajectory planning.
Two realworld scenarios assess navigation performance. In each, the robot starts from an initial point and aims to reach multiple targets sequentially before returning to the starting position. Manual intervention is employed when the robot fails to complete a task. The primary metrics for evaluation included success rate (SR), collision rate (CR), mean navigation score (MNS), mean navigation distance (MND) and mean navigation time (MNT).
5.5.2 Results and analysis
The realworld test results, as shown in Table 5, corroborate our simulation findings, demonstrating that DRLbased navigation methods (especially those utilizing advanced state representations) significantly outperform classic navigation baseline MoveBase in terms of success rates and navigation efficiency. MoveBase’s performance limitations stem from its challenges in executing precise maneuvers and avoiding obstacles in confined spaces, further impacted by the completeness and accuracy of its map data. The slight variance in performance among PRSA, Rec, LN, and Raw highlights the sensitivity of DRL models to different state representations in complex realworld environments. Our proposed method achieves superior performance, showcasing the highest success rates and navigation scores in both environments. Although the mean navigation distance in RE2 is slightly longer, the motion trajectories (see Fig. 12) indicate that the proposed method maintains a longer distance from obstacles, thereby enhancing navigational safety by sacrificing some distance for safety. Moreover, our method completes the navigation task more quickly than Rec. LN and Raw showed slightly inferior performance, with Raw choosing an unnecessarily extended route to reach the goal during the mission in RE1 (\(4 \rightarrow 5\)). This issue may arise from significant disparities in obstacle distribution between the real environment and the training environment, making learned policies difficult to generalize and occasionally leading to misunderstanding of the current state.
In summary, these realworld experimental results confirm the successful transfer of our simulated training to practical applications. This success is primarily attributed to our preprocessing of original sensor data. Unlike utilizing raw highdimensional laser observations, this preprocessing method effectively captures essential navigation and obstacle avoidance information while minimizing the impact of extraneous data. Additionally, our adaptive safety optimization reward function incentivizes the robot to balance navigation and obstacle avoidance requirements based on current threat levels, thereby reducing collision risks during navigation. The use of 3D simulator training models, which simulate sensor noise and accurately model robot kinematics, bridges the gap between simulation and reality. This comprehensive approach enables the successful deployment of our strategy in realworld scenarios.
6 Conclusion
In this article, we presented a DRLbased mapless navigation scheme for industrial AMRs with strong generalization capabilities. To mitigate the negative impact of traditional LiDAR state representations on the generalization performance of DRL agents, we introduced a potential risk state augmentation method, which enhances the generalization performance of the trained policies. Additionally, an adaptive weight was introduced into the reward function to balance navigation efficiency and safety, complemented by an intrinsic reward to boost the exploration efficiency of the DRL agent. Extensive experimental results illustrate a significant advantage of the proposed method in terms of navigation and generalization performance. The proposed PRSA enhances the navigation success rate by about 17% in previously unseen environments. Furthermore, compared with representative navigation techniques, our method not only maintained a success rate exceeding 90% in challenging narrow and dynamic settings but also reduced the mean navigation risk by approximately 44% relative to advanced endtoend methods. In realworld tasks, our method significantly outperformed the classical navigation baseline MoveBase, increasing the success rate by approximately 33%.
However, a critical limitation of our research is the lack of enforced safety constraints. Although our adaptive safety optimization reward function mitigates the risk of collision to a certain extent, it only guides the agent to learn safe navigation behaviors and cannot forcibly restrict risky actions. Navigation tasks in realworld safetycritical scenarios necessitate additional mandatory safety constraints. Future research endeavors will focus on integrating modelbased control techniques to further enhance navigation safety in complex environments. Specifically, we plan to incorporate partial map information and visual observations to augment the agent’s perception capabilities, thereby improving the navigation performance of industrial robots in intricate dynamic environments. Additionally, exploring the integration of realtime adaptive learning mechanisms could further bolster the robustness and reliability of navigation strategies in unpredictable industrial settings.
Data Availability Statement
Data cannot be shared openly but are available on request from authors: Data sets generated during the current study are available from the corresponding author on reasonable request.
References
Cadena C, Carlone L, Carrillo H et al (2016) Past, present, and future of simultaneous localization and mapping: Toward the robustperception age. IEEE Trans Robot 32(6):1309–1332
Yang H, Xu X, Hong J (2022) Automatic parking path planning of tracked vehicle based on improved a* and dwa algorithms. IEEE Trans Transp Electrif 9(1):283–292
Liu J, Ji J, Ren Y et al (2021) Path planning for vehicle active collision avoidance based on virtual flow field. Int J Automot Technol 22:1557–1567
Zhu K, Zhang T (2021) Deep reinforcement learning based mobile robot navigation: a review. Tsinghua Sci Technol 26(5):674–691
Tai L, Paolo G, Liu M (2017) Virtualtoreal deep reinforcement learning: continuous control of mobile robots for mapless navigation. In: 2017 IEEE/RSJ International conference on intelligent robots and systems (IROS), IEEE, pp 31–36
Shi H, Shi L, Xu M et al (2019) Endtoend navigation strategy with deep reinforcement learning for mobile robots. IEEE Trans Ind Inform 16(4):2393–2402
Wu K, Wang H, Esfahani MA et al (2021) Learn to navigate autonomously through deep reinforcement learning. IEEE Trans Ind Electron 69(5):5342–5352
Luong M, Pham C (2021) Incremental learning for autonomous navigation of mobile robots based on deep reinforcement learning. J Intell Robot Syst 101(1):1
Zhang W, Zhang Y, Liu N et al (2022) Ipaprec: A promising tool for learning highperformance mapless navigation skills with deep reinforcement learning. IEEE/ASME Trans Mechatron 27(6):5451–5461
Wang C, Wang J, Shen Y et al (2019) Autonomous navigation of uavs in largescale complex environments: a deep reinforcement learning approach. IEEE Trans Veh Technol 68(3):2124–2136
Xie Z, Dames P (2023) Drlvo: Learning to navigate through crowded dynamic scenes using velocity obstacles. IEEE Trans Robot
De Ryck M, Versteyhe M, Debrouwere F (2020) Automated guided vehicle systems, stateoftheart control algorithms and techniques. J Manuf Syst 54:152–173
Sprunk C, Lau B, Pfaff P et al (2017) An accurate and efficient navigation system for omnidirectional robots in industrial environments. Auton Robots 41:473–493
Liu X, Wang W, Li X et al (2022) Mpcbased highspeed trajectory tracking for 4wis robot. ISA Trans 123:413–424
Rasekhipour Y, Khajepour A, Chen SK et al (2016) A potential fieldbased model predictive pathplanning controller for autonomous road vehicles. IEEE Trans Intell Transp Syst 18(5):1255–1267
Yang H, Wang Z, Xia Y et al (2023) Empc with adaptive apf of obstacle avoidance and trajectory tracking for autonomous electric vehicles. ISA Trans 135:438–448
Xiao X, Liu B, Warnell G et al (2022) Motion planning and control for mobile robot navigation using machine learning: a survey. Auton Robots 46(5):569–597
Zhu Y, Mottaghi R, Kolve E et al (2017) Targetdriven visual navigation in indoor scenes using deep reinforcement learning. In: 2017 IEEE international conference on robotics and automation (ICRA), IEEE, pp 3357–3364
Yokoyama K, Morioka K (2020) Autonomous mobile robot with simple navigation system based on deep reinforcement learning and a monocular camera. In: 2020 IEEE/SICE International Symposium on System Integration (SII), IEEE, pp 525–530
Zhou Z, Zhu P, Zeng Z et al (2022) Robot navigation in a crowd by integrating deep reinforcement learning and online planning. Appl Intell 52(13):15600–15616
Chen Y, Liu C, Shi BE et al (2020) Robot navigation in crowds by graph convolutional networks with attention learned from human gaze. IEEE Robot Autom Lett 5(2):2754–2761
Sun X, Zhang Q, Wei Y et al (2023) Riskaware deep reinforcement learning for robot crowd navigation. Electronics 12(23):4744
Liu L, Dugas D, Cesari G, et al (2020) Robot navigation in crowded environments using deep reinforcement learning. In: 2020 IEEE/RSJ International conference on intelligent robots and systems (IROS), IEEE, pp 5671–5677
Pfeiffer M, Schaeuble M, Nieto J et al (2017) From perception to decision: A datadriven approach to endtoend motion planning for autonomous ground robots. In: 2017 IEEE international conference on robotics and automation (icra), IEEE, pp 1527–1533
Francis A, Faust A, Chiang HTL et al (2020) Longrange indoor navigation with prmrl. IEEE Trans Robot 36(4):1115–1134
Pfeiffer M, Shukla S, Turchetta M et al (2018) Reinforced imitation: sample efficient deep reinforcement learning for mapless navigation by leveraging prior demonstrations. IEEE Robot Autom Lett 3(4):4423–4430
Li W, Yue M, Shangguan J et al (2023) Navigation of mobile robots based on deep reinforcement learning: Reward function optimization and knowledge transfer. Int J Control Autom Syst 21(2):563–574
Guo H, Ren Z, Lai J et al (2023) Optimal navigation for agvs: a soft actorcriticbased reinforcement learning approach with composite auxiliary rewards. Eng Appl Artif Intell 124:106613
MartinezBaselga D, Riazuelo L, Montano L (2023) Improving robot navigation in crowded environments using intrinsic rewards. In: 2023 IEEE International Conference on Robotics and Automation (ICRA), IEEE, pp 9428–9434
Jiang H, Esfahani MA, Wu K et al (2022) itd3cln: Learn to navigate in dynamic scene through deep reinforcement learning. Neurocomputing 503:118–128
Jang Y, Baek J, Han S (2021) Hindsight intermediate targets for mapless navigation with deep reinforcement learning. IEEE Trans Ind Electron 69(11):11816–11825
Zhu W, Hayashibe M (2022) A hierarchical deep reinforcement learning framework with high efficiency and generalization for fast and safe navigation. IEEE Trans Ind Electron 70(5):4962–4971
Miranda VR, Neto AA, Freitas GM, et al (2023) Generalization in deep reinforcement learning for robotic navigation by reward shaping. IEEE Trans Ind Electron
Yan C, Qin J, Liu Q et al (2022) Mapless navigation with safetyenhanced imitation learning. IEEE Trans Ind Electron 70(7):7073–7081
Chang L, Shan L, Zhang W et al (2023) Hierarchical multirobot navigation and formation in unknown environments via deep reinforcement learning and distributed optimization. Robot ComputIntegr Manuf 83:102570
Lim J, Ha S, Choi J (2020) Prediction of reward functions for deep reinforcement learning via gaussian process regression. IEEE/ASME Trans Mechatron 25(4):1739–1746. https://doi.org/10.1109/TMECH.2020.2993564
Zhang W, Liu N, Zhang Y (2021) Learn to navigate maplessly with varied lidar configurations: a support pointbased approach. IEEE Robot Autom Lett 6(2):1918–1925. https://doi.org/10.1109/LRA.2021.3061305
Haarnoja T, Zhou A, Abbeel P, et al (2018) Soft actorcritic: Offpolicy maximum entropy deep reinforcement learning with a stochastic actor. In: International conference on machine learning, PMLR, pp 1861–1870
Yang J, Lu S, Han M et al (2023) Mapless navigation for uavs via reinforcement learning from demonstrations. Sci China Technol Sci 66(5):1263–1270
Huang W, Zhou Y, He X, et al (2023) Goalguided transformerenabled reinforcement learning for efficient autonomous navigation. IEEE Trans Intell Transp Syst
Gao X, Yan L, Li Z et al (2023) Improved deep deterministic policy gradient for dynamic obstacle avoidance of mobile robot. IEEE Trans Syst, Man, Cybern Syst 53(6):3675–3682
Pathak D, Agrawal P, Efros AA et al (2017) Curiositydriven exploration by selfsupervised prediction. In: International conference on machine learning, PMLR, pp 2778–2787
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author selfarchiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, D., Chen, P., Zhou, X. et al. Deep reinforcement learning based mapless navigation for industrial AMRs: advancements in generalization via potential risk state augmentation. Appl Intell 54, 9295–9312 (2024). https://doi.org/10.1007/s10489024056795
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489024056795