1 Introduction

Autonomous Mobile Robots (AMRs) have become pivotal in various industrial sectors, undertaking tasks ranging from material transport in manufacturing facilities to monitoring and inspection in hazardous environments. A critical capability of industrial AMRs is autonomous navigation, which demands high levels of adaptability and precision, especially in complex and dynamic industrial settings. Traditional navigation methods have predominantly utilized map-based navigation techniques, leveraging Simultaneous Localization and Mapping (SLAM) [1] to create detailed environmental maps and path planning algorithms such as A* and Dynamic Window Approach (DWA) [2, 3] for trajectory optimization. Despite their widespread adoption, these approaches often struggle to cope with the intricacies and unpredictabilities inherent to these environments [4], where the high computational overhead for map updates and the need for real-time adaptability pose considerable challenges.

Recent advancements in Deep Reinforcement Learning (DRL) have opened new avenues for enhancing the autonomy of AMRs [5,6,7]. DRL’s capability to learn from environmental interactions promises significant improvements over traditional rule-based and path-planning algorithms, particularly in unstructured or dynamic settings. However, deploying DRL in real-world navigation tasks introduces unique challenges, such as processing high-dimensional sensory inputs like LiDAR data efficiently and devising reward functions that capture complex navigation objectives accurately. Traditional approaches to DRL state representation often do not fully leverage this data, resulting in navigation policies that may not optimally balance risk and efficiency [8, 9]. Moreover, designing reward functions in DRL is critical yet challenging due to the inherent sparsity of rewards in complex navigation tasks, often leading to prolonged learning periods or convergence to suboptimal policies [10, 11].

Industrial environments amplify these challenges due to their inherent complexity and the higher stakes involved in navigation tasks. For instance, the presence of moving machinery, temporary obstructions, and ongoing human activity requires AMRs to exhibit a high degree of adaptability and generalization capability. Moreover, stringent safety requirements demand navigation strategies that minimize collision risks and ensure the uninterrupted flow of operations. Additionally, the challenge of sparse reward functions becomes apparent in industrial settings where AMRs often receive feedback only upon the successful completion of a task. This sparsity can delay the development of efficient and reliable navigation solutions by hindering the robot’s ability to learn effective navigation strategies. Furthermore, the discrepancy between simulated training environments and real-world operational settings introduces additional challenges in sim-to-real transfer. These domain differences can significantly impede the direct application of learned policies in real scenarios, necessitating robust and reliable navigation strategies.

In response to these challenges, we propose the Potential Risk State Augmentation (PRSA) method to enhance the AMR’s perception of environmental risks and improve decision-making under uncertainty. Traditional LiDAR-based state representations, which often contain excessive redundant information due to high-dimensional LiDAR observations, make it difficult to generalize strategies across varied environments. Our PRSA approach addresses this by efficiently encoding LiDAR data into a compact and informative state representation, focusing on crucial risk-related information. Additionally, we introduce an adaptive safety optimization reward function, complemented by intrinsic rewards, to mitigate the issue of sparse rewards and encourage exploration of novel states. This innovative reward structure is designed to tackle the inherent challenges in training DRL agents for complex navigation tasks, where traditional reward signals may not provide sufficient guidance for effective learning. The key contributions of this paper are as follows:

  1. 1)

    Introducing PRSA to efficiently process and represent high-dimensional LiDAR data, thereby reducing state space complexity and enhancing the DRL agent’s generalization capabilities across diverse industrial scenarios.

  2. 2)

    Developing an adaptive weight reward function shaped by collision risk and augmented with intrinsic rewards for state novelty, thereby enhancing model training efficiency and adaptability in environments characterized by sparse rewards.

  3. 3)

    Conducting extensive experiments to validate the superior navigation performance and generalization capabilities of our proposed DRL-based mapless navigation approach, including successful policy transfers to real-world scenarios.

The remainder of this paper is organized as follows: Section 2 provides an overview of related works, emphasizing the evolution of DRL-based navigation systems and highlighting the limitations of existing methodologies. Section 3 presents an overview of the preliminary work. Section 4 details the proposed navigation scheme, including PRSA and the adaptive reward function. Section 5 presents our experimental setup, results, and analysis, showcasing the effectiveness of our approach. Finally, Section 6 concludes the paper with a summary of findings and directions for future research.

2 Related work

In this section, we outline the current advancements made in navigating complex tasks and explore related works focusing on learning and exploration in the domain of robot navigation.

2.1 Robot navigation in complex environments

Autonomous Mobile Robots (AMRs) in industrial settings have traditionally relied on map-based navigation techniques to perform a broad array of operational tasks [12]. Foundational to these techniques are path planning [13], trajectory tracking [14], and obstacle avoidance algorithms [15, 16], which have established the groundwork for precise movement and trajectory optimization within structured environments. Nevertheless, their reliance on pre-existing maps and static obstacle locations limits their applicability in dynamic and unpredictable industrial settings, necessitating more adaptive and responsive solutions [4].

To overcome the limitations of traditional methods, the application of DRL enables AMRs to dynamically learn and adjust navigation policies from environmental interactions [17]. Early DRL efforts explored visual navigation using RGB images [18], depth-estimated images [19], and depth camera inputs [7]. Despite demonstrating the potential of DRL, these visual-based methods faced challenges, including high computational requirements and sensitivity to variable conditions like lighting or occlusions. In addition, in order to navigate through crowded environments safely, many studies incorporate pedestrian motion information as observations for DRL agents. Zhou et al. [20] introduced a social graph-based double dueling deep Q-network (SG-D3QN) that incorporates a social attention mechanism to extract crowd features as the agent state, significantly enhancing the safety of the navigation process. Chen et al. [21] proposed leveraging graph convolutional network (GCN) to extract crucial pedestrian information for navigation, employing an attention mechanism for prediction. Similarly, Sun et al. [22] proposed using a risk function to evaluate the collision probability of pedestrians, and prioritize pedestrians with higher collision risks for obstacle avoidance. These agent-level methods have one primary limitation: they heavily rely on pedestrian motion information and primarily address crowd navigation in open environments, overlooking the existence of static obstacle structures within the environment. Liu et al. [23] utilized static angle map to represent pedestrians and static obstacles, tackling this issue. However, this method relies on a pre-trained feature extraction network to manage high-dimensional input.

To address the need for improved generalization of navigation policies, attention has turned to effective state representation techniques, focusing on LiDAR data for its direct relevance and robustness to environmental changes. Pfeiffer et al. [24] employed convolutional neural networks (CNNs) to extract features from 2D laser scans, mapping these features to steering commands. Francis et al. [25] integrated probabilistic roadmaps (PRMs) with RL for navigating complex environments, utilizing direct LiDAR observations to enhance short-range obstacle avoidance and long-range navigation with the help of additional global path planning. In order to mitigate the degradation of generalization performance caused by high-dimensional LiDAR data, several studies have focused on extracting relevant features from laser scans to facilitate navigation across diverse environments. Pfeiffer et al. [26] introduced a method to compress LiDAR data through minimum pooling, simplifying DRL model state inputs while prioritizing crucial distance information from the nearest obstacles. Zhang et al. [9] proposed IPAPRec, a novel input preprocessing method that uses adaptively parametric reciprocal functions to process LiDAR readings, accentuating short-distance information to improve the generalization performance of DRL agents and accelerate learning in unfamiliar scenarios. However, while sparse laser scanning can enhance generalization, it overlooks crucial information essential for effective obstacle avoidance in complex environments, thereby constraining the agent’s ability to navigate obstacles.

Our research introduces the Potential Risk State Augmentation method, optimizing high-dimensional LiDAR data representation by enriching it with risk-related information. This approach assesses threat levels based on the proximity and orientation of obstacles, providing a more nuanced understanding of the environment compared to existing methods.

2.2 Exploration and learning in robot navigation

Since DRL relies on trial and error for learning, some research has focused on improving training efficiency and policy effectiveness through techniques such as reward shaping [27, 28] and innovative data utilization [29, 30]. Jang et al. [31] addressed these challenges by introducing the Hindsight Intermediate Targets (HIT), which circumvents common navigation issues like dead ends and local minima by creating virtual intermediate targets based on recent local observations, showcasing an innovative use of data utilization techniques to guide navigation. Zhu et al. [32] proposed a hierarchical deep reinforcement learning (HDRL) framework that enhances sampling efficiency and sim-to-real transfer capabilities. Miranda et al. [33] introduced a dense reward function that uses map information to encourage exploration and help avoid traps in chaotic environments. Shi et al. [6] utilized curiosity to enhance the agent’s exploration efficiency and employed it as an auxiliary reward to guide the agent in learning navigation tasks within discrete action spaces. Although the above methods can overcome the local optimal problems of exploration and navigation, they still have certain limitations when facing dynamic environments. To tackle this issue, Xie et al. [11] integrated velocity obstacles into the reward function to actively prevent collisions, thereby enhancing navigation safety in crowded pedestrian environments. However, this approach necessitates additional pedestrian location information as observations.

Our work introduces the Adaptive Safety Optimization Reward Shaping method, which stands distinct in its simplicity and effectiveness without complex models or extensive data techniques. It mitigates reward sparsity by shaping rewards and harmonizing goal achievement with obstacle avoidance through adaptive weighting based on collision risk. Additionally, we incorporate an intrinsic curiosity model to provide intrinsic rewards, enhancing the agent’s exploration ability in high-dimensional state space. This method fosters strategy generalization across diverse environments and enhances the training efficiency of the model.

3 Preliminaries

3.1 Problem formulation

Consider a differential-drive mobile robot navigating within a two-dimensional environment, represented by its position \(\textbf{p}_{t}=[{p}_{t}^{x}, {p}_{t}^{y}]^{T} \in \mathbb {R}^{2}\) and orientation \(\vartheta _{t} \in \mathbb {R}\). This work aims to develop a navigation control policy \(\pi \), which effectively controls the robot’s linear velocity \(v_{t}\) and angular velocity \(w_{t}\), enabling it to reach a desired position \(\textbf{p}_{g}=[{p}_{g}^{x}, {p}_{g}^{y}]^{T} \in \mathbb {R}^{2}\) efficiently while avoiding collisions. The mapless navigation problem is modeled as a sequential decision-making process within the framework of reinforcement learning. The agent controlling the robot’s motion operates within a Markov Decision Process (MDP), represented by a tuple \((\mathcal {S}, \mathcal {A}, \mathcal {P}(s'|s,a), r(s, a), \gamma )\) where \(\mathcal {S}\) and \(\mathcal {A}\) represent the state and action spaces respectively. \(\mathcal {P}(s' |s,a)\) describes the probability of transitioning from the current state \(s \in \mathcal {S}\) to the next state \(s' \in \mathcal {S}\) when action \(a \in \mathcal {A}\) is taken. The reward function r(sa) and discount factor \(\gamma \in [0,1]\) define the immediate reward and the importance of future rewards respectively. The objective of reinforcement learning is to find an optimal policy \(\pi ^{*}\) that maximizes the expected return:

$$\begin{aligned} \pi ^{*} = \underset{\pi }{\arg \max }\ \underset{a_{t}\sim {\pi }}{\mathbb {E}}[\sum _{t}\gamma ^{t}r(s_{t},a_{t})]. \end{aligned}$$
(1)

3.2 Traditional LiDAR state representation

LiDAR observations are crucial for environmental perception in robotic systems. These observations are typically preprocessed to extract navigation-relevant features, which serve as representational states for decision-making. Traditional LiDAR data representations involve vectors derived from either raw or preprocessed laser scans. The raw LiDAR measurements are denoted as \(\textbf{l}=[l_{1},l_{2},...,l_{n}]\in \mathbb {R}^{n}\), with each \(l_{i} \in [L_{min}, L_{max}]\) representing the ith measured distance, where \(L_{min}\) and \(L_{max}\) are the minimum and maximum measurable distances respectively.

To address the high dimensionality of LiDAR data, a common approach involves reducing these measurements to m values using minimum pooling [11, 34]:

$$\begin{aligned} y_{i}= {\mathrm{{min}}} (l_{(i-1) \cdot k+1}, l_{(i-1) \cdot k+2}, ... , l_{i \cdot k}), \end{aligned}$$
(2)

where k is the sampling window size. Subsequent dimensionality reduction of laser scans often employs linear or nonlinear normalization methods for preprocessing. Linear normalization is typically performed by dividing each measured distance by \(L_{max}\), denoted as \(LN(y_{i}) = y_{i} / L_{max}\) [35, 36]. Nonlinear normalization transforms the compressed laser scans using the reciprocal of the distance [9, 37]:

$$\begin{aligned} Rec(y_{i}) = 1 / (y_{i} - j_{i}), \end{aligned}$$
(3)

where \(j_{i} \in (-\infty , L_{min})\) is a parameter of the input preprocessing function. In this article, the linear normalization is denoted as \(LN(\varvec{y}) = [LN(y_{1}), LN(y_{2}),..., LN(y_{m})] \in \mathbb {R}^{m}\) and the reciprocal version as \(Rec(\varvec{y}) = [Rec(y_{1}), Rec(y_{2}),..., Rec(y_{m})] \in \mathbb {R}^{m}\).

3.3 Soft actor critic

The Soft Actor Critic (SAC) algorithm [38] is a deep reinforcement learning method based on the Actor-Critic architecture, known for its effectiveness in handling continuous action spaces and robust exploration capabilities. SAC has demonstrated superior performance in mapless navigation tasks [39, 40] due to its incorporation of entropy as a regularization term in the objective function, which balances the expected cumulative reward with an entropy component to encourage policy exploration:

$$\begin{aligned} \mathcal {J}(\pi )=\underset{(s_{t},a_{t})\sim \rho _{\pi }}{\mathbb {E}}\left[ \sum _{t}\left( r(s_{t},a_{t})+\alpha \mathcal {H}(\pi (\cdot |s_{t})) \right) \right] , \end{aligned}$$
(4)

where \(\rho _{\pi }\) is the distribution of state-action pairs under policy \(\pi \), and \(\alpha \) is the temperature coefficient that adjusts the weight of the entropy term \(\mathcal {H}(\pi (\cdot | s_{t}))\). This entropy term helps prevent premature convergence to suboptimal policies by promoting exploration of diverse actions. SAC employs dual critic networks (parameterized by \(\xi _{1}, \xi _{2}\)) to mitigate overestimation bias. The loss function for the Q network is defined as:

$$\begin{aligned} \mathcal {L} \left( \xi _{1,2}, \mathcal {B} \right) = \frac{1}{\left| \mathcal {B} \right| } \sum _{\mathcal {T} \in \mathcal {B}}\frac{1}{2} \left( Q(s,a\mid \xi _{1,2})- \overline{Q}(s, a) \right) ^{2} \end{aligned}$$
(5)

where \(\mathcal {T}=(s,a,r,s',d)\) is state transition sampled from mini-batch \(\mathcal {B}\) and \(\overline{Q}(s, a)\) is the target Q value calculated as:

$$\begin{aligned} \overline{Q}(s,a)\!=\!r + (1-d)\gamma \!\left( \underset{i=1,2}{\min }Q(s',\tilde{a}'\mid \overline{\xi }_{i})\!-\!\alpha \log \pi _{\zeta }(\tilde{a}'\!\mid \! s')\!\right) \end{aligned}$$
(6)

with \(\tilde{a}'\) sampled from the policy network \(\pi _{\zeta }(\cdot | s')\) (parameterized by \(\zeta \)). The loss function for the policy network is:

$$\begin{aligned} \mathcal {L} \left( \zeta , \mathcal {B} \right) = \frac{1}{\left| \mathcal {B} \right| } \sum _{s\in \mathcal {B}} \left( \alpha \log \pi _{\zeta }(\tilde{a}\mid s)- \underset{i=1,2}{\min }Q(s,\tilde{a}\mid \xi _{i}) \right) . \end{aligned}$$
(7)
Fig. 1
figure 1

Overall of proposed navigation scheme

4 Method

The objective of mapless navigation is to navigate to a destination safely and efficiently based on observed sensor information, without relying on a global map. In our approach, a neural network guides the robot through unstructured environments by controlling its movements based on processed observational states. The framework of our deep reinforcement learning-based navigation scheme is depicted in Fig. 1. The observation values from laser scans undergo preprocessing through Potential Risk State Augmentation and dimensionality reduction before being used as inputs to the policy network for decision-making.

4.1 Potential risk state augmentation

We utilize LiDAR technology to perceive surrounding obstacles, a crucial component of autonomous navigation. To address the limitations of traditional LiDAR state representations, our method enhances the representation by weighting the potential risk of obstacles based on their direction. This is achieved by combining raw laser scans \(\varvec{l} = [l_{1}, l_{2},..., l_{n}] \in \mathbb {R}^{n}\) with corresponding angle information at time t, resulting in an augmented state representation. The exponential function is applied to process distance information, expanding the distribution range of short-distance values, as demonstrated in [9]. Hence, the augmented LiDAR state representation is defined as:

$$\begin{aligned} h_{i} =\alpha (\theta _{i}) \cdot \eta ^{l_{i}}, \end{aligned}$$
(8)

where \(\alpha (\theta _{i})\) is a weight factor that evaluates the threat level of obstacles in various directions relative to the robot’s navigation path. \(\theta _{i}\) represents the direction of the \(i^{th}\) obstacle relative to the robot, and \(\eta \in (0,1)\) is the parameter of the distance processing function.

The weighting factor \(\alpha (\theta ^{i}_{t})\) minimizes redundant decision-making information by emphasizing obstacles that pose a direct threat to the navigation path. It is defined as a function of the obstacle’s angular position relative to the robot:

$$\begin{aligned} \alpha (\theta _{i}) = \sin (\beta \theta _{i}), \end{aligned}$$
(9)

where \(\beta \) is a scaling parameter that adjusts the obstacle’s angular position relative to the robot into an appropriate range for the sine function. Specifically, \(\beta \) is inversely proportional to the field of view \(\varphi _{f}\) of the LiDAR, calculated as \(\beta =180/\varphi _{f}\). This scaling ensures that the entire field of view from 0 to \(\varphi _{f}\) degrees maps to the range of 0 to 180 degrees, aligning with the valid range of the sine function. This alignment causes the threat level to peak directly in front of the robot and symmetrically decrease on both sides. It ensures that the weight assigned to the obstacle directly in front of the differential drive robot should be prioritized, while secondary obstacle information from the sides can be appropriately disregarded. By adjusting \(\beta \), the robot’s perception system is finely tuned to enhance sensitivity to direct threats in front while filtering out less relevant obstacles on the periphery, thereby optimizing navigation efficiency and safety.

To further optimize processing and reduce input dimensionality, the augmented state is compressed into m values, representing the threat level posed by obstacles within these m regions. A higher state value indicates a greater level of threat from obstacles to the robot. The dimensionality reduction function is defined as follows:

$$\begin{aligned} z_{i} = \mathrm{{max}} (h_{(i-1) \cdot k+1}, h_{(i-1) \cdot k+2}, ... , h_{i \cdot k}), \end{aligned}$$
(10)

where k is the sampling window size. The state of the LiDAR information at time t is then expressed as \(\varvec{z}=\left\{ z_{1}, z_{2},..., z_{m}\right\} \).

Fig. 2
figure 2

Comparison of risk areas under different LiDAR state representations. (a) Potential risk state augmentation. (b) Linear normalization. (c) Reciprocal mapping normalization. (d) Example of a scenario illustrating mapless navigation

Our comparative analysis, illustrated in Fig. 2(d), demonstrates the effectiveness of our method compared to traditional preprocessing techniques, using eight evenly spaced laser scan values as an example. While linear normalization and reciprocal mapping highlight high-threat obstacles (represented as dark color areas) at close range, they fail to effectively differentiate varying threat levels across different directions (see Fig. 2(b) and (c)). In contrast, the Potential Risk State Augmentation method integrates both angle and distance information into the state, enhancing the agent’s environmental understanding. When an obstacle is both close and directly ahead, it is considered more threatening, prompting a heightened response from the agent (see Fig. 2(a)), while less threatening obstacles exert a lesser influence on decision-making.

4.2 Mapless navigation with augmented LiDAR state

4.2.1 State and action representation

In deep reinforcement learning, the importance of state representation is evident, as it translates environmental perception information into a format that is interpretable by the agent, directly impacting its decision-making ability and the generalizability of policies. Figure 1 illustrates the state \(s_{t} \in \mathcal {S}\) of the DRL agent at time t, which comprises the LiDAR state, goal state, obstacle state, and current velocity. The specific formulations are described below.

  • LiDAR State \(\varvec{z}\): PRSA is applied to preprocess the feature vector derived from the raw laser scans \(\varvec{l} = [l_{1}, l_{2},..., l_{n}] \in \mathbb {R}^{n}\), which describes the obstacle information perceived by the robot. In this article, the LiDAR sensor in the simulation has a field of view (FOV) of \(190^{\circ }\), an angular resolution of \(1^{\circ }\) (equivalent to 190 laser beams), and a scanning range of 8m. Given the physical specifications of the sensor on the robot, we set the distance processing parameter \(\eta = e^{-1}\) and the sampling window size \(k = 5\) for PRSA. The resulting LiDAR state is \(\varvec{z} = [z_{1}, z_{2},..., z_{38}] \in \mathbb {R}^{38}\).

  • Goal State \(\varvec{x}_{g}\): This state denotes the relative distance \(d_{g}\) and angle \(\varphi _{g}\) between the robot and the target in the world frame, typically obtained through localization techniques such as simultaneous localization and mapping (SLAM) or global positioning system (GPS). The relative distance is defined as:

    $$\begin{aligned} d_{g} = \Vert \varvec{{p}}_{t} - \varvec{{p}}_{g}\Vert , \end{aligned}$$
    (11)

    where \(\varvec{p}_{t} = [{p}_{t}^{x}, {p}_{t}^{y}] \in \mathbb {R}^{2}\) denotes the robot’s position at time t, and \(\varvec{{p}}_{g}=[{p}_{g}^{x}, {p}_{g}^{y}] \in \mathbb {R}^{2}\) denotes the target’s position. Correspondingly, the relative angle is calculated using:

    $$\begin{aligned} \varphi _{g} = {\textbf {atan2}}(p_{g}^{y}-p_{t}^{y}, p_{g}^{x}-p_{t}^{x}) - p_{t}^{w} \end{aligned}$$
    (12)

    where \(p_{t}^{w}\) refers to the heading angle of the robot at time \(t\). After normalization, the goal state is \(\varvec{x}_{g} = [d_{g}, \varphi _{g}] \in \mathbb {R}^{2}\).

  • Obstacle State \(\varvec{x}_{o}\): This state denotes the distance \(d_{o}\) and angle \(\varphi _{o}\) between the robot and the nearest obstacle, compensating for the absence of distance information caused by PRSA. The Obstacle State is normalized and expressed as \(\varvec{x}_{o} = [d_{o}, \varphi _{o}] \in \mathbb {R}^{2}\).

  • Current Velocity \(\varvec{u}_{t}\): This state represents the linear velocity \(v_{t}\) and angular velocity \(w_{t}\) of the robot, and is given as \(\varvec{u}_{t} = [v_{t}, w_{t}] \in \mathbb {R}^{2}\).

The comprehensive state representation for the mapless navigation task is thus \(s_{t} = [\varvec{z}, \varvec{x}_{g}, \varvec{x}_{o}, \varvec{u}_{t}] \in \mathbb {R}^{44}\). This configuration accounts for potential risks posed by obstacles during navigation while also including information about the nearest obstacle, the target, and the robot’s speed, ensuring the agent has access to all necessary data for efficient navigation.

We implement continuous control in our DRL model to allow for nuanced and precise navigation, which affords smoother and more accurate movements compared to discrete control schemes. The action \(\varvec{a} = [a_{v}, a_{w}] \in \mathbb {R}^{2}\) is mapped to the robot’s linear velocity \(v_{t} \in [v_{min}, v_{max}]\) and angular velocity \(w_{t} \in [w_{min}, w_{max}]\) using the following transformation:

$$\begin{aligned} \left[ \begin{matrix} v_{t}\\ w_{t} \end{matrix} \right] = \left[ \begin{matrix} v_{min}+(1 + a_{v}) \cdot (v_{max}-v_{min})/2 \\ w_{min}+(1 + a_{w}) \cdot (w_{max}-w_{min})/2 \end{matrix} \right] . \end{aligned}$$
(13)

According to the kinematic characteristics of the industrial autonomous mobile robot, the specific ranges for linear and angular velocities in this article are set to \(v \in [0, 0.5]\) and \(w \in [-0.8, 0.8]\), respectively.

4.2.2 Adaptive safety optimization reward shaping

In DRL, addressing the challenge of sparse rewards often involves decomposing complex tasks into multiple subtasks and introducing intermediate rewards for each. During navigation, it is crucial for the robot to concurrently monitor the location of the target and the relative orientation of the nearest obstacles [41]. To effectively balance reaching the goal with avoiding obstacles, we introduce an intermediate reward mechanism with adaptive weighting, contingent on the collision risk of the current state. This method dynamically prioritizes goal attainment and obstacle avoidance based on the current risk level.

The reward function is formulated as follows:

$$\begin{aligned} r(s_{t}, a_{t})= {\left\{ \begin{array}{ll} {r_{success},} & \quad \text {if reach goal}\\ {r_{collision},} & \quad \text {if collision}\\ {r_{dense},} & \quad \text {otherwise}\\ \end{array}\right. } \end{aligned}$$
(14)

where \(r_{success}\) denotes the positive reward when the agent successfully achieves its goal, and \(r_{collision}\) represents the penalty incurred upon collision. The dense intermediate reward \(r_{dense}\) comprises three components:

Distance reward \(r_{close}\): This incentivizes the agent to minimize the distance to the target, defined as:

$$\begin{aligned} r_{close} = \kappa _{1} \cdot (d_{t-1} - d_{t}), \end{aligned}$$
(15)

where \(d_{t} = \Vert \varvec{{p}}_{t} - \varvec{{p}}_{g}\Vert \) denotes the distance between the robot and the target at time t and \(\kappa _{1}\) is a weight parameter.

Obstacle reward \(r_{penalty}\): This discourages proximity to obstacles, defined as:

$$\begin{aligned} r_{penalty}= {\left\{ \begin{array}{ll} {\kappa _{2} \cdot (\kappa _{3} - d_{min}),} & \quad \text {if}\, {d_{min} \le \kappa _{o}}\\ {0,} & \quad \text {else,}\\ \end{array}\right. } \end{aligned}$$
(16)

where \(\kappa _{2}\) and \(\kappa _{3}\) are constants adjusting the reward magnitude, \(d_{min}\) is the minimum distance for laser scan, and \(\kappa _{o} \in (0,\kappa _{3}]\) represents the threshold distance at which behavior in close proximity to an obstacle starts to incur penalties.

Goal-oriented adaptive reward \(r_{adaptive}\): This encourages the robot to adapt its behavioral strategies in response to varying risk levels, calculated as:

$$\begin{aligned} r_{adaptive} = (1 - \rho (s)) \cdot h_{orient} - \rho (s) \cdot h_{risk}, \end{aligned}$$
(17)

where \(h_{risk}\) denotes the risk posed by the nearest obstacle (computed from (8)), and \(h_{orient}\) represents the constraint reward associated with the agent’s forward direction, calculated as follows:

$$\begin{aligned} h_{orient} = \cos (\kappa _{4} \cdot {\varphi _{g}})-\kappa _{5}, \end{aligned}$$
(18)

with \(\kappa _{4}\) and \(\kappa _{5}\) fine-tuning the reward scale, and \(\varphi _{g}\) as the robot’s heading angle from (12).

The collision risk factor \(\rho (s)\) is a crucial determinant in the adaptive reward, calculated based on the distance to the nearest obstacle \(d_{o}(s)\). It is calculated as:

$$\begin{aligned} \rho (s)= {\left\{ \begin{array}{ll} \text {1,} & \quad {d_{o}(s) \le D_{risk}}\\ {\dfrac{d_{o}(s) - D_{safe}}{D_{risk} - D_{safe}},} & \quad {D_{risk}<d_{o}(s)<D_{safe}}\\ {0,} & \quad {d_{o}(s) \ge D_{safe}}\\ \end{array}\right. } \end{aligned}$$
(19)

where \(D_{safe}\) and \(D_{risk}\) denote the safe and risk distance thresholds respectively. In instances where the value of \(\rho (s)\) is small, signifying that the agent is within a safe range, the angle constraint promotes movement in the desired angular direction. Conversely, when \(\rho (s)\) is large, indicating the agent is in a hazardous zone, the reward encourages the agent to prioritize obstacle avoidance, thereby mitigating the danger level associated with the nearest obstacle.

Table 1 Dense Intermediate Reward Constant Coefficients Setting

Combining these rewards, the final dense intermediate reward is defined as \(r_{dense} = r_{close} + r_{penalty} + r_{adaptive}\). The constant coefficients are related to the robot’s size, motion model and risk area settings. They are used to constrain rewards to an appropriate range based on the robot’s physical properties, as shown in Table 1.

Fig. 3
figure 3

Architecture for intrinsic curiosity module

4.3 Enhance exploration with intrinsic reward

In mapless navigation tasks, agents often confront high-dimensional and complex state spaces, necessitating robust exploration capabilities. Enhanced exploration enables agents to develop a comprehensive understanding of state relationships, which is crucial for effective navigation. To this end, we utilize state novelty estimates as intrinsic rewards to incentivize exploration, thereby acquiring diverse and rich training data to enhance learning efficiency and generalization.

The Intrinsic Curiosity Module (ICM) [42] is adopted for state novelty estimation, as depicted in Fig. 3. The ICM consists of two main components: the forward prediction model and the inverse model, which collaboratively generate exploration-driven learning signals by encouraging the discovery of novel states and behaviors.

The inverse model within the ICM aims to estimate the agent’s actions from consecutive state features, enhancing the agent’s understanding of cause-and-effect relationships within the environment. The loss function \(\mathcal {L}_{IM}\) measures the disparity between estimated and actual actions, formulated as:

$$\begin{aligned} \underset{\psi _{I}}{{\textbf {min}}} \mathcal {L}_{IM}(a_{t}, \hat{a}_{t})= \frac{1}{2} {\left\| a_{t} - \hat{a}_{t} \right\| }^{2}_{2}. \end{aligned}$$
(20)

Conversely, the forward prediction model focuses on predicting the next state’s features \(\phi _{t+1}\) based on the current state and action. This model generates intrinsic rewards by assessing the novelty of state transitions, propelling the agent to explore unfamiliar states. The prediction error from the forward model, which forms the basis of the intrinsic reward, is computed using \(\mathcal {L}_{FM}\):

$$\begin{aligned} \underset{\psi _{F}}{{\textbf {min}}} \mathcal {L}_{FM}(\phi _{t+1}, \hat{\phi }_{t+1})= \frac{1}{2} {\left\| {\phi _{t+1}} - \hat{\phi }_{t+1} \right\| }^{2}_{2}. \end{aligned}$$
(21)

The intrinsic reward \(r^{in}_{t}\), derived from the prediction error of the forward model, encourages exploration of novel states:

$$\begin{aligned} r_{t}^{in} = \frac{\lambda }{2} {\left\| {\phi _{t+1}} - \hat{\phi }_{t+1} \right\| }^{2}_{2}, \end{aligned}$$
(22)

where \(\lambda \) is a scaling factor used to balance intrinsic and extrinsic rewards. The novelty of the current state is evaluated through the prediction error of the forward model. A smaller prediction error implies that the agent is familiar with the current environmental state, indicating lower state novelty. Conversely, a higher prediction error suggests a greater level of state novelty, thereby increasing the intrinsic reward and motivating more proactive exploration.

By introducing state novelty estimation as an auxiliary task, the agent is trained to maximize the reward \(r_{t} = r_{t}^{in} + r_{t}^{ex}\) of intrinsic reward \(r_{t}^{in}\) and extrinsic reward \(r_{t}^{ex}\) defined in (14). The overarching optimization goal integrates maximizing the policy objective (4) with minimizing the prediction model’s loss:

$$\begin{aligned} \underset{\zeta ,\psi _{F},\psi _{I}}{{\textbf {max}}} \left[ \mathcal {J}(\pi ) - \sigma \mathcal {L}_{FM} - (1 - \sigma ) \mathcal {L}_{IM}\right] , \end{aligned}$$
(23)

where \(0 \le \sigma \le 1\) adjusts the weight between \(\mathcal {L}_{FM}\) and \(\mathcal {L}_{IM}\).

Fig. 4
figure 4

Neural network for mapless navigation

Fig. 5
figure 5

Simulated robot and world used for training. (a) Simulated robot in Gazebo; (b) Simulated world used for training

4.4 Neural network architecture

The actor-critic network architecture is designed to balance computational efficiency with the ability to manage the complexities of mapless navigation, as illustrated in Fig. 4. This network employs fully connected layers to effectively process high-dimensional state inputs, thus facilitating rapid and precise decision-making for real-time navigation.

Notably, the policy network outputs the mean and variance of the linear and angular velocities, modeled as a Gaussian distribution. During the training phase, actions are sampled from this distribution to ensure a diverse range of exploratory maneuvers. Conversely, during testing and practical applications, the mean of the distribution is used to determine the robot’s actions, emphasizing stability and consistency.

The state novelty estimation model within our network consists of two integral components: the inverse model and the forward prediction model. The state input \(s_t\) initially passes through two fully connected layers, with 512 and 256 units respectively, to derive a 64-dimensional feature representation \(\phi _t\). This representation \(\phi _t\) is then utilized in both the inverse and forward models:

  • In the inverse model, \(\phi _t\) and \(\phi _{t+1}\) are concatenated and processed through a 256-unit fully connected layer, which outputs the predicted action \(\hat{a}_t\) via a tanh activation function.

  • The forward model concatenates the feature \(\phi _t\) and the action \(a_t\), and inputs them into a 128-unit fully connected layer. This layer predicts the next state’s feature \(\hat{\phi }_{t+1}\), crucial for calculating the intrinsic reward.

This architecture supported by intrinsic reward-driven exploration, enables the agent to effectively learn navigation strategies within a simulated environment. The training process is systematically outlined in Algorithm 1, detailing the initialization of network parameters and the iterative learning steps undertaken by the agent.

Algorithm 1
figure a

Learn to navigate with intrinsic reward.

5 Experiment

In this section, we demonstrate the effectiveness of our proposed method through a series of simulations and real-world experiments. The experimental results highlight significant improvements in navigation performance and generalization capabilities enabled by our approach.

Table 2 Hyperparameters setting
Fig. 6
figure 6

Simulated environments used for testing. (a) SEnv0: \(13\times 13\textrm{m}^{2}\) empty room; (b) SEnv1: \(15\times 15\textrm{m}^{2}\) indoor environment with relatively few obstacles; (c) SEnv2: \(8\times 15\textrm{m}^{2}\) crowded indoor environment with a higher density of obstacles; (d) SEnv3: \(20 \times 20\textrm{m}^{2}\) open environment without wall constraints; (e) DEnv0: Complex scenarios with dynamic obstacles

5.1 Training setup

For the training and testing of our models, we employ the Gazebo simulator integrated with the Robot Operating System (ROS) platform. This setup facilitates a seamless sim-to-real transition, essential for validating the practical applicability of our methods. To demonstrate the enhanced generalization capability of our approach, we only trained the robot(shown in Fig. 5(a)) in a \(10 \times 10\textrm{m}^{2}\) static environment depicted in Fig. 5(b). This controlled environment ensures that the improvements can be directly attributed to the efficacy of the proposed method. All models in the experiment share the same network architecture, and the hyperparameters specific to the proposed method are detailed in Table 2. These hyperparameters were chosen based on a combination of prior literature insights [6, 38] and fine-tuning specific to our model, ensuring optimal performance. Additionally, coefficients for state representation and reward functions, as well as network structure parameters, are explained and provided in the previous Section 4.

Fig. 7
figure 7

Average external reward and total navigation risk during training

5.2 Comparative experiment on reward effectiveness

To examine the impact of intrinsic rewards on generalization performance and the role of adaptive weight reward functions, we utilized Intrinsic Reward SAC (IRSAC) and standard SAC to train under three distinct intermediate rewards: (1) R1: Adaptive safety optimization reward, (2) R2: Safety optimization reward under fixed weight, (3) R3: Common intermediate reward including only distance reward [9, 31].

The key indicators for assessing the efficacy of various algorithms and reward structures were the average extrinsic reward (\(R_{Ae}\)) and the total navigation risk:

$$\begin{aligned} R_{Ae} = \frac{1}{N_{s}} \sum _{j=1}^{N_{s}} r_{j}^{ex} \end{aligned}$$
(24)

where \(r_{j}^{ex}\) denotes the extrinsic reward received at each step, and \(N_{s}\) is the total number of steps in the episode. Additionally, the total navigation risk is the sum of the risk index (\(R_{I}\)) for each episode, defined as:

$$\begin{aligned} R_{I} = {\left\{ \begin{array}{ll} \text {100,} & \quad \text {if collision}\\ {0,} & \quad {d_{o}(s) \ge D_{risk}}\\ {\dfrac{d_{o}(s) - D_{risk}}{D_{coll} - D_{risk}},} & \quad \text {else}.\\ \end{array}\right. } \end{aligned}$$
(25)

Here, \(d_{o}(s)\) is the distance to the nearest obstacle, and \(D_{coll}\) is the default collision distance. In the simulation we set \(D_{coll}=0.35\).

To ensure robust evaluation, we computed the average reward and risk index values every 12 episodes. Post-training, the models’ generalization and navigation behaviors were assessed across four different static environments (Fig. 6(a) to (d)). Each model completed the same tasks in each map, with evaluations based on average performance across the maps. Metrics included success rate (SR), collision rate (CR), loss rate (LR), and mean navigation risk (MNR), providing a comprehensive assessment of each model’s navigation capabilities in diverse environments.

Training results illustrated in Fig. 7 show that regardless of the specific reward function, incorporating intrinsic rewards during training enhances the average reward level and accelerates the learning process. The intrinsic rewards facilitate a wider range of exploratory behaviors, which is critical in environments where external rewards are sparse. This expanded exploration allows the agent to encounter and learn from a broader spectrum of states, significantly leveraging the PRSA’s capability to fine-tune responses based on the perceived risks from nearby obstacles. Test results in Table 3 indicate that models incorporating intrinsic rewards demonstrate improved adaptability to novel environments and tasks, evident in higher SRs and lower LRs compared to models trained without intrinsic rewards. The adaptability improvements are attributable to the richer exploratory experiences during training, allowing the agent to develop more versatile strategies, thus enhancing performance across diverse environments and tasks.

Furthermore, the results affirm that the proposed adaptive safety optimization reward significantly enhances decision-making safety and efficiency.This reward strategy, compared to other reward functions, shows reduced navigation risk (see Fig. 7) during training and smaller MNR during testing (see Table 3).Unlike traditional rewards that primarily focus on avoiding obstacles through penalties for collisions and close encounters, the adaptive safety optimization reward offers a detailed feedback that tailores navigation and obstacle avoidance strategies to different environmental areas. When the agent operates within safe areas, rewards incentivize achieving goals quickly. Conversely, the reward system shifts to prioritize risk reduction in hazardous areas. This adaptive reward structure encourages the agent to adopt more effective strategies based on the current obstacle distribution, thereby reducing the possibility of collisions.

Table 3 Test results of different models
Fig. 8
figure 8

The trajectories of DRL agents trained with various state representations were examined in SEnv1, SEnv2, and SEnv3. (a) LN, (b) Raw, (c) Rec, and (d) Proposed PRSA. The blue triangle denotes where the intelligent agent loses its target, while the blue cross indicates a collision at that position. The black trajectory illustrates the path taken when manual intervention is necessary due to exceeding time limits or a collision occurrence

5.3 Comparative experiment on generalization of state representation

This subsection investigates the influence of different state representations on the generalization performance in mapless navigation. We compare three standard state representation methods against our proposed Potential Risk State Augmentation (PRSA): (1) Raw dense laser scans, incorporating goal state and velocity (Raw); (2) Compressed linear normalized input, incorporating goal state and velocity (LN); (3) Compressed reciprocal function input, incorporating goal state and velocity (Rec) (Fig. 8).

Fig. 9
figure 9

Test success rate curves for different state representations across training and testing environments. Solid lines represent the average success rate of four methods, with variance shown by the shaded areas

All models were trained using Algorithm 1 with the same reward function (14), and evaluated across four different static environments as shown in Fig. 6(a-d). The test results depicted in Fig. 9 highlight substantial differences in performance among various LiDAR state representations between training and testing environments. Traditional representations (LN and Raw) exhibited the weakest performance, indicating poor generalization. The application of the reciprocal function (Rec) showed significant improvement in average performance, though it exhibited some variance across scenarios. Conversely, the PRSA method consistently achieved high success rates in unfamiliar environments, with minimal variation in performance across different settings. This suggests a superior generalization capability brought about by the detailed state information provided by PRSA.

Furthermore, we assessed the models’ obstacle avoidance capabilities in environments SEnv1, SEnv2, and SEnv3, conducting the same navigation tasks three times for consistency. The performance was quantified using four metrics: success rate (SR), collision rate (CR), lost rate (LR), and mean navigation score (MNS), calculated as follows:

$$\begin{aligned} S = {\left\{ \begin{array}{ll} {1 - \frac{P_{i}}{T_{max}*v_{max}},} & \quad \text {if success}\\ {-1,} & \quad \text {else.}\\ \end{array}\right. } \end{aligned}$$
(26)

where, \(P_{i}\) represents the distance to navigate, \(T_{max}\) signifies the maximum time steps allowed for each navigation, and \(v_{max}\) is the maximum linear speed of navigation.

Table 4 Test results of DRL agents trained with different state representations in test scenarios

The summarized results in Table 4 reveal that agents trained with PRSA not only achieve consistently higher SR and MNS but also exhibit significantly better obstacle avoidance, as evidenced in varied test environments. In addition, it is clear that traditional state representations (Raw and LN) often lead to suboptimal path planning, especially in open or complex environments(see Fig. 8). The primary reason behind this phenomenon is the agent’s inability to comprehend certain environmental features, resulting in the failure of the learned policy in the training environment to generalize effectively to novel environments. In contrast, Rec and PRSA demonstrate better generalization with trajectories closer to optimal paths and higher mean navigation scores. The strong generalization capabilities of Rec and PRSA stem from their efficient preprocessing of high-dimensional LiDAR data, utilizing key information for decision-making and ignoring information that has little impact on navigation and obstacle avoidance. Notably, Rec sometimes results in collisions in narrower environments, whereas PRSA enhances the agent’s hazard perception capability, ensuring robust performance across all tested scenarios. This underlines the effectiveness of PRSA in providing comprehensive state information, which significantly improves navigation and generalization across diverse environments.

5.4 Comprehensive comparison on navigation performance

The previous experiments comparatively analyzed the influence of reward function and state representation. In this subsection, we evaluate the proposed DRL-based real-time navigation method within a complex environment featuring narrow passages and dynamic obstacles, as depicted in Fig. 6(e). These dynamic obstacles are programmed to move in predetermined directions at specified speeds v.

Fig. 10
figure 10

Performance testing results for different navigation methods

Fig. 11
figure 11

Mean navigation risk and score for different navigation methods in the navigation performance test

We benchmarked the proposed method against several representative navigation techniques: (1) APF (Artificial Potential Field), a conventional method often used in mapless navigation, (2) MoveBase, a widely utilized classic navigation baseline in industrial AMRs autonomous navigation, (3) IPAPREC-SAC [9], which integrates adaptive parameter reciprocal input preprocessing with the SAC algorithm for enhanced generalization, (4) SAC-SP [28], an optimized real-time navigation method using an enhanced SAC algorithm with a compound auxiliary reward, modified to meet navigation requirements based on the differential robot kinematics model and observations used in this study, (5) ITD3-CLN [30], which utilizes an improved TD3 algorithm and a channel-based convolutional laser network architecture to improve obstacle avoidance and generalization.

Three scenarios were set up to comprehensively assess the navigation performance of each method: (1) S: Static environment where obstacles remain stationary (\(v=0 m/s\)), (2) D1: Dynamic environment where obstacles move at a slow speed (\(v=0.2 m/s\)), (2) D2: Dynamic environment where obstacles move at a faster speed (\(v=0.4 m/s\)). Each scenario performed the same navigation task and performance was assessed using success rate (SR), collision rate (CR), mean navigation risk (MNR) and mean navigation score (MNS).

The results summarized in Figs. 10 and 11 reveal several key insights. Traditional planning methods like APF and MoveBase are often difficult to navigate in narrow environments, exhibiting low SR and high MNR. This challenge arises from their dependency on predetermined trajectories or potential fields, which lack adaptability and often lead to local optima. Consequently, these methods struggle to adapt to narrow spaces and dynamic environments effectively. In contrast, DRL-based methods consistently outperform traditional planners by mitigating local minima issues and demonstrating superior generalization across dynamic conditions.

Moreover, our method shows the highest SR and MNS across all scenarios, indicating robust generalization and efficient navigation even under dynamic obstacle conditions. This performance highlights the effectiveness of integrating PRSA with adaptive safety optimization rewards, balancing target achievement with obstacle avoidance. Furthermore, our method achieves the lowest MNR, attributed to the designed reward function that incentivizes the agent to adopt safer navigation behaviors. It’s noteworthy that with the increase in obstacle speed, the collision rate rises across all methods, underscoring the heightened challenges in perception and decision-making. Compared to other navigation methods, our method maintains the lowest CR, showcasing its robustness to adaptively manage risks and preserve navigation integrity under varied environmental dynamics.

Table 5 Real-World Validation of Navigation Performance
Fig. 12
figure 12

Real-world test scenarios and motion trajectories. The black trajectory represents manual intervention in case of navigation failure. A: RE1, a dense obstacle scenario; B: RE2, a narrow scenario

5.5 Real-world validation of navigation performance

In this subsection, we verified the performance of the proposed DRL-based real-time navigation model in the real world. Similar to simulation testing, we compared the navigation performance of different state representations using the traditional MoveBase planner as a benchmark.

5.5.1 Experimental setup

The real-world experiments utilize an industrial composite robot equipped with SLAMTEC S1 ToF-based LiDAR, featuring a 190\(^{\circ }\) field of view and an angular resolution of 0.26\(^{\circ }\). Policies trained in simulated environments are directly deployed on this robot to evaluate the transferability of our model. The robot’s maximum linear speed is capped at 0.25 m/s for safety reasons. The GMapping SLAM algorithm is used to create maps of the testing environment, aiding in robot localization via the AMCL algorithm. It is important to note that these maps are solely for localization and not for trajectory planning.

Two real-world scenarios assess navigation performance. In each, the robot starts from an initial point and aims to reach multiple targets sequentially before returning to the starting position. Manual intervention is employed when the robot fails to complete a task. The primary metrics for evaluation included success rate (SR), collision rate (CR), mean navigation score (MNS), mean navigation distance (MND) and mean navigation time (MNT).

5.5.2 Results and analysis

The real-world test results, as shown in Table 5, corroborate our simulation findings, demonstrating that DRL-based navigation methods (especially those utilizing advanced state representations) significantly outperform classic navigation baseline MoveBase in terms of success rates and navigation efficiency. MoveBase’s performance limitations stem from its challenges in executing precise maneuvers and avoiding obstacles in confined spaces, further impacted by the completeness and accuracy of its map data. The slight variance in performance among PRSA, Rec, LN, and Raw highlights the sensitivity of DRL models to different state representations in complex real-world environments. Our proposed method achieves superior performance, showcasing the highest success rates and navigation scores in both environments. Although the mean navigation distance in RE2 is slightly longer, the motion trajectories (see Fig. 12) indicate that the proposed method maintains a longer distance from obstacles, thereby enhancing navigational safety by sacrificing some distance for safety. Moreover, our method completes the navigation task more quickly than Rec. LN and Raw showed slightly inferior performance, with Raw choosing an unnecessarily extended route to reach the goal during the mission in RE1 (\(4 \rightarrow 5\)). This issue may arise from significant disparities in obstacle distribution between the real environment and the training environment, making learned policies difficult to generalize and occasionally leading to misunderstanding of the current state.

In summary, these real-world experimental results confirm the successful transfer of our simulated training to practical applications. This success is primarily attributed to our preprocessing of original sensor data. Unlike utilizing raw high-dimensional laser observations, this preprocessing method effectively captures essential navigation and obstacle avoidance information while minimizing the impact of extraneous data. Additionally, our adaptive safety optimization reward function incentivizes the robot to balance navigation and obstacle avoidance requirements based on current threat levels, thereby reducing collision risks during navigation. The use of 3D simulator training models, which simulate sensor noise and accurately model robot kinematics, bridges the gap between simulation and reality. This comprehensive approach enables the successful deployment of our strategy in real-world scenarios.

6 Conclusion

In this article, we presented a DRL-based mapless navigation scheme for industrial AMRs with strong generalization capabilities. To mitigate the negative impact of traditional LiDAR state representations on the generalization performance of DRL agents, we introduced a potential risk state augmentation method, which enhances the generalization performance of the trained policies. Additionally, an adaptive weight was introduced into the reward function to balance navigation efficiency and safety, complemented by an intrinsic reward to boost the exploration efficiency of the DRL agent. Extensive experimental results illustrate a significant advantage of the proposed method in terms of navigation and generalization performance. The proposed PRSA enhances the navigation success rate by about 17% in previously unseen environments. Furthermore, compared with representative navigation techniques, our method not only maintained a success rate exceeding 90% in challenging narrow and dynamic settings but also reduced the mean navigation risk by approximately 44% relative to advanced end-to-end methods. In real-world tasks, our method significantly outperformed the classical navigation baseline MoveBase, increasing the success rate by approximately 33%.

However, a critical limitation of our research is the lack of enforced safety constraints. Although our adaptive safety optimization reward function mitigates the risk of collision to a certain extent, it only guides the agent to learn safe navigation behaviors and cannot forcibly restrict risky actions. Navigation tasks in real-world safety-critical scenarios necessitate additional mandatory safety constraints. Future research endeavors will focus on integrating model-based control techniques to further enhance navigation safety in complex environments. Specifically, we plan to incorporate partial map information and visual observations to augment the agent’s perception capabilities, thereby improving the navigation performance of industrial robots in intricate dynamic environments. Additionally, exploring the integration of real-time adaptive learning mechanisms could further bolster the robustness and reliability of navigation strategies in unpredictable industrial settings.