ChauffeurNetlearning to drive by imitating the best

所需积分/C币:13 2019-01-22 14:37:28 1.22MB PDF
收藏 收藏

ChauffeurNet - learning to drive by imitating the best and synthesizing the worst Waymo自动驾驶的介绍 仿真测试 人工智能
CHAUFFEURNET: LEARNING TO DRIVE BY IMITATING THE BEST AND SYNTHESIZING THE WORST 2. Related work Decades-old work on ALVINN(Pomerleau(1989)showed how a shallow neural network could follow the road by directly consuming camera and laser range data. Learning to drive in an end-to-end manner has seen a resurgence in recent vears. Recent work by chen et al (2015) demonstrated a convolutional net to estimate affordances such as distance to the preceding car that could be used to program a controller to control the car on the highway Researchers at NVIDIA Bojarski et al.(2016, 2017)) showed how to train an end-to-end deep convolutional neural network that steers a car by consuming camera input. Xu et al (2017) trained a neural network for predicting discrete or continuous actions also based on camera inputs. Codevilla et al.(2018)also train a network using camera inputs and conditioned on high-level commands to output steering and acceleration. Kuefler et al (2017) use Generative Adversarial Imitation Learning(GAIL) with simple allordance-slyle features as inputs to overcome cascading errors typically present in behavior cloned policies so that they are more robust to perturbations. Recent work from Hecker et al. (2018)learns a driving model using 360-degree camera inputs and desired route planner to predict steering and speed. The CARLA simulator(Dosovitskiy et al (2017))has enabled recent work such as Sauer et al. (2018), which estimates several affordances [rom sensor inputs to drive a car in a simulated urban environment. Using mid-level representations in a spirit similar to our own, Miller et al. (2018)train a system in simulation using CARla by training a driving policy from a scene segmentation network to output high-level control, thereby enabling transfer learning to the real world using a different segmentation network trained on real data. Pan et al.(2017) also describes achieving transfer of an agent trained in simulation to the rcal world using a lcarncd intcrmcdiatc scone labeling reprcscntation. Reinforcement learning may also be used in a simulator to train drivers on difficult interactive tasks such as merging which require a lot of exploration, as shown in Shalev-Shwartz et aL. (2016).A convolutional network operating on a space-time volume of bird's eye-view representations is also employed by Luo et al.(2018); Djuric et al.(2018); Lee et al.(2017) for tasks like 3D detection, tracking and motion forecasting. Finally, there exists a large volume of work on vehicle motion planning outside the machine learning context and Paden et aL.(2016) present a notable survey. 3. Model architecture 3.1 Input Output Representation We begin by describing our top-down input representation that the network will process to output a drivable trajectory. At any time t, our agent(or vehicle) may be represented in a top-down coordinate system by pt, At, st, where pt=(at, yt) denotes the agents location or pose, 0t denotes the heading or orientation, and st denotes the speed. The top-down coordinate system is picked such that our agent's pose po at the current time (=0 is always at a fixed location(uo, vo) within the image. For data augmentation purposes during training, the orientation of the coordinate system is randomly picked for each training example to be within an angular range of 0o+A, where bo denotes the heading or orientation of our agent at time t=0. The top-down view is represented by a set of images of size W x H pixels, al a ground sampling resolution of meters/pixel. Note thal as the agent bansal. krichevsky ogale N (a) roadmap (b) Traffic Lights (c) Spccd Limit (d) route (e) Current Agent (f)Dynamic Boxes(g)Past Agent Poses (h)Future Agent Poses Box Figure 1: Driving model inputs(a-g)and output(h) moves, this view of the environment moves with it so the agent always sees a fixed forward range, Rforuard =(H-voo of the world- similar to having an agent with sensors that see only up to Forward meters forward As shown in Fig. 1, the input to our model consists of several images of size WXH pixels rendered into this top-down coordinate system.(a) Roadmap: a color (3-channel) image with a rendering of various map fcaturcs such as lancs, stop signs, cross-walks, curbs, etc.(b) Traffic lights: a temporal sequence of grayscale images where each frame of the sequence represents the known state of the traffic lights at each past timestep. Withi each frame, we color each lane center by a gray level with the brightest level for red lights, intermediate gray level for yellow lights, and a darker level for green or unknown lights.(c Speed limit: a single channel image with lane centers colored in proportion to their known speed limit.(d) Route: the intended route along which we wish to drive, generated by a router(think of a Google Maps-style route).(e) Current agent box: this shows our agent's full bounding box at the current timestep t=0.(f)Dynamic objects in the environment a temporal sequence of images showing all the potential dynamic objects(vehicles, cyclists pedestrians) rendered as oriented boxes. (g) Past agent poses: the past poses of our agent are rendered into a single grayscale image as a trail of points We use a fixed-time sampling of ot to sample any past, or future temporal information such as the traffic light state or dynamic object states in the above inputs. The traffic 1. We employ an indexed representation for roadmap and traffic lights channels to reduce the number of input channels, and lo allow extensibility of che input representation lo express Inore roadmap fealures or more traffic light states without changing the model architecture CHAUFFEURNET: LEARNING TO DRIVE BY IMITATING THE BEST AND SYNTHESIZING THE WORST Rendered Feature Road RNN Mask Perception Net RNN Heading Speed Waypoint Agent Box Road Heatmap Mask BOx Heading Speed Waypoint BOX Geometry Road Collision Mask Perception LOSS LOSs LOSS LOSs ▲ Target Target Tare Target Target Target Target Agent Speed Waypoint Agen Geometry Road Perception Heading Box Mask Figure 2: Training the driving model.(a)The core Chauffeur Net model with a FeatureNet and an AgentRNN,(b) Co-trained road mask prediction net and PerceptionRNN, and(c) Training losses are shown in blue, and the green labels depict the ground-truth data. The dashed arrows represent the recurrent feedback of predictions from one iteration to the next lights and dynamic objects are sampled over the past Tscenc seconds, while the past agent poses are sampled over a potentially longer interval of Pose seconds. This simple input representation, particularly the box representation of other dynamic objects, makes it easy to generate input data from simulation or create it from real-sensor logs using a standard pcrccption systcm that detects and tracks objects. This cnablcs testing and validation of models in closed-loop simulations before running them on a real car. This also allows the same model to be improved using simulated data to adequately explore rare situations such as collisions for which real-world data might be difficult to obtain. Using a top-down 2D view also means efficient convolutional inputs, and allows fexibility to represent metadata and spatial relationships in a human-rcadablc format. Papcrs on testing frameworks such as Tian et al.(2018), Pei et al.(2017)show the brittleness of using raw sensor data(such as camera images or lidar point clouds) for learning to drive, and reinforce the approach of using an intermediate input representation If I denotes the set of all the inputs enumerated above, then the Chauffeur Net model recurrently predicts future poses of our agent conditioned on these input images I as shown by the green dots in Fig. 1(h) pi+ot= ChauffeurNet(I, pt) bansal. krichevsky Ogale add Memory, Mk-1 Rendered Location In puts Feature k AgentRNN Location pk set Predicted Agent Agent oX,Bk-1 Figure 3:(a) Schematic of ChauffeurNet.(b) Memory updates over multiple iterations n Eq.(1), current pose po is a known part of the input, and then the Chauffeur Net perforins N iterations and outputs a future trajectorypst, p28t: .,PNSt along with other propcrtics such as future spccds. This trajectory can bc fcd to a controls optimizer that computes detailed driving control(such as steering and braking commands) within the specific constraints imposed by the dynamics of the vehicle to be driven. Different types of vehicles may possibly utilize different control outputs to achieve the same driving trajectory, which argues against training a network to directly output low-level steering and acceleration control. Notc, howcvcr, that having intcrmcdiatc representations like ours docs not preclude end-to-end optimization from sensors to controls 3.2 Model design Broadly, the driving model is composed of several parts as shown in Fig. 2. The main Chauf- feur Net model shown in part(a) of the figure consists of a convolutional feature network (Feature Net) that consumes the input data to create a digested contextual feature repre sentation that is shared by the other networks These features are consumed by a recurrent agent network(AgeTlRNM that iteratively predicts successive points in the driving tra jec tory. Each point at time t in the trajectory is characterized by its location pt=(at, yt) heading At and speed St. The AgentrNN also predicts the bounding box of the vehicle as a spatial heatmap at each future timestep. In part(b)of the figure, we see that two other networks are co-trained using the same feature representation as an input. The Road Mask Network predicts the drivable areas of the field of view(on-road vs. olf-road), while the CHAUFFEURNET: LEARNING TO DRIVE BY IMITATING THE BEST AND SYNTHESIZING THE WORST New Environnent state Predicted Vehicle Update E Renderer Real/ simulated Environment[ Current Pose Figure 4: Software architecture for the end-to-end driving pipeline recurrent perception network(PerceptionRNN iteratively predicts a spatial heatmap for each tinestep showing the future location of every other agent in the scene. We believe that doing well on these additional tasks using the same shared features as the main task improves generalization on the main task. Fig. 2(c) shows the various losses used in training the model which we will discuss in detail below Fig 3 illustrates the ChauffeurNet model in more detail. The rendered inputs shown in Fig 1 are fed to a large-receptive field convolutional feature Net with skip connections, which outputs features F that capture the environmental context and the intent. These features are fed to the AgentrNn which predicts the next point pk: on the driving trajectory, and the agent bounding box heatmap Bk, conditioned on the features F from the FeatureNet, the iteration number k E 1,, N. the memory MK-1 of past predictions from the AgentRNN and the agent bounding box heatmap Bk-1 predicted in the previous iteration pr, Bk= AgentRNN(k, F, M (2) The memory Mk is an additive memory consisting of a single channel image. At iteration k: of the AgentRNN, the memory is incremented by 1 at, the location pk predicted by the AgentRNN, and this memory is then fed to the next iteration. The AgentrNn outputs a heatmap image over the next pose of the agent, and we use the arg-max operation to obtain the coarse pose prediction Pk Irom this heatmap. The AgenlRNN then employs a shallow convolutional meta-prediction network with a fully-connected layer that predicts a sub-pixel refinement of the pose Opk and also estimates the heading Bk and the speed Sk. Note that the agentrnn is unrolled at training time for a fixed number of iterations, and the losses described below are summed together over the unrolled iterations. This is possible because of the non-traditional rnn design where we employ an explicitly crafted memory model instead of a learned memor 3.3 System Architecture Fig. 4 shows a system level overview of how the neural net is used within the self-driving system. At each time, the updated state of our agent and the environment is obtained via a perception system that processes sensory output from the real-world or from a simulation environment as the case may be. The intended route is obtained from the router, and is updated dynamically conditioned on whether our agent was able to execute past intents or not. The environment information is rendered into the input images described in Fig. 1 and given to the rnn which then outputs a future trajectory. This is fed to a controls optimizer that outputs the low-level control signals that drive the vehicle (in the real world or in simulation) bansal. krichevsky Ogale 4. Imitating the Expert In this scction, we first show how to train the modcl above to imitate thc cxpcrt 4. 1 Imitation losses 4.1.1 AGENT POSITION. HEADING AND BOX PREDICTION The AgentRNN produces three outputs at each iteration k: a probability distribution Ph(a, y) over the spatial coordinates of the predicted waypoint obtained after a spatial softmac, a hcatmap of the predicted agent box at that timcstcp Bk(a, y) obtaincd aftcr a per-pixel sigmoid activation that represents the probability that the agent occupies a par ticular pixel, and a regressed box heading output ]k. Given ground-truth data for the above predicted quantities, we can define the corresponding losses for each iteration as Cp=H(Ph Pgt (3) CB=m∑∑(B(,)B(x,) where the superscript gt denotes the corresponding ground-truth values, and H(a, b)is the cross-entropy function. Note that Pi is a binary image with only the pixel at the ground truth target coordinate lpg set to one 4.1.2 AGENT META PREDICTION The meta prediction network performs regression on the features to generate a sub-pixel refinement Opk of the coarse waypoint prediction as well as a speed estimate sk. at each iteration. We employ Li loss for both of these outputs Lp-subpirel=pkp S where Opg =pk-lpk is the fractional part of the ground-truth pose coordinates 4.2 Past Motion Dropout During training, the model is provided the past motion history as one of the inputs(Fig. 1(g)) Since the past motion history during training is from an expert demonstration, the net can learn to"cheat" by just extrapolating from the past rather than finding the underlying causes of the behavior. During closed-loop inference, this breaks down because the past history is from the net's own past predictions. For example, such a trained net may learn to only stop for a stop sign if it sees a deceleration in the past history, and will therefore never stop for a stop sign during closed-loop inference. To address this, we introduce a dropout on the past pose history, where for 50% of the examples, we keep only the current position (uo, vo)of the agent in the past agent poses channel of the input data. This forces the nel CHAUFFEURNET: LEARNING TO DRIVE BY IMITATING THE BEST AND SYNTHESIZING THE WORST (b) Perturb Figure 5: Trajectory Perturbation(a)An original logged training example where the agent is driving along the contor of thc lanc.(b) The perturbed cxamplc crcated by perturbing the current agent location(red point) in the original example away from the lane center and then fitting a new smooth trajectory that brings the agent back to the original target location along the lane center to look at other cues in the environment to explain the future motion profile in the training example 5. Beyond Pure Imitation In this section, we go beyond vanilla cloning of the expert's demonstrations in order to teach the model to arrest drift and avoid bad behavior such as collisions and off-road driving by synthesizing variations of the expert's behavior 5.1 Synthesizing Perturbations Running the model as a part of a closed-loop system over time can cause the input data to deviate from the training distribution. To prevent this, we train the model by adding some examples with realistic perturbations to the agent trajectories. The start and end of a trajectory are kept constant, while a perturbation is applied around the midpoint and smoothed across the other points. Quantitatively, we jitter the midpoint pose of the agent uniformly at random in the range[-0.5,0.5] meters in both axes, and perturb the heading by[/ 3, /3 radians. We then fit a smooth trajectory to the perturbed point and the original start and end points. Such training examples bring the car back to its original trajectory alter a perturbation. Fig. 5 shows an example of perturbing the current ager location(red point) away from the lane center and the fitted trajectory correctly bringing it, back to the original target, location along the lane center. We filter out some perturbed trajectories that are impractical by thresholding on maximum curvature. But we do allow the perturbed trajectories to collide with other agents or drive off-road, because the network can then experience and avoid such behaviors even though real examples of these cases are bansal. krichevsky Ogale not prcscnt in the training data. In training, we give perturbed cxamples a wcight of 1/10 relative to the real examples, to avoid learning a propensity for perturbed drivin 5.2 Beyond the Imitation Loss 5.2.1 COLLISION LOSS Since our training data does not have any real collisions, the idea of avoiding collisions is implicit and will not gcncralizc wcll. To alleviate this issuc, we add a specialized loss that directly measures the overlap of the predicted agent box Bk with the ground-truth boxe of all the scene objects at each timestep collision WB∑∑B(x,).Ob( 2,y 8) where ik is the likelihood map for the output agent box prediction, and Ob, 9t k is a binary mask with ones at all pixels occupied by other dynamic objects(other vehicles, pedestrians ctc) in thc sccnc at timcstcp k. At any timc during training, if the modcl makcs a poor prediction that leads to a collision, the overlap loss would influence the gradients to correct the mistake. However, this loss would be effective only during the initial training rounds when the model hasnt learned to predict close to the ground-truth locations due to the absence of real collisions in the ground truth data. This issue is alleviated by the addition of trajectory perturbation data, where artificial collisions within those examples allow this loss to be effective throughout training without the need for online exploration like in reinforcement learning settings 5.2.2 ON ROAD LOSS Trajcctory perturbations also create synthctic cases whore the car vcrs off the road or climbs a curb or median because of the perturbation. To train the network to avoid hitting such hard road edges, we add a specialized loss that measures overlap of the predicted agent box Bk in each timestep with a binary mask Roads denoting the road and non-road regions within the field-of-view onroad mB∑∑(3).(1-mp 5.2.3 GEOMETRY LOSS We would like to explicitly constrain the agent to follow the target geometry independent of the speed profile. We model this target geometry by fitting a smooth curve to the target aypoints and rendering this curve as a binary image in the top-down coordinate system The thickness of this curve is set to be equal to the width of the agent. We express this loss similar to the collision loss by measuring the overlap of the predicted agent box with the binary target geometry image Geomgt. Any portion of the box that does not overlap with the target geometry curve is added as a penalty to the loss function geon m∑∑B(0,y).(1-Gcmn"(a,y) 10

李怀兴 比较有用图,一个新的研究思路
ChauffeurNetlearning to drive by imitating the best 版权受限,无法下载
ChauffeurNetlearning to drive by imitating the best第1页
ChauffeurNetlearning to drive by imitating the best第2页
ChauffeurNetlearning to drive by imitating the best第3页
ChauffeurNetlearning to drive by imitating the best第4页
ChauffeurNetlearning to drive by imitating the best第5页
ChauffeurNetlearning to drive by imitating the best第6页