Skip to content

Commit

Permalink
PCI/ERR: Handle fatal error recovery
Browse files Browse the repository at this point in the history
We don't need to be paranoid about the topology changing while handling an
error.  If the device has changed in a hotplug capable slot, we can rely on
the presence detection handling to react to a changing topology.

Restore the fatal error handling behavior that existed before merging DPC
with AER with 7e9084b ("PCI/AER: Handle ERR_FATAL with removal and
re-enumeration of devices").

Signed-off-by: Keith Busch <[email protected]>
Signed-off-by: Bjorn Helgaas <[email protected]>
Reviewed-by: Sinan Kaya <[email protected]>
  • Loading branch information
Keith Busch authored and bjorn-helgaas committed Sep 26, 2018
1 parent c4eed62 commit bdb5ac8
Show file tree
Hide file tree
Showing 5 changed files with 28 additions and 102 deletions.
35 changes: 10 additions & 25 deletions Documentation/PCI/pci-error-recovery.txt
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ The actual steps taken by a platform to recover from a PCI error
event will be platform-dependent, but will follow the general
sequence described below.

STEP 0: Error Event: ERR_NONFATAL
STEP 0: Error Event
-------------------
A PCI bus error is detected by the PCI hardware. On powerpc, the slot
is isolated, in that all I/O is blocked: all reads return 0xffffffff,
Expand Down Expand Up @@ -228,7 +228,13 @@ proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
proceeds to STEP 4 (Slot Reset)

STEP 3: Slot Reset
STEP 3: Link Reset
------------------
The platform resets the link. This is a PCI-Express specific step
and is done whenever a fatal error has been detected that can be
"solved" by resetting the link.

STEP 4: Slot Reset
------------------

In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
Expand Down Expand Up @@ -314,7 +320,7 @@ Failure).
>>> However, it probably should.


STEP 4: Resume Operations
STEP 5: Resume Operations
-------------------------
The platform will call the resume() callback on all affected device
drivers if all drivers on the segment have returned
Expand All @@ -326,7 +332,7 @@ a result code.
At this point, if a new error happens, the platform will restart
a new error recovery sequence.

STEP 5: Permanent Failure
STEP 6: Permanent Failure
-------------------------
A "permanent failure" has occurred, and the platform cannot recover
the device. The platform will call error_detected() with a
Expand All @@ -349,27 +355,6 @@ errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
for additional detail on real-life experience of the causes of
software errors.

STEP 0: Error Event: ERR_FATAL
-------------------
PCI bus error is detected by the PCI hardware. On powerpc, the slot is
isolated, in that all I/O is blocked: all reads return 0xffffffff, all
writes are ignored.

STEP 1: Remove devices
--------------------
Platform removes the devices depending on the error agent, it could be
this port for all subordinates or upstream component (likely downstream
port)

STEP 2: Reset link
--------------------
The platform resets the link. This is a PCI-Express specific step and is
done whenever a fatal error has been detected that can be "solved" by
resetting the link.

STEP 3: Re-enumerate the devices
--------------------
Initiates the re-enumeration.

Conclusion; General Remarks
---------------------------
Expand Down
4 changes: 2 additions & 2 deletions drivers/pci/pci.h
Original file line number Diff line number Diff line change
Expand Up @@ -433,8 +433,8 @@ static inline int pci_dev_specific_disable_acs_redir(struct pci_dev *dev)
#endif

/* PCI error reporting and recovery */
void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service);
void pcie_do_nonfatal_recovery(struct pci_dev *dev);
void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
u32 service);

bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
#ifdef CONFIG_PCIEASPM
Expand Down
12 changes: 8 additions & 4 deletions drivers/pci/pcie/aer.c
Original file line number Diff line number Diff line change
Expand Up @@ -1010,9 +1010,11 @@ static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
info->status);
pci_aer_clear_device_status(dev);
} else if (info->severity == AER_NONFATAL)
pcie_do_nonfatal_recovery(dev);
pcie_do_recovery(dev, pci_channel_io_normal,
PCIE_PORT_SERVICE_AER);
else if (info->severity == AER_FATAL)
pcie_do_fatal_recovery(dev, PCIE_PORT_SERVICE_AER);
pcie_do_recovery(dev, pci_channel_io_frozen,
PCIE_PORT_SERVICE_AER);
pci_dev_put(dev);
}

Expand Down Expand Up @@ -1048,9 +1050,11 @@ static void aer_recover_work_func(struct work_struct *work)
}
cper_print_aer(pdev, entry.severity, entry.regs);
if (entry.severity == AER_NONFATAL)
pcie_do_nonfatal_recovery(pdev);
pcie_do_recovery(pdev, pci_channel_io_normal,
PCIE_PORT_SERVICE_AER);
else if (entry.severity == AER_FATAL)
pcie_do_fatal_recovery(pdev, PCIE_PORT_SERVICE_AER);
pcie_do_recovery(pdev, pci_channel_io_frozen,
PCIE_PORT_SERVICE_AER);
pci_dev_put(pdev);
}
}
Expand Down
4 changes: 2 additions & 2 deletions drivers/pci/pcie/dpc.c
Original file line number Diff line number Diff line change
Expand Up @@ -216,7 +216,7 @@ static irqreturn_t dpc_handler(int irq, void *context)

reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN) >> 1;
ext_reason = (status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT) >> 5;
dev_warn(dev, "DPC %s detected, remove downstream devices\n",
dev_warn(dev, "DPC %s detected\n",
(reason == 0) ? "unmasked uncorrectable error" :
(reason == 1) ? "ERR_NONFATAL" :
(reason == 2) ? "ERR_FATAL" :
Expand All @@ -233,7 +233,7 @@ static irqreturn_t dpc_handler(int irq, void *context)
}

/* We configure DPC so it only triggers on ERR_FATAL */
pcie_do_fatal_recovery(pdev, PCIE_PORT_SERVICE_DPC);
pcie_do_recovery(pdev, pci_channel_io_frozen, PCIE_PORT_SERVICE_DPC);

return IRQ_HANDLED;
}
Expand Down
75 changes: 6 additions & 69 deletions drivers/pci/pcie/err.c
Original file line number Diff line number Diff line change
Expand Up @@ -271,83 +271,20 @@ static pci_ers_result_t broadcast_error_message(struct pci_dev *dev,
return result_data.result;
}

/**
* pcie_do_fatal_recovery - handle fatal error recovery process
* @dev: pointer to a pci_dev data structure of agent detecting an error
*
* Invoked when an error is fatal. Once being invoked, removes the devices
* beneath this AER agent, followed by reset link e.g. secondary bus reset
* followed by re-enumeration of devices.
*/
void pcie_do_fatal_recovery(struct pci_dev *dev, u32 service)
{
struct pci_dev *udev;
struct pci_bus *parent;
struct pci_dev *pdev, *temp;
pci_ers_result_t result;

if (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)
udev = dev;
else
udev = dev->bus->self;

parent = udev->subordinate;
pci_walk_bus(parent, pci_dev_set_disconnected, NULL);

pci_lock_rescan_remove();
pci_dev_get(dev);
list_for_each_entry_safe_reverse(pdev, temp, &parent->devices,
bus_list) {
pci_stop_and_remove_bus_device(pdev);
}

result = reset_link(udev, service);

if ((service == PCIE_PORT_SERVICE_AER) &&
(dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)) {
/*
* If the error is reported by a bridge, we think this error
* is related to the downstream link of the bridge, so we
* do error recovery on all subordinates of the bridge instead
* of the bridge and clear the error status of the bridge.
*/
pci_aer_clear_fatal_status(dev);
pci_aer_clear_device_status(dev);
}

if (result == PCI_ERS_RESULT_RECOVERED) {
if (pcie_wait_for_link(udev, true))
pci_rescan_bus(udev->bus);
pci_info(dev, "Device recovery from fatal error successful\n");
} else {
pci_uevent_ers(dev, PCI_ERS_RESULT_DISCONNECT);
pci_info(dev, "Device recovery from fatal error failed\n");
}

pci_dev_put(dev);
pci_unlock_rescan_remove();
}

/**
* pcie_do_nonfatal_recovery - handle nonfatal error recovery process
* @dev: pointer to a pci_dev data structure of agent detecting an error
*
* Invoked when an error is nonfatal/fatal. Once being invoked, broadcast
* error detected message to all downstream drivers within a hierarchy in
* question and return the returned code.
*/
void pcie_do_nonfatal_recovery(struct pci_dev *dev)
void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
u32 service)
{
pci_ers_result_t status;
enum pci_channel_state state;

state = pci_channel_io_normal;

status = broadcast_error_message(dev,
state,
"error_detected",
report_error_detected);

if (state == pci_channel_io_frozen &&
reset_link(dev, service) != PCI_ERS_RESULT_RECOVERED)
goto failed;

if (status == PCI_ERS_RESULT_CAN_RECOVER)
status = broadcast_error_message(dev,
state,
Expand Down

0 comments on commit bdb5ac8

Please sign in to comment.