scripts: Enabled symbol->dwarf mapping via address

We have symbol->addr info and dwarf->addr info (DW_AT_low_pc), so why not use this to map symbols to dwarf entries? This should hopefully be more reliable than the current name based heuristic, but only works for functions (DW_TAG_subprogram). Note that we still have to fuzzy match due to thumb-bit weirdness (small rant below). --- Ok. Why in Thumb does the symbol table include the thumb bit, but the dwarf info does not?? Would it really have been that hard to add the thumb bit to DW_AT_low_pc so symbols and dwarf entries match? So, because of Thumb, we can't expect either the address or name to match exactly. The best we can do is binary search and expect the symbol to point somewhere _within_ the dwarf's DW_AT_low_pc/DW_AT_high_pc range. Also why does DW_AT_high_pc store the _size_ of the function?? Why isn't it, idunno, the _high_pc_? I get that the size takes up less space when leb128 encoding, but surely there could have been a better name?
2025-12-06 23:52:44 +00:00 · 2024-12-05 19:28:07 -06:00
parent eb09865868
commit 02ccbdfed2
4 changed files with 211 additions and 11 deletions
--- a/scripts/code.py
+++ b/scripts/code.py
@@ -393,6 +393,24 @@ class DwarfEntry:
        else:
            return None

+    @ft.cached_property
+    def addr(self):
+        if (self.tag == 'DW_TAG_subprogram'
+                and 'DW_AT_low_pc' in self):
+            return int(self['DW_AT_low_pc'], 0)
+        else:
+            return None
+
+    @ft.cached_property
+    def size(self):
+        if (self.tag == 'DW_TAG_subprogram'
+                and 'DW_AT_high_pc' in self):
+            # this looks wrong, but high_pc does store the size,
+            # for whatever reason
+            return int(self['DW_AT_high_pc'], 0)
+        else:
+            return None
+
    def info(self, tags=None):
        # recursively flatten children
        def flatten(entry):
@@ -412,10 +430,42 @@ class DwarfInfo:
        self.entries = entries

    def get(self, k, d=None):
-        # allow lookup by both offset and dwarf name
-        if not isinstance(k, str):
+        # allow lookup by offset, symbol, or dwarf name
+        if not isinstance(k, str) and not hasattr(k, 'addr'):
            return self.entries.get(k, d)

+        elif hasattr(k, 'addr'):
+            import bisect
+
+            # organize by address
+            if not hasattr(self, '_by_addr'):
+                # sort and keep largest/first when duplicates
+                entries = [entry
+                        for entry in self.entries.values()
+                        if entry.addr is not None
+                            and entry.size is not None]
+                entries.sort(key=lambda x: (x.addr, -x.size))
+
+                by_addr = []
+                for entry in entries:
+                    if (len(by_addr) == 0
+                            or by_addr[-1].addr != entry.addr):
+                        by_addr.append(entry)
+                self._by_addr = by_addr
+
+            # find entry by range
+            i = bisect.bisect(self._by_addr, k.addr,
+                    key=lambda x: x.addr)
+            # check that we're actually in this entry's size
+            if (i > 0
+                    and k.addr
+                        < self._by_addr[i-1].addr
+                            + self._by_addr[i-1].size):
+                return self._by_addr[i-1]
+            else:
+                # fallback to lookup by name
+                return self.get(k.name, d)
+
        else:
            # organize entries by name
            if not hasattr(self, '_by_name'):
@@ -548,7 +598,7 @@ def collect(obj_paths, *,

            # find best matching dwarf entry, this may be slightly different
            # due to optimizations
-            entry = info.get(sym.name)
+            entry = info.get(sym)

            # if we have no file guess from obj path
            if entry is not None and 'DW_AT_decl_file' in entry: