libctf: archive: format v2

This commit does a bunch of things, all tangled together tightly enough that disentangling them seemed no to be worth doing. The biggest is a new archive format, v2, identified by a magic number which is one higher than the v1 format's magic number. As usual with libctf we can only write out the new format, but can still read the old one. The new format has multiple improvements over the old: - It is written native-endian and aggressively endian-swapped at open time, just like CTF and BTF dicts; format v1 was little-endian, necessitating byteswapping all over the place at read and write time rather than localized in one pair of functions at read time. - The modent array of name-offset -> archive-offset mappings for the CTF archives is explicitly pointed at via a new ctfa_modents header member rather than just starting after the end of the header. - The length that prepends each archive member actually indicates its length rather than always being sizeof (uint64_t) bytes too high (this was an outright bug) - There is a new shared properties table which in future we may be able to use to unify common values from the constituent CTF headers, reducing the size overhead of these (repeated, uncompressed) entities. Right now it only contains one value, parent_name, which is the parent dict name if one is common across all dicts in the archive (always true for any archives derived from ctf_link()). This is used to let ctf_archive_next() et al reliably open dicts in the archive even if they are child BTF dicts (which do not contain a header name). The properties table shares its property names with the CTF members, and uses the same format (and shared code) for the property values as for CTF archive members: length-prepended. The archive members and name->value table ("modents") use distinct tables for properties and CTF dicts, to ensure they are spatially separated in the file, to maximize compressibility if we end up with a lot of properties and people compress the whole thing. We can also restrict various old bug-workaround kludges that only apply to dicts found in v1 archives: in particular, we needed to dig out the preamble of some CTF dicts without opening them to figure out whether they used the .dynstr or .strtab sections: this whole bug workaround is now unnecessary for v2 and above. There are other changes for readability and consistency: - The archive wrapper data structure, known outside ctf-archive.c as ctf_archive_t, is now consistently referred to inside ctf-archive.c as 'struct ctf_archive_internal' and given the parameter name 'arci' rather than sometimes using ctf_archive_t and sometimes using 'wrapper' or 'arc' as parameter names. The archive itself is always called 'struct ctf_archive' to emphasise that it is *not* a ctf_archive_t. ctf_archive_t remains the public typedef: the fact that it's not actually the same thing as the archive file format is an internal implementation detail. - We keep the archive header around in a new ctfi_hdr member, distinct from the actual archive itself, to make upgrading from v1 and cross- endianness support easier. The archive itself is now kept as a char * and used only to root pointer arithmetic.
2025-05-28 13:42:11 +01:00
parent 4bdc7aed03
commit 16e0dd9aab
5 changed files with 636 additions and 358 deletions
--- a/include/ctf.h
+++ b/include/ctf.h
@@ -829,17 +829,36 @@ typedef struct ctf_enum64
   greater care taken with integral types.  All CTF files in an archive
   must have the same data model.  (This is not validated.)

-   All integers in this structure are stored in little-endian byte order.
+   All integers in the ctfa_archive_v1 structure are stored in little-endian byte
+   order.

-   The code relies on the fact that everything in this header is a uint64_t
-   and thus the header needs no padding (in particular, that no padding is
-   needed between ctfa_ctfs and the unnamed ctfa_archive_modent array
-   that follows it).
+   The generation code relies on the fact that everything in this header is a
+   uint64_t and thus the header needs no padding (in particular, that no padding
+   is needed between ctfa_ctfs and the unnamed ctfa_modent array that follows
+   it.  However, this is only an assumption of the generation code: the
+   read-side code in libctf and the file format do not have any such
+   requirements).
+
+   The shared properties and CTF dict storage have the same (length-prepended)
+   format and identical string/value mapping via struct ctf_archive_modent, but
+   are pointed to by different header fields: ctfa_modents for CTFs,
+   ctfa_propents for properties: their names are intermingled in ctfa_names but
+   the CTF dicts and property values are stashed in distinct tables, ctfa_ctfs
+   and ctfa_prop_values.  Implementations may interpret properties however they
+   wish, and their presence must not be mandatory (though dictionaries may be
+   modified given the presence of a particular property, making use of that
+   property mandatory for reading those dicts: the intent here is to allow
+   optional movement of shared header fields into the shared properties table in
+   the future.  For now, only parent_name=... is present.)
+
+   In format v1, the dict size uint64_t prepended to dictionaries is one
+   uint64_t too long: it contains the length of the size byte too.  In dict v2,
+   this is corrected (at open time, libctf fixes up v1 dicts too).

   This is *not* the same as the data structure returned by the ctf_arc_*()
   functions:  this is the low-level on-disk representation.  */

-#define CTFA_MAGIC 0x8b47f2a4d7623eeb	/* Random.  */
+#define CTFA_MAGIC 0x8b47f2a4d7623eec	/* V1, below, incremented.  */
 struct ctf_archive
 {
  /* Magic number.  (In loaded files, overwritten with the file size
@@ -852,6 +871,43 @@ struct ctf_archive
  /* Number of CTF dicts in the archive.  */
  uint64_t ctfa_ndicts;

+  /* Number of shared properties.  */
+  uint64_t ctfa_nprops;
+
+  /* Offset of the name table, used for both CTF member names and property
+     names.  */
+  uint64_t ctfa_names;
+
+  /* Offset of the CTF table.  Each element starts with a size (a little-
+     endian uint64_t) then a ctf_dict_t of that size.  */
+  uint64_t ctfa_ctfs;
+
+  /* Offset of the shared properties value table: identical format, except the
+     size is followed by an arbitrary (property-dependent) binary blob.  */
+  uint64_t ctfa_prop_values;
+
+  /* Offset of the modent table mapping names to CTFs.  */
+  uint64_t ctfa_modents;
+
+  /* Offset of the modent table mapping names to properties.  Ignored if
+     nprops is 0.  */
+  uint64_t ctfa_propents;
+};
+
+#define CTFA_V1_MAGIC 0x8b47f2a4d7623eeb /* Random.  */
+
+struct ctf_archive_v1
+{
+  /* Magic number.  (In loaded files, overwritten with the file size
+     so ctf_arc_close() knows how much to munmap()).  */
+  uint64_t ctfa_magic;
+
+  /* CTF data model.  */
+  uint64_t ctfa_model;
+
+  /* Number of CTF dicts in the archive.  */
+  uint64_t ctfa_ndicts;
+
  /* Offset of the name table.  */
  uint64_t ctfa_names;

@@ -860,9 +916,16 @@ struct ctf_archive
  uint64_t ctfa_ctfs;
 };

-/* An array of ctfa_ndicts of this structure lies at
-   ctf_archive[sizeof(struct ctf_archive)] and gives the ctfa_ctfs or
-   ctfa_names-relative offsets of each name or ctf_dict_t.  */
+/* An array of ctfa_ndicts of this structure lies at the offset given by
+   ctfa_modents (or, in v1, at ctf_archive[sizeof(struct ctf_archive)]) and gives
+   the ctfa_ctfs or ctfa_names-relative offsets of each name or ctf_dict_t.
+
+   Another array of ctfa_nprops of this structure lies at the ctfa_propents
+   offset: for this, the ctf_offset is the ctfa_propents-relative offset of
+   proprty values.
+
+   Both property values and CTFs are prepended by a uint64 giving their length.
+   The names are just a strtab (\0-separated).  */

 typedef struct ctf_archive_modent
 {
--- a/libctf/ctf-archive.c
+++ b/libctf/ctf-archive.c
--- a/libctf/ctf-impl.h
+++ b/libctf/ctf-impl.h
@@ -552,7 +552,10 @@ struct ctf_archive_internal
  int ctfi_is_archive;
  int ctfi_unmap_on_close;
  ctf_dict_t *ctfi_dict;
-  struct ctf_archive *ctfi_archive;
+  unsigned char *ctfi_archive;
+  struct ctf_archive *ctfi_hdr;	    /* Always malloced.  Header only.  */
+  size_t ctfi_hdr_len;
+  int ctfi_archive_v1;		    /* If set, this is a v1 archive.  */
  ctf_dynhash_t *ctfi_dicts;	    /* Dicts we have opened and cached.  */
  ctf_dict_t *ctfi_crossdict_cache; /* Cross-dict caching.  */
  ctf_dict_t **ctfi_symdicts;	    /* Array of index -> ctf_dict_t *.  */
@@ -815,13 +818,12 @@ extern int ctf_preserialize (ctf_dict_t *fp, int force_ctf);
 extern void ctf_depreserialize (ctf_dict_t *fp);

 extern struct ctf_archive_internal *
-ctf_new_archive_internal (int is_archive, int unmap_on_close,
-			  struct ctf_archive *, ctf_dict_t *,
-			  const ctf_sect_t *symsect,
+ctf_new_archive_internal (int is_archive, int is_v1, int unmap_on_close,
+			  struct ctf_archive *, size_t,
+			  ctf_dict_t *, const ctf_sect_t *symsect,
 			  const ctf_sect_t *strsect, int *errp);
-extern struct ctf_archive *ctf_arc_open_internal (const char *, int *);
-extern void ctf_arc_close_internal (struct ctf_archive *);
-extern const ctf_preamble_t *ctf_arc_bufpreamble (const ctf_sect_t *);
+extern struct ctf_archive_internal *ctf_arc_open_internal (const char *, int *);
+extern const ctf_preamble_t *ctf_arc_bufpreamble_v1 (const ctf_sect_t *);
 extern void *ctf_set_open_errno (int *, int);
 extern int ctf_flip_header (void *, int, int);
 extern int ctf_flip (ctf_dict_t *, ctf_header_t *, unsigned char *,
--- a/libctf/ctf-link.c
+++ b/libctf/ctf-link.c
@@ -1177,7 +1177,7 @@ ctf_link_deduplicating_per_cu (ctf_dict_t *fp)
 	 equal to the CU name.  We have to wrap it in an archive wrapper
 	 first.  */

-      if ((in_arc = ctf_new_archive_internal (0, 0, NULL, outputs[0], NULL,
+      if ((in_arc = ctf_new_archive_internal (0, 0, 0, NULL, 0, outputs[0], NULL,
 					      NULL, &err)) == NULL)
 	{
 	  ctf_set_errno (fp, err);
--- a/libctf/ctf-open-bfd.c
+++ b/libctf/ctf-open-bfd.c
@@ -119,9 +119,20 @@ ctf_bfdopen_ctfsect (struct bfd *abfd _libctf_unused_,
      bfderrstr = N_("CTF section is NULL");
      goto err;
    }
-  preamble = ctf_arc_bufpreamble (ctfsect);

-  if (preamble->ctp_flags & CTF_F_DYNSTR)
+  /* v3 dicts may cite the symtab or the dynsymtab, without using sh_link to
+     indicate which: pick the right one.  v4 dicts always use the dynsymtab (for
+     now).  */
+
+  errno = 0;
+  preamble = ctf_arc_bufpreamble_v1 (ctfsect);
+  if (!preamble && errno == EOVERFLOW)
+    {
+      bfderrstr = N_("section too short to be CTF or BTF");
+      goto err;
+    }
+
+  if (!preamble || (preamble && preamble->ctp_flags & CTF_F_DYNSTR))
    {
      symhdr = &elf_tdata (abfd)->dynsymtab_hdr;
      strtab_name = ".dynstr";
@@ -301,21 +312,16 @@ ctf_fdopen (int fd, const char *filename, const char *target, int *errp)
      fp->ctf_data_mmapped = data;
      fp->ctf_data_mmapped_len = (size_t) st.st_size;

-      return ctf_new_archive_internal (0, 1, NULL, fp, NULL, NULL, errp);
+      return ctf_new_archive_internal (0, 0, 1, NULL, 0, fp, NULL, NULL, errp);
    }

  if ((nbytes = ctf_pread (fd, &arc_magic, sizeof (arc_magic), 0)) <= 0)
    return (ctf_set_open_errno (errp, nbytes < 0 ? errno : ECTF_FMT));

-  if ((size_t) nbytes >= sizeof (uint64_t) && le64toh (arc_magic) == CTFA_MAGIC)
-    {
-      struct ctf_archive *arc;
-
-      if ((arc = ctf_arc_open_internal (filename, errp)) == NULL)
-	return NULL;			/* errno is set for us.  */
-
-      return ctf_new_archive_internal (1, 1, arc, NULL, NULL, NULL, errp);
-    }
+  if ((size_t) nbytes >= sizeof (uint64_t)
+      && (arc_magic == CTFA_MAGIC || bswap_64 (arc_magic) == CTFA_MAGIC
+	  || le64toh (arc_magic) == CTFA_V1_MAGIC))
+      return ctf_arc_open_internal (filename, errp);

  /* Attempt to open the file with BFD.  We must dup the fd first, since bfd
     takes ownership of the passed fd.  */