normalize Transform

The normalize transform will attempt to convert any columns of nested data into a flat table by bringing the nested items up to the column level. This will be performed one level down, so if there are nested lists of dictionaries of lists of dictionaries, only the first level of lists will be normalized.

Example:

For a transform configuration like this:

{
    "transform_name": "Normalize Data",
    "transform_type": "normalize",
    "levels": 2,
}

If we want to normalize a dataframe that has a data structure like this for each row:

{
    "data_a": 1,
    "data_b": "test",
    "dsn": {
        "dsn_a": [5, 6],
        "dsn_b": {
            "dsn_b1": {
                "dsn_b1_a": 1,
                "dsn_b1_b": 2,
                "dsn_b1_c": [11, 22, 33],
                "dsn_b1_d": {
                    "dsn_b1_d1": 1,
                    "dsn_b1_d2": 2,
                    "dsn_b1_d3": {
                        "dsn_b1_d3_a": 1,
                        "dsn_b1_d3_b": 2,
                        "dsn_b1_d3_c": [11, 22, 33],
                    }
                }
            },
            "dsn_b2": 2,
            "dsn_b3": [11, 22, 33],
        }
    }
}

The resulting dataframe will have a data structure like this at Level 1:

{
  "data_a": [
    1
  ],
  "data_b": [
    "test"
  ],
  "dsn.dsn_a[0]": [
    5
  ],
  "dsn.dsn_a[1]": [
    6
  ],
  "dsn.dsn_b": [
    "...all data under ['dsn']['dsn_b'] still formatted as a dictionary, not normalized"
  ]
}

The resulting dataframe will have a data structure like this at Level 2:

{
  "data_a": [
    1
  ],
  "data_b": [
    "test"
  ],
  "dsn.dsn_a[0]": [
    5
  ],
  "dsn.dsn_a[1]": [
    6
  ],
  "dsn.dsn_b.dsn_b1": [
    {
      "dsn_b1_a": 1,
      "dsn_b1_b": 2,
      "dsn_b1_c": [
        11,
        22,
        33
      ],
      "dsn_b1_d": {
        "dsn_b1_d1": 1,
        "dsn_b1_d2": 2,
        "dsn_b1_d3": {
          "dsn_b1_d3_a": 1,
          "dsn_b1_d3_b": 2,
          "dsn_b1_d3_c": [
            11,
            22,
            33
          ]
        }
      }
    }
  ],
  "dsn.dsn_b.dsn_b2": [
    2
  ],
  "dsn.dsn_b.dsn_b3[0]": [
    11
  ],
  "dsn.dsn_b.dsn_b3[1]": [
    22
  ],
  "dsn.dsn_b.dsn_b3[2]": [
    33
  ]
}

…etc, etc.

Configuration:

Required and optional properties that can be configured for a normalize transform.

  • transform_name: Unique name for the transform.

  • transform_type: Type of transform to apply. Should be normalize.

  • filter_stream: List of data streams to transform. Each stream can either be * (all) or asset:stream.

  • levels: The number of levels to normalize. Default is 1.

  • parse_json_string: If true, will attempt to convert string data to json and normalize if possible. Default is False.