Handling UUID Data¶
PyMongo ships with built-in support for dealing with UUID types.
It is straightforward to store native uuid.UUID
objects
to MongoDB and retrieve them as native uuid.UUID
objects:
from pymongo import MongoClient
from bson.binary import UuidRepresentation
from uuid import uuid4
# use the 'standard' representation for cross-language compatibility.
client = MongoClient(uuidRepresentation='standard')
collection = client.get_database('uuid_db').get_collection('uuid_coll')
# remove all documents from collection
collection.delete_many({})
# create a native uuid object
uuid_obj = uuid4()
# save the native uuid object to MongoDB
collection.insert_one({'uuid': uuid_obj})
# retrieve the stored uuid object from MongoDB
document = collection.find_one({})
# check that the retrieved UUID matches the inserted UUID
assert document['uuid'] == uuid_obj
Native uuid.UUID
objects can also be used as part of MongoDB
queries:
document = collection.find({'uuid': uuid_obj})
assert document['uuid'] == uuid_obj
The above examples illustrate the simplest of use-cases - one where the
UUID is generated by, and used in the same application. However,
the situation can be significantly more complex when dealing with a MongoDB
deployment that contains UUIDs created by other drivers as the Java and CSharp
drivers have historically encoded UUIDs using a byte-order that is different
from the one used by PyMongo. Applications that require interoperability across
these drivers must specify the appropriate
UuidRepresentation
.
In the following sections, we describe how drivers have historically differed
in their encoding of UUIDs, and how applications can use the
UuidRepresentation
configuration option to maintain
cross-language compatibility.
Attention
New applications that do not share a MongoDB deployment with
any other application and that have never stored UUIDs in MongoDB
should use the standard
UUID representation for cross-language
compatibility. See Configuring a UUID Representation for details
on how to configure the UuidRepresentation
.
Legacy Handling of UUID Data¶
Historically, MongoDB Drivers have used different byte-ordering
while serializing UUID types to Binary
.
Consider, for instance, a UUID with the following canonical textual
representation:
00112233-4455-6677-8899-aabbccddeeff
This UUID would historically be serialized by the Python driver as:
00112233-4455-6677-8899-aabbccddeeff
The same UUID would historically be serialized by the C# driver as:
33221100-5544-7766-8899-aabbccddeeff
Finally, the same UUID would historically be serialized by the Java driver as:
77665544-3322-1100-ffee-ddccbbaa9988
Note
For in-depth information about the the byte-order historically used by different drivers, see the Handling of Native UUID Types Specification.
This difference in the byte-order of UUIDs encoded by different drivers can result in highly unintuitive behavior in some scenarios. We detail two such scenarios in the next sections.
Scenario 2: Round-Tripping UUIDs¶
In the following examples, we see how using a misconfigured
UuidRepresentation
can cause an application
to inadvertently change the Binary
subtype, and in some
cases, the bytes of the Binary
field itself when
round-tripping documents containing UUIDs.
Consider the following situation:
from bson.codec_options import CodecOptions, DEFAULT_CODEC_OPTIONS
from bson.binary import Binary, UuidRepresentation
from uuid import uuid4
# Using UuidRepresentation.PYTHON_LEGACY stores a Binary subtype-3 UUID
python_opts = CodecOptions(uuid_representation=UuidRepresentation.PYTHON_LEGACY)
input_uuid = uuid4()
collection = client.testdb.get_collection('test', codec_options=python_opts)
collection.insert_one({'_id': 'foo', 'uuid': input_uuid})
assert collection.find_one({'uuid': Binary(input_uuid.bytes, 3)})['_id'] == 'foo'
# Retrieving this document using UuidRepresentation.STANDARD returns a Binary instance
std_opts = CodecOptions(uuid_representation=UuidRepresentation.STANDARD)
std_collection = client.testdb.get_collection('test', codec_options=std_opts)
doc = std_collection.find_one({'_id': 'foo'})
assert isinstance(doc['uuid'], Binary)
# Round-tripping the retrieved document yields the exact same document
std_collection.replace_one({'_id': 'foo'}, doc)
round_tripped_doc = collection.find_one({'uuid': Binary(input_uuid.bytes, 3)})
assert doc == round_tripped_doc
In this example, round-tripping the document using the incorrect
UuidRepresentation
(STANDARD
instead of
PYTHON_LEGACY
) changes the Binary
subtype as a
side-effect. Note that this can also happen when the situation is reversed -
i.e. when the original document is written using ``STANDARD`` representation
and then round-tripped using the ``PYTHON_LEGACY`` representation.
In the next example, we see the consequences of incorrectly using a
representation that modifies byte-order (CSHARP_LEGACY
or JAVA_LEGACY
)
when round-tripping documents:
from bson.codec_options import CodecOptions, DEFAULT_CODEC_OPTIONS
from bson.binary import Binary, UuidRepresentation
from uuid import uuid4
# Using UuidRepresentation.STANDARD stores a Binary subtype-4 UUID
std_opts = CodecOptions(uuid_representation=UuidRepresentation.STANDARD)
input_uuid = uuid4()
collection = client.testdb.get_collection('test', codec_options=std_opts)
collection.insert_one({'_id': 'baz', 'uuid': input_uuid})
assert collection.find_one({'uuid': Binary(input_uuid.bytes, 4)})['_id'] == 'baz'
# Retrieving this document using UuidRepresentation.JAVA_LEGACY returns a native UUID
# without modifying the UUID byte-order
java_opts = CodecOptions(uuid_representation=UuidRepresentation.JAVA_LEGACY)
java_collection = client.testdb.get_collection('test', codec_options=java_opts)
doc = java_collection.find_one({'_id': 'baz'})
assert doc['uuid'] == input_uuid
# Round-tripping the retrieved document silently changes the Binary bytes and subtype
java_collection.replace_one({'_id': 'baz'}, doc)
assert collection.find_one({'uuid': Binary(input_uuid.bytes, 3)}) is None
assert collection.find_one({'uuid': Binary(input_uuid.bytes, 4)}) is None
round_tripped_doc = collection.find_one({'_id': 'baz'})
assert round_tripped_doc['uuid'] == Binary(input_uuid.bytes, 3).as_uuid(UuidRepresentation.JAVA_LEGACY)
In this case, using the incorrect UuidRepresentation
(JAVA_LEGACY
instead of STANDARD
) changes the
Binary
bytes and subtype as a side-effect.
Note that this happens when any representation that
manipulates byte-order (``CSHARP_LEGACY`` or ``JAVA_LEGACY``) is incorrectly
used to round-trip UUIDs written with ``STANDARD``. When the situation is
reversed - i.e. when the original document is written using ``CSHARP_LEGACY``
or ``JAVA_LEGACY`` and then round-tripped using ``STANDARD`` -
only the :class:`~bson.binary.Binary` subtype is changed.
Note
Starting in PyMongo 4.0, these issue will be resolved as
the STANDARD
representation will decode Binary subtype 3 fields as
Binary
objects of subtype 3 (instead of
uuid.UUID
), and each of the LEGACY_*
representations will
decode Binary subtype 4 fields to Binary
objects of
subtype 4 (instead of uuid.UUID
).
Configuring a UUID Representation¶
Users can workaround the problems described above by configuring their
applications with the appropriate UuidRepresentation
.
Configuring the representation modifies PyMongo’s behavior while
encoding uuid.UUID
objects to BSON and decoding
Binary subtype 3 and 4 fields from BSON.
Applications can set the UUID representation in one of the following ways:
At the
MongoClient
level using theuuidRepresentation
URI option, e.g.:client = MongoClient("mongodb://a:27107/?uuidRepresentation=standard")
Valid values are:
Value
UUID Representation
unspecified
standard
pythonLegacy
javaLegacy
csharpLegacy
At the
MongoClient
level using theuuidRepresentation
kwarg option, e.g.:from bson.binary import UuidRepresentation client = MongoClient(uuidRepresentation=UuidRepresentation.STANDARD)
At the
Database
orCollection
level by supplying a suitableCodecOptions
instance, e.g.:from bson.codec_options import CodecOptions csharp_opts = CodecOptions(uuid_representation=UuidRepresentation.CSHARP_LEGACY) java_opts = CodecOptions(uuid_representation=UuidRepresentation.JAVA_LEGACY) # Get database/collection from client with csharpLegacy UUID representation csharp_database = client.get_database('csharp_db', codec_options=csharp_opts) csharp_collection = client.testdb.get_collection('csharp_coll', codec_options=csharp_opts) # Get database/collection from existing database/collection with javaLegacy UUID representation java_database = csharp_database.with_options(codec_options=java_opts) java_collection = csharp_collection.with_options(codec_options=java_opts)
Supported UUID Representations¶
UUID Representation |
Default? |
Encode |
Decode |
Decode |
---|---|---|---|---|
No |
|
|
||
Yes, in PyMongo>=4 |
Raise |
|
|
|
No |
|
|
||
No |
|
|
||
No |
|
|
We now detail the behavior and use-case for each supported UUID representation.
UNSPECIFIED
¶
Attention
Starting in PyMongo 4.0,
UNSPECIFIED
is the default
UUID representation used by PyMongo.
The UNSPECIFIED
representation
prevents the incorrect interpretation of UUID bytes by stopping short of
automatically converting UUID fields in BSON to native UUID types. Decoding
a UUID when using this representation returns a Binary
object instead. If required, users can coerce the decoded
Binary
objects into native UUIDs using the
as_uuid()
method and specifying the appropriate
representation format. The following example shows
what this might look like for a UUID stored by the C# driver:
from bson.codec_options import CodecOptions, DEFAULT_CODEC_OPTIONS
from bson.binary import Binary, UuidRepresentation
from uuid import uuid4
# Using UuidRepresentation.CSHARP_LEGACY
csharp_opts = CodecOptions(uuid_representation=UuidRepresentation.CSHARP_LEGACY)
# Store a legacy C#-formatted UUID
input_uuid = uuid4()
collection = client.testdb.get_collection('test', codec_options=csharp_opts)
collection.insert_one({'_id': 'foo', 'uuid': input_uuid})
# Using UuidRepresentation.UNSPECIFIED
unspec_opts = CodecOptions(uuid_representation=UuidRepresentation.UNSPECIFIED)
unspec_collection = client.testdb.get_collection('test', codec_options=unspec_opts)
# UUID fields are decoded as Binary when UuidRepresentation.UNSPECIFIED is configured
document = unspec_collection.find_one({'_id': 'foo'})
decoded_field = document['uuid']
assert isinstance(decoded_field, Binary)
# Binary.as_uuid() can be used to coerce the decoded value to a native UUID
decoded_uuid = decoded_field.as_uuid(UuidRepresentation.CSHARP_LEGACY)
assert decoded_uuid == input_uuid
Native uuid.UUID
objects cannot directly be encoded to
Binary
when the UUID representation is UNSPECIFIED
and attempting to do so will result in an exception:
unspec_collection.insert_one({'_id': 'bar', 'uuid': uuid4()})
Traceback (most recent call last):
...
ValueError: cannot encode native uuid.UUID with UuidRepresentation.UNSPECIFIED. UUIDs can be manually converted to bson.Binary instances using bson.Binary.from_uuid() or a different UuidRepresentation can be configured. See the documentation for UuidRepresentation for more information.
Instead, applications using UNSPECIFIED
must explicitly coerce a native UUID using the
from_uuid()
method:
explicit_binary = Binary.from_uuid(uuid4(), UuidRepresentation.STANDARD)
unspec_collection.insert_one({'_id': 'bar', 'uuid': explicit_binary})
STANDARD
¶
Attention
This UUID representation should be used by new applications or applications that are encoding and/or decoding UUIDs in MongoDB for the first time.
The STANDARD
representation
enables cross-language compatibility by ensuring the same byte-ordering
when encoding UUIDs from all drivers. UUIDs written by a driver with this
representation configured will be handled correctly by every other provided
it is also configured with the STANDARD
representation.
STANDARD
encodes native uuid.UUID
objects to
Binary
subtype 4 objects.
PYTHON_LEGACY
¶
Attention
This uuid representation should be used when reading UUIDs generated by existing applications that use the Python driver but don’t explicitly set a UUID representation.
Attention
PYTHON_LEGACY
was the default uuid representation in PyMongo 3.
The PYTHON_LEGACY
representation
corresponds to the legacy representation of UUIDs used by PyMongo. This
representation conforms with
RFC 4122 Section 4.1.2.
The following example illustrates the use of this representation:
from bson.codec_options import CodecOptions, DEFAULT_CODEC_OPTIONS
from bson.binary import Binary, UuidRepresentation
# No configured UUID representation
collection = client.python_legacy.get_collection('test', codec_options=DEFAULT_CODEC_OPTIONS)
# Using UuidRepresentation.PYTHON_LEGACY
pylegacy_opts = CodecOptions(uuid_representation=UuidRepresentation.PYTHON_LEGACY)
pylegacy_collection = client.python_legacy.get_collection('test', codec_options=pylegacy_opts)
# UUIDs written by PyMongo 3 with no UuidRepresentation configured
# (or PyMongo 4.0 with PYTHON_LEGACY) can be queried using PYTHON_LEGACY
uuid_1 = uuid4()
pylegacy_collection.insert_one({'uuid': uuid_1})
document = pylegacy_collection.find_one({'uuid': uuid_1})
PYTHON_LEGACY
encodes native uuid.UUID
objects to
Binary
subtype 3 objects, preserving the same
byte-order as bytes
:
from bson.binary import Binary
document = collection.find_one({'uuid': Binary(uuid_2.bytes, subtype=3)})
assert document['uuid'] == uuid_2
JAVA_LEGACY
¶
Attention
This UUID representation should be used when reading UUIDs
written to MongoDB by the legacy applications (i.e. applications that don’t
use the STANDARD
representation) using the Java driver.
The JAVA_LEGACY
representation
corresponds to the legacy representation of UUIDs used by the MongoDB Java
Driver.
Note
The JAVA_LEGACY
representation reverses the order of bytes 0-7,
and bytes 8-15.
As an example, consider the same UUID described in Legacy Handling of UUID Data.
Let us assume that an application used the Java driver without an explicitly
specified UUID representation to insert the example UUID
00112233-4455-6677-8899-aabbccddeeff
into MongoDB. If we try to read this
value using PYTHON_LEGACY
, we end up with an entirely different UUID:
UUID('77665544-3322-1100-ffee-ddccbbaa9988')
However, if we explicitly set the representation to
JAVA_LEGACY
, we get the correct result:
UUID('00112233-4455-6677-8899-aabbccddeeff')
PyMongo uses the specified UUID representation to reorder the BSON bytes and
load them correctly. JAVA_LEGACY
encodes native uuid.UUID
objects
to Binary
subtype 3 objects, while performing the same
byte-reordering as the legacy Java driver’s UUID to BSON encoder.
CSHARP_LEGACY
¶
Attention
This UUID representation should be used when reading UUIDs
written to MongoDB by the legacy applications (i.e. applications that don’t
use the STANDARD
representation) using the C# driver.
The CSHARP_LEGACY
representation
corresponds to the legacy representation of UUIDs used by the MongoDB Java
Driver.
Note
The CSHARP_LEGACY
representation reverses the order of bytes 0-3,
bytes 4-5, and bytes 6-7.
As an example, consider the same UUID described in Legacy Handling of UUID Data.
Let us assume that an application used the C# driver without an explicitly
specified UUID representation to insert the example UUID
00112233-4455-6677-8899-aabbccddeeff
into MongoDB. If we try to read this
value using PYTHON_LEGACY, we end up with an entirely different UUID:
UUID('33221100-5544-7766-8899-aabbccddeeff')
However, if we explicitly set the representation to
CSHARP_LEGACY
, we get the correct result:
UUID('00112233-4455-6677-8899-aabbccddeeff')
PyMongo uses the specified UUID representation to reorder the BSON bytes and
load them correctly. CSHARP_LEGACY
encodes native uuid.UUID
objects to Binary
subtype 3 objects, while performing
the same byte-reordering as the legacy C# driver’s UUID to BSON encoder.