Implementation
Schema
Data Structures
graph TD
%% Main hierarchy
AssetClass["Asset-Class"] --> Asset
Asset --> Object1["Object-1"]
Asset --> Object2["Object-2"]
Asset --> Object3["Object-3"]
Asset --> Object4["Object-4"]
Asset --> Object5["Object-5"]
Asset --> Content1["Content-1"]
Asset --> Content2["Content-2"]
Asset --> Content3["Content-3"]
%% Relationships between objects and content
Object1 -.-> Content1
Object2 -.-> Content1
Object3 -.-> Content2
Object4 -.-> Content2
Object5 -.-> Content3
%% Info boxes with annotations
AssetClass -.- AssetClassInfo["• resource_model<br>• trained_model<br>• nlp_model<br> etc"]
Asset -.- AssetInfo["• resource_models_1/v0.01<br>• resource_models_1/v0.02 etc"]
Content1 -.- ContentInfo["id:<br>gs:md5_gdsxxxxxxxxHiS1==<br>dvBDiGQ==<br><br>type: application/json"]
Object1 -.- ObjectInfo["path: outputs/summary.json<br>content_id:<br>gs:md5xxxxxxxx5SI86qlgm<br>UcLmQ=="]
%% Styling
classDef mainNode fill:white,stroke:#333,stroke-width:1px,rx:4px,ry:4px,font-family:Arial,font-size:14px;
classDef infoBox fill:#f5f5f5,stroke:#ddd,stroke-width:1px,rx:6px,ry:6px,font-family:Arial,font-size:12px,text-align:left;
classDef assetClassNode fill:white,stroke:#333,stroke-width:1px,rx:4px,ry:4px,font-weight:bold;
classDef assetNode fill:white,stroke:#333,stroke-width:2px,rx:4px,ry:4px,font-weight:bold;
%% Apply styles
class AssetClass assetClassNode;
class Asset assetNode;
class Object1,Object2,Object3,Object4,Object5,Content1,Content2,Content3 mainNode;
class AssetClassInfo,AssetInfo,ContentInfo,ObjectInfo infoBox;
%% Layout adjustments
subgraph ObjectGroup [" "]
Object1
Object2
Object3
Object4
Object5
end
subgraph ContentGroup [" "]
Content1
Content2
Content3
end
%% Make subgraph backgrounds transparent
classDef transparent fill:transparent,stroke:transparent;
class ObjectGroup,ContentGroup transparent;
Asset Collection
Conceptually, the Asset Collection is a classification of similar types of assets
depending on the use case. You could define a `text_analysis` asset-class to store relevant assets pertaining to
natural language processing components. Similarly, you might declare a `model_training` or `sequence_analyzer` asset-class to manage all their relevant
assets. All assets belonging to an asset-class share the same storage in the cloud bucket, which helps optimize storage requirements
since files with the same content need not be uploaded again.
classDiagram
class asset_class {
id uuid
class_name varchar
top_hash varchar
created_by varchar
}
note "example :
n4xxxxxxa-xxxa-xxx0-xxxc-axxxxxxxxf1
resource_model
960exxxxxxxxxxxxxxxxxxxxxxxxc56
bob prat"
Asset
Everytime you call ```ama init ```, it creates an Asset i.e. new member of the asset-class. On a high level
an Asset represents a collection of all digital resources you need for your activity. Assets are automatically version tracked,
any changes you make to an asset if diffed and stored separately which can be inspected and traced back to source.
classDiagram
class asset_class {
id uuid
}
class asset {
id uuid
asset_class uuid
seq_id varchar
version varchar
parent_asset uuid
owner varchar
refs varchar
patch text
}
asset_class --|> asset
note "Example:
n4xxxxxxa-xxxa-xxx0-xxxc-axxxxxxxxf1
resource_model
2
0.2.1
resource_model/1/0.11
chris prat
[resource_model/2/0.1]
[diff from parent_asset]"
Content
A Content is an abstraction for stored data. Contents are immutable.
classDiagram
class asset_class {
id uuid
counter int
name varchar
class_type varchar
title varchar
description text
readme text
}
class asset_class_content_rel {
id int
asset_class uuid
content varchar
}
class content {
id varchar
mime_type varchar
hash varchar
size bigint
meta json
}
asset_class -- asset_class_content_rel
asset_class_content_rel -- content
- Contents are stored in the bucket and their metadata is stored in the Contents Table in the database
- Contents can be of many types depending on where they are stored
- file-content
- sql-content
- url-content
- docker-content etc.
Managed-Content vs Proxy-Content
- Managed-Content is where the storage of the underlying data is in control of the asset-manager. This is the default content type
that gets created when you add any file / data to an asset.
- Proxy-Content is where users actively manage where their data is stored but still want to use asset-manager for storage, retrieval, sharing
and version control of data. For example, users may want to store data in google buckets and directly access those (without using asset-manager).
This situation is especially useful where teams want to have minimal friction of using a data-handling tool like asset-manager.
In other cases, users may want to add pre-existing cloud data without increase any cost of additional storage. For example, dna_sequence raw data
can often be in size of multiple magnitudes of terabytes, any duplication of which (in creating to managed-content) can add large cost overheads.
- A key limitation of using proxy-content is that asset-manager can not guarantee availability of data since users can modify / delete the data
at source. However, asset-manager can still be used to index the proxy-contents for lineage tracing in experiments.
Calculating ID for Content
- The id of a managed-content is a combination of its md5 hash of data and its storage system i.e.
gs:proxy_md5_gDsxxxxxxxx5fH==. The rationale behind composite key
is to allow the same content to be stored in multiple locations if required. For example: depending on the use case, we may want to have a copy of
output stored in gcs bucket and in a database or bigquery.
- On the other hand, the id of proxy-content has the following form
gs:proxy_md5_gDsxxxxxxxx5fH==. The proxy flag shows that it's a proxy content.
However, unlike the managed-content, the md5 hash here is a hash of the string src_url:content_hash.
Object
Asset holds a collection of objects. An Object is a relation between Content and a file path inside the asset-repo dir of the user.
classDiagram
class asset_class {
id: uuid
}
class asset {
id: uuid
asset_class: uuid
}
class asset_object_relations {
id: uuid
asset: uuid
object: uuid
}
class object {
id: uuid
path: varchar
content: uuid
created_by: varchar
created_at: timestamp
}
class content {
id: uuid
asset_class: uuid
}
asset --> asset_class
asset --> asset_object_relations
asset_object_relations --> object
object <-- content
content <-- asset_class
note "Example:
afxxxxxxxf1
gs:md55xxxxxxxxxxxxxxxxe1Q==
bob smith
2021-11-09T11-00-17-PST"
- Object to Content is a many-to-one relationship i.e. multiple Objects can point to the same content
- Each record in the Objects Table has a foreign Key to the Contents Table.
- Object to Asset is a many-to-many relationship i.e. an Asset holds multiple Objects and the same object can be shared by multiple Assets.
- The AssetObjects Join table maintains the relationship between Objects and Assets
- Objects are stored in the Objects Table in the database and in the asset-manifest.yaml file in the bucket
Creating Asset
Initializing an Asset
cd your_dir
ama init class_name
The following steps are performed.
flowchart TD
A[Init command] --> B[Init Repo class]
B --> C[Init Asset Collection]
subgraph RepoInfo
direction TB
D1["Ensures that operations are performed inside an asset-repo"]
end
subgraph AssetInfo
direction TB
E1["Handles all operations related to an Asset"]
end
B -.- RepoInfo
C -.- AssetInfo
classDef command fill:#F9D949,stroke:#333,stroke-width:1px
classDef info fill:#f5f5f5,stroke:#ddd,stroke-width:1px,stroke-dasharray: 5 5
class A,B,C command
class D1,E1 info
Create Repo instance
- The Repo class checks if we are currently at the root or inside an already existing asset-repo
- If no asset-repos are found, the Repo class initializes a repo by creating a .assets directory at
Create Asset instance
- We create a new asset-manifest.yaml file at the root of the asset-repo, this file is a serialized representation of the Asset instance
- We create cache directory inside the .assets dir (if not already exists). This is used to cache all files added to an asset
If there is an existing asset-manifest.yaml file which has uncommitted changes, and the user calls `ama init`, we should ask the user to commit
the changes first or lose them. If the user decides to commit the changes - we upload the asset.
If the user decides to lose the changes, we revert the asset to parent.
How do we find if there are uncommitted changes.
- When users add a new file to an asset, if the asset_id is not null, we move asset_id to parent_asset_id and make asset_id as null.
- we retain the seq_id, since this asset will be a new version of the parent
- When the user removes a file, we check if the parent_id is not null, we compare the asset's hash to parent's hash, if they are same,
we make parent_id as asset_id and parent.parent_id as the asset.parent_id
- How do we compare two assets, we compare the assets by their hash which is a hash of all objects in the asset
- What's a commit-hash, a commit-hash is the hash of the asset which has been successfully committed to the database
If the user calls ama init
- make asset_id as null - server will assign asset_id
- make parent as null - server will assign parent id
- make seq_id as null - the server will assign seq_id
Adding files to an asset
ama add output/logs/annotation.log
The following actions are performed to add a file to an asset.
flowchart TD
A[add_command] --> B[init Content]
B --> C[init Object]
C --> D[add Object to asset]
subgraph Content
direction TB
E1["• Handles the actual file content and is responsible for integrity of data
• Stores meta information to detect changes to underlying data, renaming or file deletion etc
• Hashable based on id i.e. storage:hash"]
end
subgraph Object
direction TB
F1["• This the primary interface object for the user
• Holds a reference to the underlying Content object
• Creates links between Content and the added file_path
• Hashable based on the file_path"]
end
subgraph Asset
direction TB
G1["• Holds a list of objects
• Holds a reference to parent asset if any"]
end
B -.- Content
C -.- Object
D -.- Asset
classDef command fill:#F9D949,stroke:#333,stroke-width:1px
classDef info fill:#f5f5f5,stroke:#ddd,stroke-width:1px,stroke-dasharray: 5 5
class A,B,C,D command
class E1,F1,G1 info
- Verifies if the file exists at the given path
- Calculates hash of the file and checks if a content with hash is already registered.
- If content exists, use the already stored content else create a new content and add to the contents (set) variable in the asset.
- Cache the added file by creating a hard-link from the file into the path. Given the large size of files to be handled by AMA we opted for hard-links instead of copying files.
- Contents can be of many types depending on where they are stored i.e. file-content, sql-content, url-content, docker-content etc.
Ensuring atomicity of asset
Think of scenario, user A adds few files to an asset, commits the asset but there is an interruption in the commit flow. User A,
then adds a few more files to the asset and commits the asset again.
Therefore, for every commit, ama should be able to detect:
- If the asset is inheriting from another asset, whether that asset was committed successfully.
- If the parent asset was committed, we can proceed with the regular flow.
- However, if the previous commit was not successful.
- We should ask the user, if the changes are part of the previous interrupted transaction
- If user answers - yes, then we can proceed with committing the existing asset
- If the user answers - No, then we need to initialize a new-asset, set parent to null and commit that asset
Ensuring content integrity
A core principle of AMA is that the file-name is the md5 hash of its contents.
Data files can be very large, so in the interest of storage efficiency - when a user adds files to an asset we don't
make a copy the file. Instead, we create a hardlink to the file inside the cache directory.
The hardlink is a pointer to the same inode, so whenever the user makes any changes to the files, the cached file
would be changed as well. So when a user alters the file after adding it to the asset, the file-name no longer reflects
the hash of its contents. This causes data integrity issues.
To avoid this, we need to make sure that all file-names are the same as their content hash before they are uploaded.
We can recompute the file hashes again and make sure the filenames are same. But this presents the following problem
- computing md5 hashes is an expensive operation, especially for large files.
- The user may have deleted the file after adding, which is not handled by this approach
- The user may have renamed the file after adding
In short, we need to detect if a file has been altered, renamed or deleted before committing to the cloud. To handle this,
we take the following approach.
When user adds a file, we compute the file stat and store the following information.
md5 hash
st_mtime : last modified time of file metadata, changes when a file is renamed
st_nlink : number of pointers to the inode of the file.
st_ctime : last modified time of the file content
st_ino : inode number of the file
st_size : file size in bytes
Detecting if a file was deleted
This is very straight forward. Once the user adds the file - we create a hardlink to the file inside the cache. Therefore,
the `st-nlink` to the file should be atleast 2. If this is less than 2, then that means the file was deleted after adding.
If we detect the file is deleted, we request the read the file to the asset or remove the file from the asset before it can be
uploaded.
Detecting if the contents were modified
If a file has not been deleted, we need to check if it's been modified.
We check if the st_ctime has changed. If it has changed, then we check if the st_size has changed. If st_size has changed, then
file has been altered - we don't need to compute the md5 for that. We take this approach, so that we can defer computing md5 hash,
which is expensive - until we must. If st_ctime has changed but size is same, then we compute the md5 and compare with the stored value.
Detecting if a file has been renamed
This is tricky. Any change in name will change the `st_mtime`, but there are other things that can change the `st_time` as well.
For example, the user might change the name and revert it back, which will also change the `st_mtime`. So detecting renaming
involves a few workarounds.
If a file has neither been deleted or altered. We check if it has been renamed.
We first check if `st_mtime` has changed. If not - then we can be sure, it's not been renamed. If the `st_mtime` has changed,
we find the object(s) that refer to the same file and compute the inode numbers of the path they point to. If the inode numbers are same,
then the link is valid and there is no renaming. If the inode numbers are different, then it's a different file with the same name.
The linked file has been renamed. So we prompt the user to remove the old file.
Uploading the Asset
Check if the Asset can be uploaded.
In order to be eligible for upload, an Asset must have a designated storage location which it inherits from the asset collection it belongs to.
- Check if the asset has a class_id, if not - request asset_registry for id of the asset collection
- The request to asset_registry must include class_name
- asset_registry receives the request, checks if an asset collection exists for the given name. If not, the asset_registry creates an asset_class.
- client receives the class_id and top_hash from the asset_registry and updates the asset-manifest
- Asset is now eligible for upload
Removing files from an asset
cd your_dir
ama remove file
The following steps are performed.
flowchart TD
A{{remove command}} --> B[Asset]
B --> C{{Find objects matching file_path}}
%% Object stack connected to Asset and Find objects
J[Objects] -.- B
J -.-> C
C --> D[Objects]
D --> E{{If multiple, remove last added Object}}
E --> F{{Remove Object Reference from Content}}
E --> G{{Link the previous Object to path, if any}}
G --> F
%% Single Contents connected to Asset, Remove Object Reference, and Remove file
K[Contents] -.- B
K -.-> F
F --> K
K --> I{{Remove file. If content-ref counter is 0, remove cached-file}}
- Normalize the user-input path to asset_repo dir
- Find all objects that match the path
asset.objects.filter(lambda x: x.path == path)
- Remove the object from asset, if multiple objects, we remove the last added object
- Remove the object reference from linked content, if ref count of content is 0 - remove the cached file pointed to by the content
- If multiple Objects were found earlier, and we had removed the last object then we link the previous object to the file-path
Listing Files in Asset
ama list
The following steps are performed.
flowchart LR
A[list command] --> B[Asset]
B --> C[list objects]
C --> D[get linked contents]
B -.- E[Objects]
B -.- F[Contents]
F -.-> D
style A fill:#f9d54b,stroke:#333,polygon
style C fill:#f9d54b,stroke:#333,polygon
style D fill:#f9d54b,stroke:#333,polygon
- get the list of all objects in the asset
list(asset.objects)
- each object has a ref to its content i.e.
object.content for each object, get the content_hash, content_type
- list in table format
Uploading Asset
flowchart LR
subgraph CLIENT [CLIENT]
C1[Verify Asset-class]
C2[Contents to Upload]
C3[Upload to Staging]
C4[Staged content - linked objects]
C5[Commit asset]
end
subgraph SERVER [SERVER]
S1[Update seq_id/ver_id]
S2[Transfer from staging to remote]
S3[Create records in db]
S4[Write manifest.yml to bucket]
S5[Responds to Client commit-hash]
end
C1 --> C2
C2 --> C3
C3 --> C4
C4 --> C5
C5 --> S1
S1 --> S2
S2 --> S3
S3 --> S4
S4 --> S5
SERVER -. Response from server .-> CLIENT
%% Styling
classDef clientBox fill:#8BC34A,stroke:#333,color:black
classDef serverBox fill:#2196F3,stroke:#333,color:black
classDef clientNode fill:#FFC107,stroke:#333,color:black,shape:diamond
classDef serverNode fill:#FFC107,stroke:#333,color:black,shape:diamond
class CLIENT clientBox
class SERVER serverBox
class C1,C2,C3,C4,C5 clientNode
class S1,S2,S3,S4,S5 serverNode
Create Asset Collection
It's the asset collection that owns the storage location. So in order to be eligible for upload, an asset must have a valid
class_id
- Check if the asset has a class_id, if not - request asset_registry for id of the asset collection
- The request to asset_registry must include class_name
- asset_registry receives the request, checks if an asset collection exists for the given name. If not, the asset_registry creates an asset_class.
- client receives the class_id and top_hash from the asset_registry and updates the asset-manifest
- Asset is now eligible for upload
flowchart TD
AssetClient["Ama"] --> RequestServer
RequestServer{"Request Server"} -->|class_name| AssetServer["Ama-Server"]
AssetServer --> CreateClass{"Create class record,
if not exists"}
CreateClass --> DB[("Database")]
CreateClass -. class_id, top_hash .-> AssetClient
DB -.- AssetServer
%% Styling
classDef client fill:white,stroke:#333,stroke-width:1px,color:black
classDef server fill:white,stroke:#333,stroke-width:1px,color:black
classDef decision fill:#FFC107,stroke:#333,stroke-width:1px,color:black,shape:diamond
classDef database fill:#f5f5f5,stroke:#333,stroke-width:1px,color:black
class AssetClient client
class AssetServer server
class RequestServer,CreateClass decision
class DB database
Find Files to be Uploaded
Once we have the class_id and top_hash - the asset-contents can be uploaded to staging area. We now query the asset to get a list of files that
require to be uploaded. To maintain data integrity, a file must meet the following conditions in order to be a
candidate for upload.
- added by the user
- have not been deleted, modified or renamed
flowchart TD
Content["Content"] --> File["File"]
File --> AddedByUser["Added by user"]
AddedByUser -->|No| Ignore["ignore"]
AddedByUser -->|Yes| IsDeleted["Is Deleted"]
IsDeleted -->|Yes| AskRemove["Ask user to remove.
'ama remove file'"]
IsDeleted -->|No| IsModified["Is Modified"]
IsModified -->|Yes| AskReAdd["Ask user to re-add.
'ama add file'"]
IsModified -->|No| AddToUpload["Add to upload list"]
AddToUpload --> UploadList["Upload list"]
%% Styling
classDef default fill:white,stroke:#333,stroke-width:1px,color:black
classDef decision fill:#FFC107,stroke:#333,stroke-width:1px,color:black,shape:diamond
classDef highlight fill:#FFC107,stroke:#333,stroke-width:1px,color:black,shape:diamond
classDef command fill:white,stroke:none,color:black
classDef list fill:white,stroke:#333,stroke-width:1px,color:black
class Content,File,Ignore,UploadList default
class AddedByUser,IsDeleted,IsModified,AddToUpload decision
class AskRemove,AskReAdd highlight
class Command1,Command2 command
class UploadList list
Since the file sizes can be very large, in AMA, we took the approach of not copying files. Therefore,
in order to accomplish the above requirements when a file is added to asset, we compute and store metadata about the file
st_mtime: timestamp (last modified time of the file content)
st_nlink: int (# of hardlinks to the file)
st_ctime: timestamp (last modified time of file meta data)
st_ino: int (inode number i.e. disk storage pointer for the file)
Before uploading, we recompute the file metadata and compare with the stored values to identify if a file has been
altered by the user after adding it to asset.
Exclude content already available in cloud
In AMA, we use the md5 hash of the files for storage and indexing. Its possible, that a user might be adding
a file which already exists in the cloud and is indexed. In which case, the files need not be uploaded again. Therefore,
after determining the list of files to be uploaded, we check if any of the files exist in the remote storage.
If the file is already indexed it would exist in the remote repo, if another user is in the process of uploading the file
and its yet to be indexed then it would exist in the staging area. Therefore, we check both remote-repo and staging-repos.
graph TD
Content[("Content")]
CheckRemote["Check if exists"]
RemoteRepo[("Remote repo")] --> CheckRemote
CheckRemote -->|Yes| RemoveUpload["Remove from upload"]
RemoveUpload -->Content
CheckRemote --> CheckStaging["Check if exists + verify hash"]
StagingRepo[("Staging repo")] --> CheckStaging
CheckStaging -->|No| UploadList[("Upload list")]
CheckStaging -->|Yes| RemoveUpload
%% Styling
classDef default fill:white,stroke:#333,stroke-width:1px,color:black
classDef decision fill:#FFC107,stroke:#333,stroke-width:1px,color:black,shape:diamond
classDef highlight fill:#FFC107,stroke:#333,stroke-width:1px,color:black,shape:diamond
classDef stacked fill:white,stroke:#333,stroke-width:1px,color:black
class Content,RemoteRepo,StagingRepo,UploadList stacked
class CheckRemote,CheckStaging,RemoveUpload decision
User access
Staging and remote buckets are separate by design to maintain data integrity. For staging buckets, individual user will have
write acess where as for remote buckets, only the asset-server will have write-access. Individual users will have readonly-access
to the remote bucket.
graph TD
AssetServer["ama-Server"] --> ReadWrite1["Read-Write"]
ReadWrite1 --> RemoteRepo["Remote
repo"]
ReadWrite1 --> StagingRepo["Staging
repo"]
RemoteRepo --> ReadOnly["Read-only"]
StagingRepo --> ReadWrite2["Read-Write"]
ReadOnly --> User
ReadWrite2 --> User
%% Styling
classDef server fill:white,stroke:#666,stroke-width:3px,color:black
classDef access fill:white,stroke:#333,stroke-width:1px,color:black
classDef repo fill:#f5f5f5,stroke:#333,stroke-width:1px,color:black
classDef user fill:#66B2A0,stroke:none,color:black
class AssetServer server
class ReadWrite1,ReadOnly,ReadWrite2 access
class RemoteRepo,StagingRepo repo
class User user
Upload Files to Staging
graph TD
Content[("Content")] --> StartUpload["Start Upload"]
StartUpload --> StagingRepo[("Staging repo")]
StartUpload --> Staging["Staging"]
ChecksumValid["checksum-valid"]
StagingRepo-- Finish Upload -->ChecksumValid -->|Yes| Staged["Staged"]
Content --> Pending["Pending"]
Pending --> Staging
Staging --> Staged
StartUpload --> ChecksumValid
%% States subgraph
subgraph States
Pending
Staging
Staged
end
%% Styling
classDef default fill:white,stroke:#333,stroke-width:1px,color:black
classDef decision fill:#FFC107,stroke:#333,stroke-width:1px,color:black,shape:diamond
classDef stacked fill:white,stroke:#333,stroke-width:1px,color:black
classDef state fill:white,stroke:#333,stroke-width:1px,color:black
class Content stacked
class StagingRepo stacked
class StartUpload,ChecksumValid decision
class Pending,Staging,Staged state
We use asyncio for fast concurrent network calls. The staging area is namespaced to the top-hash of the asset-class.
Commit Asset
After the files are successfully uploading to staging area, we identify the linked objects to those files. These objects
can now be committed to the asset. The asset and object are committed together since they are inter-linked.
The Asset-commit process is a handoff to the asset-server which performs the following steps. The BaseAsset-server receives the request from asset-client. The payload is in the following format
{
"class_id": 00000000-0000-0000-0000-000000000001
"parent_asset_id": 00000000-0000-0000-0000-000000000011,
"alias": "my_asset",
"objects": [...],
"patch": "...",
}
Ensuring Atomicity of the Asset-Commit process
Its possible, that the previous commit was interrupted and didn't go through. A few possible reasons could be
-
Network failure with User: Content staging was successful, but a network error happened at the user-end during the commit process. In such a case
the asset-client will try to first complete the previous commit before initiating a new transaction. The server should
expect to receive the same payload next-time since the transaction was not completed.
-
Network failure with Server: Similar issue as the above but this time, there was an error on the server-side in either gcs access or database access
-
User Interruption: i.e. ctrl + c
To ensure atomicity, we need to make sure the previous transaction was committed successfully. In order to achieve this we break down the
asset-commit flow into 2 different stages.
Asset Commit flow - Stage 1
%% graph TD
%% AssetClient["A"] --> AskForId["Ask for Id"]
%% AmaServer["ama-Server"]
%% AskForId -->|"1 class_name, parent_id"| CreateRecord
%% CreateRecord -->|2| AmaServer
%% CreateRecord -->|" 3 asset_id"| AskForId
%% AskForId -->|4| SaveAssetId["Save asset_id"]
%% SaveAssetId --> Asset["Asset"]
%% class AssetClient client
%% class AssetServer server
%% class AskForId,CreateRecord,SaveAssetId decision
%% class Asset asset
flowchart TD
AssetClient["Client"] --> AskForId["Ask for Id"]
AssetServer["ama-Server"]
CreateRecord["Create record if not exists"]
AskForId -->|"1 class_name, parent_id"| CreateRecord
CreateRecord -->|2| AssetServer
CreateRecord -->|"3 asset_id"| AskForId
AskForId -->|4| SaveAssetId["Save asset_id"]
SaveAssetId --> Asset["Asset"]
%% Styling
classDef client fill:#FFC107,stroke:#333,stroke-width:2px,color:black
classDef server fill:#FFC107,stroke:#333,stroke-width:2px,color:black
classDef decision fill:#FFC107,stroke:#333,stroke-width:1px,color:black,shape:diamond
classDef asset fill:white,stroke:#333,stroke-width:1px,color:black
class AssetClient client
class AssetServer server
class AskForId,CreateRecord,SaveAssetId decision
class Asset asset
In stage-1 the Asset-Client checks if the asset to be committed has an id, if not, it requests for an id to the asset-server. The request payload
contains the class_name, and the parent_asset_id (if asset inherits from a different asset). The Asset-Server receives the request, creates a record
in the asset-table (with seq_id and version) and returns the asset_id to the Client.
Asset Commit flow - Stage 2
In stage-2, the asset-client requests for the asset to committed. The commit process involves the following steps.
- Transfer contents from staging area in to remote-repo, do checksum validation.
- Create records in Content table, if not exist already
- Create records in Objects table, if not exist already
- Create records in Asset-Objects join table
- Update asset-record with commit-hash
- Respond to client with commit-hash
If the commit process is successful, the asset-server updates the asset-record with a commit-hash
and returns the commit-hash to client. Any time, the asset-client initiates a commit, it first checks if the existing asset has a commit-hash, if it finds a commit-has
this means the previous commit was successful, so the client then moves to follow the 2 stage process.
If the previous commit was unsuccessful, the client will initiate stage-2 again i.e. commit the previous asset along with new updates if any.
The Previous transaction got interrupted now user wants to add more files to assets
To verify, check if the previous record created by the same user with node_type and node_name
has a commit_hash. If commit_hash is missing - then we presume the current updates are also part of the
same commit, and we try to recommit.
Downloading Asset
State Management
Adding references to an asset
You can add existing assets as a reference before committing a created asset. A typical flow would be
ama init <class_name: creates asset
ama add refs --type input --asset <name>
It's important to note that references are allowed only between root nodes.
Storing an Asset
Asset Storage Structure in Bucket
flowchart LR
Bucket["Bucket/"] --- assets["assets/"]
Bucket --- contents["contents/"]
assets --- asset_class["asset_class_id/"]
asset_class --- asset_id["asset_id/"]
asset_id --- yaml1["0.0.0.yaml"]
asset_id --- yaml2["0.0.1.yaml"]
asset_id --- yaml3["0.0.2.yaml"]
asset_id --- objects["objects.yaml"]
contents --- dots["..."]
%% Styling
classDef directory fill:white,stroke:#333,stroke-width:1px,color:black,rx:4,ry:4
classDef file fill:none,stroke:none,color:black
class Bucket,assets,asset_class,asset_id,contents,dots directory
class yaml1,yaml2,yaml3,objects file
assets is a directory inside the bucket. This directory holds a list of directories which are the ids of all the
asset collections.
asset_class_id is a directory inside assets. This directory holds of list of directory, each of which are the
ids of all the assets in that class
asset_id is a directory inside asset_class_id. This holds a list files
- objects.yaml: holds the list of all objects that the asset refers to
- asset.yaml: holds the root information of the asset, this is a yaml representation of the asset record from db
- version.yaml: holds the changes relevant to that version, this is a yaml representation of the version record from db