Implementation

Schema

Data Structures

graph TD %% Main hierarchy AssetClass["Asset-Class"] --> Asset Asset --> Object1["Object-1"] Asset --> Object2["Object-2"] Asset --> Object3["Object-3"] Asset --> Object4["Object-4"] Asset --> Object5["Object-5"] Asset --> Content1["Content-1"] Asset --> Content2["Content-2"] Asset --> Content3["Content-3"] %% Relationships between objects and content Object1 -.-> Content1 Object2 -.-> Content1 Object3 -.-> Content2 Object4 -.-> Content2 Object5 -.-> Content3 %% Info boxes with annotations AssetClass -.- AssetClassInfo["• resource_model • trained_model • nlp_model etc"] Asset -.- AssetInfo["• resource_models_1/v0.01 • resource_models_1/v0.02 etc"] Content1 -.- ContentInfo["id: gs:md5_gdsxxxxxxxxHiS1== dvBDiGQ== type: application/json"] Object1 -.- ObjectInfo["path: outputs/summary.json content_id: gs:md5xxxxxxxx5SI86qlgm UcLmQ=="] %% Styling classDef mainNode fill:white,stroke:#333,stroke-width:1px,rx:4px,ry:4px,font-family:Arial,font-size:14px; classDef infoBox fill:#f5f5f5,stroke:#ddd,stroke-width:1px,rx:6px,ry:6px,font-family:Arial,font-size:12px,text-align:left; classDef assetClassNode fill:white,stroke:#333,stroke-width:1px,rx:4px,ry:4px,font-weight:bold; classDef assetNode fill:white,stroke:#333,stroke-width:2px,rx:4px,ry:4px,font-weight:bold; %% Apply styles class AssetClass assetClassNode; class Asset assetNode; class Object1,Object2,Object3,Object4,Object5,Content1,Content2,Content3 mainNode; class AssetClassInfo,AssetInfo,ContentInfo,ObjectInfo infoBox; %% Layout adjustments subgraph ObjectGroup [" "] Object1 Object2 Object3 Object4 Object5 end subgraph ContentGroup [" "] Content1 Content2 Content3 end %% Make subgraph backgrounds transparent classDef transparent fill:transparent,stroke:transparent; class ObjectGroup,ContentGroup transparent;

Asset Collection

Conceptually, the Asset Collection is a classification of similar types of assets depending on the use case. You could define a `text_analysis` asset-class to store relevant assets pertaining to natural language processing components. Similarly, you might declare a `model_training` or `sequence_analyzer` asset-class to manage all their relevant assets. All assets belonging to an asset-class share the same storage in the cloud bucket, which helps optimize storage requirements since files with the same content need not be uploaded again.

classDiagram class asset_class { id uuid class_name varchar top_hash varchar created_by varchar } note "example : n4xxxxxxa-xxxa-xxx0-xxxc-axxxxxxxxf1 resource_model 960exxxxxxxxxxxxxxxxxxxxxxxxc56 bob prat"

Asset

Everytime you call ```ama init ```, it creates an Asset i.e. new member of the asset-class. On a high level an Asset represents a collection of all digital resources you need for your activity. Assets are automatically version tracked, any changes you make to an asset if diffed and stored separately which can be inspected and traced back to source.

classDiagram class asset_class { id uuid } class asset { id uuid asset_class uuid seq_id varchar version varchar parent_asset uuid owner varchar refs varchar patch text } asset_class --|> asset note "Example: n4xxxxxxa-xxxa-xxx0-xxxc-axxxxxxxxf1 resource_model 2 0.2.1 resource_model/1/0.11 chris prat [resource_model/2/0.1] [diff from parent_asset]"

Content

A Content is an abstraction for stored data. Contents are immutable.

classDiagram class asset_class { id uuid counter int name varchar class_type varchar title varchar description text readme text } class asset_class_content_rel { id int asset_class uuid content varchar } class content { id varchar mime_type varchar hash varchar size bigint meta json } asset_class -- asset_class_content_rel asset_class_content_rel -- content

Contents are stored in the bucket and their metadata is stored in the Contents Table in the database
Contents can be of many types depending on where they are stored
- file-content
- sql-content
- url-content
- docker-content etc.

Managed-Content vs Proxy-Content

Managed-Content is where the storage of the underlying data is in control of the asset-manager. This is the default content type that gets created when you add any file / data to an asset.
Proxy-Content is where users actively manage where their data is stored but still want to use asset-manager for storage, retrieval, sharing and version control of data. For example, users may want to store data in google buckets and directly access those (without using asset-manager). This situation is especially useful where teams want to have minimal friction of using a data-handling tool like asset-manager. In other cases, users may want to add pre-existing cloud data without increase any cost of additional storage. For example, dna_sequence raw data can often be in size of multiple magnitudes of terabytes, any duplication of which (in creating to managed-content) can add large cost overheads.
A key limitation of using proxy-content is that asset-manager can not guarantee availability of data since users can modify / delete the data at source. However, asset-manager can still be used to index the proxy-contents for lineage tracing in experiments.

Calculating ID for Content

The id of a managed-content is a combination of its md5 hash of data and its storage system i.e. gs:proxy_md5_gDsxxxxxxxx5fH==. The rationale behind composite key is to allow the same content to be stored in multiple locations if required. For example: depending on the use case, we may want to have a copy of output stored in gcs bucket and in a database or bigquery.
On the other hand, the id of proxy-content has the following form gs:proxy_md5_gDsxxxxxxxx5fH==. The proxy flag shows that it's a proxy content. However, unlike the managed-content, the md5 hash here is a hash of the string src_url:content_hash.

Object

Asset holds a collection of objects. An Object is a relation between Content and a file path inside the asset-repo dir of the user.

classDiagram class asset_class { id: uuid } class asset { id: uuid asset_class: uuid } class asset_object_relations { id: uuid asset: uuid object: uuid } class object { id: uuid path: varchar content: uuid created_by: varchar created_at: timestamp } class content { id: uuid asset_class: uuid } asset --> asset_class asset --> asset_object_relations asset_object_relations --> object object <-- content content <-- asset_class note "Example: afxxxxxxxf1 gs:md55xxxxxxxxxxxxxxxxe1Q== bob smith 2021-11-09T11-00-17-PST"

Object to Content is a many-to-one relationship i.e. multiple Objects can point to the same content
Each record in the Objects Table has a foreign Key to the Contents Table.
Object to Asset is a many-to-many relationship i.e. an Asset holds multiple Objects and the same object can be shared by multiple Assets.
The AssetObjects Join table maintains the relationship between Objects and Assets
Objects are stored in the Objects Table in the database and in the asset-manifest.yaml file in the bucket

Creating Asset

Initializing an Asset

cd your_dir
ama init class_name

The following steps are performed.

flowchart TD A[Init command] --> B[Init Repo class] B --> C[Init Asset Collection] subgraph RepoInfo direction TB D1["Ensures that operations are performed inside an asset-repo"] end subgraph AssetInfo direction TB E1["Handles all operations related to an Asset"] end B -.- RepoInfo C -.- AssetInfo classDef command fill:#F9D949,stroke:#333,stroke-width:1px classDef info fill:#f5f5f5,stroke:#ddd,stroke-width:1px,stroke-dasharray: 5 5 class A,B,C command class D1,E1 info

Create Repo instance

The Repo class checks if we are currently at the root or inside an already existing asset-repo
If no asset-repos are found, the Repo class initializes a repo by creating a .assets directory at

Create Asset instance

We create a new asset-manifest.yaml file at the root of the asset-repo, this file is a serialized representation of the Asset instance
We create cache directory inside the .assets dir (if not already exists). This is used to cache all files added to an asset

If there is an existing asset-manifest.yaml file which has uncommitted changes, and the user calls `ama init`, we should ask the user to commit the changes first or lose them. If the user decides to commit the changes - we upload the asset. If the user decides to lose the changes, we revert the asset to parent.

How do we find if there are uncommitted changes.

When users add a new file to an asset, if the asset_id is not null, we move asset_id to parent_asset_id and make asset_id as null.
we retain the seq_id, since this asset will be a new version of the parent
When the user removes a file, we check if the parent_id is not null, we compare the asset's hash to parent's hash, if they are same, we make parent_id as asset_id and parent.parent_id as the asset.parent_id
How do we compare two assets, we compare the assets by their hash which is a hash of all objects in the asset
What's a commit-hash, a commit-hash is the hash of the asset which has been successfully committed to the database

If the user calls ama init

make asset_id as null - server will assign asset_id
make parent as null - server will assign parent id
make seq_id as null - the server will assign seq_id

Adding files to an asset

ama add output/logs/annotation.log

The following actions are performed to add a file to an asset.

flowchart TD A[add_command] --> B[init Content] B --> C[init Object] C --> D[add Object to asset] subgraph Content direction TB E1["• Handles the actual file content and is responsible for integrity of data • Stores meta information to detect changes to underlying data, renaming or file deletion etc • Hashable based on id i.e. storage:hash"] end subgraph Object direction TB F1["• This the primary interface object for the user • Holds a reference to the underlying Content object • Creates links between Content and the added file_path • Hashable based on the file_path"] end subgraph Asset direction TB G1["• Holds a list of objects • Holds a reference to parent asset if any"] end B -.- Content C -.- Object D -.- Asset classDef command fill:#F9D949,stroke:#333,stroke-width:1px classDef info fill:#f5f5f5,stroke:#ddd,stroke-width:1px,stroke-dasharray: 5 5 class A,B,C,D command class E1,F1,G1 info

Verifies if the file exists at the given path
Calculates hash of the file and checks if a content with hash is already registered.
If content exists, use the already stored content else create a new content and add to the contents (set) variable in the asset.
Cache the added file by creating a hard-link from the file into the path. Given the large size of files to be handled by AMA we opted for hard-links instead of copying files.
Contents can be of many types depending on where they are stored i.e. file-content, sql-content, url-content, docker-content etc.

Ensuring atomicity of asset

Think of scenario, user A adds few files to an asset, commits the asset but there is an interruption in the commit flow. User A, then adds a few more files to the asset and commits the asset again.

Therefore, for every commit, ama should be able to detect:

If the asset is inheriting from another asset, whether that asset was committed successfully.
If the parent asset was committed, we can proceed with the regular flow.
However, if the previous commit was not successful.
- We should ask the user, if the changes are part of the previous interrupted transaction
- If user answers - yes, then we can proceed with committing the existing asset
- If the user answers - No, then we need to initialize a new-asset, set parent to null and commit that asset

Ensuring content integrity

A core principle of AMA is that the file-name is the md5 hash of its contents.

Data files can be very large, so in the interest of storage efficiency - when a user adds files to an asset we don't make a copy the file. Instead, we create a hardlink to the file inside the cache directory.

The hardlink is a pointer to the same inode, so whenever the user makes any changes to the files, the cached file would be changed as well. So when a user alters the file after adding it to the asset, the file-name no longer reflects the hash of its contents. This causes data integrity issues.

To avoid this, we need to make sure that all file-names are the same as their content hash before they are uploaded. We can recompute the file hashes again and make sure the filenames are same. But this presents the following problem

computing md5 hashes is an expensive operation, especially for large files.
The user may have deleted the file after adding, which is not handled by this approach
The user may have renamed the file after adding

In short, we need to detect if a file has been altered, renamed or deleted before committing to the cloud. To handle this, we take the following approach.

When user adds a file, we compute the file stat and store the following information.

md5 hash
st_mtime : last modified time of file metadata, changes when a file is renamed
st_nlink : number of pointers to the inode of the file.
st_ctime : last modified time of the file content
st_ino : inode number of the file
st_size : file size in bytes

Detecting if a file was deleted

This is very straight forward. Once the user adds the file - we create a hardlink to the file inside the cache. Therefore, the `st-nlink` to the file should be atleast 2. If this is less than 2, then that means the file was deleted after adding. If we detect the file is deleted, we request the read the file to the asset or remove the file from the asset before it can be uploaded.

Detecting if the contents were modified

If a file has not been deleted, we need to check if it's been modified. We check if the st_ctime has changed. If it has changed, then we check if the st_size has changed. If st_size has changed, then file has been altered - we don't need to compute the md5 for that. We take this approach, so that we can defer computing md5 hash, which is expensive - until we must. If st_ctime has changed but size is same, then we compute the md5 and compare with the stored value.

Detecting if a file has been renamed

This is tricky. Any change in name will change the `st_mtime`, but there are other things that can change the `st_time` as well. For example, the user might change the name and revert it back, which will also change the `st_mtime`. So detecting renaming involves a few workarounds. If a file has neither been deleted or altered. We check if it has been renamed. We first check if `st_mtime` has changed. If not - then we can be sure, it's not been renamed. If the `st_mtime` has changed, we find the object(s) that refer to the same file and compute the inode numbers of the path they point to. If the inode numbers are same, then the link is valid and there is no renaming. If the inode numbers are different, then it's a different file with the same name. The linked file has been renamed. So we prompt the user to remove the old file.

Uploading the Asset

Check if the Asset can be uploaded.

In order to be eligible for upload, an Asset must have a designated storage location which it inherits from the asset collection it belongs to.

Check if the asset has a class_id, if not - request asset_registry for id of the asset collection
The request to asset_registry must include class_name
asset_registry receives the request, checks if an asset collection exists for the given name. If not, the asset_registry creates an asset_class.
client receives the class_id and top_hash from the asset_registry and updates the asset-manifest
Asset is now eligible for upload

Removing files from an asset

cd your_dir
ama remove file

The following steps are performed.

flowchart TD A{{remove command}} --> B[Asset] B --> C{{Find objects matching file_path}} %% Object stack connected to Asset and Find objects J[Objects] -.- B J -.-> C C --> D[Objects] D --> E{{If multiple, remove last added Object}} E --> F{{Remove Object Reference from Content}} E --> G{{Link the previous Object to path, if any}} G --> F %% Single Contents connected to Asset, Remove Object Reference, and Remove file K[Contents] -.- B K -.-> F F --> K K --> I{{Remove file. If content-ref counter is 0, remove cached-file}}

Normalize the user-input path to asset_repo dir
Find all objects that match the path asset.objects.filter(lambda x: x.path == path)
Remove the object from asset, if multiple objects, we remove the last added object
Remove the object reference from linked content, if ref count of content is 0 - remove the cached file pointed to by the content
If multiple Objects were found earlier, and we had removed the last object then we link the previous object to the file-path

Listing Files in Asset

ama list

The following steps are performed.

flowchart LR A[list command] --> B[Asset] B --> C[list objects] C --> D[get linked contents] B -.- E[Objects] B -.- F[Contents] F -.-> D style A fill:#f9d54b,stroke:#333,polygon style C fill:#f9d54b,stroke:#333,polygon style D fill:#f9d54b,stroke:#333,polygon

get the list of all objects in the asset list(asset.objects)
each object has a ref to its content i.e. object.content for each object, get the content_hash, content_type
list in table format

Uploading Asset

flowchart LR subgraph CLIENT [CLIENT] C1[Verify Asset-class] C2[Contents to Upload] C3[Upload to Staging] C4[Staged content - linked objects] C5[Commit asset] end subgraph SERVER [SERVER] S1[Update seq_id/ver_id] S2[Transfer from staging to remote] S3[Create records in db] S4[Write manifest.yml to bucket] S5[Responds to Client commit-hash] end C1 --> C2 C2 --> C3 C3 --> C4 C4 --> C5 C5 --> S1 S1 --> S2 S2 --> S3 S3 --> S4 S4 --> S5 SERVER -. Response from server .-> CLIENT %% Styling classDef clientBox fill:#8BC34A,stroke:#333,color:black classDef serverBox fill:#2196F3,stroke:#333,color:black classDef clientNode fill:#FFC107,stroke:#333,color:black,shape:diamond classDef serverNode fill:#FFC107,stroke:#333,color:black,shape:diamond class CLIENT clientBox class SERVER serverBox class C1,C2,C3,C4,C5 clientNode class S1,S2,S3,S4,S5 serverNode

Create Asset Collection

It's the asset collection that owns the storage location. So in order to be eligible for upload, an asset must have a valid class_id

Check if the asset has a class_id, if not - request asset_registry for id of the asset collection
The request to asset_registry must include class_name
asset_registry receives the request, checks if an asset collection exists for the given name. If not, the asset_registry creates an asset_class.
client receives the class_id and top_hash from the asset_registry and updates the asset-manifest
Asset is now eligible for upload

flowchart TD AssetClient["Ama"] --> RequestServer RequestServer{"Request Server"} -->|class_name| AssetServer["Ama-Server"] AssetServer --> CreateClass{"Create class record, if not exists"} CreateClass --> DB[("Database")] CreateClass -. class_id, top_hash .-> AssetClient DB -.- AssetServer %% Styling classDef client fill:white,stroke:#333,stroke-width:1px,color:black classDef server fill:white,stroke:#333,stroke-width:1px,color:black classDef decision fill:#FFC107,stroke:#333,stroke-width:1px,color:black,shape:diamond classDef database fill:#f5f5f5,stroke:#333,stroke-width:1px,color:black class AssetClient client class AssetServer server class RequestServer,CreateClass decision class DB database

Find Files to be Uploaded

Once we have the class_id and top_hash - the asset-contents can be uploaded to staging area. We now query the asset to get a list of files that require to be uploaded. To maintain data integrity, a file must meet the following conditions in order to be a candidate for upload.

added by the user
have not been deleted, modified or renamed

flowchart TD Content["Content"] --> File["File"] File --> AddedByUser["Added by user"] AddedByUser -->|No| Ignore["ignore"] AddedByUser -->|Yes| IsDeleted["Is Deleted"] IsDeleted -->|Yes| AskRemove["Ask user to remove. 'ama remove file'"] IsDeleted -->|No| IsModified["Is Modified"] IsModified -->|Yes| AskReAdd["Ask user to re-add. 'ama add file'"] IsModified -->|No| AddToUpload["Add to upload list"] AddToUpload --> UploadList["Upload list"] %% Styling classDef default fill:white,stroke:#333,stroke-width:1px,color:black classDef decision fill:#FFC107,stroke:#333,stroke-width:1px,color:black,shape:diamond classDef highlight fill:#FFC107,stroke:#333,stroke-width:1px,color:black,shape:diamond classDef command fill:white,stroke:none,color:black classDef list fill:white,stroke:#333,stroke-width:1px,color:black class Content,File,Ignore,UploadList default class AddedByUser,IsDeleted,IsModified,AddToUpload decision class AskRemove,AskReAdd highlight class Command1,Command2 command class UploadList list

Since the file sizes can be very large, in AMA, we took the approach of not copying files. Therefore, in order to accomplish the above requirements when a file is added to asset, we compute and store metadata about the file

st_mtime: timestamp (last modified time of the file content) 
st_nlink: int (# of hardlinks to the file)
st_ctime: timestamp (last modified time of file meta data)
st_ino: int (inode number i.e. disk storage pointer for the file)

Before uploading, we recompute the file metadata and compare with the stored values to identify if a file has been altered by the user after adding it to asset.

Exclude content already available in cloud

In AMA, we use the md5 hash of the files for storage and indexing. Its possible, that a user might be adding a file which already exists in the cloud and is indexed. In which case, the files need not be uploaded again. Therefore, after determining the list of files to be uploaded, we check if any of the files exist in the remote storage.

If the file is already indexed it would exist in the remote repo, if another user is in the process of uploading the file and its yet to be indexed then it would exist in the staging area. Therefore, we check both remote-repo and staging-repos.

graph TD Content[("Content")] CheckRemote["Check if exists"] RemoteRepo[("Remote repo")] --> CheckRemote CheckRemote -->|Yes| RemoveUpload["Remove from upload"] RemoveUpload -->Content CheckRemote --> CheckStaging["Check if exists + verify hash"] StagingRepo[("Staging repo")] --> CheckStaging CheckStaging -->|No| UploadList[("Upload list")] CheckStaging -->|Yes| RemoveUpload %% Styling classDef default fill:white,stroke:#333,stroke-width:1px,color:black classDef decision fill:#FFC107,stroke:#333,stroke-width:1px,color:black,shape:diamond classDef highlight fill:#FFC107,stroke:#333,stroke-width:1px,color:black,shape:diamond classDef stacked fill:white,stroke:#333,stroke-width:1px,color:black class Content,RemoteRepo,StagingRepo,UploadList stacked class CheckRemote,CheckStaging,RemoveUpload decision

User access

Staging and remote buckets are separate by design to maintain data integrity. For staging buckets, individual user will have write acess where as for remote buckets, only the asset-server will have write-access. Individual users will have readonly-access to the remote bucket.

graph TD AssetServer["ama-Server"] --> ReadWrite1["Read-Write"] ReadWrite1 --> RemoteRepo["Remote repo"] ReadWrite1 --> StagingRepo["Staging repo"] RemoteRepo --> ReadOnly["Read-only"] StagingRepo --> ReadWrite2["Read-Write"] ReadOnly --> User ReadWrite2 --> User %% Styling classDef server fill:white,stroke:#666,stroke-width:3px,color:black classDef access fill:white,stroke:#333,stroke-width:1px,color:black classDef repo fill:#f5f5f5,stroke:#333,stroke-width:1px,color:black classDef user fill:#66B2A0,stroke:none,color:black class AssetServer server class ReadWrite1,ReadOnly,ReadWrite2 access class RemoteRepo,StagingRepo repo class User user

Upload Files to Staging

graph TD Content[("Content")] --> StartUpload["Start Upload"] StartUpload --> StagingRepo[("Staging repo")] StartUpload --> Staging["Staging"] ChecksumValid["checksum-valid"] StagingRepo-- Finish Upload -->ChecksumValid -->|Yes| Staged["Staged"] Content --> Pending["Pending"] Pending --> Staging Staging --> Staged StartUpload --> ChecksumValid %% States subgraph subgraph States Pending Staging Staged end %% Styling classDef default fill:white,stroke:#333,stroke-width:1px,color:black classDef decision fill:#FFC107,stroke:#333,stroke-width:1px,color:black,shape:diamond classDef stacked fill:white,stroke:#333,stroke-width:1px,color:black classDef state fill:white,stroke:#333,stroke-width:1px,color:black class Content stacked class StagingRepo stacked class StartUpload,ChecksumValid decision class Pending,Staging,Staged state

We use asyncio for fast concurrent network calls. The staging area is namespaced to the top-hash of the asset-class.

Commit Asset

After the files are successfully uploading to staging area, we identify the linked objects to those files. These objects can now be committed to the asset. The asset and object are committed together since they are inter-linked. The Asset-commit process is a handoff to the asset-server which performs the following steps. The BaseAsset-server receives the request from asset-client. The payload is in the following format

{
  "class_id": 00000000-0000-0000-0000-000000000001
  "parent_asset_id": 00000000-0000-0000-0000-000000000011,
  "alias": "my_asset",
  "objects": [...],
  "patch": "...",
}

Ensuring Atomicity of the Asset-Commit process

Its possible, that the previous commit was interrupted and didn't go through. A few possible reasons could be

Network failure with User: Content staging was successful, but a network error happened at the user-end during the commit process. In such a case the asset-client will try to first complete the previous commit before initiating a new transaction. The server should expect to receive the same payload next-time since the transaction was not completed.
Network failure with Server: Similar issue as the above but this time, there was an error on the server-side in either gcs access or database access
User Interruption: i.e. ctrl + c

To ensure atomicity, we need to make sure the previous transaction was committed successfully. In order to achieve this we break down the asset-commit flow into 2 different stages.

Asset Commit flow - Stage 1

%% graph TD %% AssetClient["A"] --> AskForId["Ask for Id"] %% AmaServer["ama-Server"] %% AskForId -->|"1 class_name, parent_id"| CreateRecord %% CreateRecord -->|2| AmaServer %% CreateRecord -->|" 3 asset_id"| AskForId %% AskForId -->|4| SaveAssetId["Save asset_id"] %% SaveAssetId --> Asset["Asset"] %% class AssetClient client %% class AssetServer server %% class AskForId,CreateRecord,SaveAssetId decision %% class Asset asset flowchart TD AssetClient["Client"] --> AskForId["Ask for Id"] AssetServer["ama-Server"] CreateRecord["Create record if not exists"] AskForId -->|"1 class_name, parent_id"| CreateRecord CreateRecord -->|2| AssetServer CreateRecord -->|"3 asset_id"| AskForId AskForId -->|4| SaveAssetId["Save asset_id"] SaveAssetId --> Asset["Asset"] %% Styling classDef client fill:#FFC107,stroke:#333,stroke-width:2px,color:black classDef server fill:#FFC107,stroke:#333,stroke-width:2px,color:black classDef decision fill:#FFC107,stroke:#333,stroke-width:1px,color:black,shape:diamond classDef asset fill:white,stroke:#333,stroke-width:1px,color:black class AssetClient client class AssetServer server class AskForId,CreateRecord,SaveAssetId decision class Asset asset

In stage-1 the Asset-Client checks if the asset to be committed has an id, if not, it requests for an id to the asset-server. The request payload contains the class_name, and the parent_asset_id (if asset inherits from a different asset). The Asset-Server receives the request, creates a record in the asset-table (with seq_id and version) and returns the asset_id to the Client.

Asset Commit flow - Stage 2

In stage-2, the asset-client requests for the asset to committed. The commit process involves the following steps.

Transfer contents from staging area in to remote-repo, do checksum validation.
Create records in Content table, if not exist already
Create records in Objects table, if not exist already
Create records in Asset-Objects join table
Update asset-record with commit-hash
Respond to client with commit-hash

If the commit process is successful, the asset-server updates the asset-record with a commit-hash and returns the commit-hash to client. Any time, the asset-client initiates a commit, it first checks if the existing asset has a commit-hash, if it finds a commit-has this means the previous commit was successful, so the client then moves to follow the 2 stage process. If the previous commit was unsuccessful, the client will initiate stage-2 again i.e. commit the previous asset along with new updates if any.

The Previous transaction got interrupted now user wants to add more files to assets

To verify, check if the previous record created by the same user with node_type and node_name has a commit_hash. If commit_hash is missing - then we presume the current updates are also part of the same commit, and we try to recommit.

Downloading Asset

State Management

Adding references to an asset

You can add existing assets as a reference before committing a created asset. A typical flow would be

ama init <class_name: creates asset ama add refs --type input --asset <name>

It's important to note that references are allowed only between root nodes.

Storing an Asset

Asset Storage Structure in Bucket

flowchart LR Bucket["Bucket/"] --- assets["assets/"] Bucket --- contents["contents/"] assets --- asset_class["asset_class_id/"] asset_class --- asset_id["asset_id/"] asset_id --- yaml1["0.0.0.yaml"] asset_id --- yaml2["0.0.1.yaml"] asset_id --- yaml3["0.0.2.yaml"] asset_id --- objects["objects.yaml"] contents --- dots["..."] %% Styling classDef directory fill:white,stroke:#333,stroke-width:1px,color:black,rx:4,ry:4 classDef file fill:none,stroke:none,color:black class Bucket,assets,asset_class,asset_id,contents,dots directory class yaml1,yaml2,yaml3,objects file

assets is a directory inside the bucket. This directory holds a list of directories which are the ids of all the asset collections.

asset_class_id is a directory inside assets. This directory holds of list of directory, each of which are the ids of all the assets in that class

asset_id is a directory inside asset_class_id. This holds a list files

objects.yaml: holds the list of all objects that the asset refers to
asset.yaml: holds the root information of the asset, this is a yaml representation of the asset record from db
version.yaml: holds the changes relevant to that version, this is a yaml representation of the version record from db