Title: | Identifying Infection with Machine Intelligence |
---|---|
Description: | A novel machine learning method for plant viruses diagnostic using genome sequencing data. This package includes three different machine learning models, random forest, XGBoost, and elastic net, to train and predict mapped genome samples. Mappability profile and unreliable regions are introduced to the algorithm, and users can build a mappability profile from scratch with functions included in the package. Plotting mapped sample coverage information is provided. |
Authors: | Haochen Ning [aut], Ian Boyes [aut], Ibrahim Numanagić [aut]
|
Maintainer: | Xuekui Zhang <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.2.1 |
Built: | 2025-01-31 06:12:24 UTC |
Source: | https://github.com/cran/iimi |
Converts one or more indexed and sorted BAM files (ending in
*.sorted.bam
and *.bai
) into a run-length encodings (RLEs) list.
convert_bam_to_rle(bam_file, paired = FALSE)
convert_bam_to_rle(bam_file, paired = FALSE)
bam_file |
path to BAM file(s). |
paired |
Indicate if the sequencing paired is single-end or paired-end
reads. |
A list of coverage profile(s) in RLE format with one or more samples.
## Not run: ## Please change the path to your folder where you ## store sorted and indexed BAM files of mapped samples rles <- convert_bam_to_rle("path/to/bam/file") ## End(Not run)
## Not run: ## Please change the path to your folder where you ## store sorted and indexed BAM files of mapped samples rles <- convert_bam_to_rle("path/to/bam/file") ## End(Not run)
Converts a list of run-length encodings (RLEs) into a data frame with 16 features after mappability profiling and nucleotide filtering.
convert_rle_to_df( covs, unreliable_region_version = "1_4_0", unreliable_region_enabled = TRUE, additional_nucleotide_info = data.frame() )
convert_rle_to_df( covs, unreliable_region_version = "1_4_0", unreliable_region_enabled = TRUE, additional_nucleotide_info = data.frame() )
covs |
A list of Coverage profile(s) in RLE format. Can be one or more samples. |
unreliable_region_version |
The version number (character string) of unreliable regions of the virus segments.
Default is |
unreliable_region_enabled |
Default is |
additional_nucleotide_info |
Additional nucleotide information for virus
segments that are not included in |
Converts a list of run-length encodings (RLEs) into a data frame.
The returned dataframe contains 16 features for training a machine learning model. after mappability profiling and nucleotide filtering.
A data frame object that contains the mapping result for each virus segment that the plant sample reads are aligned to and a RLE list of coverage information.
## Not run: df <- convert_rle_to_df(example_cov) ## End(Not run)
## Not run: df <- convert_rle_to_df(example_cov) ## End(Not run)
Creates a data frame of the start and end positions of the regions_a that are considered high in A% and GC%.
create_high_nucleotide_content(gc = 0.6, a = 0.45, window = 75, virus_info)
create_high_nucleotide_content(gc = 0.6, a = 0.45, window = 75, virus_info)
gc |
The threshold for GC content. It is the proportion of G and C nucleotides in a sliding window. Default is 0.6. |
a |
The threshold for A nucleotide. It is the proportion of A nucleotide in a sliding window. Default is 0.45. |
window |
The sliding window size of your choice. Default is 75. |
virus_info |
A DNAStringSet of virus segments. The format should be similar to |
A data frame of the start and end positions of the regions_a that are considered high in A% and GC%.
## Not run: high_nucleotides_regions <- create_high_nucleotide_content()
## Not run: high_nucleotides_regions <- create_high_nucleotide_content()
Creates a data frame of start and end positions of the regions that are considered unmappable. Unmappable areas indicate that they can be mapped to another virus segment or a host genome. Note that we only have Arabidopsis Thaliana as a host.
create_mappability_profile( path_to_bam_files, category, window = 75, virus_info )
create_mappability_profile( path_to_bam_files, category, window = 75, virus_info )
path_to_bam_files |
Path to the folder that stores the indexed and
sorted BAM file(s) (ending in |
category |
Type of unreliable region you are creating. You can use
categories in the provided |
window |
The sliding window size of your choice. Default is 75. |
virus_info |
A DNAStringSet of virus segments. The format should be similar to |
A data frame of start and end positions of the regions that are considered unmappable.
## Not run: ## Please change the path to your folder where you store the mapped viruses mappability_profile <- create_mappability_profile("path/to/folder", category = "Unmappable regions") ## End(Not run)
## Not run: ## Please change the path to your folder where you store the mapped viruses mappability_profile <- create_mappability_profile("path/to/folder", category = "Unmappable regions") ## End(Not run)
A list of coverage profiles for three plant samples. This is only a toy sample. You can use it for running the examples in the vignette. We recommend using more data to train the model, the more the better.
example_cov
example_cov
A list of 3 run length encoding (RLE) lists for 3 plant samples. Each RLE list has the RLE vector of a virus segment
A matrix containing the known truth about the diagnostics result (using virus database version 1.4.0) for each plant sample for the example data. It records whether the sample is infected with a virus segment. Each column is a sample, and each row is a virus segment's diagnostics status for three samples.
example_diag
example_diag
A matrix with 3 columns:
Sample one
Sample two
Sample three
A data set containing the GC content and other information about the virus segments from the official Virtool virus data base (version 1.4.0). The variables are as follows:
nucleotide_info
nucleotide_info
A data frame with 7 variables:
The virus name
The virus isolate ID
The virus segment ID
The percentage of A nucleotides in the virus segment
The percentage of C nucleotides in the virus segment
The percentage of T nucleotides in the virus segment
The percentage of G and C nucleotides in the virus segment (GC content)
The length of the virus segment
The version number of the virus database
The version number of the virus database
Plots the coverage profile of the mapped plant sample.
plot_cov( covs, legend_status = TRUE, nucleotide_status = TRUE, window = 75, nucleotide_info_version = "1_4_0", virus_info, ... )
plot_cov( covs, legend_status = TRUE, nucleotide_status = TRUE, window = 75, nucleotide_info_version = "1_4_0", virus_info, ... )
covs |
An RLE list of coverage information of one or more plant samples. |
legend_status |
Whether display legend. Default is |
nucleotide_status |
Whether display a sliding window of A percentage and
CG content. Default is |
window |
The sliding window size. Default is 75. |
nucleotide_info_version |
The version number (character string) of the
nucleotide information of the virus segments. Default is |
virus_info |
A DNAStringSet of virus segments. The format should be similar to |
... |
Other arguments that can be passed to |
The coverage profile of the mapped plant sample.
plot_cov(example_cov$S1)
plot_cov(example_cov$S1)
Uses a machine learning model to predict the infection status for the plant sample(s). User can use their own model if needed.
predict_iimi(newdata, method = "xgb", trained_model, report_virus_level = TRUE)
predict_iimi(newdata, method = "xgb", trained_model, report_virus_level = TRUE)
newdata |
A matrix or data frame that contains the features extracted
from the coverage profile using |
method |
The machine learning method of choice, |
trained_model |
The trained model. If not provided, default model is used. |
report_virus_level |
If |
A data frame of diagnostics result for each sample
## Not run: df <- convert_rle_to_df(example_cov) predictions <- predict_iimi(df) ## End(Not run)
## Not run: df <- convert_rle_to_df(example_cov) predictions <- predict_iimi(df) ## End(Not run)
Trains a XGBoost
(default), Random Forest
, or Elastic Net
model using user-provided data.
train_iimi( train_x, train_y, method = "xgb", nrounds = 100, min_child_weight = 10, gamma = 20, ntree = 200, mtry = 10, k = 5, ... )
train_iimi( train_x, train_y, method = "xgb", nrounds = 100, min_child_weight = 10, gamma = 20, ntree = 200, mtry = 10, k = 5, ... )
train_x |
A data frame or a matrix of predictors. |
train_y |
A response vector of labels (needs to be a factor). |
method |
The machine learning method of choice, |
nrounds |
Max number of boosting iterations for |
min_child_weight |
Default is 10. |
gamma |
Minimum loss reduction required in |
ntree |
Number of trees in |
mtry |
Default is 10. |
k |
Number of folds. Default is 5. |
... |
Other arguments that can be passed to |
A Random Forest
, XGBoost
, Elastic Net
model
## Not run: df <- convert_rle_to_df(example_cov) train_x <- df[,-c(1:4)] train_y = c() for (ii in 1:nrow(df)) { seg_id = df$seg_id[ii] sample_id = df$sample_id[ii] train_y = c(train_y, example_diag[seg_id, sample_id]) } trained_model <- train_iimi(train_x = train_x, train_y = train_y) ## End(Not run)
## Not run: df <- convert_rle_to_df(example_cov) train_x <- df[,-c(1:4)] train_y = c() for (ii in 1:nrow(df)) { seg_id = df$seg_id[ii] sample_id = df$sample_id[ii] train_y = c(train_y, example_diag[seg_id, sample_id]) } trained_model <- train_iimi(train_x = train_x, train_y = train_y) ## End(Not run)
A trained model using the default Elastic Net settings
trained_en
trained_en
An object of class cv.glmnet
of length 13.
A trained model using the default Random Forest settings
trained_rf
trained_rf
An object of class randomForest.formula
(inherits from randomForest
) of length 19.
A trained model using the default XGBoost settings
trained_xgb
trained_xgb
An object of class raw
of length 130645.
A data frame of unmappable regions and regions of CG% and A% over 60% and 45% respectively for the virus segments. It is worth to note that if a virus segment does not have any unreliable regions, that virus segment is not shown in this data frame.
unreliable_regions
unreliable_regions
A data frame of unreliable regions in the run-length encoding format for virus segments.
The start position of the region that is considered unreliable
The end position of the region that is considered unreliable
The virus segment ID
The category that this unrelible region belong to, which are Unmappable regions (host), Unmappable regions (virus), CG% > 60%, A% > 45%
The version number of the virus database
The version number of the virus database