ExportToDatabase Output file Location error


#1

Hello,

Here is my pipeline. I am trying to export database files to a specified folder location indicated through metadata extraction. But it keeps exporting to the folder location of my pipeline. Does anyone get this same problem?

This option works fine in ExportToSpreadSheet.

Thanks, Lee

170207.fura2.analysis.sqlite.cpproj (1.0 MB)


#2

Your pipeline is set to export the database to FileLocation, the piece of metadata you extracted was called folder_location, so I suspect that’s your issue. Can you try switching that and see if it helps?


#3

Hey Beth. I’ve tried both, and they do the same thing.


#4

Hmm, two things to try then:
-Does it behave as expected if there aren’t spaces in the path for the folder you’re trying to write to? That can sometimes cause issues- in general, always better to use _ rather than spaces.
-If rather than trying to set the output location from metadata, you explicitly set your default output folder and tell ExportToDatabase to write to the DefaultOutput folder, does it behave as expected?


#5

I tried the first suggestion, and it still saves to the wrong location.

The second works, but i am working with groups of images, so i need the files to save in specified places.


#6

ExportToDatabase won’t make multiple databases from one pipeline- it’s only going to make a single database. If you want to easily be able to make a database for each group though, you may want to consider the ‘Run multiple pipelines’ option- rather than running your pipeline on many folders simultaneously, you can run it on each folder sequentially (but without you having to sit there and queue up each one), and you can specify the correct input/output folder for each batch.

Nevertheless, it’s definitely confusing that ExportToDatabase allows you to specify a location by metadata but doesn’t respect that location- I’ll file a bug report on that issue.


#7

Thank you for reporting the bug.

The next bug i am worried about is the Run Multiple Pipelines options, since i am so dependent on the metadata for my pipeline, I have to click update metadata from the Metadata Extraction Method: extract from image file headers, or else i get an error.

The other thing i am worrying about is the overwriting that will happen. I am fine with is writing to a specified location, but there isn’t an option for me to name the file from the metadata extraction method. Right?


#8

I have to click update metadata from the Metadata Extraction Method: extract from image file headers, or else i get an error.

I’m not sure precisely what will happen with this; I don’t have a suitable pipeline on hand to try it. I would hope that CP would be smart enough to just know to do the extraction when being run by ‘Run multiple pipelines’, but maybe not. I’d be interested in the answer!

If it’s not smart enough to do that, the workaround would be to open your images in your pipeline as it currently exists, once you have all the metadata, etc extracted use the ‘Export -> image set listing’ option to export a CSV of all of your image names with corresponding metadata, then create a duplicate of your pipeline that uses ‘LoadData’ and invoke that CSV. Inelegant, but it would get the job done and allow you to batch process your stuff downstream.

The other thing i am worrying about is the overwriting that will happen. I am fine with is writing to a specified location, but there isn’t an option for me to name the file from the metadata extraction method. Right?

You can’t specify the name of the file, but you can specify the folder that it goes into (since ‘Run multiple pipelines’ allows you to specify the Default Output Folder), so there shouldn’t be any overwriting as long as each batch is written to a different folder.


#9

Okay,

I am attempting the Run Multiple pipelines. Each pipeline is set up now so that if i open it and click analyze images it proceeds without qualms.

But when i fill out the dialog as follows, on a blank cell profiler, this is what i am getting.

What am i missing here?


#10

You have to use pipelines, not project files- there’s documentation in Help->Other Features->Running Multiple Pipelines.


#11

Okay, saved both as pipelines, did the same as above, same error.

Did the loaddata module and it works… Well it says it completes, but i have no clue where it is saving the file…

But, man… This a major pain!

I really like the ExportToSpreadsheet module (am still hoping i could export something larger that 3000 objects) and i was hoping the ExportToDatabase was going to be just as simple.

I guess, I would like to know which database structure is going to save the most space for me since i will most likely be making some very large databases with multiple experiments.

Lee


#12

Did the loaddata module and it works… Well it says it completes, but i have no clue where it is saving the file…

Assuming you have your ExportToDatabase module set to export to DefaultOutputFolder, it should be whereever you specified is your DefaultOutputFolder

I guess, I would like to know which database structure is going to save the most space for me since i will most likely be making some very large databases with multiple experiments.

If it’s truly going to grow to that scale, it might be worth working with your university/company/etc IT department then to figure out a MySQL database or something like that. And/or examine if there’s a cluster you can access and install CP on.

My honest opinion is that your best option is to write to a single database per ‘batch’ (whatever that ends up being for you) and break the data down into groups later- with this many ROIs and this many images, you presumably are doing some sort of scripting downstream to meta-analyze the data, and compared to whatever that would be it should be little additional effort to stratify the data by plate/well/etc.


#13

Thanks for your help Beth!

I don’t think an issue need to be raised on this, unless you add the functionality to save multiple databases per group from ExportToDatabase.

It make sense to actually remove this option if this module is going to remain the same.


#14

The more i think about the more i really want databases to be produced per group. I can merge the databases later.

I am also going to need to limit the measurements in this database export (like in ExportToSpreadsheet), simply because functional imaging is going to take up alot of space especially with the more cells, and increased detail i get.

Because of the issues i am having in ExportToSpreadsheet i am limiting myself to a single value/view and reduced # of objects . If those two modules had a baby, i would be very very happy.

Thank you for your consideration.


#15

The more i think about the more i really want databases to be produced per group.

You’re welcome to put in a formal request to our SWE at our GitHub, but I discussed the issue with one of them last week after the initial conversation here and for technical reasons I don’t think it’s very likely.


Given the large number and size of your data sets and the fact that you want the output grouped, I think you’re down to options that truly run one group at a time- which are running each group one at a time from the GUI or working on some sort of cluster solution so that each group is processed truly in parallel. If your institution has a good IT department with someone who can help you get CP installed on a cluster, I think that’d be a good option. If not, we recently released a tool to run large data sets on clusters on Amazon Web Services. It involves some use of the command line, but given the sizes of your data sets I’m assuming you or someone in your lab must be reasonably comfortable wtih scripting and can get underway without too much difficulty.

Let me know if I can be of any other help.


#16

Thanks for the reply Beth.

I love to program, so i am not afraid to delve into some of these solutions (i’m excited to start learning python). That is why i am a bit confused as to why groups cannot be run separately. I figured “group” would mean running the list of groups in a for loop, which is how i prepare each experiment post processing.

Are there other options? I’m hoping for something simple that i can script. I would like to note that I really like the input modules since they produce groups very easily for me, and i can see myself pushing these groups through the output modules in a for loop like fashion. The LIMS solution and the cloud solution are a bit over my head right now and we don’t have a very robust IT department.

I should note that i am a novice when it comes to computing, so what i think is much different from what i know.

Thanks, Lee


#17

Group in CellProfiler does mean that every group is treated separately, but CellProfiler doesn’t stop and create the output CSVs at the end of each group, it either writes the info for all groups at the end (if you’re using ExportToSpreadsheet) or writes it for each image as it goes along (in ExportToDatabase). There’s not an in-between behavior from the GUI.

On the command line, you can run each group at a time- here’s a simplified version of a cellprofiler command I ran today-
cellprofiler -c -r -b -p path_to_pipeline/segment_tracking_masking_centAWS.cppipe -i inputpath/input/ -o outputpath/Movie1-201 -d /outputpath/Movie1-201/cp.is.done --data-file=path_to_csv/load_data_csv_Movie1.csv -g Metadata_MovieName=Movie1,Metadata_Timepoint=201
This command ran my pipeline on only Movie1, Timepoint 201 (even though the CSV had the images for the whole Movie1 in it)- and while it uses LoadData rather than the first 4 input modules, you can make the CSV (as I did for this example) by using the 4 input modules first as a separate pipeline, telling CellProfiler to export a nice CSV with all my Channels, Metadata, Groupings, etc configured, then just using that CSV in LoadData in my main analysis pipeline that I call from the command line.

So, with all that said, you could definitely write yourself a script that iterates over your metadata and calls CellProfiler on the command line a certain number of times and waits for the output. It will likely be slower than running it from the GUI, since running from the command line only utilizes one CPU at a time (whereas the GUI knows to distribute the work across as many CPUs as you allow it), but it’s reasonably straightforward.

I’m likely biased because I helped a lot on the cloud based solution, but if you’re at the level where you can write the script to call CP from the command line, you’re likely at a level where you CAN use that too if you want. I definitely don’t come from a CS background AT ALL , nor does anyone on the biology sub-team I work on at CP, and so we’ve very intentionally written the tool and the wiki for it so it’s aimed at ‘biologist who has too many images to process but doesn’t necessarily know much about computers’. By all means try doing it locally first, but if that becomes onerous that’s my suggestion for a backup plan.