-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SMP Release #30
Comments
Hi Ryan, Error: Unclassifiable statement at (1)
Warning: Nonconforming tab character at (1)
Warning: Nonconforming tab character at (1) urf_rough = soil_rough * (1.0 - snowfac_can)
Error: Unclassifiable statement at (1)
Error: Statement function at (1) is recursive
Error: Statement function at (1) is recursive
Error: Statement function at (1) is recursive
Error: Statement function at (1) is recursive
Error: Statement function at (1) is recursive
Error: Statement function at (1) is recursive
Error: Statement function at (1) is recursive
Error: Statement function at (1) is recursive
Error: Unclassifiable statement at (1)
Error: Unexpected STATEMENT FUNCTION statement at (1)
Error: Unexpected STATEMENT FUNCTION statement at (1)
Error: Unexpected STATEMENT FUNCTION statement at (1)
Error: 'cdrag' at (1) is not a variable
Error: Unexpected STATEMENT FUNCTION statement at (1)
Error: Unexpected STATEMENT FUNCTION statement at (1)
Error: Unexpected STATEMENT FUNCTION statement at (1)
Error: Unexpected STATEMENT FUNCTION statement at (1)
Error: Expecting END SUBROUTINE statement at (1)
Error: Unclassifiable statement at (1)
Error: Unclassifiable statement at (1)
Error: Statement function at (1) is recursive
Error: Statement function at (1) is recursive
Error: Statement function at (1) is recursive On Thu, Mar 5, 2015 at 8:10 PM, Ryan Knox [email protected] wrote:
Afshin Pourmokhtarian, Ph.D. |
thanks Afshin, looking into this now On Mon, Mar 9, 2015 at 3:19 PM, Afshin Pourmokhtarian <
|
Thanks Ryan. I did some research on "Error: Unclassifiable statement at On Mon, Mar 9, 2015 at 9:28 PM, Ryan Knox [email protected] wrote:
Afshin Pourmokhtarian, Ph.D. |
Is it possible that your version of fortran does not like the associate statment? This is a new type of statement we have not had in the code as of yet. I think this might part of a more recent fortran standard. I only put it in there because it helps with readability, but it might be problematic when it comes to portability. |
I thought that might be the case but surprisingly when I pull the SMP in to On Mon, Mar 9, 2015 at 10:26 PM, Ryan Knox [email protected] wrote:
Afshin Pourmokhtarian, Ph.D. |
Maybe different compile flags in your branch? Regarding the crash with the hybrid, I think this would be a good new On Tue, Mar 10, 2015 at 8:07 AM, Afshin Pourmokhtarian <
|
Hi Ryan, but now I get a new error (see the bottom after -----). Any idea? Error: Dummy argument 'cgrid' of procedure 'soil_default_fill' at (1) has
Error: Dummy argument 'cgrid' of procedure 'print_soil_info' at (1) has an Fatal Error: Cannot read module file 'hdf5.mod' opened at (1), because it Thanks, On Tue, Mar 10, 2015 at 2:29 PM, Ryan Knox [email protected] wrote:
Afshin Pourmokhtarian, Ph.D. |
I went ahead and removed the associate statements, and thereby replaced the aliases with their original variables. This change should make the code complient with your original compilers. Perhaps we can discuss as a community during our next get-together whether we want to embrace the more recent fortran standards for future releases. |
Thanks Ryan. It is working now. On Thu, Mar 12, 2015 at 7:25 PM, Ryan Knox [email protected] wrote:
Afshin Pourmokhtarian, Ph.D. |
I just tried running the SMP version with the PalEON stuff and 5 out of 6 runs have crashed between 15 and 30 years into the simulations. The error I'm getting is pasted below. Sometimes the top function is [...]mmean_vars instead of dmean, but the rest is the same. I haven't tried digging into it yet and figured I'd ask you to see if you know what's going on first. I'm running things with the hybrid integrator and the new CBR_SCHEME = 0 Program received signal 8 (SIGFPE): Floating-point exception. Backtrace for this error:
|
Quick update @rgknox: I don't know if this helps at all, but all 6 models have now crashed with the SIGFPE error. 4 out of the six reference a " * frqsum_o_daysec" line in average_utils. The other 2 are " * ndaysi" Thoughts? |
I can confirm similar problems, although I can get stable results using my
|
Interesting that you mention snowfac... I just had my non-SMP ED (with CBR changes) crash (SIGFPE error 8) with it tracing back to snowfac in the radiate driver (line 757). That's the first time in about 2,000 years of runs with the normal version, but maybe I'm the one to blame... (sorry!) |
Sorry to flood the comments, but it looks like the error being tied to snowfac is likely. All of the SMP errors were tied to par_level variables. At least this time, it doesn't seem to be a snow issue as most of my errors are being thrown in non-winter months. |
@crollinson did you change some part the code in radiate_driver as part of you snow fix? |
@apourmok Nope. I steered clear of that one. |
@crollinson I remember seeing problems with frqsum_o_daysec and ndaysi, and I think it was related to -Q- files (or -Q- files turned off) that would cause division by 0. I thought we had fixed it, but maybe we didn't fix everything... Could you share the ED2IN that caused the problem so I check the configurations that created the problem? Did the problem occur right at the beginning, or at the beginning of a new month? |
@mpaiao It didn't crash right at the beginning. On the BU server, there's a lag between what gets written to the out file, so I'm not exactly sure if it crashed at the beginning of a new month. Q files are turned off. My ED2IN files and as well as the crash logs can be found in one of my github repositories: https://github.com/crollinson/ED_Processing/tree/master/spin_finish_smp Keep in mind that the last date in the log is not necessarily the date of the crash. Restarting from a history file gets me past the crash point, so maybe it's at least partially a problem with an uninitialized variable? |
@crollinson It seems the crash is always happening when it's integrating these par_level_diffu/par_level_diffd variables. The radiation code has some substantial differences from the version I updated last time (there used to be par_level_diff only), but I checked the usual places where variables should be initialised and nothing stood out. Maybe the value is becoming too large and eventually overflows when average_utils accumulates it over the month? It may be worth checking the values of these variables in the -E- files the code generated before it crashed, I think they should be always between 0 and 1, unless their definition has changed. |
I was able to remove the crash by reverting to the previous %snowfac the trouble may specifically involve line 489 in rk4_derivs.f90:
avg_th_cond & It was when I changed this line back to the original that things started On Wed, Mar 18, 2015 at 6:49 PM, Marcos Longo [email protected]
|
Thanks @rgknox. That's unfortunate, but not surprising that's where the problem is coming from. I'll take a closer look this afternoon, but the way the way the soil-snow-air were interacting were causing major problems in the northeast. I think I'd tried reverting this spot back to the original and it was one of the key spots that made snow break. I'll admit though, I got turned around as to where the problem was coming from. I can spend some time on it probably tomorrow morning (maybe this afternoon), but maybe @mpaiao could take a look and double check places I've changed and argue the case for reverting them? |
@rgknox I just tried setting this line back to the old version where h_flux_g(mzg+1) is scaled by snowfac & it made things worse, not better in my branch. @mpaiao this is a spot where I could follow your logic & it makes sense, but it causes really weird fluxes in my snow layers and getting rid of snowfac in that statement made the hflux in the first snow layer sensible. |
@crollinson this is rarely used in the tropics, just as short-lived puddles, so if removing it improves results in snowy areas, then I'm totally fine with getting rid of snowfac. I don't think it violates any energy conservation either, which would be my only concern. |
I've done a couple more tests with things that have made snow more stable in the past and I really don't think the problem is rooted with snowfac. In the snowfac tweaking, I've gotten a ton of other SIGFPE errors (with no change in frequency), and a couple were not in average_utils. Every time it ties back to a line group with par_level_diffu, par_level_dffd, or par_level_beam. I'm not sure when/why these came into the mainline, but they weren't in the version I was using for the CBR fixes and so I'm having a hard time tracking down what's going on with them. Any thoughts? |
And yet another update: I was going through the all of the stuff that is printed during compiling and came up with 5 flags for potentially uninitialized variables. I haven't tracked each of them down thoroughly and would appreciate it if anybody that knows about these sections could chime in. The flags are are (in order of what I think are potential breaking points): 1_) rk4_misc.f90: In function 'adjust_sfcw_properties':
|
I've tracked down the source of the depth_available uninitialization (#1 above). It actually dates back to @mpaiao in Jan 2012 (Jan 5). I've tried to adapt things to how things work now based on my best guess of what was going on in the version before that commit. What I have now is: I'm currently going to let energy_available be overwritten by what is currently in the code (energy_available = wmass_available * energy_needed / wmass_needed). @mpaiao, since you're the one that made this change, could you double check the new depth_avail initialization and see if it makes sense? The old version was: |
While the uninitialized variables still need to be sorted out, fixing the snow_depth issue alone has not fixed the SMP crashes. Everything continues to point back to whatever was done to introduce par_level_diffu/par_level_diffd |
I will look into that, those are my diagnostics
|
Christy, what radiation scheme are you using? it will help me track this ICANRAD = ? |
I'm running icanrad=0. All of my ED2INs with my settings can be found in one of my github repos: https://github.com/crollinson/ED_Processing/tree/master/spin_finish_smp
|
I am having trouble reproducing errors regarding par_level variables, is there any crash report info you could provide, like tracebacks etc? |
I am thinking we should really deprecate ICANRAD=0 anyway, Marcos and I have both gone through the code and theory for two-stream with a fine-tooth comb, and get more sensible and consistent answers with the updated two-stream (ICANRAD=2). |
There are examples of the first couple crashes in that folder I linked to above. Ryan, I'll start an ICANRAD=2 run now. If you think it might be a problem with my starting conditions, I just uploaded my .pss & .css files that you could try running. https://github.com/crollinson/ED_Processing/tree/master/phase1a_spinup.v2 These were created from an SAS solution after 150 years of non-SMP run with disturb off (those ED2INs are also on my github), which shouldn't affect how things are running from an initial, but I suppose it's possible. |
I found and fixed a potential bug that my be your problem with par_level On Fri, Mar 20, 2015 at 12:17 PM, Christy Rollinson <
|
Fantastic! Thanks Ryan! |
For some reason I could not branch off your master. I put the changes into the mainline. Could you try merging the changes into your local, Christy? |
I went back to archives of 2012 and then 2011 when the Heun integrator implementation was in its infancy, never did I find any instance where the local variable "combh" was initialized before it was used. I personally have no experience using the Heun integrator, and unless someone is invested in this option, I propose we just disable it until that person steps up. |
par_level fixes per @rgknox, issue EDmodel#30 (git ED2 master pull request EDmodel#38)
Thanks Ryan! I was able to pull the mainline into my branch and have things running now. I'm about 15 years in at 3 sites and so far so good. I'll let you know how things turn out. |
OK, I will pull a clone and test as well
|
@rgknox SMP is still a no-go for me. Currently ICANRAD = 1, one only got 4 years with backtrace as follows: Program received signal 8 (SIGFPE): Floating-point exception. Backtrace for this error:
another made it 30 years, but still got SIGFPE fails with par_level vars Program received signal 8 (SIGFPE): Floating-point exception. Backtrace for this error:
Is there maybe some sort of min/max bound to keep the number from getting too small or something of the sort? |
The other items:
|
Cohort fusion is not including the qmean, mmean and dmean averages of the On Fri, Mar 20, 2015 at 3:33 PM, Marcos Longo [email protected]
|
@rgknox I have an idea on why it's bonking and why it's a stochastic thing. I keep coming back to lines like this (889-899 in multiple_scatter): Could it have to do with something being off with i+1 or i-1? It looks like swd & swu are okay, but changing the numbers of those & not initializing the values right would explain the random nature of the crashes I'm seeing. |
FYI, I'm now running a test with the ED2 mainline version (no changes to CBR) to make sure it's something there and not a weird artifact in my branch that's causing all of these issues |
The ip1 and im1 thing looks like it should be ok. I did a double check and Do you get the same issues with ICANRAD=2? We are close! I'm sorry the par_level variables are being such a pain. On Fri, Mar 20, 2015 at 3:52 PM, Christy Rollinson <[email protected]
|
I guess one other thing to note is that my PFT settings are quite dramatically different from the ED defaults, and that could definitely be impacting PAR things. It doesn't explain the randomness of the the errors though. If you want to check out those settings anyway, they're also on github: https://github.com/crollinson/ED_Processing/blob/master/PalEON_Phase1a.v2.xml Thanks for being so responsive and helping me figure out what's going wrong! Once we get SMP fully working, you'll be the hero of the PalEON team for speeding up our millennial runs so much. |
I took a quick look at one of your ED2IN's Christy. https://raw.githubusercontent.com/crollinson/ED_Processing/master/spin_finish_ED2IN/ED2IN.PBL One thing I noticed is that you have a relatively large timestep set for On Fri, Mar 20, 2015 at 4:16 PM, Christy Rollinson <[email protected]
|
I just confirmed that I get the same errors with the github mainline branch ICANRAD = 2 and ICANRAD = 1 as well. I haven't been having stability issues and my gh24 pre-SMP branch is working fine, but I'll keep in mind bumping the timestep down if I start encountering issues. |
Pretty sure I just found the problem!!! The par_level variables were missing from ed_type_init.f90 Made the changes and pulling into my line for testing, but this would explain everything. |
Hey Christy, did it turn out that was the issue? Are your SMP runs stable now with the hybrid integrator? If so, can you run w/ DTLSM > 180? |
Hey @DanielNScott . SMP with the Hybrid Integrator has been working fantastically for me. I completed a set of my 12000 yr runs at 6 northeastern sites over the weekend (~48 hrs at DTLSM = 900) and had one stability interruption that then went through just fine when I dropped DTLSM from 900 to 600 (15 min to 10). I'm redoing them with the version from yesterday's updated pull request just to double check, but the numbers I got from the first run and the spin I just finished look completely reasonable based on quick glances at snapshots (no thorough analysis yet). |
The current EDModel master branch has not shown any signs of instability On Wed, Mar 25, 2015 at 1:19 PM, Christy Rollinson <[email protected]
|
Just want to report an issue with SMP. I can compile it with "-fopenmp" but when I try to run it, I get the error below but if I remove -fopenmp and compile it, the model runs. Any thought on this? +--- Parallel info: -------------------------------------+
|
I think I just realized what’s going on! Before you trying running the SMP ED, you need to set it for multi-threaded, which means logging on the to cluster with parallelization on. To do this on the BU server:
|
These are my compilation flags: Compile flags ------------------------------------------------CMACH=PC_LINUX1 -debug inline_debug_info -debug-parameters all -traceback -ftrapuv#F_Opts= -03 MPI Flags ----------------------------------------------------MPI_PATH= |
Did you set your stack limit to unlimited? ulimit -s unlimited
|
Thanks @crollinson that was the issue and now it works. |
Hi All,
I put the Shared Memory Parallelism commits on the master. This will allow for the splitting of radiation scattering, photosynthesis and thermodynamics of different patches to different CPU cores.
This has been tested using RK4 and Hybrid integration
This has had limited testing on gridded runs
This has had no testing on coupled runs (but I don't suspect any breakage).
If you don't want to use shared memory, just keep doing what you have done in the past and nothing should change.
If you do want to use it, follow these steps for a single polygon run:
"export OMP_NUM_THREADS=X" where X is the number of cores you wish to use. REMEMBER: These cores must share RAM, so you are limited by the number of cores that are on one node.
This release is experimental for the time being. If you have trouble or crashes or poor reproducability of previous work, revert to commit 2a5d68e
ie:
git checkout 2a5d68e
The text was updated successfully, but these errors were encountered: