-
Notifications
You must be signed in to change notification settings - Fork 203
Replies: 1 comment · 11 replies
-
Can you try running your workload with EDMM enabled in Gramine manifest? At least your forks will be much faster as it won't try and allocate all EPC memory at startup |
Beta Was this translation helpful? Give feedback.
All reactions
-
Quoting my message to you from a few days ago:
I'm sorry, but if you don't want to take your time to write your bugreports well and ensure they are high-quality, then don't expect us spend time analyzing them. I won't be typing all the log errors from the log manually into code search... Anyways, this is a really old code, we don't provide any support for it. You should never use it in production, because there are known security bugs in it. |
Beta Was this translation helpful? Give feedback.
All reactions
-
Environmental Information
Problem DescriptionHello Gramine team, I'm very sorry for my previous behavior. I'm working with Gramine on a project where we're running spark standalone mode(on k8s) task. When executing a pyspark task, spark will fork some python subprocess to execute the python code, and spark will communicate with the subprocess via socket. But we faced with some problems in the production environment.
For complete logs, see the compressed package, which contains complete logs(success and failed) of spark executors. thanks for help :) |
Beta Was this translation helpful? Give feedback.
All reactions
-
I'm sorry to be blunt, but @mkow point apparently did not get across, so at the risk of repeating ourselves, I'll state it once more: please don't use old insecure versions. The correct solution is to try to reproduce on a latest version, and if the bug persists, then we'll take a second look. I'm not willing to even look at the logs, it's pure waste of our time, not least because even if we found a problem there, we wouldn't release 1.3.2 at this point, we don't have resources to maintain stable branches. Any fix would be included only in latest version (I hope we'll release 1.8 soon), so to get the fix you'd have to update anyway.
Please consider notifying your customer that the deployment is currently vulnerable to all kinds of data leaks and RCE because of this bug (among others): #1796. TL;DR is, after fork, This was fixed in 1.6.2 BTW: https://github.com/gramineproject/gramine/releases/tag/v1.6.2 I hope that information will help you persuade whoever is responsible for the decision to update. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 3
-
Thank you very much for your reply in your busy schedule. We will decide whether to try the latest version of Gramine after evaluation. If there still have the problem, I will continue to feedback. Thank you very much for your support. |
Beta Was this translation helpful? Give feedback.
All reactions
-
👍 2
-
Gramine v1.7 with spark ProblemsGramine Version: v1.7 Gramine Manifest: libos.entrypoint = "{{ execdir }}/bash"
loader.entrypoint = "file:{{ gramine.libos }}"
#loader.pal_internal_mem_size = "512M"
loader.log_level = "{{ log_level }}"
loader.insecure__use_host_env = true
loader.env.LD_PRELOAD = ""
sys.enable_extra_runtime_domain_names_conf = true
sys.insecure__allow_eventfd = true
#loader.insecure__use_cmdline_argv = true
loader.insecure_disable_aslr = true
#loader.argv_src_file = "file:/ppml/trusted-big-data-ml/secured_argvs"
loader.argv_src_file = "file:/ppml/secured_argvs"
sgx.remote_attestation = "dcap"
sgx.ra_client_spid = ""
sgx.allow_file_creation = true
sgx.debug = false
#sgx.nonpie_binary = true
sgx.enclave_size = "16G"
sgx.max_threads = 1024
sgx.file_check_policy = "allow_all_but_log"
#sgx.static_address = 1
sgx.isvprodid = 1
sgx.isvsvn = 3
sys.stack.size = "64M"
# https://github.com/gramineproject/examples/blob/v1.7/openjdk/java.manifest.template
# https://github.com/gramineproject/gramine/discussions/1704
sgx.use_exinfo = true
loader.env.LD_LIBRARY_PATH = "/lib:{{ arch_libdir }}:/usr{{ arch_libdir }}:/usr/lib/python3.9/lib:/usr/lib:{{ jdk_home }}:{{ jdk_home }}/lib/amd64/jli:/ppml/lib"
loader.env.PATH = "{{ execdir }}:/usr/sbin:/usr/bin:/:/sbin:/bin:{{ jdk_home }}/bin"
#loader.env.PYTHONHOME = "/usr/lib/python3.9"
loader.env.PYTHONPATH = "/usr/lib/python3.9:/usr/lib/python3.9/lib-dynload:/usr/local/lib/python3.9/dist-packages:/usr/lib/python3/dist-packages:/ppml/bigdl-ppml/src"
loader.env.JAVA_HOME = "{{ jdk_home }}"
loader.env.JAVA_OPTS = "'-Djava.library.path={{ jdk_home }}/lib -Dsun.boot.library.path={{ jdk_home }}/lib'"
loader.env.SPARK_USER = "{{ spark_user }}"
loader.env.SPARK_SCALA_VERSION = "2.12"
loader.env.SPARK_HOME = "/opt/spark"
loader.env.SPARK_CONF_DIR = "/opt/spark/conf"
loader.env.SPARK_JARS_DIR = "/opt/spark/jars"
loader.env.PYSPARK_PYTHON = "/usr/bin/python3.9"
# Python's NumPy spawns as many threads as there are CPU cores, and each thread
# consumes a chunk of memory, so on large machines 1G enclave size may be not enough.
# We limit the number of spawned threads via OMP_NUM_THREADS env variable.
loader.env.OMP_NUM_THREADS = "4"
fs.mounts = [
{ path = "{{ arch_libdir }}", uri = "file:{{ arch_libdir }}" },
{ path = "/usr{{ arch_libdir }}", uri = "file:/usr{{ arch_libdir }}" },
{ path = "{{ execdir }}", uri = "file:{{ execdir }}" },
{ path = "/usr/lib", uri = "file:/usr/lib" },
{ path = "/lib", uri = "file:{{ gramine.runtimedir() }}" },
{ path = "/usr/local", uri = "file:/usr/local" },
{ path = "/etc", uri = "file:/etc" },
{ path = "/usr/local/etc", uri = "file:/etc" },
{ path = "/opt", uri = "file:/opt" },
{ path = "/bin", uri = "file:/bin" },
{ path = "/tmp", uri = "file:/tmp" },
{ path = "/usr/lib/python3.9", uri = "file:/usr/lib/python3.9" },
{ path = "/usr/lib/python3/dist-packages", uri = "file:/usr/lib/python3/dist-packages" },
{ path = "/root/.kube/", uri = "file:/root/.kube/" },
{ path = "/root/.keras", uri = "file:/root/.keras" },
{ path = "/root/.m2", uri = "file:/root/.m2" },
{ path = "/root/.zinc", uri = "file:/root/.zinc" },
{ path = "/root/.cache", uri = "file:/root/.cache" },
{ path = "/usr/lib/gcc", uri = "file:/usr/lib/gcc" },
{ path = "/ppml", uri = "file:/ppml" },
{ path = "/root/.jupyter", uri = "file:/root/.jupyter" },
{ type = "encrypted", path = "/ppml/encrypted-fs", uri = "file:/ppml/encrypted-fs", key_name = "_sgx_mrsigner" },
{ type = "encrypted", path = "/ppml/encrypted-fsd", uri = "file:/ppml/encrypted-fsd", key_name = "sgx_data_key" },
{ type = "encrypted", path = "/ppml/data/keys/", uri = "file:/ppml/data/keys/", key_name = "_sgx_mrsigner" },
{ path = "/opt/spark/conf", uri = "file:/opt/spark/conf-copy" },
{ path = "/opt/spark/logs-conf", uri = "file:/opt/spark/logs-conf" },
{ path = "/opt/spark/pod-template", uri = "file:/opt/spark/pod-template-copy" },
{ path = "/opt/spark/work-dir", uri = "file:/opt/spark/work-dir" },
{ path = "/app/log", uri = "file:/app/log" },
]
# { path = "{{ gramine.runtimedir() }}/etc/localtime", uri = "file:/etc" },
sgx.trusted_files = [
"file:{{ gramine.libos }}",
"file:{{ gramine.runtimedir() }}/",
"file:{{ arch_libdir }}/",
"file:/usr/{{ arch_libdir }}/",
"file:{{ execdir }}/",
#"file:/ppml/trusted-big-data-ml/secured_argvs",
"file:/ppml/secured_argvs",
"file:/ppml/scripts/ailand-kms/",
]
sgx.allowed_files = [
"file:scripts/",
"file:/etc",
"file:/tmp",
"file:{{ jdk_home }}",
"file:/ppml",
"file:{{ python_home }}",
"file:/usr/lib/python3",
"file:/usr/local/lib/python3.9/dist-packages",
"file:/root/.keras",
"file:/root/.m2",
"file:/root/.zinc",
"file:/root/.cache",
"file:/usr/lib/gcc",
"file:/root/.kube/config",
"file:/etc/localtime",
"file:/opt/spark",
"file:/usr/bin",
"file:/app/log",
]
sys.ioctl_structs.ifreq = [
{ size = 16, direction = "out" }, # ifr_name
{ size = 2, direction = "in" }, # ifr_flags
]
# below IOCTL is for socket ioctl tests (e.g. `sockioctl01`); note that there is no additional
# sanitization of these IOCTLs but this is only for testing anyway
sys.ioctl_structs.ifconf = [
# When ifc_req is NULL, direction of ifc_len is out. Otherwise, direction is in.
{ size = 4, direction = "inout", name = "ifc_len" }, # ifc_len
{ size = 4, direction = "none" }, # padding
{ ptr = [ { size = "ifc_len", direction = "in" } ] }, # ifc_req
]
sys.allowed_ioctls = [
{ request_code = 0x8912, struct = "ifconf" }, # SIOCGIFCONF
{ request_code = 0x8913, struct = "ifreq" }, # SIOCGIFFLAGS
] When we switched the Gramine version to v1.7, many errors occurred when executing spark tasks. We guess that the problems encountered are most likely caused by improper configuration of Gramine, but because we are not familiar with the latest version, we don't know how to configure it. Please give some guidance or test methods based on my scenario and some simple error logs. There is nothing we can do about these problems right now.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/ppml/spark-3.1.3/jars/slf4j-reload4j-1.7.35.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/ppml/spark-3.1.3/jars/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Reload4jLoggerFactory]
INFO [2024-10-11 09:02:46,247] ({main} Utils.scala[initDaemon]:2648) - Started daemon with process name: 3@spark-pi-d11852927ad74288-exec-3
INFO [2024-10-11 09:02:51,291] ({main} Logging.scala[logInfo]:57) - Registering signal handler for TERM
INFO [2024-10-11 09:02:51,714] ({main} Logging.scala[logInfo]:57) - Registering signal handler for HUP
INFO [2024-10-11 09:02:51,714] ({main} Logging.scala[logInfo]:57) - Registering signal handler for INT
INFO [2024-10-11 09:05:07,213] ({main} Logging.scala[logInfo]:57) - Changing view acls to: root
INFO [2024-10-11 09:05:07,398] ({main} Logging.scala[logInfo]:57) - Changing modify acls to: root
INFO [2024-10-11 09:05:07,580] ({main} Logging.scala[logInfo]:57) - Changing view acls groups to:
INFO [2024-10-11 09:05:07,707] ({main} Logging.scala[logInfo]:57) - Changing modify acls groups to:
INFO [2024-10-11 09:05:07,988] ({main} Logging.scala[logInfo]:57) - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set()
WARN [2024-10-11 09:06:31,146] ({netty-rpc-connection-0} Logging.scala[logWarning]:69) - NettyRpcEnv.createClient address: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078
INFO [2024-10-11 09:06:31,577] ({netty-rpc-connection-0} TransportClientFactory.java[createClient]:233) - TransportClientFactory.createClient2 remoteHost: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svcremotePort:7078
INFO [2024-10-11 09:06:31,668] ({netty-rpc-connection-0} TransportClientFactory.java[createClient]:154) - TransportClientFactory.createClient remoteHost: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svcremotePort:7078
INFO [2024-10-11 09:06:38,883] ({netty-rpc-connection-0} TransportClientFactory.java[createClient]:190) - TransportClientFactory.createClient resolvedAddress: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078
INFO [2024-10-11 09:06:39,654] ({netty-rpc-connection-0} TransportClientFactory.java[createClient]:194) - DNS resolution failed for yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078 took 6619 ms
WARN [2024-10-11 09:06:49,632] ({netty-rpc-connection-0} MacAddressUtil.java[defaultMachineId]:142) - Failed to find a usable hardware address from the network interfaces; using random bytes: 63:76:62:17:d6:9c:03:0a
WARN [2024-10-11 09:07:19,477] ({netty-rpc-connection-1} Logging.scala[logWarning]:69) - NettyRpcEnv.createClient address: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078
INFO [2024-10-11 09:07:19,477] ({netty-rpc-connection-1} TransportClientFactory.java[createClient]:233) - TransportClientFactory.createClient2 remoteHost: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svcremotePort:7078
INFO [2024-10-11 09:07:19,555] ({netty-rpc-connection-1} TransportClientFactory.java[createClient]:154) - TransportClientFactory.createClient remoteHost: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svcremotePort:7078
INFO [2024-10-11 09:07:19,664] ({netty-rpc-connection-1} TransportClientFactory.java[createClient]:190) - TransportClientFactory.createClient resolvedAddress: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078
INFO [2024-10-11 09:07:19,706] ({netty-rpc-connection-1} TransportClientFactory.java[createClient]:197) - DNS resolution failed for yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078 took 73 ms
WARN [2024-10-11 09:07:22,097] ({netty-rpc-connection-2} Logging.scala[logWarning]:69) - NettyRpcEnv.createClient address: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078
INFO [2024-10-11 09:07:22,162] ({netty-rpc-connection-2} TransportClientFactory.java[createClient]:233) - TransportClientFactory.createClient2 remoteHost: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svcremotePort:7078
INFO [2024-10-11 09:07:22,178] ({netty-rpc-connection-2} TransportClientFactory.java[createClient]:154) - TransportClientFactory.createClient remoteHost: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svcremotePort:7078
INFO [2024-10-11 09:07:22,312] ({netty-rpc-connection-2} TransportClientFactory.java[createClient]:190) - TransportClientFactory.createClient resolvedAddress: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078
INFO [2024-10-11 09:07:22,312] ({netty-rpc-connection-2} TransportClientFactory.java[createClient]:197) - DNS resolution failed for yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078 took 45 ms
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1748)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:402)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:391)
at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:422)
at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
at scala.collection.immutable.Range.foreach(Range.scala:158)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:420)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
... 4 more
Caused by: java.io.IOException: Failed to connect to yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc:7078
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:294)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:221)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:235)
at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:205)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.UnknownHostException: yeqc-pyspark-7edbb7927ac621f5-driver-svc.dios-task.svc
at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
at java.net.InetAddress.getAllByName(InetAddress.java:1193)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)
at java.net.InetAddress.getByName(InetAddress.java:1077)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:156)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:153)
at java.security.AccessController.doPrivileged(Native Method)
at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:153)
at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:41)
at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:61)
at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:53)
at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:55)
at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:31)
at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:106)
at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:206)
at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:46)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:180)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:166)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:551)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
at io.netty.util.concurrent.DefaultPromise.setSuccess0(DefaultPromise.java:604)
at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:984)
at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:504)
at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:417)
at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:474)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
... 1 more
```
A very frequent error is:
```bash
WARN [2024-10-11 09:06:49,632] ({netty-rpc-connection-0} MacAddressUtil.java[defaultMachineId]:142) - Failed to find a usable hardware address from the network interfaces; using random bytes: 63:76:62:17:d6:9c:03:0a
```
Thanks. |
Beta Was this translation helpful? Give feedback.
-
I am now running spark tasks on the bigdl gramine k8s cluster. Task configuration: driver 2-core 8GB epc, 15 executors 2-core 8GB epc. Cluster configuration: 4 virtual machines with 128GB epc. Tasks performed: pyspark. Since spark executes pyspark, it needs to fork the child process to execute the python worker, and gramine fork will create an enclave instance with the same size as the parent process, so when the task is executed, fork failure errors appear at intervals, and the process cannot be created. In the end, spark tries 4 times. The mission then failed. However, the current task configuration has not fully loaded the load (according to my current understanding, 384GB epc is consumed), so I don’t know how to tune it. Please give me some advice. Will the child process epc not be restored immediately after the grine fork is completed? At the same time, I still have a question. I currently see that the spark configuration provided by Intel is spark.python.worker.reuse=false. My current understanding is that if it is reused, there is no need to fork repeatedly. If it is not executed, it will be faster. Please help solve it. Let’s look at these questions.
Beta Was this translation helpful? Give feedback.
All reactions